System Design (HLD)Rebalancing and Resharding

Rebalancing and Resharding

LevelAdvanced

Duration90 mins

TopicRebalancing and Resharding

1 / 5

When Rebalancing Is Needed

The Rebalancing Imperative

In the lifecycle of any distributed database system, there comes an inevitable moment when the initial data distribution no longer serves the workload. Data that was once evenly spread across nodes becomes skewed. Hot partitions emerge where none existed. Nodes that were sufficient yesterday become bottlenecks today. Rebalancing—the process of redistributing data across the cluster—becomes not just desirable, but essential for system survival.

Understanding when rebalancing is needed is arguably more important than knowing how to execute it. Rebalancing too early wastes resources and introduces unnecessary risk. Rebalancing too late leads to cascading failures, degraded performance, and potential data loss. The decision of when to rebalance is one of the most consequential choices a database administrator or systems architect can make.

What You Will Learn

By the end of this page, you will understand the fundamental triggers for rebalancing, how to detect imbalanced states through monitoring and metrics, the difference between proactive and reactive rebalancing, and how to develop a systematic framework for making rebalancing decisions in production environments.

Understanding Data Distribution Drift

When a distributed database is first deployed, data is typically distributed according to a carefully chosen partitioning strategy. Whether using range-based, hash-based, or directory-based partitioning, the initial distribution is designed to spread load evenly across nodes. However, real-world systems exhibit distribution drift—a gradual deviation from the intended data distribution over time.

Why Distribution Drift Occurs:

Distribution drift is not a bug but an inevitable consequence of how real applications generate and access data. Understanding its causes is the first step toward recognizing when rebalancing is needed.

Primary Causes of Distribution Drift

•Non-Uniform Data Growth — In most applications, data growth is not uniform. An e-commerce platform might see certain product categories explode in popularity. A social media platform might experience viral content that concentrates activity on specific user accounts. A financial system might see transaction volumes spike for particular accounts or regions.
•Temporal Access Patterns — Data access often follows temporal patterns that weren't anticipated during initial design. Recent data is accessed more frequently than historical data. Certain time windows (business hours, holiday seasons) create concentrated load on specific partitions.
•Changing Business Requirements — As applications evolve, query patterns change. New features might require accessing data in ways the original partitioning scheme wasn't optimized for. What was once a write-heavy workload might become read-heavy, or vice versa.
•Key Space Evolution — The distribution of partition keys changes over time. Sequential ID generation can cause range-partitioned systems to accumulate data on the 'latest' partition. User-based partitioning can become skewed as certain users become power users.
•Infrastructure Changes — Hardware failures, node additions, and capacity changes all affect the expected data distribution. A cluster that was balanced at 5 nodes may become imbalanced when expanded to 7 nodes.

The Inevitability of Drift

Distribution drift is not a sign of poor design—it's a fundamental property of dynamic systems. Even the best partitioning scheme will eventually drift from its optimal state. The goal is not to prevent drift but to detect it early and respond appropriately.

Hot Spots and Skewed Partitions

The most visible symptom of distribution drift is the emergence of hot spots—partitions that receive disproportionately high load compared to others. Hot spots are particularly dangerous because they create single points of congestion that can limit the entire system's throughput, regardless of how much capacity exists elsewhere.

Anatomy of a Hot Spot:

A hot spot forms when one or more partitions receive significantly more operations than the cluster average. This can manifest as:

Hot Spot Types and Their Characteristics
Hot Spot Type	Cause	Symptoms	Examples
Write Hot Spot	Concentrated write operations on specific keys	High write latency, WAL contention, replication lag	Auto-incrementing IDs on range partitions, counters, event logs
Read Hot Spot	Popular data accessed far more than average	Cache misses, high read latency, connection exhaustion	Viral content, celebrity users, trending items
Size Skew	Data volume imbalance across partitions	Disk space exhaustion, compaction storms, backup delays	Large text fields, media storage, audit logs
Mixed Hot Spot	Combination of read/write concentration	All of the above, plus lock contention	Popular user profiles, frequently updated leaderboards

The Cascade Effect:

Hot spots rarely remain isolated problems. When one partition becomes overloaded:

Request Queuing: Operations targeting the hot partition begin to queue, increasing latency
Connection Pool Exhaustion: Application connections waiting on the hot partition block connections to healthy partitions
Timeout Cascade: Timeouts on the hot partition cause retries, which increase load further
Replication Lag: If using replication, the hot partition's followers fall behind
Failover Instability: The hot partition may appear unhealthy, triggering unnecessary failovers

This cascade effect means that a hot spot affecting 10% of data can degrade the entire system by 50% or more. The non-linear relationship between localized overload and global performance is why hot spot detection is critical.

Hot Spot Detection Is Non-Trivial

Average metrics hide hot spots. A cluster showing 60% average CPU utilization might have one node at 95% and others at 50%. Always examine distributions (p50, p90, p99, max) rather than averages when assessing partition health.

Capacity Exhaustion Signals

Beyond hot spots, rebalancing becomes necessary when individual partitions or nodes approach their capacity limits. Capacity exhaustion can occur along multiple dimensions, each requiring careful monitoring and distinct response strategies.

The Five Dimensions of Capacity:

Modern distributed databases must manage capacity across multiple resource dimensions. Exhaustion in any single dimension can necessitate rebalancing, even if other dimensions have ample headroom.

Critical Capacity Dimensions

•Storage Capacity — The most fundamental constraint. When a partition approaches disk capacity, writes begin to fail. Storage exhaustion is typically the most visible but also the most predictable, as data growth rates are usually observable. Warning threshold: 70-80% utilization.
•Memory Capacity — Critical for databases that maintain indexes, caches, or hot data in memory. Memory pressure leads to increased disk I/O, cache evictions, and severely degraded query performance. Warning threshold: 80-85% utilization.
•CPU Capacity — Compute-bound operations (complex queries, cryptographic operations, compression) can exhaust CPU capacity. Unlike storage, CPU exhaustion is often bursty and harder to predict. Warning threshold: 70% sustained utilization.
•Network Capacity — Often overlooked, network bandwidth can become the limiting factor for data-intensive operations, replication, and inter-node communication. Warning threshold: 60-70% sustained utilization.
•Connection Capacity — Every database has limits on concurrent connections. Connection exhaustion blocks new operations entirely, regardless of other available capacity. Warning threshold: 70-80% of maximum connections.

Predictive Capacity Planning:

Effective capacity management requires not just monitoring current utilization but projecting future needs. Key metrics to track:

Growth Rate: How fast is data/traffic growing? (Daily, weekly, monthly growth percentages)
Time to Exhaustion: At current growth rates, when will each capacity dimension hit critical thresholds?
Seasonality: Are there predictable spikes (end of month, holiday seasons, promotional events)?
Headroom Requirements: How much spare capacity is needed for burst handling and operational safety?

The 80% Rule

A common operational guideline is to begin rebalancing planning when any capacity dimension reaches 80% utilization. This provides sufficient headroom for unexpected spikes while allowing time to plan and execute a careful rebalancing operation.

Infrastructure Changes as Triggers

Some of the most common triggers for rebalancing are changes to the infrastructure itself. Unlike gradual drift, infrastructure changes create sudden imbalances that may require immediate attention.

Node Failures and Replacements:

When a node fails and is replaced, the new node starts empty. Depending on the database architecture, this might mean:

Immediate Data Migration: The system automatically begins moving data to the new node
Gradual Rebalancing: New writes are directed to the new node while old data remains skewed
Manual Intervention Required: An operator must initiate data movement to the new node

Regardless of the approach, the period during and immediately after node replacement is one of heightened imbalance.

Cluster Expansion Challenges

•New nodes have no data initially
•Hash-based systems require key remapping
•Range-based systems need partition splits
•Replication topology changes
•Network reconfiguration needed
•Client routing updates required

Cluster Contraction Challenges

•Data from removed nodes must migrate
•Remaining nodes must absorb extra load
•Risk of overloading remaining nodes
•Potential consistency issues during drain
•Requires careful coordination with clients
•May trigger compaction storms

Hardware Heterogeneity:

Modern clusters often have heterogeneous hardware due to incremental upgrades. A cluster might have:

Older nodes with slower disks and less memory
Newer nodes with NVMe SSDs and larger RAM
Different CPU generations with varying performance characteristics

Even if data is evenly distributed by count, performance imbalances arise because newer hardware can handle more load. Rebalancing might be needed to shift more data to more capable nodes—a process called capacity-aware balancing.

Cloud Considerations:

In cloud environments, additional triggers include:

Instance Type Changes: Migrating to different instance types (e.g., upgrading from m5.large to m5.xlarge)
Availability Zone Changes: Spreading data across AZs for resilience
Regional Expansion: Extending to new geographic regions for lower latency
Spot Instance Volatility: Handling sudden termination of spot/preemptible instances

Performance Degradation Patterns

Often, the need for rebalancing reveals itself through observable performance degradation. Recognizing these patterns early allows for proactive intervention before degradation becomes critical.

Key Performance Indicators (KPIs) That Signal Rebalancing Need:

Performance Degradation Indicators for Rebalancing
Indicator	Normal Range	Warning Threshold	Critical Threshold	What It Signals
Read Latency (p99)	< 10ms	50ms	200ms	Hot read partitions or cache inefficiency
Write Latency (p99)	< 20ms	100ms	500ms	Write contention or replication lag
Replication Lag	< 100ms	1s	10s	Overloaded primary or network issues
Query Queue Depth	< 10	100	1000	Compute or I/O saturation
Connection Utilization	< 50%	70%	90%	Client concentration on specific nodes
Disk I/O Wait	< 5%	15%	30%	Storage saturation

Interpreting Performance Variance:

One of the most telling signs of imbalance is high variance in performance metrics across partitions. When examining cluster health:

Healthy Cluster: All nodes show similar p99 latencies (within 20% of each other)
Moderate Imbalance: Some nodes show 2-3x higher latency than others
Severe Imbalance: Worst-performing nodes show 5-10x higher latency than best-performing nodes

The Long Tail Problem:

In a sharded system, overall system latency is determined by the slowest shard involved in any operation. If a query touches 10 shards and 9 respond in 5ms but one responds in 500ms, the user experiences 500ms latency. This 'long tail' effect means that even a small number of overloaded partitions can dramatically impact user experience.

The Degradation Spiral

Performance degradation often accelerates. Initial slowdowns cause retry storms, which increase load, which causes more slowdowns. By the time degradation is noticeable to users, the system may already be in a dangerous state. Early detection through monitoring is essential.

Proactive vs. Reactive Rebalancing

The timing of rebalancing operations falls into two broad categories: proactive (before problems occur) and reactive (in response to problems). Each approach has its place, but mature operations teams strongly favor proactive rebalancing.

Reactive Rebalancing:

Reactive rebalancing occurs when system health has already degraded:

Triggered by alerts, user complaints, or outages
Often executed under time pressure
Higher risk of errors due to urgency
May conflict with peak usage periods
Limited time for testing and validation

Proactive Rebalancing:

Proactive rebalancing anticipates future needs:

Scheduled during low-traffic windows
Based on capacity projections and trend analysis
Allows thorough planning and testing
Can be rolled back if issues arise
Minimizes user impact

When to Proactively Rebalance

•Any capacity dimension approaching 70%
•Variance in partition sizes exceeds 30%
•Planned infrastructure changes upcoming
•Seasonal traffic increase anticipated
•New features will change access patterns
•Geometric data growth detected

When Reactive Is Unavoidable

•Unexpected viral content/traffic spike
•Sudden node failure without redundancy
•Security incident requiring isolation
•Unanticipated business event impact
•Discovery of critical data corruption
•Vendor or cloud provider incidents

Building a Proactive Rebalancing Culture:

Organizations that successfully practice proactive rebalancing share common characteristics:

Comprehensive Monitoring: Detailed metrics on all capacity dimensions, with historical trend analysis
Defined Thresholds: Clear, documented thresholds that trigger rebalancing consideration
Regular Reviews: Weekly or monthly capacity planning reviews
Runbook-Driven Operations: Pre-written, tested procedures for common rebalancing scenarios
Post-Incident Learning: Every reactive rebalancing leads to process improvements to prevent recurrence

The 3:1 Rule

Industry experience suggests that proactive rebalancing is approximately three times less disruptive than reactive rebalancing. The time invested in monitoring, planning, and scheduled maintenance pays dividends in reduced incidents and better user experience.

Decision Framework for Rebalancing

Given the complexity and risk of rebalancing operations, a structured decision framework helps ensure that rebalancing is undertaken only when truly necessary and with appropriate planning.

The Rebalancing Decision Matrix:

When evaluating whether to rebalance, consider these factors:

Rebalancing Decision Matrix
Factor	Low Priority	Medium Priority	High Priority
Capacity Utilization	< 60% peak	60-80% peak	80% peak
Partition Size Variance	< 20% deviation	20-40% deviation	40% deviation
Performance Variance	< 20% latency diff	20-50% latency diff	50% latency diff
Time to Exhaustion	6 months	3-6 months	< 3 months
Upcoming Events	None known	Moderate growth expected	Major traffic spike expected
System Stability	Stable, no recent issues	Minor issues observed	Active incidents related to balance

Decision Tree for Rebalancing:

                    Start
                      |
         Is there an active incident?
                /           \
              Yes             No
               |               |
    Is rebalancing the fix?  Any HIGH priority factors?
         /        \              /          \
        Yes        No          Yes           No
         |          |           |             |
  Emergency    Address root   Plan         Any 2+ MEDIUM
  rebalance    cause first   proactive      priority factors?
                             rebalance        /         \
                                            Yes          No
                                             |            |
                                        Schedule     Continue
                                        rebalance    monitoring

Questions to Ask Before Rebalancing:

What problem are we solving? Be specific about the symptom or projected issue
What is the blast radius? How much data will move? How long will it take?
What is the risk if we DON'T rebalance? Quantify in terms of outage probability or degradation
What is the risk if we DO rebalance? What could go wrong during the operation?
Do we have a rollback plan? Can we undo the rebalancing if issues arise?
What is the maintenance window? When can we execute with minimal user impact?
Who needs to be informed? Stakeholders, on-call teams, downstream service owners

Document the Decision

Every rebalancing decision—whether to proceed or to defer—should be documented. This creates an audit trail, facilitates post-mortems, and helps refine decision criteria over time.

Summary: When Rebalancing Is Needed

Recognizing when rebalancing is needed is a critical skill for database administrators and system architects. The key insights from this page:

Key Takeaways

•Distribution drift is inevitable — Real-world data and access patterns never match initial assumptions perfectly. Drift detection, not prevention, is the goal.
•Hot spots are system-wide threats — A single overloaded partition can degrade the entire cluster through cascade effects. Early hot spot detection is critical.
•Monitor multiple capacity dimensions — Storage, memory, CPU, network, and connections can each become the limiting factor. Balance all five.
•Infrastructure changes create sudden imbalances — Node failures, additions, and hardware changes require immediate attention to data distribution.
•Performance variance signals imbalance — When partition latencies diverge significantly, rebalancing should be considered.
•Proactive beats reactive — Scheduled, planned rebalancing is far less disruptive than emergency operations under outage conditions.
•Use a structured decision framework — Quantitative thresholds and decision trees reduce subjectivity and improve consistency.

What's Next:

Now that we understand when rebalancing is needed, the next page explores how to execute rebalancing with minimal disruption. We'll examine specific strategies for moving data between partitions, techniques for maintaining availability during rebalancing, and approaches to coordinate rebalancing with ongoing operations.

Page Complete

You now have a comprehensive understanding of the triggers, signals, and decision criteria for database rebalancing. This foundation prepares you for the practical strategies covered in the following pages.

1 / 5

Loading learning content...

System Design (HLD)Rebalancing and Resharding

Rebalancing and Resharding

LevelAdvanced

Duration90 mins

TopicRebalancing and Resharding

1 / 5

When Rebalancing Is Needed

The Rebalancing Imperative

What You Will Learn

Understanding Data Distribution Drift

Why Distribution Drift Occurs:

Primary Causes of Distribution Drift

•Non-Uniform Data Growth — In most applications, data growth is not uniform. An e-commerce platform might see certain product categories explode in popularity. A social media platform might experience viral content that concentrates activity on specific user accounts. A financial system might see transaction volumes spike for particular accounts or regions.
•Temporal Access Patterns — Data access often follows temporal patterns that weren't anticipated during initial design. Recent data is accessed more frequently than historical data. Certain time windows (business hours, holiday seasons) create concentrated load on specific partitions.
•Changing Business Requirements — As applications evolve, query patterns change. New features might require accessing data in ways the original partitioning scheme wasn't optimized for. What was once a write-heavy workload might become read-heavy, or vice versa.
•Key Space Evolution — The distribution of partition keys changes over time. Sequential ID generation can cause range-partitioned systems to accumulate data on the 'latest' partition. User-based partitioning can become skewed as certain users become power users.
•Infrastructure Changes — Hardware failures, node additions, and capacity changes all affect the expected data distribution. A cluster that was balanced at 5 nodes may become imbalanced when expanded to 7 nodes.

The Inevitability of Drift

Hot Spots and Skewed Partitions

Anatomy of a Hot Spot:

A hot spot forms when one or more partitions receive significantly more operations than the cluster average. This can manifest as:

Hot Spot Types and Their Characteristics
Hot Spot Type	Cause	Symptoms	Examples
Write Hot Spot	Concentrated write operations on specific keys	High write latency, WAL contention, replication lag	Auto-incrementing IDs on range partitions, counters, event logs
Read Hot Spot	Popular data accessed far more than average	Cache misses, high read latency, connection exhaustion	Viral content, celebrity users, trending items
Size Skew	Data volume imbalance across partitions	Disk space exhaustion, compaction storms, backup delays	Large text fields, media storage, audit logs
Mixed Hot Spot	Combination of read/write concentration	All of the above, plus lock contention	Popular user profiles, frequently updated leaderboards

The Cascade Effect:

Hot spots rarely remain isolated problems. When one partition becomes overloaded:

Request Queuing: Operations targeting the hot partition begin to queue, increasing latency
Connection Pool Exhaustion: Application connections waiting on the hot partition block connections to healthy partitions
Timeout Cascade: Timeouts on the hot partition cause retries, which increase load further
Replication Lag: If using replication, the hot partition's followers fall behind
Failover Instability: The hot partition may appear unhealthy, triggering unnecessary failovers

Hot Spot Detection Is Non-Trivial

Capacity Exhaustion Signals

The Five Dimensions of Capacity:

Modern distributed databases must manage capacity across multiple resource dimensions. Exhaustion in any single dimension can necessitate rebalancing, even if other dimensions have ample headroom.

Critical Capacity Dimensions

•Storage Capacity — The most fundamental constraint. When a partition approaches disk capacity, writes begin to fail. Storage exhaustion is typically the most visible but also the most predictable, as data growth rates are usually observable. Warning threshold: 70-80% utilization.
•Memory Capacity — Critical for databases that maintain indexes, caches, or hot data in memory. Memory pressure leads to increased disk I/O, cache evictions, and severely degraded query performance. Warning threshold: 80-85% utilization.
•CPU Capacity — Compute-bound operations (complex queries, cryptographic operations, compression) can exhaust CPU capacity. Unlike storage, CPU exhaustion is often bursty and harder to predict. Warning threshold: 70% sustained utilization.
•Network Capacity — Often overlooked, network bandwidth can become the limiting factor for data-intensive operations, replication, and inter-node communication. Warning threshold: 60-70% sustained utilization.
•Connection Capacity — Every database has limits on concurrent connections. Connection exhaustion blocks new operations entirely, regardless of other available capacity. Warning threshold: 70-80% of maximum connections.

Predictive Capacity Planning:

Effective capacity management requires not just monitoring current utilization but projecting future needs. Key metrics to track:

Growth Rate: How fast is data/traffic growing? (Daily, weekly, monthly growth percentages)
Time to Exhaustion: At current growth rates, when will each capacity dimension hit critical thresholds?
Seasonality: Are there predictable spikes (end of month, holiday seasons, promotional events)?
Headroom Requirements: How much spare capacity is needed for burst handling and operational safety?

The 80% Rule

Infrastructure Changes as Triggers

Some of the most common triggers for rebalancing are changes to the infrastructure itself. Unlike gradual drift, infrastructure changes create sudden imbalances that may require immediate attention.

Node Failures and Replacements:

When a node fails and is replaced, the new node starts empty. Depending on the database architecture, this might mean:

Immediate Data Migration: The system automatically begins moving data to the new node
Gradual Rebalancing: New writes are directed to the new node while old data remains skewed
Manual Intervention Required: An operator must initiate data movement to the new node

Regardless of the approach, the period during and immediately after node replacement is one of heightened imbalance.

Cluster Expansion Challenges

•New nodes have no data initially
•Hash-based systems require key remapping
•Range-based systems need partition splits
•Replication topology changes
•Network reconfiguration needed
•Client routing updates required

Cluster Contraction Challenges

•Data from removed nodes must migrate
•Remaining nodes must absorb extra load
•Risk of overloading remaining nodes
•Potential consistency issues during drain
•Requires careful coordination with clients
•May trigger compaction storms

Hardware Heterogeneity:

Modern clusters often have heterogeneous hardware due to incremental upgrades. A cluster might have:

Older nodes with slower disks and less memory
Newer nodes with NVMe SSDs and larger RAM
Different CPU generations with varying performance characteristics

Cloud Considerations:

In cloud environments, additional triggers include:

Instance Type Changes: Migrating to different instance types (e.g., upgrading from m5.large to m5.xlarge)
Availability Zone Changes: Spreading data across AZs for resilience
Regional Expansion: Extending to new geographic regions for lower latency
Spot Instance Volatility: Handling sudden termination of spot/preemptible instances

Performance Degradation Patterns

Often, the need for rebalancing reveals itself through observable performance degradation. Recognizing these patterns early allows for proactive intervention before degradation becomes critical.

Key Performance Indicators (KPIs) That Signal Rebalancing Need:

Performance Degradation Indicators for Rebalancing
Indicator	Normal Range	Warning Threshold	Critical Threshold	What It Signals
Read Latency (p99)	< 10ms	50ms	200ms	Hot read partitions or cache inefficiency
Write Latency (p99)	< 20ms	100ms	500ms	Write contention or replication lag
Replication Lag	< 100ms	1s	10s	Overloaded primary or network issues
Query Queue Depth	< 10	100	1000	Compute or I/O saturation
Connection Utilization	< 50%	70%	90%	Client concentration on specific nodes
Disk I/O Wait	< 5%	15%	30%	Storage saturation

Interpreting Performance Variance:

One of the most telling signs of imbalance is high variance in performance metrics across partitions. When examining cluster health:

Healthy Cluster: All nodes show similar p99 latencies (within 20% of each other)
Moderate Imbalance: Some nodes show 2-3x higher latency than others
Severe Imbalance: Worst-performing nodes show 5-10x higher latency than best-performing nodes

The Long Tail Problem:

The Degradation Spiral

Proactive vs. Reactive Rebalancing

Reactive Rebalancing:

Reactive rebalancing occurs when system health has already degraded:

Triggered by alerts, user complaints, or outages
Often executed under time pressure
Higher risk of errors due to urgency
May conflict with peak usage periods
Limited time for testing and validation

Proactive Rebalancing:

Proactive rebalancing anticipates future needs:

Scheduled during low-traffic windows
Based on capacity projections and trend analysis
Allows thorough planning and testing
Can be rolled back if issues arise
Minimizes user impact

When to Proactively Rebalance

•Any capacity dimension approaching 70%
•Variance in partition sizes exceeds 30%
•Planned infrastructure changes upcoming
•Seasonal traffic increase anticipated
•New features will change access patterns
•Geometric data growth detected

When Reactive Is Unavoidable

•Unexpected viral content/traffic spike
•Sudden node failure without redundancy
•Security incident requiring isolation
•Unanticipated business event impact
•Discovery of critical data corruption
•Vendor or cloud provider incidents

Building a Proactive Rebalancing Culture:

Organizations that successfully practice proactive rebalancing share common characteristics:

Comprehensive Monitoring: Detailed metrics on all capacity dimensions, with historical trend analysis
Defined Thresholds: Clear, documented thresholds that trigger rebalancing consideration
Regular Reviews: Weekly or monthly capacity planning reviews
Runbook-Driven Operations: Pre-written, tested procedures for common rebalancing scenarios
Post-Incident Learning: Every reactive rebalancing leads to process improvements to prevent recurrence

The 3:1 Rule

Decision Framework for Rebalancing

Given the complexity and risk of rebalancing operations, a structured decision framework helps ensure that rebalancing is undertaken only when truly necessary and with appropriate planning.

The Rebalancing Decision Matrix:

When evaluating whether to rebalance, consider these factors:

Rebalancing Decision Matrix
Factor	Low Priority	Medium Priority	High Priority
Capacity Utilization	< 60% peak	60-80% peak	80% peak
Partition Size Variance	< 20% deviation	20-40% deviation	40% deviation
Performance Variance	< 20% latency diff	20-50% latency diff	50% latency diff
Time to Exhaustion	6 months	3-6 months	< 3 months
Upcoming Events	None known	Moderate growth expected	Major traffic spike expected
System Stability	Stable, no recent issues	Minor issues observed	Active incidents related to balance

Decision Tree for Rebalancing:

                    Start
                      |
         Is there an active incident?
                /           \
              Yes             No
               |               |
    Is rebalancing the fix?  Any HIGH priority factors?
         /        \              /          \
        Yes        No          Yes           No
         |          |           |             |
  Emergency    Address root   Plan         Any 2+ MEDIUM
  rebalance    cause first   proactive      priority factors?
                             rebalance        /         \
                                            Yes          No
                                             |            |
                                        Schedule     Continue
                                        rebalance    monitoring

Questions to Ask Before Rebalancing:

What problem are we solving? Be specific about the symptom or projected issue
What is the blast radius? How much data will move? How long will it take?
What is the risk if we DON'T rebalance? Quantify in terms of outage probability or degradation
What is the risk if we DO rebalance? What could go wrong during the operation?
Do we have a rollback plan? Can we undo the rebalancing if issues arise?
What is the maintenance window? When can we execute with minimal user impact?
Who needs to be informed? Stakeholders, on-call teams, downstream service owners

Document the Decision

Every rebalancing decision—whether to proceed or to defer—should be documented. This creates an audit trail, facilitates post-mortems, and helps refine decision criteria over time.

Summary: When Rebalancing Is Needed

Recognizing when rebalancing is needed is a critical skill for database administrators and system architects. The key insights from this page:

Key Takeaways

•Distribution drift is inevitable — Real-world data and access patterns never match initial assumptions perfectly. Drift detection, not prevention, is the goal.
•Hot spots are system-wide threats — A single overloaded partition can degrade the entire cluster through cascade effects. Early hot spot detection is critical.
•Monitor multiple capacity dimensions — Storage, memory, CPU, network, and connections can each become the limiting factor. Balance all five.
•Infrastructure changes create sudden imbalances — Node failures, additions, and hardware changes require immediate attention to data distribution.
•Performance variance signals imbalance — When partition latencies diverge significantly, rebalancing should be considered.
•Proactive beats reactive — Scheduled, planned rebalancing is far less disruptive than emergency operations under outage conditions.
•Use a structured decision framework — Quantitative thresholds and decision trees reduce subjectivity and improve consistency.

What's Next:

Page Complete

1 / 5