Rebalancing & Resharding - Learning Module

Loading content...

0/273

Strategies for Minimal Disruption

The Art of Invisible Operations

The true measure of operational excellence in distributed databases is not whether you can rebalance data—it's whether you can rebalance without users noticing. In production environments serving real traffic, the goal is to move potentially terabytes of data, reroute millions of requests, and fundamentally restructure data placement—all while maintaining the same response times and availability that users expect.

This is not merely aspirational. The techniques developed by hyperscale operators at companies like Google, Amazon, and Facebook have established that zero-downtime rebalancing is achievable with the right strategies. What was once considered an acceptable maintenance window is now recognized as an unnecessary service disruption.

What You Will Learn

By the end of this page, you will master the core strategies for minimal-disruption rebalancing: live data migration, throttling and rate limiting, staged rollouts, dual-write patterns, background processing, and the coordination mechanisms that tie these techniques together.

Principles of Non-Disruptive Rebalancing

Before diving into specific techniques, it's essential to understand the fundamental principles that underpin all non-disruptive rebalancing strategies. These principles guide design decisions and help evaluate tradeoffs.

The Core Principles:

Foundational Principles for Zero-Downtime Rebalancing

•Incremental Over Atomic — Break large operations into many small steps. Moving 1 TB in 1000 chunks of 1 GB each allows for pausing, throttling, and recovery. Moving 1 TB atomically creates an all-or-nothing risk.
•Background Over Foreground — Rebalancing work should happen in background processes that can be deprioritized when foreground traffic needs resources. User-facing operations always take precedence.
•Reversible Over Irreversible — Every step in a rebalancing operation should be reversible until the final cutover. This enables rollback at any point if issues arise.
•Observed Over Assumed — Continuously monitor system health during rebalancing. Decisions should be based on observed metrics, not assumed capacity. If latency spikes, pause rebalancing.
•Conservative Over Aggressive — It's better to take 24 hours to complete a rebalancing operation safely than to rush it in 4 hours and cause an incident. Patience is a virtue in operations.

The Disruption Budget:

Every rebalancing operation has an implicit disruption budget—the amount of additional latency, reduced throughput, or temporary inconsistency that the system can tolerate. This budget is determined by:

SLA Requirements: What latency and availability commitments exist?
Peak vs. Off-Peak: How much headroom exists during low-traffic periods?
User Sensitivity: How tolerant are users of minor performance variations?
Business Impact: What is the cost of degraded performance per minute/hour?

A well-planned rebalancing operation stays comfortably within this disruption budget throughout its execution.

No Free Lunch

All rebalancing strategies involve tradeoffs. Safer approaches take longer. Faster approaches consume more resources. The goal is not to eliminate tradeoffs but to make them explicit and manageable.

Live Data Migration Techniques

Live data migration is the process of moving data from one partition to another while both partitions continue servicing requests. This is the technical foundation of non-disruptive rebalancing.

The Three-Phase Migration Pattern:

Most live migration strategies follow a three-phase pattern that ensures data consistency while maintaining availability:

Three-Phase Live Migration Pattern
Phase	Duration	Operations	Key Activities
Phase 1: Dual-Write Setup	Minutes to hours	Writes go to both source and destination	Configure replication, validate connectivity, start change capture
Phase 2: Bulk Transfer	Hours to days	Historical data copied in background	Copy existing data, reconcile with ongoing changes, verify integrity
Phase 3: Cutover	Seconds to minutes	Traffic shifted to new partition	Redirect reads, then writes, validate, cleanup source

Phase 1: Dual-Write Setup

Before any data moves, you establish a connection between source and destination partitions:

Enable Change Data Capture (CDC): All writes to the source partition are recorded for replay on the destination
Mark Migration Start Point: Record a logical timestamp or position in the write-ahead log
Validate Destination Readiness: Confirm the destination partition can accept writes
Begin Dual-Write (Optional): Some systems start writing to both partitions immediately

Phase 2: Bulk Transfer

The bulk of rebalancing time is spent in this phase:

Snapshot Creation: Create a consistent snapshot of data to be migrated
Streaming Copy: Transfer the snapshot in chunks to the destination
Change Application: Apply captured changes that occurred during the snapshot
Catch-Up Loop: Continue applying changes until source and destination are synchronized
Verification: Confirm data integrity through checksums or sampling

Phase 3: Cutover

The final, most delicate phase:

Quiesce Traffic Window: Brief pause (often < 1 second) to flush in-flight operations
Final Change Application: Apply last batch of changes
Routing Update: Update partition mapping to point to new location
Verify Cutover Success: Confirm traffic is flowing to the new partition
Rollback Window: Maintain ability to revert for a safety period
Source Cleanup: Eventually remove data from the source partition

The Critical Cutover Moment

The cutover phase is where most migration failures occur. Even with extensive bulk transfer success, a botched cutover can cause data loss or extended downtime. Invest heavily in rehearsing and automating the cutover process.

Throttling and Rate Limiting

Rebalancing operations compete with user traffic for system resources. Without careful throttling, a well-intentioned rebalancing operation can starve production workloads and cause the very outage it was meant to prevent.

Resource Competition During Rebalancing:

Resources Consumed by Rebalancing

•Network Bandwidth — Bulk data transfer can saturate network links, adding latency to all operations
•Disk I/O — Reading source data and writing destination data competes with user queries
•CPU Cycles — Serialization, compression, checksumming, and encryption consume compute
•Memory — Buffering data for transfer and change tracking consumes RAM
•Database Connections — Migration processes require connections that could serve users

Throttling Strategies:

1. Static Rate Limiting

The simplest approach: configure a fixed rate for rebalancing operations.

rebalancing_bandwidth_limit = 100 MB/s
rebalancing_iops_limit = 5000
rebalancing_concurrent_operations = 4

Pros: Predictable, easy to configure Cons: Doesn't adapt to actual system load; may be too conservative or too aggressive

2. Adaptive Throttling

Adjust rebalancing rate based on observed system metrics:

if current_latency_p99 > latency_threshold:
    reduce_rebalancing_rate()
elif current_latency_p99 < latency_target * 0.8:
    increase_rebalancing_rate()

Pros: Automatically balances speed and stability Cons: Requires sophisticated monitoring integration; feedback loops can oscillate

3. Time-Based Windows

Limit rebalancing to specific time windows:

rebalancing_windows:
  - start: 02:00
    end: 06:00
    rate: 500 MB/s  # Aggressive during off-peak
  - start: 06:00
    end: 02:00
    rate: 50 MB/s   # Conservative during business hours

Pros: Aligns with known traffic patterns; predictable behavior Cons: Doesn't adapt to unexpected traffic; rebalancing takes longer

4. Priority Queuing

Use operating system or database-level priority mechanisms:

Linux: Nice values, cgroups, ionice
Database: Background task priorities, resource pools
Cloud: Spot instances for migration workers

The 10% Rule

A common starting point: rebalancing operations should consume no more than 10% of available resources during peak hours. This can be increased to 50% or more during maintenance windows. Adjust based on observed impact.

Staged Rollout Strategies

Complex rebalancing operations benefit from staged rollouts that limit blast radius and provide validation checkpoints. Rather than rebalancing the entire cluster at once, work in graduated phases.

The Canary Approach:

Start with a small subset (1-5%) of partitions:

Select Canary Partitions: Choose representative partitions that cover typical workloads
Execute Rebalancing: Complete the full migration for canary partitions only
Monitor and Validate: Run for hours or days to detect issues
Proceed or Rollback: If canary succeeds, continue; if issues arise, fix before proceeding
Expand Gradually: Move to 10%, then 25%, then 50%, then complete

Staged Rollout Timeline Example
Stage	Scope	Duration	Success Criteria	Rollback Cost
Canary	1% of partitions	24-48 hours	No errors, latency within 10% of baseline	Minimal - single partition
Early Majority	10% of partitions	2-3 days	Aggregate metrics stable, no user complaints	Low - limited scope
Majority	50% of partitions	3-5 days	All health checks passing, SLAs met	Medium - significant coordination
Complete	100% of partitions	1-2 days	Full validation, cleanup complete	High - complete rollback complex

Partition Selection Strategies:

Not all partitions are equal for canary selection:

Representative Partitions: Choose partitions that reflect typical data and access patterns
Low-Risk Partitions: Start with partitions where issues would have minimal business impact
High-Observability Partitions: Choose partitions with good monitoring coverage
Avoid Critical Partitions: Don't start with partitions serving authentication, payments, or other critical paths

The Ring Approach:

For geographic or tiered deployments:

Ring 0 (Internal): Employee-facing or test environments
Ring 1 (Low Risk): Less critical regions or customer segments
Ring 2 (Moderate Risk): Secondary production regions
Ring 3 (Production): Primary production, most critical customers

Each ring completes before the next begins, with explicit approval gates.

Bake Time Is Essential

The waiting periods between stages—called 'bake time'—are not optional. Many issues only appear under sustained load or after time passes (memory leaks, slow degradation, edge cases). Rushing through stages defeats the purpose of staged rollouts.

Dual-Write and Shadow Patterns

Dual-write patterns ensure data consistency during migration by writing to both source and destination simultaneously. Shadow patterns validate new partition behavior by comparing results without affecting users.

The Dual-Write Pattern:

Dual-Write Pattern (Pseudocode)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Dual-Write Implementation
function write(key, value):
    // Write to primary (source) partition
    result = primary_partition.write(key, value)
    
    if migration_in_progress(key):
        // Asynchronously write to destination
        async destination_partition.write(key, value)
        
        // Or synchronously for strong consistency
        // dest_result = destination_partition.write(key, value)
        // if dest_result.failed:
        //     handle_inconsistency(key, value)
    
    return result
 
// Dual-Write with Conflict Detection
function write_with_conflict_check(key, value, version):
    primary_result = primary_partition.compare_and_swap(key, value, version)
    
    if primary_result.success AND migration_in_progress(key):
        new_version = primary_result.new_version
        dest_result = destination_partition.write(key, value, new_version)
        
        if dest_result.failed:
            log_conflict(key, primary_result, dest_result)
            schedule_reconciliation(key)
    
    return primary_result

Dual-Write Considerations:

Synchronous vs. Asynchronous: Sync dual-write adds latency but guarantees consistency; async is faster but risks temporary divergence
Failure Handling: What happens if the secondary write fails? Queue for retry? Alert? Rollback primary?
Write Amplification: Dual-write doubles write load during migration; ensure capacity exists
Ordering Guarantees: Must writes apply in the same order to both partitions?

The Shadow Read Pattern:

Validate new partition behavior without affecting users:

Shadow Read Pattern (Pseudocode)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Shadow Read Implementation
function read(key):
    // Always read from primary
    primary_result = primary_partition.read(key)
    
    if shadow_validation_enabled(key):
        // Asynchronously read from destination
        async:
            shadow_result = destination_partition.read(key)
            compare_results(primary_result, shadow_result, key)
    
    // Return primary result regardless of shadow
    return primary_result
 
function compare_results(primary, shadow, key):
    if primary != shadow:
        metrics.increment("shadow_mismatch")
        log_divergence(key, primary, shadow)
        
        if critical_mismatch(primary, shadow):
            alert_oncall("Data divergence detected", key)
            pause_migration_if_threshold_exceeded()

Shadow Pattern Benefits:

Validates migration correctness without user impact
Identifies issues before cutover
Provides confidence metrics for go/no-go decisions
Catches subtle bugs that testing might miss

Shadow Pattern Costs:

Doubles read load during validation phase
Requires comparison logic for each data type
Can generate noise from legitimate divergence (eventual consistency)
Adds latency to shadow path even if primary path is unaffected

Statistics Over Alerts

Shadow validation works best when you track mismatch rates as percentages rather than alerting on every divergence. A 0.001% mismatch rate might be acceptable; a 1% rate warrants investigation. Define thresholds in advance.

Background Processing Techniques

Rebalancing workloads should run as background processes that yield to foreground traffic. Several techniques achieve this goal.

Change Data Capture (CDC):

CDC captures changes from the database transaction log for replay:

Log Tailing: Read changes from write-ahead log (WAL/binlog)
Change Extraction: Parse log entries into structured change events
Change Application: Apply changes to destination partition
Checkpointing: Track progress to enable resume after failures

CDC Advantages

•Minimal impact on source system
•Captures all changes including deletes
•Maintains operation ordering
•Enables exactly-once semantics
•Works with any client application
•Supports continuous replication

CDC Challenges

•Requires log format knowledge
•Schema changes need handling
•Large transactions can cause lag
•Log retention limits catch-up time
•Complex to implement correctly
•Debugging requires log expertise

Snapshot-Plus-Incremental:

Combine a point-in-time snapshot with incremental changes:

Mark Start Position: Record current log position
Take Consistent Snapshot: Export data at a known point
Transfer Snapshot: Stream snapshot to destination (background)
Apply Changes: Replay changes from start position to current
Repeat Until Caught Up: Continue until lag is minimal

Copy-on-Read:

Lazily migrate data when it's accessed:

On Read: Check if data exists in new partition
If Missing: Fetch from old partition, write to new, return to user
Background Sweep: Eventually copy unread data
Benefit: Hot data migrates automatically first
Drawback: First access to unmigrated data is slower

Cooperative Background Threads:

Use database-native mechanisms for background work:

PostgreSQL: Background workers, logical replication
MySQL: Statement-based or row-based replication
MongoDB: Balancer process with chunk migration
Cassandra: Anti-entropy repair and streaming

Combine Techniques

Production rebalancing often combines multiple techniques: CDC for change capture, snapshot-plus-incremental for bulk transfer, and copy-on-read for lazy migration of cold data. The optimal combination depends on data volume, access patterns, and consistency requirements.

Coordination and Orchestration

Complex rebalancing operations require careful coordination across multiple components. Without proper orchestration, partial failures can leave the system in inconsistent states.

The Orchestration State Machine:

Model rebalancing as a state machine with defined transitions:

Rebalancing State Machine
State	Description	Next States	Failure Action
PENDING	Migration scheduled but not started	PREPARING, CANCELLED	No action needed
PREPARING	Setting up destination, starting CDC	COPYING, FAILED	Cleanup destination
COPYING	Bulk data transfer in progress	CATCHING_UP, PAUSED, FAILED	Resume or cleanup
CATCHING_UP	Applying incremental changes	READY_FOR_CUTOVER, PAUSED, FAILED	Return to COPYING
READY_FOR_CUTOVER	Source and destination synchronized	CUTTING_OVER, PAUSED	Maintain sync
CUTTING_OVER	Switching traffic to destination	COMPLETED, ROLLING_BACK	Execute rollback
COMPLETED	Migration finished successfully	CLEANING_UP	N/A
ROLLING_BACK	Reverting to source partition	ROLLED_BACK, FAILED	Manual intervention
PAUSED	Temporarily suspended	Previous state	Resume when ready

Coordination Mechanisms:

1. Distributed Lock Management

Prevent concurrent migrations of the same partition:

// Acquire exclusive lock before migration
lock = lock_service.acquire("migration:" + partition_id, ttl=1hour)
if not lock.acquired:
    abort("Partition already being migrated")

// Refresh lock periodically during migration
while migration_in_progress:
    lock.refresh()
    sleep(lock_refresh_interval)

// Release lock on completion
lock.release()

2. Metadata Coordination

Update routing metadata atomically:

Use versioned metadata with compare-and-swap updates
Ensure all nodes see metadata updates before serving traffic
Consider two-phase commit for critical updates

3. Health Check Integration

Pause rebalancing when system health degrades:

Monitor latency, error rates, resource utilization
Define automatic pause conditions
Alert operators when paused
Require manual approval to resume in degraded conditions

4. Cross-Region Coordination

For global deployments:

Coordinate rebalancing across time zones
Ensure each region follows the same playbook
Use global metadata services for consistency

The Orchestrator Itself Can Fail

The rebalancing orchestrator must be resilient. If it crashes mid-migration, the system should be able to resume from the last checkpoint. Store all state durably and design for orchestrator restarts.

Summary: Strategies for Minimal Disruption

Non-disruptive rebalancing is achievable with the right techniques and careful execution. The key insights from this page:

Key Takeaways

•Adhere to core principles — Incremental, background, reversible, observed, and conservative operations minimize risk and disruption.
•Use three-phase live migration — Dual-write setup, bulk transfer, and careful cutover provide a proven pattern for data movement.
•Implement multi-layered throttling — Static limits, adaptive controls, time windows, and priority queuing prevent rebalancing from impacting production.
•Execute staged rollouts — Canary, early majority, and graduated expansion limit blast radius and provide validation checkpoints.
•Leverage dual-write and shadow patterns — Ensure data consistency and validate correctness before committing to cutover.
•Choose appropriate background techniques — CDC, snapshot-plus-incremental, and copy-on-read each have their place depending on requirements.
•Orchestrate with state machines — Well-defined states, transitions, and failure handling enable complex migrations to proceed safely.

What's Next:

With strategies for minimal disruption understood, the next page dives deep into consistent hashing—the algorithmic foundation that makes modern distributed rebalancing possible. We'll explore how consistent hashing minimizes data movement when cluster membership changes and how virtual nodes improve load distribution.

Page Complete

You now understand the operational strategies that enable zero-downtime rebalancing. These techniques, refined by hyperscale operators, transform rebalancing from a risky maintenance window into a routine background operation.