System Design (HLD)Rebalancing and Resharding

Rebalancing and Resharding

LevelAdvanced

Duration90 mins

TopicRebalancing and Resharding

5 / 5

Practical Considerations

From Theory to Production Reality

Understanding the algorithms and challenges of rebalancing is necessary but not sufficient for success. The difference between a smooth resharding operation and an incident often comes down to operational discipline: the runbooks, monitoring, coordination, and decision-making frameworks that guide execution.

Production resharding brings together technical complexity with organizational reality. Multiple teams may be affected. On-call engineers need to know when to intervene. Stakeholders need clear communication about potential impacts. This page bridges the gap between theoretical understanding and operational excellence.

What You Will Learn

By the end of this page, you will understand how to build operational runbooks for resharding, design monitoring dashboards that surface problems early, plan and execute rollbacks safely, coordinate across teams and time zones, and apply real-world lessons learned from production resharding operations.

Building Operational Runbooks

A runbook is a documented procedure that guides operators through a complex operation. For resharding, runbooks transform ad-hoc decisions into repeatable, tested processes.

Essential Runbook Components:

Runbook Structure for Resharding Operations

•Purpose and Scope — What does this runbook cover? What are the expected outcomes? What's out of scope?
•Prerequisites Checklist — What conditions must be met before starting? (capacity available, no other maintenance, stakeholders notified)
•Step-by-Step Procedures — Exact commands, configuration changes, and verification steps for each phase
•Expected Outputs — What should you see at each step? How do you know it's working correctly?
•Decision Points — When to proceed, when to pause, when to rollback. Clear critera, not judgment calls.
•Rollback Procedures — For each phase, how to undo what was done and return to a safe state
•Escalation Paths — Who to contact if things go wrong, ordered by severity and time of day
•Post-Operation Verification — How to confirm the operation was successful after completion

Example Runbook Section: Starting Migration

## Phase 2: Start Migration

### Prerequisites
- [ ] Phase 1 (Preparation) completed successfully
- [ ] All health checks passing (dashboard link: <url>)
- [ ] On-call engineer aware migration is starting
- [ ] Traffic below 70% of peak (check: <metrics link>)

### Procedure

1. Open terminal session to migration controller (host: migration-ctrl-1)

2. Verify no active migrations:
   ```bash
   $ migctl status --all
   Expected: "No active migrations"

Start migration with throttled rate:

$ migctl start --migration-id=[MIGRATION_ID] --rate-limit=100MB/s

Verify migration started:

$ migctl status --migration-id=[MIGRATION_ID]
Expected: State=RUNNING, Progress=0%, Rate=~100MB/s

Monitor for 15 minutes before proceeding:
- Watch latency dashboard: <url>
- Watch error rate dashboard: <url>
- If p99 latency > 50ms OR error rate > 0.1%, PAUSE (see Rollback section)

Success Criteria

Migration status shows RUNNING
Latency remains within normal bounds
No new errors in logs

If Something Goes Wrong

PAUSE: migctl pause --migration-id=[MIGRATION_ID]
ESCALATE: Page @database-oncall if pause doesn't stabilize

Test Your Runbooks

A runbook that hasn't been executed is just a hypothesis. Run through your resharding runbook in a test environment at least quarterly. Every production execution should result in runbook updates based on lessons learned.

Monitoring and Observability

During resharding, visibility into system behavior becomes critical. You need to detect problems before they become incidents and understand what's happening well enough to make informed decisions.

Key Metrics to Monitor:

Essential Resharding Monitoring Metrics
Category	Metric	Normal Range	Alert Threshold	Why It Matters
Migration Progress	Keys/bytes migrated	Increasing steadily	Stalled > 5 min	Detect stuck migrations
Migration Progress	Estimated time remaining	Decreasing	Increasing trend	Detect slowdown
Latency	Read p99	< 10ms	50ms	User experience degradation
Latency	Write p99	< 20ms	100ms	User experience degradation
Error Rates	Failed requests %	< 0.01%	0.1%	Service degradation
Error Rates	Timeout rate	< 0.001%	0.01%	Resource exhaustion
Resources	CPU utilization	< 70%	85%	Capacity exhaustion
Resources	Disk I/O wait	< 10%	25%	Storage bottleneck
Resources	Network throughput	< 70% capacity	85% capacity	Network saturation
Replication	Replica lag	< 1s	10s	Consistency risk

Dashboard Design:

A well-designed resharding dashboard provides a single view of migration health:

┌────────────────────────────────────────────────────────────────────┐
│  RESHARDING DASHBOARD - Migration ID: mig-2024-01-15-001           │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  STATUS: ■ RUNNING           PROGRESS: ████████████░░░ 78%         │
│  Started: 02:15 UTC          ETA: 04:45 UTC                        │
│                                                                     │
├────────────────────────────────────────────────────────────────────┤
│  HEALTH INDICATORS                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐            │
│  │ Latency  │  │ Errors   │  │ CPU      │  │ Network  │            │
│  │  ● OK    │  │  ● OK    │  │  ● WARN  │  │  ● OK    │            │
│  │  12ms    │  │  0.003%  │  │  78%     │  │  45%     │            │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘            │
│                                                                     │
├────────────────────────────────────────────────────────────────────┤
│  MIGRATION RATE          LATENCY TREND (last 30 min)               │
│  ▁▂▃▄▅▆▇█▇▆▆▆▆▆          p99: ─────────▁▂▁───────                  │
│  Target: 100MB/s         p50: ─────────────────────                 │
│  Actual: 95MB/s                                                     │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Logging Strategy:

Structured logging enables post-hoc analysis:

{
  "timestamp": "2024-01-15T03:45:22.456Z",
  "level": "INFO",
  "migration_id": "mig-2024-01-15-001",
  "event": "chunk_migrated",
  "source_shard": "shard-1",
  "dest_shard": "shard-5",
  "key_range_start": "user:100000",
  "key_range_end": "user:100999",
  "records_count": 1000,
  "bytes_migrated": 524288,
  "duration_ms": 1250,
  "checksum": "a1b2c3d4"
}

Pre-Create Your Dashboards

Don't wait until migration day to build dashboards. Create and test monitoring dashboards in advance. Run test migrations to ensure dashboards display useful information. During a real migration is not the time to discover that a critical metric isn't being collected.

Rollback Planning and Execution

Every resharding operation needs a rollback plan. The ability to abort and return to a known-good state is essential for maintaining system stability.

Rollback Phases and Strategies:

Rollback complexity varies by how far the migration has progressed:

Rollback Strategies by Migration Phase
Phase	Rollback Difficulty	Strategy	Data at Risk
Preparation	Trivial	Delete destination setup, restore config	None
Early Migration (< 10%)	Easy	Stop migration, delete dest data, restore routing	None - source is authoritative
Mid Migration (10-90%)	Medium	Stop migration, reconcile if dual-write, restore routing	Low if dual-write enabled
Late Migration (> 90%)	Hard	May need to remigrate to source, complex coordination	Medium - careful handling required
Post-Cutover	Very Hard	Reverse migration (dest→source), significant effort	High if writes accepted on dest
Cleanup Complete	Impossible	Source data deleted, cannot rollback	Complete data on dest only

The Rollback Decision Framework:

                     Detect Anomaly
                           │
                           ▼
              ┌────────────────────────┐
              │ Is this expected       │
              │ migration behavior?    │
              └───────────┬────────────┘
                    Yes   │   No
                          │
              ┌───────────┴───────────┐
              │                       │
              ▼                       ▼
        Continue with            Is service
        monitoring               impacted?
                                      │
                          ┌───────────┴───────────┐
                          │                       │
                          ▼                       ▼
                   Pause migration          Immediate
                   Investigate              rollback
                          │
                          ▼
              ┌────────────────────────┐
              │ Can issue be resolved  │
              │ without rollback?      │
              └───────────┬────────────┘
                    Yes   │   No
              ┌───────────┴───────────┐
              │                       │
              ▼                       ▼
          Fix and               Execute
          resume                rollback

Rollback Procedure Example:

## Emergency Rollback Procedure

### Trigger Criteria (ANY of the following)
- Error rate exceeds 1% for > 2 minutes
- p99 latency exceeds 200ms for > 5 minutes
- Any data inconsistency detected
- Operator judgment that service is at risk

### Immediate Actions (< 1 minute)

1. PAUSE migration immediately:
   ```bash
   $ migctl emergency-stop --migration-id=[MIGRATION_ID]

Verify pause successful:

$ migctl status --migration-id=[MIGRATION_ID]
Expected: State=STOPPED

Restore routing to source shard:

$ routectl restore-pre-migration --migration-id=[MIGRATION_ID]

Verify traffic flowing to source:
- Check source shard QPS increased
- Check destination shard QPS dropped to near-zero

Stabilization (< 5 minutes)

Verify service health restored:
- Latency returned to baseline
- Error rate returned to baseline
Alert incident channel: "@here Migration [ID] rolled back at [TIME]. Service stable. Investigation ongoing."

Investigation (< 30 minutes)

Collect diagnostic information:
- Migration logs from last 30 minutes
- System metrics around failure time
- Any error messages or stack traces
Document in incident ticket

The Point of No Return

Identify and document the 'point of no return' in your migration plan—the phase after which rollback is no longer simple. Before crossing this threshold, perform extra verification. Most teams wait 24-48 hours in the 'ready to commit' phase before final cutover.

Team Coordination and Communication

Resharding affects multiple teams and requires careful coordination. Poor communication is a leading cause of resharding-related incidents.

Stakeholder Communication Plan:

Stakeholder Communication Matrix
Stakeholder	What They Need	When to Notify	Channel
Database Team	Full technical details, runbooks	1 week before, day of	Direct channel + ticket
On-Call Engineers	Impact summary, escalation paths	24 hours before	On-call handoff + runbook
Application Teams	Potential latency impact, timeline	1 week before	Team email + Slack
SRE/Platform	Resource impact, monitoring alerts	1 week before	SRE channel
Product/Business	User impact (if any), timing	As needed for user-facing impact	Status page if external
Leadership	Risk summary, go/no-go criteria	For major resharding only	Email summary

War Room Protocol:

For high-risk resharding, establish a war room (physical or virtual):

Before Migration:

T-24h: Send reminder to all stakeholders
T-2h: Confirm all participants available
T-30m: Open war room channel, verify monitoring dashboards
T-15m: Final go/no-go checklist review

During Migration:

Dedicated communication channel (Slack, Teams, etc.)
Status updates every 15-30 minutes
Clear ownership: one person makes decisions
Explicit "all clear" or "hold" announcements
No unrelated discussion in war room

Communication Templates:

### Migration Start Announcement
🚀 [Migration ID] STARTED
- Target: Migrate shard-1 data to shard-5 and shard-6
- Expected Duration: 4-6 hours
- Lead: @alice
- War Room: #database-migration-2024-01

Monitoring: <dashboard link>
Runbook: <runbook link>

### Periodic Status Update
📊 [Migration ID] STATUS - T+2h
- Progress: 45% complete
- Health: All systems normal
- ETA: 2-3 more hours
- Next update: T+3h or if status changes

### Completion Announcement
✅ [Migration ID] COMPLETE
- Duration: 5h 23m
- Status: Successful, no incidents
- Verification: All health checks passing
- Next: 24h bake period, then source cleanup

Silence Is Concerning

During a long migration, teams watching the war room worry when updates stop. Even if nothing is happening, post periodic 'all clear' messages. Absence of updates is often interpreted as 'something is wrong and they're too busy to communicate.'

Timing and Scheduling

Choosing when to execute resharding significantly impacts risk and recovery options. Strategic timing can mean the difference between a smooth operation and an incident.

Timing Considerations:

Favorable Conditions

•Low traffic periods
•No major business events pending
•Full staffing available
•No other major changes scheduled
•Good weather (for on-site teams!)
•Monday-Wednesday (buffer before weekend)

Unfavorable Conditions

•Peak traffic hours
•Holiday shopping seasons
•Key staff on vacation
•Other major deployments same day
•End of month/quarter closings
•Friday afternoons (weekend recovery is hard)

The Maintenance Window vs. Online Tradeoff:

Approach	User Impact	Risk Level	Rollback Time	When to Use
Full maintenance window	High (complete outage)	Low	Fast	Very high-risk migrations
Degraded service window	Medium (slower response)	Medium	Medium	Moderate risk, latency-sensitive
Background online	Low (minimal impact)	Higher	Can be slow	Most migrations
Zero-impact online	None (fully transparent)	Highest	Slowest	Low-risk, well-tested systems

Timezone Considerations:

For globally distributed teams and systems:

US Users: Lowest traffic 02:00-06:00 Pacific (10:00-14:00 UTC)
EU Users: Lowest traffic 02:00-06:00 CET (01:00-05:00 UTC)
APAC Users: Lowest traffic 02:00-06:00 JST (17:00-21:00 UTC)

Global minimum: No perfect time exists
Strategy: Migrate region by region during each region's low period

Or: Accept that some region sees degraded performance during migration

Duration Planning:

Estimate conservatively: If you think 4 hours, plan for 8
Include buffer time: Buffer before business hours resume
Plan for delays: What if migration takes 2x longer than expected?
Consider pausing: Can you pause overnight and resume the next day?

Never Friday

The 'Never deploy on Friday' rule applies doubly to resharding. If something goes wrong, you want weekday recovery with full staffing, not weekend heroics. Aim to complete resharding by Wednesday to allow for monitoring before the weekend.

Post-Migration Verification

The migration isn't complete when data transfer finishes. Comprehensive verification is essential before declaring success and cleaning up source data.

Verification Checklist:

Post-Migration Verification Steps

•Row Count Verification — Source row count = Destination row count (for each migrated table/collection)
•Checksum Verification — Hash of source data matches hash of destination data
•Sample Data Comparison — Random sample of records compared field-by-field
•Query Result Verification — Critical queries produce identical results on source and destination
•Application Smoke Tests — Key application flows work correctly with new routing
•Performance Baseline — Latency and throughput match or exceed pre-migration baselines
•Error Rate Check — No increase in error rates post-migration
•Constraint Verification — All foreign keys, unique constraints, indexes are intact

Automated Verification Script Example:

def verify_migration(migration_id):
    config = get_migration_config(migration_id)
    results = MigrationVerificationResults()
    
    # Row count check
    source_count = source_db.count(config.table)
    dest_count = dest_db.count(config.table)
    results.add_check(
        "row_count",
        passed=(source_count == dest_count),
        details=f"Source: {source_count}, Dest: {dest_count}"
    )
    
    # Checksum check (for small tables) or sample checksum
    if source_count < 1_000_000:
        source_checksum = source_db.table_checksum(config.table)
        dest_checksum = dest_db.table_checksum(config.table)
    else:
        # Sample 0.1% of rows
        sample_keys = source_db.sample_keys(config.table, fraction=0.001)
        source_checksum = source_db.rows_checksum(config.table, sample_keys)
        dest_checksum = dest_db.rows_checksum(config.table, sample_keys)
    
    results.add_check(
        "checksum",
        passed=(source_checksum == dest_checksum),
        details=f"Source: {source_checksum[:16]}..., Dest: {dest_checksum[:16]}..."
    )
    
    # Performance check
    baseline_latency = get_historical_p99(config.table, period="7d")
    current_latency = get_current_p99(config.table)
    latency_increase = (current_latency - baseline_latency) / baseline_latency
    
    results.add_check(
        "performance",
        passed=(latency_increase < 0.1),  # Less than 10% increase
        details=f"Baseline: {baseline_latency}ms, Current: {current_latency}ms (+{latency_increase*100:.1f}%)"
    )
    
    return results

# Run verification
results = verify_migration("mig-2024-01-15-001")
if results.all_passed():
    print("✅ Migration verification PASSED")
else:
    print("❌ Migration verification FAILED")
    for failure in results.failures():
        print(f"  - {failure.name}: {failure.details}")

Bake Period:

After verification passes, maintain a bake period before final cleanup:

24 hours minimum: Watch for issues that only appear under sustained load
Full business cycle: Ensure daily peaks are handled correctly
Keep source data: Don't delete until bake period completes successfully

Never Skip Verification

The rush to declare victory and clean up is dangerous. Corrupted or missing data may not be immediately visible. Take the time to verify thoroughly. The cost of verification is tiny compared to the cost of discovering data loss after source cleanup.

Lessons from Production

Real-world resharding operations generate valuable lessons. Here are patterns observed across many production resharding projects:

Common Failure Patterns:

Resharding Failure Patterns and Prevention
Pattern	What Went Wrong	How to Prevent
The Slow Burn	Migration took 3x longer than expected, ran into peak hours	Conservative estimates, automatic pause at time boundaries
The Cascade	One shard overloaded, caused others to fail	Circuit breakers, per-shard health checks, automatic throttling
The Ghost Data	Old routing cached in clients caused stale reads for hours	Client library migration support, forced cache invalidation
The Friday Surprise	Issue discovered Friday 5pm, weekend recovery	Never Friday, never before holidays
The Silent Failure	Migration 'completed' but 0.1% of data was corrupted	Comprehensive verification, sampling-based integrity checks
The Retry Storm	Throttled migration caused timeouts, client retries overwhelmed system	Back-pressure mechanisms, client-side circuit breakers

Organizational Anti-Patterns:

"We'll figure it out live" — Starting migration without detailed runbooks
- Fix: Document first, then execute
"The expert will handle it" — Single person knows the system
- Fix: Cross-train, pair on migrations, document tribal knowledge
"It worked in test" — Insufficient production-like testing
- Fix: Shadow testing, canary migrations, production-scale test environments
"Just push through" — Continuing migration despite warning signs
- Fix: Clear abort criteria, empowered operators to stop
"We don't have time" — Rushing migration due to deadline pressure
- Fix: Reframe as risk management, get leadership buy-in for proper timeline

What Successful Teams Do:

Practice migrations: Regular 'fire drills' with real data movement
Invest in automation: Reduce human error opportunities
Maintain runbooks: Continuously updated based on experience
Celebrate near-misses: Learn from problems that didn't become incidents
Build institutional memory: Document lessons learned for future teams

Every Incident Improves the Process

After every resharding (successful or not), conduct a retrospective. What worked well? What surprised us? What would we do differently? Feed learnings back into runbooks and tooling. Mature teams have resharding that's almost boring—because they've solved all the interesting problems already.

Summary: Practical Considerations

Successful resharding requires operational discipline as much as technical correctness. The key insights from this page:

Key Takeaways

•Runbooks transform operations — Documented, tested procedures reduce errors, enable delegation, and survive staff turnover.
•Monitoring is your safety net — Comprehensive dashboards and alerts let you detect and respond to problems before they become incidents.
•Plan for rollback from the start — Every phase needs a tested return path. Know your point of no return.
•Communication prevents surprise — Stakeholders who are informed in advance are partners; stakeholders surprised are adversaries.
•Timing matters tremendously — Choose windows that maximize staffing and minimize user impact. Never Friday.
•Verification before cleanup — Thorough checks protect against silent data loss. The bake period is not optional.
•Learn from every operation — Retrospectives and documentation build institutional capability over time.

Module Conclusion: Rebalancing and Resharding

Across this module, we've journeyed from understanding when rebalancing is needed, through strategies for minimal disruption, the algorithmic foundations of consistent hashing, the deep technical challenges of online resharding, and finally the practical considerations for production execution.

Rebalancing and resharding represent the intersection of distributed systems theory, software engineering, and operational excellence. Mastering these skills enables you to build and maintain systems that scale gracefully, adapt to changing requirements, and remain reliable even as they grow.

The Path Forward:

With completion of this module on rebalancing and resharding, you've covered the major aspects of database replication and partitioning. These concepts form the foundation for building globally distributed, highly available data systems that power modern internet-scale applications.

Module Complete

Congratulations! You've completed the Rebalancing and Resharding module. You now possess the knowledge to plan, execute, and troubleshoot some of the most challenging operations in distributed database management.

5 / 5

Loading learning content...

System Design (HLD)Rebalancing and Resharding

Rebalancing and Resharding

LevelAdvanced

Duration90 mins

TopicRebalancing and Resharding

5 / 5

Practical Considerations

From Theory to Production Reality

What You Will Learn

Building Operational Runbooks

A runbook is a documented procedure that guides operators through a complex operation. For resharding, runbooks transform ad-hoc decisions into repeatable, tested processes.

Essential Runbook Components:

Runbook Structure for Resharding Operations

•Purpose and Scope — What does this runbook cover? What are the expected outcomes? What's out of scope?
•Prerequisites Checklist — What conditions must be met before starting? (capacity available, no other maintenance, stakeholders notified)
•Step-by-Step Procedures — Exact commands, configuration changes, and verification steps for each phase
•Expected Outputs — What should you see at each step? How do you know it's working correctly?
•Decision Points — When to proceed, when to pause, when to rollback. Clear critera, not judgment calls.
•Rollback Procedures — For each phase, how to undo what was done and return to a safe state
•Escalation Paths — Who to contact if things go wrong, ordered by severity and time of day
•Post-Operation Verification — How to confirm the operation was successful after completion

Example Runbook Section: Starting Migration

## Phase 2: Start Migration

### Prerequisites
- [ ] Phase 1 (Preparation) completed successfully
- [ ] All health checks passing (dashboard link: <url>)
- [ ] On-call engineer aware migration is starting
- [ ] Traffic below 70% of peak (check: <metrics link>)

### Procedure

1. Open terminal session to migration controller (host: migration-ctrl-1)

2. Verify no active migrations:
   ```bash
   $ migctl status --all
   Expected: "No active migrations"

Start migration with throttled rate:

$ migctl start --migration-id=[MIGRATION_ID] --rate-limit=100MB/s

Verify migration started:

$ migctl status --migration-id=[MIGRATION_ID]
Expected: State=RUNNING, Progress=0%, Rate=~100MB/s

Monitor for 15 minutes before proceeding:
- Watch latency dashboard: <url>
- Watch error rate dashboard: <url>
- If p99 latency > 50ms OR error rate > 0.1%, PAUSE (see Rollback section)

Success Criteria

Migration status shows RUNNING
Latency remains within normal bounds
No new errors in logs

If Something Goes Wrong

PAUSE: migctl pause --migration-id=[MIGRATION_ID]
ESCALATE: Page @database-oncall if pause doesn't stabilize

Test Your Runbooks

Monitoring and Observability

During resharding, visibility into system behavior becomes critical. You need to detect problems before they become incidents and understand what's happening well enough to make informed decisions.

Key Metrics to Monitor:

Essential Resharding Monitoring Metrics
Category	Metric	Normal Range	Alert Threshold	Why It Matters
Migration Progress	Keys/bytes migrated	Increasing steadily	Stalled > 5 min	Detect stuck migrations
Migration Progress	Estimated time remaining	Decreasing	Increasing trend	Detect slowdown
Latency	Read p99	< 10ms	50ms	User experience degradation
Latency	Write p99	< 20ms	100ms	User experience degradation
Error Rates	Failed requests %	< 0.01%	0.1%	Service degradation
Error Rates	Timeout rate	< 0.001%	0.01%	Resource exhaustion
Resources	CPU utilization	< 70%	85%	Capacity exhaustion
Resources	Disk I/O wait	< 10%	25%	Storage bottleneck
Resources	Network throughput	< 70% capacity	85% capacity	Network saturation
Replication	Replica lag	< 1s	10s	Consistency risk

Dashboard Design:

A well-designed resharding dashboard provides a single view of migration health:

┌────────────────────────────────────────────────────────────────────┐
│  RESHARDING DASHBOARD - Migration ID: mig-2024-01-15-001           │
├────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  STATUS: ■ RUNNING           PROGRESS: ████████████░░░ 78%         │
│  Started: 02:15 UTC          ETA: 04:45 UTC                        │
│                                                                     │
├────────────────────────────────────────────────────────────────────┤
│  HEALTH INDICATORS                                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐            │
│  │ Latency  │  │ Errors   │  │ CPU      │  │ Network  │            │
│  │  ● OK    │  │  ● OK    │  │  ● WARN  │  │  ● OK    │            │
│  │  12ms    │  │  0.003%  │  │  78%     │  │  45%     │            │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘            │
│                                                                     │
├────────────────────────────────────────────────────────────────────┤
│  MIGRATION RATE          LATENCY TREND (last 30 min)               │
│  ▁▂▃▄▅▆▇█▇▆▆▆▆▆          p99: ─────────▁▂▁───────                  │
│  Target: 100MB/s         p50: ─────────────────────                 │
│  Actual: 95MB/s                                                     │
│                                                                     │
└────────────────────────────────────────────────────────────────────┘

Logging Strategy:

Structured logging enables post-hoc analysis:

{
  "timestamp": "2024-01-15T03:45:22.456Z",
  "level": "INFO",
  "migration_id": "mig-2024-01-15-001",
  "event": "chunk_migrated",
  "source_shard": "shard-1",
  "dest_shard": "shard-5",
  "key_range_start": "user:100000",
  "key_range_end": "user:100999",
  "records_count": 1000,
  "bytes_migrated": 524288,
  "duration_ms": 1250,
  "checksum": "a1b2c3d4"
}

Pre-Create Your Dashboards

Rollback Planning and Execution

Every resharding operation needs a rollback plan. The ability to abort and return to a known-good state is essential for maintaining system stability.

Rollback Phases and Strategies:

Rollback complexity varies by how far the migration has progressed:

Rollback Strategies by Migration Phase
Phase	Rollback Difficulty	Strategy	Data at Risk
Preparation	Trivial	Delete destination setup, restore config	None
Early Migration (< 10%)	Easy	Stop migration, delete dest data, restore routing	None - source is authoritative
Mid Migration (10-90%)	Medium	Stop migration, reconcile if dual-write, restore routing	Low if dual-write enabled
Late Migration (> 90%)	Hard	May need to remigrate to source, complex coordination	Medium - careful handling required
Post-Cutover	Very Hard	Reverse migration (dest→source), significant effort	High if writes accepted on dest
Cleanup Complete	Impossible	Source data deleted, cannot rollback	Complete data on dest only

The Rollback Decision Framework:

                     Detect Anomaly
                           │
                           ▼
              ┌────────────────────────┐
              │ Is this expected       │
              │ migration behavior?    │
              └───────────┬────────────┘
                    Yes   │   No
                          │
              ┌───────────┴───────────┐
              │                       │
              ▼                       ▼
        Continue with            Is service
        monitoring               impacted?
                                      │
                          ┌───────────┴───────────┐
                          │                       │
                          ▼                       ▼
                   Pause migration          Immediate
                   Investigate              rollback
                          │
                          ▼
              ┌────────────────────────┐
              │ Can issue be resolved  │
              │ without rollback?      │
              └───────────┬────────────┘
                    Yes   │   No
              ┌───────────┴───────────┐
              │                       │
              ▼                       ▼
          Fix and               Execute
          resume                rollback

Rollback Procedure Example:

## Emergency Rollback Procedure

### Trigger Criteria (ANY of the following)
- Error rate exceeds 1% for > 2 minutes
- p99 latency exceeds 200ms for > 5 minutes
- Any data inconsistency detected
- Operator judgment that service is at risk

### Immediate Actions (< 1 minute)

1. PAUSE migration immediately:
   ```bash
   $ migctl emergency-stop --migration-id=[MIGRATION_ID]

Verify pause successful:

$ migctl status --migration-id=[MIGRATION_ID]
Expected: State=STOPPED

Restore routing to source shard:

$ routectl restore-pre-migration --migration-id=[MIGRATION_ID]

Verify traffic flowing to source:
- Check source shard QPS increased
- Check destination shard QPS dropped to near-zero

Stabilization (< 5 minutes)

Verify service health restored:
- Latency returned to baseline
- Error rate returned to baseline
Alert incident channel: "@here Migration [ID] rolled back at [TIME]. Service stable. Investigation ongoing."

Investigation (< 30 minutes)

Collect diagnostic information:
- Migration logs from last 30 minutes
- System metrics around failure time
- Any error messages or stack traces
Document in incident ticket

The Point of No Return

Team Coordination and Communication

Resharding affects multiple teams and requires careful coordination. Poor communication is a leading cause of resharding-related incidents.

Stakeholder Communication Plan:

Stakeholder Communication Matrix
Stakeholder	What They Need	When to Notify	Channel
Database Team	Full technical details, runbooks	1 week before, day of	Direct channel + ticket
On-Call Engineers	Impact summary, escalation paths	24 hours before	On-call handoff + runbook
Application Teams	Potential latency impact, timeline	1 week before	Team email + Slack
SRE/Platform	Resource impact, monitoring alerts	1 week before	SRE channel
Product/Business	User impact (if any), timing	As needed for user-facing impact	Status page if external
Leadership	Risk summary, go/no-go criteria	For major resharding only	Email summary

War Room Protocol:

For high-risk resharding, establish a war room (physical or virtual):

Before Migration:

T-24h: Send reminder to all stakeholders
T-2h: Confirm all participants available
T-30m: Open war room channel, verify monitoring dashboards
T-15m: Final go/no-go checklist review

During Migration:

Dedicated communication channel (Slack, Teams, etc.)
Status updates every 15-30 minutes
Clear ownership: one person makes decisions
Explicit "all clear" or "hold" announcements
No unrelated discussion in war room

Communication Templates:

### Migration Start Announcement
🚀 [Migration ID] STARTED
- Target: Migrate shard-1 data to shard-5 and shard-6
- Expected Duration: 4-6 hours
- Lead: @alice
- War Room: #database-migration-2024-01

Monitoring: <dashboard link>
Runbook: <runbook link>

### Periodic Status Update
📊 [Migration ID] STATUS - T+2h
- Progress: 45% complete
- Health: All systems normal
- ETA: 2-3 more hours
- Next update: T+3h or if status changes

### Completion Announcement
✅ [Migration ID] COMPLETE
- Duration: 5h 23m
- Status: Successful, no incidents
- Verification: All health checks passing
- Next: 24h bake period, then source cleanup

Silence Is Concerning

Timing and Scheduling

Choosing when to execute resharding significantly impacts risk and recovery options. Strategic timing can mean the difference between a smooth operation and an incident.

Timing Considerations:

Favorable Conditions

•Low traffic periods
•No major business events pending
•Full staffing available
•No other major changes scheduled
•Good weather (for on-site teams!)
•Monday-Wednesday (buffer before weekend)

Unfavorable Conditions

•Peak traffic hours
•Holiday shopping seasons
•Key staff on vacation
•Other major deployments same day
•End of month/quarter closings
•Friday afternoons (weekend recovery is hard)

The Maintenance Window vs. Online Tradeoff:

Approach	User Impact	Risk Level	Rollback Time	When to Use
Full maintenance window	High (complete outage)	Low	Fast	Very high-risk migrations
Degraded service window	Medium (slower response)	Medium	Medium	Moderate risk, latency-sensitive
Background online	Low (minimal impact)	Higher	Can be slow	Most migrations
Zero-impact online	None (fully transparent)	Highest	Slowest	Low-risk, well-tested systems

Timezone Considerations:

For globally distributed teams and systems:

US Users: Lowest traffic 02:00-06:00 Pacific (10:00-14:00 UTC)
EU Users: Lowest traffic 02:00-06:00 CET (01:00-05:00 UTC)
APAC Users: Lowest traffic 02:00-06:00 JST (17:00-21:00 UTC)

Global minimum: No perfect time exists
Strategy: Migrate region by region during each region's low period

Or: Accept that some region sees degraded performance during migration

Duration Planning:

Estimate conservatively: If you think 4 hours, plan for 8
Include buffer time: Buffer before business hours resume
Plan for delays: What if migration takes 2x longer than expected?
Consider pausing: Can you pause overnight and resume the next day?

Never Friday

Post-Migration Verification

The migration isn't complete when data transfer finishes. Comprehensive verification is essential before declaring success and cleaning up source data.

Verification Checklist:

Post-Migration Verification Steps

•Row Count Verification — Source row count = Destination row count (for each migrated table/collection)
•Checksum Verification — Hash of source data matches hash of destination data
•Sample Data Comparison — Random sample of records compared field-by-field
•Query Result Verification — Critical queries produce identical results on source and destination
•Application Smoke Tests — Key application flows work correctly with new routing
•Performance Baseline — Latency and throughput match or exceed pre-migration baselines
•Error Rate Check — No increase in error rates post-migration
•Constraint Verification — All foreign keys, unique constraints, indexes are intact

Automated Verification Script Example:

def verify_migration(migration_id):
    config = get_migration_config(migration_id)
    results = MigrationVerificationResults()
    
    # Row count check
    source_count = source_db.count(config.table)
    dest_count = dest_db.count(config.table)
    results.add_check(
        "row_count",
        passed=(source_count == dest_count),
        details=f"Source: {source_count}, Dest: {dest_count}"
    )
    
    # Checksum check (for small tables) or sample checksum
    if source_count < 1_000_000:
        source_checksum = source_db.table_checksum(config.table)
        dest_checksum = dest_db.table_checksum(config.table)
    else:
        # Sample 0.1% of rows
        sample_keys = source_db.sample_keys(config.table, fraction=0.001)
        source_checksum = source_db.rows_checksum(config.table, sample_keys)
        dest_checksum = dest_db.rows_checksum(config.table, sample_keys)
    
    results.add_check(
        "checksum",
        passed=(source_checksum == dest_checksum),
        details=f"Source: {source_checksum[:16]}..., Dest: {dest_checksum[:16]}..."
    )
    
    # Performance check
    baseline_latency = get_historical_p99(config.table, period="7d")
    current_latency = get_current_p99(config.table)
    latency_increase = (current_latency - baseline_latency) / baseline_latency
    
    results.add_check(
        "performance",
        passed=(latency_increase < 0.1),  # Less than 10% increase
        details=f"Baseline: {baseline_latency}ms, Current: {current_latency}ms (+{latency_increase*100:.1f}%)"
    )
    
    return results

# Run verification
results = verify_migration("mig-2024-01-15-001")
if results.all_passed():
    print("✅ Migration verification PASSED")
else:
    print("❌ Migration verification FAILED")
    for failure in results.failures():
        print(f"  - {failure.name}: {failure.details}")

Bake Period:

After verification passes, maintain a bake period before final cleanup:

24 hours minimum: Watch for issues that only appear under sustained load
Full business cycle: Ensure daily peaks are handled correctly
Keep source data: Don't delete until bake period completes successfully

Never Skip Verification

Lessons from Production

Real-world resharding operations generate valuable lessons. Here are patterns observed across many production resharding projects:

Common Failure Patterns:

Resharding Failure Patterns and Prevention
Pattern	What Went Wrong	How to Prevent
The Slow Burn	Migration took 3x longer than expected, ran into peak hours	Conservative estimates, automatic pause at time boundaries
The Cascade	One shard overloaded, caused others to fail	Circuit breakers, per-shard health checks, automatic throttling
The Ghost Data	Old routing cached in clients caused stale reads for hours	Client library migration support, forced cache invalidation
The Friday Surprise	Issue discovered Friday 5pm, weekend recovery	Never Friday, never before holidays
The Silent Failure	Migration 'completed' but 0.1% of data was corrupted	Comprehensive verification, sampling-based integrity checks
The Retry Storm	Throttled migration caused timeouts, client retries overwhelmed system	Back-pressure mechanisms, client-side circuit breakers

Organizational Anti-Patterns:

"We'll figure it out live" — Starting migration without detailed runbooks
- Fix: Document first, then execute
"The expert will handle it" — Single person knows the system
- Fix: Cross-train, pair on migrations, document tribal knowledge
"It worked in test" — Insufficient production-like testing
- Fix: Shadow testing, canary migrations, production-scale test environments
"Just push through" — Continuing migration despite warning signs
- Fix: Clear abort criteria, empowered operators to stop
"We don't have time" — Rushing migration due to deadline pressure
- Fix: Reframe as risk management, get leadership buy-in for proper timeline

What Successful Teams Do:

Practice migrations: Regular 'fire drills' with real data movement
Invest in automation: Reduce human error opportunities
Maintain runbooks: Continuously updated based on experience
Celebrate near-misses: Learn from problems that didn't become incidents
Build institutional memory: Document lessons learned for future teams

Every Incident Improves the Process

Summary: Practical Considerations

Successful resharding requires operational discipline as much as technical correctness. The key insights from this page:

Key Takeaways

•Runbooks transform operations — Documented, tested procedures reduce errors, enable delegation, and survive staff turnover.
•Monitoring is your safety net — Comprehensive dashboards and alerts let you detect and respond to problems before they become incidents.
•Plan for rollback from the start — Every phase needs a tested return path. Know your point of no return.
•Communication prevents surprise — Stakeholders who are informed in advance are partners; stakeholders surprised are adversaries.
•Timing matters tremendously — Choose windows that maximize staffing and minimize user impact. Never Friday.
•Verification before cleanup — Thorough checks protect against silent data loss. The bake period is not optional.
•Learn from every operation — Retrospectives and documentation build institutional capability over time.

Module Conclusion: Rebalancing and Resharding

The Path Forward:

Module Complete

5 / 5