Loading learning content...
Understanding the algorithms and challenges of rebalancing is necessary but not sufficient for success. The difference between a smooth resharding operation and an incident often comes down to operational discipline: the runbooks, monitoring, coordination, and decision-making frameworks that guide execution.
Production resharding brings together technical complexity with organizational reality. Multiple teams may be affected. On-call engineers need to know when to intervene. Stakeholders need clear communication about potential impacts. This page bridges the gap between theoretical understanding and operational excellence.
By the end of this page, you will understand how to build operational runbooks for resharding, design monitoring dashboards that surface problems early, plan and execute rollbacks safely, coordinate across teams and time zones, and apply real-world lessons learned from production resharding operations.
A runbook is a documented procedure that guides operators through a complex operation. For resharding, runbooks transform ad-hoc decisions into repeatable, tested processes.
Essential Runbook Components:
Example Runbook Section: Starting Migration
## Phase 2: Start Migration
### Prerequisites
- [ ] Phase 1 (Preparation) completed successfully
- [ ] All health checks passing (dashboard link: <url>)
- [ ] On-call engineer aware migration is starting
- [ ] Traffic below 70% of peak (check: <metrics link>)
### Procedure
1. Open terminal session to migration controller (host: migration-ctrl-1)
2. Verify no active migrations:
```bash
$ migctl status --all
Expected: "No active migrations"
Start migration with throttled rate:
$ migctl start --migration-id=[MIGRATION_ID] --rate-limit=100MB/s
Verify migration started:
$ migctl status --migration-id=[MIGRATION_ID]
Expected: State=RUNNING, Progress=0%, Rate=~100MB/s
Monitor for 15 minutes before proceeding:
migctl pause --migration-id=[MIGRATION_ID]A runbook that hasn't been executed is just a hypothesis. Run through your resharding runbook in a test environment at least quarterly. Every production execution should result in runbook updates based on lessons learned.
During resharding, visibility into system behavior becomes critical. You need to detect problems before they become incidents and understand what's happening well enough to make informed decisions.
Key Metrics to Monitor:
| Category | Metric | Normal Range | Alert Threshold | Why It Matters |
|---|---|---|---|---|
| Migration Progress | Keys/bytes migrated | Increasing steadily | Stalled > 5 min | Detect stuck migrations |
| Migration Progress | Estimated time remaining | Decreasing | Increasing trend | Detect slowdown |
| Latency | Read p99 | < 10ms | 50ms | User experience degradation |
| Latency | Write p99 | < 20ms | 100ms | User experience degradation |
| Error Rates | Failed requests % | < 0.01% | 0.1% | Service degradation |
| Error Rates | Timeout rate | < 0.001% | 0.01% | Resource exhaustion |
| Resources | CPU utilization | < 70% | 85% | Capacity exhaustion |
| Resources | Disk I/O wait | < 10% | 25% | Storage bottleneck |
| Resources | Network throughput | < 70% capacity | 85% capacity | Network saturation |
| Replication | Replica lag | < 1s | 10s | Consistency risk |
Dashboard Design:
A well-designed resharding dashboard provides a single view of migration health:
┌────────────────────────────────────────────────────────────────────┐
│ RESHARDING DASHBOARD - Migration ID: mig-2024-01-15-001 │
├────────────────────────────────────────────────────────────────────┤
│ │
│ STATUS: ■ RUNNING PROGRESS: ████████████░░░ 78% │
│ Started: 02:15 UTC ETA: 04:45 UTC │
│ │
├────────────────────────────────────────────────────────────────────┤
│ HEALTH INDICATORS │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Latency │ │ Errors │ │ CPU │ │ Network │ │
│ │ ● OK │ │ ● OK │ │ ● WARN │ │ ● OK │ │
│ │ 12ms │ │ 0.003% │ │ 78% │ │ 45% │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
├────────────────────────────────────────────────────────────────────┤
│ MIGRATION RATE LATENCY TREND (last 30 min) │
│ ▁▂▃▄▅▆▇█▇▆▆▆▆▆ p99: ─────────▁▂▁─────── │
│ Target: 100MB/s p50: ───────────────────── │
│ Actual: 95MB/s │
│ │
└────────────────────────────────────────────────────────────────────┘
Logging Strategy:
Structured logging enables post-hoc analysis:
{
"timestamp": "2024-01-15T03:45:22.456Z",
"level": "INFO",
"migration_id": "mig-2024-01-15-001",
"event": "chunk_migrated",
"source_shard": "shard-1",
"dest_shard": "shard-5",
"key_range_start": "user:100000",
"key_range_end": "user:100999",
"records_count": 1000,
"bytes_migrated": 524288,
"duration_ms": 1250,
"checksum": "a1b2c3d4"
}
Don't wait until migration day to build dashboards. Create and test monitoring dashboards in advance. Run test migrations to ensure dashboards display useful information. During a real migration is not the time to discover that a critical metric isn't being collected.
Every resharding operation needs a rollback plan. The ability to abort and return to a known-good state is essential for maintaining system stability.
Rollback Phases and Strategies:
Rollback complexity varies by how far the migration has progressed:
| Phase | Rollback Difficulty | Strategy | Data at Risk |
|---|---|---|---|
| Preparation | Trivial | Delete destination setup, restore config | None |
| Early Migration (< 10%) | Easy | Stop migration, delete dest data, restore routing | None - source is authoritative |
| Mid Migration (10-90%) | Medium | Stop migration, reconcile if dual-write, restore routing | Low if dual-write enabled |
| Late Migration (> 90%) | Hard | May need to remigrate to source, complex coordination | Medium - careful handling required |
| Post-Cutover | Very Hard | Reverse migration (dest→source), significant effort | High if writes accepted on dest |
| Cleanup Complete | Impossible | Source data deleted, cannot rollback | Complete data on dest only |
The Rollback Decision Framework:
Detect Anomaly
│
▼
┌────────────────────────┐
│ Is this expected │
│ migration behavior? │
└───────────┬────────────┘
Yes │ No
│
┌───────────┴───────────┐
│ │
▼ ▼
Continue with Is service
monitoring impacted?
│
┌───────────┴───────────┐
│ │
▼ ▼
Pause migration Immediate
Investigate rollback
│
▼
┌────────────────────────┐
│ Can issue be resolved │
│ without rollback? │
└───────────┬────────────┘
Yes │ No
┌───────────┴───────────┐
│ │
▼ ▼
Fix and Execute
resume rollback
Rollback Procedure Example:
## Emergency Rollback Procedure
### Trigger Criteria (ANY of the following)
- Error rate exceeds 1% for > 2 minutes
- p99 latency exceeds 200ms for > 5 minutes
- Any data inconsistency detected
- Operator judgment that service is at risk
### Immediate Actions (< 1 minute)
1. PAUSE migration immediately:
```bash
$ migctl emergency-stop --migration-id=[MIGRATION_ID]
Verify pause successful:
$ migctl status --migration-id=[MIGRATION_ID]
Expected: State=STOPPED
Restore routing to source shard:
$ routectl restore-pre-migration --migration-id=[MIGRATION_ID]
Verify traffic flowing to source:
Verify service health restored:
Alert incident channel: "@here Migration [ID] rolled back at [TIME]. Service stable. Investigation ongoing."
Collect diagnostic information:
Document in incident ticket
Identify and document the 'point of no return' in your migration plan—the phase after which rollback is no longer simple. Before crossing this threshold, perform extra verification. Most teams wait 24-48 hours in the 'ready to commit' phase before final cutover.
Resharding affects multiple teams and requires careful coordination. Poor communication is a leading cause of resharding-related incidents.
Stakeholder Communication Plan:
| Stakeholder | What They Need | When to Notify | Channel |
|---|---|---|---|
| Database Team | Full technical details, runbooks | 1 week before, day of | Direct channel + ticket |
| On-Call Engineers | Impact summary, escalation paths | 24 hours before | On-call handoff + runbook |
| Application Teams | Potential latency impact, timeline | 1 week before | Team email + Slack |
| SRE/Platform | Resource impact, monitoring alerts | 1 week before | SRE channel |
| Product/Business | User impact (if any), timing | As needed for user-facing impact | Status page if external |
| Leadership | Risk summary, go/no-go criteria | For major resharding only | Email summary |
War Room Protocol:
For high-risk resharding, establish a war room (physical or virtual):
Before Migration:
During Migration:
Communication Templates:
### Migration Start Announcement
🚀 [Migration ID] STARTED
- Target: Migrate shard-1 data to shard-5 and shard-6
- Expected Duration: 4-6 hours
- Lead: @alice
- War Room: #database-migration-2024-01
Monitoring: <dashboard link>
Runbook: <runbook link>
### Periodic Status Update
📊 [Migration ID] STATUS - T+2h
- Progress: 45% complete
- Health: All systems normal
- ETA: 2-3 more hours
- Next update: T+3h or if status changes
### Completion Announcement
✅ [Migration ID] COMPLETE
- Duration: 5h 23m
- Status: Successful, no incidents
- Verification: All health checks passing
- Next: 24h bake period, then source cleanup
During a long migration, teams watching the war room worry when updates stop. Even if nothing is happening, post periodic 'all clear' messages. Absence of updates is often interpreted as 'something is wrong and they're too busy to communicate.'
Choosing when to execute resharding significantly impacts risk and recovery options. Strategic timing can mean the difference between a smooth operation and an incident.
Timing Considerations:
The Maintenance Window vs. Online Tradeoff:
| Approach | User Impact | Risk Level | Rollback Time | When to Use |
|---|---|---|---|---|
| Full maintenance window | High (complete outage) | Low | Fast | Very high-risk migrations |
| Degraded service window | Medium (slower response) | Medium | Medium | Moderate risk, latency-sensitive |
| Background online | Low (minimal impact) | Higher | Can be slow | Most migrations |
| Zero-impact online | None (fully transparent) | Highest | Slowest | Low-risk, well-tested systems |
Timezone Considerations:
For globally distributed teams and systems:
US Users: Lowest traffic 02:00-06:00 Pacific (10:00-14:00 UTC)
EU Users: Lowest traffic 02:00-06:00 CET (01:00-05:00 UTC)
APAC Users: Lowest traffic 02:00-06:00 JST (17:00-21:00 UTC)
Global minimum: No perfect time exists
Strategy: Migrate region by region during each region's low period
Or: Accept that some region sees degraded performance during migration
Duration Planning:
The 'Never deploy on Friday' rule applies doubly to resharding. If something goes wrong, you want weekday recovery with full staffing, not weekend heroics. Aim to complete resharding by Wednesday to allow for monitoring before the weekend.
The migration isn't complete when data transfer finishes. Comprehensive verification is essential before declaring success and cleaning up source data.
Verification Checklist:
Automated Verification Script Example:
def verify_migration(migration_id):
config = get_migration_config(migration_id)
results = MigrationVerificationResults()
# Row count check
source_count = source_db.count(config.table)
dest_count = dest_db.count(config.table)
results.add_check(
"row_count",
passed=(source_count == dest_count),
details=f"Source: {source_count}, Dest: {dest_count}"
)
# Checksum check (for small tables) or sample checksum
if source_count < 1_000_000:
source_checksum = source_db.table_checksum(config.table)
dest_checksum = dest_db.table_checksum(config.table)
else:
# Sample 0.1% of rows
sample_keys = source_db.sample_keys(config.table, fraction=0.001)
source_checksum = source_db.rows_checksum(config.table, sample_keys)
dest_checksum = dest_db.rows_checksum(config.table, sample_keys)
results.add_check(
"checksum",
passed=(source_checksum == dest_checksum),
details=f"Source: {source_checksum[:16]}..., Dest: {dest_checksum[:16]}..."
)
# Performance check
baseline_latency = get_historical_p99(config.table, period="7d")
current_latency = get_current_p99(config.table)
latency_increase = (current_latency - baseline_latency) / baseline_latency
results.add_check(
"performance",
passed=(latency_increase < 0.1), # Less than 10% increase
details=f"Baseline: {baseline_latency}ms, Current: {current_latency}ms (+{latency_increase*100:.1f}%)"
)
return results
# Run verification
results = verify_migration("mig-2024-01-15-001")
if results.all_passed():
print("✅ Migration verification PASSED")
else:
print("❌ Migration verification FAILED")
for failure in results.failures():
print(f" - {failure.name}: {failure.details}")
Bake Period:
After verification passes, maintain a bake period before final cleanup:
The rush to declare victory and clean up is dangerous. Corrupted or missing data may not be immediately visible. Take the time to verify thoroughly. The cost of verification is tiny compared to the cost of discovering data loss after source cleanup.
Real-world resharding operations generate valuable lessons. Here are patterns observed across many production resharding projects:
Common Failure Patterns:
| Pattern | What Went Wrong | How to Prevent |
|---|---|---|
| The Slow Burn | Migration took 3x longer than expected, ran into peak hours | Conservative estimates, automatic pause at time boundaries |
| The Cascade | One shard overloaded, caused others to fail | Circuit breakers, per-shard health checks, automatic throttling |
| The Ghost Data | Old routing cached in clients caused stale reads for hours | Client library migration support, forced cache invalidation |
| The Friday Surprise | Issue discovered Friday 5pm, weekend recovery | Never Friday, never before holidays |
| The Silent Failure | Migration 'completed' but 0.1% of data was corrupted | Comprehensive verification, sampling-based integrity checks |
| The Retry Storm | Throttled migration caused timeouts, client retries overwhelmed system | Back-pressure mechanisms, client-side circuit breakers |
Organizational Anti-Patterns:
"We'll figure it out live" — Starting migration without detailed runbooks
"The expert will handle it" — Single person knows the system
"It worked in test" — Insufficient production-like testing
"Just push through" — Continuing migration despite warning signs
"We don't have time" — Rushing migration due to deadline pressure
What Successful Teams Do:
After every resharding (successful or not), conduct a retrospective. What worked well? What surprised us? What would we do differently? Feed learnings back into runbooks and tooling. Mature teams have resharding that's almost boring—because they've solved all the interesting problems already.
Successful resharding requires operational discipline as much as technical correctness. The key insights from this page:
Module Conclusion: Rebalancing and Resharding
Across this module, we've journeyed from understanding when rebalancing is needed, through strategies for minimal disruption, the algorithmic foundations of consistent hashing, the deep technical challenges of online resharding, and finally the practical considerations for production execution.
Rebalancing and resharding represent the intersection of distributed systems theory, software engineering, and operational excellence. Mastering these skills enables you to build and maintain systems that scale gracefully, adapt to changing requirements, and remain reliable even as they grow.
The Path Forward:
With completion of this module on rebalancing and resharding, you've covered the major aspects of database replication and partitioning. These concepts form the foundation for building globally distributed, highly available data systems that power modern internet-scale applications.
Congratulations! You've completed the Rebalancing and Resharding module. You now possess the knowledge to plan, execute, and troubleshoot some of the most challenging operations in distributed database management.