Loading content...
You've built the routing façade. You've extracted the functionality. You've validated with shadow traffic. Now comes the moment that keeps engineers awake at night: the cutover—the transition from theory to reality, from the old system serving traffic to the new system taking over.
The cutover is where migrations succeed or fail. A well-executed cutover is invisible to users; a poorly executed one results in outages, data corruption, or frantic rollbacks at 3 AM. The difference lies in strategy, preparation, and the ability to respond when things don't go as planned.
This is not merely a technical challenge. It's an orchestration of code, infrastructure, teams, and timing. The strategies you choose determine your risk exposure, your ability to recover, and ultimately, your confidence in the migration.
By the end of this page, you will understand multiple cutover strategies (from percentage rollout to blue-green to feature flags), how to plan and execute cutovers safely, techniques for handling failures during transition, and how to know when a cutover is truly complete.
Cutover strategies exist on a spectrum from maximally gradual (shifting traffic 1% at a time over weeks) to instant (flipping a switch to move 100% at once). Your position on this spectrum should be informed by risk tolerance, technical constraints, and business requirements.
The Fundamental Tradeoff:
Gradual cutovers reduce risk but extend the period of parallel operation, increasing operational complexity and the chance of drift between systems. Instant cutovers minimize parallel operation but concentrate all risk at a single moment.
| Strategy | Risk Level | Rollback Speed | Parallel Period | Best For |
|---|---|---|---|---|
| Percentage Rollout | Very Low | Instant | Long (weeks) | High-traffic production, risk-averse orgs |
| Canary Release | Low | Instant | Medium (days) | Validating with subset before wider release |
| Blue-Green | Medium | Fast (seconds) | Short (hours) | When full traffic testing is needed |
| Feature Flags | Low to Medium | Instant | Variable | Fine-grained control, targeted rollout |
| Shadow to Live | Medium | Fast | Medium | Complex validation requirements |
| Scheduled Cutover | High | Slow (requires deployment) | Minimal | Batch systems, maintenance windows |
Choosing Your Strategy:
Consider these factors:
Traffic Volume: High-traffic systems need gradual rollout to detect issues before they affect many users.
Error Detectability: Can you detect problems quickly? If yes, faster rollout is acceptable. If problems take hours to surface, go slower.
Data Consistency: If the cutover involves data migration, you may need tighter coordination and shorter parallel periods.
Team Experience: First few cutovers should be maximally gradual. As expertise builds, you can accelerate.
Rollback Complexity: If rollback is complex (requires data reconciliation), favor gradual strategies with easier abort points.
Your first migration should use the most gradual strategy you can tolerate. As you gain confidence and build tooling, accelerate. Many organizations move from weeks-long percentage rollouts to same-day blue-green deployments as their platform matures.
Percentage-based rollout is the safest cutover strategy. You gradually shift traffic from the old implementation to the new, starting with a tiny fraction and increasing as confidence grows.
The Typical Progression:
0% → 1% → 5% → 10% → 25% → 50% → 75% → 100%
↓ ↓ ↓ ↓ ↓ ↓ ↓
Monitor Monitor Monitor Monitor Monitor Monitor Monitor
1 hour 4 hours 1 day 2 days 3 days 3 days Done
At each step, you monitor error rates, latency, and business metrics. Any regression triggers investigation and potential rollback to the previous percentage.
Key Considerations:
Sticky vs. Random Assignment: Should the same user always go to the same backend (sticky), or randomly assigned on each request? Sticky sessions give consistent experience but make true percentage impossible. Random gives exact percentages but users may see inconsistencies.
What to Measure: Error rates, latency (P50, P95, P99), business metrics (conversion, revenue), resource utilization, and customer support tickets.
Hold Time at Each Step: Don't rush. Problems often emerge hours or days after a change, as edge cases accumulate.
Rollback Threshold: Define objective criteria. Example: "If P99 latency exceeds 500ms or error rate exceeds 0.1%, rollback immediately."
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184
interface RolloutConfig { featureName: string; currentPercentage: number; targetPercentage: number; stickyBuckets: boolean; bucketCount: number;} interface RolloutDecision { useNewImplementation: boolean; bucketId: number; reason: 'percentage' | 'override' | 'sticky';} class PercentageRollout { private config: RolloutConfig; private stickyAssignments: Map<string, boolean> = new Map(); constructor(config: RolloutConfig) { this.config = config; } /** * Determine which implementation to use for a request */ decideRouting( userId: string, requestId: string, overrides?: { forceNew?: boolean; forceLegacy?: boolean } ): RolloutDecision { // Check for explicit overrides first if (overrides?.forceNew) { return { useNewImplementation: true, bucketId: -1, reason: 'override', }; } if (overrides?.forceLegacy) { return { useNewImplementation: false, bucketId: -1, reason: 'override', }; } // For sticky sessions, check prior assignment if (this.config.stickyBuckets) { const prior = this.stickyAssignments.get(userId); if (prior !== undefined) { return { useNewImplementation: prior, bucketId: this.computeBucket(userId), reason: 'sticky', }; } } // Compute bucket for this user const bucket = this.computeBucket(userId); // Determine if bucket falls within rollout percentage // Buckets 0 to (percentage * bucketCount / 100) go to new implementation const cutoffBucket = Math.floor( (this.config.currentPercentage / 100) * this.config.bucketCount ); const useNew = bucket < cutoffBucket; // Cache for sticky sessions if (this.config.stickyBuckets) { this.stickyAssignments.set(userId, useNew); } return { useNewImplementation: useNew, bucketId: bucket, reason: 'percentage', }; } /** * Compute a stable bucket assignment for a user * Uses consistent hashing so the same user always gets the same bucket */ private computeBucket(userId: string): number { // Simple hash function for demonstration // In production, use a proper consistent hashing algorithm let hash = 0; for (const char of userId) { hash = ((hash << 5) - hash) + char.charCodeAt(0); hash = hash & hash; // Convert to 32-bit integer } return Math.abs(hash) % this.config.bucketCount; } /** * Safely increase the rollout percentage */ async increasePercentage( newPercentage: number, metricsChecker: () => Promise<boolean> ): Promise<{ success: boolean; reason?: string }> { // Verify metrics are healthy before updating const metricsHealthy = await metricsChecker(); if (!metricsHealthy) { return { success: false, reason: 'Metrics unhealthy, cannot increase percentage', }; } // Update percentage const oldPercentage = this.config.currentPercentage; this.config.currentPercentage = newPercentage; // Clear sticky cache to allow redistribution on next request // (Optional: some systems keep sticky assignments permanent) if (!this.config.stickyBuckets) { this.stickyAssignments.clear(); } console.log(`Rollout ${this.config.featureName}: ${oldPercentage}% → ${newPercentage}%`); return { success: true }; } /** * Emergency rollback to 0% */ rollback(): void { console.log(`ROLLBACK: ${this.config.featureName} → 0%`); this.config.currentPercentage = 0; this.stickyAssignments.clear(); }} // Usage exampleasync function executeRollout() { const rollout = new PercentageRollout({ featureName: 'user-service-v2', currentPercentage: 0, targetPercentage: 100, stickyBuckets: true, bucketCount: 1000, // Allows 0.1% granularity }); const stages = [1, 5, 10, 25, 50, 75, 100]; for (const percentage of stages) { const result = await rollout.increasePercentage( percentage, async () => { // Check your monitoring system const errorRate = await getErrorRate(); const p99Latency = await getP99Latency(); return errorRate < 0.001 && p99Latency < 500; } ); if (!result.success) { console.error(`Failed to increase to ${percentage}%: ${result.reason}`); rollout.rollback(); break; } // Wait for metrics to stabilize await waitForMetricsStabilization(percentage > 25 ? '4h' : '1h'); }} async function getErrorRate(): Promise<number> { // Integration with monitoring system return 0;} async function getP99Latency(): Promise<number> { // Integration with monitoring system return 0;} async function waitForMetricsStabilization(duration: string): Promise<void> { // Wait for the specified duration}Never skip the 1% stage, no matter how confident you are. Production traffic always contains surprises not seen in testing. A 1% rollout exposes you to real traffic while limiting blast radius. Many production issues are discovered at 1% that weren't found in any prior testing.
A canary release routes a small subset of traffic to the new implementation while monitoring for problems. The term comes from the practice of using canaries in coal mines to detect dangerous gases—if the canary dies, evacuate before miners are affected.
How Canary Differs from Percentage Rollout:
While similar, canary releases typically:
Automated Canary Analysis:
Modern canary systems automate the comparison and decision-making:
Baseline Establishment: Collect metrics from the stable implementation over a period.
Canary Deployment: Deploy new implementation and route canary traffic.
Metric Collection: Gather same metrics from canary for comparison period.
Statistical Analysis: Compare distributions using statistical tests (Mann-Whitney U, Kolmogorov-Smirnov).
Verdict: Pass (promote to larger rollout), Fail (rollback), or Inconclusive (extend test).
Tools like Kayenta (Spinnaker), Flagger (Kubernetes), and Argo Rollouts provide this automation.
Always route internal user traffic to canary before any external traffic. Engineers and employees are more likely to notice subtle issues, more tolerant of problems, and easier to communicate with. Never let a bug reach customers that your own team hasn't experienced first.
Blue-green deployment maintains two identical production environments: blue (current production) and green (new version). Traffic is switched entirely from one to the other, with instant rollback by switching back.
The Blue-Green Model:
Before: Users → [Load Balancer] → [Blue: v1.0]
[Green: idle]
Deploy: Users → [Load Balancer] → [Blue: v1.0]
[Green: v2.0 deploying]
Switch: Users → [Load Balancer] → [Blue: standby]
[Green: v2.0 serving]
Rollback: Users → [Load Balancer] → [Blue: v1.0 serving]
[Green: failed v2.0]
Blue-Green Advantages:
Blue-Green Challenges:
| Phase | Actions | Verification |
|---|---|---|
| Preparation | Deploy to green, run smoke tests, verify health | All health checks pass, smoke tests green |
| Pre-Switch | Warm up green (caches, connections), final validation | Green handling synthetic traffic successfully |
| Switch | Update load balancer, shift traffic | Traffic flowing to green, blue draining |
| Verification | Monitor metrics, watch for errors, check business metrics | Error rate stable, no metric regression |
| Hold | Wait for stability period (30min-2hrs) | No incidents, metrics remain healthy |
| Completion | Mark blue as previous, available for next deployment | Blue can be updated for next release |
Blue-green works cleanly when the database schema is unchanged. When schema changes are required, you must ensure both blue and green can work with the database simultaneously (expand-contract pattern). Never deploy a schema change that breaks the blue environment—you'd lose your rollback capability.
Feature flags provide the most granular control over cutovers. The routing decision is made within the application based on flag state, allowing instant changes without deployment.
Feature Flag Capabilities:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
interface FlagConfiguration { name: string; defaultValue: boolean; rules: FlagRule[]; percentageRollout: number; // 0-100} interface FlagRule { attribute: string; operator: 'equals' | 'contains' | 'startsWith' | 'in'; value: string | string[]; result: boolean;} interface EvaluationContext { userId: string; userEmail?: string; organizationId?: string; country?: string; userAgent?: string; attributes: Record<string, string>;} class FeatureFlagService { private flags: Map<string, FlagConfiguration> = new Map(); /** * Evaluate a feature flag for a given context */ evaluate(flagName: string, context: EvaluationContext): boolean { const flag = this.flags.get(flagName); if (!flag) { console.warn(`Unknown flag: ${flagName}`); return false; } // Check explicit rules first (overrides percentage) for (const rule of flag.rules) { const contextValue = this.getContextValue(context, rule.attribute); if (this.evaluateRule(rule, contextValue)) { return rule.result; } } // Fall through to percentage rollout if (flag.percentageRollout > 0) { return this.inPercentage(context.userId, flag.percentageRollout); } return flag.defaultValue; } private getContextValue(context: EvaluationContext, attribute: string): string | undefined { switch (attribute) { case 'userId': return context.userId; case 'userEmail': return context.userEmail; case 'organizationId': return context.organizationId; case 'country': return context.country; default: return context.attributes[attribute]; } } private evaluateRule(rule: FlagRule, value: string | undefined): boolean { if (value === undefined) return false; switch (rule.operator) { case 'equals': return value === rule.value; case 'contains': return typeof rule.value === 'string' && value.includes(rule.value); case 'startsWith': return typeof rule.value === 'string' && value.startsWith(rule.value); case 'in': return Array.isArray(rule.value) && rule.value.includes(value); default: return false; } } private inPercentage(userId: string, percentage: number): boolean { // Consistent hash for stable assignment let hash = 0; for (const char of userId) { hash = ((hash << 5) - hash) + char.charCodeAt(0); hash = hash & hash; } const bucket = Math.abs(hash) % 100; return bucket < percentage; } /** * Update flag configuration (used by feature flag management system) */ updateFlag(config: FlagConfiguration): void { this.flags.set(config.name, config); console.log(`Flag updated: ${config.name}`, JSON.stringify(config, null, 2)); } /** * Emergency kill switch - disable a flag for everyone */ killFlag(flagName: string): void { const flag = this.flags.get(flagName); if (flag) { flag.rules = []; flag.percentageRollout = 0; flag.defaultValue = false; console.log(`KILL SWITCH ACTIVATED: ${flagName}`); } }} // Example configuration for migration cutoverconst migrationFlag: FlagConfiguration = { name: 'use-new-user-service', defaultValue: false, rules: [ // All internal users get new service { attribute: 'userEmail', operator: 'contains', value: '@company.com', result: true, }, // Specific beta organization { attribute: 'organizationId', operator: 'in', value: ['org-123', 'org-456', 'org-789'], result: true, }, ], percentageRollout: 10, // 10% of non-matched users}; // Usage in application codeasync function handleRequest(request: Request) { const flags = new FeatureFlagService(); const context: EvaluationContext = { userId: request.userId, userEmail: request.userEmail, organizationId: request.orgId, attributes: {}, }; if (flags.evaluate('use-new-user-service', context)) { return await newUserService.handle(request); } else { return await legacyMonolith.handle(request); }}Feature flags shine when you need surgical precision: 'Enable for employees, plus organizations A, B, C, plus 5% of users in Europe.' This granularity is impossible with infrastructure-level routing. Use flags for business-critical cutovers where you need maximum control.
Despite best efforts, cutovers sometimes go wrong. How you respond determines whether a problem becomes a minor incident or a major outage. The key is preparation: define your responses before you need them.
The Response Framework:
Severity 1: Rollback Immediately
Action: Execute rollback within 5 minutes. Investigate later.
Severity 2: Reduce Exposure and Investigate
Action: Reduce rollout percentage to minimize impact. Investigate root cause.
Severity 3: Monitor and Fix Forward
Action: Deploy fix to new service. Monitor closely. No rollback needed.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
# Cutover Runbook: User Service Migration ## Pre-Cutover Checklist- [ ] New service deployed and healthy- [ ] Smoke tests passing- [ ] Rollback mechanism tested- [ ] Monitoring dashboards open- [ ] On-call engineer confirmed- [ ] Communication channel established ## Rollback Procedure ### Method 1: Routing Change (preferred, <1 minute)```bash# Set routing percentage to 0kubectl patch configmap routing-config \ --patch '{"data":{"user-service-percentage":"0"}}'``` ### Method 2: Feature Flag Kill Switch (<1 minute)```bash# Disable feature flag via APIcurl -X POST https://flags.internal/api/flags/use-new-user-service/kill``` ### Method 3: DNS Failover (backup, ~5 minutes)```bash# Point user-service DNS back to monolithaws route53 change-resource-record-sets \ --hosted-zone-id Z123456 \ --change-batch file://failback-dns.json``` ## Rollback Triggers (automatic recommendation)| Metric | Threshold | Action ||--------|-----------|--------|| Error rate | > 0.5% for 2 min | Auto-rollback || P99 latency | > 1000ms for 5 min | Alert, manual evaluation || 5xx responses | > 10/minute | Auto-rollback | ## Post-Rollback1. Confirm traffic flowing to monolith2. Verify error rates normalized3. Open incident ticket4. Notify stakeholders5. Begin root cause investigationIf you're debating whether to roll back, roll back. The cost of a brief rollback is almost always less than the cost of extended user impact while you investigate. You can always try again after understanding the problem. Pride has no place in incident response.
How do you know the cutover succeeded? You need objective, measurable criteria defined before the cutover begins. This prevents debates about whether to rollback and provides clear completion criteria.
| Category | Metric | Success Criterion | Measurement Period |
|---|---|---|---|
| Reliability | Error rate | ≤ baseline + 0.01% | 7 days at 100% |
| Performance | P99 latency | ≤ baseline + 10% | 7 days at 100% |
| Performance | P50 latency | ≤ baseline | 7 days at 100% |
| Correctness | Shadow comparison | 100% match | Pre-cutover validation |
| Business | Conversion rate | ≥ baseline - 0.5% | 14 days at 100% |
| Operations | Incidents | 0 SEV1/SEV2 caused | 30 days at 100% |
| Stability | Rollback required | 0 rollbacks | 30 days at 100% |
The Bake Period:
Even after reaching 100% traffic, maintain a 'bake period' before declaring the cutover complete:
Only after the bake period ends with no issues should you:
A cutover is truly complete when: 100% of traffic successfully served by new implementation for 14+ days, no rollbacks required, all success metrics met, old code removed from production, and the team can handle operational issues with the new system without referring back to monolith knowledge.
The cutover is where migration succeeds or fails. A well-executed cutover is invisible; a poorly executed one causes outages. The key is strategy, preparation, and rapid response capability.
What's Next:
With cutover strategies mastered, the final page addresses Risk Mitigation—the comprehensive approach to identifying, preventing, and managing risks throughout the entire Strangler Fig migration journey.
You now understand the full spectrum of cutover strategies, from gradual percentage rollouts to instant blue-green switches. With this knowledge, you can plan and execute zero-downtime transitions while maintaining the ability to rapidly respond to problems. Next, we'll explore comprehensive risk mitigation strategies.