Strangler Fig Pattern - Learning Module

Loading content...

0/273

Cutover Strategies

The Critical Moment of Truth

You've built the routing façade. You've extracted the functionality. You've validated with shadow traffic. Now comes the moment that keeps engineers awake at night: the cutover—the transition from theory to reality, from the old system serving traffic to the new system taking over.

The cutover is where migrations succeed or fail. A well-executed cutover is invisible to users; a poorly executed one results in outages, data corruption, or frantic rollbacks at 3 AM. The difference lies in strategy, preparation, and the ability to respond when things don't go as planned.

This is not merely a technical challenge. It's an orchestration of code, infrastructure, teams, and timing. The strategies you choose determine your risk exposure, your ability to recover, and ultimately, your confidence in the migration.

What You Will Learn

By the end of this page, you will understand multiple cutover strategies (from percentage rollout to blue-green to feature flags), how to plan and execute cutovers safely, techniques for handling failures during transition, and how to know when a cutover is truly complete.

The Cutover Spectrum: From Gradual to Instant

Cutover strategies exist on a spectrum from maximally gradual (shifting traffic 1% at a time over weeks) to instant (flipping a switch to move 100% at once). Your position on this spectrum should be informed by risk tolerance, technical constraints, and business requirements.

The Fundamental Tradeoff:

Gradual cutovers reduce risk but extend the period of parallel operation, increasing operational complexity and the chance of drift between systems. Instant cutovers minimize parallel operation but concentrate all risk at a single moment.

Cutover Strategy Comparison
Strategy	Risk Level	Rollback Speed	Parallel Period	Best For
Percentage Rollout	Very Low	Instant	Long (weeks)	High-traffic production, risk-averse orgs
Canary Release	Low	Instant	Medium (days)	Validating with subset before wider release
Blue-Green	Medium	Fast (seconds)	Short (hours)	When full traffic testing is needed
Feature Flags	Low to Medium	Instant	Variable	Fine-grained control, targeted rollout
Shadow to Live	Medium	Fast	Medium	Complex validation requirements
Scheduled Cutover	High	Slow (requires deployment)	Minimal	Batch systems, maintenance windows

Choosing Your Strategy:

Consider these factors:

Traffic Volume: High-traffic systems need gradual rollout to detect issues before they affect many users.
Error Detectability: Can you detect problems quickly? If yes, faster rollout is acceptable. If problems take hours to surface, go slower.
Data Consistency: If the cutover involves data migration, you may need tighter coordination and shorter parallel periods.
Team Experience: First few cutovers should be maximally gradual. As expertise builds, you can accelerate.
Rollback Complexity: If rollback is complex (requires data reconciliation), favor gradual strategies with easier abort points.

Start Conservative, Accelerate Over Time

Your first migration should use the most gradual strategy you can tolerate. As you gain confidence and build tooling, accelerate. Many organizations move from weeks-long percentage rollouts to same-day blue-green deployments as their platform matures.

Percentage-Based Rollout

Percentage-based rollout is the safest cutover strategy. You gradually shift traffic from the old implementation to the new, starting with a tiny fraction and increasing as confidence grows.

The Typical Progression:

0%   → 1%   → 5%   → 10%  → 25%  → 50%  → 75%  → 100%
      ↓      ↓      ↓      ↓      ↓      ↓      ↓
    Monitor Monitor Monitor Monitor Monitor Monitor Monitor
    1 hour  4 hours 1 day  2 days 3 days 3 days Done

At each step, you monitor error rates, latency, and business metrics. Any regression triggers investigation and potential rollback to the previous percentage.

Converting Mermaid diagram...

Key Considerations:

Sticky vs. Random Assignment: Should the same user always go to the same backend (sticky), or randomly assigned on each request? Sticky sessions give consistent experience but make true percentage impossible. Random gives exact percentages but users may see inconsistencies.
What to Measure: Error rates, latency (P50, P95, P99), business metrics (conversion, revenue), resource utilization, and customer support tickets.
Hold Time at Each Step: Don't rush. Problems often emerge hours or days after a change, as edge cases accumulate.
Rollback Threshold: Define objective criteria. Example: "If P99 latency exceeds 500ms or error rate exceeds 0.1%, rollback immediately."

PercentageRollout.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
interface RolloutConfig {
    featureName: string;
    currentPercentage: number;
    targetPercentage: number;
    stickyBuckets: boolean;
    bucketCount: number;
}
 
interface RolloutDecision {
    useNewImplementation: boolean;
    bucketId: number;
    reason: 'percentage' | 'override' | 'sticky';
}
 
class PercentageRollout {
    private config: RolloutConfig;
    private stickyAssignments: Map<string, boolean> = new Map();
    
    constructor(config: RolloutConfig) {
        this.config = config;
    }
    
    /**
     * Determine which implementation to use for a request
     */
    decideRouting(
        userId: string,
        requestId: string,
        overrides?: { forceNew?: boolean; forceLegacy?: boolean }
    ): RolloutDecision {
        // Check for explicit overrides first
        if (overrides?.forceNew) {
            return {
                useNewImplementation: true,
                bucketId: -1,
                reason: 'override',
            };
        }
        if (overrides?.forceLegacy) {
            return {
                useNewImplementation: false,
                bucketId: -1,
                reason: 'override',
            };
        }
        
        // For sticky sessions, check prior assignment
        if (this.config.stickyBuckets) {
            const prior = this.stickyAssignments.get(userId);
            if (prior !== undefined) {
                return {
                    useNewImplementation: prior,
                    bucketId: this.computeBucket(userId),
                    reason: 'sticky',
                };
            }
        }
        
        // Compute bucket for this user
        const bucket = this.computeBucket(userId);
        
        // Determine if bucket falls within rollout percentage
        // Buckets 0 to (percentage * bucketCount / 100) go to new implementation
        const cutoffBucket = Math.floor(
            (this.config.currentPercentage / 100) * this.config.bucketCount
        );
        
        const useNew = bucket < cutoffBucket;
        
        // Cache for sticky sessions
        if (this.config.stickyBuckets) {
            this.stickyAssignments.set(userId, useNew);
        }
        
        return {
            useNewImplementation: useNew,
            bucketId: bucket,
            reason: 'percentage',
        };
    }
    
    /**
     * Compute a stable bucket assignment for a user
     * Uses consistent hashing so the same user always gets the same bucket
     */
    private computeBucket(userId: string): number {
        // Simple hash function for demonstration
        // In production, use a proper consistent hashing algorithm
        let hash = 0;
        for (const char of userId) {
            hash = ((hash << 5) - hash) + char.charCodeAt(0);
            hash = hash & hash; // Convert to 32-bit integer
        }
        return Math.abs(hash) % this.config.bucketCount;
    }
    
    /**
     * Safely increase the rollout percentage
     */
    async increasePercentage(
        newPercentage: number,
        metricsChecker: () => Promise<boolean>
    ): Promise<{ success: boolean; reason?: string }> {
        // Verify metrics are healthy before updating
        const metricsHealthy = await metricsChecker();
        if (!metricsHealthy) {
            return {
                success: false,
                reason: 'Metrics unhealthy, cannot increase percentage',
            };
        }
        
        // Update percentage
        const oldPercentage = this.config.currentPercentage;
        this.config.currentPercentage = newPercentage;
        
        // Clear sticky cache to allow redistribution on next request
        // (Optional: some systems keep sticky assignments permanent)
        if (!this.config.stickyBuckets) {
            this.stickyAssignments.clear();
        }
        
        console.log(`Rollout ${this.config.featureName}: ${oldPercentage}% → ${newPercentage}%`);
        
        return { success: true };
    }
    
    /**
     * Emergency rollback to 0%
     */
    rollback(): void {
        console.log(`ROLLBACK: ${this.config.featureName} → 0%`);
        this.config.currentPercentage = 0;
        this.stickyAssignments.clear();
    }
}
 
// Usage example
async function executeRollout() {
    const rollout = new PercentageRollout({
        featureName: 'user-service-v2',
        currentPercentage: 0,
        targetPercentage: 100,
        stickyBuckets: true,
        bucketCount: 1000,  // Allows 0.1% granularity
    });
    
    const stages = [1, 5, 10, 25, 50, 75, 100];
    
    for (const percentage of stages) {
        const result = await rollout.increasePercentage(
            percentage,
            async () => {
                // Check your monitoring system
                const errorRate = await getErrorRate();
                const p99Latency = await getP99Latency();
                return errorRate < 0.001 && p99Latency < 500;
            }
        );
        
        if (!result.success) {
            console.error(`Failed to increase to ${percentage}%: ${result.reason}`);
            rollout.rollback();
            break;
        }
        
        // Wait for metrics to stabilize
        await waitForMetricsStabilization(percentage > 25 ? '4h' : '1h');
    }
}
 
async function getErrorRate(): Promise<number> {
    // Integration with monitoring system
    return 0;
}
 
async function getP99Latency(): Promise<number> {
    // Integration with monitoring system
    return 0;
}
 
async function waitForMetricsStabilization(duration: string): Promise<void> {
    // Wait for the specified duration
}

The 1% Rule

Never skip the 1% stage, no matter how confident you are. Production traffic always contains surprises not seen in testing. A 1% rollout exposes you to real traffic while limiting blast radius. Many production issues are discovered at 1% that weren't found in any prior testing.

Canary Releases

A canary release routes a small subset of traffic to the new implementation while monitoring for problems. The term comes from the practice of using canaries in coal mines to detect dangerous gases—if the canary dies, evacuate before miners are affected.

How Canary Differs from Percentage Rollout:

While similar, canary releases typically:

Focus on a specific subset (e.g., internal users, specific region)
Have automated analysis and rollback
Are shorter duration before deciding to proceed or abort
Often use A/B testing statistical analysis

Canary Traffic Sources

•Internal Users — Route employee traffic to canary first. They'll notice issues and can report quickly.
•Beta Users — Users who've opted into testing. More tolerant of issues, provide feedback.
•Geographic Subset — Route one region to canary while others stay on stable. Limits blast radius.
•Non-Critical Requests — Route less-important traffic (analytics, prefetch) before critical paths.
•Random Sample — True random selection of users for statistically valid comparison.

Canary Metrics to Monitor

•Error Rate — Compare canary vs. baseline. Canary should not exceed baseline + threshold.
•Latency Distribution — P50, P95, P99. Watch for tail latency regression even if median is OK.
•Resource Usage — CPU, memory, connections. Canary shouldn't consume disproportionate resources.
•Business Metrics — Conversion, revenue, engagement. Ensure no negative business impact.
•Customer Tickets — Increase in support requests may indicate problems not caught by technical metrics.

Automated Canary Analysis:

Modern canary systems automate the comparison and decision-making:

Baseline Establishment: Collect metrics from the stable implementation over a period.
Canary Deployment: Deploy new implementation and route canary traffic.
Metric Collection: Gather same metrics from canary for comparison period.
Statistical Analysis: Compare distributions using statistical tests (Mann-Whitney U, Kolmogorov-Smirnov).
Verdict: Pass (promote to larger rollout), Fail (rollback), or Inconclusive (extend test).

Tools like Kayenta (Spinnaker), Flagger (Kubernetes), and Argo Rollouts provide this automation.

The Internal Dog-Fooding Canary

Always route internal user traffic to canary before any external traffic. Engineers and employees are more likely to notice subtle issues, more tolerant of problems, and easier to communicate with. Never let a bug reach customers that your own team hasn't experienced first.

Blue-Green Deployments

Blue-green deployment maintains two identical production environments: blue (current production) and green (new version). Traffic is switched entirely from one to the other, with instant rollback by switching back.

The Blue-Green Model:

Before:  Users → [Load Balancer] → [Blue: v1.0] 
                                    [Green: idle]

Deploy:  Users → [Load Balancer] → [Blue: v1.0]
                                    [Green: v2.0 deploying]

Switch:  Users → [Load Balancer] → [Blue: standby]
                                    [Green: v2.0 serving]

Rollback: Users → [Load Balancer] → [Blue: v1.0 serving]
                                     [Green: failed v2.0]

Converting Mermaid diagram...

Blue-Green Advantages:

Instant switch and instant rollback
Green is fully validated before any traffic
No mixed traffic (all users on same version)
Exact production testing possible before switch

Blue-Green Challenges:

Requires double infrastructure (though not double cost with cloud auto-scaling)
Database schema changes require careful handling
Long-running requests may be interrupted during switch
Session state must be externalized

Blue-Green Cutover Checklist
Phase	Actions	Verification
Preparation	Deploy to green, run smoke tests, verify health	All health checks pass, smoke tests green
Pre-Switch	Warm up green (caches, connections), final validation	Green handling synthetic traffic successfully
Switch	Update load balancer, shift traffic	Traffic flowing to green, blue draining
Verification	Monitor metrics, watch for errors, check business metrics	Error rate stable, no metric regression
Hold	Wait for stability period (30min-2hrs)	No incidents, metrics remain healthy
Completion	Mark blue as previous, available for next deployment	Blue can be updated for next release

Database Migration in Blue-Green

Blue-green works cleanly when the database schema is unchanged. When schema changes are required, you must ensure both blue and green can work with the database simultaneously (expand-contract pattern). Never deploy a schema change that breaks the blue environment—you'd lose your rollback capability.

Feature Flag Driven Cutover

Feature flags provide the most granular control over cutovers. The routing decision is made within the application based on flag state, allowing instant changes without deployment.

Feature Flag Capabilities:

Route by user, organization, or any attribute
Change routing instantly via configuration (no deploy)
Combine percentage rollout with targeted overrides
A/B test different implementations
Kill switch for instant rollback

FeatureFlagCutover.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
interface FlagConfiguration {
    name: string;
    defaultValue: boolean;
    rules: FlagRule[];
    percentageRollout: number;  // 0-100
}
 
interface FlagRule {
    attribute: string;
    operator: 'equals' | 'contains' | 'startsWith' | 'in';
    value: string | string[];
    result: boolean;
}
 
interface EvaluationContext {
    userId: string;
    userEmail?: string;
    organizationId?: string;
    country?: string;
    userAgent?: string;
    attributes: Record<string, string>;
}
 
class FeatureFlagService {
    private flags: Map<string, FlagConfiguration> = new Map();
    
    /**
     * Evaluate a feature flag for a given context
     */
    evaluate(flagName: string, context: EvaluationContext): boolean {
        const flag = this.flags.get(flagName);
        if (!flag) {
            console.warn(`Unknown flag: ${flagName}`);
            return false;
        }
        
        // Check explicit rules first (overrides percentage)
        for (const rule of flag.rules) {
            const contextValue = this.getContextValue(context, rule.attribute);
            if (this.evaluateRule(rule, contextValue)) {
                return rule.result;
            }
        }
        
        // Fall through to percentage rollout
        if (flag.percentageRollout > 0) {
            return this.inPercentage(context.userId, flag.percentageRollout);
        }
        
        return flag.defaultValue;
    }
    
    private getContextValue(context: EvaluationContext, attribute: string): string | undefined {
        switch (attribute) {
            case 'userId': return context.userId;
            case 'userEmail': return context.userEmail;
            case 'organizationId': return context.organizationId;
            case 'country': return context.country;
            default: return context.attributes[attribute];
        }
    }
    
    private evaluateRule(rule: FlagRule, value: string | undefined): boolean {
        if (value === undefined) return false;
        
        switch (rule.operator) {
            case 'equals':
                return value === rule.value;
            case 'contains':
                return typeof rule.value === 'string' && value.includes(rule.value);
            case 'startsWith':
                return typeof rule.value === 'string' && value.startsWith(rule.value);
            case 'in':
                return Array.isArray(rule.value) && rule.value.includes(value);
            default:
                return false;
        }
    }
    
    private inPercentage(userId: string, percentage: number): boolean {
        // Consistent hash for stable assignment
        let hash = 0;
        for (const char of userId) {
            hash = ((hash << 5) - hash) + char.charCodeAt(0);
            hash = hash & hash;
        }
        const bucket = Math.abs(hash) % 100;
        return bucket < percentage;
    }
    
    /**
     * Update flag configuration (used by feature flag management system)
     */
    updateFlag(config: FlagConfiguration): void {
        this.flags.set(config.name, config);
        console.log(`Flag updated: ${config.name}`, JSON.stringify(config, null, 2));
    }
    
    /**
     * Emergency kill switch - disable a flag for everyone
     */
    killFlag(flagName: string): void {
        const flag = this.flags.get(flagName);
        if (flag) {
            flag.rules = [];
            flag.percentageRollout = 0;
            flag.defaultValue = false;
            console.log(`KILL SWITCH ACTIVATED: ${flagName}`);
        }
    }
}
 
// Example configuration for migration cutover
const migrationFlag: FlagConfiguration = {
    name: 'use-new-user-service',
    defaultValue: false,
    rules: [
        // All internal users get new service
        {
            attribute: 'userEmail',
            operator: 'contains',
            value: '@company.com',
            result: true,
        },
        // Specific beta organization
        {
            attribute: 'organizationId',
            operator: 'in',
            value: ['org-123', 'org-456', 'org-789'],
            result: true,
        },
    ],
    percentageRollout: 10,  // 10% of non-matched users
};
 
// Usage in application code
async function handleRequest(request: Request) {
    const flags = new FeatureFlagService();
    const context: EvaluationContext = {
        userId: request.userId,
        userEmail: request.userEmail,
        organizationId: request.orgId,
        attributes: {},
    };
    
    if (flags.evaluate('use-new-user-service', context)) {
        return await newUserService.handle(request);
    } else {
        return await legacyMonolith.handle(request);
    }
}

The Power of Gradual Targeting

Feature flags shine when you need surgical precision: 'Enable for employees, plus organizations A, B, C, plus 5% of users in Europe.' This granularity is impossible with infrastructure-level routing. Use flags for business-critical cutovers where you need maximum control.

Handling Failures During Cutover

Despite best efforts, cutovers sometimes go wrong. How you respond determines whether a problem becomes a minor incident or a major outage. The key is preparation: define your responses before you need them.

Common Cutover Failure Modes

•Latency Spike — New service is slower than expected under real load. May cascade to timeouts.
•Error Rate Increase — New service throws errors not seen in testing. Edge cases exposed.
•Data Inconsistency — New service produces different results than old for some inputs.
•Resource Exhaustion — New service consumes more CPU/memory than expected, affecting stability.
•Dependency Failure — A service the new implementation depends on fails under new traffic patterns.
•Session/State Issues — Users experience logout, cart loss, or other state problems.
•Cascading Failure — Problems in new service propagate to other systems.

The Response Framework:

Severity 1: Rollback Immediately

Widespread errors visible to users
Data corruption or inconsistency
Security vulnerability exposed
Business-critical path broken

Action: Execute rollback within 5 minutes. Investigate later.

Severity 2: Reduce Exposure and Investigate

Increased error rate but within tolerance
Latency regression affecting user experience
Elevated resource usage

Action: Reduce rollout percentage to minimize impact. Investigate root cause.

Severity 3: Monitor and Fix Forward

Minor issues not affecting most users
Non-critical path affected
Already have identified fix

Action: Deploy fix to new service. Monitor closely. No rollback needed.

CutoverRunbook.md
Markdown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Cutover Runbook: User Service Migration
 
## Pre-Cutover Checklist
- [ ] New service deployed and healthy
- [ ] Smoke tests passing
- [ ] Rollback mechanism tested
- [ ] Monitoring dashboards open
- [ ] On-call engineer confirmed
- [ ] Communication channel established
 
## Rollback Procedure
 
### Method 1: Routing Change (preferred, <1 minute)
```bash
# Set routing percentage to 0
kubectl patch configmap routing-config \
  --patch '{"data":{"user-service-percentage":"0"}}'
```
 
### Method 2: Feature Flag Kill Switch (<1 minute)
```bash
# Disable feature flag via API
curl -X POST https://flags.internal/api/flags/use-new-user-service/kill
```
 
### Method 3: DNS Failover (backup, ~5 minutes)
```bash
# Point user-service DNS back to monolith
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch file://failback-dns.json
```
 
## Rollback Triggers (automatic recommendation)
| Metric | Threshold | Action |
|--------|-----------|--------|
| Error rate | > 0.5% for 2 min | Auto-rollback |
| P99 latency | > 1000ms for 5 min | Alert, manual evaluation |
| 5xx responses | > 10/minute | Auto-rollback |
 
## Post-Rollback
1. Confirm traffic flowing to monolith
2. Verify error rates normalized
3. Open incident ticket
4. Notify stakeholders
5. Begin root cause investigation

When In Doubt, Roll Back

If you're debating whether to roll back, roll back. The cost of a brief rollback is almost always less than the cost of extended user impact while you investigate. You can always try again after understanding the problem. Pride has no place in incident response.

Measuring Cutover Success

How do you know the cutover succeeded? You need objective, measurable criteria defined before the cutover begins. This prevents debates about whether to rollback and provides clear completion criteria.

Cutover Success Metrics
Category	Metric	Success Criterion	Measurement Period
Reliability	Error rate	≤ baseline + 0.01%	7 days at 100%
Performance	P99 latency	≤ baseline + 10%	7 days at 100%
Performance	P50 latency	≤ baseline	7 days at 100%
Correctness	Shadow comparison	100% match	Pre-cutover validation
Business	Conversion rate	≥ baseline - 0.5%	14 days at 100%
Operations	Incidents	0 SEV1/SEV2 caused	30 days at 100%
Stability	Rollback required	0 rollbacks	30 days at 100%

The Bake Period:

Even after reaching 100% traffic, maintain a 'bake period' before declaring the cutover complete:

Minimum duration: 14 days at 100% traffic
Include full cycles: Cover weekly patterns, month-end processing, any periodic events
Maintain rollback capability: Keep old implementation ready during bake
Active monitoring: Don't rely solely on alerts; actively review dashboards daily

Only after the bake period ends with no issues should you:

Remove old implementation code
Decommission old infrastructure
Update documentation to reflect new architecture

The 'Complete' Definition

A cutover is truly complete when: 100% of traffic successfully served by new implementation for 14+ days, no rollbacks required, all success metrics met, old code removed from production, and the team can handle operational issues with the new system without referring back to monolith knowledge.

Summary: Cutover Mastery

The cutover is where migration succeeds or fails. A well-executed cutover is invisible; a poorly executed one causes outages. The key is strategy, preparation, and rapid response capability.

Key Takeaways

•Choose the right strategy — From percentage rollout to blue-green to feature flags. Match strategy to risk tolerance and technical constraints.
•Never skip 1% — Percentage rollout at 1% catches real production issues with minimal user impact. This is non-negotiable.
•Canaries detect problems early — Route internal users and beta testers first. Use automated analysis to compare canary vs. baseline.
•Blue-green enables instant switch — Maintain parallel environments for fast cutover and instant rollback. Handle database carefully.
•Feature flags provide surgical control — Target specific users, organizations, or segments. Change routing without deployment.
•Plan for failure — Define rollback triggers and procedures before cutover. When in doubt, roll back.
•Define success objectively — Measurable criteria for error rate, latency, and business metrics. Include a bake period.

What's Next:

With cutover strategies mastered, the final page addresses Risk Mitigation—the comprehensive approach to identifying, preventing, and managing risks throughout the entire Strangler Fig migration journey.

Page Complete

You now understand the full spectrum of cutover strategies, from gradual percentage rollouts to instant blue-green switches. With this knowledge, you can plan and execute zero-downtime transitions while maintaining the ability to rapidly respond to problems. Next, we'll explore comprehensive risk mitigation strategies.

Cutover Strategies

The Critical Moment of Truth

What You Will Learn

The Cutover Spectrum: From Gradual to Instant

The Fundamental Tradeoff:

Cutover Strategy Comparison
Strategy	Risk Level	Rollback Speed	Parallel Period	Best For
Percentage Rollout	Very Low	Instant	Long (weeks)	High-traffic production, risk-averse orgs
Canary Release	Low	Instant	Medium (days)	Validating with subset before wider release
Blue-Green	Medium	Fast (seconds)	Short (hours)	When full traffic testing is needed
Feature Flags	Low to Medium	Instant	Variable	Fine-grained control, targeted rollout
Shadow to Live	Medium	Fast	Medium	Complex validation requirements
Scheduled Cutover	High	Slow (requires deployment)	Minimal	Batch systems, maintenance windows

Choosing Your Strategy:

Consider these factors:

Traffic Volume: High-traffic systems need gradual rollout to detect issues before they affect many users.
Error Detectability: Can you detect problems quickly? If yes, faster rollout is acceptable. If problems take hours to surface, go slower.
Data Consistency: If the cutover involves data migration, you may need tighter coordination and shorter parallel periods.
Team Experience: First few cutovers should be maximally gradual. As expertise builds, you can accelerate.
Rollback Complexity: If rollback is complex (requires data reconciliation), favor gradual strategies with easier abort points.

Start Conservative, Accelerate Over Time

Percentage-Based Rollout

Percentage-based rollout is the safest cutover strategy. You gradually shift traffic from the old implementation to the new, starting with a tiny fraction and increasing as confidence grows.

The Typical Progression:

0%   → 1%   → 5%   → 10%  → 25%  → 50%  → 75%  → 100%
      ↓      ↓      ↓      ↓      ↓      ↓      ↓
    Monitor Monitor Monitor Monitor Monitor Monitor Monitor
    1 hour  4 hours 1 day  2 days 3 days 3 days Done

At each step, you monitor error rates, latency, and business metrics. Any regression triggers investigation and potential rollback to the previous percentage.

Converting Mermaid diagram...

Key Considerations:

Sticky vs. Random Assignment: Should the same user always go to the same backend (sticky), or randomly assigned on each request? Sticky sessions give consistent experience but make true percentage impossible. Random gives exact percentages but users may see inconsistencies.
What to Measure: Error rates, latency (P50, P95, P99), business metrics (conversion, revenue), resource utilization, and customer support tickets.
Hold Time at Each Step: Don't rush. Problems often emerge hours or days after a change, as edge cases accumulate.
Rollback Threshold: Define objective criteria. Example: "If P99 latency exceeds 500ms or error rate exceeds 0.1%, rollback immediately."

PercentageRollout.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
interface RolloutConfig {
    featureName: string;
    currentPercentage: number;
    targetPercentage: number;
    stickyBuckets: boolean;
    bucketCount: number;
}
 
interface RolloutDecision {
    useNewImplementation: boolean;
    bucketId: number;
    reason: 'percentage' | 'override' | 'sticky';
}
 
class PercentageRollout {
    private config: RolloutConfig;
    private stickyAssignments: Map<string, boolean> = new Map();
    
    constructor(config: RolloutConfig) {
        this.config = config;
    }
    
    /**
     * Determine which implementation to use for a request
     */
    decideRouting(
        userId: string,
        requestId: string,
        overrides?: { forceNew?: boolean; forceLegacy?: boolean }
    ): RolloutDecision {
        // Check for explicit overrides first
        if (overrides?.forceNew) {
            return {
                useNewImplementation: true,
                bucketId: -1,
                reason: 'override',
            };
        }
        if (overrides?.forceLegacy) {
            return {
                useNewImplementation: false,
                bucketId: -1,
                reason: 'override',
            };
        }
        
        // For sticky sessions, check prior assignment
        if (this.config.stickyBuckets) {
            const prior = this.stickyAssignments.get(userId);
            if (prior !== undefined) {
                return {
                    useNewImplementation: prior,
                    bucketId: this.computeBucket(userId),
                    reason: 'sticky',
                };
            }
        }
        
        // Compute bucket for this user
        const bucket = this.computeBucket(userId);
        
        // Determine if bucket falls within rollout percentage
        // Buckets 0 to (percentage * bucketCount / 100) go to new implementation
        const cutoffBucket = Math.floor(
            (this.config.currentPercentage / 100) * this.config.bucketCount
        );
        
        const useNew = bucket < cutoffBucket;
        
        // Cache for sticky sessions
        if (this.config.stickyBuckets) {
            this.stickyAssignments.set(userId, useNew);
        }
        
        return {
            useNewImplementation: useNew,
            bucketId: bucket,
            reason: 'percentage',
        };
    }
    
    /**
     * Compute a stable bucket assignment for a user
     * Uses consistent hashing so the same user always gets the same bucket
     */
    private computeBucket(userId: string): number {
        // Simple hash function for demonstration
        // In production, use a proper consistent hashing algorithm
        let hash = 0;
        for (const char of userId) {
            hash = ((hash << 5) - hash) + char.charCodeAt(0);
            hash = hash & hash; // Convert to 32-bit integer
        }
        return Math.abs(hash) % this.config.bucketCount;
    }
    
    /**
     * Safely increase the rollout percentage
     */
    async increasePercentage(
        newPercentage: number,
        metricsChecker: () => Promise<boolean>
    ): Promise<{ success: boolean; reason?: string }> {
        // Verify metrics are healthy before updating
        const metricsHealthy = await metricsChecker();
        if (!metricsHealthy) {
            return {
                success: false,
                reason: 'Metrics unhealthy, cannot increase percentage',
            };
        }
        
        // Update percentage
        const oldPercentage = this.config.currentPercentage;
        this.config.currentPercentage = newPercentage;
        
        // Clear sticky cache to allow redistribution on next request
        // (Optional: some systems keep sticky assignments permanent)
        if (!this.config.stickyBuckets) {
            this.stickyAssignments.clear();
        }
        
        console.log(`Rollout ${this.config.featureName}: ${oldPercentage}% → ${newPercentage}%`);
        
        return { success: true };
    }
    
    /**
     * Emergency rollback to 0%
     */
    rollback(): void {
        console.log(`ROLLBACK: ${this.config.featureName} → 0%`);
        this.config.currentPercentage = 0;
        this.stickyAssignments.clear();
    }
}
 
// Usage example
async function executeRollout() {
    const rollout = new PercentageRollout({
        featureName: 'user-service-v2',
        currentPercentage: 0,
        targetPercentage: 100,
        stickyBuckets: true,
        bucketCount: 1000,  // Allows 0.1% granularity
    });
    
    const stages = [1, 5, 10, 25, 50, 75, 100];
    
    for (const percentage of stages) {
        const result = await rollout.increasePercentage(
            percentage,
            async () => {
                // Check your monitoring system
                const errorRate = await getErrorRate();
                const p99Latency = await getP99Latency();
                return errorRate < 0.001 && p99Latency < 500;
            }
        );
        
        if (!result.success) {
            console.error(`Failed to increase to ${percentage}%: ${result.reason}`);
            rollout.rollback();
            break;
        }
        
        // Wait for metrics to stabilize
        await waitForMetricsStabilization(percentage > 25 ? '4h' : '1h');
    }
}
 
async function getErrorRate(): Promise<number> {
    // Integration with monitoring system
    return 0;
}
 
async function getP99Latency(): Promise<number> {
    // Integration with monitoring system
    return 0;
}
 
async function waitForMetricsStabilization(duration: string): Promise<void> {
    // Wait for the specified duration
}

The 1% Rule

Canary Releases

How Canary Differs from Percentage Rollout:

While similar, canary releases typically:

Focus on a specific subset (e.g., internal users, specific region)
Have automated analysis and rollback
Are shorter duration before deciding to proceed or abort
Often use A/B testing statistical analysis

Canary Traffic Sources

•Internal Users — Route employee traffic to canary first. They'll notice issues and can report quickly.
•Beta Users — Users who've opted into testing. More tolerant of issues, provide feedback.
•Geographic Subset — Route one region to canary while others stay on stable. Limits blast radius.
•Non-Critical Requests — Route less-important traffic (analytics, prefetch) before critical paths.
•Random Sample — True random selection of users for statistically valid comparison.

Canary Metrics to Monitor

•Error Rate — Compare canary vs. baseline. Canary should not exceed baseline + threshold.
•Latency Distribution — P50, P95, P99. Watch for tail latency regression even if median is OK.
•Resource Usage — CPU, memory, connections. Canary shouldn't consume disproportionate resources.
•Business Metrics — Conversion, revenue, engagement. Ensure no negative business impact.
•Customer Tickets — Increase in support requests may indicate problems not caught by technical metrics.

Automated Canary Analysis:

Modern canary systems automate the comparison and decision-making:

Baseline Establishment: Collect metrics from the stable implementation over a period.
Canary Deployment: Deploy new implementation and route canary traffic.
Metric Collection: Gather same metrics from canary for comparison period.
Statistical Analysis: Compare distributions using statistical tests (Mann-Whitney U, Kolmogorov-Smirnov).
Verdict: Pass (promote to larger rollout), Fail (rollback), or Inconclusive (extend test).

Tools like Kayenta (Spinnaker), Flagger (Kubernetes), and Argo Rollouts provide this automation.

The Internal Dog-Fooding Canary

Blue-Green Deployments

The Blue-Green Model:

Before:  Users → [Load Balancer] → [Blue: v1.0] 
                                    [Green: idle]

Deploy:  Users → [Load Balancer] → [Blue: v1.0]
                                    [Green: v2.0 deploying]

Switch:  Users → [Load Balancer] → [Blue: standby]
                                    [Green: v2.0 serving]

Rollback: Users → [Load Balancer] → [Blue: v1.0 serving]
                                     [Green: failed v2.0]

Converting Mermaid diagram...

Blue-Green Advantages:

Instant switch and instant rollback
Green is fully validated before any traffic
No mixed traffic (all users on same version)
Exact production testing possible before switch

Blue-Green Challenges:

Requires double infrastructure (though not double cost with cloud auto-scaling)
Database schema changes require careful handling
Long-running requests may be interrupted during switch
Session state must be externalized

Blue-Green Cutover Checklist
Phase	Actions	Verification
Preparation	Deploy to green, run smoke tests, verify health	All health checks pass, smoke tests green
Pre-Switch	Warm up green (caches, connections), final validation	Green handling synthetic traffic successfully
Switch	Update load balancer, shift traffic	Traffic flowing to green, blue draining
Verification	Monitor metrics, watch for errors, check business metrics	Error rate stable, no metric regression
Hold	Wait for stability period (30min-2hrs)	No incidents, metrics remain healthy
Completion	Mark blue as previous, available for next deployment	Blue can be updated for next release

Database Migration in Blue-Green

Feature Flag Driven Cutover

Feature flags provide the most granular control over cutovers. The routing decision is made within the application based on flag state, allowing instant changes without deployment.

Feature Flag Capabilities:

Route by user, organization, or any attribute
Change routing instantly via configuration (no deploy)
Combine percentage rollout with targeted overrides
A/B test different implementations
Kill switch for instant rollback

FeatureFlagCutover.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
interface FlagConfiguration {
    name: string;
    defaultValue: boolean;
    rules: FlagRule[];
    percentageRollout: number;  // 0-100
}
 
interface FlagRule {
    attribute: string;
    operator: 'equals' | 'contains' | 'startsWith' | 'in';
    value: string | string[];
    result: boolean;
}
 
interface EvaluationContext {
    userId: string;
    userEmail?: string;
    organizationId?: string;
    country?: string;
    userAgent?: string;
    attributes: Record<string, string>;
}
 
class FeatureFlagService {
    private flags: Map<string, FlagConfiguration> = new Map();
    
    /**
     * Evaluate a feature flag for a given context
     */
    evaluate(flagName: string, context: EvaluationContext): boolean {
        const flag = this.flags.get(flagName);
        if (!flag) {
            console.warn(`Unknown flag: ${flagName}`);
            return false;
        }
        
        // Check explicit rules first (overrides percentage)
        for (const rule of flag.rules) {
            const contextValue = this.getContextValue(context, rule.attribute);
            if (this.evaluateRule(rule, contextValue)) {
                return rule.result;
            }
        }
        
        // Fall through to percentage rollout
        if (flag.percentageRollout > 0) {
            return this.inPercentage(context.userId, flag.percentageRollout);
        }
        
        return flag.defaultValue;
    }
    
    private getContextValue(context: EvaluationContext, attribute: string): string | undefined {
        switch (attribute) {
            case 'userId': return context.userId;
            case 'userEmail': return context.userEmail;
            case 'organizationId': return context.organizationId;
            case 'country': return context.country;
            default: return context.attributes[attribute];
        }
    }
    
    private evaluateRule(rule: FlagRule, value: string | undefined): boolean {
        if (value === undefined) return false;
        
        switch (rule.operator) {
            case 'equals':
                return value === rule.value;
            case 'contains':
                return typeof rule.value === 'string' && value.includes(rule.value);
            case 'startsWith':
                return typeof rule.value === 'string' && value.startsWith(rule.value);
            case 'in':
                return Array.isArray(rule.value) && rule.value.includes(value);
            default:
                return false;
        }
    }
    
    private inPercentage(userId: string, percentage: number): boolean {
        // Consistent hash for stable assignment
        let hash = 0;
        for (const char of userId) {
            hash = ((hash << 5) - hash) + char.charCodeAt(0);
            hash = hash & hash;
        }
        const bucket = Math.abs(hash) % 100;
        return bucket < percentage;
    }
    
    /**
     * Update flag configuration (used by feature flag management system)
     */
    updateFlag(config: FlagConfiguration): void {
        this.flags.set(config.name, config);
        console.log(`Flag updated: ${config.name}`, JSON.stringify(config, null, 2));
    }
    
    /**
     * Emergency kill switch - disable a flag for everyone
     */
    killFlag(flagName: string): void {
        const flag = this.flags.get(flagName);
        if (flag) {
            flag.rules = [];
            flag.percentageRollout = 0;
            flag.defaultValue = false;
            console.log(`KILL SWITCH ACTIVATED: ${flagName}`);
        }
    }
}
 
// Example configuration for migration cutover
const migrationFlag: FlagConfiguration = {
    name: 'use-new-user-service',
    defaultValue: false,
    rules: [
        // All internal users get new service
        {
            attribute: 'userEmail',
            operator: 'contains',
            value: '@company.com',
            result: true,
        },
        // Specific beta organization
        {
            attribute: 'organizationId',
            operator: 'in',
            value: ['org-123', 'org-456', 'org-789'],
            result: true,
        },
    ],
    percentageRollout: 10,  // 10% of non-matched users
};
 
// Usage in application code
async function handleRequest(request: Request) {
    const flags = new FeatureFlagService();
    const context: EvaluationContext = {
        userId: request.userId,
        userEmail: request.userEmail,
        organizationId: request.orgId,
        attributes: {},
    };
    
    if (flags.evaluate('use-new-user-service', context)) {
        return await newUserService.handle(request);
    } else {
        return await legacyMonolith.handle(request);
    }
}

The Power of Gradual Targeting

Handling Failures During Cutover

Common Cutover Failure Modes

•Latency Spike — New service is slower than expected under real load. May cascade to timeouts.
•Error Rate Increase — New service throws errors not seen in testing. Edge cases exposed.
•Data Inconsistency — New service produces different results than old for some inputs.
•Resource Exhaustion — New service consumes more CPU/memory than expected, affecting stability.
•Dependency Failure — A service the new implementation depends on fails under new traffic patterns.
•Session/State Issues — Users experience logout, cart loss, or other state problems.
•Cascading Failure — Problems in new service propagate to other systems.

The Response Framework:

Severity 1: Rollback Immediately

Widespread errors visible to users
Data corruption or inconsistency
Security vulnerability exposed
Business-critical path broken

Action: Execute rollback within 5 minutes. Investigate later.

Severity 2: Reduce Exposure and Investigate

Increased error rate but within tolerance
Latency regression affecting user experience
Elevated resource usage

Action: Reduce rollout percentage to minimize impact. Investigate root cause.

Severity 3: Monitor and Fix Forward

Minor issues not affecting most users
Non-critical path affected
Already have identified fix

Action: Deploy fix to new service. Monitor closely. No rollback needed.

CutoverRunbook.md
Markdown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Cutover Runbook: User Service Migration
 
## Pre-Cutover Checklist
- [ ] New service deployed and healthy
- [ ] Smoke tests passing
- [ ] Rollback mechanism tested
- [ ] Monitoring dashboards open
- [ ] On-call engineer confirmed
- [ ] Communication channel established
 
## Rollback Procedure
 
### Method 1: Routing Change (preferred, <1 minute)
```bash
# Set routing percentage to 0
kubectl patch configmap routing-config \
  --patch '{"data":{"user-service-percentage":"0"}}'
```
 
### Method 2: Feature Flag Kill Switch (<1 minute)
```bash
# Disable feature flag via API
curl -X POST https://flags.internal/api/flags/use-new-user-service/kill
```
 
### Method 3: DNS Failover (backup, ~5 minutes)
```bash
# Point user-service DNS back to monolith
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch file://failback-dns.json
```
 
## Rollback Triggers (automatic recommendation)
| Metric | Threshold | Action |
|--------|-----------|--------|
| Error rate | > 0.5% for 2 min | Auto-rollback |
| P99 latency | > 1000ms for 5 min | Alert, manual evaluation |
| 5xx responses | > 10/minute | Auto-rollback |
 
## Post-Rollback
1. Confirm traffic flowing to monolith
2. Verify error rates normalized
3. Open incident ticket
4. Notify stakeholders
5. Begin root cause investigation

When In Doubt, Roll Back

Measuring Cutover Success

Cutover Success Metrics
Category	Metric	Success Criterion	Measurement Period
Reliability	Error rate	≤ baseline + 0.01%	7 days at 100%
Performance	P99 latency	≤ baseline + 10%	7 days at 100%
Performance	P50 latency	≤ baseline	7 days at 100%
Correctness	Shadow comparison	100% match	Pre-cutover validation
Business	Conversion rate	≥ baseline - 0.5%	14 days at 100%
Operations	Incidents	0 SEV1/SEV2 caused	30 days at 100%
Stability	Rollback required	0 rollbacks	30 days at 100%

The Bake Period:

Even after reaching 100% traffic, maintain a 'bake period' before declaring the cutover complete:

Minimum duration: 14 days at 100% traffic
Include full cycles: Cover weekly patterns, month-end processing, any periodic events
Maintain rollback capability: Keep old implementation ready during bake
Active monitoring: Don't rely solely on alerts; actively review dashboards daily

Only after the bake period ends with no issues should you:

Remove old implementation code
Decommission old infrastructure
Update documentation to reflect new architecture

The 'Complete' Definition

Summary: Cutover Mastery

The cutover is where migration succeeds or fails. A well-executed cutover is invisible; a poorly executed one causes outages. The key is strategy, preparation, and rapid response capability.

Key Takeaways

•Choose the right strategy — From percentage rollout to blue-green to feature flags. Match strategy to risk tolerance and technical constraints.
•Never skip 1% — Percentage rollout at 1% catches real production issues with minimal user impact. This is non-negotiable.
•Canaries detect problems early — Route internal users and beta testers first. Use automated analysis to compare canary vs. baseline.
•Blue-green enables instant switch — Maintain parallel environments for fast cutover and instant rollback. Handle database carefully.
•Feature flags provide surgical control — Target specific users, organizations, or segments. Change routing without deployment.
•Plan for failure — Define rollback triggers and procedures before cutover. When in doubt, roll back.
•Define success objectively — Measurable criteria for error rate, latency, and business metrics. Include a bake period.

What's Next:

Page Complete