System Design (HLD)Strangler Fig Pattern

Strangler Fig Pattern: Gradual Migration from Monolith to Microservices

LevelAdvanced

Duration90 mins

TopicStrangler Fig Pattern

5 / 5

Risk Mitigation

The Safety Net of Systematic Risk Management

Every migration story includes moments of crisis—the unexpected behavior, the data inconsistency discovered at midnight, the cascade that almost brought down production. What separates successful migrations from failures is not the absence of problems, but the presence of systematic risk management that prevents problems from becoming catastrophes.

The Strangler Fig Pattern is inherently designed to reduce risk through incrementalism. But the pattern alone is not enough. You need a comprehensive approach to identifying risks before they materialize, building defenses against known risks, detecting problems early when they occur, and recovering quickly when defenses fail.

Risk mitigation is not pessimism—it's professionalism. The best-prepared teams look paranoid before the migration and competent during it. Under-prepared teams look confident before and panicked during.

What You Will Learn

By the end of this page, you will understand how to identify and categorize migration risks, implement preventive controls for common failure modes, build early detection systems that catch problems before users notice, design recovery mechanisms that minimize blast radius, and create organizational practices that institutionalize safe migration.

A Taxonomy of Migration Risks

Before mitigating risks, you must understand them. Migration risks fall into several categories, each requiring different prevention and response strategies.

Migration Risk Categories
Category	Description	Examples	Impact Level
Functional Correctness	New service produces different results	Wrong calculations, missing data, incorrect state transitions	Critical
Performance	New service slower or less efficient	Latency regression, resource exhaustion, timeout cascades	High
Data Integrity	Data loss, corruption, or inconsistency	Dual-write failures, migration errors, sync lag	Critical
Availability	Service unavailable or degraded	Deployment failures, dependency outages, circuit breaker trips	High
Security	New attack vectors or vulnerabilities	Auth bypass, exposed endpoints, weakened controls	Critical
Operational	Inability to maintain or debug system	Missing runbooks, inadequate monitoring, knowledge gaps	Medium
Organizational	Team or process failures	Communication breakdown, unclear ownership, scope creep	Medium
Strategic	Migration direction proves wrong	Technology choice fails, requirements change, budget cut	High

Risk Priority = Likelihood × Impact × Detectability

Prioritize risks based on three factors:

Likelihood: How probable is this risk? (Based on complexity, team experience, historical data)
Impact: If it occurs, how severe? (User impact, data loss, revenue impact, recovery time)
Detectability: How quickly will we know? (High detectability reduces effective impact)

A high-impact, low-detectability risk (like subtle data corruption) is far more dangerous than a high-impact, high-detectability risk (like a complete outage).

The Silent Killers

The most dangerous risks are those with delayed impact or low detectability: gradual performance degradation, subtle data inconsistencies, or slowly increasing error rates. These don't trigger alerts until significant damage is done. Invest heavily in early detection for these 'silent killers.'

Preventive Controls: Stopping Risks Before They Occur

The best way to handle a risk is to prevent it from occurring. Preventive controls are the first line of defense, built into your development and deployment processes.

Essential Preventive Controls

•Contract Testing — Verify that new services implement the exact API contract as the monolith. Use tools like Pact or Spring Cloud Contract. Run contract tests on every build.
•Shadow Traffic Validation — Before any real traffic, mirror production requests to the new service and compare responses. Flag discrepancies for investigation.
•Automated Rollback Triggers — Define objective metrics that trigger automatic rollback. Don't rely on humans to notice problems at 2 AM.
•Pre-Production Load Testing — Test at 2-3x expected production load. Identify breaking points before users find them.
•Database Schema Versioning — Use tools like Flyway or Liquibase. Never make breaking changes without a migration path.
•Security Scanning — Run SAST/DAST on new services. Verify authentication and authorization are correctly implemented.
•Chaos Engineering — Proactively inject failures to verify resilience. Test circuit breakers, fallbacks, and recovery mechanisms.
•Change Review Processes — Require peer review for infrastructure changes. No lone-wolf deployments during migration.

Converting Mermaid diagram...

The Testing Pyramid for Migration:

Apply the testing pyramid with migration-specific focus:

Base (many tests): Unit tests for business logic parity with monolith
Middle (some tests): Integration tests for API contracts and data access
Top (few tests): End-to-end tests for critical user journeys across both systems

Additionally, add a comparison layer not found in normal testing:

Shadow traffic comparison (production requests, response comparison)
Business metric comparison (conversion rates, revenue, engagement)

The 'Record and Replay' Pattern

Record production requests and responses from the monolith. Replay requests to the new service and compare responses. This gives you a massive test suite derived from real production traffic—far more comprehensive than any manually written tests.

Detection Systems: Finding Problems Early

When preventive controls fail (and eventually some will), you need detection systems that identify problems before users notice. The goal is to detect problems in seconds or minutes, not hours or days.

Detection System Requirements
Detection Type	What It Catches	Detection Speed	Implementation
Error Rate Monitoring	Increased failures, new error types	Seconds	Prometheus/Datadog + alerting
Latency Monitoring	Performance regression	Seconds	P50/P95/P99 percentile tracking
Comparison Scoring	Behavioral differences from baseline	Minutes	Real-time response comparison service
Business Metrics	Revenue, conversion impact	Minutes to hours	Real-time analytics dashboards
Anomaly Detection	Unusual patterns, edge cases	Minutes	ML-based anomaly detection
Synthetic Monitoring	Critical path availability	Seconds	Synthetic probes from multiple locations
Log Analysis	Error patterns, unusual behavior	Minutes	ELK/Splunk with pattern detection
User Feedback	UX issues, functional problems	Hours to days	Support ticket analysis, error reporting

The Golden Signals for Migration:

Google's four golden signals, adapted for migration monitoring:

Latency: Time to serve requests. Track separately for monolith vs. new service. Alert on divergence.
Traffic: Request volume. Monitor for unexpected drops (requests failing silently) or spikes (requests being retried).
Errors: Failed request rate. Track by type (4xx vs 5xx) and compare between implementations.
Saturation: Resource utilization. New services often have different resource profiles than expected.

MigrationMonitoring.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
interface MigrationMetrics {
    service: string;
    endpoint: string;
    timestamp: Date;
    backend: 'monolith' | 'microservice';
    
    // Golden signals
    latencyMs: number;
    success: boolean;
    errorType?: string;
    
    // Migration-specific
    responseHash?: string;  // For comparison
    shadowComparison?: 'match' | 'mismatch' | 'not-compared';
}
 
interface AlertThreshold {
    metric: string;
    condition: 'gt' | 'lt' | 'change';
    value: number;
    windowSeconds: number;
    severity: 'info' | 'warning' | 'critical';
}
 
class MigrationMonitor {
    private metrics: MigrationMetrics[] = [];
    private thresholds: AlertThreshold[] = [
        // Error rate thresholds
        {
            metric: 'errorRate',
            condition: 'gt',
            value: 0.01,  // 1%
            windowSeconds: 300,
            severity: 'warning',
        },
        {
            metric: 'errorRate',
            condition: 'gt',
            value: 0.05,  // 5%
            windowSeconds: 60,
            severity: 'critical',
        },
        // Latency thresholds
        {
            metric: 'p99Latency',
            condition: 'gt',
            value: 1000,  // 1 second
            windowSeconds: 300,
            severity: 'warning',
        },
        // Comparison thresholds
        {
            metric: 'mismatchRate',
            condition: 'gt',
            value: 0.001,  // 0.1%
            windowSeconds: 600,
            severity: 'critical',
        },
        // Traffic thresholds (sudden drops)
        {
            metric: 'trafficChange',
            condition: 'lt',
            value: -0.5,  // 50% drop
            windowSeconds: 300,
            severity: 'critical',
        },
    ];
    
    recordMetric(metric: MigrationMetrics): void {
        this.metrics.push(metric);
        this.evaluateThresholds();
    }
    
    private evaluateThresholds(): void {
        for (const threshold of this.thresholds) {
            const value = this.calculateMetric(threshold.metric, threshold.windowSeconds);
            const violated = this.checkCondition(value, threshold.condition, threshold.value);
            
            if (violated) {
                this.raiseAlert({
                    threshold,
                    currentValue: value,
                    message: `${threshold.metric} ${threshold.condition} ${threshold.value}: current=${value}`,
                });
            }
        }
    }
    
    private calculateMetric(metric: string, windowSeconds: number): number {
        const cutoff = new Date(Date.now() - windowSeconds * 1000);
        const windowMetrics = this.metrics.filter(m => m.timestamp >= cutoff);
        
        switch (metric) {
            case 'errorRate':
                return windowMetrics.filter(m => !m.success).length / windowMetrics.length;
            
            case 'p99Latency':
                const latencies = windowMetrics.map(m => m.latencyMs).sort((a, b) => a - b);
                const p99Index = Math.floor(latencies.length * 0.99);
                return latencies[p99Index] || 0;
            
            case 'mismatchRate':
                const compared = windowMetrics.filter(m => m.shadowComparison !== 'not-compared');
                return compared.filter(m => m.shadowComparison === 'mismatch').length / compared.length;
            
            case 'trafficChange':
                // Compare current window to previous window of same size
                const prevCutoff = new Date(cutoff.getTime() - windowSeconds * 1000);
                const prevMetrics = this.metrics.filter(
                    m => m.timestamp >= prevCutoff && m.timestamp < cutoff
                );
                if (prevMetrics.length === 0) return 0;
                return (windowMetrics.length - prevMetrics.length) / prevMetrics.length;
            
            default:
                return 0;
        }
    }
    
    private checkCondition(value: number, condition: string, threshold: number): boolean {
        switch (condition) {
            case 'gt': return value > threshold;
            case 'lt': return value < threshold;
            case 'change': return Math.abs(value) > threshold;
            default: return false;
        }
    }
    
    private raiseAlert(alert: {
        threshold: AlertThreshold;
        currentValue: number;
        message: string;
    }): void {
        console.log(`[${alert.threshold.severity.toUpperCase()}] ${alert.message}`);
        
        if (alert.threshold.severity === 'critical') {
            // Trigger automatic rollback
            this.triggerRollback(alert.message);
        }
    }
    
    private triggerRollback(reason: string): void {
        console.log(`AUTO-ROLLBACK TRIGGERED: ${reason}`);
        // Integration with rollback mechanism
    }
    
    /**
     * Dashboard data for migration comparison view
     */
    getComparativeMetrics(): {
        monolith: ServiceMetrics;
        microservice: ServiceMetrics;
    } {
        const recentMetrics = this.metrics.filter(
            m => m.timestamp >= new Date(Date.now() - 3600000)  // Last hour
        );
        
        return {
            monolith: this.aggregateMetrics(recentMetrics.filter(m => m.backend === 'monolith')),
            microservice: this.aggregateMetrics(recentMetrics.filter(m => m.backend === 'microservice')),
        };
    }
    
    private aggregateMetrics(metrics: MigrationMetrics[]): ServiceMetrics {
        if (metrics.length === 0) {
            return { requestCount: 0, errorRate: 0, p50: 0, p95: 0, p99: 0 };
        }
        
        const sorted = metrics.map(m => m.latencyMs).sort((a, b) => a - b);
        
        return {
            requestCount: metrics.length,
            errorRate: metrics.filter(m => !m.success).length / metrics.length,
            p50: sorted[Math.floor(sorted.length * 0.5)],
            p95: sorted[Math.floor(sorted.length * 0.95)],
            p99: sorted[Math.floor(sorted.length * 0.99)],
        };
    }
}
 
interface ServiceMetrics {
    requestCount: number;
    errorRate: number;
    p50: number;
    p95: number;
    p99: number;
}

Alert Fatigue Kills Detection

Too many alerts leads to alert fatigue, where teams start ignoring alerts. Tune alert thresholds carefully. Every alert should be actionable. If an alert fires and the response is 'ignore it,' either fix the underlying issue or remove the alert.

Recovery Mechanisms: Minimizing Blast Radius

When problems occur despite prevention and are detected by monitoring, you need recovery mechanisms that limit damage and restore service quickly. The key principle is minimizing blast radius—containing the impact of any failure.

Recovery Mechanism Hierarchy

•Automatic Fallback — Circuit breakers automatically route traffic to monolith when microservice fails. No human intervention required. Recovery time: seconds.
•Traffic Reduction — Reduce percentage of traffic to problem service. Limit user impact while investigating. Recovery time: seconds (if automated).
•Full Rollback — Return all traffic to monolith. Complete revert of the migration step. Recovery time: seconds to minutes.
•Data Reconciliation — Fix data inconsistencies between systems. May require specialized scripts or manual intervention. Recovery time: hours to days.
•Feature Disable — Disable specific features rather than full rollback. Surgical response for localized problems. Recovery time: seconds.

Blast Radius Containment Strategies:

Percentage-Based Limiting

Start with 1% maximum
Increase only with proven stability
Automatic reduction on problems

Geographic Isolation

Roll out to one region first
Limit impact to single geography
Learn before global rollout

User Segment Isolation

Internal users first
Beta users second
General availability last

Time-Based Isolation

Roll out during low-traffic periods
Avoid weekends/holidays initially
Build confidence before peak times

Recovery Time Objectives (RTO):

Define and practice recovery times for each failure mode:

Failure Type	Target RTO
Service errors	< 1 minute (auto-fallback)
Performance regression	< 5 minutes
Data discrepancy	< 30 minutes (traffic stop)
Data corruption	< 4 hours (full recovery)
Security incident	Immediate (full shutdown)

Practice Recovery:

Recovery mechanisms that aren't tested don't work when needed. Run regular recovery drills:

Monthly: Verify automatic fallback working
Quarterly: Practice manual rollback
Yearly: Full disaster recovery exercise

The 'Kill Switch' Pattern

Every migrated service should have a 'kill switch'—a single action that completely disables it and routes all traffic back to the monolith. This should be accessible via simple command (CLI, button) and tested regularly. In a crisis, you need to act in seconds, not minutes.

Data Protection Strategies

Data risks deserve special attention because they're often irreversible. A code bug can be fixed; lost or corrupted data may be unrecoverable. Data protection must be a first-class concern throughout migration.

Data Protection Principles

•Never Delete Without Backup — Before any data migration step, verify backups are complete and restorable. Test restore procedures in advance.
•Soft Delete First — When migrating writes to new service, don't delete from old system immediately. Mark as 'migrated' and delete after stability period.
•Audit All Changes — Log every data modification during migration with source system, timestamp, and change type. Enable forensic analysis.
•Dual-Read Verification — During transition, read from both systems and compare. Alert on discrepancies before trusting new system.
•Point-in-Time Recovery — Ensure you can restore to any point during migration. Problems may be discovered hours after they occurred.
•Schema Backward Compatibility — New schema must work with both old and new code during transition. Use expand-contract pattern.

The Data Migration Safety Protocol:

Snapshot: Create verified backup before any migration step
Validate: Verify backup is complete and restorable
Migrate: Execute migration in batches, not all at once
Verify: Compare migrated data against backup/source
Monitor: Watch for anomalies for defined stability period
Proceed: Only after verification passes, continue to next batch

Red Lines:

Never migrate production data without tested rollback procedure
Never run migration during peak traffic without approval
Never skip verification step to 'save time'
Never delete source data until stability period complete

The Point of No Return

Some data operations are truly irreversible: dropping tables, overwriting without backup, or exceeding backup retention periods. Before any such operation, get explicit sign-off, verify backups twice, and have a verified restore procedure ready. These are 'measure twice, cut once' moments.

Organizational Risk Management

Technical risks are only half the equation. Organizational risks—team dynamics, communication failures, unclear ownership—cause as many migration failures as technical problems. Managing these requires different tools.

Common Organizational Failures

•Key team member leaves mid-migration
•Teams disagree on service boundaries
•Unclear ownership of shared components
•Business priorities shift, migration deprioritized
•Teams work in silos without coordination
•Knowledge concentrated in few individuals
•Stakeholders lose patience with long timeline

Organizational Mitigations

•Cross-train team members, document decisions
•Use structured decision-making (ADRs)
•Create explicit ownership matrix
•Secure executive sponsorship with quarterly reviews
•Regular cross-team sync meetings
•Pair programming, knowledge sharing sessions
•Deliver value incrementally to maintain confidence

The RACI Matrix for Migration:

Define clear responsibilities using RACI (Responsible, Accountable, Consulted, Informed):

Activity	Platform Team	Service Team	Architecture	Leadership
Routing changes	R	C	C	I
Service extraction	C	R	A	I
Data migration	C	R	C	I
Incident response	R	R	C	I
Go/No-Go decisions	C	C	R	A
Budget approval	I	I	C	A

Legend: R=Responsible (does the work), A=Accountable (final decision), C=Consulted, I=Informed

The 'Bus Factor' Risk

If one person leaving would significantly impair migration progress, you have a critical organizational risk. Mitigate by ensuring at least two people deeply understand every critical system or process. Document aggressively—the documentation is the backup for when memory isn't available.

The Risk Register: Institutionalizing Risk Management

Professional risk management requires a risk register—a living document that captures identified risks, their status, mitigations, and owners. This transforms ad-hoc worry into systematic management.

risk-register.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
# Migration Risk Register
# Review: Weekly at migration standup
# Owner: Migration Lead
 
risks:
  - id: RISK-001
    title: "User service latency regression"
    category: performance
    likelihood: medium
    impact: high
    detection: high
    priority: 1
    status: mitigating
    description: |
      New user service may have higher latency due to additional
      network hop and database query patterns. Could affect checkout
      conversion rate.
    mitigations:
      - type: preventive
        action: "Load test at 3x peak traffic"
        owner: platform-team
        status: complete
      - type: preventive
        action: "Implement caching layer"
        owner: service-team
        status: in-progress
        due: 2024-02-15
      - type: detective
        action: "Set up P99 latency alerting with 500ms threshold"
        owner: sre-team
        status: complete
      - type: recovery
        action: "Automated fallback to monolith on latency spike"
        owner: platform-team
        status: complete
    triggers:
      - P99 latency exceeds 500ms for 2 minutes
    response: |
      1. Automatic fallback will reduce traffic to 0%
      2. On-call investigates and determines root cause
      3. Fix deployed and shadow-tested before retry
 
  - id: RISK-002
    title: "Data inconsistency during dual-write period"
    category: data-integrity
    likelihood: medium
    impact: critical
    detection: low
    priority: 1
    status: mitigating
    description: |
      During the period when both systems write user data, 
      inconsistencies may arise from race conditions, network 
      failures, or application bugs.
    mitigations:
      - type: preventive
        action: "Implement outbox pattern for reliable event propagation"
        owner: service-team
        status: complete
      - type: preventive
        action: "Use optimistic locking with version numbers"
        owner: service-team
        status: in-progress
        due: 2024-02-20
      - type: detective
        action: "Hourly consistency check comparing both databases"
        owner: data-team
        status: complete
      - type: recovery
        action: "Documented reconciliation procedure with runbook"
        owner: data-team
        status: in-progress
        due: 2024-02-10
    triggers:
      - Consistency check finds >0.01% discrepancy rate
      - Any data loss reported
    response: |
      1. Stop writes to new service immediately
      2. Run reconciliation script to identify affected records
      3. Restore from backup if necessary
      4. Root cause analysis before resuming
 
  - id: RISK-003
    title: "Key team member departure"
    category: organizational
    likelihood: low
    impact: high
    detection: high
    priority: 2
    status: mitigating
    description: |
      If a key migration team member leaves unexpectedly, critical 
      knowledge could be lost, slowing migration significantly.
    mitigations:
      - type: preventive
        action: "Document all major decisions in ADRs"
        owner: tech-lead
        status: ongoing
      - type: preventive
        action: "Pair programming for all migration work"
        owner: team-leads
        status: ongoing
      - type: preventive
        action: "Bi-weekly knowledge transfer sessions"
        owner: tech-lead
        status: ongoing
      - type: recovery
        action: "Identify backup resources who can ramp up"
        owner: engineering-manager
        status: complete
    triggers:
      - Resignation announced
      - Extended leave >2 weeks
    response: |
      1. Immediate knowledge transfer sessions
      2. Review and enhance documentation
      3. Pause complex migration work until coverage confirmed
 
# Review History
reviews:
  - date: 2024-01-15
    attendees: [migration-lead, tech-lead, sre-lead]
    notes: |
      - RISK-001 downgraded after successful load testing
      - Added RISK-003 based on team feedback
      - Next review: 2024-01-22

Risk Register Practices:

Weekly Review: Review the risk register at weekly migration standup. Update likelihood/status, check mitigation progress.
New Risk Addition: Any team member can add risks. Lower the bar—it's better to have too many than miss critical ones.
Mitigation Tracking: Every high-priority risk must have at least one mitigation in progress. Unmitigated high-impact risks are blockers.
Closure Criteria: Risks can be closed when mitigations are complete and the risk is no longer applicable. Document why.
Post-Incident Updates: After any incident, review the risk register. Was this risk listed? Should it have been? What mitigations failed?

The 'Pre-Mortem' Exercise

Before each major migration phase, conduct a pre-mortem: 'Assume the migration failed. What went wrong?' This surfaces risks that optimistic planning overlooks. Add identified risks to the register and ensure mitigations exist.

Summary: Systematic Risk Management

Risk management transforms migration from an anxious gamble into a controlled journey. With proper prevention, detection, and recovery, you can confidently navigate even complex migrations.

Key Takeaways

•Understand your risks — Categorize by type (functional, performance, data, organizational). Prioritize by likelihood × impact × detectability.
•Build preventive controls — Contract testing, shadow validation, security scanning, and chaos engineering stop problems before they occur.
•Implement detection systems — Monitor golden signals with automated alerting. Detect problems in seconds, not hours.
•Design recovery mechanisms — Automatic fallback, kill switches, and practiced rollback procedures limit blast radius and speed recovery.
•Protect data rigorously — Never delete without backup. Use soft deletes, audit logging, and point-in-time recovery capability.
•Manage organizational risks — RACI matrix, documentation, cross-training, and executive sponsorship prevent people-related failures.
•Institutionalize with a risk register — Track risks, mitigations, and owners. Review weekly. Update after every incident.

Module Complete:

You have now completed the comprehensive study of the Strangler Fig Pattern. From gradual migration strategy through routing façades, functionality extraction, cutover strategies, and risk mitigation, you have the complete toolkit for successfully migrating from monolith to microservices.

The next step is application. Start with a small extraction, apply these principles, learn from the experience, and iterate. Each migration makes you more skilled, and each extracted service brings you closer to your target architecture.

Module Complete

You now have a complete understanding of the Strangler Fig Pattern for migrating from monolith to microservices. You can identify extraction candidates, build routing façades, execute safe cutovers, and systematically manage risks throughout the journey. With this knowledge, you're equipped to lead complex architecture migrations with confidence.

5 / 5

Loading learning content...

System Design (HLD)Strangler Fig Pattern

Strangler Fig Pattern: Gradual Migration from Monolith to Microservices

LevelAdvanced

Duration90 mins

TopicStrangler Fig Pattern

5 / 5

Risk Mitigation

The Safety Net of Systematic Risk Management

What You Will Learn

A Taxonomy of Migration Risks

Before mitigating risks, you must understand them. Migration risks fall into several categories, each requiring different prevention and response strategies.

Migration Risk Categories
Category	Description	Examples	Impact Level
Functional Correctness	New service produces different results	Wrong calculations, missing data, incorrect state transitions	Critical
Performance	New service slower or less efficient	Latency regression, resource exhaustion, timeout cascades	High
Data Integrity	Data loss, corruption, or inconsistency	Dual-write failures, migration errors, sync lag	Critical
Availability	Service unavailable or degraded	Deployment failures, dependency outages, circuit breaker trips	High
Security	New attack vectors or vulnerabilities	Auth bypass, exposed endpoints, weakened controls	Critical
Operational	Inability to maintain or debug system	Missing runbooks, inadequate monitoring, knowledge gaps	Medium
Organizational	Team or process failures	Communication breakdown, unclear ownership, scope creep	Medium
Strategic	Migration direction proves wrong	Technology choice fails, requirements change, budget cut	High

Risk Priority = Likelihood × Impact × Detectability

Prioritize risks based on three factors:

Likelihood: How probable is this risk? (Based on complexity, team experience, historical data)
Impact: If it occurs, how severe? (User impact, data loss, revenue impact, recovery time)
Detectability: How quickly will we know? (High detectability reduces effective impact)

A high-impact, low-detectability risk (like subtle data corruption) is far more dangerous than a high-impact, high-detectability risk (like a complete outage).

The Silent Killers

Preventive Controls: Stopping Risks Before They Occur

The best way to handle a risk is to prevent it from occurring. Preventive controls are the first line of defense, built into your development and deployment processes.

Essential Preventive Controls

•Contract Testing — Verify that new services implement the exact API contract as the monolith. Use tools like Pact or Spring Cloud Contract. Run contract tests on every build.
•Shadow Traffic Validation — Before any real traffic, mirror production requests to the new service and compare responses. Flag discrepancies for investigation.
•Automated Rollback Triggers — Define objective metrics that trigger automatic rollback. Don't rely on humans to notice problems at 2 AM.
•Pre-Production Load Testing — Test at 2-3x expected production load. Identify breaking points before users find them.
•Database Schema Versioning — Use tools like Flyway or Liquibase. Never make breaking changes without a migration path.
•Security Scanning — Run SAST/DAST on new services. Verify authentication and authorization are correctly implemented.
•Chaos Engineering — Proactively inject failures to verify resilience. Test circuit breakers, fallbacks, and recovery mechanisms.
•Change Review Processes — Require peer review for infrastructure changes. No lone-wolf deployments during migration.

Converting Mermaid diagram...

The Testing Pyramid for Migration:

Apply the testing pyramid with migration-specific focus:

Base (many tests): Unit tests for business logic parity with monolith
Middle (some tests): Integration tests for API contracts and data access
Top (few tests): End-to-end tests for critical user journeys across both systems

Additionally, add a comparison layer not found in normal testing:

Shadow traffic comparison (production requests, response comparison)
Business metric comparison (conversion rates, revenue, engagement)

The 'Record and Replay' Pattern

Detection Systems: Finding Problems Early

Detection System Requirements
Detection Type	What It Catches	Detection Speed	Implementation
Error Rate Monitoring	Increased failures, new error types	Seconds	Prometheus/Datadog + alerting
Latency Monitoring	Performance regression	Seconds	P50/P95/P99 percentile tracking
Comparison Scoring	Behavioral differences from baseline	Minutes	Real-time response comparison service
Business Metrics	Revenue, conversion impact	Minutes to hours	Real-time analytics dashboards
Anomaly Detection	Unusual patterns, edge cases	Minutes	ML-based anomaly detection
Synthetic Monitoring	Critical path availability	Seconds	Synthetic probes from multiple locations
Log Analysis	Error patterns, unusual behavior	Minutes	ELK/Splunk with pattern detection
User Feedback	UX issues, functional problems	Hours to days	Support ticket analysis, error reporting

The Golden Signals for Migration:

Google's four golden signals, adapted for migration monitoring:

Latency: Time to serve requests. Track separately for monolith vs. new service. Alert on divergence.
Traffic: Request volume. Monitor for unexpected drops (requests failing silently) or spikes (requests being retried).
Errors: Failed request rate. Track by type (4xx vs 5xx) and compare between implementations.
Saturation: Resource utilization. New services often have different resource profiles than expected.

MigrationMonitoring.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
interface MigrationMetrics {
    service: string;
    endpoint: string;
    timestamp: Date;
    backend: 'monolith' | 'microservice';
    
    // Golden signals
    latencyMs: number;
    success: boolean;
    errorType?: string;
    
    // Migration-specific
    responseHash?: string;  // For comparison
    shadowComparison?: 'match' | 'mismatch' | 'not-compared';
}
 
interface AlertThreshold {
    metric: string;
    condition: 'gt' | 'lt' | 'change';
    value: number;
    windowSeconds: number;
    severity: 'info' | 'warning' | 'critical';
}
 
class MigrationMonitor {
    private metrics: MigrationMetrics[] = [];
    private thresholds: AlertThreshold[] = [
        // Error rate thresholds
        {
            metric: 'errorRate',
            condition: 'gt',
            value: 0.01,  // 1%
            windowSeconds: 300,
            severity: 'warning',
        },
        {
            metric: 'errorRate',
            condition: 'gt',
            value: 0.05,  // 5%
            windowSeconds: 60,
            severity: 'critical',
        },
        // Latency thresholds
        {
            metric: 'p99Latency',
            condition: 'gt',
            value: 1000,  // 1 second
            windowSeconds: 300,
            severity: 'warning',
        },
        // Comparison thresholds
        {
            metric: 'mismatchRate',
            condition: 'gt',
            value: 0.001,  // 0.1%
            windowSeconds: 600,
            severity: 'critical',
        },
        // Traffic thresholds (sudden drops)
        {
            metric: 'trafficChange',
            condition: 'lt',
            value: -0.5,  // 50% drop
            windowSeconds: 300,
            severity: 'critical',
        },
    ];
    
    recordMetric(metric: MigrationMetrics): void {
        this.metrics.push(metric);
        this.evaluateThresholds();
    }
    
    private evaluateThresholds(): void {
        for (const threshold of this.thresholds) {
            const value = this.calculateMetric(threshold.metric, threshold.windowSeconds);
            const violated = this.checkCondition(value, threshold.condition, threshold.value);
            
            if (violated) {
                this.raiseAlert({
                    threshold,
                    currentValue: value,
                    message: `${threshold.metric} ${threshold.condition} ${threshold.value}: current=${value}`,
                });
            }
        }
    }
    
    private calculateMetric(metric: string, windowSeconds: number): number {
        const cutoff = new Date(Date.now() - windowSeconds * 1000);
        const windowMetrics = this.metrics.filter(m => m.timestamp >= cutoff);
        
        switch (metric) {
            case 'errorRate':
                return windowMetrics.filter(m => !m.success).length / windowMetrics.length;
            
            case 'p99Latency':
                const latencies = windowMetrics.map(m => m.latencyMs).sort((a, b) => a - b);
                const p99Index = Math.floor(latencies.length * 0.99);
                return latencies[p99Index] || 0;
            
            case 'mismatchRate':
                const compared = windowMetrics.filter(m => m.shadowComparison !== 'not-compared');
                return compared.filter(m => m.shadowComparison === 'mismatch').length / compared.length;
            
            case 'trafficChange':
                // Compare current window to previous window of same size
                const prevCutoff = new Date(cutoff.getTime() - windowSeconds * 1000);
                const prevMetrics = this.metrics.filter(
                    m => m.timestamp >= prevCutoff && m.timestamp < cutoff
                );
                if (prevMetrics.length === 0) return 0;
                return (windowMetrics.length - prevMetrics.length) / prevMetrics.length;
            
            default:
                return 0;
        }
    }
    
    private checkCondition(value: number, condition: string, threshold: number): boolean {
        switch (condition) {
            case 'gt': return value > threshold;
            case 'lt': return value < threshold;
            case 'change': return Math.abs(value) > threshold;
            default: return false;
        }
    }
    
    private raiseAlert(alert: {
        threshold: AlertThreshold;
        currentValue: number;
        message: string;
    }): void {
        console.log(`[${alert.threshold.severity.toUpperCase()}] ${alert.message}`);
        
        if (alert.threshold.severity === 'critical') {
            // Trigger automatic rollback
            this.triggerRollback(alert.message);
        }
    }
    
    private triggerRollback(reason: string): void {
        console.log(`AUTO-ROLLBACK TRIGGERED: ${reason}`);
        // Integration with rollback mechanism
    }
    
    /**
     * Dashboard data for migration comparison view
     */
    getComparativeMetrics(): {
        monolith: ServiceMetrics;
        microservice: ServiceMetrics;
    } {
        const recentMetrics = this.metrics.filter(
            m => m.timestamp >= new Date(Date.now() - 3600000)  // Last hour
        );
        
        return {
            monolith: this.aggregateMetrics(recentMetrics.filter(m => m.backend === 'monolith')),
            microservice: this.aggregateMetrics(recentMetrics.filter(m => m.backend === 'microservice')),
        };
    }
    
    private aggregateMetrics(metrics: MigrationMetrics[]): ServiceMetrics {
        if (metrics.length === 0) {
            return { requestCount: 0, errorRate: 0, p50: 0, p95: 0, p99: 0 };
        }
        
        const sorted = metrics.map(m => m.latencyMs).sort((a, b) => a - b);
        
        return {
            requestCount: metrics.length,
            errorRate: metrics.filter(m => !m.success).length / metrics.length,
            p50: sorted[Math.floor(sorted.length * 0.5)],
            p95: sorted[Math.floor(sorted.length * 0.95)],
            p99: sorted[Math.floor(sorted.length * 0.99)],
        };
    }
}
 
interface ServiceMetrics {
    requestCount: number;
    errorRate: number;
    p50: number;
    p95: number;
    p99: number;
}

Alert Fatigue Kills Detection

Recovery Mechanisms: Minimizing Blast Radius

Recovery Mechanism Hierarchy

•Automatic Fallback — Circuit breakers automatically route traffic to monolith when microservice fails. No human intervention required. Recovery time: seconds.
•Traffic Reduction — Reduce percentage of traffic to problem service. Limit user impact while investigating. Recovery time: seconds (if automated).
•Full Rollback — Return all traffic to monolith. Complete revert of the migration step. Recovery time: seconds to minutes.
•Data Reconciliation — Fix data inconsistencies between systems. May require specialized scripts or manual intervention. Recovery time: hours to days.
•Feature Disable — Disable specific features rather than full rollback. Surgical response for localized problems. Recovery time: seconds.

Blast Radius Containment Strategies:

Percentage-Based Limiting

Start with 1% maximum
Increase only with proven stability
Automatic reduction on problems

Geographic Isolation

Roll out to one region first
Limit impact to single geography
Learn before global rollout

User Segment Isolation

Internal users first
Beta users second
General availability last

Time-Based Isolation

Roll out during low-traffic periods
Avoid weekends/holidays initially
Build confidence before peak times

Recovery Time Objectives (RTO):

Define and practice recovery times for each failure mode:

Failure Type	Target RTO
Service errors	< 1 minute (auto-fallback)
Performance regression	< 5 minutes
Data discrepancy	< 30 minutes (traffic stop)
Data corruption	< 4 hours (full recovery)
Security incident	Immediate (full shutdown)

Practice Recovery:

Recovery mechanisms that aren't tested don't work when needed. Run regular recovery drills:

Monthly: Verify automatic fallback working
Quarterly: Practice manual rollback
Yearly: Full disaster recovery exercise

The 'Kill Switch' Pattern

Data Protection Strategies

Data Protection Principles

•Never Delete Without Backup — Before any data migration step, verify backups are complete and restorable. Test restore procedures in advance.
•Soft Delete First — When migrating writes to new service, don't delete from old system immediately. Mark as 'migrated' and delete after stability period.
•Audit All Changes — Log every data modification during migration with source system, timestamp, and change type. Enable forensic analysis.
•Dual-Read Verification — During transition, read from both systems and compare. Alert on discrepancies before trusting new system.
•Point-in-Time Recovery — Ensure you can restore to any point during migration. Problems may be discovered hours after they occurred.
•Schema Backward Compatibility — New schema must work with both old and new code during transition. Use expand-contract pattern.

The Data Migration Safety Protocol:

Snapshot: Create verified backup before any migration step
Validate: Verify backup is complete and restorable
Migrate: Execute migration in batches, not all at once
Verify: Compare migrated data against backup/source
Monitor: Watch for anomalies for defined stability period
Proceed: Only after verification passes, continue to next batch

Red Lines:

Never migrate production data without tested rollback procedure
Never run migration during peak traffic without approval
Never skip verification step to 'save time'
Never delete source data until stability period complete

The Point of No Return

Organizational Risk Management

Common Organizational Failures

•Key team member leaves mid-migration
•Teams disagree on service boundaries
•Unclear ownership of shared components
•Business priorities shift, migration deprioritized
•Teams work in silos without coordination
•Knowledge concentrated in few individuals
•Stakeholders lose patience with long timeline

Organizational Mitigations

•Cross-train team members, document decisions
•Use structured decision-making (ADRs)
•Create explicit ownership matrix
•Secure executive sponsorship with quarterly reviews
•Regular cross-team sync meetings
•Pair programming, knowledge sharing sessions
•Deliver value incrementally to maintain confidence

The RACI Matrix for Migration:

Define clear responsibilities using RACI (Responsible, Accountable, Consulted, Informed):

Activity	Platform Team	Service Team	Architecture	Leadership
Routing changes	R	C	C	I
Service extraction	C	R	A	I
Data migration	C	R	C	I
Incident response	R	R	C	I
Go/No-Go decisions	C	C	R	A
Budget approval	I	I	C	A

Legend: R=Responsible (does the work), A=Accountable (final decision), C=Consulted, I=Informed

The 'Bus Factor' Risk

The Risk Register: Institutionalizing Risk Management

risk-register.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
# Migration Risk Register
# Review: Weekly at migration standup
# Owner: Migration Lead
 
risks:
  - id: RISK-001
    title: "User service latency regression"
    category: performance
    likelihood: medium
    impact: high
    detection: high
    priority: 1
    status: mitigating
    description: |
      New user service may have higher latency due to additional
      network hop and database query patterns. Could affect checkout
      conversion rate.
    mitigations:
      - type: preventive
        action: "Load test at 3x peak traffic"
        owner: platform-team
        status: complete
      - type: preventive
        action: "Implement caching layer"
        owner: service-team
        status: in-progress
        due: 2024-02-15
      - type: detective
        action: "Set up P99 latency alerting with 500ms threshold"
        owner: sre-team
        status: complete
      - type: recovery
        action: "Automated fallback to monolith on latency spike"
        owner: platform-team
        status: complete
    triggers:
      - P99 latency exceeds 500ms for 2 minutes
    response: |
      1. Automatic fallback will reduce traffic to 0%
      2. On-call investigates and determines root cause
      3. Fix deployed and shadow-tested before retry
 
  - id: RISK-002
    title: "Data inconsistency during dual-write period"
    category: data-integrity
    likelihood: medium
    impact: critical
    detection: low
    priority: 1
    status: mitigating
    description: |
      During the period when both systems write user data, 
      inconsistencies may arise from race conditions, network 
      failures, or application bugs.
    mitigations:
      - type: preventive
        action: "Implement outbox pattern for reliable event propagation"
        owner: service-team
        status: complete
      - type: preventive
        action: "Use optimistic locking with version numbers"
        owner: service-team
        status: in-progress
        due: 2024-02-20
      - type: detective
        action: "Hourly consistency check comparing both databases"
        owner: data-team
        status: complete
      - type: recovery
        action: "Documented reconciliation procedure with runbook"
        owner: data-team
        status: in-progress
        due: 2024-02-10
    triggers:
      - Consistency check finds >0.01% discrepancy rate
      - Any data loss reported
    response: |
      1. Stop writes to new service immediately
      2. Run reconciliation script to identify affected records
      3. Restore from backup if necessary
      4. Root cause analysis before resuming
 
  - id: RISK-003
    title: "Key team member departure"
    category: organizational
    likelihood: low
    impact: high
    detection: high
    priority: 2
    status: mitigating
    description: |
      If a key migration team member leaves unexpectedly, critical 
      knowledge could be lost, slowing migration significantly.
    mitigations:
      - type: preventive
        action: "Document all major decisions in ADRs"
        owner: tech-lead
        status: ongoing
      - type: preventive
        action: "Pair programming for all migration work"
        owner: team-leads
        status: ongoing
      - type: preventive
        action: "Bi-weekly knowledge transfer sessions"
        owner: tech-lead
        status: ongoing
      - type: recovery
        action: "Identify backup resources who can ramp up"
        owner: engineering-manager
        status: complete
    triggers:
      - Resignation announced
      - Extended leave >2 weeks
    response: |
      1. Immediate knowledge transfer sessions
      2. Review and enhance documentation
      3. Pause complex migration work until coverage confirmed
 
# Review History
reviews:
  - date: 2024-01-15
    attendees: [migration-lead, tech-lead, sre-lead]
    notes: |
      - RISK-001 downgraded after successful load testing
      - Added RISK-003 based on team feedback
      - Next review: 2024-01-22

Risk Register Practices:

Weekly Review: Review the risk register at weekly migration standup. Update likelihood/status, check mitigation progress.
New Risk Addition: Any team member can add risks. Lower the bar—it's better to have too many than miss critical ones.
Mitigation Tracking: Every high-priority risk must have at least one mitigation in progress. Unmitigated high-impact risks are blockers.
Closure Criteria: Risks can be closed when mitigations are complete and the risk is no longer applicable. Document why.
Post-Incident Updates: After any incident, review the risk register. Was this risk listed? Should it have been? What mitigations failed?

The 'Pre-Mortem' Exercise

Summary: Systematic Risk Management

Risk management transforms migration from an anxious gamble into a controlled journey. With proper prevention, detection, and recovery, you can confidently navigate even complex migrations.

Key Takeaways

•Understand your risks — Categorize by type (functional, performance, data, organizational). Prioritize by likelihood × impact × detectability.
•Build preventive controls — Contract testing, shadow validation, security scanning, and chaos engineering stop problems before they occur.
•Implement detection systems — Monitor golden signals with automated alerting. Detect problems in seconds, not hours.
•Design recovery mechanisms — Automatic fallback, kill switches, and practiced rollback procedures limit blast radius and speed recovery.
•Protect data rigorously — Never delete without backup. Use soft deletes, audit logging, and point-in-time recovery capability.
•Manage organizational risks — RACI matrix, documentation, cross-training, and executive sponsorship prevent people-related failures.
•Institutionalize with a risk register — Track risks, mitigations, and owners. Review weekly. Update after every incident.

Module Complete:

Module Complete

5 / 5