Health Checks & Failover - Learning Module

Loading content...

0/273

Failover Strategies

The Art of Seamless Recovery

On March 10, 2021, a major fire at a European data center left millions of websites offline. Some companies recovered within hours; others took days; some lost data permanently. The difference wasn't luck—it was failover architecture.

Companies with robust failover strategies had their traffic redirected to other data centers before most customers noticed anything unusual. Their DNS was already pointing to backup sites. Their databases had been synchronously replicated. Their operational runbooks had been tested monthly.

Companies without these preparations scrambled to restore from backups, reconfigure DNS, and explain to customers why 'cloud hosting' didn't mean 'immune to fire.'

Failover is the ultimate test of distributed system design. It's easy to build systems that work when everything is healthy. The true measure of engineering is what happens when components fail—and how quickly, completely, and automatically the system recovers.

What You Will Learn

By the end of this page, you will understand the spectrum of failover strategies—from simple hot standby to sophisticated multi-region active-active architectures. You'll learn how to design failover mechanisms, understand the trade-offs between failover speed and data safety, and implement recovery procedures that restore full service after failures.

Failover Fundamentals

Failover is the process of automatically or manually switching from a failed component to a backup component. The goal is to maintain service continuity when primary systems fail.

Key Failover Metrics:

Recovery Time Objective (RTO): Maximum acceptable time to restore service after failure. 'How long can we be down?'
Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. 'How much data can we lose?'
Failover Time: Actual time from failure detection to traffic routing to backup systems.
Failback Time: Time to restore traffic to primary systems after they recover.

These metrics drive all failover architecture decisions. A system with RTO of 1 second requires fundamentally different architecture than one with RTO of 4 hours.

RTO/RPO and Corresponding Architecture
RTO Target	RPO Target	Architecture Required	Cost Level
< 1 minute	Zero data loss	Synchronous replication, active-active, automated failover	Very High
1-15 minutes	< 1 minute data loss	Asynchronous replication, warm standby, automated failover	High
15-60 minutes	< 15 minutes data loss	Pilot light infrastructure, semi-automated failover	Medium
1-4 hours	< 1 hour data loss	Cold standby, backup restoration, manual failover	Low
4-24 hours	< 24 hours data loss	Disaster recovery from backups only	Minimal

Converting Mermaid diagram...

RTO Is Not Just Technology

Your actual RTO includes human decision time. If failover requires manual approval, add the time to page someone, assess the situation, and make a decision. At 3 AM, this can be 15+ minutes. Fully automated failover is the only way to achieve sub-minute RTO.

Failover Architectures

Failover architectures exist on a spectrum from simple to sophisticated, with corresponding trade-offs in cost, complexity, and recovery capability.

Architecture Spectrum:

Failover Architecture Types

•Cold Standby — Backup infrastructure exists but is powered off. Recovery requires starting instances, restoring data from backups, and reconfiguring. Cheapest but slowest (hours to days).
•Pilot Light — Minimal backup infrastructure runs continuously (database replicas, core services). On failover, additional capacity is provisioned. Faster than cold standby (15-60 minutes).
•Warm Standby — Fully functional scaled-down copy of production runs continuously. On failover, it's scaled up and traffic redirected. Moderate cost and speed (1-15 minutes).
•Hot Standby — Identical production environment ready to serve traffic immediately. Synchronous replication ensures zero data loss. High cost but near-instant failover (<1 minute).
•Active-Active — Multiple active environments serve traffic simultaneously. No failover needed—failed environment's traffic is absorbed by survivors. Highest cost and complexity but zero failover latency.

Active-Passive

•Primary handles all traffic
•Secondary receives replicated data
•Failover redirects traffic to secondary
•Secondary becomes primary
•Lower complexity
•Wasted standby capacity

Active-Active

•Multiple sites handle traffic simultaneously
•Bi-directional data replication
•Traffic shift rather than failover
•All capacity used all the time
•Higher complexity (conflict resolution)
•Better resource utilization

Converting Mermaid diagram...

Active-Active Data Challenges

Active-active is simple for stateless services but complex for stateful data. If users can write to multiple regions, you need conflict resolution strategies: last-write-wins, custom merge logic, or CRDTs. Many systems achieve 'active-active' by routing users to a specific region for all their data, avoiding true multi-master writes.

Database Failover

Database failover deserves special attention because databases are often the most critical and most challenging component to fail over. The data consistency requirements, replication lag, and connection management complexities make database failover a distinct discipline.

Database Failover Modes:

Database Replication and Failover Modes
Replication Mode	Data Loss Risk	Failover Speed	Performance Impact	Use Case
Synchronous	Zero (RPO=0)	Fast	Higher latency (wait for replica ack)	Financial, critical data
Semi-synchronous	Minimal (~1 transaction)	Fast	Moderate latency	Most production systems
Asynchronous	Some (seconds to minutes)	Fast	Minimal latency	Read replicas, analytics
Log shipping	Variable (shipping interval)	Slow (apply logs)	Minimal	Disaster recovery

Automatic Database Failover Components:

Failure Detection: Health checks, heartbeats, or quorum-based detection
Leader Election: Determine which replica should become the new primary
Promotion: Promote the selected replica to primary status
Topology Update: Redirect client connections to the new primary
Old Primary Handling: Prevent the old primary from accepting writes (fencing)

database-failover-pattern.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
// TypeScript: Database Failover Coordinator Pattern
 
interface DatabaseNode {
    id: string;
    host: string;
    port: number;
    role: 'primary' | 'replica' | 'unknown';
    isHealthy: boolean;
    replicationLag: number;  // Seconds behind primary
    lastHeartbeat: number;
}
 
interface FailoverConfig {
    healthCheckIntervalMs: number;
    failureThresholdMs: number;      // Time before declaring failure
    maxReplicationLag: number;        // Max acceptable lag for promotion
    fencingTimeoutMs: number;         // Time to ensure old primary is fenced
    requireQuorum: boolean;           // Require quorum agreement for failover
}
 
class DatabaseFailoverCoordinator {
    private nodes: Map<string, DatabaseNode> = new Map();
    private config: FailoverConfig;
    private currentPrimary: string | null = null;
    
    constructor(config: FailoverConfig) {
        this.config = config;
    }
    
    /**
     * Detect primary failure and initiate failover
     */
    async checkAndFailover(): Promise<{
        failoverOccurred: boolean;
        newPrimary?: string;
        oldPrimary?: string;
        reason?: string;
    }> {
        const primary = this.getCurrentPrimary();
        
        if (!primary) {
            return { failoverOccurred: false, reason: 'No primary configured' };
        }
        
        // Check primary health
        const isHealthy = await this.checkNodeHealth(primary);
        
        if (isHealthy) {
            return { failoverOccurred: false, reason: 'Primary is healthy' };
        }
        
        // Verify failure with quorum if configured
        if (this.config.requireQuorum) {
            const quorumAgreed = await this.checkQuorumAgreement(primary);
            if (!quorumAgreed) {
                console.log('Quorum does not agree on primary failure - possible network partition');
                return { failoverOccurred: false, reason: 'Quorum disagreement' };
            }
        }
        
        // Select best replica for promotion
        const newPrimary = await this.selectPromotionCandidate();
        
        if (!newPrimary) {
            throw new Error('No suitable replica available for promotion');
        }
        
        // Execute failover
        await this.executeFailover(primary, newPrimary);
        
        return {
            failoverOccurred: true,
            newPrimary: newPrimary.id,
            oldPrimary: primary.id,
            reason: 'Primary failure detected and failover completed'
        };
    }
    
    /**
     * Select the best replica to promote to primary
     */
    private async selectPromotionCandidate(): Promise<DatabaseNode | null> {
        const replicas = Array.from(this.nodes.values())
            .filter(n => n.role === 'replica' && n.isHealthy);
        
        if (replicas.length === 0) {
            return null;
        }
        
        // Sort by replication lag (lowest first)
        replicas.sort((a, b) => a.replicationLag - b.replicationLag);
        
        // Find first replica within acceptable lag
        for (const replica of replicas) {
            if (replica.replicationLag <= this.config.maxReplicationLag) {
                return replica;
            }
        }
        
        // If no replica within threshold, return lowest lag replica with warning
        console.warn(`All replicas exceed max replication lag. Selecting ${replicas[0].id} with lag ${replicas[0].replicationLag}s`);
        return replicas[0];
    }
    
    /**
     * Execute the failover sequence
     */
    private async executeFailover(
        oldPrimary: DatabaseNode, 
        newPrimary: DatabaseNode
    ): Promise<void> {
        console.log(`[Failover] Starting: ${oldPrimary.id} → ${newPrimary.id}`);
        
        // Step 1: Fence the old primary (prevent split-brain)
        await this.fenceNode(oldPrimary);
        console.log(`[Failover] Step 1: Old primary ${oldPrimary.id} fenced`);
        
        // Step 2: Wait for new primary to apply any remaining replication
        await this.waitForReplicationCatchup(newPrimary);
        console.log(`[Failover] Step 2: Replication caught up on ${newPrimary.id}`);
        
        // Step 3: Promote new primary
        await this.promoteToMaster(newPrimary);
        console.log(`[Failover] Step 3: ${newPrimary.id} promoted to primary`);
        
        // Step 4: Update topology/discovery service
        await this.updateTopology(newPrimary);
        console.log(`[Failover] Step 4: Topology updated`);
        
        // Step 5: Redirect replicas to new primary
        await this.redirectReplicas(newPrimary);
        console.log(`[Failover] Step 5: Replicas redirected`);
        
        // Update internal state
        oldPrimary.role = 'unknown';
        newPrimary.role = 'primary';
        this.currentPrimary = newPrimary.id;
        
        console.log(`[Failover] Complete: ${newPrimary.id} is now primary`);
    }
    
    /**
     * Fence a node to prevent it from accepting writes (split-brain prevention)
     */
    private async fenceNode(node: DatabaseNode): Promise<void> {
        // Fencing strategies:
        // 1. STONITH (Shoot The Other Node In The Head) - power off the node
        // 2. Revoke write permissions at network level
        // 3. Use a fencing token that must be held to write
        // 4. Instruct the node to enter read-only mode
        
        // Implementation depends on infrastructure
        await this.sendFencingCommand(node);
        
        // Wait for fencing to take effect
        await new Promise(resolve => 
            setTimeout(resolve, this.config.fencingTimeoutMs)
        );
    }
    
    // Infrastructure-specific implementations
    private async checkNodeHealth(node: DatabaseNode): Promise<boolean> {
        // Implement health check logic
        throw new Error('Implementation required');
    }
    
    private async checkQuorumAgreement(primary: DatabaseNode): Promise<boolean> {
        // Ask majority of nodes if they can reach primary
        throw new Error('Implementation required');
    }
    
    private async waitForReplicationCatchup(node: DatabaseNode): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private async promoteToMaster(node: DatabaseNode): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private async updateTopology(newPrimary: DatabaseNode): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private async redirectReplicas(newPrimary: DatabaseNode): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private async sendFencingCommand(node: DatabaseNode): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private getCurrentPrimary(): DatabaseNode | null {
        if (!this.currentPrimary) return null;
        return this.nodes.get(this.currentPrimary) || null;
    }
}

The Split-Brain Nightmare

If both old and new primaries accept writes simultaneously (split-brain), data corruption occurs. Fencing is critical—you must guarantee the old primary cannot write before the new one starts. Use STONITH, network fencing, or distributed locks to prevent split-brain.

DNS and Traffic Failover

Even with backend failover in place, clients must be directed to the backup systems. This is where DNS and traffic management strategies become critical. The speed of traffic redirection is often the limiting factor in overall failover time.

Traffic Failover Strategies:

Traffic Failover Mechanisms
Mechanism	Failover Speed	Pros	Cons
DNS TTL-based	Minutes to hours (TTL + client caching)	Simple, universal	Slow, client TTL caching unpredictable
Low TTL DNS	30-60 seconds	Faster than standard DNS	Higher DNS query load, still has propagation delay
DNS health-aware	30-60 seconds (automated)	Automated failover	Still DNS propagation delay
Global Load Balancer	Seconds	Fast, health-aware	Single vendor dependency
Anycast	Near-instant	Fastest possible	Complex BGP management, limited to IP routing
Client-side failover	Near-instant	Fastest for aware clients	Requires client implementation

DNS TTL Considerations:

DNS Time-To-Live (TTL) controls how long DNS responses are cached. In disaster recovery planning, this is crucial:

High TTL (1 hour+): Reduces DNS query load but makes failover slow. Clients use old IP addresses for up to TTL duration after failover.
Low TTL (60 seconds): Faster failover but higher DNS query load. Some clients ignore low TTLs.
Pre-failover TTL reduction: Lower TTL before planned maintenance, then restore after.

Reality check: Even with 60-second TTL, many clients (browsers, OSes, Java) have minimum TTL floors. AWS's Route 53 recommends expecting 1-2 minutes for DNS failover even with low TTLs.

dns-failover-strategy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# AWS Route 53 Health Check and Failover Configuration
 
# Primary endpoint health check
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  PrimaryHealthCheck:
    Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        IPAddress: 192.0.2.1
        Port: 443
        Type: HTTPS
        ResourcePath: /health/ready
        FullyQualifiedDomainName: api.example.com
        RequestInterval: 10        # Check every 10 seconds
        FailureThreshold: 2        # Fail after 2 consecutive failures
        MeasureLatency: true
        EnableSNI: true
      HealthCheckTags:
        - Key: Name
          Value: Primary-DC-Health
 
  # Secondary endpoint health check
  SecondaryHealthCheck:
    Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        IPAddress: 192.0.2.2
        Port: 443
        Type: HTTPS
        ResourcePath: /health/ready
        FullyQualifiedDomainName: api-secondary.example.com
        RequestInterval: 10
        FailureThreshold: 2
 
  # Failover DNS record for primary
  PrimaryDNSRecord:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: Z1234567890ABC
      Name: api.example.com
      Type: A
      TTL: 60                       # Low TTL for faster failover
      SetIdentifier: primary
      Failover: PRIMARY             # Primary in failover policy
      HealthCheckId: !Ref PrimaryHealthCheck
      ResourceRecords:
        - 192.0.2.1
 
  # Failover DNS record for secondary
  SecondaryDNSRecord:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: Z1234567890ABC
      Name: api.example.com
      Type: A
      TTL: 60
      SetIdentifier: secondary
      Failover: SECONDARY           # Secondary in failover policy
      HealthCheckId: !Ref SecondaryHealthCheck
      ResourceRecords:
        - 192.0.2.2
 
---
# Global Accelerator for faster failover (bypasses DNS)
Resources:
  GlobalAccelerator:
    Type: AWS::GlobalAccelerator::Accelerator
    Properties:
      Name: api-accelerator
      Enabled: true
      IpAddressType: IPV4
  
  Listener:
    Type: AWS::GlobalAccelerator::Listener
    Properties:
      AcceleratorArn: !Ref GlobalAccelerator
      Protocol: TCP
      PortRanges:
        - FromPort: 443
          ToPort: 443
  
  EndpointGroup:
    Type: AWS::GlobalAccelerator::EndpointGroup
    Properties:
      ListenerArn: !Ref Listener
      EndpointGroupRegion: us-east-1
      HealthCheckIntervalSeconds: 10
      ThresholdCount: 2
      EndpointConfigurations:
        - EndpointId: !Ref PrimaryALB
          Weight: 100
          ClientIPPreservationEnabled: true
        - EndpointId: !Ref SecondaryALB
          Weight: 0                   # Zero weight = standby

Anycast for Instant Failover

Anycast routing announces the same IP address from multiple locations. BGP routing automatically directs clients to the nearest healthy instance. When one location fails, BGP reconverges in seconds—no DNS involved. CDNs and DNS providers use this extensively. AWS Global Accelerator and Cloudflare use anycast.

Failback and Recovery

Failover is only half the story. After the primary systems recover, you need to decide whether and how to return to the original configuration. This is failback—and it's often more complex than the initial failover.

Failback Strategies:

Failback Approaches

•Symmetric Failback — Treat the recovered original primary as a new replica, sync data from current primary, then fail over back. Most conservative—ensures no data loss.
•Asymmetric Failback — Promote original primary without full sync. Faster but risks data loss if current primary had writes not yet replicated back.
•No Failback — Accept the new topology as permanent. The former primary becomes a replica. Simplest approach, avoids second failover risk.
•Scheduled Failback — Plan failback during maintenance window to minimize risk. Allows thorough testing before switching.
•Gradual Failback — Shift traffic incrementally (10% → 25% → 50% → 100%) while monitoring for issues.

The Failback Decision Framework:

Before initiating failback, answer these questions:

Is the recovered system fully healthy? Not just 'up' but performing normally under load test.
Is data synchronized? All writes to the failover system replicated back to the recovered system.
What's the urgency? Is the failover site struggling, or can it serve indefinitely?
What's the risk tolerance? Can you afford another failover if failback fails?
Is this a good time? Failing back at 5 PM Friday is rarely wise.

failback-procedure.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
// TypeScript: Failback Procedure Automation
 
interface FailbackConfig {
    verifyHealthDurationMs: number;     // How long recovered primary must be healthy
    syncVerificationQuery: string;       // Query to verify data sync completion
    trafficShiftSteps: number[];         // Traffic percentages: [10, 25, 50, 100]
    stepWaitMs: number;                  // Time between traffic shift steps
    rollbackThreshold: number;           // Error rate to trigger rollback (0-1)
}
 
interface FailbackStatus {
    phase: 'not_started' | 'verifying' | 'syncing' | 'shifting' | 'complete' | 'rolled_back';
    currentTrafficPercent: number;
    startTime: number;
    errors: string[];
}
 
class FailbackController {
    private config: FailbackConfig;
    private status: FailbackStatus;
    
    constructor(config: FailbackConfig) {
        this.config = config;
        this.status = {
            phase: 'not_started',
            currentTrafficPercent: 0,
            startTime: 0,
            errors: []
        };
    }
    
    /**
     * Execute controlled failback procedure
     */
    async executeFailback(
        recoveredPrimary: string,
        currentPrimary: string
    ): Promise<FailbackStatus> {
        this.status = {
            phase: 'verifying',
            currentTrafficPercent: 0,
            startTime: Date.now(),
            errors: []
        };
        
        try {
            // Phase 1: Verify recovered primary health
            console.log('[Failback] Phase 1: Verifying recovered primary health');
            await this.verifyHealth(recoveredPrimary);
            
            // Phase 2: Verify data synchronization
            console.log('[Failback] Phase 2: Verifying data synchronization');
            this.status.phase = 'syncing';
            await this.verifySyncComplete(recoveredPrimary, currentPrimary);
            
            // Phase 3: Gradual traffic shift
            console.log('[Failback] Phase 3: Shifting traffic');
            this.status.phase = 'shifting';
            
            for (const targetPercent of this.config.trafficShiftSteps) {
                await this.shiftTraffic(recoveredPrimary, targetPercent);
                this.status.currentTrafficPercent = targetPercent;
                
                // Monitor for errors
                const isHealthy = await this.monitorAfterShift();
                
                if (!isHealthy) {
                    console.log(`[Failback] Error threshold exceeded at ${targetPercent}% - rolling back`);
                    await this.rollback(currentPrimary);
                    this.status.phase = 'rolled_back';
                    return this.status;
                }
                
                console.log(`[Failback] Traffic at ${targetPercent}% - healthy, continuing`);
            }
            
            // Phase 4: Complete - recovered primary is now current primary
            this.status.phase = 'complete';
            console.log('[Failback] Complete - recovered primary is now serving 100% traffic');
            
            // Update topology to reflect new primary
            await this.updateTopologyAfterFailback(recoveredPrimary);
            
        } catch (error) {
            this.status.errors.push(`Failback error: ${error.message}`);
            console.error('[Failback] Error during failback:', error);
            
            // Attempt rollback on any error
            try {
                await this.rollback(currentPrimary);
                this.status.phase = 'rolled_back';
            } catch (rollbackError) {
                this.status.errors.push(`Rollback error: ${rollbackError.message}`);
            }
        }
        
        return this.status;
    }
    
    /**
     * Verify primary is consistently healthy
     */
    private async verifyHealth(primary: string): Promise<void> {
        const startTime = Date.now();
        
        while (Date.now() - startTime < this.config.verifyHealthDurationMs) {
            const isHealthy = await this.checkHealth(primary);
            
            if (!isHealthy) {
                throw new Error(`Recovered primary ${primary} failed health check`);
            }
            
            await this.sleep(5000);  // Check every 5 seconds
        }
        
        console.log(`[Failback] Recovered primary healthy for ${this.config.verifyHealthDurationMs}ms`);
    }
    
    /**
     * Verify data sync is complete (no replication lag)
     */
    private async verifySyncComplete(
        target: string, 
        source: string
    ): Promise<void> {
        const maxAttempts = 60;
        let attempts = 0;
        
        while (attempts < maxAttempts) {
            const lag = await this.getReplicationLag(target, source);
            
            if (lag === 0) {
                console.log('[Failback] Data synchronization complete');
                return;
            }
            
            console.log(`[Failback] Replication lag: ${lag}s - waiting`);
            attempts++;
            await this.sleep(5000);
        }
        
        throw new Error('Replication sync did not complete within timeout');
    }
    
    /**
     * Monitor error rate after traffic shift
     */
    private async monitorAfterShift(): Promise<boolean> {
        // Wait for traffic to stabilize
        await this.sleep(this.config.stepWaitMs);
        
        // Get current error rate
        const errorRate = await this.getCurrentErrorRate();
        
        if (errorRate > this.config.rollbackThreshold) {
            this.status.errors.push(`Error rate ${errorRate} exceeded threshold ${this.config.rollbackThreshold}`);
            return false;
        }
        
        return true;
    }
    
    /**
     * Rollback all traffic to specified primary
     */
    private async rollback(target: string): Promise<void> {
        console.log(`[Failback] Rolling back all traffic to ${target}`);
        await this.shiftTraffic(target, 100);
        this.status.currentTrafficPercent = 0;
    }
    
    // Implementation stubs
    private async checkHealth(primary: string): Promise<boolean> {
        throw new Error('Implementation required');
    }
    
    private async getReplicationLag(target: string, source: string): Promise<number> {
        throw new Error('Implementation required');
    }
    
    private async shiftTraffic(target: string, percent: number): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private async getCurrentErrorRate(): Promise<number> {
        throw new Error('Implementation required');
    }
    
    private async updateTopologyAfterFailback(primary: string): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private sleep(ms: number): Promise<void> {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}

Test Failback Before You Need It

Most teams test failover but never test failback. When disaster strikes and you need to fail back, you discover the procedure is outdated or doesn't work. Include regular failback tests in your disaster recovery exercises—not just failover.

Testing Failover

A failover mechanism that has never been tested is, at best, a hypothesis. Production failures have a way of exposing assumptions that seemed reasonable in design but fail catastrophically in reality.

Types of Failover Tests:

Failover Testing Approaches
Test Type	Description	Risk	Frequency
Tabletop Exercise	Walk through procedures without executing	Zero	Monthly
Staged Test	Execute in non-production environment	Low	Monthly
Partial Production Test	Fail over one component/shard in production	Medium	Quarterly
Full Production Test	Complete failover drill in production	High	Semi-annually
Chaos Engineering	Random failures injected continuously	Variable	Continuous
Game Day	Full scenario simulation with teams	Medium-High	Annually

Failover Testing Checklist

•Detection Time — How long from actual failure to detection? Does it match your monitoring configuration?
•Promotion Time — How long from detection to new primary accepting writes?
•Client Re-routing — How long until clients are talking to the new primary?
•Data Consistency — Is any data lost? Does the new primary have all committed transactions?
•Dependent Services — Do all services correctly reconnect to the new primary?
•Alerting — Did the right people get notified? Were runbooks accessible?
•Metrics and Logs — Can you reconstruct the failover timeline from observability data?
•Rollback Capability — If the failover made things worse, can you reverse it?

Netflix's Chaos Philosophy

Netflix famously runs 'Chaos Monkey' which randomly terminates production instances. The philosophy: if failover only works when you test it, make testing continuous. The result is infrastructure that handles failures as routine events rather than crises.

Summary: Designing for Recovery

Failover is the culmination of all health check and failure detection work. It's the moment when design meets reality—when theoretical redundancy becomes actual service continuity.

Key Takeaways

•Define RTO and RPO first — These metrics drive all architecture decisions. A 1-second RTO requires entirely different infrastructure than a 4-hour RTO.
•Choose your architecture carefully — Cold standby, pilot light, warm standby, hot standby, and active-active represent increasing cost but faster recovery.
•Database failover is special — Split-brain prevention through fencing is critical. Synchronous vs asynchronous replication is a fundamental trade-off.
•DNS is often the bottleneck — Low TTLs help but have limits. Consider anycast or global load balancers for sub-minute failover.
•Plan failback explicitly — Returning to original configuration is often harder than the initial failover. Test failback procedures.
•Test relentlessly — Untested failover is wishful thinking. Regular drills, chaos engineering, and game days build confidence and find gaps.

Module Complete:

You've now completed a comprehensive study of health checks and failover—from the mechanisms that detect failures, through the endpoints that report health, to the strategies that restore service. Together, these patterns form the foundation of resilient distributed systems.

The next time a server crashes, a network partitions, or a data center catches fire, your systems should handle it automatically while you sleep. That's the promise of well-designed health checking and failover—and now you have the knowledge to deliver it.

Module Complete

Congratulations! You've completed the Health Checks & Failover module. You now understand active and passive health checks, health endpoint design, failure detection algorithms, graceful degradation patterns, and comprehensive failover strategies. These skills are essential for building distributed systems that maintain availability through component failures.

Failover Strategies

The Art of Seamless Recovery

Companies without these preparations scrambled to restore from backups, reconfigure DNS, and explain to customers why 'cloud hosting' didn't mean 'immune to fire.'

What You Will Learn

Failover Fundamentals

Failover is the process of automatically or manually switching from a failed component to a backup component. The goal is to maintain service continuity when primary systems fail.

Key Failover Metrics:

Recovery Time Objective (RTO): Maximum acceptable time to restore service after failure. 'How long can we be down?'
Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. 'How much data can we lose?'
Failover Time: Actual time from failure detection to traffic routing to backup systems.
Failback Time: Time to restore traffic to primary systems after they recover.

These metrics drive all failover architecture decisions. A system with RTO of 1 second requires fundamentally different architecture than one with RTO of 4 hours.

RTO/RPO and Corresponding Architecture
RTO Target	RPO Target	Architecture Required	Cost Level
< 1 minute	Zero data loss	Synchronous replication, active-active, automated failover	Very High
1-15 minutes	< 1 minute data loss	Asynchronous replication, warm standby, automated failover	High
15-60 minutes	< 15 minutes data loss	Pilot light infrastructure, semi-automated failover	Medium
1-4 hours	< 1 hour data loss	Cold standby, backup restoration, manual failover	Low
4-24 hours	< 24 hours data loss	Disaster recovery from backups only	Minimal

Converting Mermaid diagram...

RTO Is Not Just Technology

Failover Architectures

Failover architectures exist on a spectrum from simple to sophisticated, with corresponding trade-offs in cost, complexity, and recovery capability.

Architecture Spectrum:

Failover Architecture Types

•Cold Standby — Backup infrastructure exists but is powered off. Recovery requires starting instances, restoring data from backups, and reconfiguring. Cheapest but slowest (hours to days).
•Pilot Light — Minimal backup infrastructure runs continuously (database replicas, core services). On failover, additional capacity is provisioned. Faster than cold standby (15-60 minutes).
•Warm Standby — Fully functional scaled-down copy of production runs continuously. On failover, it's scaled up and traffic redirected. Moderate cost and speed (1-15 minutes).
•Hot Standby — Identical production environment ready to serve traffic immediately. Synchronous replication ensures zero data loss. High cost but near-instant failover (<1 minute).
•Active-Active — Multiple active environments serve traffic simultaneously. No failover needed—failed environment's traffic is absorbed by survivors. Highest cost and complexity but zero failover latency.

Active-Passive

•Primary handles all traffic
•Secondary receives replicated data
•Failover redirects traffic to secondary
•Secondary becomes primary
•Lower complexity
•Wasted standby capacity

Active-Active

•Multiple sites handle traffic simultaneously
•Bi-directional data replication
•Traffic shift rather than failover
•All capacity used all the time
•Higher complexity (conflict resolution)
•Better resource utilization

Converting Mermaid diagram...

Active-Active Data Challenges

Database Failover

Database Failover Modes:

Database Replication and Failover Modes
Replication Mode	Data Loss Risk	Failover Speed	Performance Impact	Use Case
Synchronous	Zero (RPO=0)	Fast	Higher latency (wait for replica ack)	Financial, critical data
Semi-synchronous	Minimal (~1 transaction)	Fast	Moderate latency	Most production systems
Asynchronous	Some (seconds to minutes)	Fast	Minimal latency	Read replicas, analytics
Log shipping	Variable (shipping interval)	Slow (apply logs)	Minimal	Disaster recovery

Automatic Database Failover Components:

Failure Detection: Health checks, heartbeats, or quorum-based detection
Leader Election: Determine which replica should become the new primary
Promotion: Promote the selected replica to primary status
Topology Update: Redirect client connections to the new primary
Old Primary Handling: Prevent the old primary from accepting writes (fencing)

database-failover-pattern.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
// TypeScript: Database Failover Coordinator Pattern
 
interface DatabaseNode {
    id: string;
    host: string;
    port: number;
    role: 'primary' | 'replica' | 'unknown';
    isHealthy: boolean;
    replicationLag: number;  // Seconds behind primary
    lastHeartbeat: number;
}
 
interface FailoverConfig {
    healthCheckIntervalMs: number;
    failureThresholdMs: number;      // Time before declaring failure
    maxReplicationLag: number;        // Max acceptable lag for promotion
    fencingTimeoutMs: number;         // Time to ensure old primary is fenced
    requireQuorum: boolean;           // Require quorum agreement for failover
}
 
class DatabaseFailoverCoordinator {
    private nodes: Map<string, DatabaseNode> = new Map();
    private config: FailoverConfig;
    private currentPrimary: string | null = null;
    
    constructor(config: FailoverConfig) {
        this.config = config;
    }
    
    /**
     * Detect primary failure and initiate failover
     */
    async checkAndFailover(): Promise<{
        failoverOccurred: boolean;
        newPrimary?: string;
        oldPrimary?: string;
        reason?: string;
    }> {
        const primary = this.getCurrentPrimary();
        
        if (!primary) {
            return { failoverOccurred: false, reason: 'No primary configured' };
        }
        
        // Check primary health
        const isHealthy = await this.checkNodeHealth(primary);
        
        if (isHealthy) {
            return { failoverOccurred: false, reason: 'Primary is healthy' };
        }
        
        // Verify failure with quorum if configured
        if (this.config.requireQuorum) {
            const quorumAgreed = await this.checkQuorumAgreement(primary);
            if (!quorumAgreed) {
                console.log('Quorum does not agree on primary failure - possible network partition');
                return { failoverOccurred: false, reason: 'Quorum disagreement' };
            }
        }
        
        // Select best replica for promotion
        const newPrimary = await this.selectPromotionCandidate();
        
        if (!newPrimary) {
            throw new Error('No suitable replica available for promotion');
        }
        
        // Execute failover
        await this.executeFailover(primary, newPrimary);
        
        return {
            failoverOccurred: true,
            newPrimary: newPrimary.id,
            oldPrimary: primary.id,
            reason: 'Primary failure detected and failover completed'
        };
    }
    
    /**
     * Select the best replica to promote to primary
     */
    private async selectPromotionCandidate(): Promise<DatabaseNode | null> {
        const replicas = Array.from(this.nodes.values())
            .filter(n => n.role === 'replica' && n.isHealthy);
        
        if (replicas.length === 0) {
            return null;
        }
        
        // Sort by replication lag (lowest first)
        replicas.sort((a, b) => a.replicationLag - b.replicationLag);
        
        // Find first replica within acceptable lag
        for (const replica of replicas) {
            if (replica.replicationLag <= this.config.maxReplicationLag) {
                return replica;
            }
        }
        
        // If no replica within threshold, return lowest lag replica with warning
        console.warn(`All replicas exceed max replication lag. Selecting ${replicas[0].id} with lag ${replicas[0].replicationLag}s`);
        return replicas[0];
    }
    
    /**
     * Execute the failover sequence
     */
    private async executeFailover(
        oldPrimary: DatabaseNode, 
        newPrimary: DatabaseNode
    ): Promise<void> {
        console.log(`[Failover] Starting: ${oldPrimary.id} → ${newPrimary.id}`);
        
        // Step 1: Fence the old primary (prevent split-brain)
        await this.fenceNode(oldPrimary);
        console.log(`[Failover] Step 1: Old primary ${oldPrimary.id} fenced`);
        
        // Step 2: Wait for new primary to apply any remaining replication
        await this.waitForReplicationCatchup(newPrimary);
        console.log(`[Failover] Step 2: Replication caught up on ${newPrimary.id}`);
        
        // Step 3: Promote new primary
        await this.promoteToMaster(newPrimary);
        console.log(`[Failover] Step 3: ${newPrimary.id} promoted to primary`);
        
        // Step 4: Update topology/discovery service
        await this.updateTopology(newPrimary);
        console.log(`[Failover] Step 4: Topology updated`);
        
        // Step 5: Redirect replicas to new primary
        await this.redirectReplicas(newPrimary);
        console.log(`[Failover] Step 5: Replicas redirected`);
        
        // Update internal state
        oldPrimary.role = 'unknown';
        newPrimary.role = 'primary';
        this.currentPrimary = newPrimary.id;
        
        console.log(`[Failover] Complete: ${newPrimary.id} is now primary`);
    }
    
    /**
     * Fence a node to prevent it from accepting writes (split-brain prevention)
     */
    private async fenceNode(node: DatabaseNode): Promise<void> {
        // Fencing strategies:
        // 1. STONITH (Shoot The Other Node In The Head) - power off the node
        // 2. Revoke write permissions at network level
        // 3. Use a fencing token that must be held to write
        // 4. Instruct the node to enter read-only mode
        
        // Implementation depends on infrastructure
        await this.sendFencingCommand(node);
        
        // Wait for fencing to take effect
        await new Promise(resolve => 
            setTimeout(resolve, this.config.fencingTimeoutMs)
        );
    }
    
    // Infrastructure-specific implementations
    private async checkNodeHealth(node: DatabaseNode): Promise<boolean> {
        // Implement health check logic
        throw new Error('Implementation required');
    }
    
    private async checkQuorumAgreement(primary: DatabaseNode): Promise<boolean> {
        // Ask majority of nodes if they can reach primary
        throw new Error('Implementation required');
    }
    
    private async waitForReplicationCatchup(node: DatabaseNode): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private async promoteToMaster(node: DatabaseNode): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private async updateTopology(newPrimary: DatabaseNode): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private async redirectReplicas(newPrimary: DatabaseNode): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private async sendFencingCommand(node: DatabaseNode): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private getCurrentPrimary(): DatabaseNode | null {
        if (!this.currentPrimary) return null;
        return this.nodes.get(this.currentPrimary) || null;
    }
}

The Split-Brain Nightmare

DNS and Traffic Failover

Traffic Failover Strategies:

Traffic Failover Mechanisms
Mechanism	Failover Speed	Pros	Cons
DNS TTL-based	Minutes to hours (TTL + client caching)	Simple, universal	Slow, client TTL caching unpredictable
Low TTL DNS	30-60 seconds	Faster than standard DNS	Higher DNS query load, still has propagation delay
DNS health-aware	30-60 seconds (automated)	Automated failover	Still DNS propagation delay
Global Load Balancer	Seconds	Fast, health-aware	Single vendor dependency
Anycast	Near-instant	Fastest possible	Complex BGP management, limited to IP routing
Client-side failover	Near-instant	Fastest for aware clients	Requires client implementation

DNS TTL Considerations:

DNS Time-To-Live (TTL) controls how long DNS responses are cached. In disaster recovery planning, this is crucial:

High TTL (1 hour+): Reduces DNS query load but makes failover slow. Clients use old IP addresses for up to TTL duration after failover.
Low TTL (60 seconds): Faster failover but higher DNS query load. Some clients ignore low TTLs.
Pre-failover TTL reduction: Lower TTL before planned maintenance, then restore after.

Reality check: Even with 60-second TTL, many clients (browsers, OSes, Java) have minimum TTL floors. AWS's Route 53 recommends expecting 1-2 minutes for DNS failover even with low TTLs.

dns-failover-strategy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# AWS Route 53 Health Check and Failover Configuration
 
# Primary endpoint health check
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  PrimaryHealthCheck:
    Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        IPAddress: 192.0.2.1
        Port: 443
        Type: HTTPS
        ResourcePath: /health/ready
        FullyQualifiedDomainName: api.example.com
        RequestInterval: 10        # Check every 10 seconds
        FailureThreshold: 2        # Fail after 2 consecutive failures
        MeasureLatency: true
        EnableSNI: true
      HealthCheckTags:
        - Key: Name
          Value: Primary-DC-Health
 
  # Secondary endpoint health check
  SecondaryHealthCheck:
    Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        IPAddress: 192.0.2.2
        Port: 443
        Type: HTTPS
        ResourcePath: /health/ready
        FullyQualifiedDomainName: api-secondary.example.com
        RequestInterval: 10
        FailureThreshold: 2
 
  # Failover DNS record for primary
  PrimaryDNSRecord:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: Z1234567890ABC
      Name: api.example.com
      Type: A
      TTL: 60                       # Low TTL for faster failover
      SetIdentifier: primary
      Failover: PRIMARY             # Primary in failover policy
      HealthCheckId: !Ref PrimaryHealthCheck
      ResourceRecords:
        - 192.0.2.1
 
  # Failover DNS record for secondary
  SecondaryDNSRecord:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: Z1234567890ABC
      Name: api.example.com
      Type: A
      TTL: 60
      SetIdentifier: secondary
      Failover: SECONDARY           # Secondary in failover policy
      HealthCheckId: !Ref SecondaryHealthCheck
      ResourceRecords:
        - 192.0.2.2
 
---
# Global Accelerator for faster failover (bypasses DNS)
Resources:
  GlobalAccelerator:
    Type: AWS::GlobalAccelerator::Accelerator
    Properties:
      Name: api-accelerator
      Enabled: true
      IpAddressType: IPV4
  
  Listener:
    Type: AWS::GlobalAccelerator::Listener
    Properties:
      AcceleratorArn: !Ref GlobalAccelerator
      Protocol: TCP
      PortRanges:
        - FromPort: 443
          ToPort: 443
  
  EndpointGroup:
    Type: AWS::GlobalAccelerator::EndpointGroup
    Properties:
      ListenerArn: !Ref Listener
      EndpointGroupRegion: us-east-1
      HealthCheckIntervalSeconds: 10
      ThresholdCount: 2
      EndpointConfigurations:
        - EndpointId: !Ref PrimaryALB
          Weight: 100
          ClientIPPreservationEnabled: true
        - EndpointId: !Ref SecondaryALB
          Weight: 0                   # Zero weight = standby

Anycast for Instant Failover

Failback and Recovery

Failback Strategies:

Failback Approaches

•Symmetric Failback — Treat the recovered original primary as a new replica, sync data from current primary, then fail over back. Most conservative—ensures no data loss.
•Asymmetric Failback — Promote original primary without full sync. Faster but risks data loss if current primary had writes not yet replicated back.
•No Failback — Accept the new topology as permanent. The former primary becomes a replica. Simplest approach, avoids second failover risk.
•Scheduled Failback — Plan failback during maintenance window to minimize risk. Allows thorough testing before switching.
•Gradual Failback — Shift traffic incrementally (10% → 25% → 50% → 100%) while monitoring for issues.

The Failback Decision Framework:

Before initiating failback, answer these questions:

Is the recovered system fully healthy? Not just 'up' but performing normally under load test.
Is data synchronized? All writes to the failover system replicated back to the recovered system.
What's the urgency? Is the failover site struggling, or can it serve indefinitely?
What's the risk tolerance? Can you afford another failover if failback fails?
Is this a good time? Failing back at 5 PM Friday is rarely wise.

failback-procedure.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
// TypeScript: Failback Procedure Automation
 
interface FailbackConfig {
    verifyHealthDurationMs: number;     // How long recovered primary must be healthy
    syncVerificationQuery: string;       // Query to verify data sync completion
    trafficShiftSteps: number[];         // Traffic percentages: [10, 25, 50, 100]
    stepWaitMs: number;                  // Time between traffic shift steps
    rollbackThreshold: number;           // Error rate to trigger rollback (0-1)
}
 
interface FailbackStatus {
    phase: 'not_started' | 'verifying' | 'syncing' | 'shifting' | 'complete' | 'rolled_back';
    currentTrafficPercent: number;
    startTime: number;
    errors: string[];
}
 
class FailbackController {
    private config: FailbackConfig;
    private status: FailbackStatus;
    
    constructor(config: FailbackConfig) {
        this.config = config;
        this.status = {
            phase: 'not_started',
            currentTrafficPercent: 0,
            startTime: 0,
            errors: []
        };
    }
    
    /**
     * Execute controlled failback procedure
     */
    async executeFailback(
        recoveredPrimary: string,
        currentPrimary: string
    ): Promise<FailbackStatus> {
        this.status = {
            phase: 'verifying',
            currentTrafficPercent: 0,
            startTime: Date.now(),
            errors: []
        };
        
        try {
            // Phase 1: Verify recovered primary health
            console.log('[Failback] Phase 1: Verifying recovered primary health');
            await this.verifyHealth(recoveredPrimary);
            
            // Phase 2: Verify data synchronization
            console.log('[Failback] Phase 2: Verifying data synchronization');
            this.status.phase = 'syncing';
            await this.verifySyncComplete(recoveredPrimary, currentPrimary);
            
            // Phase 3: Gradual traffic shift
            console.log('[Failback] Phase 3: Shifting traffic');
            this.status.phase = 'shifting';
            
            for (const targetPercent of this.config.trafficShiftSteps) {
                await this.shiftTraffic(recoveredPrimary, targetPercent);
                this.status.currentTrafficPercent = targetPercent;
                
                // Monitor for errors
                const isHealthy = await this.monitorAfterShift();
                
                if (!isHealthy) {
                    console.log(`[Failback] Error threshold exceeded at ${targetPercent}% - rolling back`);
                    await this.rollback(currentPrimary);
                    this.status.phase = 'rolled_back';
                    return this.status;
                }
                
                console.log(`[Failback] Traffic at ${targetPercent}% - healthy, continuing`);
            }
            
            // Phase 4: Complete - recovered primary is now current primary
            this.status.phase = 'complete';
            console.log('[Failback] Complete - recovered primary is now serving 100% traffic');
            
            // Update topology to reflect new primary
            await this.updateTopologyAfterFailback(recoveredPrimary);
            
        } catch (error) {
            this.status.errors.push(`Failback error: ${error.message}`);
            console.error('[Failback] Error during failback:', error);
            
            // Attempt rollback on any error
            try {
                await this.rollback(currentPrimary);
                this.status.phase = 'rolled_back';
            } catch (rollbackError) {
                this.status.errors.push(`Rollback error: ${rollbackError.message}`);
            }
        }
        
        return this.status;
    }
    
    /**
     * Verify primary is consistently healthy
     */
    private async verifyHealth(primary: string): Promise<void> {
        const startTime = Date.now();
        
        while (Date.now() - startTime < this.config.verifyHealthDurationMs) {
            const isHealthy = await this.checkHealth(primary);
            
            if (!isHealthy) {
                throw new Error(`Recovered primary ${primary} failed health check`);
            }
            
            await this.sleep(5000);  // Check every 5 seconds
        }
        
        console.log(`[Failback] Recovered primary healthy for ${this.config.verifyHealthDurationMs}ms`);
    }
    
    /**
     * Verify data sync is complete (no replication lag)
     */
    private async verifySyncComplete(
        target: string, 
        source: string
    ): Promise<void> {
        const maxAttempts = 60;
        let attempts = 0;
        
        while (attempts < maxAttempts) {
            const lag = await this.getReplicationLag(target, source);
            
            if (lag === 0) {
                console.log('[Failback] Data synchronization complete');
                return;
            }
            
            console.log(`[Failback] Replication lag: ${lag}s - waiting`);
            attempts++;
            await this.sleep(5000);
        }
        
        throw new Error('Replication sync did not complete within timeout');
    }
    
    /**
     * Monitor error rate after traffic shift
     */
    private async monitorAfterShift(): Promise<boolean> {
        // Wait for traffic to stabilize
        await this.sleep(this.config.stepWaitMs);
        
        // Get current error rate
        const errorRate = await this.getCurrentErrorRate();
        
        if (errorRate > this.config.rollbackThreshold) {
            this.status.errors.push(`Error rate ${errorRate} exceeded threshold ${this.config.rollbackThreshold}`);
            return false;
        }
        
        return true;
    }
    
    /**
     * Rollback all traffic to specified primary
     */
    private async rollback(target: string): Promise<void> {
        console.log(`[Failback] Rolling back all traffic to ${target}`);
        await this.shiftTraffic(target, 100);
        this.status.currentTrafficPercent = 0;
    }
    
    // Implementation stubs
    private async checkHealth(primary: string): Promise<boolean> {
        throw new Error('Implementation required');
    }
    
    private async getReplicationLag(target: string, source: string): Promise<number> {
        throw new Error('Implementation required');
    }
    
    private async shiftTraffic(target: string, percent: number): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private async getCurrentErrorRate(): Promise<number> {
        throw new Error('Implementation required');
    }
    
    private async updateTopologyAfterFailback(primary: string): Promise<void> {
        throw new Error('Implementation required');
    }
    
    private sleep(ms: number): Promise<void> {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}

Test Failback Before You Need It

Testing Failover

Types of Failover Tests:

Failover Testing Approaches
Test Type	Description	Risk	Frequency
Tabletop Exercise	Walk through procedures without executing	Zero	Monthly
Staged Test	Execute in non-production environment	Low	Monthly
Partial Production Test	Fail over one component/shard in production	Medium	Quarterly
Full Production Test	Complete failover drill in production	High	Semi-annually
Chaos Engineering	Random failures injected continuously	Variable	Continuous
Game Day	Full scenario simulation with teams	Medium-High	Annually

Failover Testing Checklist

•Detection Time — How long from actual failure to detection? Does it match your monitoring configuration?
•Promotion Time — How long from detection to new primary accepting writes?
•Client Re-routing — How long until clients are talking to the new primary?
•Data Consistency — Is any data lost? Does the new primary have all committed transactions?
•Dependent Services — Do all services correctly reconnect to the new primary?
•Alerting — Did the right people get notified? Were runbooks accessible?
•Metrics and Logs — Can you reconstruct the failover timeline from observability data?
•Rollback Capability — If the failover made things worse, can you reverse it?

Netflix's Chaos Philosophy

Summary: Designing for Recovery

Failover is the culmination of all health check and failure detection work. It's the moment when design meets reality—when theoretical redundancy becomes actual service continuity.

Key Takeaways

•Define RTO and RPO first — These metrics drive all architecture decisions. A 1-second RTO requires entirely different infrastructure than a 4-hour RTO.
•Choose your architecture carefully — Cold standby, pilot light, warm standby, hot standby, and active-active represent increasing cost but faster recovery.
•Database failover is special — Split-brain prevention through fencing is critical. Synchronous vs asynchronous replication is a fundamental trade-off.
•DNS is often the bottleneck — Low TTLs help but have limits. Consider anycast or global load balancers for sub-minute failover.
•Plan failback explicitly — Returning to original configuration is often harder than the initial failover. Test failback procedures.
•Test relentlessly — Untested failover is wishful thinking. Regular drills, chaos engineering, and game days build confidence and find gaps.

Module Complete:

Module Complete