Loading content...
On March 10, 2021, a major fire at a European data center left millions of websites offline. Some companies recovered within hours; others took days; some lost data permanently. The difference wasn't luck—it was failover architecture.
Companies with robust failover strategies had their traffic redirected to other data centers before most customers noticed anything unusual. Their DNS was already pointing to backup sites. Their databases had been synchronously replicated. Their operational runbooks had been tested monthly.
Companies without these preparations scrambled to restore from backups, reconfigure DNS, and explain to customers why 'cloud hosting' didn't mean 'immune to fire.'
Failover is the ultimate test of distributed system design. It's easy to build systems that work when everything is healthy. The true measure of engineering is what happens when components fail—and how quickly, completely, and automatically the system recovers.
By the end of this page, you will understand the spectrum of failover strategies—from simple hot standby to sophisticated multi-region active-active architectures. You'll learn how to design failover mechanisms, understand the trade-offs between failover speed and data safety, and implement recovery procedures that restore full service after failures.
Failover is the process of automatically or manually switching from a failed component to a backup component. The goal is to maintain service continuity when primary systems fail.
Key Failover Metrics:
These metrics drive all failover architecture decisions. A system with RTO of 1 second requires fundamentally different architecture than one with RTO of 4 hours.
| RTO Target | RPO Target | Architecture Required | Cost Level |
|---|---|---|---|
| < 1 minute | Zero data loss | Synchronous replication, active-active, automated failover | Very High |
| 1-15 minutes | < 1 minute data loss | Asynchronous replication, warm standby, automated failover | High |
| 15-60 minutes | < 15 minutes data loss | Pilot light infrastructure, semi-automated failover | Medium |
| 1-4 hours | < 1 hour data loss | Cold standby, backup restoration, manual failover | Low |
| 4-24 hours | < 24 hours data loss | Disaster recovery from backups only | Minimal |
Your actual RTO includes human decision time. If failover requires manual approval, add the time to page someone, assess the situation, and make a decision. At 3 AM, this can be 15+ minutes. Fully automated failover is the only way to achieve sub-minute RTO.
Failover architectures exist on a spectrum from simple to sophisticated, with corresponding trade-offs in cost, complexity, and recovery capability.
Architecture Spectrum:
Active-active is simple for stateless services but complex for stateful data. If users can write to multiple regions, you need conflict resolution strategies: last-write-wins, custom merge logic, or CRDTs. Many systems achieve 'active-active' by routing users to a specific region for all their data, avoiding true multi-master writes.
Database failover deserves special attention because databases are often the most critical and most challenging component to fail over. The data consistency requirements, replication lag, and connection management complexities make database failover a distinct discipline.
Database Failover Modes:
| Replication Mode | Data Loss Risk | Failover Speed | Performance Impact | Use Case |
|---|---|---|---|---|
| Synchronous | Zero (RPO=0) | Fast | Higher latency (wait for replica ack) | Financial, critical data |
| Semi-synchronous | Minimal (~1 transaction) | Fast | Moderate latency | Most production systems |
| Asynchronous | Some (seconds to minutes) | Fast | Minimal latency | Read replicas, analytics |
| Log shipping | Variable (shipping interval) | Slow (apply logs) | Minimal | Disaster recovery |
Automatic Database Failover Components:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196
// TypeScript: Database Failover Coordinator Pattern interface DatabaseNode { id: string; host: string; port: number; role: 'primary' | 'replica' | 'unknown'; isHealthy: boolean; replicationLag: number; // Seconds behind primary lastHeartbeat: number;} interface FailoverConfig { healthCheckIntervalMs: number; failureThresholdMs: number; // Time before declaring failure maxReplicationLag: number; // Max acceptable lag for promotion fencingTimeoutMs: number; // Time to ensure old primary is fenced requireQuorum: boolean; // Require quorum agreement for failover} class DatabaseFailoverCoordinator { private nodes: Map<string, DatabaseNode> = new Map(); private config: FailoverConfig; private currentPrimary: string | null = null; constructor(config: FailoverConfig) { this.config = config; } /** * Detect primary failure and initiate failover */ async checkAndFailover(): Promise<{ failoverOccurred: boolean; newPrimary?: string; oldPrimary?: string; reason?: string; }> { const primary = this.getCurrentPrimary(); if (!primary) { return { failoverOccurred: false, reason: 'No primary configured' }; } // Check primary health const isHealthy = await this.checkNodeHealth(primary); if (isHealthy) { return { failoverOccurred: false, reason: 'Primary is healthy' }; } // Verify failure with quorum if configured if (this.config.requireQuorum) { const quorumAgreed = await this.checkQuorumAgreement(primary); if (!quorumAgreed) { console.log('Quorum does not agree on primary failure - possible network partition'); return { failoverOccurred: false, reason: 'Quorum disagreement' }; } } // Select best replica for promotion const newPrimary = await this.selectPromotionCandidate(); if (!newPrimary) { throw new Error('No suitable replica available for promotion'); } // Execute failover await this.executeFailover(primary, newPrimary); return { failoverOccurred: true, newPrimary: newPrimary.id, oldPrimary: primary.id, reason: 'Primary failure detected and failover completed' }; } /** * Select the best replica to promote to primary */ private async selectPromotionCandidate(): Promise<DatabaseNode | null> { const replicas = Array.from(this.nodes.values()) .filter(n => n.role === 'replica' && n.isHealthy); if (replicas.length === 0) { return null; } // Sort by replication lag (lowest first) replicas.sort((a, b) => a.replicationLag - b.replicationLag); // Find first replica within acceptable lag for (const replica of replicas) { if (replica.replicationLag <= this.config.maxReplicationLag) { return replica; } } // If no replica within threshold, return lowest lag replica with warning console.warn(`All replicas exceed max replication lag. Selecting ${replicas[0].id} with lag ${replicas[0].replicationLag}s`); return replicas[0]; } /** * Execute the failover sequence */ private async executeFailover( oldPrimary: DatabaseNode, newPrimary: DatabaseNode ): Promise<void> { console.log(`[Failover] Starting: ${oldPrimary.id} → ${newPrimary.id}`); // Step 1: Fence the old primary (prevent split-brain) await this.fenceNode(oldPrimary); console.log(`[Failover] Step 1: Old primary ${oldPrimary.id} fenced`); // Step 2: Wait for new primary to apply any remaining replication await this.waitForReplicationCatchup(newPrimary); console.log(`[Failover] Step 2: Replication caught up on ${newPrimary.id}`); // Step 3: Promote new primary await this.promoteToMaster(newPrimary); console.log(`[Failover] Step 3: ${newPrimary.id} promoted to primary`); // Step 4: Update topology/discovery service await this.updateTopology(newPrimary); console.log(`[Failover] Step 4: Topology updated`); // Step 5: Redirect replicas to new primary await this.redirectReplicas(newPrimary); console.log(`[Failover] Step 5: Replicas redirected`); // Update internal state oldPrimary.role = 'unknown'; newPrimary.role = 'primary'; this.currentPrimary = newPrimary.id; console.log(`[Failover] Complete: ${newPrimary.id} is now primary`); } /** * Fence a node to prevent it from accepting writes (split-brain prevention) */ private async fenceNode(node: DatabaseNode): Promise<void> { // Fencing strategies: // 1. STONITH (Shoot The Other Node In The Head) - power off the node // 2. Revoke write permissions at network level // 3. Use a fencing token that must be held to write // 4. Instruct the node to enter read-only mode // Implementation depends on infrastructure await this.sendFencingCommand(node); // Wait for fencing to take effect await new Promise(resolve => setTimeout(resolve, this.config.fencingTimeoutMs) ); } // Infrastructure-specific implementations private async checkNodeHealth(node: DatabaseNode): Promise<boolean> { // Implement health check logic throw new Error('Implementation required'); } private async checkQuorumAgreement(primary: DatabaseNode): Promise<boolean> { // Ask majority of nodes if they can reach primary throw new Error('Implementation required'); } private async waitForReplicationCatchup(node: DatabaseNode): Promise<void> { throw new Error('Implementation required'); } private async promoteToMaster(node: DatabaseNode): Promise<void> { throw new Error('Implementation required'); } private async updateTopology(newPrimary: DatabaseNode): Promise<void> { throw new Error('Implementation required'); } private async redirectReplicas(newPrimary: DatabaseNode): Promise<void> { throw new Error('Implementation required'); } private async sendFencingCommand(node: DatabaseNode): Promise<void> { throw new Error('Implementation required'); } private getCurrentPrimary(): DatabaseNode | null { if (!this.currentPrimary) return null; return this.nodes.get(this.currentPrimary) || null; }}If both old and new primaries accept writes simultaneously (split-brain), data corruption occurs. Fencing is critical—you must guarantee the old primary cannot write before the new one starts. Use STONITH, network fencing, or distributed locks to prevent split-brain.
Even with backend failover in place, clients must be directed to the backup systems. This is where DNS and traffic management strategies become critical. The speed of traffic redirection is often the limiting factor in overall failover time.
Traffic Failover Strategies:
| Mechanism | Failover Speed | Pros | Cons |
|---|---|---|---|
| DNS TTL-based | Minutes to hours (TTL + client caching) | Simple, universal | Slow, client TTL caching unpredictable |
| Low TTL DNS | 30-60 seconds | Faster than standard DNS | Higher DNS query load, still has propagation delay |
| DNS health-aware | 30-60 seconds (automated) | Automated failover | Still DNS propagation delay |
| Global Load Balancer | Seconds | Fast, health-aware | Single vendor dependency |
| Anycast | Near-instant | Fastest possible | Complex BGP management, limited to IP routing |
| Client-side failover | Near-instant | Fastest for aware clients | Requires client implementation |
DNS TTL Considerations:
DNS Time-To-Live (TTL) controls how long DNS responses are cached. In disaster recovery planning, this is crucial:
Reality check: Even with 60-second TTL, many clients (browsers, OSes, Java) have minimum TTL floors. AWS's Route 53 recommends expecting 1-2 minutes for DNS failover even with low TTLs.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
# AWS Route 53 Health Check and Failover Configuration # Primary endpoint health checkAWSTemplateFormatVersion: '2010-09-09'Resources: PrimaryHealthCheck: Type: AWS::Route53::HealthCheck Properties: HealthCheckConfig: IPAddress: 192.0.2.1 Port: 443 Type: HTTPS ResourcePath: /health/ready FullyQualifiedDomainName: api.example.com RequestInterval: 10 # Check every 10 seconds FailureThreshold: 2 # Fail after 2 consecutive failures MeasureLatency: true EnableSNI: true HealthCheckTags: - Key: Name Value: Primary-DC-Health # Secondary endpoint health check SecondaryHealthCheck: Type: AWS::Route53::HealthCheck Properties: HealthCheckConfig: IPAddress: 192.0.2.2 Port: 443 Type: HTTPS ResourcePath: /health/ready FullyQualifiedDomainName: api-secondary.example.com RequestInterval: 10 FailureThreshold: 2 # Failover DNS record for primary PrimaryDNSRecord: Type: AWS::Route53::RecordSet Properties: HostedZoneId: Z1234567890ABC Name: api.example.com Type: A TTL: 60 # Low TTL for faster failover SetIdentifier: primary Failover: PRIMARY # Primary in failover policy HealthCheckId: !Ref PrimaryHealthCheck ResourceRecords: - 192.0.2.1 # Failover DNS record for secondary SecondaryDNSRecord: Type: AWS::Route53::RecordSet Properties: HostedZoneId: Z1234567890ABC Name: api.example.com Type: A TTL: 60 SetIdentifier: secondary Failover: SECONDARY # Secondary in failover policy HealthCheckId: !Ref SecondaryHealthCheck ResourceRecords: - 192.0.2.2 ---# Global Accelerator for faster failover (bypasses DNS)Resources: GlobalAccelerator: Type: AWS::GlobalAccelerator::Accelerator Properties: Name: api-accelerator Enabled: true IpAddressType: IPV4 Listener: Type: AWS::GlobalAccelerator::Listener Properties: AcceleratorArn: !Ref GlobalAccelerator Protocol: TCP PortRanges: - FromPort: 443 ToPort: 443 EndpointGroup: Type: AWS::GlobalAccelerator::EndpointGroup Properties: ListenerArn: !Ref Listener EndpointGroupRegion: us-east-1 HealthCheckIntervalSeconds: 10 ThresholdCount: 2 EndpointConfigurations: - EndpointId: !Ref PrimaryALB Weight: 100 ClientIPPreservationEnabled: true - EndpointId: !Ref SecondaryALB Weight: 0 # Zero weight = standbyAnycast routing announces the same IP address from multiple locations. BGP routing automatically directs clients to the nearest healthy instance. When one location fails, BGP reconverges in seconds—no DNS involved. CDNs and DNS providers use this extensively. AWS Global Accelerator and Cloudflare use anycast.
Failover is only half the story. After the primary systems recover, you need to decide whether and how to return to the original configuration. This is failback—and it's often more complex than the initial failover.
Failback Strategies:
The Failback Decision Framework:
Before initiating failback, answer these questions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196
// TypeScript: Failback Procedure Automation interface FailbackConfig { verifyHealthDurationMs: number; // How long recovered primary must be healthy syncVerificationQuery: string; // Query to verify data sync completion trafficShiftSteps: number[]; // Traffic percentages: [10, 25, 50, 100] stepWaitMs: number; // Time between traffic shift steps rollbackThreshold: number; // Error rate to trigger rollback (0-1)} interface FailbackStatus { phase: 'not_started' | 'verifying' | 'syncing' | 'shifting' | 'complete' | 'rolled_back'; currentTrafficPercent: number; startTime: number; errors: string[];} class FailbackController { private config: FailbackConfig; private status: FailbackStatus; constructor(config: FailbackConfig) { this.config = config; this.status = { phase: 'not_started', currentTrafficPercent: 0, startTime: 0, errors: [] }; } /** * Execute controlled failback procedure */ async executeFailback( recoveredPrimary: string, currentPrimary: string ): Promise<FailbackStatus> { this.status = { phase: 'verifying', currentTrafficPercent: 0, startTime: Date.now(), errors: [] }; try { // Phase 1: Verify recovered primary health console.log('[Failback] Phase 1: Verifying recovered primary health'); await this.verifyHealth(recoveredPrimary); // Phase 2: Verify data synchronization console.log('[Failback] Phase 2: Verifying data synchronization'); this.status.phase = 'syncing'; await this.verifySyncComplete(recoveredPrimary, currentPrimary); // Phase 3: Gradual traffic shift console.log('[Failback] Phase 3: Shifting traffic'); this.status.phase = 'shifting'; for (const targetPercent of this.config.trafficShiftSteps) { await this.shiftTraffic(recoveredPrimary, targetPercent); this.status.currentTrafficPercent = targetPercent; // Monitor for errors const isHealthy = await this.monitorAfterShift(); if (!isHealthy) { console.log(`[Failback] Error threshold exceeded at ${targetPercent}% - rolling back`); await this.rollback(currentPrimary); this.status.phase = 'rolled_back'; return this.status; } console.log(`[Failback] Traffic at ${targetPercent}% - healthy, continuing`); } // Phase 4: Complete - recovered primary is now current primary this.status.phase = 'complete'; console.log('[Failback] Complete - recovered primary is now serving 100% traffic'); // Update topology to reflect new primary await this.updateTopologyAfterFailback(recoveredPrimary); } catch (error) { this.status.errors.push(`Failback error: ${error.message}`); console.error('[Failback] Error during failback:', error); // Attempt rollback on any error try { await this.rollback(currentPrimary); this.status.phase = 'rolled_back'; } catch (rollbackError) { this.status.errors.push(`Rollback error: ${rollbackError.message}`); } } return this.status; } /** * Verify primary is consistently healthy */ private async verifyHealth(primary: string): Promise<void> { const startTime = Date.now(); while (Date.now() - startTime < this.config.verifyHealthDurationMs) { const isHealthy = await this.checkHealth(primary); if (!isHealthy) { throw new Error(`Recovered primary ${primary} failed health check`); } await this.sleep(5000); // Check every 5 seconds } console.log(`[Failback] Recovered primary healthy for ${this.config.verifyHealthDurationMs}ms`); } /** * Verify data sync is complete (no replication lag) */ private async verifySyncComplete( target: string, source: string ): Promise<void> { const maxAttempts = 60; let attempts = 0; while (attempts < maxAttempts) { const lag = await this.getReplicationLag(target, source); if (lag === 0) { console.log('[Failback] Data synchronization complete'); return; } console.log(`[Failback] Replication lag: ${lag}s - waiting`); attempts++; await this.sleep(5000); } throw new Error('Replication sync did not complete within timeout'); } /** * Monitor error rate after traffic shift */ private async monitorAfterShift(): Promise<boolean> { // Wait for traffic to stabilize await this.sleep(this.config.stepWaitMs); // Get current error rate const errorRate = await this.getCurrentErrorRate(); if (errorRate > this.config.rollbackThreshold) { this.status.errors.push(`Error rate ${errorRate} exceeded threshold ${this.config.rollbackThreshold}`); return false; } return true; } /** * Rollback all traffic to specified primary */ private async rollback(target: string): Promise<void> { console.log(`[Failback] Rolling back all traffic to ${target}`); await this.shiftTraffic(target, 100); this.status.currentTrafficPercent = 0; } // Implementation stubs private async checkHealth(primary: string): Promise<boolean> { throw new Error('Implementation required'); } private async getReplicationLag(target: string, source: string): Promise<number> { throw new Error('Implementation required'); } private async shiftTraffic(target: string, percent: number): Promise<void> { throw new Error('Implementation required'); } private async getCurrentErrorRate(): Promise<number> { throw new Error('Implementation required'); } private async updateTopologyAfterFailback(primary: string): Promise<void> { throw new Error('Implementation required'); } private sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms)); }}Most teams test failover but never test failback. When disaster strikes and you need to fail back, you discover the procedure is outdated or doesn't work. Include regular failback tests in your disaster recovery exercises—not just failover.
A failover mechanism that has never been tested is, at best, a hypothesis. Production failures have a way of exposing assumptions that seemed reasonable in design but fail catastrophically in reality.
Types of Failover Tests:
| Test Type | Description | Risk | Frequency |
|---|---|---|---|
| Tabletop Exercise | Walk through procedures without executing | Zero | Monthly |
| Staged Test | Execute in non-production environment | Low | Monthly |
| Partial Production Test | Fail over one component/shard in production | Medium | Quarterly |
| Full Production Test | Complete failover drill in production | High | Semi-annually |
| Chaos Engineering | Random failures injected continuously | Variable | Continuous |
| Game Day | Full scenario simulation with teams | Medium-High | Annually |
Netflix famously runs 'Chaos Monkey' which randomly terminates production instances. The philosophy: if failover only works when you test it, make testing continuous. The result is infrastructure that handles failures as routine events rather than crises.
Failover is the culmination of all health check and failure detection work. It's the moment when design meets reality—when theoretical redundancy becomes actual service continuity.
Module Complete:
You've now completed a comprehensive study of health checks and failover—from the mechanisms that detect failures, through the endpoints that report health, to the strategies that restore service. Together, these patterns form the foundation of resilient distributed systems.
The next time a server crashes, a network partitions, or a data center catches fire, your systems should handle it automatically while you sleep. That's the promise of well-designed health checking and failover—and now you have the knowledge to deliver it.
Congratulations! You've completed the Health Checks & Failover module. You now understand active and passive health checks, health endpoint design, failure detection algorithms, graceful degradation patterns, and comprehensive failover strategies. These skills are essential for building distributed systems that maintain availability through component failures.