System Design (HLD)Redundancy Patterns

Redundancy Patterns: Building Systems That Never Fail

LevelIntermediate

Duration90 mins

TopicRedundancy Patterns

1 / 5

Active-Passive Redundancy: The Foundation of High Availability

When Systems Must Never Stop

In November 2017, Amazon Web Services experienced a major S3 outage that cascaded across thousands of websites and applications. The four-hour disruption cost S&P 500 companies an estimated $150 million. In 2021, Facebook's BGP misconfiguration took down WhatsApp, Instagram, and Facebook simultaneously for six hours, affecting 3.5 billion users and costing the company approximately $100 million in lost revenue.

These incidents share a common lesson: single points of failure are catastrophic at scale. When a system serves millions of users or processes billions of dollars in transactions, even minutes of downtime translate to significant business impact. This reality drives one of the most fundamental principles in system design: redundancy.

Active-passive redundancy represents the foundational pattern for eliminating single points of failure. It's the architectural equivalent of having a spare tire in your car—a standby component ready to take over the moment the primary fails. While conceptually simple, implementing active-passive redundancy correctly requires deep understanding of failure detection, state synchronization, failover mechanics, and the subtle tradeoffs that determine whether your system gracefully handles failures or creates new problems.

What You Will Learn

By the end of this page, you will understand the complete architecture of active-passive redundancy systems, including primary-standby relationships, heartbeat mechanisms, failover triggers, state synchronization strategies, and the critical design decisions that separate robust implementations from fragile ones. You'll gain the knowledge to design, implement, and troubleshoot active-passive systems at any scale.

Understanding Active-Passive Architecture

Active-passive redundancy, also known as primary-standby or master-slave architecture, is a high availability pattern where one component (the active or primary) handles all requests while one or more identical components (the passive or standby) remain idle, ready to assume the primary role if failure occurs.

The Core Principle:

At its heart, active-passive redundancy embodies a simple proposition: maintain a fully functional copy of your critical system that can take over instantly when needed. The passive component isn't just a backup of data—it's a complete, operational replica capable of serving production traffic the moment it's activated.

This pattern appears throughout computing infrastructure:

Database systems: A primary database handles all writes while a standby replica maintains synchronized data, ready for promotion
Load balancers: A primary load balancer routes traffic while a standby monitors health and assumes control during failures
Application servers: Primary instances serve requests while standby instances wait in warm or hot standby states
Network equipment: Primary routers handle traffic while backup routers maintain routing tables for instant failover

Active-Passive Component States
State	Description	Resource Usage	Failover Time
Cold Standby	Standby is powered off or idle; requires startup on failure	Minimal (storage only)	Minutes to hours
Warm Standby	Standby is running but not processing; may need data sync	Moderate (compute + partial data)	Seconds to minutes
Hot Standby	Standby is fully synchronized and ready for instant takeover	High (full compute + full data)	Milliseconds to seconds

Why Choose Active-Passive?

Active-passive redundancy offers several compelling advantages that explain its widespread adoption:

Simplicity: Unlike active-active configurations, there's no need to handle concurrent writes, resolve conflicts, or manage distributed consensus. The primary handles all operations, eliminating coordination complexity.

Consistency: With a single writer, data consistency is inherently maintained. There's no risk of conflicting updates or split-brain scenarios during normal operation.

Cost Efficiency: The passive component can run on less powerful hardware or remain in a reduced capacity state, lowering operational costs compared to fully active replicas.

Predictable Behavior: Traffic always flows through the primary, making performance characteristics predictable and debugging straightforward.

The Fundamental Tradeoff

Active-passive systems trade resource utilization for simplicity. The standby component consumes resources (compute, memory, storage) while contributing nothing to throughput during normal operation. This is the cost of maintaining instant failover capability. For resource-constrained environments or cost-sensitive deployments, this overhead must be weighed against availability requirements.

Anatomy of Active-Passive Systems

A robust active-passive implementation consists of several interconnected components working in concert. Understanding each component's role is essential for designing systems that fail gracefully.

1. The Primary (Active) Component

The primary component handles all production workload:

Processes incoming requests from clients
Maintains authoritative state (database records, session data, etc.)
Generates replication streams for standby synchronization
Reports health status to monitoring systems
Participates in leader election protocols

2. The Standby (Passive) Component

The standby component maintains failover readiness:

Receives and applies replication streams from the primary
Maintains near-current copy of all critical state
Monitors primary health through heartbeat mechanisms
Stands ready to assume primary role on failure detection
May serve read-only queries in some configurations

3. Health Monitoring System

The monitoring layer detects failures and triggers failover:

Implements heartbeat/ping protocols
Distinguishes between component failure and network partitions
Makes promotion decisions based on configurable policies
Prevents split-brain scenarios through quorum or fencing
Logs all state transitions for debugging and auditing

┌─────────────────────────────────────────────────────────────────┐
│                        CLIENT REQUESTS                          │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                      VIRTUAL IP / DNS                           │
│              (Points to current primary)                        │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
              ┌───────────────────┴───────────────────┐
              │                                       │
              ▼                                       ▼
┌──────────────────────────┐          ┌──────────────────────────┐
│      PRIMARY NODE        │          │     STANDBY NODE         │
│   ┌──────────────────┐   │          │   ┌──────────────────┐   │
│   │   Application    │   │  ──────▶ │   │   Application    │   │
│   │   (Active)       │   │ Replication  │   (Passive)      │   │
│   └──────────────────┘   │  Stream  │   └──────────────────┘   │
│   ┌──────────────────┐   │          │   ┌──────────────────┐   │
│   │   Data Store     │   │          │   │   Data Store     │   │
│   │   (Read/Write)   │   │          │   │   (Read-only)    │   │
│   └──────────────────┘   │          │   └──────────────────┘   │
│   ┌──────────────────┐   │          │   ┌──────────────────┐   │
│   │  Health Agent    │◀──────────────▶│  Health Agent    │   │
│   └──────────────────┘   │ Heartbeat│   └──────────────────┘   │
└──────────────────────────┘          └──────────────────────────┘
              │                                       │
              └───────────────────┬───────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                   MONITORING / ARBITRATOR                       │
│            (Detects failures, triggers failover)                │
└─────────────────────────────────────────────────────────────────┘

4. Traffic Routing Mechanism

Routing ensures clients connect to the active component:

Virtual IP (VIP): A floating IP address that moves between primary and standby during failover
DNS-based routing: DNS records updated to point to the new primary
Load balancer: Health-checked backend pools that remove failed primaries
Service discovery: Dynamic registration systems like Consul or etcd

5. Fencing Mechanism

Fencing prevents split-brain scenarios:

STONITH (Shoot The Other Node In The Head): Physically powers off the failed node
Resource fencing: Revokes access to shared storage or network
Application-level fencing: Invalidates sessions, revokes tokens

Each component must be carefully designed and tested together. A monitoring system that detects failures but can't route traffic is useless. A replication system that falls behind makes failover lossy. These components form an integrated whole.

Failure Detection Mechanisms

Failure detection is the most critical and challenging aspect of active-passive systems. Detecting failures too slowly extends downtime; detecting them too aggressively causes unnecessary failovers that disrupt service. The art of failure detection lies in finding the optimal balance.

Heartbeat Protocols

Heartbeats are periodic signals that indicate a component is alive and functioning. They form the foundation of most failure detection systems:

Push-based heartbeats: The primary actively sends signals to the standby or monitor at regular intervals. Missing heartbeats trigger failure investigation.

Pull-based health checks: The monitor actively queries the primary's health endpoint. Failed queries or timeout responses indicate potential failure.

Bidirectional heartbeats: Both primary and standby exchange heartbeats, enabling either to detect the other's failure.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
interface HeartbeatConfig {
    intervalMs: number;        // Time between heartbeats
    timeoutMs: number;         // Time to wait for response
    failureThreshold: number;  // Consecutive failures before declaring dead
}
 
class HeartbeatMonitor {
    private consecutiveFailures = 0;
    private lastSuccessTime = Date.now();
    
    constructor(
        private config: HeartbeatConfig,
        private target: HealthCheckTarget,
        private onFailure: () => void
    ) {}
    
    async checkHealth(): Promise<boolean> {
        try {
            const response = await Promise.race([
                this.target.healthCheck(),
                this.timeout(this.config.timeoutMs)
            ]);
            
            if (response.healthy) {
                this.consecutiveFailures = 0;
                this.lastSuccessTime = Date.now();
                return true;
            }
            
            return this.handleFailure('Unhealthy response');
        } catch (error) {
            return this.handleFailure(error.message);
        }
    }
    
    private handleFailure(reason: string): boolean {
        this.consecutiveFailures++;
        console.warn(`Heartbeat failure #${this.consecutiveFailures}: ${reason}`);
        
        if (this.consecutiveFailures >= this.config.failureThreshold) {
            console.error('Failure threshold exceeded, triggering failover');
            this.onFailure();
        }
        
        return false;
    }
    
    private timeout(ms: number): Promise<never> {
        return new Promise((_, reject) => 
            setTimeout(() => reject(new Error('Timeout')), ms)
        );
    }
}

Configuring Detection Parameters

The three critical parameters for heartbeat configuration are:

Heartbeat Interval: How frequently health checks occur. Shorter intervals detect failures faster but increase network and CPU overhead. Typical values range from 100ms for latency-sensitive systems to 30 seconds for less critical components.

Timeout Period: How long to wait for a response before considering a check failed. Must account for network latency, garbage collection pauses, and temporary load spikes. Setting this too low causes false positives; too high delays failure detection.

Failure Threshold: How many consecutive failures trigger failover. A single missed heartbeat could be a network blip; three consecutive failures suggest genuine failure. Higher thresholds reduce false positives but extend detection time.

The Detection Time Formula:

Maximum Detection Time = (Failure Threshold × Heartbeat Interval) + Timeout

For example, with 5-second intervals, 3-second timeouts, and a threshold of 3:

Maximum detection time = (3 × 5) + 3 = 18 seconds

This 18-second window represents the worst-case time between actual failure and failover initiation.

Multi-Layer Health Checks

Production systems should implement health checks at multiple layers: network (can we reach the host?), process (is the application running?), application (is it responding correctly?), and business logic (is it processing requests successfully?). A database that responds to TCP connections but can't execute queries should not be considered healthy.

State Synchronization Strategies

For the standby to assume the primary role effectively, it must maintain sufficiently current state. The synchronization strategy determines how much data might be lost during failover and how quickly the standby can take over.

Synchronous Replication

In synchronous replication, the primary waits for the standby to acknowledge receipt of each change before confirming success to the client:

Client sends write request to primary
Primary applies change locally
Primary sends change to standby
Standby applies change and acknowledges
Primary confirms success to client

Advantages:

Zero data loss during failover (RPO = 0)
Standby always has complete, consistent state
No recovery replay needed after promotion

Disadvantages:

Higher latency for every write operation
Standby failure blocks primary operations
Network latency directly impacts performance

Asynchronous Replication

In asynchronous replication, the primary confirms success immediately and replicates changes independently:

Client sends write request to primary
Primary applies change and confirms success
Primary queues change for replication
Standby receives and applies change (eventually)

Advantages:

Lower write latency (no waiting for standby)
Standby failures don't affect primary performance
Works across high-latency network links

Disadvantages:

Potential data loss during failover (RPO > 0)
Replication lag creates inconsistency windows
May require recovery replay after promotion

Replication Strategy Comparison
Aspect	Synchronous	Asynchronous
Data Loss Risk	None (RPO = 0)	Possible (RPO > 0)
Write Latency	Higher (network round-trip)	Lower (local only)
Standby Impact	Failure blocks writes	No impact on writes
Geographic Distance	Limited by latency	Works globally
Consistency	Always consistent	Eventually consistent
Use Cases	Financial, critical data	Analytics, caching

Semi-Synchronous Replication

Semi-synchronous replication offers a middle ground:

Primary waits for acknowledgment from at least one standby
If no acknowledgment within timeout, falls back to asynchronous
Balances data safety with availability

MySQL, PostgreSQL, and many databases support this mode, allowing operators to tune the tradeoff based on requirements.

Replication Lag Monitoring

For asynchronous replication, monitoring replication lag is essential. Lag that grows unbounded indicates the standby can't keep up with write volume—a serious problem that could mean significant data loss during failover. Alert on lag thresholds and investigate immediately when triggered.

Failover Mechanics

When failure is detected, the system must execute a coordinated sequence of steps to promote the standby to primary status. This process must be reliable, fast, and safe from split-brain scenarios.

The Failover Sequence:

Step 1: Failure Confirmation

Multiple monitors confirm the primary is unreachable
Distinguish between primary failure and network partition
Quorum-based decision prevents false positives

Step 2: Fencing the Failed Primary

Ensure the old primary cannot continue serving requests
Revoke access to shared resources (storage, network)
STONITH (power off) for hardware-level certainty

Step 3: Standby Promotion

Stop replication from the failed primary
Apply any pending replication log entries
Transition from read-only to read-write mode
Update internal state to reflect primary role

Step 4: Traffic Redirection

Move Virtual IP to the new primary
Update DNS records (with TTL considerations)
Reconfigure load balancer backends
Update service discovery registrations

Step 5: Validation

Verify the new primary is accepting connections
Confirm data integrity and consistency
Run health checks against the promoted system
Alert operations team of completed failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class FailoverOrchestrator {
    async executeFailover(
        failedPrimary: Node,
        standby: Node,
        router: TrafficRouter
    ): Promise<FailoverResult> {
        const startTime = Date.now();
        
        try {
            // Step 1: Confirm failure with quorum
            console.log('Confirming primary failure...');
            const confirmed = await this.confirmFailure(failedPrimary);
            if (!confirmed) {
                return { success: false, reason: 'Failure not confirmed' };
            }
            
            // Step 2: Fence the failed primary
            console.log('Fencing failed primary...');
            await this.fencePrimary(failedPrimary);
            
            // Step 3: Promote standby
            console.log('Promoting standby to primary...');
            await standby.stopReplication();
            await standby.applyPendingLogs();
            await standby.enableWriteMode();
            
            // Step 4: Redirect traffic
            console.log('Redirecting traffic...');
            await router.updatePrimary(standby.address);
            
            // Step 5: Validate
            console.log('Validating new primary...');
            await this.validatePrimary(standby);
            
            const duration = Date.now() - startTime;
            console.log(`Failover completed in ${duration}ms`);
            
            return { 
                success: true, 
                newPrimary: standby.address,
                failoverDurationMs: duration 
            };
        } catch (error) {
            console.error('Failover failed:', error);
            await this.alertOperations(error);
            return { success: false, reason: error.message };
        }
    }
}

The Split-Brain Problem

Split-brain occurs when both primary and standby believe they are the active primary, typically due to network partitions. Both accept writes, creating divergent data that may be impossible to reconcile. Fencing mechanisms and quorum-based decisions are essential to prevent this catastrophic scenario. Never skip or shortcut fencing steps.

Implementation Best Practices

Implementing active-passive redundancy correctly requires attention to numerous details. These best practices, learned from production experience, help avoid common pitfalls.

1. Test Failover Regularly

Failover that isn't tested is failover that doesn't work. Regular testing reveals:

Configuration drift between primary and standby
Replication lag issues
Fencing mechanism failures
Traffic routing problems
Recovery time reality vs expectations

Schedule monthly or quarterly failover drills. Automate what you can, but ensure humans practice the manual steps too.

2. Maintain Standby at Full Capacity

A standby that can't handle production load is not a viable failover target. Ensure:

Standby hardware matches primary specifications
Standby is running same software versions
Standby has equivalent network capacity
Standby configuration mirrors production settings

3. Monitor Replication Continuously

Replication health determines failover quality:

Track replication lag in real-time
Alert on lag exceeding thresholds
Monitor replication throughput and queue depth
Verify data consistency periodically

4. Implement Proper Timeouts

Timeout configuration requires careful tuning:

Account for garbage collection pauses
Consider network latency variability
Allow for temporary load spikes
Balance detection speed vs false positive risk

Active-Passive Checklist

•Identical configurations — Primary and standby must be configured identically to ensure seamless takeover
•Automated failover — Manual intervention should be optional, not required for basic failures
•Fencing implementation — STONITH or equivalent must be in place to prevent split-brain
•Replication monitoring — Continuous visibility into lag, throughput, and consistency
•Regular testing — Scheduled failover drills with documented results
•Runbook documentation — Clear procedures for manual intervention when automation fails
•Alerting integration — Immediate notification of failures and failover events
•Capacity planning — Standby must handle full production load

Real-World Examples

PostgreSQL Streaming Replication

PostgreSQL implements active-passive through streaming replication:

Primary streams Write-Ahead Log (WAL) to standby
Standby replays WAL to maintain synchronized state
Supports both synchronous and asynchronous modes
Tools like Patroni automate failover and leader election

MySQL Group Replication (Single-Primary Mode)

MySQL's Group Replication can operate in single-primary mode:

One member is elected primary and accepts writes
Other members are secondaries that replicate synchronously
Automatic primary election on failure
Built-in conflict detection (though conflicts shouldn't occur)

AWS RDS Multi-AZ

Amazon RDS implements active-passive at the managed service level:

Primary instance in one Availability Zone
Synchronous standby in different AZ
Automatic failover with DNS update
Transparent to applications using the endpoint DNS

Redis Sentinel

Redis uses Sentinel for active-passive high availability:

Sentinel processes monitor Redis master and replicas
Automatic failover when master unreachable
Client libraries query Sentinel for current master
Supports notification and scripting on state changes

Leverage Managed Services

When possible, use managed services that handle active-passive redundancy automatically. AWS RDS Multi-AZ, Google Cloud SQL HA, and Azure SQL Database all implement battle-tested failover mechanisms. Building your own is educational but operationally expensive.

Summary: Active-Passive Redundancy

Active-passive redundancy is the foundational pattern for high availability. While conceptually simple—keep a spare ready—correct implementation requires careful attention to failure detection, state synchronization, failover mechanics, and split-brain prevention.

Key Takeaways

•Active-passive trades utilization for simplicity — Standby resources are idle but eliminate single points of failure
•Failure detection is the critical challenge — Balance speed against false positives through careful timeout configuration
•Synchronization strategy determines data loss risk — Synchronous replication = zero loss; asynchronous = potential loss
•Fencing prevents split-brain catastrophes — Never skip or shortcut fencing mechanisms
•Regular testing is non-negotiable — Untested failover is unreliable failover
•Monitoring enables proactive response — Track replication lag, health status, and failover readiness continuously

Next Steps:

Active-passive redundancy excels at simplicity and consistency but leaves standby resources idle. In the next page, we'll explore active-active redundancy, where all nodes handle production traffic simultaneously—eliminating wasted capacity but introducing new challenges around coordination and conflict resolution.

Page Complete

You now understand active-passive redundancy comprehensively—from architecture and failure detection through state synchronization and failover mechanics. This pattern forms the foundation for most high-availability database deployments, load balancer configurations, and application server setups across the industry.

1 / 5

Loading learning content...

System Design (HLD)Redundancy Patterns

Redundancy Patterns: Building Systems That Never Fail

LevelIntermediate

Duration90 mins

TopicRedundancy Patterns

1 / 5

Active-Passive Redundancy: The Foundation of High Availability

When Systems Must Never Stop

What You Will Learn

Understanding Active-Passive Architecture

The Core Principle:

This pattern appears throughout computing infrastructure:

Database systems: A primary database handles all writes while a standby replica maintains synchronized data, ready for promotion
Load balancers: A primary load balancer routes traffic while a standby monitors health and assumes control during failures
Application servers: Primary instances serve requests while standby instances wait in warm or hot standby states
Network equipment: Primary routers handle traffic while backup routers maintain routing tables for instant failover

Active-Passive Component States
State	Description	Resource Usage	Failover Time
Cold Standby	Standby is powered off or idle; requires startup on failure	Minimal (storage only)	Minutes to hours
Warm Standby	Standby is running but not processing; may need data sync	Moderate (compute + partial data)	Seconds to minutes
Hot Standby	Standby is fully synchronized and ready for instant takeover	High (full compute + full data)	Milliseconds to seconds

Why Choose Active-Passive?

Active-passive redundancy offers several compelling advantages that explain its widespread adoption:

Consistency: With a single writer, data consistency is inherently maintained. There's no risk of conflicting updates or split-brain scenarios during normal operation.

Cost Efficiency: The passive component can run on less powerful hardware or remain in a reduced capacity state, lowering operational costs compared to fully active replicas.

Predictable Behavior: Traffic always flows through the primary, making performance characteristics predictable and debugging straightforward.

The Fundamental Tradeoff

Anatomy of Active-Passive Systems

A robust active-passive implementation consists of several interconnected components working in concert. Understanding each component's role is essential for designing systems that fail gracefully.

1. The Primary (Active) Component

The primary component handles all production workload:

Processes incoming requests from clients
Maintains authoritative state (database records, session data, etc.)
Generates replication streams for standby synchronization
Reports health status to monitoring systems
Participates in leader election protocols

2. The Standby (Passive) Component

The standby component maintains failover readiness:

Receives and applies replication streams from the primary
Maintains near-current copy of all critical state
Monitors primary health through heartbeat mechanisms
Stands ready to assume primary role on failure detection
May serve read-only queries in some configurations

3. Health Monitoring System

The monitoring layer detects failures and triggers failover:

Implements heartbeat/ping protocols
Distinguishes between component failure and network partitions
Makes promotion decisions based on configurable policies
Prevents split-brain scenarios through quorum or fencing
Logs all state transitions for debugging and auditing

┌─────────────────────────────────────────────────────────────────┐
│                        CLIENT REQUESTS                          │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                      VIRTUAL IP / DNS                           │
│              (Points to current primary)                        │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
              ┌───────────────────┴───────────────────┐
              │                                       │
              ▼                                       ▼
┌──────────────────────────┐          ┌──────────────────────────┐
│      PRIMARY NODE        │          │     STANDBY NODE         │
│   ┌──────────────────┐   │          │   ┌──────────────────┐   │
│   │   Application    │   │  ──────▶ │   │   Application    │   │
│   │   (Active)       │   │ Replication  │   (Passive)      │   │
│   └──────────────────┘   │  Stream  │   └──────────────────┘   │
│   ┌──────────────────┐   │          │   ┌──────────────────┐   │
│   │   Data Store     │   │          │   │   Data Store     │   │
│   │   (Read/Write)   │   │          │   │   (Read-only)    │   │
│   └──────────────────┘   │          │   └──────────────────┘   │
│   ┌──────────────────┐   │          │   ┌──────────────────┐   │
│   │  Health Agent    │◀──────────────▶│  Health Agent    │   │
│   └──────────────────┘   │ Heartbeat│   └──────────────────┘   │
└──────────────────────────┘          └──────────────────────────┘
              │                                       │
              └───────────────────┬───────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                   MONITORING / ARBITRATOR                       │
│            (Detects failures, triggers failover)                │
└─────────────────────────────────────────────────────────────────┘

4. Traffic Routing Mechanism

Routing ensures clients connect to the active component:

Virtual IP (VIP): A floating IP address that moves between primary and standby during failover
DNS-based routing: DNS records updated to point to the new primary
Load balancer: Health-checked backend pools that remove failed primaries
Service discovery: Dynamic registration systems like Consul or etcd

5. Fencing Mechanism

Fencing prevents split-brain scenarios:

STONITH (Shoot The Other Node In The Head): Physically powers off the failed node
Resource fencing: Revokes access to shared storage or network
Application-level fencing: Invalidates sessions, revokes tokens

Failure Detection Mechanisms

Heartbeat Protocols

Heartbeats are periodic signals that indicate a component is alive and functioning. They form the foundation of most failure detection systems:

Push-based heartbeats: The primary actively sends signals to the standby or monitor at regular intervals. Missing heartbeats trigger failure investigation.

Pull-based health checks: The monitor actively queries the primary's health endpoint. Failed queries or timeout responses indicate potential failure.

Bidirectional heartbeats: Both primary and standby exchange heartbeats, enabling either to detect the other's failure.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
interface HeartbeatConfig {
    intervalMs: number;        // Time between heartbeats
    timeoutMs: number;         // Time to wait for response
    failureThreshold: number;  // Consecutive failures before declaring dead
}
 
class HeartbeatMonitor {
    private consecutiveFailures = 0;
    private lastSuccessTime = Date.now();
    
    constructor(
        private config: HeartbeatConfig,
        private target: HealthCheckTarget,
        private onFailure: () => void
    ) {}
    
    async checkHealth(): Promise<boolean> {
        try {
            const response = await Promise.race([
                this.target.healthCheck(),
                this.timeout(this.config.timeoutMs)
            ]);
            
            if (response.healthy) {
                this.consecutiveFailures = 0;
                this.lastSuccessTime = Date.now();
                return true;
            }
            
            return this.handleFailure('Unhealthy response');
        } catch (error) {
            return this.handleFailure(error.message);
        }
    }
    
    private handleFailure(reason: string): boolean {
        this.consecutiveFailures++;
        console.warn(`Heartbeat failure #${this.consecutiveFailures}: ${reason}`);
        
        if (this.consecutiveFailures >= this.config.failureThreshold) {
            console.error('Failure threshold exceeded, triggering failover');
            this.onFailure();
        }
        
        return false;
    }
    
    private timeout(ms: number): Promise<never> {
        return new Promise((_, reject) => 
            setTimeout(() => reject(new Error('Timeout')), ms)
        );
    }
}

Configuring Detection Parameters

The three critical parameters for heartbeat configuration are:

The Detection Time Formula:

Maximum Detection Time = (Failure Threshold × Heartbeat Interval) + Timeout

For example, with 5-second intervals, 3-second timeouts, and a threshold of 3:

Maximum detection time = (3 × 5) + 3 = 18 seconds

This 18-second window represents the worst-case time between actual failure and failover initiation.

Multi-Layer Health Checks

State Synchronization Strategies

Synchronous Replication

In synchronous replication, the primary waits for the standby to acknowledge receipt of each change before confirming success to the client:

Client sends write request to primary
Primary applies change locally
Primary sends change to standby
Standby applies change and acknowledges
Primary confirms success to client

Advantages:

Zero data loss during failover (RPO = 0)
Standby always has complete, consistent state
No recovery replay needed after promotion

Disadvantages:

Higher latency for every write operation
Standby failure blocks primary operations
Network latency directly impacts performance

Asynchronous Replication

In asynchronous replication, the primary confirms success immediately and replicates changes independently:

Client sends write request to primary
Primary applies change and confirms success
Primary queues change for replication
Standby receives and applies change (eventually)

Advantages:

Lower write latency (no waiting for standby)
Standby failures don't affect primary performance
Works across high-latency network links

Disadvantages:

Potential data loss during failover (RPO > 0)
Replication lag creates inconsistency windows
May require recovery replay after promotion

Replication Strategy Comparison
Aspect	Synchronous	Asynchronous
Data Loss Risk	None (RPO = 0)	Possible (RPO > 0)
Write Latency	Higher (network round-trip)	Lower (local only)
Standby Impact	Failure blocks writes	No impact on writes
Geographic Distance	Limited by latency	Works globally
Consistency	Always consistent	Eventually consistent
Use Cases	Financial, critical data	Analytics, caching

Semi-Synchronous Replication

Semi-synchronous replication offers a middle ground:

Primary waits for acknowledgment from at least one standby
If no acknowledgment within timeout, falls back to asynchronous
Balances data safety with availability

MySQL, PostgreSQL, and many databases support this mode, allowing operators to tune the tradeoff based on requirements.

Replication Lag Monitoring

Failover Mechanics

When failure is detected, the system must execute a coordinated sequence of steps to promote the standby to primary status. This process must be reliable, fast, and safe from split-brain scenarios.

The Failover Sequence:

Step 1: Failure Confirmation

Multiple monitors confirm the primary is unreachable
Distinguish between primary failure and network partition
Quorum-based decision prevents false positives

Step 2: Fencing the Failed Primary

Ensure the old primary cannot continue serving requests
Revoke access to shared resources (storage, network)
STONITH (power off) for hardware-level certainty

Step 3: Standby Promotion

Stop replication from the failed primary
Apply any pending replication log entries
Transition from read-only to read-write mode
Update internal state to reflect primary role

Step 4: Traffic Redirection

Move Virtual IP to the new primary
Update DNS records (with TTL considerations)
Reconfigure load balancer backends
Update service discovery registrations

Step 5: Validation

Verify the new primary is accepting connections
Confirm data integrity and consistency
Run health checks against the promoted system
Alert operations team of completed failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class FailoverOrchestrator {
    async executeFailover(
        failedPrimary: Node,
        standby: Node,
        router: TrafficRouter
    ): Promise<FailoverResult> {
        const startTime = Date.now();
        
        try {
            // Step 1: Confirm failure with quorum
            console.log('Confirming primary failure...');
            const confirmed = await this.confirmFailure(failedPrimary);
            if (!confirmed) {
                return { success: false, reason: 'Failure not confirmed' };
            }
            
            // Step 2: Fence the failed primary
            console.log('Fencing failed primary...');
            await this.fencePrimary(failedPrimary);
            
            // Step 3: Promote standby
            console.log('Promoting standby to primary...');
            await standby.stopReplication();
            await standby.applyPendingLogs();
            await standby.enableWriteMode();
            
            // Step 4: Redirect traffic
            console.log('Redirecting traffic...');
            await router.updatePrimary(standby.address);
            
            // Step 5: Validate
            console.log('Validating new primary...');
            await this.validatePrimary(standby);
            
            const duration = Date.now() - startTime;
            console.log(`Failover completed in ${duration}ms`);
            
            return { 
                success: true, 
                newPrimary: standby.address,
                failoverDurationMs: duration 
            };
        } catch (error) {
            console.error('Failover failed:', error);
            await this.alertOperations(error);
            return { success: false, reason: error.message };
        }
    }
}

The Split-Brain Problem

Implementation Best Practices

Implementing active-passive redundancy correctly requires attention to numerous details. These best practices, learned from production experience, help avoid common pitfalls.

1. Test Failover Regularly

Failover that isn't tested is failover that doesn't work. Regular testing reveals:

Configuration drift between primary and standby
Replication lag issues
Fencing mechanism failures
Traffic routing problems
Recovery time reality vs expectations

Schedule monthly or quarterly failover drills. Automate what you can, but ensure humans practice the manual steps too.

2. Maintain Standby at Full Capacity

A standby that can't handle production load is not a viable failover target. Ensure:

Standby hardware matches primary specifications
Standby is running same software versions
Standby has equivalent network capacity
Standby configuration mirrors production settings

3. Monitor Replication Continuously

Replication health determines failover quality:

Track replication lag in real-time
Alert on lag exceeding thresholds
Monitor replication throughput and queue depth
Verify data consistency periodically

4. Implement Proper Timeouts

Timeout configuration requires careful tuning:

Account for garbage collection pauses
Consider network latency variability
Allow for temporary load spikes
Balance detection speed vs false positive risk

Active-Passive Checklist

•Identical configurations — Primary and standby must be configured identically to ensure seamless takeover
•Automated failover — Manual intervention should be optional, not required for basic failures
•Fencing implementation — STONITH or equivalent must be in place to prevent split-brain
•Replication monitoring — Continuous visibility into lag, throughput, and consistency
•Regular testing — Scheduled failover drills with documented results
•Runbook documentation — Clear procedures for manual intervention when automation fails
•Alerting integration — Immediate notification of failures and failover events
•Capacity planning — Standby must handle full production load

Real-World Examples

PostgreSQL Streaming Replication

PostgreSQL implements active-passive through streaming replication:

Primary streams Write-Ahead Log (WAL) to standby
Standby replays WAL to maintain synchronized state
Supports both synchronous and asynchronous modes
Tools like Patroni automate failover and leader election

MySQL Group Replication (Single-Primary Mode)

MySQL's Group Replication can operate in single-primary mode:

One member is elected primary and accepts writes
Other members are secondaries that replicate synchronously
Automatic primary election on failure
Built-in conflict detection (though conflicts shouldn't occur)

AWS RDS Multi-AZ

Amazon RDS implements active-passive at the managed service level:

Primary instance in one Availability Zone
Synchronous standby in different AZ
Automatic failover with DNS update
Transparent to applications using the endpoint DNS

Redis Sentinel

Redis uses Sentinel for active-passive high availability:

Sentinel processes monitor Redis master and replicas
Automatic failover when master unreachable
Client libraries query Sentinel for current master
Supports notification and scripting on state changes

Leverage Managed Services

Summary: Active-Passive Redundancy

Key Takeaways

•Active-passive trades utilization for simplicity — Standby resources are idle but eliminate single points of failure
•Failure detection is the critical challenge — Balance speed against false positives through careful timeout configuration
•Synchronization strategy determines data loss risk — Synchronous replication = zero loss; asynchronous = potential loss
•Fencing prevents split-brain catastrophes — Never skip or shortcut fencing mechanisms
•Regular testing is non-negotiable — Untested failover is unreliable failover
•Monitoring enables proactive response — Track replication lag, health status, and failover readiness continuously

Next Steps:

Page Complete

1 / 5