System Design (HLD)Leader-Follower Replication

Leader-Follower Replication

LevelIntermediate

Duration75 mins

TopicLeader-Follower Replication

4 / 5

Failover Handling

When the Leader Falls: The Critical Transition

It's 3 AM. Your on-call pager fires. The primary database server—handling all writes for your e-commerce platform during a holiday sale—has just crashed. Orders are piling up. Every second of downtime costs thousands of dollars.

The cluster has two healthy replicas. But how does the system know the primary is really down? Which replica becomes the new primary? What happens to writes that the old primary accepted but hadn't replicated yet? And most critically: how do we ensure we don't end up with two primaries accepting conflicting writes?

This is failover—the process of transitioning from a failed leader to a new one. Done well, it's seamless and automatic. Done poorly, it causes data loss, extended outages, or the dreaded split-brain scenario where data integrity is compromised.

This page explores every aspect of failover handling in leader-follower replication.

What You Will Learn

By the end of this page, you will understand failure detection mechanisms, the failover decision-making process, leader election algorithms, data consistency guarantees during failover, split-brain prevention, manual vs. automatic failover trade-offs, and operational best practices for reliable failover.

Failure Detection: Is the Leader Really Down?

Before initiating failover, the system must determine that the leader has actually failed. This is surprisingly difficult in distributed systems, where networks are unreliable and the difference between 'failed' and 'temporarily slow' is often indistinguishable.

The Fundamental Challenge:

If a follower can't reach the leader, it could mean:

The leader has crashed
The network between them has failed
The leader is overloaded and slow to respond
The follower itself has a network problem

Misjudging leads to two failure modes:

False positive (thinking leader is down when it isn't) — Causes unnecessary failover, potential split-brain
False negative (thinking leader is up when it isn't) — Prolongs outage, keeps system unavailable

Failure Detection Mechanisms

•Heartbeat/Ping — The leader sends periodic heartbeats to followers (or vice versa). Missing N consecutive heartbeats triggers failure detection. Simple but susceptible to network glitches.
•TCP Connection Monitoring — Monitor the replication connection. Disconnection indicates potential failure, but doesn't distinguish leader crash from network partition.
•Application-Level Health Checks — Actually execute queries against the leader (e.g., SELECT 1). Detects database hangs that heartbeats might miss.
•Out-of-Band Monitoring — External systems (monitoring tools, cloud provider) observe the leader independently. Multiple perspectives reduce false positives.
•Lease/Heartbeat Registration — Leader periodically renews a lease in a consensus system (ZooKeeper, etcd). Expired lease means leader hasn't renewed, triggering failover.

Failure Detection Timeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
TIME ─────────────────────────────────────────────────────────────────────────▶
 
NORMAL OPERATION:
  Leader    ──♥──♥──♥──♥──♥──♥──♥──  (heartbeats every 1 second)
  Follower     ✓   ✓   ✓   ✓   ✓     (receives heartbeats)
 
LEADER FAILURE:
  t=0s      t=1s      t=2s      t=3s      t=4s      t=5s
    │         │         │         │         │         │
  CRASH     ─────────────────────────────────────────────▶
    ✗                                                   
              │         │         │         │         │
  Follower    ?─missed──?─missed──?─missed──!TIMEOUT!──
                                                        │
                                                        ▼
                                              INITIATE FAILOVER
 
  Detection Time = Heartbeat Interval × Missed Count + Processing Time
                 = 1s × 3 misses + ~1s processing = ~4 seconds
 
  This 4-second window is unavailability for writes

Failure Detection Parameter Trade-offs
Parameter	Lower Value	Higher Value
Heartbeat Interval	Faster detection, more network load, more false positives	Slower detection, less load, fewer false positives
Missed Count Threshold	Faster detection, more false positives (transient network)	Slower detection, fewer false positives
Query Timeout	Faster detection, may timeout healthy but slow leader	Slower detection, tolerates slow responses

The Network Partition Dilemma

If the network partitions (followers can't reach leader, but leader is still running and accepting writes from its network partition), followers may initiate failover while the old leader continues operating. This is the setup for split-brain. Proper failover requires not just detecting failure, but also preventing the old leader from continuing to operate.

Leader Election: Choosing the New Primary

Once a failure is detected, the cluster must elect a new leader from available followers. The election must satisfy two properties:

Safety — At most one leader is elected (no split-brain)
Liveness — Eventually, a leader is elected (system makes progress)

These properties are in tension during network partitions, as formalized by the FLP impossibility result. Practical systems use timeouts and probability to achieve 'mostly safe, mostly live.'

Leader Election Approaches

•Static Priority — Each follower has a configured priority. Highest-priority available follower becomes leader. Simple but requires manual configuration.
•Most Up-to-Date Wins — The follower with the least replication lag (highest LSN) becomes leader. Minimizes data loss but may elect a follower in an isolated partition.
•Consensus-Based (Raft/Paxos) — Followers vote; a candidate needs a majority to become leader. Guarantees single leader but requires majority availability.
•External Arbiter — An external consensus system (ZooKeeper, etcd, Consul) manages leadership. Followers compete for a distributed lock; winner is leader.
•Human Decision — No automatic election; operators manually promote a follower. Slowest but safest for critical systems where automated decisions are risky.

Consensus-Based Election (Raft)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
ELECTION PROCESS (Simplified Raft):
 
  STEP 1: Follower becomes Candidate
  ─────────────────────────────────────────────────────────────────────
  Node B (follower) hasn't heard from leader for election_timeout
  Node B increments its term and becomes a candidate
  Node B votes for itself
 
  STEP 2: Request Votes
  ─────────────────────────────────────────────────────────────────────
  Node B sends RequestVote RPCs to all other nodes
  Request includes: candidate's term, candidate's log position (LSN)
 
            ┌─────┐          ┌─────┐          ┌─────┐
            │  A  │◀─vote?───│  B  │──vote?──▶│  C  │
            │     │          │CAND │          │     │
            └─────┘          └─────┘          └─────┘
                               ▲
                               │
                      (B votes for itself)
 
  STEP 3: Collect Votes
  ─────────────────────────────────────────────────────────────────────
  Each node votes for candidate if:
    - Candidate's term ≥ node's term
    - Node hasn't voted for someone else this term
    - Candidate's log is at least as up-to-date
  
  STEP 4: Become Leader (or retry)
  ─────────────────────────────────────────────────────────────────────
  If B receives majority of votes: B becomes leader
  If B doesn't receive majority: wait random time, restart election
  If B discovers higher term: step down to follower

Choosing the Best Candidate:

Not all followers are equal candidates. Election criteria should consider:

Replication Position — Followers with higher LSN have more data; promoting them loses less
Health Status — Is the follower keeping up, or is it itself struggling?
Geographic Location — Prefer followers in the same datacenter as failed leader (lower latency)
Hardware Tier — Some followers may be less powerful (analytics replicas); avoid promoting them
Explicit Priority — Operators may have designated preferred promotion targets

Majority is Key

Consensus-based election requires a majority of nodes to agree. With 3 nodes, you need 2; with 5 nodes, you need 3. This ensures that any two majorities overlap by at least one node, preventing two simultaneous elections from both succeeding.

The Failover Sequence

Failover is a multi-step process that must execute carefully to avoid data loss or corruption. Let's trace through a complete failover sequence.

Complete Failover Sequence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
PHASE 1: FAILURE DETECTION
═══════════════════════════════════════════════════════════════════════════
  t=0s    Leader crashes (unknown to followers)
  t=1s    First missed heartbeat
  t=2s    Second missed heartbeat  
  t=3s    Third missed heartbeat → FAILURE CONFIRMED
 
PHASE 2: OLD LEADER ISOLATION (FENCING)
═══════════════════════════════════════════════════════════════════════════
  t=3.1s  Revoke old leader's access:
          - Cloud: terminate instance or detach storage
          - On-prem: STONITH (Shoot The Other Node In The Head)
          - Network: update firewall to block old leader
          - Storage: revoke write access at SAN level
          
  WHY: Prevent old leader from accepting writes if it recovers
 
PHASE 3: LEADER ELECTION
═══════════════════════════════════════════════════════════════════════════
  t=3.2s  Election begins
  t=3.5s  Votes collected, new leader elected (Node B)
  t=3.6s  New leader confirms its role
 
PHASE 4: NEW LEADER PROMOTION
═══════════════════════════════════════════════════════════════════════════
  t=3.7s  New leader:
          - Stops accepting replication from old leader
          - Begins accepting write queries
          - Starts streaming to remaining followers
          - Announces its leadership (to load balancers, apps)
 
PHASE 5: TRAFFIC ROUTING UPDATE  
═══════════════════════════════════════════════════════════════════════════
  t=3.8s  Update routing:
          - DNS record updated (if DNS-based routing)
          - Load balancer reconfigured
          - Proxy nodes updated
          - Application connection pools refreshed
 
PHASE 6: SYSTEM STABILIZATION
═══════════════════════════════════════════════════════════════════════════
  t=4.0s  New leader accepting writes
  t=4.5s  Followers reconnected to new leader
  t=5.0s  Monitoring confirms healthy cluster
 
 
TOTAL FAILOVER TIME: ~4-5 seconds (well-tuned automatic failover)
                     10-60 seconds (typical production systems)
                     MINUTES (manual failover requiring human approval)

Critical Failover Considerations

•Fencing the Old Leader — Simply ignoring the old leader isn't enough. It might recover and resume accepting writes. Physical or logical isolation is required.
•Uncommitted Transactions — Any transactions the client requested but weren't acknowledged before the crash will need to be retried. Applications must handle this.
•In-Flight Connections — Client connections to the old leader will error. Applications need reconnection logic pointing to the new leader.
•Replication to New Leader — Remaining followers must switch their replication source from old leader to new leader.
•Monitoring Updates — Dashboards, alerts, and operational tools must recognize the new topology.

The Recovery Race

If the old leader recovers while the new leader is being promoted, you have two nodes that think they're the leader. This is why fencing (step 2) must happen BEFORE election completes. The old leader must be definitively isolated before a new leader can safely assume the role.

Data Consistency During Failover

Failover creates moments of uncertainty about data state. Transactions may be in various stages of completion when the leader fails. Understanding and handling these edge cases is critical for data integrity.

Transaction States at Failure Time
Transaction State	Client Experience	New Leader Behavior	Data Outcome
Not started on leader	Request never received	N/A	Client retries, succeeds on new leader
In progress, not committed	Connection error	Transaction doesn't exist	Client retries or handles error
Committed on leader WAL	May or may not see success	May or may not have replicated	Depends on sync vs. async
Replicated to new leader	Success already returned	Transaction exists	Data is safe

The Critical Window: Committed but Not Replicated

The most dangerous state is when the old leader committed a transaction (wrote to WAL) but hadn't replicated to any follower. With synchronous replication, this window is zero—transactions only commit after replication. With asynchronous replication, this window equals the replication lag.

What happens to uncommitted transactions:

With Synchronous Replication — By definition, any transaction the old leader acknowledged is on at least one follower. The new leader (which must be one of those followers) has all acknowledged transactions.
With Asynchronous Replication — Transactions in the replication window are lost. The new leader's WAL is behind the old leader's. Any transactions the old leader acknowledged but hadn't replicated are gone.
If Old Leader Recovers — It may have "orphan" transactions not present on the new leader. These must be rolled back or conflict-resolved—they cannot simply be applied.

Data Divergence After Failover
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
SCENARIO: Async Replication, Leader Fails, Then Recovers
 
BEFORE FAILURE:
  Old Leader WAL:   [tx1] [tx2] [tx3] [tx4] [tx5] [tx6]
                                       ▲           ▲
                                       │           └── Last committed (client saw success)
                                       └── Last replicated to follower
 
  Follower WAL:     [tx1] [tx2] [tx3] [tx4]
                                       ▲
                                       └── Follower's latest
 
AFTER FAILOVER:
  New Leader WAL:   [tx1] [tx2] [tx3] [tx4] [tx7] [tx8] ← New writes
                                             ▲
                                             └── New transactions after promotion
 
  Old Leader WAL:   [tx1] [tx2] [tx3] [tx4] [tx5] [tx6]
                                       ▲
                                       └── tx5 and tx6 are "orphan" transactions
 
PROBLEM: tx5 and tx6 exist on old leader but not new leader
         Client was told tx5 and tx6 succeeded, but they're lost
         If old leader rejoins, it has conflicting history
 
SOLUTIONS:
  - Discard orphan transactions (acknowledge data loss)
  - Attempt to replay orphan transactions on new leader (may conflict)
  - Manual review and reconciliation
  - Prevent this with synchronous replication

RPO and RTO

RPO (Recovery Point Objective) is how much data loss is acceptable. RTO (Recovery Time Objective) is how long downtime is acceptable. Synchronous replication minimizes RPO to zero but increases RTO (writes are slower). Asynchronous replication minimizes latency impact but has non-zero RPO.

Split-Brain Prevention

Split-brain is the nightmare scenario: two nodes both believe they're the leader and accept conflicting writes. This corrupts data and may be unrecoverable. Preventing split-brain is the primary goal of failover design.

Split-Brain Scenario
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
NETWORK PARTITION SPLITS CLUSTER:
 
  ┌─────────────────────┐          ║          ┌─────────────────────┐
  │    PARTITION A      │          ║          │    PARTITION B      │
  │                     │          ║          │                     │
  │  ┌───────────────┐  │          ║          │  ┌───────────────┐  │
  │  │   Old Leader  │  │    NETWORK PARTITION │  │   Follower    │  │
  │  │    (still     │  │◀────────X───────────▶│  │  (promoted!)  │  │
  │  │   running)    │  │          ║          │  │               │  │
  │  └───────┬───────┘  │          ║          │  └───────┬───────┘  │
  │          │          │          ║          │          │          │
  │   Client A writes   │          ║          │   Client B writes   │
  │   UPDATE x = 100    │          ║          │   UPDATE x = 200    │
  │                     │          ║          │                     │
  └─────────────────────┘          ║          └─────────────────────┘
 
  RESULT: x = 100 on one node, x = 200 on the other
          Both clients told their write succeeded
          Data is permanently inconsistent

Split-Brain Prevention Techniques

•STONITH (Shoot The Other Node In The Head) — Before promoting a new leader, physically power off or restart the old leader. If it's actually running, this kills it. Hardware-level guarantee.
•Fencing Tokens — Each leader gets a monotonically increasing token. Storage systems reject writes from lower tokens. Even if old leader runs, its writes are rejected.
•Quorum Requirement — Leader must maintain contact with a majority of nodes to operate. Minority partition cannot elect a leader because it can't achieve quorum.
•Lease-Based Leadership — Leader must periodically renew a lease in consensus storage. If it can't renew (isolated), it must stop accepting writes before lease expires.
•Watchdog Timers — Old leader has a hardware watchdog. If it can't refresh the watchdog (due to partition), the watchdog reboots the machine.

The Quorum Approach in Detail:

With quorum-based systems (Raft, Paxos, ZooKeeper), split-brain is mathematically prevented:

A leader must have acknowledgment from a majority of nodes for any operation
Any two majorities always overlap by at least one node
Therefore, two simultaneous leaders would both need the same overlapping node
That node would reject one of them (due to term numbers or other conflict detection)

For example, with 5 nodes:

Majority = 3 nodes
If network splits 2-3, only the 3-node partition can elect a leader
The 2-node partition cannot achieve quorum, preventing split-brain

Quorum Has Availability Cost

Quorum prevents split-brain but also means the minority partition becomes unavailable for writes. With 5 nodes, losing 3 means no writes. This is the fundamental trade-off: you can have consistency or availability during partitions, but not both (CAP theorem).

Manual vs. Automatic Failover

A critical design decision is whether failover should happen automatically or require human approval. Each approach has strong proponents.

Automatic Failover

•Fastest Recovery — No human in the loop; failover in seconds
•24/7 Protection — Works at 3 AM without waiting for on-call
•Consistent Process — Same steps every time; no human error
•Risk: False Positives — May failover for transient issues
•Risk: Cascading Failures — May make bad situations worse
•Best For: High-availability systems with well-tested failover

Manual Failover

•Human Judgment — Operator assesses situation before acting
•Avoids False Positives — Won't failover for transient network blips
•Can Coordinate — Operator can notify stakeholders, schedule maintenance
•Risk: Delay — Minutes or hours of downtime waiting for human
•Risk: Error — Human may make wrong decision under pressure
•Best For: Critical systems where wrong failover is worse than delayed failover

Failover Strategy by System Type
System Type	Recommended Approach	Rationale
Web application DB	Automatic	Downtime visibility is high; fast recovery more important
Financial transactions	Semi-auto (auto-detect, human-approve)	Data integrity critical; human validation worth delay
Multi-region DR	Manual	Failover affects many systems; requires coordination
Development/Staging	Automatic (or none)	Low impact; don't waste human time
Regulatory-controlled	Manual with audit trail	May require approval documentation

Hybrid Approach: 'Semi-Automatic'

Many production systems use a hybrid:

Automatic Detection — System detects failure and identifies promotion candidate
Human Approval — Alert sent to on-call; they have N minutes to approve or override
Automatic Execution — Human approves (or timeout triggers auto-approval) → system executes failover
Automatic Verification — System confirms healthy operation and alerts if issues

This combines fast automated execution with human oversight, catching false positives while minimizing delay.

Practice Makes Perfect

Whichever approach you choose, practice failover regularly. Run drills monthly. Test that automatic failover works. Ensure operators know the manual process by heart. A failover procedure you've never tested is a procedure that might not work when you need it.

Real-World Failover Implementations

Let's examine how major database systems and HA solutions implement failover in practice.

PostgreSQL (with Patroni)

•Coordination: Uses etcd, Consul, or ZooKeeper for distributed consensus
•Detection: Leader must maintain lock in DCS (Distributed Coordination Service); lock expires on failure
•Election: Followers compete to acquire the lock; winner becomes leader
•Promotion: Winner executes pg_promote() to become primary; updates DCS with new leader info
•Routing: Patroni provides REST API for load balancers to query current leader

MySQL (InnoDB Cluster/Group Replication)

•Coordination: Built-in group communication using Paxos-based consensus
•Detection: Group members detect failure through membership protocol
•Election: Automatic; group elects new primary based on configured weights
•Promotion: New primary automatically starts accepting writes
•Routing: MySQL Router automatically directs traffic to current primary

MongoDB (Replica Sets)

•Coordination: Built-in election protocol (based on Raft-like consensus)
•Detection: Members ping each other; failure detected via missing heartbeats
•Election: Voting among members; majority required; uses priority and recency
•Promotion: Winning secondary becomes primary automatically
•Routing: Drivers maintain server selection; automatically route to new primary

AWS RDS Multi-AZ

•Coordination: AWS internal systems manage HA
•Detection: AWS monitors primary health; detects failures automatically
•Election: Standby in another AZ is automatically promoted
•Promotion: DNS CNAME points to new primary; synchronous replication ensures zero data loss
•Routing: RDS endpoint DNS automatically updates; clients see brief unavailability (~60-120s)

Managed Services Simplify Failover

Cloud-managed databases (RDS, Cloud SQL, Atlas) handle failover internally. You trade control for convenience. For most applications, this is the right choice. Build your own HA only if you have specific requirements managed services can't meet.

Summary: Failover Handling

We've explored the complete lifecycle of failover in leader-follower replication—from detection through election, promotion, and stabilization. Let's consolidate the critical knowledge:

Key Takeaways

•Failure detection must balance speed against false positives — Faster detection means faster recovery but more risk of unnecessary failovers from transient issues.
•Leader election requires consensus for safety — Majority-based voting ensures at most one leader; external consensus systems (etcd, ZooKeeper) or built-in protocols (Raft) provide this.
•The failover sequence must fence the old leader — Simply ignoring the old leader isn't enough; it might recover and cause split-brain. STONITH, fencing tokens, or lease expiry is required.
•Data consistency depends on synchronous vs. asynchronous choice — Sync replication means zero data loss on failover; async means potential loss up to replication lag.
•Split-brain is prevented by quorum and fencing — Quorum ensures only one partition can elect a leader; fencing ensures old leader can't continue after new leader is elected.
•Manual vs. automatic failover trades speed for safety — Automatic is faster; manual allows human judgment. Hybrid approaches combine benefits.

What's Next:

With a strong understanding of how writes flow, how followers replicate, the sync/async trade-off, and failover mechanics, we have one final topic to address: replication lag. The next page explores why followers fall behind, how to measure and monitor lag, and strategies for minimizing its impact on application behavior.

Page Complete

You now understand the intricacies of failover handling in leader-follower replication. You can reason about failure detection, leader election, split-brain prevention, and the trade-offs between automatic and manual failover. Next, we'll tackle replication lag—the gap between leader and followers that affects read consistency.

4 / 5

Loading learning content...

System Design (HLD)Leader-Follower Replication

Leader-Follower Replication

LevelIntermediate

Duration75 mins

TopicLeader-Follower Replication

4 / 5

Failover Handling

When the Leader Falls: The Critical Transition

This page explores every aspect of failover handling in leader-follower replication.

What You Will Learn

Failure Detection: Is the Leader Really Down?

The Fundamental Challenge:

If a follower can't reach the leader, it could mean:

The leader has crashed
The network between them has failed
The leader is overloaded and slow to respond
The follower itself has a network problem

Misjudging leads to two failure modes:

False positive (thinking leader is down when it isn't) — Causes unnecessary failover, potential split-brain
False negative (thinking leader is up when it isn't) — Prolongs outage, keeps system unavailable

Failure Detection Mechanisms

•Heartbeat/Ping — The leader sends periodic heartbeats to followers (or vice versa). Missing N consecutive heartbeats triggers failure detection. Simple but susceptible to network glitches.
•TCP Connection Monitoring — Monitor the replication connection. Disconnection indicates potential failure, but doesn't distinguish leader crash from network partition.
•Application-Level Health Checks — Actually execute queries against the leader (e.g., SELECT 1). Detects database hangs that heartbeats might miss.
•Out-of-Band Monitoring — External systems (monitoring tools, cloud provider) observe the leader independently. Multiple perspectives reduce false positives.
•Lease/Heartbeat Registration — Leader periodically renews a lease in a consensus system (ZooKeeper, etcd). Expired lease means leader hasn't renewed, triggering failover.

Failure Detection Timeline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
TIME ─────────────────────────────────────────────────────────────────────────▶
 
NORMAL OPERATION:
  Leader    ──♥──♥──♥──♥──♥──♥──♥──  (heartbeats every 1 second)
  Follower     ✓   ✓   ✓   ✓   ✓     (receives heartbeats)
 
LEADER FAILURE:
  t=0s      t=1s      t=2s      t=3s      t=4s      t=5s
    │         │         │         │         │         │
  CRASH     ─────────────────────────────────────────────▶
    ✗                                                   
              │         │         │         │         │
  Follower    ?─missed──?─missed──?─missed──!TIMEOUT!──
                                                        │
                                                        ▼
                                              INITIATE FAILOVER
 
  Detection Time = Heartbeat Interval × Missed Count + Processing Time
                 = 1s × 3 misses + ~1s processing = ~4 seconds
 
  This 4-second window is unavailability for writes

Failure Detection Parameter Trade-offs
Parameter	Lower Value	Higher Value
Heartbeat Interval	Faster detection, more network load, more false positives	Slower detection, less load, fewer false positives
Missed Count Threshold	Faster detection, more false positives (transient network)	Slower detection, fewer false positives
Query Timeout	Faster detection, may timeout healthy but slow leader	Slower detection, tolerates slow responses

The Network Partition Dilemma

Leader Election: Choosing the New Primary

Once a failure is detected, the cluster must elect a new leader from available followers. The election must satisfy two properties:

Safety — At most one leader is elected (no split-brain)
Liveness — Eventually, a leader is elected (system makes progress)

These properties are in tension during network partitions, as formalized by the FLP impossibility result. Practical systems use timeouts and probability to achieve 'mostly safe, mostly live.'

Leader Election Approaches

•Static Priority — Each follower has a configured priority. Highest-priority available follower becomes leader. Simple but requires manual configuration.
•Most Up-to-Date Wins — The follower with the least replication lag (highest LSN) becomes leader. Minimizes data loss but may elect a follower in an isolated partition.
•Consensus-Based (Raft/Paxos) — Followers vote; a candidate needs a majority to become leader. Guarantees single leader but requires majority availability.
•External Arbiter — An external consensus system (ZooKeeper, etcd, Consul) manages leadership. Followers compete for a distributed lock; winner is leader.
•Human Decision — No automatic election; operators manually promote a follower. Slowest but safest for critical systems where automated decisions are risky.

Consensus-Based Election (Raft)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
ELECTION PROCESS (Simplified Raft):
 
  STEP 1: Follower becomes Candidate
  ─────────────────────────────────────────────────────────────────────
  Node B (follower) hasn't heard from leader for election_timeout
  Node B increments its term and becomes a candidate
  Node B votes for itself
 
  STEP 2: Request Votes
  ─────────────────────────────────────────────────────────────────────
  Node B sends RequestVote RPCs to all other nodes
  Request includes: candidate's term, candidate's log position (LSN)
 
            ┌─────┐          ┌─────┐          ┌─────┐
            │  A  │◀─vote?───│  B  │──vote?──▶│  C  │
            │     │          │CAND │          │     │
            └─────┘          └─────┘          └─────┘
                               ▲
                               │
                      (B votes for itself)
 
  STEP 3: Collect Votes
  ─────────────────────────────────────────────────────────────────────
  Each node votes for candidate if:
    - Candidate's term ≥ node's term
    - Node hasn't voted for someone else this term
    - Candidate's log is at least as up-to-date
  
  STEP 4: Become Leader (or retry)
  ─────────────────────────────────────────────────────────────────────
  If B receives majority of votes: B becomes leader
  If B doesn't receive majority: wait random time, restart election
  If B discovers higher term: step down to follower

Choosing the Best Candidate:

Not all followers are equal candidates. Election criteria should consider:

Replication Position — Followers with higher LSN have more data; promoting them loses less
Health Status — Is the follower keeping up, or is it itself struggling?
Geographic Location — Prefer followers in the same datacenter as failed leader (lower latency)
Hardware Tier — Some followers may be less powerful (analytics replicas); avoid promoting them
Explicit Priority — Operators may have designated preferred promotion targets

Majority is Key

The Failover Sequence

Failover is a multi-step process that must execute carefully to avoid data loss or corruption. Let's trace through a complete failover sequence.

Complete Failover Sequence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
PHASE 1: FAILURE DETECTION
═══════════════════════════════════════════════════════════════════════════
  t=0s    Leader crashes (unknown to followers)
  t=1s    First missed heartbeat
  t=2s    Second missed heartbeat  
  t=3s    Third missed heartbeat → FAILURE CONFIRMED
 
PHASE 2: OLD LEADER ISOLATION (FENCING)
═══════════════════════════════════════════════════════════════════════════
  t=3.1s  Revoke old leader's access:
          - Cloud: terminate instance or detach storage
          - On-prem: STONITH (Shoot The Other Node In The Head)
          - Network: update firewall to block old leader
          - Storage: revoke write access at SAN level
          
  WHY: Prevent old leader from accepting writes if it recovers
 
PHASE 3: LEADER ELECTION
═══════════════════════════════════════════════════════════════════════════
  t=3.2s  Election begins
  t=3.5s  Votes collected, new leader elected (Node B)
  t=3.6s  New leader confirms its role
 
PHASE 4: NEW LEADER PROMOTION
═══════════════════════════════════════════════════════════════════════════
  t=3.7s  New leader:
          - Stops accepting replication from old leader
          - Begins accepting write queries
          - Starts streaming to remaining followers
          - Announces its leadership (to load balancers, apps)
 
PHASE 5: TRAFFIC ROUTING UPDATE  
═══════════════════════════════════════════════════════════════════════════
  t=3.8s  Update routing:
          - DNS record updated (if DNS-based routing)
          - Load balancer reconfigured
          - Proxy nodes updated
          - Application connection pools refreshed
 
PHASE 6: SYSTEM STABILIZATION
═══════════════════════════════════════════════════════════════════════════
  t=4.0s  New leader accepting writes
  t=4.5s  Followers reconnected to new leader
  t=5.0s  Monitoring confirms healthy cluster
 
 
TOTAL FAILOVER TIME: ~4-5 seconds (well-tuned automatic failover)
                     10-60 seconds (typical production systems)
                     MINUTES (manual failover requiring human approval)

Critical Failover Considerations

•Fencing the Old Leader — Simply ignoring the old leader isn't enough. It might recover and resume accepting writes. Physical or logical isolation is required.
•Uncommitted Transactions — Any transactions the client requested but weren't acknowledged before the crash will need to be retried. Applications must handle this.
•In-Flight Connections — Client connections to the old leader will error. Applications need reconnection logic pointing to the new leader.
•Replication to New Leader — Remaining followers must switch their replication source from old leader to new leader.
•Monitoring Updates — Dashboards, alerts, and operational tools must recognize the new topology.

The Recovery Race

Data Consistency During Failover

Transaction States at Failure Time
Transaction State	Client Experience	New Leader Behavior	Data Outcome
Not started on leader	Request never received	N/A	Client retries, succeeds on new leader
In progress, not committed	Connection error	Transaction doesn't exist	Client retries or handles error
Committed on leader WAL	May or may not see success	May or may not have replicated	Depends on sync vs. async
Replicated to new leader	Success already returned	Transaction exists	Data is safe

The Critical Window: Committed but Not Replicated

What happens to uncommitted transactions:

With Synchronous Replication — By definition, any transaction the old leader acknowledged is on at least one follower. The new leader (which must be one of those followers) has all acknowledged transactions.
With Asynchronous Replication — Transactions in the replication window are lost. The new leader's WAL is behind the old leader's. Any transactions the old leader acknowledged but hadn't replicated are gone.
If Old Leader Recovers — It may have "orphan" transactions not present on the new leader. These must be rolled back or conflict-resolved—they cannot simply be applied.

Data Divergence After Failover
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
SCENARIO: Async Replication, Leader Fails, Then Recovers
 
BEFORE FAILURE:
  Old Leader WAL:   [tx1] [tx2] [tx3] [tx4] [tx5] [tx6]
                                       ▲           ▲
                                       │           └── Last committed (client saw success)
                                       └── Last replicated to follower
 
  Follower WAL:     [tx1] [tx2] [tx3] [tx4]
                                       ▲
                                       └── Follower's latest
 
AFTER FAILOVER:
  New Leader WAL:   [tx1] [tx2] [tx3] [tx4] [tx7] [tx8] ← New writes
                                             ▲
                                             └── New transactions after promotion
 
  Old Leader WAL:   [tx1] [tx2] [tx3] [tx4] [tx5] [tx6]
                                       ▲
                                       └── tx5 and tx6 are "orphan" transactions
 
PROBLEM: tx5 and tx6 exist on old leader but not new leader
         Client was told tx5 and tx6 succeeded, but they're lost
         If old leader rejoins, it has conflicting history
 
SOLUTIONS:
  - Discard orphan transactions (acknowledge data loss)
  - Attempt to replay orphan transactions on new leader (may conflict)
  - Manual review and reconciliation
  - Prevent this with synchronous replication

RPO and RTO

Split-Brain Prevention

Split-Brain Scenario
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
NETWORK PARTITION SPLITS CLUSTER:
 
  ┌─────────────────────┐          ║          ┌─────────────────────┐
  │    PARTITION A      │          ║          │    PARTITION B      │
  │                     │          ║          │                     │
  │  ┌───────────────┐  │          ║          │  ┌───────────────┐  │
  │  │   Old Leader  │  │    NETWORK PARTITION │  │   Follower    │  │
  │  │    (still     │  │◀────────X───────────▶│  │  (promoted!)  │  │
  │  │   running)    │  │          ║          │  │               │  │
  │  └───────┬───────┘  │          ║          │  └───────┬───────┘  │
  │          │          │          ║          │          │          │
  │   Client A writes   │          ║          │   Client B writes   │
  │   UPDATE x = 100    │          ║          │   UPDATE x = 200    │
  │                     │          ║          │                     │
  └─────────────────────┘          ║          └─────────────────────┘
 
  RESULT: x = 100 on one node, x = 200 on the other
          Both clients told their write succeeded
          Data is permanently inconsistent

Split-Brain Prevention Techniques

•STONITH (Shoot The Other Node In The Head) — Before promoting a new leader, physically power off or restart the old leader. If it's actually running, this kills it. Hardware-level guarantee.
•Fencing Tokens — Each leader gets a monotonically increasing token. Storage systems reject writes from lower tokens. Even if old leader runs, its writes are rejected.
•Quorum Requirement — Leader must maintain contact with a majority of nodes to operate. Minority partition cannot elect a leader because it can't achieve quorum.
•Lease-Based Leadership — Leader must periodically renew a lease in consensus storage. If it can't renew (isolated), it must stop accepting writes before lease expires.
•Watchdog Timers — Old leader has a hardware watchdog. If it can't refresh the watchdog (due to partition), the watchdog reboots the machine.

The Quorum Approach in Detail:

With quorum-based systems (Raft, Paxos, ZooKeeper), split-brain is mathematically prevented:

A leader must have acknowledgment from a majority of nodes for any operation
Any two majorities always overlap by at least one node
Therefore, two simultaneous leaders would both need the same overlapping node
That node would reject one of them (due to term numbers or other conflict detection)

For example, with 5 nodes:

Majority = 3 nodes
If network splits 2-3, only the 3-node partition can elect a leader
The 2-node partition cannot achieve quorum, preventing split-brain

Quorum Has Availability Cost

Manual vs. Automatic Failover

A critical design decision is whether failover should happen automatically or require human approval. Each approach has strong proponents.

Automatic Failover

•Fastest Recovery — No human in the loop; failover in seconds
•24/7 Protection — Works at 3 AM without waiting for on-call
•Consistent Process — Same steps every time; no human error
•Risk: False Positives — May failover for transient issues
•Risk: Cascading Failures — May make bad situations worse
•Best For: High-availability systems with well-tested failover

Manual Failover

•Human Judgment — Operator assesses situation before acting
•Avoids False Positives — Won't failover for transient network blips
•Can Coordinate — Operator can notify stakeholders, schedule maintenance
•Risk: Delay — Minutes or hours of downtime waiting for human
•Risk: Error — Human may make wrong decision under pressure
•Best For: Critical systems where wrong failover is worse than delayed failover

Failover Strategy by System Type
System Type	Recommended Approach	Rationale
Web application DB	Automatic	Downtime visibility is high; fast recovery more important
Financial transactions	Semi-auto (auto-detect, human-approve)	Data integrity critical; human validation worth delay
Multi-region DR	Manual	Failover affects many systems; requires coordination
Development/Staging	Automatic (or none)	Low impact; don't waste human time
Regulatory-controlled	Manual with audit trail	May require approval documentation

Hybrid Approach: 'Semi-Automatic'

Many production systems use a hybrid:

Automatic Detection — System detects failure and identifies promotion candidate
Human Approval — Alert sent to on-call; they have N minutes to approve or override
Automatic Execution — Human approves (or timeout triggers auto-approval) → system executes failover
Automatic Verification — System confirms healthy operation and alerts if issues

This combines fast automated execution with human oversight, catching false positives while minimizing delay.

Practice Makes Perfect

Real-World Failover Implementations

Let's examine how major database systems and HA solutions implement failover in practice.

PostgreSQL (with Patroni)

•Coordination: Uses etcd, Consul, or ZooKeeper for distributed consensus
•Detection: Leader must maintain lock in DCS (Distributed Coordination Service); lock expires on failure
•Election: Followers compete to acquire the lock; winner becomes leader
•Promotion: Winner executes pg_promote() to become primary; updates DCS with new leader info
•Routing: Patroni provides REST API for load balancers to query current leader

MySQL (InnoDB Cluster/Group Replication)

•Coordination: Built-in group communication using Paxos-based consensus
•Detection: Group members detect failure through membership protocol
•Election: Automatic; group elects new primary based on configured weights
•Promotion: New primary automatically starts accepting writes
•Routing: MySQL Router automatically directs traffic to current primary

MongoDB (Replica Sets)

•Coordination: Built-in election protocol (based on Raft-like consensus)
•Detection: Members ping each other; failure detected via missing heartbeats
•Election: Voting among members; majority required; uses priority and recency
•Promotion: Winning secondary becomes primary automatically
•Routing: Drivers maintain server selection; automatically route to new primary

AWS RDS Multi-AZ

•Coordination: AWS internal systems manage HA
•Detection: AWS monitors primary health; detects failures automatically
•Election: Standby in another AZ is automatically promoted
•Promotion: DNS CNAME points to new primary; synchronous replication ensures zero data loss
•Routing: RDS endpoint DNS automatically updates; clients see brief unavailability (~60-120s)

Managed Services Simplify Failover

Summary: Failover Handling

We've explored the complete lifecycle of failover in leader-follower replication—from detection through election, promotion, and stabilization. Let's consolidate the critical knowledge:

Key Takeaways

•Failure detection must balance speed against false positives — Faster detection means faster recovery but more risk of unnecessary failovers from transient issues.
•Leader election requires consensus for safety — Majority-based voting ensures at most one leader; external consensus systems (etcd, ZooKeeper) or built-in protocols (Raft) provide this.
•The failover sequence must fence the old leader — Simply ignoring the old leader isn't enough; it might recover and cause split-brain. STONITH, fencing tokens, or lease expiry is required.
•Data consistency depends on synchronous vs. asynchronous choice — Sync replication means zero data loss on failover; async means potential loss up to replication lag.
•Split-brain is prevented by quorum and fencing — Quorum ensures only one partition can elect a leader; fencing ensures old leader can't continue after new leader is elected.
•Manual vs. automatic failover trades speed for safety — Automatic is faster; manual allows human judgment. Hybrid approaches combine benefits.

What's Next:

Page Complete

4 / 5