Loading learning content...
It's 3 AM. Your on-call pager fires. The primary database server—handling all writes for your e-commerce platform during a holiday sale—has just crashed. Orders are piling up. Every second of downtime costs thousands of dollars.
The cluster has two healthy replicas. But how does the system know the primary is really down? Which replica becomes the new primary? What happens to writes that the old primary accepted but hadn't replicated yet? And most critically: how do we ensure we don't end up with two primaries accepting conflicting writes?
This is failover—the process of transitioning from a failed leader to a new one. Done well, it's seamless and automatic. Done poorly, it causes data loss, extended outages, or the dreaded split-brain scenario where data integrity is compromised.
This page explores every aspect of failover handling in leader-follower replication.
By the end of this page, you will understand failure detection mechanisms, the failover decision-making process, leader election algorithms, data consistency guarantees during failover, split-brain prevention, manual vs. automatic failover trade-offs, and operational best practices for reliable failover.
Before initiating failover, the system must determine that the leader has actually failed. This is surprisingly difficult in distributed systems, where networks are unreliable and the difference between 'failed' and 'temporarily slow' is often indistinguishable.
The Fundamental Challenge:
If a follower can't reach the leader, it could mean:
Misjudging leads to two failure modes:
123456789101112131415161718192021
TIME ─────────────────────────────────────────────────────────────────────────▶ NORMAL OPERATION: Leader ──♥──♥──♥──♥──♥──♥──♥── (heartbeats every 1 second) Follower ✓ ✓ ✓ ✓ ✓ (receives heartbeats) LEADER FAILURE: t=0s t=1s t=2s t=3s t=4s t=5s │ │ │ │ │ │ CRASH ─────────────────────────────────────────────▶ ✗ │ │ │ │ │ Follower ?─missed──?─missed──?─missed──!TIMEOUT!── │ ▼ INITIATE FAILOVER Detection Time = Heartbeat Interval × Missed Count + Processing Time = 1s × 3 misses + ~1s processing = ~4 seconds This 4-second window is unavailability for writes| Parameter | Lower Value | Higher Value |
|---|---|---|
| Heartbeat Interval | Faster detection, more network load, more false positives | Slower detection, less load, fewer false positives |
| Missed Count Threshold | Faster detection, more false positives (transient network) | Slower detection, fewer false positives |
| Query Timeout | Faster detection, may timeout healthy but slow leader | Slower detection, tolerates slow responses |
If the network partitions (followers can't reach leader, but leader is still running and accepting writes from its network partition), followers may initiate failover while the old leader continues operating. This is the setup for split-brain. Proper failover requires not just detecting failure, but also preventing the old leader from continuing to operate.
Once a failure is detected, the cluster must elect a new leader from available followers. The election must satisfy two properties:
These properties are in tension during network partitions, as formalized by the FLP impossibility result. Practical systems use timeouts and probability to achieve 'mostly safe, mostly live.'
123456789101112131415161718192021222324252627282930313233
ELECTION PROCESS (Simplified Raft): STEP 1: Follower becomes Candidate ───────────────────────────────────────────────────────────────────── Node B (follower) hasn't heard from leader for election_timeout Node B increments its term and becomes a candidate Node B votes for itself STEP 2: Request Votes ───────────────────────────────────────────────────────────────────── Node B sends RequestVote RPCs to all other nodes Request includes: candidate's term, candidate's log position (LSN) ┌─────┐ ┌─────┐ ┌─────┐ │ A │◀─vote?───│ B │──vote?──▶│ C │ │ │ │CAND │ │ │ └─────┘ └─────┘ └─────┘ ▲ │ (B votes for itself) STEP 3: Collect Votes ───────────────────────────────────────────────────────────────────── Each node votes for candidate if: - Candidate's term ≥ node's term - Node hasn't voted for someone else this term - Candidate's log is at least as up-to-date STEP 4: Become Leader (or retry) ───────────────────────────────────────────────────────────────────── If B receives majority of votes: B becomes leader If B doesn't receive majority: wait random time, restart election If B discovers higher term: step down to followerChoosing the Best Candidate:
Not all followers are equal candidates. Election criteria should consider:
Consensus-based election requires a majority of nodes to agree. With 3 nodes, you need 2; with 5 nodes, you need 3. This ensures that any two majorities overlap by at least one node, preventing two simultaneous elections from both succeeding.
Failover is a multi-step process that must execute carefully to avoid data loss or corruption. Let's trace through a complete failover sequence.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
PHASE 1: FAILURE DETECTION═══════════════════════════════════════════════════════════════════════════ t=0s Leader crashes (unknown to followers) t=1s First missed heartbeat t=2s Second missed heartbeat t=3s Third missed heartbeat → FAILURE CONFIRMED PHASE 2: OLD LEADER ISOLATION (FENCING)═══════════════════════════════════════════════════════════════════════════ t=3.1s Revoke old leader's access: - Cloud: terminate instance or detach storage - On-prem: STONITH (Shoot The Other Node In The Head) - Network: update firewall to block old leader - Storage: revoke write access at SAN level WHY: Prevent old leader from accepting writes if it recovers PHASE 3: LEADER ELECTION═══════════════════════════════════════════════════════════════════════════ t=3.2s Election begins t=3.5s Votes collected, new leader elected (Node B) t=3.6s New leader confirms its role PHASE 4: NEW LEADER PROMOTION═══════════════════════════════════════════════════════════════════════════ t=3.7s New leader: - Stops accepting replication from old leader - Begins accepting write queries - Starts streaming to remaining followers - Announces its leadership (to load balancers, apps) PHASE 5: TRAFFIC ROUTING UPDATE ═══════════════════════════════════════════════════════════════════════════ t=3.8s Update routing: - DNS record updated (if DNS-based routing) - Load balancer reconfigured - Proxy nodes updated - Application connection pools refreshed PHASE 6: SYSTEM STABILIZATION═══════════════════════════════════════════════════════════════════════════ t=4.0s New leader accepting writes t=4.5s Followers reconnected to new leader t=5.0s Monitoring confirms healthy cluster TOTAL FAILOVER TIME: ~4-5 seconds (well-tuned automatic failover) 10-60 seconds (typical production systems) MINUTES (manual failover requiring human approval)If the old leader recovers while the new leader is being promoted, you have two nodes that think they're the leader. This is why fencing (step 2) must happen BEFORE election completes. The old leader must be definitively isolated before a new leader can safely assume the role.
Failover creates moments of uncertainty about data state. Transactions may be in various stages of completion when the leader fails. Understanding and handling these edge cases is critical for data integrity.
| Transaction State | Client Experience | New Leader Behavior | Data Outcome |
|---|---|---|---|
| Not started on leader | Request never received | N/A | Client retries, succeeds on new leader |
| In progress, not committed | Connection error | Transaction doesn't exist | Client retries or handles error |
| Committed on leader WAL | May or may not see success | May or may not have replicated | Depends on sync vs. async |
| Replicated to new leader | Success already returned | Transaction exists | Data is safe |
The Critical Window: Committed but Not Replicated
The most dangerous state is when the old leader committed a transaction (wrote to WAL) but hadn't replicated to any follower. With synchronous replication, this window is zero—transactions only commit after replication. With asynchronous replication, this window equals the replication lag.
What happens to uncommitted transactions:
With Synchronous Replication — By definition, any transaction the old leader acknowledged is on at least one follower. The new leader (which must be one of those followers) has all acknowledged transactions.
With Asynchronous Replication — Transactions in the replication window are lost. The new leader's WAL is behind the old leader's. Any transactions the old leader acknowledged but hadn't replicated are gone.
If Old Leader Recovers — It may have "orphan" transactions not present on the new leader. These must be rolled back or conflict-resolved—they cannot simply be applied.
123456789101112131415161718192021222324252627282930
SCENARIO: Async Replication, Leader Fails, Then Recovers BEFORE FAILURE: Old Leader WAL: [tx1] [tx2] [tx3] [tx4] [tx5] [tx6] ▲ ▲ │ └── Last committed (client saw success) └── Last replicated to follower Follower WAL: [tx1] [tx2] [tx3] [tx4] ▲ └── Follower's latest AFTER FAILOVER: New Leader WAL: [tx1] [tx2] [tx3] [tx4] [tx7] [tx8] ← New writes ▲ └── New transactions after promotion Old Leader WAL: [tx1] [tx2] [tx3] [tx4] [tx5] [tx6] ▲ └── tx5 and tx6 are "orphan" transactions PROBLEM: tx5 and tx6 exist on old leader but not new leader Client was told tx5 and tx6 succeeded, but they're lost If old leader rejoins, it has conflicting history SOLUTIONS: - Discard orphan transactions (acknowledge data loss) - Attempt to replay orphan transactions on new leader (may conflict) - Manual review and reconciliation - Prevent this with synchronous replicationRPO (Recovery Point Objective) is how much data loss is acceptable. RTO (Recovery Time Objective) is how long downtime is acceptable. Synchronous replication minimizes RPO to zero but increases RTO (writes are slower). Asynchronous replication minimizes latency impact but has non-zero RPO.
Split-brain is the nightmare scenario: two nodes both believe they're the leader and accept conflicting writes. This corrupts data and may be unrecoverable. Preventing split-brain is the primary goal of failover design.
12345678910111213141516171819
NETWORK PARTITION SPLITS CLUSTER: ┌─────────────────────┐ ║ ┌─────────────────────┐ │ PARTITION A │ ║ │ PARTITION B │ │ │ ║ │ │ │ ┌───────────────┐ │ ║ │ ┌───────────────┐ │ │ │ Old Leader │ │ NETWORK PARTITION │ │ Follower │ │ │ │ (still │ │◀────────X───────────▶│ │ (promoted!) │ │ │ │ running) │ │ ║ │ │ │ │ │ └───────┬───────┘ │ ║ │ └───────┬───────┘ │ │ │ │ ║ │ │ │ │ Client A writes │ ║ │ Client B writes │ │ UPDATE x = 100 │ ║ │ UPDATE x = 200 │ │ │ ║ │ │ └─────────────────────┘ ║ └─────────────────────┘ RESULT: x = 100 on one node, x = 200 on the other Both clients told their write succeeded Data is permanently inconsistentThe Quorum Approach in Detail:
With quorum-based systems (Raft, Paxos, ZooKeeper), split-brain is mathematically prevented:
For example, with 5 nodes:
Quorum prevents split-brain but also means the minority partition becomes unavailable for writes. With 5 nodes, losing 3 means no writes. This is the fundamental trade-off: you can have consistency or availability during partitions, but not both (CAP theorem).
A critical design decision is whether failover should happen automatically or require human approval. Each approach has strong proponents.
| System Type | Recommended Approach | Rationale |
|---|---|---|
| Web application DB | Automatic | Downtime visibility is high; fast recovery more important |
| Financial transactions | Semi-auto (auto-detect, human-approve) | Data integrity critical; human validation worth delay |
| Multi-region DR | Manual | Failover affects many systems; requires coordination |
| Development/Staging | Automatic (or none) | Low impact; don't waste human time |
| Regulatory-controlled | Manual with audit trail | May require approval documentation |
Hybrid Approach: 'Semi-Automatic'
Many production systems use a hybrid:
This combines fast automated execution with human oversight, catching false positives while minimizing delay.
Whichever approach you choose, practice failover regularly. Run drills monthly. Test that automatic failover works. Ensure operators know the manual process by heart. A failover procedure you've never tested is a procedure that might not work when you need it.
Let's examine how major database systems and HA solutions implement failover in practice.
pg_promote() to become primary; updates DCS with new leader infoCloud-managed databases (RDS, Cloud SQL, Atlas) handle failover internally. You trade control for convenience. For most applications, this is the right choice. Build your own HA only if you have specific requirements managed services can't meet.
We've explored the complete lifecycle of failover in leader-follower replication—from detection through election, promotion, and stabilization. Let's consolidate the critical knowledge:
What's Next:
With a strong understanding of how writes flow, how followers replicate, the sync/async trade-off, and failover mechanics, we have one final topic to address: replication lag. The next page explores why followers fall behind, how to measure and monitor lag, and strategies for minimizing its impact on application behavior.
You now understand the intricacies of failover handling in leader-follower replication. You can reason about failure detection, leader election, split-brain prevention, and the trade-offs between automatic and manual failover. Next, we'll tackle replication lag—the gap between leader and followers that affects read consistency.