Loading content...
December 2012: GitHub experienced a major outage that lasted hours. The root cause was split-brain in their MySQL infrastructure. During a network partition, both the primary and the intended standby believed they were the authoritative primary. Both accepted writes. When the partition healed, GitHub faced an impossible reconciliation: two divergent histories of the database with conflicting changes. Some data was permanently lost. The recovery took days of painstaking work.
This was not an isolated incident. Split-brain has struck organizations of every size—from startups to tech giants. It's been responsible for data loss at major banks, corruption at healthcare systems, and outages at critical infrastructure providers. Split-brain is not a theoretical concern; it's one of the most dangerous failure modes in distributed systems.
This page arms you with the comprehensive knowledge to understand, detect, and prevent split-brain in your systems. When the network partitions and failover triggers incorrectly, these techniques ensure your data survives intact.
By the end of this page, you will deeply understand: what split-brain is and how it occurs, why it's so dangerous compared to other failure modes, fencing and STONITH mechanisms, quorum-based prevention, consensus protocols that eliminate split-brain by design, and practical prevention strategies for different system architectures.
Split-brain occurs when a distributed system divides into two or more partitions that cannot communicate, and multiple partitions each believe they are the authoritative source of truth. In primary-standby architectures, it means having two nodes simultaneously acting as the primary, both accepting writes.
Why Does Split-Brain Happen?
Split-brain typically emerges from the combination of:
Network partition — Communication between nodes is interrupted (switch failure, cable cut, firewall misconfiguration, cloud provider networking issue)
Failover trigger — The standby node cannot reach the primary and (correctly, from its limited perspective) concludes the primary is dead
Both nodes operational — Unlike actual primary failure, the original primary is still running and accepting writes
The result: two nodes, isolated from each other, each convinced they are the rightful primary. Both accept client connections. Both write data. Neither knows about the other's operations.
The Anatomy of Split-Brain:
Before Partition During Partition
┌─────────────────────────┐ ┌───────────────────────────────┐
│ │ │ │
│ ┌─────────┐ │ │ ┌─────────┐ ╳ ┌─────────┐
│ │ Primary │◄─────────┤ │ │ Node A │ │ │ Node B │
│ └────┬────┘ writes │ │ │(Primary)│ │ │(Primary)│
│ │ │ │ └────┬────┘ │ └────┬────┘
│ │ sync │ │ │ │ │
│ ▼ │ │ │ │ │
│ ┌─────────┐ │ │ Clients A Partition Clients B
│ │ Standby │ │ │ writing (can't writing
│ └─────────┘ │ │ to Node A communicate to Node B
│ │ │ │ each other)
│ │ │ │ │
└─────────┼───────────────┘ └───────────────────────────────┘
│
Replication
Clients on either side of the partition see a working primary. They continue operating, unaware that another primary exists. The data diverges.
When the partition heals, you face the nightmare scenario: two divergent datasets with conflicting writes. Row ID 12345 has balance $500 on Node A and $700 on Node B. Which is correct? Both writes were valid when made. There's no automatic resolution. Manual reconciliation is required, and some data may be permanently lost.
Split-brain isn't just another failure mode—it's qualitatively different from other failures because it can cause permanent, irrecoverable data corruption.
Comparison with Other Failure Modes:
| Failure Mode | Impact | Recoverable? | Automatic Fix? |
|---|---|---|---|
| Primary crashes | Downtime until failover | Yes | Yes |
| Standby crashes | Reduced redundancy | Yes | Yes |
| Network packet loss | Retries, latency | Yes | Yes |
| Primary slow/degraded | Latency increase | Yes | Yes |
| Split-brain | Data corruption/loss | Maybe partially | NO |
The Permanence Problem:
Most failures are transient—systems recover, retries succeed, redundancy absorbs the impact. But split-brain creates a lasting mess:
1. Transaction Conflicts
Two users both successfully buy the last item in inventory:
2. Uniqueness Violations
Database auto-increment IDs or unique constraints violated:
3. Cascading Inconsistencies
Financial example:
Even when data can be recovered, the cost is enormous: engineering time for reconciliation (often weeks), customer trust erosion, potential regulatory fines, and the opportunity cost of incident response instead of product development. Prevention is orders of magnitude cheaper than recovery.
Fencing is the first line of defense against split-brain. The concept is brutally simple: before the standby can become primary, it must ensure the old primary is completely stopped—even if that means forcibly killing it.
The colorful acronym STONITH (Shoot The Other Node In The Head) captures this approach. If you can't confirm the old primary is dead, you make it dead.
Why Fencing Is Necessary:
Without fencing, a simple race condition can cause split-brain:
Fencing breaks this race by ensuring the old primary cannot continue operating even if it comes back online.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
interface FencingDevice { name: string; type: 'power' | 'network' | 'storage' | 'application'; fence(nodeId: string): Promise<FenceResult>; verify(nodeId: string): Promise<boolean>; // Confirm node is fenced} interface FenceResult { success: boolean; method: string; verificationConfirmed: boolean; timestamp: Date;} class FencingCoordinator { private devices: FencingDevice[] = []; constructor( devices: FencingDevice[], private requireVerification: boolean = true, private retryCount: number = 3, private retryDelayMs: number = 5000 ) { // Order devices by reliability (power > storage > network > application) this.devices = devices.sort((a, b) => this.getReliabilityScore(b.type) - this.getReliabilityScore(a.type) ); } private getReliabilityScore(type: string): number { switch (type) { case 'power': return 100; // Most reliable case 'storage': return 80; case 'network': return 60; case 'application': return 40; // Least reliable default: return 0; } } /** * Fence a node before promoting standby. * Tries multiple fencing devices until one succeeds with verification. * Returns only when we're certain the node is fenced. */ async fenceNode(nodeId: string): Promise<FencingResult> { console.log(`Attempting to fence node ${nodeId}`); const attemptedMethods: string[] = []; for (const device of this.devices) { for (let attempt = 0; attempt < this.retryCount; attempt++) { try { console.log(`Fencing attempt ${attempt + 1} using ${device.name}`); const result = await device.fence(nodeId); attemptedMethods.push(`${device.name}:attempt${attempt + 1}`); if (!result.success) { console.log(`Fence operation failed: ${device.name}`); continue; } // Verify the fence worked if (this.requireVerification) { const verified = await this.verifyFence(device, nodeId); if (!verified) { console.log(`Fence verification failed: ${device.name}`); continue; } } // Success! return { success: true, fencedBy: device.name, fenceType: device.type, attemptedMethods, timestamp: new Date(), }; } catch (error) { console.error(`Fencing error with ${device.name}: ${error}`); attemptedMethods.push(`${device.name}:error`); } // Wait before retry await this.sleep(this.retryDelayMs); } } // All methods exhausted - this is serious return { success: false, fencedBy: null, fenceType: null, attemptedMethods, timestamp: new Date(), error: 'All fencing methods exhausted. DO NOT PROCEED WITH FAILOVER.', }; } private async verifyFence(device: FencingDevice, nodeId: string): Promise<boolean> { // Wait a moment for fence to take effect await this.sleep(2000); // Try to verify multiple times for (let i = 0; i < 3; i++) { const isFenced = await device.verify(nodeId); if (isFenced) return true; await this.sleep(1000); } return false; } private sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms)); }} // Usage in failover controllerasync function executeFailover(standby: Node, oldPrimary: Node): Promise<void> { const fencer = new FencingCoordinator([ new PowerFencer('ipmi-controller'), new StorageFencer('san-controller'), new NetworkFencer('firewall-api'), ]); // CRITICAL: Fence before promotion const fenceResult = await fencer.fenceNode(oldPrimary.id); if (!fenceResult.success) { throw new Error(`Cannot proceed with failover: fencing failed. ${fenceResult.error}`); // Alert operators for manual intervention } console.log(`Old primary fenced via ${fenceResult.fencedBy}`); // Now safe to proceed with promotion await promoteStandby(standby);}This is the cardinal rule: if you cannot confirm the old primary is fenced, you must NOT promote the standby. The downtime from aborting failover is recoverable. The data corruption from split-brain may not be. Always prefer extended downtime over possible split-brain.
Quorum is a voting-based approach to split-brain prevention. The idea is simple: a node can only become primary if it can communicate with a majority of the cluster. In a 3-node cluster, you need 2 nodes. In a 5-node cluster, you need 3.
Why Quorum Works:
Mathematically, if a partition occurs, at most one side can have a majority:
The minority side(s) know they lack quorum and refuse to become primary. Even if they can't reach the current primary, they won't promote themselves.
Quorum Calculation:
Quorum = floor(N/2) + 1
Cluster Size: 3 → Quorum: 2
Cluster Size: 5 → Quorum: 3
Cluster Size: 7 → Quorum: 4
Note: Always use odd cluster sizes. With even sizes, an even split (e.g., 2-2) leaves no majority and the entire cluster deadlocks.
| Cluster Size | Quorum | Max Failures Tolerated | Notes |
|---|---|---|---|
| 2 | 2 | 0 | No fault tolerance! Avoid. |
| 3 | 2 | 1 | Minimum for fault tolerance |
| 5 | 3 | 2 | Good balance, common choice |
| 7 | 4 | 3 | High availability, more complexity |
Implementing Quorum:
Witness/Tie-Breaker Nodes:
For systems where having multiple data-holding nodes is expensive, you can use lightweight witness nodes that participate in quorum but don't hold data:
Quorum requires 2 of 3. If A fails, B+C can agree to promote B. If A and C lose contact with B, they might see B fail but A (with C's vote) maintains quorum and stays primary.
Quorum in Cloud Environments:
Cloud providers offer quorum-as-a-service:
These managed services handle the complexity of maintaining quorum across availability zones.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
interface ClusterNode { id: string; type: 'data' | 'witness'; weight: number; // Some nodes may have more voting power isReachable: () => Promise<boolean>;} class QuorumChecker { private nodes: ClusterNode[]; private totalWeight: number; private quorumWeight: number; constructor(nodes: ClusterNode[]) { this.nodes = nodes; this.totalWeight = nodes.reduce((sum, n) => sum + n.weight, 0); this.quorumWeight = Math.floor(this.totalWeight / 2) + 1; console.log(`Cluster initialized: ${nodes.length} nodes, total weight ${this.totalWeight}, quorum ${this.quorumWeight}`); } async checkQuorum(): Promise<QuorumResult> { const reachabilityResults = await Promise.all( this.nodes.map(async (node) => ({ node, reachable: await this.checkNodeWithTimeout(node, 5000), })) ); const reachableNodes = reachabilityResults .filter(r => r.reachable) .map(r => r.node); const reachableWeight = reachableNodes.reduce((sum, n) => sum + n.weight, 0); const hasQuorum = reachableWeight >= this.quorumWeight; return { hasQuorum, reachableWeight, requiredWeight: this.quorumWeight, reachableNodes: reachableNodes.map(n => n.id), unreachableNodes: reachabilityResults .filter(r => !r.reachable) .map(r => r.node.id), }; } private async checkNodeWithTimeout(node: ClusterNode, timeoutMs: number): Promise<boolean> { return Promise.race([ node.isReachable(), new Promise<boolean>(resolve => setTimeout(() => resolve(false), timeoutMs) ), ]); }} class QuorumAwareFailoverController { constructor( private quorumChecker: QuorumChecker, private fencer: FencingCoordinator, private localNode: ClusterNode ) {} async canBecomePromary(): Promise<{ allowed: boolean; reason: string }> { // Step 1: Do we have quorum? const quorum = await this.quorumChecker.checkQuorum(); if (!quorum.hasQuorum) { return { allowed: false, reason: `Insufficient quorum: ${quorum.reachableWeight}/${quorum.requiredWeight} weight. Unreachable: ${quorum.unreachableNodes.join(', ')}`, }; } // Step 2: Is there another primary we need to fence? const currentPrimary = await this.discoverCurrentPrimary(); if (currentPrimary && currentPrimary.id !== this.localNode.id) { // Try to fence the current primary const fenceResult = await this.fencer.fenceNode(currentPrimary.id); if (!fenceResult.success) { return { allowed: false, reason: `Cannot fence current primary ${currentPrimary.id}: ${fenceResult.error}`, }; } } // Step 3: Verify quorum still holds after fencing const postFenceQuorum = await this.quorumChecker.checkQuorum(); if (!postFenceQuorum.hasQuorum) { return { allowed: false, reason: 'Lost quorum after fencing attempt', }; } return { allowed: true, reason: `Quorum confirmed (${postFenceQuorum.reachableWeight}/${postFenceQuorum.requiredWeight}), ready for promotion`, }; }}Quorum prevents the minority partition from promoting, but it doesn't prevent the current primary from continuing to operate in that minority. If the primary is on the minority side, it should step down. This requires self-awareness: the primary must continuously verify it has quorum, not just the standby.
The most robust defense against split-brain is using systems built on consensus protocols like Raft, Paxos, or Zab. These protocols mathematically guarantee that only one leader can exist at any time, making split-brain impossible by design.
How Consensus Protocols Prevent Split-Brain:
1. Leader Election Requires Majority
A node can only become leader if it receives votes from a majority of nodes. In a 5-node cluster, that means 3 nodes must agree. During a partition, only one side can have 3+ nodes.
2. Log Replication Requires Majority Acknowledgment
Writes are only considered committed when a majority of nodes acknowledge them. A minority-partition 'leader' (if one somehow existed) couldn't commit any writes.
3. Term Numbers Prevent Stale Leaders
Each leader election increments a term number. If a stale leader from a previous term receives a message with a higher term, it immediately steps down. This handles network partitions that heal: the old leader learns it's obsolete.
| Protocol | Used By | Strengths | Complexity |
|---|---|---|---|
| Raft | etcd, Consul, CockroachDB | Understandable, well-documented | Medium |
| Paxos | Google Spanner, Chubby | Proven, flexible | High |
| Zab | ZooKeeper | Optimized for primary-backup | Medium |
| Viewstamped Replication | Academic systems | Simple variant of Paxos | Medium |
Raft Leader Election (Simplified):
1. Normal Operation:
- Leader sends heartbeats to followers every 150ms
- Followers reset their election timeout on heartbeat
2. Leader Failure Detected:
- Follower's election timeout expires (no heartbeat received)
- Follower becomes Candidate
- Increments term number
- Votes for itself
- Requests votes from other nodes
3. Election Resolution:
- If candidate receives votes from majority → becomes Leader
- If candidate receives heartbeat from valid leader → reverts to Follower
- If election times out with no winner → start new election with higher term
4. Split-Brain Protection:
- Only one candidate can receive majority in any term
- Higher terms always win, forcing stale leaders to step down
- Any 'leader' without majority can't commit writes anyway
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import { Etcd3 } from 'etcd3'; /** * Uses etcd (which implements Raft internally) to coordinate * primary election and failover. This approach delegates split-brain * prevention to etcd's consensus protocol. */class EtcdBasedFailover { private client: Etcd3; private leaseId: string | null = null; private isPrimary: boolean = false; constructor( private nodeId: string, private serviceName: string, private ttlSeconds: number = 10 ) { this.client = new Etcd3(); } async start(): Promise<void> { // Create a lease that expires if we stop renewing const lease = this.client.lease(this.ttlSeconds); this.leaseId = await lease.grant(); // Keep the lease alive lease.on('lost', () => { console.log('Lost primary lease - stepping down'); this.isPrimary = false; this.attemptToBecomePromary(); }); // Try to become primary await this.attemptToBecomePromary(); // Watch for changes (failover by other nodes) this.watchPrimaryKey(); } private async attemptToBecomePromary(): Promise<void> { const primaryKey = `/services/${this.serviceName}/primary`; const myValue = JSON.stringify({ nodeId: this.nodeId, timestamp: Date.now(), }); try { // Try to create the key (if it doesn't exist) // This is an atomic operation - only one node can succeed await this.client.if(primaryKey, 'Create', '==', 0) .then( this.client.put(primaryKey).value(myValue).lease(this.leaseId) ) .commit(); console.log('Successfully became primary'); this.isPrimary = true; this.onBecomePrimary(); } catch (error) { // Key already exists - someone else is primary console.log('Another node is primary, staying as standby'); this.isPrimary = false; } } private watchPrimaryKey(): void { const primaryKey = `/services/${this.serviceName}/primary`; const watcher = this.client.watch() .key(primaryKey) .create(); watcher.on('delete', async () => { console.log('Primary key deleted - attempting to take over'); await this.attemptToBecomePromary(); }); watcher.on('put', (kv) => { const value = JSON.parse(kv.value.toString()); if (value.nodeId !== this.nodeId) { console.log(`Node ${value.nodeId} is now primary`); this.isPrimary = false; } }); } private onBecomePrimary(): void { // Application-specific logic when becoming primary // Start accepting writes, etc. }} // Usageconst failover = new EtcdBasedFailover('node-1', 'my-database', 10);await failover.start();If your system doesn't natively use consensus (e.g., PostgreSQL, MySQL), you can add an external consensus layer (etcd, ZooKeeper, Consul) for failover coordination. The data system still uses primary-standby replication, but leader election is managed by the consensus cluster. This is how Patroni provides HA for PostgreSQL.
Let's translate theory into practice with concrete strategies for common architectures.
Strategy 1: Defense in Depth
Don't rely on a single split-brain prevention mechanism. Layer multiple defenses:
If any single mechanism has a bug or failure, the others provide backup.
Strategy 2: Pre-Flight Checklist
Before any failover, verify a safety checklist:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
interface SafetyCheck { name: string; check: () => Promise<boolean>; fatal: boolean; // If true, abort failover on failure} const preFailoverChecklist: SafetyCheck[] = [ { name: 'Quorum check', check: async () => (await quorumChecker.checkQuorum()).hasQuorum, fatal: true, }, { name: 'Old primary reachability', check: async () => { // If reachable, we need to fence it const reachable = await pingNode(oldPrimary); if (reachable) { return await fencer.fenceNode(oldPrimary.id).then(r => r.success); } return true; // Not reachable = safe to proceed }, fatal: true, }, { name: 'Pending transaction check', check: async () => { // Are there in-flight transactions that would be lost? const pending = await standby.getPendingReplicatedTransactions(); return pending.length === 0; }, fatal: false, // Log warning but allow failover }, { name: 'Replication lag check', check: async () => { const lagBytes = await standby.getReplicationLag(); return lagBytes < MAX_ACCEPTABLE_LAG_BYTES; }, fatal: true, // Too much lag = data loss risk }, { name: 'Recent successful failover check', check: async () => { // Prevent flapping - ensure no recent failover const lastFailover = await getLastFailoverTime(); const minInterval = 10 * 60 * 1000; // 10 minutes return Date.now() - lastFailover > minInterval; }, fatal: true, },]; async function runPreFailoverChecks(): Promise<{ safe: boolean; issues: string[] }> { const issues: string[] = []; let safe = true; for (const check of preFailoverChecklist) { try { const passed = await check.check(); if (!passed) { issues.push(`FAILED: ${check.name}`); if (check.fatal) safe = false; } } catch (error) { issues.push(`ERROR in ${check.name}: ${error.message}`); if (check.fatal) safe = false; } } return { safe, issues };}Regularly test your split-brain prevention by injecting network partitions (using iptables, tc, or chaos engineering tools). Verify that: 1) The minority side refuses to promote, 2) The majority side can still function, 3) Healing the partition doesn't cause conflicts. If you haven't tested it, assume it doesn't work.
Despite prevention efforts, split-brain can still occur. Recognizing it quickly and recovering with minimal damage requires preparation.
Detecting Split-Brain:
1. Multiple Primary Detection
Monitor for the impossible condition: two nodes claiming to be primary.
-- PostgreSQL: Check pg_is_in_recovery() across nodes
-- Both returning FALSE indicates split-brain
-- MySQL: Check read_only status across nodes
-- Two nodes with read_only=OFF indicates split-brain
2. Sequence/LSN Divergence
Compare replication positions. If both nodes have advanced beyond their last common point, writes occurred independently.
3. Write Detection on Both Sides
Monitor write metrics from both nodes. If both show non-zero writes during the same period, split-brain is occurring.
Recovery Procedure:
Recovery from split-brain is painful but methodical:
Step 1: Stop the Bleeding
Immediately stop writes to both primaries. The longer they diverge, the harder reconciliation becomes.
Step 2: Choose the Winner
Decide which node's data is authoritative. Criteria:
Step 3: Export Loser's Delta
From the 'loser' node, export all writes that occurred since the divergence point. These need manual review.
Step 4: Rebuild from Winner
Reinitialize the loser node as a replica of the winner. This may require a full base backup.
Step 5: Reconcile Lost Writes
Manually review exported writes from the loser. For each:
Step 6: Post-Mortem and Prevention
Conduct thorough analysis: how did split-brain occur? What prevention mechanism failed? What needs to change?
Be deeply skeptical of any system claiming automatic split-brain recovery. True split-brain means conflicting writes—there's no algorithm that can automatically resolve 'User A set balance to $100' vs 'User B set balance to $200' on the same account. Human judgment is required. The goal is minimizing damage, not eliminating it.
Split-brain is the nightmare scenario of high-availability systems—a failure mode that can cause permanent, irrecoverable data corruption. Unlike other failures that systems can heal from automatically, split-brain demands prevention rather than recovery. Let's consolidate the essential knowledge:
What's Next:
With split-brain prevention mastered, we turn to the final phase of failover: failback procedures. When the original primary is restored, how do we safely transition back? The next page explores the complexities of returning to the original configuration after a failover event.
You now understand why split-brain is uniquely dangerous, can implement fencing and quorum mechanisms, know when to use consensus protocols, and have strategies for detection and recovery. Next: Failback Procedures.