Failover Strategies - Learning Module

Loading content...

0/273

Split-Brain Prevention: Ensuring One Source of Truth

When Two Primaries Become One Disaster

December 2012: GitHub experienced a major outage that lasted hours. The root cause was split-brain in their MySQL infrastructure. During a network partition, both the primary and the intended standby believed they were the authoritative primary. Both accepted writes. When the partition healed, GitHub faced an impossible reconciliation: two divergent histories of the database with conflicting changes. Some data was permanently lost. The recovery took days of painstaking work.

This was not an isolated incident. Split-brain has struck organizations of every size—from startups to tech giants. It's been responsible for data loss at major banks, corruption at healthcare systems, and outages at critical infrastructure providers. Split-brain is not a theoretical concern; it's one of the most dangerous failure modes in distributed systems.

This page arms you with the comprehensive knowledge to understand, detect, and prevent split-brain in your systems. When the network partitions and failover triggers incorrectly, these techniques ensure your data survives intact.

What You Will Learn

By the end of this page, you will deeply understand: what split-brain is and how it occurs, why it's so dangerous compared to other failure modes, fencing and STONITH mechanisms, quorum-based prevention, consensus protocols that eliminate split-brain by design, and practical prevention strategies for different system architectures.

Understanding Split-Brain

Split-brain occurs when a distributed system divides into two or more partitions that cannot communicate, and multiple partitions each believe they are the authoritative source of truth. In primary-standby architectures, it means having two nodes simultaneously acting as the primary, both accepting writes.

Why Does Split-Brain Happen?

Split-brain typically emerges from the combination of:

Network partition — Communication between nodes is interrupted (switch failure, cable cut, firewall misconfiguration, cloud provider networking issue)
Failover trigger — The standby node cannot reach the primary and (correctly, from its limited perspective) concludes the primary is dead
Both nodes operational — Unlike actual primary failure, the original primary is still running and accepting writes

The result: two nodes, isolated from each other, each convinced they are the rightful primary. Both accept client connections. Both write data. Neither knows about the other's operations.

The Anatomy of Split-Brain:

          Before Partition                     During Partition
    ┌─────────────────────────┐       ┌───────────────────────────────┐
    │                         │       │                               │
    │   ┌─────────┐          │       │   ┌─────────┐    ╳    ┌─────────┐
    │   │ Primary │◄─────────┤       │   │ Node A  │    │    │ Node B  │
    │   └────┬────┘   writes │       │   │(Primary)│    │    │(Primary)│
    │        │                │       │   └────┬────┘    │    └────┬────┘
    │        │  sync          │       │        │         │         │
    │        ▼                │       │        │         │         │
    │   ┌─────────┐          │       │   Clients A   Partition  Clients B
    │   │ Standby │          │       │   writing      (can't    writing
    │   └─────────┘          │       │   to Node A   communicate to Node B
    │         │               │       │              each other)
    │         │               │       │                               │
    └─────────┼───────────────┘       └───────────────────────────────┘
              │                       
           Replication

Clients on either side of the partition see a working primary. They continue operating, unaware that another primary exists. The data diverges.

The Catastrophic Consequence

When the partition heals, you face the nightmare scenario: two divergent datasets with conflicting writes. Row ID 12345 has balance $500 on Node A and $700 on Node B. Which is correct? Both writes were valid when made. There's no automatic resolution. Manual reconciliation is required, and some data may be permanently lost.

Why Split-Brain Is Exceptionally Dangerous

Split-brain isn't just another failure mode—it's qualitatively different from other failures because it can cause permanent, irrecoverable data corruption.

Comparison with Other Failure Modes:

Failure Mode Impact Comparison
Failure Mode	Impact	Recoverable?	Automatic Fix?
Primary crashes	Downtime until failover	Yes	Yes
Standby crashes	Reduced redundancy	Yes	Yes
Network packet loss	Retries, latency	Yes	Yes
Primary slow/degraded	Latency increase	Yes	Yes
Split-brain	Data corruption/loss	Maybe partially	NO

The Permanence Problem:

Most failures are transient—systems recover, retries succeed, redundancy absorbs the impact. But split-brain creates a lasting mess:

1. Transaction Conflicts

Two users both successfully buy the last item in inventory:

Node A: User 1 purchases item, inventory = 0
Node B: User 2 purchases item, inventory = 0
Both transactions committed successfully
After reconciliation: One customer must be disappointed

2. Uniqueness Violations

Database auto-increment IDs or unique constraints violated:

Node A: Inserts record with ID 1001
Node B: Inserts different record with ID 1001
After reconciliation: Primary key collision, one record lost

3. Cascading Inconsistencies

Financial example:

Node A: Account balance updated to $100
Node B: Same account's balance updated to $200
Other tables reference these as foreign keys
Reconciliation doesn't just fix one row—it unravels dependencies

Industries Most Vulnerable to Split-Brain Damage

•Financial Services — Double-charging accounts, incorrect balances, audit trail gaps. Regulatory consequences beyond technical damage.
•Healthcare — Conflicting medical records, drug interaction data divergence, patient safety implications.
•E-commerce — Overselling inventory, duplicate orders, inconsistent pricing leading to significant financial loss.
•Real-time Collaboration — Document merging nightmares, lost edits, user trust destruction.
•IoT/Industrial Control — Conflicting commands to physical systems with safety implications.

The Recovery Tax

Even when data can be recovered, the cost is enormous: engineering time for reconciliation (often weeks), customer trust erosion, potential regulatory fines, and the opportunity cost of incident response instead of product development. Prevention is orders of magnitude cheaper than recovery.

Fencing: Shoot The Other Node In The Head

Fencing is the first line of defense against split-brain. The concept is brutally simple: before the standby can become primary, it must ensure the old primary is completely stopped—even if that means forcibly killing it.

The colorful acronym STONITH (Shoot The Other Node In The Head) captures this approach. If you can't confirm the old primary is dead, you make it dead.

Why Fencing Is Necessary:

Without fencing, a simple race condition can cause split-brain:

Primary becomes unreachable (network partition)
Standby detects failure, starts promotion
Primary's heartbeat daemon restarts, primary now reachable again
But standby is already promoting...
Both nodes think they're primary

Fencing breaks this race by ensuring the old primary cannot continue operating even if it comes back online.

Fencing Mechanisms

•Power Fencing — Cut electrical power to the old primary using networked power distribution units (PDUs) or IPMI/BMC interfaces. Most reliable: physically impossible for node to continue. Used in on-premise datacenters.
•Network Fencing — Block all network traffic to/from the old primary at the switch/firewall level. Isolates the node from clients and cluster. Less reliable: node may still perform local operations.
•Storage Fencing — Revoke the old primary's access to shared storage (SAN, NFS). Even if the node is running, it can't write data. Common in shared-storage clusters.
•Application Fencing — Revoke the old primary's credentials/tokens at the application layer. Service remains running but can't authenticate to backends. Cloud-native approach.
•Self-Fencing — The old primary detects it's been superseded and stops itself. Requires reliable communication of "you're no longer primary" signal. Least reliable but gentlest.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
interface FencingDevice {
  name: string;
  type: 'power' | 'network' | 'storage' | 'application';
  fence(nodeId: string): Promise<FenceResult>;
  verify(nodeId: string): Promise<boolean>;  // Confirm node is fenced
}
 
interface FenceResult {
  success: boolean;
  method: string;
  verificationConfirmed: boolean;
  timestamp: Date;
}
 
class FencingCoordinator {
  private devices: FencingDevice[] = [];
  
  constructor(
    devices: FencingDevice[],
    private requireVerification: boolean = true,
    private retryCount: number = 3,
    private retryDelayMs: number = 5000
  ) {
    // Order devices by reliability (power > storage > network > application)
    this.devices = devices.sort((a, b) => 
      this.getReliabilityScore(b.type) - this.getReliabilityScore(a.type)
    );
  }
 
  private getReliabilityScore(type: string): number {
    switch (type) {
      case 'power': return 100;      // Most reliable
      case 'storage': return 80;
      case 'network': return 60;
      case 'application': return 40;  // Least reliable
      default: return 0;
    }
  }
 
  /**
   * Fence a node before promoting standby.
   * Tries multiple fencing devices until one succeeds with verification.
   * Returns only when we're certain the node is fenced.
   */
  async fenceNode(nodeId: string): Promise<FencingResult> {
    console.log(`Attempting to fence node ${nodeId}`);
    
    const attemptedMethods: string[] = [];
    
    for (const device of this.devices) {
      for (let attempt = 0; attempt < this.retryCount; attempt++) {
        try {
          console.log(`Fencing attempt ${attempt + 1} using ${device.name}`);
          
          const result = await device.fence(nodeId);
          attemptedMethods.push(`${device.name}:attempt${attempt + 1}`);
          
          if (!result.success) {
            console.log(`Fence operation failed: ${device.name}`);
            continue;
          }
          
          // Verify the fence worked
          if (this.requireVerification) {
            const verified = await this.verifyFence(device, nodeId);
            if (!verified) {
              console.log(`Fence verification failed: ${device.name}`);
              continue;
            }
          }
          
          // Success!
          return {
            success: true,
            fencedBy: device.name,
            fenceType: device.type,
            attemptedMethods,
            timestamp: new Date(),
          };
          
        } catch (error) {
          console.error(`Fencing error with ${device.name}: ${error}`);
          attemptedMethods.push(`${device.name}:error`);
        }
        
        // Wait before retry
        await this.sleep(this.retryDelayMs);
      }
    }
    
    // All methods exhausted - this is serious
    return {
      success: false,
      fencedBy: null,
      fenceType: null,
      attemptedMethods,
      timestamp: new Date(),
      error: 'All fencing methods exhausted. DO NOT PROCEED WITH FAILOVER.',
    };
  }
 
  private async verifyFence(device: FencingDevice, nodeId: string): Promise<boolean> {
    // Wait a moment for fence to take effect
    await this.sleep(2000);
    
    // Try to verify multiple times
    for (let i = 0; i < 3; i++) {
      const isFenced = await device.verify(nodeId);
      if (isFenced) return true;
      await this.sleep(1000);
    }
    
    return false;
  }
 
  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}
 
// Usage in failover controller
async function executeFailover(standby: Node, oldPrimary: Node): Promise<void> {
  const fencer = new FencingCoordinator([
    new PowerFencer('ipmi-controller'),
    new StorageFencer('san-controller'),
    new NetworkFencer('firewall-api'),
  ]);
 
  // CRITICAL: Fence before promotion
  const fenceResult = await fencer.fenceNode(oldPrimary.id);
  
  if (!fenceResult.success) {
    throw new Error(`Cannot proceed with failover: fencing failed. ${fenceResult.error}`);
    // Alert operators for manual intervention
  }
 
  console.log(`Old primary fenced via ${fenceResult.fencedBy}`);
  
  // Now safe to proceed with promotion
  await promoteStandby(standby);
}

Fencing Must Succeed or Failover Must Abort

This is the cardinal rule: if you cannot confirm the old primary is fenced, you must NOT promote the standby. The downtime from aborting failover is recoverable. The data corruption from split-brain may not be. Always prefer extended downtime over possible split-brain.

Quorum-Based Prevention

Quorum is a voting-based approach to split-brain prevention. The idea is simple: a node can only become primary if it can communicate with a majority of the cluster. In a 3-node cluster, you need 2 nodes. In a 5-node cluster, you need 3.

Why Quorum Works:

Mathematically, if a partition occurs, at most one side can have a majority:

5-node cluster splits 2-3: only the side with 3 nodes has quorum
5-node cluster splits 1-4: only the side with 4 nodes has quorum
3-node cluster splits 1-2: only the side with 2 nodes has quorum

The minority side(s) know they lack quorum and refuse to become primary. Even if they can't reach the current primary, they won't promote themselves.

Quorum Calculation:

Quorum = floor(N/2) + 1

Cluster Size: 3  → Quorum: 2
Cluster Size: 5  → Quorum: 3
Cluster Size: 7  → Quorum: 4

Note: Always use odd cluster sizes. With even sizes, an even split (e.g., 2-2) leaves no majority and the entire cluster deadlocks.

Fault Tolerance by Cluster Size
Cluster Size	Quorum	Max Failures Tolerated	Notes
2	2	0	No fault tolerance! Avoid.
3	2	1	Minimum for fault tolerance
5	3	2	Good balance, common choice
7	4	3	High availability, more complexity

Implementing Quorum:

Witness/Tie-Breaker Nodes:

For systems where having multiple data-holding nodes is expensive, you can use lightweight witness nodes that participate in quorum but don't hold data:

Node A: Primary (data node)
Node B: Standby (data node)
Node C: Witness (no data, just votes)

Quorum requires 2 of 3. If A fails, B+C can agree to promote B. If A and C lose contact with B, they might see B fail but A (with C's vote) maintains quorum and stays primary.

Quorum in Cloud Environments:

Cloud providers offer quorum-as-a-service:

AWS: Cluster Parameter Groups in RDS, EC2 arbitration
Azure: Cloud Witness for Windows Server Failover Clustering
GCP: PD-based witness configurations

These managed services handle the complexity of maintaining quorum across availability zones.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
interface ClusterNode {
  id: string;
  type: 'data' | 'witness';
  weight: number;  // Some nodes may have more voting power
  isReachable: () => Promise<boolean>;
}
 
class QuorumChecker {
  private nodes: ClusterNode[];
  private totalWeight: number;
  private quorumWeight: number;
  
  constructor(nodes: ClusterNode[]) {
    this.nodes = nodes;
    this.totalWeight = nodes.reduce((sum, n) => sum + n.weight, 0);
    this.quorumWeight = Math.floor(this.totalWeight / 2) + 1;
    
    console.log(`Cluster initialized: ${nodes.length} nodes, total weight ${this.totalWeight}, quorum ${this.quorumWeight}`);
  }
 
  async checkQuorum(): Promise<QuorumResult> {
    const reachabilityResults = await Promise.all(
      this.nodes.map(async (node) => ({
        node,
        reachable: await this.checkNodeWithTimeout(node, 5000),
      }))
    );
    
    const reachableNodes = reachabilityResults
      .filter(r => r.reachable)
      .map(r => r.node);
    
    const reachableWeight = reachableNodes.reduce((sum, n) => sum + n.weight, 0);
    const hasQuorum = reachableWeight >= this.quorumWeight;
    
    return {
      hasQuorum,
      reachableWeight,
      requiredWeight: this.quorumWeight,
      reachableNodes: reachableNodes.map(n => n.id),
      unreachableNodes: reachabilityResults
        .filter(r => !r.reachable)
        .map(r => r.node.id),
    };
  }
 
  private async checkNodeWithTimeout(node: ClusterNode, timeoutMs: number): Promise<boolean> {
    return Promise.race([
      node.isReachable(),
      new Promise<boolean>(resolve => 
        setTimeout(() => resolve(false), timeoutMs)
      ),
    ]);
  }
}
 
class QuorumAwareFailoverController {
  constructor(
    private quorumChecker: QuorumChecker,
    private fencer: FencingCoordinator,
    private localNode: ClusterNode
  ) {}
 
  async canBecomePromary(): Promise<{ allowed: boolean; reason: string }> {
    // Step 1: Do we have quorum?
    const quorum = await this.quorumChecker.checkQuorum();
    
    if (!quorum.hasQuorum) {
      return {
        allowed: false,
        reason: `Insufficient quorum: ${quorum.reachableWeight}/${quorum.requiredWeight} weight. Unreachable: ${quorum.unreachableNodes.join(', ')}`,
      };
    }
    
    // Step 2: Is there another primary we need to fence?
    const currentPrimary = await this.discoverCurrentPrimary();
    
    if (currentPrimary && currentPrimary.id !== this.localNode.id) {
      // Try to fence the current primary
      const fenceResult = await this.fencer.fenceNode(currentPrimary.id);
      
      if (!fenceResult.success) {
        return {
          allowed: false,
          reason: `Cannot fence current primary ${currentPrimary.id}: ${fenceResult.error}`,
        };
      }
    }
    
    // Step 3: Verify quorum still holds after fencing
    const postFenceQuorum = await this.quorumChecker.checkQuorum();
    
    if (!postFenceQuorum.hasQuorum) {
      return {
        allowed: false,
        reason: 'Lost quorum after fencing attempt',
      };
    }
    
    return {
      allowed: true,
      reason: `Quorum confirmed (${postFenceQuorum.reachableWeight}/${postFenceQuorum.requiredWeight}), ready for promotion`,
    };
  }
}

Quorum Is Necessary But Not Sufficient

Quorum prevents the minority partition from promoting, but it doesn't prevent the current primary from continuing to operate in that minority. If the primary is on the minority side, it should step down. This requires self-awareness: the primary must continuously verify it has quorum, not just the standby.

Consensus Protocols: Split-Brain Immunity by Design

The most robust defense against split-brain is using systems built on consensus protocols like Raft, Paxos, or Zab. These protocols mathematically guarantee that only one leader can exist at any time, making split-brain impossible by design.

How Consensus Protocols Prevent Split-Brain:

1. Leader Election Requires Majority

A node can only become leader if it receives votes from a majority of nodes. In a 5-node cluster, that means 3 nodes must agree. During a partition, only one side can have 3+ nodes.

2. Log Replication Requires Majority Acknowledgment

Writes are only considered committed when a majority of nodes acknowledge them. A minority-partition 'leader' (if one somehow existed) couldn't commit any writes.

3. Term Numbers Prevent Stale Leaders

Each leader election increments a term number. If a stale leader from a previous term receives a message with a higher term, it immediately steps down. This handles network partitions that heal: the old leader learns it's obsolete.

Comparison of Consensus Protocols
Protocol	Used By	Strengths	Complexity
Raft	etcd, Consul, CockroachDB	Understandable, well-documented	Medium
Paxos	Google Spanner, Chubby	Proven, flexible	High
Zab	ZooKeeper	Optimized for primary-backup	Medium
Viewstamped Replication	Academic systems	Simple variant of Paxos	Medium

Raft Leader Election (Simplified):

1. Normal Operation:
   - Leader sends heartbeats to followers every 150ms
   - Followers reset their election timeout on heartbeat

2. Leader Failure Detected:
   - Follower's election timeout expires (no heartbeat received)
   - Follower becomes Candidate
   - Increments term number
   - Votes for itself
   - Requests votes from other nodes

3. Election Resolution:
   - If candidate receives votes from majority → becomes Leader
   - If candidate receives heartbeat from valid leader → reverts to Follower
   - If election times out with no winner → start new election with higher term

4. Split-Brain Protection:
   - Only one candidate can receive majority in any term
   - Higher terms always win, forcing stale leaders to step down
   - Any 'leader' without majority can't commit writes anyway

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import { Etcd3 } from 'etcd3';
 
/**
 * Uses etcd (which implements Raft internally) to coordinate
 * primary election and failover. This approach delegates split-brain
 * prevention to etcd's consensus protocol.
 */
class EtcdBasedFailover {
  private client: Etcd3;
  private leaseId: string | null = null;
  private isPrimary: boolean = false;
  
  constructor(
    private nodeId: string,
    private serviceName: string,
    private ttlSeconds: number = 10
  ) {
    this.client = new Etcd3();
  }
 
  async start(): Promise<void> {
    // Create a lease that expires if we stop renewing
    const lease = this.client.lease(this.ttlSeconds);
    this.leaseId = await lease.grant();
    
    // Keep the lease alive
    lease.on('lost', () => {
      console.log('Lost primary lease - stepping down');
      this.isPrimary = false;
      this.attemptToBecomePromary();
    });
    
    // Try to become primary
    await this.attemptToBecomePromary();
    
    // Watch for changes (failover by other nodes)
    this.watchPrimaryKey();
  }
 
  private async attemptToBecomePromary(): Promise<void> {
    const primaryKey = `/services/${this.serviceName}/primary`;
    const myValue = JSON.stringify({
      nodeId: this.nodeId,
      timestamp: Date.now(),
    });
 
    try {
      // Try to create the key (if it doesn't exist)
      // This is an atomic operation - only one node can succeed
      await this.client.if(primaryKey, 'Create', '==', 0)
        .then(
          this.client.put(primaryKey).value(myValue).lease(this.leaseId)
        )
        .commit();
      
      console.log('Successfully became primary');
      this.isPrimary = true;
      this.onBecomePrimary();
      
    } catch (error) {
      // Key already exists - someone else is primary
      console.log('Another node is primary, staying as standby');
      this.isPrimary = false;
    }
  }
 
  private watchPrimaryKey(): void {
    const primaryKey = `/services/${this.serviceName}/primary`;
    
    const watcher = this.client.watch()
      .key(primaryKey)
      .create();
    
    watcher.on('delete', async () => {
      console.log('Primary key deleted - attempting to take over');
      await this.attemptToBecomePromary();
    });
    
    watcher.on('put', (kv) => {
      const value = JSON.parse(kv.value.toString());
      if (value.nodeId !== this.nodeId) {
        console.log(`Node ${value.nodeId} is now primary`);
        this.isPrimary = false;
      }
    });
  }
 
  private onBecomePrimary(): void {
    // Application-specific logic when becoming primary
    // Start accepting writes, etc.
  }
}
 
// Usage
const failover = new EtcdBasedFailover('node-1', 'my-database', 10);
await failover.start();

When to Use Consensus-Based Coordination

If your system doesn't natively use consensus (e.g., PostgreSQL, MySQL), you can add an external consensus layer (etcd, ZooKeeper, Consul) for failover coordination. The data system still uses primary-standby replication, but leader election is managed by the consensus cluster. This is how Patroni provides HA for PostgreSQL.

Practical Prevention Strategies

Let's translate theory into practice with concrete strategies for common architectures.

Strategy 1: Defense in Depth

Don't rely on a single split-brain prevention mechanism. Layer multiple defenses:

Primary Line: Quorum checking before failover
Secondary Line: Fencing of old primary before promotion
Tertiary Line: Application-level conflict detection

If any single mechanism has a bug or failure, the others provide backup.

Strategy 2: Pre-Flight Checklist

Before any failover, verify a safety checklist:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
interface SafetyCheck {
  name: string;
  check: () => Promise<boolean>;
  fatal: boolean;  // If true, abort failover on failure
}
 
const preFailoverChecklist: SafetyCheck[] = [
  {
    name: 'Quorum check',
    check: async () => (await quorumChecker.checkQuorum()).hasQuorum,
    fatal: true,
  },
  {
    name: 'Old primary reachability',
    check: async () => {
      // If reachable, we need to fence it
      const reachable = await pingNode(oldPrimary);
      if (reachable) {
        return await fencer.fenceNode(oldPrimary.id).then(r => r.success);
      }
      return true;  // Not reachable = safe to proceed
    },
    fatal: true,
  },
  {
    name: 'Pending transaction check',
    check: async () => {
      // Are there in-flight transactions that would be lost?
      const pending = await standby.getPendingReplicatedTransactions();
      return pending.length === 0;
    },
    fatal: false,  // Log warning but allow failover
  },
  {
    name: 'Replication lag check',
    check: async () => {
      const lagBytes = await standby.getReplicationLag();
      return lagBytes < MAX_ACCEPTABLE_LAG_BYTES;
    },
    fatal: true,  // Too much lag = data loss risk
  },
  {
    name: 'Recent successful failover check',
    check: async () => {
      // Prevent flapping - ensure no recent failover
      const lastFailover = await getLastFailoverTime();
      const minInterval = 10 * 60 * 1000;  // 10 minutes
      return Date.now() - lastFailover > minInterval;
    },
    fatal: true,
  },
];
 
async function runPreFailoverChecks(): Promise<{ safe: boolean; issues: string[] }> {
  const issues: string[] = [];
  let safe = true;
  
  for (const check of preFailoverChecklist) {
    try {
      const passed = await check.check();
      if (!passed) {
        issues.push(`FAILED: ${check.name}`);
        if (check.fatal) safe = false;
      }
    } catch (error) {
      issues.push(`ERROR in ${check.name}: ${error.message}`);
      if (check.fatal) safe = false;
    }
  }
  
  return { safe, issues };
}

Architecture-Specific Strategies

•PostgreSQL with Patroni: Uses etcd/ZK/Consul for consensus. Patroni manages fencing via callbacks. Configure synchronous_commit=on for zero data loss.
•MySQL with Group Replication: Built-in consensus for 3+ nodes. Single-primary mode prevents split-brain. Auto-ejection of partitioned members.
•Redis Sentinel: Quorum-based decision making. At least 3 sentinels required. Always use odd number of sentinels.
•MongoDB Replica Set: Built-in consensus-based election. Majority write concern prevents ghost writes during partition.
•Custom Applications: Use external consensus store (etcd/ZK) for leader election. Keep lease TTL short (10-30s). Renew heartbeat more frequently than TTL (every TTL/3).

Testing Split-Brain Prevention

Regularly test your split-brain prevention by injecting network partitions (using iptables, tc, or chaos engineering tools). Verify that: 1) The minority side refuses to promote, 2) The majority side can still function, 3) Healing the partition doesn't cause conflicts. If you haven't tested it, assume it doesn't work.

Detecting and Recovering from Split-Brain

Despite prevention efforts, split-brain can still occur. Recognizing it quickly and recovering with minimal damage requires preparation.

Detecting Split-Brain:

1. Multiple Primary Detection

Monitor for the impossible condition: two nodes claiming to be primary.

-- PostgreSQL: Check pg_is_in_recovery() across nodes
-- Both returning FALSE indicates split-brain

-- MySQL: Check read_only status across nodes
-- Two nodes with read_only=OFF indicates split-brain

2. Sequence/LSN Divergence

Compare replication positions. If both nodes have advanced beyond their last common point, writes occurred independently.

3. Write Detection on Both Sides

Monitor write metrics from both nodes. If both show non-zero writes during the same period, split-brain is occurring.

Signs of Active Split-Brain

•Two nodes reporting as primary simultaneously
•Writes accumulating on both sides of a partition
•Clients reporting different data for the same query depending on which node they hit
•Replication attempt failing with 'timeline/history divergence' errors
•Monitoring showing two 'active' primaries in dashboard

Recovery Procedure:

Recovery from split-brain is painful but methodical:

Step 1: Stop the Bleeding

Immediately stop writes to both primaries. The longer they diverge, the harder reconciliation becomes.

Step 2: Choose the Winner

Decide which node's data is authoritative. Criteria:

Which had more recent client activity?
Which has critical transactions that can't be lost?
Which has more data (if one side processed more writes)?
Business input: which transactions are more valuable?

Step 3: Export Loser's Delta

From the 'loser' node, export all writes that occurred since the divergence point. These need manual review.

Step 4: Rebuild from Winner

Reinitialize the loser node as a replica of the winner. This may require a full base backup.

Step 5: Reconcile Lost Writes

Manually review exported writes from the loser. For each:

If it conflicts with winner's data: business decision required
If it doesn't conflict: carefully replay it
If it's irreconcilable: document and communicate the loss

Step 6: Post-Mortem and Prevention

Conduct thorough analysis: how did split-brain occur? What prevention mechanism failed? What needs to change?

There Is No Automatic Recovery

Be deeply skeptical of any system claiming automatic split-brain recovery. True split-brain means conflicting writes—there's no algorithm that can automatically resolve 'User A set balance to $100' vs 'User B set balance to $200' on the same account. Human judgment is required. The goal is minimizing damage, not eliminating it.

Summary: Eliminating Split-Brain

Split-brain is the nightmare scenario of high-availability systems—a failure mode that can cause permanent, irrecoverable data corruption. Unlike other failures that systems can heal from automatically, split-brain demands prevention rather than recovery. Let's consolidate the essential knowledge:

Key Takeaways

•Split-brain means two primaries — It occurs when network partition combines with failover, leaving two nodes both accepting writes. The resulting data divergence may be irreconcilable.
•Fencing is your first defense — Before promoting a standby, ensure the old primary is stopped. STONITH (Shoot The Other Node In The Head) prevents the old primary from continuing to operate.
•Quorum prevents minority-side promotion — Requiring majority agreement ensures at most one partition can elect a primary. Use odd cluster sizes.
•Consensus protocols provide mathematical guarantees — Raft, Paxos, and Zab make split-brain impossible by design. Consider consensus-based coordination even for non-consensus databases.
•Layer your defenses — Combine quorum, fencing, and application-level checks. No single mechanism is infallible.
•Test your prevention regularly — Inject network partitions and verify that split-brain prevention actually works. Untested mechanisms are unreliable mechanisms.

What's Next:

With split-brain prevention mastered, we turn to the final phase of failover: failback procedures. When the original primary is restored, how do we safely transition back? The next page explores the complexities of returning to the original configuration after a failover event.

Page Complete

You now understand why split-brain is uniquely dangerous, can implement fencing and quorum mechanisms, know when to use consensus protocols, and have strategies for detection and recovery. Next: Failback Procedures.