Operating SystemsDistributed Coordination

Distributed Coordination

LevelAdvanced

Duration90 mins

TopicDistributed Coordination

2 / 5

Consensus Algorithms

The Agreement Problem

Imagine you're building a distributed database replicated across five data centers. A client submits a transaction. For the system to function correctly, all replicas must process the same set of transactions in the same order—even if some replicas fail, messages arrive out of order, or network partitions occur. How do you ensure this?

This is the consensus problem: getting a collection of independent nodes to agree on a single value (or a sequence of values) despite failures and asynchrony. Consensus is the theoretical foundation underlying leader election, replicated state machines, distributed transactions, and virtually every coordination mechanism in distributed systems.

Why is consensus hard?

In a single-machine system, agreement is trivial—there's only one perspective. In distributed systems, each node has its own local view, and that view may differ from others due to message delays, failures, and network partitions. Achieving agreement despite these discrepancies is fundamentally challenging.

What You Will Learn

By the end of this page, you will understand the formal definition of consensus, its safety and liveness properties, the seminal FLP impossibility result, how practical systems circumvent theoretical limitations, and the relationship between consensus and other distributed primitives like leader election and atomic broadcast.

Defining Consensus Formally

Consensus has a precise formal definition that captures the requirements any correct solution must satisfy. Understanding this definition is essential for evaluating and designing consensus protocols.

The Consensus Problem:

Given N processes, each proposing a value, the consensus problem is to have all correct processes agree on one of the proposed values.

Formal Properties:

A consensus protocol must satisfy three properties:

The Three Properties of Consensus
Property	Formal Definition	Informal Explanation
Agreement	No two correct processes decide on different values	Everyone who decides, decides the same thing
Validity	If a process decides value v, then v was proposed by some process	The decided value isn't made up—someone actually proposed it
Termination	Every correct process eventually decides on a value	The algorithm eventually finishes; it doesn't run forever

Safety vs. Liveness:

These properties divide into two categories with fundamentally different characteristics:

Safety properties (Agreement, Validity): "Nothing bad ever happens." If violated, there's a specific point in time when the violation occurred. Safety violations are irreversible—once you've decided two different values, you can't undo it.

Liveness properties (Termination): "Something good eventually happens." Liveness violations cannot be detected at any finite point—you can only say termination hasn't happened yet, not that it won't happen.

This distinction matters because safety and liveness have different failure characteristics. Safety can be guaranteed absolutely (never violate agreement), but liveness is best-effort in asynchronous systems (termination might be delayed indefinitely).

The Priority of Safety

In distributed systems design, safety is typically prioritized over liveness. A system that temporarily cannot make progress is inconvenient; a system that corrupts data is catastrophic. Consensus protocols sacrifice termination guarantees under extreme conditions to preserve agreement guarantees absolutely.

The FLP Impossibility Result

The most important theoretical result in distributed systems is the Fischer-Lynch-Paterson (FLP) Impossibility Theorem (1985). It establishes fundamental limits on what consensus can achieve.

The Result:

In an asynchronous distributed system where at least one process may crash, there is no deterministic algorithm that guarantees consensus.

Understanding the Implications:

The FLP result states that you cannot have all three of:

Guaranteed termination (liveness)
Guaranteed agreement (safety)
Tolerance of even a single crash failure

in a purely asynchronous system with a deterministic algorithm.

Why is this true?

The core issue is indistinguishability. In an asynchronous system:

You cannot distinguish a slow process from a crashed one
You cannot distinguish a delayed message from a lost one
There's no bound on how long messages take to arrive

The proof constructs a sequence of system states where:

The system is in a "bivalent" state (could decide either 0 or 1)
Any step taken preserves bivalence (the system could still go either way)
This sequence can be extended indefinitely
Therefore, the algorithm may never terminate

What FLP Does NOT Say

FLP does not mean consensus is impossible in practice. It says deterministic consensus with guaranteed termination is impossible in purely asynchronous systems. Real systems circumvent FLP by:

Relaxing asynchrony — Using timeouts and synchrony assumptions
Using randomization — Non-deterministic algorithms with probabilistic termination
Accepting probabilistic termination — Terminating with high probability, not certainty

Circumventing FLP in Practice:

Production consensus protocols circumvent FLP through several mechanisms:

1. Partial Synchrony Model:

Assume the system is asynchronous but eventually becomes synchronous—message delays are unbounded but eventually stabilize. During periods of synchrony, the algorithm can make progress and terminate.

2. Failure Detectors:

Use imperfect failure detectors (discussed in leader election) that eventually become accurate. The algorithm waits until the detector is reliable enough to make correct decisions.

3. Randomization:

Introduce randomness into the decision process. The algorithm terminates with probability 1 over infinite time, even if any finite execution might not terminate.

4. Leader-Based Protocols:

Elect a leader who makes decisions. The algorithm terminates quickly under stable leadership and falls back to leader election when instability is detected.

consensus_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
class ConsensusProtocol:
    """
    Abstract base class demonstrating consensus properties.
    
    Any concrete implementation must satisfy:
    - Agreement: All correct nodes decide the same value
    - Validity: Decided value was proposed by some node
    - Termination: All correct nodes eventually decide
    
    FLP tells us we cannot guarantee all three in async systems.
    """
    
    def __init__(self, node_id: str, peers: List[str]):
        self.node_id = node_id
        self.peers = peers
        self.decided_value: Optional[Any] = None
        self.proposed_values: Set[Any] = set()
        
    def propose(self, value: Any) -> Optional[Any]:
        """
        Propose a value for consensus.
        Returns the decided value, or None if not yet decided.
        
        Validity: The returned value must be in proposed_values
        """
        self.proposed_values.add(value)
        # Implementation-specific consensus logic
        raise NotImplementedError
        
    def decide(self, value: Any):
        """
        Record a consensus decision.
        
        Agreement: Once called, all correct nodes must call
        with the same value.
        
        Validity: value must have been proposed by some node.
        """
        if value not in self.proposed_values:
            raise ValidityViolation(
                f"Decided value {value} was never proposed"
            )
            
        if self.decided_value is not None:
            if self.decided_value != value:
                raise AgreementViolation(
                    f"Already decided {self.decided_value}, cannot decide {value}"
                )
            return  # Already decided this value
            
        self.decided_value = value
        self.on_decision(value)
        
    def on_decision(self, value: Any):
        """Hook called when consensus is reached."""
        print(f"Node {self.node_id}: Decided on {value}")
 
 
class PartialSynchronyConsensus(ConsensusProtocol):
    """
    Consensus assuming eventual synchrony.
    
    During asynchronous periods, safety is preserved but
    liveness is not guaranteed. During synchronous periods,
    the protocol makes progress and terminates.
    """
    
    def __init__(self, node_id: str, peers: List[str]):
        super().__init__(node_id, peers)
        self.round_timeout = 1.0  # Initial timeout
        self.max_timeout = 30.0   # Maximum timeout
        
    def propose(self, value: Any) -> Optional[Any]:
        """
        Propose value, retrying with exponential backoff
        until synchrony is achieved.
        """
        self.proposed_values.add(value)
        
        current_round = 0
        timeout = self.round_timeout
        
        while self.decided_value is None:
            try:
                # Attempt consensus in current round
                result = self._run_round(value, current_round, timeout)
                
                if result is not None:
                    self.decide(result)
                    return result
                    
            except RoundTimeout:
                # System may be asynchronous
                # Increase timeout and retry
                current_round += 1
                timeout = min(timeout * 2, self.max_timeout)
                print(f"Round {current_round} timeout, new timeout: {timeout}s")
                
        return self.decided_value
        
    def _run_round(self, value: Any, round_num: int, timeout: float):
        """
        Execute one round of consensus protocol.
        
        Returns decided value if consensus reached, None if
        round timed out without decision.
        """
        # Implementation would include:
        # 1. Broadcast proposal for this round
        # 2. Collect votes with timeout
        # 3. If quorum reached, decide
        # 4. If timeout, return None for retry
        raise NotImplementedError

Consensus and State Machine Replication

State Machine Replication (SMR) is the most important application of consensus. It's the technique underlying virtually all replicated databases, coordination services, and distributed storage systems.

The Replicated State Machine Model:

Multiple nodes maintain identical copies of a deterministic state machine
All nodes start in the same initial state
All nodes process the same sequence of commands in the same order
Therefore, all nodes reach the same final state

Consensus enables SMR by ensuring agreement on:

Which commands to execute
In what order to execute them

The Log Abstraction:

Replicated state machines typically use a replicated log as the ordering mechanism:

Each command is assigned a log position (index)
Consensus ensures all replicas agree on the command at each index
Replicas apply commands sequentially from the log
The log provides atomicity and ordering guarantees

Converting Mermaid diagram...

Multi-Paxos and Multi-Decree Consensus:

A single consensus instance decides one value. For state machine replication, we need repeated consensus—a sequence of consensus instances, one for each log entry. This is what "Multi-Paxos" and similar protocols provide.

The Leader Optimization:

In repeated consensus, the same node typically leads consecutive instances. This optimization eliminates the need to run a full consensus round for each command:

The leader is established once and persists across many commands
For each command, the leader proposes and waits for acknowledgments
If the leader fails, a new leader is elected and resumes from the log

This is exactly what Raft formalizes as "normal operation" vs. "leader election."

replicated_state_machine.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
class ReplicatedStateMachine:
    """
    State machine replication using consensus.
    
    Maintains a replicated log where each entry is decided
    by consensus. State machines apply log entries sequentially
    to maintain identical state across replicas.
    """
    
    def __init__(self, node_id: str, consensus: ConsensusProtocol):
        self.node_id = node_id
        self.consensus = consensus
        
        # The replicated log - same on all replicas
        self.log: List[Command] = []
        self.last_applied_index = -1
        
        # The replicated state - same on all replicas after
        # applying the same log prefix
        self.state: Dict[str, Any] = {}
        
    def submit_command(self, command: Command) -> Any:
        """
        Submit a command for replicated execution.
        
        1. Propose command at next log index
        2. Wait for consensus
        3. Apply all committed entries
        4. Return result of this command
        """
        # Leader assigns log index
        log_index = len(self.log)
        
        # Reach consensus on this command at this index
        decided_command = self.consensus.propose_for_index(
            index=log_index,
            value=command
        )
        
        # Add to log
        self.log.append(decided_command)
        
        # Apply all unapplied entries
        result = None
        while self.last_applied_index < len(self.log) - 1:
            self.last_applied_index += 1
            cmd = self.log[self.last_applied_index]
            cmd_result = self._apply(cmd)
            
            if self.last_applied_index == log_index:
                result = cmd_result
                
        return result
        
    def _apply(self, command: Command) -> Any:
        """
        Apply a command to the state machine.
        
        Must be deterministic - same input always produces
        same state change and output.
        """
        if command.type == "SET":
            self.state[command.key] = command.value
            return "OK"
            
        elif command.type == "GET":
            return self.state.get(command.key)
            
        elif command.type == "DELETE":
            old = self.state.pop(command.key, None)
            return old
            
        elif command.type == "CAS":  # Compare-and-swap
            current = self.state.get(command.key)
            if current == command.expected:
                self.state[command.key] = command.new_value
                return True
            return False
            
    def get_state_snapshot(self) -> Dict[str, Any]:
        """
        Get current state.
        
        State is deterministic function of log prefix
        up to last_applied_index.
        """
        return dict(self.state)
        
    def recover_from_log(self, log: List[Command]):
        """
        Recover state by replaying log.
        
        After network partition heals, a node may receive
        committed entries it missed. Apply them in order.
        """
        for i, command in enumerate(log):
            if i > self.last_applied_index:
                # This entry not yet applied
                if i < len(self.log):
                    # Already in our log, just apply
                    self._apply(command)
                else:
                    # Missing from our log, add and apply
                    self.log.append(command)
                    self._apply(command)
                self.last_applied_index = i
                
        print(f"Recovery complete: applied {len(log)} entries")

Linearizability Connection

SMR provides strong consistency: once a command is committed, all subsequent reads will see its effects. Combined with leader-based reads, this achieves linearizability—operations appear to occur atomically at some point between invocation and response. This is the strongest consistency model and is what distinguishes systems like etcd from eventually consistent stores like Cassandra.

Two-Phase and Three-Phase Commit

Before consensus algorithms like Paxos became widespread, distributed databases used commit protocols for atomic transactions across nodes. Understanding these protocols illuminates why consensus is superior for many use cases.

Two-Phase Commit (2PC):

2PC ensures all participants either commit or abort a transaction atomically. A designated coordinator orchestrates the protocol:

Phase 1: Prepare (Voting)

Coordinator sends PREPARE request to all participants
Each participant prepares to commit (acquires locks, writes to durable log)
Participants respond VOTE-COMMIT or VOTE-ABORT

Phase 2: Commit/Abort

If all participants voted COMMIT: coordinator sends COMMIT to all
If any participant voted ABORT: coordinator sends ABORT to all
Participants execute the coordinator's decision

two_phase_commit.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
class TwoPhaseCommitCoordinator:
    """
    Coordinator for Two-Phase Commit protocol.
    
    Ensures atomic commit/abort across all participants.
    """
    
    def __init__(self, participants: List[str]):
        self.participants = participants
        self.transaction_log = []
        
    def execute_transaction(self, transaction_id: str, operations: List) -> bool:
        """
        Execute distributed transaction with 2PC.
        
        Returns True if committed, False if aborted.
        """
        # Log transaction start
        self.log("PREPARE", transaction_id)
        
        # Phase 1: Prepare
        votes = self._collect_votes(transaction_id, operations)
        
        all_agreed = all(vote == "COMMIT" for vote in votes.values())
        
        # Phase 2: Commit or Abort
        if all_agreed:
            self.log("COMMIT", transaction_id)
            self._send_decision(transaction_id, "COMMIT")
            return True
        else:
            self.log("ABORT", transaction_id)
            self._send_decision(transaction_id, "ABORT")
            return False
            
    def _collect_votes(self, txn_id: str, operations: List) -> Dict[str, str]:
        """
        Send PREPARE to all participants and collect votes.
        """
        votes = {}
        
        for participant in self.participants:
            try:
                # Send prepare request with timeout
                response = self.send_prepare(
                    participant, txn_id, operations,
                    timeout=5.0
                )
                votes[participant] = response
                
            except (NetworkTimeout, ParticipantUnavailable):
                # Treat timeout/failure as ABORT vote
                votes[participant] = "ABORT"
                
        return votes
        
    def _send_decision(self, txn_id: str, decision: str):
        """
        Send final decision to all participants.
        Retry until all acknowledge.
        """
        pending = set(self.participants)
        
        while pending:
            for participant in list(pending):
                try:
                    self.send_decision(participant, txn_id, decision)
                    pending.remove(participant)
                except NetworkTimeout:
                    # Will retry on next iteration
                    pass
                    
    def recover_after_crash(self):
        """
        Recovery after coordinator crash.
        
        Read transaction log to find incomplete transactions
        and resume from last logged state.
        """
        for entry in self.transaction_log:
            if entry.state == "PREPARE":
                # Crashed before decision - abort
                self._send_decision(entry.txn_id, "ABORT")
            elif entry.state == "COMMIT":
                # Crashed after commit decision - resend COMMIT
                self._send_decision(entry.txn_id, "COMMIT")
            elif entry.state == "ABORT":
                # Crashed after abort decision - resend ABORT
                self._send_decision(entry.txn_id, "ABORT")
 
 
class TwoPhaseCommitParticipant:
    """
    Participant in Two-Phase Commit.
    """
    
    def __init__(self, node_id: str):
        self.node_id = node_id
        self.prepared_transactions: Dict[str, Any] = {}
        
    def on_prepare(self, txn_id: str, operations: List) -> str:
        """
        Handle PREPARE request.
        
        If we can commit, vote COMMIT and wait for decision.
        Otherwise, vote ABORT.
        """
        try:
            # Acquire necessary locks
            locks = self._acquire_locks(operations)
            
            # Write to durable log
            self._log_prepare(txn_id, operations, locks)
            
            # Store prepared state
            self.prepared_transactions[txn_id] = {
                "operations": operations,
                "locks": locks
            }
            
            return "COMMIT"
            
        except (LockConflict, ResourceUnavailable) as e:
            return "ABORT"
            
    def on_decision(self, txn_id: str, decision: str):
        """
        Handle final COMMIT or ABORT decision.
        """
        if txn_id not in self.prepared_transactions:
            # Never prepared - ignore
            return
            
        txn = self.prepared_transactions[txn_id]
        
        if decision == "COMMIT":
            # Apply changes
            self._apply_operations(txn["operations"])
            self._log_commit(txn_id)
        else:
            # Discard changes
            self._log_abort(txn_id)
            
        # Release locks
        self._release_locks(txn["locks"])
        del self.prepared_transactions[txn_id]

The Blocking Problem:

2PC has a critical weakness: if the coordinator fails after sending PREPARE but before sending a decision, participants are blocked indefinitely. They've voted COMMIT and hold locks, but don't know whether to commit or abort.

They can't commit: another participant might have voted ABORT
They can't abort: the coordinator might have decided COMMIT
They're stuck waiting for the coordinator to recover

Three-Phase Commit (3PC):

3PC attempts to solve the blocking problem by adding a PRE-COMMIT phase:

Phase 1: Prepare — Same as 2PC
Phase 2: Pre-Commit — If all voted COMMIT, coordinator sends PRE-COMMIT (indicating intent to commit)
Phase 3: Commit — Coordinator sends final COMMIT

The key insight: any participant who received PRE-COMMIT knows all participants voted COMMIT, so a recovering coordinator can safely commit. Any participant who only saw PREPARE knows they can safely abort.

2PC Limitations

•Blocking: Coordinator failure blocks participants
•Latency: 2 round trips for commit
•Availability: Single coordinator = single point of failure
•Network partitions: Cannot make progress during partitions

Consensus Advantages

•Non-blocking: Majority quorum avoids blocking
•Availability: Can tolerate minority failures
•Leader election: Automatic failover
•Partitions: Majority partition remains available

3PC Doesn't Solve Everything

3PC assumes bounded message delays (synchrony). Under network partitions, both 2PC and 3PC can violate safety or liveness. Modern distributed databases increasingly use consensus (Paxos, Raft) instead of 2PC/3PC, or use 2PC atop a consensus-based log for atomic commit across shards.

Quorums and Voting

Quorums are the fundamental mechanism enabling consensus in the presence of failures. A quorum is a subset of nodes large enough to make authoritative decisions.

The Majority Quorum:

The simplest and most common quorum is a majority: (N/2) + 1 nodes out of N total. Majority quorums have a crucial property:

Any two majority quorums must overlap in at least one node.

This overlap property is what enables consensus:

A decision made by one majority (e.g., "value is 42") will be seen by any subsequent majority
The overlapping node carries information from the first decision to the second
Therefore, no conflicting decision can be made by another majority

Why Majority Works:

Majority Quorum Examples
Cluster Size N	Majority Quorum	Failures Tolerated	Example Node IDs
3 nodes	2 nodes	1 failure	Any 2 of {A, B, C}
5 nodes	3 nodes	2 failures	Any 3 of {A, B, C, D, E}
7 nodes	4 nodes	3 failures	Any 4 of {A, B, C, D, E, F, G}
2N+1 nodes	N+1 nodes	N failures	Any majority of cluster

Generalized Quorum Systems:

Majority quorums aren't the only option. A quorum system is any collection of subsets (quorums) where any two quorums intersect.

Examples:

Grid quorums: Arrange N nodes in a √N × √N grid. Any row + column forms a quorum. Requires only 2√N nodes, but any two such quorums still intersect.
Weighted quorums: Assign weights to nodes (reflecting reliability, geographic importance, etc.). A quorum is any set whose total weight exceeds half the total.
Hierarchical quorums: Organize nodes in a tree. Quorums are formed by combining quorums from subtrees.

The Quorum Trade-off:

Smaller quorums mean:

✅ Better availability (need fewer nodes to proceed)
✅ Lower latency (wait for fewer responses)
❌ More nodes must store data (larger replica overlap)
❌ Higher storage costs

quorum_systems.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
class QuorumSystem:
    """
    Abstract quorum system for consensus operations.
    
    A quorum system partitions the set of subsets of nodes
    such that any two quorums in the system intersect.
    """
    
    def __init__(self, nodes: List[str]):
        self.nodes = nodes
        self.n = len(nodes)
        
    def is_quorum(self, node_set: Set[str]) -> bool:
        """Check if a set of nodes forms a valid quorum."""
        raise NotImplementedError
        
    def check_intersection_property(self) -> bool:
        """
        Verify that any two quorums intersect.
        This is the fundamental safety property.
        """
        all_quorums = self._enumerate_quorums()
        
        for q1 in all_quorums:
            for q2 in all_quorums:
                if not q1.intersection(q2):
                    return False
        return True
 
 
class MajorityQuorum(QuorumSystem):
    """
    Simple majority quorum: any set of more than N/2 nodes.
    """
    
    def is_quorum(self, node_set: Set[str]) -> bool:
        valid_nodes = node_set.intersection(self.nodes)
        return len(valid_nodes) > self.n // 2
        
    def required_for_quorum(self) -> int:
        """Minimum nodes needed for a quorum."""
        return self.n // 2 + 1
        
    def failures_tolerated(self) -> int:
        """Maximum failures while still forming quorums."""
        return self.n - self.required_for_quorum()
 
 
class WeightedQuorum(QuorumSystem):
    """
    Weighted quorum where nodes have different voting power.
    
    Useful when some nodes are more reliable or important.
    A quorum requires votes exceeding half the total weight.
    """
    
    def __init__(self, node_weights: Dict[str, int]):
        self.node_weights = node_weights
        self.nodes = list(node_weights.keys())
        self.n = len(self.nodes)
        self.total_weight = sum(node_weights.values())
        self.quorum_threshold = self.total_weight // 2 + 1
        
    def is_quorum(self, node_set: Set[str]) -> bool:
        weight = sum(
            self.node_weights.get(node, 0) 
            for node in node_set
        )
        return weight >= self.quorum_threshold
        
    def node_criticality(self, node: str) -> float:
        """
        How critical is this node for forming quorums?
        Higher weight = more critical.
        """
        return self.node_weights.get(node, 0) / self.total_weight
 
 
class ReadWriteQuorum(QuorumSystem):
    """
    Separate read and write quorums.
    
    Allows tuning read vs write availability/latency.
    Must satisfy: R + W > N (read and write quorums intersect)
    """
    
    def __init__(self, nodes: List[str], read_quorum: int, write_quorum: int):
        super().__init__(nodes)
        self.read_quorum = read_quorum
        self.write_quorum = write_quorum
        
        # Verify intersection property
        if read_quorum + write_quorum <= self.n:
            raise ValueError(
                f"Read ({read_quorum}) + Write ({write_quorum}) "
                f"must be > N ({self.n})"
            )
            
    def is_read_quorum(self, node_set: Set[str]) -> bool:
        return len(node_set.intersection(self.nodes)) >= self.read_quorum
        
    def is_write_quorum(self, node_set: Set[str]) -> bool:
        return len(node_set.intersection(self.nodes)) >= self.write_quorum
 
 
# Example usage: Tuning quorums for different workloads
 
# Read-heavy workload: R=1, W=N
# Fast reads from any node, slow writes to all
read_heavy = ReadWriteQuorum(
    nodes=["A", "B", "C", "D", "E"],
    read_quorum=1,   # Read from any node
    write_quorum=5   # Write to all nodes
)
 
# Write-heavy workload: R=N, W=1  
# Fast writes to any node, slow reads from all
write_heavy = ReadWriteQuorum(
    nodes=["A", "B", "C", "D", "E"],
    read_quorum=5,   # Read from all nodes
    write_quorum=1   # Write to any node
)
 
# Balanced: R=W=3
balanced = ReadWriteQuorum(
    nodes=["A", "B", "C", "D", "E"],
    read_quorum=3,
    write_quorum=3
)

Quorums in Practice

Most production systems use simple majority quorums with odd cluster sizes (3, 5, 7 nodes). This provides a good balance of fault tolerance and operational simplicity. Larger clusters increase fault tolerance but also increase the probability that at least one node is unavailable at any time—cluster size should match expected failure rates.

Consensus Variants and Trade-offs

Different applications have different requirements, leading to various consensus variants that trade off properties against each other.

Single-Decree vs. Multi-Decree:

Single-decree consensus: Agree on one value. Used for leader election, configuration changes.
Multi-decree consensus: Agree on a sequence of values. Used for replicated logs, state machine replication.

Crash-Fault vs. Byzantine Consensus:

Crash-fault tolerant (CFT): Tolerates nodes that fail by crashing—they stop responding but don't send incorrect data. Requires N ≥ 2F + 1 nodes to tolerate F failures.
Byzantine-fault tolerant (BFT): Tolerates nodes that may behave arbitrarily—including sending conflicting messages to different nodes. Requires N ≥ 3F + 1 nodes.

Leader-Based vs. Leaderless:

Leader-based: One node coordinates all decisions. Simpler, faster under stable leadership, but leader is a bottleneck and single point of failure.
Leaderless: All nodes can propose. Higher availability but more complex conflict resolution.

Consensus Protocol Comparison
Protocol	Fault Model	Nodes Needed	Leader	Notable Features
Paxos	Crash-fault	2F+1	Yes (Multi-Paxos)	Foundational, complex
Raft	Crash-fault	2F+1	Yes	Understandable, widely implemented
ZAB	Crash-fault	2F+1	Yes	ZooKeeper's protocol
PBFT	Byzantine	3F+1	Rotating	First practical BFT
HotStuff	Byzantine	3F+1	Rotating	Linear communication
Tendermint	Byzantine	3F+1	Rotating	Blockchain consensus

Performance Trade-offs:

1. Latency vs. Throughput:

Smaller quorums (closer to majority) improve latency
Batching multiple commands improves throughput
These can conflict: batching adds latency, fast acks reduce batch size

2. Read Consistency vs. Availability:

Strong reads (from leader only) guarantee freshness but create a bottleneck
Follower reads improve availability but may be stale
Read-at-snapshot provides consistent reads without leader bottleneck

3. Durability vs. Performance:

Synchronous disk writes before acknowledging maximize durability
Batched/async writes improve performance but risk data loss on crash

4. Geographic Distribution:

Majority quorum of 5 globally-distributed nodes is slow (cross-ocean latency)
Hierarchical quorums or witness nodes can reduce latency
Flexible Paxos allows read/write quorum tuning per region

The CAP Theorem Connection

Consensus protocols inherently prioritize Consistency over Availability during partitions. A minority partition cannot form a quorum and thus cannot make progress. This is a deliberate choice: it's better to be unavailable than to allow conflicting decisions. Systems that prefer availability (like Cassandra) sacrifice consistency by avoiding consensus for normal operations.

Implementing Consensus Safely

Consensus implementations are notoriously difficult to get right. Subtle bugs can lie dormant for years before manifesting under specific failure conditions. Here are key principles for safe implementation:

1. Persistence Before Response:

Before a node sends any response that could influence consensus (votes, acknowledgments, decisions), it must durably persist the information that response represents. Otherwise, a restart could cause the node to "forget" its vote and vote differently.

2. State Machine Safety:

The state machine must be purely deterministic. The same sequence of commands applied in the same order must always produce the same state—regardless of wall-clock time, thread scheduling, or any external factors.

3. Epoch/Term Ordering:

Every message must be tagged with an epoch/term number. Nodes must reject messages from stale epochs to prevent interference from zombie leaders or delayed messages.

4. Log Consistency:

Once a log entry is committed (decided by consensus), it can never be changed or removed. The log is append-only at each position up to the commit point.

safe_consensus.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
class SafeConsensusNode:
    """
    Consensus node with safety invariants enforced.
    
    Key invariants:
    1. Persistence before response
    2. Epoch ordering enforced
    3. Committed entries never changed
    4. Deterministic state machine
    """
    
    def __init__(self, node_id: str, storage: PersistentStorage):
        self.node_id = node_id
        self.storage = storage
        
        # Persisted state - recovered from storage on restart
        self.current_term: int = self.storage.get("current_term", 0)
        self.voted_for: Optional[str] = self.storage.get("voted_for", None)
        self.log: List[LogEntry] = self.storage.get("log", [])
        
        # Volatile state - reconstructed on restart
        self.commit_index: int = -1
        self.last_applied: int = -1
        
    def handle_vote_request(self, term: int, candidate_id: str, 
                           last_log_index: int, last_log_term: int) -> VoteResponse:
        """
        Handle RequestVote RPC.
        
        SAFETY: Must persist voted_for before responding.
        """
        # Epoch ordering: reject if term < current_term
        if term < self.current_term:
            return VoteResponse(term=self.current_term, vote_granted=False)
            
        # Update term if newer
        if term > self.current_term:
            self._step_down(term)
            
        # Check if we can vote for this candidate
        can_vote = (
            self.voted_for is None or 
            self.voted_for == candidate_id
        )
        
        # Check if candidate's log is at least as up-to-date as ours
        log_ok = self._log_is_current(last_log_index, last_log_term)
        
        if can_vote and log_ok:
            # SAFETY: Persist before responding
            self.voted_for = candidate_id
            self.storage.put("voted_for", candidate_id)
            self.storage.flush()  # Ensure durable
            
            return VoteResponse(term=self.current_term, vote_granted=True)
        else:
            return VoteResponse(term=self.current_term, vote_granted=False)
            
    def handle_append_entries(self, term: int, leader_id: str,
                             prev_log_index: int, prev_log_term: int,
                             entries: List[LogEntry], leader_commit: int):
        """
        Handle AppendEntries RPC.
        
        SAFETY: Must persist new entries before responding.
        """
        # Epoch ordering
        if term < self.current_term:
            return AppendResponse(term=self.current_term, success=False)
            
        if term > self.current_term:
            self._step_down(term)
            
        # Check log consistency at prev_log_index
        if not self._log_matches(prev_log_index, prev_log_term):
            return AppendResponse(term=self.current_term, success=False)
            
        # Append new entries
        for i, entry in enumerate(entries):
            log_index = prev_log_index + 1 + i
            
            if log_index < len(self.log):
                if self.log[log_index].term != entry.term:
                    # Conflict: truncate log from here
                    # SAFETY: Only truncate uncommitted entries
                    if log_index <= self.commit_index:
                        raise SafetyViolation(
                            "Cannot truncate committed entry"
                        )
                    self.log = self.log[:log_index]
                    
            if log_index >= len(self.log):
                self.log.append(entry)
                
        # SAFETY: Persist before responding
        self.storage.put("log", self.log)
        self.storage.flush()
        
        # Update commit index
        if leader_commit > self.commit_index:
            self.commit_index = min(leader_commit, len(self.log) - 1)
            self._apply_committed_entries()
            
        return AppendResponse(term=self.current_term, success=True)
        
    def _apply_committed_entries(self):
        """
        Apply committed entries to state machine.
        
        SAFETY: State machine must be deterministic.
        """
        while self.last_applied < self.commit_index:
            self.last_applied += 1
            entry = self.log[self.last_applied]
            
            # Apply deterministically
            result = self.state_machine.apply(entry.command)
            
            # Store result (indexed by log position for idempotency)
            self.results[self.last_applied] = result
            
    def _step_down(self, new_term: int):
        """
        Step down when we see a higher term.
        
        SAFETY: Persist new term before any other action.
        """
        self.current_term = new_term
        self.voted_for = None
        
        self.storage.put("current_term", new_term)
        self.storage.put("voted_for", None)
        self.storage.flush()
        
        self.role = "FOLLOWER"

Common Implementation Bugs

•Responding before persisting — Crashes can cause a node to 'forget' its vote and vote again, potentially electing two leaders
•Truncating committed entries — Once committed, log entries are immutable. Removing them corrupts replicated state
•Non-deterministic state machines — Using wall-clock time, random numbers, or external state makes replicas diverge
•Stale epoch acceptance — Processing messages from old leaders can resurrect rolled-back commands
•Incorrect quorum counting — Off-by-one errors in quorum size can violate the intersection property

Summary: Consensus Algorithm Foundations

We've established the theoretical and practical foundations of consensus. Let's consolidate the key insights:

Key Takeaways

•Consensus = Agreement + Validity + Termination — These three properties define what consensus achieves.
•Safety vs. Liveness — Safety (agreement) can be guaranteed absolutely; liveness (termination) is best-effort in asynchronous systems.
•FLP Impossibility — No deterministic consensus algorithm guarantees termination in purely asynchronous systems. Practical systems circumvent this with timeouts, randomization, or synchrony assumptions.
•Quorums enable fault tolerance — Majority quorums ensure any two decisions share a witness, preventing conflicts.
•State Machine Replication — Consensus enables replicated logs, which enable deterministic state machine replication—the foundation of distributed databases.
•2PC blocks, consensus doesn't — Two-phase commit depends on the coordinator; consensus tolerates any minority failure.
•Persistence is critical — Nodes must persist state before responding to avoid 'forgetting' and causing safety violations.

What's next:

With the theoretical foundations in place, we'll dive into Paxos—the first consensus algorithm proven correct and the inspiration for all subsequent protocols. Though famously difficult to understand, Paxos embodies the essential insights that make consensus work.

Page Complete

You now understand the theoretical foundations of consensus—the formal properties, the FLP impossibility result that constrains all solutions, and how quorums and state machine replication build upon consensus. Next, we'll see how Paxos concretizes these ideas into the first practical consensus algorithm.

2 / 5

Loading learning content...

Operating SystemsDistributed Coordination

Distributed Coordination

LevelAdvanced

Duration90 mins

TopicDistributed Coordination

2 / 5

Consensus Algorithms

The Agreement Problem

Why is consensus hard?

What You Will Learn

Defining Consensus Formally

The Consensus Problem:

Given N processes, each proposing a value, the consensus problem is to have all correct processes agree on one of the proposed values.

Formal Properties:

A consensus protocol must satisfy three properties:

The Three Properties of Consensus
Property	Formal Definition	Informal Explanation
Agreement	No two correct processes decide on different values	Everyone who decides, decides the same thing
Validity	If a process decides value v, then v was proposed by some process	The decided value isn't made up—someone actually proposed it
Termination	Every correct process eventually decides on a value	The algorithm eventually finishes; it doesn't run forever

Safety vs. Liveness:

These properties divide into two categories with fundamentally different characteristics:

The Priority of Safety

The FLP Impossibility Result

The most important theoretical result in distributed systems is the Fischer-Lynch-Paterson (FLP) Impossibility Theorem (1985). It establishes fundamental limits on what consensus can achieve.

The Result:

In an asynchronous distributed system where at least one process may crash, there is no deterministic algorithm that guarantees consensus.

Understanding the Implications:

The FLP result states that you cannot have all three of:

Guaranteed termination (liveness)
Guaranteed agreement (safety)
Tolerance of even a single crash failure

in a purely asynchronous system with a deterministic algorithm.

Why is this true?

The core issue is indistinguishability. In an asynchronous system:

You cannot distinguish a slow process from a crashed one
You cannot distinguish a delayed message from a lost one
There's no bound on how long messages take to arrive

The proof constructs a sequence of system states where:

The system is in a "bivalent" state (could decide either 0 or 1)
Any step taken preserves bivalence (the system could still go either way)
This sequence can be extended indefinitely
Therefore, the algorithm may never terminate

What FLP Does NOT Say

FLP does not mean consensus is impossible in practice. It says deterministic consensus with guaranteed termination is impossible in purely asynchronous systems. Real systems circumvent FLP by:

Relaxing asynchrony — Using timeouts and synchrony assumptions
Using randomization — Non-deterministic algorithms with probabilistic termination
Accepting probabilistic termination — Terminating with high probability, not certainty

Circumventing FLP in Practice:

Production consensus protocols circumvent FLP through several mechanisms:

1. Partial Synchrony Model:

2. Failure Detectors:

Use imperfect failure detectors (discussed in leader election) that eventually become accurate. The algorithm waits until the detector is reliable enough to make correct decisions.

3. Randomization:

Introduce randomness into the decision process. The algorithm terminates with probability 1 over infinite time, even if any finite execution might not terminate.

4. Leader-Based Protocols:

Elect a leader who makes decisions. The algorithm terminates quickly under stable leadership and falls back to leader election when instability is detected.

consensus_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
class ConsensusProtocol:
    """
    Abstract base class demonstrating consensus properties.
    
    Any concrete implementation must satisfy:
    - Agreement: All correct nodes decide the same value
    - Validity: Decided value was proposed by some node
    - Termination: All correct nodes eventually decide
    
    FLP tells us we cannot guarantee all three in async systems.
    """
    
    def __init__(self, node_id: str, peers: List[str]):
        self.node_id = node_id
        self.peers = peers
        self.decided_value: Optional[Any] = None
        self.proposed_values: Set[Any] = set()
        
    def propose(self, value: Any) -> Optional[Any]:
        """
        Propose a value for consensus.
        Returns the decided value, or None if not yet decided.
        
        Validity: The returned value must be in proposed_values
        """
        self.proposed_values.add(value)
        # Implementation-specific consensus logic
        raise NotImplementedError
        
    def decide(self, value: Any):
        """
        Record a consensus decision.
        
        Agreement: Once called, all correct nodes must call
        with the same value.
        
        Validity: value must have been proposed by some node.
        """
        if value not in self.proposed_values:
            raise ValidityViolation(
                f"Decided value {value} was never proposed"
            )
            
        if self.decided_value is not None:
            if self.decided_value != value:
                raise AgreementViolation(
                    f"Already decided {self.decided_value}, cannot decide {value}"
                )
            return  # Already decided this value
            
        self.decided_value = value
        self.on_decision(value)
        
    def on_decision(self, value: Any):
        """Hook called when consensus is reached."""
        print(f"Node {self.node_id}: Decided on {value}")
 
 
class PartialSynchronyConsensus(ConsensusProtocol):
    """
    Consensus assuming eventual synchrony.
    
    During asynchronous periods, safety is preserved but
    liveness is not guaranteed. During synchronous periods,
    the protocol makes progress and terminates.
    """
    
    def __init__(self, node_id: str, peers: List[str]):
        super().__init__(node_id, peers)
        self.round_timeout = 1.0  # Initial timeout
        self.max_timeout = 30.0   # Maximum timeout
        
    def propose(self, value: Any) -> Optional[Any]:
        """
        Propose value, retrying with exponential backoff
        until synchrony is achieved.
        """
        self.proposed_values.add(value)
        
        current_round = 0
        timeout = self.round_timeout
        
        while self.decided_value is None:
            try:
                # Attempt consensus in current round
                result = self._run_round(value, current_round, timeout)
                
                if result is not None:
                    self.decide(result)
                    return result
                    
            except RoundTimeout:
                # System may be asynchronous
                # Increase timeout and retry
                current_round += 1
                timeout = min(timeout * 2, self.max_timeout)
                print(f"Round {current_round} timeout, new timeout: {timeout}s")
                
        return self.decided_value
        
    def _run_round(self, value: Any, round_num: int, timeout: float):
        """
        Execute one round of consensus protocol.
        
        Returns decided value if consensus reached, None if
        round timed out without decision.
        """
        # Implementation would include:
        # 1. Broadcast proposal for this round
        # 2. Collect votes with timeout
        # 3. If quorum reached, decide
        # 4. If timeout, return None for retry
        raise NotImplementedError

Consensus and State Machine Replication

The Replicated State Machine Model:

Multiple nodes maintain identical copies of a deterministic state machine
All nodes start in the same initial state
All nodes process the same sequence of commands in the same order
Therefore, all nodes reach the same final state

Consensus enables SMR by ensuring agreement on:

Which commands to execute
In what order to execute them

The Log Abstraction:

Replicated state machines typically use a replicated log as the ordering mechanism:

Each command is assigned a log position (index)
Consensus ensures all replicas agree on the command at each index
Replicas apply commands sequentially from the log
The log provides atomicity and ordering guarantees

Converting Mermaid diagram...

Multi-Paxos and Multi-Decree Consensus:

The Leader Optimization:

In repeated consensus, the same node typically leads consecutive instances. This optimization eliminates the need to run a full consensus round for each command:

The leader is established once and persists across many commands
For each command, the leader proposes and waits for acknowledgments
If the leader fails, a new leader is elected and resumes from the log

This is exactly what Raft formalizes as "normal operation" vs. "leader election."

replicated_state_machine.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
class ReplicatedStateMachine:
    """
    State machine replication using consensus.
    
    Maintains a replicated log where each entry is decided
    by consensus. State machines apply log entries sequentially
    to maintain identical state across replicas.
    """
    
    def __init__(self, node_id: str, consensus: ConsensusProtocol):
        self.node_id = node_id
        self.consensus = consensus
        
        # The replicated log - same on all replicas
        self.log: List[Command] = []
        self.last_applied_index = -1
        
        # The replicated state - same on all replicas after
        # applying the same log prefix
        self.state: Dict[str, Any] = {}
        
    def submit_command(self, command: Command) -> Any:
        """
        Submit a command for replicated execution.
        
        1. Propose command at next log index
        2. Wait for consensus
        3. Apply all committed entries
        4. Return result of this command
        """
        # Leader assigns log index
        log_index = len(self.log)
        
        # Reach consensus on this command at this index
        decided_command = self.consensus.propose_for_index(
            index=log_index,
            value=command
        )
        
        # Add to log
        self.log.append(decided_command)
        
        # Apply all unapplied entries
        result = None
        while self.last_applied_index < len(self.log) - 1:
            self.last_applied_index += 1
            cmd = self.log[self.last_applied_index]
            cmd_result = self._apply(cmd)
            
            if self.last_applied_index == log_index:
                result = cmd_result
                
        return result
        
    def _apply(self, command: Command) -> Any:
        """
        Apply a command to the state machine.
        
        Must be deterministic - same input always produces
        same state change and output.
        """
        if command.type == "SET":
            self.state[command.key] = command.value
            return "OK"
            
        elif command.type == "GET":
            return self.state.get(command.key)
            
        elif command.type == "DELETE":
            old = self.state.pop(command.key, None)
            return old
            
        elif command.type == "CAS":  # Compare-and-swap
            current = self.state.get(command.key)
            if current == command.expected:
                self.state[command.key] = command.new_value
                return True
            return False
            
    def get_state_snapshot(self) -> Dict[str, Any]:
        """
        Get current state.
        
        State is deterministic function of log prefix
        up to last_applied_index.
        """
        return dict(self.state)
        
    def recover_from_log(self, log: List[Command]):
        """
        Recover state by replaying log.
        
        After network partition heals, a node may receive
        committed entries it missed. Apply them in order.
        """
        for i, command in enumerate(log):
            if i > self.last_applied_index:
                # This entry not yet applied
                if i < len(self.log):
                    # Already in our log, just apply
                    self._apply(command)
                else:
                    # Missing from our log, add and apply
                    self.log.append(command)
                    self._apply(command)
                self.last_applied_index = i
                
        print(f"Recovery complete: applied {len(log)} entries")

Linearizability Connection

Two-Phase and Three-Phase Commit

Two-Phase Commit (2PC):

2PC ensures all participants either commit or abort a transaction atomically. A designated coordinator orchestrates the protocol:

Phase 1: Prepare (Voting)

Coordinator sends PREPARE request to all participants
Each participant prepares to commit (acquires locks, writes to durable log)
Participants respond VOTE-COMMIT or VOTE-ABORT

Phase 2: Commit/Abort

If all participants voted COMMIT: coordinator sends COMMIT to all
If any participant voted ABORT: coordinator sends ABORT to all
Participants execute the coordinator's decision

two_phase_commit.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
class TwoPhaseCommitCoordinator:
    """
    Coordinator for Two-Phase Commit protocol.
    
    Ensures atomic commit/abort across all participants.
    """
    
    def __init__(self, participants: List[str]):
        self.participants = participants
        self.transaction_log = []
        
    def execute_transaction(self, transaction_id: str, operations: List) -> bool:
        """
        Execute distributed transaction with 2PC.
        
        Returns True if committed, False if aborted.
        """
        # Log transaction start
        self.log("PREPARE", transaction_id)
        
        # Phase 1: Prepare
        votes = self._collect_votes(transaction_id, operations)
        
        all_agreed = all(vote == "COMMIT" for vote in votes.values())
        
        # Phase 2: Commit or Abort
        if all_agreed:
            self.log("COMMIT", transaction_id)
            self._send_decision(transaction_id, "COMMIT")
            return True
        else:
            self.log("ABORT", transaction_id)
            self._send_decision(transaction_id, "ABORT")
            return False
            
    def _collect_votes(self, txn_id: str, operations: List) -> Dict[str, str]:
        """
        Send PREPARE to all participants and collect votes.
        """
        votes = {}
        
        for participant in self.participants:
            try:
                # Send prepare request with timeout
                response = self.send_prepare(
                    participant, txn_id, operations,
                    timeout=5.0
                )
                votes[participant] = response
                
            except (NetworkTimeout, ParticipantUnavailable):
                # Treat timeout/failure as ABORT vote
                votes[participant] = "ABORT"
                
        return votes
        
    def _send_decision(self, txn_id: str, decision: str):
        """
        Send final decision to all participants.
        Retry until all acknowledge.
        """
        pending = set(self.participants)
        
        while pending:
            for participant in list(pending):
                try:
                    self.send_decision(participant, txn_id, decision)
                    pending.remove(participant)
                except NetworkTimeout:
                    # Will retry on next iteration
                    pass
                    
    def recover_after_crash(self):
        """
        Recovery after coordinator crash.
        
        Read transaction log to find incomplete transactions
        and resume from last logged state.
        """
        for entry in self.transaction_log:
            if entry.state == "PREPARE":
                # Crashed before decision - abort
                self._send_decision(entry.txn_id, "ABORT")
            elif entry.state == "COMMIT":
                # Crashed after commit decision - resend COMMIT
                self._send_decision(entry.txn_id, "COMMIT")
            elif entry.state == "ABORT":
                # Crashed after abort decision - resend ABORT
                self._send_decision(entry.txn_id, "ABORT")
 
 
class TwoPhaseCommitParticipant:
    """
    Participant in Two-Phase Commit.
    """
    
    def __init__(self, node_id: str):
        self.node_id = node_id
        self.prepared_transactions: Dict[str, Any] = {}
        
    def on_prepare(self, txn_id: str, operations: List) -> str:
        """
        Handle PREPARE request.
        
        If we can commit, vote COMMIT and wait for decision.
        Otherwise, vote ABORT.
        """
        try:
            # Acquire necessary locks
            locks = self._acquire_locks(operations)
            
            # Write to durable log
            self._log_prepare(txn_id, operations, locks)
            
            # Store prepared state
            self.prepared_transactions[txn_id] = {
                "operations": operations,
                "locks": locks
            }
            
            return "COMMIT"
            
        except (LockConflict, ResourceUnavailable) as e:
            return "ABORT"
            
    def on_decision(self, txn_id: str, decision: str):
        """
        Handle final COMMIT or ABORT decision.
        """
        if txn_id not in self.prepared_transactions:
            # Never prepared - ignore
            return
            
        txn = self.prepared_transactions[txn_id]
        
        if decision == "COMMIT":
            # Apply changes
            self._apply_operations(txn["operations"])
            self._log_commit(txn_id)
        else:
            # Discard changes
            self._log_abort(txn_id)
            
        # Release locks
        self._release_locks(txn["locks"])
        del self.prepared_transactions[txn_id]

The Blocking Problem:

They can't commit: another participant might have voted ABORT
They can't abort: the coordinator might have decided COMMIT
They're stuck waiting for the coordinator to recover

Three-Phase Commit (3PC):

3PC attempts to solve the blocking problem by adding a PRE-COMMIT phase:

Phase 1: Prepare — Same as 2PC
Phase 2: Pre-Commit — If all voted COMMIT, coordinator sends PRE-COMMIT (indicating intent to commit)
Phase 3: Commit — Coordinator sends final COMMIT

2PC Limitations

•Blocking: Coordinator failure blocks participants
•Latency: 2 round trips for commit
•Availability: Single coordinator = single point of failure
•Network partitions: Cannot make progress during partitions

Consensus Advantages

•Non-blocking: Majority quorum avoids blocking
•Availability: Can tolerate minority failures
•Leader election: Automatic failover
•Partitions: Majority partition remains available

3PC Doesn't Solve Everything

Quorums and Voting

Quorums are the fundamental mechanism enabling consensus in the presence of failures. A quorum is a subset of nodes large enough to make authoritative decisions.

The Majority Quorum:

The simplest and most common quorum is a majority: (N/2) + 1 nodes out of N total. Majority quorums have a crucial property:

Any two majority quorums must overlap in at least one node.

This overlap property is what enables consensus:

A decision made by one majority (e.g., "value is 42") will be seen by any subsequent majority
The overlapping node carries information from the first decision to the second
Therefore, no conflicting decision can be made by another majority

Why Majority Works:

Majority Quorum Examples
Cluster Size N	Majority Quorum	Failures Tolerated	Example Node IDs
3 nodes	2 nodes	1 failure	Any 2 of {A, B, C}
5 nodes	3 nodes	2 failures	Any 3 of {A, B, C, D, E}
7 nodes	4 nodes	3 failures	Any 4 of {A, B, C, D, E, F, G}
2N+1 nodes	N+1 nodes	N failures	Any majority of cluster

Generalized Quorum Systems:

Majority quorums aren't the only option. A quorum system is any collection of subsets (quorums) where any two quorums intersect.

Examples:

Grid quorums: Arrange N nodes in a √N × √N grid. Any row + column forms a quorum. Requires only 2√N nodes, but any two such quorums still intersect.
Weighted quorums: Assign weights to nodes (reflecting reliability, geographic importance, etc.). A quorum is any set whose total weight exceeds half the total.
Hierarchical quorums: Organize nodes in a tree. Quorums are formed by combining quorums from subtrees.

The Quorum Trade-off:

Smaller quorums mean:

✅ Better availability (need fewer nodes to proceed)
✅ Lower latency (wait for fewer responses)
❌ More nodes must store data (larger replica overlap)
❌ Higher storage costs

quorum_systems.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
class QuorumSystem:
    """
    Abstract quorum system for consensus operations.
    
    A quorum system partitions the set of subsets of nodes
    such that any two quorums in the system intersect.
    """
    
    def __init__(self, nodes: List[str]):
        self.nodes = nodes
        self.n = len(nodes)
        
    def is_quorum(self, node_set: Set[str]) -> bool:
        """Check if a set of nodes forms a valid quorum."""
        raise NotImplementedError
        
    def check_intersection_property(self) -> bool:
        """
        Verify that any two quorums intersect.
        This is the fundamental safety property.
        """
        all_quorums = self._enumerate_quorums()
        
        for q1 in all_quorums:
            for q2 in all_quorums:
                if not q1.intersection(q2):
                    return False
        return True
 
 
class MajorityQuorum(QuorumSystem):
    """
    Simple majority quorum: any set of more than N/2 nodes.
    """
    
    def is_quorum(self, node_set: Set[str]) -> bool:
        valid_nodes = node_set.intersection(self.nodes)
        return len(valid_nodes) > self.n // 2
        
    def required_for_quorum(self) -> int:
        """Minimum nodes needed for a quorum."""
        return self.n // 2 + 1
        
    def failures_tolerated(self) -> int:
        """Maximum failures while still forming quorums."""
        return self.n - self.required_for_quorum()
 
 
class WeightedQuorum(QuorumSystem):
    """
    Weighted quorum where nodes have different voting power.
    
    Useful when some nodes are more reliable or important.
    A quorum requires votes exceeding half the total weight.
    """
    
    def __init__(self, node_weights: Dict[str, int]):
        self.node_weights = node_weights
        self.nodes = list(node_weights.keys())
        self.n = len(self.nodes)
        self.total_weight = sum(node_weights.values())
        self.quorum_threshold = self.total_weight // 2 + 1
        
    def is_quorum(self, node_set: Set[str]) -> bool:
        weight = sum(
            self.node_weights.get(node, 0) 
            for node in node_set
        )
        return weight >= self.quorum_threshold
        
    def node_criticality(self, node: str) -> float:
        """
        How critical is this node for forming quorums?
        Higher weight = more critical.
        """
        return self.node_weights.get(node, 0) / self.total_weight
 
 
class ReadWriteQuorum(QuorumSystem):
    """
    Separate read and write quorums.
    
    Allows tuning read vs write availability/latency.
    Must satisfy: R + W > N (read and write quorums intersect)
    """
    
    def __init__(self, nodes: List[str], read_quorum: int, write_quorum: int):
        super().__init__(nodes)
        self.read_quorum = read_quorum
        self.write_quorum = write_quorum
        
        # Verify intersection property
        if read_quorum + write_quorum <= self.n:
            raise ValueError(
                f"Read ({read_quorum}) + Write ({write_quorum}) "
                f"must be > N ({self.n})"
            )
            
    def is_read_quorum(self, node_set: Set[str]) -> bool:
        return len(node_set.intersection(self.nodes)) >= self.read_quorum
        
    def is_write_quorum(self, node_set: Set[str]) -> bool:
        return len(node_set.intersection(self.nodes)) >= self.write_quorum
 
 
# Example usage: Tuning quorums for different workloads
 
# Read-heavy workload: R=1, W=N
# Fast reads from any node, slow writes to all
read_heavy = ReadWriteQuorum(
    nodes=["A", "B", "C", "D", "E"],
    read_quorum=1,   # Read from any node
    write_quorum=5   # Write to all nodes
)
 
# Write-heavy workload: R=N, W=1  
# Fast writes to any node, slow reads from all
write_heavy = ReadWriteQuorum(
    nodes=["A", "B", "C", "D", "E"],
    read_quorum=5,   # Read from all nodes
    write_quorum=1   # Write to any node
)
 
# Balanced: R=W=3
balanced = ReadWriteQuorum(
    nodes=["A", "B", "C", "D", "E"],
    read_quorum=3,
    write_quorum=3
)

Quorums in Practice

Consensus Variants and Trade-offs

Different applications have different requirements, leading to various consensus variants that trade off properties against each other.

Single-Decree vs. Multi-Decree:

Single-decree consensus: Agree on one value. Used for leader election, configuration changes.
Multi-decree consensus: Agree on a sequence of values. Used for replicated logs, state machine replication.

Crash-Fault vs. Byzantine Consensus:

Crash-fault tolerant (CFT): Tolerates nodes that fail by crashing—they stop responding but don't send incorrect data. Requires N ≥ 2F + 1 nodes to tolerate F failures.
Byzantine-fault tolerant (BFT): Tolerates nodes that may behave arbitrarily—including sending conflicting messages to different nodes. Requires N ≥ 3F + 1 nodes.

Leader-Based vs. Leaderless:

Leader-based: One node coordinates all decisions. Simpler, faster under stable leadership, but leader is a bottleneck and single point of failure.
Leaderless: All nodes can propose. Higher availability but more complex conflict resolution.

Consensus Protocol Comparison
Protocol	Fault Model	Nodes Needed	Leader	Notable Features
Paxos	Crash-fault	2F+1	Yes (Multi-Paxos)	Foundational, complex
Raft	Crash-fault	2F+1	Yes	Understandable, widely implemented
ZAB	Crash-fault	2F+1	Yes	ZooKeeper's protocol
PBFT	Byzantine	3F+1	Rotating	First practical BFT
HotStuff	Byzantine	3F+1	Rotating	Linear communication
Tendermint	Byzantine	3F+1	Rotating	Blockchain consensus

Performance Trade-offs:

1. Latency vs. Throughput:

Smaller quorums (closer to majority) improve latency
Batching multiple commands improves throughput
These can conflict: batching adds latency, fast acks reduce batch size

2. Read Consistency vs. Availability:

Strong reads (from leader only) guarantee freshness but create a bottleneck
Follower reads improve availability but may be stale
Read-at-snapshot provides consistent reads without leader bottleneck

3. Durability vs. Performance:

Synchronous disk writes before acknowledging maximize durability
Batched/async writes improve performance but risk data loss on crash

4. Geographic Distribution:

Majority quorum of 5 globally-distributed nodes is slow (cross-ocean latency)
Hierarchical quorums or witness nodes can reduce latency
Flexible Paxos allows read/write quorum tuning per region

The CAP Theorem Connection

Implementing Consensus Safely

1. Persistence Before Response:

2. State Machine Safety:

3. Epoch/Term Ordering:

Every message must be tagged with an epoch/term number. Nodes must reject messages from stale epochs to prevent interference from zombie leaders or delayed messages.

4. Log Consistency:

Once a log entry is committed (decided by consensus), it can never be changed or removed. The log is append-only at each position up to the commit point.

safe_consensus.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
class SafeConsensusNode:
    """
    Consensus node with safety invariants enforced.
    
    Key invariants:
    1. Persistence before response
    2. Epoch ordering enforced
    3. Committed entries never changed
    4. Deterministic state machine
    """
    
    def __init__(self, node_id: str, storage: PersistentStorage):
        self.node_id = node_id
        self.storage = storage
        
        # Persisted state - recovered from storage on restart
        self.current_term: int = self.storage.get("current_term", 0)
        self.voted_for: Optional[str] = self.storage.get("voted_for", None)
        self.log: List[LogEntry] = self.storage.get("log", [])
        
        # Volatile state - reconstructed on restart
        self.commit_index: int = -1
        self.last_applied: int = -1
        
    def handle_vote_request(self, term: int, candidate_id: str, 
                           last_log_index: int, last_log_term: int) -> VoteResponse:
        """
        Handle RequestVote RPC.
        
        SAFETY: Must persist voted_for before responding.
        """
        # Epoch ordering: reject if term < current_term
        if term < self.current_term:
            return VoteResponse(term=self.current_term, vote_granted=False)
            
        # Update term if newer
        if term > self.current_term:
            self._step_down(term)
            
        # Check if we can vote for this candidate
        can_vote = (
            self.voted_for is None or 
            self.voted_for == candidate_id
        )
        
        # Check if candidate's log is at least as up-to-date as ours
        log_ok = self._log_is_current(last_log_index, last_log_term)
        
        if can_vote and log_ok:
            # SAFETY: Persist before responding
            self.voted_for = candidate_id
            self.storage.put("voted_for", candidate_id)
            self.storage.flush()  # Ensure durable
            
            return VoteResponse(term=self.current_term, vote_granted=True)
        else:
            return VoteResponse(term=self.current_term, vote_granted=False)
            
    def handle_append_entries(self, term: int, leader_id: str,
                             prev_log_index: int, prev_log_term: int,
                             entries: List[LogEntry], leader_commit: int):
        """
        Handle AppendEntries RPC.
        
        SAFETY: Must persist new entries before responding.
        """
        # Epoch ordering
        if term < self.current_term:
            return AppendResponse(term=self.current_term, success=False)
            
        if term > self.current_term:
            self._step_down(term)
            
        # Check log consistency at prev_log_index
        if not self._log_matches(prev_log_index, prev_log_term):
            return AppendResponse(term=self.current_term, success=False)
            
        # Append new entries
        for i, entry in enumerate(entries):
            log_index = prev_log_index + 1 + i
            
            if log_index < len(self.log):
                if self.log[log_index].term != entry.term:
                    # Conflict: truncate log from here
                    # SAFETY: Only truncate uncommitted entries
                    if log_index <= self.commit_index:
                        raise SafetyViolation(
                            "Cannot truncate committed entry"
                        )
                    self.log = self.log[:log_index]
                    
            if log_index >= len(self.log):
                self.log.append(entry)
                
        # SAFETY: Persist before responding
        self.storage.put("log", self.log)
        self.storage.flush()
        
        # Update commit index
        if leader_commit > self.commit_index:
            self.commit_index = min(leader_commit, len(self.log) - 1)
            self._apply_committed_entries()
            
        return AppendResponse(term=self.current_term, success=True)
        
    def _apply_committed_entries(self):
        """
        Apply committed entries to state machine.
        
        SAFETY: State machine must be deterministic.
        """
        while self.last_applied < self.commit_index:
            self.last_applied += 1
            entry = self.log[self.last_applied]
            
            # Apply deterministically
            result = self.state_machine.apply(entry.command)
            
            # Store result (indexed by log position for idempotency)
            self.results[self.last_applied] = result
            
    def _step_down(self, new_term: int):
        """
        Step down when we see a higher term.
        
        SAFETY: Persist new term before any other action.
        """
        self.current_term = new_term
        self.voted_for = None
        
        self.storage.put("current_term", new_term)
        self.storage.put("voted_for", None)
        self.storage.flush()
        
        self.role = "FOLLOWER"

Common Implementation Bugs

•Responding before persisting — Crashes can cause a node to 'forget' its vote and vote again, potentially electing two leaders
•Truncating committed entries — Once committed, log entries are immutable. Removing them corrupts replicated state
•Non-deterministic state machines — Using wall-clock time, random numbers, or external state makes replicas diverge
•Stale epoch acceptance — Processing messages from old leaders can resurrect rolled-back commands
•Incorrect quorum counting — Off-by-one errors in quorum size can violate the intersection property

Summary: Consensus Algorithm Foundations

We've established the theoretical and practical foundations of consensus. Let's consolidate the key insights:

Key Takeaways

•Consensus = Agreement + Validity + Termination — These three properties define what consensus achieves.
•Safety vs. Liveness — Safety (agreement) can be guaranteed absolutely; liveness (termination) is best-effort in asynchronous systems.
•FLP Impossibility — No deterministic consensus algorithm guarantees termination in purely asynchronous systems. Practical systems circumvent this with timeouts, randomization, or synchrony assumptions.
•Quorums enable fault tolerance — Majority quorums ensure any two decisions share a witness, preventing conflicts.
•State Machine Replication — Consensus enables replicated logs, which enable deterministic state machine replication—the foundation of distributed databases.
•2PC blocks, consensus doesn't — Two-phase commit depends on the coordinator; consensus tolerates any minority failure.
•Persistence is critical — Nodes must persist state before responding to avoid 'forgetting' and causing safety violations.

What's next:

Page Complete

2 / 5