Loading learning content...
Imagine you're building a distributed database replicated across five data centers. A client submits a transaction. For the system to function correctly, all replicas must process the same set of transactions in the same order—even if some replicas fail, messages arrive out of order, or network partitions occur. How do you ensure this?
This is the consensus problem: getting a collection of independent nodes to agree on a single value (or a sequence of values) despite failures and asynchrony. Consensus is the theoretical foundation underlying leader election, replicated state machines, distributed transactions, and virtually every coordination mechanism in distributed systems.
Why is consensus hard?
In a single-machine system, agreement is trivial—there's only one perspective. In distributed systems, each node has its own local view, and that view may differ from others due to message delays, failures, and network partitions. Achieving agreement despite these discrepancies is fundamentally challenging.
By the end of this page, you will understand the formal definition of consensus, its safety and liveness properties, the seminal FLP impossibility result, how practical systems circumvent theoretical limitations, and the relationship between consensus and other distributed primitives like leader election and atomic broadcast.
Consensus has a precise formal definition that captures the requirements any correct solution must satisfy. Understanding this definition is essential for evaluating and designing consensus protocols.
The Consensus Problem:
Given N processes, each proposing a value, the consensus problem is to have all correct processes agree on one of the proposed values.
Formal Properties:
A consensus protocol must satisfy three properties:
| Property | Formal Definition | Informal Explanation |
|---|---|---|
| Agreement | No two correct processes decide on different values | Everyone who decides, decides the same thing |
| Validity | If a process decides value v, then v was proposed by some process | The decided value isn't made up—someone actually proposed it |
| Termination | Every correct process eventually decides on a value | The algorithm eventually finishes; it doesn't run forever |
Safety vs. Liveness:
These properties divide into two categories with fundamentally different characteristics:
Safety properties (Agreement, Validity): "Nothing bad ever happens." If violated, there's a specific point in time when the violation occurred. Safety violations are irreversible—once you've decided two different values, you can't undo it.
Liveness properties (Termination): "Something good eventually happens." Liveness violations cannot be detected at any finite point—you can only say termination hasn't happened yet, not that it won't happen.
This distinction matters because safety and liveness have different failure characteristics. Safety can be guaranteed absolutely (never violate agreement), but liveness is best-effort in asynchronous systems (termination might be delayed indefinitely).
In distributed systems design, safety is typically prioritized over liveness. A system that temporarily cannot make progress is inconvenient; a system that corrupts data is catastrophic. Consensus protocols sacrifice termination guarantees under extreme conditions to preserve agreement guarantees absolutely.
The most important theoretical result in distributed systems is the Fischer-Lynch-Paterson (FLP) Impossibility Theorem (1985). It establishes fundamental limits on what consensus can achieve.
The Result:
In an asynchronous distributed system where at least one process may crash, there is no deterministic algorithm that guarantees consensus.
Understanding the Implications:
The FLP result states that you cannot have all three of:
in a purely asynchronous system with a deterministic algorithm.
Why is this true?
The core issue is indistinguishability. In an asynchronous system:
The proof constructs a sequence of system states where:
FLP does not mean consensus is impossible in practice. It says deterministic consensus with guaranteed termination is impossible in purely asynchronous systems. Real systems circumvent FLP by:
Circumventing FLP in Practice:
Production consensus protocols circumvent FLP through several mechanisms:
1. Partial Synchrony Model:
Assume the system is asynchronous but eventually becomes synchronous—message delays are unbounded but eventually stabilize. During periods of synchrony, the algorithm can make progress and terminate.
2. Failure Detectors:
Use imperfect failure detectors (discussed in leader election) that eventually become accurate. The algorithm waits until the detector is reliable enough to make correct decisions.
3. Randomization:
Introduce randomness into the decision process. The algorithm terminates with probability 1 over infinite time, even if any finite execution might not terminate.
4. Leader-Based Protocols:
Elect a leader who makes decisions. The algorithm terminates quickly under stable leadership and falls back to leader election when instability is detected.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
class ConsensusProtocol: """ Abstract base class demonstrating consensus properties. Any concrete implementation must satisfy: - Agreement: All correct nodes decide the same value - Validity: Decided value was proposed by some node - Termination: All correct nodes eventually decide FLP tells us we cannot guarantee all three in async systems. """ def __init__(self, node_id: str, peers: List[str]): self.node_id = node_id self.peers = peers self.decided_value: Optional[Any] = None self.proposed_values: Set[Any] = set() def propose(self, value: Any) -> Optional[Any]: """ Propose a value for consensus. Returns the decided value, or None if not yet decided. Validity: The returned value must be in proposed_values """ self.proposed_values.add(value) # Implementation-specific consensus logic raise NotImplementedError def decide(self, value: Any): """ Record a consensus decision. Agreement: Once called, all correct nodes must call with the same value. Validity: value must have been proposed by some node. """ if value not in self.proposed_values: raise ValidityViolation( f"Decided value {value} was never proposed" ) if self.decided_value is not None: if self.decided_value != value: raise AgreementViolation( f"Already decided {self.decided_value}, cannot decide {value}" ) return # Already decided this value self.decided_value = value self.on_decision(value) def on_decision(self, value: Any): """Hook called when consensus is reached.""" print(f"Node {self.node_id}: Decided on {value}") class PartialSynchronyConsensus(ConsensusProtocol): """ Consensus assuming eventual synchrony. During asynchronous periods, safety is preserved but liveness is not guaranteed. During synchronous periods, the protocol makes progress and terminates. """ def __init__(self, node_id: str, peers: List[str]): super().__init__(node_id, peers) self.round_timeout = 1.0 # Initial timeout self.max_timeout = 30.0 # Maximum timeout def propose(self, value: Any) -> Optional[Any]: """ Propose value, retrying with exponential backoff until synchrony is achieved. """ self.proposed_values.add(value) current_round = 0 timeout = self.round_timeout while self.decided_value is None: try: # Attempt consensus in current round result = self._run_round(value, current_round, timeout) if result is not None: self.decide(result) return result except RoundTimeout: # System may be asynchronous # Increase timeout and retry current_round += 1 timeout = min(timeout * 2, self.max_timeout) print(f"Round {current_round} timeout, new timeout: {timeout}s") return self.decided_value def _run_round(self, value: Any, round_num: int, timeout: float): """ Execute one round of consensus protocol. Returns decided value if consensus reached, None if round timed out without decision. """ # Implementation would include: # 1. Broadcast proposal for this round # 2. Collect votes with timeout # 3. If quorum reached, decide # 4. If timeout, return None for retry raise NotImplementedErrorState Machine Replication (SMR) is the most important application of consensus. It's the technique underlying virtually all replicated databases, coordination services, and distributed storage systems.
The Replicated State Machine Model:
Consensus enables SMR by ensuring agreement on:
The Log Abstraction:
Replicated state machines typically use a replicated log as the ordering mechanism:
Multi-Paxos and Multi-Decree Consensus:
A single consensus instance decides one value. For state machine replication, we need repeated consensus—a sequence of consensus instances, one for each log entry. This is what "Multi-Paxos" and similar protocols provide.
The Leader Optimization:
In repeated consensus, the same node typically leads consecutive instances. This optimization eliminates the need to run a full consensus round for each command:
This is exactly what Raft formalizes as "normal operation" vs. "leader election."
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
class ReplicatedStateMachine: """ State machine replication using consensus. Maintains a replicated log where each entry is decided by consensus. State machines apply log entries sequentially to maintain identical state across replicas. """ def __init__(self, node_id: str, consensus: ConsensusProtocol): self.node_id = node_id self.consensus = consensus # The replicated log - same on all replicas self.log: List[Command] = [] self.last_applied_index = -1 # The replicated state - same on all replicas after # applying the same log prefix self.state: Dict[str, Any] = {} def submit_command(self, command: Command) -> Any: """ Submit a command for replicated execution. 1. Propose command at next log index 2. Wait for consensus 3. Apply all committed entries 4. Return result of this command """ # Leader assigns log index log_index = len(self.log) # Reach consensus on this command at this index decided_command = self.consensus.propose_for_index( index=log_index, value=command ) # Add to log self.log.append(decided_command) # Apply all unapplied entries result = None while self.last_applied_index < len(self.log) - 1: self.last_applied_index += 1 cmd = self.log[self.last_applied_index] cmd_result = self._apply(cmd) if self.last_applied_index == log_index: result = cmd_result return result def _apply(self, command: Command) -> Any: """ Apply a command to the state machine. Must be deterministic - same input always produces same state change and output. """ if command.type == "SET": self.state[command.key] = command.value return "OK" elif command.type == "GET": return self.state.get(command.key) elif command.type == "DELETE": old = self.state.pop(command.key, None) return old elif command.type == "CAS": # Compare-and-swap current = self.state.get(command.key) if current == command.expected: self.state[command.key] = command.new_value return True return False def get_state_snapshot(self) -> Dict[str, Any]: """ Get current state. State is deterministic function of log prefix up to last_applied_index. """ return dict(self.state) def recover_from_log(self, log: List[Command]): """ Recover state by replaying log. After network partition heals, a node may receive committed entries it missed. Apply them in order. """ for i, command in enumerate(log): if i > self.last_applied_index: # This entry not yet applied if i < len(self.log): # Already in our log, just apply self._apply(command) else: # Missing from our log, add and apply self.log.append(command) self._apply(command) self.last_applied_index = i print(f"Recovery complete: applied {len(log)} entries")SMR provides strong consistency: once a command is committed, all subsequent reads will see its effects. Combined with leader-based reads, this achieves linearizability—operations appear to occur atomically at some point between invocation and response. This is the strongest consistency model and is what distinguishes systems like etcd from eventually consistent stores like Cassandra.
Before consensus algorithms like Paxos became widespread, distributed databases used commit protocols for atomic transactions across nodes. Understanding these protocols illuminates why consensus is superior for many use cases.
Two-Phase Commit (2PC):
2PC ensures all participants either commit or abort a transaction atomically. A designated coordinator orchestrates the protocol:
Phase 1: Prepare (Voting)
Phase 2: Commit/Abort
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
class TwoPhaseCommitCoordinator: """ Coordinator for Two-Phase Commit protocol. Ensures atomic commit/abort across all participants. """ def __init__(self, participants: List[str]): self.participants = participants self.transaction_log = [] def execute_transaction(self, transaction_id: str, operations: List) -> bool: """ Execute distributed transaction with 2PC. Returns True if committed, False if aborted. """ # Log transaction start self.log("PREPARE", transaction_id) # Phase 1: Prepare votes = self._collect_votes(transaction_id, operations) all_agreed = all(vote == "COMMIT" for vote in votes.values()) # Phase 2: Commit or Abort if all_agreed: self.log("COMMIT", transaction_id) self._send_decision(transaction_id, "COMMIT") return True else: self.log("ABORT", transaction_id) self._send_decision(transaction_id, "ABORT") return False def _collect_votes(self, txn_id: str, operations: List) -> Dict[str, str]: """ Send PREPARE to all participants and collect votes. """ votes = {} for participant in self.participants: try: # Send prepare request with timeout response = self.send_prepare( participant, txn_id, operations, timeout=5.0 ) votes[participant] = response except (NetworkTimeout, ParticipantUnavailable): # Treat timeout/failure as ABORT vote votes[participant] = "ABORT" return votes def _send_decision(self, txn_id: str, decision: str): """ Send final decision to all participants. Retry until all acknowledge. """ pending = set(self.participants) while pending: for participant in list(pending): try: self.send_decision(participant, txn_id, decision) pending.remove(participant) except NetworkTimeout: # Will retry on next iteration pass def recover_after_crash(self): """ Recovery after coordinator crash. Read transaction log to find incomplete transactions and resume from last logged state. """ for entry in self.transaction_log: if entry.state == "PREPARE": # Crashed before decision - abort self._send_decision(entry.txn_id, "ABORT") elif entry.state == "COMMIT": # Crashed after commit decision - resend COMMIT self._send_decision(entry.txn_id, "COMMIT") elif entry.state == "ABORT": # Crashed after abort decision - resend ABORT self._send_decision(entry.txn_id, "ABORT") class TwoPhaseCommitParticipant: """ Participant in Two-Phase Commit. """ def __init__(self, node_id: str): self.node_id = node_id self.prepared_transactions: Dict[str, Any] = {} def on_prepare(self, txn_id: str, operations: List) -> str: """ Handle PREPARE request. If we can commit, vote COMMIT and wait for decision. Otherwise, vote ABORT. """ try: # Acquire necessary locks locks = self._acquire_locks(operations) # Write to durable log self._log_prepare(txn_id, operations, locks) # Store prepared state self.prepared_transactions[txn_id] = { "operations": operations, "locks": locks } return "COMMIT" except (LockConflict, ResourceUnavailable) as e: return "ABORT" def on_decision(self, txn_id: str, decision: str): """ Handle final COMMIT or ABORT decision. """ if txn_id not in self.prepared_transactions: # Never prepared - ignore return txn = self.prepared_transactions[txn_id] if decision == "COMMIT": # Apply changes self._apply_operations(txn["operations"]) self._log_commit(txn_id) else: # Discard changes self._log_abort(txn_id) # Release locks self._release_locks(txn["locks"]) del self.prepared_transactions[txn_id]The Blocking Problem:
2PC has a critical weakness: if the coordinator fails after sending PREPARE but before sending a decision, participants are blocked indefinitely. They've voted COMMIT and hold locks, but don't know whether to commit or abort.
Three-Phase Commit (3PC):
3PC attempts to solve the blocking problem by adding a PRE-COMMIT phase:
The key insight: any participant who received PRE-COMMIT knows all participants voted COMMIT, so a recovering coordinator can safely commit. Any participant who only saw PREPARE knows they can safely abort.
3PC assumes bounded message delays (synchrony). Under network partitions, both 2PC and 3PC can violate safety or liveness. Modern distributed databases increasingly use consensus (Paxos, Raft) instead of 2PC/3PC, or use 2PC atop a consensus-based log for atomic commit across shards.
Quorums are the fundamental mechanism enabling consensus in the presence of failures. A quorum is a subset of nodes large enough to make authoritative decisions.
The Majority Quorum:
The simplest and most common quorum is a majority: (N/2) + 1 nodes out of N total. Majority quorums have a crucial property:
Any two majority quorums must overlap in at least one node.
This overlap property is what enables consensus:
Why Majority Works:
| Cluster Size N | Majority Quorum | Failures Tolerated | Example Node IDs |
|---|---|---|---|
| 3 nodes | 2 nodes | 1 failure | Any 2 of {A, B, C} |
| 5 nodes | 3 nodes | 2 failures | Any 3 of {A, B, C, D, E} |
| 7 nodes | 4 nodes | 3 failures | Any 4 of {A, B, C, D, E, F, G} |
| 2N+1 nodes | N+1 nodes | N failures | Any majority of cluster |
Generalized Quorum Systems:
Majority quorums aren't the only option. A quorum system is any collection of subsets (quorums) where any two quorums intersect.
Examples:
Grid quorums: Arrange N nodes in a √N × √N grid. Any row + column forms a quorum. Requires only 2√N nodes, but any two such quorums still intersect.
Weighted quorums: Assign weights to nodes (reflecting reliability, geographic importance, etc.). A quorum is any set whose total weight exceeds half the total.
Hierarchical quorums: Organize nodes in a tree. Quorums are formed by combining quorums from subtrees.
The Quorum Trade-off:
Smaller quorums mean:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
class QuorumSystem: """ Abstract quorum system for consensus operations. A quorum system partitions the set of subsets of nodes such that any two quorums in the system intersect. """ def __init__(self, nodes: List[str]): self.nodes = nodes self.n = len(nodes) def is_quorum(self, node_set: Set[str]) -> bool: """Check if a set of nodes forms a valid quorum.""" raise NotImplementedError def check_intersection_property(self) -> bool: """ Verify that any two quorums intersect. This is the fundamental safety property. """ all_quorums = self._enumerate_quorums() for q1 in all_quorums: for q2 in all_quorums: if not q1.intersection(q2): return False return True class MajorityQuorum(QuorumSystem): """ Simple majority quorum: any set of more than N/2 nodes. """ def is_quorum(self, node_set: Set[str]) -> bool: valid_nodes = node_set.intersection(self.nodes) return len(valid_nodes) > self.n // 2 def required_for_quorum(self) -> int: """Minimum nodes needed for a quorum.""" return self.n // 2 + 1 def failures_tolerated(self) -> int: """Maximum failures while still forming quorums.""" return self.n - self.required_for_quorum() class WeightedQuorum(QuorumSystem): """ Weighted quorum where nodes have different voting power. Useful when some nodes are more reliable or important. A quorum requires votes exceeding half the total weight. """ def __init__(self, node_weights: Dict[str, int]): self.node_weights = node_weights self.nodes = list(node_weights.keys()) self.n = len(self.nodes) self.total_weight = sum(node_weights.values()) self.quorum_threshold = self.total_weight // 2 + 1 def is_quorum(self, node_set: Set[str]) -> bool: weight = sum( self.node_weights.get(node, 0) for node in node_set ) return weight >= self.quorum_threshold def node_criticality(self, node: str) -> float: """ How critical is this node for forming quorums? Higher weight = more critical. """ return self.node_weights.get(node, 0) / self.total_weight class ReadWriteQuorum(QuorumSystem): """ Separate read and write quorums. Allows tuning read vs write availability/latency. Must satisfy: R + W > N (read and write quorums intersect) """ def __init__(self, nodes: List[str], read_quorum: int, write_quorum: int): super().__init__(nodes) self.read_quorum = read_quorum self.write_quorum = write_quorum # Verify intersection property if read_quorum + write_quorum <= self.n: raise ValueError( f"Read ({read_quorum}) + Write ({write_quorum}) " f"must be > N ({self.n})" ) def is_read_quorum(self, node_set: Set[str]) -> bool: return len(node_set.intersection(self.nodes)) >= self.read_quorum def is_write_quorum(self, node_set: Set[str]) -> bool: return len(node_set.intersection(self.nodes)) >= self.write_quorum # Example usage: Tuning quorums for different workloads # Read-heavy workload: R=1, W=N# Fast reads from any node, slow writes to allread_heavy = ReadWriteQuorum( nodes=["A", "B", "C", "D", "E"], read_quorum=1, # Read from any node write_quorum=5 # Write to all nodes) # Write-heavy workload: R=N, W=1 # Fast writes to any node, slow reads from allwrite_heavy = ReadWriteQuorum( nodes=["A", "B", "C", "D", "E"], read_quorum=5, # Read from all nodes write_quorum=1 # Write to any node) # Balanced: R=W=3balanced = ReadWriteQuorum( nodes=["A", "B", "C", "D", "E"], read_quorum=3, write_quorum=3)Most production systems use simple majority quorums with odd cluster sizes (3, 5, 7 nodes). This provides a good balance of fault tolerance and operational simplicity. Larger clusters increase fault tolerance but also increase the probability that at least one node is unavailable at any time—cluster size should match expected failure rates.
Different applications have different requirements, leading to various consensus variants that trade off properties against each other.
Single-Decree vs. Multi-Decree:
Crash-Fault vs. Byzantine Consensus:
Leader-Based vs. Leaderless:
| Protocol | Fault Model | Nodes Needed | Leader | Notable Features |
|---|---|---|---|---|
| Paxos | Crash-fault | 2F+1 | Yes (Multi-Paxos) | Foundational, complex |
| Raft | Crash-fault | 2F+1 | Yes | Understandable, widely implemented |
| ZAB | Crash-fault | 2F+1 | Yes | ZooKeeper's protocol |
| PBFT | Byzantine | 3F+1 | Rotating | First practical BFT |
| HotStuff | Byzantine | 3F+1 | Rotating | Linear communication |
| Tendermint | Byzantine | 3F+1 | Rotating | Blockchain consensus |
Performance Trade-offs:
1. Latency vs. Throughput:
2. Read Consistency vs. Availability:
3. Durability vs. Performance:
4. Geographic Distribution:
Consensus protocols inherently prioritize Consistency over Availability during partitions. A minority partition cannot form a quorum and thus cannot make progress. This is a deliberate choice: it's better to be unavailable than to allow conflicting decisions. Systems that prefer availability (like Cassandra) sacrifice consistency by avoiding consensus for normal operations.
Consensus implementations are notoriously difficult to get right. Subtle bugs can lie dormant for years before manifesting under specific failure conditions. Here are key principles for safe implementation:
1. Persistence Before Response:
Before a node sends any response that could influence consensus (votes, acknowledgments, decisions), it must durably persist the information that response represents. Otherwise, a restart could cause the node to "forget" its vote and vote differently.
2. State Machine Safety:
The state machine must be purely deterministic. The same sequence of commands applied in the same order must always produce the same state—regardless of wall-clock time, thread scheduling, or any external factors.
3. Epoch/Term Ordering:
Every message must be tagged with an epoch/term number. Nodes must reject messages from stale epochs to prevent interference from zombie leaders or delayed messages.
4. Log Consistency:
Once a log entry is committed (decided by consensus), it can never be changed or removed. The log is append-only at each position up to the commit point.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
class SafeConsensusNode: """ Consensus node with safety invariants enforced. Key invariants: 1. Persistence before response 2. Epoch ordering enforced 3. Committed entries never changed 4. Deterministic state machine """ def __init__(self, node_id: str, storage: PersistentStorage): self.node_id = node_id self.storage = storage # Persisted state - recovered from storage on restart self.current_term: int = self.storage.get("current_term", 0) self.voted_for: Optional[str] = self.storage.get("voted_for", None) self.log: List[LogEntry] = self.storage.get("log", []) # Volatile state - reconstructed on restart self.commit_index: int = -1 self.last_applied: int = -1 def handle_vote_request(self, term: int, candidate_id: str, last_log_index: int, last_log_term: int) -> VoteResponse: """ Handle RequestVote RPC. SAFETY: Must persist voted_for before responding. """ # Epoch ordering: reject if term < current_term if term < self.current_term: return VoteResponse(term=self.current_term, vote_granted=False) # Update term if newer if term > self.current_term: self._step_down(term) # Check if we can vote for this candidate can_vote = ( self.voted_for is None or self.voted_for == candidate_id ) # Check if candidate's log is at least as up-to-date as ours log_ok = self._log_is_current(last_log_index, last_log_term) if can_vote and log_ok: # SAFETY: Persist before responding self.voted_for = candidate_id self.storage.put("voted_for", candidate_id) self.storage.flush() # Ensure durable return VoteResponse(term=self.current_term, vote_granted=True) else: return VoteResponse(term=self.current_term, vote_granted=False) def handle_append_entries(self, term: int, leader_id: str, prev_log_index: int, prev_log_term: int, entries: List[LogEntry], leader_commit: int): """ Handle AppendEntries RPC. SAFETY: Must persist new entries before responding. """ # Epoch ordering if term < self.current_term: return AppendResponse(term=self.current_term, success=False) if term > self.current_term: self._step_down(term) # Check log consistency at prev_log_index if not self._log_matches(prev_log_index, prev_log_term): return AppendResponse(term=self.current_term, success=False) # Append new entries for i, entry in enumerate(entries): log_index = prev_log_index + 1 + i if log_index < len(self.log): if self.log[log_index].term != entry.term: # Conflict: truncate log from here # SAFETY: Only truncate uncommitted entries if log_index <= self.commit_index: raise SafetyViolation( "Cannot truncate committed entry" ) self.log = self.log[:log_index] if log_index >= len(self.log): self.log.append(entry) # SAFETY: Persist before responding self.storage.put("log", self.log) self.storage.flush() # Update commit index if leader_commit > self.commit_index: self.commit_index = min(leader_commit, len(self.log) - 1) self._apply_committed_entries() return AppendResponse(term=self.current_term, success=True) def _apply_committed_entries(self): """ Apply committed entries to state machine. SAFETY: State machine must be deterministic. """ while self.last_applied < self.commit_index: self.last_applied += 1 entry = self.log[self.last_applied] # Apply deterministically result = self.state_machine.apply(entry.command) # Store result (indexed by log position for idempotency) self.results[self.last_applied] = result def _step_down(self, new_term: int): """ Step down when we see a higher term. SAFETY: Persist new term before any other action. """ self.current_term = new_term self.voted_for = None self.storage.put("current_term", new_term) self.storage.put("voted_for", None) self.storage.flush() self.role = "FOLLOWER"We've established the theoretical and practical foundations of consensus. Let's consolidate the key insights:
What's next:
With the theoretical foundations in place, we'll dive into Paxos—the first consensus algorithm proven correct and the inspiration for all subsequent protocols. Though famously difficult to understand, Paxos embodies the essential insights that make consensus work.
You now understand the theoretical foundations of consensus—the formal properties, the FLP impossibility result that constrains all solutions, and how quorums and state machine replication build upon consensus. Next, we'll see how Paxos concretizes these ideas into the first practical consensus algorithm.