Loading learning content...
Paxos has a reputation for being notoriously difficult to implement correctly. This reputation is well-deserved. Google engineers working on Chubby reported spending years getting their Paxos implementation right. The creators of Raft explicitly designed their algorithm to be "understandable" in direct response to Paxos's complexity.
But what exactly makes Paxos so hard? Is it the algorithm itself, or something else? The truth is nuanced: the core algorithm is elegant and relatively simple, but the gap between the theoretical algorithm and a production-quality implementation is vast.
Understanding this gap is crucial for anyone who works with distributed systems, whether you're implementing consensus yourself or using systems built on Paxos-like protocols.
By the end of this page, you will understand the specific categories of challenges that make Paxos hard, the gap between theory and practice, common bugs found in real implementations, why alternative algorithms like Raft emerged, and lessons for building reliable distributed systems.
Lamport's Paxos papers describe the algorithm at a high level of abstraction. They specify what the algorithm must do, but leave many how decisions to the implementer. This is typical for theoretical papers but creates a significant implementation burden.
What the Papers Specify:
What the Papers Leave Unspecified:
| Aspect | What's Missing | Why It Matters |
|---|---|---|
| Leader election | How to choose a leader initially or after failure | Protocol can livelock without stable leadership |
| Failure detection | How to detect leader/acceptor crashes | Too aggressive = thrashing; too slow = long unavailability |
| Log compaction | How to discard old entries safely | Log grows unbounded without compaction |
| Reconfiguration | How to add/remove nodes from the cluster | Static membership is impractical for real systems |
| Client interaction | How clients submit commands and receive responses | Duplicate commands, retry logic, session semantics |
| Persistence format | How to store state on disk efficiently | Performance and correctness depend on implementation |
| Catch-up protocol | How lagging replicas get up to date | Slow replicas may never fully sync |
The Paxos algorithm as described in papers is the tip of the iceberg. Roughly 90% of the code in a production Paxos implementation deals with aspects not covered by the algorithm specification: serialization, networking, persistence, monitoring, reconfiguration, optimization, and error handling.
Google's Chubby Experience:
In their paper "Paxos Made Live" (2007), Google engineers documented their experience implementing Paxos for the Chubby lock service:
"There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system... the algorithm must be supplemented with additional mechanisms in order to provide a complete solution."
They reported that the core Paxos algorithm was "only a small part" of the final codebase, with most effort going into:
One of the most significant barriers to correct Paxos implementation is simply understanding the algorithm deeply enough to implement it correctly.
The Symmetry Problem:
Basic Paxos is inherently symmetric — any node can be a proposer, and there's no distinguished leader. While elegant theoretically, this symmetry makes reasoning about the system harder:
The Original Presentation Didn't Help:
Lamport's original paper used a Greek parliament allegory that many found confusing rather than clarifying. When he published "Paxos Made Simple" in 2001, he famously wrote:
"The Paxos algorithm, when presented in plain English, is very simple."
Yet even "Paxos Made Simple" is only 14 pages and leaves many practical concerns unaddressed. The simplicity is in the algorithm's invariants, not in the implementation path.
The Raft paper explicitly states that understandability was the primary design goal. The authors conducted user studies to measure understandability. Their thesis: a protocol that engineers can fully understand is one they can implement correctly. Raft makes specific design choices (strong leader, sequential log) that sacrifice some generality for clarity.
Correct persistence is one of the most error-prone aspects of Paxos implementation. The algorithm's safety depends on acceptors never forgetting their promises, but achieving this in practice is surprisingly hard.
The fsync Problem:
12345678910111213141516171819202122232425262728293031
# BUGGY: This looks correct but isn't safe!def handle_accept_BUGGY(self, proposal_number, value): if self._can_accept(proposal_number): self.accepted = (proposal_number, value) # Write to file with open(self.state_file, 'w') as f: f.write(json.dumps(self.accepted)) # BUG: No fsync! Data may be in OS buffer, not on disk return AcceptedResponse(proposal_number, value) # CORRECT: Ensure data reaches stable storagedef handle_accept_CORRECT(self, proposal_number, value): if self._can_accept(proposal_number): self.accepted = (proposal_number, value) # Write to file with explicit sync with open(self.state_file, 'w') as f: f.write(json.dumps(self.accepted)) f.flush() # Flush Python buffers os.fsync(f.fileno()) # Flush OS buffers to disk # Also sync the directory to ensure the file metadata is durable dir_fd = os.open(os.path.dirname(self.state_file), os.O_RDONLY) try: os.fsync(dir_fd) finally: os.close(dir_fd) return AcceptedResponse(proposal_number, value)Without proper fsync, data can appear to be written but actually be in OS buffers. A power failure at this moment loses the data. An acceptor that restarts after power failure has no record of its promise, potentially violating safety. This bug caused real outages at multiple companies.
Additional Persistence Challenges:
Real-World Incident:
In one documented incident, a distributed database using Paxos suffered data loss because:
The fix required not just code changes but extensive storage-layer verification.
The Paxos protocol has numerous edge cases that are easy to miss. Each edge case can lead to subtle bugs that only manifest under rare conditions.
Edge Case 1: Re-proposing After Failed Accept
12345678910111213141516171819202122
# SCENARIO: Our Accept was rejected, now we retry## WRONG approach:async def retry_proposal_WRONG(self, value): proposal_number = self.generate_higher_proposal_number() # BUG: Skipping Phase 1 because "we know our value" # But another proposer may have succeeded in the meantime! await self.phase2_accept(proposal_number, value) # CORRECT approach:async def retry_proposal_CORRECT(self, value): proposal_number = self.generate_higher_proposal_number() # MUST run Phase 1 to discover any newly-accepted values phase1_result = await self.phase1_prepare(proposal_number) if phase1_result.discovered_value is not None: # Adoption rule: use the discovered value! value = phase1_result.discovered_value await self.phase2_accept(proposal_number, value)Edge Case 2: Promise Number vs Accept Number
A subtle bug occurs when implementations confuse the prepare condition with the accept condition:
| Operation | Correct Condition | Common Bug | Result of Bug |
|---|---|---|---|
| Handle Prepare(n) | Promise if n > promised | Using n >= promised | Allows duplicate proposal numbers, potential safety violation |
| Handle Accept(n, v) | Accept if n >= promised | Using n > promised | Rejects proposer's own accept after their prepare |
Edge Case 3: Handling of No-Responses
What happens when an acceptor doesn't respond at all (network timeout)? The handling is subtle:
Edge Case 4: Concurrent Leadership Claims
Two nodes may believe they are the leader simultaneously. The protocol handles this safely, but implementations often add unsafe optimizations:
123456789101112131415161718192021222324
# UNSAFE optimization that breaks safety:class Leader: def accept_command(self, command): if self.is_leader: # BUG: Stale leadership check # Skip Phase 1, assume we can accept # But we might have been preempted! slot = self.next_slot self.next_slot += 1 self.phase2_accept(self.proposal_number, slot, command) # SAFE approach:class Leader: def accept_command(self, command): if self.is_leader: slot = self.next_slot self.next_slot += 1 try: result = self.phase2_accept(self.proposal_number, slot, command) except RejectedError as e: # We've been preempted! Step down. self.is_leader = False self.proposal_number = None raise NotLeaderAnymore()While Paxos guarantees safety unconditionally, it only guarantees liveness under "reasonable" conditions. Without careful design, Paxos can get stuck in pathological states.
Livelock: Dueling Proposers
The classic livelock scenario:
Preventing Livelock:
Production systems use multiple techniques to prevent livelock:
Performance Anti-Patterns:
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Synchronous disk writes on hot path | Each accept waits for fsync | Batch multiple accepts into one fsync; use write-ahead logging |
| Sequential slot processing | One slot at a time limits throughput | Pipeline multiple slots; parallel Phase 2 for different slots |
| Large messages | Sending full values in every message | Separate data plane from control plane; use value IDs |
| Frequent leader elections | Each election costs a Phase 1 round | Tune failure detection; use leases; graceful handoffs |
| No batching | Each command is a separate consensus round | Batch multiple commands into one log entry |
Most Paxos performance optimizations assume a 'steady state' with stable leadership. The cost of leadership changes is amortized over many entries. If leadership is unstable (frequent changes), performance degrades significantly. Tune failure detection carefully.
Perhaps the most challenging aspect of Paxos is verifying correctness. Bugs often only manifest under rare failure conditions that are nearly impossible to encounter during normal testing.
Why Standard Testing Fails:
Verification Techniques for Consensus:
| Technique | Description | Pros | Cons |
|---|---|---|---|
| Model Checking | TLA+ or similar to exhaustively check state space | Finds subtle concurrency bugs; proves properties | Requires specification expertise; state explosion for large systems |
| Jepsen Testing | Black-box chaos testing with failure injection | Tests real implementation; finds real bugs | Cannot prove absence of bugs; coverage is random |
| Formal Verification | Mathematical proof of implementation correctness | Strongest guarantees | Extremely labor-intensive; few teams can do this |
| Simulation Testing | Deterministic replay of distributed executions | Reproducible; fast iteration | Requires significant infrastructure investment |
| Property-Based Testing | Generate random test cases checking invariants | Good coverage with modest effort | May miss adversarial scenarios |
Lamport himself advocates for TLA+ for specifying and verifying distributed algorithms. His TLA+ specification of Paxos is a definitive reference. Amazon has used TLA+ to find bugs in their internal systems. For production Paxos, a TLA+ specification is highly recommended.
FoundationDB's Simulation Testing:
FoundationDB pioneered a technique called deterministic simulation, where:
This approach found bugs that years of production use never surfaced. Many modern consensus implementations now use similar techniques.
Learning from bugs found in real implementations is invaluable. Here are documented bugs from production Paxos systems:
Bug Category 1: State Corruption on Recovery
A bug in ZooKeeper (which uses ZAB, a Paxos variant) caused corruption when a leader crashed and recovered. The leader would replay its log but skip entries that were already committed, leading to divergent replica state. Fixed by ensuring complete log replay on recovery.
Bug Category 2: Incorrect Quorum Calculations
12345678910111213141516171819202122232425
# BUG 1: Off-by-one error in majority calculationdef majority_WRONG(n): return n / 2 + 1 # For n=5, returns 3.5, which might truncate to 3! def majority_CORRECT(n): return n // 2 + 1 # For n=5, returns 3 # BUG 2: Changing cluster size without updating quorumclass Cluster: def __init__(self, nodes): self.nodes = nodes self.majority = len(nodes) // 2 + 1 def add_node(self, node): self.nodes.append(node) # BUG: forgot to update self.majority! # With 5 nodes, majority is still 3, but should be 4 # BUG 3: Counting same node multiple timesdef check_quorum_WRONG(responses, majority): return len(responses) >= majority # Might count duplicates! def check_quorum_CORRECT(responses, majority): unique_respondents = set(r.node_id for r in responses) return len(unique_respondents) >= majorityBug Category 3: Message Handling Edge Cases
Bug Category 4: Reconfiguration Bugs
Membership changes (adding/removing nodes) are a source of numerous bugs. The fundamental challenge: during reconfiguration, which quorum applies?
A naive approach to reconfiguration might accept a value with the old majority and then accept a conflicting value with the new majority, before all nodes agree on the new configuration. Correct reconfiguration requires special protocols like joint consensus or single-server changes.
The difficulties of implementing Paxos led to the development of alternative consensus algorithms designed explicitly for understandability and implementability.
Raft's Design Philosophy:
The Raft paper (2014) by Diego Ongaro and John Ousterhout explicitly states:
"Raft is a consensus algorithm that is designed to be easy to understand. It's equivalent to Paxos in fault-tolerance and performance. The difference is that it's decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems."
| Aspect | Paxos | Raft | Raft's Advantage |
|---|---|---|---|
| Leadership | Any node can propose | Strong leader only | Simpler reasoning; no dueling proposers |
| Log structure | Gaps allowed | No gaps in log | Simpler replication; easier catch-up |
| Membership changes | Underspecified | Explicitly defined | Clear, tested procedure |
| Snapshotting | Underspecified | Explicitly defined | Practical from day one |
| Algorithm scope | Minimal core | Complete system | Less implementation burden |
Other Alternatives:
For most systems, Raft is a better choice than raw Paxos because of its implementation clarity. Use Paxos when you need specific optimizations it offers (like Fast Paxos or EPaxos), or when you're extending an existing Paxos-based system.
The Spectrum of Correctness:
There's an important tradeoff in algorithm design:
Simpler Algorithm ←→ More Understandable ←→ Likely Correct Implementation
Raft explicitly chose simplicity over theoretical generality, betting that real-world systems benefit more from implementations that are likely to be correct than from algorithms that are theoretically optimal but frequently buggy.
The challenges of implementing Paxos teach us broader lessons about building reliable distributed systems.
Implement consensus yourself only when: (1) Existing solutions don't meet your requirements, (2) You have the expertise and time for thorough testing, (3) You're willing to use formal verification techniques. Otherwise, use etcd, ZooKeeper, or another battle-tested system.
The Humility Principle:
Google, with some of the best distributed systems engineers in the world, spent years debugging their Paxos implementation. If Google finds it hard, so will you.
This isn't a reason to avoid learning Paxos — understanding it deeply makes you better at using and debugging systems built on it. But it is a reason for humility about implementation, and a strong argument for leveraging existing, well-tested implementations wherever possible.
We've explored the many challenges that make Paxos implementation notoriously difficult. Let's consolidate the key takeaways:
Module Complete:
You've now completed the Paxos Algorithm module. You understand the theoretical foundations (Basic Paxos, the two-phase protocol, the adoption rule), the practical extensions (Multi-Paxos, log replication, stable leader), and the real-world challenges (implementation pitfalls, testing, alternatives).
This deep knowledge prepares you for the subsequent modules on Raft, ZAB, and practical coordination services like ZooKeeper and etcd.
Congratulations! You've completed the Paxos Algorithm module. You now have a deep understanding of the canonical distributed consensus algorithm — its elegance, its challenges, and its practical application. This knowledge is foundational for understanding all modern distributed systems that provide strong consistency guarantees.