System Design (HLD)Paxos Algorithm

The Paxos Consensus Algorithm

LevelAdvanced

Duration90 mins

TopicPaxos Algorithm

5 / 5

Why Paxos is Hard to Implement

The Notorious Difficulty of Paxos

Paxos has a reputation for being notoriously difficult to implement correctly. This reputation is well-deserved. Google engineers working on Chubby reported spending years getting their Paxos implementation right. The creators of Raft explicitly designed their algorithm to be "understandable" in direct response to Paxos's complexity.

But what exactly makes Paxos so hard? Is it the algorithm itself, or something else? The truth is nuanced: the core algorithm is elegant and relatively simple, but the gap between the theoretical algorithm and a production-quality implementation is vast.

Understanding this gap is crucial for anyone who works with distributed systems, whether you're implementing consensus yourself or using systems built on Paxos-like protocols.

What You Will Learn

By the end of this page, you will understand the specific categories of challenges that make Paxos hard, the gap between theory and practice, common bugs found in real implementations, why alternative algorithms like Raft emerged, and lessons for building reliable distributed systems.

The Gap Between Paper and Production

Lamport's Paxos papers describe the algorithm at a high level of abstraction. They specify what the algorithm must do, but leave many how decisions to the implementer. This is typical for theoretical papers but creates a significant implementation burden.

What the Papers Specify:

Message types (Prepare, Promise, Accept, Accepted)
Invariants that must be maintained
Conditions under which each message is sent/received
Safety properties that must hold

What the Papers Leave Unspecified:

Underspecified Aspects of Paxos
Aspect	What's Missing	Why It Matters
Leader election	How to choose a leader initially or after failure	Protocol can livelock without stable leadership
Failure detection	How to detect leader/acceptor crashes	Too aggressive = thrashing; too slow = long unavailability
Log compaction	How to discard old entries safely	Log grows unbounded without compaction
Reconfiguration	How to add/remove nodes from the cluster	Static membership is impractical for real systems
Client interaction	How clients submit commands and receive responses	Duplicate commands, retry logic, session semantics
Persistence format	How to store state on disk efficiently	Performance and correctness depend on implementation
Catch-up protocol	How lagging replicas get up to date	Slow replicas may never fully sync

The Implementation Iceberg

The Paxos algorithm as described in papers is the tip of the iceberg. Roughly 90% of the code in a production Paxos implementation deals with aspects not covered by the algorithm specification: serialization, networking, persistence, monitoring, reconfiguration, optimization, and error handling.

Google's Chubby Experience:

In their paper "Paxos Made Live" (2007), Google engineers documented their experience implementing Paxos for the Chubby lock service:

"There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system... the algorithm must be supplemented with additional mechanisms in order to provide a complete solution."

They reported that the core Paxos algorithm was "only a small part" of the final codebase, with most effort going into:

Disk corruption handling
Master election and leases
Group membership management
Snapshotting and truncation
Request processing and batching

Understandability and Mental Models

One of the most significant barriers to correct Paxos implementation is simply understanding the algorithm deeply enough to implement it correctly.

The Symmetry Problem:

Basic Paxos is inherently symmetric — any node can be a proposer, and there's no distinguished leader. While elegant theoretically, this symmetry makes reasoning about the system harder:

Multiple proposals can be in flight simultaneously
The outcome of one proposal depends on the state left by others
Reasoning about concurrent proposals requires considering all interleavings

Mental Model Challenges

•No linear flow — Unlike traditional algorithms with a clear start-to-finish flow, Paxos involves interleaved phases from multiple proposers.
•Hidden dependencies — The adoption rule creates dependencies between proposals that span time. A proposal's behavior depends on what happened in previous proposals we might not have observed.
•Failure is a first-class concern — Every step must consider what happens if any participant fails. This multiplies the mental complexity enormously.
•Subtle invariants — The safety properties depend on invariants that span the entire system (quorum intersection, promise monotonicity) rather than local properties.

The Original Presentation Didn't Help:

Lamport's original paper used a Greek parliament allegory that many found confusing rather than clarifying. When he published "Paxos Made Simple" in 2001, he famously wrote:

"The Paxos algorithm, when presented in plain English, is very simple."

Yet even "Paxos Made Simple" is only 14 pages and leaves many practical concerns unaddressed. The simplicity is in the algorithm's invariants, not in the implementation path.

Why Raft Emphasizes Understandability

The Raft paper explicitly states that understandability was the primary design goal. The authors conducted user studies to measure understandability. Their thesis: a protocol that engineers can fully understand is one they can implement correctly. Raft makes specific design choices (strong leader, sequential log) that sacrifice some generality for clarity.

Persistence and Durability Pitfalls

Correct persistence is one of the most error-prone aspects of Paxos implementation. The algorithm's safety depends on acceptors never forgetting their promises, but achieving this in practice is surprisingly hard.

The fsync Problem:

persistence_bug.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# BUGGY: This looks correct but isn't safe!
def handle_accept_BUGGY(self, proposal_number, value):
    if self._can_accept(proposal_number):
        self.accepted = (proposal_number, value)
        
        # Write to file
        with open(self.state_file, 'w') as f:
            f.write(json.dumps(self.accepted))
        # BUG: No fsync! Data may be in OS buffer, not on disk
        
        return AcceptedResponse(proposal_number, value)
 
# CORRECT: Ensure data reaches stable storage
def handle_accept_CORRECT(self, proposal_number, value):
    if self._can_accept(proposal_number):
        self.accepted = (proposal_number, value)
        
        # Write to file with explicit sync
        with open(self.state_file, 'w') as f:
            f.write(json.dumps(self.accepted))
            f.flush()           # Flush Python buffers
            os.fsync(f.fileno())  # Flush OS buffers to disk
        
        # Also sync the directory to ensure the file metadata is durable
        dir_fd = os.open(os.path.dirname(self.state_file), os.O_RDONLY)
        try:
            os.fsync(dir_fd)
        finally:
            os.close(dir_fd)
        
        return AcceptedResponse(proposal_number, value)

The Power Failure Scenario

Without proper fsync, data can appear to be written but actually be in OS buffers. A power failure at this moment loses the data. An acceptor that restarts after power failure has no record of its promise, potentially violating safety. This bug caused real outages at multiple companies.

Additional Persistence Challenges:

Persistence Pitfalls

•Torn writes — A crash mid-write can leave partial data. Must use atomic write techniques (write to temp file, fsync, rename).
•Disk corruption — Disks can silently corrupt data. Must use checksums and verify on read.
•SSD behavior — SSDs have different failure modes than HDDs, including write amplification affecting fsync performance.
•Virtualized storage — VMs and cloud storage may not honor fsync semantics correctly. Must verify durability guarantees.
•RAID controller caches — Hardware RAID controllers often cache writes. Need battery-backed write cache or explicit configuration.

Real-World Incident:

In one documented incident, a distributed database using Paxos suffered data loss because:

fsync was called but the storage driver was misconfigured
Power was lost while data was in the driver's cache
Multiple acceptors' promises were lost simultaneously
On recovery, the cluster made a conflicting decision

The fix required not just code changes but extensive storage-layer verification.

Edge Cases in the Protocol

The Paxos protocol has numerous edge cases that are easy to miss. Each edge case can lead to subtle bugs that only manifest under rare conditions.

Edge Case 1: Re-proposing After Failed Accept

repropose_edge_case.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# SCENARIO: Our Accept was rejected, now we retry
#
# WRONG approach:
async def retry_proposal_WRONG(self, value):
    proposal_number = self.generate_higher_proposal_number()
    
    # BUG: Skipping Phase 1 because "we know our value"
    # But another proposer may have succeeded in the meantime!
    await self.phase2_accept(proposal_number, value)
 
# CORRECT approach:
async def retry_proposal_CORRECT(self, value):
    proposal_number = self.generate_higher_proposal_number()
    
    # MUST run Phase 1 to discover any newly-accepted values
    phase1_result = await self.phase1_prepare(proposal_number)
    
    if phase1_result.discovered_value is not None:
        # Adoption rule: use the discovered value!
        value = phase1_result.discovered_value
    
    await self.phase2_accept(proposal_number, value)

Edge Case 2: Promise Number vs Accept Number

A subtle bug occurs when implementations confuse the prepare condition with the accept condition:

Prepare vs Accept Conditions
Operation	Correct Condition	Common Bug	Result of Bug
Handle Prepare(n)	Promise if n > promised	Using n >= promised	Allows duplicate proposal numbers, potential safety violation
Handle Accept(n, v)	Accept if n >= promised	Using n > promised	Rejects proposer's own accept after their prepare

Edge Case 3: Handling of No-Responses

What happens when an acceptor doesn't respond at all (network timeout)? The handling is subtle:

Non-Response Handling

•Cannot assume rejection — Silence might mean the acceptor accepted but the response was lost.
•Cannot assume acceptance — Silence might mean the message never arrived or the acceptor crashed.
•Must count only explicit responses — Wait for majority of explicit responses, ignore non-responses.
•Timeout handling is critical — Too short = unnecessary retries; too long = poor performance.

Edge Case 4: Concurrent Leadership Claims

Two nodes may believe they are the leader simultaneously. The protocol handles this safely, but implementations often add unsafe optimizations:

leadership_edge_case.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# UNSAFE optimization that breaks safety:
class Leader:
    def accept_command(self, command):
        if self.is_leader:  # BUG: Stale leadership check
            # Skip Phase 1, assume we can accept
            # But we might have been preempted!
            slot = self.next_slot
            self.next_slot += 1
            self.phase2_accept(self.proposal_number, slot, command)
 
# SAFE approach:
class Leader:
    def accept_command(self, command):
        if self.is_leader:
            slot = self.next_slot
            self.next_slot += 1
            
            try:
                result = self.phase2_accept(self.proposal_number, slot, command)
            except RejectedError as e:
                # We've been preempted! Step down.
                self.is_leader = False
                self.proposal_number = None
                raise NotLeaderAnymore()

Livelock and Performance Issues

While Paxos guarantees safety unconditionally, it only guarantees liveness under "reasonable" conditions. Without careful design, Paxos can get stuck in pathological states.

Livelock: Dueling Proposers

The classic livelock scenario:

Converting Mermaid diagram...

Preventing Livelock:

Production systems use multiple techniques to prevent livelock:

Livelock Prevention Strategies

•Single designated leader — Only the leader proposes. Other nodes forward to the leader. Multi-Paxos naturally encourages this.
•Randomized exponential backoff — When rejected, wait a random time before retrying. Reduces collision probability.
•Leader leases — Give the leader exclusive rights to propose for a time period. Other proposers wait for lease expiration.
•Priority-based proposing — Higher-priority proposers (e.g., by node ID) back off less. Ensures one proposer eventually wins.

Performance Anti-Patterns:

Performance Anti-Patterns in Paxos
Anti-Pattern	Problem	Solution
Synchronous disk writes on hot path	Each accept waits for fsync	Batch multiple accepts into one fsync; use write-ahead logging
Sequential slot processing	One slot at a time limits throughput	Pipeline multiple slots; parallel Phase 2 for different slots
Large messages	Sending full values in every message	Separate data plane from control plane; use value IDs
Frequent leader elections	Each election costs a Phase 1 round	Tune failure detection; use leases; graceful handoffs
No batching	Each command is a separate consensus round	Batch multiple commands into one log entry

The 'Steady State' Assumption

Most Paxos performance optimizations assume a 'steady state' with stable leadership. The cost of leadership changes is amortized over many entries. If leadership is unstable (frequent changes), performance degrades significantly. Tune failure detection carefully.

Testing and Verification Challenges

Perhaps the most challenging aspect of Paxos is verifying correctness. Bugs often only manifest under rare failure conditions that are nearly impossible to encounter during normal testing.

Why Standard Testing Fails:

Testing Limitations

•Happy path is trivial — Normal operation without failures always works. The bugs are in failure handling.
•Failure combinations are exponential — With N nodes, there are 2^N failure patterns. With concurrent operations, the space explodes.
•Timing-dependent bugs — Some bugs only appear with specific message ordering. Random testing rarely finds these.
•Production conditions differ — Test environments don't have real network partitions, disk failures, or the scale of production.
•Subtle state corruption — A bug might corrupt state in a way that only causes problems much later, long after the buggy code ran.

Verification Techniques for Consensus:

Verification Approaches
Technique	Description	Pros	Cons
Model Checking	TLA+ or similar to exhaustively check state space	Finds subtle concurrency bugs; proves properties	Requires specification expertise; state explosion for large systems
Jepsen Testing	Black-box chaos testing with failure injection	Tests real implementation; finds real bugs	Cannot prove absence of bugs; coverage is random
Formal Verification	Mathematical proof of implementation correctness	Strongest guarantees	Extremely labor-intensive; few teams can do this
Simulation Testing	Deterministic replay of distributed executions	Reproducible; fast iteration	Requires significant infrastructure investment
Property-Based Testing	Generate random test cases checking invariants	Good coverage with modest effort	May miss adversarial scenarios

TLA+ and Paxos

Lamport himself advocates for TLA+ for specifying and verifying distributed algorithms. His TLA+ specification of Paxos is a definitive reference. Amazon has used TLA+ to find bugs in their internal systems. For production Paxos, a TLA+ specification is highly recommended.

FoundationDB's Simulation Testing:

FoundationDB pioneered a technique called deterministic simulation, where:

All I/O is abstracted behind interfaces
A test harness controls all "random" events deterministically
Failures are injected programmatically
Entire distributed system runs in a single process
Tests can explore millions of scenarios quickly

This approach found bugs that years of production use never surfaced. Many modern consensus implementations now use similar techniques.

Real-World Implementation Bugs

Learning from bugs found in real implementations is invaluable. Here are documented bugs from production Paxos systems:

Bug Category 1: State Corruption on Recovery

ZooKeeper Bug ZK-335

A bug in ZooKeeper (which uses ZAB, a Paxos variant) caused corruption when a leader crashed and recovered. The leader would replay its log but skip entries that were already committed, leading to divergent replica state. Fixed by ensuring complete log replay on recovery.

Bug Category 2: Incorrect Quorum Calculations

quorum_bugs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# BUG 1: Off-by-one error in majority calculation
def majority_WRONG(n):
    return n / 2 + 1  # For n=5, returns 3.5, which might truncate to 3!
 
def majority_CORRECT(n):
    return n // 2 + 1  # For n=5, returns 3
 
# BUG 2: Changing cluster size without updating quorum
class Cluster:
    def __init__(self, nodes):
        self.nodes = nodes
        self.majority = len(nodes) // 2 + 1
    
    def add_node(self, node):
        self.nodes.append(node)
        # BUG: forgot to update self.majority!
        # With 5 nodes, majority is still 3, but should be 4
        
# BUG 3: Counting same node multiple times
def check_quorum_WRONG(responses, majority):
    return len(responses) >= majority  # Might count duplicates!
 
def check_quorum_CORRECT(responses, majority):
    unique_respondents = set(r.node_id for r in responses)
    return len(unique_respondents) >= majority

Bug Category 3: Message Handling Edge Cases

Message Handling Bugs

•Duplicate message processing — Processing the same message twice can corrupt quorum counts. Need message deduplication.
•Out-of-order message processing — An old message arriving after a new one can regress state. Need version checking.
•Message from old leader — An old leader's message arriving after a new leader took over. Must check proposal numbers.
•Partial message writes — Network or serialization bugs causing partial messages. Need message framing and checksums.

Bug Category 4: Reconfiguration Bugs

Membership changes (adding/removing nodes) are a source of numerous bugs. The fundamental challenge: during reconfiguration, which quorum applies?

The Reconfiguration Danger

A naive approach to reconfiguration might accept a value with the old majority and then accept a conflicting value with the new majority, before all nodes agree on the new configuration. Correct reconfiguration requires special protocols like joint consensus or single-server changes.

Why Raft (and Others) Emerged

The difficulties of implementing Paxos led to the development of alternative consensus algorithms designed explicitly for understandability and implementability.

Raft's Design Philosophy:

The Raft paper (2014) by Diego Ongaro and John Ousterhout explicitly states:

"Raft is a consensus algorithm that is designed to be easy to understand. It's equivalent to Paxos in fault-tolerance and performance. The difference is that it's decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems."

Paxos vs. Raft Design Choices
Aspect	Paxos	Raft	Raft's Advantage
Leadership	Any node can propose	Strong leader only	Simpler reasoning; no dueling proposers
Log structure	Gaps allowed	No gaps in log	Simpler replication; easier catch-up
Membership changes	Underspecified	Explicitly defined	Clear, tested procedure
Snapshotting	Underspecified	Explicitly defined	Practical from day one
Algorithm scope	Minimal core	Complete system	Less implementation burden

Other Alternatives:

Consensus Algorithm Alternatives

•Viewstamped Replication (VR) — Predates Paxos actually! Similar to Raft in many ways. Uses view changes instead of leader election.
•ZAB (Zookeeper Atomic Broadcast) — Designed specifically for ZooKeeper. Similar to Multi-Paxos with recovery improvements.
•EPaxos (Egalitarian Paxos) — Removes the leader bottleneck, allowing any replica to commit commands. More complex but higher throughput.
•PBFT (Practical Byzantine Fault Tolerance) — Tolerates malicious nodes, not just crashes. Used in blockchain systems.

Choosing an Algorithm

For most systems, Raft is a better choice than raw Paxos because of its implementation clarity. Use Paxos when you need specific optimizations it offers (like Fast Paxos or EPaxos), or when you're extending an existing Paxos-based system.

The Spectrum of Correctness:

There's an important tradeoff in algorithm design:

Simpler Algorithm ←→ More Understandable ←→ Likely Correct Implementation

Raft explicitly chose simplicity over theoretical generality, betting that real-world systems benefit more from implementations that are likely to be correct than from algorithms that are theoretically optimal but frequently buggy.

Lessons for Distributed Systems Builders

The challenges of implementing Paxos teach us broader lessons about building reliable distributed systems.

Key Lessons

•Understandability matters — An algorithm you deeply understand is one you can implement correctly. Choose simpler alternatives when they exist.
•Specifications are not implementations — A correct algorithm specification says nothing about networking, persistence, or performance. The implementation burden is enormous.
•Testing distributed systems is hard — Standard testing misses most bugs. Invest in chaos testing, simulation, and formal methods.
•Persistence semantics are subtle — fsync is not optional. Verify your storage layer's durability guarantees empirically.
•Edge cases dominate complexity — The 'happy path' is trivial. Handling failures, concurrent operations, and recovery is where bugs live.
•Use existing libraries — Unless you have specific needs, use well-tested implementations (etcd, ZooKeeper, Consul) rather than rolling your own.

When to Implement Consensus Yourself

Implement consensus yourself only when: (1) Existing solutions don't meet your requirements, (2) You have the expertise and time for thorough testing, (3) You're willing to use formal verification techniques. Otherwise, use etcd, ZooKeeper, or another battle-tested system.

The Humility Principle:

Google, with some of the best distributed systems engineers in the world, spent years debugging their Paxos implementation. If Google finds it hard, so will you.

This isn't a reason to avoid learning Paxos — understanding it deeply makes you better at using and debugging systems built on it. But it is a reason for humility about implementation, and a strong argument for leveraging existing, well-tested implementations wherever possible.

Summary: Why Paxos is Hard to Implement

We've explored the many challenges that make Paxos implementation notoriously difficult. Let's consolidate the key takeaways:

Key Takeaways

•The paper-to-production gap is enormous — The algorithm specification is only 10% of a production implementation. Networking, persistence, recovery, and reconfiguration are underspecified.
•Understandability is a practical concern — Algorithms that are hard to understand are implemented with bugs. Raft's design explicitly prioritizes understandability.
•Persistence bugs violate safety — Without correct fsync usage, crashes can cause acceptors to forget promises, breaking consensus safety.
•Edge cases are where bugs hide — Concurrent proposals, recovery scenarios, and message timing create a vast space of potential bugs.
•Liveness requires careful design — Without leader election and backoff, dueling proposers can livelock indefinitely.
•Standard testing is insufficient — Consensus bugs manifest under rare failure conditions. Use simulation testing, chaos engineering, and formal methods.
•Real implementations have had many bugs — Even well-funded production systems have suffered consensus bugs. Humility is warranted.
•Use existing implementations when possible — Unless you have specific needs, leverage etcd, ZooKeeper, or other battle-tested systems.

Module Complete:

You've now completed the Paxos Algorithm module. You understand the theoretical foundations (Basic Paxos, the two-phase protocol, the adoption rule), the practical extensions (Multi-Paxos, log replication, stable leader), and the real-world challenges (implementation pitfalls, testing, alternatives).

This deep knowledge prepares you for the subsequent modules on Raft, ZAB, and practical coordination services like ZooKeeper and etcd.

Module Complete

Congratulations! You've completed the Paxos Algorithm module. You now have a deep understanding of the canonical distributed consensus algorithm — its elegance, its challenges, and its practical application. This knowledge is foundational for understanding all modern distributed systems that provide strong consistency guarantees.

5 / 5

Loading learning content...

System Design (HLD)Paxos Algorithm

The Paxos Consensus Algorithm

LevelAdvanced

Duration90 mins

TopicPaxos Algorithm

5 / 5

Why Paxos is Hard to Implement

The Notorious Difficulty of Paxos

Understanding this gap is crucial for anyone who works with distributed systems, whether you're implementing consensus yourself or using systems built on Paxos-like protocols.

What You Will Learn

The Gap Between Paper and Production

What the Papers Specify:

Message types (Prepare, Promise, Accept, Accepted)
Invariants that must be maintained
Conditions under which each message is sent/received
Safety properties that must hold

What the Papers Leave Unspecified:

Underspecified Aspects of Paxos
Aspect	What's Missing	Why It Matters
Leader election	How to choose a leader initially or after failure	Protocol can livelock without stable leadership
Failure detection	How to detect leader/acceptor crashes	Too aggressive = thrashing; too slow = long unavailability
Log compaction	How to discard old entries safely	Log grows unbounded without compaction
Reconfiguration	How to add/remove nodes from the cluster	Static membership is impractical for real systems
Client interaction	How clients submit commands and receive responses	Duplicate commands, retry logic, session semantics
Persistence format	How to store state on disk efficiently	Performance and correctness depend on implementation
Catch-up protocol	How lagging replicas get up to date	Slow replicas may never fully sync

The Implementation Iceberg

Google's Chubby Experience:

In their paper "Paxos Made Live" (2007), Google engineers documented their experience implementing Paxos for the Chubby lock service:

"There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system... the algorithm must be supplemented with additional mechanisms in order to provide a complete solution."

They reported that the core Paxos algorithm was "only a small part" of the final codebase, with most effort going into:

Disk corruption handling
Master election and leases
Group membership management
Snapshotting and truncation
Request processing and batching

Understandability and Mental Models

One of the most significant barriers to correct Paxos implementation is simply understanding the algorithm deeply enough to implement it correctly.

The Symmetry Problem:

Basic Paxos is inherently symmetric — any node can be a proposer, and there's no distinguished leader. While elegant theoretically, this symmetry makes reasoning about the system harder:

Multiple proposals can be in flight simultaneously
The outcome of one proposal depends on the state left by others
Reasoning about concurrent proposals requires considering all interleavings

Mental Model Challenges

•No linear flow — Unlike traditional algorithms with a clear start-to-finish flow, Paxos involves interleaved phases from multiple proposers.
•Hidden dependencies — The adoption rule creates dependencies between proposals that span time. A proposal's behavior depends on what happened in previous proposals we might not have observed.
•Failure is a first-class concern — Every step must consider what happens if any participant fails. This multiplies the mental complexity enormously.
•Subtle invariants — The safety properties depend on invariants that span the entire system (quorum intersection, promise monotonicity) rather than local properties.

The Original Presentation Didn't Help:

Lamport's original paper used a Greek parliament allegory that many found confusing rather than clarifying. When he published "Paxos Made Simple" in 2001, he famously wrote:

"The Paxos algorithm, when presented in plain English, is very simple."

Yet even "Paxos Made Simple" is only 14 pages and leaves many practical concerns unaddressed. The simplicity is in the algorithm's invariants, not in the implementation path.

Why Raft Emphasizes Understandability

Persistence and Durability Pitfalls

The fsync Problem:

persistence_bug.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# BUGGY: This looks correct but isn't safe!
def handle_accept_BUGGY(self, proposal_number, value):
    if self._can_accept(proposal_number):
        self.accepted = (proposal_number, value)
        
        # Write to file
        with open(self.state_file, 'w') as f:
            f.write(json.dumps(self.accepted))
        # BUG: No fsync! Data may be in OS buffer, not on disk
        
        return AcceptedResponse(proposal_number, value)
 
# CORRECT: Ensure data reaches stable storage
def handle_accept_CORRECT(self, proposal_number, value):
    if self._can_accept(proposal_number):
        self.accepted = (proposal_number, value)
        
        # Write to file with explicit sync
        with open(self.state_file, 'w') as f:
            f.write(json.dumps(self.accepted))
            f.flush()           # Flush Python buffers
            os.fsync(f.fileno())  # Flush OS buffers to disk
        
        # Also sync the directory to ensure the file metadata is durable
        dir_fd = os.open(os.path.dirname(self.state_file), os.O_RDONLY)
        try:
            os.fsync(dir_fd)
        finally:
            os.close(dir_fd)
        
        return AcceptedResponse(proposal_number, value)

The Power Failure Scenario

Additional Persistence Challenges:

Persistence Pitfalls

•Torn writes — A crash mid-write can leave partial data. Must use atomic write techniques (write to temp file, fsync, rename).
•Disk corruption — Disks can silently corrupt data. Must use checksums and verify on read.
•SSD behavior — SSDs have different failure modes than HDDs, including write amplification affecting fsync performance.
•Virtualized storage — VMs and cloud storage may not honor fsync semantics correctly. Must verify durability guarantees.
•RAID controller caches — Hardware RAID controllers often cache writes. Need battery-backed write cache or explicit configuration.

Real-World Incident:

In one documented incident, a distributed database using Paxos suffered data loss because:

fsync was called but the storage driver was misconfigured
Power was lost while data was in the driver's cache
Multiple acceptors' promises were lost simultaneously
On recovery, the cluster made a conflicting decision

The fix required not just code changes but extensive storage-layer verification.

Edge Cases in the Protocol

The Paxos protocol has numerous edge cases that are easy to miss. Each edge case can lead to subtle bugs that only manifest under rare conditions.

Edge Case 1: Re-proposing After Failed Accept

repropose_edge_case.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# SCENARIO: Our Accept was rejected, now we retry
#
# WRONG approach:
async def retry_proposal_WRONG(self, value):
    proposal_number = self.generate_higher_proposal_number()
    
    # BUG: Skipping Phase 1 because "we know our value"
    # But another proposer may have succeeded in the meantime!
    await self.phase2_accept(proposal_number, value)
 
# CORRECT approach:
async def retry_proposal_CORRECT(self, value):
    proposal_number = self.generate_higher_proposal_number()
    
    # MUST run Phase 1 to discover any newly-accepted values
    phase1_result = await self.phase1_prepare(proposal_number)
    
    if phase1_result.discovered_value is not None:
        # Adoption rule: use the discovered value!
        value = phase1_result.discovered_value
    
    await self.phase2_accept(proposal_number, value)

Edge Case 2: Promise Number vs Accept Number

A subtle bug occurs when implementations confuse the prepare condition with the accept condition:

Prepare vs Accept Conditions
Operation	Correct Condition	Common Bug	Result of Bug
Handle Prepare(n)	Promise if n > promised	Using n >= promised	Allows duplicate proposal numbers, potential safety violation
Handle Accept(n, v)	Accept if n >= promised	Using n > promised	Rejects proposer's own accept after their prepare

Edge Case 3: Handling of No-Responses

What happens when an acceptor doesn't respond at all (network timeout)? The handling is subtle:

Non-Response Handling

•Cannot assume rejection — Silence might mean the acceptor accepted but the response was lost.
•Cannot assume acceptance — Silence might mean the message never arrived or the acceptor crashed.
•Must count only explicit responses — Wait for majority of explicit responses, ignore non-responses.
•Timeout handling is critical — Too short = unnecessary retries; too long = poor performance.

Edge Case 4: Concurrent Leadership Claims

Two nodes may believe they are the leader simultaneously. The protocol handles this safely, but implementations often add unsafe optimizations:

leadership_edge_case.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# UNSAFE optimization that breaks safety:
class Leader:
    def accept_command(self, command):
        if self.is_leader:  # BUG: Stale leadership check
            # Skip Phase 1, assume we can accept
            # But we might have been preempted!
            slot = self.next_slot
            self.next_slot += 1
            self.phase2_accept(self.proposal_number, slot, command)
 
# SAFE approach:
class Leader:
    def accept_command(self, command):
        if self.is_leader:
            slot = self.next_slot
            self.next_slot += 1
            
            try:
                result = self.phase2_accept(self.proposal_number, slot, command)
            except RejectedError as e:
                # We've been preempted! Step down.
                self.is_leader = False
                self.proposal_number = None
                raise NotLeaderAnymore()

Livelock and Performance Issues

While Paxos guarantees safety unconditionally, it only guarantees liveness under "reasonable" conditions. Without careful design, Paxos can get stuck in pathological states.

Livelock: Dueling Proposers

The classic livelock scenario:

Converting Mermaid diagram...

Preventing Livelock:

Production systems use multiple techniques to prevent livelock:

Livelock Prevention Strategies

•Single designated leader — Only the leader proposes. Other nodes forward to the leader. Multi-Paxos naturally encourages this.
•Randomized exponential backoff — When rejected, wait a random time before retrying. Reduces collision probability.
•Leader leases — Give the leader exclusive rights to propose for a time period. Other proposers wait for lease expiration.
•Priority-based proposing — Higher-priority proposers (e.g., by node ID) back off less. Ensures one proposer eventually wins.

Performance Anti-Patterns:

Performance Anti-Patterns in Paxos
Anti-Pattern	Problem	Solution
Synchronous disk writes on hot path	Each accept waits for fsync	Batch multiple accepts into one fsync; use write-ahead logging
Sequential slot processing	One slot at a time limits throughput	Pipeline multiple slots; parallel Phase 2 for different slots
Large messages	Sending full values in every message	Separate data plane from control plane; use value IDs
Frequent leader elections	Each election costs a Phase 1 round	Tune failure detection; use leases; graceful handoffs
No batching	Each command is a separate consensus round	Batch multiple commands into one log entry

The 'Steady State' Assumption

Testing and Verification Challenges

Perhaps the most challenging aspect of Paxos is verifying correctness. Bugs often only manifest under rare failure conditions that are nearly impossible to encounter during normal testing.

Why Standard Testing Fails:

Testing Limitations

•Happy path is trivial — Normal operation without failures always works. The bugs are in failure handling.
•Failure combinations are exponential — With N nodes, there are 2^N failure patterns. With concurrent operations, the space explodes.
•Timing-dependent bugs — Some bugs only appear with specific message ordering. Random testing rarely finds these.
•Production conditions differ — Test environments don't have real network partitions, disk failures, or the scale of production.
•Subtle state corruption — A bug might corrupt state in a way that only causes problems much later, long after the buggy code ran.

Verification Techniques for Consensus:

Verification Approaches
Technique	Description	Pros	Cons
Model Checking	TLA+ or similar to exhaustively check state space	Finds subtle concurrency bugs; proves properties	Requires specification expertise; state explosion for large systems
Jepsen Testing	Black-box chaos testing with failure injection	Tests real implementation; finds real bugs	Cannot prove absence of bugs; coverage is random
Formal Verification	Mathematical proof of implementation correctness	Strongest guarantees	Extremely labor-intensive; few teams can do this
Simulation Testing	Deterministic replay of distributed executions	Reproducible; fast iteration	Requires significant infrastructure investment
Property-Based Testing	Generate random test cases checking invariants	Good coverage with modest effort	May miss adversarial scenarios

TLA+ and Paxos

FoundationDB's Simulation Testing:

FoundationDB pioneered a technique called deterministic simulation, where:

All I/O is abstracted behind interfaces
A test harness controls all "random" events deterministically
Failures are injected programmatically
Entire distributed system runs in a single process
Tests can explore millions of scenarios quickly

This approach found bugs that years of production use never surfaced. Many modern consensus implementations now use similar techniques.

Real-World Implementation Bugs

Learning from bugs found in real implementations is invaluable. Here are documented bugs from production Paxos systems:

Bug Category 1: State Corruption on Recovery

ZooKeeper Bug ZK-335

Bug Category 2: Incorrect Quorum Calculations

quorum_bugs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# BUG 1: Off-by-one error in majority calculation
def majority_WRONG(n):
    return n / 2 + 1  # For n=5, returns 3.5, which might truncate to 3!
 
def majority_CORRECT(n):
    return n // 2 + 1  # For n=5, returns 3
 
# BUG 2: Changing cluster size without updating quorum
class Cluster:
    def __init__(self, nodes):
        self.nodes = nodes
        self.majority = len(nodes) // 2 + 1
    
    def add_node(self, node):
        self.nodes.append(node)
        # BUG: forgot to update self.majority!
        # With 5 nodes, majority is still 3, but should be 4
        
# BUG 3: Counting same node multiple times
def check_quorum_WRONG(responses, majority):
    return len(responses) >= majority  # Might count duplicates!
 
def check_quorum_CORRECT(responses, majority):
    unique_respondents = set(r.node_id for r in responses)
    return len(unique_respondents) >= majority

Bug Category 3: Message Handling Edge Cases

Message Handling Bugs

•Duplicate message processing — Processing the same message twice can corrupt quorum counts. Need message deduplication.
•Out-of-order message processing — An old message arriving after a new one can regress state. Need version checking.
•Message from old leader — An old leader's message arriving after a new leader took over. Must check proposal numbers.
•Partial message writes — Network or serialization bugs causing partial messages. Need message framing and checksums.

Bug Category 4: Reconfiguration Bugs

Membership changes (adding/removing nodes) are a source of numerous bugs. The fundamental challenge: during reconfiguration, which quorum applies?

The Reconfiguration Danger

Why Raft (and Others) Emerged

The difficulties of implementing Paxos led to the development of alternative consensus algorithms designed explicitly for understandability and implementability.

Raft's Design Philosophy:

The Raft paper (2014) by Diego Ongaro and John Ousterhout explicitly states:

"Raft is a consensus algorithm that is designed to be easy to understand. It's equivalent to Paxos in fault-tolerance and performance. The difference is that it's decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems."

Paxos vs. Raft Design Choices
Aspect	Paxos	Raft	Raft's Advantage
Leadership	Any node can propose	Strong leader only	Simpler reasoning; no dueling proposers
Log structure	Gaps allowed	No gaps in log	Simpler replication; easier catch-up
Membership changes	Underspecified	Explicitly defined	Clear, tested procedure
Snapshotting	Underspecified	Explicitly defined	Practical from day one
Algorithm scope	Minimal core	Complete system	Less implementation burden

Other Alternatives:

Consensus Algorithm Alternatives

•Viewstamped Replication (VR) — Predates Paxos actually! Similar to Raft in many ways. Uses view changes instead of leader election.
•ZAB (Zookeeper Atomic Broadcast) — Designed specifically for ZooKeeper. Similar to Multi-Paxos with recovery improvements.
•EPaxos (Egalitarian Paxos) — Removes the leader bottleneck, allowing any replica to commit commands. More complex but higher throughput.
•PBFT (Practical Byzantine Fault Tolerance) — Tolerates malicious nodes, not just crashes. Used in blockchain systems.

Choosing an Algorithm

The Spectrum of Correctness:

There's an important tradeoff in algorithm design:

Simpler Algorithm ←→ More Understandable ←→ Likely Correct Implementation

Lessons for Distributed Systems Builders

The challenges of implementing Paxos teach us broader lessons about building reliable distributed systems.

Key Lessons

•Understandability matters — An algorithm you deeply understand is one you can implement correctly. Choose simpler alternatives when they exist.
•Specifications are not implementations — A correct algorithm specification says nothing about networking, persistence, or performance. The implementation burden is enormous.
•Testing distributed systems is hard — Standard testing misses most bugs. Invest in chaos testing, simulation, and formal methods.
•Persistence semantics are subtle — fsync is not optional. Verify your storage layer's durability guarantees empirically.
•Edge cases dominate complexity — The 'happy path' is trivial. Handling failures, concurrent operations, and recovery is where bugs live.
•Use existing libraries — Unless you have specific needs, use well-tested implementations (etcd, ZooKeeper, Consul) rather than rolling your own.

When to Implement Consensus Yourself

The Humility Principle:

Google, with some of the best distributed systems engineers in the world, spent years debugging their Paxos implementation. If Google finds it hard, so will you.

Summary: Why Paxos is Hard to Implement

We've explored the many challenges that make Paxos implementation notoriously difficult. Let's consolidate the key takeaways:

Key Takeaways

•The paper-to-production gap is enormous — The algorithm specification is only 10% of a production implementation. Networking, persistence, recovery, and reconfiguration are underspecified.
•Understandability is a practical concern — Algorithms that are hard to understand are implemented with bugs. Raft's design explicitly prioritizes understandability.
•Persistence bugs violate safety — Without correct fsync usage, crashes can cause acceptors to forget promises, breaking consensus safety.
•Edge cases are where bugs hide — Concurrent proposals, recovery scenarios, and message timing create a vast space of potential bugs.
•Liveness requires careful design — Without leader election and backoff, dueling proposers can livelock indefinitely.
•Standard testing is insufficient — Consensus bugs manifest under rare failure conditions. Use simulation testing, chaos engineering, and formal methods.
•Real implementations have had many bugs — Even well-funded production systems have suffered consensus bugs. Humility is warranted.
•Use existing implementations when possible — Unless you have specific needs, leverage etcd, ZooKeeper, or other battle-tested systems.

Module Complete:

This deep knowledge prepares you for the subsequent modules on Raft, ZAB, and practical coordination services like ZooKeeper and etcd.

Module Complete

5 / 5