Loading content...
In the landscape of distributed consensus, three protocols dominate production deployments: Paxos, Raft, and ZAB. Each emerged from different contexts, was designed with different priorities, and makes distinct trade-offs. Understanding their similarities and differences isn't merely academic—it's essential for choosing the right foundation for your distributed systems and for understanding the behavior of systems you already depend on.
Paxos is the grandfather of consensus protocols, mathematically elegant but notoriously difficult to implement. Raft was explicitly designed for understandability, decomposing consensus into digestible sub-problems. ZAB was purpose-built for Zookeeper's specific needs as a coordination service. All three achieve the same fundamental goal—distributed agreement—but they arrive there through remarkably different paths.
By the end of this page, you will understand the fundamental design philosophies of each protocol, their key technical differences, performance implications, and guidance for when each is most appropriate. You'll be able to reason about why a system uses one protocol over another.
Each protocol emerged from a specific context that shaped its design. Understanding this context illuminates why each protocol makes the choices it does.
Paxos (1989/1998)
Leslie Lamport developed Paxos to prove that consensus was possible in asynchronous systems with crash failures. The original "The Part-Time Parliament" paper used an allegory about a Greek parliament, making it somewhat obscure. The later "Paxos Made Simple" paper clarified the algorithm but retained its theoretical focus.
Design Goals:
ZAB (2007-2008)
ZAB was developed specifically for Apache Zookeeper, a coordination service that needed high-throughput atomic broadcast rather than general consensus. The designers studied Paxos extensively but found it insufficiently specified for their needs.
Design Goals:
Raft (2013)
Diego Ongaro and John Ousterhout created Raft explicitly to address Paxos's understandability problems. Their paper "In Search of an Understandable Consensus Algorithm" emphasized pedagogy as a primary design goal.
Design Goals:
| Protocol | Year | Primary Author(s) | Original Context | Primary Focus |
|---|---|---|---|---|
| Paxos | 1989/1998 | Leslie Lamport | Theoretical research | Proving consensus possible, generality |
| ZAB | 2007-2008 | Yahoo! Research | Zookeeper coordination service | High-throughput atomic broadcast |
| Raft | 2013 | Ongaro, Ousterhout | Stanford research | Understandability, practical implementation |
The Understandability Factor:
Raft's designers conducted user studies showing that students understood Raft significantly faster than Paxos. This isn't just about learning speed—understandable protocols are:
ZAB, while more complex than Raft, was specified with enough detail that implementers could follow it. Paxos papers left many practical details unspecified, leading to many incompatible "Paxos" implementations.
Despite solving the same fundamental problem, the three protocols differ significantly in their core mechanics.
Leader Selection:
Paxos: Leadership is an optimization, not a requirement. Any node can be a proposer at any time. In practice, Multi-Paxos uses a stable leader for efficiency, but the protocol doesn't require it.
ZAB: Leadership is mandatory. Only the leader can broadcast proposals. Fast Leader Election is a defined sub-protocol that selects the node with the most complete history.
Raft: Leadership is mandatory, similar to ZAB. Leader election uses randomized timeouts to break symmetry. The leader must have all committed entries (Log Matching Property).
Log Replication Approach:
Paxos: Each log slot is decided independently. Proposals can be decided out of order, creating 'holes.' The state machine must handle gaps while waiting for earlier slots.
ZAB: Proposals are broadcast in strict order. No gaps are possible—if zxid N is committed, all zxids < N are committed. This is the 'prefix' property.
Raft: Similar to ZAB, log entries are replicated in order. The Log Matching Property ensures that if two logs have an entry with the same index and term, all preceding entries are identical.
| Mechanism | Paxos | ZAB | Raft |
|---|---|---|---|
| Leader requirement | Optional (optimization) | Mandatory | Mandatory |
| Leader selection | Higher proposal number wins | Fast Leader Election (highest zxid) | Randomized timeout election |
| Entry ordering | Per-slot consensus (can have holes) | Strict sequential (no holes) | Strict sequential (no holes) |
| Term/epoch concept | Proposal numbers (interleaved) | Epochs (discrete) | Terms (discrete) |
| Commit condition | Majority accepts proposal | Quorum ACKs proposal | Majority replicates entry |
| Safety approach | Proposal number ordering | Epoch + zxid ordering | Term + index + Log Matching |
Handling Log Divergence:
One critical difference is how each protocol handles divergent logs—situations where different nodes have different log entries for the same position.
Paxos: Divergence is natural since proposals can compete. The protocol resolves by having acceptors promise to higher proposal numbers. Old proposals can be "preempted."
ZAB: Divergence only occurs when a failed leader had uncommitted proposals. The new leader's synchronization phase detects this via zxid comparison and instructs followers to TRUNC (truncate) their logs.
Raft: Divergence occurs when followers have entries from old leaders. Raft resolves by having the leader's log be authoritative—followers delete conflicting entries and accept the leader's entries.
The key difference: Paxos handles divergence as a normal part of operation, while ZAB and Raft treat it as an error condition that only occurs across leader changes.
Raft's Log Matching Property states: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index. This is similar to ZAB's prefix consistency but expressed differently. Both ensure that committed state can never diverge.
All three protocols provide the fundamental consensus guarantees, but with subtle differences in how they achieve them and what additional guarantees they provide.
Safety Guarantees (Always True):
Agreement: If any correct node commits value V at position P, all correct nodes eventually commit V at P. None of the protocols allow committed values to be lost.
Validity: Only proposed values can be decided. The protocols don't invent values—they only choose among proposed ones.
Integrity: Each node applies each transaction at most once. No duplicates in the applied log.
Liveness Guarantees (Eventually True with Assumptions):
All three protocols require partial synchrony for liveness—they need network delays to be bounded eventually for the system to make progress. In practice:
Additional Guarantees:
Paxos provides just the core consensus guarantees. FIFO ordering, total ordering, and primary ordering must be built on top.
ZAB provides atomic broadcast guarantees:
Raft provides:
| Guarantee | Paxos | ZAB | Raft |
|---|---|---|---|
| Core consensus safety | ✓ | ✓ | ✓ |
| Total ordering (built-in) | Partial (per-slot) | ✓ (all transactions) | ✓ (log entries) |
| FIFO client ordering | ✗ (build on top) | ✓ (native) | ✗ (typically build on top) |
| Prefix consistency | ✗ (holes possible) | ✓ | ✓ |
| Leader completeness | N/A (no fixed leader) | ✓ | ✓ (formal property) |
| Exactly-once semantics | ✗ (add on) | ✓ | Depends on implementation |
FLP Impossibility and All Three Protocols:
The FLP impossibility result states that in an asynchronous system with even one crash failure, no deterministic protocol can guarantee consensus. All three protocols escape this result through similar mechanisms:
In practice, all three protocols work well in real networks because modern networks are typically synchronous (bounded delays) with occasional asynchronous periods (network partitions, overload). The protocols tolerate the asynchronous periods by blocking progress but maintain safety throughout.
All three protocols prioritize safety over liveness. If forced to choose between possibly committing an incorrect value or blocking indefinitely, they block. This is the correct choice for coordination systems—incorrect coordination is worse than unavailable coordination.
Leader election is one of the most visible differences between the protocols. Each takes a different approach that reflects its design philosophy.
Paxos Leader Election:
Basic Paxos doesn't define leader election—any node can propose at any time. Multi-Paxos implementations typically add leader election, but it's not specified in the original papers. Common approaches:
ZAB Fast Leader Election:
ZAB defines a specific Fast Leader Election protocol:
Raft Election:
Election Speed and Stability:
Paxos: Without an explicit election protocol, "elections" happen whenever nodes contend. This can lead to live-lock where competing proposers preempt each other. Implementations add leader leases or backoff to stabilize.
ZAB: Fast Leader Election typically converges in 2-3 message rounds when there's a clear winner. The protocol explicitly favors nodes with the most complete history, reducing the need for post-election synchronization.
Raft: Elections use randomized timeouts (typically 150-300ms). In the common case, election completes in one round. Split votes trigger new elections with fresh random timeouts. Studies show Raft elections converge quickly in practice.
The Log Completeness Requirement:
Both ZAB and Raft ensure the new leader has all committed entries:
ZAB: Fast Leader Election prefers higher zxids. The discovery phase additionally verifies and syncs if needed.
Raft: Voters reject candidates whose log is less up-to-date than their own. This voting restriction ensures only candidates with all committed entries can win majority.
Paxos: Multi-Paxos leaders must learn committed values they might be missing. This can add latency to the post-election phase.
Raft's original election algorithm can cause disruption when a partitioned node rejoins—it may trigger unnecessary elections with higher terms. The Pre-Vote optimization (now widely implemented) has nodes check if they would win before incrementing terms. ZAB's epoch mechanism provides similar protection inherently.
The mechanics of how each protocol replicates log entries reveal fundamental design differences.
Paxos (Multi-Paxos) Replication:
Challenge: Holes can form in the log. If slot 10 is decided before slot 9, the state machine must wait for slot 9. This requires hole-filling mechanisms.
ZAB Replication:
Key property: Strict ordering with no gaps. If zxid N is committed, all zxids < N are committed.
Raft Replication:
| Aspect | Paxos | ZAB | Raft |
|---|---|---|---|
| Entry identification | Slot number + proposal number | Epoch + counter (zxid) | Term + index |
| Ordering guarantee | Per-slot only | Global across all slots | Global across all slots |
| Gap handling | Must fill holes | No gaps by construction | No gaps by construction |
| Conflict resolution | Higher proposal wins | Leader is authoritative (TRUNC) | Leader is authoritative (overwrite) |
| Commit granularity | Per-slot | Per-transaction (with COMMIT msg) | Cumulative (commitIndex) |
| Persistence timing | Before accepting | Before ACK (write-ahead) | Before responding to client |
The Commit Mechanism:
How commits are communicated differs significantly:
Paxos: There's no explicit commit message in basic Paxos. A value is "chosen" when a majority accepts it. Learners must observe majority acceptance or be told by a distinguished learner.
ZAB: Explicit COMMIT messages are sent once quorum ACKs are received. This provides clear commit semantics and enables followers to apply immediately upon receiving COMMIT.
Raft: The leader maintains a commitIndex (highest log index known to be replicated on majority). AppendEntries messages include leaderCommit, which followers use to advance their commit. No separate commit message—commit information piggybacks on normal replication.
Batching and Pipelining:
All three protocols support optimizations for throughput:
These optimizations are crucial for production performance—replicating one entry at a time would be too slow for most workloads.
In ZAB and Raft, the 'commit horizon' (highest committed entry) can only advance—it never goes backward. This monotonicity property is essential for state machine safety. Paxos achieves this through the slot abstraction but must handle non-sequential commits.
How each protocol handles node recovery reveals significant architectural differences.
Paxos Recovery:
Basic Paxos doesn't define recovery procedures. Multi-Paxos implementations must handle:
The lack of specification means recovery procedures vary widely between implementations.
ZAB Recovery (Synchronization Phase):
ZAB has a well-defined synchronization phase with three modes:
DIFF (Differential)
TRUNC (Truncate)
SNAP (Snapshot)
1234567891011121314151617181920212223242526
// ZAB Synchronization DecisionFUNCTION zabSync(followerLastZxid, leaderState): IF followerLastZxid > leaderState.lastCommitted: // Follower has orphaned transactions RETURN TRUNC(targetZxid = leaderState.lastCommitted) ELSE IF followerLastZxid >= leaderState.minLogZxid: // Can sync incrementally RETURN DIFF(leaderState.log[followerLastZxid:]) ELSE: // Too far behind RETURN SNAP(leaderState.snapshot) // Raft AppendEntries-based Catch-upFUNCTION raftCatchup(follower, prevLogIndex, prevLogTerm): response = sendAppendEntries(follower, prevLogIndex, prevLogTerm, entries) IF response.success: // Follower's log matches, proceed with next entries RETURN updateNextIndex(follower, prevLogIndex + len(entries)) ELSE: // Log mismatch, decrement and retry // Optimization: use response.conflictIndex to jump back RETURN decrementAndRetry(follower, response.conflictIndex) // Paxos typically doesn't have standardized recovery// Implementations handle this differentlyRaft Recovery (Log Consistency Check):
Raft uses the AppendEntries consistency check for catch-up:
Optimization: Instead of decrementing by one, followers can return conflict information that allows the leader to skip large gaps.
For far-behind followers: Raft uses InstallSnapshot RPC to transfer state.
Recovery Speed Comparison:
| Protocol | Common Case | Edge Cases | Note |
|---|---|---|---|
| Paxos | Varies | Varies | Implementation-dependent |
| ZAB | DIFF: Very fast | SNAP: Slow | Explicit sync modes help |
| Raft | Single round-trip | Decrement iterations | Binary search optimization available |
All three protocols require snapshot capabilities for practical deployments. Logs cannot grow unboundedly. ZAB and Raft explicitly define snapshot mechanisms; Paxos implementations must add them. Snapshot interaction with the consensus protocol is a common source of implementation bugs.
Performance comparisons must be made carefully—implementation quality often matters more than protocol choice. However, there are inherent differences worth understanding.
Message Complexity:
Paxos (Multi-Paxos, stable leader case):
ZAB (Broadcast phase):
Raft:
Raft has a slight message advantage because commit information piggybacks on AppendEntries.
| Metric | Paxos | ZAB | Raft |
|---|---|---|---|
| Messages per request (stable) | 2n (accept + response) | 3n (propose + ack + commit) | 2n (append + response) |
| Latency (network bound) | 2 RTT typical | 2 RTT typical | 2 RTT typical |
| Persistence writes per request | 1 per replica | 1 per replica | 1 per replica |
| Leader bottleneck | Can have multiple proposers | Single leader | Single leader |
| Read scaling | Implementation-dependent | Excellent (local reads) | Depends on implementation |
| Election speed | Varies widely | 2-3 rounds typical | 1-2 rounds typical |
Latency Considerations:
All three protocols have similar optimal-case latency:
Total: 2 round-trip times minimum (client-leader + leader-follower)
The key latency factors are:
Throughput Considerations:
Single-leader protocols (ZAB, Raft): Throughput limited by leader capacity. Batching is critical.
Paxos with multiple proposers: Theoretically higher throughput via parallelism, but contention can reduce actual throughput.
Read throughput: ZAB excels here—reads are local, no consensus needed. Raft can do the same with relaxed consistency. Paxos varies by implementation.
In practice, a well-implemented Raft often outperforms a poorly-implemented Paxos (or vice versa). Protocol choice matters less than: quality of batching, memory management, disk sync optimization, network handling, and thread management. Choose based on understandability and ecosystem, then optimize the implementation.
Each protocol has established itself in particular niches based on its strengths.
When Paxos Makes Sense:
Notable Paxos Implementations:
When ZAB Makes Sense:
Notable ZAB/Zookeeper Usage:
When Raft Makes Sense:
Notable Raft Implementations:
The Modern Trend:
Raft has become the dominant choice for new implementations. Reasons include:
We've analyzed the three major consensus protocols across multiple dimensions. Let's consolidate the key insights:
What's Next:
Now that you understand how ZAB compares to other consensus protocols, the next page explores ZAB's use in Kafka and Hadoop—the real-world systems that rely on Zookeeper's atomic broadcast for their distributed coordination needs.
You now understand the fundamental differences between Paxos, ZAB, and Raft—their design philosophies, technical mechanisms, performance characteristics, and appropriate use cases. This knowledge enables you to reason about consensus protocol choices and understand the systems that rely on them.