ZAB (Zookeeper) - Learning Module

Loading content...

0/273

Comparison with Raft and Paxos

Three Paths to Consensus

In the landscape of distributed consensus, three protocols dominate production deployments: Paxos, Raft, and ZAB. Each emerged from different contexts, was designed with different priorities, and makes distinct trade-offs. Understanding their similarities and differences isn't merely academic—it's essential for choosing the right foundation for your distributed systems and for understanding the behavior of systems you already depend on.

Paxos is the grandfather of consensus protocols, mathematically elegant but notoriously difficult to implement. Raft was explicitly designed for understandability, decomposing consensus into digestible sub-problems. ZAB was purpose-built for Zookeeper's specific needs as a coordination service. All three achieve the same fundamental goal—distributed agreement—but they arrive there through remarkably different paths.

What You Will Master

By the end of this page, you will understand the fundamental design philosophies of each protocol, their key technical differences, performance implications, and guidance for when each is most appropriate. You'll be able to reason about why a system uses one protocol over another.

Historical Context and Design Goals

Each protocol emerged from a specific context that shaped its design. Understanding this context illuminates why each protocol makes the choices it does.

Paxos (1989/1998)

Leslie Lamport developed Paxos to prove that consensus was possible in asynchronous systems with crash failures. The original "The Part-Time Parliament" paper used an allegory about a Greek parliament, making it somewhat obscure. The later "Paxos Made Simple" paper clarified the algorithm but retained its theoretical focus.

Design Goals:

Prove consensus is achievable
Handle arbitrary crash failures
Make minimal assumptions about timing
Optimize for correctness and generality

ZAB (2007-2008)

ZAB was developed specifically for Apache Zookeeper, a coordination service that needed high-throughput atomic broadcast rather than general consensus. The designers studied Paxos extensively but found it insufficiently specified for their needs.

Design Goals:

High-throughput primary-backup replication
Total ordering with FIFO client guarantees
Efficient recovery and synchronization
Production-ready specification

Raft (2013)

Diego Ongaro and John Ousterhout created Raft explicitly to address Paxos's understandability problems. Their paper "In Search of an Understandable Consensus Algorithm" emphasized pedagogy as a primary design goal.

Design Goals:

Understandability above all else
Decomposable into independent sub-problems
Minimal state space for correctness reasoning
Practical implementation guidance

Protocol Origins and Primary Focus
Protocol	Year	Primary Author(s)	Original Context	Primary Focus
Paxos	1989/1998	Leslie Lamport	Theoretical research	Proving consensus possible, generality
ZAB	2007-2008	Yahoo! Research	Zookeeper coordination service	High-throughput atomic broadcast
Raft	2013	Ongaro, Ousterhout	Stanford research	Understandability, practical implementation

The Understandability Factor:

Raft's designers conducted user studies showing that students understood Raft significantly faster than Paxos. This isn't just about learning speed—understandable protocols are:

More likely to be implemented correctly: Developers who understand the protocol make fewer mistakes
Easier to debug: When issues arise, understanding the algorithm helps diagnosis
More maintainable: Future developers can modify the code confidently
Better documented: Teams can write clear explanations

ZAB, while more complex than Raft, was specified with enough detail that implementers could follow it. Paxos papers left many practical details unspecified, leading to many incompatible "Paxos" implementations.

Core Algorithmic Differences

Despite solving the same fundamental problem, the three protocols differ significantly in their core mechanics.

Leader Selection:

Paxos: Leadership is an optimization, not a requirement. Any node can be a proposer at any time. In practice, Multi-Paxos uses a stable leader for efficiency, but the protocol doesn't require it.

ZAB: Leadership is mandatory. Only the leader can broadcast proposals. Fast Leader Election is a defined sub-protocol that selects the node with the most complete history.

Raft: Leadership is mandatory, similar to ZAB. Leader election uses randomized timeouts to break symmetry. The leader must have all committed entries (Log Matching Property).

Log Replication Approach:

Paxos: Each log slot is decided independently. Proposals can be decided out of order, creating 'holes.' The state machine must handle gaps while waiting for earlier slots.

ZAB: Proposals are broadcast in strict order. No gaps are possible—if zxid N is committed, all zxids < N are committed. This is the 'prefix' property.

Raft: Similar to ZAB, log entries are replicated in order. The Log Matching Property ensures that if two logs have an entry with the same index and term, all preceding entries are identical.

Core Mechanism Comparison
Mechanism	Paxos	ZAB	Raft
Leader requirement	Optional (optimization)	Mandatory	Mandatory
Leader selection	Higher proposal number wins	Fast Leader Election (highest zxid)	Randomized timeout election
Entry ordering	Per-slot consensus (can have holes)	Strict sequential (no holes)	Strict sequential (no holes)
Term/epoch concept	Proposal numbers (interleaved)	Epochs (discrete)	Terms (discrete)
Commit condition	Majority accepts proposal	Quorum ACKs proposal	Majority replicates entry
Safety approach	Proposal number ordering	Epoch + zxid ordering	Term + index + Log Matching

Handling Log Divergence:

One critical difference is how each protocol handles divergent logs—situations where different nodes have different log entries for the same position.

Paxos: Divergence is natural since proposals can compete. The protocol resolves by having acceptors promise to higher proposal numbers. Old proposals can be "preempted."

ZAB: Divergence only occurs when a failed leader had uncommitted proposals. The new leader's synchronization phase detects this via zxid comparison and instructs followers to TRUNC (truncate) their logs.

Raft: Divergence occurs when followers have entries from old leaders. Raft resolves by having the leader's log be authoritative—followers delete conflicting entries and accept the leader's entries.

The key difference: Paxos handles divergence as a normal part of operation, while ZAB and Raft treat it as an error condition that only occurs across leader changes.

The Log Matching Property

Raft's Log Matching Property states: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index. This is similar to ZAB's prefix consistency but expressed differently. Both ensure that committed state can never diverge.

Safety and Liveness Guarantees

All three protocols provide the fundamental consensus guarantees, but with subtle differences in how they achieve them and what additional guarantees they provide.

Safety Guarantees (Always True):

Agreement: If any correct node commits value V at position P, all correct nodes eventually commit V at P. None of the protocols allow committed values to be lost.

Validity: Only proposed values can be decided. The protocols don't invent values—they only choose among proposed ones.

Integrity: Each node applies each transaction at most once. No duplicates in the applied log.

Liveness Guarantees (Eventually True with Assumptions):

All three protocols require partial synchrony for liveness—they need network delays to be bounded eventually for the system to make progress. In practice:

Paxos: Progress requires stable leadership or no competing proposers
ZAB: Progress requires a stable leader and quorum connectivity
Raft: Progress requires a stable leader and quorum connectivity

Additional Guarantees:

Paxos provides just the core consensus guarantees. FIFO ordering, total ordering, and primary ordering must be built on top.

ZAB provides atomic broadcast guarantees:

Total ordering of all transactions
FIFO ordering per client
Primary order (if leader broadcasts A before B, all deliver A before B)

Raft provides:

Total ordering of log entries
Leader completeness (leader has all committed entries)
State machine safety (same state after same prefix)

Guarantee Comparison
Guarantee	Paxos	ZAB	Raft
Core consensus safety	✓	✓	✓
Total ordering (built-in)	Partial (per-slot)	✓ (all transactions)	✓ (log entries)
FIFO client ordering	✗ (build on top)	✓ (native)	✗ (typically build on top)
Prefix consistency	✗ (holes possible)	✓	✓
Leader completeness	N/A (no fixed leader)	✓	✓ (formal property)
Exactly-once semantics	✗ (add on)	✓	Depends on implementation

FLP Impossibility and All Three Protocols:

The FLP impossibility result states that in an asynchronous system with even one crash failure, no deterministic protocol can guarantee consensus. All three protocols escape this result through similar mechanisms:

Timing assumptions for liveness only: Safety holds even in asynchronous conditions; only liveness requires eventual synchrony
Randomization (Raft): Random election timeouts break symmetry, allowing progress in most runs
Leader stability assumption: All protocols assume leaders remain stable "long enough" for progress

In practice, all three protocols work well in real networks because modern networks are typically synchronous (bounded delays) with occasional asynchronous periods (network partitions, overload). The protocols tolerate the asynchronous periods by blocking progress but maintain safety throughout.

Safety vs. Liveness Trade-off

All three protocols prioritize safety over liveness. If forced to choose between possibly committing an incorrect value or blocking indefinitely, they block. This is the correct choice for coordination systems—incorrect coordination is worse than unavailable coordination.

Leader Election Comparison

Leader election is one of the most visible differences between the protocols. Each takes a different approach that reflects its design philosophy.

Paxos Leader Election:

Basic Paxos doesn't define leader election—any node can propose at any time. Multi-Paxos implementations typically add leader election, but it's not specified in the original papers. Common approaches:

Stable leader until failure (timeout-based detection)
Higher proposal numbers preempt current leader
No explicit election mechanism—leader emergence through contention resolution

ZAB Fast Leader Election:

ZAB defines a specific Fast Leader Election protocol:

All nodes initially vote for themselves
Exchange votes with all other nodes
If received vote has higher zxid (epoch, counter), adopt it
Tie-break by server ID if zxids are equal
When quorum agrees on a candidate, that candidate becomes prospective leader
Prospective leader initiates discovery phase to confirm leadership

Raft Election:

Follower times out waiting for leader heartbeat
Follower increments term and becomes candidate
Candidate votes for self and requests votes from others
Other nodes vote for first candidate with term ≥ their term AND log at least as up-to-date
Candidate receiving majority votes becomes leader
Randomized election timeout prevents split votes

Converting Mermaid diagram...

Election Speed and Stability:

Paxos: Without an explicit election protocol, "elections" happen whenever nodes contend. This can lead to live-lock where competing proposers preempt each other. Implementations add leader leases or backoff to stabilize.

ZAB: Fast Leader Election typically converges in 2-3 message rounds when there's a clear winner. The protocol explicitly favors nodes with the most complete history, reducing the need for post-election synchronization.

Raft: Elections use randomized timeouts (typically 150-300ms). In the common case, election completes in one round. Split votes trigger new elections with fresh random timeouts. Studies show Raft elections converge quickly in practice.

The Log Completeness Requirement:

Both ZAB and Raft ensure the new leader has all committed entries:

ZAB: Fast Leader Election prefers higher zxids. The discovery phase additionally verifies and syncs if needed.

Raft: Voters reject candidates whose log is less up-to-date than their own. This voting restriction ensures only candidates with all committed entries can win majority.

Paxos: Multi-Paxos leaders must learn committed values they might be missing. This can add latency to the post-election phase.

Pre-Vote Optimization

Raft's original election algorithm can cause disruption when a partitioned node rejoins—it may trigger unnecessary elections with higher terms. The Pre-Vote optimization (now widely implemented) has nodes check if they would win before incrementing terms. ZAB's epoch mechanism provides similar protection inherently.

Log Replication Mechanics

The mechanics of how each protocol replicates log entries reveal fundamental design differences.

Paxos (Multi-Paxos) Replication:

Leader receives client request
Leader selects next log slot (not necessarily sequential with previous)
Leader runs Paxos Phase 2 for that slot (Phase 1 was done during leader establishment)
Acceptors accept if proposal number is highest they've seen
Once majority accepts, value is chosen for that slot
Slots can be decided in any order

Challenge: Holes can form in the log. If slot 10 is decided before slot 9, the state machine must wait for slot 9. This requires hole-filling mechanisms.

ZAB Replication:

Leader receives client request
Leader assigns next sequential zxid
Leader broadcasts PROPOSAL to all followers
Followers persist and ACK
Leader waits for quorum ACKs
Leader broadcasts COMMIT
All nodes apply in zxid order

Key property: Strict ordering with no gaps. If zxid N is committed, all zxids < N are committed.

Raft Replication:

Leader receives client request
Leader appends to local log with (term, index, command)
Leader sends AppendEntries RPC to followers
Followers check log consistency, then append
If follower's log conflicts, leader finds matching point and overwrites
Once majority responds, entry is committed
Commit applies to all entries up to that index

Replication Mechanics Comparison
Aspect	Paxos	ZAB	Raft
Entry identification	Slot number + proposal number	Epoch + counter (zxid)	Term + index
Ordering guarantee	Per-slot only	Global across all slots	Global across all slots
Gap handling	Must fill holes	No gaps by construction	No gaps by construction
Conflict resolution	Higher proposal wins	Leader is authoritative (TRUNC)	Leader is authoritative (overwrite)
Commit granularity	Per-slot	Per-transaction (with COMMIT msg)	Cumulative (commitIndex)
Persistence timing	Before accepting	Before ACK (write-ahead)	Before responding to client

The Commit Mechanism:

How commits are communicated differs significantly:

Paxos: There's no explicit commit message in basic Paxos. A value is "chosen" when a majority accepts it. Learners must observe majority acceptance or be told by a distinguished learner.

ZAB: Explicit COMMIT messages are sent once quorum ACKs are received. This provides clear commit semantics and enables followers to apply immediately upon receiving COMMIT.

Raft: The leader maintains a commitIndex (highest log index known to be replicated on majority). AppendEntries messages include leaderCommit, which followers use to advance their commit. No separate commit message—commit information piggybacks on normal replication.

Batching and Pipelining:

All three protocols support optimizations for throughput:

Paxos: Can batch multiple values into single proposals
ZAB: Groups proposals to reduce network overhead
Raft: Can send multiple entries in single AppendEntries RPC

These optimizations are crucial for production performance—replicating one entry at a time would be too slow for most workloads.

The Commit Horizon

In ZAB and Raft, the 'commit horizon' (highest committed entry) can only advance—it never goes backward. This monotonicity property is essential for state machine safety. Paxos achieves this through the slot abstraction but must handle non-sequential commits.

Recovery and Catch-up Procedures

How each protocol handles node recovery reveals significant architectural differences.

Paxos Recovery:

Basic Paxos doesn't define recovery procedures. Multi-Paxos implementations must handle:

Truncation: Not typically needed—Paxos handles conflicting proposals at runtime
Hole filling: Recovering node may have gaps; must run Paxos instances for missing slots
Snapshot: For far-behind nodes, transfer state snapshot and resume from there

The lack of specification means recovery procedures vary widely between implementations.

ZAB Recovery (Synchronization Phase):

ZAB has a well-defined synchronization phase with three modes:

DIFF (Differential)

For nodes slightly behind
Leader sends missing proposals in order
Followed by commits for those proposals
Efficient for short outages

TRUNC (Truncate)

For nodes with uncommitted proposals from failed leader
Leader instructs node to truncate to specific zxid
Removes orphaned transactions safely
Then applies any missing committed transactions

SNAP (Snapshot)

For nodes too far behind
Leader sends complete state snapshot
Followed by incremental log replay if needed
Slowest but handles any divergence

recovery-comparison.pseudo

Pseudocode

// ZAB Synchronization Decision
FUNCTION zabSync(followerLastZxid, leaderState):
    IF followerLastZxid > leaderState.lastCommitted:
        // Follower has orphaned transactions
        RETURN TRUNC(targetZxid = leaderState.lastCommitted)
    ELSE IF followerLastZxid >= leaderState.minLogZxid:
        // Can sync incrementally
        RETURN DIFF(leaderState.log[followerLastZxid:])
    ELSE:
        // Too far behind
        RETURN SNAP(leaderState.snapshot)
 
// Raft AppendEntries-based Catch-up
FUNCTION raftCatchup(follower, prevLogIndex, prevLogTerm):
    response = sendAppendEntries(follower, prevLogIndex, prevLogTerm, entries)
    
    IF response.success:
        // Follower's log matches, proceed with next entries
        RETURN updateNextIndex(follower, prevLogIndex + len(entries))
    ELSE:
        // Log mismatch, decrement and retry
        // Optimization: use response.conflictIndex to jump back
        RETURN decrementAndRetry(follower, response.conflictIndex)
 
// Paxos typically doesn't have standardized recovery
// Implementations handle this differently

Raft Recovery (Log Consistency Check):

Raft uses the AppendEntries consistency check for catch-up:

Leader tracks nextIndex[follower]—the next entry to send
AppendEntries includes prevLogIndex and prevLogTerm
Follower rejects if its log doesn't match at that point
On rejection, leader decrements nextIndex and retries
Eventually, a matching point is found
Follower deletes conflicting entries and accepts new ones

Optimization: Instead of decrementing by one, followers can return conflict information that allows the leader to skip large gaps.

For far-behind followers: Raft uses InstallSnapshot RPC to transfer state.

Recovery Speed Comparison:

Protocol	Common Case	Edge Cases	Note
Paxos	Varies	Varies	Implementation-dependent
ZAB	DIFF: Very fast	SNAP: Slow	Explicit sync modes help
Raft	Single round-trip	Decrement iterations	Binary search optimization available

Snapshot Requirements

All three protocols require snapshot capabilities for practical deployments. Logs cannot grow unboundedly. ZAB and Raft explicitly define snapshot mechanisms; Paxos implementations must add them. Snapshot interaction with the consensus protocol is a common source of implementation bugs.

Performance Characteristics

Performance comparisons must be made carefully—implementation quality often matters more than protocol choice. However, there are inherent differences worth understanding.

Message Complexity:

Paxos (Multi-Paxos, stable leader case):

Phase 1 (Prepare): 1 round-trip per new leader (amortized once)
Phase 2 (Accept): 1 round-trip per proposal
Learning: Additional messages if learners are separate
Total per request: ~2 messages per follower (propose + propose-ok)

ZAB (Broadcast phase):

Proposal: 1 message to each follower
ACK: 1 message from each follower to leader
Commit: 1 message to each follower
Total per request: 3 messages per follower

Raft:

AppendEntries: 1 message to each follower (includes commit info)
Response: 1 message from each follower
Total per request: 2 messages per follower

Raft has a slight message advantage because commit information piggybacks on AppendEntries.

Performance Comparison
Metric	Paxos	ZAB	Raft
Messages per request (stable)	2n (accept + response)	3n (propose + ack + commit)	2n (append + response)
Latency (network bound)	2 RTT typical	2 RTT typical	2 RTT typical
Persistence writes per request	1 per replica	1 per replica	1 per replica
Leader bottleneck	Can have multiple proposers	Single leader	Single leader
Read scaling	Implementation-dependent	Excellent (local reads)	Depends on implementation
Election speed	Varies widely	2-3 rounds typical	1-2 rounds typical

Latency Considerations:

All three protocols have similar optimal-case latency:

Client → Leader: 1 message
Leader → Followers (parallel): 1 message
Followers → Leader (ACKs): 1 message
Wait for quorum: 0 additional
Leader → Client: 1 message

Total: 2 round-trip times minimum (client-leader + leader-follower)

The key latency factors are:

Disk sync time: All protocols need durable writes; SSD vs HDD matters more than protocol
Network latency: WAN deployments are dominated by geography, not protocol
Quorum size: Larger quorums wait for more ACKs

Throughput Considerations:

Single-leader protocols (ZAB, Raft): Throughput limited by leader capacity. Batching is critical.

Paxos with multiple proposers: Theoretically higher throughput via parallelism, but contention can reduce actual throughput.

Read throughput: ZAB excels here—reads are local, no consensus needed. Raft can do the same with relaxed consistency. Paxos varies by implementation.

Implementation Matters Most

In practice, a well-implemented Raft often outperforms a poorly-implemented Paxos (or vice versa). Protocol choice matters less than: quality of batching, memory management, disk sync optimization, network handling, and thread management. Choose based on understandability and ecosystem, then optimize the implementation.

Use Cases and Protocol Selection

Each protocol has established itself in particular niches based on its strengths.

When Paxos Makes Sense:

You need a theoretically well-understood foundation
You're implementing a database that needs flexible consensus semantics
You have experienced distributed systems engineers
You want multi-proposer semantics for specific use cases

Notable Paxos Implementations:

Google Spanner (uses Paxos for replication within regions)
Google Chubby (original lock service using Paxos)
Amazon DynamoDB (uses Paxos variants internally)
Many proprietary database systems

When ZAB Makes Sense:

You're using Zookeeper (it's the protocol)
You need atomic broadcast rather than general consensus
Strong FIFO ordering is required
You're building a coordination service with read-heavy workloads

Notable ZAB/Zookeeper Usage:

Apache Kafka (uses Zookeeper for metadata, though moving away)
Apache HBase (uses Zookeeper for coordination)
Apache Hadoop (uses Zookeeper for service discovery)
Many enterprise systems using Zookeeper for coordination

Choose Raft When...

•Implementing new consensus system from scratch
•Team members need to understand the protocol
•You want a well-documented, tested pattern
•Using etcd or Consul (they use Raft)
•Building a distributed database
•Need clear specification for audit/compliance

Choose ZAB/Zookeeper When...

•Need proven coordination service (use Zookeeper)
•Existing ecosystem uses Zookeeper
•Read-heavy coordination workload
•Strong FIFO client ordering matters
•Integration with Hadoop ecosystem
•Need hierarchical coordination primitives

When Raft Makes Sense:

Implementing a new consensus system (Raft is most understandable)
Team needs to maintain and modify the consensus layer
Using etcd or systems built on Raft
Clear documentation and specification needed
Want a large community of practitioners

Notable Raft Implementations:

etcd (Kubernetes' configuration store)
Consul (HashiCorp service mesh)
TiKV (distributed key-value store)
CockroachDB (uses Raft for replication)
RethinkDB (uses Raft for consensus)
Many modern distributed databases

The Modern Trend:

Raft has become the dominant choice for new implementations. Reasons include:

Well-specified (including membership changes, which Paxos papers don't cover)
Large implementation ecosystem (reference implementations, libraries)
Extensive testing and formal verification
Active community sharing operational knowledge

Summary: Comparing Consensus Protocols

We've analyzed the three major consensus protocols across multiple dimensions. Let's consolidate the key insights:

Key Takeaways

•All three provide safe consensus — Paxos, ZAB, and Raft all guarantee agreement, validity, and integrity. The differences are in ergonomics and additional guarantees.
•Paxos is foundational but underspecified — It proves consensus is possible and provides flexibility, but leaves many practical details to implementers.
•ZAB is optimized for coordination — Its atomic broadcast semantics, FIFO ordering, and efficient sync protocols make it ideal for Zookeeper's use case.
•Raft prioritizes understandability — By decomposing consensus into sub-problems and providing complete specification, Raft enables confident implementation.
•Leader election differs significantly — Paxos has no defined election; ZAB uses Fast Leader Election with zxid preference; Raft uses randomized timeouts.
•Log handling varies — Paxos allows holes (gaps); ZAB and Raft guarantee prefix consistency with no gaps.
•Performance is similar in practice — Implementation quality matters more than protocol choice for most workloads.
•Protocol choice depends on context — Consider team expertise, ecosystem integration, operational requirements, and specific guarantees needed.

What's Next:

Now that you understand how ZAB compares to other consensus protocols, the next page explores ZAB's use in Kafka and Hadoop—the real-world systems that rely on Zookeeper's atomic broadcast for their distributed coordination needs.

Page Complete

You now understand the fundamental differences between Paxos, ZAB, and Raft—their design philosophies, technical mechanisms, performance characteristics, and appropriate use cases. This knowledge enables you to reason about consensus protocol choices and understand the systems that rely on them.

Comparison with Raft and Paxos

Three Paths to Consensus

What You Will Master

Historical Context and Design Goals

Each protocol emerged from a specific context that shaped its design. Understanding this context illuminates why each protocol makes the choices it does.

Paxos (1989/1998)

Design Goals:

Prove consensus is achievable
Handle arbitrary crash failures
Make minimal assumptions about timing
Optimize for correctness and generality

ZAB (2007-2008)

Design Goals:

High-throughput primary-backup replication
Total ordering with FIFO client guarantees
Efficient recovery and synchronization
Production-ready specification

Raft (2013)

Design Goals:

Understandability above all else
Decomposable into independent sub-problems
Minimal state space for correctness reasoning
Practical implementation guidance

Protocol Origins and Primary Focus
Protocol	Year	Primary Author(s)	Original Context	Primary Focus
Paxos	1989/1998	Leslie Lamport	Theoretical research	Proving consensus possible, generality
ZAB	2007-2008	Yahoo! Research	Zookeeper coordination service	High-throughput atomic broadcast
Raft	2013	Ongaro, Ousterhout	Stanford research	Understandability, practical implementation

The Understandability Factor:

Raft's designers conducted user studies showing that students understood Raft significantly faster than Paxos. This isn't just about learning speed—understandable protocols are:

More likely to be implemented correctly: Developers who understand the protocol make fewer mistakes
Easier to debug: When issues arise, understanding the algorithm helps diagnosis
More maintainable: Future developers can modify the code confidently
Better documented: Teams can write clear explanations

Core Algorithmic Differences

Despite solving the same fundamental problem, the three protocols differ significantly in their core mechanics.

Leader Selection:

Paxos: Leadership is an optimization, not a requirement. Any node can be a proposer at any time. In practice, Multi-Paxos uses a stable leader for efficiency, but the protocol doesn't require it.

ZAB: Leadership is mandatory. Only the leader can broadcast proposals. Fast Leader Election is a defined sub-protocol that selects the node with the most complete history.

Raft: Leadership is mandatory, similar to ZAB. Leader election uses randomized timeouts to break symmetry. The leader must have all committed entries (Log Matching Property).

Log Replication Approach:

Paxos: Each log slot is decided independently. Proposals can be decided out of order, creating 'holes.' The state machine must handle gaps while waiting for earlier slots.

ZAB: Proposals are broadcast in strict order. No gaps are possible—if zxid N is committed, all zxids < N are committed. This is the 'prefix' property.

Raft: Similar to ZAB, log entries are replicated in order. The Log Matching Property ensures that if two logs have an entry with the same index and term, all preceding entries are identical.

Core Mechanism Comparison
Mechanism	Paxos	ZAB	Raft
Leader requirement	Optional (optimization)	Mandatory	Mandatory
Leader selection	Higher proposal number wins	Fast Leader Election (highest zxid)	Randomized timeout election
Entry ordering	Per-slot consensus (can have holes)	Strict sequential (no holes)	Strict sequential (no holes)
Term/epoch concept	Proposal numbers (interleaved)	Epochs (discrete)	Terms (discrete)
Commit condition	Majority accepts proposal	Quorum ACKs proposal	Majority replicates entry
Safety approach	Proposal number ordering	Epoch + zxid ordering	Term + index + Log Matching

Handling Log Divergence:

One critical difference is how each protocol handles divergent logs—situations where different nodes have different log entries for the same position.

Paxos: Divergence is natural since proposals can compete. The protocol resolves by having acceptors promise to higher proposal numbers. Old proposals can be "preempted."

The key difference: Paxos handles divergence as a normal part of operation, while ZAB and Raft treat it as an error condition that only occurs across leader changes.

The Log Matching Property

Safety and Liveness Guarantees

All three protocols provide the fundamental consensus guarantees, but with subtle differences in how they achieve them and what additional guarantees they provide.

Safety Guarantees (Always True):

Agreement: If any correct node commits value V at position P, all correct nodes eventually commit V at P. None of the protocols allow committed values to be lost.

Validity: Only proposed values can be decided. The protocols don't invent values—they only choose among proposed ones.

Integrity: Each node applies each transaction at most once. No duplicates in the applied log.

Liveness Guarantees (Eventually True with Assumptions):

All three protocols require partial synchrony for liveness—they need network delays to be bounded eventually for the system to make progress. In practice:

Paxos: Progress requires stable leadership or no competing proposers
ZAB: Progress requires a stable leader and quorum connectivity
Raft: Progress requires a stable leader and quorum connectivity

Additional Guarantees:

Paxos provides just the core consensus guarantees. FIFO ordering, total ordering, and primary ordering must be built on top.

ZAB provides atomic broadcast guarantees:

Total ordering of all transactions
FIFO ordering per client
Primary order (if leader broadcasts A before B, all deliver A before B)

Raft provides:

Total ordering of log entries
Leader completeness (leader has all committed entries)
State machine safety (same state after same prefix)

Guarantee Comparison
Guarantee	Paxos	ZAB	Raft
Core consensus safety	✓	✓	✓
Total ordering (built-in)	Partial (per-slot)	✓ (all transactions)	✓ (log entries)
FIFO client ordering	✗ (build on top)	✓ (native)	✗ (typically build on top)
Prefix consistency	✗ (holes possible)	✓	✓
Leader completeness	N/A (no fixed leader)	✓	✓ (formal property)
Exactly-once semantics	✗ (add on)	✓	Depends on implementation

FLP Impossibility and All Three Protocols:

Timing assumptions for liveness only: Safety holds even in asynchronous conditions; only liveness requires eventual synchrony
Randomization (Raft): Random election timeouts break symmetry, allowing progress in most runs
Leader stability assumption: All protocols assume leaders remain stable "long enough" for progress

Safety vs. Liveness Trade-off

Leader Election Comparison

Leader election is one of the most visible differences between the protocols. Each takes a different approach that reflects its design philosophy.

Paxos Leader Election:

Stable leader until failure (timeout-based detection)
Higher proposal numbers preempt current leader
No explicit election mechanism—leader emergence through contention resolution

ZAB Fast Leader Election:

ZAB defines a specific Fast Leader Election protocol:

All nodes initially vote for themselves
Exchange votes with all other nodes
If received vote has higher zxid (epoch, counter), adopt it
Tie-break by server ID if zxids are equal
When quorum agrees on a candidate, that candidate becomes prospective leader
Prospective leader initiates discovery phase to confirm leadership

Raft Election:

Follower times out waiting for leader heartbeat
Follower increments term and becomes candidate
Candidate votes for self and requests votes from others
Other nodes vote for first candidate with term ≥ their term AND log at least as up-to-date
Candidate receiving majority votes becomes leader
Randomized election timeout prevents split votes

Converting Mermaid diagram...

Election Speed and Stability:

The Log Completeness Requirement:

Both ZAB and Raft ensure the new leader has all committed entries:

ZAB: Fast Leader Election prefers higher zxids. The discovery phase additionally verifies and syncs if needed.

Raft: Voters reject candidates whose log is less up-to-date than their own. This voting restriction ensures only candidates with all committed entries can win majority.

Paxos: Multi-Paxos leaders must learn committed values they might be missing. This can add latency to the post-election phase.

Pre-Vote Optimization

Log Replication Mechanics

The mechanics of how each protocol replicates log entries reveal fundamental design differences.

Paxos (Multi-Paxos) Replication:

Leader receives client request
Leader selects next log slot (not necessarily sequential with previous)
Leader runs Paxos Phase 2 for that slot (Phase 1 was done during leader establishment)
Acceptors accept if proposal number is highest they've seen
Once majority accepts, value is chosen for that slot
Slots can be decided in any order

Challenge: Holes can form in the log. If slot 10 is decided before slot 9, the state machine must wait for slot 9. This requires hole-filling mechanisms.

ZAB Replication:

Leader receives client request
Leader assigns next sequential zxid
Leader broadcasts PROPOSAL to all followers
Followers persist and ACK
Leader waits for quorum ACKs
Leader broadcasts COMMIT
All nodes apply in zxid order

Key property: Strict ordering with no gaps. If zxid N is committed, all zxids < N are committed.

Raft Replication:

Leader receives client request
Leader appends to local log with (term, index, command)
Leader sends AppendEntries RPC to followers
Followers check log consistency, then append
If follower's log conflicts, leader finds matching point and overwrites
Once majority responds, entry is committed
Commit applies to all entries up to that index

Replication Mechanics Comparison
Aspect	Paxos	ZAB	Raft
Entry identification	Slot number + proposal number	Epoch + counter (zxid)	Term + index
Ordering guarantee	Per-slot only	Global across all slots	Global across all slots
Gap handling	Must fill holes	No gaps by construction	No gaps by construction
Conflict resolution	Higher proposal wins	Leader is authoritative (TRUNC)	Leader is authoritative (overwrite)
Commit granularity	Per-slot	Per-transaction (with COMMIT msg)	Cumulative (commitIndex)
Persistence timing	Before accepting	Before ACK (write-ahead)	Before responding to client

The Commit Mechanism:

How commits are communicated differs significantly:

Paxos: There's no explicit commit message in basic Paxos. A value is "chosen" when a majority accepts it. Learners must observe majority acceptance or be told by a distinguished learner.

ZAB: Explicit COMMIT messages are sent once quorum ACKs are received. This provides clear commit semantics and enables followers to apply immediately upon receiving COMMIT.

Batching and Pipelining:

All three protocols support optimizations for throughput:

Paxos: Can batch multiple values into single proposals
ZAB: Groups proposals to reduce network overhead
Raft: Can send multiple entries in single AppendEntries RPC

These optimizations are crucial for production performance—replicating one entry at a time would be too slow for most workloads.

The Commit Horizon

Recovery and Catch-up Procedures

How each protocol handles node recovery reveals significant architectural differences.

Paxos Recovery:

Basic Paxos doesn't define recovery procedures. Multi-Paxos implementations must handle:

Truncation: Not typically needed—Paxos handles conflicting proposals at runtime
Hole filling: Recovering node may have gaps; must run Paxos instances for missing slots
Snapshot: For far-behind nodes, transfer state snapshot and resume from there

The lack of specification means recovery procedures vary widely between implementations.

ZAB Recovery (Synchronization Phase):

ZAB has a well-defined synchronization phase with three modes:

DIFF (Differential)

For nodes slightly behind
Leader sends missing proposals in order
Followed by commits for those proposals
Efficient for short outages

TRUNC (Truncate)

For nodes with uncommitted proposals from failed leader
Leader instructs node to truncate to specific zxid
Removes orphaned transactions safely
Then applies any missing committed transactions

SNAP (Snapshot)

For nodes too far behind
Leader sends complete state snapshot
Followed by incremental log replay if needed
Slowest but handles any divergence

recovery-comparison.pseudo

Pseudocode

// ZAB Synchronization Decision
FUNCTION zabSync(followerLastZxid, leaderState):
    IF followerLastZxid > leaderState.lastCommitted:
        // Follower has orphaned transactions
        RETURN TRUNC(targetZxid = leaderState.lastCommitted)
    ELSE IF followerLastZxid >= leaderState.minLogZxid:
        // Can sync incrementally
        RETURN DIFF(leaderState.log[followerLastZxid:])
    ELSE:
        // Too far behind
        RETURN SNAP(leaderState.snapshot)
 
// Raft AppendEntries-based Catch-up
FUNCTION raftCatchup(follower, prevLogIndex, prevLogTerm):
    response = sendAppendEntries(follower, prevLogIndex, prevLogTerm, entries)
    
    IF response.success:
        // Follower's log matches, proceed with next entries
        RETURN updateNextIndex(follower, prevLogIndex + len(entries))
    ELSE:
        // Log mismatch, decrement and retry
        // Optimization: use response.conflictIndex to jump back
        RETURN decrementAndRetry(follower, response.conflictIndex)
 
// Paxos typically doesn't have standardized recovery
// Implementations handle this differently

Raft Recovery (Log Consistency Check):

Raft uses the AppendEntries consistency check for catch-up:

Leader tracks nextIndex[follower]—the next entry to send
AppendEntries includes prevLogIndex and prevLogTerm
Follower rejects if its log doesn't match at that point
On rejection, leader decrements nextIndex and retries
Eventually, a matching point is found
Follower deletes conflicting entries and accepts new ones

Optimization: Instead of decrementing by one, followers can return conflict information that allows the leader to skip large gaps.

For far-behind followers: Raft uses InstallSnapshot RPC to transfer state.

Recovery Speed Comparison:

Protocol	Common Case	Edge Cases	Note
Paxos	Varies	Varies	Implementation-dependent
ZAB	DIFF: Very fast	SNAP: Slow	Explicit sync modes help
Raft	Single round-trip	Decrement iterations	Binary search optimization available

Snapshot Requirements

Performance Characteristics

Performance comparisons must be made carefully—implementation quality often matters more than protocol choice. However, there are inherent differences worth understanding.

Message Complexity:

Paxos (Multi-Paxos, stable leader case):

Phase 1 (Prepare): 1 round-trip per new leader (amortized once)
Phase 2 (Accept): 1 round-trip per proposal
Learning: Additional messages if learners are separate
Total per request: ~2 messages per follower (propose + propose-ok)

ZAB (Broadcast phase):

Proposal: 1 message to each follower
ACK: 1 message from each follower to leader
Commit: 1 message to each follower
Total per request: 3 messages per follower

Raft:

AppendEntries: 1 message to each follower (includes commit info)
Response: 1 message from each follower
Total per request: 2 messages per follower

Raft has a slight message advantage because commit information piggybacks on AppendEntries.

Performance Comparison
Metric	Paxos	ZAB	Raft
Messages per request (stable)	2n (accept + response)	3n (propose + ack + commit)	2n (append + response)
Latency (network bound)	2 RTT typical	2 RTT typical	2 RTT typical
Persistence writes per request	1 per replica	1 per replica	1 per replica
Leader bottleneck	Can have multiple proposers	Single leader	Single leader
Read scaling	Implementation-dependent	Excellent (local reads)	Depends on implementation
Election speed	Varies widely	2-3 rounds typical	1-2 rounds typical

Latency Considerations:

All three protocols have similar optimal-case latency:

Client → Leader: 1 message
Leader → Followers (parallel): 1 message
Followers → Leader (ACKs): 1 message
Wait for quorum: 0 additional
Leader → Client: 1 message

Total: 2 round-trip times minimum (client-leader + leader-follower)

The key latency factors are:

Disk sync time: All protocols need durable writes; SSD vs HDD matters more than protocol
Network latency: WAN deployments are dominated by geography, not protocol
Quorum size: Larger quorums wait for more ACKs

Throughput Considerations:

Single-leader protocols (ZAB, Raft): Throughput limited by leader capacity. Batching is critical.

Paxos with multiple proposers: Theoretically higher throughput via parallelism, but contention can reduce actual throughput.

Read throughput: ZAB excels here—reads are local, no consensus needed. Raft can do the same with relaxed consistency. Paxos varies by implementation.

Implementation Matters Most

Use Cases and Protocol Selection

Each protocol has established itself in particular niches based on its strengths.

When Paxos Makes Sense:

You need a theoretically well-understood foundation
You're implementing a database that needs flexible consensus semantics
You have experienced distributed systems engineers
You want multi-proposer semantics for specific use cases

Notable Paxos Implementations:

Google Spanner (uses Paxos for replication within regions)
Google Chubby (original lock service using Paxos)
Amazon DynamoDB (uses Paxos variants internally)
Many proprietary database systems

When ZAB Makes Sense:

You're using Zookeeper (it's the protocol)
You need atomic broadcast rather than general consensus
Strong FIFO ordering is required
You're building a coordination service with read-heavy workloads

Notable ZAB/Zookeeper Usage:

Apache Kafka (uses Zookeeper for metadata, though moving away)
Apache HBase (uses Zookeeper for coordination)
Apache Hadoop (uses Zookeeper for service discovery)
Many enterprise systems using Zookeeper for coordination

Choose Raft When...

•Implementing new consensus system from scratch
•Team members need to understand the protocol
•You want a well-documented, tested pattern
•Using etcd or Consul (they use Raft)
•Building a distributed database
•Need clear specification for audit/compliance

Choose ZAB/Zookeeper When...

•Need proven coordination service (use Zookeeper)
•Existing ecosystem uses Zookeeper
•Read-heavy coordination workload
•Strong FIFO client ordering matters
•Integration with Hadoop ecosystem
•Need hierarchical coordination primitives

When Raft Makes Sense:

Implementing a new consensus system (Raft is most understandable)
Team needs to maintain and modify the consensus layer
Using etcd or systems built on Raft
Clear documentation and specification needed
Want a large community of practitioners

Notable Raft Implementations:

etcd (Kubernetes' configuration store)
Consul (HashiCorp service mesh)
TiKV (distributed key-value store)
CockroachDB (uses Raft for replication)
RethinkDB (uses Raft for consensus)
Many modern distributed databases

The Modern Trend:

Raft has become the dominant choice for new implementations. Reasons include:

Well-specified (including membership changes, which Paxos papers don't cover)
Large implementation ecosystem (reference implementations, libraries)
Extensive testing and formal verification
Active community sharing operational knowledge

Summary: Comparing Consensus Protocols

We've analyzed the three major consensus protocols across multiple dimensions. Let's consolidate the key insights:

Key Takeaways

•All three provide safe consensus — Paxos, ZAB, and Raft all guarantee agreement, validity, and integrity. The differences are in ergonomics and additional guarantees.
•Paxos is foundational but underspecified — It proves consensus is possible and provides flexibility, but leaves many practical details to implementers.
•ZAB is optimized for coordination — Its atomic broadcast semantics, FIFO ordering, and efficient sync protocols make it ideal for Zookeeper's use case.
•Raft prioritizes understandability — By decomposing consensus into sub-problems and providing complete specification, Raft enables confident implementation.
•Leader election differs significantly — Paxos has no defined election; ZAB uses Fast Leader Election with zxid preference; Raft uses randomized timeouts.
•Log handling varies — Paxos allows holes (gaps); ZAB and Raft guarantee prefix consistency with no gaps.
•Performance is similar in practice — Implementation quality matters more than protocol choice for most workloads.
•Protocol choice depends on context — Consider team expertise, ecosystem integration, operational requirements, and specific guarantees needed.

What's Next:

Page Complete