Loading content...
When you think about the systems that keep the modern internet running—Apache Kafka processing trillions of messages, Hadoop clusters coordinating petabytes of computation, distributed databases maintaining consistency across continents—there's an invisible force orchestrating it all. That force is Apache Zookeeper, and at its core lies one of the most elegant yet practical consensus protocols ever designed: ZAB (Zookeeper Atomic Broadcast).
ZAB isn't just another consensus algorithm. It's a protocol purpose-built for the unique demands of coordination services—systems that must provide total order, exactly-once delivery, and strong consistency while maintaining the performance characteristics necessary for production workloads. Where Paxos provides theoretical foundations and Raft offers understandability, ZAB delivers the precise guarantees needed by coordination services: primary-backup replication with atomic broadcast semantics.
By the end of this page, you will understand ZAB's design philosophy, its two operational modes, how it achieves atomic broadcast guarantees, and why these properties make it uniquely suited for coordination services. You'll see how ZAB forms the foundation upon which critical distributed systems infrastructure is built.
Before diving into ZAB's mechanics, we must understand why the Zookeeper team didn't simply adopt Paxos or an existing consensus algorithm. The answer lies in the specific requirements of coordination services—requirements that existing protocols didn't fully address.
The Coordination Service Challenge:
Zookeeper was designed to be a high-performance coordination kernel for distributed systems. It needed to handle millions of read requests per second while maintaining strict ordering guarantees for write operations. More importantly, it needed to support the patterns that coordination services require:
| Requirement | Traditional Consensus (e.g., Multi-Paxos) | ZAB's Approach |
|---|---|---|
| Primary ordering | Any replica can propose in theory | Single designated leader orders all updates |
| State synchronization | Log-based catch-up | Full state snapshot + incremental log |
| FIFO client guarantees | Not built-in | Native FIFO ordering per client |
| Prefix consistency | Requires additional protocols | Built into atomic broadcast |
| Recovery efficiency | Can require many message rounds | Single-round recovery in common case |
The Atomic Broadcast Distinction:
At its heart, ZAB is an atomic broadcast protocol rather than a pure consensus protocol. This distinction is subtle but profound:
Consensus solves the problem of getting distributed nodes to agree on a single value. Once agreement is reached on value V, the consensus instance is complete.
Atomic Broadcast solves the problem of delivering a sequence of messages to all nodes in exactly the same order. Every message must be delivered exactly once, and all nodes must see the same sequence of messages.
Zookeeper needs atomic broadcast because it's maintaining a replicated state machine. Clients issue a stream of state-changing operations, and every replica must apply those operations in exactly the same order to maintain consistency. ZAB provides precisely this guarantee.
Think of atomic broadcast as 'sequenced consensus'—it's like running consensus repeatedly for each message, but with the additional guarantee that all decided messages are delivered in a globally consistent order. ZAB optimizes this pattern specifically for the primary-backup replication model that coordination services use.
ZAB operates in two distinct phases, each designed to handle different system states. Understanding these phases is crucial to grasping how ZAB maintains correctness while maximizing availability.
Phase 1: Recovery (Discovery + Synchronization)
The recovery phase executes whenever the ensemble must establish or re-establish a leader. This happens at startup, after leader failure, or when network partitions heal. Recovery has two sub-phases:
Discovery: Nodes elect a prospective leader and the prospective leader discovers the most up-to-date transaction history among a quorum of followers.
Synchronization: The leader synchronizes its state with followers, ensuring every follower has a consistent prefix of committed transactions before broadcasting begins.
Phase 2: Broadcast
Once recovery completes successfully, ZAB enters the broadcast phase—its steady-state operation mode. In this phase:
The Critical Invariant:
ZAB's correctness depends on a fundamental invariant: a new leader must have all committed transactions from previous epochs before starting broadcast. This ensures that no committed state is ever lost, even across leader changes.
This invariant is enforced through the epoch mechanism and quorum intersection. Since any quorum in the current epoch must overlap with the quorum that committed the previous epoch's transactions, the new leader will always discover all committed history during the discovery phase.
Moving from broadcast to recovery represents a significant performance disruption. During recovery, the cluster cannot process new write requests. This is why ZAB is optimized to make the broadcast phase as stable as possible and why proper timeout tuning and network stability are crucial for Zookeeper deployments.
ZAB uses a sophisticated numbering system to uniquely identify and order all transactions across the distributed system. This system consists of epochs and zxids (Zookeeper Transaction IDs).
Epoch: The Leadership Era
An epoch is a 32-bit number that identifies a particular leadership term. Every time a new leader is elected, the epoch is incremented. Epochs serve several critical purposes:
Zxid: The Global Transaction Ordering
A zxid is a 64-bit number composed of two 32-bit parts:
123456789101112131415161718192021222324252627282930313233343536373839
// ZXID Structure in Zookeeper// 64-bit number: [epoch (32 bits)] [counter (32 bits)] public class ZxidUtils { // Extract epoch from zxid public static int getEpochFromZxid(long zxid) { return (int) (zxid >> 32); } // Extract counter from zxid public static int getCounterFromZxid(long zxid) { return (int) (zxid & 0xFFFFFFFFL); } // Create zxid from epoch and counter public static long makeZxid(int epoch, int counter) { return ((long) epoch << 32) | (counter & 0xFFFFFFFFL); } // Compare zxids for ordering // This works because epoch is in high bits public static int compare(long zxid1, long zxid2) { // Higher epoch always wins // Within same epoch, higher counter wins return Long.compare(zxid1, zxid2); } /* * Example ZXID values: * * Epoch 1, Transaction 1: 0x0000000100000001 * Epoch 1, Transaction 2: 0x0000000100000002 * Epoch 2, Transaction 1: 0x0000000200000001 <- Always > any Epoch 1 zxid * Epoch 2, Transaction 100: 0x0000000200000064 * * This ensures total ordering even across leader changes! */}Why This Structure Matters:
The genius of this zxid structure is that comparison is trivial—it's just a 64-bit integer comparison! Yet it encodes complete ordering information:
This structure enables efficient log comparison during recovery. When a new leader syncs with followers, it can quickly determine exactly which transactions each follower is missing by comparing their last zxid with its own log.
The 32-bit counter allows approximately 4 billion transactions per epoch. In practice, leader elections happen frequently enough that counter overflow is never an issue. However, Zookeeper implementations handle this edge case by triggering a leader change before overflow occurs.
The broadcast phase is where ZAB spends most of its time in a healthy cluster. Understanding this protocol is essential for reasoning about Zookeeper's consistency guarantees and performance characteristics.
The Two-Phase Commit Flow:
ZAB's broadcast uses a variant of two-phase commit optimized for the primary-backup model:
Phase 1: Propose
Phase 2: Commit
Critical Properties of the Broadcast Protocol:
1. FIFO Ordering The leader sends proposals in the order it assigns zxids. Followers process proposals in the order received. Since there's only one leader and TCP guarantees in-order delivery, all followers see proposals in the same order.
2. Prefix Consistency If a follower has committed transaction with zxid Z, it has committed all transactions with zxid < Z. This is the "prefix" property—committed logs are always prefixes of each other.
3. Quorum Acknowledgment The leader waits for acknowledgment from a majority (quorum) before committing. For a 5-node cluster, quorum is 3. This means:
4. Reliable Delivery Once a transaction is committed, it will eventually be delivered to all non-failed nodes. Followers that were temporarily down will catch up during synchronization.
Traditional three-phase commit was designed to handle arbitrary node failures during coordinator failure. ZAB's two-phase approach works because it relies on quorums and leader election rather than timeouts. If the leader fails mid-protocol, a new leader is elected and the recovery phase ensures consistency. This is simpler and more efficient than 3PC.
ZAB provides three fundamental guarantees that together constitute atomic broadcast. These guarantees form the contract that Zookeeper relies upon to provide its consistency semantics.
Guarantee 1: Agreement
If a correct (non-failed) process delivers message m, then all correct processes eventually deliver m.
This means that once any node commits a transaction, you can be certain that all other nodes will eventually commit the same transaction. There's no possibility of one replica having transaction T committed while another replica permanently lacks it.
Guarantee 2: Total Order
If correct processes p and q both deliver messages m1 and m2, then p delivers m1 before m2 if and only if q delivers m1 before m2.
This is perhaps the most critical guarantee. All replicas apply transactions in exactly the same order. This means their state machines evolve identically, maintaining perfect replica consistency.
Guarantee 3: Local Primary Order
If a primary broadcasts m1 before m2, then a correct process that delivers m2 must have already delivered m1.
This ensures that a client's operations are applied in the order the client issued them. If client C issues operations A, then B, then C, they're applied in that order on all replicas.
The Relationship to Linearizability:
ZAB provides sequential consistency for writes—all writes appear in a total order that respects the order issued by each client. However, reads in Zookeeper are served locally by any replica, which means they might not see the most recent writes.
To achieve linearizable reads (read-your-writes and reading the latest value), Zookeeper provides:
This design trades some consistency guarantees on reads for dramatically better read performance—reads don't require any inter-node communication.
In CAP theorem terms, ZAB chooses Consistency over Availability during partitions. If a Zookeeper cluster loses quorum (more than half the nodes are unavailable), it cannot process write requests. This is the correct trade-off for a coordination service—incorrect coordination data is worse than unavailable coordination data.
When Zookeeper nodes need to synchronize—whether during initial startup, after a follower restarts, or during leader election—ZAB uses the zxid to efficiently bring lagging nodes up to date.
The Synchronization Decision Process:
When a follower connects to a leader, the leader examines the follower's last committed zxid and determines the synchronization strategy:
Case 1: DIFF (Differential Sync) If the follower's last zxid is in the leader's transaction log and the gap is small:
Case 2: TRUNC (Truncation) If the follower has transactions beyond the leader's committed point (orphaned transactions from a failed leader):
Case 3: SNAP (Snapshot) If the follower is too far behind or its log is incompatible:
1234567891011121314151617181920212223242526272829303132
FUNCTION determineSync(followerLastZxid, leaderState): // Extract epoch and counter from follower's last zxid followerEpoch = epoch(followerLastZxid) followerCounter = counter(followerLastZxid) // Check if follower is from a future epoch (impossible in correct operation) IF followerEpoch > leaderState.currentEpoch: REJECT connection (follower is misconfigured or byzantine) // Check if follower has uncommitted transactions from failed leader IF followerLastZxid > leaderState.lastCommittedZxid: IF followerLastZxid in leaderState.log: // Follower saw proposal but leader didn't commit // These were not committed, safe to lose RETURN TRUNC to leaderState.lastCommittedZxid ELSE: // Follower was part of different leader's quorum // Need full resync RETURN SNAP // Check if differential sync is possible IF followerLastZxid >= leaderState.minLogZxid: // Follower's last transaction is still in log missingTxns = leaderState.log[followerLastZxid:] IF size(missingTxns) < DIFF_THRESHOLD: RETURN DIFF with missingTxns ELSE: // Too many missing, snapshot is more efficient RETURN SNAP // Follower is beyond log retention RETURN SNAPLog Retention and Snapshots:
Zookeeper maintains a transaction log for recent operations and periodic snapshots of the complete state. The log retention policy is configurable and affects sync behavior:
In practice, production deployments tune this based on expected failure patterns. If nodes typically recover quickly, short log retention works well. If network partitions can last hours, longer retention reduces recovery time.
TRUNC sync represents a subtle but important case: a follower may have logged transactions that were never committed. This happens when a leader fails after proposing but before committing. The new leader's quorum didn't include this follower, so those transactions are 'orphaned.' ZAB correctly identifies and discards them, ensuring only truly committed transactions persist.
While ZAB and Paxos solve related problems, they have fundamental differences in design philosophy and guarantees. Understanding these differences helps clarify when each protocol is appropriate.
Design Philosophy:
Paxos: Designed as a general consensus primitive. It answers: "How do N nodes agree on a single value?" Multi-Paxos extends this to sequences of values but fundamentally views each slot independently.
ZAB: Designed specifically for primary-backup replication. It answers: "How does a single leader reliably broadcast a sequence of updates to followers?" The sequence nature is fundamental, not an extension.
Key Technical Differences:
| Aspect | Paxos | ZAB |
|---|---|---|
| Core abstraction | Consensus on single value | Atomic broadcast (sequence of values) |
| Leadership | Leader is optimization; any node can propose | Leader is mandatory; only leader broadcasts |
| Proposal ordering | Out-of-order proposals possible | Strict FIFO ordering required |
| Epoch handling | Proposal numbers can interleave | New epoch requires full recovery |
| Gaps in sequence | Allowed (holes can exist) | Not allowed (prefix must be complete) |
| Recovery focus | Per-slot recovery | Log prefix recovery |
| Client ordering | Not built-in | FIFO per client guaranteed |
| Primary use case | State machine replication | Coordination services |
The Hole Problem:
One critical difference involves "holes" in the transaction sequence. In Multi-Paxos:
In ZAB:
The FIFO Guarantee:
ZAB's FIFO ordering per client is crucial for coordination patterns. Consider a client that:
ZAB guarantees these operations are applied in order. If operation 3 were applied before operation 2, data corruption could result. Paxos doesn't inherently provide this guarantee—it would need to be built on top.
Atomic broadcast and consensus are theoretically equivalent in asynchronous systems—each can implement the other. However, a protocol designed specifically for atomic broadcast (like ZAB) is more efficient for repeated message ordering than using consensus repeatedly. The optimizations are baked into the protocol design.
Understanding ZAB's performance profile is essential for deploying and tuning Zookeeper effectively. The protocol's design creates specific performance characteristics that affect system architecture decisions.
Write Latency:
Every write in ZAB requires:
The critical path for latency is: network RTT + disk fsync + network RTT
For a 3-node cluster with 1ms network latency and 5ms fsync:
Read Latency:
Reads are served locally without any inter-node communication:
| Operation | Latency Components | Typical Latency | Scalability |
|---|---|---|---|
| Write | 2 × network RTT + disk sync | 2-10ms | Limited by leader throughput |
| Read (follower) | Local memory access | < 1ms | Scales with replicas |
| Read + sync | Network RTT + catchup | 2-5ms | Limited by leader |
| Leader election | Discovery + sync + broadcast setup | seconds | Bounded by slowest node |
| Follower sync (DIFF) | Network + log replay | ms-seconds | Depends on gap size |
| Follower sync (SNAP) | Network + snapshot transfer | seconds-minutes | Depends on state size |
Throughput Considerations:
Write Throughput:
Read Throughput:
The 5-Node Sweet Spot:
Most production Zookeeper deployments use 5 nodes:
Going beyond 5 nodes has diminishing returns—write latency increases (larger quorum), and the additional fault tolerance is rarely needed.
ZAB's performance is heavily influenced by disk performance because the protocol requires syncing proposals to disk before acknowledging. Using SSDs instead of HDDs can reduce write latency by 10x. Dedicated disks for the transaction log (separate from snapshots) prevent I/O contention.
We've explored the foundations of ZAB—the atomic broadcast protocol that makes Zookeeper possible. Let's consolidate the essential knowledge:
What's Next:
Now that you understand ZAB's core mechanics, the next page dives into leader-based ordering—how ZAB leverages a single leader to simplify the protocol, how leader election works, and what happens during leader transitions. This is where the rubber meets the road for understanding Zookeeper's behavior under failure conditions.
You now understand the foundational concepts of ZAB: its role as an atomic broadcast protocol, its two operational phases, the epoch/zxid numbering system, the broadcast protocol mechanics, and its performance characteristics. Next, we'll explore how leader-based ordering works and why it's central to ZAB's design.