ZAB (Zookeeper) - Learning Module

Loading content...

0/273

Zookeeper's Consensus Protocol

The Heart of Distributed Coordination

When you think about the systems that keep the modern internet running—Apache Kafka processing trillions of messages, Hadoop clusters coordinating petabytes of computation, distributed databases maintaining consistency across continents—there's an invisible force orchestrating it all. That force is Apache Zookeeper, and at its core lies one of the most elegant yet practical consensus protocols ever designed: ZAB (Zookeeper Atomic Broadcast).

ZAB isn't just another consensus algorithm. It's a protocol purpose-built for the unique demands of coordination services—systems that must provide total order, exactly-once delivery, and strong consistency while maintaining the performance characteristics necessary for production workloads. Where Paxos provides theoretical foundations and Raft offers understandability, ZAB delivers the precise guarantees needed by coordination services: primary-backup replication with atomic broadcast semantics.

What You Will Master

By the end of this page, you will understand ZAB's design philosophy, its two operational modes, how it achieves atomic broadcast guarantees, and why these properties make it uniquely suited for coordination services. You'll see how ZAB forms the foundation upon which critical distributed systems infrastructure is built.

Why Zookeeper Needed Its Own Protocol

Before diving into ZAB's mechanics, we must understand why the Zookeeper team didn't simply adopt Paxos or an existing consensus algorithm. The answer lies in the specific requirements of coordination services—requirements that existing protocols didn't fully address.

The Coordination Service Challenge:

Zookeeper was designed to be a high-performance coordination kernel for distributed systems. It needed to handle millions of read requests per second while maintaining strict ordering guarantees for write operations. More importantly, it needed to support the patterns that coordination services require:

Sequential consistency for all operations
Atomicity across multiple operations
Total ordering of all state changes
Efficient primary-backup replication
Fast leader election and recovery

Coordination Service Requirements vs Traditional Consensus
Requirement	Traditional Consensus (e.g., Multi-Paxos)	ZAB's Approach
Primary ordering	Any replica can propose in theory	Single designated leader orders all updates
State synchronization	Log-based catch-up	Full state snapshot + incremental log
FIFO client guarantees	Not built-in	Native FIFO ordering per client
Prefix consistency	Requires additional protocols	Built into atomic broadcast
Recovery efficiency	Can require many message rounds	Single-round recovery in common case

The Atomic Broadcast Distinction:

At its heart, ZAB is an atomic broadcast protocol rather than a pure consensus protocol. This distinction is subtle but profound:

Consensus solves the problem of getting distributed nodes to agree on a single value. Once agreement is reached on value V, the consensus instance is complete.

Atomic Broadcast solves the problem of delivering a sequence of messages to all nodes in exactly the same order. Every message must be delivered exactly once, and all nodes must see the same sequence of messages.

Zookeeper needs atomic broadcast because it's maintaining a replicated state machine. Clients issue a stream of state-changing operations, and every replica must apply those operations in exactly the same order to maintain consistency. ZAB provides precisely this guarantee.

Atomic Broadcast = Repeated Consensus + Total Order

Think of atomic broadcast as 'sequenced consensus'—it's like running consensus repeatedly for each message, but with the additional guarantee that all decided messages are delivered in a globally consistent order. ZAB optimizes this pattern specifically for the primary-backup replication model that coordination services use.

ZAB's Two Operational Phases

ZAB operates in two distinct phases, each designed to handle different system states. Understanding these phases is crucial to grasping how ZAB maintains correctness while maximizing availability.

Phase 1: Recovery (Discovery + Synchronization)

The recovery phase executes whenever the ensemble must establish or re-establish a leader. This happens at startup, after leader failure, or when network partitions heal. Recovery has two sub-phases:

Discovery: Nodes elect a prospective leader and the prospective leader discovers the most up-to-date transaction history among a quorum of followers.
Synchronization: The leader synchronizes its state with followers, ensuring every follower has a consistent prefix of committed transactions before broadcasting begins.

Phase 2: Broadcast

Once recovery completes successfully, ZAB enters the broadcast phase—its steady-state operation mode. In this phase:

The leader receives all client write requests
The leader broadcasts proposals to followers in FIFO order
Followers acknowledge proposals
Once a quorum acknowledges, the leader commits and broadcasts the commit
All nodes apply committed transactions in order

Converting Mermaid diagram...

The Critical Invariant:

ZAB's correctness depends on a fundamental invariant: a new leader must have all committed transactions from previous epochs before starting broadcast. This ensures that no committed state is ever lost, even across leader changes.

This invariant is enforced through the epoch mechanism and quorum intersection. Since any quorum in the current epoch must overlap with the quorum that committed the previous epoch's transactions, the new leader will always discover all committed history during the discovery phase.

Phase Transitions Are Expensive

Moving from broadcast to recovery represents a significant performance disruption. During recovery, the cluster cannot process new write requests. This is why ZAB is optimized to make the broadcast phase as stable as possible and why proper timeout tuning and network stability are crucial for Zookeeper deployments.

The Epoch and Transaction ID System

ZAB uses a sophisticated numbering system to uniquely identify and order all transactions across the distributed system. This system consists of epochs and zxids (Zookeeper Transaction IDs).

Epoch: The Leadership Era

An epoch is a 32-bit number that identifies a particular leadership term. Every time a new leader is elected, the epoch is incremented. Epochs serve several critical purposes:

Leader disambiguation: If two nodes both believe they're the leader (split-brain scenario), the one with the higher epoch wins
Message filtering: Followers reject messages from old epochs
Ordering guarantee: Higher epochs always represent more recent leadership
Recovery identification: Epochs help determine which transactions are from previous leaders

Zxid: The Global Transaction Ordering

A zxid is a 64-bit number composed of two 32-bit parts:

High 32 bits: The epoch number
Low 32 bits: The counter (incremented for each transaction within an epoch)

zxid-structure.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// ZXID Structure in Zookeeper
// 64-bit number: [epoch (32 bits)] [counter (32 bits)]
 
public class ZxidUtils {
    
    // Extract epoch from zxid
    public static int getEpochFromZxid(long zxid) {
        return (int) (zxid >> 32);
    }
    
    // Extract counter from zxid
    public static int getCounterFromZxid(long zxid) {
        return (int) (zxid & 0xFFFFFFFFL);
    }
    
    // Create zxid from epoch and counter
    public static long makeZxid(int epoch, int counter) {
        return ((long) epoch << 32) | (counter & 0xFFFFFFFFL);
    }
    
    // Compare zxids for ordering
    // This works because epoch is in high bits
    public static int compare(long zxid1, long zxid2) {
        // Higher epoch always wins
        // Within same epoch, higher counter wins
        return Long.compare(zxid1, zxid2);
    }
    
    /*
     * Example ZXID values:
     * 
     * Epoch 1, Transaction 1:  0x0000000100000001
     * Epoch 1, Transaction 2:  0x0000000100000002
     * Epoch 2, Transaction 1:  0x0000000200000001  <- Always > any Epoch 1 zxid
     * Epoch 2, Transaction 100: 0x0000000200000064
     * 
     * This ensures total ordering even across leader changes!
     */
}

Why This Structure Matters:

The genius of this zxid structure is that comparison is trivial—it's just a 64-bit integer comparison! Yet it encodes complete ordering information:

All transactions from a later epoch are greater than all transactions from earlier epochs, regardless of counter values
Within an epoch, transactions are ordered by counter
No coordination is needed to compare zxids—any node can determine transaction ordering independently
Zxids are globally unique—the same zxid can never be reused, even after leader failures

This structure enables efficient log comparison during recovery. When a new leader syncs with followers, it can quickly determine exactly which transactions each follower is missing by comparing their last zxid with its own log.

Counter Overflow Handling

The 32-bit counter allows approximately 4 billion transactions per epoch. In practice, leader elections happen frequently enough that counter overflow is never an issue. However, Zookeeper implementations handle this edge case by triggering a leader change before overflow occurs.

The Broadcast Protocol in Detail

The broadcast phase is where ZAB spends most of its time in a healthy cluster. Understanding this protocol is essential for reasoning about Zookeeper's consistency guarantees and performance characteristics.

The Two-Phase Commit Flow:

ZAB's broadcast uses a variant of two-phase commit optimized for the primary-backup model:

Phase 1: Propose

Leader receives client write request
Leader assigns the next zxid to the transaction
Leader creates a PROPOSAL message containing: zxid, transaction data
Leader sends PROPOSAL to all followers (and logs locally)
Each follower logs the proposal to disk and sends ACK to leader

Phase 2: Commit

Leader waits for ACKs from a quorum (majority) of followers
Once quorum is achieved, leader creates COMMIT message with the zxid
Leader sends COMMIT to all followers
Each node (leader and followers) applies the transaction to their state machine

Converting Mermaid diagram...

Critical Properties of the Broadcast Protocol:

1. FIFO Ordering The leader sends proposals in the order it assigns zxids. Followers process proposals in the order received. Since there's only one leader and TCP guarantees in-order delivery, all followers see proposals in the same order.

2. Prefix Consistency If a follower has committed transaction with zxid Z, it has committed all transactions with zxid < Z. This is the "prefix" property—committed logs are always prefixes of each other.

3. Quorum Acknowledgment The leader waits for acknowledgment from a majority (quorum) before committing. For a 5-node cluster, quorum is 3. This means:

At most 2 nodes can fail while maintaining progress
Any two quorums overlap by at least 1 node
Committed transactions are stored on a majority

4. Reliable Delivery Once a transaction is committed, it will eventually be delivered to all non-failed nodes. Followers that were temporarily down will catch up during synchronization.

Why Not Three-Phase Commit?

Traditional three-phase commit was designed to handle arbitrary node failures during coordinator failure. ZAB's two-phase approach works because it relies on quorums and leader election rather than timeouts. If the leader fails mid-protocol, a new leader is elected and the recovery phase ensures consistency. This is simpler and more efficient than 3PC.

Atomic Broadcast Guarantees

ZAB provides three fundamental guarantees that together constitute atomic broadcast. These guarantees form the contract that Zookeeper relies upon to provide its consistency semantics.

Guarantee 1: Agreement

If a correct (non-failed) process delivers message m, then all correct processes eventually deliver m.

This means that once any node commits a transaction, you can be certain that all other nodes will eventually commit the same transaction. There's no possibility of one replica having transaction T committed while another replica permanently lacks it.

Guarantee 2: Total Order

If correct processes p and q both deliver messages m1 and m2, then p delivers m1 before m2 if and only if q delivers m1 before m2.

This is perhaps the most critical guarantee. All replicas apply transactions in exactly the same order. This means their state machines evolve identically, maintaining perfect replica consistency.

Guarantee 3: Local Primary Order

If a primary broadcasts m1 before m2, then a correct process that delivers m2 must have already delivered m1.

This ensures that a client's operations are applied in the order the client issued them. If client C issues operations A, then B, then C, they're applied in that order on all replicas.

What ZAB Guarantees

•All committed transactions applied in same order everywhere
•No committed transaction is ever lost
•Client operations are atomically applied
•FIFO ordering per client is preserved
•Exactly-once delivery (no duplicates)
•Prefix consistency across all replicas

What ZAB Does NOT Guarantee

•Real-time ordering (network delays exist)
•Immediate delivery (quorum must be available)
•Read your own writes (unless using sync)
•Linearizability for reads by default
•Partition tolerance for writes (needs quorum)

The Relationship to Linearizability:

ZAB provides sequential consistency for writes—all writes appear in a total order that respects the order issued by each client. However, reads in Zookeeper are served locally by any replica, which means they might not see the most recent writes.

To achieve linearizable reads (read-your-writes and reading the latest value), Zookeeper provides:

sync() operation: Forces the follower to catch up to the leader's committed state before returning
Leader reads: Option to read from the leader only (not commonly used due to performance impact)

This design trades some consistency guarantees on reads for dramatically better read performance—reads don't require any inter-node communication.

CAP Perspective on ZAB

In CAP theorem terms, ZAB chooses Consistency over Availability during partitions. If a Zookeeper cluster loses quorum (more than half the nodes are unavailable), it cannot process write requests. This is the correct trade-off for a coordination service—incorrect coordination data is worse than unavailable coordination data.

ZXID-Based State Synchronization

When Zookeeper nodes need to synchronize—whether during initial startup, after a follower restarts, or during leader election—ZAB uses the zxid to efficiently bring lagging nodes up to date.

The Synchronization Decision Process:

When a follower connects to a leader, the leader examines the follower's last committed zxid and determines the synchronization strategy:

Case 1: DIFF (Differential Sync) If the follower's last zxid is in the leader's transaction log and the gap is small:

Leader sends only the missing transactions
Most efficient for followers that were briefly disconnected
Preserves incremental log without full state transfer

Case 2: TRUNC (Truncation) If the follower has transactions beyond the leader's committed point (orphaned transactions from a failed leader):

Leader instructs follower to truncate log to specific zxid
Then sends any missing committed transactions
Handles the case where a failed leader had proposed but not committed transactions

Case 3: SNAP (Snapshot) If the follower is too far behind or its log is incompatible:

Leader sends a full state snapshot
Slower but handles any state divergence
Necessary after prolonged outages or log corruption

sync-decision.pseudo

Pseudocode

FUNCTION determineSync(followerLastZxid, leaderState):
    // Extract epoch and counter from follower's last zxid
    followerEpoch = epoch(followerLastZxid)
    followerCounter = counter(followerLastZxid)
    
    // Check if follower is from a future epoch (impossible in correct operation)
    IF followerEpoch > leaderState.currentEpoch:
        REJECT connection (follower is misconfigured or byzantine)
    
    // Check if follower has uncommitted transactions from failed leader
    IF followerLastZxid > leaderState.lastCommittedZxid:
        IF followerLastZxid in leaderState.log:
            // Follower saw proposal but leader didn't commit
            // These were not committed, safe to lose
            RETURN TRUNC to leaderState.lastCommittedZxid
        ELSE:
            // Follower was part of different leader's quorum
            // Need full resync
            RETURN SNAP
    
    // Check if differential sync is possible
    IF followerLastZxid >= leaderState.minLogZxid:
        // Follower's last transaction is still in log
        missingTxns = leaderState.log[followerLastZxid:]
        IF size(missingTxns) < DIFF_THRESHOLD:
            RETURN DIFF with missingTxns
        ELSE:
            // Too many missing, snapshot is more efficient
            RETURN SNAP
    
    // Follower is beyond log retention
    RETURN SNAP

Log Retention and Snapshots:

Zookeeper maintains a transaction log for recent operations and periodic snapshots of the complete state. The log retention policy is configurable and affects sync behavior:

More log retention: More likely to do DIFF sync, less network and CPU for snapshots, but more disk usage
Less log retention: More frequent SNAP syncs for lagging nodes, but lower disk overhead

In practice, production deployments tune this based on expected failure patterns. If nodes typically recover quickly, short log retention works well. If network partitions can last hours, longer retention reduces recovery time.

The TRUNC Scenario

TRUNC sync represents a subtle but important case: a follower may have logged transactions that were never committed. This happens when a leader fails after proposing but before committing. The new leader's quorum didn't include this follower, so those transactions are 'orphaned.' ZAB correctly identifies and discards them, ensuring only truly committed transactions persist.

How ZAB Differs from Paxos

While ZAB and Paxos solve related problems, they have fundamental differences in design philosophy and guarantees. Understanding these differences helps clarify when each protocol is appropriate.

Design Philosophy:

Paxos: Designed as a general consensus primitive. It answers: "How do N nodes agree on a single value?" Multi-Paxos extends this to sequences of values but fundamentally views each slot independently.

ZAB: Designed specifically for primary-backup replication. It answers: "How does a single leader reliably broadcast a sequence of updates to followers?" The sequence nature is fundamental, not an extension.

Key Technical Differences:

ZAB vs Paxos: Technical Comparison
Aspect	Paxos	ZAB
Core abstraction	Consensus on single value	Atomic broadcast (sequence of values)
Leadership	Leader is optimization; any node can propose	Leader is mandatory; only leader broadcasts
Proposal ordering	Out-of-order proposals possible	Strict FIFO ordering required
Epoch handling	Proposal numbers can interleave	New epoch requires full recovery
Gaps in sequence	Allowed (holes can exist)	Not allowed (prefix must be complete)
Recovery focus	Per-slot recovery	Log prefix recovery
Client ordering	Not built-in	FIFO per client guaranteed
Primary use case	State machine replication	Coordination services

The Hole Problem:

One critical difference involves "holes" in the transaction sequence. In Multi-Paxos:

Proposals can be decided out of order
Slot 5 might be decided before slot 4
The state machine must handle gaps, waiting for earlier slots

In ZAB:

No holes are possible by construction
If transaction Z is committed, all transactions with zxid < Z are committed
This simplifies state machine implementation significantly

The FIFO Guarantee:

ZAB's FIFO ordering per client is crucial for coordination patterns. Consider a client that:

Creates a lock node
Writes data
Deletes the lock node

ZAB guarantees these operations are applied in order. If operation 3 were applied before operation 2, data corruption could result. Paxos doesn't inherently provide this guarantee—it would need to be built on top.

Equivalence Theorem

Atomic broadcast and consensus are theoretically equivalent in asynchronous systems—each can implement the other. However, a protocol designed specifically for atomic broadcast (like ZAB) is more efficient for repeated message ordering than using consensus repeatedly. The optimizations are baked into the protocol design.

Performance Characteristics

Understanding ZAB's performance profile is essential for deploying and tuning Zookeeper effectively. The protocol's design creates specific performance characteristics that affect system architecture decisions.

Write Latency:

Every write in ZAB requires:

Leader receives request
Leader logs proposal to disk
Proposal sent to followers (parallel)
Followers log to disk and ACK
Leader receives quorum of ACKs
Leader sends commit
State machine applies transaction

The critical path for latency is: network RTT + disk fsync + network RTT

For a 3-node cluster with 1ms network latency and 5ms fsync:

Minimum write latency: ~7ms (1ms to leader + 5ms fsync + 1ms ACK)
This assumes leader wasn't waiting for previous commits

Read Latency:

Reads are served locally without any inter-node communication:

Read latency: essentially just local memory access
This makes reads orders of magnitude faster than writes
Typical: microseconds for simple reads

ZAB Performance Characteristics
Operation	Latency Components	Typical Latency	Scalability
Write	2 × network RTT + disk sync	2-10ms	Limited by leader throughput
Read (follower)	Local memory access	< 1ms	Scales with replicas
Read + sync	Network RTT + catchup	2-5ms	Limited by leader
Leader election	Discovery + sync + broadcast setup	seconds	Bounded by slowest node
Follower sync (DIFF)	Network + log replay	ms-seconds	Depends on gap size
Follower sync (SNAP)	Network + snapshot transfer	seconds-minutes	Depends on state size

Throughput Considerations:

Write Throughput:

Bounded by leader's ability to process and sequence writes
Single leader is a fundamental bottleneck
Batching proposals can dramatically improve throughput
Typical: 10K-100K writes/second depending on payload size

Read Throughput:

Scales linearly with cluster size (more replicas = more read capacity)
Not affected by write load (reads don't coordinate)
Can easily handle millions of reads/second

The 5-Node Sweet Spot:

Most production Zookeeper deployments use 5 nodes:

Tolerates 2 failures (quorum = 3)
Writes involve 3 nodes (quorum)
Good read scalability (5 nodes serving reads)
Manageable consistency overhead

Going beyond 5 nodes has diminishing returns—write latency increases (larger quorum), and the additional fault tolerance is rarely needed.

Disk Performance is Critical

ZAB's performance is heavily influenced by disk performance because the protocol requires syncing proposals to disk before acknowledging. Using SSDs instead of HDDs can reduce write latency by 10x. Dedicated disks for the transaction log (separate from snapshots) prevent I/O contention.

Summary: Zookeeper's Consensus Protocol

We've explored the foundations of ZAB—the atomic broadcast protocol that makes Zookeeper possible. Let's consolidate the essential knowledge:

Key Takeaways

•ZAB is atomic broadcast, not just consensus — It guarantees total ordering of a sequence of messages, not just agreement on individual values.
•Two phases: Recovery and Broadcast — Recovery establishes leadership and synchronizes state; Broadcast handles steady-state operation with high performance.
•Epochs and zxids provide global ordering — The 64-bit zxid (epoch + counter) enables efficient comparison and ensures no transaction is ever reused.
•Quorum-based commit ensures durability — Transactions are committed only after a majority acknowledges, guaranteeing persistence across failures.
•FIFO and prefix consistency — Client operations are ordered, and committed logs are always prefixes of each other.
•Reads are local, writes go through leader — This design enables massive read scalability while maintaining write consistency.
•Efficient synchronization strategies — DIFF, TRUNC, and SNAP modes optimize recovery based on the state divergence between nodes.
•Different from Paxos in key ways — ZAB's design assumptions (single leader, no gaps, FIFO) enable important optimizations for coordination workloads.

What's Next:

Now that you understand ZAB's core mechanics, the next page dives into leader-based ordering—how ZAB leverages a single leader to simplify the protocol, how leader election works, and what happens during leader transitions. This is where the rubber meets the road for understanding Zookeeper's behavior under failure conditions.

Page Complete

You now understand the foundational concepts of ZAB: its role as an atomic broadcast protocol, its two operational phases, the epoch/zxid numbering system, the broadcast protocol mechanics, and its performance characteristics. Next, we'll explore how leader-based ordering works and why it's central to ZAB's design.

Zookeeper's Consensus Protocol

The Heart of Distributed Coordination

What You Will Master

Why Zookeeper Needed Its Own Protocol

The Coordination Service Challenge:

Sequential consistency for all operations
Atomicity across multiple operations
Total ordering of all state changes
Efficient primary-backup replication
Fast leader election and recovery

Coordination Service Requirements vs Traditional Consensus
Requirement	Traditional Consensus (e.g., Multi-Paxos)	ZAB's Approach
Primary ordering	Any replica can propose in theory	Single designated leader orders all updates
State synchronization	Log-based catch-up	Full state snapshot + incremental log
FIFO client guarantees	Not built-in	Native FIFO ordering per client
Prefix consistency	Requires additional protocols	Built into atomic broadcast
Recovery efficiency	Can require many message rounds	Single-round recovery in common case

The Atomic Broadcast Distinction:

At its heart, ZAB is an atomic broadcast protocol rather than a pure consensus protocol. This distinction is subtle but profound:

Consensus solves the problem of getting distributed nodes to agree on a single value. Once agreement is reached on value V, the consensus instance is complete.

Atomic Broadcast = Repeated Consensus + Total Order

ZAB's Two Operational Phases

ZAB operates in two distinct phases, each designed to handle different system states. Understanding these phases is crucial to grasping how ZAB maintains correctness while maximizing availability.

Phase 1: Recovery (Discovery + Synchronization)

The recovery phase executes whenever the ensemble must establish or re-establish a leader. This happens at startup, after leader failure, or when network partitions heal. Recovery has two sub-phases:

Discovery: Nodes elect a prospective leader and the prospective leader discovers the most up-to-date transaction history among a quorum of followers.
Synchronization: The leader synchronizes its state with followers, ensuring every follower has a consistent prefix of committed transactions before broadcasting begins.

Phase 2: Broadcast

Once recovery completes successfully, ZAB enters the broadcast phase—its steady-state operation mode. In this phase:

The leader receives all client write requests
The leader broadcasts proposals to followers in FIFO order
Followers acknowledge proposals
Once a quorum acknowledges, the leader commits and broadcasts the commit
All nodes apply committed transactions in order

Converting Mermaid diagram...

The Critical Invariant:

Phase Transitions Are Expensive

The Epoch and Transaction ID System

ZAB uses a sophisticated numbering system to uniquely identify and order all transactions across the distributed system. This system consists of epochs and zxids (Zookeeper Transaction IDs).

Epoch: The Leadership Era

An epoch is a 32-bit number that identifies a particular leadership term. Every time a new leader is elected, the epoch is incremented. Epochs serve several critical purposes:

Leader disambiguation: If two nodes both believe they're the leader (split-brain scenario), the one with the higher epoch wins
Message filtering: Followers reject messages from old epochs
Ordering guarantee: Higher epochs always represent more recent leadership
Recovery identification: Epochs help determine which transactions are from previous leaders

Zxid: The Global Transaction Ordering

A zxid is a 64-bit number composed of two 32-bit parts:

High 32 bits: The epoch number
Low 32 bits: The counter (incremented for each transaction within an epoch)

zxid-structure.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// ZXID Structure in Zookeeper
// 64-bit number: [epoch (32 bits)] [counter (32 bits)]
 
public class ZxidUtils {
    
    // Extract epoch from zxid
    public static int getEpochFromZxid(long zxid) {
        return (int) (zxid >> 32);
    }
    
    // Extract counter from zxid
    public static int getCounterFromZxid(long zxid) {
        return (int) (zxid & 0xFFFFFFFFL);
    }
    
    // Create zxid from epoch and counter
    public static long makeZxid(int epoch, int counter) {
        return ((long) epoch << 32) | (counter & 0xFFFFFFFFL);
    }
    
    // Compare zxids for ordering
    // This works because epoch is in high bits
    public static int compare(long zxid1, long zxid2) {
        // Higher epoch always wins
        // Within same epoch, higher counter wins
        return Long.compare(zxid1, zxid2);
    }
    
    /*
     * Example ZXID values:
     * 
     * Epoch 1, Transaction 1:  0x0000000100000001
     * Epoch 1, Transaction 2:  0x0000000100000002
     * Epoch 2, Transaction 1:  0x0000000200000001  <- Always > any Epoch 1 zxid
     * Epoch 2, Transaction 100: 0x0000000200000064
     * 
     * This ensures total ordering even across leader changes!
     */
}

Why This Structure Matters:

The genius of this zxid structure is that comparison is trivial—it's just a 64-bit integer comparison! Yet it encodes complete ordering information:

All transactions from a later epoch are greater than all transactions from earlier epochs, regardless of counter values
Within an epoch, transactions are ordered by counter
No coordination is needed to compare zxids—any node can determine transaction ordering independently
Zxids are globally unique—the same zxid can never be reused, even after leader failures

Counter Overflow Handling

The Broadcast Protocol in Detail

The Two-Phase Commit Flow:

ZAB's broadcast uses a variant of two-phase commit optimized for the primary-backup model:

Phase 1: Propose

Leader receives client write request
Leader assigns the next zxid to the transaction
Leader creates a PROPOSAL message containing: zxid, transaction data
Leader sends PROPOSAL to all followers (and logs locally)
Each follower logs the proposal to disk and sends ACK to leader

Phase 2: Commit

Leader waits for ACKs from a quorum (majority) of followers
Once quorum is achieved, leader creates COMMIT message with the zxid
Leader sends COMMIT to all followers
Each node (leader and followers) applies the transaction to their state machine

Converting Mermaid diagram...

Critical Properties of the Broadcast Protocol:

3. Quorum Acknowledgment The leader waits for acknowledgment from a majority (quorum) before committing. For a 5-node cluster, quorum is 3. This means:

At most 2 nodes can fail while maintaining progress
Any two quorums overlap by at least 1 node
Committed transactions are stored on a majority

4. Reliable Delivery Once a transaction is committed, it will eventually be delivered to all non-failed nodes. Followers that were temporarily down will catch up during synchronization.

Why Not Three-Phase Commit?

Atomic Broadcast Guarantees

ZAB provides three fundamental guarantees that together constitute atomic broadcast. These guarantees form the contract that Zookeeper relies upon to provide its consistency semantics.

Guarantee 1: Agreement

If a correct (non-failed) process delivers message m, then all correct processes eventually deliver m.

Guarantee 2: Total Order

If correct processes p and q both deliver messages m1 and m2, then p delivers m1 before m2 if and only if q delivers m1 before m2.

This is perhaps the most critical guarantee. All replicas apply transactions in exactly the same order. This means their state machines evolve identically, maintaining perfect replica consistency.

Guarantee 3: Local Primary Order

If a primary broadcasts m1 before m2, then a correct process that delivers m2 must have already delivered m1.

This ensures that a client's operations are applied in the order the client issued them. If client C issues operations A, then B, then C, they're applied in that order on all replicas.

What ZAB Guarantees

•All committed transactions applied in same order everywhere
•No committed transaction is ever lost
•Client operations are atomically applied
•FIFO ordering per client is preserved
•Exactly-once delivery (no duplicates)
•Prefix consistency across all replicas

What ZAB Does NOT Guarantee

•Real-time ordering (network delays exist)
•Immediate delivery (quorum must be available)
•Read your own writes (unless using sync)
•Linearizability for reads by default
•Partition tolerance for writes (needs quorum)

The Relationship to Linearizability:

To achieve linearizable reads (read-your-writes and reading the latest value), Zookeeper provides:

sync() operation: Forces the follower to catch up to the leader's committed state before returning
Leader reads: Option to read from the leader only (not commonly used due to performance impact)

This design trades some consistency guarantees on reads for dramatically better read performance—reads don't require any inter-node communication.

CAP Perspective on ZAB

ZXID-Based State Synchronization

When Zookeeper nodes need to synchronize—whether during initial startup, after a follower restarts, or during leader election—ZAB uses the zxid to efficiently bring lagging nodes up to date.

The Synchronization Decision Process:

When a follower connects to a leader, the leader examines the follower's last committed zxid and determines the synchronization strategy:

Case 1: DIFF (Differential Sync) If the follower's last zxid is in the leader's transaction log and the gap is small:

Leader sends only the missing transactions
Most efficient for followers that were briefly disconnected
Preserves incremental log without full state transfer

Case 2: TRUNC (Truncation) If the follower has transactions beyond the leader's committed point (orphaned transactions from a failed leader):

Leader instructs follower to truncate log to specific zxid
Then sends any missing committed transactions
Handles the case where a failed leader had proposed but not committed transactions

Case 3: SNAP (Snapshot) If the follower is too far behind or its log is incompatible:

Leader sends a full state snapshot
Slower but handles any state divergence
Necessary after prolonged outages or log corruption

sync-decision.pseudo

Pseudocode

FUNCTION determineSync(followerLastZxid, leaderState):
    // Extract epoch and counter from follower's last zxid
    followerEpoch = epoch(followerLastZxid)
    followerCounter = counter(followerLastZxid)
    
    // Check if follower is from a future epoch (impossible in correct operation)
    IF followerEpoch > leaderState.currentEpoch:
        REJECT connection (follower is misconfigured or byzantine)
    
    // Check if follower has uncommitted transactions from failed leader
    IF followerLastZxid > leaderState.lastCommittedZxid:
        IF followerLastZxid in leaderState.log:
            // Follower saw proposal but leader didn't commit
            // These were not committed, safe to lose
            RETURN TRUNC to leaderState.lastCommittedZxid
        ELSE:
            // Follower was part of different leader's quorum
            // Need full resync
            RETURN SNAP
    
    // Check if differential sync is possible
    IF followerLastZxid >= leaderState.minLogZxid:
        // Follower's last transaction is still in log
        missingTxns = leaderState.log[followerLastZxid:]
        IF size(missingTxns) < DIFF_THRESHOLD:
            RETURN DIFF with missingTxns
        ELSE:
            // Too many missing, snapshot is more efficient
            RETURN SNAP
    
    // Follower is beyond log retention
    RETURN SNAP

Log Retention and Snapshots:

Zookeeper maintains a transaction log for recent operations and periodic snapshots of the complete state. The log retention policy is configurable and affects sync behavior:

More log retention: More likely to do DIFF sync, less network and CPU for snapshots, but more disk usage
Less log retention: More frequent SNAP syncs for lagging nodes, but lower disk overhead

The TRUNC Scenario

How ZAB Differs from Paxos

While ZAB and Paxos solve related problems, they have fundamental differences in design philosophy and guarantees. Understanding these differences helps clarify when each protocol is appropriate.

Design Philosophy:

Key Technical Differences:

ZAB vs Paxos: Technical Comparison
Aspect	Paxos	ZAB
Core abstraction	Consensus on single value	Atomic broadcast (sequence of values)
Leadership	Leader is optimization; any node can propose	Leader is mandatory; only leader broadcasts
Proposal ordering	Out-of-order proposals possible	Strict FIFO ordering required
Epoch handling	Proposal numbers can interleave	New epoch requires full recovery
Gaps in sequence	Allowed (holes can exist)	Not allowed (prefix must be complete)
Recovery focus	Per-slot recovery	Log prefix recovery
Client ordering	Not built-in	FIFO per client guaranteed
Primary use case	State machine replication	Coordination services

The Hole Problem:

One critical difference involves "holes" in the transaction sequence. In Multi-Paxos:

Proposals can be decided out of order
Slot 5 might be decided before slot 4
The state machine must handle gaps, waiting for earlier slots

In ZAB:

No holes are possible by construction
If transaction Z is committed, all transactions with zxid < Z are committed
This simplifies state machine implementation significantly

The FIFO Guarantee:

ZAB's FIFO ordering per client is crucial for coordination patterns. Consider a client that:

Creates a lock node
Writes data
Deletes the lock node

Equivalence Theorem

Performance Characteristics

Write Latency:

Every write in ZAB requires:

Leader receives request
Leader logs proposal to disk
Proposal sent to followers (parallel)
Followers log to disk and ACK
Leader receives quorum of ACKs
Leader sends commit
State machine applies transaction

The critical path for latency is: network RTT + disk fsync + network RTT

For a 3-node cluster with 1ms network latency and 5ms fsync:

Minimum write latency: ~7ms (1ms to leader + 5ms fsync + 1ms ACK)
This assumes leader wasn't waiting for previous commits

Read Latency:

Reads are served locally without any inter-node communication:

Read latency: essentially just local memory access
This makes reads orders of magnitude faster than writes
Typical: microseconds for simple reads

ZAB Performance Characteristics
Operation	Latency Components	Typical Latency	Scalability
Write	2 × network RTT + disk sync	2-10ms	Limited by leader throughput
Read (follower)	Local memory access	< 1ms	Scales with replicas
Read + sync	Network RTT + catchup	2-5ms	Limited by leader
Leader election	Discovery + sync + broadcast setup	seconds	Bounded by slowest node
Follower sync (DIFF)	Network + log replay	ms-seconds	Depends on gap size
Follower sync (SNAP)	Network + snapshot transfer	seconds-minutes	Depends on state size

Throughput Considerations:

Write Throughput:

Bounded by leader's ability to process and sequence writes
Single leader is a fundamental bottleneck
Batching proposals can dramatically improve throughput
Typical: 10K-100K writes/second depending on payload size

Read Throughput:

Scales linearly with cluster size (more replicas = more read capacity)
Not affected by write load (reads don't coordinate)
Can easily handle millions of reads/second

The 5-Node Sweet Spot:

Most production Zookeeper deployments use 5 nodes:

Tolerates 2 failures (quorum = 3)
Writes involve 3 nodes (quorum)
Good read scalability (5 nodes serving reads)
Manageable consistency overhead

Going beyond 5 nodes has diminishing returns—write latency increases (larger quorum), and the additional fault tolerance is rarely needed.

Disk Performance is Critical

Summary: Zookeeper's Consensus Protocol

We've explored the foundations of ZAB—the atomic broadcast protocol that makes Zookeeper possible. Let's consolidate the essential knowledge:

Key Takeaways

•ZAB is atomic broadcast, not just consensus — It guarantees total ordering of a sequence of messages, not just agreement on individual values.
•Two phases: Recovery and Broadcast — Recovery establishes leadership and synchronizes state; Broadcast handles steady-state operation with high performance.
•Epochs and zxids provide global ordering — The 64-bit zxid (epoch + counter) enables efficient comparison and ensures no transaction is ever reused.
•Quorum-based commit ensures durability — Transactions are committed only after a majority acknowledges, guaranteeing persistence across failures.
•FIFO and prefix consistency — Client operations are ordered, and committed logs are always prefixes of each other.
•Reads are local, writes go through leader — This design enables massive read scalability while maintaining write consistency.
•Efficient synchronization strategies — DIFF, TRUNC, and SNAP modes optimize recovery based on the state divergence between nodes.
•Different from Paxos in key ways — ZAB's design assumptions (single leader, no gaps, FIFO) enable important optimizations for coordination workloads.

What's Next:

Page Complete