Loading content...
In the world of distributed systems, achieving consensus on a sequence of values is notoriously difficult. Multiple nodes, each with their own view of time and state, must agree not just on what values are chosen, but on the exact order in which they're applied. ZAB solves this challenge through an elegant design principle: designate a single leader as the sole authority for ordering all transactions.
This isn't a limitation—it's a profound simplification. By funneling all state changes through one node, ZAB eliminates the complex multi-proposer coordination that makes protocols like Multi-Paxos challenging to implement correctly. The leader becomes the total ordering authority, and the protocol's job is to ensure this authority is maintained correctly across leader changes, failures, and network partitions.
By the end of this page, you will understand why leader-based ordering simplifies distributed consensus, how ZAB's leader election protocol works, what happens during leader transitions, and how the protocol maintains ordering guarantees even when leaders fail mid-transaction.
The decision to use a single leader for ordering all transactions is one of the most important architectural choices in ZAB. Understanding why this design was chosen illuminates core principles of distributed systems design.
The Multi-Proposer Problem:
In protocols that allow multiple nodes to propose values simultaneously (like basic Paxos), several complications arise:
The Single Leader Solution:
By designating exactly one leader at any given time, ZAB eliminates these problems:
The Coordination Service Sweet Spot:
For a coordination service like Zookeeper, the single-leader trade-offs are favorable:
If Zookeeper had different requirements—high write volume, weak consistency acceptable, or very large clusters—a different design might be appropriate. But for coordination, single-leader ordering is optimal.
The ZAB leader has specific responsibilities that go beyond simply accepting client writes. Understanding these responsibilities clarifies how the protocol maintains its guarantees.
Leader Responsibilities:
1. Transaction Ordering The leader assigns the next zxid to each incoming write request. This assignment is final—once a zxid is assigned, it's never changed. The zxid embeds both the current epoch and a monotonically increasing counter.
2. Proposal Broadcasting For each transaction, the leader creates a proposal containing the zxid and transaction data, then broadcasts this proposal to all followers. The leader logs the proposal before broadcasting (write-ahead).
3. Quorum Collection The leader tracks acknowledgments from followers. Once a quorum (majority) of followers have acknowledged a proposal, it's ready for commit. The leader maintains state tracking which proposals have achieved quorum.
4. Commit Coordination When a proposal reaches quorum, the leader broadcasts a commit message. This tells all followers to apply the transaction to their state machine. The leader also applies locally.
5. Follower Synchronization When new followers join or existing followers recover, the leader determines the synchronization strategy (DIFF, TRUNC, or SNAP) and performs the necessary state transfer.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
// Conceptual Leader State Managementpublic class LeaderState { // Current epoch - incremented on each new leadership term private final int epoch; // Next zxid counter to assign private AtomicLong nextZxidCounter; // Active followers with their last acked zxid private final Map<ServerId, Long> followerAckStatus; // Pending proposals awaiting quorum private final ConcurrentMap<Long, PendingProposal> pendingProposals; // Committed but not yet delivered to some followers private final Queue<Long> committedPending; public LeaderState(int epoch, long lastZxid) { this.epoch = epoch; this.nextZxidCounter = new AtomicLong(ZxidUtils.getCounter(lastZxid) + 1); this.followerAckStatus = new ConcurrentHashMap<>(); this.pendingProposals = new ConcurrentHashMap<>(); this.committedPending = new ConcurrentLinkedQueue<>(); } // Assign next zxid for incoming transaction public long assignNextZxid() { long counter = nextZxidCounter.getAndIncrement(); return ZxidUtils.makeZxid(epoch, counter); } // Process ACK from follower public void processAck(ServerId follower, long zxid) { followerAckStatus.put(follower, zxid); PendingProposal proposal = pendingProposals.get(zxid); if (proposal != null) { proposal.addAck(follower); // Check if quorum reached (including leader) if (proposal.hasQuorum(getQuorumSize())) { commitProposal(zxid); } } } // Commit a proposal that has achieved quorum private void commitProposal(long zxid) { // Remove from pending PendingProposal proposal = pendingProposals.remove(zxid); // Add to commit queue committedPending.add(zxid); // Broadcast commit to all followers broadcastCommit(zxid); // Apply locally applyToStateMachine(proposal.getTransaction()); }}The Leader as Serialization Point:
The leader acts as a serialization point—all writes pass through this single point, establishing a total order. This pattern appears throughout distributed systems design:
The key insight is that serialization through a single point is both simple and correct. The complexity comes in maintaining the serialization point across failures—which is what the rest of ZAB's protocol handles.
When a Zookeeper cluster starts up or loses its leader, it must elect a new leader before processing writes. ZAB uses the Fast Leader Election (FLE) protocol—an optimized election mechanism designed for rapid convergence.
Election Triggers:
Leader election begins when:
The Election Goal:
The election must select a leader that:
Why Most Up-to-Date Matters:
The leader must have all committed transactions from previous epochs. If a less up-to-date node became leader, committed transactions could be lost. The election protocol ensures the node with the highest epoch/zxid wins.
The Fast Leader Election Algorithm:
Step 1: Initial Vote Each node initially votes for itself, proposing: (myId, lastZxid)
Step 2: Vote Exchange Nodes broadcast their current vote to all other nodes
Step 3: Vote Comparison When receiving another node's vote, compare with current vote:
Step 4: Quorum Detection When a node sees that a candidate has votes from a quorum (including itself), it considers that candidate elected
Step 5: State Transition If elected: become LEADING and start recovery phase If not elected: become FOLLOWING and connect to leader
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
FUNCTION fastLeaderElection(): // State currentVote = (myServerId, myLastZxid, myEpoch) votes = new Map() // votes[serverId] = (proposedLeader, zxid, epoch) // Initially vote for self votes[myServerId] = currentVote broadcastNotification(currentVote) WHILE not electionComplete: // Receive notification from another server (senderId, vote) = receiveNotification() // Update our knowledge of sender's vote votes[senderId] = vote // Compare received vote with our current vote IF shouldUpdateVote(vote, currentVote): currentVote = vote votes[myServerId] = currentVote broadcastNotification(currentVote) // Check if any candidate has quorum FOR EACH candidate IN uniqueCandidates(votes): IF countVotes(votes, candidate) >= quorumSize: // Wait briefly for any late-arriving higher votes IF verifyNoHigherVotes(candidate): IF candidate.serverId == myServerId: RETURN LEADING ELSE: RETURN FOLLOWING(candidate.serverId) FUNCTION shouldUpdateVote(received, current): // Prefer higher epoch IF received.epoch > current.epoch: RETURN TRUE IF received.epoch < current.epoch: RETURN FALSE // Same epoch: prefer higher zxid IF received.zxid > current.zxid: RETURN TRUE IF received.zxid < current.zxid: RETURN FALSE // Same epoch and zxid: tie-break by server ID RETURN received.serverId > current.serverIdThe algorithm is called 'fast' because it converges quickly in the common case. When a clear winner exists (one node has the highest zxid), all nodes update their votes in a single round. Compare this to some election protocols that require multiple rounds of message exchange before converging.
Epochs are the mechanism by which ZAB distinguishes between different leadership terms. Proper epoch management is essential for correctness in the face of failures and network partitions.
Epoch Lifecycle:
1. Epoch Increment When a new leader begins the discovery phase, it proposes a new epoch = max(known epochs) + 1. Followers respond with their accepted epoch and last zxid.
2. Epoch Establishment Once the prospective leader receives epoch acknowledgments from a quorum, the new epoch is established. The leader will use this epoch for all zxids in its term.
3. Epoch Recognition Followers only accept proposals from a leader with the current epoch. Messages from lower epochs are rejected (stale leader).
4. Epoch Persistence Each node persists its current epoch to stable storage. On restart, nodes use their persisted epoch to avoid accidentally accepting old messages.
| Scenario | Epoch Behavior | Outcome |
|---|---|---|
| Clean leader election | New leader uses epoch = old + 1 | Orderly transition to new term |
| Leader crashes mid-transaction | New leader's epoch > crashed leader | Uncommitted proposals are invalidated |
| Network partition heals | Minority side has stale epoch | Minority nodes resync with majority |
| Old leader thinks it's still leader | Messages rejected (old epoch) | Old leader discovers it's no longer leader |
| Node restarts after long outage | Node's epoch is behind | Node follows new leader, updates epoch |
The Split-Brain Prevention:
Epochs are crucial for preventing split-brain scenarios. Consider this case:
How does ZAB handle this?
The epoch acts as a fencing token—a mechanism that ensures only the legitimate current leader can make changes.
Epochs must never go backward. If a node restarts and forgets its epoch (due to storage failure), it could accept messages from an old leader. This is why Zookeeper requires careful attention to storage reliability—lost epoch information is a correctness hazard.
Epoch vs. Term in Raft:
ZAB's epoch is analogous to Raft's term, but there are subtle differences:
The core principle is identical: each leadership term has a unique, monotonically increasing identifier that allows nodes to determine which leader is current.
After Fast Leader Election identifies a prospective leader, the Discovery phase begins. This phase serves two critical purposes:
Discovery Protocol Steps:
Step 1: FOLLOWERINFO Each follower sends FOLLOWERINFO to the prospective leader, containing:
This tells the prospective leader what history each follower has.
Step 2: LEADERINFO The prospective leader calculates the new epoch (max of all follower epochs + 1) and sends LEADERINFO:
This proposes the new leadership term to all followers.
Step 3: ACKEPOCH Followers respond with ACKEPOCH indicating they accept the new epoch. They also include:
Step 4: Epoch Established Once the prospective leader receives ACKEPOCH from a quorum, the new epoch is established.
The Critical Synchronization Insight:
During discovery, the prospective leader learns the highest zxid among the quorum. If any follower has a higher zxid than the prospective leader, those transactions might be committed (if they achieved quorum in the previous epoch).
The prospective leader must sync to the highest zxid in the quorum before proceeding.
This is non-obvious: the elected leader might not have the most up-to-date log! Fast Leader Election prefers higher zxids, but the actual elected leader might have slightly stale data if:
Discovery corrects for this by identifying the true highest zxid in the quorum and syncing the leader accordingly.
It seems counterintuitive that a leader might need to learn transactions from followers. But this is correct and necessary: the quorum intersection property guarantees that any committed transaction is on at least one node in any quorum. The leader must honor this by incorporating all committed history.
After discovery establishes the new epoch and identifies the highest committed state, the Synchronization phase ensures all followers in the quorum have consistent state before broadcast begins.
Synchronization Goals:
Synchronization Protocol:
Step 1: Leader evaluates each follower For each follower, based on their last zxid, leader determines sync method:
Step 2: Leader sends sync messages The appropriate sync messages are sent:
Step 3: Commit pending transactions Leader sends COMMIT for all synced transactions
Step 4: NEWLEADER message Leader sends NEWLEADER message indicating sync is complete
Step 5: Followers acknowledge Followers send ACK for NEWLEADER when they've processed all sync data
Step 6: Quorum achieved Once leader receives NEWLEADER ACKs from a quorum, synchronization phase completes
123456789101112131415161718192021222324252627282930313233343536373839
FUNCTION determineAndExecuteSync(follower, followerLastZxid): leaderMaxZxid = leader.lastCommittedZxid leaderMinZxid = leader.minRetainedLogZxid // Case 1: Follower is ahead - has uncommitted transactions IF followerLastZxid > leaderMaxZxid: // These transactions were never committed (leader failed before quorum) // Follower must truncate to leader's committed state SEND_TRUNC(follower, targetZxid = leaderMaxZxid) // After truncation, follower is synchronized SEND_NEWLEADER(follower) RETURN // Case 2: Follower is exactly caught up IF followerLastZxid == leaderMaxZxid: // No synchronization needed SEND_NEWLEADER(follower) RETURN // Case 3: Follower is behind but within log retention IF followerLastZxid >= leaderMinZxid: // Can use incremental sync missingTxns = leader.log.range(followerLastZxid + 1, leaderMaxZxid) FOR EACH txn IN missingTxns: SEND_PROPOSAL(follower, txn.zxid, txn.data) FOR EACH txn IN missingTxns: SEND_COMMIT(follower, txn.zxid) SEND_NEWLEADER(follower) RETURN // Case 4: Follower is too far behind // Must send complete snapshot snapshot = leader.takeSnapshot() SEND_SNAP(follower, snapshot) SEND_NEWLEADER(follower)Handling the TRUNC Case:
The TRUNC case deserves special attention as it represents a subtle correctness scenario:
Scenario:
Resolution:
This is correct because T never achieved quorum acknowledgment. The client that proposed T never received a confirmation, so from the client's perspective, T failed (correctly).
The leader cannot process new write requests until synchronization completes with a quorum. If the leader started broadcasting while followers had inconsistent state, the total ordering guarantee would be violated. This is why ZAB has distinct phases—they provide clear state machine transitions.
Understanding how ZAB handles various leader failure scenarios is essential for reasoning about system behavior and correctness. Let's analyze the key failure modes.
Scenario 1: Leader Fails Before Proposing
Client sends request → Leader crashes before creating proposal
Outcome: Request is lost (never entered the system). Client times out and can retry. No consistency issues—transaction was never started.
Scenario 2: Leader Fails After Proposing, Before Any Acknowledgment
Leader creates proposal → Leader crashes → No followers received it
Outcome: Proposal is lost. Leader might have logged it, but since no follower has it, new leader won't discover it. Client times out. Safe—transaction never reached any other node.
Scenario 3: Leader Fails After Some Acknowledgments, Before Quorum
Leader proposes → Some followers ACK → Leader crashes before quorum
Outcome: Some followers have the proposal in their log, others don't. New leader election occurs:
This is the subtle case—the outcome depends on quorum membership.
| Failure Point | Leader State | Follower State | New Leader Action | Client Experience |
|---|---|---|---|---|
| Before proposal | Request received only | No knowledge | N/A | Timeout, can retry |
| After proposal, no ACKs | Logged locally only | No knowledge | N/A | Timeout, can retry |
| Minority ACKs | Logged, pending | Some have proposal | May TRUNC followers | Timeout, can retry |
| Quorum ACKs, no commit sent | Ready to commit | Quorum has proposal | Must commit | Timeout, success on retry check |
| Commit sent to some | Committed | Some committed | Sync lagging followers | May have received success |
| Commit sent to all | Committed | All committed | Just new epoch | Received success |
Scenario 4: Leader Fails After Quorum ACKs, Before Commit (Critical)
This scenario deserves deep analysis:
Analysis:
The quorum intersection property guarantees that T will be discovered and committed by the new leader. This is why quorum-based replication is safe!
Scenario 5: Leader Fails After Committing Locally
Result: New leader will find T in quorum and commit it. Client might not know T succeeded, but T is durably committed. Idempotency on client side handles the uncertainty.
Because clients can't always know whether a request succeeded during leader failure, requests should be idempotent where possible. Zookeeper's conditional operations (create-if-not-exists, compare-and-set) are designed with this in mind—retrying them is safe because the condition will fail if the operation already succeeded.
While leader failures trigger dramatic protocol transitions, follower failures are handled more gracefully. The system continues operating as long as quorum remains available.
Follower Crash Scenarios:
Scenario A: Follower Crashes, Quorum Still Available
Scenario B: Multiple Followers Crash, Quorum Lost
Scenario C: Follower Recovers After Short Outage
Follower Lagging During Broadcast:
In steady state, a slow follower might fall behind the leader. ZAB handles this gracefully:
This is why ZAB uses asynchronous broadcast—the leader doesn't wait for all followers, only quorum.
Observer Nodes:
Zookeeper also supports observer nodes—replicas that receive broadcasts but don't participate in quorum voting. Benefits:
Observers receive all commits but their acknowledgment isn't required for commit. They're eventually consistent with leaders.
We've explored how ZAB uses a single leader to establish total ordering for transactions. Let's consolidate the essential concepts:
What's Next:
Now that you understand ZAB's leader-based ordering mechanics, the next page provides a detailed comparison of ZAB with Raft and Paxos. We'll analyze the design choices each protocol makes, their performance characteristics, and when each is most appropriate.
You now understand how ZAB leverages a single leader to achieve total ordering, how Fast Leader Election works, the role of epochs in preventing split-brain, and how the discovery and synchronization phases ensure consistency across leader transitions. Next, we'll compare ZAB with other consensus protocols.