Zab Zookeeper Atomic Broadcast - Learning Module

Loading content...

0/273

Leader-Based Ordering

The Single Point of Order

In the world of distributed systems, achieving consensus on a sequence of values is notoriously difficult. Multiple nodes, each with their own view of time and state, must agree not just on what values are chosen, but on the exact order in which they're applied. ZAB solves this challenge through an elegant design principle: designate a single leader as the sole authority for ordering all transactions.

This isn't a limitation—it's a profound simplification. By funneling all state changes through one node, ZAB eliminates the complex multi-proposer coordination that makes protocols like Multi-Paxos challenging to implement correctly. The leader becomes the total ordering authority, and the protocol's job is to ensure this authority is maintained correctly across leader changes, failures, and network partitions.

What You Will Master

By the end of this page, you will understand why leader-based ordering simplifies distributed consensus, how ZAB's leader election protocol works, what happens during leader transitions, and how the protocol maintains ordering guarantees even when leaders fail mid-transaction.

Why a Single Leader?

The decision to use a single leader for ordering all transactions is one of the most important architectural choices in ZAB. Understanding why this design was chosen illuminates core principles of distributed systems design.

The Multi-Proposer Problem:

In protocols that allow multiple nodes to propose values simultaneously (like basic Paxos), several complications arise:

Proposal conflicts: Two proposers might try to fill the same slot, requiring resolution
Ordering complexity: Proposals can arrive at different nodes in different orders
Dueling leaders: Competing proposers can block each other, reducing throughput
Gap handling: Slots might be decided out of order, creating holes in the sequence

The Single Leader Solution:

By designating exactly one leader at any given time, ZAB eliminates these problems:

No conflicts: Only one node can assign zxids, so no two transactions can claim the same slot
Natural ordering: The leader decides order; followers simply accept
Maximum throughput: No proposal contentio—the leader processes requests as fast as it can
No gaps: The leader assigns zxids sequentially; gaps are structurally impossible

Single Leader Benefits

•Simplicity: One node orders everything—no coordination needed for ordering
•Performance: No contention between proposers
•FIFO guarantee: Leader naturally preserves client request order
•Sequential zxids: No gaps, no reordering complexity
•Efficient reads: Any replica can serve reads without leader involvement

Single Leader Trade-offs

•Write bottleneck: All writes go through one node
•Leader failure impact: Writes stop until new leader is elected
•Geographic constraints: High latency to distant leader
•Recovery overhead: Leader election takes time
•Not symmetric: Leader node has different responsibilities

The Coordination Service Sweet Spot:

For a coordination service like Zookeeper, the single-leader trade-offs are favorable:

Writes are relatively rare: Coordination data (locks, configuration, leader election) changes infrequently compared to reads
Strong consistency is required: Coordination data must be correct; weak consistency would defeat the purpose
Cluster size is bounded: Zookeeper clusters are typically 3-7 nodes, so leader election is fast
Writes are small: Coordination payloads are typically bytes to kilobytes, not megabytes

If Zookeeper had different requirements—high write volume, weak consistency acceptable, or very large clusters—a different design might be appropriate. But for coordination, single-leader ordering is optimal.

The Leader Role in Detail

The ZAB leader has specific responsibilities that go beyond simply accepting client writes. Understanding these responsibilities clarifies how the protocol maintains its guarantees.

Leader Responsibilities:

1. Transaction Ordering The leader assigns the next zxid to each incoming write request. This assignment is final—once a zxid is assigned, it's never changed. The zxid embeds both the current epoch and a monotonically increasing counter.

2. Proposal Broadcasting For each transaction, the leader creates a proposal containing the zxid and transaction data, then broadcasts this proposal to all followers. The leader logs the proposal before broadcasting (write-ahead).

3. Quorum Collection The leader tracks acknowledgments from followers. Once a quorum (majority) of followers have acknowledged a proposal, it's ready for commit. The leader maintains state tracking which proposals have achieved quorum.

4. Commit Coordination When a proposal reaches quorum, the leader broadcasts a commit message. This tells all followers to apply the transaction to their state machine. The leader also applies locally.

5. Follower Synchronization When new followers join or existing followers recover, the leader determines the synchronization strategy (DIFF, TRUNC, or SNAP) and performs the necessary state transfer.

leader-state.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Conceptual Leader State Management
public class LeaderState {
    
    // Current epoch - incremented on each new leadership term
    private final int epoch;
    
    // Next zxid counter to assign
    private AtomicLong nextZxidCounter;
    
    // Active followers with their last acked zxid
    private final Map<ServerId, Long> followerAckStatus;
    
    // Pending proposals awaiting quorum
    private final ConcurrentMap<Long, PendingProposal> pendingProposals;
    
    // Committed but not yet delivered to some followers
    private final Queue<Long> committedPending;
    
    public LeaderState(int epoch, long lastZxid) {
        this.epoch = epoch;
        this.nextZxidCounter = new AtomicLong(ZxidUtils.getCounter(lastZxid) + 1);
        this.followerAckStatus = new ConcurrentHashMap<>();
        this.pendingProposals = new ConcurrentHashMap<>();
        this.committedPending = new ConcurrentLinkedQueue<>();
    }
    
    // Assign next zxid for incoming transaction
    public long assignNextZxid() {
        long counter = nextZxidCounter.getAndIncrement();
        return ZxidUtils.makeZxid(epoch, counter);
    }
    
    // Process ACK from follower
    public void processAck(ServerId follower, long zxid) {
        followerAckStatus.put(follower, zxid);
        
        PendingProposal proposal = pendingProposals.get(zxid);
        if (proposal != null) {
            proposal.addAck(follower);
            
            // Check if quorum reached (including leader)
            if (proposal.hasQuorum(getQuorumSize())) {
                commitProposal(zxid);
            }
        }
    }
    
    // Commit a proposal that has achieved quorum
    private void commitProposal(long zxid) {
        // Remove from pending
        PendingProposal proposal = pendingProposals.remove(zxid);
        
        // Add to commit queue
        committedPending.add(zxid);
        
        // Broadcast commit to all followers
        broadcastCommit(zxid);
        
        // Apply locally
        applyToStateMachine(proposal.getTransaction());
    }
}

The Leader as Serialization Point:

The leader acts as a serialization point—all writes pass through this single point, establishing a total order. This pattern appears throughout distributed systems design:

Database write-ahead logs serialize transactions
Kafka partition leaders serialize messages within a partition
Raft leaders serialize log entries

The key insight is that serialization through a single point is both simple and correct. The complexity comes in maintaining the serialization point across failures—which is what the rest of ZAB's protocol handles.

Fast Leader Election Protocol

When a Zookeeper cluster starts up or loses its leader, it must elect a new leader before processing writes. ZAB uses the Fast Leader Election (FLE) protocol—an optimized election mechanism designed for rapid convergence.

Election Triggers:

Leader election begins when:

The cluster is starting fresh (no existing leader)
A follower detects leader failure (heartbeat timeout)
A leader voluntarily steps down (rare, typically for maintenance)
Network partition heals and nodes need to reunify

The Election Goal:

The election must select a leader that:

Has the most up-to-date transaction log (highest zxid)
Is accepted by a quorum of nodes
Will use a higher epoch than any previous leader

Why Most Up-to-Date Matters:

The leader must have all committed transactions from previous epochs. If a less up-to-date node became leader, committed transactions could be lost. The election protocol ensures the node with the highest epoch/zxid wins.

Converting Mermaid diagram...

The Fast Leader Election Algorithm:

Step 1: Initial Vote Each node initially votes for itself, proposing: (myId, lastZxid)

Step 2: Vote Exchange Nodes broadcast their current vote to all other nodes

Step 3: Vote Comparison When receiving another node's vote, compare with current vote:

If received vote has higher zxid → adopt it
If same zxid, tie-break by server ID → higher ID wins
Broadcast new vote if changed

Step 4: Quorum Detection When a node sees that a candidate has votes from a quorum (including itself), it considers that candidate elected

Step 5: State Transition If elected: become LEADING and start recovery phase If not elected: become FOLLOWING and connect to leader

fle-algorithm.pseudo

Pseudocode

FUNCTION fastLeaderElection():
    // State
    currentVote = (myServerId, myLastZxid, myEpoch)
    votes = new Map()  // votes[serverId] = (proposedLeader, zxid, epoch)
    
    // Initially vote for self
    votes[myServerId] = currentVote
    broadcastNotification(currentVote)
    
    WHILE not electionComplete:
        // Receive notification from another server
        (senderId, vote) = receiveNotification()
        
        // Update our knowledge of sender's vote
        votes[senderId] = vote
        
        // Compare received vote with our current vote
        IF shouldUpdateVote(vote, currentVote):
            currentVote = vote
            votes[myServerId] = currentVote
            broadcastNotification(currentVote)
        
        // Check if any candidate has quorum
        FOR EACH candidate IN uniqueCandidates(votes):
            IF countVotes(votes, candidate) >= quorumSize:
                // Wait briefly for any late-arriving higher votes
                IF verifyNoHigherVotes(candidate):
                    IF candidate.serverId == myServerId:
                        RETURN LEADING
                    ELSE:
                        RETURN FOLLOWING(candidate.serverId)
    
FUNCTION shouldUpdateVote(received, current):
    // Prefer higher epoch
    IF received.epoch > current.epoch:
        RETURN TRUE
    IF received.epoch < current.epoch:
        RETURN FALSE
    
    // Same epoch: prefer higher zxid
    IF received.zxid > current.zxid:
        RETURN TRUE
    IF received.zxid < current.zxid:
        RETURN FALSE
    
    // Same epoch and zxid: tie-break by server ID
    RETURN received.serverId > current.serverId

Why 'Fast' Leader Election?

The algorithm is called 'fast' because it converges quickly in the common case. When a clear winner exists (one node has the highest zxid), all nodes update their votes in a single round. Compare this to some election protocols that require multiple rounds of message exchange before converging.

Epoch Management

Epochs are the mechanism by which ZAB distinguishes between different leadership terms. Proper epoch management is essential for correctness in the face of failures and network partitions.

Epoch Lifecycle:

1. Epoch Increment When a new leader begins the discovery phase, it proposes a new epoch = max(known epochs) + 1. Followers respond with their accepted epoch and last zxid.

2. Epoch Establishment Once the prospective leader receives epoch acknowledgments from a quorum, the new epoch is established. The leader will use this epoch for all zxids in its term.

3. Epoch Recognition Followers only accept proposals from a leader with the current epoch. Messages from lower epochs are rejected (stale leader).

4. Epoch Persistence Each node persists its current epoch to stable storage. On restart, nodes use their persisted epoch to avoid accidentally accepting old messages.

Epoch Scenarios and Handling
Scenario	Epoch Behavior	Outcome
Clean leader election	New leader uses epoch = old + 1	Orderly transition to new term
Leader crashes mid-transaction	New leader's epoch > crashed leader	Uncommitted proposals are invalidated
Network partition heals	Minority side has stale epoch	Minority nodes resync with majority
Old leader thinks it's still leader	Messages rejected (old epoch)	Old leader discovers it's no longer leader
Node restarts after long outage	Node's epoch is behind	Node follows new leader, updates epoch

The Split-Brain Prevention:

Epochs are crucial for preventing split-brain scenarios. Consider this case:

Leader L1 is in epoch 5
Network partition isolates L1 from the majority
Majority elects L2 in epoch 6
Partition heals—now both L1 and L2 think they're leader

How does ZAB handle this?

L1's proposals carry epoch 5
All followers in the majority have accepted epoch 6
Followers reject L1's epoch-5 proposals
L1 receives rejections, realizes it's no longer leader
L1 transitions to follower state and syncs with L2

The epoch acts as a fencing token—a mechanism that ensures only the legitimate current leader can make changes.

Epoch Monotonicity is Critical

Epochs must never go backward. If a node restarts and forgets its epoch (due to storage failure), it could accept messages from an old leader. This is why Zookeeper requires careful attention to storage reliability—lost epoch information is a correctness hazard.

Epoch vs. Term in Raft:

ZAB's epoch is analogous to Raft's term, but there are subtle differences:

Increment timing: ZAB epochs are incremented during discovery (before leader is confirmed); Raft terms are incremented at election start
Persistence requirements: Both require persistent storage of the current term/epoch
Message handling: Both reject messages from stale terms/epochs

The core principle is identical: each leadership term has a unique, monotonically increasing identifier that allows nodes to determine which leader is current.

Discovery Phase Mechanics

After Fast Leader Election identifies a prospective leader, the Discovery phase begins. This phase serves two critical purposes:

Establish the new epoch: Ensure all followers accept the new leadership term
Discover committed transactions: Find any transactions the prospective leader might be missing

Discovery Protocol Steps:

Step 1: FOLLOWERINFO Each follower sends FOLLOWERINFO to the prospective leader, containing:

The follower's accepted epoch
The follower's last zxid

This tells the prospective leader what history each follower has.

Step 2: LEADERINFO The prospective leader calculates the new epoch (max of all follower epochs + 1) and sends LEADERINFO:

The new epoch number

This proposes the new leadership term to all followers.

Step 3: ACKEPOCH Followers respond with ACKEPOCH indicating they accept the new epoch. They also include:

Their current epoch (being replaced)
Their last zxid
Optionally, their transaction history

Step 4: Epoch Established Once the prospective leader receives ACKEPOCH from a quorum, the new epoch is established.

Converting Mermaid diagram...

The Critical Synchronization Insight:

During discovery, the prospective leader learns the highest zxid among the quorum. If any follower has a higher zxid than the prospective leader, those transactions might be committed (if they achieved quorum in the previous epoch).

The prospective leader must sync to the highest zxid in the quorum before proceeding.

This is non-obvious: the elected leader might not have the most up-to-date log! Fast Leader Election prefers higher zxids, but the actual elected leader might have slightly stale data if:

FLE votes were based on zxids at election start
Transactions committed during the election
Minor timing differences in vote propagation

Discovery corrects for this by identifying the true highest zxid in the quorum and syncing the leader accordingly.

Leader Learning from Followers

It seems counterintuitive that a leader might need to learn transactions from followers. But this is correct and necessary: the quorum intersection property guarantees that any committed transaction is on at least one node in any quorum. The leader must honor this by incorporating all committed history.

Synchronization Phase Mechanics

After discovery establishes the new epoch and identifies the highest committed state, the Synchronization phase ensures all followers in the quorum have consistent state before broadcast begins.

Synchronization Goals:

Bring all followers up to the leader's state
Truncate any uncommitted transactions on followers (from a failed leader)
Ensure prefix consistency before new transactions begin

Synchronization Protocol:

Step 1: Leader evaluates each follower For each follower, based on their last zxid, leader determines sync method:

DIFF: Follower needs incremental transactions
TRUNC: Follower has uncommitted transactions to discard
SNAP: Follower is too far behind; needs full snapshot

Step 2: Leader sends sync messages The appropriate sync messages are sent:

For DIFF: sequence of PROPOSAL messages for missing transactions
For TRUNC: TRUNC message with the zxid to truncate to
For SNAP: SNAP message followed by complete state snapshot

Step 3: Commit pending transactions Leader sends COMMIT for all synced transactions

Step 4: NEWLEADER message Leader sends NEWLEADER message indicating sync is complete

Step 5: Followers acknowledge Followers send ACK for NEWLEADER when they've processed all sync data

Step 6: Quorum achieved Once leader receives NEWLEADER ACKs from a quorum, synchronization phase completes

sync-decision-tree.pseudo

Pseudocode

FUNCTION determineAndExecuteSync(follower, followerLastZxid):
    leaderMaxZxid = leader.lastCommittedZxid
    leaderMinZxid = leader.minRetainedLogZxid
    
    // Case 1: Follower is ahead - has uncommitted transactions
    IF followerLastZxid > leaderMaxZxid:
        // These transactions were never committed (leader failed before quorum)
        // Follower must truncate to leader's committed state
        SEND_TRUNC(follower, targetZxid = leaderMaxZxid)
        
        // After truncation, follower is synchronized
        SEND_NEWLEADER(follower)
        RETURN
    
    // Case 2: Follower is exactly caught up
    IF followerLastZxid == leaderMaxZxid:
        // No synchronization needed
        SEND_NEWLEADER(follower)
        RETURN
    
    // Case 3: Follower is behind but within log retention
    IF followerLastZxid >= leaderMinZxid:
        // Can use incremental sync
        missingTxns = leader.log.range(followerLastZxid + 1, leaderMaxZxid)
        
        FOR EACH txn IN missingTxns:
            SEND_PROPOSAL(follower, txn.zxid, txn.data)
        
        FOR EACH txn IN missingTxns:
            SEND_COMMIT(follower, txn.zxid)
        
        SEND_NEWLEADER(follower)
        RETURN
    
    // Case 4: Follower is too far behind
    // Must send complete snapshot
    snapshot = leader.takeSnapshot()
    SEND_SNAP(follower, snapshot)
    SEND_NEWLEADER(follower)

Handling the TRUNC Case:

The TRUNC case deserves special attention as it represents a subtle correctness scenario:

Scenario:

Leader L1 proposes transaction T at zxid Z
Follower F1 receives and logs T
L1 crashes before T achieves quorum (before F2 and F3 see it)
New leader L2 is elected (from F2 and F3)
L2's quorum never included T, so T is not committed
F1 reconnects with T in its log

Resolution:

L2 detects F1's last zxid (Z) is higher than L2's last committed zxid
L2 sends TRUNC to F1, instructing it to truncate to L2's last committed zxid
F1 discards T—it was never committed, so this is safe
F1's log now matches the committed prefix

This is correct because T never achieved quorum acknowledgment. The client that proposed T never received a confirmation, so from the client's perspective, T failed (correctly).

Synchronization Must Complete Before Broadcast

The leader cannot process new write requests until synchronization completes with a quorum. If the leader started broadcasting while followers had inconsistent state, the total ordering guarantee would be violated. This is why ZAB has distinct phases—they provide clear state machine transitions.

Leader Failure Scenarios

Understanding how ZAB handles various leader failure scenarios is essential for reasoning about system behavior and correctness. Let's analyze the key failure modes.

Scenario 1: Leader Fails Before Proposing

Client sends request → Leader crashes before creating proposal

Outcome: Request is lost (never entered the system). Client times out and can retry. No consistency issues—transaction was never started.

Scenario 2: Leader Fails After Proposing, Before Any Acknowledgment

Leader creates proposal → Leader crashes → No followers received it

Outcome: Proposal is lost. Leader might have logged it, but since no follower has it, new leader won't discover it. Client times out. Safe—transaction never reached any other node.

Scenario 3: Leader Fails After Some Acknowledgments, Before Quorum

Leader proposes → Some followers ACK → Leader crashes before quorum

Outcome: Some followers have the proposal in their log, others don't. New leader election occurs:

If new leader is from followers who have the proposal AND they form a quorum of the old proposal → proposal might be committed by new leader
If new leader is from followers who lack the proposal → proposal is orphaned | followers with it will TRUNC

This is the subtle case—the outcome depends on quorum membership.

Leader Failure Scenarios and Outcomes
Failure Point	Leader State	Follower State	New Leader Action	Client Experience
Before proposal	Request received only	No knowledge	N/A	Timeout, can retry
After proposal, no ACKs	Logged locally only	No knowledge	N/A	Timeout, can retry
Minority ACKs	Logged, pending	Some have proposal	May TRUNC followers	Timeout, can retry
Quorum ACKs, no commit sent	Ready to commit	Quorum has proposal	Must commit	Timeout, success on retry check
Commit sent to some	Committed	Some committed	Sync lagging followers	May have received success
Commit sent to all	Committed	All committed	Just new epoch	Received success

Scenario 4: Leader Fails After Quorum ACKs, Before Commit (Critical)

This scenario deserves deep analysis:

Leader proposes T with zxid Z
Leader and followers F1, F2 ACK (quorum = 3 of 5)
Leader crashes before sending COMMIT
F3, F4 never received the proposal

Analysis:

The proposal achieved quorum ACK on {Leader, F1, F2}
Any new quorum must include at least one of these (quorum intersection)
If new leader is F1 or F2: they have the proposal and must commit it
If new leader is F3 or F4: during discovery, they'll learn about Z from F1 or F2

The quorum intersection property guarantees that T will be discovered and committed by the new leader. This is why quorum-based replication is safe!

Scenario 5: Leader Fails After Committing Locally

Leader commits T, updates state machine
Leader crashes before sending COMMIT to followers
Followers have T proposed but not committed

Result: New leader will find T in quorum and commit it. Client might not know T succeeded, but T is durably committed. Idempotency on client side handles the uncertainty.

Client Idempotency is Essential

Because clients can't always know whether a request succeeded during leader failure, requests should be idempotent where possible. Zookeeper's conditional operations (create-if-not-exists, compare-and-set) are designed with this in mind—retrying them is safe because the condition will fail if the operation already succeeded.

Follower Failure Handling

While leader failures trigger dramatic protocol transitions, follower failures are handled more gracefully. The system continues operating as long as quorum remains available.

Follower Crash Scenarios:

Scenario A: Follower Crashes, Quorum Still Available

Leader notices missing heartbeat from follower
Leader removes follower from active set
Pending proposals only need remaining quorum
System continues with reduced redundancy

Scenario B: Multiple Followers Crash, Quorum Lost

Leader can no longer achieve quorum for commit
Outstanding proposals cannot commit
Leader blocks, waiting for quorum recovery or eventually times out
System is unavailable for writes but consistent

Scenario C: Follower Recovers After Short Outage

Follower reconnects to leader
Leader initiates DIFF sync (incremental)
Follower catches up quickly
Follower rejoins active set

Normal Follower Recovery

•Follower restarts and loads persisted state
•Follower locates current leader (from cluster config)
•Follower sends FOLLOWERINFO with last zxid
•Leader determines sync strategy
•Sync completes (DIFF, TRUNC, or SNAP)
•Follower receives NEWLEADER, sends ACK
•Follower enters broadcast mode
•Follower can now serve reads and vote on proposals

Recovery Edge Cases

•Storage corruption → requires manual intervention
•Log too far behind, log lost → SNAP sync (slow)
•Persistent epoch lost → could accept stale leader (dangerous)
•Network permanently partitions follower → follower isolated
•Follower connects to stale leader → rejected (wrong epoch)
•Follower has orphaned txns → TRUNC discards them

Follower Lagging During Broadcast:

In steady state, a slow follower might fall behind the leader. ZAB handles this gracefully:

Leader continues broadcasting: Proposals are sent to all followers; slow follower receives them eventually
Commit can proceed: As long as quorum (excluding slow follower) acknowledges, transactions commit
Follower catches up: Slow follower processes proposals in order, eventually reaching committed state
Sync if too slow: If follower falls too far behind (leader's log is truncated), standard sync protocol applies

This is why ZAB uses asynchronous broadcast—the leader doesn't wait for all followers, only quorum.

Observer Nodes:

Zookeeper also supports observer nodes—replicas that receive broadcasts but don't participate in quorum voting. Benefits:

Scale read capacity without increasing quorum size
Deploy in distant data centers without affecting write latency
Provide local reads for clients

Observers receive all commits but their acknowledgment isn't required for commit. They're eventually consistent with leaders.

Summary: Leader-Based Ordering

We've explored how ZAB uses a single leader to establish total ordering for transactions. Let's consolidate the essential concepts:

Key Takeaways

•Single leader simplifies ordering — All writes go through one node, eliminating multi-proposer complexity and ensuring natural FIFO ordering.
•Fast Leader Election selects the most up-to-date node — The algorithm converges quickly by having nodes vote for the candidate with the highest zxid.
•Epochs prevent split-brain — Each leadership term has a unique, monotonically increasing epoch that acts as a fencing token.
•Discovery phase establishes new epochs — Prospective leaders collect epoch acknowledgments and discover the highest committed state.
•Synchronization ensures consistency before broadcast — DIFF, TRUNC, and SNAP modes handle various state divergence scenarios.
•Leader failures are handled by quorum intersection — Committed transactions are always on a majority, so new leaders will discover and honor them.
•Follower failures are handled gracefully — System continues with reduced redundancy; recovering followers sync incrementally.
•Client idempotency handles uncertainty — Clients should design for potential duplicate deliveries during failure recovery.

What's Next:

Now that you understand ZAB's leader-based ordering mechanics, the next page provides a detailed comparison of ZAB with Raft and Paxos. We'll analyze the design choices each protocol makes, their performance characteristics, and when each is most appropriate.

Page Complete

You now understand how ZAB leverages a single leader to achieve total ordering, how Fast Leader Election works, the role of epochs in preventing split-brain, and how the discovery and synchronization phases ensure consistency across leader transitions. Next, we'll compare ZAB with other consensus protocols.

Leader-Based Ordering

The Single Point of Order

What You Will Master

Why a Single Leader?

The Multi-Proposer Problem:

In protocols that allow multiple nodes to propose values simultaneously (like basic Paxos), several complications arise:

Proposal conflicts: Two proposers might try to fill the same slot, requiring resolution
Ordering complexity: Proposals can arrive at different nodes in different orders
Dueling leaders: Competing proposers can block each other, reducing throughput
Gap handling: Slots might be decided out of order, creating holes in the sequence

The Single Leader Solution:

By designating exactly one leader at any given time, ZAB eliminates these problems:

No conflicts: Only one node can assign zxids, so no two transactions can claim the same slot
Natural ordering: The leader decides order; followers simply accept
Maximum throughput: No proposal contentio—the leader processes requests as fast as it can
No gaps: The leader assigns zxids sequentially; gaps are structurally impossible

Single Leader Benefits

•Simplicity: One node orders everything—no coordination needed for ordering
•Performance: No contention between proposers
•FIFO guarantee: Leader naturally preserves client request order
•Sequential zxids: No gaps, no reordering complexity
•Efficient reads: Any replica can serve reads without leader involvement

Single Leader Trade-offs

•Write bottleneck: All writes go through one node
•Leader failure impact: Writes stop until new leader is elected
•Geographic constraints: High latency to distant leader
•Recovery overhead: Leader election takes time
•Not symmetric: Leader node has different responsibilities

The Coordination Service Sweet Spot:

For a coordination service like Zookeeper, the single-leader trade-offs are favorable:

Writes are relatively rare: Coordination data (locks, configuration, leader election) changes infrequently compared to reads
Strong consistency is required: Coordination data must be correct; weak consistency would defeat the purpose
Cluster size is bounded: Zookeeper clusters are typically 3-7 nodes, so leader election is fast
Writes are small: Coordination payloads are typically bytes to kilobytes, not megabytes

The Leader Role in Detail

The ZAB leader has specific responsibilities that go beyond simply accepting client writes. Understanding these responsibilities clarifies how the protocol maintains its guarantees.

Leader Responsibilities:

leader-state.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Conceptual Leader State Management
public class LeaderState {
    
    // Current epoch - incremented on each new leadership term
    private final int epoch;
    
    // Next zxid counter to assign
    private AtomicLong nextZxidCounter;
    
    // Active followers with their last acked zxid
    private final Map<ServerId, Long> followerAckStatus;
    
    // Pending proposals awaiting quorum
    private final ConcurrentMap<Long, PendingProposal> pendingProposals;
    
    // Committed but not yet delivered to some followers
    private final Queue<Long> committedPending;
    
    public LeaderState(int epoch, long lastZxid) {
        this.epoch = epoch;
        this.nextZxidCounter = new AtomicLong(ZxidUtils.getCounter(lastZxid) + 1);
        this.followerAckStatus = new ConcurrentHashMap<>();
        this.pendingProposals = new ConcurrentHashMap<>();
        this.committedPending = new ConcurrentLinkedQueue<>();
    }
    
    // Assign next zxid for incoming transaction
    public long assignNextZxid() {
        long counter = nextZxidCounter.getAndIncrement();
        return ZxidUtils.makeZxid(epoch, counter);
    }
    
    // Process ACK from follower
    public void processAck(ServerId follower, long zxid) {
        followerAckStatus.put(follower, zxid);
        
        PendingProposal proposal = pendingProposals.get(zxid);
        if (proposal != null) {
            proposal.addAck(follower);
            
            // Check if quorum reached (including leader)
            if (proposal.hasQuorum(getQuorumSize())) {
                commitProposal(zxid);
            }
        }
    }
    
    // Commit a proposal that has achieved quorum
    private void commitProposal(long zxid) {
        // Remove from pending
        PendingProposal proposal = pendingProposals.remove(zxid);
        
        // Add to commit queue
        committedPending.add(zxid);
        
        // Broadcast commit to all followers
        broadcastCommit(zxid);
        
        // Apply locally
        applyToStateMachine(proposal.getTransaction());
    }
}

The Leader as Serialization Point:

The leader acts as a serialization point—all writes pass through this single point, establishing a total order. This pattern appears throughout distributed systems design:

Database write-ahead logs serialize transactions
Kafka partition leaders serialize messages within a partition
Raft leaders serialize log entries

Fast Leader Election Protocol

Election Triggers:

Leader election begins when:

The cluster is starting fresh (no existing leader)
A follower detects leader failure (heartbeat timeout)
A leader voluntarily steps down (rare, typically for maintenance)
Network partition heals and nodes need to reunify

The Election Goal:

The election must select a leader that:

Has the most up-to-date transaction log (highest zxid)
Is accepted by a quorum of nodes
Will use a higher epoch than any previous leader

Why Most Up-to-Date Matters:

Converting Mermaid diagram...

The Fast Leader Election Algorithm:

Step 1: Initial Vote Each node initially votes for itself, proposing: (myId, lastZxid)

Step 2: Vote Exchange Nodes broadcast their current vote to all other nodes

Step 3: Vote Comparison When receiving another node's vote, compare with current vote:

If received vote has higher zxid → adopt it
If same zxid, tie-break by server ID → higher ID wins
Broadcast new vote if changed

Step 4: Quorum Detection When a node sees that a candidate has votes from a quorum (including itself), it considers that candidate elected

Step 5: State Transition If elected: become LEADING and start recovery phase If not elected: become FOLLOWING and connect to leader

fle-algorithm.pseudo

Pseudocode

FUNCTION fastLeaderElection():
    // State
    currentVote = (myServerId, myLastZxid, myEpoch)
    votes = new Map()  // votes[serverId] = (proposedLeader, zxid, epoch)
    
    // Initially vote for self
    votes[myServerId] = currentVote
    broadcastNotification(currentVote)
    
    WHILE not electionComplete:
        // Receive notification from another server
        (senderId, vote) = receiveNotification()
        
        // Update our knowledge of sender's vote
        votes[senderId] = vote
        
        // Compare received vote with our current vote
        IF shouldUpdateVote(vote, currentVote):
            currentVote = vote
            votes[myServerId] = currentVote
            broadcastNotification(currentVote)
        
        // Check if any candidate has quorum
        FOR EACH candidate IN uniqueCandidates(votes):
            IF countVotes(votes, candidate) >= quorumSize:
                // Wait briefly for any late-arriving higher votes
                IF verifyNoHigherVotes(candidate):
                    IF candidate.serverId == myServerId:
                        RETURN LEADING
                    ELSE:
                        RETURN FOLLOWING(candidate.serverId)
    
FUNCTION shouldUpdateVote(received, current):
    // Prefer higher epoch
    IF received.epoch > current.epoch:
        RETURN TRUE
    IF received.epoch < current.epoch:
        RETURN FALSE
    
    // Same epoch: prefer higher zxid
    IF received.zxid > current.zxid:
        RETURN TRUE
    IF received.zxid < current.zxid:
        RETURN FALSE
    
    // Same epoch and zxid: tie-break by server ID
    RETURN received.serverId > current.serverId

Why 'Fast' Leader Election?

Epoch Management

Epochs are the mechanism by which ZAB distinguishes between different leadership terms. Proper epoch management is essential for correctness in the face of failures and network partitions.

Epoch Lifecycle:

1. Epoch Increment When a new leader begins the discovery phase, it proposes a new epoch = max(known epochs) + 1. Followers respond with their accepted epoch and last zxid.

2. Epoch Establishment Once the prospective leader receives epoch acknowledgments from a quorum, the new epoch is established. The leader will use this epoch for all zxids in its term.

3. Epoch Recognition Followers only accept proposals from a leader with the current epoch. Messages from lower epochs are rejected (stale leader).

4. Epoch Persistence Each node persists its current epoch to stable storage. On restart, nodes use their persisted epoch to avoid accidentally accepting old messages.

Epoch Scenarios and Handling
Scenario	Epoch Behavior	Outcome
Clean leader election	New leader uses epoch = old + 1	Orderly transition to new term
Leader crashes mid-transaction	New leader's epoch > crashed leader	Uncommitted proposals are invalidated
Network partition heals	Minority side has stale epoch	Minority nodes resync with majority
Old leader thinks it's still leader	Messages rejected (old epoch)	Old leader discovers it's no longer leader
Node restarts after long outage	Node's epoch is behind	Node follows new leader, updates epoch

The Split-Brain Prevention:

Epochs are crucial for preventing split-brain scenarios. Consider this case:

Leader L1 is in epoch 5
Network partition isolates L1 from the majority
Majority elects L2 in epoch 6
Partition heals—now both L1 and L2 think they're leader

How does ZAB handle this?

L1's proposals carry epoch 5
All followers in the majority have accepted epoch 6
Followers reject L1's epoch-5 proposals
L1 receives rejections, realizes it's no longer leader
L1 transitions to follower state and syncs with L2

The epoch acts as a fencing token—a mechanism that ensures only the legitimate current leader can make changes.

Epoch Monotonicity is Critical

Epoch vs. Term in Raft:

ZAB's epoch is analogous to Raft's term, but there are subtle differences:

Increment timing: ZAB epochs are incremented during discovery (before leader is confirmed); Raft terms are incremented at election start
Persistence requirements: Both require persistent storage of the current term/epoch
Message handling: Both reject messages from stale terms/epochs

The core principle is identical: each leadership term has a unique, monotonically increasing identifier that allows nodes to determine which leader is current.

Discovery Phase Mechanics

After Fast Leader Election identifies a prospective leader, the Discovery phase begins. This phase serves two critical purposes:

Establish the new epoch: Ensure all followers accept the new leadership term
Discover committed transactions: Find any transactions the prospective leader might be missing

Discovery Protocol Steps:

Step 1: FOLLOWERINFO Each follower sends FOLLOWERINFO to the prospective leader, containing:

The follower's accepted epoch
The follower's last zxid

This tells the prospective leader what history each follower has.

Step 2: LEADERINFO The prospective leader calculates the new epoch (max of all follower epochs + 1) and sends LEADERINFO:

The new epoch number

This proposes the new leadership term to all followers.

Step 3: ACKEPOCH Followers respond with ACKEPOCH indicating they accept the new epoch. They also include:

Their current epoch (being replaced)
Their last zxid
Optionally, their transaction history

Step 4: Epoch Established Once the prospective leader receives ACKEPOCH from a quorum, the new epoch is established.

Converting Mermaid diagram...

The Critical Synchronization Insight:

The prospective leader must sync to the highest zxid in the quorum before proceeding.

This is non-obvious: the elected leader might not have the most up-to-date log! Fast Leader Election prefers higher zxids, but the actual elected leader might have slightly stale data if:

FLE votes were based on zxids at election start
Transactions committed during the election
Minor timing differences in vote propagation

Discovery corrects for this by identifying the true highest zxid in the quorum and syncing the leader accordingly.

Leader Learning from Followers

Synchronization Phase Mechanics

After discovery establishes the new epoch and identifies the highest committed state, the Synchronization phase ensures all followers in the quorum have consistent state before broadcast begins.

Synchronization Goals:

Bring all followers up to the leader's state
Truncate any uncommitted transactions on followers (from a failed leader)
Ensure prefix consistency before new transactions begin

Synchronization Protocol:

Step 1: Leader evaluates each follower For each follower, based on their last zxid, leader determines sync method:

DIFF: Follower needs incremental transactions
TRUNC: Follower has uncommitted transactions to discard
SNAP: Follower is too far behind; needs full snapshot

Step 2: Leader sends sync messages The appropriate sync messages are sent:

For DIFF: sequence of PROPOSAL messages for missing transactions
For TRUNC: TRUNC message with the zxid to truncate to
For SNAP: SNAP message followed by complete state snapshot

Step 3: Commit pending transactions Leader sends COMMIT for all synced transactions

Step 4: NEWLEADER message Leader sends NEWLEADER message indicating sync is complete

Step 5: Followers acknowledge Followers send ACK for NEWLEADER when they've processed all sync data

Step 6: Quorum achieved Once leader receives NEWLEADER ACKs from a quorum, synchronization phase completes

sync-decision-tree.pseudo

Pseudocode

FUNCTION determineAndExecuteSync(follower, followerLastZxid):
    leaderMaxZxid = leader.lastCommittedZxid
    leaderMinZxid = leader.minRetainedLogZxid
    
    // Case 1: Follower is ahead - has uncommitted transactions
    IF followerLastZxid > leaderMaxZxid:
        // These transactions were never committed (leader failed before quorum)
        // Follower must truncate to leader's committed state
        SEND_TRUNC(follower, targetZxid = leaderMaxZxid)
        
        // After truncation, follower is synchronized
        SEND_NEWLEADER(follower)
        RETURN
    
    // Case 2: Follower is exactly caught up
    IF followerLastZxid == leaderMaxZxid:
        // No synchronization needed
        SEND_NEWLEADER(follower)
        RETURN
    
    // Case 3: Follower is behind but within log retention
    IF followerLastZxid >= leaderMinZxid:
        // Can use incremental sync
        missingTxns = leader.log.range(followerLastZxid + 1, leaderMaxZxid)
        
        FOR EACH txn IN missingTxns:
            SEND_PROPOSAL(follower, txn.zxid, txn.data)
        
        FOR EACH txn IN missingTxns:
            SEND_COMMIT(follower, txn.zxid)
        
        SEND_NEWLEADER(follower)
        RETURN
    
    // Case 4: Follower is too far behind
    // Must send complete snapshot
    snapshot = leader.takeSnapshot()
    SEND_SNAP(follower, snapshot)
    SEND_NEWLEADER(follower)

Handling the TRUNC Case:

The TRUNC case deserves special attention as it represents a subtle correctness scenario:

Scenario:

Leader L1 proposes transaction T at zxid Z
Follower F1 receives and logs T
L1 crashes before T achieves quorum (before F2 and F3 see it)
New leader L2 is elected (from F2 and F3)
L2's quorum never included T, so T is not committed
F1 reconnects with T in its log

Resolution:

L2 detects F1's last zxid (Z) is higher than L2's last committed zxid
L2 sends TRUNC to F1, instructing it to truncate to L2's last committed zxid
F1 discards T—it was never committed, so this is safe
F1's log now matches the committed prefix