ZAB (Zookeeper) - Learning Module

Loading content...

0/273

Use in Kafka and Hadoop

ZAB in the Big Data Trenches

When you send a message to Kafka that eventually reaches billions of devices, or run a Hadoop job that processes petabytes of data, there's a silent guardian ensuring the machinery doesn't fly apart: Apache Zookeeper and its ZAB protocol. These systems process staggering volumes of data, but their coordination layer—the part that keeps everything consistent—relies on the atomic broadcast guarantees that ZAB provides.

Understanding how Zookeeper is used in these production systems isn't just academic—it's essential for operating, debugging, and designing large-scale distributed systems. The patterns that Kafka and Hadoop use with Zookeeper have become templates for distributed coordination across the industry.

What You Will Master

By the end of this page, you will understand how Kafka uses Zookeeper for controller election, topic metadata, and broker management; how the Hadoop ecosystem leverages Zookeeper for NameNode HA, HBase coordination, and service discovery; and the emerging Kafka Raft (KRaft) architecture that's replacing Zookeeper in Kafka.

Zookeeper's Role in Distributed Systems

Before diving into Kafka and Hadoop specifically, let's establish what Zookeeper provides that makes it so valuable for distributed coordination.

Zookeeper's Core Abstractions:

1. Hierarchical Namespace (ZNode Tree) Zookeeper organizes data in a tree structure similar to a filesystem. Each node (znode) can hold data and have children. This enables natural organization of distributed state.

2. Ephemeral Nodes Znodes can be ephemeral—they automatically disappear when the client session that created them ends. This is powerful for:

Node presence detection (create ephemeral node on startup → node death auto-deletes it)
Lock implementation (holder's session ends → lock auto-releases)
Leader election (leader's ephemeral node disappears on failure)

3. Sequential Nodes Znodes can be created with monotonically increasing sequence numbers appended. Combined with ephemeral nodes, this enables:

Fair locking (queue order based on sequence)
Barrier synchronization
Message ordering

4. Watches Clients can set watches on znodes. When the znode changes, the client receives a notification. This enables reactive programming without polling.

Zookeeper Primitives and Their Uses
Primitive	Zookeeper Feature	Common Use Cases
Configuration Management	Persistent znodes + watches	Store configs, notify on change
Service Discovery	Ephemeral znodes + parent watch	Register instances, detect failures
Leader Election	Ephemeral sequential znodes	Elect controller, handle failover
Distributed Locks	Ephemeral sequential znodes	Mutual exclusion, read/write locks
Barriers	Znode existence + watches	Synchronize distributed processes
Group Membership	Ephemeral znodes under parent	Track cluster members

Why ZAB Matters for These Primitives:

All of Zookeeper's coordination primitives depend on ZAB's atomic broadcast guarantees:

Total ordering ensures that all observers see state changes in the same order. If node A sees configuration change before node B registers, all nodes see events in that order.
Prefix consistency means you can rely on seeing complete history. If you see znode X created, you've seen all earlier state.
FIFO client ordering ensures that your operations are applied sequentially. If you create a lock then write data, the write won't happen before the lock.
Durability via quorum acknowledgment means committed changes survive failures.

Without these guarantees, Zookeeper's coordination primitives would be unreliable—race conditions and inconsistencies would undermine their purpose.

Kafka's Use of Zookeeper (Traditional Architecture)

Apache Kafka has historically been deeply integrated with Zookeeper, relying on it for critical coordination functions. Understanding this architecture is essential both for operating existing Kafka deployments and appreciating the motivations behind KRaft.

What Kafka Stores in Zookeeper:

1. Broker Registration Each Kafka broker creates an ephemeral znode under /brokers/ids/[broker-id] when it starts. This znode contains the broker's connection information.

Purpose: Discover which brokers are alive
Ephemeral: Auto-deletes when broker disconnects → instant failure detection
Watch: Controller watches this path to detect broker changes

2. Topic and Partition Metadata

/brokers/topics/[topic-name]/partitions/[partition-id] — Partition assignments
Topic configuration stored under /config/topics/[topic-name]
Partition leader information in state nodes

3. Controller Election The Kafka controller—the broker responsible for partition leadership elections and rebalancing—is elected using Zookeeper. First broker to successfully create /controller (ephemeral) becomes the controller.

4. Consumer Group Offsets (Legacy) Older Kafka versions stored consumer group offsets in Zookeeper. Modern Kafka uses internal topics (__consumer_offsets) instead.

Converting Mermaid diagram...

The Kafka Controller's Role:

The Kafka controller is the brain of the cluster, and its election via Zookeeper is critical:

Election Process:

All brokers attempt to create /controller (ephemeral, non-sequential)
First to succeed becomes controller
Others receive failure (znode already exists) and watch the znode
When controller dies, ephemeral node disappears, watch triggers
Race begins again; first to create new /controller wins

Controller Responsibilities:

Monitor broker liveness (via watches on /brokers/ids/*)
Assign partition leaders and replicas
Coordinate reassignments when brokers join/leave
Propagate metadata changes to all brokers

The Controller Epoch: Kafka maintains a controller epoch (like ZAB's epoch) in /controller_epoch. This prevents stale controllers from making changes after being fenced. Every controller command includes the controller epoch; brokers reject commands from old epochs.

Zookeeper is Kafka's Single Point of Coordination

In traditional Kafka, if Zookeeper becomes unavailable, new broker registrations fail, partition leadership cannot change (existing replicas continue serving), and topic operations halt. Existing message production and consumption continue, but the cluster cannot adapt to failures.

Partition Leadership with Zookeeper

Kafka's partition leadership model is where Zookeeper's coordination capabilities shine most visibly. Each partition has exactly one leader at any time, and all reads/writes go through the leader.

Partition State in Zookeeper:

For each partition, Zookeeper stores:

/brokers/topics/[topic]/partitions/[partition]/state
{
  "controller_epoch": 123,
  "leader": 1,
  "version": 1,
  "leader_epoch": 45,
  "isr": [1, 2, 3]
}

leader: Current broker ID serving as partition leader
leader_epoch: Increments on each leader change (prevents stale leaders)
isr: In-Sync Replicas—replicas that are caught up and can become leader
controller_epoch: Prevents updates from stale controllers

ISR (In-Sync Replica) Management:

The ISR is crucial for Kafka's durability guarantees:

Leader tracks which replicas have caught up
When a follower falls behind (beyond replica.lag.time.max.ms), it's removed from ISR
ISR changes are written to Zookeeper atomically
If leader fails, new leader must be from ISR (has all committed messages)
ZAB's atomic broadcast ensures all brokers see consistent ISR state

partition-leader-election.pseudo

Pseudocode

// Kafka Controller: Partition Leader Election
// Triggered when current leader fails (ephemeral znode disappears)
 
FUNCTION electPartitionLeader(topic, partition):
    // 1. Read current state from Zookeeper
    currentState = zk.getData("/brokers/topics/{topic}/partitions/{partition}/state")
    currentLeader = currentState.leader
    currentISR = currentState.isr
    currentLeaderEpoch = currentState.leader_epoch
    
    // 2. Verify leader is actually failed
    IF brokerIsAlive(currentLeader):
        RETURN // Leader is fine, no election needed
    
    // 3. Select new leader from ISR
    newLeader = null
    FOR broker IN currentISR:
        IF brokerIsAlive(broker):
            newLeader = broker
            BREAK  // First alive ISR member becomes leader
    
    // 4. Handle unclean leader election (if all ISR failed)
    IF newLeader == null AND unclean_leader_election_enabled:
        // Pick any alive replica - may lose data!
        newLeader = selectAnyAliveReplica(topic, partition)
        LOG.WARN("Unclean leader election for {topic}/{partition}")
    
    IF newLeader == null:
        RETURN ERROR("No eligible leader found")
    
    // 5. Write new state to Zookeeper (atomic via ZAB)
    newState = {
        controller_epoch: myControllerEpoch,
        leader: newLeader,
        leader_epoch: currentLeaderEpoch + 1,
        isr: [newLeader],  // ISR resets to just the new leader
        version: currentState.version + 1
    }
    
    // Conditional write - only succeeds if version matches
    zk.setData("/brokers/topics/{topic}/partitions/{partition}/state",
               newState, 
               expectedVersion = currentState.version)
    
    // 6. Notify all brokers of leadership change
    broadcastLeaderChangeToAllBrokers(topic, partition, newLeader)

Leader Epoch and Fencing:

The leader epoch is critical for preventing split-brain scenarios:

Each time a partition gets a new leader, leader_epoch increments
The new leader includes leader_epoch in all produce responses
Followers reject messages from old leader epochs
This prevents a zombie leader (network partitioned, thinks it's still leader) from corrupting data

Relationship to ZAB:

Kafka's leader_epoch is conceptually similar to ZAB's epoch:

Both are monotonically increasing
Both prevent old leaders from making changes
Both are stored durably (ZAB in zxid, Kafka in Zookeeper)

The difference is scope: ZAB's epoch is for the Zookeeper cluster; Kafka's leader_epoch is per-partition and stored in Zookeeper.

Unclean Leader Election Trade-off

When all ISR replicas fail, Kafka can elect an out-of-sync replica as leader (if unclean.leader.election.enable=true). This trades availability for consistency—you get a leader, but may lose messages. Most production deployments disable this, preferring unavailability to data loss.

Kafka's Evolution to KRaft

Despite Zookeeper's proven reliability, Apache Kafka has been migrating away from it toward KRaft (Kafka Raft)—a built-in consensus protocol that eliminates the Zookeeper dependency. Understanding why reveals both Zookeeper's limitations and the maturation of the Kafka project.

Why Move Away from Zookeeper?

1. Operational Complexity

Two separate distributed systems to operate (Kafka + Zookeeper)
Different configuration, monitoring, upgrade procedures
Zookeeper expertise required alongside Kafka expertise
Separate capacity planning and scaling

2. Scalability Limits

Zookeeper sessions and watches have overhead
Thousands of partitions mean thousands of znodes
Metadata update storms during topology changes
Controller recovery time proportional to metadata size

3. Latency for Metadata Operations

Partition leadership changes require Zookeeper consensus
Cross-datacenter Zookeeper adds significant latency
Kafka already has raft-like replication for partitions

4. Single Point of Failure Perception

Though Zookeeper is fault-tolerant, its unavailability halts cluster management
Adding Zookeeper nodes increases quorum overhead

Zookeeper (ZAB) vs KRaft Architecture
Aspect	Zookeeper Mode	KRaft Mode
Consensus protocol	ZAB (via Zookeeper cluster)	Raft (built into Kafka controllers)
Metadata storage	Zookeeper znodes	Kafka internal log topic (__cluster_metadata)
Controller count	1 active (elected via ZK)	1+ active (Raft quorum)
Components to operate	Kafka brokers + Zookeeper cluster	Kafka brokers (some in controller role)
Partition scalability	~100K partitions	Millions of partitions targeted
Failover time	Seconds (Zookeeper election + discovery)	Milliseconds (Raft election)
Monitoring	Kafka + Zookeeper metrics	Kafka metrics only

KRaft Architecture:

In KRaft mode, a subset of Kafka brokers act as controllers, forming a Raft quorum:

Controller Quorum: 3 or 5 brokers designated as controllers, running Raft
Metadata Log: __cluster_metadata internal topic stores all cluster metadata
Active Controller: Raft leader becomes active controller
Metadata Replication: Metadata changes replicated via Raft to all controllers
Broker Updates: Brokers fetch metadata from controllers, no Zookeeper watches

Benefits Realized:

Simplified Operations: One system to deploy, configure, monitor
Faster Failover: Raft election is sub-second; no Zookeeper session timeout
Better Scalability: Metadata log is more efficient than znode tree
Single Binary: Kafka deployment without separate Zookeeper installation

Migration Path:

Kafka supports migration from Zookeeper to KRaft mode:

Deploy KRaft controllers alongside Zookeeper
Migrate metadata to KRaft
Switch brokers to KRaft mode
Decommission Zookeeper

ZAB vs Raft in This Context

KRaft uses Raft rather than ZAB because Kafka's team wanted tight integration without the overhead of the full Zookeeper coordination service. Raft's well-specified implementation guidance and familiarity made it a natural choice. The core consensus guarantees are equivalent—the practical differences are in the implementation details.

Zookeeper in the Hadoop Ecosystem

Apache Hadoop and its ecosystem have deep dependencies on Zookeeper for coordination. Unlike Kafka's evolution toward independence, most Hadoop components continue to rely on Zookeeper as their coordination service.

HDFS High Availability:

The Hadoop Distributed File System (HDFS) uses Zookeeper for NameNode high availability:

Problem: The NameNode is HDFS's brain, storing all filesystem metadata. Originally, it was a single point of failure.

Solution: Active/Standby NameNode pair with Zookeeper coordination:

ZKFailoverController (ZKFC): Runs alongside each NameNode
Health Monitoring: ZKFC monitors NameNode health
Leader Election: ZKFCs compete for an ephemeral lock in Zookeeper
Fencing: Winning ZKFC becomes active; ensures old active cannot write
Automatic Failover: ZKFC detects failure, triggers election, promotes standby

Converting Mermaid diagram...

Apache HBase Coordination:

HBase, the distributed wide-column store built on Hadoop, is heavily dependent on Zookeeper:

1. RegionServer Registration

Each RegionServer creates an ephemeral znode under /hbase/rs/[server-name]
HMaster monitors these znodes, detects RegionServer failures
Failure detection triggers region reassignment

2. HMaster Leader Election

Multiple HMaster instances can run for high availability
Zookeeper election determines active master
Standby masters watch for active master failure

3. Table and Region Metadata

/hbase/table/[table-name] stores table state (enabled, disabled)
Region assignments stored in Zookeeper
Schema changes coordinated through Zookeeper

4. Distributed Locks

Table-level locks for DDL operations
Split and compaction coordination

Critical Path: If Zookeeper is unavailable, HBase cannot:

Detect failed RegionServers
Reassign regions
Perform DDL operations
Elect new master if needed

Existing read/write operations to healthy RegionServers continue, but failure handling stops.

Zookeeper Sizing for HBase

HBase creates many ephemeral znodes (one per RegionServer, plus region metadata). Large HBase clusters require careful Zookeeper sizing—more memory, faster disks, and appropriate session timeouts. Session timeout tuning is critical: too short causes false failure detection; too long delays real failure detection.

Common Zookeeper Usage Patterns

Both Kafka and Hadoop implement common coordination patterns on top of Zookeeper. These patterns have become templates used throughout the industry.

Pattern 1: Leader Election

The standard pattern used by Kafka controllers, HBase masters, and many other systems:

/[service]/leader  ← ephemeral znode, first to create is leader

All candidates try to create the ephemeral znode
One succeeds and becomes leader
Others receive failure, set watch on znode
When leader dies, znode disappears, watch fires
All candidates race to create znode again

Pattern 2: Fair Leader Election (Queue-Based)

For fairer ordering or when you need to know your position:

/[service]/candidates/candidate-[sequence]

Each candidate creates ephemeral sequential znode
Lowest sequence number is current leader
Each candidate watches the znode before itself
When predecessor disappears, candidate checks if it's now lowest
This avoids thundering herd—only one candidate activates on leader death

zk-patterns.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Pattern 1: Simple Leader Election (Kafka-style)
public class SimpleLeaderElection {
    private ZooKeeper zk;
    private String leaderPath = "/kafka/controller";
    
    public boolean tryBecomeLeader(String myId) {
        try {
            // Attempt to create ephemeral node
            zk.create(leaderPath, myId.getBytes(), 
                      ZooDefs.Ids.OPEN_ACL_UNSAFE,
                      CreateMode.EPHEMERAL);
            return true;  // I am the leader
        } catch (KeeperException.NodeExistsException e) {
            // Someone else is leader, watch and wait
            zk.exists(leaderPath, this::onLeaderChange);
            return false;
        }
    }
    
    private void onLeaderChange(WatchedEvent event) {
        if (event.getType() == EventType.NodeDeleted) {
            tryBecomeLeader(myId);  // Leader died, try to take over
        }
    }
}
 
// Pattern 2: Fair Queue-Based Election (HBase-style)
public class FairLeaderElection {
    private String candidatesPath = "/hbase/master/candidates";
    private String myZnode;
    
    public void joinElection(String myId) throws Exception {
        // Create ephemeral sequential node
        myZnode = zk.create(
            candidatesPath + "/candidate-",
            myId.getBytes(),
            ZooDefs.Ids.OPEN_ACL_UNSAFE,
            CreateMode.EPHEMERAL_SEQUENTIAL);
        
        checkLeadership();
    }
    
    private void checkLeadership() throws Exception {
        List<String> candidates = zk.getChildren(candidatesPath, false);
        Collections.sort(candidates);  // Sort by sequence number
        
        int myIndex = candidates.indexOf(myZnode.substring(
            candidatesPath.length() + 1));
        
        if (myIndex == 0) {
            becomeLeader();  // I have lowest sequence number
        } else {
            // Watch the node before me
            String nodeToWatch = candidatesPath + "/" + candidates.get(myIndex - 1);
            zk.exists(nodeToWatch, event -> {
                if (event.getType() == EventType.NodeDeleted) {
                    checkLeadership();
                }
            });
        }
    }
}

Pattern 3: Configuration Management

Store configuration in znodes, get notified on changes:

Store config in /[service]/config/[setting-name]
Clients read config and set watch
When config changes, watch fires, clients re-read
Atomic updates via Zookeeper transactions

Pattern 4: Service Registry (Service Discovery)

/services/[service-name]/instances/[instance-id]  ← ephemeral

Each instance creates ephemeral znode with its address/port
Clients list children of /services/[service-name]/instances
Clients watch for child changes (instances added/removed)
Load balancing across active instances

Pattern 5: Distributed Barrier

Synchronize N processes:

Each process creates child under /barriers/[barrier-id]/[process-id]
Processes wait until child count reaches N
When N children exist, barrier is released
All processes proceed together

Operational Considerations

Operating Zookeeper for production Kafka and Hadoop deployments requires careful attention to several factors that directly impact ZAB's performance.

Cluster Sizing:

Why odd numbers? Zookeeper quorums need a majority. With N nodes:

Quorum = floor(N/2) + 1
Tolerated failures = N - quorum = floor(N/2)

Nodes	Quorum	Tolerated Failures
3	2	1
4	3	1
5	3	2
6	4	2
7	4	3

4 nodes tolerates only 1 failure (same as 3), so avoid even numbers. Most deployments use 3 or 5 nodes.

Disk Considerations:

ZAB's performance is heavily disk-bound:

Use SSDs: Transaction log writes are latency-critical
Separate drives: Put transaction log on dedicated disk (not shared with OS or snapshots)
fsync tuning: ZAB requires fsync for durability; this dominates latency
Disable swapping: Swap destroys predictable latency

Critical Zookeeper Configuration Parameters
Parameter	Default	Guidance	Why It Matters
tickTime	2000ms	Keep default for most cases	Base time unit for all intervals
initLimit	10 (× tickTime)	Increase for large snapshots	Time for followers to sync at startup
syncLimit	5 (× tickTime)	Tune for network latency	Time for followers to lag before disconnect
maxSessionTimeout	20 × tickTime	Match client needs	Longest session timeout clients can request
autopurge.purgeInterval	0 (disabled)	Enable for production	Automatic log cleanup interval
dataDir	/var/zookeeper	Fast SSD, not shared	Main data directory (snapshots)
dataLogDir	(same as dataDir)	Separate fast SSD	Transaction log directory (CRITICAL)

Monitoring ZAB Health:

Key Metrics to Watch:

Outstanding requests: Pending requests in queue—should stay low
Average latency: Time to process requests—spikes indicate problems
Zxid: Transaction ID—should be incrementing; same across cluster indicates healthy replication
Sync lag: Follower lag behind leader—should stay minimal
Connection count: Active client connections—spikes may indicate client issues
Ephemeral node count: Tracks registered services—unexpected drops mean failures

JVM Considerations:

Heap sizing: 3-4GB typically sufficient; larger heaps increase GC pause risk
GC tuning: Use G1GC or ZGC for predictable pauses
GC logging: Enable for troubleshooting latency spikes

Network Placement:

Dedicated network: Isolate Zookeeper traffic from data plane if possible
Same rack for quorum latency: Faster disk sync improves write latency
Cross-rack for availability: Tolerate rack failures (trade-off with latency)

The 'Four Letter Words'

Zookeeper exposes 'four letter word' commands for monitoring: 'stat', 'mntr', 'ruok', 'envi', etc. For production monitoring, prefer 'mntr' (metrics in key=value format) over 'stat'. Note: Admin Server (HTTP) is the modern alternative, enabled by default in newer versions.

Troubleshooting Common Issues

When Zookeeper misbehaves, Kafka and Hadoop coordination breaks. Understanding common failure modes helps with rapid diagnosis.

Issue 1: Leader Election Storms

Symptoms:

Frequent log messages about leader election
Kafka controller repeatedly changes
High latency, request timeouts

Causes:

Network instability between Zookeeper nodes
Disk latency causing heartbeat timeouts
GC pauses exceeding session timeout
Overloaded Zookeeper (too many clients/requests)

Remediation:

Check network stability between ZK nodes
Review disk latency (iostat, sar)
Check GC logs for long pauses
Review tickTime, syncLimit settings

Issue 2: Session Expiration Floods

Symptoms:

Many clients disconnecting/reconnecting
"Session expired" errors in Kafka/HBase logs
Ephemeral nodes disappearing unexpectedly

Causes:

Client process paused (GC, resource exhaustion)
Network partition between clients and ZK
Zookeeper overloaded, slow to process
Session timeout too short for network conditions

Red Flags (Immediate Attention)

•Quorum lost: Fewer than majority nodes available
•Leader election loops: Continuous elections, no stable leader
•Split-brain errors: Multiple nodes believe they're leader
•Request timeouts: Clients can't complete operations
•Log corruption errors: Data directory problems
•Extreme latency: p99 > 10x normal

Yellow Flags (Monitor Closely)

•High outstanding requests: Queue backing up
•Increased latency: Latency trend up, not spiking yet
•Follower lag: Followers behind leader by many transactions
•Connection count increasing: Possible client issue
•Disk usage climbing: Need to trigger or tune autopurge
•Frequent client reconnects: Session timeout borderline

Issue 3: Slow Cluster Recovery After Full Outage

Symptoms:

Cluster takes minutes to restart
Leader election succeeds but followers disconnect
"Unable to sync with leader" errors

Causes:

Large transaction log requiring replay
Slow disk during snapshot loading
initLimit too short for data volume

Remediation:

Ensure recent snapshots exist (autopurge with appropriate retention)
Increase initLimit during recovery if needed
Consider faster storage

Issue 4: Kafka Controller Thrashing

Symptoms:

Kafka controller keeps changing (visible in __controller topic)
Partition leadership changes frequently
High latency in Kafka producer/consumer

Causes:

Zookeeper session timeout triggered
Controller broker experiencing GC pauses
Network issues between controller and ZK
Controller overloaded with partition metadata operations

Remediation:

Tune Kafka zookeeper.session.timeout.ms
Check controller broker's resource usage and GC
Reduce partition count if necessary
Consider dedicated controller nodes (in newer Kafka)

Summary: ZAB in Production Systems

We've explored how ZAB powers Zookeeper's role in Kafka and Hadoop—critical production systems processing enormous data volumes. Let's consolidate the key insights:

Key Takeaways

•Zookeeper provides coordination primitives on top of ZAB — Ephemeral nodes, sequential nodes, and watches enable leader election, service discovery, and distributed locks.
•Kafka traditionally depends on Zookeeper for metadata — Broker registration, topic metadata, controller election, and partition state are all stored in Zookeeper.
•KRaft is Kafka's path away from Zookeeper — Built-in Raft consensus eliminates operational complexity and improves scalability, while providing equivalent coordination guarantees.
•HDFS and HBase use Zookeeper for high availability — NameNode failover, RegionServer registration, and master election all rely on Zookeeper's coordination.
•Common patterns have emerged — Leader election, service registry, configuration management, and barriers are implemented similarly across systems.
•Operational care is essential — Disk performance, cluster sizing, session timeouts, and monitoring directly impact ZAB and Zookeeper reliability.
•Troubleshooting requires ZAB understanding — Knowing how ZAB works helps diagnose elections storms, session expirations, and recovery issues.
•ZAB's guarantees underpin everything — Total ordering, durability, and atomic broadcast make all these coordination patterns possible and reliable.

Module Complete:

You've now completed the comprehensive study of ZAB (Zookeeper Atomic Broadcast). From the core protocol mechanics to leader-based ordering, comparisons with Raft and Paxos, and real-world usage in Kafka and Hadoop, you understand one of the most important consensus protocols in production distributed systems.

This knowledge enables you to:

Operate Zookeeper-dependent systems with confidence
Debug coordination issues by understanding the underlying protocol
Make informed decisions about consensus protocol selection
Design distributed systems that leverage atomic broadcast guarantees

Module Complete

Congratulations! You've mastered ZAB (Zookeeper Atomic Broadcast). You understand its core mechanics, how it provides leader-based ordering, how it compares to Raft and Paxos, and how it powers critical distributed systems like Kafka and Hadoop. This knowledge is essential for any engineer working with distributed systems at scale.

Use in Kafka and Hadoop

ZAB in the Big Data Trenches

What You Will Master

Zookeeper's Role in Distributed Systems

Before diving into Kafka and Hadoop specifically, let's establish what Zookeeper provides that makes it so valuable for distributed coordination.

Zookeeper's Core Abstractions:

2. Ephemeral Nodes Znodes can be ephemeral—they automatically disappear when the client session that created them ends. This is powerful for:

Node presence detection (create ephemeral node on startup → node death auto-deletes it)
Lock implementation (holder's session ends → lock auto-releases)
Leader election (leader's ephemeral node disappears on failure)

3. Sequential Nodes Znodes can be created with monotonically increasing sequence numbers appended. Combined with ephemeral nodes, this enables:

Fair locking (queue order based on sequence)
Barrier synchronization
Message ordering

4. Watches Clients can set watches on znodes. When the znode changes, the client receives a notification. This enables reactive programming without polling.

Zookeeper Primitives and Their Uses
Primitive	Zookeeper Feature	Common Use Cases
Configuration Management	Persistent znodes + watches	Store configs, notify on change
Service Discovery	Ephemeral znodes + parent watch	Register instances, detect failures
Leader Election	Ephemeral sequential znodes	Elect controller, handle failover
Distributed Locks	Ephemeral sequential znodes	Mutual exclusion, read/write locks
Barriers	Znode existence + watches	Synchronize distributed processes
Group Membership	Ephemeral znodes under parent	Track cluster members

Why ZAB Matters for These Primitives:

All of Zookeeper's coordination primitives depend on ZAB's atomic broadcast guarantees:

Total ordering ensures that all observers see state changes in the same order. If node A sees configuration change before node B registers, all nodes see events in that order.
Prefix consistency means you can rely on seeing complete history. If you see znode X created, you've seen all earlier state.
FIFO client ordering ensures that your operations are applied sequentially. If you create a lock then write data, the write won't happen before the lock.
Durability via quorum acknowledgment means committed changes survive failures.

Without these guarantees, Zookeeper's coordination primitives would be unreliable—race conditions and inconsistencies would undermine their purpose.

Kafka's Use of Zookeeper (Traditional Architecture)

What Kafka Stores in Zookeeper:

1. Broker Registration Each Kafka broker creates an ephemeral znode under /brokers/ids/[broker-id] when it starts. This znode contains the broker's connection information.

Purpose: Discover which brokers are alive
Ephemeral: Auto-deletes when broker disconnects → instant failure detection
Watch: Controller watches this path to detect broker changes

2. Topic and Partition Metadata

/brokers/topics/[topic-name]/partitions/[partition-id] — Partition assignments
Topic configuration stored under /config/topics/[topic-name]
Partition leader information in state nodes

4. Consumer Group Offsets (Legacy) Older Kafka versions stored consumer group offsets in Zookeeper. Modern Kafka uses internal topics (__consumer_offsets) instead.

Converting Mermaid diagram...

The Kafka Controller's Role:

The Kafka controller is the brain of the cluster, and its election via Zookeeper is critical:

Election Process:

All brokers attempt to create /controller (ephemeral, non-sequential)
First to succeed becomes controller
Others receive failure (znode already exists) and watch the znode
When controller dies, ephemeral node disappears, watch triggers
Race begins again; first to create new /controller wins

Controller Responsibilities:

Monitor broker liveness (via watches on /brokers/ids/*)
Assign partition leaders and replicas
Coordinate reassignments when brokers join/leave
Propagate metadata changes to all brokers

Zookeeper is Kafka's Single Point of Coordination

Partition Leadership with Zookeeper

Kafka's partition leadership model is where Zookeeper's coordination capabilities shine most visibly. Each partition has exactly one leader at any time, and all reads/writes go through the leader.

Partition State in Zookeeper:

For each partition, Zookeeper stores:

/brokers/topics/[topic]/partitions/[partition]/state
{
  "controller_epoch": 123,
  "leader": 1,
  "version": 1,
  "leader_epoch": 45,
  "isr": [1, 2, 3]
}

leader: Current broker ID serving as partition leader
leader_epoch: Increments on each leader change (prevents stale leaders)
isr: In-Sync Replicas—replicas that are caught up and can become leader
controller_epoch: Prevents updates from stale controllers

ISR (In-Sync Replica) Management:

The ISR is crucial for Kafka's durability guarantees:

Leader tracks which replicas have caught up
When a follower falls behind (beyond replica.lag.time.max.ms), it's removed from ISR
ISR changes are written to Zookeeper atomically
If leader fails, new leader must be from ISR (has all committed messages)
ZAB's atomic broadcast ensures all brokers see consistent ISR state

partition-leader-election.pseudo

Pseudocode

// Kafka Controller: Partition Leader Election
// Triggered when current leader fails (ephemeral znode disappears)
 
FUNCTION electPartitionLeader(topic, partition):
    // 1. Read current state from Zookeeper
    currentState = zk.getData("/brokers/topics/{topic}/partitions/{partition}/state")
    currentLeader = currentState.leader
    currentISR = currentState.isr
    currentLeaderEpoch = currentState.leader_epoch
    
    // 2. Verify leader is actually failed
    IF brokerIsAlive(currentLeader):
        RETURN // Leader is fine, no election needed
    
    // 3. Select new leader from ISR
    newLeader = null
    FOR broker IN currentISR:
        IF brokerIsAlive(broker):
            newLeader = broker
            BREAK  // First alive ISR member becomes leader
    
    // 4. Handle unclean leader election (if all ISR failed)
    IF newLeader == null AND unclean_leader_election_enabled:
        // Pick any alive replica - may lose data!
        newLeader = selectAnyAliveReplica(topic, partition)
        LOG.WARN("Unclean leader election for {topic}/{partition}")
    
    IF newLeader == null:
        RETURN ERROR("No eligible leader found")
    
    // 5. Write new state to Zookeeper (atomic via ZAB)
    newState = {
        controller_epoch: myControllerEpoch,
        leader: newLeader,
        leader_epoch: currentLeaderEpoch + 1,
        isr: [newLeader],  // ISR resets to just the new leader
        version: currentState.version + 1
    }
    
    // Conditional write - only succeeds if version matches
    zk.setData("/brokers/topics/{topic}/partitions/{partition}/state",
               newState, 
               expectedVersion = currentState.version)
    
    // 6. Notify all brokers of leadership change
    broadcastLeaderChangeToAllBrokers(topic, partition, newLeader)

Leader Epoch and Fencing:

The leader epoch is critical for preventing split-brain scenarios:

Each time a partition gets a new leader, leader_epoch increments
The new leader includes leader_epoch in all produce responses
Followers reject messages from old leader epochs
This prevents a zombie leader (network partitioned, thinks it's still leader) from corrupting data

Relationship to ZAB:

Kafka's leader_epoch is conceptually similar to ZAB's epoch:

Both are monotonically increasing
Both prevent old leaders from making changes
Both are stored durably (ZAB in zxid, Kafka in Zookeeper)

The difference is scope: ZAB's epoch is for the Zookeeper cluster; Kafka's leader_epoch is per-partition and stored in Zookeeper.

Unclean Leader Election Trade-off

Kafka's Evolution to KRaft

Why Move Away from Zookeeper?

1. Operational Complexity

Two separate distributed systems to operate (Kafka + Zookeeper)
Different configuration, monitoring, upgrade procedures
Zookeeper expertise required alongside Kafka expertise
Separate capacity planning and scaling

2. Scalability Limits

Zookeeper sessions and watches have overhead
Thousands of partitions mean thousands of znodes
Metadata update storms during topology changes
Controller recovery time proportional to metadata size

3. Latency for Metadata Operations

Partition leadership changes require Zookeeper consensus
Cross-datacenter Zookeeper adds significant latency
Kafka already has raft-like replication for partitions

4. Single Point of Failure Perception

Though Zookeeper is fault-tolerant, its unavailability halts cluster management
Adding Zookeeper nodes increases quorum overhead

Zookeeper (ZAB) vs KRaft Architecture
Aspect	Zookeeper Mode	KRaft Mode
Consensus protocol	ZAB (via Zookeeper cluster)	Raft (built into Kafka controllers)
Metadata storage	Zookeeper znodes	Kafka internal log topic (__cluster_metadata)
Controller count	1 active (elected via ZK)	1+ active (Raft quorum)
Components to operate	Kafka brokers + Zookeeper cluster	Kafka brokers (some in controller role)
Partition scalability	~100K partitions	Millions of partitions targeted
Failover time	Seconds (Zookeeper election + discovery)	Milliseconds (Raft election)
Monitoring	Kafka + Zookeeper metrics	Kafka metrics only

KRaft Architecture:

In KRaft mode, a subset of Kafka brokers act as controllers, forming a Raft quorum:

Controller Quorum: 3 or 5 brokers designated as controllers, running Raft
Metadata Log: __cluster_metadata internal topic stores all cluster metadata
Active Controller: Raft leader becomes active controller
Metadata Replication: Metadata changes replicated via Raft to all controllers
Broker Updates: Brokers fetch metadata from controllers, no Zookeeper watches

Benefits Realized:

Simplified Operations: One system to deploy, configure, monitor
Faster Failover: Raft election is sub-second; no Zookeeper session timeout
Better Scalability: Metadata log is more efficient than znode tree
Single Binary: Kafka deployment without separate Zookeeper installation

Migration Path:

Kafka supports migration from Zookeeper to KRaft mode:

Deploy KRaft controllers alongside Zookeeper
Migrate metadata to KRaft
Switch brokers to KRaft mode
Decommission Zookeeper

ZAB vs Raft in This Context

Zookeeper in the Hadoop Ecosystem

HDFS High Availability:

The Hadoop Distributed File System (HDFS) uses Zookeeper for NameNode high availability:

Problem: The NameNode is HDFS's brain, storing all filesystem metadata. Originally, it was a single point of failure.

Solution: Active/Standby NameNode pair with Zookeeper coordination:

ZKFailoverController (ZKFC): Runs alongside each NameNode
Health Monitoring: ZKFC monitors NameNode health
Leader Election: ZKFCs compete for an ephemeral lock in Zookeeper
Fencing: Winning ZKFC becomes active; ensures old active cannot write
Automatic Failover: ZKFC detects failure, triggers election, promotes standby

Converting Mermaid diagram...

Apache HBase Coordination:

HBase, the distributed wide-column store built on Hadoop, is heavily dependent on Zookeeper:

1. RegionServer Registration

Each RegionServer creates an ephemeral znode under /hbase/rs/[server-name]
HMaster monitors these znodes, detects RegionServer failures
Failure detection triggers region reassignment

2. HMaster Leader Election

Multiple HMaster instances can run for high availability
Zookeeper election determines active master
Standby masters watch for active master failure

3. Table and Region Metadata

/hbase/table/[table-name] stores table state (enabled, disabled)
Region assignments stored in Zookeeper
Schema changes coordinated through Zookeeper

4. Distributed Locks

Table-level locks for DDL operations
Split and compaction coordination

Critical Path: If Zookeeper is unavailable, HBase cannot:

Detect failed RegionServers
Reassign regions
Perform DDL operations
Elect new master if needed

Existing read/write operations to healthy RegionServers continue, but failure handling stops.

Zookeeper Sizing for HBase

Common Zookeeper Usage Patterns

Both Kafka and Hadoop implement common coordination patterns on top of Zookeeper. These patterns have become templates used throughout the industry.

Pattern 1: Leader Election

The standard pattern used by Kafka controllers, HBase masters, and many other systems:

/[service]/leader  ← ephemeral znode, first to create is leader

All candidates try to create the ephemeral znode
One succeeds and becomes leader
Others receive failure, set watch on znode
When leader dies, znode disappears, watch fires
All candidates race to create znode again

Pattern 2: Fair Leader Election (Queue-Based)

For fairer ordering or when you need to know your position:

/[service]/candidates/candidate-[sequence]

Each candidate creates ephemeral sequential znode
Lowest sequence number is current leader
Each candidate watches the znode before itself
When predecessor disappears, candidate checks if it's now lowest
This avoids thundering herd—only one candidate activates on leader death

zk-patterns.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Pattern 1: Simple Leader Election (Kafka-style)
public class SimpleLeaderElection {
    private ZooKeeper zk;
    private String leaderPath = "/kafka/controller";
    
    public boolean tryBecomeLeader(String myId) {
        try {
            // Attempt to create ephemeral node
            zk.create(leaderPath, myId.getBytes(), 
                      ZooDefs.Ids.OPEN_ACL_UNSAFE,
                      CreateMode.EPHEMERAL);
            return true;  // I am the leader
        } catch (KeeperException.NodeExistsException e) {
            // Someone else is leader, watch and wait
            zk.exists(leaderPath, this::onLeaderChange);
            return false;
        }
    }
    
    private void onLeaderChange(WatchedEvent event) {
        if (event.getType() == EventType.NodeDeleted) {
            tryBecomeLeader(myId);  // Leader died, try to take over
        }
    }
}
 
// Pattern 2: Fair Queue-Based Election (HBase-style)
public class FairLeaderElection {
    private String candidatesPath = "/hbase/master/candidates";
    private String myZnode;
    
    public void joinElection(String myId) throws Exception {
        // Create ephemeral sequential node
        myZnode = zk.create(
            candidatesPath + "/candidate-",
            myId.getBytes(),
            ZooDefs.Ids.OPEN_ACL_UNSAFE,
            CreateMode.EPHEMERAL_SEQUENTIAL);
        
        checkLeadership();
    }
    
    private void checkLeadership() throws Exception {
        List<String> candidates = zk.getChildren(candidatesPath, false);
        Collections.sort(candidates);  // Sort by sequence number
        
        int myIndex = candidates.indexOf(myZnode.substring(
            candidatesPath.length() + 1));
        
        if (myIndex == 0) {
            becomeLeader();  // I have lowest sequence number
        } else {
            // Watch the node before me
            String nodeToWatch = candidatesPath + "/" + candidates.get(myIndex - 1);
            zk.exists(nodeToWatch, event -> {
                if (event.getType() == EventType.NodeDeleted) {
                    checkLeadership();
                }
            });
        }
    }
}

Pattern 3: Configuration Management

Store configuration in znodes, get notified on changes:

Store config in /[service]/config/[setting-name]
Clients read config and set watch
When config changes, watch fires, clients re-read
Atomic updates via Zookeeper transactions

Pattern 4: Service Registry (Service Discovery)

/services/[service-name]/instances/[instance-id]  ← ephemeral

Each instance creates ephemeral znode with its address/port
Clients list children of /services/[service-name]/instances
Clients watch for child changes (instances added/removed)
Load balancing across active instances

Pattern 5: Distributed Barrier

Synchronize N processes:

Each process creates child under /barriers/[barrier-id]/[process-id]
Processes wait until child count reaches N
When N children exist, barrier is released
All processes proceed together

Operational Considerations

Operating Zookeeper for production Kafka and Hadoop deployments requires careful attention to several factors that directly impact ZAB's performance.

Cluster Sizing:

Why odd numbers? Zookeeper quorums need a majority. With N nodes:

Quorum = floor(N/2) + 1
Tolerated failures = N - quorum = floor(N/2)

Nodes	Quorum	Tolerated Failures
3	2	1
4	3	1
5	3	2
6	4	2
7	4	3

4 nodes tolerates only 1 failure (same as 3), so avoid even numbers. Most deployments use 3 or 5 nodes.

Disk Considerations:

ZAB's performance is heavily disk-bound:

Use SSDs: Transaction log writes are latency-critical
Separate drives: Put transaction log on dedicated disk (not shared with OS or snapshots)
fsync tuning: ZAB requires fsync for durability; this dominates latency
Disable swapping: Swap destroys predictable latency

Critical Zookeeper Configuration Parameters
Parameter	Default	Guidance	Why It Matters
tickTime	2000ms	Keep default for most cases	Base time unit for all intervals
initLimit	10 (× tickTime)	Increase for large snapshots	Time for followers to sync at startup
syncLimit	5 (× tickTime)	Tune for network latency	Time for followers to lag before disconnect
maxSessionTimeout	20 × tickTime	Match client needs	Longest session timeout clients can request
autopurge.purgeInterval	0 (disabled)	Enable for production	Automatic log cleanup interval
dataDir	/var/zookeeper	Fast SSD, not shared	Main data directory (snapshots)
dataLogDir	(same as dataDir)	Separate fast SSD	Transaction log directory (CRITICAL)

Monitoring ZAB Health:

Key Metrics to Watch:

Outstanding requests: Pending requests in queue—should stay low
Average latency: Time to process requests—spikes indicate problems
Zxid: Transaction ID—should be incrementing; same across cluster indicates healthy replication
Sync lag: Follower lag behind leader—should stay minimal
Connection count: Active client connections—spikes may indicate client issues
Ephemeral node count: Tracks registered services—unexpected drops mean failures

JVM Considerations:

Heap sizing: 3-4GB typically sufficient; larger heaps increase GC pause risk
GC tuning: Use G1GC or ZGC for predictable pauses
GC logging: Enable for troubleshooting latency spikes

Network Placement:

Dedicated network: Isolate Zookeeper traffic from data plane if possible
Same rack for quorum latency: Faster disk sync improves write latency
Cross-rack for availability: Tolerate rack failures (trade-off with latency)

The 'Four Letter Words'

Troubleshooting Common Issues

When Zookeeper misbehaves, Kafka and Hadoop coordination breaks. Understanding common failure modes helps with rapid diagnosis.

Issue 1: Leader Election Storms

Symptoms:

Frequent log messages about leader election
Kafka controller repeatedly changes
High latency, request timeouts

Causes:

Network instability between Zookeeper nodes
Disk latency causing heartbeat timeouts
GC pauses exceeding session timeout
Overloaded Zookeeper (too many clients/requests)

Remediation:

Check network stability between ZK nodes
Review disk latency (iostat, sar)
Check GC logs for long pauses
Review tickTime, syncLimit settings

Issue 2: Session Expiration Floods

Symptoms:

Many clients disconnecting/reconnecting
"Session expired" errors in Kafka/HBase logs
Ephemeral nodes disappearing unexpectedly

Causes:

Client process paused (GC, resource exhaustion)
Network partition between clients and ZK
Zookeeper overloaded, slow to process
Session timeout too short for network conditions

Red Flags (Immediate Attention)

•Quorum lost: Fewer than majority nodes available
•Leader election loops: Continuous elections, no stable leader
•Split-brain errors: Multiple nodes believe they're leader
•Request timeouts: Clients can't complete operations
•Log corruption errors: Data directory problems
•Extreme latency: p99 > 10x normal

Yellow Flags (Monitor Closely)

•High outstanding requests: Queue backing up
•Increased latency: Latency trend up, not spiking yet
•Follower lag: Followers behind leader by many transactions
•Connection count increasing: Possible client issue
•Disk usage climbing: Need to trigger or tune autopurge
•Frequent client reconnects: Session timeout borderline

Issue 3: Slow Cluster Recovery After Full Outage

Symptoms:

Cluster takes minutes to restart
Leader election succeeds but followers disconnect
"Unable to sync with leader" errors

Causes:

Large transaction log requiring replay
Slow disk during snapshot loading
initLimit too short for data volume

Remediation:

Ensure recent snapshots exist (autopurge with appropriate retention)
Increase initLimit during recovery if needed
Consider faster storage

Issue 4: Kafka Controller Thrashing

Symptoms:

Kafka controller keeps changing (visible in __controller topic)
Partition leadership changes frequently
High latency in Kafka producer/consumer

Causes:

Zookeeper session timeout triggered
Controller broker experiencing GC pauses
Network issues between controller and ZK
Controller overloaded with partition metadata operations

Remediation:

Tune Kafka zookeeper.session.timeout.ms
Check controller broker's resource usage and GC
Reduce partition count if necessary
Consider dedicated controller nodes (in newer Kafka)

Summary: ZAB in Production Systems

We've explored how ZAB powers Zookeeper's role in Kafka and Hadoop—critical production systems processing enormous data volumes. Let's consolidate the key insights:

Key Takeaways

•Zookeeper provides coordination primitives on top of ZAB — Ephemeral nodes, sequential nodes, and watches enable leader election, service discovery, and distributed locks.
•Kafka traditionally depends on Zookeeper for metadata — Broker registration, topic metadata, controller election, and partition state are all stored in Zookeeper.
•KRaft is Kafka's path away from Zookeeper — Built-in Raft consensus eliminates operational complexity and improves scalability, while providing equivalent coordination guarantees.
•HDFS and HBase use Zookeeper for high availability — NameNode failover, RegionServer registration, and master election all rely on Zookeeper's coordination.
•Common patterns have emerged — Leader election, service registry, configuration management, and barriers are implemented similarly across systems.
•Operational care is essential — Disk performance, cluster sizing, session timeouts, and monitoring directly impact ZAB and Zookeeper reliability.
•Troubleshooting requires ZAB understanding — Knowing how ZAB works helps diagnose elections storms, session expirations, and recovery issues.
•ZAB's guarantees underpin everything — Total ordering, durability, and atomic broadcast make all these coordination patterns possible and reliable.

Module Complete:

This knowledge enables you to:

Operate Zookeeper-dependent systems with confidence
Debug coordination issues by understanding the underlying protocol
Make informed decisions about consensus protocol selection
Design distributed systems that leverage atomic broadcast guarantees

Module Complete