System Design (HLD)Coordination Services

Coordination Services

LevelIntermediate

Duration90 mins

TopicCoordination Services

1 / 5

Zookeeper: Hierarchical Key-Value Coordination

The Invisible Backbone of Distributed Systems

In 2006, engineers at Yahoo! faced a problem that would become universal across the industry: how do you coordinate the actions of thousands of machines that need to work together as a single coherent system?

The challenges were immense. Servers needed to agree on which node was the leader. Configuration changes had to propagate reliably. Distributed locks had to prevent race conditions across data centers. And all of this had to work even when individual machines failed—which they did, constantly.

The solution they built was Apache ZooKeeper, and it would go on to become the foundational coordination service for some of the world's most critical distributed systems: Hadoop, Kafka, HBase, Solr, and countless internal systems at companies like LinkedIn, Twitter, and Netflix.

What You Will Learn

By the end of this page, you will understand ZooKeeper's hierarchical data model, its consistency guarantees, the znode abstraction that makes it unique, how sessions and watches enable reactive coordination, and the canonical patterns that make ZooKeeper invaluable for distributed systems. You'll also understand why ZooKeeper remains relevant despite newer alternatives.

What is ZooKeeper?

Apache ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. At its core, ZooKeeper is a distributed, hierarchical key-value store designed specifically for coordination workloads—not for storing application data.

The key insight behind ZooKeeper is that coordination is hard, but a small set of primitives can address most coordination needs. Rather than building custom coordination logic into every distributed application, ZooKeeper provides a reliable, high-performance coordination kernel that applications can build upon.

The ZooKeeper Philosophy

ZooKeeper doesn't provide high-level coordination primitives like distributed locks or leader election directly. Instead, it provides simple building blocks—ordered, persistent data nodes with watches—that can be combined to implement any coordination recipe. This design philosophy keeps ZooKeeper's core simple while enabling infinite flexibility.

Why Not Just Use a Database?

A natural question arises: why not use a regular database for coordination? After all, databases can store configuration, support transactions, and provide consistency. The answer lies in the specific requirements of coordination workloads:

Ultra-low latency reads: Coordination operations are often in the critical path of every request. ZooKeeper serves reads from local replicas in microseconds.
Ordered writes: All state changes must be totally ordered. ZooKeeper guarantees that all clients see the same order of updates.
Watches for reactivity: Clients need to react immediately when state changes. Databases require polling; ZooKeeper provides push notifications.
Ephemeral state: Some coordination state (like "which nodes are alive") should automatically disappear when nodes fail. Regular databases don't model this.
Sequential consistency: Clients need strong ordering guarantees on their own operations. ZooKeeper provides per-client FIFO ordering.

ZooKeeper vs Traditional Database for Coordination
Aspect	ZooKeeper	Traditional Database
Read Latency	Microseconds (local replica)	Milliseconds (network roundtrip)
Write Ordering	Total ordering guaranteed	Depends on isolation level
Change Notifications	Built-in watch mechanism	Polling or custom triggers
Ephemeral State	Native ephemeral znodes	Requires manual cleanup
Data Model	Hierarchical tree (znodes)	Tables/documents
Data Size Limit	1MB per node (by design)	Gigabytes per record
Optimized For	Coordination metadata	Application data

The Hierarchical Data Model: Znodes

ZooKeeper's data model is a hierarchical namespace, similar to a file system. This is fundamentally different from key-value stores like Redis or etcd, which use a flat namespace. The hierarchy enables natural organization of coordination data and serves as a form of namespacing.

Every node in ZooKeeper's hierarchy is called a znode (ZooKeeper node). Unlike a file system, every znode can have both data (up to 1MB) and children. This dual capability—acting as both file and directory—makes the model surprisingly expressive.

znode-hierarchy-example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Example ZooKeeper Hierarchy for a Distributed Service
 
/
├── services/
│   ├── user-service/
│   │   ├── config           # Data: {"max_connections": 1000}
│   │   ├── leader           # Data: "host1:8080" (ephemeral)
│   │   └── instances/
│   │       ├── host1-001    # Ephemeral sequential
│   │       ├── host2-001    # Ephemeral sequential
│   │       └── host3-001    # Ephemeral sequential
│   │
│   └── payment-service/
│       ├── config
│       ├── leader
│       └── instances/
│           └── ...
│
├── locks/
│   └── payment-processing/
│       ├── lock-0000000001  # Ephemeral sequential (lock holder)
│       ├── lock-0000000002  # Ephemeral sequential (waiting)
│       └── lock-0000000003  # Ephemeral sequential (waiting)
│
└── election/
    └── cluster-leader/
        ├── candidate-0000000001  # Ephemeral sequential
        └── candidate-0000000002  # Ephemeral sequential

The Path Abstraction

Every znode is identified by a path from the root, using slash-separated components (like /services/user-service/config). Paths must be absolute (starting with /). This design provides several advantages:

Natural namespacing: Different services use different path prefixes, avoiding conflicts
Recursive operations: You can delete or watch entire subtrees
ACL inheritance: Access control can flow down the hierarchy
Intuitive organization: The structure mirrors logical relationships

Types of Znodes

•Persistent znodes — Created explicitly by clients and remain until explicitly deleted. Used for configuration, metadata, and any state that should survive node failures. These are the default type.
•Ephemeral znodes — Automatically deleted when the session that created them ends (either explicitly or due to timeout). Critical for membership tracking, leader election, and liveness detection. Cannot have children.
•Sequential znodes — Have a monotonically increasing counter appended to their name (e.g., lock-0000000001). The counter is unique per parent znode. Essential for implementing locks and queues.
•Persistent-Sequential — Persistent znodes with sequential numbering. Used for implementing durable queues or ordered logs.
•Ephemeral-Sequential — Ephemeral znodes with sequential numbering. The workhorse for distributed locks and leader election (the "fair lock" pattern).

The 1MB Data Limit is Intentional

ZooKeeper limits znode data to 1MB and is designed for small metadata, not application data. This isn't a limitation—it's a design principle. Large data would increase replication latency, slow down snapshotting, and reduce overall system performance. If you're storing more than a few KB per znode, you're likely misusing ZooKeeper.

ZooKeeper Architecture

ZooKeeper runs as an ensemble (cluster) of servers, typically 3, 5, or 7 nodes. The architecture is designed for high read throughput and strong ordering guarantees on writes, making it ideal for coordination workloads where reads vastly outnumber writes.

Converting Mermaid diagram...

The Leader-Follower Model

ZooKeeper uses a leader-follower architecture where:

One leader handles all write requests and coordinates state changes
Multiple followers replicate the leader's state and serve read requests
Observers (optional) replicate state but don't participate in voting—used to scale reads without affecting write quorum

Write Path:

Client sends write request to any server
If it's a follower, it forwards the request to the leader
Leader assigns a sequential zxid (ZooKeeper Transaction ID)
Leader broadcasts a proposal to all followers (ZAB protocol)
Followers write to their transaction log and acknowledge
Once a quorum (majority) acknowledges, leader commits
Leader sends commit message to all followers
Response sent to client

Read Path:

Client sends read request to any server
Server responds immediately from local state
No inter-server communication required

Why Read Scalability Matters

In coordination workloads, reads typically outnumber writes by 10:1 to 100:1. Configuration is read on every request; it's updated occasionally. By serving reads locally, ZooKeeper can handle millions of reads per second with a small cluster. Adding observers scales reads linearly without impacting write latency.

ZooKeeper Ensemble Sizing
Ensemble Size	Failure Tolerance	Write Quorum	Use Case
3 nodes	1 failure	2 nodes	Development, small deployments
5 nodes	2 failures	3 nodes	Production standard
7 nodes	3 failures	4 nodes	High-availability critical systems
5 + observers	2 failures	3 nodes	Global deployments needing read scaling

The ZAB Protocol

ZooKeeper uses the ZooKeeper Atomic Broadcast (ZAB) protocol for consensus, which is similar to Raft but predates it. ZAB provides:

Total ordering: All state changes have a unique, increasing zxid
Reliable delivery: All non-crashed servers eventually deliver all messages
Causal ordering: If message A caused message B, A is delivered first

The zxid is a 64-bit number composed of:

High 32 bits: Epoch number (incremented on each leader election)
Low 32 bits: Counter within the epoch

This structure ensures that a new leader's first zxid is always higher than any zxid from previous leaders, enabling crash recovery without conflicts.

Sessions and Heartbeats

ZooKeeper's session abstraction is fundamental to its coordination capabilities. A session represents a long-lived connection between a client and the ZooKeeper ensemble. Sessions enable ephemeral nodes, provide crash detection, and maintain client identity across server failovers.

Session Lifecycle

Connection: Client connects to any ZooKeeper server, receives a session ID
Active: Client sends heartbeats (pings) to keep session alive
Disconnected: Network issues interrupt communication; session still valid on server
Reconnection: Client reconnects (possibly to different server) with same session ID
Expired: No heartbeat received within session timeout; server terminates session

The critical insight is that disconnection and expiration are different states:

Disconnected: Client can't communicate, but ephemeral nodes exist, locks held
Expired: Session terminated, ephemeral nodes deleted, locks released

session-handling-example.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Creating a ZooKeeper session with proper event handling
ZooKeeper zk = new ZooKeeper(
    "zk1:2181,zk2:2181,zk3:2181",  // Connection string
    30000,                          // Session timeout (ms)
    new Watcher() {
        @Override
        public void process(WatchedEvent event) {
            switch (event.getState()) {
                case SyncConnected:
                    // Session established or reconnected
                    log.info("Connected to ZooKeeper");
                    break;
                    
                case Disconnected:
                    // Network issues - ephemeral nodes STILL exist
                    // Session might recover - don't panic
                    log.warn("Disconnected from ZooKeeper");
                    break;
                    
                case Expired:
                    // Session expired - ephemeral nodes DELETED
                    // Must create new session and re-establish state
                    log.error("Session expired - must recreate");
                    recreateSession();
                    break;
                    
                case ConnectedReadOnly:
                    // Connected to a server that's partitioned from quorum
                    // Can read but not write
                    log.warn("Read-only connection");
                    break;
            }
        }
    }
);
 
// Session ID is unique and stable across reconnections
long sessionId = zk.getSessionId();
byte[] sessionPasswd = zk.getSessionPasswd();
 
// Can reconnect with same session (useful for client restart)
ZooKeeper zkReconnected = new ZooKeeper(
    "zk1:2181,zk2:2181,zk3:2181",
    30000,
    watcher,
    sessionId,       // Restore session
    sessionPasswd    // Session authentication
);

Session Timeout Considerations

•Too short (< 5 seconds): False positives during GC pauses, network glitches, or high load. Ephemeral nodes deleted unnecessarily. Thrashing.
•Too long (> 60 seconds): Slow failure detection. Dead nodes hold locks for too long. Cluster reacts slowly to failures.
•Sweet spot (10-30 seconds): Balances false positives against detection latency. Tune based on your network and GC characteristics.
•Heartbeat frequency: ZooKeeper sends heartbeats at 1/3 of session timeout. A 30-second timeout means 10-second heartbeat intervals.
•Server-side minimum: ZooKeeper servers enforce minSessionTimeout (default: 2× tickTime) and maxSessionTimeout (default: 20× tickTime).

The Session Expiration Trap

Session expiration is the most common source of ZooKeeper-related production incidents. When a session expires: (1) all ephemeral nodes are deleted, (2) all watches are removed, (3) the client loses any exclusive locks it held. Applications must handle this by re-establishing all coordination state—there's no automatic recovery.

Watches — The Reactive Pattern

Watches are ZooKeeper's mechanism for clients to receive notifications when data changes. Rather than polling, clients register interest in specific znodes and receive callbacks when those znodes are created, deleted, or modified.

This push-based model is essential for responsive coordination. When configuration changes, all interested services can react within milliseconds—no polling interval, no wasted bandwidth, no missed updates.

Watch Semantics

ZooKeeper watches have specific behavioral guarantees that are critical to understand:

One-time trigger: A watch fires once and is then removed. To continue watching, you must re-register.
Ordered delivery: Watch notifications are delivered before any other changes to that znode are visible to the client.
Guaranteed delivery: If a client has a watch set and the znode changes, the watch will fire before the client sees the new data.
Types of watches:
- Data watches — Triggered on data changes (setData) or deletion (delete)
- Child watches — Triggered when children are added or removed
- Existence watches — Triggered when a znode is created or deleted

watch-patterns.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
// Pattern: Continuous watching with automatic re-registration
public class ConfigWatcher implements Watcher {
    private ZooKeeper zk;
    private String configPath = "/app/config";
    private volatile ConfigData currentConfig;
    
    public void watchConfig() throws KeeperException, InterruptedException {
        // getData registers a data watch
        byte[] data = zk.getData(
            configPath,
            this,          // Register this watcher
            null           // Stat object (optional)
        );
        
        // Process the current data
        currentConfig = parseConfig(data);
        log.info("Config updated: {}", currentConfig);
    }
    
    @Override
    public void process(WatchedEvent event) {
        if (event.getType() == Event.EventType.NodeDataChanged) {
            try {
                // Watch fired - re-register and get new data
                watchConfig();  // This re-registers the watch
            } catch (Exception e) {
                log.error("Failed to handle config change", e);
            }
        }
    }
}
 
// Pattern: Watching for children (membership tracking)
public class MembershipWatcher implements Watcher {
    private String membershipPath = "/services/myapp/instances";
    
    public Set<String> watchMembers() throws Exception {
        // getChildren registers a child watch
        List<String> children = zk.getChildren(
            membershipPath,
            this           // Watch for child changes
        );
        
        Set<String> members = new HashSet<>();
        for (String child : children) {
            // Get each member's data (usually host:port)
            byte[] data = zk.getData(
                membershipPath + "/" + child,
                false,     // No data watch on individual members
                null
            );
            members.add(new String(data, StandardCharsets.UTF_8));
        }
        return members;
    }
    
    @Override
    public void process(WatchedEvent event) {
        if (event.getType() == Event.EventType.NodeChildrenChanged) {
            // Membership changed - re-fetch and re-watch
            try {
                Set<String> newMembers = watchMembers();
                notifyMembershipChange(newMembers);
            } catch (Exception e) {
                log.error("Failed to handle membership change", e);
            }
        }
    }
}

The Watch Re-registration Gap

Between when a watch fires and when you re-register, changes can occur that you won't see as events. Always read the current state when re-registering—don't rely solely on the event content. The pattern 'handle event → read current state → re-register watch' ensures you never miss changes.

Watch Types and Triggers
Operation	Data Watch	Child Watch	Existence Watch
create (parent)	—	Yes	—
create (node)	—	—	Yes
delete (node)	Yes	—	Yes
delete (child)	—	Yes	—
setData	Yes	—	—

Consistency Guarantees

ZooKeeper provides a specific set of consistency guarantees that are crucial to understand. These guarantees are strong enough for coordination yet carefully bounded to enable high performance.

ZooKeeper's Ordering Guarantees

•Sequential Consistency — Updates from any single client are applied in the order they were sent. If client A sends write1 then write2, all servers will apply write1 before write2.
•Atomicity — Updates either succeed completely or fail completely. No partial application.
•Single System Image — A client sees the same view of the service regardless of which server it connects to. If a client reconnects to a different server, it will never see older data than it saw before.
•Reliability — Once an update is applied, it persists until overwritten by a subsequent update. Even if the leader fails, committed updates survive.
•Timeliness — The system view is guaranteed to be up-to-date within a bounded time. With default settings, a client's view is within seconds of the leader.

Not Linearizable by Default

ZooKeeper reads are NOT linearizable by default—they may return stale data if the connected server is behind the leader. This is intentional for read performance. For linearizable reads, use the sync() operation before reading, which forces the server to catch up with the leader first.

The sync() Operation

When you absolutely need the latest data, call sync() before your read:

// Ensure we see all updates up to this moment
zk.sync("/", null, null);  // Syncs entire tree
Data latestConfig = zk.getData("/config", false, null);

sync() is asynchronous but guarantees that subsequent operations see all updates that were committed before the sync was issued. This is expensive (requires a round-trip to the leader), so use sparingly.

ZXID for Ordering

Every ZooKeeper operation returns a zxid (ZooKeeper Transaction ID). This is a globally unique, monotonically increasing identifier that establishes a total order on all changes:

Compare zxids to determine which write happened first
Store zxids with your application state to detect whether you're up-to-date
Use the Stat object — every znode has a Stat containing its creation zxid, modification zxid, version numbers, and more

stat-and-versioning.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// The Stat object contains crucial metadata
Stat stat = new Stat();
byte[] data = zk.getData("/config", false, stat);
 
// Important Stat fields:
long czxid = stat.getCzxid();     // zxid of creation
long mzxid = stat.getMzxid();     // zxid of last modification
long pzxid = stat.getPzxid();     // zxid of last child change
int version = stat.getVersion();  // Data version (for optimistic locking)
int cversion = stat.getCversion(); // Children version
int aversion = stat.getAversion(); // ACL version
long ctime = stat.getCtime();     // Creation time (ms since epoch)
long mtime = stat.getMtime();     // Modification time
int dataLength = stat.getDataLength();
int numChildren = stat.getNumChildren();
 
// Optimistic locking with version checking
try {
    zk.setData(
        "/config",
        newData,
        stat.getVersion()  // Only succeeds if version matches
    );
} catch (KeeperException.BadVersionException e) {
    // Someone else modified it - read again and retry
    handleConflict();
}

Common Coordination Recipes

ZooKeeper's primitives can be combined to implement sophisticated coordination patterns. The Apache Curator library provides production-ready implementations, but understanding the underlying recipes is essential for debugging and customization.

Fair Distributed Lock Recipe

This pattern uses ephemeral-sequential znodes to create a fair lock (FIFO ordering):

Create an ephemeral-sequential znode under /locks/myresource/lock-
Get all children of /locks/myresource
If your znode has the smallest sequence number, you hold the lock
Otherwise, watch the znode immediately before yours
When that watch fires, re-check if you're now the smallest

This is "herd-safe"—only one client wakes up per lock release, avoiding thundering herd.

distributed-lock.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
public class DistributedLock {
    private final ZooKeeper zk;
    private final String lockPath;
    private String lockNode;
    
    public void lock() throws Exception {
        // Step 1: Create ephemeral-sequential node
        lockNode = zk.create(
            lockPath + "/lock-",
            new byte[0],
            ZooDefs.Ids.OPEN_ACL_UNSAFE,
            CreateMode.EPHEMERAL_SEQUENTIAL
        );
        
        tryAcquire();
    }
    
    private void tryAcquire() throws Exception {
        while (true) {
            // Step 2: Get all children
            List<String> children = zk.getChildren(lockPath, false);
            Collections.sort(children);
            
            String myNode = lockNode.substring(lockPath.length() + 1);
            int myIndex = children.indexOf(myNode);
            
            // Step 3: Am I the lowest?
            if (myIndex == 0) {
                return; // Lock acquired!
            }
            
            // Step 4: Watch predecessor only
            String predecessor = children.get(myIndex - 1);
            final Object lock = new Object();
            
            Stat stat = zk.exists(lockPath + "/" + predecessor, event -> {
                synchronized (lock) {
                    lock.notifyAll();
                }
            });
            
            if (stat != null) {
                synchronized (lock) {
                    lock.wait(); // Wait for predecessor to die
                }
            }
            // Loop again to check if we got the lock
        }
    }
    
    public void unlock() throws Exception {
        zk.delete(lockNode, -1);
    }
}

Use Apache Curator in Production

While understanding these recipes is essential, don't implement them yourself for production. Apache Curator provides production-tested implementations with proper edge case handling, retry logic, and connection management. Curator's recipes have been battle-tested across thousands of deployments.

Summary: ZooKeeper Essentials

We've covered ZooKeeper's architecture, data model, and core mechanisms. Let's consolidate the essential knowledge:

Key Takeaways

•ZooKeeper is a coordination kernel — It provides primitives (znodes, watches, sessions) that compose into higher-level patterns like locks and leader election.
•The hierarchical namespace — Znodes form a tree structure that naturally organizes coordination data with paths like /services/myapp/config.
•Ephemeral nodes enable liveness detection — They automatically disappear when sessions end, making crash detection automatic.
•Sequential nodes enable ordering — They assign unique, increasing sequence numbers that enable fair locking and queueing.
•Watches provide reactive notifications — Push-based updates eliminate polling and enable sub-second reaction to changes.
•Sessions maintain client identity — Long-lived sessions survive server failover and bound the lifetime of ephemeral nodes.
•Strong ordering, bounded consistency — All writes are totally ordered; reads may be stale unless you use sync().
•Use Curator in production — Apache Curator provides production-ready implementations of common recipes.

When to Use ZooKeeper

✅ Good fit: Leader election, distributed locks, configuration management, service discovery, cluster membership, barrier synchronization

❌ Poor fit: Storing application data, high-volume writes, large values, message queuing (use Kafka instead)

What's Next:

In the next page, we'll explore etcd, a newer coordination service that provides a simpler key-value model with Raft consensus. You'll learn how etcd differs from ZooKeeper and when each is the better choice.

Page Complete

You now understand ZooKeeper's hierarchical data model, its consistency guarantees, and how its primitives compose into coordination patterns. This foundation will help you understand its alternatives and make informed choices between coordination services.

1 / 5

Loading learning content...

System Design (HLD)Coordination Services

Coordination Services

LevelIntermediate

Duration90 mins

TopicCoordination Services

1 / 5

Zookeeper: Hierarchical Key-Value Coordination

The Invisible Backbone of Distributed Systems

What You Will Learn

What is ZooKeeper?

The ZooKeeper Philosophy

Why Not Just Use a Database?

Ultra-low latency reads: Coordination operations are often in the critical path of every request. ZooKeeper serves reads from local replicas in microseconds.
Ordered writes: All state changes must be totally ordered. ZooKeeper guarantees that all clients see the same order of updates.
Watches for reactivity: Clients need to react immediately when state changes. Databases require polling; ZooKeeper provides push notifications.
Ephemeral state: Some coordination state (like "which nodes are alive") should automatically disappear when nodes fail. Regular databases don't model this.
Sequential consistency: Clients need strong ordering guarantees on their own operations. ZooKeeper provides per-client FIFO ordering.

ZooKeeper vs Traditional Database for Coordination
Aspect	ZooKeeper	Traditional Database
Read Latency	Microseconds (local replica)	Milliseconds (network roundtrip)
Write Ordering	Total ordering guaranteed	Depends on isolation level
Change Notifications	Built-in watch mechanism	Polling or custom triggers
Ephemeral State	Native ephemeral znodes	Requires manual cleanup
Data Model	Hierarchical tree (znodes)	Tables/documents
Data Size Limit	1MB per node (by design)	Gigabytes per record
Optimized For	Coordination metadata	Application data

The Hierarchical Data Model: Znodes

znode-hierarchy-example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Example ZooKeeper Hierarchy for a Distributed Service
 
/
├── services/
│   ├── user-service/
│   │   ├── config           # Data: {"max_connections": 1000}
│   │   ├── leader           # Data: "host1:8080" (ephemeral)
│   │   └── instances/
│   │       ├── host1-001    # Ephemeral sequential
│   │       ├── host2-001    # Ephemeral sequential
│   │       └── host3-001    # Ephemeral sequential
│   │
│   └── payment-service/
│       ├── config
│       ├── leader
│       └── instances/
│           └── ...
│
├── locks/
│   └── payment-processing/
│       ├── lock-0000000001  # Ephemeral sequential (lock holder)
│       ├── lock-0000000002  # Ephemeral sequential (waiting)
│       └── lock-0000000003  # Ephemeral sequential (waiting)
│
└── election/
    └── cluster-leader/
        ├── candidate-0000000001  # Ephemeral sequential
        └── candidate-0000000002  # Ephemeral sequential

The Path Abstraction

Natural namespacing: Different services use different path prefixes, avoiding conflicts
Recursive operations: You can delete or watch entire subtrees
ACL inheritance: Access control can flow down the hierarchy
Intuitive organization: The structure mirrors logical relationships

Types of Znodes

•Persistent znodes — Created explicitly by clients and remain until explicitly deleted. Used for configuration, metadata, and any state that should survive node failures. These are the default type.
•Ephemeral znodes — Automatically deleted when the session that created them ends (either explicitly or due to timeout). Critical for membership tracking, leader election, and liveness detection. Cannot have children.
•Sequential znodes — Have a monotonically increasing counter appended to their name (e.g., lock-0000000001). The counter is unique per parent znode. Essential for implementing locks and queues.
•Persistent-Sequential — Persistent znodes with sequential numbering. Used for implementing durable queues or ordered logs.
•Ephemeral-Sequential — Ephemeral znodes with sequential numbering. The workhorse for distributed locks and leader election (the "fair lock" pattern).

The 1MB Data Limit is Intentional

ZooKeeper Architecture

Converting Mermaid diagram...

The Leader-Follower Model

ZooKeeper uses a leader-follower architecture where:

One leader handles all write requests and coordinates state changes
Multiple followers replicate the leader's state and serve read requests
Observers (optional) replicate state but don't participate in voting—used to scale reads without affecting write quorum

Write Path:

Client sends write request to any server
If it's a follower, it forwards the request to the leader
Leader assigns a sequential zxid (ZooKeeper Transaction ID)
Leader broadcasts a proposal to all followers (ZAB protocol)
Followers write to their transaction log and acknowledge
Once a quorum (majority) acknowledges, leader commits
Leader sends commit message to all followers
Response sent to client

Read Path:

Client sends read request to any server
Server responds immediately from local state
No inter-server communication required

Why Read Scalability Matters

ZooKeeper Ensemble Sizing
Ensemble Size	Failure Tolerance	Write Quorum	Use Case
3 nodes	1 failure	2 nodes	Development, small deployments
5 nodes	2 failures	3 nodes	Production standard
7 nodes	3 failures	4 nodes	High-availability critical systems
5 + observers	2 failures	3 nodes	Global deployments needing read scaling

The ZAB Protocol

ZooKeeper uses the ZooKeeper Atomic Broadcast (ZAB) protocol for consensus, which is similar to Raft but predates it. ZAB provides:

Total ordering: All state changes have a unique, increasing zxid
Reliable delivery: All non-crashed servers eventually deliver all messages
Causal ordering: If message A caused message B, A is delivered first

The zxid is a 64-bit number composed of:

High 32 bits: Epoch number (incremented on each leader election)
Low 32 bits: Counter within the epoch

This structure ensures that a new leader's first zxid is always higher than any zxid from previous leaders, enabling crash recovery without conflicts.

Sessions and Heartbeats

Session Lifecycle

Connection: Client connects to any ZooKeeper server, receives a session ID
Active: Client sends heartbeats (pings) to keep session alive
Disconnected: Network issues interrupt communication; session still valid on server
Reconnection: Client reconnects (possibly to different server) with same session ID
Expired: No heartbeat received within session timeout; server terminates session

The critical insight is that disconnection and expiration are different states:

Disconnected: Client can't communicate, but ephemeral nodes exist, locks held
Expired: Session terminated, ephemeral nodes deleted, locks released

session-handling-example.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Creating a ZooKeeper session with proper event handling
ZooKeeper zk = new ZooKeeper(
    "zk1:2181,zk2:2181,zk3:2181",  // Connection string
    30000,                          // Session timeout (ms)
    new Watcher() {
        @Override
        public void process(WatchedEvent event) {
            switch (event.getState()) {
                case SyncConnected:
                    // Session established or reconnected
                    log.info("Connected to ZooKeeper");
                    break;
                    
                case Disconnected:
                    // Network issues - ephemeral nodes STILL exist
                    // Session might recover - don't panic
                    log.warn("Disconnected from ZooKeeper");
                    break;
                    
                case Expired:
                    // Session expired - ephemeral nodes DELETED
                    // Must create new session and re-establish state
                    log.error("Session expired - must recreate");
                    recreateSession();
                    break;
                    
                case ConnectedReadOnly:
                    // Connected to a server that's partitioned from quorum
                    // Can read but not write
                    log.warn("Read-only connection");
                    break;
            }
        }
    }
);
 
// Session ID is unique and stable across reconnections
long sessionId = zk.getSessionId();
byte[] sessionPasswd = zk.getSessionPasswd();
 
// Can reconnect with same session (useful for client restart)
ZooKeeper zkReconnected = new ZooKeeper(
    "zk1:2181,zk2:2181,zk3:2181",
    30000,
    watcher,
    sessionId,       // Restore session
    sessionPasswd    // Session authentication
);

Session Timeout Considerations

•Too short (< 5 seconds): False positives during GC pauses, network glitches, or high load. Ephemeral nodes deleted unnecessarily. Thrashing.
•Too long (> 60 seconds): Slow failure detection. Dead nodes hold locks for too long. Cluster reacts slowly to failures.
•Sweet spot (10-30 seconds): Balances false positives against detection latency. Tune based on your network and GC characteristics.
•Heartbeat frequency: ZooKeeper sends heartbeats at 1/3 of session timeout. A 30-second timeout means 10-second heartbeat intervals.
•Server-side minimum: ZooKeeper servers enforce minSessionTimeout (default: 2× tickTime) and maxSessionTimeout (default: 20× tickTime).

The Session Expiration Trap

Watches — The Reactive Pattern

Watch Semantics

ZooKeeper watches have specific behavioral guarantees that are critical to understand:

One-time trigger: A watch fires once and is then removed. To continue watching, you must re-register.
Ordered delivery: Watch notifications are delivered before any other changes to that znode are visible to the client.
Guaranteed delivery: If a client has a watch set and the znode changes, the watch will fire before the client sees the new data.
Types of watches:
- Data watches — Triggered on data changes (setData) or deletion (delete)
- Child watches — Triggered when children are added or removed
- Existence watches — Triggered when a znode is created or deleted

watch-patterns.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
// Pattern: Continuous watching with automatic re-registration
public class ConfigWatcher implements Watcher {
    private ZooKeeper zk;
    private String configPath = "/app/config";
    private volatile ConfigData currentConfig;
    
    public void watchConfig() throws KeeperException, InterruptedException {
        // getData registers a data watch
        byte[] data = zk.getData(
            configPath,
            this,          // Register this watcher
            null           // Stat object (optional)
        );
        
        // Process the current data
        currentConfig = parseConfig(data);
        log.info("Config updated: {}", currentConfig);
    }
    
    @Override
    public void process(WatchedEvent event) {
        if (event.getType() == Event.EventType.NodeDataChanged) {
            try {
                // Watch fired - re-register and get new data
                watchConfig();  // This re-registers the watch
            } catch (Exception e) {
                log.error("Failed to handle config change", e);
            }
        }
    }
}
 
// Pattern: Watching for children (membership tracking)
public class MembershipWatcher implements Watcher {
    private String membershipPath = "/services/myapp/instances";
    
    public Set<String> watchMembers() throws Exception {
        // getChildren registers a child watch
        List<String> children = zk.getChildren(
            membershipPath,
            this           // Watch for child changes
        );
        
        Set<String> members = new HashSet<>();
        for (String child : children) {
            // Get each member's data (usually host:port)
            byte[] data = zk.getData(
                membershipPath + "/" + child,
                false,     // No data watch on individual members
                null
            );
            members.add(new String(data, StandardCharsets.UTF_8));
        }
        return members;
    }
    
    @Override
    public void process(WatchedEvent event) {
        if (event.getType() == Event.EventType.NodeChildrenChanged) {
            // Membership changed - re-fetch and re-watch
            try {
                Set<String> newMembers = watchMembers();
                notifyMembershipChange(newMembers);
            } catch (Exception e) {
                log.error("Failed to handle membership change", e);
            }
        }
    }
}

The Watch Re-registration Gap

Watch Types and Triggers
Operation	Data Watch	Child Watch	Existence Watch
create (parent)	—	Yes	—
create (node)	—	—	Yes
delete (node)	Yes	—	Yes
delete (child)	—	Yes	—
setData	Yes	—	—

Consistency Guarantees

ZooKeeper provides a specific set of consistency guarantees that are crucial to understand. These guarantees are strong enough for coordination yet carefully bounded to enable high performance.

ZooKeeper's Ordering Guarantees

•Sequential Consistency — Updates from any single client are applied in the order they were sent. If client A sends write1 then write2, all servers will apply write1 before write2.
•Atomicity — Updates either succeed completely or fail completely. No partial application.
•Single System Image — A client sees the same view of the service regardless of which server it connects to. If a client reconnects to a different server, it will never see older data than it saw before.
•Reliability — Once an update is applied, it persists until overwritten by a subsequent update. Even if the leader fails, committed updates survive.
•Timeliness — The system view is guaranteed to be up-to-date within a bounded time. With default settings, a client's view is within seconds of the leader.

Not Linearizable by Default

The sync() Operation

When you absolutely need the latest data, call sync() before your read:

// Ensure we see all updates up to this moment
zk.sync("/", null, null);  // Syncs entire tree
Data latestConfig = zk.getData("/config", false, null);

ZXID for Ordering

Every ZooKeeper operation returns a zxid (ZooKeeper Transaction ID). This is a globally unique, monotonically increasing identifier that establishes a total order on all changes:

Compare zxids to determine which write happened first
Store zxids with your application state to detect whether you're up-to-date
Use the Stat object — every znode has a Stat containing its creation zxid, modification zxid, version numbers, and more

stat-and-versioning.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// The Stat object contains crucial metadata
Stat stat = new Stat();
byte[] data = zk.getData("/config", false, stat);
 
// Important Stat fields:
long czxid = stat.getCzxid();     // zxid of creation
long mzxid = stat.getMzxid();     // zxid of last modification
long pzxid = stat.getPzxid();     // zxid of last child change
int version = stat.getVersion();  // Data version (for optimistic locking)
int cversion = stat.getCversion(); // Children version
int aversion = stat.getAversion(); // ACL version
long ctime = stat.getCtime();     // Creation time (ms since epoch)
long mtime = stat.getMtime();     // Modification time
int dataLength = stat.getDataLength();
int numChildren = stat.getNumChildren();
 
// Optimistic locking with version checking
try {
    zk.setData(
        "/config",
        newData,
        stat.getVersion()  // Only succeeds if version matches
    );
} catch (KeeperException.BadVersionException e) {
    // Someone else modified it - read again and retry
    handleConflict();
}

Common Coordination Recipes

Fair Distributed Lock Recipe

This pattern uses ephemeral-sequential znodes to create a fair lock (FIFO ordering):

Create an ephemeral-sequential znode under /locks/myresource/lock-
Get all children of /locks/myresource
If your znode has the smallest sequence number, you hold the lock
Otherwise, watch the znode immediately before yours
When that watch fires, re-check if you're now the smallest

This is "herd-safe"—only one client wakes up per lock release, avoiding thundering herd.

distributed-lock.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
public class DistributedLock {
    private final ZooKeeper zk;
    private final String lockPath;
    private String lockNode;
    
    public void lock() throws Exception {
        // Step 1: Create ephemeral-sequential node
        lockNode = zk.create(
            lockPath + "/lock-",
            new byte[0],
            ZooDefs.Ids.OPEN_ACL_UNSAFE,
            CreateMode.EPHEMERAL_SEQUENTIAL
        );
        
        tryAcquire();
    }
    
    private void tryAcquire() throws Exception {
        while (true) {
            // Step 2: Get all children
            List<String> children = zk.getChildren(lockPath, false);
            Collections.sort(children);
            
            String myNode = lockNode.substring(lockPath.length() + 1);
            int myIndex = children.indexOf(myNode);
            
            // Step 3: Am I the lowest?
            if (myIndex == 0) {
                return; // Lock acquired!
            }
            
            // Step 4: Watch predecessor only
            String predecessor = children.get(myIndex - 1);
            final Object lock = new Object();
            
            Stat stat = zk.exists(lockPath + "/" + predecessor, event -> {
                synchronized (lock) {
                    lock.notifyAll();
                }
            });
            
            if (stat != null) {
                synchronized (lock) {
                    lock.wait(); // Wait for predecessor to die
                }
            }
            // Loop again to check if we got the lock
        }
    }
    
    public void unlock() throws Exception {
        zk.delete(lockNode, -1);
    }
}

Use Apache Curator in Production

Summary: ZooKeeper Essentials

We've covered ZooKeeper's architecture, data model, and core mechanisms. Let's consolidate the essential knowledge:

Key Takeaways

•ZooKeeper is a coordination kernel — It provides primitives (znodes, watches, sessions) that compose into higher-level patterns like locks and leader election.
•The hierarchical namespace — Znodes form a tree structure that naturally organizes coordination data with paths like /services/myapp/config.
•Ephemeral nodes enable liveness detection — They automatically disappear when sessions end, making crash detection automatic.
•Sequential nodes enable ordering — They assign unique, increasing sequence numbers that enable fair locking and queueing.
•Watches provide reactive notifications — Push-based updates eliminate polling and enable sub-second reaction to changes.
•Sessions maintain client identity — Long-lived sessions survive server failover and bound the lifetime of ephemeral nodes.
•Strong ordering, bounded consistency — All writes are totally ordered; reads may be stale unless you use sync().
•Use Curator in production — Apache Curator provides production-ready implementations of common recipes.

When to Use ZooKeeper

✅ Good fit: Leader election, distributed locks, configuration management, service discovery, cluster membership, barrier synchronization

❌ Poor fit: Storing application data, high-volume writes, large values, message queuing (use Kafka instead)

What's Next:

Page Complete

1 / 5