Loading learning content...
In 2006, engineers at Yahoo! faced a problem that would become universal across the industry: how do you coordinate the actions of thousands of machines that need to work together as a single coherent system?
The challenges were immense. Servers needed to agree on which node was the leader. Configuration changes had to propagate reliably. Distributed locks had to prevent race conditions across data centers. And all of this had to work even when individual machines failed—which they did, constantly.
The solution they built was Apache ZooKeeper, and it would go on to become the foundational coordination service for some of the world's most critical distributed systems: Hadoop, Kafka, HBase, Solr, and countless internal systems at companies like LinkedIn, Twitter, and Netflix.
By the end of this page, you will understand ZooKeeper's hierarchical data model, its consistency guarantees, the znode abstraction that makes it unique, how sessions and watches enable reactive coordination, and the canonical patterns that make ZooKeeper invaluable for distributed systems. You'll also understand why ZooKeeper remains relevant despite newer alternatives.
Apache ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. At its core, ZooKeeper is a distributed, hierarchical key-value store designed specifically for coordination workloads—not for storing application data.
The key insight behind ZooKeeper is that coordination is hard, but a small set of primitives can address most coordination needs. Rather than building custom coordination logic into every distributed application, ZooKeeper provides a reliable, high-performance coordination kernel that applications can build upon.
ZooKeeper doesn't provide high-level coordination primitives like distributed locks or leader election directly. Instead, it provides simple building blocks—ordered, persistent data nodes with watches—that can be combined to implement any coordination recipe. This design philosophy keeps ZooKeeper's core simple while enabling infinite flexibility.
Why Not Just Use a Database?
A natural question arises: why not use a regular database for coordination? After all, databases can store configuration, support transactions, and provide consistency. The answer lies in the specific requirements of coordination workloads:
Ultra-low latency reads: Coordination operations are often in the critical path of every request. ZooKeeper serves reads from local replicas in microseconds.
Ordered writes: All state changes must be totally ordered. ZooKeeper guarantees that all clients see the same order of updates.
Watches for reactivity: Clients need to react immediately when state changes. Databases require polling; ZooKeeper provides push notifications.
Ephemeral state: Some coordination state (like "which nodes are alive") should automatically disappear when nodes fail. Regular databases don't model this.
Sequential consistency: Clients need strong ordering guarantees on their own operations. ZooKeeper provides per-client FIFO ordering.
| Aspect | ZooKeeper | Traditional Database |
|---|---|---|
| Read Latency | Microseconds (local replica) | Milliseconds (network roundtrip) |
| Write Ordering | Total ordering guaranteed | Depends on isolation level |
| Change Notifications | Built-in watch mechanism | Polling or custom triggers |
| Ephemeral State | Native ephemeral znodes | Requires manual cleanup |
| Data Model | Hierarchical tree (znodes) | Tables/documents |
| Data Size Limit | 1MB per node (by design) | Gigabytes per record |
| Optimized For | Coordination metadata | Application data |
ZooKeeper's data model is a hierarchical namespace, similar to a file system. This is fundamentally different from key-value stores like Redis or etcd, which use a flat namespace. The hierarchy enables natural organization of coordination data and serves as a form of namespacing.
Every node in ZooKeeper's hierarchy is called a znode (ZooKeeper node). Unlike a file system, every znode can have both data (up to 1MB) and children. This dual capability—acting as both file and directory—makes the model surprisingly expressive.
12345678910111213141516171819202122232425262728
# Example ZooKeeper Hierarchy for a Distributed Service /├── services/│ ├── user-service/│ │ ├── config # Data: {"max_connections": 1000}│ │ ├── leader # Data: "host1:8080" (ephemeral)│ │ └── instances/│ │ ├── host1-001 # Ephemeral sequential│ │ ├── host2-001 # Ephemeral sequential│ │ └── host3-001 # Ephemeral sequential│ ││ └── payment-service/│ ├── config│ ├── leader│ └── instances/│ └── ...│├── locks/│ └── payment-processing/│ ├── lock-0000000001 # Ephemeral sequential (lock holder)│ ├── lock-0000000002 # Ephemeral sequential (waiting)│ └── lock-0000000003 # Ephemeral sequential (waiting)│└── election/ └── cluster-leader/ ├── candidate-0000000001 # Ephemeral sequential └── candidate-0000000002 # Ephemeral sequentialThe Path Abstraction
Every znode is identified by a path from the root, using slash-separated components (like /services/user-service/config). Paths must be absolute (starting with /). This design provides several advantages:
lock-0000000001). The counter is unique per parent znode. Essential for implementing locks and queues.ZooKeeper limits znode data to 1MB and is designed for small metadata, not application data. This isn't a limitation—it's a design principle. Large data would increase replication latency, slow down snapshotting, and reduce overall system performance. If you're storing more than a few KB per znode, you're likely misusing ZooKeeper.
ZooKeeper runs as an ensemble (cluster) of servers, typically 3, 5, or 7 nodes. The architecture is designed for high read throughput and strong ordering guarantees on writes, making it ideal for coordination workloads where reads vastly outnumber writes.
The Leader-Follower Model
ZooKeeper uses a leader-follower architecture where:
Write Path:
Read Path:
In coordination workloads, reads typically outnumber writes by 10:1 to 100:1. Configuration is read on every request; it's updated occasionally. By serving reads locally, ZooKeeper can handle millions of reads per second with a small cluster. Adding observers scales reads linearly without impacting write latency.
| Ensemble Size | Failure Tolerance | Write Quorum | Use Case |
|---|---|---|---|
| 3 nodes | 1 failure | 2 nodes | Development, small deployments |
| 5 nodes | 2 failures | 3 nodes | Production standard |
| 7 nodes | 3 failures | 4 nodes | High-availability critical systems |
| 5 + observers | 2 failures | 3 nodes | Global deployments needing read scaling |
The ZAB Protocol
ZooKeeper uses the ZooKeeper Atomic Broadcast (ZAB) protocol for consensus, which is similar to Raft but predates it. ZAB provides:
The zxid is a 64-bit number composed of:
This structure ensures that a new leader's first zxid is always higher than any zxid from previous leaders, enabling crash recovery without conflicts.
ZooKeeper's session abstraction is fundamental to its coordination capabilities. A session represents a long-lived connection between a client and the ZooKeeper ensemble. Sessions enable ephemeral nodes, provide crash detection, and maintain client identity across server failovers.
Session Lifecycle
The critical insight is that disconnection and expiration are different states:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
// Creating a ZooKeeper session with proper event handlingZooKeeper zk = new ZooKeeper( "zk1:2181,zk2:2181,zk3:2181", // Connection string 30000, // Session timeout (ms) new Watcher() { @Override public void process(WatchedEvent event) { switch (event.getState()) { case SyncConnected: // Session established or reconnected log.info("Connected to ZooKeeper"); break; case Disconnected: // Network issues - ephemeral nodes STILL exist // Session might recover - don't panic log.warn("Disconnected from ZooKeeper"); break; case Expired: // Session expired - ephemeral nodes DELETED // Must create new session and re-establish state log.error("Session expired - must recreate"); recreateSession(); break; case ConnectedReadOnly: // Connected to a server that's partitioned from quorum // Can read but not write log.warn("Read-only connection"); break; } } }); // Session ID is unique and stable across reconnectionslong sessionId = zk.getSessionId();byte[] sessionPasswd = zk.getSessionPasswd(); // Can reconnect with same session (useful for client restart)ZooKeeper zkReconnected = new ZooKeeper( "zk1:2181,zk2:2181,zk3:2181", 30000, watcher, sessionId, // Restore session sessionPasswd // Session authentication);minSessionTimeout (default: 2× tickTime) and maxSessionTimeout (default: 20× tickTime).Session expiration is the most common source of ZooKeeper-related production incidents. When a session expires: (1) all ephemeral nodes are deleted, (2) all watches are removed, (3) the client loses any exclusive locks it held. Applications must handle this by re-establishing all coordination state—there's no automatic recovery.
Watches are ZooKeeper's mechanism for clients to receive notifications when data changes. Rather than polling, clients register interest in specific znodes and receive callbacks when those znodes are created, deleted, or modified.
This push-based model is essential for responsive coordination. When configuration changes, all interested services can react within milliseconds—no polling interval, no wasted bandwidth, no missed updates.
Watch Semantics
ZooKeeper watches have specific behavioral guarantees that are critical to understand:
One-time trigger: A watch fires once and is then removed. To continue watching, you must re-register.
Ordered delivery: Watch notifications are delivered before any other changes to that znode are visible to the client.
Guaranteed delivery: If a client has a watch set and the znode changes, the watch will fire before the client sees the new data.
Types of watches:
setData) or deletion (delete)123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
// Pattern: Continuous watching with automatic re-registrationpublic class ConfigWatcher implements Watcher { private ZooKeeper zk; private String configPath = "/app/config"; private volatile ConfigData currentConfig; public void watchConfig() throws KeeperException, InterruptedException { // getData registers a data watch byte[] data = zk.getData( configPath, this, // Register this watcher null // Stat object (optional) ); // Process the current data currentConfig = parseConfig(data); log.info("Config updated: {}", currentConfig); } @Override public void process(WatchedEvent event) { if (event.getType() == Event.EventType.NodeDataChanged) { try { // Watch fired - re-register and get new data watchConfig(); // This re-registers the watch } catch (Exception e) { log.error("Failed to handle config change", e); } } }} // Pattern: Watching for children (membership tracking)public class MembershipWatcher implements Watcher { private String membershipPath = "/services/myapp/instances"; public Set<String> watchMembers() throws Exception { // getChildren registers a child watch List<String> children = zk.getChildren( membershipPath, this // Watch for child changes ); Set<String> members = new HashSet<>(); for (String child : children) { // Get each member's data (usually host:port) byte[] data = zk.getData( membershipPath + "/" + child, false, // No data watch on individual members null ); members.add(new String(data, StandardCharsets.UTF_8)); } return members; } @Override public void process(WatchedEvent event) { if (event.getType() == Event.EventType.NodeChildrenChanged) { // Membership changed - re-fetch and re-watch try { Set<String> newMembers = watchMembers(); notifyMembershipChange(newMembers); } catch (Exception e) { log.error("Failed to handle membership change", e); } } }}Between when a watch fires and when you re-register, changes can occur that you won't see as events. Always read the current state when re-registering—don't rely solely on the event content. The pattern 'handle event → read current state → re-register watch' ensures you never miss changes.
| Operation | Data Watch | Child Watch | Existence Watch |
|---|---|---|---|
| create (parent) | — | Yes | — |
| create (node) | — | — | Yes |
| delete (node) | Yes | — | Yes |
| delete (child) | — | Yes | — |
| setData | Yes | — | — |
ZooKeeper provides a specific set of consistency guarantees that are crucial to understand. These guarantees are strong enough for coordination yet carefully bounded to enable high performance.
ZooKeeper reads are NOT linearizable by default—they may return stale data if the connected server is behind the leader. This is intentional for read performance. For linearizable reads, use the sync() operation before reading, which forces the server to catch up with the leader first.
The sync() Operation
When you absolutely need the latest data, call sync() before your read:
// Ensure we see all updates up to this moment
zk.sync("/", null, null); // Syncs entire tree
Data latestConfig = zk.getData("/config", false, null);
sync() is asynchronous but guarantees that subsequent operations see all updates that were committed before the sync was issued. This is expensive (requires a round-trip to the leader), so use sparingly.
ZXID for Ordering
Every ZooKeeper operation returns a zxid (ZooKeeper Transaction ID). This is a globally unique, monotonically increasing identifier that establishes a total order on all changes:
123456789101112131415161718192021222324252627
// The Stat object contains crucial metadataStat stat = new Stat();byte[] data = zk.getData("/config", false, stat); // Important Stat fields:long czxid = stat.getCzxid(); // zxid of creationlong mzxid = stat.getMzxid(); // zxid of last modificationlong pzxid = stat.getPzxid(); // zxid of last child changeint version = stat.getVersion(); // Data version (for optimistic locking)int cversion = stat.getCversion(); // Children versionint aversion = stat.getAversion(); // ACL versionlong ctime = stat.getCtime(); // Creation time (ms since epoch)long mtime = stat.getMtime(); // Modification timeint dataLength = stat.getDataLength();int numChildren = stat.getNumChildren(); // Optimistic locking with version checkingtry { zk.setData( "/config", newData, stat.getVersion() // Only succeeds if version matches );} catch (KeeperException.BadVersionException e) { // Someone else modified it - read again and retry handleConflict();}ZooKeeper's primitives can be combined to implement sophisticated coordination patterns. The Apache Curator library provides production-ready implementations, but understanding the underlying recipes is essential for debugging and customization.
Fair Distributed Lock Recipe
This pattern uses ephemeral-sequential znodes to create a fair lock (FIFO ordering):
/locks/myresource/lock-/locks/myresourceThis is "herd-safe"—only one client wakes up per lock release, avoiding thundering herd.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
public class DistributedLock { private final ZooKeeper zk; private final String lockPath; private String lockNode; public void lock() throws Exception { // Step 1: Create ephemeral-sequential node lockNode = zk.create( lockPath + "/lock-", new byte[0], ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL_SEQUENTIAL ); tryAcquire(); } private void tryAcquire() throws Exception { while (true) { // Step 2: Get all children List<String> children = zk.getChildren(lockPath, false); Collections.sort(children); String myNode = lockNode.substring(lockPath.length() + 1); int myIndex = children.indexOf(myNode); // Step 3: Am I the lowest? if (myIndex == 0) { return; // Lock acquired! } // Step 4: Watch predecessor only String predecessor = children.get(myIndex - 1); final Object lock = new Object(); Stat stat = zk.exists(lockPath + "/" + predecessor, event -> { synchronized (lock) { lock.notifyAll(); } }); if (stat != null) { synchronized (lock) { lock.wait(); // Wait for predecessor to die } } // Loop again to check if we got the lock } } public void unlock() throws Exception { zk.delete(lockNode, -1); }}While understanding these recipes is essential, don't implement them yourself for production. Apache Curator provides production-tested implementations with proper edge case handling, retry logic, and connection management. Curator's recipes have been battle-tested across thousands of deployments.
We've covered ZooKeeper's architecture, data model, and core mechanisms. Let's consolidate the essential knowledge:
/services/myapp/config.When to Use ZooKeeper
✅ Good fit: Leader election, distributed locks, configuration management, service discovery, cluster membership, barrier synchronization
❌ Poor fit: Storing application data, high-volume writes, large values, message queuing (use Kafka instead)
What's Next:
In the next page, we'll explore etcd, a newer coordination service that provides a simpler key-value model with Raft consensus. You'll learn how etcd differs from ZooKeeper and when each is the better choice.
You now understand ZooKeeper's hierarchical data model, its consistency guarantees, and how its primitives compose into coordination patterns. This foundation will help you understand its alternatives and make informed choices between coordination services.