Loading learning content...
Apache Zookeeper has been the workhorse of distributed coordination since its creation at Yahoo! in 2006. Born from the need to coordinate the massive Hadoop ecosystem, Zookeeper now powers critical systems at LinkedIn, Twitter, Netflix, Uber, and countless other organizations. When engineers need reliable distributed locks, Zookeeper is often the first tool they reach for.
But Zookeeper is not just a lock service—it's a hierarchical key-value store with strong consistency guarantees that can be used to build coordination primitives including locks, barriers, queues, and leader election. Understanding how Zookeeper works internally is essential for using it correctly and debugging the inevitable edge cases.
This page takes you deep into Zookeeper's architecture and its lock implementation, from the theoretical foundations to production deployment considerations.
By the end of this page, you will understand Zookeeper's architecture and consistency model, how ephemeral nodes and sessions enable reliable locks, the standard lock recipe and its fairness guarantees, failure modes and how to handle them, and when to choose Zookeeper over alternatives.
Zookeeper is a replicated state machine that maintains a hierarchical namespace (similar to a filesystem) with strong consistency guarantees. To understand how locks work, we must first understand the architecture.
Ensemble Architecture:
A Zookeeper deployment consists of a cluster of servers called an ensemble. The recommended minimum is 3 servers; 5 is common for production deployments needing higher availability.
123456789101112131415161718192021222324252627
Zookeeper Ensemble (5 servers):━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ┌─────────────────────────────────────────────────────┐ │ Clients │ │ App1 App2 App3 App4 App5 App6 App7 │ └───┬──────┬──────┬──────┬──────┬──────┬──────┬───────┘ │ │ │ │ │ │ │ ─────────┴──────┴──────┴──────┴──────┴──────┴──────┴───────── │ Connections ───────────────────────────────────────────────────────────── │ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ │ ZK #1 │◄────────►│ ZK #2 │◄────────►│ ZK #3 │ │ (Follower) │ (LEADER)│ │(Follower)│ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ │ │ ┌────┴────┐ ┌────┴────┐ │ ZK #4 │◄────────►│ ZK #5 │ │(Follower)│ │(Follower)│ └─────────┘ └─────────┘ Key Components:- Leader: Handles all write requests, coordinates replication- Followers: Serve read requests, forward writes to leader- Clients: Connect to any server, session maintained with ensembleConsensus via ZAB Protocol:
Zookeeper uses the Zookeeper Atomic Broadcast (ZAB) protocol for consensus—a leader-based protocol similar to Raft. All writes go through the leader and are replicated to a majority of followers before being acknowledged.
Key Properties:
By default, reads can be served by any server and may be slightly stale (relative to the most recent write). For lock implementations, use sync() before reads or only rely on watches and write operations for coordination. The lock recipe we'll describe uses only writes and watches, avoiding stale reads entirely.
Zookeeper's data model is a hierarchical namespace of nodes called znodes, organized like a filesystem tree. Each znode can store data (up to 1MB by default) and have children.
Znode Types:
Zookeeper supports several znode types, each with different behaviors that are crucial for implementing coordination primitives:
| Type | Lifecycle | Use Case |
|---|---|---|
| Persistent | Exists until explicitly deleted | Configuration, metadata, lock container nodes |
| Ephemeral | Exists until session that created it ends | Lock holder markers, presence detection |
| Persistent Sequential | Exists until deleted; gets auto-incremented suffix | Ordered children that persist |
| Ephemeral Sequential | Session-bound; gets auto-incremented suffix | Fair lock queues, leader election |
1234567891011121314151617181920212223242526
Zookeeper Namespace:━━━━━━━━━━━━━━━━━━━━ / (root)├── locks/ (persistent - container for locks)│ ├── inventory/ (persistent - lock for inventory resource)│ │ ├── lock-0000000001 (ephemeral-sequential, session A)│ │ ├── lock-0000000002 (ephemeral-sequential, session B)│ │ └── lock-0000000003 (ephemeral-sequential, session C)│ └── order-processing/ (persistent - another lock)│ └── lock-0000000001 (ephemeral-sequential, session D)│├── config/ (persistent - configuration data)│ ├── database (persistent, data: connection string)│ └── feature-flags (persistent, data: JSON config)│└── services/ (persistent - service discovery) ├── api-gateway/ │ ├── node-0000000001 (ephemeral-sequential - live instance) │ └── node-0000000002 (ephemeral-sequential - live instance) └── worker-service/ └── node-0000000001 (ephemeral-sequential - live instance) Key Insight: Lock queue is the children of /locks/inventory/- Holder: Lowest sequence number (lock-0000000001)- Waiters: Higher sequence numbers, watching predecessorEphemeral Nodes: The Key to Fault-Tolerant Locks
Ephemeral nodes are automatically deleted when the session that created them ends. This provides automatic cleanup for crashed lock holders:
1234567891011121314151617181920
Session and Ephemeral Node Lifecycle:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Time Client A Zookeeper Lock State──── ──────── ───────── ──────────t0 Connect Session A created No nodest1 Create ephemeral Node created /lock/lock-001 (A) /lock/lock-001 t2 (doing work) Session A alive /lock/lock-001 (A)t3 CRASH! Connection lost Heartbeats stop /lock/lock-001 (A)t4 (dead) Session timeout begins /lock/lock-001 (A)t5 (dead) Session A EXPIRED (node deleted)t6 (dead) Delete ephemeral nodes Lock FREEt7 Client B connects Session B created Lock FREEt8 Client B acquires Node created /lock/lock-002 (B) Key: Client A didn't explicitly release lock, but:- Session expiry triggered automatic cleanup- Lock became available without manual intervention- No deadlock due to crashed holderSession timeout (default: 30 seconds) is negotiated between client and server. Shorter timeouts mean faster failover but higher risk of false expiration during GC pauses or network hiccups. Most production deployments use 10-30 seconds. The client library automatically sends heartbeats to keep the session alive.
Zookeeper's watch mechanism allows clients to receive notifications when znodes change, enabling efficient event-driven coordination without polling.
How Watches Work:
123456789101112131415161718192021222324252627282930
Watch Registration and Triggering:━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Client A Zookeeper Client B──────── ───────── ──────── /lock/lock-001 exists exists(/lock/lock-001, Store watch: watch=true) "notify A on lock-001" Result: exists=true Watch registered (one-time trigger) (waiting for ... delete(/lock/lock-001) notification) Node deleted Trigger watch! Send: NODE_DELETED to Client A Receive: NODE_DELETED Watch cleared (delete succeeded)from Zookeeper (one-time only) Now re-register if ... ...still interested Watch Types:- getData watch: Triggers on data change or deletion- exists watch: Triggers on creation, deletion, or data change- getChildren watch: Triggers on child add/removeCritical Watch Properties:
If all waiters watch the same node (the current lock holder), all are notified when it's deleted. All then attempt to acquire the lock simultaneously, but only one succeeds. The others must re-watch—causing unnecessary traffic. This is the 'herd effect' or 'thundering herd.' The fair lock recipe avoids this by having each waiter watch only its predecessor.
Zookeeper provides the primitives for locks but not a built-in lock API. The Apache Curator project defined the standard lock recipe that avoids the herd effect and provides fair, ordered access.
The Lock Algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
ACQUIRE LOCK on resource R:━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1. Create an ephemeral-sequential node under /locks/R: CREATE /locks/R/lock- (ephemeral-sequential) Result: /locks/R/lock-0000000007 (server assigns sequence number) 2. Get all children of /locks/R: GETCHILDREN /locks/R Result: [lock-001, lock-003, lock-007, lock-012] 3. If our node has the LOWEST sequence number: → We hold the lock! Enter critical section. 4. Otherwise, find the child with the next-lower sequence number: Our node: lock-007 Children lower than us: [lock-001, lock-003] Predecessor: lock-003 (highest of those lower than us) 5. Watch the predecessor node: EXISTS /locks/R/lock-003 (watch=true) 6. Wait for watch notification: - If predecessor deleted → go to step 2 - If session expires → fail acquisition RELEASE LOCK:━━━━━━━━━━━━━ 1. Delete our node: DELETE /locks/R/lock-0000000007 2. This triggers the watch on our successor (if any) FULL EXAMPLE:━━━━━━━━━━━━━ Initial: /locks/resource/ contains [lock-001 (holder A)] Client B: 1. Creates lock-002 2. Gets children: [lock-001, lock-002] 3. lock-001 < lock-002, so B doesn't hold lock 4. B watches lock-001 Client C: 1. Creates lock-003 2. Gets children: [lock-001, lock-002, lock-003] 3. lock-001 < lock-003, so C doesn't hold lock 4. C's predecessor is lock-002 (highest below 003) 5. C watches lock-002 (NOT lock-001!) Client A releases: 1. A deletes lock-001 2. B's watch on lock-001 fires 3. B gets children: [lock-002, lock-003] 4. B has lowest number → B now holds lock! 5. C is NOT notified (watches lock-002, not lock-001) Client B releases: 1. B deletes lock-002 2. C's watch on lock-002 fires 3. C gets children: [lock-003] 4. C has lowest number → C now holds lock!Why This Recipe Is Superior:
| Property | How Recipe Achieves It |
|---|---|
| Mutual Exclusion | Only lowest sequence number is holder |
| Deadlock Freedom | Ephemeral nodes auto-deleted on session expiry |
| Starvation Freedom | Sequential ordering guarantees FIFO |
| Bounded Wait | At most N-1 others served before you (N = queue length when joined) |
| No Herd Effect | Each waiter watches only predecessor, not holder |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
// Using Apache Curator (high-level Zookeeper client)import org.apache.curator.framework.CuratorFramework;import org.apache.curator.framework.recipes.locks.InterProcessMutex;import java.util.concurrent.TimeUnit; public class ZookeeperLockExample { private final CuratorFramework client; private final InterProcessMutex lock; public ZookeeperLockExample(CuratorFramework client, String lockPath) { this.client = client; this.lock = new InterProcessMutex(client, lockPath); } public void doWorkWithLock() throws Exception { // Attempt to acquire with timeout if (lock.acquire(10, TimeUnit.SECONDS)) { try { // CRITICAL SECTION // Only one process across the cluster executes this System.out.println("Lock acquired, doing exclusive work..."); processExclusiveResource(); } finally { // ALWAYS release in finally block lock.release(); System.out.println("Lock released"); } } else { // Timeout expired, lock not acquired System.out.println("Could not acquire lock within timeout"); handleLockTimeout(); } } // Reentrant lock demonstration public void reentrantExample() throws Exception { lock.acquire(); // Count = 1 try { doNestedWork(); } finally { lock.release(); // Count = 0, lock released } } private void doNestedWork() throws Exception { // Same lock, same thread - reentrant! lock.acquire(); // Count = 2 (not blocking) try { System.out.println("Nested work with same lock"); } finally { lock.release(); // Count = 1, still held } }}The raw Zookeeper API is low-level and error-prone. Curator handles connection management, retry logic, session recovery, and provides well-tested recipes. Unless you have specific requirements that Curator can't meet, always use Curator for production Zookeeper applications.
While the lock recipe is sound, real-world deployments face numerous failure scenarios. Understanding these is critical for building robust applications.
Scenario 1: Session Expiration During Critical Section
123456789101112131415
Timeline:t0: Client A acquires lock (creates lock-001)t1: Client A begins critical sectiont2: Network partition isolates A from Zookeepert3: A's heartbeats stop reaching Zookeepert4: Session timeout begins (e.g., 30 seconds)t5: A continues working (unaware of impending expiry)t6: Session A EXPIRES, lock-001 deletedt7: Client B acquires lock (creates lock-002)t8: B begins critical sectiont9: Network heals, A's writes reach databaset10: BOTH A AND B ARE MODIFYING SHARED STATE! The Problem: A's session expired, but A didn't know and kept working.The lock service believes B is the holder. A still thinks it holds the lock.Mitigation Strategies:
1234567891011121314151617181920212223242526
// Monitor Zookeeper connection stateclient.getConnectionStateListenable().addListener((client, newState) -> { switch (newState) { case CONNECTED: log.info("Connected to Zookeeper"); break; case RECONNECTED: log.info("Reconnected - session still valid"); // Session survived, locks still held break; case SUSPENDED: log.warn("Connection suspended - may lose session"); // Stop writing to shared resources! pauseCriticalWork(); break; case LOST: log.error("SESSION LOST - all locks are gone!"); // Session expired, ephemeral nodes deleted // IMMEDIATELY stop all critical work abortCriticalWorkAndCleanup(); break; case READ_ONLY: log.warn("Read-only mode - no writes possible"); break; }});Scenario 2: Create Succeeds But ACK Lost (Uncertain State)
1234567891011121314
Timeline:t0: Client A calls create("/locks/resource/lock-", ephemeral-sequential)t1: Request reaches Zookeepert2: Zookeeper creates lock-0000000042t3: Zookeeper writes to quorumt4: Zookeeper sends ACK to Client At5: Network drops ACK packett6: Client A times out waiting for responset7: Client A doesn't know if create succeeded! Options for Client A:a) Assume success and try to release → might release non-existent lockb) Assume failure and retry → creates DUPLICATE lock-0000000043!c) Query for existing nodes with our client ID → correct approachHandling Uncertain State:
The solution is to include client identification in the lock node data or name pattern. When uncertain, query for existing locks and check if any belong to us:
123456789101112131415161718192021222324252627
// Curator handles this internally, but here's the pattern: private String acquireLockWithRecovery() throws Exception { byte[] ourId = getClientIdentifier(); // Unique per-process String lockPath = "/locks/resource/lock-"; // Check if we already have a lock from a previous attempt List<String> children = zk.getChildren("/locks/resource"); for (String child : children) { byte[] data = zk.getData("/locks/resource/" + child); if (Arrays.equals(data, ourId)) { // We already have a lock from previous attempt! log.info("Found existing lock from previous attempt: " + child); return child; } } // No existing lock, create new one String created = zk.create( lockPath, ourId, // Store our ID as node data Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL_SEQUENTIAL ); return created;}Curator's InterProcessMutex handles connection loss, uncertain state, and recovery automatically. It includes client identification and checks for existing locks. Implementing this yourself is error-prone—use Curator unless you have specific requirements it can't meet.
Zookeeper is designed for coordination, not high-throughput data storage. Understanding its performance characteristics is essential for appropriate use.
Latency Characteristics:
| Operation | Latency | Notes |
|---|---|---|
| Read (local server) | ~1-2ms | Served from in-memory cache |
| Read (with sync) | ~5-10ms | Ensures freshness |
| Write (create/update) | ~10-30ms | Requires consensus across majority |
| Lock Acquisition (uncontended) | ~15-40ms | Create + getChildren + possibly watch |
| Lock Acquisition (contended) | ~15ms + wait time | Plus time waiting for predecessor |
| Lock Release | ~10-30ms | Delete operation |
Throughput Considerations:
12345678910111213141516171819202122232425
Scenario: High-throughput order processing Requirements:- 1,000 orders/second peak- Each order requires exclusive lock on inventory Lock Operations per Order:- Create lock node: 1 write- Delete lock node: 1 writeTotal: 2 writes per order Write Load on Zookeeper:- 1,000 orders × 2 writes = 2,000 writes/second Is This Sustainable?- Typical ZK cluster: 10,000-50,000 writes/second capacity- Our load: 2,000 writes/second- Answer: Yes, with headroom But Consider:- What about 10,000 orders/second?- 20,000 writes/second → approaching capacity- Consider alternative: partition by product category - 10 lock paths instead of 1 - Reduced contention AND distributed loadCoarse-grained locks (one lock for all inventory) are simple but create bottlenecks. Fine-grained locks (one lock per product) reduce contention but increase Zookeeper load and complexity. Find the right granularity for your workload—often a middle ground like category-level locking.
Beyond simple mutual exclusion, Zookeeper supports more sophisticated coordination patterns.
Read-Write Lock Recipe:
For read-heavy workloads, allowing multiple concurrent readers while ensuring exclusive write access can dramatically improve throughput.
12345678910111213141516171819202122232425262728293031323334353637
READ-WRITE LOCK RECIPE:━━━━━━━━━━━━━━━━━━━━━━━ Namespace:/locks/resource/ ├── read-0000000001 (ephemeral-sequential, reader) ├── read-0000000002 (ephemeral-sequential, reader) ├── write-0000000003 (ephemeral-sequential, writer) ├── read-0000000004 (ephemeral-sequential, reader waiting for writer) └── write-0000000005 (ephemeral-sequential, writer waiting) ACQUIRE READ LOCK:1. Create ephemeral-sequential /locks/resource/read-2. Get children sorted by sequence number3. Check for any WRITE node with lower sequence number - If no writes before us: read lock acquired - If write before us: watch closest write predecessor, wait ACQUIRE WRITE LOCK:1. Create ephemeral-sequential /locks/resource/write-2. Get children sorted by sequence number3. Check if our node has the lowest sequence number - If lowest: write lock acquired - Otherwise: watch immediate predecessor (read or write), wait SEMANTICS:- Multiple concurrent readers allowed (if no writer waiting ahead)- Writers get exclusive access- Writers don't starve readers (readers that arrive after writer wait)- Readers don't starve writers (writers eventually get turn) Example State: Children: [read-001, read-002, write-003, read-004] read-001 and read-002: Both hold read lock (no write before them) write-003: Waiting for read-001 and read-002 to release read-004: Waiting for write-003 to releaseOther Curator Recipes:
12345678910111213141516171819202122232425262728293031
import org.apache.curator.framework.recipes.locks.InterProcessReadWriteLock; public class ZookeeperReadWriteLockExample { private final InterProcessReadWriteLock rwLock; public ZookeeperReadWriteLockExample(CuratorFramework client, String path) { this.rwLock = new InterProcessReadWriteLock(client, path); } public Data readData() throws Exception { // Multiple readers can hold this simultaneously rwLock.readLock().acquire(); try { return fetchDataFromDatabase(); } finally { rwLock.readLock().release(); } } public void writeData(Data newData) throws Exception { // Only one writer, blocks all readers rwLock.writeLock().acquire(); try { updateDatabase(newData); invalidateCache(); } finally { rwLock.writeLock().release(); } }}We've explored Zookeeper's architecture, primitives, and lock implementations in depth. Let's consolidate when Zookeeper is the right choice.
What's Next:
The next page explores etcd locks—a modern alternative to Zookeeper with a simpler data model, built on Raft consensus, and increasingly popular in Kubernetes-native environments.
You now understand how Zookeeper provides distributed locks: the architecture enabling strong consistency, the primitives (ephemeral nodes, sequences, watches) that make locks possible, the lock recipe that provides fairness and efficiency, and the failure scenarios you must handle. Next, we'll see how etcd provides similar guarantees with a different approach.