System Design (HLD)Distributed Locking

Distributed Locking

LevelAdvanced

Duration90 mins

TopicDistributed Locking

3 / 5

Zookeeper Locks: Hierarchical Coordination for Distributed Systems

The Industry Standard for Distributed Coordination

Apache Zookeeper has been the workhorse of distributed coordination since its creation at Yahoo! in 2006. Born from the need to coordinate the massive Hadoop ecosystem, Zookeeper now powers critical systems at LinkedIn, Twitter, Netflix, Uber, and countless other organizations. When engineers need reliable distributed locks, Zookeeper is often the first tool they reach for.

But Zookeeper is not just a lock service—it's a hierarchical key-value store with strong consistency guarantees that can be used to build coordination primitives including locks, barriers, queues, and leader election. Understanding how Zookeeper works internally is essential for using it correctly and debugging the inevitable edge cases.

This page takes you deep into Zookeeper's architecture and its lock implementation, from the theoretical foundations to production deployment considerations.

Learning Objectives

By the end of this page, you will understand Zookeeper's architecture and consistency model, how ephemeral nodes and sessions enable reliable locks, the standard lock recipe and its fairness guarantees, failure modes and how to handle them, and when to choose Zookeeper over alternatives.

Zookeeper Architecture: The Foundation for Locks

Zookeeper is a replicated state machine that maintains a hierarchical namespace (similar to a filesystem) with strong consistency guarantees. To understand how locks work, we must first understand the architecture.

Ensemble Architecture:

A Zookeeper deployment consists of a cluster of servers called an ensemble. The recommended minimum is 3 servers; 5 is common for production deployments needing higher availability.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Zookeeper Ensemble (5 servers):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
         ┌─────────────────────────────────────────────────────┐
         │                    Clients                          │
         │  App1   App2   App3   App4   App5   App6   App7     │
         └───┬──────┬──────┬──────┬──────┬──────┬──────┬───────┘
             │      │      │      │      │      │      │
    ─────────┴──────┴──────┴──────┴──────┴──────┴──────┴─────────
                              │ Connections
    ─────────────────────────────────────────────────────────────
         │                    │                    │
    ┌────┴────┐          ┌────┴────┐          ┌────┴────┐
    │  ZK #1  │◄────────►│  ZK #2  │◄────────►│  ZK #3  │
    │ (Follower)         │ (LEADER)│          │(Follower)│
    └────┬────┘          └────┬────┘          └────┬────┘
         │                    │                    │
         │                    │                    │
    ┌────┴────┐          ┌────┴────┐
    │  ZK #4  │◄────────►│  ZK #5  │
    │(Follower)│          │(Follower)│
    └─────────┘          └─────────┘
 
Key Components:
- Leader: Handles all write requests, coordinates replication
- Followers: Serve read requests, forward writes to leader
- Clients: Connect to any server, session maintained with ensemble

Consensus via ZAB Protocol:

Zookeeper uses the Zookeeper Atomic Broadcast (ZAB) protocol for consensus—a leader-based protocol similar to Raft. All writes go through the leader and are replicated to a majority of followers before being acknowledged.

Key Properties:

Zookeeper Consistency Guarantees

•Linearizable Writes — All writes are serialized through the leader and applied in a global order. If write A completes before write B starts, A is ordered before B.
•Sequential Consistency for Reads — Reads from a single client see writes in the order they were applied. A client never sees stale data relative to its own previous operations.
•Majority Durability — A write is acknowledged only after it's persisted on a majority of servers. This survives minority failures.
•Session Semantics — Clients maintain sessions with the ensemble. Session expiry has specific semantics for ephemeral data.

Read Freshness Caveat

By default, reads can be served by any server and may be slightly stale (relative to the most recent write). For lock implementations, use sync() before reads or only rely on watches and write operations for coordination. The lock recipe we'll describe uses only writes and watches, avoiding stale reads entirely.

The Hierarchical Namespace: Znodes as Building Blocks

Zookeeper's data model is a hierarchical namespace of nodes called znodes, organized like a filesystem tree. Each znode can store data (up to 1MB by default) and have children.

Znode Types:

Zookeeper supports several znode types, each with different behaviors that are crucial for implementing coordination primitives:

Znode Types and Their Properties
Type	Lifecycle	Use Case
Persistent	Exists until explicitly deleted	Configuration, metadata, lock container nodes
Ephemeral	Exists until session that created it ends	Lock holder markers, presence detection
Persistent Sequential	Exists until deleted; gets auto-incremented suffix	Ordered children that persist
Ephemeral Sequential	Session-bound; gets auto-incremented suffix	Fair lock queues, leader election

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Zookeeper Namespace:
━━━━━━━━━━━━━━━━━━━━
 
/                                    (root)
├── locks/                           (persistent - container for locks)
│   ├── inventory/                   (persistent - lock for inventory resource)
│   │   ├── lock-0000000001          (ephemeral-sequential, session A)
│   │   ├── lock-0000000002          (ephemeral-sequential, session B)
│   │   └── lock-0000000003          (ephemeral-sequential, session C)
│   └── order-processing/            (persistent - another lock)
│       └── lock-0000000001          (ephemeral-sequential, session D)
│
├── config/                          (persistent - configuration data)
│   ├── database                     (persistent, data: connection string)
│   └── feature-flags                (persistent, data: JSON config)
│
└── services/                        (persistent - service discovery)
    ├── api-gateway/
    │   ├── node-0000000001          (ephemeral-sequential - live instance)
    │   └── node-0000000002          (ephemeral-sequential - live instance)
    └── worker-service/
        └── node-0000000001          (ephemeral-sequential - live instance)
 
Key Insight: Lock queue is the children of /locks/inventory/
- Holder: Lowest sequence number (lock-0000000001)
- Waiters: Higher sequence numbers, watching predecessor

Ephemeral Nodes: The Key to Fault-Tolerant Locks

Ephemeral nodes are automatically deleted when the session that created them ends. This provides automatic cleanup for crashed lock holders:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Session and Ephemeral Node Lifecycle:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
Time    Client A               Zookeeper                 Lock State
────    ────────               ─────────                 ──────────
t0      Connect                Session A created         No nodes
t1      Create ephemeral       Node created              /lock/lock-001 (A)
        /lock/lock-001         
t2      (doing work)           Session A alive           /lock/lock-001 (A)
t3      CRASH! Connection lost Heartbeats stop          /lock/lock-001 (A)
t4      (dead)                 Session timeout begins    /lock/lock-001 (A)
t5      (dead)                 Session A EXPIRED         (node deleted)
t6      (dead)                 Delete ephemeral nodes   Lock FREE
t7      Client B connects      Session B created         Lock FREE
t8      Client B acquires      Node created              /lock/lock-002 (B)
 
Key: Client A didn't explicitly release lock, but:
- Session expiry triggered automatic cleanup
- Lock became available without manual intervention
- No deadlock due to crashed holder

Session Timeout Configuration

Session timeout (default: 30 seconds) is negotiated between client and server. Shorter timeouts mean faster failover but higher risk of false expiration during GC pauses or network hiccups. Most production deployments use 10-30 seconds. The client library automatically sends heartbeats to keep the session alive.

Watches: Event-Driven Coordination

Zookeeper's watch mechanism allows clients to receive notifications when znodes change, enabling efficient event-driven coordination without polling.

How Watches Work:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Watch Registration and Triggering:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
Client A                    Zookeeper                   Client B
────────                    ─────────                   ────────
                           /lock/lock-001 exists
                           
exists(/lock/lock-001,     Store watch:                
       watch=true)         "notify A on lock-001"
                           
Result: exists=true        Watch registered            
                           (one-time trigger)
                           
(waiting for               ...                         delete(/lock/lock-001)
 notification)                                         
                           Node deleted                
                           Trigger watch!
                           Send: NODE_DELETED          
                           to Client A
                           
Receive: NODE_DELETED      Watch cleared               (delete succeeded)
from Zookeeper             (one-time only)
                           
Now re-register if         ...                         ...
still interested
 
Watch Types:
- getData watch: Triggers on data change or deletion
- exists watch: Triggers on creation, deletion, or data change
- getChildren watch: Triggers on child add/remove

Critical Watch Properties:

Watch Semantics

•One-Time Triggers — A watch fires at most once. After triggering, you must re-register if you want to watch again. This prevents notification storms but requires careful client logic.
•Ordered Delivery — Watch notifications are delivered in order relative to other events. You'll see the watch trigger before any subsequent changes are visible.
•Session Scoped — Watches are tied to client sessions. If the session dies, watches are removed. Reconnecting requires re-establishing watches.
•Best Effort Delivery — Watches are guaranteed to be delivered if the session remains active. But network issues may cause delayed delivery.

The Herd Effect Problem

If all waiters watch the same node (the current lock holder), all are notified when it's deleted. All then attempt to acquire the lock simultaneously, but only one succeeds. The others must re-watch—causing unnecessary traffic. This is the 'herd effect' or 'thundering herd.' The fair lock recipe avoids this by having each waiter watch only its predecessor.

The Zookeeper Lock Recipe: Fair and Efficient

Zookeeper provides the primitives for locks but not a built-in lock API. The Apache Curator project defined the standard lock recipe that avoids the herd effect and provides fair, ordered access.

The Lock Algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
ACQUIRE LOCK on resource R:
━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
1. Create an ephemeral-sequential node under /locks/R:
   CREATE /locks/R/lock-     (ephemeral-sequential)
   Result: /locks/R/lock-0000000007  (server assigns sequence number)
   
2. Get all children of /locks/R:
   GETCHILDREN /locks/R
   Result: [lock-001, lock-003, lock-007, lock-012]
   
3. If our node has the LOWEST sequence number:
   → We hold the lock! Enter critical section.
   
4. Otherwise, find the child with the next-lower sequence number:
   Our node: lock-007
   Children lower than us: [lock-001, lock-003]
   Predecessor: lock-003 (highest of those lower than us)
   
5. Watch the predecessor node:
   EXISTS /locks/R/lock-003 (watch=true)
   
6. Wait for watch notification:
   - If predecessor deleted → go to step 2
   - If session expires    → fail acquisition
 
 
RELEASE LOCK:
━━━━━━━━━━━━━
 
1. Delete our node:
   DELETE /locks/R/lock-0000000007
   
2. This triggers the watch on our successor (if any)
 
 
FULL EXAMPLE:
━━━━━━━━━━━━━
 
Initial: /locks/resource/ contains [lock-001 (holder A)]
 
Client B:
  1. Creates lock-002
  2. Gets children: [lock-001, lock-002]
  3. lock-001 < lock-002, so B doesn't hold lock
  4. B watches lock-001
  
Client C:  
  1. Creates lock-003
  2. Gets children: [lock-001, lock-002, lock-003]
  3. lock-001 < lock-003, so C doesn't hold lock
  4. C's predecessor is lock-002 (highest below 003)
  5. C watches lock-002 (NOT lock-001!)
  
Client A releases:
  1. A deletes lock-001
  2. B's watch on lock-001 fires
  3. B gets children: [lock-002, lock-003]
  4. B has lowest number → B now holds lock!
  5. C is NOT notified (watches lock-002, not lock-001)
 
Client B releases:
  1. B deletes lock-002
  2. C's watch on lock-002 fires
  3. C gets children: [lock-003]
  4. C has lowest number → C now holds lock!

Why This Recipe Is Superior:

Lock Recipe Properties
Property	How Recipe Achieves It
Mutual Exclusion	Only lowest sequence number is holder
Deadlock Freedom	Ephemeral nodes auto-deleted on session expiry
Starvation Freedom	Sequential ordering guarantees FIFO
Bounded Wait	At most N-1 others served before you (N = queue length when joined)
No Herd Effect	Each waiter watches only predecessor, not holder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// Using Apache Curator (high-level Zookeeper client)
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.recipes.locks.InterProcessMutex;
import java.util.concurrent.TimeUnit;
 
public class ZookeeperLockExample {
    
    private final CuratorFramework client;
    private final InterProcessMutex lock;
    
    public ZookeeperLockExample(CuratorFramework client, String lockPath) {
        this.client = client;
        this.lock = new InterProcessMutex(client, lockPath);
    }
    
    public void doWorkWithLock() throws Exception {
        // Attempt to acquire with timeout
        if (lock.acquire(10, TimeUnit.SECONDS)) {
            try {
                // CRITICAL SECTION
                // Only one process across the cluster executes this
                System.out.println("Lock acquired, doing exclusive work...");
                processExclusiveResource();
            } finally {
                // ALWAYS release in finally block
                lock.release();
                System.out.println("Lock released");
            }
        } else {
            // Timeout expired, lock not acquired
            System.out.println("Could not acquire lock within timeout");
            handleLockTimeout();
        }
    }
    
    // Reentrant lock demonstration
    public void reentrantExample() throws Exception {
        lock.acquire();  // Count = 1
        try {
            doNestedWork();
        } finally {
            lock.release();  // Count = 0, lock released
        }
    }
    
    private void doNestedWork() throws Exception {
        // Same lock, same thread - reentrant!
        lock.acquire();  // Count = 2 (not blocking)
        try {
            System.out.println("Nested work with same lock");
        } finally {
            lock.release();  // Count = 1, still held
        }
    }
}

Use Curator, Not Raw Zookeeper

The raw Zookeeper API is low-level and error-prone. Curator handles connection management, retry logic, session recovery, and provides well-tested recipes. Unless you have specific requirements that Curator can't meet, always use Curator for production Zookeeper applications.

Failure Scenarios: What Can Go Wrong and How to Handle It

While the lock recipe is sound, real-world deployments face numerous failure scenarios. Understanding these is critical for building robust applications.

Scenario 1: Session Expiration During Critical Section

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Timeline:
t0: Client A acquires lock (creates lock-001)
t1: Client A begins critical section
t2: Network partition isolates A from Zookeeper
t3: A's heartbeats stop reaching Zookeeper
t4: Session timeout begins (e.g., 30 seconds)
t5: A continues working (unaware of impending expiry)
t6: Session A EXPIRES, lock-001 deleted
t7: Client B acquires lock (creates lock-002)
t8: B begins critical section
t9: Network heals, A's writes reach database
t10: BOTH A AND B ARE MODIFYING SHARED STATE!
 
The Problem: A's session expired, but A didn't know and kept working.
The lock service believes B is the holder. A still thinks it holds the lock.

Mitigation Strategies:

Session Expiration Mitigations

•Register Session Lost Callback — Curator's ConnectionStateListener notifies when connection is LOST. Stop all critical work immediately when this fires.
•Use Fencing Tokens — The znode creation zxid (transaction ID) can serve as a fencing token. Include it in all writes to protected resources.
•Reduce Session Timeout — Shorter timeout = faster detection, but increases false expiration risk. Find the right balance.
•Idempotent Operations — Design critical section work to be safe to execute multiple times. If both A and B execute, the result should be the same as if only one did.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Monitor Zookeeper connection state
client.getConnectionStateListenable().addListener((client, newState) -> {
    switch (newState) {
        case CONNECTED:
            log.info("Connected to Zookeeper");
            break;
        case RECONNECTED:
            log.info("Reconnected - session still valid");
            // Session survived, locks still held
            break;
        case SUSPENDED:
            log.warn("Connection suspended - may lose session");
            // Stop writing to shared resources!
            pauseCriticalWork();
            break;
        case LOST:
            log.error("SESSION LOST - all locks are gone!");
            // Session expired, ephemeral nodes deleted
            // IMMEDIATELY stop all critical work
            abortCriticalWorkAndCleanup();
            break;
        case READ_ONLY:
            log.warn("Read-only mode - no writes possible");
            break;
    }
});

Scenario 2: Create Succeeds But ACK Lost (Uncertain State)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Timeline:
t0: Client A calls create("/locks/resource/lock-", ephemeral-sequential)
t1: Request reaches Zookeeper
t2: Zookeeper creates lock-0000000042
t3: Zookeeper writes to quorum
t4: Zookeeper sends ACK to Client A
t5: Network drops ACK packet
t6: Client A times out waiting for response
t7: Client A doesn't know if create succeeded!
 
Options for Client A:
a) Assume success and try to release → might release non-existent lock
b) Assume failure and retry → creates DUPLICATE lock-0000000043!
c) Query for existing nodes with our client ID → correct approach

Handling Uncertain State:

The solution is to include client identification in the lock node data or name pattern. When uncertain, query for existing locks and check if any belong to us:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Curator handles this internally, but here's the pattern:
 
private String acquireLockWithRecovery() throws Exception {
    byte[] ourId = getClientIdentifier();  // Unique per-process
    String lockPath = "/locks/resource/lock-";
    
    // Check if we already have a lock from a previous attempt
    List<String> children = zk.getChildren("/locks/resource");
    for (String child : children) {
        byte[] data = zk.getData("/locks/resource/" + child);
        if (Arrays.equals(data, ourId)) {
            // We already have a lock from previous attempt!
            log.info("Found existing lock from previous attempt: " + child);
            return child;
        }
    }
    
    // No existing lock, create new one
    String created = zk.create(
        lockPath, 
        ourId,  // Store our ID as node data
        Ids.OPEN_ACL_UNSAFE, 
        CreateMode.EPHEMERAL_SEQUENTIAL
    );
    
    return created;
}

Curator Handles Recovery

Curator's InterProcessMutex handles connection loss, uncertain state, and recovery automatically. It includes client identification and checks for existing locks. Implementing this yourself is error-prone—use Curator unless you have specific requirements it can't meet.

Performance Characteristics and Optimization

Zookeeper is designed for coordination, not high-throughput data storage. Understanding its performance characteristics is essential for appropriate use.

Latency Characteristics:

Zookeeper Operation Latencies (Typical)
Operation	Latency	Notes
Read (local server)	~1-2ms	Served from in-memory cache
Read (with sync)	~5-10ms	Ensures freshness
Write (create/update)	~10-30ms	Requires consensus across majority
Lock Acquisition (uncontended)	~15-40ms	Create + getChildren + possibly watch
Lock Acquisition (contended)	~15ms + wait time	Plus time waiting for predecessor
Lock Release	~10-30ms	Delete operation

Throughput Considerations:

Throughput Factors

•Read-Heavy Workloads Scale Well — Reads can be served by any server in the ensemble. Adding followers increases read capacity.
•Writes Are Bottlenecked by Leader — All writes go through the single leader. Write throughput is limited by leader's capacity.
•Typical Write Throughput — A Zookeeper cluster can handle ~10,000-50,000 writes per second depending on hardware and replication factor.
•Lock Churn Creates Write Load — High lock acquisition/release rates create write traffic. Consider if every operation truly needs a lock.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Scenario: High-throughput order processing
 
Requirements:
- 1,000 orders/second peak
- Each order requires exclusive lock on inventory
 
Lock Operations per Order:
- Create lock node: 1 write
- Delete lock node: 1 write
Total: 2 writes per order
 
Write Load on Zookeeper:
- 1,000 orders × 2 writes = 2,000 writes/second
 
Is This Sustainable?
- Typical ZK cluster: 10,000-50,000 writes/second capacity
- Our load: 2,000 writes/second
- Answer: Yes, with headroom
 
But Consider:
- What about 10,000 orders/second?
- 20,000 writes/second → approaching capacity
- Consider alternative: partition by product category
  - 10 lock paths instead of 1
  - Reduced contention AND distributed load

Lock Granularity Trade-off

Coarse-grained locks (one lock for all inventory) are simple but create bottlenecks. Fine-grained locks (one lock per product) reduce contention but increase Zookeeper load and complexity. Find the right granularity for your workload—often a middle ground like category-level locking.

Read-Write Locks and Advanced Recipes

Beyond simple mutual exclusion, Zookeeper supports more sophisticated coordination patterns.

Read-Write Lock Recipe:

For read-heavy workloads, allowing multiple concurrent readers while ensuring exclusive write access can dramatically improve throughput.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
READ-WRITE LOCK RECIPE:
━━━━━━━━━━━━━━━━━━━━━━━
 
Namespace:
/locks/resource/
  ├── read-0000000001    (ephemeral-sequential, reader)
  ├── read-0000000002    (ephemeral-sequential, reader)
  ├── write-0000000003   (ephemeral-sequential, writer)
  ├── read-0000000004    (ephemeral-sequential, reader waiting for writer)
  └── write-0000000005   (ephemeral-sequential, writer waiting)
 
ACQUIRE READ LOCK:
1. Create ephemeral-sequential /locks/resource/read-
2. Get children sorted by sequence number
3. Check for any WRITE node with lower sequence number
   - If no writes before us: read lock acquired
   - If write before us: watch closest write predecessor, wait
 
ACQUIRE WRITE LOCK:
1. Create ephemeral-sequential /locks/resource/write-
2. Get children sorted by sequence number
3. Check if our node has the lowest sequence number
   - If lowest: write lock acquired
   - Otherwise: watch immediate predecessor (read or write), wait
 
SEMANTICS:
- Multiple concurrent readers allowed (if no writer waiting ahead)
- Writers get exclusive access
- Writers don't starve readers (readers that arrive after writer wait)
- Readers don't starve writers (writers eventually get turn)
 
Example State:
  Children: [read-001, read-002, write-003, read-004]
  
  read-001 and read-002: Both hold read lock (no write before them)
  write-003: Waiting for read-001 and read-002 to release
  read-004: Waiting for write-003 to release

Other Curator Recipes:

Curator Coordination Recipes

•LeaderSelector — Elect a leader among participants. One participant is leader, others standby. Leader crash triggers re-election.
•LeaderLatch — Simpler leader election where all participants contend to be leader. Winner holds position until they exit.
•DistributedBarrier — Wait for all participants to reach a synchronization point before proceeding.
•DistributedQueue — Distributed FIFO queue using sequential nodes.
•Semaphore — Distributed counting semaphore for limiting concurrent access to a resource pool.
•SharedCounter — Distributed counter backed by Zookeeper with atomic increment/decrement.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import org.apache.curator.framework.recipes.locks.InterProcessReadWriteLock;
 
public class ZookeeperReadWriteLockExample {
    
    private final InterProcessReadWriteLock rwLock;
    
    public ZookeeperReadWriteLockExample(CuratorFramework client, String path) {
        this.rwLock = new InterProcessReadWriteLock(client, path);
    }
    
    public Data readData() throws Exception {
        // Multiple readers can hold this simultaneously
        rwLock.readLock().acquire();
        try {
            return fetchDataFromDatabase();
        } finally {
            rwLock.readLock().release();
        }
    }
    
    public void writeData(Data newData) throws Exception {
        // Only one writer, blocks all readers
        rwLock.writeLock().acquire();
        try {
            updateDatabase(newData);
            invalidateCache();
        } finally {
            rwLock.writeLock().release();
        }
    }
}

Summary: When to Choose Zookeeper for Locks

We've explored Zookeeper's architecture, primitives, and lock implementations in depth. Let's consolidate when Zookeeper is the right choice.

Use Zookeeper When

•You need strong consistency guarantees
•Fair (FIFO) ordering of lock acquisition matters
•You're already running Kafka, Hadoop, or HBase (which use ZK)
•You need coordination primitives beyond locks (leader election, barriers)
•Lock correctness is more important than lock performance
•Your lock acquisition rate is < 10,000/second

Consider Alternatives When

•You need extremely high lock throughput (>50K/sec)
•You can't accept the operational complexity of ZK cluster
•Simple efficiency locks suffice (Redis may be enough)
•You're in a cloud environment with managed alternatives (AWS, GCP)
•You need locks with TTL only (no session semantics)
•Your team lacks Zookeeper operational experience

Key Takeaways

•Zookeeper provides strong consistency — Linearizable writes via ZAB consensus ensure safety properties are maintained.
•Ephemeral nodes enable automatic cleanup — Crashed lock holders don't cause permanent deadlock; session expiry cleans up.
•Sequential nodes enable fair ordering — The lock recipe guarantees FIFO ordering, preventing starvation.
•Watch mechanism enables efficient waiting — No polling; clients are notified when predecessor releases.
•Session management is critical — Applications must handle session loss events to avoid safety violations.
•Curator simplifies usage — Don't roll your own; use Curator's battle-tested recipes.

What's Next:

The next page explores etcd locks—a modern alternative to Zookeeper with a simpler data model, built on Raft consensus, and increasingly popular in Kubernetes-native environments.

Page Complete

You now understand how Zookeeper provides distributed locks: the architecture enabling strong consistency, the primitives (ephemeral nodes, sequences, watches) that make locks possible, the lock recipe that provides fairness and efficiency, and the failure scenarios you must handle. Next, we'll see how etcd provides similar guarantees with a different approach.

3 / 5

Loading learning content...

System Design (HLD)Distributed Locking

Distributed Locking

LevelAdvanced

Duration90 mins

TopicDistributed Locking

3 / 5

Zookeeper Locks: Hierarchical Coordination for Distributed Systems

The Industry Standard for Distributed Coordination

This page takes you deep into Zookeeper's architecture and its lock implementation, from the theoretical foundations to production deployment considerations.

Learning Objectives

Zookeeper Architecture: The Foundation for Locks

Ensemble Architecture:

A Zookeeper deployment consists of a cluster of servers called an ensemble. The recommended minimum is 3 servers; 5 is common for production deployments needing higher availability.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Zookeeper Ensemble (5 servers):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
         ┌─────────────────────────────────────────────────────┐
         │                    Clients                          │
         │  App1   App2   App3   App4   App5   App6   App7     │
         └───┬──────┬──────┬──────┬──────┬──────┬──────┬───────┘
             │      │      │      │      │      │      │
    ─────────┴──────┴──────┴──────┴──────┴──────┴──────┴─────────
                              │ Connections
    ─────────────────────────────────────────────────────────────
         │                    │                    │
    ┌────┴────┐          ┌────┴────┐          ┌────┴────┐
    │  ZK #1  │◄────────►│  ZK #2  │◄────────►│  ZK #3  │
    │ (Follower)         │ (LEADER)│          │(Follower)│
    └────┬────┘          └────┬────┘          └────┬────┘
         │                    │                    │
         │                    │                    │
    ┌────┴────┐          ┌────┴────┐
    │  ZK #4  │◄────────►│  ZK #5  │
    │(Follower)│          │(Follower)│
    └─────────┘          └─────────┘
 
Key Components:
- Leader: Handles all write requests, coordinates replication
- Followers: Serve read requests, forward writes to leader
- Clients: Connect to any server, session maintained with ensemble

Consensus via ZAB Protocol:

Key Properties:

Zookeeper Consistency Guarantees

•Linearizable Writes — All writes are serialized through the leader and applied in a global order. If write A completes before write B starts, A is ordered before B.
•Sequential Consistency for Reads — Reads from a single client see writes in the order they were applied. A client never sees stale data relative to its own previous operations.
•Majority Durability — A write is acknowledged only after it's persisted on a majority of servers. This survives minority failures.
•Session Semantics — Clients maintain sessions with the ensemble. Session expiry has specific semantics for ephemeral data.

Read Freshness Caveat

The Hierarchical Namespace: Znodes as Building Blocks

Zookeeper's data model is a hierarchical namespace of nodes called znodes, organized like a filesystem tree. Each znode can store data (up to 1MB by default) and have children.

Znode Types:

Zookeeper supports several znode types, each with different behaviors that are crucial for implementing coordination primitives:

Znode Types and Their Properties
Type	Lifecycle	Use Case
Persistent	Exists until explicitly deleted	Configuration, metadata, lock container nodes
Ephemeral	Exists until session that created it ends	Lock holder markers, presence detection
Persistent Sequential	Exists until deleted; gets auto-incremented suffix	Ordered children that persist
Ephemeral Sequential	Session-bound; gets auto-incremented suffix	Fair lock queues, leader election

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Zookeeper Namespace:
━━━━━━━━━━━━━━━━━━━━
 
/                                    (root)
├── locks/                           (persistent - container for locks)
│   ├── inventory/                   (persistent - lock for inventory resource)
│   │   ├── lock-0000000001          (ephemeral-sequential, session A)
│   │   ├── lock-0000000002          (ephemeral-sequential, session B)
│   │   └── lock-0000000003          (ephemeral-sequential, session C)
│   └── order-processing/            (persistent - another lock)
│       └── lock-0000000001          (ephemeral-sequential, session D)
│
├── config/                          (persistent - configuration data)
│   ├── database                     (persistent, data: connection string)
│   └── feature-flags                (persistent, data: JSON config)
│
└── services/                        (persistent - service discovery)
    ├── api-gateway/
    │   ├── node-0000000001          (ephemeral-sequential - live instance)
    │   └── node-0000000002          (ephemeral-sequential - live instance)
    └── worker-service/
        └── node-0000000001          (ephemeral-sequential - live instance)
 
Key Insight: Lock queue is the children of /locks/inventory/
- Holder: Lowest sequence number (lock-0000000001)
- Waiters: Higher sequence numbers, watching predecessor

Ephemeral Nodes: The Key to Fault-Tolerant Locks

Ephemeral nodes are automatically deleted when the session that created them ends. This provides automatic cleanup for crashed lock holders:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Session and Ephemeral Node Lifecycle:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
Time    Client A               Zookeeper                 Lock State
────    ────────               ─────────                 ──────────
t0      Connect                Session A created         No nodes
t1      Create ephemeral       Node created              /lock/lock-001 (A)
        /lock/lock-001         
t2      (doing work)           Session A alive           /lock/lock-001 (A)
t3      CRASH! Connection lost Heartbeats stop          /lock/lock-001 (A)
t4      (dead)                 Session timeout begins    /lock/lock-001 (A)
t5      (dead)                 Session A EXPIRED         (node deleted)
t6      (dead)                 Delete ephemeral nodes   Lock FREE
t7      Client B connects      Session B created         Lock FREE
t8      Client B acquires      Node created              /lock/lock-002 (B)
 
Key: Client A didn't explicitly release lock, but:
- Session expiry triggered automatic cleanup
- Lock became available without manual intervention
- No deadlock due to crashed holder

Session Timeout Configuration

Watches: Event-Driven Coordination

Zookeeper's watch mechanism allows clients to receive notifications when znodes change, enabling efficient event-driven coordination without polling.

How Watches Work:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Watch Registration and Triggering:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
Client A                    Zookeeper                   Client B
────────                    ─────────                   ────────
                           /lock/lock-001 exists
                           
exists(/lock/lock-001,     Store watch:                
       watch=true)         "notify A on lock-001"
                           
Result: exists=true        Watch registered            
                           (one-time trigger)
                           
(waiting for               ...                         delete(/lock/lock-001)
 notification)                                         
                           Node deleted                
                           Trigger watch!
                           Send: NODE_DELETED          
                           to Client A
                           
Receive: NODE_DELETED      Watch cleared               (delete succeeded)
from Zookeeper             (one-time only)
                           
Now re-register if         ...                         ...
still interested
 
Watch Types:
- getData watch: Triggers on data change or deletion
- exists watch: Triggers on creation, deletion, or data change
- getChildren watch: Triggers on child add/remove

Critical Watch Properties:

Watch Semantics

•One-Time Triggers — A watch fires at most once. After triggering, you must re-register if you want to watch again. This prevents notification storms but requires careful client logic.
•Ordered Delivery — Watch notifications are delivered in order relative to other events. You'll see the watch trigger before any subsequent changes are visible.
•Session Scoped — Watches are tied to client sessions. If the session dies, watches are removed. Reconnecting requires re-establishing watches.
•Best Effort Delivery — Watches are guaranteed to be delivered if the session remains active. But network issues may cause delayed delivery.

The Herd Effect Problem

The Zookeeper Lock Recipe: Fair and Efficient

Zookeeper provides the primitives for locks but not a built-in lock API. The Apache Curator project defined the standard lock recipe that avoids the herd effect and provides fair, ordered access.

The Lock Algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
ACQUIRE LOCK on resource R:
━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
1. Create an ephemeral-sequential node under /locks/R:
   CREATE /locks/R/lock-     (ephemeral-sequential)
   Result: /locks/R/lock-0000000007  (server assigns sequence number)
   
2. Get all children of /locks/R:
   GETCHILDREN /locks/R
   Result: [lock-001, lock-003, lock-007, lock-012]
   
3. If our node has the LOWEST sequence number:
   → We hold the lock! Enter critical section.
   
4. Otherwise, find the child with the next-lower sequence number:
   Our node: lock-007
   Children lower than us: [lock-001, lock-003]
   Predecessor: lock-003 (highest of those lower than us)
   
5. Watch the predecessor node:
   EXISTS /locks/R/lock-003 (watch=true)
   
6. Wait for watch notification:
   - If predecessor deleted → go to step 2
   - If session expires    → fail acquisition
 
 
RELEASE LOCK:
━━━━━━━━━━━━━
 
1. Delete our node:
   DELETE /locks/R/lock-0000000007
   
2. This triggers the watch on our successor (if any)
 
 
FULL EXAMPLE:
━━━━━━━━━━━━━
 
Initial: /locks/resource/ contains [lock-001 (holder A)]
 
Client B:
  1. Creates lock-002
  2. Gets children: [lock-001, lock-002]
  3. lock-001 < lock-002, so B doesn't hold lock
  4. B watches lock-001
  
Client C:  
  1. Creates lock-003
  2. Gets children: [lock-001, lock-002, lock-003]
  3. lock-001 < lock-003, so C doesn't hold lock
  4. C's predecessor is lock-002 (highest below 003)
  5. C watches lock-002 (NOT lock-001!)
  
Client A releases:
  1. A deletes lock-001
  2. B's watch on lock-001 fires
  3. B gets children: [lock-002, lock-003]
  4. B has lowest number → B now holds lock!
  5. C is NOT notified (watches lock-002, not lock-001)
 
Client B releases:
  1. B deletes lock-002
  2. C's watch on lock-002 fires
  3. C gets children: [lock-003]
  4. C has lowest number → C now holds lock!

Why This Recipe Is Superior:

Lock Recipe Properties
Property	How Recipe Achieves It
Mutual Exclusion	Only lowest sequence number is holder
Deadlock Freedom	Ephemeral nodes auto-deleted on session expiry
Starvation Freedom	Sequential ordering guarantees FIFO
Bounded Wait	At most N-1 others served before you (N = queue length when joined)
No Herd Effect	Each waiter watches only predecessor, not holder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// Using Apache Curator (high-level Zookeeper client)
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.recipes.locks.InterProcessMutex;
import java.util.concurrent.TimeUnit;
 
public class ZookeeperLockExample {
    
    private final CuratorFramework client;
    private final InterProcessMutex lock;
    
    public ZookeeperLockExample(CuratorFramework client, String lockPath) {
        this.client = client;
        this.lock = new InterProcessMutex(client, lockPath);
    }
    
    public void doWorkWithLock() throws Exception {
        // Attempt to acquire with timeout
        if (lock.acquire(10, TimeUnit.SECONDS)) {
            try {
                // CRITICAL SECTION
                // Only one process across the cluster executes this
                System.out.println("Lock acquired, doing exclusive work...");
                processExclusiveResource();
            } finally {
                // ALWAYS release in finally block
                lock.release();
                System.out.println("Lock released");
            }
        } else {
            // Timeout expired, lock not acquired
            System.out.println("Could not acquire lock within timeout");
            handleLockTimeout();
        }
    }
    
    // Reentrant lock demonstration
    public void reentrantExample() throws Exception {
        lock.acquire();  // Count = 1
        try {
            doNestedWork();
        } finally {
            lock.release();  // Count = 0, lock released
        }
    }
    
    private void doNestedWork() throws Exception {
        // Same lock, same thread - reentrant!
        lock.acquire();  // Count = 2 (not blocking)
        try {
            System.out.println("Nested work with same lock");
        } finally {
            lock.release();  // Count = 1, still held
        }
    }
}

Use Curator, Not Raw Zookeeper

Failure Scenarios: What Can Go Wrong and How to Handle It

While the lock recipe is sound, real-world deployments face numerous failure scenarios. Understanding these is critical for building robust applications.

Scenario 1: Session Expiration During Critical Section

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Timeline:
t0: Client A acquires lock (creates lock-001)
t1: Client A begins critical section
t2: Network partition isolates A from Zookeeper
t3: A's heartbeats stop reaching Zookeeper
t4: Session timeout begins (e.g., 30 seconds)
t5: A continues working (unaware of impending expiry)
t6: Session A EXPIRES, lock-001 deleted
t7: Client B acquires lock (creates lock-002)
t8: B begins critical section
t9: Network heals, A's writes reach database
t10: BOTH A AND B ARE MODIFYING SHARED STATE!
 
The Problem: A's session expired, but A didn't know and kept working.
The lock service believes B is the holder. A still thinks it holds the lock.

Mitigation Strategies:

Session Expiration Mitigations

•Register Session Lost Callback — Curator's ConnectionStateListener notifies when connection is LOST. Stop all critical work immediately when this fires.
•Use Fencing Tokens — The znode creation zxid (transaction ID) can serve as a fencing token. Include it in all writes to protected resources.
•Reduce Session Timeout — Shorter timeout = faster detection, but increases false expiration risk. Find the right balance.
•Idempotent Operations — Design critical section work to be safe to execute multiple times. If both A and B execute, the result should be the same as if only one did.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Monitor Zookeeper connection state
client.getConnectionStateListenable().addListener((client, newState) -> {
    switch (newState) {
        case CONNECTED:
            log.info("Connected to Zookeeper");
            break;
        case RECONNECTED:
            log.info("Reconnected - session still valid");
            // Session survived, locks still held
            break;
        case SUSPENDED:
            log.warn("Connection suspended - may lose session");
            // Stop writing to shared resources!
            pauseCriticalWork();
            break;
        case LOST:
            log.error("SESSION LOST - all locks are gone!");
            // Session expired, ephemeral nodes deleted
            // IMMEDIATELY stop all critical work
            abortCriticalWorkAndCleanup();
            break;
        case READ_ONLY:
            log.warn("Read-only mode - no writes possible");
            break;
    }
});

Scenario 2: Create Succeeds But ACK Lost (Uncertain State)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Timeline:
t0: Client A calls create("/locks/resource/lock-", ephemeral-sequential)
t1: Request reaches Zookeeper
t2: Zookeeper creates lock-0000000042
t3: Zookeeper writes to quorum
t4: Zookeeper sends ACK to Client A
t5: Network drops ACK packet
t6: Client A times out waiting for response
t7: Client A doesn't know if create succeeded!
 
Options for Client A:
a) Assume success and try to release → might release non-existent lock
b) Assume failure and retry → creates DUPLICATE lock-0000000043!
c) Query for existing nodes with our client ID → correct approach

Handling Uncertain State:

The solution is to include client identification in the lock node data or name pattern. When uncertain, query for existing locks and check if any belong to us:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Curator handles this internally, but here's the pattern:
 
private String acquireLockWithRecovery() throws Exception {
    byte[] ourId = getClientIdentifier();  // Unique per-process
    String lockPath = "/locks/resource/lock-";
    
    // Check if we already have a lock from a previous attempt
    List<String> children = zk.getChildren("/locks/resource");
    for (String child : children) {
        byte[] data = zk.getData("/locks/resource/" + child);
        if (Arrays.equals(data, ourId)) {
            // We already have a lock from previous attempt!
            log.info("Found existing lock from previous attempt: " + child);
            return child;
        }
    }
    
    // No existing lock, create new one
    String created = zk.create(
        lockPath, 
        ourId,  // Store our ID as node data
        Ids.OPEN_ACL_UNSAFE, 
        CreateMode.EPHEMERAL_SEQUENTIAL
    );
    
    return created;
}

Curator Handles Recovery

Performance Characteristics and Optimization

Zookeeper is designed for coordination, not high-throughput data storage. Understanding its performance characteristics is essential for appropriate use.

Latency Characteristics:

Zookeeper Operation Latencies (Typical)
Operation	Latency	Notes
Read (local server)	~1-2ms	Served from in-memory cache
Read (with sync)	~5-10ms	Ensures freshness
Write (create/update)	~10-30ms	Requires consensus across majority
Lock Acquisition (uncontended)	~15-40ms	Create + getChildren + possibly watch
Lock Acquisition (contended)	~15ms + wait time	Plus time waiting for predecessor
Lock Release	~10-30ms	Delete operation

Throughput Considerations:

Throughput Factors

•Read-Heavy Workloads Scale Well — Reads can be served by any server in the ensemble. Adding followers increases read capacity.
•Writes Are Bottlenecked by Leader — All writes go through the single leader. Write throughput is limited by leader's capacity.
•Typical Write Throughput — A Zookeeper cluster can handle ~10,000-50,000 writes per second depending on hardware and replication factor.
•Lock Churn Creates Write Load — High lock acquisition/release rates create write traffic. Consider if every operation truly needs a lock.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Scenario: High-throughput order processing
 
Requirements:
- 1,000 orders/second peak
- Each order requires exclusive lock on inventory
 
Lock Operations per Order:
- Create lock node: 1 write
- Delete lock node: 1 write
Total: 2 writes per order
 
Write Load on Zookeeper:
- 1,000 orders × 2 writes = 2,000 writes/second
 
Is This Sustainable?
- Typical ZK cluster: 10,000-50,000 writes/second capacity
- Our load: 2,000 writes/second
- Answer: Yes, with headroom
 
But Consider:
- What about 10,000 orders/second?
- 20,000 writes/second → approaching capacity
- Consider alternative: partition by product category
  - 10 lock paths instead of 1
  - Reduced contention AND distributed load

Lock Granularity Trade-off

Read-Write Locks and Advanced Recipes

Beyond simple mutual exclusion, Zookeeper supports more sophisticated coordination patterns.

Read-Write Lock Recipe:

For read-heavy workloads, allowing multiple concurrent readers while ensuring exclusive write access can dramatically improve throughput.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
READ-WRITE LOCK RECIPE:
━━━━━━━━━━━━━━━━━━━━━━━
 
Namespace:
/locks/resource/
  ├── read-0000000001    (ephemeral-sequential, reader)
  ├── read-0000000002    (ephemeral-sequential, reader)
  ├── write-0000000003   (ephemeral-sequential, writer)
  ├── read-0000000004    (ephemeral-sequential, reader waiting for writer)
  └── write-0000000005   (ephemeral-sequential, writer waiting)
 
ACQUIRE READ LOCK:
1. Create ephemeral-sequential /locks/resource/read-
2. Get children sorted by sequence number
3. Check for any WRITE node with lower sequence number
   - If no writes before us: read lock acquired
   - If write before us: watch closest write predecessor, wait
 
ACQUIRE WRITE LOCK:
1. Create ephemeral-sequential /locks/resource/write-
2. Get children sorted by sequence number
3. Check if our node has the lowest sequence number
   - If lowest: write lock acquired
   - Otherwise: watch immediate predecessor (read or write), wait
 
SEMANTICS:
- Multiple concurrent readers allowed (if no writer waiting ahead)
- Writers get exclusive access
- Writers don't starve readers (readers that arrive after writer wait)
- Readers don't starve writers (writers eventually get turn)
 
Example State:
  Children: [read-001, read-002, write-003, read-004]
  
  read-001 and read-002: Both hold read lock (no write before them)
  write-003: Waiting for read-001 and read-002 to release
  read-004: Waiting for write-003 to release

Other Curator Recipes:

Curator Coordination Recipes

•LeaderSelector — Elect a leader among participants. One participant is leader, others standby. Leader crash triggers re-election.
•LeaderLatch — Simpler leader election where all participants contend to be leader. Winner holds position until they exit.
•DistributedBarrier — Wait for all participants to reach a synchronization point before proceeding.
•DistributedQueue — Distributed FIFO queue using sequential nodes.
•Semaphore — Distributed counting semaphore for limiting concurrent access to a resource pool.
•SharedCounter — Distributed counter backed by Zookeeper with atomic increment/decrement.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import org.apache.curator.framework.recipes.locks.InterProcessReadWriteLock;
 
public class ZookeeperReadWriteLockExample {
    
    private final InterProcessReadWriteLock rwLock;
    
    public ZookeeperReadWriteLockExample(CuratorFramework client, String path) {
        this.rwLock = new InterProcessReadWriteLock(client, path);
    }
    
    public Data readData() throws Exception {
        // Multiple readers can hold this simultaneously
        rwLock.readLock().acquire();
        try {
            return fetchDataFromDatabase();
        } finally {
            rwLock.readLock().release();
        }
    }
    
    public void writeData(Data newData) throws Exception {
        // Only one writer, blocks all readers
        rwLock.writeLock().acquire();
        try {
            updateDatabase(newData);
            invalidateCache();
        } finally {
            rwLock.writeLock().release();
        }
    }
}

Summary: When to Choose Zookeeper for Locks

We've explored Zookeeper's architecture, primitives, and lock implementations in depth. Let's consolidate when Zookeeper is the right choice.

Use Zookeeper When

•You need strong consistency guarantees
•Fair (FIFO) ordering of lock acquisition matters
•You're already running Kafka, Hadoop, or HBase (which use ZK)
•You need coordination primitives beyond locks (leader election, barriers)
•Lock correctness is more important than lock performance
•Your lock acquisition rate is < 10,000/second

Consider Alternatives When

•You need extremely high lock throughput (>50K/sec)
•You can't accept the operational complexity of ZK cluster
•Simple efficiency locks suffice (Redis may be enough)
•You're in a cloud environment with managed alternatives (AWS, GCP)
•You need locks with TTL only (no session semantics)
•Your team lacks Zookeeper operational experience

Key Takeaways

•Zookeeper provides strong consistency — Linearizable writes via ZAB consensus ensure safety properties are maintained.
•Ephemeral nodes enable automatic cleanup — Crashed lock holders don't cause permanent deadlock; session expiry cleans up.
•Sequential nodes enable fair ordering — The lock recipe guarantees FIFO ordering, preventing starvation.
•Watch mechanism enables efficient waiting — No polling; clients are notified when predecessor releases.
•Session management is critical — Applications must handle session loss events to avoid safety violations.
•Curator simplifies usage — Don't roll your own; use Curator's battle-tested recipes.

What's Next:

The next page explores etcd locks—a modern alternative to Zookeeper with a simpler data model, built on Raft consensus, and increasingly popular in Kubernetes-native environments.

Page Complete

3 / 5