System Design (HLD)Distributed Locking

Distributed Locking

LevelAdvanced

Duration90 mins

TopicDistributed Locking

4 / 5

etcd Locks: Modern Coordination for Cloud-Native Systems

The Cloud-Native Coordination Standard

When Kubernetes needed a reliable coordination system for its control plane, the creators chose etcd—a distributed key-value store built on the Raft consensus algorithm. Today, etcd runs at the heart of virtually every Kubernetes cluster on the planet, making it one of the most battle-tested distributed coordination systems in existence.

But etcd isn't just for Kubernetes. Its clean API, strong consistency guarantees, and modern design make it an excellent choice for distributed locking in cloud-native applications. If Zookeeper is the veteran of distributed coordination, etcd is its modern successor—simpler to operate, easier to understand, and purpose-built for containerized environments.

This page explores how etcd provides distributed locks, from its architectural foundations to production-ready implementation patterns.

Learning Objectives

By the end of this page, you will understand etcd's architecture and Raft-based consistency model, how leases enable lock lifecycle management, the etcd locking mechanism and its guarantees, the comparison between etcd and Zookeeper for different use cases, and practical implementation patterns for production use.

etcd Architecture: Simplicity Through Raft

etcd is a distributed reliable key-value store that uses the Raft consensus algorithm to ensure consistency across replicas. Unlike Zookeeper's hierarchical model, etcd uses a flat key-value namespace with some hierarchical conventions.

Cluster Architecture:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
etcd Cluster (3 or 5 nodes typical):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
                    ┌─────────────────────────────────────┐
                    │             Clients                  │
                    │   gRPC clients, etcdctl, Kubernetes │
                    └───────────┬─────────────────────────┘
                                │
              ┌─────────────────┼─────────────────┐
              │                 │                 │
         ┌────┴────┐      ┌────┴────┐       ┌────┴────┐
         │  etcd   │      │  etcd   │       │  etcd   │
         │  Node 1 │◄────►│  Node 2 │◄─────►│  Node 3 │
         │(Follower)│      │ (LEADER)│       │(Follower)│
         └────┬────┘      └────┬────┘       └────┬────┘
              │                 │                 │
         ┌────┴────┐      ┌────┴────┐       ┌────┴────┐
         │  Raft   │      │  Raft   │       │  Raft   │
         │   Log   │      │   Log   │       │   Log   │
         └─────────┘      └─────────┘       └─────────┘
 
Key Components:
- All nodes run identical etcd binary
- Leader handles all write operations
- Followers replicate log and serve linearizable reads
- Raft ensures safety even with minority failures
- gRPC API (v3) for all client communication

Raft Consensus: The Foundation of Consistency

Raft was designed explicitly to be understandable, in contrast to Paxos's notorious complexity. Its key ideas:

Raft Fundamentals

•Leader Election — One node is always the leader. If the leader fails, the remaining nodes elect a new one within milliseconds.
•Log Replication — The leader appends entries to its log and replicates to followers. A majority acknowledgment commits the entry.
•Safety — Committed entries are never lost. Even during leader changes, the new leader has all committed entries.
•Linearizability — External observers see a single, totally-ordered sequence of operations.

etcd Consistency Guarantees
Guarantee	Description	Implication for Locks
Linearizable Writes	All writes appear in a single global order	Lock grants are totally ordered—no ambiguity about who acquired first
Linearizable Reads (optional)	Reads reflect all completed writes	Can verify current lock holder accurately
Sequential Reads (default)	Reads from a client see all of that client's prior writes	Safe for lock holder to read their own state
Watch Reliability	Watches receive all updates in order	Lock waiters receive accurate notifications

etcd v2 vs v3

etcd v3 (released 2016) introduced a completely new API based on gRPC, leases, and a multi-version concurrency control (MVCC) store. etcd v2 is deprecated. All modern etcd deployments and this page focus exclusively on the v3 API.

The Key-Value Model: Flat But Powerful

Unlike Zookeeper's tree structure, etcd uses a flat key-value namespace. Keys are arbitrary byte sequences, and hierarchical organization is achieved through key prefixes and naming conventions.

Key Structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
etcd Namespace (Flat with Prefix Convention):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
Key                                          Value
────────────────────────────────────────     ─────────────────────
/locks/inventory/holder                      client-id-123
/locks/order-processing/holder               client-id-456
/locks/payment/queue/00000000000000000001    client-id-789
/locks/payment/queue/00000000000000000002    client-id-abc
 
/config/database/connection-string           postgresql://...
/config/feature-flags                        {"dark_mode": true}
 
/services/api-gateway/192.168.1.10           {"port": 8080}
/services/api-gateway/192.168.1.11           {"port": 8080}
/services/worker/192.168.1.20                {"port": 9000}
 
Prefix Queries:
- GET /locks/ (prefix)     → Returns all lock-related keys
- GET /services/api-gateway/ (prefix) → Returns all api-gateway instances
 
No Parent-Child Relationship:
- /locks/inventory can exist without /locks existing
- Deleting /locks does NOT delete /locks/inventory
- This differs fundamentally from Zookeeper's tree!

MVCC and Revisions:

etcd v3 uses Multi-Version Concurrency Control (MVCC), maintaining historical versions of keys. Each modification increments a global revision number.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
MVCC Revision System:
━━━━━━━━━━━━━━━━━━━━━
 
Global Revision: Monotonically increasing across ALL keys
                 Similar to Zookeeper's zxid
 
Operation History:
  Rev 1: PUT /config/db = "postgres://v1"
  Rev 2: PUT /locks/x = "client-A"
  Rev 3: PUT /config/db = "postgres://v2"
  Rev 4: DELETE /locks/x
  Rev 5: PUT /locks/x = "client-B"
 
Key /config/db at:
  - Rev 1: "postgres://v1"
  - Rev 2: "postgres://v1" (no change to this key)
  - Rev 3: "postgres://v2"
  - Current: "postgres://v2"
 
Key /locks/x at:
  - Rev 1: (does not exist)
  - Rev 2: "client-A" (create_revision=2, mod_revision=2)
  - Rev 3: "client-A"
  - Rev 4: (deleted)
  - Rev 5: "client-B" (create_revision=5, mod_revision=5)
 
Use for Locks:
- create_revision: When the key was created (changes on delete/recreate)
- mod_revision: When the key was last modified
- Compare create_revision to determine lock ordering!

Revision as Fencing Token

The create_revision of a lock key serves as a natural fencing token. When acquiring a lock, record the create_revision. Include it in all operations on protected resources. The resource can reject operations with stale revisions, protecting against delayed messages from expired lock holders.

Leases: The Heart of etcd Lock Lifecycle

etcd's lease primitive is the key building block for distributed locks. A lease is a time-bound contract: as long as the lease is alive (via keep-alives), associated keys exist; when the lease expires, all attached keys are automatically deleted.

Lease Lifecycle:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
LEASE LIFECYCLE:
━━━━━━━━━━━━━━━━
 
1. CREATE LEASE
   Client → etcd: LeaseGrant(TTL=30s)
   etcd → Client: Lease ID = 1234567890
 
2. ATTACH KEYS TO LEASE
   Client → etcd: PUT /locks/resource = "client-A" (lease=1234567890)
   Key now bound to lease: expires when lease expires
 
3. KEEP LEASE ALIVE (Heartbeat)
   Client → etcd: LeaseKeepAlive(1234567890)
   etcd → Client: TTL refreshed to 30s
   (Client must send keep-alives at ~TTL/3 interval)
 
4. NORMAL RELEASE
   Client → etcd: LeaseRevoke(1234567890)
   Result: Lease deleted, ALL attached keys deleted immediately
 
5. EXPIRATION (Client Crash)
   Client crashes, stop sending keep-alives
   etcd: Lease 1234567890 TTL countdown...
   After 30s: Lease EXPIRED
   Result: All attached keys (/locks/resource) deleted automatically
 
 
MULTIPLE KEYS PER LEASE:
━━━━━━━━━━━━━━━━━━━━━━━━━
 
Lease 1234567890 (TTL=30s):
  Attached keys:
    - /locks/resource-A = "client-1"
    - /locks/resource-B = "client-1"
    - /session/client-1/heartbeat = timestamp
 
If client-1 crashes:
  → Lease expires
  → All three keys deleted atomically
  → Resources A and B both released

Leases vs. Zookeeper Sessions:

Leases (etcd) vs. Sessions (Zookeeper)
Aspect	etcd Leases	Zookeeper Sessions
Granularity	Multiple leases per client possible	One session per client connection
Keep-Alive	Explicit keep-alive RPCs per lease	Automatic heartbeats at connection level
TTL Control	TTL set per lease at creation	Session timeout negotiated at connect
Multiple Resources	One lease can cover multiple locks	Each ephemeral node tied to single session
Revocation	Client can revoke lease proactively	Client disconnects; session may survive briefly
Visibility	Leases are first-class objects, queryable	Sessions are internal, less visible

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
package main
 
import (
    "context"
    "log"
    "time"
    
    clientv3 "go.etcd.io/etcd/client/v3"
    "go.etcd.io/etcd/client/v3/concurrency"
)
 
func leaseExample() {
    cli, err := clientv3.New(clientv3.Config{
        Endpoints:   []string{"localhost:2379"},
        DialTimeout: 5 * time.Second,
    })
    if err != nil { log.Fatal(err) }
    defer cli.Close()
 
    ctx := context.Background()
 
    // Create lease with 30-second TTL
    lease, err := cli.Grant(ctx, 30)
    if err != nil { log.Fatal(err) }
    
    log.Printf("Created lease with ID: %x, TTL: %d", lease.ID, lease.TTL)
 
    // Put key with lease attachment
    _, err = cli.Put(ctx, "/locks/my-resource", "holder-id", 
        clientv3.WithLease(lease.ID))
    if err != nil { log.Fatal(err) }
 
    // Start keep-alive goroutine
    keepAliveCh, err := cli.KeepAlive(ctx, lease.ID)
    if err != nil { log.Fatal(err) }
 
    // Monitor keep-alive responses
    go func() {
        for ka := range keepAliveCh {
            if ka == nil {
                log.Println("Keep-alive channel closed, lease expired!")
                return
            }
            log.Printf("Lease renewed, TTL: %d", ka.TTL)
        }
    }()
 
    // Do critical work...
    time.Sleep(60 * time.Second)
 
    // Explicit release (optional - can also just let lease expire)
    cli.Revoke(ctx, lease.ID)
}

Monitor Keep-Alive Failures

When the keep-alive channel returns nil or closes, your lease has expired or will expire soon. You must immediately stop any critical section work. Continuing to modify shared resources after lease expiration violates mutual exclusion—another process may have acquired the lock.

The etcd Lock Mechanism: Prefix-Based Queuing

etcd's concurrency library provides a lock implementation that's conceptually similar to Zookeeper's recipe but adapted to etcd's flat key-value model.

The Lock Algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
ETCD LOCK ALGORITHM:
━━━━━━━━━━━━━━━━━━━━━
 
ACQUIRE LOCK on prefix "/locks/resource/":
1. Create a lease (e.g., 30s TTL)
2. PUT a key with unique suffix under the lock prefix:
   Key: /locks/resource/{lease-id}
   Value: (empty or client info)
   Condition: Only if key doesn't exist (using transaction)
   
3. GET all keys with prefix /locks/resource/
   Sort by create_revision (ascending)
   
4. If our key has the LOWEST create_revision:
   → We hold the lock! Enter critical section.
   
5. Otherwise, find the key with the next-lower create_revision:
   WATCH that key for deletion
   
6. When watch triggers (predecessor deleted):
   → Re-check if we're now lowest (go to step 3)
 
 
EXAMPLE:
━━━━━━━━
 
Initial state: /locks/resource/ prefix is empty
 
Client A (lease 1001):
  1. PUT /locks/resource/1001 (create_revision = 100)
  2. GET prefix: [/locks/resource/1001]
  3. We're lowest → Lock acquired!
 
Client B (lease 1002):
  1. PUT /locks/resource/1002 (create_revision = 101)
  2. GET prefix: [/locks/resource/1001, /locks/resource/1002]
  3. 100 < 101 → We're not lowest
  4. Watch /locks/resource/1001 for deletion
 
Client C (lease 1003):
  1. PUT /locks/resource/1003 (create_revision = 102)
  2. GET prefix: [/locks/resource/1001, /locks/resource/1002, /locks/resource/1003]
  3. Predecessor with lower revision: /locks/resource/1002
  4. Watch /locks/resource/1002 for deletion (NOT 1001!)
 
Client A releases (lease 1001 revoked):
  → /locks/resource/1001 deleted
  → B's watch triggers
  → B re-checks: [/locks/resource/1002, /locks/resource/1003]
  → B is now lowest → Lock acquired!
  → C is NOT notified (watching 1002, not 1001)

Implementation with etcd Concurrency Library:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
package main
 
import (
    "context"
    "fmt"
    "log"
    "time"
 
    clientv3 "go.etcd.io/etcd/client/v3"
    "go.etcd.io/etcd/client/v3/concurrency"
)
 
func etcdLockExample() {
    cli, err := clientv3.New(clientv3.Config{
        Endpoints:   []string{"localhost:2379"},
        DialTimeout: 5 * time.Second,
    })
    if err != nil { log.Fatal(err) }
    defer cli.Close()
 
    // Create a session (wraps lease management)
    // TTL: 10 seconds, auto-renewed via keep-alives
    session, err := concurrency.NewSession(cli, concurrency.WithTTL(10))
    if err != nil { log.Fatal(err) }
    defer session.Close()
 
    // Create mutex on a given prefix
    mutex := concurrency.NewMutex(session, "/locks/my-critical-resource/")
 
    // Acquire lock (blocks until acquired)
    ctx := context.Background()
    if err := mutex.Lock(ctx); err != nil {
        log.Fatal(err)
    }
    
    fmt.Println("Lock acquired!")
    fmt.Printf("Lock key: %s
", mutex.Key())  // The actual key created
 
    // CRITICAL SECTION
    // Only one process with this lock prefix executes here
    doExclusiveWork()
 
    // Release lock
    if err := mutex.Unlock(ctx); err != nil {
        log.Fatal(err)
    }
    fmt.Println("Lock released!")
}
 
// With timeout
func etcdLockWithTimeout() {
    // ... client setup ...
    
    mutex := concurrency.NewMutex(session, "/locks/resource/")
    
    // Context with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    
    if err := mutex.Lock(ctx); err != nil {
        if err == context.DeadlineExceeded {
            log.Println("Could not acquire lock within timeout")
            return
        }
        log.Fatal(err)
    }
    defer mutex.Unlock(context.Background())
    
    // Critical section...
}

Session vs. Raw Lease

The concurrency.Session type wraps lease creation and keep-alive management. It automatically handles keep-alive failures and provides a Done() channel that closes when the session is lost. Always use Session for locks rather than managing leases manually.

Transactions: Atomic Lock Operations

etcd v3 provides powerful transactions (Txn) that enable atomic compare-and-set operations. This is crucial for implementing safe lock acquisition.

Transaction Structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ETCD TRANSACTION (Txn):
━━━━━━━━━━━━━━━━━━━━━━━
 
Structure:
  IF (conditions)
  THEN (operations)
  ELSE (operations)
 
All conditions are evaluated atomically.
If ALL conditions pass: THEN operations execute.
If ANY condition fails: ELSE operations execute.
 
Condition Types:
  - Compare(Key).Version() == 0     // Key doesn't exist
  - Compare(Key).CreateRevision()   // When key was created
  - Compare(Key).ModRevision()      // When key was last modified
  - Compare(Key).Value()            // Key's current value
  - Compare(Key).Lease()            // Key's attached lease
 
Operations:
  - Put(key, value, opts...)
  - Delete(key)
  - Get(key)  // Returns value in response

Safe Lock Acquisition with Transactions:

The concurrency library uses transactions internally, but understanding them helps debug issues and implement custom patterns:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Atomic lock acquisition with transaction
func acquireLockWithTxn(cli *clientv3.Client, lockKey string, leaseID clientv3.LeaseID) (bool, error) {
    ctx := context.Background()
    
    // Transaction: Only create lock key if it doesn't exist
    txn := cli.Txn(ctx).If(
        // Condition: Key doesn't exist (version == 0)
        clientv3.Compare(clientv3.Version(lockKey), "=", 0),
    ).Then(
        // If condition passes: Create the lock key
        clientv3.OpPut(lockKey, "holder-info", clientv3.WithLease(leaseID)),
    ).Else(
        // If condition fails: Get current holder info
        clientv3.OpGet(lockKey),
    )
    
    resp, err := txn.Commit()
    if err != nil {
        return false, err
    }
    
    if resp.Succeeded {
        // We created the lock key, we hold the lock
        return true, nil
    }
    
    // Lock already held by someone else
    // resp.Responses[0].GetResponseRange() contains current holder
    return false, nil
}
 
// Safe lock release: Only delete if we're still the holder
func releaseLockSafely(cli *clientv3.Client, lockKey string, expectedLeaseID clientv3.LeaseID) error {
    ctx := context.Background()
    
    txn := cli.Txn(ctx).If(
        // Condition: Key exists AND has our lease
        clientv3.Compare(clientv3.LeaseValue(lockKey), "=", expectedLeaseID),
    ).Then(
        // Only delete if we still hold it
        clientv3.OpDelete(lockKey),
    )
    
    _, err := txn.Commit()
    return err
}

Why Transactions Matter

Without transactions, a race condition exists between checking lock status and updating it. Transactions make the check-then-act atomic. etcd's transaction processing is still linearizable—each transaction sees a consistent snapshot and applies atomically.

Watches: Efficient Lock Queue Management

Like Zookeeper, etcd provides a watch mechanism for receiving notifications when keys change. This enables efficient lock queue management without polling.

Watch Characteristics:

etcd Watch Properties

•Continuous Stream — Unlike Zookeeper's one-shot watches, etcd watches are continuous streams until explicitly canceled.
•Revision-Based — Can watch from a specific revision, ensuring no events are missed even on reconnection.
•Prefix Watches — Can watch all keys under a prefix, useful for monitoring entire lock namespaces.
•Guaranteed Ordering — Events are delivered in revision order, preserving causality.
•Compact Handling — If watching from a revision that's been compacted (deleted from history), etcd returns an error.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
func waitForLockRelease(cli *clientv3.Client, predecessorKey string, ctx context.Context) error {
    // Watch the predecessor key for deletion
    watchCh := cli.Watch(ctx, predecessorKey)
    
    for watchResp := range watchCh {
        if watchResp.Canceled {
            return fmt.Errorf("watch canceled: %v", watchResp.Err())
        }
        
        for _, event := range watchResp.Events {
            if event.Type == mvccpb.DELETE {
                // Predecessor key deleted, we may now hold the lock
                log.Printf("Predecessor %s deleted, checking lock status", predecessorKey)
                return nil
            }
            // Other events (PUT) on this key - predecessor is still holding
        }
    }
    
    return fmt.Errorf("watch channel closed unexpectedly")
}
 
// Watch with starting revision (ensure no missed events)
func watchFromRevision(cli *clientv3.Client, key string, startRev int64) {
    ctx := context.Background()
    
    watchCh := cli.Watch(ctx, key, 
        clientv3.WithRev(startRev),           // Start from specific revision
        clientv3.WithPrevKV(),                // Include previous value in events
        clientv3.WithProgressNotify(),        // Periodic progress notifications
    )
    
    for watchResp := range watchCh {
        if watchResp.IsProgressNotify() {
            log.Printf("Watch progress: revision %d", watchResp.Header.Revision)
            continue
        }
        
        for _, event := range watchResp.Events {
            log.Printf("Event: Type=%s, Key=%s, Rev=%d",
                event.Type, event.Kv.Key, event.Kv.ModRevision)
        }
    }
}

Watch from Current Revision

When setting up a watch for lock waiting, note the current revision from your GET response. Watch from that revision + 1 to ensure you don't miss events that occur between your GET and watch setup. The concurrency library handles this correctly.

etcd vs. Zookeeper: A Detailed Comparison

Both etcd and Zookeeper provide strong consistency and can implement distributed locks. Understanding their differences helps you choose the right tool.

etcd vs. Zookeeper Feature Comparison
Aspect	etcd	Zookeeper
Consensus	Raft	ZAB (Zookeeper Atomic Broadcast)
Data Model	Flat key-value with MVCC	Hierarchical tree (znodes)
API	gRPC (binary, efficient)	Custom protocol, many client libraries
Watch Model	Continuous streams	One-shot triggers
Session Management	Explicit leases per key/group	Connection-based sessions
Version	Create/Mod revisions per key	Version + cversion + aversion per znode
History	Full MVCC history (until compacted)	Current state only
Language	Go (single binary)	Java (JVM required)
Kubernetes Integration	Native (used by K8s)	Possible but requires extra setup

When to Choose etcd:

Choose etcd When

•Kubernetes environment — You're already running Kubernetes and want to reuse etcd or align with K8s patterns
•Simpler operations — You want a single binary with fewer dependencies (no JVM)
•Modern Go ecosystem — Your team works primarily in Go
•gRPC integration — You want efficient binary protocols and streaming
•Historical queries — You need to query key values at past revisions

When to Choose Zookeeper:

Choose Zookeeper When

•Existing Kafka/Hadoop ecosystem — You're already running Kafka, HBase, or other systems that use ZK
•Hierarchical data — Your coordination model naturally fits a tree structure
•Java ecosystem — Your team works primarily in Java with mature ZK client libraries
•Proven stability — You want the longer production track record (ZK since 2007, etcd since 2013)
•Rich recipes — You need Curator's battle-tested coordination recipes

Lock Semantics Are Equivalent

For distributed locking specifically, both etcd and Zookeeper provide equivalent guarantees: linearizable lock grants, automatic cleanup on holder failure, and fair ordering. The choice between them is more about operational preferences, existing infrastructure, and team expertise than lock feature differences.

Summary: etcd for Modern Distributed Locking

We've explored etcd's architecture and lock implementation in depth. Let's consolidate the key insights:

Key Takeaways

•Raft provides strong consistency — etcd's Raft-based consensus ensures linearizable operations, making lock grants unambiguous and totally ordered.
•Leases enable lock lifecycle — Time-bound leases with keep-alives ensure automatic cleanup when lock holders fail, preventing permanent deadlock.
•MVCC revisions serve as fencing tokens — The create_revision of a lock key provides a natural, monotonically increasing identifier for detecting stale lock holders.
•Transactions enable atomic lock operations — Compare-and-set transactions prevent race conditions in lock acquisition and release.
•Continuous watches simplify waiting — Unlike one-shot watches, etcd's streaming watches eliminate the need for re-registration.
•Session abstraction handles complexity — The concurrency.Session type manages lease keep-alives and failure detection automatically.
•Cloud-native ready — etcd's design fits naturally into containerized, Kubernetes-based environments.

What's Next:

We've now explored two consensus-based coordination systems (Zookeeper and etcd) that provide strong lock guarantees. The final page examines a fundamentally different approach: Redis Redlock—an algorithm that attempts to provide distributed locks using multiple independent Redis instances without a consensus protocol, and the controversy that surrounds it.

Page Complete

You now understand how etcd provides distributed locks: the Raft-based architecture, the lease mechanism for lifecycle management, the prefix-based lock queue, and how it compares to Zookeeper. Next, we'll explore Redis Redlock—a controversial alternative that trades off consistency guarantees for simpler infrastructure.

4 / 5

Loading learning content...

System Design (HLD)Distributed Locking

Distributed Locking

LevelAdvanced

Duration90 mins

TopicDistributed Locking

4 / 5

etcd Locks: Modern Coordination for Cloud-Native Systems

The Cloud-Native Coordination Standard

This page explores how etcd provides distributed locks, from its architectural foundations to production-ready implementation patterns.

Learning Objectives

etcd Architecture: Simplicity Through Raft

Cluster Architecture:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
etcd Cluster (3 or 5 nodes typical):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
                    ┌─────────────────────────────────────┐
                    │             Clients                  │
                    │   gRPC clients, etcdctl, Kubernetes │
                    └───────────┬─────────────────────────┘
                                │
              ┌─────────────────┼─────────────────┐
              │                 │                 │
         ┌────┴────┐      ┌────┴────┐       ┌────┴────┐
         │  etcd   │      │  etcd   │       │  etcd   │
         │  Node 1 │◄────►│  Node 2 │◄─────►│  Node 3 │
         │(Follower)│      │ (LEADER)│       │(Follower)│
         └────┬────┘      └────┬────┘       └────┬────┘
              │                 │                 │
         ┌────┴────┐      ┌────┴────┐       ┌────┴────┐
         │  Raft   │      │  Raft   │       │  Raft   │
         │   Log   │      │   Log   │       │   Log   │
         └─────────┘      └─────────┘       └─────────┘
 
Key Components:
- All nodes run identical etcd binary
- Leader handles all write operations
- Followers replicate log and serve linearizable reads
- Raft ensures safety even with minority failures
- gRPC API (v3) for all client communication

Raft Consensus: The Foundation of Consistency

Raft was designed explicitly to be understandable, in contrast to Paxos's notorious complexity. Its key ideas:

Raft Fundamentals

•Leader Election — One node is always the leader. If the leader fails, the remaining nodes elect a new one within milliseconds.
•Log Replication — The leader appends entries to its log and replicates to followers. A majority acknowledgment commits the entry.
•Safety — Committed entries are never lost. Even during leader changes, the new leader has all committed entries.
•Linearizability — External observers see a single, totally-ordered sequence of operations.

etcd Consistency Guarantees
Guarantee	Description	Implication for Locks
Linearizable Writes	All writes appear in a single global order	Lock grants are totally ordered—no ambiguity about who acquired first
Linearizable Reads (optional)	Reads reflect all completed writes	Can verify current lock holder accurately
Sequential Reads (default)	Reads from a client see all of that client's prior writes	Safe for lock holder to read their own state
Watch Reliability	Watches receive all updates in order	Lock waiters receive accurate notifications

etcd v2 vs v3

The Key-Value Model: Flat But Powerful

Unlike Zookeeper's tree structure, etcd uses a flat key-value namespace. Keys are arbitrary byte sequences, and hierarchical organization is achieved through key prefixes and naming conventions.

Key Structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
etcd Namespace (Flat with Prefix Convention):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 
Key                                          Value
────────────────────────────────────────     ─────────────────────
/locks/inventory/holder                      client-id-123
/locks/order-processing/holder               client-id-456
/locks/payment/queue/00000000000000000001    client-id-789
/locks/payment/queue/00000000000000000002    client-id-abc
 
/config/database/connection-string           postgresql://...
/config/feature-flags                        {"dark_mode": true}
 
/services/api-gateway/192.168.1.10           {"port": 8080}
/services/api-gateway/192.168.1.11           {"port": 8080}
/services/worker/192.168.1.20                {"port": 9000}
 
Prefix Queries:
- GET /locks/ (prefix)     → Returns all lock-related keys
- GET /services/api-gateway/ (prefix) → Returns all api-gateway instances
 
No Parent-Child Relationship:
- /locks/inventory can exist without /locks existing
- Deleting /locks does NOT delete /locks/inventory
- This differs fundamentally from Zookeeper's tree!

MVCC and Revisions:

etcd v3 uses Multi-Version Concurrency Control (MVCC), maintaining historical versions of keys. Each modification increments a global revision number.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
MVCC Revision System:
━━━━━━━━━━━━━━━━━━━━━
 
Global Revision: Monotonically increasing across ALL keys
                 Similar to Zookeeper's zxid
 
Operation History:
  Rev 1: PUT /config/db = "postgres://v1"
  Rev 2: PUT /locks/x = "client-A"
  Rev 3: PUT /config/db = "postgres://v2"
  Rev 4: DELETE /locks/x
  Rev 5: PUT /locks/x = "client-B"
 
Key /config/db at:
  - Rev 1: "postgres://v1"
  - Rev 2: "postgres://v1" (no change to this key)
  - Rev 3: "postgres://v2"
  - Current: "postgres://v2"
 
Key /locks/x at:
  - Rev 1: (does not exist)
  - Rev 2: "client-A" (create_revision=2, mod_revision=2)
  - Rev 3: "client-A"
  - Rev 4: (deleted)
  - Rev 5: "client-B" (create_revision=5, mod_revision=5)
 
Use for Locks:
- create_revision: When the key was created (changes on delete/recreate)
- mod_revision: When the key was last modified
- Compare create_revision to determine lock ordering!

Revision as Fencing Token

Leases: The Heart of etcd Lock Lifecycle

Lease Lifecycle:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
LEASE LIFECYCLE:
━━━━━━━━━━━━━━━━
 
1. CREATE LEASE
   Client → etcd: LeaseGrant(TTL=30s)
   etcd → Client: Lease ID = 1234567890
 
2. ATTACH KEYS TO LEASE
   Client → etcd: PUT /locks/resource = "client-A" (lease=1234567890)
   Key now bound to lease: expires when lease expires
 
3. KEEP LEASE ALIVE (Heartbeat)
   Client → etcd: LeaseKeepAlive(1234567890)
   etcd → Client: TTL refreshed to 30s
   (Client must send keep-alives at ~TTL/3 interval)
 
4. NORMAL RELEASE
   Client → etcd: LeaseRevoke(1234567890)
   Result: Lease deleted, ALL attached keys deleted immediately
 
5. EXPIRATION (Client Crash)
   Client crashes, stop sending keep-alives
   etcd: Lease 1234567890 TTL countdown...
   After 30s: Lease EXPIRED
   Result: All attached keys (/locks/resource) deleted automatically
 
 
MULTIPLE KEYS PER LEASE:
━━━━━━━━━━━━━━━━━━━━━━━━━
 
Lease 1234567890 (TTL=30s):
  Attached keys:
    - /locks/resource-A = "client-1"
    - /locks/resource-B = "client-1"
    - /session/client-1/heartbeat = timestamp
 
If client-1 crashes:
  → Lease expires
  → All three keys deleted atomically
  → Resources A and B both released

Leases vs. Zookeeper Sessions:

Leases (etcd) vs. Sessions (Zookeeper)
Aspect	etcd Leases	Zookeeper Sessions
Granularity	Multiple leases per client possible	One session per client connection
Keep-Alive	Explicit keep-alive RPCs per lease	Automatic heartbeats at connection level
TTL Control	TTL set per lease at creation	Session timeout negotiated at connect
Multiple Resources	One lease can cover multiple locks	Each ephemeral node tied to single session
Revocation	Client can revoke lease proactively	Client disconnects; session may survive briefly
Visibility	Leases are first-class objects, queryable	Sessions are internal, less visible

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
package main
 
import (
    "context"
    "log"
    "time"
    
    clientv3 "go.etcd.io/etcd/client/v3"
    "go.etcd.io/etcd/client/v3/concurrency"
)
 
func leaseExample() {
    cli, err := clientv3.New(clientv3.Config{
        Endpoints:   []string{"localhost:2379"},
        DialTimeout: 5 * time.Second,
    })
    if err != nil { log.Fatal(err) }
    defer cli.Close()
 
    ctx := context.Background()
 
    // Create lease with 30-second TTL
    lease, err := cli.Grant(ctx, 30)
    if err != nil { log.Fatal(err) }
    
    log.Printf("Created lease with ID: %x, TTL: %d", lease.ID, lease.TTL)
 
    // Put key with lease attachment
    _, err = cli.Put(ctx, "/locks/my-resource", "holder-id", 
        clientv3.WithLease(lease.ID))
    if err != nil { log.Fatal(err) }
 
    // Start keep-alive goroutine
    keepAliveCh, err := cli.KeepAlive(ctx, lease.ID)
    if err != nil { log.Fatal(err) }
 
    // Monitor keep-alive responses
    go func() {
        for ka := range keepAliveCh {
            if ka == nil {
                log.Println("Keep-alive channel closed, lease expired!")
                return
            }
            log.Printf("Lease renewed, TTL: %d", ka.TTL)
        }
    }()
 
    // Do critical work...
    time.Sleep(60 * time.Second)
 
    // Explicit release (optional - can also just let lease expire)
    cli.Revoke(ctx, lease.ID)
}

Monitor Keep-Alive Failures

The etcd Lock Mechanism: Prefix-Based Queuing

etcd's concurrency library provides a lock implementation that's conceptually similar to Zookeeper's recipe but adapted to etcd's flat key-value model.

The Lock Algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
ETCD LOCK ALGORITHM:
━━━━━━━━━━━━━━━━━━━━━
 
ACQUIRE LOCK on prefix "/locks/resource/":
1. Create a lease (e.g., 30s TTL)
2. PUT a key with unique suffix under the lock prefix:
   Key: /locks/resource/{lease-id}
   Value: (empty or client info)
   Condition: Only if key doesn't exist (using transaction)
   
3. GET all keys with prefix /locks/resource/
   Sort by create_revision (ascending)
   
4. If our key has the LOWEST create_revision:
   → We hold the lock! Enter critical section.
   
5. Otherwise, find the key with the next-lower create_revision:
   WATCH that key for deletion
   
6. When watch triggers (predecessor deleted):
   → Re-check if we're now lowest (go to step 3)
 
 
EXAMPLE:
━━━━━━━━
 
Initial state: /locks/resource/ prefix is empty
 
Client A (lease 1001):
  1. PUT /locks/resource/1001 (create_revision = 100)
  2. GET prefix: [/locks/resource/1001]
  3. We're lowest → Lock acquired!
 
Client B (lease 1002):
  1. PUT /locks/resource/1002 (create_revision = 101)
  2. GET prefix: [/locks/resource/1001, /locks/resource/1002]
  3. 100 < 101 → We're not lowest
  4. Watch /locks/resource/1001 for deletion
 
Client C (lease 1003):
  1. PUT /locks/resource/1003 (create_revision = 102)
  2. GET prefix: [/locks/resource/1001, /locks/resource/1002, /locks/resource/1003]
  3. Predecessor with lower revision: /locks/resource/1002
  4. Watch /locks/resource/1002 for deletion (NOT 1001!)
 
Client A releases (lease 1001 revoked):
  → /locks/resource/1001 deleted
  → B's watch triggers
  → B re-checks: [/locks/resource/1002, /locks/resource/1003]
  → B is now lowest → Lock acquired!
  → C is NOT notified (watching 1002, not 1001)

Implementation with etcd Concurrency Library:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
package main
 
import (
    "context"
    "fmt"
    "log"
    "time"
 
    clientv3 "go.etcd.io/etcd/client/v3"
    "go.etcd.io/etcd/client/v3/concurrency"
)
 
func etcdLockExample() {
    cli, err := clientv3.New(clientv3.Config{
        Endpoints:   []string{"localhost:2379"},
        DialTimeout: 5 * time.Second,
    })
    if err != nil { log.Fatal(err) }
    defer cli.Close()
 
    // Create a session (wraps lease management)
    // TTL: 10 seconds, auto-renewed via keep-alives
    session, err := concurrency.NewSession(cli, concurrency.WithTTL(10))
    if err != nil { log.Fatal(err) }
    defer session.Close()
 
    // Create mutex on a given prefix
    mutex := concurrency.NewMutex(session, "/locks/my-critical-resource/")
 
    // Acquire lock (blocks until acquired)
    ctx := context.Background()
    if err := mutex.Lock(ctx); err != nil {
        log.Fatal(err)
    }
    
    fmt.Println("Lock acquired!")
    fmt.Printf("Lock key: %s
", mutex.Key())  // The actual key created
 
    // CRITICAL SECTION
    // Only one process with this lock prefix executes here
    doExclusiveWork()
 
    // Release lock
    if err := mutex.Unlock(ctx); err != nil {
        log.Fatal(err)
    }
    fmt.Println("Lock released!")
}
 
// With timeout
func etcdLockWithTimeout() {
    // ... client setup ...
    
    mutex := concurrency.NewMutex(session, "/locks/resource/")
    
    // Context with timeout
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()
    
    if err := mutex.Lock(ctx); err != nil {
        if err == context.DeadlineExceeded {
            log.Println("Could not acquire lock within timeout")
            return
        }
        log.Fatal(err)
    }
    defer mutex.Unlock(context.Background())
    
    // Critical section...
}

Session vs. Raw Lease

Transactions: Atomic Lock Operations

etcd v3 provides powerful transactions (Txn) that enable atomic compare-and-set operations. This is crucial for implementing safe lock acquisition.

Transaction Structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ETCD TRANSACTION (Txn):
━━━━━━━━━━━━━━━━━━━━━━━
 
Structure:
  IF (conditions)
  THEN (operations)
  ELSE (operations)
 
All conditions are evaluated atomically.
If ALL conditions pass: THEN operations execute.
If ANY condition fails: ELSE operations execute.
 
Condition Types:
  - Compare(Key).Version() == 0     // Key doesn't exist
  - Compare(Key).CreateRevision()   // When key was created
  - Compare(Key).ModRevision()      // When key was last modified
  - Compare(Key).Value()            // Key's current value
  - Compare(Key).Lease()            // Key's attached lease
 
Operations:
  - Put(key, value, opts...)
  - Delete(key)
  - Get(key)  // Returns value in response

Safe Lock Acquisition with Transactions:

The concurrency library uses transactions internally, but understanding them helps debug issues and implement custom patterns:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Atomic lock acquisition with transaction
func acquireLockWithTxn(cli *clientv3.Client, lockKey string, leaseID clientv3.LeaseID) (bool, error) {
    ctx := context.Background()
    
    // Transaction: Only create lock key if it doesn't exist
    txn := cli.Txn(ctx).If(
        // Condition: Key doesn't exist (version == 0)
        clientv3.Compare(clientv3.Version(lockKey), "=", 0),
    ).Then(
        // If condition passes: Create the lock key
        clientv3.OpPut(lockKey, "holder-info", clientv3.WithLease(leaseID)),
    ).Else(
        // If condition fails: Get current holder info
        clientv3.OpGet(lockKey),
    )
    
    resp, err := txn.Commit()
    if err != nil {
        return false, err
    }
    
    if resp.Succeeded {
        // We created the lock key, we hold the lock
        return true, nil
    }
    
    // Lock already held by someone else
    // resp.Responses[0].GetResponseRange() contains current holder
    return false, nil
}
 
// Safe lock release: Only delete if we're still the holder
func releaseLockSafely(cli *clientv3.Client, lockKey string, expectedLeaseID clientv3.LeaseID) error {
    ctx := context.Background()
    
    txn := cli.Txn(ctx).If(
        // Condition: Key exists AND has our lease
        clientv3.Compare(clientv3.LeaseValue(lockKey), "=", expectedLeaseID),
    ).Then(
        // Only delete if we still hold it
        clientv3.OpDelete(lockKey),
    )
    
    _, err := txn.Commit()
    return err
}

Why Transactions Matter

Watches: Efficient Lock Queue Management

Like Zookeeper, etcd provides a watch mechanism for receiving notifications when keys change. This enables efficient lock queue management without polling.

Watch Characteristics:

etcd Watch Properties

•Continuous Stream — Unlike Zookeeper's one-shot watches, etcd watches are continuous streams until explicitly canceled.
•Revision-Based — Can watch from a specific revision, ensuring no events are missed even on reconnection.
•Prefix Watches — Can watch all keys under a prefix, useful for monitoring entire lock namespaces.
•Guaranteed Ordering — Events are delivered in revision order, preserving causality.
•Compact Handling — If watching from a revision that's been compacted (deleted from history), etcd returns an error.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
func waitForLockRelease(cli *clientv3.Client, predecessorKey string, ctx context.Context) error {
    // Watch the predecessor key for deletion
    watchCh := cli.Watch(ctx, predecessorKey)
    
    for watchResp := range watchCh {
        if watchResp.Canceled {
            return fmt.Errorf("watch canceled: %v", watchResp.Err())
        }
        
        for _, event := range watchResp.Events {
            if event.Type == mvccpb.DELETE {
                // Predecessor key deleted, we may now hold the lock
                log.Printf("Predecessor %s deleted, checking lock status", predecessorKey)
                return nil
            }
            // Other events (PUT) on this key - predecessor is still holding
        }
    }
    
    return fmt.Errorf("watch channel closed unexpectedly")
}
 
// Watch with starting revision (ensure no missed events)
func watchFromRevision(cli *clientv3.Client, key string, startRev int64) {
    ctx := context.Background()
    
    watchCh := cli.Watch(ctx, key, 
        clientv3.WithRev(startRev),           // Start from specific revision
        clientv3.WithPrevKV(),                // Include previous value in events
        clientv3.WithProgressNotify(),        // Periodic progress notifications
    )
    
    for watchResp := range watchCh {
        if watchResp.IsProgressNotify() {
            log.Printf("Watch progress: revision %d", watchResp.Header.Revision)
            continue
        }
        
        for _, event := range watchResp.Events {
            log.Printf("Event: Type=%s, Key=%s, Rev=%d",
                event.Type, event.Kv.Key, event.Kv.ModRevision)
        }
    }
}

Watch from Current Revision

etcd vs. Zookeeper: A Detailed Comparison

Both etcd and Zookeeper provide strong consistency and can implement distributed locks. Understanding their differences helps you choose the right tool.

etcd vs. Zookeeper Feature Comparison
Aspect	etcd	Zookeeper
Consensus	Raft	ZAB (Zookeeper Atomic Broadcast)
Data Model	Flat key-value with MVCC	Hierarchical tree (znodes)
API	gRPC (binary, efficient)	Custom protocol, many client libraries
Watch Model	Continuous streams	One-shot triggers
Session Management	Explicit leases per key/group	Connection-based sessions
Version	Create/Mod revisions per key	Version + cversion + aversion per znode
History	Full MVCC history (until compacted)	Current state only
Language	Go (single binary)	Java (JVM required)
Kubernetes Integration	Native (used by K8s)	Possible but requires extra setup

When to Choose etcd:

Choose etcd When

•Kubernetes environment — You're already running Kubernetes and want to reuse etcd or align with K8s patterns
•Simpler operations — You want a single binary with fewer dependencies (no JVM)
•Modern Go ecosystem — Your team works primarily in Go
•gRPC integration — You want efficient binary protocols and streaming
•Historical queries — You need to query key values at past revisions

When to Choose Zookeeper:

Choose Zookeeper When

•Existing Kafka/Hadoop ecosystem — You're already running Kafka, HBase, or other systems that use ZK
•Hierarchical data — Your coordination model naturally fits a tree structure
•Java ecosystem — Your team works primarily in Java with mature ZK client libraries
•Proven stability — You want the longer production track record (ZK since 2007, etcd since 2013)
•Rich recipes — You need Curator's battle-tested coordination recipes

Lock Semantics Are Equivalent

Summary: etcd for Modern Distributed Locking

We've explored etcd's architecture and lock implementation in depth. Let's consolidate the key insights:

Key Takeaways

•Raft provides strong consistency — etcd's Raft-based consensus ensures linearizable operations, making lock grants unambiguous and totally ordered.
•Leases enable lock lifecycle — Time-bound leases with keep-alives ensure automatic cleanup when lock holders fail, preventing permanent deadlock.
•MVCC revisions serve as fencing tokens — The create_revision of a lock key provides a natural, monotonically increasing identifier for detecting stale lock holders.
•Transactions enable atomic lock operations — Compare-and-set transactions prevent race conditions in lock acquisition and release.
•Continuous watches simplify waiting — Unlike one-shot watches, etcd's streaming watches eliminate the need for re-registration.
•Session abstraction handles complexity — The concurrency.Session type manages lease keep-alives and failure detection automatically.
•Cloud-native ready — etcd's design fits naturally into containerized, Kubernetes-based environments.

What's Next:

Page Complete

4 / 5