Loading learning content...
When Kubernetes needed a reliable coordination system for its control plane, the creators chose etcd—a distributed key-value store built on the Raft consensus algorithm. Today, etcd runs at the heart of virtually every Kubernetes cluster on the planet, making it one of the most battle-tested distributed coordination systems in existence.
But etcd isn't just for Kubernetes. Its clean API, strong consistency guarantees, and modern design make it an excellent choice for distributed locking in cloud-native applications. If Zookeeper is the veteran of distributed coordination, etcd is its modern successor—simpler to operate, easier to understand, and purpose-built for containerized environments.
This page explores how etcd provides distributed locks, from its architectural foundations to production-ready implementation patterns.
By the end of this page, you will understand etcd's architecture and Raft-based consistency model, how leases enable lock lifecycle management, the etcd locking mechanism and its guarantees, the comparison between etcd and Zookeeper for different use cases, and practical implementation patterns for production use.
etcd is a distributed reliable key-value store that uses the Raft consensus algorithm to ensure consistency across replicas. Unlike Zookeeper's hierarchical model, etcd uses a flat key-value namespace with some hierarchical conventions.
Cluster Architecture:
123456789101112131415161718192021222324252627
etcd Cluster (3 or 5 nodes typical):━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ┌─────────────────────────────────────┐ │ Clients │ │ gRPC clients, etcdctl, Kubernetes │ └───────────┬─────────────────────────┘ │ ┌─────────────────┼─────────────────┐ │ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ │ etcd │ │ etcd │ │ etcd │ │ Node 1 │◄────►│ Node 2 │◄─────►│ Node 3 │ │(Follower)│ │ (LEADER)│ │(Follower)│ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ │ Raft │ │ Raft │ │ Raft │ │ Log │ │ Log │ │ Log │ └─────────┘ └─────────┘ └─────────┘ Key Components:- All nodes run identical etcd binary- Leader handles all write operations- Followers replicate log and serve linearizable reads- Raft ensures safety even with minority failures- gRPC API (v3) for all client communicationRaft Consensus: The Foundation of Consistency
Raft was designed explicitly to be understandable, in contrast to Paxos's notorious complexity. Its key ideas:
| Guarantee | Description | Implication for Locks |
|---|---|---|
| Linearizable Writes | All writes appear in a single global order | Lock grants are totally ordered—no ambiguity about who acquired first |
| Linearizable Reads (optional) | Reads reflect all completed writes | Can verify current lock holder accurately |
| Sequential Reads (default) | Reads from a client see all of that client's prior writes | Safe for lock holder to read their own state |
| Watch Reliability | Watches receive all updates in order | Lock waiters receive accurate notifications |
etcd v3 (released 2016) introduced a completely new API based on gRPC, leases, and a multi-version concurrency control (MVCC) store. etcd v2 is deprecated. All modern etcd deployments and this page focus exclusively on the v3 API.
Unlike Zookeeper's tree structure, etcd uses a flat key-value namespace. Keys are arbitrary byte sequences, and hierarchical organization is achieved through key prefixes and naming conventions.
Key Structure:
12345678910111213141516171819202122232425
etcd Namespace (Flat with Prefix Convention):━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Key Value──────────────────────────────────────── ─────────────────────/locks/inventory/holder client-id-123/locks/order-processing/holder client-id-456/locks/payment/queue/00000000000000000001 client-id-789/locks/payment/queue/00000000000000000002 client-id-abc /config/database/connection-string postgresql://.../config/feature-flags {"dark_mode": true} /services/api-gateway/192.168.1.10 {"port": 8080}/services/api-gateway/192.168.1.11 {"port": 8080}/services/worker/192.168.1.20 {"port": 9000} Prefix Queries:- GET /locks/ (prefix) → Returns all lock-related keys- GET /services/api-gateway/ (prefix) → Returns all api-gateway instances No Parent-Child Relationship:- /locks/inventory can exist without /locks existing- Deleting /locks does NOT delete /locks/inventory- This differs fundamentally from Zookeeper's tree!MVCC and Revisions:
etcd v3 uses Multi-Version Concurrency Control (MVCC), maintaining historical versions of keys. Each modification increments a global revision number.
123456789101112131415161718192021222324252627282930
MVCC Revision System:━━━━━━━━━━━━━━━━━━━━━ Global Revision: Monotonically increasing across ALL keys Similar to Zookeeper's zxid Operation History: Rev 1: PUT /config/db = "postgres://v1" Rev 2: PUT /locks/x = "client-A" Rev 3: PUT /config/db = "postgres://v2" Rev 4: DELETE /locks/x Rev 5: PUT /locks/x = "client-B" Key /config/db at: - Rev 1: "postgres://v1" - Rev 2: "postgres://v1" (no change to this key) - Rev 3: "postgres://v2" - Current: "postgres://v2" Key /locks/x at: - Rev 1: (does not exist) - Rev 2: "client-A" (create_revision=2, mod_revision=2) - Rev 3: "client-A" - Rev 4: (deleted) - Rev 5: "client-B" (create_revision=5, mod_revision=5) Use for Locks:- create_revision: When the key was created (changes on delete/recreate)- mod_revision: When the key was last modified- Compare create_revision to determine lock ordering!The create_revision of a lock key serves as a natural fencing token. When acquiring a lock, record the create_revision. Include it in all operations on protected resources. The resource can reject operations with stale revisions, protecting against delayed messages from expired lock holders.
etcd's lease primitive is the key building block for distributed locks. A lease is a time-bound contract: as long as the lease is alive (via keep-alives), associated keys exist; when the lease expires, all attached keys are automatically deleted.
Lease Lifecycle:
12345678910111213141516171819202122232425262728293031323334353637383940
LEASE LIFECYCLE:━━━━━━━━━━━━━━━━ 1. CREATE LEASE Client → etcd: LeaseGrant(TTL=30s) etcd → Client: Lease ID = 1234567890 2. ATTACH KEYS TO LEASE Client → etcd: PUT /locks/resource = "client-A" (lease=1234567890) Key now bound to lease: expires when lease expires 3. KEEP LEASE ALIVE (Heartbeat) Client → etcd: LeaseKeepAlive(1234567890) etcd → Client: TTL refreshed to 30s (Client must send keep-alives at ~TTL/3 interval) 4. NORMAL RELEASE Client → etcd: LeaseRevoke(1234567890) Result: Lease deleted, ALL attached keys deleted immediately 5. EXPIRATION (Client Crash) Client crashes, stop sending keep-alives etcd: Lease 1234567890 TTL countdown... After 30s: Lease EXPIRED Result: All attached keys (/locks/resource) deleted automatically MULTIPLE KEYS PER LEASE:━━━━━━━━━━━━━━━━━━━━━━━━━ Lease 1234567890 (TTL=30s): Attached keys: - /locks/resource-A = "client-1" - /locks/resource-B = "client-1" - /session/client-1/heartbeat = timestamp If client-1 crashes: → Lease expires → All three keys deleted atomically → Resources A and B both releasedLeases vs. Zookeeper Sessions:
| Aspect | etcd Leases | Zookeeper Sessions |
|---|---|---|
| Granularity | Multiple leases per client possible | One session per client connection |
| Keep-Alive | Explicit keep-alive RPCs per lease | Automatic heartbeats at connection level |
| TTL Control | TTL set per lease at creation | Session timeout negotiated at connect |
| Multiple Resources | One lease can cover multiple locks | Each ephemeral node tied to single session |
| Revocation | Client can revoke lease proactively | Client disconnects; session may survive briefly |
| Visibility | Leases are first-class objects, queryable | Sessions are internal, less visible |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
package main import ( "context" "log" "time" clientv3 "go.etcd.io/etcd/client/v3" "go.etcd.io/etcd/client/v3/concurrency") func leaseExample() { cli, err := clientv3.New(clientv3.Config{ Endpoints: []string{"localhost:2379"}, DialTimeout: 5 * time.Second, }) if err != nil { log.Fatal(err) } defer cli.Close() ctx := context.Background() // Create lease with 30-second TTL lease, err := cli.Grant(ctx, 30) if err != nil { log.Fatal(err) } log.Printf("Created lease with ID: %x, TTL: %d", lease.ID, lease.TTL) // Put key with lease attachment _, err = cli.Put(ctx, "/locks/my-resource", "holder-id", clientv3.WithLease(lease.ID)) if err != nil { log.Fatal(err) } // Start keep-alive goroutine keepAliveCh, err := cli.KeepAlive(ctx, lease.ID) if err != nil { log.Fatal(err) } // Monitor keep-alive responses go func() { for ka := range keepAliveCh { if ka == nil { log.Println("Keep-alive channel closed, lease expired!") return } log.Printf("Lease renewed, TTL: %d", ka.TTL) } }() // Do critical work... time.Sleep(60 * time.Second) // Explicit release (optional - can also just let lease expire) cli.Revoke(ctx, lease.ID)}When the keep-alive channel returns nil or closes, your lease has expired or will expire soon. You must immediately stop any critical section work. Continuing to modify shared resources after lease expiration violates mutual exclusion—another process may have acquired the lock.
etcd's concurrency library provides a lock implementation that's conceptually similar to Zookeeper's recipe but adapted to etcd's flat key-value model.
The Lock Algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
ETCD LOCK ALGORITHM:━━━━━━━━━━━━━━━━━━━━━ ACQUIRE LOCK on prefix "/locks/resource/":1. Create a lease (e.g., 30s TTL)2. PUT a key with unique suffix under the lock prefix: Key: /locks/resource/{lease-id} Value: (empty or client info) Condition: Only if key doesn't exist (using transaction) 3. GET all keys with prefix /locks/resource/ Sort by create_revision (ascending) 4. If our key has the LOWEST create_revision: → We hold the lock! Enter critical section. 5. Otherwise, find the key with the next-lower create_revision: WATCH that key for deletion 6. When watch triggers (predecessor deleted): → Re-check if we're now lowest (go to step 3) EXAMPLE:━━━━━━━━ Initial state: /locks/resource/ prefix is empty Client A (lease 1001): 1. PUT /locks/resource/1001 (create_revision = 100) 2. GET prefix: [/locks/resource/1001] 3. We're lowest → Lock acquired! Client B (lease 1002): 1. PUT /locks/resource/1002 (create_revision = 101) 2. GET prefix: [/locks/resource/1001, /locks/resource/1002] 3. 100 < 101 → We're not lowest 4. Watch /locks/resource/1001 for deletion Client C (lease 1003): 1. PUT /locks/resource/1003 (create_revision = 102) 2. GET prefix: [/locks/resource/1001, /locks/resource/1002, /locks/resource/1003] 3. Predecessor with lower revision: /locks/resource/1002 4. Watch /locks/resource/1002 for deletion (NOT 1001!) Client A releases (lease 1001 revoked): → /locks/resource/1001 deleted → B's watch triggers → B re-checks: [/locks/resource/1002, /locks/resource/1003] → B is now lowest → Lock acquired! → C is NOT notified (watching 1002, not 1001)Implementation with etcd Concurrency Library:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
package main import ( "context" "fmt" "log" "time" clientv3 "go.etcd.io/etcd/client/v3" "go.etcd.io/etcd/client/v3/concurrency") func etcdLockExample() { cli, err := clientv3.New(clientv3.Config{ Endpoints: []string{"localhost:2379"}, DialTimeout: 5 * time.Second, }) if err != nil { log.Fatal(err) } defer cli.Close() // Create a session (wraps lease management) // TTL: 10 seconds, auto-renewed via keep-alives session, err := concurrency.NewSession(cli, concurrency.WithTTL(10)) if err != nil { log.Fatal(err) } defer session.Close() // Create mutex on a given prefix mutex := concurrency.NewMutex(session, "/locks/my-critical-resource/") // Acquire lock (blocks until acquired) ctx := context.Background() if err := mutex.Lock(ctx); err != nil { log.Fatal(err) } fmt.Println("Lock acquired!") fmt.Printf("Lock key: %s", mutex.Key()) // The actual key created // CRITICAL SECTION // Only one process with this lock prefix executes here doExclusiveWork() // Release lock if err := mutex.Unlock(ctx); err != nil { log.Fatal(err) } fmt.Println("Lock released!")} // With timeoutfunc etcdLockWithTimeout() { // ... client setup ... mutex := concurrency.NewMutex(session, "/locks/resource/") // Context with timeout ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) defer cancel() if err := mutex.Lock(ctx); err != nil { if err == context.DeadlineExceeded { log.Println("Could not acquire lock within timeout") return } log.Fatal(err) } defer mutex.Unlock(context.Background()) // Critical section...}The concurrency.Session type wraps lease creation and keep-alive management. It automatically handles keep-alive failures and provides a Done() channel that closes when the session is lost. Always use Session for locks rather than managing leases manually.
etcd v3 provides powerful transactions (Txn) that enable atomic compare-and-set operations. This is crucial for implementing safe lock acquisition.
Transaction Structure:
1234567891011121314151617181920212223
ETCD TRANSACTION (Txn):━━━━━━━━━━━━━━━━━━━━━━━ Structure: IF (conditions) THEN (operations) ELSE (operations) All conditions are evaluated atomically.If ALL conditions pass: THEN operations execute.If ANY condition fails: ELSE operations execute. Condition Types: - Compare(Key).Version() == 0 // Key doesn't exist - Compare(Key).CreateRevision() // When key was created - Compare(Key).ModRevision() // When key was last modified - Compare(Key).Value() // Key's current value - Compare(Key).Lease() // Key's attached lease Operations: - Put(key, value, opts...) - Delete(key) - Get(key) // Returns value in responseSafe Lock Acquisition with Transactions:
The concurrency library uses transactions internally, but understanding them helps debug issues and implement custom patterns:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// Atomic lock acquisition with transactionfunc acquireLockWithTxn(cli *clientv3.Client, lockKey string, leaseID clientv3.LeaseID) (bool, error) { ctx := context.Background() // Transaction: Only create lock key if it doesn't exist txn := cli.Txn(ctx).If( // Condition: Key doesn't exist (version == 0) clientv3.Compare(clientv3.Version(lockKey), "=", 0), ).Then( // If condition passes: Create the lock key clientv3.OpPut(lockKey, "holder-info", clientv3.WithLease(leaseID)), ).Else( // If condition fails: Get current holder info clientv3.OpGet(lockKey), ) resp, err := txn.Commit() if err != nil { return false, err } if resp.Succeeded { // We created the lock key, we hold the lock return true, nil } // Lock already held by someone else // resp.Responses[0].GetResponseRange() contains current holder return false, nil} // Safe lock release: Only delete if we're still the holderfunc releaseLockSafely(cli *clientv3.Client, lockKey string, expectedLeaseID clientv3.LeaseID) error { ctx := context.Background() txn := cli.Txn(ctx).If( // Condition: Key exists AND has our lease clientv3.Compare(clientv3.LeaseValue(lockKey), "=", expectedLeaseID), ).Then( // Only delete if we still hold it clientv3.OpDelete(lockKey), ) _, err := txn.Commit() return err}Without transactions, a race condition exists between checking lock status and updating it. Transactions make the check-then-act atomic. etcd's transaction processing is still linearizable—each transaction sees a consistent snapshot and applies atomically.
Like Zookeeper, etcd provides a watch mechanism for receiving notifications when keys change. This enables efficient lock queue management without polling.
Watch Characteristics:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
func waitForLockRelease(cli *clientv3.Client, predecessorKey string, ctx context.Context) error { // Watch the predecessor key for deletion watchCh := cli.Watch(ctx, predecessorKey) for watchResp := range watchCh { if watchResp.Canceled { return fmt.Errorf("watch canceled: %v", watchResp.Err()) } for _, event := range watchResp.Events { if event.Type == mvccpb.DELETE { // Predecessor key deleted, we may now hold the lock log.Printf("Predecessor %s deleted, checking lock status", predecessorKey) return nil } // Other events (PUT) on this key - predecessor is still holding } } return fmt.Errorf("watch channel closed unexpectedly")} // Watch with starting revision (ensure no missed events)func watchFromRevision(cli *clientv3.Client, key string, startRev int64) { ctx := context.Background() watchCh := cli.Watch(ctx, key, clientv3.WithRev(startRev), // Start from specific revision clientv3.WithPrevKV(), // Include previous value in events clientv3.WithProgressNotify(), // Periodic progress notifications ) for watchResp := range watchCh { if watchResp.IsProgressNotify() { log.Printf("Watch progress: revision %d", watchResp.Header.Revision) continue } for _, event := range watchResp.Events { log.Printf("Event: Type=%s, Key=%s, Rev=%d", event.Type, event.Kv.Key, event.Kv.ModRevision) } }}When setting up a watch for lock waiting, note the current revision from your GET response. Watch from that revision + 1 to ensure you don't miss events that occur between your GET and watch setup. The concurrency library handles this correctly.
Both etcd and Zookeeper provide strong consistency and can implement distributed locks. Understanding their differences helps you choose the right tool.
| Aspect | etcd | Zookeeper |
|---|---|---|
| Consensus | Raft | ZAB (Zookeeper Atomic Broadcast) |
| Data Model | Flat key-value with MVCC | Hierarchical tree (znodes) |
| API | gRPC (binary, efficient) | Custom protocol, many client libraries |
| Watch Model | Continuous streams | One-shot triggers |
| Session Management | Explicit leases per key/group | Connection-based sessions |
| Version | Create/Mod revisions per key | Version + cversion + aversion per znode |
| History | Full MVCC history (until compacted) | Current state only |
| Language | Go (single binary) | Java (JVM required) |
| Kubernetes Integration | Native (used by K8s) | Possible but requires extra setup |
When to Choose etcd:
When to Choose Zookeeper:
For distributed locking specifically, both etcd and Zookeeper provide equivalent guarantees: linearizable lock grants, automatic cleanup on holder failure, and fair ordering. The choice between them is more about operational preferences, existing infrastructure, and team expertise than lock feature differences.
We've explored etcd's architecture and lock implementation in depth. Let's consolidate the key insights:
What's Next:
We've now explored two consensus-based coordination systems (Zookeeper and etcd) that provide strong lock guarantees. The final page examines a fundamentally different approach: Redis Redlock—an algorithm that attempts to provide distributed locks using multiple independent Redis instances without a consensus protocol, and the controversy that surrounds it.
You now understand how etcd provides distributed locks: the Raft-based architecture, the lease mechanism for lifecycle management, the prefix-based lock queue, and how it compares to Zookeeper. Next, we'll explore Redis Redlock—a controversial alternative that trades off consistency guarantees for simpler infrastructure.