System Design (HLD)Coordination Services

Coordination Services

LevelIntermediate

Duration90 mins

TopicCoordination Services

2 / 5

etcd: Simple Key-Value with Raft

The Heart of Cloud-Native Infrastructure

Every time you deploy a container to Kubernetes, schedule a workload, or update a ConfigMap, you're interacting with etcd — even if you've never heard of it. etcd is the distributed key-value store that serves as the brain of Kubernetes, storing all cluster state: every Pod definition, every Service, every Secret, every ConfigMap.

But etcd isn't just a Kubernetes implementation detail. It's a standalone distributed coordination service designed from the ground up for the cloud-native era. Created by CoreOS in 2013 (now maintained by the Cloud Native Computing Foundation), etcd represents a different design philosophy than ZooKeeper — one that prioritizes simplicity, performance, and ease of operation.

What You Will Learn

By the end of this page, you will understand etcd's flat key-value data model, how it uses Raft for consensus, the role of leases (etcd's equivalent of ephemeral nodes), its powerful watch mechanism, and why it became the coordination backbone for Kubernetes. You'll also understand when to choose etcd over ZooKeeper.

What is etcd?

etcd (pronounced "et-see-dee", from the Unix /etc directory + "d" for distributed) is a strongly consistent, distributed key-value store designed for coordination workloads. It provides a reliable way to store data across a cluster of machines with automatic failover and leader election.

etcd's design philosophy differs fundamentally from ZooKeeper:

etcd vs ZooKeeper Design Philosophy
Aspect	etcd	ZooKeeper
Data Model	Flat key-value with prefix queries	Hierarchical tree (znodes)
Consensus Protocol	Raft (understandable by design)	ZAB (similar to Paxos)
API	gRPC + HTTP/JSON	Custom binary protocol
Ephemeral Data	Leases (can span multiple keys)	Ephemeral znodes (per-node basis)
Watch Model	Continuous streaming watches	One-shot watches
Language	Go	Java
Deployment	Static binary, no dependencies	JVM required
Configuration	YAML/CLI, minimal	Properties files, more complex

Why etcd Was Created

When CoreOS started building container orchestration tools in 2013, they needed a coordination service. ZooKeeper existed but had drawbacks for their use case:

JVM dependency: ZooKeeper requires Java — heavy for container environments
Complex operations: ZooKeeper configuration and maintenance require expertise
Protocol complexity: ZAB is powerful but hard to understand and debug
API design: The Java-centric API wasn't ideal for polyglot environments

etcd was designed to address these concerns while providing the same fundamental guarantees: strong consistency, high availability, and reliable coordination primitives.

The Raft Advantage

etcd's use of Raft isn't just an implementation choice — it's a core value proposition. Raft was designed to be understandable, making etcd easier to debug, operate, and reason about. When things go wrong (and they do), understanding your consensus algorithm helps you recover faster.

The Key-Value Data Model

Unlike ZooKeeper's hierarchical namespace, etcd uses a flat key-value model with byte-string keys and values. This simplicity is deliberate: it reduces conceptual overhead and enables simpler implementation.

However, etcd provides powerful prefix-based operations that enable hierarchical organization through convention rather than enforcement.

etcd-key-patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# etcd uses flat keys, but conventions create logical hierarchy
# Keys are byte strings; these examples use human-readable paths
 
# Service registry pattern
/services/user-service/instances/10.0.0.1:8080
/services/user-service/instances/10.0.0.2:8080
/services/user-service/instances/10.0.0.3:8080
/services/payment-service/instances/10.0.1.1:8080
/services/payment-service/instances/10.0.1.2:8080
 
# Configuration pattern
/config/database/connection-string
/config/database/pool-size
/config/features/new-checkout-enabled
/config/features/dark-mode-enabled
 
# Leader election pattern
/election/cluster-leader/candidate-001
/election/cluster-leader/candidate-002
 
# Distributed lock pattern  
/locks/payment-processing/lock
 
# Prefix query: get all user-service instances
etcdctl get /services/user-service/instances/ --prefix
 
# Range query: get all config between keys
etcdctl get /config/database/a /config/database/z

Key Design Considerations

Keys are byte strings: Any bytes are valid, but UTF-8 encoded paths are conventional
1MB value limit: Like ZooKeeper, etcd is for small coordination data, not application storage
Prefix queries: Use consistent key prefixes to enable range operations
No true hierarchy: /a/b/c is just a string — deleting /a doesn't delete /a/b
Lexicographic ordering: Keys are sorted lexicographically, enabling range scans

The Hierarchy Illusion

Unlike ZooKeeper, etcd's slash-separated keys don't form a true hierarchy. The key /services/myapp has no special relationship to /services/myapp/config. You can create the latter without the former, and deleting the former doesn't affect the latter. This is simpler but requires different thinking.

Revisions: The Versioning System

etcd maintains a global revision number that increments with every write operation. This is similar to ZooKeeper's zxid but more central to etcd's design:

Every key-value pair knows its creation revision and modification revision
Watches can start from any historical revision
etcd stores history, not just current state (up to configured retention)
Transactions and conditional writes use revision-based conditions

revision-example.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Put a key and see the revision
$ etcdctl put /mykey "value1"
OK
 
$ etcdctl get /mykey -w json | jq
{
  "header": { "revision": 42 },
  "kvs": [{
    "key": "L215a2V5",           # base64 encoded
    "create_revision": 42,       # Revision when created
    "mod_revision": 42,          # Revision when last modified  
    "version": 1,                # Number of modifications
    "value": "dmFsdWUx"          # base64 encoded
  }]
}
 
# Update the key
$ etcdctl put /mykey "value2"
 
$ etcdctl get /mykey -w json | jq '.kvs[0] | {mod_revision, version}'
{
  "mod_revision": 43,
  "version": 2
}
 
# Get historical value at revision 42
$ etcdctl get /mykey --rev=42
/mykey
value1

etcd Architecture and Raft

etcd runs as a cluster of nodes (typically 3 or 5), using the Raft consensus protocol to maintain consistent state across all members. Understanding etcd's architecture helps you operate it effectively and debug issues.

Converting Mermaid diagram...

Node Roles in Raft

Leader: Handles all write requests, replicates to followers, sends heartbeats
Follower: Replicates leader's log, forwards writes to leader, can serve reads (with caveats)
Candidate: Temporary state during elections
Learner (etcd 3.4+): Non-voting member that receives log entries but doesn't participate in quorum — useful for adding nodes safely

Write Path

Client sends write to any node
If not leader, request is forwarded to leader
Leader appends entry to its log, assigns term and index
Leader replicates to followers in parallel
Once majority acknowledges, entry is committed
Leader applies to state machine, responds to client
Followers learn of commit via next heartbeat/append

Raft Guarantees

•Election Safety: At most one leader per term. If you see a leader for term N, no other leader exists for term N.
•Leader Append-Only: Leaders never overwrite or delete entries; they only append.
•Log Matching: If two logs have an entry with the same index and term, all preceding entries are identical.
•Leader Completeness: If an entry is committed in a given term, it will be present in the logs of all leaders for higher-numbered terms.
•State Machine Safety: If a node has applied an entry at a given index, no other node will apply a different entry at that index.

Read Path and Consistency Levels

etcd provides three read consistency levels:

Serializable (default for most APIs): Reads from any node, may return stale data. Lower latency, higher throughput.
Linearizable: Guarantees you see all writes that completed before the read started. Requires coordination with leader.
Revision-based: Read as of a specific revision. Useful for implementing consistent snapshots.

For coordination use cases, linearizable reads are usually necessary to avoid race conditions.

consistency-levels.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Serializable read (may be stale, but fast)
$ etcdctl get /mykey --consistency=s
 
# Linearizable read (always up-to-date, may be slower)
$ etcdctl get /mykey --consistency=l
 
# Read at specific revision
$ etcdctl get /mykey --rev=42
 
# In Go client
resp, err := client.Get(ctx, "/mykey", 
    clientv3.WithSerializable())  // Don't require leader confirmation
 
resp, err := client.Get(ctx, "/mykey")  
    // Default is linearizable when using leader-only endpoints

Lease Reads for Performance

For read-heavy workloads, consider 'lease reads' (not to be confused with key leases). The leader confirms its leadership once, then serves reads directly for a time window. This provides linearizable-like guarantees with better performance. Enable with ETCD_EXPERIMENTAL_ENABLE_LEASE_CHECKPOINT=true.

Leases — etcd's Ephemeral State

While ZooKeeper has ephemeral znodes tied to session lifetime, etcd uses leases: time-bounded tokens that can be attached to multiple keys. When a lease expires (or is revoked), all keys attached to it are deleted.

This design is more flexible than ZooKeeper's approach:

One lease can govern multiple keys
Leases have explicit TTLs (not tied to session timeout)
Leases can be kept alive indefinitely
Lease expiry is predictable and configurable

lease-example.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Create a lease with 60-second TTL
$ etcdctl lease grant 60
lease 694d7c8c9b6c6b0a granted with TTL(60s)
 
# Attach keys to the lease
$ etcdctl put /services/myapp/instance-1 "10.0.0.1:8080" --lease=694d7c8c9b6c6b0a
$ etcdctl put /services/myapp/instance-1/health "OK" --lease=694d7c8c9b6c6b0a
 
# Both keys share the same lease
# If we stop keeping the lease alive, both disappear after 60s
 
# Keep the lease alive (run this continuously in your app)
$ etcdctl lease keep-alive 694d7c8c9b6c6b0a
lease 694d7c8c9b6c6b0a keepalived with TTL(60)
lease 694d7c8c9b6c6b0a keepalived with TTL(60)
...
 
# Revoke the lease manually (both keys deleted immediately)
$ etcdctl lease revoke 694d7c8c9b6c6b0a
lease 694d7c8c9b6c6b0a revoked
 
# Check remaining TTL
$ etcdctl lease timetolive 694d7c8c9b6c6b0a
lease 694d7c8c9b6c6b0a granted with TTL(60s), remaining(45s)

Lease Patterns for Service Registration

The typical pattern for service registration:

On startup, grant a lease with TTL (e.g., 30 seconds)
Create registration key with lease attached
Start a goroutine/thread that calls LeaseKeepAlive continuously
If the service crashes, keep-alive stops, lease expires, key deleted
Other services watching the prefix see the update

lease-service-registration.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
package main
 
import (
    "context"
    "log"
    clientv3 "go.etcd.io/etcd/client/v3"
)
 
func registerService(client *clientv3.Client) error {
    ctx := context.Background()
    
    // Step 1: Grant a lease with 30-second TTL
    lease, err := client.Grant(ctx, 30)
    if err != nil {
        return err
    }
    
    // Step 2: Create registration key with lease
    _, err = client.Put(ctx, 
        "/services/myapp/instances/"+hostname,
        "10.0.0.1:8080",
        clientv3.WithLease(lease.ID),
    )
    if err != nil {
        return err
    }
    
    // Step 3: Keep the lease alive
    // KeepAlive returns a channel that sends keep-alive responses
    keepAliveChan, err := client.KeepAlive(ctx, lease.ID)
    if err != nil {
        return err
    }
    
    // Process keep-alive responses in background
    go func() {
        for {
            select {
            case resp := <-keepAliveChan:
                if resp == nil {
                    // Lease expired or revoked
                    log.Println("Lease expired, re-registering...")
                    registerService(client)  // Re-register
                    return
                }
                // Successfully kept alive
                log.Printf("Lease %x kept alive, TTL: %d
", 
                    resp.ID, resp.TTL)
            }
        }
    }()
    
    return nil
}
 
func discoverServices(client *clientv3.Client) ([]string, error) {
    ctx := context.Background()
    
    // Get all instances under the prefix
    resp, err := client.Get(ctx, 
        "/services/myapp/instances/",
        clientv3.WithPrefix(),
    )
    if err != nil {
        return nil, err
    }
    
    instances := make([]string, len(resp.Kvs))
    for i, kv := range resp.Kvs {
        instances[i] = string(kv.Value)
    }
    return instances, nil
}

Lease TTL vs Keep-Alive Frequency

The lease TTL should be longer than your keep-alive frequency allows for network hiccups. If TTL is 30s, keep-alive every 10s gives you 2 missed keep-alives before expiry. Too short a TTL causes false positives; too long delays failure detection.

Watches — Streaming Updates

etcd's watch mechanism improves on ZooKeeper's one-shot watches with continuous, streaming watches. Once you create a watch, it stays active and delivers all subsequent changes — no re-registration needed.

This design eliminates the watch re-registration race condition that's common in ZooKeeper programming.

Watch Features

Prefix watches: Watch all keys with a given prefix
Range watches: Watch keys in a lexicographic range
Historical watches: Start watching from a past revision
Compacted revision handling: Graceful handling when history is compacted
Progress notifications: Optional periodic events to confirm liveness
Filters: Watch only puts, only deletes, or both

watch-examples.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
package main
 
import (
    "context"
    "fmt"
    clientv3 "go.etcd.io/etcd/client/v3"
)
 
func watchExamples(client *clientv3.Client) {
    ctx := context.Background()
    
    // Watch a single key
    watchChan := client.Watch(ctx, "/config/feature-flag")
    
    // Watch all keys with prefix
    watchChan = client.Watch(ctx, "/services/myapp/", 
        clientv3.WithPrefix())
    
    // Watch from a specific revision (useful after restart)
    watchChan = client.Watch(ctx, "/services/myapp/",
        clientv3.WithPrefix(),
        clientv3.WithRev(lastSeenRevision + 1))
    
    // Watch with filters (only deletions)
    watchChan = client.Watch(ctx, "/services/myapp/",
        clientv3.WithPrefix(),
        clientv3.WithFilterPut())  // Filter OUT puts, see only deletes
    
    // Process watch events
    for watchResp := range watchChan {
        // Check for errors
        if watchResp.Err() != nil {
            handleWatchError(watchResp.Err())
            continue
        }
        
        // Check if this is a progress notification
        if watchResp.IsProgressNotify() {
            // No actual events, just confirming liveness
            continue
        }
        
        // Process events
        for _, event := range watchResp.Events {
            switch event.Type {
            case clientv3.EventTypePut:
                fmt.Printf("PUT: %s -> %s (mod_rev: %d)
",
                    event.Kv.Key, event.Kv.Value, 
                    event.Kv.ModRevision)
                    
            case clientv3.EventTypeDelete:
                fmt.Printf("DELETE: %s (mod_rev: %d)
",
                    event.Kv.Key, event.Kv.ModRevision)
            }
        }
        
        // Save revision for restart recovery
        lastSeenRevision = watchResp.Header.Revision
    }
}
 
func handleWatchError(err error) {
    if err == context.Canceled {
        // Normal cancellation
        return
    }
    
    // Check if it's a compaction error
    if err.Error() == rpctypes.ErrCompacted.Error() {
        // History was compacted, need to resync from current state
        resyncFromCurrentState()
    }
}

Handling History Compaction

etcd compacts old revisions to save space. If your watch tries to start from a compacted revision, it fails with ErrCompacted. Handle this by reading the current state (full sync) and starting your watch from the current revision. This is the 'list then watch' pattern used by Kubernetes.

etcd Watches vs ZooKeeper Watches
Feature	etcd	ZooKeeper
Persistence	Continuous (stays active)	One-shot (removed after firing)
Re-registration	Not needed	Required after every event
Historical replay	Yes, from any revision	No, only current state
Prefix/range	Built-in	Requires watching parent
Protocol	gRPC streaming	TCP session
Delivery	Ordered, guaranteed	Ordered, guaranteed

Transactions — Atomic Multi-Key Operations

etcd provides mini-transactions (often called "txn" or "STM") that enable atomic read-modify-write operations across multiple keys. This is essential for implementing safe coordination patterns without race conditions.

A transaction has three components:

Compare: A list of conditions that must all be true
Success: Operations to execute if all comparisons pass
Failure: Operations to execute if any comparison fails

transaction-examples.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
package main
 
import (
    "context"
    clientv3 "go.etcd.io/etcd/client/v3"
)
 
func transactionExamples(client *clientv3.Client) {
    ctx := context.Background()
    
    // Compare-And-Swap (CAS): Only set if current value matches
    _, err := client.Txn(ctx).
        If(clientv3.Compare(clientv3.Value("/config/version"), "=", "v1")).
        Then(clientv3.OpPut("/config/version", "v2")).
        Else(clientv3.OpGet("/config/version")).
        Commit()
    
    // Create-If-Not-Exists: Acquire lock only if key doesn't exist
    resp, err := client.Txn(ctx).
        If(clientv3.Compare(clientv3.CreateRevision("/locks/mylock"), "=", 0)).
        Then(clientv3.OpPut("/locks/mylock", myIdentity, 
             clientv3.WithLease(lease.ID))).
        Else(clientv3.OpGet("/locks/mylock")).
        Commit()
    
    if resp.Succeeded {
        fmt.Println("Lock acquired!")
    } else {
        holder := string(resp.Responses[0].GetResponseRange().Kvs[0].Value)
        fmt.Printf("Lock held by: %s
", holder)
    }
    
    // Multi-key atomic update
    _, err = client.Txn(ctx).
        If(
            clientv3.Compare(clientv3.ModRevision("/users/alice"), ">", 0),
            clientv3.Compare(clientv3.ModRevision("/users/bob"), ">", 0),
        ).
        Then(
            clientv3.OpPut("/users/alice", updatedAlice),
            clientv3.OpPut("/users/bob", updatedBob),
            clientv3.OpPut("/audit/log", "updated alice and bob"),
        ).
        Commit()
}
 
// Software Transactional Memory (STM) pattern for complex logic
func stmExample(client *clientv3.Client) {
    // Using the concurrency package for higher-level STM
    stm.NewSTM(client, func(stm concurrency.STM) error {
        // Read multiple keys
        balance1 := stm.Get("/accounts/alice/balance")
        balance2 := stm.Get("/accounts/bob/balance")
        
        // Perform logic
        amount := 100
        newBalance1 := parseBalance(balance1) - amount
        newBalance2 := parseBalance(balance2) + amount
        
        // Write multiple keys - all or nothing
        stm.Put("/accounts/alice/balance", formatBalance(newBalance1))
        stm.Put("/accounts/bob/balance", formatBalance(newBalance2))
        
        return nil  // Commit
    })
    // STM automatically retries on conflicts
}

Transaction Comparison Operators

•Value — Compare the value of a key: Value("/key") = "expected"
•Version — Compare modification count: Version("/key") = 5
•CreateRevision — Compare creation revision (0 = doesn't exist): CreateRevision("/key") = 0
•ModRevision — Compare last modification revision: ModRevision("/key") > lastSeen
•Lease — Compare attached lease: Lease("/key") = leaseID

Transaction Size Limits

etcd transactions have limits: by default, max 128 operations per transaction and 1.5MB total request size. For larger coordination needs, break into multiple transactions with application-level conflict resolution, or use the STM (Software Transactional Memory) abstraction which handles retries automatically.

etcd in Kubernetes

etcd's most prominent use is as Kubernetes' backing store. Every Kubernetes object — Pods, Deployments, Services, ConfigMaps, Secrets — is stored as a key-value pair in etcd. Understanding this relationship helps you operate Kubernetes clusters effectively.

How Kubernetes Uses etcd

Object Storage: Each Kubernetes resource is serialized (usually as Protocol Buffers) and stored at a predictable key path:
- /registry/pods/default/my-pod
- /registry/deployments/kube-system/coredns
- /registry/secrets/production/database-credentials
Watch-Based Coordination: Controllers (Deployment controller, ReplicaSet controller, etc.) watch etcd for changes to resources they manage. This enables the declarative, reconciliation-based model.
Leader Election: Kubernetes components like kube-scheduler and kube-controller-manager use etcd-based leader election when running in HA mode.
Lease-Based Health: Components report their health through etcd entries with leases, enabling automatic failure detection.

Converting Mermaid diagram...

API Server as Gatekeeper

Note that only the API Server talks directly to etcd. All other components (scheduler, controllers, kubelet) go through the API Server. This centralizes authentication, authorization, validation, and admission control. Never let applications access etcd directly in a Kubernetes deployment.

etcd Performance in Kubernetes

•SSD Storage Required: etcd's performance depends heavily on disk latency. Never run etcd on spinning disks in production.
•Network Latency: Keep etcd nodes in the same datacenter. Cross-region etcd clusters suffer from Raft's latency sensitivity.
•Memory Sizing: etcd stores data in memory. Size for 2-8GB for most clusters; very large clusters may need more.
•Request Throttling: Under heavy load, Kubernetes applies backpressure. Monitor etcd_server_slow_apply_total metric.
•Defragmentation: etcd's BoltDB backend requires periodic defragmentation. Schedule during maintenance windows.

Summary: etcd Essentials

We've covered etcd's architecture, data model, and its role in modern infrastructure. Let's consolidate the essential knowledge:

Key Takeaways

•etcd is a distributed key-value store built on Raft consensus, designed for coordination workloads with strong consistency guarantees.
•Flat key-value model with prefix queries enables hierarchical organization by convention while maintaining simplicity.
•Leases provide ephemeral key semantics — one lease can govern multiple keys, and expiry is time-based rather than session-based.
•Streaming watches eliminate the re-registration pattern required by ZooKeeper, reducing race condition windows.
•Transactions enable atomic multi-key operations with compare-and-swap semantics for safe coordination patterns.
•Revisions provide a global ordering for all changes, enabling historical queries and consistent watch starting points.
•Kubernetes dependency means etcd is battle-tested at massive scale, but also means etcd health is critical for cluster operation.
•Operational simplicity — single static binary, no JVM, straightforward configuration — makes etcd easier to deploy and maintain.

When to Choose etcd

•Building cloud-native applications
•Using Kubernetes or Raft-based systems
•Need simple operations (Go binary)
•Prefer gRPC/HTTP APIs
•Want streaming watches
•Need multi-key transactions

When ZooKeeper May Be Better

•Using Hadoop/Kafka ecosystems
•Need true hierarchical structure
•Have existing ZooKeeper expertise
•Need ACL inheritance through hierarchy
•Java-centric environment

What's Next:

In the next page, we'll explore Consul, which takes a different approach by combining service mesh, service discovery, and key-value storage in one platform. You'll learn how Consul compares to etcd and ZooKeeper and when its integrated approach is advantageous.

Page Complete

You now understand etcd's key-value model, Raft-based consensus, leases, watches, and its role as Kubernetes' backbone. This foundation will help you compare coordination services and make informed architectural decisions.

2 / 5

Loading learning content...

System Design (HLD)Coordination Services

Coordination Services

LevelIntermediate

Duration90 mins

TopicCoordination Services

2 / 5

etcd: Simple Key-Value with Raft

The Heart of Cloud-Native Infrastructure

What You Will Learn

What is etcd?

etcd's design philosophy differs fundamentally from ZooKeeper:

etcd vs ZooKeeper Design Philosophy
Aspect	etcd	ZooKeeper
Data Model	Flat key-value with prefix queries	Hierarchical tree (znodes)
Consensus Protocol	Raft (understandable by design)	ZAB (similar to Paxos)
API	gRPC + HTTP/JSON	Custom binary protocol
Ephemeral Data	Leases (can span multiple keys)	Ephemeral znodes (per-node basis)
Watch Model	Continuous streaming watches	One-shot watches
Language	Go	Java
Deployment	Static binary, no dependencies	JVM required
Configuration	YAML/CLI, minimal	Properties files, more complex

Why etcd Was Created

When CoreOS started building container orchestration tools in 2013, they needed a coordination service. ZooKeeper existed but had drawbacks for their use case:

JVM dependency: ZooKeeper requires Java — heavy for container environments
Complex operations: ZooKeeper configuration and maintenance require expertise
Protocol complexity: ZAB is powerful but hard to understand and debug
API design: The Java-centric API wasn't ideal for polyglot environments

etcd was designed to address these concerns while providing the same fundamental guarantees: strong consistency, high availability, and reliable coordination primitives.

The Raft Advantage

The Key-Value Data Model

However, etcd provides powerful prefix-based operations that enable hierarchical organization through convention rather than enforcement.

etcd-key-patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# etcd uses flat keys, but conventions create logical hierarchy
# Keys are byte strings; these examples use human-readable paths
 
# Service registry pattern
/services/user-service/instances/10.0.0.1:8080
/services/user-service/instances/10.0.0.2:8080
/services/user-service/instances/10.0.0.3:8080
/services/payment-service/instances/10.0.1.1:8080
/services/payment-service/instances/10.0.1.2:8080
 
# Configuration pattern
/config/database/connection-string
/config/database/pool-size
/config/features/new-checkout-enabled
/config/features/dark-mode-enabled
 
# Leader election pattern
/election/cluster-leader/candidate-001
/election/cluster-leader/candidate-002
 
# Distributed lock pattern  
/locks/payment-processing/lock
 
# Prefix query: get all user-service instances
etcdctl get /services/user-service/instances/ --prefix
 
# Range query: get all config between keys
etcdctl get /config/database/a /config/database/z

Key Design Considerations

Keys are byte strings: Any bytes are valid, but UTF-8 encoded paths are conventional
1MB value limit: Like ZooKeeper, etcd is for small coordination data, not application storage
Prefix queries: Use consistent key prefixes to enable range operations
No true hierarchy: /a/b/c is just a string — deleting /a doesn't delete /a/b
Lexicographic ordering: Keys are sorted lexicographically, enabling range scans

The Hierarchy Illusion

Revisions: The Versioning System

etcd maintains a global revision number that increments with every write operation. This is similar to ZooKeeper's zxid but more central to etcd's design:

Every key-value pair knows its creation revision and modification revision
Watches can start from any historical revision
etcd stores history, not just current state (up to configured retention)
Transactions and conditional writes use revision-based conditions

revision-example.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Put a key and see the revision
$ etcdctl put /mykey "value1"
OK
 
$ etcdctl get /mykey -w json | jq
{
  "header": { "revision": 42 },
  "kvs": [{
    "key": "L215a2V5",           # base64 encoded
    "create_revision": 42,       # Revision when created
    "mod_revision": 42,          # Revision when last modified  
    "version": 1,                # Number of modifications
    "value": "dmFsdWUx"          # base64 encoded
  }]
}
 
# Update the key
$ etcdctl put /mykey "value2"
 
$ etcdctl get /mykey -w json | jq '.kvs[0] | {mod_revision, version}'
{
  "mod_revision": 43,
  "version": 2
}
 
# Get historical value at revision 42
$ etcdctl get /mykey --rev=42
/mykey
value1

etcd Architecture and Raft

Converting Mermaid diagram...

Node Roles in Raft

Leader: Handles all write requests, replicates to followers, sends heartbeats
Follower: Replicates leader's log, forwards writes to leader, can serve reads (with caveats)
Candidate: Temporary state during elections
Learner (etcd 3.4+): Non-voting member that receives log entries but doesn't participate in quorum — useful for adding nodes safely

Write Path

Client sends write to any node
If not leader, request is forwarded to leader
Leader appends entry to its log, assigns term and index
Leader replicates to followers in parallel
Once majority acknowledges, entry is committed
Leader applies to state machine, responds to client
Followers learn of commit via next heartbeat/append

Raft Guarantees

•Election Safety: At most one leader per term. If you see a leader for term N, no other leader exists for term N.
•Leader Append-Only: Leaders never overwrite or delete entries; they only append.
•Log Matching: If two logs have an entry with the same index and term, all preceding entries are identical.
•Leader Completeness: If an entry is committed in a given term, it will be present in the logs of all leaders for higher-numbered terms.
•State Machine Safety: If a node has applied an entry at a given index, no other node will apply a different entry at that index.

Read Path and Consistency Levels

etcd provides three read consistency levels:

Serializable (default for most APIs): Reads from any node, may return stale data. Lower latency, higher throughput.
Linearizable: Guarantees you see all writes that completed before the read started. Requires coordination with leader.
Revision-based: Read as of a specific revision. Useful for implementing consistent snapshots.

For coordination use cases, linearizable reads are usually necessary to avoid race conditions.

consistency-levels.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Serializable read (may be stale, but fast)
$ etcdctl get /mykey --consistency=s
 
# Linearizable read (always up-to-date, may be slower)
$ etcdctl get /mykey --consistency=l
 
# Read at specific revision
$ etcdctl get /mykey --rev=42
 
# In Go client
resp, err := client.Get(ctx, "/mykey", 
    clientv3.WithSerializable())  // Don't require leader confirmation
 
resp, err := client.Get(ctx, "/mykey")  
    // Default is linearizable when using leader-only endpoints

Lease Reads for Performance

Leases — etcd's Ephemeral State

This design is more flexible than ZooKeeper's approach:

One lease can govern multiple keys
Leases have explicit TTLs (not tied to session timeout)
Leases can be kept alive indefinitely
Lease expiry is predictable and configurable

lease-example.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Create a lease with 60-second TTL
$ etcdctl lease grant 60
lease 694d7c8c9b6c6b0a granted with TTL(60s)
 
# Attach keys to the lease
$ etcdctl put /services/myapp/instance-1 "10.0.0.1:8080" --lease=694d7c8c9b6c6b0a
$ etcdctl put /services/myapp/instance-1/health "OK" --lease=694d7c8c9b6c6b0a
 
# Both keys share the same lease
# If we stop keeping the lease alive, both disappear after 60s
 
# Keep the lease alive (run this continuously in your app)
$ etcdctl lease keep-alive 694d7c8c9b6c6b0a
lease 694d7c8c9b6c6b0a keepalived with TTL(60)
lease 694d7c8c9b6c6b0a keepalived with TTL(60)
...
 
# Revoke the lease manually (both keys deleted immediately)
$ etcdctl lease revoke 694d7c8c9b6c6b0a
lease 694d7c8c9b6c6b0a revoked
 
# Check remaining TTL
$ etcdctl lease timetolive 694d7c8c9b6c6b0a
lease 694d7c8c9b6c6b0a granted with TTL(60s), remaining(45s)

Lease Patterns for Service Registration

The typical pattern for service registration:

On startup, grant a lease with TTL (e.g., 30 seconds)
Create registration key with lease attached
Start a goroutine/thread that calls LeaseKeepAlive continuously
If the service crashes, keep-alive stops, lease expires, key deleted
Other services watching the prefix see the update

lease-service-registration.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
package main
 
import (
    "context"
    "log"
    clientv3 "go.etcd.io/etcd/client/v3"
)
 
func registerService(client *clientv3.Client) error {
    ctx := context.Background()
    
    // Step 1: Grant a lease with 30-second TTL
    lease, err := client.Grant(ctx, 30)
    if err != nil {
        return err
    }
    
    // Step 2: Create registration key with lease
    _, err = client.Put(ctx, 
        "/services/myapp/instances/"+hostname,
        "10.0.0.1:8080",
        clientv3.WithLease(lease.ID),
    )
    if err != nil {
        return err
    }
    
    // Step 3: Keep the lease alive
    // KeepAlive returns a channel that sends keep-alive responses
    keepAliveChan, err := client.KeepAlive(ctx, lease.ID)
    if err != nil {
        return err
    }
    
    // Process keep-alive responses in background
    go func() {
        for {
            select {
            case resp := <-keepAliveChan:
                if resp == nil {
                    // Lease expired or revoked
                    log.Println("Lease expired, re-registering...")
                    registerService(client)  // Re-register
                    return
                }
                // Successfully kept alive
                log.Printf("Lease %x kept alive, TTL: %d
", 
                    resp.ID, resp.TTL)
            }
        }
    }()
    
    return nil
}
 
func discoverServices(client *clientv3.Client) ([]string, error) {
    ctx := context.Background()
    
    // Get all instances under the prefix
    resp, err := client.Get(ctx, 
        "/services/myapp/instances/",
        clientv3.WithPrefix(),
    )
    if err != nil {
        return nil, err
    }
    
    instances := make([]string, len(resp.Kvs))
    for i, kv := range resp.Kvs {
        instances[i] = string(kv.Value)
    }
    return instances, nil
}

Lease TTL vs Keep-Alive Frequency

Watches — Streaming Updates

This design eliminates the watch re-registration race condition that's common in ZooKeeper programming.

Watch Features

Prefix watches: Watch all keys with a given prefix
Range watches: Watch keys in a lexicographic range
Historical watches: Start watching from a past revision
Compacted revision handling: Graceful handling when history is compacted
Progress notifications: Optional periodic events to confirm liveness
Filters: Watch only puts, only deletes, or both

watch-examples.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
package main
 
import (
    "context"
    "fmt"
    clientv3 "go.etcd.io/etcd/client/v3"
)
 
func watchExamples(client *clientv3.Client) {
    ctx := context.Background()
    
    // Watch a single key
    watchChan := client.Watch(ctx, "/config/feature-flag")
    
    // Watch all keys with prefix
    watchChan = client.Watch(ctx, "/services/myapp/", 
        clientv3.WithPrefix())
    
    // Watch from a specific revision (useful after restart)
    watchChan = client.Watch(ctx, "/services/myapp/",
        clientv3.WithPrefix(),
        clientv3.WithRev(lastSeenRevision + 1))
    
    // Watch with filters (only deletions)
    watchChan = client.Watch(ctx, "/services/myapp/",
        clientv3.WithPrefix(),
        clientv3.WithFilterPut())  // Filter OUT puts, see only deletes
    
    // Process watch events
    for watchResp := range watchChan {
        // Check for errors
        if watchResp.Err() != nil {
            handleWatchError(watchResp.Err())
            continue
        }
        
        // Check if this is a progress notification
        if watchResp.IsProgressNotify() {
            // No actual events, just confirming liveness
            continue
        }
        
        // Process events
        for _, event := range watchResp.Events {
            switch event.Type {
            case clientv3.EventTypePut:
                fmt.Printf("PUT: %s -> %s (mod_rev: %d)
",
                    event.Kv.Key, event.Kv.Value, 
                    event.Kv.ModRevision)
                    
            case clientv3.EventTypeDelete:
                fmt.Printf("DELETE: %s (mod_rev: %d)
",
                    event.Kv.Key, event.Kv.ModRevision)
            }
        }
        
        // Save revision for restart recovery
        lastSeenRevision = watchResp.Header.Revision
    }
}
 
func handleWatchError(err error) {
    if err == context.Canceled {
        // Normal cancellation
        return
    }
    
    // Check if it's a compaction error
    if err.Error() == rpctypes.ErrCompacted.Error() {
        // History was compacted, need to resync from current state
        resyncFromCurrentState()
    }
}

Handling History Compaction

etcd Watches vs ZooKeeper Watches
Feature	etcd	ZooKeeper
Persistence	Continuous (stays active)	One-shot (removed after firing)
Re-registration	Not needed	Required after every event
Historical replay	Yes, from any revision	No, only current state
Prefix/range	Built-in	Requires watching parent
Protocol	gRPC streaming	TCP session
Delivery	Ordered, guaranteed	Ordered, guaranteed

Transactions — Atomic Multi-Key Operations

A transaction has three components:

Compare: A list of conditions that must all be true
Success: Operations to execute if all comparisons pass
Failure: Operations to execute if any comparison fails

transaction-examples.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
package main
 
import (
    "context"
    clientv3 "go.etcd.io/etcd/client/v3"
)
 
func transactionExamples(client *clientv3.Client) {
    ctx := context.Background()
    
    // Compare-And-Swap (CAS): Only set if current value matches
    _, err := client.Txn(ctx).
        If(clientv3.Compare(clientv3.Value("/config/version"), "=", "v1")).
        Then(clientv3.OpPut("/config/version", "v2")).
        Else(clientv3.OpGet("/config/version")).
        Commit()
    
    // Create-If-Not-Exists: Acquire lock only if key doesn't exist
    resp, err := client.Txn(ctx).
        If(clientv3.Compare(clientv3.CreateRevision("/locks/mylock"), "=", 0)).
        Then(clientv3.OpPut("/locks/mylock", myIdentity, 
             clientv3.WithLease(lease.ID))).
        Else(clientv3.OpGet("/locks/mylock")).
        Commit()
    
    if resp.Succeeded {
        fmt.Println("Lock acquired!")
    } else {
        holder := string(resp.Responses[0].GetResponseRange().Kvs[0].Value)
        fmt.Printf("Lock held by: %s
", holder)
    }
    
    // Multi-key atomic update
    _, err = client.Txn(ctx).
        If(
            clientv3.Compare(clientv3.ModRevision("/users/alice"), ">", 0),
            clientv3.Compare(clientv3.ModRevision("/users/bob"), ">", 0),
        ).
        Then(
            clientv3.OpPut("/users/alice", updatedAlice),
            clientv3.OpPut("/users/bob", updatedBob),
            clientv3.OpPut("/audit/log", "updated alice and bob"),
        ).
        Commit()
}
 
// Software Transactional Memory (STM) pattern for complex logic
func stmExample(client *clientv3.Client) {
    // Using the concurrency package for higher-level STM
    stm.NewSTM(client, func(stm concurrency.STM) error {
        // Read multiple keys
        balance1 := stm.Get("/accounts/alice/balance")
        balance2 := stm.Get("/accounts/bob/balance")
        
        // Perform logic
        amount := 100
        newBalance1 := parseBalance(balance1) - amount
        newBalance2 := parseBalance(balance2) + amount
        
        // Write multiple keys - all or nothing
        stm.Put("/accounts/alice/balance", formatBalance(newBalance1))
        stm.Put("/accounts/bob/balance", formatBalance(newBalance2))
        
        return nil  // Commit
    })
    // STM automatically retries on conflicts
}

Transaction Comparison Operators

•Value — Compare the value of a key: Value("/key") = "expected"
•Version — Compare modification count: Version("/key") = 5
•CreateRevision — Compare creation revision (0 = doesn't exist): CreateRevision("/key") = 0
•ModRevision — Compare last modification revision: ModRevision("/key") > lastSeen
•Lease — Compare attached lease: Lease("/key") = leaseID

Transaction Size Limits

etcd in Kubernetes

How Kubernetes Uses etcd

Object Storage: Each Kubernetes resource is serialized (usually as Protocol Buffers) and stored at a predictable key path:
- /registry/pods/default/my-pod
- /registry/deployments/kube-system/coredns
- /registry/secrets/production/database-credentials
Watch-Based Coordination: Controllers (Deployment controller, ReplicaSet controller, etc.) watch etcd for changes to resources they manage. This enables the declarative, reconciliation-based model.
Leader Election: Kubernetes components like kube-scheduler and kube-controller-manager use etcd-based leader election when running in HA mode.
Lease-Based Health: Components report their health through etcd entries with leases, enabling automatic failure detection.

Converting Mermaid diagram...

API Server as Gatekeeper

etcd Performance in Kubernetes

•SSD Storage Required: etcd's performance depends heavily on disk latency. Never run etcd on spinning disks in production.
•Network Latency: Keep etcd nodes in the same datacenter. Cross-region etcd clusters suffer from Raft's latency sensitivity.
•Memory Sizing: etcd stores data in memory. Size for 2-8GB for most clusters; very large clusters may need more.
•Request Throttling: Under heavy load, Kubernetes applies backpressure. Monitor etcd_server_slow_apply_total metric.
•Defragmentation: etcd's BoltDB backend requires periodic defragmentation. Schedule during maintenance windows.

Summary: etcd Essentials

We've covered etcd's architecture, data model, and its role in modern infrastructure. Let's consolidate the essential knowledge:

Key Takeaways

•etcd is a distributed key-value store built on Raft consensus, designed for coordination workloads with strong consistency guarantees.
•Flat key-value model with prefix queries enables hierarchical organization by convention while maintaining simplicity.
•Leases provide ephemeral key semantics — one lease can govern multiple keys, and expiry is time-based rather than session-based.
•Streaming watches eliminate the re-registration pattern required by ZooKeeper, reducing race condition windows.
•Transactions enable atomic multi-key operations with compare-and-swap semantics for safe coordination patterns.
•Revisions provide a global ordering for all changes, enabling historical queries and consistent watch starting points.
•Kubernetes dependency means etcd is battle-tested at massive scale, but also means etcd health is critical for cluster operation.
•Operational simplicity — single static binary, no JVM, straightforward configuration — makes etcd easier to deploy and maintain.

When to Choose etcd

•Building cloud-native applications
•Using Kubernetes or Raft-based systems
•Need simple operations (Go binary)
•Prefer gRPC/HTTP APIs
•Want streaming watches
•Need multi-key transactions

When ZooKeeper May Be Better

•Using Hadoop/Kafka ecosystems
•Need true hierarchical structure
•Have existing ZooKeeper expertise
•Need ACL inheritance through hierarchy
•Java-centric environment

What's Next:

Page Complete

2 / 5