Loading learning content...
Every time you deploy a container to Kubernetes, schedule a workload, or update a ConfigMap, you're interacting with etcd — even if you've never heard of it. etcd is the distributed key-value store that serves as the brain of Kubernetes, storing all cluster state: every Pod definition, every Service, every Secret, every ConfigMap.
But etcd isn't just a Kubernetes implementation detail. It's a standalone distributed coordination service designed from the ground up for the cloud-native era. Created by CoreOS in 2013 (now maintained by the Cloud Native Computing Foundation), etcd represents a different design philosophy than ZooKeeper — one that prioritizes simplicity, performance, and ease of operation.
By the end of this page, you will understand etcd's flat key-value data model, how it uses Raft for consensus, the role of leases (etcd's equivalent of ephemeral nodes), its powerful watch mechanism, and why it became the coordination backbone for Kubernetes. You'll also understand when to choose etcd over ZooKeeper.
etcd (pronounced "et-see-dee", from the Unix /etc directory + "d" for distributed) is a strongly consistent, distributed key-value store designed for coordination workloads. It provides a reliable way to store data across a cluster of machines with automatic failover and leader election.
etcd's design philosophy differs fundamentally from ZooKeeper:
| Aspect | etcd | ZooKeeper |
|---|---|---|
| Data Model | Flat key-value with prefix queries | Hierarchical tree (znodes) |
| Consensus Protocol | Raft (understandable by design) | ZAB (similar to Paxos) |
| API | gRPC + HTTP/JSON | Custom binary protocol |
| Ephemeral Data | Leases (can span multiple keys) | Ephemeral znodes (per-node basis) |
| Watch Model | Continuous streaming watches | One-shot watches |
| Language | Go | Java |
| Deployment | Static binary, no dependencies | JVM required |
| Configuration | YAML/CLI, minimal | Properties files, more complex |
Why etcd Was Created
When CoreOS started building container orchestration tools in 2013, they needed a coordination service. ZooKeeper existed but had drawbacks for their use case:
etcd was designed to address these concerns while providing the same fundamental guarantees: strong consistency, high availability, and reliable coordination primitives.
etcd's use of Raft isn't just an implementation choice — it's a core value proposition. Raft was designed to be understandable, making etcd easier to debug, operate, and reason about. When things go wrong (and they do), understanding your consensus algorithm helps you recover faster.
Unlike ZooKeeper's hierarchical namespace, etcd uses a flat key-value model with byte-string keys and values. This simplicity is deliberate: it reduces conceptual overhead and enables simpler implementation.
However, etcd provides powerful prefix-based operations that enable hierarchical organization through convention rather than enforcement.
12345678910111213141516171819202122232425262728
# etcd uses flat keys, but conventions create logical hierarchy# Keys are byte strings; these examples use human-readable paths # Service registry pattern/services/user-service/instances/10.0.0.1:8080/services/user-service/instances/10.0.0.2:8080/services/user-service/instances/10.0.0.3:8080/services/payment-service/instances/10.0.1.1:8080/services/payment-service/instances/10.0.1.2:8080 # Configuration pattern/config/database/connection-string/config/database/pool-size/config/features/new-checkout-enabled/config/features/dark-mode-enabled # Leader election pattern/election/cluster-leader/candidate-001/election/cluster-leader/candidate-002 # Distributed lock pattern /locks/payment-processing/lock # Prefix query: get all user-service instancesetcdctl get /services/user-service/instances/ --prefix # Range query: get all config between keysetcdctl get /config/database/a /config/database/zKey Design Considerations
/a/b/c is just a string — deleting /a doesn't delete /a/bUnlike ZooKeeper, etcd's slash-separated keys don't form a true hierarchy. The key /services/myapp has no special relationship to /services/myapp/config. You can create the latter without the former, and deleting the former doesn't affect the latter. This is simpler but requires different thinking.
Revisions: The Versioning System
etcd maintains a global revision number that increments with every write operation. This is similar to ZooKeeper's zxid but more central to etcd's design:
1234567891011121314151617181920212223242526272829
# Put a key and see the revision$ etcdctl put /mykey "value1"OK $ etcdctl get /mykey -w json | jq{ "header": { "revision": 42 }, "kvs": [{ "key": "L215a2V5", # base64 encoded "create_revision": 42, # Revision when created "mod_revision": 42, # Revision when last modified "version": 1, # Number of modifications "value": "dmFsdWUx" # base64 encoded }]} # Update the key$ etcdctl put /mykey "value2" $ etcdctl get /mykey -w json | jq '.kvs[0] | {mod_revision, version}'{ "mod_revision": 43, "version": 2} # Get historical value at revision 42$ etcdctl get /mykey --rev=42/mykeyvalue1etcd runs as a cluster of nodes (typically 3 or 5), using the Raft consensus protocol to maintain consistent state across all members. Understanding etcd's architecture helps you operate it effectively and debug issues.
Node Roles in Raft
Write Path
Read Path and Consistency Levels
etcd provides three read consistency levels:
Serializable (default for most APIs): Reads from any node, may return stale data. Lower latency, higher throughput.
Linearizable: Guarantees you see all writes that completed before the read started. Requires coordination with leader.
Revision-based: Read as of a specific revision. Useful for implementing consistent snapshots.
For coordination use cases, linearizable reads are usually necessary to avoid race conditions.
123456789101112131415
# Serializable read (may be stale, but fast)$ etcdctl get /mykey --consistency=s # Linearizable read (always up-to-date, may be slower)$ etcdctl get /mykey --consistency=l # Read at specific revision$ etcdctl get /mykey --rev=42 # In Go clientresp, err := client.Get(ctx, "/mykey", clientv3.WithSerializable()) // Don't require leader confirmation resp, err := client.Get(ctx, "/mykey") // Default is linearizable when using leader-only endpointsFor read-heavy workloads, consider 'lease reads' (not to be confused with key leases). The leader confirms its leadership once, then serves reads directly for a time window. This provides linearizable-like guarantees with better performance. Enable with ETCD_EXPERIMENTAL_ENABLE_LEASE_CHECKPOINT=true.
While ZooKeeper has ephemeral znodes tied to session lifetime, etcd uses leases: time-bounded tokens that can be attached to multiple keys. When a lease expires (or is revoked), all keys attached to it are deleted.
This design is more flexible than ZooKeeper's approach:
123456789101112131415161718192021222324
# Create a lease with 60-second TTL$ etcdctl lease grant 60lease 694d7c8c9b6c6b0a granted with TTL(60s) # Attach keys to the lease$ etcdctl put /services/myapp/instance-1 "10.0.0.1:8080" --lease=694d7c8c9b6c6b0a$ etcdctl put /services/myapp/instance-1/health "OK" --lease=694d7c8c9b6c6b0a # Both keys share the same lease# If we stop keeping the lease alive, both disappear after 60s # Keep the lease alive (run this continuously in your app)$ etcdctl lease keep-alive 694d7c8c9b6c6b0alease 694d7c8c9b6c6b0a keepalived with TTL(60)lease 694d7c8c9b6c6b0a keepalived with TTL(60)... # Revoke the lease manually (both keys deleted immediately)$ etcdctl lease revoke 694d7c8c9b6c6b0alease 694d7c8c9b6c6b0a revoked # Check remaining TTL$ etcdctl lease timetolive 694d7c8c9b6c6b0alease 694d7c8c9b6c6b0a granted with TTL(60s), remaining(45s)Lease Patterns for Service Registration
The typical pattern for service registration:
LeaseKeepAlive continuously1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
package main import ( "context" "log" clientv3 "go.etcd.io/etcd/client/v3") func registerService(client *clientv3.Client) error { ctx := context.Background() // Step 1: Grant a lease with 30-second TTL lease, err := client.Grant(ctx, 30) if err != nil { return err } // Step 2: Create registration key with lease _, err = client.Put(ctx, "/services/myapp/instances/"+hostname, "10.0.0.1:8080", clientv3.WithLease(lease.ID), ) if err != nil { return err } // Step 3: Keep the lease alive // KeepAlive returns a channel that sends keep-alive responses keepAliveChan, err := client.KeepAlive(ctx, lease.ID) if err != nil { return err } // Process keep-alive responses in background go func() { for { select { case resp := <-keepAliveChan: if resp == nil { // Lease expired or revoked log.Println("Lease expired, re-registering...") registerService(client) // Re-register return } // Successfully kept alive log.Printf("Lease %x kept alive, TTL: %d", resp.ID, resp.TTL) } } }() return nil} func discoverServices(client *clientv3.Client) ([]string, error) { ctx := context.Background() // Get all instances under the prefix resp, err := client.Get(ctx, "/services/myapp/instances/", clientv3.WithPrefix(), ) if err != nil { return nil, err } instances := make([]string, len(resp.Kvs)) for i, kv := range resp.Kvs { instances[i] = string(kv.Value) } return instances, nil}The lease TTL should be longer than your keep-alive frequency allows for network hiccups. If TTL is 30s, keep-alive every 10s gives you 2 missed keep-alives before expiry. Too short a TTL causes false positives; too long delays failure detection.
etcd's watch mechanism improves on ZooKeeper's one-shot watches with continuous, streaming watches. Once you create a watch, it stays active and delivers all subsequent changes — no re-registration needed.
This design eliminates the watch re-registration race condition that's common in ZooKeeper programming.
Watch Features
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
package main import ( "context" "fmt" clientv3 "go.etcd.io/etcd/client/v3") func watchExamples(client *clientv3.Client) { ctx := context.Background() // Watch a single key watchChan := client.Watch(ctx, "/config/feature-flag") // Watch all keys with prefix watchChan = client.Watch(ctx, "/services/myapp/", clientv3.WithPrefix()) // Watch from a specific revision (useful after restart) watchChan = client.Watch(ctx, "/services/myapp/", clientv3.WithPrefix(), clientv3.WithRev(lastSeenRevision + 1)) // Watch with filters (only deletions) watchChan = client.Watch(ctx, "/services/myapp/", clientv3.WithPrefix(), clientv3.WithFilterPut()) // Filter OUT puts, see only deletes // Process watch events for watchResp := range watchChan { // Check for errors if watchResp.Err() != nil { handleWatchError(watchResp.Err()) continue } // Check if this is a progress notification if watchResp.IsProgressNotify() { // No actual events, just confirming liveness continue } // Process events for _, event := range watchResp.Events { switch event.Type { case clientv3.EventTypePut: fmt.Printf("PUT: %s -> %s (mod_rev: %d)", event.Kv.Key, event.Kv.Value, event.Kv.ModRevision) case clientv3.EventTypeDelete: fmt.Printf("DELETE: %s (mod_rev: %d)", event.Kv.Key, event.Kv.ModRevision) } } // Save revision for restart recovery lastSeenRevision = watchResp.Header.Revision }} func handleWatchError(err error) { if err == context.Canceled { // Normal cancellation return } // Check if it's a compaction error if err.Error() == rpctypes.ErrCompacted.Error() { // History was compacted, need to resync from current state resyncFromCurrentState() }}etcd compacts old revisions to save space. If your watch tries to start from a compacted revision, it fails with ErrCompacted. Handle this by reading the current state (full sync) and starting your watch from the current revision. This is the 'list then watch' pattern used by Kubernetes.
| Feature | etcd | ZooKeeper |
|---|---|---|
| Persistence | Continuous (stays active) | One-shot (removed after firing) |
| Re-registration | Not needed | Required after every event |
| Historical replay | Yes, from any revision | No, only current state |
| Prefix/range | Built-in | Requires watching parent |
| Protocol | gRPC streaming | TCP session |
| Delivery | Ordered, guaranteed | Ordered, guaranteed |
etcd provides mini-transactions (often called "txn" or "STM") that enable atomic read-modify-write operations across multiple keys. This is essential for implementing safe coordination patterns without race conditions.
A transaction has three components:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
package main import ( "context" clientv3 "go.etcd.io/etcd/client/v3") func transactionExamples(client *clientv3.Client) { ctx := context.Background() // Compare-And-Swap (CAS): Only set if current value matches _, err := client.Txn(ctx). If(clientv3.Compare(clientv3.Value("/config/version"), "=", "v1")). Then(clientv3.OpPut("/config/version", "v2")). Else(clientv3.OpGet("/config/version")). Commit() // Create-If-Not-Exists: Acquire lock only if key doesn't exist resp, err := client.Txn(ctx). If(clientv3.Compare(clientv3.CreateRevision("/locks/mylock"), "=", 0)). Then(clientv3.OpPut("/locks/mylock", myIdentity, clientv3.WithLease(lease.ID))). Else(clientv3.OpGet("/locks/mylock")). Commit() if resp.Succeeded { fmt.Println("Lock acquired!") } else { holder := string(resp.Responses[0].GetResponseRange().Kvs[0].Value) fmt.Printf("Lock held by: %s", holder) } // Multi-key atomic update _, err = client.Txn(ctx). If( clientv3.Compare(clientv3.ModRevision("/users/alice"), ">", 0), clientv3.Compare(clientv3.ModRevision("/users/bob"), ">", 0), ). Then( clientv3.OpPut("/users/alice", updatedAlice), clientv3.OpPut("/users/bob", updatedBob), clientv3.OpPut("/audit/log", "updated alice and bob"), ). Commit()} // Software Transactional Memory (STM) pattern for complex logicfunc stmExample(client *clientv3.Client) { // Using the concurrency package for higher-level STM stm.NewSTM(client, func(stm concurrency.STM) error { // Read multiple keys balance1 := stm.Get("/accounts/alice/balance") balance2 := stm.Get("/accounts/bob/balance") // Perform logic amount := 100 newBalance1 := parseBalance(balance1) - amount newBalance2 := parseBalance(balance2) + amount // Write multiple keys - all or nothing stm.Put("/accounts/alice/balance", formatBalance(newBalance1)) stm.Put("/accounts/bob/balance", formatBalance(newBalance2)) return nil // Commit }) // STM automatically retries on conflicts}Value("/key") = "expected"Version("/key") = 5CreateRevision("/key") = 0ModRevision("/key") > lastSeenLease("/key") = leaseIDetcd transactions have limits: by default, max 128 operations per transaction and 1.5MB total request size. For larger coordination needs, break into multiple transactions with application-level conflict resolution, or use the STM (Software Transactional Memory) abstraction which handles retries automatically.
etcd's most prominent use is as Kubernetes' backing store. Every Kubernetes object — Pods, Deployments, Services, ConfigMaps, Secrets — is stored as a key-value pair in etcd. Understanding this relationship helps you operate Kubernetes clusters effectively.
How Kubernetes Uses etcd
Object Storage: Each Kubernetes resource is serialized (usually as Protocol Buffers) and stored at a predictable key path:
/registry/pods/default/my-pod/registry/deployments/kube-system/coredns/registry/secrets/production/database-credentialsWatch-Based Coordination: Controllers (Deployment controller, ReplicaSet controller, etc.) watch etcd for changes to resources they manage. This enables the declarative, reconciliation-based model.
Leader Election: Kubernetes components like kube-scheduler and kube-controller-manager use etcd-based leader election when running in HA mode.
Lease-Based Health: Components report their health through etcd entries with leases, enabling automatic failure detection.
Note that only the API Server talks directly to etcd. All other components (scheduler, controllers, kubelet) go through the API Server. This centralizes authentication, authorization, validation, and admission control. Never let applications access etcd directly in a Kubernetes deployment.
etcd_server_slow_apply_total metric.We've covered etcd's architecture, data model, and its role in modern infrastructure. Let's consolidate the essential knowledge:
What's Next:
In the next page, we'll explore Consul, which takes a different approach by combining service mesh, service discovery, and key-value storage in one platform. You'll learn how Consul compares to etcd and ZooKeeper and when its integrated approach is advantageous.
You now understand etcd's key-value model, Raft-based consensus, leases, watches, and its role as Kubernetes' backbone. This foundation will help you compare coordination services and make informed architectural decisions.