Loading learning content...
Understanding the Raft algorithm is one thing. Running it in production, serving millions of requests, maintaining five-nines availability, and handling the messy reality of datacenter operations—that's another challenge entirely.
The academic Raft paper describes the algorithm's core mechanics, but production implementations face additional concerns:
This page examines how real-world systems solve these problems. We'll focus on etcd and Consul—two of the most widely deployed Raft implementations—and extract lessons that apply to any Raft-based system.
By the end of this page, you will understand:
• etcd and Consul — What they are and how they use Raft • Log compaction and snapshots — Preventing unbounded log growth • Cluster membership changes — Adding and removing nodes safely • Linearizable reads — Different strategies for read consistency • Production deployment patterns — Sizing, placement, and monitoring • Common failure modes — What goes wrong and how to handle it
etcd is a distributed key-value store that provides a reliable way to store data across a cluster of machines. It's the data store underlying Kubernetes, making it one of the most critical pieces of infrastructure in modern cloud computing.
Key characteristics:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
"""etcd: Distributed Key-Value Store with Raft Architecture: ┌─────────────────────────────────────────────────┐ │ etcd Cluster │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ etcd-1 │ │ etcd-2 │ │ etcd-3 │ │ │ │ (Leader) │◄──│(Follower)│──►│(Follower)│ │ │ └────┬─────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ Raft Consensus │ │ ▼ │ │ Key-Value Store │ │ Watches / Leases │ └─────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ Kubernetes API Server │ │ (All cluster state stored in etcd) │ └─────────────────────────────────────────────────┘ What Kubernetes stores in etcd:- Pod definitions and status- Service endpoints- ConfigMaps and Secrets- RBAC policies- Node registrations- Custom resources """ # Example: etcd operations via Python clientimport etcd3 # Connect to etcd clusterclient = etcd3.client( host='etcd-1.example.com', port=2379, # Typically uses mTLS in production ca_cert='/path/to/ca.crt', cert_key='/path/to/client.key', cert_cert='/path/to/client.crt',) # === BASIC OPERATIONS === # Write a key (goes through Raft consensus)client.put('/services/api/leader', 'pod-abc-123') # Read a key (can be linearizable or serializable)value, metadata = client.get('/services/api/leader')print(f"Leader: {value.decode()}, Revision: {metadata.mod_revision}") # === WATCH API === # Subscribe to changes on a key prefixevents_iterator, cancel = client.watch_prefix('/services/') for event in events_iterator: if isinstance(event, etcd3.events.PutEvent): print(f"Key updated: {event.key} = {event.value}") elif isinstance(event, etcd3.events.DeleteEvent): print(f"Key deleted: {event.key}") # === LEASES (TTL) === # Create a lease that expires in 30 secondslease = client.lease(ttl=30) # Associate key with lease - key deleted when lease expiresclient.put('/locks/resource-1', 'holder-xyz', lease=lease) # Keep the lease alive (heartbeat)while True: lease.refresh() time.sleep(10) # === TRANSACTIONS === # Atomic compare-and-swapsuccess = client.transaction( compare=[ client.transactions.value('/counter') == '5' ], success=[ client.transactions.put('/counter', '6') ], failure=[ client.transactions.get('/counter') # Return current value ])etcd's Raft implementation (go.etcd.io/raft) includes several production enhancements: • Pre-vote — Prevents disruption from partitioned nodes (enabled by default since v3.4) • Learner nodes — Non-voting members for safe cluster expansion • Leadership transfer — Graceful leader handoff for maintenance • Read indexes — Efficient linearizable reads without full consensus
Consul by HashiCorp is a multi-purpose tool for service discovery, configuration, and service mesh. Unlike etcd's focused key-value design, Consul provides a broader feature set built on Raft consensus.
Key features beyond basic KV:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
"""Consul Architecture: Multi-Datacenter Federation Unlike etcd (single cluster), Consul is designed for multi-datacenter deployments from the start. Single Datacenter: ┌───────────────────────────────────────────────┐ │ Datacenter 1 │ │ │ │ ┌─────────────────────────────────────────┐ │ │ │ Consul Server Cluster │ │ │ │ │ │ │ │ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │ │ │ S1 │◄──│ S2 │──►│ S3 │ │ │ │ │ │Lead │ │ │ │ │ │ │ │ │ └─────┘ └─────┘ └─────┘ │ │ │ │ (Raft Consensus) │ │ │ └─────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │Agent│ │Agent│ │Agent│ │Agent│ │ │ │ C1 │ │ C2 │ │ C3 │ │ C4 │ │ │ └─────┘ └─────┘ └─────┘ └─────┘ │ │ (Every node runs a Consul agent) │ └───────────────────────────────────────────────┘ Multi-Datacenter Federation: ┌─────────────────────┐ ┌─────────────────────┐ │ Datacenter 1 │ │ Datacenter 2 │ │ (US-East) │ │ (EU-West) │ │ │ │ │ │ ┌───┐ ┌───┐ ┌───┐ │ │ ┌───┐ ┌───┐ ┌───┐ │ │ │S1 │ │S2 │ │S3 │◄─┼─WAN─┼─►│S1 │ │S2 │ │S3 │ │ │ └───┘ └───┘ └───┘ │ │ └───┘ └───┘ └───┘ │ │ │ │ │ │ Local Raft │ │ Local Raft │ │ Consensus │ │ Consensus │ └─────────────────────┘ └─────────────────────┘ Each datacenter has independent Raft consensus.WAN gossip protocol for cross-DC communication.No cross-DC Raft (latency would be too high).""" # Consul service registration and discoveryimport consul # Connect to local Consul agentc = consul.Consul() # === SERVICE REGISTRATION === # Register a service (local agent forwards to servers via Raft)c.agent.service.register( name='web-api', service_id='web-api-1', address='10.0.1.50', port=8080, tags=['v1.2.3', 'production'], # Health check - Consul pings this endpoint check=consul.Check.http('http://10.0.1.50:8080/health', interval='10s')) # === SERVICE DISCOVERY === # Get healthy instances of a serviceindex, services = c.health.service('web-api', passing=True) for service in services: node = service['Node']['Node'] address = service['Service']['Address'] port = service['Service']['Port'] print(f"Instance: {node} at {address}:{port}") # === KEY-VALUE STORE === # Put/Get (similar to etcd)c.kv.put('config/database/host', 'db.example.com')index, data = c.kv.get('config/database/host') # === DISTRIBUTED LOCKING === # Create a session (similar to etcd lease)session_id = c.session.create( name='my-lock-session', ttl='30s', behavior='delete' # Delete keys when session expires) # Acquire lockacquired = c.kv.put('locks/my-resource', 'locked', acquire=session_id)if acquired: # We hold the lock pass # Release lockc.kv.put('locks/my-resource', 'locked', release=session_id)| Aspect | etcd | Consul |
|---|---|---|
| Primary Use | Kubernetes data store | Service mesh, discovery, KV |
| Raft Implementation | go.etcd.io/raft (Go) | hashicorp/raft (Go) |
| Multi-DC | Separate clusters | Native federation |
| Health Checking | Via Kubernetes | Built-in |
| Watch API | gRPC streams | Long polling + blocking queries |
| Consistency | Linearizable default | Configurable (default, consistent, stale) |
| Typical Cluster Size | 3-7 nodes | 3-5 servers per DC |
In the basic Raft algorithm, the log grows forever. Every command ever executed remains in the log. This is clearly unsustainable—a system running for years would have a log of unbounded size.
Log compaction solves this by periodically saving the state machine state as a snapshot and discarding log entries that are reflected in the snapshot.
The key insight: We don't need the log entries themselves; we need their effect on the state machine. A snapshot captures that effect directly.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158
"""Log Compaction via Snapshots Before compaction: Log: [1|SET x=1] [2|SET y=2] [3|SET x=3] [4|DEL y] [5|SET z=4] ... Each entry takes space and must be replayed on recovery. After compaction at index 4: Snapshot: {x: 3, z: 4} (state at index 4) Log: [5|SET z=4] [6|...] [7|...] Entries 1-4 discarded - their effect is in the snapshot. When to snapshot:- When log exceeds a size threshold (e.g., 10MB)- At regular intervals (e.g., every hour)- On explicit request (backup) etcd default: Snapshot every 10,000 applied entries.""" from dataclasses import dataclassfrom typing import Dict, Any, Optionalimport json @dataclassclass Snapshot: """ A snapshot captures the state machine state at a point in time. """ # The index of the last log entry applied to this snapshot last_included_index: int # The term of the last log entry applied last_included_term: int # The actual state machine state (serialized) data: bytes # Cluster configuration at this point (for membership changes) config: Dict[str, Any] class SnapshotManager: """ Manages snapshot creation and restoration. """ SNAPSHOT_THRESHOLD = 10000 # Entries before snapshotting def __init__(self, state_machine, log, storage): self.state_machine = state_machine self.log = log self.storage = storage def maybe_snapshot(self, applied_index: int): """ Check if we should create a snapshot. Called after applying entries. """ entries_since_snapshot = applied_index - self.last_snapshot_index if entries_since_snapshot >= self.SNAPSHOT_THRESHOLD: self.create_snapshot(applied_index) def create_snapshot(self, up_to_index: int): """ Create a snapshot of the current state. """ # Get term of last included entry last_term = self.log.get_term_at(up_to_index) # Serialize state machine state_data = self.state_machine.serialize() # Get cluster config config = self.log.get_config_at(up_to_index) snapshot = Snapshot( last_included_index=up_to_index, last_included_term=last_term, data=state_data, config=config ) # Persist snapshot to disk self.storage.save_snapshot(snapshot) # Compact log - remove entries up to snapshot self.log.discard_up_to(up_to_index) print(f"Created snapshot at index {up_to_index}") def restore_from_snapshot(self, snapshot: Snapshot): """ Restore state machine from a snapshot. Used on startup or when receiving InstallSnapshot RPC. """ # Restore state machine self.state_machine.deserialize(snapshot.data) # Update log to reflect snapshot self.log.reset_to( snapshot.last_included_index, snapshot.last_included_term ) print(f"Restored from snapshot at index {snapshot.last_included_index}") # InstallSnapshot RPC - for slow followers@dataclassclass InstallSnapshotRequest: """ Sent by leader when follower is too far behind to catch up with log entries. """ term: int leader_id: int # Snapshot identification last_included_index: int last_included_term: int # For chunked transfer of large snapshots offset: int data: bytes # Chunk of snapshot data done: bool # True if this is the last chunk def handle_install_snapshot(server, request: InstallSnapshotRequest): """ Follower receives snapshot from leader. """ if request.term < server.current_term: return {"term": server.current_term} if request.term > server.current_term: server.become_follower(request.term) server.reset_election_timer() # Write chunk to temporary file server.snapshot_buffer.write_chunk(request.offset, request.data) if request.done: # All chunks received - install snapshot snapshot_data = server.snapshot_buffer.finalize() snapshot = Snapshot( last_included_index=request.last_included_index, last_included_term=request.last_included_term, data=snapshot_data, config={} # Included in data ) server.snapshot_manager.restore_from_snapshot(snapshot) return {"term": server.current_term}Creating a snapshot while the state machine is being modified can result in an inconsistent snapshot. Production systems use techniques like: • Copy-on-write — Fork the process (etcd approach) • Pause writes — Brief pause during snapshot • MVCC — Multi-version concurrency control (etcd's BoltDB) • Incremental snapshots — Only delta since last snapshot
Production clusters need to change their membership over time:
This is surprisingly tricky. The naive approach—adding a node directly—can create a situation where two majorities exist simultaneously, violating safety.
The Problem:
Consider a 3-node cluster {A, B, C} transitioning to 5 nodes {A, B, C, D, E}:
During the transition, {A, B} could form an old majority while {C, D, E} forms a new majority. Two leaders could be elected simultaneously!
Raft's Solution: Single-server changes
Raft handles membership changes by adding or removing one server at a time. This guarantees that old and new majorities always overlap.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
"""Safe Membership Changes: One Node at a Time The Raft paper describes "joint consensus" for multi-node changes,but most implementations (including etcd) use single-node changesbecause they're simpler and sufficient. Why single-node changes are safe: Going from N nodes to N+1 nodes:- Old majority: (N/2) + 1- New majority: ((N+1)/2) + 1 = (N/2) + 1 (same or +1) Any old majority and any new majority must share at least one node! Example: 3 → 4 nodes- Old cluster: {A, B, C}, majority = 2- New cluster: {A, B, C, D}, majority = 3 Old majorities: {A,B}, {A,C}, {B,C}New majorities: {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D} Every new majority contains at least 2 of {A,B,C}.Every pair from {A,B,C} is an old majority.Therefore: overlap guaranteed!""" from dataclasses import dataclassfrom enum import Enumfrom typing import Set class MembershipChangeType(Enum): ADD_VOTER = "add_voter" REMOVE_VOTER = "remove_voter" ADD_LEARNER = "add_learner" PROMOTE_LEARNER = "promote_learner" @dataclassclass MembershipChange: change_type: MembershipChangeType server_id: int server_address: str class MembershipManager: """ Manages cluster membership changes. """ def __init__(self, server): self.server = server self.pending_change: Optional[MembershipChange] = None def propose_change(self, change: MembershipChange) -> bool: """ Propose a membership change. Rules: 1. Only leader can propose changes 2. Only one change at a time 3. Change must be committed before proposing another """ if not self.server.is_leader(): return False if self.pending_change is not None: return False # Already have pending change # Create a special log entry for the membership change config_entry = ConfigChangeEntry( change_type=change.change_type, server_id=change.server_id, server_address=change.server_address ) # Append to log like any other entry self.server.log.append( term=self.server.current_term, command=config_entry ) self.pending_change = change # Replicate to followers self.server.replicate_to_all() return True def on_config_committed(self, change: MembershipChange): """ Called when a membership change entry is committed. """ if change.change_type == MembershipChangeType.ADD_VOTER: self.server.voters.add(change.server_id) elif change.change_type == MembershipChangeType.REMOVE_VOTER: self.server.voters.discard(change.server_id) # If we removed ourselves, step down if change.server_id == self.server.id: self.server.step_down() self.pending_change = None # etcd's learner pattern for safe node addition"""Adding a new node directly is risky:- New node has empty log- Needs to catch up (InstallSnapshot + log entries)- If it becomes a voter before catching up, it can't vote usefully- Reduces cluster availability during catch-up etcd's solution: Learners 1. Add node as LEARNER (non-voting) - Receives log entries like a follower - Does NOT participate in elections - Does NOT count toward majority 2. Wait for learner to catch up - Monitor: raft_apply_index, raft_log_index - Learner should be within a few entries of leader 3. Promote learner to VOTER - Now counts toward majority - Can vote in elections - Safe because it's already caught up""" def safe_add_node(cluster_client, new_node_address: str): """ Safely add a new node to an etcd cluster. """ # Step 1: Add as learner print(f"Adding {new_node_address} as learner...") cluster_client.member_add( peer_urls=[new_node_address], is_learner=True ) # Step 2: Start etcd on the new node with --initial-cluster-state=existing # Step 3: Wait for catch-up print("Waiting for learner to catch up...") while True: members = cluster_client.member_list() learner = find_member_by_address(members, new_node_address) leader = get_leader(members) # Check if learner is caught up if leader.applied_index - learner.applied_index < 100: print(f"Learner caught up (lag: {leader.applied_index - learner.applied_index})") break time.sleep(5) # Step 4: Promote to voter print(f"Promoting {new_node_address} to voter...") cluster_client.member_promote(learner.id) print("Node successfully added!")The learner pattern is a production best practice implemented in etcd, Consul, and most Raft systems. It provides: • Safe catch-up — New node syncs data without affecting quorum • Rollback capability — If the new node has issues, remove the learner without impact • Validation — Verify the new node is healthy before it becomes critical
Reading from a Raft cluster is more complex than it appears. The naive approach—read from the leader—has a subtle problem: the leader might not know it's been deposed.
If a network partition occurred and a new leader was elected, the old leader might still think it's the leader and serve stale reads. This violates linearizability.
Production systems use several strategies to provide correct reads:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143
"""Strategies for Linearizable (Consistent) Reads The problem:1. Client reads from Leader A2. Network partition occurs 3. New Leader B is elected4. Client's read returns stale data (Leader A doesn't know it's stale) Solutions:""" # === STRATEGY 1: Read Through Consensus ==="""Treat reads like writes - put them through Raft. Pros:- Always linearizable- Simple to implement Cons:- Slow (one RTT to majority for every read)- High load on leader and Raft log Implementation:"""async def read_through_consensus(key: str) -> str: # Create a read marker entry entry = {"type": "read", "key": key, "id": generate_id()} # Replicate through Raft (wait for commit) await replicate_and_commit(entry) # After commit, we know we're the leader and can read safely return state_machine.get(key) # === STRATEGY 2: ReadIndex (etcd's approach) ==="""Verify leadership without logging a full entry. Steps:1. Leader records its current commit index (read_index)2. Leader confirms it's still leader (heartbeat to majority)3. Leader waits for state machine to apply up to read_index4. Leader serves the read Pros:- Linearizable without log entries for reads- Much faster than full consensus reads Cons:- Still requires RTT to majority per read (or batch)"""async def read_index(key: str) -> str: if not is_leader(): raise NotLeaderError() # Step 1: Record current commit index read_index = self.commit_index # Step 2: Confirm leadership (heartbeat to majority) # This ensures no new leader has been elected acks = await send_heartbeats_and_wait() if acks < majority: raise NotLeaderError() # Lost leadership # Step 3: Wait for state machine to catch up while self.last_applied < read_index: await asyncio.sleep(0.001) # Step 4: Read is now safe return self.state_machine.get(key) # === STRATEGY 3: Lease-Based Reads ==="""Leader holds a "lease" during which it can serve reads locally. Steps:1. Leader gets heartbeat acks with timestamps2. If all followers acked within lease_duration, leader has a lease3. During lease, leader can serve reads without checking majority4. Lease automatically expires, requiring renewal Pros:- Reads don't require network round-trips during lease- Very fast (local read) Cons:- Depends on bounded clock drift assumption- If clocks skew too much, could serve stale data"""async def lease_read(key: str) -> str: if not is_leader(): raise NotLeaderError() current_time = time.monotonic() # Check if lease is still valid if current_time < self.lease_expiry: # Lease is valid - read locally without network return self.state_machine.get(key) else: # Lease expired - need to renew via heartbeats await send_heartbeats_and_renew_lease() return await lease_read(key) # Retry # === STRATEGY 4: Serializable Reads (Follower Reads) ==="""Allow reads from any server, sacrificing linearizability for performance. The read might be stale, but it's from a consistent point in time.Useful when:- Slight staleness is acceptable- Read load is high- You need horizontal read scaling etcd supports this via "Serializable: true" option."""async def serializable_read(key: str, any_server: bool = False) -> str: if any_server: # Read from local state machine (might be behind) return self.state_machine.get(key) else: # Forward to leader for linearizable read return await forward_to_leader(key) # === etcd's Read Modes ==="""etcd exposes these as client options: 1. Linearizable (default): Uses ReadIndex internally Guarantees seeing latest writes 2. Serializable: Reads from any server May return stale data Much higher throughput"""| Strategy | Consistency | Latency | Throughput | Clock Dependency |
|---|---|---|---|---|
| Read Through Raft | Linearizable | High (consensus) | Low | None |
| ReadIndex | Linearizable | Medium (heartbeat) | Medium | None |
| Lease-Based | Linearizable* | Very Low (local) | Very High | Bounded drift required |
| Serializable | Serializable | Very Low | Very High (any node) | None |
Deploying Raft in production requires careful attention to cluster sizing, hardware placement, monitoring, and operational procedures.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
# Example: etcd production deployment for Kubernetes # === HARDWARE RECOMMENDATIONS ===# # CPU: 2-4 cores (etcd is not CPU-bound)# Memory: 8-16 GB (for large key-value stores)# Disk: SSD with fsync durability - CRITICAL# - Avoid network-attached storage if possible# - Use local NVMe for best performance# Network: Low latency between nodes (<2ms RTT ideal) # === kubernetes etcd StatefulSet ===apiVersion: apps/v1kind: StatefulSetmetadata: name: etcdspec: serviceName: etcd replicas: 3 # Always odd number selector: matchLabels: app: etcd template: metadata: labels: app: etcd spec: # Anti-affinity: Spread across failure domains affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: etcd topologyKey: topology.kubernetes.io/zone containers: - name: etcd image: quay.io/coreos/etcd:v3.5.9 resources: requests: cpu: "2" memory: "8Gi" limits: cpu: "4" memory: "16Gi" env: - name: ETCD_NAME valueFrom: fieldRef: fieldPath: metadata.name command: - etcd - --name=$(ETCD_NAME) - --data-dir=/var/lib/etcd # === CRITICAL PERFORMANCE SETTINGS === - --heartbeat-interval=100 # ms (default 100) - --election-timeout=1000 # ms (default 1000) - --snapshot-count=10000 # Entries before snapshot - --quota-backend-bytes=8589934592 # 8GB max size # === NETWORK SETTINGS === - --listen-peer-urls=https://0.0.0.0:2380 - --listen-client-urls=https://0.0.0.0:2379 - --initial-cluster=etcd-0=https://etcd-0:2380,etcd-1=https://etcd-1:2380,etcd-2=https://etcd-2:2380 # === SECURITY (always use TLS in production) === - --peer-cert-file=/etc/etcd/peer.crt - --peer-key-file=/etc/etcd/peer.key - --peer-trusted-ca-file=/etc/etcd/ca.crt - --client-cert-auth=true volumeMounts: - name: data mountPath: /var/lib/etcd - name: certs mountPath: /etc/etcd # Use local SSD storage volumeClaimTemplates: - metadata: name: data spec: accessModes: ["ReadWriteOnce"] storageClassName: local-ssd # High-performance local storage resources: requests: storage: 100Gi ---# === KEY METRICS TO MONITOR ===# # etcd_disk_wal_fsync_duration_seconds# - Measures disk performance# - If p99 > 10ms, disk is too slow for reliable Raft## etcd_network_peer_round_trip_time_seconds# - Latency between etcd nodes# - High RTT increases election and commit latency## etcd_server_has_leader# - 1 if cluster has a leader, 0 otherwise# - Alert if 0 for more than a few seconds## etcd_server_leader_changes_seen_total# - Counter of leader elections# - Frequent changes indicate instability## etcd_mvcc_db_total_size_in_bytes# - Current database size# - Alert if approaching quotaThe most common cause of etcd instability is slow disk I/O. Raft requires fsync for durability, and if fsync is slow, heartbeats are delayed, causing cascading election failures. Always use SSDs, and monitor etcd_disk_wal_fsync_duration_seconds. If p99 exceeds 10ms, your disk is the bottleneck.
Understanding common failure patterns helps you diagnose and resolve issues quickly when your 3 AM pager goes off.
| Symptom | Likely Cause | Investigation | Resolution |
|---|---|---|---|
| Frequent leader elections | Slow disk fsync OR network latency | Check etcd_disk_wal_fsync_duration, etcd_network_peer_round_trip_time | Move to faster storage, reduce network latency, increase election timeout |
| Cluster won't elect leader | Quorum lost (too few nodes up) | Check member list, verify network connectivity between nodes | Restore failed nodes, or in emergency: etcdctl snapshot restore |
| One node far behind | Slow disk or network to that node | Compare raft_applied_index across nodes | Fix underlying infrastructure, or remove/replace node |
| Writes blocked but cluster seems healthy | Leader can't reach quorum for commits | Check leader's matchIndex for each follower | Debug network between leader and followers |
| Memory/disk usage growing | Snapshot not triggering, or compaction disabled | Check snapshot-count config, run compaction manually | Enable auto-compaction, trigger manual compaction |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
#!/bin/bash # === Essential etcd Debugging Commands === # Check cluster healthetcdctl endpoint health --cluster# Output: https://etcd-0:2379 is healthy: successfully committed proposal # Check cluster status (who is leader?)etcdctl endpoint status --cluster -w table# +---------------------+------------------+---------+---------+# | ENDPOINT | ID | VERSION | IS LEADER |# +---------------------+------------------+---------+---------+# | https://etcd-0:2379 | 8e9e05c52164694d | 3.5.9 | true |# | https://etcd-1:2379 | 2e80f96756585b85 | 3.5.9 | false |# | https://etcd-2:2379 | 501e5b0c8f37c2d2 | 3.5.9 | false |# +---------------------+------------------+---------+---------+ # Check member listetcdctl member list -w table # Check Raft terms and indices (detailed)curl -s https://etcd-0:2379/metrics | grep -E 'etcd_server_|raft_'# etcd_server_has_leader 1# etcd_server_leader_changes_seen_total 3# etcd_server_proposals_committed_total 145892# raft_term 4# raft_applied_index 145892 # === Diagnosing Split-Brain / Network Issues === # From each node, test connectivity to othersfor i in 0 1 2; do echo "Testing from etcd-$i:" kubectl exec etcd-$i -- etcdctl endpoint health --endpoints=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379done # === Performance Debugging === # Check fsync latency histogramcurl -s https://etcd-0:2379/metrics | grep etcd_disk_wal_fsync# etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 45123# etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 45400# etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 45420# (Most fsyncs < 1ms is healthy) # Check round-trip time between peerscurl -s https://etcd-0:2379/metrics | grep round_trip# etcd_network_peer_round_trip_time_seconds_bucket{To="8e9e05c52164694d",le="0.001"} 9000 # === Recovery Operations === # Take a snapshotetcdctl snapshot save backup.db # Restore from snapshot (ONLY do this if quorum is permanently lost)etcdctl snapshot restore backup.db --name etcd-0 --initial-cluster etcd-0=https://etcd-0:2380 --initial-advertise-peer-urls https://etcd-0:2380 # Move leader (for maintenance)etcdctl move-leader <target-member-id> # Compact revision history (free up space)# Get current revisionrev=$(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')# Compact up to current revisionetcdctl compact $rev| System | Primary Use Case | Notable Feature |
|---|---|---|
| etcd | Kubernetes data store | ReadIndex for efficient reads |
| Consul | Service mesh and discovery | Multi-datacenter federation |
| TiKV | Distributed database | Multi-Raft for scalability |
| CockroachDB | SQL database | Ranges as Raft groups |
| RethinkDB | Real-time database | Raft for shard coordination |
Module Complete:
You've now completed a comprehensive study of the Raft consensus algorithm. Starting from its design philosophy of understandability, through leader election, log replication, and safety guarantees, to practical production deployment—you have the knowledge to understand, implement, debug, and operate Raft-based systems.
Raft is the foundation for many of the most critical distributed systems in modern infrastructure. The concepts you've learned here—consensus, quorums, replication, safety properties—are fundamental to distributed systems engineering and will serve you throughout your career.
Congratulations! You've mastered the Raft consensus algorithm—from theoretical foundations to production deployment. You understand why Raft was designed for understandability, how leader election and log replication work, what safety properties Raft guarantees, and how systems like etcd and Consul implement Raft in practice. You're now equipped to work with Raft-based distributed systems at the deepest level.