System Design (HLD)Raft Algorithm

Raft Consensus Algorithm

LevelAdvanced

Duration90 mins

TopicRaft Algorithm

5 / 5

Raft in Practice (etcd, Consul)

From Algorithm to Production System

Understanding the Raft algorithm is one thing. Running it in production, serving millions of requests, maintaining five-nines availability, and handling the messy reality of datacenter operations—that's another challenge entirely.

The academic Raft paper describes the algorithm's core mechanics, but production implementations face additional concerns:

How do you prevent the log from growing infinitely?
How do you add or remove servers without downtime?
How do you handle geographically distributed clusters?
How do you debug a three-server cluster when things go wrong at 3 AM?

This page examines how real-world systems solve these problems. We'll focus on etcd and Consul—two of the most widely deployed Raft implementations—and extract lessons that apply to any Raft-based system.

What You Will Learn

By the end of this page, you will understand:

• etcd and Consul — What they are and how they use Raft • Log compaction and snapshots — Preventing unbounded log growth • Cluster membership changes — Adding and removing nodes safely • Linearizable reads — Different strategies for read consistency • Production deployment patterns — Sizing, placement, and monitoring • Common failure modes — What goes wrong and how to handle it

etcd: The Backbone of Kubernetes

etcd is a distributed key-value store that provides a reliable way to store data across a cluster of machines. It's the data store underlying Kubernetes, making it one of the most critical pieces of infrastructure in modern cloud computing.

Key characteristics:

Strong consistency — Linearizable reads and writes
Watch API — Clients can subscribe to changes
Transactions — Multi-key atomic operations
Lease mechanism — TTL-based key expiration
3.5M+ Kubernetes clusters — Proven at massive scale

etcd_usage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
"""
etcd: Distributed Key-Value Store with Raft
 
Architecture:
                                
    ┌─────────────────────────────────────────────────┐
    │                  etcd Cluster                    │
    │                                                  │
    │   ┌──────────┐   ┌──────────┐   ┌──────────┐    │
    │   │  etcd-1  │   │  etcd-2  │   │  etcd-3  │    │
    │   │ (Leader) │◄──│(Follower)│──►│(Follower)│    │
    │   └────┬─────┘   └──────────┘   └──────────┘    │
    │        │                                         │
    │        │ Raft Consensus                          │
    │        ▼                                         │
    │   Key-Value Store                                │
    │   Watches / Leases                               │
    └─────────────────────────────────────────────────┘
              │
              ▼
    ┌─────────────────────────────────────────────────┐
    │             Kubernetes API Server                │
    │ (All cluster state stored in etcd)              │
    └─────────────────────────────────────────────────┘
 
What Kubernetes stores in etcd:
- Pod definitions and status
- Service endpoints
- ConfigMaps and Secrets
- RBAC policies
- Node registrations
- Custom resources
 
"""
 
# Example: etcd operations via Python client
import etcd3
 
# Connect to etcd cluster
client = etcd3.client(
    host='etcd-1.example.com',
    port=2379,
    # Typically uses mTLS in production
    ca_cert='/path/to/ca.crt',
    cert_key='/path/to/client.key',
    cert_cert='/path/to/client.crt',
)
 
# === BASIC OPERATIONS ===
 
# Write a key (goes through Raft consensus)
client.put('/services/api/leader', 'pod-abc-123')
 
# Read a key (can be linearizable or serializable)
value, metadata = client.get('/services/api/leader')
print(f"Leader: {value.decode()}, Revision: {metadata.mod_revision}")
 
# === WATCH API ===
 
# Subscribe to changes on a key prefix
events_iterator, cancel = client.watch_prefix('/services/')
 
for event in events_iterator:
    if isinstance(event, etcd3.events.PutEvent):
        print(f"Key updated: {event.key} = {event.value}")
    elif isinstance(event, etcd3.events.DeleteEvent):
        print(f"Key deleted: {event.key}")
 
# === LEASES (TTL) ===
 
# Create a lease that expires in 30 seconds
lease = client.lease(ttl=30)
 
# Associate key with lease - key deleted when lease expires
client.put('/locks/resource-1', 'holder-xyz', lease=lease)
 
# Keep the lease alive (heartbeat)
while True:
    lease.refresh()
    time.sleep(10)
 
# === TRANSACTIONS ===
 
# Atomic compare-and-swap
success = client.transaction(
    compare=[
        client.transactions.value('/counter') == '5'
    ],
    success=[
        client.transactions.put('/counter', '6')
    ],
    failure=[
        client.transactions.get('/counter')  # Return current value
    ]
)

etcd's Raft Enhancements

etcd's Raft implementation (go.etcd.io/raft) includes several production enhancements: • Pre-vote — Prevents disruption from partitioned nodes (enabled by default since v3.4) • Learner nodes — Non-voting members for safe cluster expansion • Leadership transfer — Graceful leader handoff for maintenance • Read indexes — Efficient linearizable reads without full consensus

Consul: Service Mesh and Configuration

Consul by HashiCorp is a multi-purpose tool for service discovery, configuration, and service mesh. Unlike etcd's focused key-value design, Consul provides a broader feature set built on Raft consensus.

Key features beyond basic KV:

Service catalog — Register and discover services
Health checking — Monitor service health across the cluster
Service mesh — Secure service-to-service communication (Consul Connect)
Multi-datacenter — Federation across geographic regions
ACL system — Fine-grained access control

consul_architecture.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
"""
Consul Architecture: Multi-Datacenter Federation
 
Unlike etcd (single cluster), Consul is designed for 
multi-datacenter deployments from the start.
 
Single Datacenter:
 
    ┌───────────────────────────────────────────────┐
    │                  Datacenter 1                  │
    │                                                │
    │   ┌─────────────────────────────────────────┐ │
    │   │         Consul Server Cluster           │ │
    │   │                                         │ │
    │   │  ┌─────┐   ┌─────┐   ┌─────┐          │ │
    │   │  │ S1  │◄──│ S2  │──►│ S3  │          │ │
    │   │  │Lead │   │     │   │     │          │ │
    │   │  └─────┘   └─────┘   └─────┘          │ │
    │   │           (Raft Consensus)              │ │
    │   └─────────────────────────────────────────┘ │
    │                      │                        │
    │                      ▼                        │
    │   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐     │
    │   │Agent│   │Agent│   │Agent│   │Agent│     │
    │   │ C1  │   │ C2  │   │ C3  │   │ C4  │     │
    │   └─────┘   └─────┘   └─────┘   └─────┘     │
    │   (Every node runs a Consul agent)          │
    └───────────────────────────────────────────────┘
 
Multi-Datacenter Federation:
 
    ┌─────────────────────┐     ┌─────────────────────┐
    │    Datacenter 1     │     │    Datacenter 2     │
    │   (US-East)         │     │   (EU-West)         │
    │                     │     │                     │
    │  ┌───┐ ┌───┐ ┌───┐  │     │  ┌───┐ ┌───┐ ┌───┐  │
    │  │S1 │ │S2 │ │S3 │◄─┼─WAN─┼─►│S1 │ │S2 │ │S3 │  │
    │  └───┘ └───┘ └───┘  │     │  └───┘ └───┘ └───┘  │
    │                     │     │                     │
    │   Local Raft        │     │   Local Raft        │
    │   Consensus         │     │   Consensus         │
    └─────────────────────┘     └─────────────────────┘
    
Each datacenter has independent Raft consensus.
WAN gossip protocol for cross-DC communication.
No cross-DC Raft (latency would be too high).
"""
 
# Consul service registration and discovery
import consul
 
# Connect to local Consul agent
c = consul.Consul()
 
# === SERVICE REGISTRATION ===
 
# Register a service (local agent forwards to servers via Raft)
c.agent.service.register(
    name='web-api',
    service_id='web-api-1',
    address='10.0.1.50',
    port=8080,
    tags=['v1.2.3', 'production'],
    # Health check - Consul pings this endpoint
    check=consul.Check.http('http://10.0.1.50:8080/health', interval='10s')
)
 
# === SERVICE DISCOVERY ===
 
# Get healthy instances of a service
index, services = c.health.service('web-api', passing=True)
 
for service in services:
    node = service['Node']['Node']
    address = service['Service']['Address']
    port = service['Service']['Port']
    print(f"Instance: {node} at {address}:{port}")
 
# === KEY-VALUE STORE ===
 
# Put/Get (similar to etcd)
c.kv.put('config/database/host', 'db.example.com')
index, data = c.kv.get('config/database/host')
 
# === DISTRIBUTED LOCKING ===
 
# Create a session (similar to etcd lease)
session_id = c.session.create(
    name='my-lock-session',
    ttl='30s',
    behavior='delete'  # Delete keys when session expires
)
 
# Acquire lock
acquired = c.kv.put('locks/my-resource', 'locked', acquire=session_id)
if acquired:
    # We hold the lock
    pass
 
# Release lock
c.kv.put('locks/my-resource', 'locked', release=session_id)

etcd vs Consul Comparison
Aspect	etcd	Consul
Primary Use	Kubernetes data store	Service mesh, discovery, KV
Raft Implementation	go.etcd.io/raft (Go)	hashicorp/raft (Go)
Multi-DC	Separate clusters	Native federation
Health Checking	Via Kubernetes	Built-in
Watch API	gRPC streams	Long polling + blocking queries
Consistency	Linearizable default	Configurable (default, consistent, stale)
Typical Cluster Size	3-7 nodes	3-5 servers per DC

Log Compaction and Snapshots

In the basic Raft algorithm, the log grows forever. Every command ever executed remains in the log. This is clearly unsustainable—a system running for years would have a log of unbounded size.

Log compaction solves this by periodically saving the state machine state as a snapshot and discarding log entries that are reflected in the snapshot.

The key insight: We don't need the log entries themselves; we need their effect on the state machine. A snapshot captures that effect directly.

log_compaction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
"""
Log Compaction via Snapshots
 
Before compaction:
    Log: [1|SET x=1] [2|SET y=2] [3|SET x=3] [4|DEL y] [5|SET z=4] ...
    Each entry takes space and must be replayed on recovery.
 
After compaction at index 4:
    Snapshot: {x: 3, z: 4}  (state at index 4)
    Log: [5|SET z=4] [6|...] [7|...]
    Entries 1-4 discarded - their effect is in the snapshot.
 
When to snapshot:
- When log exceeds a size threshold (e.g., 10MB)
- At regular intervals (e.g., every hour)
- On explicit request (backup)
 
etcd default: Snapshot every 10,000 applied entries.
"""
 
from dataclasses import dataclass
from typing import Dict, Any, Optional
import json
 
@dataclass
class Snapshot:
    """
    A snapshot captures the state machine state at a point in time.
    """
    # The index of the last log entry applied to this snapshot
    last_included_index: int
    
    # The term of the last log entry applied
    last_included_term: int
    
    # The actual state machine state (serialized)
    data: bytes
    
    # Cluster configuration at this point (for membership changes)
    config: Dict[str, Any]
 
 
class SnapshotManager:
    """
    Manages snapshot creation and restoration.
    """
    
    SNAPSHOT_THRESHOLD = 10000  # Entries before snapshotting
    
    def __init__(self, state_machine, log, storage):
        self.state_machine = state_machine
        self.log = log
        self.storage = storage
    
    def maybe_snapshot(self, applied_index: int):
        """
        Check if we should create a snapshot.
        Called after applying entries.
        """
        entries_since_snapshot = applied_index - self.last_snapshot_index
        
        if entries_since_snapshot >= self.SNAPSHOT_THRESHOLD:
            self.create_snapshot(applied_index)
    
    def create_snapshot(self, up_to_index: int):
        """
        Create a snapshot of the current state.
        """
        # Get term of last included entry
        last_term = self.log.get_term_at(up_to_index)
        
        # Serialize state machine
        state_data = self.state_machine.serialize()
        
        # Get cluster config
        config = self.log.get_config_at(up_to_index)
        
        snapshot = Snapshot(
            last_included_index=up_to_index,
            last_included_term=last_term,
            data=state_data,
            config=config
        )
        
        # Persist snapshot to disk
        self.storage.save_snapshot(snapshot)
        
        # Compact log - remove entries up to snapshot
        self.log.discard_up_to(up_to_index)
        
        print(f"Created snapshot at index {up_to_index}")
    
    def restore_from_snapshot(self, snapshot: Snapshot):
        """
        Restore state machine from a snapshot.
        Used on startup or when receiving InstallSnapshot RPC.
        """
        # Restore state machine
        self.state_machine.deserialize(snapshot.data)
        
        # Update log to reflect snapshot
        self.log.reset_to(
            snapshot.last_included_index,
            snapshot.last_included_term
        )
        
        print(f"Restored from snapshot at index {snapshot.last_included_index}")
 
 
# InstallSnapshot RPC - for slow followers
@dataclass
class InstallSnapshotRequest:
    """
    Sent by leader when follower is too far behind
    to catch up with log entries.
    """
    term: int
    leader_id: int
    
    # Snapshot identification
    last_included_index: int
    last_included_term: int
    
    # For chunked transfer of large snapshots
    offset: int
    data: bytes  # Chunk of snapshot data
    done: bool   # True if this is the last chunk
 
 
def handle_install_snapshot(server, request: InstallSnapshotRequest):
    """
    Follower receives snapshot from leader.
    """
    if request.term < server.current_term:
        return {"term": server.current_term}
    
    if request.term > server.current_term:
        server.become_follower(request.term)
    
    server.reset_election_timer()
    
    # Write chunk to temporary file
    server.snapshot_buffer.write_chunk(request.offset, request.data)
    
    if request.done:
        # All chunks received - install snapshot
        snapshot_data = server.snapshot_buffer.finalize()
        
        snapshot = Snapshot(
            last_included_index=request.last_included_index,
            last_included_term=request.last_included_term,
            data=snapshot_data,
            config={}  # Included in data
        )
        
        server.snapshot_manager.restore_from_snapshot(snapshot)
    
    return {"term": server.current_term}

Snapshot Consistency

Creating a snapshot while the state machine is being modified can result in an inconsistent snapshot. Production systems use techniques like: • Copy-on-write — Fork the process (etcd approach) • Pause writes — Brief pause during snapshot • MVCC — Multi-version concurrency control (etcd's BoltDB) • Incremental snapshots — Only delta since last snapshot

Cluster Membership Changes

Production clusters need to change their membership over time:

Add nodes for capacity
Remove nodes for maintenance
Replace failed nodes

This is surprisingly tricky. The naive approach—adding a node directly—can create a situation where two majorities exist simultaneously, violating safety.

The Problem:

Consider a 3-node cluster {A, B, C} transitioning to 5 nodes {A, B, C, D, E}:

Old majority = 2 nodes
New majority = 3 nodes

During the transition, {A, B} could form an old majority while {C, D, E} forms a new majority. Two leaders could be elected simultaneously!

Raft's Solution: Single-server changes

Raft handles membership changes by adding or removing one server at a time. This guarantees that old and new majorities always overlap.

membership_change.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
"""
Safe Membership Changes: One Node at a Time
 
The Raft paper describes "joint consensus" for multi-node changes,
but most implementations (including etcd) use single-node changes
because they're simpler and sufficient.
 
Why single-node changes are safe:
 
Going from N nodes to N+1 nodes:
- Old majority: (N/2) + 1
- New majority: ((N+1)/2) + 1 = (N/2) + 1  (same or +1)
 
Any old majority and any new majority must share at least one node!
 
Example: 3 → 4 nodes
- Old cluster: {A, B, C}, majority = 2
- New cluster: {A, B, C, D}, majority = 3
 
Old majorities: {A,B}, {A,C}, {B,C}
New majorities: {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}
 
Every new majority contains at least 2 of {A,B,C}.
Every pair from {A,B,C} is an old majority.
Therefore: overlap guaranteed!
"""
 
from dataclasses import dataclass
from enum import Enum
from typing import Set
 
class MembershipChangeType(Enum):
    ADD_VOTER = "add_voter"
    REMOVE_VOTER = "remove_voter"
    ADD_LEARNER = "add_learner"
    PROMOTE_LEARNER = "promote_learner"
 
@dataclass
class MembershipChange:
    change_type: MembershipChangeType
    server_id: int
    server_address: str
 
 
class MembershipManager:
    """
    Manages cluster membership changes.
    """
    
    def __init__(self, server):
        self.server = server
        self.pending_change: Optional[MembershipChange] = None
    
    def propose_change(self, change: MembershipChange) -> bool:
        """
        Propose a membership change.
        
        Rules:
        1. Only leader can propose changes
        2. Only one change at a time
        3. Change must be committed before proposing another
        """
        if not self.server.is_leader():
            return False
        
        if self.pending_change is not None:
            return False  # Already have pending change
        
        # Create a special log entry for the membership change
        config_entry = ConfigChangeEntry(
            change_type=change.change_type,
            server_id=change.server_id,
            server_address=change.server_address
        )
        
        # Append to log like any other entry
        self.server.log.append(
            term=self.server.current_term,
            command=config_entry
        )
        
        self.pending_change = change
        
        # Replicate to followers
        self.server.replicate_to_all()
        
        return True
    
    def on_config_committed(self, change: MembershipChange):
        """
        Called when a membership change entry is committed.
        """
        if change.change_type == MembershipChangeType.ADD_VOTER:
            self.server.voters.add(change.server_id)
        elif change.change_type == MembershipChangeType.REMOVE_VOTER:
            self.server.voters.discard(change.server_id)
            
            # If we removed ourselves, step down
            if change.server_id == self.server.id:
                self.server.step_down()
        
        self.pending_change = None
 
 
# etcd's learner pattern for safe node addition
"""
Adding a new node directly is risky:
- New node has empty log
- Needs to catch up (InstallSnapshot + log entries)
- If it becomes a voter before catching up, it can't vote usefully
- Reduces cluster availability during catch-up
 
etcd's solution: Learners
 
1. Add node as LEARNER (non-voting)
   - Receives log entries like a follower
   - Does NOT participate in elections
   - Does NOT count toward majority
 
2. Wait for learner to catch up
   - Monitor: raft_apply_index, raft_log_index
   - Learner should be within a few entries of leader
 
3. Promote learner to VOTER
   - Now counts toward majority
   - Can vote in elections
   - Safe because it's already caught up
"""
 
def safe_add_node(cluster_client, new_node_address: str):
    """
    Safely add a new node to an etcd cluster.
    """
    # Step 1: Add as learner
    print(f"Adding {new_node_address} as learner...")
    cluster_client.member_add(
        peer_urls=[new_node_address],
        is_learner=True
    )
    
    # Step 2: Start etcd on the new node with --initial-cluster-state=existing
    
    # Step 3: Wait for catch-up
    print("Waiting for learner to catch up...")
    while True:
        members = cluster_client.member_list()
        learner = find_member_by_address(members, new_node_address)
        leader = get_leader(members)
        
        # Check if learner is caught up
        if leader.applied_index - learner.applied_index < 100:
            print(f"Learner caught up (lag: {leader.applied_index - learner.applied_index})")
            break
        
        time.sleep(5)
    
    # Step 4: Promote to voter
    print(f"Promoting {new_node_address} to voter...")
    cluster_client.member_promote(learner.id)
    
    print("Node successfully added!")

The Learner Safety Net

The learner pattern is a production best practice implemented in etcd, Consul, and most Raft systems. It provides: • Safe catch-up — New node syncs data without affecting quorum • Rollback capability — If the new node has issues, remove the learner without impact • Validation — Verify the new node is healthy before it becomes critical

Strategies for Linearizable Reads

Reading from a Raft cluster is more complex than it appears. The naive approach—read from the leader—has a subtle problem: the leader might not know it's been deposed.

If a network partition occurred and a new leader was elected, the old leader might still think it's the leader and serve stale reads. This violates linearizability.

Production systems use several strategies to provide correct reads:

linearizable_reads.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
"""
Strategies for Linearizable (Consistent) Reads
 
The problem:
1. Client reads from Leader A
2. Network partition occurs 
3. New Leader B is elected
4. Client's read returns stale data (Leader A doesn't know it's stale)
 
Solutions:
"""
 
# === STRATEGY 1: Read Through Consensus ===
"""
Treat reads like writes - put them through Raft.
 
Pros:
- Always linearizable
- Simple to implement
 
Cons:
- Slow (one RTT to majority for every read)
- High load on leader and Raft log
 
Implementation:
"""
async def read_through_consensus(key: str) -> str:
    # Create a read marker entry
    entry = {"type": "read", "key": key, "id": generate_id()}
    
    # Replicate through Raft (wait for commit)
    await replicate_and_commit(entry)
    
    # After commit, we know we're the leader and can read safely
    return state_machine.get(key)
 
 
# === STRATEGY 2: ReadIndex (etcd's approach) ===
"""
Verify leadership without logging a full entry.
 
Steps:
1. Leader records its current commit index (read_index)
2. Leader confirms it's still leader (heartbeat to majority)
3. Leader waits for state machine to apply up to read_index
4. Leader serves the read
 
Pros:
- Linearizable without log entries for reads
- Much faster than full consensus reads
 
Cons:
- Still requires RTT to majority per read (or batch)
"""
async def read_index(key: str) -> str:
    if not is_leader():
        raise NotLeaderError()
    
    # Step 1: Record current commit index
    read_index = self.commit_index
    
    # Step 2: Confirm leadership (heartbeat to majority)
    # This ensures no new leader has been elected
    acks = await send_heartbeats_and_wait()
    if acks < majority:
        raise NotLeaderError()  # Lost leadership
    
    # Step 3: Wait for state machine to catch up
    while self.last_applied < read_index:
        await asyncio.sleep(0.001)
    
    # Step 4: Read is now safe
    return self.state_machine.get(key)
 
 
# === STRATEGY 3: Lease-Based Reads ===
"""
Leader holds a "lease" during which it can serve reads locally.
 
Steps:
1. Leader gets heartbeat acks with timestamps
2. If all followers acked within lease_duration, leader has a lease
3. During lease, leader can serve reads without checking majority
4. Lease automatically expires, requiring renewal
 
Pros:
- Reads don't require network round-trips during lease
- Very fast (local read)
 
Cons:
- Depends on bounded clock drift assumption
- If clocks skew too much, could serve stale data
"""
async def lease_read(key: str) -> str:
    if not is_leader():
        raise NotLeaderError()
    
    current_time = time.monotonic()
    
    # Check if lease is still valid
    if current_time < self.lease_expiry:
        # Lease is valid - read locally without network
        return self.state_machine.get(key)
    else:
        # Lease expired - need to renew via heartbeats
        await send_heartbeats_and_renew_lease()
        return await lease_read(key)  # Retry
 
 
# === STRATEGY 4: Serializable Reads (Follower Reads) ===
"""
Allow reads from any server, sacrificing linearizability for performance.
 
The read might be stale, but it's from a consistent point in time.
Useful when:
- Slight staleness is acceptable
- Read load is high
- You need horizontal read scaling
 
etcd supports this via "Serializable: true" option.
"""
async def serializable_read(key: str, any_server: bool = False) -> str:
    if any_server:
        # Read from local state machine (might be behind)
        return self.state_machine.get(key)
    else:
        # Forward to leader for linearizable read
        return await forward_to_leader(key)
 
 
# === etcd's Read Modes ===
"""
etcd exposes these as client options:
 
1. Linearizable (default):
   Uses ReadIndex internally
   Guarantees seeing latest writes
   
2. Serializable:
   Reads from any server
   May return stale data
   Much higher throughput
"""

Read Strategy Comparison
Strategy	Consistency	Latency	Throughput	Clock Dependency
Read Through Raft	Linearizable	High (consensus)	Low	None
ReadIndex	Linearizable	Medium (heartbeat)	Medium	None
Lease-Based	Linearizable*	Very Low (local)	Very High	Bounded drift required
Serializable	Serializable	Very Low	Very High (any node)	None

Production Deployment Best Practices

Deploying Raft in production requires careful attention to cluster sizing, hardware placement, monitoring, and operational procedures.

Cluster Sizing Guidelines

•3 nodes — Minimum for production. Tolerates 1 failure. Most common for small/medium deployments.
•5 nodes — Tolerates 2 failures. Recommended for critical workloads. Allows rolling upgrades without losing fault tolerance.
•7 nodes — Tolerates 3 failures. Rarely needed. Higher latency due to larger quorums.
•Never use even numbers — 4 nodes tolerates same failures as 3 but requires larger quorums. Always use odd numbers.
•Don't exceed 7 — Larger clusters have diminishing returns. Latency increases, and you rarely need to tolerate 4+ simultaneous failures.

production_deployment.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# Example: etcd production deployment for Kubernetes
 
# === HARDWARE RECOMMENDATIONS ===
# 
# CPU: 2-4 cores (etcd is not CPU-bound)
# Memory: 8-16 GB (for large key-value stores)
# Disk: SSD with fsync durability - CRITICAL
#       - Avoid network-attached storage if possible
#       - Use local NVMe for best performance
# Network: Low latency between nodes (<2ms RTT ideal)
 
# === kubernetes etcd StatefulSet ===
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd
spec:
  serviceName: etcd
  replicas: 3  # Always odd number
  selector:
    matchLabels:
      app: etcd
  template:
    metadata:
      labels:
        app: etcd
    spec:
      # Anti-affinity: Spread across failure domains
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: etcd
            topologyKey: topology.kubernetes.io/zone
      
      containers:
      - name: etcd
        image: quay.io/coreos/etcd:v3.5.9
        
        resources:
          requests:
            cpu: "2"
            memory: "8Gi"
          limits:
            cpu: "4"
            memory: "16Gi"
        
        env:
        - name: ETCD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        
        command:
        - etcd
        - --name=$(ETCD_NAME)
        - --data-dir=/var/lib/etcd
        
        # === CRITICAL PERFORMANCE SETTINGS ===
        - --heartbeat-interval=100     # ms (default 100)
        - --election-timeout=1000      # ms (default 1000)
        - --snapshot-count=10000       # Entries before snapshot
        - --quota-backend-bytes=8589934592  # 8GB max size
        
        # === NETWORK SETTINGS ===
        - --listen-peer-urls=https://0.0.0.0:2380
        - --listen-client-urls=https://0.0.0.0:2379
        - --initial-cluster=etcd-0=https://etcd-0:2380,etcd-1=https://etcd-1:2380,etcd-2=https://etcd-2:2380
        
        # === SECURITY (always use TLS in production) ===
        - --peer-cert-file=/etc/etcd/peer.crt
        - --peer-key-file=/etc/etcd/peer.key
        - --peer-trusted-ca-file=/etc/etcd/ca.crt
        - --client-cert-auth=true
        
        volumeMounts:
        - name: data
          mountPath: /var/lib/etcd
        - name: certs
          mountPath: /etc/etcd
  
  # Use local SSD storage
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: local-ssd  # High-performance local storage
      resources:
        requests:
          storage: 100Gi
 
---
# === KEY METRICS TO MONITOR ===
# 
# etcd_disk_wal_fsync_duration_seconds
#   - Measures disk performance
#   - If p99 > 10ms, disk is too slow for reliable Raft
#
# etcd_network_peer_round_trip_time_seconds
#   - Latency between etcd nodes
#   - High RTT increases election and commit latency
#
# etcd_server_has_leader
#   - 1 if cluster has a leader, 0 otherwise
#   - Alert if 0 for more than a few seconds
#
# etcd_server_leader_changes_seen_total
#   - Counter of leader elections
#   - Frequent changes indicate instability
#
# etcd_mvcc_db_total_size_in_bytes
#   - Current database size
#   - Alert if approaching quota

The Disk Performance Trap

The most common cause of etcd instability is slow disk I/O. Raft requires fsync for durability, and if fsync is slow, heartbeats are delayed, causing cascading election failures. Always use SSDs, and monitor etcd_disk_wal_fsync_duration_seconds. If p99 exceeds 10ms, your disk is the bottleneck.

Common Failure Modes and Debugging

Understanding common failure patterns helps you diagnose and resolve issues quickly when your 3 AM pager goes off.

Raft Failure Modes and Solutions
Symptom	Likely Cause	Investigation	Resolution
Frequent leader elections	Slow disk fsync OR network latency	Check etcd_disk_wal_fsync_duration, etcd_network_peer_round_trip_time	Move to faster storage, reduce network latency, increase election timeout
Cluster won't elect leader	Quorum lost (too few nodes up)	Check member list, verify network connectivity between nodes	Restore failed nodes, or in emergency: etcdctl snapshot restore
One node far behind	Slow disk or network to that node	Compare raft_applied_index across nodes	Fix underlying infrastructure, or remove/replace node
Writes blocked but cluster seems healthy	Leader can't reach quorum for commits	Check leader's matchIndex for each follower	Debug network between leader and followers
Memory/disk usage growing	Snapshot not triggering, or compaction disabled	Check snapshot-count config, run compaction manually	Enable auto-compaction, trigger manual compaction

debugging_commands.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#!/bin/bash
 
# === Essential etcd Debugging Commands ===
 
# Check cluster health
etcdctl endpoint health --cluster
# Output: https://etcd-0:2379 is healthy: successfully committed proposal
 
# Check cluster status (who is leader?)
etcdctl endpoint status --cluster -w table
# +---------------------+------------------+---------+---------+
# |      ENDPOINT       |        ID        | VERSION | IS LEADER |
# +---------------------+------------------+---------+---------+
# | https://etcd-0:2379 | 8e9e05c52164694d | 3.5.9   | true    |
# | https://etcd-1:2379 | 2e80f96756585b85 | 3.5.9   | false   |
# | https://etcd-2:2379 | 501e5b0c8f37c2d2 | 3.5.9   | false   |
# +---------------------+------------------+---------+---------+
 
# Check member list
etcdctl member list -w table
 
# Check Raft terms and indices (detailed)
curl -s https://etcd-0:2379/metrics | grep -E 'etcd_server_|raft_'
# etcd_server_has_leader 1
# etcd_server_leader_changes_seen_total 3
# etcd_server_proposals_committed_total 145892
# raft_term 4
# raft_applied_index 145892
 
# === Diagnosing Split-Brain / Network Issues ===
 
# From each node, test connectivity to others
for i in 0 1 2; do
  echo "Testing from etcd-$i:"
  kubectl exec etcd-$i -- etcdctl endpoint health     --endpoints=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379
done
 
# === Performance Debugging ===
 
# Check fsync latency histogram
curl -s https://etcd-0:2379/metrics | grep etcd_disk_wal_fsync
# etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 45123
# etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 45400
# etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 45420
# (Most fsyncs < 1ms is healthy)
 
# Check round-trip time between peers
curl -s https://etcd-0:2379/metrics | grep round_trip
# etcd_network_peer_round_trip_time_seconds_bucket{To="8e9e05c52164694d",le="0.001"} 9000
 
# === Recovery Operations ===
 
# Take a snapshot
etcdctl snapshot save backup.db
 
# Restore from snapshot (ONLY do this if quorum is permanently lost)
etcdctl snapshot restore backup.db   --name etcd-0   --initial-cluster etcd-0=https://etcd-0:2380   --initial-advertise-peer-urls https://etcd-0:2380
 
# Move leader (for maintenance)
etcdctl move-leader <target-member-id>
 
# Compact revision history (free up space)
# Get current revision
rev=$(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')
# Compact up to current revision
etcdctl compact $rev

Summary: Raft in the Real World

Key Takeaways

•etcd powers Kubernetes — The most widely deployed Raft implementation, providing the consistency that Kubernetes relies on.
•Consul extends Raft for service mesh — Multi-datacenter federation and service discovery built on Raft consensus.
•Log compaction is essential — Snapshots prevent unbounded log growth in long-running systems.
•Single-node membership changes are safer — The learner pattern provides additional safety during expansion.
•Linearizable reads need special handling — ReadIndex and lease-based reads balance consistency and performance.
•Disk performance is critical — Slow fsync is the most common cause of Raft cluster instability.

Raft Production Systems at a Glance
System	Primary Use Case	Notable Feature
etcd	Kubernetes data store	ReadIndex for efficient reads
Consul	Service mesh and discovery	Multi-datacenter federation
TiKV	Distributed database	Multi-Raft for scalability
CockroachDB	SQL database	Ranges as Raft groups
RethinkDB	Real-time database	Raft for shard coordination

Module Complete:

You've now completed a comprehensive study of the Raft consensus algorithm. Starting from its design philosophy of understandability, through leader election, log replication, and safety guarantees, to practical production deployment—you have the knowledge to understand, implement, debug, and operate Raft-based systems.

Raft is the foundation for many of the most critical distributed systems in modern infrastructure. The concepts you've learned here—consensus, quorums, replication, safety properties—are fundamental to distributed systems engineering and will serve you throughout your career.

Module Complete

Congratulations! You've mastered the Raft consensus algorithm—from theoretical foundations to production deployment. You understand why Raft was designed for understandability, how leader election and log replication work, what safety properties Raft guarantees, and how systems like etcd and Consul implement Raft in practice. You're now equipped to work with Raft-based distributed systems at the deepest level.

5 / 5

Loading learning content...

System Design (HLD)Raft Algorithm

Raft Consensus Algorithm

LevelAdvanced

Duration90 mins

TopicRaft Algorithm

5 / 5

Raft in Practice (etcd, Consul)

From Algorithm to Production System

The academic Raft paper describes the algorithm's core mechanics, but production implementations face additional concerns:

How do you prevent the log from growing infinitely?
How do you add or remove servers without downtime?
How do you handle geographically distributed clusters?
How do you debug a three-server cluster when things go wrong at 3 AM?

What You Will Learn

By the end of this page, you will understand:

etcd: The Backbone of Kubernetes

Key characteristics:

Strong consistency — Linearizable reads and writes
Watch API — Clients can subscribe to changes
Transactions — Multi-key atomic operations
Lease mechanism — TTL-based key expiration
3.5M+ Kubernetes clusters — Proven at massive scale

etcd_usage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
"""
etcd: Distributed Key-Value Store with Raft
 
Architecture:
                                
    ┌─────────────────────────────────────────────────┐
    │                  etcd Cluster                    │
    │                                                  │
    │   ┌──────────┐   ┌──────────┐   ┌──────────┐    │
    │   │  etcd-1  │   │  etcd-2  │   │  etcd-3  │    │
    │   │ (Leader) │◄──│(Follower)│──►│(Follower)│    │
    │   └────┬─────┘   └──────────┘   └──────────┘    │
    │        │                                         │
    │        │ Raft Consensus                          │
    │        ▼                                         │
    │   Key-Value Store                                │
    │   Watches / Leases                               │
    └─────────────────────────────────────────────────┘
              │
              ▼
    ┌─────────────────────────────────────────────────┐
    │             Kubernetes API Server                │
    │ (All cluster state stored in etcd)              │
    └─────────────────────────────────────────────────┘
 
What Kubernetes stores in etcd:
- Pod definitions and status
- Service endpoints
- ConfigMaps and Secrets
- RBAC policies
- Node registrations
- Custom resources
 
"""
 
# Example: etcd operations via Python client
import etcd3
 
# Connect to etcd cluster
client = etcd3.client(
    host='etcd-1.example.com',
    port=2379,
    # Typically uses mTLS in production
    ca_cert='/path/to/ca.crt',
    cert_key='/path/to/client.key',
    cert_cert='/path/to/client.crt',
)
 
# === BASIC OPERATIONS ===
 
# Write a key (goes through Raft consensus)
client.put('/services/api/leader', 'pod-abc-123')
 
# Read a key (can be linearizable or serializable)
value, metadata = client.get('/services/api/leader')
print(f"Leader: {value.decode()}, Revision: {metadata.mod_revision}")
 
# === WATCH API ===
 
# Subscribe to changes on a key prefix
events_iterator, cancel = client.watch_prefix('/services/')
 
for event in events_iterator:
    if isinstance(event, etcd3.events.PutEvent):
        print(f"Key updated: {event.key} = {event.value}")
    elif isinstance(event, etcd3.events.DeleteEvent):
        print(f"Key deleted: {event.key}")
 
# === LEASES (TTL) ===
 
# Create a lease that expires in 30 seconds
lease = client.lease(ttl=30)
 
# Associate key with lease - key deleted when lease expires
client.put('/locks/resource-1', 'holder-xyz', lease=lease)
 
# Keep the lease alive (heartbeat)
while True:
    lease.refresh()
    time.sleep(10)
 
# === TRANSACTIONS ===
 
# Atomic compare-and-swap
success = client.transaction(
    compare=[
        client.transactions.value('/counter') == '5'
    ],
    success=[
        client.transactions.put('/counter', '6')
    ],
    failure=[
        client.transactions.get('/counter')  # Return current value
    ]
)

etcd's Raft Enhancements

Consul: Service Mesh and Configuration

Key features beyond basic KV:

Service catalog — Register and discover services
Health checking — Monitor service health across the cluster
Service mesh — Secure service-to-service communication (Consul Connect)
Multi-datacenter — Federation across geographic regions
ACL system — Fine-grained access control

consul_architecture.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
"""
Consul Architecture: Multi-Datacenter Federation
 
Unlike etcd (single cluster), Consul is designed for 
multi-datacenter deployments from the start.
 
Single Datacenter:
 
    ┌───────────────────────────────────────────────┐
    │                  Datacenter 1                  │
    │                                                │
    │   ┌─────────────────────────────────────────┐ │
    │   │         Consul Server Cluster           │ │
    │   │                                         │ │
    │   │  ┌─────┐   ┌─────┐   ┌─────┐          │ │
    │   │  │ S1  │◄──│ S2  │──►│ S3  │          │ │
    │   │  │Lead │   │     │   │     │          │ │
    │   │  └─────┘   └─────┘   └─────┘          │ │
    │   │           (Raft Consensus)              │ │
    │   └─────────────────────────────────────────┘ │
    │                      │                        │
    │                      ▼                        │
    │   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐     │
    │   │Agent│   │Agent│   │Agent│   │Agent│     │
    │   │ C1  │   │ C2  │   │ C3  │   │ C4  │     │
    │   └─────┘   └─────┘   └─────┘   └─────┘     │
    │   (Every node runs a Consul agent)          │
    └───────────────────────────────────────────────┘
 
Multi-Datacenter Federation:
 
    ┌─────────────────────┐     ┌─────────────────────┐
    │    Datacenter 1     │     │    Datacenter 2     │
    │   (US-East)         │     │   (EU-West)         │
    │                     │     │                     │
    │  ┌───┐ ┌───┐ ┌───┐  │     │  ┌───┐ ┌───┐ ┌───┐  │
    │  │S1 │ │S2 │ │S3 │◄─┼─WAN─┼─►│S1 │ │S2 │ │S3 │  │
    │  └───┘ └───┘ └───┘  │     │  └───┘ └───┘ └───┘  │
    │                     │     │                     │
    │   Local Raft        │     │   Local Raft        │
    │   Consensus         │     │   Consensus         │
    └─────────────────────┘     └─────────────────────┘
    
Each datacenter has independent Raft consensus.
WAN gossip protocol for cross-DC communication.
No cross-DC Raft (latency would be too high).
"""
 
# Consul service registration and discovery
import consul
 
# Connect to local Consul agent
c = consul.Consul()
 
# === SERVICE REGISTRATION ===
 
# Register a service (local agent forwards to servers via Raft)
c.agent.service.register(
    name='web-api',
    service_id='web-api-1',
    address='10.0.1.50',
    port=8080,
    tags=['v1.2.3', 'production'],
    # Health check - Consul pings this endpoint
    check=consul.Check.http('http://10.0.1.50:8080/health', interval='10s')
)
 
# === SERVICE DISCOVERY ===
 
# Get healthy instances of a service
index, services = c.health.service('web-api', passing=True)
 
for service in services:
    node = service['Node']['Node']
    address = service['Service']['Address']
    port = service['Service']['Port']
    print(f"Instance: {node} at {address}:{port}")
 
# === KEY-VALUE STORE ===
 
# Put/Get (similar to etcd)
c.kv.put('config/database/host', 'db.example.com')
index, data = c.kv.get('config/database/host')
 
# === DISTRIBUTED LOCKING ===
 
# Create a session (similar to etcd lease)
session_id = c.session.create(
    name='my-lock-session',
    ttl='30s',
    behavior='delete'  # Delete keys when session expires
)
 
# Acquire lock
acquired = c.kv.put('locks/my-resource', 'locked', acquire=session_id)
if acquired:
    # We hold the lock
    pass
 
# Release lock
c.kv.put('locks/my-resource', 'locked', release=session_id)

etcd vs Consul Comparison
Aspect	etcd	Consul
Primary Use	Kubernetes data store	Service mesh, discovery, KV
Raft Implementation	go.etcd.io/raft (Go)	hashicorp/raft (Go)
Multi-DC	Separate clusters	Native federation
Health Checking	Via Kubernetes	Built-in
Watch API	gRPC streams	Long polling + blocking queries
Consistency	Linearizable default	Configurable (default, consistent, stale)
Typical Cluster Size	3-7 nodes	3-5 servers per DC

Log Compaction and Snapshots

In the basic Raft algorithm, the log grows forever. Every command ever executed remains in the log. This is clearly unsustainable—a system running for years would have a log of unbounded size.

Log compaction solves this by periodically saving the state machine state as a snapshot and discarding log entries that are reflected in the snapshot.

The key insight: We don't need the log entries themselves; we need their effect on the state machine. A snapshot captures that effect directly.

log_compaction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
"""
Log Compaction via Snapshots
 
Before compaction:
    Log: [1|SET x=1] [2|SET y=2] [3|SET x=3] [4|DEL y] [5|SET z=4] ...
    Each entry takes space and must be replayed on recovery.
 
After compaction at index 4:
    Snapshot: {x: 3, z: 4}  (state at index 4)
    Log: [5|SET z=4] [6|...] [7|...]
    Entries 1-4 discarded - their effect is in the snapshot.
 
When to snapshot:
- When log exceeds a size threshold (e.g., 10MB)
- At regular intervals (e.g., every hour)
- On explicit request (backup)
 
etcd default: Snapshot every 10,000 applied entries.
"""
 
from dataclasses import dataclass
from typing import Dict, Any, Optional
import json
 
@dataclass
class Snapshot:
    """
    A snapshot captures the state machine state at a point in time.
    """
    # The index of the last log entry applied to this snapshot
    last_included_index: int
    
    # The term of the last log entry applied
    last_included_term: int
    
    # The actual state machine state (serialized)
    data: bytes
    
    # Cluster configuration at this point (for membership changes)
    config: Dict[str, Any]
 
 
class SnapshotManager:
    """
    Manages snapshot creation and restoration.
    """
    
    SNAPSHOT_THRESHOLD = 10000  # Entries before snapshotting
    
    def __init__(self, state_machine, log, storage):
        self.state_machine = state_machine
        self.log = log
        self.storage = storage
    
    def maybe_snapshot(self, applied_index: int):
        """
        Check if we should create a snapshot.
        Called after applying entries.
        """
        entries_since_snapshot = applied_index - self.last_snapshot_index
        
        if entries_since_snapshot >= self.SNAPSHOT_THRESHOLD:
            self.create_snapshot(applied_index)
    
    def create_snapshot(self, up_to_index: int):
        """
        Create a snapshot of the current state.
        """
        # Get term of last included entry
        last_term = self.log.get_term_at(up_to_index)
        
        # Serialize state machine
        state_data = self.state_machine.serialize()
        
        # Get cluster config
        config = self.log.get_config_at(up_to_index)
        
        snapshot = Snapshot(
            last_included_index=up_to_index,
            last_included_term=last_term,
            data=state_data,
            config=config
        )
        
        # Persist snapshot to disk
        self.storage.save_snapshot(snapshot)
        
        # Compact log - remove entries up to snapshot
        self.log.discard_up_to(up_to_index)
        
        print(f"Created snapshot at index {up_to_index}")
    
    def restore_from_snapshot(self, snapshot: Snapshot):
        """
        Restore state machine from a snapshot.
        Used on startup or when receiving InstallSnapshot RPC.
        """
        # Restore state machine
        self.state_machine.deserialize(snapshot.data)
        
        # Update log to reflect snapshot
        self.log.reset_to(
            snapshot.last_included_index,
            snapshot.last_included_term
        )
        
        print(f"Restored from snapshot at index {snapshot.last_included_index}")
 
 
# InstallSnapshot RPC - for slow followers
@dataclass
class InstallSnapshotRequest:
    """
    Sent by leader when follower is too far behind
    to catch up with log entries.
    """
    term: int
    leader_id: int
    
    # Snapshot identification
    last_included_index: int
    last_included_term: int
    
    # For chunked transfer of large snapshots
    offset: int
    data: bytes  # Chunk of snapshot data
    done: bool   # True if this is the last chunk
 
 
def handle_install_snapshot(server, request: InstallSnapshotRequest):
    """
    Follower receives snapshot from leader.
    """
    if request.term < server.current_term:
        return {"term": server.current_term}
    
    if request.term > server.current_term:
        server.become_follower(request.term)
    
    server.reset_election_timer()
    
    # Write chunk to temporary file
    server.snapshot_buffer.write_chunk(request.offset, request.data)
    
    if request.done:
        # All chunks received - install snapshot
        snapshot_data = server.snapshot_buffer.finalize()
        
        snapshot = Snapshot(
            last_included_index=request.last_included_index,
            last_included_term=request.last_included_term,
            data=snapshot_data,
            config={}  # Included in data
        )
        
        server.snapshot_manager.restore_from_snapshot(snapshot)
    
    return {"term": server.current_term}

Snapshot Consistency

Cluster Membership Changes

Production clusters need to change their membership over time:

Add nodes for capacity
Remove nodes for maintenance
Replace failed nodes

This is surprisingly tricky. The naive approach—adding a node directly—can create a situation where two majorities exist simultaneously, violating safety.

The Problem:

Consider a 3-node cluster {A, B, C} transitioning to 5 nodes {A, B, C, D, E}:

Old majority = 2 nodes
New majority = 3 nodes

During the transition, {A, B} could form an old majority while {C, D, E} forms a new majority. Two leaders could be elected simultaneously!

Raft's Solution: Single-server changes

Raft handles membership changes by adding or removing one server at a time. This guarantees that old and new majorities always overlap.

membership_change.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
"""
Safe Membership Changes: One Node at a Time
 
The Raft paper describes "joint consensus" for multi-node changes,
but most implementations (including etcd) use single-node changes
because they're simpler and sufficient.
 
Why single-node changes are safe:
 
Going from N nodes to N+1 nodes:
- Old majority: (N/2) + 1
- New majority: ((N+1)/2) + 1 = (N/2) + 1  (same or +1)
 
Any old majority and any new majority must share at least one node!
 
Example: 3 → 4 nodes
- Old cluster: {A, B, C}, majority = 2
- New cluster: {A, B, C, D}, majority = 3
 
Old majorities: {A,B}, {A,C}, {B,C}
New majorities: {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D}
 
Every new majority contains at least 2 of {A,B,C}.
Every pair from {A,B,C} is an old majority.
Therefore: overlap guaranteed!
"""
 
from dataclasses import dataclass
from enum import Enum
from typing import Set
 
class MembershipChangeType(Enum):
    ADD_VOTER = "add_voter"
    REMOVE_VOTER = "remove_voter"
    ADD_LEARNER = "add_learner"
    PROMOTE_LEARNER = "promote_learner"
 
@dataclass
class MembershipChange:
    change_type: MembershipChangeType
    server_id: int
    server_address: str
 
 
class MembershipManager:
    """
    Manages cluster membership changes.
    """
    
    def __init__(self, server):
        self.server = server
        self.pending_change: Optional[MembershipChange] = None
    
    def propose_change(self, change: MembershipChange) -> bool:
        """
        Propose a membership change.
        
        Rules:
        1. Only leader can propose changes
        2. Only one change at a time
        3. Change must be committed before proposing another
        """
        if not self.server.is_leader():
            return False
        
        if self.pending_change is not None:
            return False  # Already have pending change
        
        # Create a special log entry for the membership change
        config_entry = ConfigChangeEntry(
            change_type=change.change_type,
            server_id=change.server_id,
            server_address=change.server_address
        )
        
        # Append to log like any other entry
        self.server.log.append(
            term=self.server.current_term,
            command=config_entry
        )
        
        self.pending_change = change
        
        # Replicate to followers
        self.server.replicate_to_all()
        
        return True
    
    def on_config_committed(self, change: MembershipChange):
        """
        Called when a membership change entry is committed.
        """
        if change.change_type == MembershipChangeType.ADD_VOTER:
            self.server.voters.add(change.server_id)
        elif change.change_type == MembershipChangeType.REMOVE_VOTER:
            self.server.voters.discard(change.server_id)
            
            # If we removed ourselves, step down
            if change.server_id == self.server.id:
                self.server.step_down()
        
        self.pending_change = None
 
 
# etcd's learner pattern for safe node addition
"""
Adding a new node directly is risky:
- New node has empty log
- Needs to catch up (InstallSnapshot + log entries)
- If it becomes a voter before catching up, it can't vote usefully
- Reduces cluster availability during catch-up
 
etcd's solution: Learners
 
1. Add node as LEARNER (non-voting)
   - Receives log entries like a follower
   - Does NOT participate in elections
   - Does NOT count toward majority
 
2. Wait for learner to catch up
   - Monitor: raft_apply_index, raft_log_index
   - Learner should be within a few entries of leader
 
3. Promote learner to VOTER
   - Now counts toward majority
   - Can vote in elections
   - Safe because it's already caught up
"""
 
def safe_add_node(cluster_client, new_node_address: str):
    """
    Safely add a new node to an etcd cluster.
    """
    # Step 1: Add as learner
    print(f"Adding {new_node_address} as learner...")
    cluster_client.member_add(
        peer_urls=[new_node_address],
        is_learner=True
    )
    
    # Step 2: Start etcd on the new node with --initial-cluster-state=existing
    
    # Step 3: Wait for catch-up
    print("Waiting for learner to catch up...")
    while True:
        members = cluster_client.member_list()
        learner = find_member_by_address(members, new_node_address)
        leader = get_leader(members)
        
        # Check if learner is caught up
        if leader.applied_index - learner.applied_index < 100:
            print(f"Learner caught up (lag: {leader.applied_index - learner.applied_index})")
            break
        
        time.sleep(5)
    
    # Step 4: Promote to voter
    print(f"Promoting {new_node_address} to voter...")
    cluster_client.member_promote(learner.id)
    
    print("Node successfully added!")

The Learner Safety Net

Strategies for Linearizable Reads

Reading from a Raft cluster is more complex than it appears. The naive approach—read from the leader—has a subtle problem: the leader might not know it's been deposed.

If a network partition occurred and a new leader was elected, the old leader might still think it's the leader and serve stale reads. This violates linearizability.

Production systems use several strategies to provide correct reads:

linearizable_reads.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
"""
Strategies for Linearizable (Consistent) Reads
 
The problem:
1. Client reads from Leader A
2. Network partition occurs 
3. New Leader B is elected
4. Client's read returns stale data (Leader A doesn't know it's stale)
 
Solutions:
"""
 
# === STRATEGY 1: Read Through Consensus ===
"""
Treat reads like writes - put them through Raft.
 
Pros:
- Always linearizable
- Simple to implement
 
Cons:
- Slow (one RTT to majority for every read)
- High load on leader and Raft log
 
Implementation:
"""
async def read_through_consensus(key: str) -> str:
    # Create a read marker entry
    entry = {"type": "read", "key": key, "id": generate_id()}
    
    # Replicate through Raft (wait for commit)
    await replicate_and_commit(entry)
    
    # After commit, we know we're the leader and can read safely
    return state_machine.get(key)
 
 
# === STRATEGY 2: ReadIndex (etcd's approach) ===
"""
Verify leadership without logging a full entry.
 
Steps:
1. Leader records its current commit index (read_index)
2. Leader confirms it's still leader (heartbeat to majority)
3. Leader waits for state machine to apply up to read_index
4. Leader serves the read
 
Pros:
- Linearizable without log entries for reads
- Much faster than full consensus reads
 
Cons:
- Still requires RTT to majority per read (or batch)
"""
async def read_index(key: str) -> str:
    if not is_leader():
        raise NotLeaderError()
    
    # Step 1: Record current commit index
    read_index = self.commit_index
    
    # Step 2: Confirm leadership (heartbeat to majority)
    # This ensures no new leader has been elected
    acks = await send_heartbeats_and_wait()
    if acks < majority:
        raise NotLeaderError()  # Lost leadership
    
    # Step 3: Wait for state machine to catch up
    while self.last_applied < read_index:
        await asyncio.sleep(0.001)
    
    # Step 4: Read is now safe
    return self.state_machine.get(key)
 
 
# === STRATEGY 3: Lease-Based Reads ===
"""
Leader holds a "lease" during which it can serve reads locally.
 
Steps:
1. Leader gets heartbeat acks with timestamps
2. If all followers acked within lease_duration, leader has a lease
3. During lease, leader can serve reads without checking majority
4. Lease automatically expires, requiring renewal
 
Pros:
- Reads don't require network round-trips during lease
- Very fast (local read)
 
Cons:
- Depends on bounded clock drift assumption
- If clocks skew too much, could serve stale data
"""
async def lease_read(key: str) -> str:
    if not is_leader():
        raise NotLeaderError()
    
    current_time = time.monotonic()
    
    # Check if lease is still valid
    if current_time < self.lease_expiry:
        # Lease is valid - read locally without network
        return self.state_machine.get(key)
    else:
        # Lease expired - need to renew via heartbeats
        await send_heartbeats_and_renew_lease()
        return await lease_read(key)  # Retry
 
 
# === STRATEGY 4: Serializable Reads (Follower Reads) ===
"""
Allow reads from any server, sacrificing linearizability for performance.
 
The read might be stale, but it's from a consistent point in time.
Useful when:
- Slight staleness is acceptable
- Read load is high
- You need horizontal read scaling
 
etcd supports this via "Serializable: true" option.
"""
async def serializable_read(key: str, any_server: bool = False) -> str:
    if any_server:
        # Read from local state machine (might be behind)
        return self.state_machine.get(key)
    else:
        # Forward to leader for linearizable read
        return await forward_to_leader(key)
 
 
# === etcd's Read Modes ===
"""
etcd exposes these as client options:
 
1. Linearizable (default):
   Uses ReadIndex internally
   Guarantees seeing latest writes
   
2. Serializable:
   Reads from any server
   May return stale data
   Much higher throughput
"""

Read Strategy Comparison
Strategy	Consistency	Latency	Throughput	Clock Dependency
Read Through Raft	Linearizable	High (consensus)	Low	None
ReadIndex	Linearizable	Medium (heartbeat)	Medium	None
Lease-Based	Linearizable*	Very Low (local)	Very High	Bounded drift required
Serializable	Serializable	Very Low	Very High (any node)	None

Production Deployment Best Practices

Deploying Raft in production requires careful attention to cluster sizing, hardware placement, monitoring, and operational procedures.

Cluster Sizing Guidelines

•3 nodes — Minimum for production. Tolerates 1 failure. Most common for small/medium deployments.
•5 nodes — Tolerates 2 failures. Recommended for critical workloads. Allows rolling upgrades without losing fault tolerance.
•7 nodes — Tolerates 3 failures. Rarely needed. Higher latency due to larger quorums.
•Never use even numbers — 4 nodes tolerates same failures as 3 but requires larger quorums. Always use odd numbers.
•Don't exceed 7 — Larger clusters have diminishing returns. Latency increases, and you rarely need to tolerate 4+ simultaneous failures.

production_deployment.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# Example: etcd production deployment for Kubernetes
 
# === HARDWARE RECOMMENDATIONS ===
# 
# CPU: 2-4 cores (etcd is not CPU-bound)
# Memory: 8-16 GB (for large key-value stores)
# Disk: SSD with fsync durability - CRITICAL
#       - Avoid network-attached storage if possible
#       - Use local NVMe for best performance
# Network: Low latency between nodes (<2ms RTT ideal)
 
# === kubernetes etcd StatefulSet ===
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: etcd
spec:
  serviceName: etcd
  replicas: 3  # Always odd number
  selector:
    matchLabels:
      app: etcd
  template:
    metadata:
      labels:
        app: etcd
    spec:
      # Anti-affinity: Spread across failure domains
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: etcd
            topologyKey: topology.kubernetes.io/zone
      
      containers:
      - name: etcd
        image: quay.io/coreos/etcd:v3.5.9
        
        resources:
          requests:
            cpu: "2"
            memory: "8Gi"
          limits:
            cpu: "4"
            memory: "16Gi"
        
        env:
        - name: ETCD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        
        command:
        - etcd
        - --name=$(ETCD_NAME)
        - --data-dir=/var/lib/etcd
        
        # === CRITICAL PERFORMANCE SETTINGS ===
        - --heartbeat-interval=100     # ms (default 100)
        - --election-timeout=1000      # ms (default 1000)
        - --snapshot-count=10000       # Entries before snapshot
        - --quota-backend-bytes=8589934592  # 8GB max size
        
        # === NETWORK SETTINGS ===
        - --listen-peer-urls=https://0.0.0.0:2380
        - --listen-client-urls=https://0.0.0.0:2379
        - --initial-cluster=etcd-0=https://etcd-0:2380,etcd-1=https://etcd-1:2380,etcd-2=https://etcd-2:2380
        
        # === SECURITY (always use TLS in production) ===
        - --peer-cert-file=/etc/etcd/peer.crt
        - --peer-key-file=/etc/etcd/peer.key
        - --peer-trusted-ca-file=/etc/etcd/ca.crt
        - --client-cert-auth=true
        
        volumeMounts:
        - name: data
          mountPath: /var/lib/etcd
        - name: certs
          mountPath: /etc/etcd
  
  # Use local SSD storage
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: local-ssd  # High-performance local storage
      resources:
        requests:
          storage: 100Gi
 
---
# === KEY METRICS TO MONITOR ===
# 
# etcd_disk_wal_fsync_duration_seconds
#   - Measures disk performance
#   - If p99 > 10ms, disk is too slow for reliable Raft
#
# etcd_network_peer_round_trip_time_seconds
#   - Latency between etcd nodes
#   - High RTT increases election and commit latency
#
# etcd_server_has_leader
#   - 1 if cluster has a leader, 0 otherwise
#   - Alert if 0 for more than a few seconds
#
# etcd_server_leader_changes_seen_total
#   - Counter of leader elections
#   - Frequent changes indicate instability
#
# etcd_mvcc_db_total_size_in_bytes
#   - Current database size
#   - Alert if approaching quota

The Disk Performance Trap

Common Failure Modes and Debugging

Understanding common failure patterns helps you diagnose and resolve issues quickly when your 3 AM pager goes off.

Raft Failure Modes and Solutions
Symptom	Likely Cause	Investigation	Resolution
Frequent leader elections	Slow disk fsync OR network latency	Check etcd_disk_wal_fsync_duration, etcd_network_peer_round_trip_time	Move to faster storage, reduce network latency, increase election timeout
Cluster won't elect leader	Quorum lost (too few nodes up)	Check member list, verify network connectivity between nodes	Restore failed nodes, or in emergency: etcdctl snapshot restore
One node far behind	Slow disk or network to that node	Compare raft_applied_index across nodes	Fix underlying infrastructure, or remove/replace node
Writes blocked but cluster seems healthy	Leader can't reach quorum for commits	Check leader's matchIndex for each follower	Debug network between leader and followers
Memory/disk usage growing	Snapshot not triggering, or compaction disabled	Check snapshot-count config, run compaction manually	Enable auto-compaction, trigger manual compaction

debugging_commands.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#!/bin/bash
 
# === Essential etcd Debugging Commands ===
 
# Check cluster health
etcdctl endpoint health --cluster
# Output: https://etcd-0:2379 is healthy: successfully committed proposal
 
# Check cluster status (who is leader?)
etcdctl endpoint status --cluster -w table
# +---------------------+------------------+---------+---------+
# |      ENDPOINT       |        ID        | VERSION | IS LEADER |
# +---------------------+------------------+---------+---------+
# | https://etcd-0:2379 | 8e9e05c52164694d | 3.5.9   | true    |
# | https://etcd-1:2379 | 2e80f96756585b85 | 3.5.9   | false   |
# | https://etcd-2:2379 | 501e5b0c8f37c2d2 | 3.5.9   | false   |
# +---------------------+------------------+---------+---------+
 
# Check member list
etcdctl member list -w table
 
# Check Raft terms and indices (detailed)
curl -s https://etcd-0:2379/metrics | grep -E 'etcd_server_|raft_'
# etcd_server_has_leader 1
# etcd_server_leader_changes_seen_total 3
# etcd_server_proposals_committed_total 145892
# raft_term 4
# raft_applied_index 145892
 
# === Diagnosing Split-Brain / Network Issues ===
 
# From each node, test connectivity to others
for i in 0 1 2; do
  echo "Testing from etcd-$i:"
  kubectl exec etcd-$i -- etcdctl endpoint health     --endpoints=https://etcd-0:2379,https://etcd-1:2379,https://etcd-2:2379
done
 
# === Performance Debugging ===
 
# Check fsync latency histogram
curl -s https://etcd-0:2379/metrics | grep etcd_disk_wal_fsync
# etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 45123
# etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 45400
# etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 45420
# (Most fsyncs < 1ms is healthy)
 
# Check round-trip time between peers
curl -s https://etcd-0:2379/metrics | grep round_trip
# etcd_network_peer_round_trip_time_seconds_bucket{To="8e9e05c52164694d",le="0.001"} 9000
 
# === Recovery Operations ===
 
# Take a snapshot
etcdctl snapshot save backup.db
 
# Restore from snapshot (ONLY do this if quorum is permanently lost)
etcdctl snapshot restore backup.db   --name etcd-0   --initial-cluster etcd-0=https://etcd-0:2380   --initial-advertise-peer-urls https://etcd-0:2380
 
# Move leader (for maintenance)
etcdctl move-leader <target-member-id>
 
# Compact revision history (free up space)
# Get current revision
rev=$(etcdctl endpoint status -w json | jq '.[0].Status.header.revision')
# Compact up to current revision
etcdctl compact $rev

Summary: Raft in the Real World

Key Takeaways

•etcd powers Kubernetes — The most widely deployed Raft implementation, providing the consistency that Kubernetes relies on.
•Consul extends Raft for service mesh — Multi-datacenter federation and service discovery built on Raft consensus.
•Log compaction is essential — Snapshots prevent unbounded log growth in long-running systems.
•Single-node membership changes are safer — The learner pattern provides additional safety during expansion.
•Linearizable reads need special handling — ReadIndex and lease-based reads balance consistency and performance.
•Disk performance is critical — Slow fsync is the most common cause of Raft cluster instability.

Raft Production Systems at a Glance
System	Primary Use Case	Notable Feature
etcd	Kubernetes data store	ReadIndex for efficient reads
Consul	Service mesh and discovery	Multi-datacenter federation
TiKV	Distributed database	Multi-Raft for scalability
CockroachDB	SQL database	Ranges as Raft groups
RethinkDB	Real-time database	Raft for shard coordination

Module Complete:

Module Complete

5 / 5