Operating SystemsDistributed Coordination

Distributed Coordination

LevelAdvanced

Duration90 mins

TopicDistributed Coordination

4 / 5

Raft Basics

Consensus Made Understandable

In 2014, Diego Ongaro and John Ousterhout published "In Search of an Understandable Consensus Algorithm" introducing Raft. Their motivation was revolutionary in academic circles: they prioritized understandability as a design goal, not just correctness and performance.

Paxos, while correct, had gained a reputation for being notoriously difficult to understand and implement correctly. Years of academic papers and production outages demonstrated that even experts struggled. Ongaro and Ousterhout argued that understandability isn't a luxury—it's essential for building reliable systems.

Raft's key innovation:

Rather than derive a minimal algorithm from first principles (as Paxos does), Raft decomposes consensus into independent subproblems:

Leader election — Selecting one leader at a time
Log replication — Leader replicates log entries to followers
Safety — Ensuring only consistent logs can be elected leader

Each subproblem can be understood, implemented, and tested independently. The result is a protocol that's provably equivalent to (Multi-)Paxos in power, but dramatically easier to implement correctly.

What You Will Learn

By the end of this page, you will understand Raft's design philosophy, the term and election mechanism, log replication and commitment, the safety properties that ensure correctness, membership changes, and how Raft is implemented in production systems like etcd, Consul, and CockroachDB.

Raft vs. Paxos Philosophy

Before diving into Raft's mechanics, it's worth understanding why it differs from Paxos and what trade-offs it makes.

Paxos philosophy:

Derive the minimal protocol from correctness requirements
Allow maximum flexibility in implementation
Don't prescribe how to build a complete system

Raft philosophy:

Design for understandability from the start
Prescribe a complete, implementable system
Decompose into independent, comprehensible subproblems
Prefer strong invariants that simplify reasoning

Paxos vs. Raft Design Choices
Aspect	Paxos	Raft
Primary goal	Minimal correctness	Understandability
Leader role	Optimization (Multi-Paxos)	Fundamental requirement
Log gaps	Allowed (each slot independent)	Prohibited (contiguous only)
Entry commitment	Per-slot quorum	Log prefix matching
Voting restriction	None (any node can propose)	Only candidates with up-to-date log
Specification	Abstract protocol	Complete system

The key Raft constraint: Contiguous logs only

In Paxos, a proposer can attempt to fill any log slot. Slot 5 might be committed while slot 3 is still undecided. This flexibility is theoretically elegant but complicates log management.

Raft requires logs to be contiguous: if a leader wants to add entry N, entries 1 through N-1 must be committed first. This constraint simplifies:

Log matching: Same index + same term → same command and same prefix
Commitment detection: A single "commit index" suffices
Recovery: New leader simply replicates missing entries in order

The trade-off:

Raft sacrifices some theoretical flexibility for practical simplicity. It may be slightly slower in scenarios where out-of-order commitment could help, but the implementation is dramatically easier to get right.

When to Use Which

In practice, most teams should use Raft (or a Raft-based system like etcd). Paxos is primarily of academic interest today, though some highly specialized systems use Paxos variants for their flexibility. If you're implementing consensus from scratch, Raft is the clear choice.

Node States and Terms

Raft defines exactly three states a node can be in, and uses terms as a logical clock to order events.

The Three States:

1. Follower

Passive: only responds to requests from leaders and candidates
Redirects client requests to the leader
If no heartbeat received within election timeout, becomes candidate

2. Candidate

Active: seeking votes to become leader
Sends RequestVote RPCs to all other nodes
If receives majority votes, becomes leader
If receives heartbeat from legitimate leader, reverts to follower
If election timeout expires without winning, starts new election

3. Leader

Directs all client interactions and log replication
Sends periodic heartbeats (empty AppendEntries) to maintain authority
Only one leader can exist per term
Reverts to follower if discovers a higher term

Converting Mermaid diagram...

Terms: Raft's Logical Clock

Raft divides time into terms, numbered consecutively starting at 1. Each term has at most one leader.

A new term begins when a candidate starts an election
Terms act as a logical clock: higher term = more recent authority
Every RPC includes the sender's current term
If a node receives a higher term, it updates its term and reverts to follower
If a node receives a lower term, it rejects the request

Term guarantees:

Election Safety: At most one leader is elected per term
Term Ordering: A node in term T won't accept messages from term < T
Epoch Fencing: Old leaders in term T are ignored once term T+1 begins

raft_state.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
class RaftNode:
    """
    Raft node state management.
    """
    
    def __init__(self, node_id: str, peers: List[str], storage: DurableStorage):
        self.node_id = node_id
        self.peers = peers
        self.storage = storage
        
        # Persistent state (survives crashes)
        self.current_term: int = storage.get("current_term", 0)
        self.voted_for: Optional[str] = storage.get("voted_for", None)
        self.log: List[LogEntry] = storage.get("log", [])
        
        # Volatile state
        self.state: NodeState = NodeState.FOLLOWER
        self.commit_index: int = -1
        self.last_applied: int = -1
        
        # Leader-only volatile state
        self.next_index: Dict[str, int] = {}  # For each peer
        self.match_index: Dict[str, int] = {} # Highest replicated index
        
        # Timing
        self.election_timeout = self._random_timeout(150, 300)
        self.last_heartbeat = time.time()
        
    def _random_timeout(self, min_ms: int, max_ms: int) -> float:
        """
        Random election timeout in range [min_ms, max_ms].
        
        Randomization prevents split votes.
        """
        return random.randint(min_ms, max_ms) / 1000.0
        
    def check_election_timeout(self):
        """
        Check if election timeout has expired.
        If so, transition to candidate and start election.
        """
        if self.state == NodeState.LEADER:
            return  # Leaders don't timeout
            
        elapsed = time.time() - self.last_heartbeat
        
        if elapsed >= self.election_timeout:
            self._start_election()
            
    def _start_election(self):
        """
        Transition to candidate and start new election.
        """
        # Increment term for new election
        self.current_term += 1
        self._persist("current_term", self.current_term)
        
        # Vote for ourselves
        self.voted_for = self.node_id
        self._persist("voted_for", self.voted_for)
        
        self.state = NodeState.CANDIDATE
        votes_received = 1  # Our own vote
        
        # Reset election timeout
        self.election_timeout = self._random_timeout(150, 300)
        self.last_heartbeat = time.time()
        
        # Send RequestVote to all peers
        for peer in self.peers:
            response = self._request_vote(peer)
            if response and response.vote_granted:
                votes_received += 1
                
        # Check if we won
        if votes_received > (len(self.peers) + 1) // 2:
            self._become_leader()
        # Otherwise, either retry on timeout or become follower
            
    def _become_leader(self):
        """
        Transition to leader state.
        """
        self.state = NodeState.LEADER
        
        # Initialize leader state
        for peer in self.peers:
            self.next_index[peer] = len(self.log)
            self.match_index[peer] = -1
            
        # Send initial heartbeats immediately
        self._send_heartbeats()
        
        print(f"Node {self.node_id}: Became leader for term {self.current_term}")
        
    def update_term(self, term: int):
        """
        Update to higher term, reverting to follower if needed.
        """
        if term > self.current_term:
            self.current_term = term
            self.voted_for = None
            self.state = NodeState.FOLLOWER
            
            self._persist("current_term", self.current_term)
            self._persist("voted_for", None)

Random Timeouts are Essential

Randomized election timeouts (typically 150-300ms) prevent synchronized elections that could cause perpetual split votes. Without randomization, nodes would timeout simultaneously, all become candidates, and split the vote—repeating indefinitely. This is how Raft circumvents FLP impossibility in practice.

Leader Election

Raft's leader election is simpler than Paxos's because it uses a single voting rule: a candidate only grants its vote if the candidate's log is at least as up-to-date as the voter's log.

The RequestVote RPC:

RequestVote RPC:
  term            - Candidate's term
  candidateId     - Candidate requesting vote
  lastLogIndex    - Index of candidate's last log entry
  lastLogTerm     - Term of candidate's last log entry
  
Response:
  term            - CurrentTerm (for candidate to update itself)
  voteGranted     - True if candidate received vote

Voting rules:

Term check: Reject if candidate's term < receiver's current term
Vote availability: Grant vote only if haven't voted in this term (or already voted for this candidate)
Log up-to-date check: Grant vote only if candidate's log is at least as up-to-date

Log comparison:

Candidate's log is "at least as up-to-date" if:

Candidate's last log term > voter's last log term, OR
Same last log term AND candidate's log is at least as long

raft_election.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
class RaftNode:
    """
    Raft leader election implementation.
    """
    
    def handle_request_vote(self, request: RequestVoteRequest) -> RequestVoteResponse:
        """
        Handle RequestVote RPC from candidate.
        
        Grant vote if:
        1. Candidate's term >= our term
        2. We haven't voted for someone else in this term
        3. Candidate's log is at least as up-to-date as ours
        """
        # Update term if candidate has higher term
        if request.term > self.current_term:
            self.update_term(request.term)
            
        # Reject if candidate's term is stale
        if request.term < self.current_term:
            return RequestVoteResponse(
                term=self.current_term,
                vote_granted=False
            )
            
        # Check if we can vote for this candidate
        can_vote = (
            self.voted_for is None or 
            self.voted_for == request.candidate_id
        )
        
        # Check if candidate's log is up-to-date
        last_log_term = self._get_last_log_term()
        last_log_index = len(self.log) - 1
        
        log_ok = (
            request.last_log_term > last_log_term or
            (request.last_log_term == last_log_term and 
             request.last_log_index >= last_log_index)
        )
        
        if can_vote and log_ok:
            # Grant vote
            self.voted_for = request.candidate_id
            self._persist("voted_for", self.voted_for)
            
            # Reset election timeout (we've heard from a valid candidate)
            self.last_heartbeat = time.time()
            
            return RequestVoteResponse(
                term=self.current_term,
                vote_granted=True
            )
        else:
            return RequestVoteResponse(
                term=self.current_term,
                vote_granted=False
            )
            
    def _get_last_log_term(self) -> int:
        """Get term of last log entry, or 0 if log is empty."""
        if not self.log:
            return 0
        return self.log[-1].term
        
    def _request_vote(self, peer: str) -> Optional[RequestVoteResponse]:
        """
        Send RequestVote RPC to a peer.
        """
        request = RequestVoteRequest(
            term=self.current_term,
            candidate_id=self.node_id,
            last_log_index=len(self.log) - 1,
            last_log_term=self._get_last_log_term()
        )
        
        try:
            return self._send_rpc(peer, "RequestVote", request)
        except NetworkTimeout:
            return None
 
 
# Example: Election scenario trace
"""
Scenario: 3-node cluster, Node A is leader, then crashes
 
Initial state:
- Node A (Leader, term=5): log = [e1, e2, e3, e4, e5]
- Node B (Follower, term=5): log = [e1, e2, e3, e4, e5]
- Node C (Follower, term=5): log = [e1, e2, e3, e4]  # Slightly behind
 
Node A crashes. Network delays cause different election timeouts.
 
1. Node B times out first, becomes candidate, term=6
2. Node B sends RequestVote(term=6, lastLogIndex=4, lastLogTerm=5) to C
3. Node C grants vote:
   - term=6 >= currentTerm=5 ✓
   - Haven't voted in term 6 ✓
   - B's log (term=5, index=4) vs C's log (term=5, index=3)
   - Same term, B has more entries ✓
4. Node B has 2 votes (self + C), becomes leader
 
If Node C had timed out first:
1. Node C becomes candidate, term=6
2. Node C sends RequestVote(term=6, lastLogIndex=3, lastLogTerm=5) to B
3. Node B rejects:
   - C's log (term=5, index=3) vs B's log (term=5, index=4)
   - Same term but C has fewer entries ✗
4. Node C doesn't get majority, times out
5. Eventually B or C wins with higher log
"""

Election Safety Guarantee

The log up-to-date check ensures the Leader Completeness property: a leader for any term must have all entries committed in previous terms. This is critical for safety—without it, a new leader might overwrite committed entries.

Log Replication

Once a leader is elected, it handles all client requests by appending entries to its log and replicating them to followers.

The AppendEntries RPC:

This single RPC handles both heartbeats (empty entries) and log replication:

AppendEntries RPC:
  term              - Leader's term
  leaderId          - For followers to redirect clients
  prevLogIndex      - Index of log entry immediately preceding new ones
  prevLogTerm       - Term of prevLogIndex entry
  entries[]         - Log entries to store (empty for heartbeat)
  leaderCommit      - Leader's commit index
  
Response:
  term              - CurrentTerm (for leader to update itself)
  success           - True if follower had entry matching prevLogIndex/Term

The Log Matching Property:

Raft maintains a crucial invariant:

If two entries in different logs have the same index and term, they store the same command AND all preceding entries are identical.

This is enforced by:

Leaders create log entries with unique (index, term) pairs
Followers only append entries if prevLogIndex/prevLogTerm matches
Mismatches cause the follower to reject, triggering backtracking

raft_log_replication.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
class RaftLeader:
    """
    Raft leader log replication.
    """
    
    def replicate_command(self, command: Any) -> bool:
        """
        Client-facing: Add command to log and replicate.
        
        Returns True if command is successfully committed.
        """
        # Append to local log
        entry = LogEntry(
            term=self.current_term,
            command=command,
            index=len(self.log)
        )
        self.log.append(entry)
        self._persist("log", self.log)
        
        # Replicate to followers
        return self._replicate_to_majority()
        
    def _replicate_to_majority(self) -> bool:
        """
        Replicate pending entries to all followers.
        Wait for majority acknowledgment.
        """
        while True:
            success_count = 1  # Count ourselves
            
            for peer in self.peers:
                if self._send_append_entries(peer):
                    success_count += 1
                    
            majority = (len(self.peers) + 1) // 2 + 1
            
            if success_count >= majority:
                self._advance_commit_index()
                return True
                
            # If not enough responses, retry
            time.sleep(0.010)  # Brief backoff
            
    def _send_append_entries(self, peer: str) -> bool:
        """
        Send AppendEntries to a single follower.
        
        Uses next_index to determine which entries to send.
        Decrements next_index on failure for Log Matching.
        """
        next_idx = self.next_index[peer]
        
        # Prepare entries to send
        if next_idx < len(self.log):
            entries = self.log[next_idx:]
        else:
            entries = []  # Heartbeat
            
        # Get prev log info for consistency check
        prev_log_index = next_idx - 1
        prev_log_term = 0
        if prev_log_index >= 0:
            prev_log_term = self.log[prev_log_index].term
            
        request = AppendEntriesRequest(
            term=self.current_term,
            leader_id=self.node_id,
            prev_log_index=prev_log_index,
            prev_log_term=prev_log_term,
            entries=entries,
            leader_commit=self.commit_index
        )
        
        try:
            response = self._send_rpc(peer, "AppendEntries", request)
            
            if response.success:
                # Update tracking for this follower
                self.next_index[peer] = len(self.log)
                self.match_index[peer] = len(self.log) - 1
                return True
            else:
                # Log mismatch - decrement next_index and retry
                self.next_index[peer] = max(0, self.next_index[peer] - 1)
                return False
                
        except NetworkTimeout:
            return False
            
    def _advance_commit_index(self):
        """
        Advance commit index based on majority replication.
        
        An entry is committed when replicated to a majority
        AND the entry's term equals current term.
        """
        for n in range(self.commit_index + 1, len(self.log)):
            if self.log[n].term != self.current_term:
                continue  # Only commit entries from current term
                
            # Count replications (including ourselves)
            replicated = 1
            for peer in self.peers:
                if self.match_index.get(peer, -1) >= n:
                    replicated += 1
                    
            majority = (len(self.peers) + 1) // 2 + 1
            
            if replicated >= majority:
                self.commit_index = n
                self._apply_committed_entries()
 
 
class RaftFollower:
    """
    Raft follower handling AppendEntries.
    """
    
    def handle_append_entries(self, request: AppendEntriesRequest) -> AppendEntriesResponse:
        """
        Handle AppendEntries RPC from leader.
        
        1. Term check
        2. Log consistency check
        3. Append new entries
        4. Update commit index
        """
        # Update term if leader has higher term
        if request.term > self.current_term:
            self.update_term(request.term)
            
        # Reject if stale term
        if request.term < self.current_term:
            return AppendEntriesResponse(
                term=self.current_term,
                success=False
            )
            
        # Valid leader - reset election timeout
        self.last_heartbeat = time.time()
        self.state = NodeState.FOLLOWER
        
        # Log consistency check
        if request.prev_log_index >= 0:
            if request.prev_log_index >= len(self.log):
                # We're missing entries
                return AppendEntriesResponse(
                    term=self.current_term,
                    success=False
                )
                
            if self.log[request.prev_log_index].term != request.prev_log_term:
                # Conflict - delete this entry and all after it
                self.log = self.log[:request.prev_log_index]
                self._persist("log", self.log)
                return AppendEntriesResponse(
                    term=self.current_term,
                    success=False
                )
                
        # Append new entries (if any)
        for i, entry in enumerate(request.entries):
            log_index = request.prev_log_index + 1 + i
            
            if log_index < len(self.log):
                if self.log[log_index].term != entry.term:
                    # Conflict - truncate and append
                    self.log = self.log[:log_index]
                    self.log.append(entry)
                # else: entry already exists correctly
            else:
                self.log.append(entry)
                
        self._persist("log", self.log)
        
        # Update commit index
        if request.leader_commit > self.commit_index:
            self.commit_index = min(
                request.leader_commit,
                len(self.log) - 1
            )
            self._apply_committed_entries()
            
        return AppendEntriesResponse(
            term=self.current_term,
            success=True
        )

The Commitment Rule

A leader only commits entries from its current term. Why? If a leader commits entries from a previous term by counting replicas, those entries might be overwritten by a higher-term leader with a different log. By only counting current-term entries, the leader ensures it has the authority to make that commitment.

Safety Properties

Raft guarantees five key properties that together ensure correct consensus:

1. Election Safety

At most one leader can be elected in any given term.

Ensured by: Each node votes for at most one candidate per term, and a majority is required to win.

2. Leader Append-Only

A leader never overwrites or deletes entries in its log; it only appends.

Ensured by: Leader's log is authoritative; it only adds entries at the end.

3. Log Matching

If two logs contain an entry with the same index and term, the logs are identical in all preceding entries.

Ensured by: AppendEntries consistency check with prevLogIndex/prevLogTerm.

4. Leader Completeness

If a log entry is committed in a given term, that entry will be present in the logs of all leaders for higher-numbered terms.

Ensured by: Voting restriction requiring candidate's log to be up-to-date.

5. State Machine Safety

If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for that index.

Ensured by: Combination of Log Matching and Leader Completeness.

Safety Property Enforcement
Property	Mechanism	What Could Go Wrong Without It
Election Safety	Majority vote + one vote per term	Two leaders accept conflicting writes
Leader Append-Only	Leader never modifies existing entries	Applied entries could be lost
Log Matching	prevLogIndex/Term check	Logs diverge, different commands at same index
Leader Completeness	Log up-to-date voting restriction	New leader overwrites committed entries
State Machine Safety	All above combined	Replicas have different state

The Leader Completeness Proof Sketch:

Suppose entry E was committed in term T. We prove any leader for term T' > T has E in its log.

E was committed in term T, so a majority M1 has E
To become leader in T' > T, a candidate must get votes from a majority M2
M1 and M2 must overlap (two majorities intersect)
The overlapping node(s) have E and will only vote for candidates with logs at least as up-to-date
By the voting rule, a candidate missing E can't get votes from M1 nodes
Therefore, any leader in term T' must have E

This chain of reasoning ensures committed entries can never be lost.

Understanding Through Invariants

Raft's safety properties form a logical chain. If you understand why each property holds and how they combine, you understand Raft's correctness. This modularity is precisely what makes Raft more understandable than Paxos, where the safety argument is more tightly interleaved.

Handling Edge Cases

Real Raft implementations must handle several edge cases that don't appear in the basic algorithm description.

1. Log Compaction and Snapshots

Logs grow indefinitely, eventually exhausting storage. Raft systems periodically compact the log by:

Taking a snapshot of the state machine at some log index
Discarding all log entries up to that index
Storing the snapshot durably

If a follower is far behind, the leader sends an InstallSnapshot RPC instead of individual log entries.

2. Cluster Membership Changes

Changing the cluster membership (adding/removing nodes) is tricky because there's a window where two different majorities could exist (old cluster vs. new cluster).

Raft solves this with joint consensus:

Transition to a joint configuration requiring majorities of both old and new clusters
Then transition to the new configuration alone
At no point can two independent majorities elect different leaders

3. Pre-Vote Protocol

In some scenarios, a disconnected node repeatedly times out and increments its term. When it reconnects, its high term disrupts the cluster. The pre-vote extension has candidates probe whether they could win before incrementing their term.

raft_snapshots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
class RaftSnapshotManager:
    """
    Log compaction via snapshots.
    """
    
    def __init__(self, node: RaftNode, snapshot_threshold: int = 10000):
        self.node = node
        self.snapshot_threshold = snapshot_threshold
        
        # Snapshot state
        self.last_included_index: int = -1
        self.last_included_term: int = 0
        self.snapshot_data: Optional[bytes] = None
        
    def maybe_snapshot(self):
        """
        Take snapshot if log is too long.
        """
        if len(self.node.log) < self.snapshot_threshold:
            return
            
        # Can only snapshot committed entries
        if self.node.commit_index <= self.last_included_index:
            return
            
        # Get state machine snapshot
        snapshot_index = self.node.commit_index
        snapshot_data = self.node.state_machine.snapshot()
        
        # Update snapshot metadata
        self.last_included_index = snapshot_index
        self.last_included_term = self.node.log[snapshot_index].term
        self.snapshot_data = snapshot_data
        
        # Compact log (keep entries after snapshot)
        self.node.log = self.node.log[snapshot_index + 1:]
        
        # Persist snapshot and compacted log
        self._persist_snapshot()
        
    def install_snapshot(self, leader_term: int, last_included_index: int,
                        last_included_term: int, data: bytes) -> bool:
        """
        Handle InstallSnapshot RPC from leader.
        
        Used when follower is too far behind for normal replication.
        """
        if leader_term < self.node.current_term:
            return False
            
        if last_included_index <= self.last_included_index:
            return True  # Already have this or newer snapshot
            
        # Discard entire log if snapshot covers all of it
        if last_included_index >= len(self.node.log):
            self.node.log = []
        else:
            # Keep entries after snapshot
            self.node.log = self.node.log[last_included_index + 1:]
            
        # Update snapshot state
        self.last_included_index = last_included_index
        self.last_included_term = last_included_term
        self.snapshot_data = data
        
        # Install snapshot to state machine
        self.node.state_machine.restore(data)
        self.node.commit_index = last_included_index
        self.node.last_applied = last_included_index
        
        self._persist_snapshot()
        return True
 
 
class RaftMembershipChange:
    """
    Joint consensus for membership changes.
    """
    
    def __init__(self, node: RaftNode):
        self.node = node
        self.configuration = ClusterConfiguration(node.peers + [node.node_id])
        self.transitioning = False
        
    def change_membership(self, new_nodes: List[str]) -> bool:
        """
        Change cluster membership from current to new configuration.
        
        Uses joint consensus: C_old+new -> C_new
        """
        if self.transitioning:
            return False  # One change at a time
            
        self.transitioning = True
        
        # Phase 1: Enter joint configuration
        joint_config = JointConfiguration(
            old_nodes=self.configuration.nodes,
            new_nodes=new_nodes
        )
        
        # Replicate joint config as log entry
        joint_entry = ConfigurationEntry(
            type="JOINT",
            config=joint_config
        )
        
        if not self.node.replicate_command(joint_entry):
            self.transitioning = False
            return False
            
        # Phase 2: Transition to new configuration
        new_config = ClusterConfiguration(new_nodes)
        
        new_entry = ConfigurationEntry(
            type="NEW",
            config=new_config
        )
        
        if not self.node.replicate_command(new_entry):
            # Tricky: we're in joint config, must complete transition
            # In practice, retry until success
            pass
            
        self.configuration = new_config
        self.transitioning = False
        return True
 
 
class JointConfiguration:
    """
    Joint consensus configuration requiring majorities from both sets.
    """
    
    def __init__(self, old_nodes: List[str], new_nodes: List[str]):
        self.old_nodes = old_nodes
        self.new_nodes = new_nodes
        
    def is_quorum(self, responding_nodes: Set[str]) -> bool:
        """
        Quorum requires majority from BOTH old and new configurations.
        """
        old_count = len(responding_nodes.intersection(self.old_nodes))
        new_count = len(responding_nodes.intersection(self.new_nodes))
        
        old_majority = old_count > len(self.old_nodes) // 2
        new_majority = new_count > len(self.new_nodes) // 2
        
        return old_majority and new_majority

Single-Server Membership Changes

An alternative to joint consensus is to limit changes to one server at a time. Adding or removing one server at a time guarantees that old and new majorities overlap, eliminating the need for joint configurations. Many Raft implementations use this simpler approach.

Raft in Production

Raft is widely deployed in production systems. Understanding these implementations helps bridge theory and practice.

etcd:

The distributed key-value store used by Kubernetes for cluster state. etcd's Raft implementation (etcd/raft) is a reference implementation used by many other Go projects.

Linearizable reads and writes
Watch mechanism for change notifications
Lease-based TTL for ephemeral keys
Snapshotting and log compaction

Consul:

HashiCorp's service discovery and configuration platform uses Raft for its catalog and KV store.

Stale reads option for higher availability
Enterprise features for multi-datacenter
Built-in health checking

CockroachDB:

A distributed SQL database using Raft for replication within each range (data partition).

Each range is a separate Raft group
Thousands of Raft groups per cluster
Multi-Raft optimization for efficient resource usage

TiKV:

A distributed transactional key-value store, part of TiDB ecosystem.

Multi-Raft architecture
Region-based scaling
RocksDB as storage engine

Raft Production Implementations
System	Language	Use Case	Notable Features
etcd	Go	Kubernetes, distributed KV	Reference implementation, leases, watch
Consul	Go	Service discovery, config	Multi-DC, health checks
CockroachDB	Go	Distributed SQL	Multi-Raft, geo-partitioning
TiKV	Rust	Transactional KV	Region-based scaling
RethinkDB	C++	Document database	Changefeeds
Neo4j (Enterprise)	Java	Graph database	Clustering

Production Considerations:

1. Backpressure: When the leader is overwhelmed, it must slow down accepting new proposals. Without backpressure, the leader's memory grows unboundedly as pending entries queue.

2. Batching: Committing one entry per round trip is inefficient. Production systems batch multiple entries per AppendEntries RPC and commit them together.

3. Pipelining: Instead of waiting for one AppendEntries to complete before sending the next, pipeline multiple requests. This hides network latency.

4. Read Optimization: Linearizable reads from the leader require confirming leadership. Options:

Read Index: Confirm commit index, wait for state machine to catch up
Lease Reads: Leader holds time-based lease, reads without confirmation during lease

5. Multi-Raft: For large-scale systems, run many independent Raft groups, each managing a data partition. This requires careful connection pooling and heartbeat coalescing.

Don't Roll Your Own

Unless you have very specific requirements, use an existing Raft implementation or Raft-based system. etcd's raft library, HashiCorp's raft, and tikv/raft-rs are battle-tested and handle the many edge cases that catch new implementations. Consensus bugs are among the hardest to detect and most costly to fix.

Summary: Raft Protocol

We've covered Raft comprehensively, from its philosophy to production deployment. Let's consolidate the key insights:

Key Takeaways

•Understandability as a design goal — Raft decomposes consensus into leader election, log replication, and safety as independent subproblems.
•Three states — Follower (passive), Candidate (seeking votes), Leader (coordinating all operations).
•Terms as logical clock — Every operation is tagged with a term; higher terms supersede lower ones.
•Random election timeouts — Prevent synchronized elections and resulting split votes.
•Log matching property — Same index + same term implies identical commands and prefix. Enforced by prevLogIndex/Term checks.
•Leader completeness — Voting restriction ensures leaders have all committed entries.
•Commit from current term only — Leaders only advance commit index for entries in their own term.
•Production extensions — Snapshots for log compaction, joint consensus for membership changes, pre-vote for disruption avoidance.

What's next:

With leader election and consensus covered, the final piece of distributed coordination is distributed locking—ensuring exclusive access to resources across a network. We'll explore how to build correct locks on top of consensus and the subtle pitfalls that trap unwary developers.

Page Complete

You now understand Raft, the dominant consensus algorithm in modern distributed systems. From its elegantly decomposed design to its production implementations in etcd, Consul, and CockroachDB, Raft demonstrates that understandability and correctness can coexist. Next, we'll see how consensus enables distributed locking.

4 / 5

Loading learning content...

Operating SystemsDistributed Coordination

Distributed Coordination

LevelAdvanced

Duration90 mins

TopicDistributed Coordination

4 / 5

Raft Basics

Consensus Made Understandable

Raft's key innovation:

Rather than derive a minimal algorithm from first principles (as Paxos does), Raft decomposes consensus into independent subproblems:

Leader election — Selecting one leader at a time
Log replication — Leader replicates log entries to followers
Safety — Ensuring only consistent logs can be elected leader

What You Will Learn

Raft vs. Paxos Philosophy

Before diving into Raft's mechanics, it's worth understanding why it differs from Paxos and what trade-offs it makes.

Paxos philosophy:

Derive the minimal protocol from correctness requirements
Allow maximum flexibility in implementation
Don't prescribe how to build a complete system

Raft philosophy:

Design for understandability from the start
Prescribe a complete, implementable system
Decompose into independent, comprehensible subproblems
Prefer strong invariants that simplify reasoning

Paxos vs. Raft Design Choices
Aspect	Paxos	Raft
Primary goal	Minimal correctness	Understandability
Leader role	Optimization (Multi-Paxos)	Fundamental requirement
Log gaps	Allowed (each slot independent)	Prohibited (contiguous only)
Entry commitment	Per-slot quorum	Log prefix matching
Voting restriction	None (any node can propose)	Only candidates with up-to-date log
Specification	Abstract protocol	Complete system

The key Raft constraint: Contiguous logs only

In Paxos, a proposer can attempt to fill any log slot. Slot 5 might be committed while slot 3 is still undecided. This flexibility is theoretically elegant but complicates log management.

Raft requires logs to be contiguous: if a leader wants to add entry N, entries 1 through N-1 must be committed first. This constraint simplifies:

Log matching: Same index + same term → same command and same prefix
Commitment detection: A single "commit index" suffices
Recovery: New leader simply replicates missing entries in order

The trade-off:

When to Use Which

Node States and Terms

Raft defines exactly three states a node can be in, and uses terms as a logical clock to order events.

The Three States:

1. Follower

Passive: only responds to requests from leaders and candidates
Redirects client requests to the leader
If no heartbeat received within election timeout, becomes candidate

2. Candidate

Active: seeking votes to become leader
Sends RequestVote RPCs to all other nodes
If receives majority votes, becomes leader
If receives heartbeat from legitimate leader, reverts to follower
If election timeout expires without winning, starts new election

3. Leader

Directs all client interactions and log replication
Sends periodic heartbeats (empty AppendEntries) to maintain authority
Only one leader can exist per term
Reverts to follower if discovers a higher term

Converting Mermaid diagram...

Terms: Raft's Logical Clock

Raft divides time into terms, numbered consecutively starting at 1. Each term has at most one leader.

A new term begins when a candidate starts an election
Terms act as a logical clock: higher term = more recent authority
Every RPC includes the sender's current term
If a node receives a higher term, it updates its term and reverts to follower
If a node receives a lower term, it rejects the request

Term guarantees:

Election Safety: At most one leader is elected per term
Term Ordering: A node in term T won't accept messages from term < T
Epoch Fencing: Old leaders in term T are ignored once term T+1 begins

raft_state.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
class RaftNode:
    """
    Raft node state management.
    """
    
    def __init__(self, node_id: str, peers: List[str], storage: DurableStorage):
        self.node_id = node_id
        self.peers = peers
        self.storage = storage
        
        # Persistent state (survives crashes)
        self.current_term: int = storage.get("current_term", 0)
        self.voted_for: Optional[str] = storage.get("voted_for", None)
        self.log: List[LogEntry] = storage.get("log", [])
        
        # Volatile state
        self.state: NodeState = NodeState.FOLLOWER
        self.commit_index: int = -1
        self.last_applied: int = -1
        
        # Leader-only volatile state
        self.next_index: Dict[str, int] = {}  # For each peer
        self.match_index: Dict[str, int] = {} # Highest replicated index
        
        # Timing
        self.election_timeout = self._random_timeout(150, 300)
        self.last_heartbeat = time.time()
        
    def _random_timeout(self, min_ms: int, max_ms: int) -> float:
        """
        Random election timeout in range [min_ms, max_ms].
        
        Randomization prevents split votes.
        """
        return random.randint(min_ms, max_ms) / 1000.0
        
    def check_election_timeout(self):
        """
        Check if election timeout has expired.
        If so, transition to candidate and start election.
        """
        if self.state == NodeState.LEADER:
            return  # Leaders don't timeout
            
        elapsed = time.time() - self.last_heartbeat
        
        if elapsed >= self.election_timeout:
            self._start_election()
            
    def _start_election(self):
        """
        Transition to candidate and start new election.
        """
        # Increment term for new election
        self.current_term += 1
        self._persist("current_term", self.current_term)
        
        # Vote for ourselves
        self.voted_for = self.node_id
        self._persist("voted_for", self.voted_for)
        
        self.state = NodeState.CANDIDATE
        votes_received = 1  # Our own vote
        
        # Reset election timeout
        self.election_timeout = self._random_timeout(150, 300)
        self.last_heartbeat = time.time()
        
        # Send RequestVote to all peers
        for peer in self.peers:
            response = self._request_vote(peer)
            if response and response.vote_granted:
                votes_received += 1
                
        # Check if we won
        if votes_received > (len(self.peers) + 1) // 2:
            self._become_leader()
        # Otherwise, either retry on timeout or become follower
            
    def _become_leader(self):
        """
        Transition to leader state.
        """
        self.state = NodeState.LEADER
        
        # Initialize leader state
        for peer in self.peers:
            self.next_index[peer] = len(self.log)
            self.match_index[peer] = -1
            
        # Send initial heartbeats immediately
        self._send_heartbeats()
        
        print(f"Node {self.node_id}: Became leader for term {self.current_term}")
        
    def update_term(self, term: int):
        """
        Update to higher term, reverting to follower if needed.
        """
        if term > self.current_term:
            self.current_term = term
            self.voted_for = None
            self.state = NodeState.FOLLOWER
            
            self._persist("current_term", self.current_term)
            self._persist("voted_for", None)

Random Timeouts are Essential

Leader Election

Raft's leader election is simpler than Paxos's because it uses a single voting rule: a candidate only grants its vote if the candidate's log is at least as up-to-date as the voter's log.

The RequestVote RPC:

RequestVote RPC:
  term            - Candidate's term
  candidateId     - Candidate requesting vote
  lastLogIndex    - Index of candidate's last log entry
  lastLogTerm     - Term of candidate's last log entry
  
Response:
  term            - CurrentTerm (for candidate to update itself)
  voteGranted     - True if candidate received vote

Voting rules:

Term check: Reject if candidate's term < receiver's current term
Vote availability: Grant vote only if haven't voted in this term (or already voted for this candidate)
Log up-to-date check: Grant vote only if candidate's log is at least as up-to-date

Log comparison:

Candidate's log is "at least as up-to-date" if:

Candidate's last log term > voter's last log term, OR
Same last log term AND candidate's log is at least as long

raft_election.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
class RaftNode:
    """
    Raft leader election implementation.
    """
    
    def handle_request_vote(self, request: RequestVoteRequest) -> RequestVoteResponse:
        """
        Handle RequestVote RPC from candidate.
        
        Grant vote if:
        1. Candidate's term >= our term
        2. We haven't voted for someone else in this term
        3. Candidate's log is at least as up-to-date as ours
        """
        # Update term if candidate has higher term
        if request.term > self.current_term:
            self.update_term(request.term)
            
        # Reject if candidate's term is stale
        if request.term < self.current_term:
            return RequestVoteResponse(
                term=self.current_term,
                vote_granted=False
            )
            
        # Check if we can vote for this candidate
        can_vote = (
            self.voted_for is None or 
            self.voted_for == request.candidate_id
        )
        
        # Check if candidate's log is up-to-date
        last_log_term = self._get_last_log_term()
        last_log_index = len(self.log) - 1
        
        log_ok = (
            request.last_log_term > last_log_term or
            (request.last_log_term == last_log_term and 
             request.last_log_index >= last_log_index)
        )
        
        if can_vote and log_ok:
            # Grant vote
            self.voted_for = request.candidate_id
            self._persist("voted_for", self.voted_for)
            
            # Reset election timeout (we've heard from a valid candidate)
            self.last_heartbeat = time.time()
            
            return RequestVoteResponse(
                term=self.current_term,
                vote_granted=True
            )
        else:
            return RequestVoteResponse(
                term=self.current_term,
                vote_granted=False
            )
            
    def _get_last_log_term(self) -> int:
        """Get term of last log entry, or 0 if log is empty."""
        if not self.log:
            return 0
        return self.log[-1].term
        
    def _request_vote(self, peer: str) -> Optional[RequestVoteResponse]:
        """
        Send RequestVote RPC to a peer.
        """
        request = RequestVoteRequest(
            term=self.current_term,
            candidate_id=self.node_id,
            last_log_index=len(self.log) - 1,
            last_log_term=self._get_last_log_term()
        )
        
        try:
            return self._send_rpc(peer, "RequestVote", request)
        except NetworkTimeout:
            return None
 
 
# Example: Election scenario trace
"""
Scenario: 3-node cluster, Node A is leader, then crashes
 
Initial state:
- Node A (Leader, term=5): log = [e1, e2, e3, e4, e5]
- Node B (Follower, term=5): log = [e1, e2, e3, e4, e5]
- Node C (Follower, term=5): log = [e1, e2, e3, e4]  # Slightly behind
 
Node A crashes. Network delays cause different election timeouts.
 
1. Node B times out first, becomes candidate, term=6
2. Node B sends RequestVote(term=6, lastLogIndex=4, lastLogTerm=5) to C
3. Node C grants vote:
   - term=6 >= currentTerm=5 ✓
   - Haven't voted in term 6 ✓
   - B's log (term=5, index=4) vs C's log (term=5, index=3)
   - Same term, B has more entries ✓
4. Node B has 2 votes (self + C), becomes leader
 
If Node C had timed out first:
1. Node C becomes candidate, term=6
2. Node C sends RequestVote(term=6, lastLogIndex=3, lastLogTerm=5) to B
3. Node B rejects:
   - C's log (term=5, index=3) vs B's log (term=5, index=4)
   - Same term but C has fewer entries ✗
4. Node C doesn't get majority, times out
5. Eventually B or C wins with higher log
"""

Election Safety Guarantee

Log Replication

Once a leader is elected, it handles all client requests by appending entries to its log and replicating them to followers.

The AppendEntries RPC:

This single RPC handles both heartbeats (empty entries) and log replication:

AppendEntries RPC:
  term              - Leader's term
  leaderId          - For followers to redirect clients
  prevLogIndex      - Index of log entry immediately preceding new ones
  prevLogTerm       - Term of prevLogIndex entry
  entries[]         - Log entries to store (empty for heartbeat)
  leaderCommit      - Leader's commit index
  
Response:
  term              - CurrentTerm (for leader to update itself)
  success           - True if follower had entry matching prevLogIndex/Term

The Log Matching Property:

Raft maintains a crucial invariant:

If two entries in different logs have the same index and term, they store the same command AND all preceding entries are identical.

This is enforced by:

Leaders create log entries with unique (index, term) pairs
Followers only append entries if prevLogIndex/prevLogTerm matches
Mismatches cause the follower to reject, triggering backtracking

raft_log_replication.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
class RaftLeader:
    """
    Raft leader log replication.
    """
    
    def replicate_command(self, command: Any) -> bool:
        """
        Client-facing: Add command to log and replicate.
        
        Returns True if command is successfully committed.
        """
        # Append to local log
        entry = LogEntry(
            term=self.current_term,
            command=command,
            index=len(self.log)
        )
        self.log.append(entry)
        self._persist("log", self.log)
        
        # Replicate to followers
        return self._replicate_to_majority()
        
    def _replicate_to_majority(self) -> bool:
        """
        Replicate pending entries to all followers.
        Wait for majority acknowledgment.
        """
        while True:
            success_count = 1  # Count ourselves
            
            for peer in self.peers:
                if self._send_append_entries(peer):
                    success_count += 1
                    
            majority = (len(self.peers) + 1) // 2 + 1
            
            if success_count >= majority:
                self._advance_commit_index()
                return True
                
            # If not enough responses, retry
            time.sleep(0.010)  # Brief backoff
            
    def _send_append_entries(self, peer: str) -> bool:
        """
        Send AppendEntries to a single follower.
        
        Uses next_index to determine which entries to send.
        Decrements next_index on failure for Log Matching.
        """
        next_idx = self.next_index[peer]
        
        # Prepare entries to send
        if next_idx < len(self.log):
            entries = self.log[next_idx:]
        else:
            entries = []  # Heartbeat
            
        # Get prev log info for consistency check
        prev_log_index = next_idx - 1
        prev_log_term = 0
        if prev_log_index >= 0:
            prev_log_term = self.log[prev_log_index].term
            
        request = AppendEntriesRequest(
            term=self.current_term,
            leader_id=self.node_id,
            prev_log_index=prev_log_index,
            prev_log_term=prev_log_term,
            entries=entries,
            leader_commit=self.commit_index
        )
        
        try:
            response = self._send_rpc(peer, "AppendEntries", request)
            
            if response.success:
                # Update tracking for this follower
                self.next_index[peer] = len(self.log)
                self.match_index[peer] = len(self.log) - 1
                return True
            else:
                # Log mismatch - decrement next_index and retry
                self.next_index[peer] = max(0, self.next_index[peer] - 1)
                return False
                
        except NetworkTimeout:
            return False
            
    def _advance_commit_index(self):
        """
        Advance commit index based on majority replication.
        
        An entry is committed when replicated to a majority
        AND the entry's term equals current term.
        """
        for n in range(self.commit_index + 1, len(self.log)):
            if self.log[n].term != self.current_term:
                continue  # Only commit entries from current term
                
            # Count replications (including ourselves)
            replicated = 1
            for peer in self.peers:
                if self.match_index.get(peer, -1) >= n:
                    replicated += 1
                    
            majority = (len(self.peers) + 1) // 2 + 1
            
            if replicated >= majority:
                self.commit_index = n
                self._apply_committed_entries()
 
 
class RaftFollower:
    """
    Raft follower handling AppendEntries.
    """
    
    def handle_append_entries(self, request: AppendEntriesRequest) -> AppendEntriesResponse:
        """
        Handle AppendEntries RPC from leader.
        
        1. Term check
        2. Log consistency check
        3. Append new entries
        4. Update commit index
        """
        # Update term if leader has higher term
        if request.term > self.current_term:
            self.update_term(request.term)
            
        # Reject if stale term
        if request.term < self.current_term:
            return AppendEntriesResponse(
                term=self.current_term,
                success=False
            )
            
        # Valid leader - reset election timeout
        self.last_heartbeat = time.time()
        self.state = NodeState.FOLLOWER
        
        # Log consistency check
        if request.prev_log_index >= 0:
            if request.prev_log_index >= len(self.log):
                # We're missing entries
                return AppendEntriesResponse(
                    term=self.current_term,
                    success=False
                )
                
            if self.log[request.prev_log_index].term != request.prev_log_term:
                # Conflict - delete this entry and all after it
                self.log = self.log[:request.prev_log_index]
                self._persist("log", self.log)
                return AppendEntriesResponse(
                    term=self.current_term,
                    success=False
                )
                
        # Append new entries (if any)
        for i, entry in enumerate(request.entries):
            log_index = request.prev_log_index + 1 + i
            
            if log_index < len(self.log):
                if self.log[log_index].term != entry.term:
                    # Conflict - truncate and append
                    self.log = self.log[:log_index]
                    self.log.append(entry)
                # else: entry already exists correctly
            else:
                self.log.append(entry)
                
        self._persist("log", self.log)
        
        # Update commit index
        if request.leader_commit > self.commit_index:
            self.commit_index = min(
                request.leader_commit,
                len(self.log) - 1
            )
            self._apply_committed_entries()
            
        return AppendEntriesResponse(
            term=self.current_term,
            success=True
        )

The Commitment Rule

Safety Properties

Raft guarantees five key properties that together ensure correct consensus:

1. Election Safety

At most one leader can be elected in any given term.

Ensured by: Each node votes for at most one candidate per term, and a majority is required to win.

2. Leader Append-Only

A leader never overwrites or deletes entries in its log; it only appends.

Ensured by: Leader's log is authoritative; it only adds entries at the end.

3. Log Matching

If two logs contain an entry with the same index and term, the logs are identical in all preceding entries.

Ensured by: AppendEntries consistency check with prevLogIndex/prevLogTerm.

4. Leader Completeness

If a log entry is committed in a given term, that entry will be present in the logs of all leaders for higher-numbered terms.

Ensured by: Voting restriction requiring candidate's log to be up-to-date.

5. State Machine Safety

If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for that index.

Ensured by: Combination of Log Matching and Leader Completeness.

Safety Property Enforcement
Property	Mechanism	What Could Go Wrong Without It
Election Safety	Majority vote + one vote per term	Two leaders accept conflicting writes
Leader Append-Only	Leader never modifies existing entries	Applied entries could be lost
Log Matching	prevLogIndex/Term check	Logs diverge, different commands at same index
Leader Completeness	Log up-to-date voting restriction	New leader overwrites committed entries
State Machine Safety	All above combined	Replicas have different state

The Leader Completeness Proof Sketch:

Suppose entry E was committed in term T. We prove any leader for term T' > T has E in its log.

E was committed in term T, so a majority M1 has E
To become leader in T' > T, a candidate must get votes from a majority M2
M1 and M2 must overlap (two majorities intersect)
The overlapping node(s) have E and will only vote for candidates with logs at least as up-to-date
By the voting rule, a candidate missing E can't get votes from M1 nodes
Therefore, any leader in term T' must have E

This chain of reasoning ensures committed entries can never be lost.

Understanding Through Invariants

Handling Edge Cases

Real Raft implementations must handle several edge cases that don't appear in the basic algorithm description.

1. Log Compaction and Snapshots

Logs grow indefinitely, eventually exhausting storage. Raft systems periodically compact the log by:

Taking a snapshot of the state machine at some log index
Discarding all log entries up to that index
Storing the snapshot durably

If a follower is far behind, the leader sends an InstallSnapshot RPC instead of individual log entries.

2. Cluster Membership Changes

Changing the cluster membership (adding/removing nodes) is tricky because there's a window where two different majorities could exist (old cluster vs. new cluster).

Raft solves this with joint consensus:

Transition to a joint configuration requiring majorities of both old and new clusters
Then transition to the new configuration alone
At no point can two independent majorities elect different leaders

3. Pre-Vote Protocol

raft_snapshots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
class RaftSnapshotManager:
    """
    Log compaction via snapshots.
    """
    
    def __init__(self, node: RaftNode, snapshot_threshold: int = 10000):
        self.node = node
        self.snapshot_threshold = snapshot_threshold
        
        # Snapshot state
        self.last_included_index: int = -1
        self.last_included_term: int = 0
        self.snapshot_data: Optional[bytes] = None
        
    def maybe_snapshot(self):
        """
        Take snapshot if log is too long.
        """
        if len(self.node.log) < self.snapshot_threshold:
            return
            
        # Can only snapshot committed entries
        if self.node.commit_index <= self.last_included_index:
            return
            
        # Get state machine snapshot
        snapshot_index = self.node.commit_index
        snapshot_data = self.node.state_machine.snapshot()
        
        # Update snapshot metadata
        self.last_included_index = snapshot_index
        self.last_included_term = self.node.log[snapshot_index].term
        self.snapshot_data = snapshot_data
        
        # Compact log (keep entries after snapshot)
        self.node.log = self.node.log[snapshot_index + 1:]
        
        # Persist snapshot and compacted log
        self._persist_snapshot()
        
    def install_snapshot(self, leader_term: int, last_included_index: int,
                        last_included_term: int, data: bytes) -> bool:
        """
        Handle InstallSnapshot RPC from leader.
        
        Used when follower is too far behind for normal replication.
        """
        if leader_term < self.node.current_term:
            return False
            
        if last_included_index <= self.last_included_index:
            return True  # Already have this or newer snapshot
            
        # Discard entire log if snapshot covers all of it
        if last_included_index >= len(self.node.log):
            self.node.log = []
        else:
            # Keep entries after snapshot
            self.node.log = self.node.log[last_included_index + 1:]
            
        # Update snapshot state
        self.last_included_index = last_included_index
        self.last_included_term = last_included_term
        self.snapshot_data = data
        
        # Install snapshot to state machine
        self.node.state_machine.restore(data)
        self.node.commit_index = last_included_index
        self.node.last_applied = last_included_index
        
        self._persist_snapshot()
        return True
 
 
class RaftMembershipChange:
    """
    Joint consensus for membership changes.
    """
    
    def __init__(self, node: RaftNode):
        self.node = node
        self.configuration = ClusterConfiguration(node.peers + [node.node_id])
        self.transitioning = False
        
    def change_membership(self, new_nodes: List[str]) -> bool:
        """
        Change cluster membership from current to new configuration.
        
        Uses joint consensus: C_old+new -> C_new
        """
        if self.transitioning:
            return False  # One change at a time
            
        self.transitioning = True
        
        # Phase 1: Enter joint configuration
        joint_config = JointConfiguration(
            old_nodes=self.configuration.nodes,
            new_nodes=new_nodes
        )
        
        # Replicate joint config as log entry
        joint_entry = ConfigurationEntry(
            type="JOINT",
            config=joint_config
        )
        
        if not self.node.replicate_command(joint_entry):
            self.transitioning = False
            return False
            
        # Phase 2: Transition to new configuration
        new_config = ClusterConfiguration(new_nodes)
        
        new_entry = ConfigurationEntry(
            type="NEW",
            config=new_config
        )
        
        if not self.node.replicate_command(new_entry):
            # Tricky: we're in joint config, must complete transition
            # In practice, retry until success
            pass
            
        self.configuration = new_config
        self.transitioning = False
        return True
 
 
class JointConfiguration:
    """
    Joint consensus configuration requiring majorities from both sets.
    """
    
    def __init__(self, old_nodes: List[str], new_nodes: List[str]):
        self.old_nodes = old_nodes
        self.new_nodes = new_nodes
        
    def is_quorum(self, responding_nodes: Set[str]) -> bool:
        """
        Quorum requires majority from BOTH old and new configurations.
        """
        old_count = len(responding_nodes.intersection(self.old_nodes))
        new_count = len(responding_nodes.intersection(self.new_nodes))
        
        old_majority = old_count > len(self.old_nodes) // 2
        new_majority = new_count > len(self.new_nodes) // 2
        
        return old_majority and new_majority

Single-Server Membership Changes

Raft in Production

Raft is widely deployed in production systems. Understanding these implementations helps bridge theory and practice.

etcd:

The distributed key-value store used by Kubernetes for cluster state. etcd's Raft implementation (etcd/raft) is a reference implementation used by many other Go projects.

Linearizable reads and writes
Watch mechanism for change notifications
Lease-based TTL for ephemeral keys
Snapshotting and log compaction

Consul:

HashiCorp's service discovery and configuration platform uses Raft for its catalog and KV store.

Stale reads option for higher availability
Enterprise features for multi-datacenter
Built-in health checking

CockroachDB:

A distributed SQL database using Raft for replication within each range (data partition).

Each range is a separate Raft group
Thousands of Raft groups per cluster
Multi-Raft optimization for efficient resource usage

TiKV:

A distributed transactional key-value store, part of TiDB ecosystem.

Multi-Raft architecture
Region-based scaling
RocksDB as storage engine

Raft Production Implementations
System	Language	Use Case	Notable Features
etcd	Go	Kubernetes, distributed KV	Reference implementation, leases, watch
Consul	Go	Service discovery, config	Multi-DC, health checks
CockroachDB	Go	Distributed SQL	Multi-Raft, geo-partitioning
TiKV	Rust	Transactional KV	Region-based scaling
RethinkDB	C++	Document database	Changefeeds
Neo4j (Enterprise)	Java	Graph database	Clustering

Production Considerations:

1. Backpressure: When the leader is overwhelmed, it must slow down accepting new proposals. Without backpressure, the leader's memory grows unboundedly as pending entries queue.

2. Batching: Committing one entry per round trip is inefficient. Production systems batch multiple entries per AppendEntries RPC and commit them together.

3. Pipelining: Instead of waiting for one AppendEntries to complete before sending the next, pipeline multiple requests. This hides network latency.

4. Read Optimization: Linearizable reads from the leader require confirming leadership. Options:

Read Index: Confirm commit index, wait for state machine to catch up
Lease Reads: Leader holds time-based lease, reads without confirmation during lease

5. Multi-Raft: For large-scale systems, run many independent Raft groups, each managing a data partition. This requires careful connection pooling and heartbeat coalescing.

Don't Roll Your Own

Summary: Raft Protocol

We've covered Raft comprehensively, from its philosophy to production deployment. Let's consolidate the key insights:

Key Takeaways

•Understandability as a design goal — Raft decomposes consensus into leader election, log replication, and safety as independent subproblems.
•Three states — Follower (passive), Candidate (seeking votes), Leader (coordinating all operations).
•Terms as logical clock — Every operation is tagged with a term; higher terms supersede lower ones.
•Random election timeouts — Prevent synchronized elections and resulting split votes.
•Log matching property — Same index + same term implies identical commands and prefix. Enforced by prevLogIndex/Term checks.
•Leader completeness — Voting restriction ensures leaders have all committed entries.
•Commit from current term only — Leaders only advance commit index for entries in their own term.
•Production extensions — Snapshots for log compaction, joint consensus for membership changes, pre-vote for disruption avoidance.

What's next:

Page Complete

4 / 5