Raft Algorithm - Learning Module

Loading content...

0/273

Raft as Understandable Consensus

The Quest for Understandable Consensus

In 2013, Diego Ongaro and John Ousterhout at Stanford University published a paper that would fundamentally reshape how engineers think about distributed consensus. Their creation—the Raft consensus algorithm—wasn't faster than Paxos. It wasn't more efficient. It wasn't more powerful. It was something far more valuable: understandable.

This might seem like a strange design goal. Shouldn't algorithms be optimized for performance, correctness, or scalability? Why prioritize "understandability"?

The answer lies in a profound insight about real-world distributed systems: an algorithm that engineers cannot understand is an algorithm they cannot implement correctly, debug effectively, or extend safely. Paxos, despite its elegance and provable correctness, had become notorious for the gap between its theoretical description and practical implementation. Raft was designed to close that gap.

What You Will Learn

By the end of this page, you will understand:

• Why understandability is a critical design constraint for consensus algorithms • The fundamental problems that Raft and Paxos both solve • How Raft decomposes consensus into manageable subproblems • The key innovations that make Raft easier to reason about than Paxos • The state machine replication paradigm that underlies Raft's design

The Consensus Problem: What Are We Solving?

Before we can appreciate Raft's approach, we need to understand the fundamental problem it solves. Consensus is the challenge of getting multiple computers (nodes) in a distributed system to agree on a single value, even when some nodes may fail at any time.

This sounds deceptively simple. Imagine five servers that need to agree on "who is the current leader?" If all network messages arrive instantly and no servers ever fail, this is trivial. But in the real world:

Messages can be delayed, reordered, or lost — A server might receive responses out of order
Servers can crash and restart — A server might miss decisions while it was down
Network partitions can occur — Some servers might be unable to talk to others
Clocks are unreliable — Servers can't agree on "what time is it"

Despite all these challenges, we need the distributed system to:

Eventually reach agreement — All operational nodes should decide on the same value
Never disagree — Two nodes should never decide on different values
Not block forever — The system should make progress when possible

The Three Properties of Consensus
Property	Definition	What It Prevents
Agreement (Safety)	All nodes that decide must decide on the same value	Split-brain scenarios where different parts of the system believe different things
Validity	If a node decides on value v, then v was proposed by some node	Random or malicious values from being chosen
Termination (Liveness)	Eventually, all non-faulty nodes decide on some value	Indefinite waiting or deadlock

The FLP Impossibility Result

The Fischer-Lynch-Paterson (FLP) theorem proves that in an asynchronous system where even one node can fail, no deterministic algorithm can guarantee consensus with certainty. This seems devastating—but Raft (like Paxos) sidesteps this by using randomization and assuming eventual network stability. In practice, these algorithms work remarkably well.

State Machine Replication: The Underlying Paradigm

Raft is not just a consensus algorithm—it's a framework for Replicated State Machine (RSM) systems. Understanding this paradigm is essential to understanding why Raft works the way it does.

A state machine is a simple concept: it's a system that processes a sequence of commands, where each command transitions the machine from one state to another deterministically. Given the same starting state and the same sequence of commands, the state machine always ends in the same final state.

Replicated state machines extend this idea across multiple servers:

Each server maintains its own copy of the state machine
Each server maintains a log of commands to execute
Consensus ensures all servers agree on the same log entries
If all logs are identical and all servers start in the same state, all servers will reach the same final state

replicated_state_machine.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Conceptual model of a replicated state machine
class ReplicatedStateMachine:
    def __init__(self):
        self.state = {}       # The current state (e.g., key-value store)
        self.log = []         # Ordered list of commands
        self.last_applied = 0 # Index of last command applied to state
    
    def apply_command(self, command: dict) -> any:
        """
        Apply a single command to the state machine.
        Commands are deterministic - same input = same output.
        """
        if command["type"] == "SET":
            self.state[command["key"]] = command["value"]
            return f"OK: set {command['key']}"
        elif command["type"] == "GET":
            return self.state.get(command["key"], "NOT_FOUND")
        elif command["type"] == "DELETE":
            if command["key"] in self.state:
                del self.state[command["key"]]
                return f"OK: deleted {command['key']}"
            return "NOT_FOUND"
    
    def apply_log_entries(self):
        """
        Apply all committed log entries that haven't been applied yet.
        This is called after consensus confirms entries are committed.
        """
        while self.last_applied < len(self.log):
            entry = self.log[self.last_applied]
            result = self.apply_command(entry["command"])
            self.last_applied += 1
            # In practice, we'd notify the client of the result here
 
    def append_to_log(self, entry: dict):
        """
        Append a new entry to the log.
        The entry contains: term, index, and command.
        """
        self.log.append(entry)
 
# Example: How identical logs lead to identical states
server1 = ReplicatedStateMachine()
server2 = ReplicatedStateMachine()
 
# If both servers receive the same log entries in the same order...
commands = [
    {"type": "SET", "key": "x", "value": 1},
    {"type": "SET", "key": "y", "value": 2},
    {"type": "SET", "key": "x", "value": 3},  # Overwrites x
]
 
for i, cmd in enumerate(commands):
    entry = {"term": 1, "index": i, "command": cmd}
    server1.append_to_log(entry)
    server2.append_to_log(entry)
 
server1.apply_log_entries()
server2.apply_log_entries()
 
# ...they will have identical final states
assert server1.state == server2.state  # Both: {"x": 3, "y": 2}

The insight here is profound: if we can get all servers to agree on the sequence of log entries, the rest is deterministic. Consensus becomes the problem of agreeing on "what is the next entry in the log?"—a much more concrete problem than "agree on any value."

This is why Raft (and Paxos-based systems like Multi-Paxos) focus on log replication as the core abstraction. The log provides:

Ordering — Commands have a well-defined sequence
Durability — Committed entries survive crashes
Consistency — All servers apply the same commands in the same order
Recoverability — Crashed servers can "catch up" by replaying the log

The Paxos Problem: Why We Needed Raft

Leslie Lamport's Paxos algorithm, first published in 1989 (though not widely understood until 1998), was a monumental achievement. It provides a provably correct solution to distributed consensus. So why did Raft need to exist?

The short answer: Paxos is notoriously difficult to understand and implement.

This isn't a casual criticism. It's backed by decades of evidence:

The Paxos Understanding Challenge

•Lamport's own admission — Lamport himself wrote a second paper ("Paxos Made Simple") because people couldn't understand the first one. Even that paper begins with the ironic statement: "The Paxos algorithm, when presented in plain English, is very simple."
•The implementation gap — There's a massive gap between the abstract Paxos description and a working implementation. Google's Chubby team noted that "there are significant gaps between the description of the Paxos algorithm and the needs of a real-world system."
•Diverse interpretations — Different implementations of Paxos make different choices about details left unspecified in the original paper. These differences have led to subtle bugs and incompatibilities.
•Multi-Paxos complexity — Basic Paxos decides on a single value. Practical systems need to agree on a sequence of values (a log). Multi-Paxos addresses this but adds substantial complexity that's not formally specified.
•Educational failure — Surveys of distributed systems courses showed that students and even instructors frequently misunderstand Paxos, especially the safety guarantees.

The Raft authors conducted a formal study comparing Paxos and Raft understandability. They asked students who had studied both algorithms to answer questions about each. The results were striking: students scored significantly higher on Raft questions, and a large majority reported finding Raft easier to understand.

Understandability isn't a luxury—it's a safety requirement. When engineers misunderstand an algorithm:

Implementation bugs go unnoticed because reviewers don't understand the invariants
Edge cases are mishandled because developers don't understand the failure modes
Extensions and optimizations introduce correctness bugs because the constraints aren't clear
Debugging becomes impossible because no one understands what "correct behavior" looks like

Ongaro's Design Philosophy

Diego Ongaro, Raft's primary author, stated: "In developing Raft, we had the goal of understandability foremost in our minds... For each design question, we evaluated alternatives and chose the one that was easiest to understand and explain." This wasn't about making Raft "dumbed down"—it was about recognizing that human cognition is a legitimate design constraint.

Raft's Key Innovation: Decomposition

Raft's most important contribution is decomposing consensus into three independently understandable subproblems. While Paxos presents consensus as a single, unified protocol, Raft explicitly separates concerns:

Leader Election — Choosing a distinguished node to coordinate all decisions
Log Replication — The leader replicating its log to followers
Safety — Ensuring that all logs remain consistent despite failures

This decomposition is not just pedagogical—it's fundamental to the protocol design. Each subproblem has clear invariants, clear failure modes, and clear recovery procedures. You can understand and verify each component independently.

Raft's Three Subproblems
Subproblem	What It Means	Key Mechanism	Guarantees
Leader Election	Who coordinates decisions?	Term numbers + majority votes	At most one leader per term
Log Replication	How are decisions distributed?	AppendEntries RPC + majority acknowledgment	Committed entries never lost
Safety	What prevents inconsistency?	Election restriction + log matching	All servers apply same commands in same order

raft_structure.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Raft's structure: Each server is in one of three states
from enum import Enum
 
class ServerState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"
 
class RaftServer:
    """
    A Raft server's state can be decomposed into:
    
    1. PERSISTENT state (survives crashes):
       - currentTerm: latest term server has seen
       - votedFor: candidateId that received vote in current term
       - log[]: log entries; each entry contains command and term
    
    2. VOLATILE state (rebuilt after crash):
       - commitIndex: index of highest log entry known to be committed
       - lastApplied: index of highest log entry applied to state machine
    
    3. VOLATILE state on LEADERS only:
       - nextIndex[]: for each server, index of next log entry to send
       - matchIndex[]: for each server, index of highest log entry replicated
    """
    
    def __init__(self, server_id: int, peer_ids: list):
        self.id = server_id
        self.peers = peer_ids
        
        # === PERSISTENT STATE ===
        self.current_term = 0
        self.voted_for = None
        self.log = []  # List of {term, command} entries
        
        # === VOLATILE STATE ===
        self.commit_index = 0
        self.last_applied = 0
        
        # === STATE ===
        self.state = ServerState.FOLLOWER
        
        # === LEADER-ONLY VOLATILE STATE ===
        self.next_index = {}   # Initialized on becoming leader
        self.match_index = {}  # Initialized on becoming leader
    
    def become_leader(self):
        """
        Called after winning an election.
        Initializes leader-specific state.
        """
        self.state = ServerState.LEADER
        # For each server, assume they're caught up, then discover otherwise
        for peer in self.peers:
            self.next_index[peer] = len(self.log) + 1
            self.match_index[peer] = 0
    
    def become_follower(self, term: int):
        """
        Called when discovering a higher term.
        This is how leaders step down gracefully.
        """
        self.state = ServerState.FOLLOWER
        self.current_term = term
        self.voted_for = None
        # Clear leader state
        self.next_index = {}
        self.match_index = {}
    
    def become_candidate(self):
        """
        Called when election timeout expires without
        hearing from a leader.
        """
        self.state = ServerState.CANDIDATE
        self.current_term += 1  # Increment term for this election
        self.voted_for = self.id  # Vote for self

The decomposition also enables independent verification. When implementing Raft, you can verify:

Leader election works correctly even if you haven't implemented log replication yet
Log replication works correctly assuming leader election gives you a valid leader
Safety properties hold given that election and replication work as specified

This modularity dramatically reduces the cognitive load of both implementing and debugging Raft.

Strong Leader: Simplification Through Asymmetry

Raft's second key design decision is the strong leader model. In Raft, data flows in one direction only: from leader to followers.

Why is this simpler?

In symmetric protocols (like some Paxos variants), any node can propose values, and the protocol must handle concurrent proposals, conflicts, and merges. This creates a complex state space with many possible interleavings.

In Raft's asymmetric model:

Only the leader receives client requests
Only the leader appends entries to its log
Only the leader sends log entries to followers
Followers simply accept what the leader tells them (if valid)

This dramatically reduces the number of states and transitions the protocol must handle.

Symmetric Approach

•Any node can propose values
•Must handle concurrent proposals
•Need conflict resolution mechanisms
•Complex ordering guarantees
•Many edge cases for proposal merging
•Harder to reason about correctness

Raft's Strong Leader

•Only leader handles client requests
•No concurrent proposals by design
•Leader's log is authoritative
•Serialization at the leader
•Followers simply replicate
•Clear data flow to verify

The trade-off is availability during leader failures. When a leader fails, the cluster cannot accept new writes until a new leader is elected (typically 150-300ms). This is acceptable for most systems—brief unavailability is preferable to complexity-induced bugs.

The strong leader also provides natural read consistency. Because all writes flow through the leader, the leader always has the most up-to-date state. Reading from the leader gives linearizable reads without additional protocol complexity.

Some Raft implementations extend this with lease-based reads, where a leader can serve reads locally during its lease period, eliminating the need to contact a quorum for every read operation.

Leader-Based Systems Are Common

The strong leader model isn't unique to Raft—it's the dominant pattern in production systems. ZooKeeper (with ZAB), Kafka (with controller), MongoDB (with replica set primaries), and most SQL databases use leader-based replication. Raft made this explicit and central to the algorithm design.

Terms: Raft's Logical Clock

Time is a fundamental challenge in distributed systems—physical clocks drift, network delays are unpredictable, and "what happened first" is often ambiguous. Raft solves this with terms.

A term is a logical clock that divides time into numbered periods. Each term has at most one leader. Terms monotonically increase, and higher terms always supersede lower terms.

raft_terms.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
"""
Terms in Raft serve multiple purposes:
 
1. LOGICAL CLOCK - Terms order events without physical time
2. ELECTION EPOCHS - Each election attempt starts a new term  
3. STALE DETECTION - Old messages/leaders are identified by old terms
4. TIE-BREAKING - Higher term always wins
 
Timeline example:
=================
 
Term 1:  [Server A is leader] --crash--> 
Term 2:  [Server B wins election, becomes leader] --crash-->
Term 3:  [Server C wins election, becomes leader] --> [operating normally]
 
Key invariant: At most one leader per term
"""
 
class TermManager:
    def __init__(self, initial_term: int = 0):
        self.term = initial_term
    
    def discover_higher_term(self, observed_term: int) -> bool:
        """
        Called when any RPC reveals a higher term.
        If observed term is higher, we:
        - Update our term
        - Revert to follower state
        - Clear any vote we've cast
        
        Returns True if our term was updated.
        """
        if observed_term > self.term:
            self.term = observed_term
            return True  # Caller should become follower
        return False
    
    def start_election(self) -> int:
        """
        Called when starting a new election.
        Increment term and return new value.
        """
        self.term += 1
        return self.term
    
    def is_valid_term(self, request_term: int) -> bool:
        """
        Check if an incoming request has a valid (not stale) term.
        Requests with term < current_term are rejected.
        """
        return request_term >= self.term
 
 
# Example: How terms prevent stale leaders
class RaftNode:
    def receive_append_entries(self, leader_term: int, leader_id: int, entries: list):
        """
        Handle AppendEntries RPC from (claimed) leader.
        """
        if leader_term < self.current_term:
            # This is a stale leader - reject!
            # Could be: old leader that was partitioned, 
            # or network delays delivered old message
            return {
                "success": False,
                "term": self.current_term  # Tell them to update
            }
        
        if leader_term > self.current_term:
            # They're ahead - we're stale
            # Step down and update our term
            self.become_follower(leader_term)
        
        # Now process the entries...
        # Reset election timer (we heard from valid leader)
        self.reset_election_timer()
        
        # ... log replication logic ...
        return {"success": True, "term": self.current_term}

Terms provide automatic obsolescence detection. Every RPC in Raft includes the sender's current term. When a server receives a message:

If the message's term is less than the server's term: Reject the request (sender is stale)
If the message's term is greater than the server's term: Update term, become follower
If the terms are equal: Process normally

This simple rule ensures that:

Old leaders step down automatically when they reconnect after a partition
Stale messages are rejected without complex tracking
Elections are bounded — a candidate that can't get votes in term N will increment to N+1
Split-brain is impossible — two leaders would need majority in same term, but voting rules prevent this

Terms vs. Lamport Clocks

Terms are simpler than Lamport or vector clocks because they only increment on significant events (elections), not on every message. This reduces overhead and makes reasoning easier, at the cost of less fine-grained ordering—which Raft doesn't need because the leader serializes all operations anyway.

Randomized Timeouts: Breaking Symmetry

One of Raft's elegant engineering decisions is the use of randomized election timeouts. This solves the problem of split votes—situations where multiple candidates compete and no one gets a majority.

The Problem:

Imagine three servers. The leader crashes. If all three simultaneously:

Notice the leader is gone
Increment their term
Vote for themselves
Send vote requests

No candidate can get a majority (2 out of 3 votes) because everyone already voted for themselves. Now all three timeout again, repeat the process, and the cluster is stuck in an infinite election loop.

Raft's Solution:

Each server's election timeout is chosen randomly from a range (e.g., 150-300ms). This means:

Servers rarely wake up simultaneously
Usually one server times out first, starts an election, and wins before others timeout
Even if there's a split vote, the random timeouts prevent repeated splits

election_timeout.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import random
import asyncio
from typing import Optional
 
class ElectionTimer:
    """
    Raft's election timer with randomized timeout.
    
    Key parameters:
    - HEARTBEAT_INTERVAL: How often leader sends heartbeats (e.g., 50ms)
    - ELECTION_TIMEOUT_MIN: Minimum timeout before starting election (e.g., 150ms)
    - ELECTION_TIMEOUT_MAX: Maximum timeout (e.g., 300ms)
    
    Rule of thumb: 
    ELECTION_TIMEOUT_MIN > HEARTBEAT_INTERVAL * 2
    to allow for network delays
    """
    
    HEARTBEAT_INTERVAL_MS = 50
    ELECTION_TIMEOUT_MIN_MS = 150
    ELECTION_TIMEOUT_MAX_MS = 300
    
    def __init__(self, on_timeout_callback):
        self.on_timeout = on_timeout_callback
        self.timer_task: Optional[asyncio.Task] = None
    
    def _random_timeout(self) -> float:
        """
        Generate random timeout in seconds.
        Using uniform distribution over the range.
        """
        timeout_ms = random.randint(
            self.ELECTION_TIMEOUT_MIN_MS,
            self.ELECTION_TIMEOUT_MAX_MS
        )
        return timeout_ms / 1000.0
    
    def reset(self):
        """
        Called when:
        1. Server receives valid AppendEntries from leader
        2. Server grants vote to a candidate
        
        Resets the timer with a new random timeout.
        This prevents unnecessary elections while leader is alive.
        """
        if self.timer_task:
            self.timer_task.cancel()
        
        timeout = self._random_timeout()
        self.timer_task = asyncio.create_task(
            self._run_timer(timeout)
        )
    
    async def _run_timer(self, timeout: float):
        """
        Wait for timeout, then trigger election.
        If reset() is called before timeout, this task is cancelled.
        """
        try:
            await asyncio.sleep(timeout)
            # Timeout expired without hearing from leader
            # Convert to candidate and start election
            await self.on_timeout()
        except asyncio.CancelledError:
            # Timer was reset - normal operation
            pass
 
 
# Simulation showing how randomized timeouts prevent split votes
def simulate_election_starts(num_servers: int, num_trials: int = 1000) -> dict:
    """
    Simulate how often servers would start elections 'simultaneously'.
    """
    simultaneous_starts = 0
    single_starter = 0
    threshold_ms = 10  # If within 10ms, consider 'simultaneous'
    
    for _ in range(num_trials):
        timeouts = [
            random.randint(150, 300) for _ in range(num_servers)
        ]
        min_timeout = min(timeouts)
        starters = sum(1 for t in timeouts if t - min_timeout < threshold_ms)
        
        if starters > 1:
            simultaneous_starts += 1
        else:
            single_starter += 1
    
    return {
        "simultaneous_starts": simultaneous_starts,
        "single_starter": single_starter,
        "probability_clean_election": single_starter / num_trials
    }
 
# With 5 servers: ~93% of elections have a clear first starter
# result = simulate_election_starts(5)  
# {'simultaneous_starts': 70, 'single_starter': 930, 'probability_clean_election': 0.93}

Why This Works:

Desynchronization — Random timeouts prevent servers from acting in lockstep
Fast convergence — The first candidate often wins before others start
Guaranteed progress — Even with bad luck, eventually timeouts will differ enough
No coordination required — Each server decides independently

The alternative would be more complex mechanisms like:

Round-robin leader selection (requires global ordering)
Leader election via consensus (circular dependency!)
External coordinator (single point of failure)

Randomization elegantly sidesteps all of these issues.

Tuning Timeout Parameters

In production, timeout parameters are tuned based on network characteristics: • Stable networks (data center): 150-300ms range works well • High latency networks (geo-distributed): Consider 500-1000ms • Very stable clusters: Heartbeat can be less frequent to reduce overhead The key constraint: election timeout must be >> heartbeat interval × average message latency

Summary: The Philosophy of Raft

Raft represents a philosophical shift in algorithm design: understandability is a first-class design goal, not a nice-to-have. The algorithm achieves this through several key principles:

Key Takeaways

•Decomposition over unification — Raft separates consensus into leader election, log replication, and safety, each independently understandable.
•Asymmetry over symmetry — The strong leader model eliminates complexity from concurrent proposals and conflict resolution.
•Terms as logical time — Simple monotonic counters provide ordering and stale detection without the complexity of full logical clocks.
•Randomization over coordination — Random election timeouts break symmetry without requiring complex coordination protocols.
•State machine replication — The replicated log abstraction provides a concrete, visualizable model for consensus.

What's Next:

With this philosophical foundation, we're ready to dive into the mechanics. The next page explores Leader Election—how Raft chooses a single leader from among its servers, handles failures during elections, and guarantees that at most one leader exists per term.

Understanding leader election is crucial because every other part of Raft depends on having a functioning leader. Log replication assumes a leader exists. Safety properties rely on leader election constraints. The entire system starts with answering: "Who is in charge?"

Page Complete

You now understand why Raft exists, what problems it solves, and the key design decisions that make it understandable. You've seen how consensus algorithms serve replicated state machines, why Paxos's complexity motivated Raft's creation, and how Raft's decomposition, strong leader model, terms, and randomized timeouts work together to create an algorithm that engineers can actually implement correctly.