Loading content...
In 2013, Diego Ongaro and John Ousterhout at Stanford University published a paper that would fundamentally reshape how engineers think about distributed consensus. Their creation—the Raft consensus algorithm—wasn't faster than Paxos. It wasn't more efficient. It wasn't more powerful. It was something far more valuable: understandable.
This might seem like a strange design goal. Shouldn't algorithms be optimized for performance, correctness, or scalability? Why prioritize "understandability"?
The answer lies in a profound insight about real-world distributed systems: an algorithm that engineers cannot understand is an algorithm they cannot implement correctly, debug effectively, or extend safely. Paxos, despite its elegance and provable correctness, had become notorious for the gap between its theoretical description and practical implementation. Raft was designed to close that gap.
By the end of this page, you will understand:
• Why understandability is a critical design constraint for consensus algorithms • The fundamental problems that Raft and Paxos both solve • How Raft decomposes consensus into manageable subproblems • The key innovations that make Raft easier to reason about than Paxos • The state machine replication paradigm that underlies Raft's design
Before we can appreciate Raft's approach, we need to understand the fundamental problem it solves. Consensus is the challenge of getting multiple computers (nodes) in a distributed system to agree on a single value, even when some nodes may fail at any time.
This sounds deceptively simple. Imagine five servers that need to agree on "who is the current leader?" If all network messages arrive instantly and no servers ever fail, this is trivial. But in the real world:
Despite all these challenges, we need the distributed system to:
| Property | Definition | What It Prevents |
|---|---|---|
| Agreement (Safety) | All nodes that decide must decide on the same value | Split-brain scenarios where different parts of the system believe different things |
| Validity | If a node decides on value v, then v was proposed by some node | Random or malicious values from being chosen |
| Termination (Liveness) | Eventually, all non-faulty nodes decide on some value | Indefinite waiting or deadlock |
The Fischer-Lynch-Paterson (FLP) theorem proves that in an asynchronous system where even one node can fail, no deterministic algorithm can guarantee consensus with certainty. This seems devastating—but Raft (like Paxos) sidesteps this by using randomization and assuming eventual network stability. In practice, these algorithms work remarkably well.
Raft is not just a consensus algorithm—it's a framework for Replicated State Machine (RSM) systems. Understanding this paradigm is essential to understanding why Raft works the way it does.
A state machine is a simple concept: it's a system that processes a sequence of commands, where each command transitions the machine from one state to another deterministically. Given the same starting state and the same sequence of commands, the state machine always ends in the same final state.
Replicated state machines extend this idea across multiple servers:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
# Conceptual model of a replicated state machineclass ReplicatedStateMachine: def __init__(self): self.state = {} # The current state (e.g., key-value store) self.log = [] # Ordered list of commands self.last_applied = 0 # Index of last command applied to state def apply_command(self, command: dict) -> any: """ Apply a single command to the state machine. Commands are deterministic - same input = same output. """ if command["type"] == "SET": self.state[command["key"]] = command["value"] return f"OK: set {command['key']}" elif command["type"] == "GET": return self.state.get(command["key"], "NOT_FOUND") elif command["type"] == "DELETE": if command["key"] in self.state: del self.state[command["key"]] return f"OK: deleted {command['key']}" return "NOT_FOUND" def apply_log_entries(self): """ Apply all committed log entries that haven't been applied yet. This is called after consensus confirms entries are committed. """ while self.last_applied < len(self.log): entry = self.log[self.last_applied] result = self.apply_command(entry["command"]) self.last_applied += 1 # In practice, we'd notify the client of the result here def append_to_log(self, entry: dict): """ Append a new entry to the log. The entry contains: term, index, and command. """ self.log.append(entry) # Example: How identical logs lead to identical statesserver1 = ReplicatedStateMachine()server2 = ReplicatedStateMachine() # If both servers receive the same log entries in the same order...commands = [ {"type": "SET", "key": "x", "value": 1}, {"type": "SET", "key": "y", "value": 2}, {"type": "SET", "key": "x", "value": 3}, # Overwrites x] for i, cmd in enumerate(commands): entry = {"term": 1, "index": i, "command": cmd} server1.append_to_log(entry) server2.append_to_log(entry) server1.apply_log_entries()server2.apply_log_entries() # ...they will have identical final statesassert server1.state == server2.state # Both: {"x": 3, "y": 2}The insight here is profound: if we can get all servers to agree on the sequence of log entries, the rest is deterministic. Consensus becomes the problem of agreeing on "what is the next entry in the log?"—a much more concrete problem than "agree on any value."
This is why Raft (and Paxos-based systems like Multi-Paxos) focus on log replication as the core abstraction. The log provides:
Leslie Lamport's Paxos algorithm, first published in 1989 (though not widely understood until 1998), was a monumental achievement. It provides a provably correct solution to distributed consensus. So why did Raft need to exist?
The short answer: Paxos is notoriously difficult to understand and implement.
This isn't a casual criticism. It's backed by decades of evidence:
The Raft authors conducted a formal study comparing Paxos and Raft understandability. They asked students who had studied both algorithms to answer questions about each. The results were striking: students scored significantly higher on Raft questions, and a large majority reported finding Raft easier to understand.
Understandability isn't a luxury—it's a safety requirement. When engineers misunderstand an algorithm:
Diego Ongaro, Raft's primary author, stated: "In developing Raft, we had the goal of understandability foremost in our minds... For each design question, we evaluated alternatives and chose the one that was easiest to understand and explain." This wasn't about making Raft "dumbed down"—it was about recognizing that human cognition is a legitimate design constraint.
Raft's most important contribution is decomposing consensus into three independently understandable subproblems. While Paxos presents consensus as a single, unified protocol, Raft explicitly separates concerns:
This decomposition is not just pedagogical—it's fundamental to the protocol design. Each subproblem has clear invariants, clear failure modes, and clear recovery procedures. You can understand and verify each component independently.
| Subproblem | What It Means | Key Mechanism | Guarantees |
|---|---|---|---|
| Leader Election | Who coordinates decisions? | Term numbers + majority votes | At most one leader per term |
| Log Replication | How are decisions distributed? | AppendEntries RPC + majority acknowledgment | Committed entries never lost |
| Safety | What prevents inconsistency? | Election restriction + log matching | All servers apply same commands in same order |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
# Raft's structure: Each server is in one of three statesfrom enum import Enum class ServerState(Enum): FOLLOWER = "follower" CANDIDATE = "candidate" LEADER = "leader" class RaftServer: """ A Raft server's state can be decomposed into: 1. PERSISTENT state (survives crashes): - currentTerm: latest term server has seen - votedFor: candidateId that received vote in current term - log[]: log entries; each entry contains command and term 2. VOLATILE state (rebuilt after crash): - commitIndex: index of highest log entry known to be committed - lastApplied: index of highest log entry applied to state machine 3. VOLATILE state on LEADERS only: - nextIndex[]: for each server, index of next log entry to send - matchIndex[]: for each server, index of highest log entry replicated """ def __init__(self, server_id: int, peer_ids: list): self.id = server_id self.peers = peer_ids # === PERSISTENT STATE === self.current_term = 0 self.voted_for = None self.log = [] # List of {term, command} entries # === VOLATILE STATE === self.commit_index = 0 self.last_applied = 0 # === STATE === self.state = ServerState.FOLLOWER # === LEADER-ONLY VOLATILE STATE === self.next_index = {} # Initialized on becoming leader self.match_index = {} # Initialized on becoming leader def become_leader(self): """ Called after winning an election. Initializes leader-specific state. """ self.state = ServerState.LEADER # For each server, assume they're caught up, then discover otherwise for peer in self.peers: self.next_index[peer] = len(self.log) + 1 self.match_index[peer] = 0 def become_follower(self, term: int): """ Called when discovering a higher term. This is how leaders step down gracefully. """ self.state = ServerState.FOLLOWER self.current_term = term self.voted_for = None # Clear leader state self.next_index = {} self.match_index = {} def become_candidate(self): """ Called when election timeout expires without hearing from a leader. """ self.state = ServerState.CANDIDATE self.current_term += 1 # Increment term for this election self.voted_for = self.id # Vote for selfThe decomposition also enables independent verification. When implementing Raft, you can verify:
This modularity dramatically reduces the cognitive load of both implementing and debugging Raft.
Raft's second key design decision is the strong leader model. In Raft, data flows in one direction only: from leader to followers.
Why is this simpler?
In symmetric protocols (like some Paxos variants), any node can propose values, and the protocol must handle concurrent proposals, conflicts, and merges. This creates a complex state space with many possible interleavings.
In Raft's asymmetric model:
This dramatically reduces the number of states and transitions the protocol must handle.
The trade-off is availability during leader failures. When a leader fails, the cluster cannot accept new writes until a new leader is elected (typically 150-300ms). This is acceptable for most systems—brief unavailability is preferable to complexity-induced bugs.
The strong leader also provides natural read consistency. Because all writes flow through the leader, the leader always has the most up-to-date state. Reading from the leader gives linearizable reads without additional protocol complexity.
Some Raft implementations extend this with lease-based reads, where a leader can serve reads locally during its lease period, eliminating the need to contact a quorum for every read operation.
The strong leader model isn't unique to Raft—it's the dominant pattern in production systems. ZooKeeper (with ZAB), Kafka (with controller), MongoDB (with replica set primaries), and most SQL databases use leader-based replication. Raft made this explicit and central to the algorithm design.
Time is a fundamental challenge in distributed systems—physical clocks drift, network delays are unpredictable, and "what happened first" is often ambiguous. Raft solves this with terms.
A term is a logical clock that divides time into numbered periods. Each term has at most one leader. Terms monotonically increase, and higher terms always supersede lower terms.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
"""Terms in Raft serve multiple purposes: 1. LOGICAL CLOCK - Terms order events without physical time2. ELECTION EPOCHS - Each election attempt starts a new term 3. STALE DETECTION - Old messages/leaders are identified by old terms4. TIE-BREAKING - Higher term always wins Timeline example:================= Term 1: [Server A is leader] --crash--> Term 2: [Server B wins election, becomes leader] --crash-->Term 3: [Server C wins election, becomes leader] --> [operating normally] Key invariant: At most one leader per term""" class TermManager: def __init__(self, initial_term: int = 0): self.term = initial_term def discover_higher_term(self, observed_term: int) -> bool: """ Called when any RPC reveals a higher term. If observed term is higher, we: - Update our term - Revert to follower state - Clear any vote we've cast Returns True if our term was updated. """ if observed_term > self.term: self.term = observed_term return True # Caller should become follower return False def start_election(self) -> int: """ Called when starting a new election. Increment term and return new value. """ self.term += 1 return self.term def is_valid_term(self, request_term: int) -> bool: """ Check if an incoming request has a valid (not stale) term. Requests with term < current_term are rejected. """ return request_term >= self.term # Example: How terms prevent stale leadersclass RaftNode: def receive_append_entries(self, leader_term: int, leader_id: int, entries: list): """ Handle AppendEntries RPC from (claimed) leader. """ if leader_term < self.current_term: # This is a stale leader - reject! # Could be: old leader that was partitioned, # or network delays delivered old message return { "success": False, "term": self.current_term # Tell them to update } if leader_term > self.current_term: # They're ahead - we're stale # Step down and update our term self.become_follower(leader_term) # Now process the entries... # Reset election timer (we heard from valid leader) self.reset_election_timer() # ... log replication logic ... return {"success": True, "term": self.current_term}Terms provide automatic obsolescence detection. Every RPC in Raft includes the sender's current term. When a server receives a message:
This simple rule ensures that:
Terms are simpler than Lamport or vector clocks because they only increment on significant events (elections), not on every message. This reduces overhead and makes reasoning easier, at the cost of less fine-grained ordering—which Raft doesn't need because the leader serializes all operations anyway.
One of Raft's elegant engineering decisions is the use of randomized election timeouts. This solves the problem of split votes—situations where multiple candidates compete and no one gets a majority.
The Problem:
Imagine three servers. The leader crashes. If all three simultaneously:
No candidate can get a majority (2 out of 3 votes) because everyone already voted for themselves. Now all three timeout again, repeat the process, and the cluster is stuck in an infinite election loop.
Raft's Solution:
Each server's election timeout is chosen randomly from a range (e.g., 150-300ms). This means:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
import randomimport asynciofrom typing import Optional class ElectionTimer: """ Raft's election timer with randomized timeout. Key parameters: - HEARTBEAT_INTERVAL: How often leader sends heartbeats (e.g., 50ms) - ELECTION_TIMEOUT_MIN: Minimum timeout before starting election (e.g., 150ms) - ELECTION_TIMEOUT_MAX: Maximum timeout (e.g., 300ms) Rule of thumb: ELECTION_TIMEOUT_MIN > HEARTBEAT_INTERVAL * 2 to allow for network delays """ HEARTBEAT_INTERVAL_MS = 50 ELECTION_TIMEOUT_MIN_MS = 150 ELECTION_TIMEOUT_MAX_MS = 300 def __init__(self, on_timeout_callback): self.on_timeout = on_timeout_callback self.timer_task: Optional[asyncio.Task] = None def _random_timeout(self) -> float: """ Generate random timeout in seconds. Using uniform distribution over the range. """ timeout_ms = random.randint( self.ELECTION_TIMEOUT_MIN_MS, self.ELECTION_TIMEOUT_MAX_MS ) return timeout_ms / 1000.0 def reset(self): """ Called when: 1. Server receives valid AppendEntries from leader 2. Server grants vote to a candidate Resets the timer with a new random timeout. This prevents unnecessary elections while leader is alive. """ if self.timer_task: self.timer_task.cancel() timeout = self._random_timeout() self.timer_task = asyncio.create_task( self._run_timer(timeout) ) async def _run_timer(self, timeout: float): """ Wait for timeout, then trigger election. If reset() is called before timeout, this task is cancelled. """ try: await asyncio.sleep(timeout) # Timeout expired without hearing from leader # Convert to candidate and start election await self.on_timeout() except asyncio.CancelledError: # Timer was reset - normal operation pass # Simulation showing how randomized timeouts prevent split votesdef simulate_election_starts(num_servers: int, num_trials: int = 1000) -> dict: """ Simulate how often servers would start elections 'simultaneously'. """ simultaneous_starts = 0 single_starter = 0 threshold_ms = 10 # If within 10ms, consider 'simultaneous' for _ in range(num_trials): timeouts = [ random.randint(150, 300) for _ in range(num_servers) ] min_timeout = min(timeouts) starters = sum(1 for t in timeouts if t - min_timeout < threshold_ms) if starters > 1: simultaneous_starts += 1 else: single_starter += 1 return { "simultaneous_starts": simultaneous_starts, "single_starter": single_starter, "probability_clean_election": single_starter / num_trials } # With 5 servers: ~93% of elections have a clear first starter# result = simulate_election_starts(5) # {'simultaneous_starts': 70, 'single_starter': 930, 'probability_clean_election': 0.93}Why This Works:
The alternative would be more complex mechanisms like:
Randomization elegantly sidesteps all of these issues.
In production, timeout parameters are tuned based on network characteristics: • Stable networks (data center): 150-300ms range works well • High latency networks (geo-distributed): Consider 500-1000ms • Very stable clusters: Heartbeat can be less frequent to reduce overhead The key constraint: election timeout must be >> heartbeat interval × average message latency
Raft represents a philosophical shift in algorithm design: understandability is a first-class design goal, not a nice-to-have. The algorithm achieves this through several key principles:
What's Next:
With this philosophical foundation, we're ready to dive into the mechanics. The next page explores Leader Election—how Raft chooses a single leader from among its servers, handles failures during elections, and guarantees that at most one leader exists per term.
Understanding leader election is crucial because every other part of Raft depends on having a functioning leader. Log replication assumes a leader exists. Safety properties rely on leader election constraints. The entire system starts with answering: "Who is in charge?"
You now understand why Raft exists, what problems it solves, and the key design decisions that make it understandable. You've seen how consensus algorithms serve replicated state machines, why Paxos's complexity motivated Raft's creation, and how Raft's decomposition, strong leader model, terms, and randomized timeouts work together to create an algorithm that engineers can actually implement correctly.