Loading learning content...
In 2014, Diego Ongaro and John Ousterhout published "In Search of an Understandable Consensus Algorithm" introducing Raft. Their motivation was revolutionary in academic circles: they prioritized understandability as a design goal, not just correctness and performance.
Paxos, while correct, had gained a reputation for being notoriously difficult to understand and implement correctly. Years of academic papers and production outages demonstrated that even experts struggled. Ongaro and Ousterhout argued that understandability isn't a luxury—it's essential for building reliable systems.
Raft's key innovation:
Rather than derive a minimal algorithm from first principles (as Paxos does), Raft decomposes consensus into independent subproblems:
Each subproblem can be understood, implemented, and tested independently. The result is a protocol that's provably equivalent to (Multi-)Paxos in power, but dramatically easier to implement correctly.
By the end of this page, you will understand Raft's design philosophy, the term and election mechanism, log replication and commitment, the safety properties that ensure correctness, membership changes, and how Raft is implemented in production systems like etcd, Consul, and CockroachDB.
Before diving into Raft's mechanics, it's worth understanding why it differs from Paxos and what trade-offs it makes.
Paxos philosophy:
Raft philosophy:
| Aspect | Paxos | Raft |
|---|---|---|
| Primary goal | Minimal correctness | Understandability |
| Leader role | Optimization (Multi-Paxos) | Fundamental requirement |
| Log gaps | Allowed (each slot independent) | Prohibited (contiguous only) |
| Entry commitment | Per-slot quorum | Log prefix matching |
| Voting restriction | None (any node can propose) | Only candidates with up-to-date log |
| Specification | Abstract protocol | Complete system |
The key Raft constraint: Contiguous logs only
In Paxos, a proposer can attempt to fill any log slot. Slot 5 might be committed while slot 3 is still undecided. This flexibility is theoretically elegant but complicates log management.
Raft requires logs to be contiguous: if a leader wants to add entry N, entries 1 through N-1 must be committed first. This constraint simplifies:
The trade-off:
Raft sacrifices some theoretical flexibility for practical simplicity. It may be slightly slower in scenarios where out-of-order commitment could help, but the implementation is dramatically easier to get right.
In practice, most teams should use Raft (or a Raft-based system like etcd). Paxos is primarily of academic interest today, though some highly specialized systems use Paxos variants for their flexibility. If you're implementing consensus from scratch, Raft is the clear choice.
Raft defines exactly three states a node can be in, and uses terms as a logical clock to order events.
The Three States:
1. Follower
2. Candidate
3. Leader
Terms: Raft's Logical Clock
Raft divides time into terms, numbered consecutively starting at 1. Each term has at most one leader.
Term guarantees:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
class RaftNode: """ Raft node state management. """ def __init__(self, node_id: str, peers: List[str], storage: DurableStorage): self.node_id = node_id self.peers = peers self.storage = storage # Persistent state (survives crashes) self.current_term: int = storage.get("current_term", 0) self.voted_for: Optional[str] = storage.get("voted_for", None) self.log: List[LogEntry] = storage.get("log", []) # Volatile state self.state: NodeState = NodeState.FOLLOWER self.commit_index: int = -1 self.last_applied: int = -1 # Leader-only volatile state self.next_index: Dict[str, int] = {} # For each peer self.match_index: Dict[str, int] = {} # Highest replicated index # Timing self.election_timeout = self._random_timeout(150, 300) self.last_heartbeat = time.time() def _random_timeout(self, min_ms: int, max_ms: int) -> float: """ Random election timeout in range [min_ms, max_ms]. Randomization prevents split votes. """ return random.randint(min_ms, max_ms) / 1000.0 def check_election_timeout(self): """ Check if election timeout has expired. If so, transition to candidate and start election. """ if self.state == NodeState.LEADER: return # Leaders don't timeout elapsed = time.time() - self.last_heartbeat if elapsed >= self.election_timeout: self._start_election() def _start_election(self): """ Transition to candidate and start new election. """ # Increment term for new election self.current_term += 1 self._persist("current_term", self.current_term) # Vote for ourselves self.voted_for = self.node_id self._persist("voted_for", self.voted_for) self.state = NodeState.CANDIDATE votes_received = 1 # Our own vote # Reset election timeout self.election_timeout = self._random_timeout(150, 300) self.last_heartbeat = time.time() # Send RequestVote to all peers for peer in self.peers: response = self._request_vote(peer) if response and response.vote_granted: votes_received += 1 # Check if we won if votes_received > (len(self.peers) + 1) // 2: self._become_leader() # Otherwise, either retry on timeout or become follower def _become_leader(self): """ Transition to leader state. """ self.state = NodeState.LEADER # Initialize leader state for peer in self.peers: self.next_index[peer] = len(self.log) self.match_index[peer] = -1 # Send initial heartbeats immediately self._send_heartbeats() print(f"Node {self.node_id}: Became leader for term {self.current_term}") def update_term(self, term: int): """ Update to higher term, reverting to follower if needed. """ if term > self.current_term: self.current_term = term self.voted_for = None self.state = NodeState.FOLLOWER self._persist("current_term", self.current_term) self._persist("voted_for", None)Randomized election timeouts (typically 150-300ms) prevent synchronized elections that could cause perpetual split votes. Without randomization, nodes would timeout simultaneously, all become candidates, and split the vote—repeating indefinitely. This is how Raft circumvents FLP impossibility in practice.
Raft's leader election is simpler than Paxos's because it uses a single voting rule: a candidate only grants its vote if the candidate's log is at least as up-to-date as the voter's log.
The RequestVote RPC:
RequestVote RPC:
term - Candidate's term
candidateId - Candidate requesting vote
lastLogIndex - Index of candidate's last log entry
lastLogTerm - Term of candidate's last log entry
Response:
term - CurrentTerm (for candidate to update itself)
voteGranted - True if candidate received vote
Voting rules:
Log comparison:
Candidate's log is "at least as up-to-date" if:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
class RaftNode: """ Raft leader election implementation. """ def handle_request_vote(self, request: RequestVoteRequest) -> RequestVoteResponse: """ Handle RequestVote RPC from candidate. Grant vote if: 1. Candidate's term >= our term 2. We haven't voted for someone else in this term 3. Candidate's log is at least as up-to-date as ours """ # Update term if candidate has higher term if request.term > self.current_term: self.update_term(request.term) # Reject if candidate's term is stale if request.term < self.current_term: return RequestVoteResponse( term=self.current_term, vote_granted=False ) # Check if we can vote for this candidate can_vote = ( self.voted_for is None or self.voted_for == request.candidate_id ) # Check if candidate's log is up-to-date last_log_term = self._get_last_log_term() last_log_index = len(self.log) - 1 log_ok = ( request.last_log_term > last_log_term or (request.last_log_term == last_log_term and request.last_log_index >= last_log_index) ) if can_vote and log_ok: # Grant vote self.voted_for = request.candidate_id self._persist("voted_for", self.voted_for) # Reset election timeout (we've heard from a valid candidate) self.last_heartbeat = time.time() return RequestVoteResponse( term=self.current_term, vote_granted=True ) else: return RequestVoteResponse( term=self.current_term, vote_granted=False ) def _get_last_log_term(self) -> int: """Get term of last log entry, or 0 if log is empty.""" if not self.log: return 0 return self.log[-1].term def _request_vote(self, peer: str) -> Optional[RequestVoteResponse]: """ Send RequestVote RPC to a peer. """ request = RequestVoteRequest( term=self.current_term, candidate_id=self.node_id, last_log_index=len(self.log) - 1, last_log_term=self._get_last_log_term() ) try: return self._send_rpc(peer, "RequestVote", request) except NetworkTimeout: return None # Example: Election scenario trace"""Scenario: 3-node cluster, Node A is leader, then crashes Initial state:- Node A (Leader, term=5): log = [e1, e2, e3, e4, e5]- Node B (Follower, term=5): log = [e1, e2, e3, e4, e5]- Node C (Follower, term=5): log = [e1, e2, e3, e4] # Slightly behind Node A crashes. Network delays cause different election timeouts. 1. Node B times out first, becomes candidate, term=62. Node B sends RequestVote(term=6, lastLogIndex=4, lastLogTerm=5) to C3. Node C grants vote: - term=6 >= currentTerm=5 ✓ - Haven't voted in term 6 ✓ - B's log (term=5, index=4) vs C's log (term=5, index=3) - Same term, B has more entries ✓4. Node B has 2 votes (self + C), becomes leader If Node C had timed out first:1. Node C becomes candidate, term=62. Node C sends RequestVote(term=6, lastLogIndex=3, lastLogTerm=5) to B3. Node B rejects: - C's log (term=5, index=3) vs B's log (term=5, index=4) - Same term but C has fewer entries ✗4. Node C doesn't get majority, times out5. Eventually B or C wins with higher log"""The log up-to-date check ensures the Leader Completeness property: a leader for any term must have all entries committed in previous terms. This is critical for safety—without it, a new leader might overwrite committed entries.
Once a leader is elected, it handles all client requests by appending entries to its log and replicating them to followers.
The AppendEntries RPC:
This single RPC handles both heartbeats (empty entries) and log replication:
AppendEntries RPC:
term - Leader's term
leaderId - For followers to redirect clients
prevLogIndex - Index of log entry immediately preceding new ones
prevLogTerm - Term of prevLogIndex entry
entries[] - Log entries to store (empty for heartbeat)
leaderCommit - Leader's commit index
Response:
term - CurrentTerm (for leader to update itself)
success - True if follower had entry matching prevLogIndex/Term
The Log Matching Property:
Raft maintains a crucial invariant:
If two entries in different logs have the same index and term, they store the same command AND all preceding entries are identical.
This is enforced by:
prevLogIndex/prevLogTerm matches123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188
class RaftLeader: """ Raft leader log replication. """ def replicate_command(self, command: Any) -> bool: """ Client-facing: Add command to log and replicate. Returns True if command is successfully committed. """ # Append to local log entry = LogEntry( term=self.current_term, command=command, index=len(self.log) ) self.log.append(entry) self._persist("log", self.log) # Replicate to followers return self._replicate_to_majority() def _replicate_to_majority(self) -> bool: """ Replicate pending entries to all followers. Wait for majority acknowledgment. """ while True: success_count = 1 # Count ourselves for peer in self.peers: if self._send_append_entries(peer): success_count += 1 majority = (len(self.peers) + 1) // 2 + 1 if success_count >= majority: self._advance_commit_index() return True # If not enough responses, retry time.sleep(0.010) # Brief backoff def _send_append_entries(self, peer: str) -> bool: """ Send AppendEntries to a single follower. Uses next_index to determine which entries to send. Decrements next_index on failure for Log Matching. """ next_idx = self.next_index[peer] # Prepare entries to send if next_idx < len(self.log): entries = self.log[next_idx:] else: entries = [] # Heartbeat # Get prev log info for consistency check prev_log_index = next_idx - 1 prev_log_term = 0 if prev_log_index >= 0: prev_log_term = self.log[prev_log_index].term request = AppendEntriesRequest( term=self.current_term, leader_id=self.node_id, prev_log_index=prev_log_index, prev_log_term=prev_log_term, entries=entries, leader_commit=self.commit_index ) try: response = self._send_rpc(peer, "AppendEntries", request) if response.success: # Update tracking for this follower self.next_index[peer] = len(self.log) self.match_index[peer] = len(self.log) - 1 return True else: # Log mismatch - decrement next_index and retry self.next_index[peer] = max(0, self.next_index[peer] - 1) return False except NetworkTimeout: return False def _advance_commit_index(self): """ Advance commit index based on majority replication. An entry is committed when replicated to a majority AND the entry's term equals current term. """ for n in range(self.commit_index + 1, len(self.log)): if self.log[n].term != self.current_term: continue # Only commit entries from current term # Count replications (including ourselves) replicated = 1 for peer in self.peers: if self.match_index.get(peer, -1) >= n: replicated += 1 majority = (len(self.peers) + 1) // 2 + 1 if replicated >= majority: self.commit_index = n self._apply_committed_entries() class RaftFollower: """ Raft follower handling AppendEntries. """ def handle_append_entries(self, request: AppendEntriesRequest) -> AppendEntriesResponse: """ Handle AppendEntries RPC from leader. 1. Term check 2. Log consistency check 3. Append new entries 4. Update commit index """ # Update term if leader has higher term if request.term > self.current_term: self.update_term(request.term) # Reject if stale term if request.term < self.current_term: return AppendEntriesResponse( term=self.current_term, success=False ) # Valid leader - reset election timeout self.last_heartbeat = time.time() self.state = NodeState.FOLLOWER # Log consistency check if request.prev_log_index >= 0: if request.prev_log_index >= len(self.log): # We're missing entries return AppendEntriesResponse( term=self.current_term, success=False ) if self.log[request.prev_log_index].term != request.prev_log_term: # Conflict - delete this entry and all after it self.log = self.log[:request.prev_log_index] self._persist("log", self.log) return AppendEntriesResponse( term=self.current_term, success=False ) # Append new entries (if any) for i, entry in enumerate(request.entries): log_index = request.prev_log_index + 1 + i if log_index < len(self.log): if self.log[log_index].term != entry.term: # Conflict - truncate and append self.log = self.log[:log_index] self.log.append(entry) # else: entry already exists correctly else: self.log.append(entry) self._persist("log", self.log) # Update commit index if request.leader_commit > self.commit_index: self.commit_index = min( request.leader_commit, len(self.log) - 1 ) self._apply_committed_entries() return AppendEntriesResponse( term=self.current_term, success=True )A leader only commits entries from its current term. Why? If a leader commits entries from a previous term by counting replicas, those entries might be overwritten by a higher-term leader with a different log. By only counting current-term entries, the leader ensures it has the authority to make that commitment.
Raft guarantees five key properties that together ensure correct consensus:
1. Election Safety
At most one leader can be elected in any given term.
Ensured by: Each node votes for at most one candidate per term, and a majority is required to win.
2. Leader Append-Only
A leader never overwrites or deletes entries in its log; it only appends.
Ensured by: Leader's log is authoritative; it only adds entries at the end.
3. Log Matching
If two logs contain an entry with the same index and term, the logs are identical in all preceding entries.
Ensured by: AppendEntries consistency check with prevLogIndex/prevLogTerm.
4. Leader Completeness
If a log entry is committed in a given term, that entry will be present in the logs of all leaders for higher-numbered terms.
Ensured by: Voting restriction requiring candidate's log to be up-to-date.
5. State Machine Safety
If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for that index.
Ensured by: Combination of Log Matching and Leader Completeness.
| Property | Mechanism | What Could Go Wrong Without It |
|---|---|---|
| Election Safety | Majority vote + one vote per term | Two leaders accept conflicting writes |
| Leader Append-Only | Leader never modifies existing entries | Applied entries could be lost |
| Log Matching | prevLogIndex/Term check | Logs diverge, different commands at same index |
| Leader Completeness | Log up-to-date voting restriction | New leader overwrites committed entries |
| State Machine Safety | All above combined | Replicas have different state |
The Leader Completeness Proof Sketch:
Suppose entry E was committed in term T. We prove any leader for term T' > T has E in its log.
This chain of reasoning ensures committed entries can never be lost.
Raft's safety properties form a logical chain. If you understand why each property holds and how they combine, you understand Raft's correctness. This modularity is precisely what makes Raft more understandable than Paxos, where the safety argument is more tightly interleaved.
Real Raft implementations must handle several edge cases that don't appear in the basic algorithm description.
1. Log Compaction and Snapshots
Logs grow indefinitely, eventually exhausting storage. Raft systems periodically compact the log by:
If a follower is far behind, the leader sends an InstallSnapshot RPC instead of individual log entries.
2. Cluster Membership Changes
Changing the cluster membership (adding/removing nodes) is tricky because there's a window where two different majorities could exist (old cluster vs. new cluster).
Raft solves this with joint consensus:
3. Pre-Vote Protocol
In some scenarios, a disconnected node repeatedly times out and increments its term. When it reconnects, its high term disrupts the cluster. The pre-vote extension has candidates probe whether they could win before incrementing their term.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
class RaftSnapshotManager: """ Log compaction via snapshots. """ def __init__(self, node: RaftNode, snapshot_threshold: int = 10000): self.node = node self.snapshot_threshold = snapshot_threshold # Snapshot state self.last_included_index: int = -1 self.last_included_term: int = 0 self.snapshot_data: Optional[bytes] = None def maybe_snapshot(self): """ Take snapshot if log is too long. """ if len(self.node.log) < self.snapshot_threshold: return # Can only snapshot committed entries if self.node.commit_index <= self.last_included_index: return # Get state machine snapshot snapshot_index = self.node.commit_index snapshot_data = self.node.state_machine.snapshot() # Update snapshot metadata self.last_included_index = snapshot_index self.last_included_term = self.node.log[snapshot_index].term self.snapshot_data = snapshot_data # Compact log (keep entries after snapshot) self.node.log = self.node.log[snapshot_index + 1:] # Persist snapshot and compacted log self._persist_snapshot() def install_snapshot(self, leader_term: int, last_included_index: int, last_included_term: int, data: bytes) -> bool: """ Handle InstallSnapshot RPC from leader. Used when follower is too far behind for normal replication. """ if leader_term < self.node.current_term: return False if last_included_index <= self.last_included_index: return True # Already have this or newer snapshot # Discard entire log if snapshot covers all of it if last_included_index >= len(self.node.log): self.node.log = [] else: # Keep entries after snapshot self.node.log = self.node.log[last_included_index + 1:] # Update snapshot state self.last_included_index = last_included_index self.last_included_term = last_included_term self.snapshot_data = data # Install snapshot to state machine self.node.state_machine.restore(data) self.node.commit_index = last_included_index self.node.last_applied = last_included_index self._persist_snapshot() return True class RaftMembershipChange: """ Joint consensus for membership changes. """ def __init__(self, node: RaftNode): self.node = node self.configuration = ClusterConfiguration(node.peers + [node.node_id]) self.transitioning = False def change_membership(self, new_nodes: List[str]) -> bool: """ Change cluster membership from current to new configuration. Uses joint consensus: C_old+new -> C_new """ if self.transitioning: return False # One change at a time self.transitioning = True # Phase 1: Enter joint configuration joint_config = JointConfiguration( old_nodes=self.configuration.nodes, new_nodes=new_nodes ) # Replicate joint config as log entry joint_entry = ConfigurationEntry( type="JOINT", config=joint_config ) if not self.node.replicate_command(joint_entry): self.transitioning = False return False # Phase 2: Transition to new configuration new_config = ClusterConfiguration(new_nodes) new_entry = ConfigurationEntry( type="NEW", config=new_config ) if not self.node.replicate_command(new_entry): # Tricky: we're in joint config, must complete transition # In practice, retry until success pass self.configuration = new_config self.transitioning = False return True class JointConfiguration: """ Joint consensus configuration requiring majorities from both sets. """ def __init__(self, old_nodes: List[str], new_nodes: List[str]): self.old_nodes = old_nodes self.new_nodes = new_nodes def is_quorum(self, responding_nodes: Set[str]) -> bool: """ Quorum requires majority from BOTH old and new configurations. """ old_count = len(responding_nodes.intersection(self.old_nodes)) new_count = len(responding_nodes.intersection(self.new_nodes)) old_majority = old_count > len(self.old_nodes) // 2 new_majority = new_count > len(self.new_nodes) // 2 return old_majority and new_majorityAn alternative to joint consensus is to limit changes to one server at a time. Adding or removing one server at a time guarantees that old and new majorities overlap, eliminating the need for joint configurations. Many Raft implementations use this simpler approach.
Raft is widely deployed in production systems. Understanding these implementations helps bridge theory and practice.
etcd:
The distributed key-value store used by Kubernetes for cluster state. etcd's Raft implementation (etcd/raft) is a reference implementation used by many other Go projects.
Consul:
HashiCorp's service discovery and configuration platform uses Raft for its catalog and KV store.
CockroachDB:
A distributed SQL database using Raft for replication within each range (data partition).
TiKV:
A distributed transactional key-value store, part of TiDB ecosystem.
| System | Language | Use Case | Notable Features |
|---|---|---|---|
| etcd | Go | Kubernetes, distributed KV | Reference implementation, leases, watch |
| Consul | Go | Service discovery, config | Multi-DC, health checks |
| CockroachDB | Go | Distributed SQL | Multi-Raft, geo-partitioning |
| TiKV | Rust | Transactional KV | Region-based scaling |
| RethinkDB | C++ | Document database | Changefeeds |
| Neo4j (Enterprise) | Java | Graph database | Clustering |
Production Considerations:
1. Backpressure: When the leader is overwhelmed, it must slow down accepting new proposals. Without backpressure, the leader's memory grows unboundedly as pending entries queue.
2. Batching: Committing one entry per round trip is inefficient. Production systems batch multiple entries per AppendEntries RPC and commit them together.
3. Pipelining: Instead of waiting for one AppendEntries to complete before sending the next, pipeline multiple requests. This hides network latency.
4. Read Optimization: Linearizable reads from the leader require confirming leadership. Options:
5. Multi-Raft: For large-scale systems, run many independent Raft groups, each managing a data partition. This requires careful connection pooling and heartbeat coalescing.
Unless you have very specific requirements, use an existing Raft implementation or Raft-based system. etcd's raft library, HashiCorp's raft, and tikv/raft-rs are battle-tested and handle the many edge cases that catch new implementations. Consensus bugs are among the hardest to detect and most costly to fix.
We've covered Raft comprehensively, from its philosophy to production deployment. Let's consolidate the key insights:
What's next:
With leader election and consensus covered, the final piece of distributed coordination is distributed locking—ensuring exclusive access to resources across a network. We'll explore how to build correct locks on top of consensus and the subtle pitfalls that trap unwary developers.
You now understand Raft, the dominant consensus algorithm in modern distributed systems. From its elegantly decomposed design to its production implementations in etcd, Consul, and CockroachDB, Raft demonstrates that understandability and correctness can coexist. Next, we'll see how consensus enables distributed locking.