Loading learning content...
In any leader-based distributed system, leader election is the foundation upon which everything else rests. Without a functioning leader, Raft cannot accept client requests, replicate log entries, or maintain consistency. The leader election protocol must be fast, reliable, and—above all—safe.
"Safe" in this context has a precise meaning: at most one leader may exist per term. If two leaders existed simultaneously, they could accept conflicting client requests, leading to divergent state across the cluster—the worst possible outcome for a consensus algorithm.
This page dissects Raft's leader election mechanism in complete detail. We'll examine exactly when elections start, how candidates request votes, why the voting rules guarantee safety, and what happens in edge cases like network partitions and simultaneous elections.
By the end of this page, you will understand:
• The three server states (Follower, Candidate, Leader) and all transitions between them • Election triggers — when and why elections begin • The RequestVote RPC — the complete protocol for requesting and granting votes • The election restriction — how Raft prevents unsafe leaders from being elected • Split vote handling — how randomized timeouts ensure progress • Pre-vote optimization — a common extension that prevents disruption
Every Raft server is in exactly one of three states at any time:
1. Follower — The default, passive state. Followers respond to requests from leaders and candidates but do not initiate communication. They simply:
2. Candidate — A transitional state during elections. A server becomes a candidate when:
3. Leader — The active, authoritative state. Exactly one leader exists per term. The leader:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
"""Raft State Machine - Complete State Transitions ┌──────────────────────────────────────────────────────────┐ │ │ │ ┌─────────┐ timeout ┌───────────┐ wins vote ┌────────┐ │ │ │────────────►│ │──────────────►│ │ │ │ Follower│ │ Candidate │ │ Leader │ │ │ │◄────────────│ │◄──────────────│ │ │ └─────────┘ discovers └───────────┘ discovers └────────┘ │ ▲ higher term │ higher term │ │ │ │ │ │ │ │ split vote │ │ │ │ (times out) │ │ │ ▼ │ │ │ [restart election] │ │ │ │ │ └───────────────────────────────────────────────────┘ │ discovers higher term │ └── All servers start as Followers """ from enum import Enumfrom dataclasses import dataclassfrom typing import Optional, Callableimport asyncio class ServerState(Enum): FOLLOWER = "follower" CANDIDATE = "candidate" LEADER = "leader" @dataclassclass StateTransition: from_state: ServerState to_state: ServerState reason: str class RaftStateMachine: """ Complete state machine for Raft server states. Handles all legal transitions and invariants. """ def __init__(self, on_become_leader: Callable, on_step_down: Callable): self.state = ServerState.FOLLOWER self.on_become_leader = on_become_leader self.on_step_down = on_step_down self.transition_log: list[StateTransition] = [] def _transition(self, new_state: ServerState, reason: str) -> bool: """ Record and execute a state transition. Returns True if transition occurred. """ if new_state == self.state: return False # No-op old_state = self.state self.transition_log.append(StateTransition(old_state, new_state, reason)) self.state = new_state # Trigger callbacks if new_state == ServerState.LEADER: self.on_become_leader() elif old_state == ServerState.LEADER: self.on_step_down() return True # === TRANSITION: Follower → Candidate === def start_election(self) -> bool: """ Called when election timeout expires. Only valid from Follower state. """ if self.state != ServerState.FOLLOWER: return False return self._transition( ServerState.CANDIDATE, "election timeout expired" ) # === TRANSITION: Candidate → Leader === def win_election(self) -> bool: """ Called when candidate receives majority votes. Only valid from Candidate state. """ if self.state != ServerState.CANDIDATE: return False return self._transition( ServerState.LEADER, "received majority votes" ) # === TRANSITION: Candidate → Candidate (restart) === def restart_election(self) -> bool: """ Called when candidate's election timeout expires without winning or losing. Only valid from Candidate state. """ if self.state != ServerState.CANDIDATE: return False # Technically stays Candidate, but increments term # and starts new vote collection return True # Term increment handled elsewhere # === TRANSITION: Any → Follower (step down) === def step_down(self, reason: str) -> bool: """ Called when discovering a higher term from any state. Always valid (though no-op if already Follower). """ if self.state == ServerState.FOLLOWER: return False return self._transition(ServerState.FOLLOWER, reason) # === QUERY METHODS === def is_leader(self) -> bool: return self.state == ServerState.LEADER def is_candidate(self) -> bool: return self.state == ServerState.CANDIDATE def is_follower(self) -> bool: return self.state == ServerState.FOLLOWER| From | To | Trigger | Actions |
|---|---|---|---|
| Follower | Candidate | Election timeout | Increment term, vote for self, request votes |
| Candidate | Leader | Receives majority votes | Start sending heartbeats, initialize leader state |
| Candidate | Follower | Discovers higher term OR receives valid AppendEntries | Update term, clear voted_for |
| Candidate | Candidate | Election timeout (split vote) | Increment term, restart election |
| Leader | Follower | Discovers higher term | Stop heartbeats, clear leader state |
Elections are triggered by a single condition: a follower's election timeout expires without receiving communication from a valid leader.
This simple rule encapsulates several scenarios:
Leader failure — The leader crashes or becomes unreachable. Followers stop receiving heartbeats and eventually time out.
Network partition — A follower is partitioned from the leader. From the follower's perspective, this looks identical to leader failure.
Cluster startup — When a Raft cluster starts, no leader exists. All servers are followers with running election timers. The first server to time out starts an election.
Slow leader — If the leader is so overloaded that heartbeats don't arrive in time, followers may start spurious elections.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import asyncioimport randomfrom typing import Optional class ElectionController: """ Controls when elections are triggered based on heartbeat receipt. The election timer is RESET when: 1. Receiving valid AppendEntries from current leader 2. Granting vote to a candidate The election timer FIRES when: - No reset occurs within the timeout period """ ELECTION_TIMEOUT_MIN_MS = 150 ELECTION_TIMEOUT_MAX_MS = 300 def __init__(self, on_election_timeout): self.on_election_timeout = on_election_timeout self._timer_task: Optional[asyncio.Task] = None self._current_timeout_ms: int = 0 def _random_timeout_ms(self) -> int: """ Generate random election timeout. Why random? - Prevents all servers from timing out simultaneously - Usually ONE server times out first and wins cleanly - Even in split vote, random timeouts break the tie """ return random.randint( self.ELECTION_TIMEOUT_MIN_MS, self.ELECTION_TIMEOUT_MAX_MS ) async def _timer_loop(self): """ The actual timer coroutine. Sleeps for timeout, then triggers election. """ try: timeout_sec = self._current_timeout_ms / 1000.0 await asyncio.sleep(timeout_sec) # Timeout expired! Time to become a candidate. print(f"Election timeout ({self._current_timeout_ms}ms) - starting election") await self.on_election_timeout() except asyncio.CancelledError: # Timer was reset before it fired - this is normal pass def reset(self): """ Reset the election timer. Called when receiving valid communication from leader or when granting a vote to a candidate. Key insight: The timer resets with a NEW random timeout. This prevents synchronized elections after recovery. """ # Cancel existing timer if self._timer_task and not self._timer_task.done(): self._timer_task.cancel() # Start new timer with fresh random timeout self._current_timeout_ms = self._random_timeout_ms() self._timer_task = asyncio.create_task(self._timer_loop()) def stop(self): """ Stop the election timer entirely. Called when becoming leader (leaders don't have election timeouts). """ if self._timer_task and not self._timer_task.done(): self._timer_task.cancel() self._timer_task = None # Example: Simulating election trigger scenariosclass Scenario: """Demonstration of when elections are triggered.""" @staticmethod async def scenario_leader_failure(): """ Leader crashes at t=0. Follower with 200ms timeout starts election at t=200ms. """ print("Time 0: Leader crashes") print("Time 0-200ms: Follower waiting...") await asyncio.sleep(0.2) print("Time 200ms: Follower starts election") @staticmethod async def scenario_heartbeat_keeps_alive(): """ Leader sends heartbeat every 50ms. Follower resets 200ms timer on each heartbeat. Result: Follower never times out while leader is healthy. """ print("Leader heartbeat schedule: t=50, 100, 150, 200, ...") print("Follower timeout 200ms, but resets at t=50") print("Next timeout would be at t=250, but resets at t=100") print("Follower never starts election while heartbeats arrive")If the leader is slow but not dead, followers may start elections unnecessarily. This creates an "election storm" where the cluster constantly holds elections instead of doing useful work. Production systems tune timeouts carefully and may implement "pre-vote" (discussed later) to prevent spurious elections.
When a server becomes a candidate, it initiates an election by sending RequestVote RPCs to all other servers. This RPC is the heart of leader election and carries specific information that voters use to decide whether to grant their vote.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
from dataclasses import dataclassfrom typing import Optional @dataclassclass RequestVoteRequest: """ Arguments sent by candidates to request votes. Each field serves a specific purpose: """ # === Candidate's term === # Used for: # 1. Voters reject if their term is higher (stale candidate) # 2. Voters update term if candidate's is higher # 3. Identifies which election this vote is for term: int # === Candidate's ID === # Used for: # 1. Recording who we voted for # 2. Knowing who to recognize as leader if they win candidate_id: int # === Index of candidate's last log entry === # Used for the ELECTION RESTRICTION (critical for safety): # Voters only vote for candidates with "at least as up-to-date" logs last_log_index: int # === Term of candidate's last log entry === # Combined with last_log_index to determine "up-to-date-ness" # A log is more up-to-date if: # 1. It has a higher last term, OR # 2. Same last term but longer log last_log_term: int @dataclassclass RequestVoteResponse: """ Response from voters. """ # The voter's current term # If higher than candidate's, candidate must step down term: int # Whether the vote was granted vote_granted: bool class VoteHandler: """ Handles incoming RequestVote RPCs as a voter. """ def __init__(self, server): self.server = server def handle_request_vote(self, request: RequestVoteRequest) -> RequestVoteResponse: """ Process a vote request. The complete decision logic: """ # === CHECK 1: Term comparison === if request.term < self.server.current_term: # Candidate is from past term - reject and inform them return RequestVoteResponse( term=self.server.current_term, vote_granted=False ) # If candidate's term is higher, update our term and become follower if request.term > self.server.current_term: self.server.become_follower(request.term) # === CHECK 2: Have we already voted this term? === if self.server.voted_for is not None and self.server.voted_for != request.candidate_id: # Already voted for someone else this term return RequestVoteResponse( term=self.server.current_term, vote_granted=False ) # === CHECK 3: Is candidate's log at least as up-to-date as ours? === # This is the ELECTION RESTRICTION - critical for safety if not self._is_candidate_log_up_to_date(request): return RequestVoteResponse( term=self.server.current_term, vote_granted=False ) # === All checks passed - grant vote === self.server.voted_for = request.candidate_id self.server.persist() # Must persist before responding! # Reset election timer - we found a viable candidate self.server.reset_election_timer() return RequestVoteResponse( term=self.server.current_term, vote_granted=True ) def _is_candidate_log_up_to_date(self, request: RequestVoteRequest) -> bool: """ Determine if candidate's log is at least as up-to-date as ours. "Up-to-date" comparison: 1. If last log terms differ: higher term wins 2. If last log terms equal: longer log wins This ensures the elected leader has all committed entries. """ my_last_index, my_last_term = self.server.get_last_log_info() # Rule 1: Compare terms first if request.last_log_term != my_last_term: return request.last_log_term > my_last_term # Rule 2: Same term - compare lengths return request.last_log_index >= my_last_indexThe Three Conditions for Granting a Vote:
A server grants its vote if and only if ALL of these conditions hold:
The first two are straightforward. The third—the election restriction—is subtle but critical for safety. We'll examine it in detail next.
The election restriction is one of Raft's most important safety mechanisms. It ensures that any elected leader contains all entries that have been committed in previous terms.
Why is this necessary?
Consider what would happen without this restriction:
The election restriction prevents step 3. Since entry X was replicated to a majority before being committed, and elections require a majority vote, at least one voter must have entry X. That voter will refuse to vote for Server B because B's log is "less up-to-date."
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
"""The Election Restriction Explained with Concrete Example Cluster: Servers A, B, C (majority = 2) === SCENARIO WHERE ELECTION RESTRICTION SAVES US === Initial state: A (Leader, term=1): log = [1:x, 1:y, 1:z] (committed up to z) B (Follower): log = [1:x, 1:y, 1:z] (has all entries) C (Follower): log = [1:x, 1:y] (missed z) Entry z was committed (replicated to A and B, majority achieved). Now A crashes. B and C start election (term 2). === IF C TRIES TO BECOME LEADER === C sends RequestVote to B: term = 2 last_log_index = 2 (entries x, y) last_log_term = 1 B checks C's log against its own: B's last_log_index = 3 B's last_log_term = 1 Comparison: Same last term, but B has longer log → B's log is more up-to-date Result: B REJECTS C's vote request. C cannot get majority (only has its own vote). === IF B TRIES TO BECOME LEADER === B sends RequestVote to C: term = 2 last_log_index = 3 last_log_term = 1 C checks B's log against its own: C's last_log_index = 2 C's last_log_term = 1 Comparison: Same last term, but B has longer log → B's log is more up-to-date Result: C GRANTS vote to B. B gets majority (B + C) and becomes leader.Entry z is preserved!""" def compare_logs( candidate_last_index: int, candidate_last_term: int, voter_last_index: int, voter_last_term: int) -> str: """ Returns which log is 'more up-to-date'. The comparison works like version numbers: - Compare major version (term) first - Compare minor version (index) if major versions match """ if candidate_last_term > voter_last_term: return "CANDIDATE is more up-to-date" elif candidate_last_term < voter_last_term: return "VOTER is more up-to-date" else: # Same term if candidate_last_index >= voter_last_index: return "CANDIDATE is at least as up-to-date (vote granted)" else: return "VOTER is more up-to-date (vote denied)" # More complex scenario: Term matters more than length"""Scenario where term matters more than length: A: log = [1:a, 1:b, 1:c, 1:d] (4 entries, all term 1) B: log = [1:a, 1:b, 2:x] (3 entries, last is term 2) Who has the more up-to-date log? B does! Even though B's log is shorter, B has an entry from term 2.Entries from higher terms are "newer" than entries from lower terms,regardless of log length. Why? Because an entry from term 2 means B was replicated up to at leastthe leader of term 2. Whatever A has in term 1 must be older.""" print(compare_logs( candidate_last_index=4, candidate_last_term=1, # A as candidate voter_last_index=3, voter_last_term=2 # B as voter))# Output: "VOTER is more up-to-date (vote denied)" print(compare_logs( candidate_last_index=3, candidate_last_term=2, # B as candidate voter_last_index=4, voter_last_term=1 # A as voter))# Output: "CANDIDATE is more up-to-date"The election restriction works because of quorum intersection. Committing an entry requires a majority. Winning an election requires a majority. Any two majorities must share at least one server. Therefore, any election winner must have contacted at least one server that has every committed entry. The election restriction ensures the winner actually has those entries (or won't win).
Let's walk through exactly what happens when a server runs an election, from the moment it times out to winning (or losing).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
import asynciofrom typing import List, Setfrom dataclasses import dataclass @dataclassclass VoteResult: server_id: int vote_granted: bool term: int class ElectionCoordinator: """ Coordinates the election process for a candidate. """ def __init__(self, server): self.server = server async def run_election(self) -> bool: """ Execute an election. Returns True if we became leader. Step 1: Transition to candidate Step 2: Increment term Step 3: Vote for self Step 4: Send RequestVote RPCs to all peers Step 5: Wait for responses Step 6: Process results """ # ======================================== # STEP 1: Become candidate # ======================================== self.server.state = ServerState.CANDIDATE print(f"Server {self.server.id}: Became candidate") # ======================================== # STEP 2: Increment term # ======================================== self.server.current_term += 1 election_term = self.server.current_term print(f"Server {self.server.id}: Starting election for term {election_term}") # ======================================== # STEP 3: Vote for self # ======================================== self.server.voted_for = self.server.id self.server.persist() # Must persist before sending RPCs! votes_received: Set[int] = {self.server.id} # Self vote votes_needed = (len(self.server.peers) + 1) // 2 + 1 # Majority print(f"Server {self.server.id}: Voted for self. Need {votes_needed} votes total.") # ======================================== # STEP 4: Build RequestVote message # ======================================== last_log_index, last_log_term = self.server.get_last_log_info() request = RequestVoteRequest( term=election_term, candidate_id=self.server.id, last_log_index=last_log_index, last_log_term=last_log_term ) # ======================================== # STEP 5: Send RPCs in parallel, collect results # ======================================== tasks = [] for peer_id in self.server.peers: task = asyncio.create_task( self._request_vote_with_retry(peer_id, request) ) tasks.append(task) # Process responses as they arrive for completed in asyncio.as_completed(tasks): try: result = await completed # Check if we're still in the same election if self.server.current_term != election_term: print(f"Server {self.server.id}: Term changed during election, aborting") return False if self.server.state != ServerState.CANDIDATE: print(f"Server {self.server.id}: No longer candidate, aborting") return False # Process the vote outcome = self._process_vote_response(result, election_term) if outcome == "STEP_DOWN": return False if outcome == "VOTE_GRANTED": votes_received.add(result.server_id) print(f"Server {self.server.id}: Got vote from {result.server_id}. " f"Total: {len(votes_received)}/{votes_needed}") # ======================================== # STEP 6: Check if we won # ======================================== if len(votes_received) >= votes_needed: print(f"Server {self.server.id}: Won election for term {election_term}!") self.server.become_leader() return True except asyncio.TimeoutError: continue # This peer didn't respond in time # Didn't get enough votes print(f"Server {self.server.id}: Election failed. Got {len(votes_received)}/{votes_needed}") return False def _process_vote_response(self, result: VoteResult, election_term: int) -> str: """ Process a single vote response. Returns: "VOTE_GRANTED", "VOTE_DENIED", or "STEP_DOWN" """ # If response contains higher term, step down immediately if result.term > election_term: print(f"Server {self.server.id}: Discovered higher term {result.term}, stepping down") self.server.become_follower(result.term) return "STEP_DOWN" if result.vote_granted: return "VOTE_GRANTED" else: return "VOTE_DENIED" async def _request_vote_with_retry( self, peer_id: int, request: RequestVoteRequest, timeout_ms: int = 100 ) -> VoteResult: """ Send RequestVote RPC with timeout. In production, might retry on transient failures. """ # Simulated RPC - in real code, this is network I/O response = await self.server.rpc_client.request_vote(peer_id, request) return VoteResult( server_id=peer_id, vote_granted=response.vote_granted, term=response.term )A split vote occurs when no candidate receives a majority in an election. This can happen when:
When a split vote occurs:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
"""Split Vote Scenario Analysis Cluster: 5 servers (A, B, C, D, E), majority = 3 === TIMELINE OF A SPLIT VOTE === t=0: Leader crashest=150: A's timeout expires (150ms), A becomes candidate term 2t=155: B's timeout expires (155ms), B becomes candidate term 2t=160: C's timeout expires (160ms), but receives A's RequestVote first Vote distribution: A votes for: A (self) B votes for: B (self) C votes for: A (first valid request received) D votes for: B (if B's request arrived first) E votes for: A (if A's request arrived first) Results vary by network timing, but let's say: A gets: A, C, E = 3 votes → WINS! B gets: B, D = 2 votes → loses But what if timing was different: A gets: A, C = 2 votes B gets: B, D, E = 3 votes → B WINS! And adversarial timing: A gets: A, C = 2 votes B gets: B, D = 2 votes E's response delayed... → SPLIT VOTE === HOW RANDOMIZATION HELPS === After split vote, candidates restart election:- A chooses new timeout: 250ms- B chooses new timeout: 180ms t=0: Split vote endedt=180: B's timeout expires, B starts term 3 electiont=180+: B collects votes before A even startst=180+: B wins with majorityt=250: A's timeout expires, but B is already leadert=250+: A discovers B's higher term, steps down Probability of repeated splits decreases exponentiallywith each retry due to randomization.""" import randomimport statistics def simulate_elections(num_servers: int, timeout_range: tuple, trials: int = 10000): """ Simulate how often clean elections occur vs split votes. """ clean_wins = 0 split_votes = 0 min_timeout, max_timeout = timeout_range threshold = 5 # ms within which we consider "simultaneous" for _ in range(trials): # Each server picks a random timeout timeouts = [random.randint(min_timeout, max_timeout) for _ in range(num_servers)] # Find the minimum timeout (first to become candidate) min_time = min(timeouts) # Count how many are within threshold of minimum (potential split) simultaneous = sum(1 for t in timeouts if t - min_time <= threshold) if simultaneous == 1: clean_wins += 1 else: split_votes += 1 return { "clean_election_rate": clean_wins / trials, "split_vote_rate": split_votes / trials } # With standard 150-300ms range, 5 servers:result = simulate_elections(5, (150, 300))print(f"Clean election probability: {result['clean_election_rate']:.1%}")print(f"Split vote probability: {result['split_vote_rate']:.1%}") # Typical output: Clean ~93%, Split ~7%# Even with a split, the next round usually resolves cleanlyWhile split votes can occur, they are probabilistically rare and don't compound. Each election attempt is independent, so the probability of N consecutive split votes is exponentially small. In practice, elections complete within a few hundred milliseconds, even with occasional splits.
Standard Raft has a potential problem: a partitioned server can disrupt the cluster when it reconnects.
The Scenario:
This is technically "safe" (the cluster will recover), but it causes unnecessary disruption. The pre-vote extension prevents this.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136
"""Pre-Vote: Check Viability Before Real Election The idea: Before incrementing term and requesting real votes,send "pre-vote" requests that don't affect state. Pre-vote asks: "Would you vote for me if I started an election?" Servers respond based on:1. Their current leader (if recently heard from leader, answer NO)2. The candidate's log (same election restriction) Only if pre-vote succeeds does the candidate actually start the election. This prevents partitioned servers from disrupting the cluster:- They send pre-votes- Connected servers say "no, we have a leader"- Partitioned server never increments term- When reconnected, term is still low → no disruption""" from dataclasses import dataclassfrom typing import Optional @dataclassclass PreVoteRequest: """ Pre-vote request - does NOT change any state. Note: Uses proposed term (current + 1), not current term. """ term: int # Would-be election term (current_term + 1) candidate_id: int last_log_index: int last_log_term: int @dataclass class PreVoteResponse: term: int vote_granted: bool class PreVoteHandler: """ Handle pre-vote requests. Key difference from regular vote: We don't update state! """ # How recently we must have heard from leader to reject pre-vote LEADER_STICKINESS_MS = 500 # Typically 2-3 heartbeat intervals def __init__(self, server): self.server = server def handle_pre_vote(self, request: PreVoteRequest) -> PreVoteResponse: """ Respond to pre-vote request. Critical differences from regular vote: 1. Don't update our term (even if request term is higher) 2. Don't record vote (can pre-vote for multiple candidates) 3. Check if we've recently heard from leader """ # Check 1: Is our term higher? # (Use current_term + 1 because candidate hasn't incremented yet) if request.term < self.server.current_term: return PreVoteResponse( term=self.server.current_term, vote_granted=False ) # Check 2: Have we heard from a leader recently? # If yes, the "election" is probably due to network issues if self._recently_heard_from_leader(): return PreVoteResponse( term=self.server.current_term, vote_granted=False ) # Check 3: Is candidate's log up-to-date? (Same as regular vote) if not self._is_log_up_to_date(request): return PreVoteResponse( term=self.server.current_term, vote_granted=False ) # Pre-vote granted - but NO STATE CHANGES return PreVoteResponse( term=self.server.current_term, vote_granted=True ) def _recently_heard_from_leader(self) -> bool: """ Check if we've received a valid message from current leader recently. """ if self.server.last_leader_contact is None: return False elapsed = self.server.current_time() - self.server.last_leader_contact return elapsed < self.LEADER_STICKINESS_MS def _is_log_up_to_date(self, request: PreVoteRequest) -> bool: my_last_index, my_last_term = self.server.get_last_log_info() if request.last_log_term != my_last_term: return request.last_log_term > my_last_term return request.last_log_index >= my_last_index # Election with Pre-Voteclass PreVoteElection: """ Election procedure with pre-vote phase. """ async def run_election_with_pre_vote(self) -> bool: """ Two-phase election: 1. Pre-vote: Check if election would succeed 2. Real vote: If pre-vote succeeds, run actual election """ # Phase 1: Pre-vote pre_vote_success = await self._run_pre_vote() if not pre_vote_success: # Pre-vote failed - don't start real election # This prevents term inflation for partitioned nodes print("Pre-vote failed - not starting election") return False # Phase 2: Real election print("Pre-vote succeeded - starting real election") return await self._run_real_election()Pre-vote is implemented in most production Raft systems including etcd (enabled by default since v3.4). It adds a small latency cost (one extra round-trip before elections) but significantly improves cluster stability, especially in networks with occasional partitions.
Raft's leader election mechanism is deceptively simple on the surface but provides powerful guarantees:
| Component | Purpose | Mechanism |
|---|---|---|
| Term numbers | Logical clock for ordering | Monotonically increasing; higher term always wins |
| Random timeouts | Break symmetry | 150-300ms range prevents synchronized elections |
| RequestVote RPC | Collect votes | Includes log info for election restriction |
| Election restriction | Ensure safety | Candidate must have up-to-date log |
| Pre-vote (optional) | Prevent disruption | Check viability before real election |
What's Next:
With leader election understood, we can examine what leaders actually do. The next page covers Log Replication—how the leader accepts client commands, appends them to its log, replicates them to followers, and determines when entries are safely committed.
You now understand Raft's leader election mechanism in depth: the three server states and their transitions, when elections trigger, the RequestVote protocol, the critical election restriction, how split votes are handled, and the pre-vote optimization. You can trace through exactly what happens when a leader fails and a new leader is elected.