System Design (HLD)Raft Algorithm

Raft Consensus Algorithm

LevelAdvanced

Duration90 mins

TopicRaft Algorithm

2 / 5

Leader Election

The Foundation of Distributed Authority

In any leader-based distributed system, leader election is the foundation upon which everything else rests. Without a functioning leader, Raft cannot accept client requests, replicate log entries, or maintain consistency. The leader election protocol must be fast, reliable, and—above all—safe.

"Safe" in this context has a precise meaning: at most one leader may exist per term. If two leaders existed simultaneously, they could accept conflicting client requests, leading to divergent state across the cluster—the worst possible outcome for a consensus algorithm.

This page dissects Raft's leader election mechanism in complete detail. We'll examine exactly when elections start, how candidates request votes, why the voting rules guarantee safety, and what happens in edge cases like network partitions and simultaneous elections.

What You Will Learn

By the end of this page, you will understand:

• The three server states (Follower, Candidate, Leader) and all transitions between them • Election triggers — when and why elections begin • The RequestVote RPC — the complete protocol for requesting and granting votes • The election restriction — how Raft prevents unsafe leaders from being elected • Split vote handling — how randomized timeouts ensure progress • Pre-vote optimization — a common extension that prevents disruption

Server States and Transitions

Every Raft server is in exactly one of three states at any time:

1. Follower — The default, passive state. Followers respond to requests from leaders and candidates but do not initiate communication. They simply:

Respond to AppendEntries from the leader
Grant or deny votes to candidates
Redirect client requests to the leader (if known)

2. Candidate — A transitional state during elections. A server becomes a candidate when:

It's a follower that hasn't heard from a leader
Its election timeout expires

3. Leader — The active, authoritative state. Exactly one leader exists per term. The leader:

Receives and processes all client requests
Replicates log entries to followers
Sends periodic heartbeats to maintain authority

state_transitions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
"""
Raft State Machine - Complete State Transitions
 
       ┌──────────────────────────────────────────────────────────┐
       │                                                          │
       │   ┌─────────┐   timeout    ┌───────────┐   wins vote   ┌────────┐
       │   │         │────────────►│           │──────────────►│        │
       │   │ Follower│             │ Candidate │               │ Leader │
       │   │         │◄────────────│           │◄──────────────│    │
       │   └─────────┘  discovers   └───────────┘   discovers   └────────┘
       │        ▲       higher term      │         higher term      │
       │        │                        │                          │
       │        │                        │ split vote               │
       │        │                        │ (times out)              │
       │        │                        ▼                          │
       │        │              [restart election]                   │
       │        │                                                   │
       │        └───────────────────────────────────────────────────┘
       │                    discovers higher term
       │
       └── All servers start as Followers
 
"""
 
from enum import Enum
from dataclasses import dataclass
from typing import Optional, Callable
import asyncio
 
class ServerState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"
 
@dataclass
class StateTransition:
    from_state: ServerState
    to_state: ServerState
    reason: str
 
class RaftStateMachine:
    """
    Complete state machine for Raft server states.
    Handles all legal transitions and invariants.
    """
    
    def __init__(self, on_become_leader: Callable, on_step_down: Callable):
        self.state = ServerState.FOLLOWER
        self.on_become_leader = on_become_leader
        self.on_step_down = on_step_down
        self.transition_log: list[StateTransition] = []
    
    def _transition(self, new_state: ServerState, reason: str) -> bool:
        """
        Record and execute a state transition.
        Returns True if transition occurred.
        """
        if new_state == self.state:
            return False  # No-op
            
        old_state = self.state
        self.transition_log.append(StateTransition(old_state, new_state, reason))
        self.state = new_state
        
        # Trigger callbacks
        if new_state == ServerState.LEADER:
            self.on_become_leader()
        elif old_state == ServerState.LEADER:
            self.on_step_down()
            
        return True
    
    # === TRANSITION: Follower → Candidate ===
    def start_election(self) -> bool:
        """
        Called when election timeout expires.
        Only valid from Follower state.
        """
        if self.state != ServerState.FOLLOWER:
            return False
        return self._transition(
            ServerState.CANDIDATE, 
            "election timeout expired"
        )
    
    # === TRANSITION: Candidate → Leader ===
    def win_election(self) -> bool:
        """
        Called when candidate receives majority votes.
        Only valid from Candidate state.
        """
        if self.state != ServerState.CANDIDATE:
            return False
        return self._transition(
            ServerState.LEADER,
            "received majority votes"
        )
    
    # === TRANSITION: Candidate → Candidate (restart) ===
    def restart_election(self) -> bool:
        """
        Called when candidate's election timeout expires
        without winning or losing.
        Only valid from Candidate state.
        """
        if self.state != ServerState.CANDIDATE:
            return False
        # Technically stays Candidate, but increments term
        # and starts new vote collection
        return True  # Term increment handled elsewhere
    
    # === TRANSITION: Any → Follower (step down) ===
    def step_down(self, reason: str) -> bool:
        """
        Called when discovering a higher term from any state.
        Always valid (though no-op if already Follower).
        """
        if self.state == ServerState.FOLLOWER:
            return False
        return self._transition(ServerState.FOLLOWER, reason)
    
    # === QUERY METHODS ===
    def is_leader(self) -> bool:
        return self.state == ServerState.LEADER
    
    def is_candidate(self) -> bool:
        return self.state == ServerState.CANDIDATE
    
    def is_follower(self) -> bool:
        return self.state == ServerState.FOLLOWER

Complete State Transition Rules
From	To	Trigger	Actions
Follower	Candidate	Election timeout	Increment term, vote for self, request votes
Candidate	Leader	Receives majority votes	Start sending heartbeats, initialize leader state
Candidate	Follower	Discovers higher term OR receives valid AppendEntries	Update term, clear voted_for
Candidate	Candidate	Election timeout (split vote)	Increment term, restart election
Leader	Follower	Discovers higher term	Stop heartbeats, clear leader state

When Elections Begin

Elections are triggered by a single condition: a follower's election timeout expires without receiving communication from a valid leader.

This simple rule encapsulates several scenarios:

Leader failure — The leader crashes or becomes unreachable. Followers stop receiving heartbeats and eventually time out.
Network partition — A follower is partitioned from the leader. From the follower's perspective, this looks identical to leader failure.
Cluster startup — When a Raft cluster starts, no leader exists. All servers are followers with running election timers. The first server to time out starts an election.
Slow leader — If the leader is so overloaded that heartbeats don't arrive in time, followers may start spurious elections.

election_trigger.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import asyncio
import random
from typing import Optional
 
class ElectionController:
    """
    Controls when elections are triggered based on heartbeat receipt.
    
    The election timer is RESET when:
    1. Receiving valid AppendEntries from current leader
    2. Granting vote to a candidate
    
    The election timer FIRES when:
    - No reset occurs within the timeout period
    """
    
    ELECTION_TIMEOUT_MIN_MS = 150
    ELECTION_TIMEOUT_MAX_MS = 300
    
    def __init__(self, on_election_timeout):
        self.on_election_timeout = on_election_timeout
        self._timer_task: Optional[asyncio.Task] = None
        self._current_timeout_ms: int = 0
    
    def _random_timeout_ms(self) -> int:
        """
        Generate random election timeout.
        
        Why random?
        - Prevents all servers from timing out simultaneously
        - Usually ONE server times out first and wins cleanly
        - Even in split vote, random timeouts break the tie
        """
        return random.randint(
            self.ELECTION_TIMEOUT_MIN_MS,
            self.ELECTION_TIMEOUT_MAX_MS
        )
    
    async def _timer_loop(self):
        """
        The actual timer coroutine.
        Sleeps for timeout, then triggers election.
        """
        try:
            timeout_sec = self._current_timeout_ms / 1000.0
            await asyncio.sleep(timeout_sec)
            
            # Timeout expired! Time to become a candidate.
            print(f"Election timeout ({self._current_timeout_ms}ms) - starting election")
            await self.on_election_timeout()
            
        except asyncio.CancelledError:
            # Timer was reset before it fired - this is normal
            pass
    
    def reset(self):
        """
        Reset the election timer.
        
        Called when receiving valid communication from leader
        or when granting a vote to a candidate.
        
        Key insight: The timer resets with a NEW random timeout.
        This prevents synchronized elections after recovery.
        """
        # Cancel existing timer
        if self._timer_task and not self._timer_task.done():
            self._timer_task.cancel()
        
        # Start new timer with fresh random timeout
        self._current_timeout_ms = self._random_timeout_ms()
        self._timer_task = asyncio.create_task(self._timer_loop())
    
    def stop(self):
        """
        Stop the election timer entirely.
        Called when becoming leader (leaders don't have election timeouts).
        """
        if self._timer_task and not self._timer_task.done():
            self._timer_task.cancel()
            self._timer_task = None
 
 
# Example: Simulating election trigger scenarios
class Scenario:
    """Demonstration of when elections are triggered."""
    
    @staticmethod
    async def scenario_leader_failure():
        """
        Leader crashes at t=0.
        Follower with 200ms timeout starts election at t=200ms.
        """
        print("Time 0: Leader crashes")
        print("Time 0-200ms: Follower waiting...")
        await asyncio.sleep(0.2)
        print("Time 200ms: Follower starts election")
    
    @staticmethod
    async def scenario_heartbeat_keeps_alive():
        """
        Leader sends heartbeat every 50ms.
        Follower resets 200ms timer on each heartbeat.
        Result: Follower never times out while leader is healthy.
        """
        print("Leader heartbeat schedule: t=50, 100, 150, 200, ...")
        print("Follower timeout 200ms, but resets at t=50")
        print("Next timeout would be at t=250, but resets at t=100")
        print("Follower never starts election while heartbeats arrive")

Election Storm Risk

If the leader is slow but not dead, followers may start elections unnecessarily. This creates an "election storm" where the cluster constantly holds elections instead of doing useful work. Production systems tune timeouts carefully and may implement "pre-vote" (discussed later) to prevent spurious elections.

The RequestVote RPC Protocol

When a server becomes a candidate, it initiates an election by sending RequestVote RPCs to all other servers. This RPC is the heart of leader election and carries specific information that voters use to decide whether to grant their vote.

request_vote_rpc.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class RequestVoteRequest:
    """
    Arguments sent by candidates to request votes.
    
    Each field serves a specific purpose:
    """
    
    # === Candidate's term ===
    # Used for:
    # 1. Voters reject if their term is higher (stale candidate)
    # 2. Voters update term if candidate's is higher
    # 3. Identifies which election this vote is for
    term: int
    
    # === Candidate's ID ===
    # Used for:
    # 1. Recording who we voted for
    # 2. Knowing who to recognize as leader if they win
    candidate_id: int
    
    # === Index of candidate's last log entry ===
    # Used for the ELECTION RESTRICTION (critical for safety):
    # Voters only vote for candidates with "at least as up-to-date" logs
    last_log_index: int
    
    # === Term of candidate's last log entry ===
    # Combined with last_log_index to determine "up-to-date-ness"
    # A log is more up-to-date if:
    # 1. It has a higher last term, OR
    # 2. Same last term but longer log
    last_log_term: int
 
 
@dataclass
class RequestVoteResponse:
    """
    Response from voters.
    """
    
    # The voter's current term
    # If higher than candidate's, candidate must step down
    term: int
    
    # Whether the vote was granted
    vote_granted: bool
 
 
class VoteHandler:
    """
    Handles incoming RequestVote RPCs as a voter.
    """
    
    def __init__(self, server):
        self.server = server
    
    def handle_request_vote(self, request: RequestVoteRequest) -> RequestVoteResponse:
        """
        Process a vote request. The complete decision logic:
        """
        
        # === CHECK 1: Term comparison ===
        if request.term < self.server.current_term:
            # Candidate is from past term - reject and inform them
            return RequestVoteResponse(
                term=self.server.current_term,
                vote_granted=False
            )
        
        # If candidate's term is higher, update our term and become follower
        if request.term > self.server.current_term:
            self.server.become_follower(request.term)
        
        # === CHECK 2: Have we already voted this term? ===
        if self.server.voted_for is not None and            self.server.voted_for != request.candidate_id:
            # Already voted for someone else this term
            return RequestVoteResponse(
                term=self.server.current_term,
                vote_granted=False
            )
        
        # === CHECK 3: Is candidate's log at least as up-to-date as ours? ===
        # This is the ELECTION RESTRICTION - critical for safety
        if not self._is_candidate_log_up_to_date(request):
            return RequestVoteResponse(
                term=self.server.current_term,
                vote_granted=False
            )
        
        # === All checks passed - grant vote ===
        self.server.voted_for = request.candidate_id
        self.server.persist()  # Must persist before responding!
        
        # Reset election timer - we found a viable candidate
        self.server.reset_election_timer()
        
        return RequestVoteResponse(
            term=self.server.current_term,
            vote_granted=True
        )
    
    def _is_candidate_log_up_to_date(self, request: RequestVoteRequest) -> bool:
        """
        Determine if candidate's log is at least as up-to-date as ours.
        
        "Up-to-date" comparison:
        1. If last log terms differ: higher term wins
        2. If last log terms equal: longer log wins
        
        This ensures the elected leader has all committed entries.
        """
        my_last_index, my_last_term = self.server.get_last_log_info()
        
        # Rule 1: Compare terms first
        if request.last_log_term != my_last_term:
            return request.last_log_term > my_last_term
        
        # Rule 2: Same term - compare lengths
        return request.last_log_index >= my_last_index

The Three Conditions for Granting a Vote:

A server grants its vote if and only if ALL of these conditions hold:

Term check — The candidate's term is at least as high as the voter's term
Vote availability — The voter hasn't already voted for a different candidate in this term (or hasn't voted at all)
Log check (Election Restriction) — The candidate's log is at least as up-to-date as the voter's log

The first two are straightforward. The third—the election restriction—is subtle but critical for safety. We'll examine it in detail next.

The Election Restriction: Ensuring Safety

The election restriction is one of Raft's most important safety mechanisms. It ensures that any elected leader contains all entries that have been committed in previous terms.

Why is this necessary?

Consider what would happen without this restriction:

Server A is leader in term 1, commits entry X (replicated to majority)
Server A crashes
Server B, which missed entry X, gets elected in term 2
Server B starts accepting new entries, overwriting the slot where X should be
Committed entry X is lost — a safety violation!

The election restriction prevents step 3. Since entry X was replicated to a majority before being committed, and elections require a majority vote, at least one voter must have entry X. That voter will refuse to vote for Server B because B's log is "less up-to-date."

election_restriction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
"""
The Election Restriction Explained with Concrete Example
 
Cluster: Servers A, B, C (majority = 2)
 
=== SCENARIO WHERE ELECTION RESTRICTION SAVES US ===
 
Initial state:
    A (Leader, term=1): log = [1:x, 1:y, 1:z]  (committed up to z)
    B (Follower):       log = [1:x, 1:y, 1:z]  (has all entries)
    C (Follower):       log = [1:x, 1:y]        (missed z)
 
Entry z was committed (replicated to A and B, majority achieved).
 
Now A crashes. B and C start election (term 2).
 
=== IF C TRIES TO BECOME LEADER ===
 
C sends RequestVote to B:
    term = 2
    last_log_index = 2  (entries x, y)
    last_log_term = 1
 
B checks C's log against its own:
    B's last_log_index = 3
    B's last_log_term = 1
    
Comparison: Same last term, but B has longer log → B's log is more up-to-date
 
Result: B REJECTS C's vote request.
 
C cannot get majority (only has its own vote).
 
=== IF B TRIES TO BECOME LEADER ===
 
B sends RequestVote to C:
    term = 2
    last_log_index = 3
    last_log_term = 1
 
C checks B's log against its own:
    C's last_log_index = 2
    C's last_log_term = 1
 
Comparison: Same last term, but B has longer log → B's log is more up-to-date
 
Result: C GRANTS vote to B.
 
B gets majority (B + C) and becomes leader.
Entry z is preserved!
"""
 
def compare_logs(
    candidate_last_index: int, 
    candidate_last_term: int,
    voter_last_index: int,
    voter_last_term: int
) -> str:
    """
    Returns which log is 'more up-to-date'.
    
    The comparison works like version numbers:
    - Compare major version (term) first
    - Compare minor version (index) if major versions match
    """
    
    if candidate_last_term > voter_last_term:
        return "CANDIDATE is more up-to-date"
    elif candidate_last_term < voter_last_term:
        return "VOTER is more up-to-date"
    else:  # Same term
        if candidate_last_index >= voter_last_index:
            return "CANDIDATE is at least as up-to-date (vote granted)"
        else:
            return "VOTER is more up-to-date (vote denied)"
 
 
# More complex scenario: Term matters more than length
"""
Scenario where term matters more than length:
 
    A: log = [1:a, 1:b, 1:c, 1:d]  (4 entries, all term 1)
    B: log = [1:a, 1:b, 2:x]       (3 entries, last is term 2)
 
Who has the more up-to-date log?
 
B does! Even though B's log is shorter, B has an entry from term 2.
Entries from higher terms are "newer" than entries from lower terms,
regardless of log length.
 
Why? Because an entry from term 2 means B was replicated up to at least
the leader of term 2. Whatever A has in term 1 must be older.
"""
 
print(compare_logs(
    candidate_last_index=4, candidate_last_term=1,  # A as candidate
    voter_last_index=3, voter_last_term=2           # B as voter
))
# Output: "VOTER is more up-to-date (vote denied)"
 
print(compare_logs(
    candidate_last_index=3, candidate_last_term=2,  # B as candidate
    voter_last_index=4, voter_last_term=1           # A as voter
))
# Output: "CANDIDATE is more up-to-date"

The Quorum Intersection Insight

The election restriction works because of quorum intersection. Committing an entry requires a majority. Winning an election requires a majority. Any two majorities must share at least one server. Therefore, any election winner must have contacted at least one server that has every committed entry. The election restriction ensures the winner actually has those entries (or won't win).

Running an Election: Step by Step

Let's walk through exactly what happens when a server runs an election, from the moment it times out to winning (or losing).

election_procedure.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import asyncio
from typing import List, Set
from dataclasses import dataclass
 
@dataclass
class VoteResult:
    server_id: int
    vote_granted: bool
    term: int
 
class ElectionCoordinator:
    """
    Coordinates the election process for a candidate.
    """
    
    def __init__(self, server):
        self.server = server
    
    async def run_election(self) -> bool:
        """
        Execute an election. Returns True if we became leader.
        
        Step 1: Transition to candidate
        Step 2: Increment term
        Step 3: Vote for self
        Step 4: Send RequestVote RPCs to all peers
        Step 5: Wait for responses
        Step 6: Process results
        """
        
        # ========================================
        # STEP 1: Become candidate
        # ========================================
        self.server.state = ServerState.CANDIDATE
        print(f"Server {self.server.id}: Became candidate")
        
        # ========================================
        # STEP 2: Increment term
        # ========================================
        self.server.current_term += 1
        election_term = self.server.current_term
        print(f"Server {self.server.id}: Starting election for term {election_term}")
        
        # ========================================
        # STEP 3: Vote for self
        # ========================================
        self.server.voted_for = self.server.id
        self.server.persist()  # Must persist before sending RPCs!
        
        votes_received: Set[int] = {self.server.id}  # Self vote
        votes_needed = (len(self.server.peers) + 1) // 2 + 1  # Majority
        
        print(f"Server {self.server.id}: Voted for self. Need {votes_needed} votes total.")
        
        # ========================================
        # STEP 4: Build RequestVote message
        # ========================================
        last_log_index, last_log_term = self.server.get_last_log_info()
        
        request = RequestVoteRequest(
            term=election_term,
            candidate_id=self.server.id,
            last_log_index=last_log_index,
            last_log_term=last_log_term
        )
        
        # ========================================
        # STEP 5: Send RPCs in parallel, collect results
        # ========================================
        tasks = []
        for peer_id in self.server.peers:
            task = asyncio.create_task(
                self._request_vote_with_retry(peer_id, request)
            )
            tasks.append(task)
        
        # Process responses as they arrive
        for completed in asyncio.as_completed(tasks):
            try:
                result = await completed
                
                # Check if we're still in the same election
                if self.server.current_term != election_term:
                    print(f"Server {self.server.id}: Term changed during election, aborting")
                    return False
                
                if self.server.state != ServerState.CANDIDATE:
                    print(f"Server {self.server.id}: No longer candidate, aborting")
                    return False
                
                # Process the vote
                outcome = self._process_vote_response(result, election_term)
                
                if outcome == "STEP_DOWN":
                    return False
                
                if outcome == "VOTE_GRANTED":
                    votes_received.add(result.server_id)
                    print(f"Server {self.server.id}: Got vote from {result.server_id}. "
                          f"Total: {len(votes_received)}/{votes_needed}")
                    
                    # ========================================
                    # STEP 6: Check if we won
                    # ========================================
                    if len(votes_received) >= votes_needed:
                        print(f"Server {self.server.id}: Won election for term {election_term}!")
                        self.server.become_leader()
                        return True
                        
            except asyncio.TimeoutError:
                continue  # This peer didn't respond in time
        
        # Didn't get enough votes
        print(f"Server {self.server.id}: Election failed. Got {len(votes_received)}/{votes_needed}")
        return False
    
    def _process_vote_response(self, result: VoteResult, election_term: int) -> str:
        """
        Process a single vote response.
        Returns: "VOTE_GRANTED", "VOTE_DENIED", or "STEP_DOWN"
        """
        # If response contains higher term, step down immediately
        if result.term > election_term:
            print(f"Server {self.server.id}: Discovered higher term {result.term}, stepping down")
            self.server.become_follower(result.term)
            return "STEP_DOWN"
        
        if result.vote_granted:
            return "VOTE_GRANTED"
        else:
            return "VOTE_DENIED"
    
    async def _request_vote_with_retry(
        self, 
        peer_id: int, 
        request: RequestVoteRequest,
        timeout_ms: int = 100
    ) -> VoteResult:
        """
        Send RequestVote RPC with timeout.
        In production, might retry on transient failures.
        """
        # Simulated RPC - in real code, this is network I/O
        response = await self.server.rpc_client.request_vote(peer_id, request)
        return VoteResult(
            server_id=peer_id,
            vote_granted=response.vote_granted,
            term=response.term
        )

Election Outcome Possibilities

•Win — Candidate receives votes from a majority (including self). Becomes leader immediately.
•Lose to higher term — Candidate discovers another server has a higher term. Steps down to follower.
•Lose to new leader — Candidate receives AppendEntries from a valid leader for this term. Steps down.
•Split vote — No candidate gets majority. Election timeout expires. Candidate increments term and tries again.

Split Votes and Recovery

A split vote occurs when no candidate receives a majority in an election. This can happen when:

Multiple servers time out simultaneously and become candidates
Network delays cause votes to be distributed before any candidate wins
Servers are partitioned such that no candidate can reach a majority

When a split vote occurs:

No candidate becomes leader
Each candidate's election timeout eventually expires
Candidates increment their term and start a new election
Randomized timeouts make it unlikely the same split will repeat

split_vote_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
"""
Split Vote Scenario Analysis
 
Cluster: 5 servers (A, B, C, D, E), majority = 3
 
=== TIMELINE OF A SPLIT VOTE ===
 
t=0:    Leader crashes
t=150:  A's timeout expires (150ms), A becomes candidate term 2
t=155:  B's timeout expires (155ms), B becomes candidate term 2
t=160:  C's timeout expires (160ms), but receives A's RequestVote first
 
Vote distribution:
    A votes for: A (self)
    B votes for: B (self)  
    C votes for: A (first valid request received)
    D votes for: B (if B's request arrived first)
    E votes for: A (if A's request arrived first)
 
Results vary by network timing, but let's say:
    A gets: A, C, E = 3 votes → WINS!
    B gets: B, D = 2 votes → loses
 
But what if timing was different:
    A gets: A, C = 2 votes
    B gets: B, D, E = 3 votes → B WINS!
 
And adversarial timing:
    A gets: A, C = 2 votes
    B gets: B, D = 2 votes
    E's response delayed... → SPLIT VOTE
 
=== HOW RANDOMIZATION HELPS ===
 
After split vote, candidates restart election:
- A chooses new timeout: 250ms
- B chooses new timeout: 180ms
 
t=0:    Split vote ended
t=180:  B's timeout expires, B starts term 3 election
t=180+: B collects votes before A even starts
t=180+: B wins with majority
t=250:  A's timeout expires, but B is already leader
t=250+: A discovers B's higher term, steps down
 
Probability of repeated splits decreases exponentially
with each retry due to randomization.
"""
 
import random
import statistics
 
def simulate_elections(num_servers: int, timeout_range: tuple, trials: int = 10000):
    """
    Simulate how often clean elections occur vs split votes.
    """
    clean_wins = 0
    split_votes = 0
    
    min_timeout, max_timeout = timeout_range
    threshold = 5  # ms within which we consider "simultaneous"
    
    for _ in range(trials):
        # Each server picks a random timeout
        timeouts = [random.randint(min_timeout, max_timeout) for _ in range(num_servers)]
        
        # Find the minimum timeout (first to become candidate)
        min_time = min(timeouts)
        
        # Count how many are within threshold of minimum (potential split)
        simultaneous = sum(1 for t in timeouts if t - min_time <= threshold)
        
        if simultaneous == 1:
            clean_wins += 1
        else:
            split_votes += 1
    
    return {
        "clean_election_rate": clean_wins / trials,
        "split_vote_rate": split_votes / trials
    }
 
# With standard 150-300ms range, 5 servers:
result = simulate_elections(5, (150, 300))
print(f"Clean election probability: {result['clean_election_rate']:.1%}")
print(f"Split vote probability: {result['split_vote_rate']:.1%}")
 
# Typical output: Clean ~93%, Split ~7%
# Even with a split, the next round usually resolves cleanly

Bounded Election Time

While split votes can occur, they are probabilistically rare and don't compound. Each election attempt is independent, so the probability of N consecutive split votes is exponentially small. In practice, elections complete within a few hundred milliseconds, even with occasional splits.

Pre-Vote: Preventing Disruption

Standard Raft has a potential problem: a partitioned server can disrupt the cluster when it reconnects.

The Scenario:

Server A is partitioned from the rest of the cluster
Server A's election timeout expires (no leader heartbeats)
Server A becomes candidate, increments term to 2, 3, 4... repeatedly
Meanwhile, the rest of the cluster operates normally in term 1
Server A reconnects with term 100 (many failed elections)
Server A's messages force all other servers to update to term 100
The healthy leader is forced to step down!

This is technically "safe" (the cluster will recover), but it causes unnecessary disruption. The pre-vote extension prevents this.

pre_vote.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
"""
Pre-Vote: Check Viability Before Real Election
 
The idea: Before incrementing term and requesting real votes,
send "pre-vote" requests that don't affect state.
 
Pre-vote asks: "Would you vote for me if I started an election?"
 
Servers respond based on:
1. Their current leader (if recently heard from leader, answer NO)
2. The candidate's log (same election restriction)
 
Only if pre-vote succeeds does the candidate actually start the election.
 
This prevents partitioned servers from disrupting the cluster:
- They send pre-votes
- Connected servers say "no, we have a leader"
- Partitioned server never increments term
- When reconnected, term is still low → no disruption
"""
 
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class PreVoteRequest:
    """
    Pre-vote request - does NOT change any state.
    Note: Uses proposed term (current + 1), not current term.
    """
    term: int  # Would-be election term (current_term + 1)
    candidate_id: int
    last_log_index: int
    last_log_term: int
 
 
@dataclass  
class PreVoteResponse:
    term: int
    vote_granted: bool
 
 
class PreVoteHandler:
    """
    Handle pre-vote requests.
    Key difference from regular vote: We don't update state!
    """
    
    # How recently we must have heard from leader to reject pre-vote
    LEADER_STICKINESS_MS = 500  # Typically 2-3 heartbeat intervals
    
    def __init__(self, server):
        self.server = server
    
    def handle_pre_vote(self, request: PreVoteRequest) -> PreVoteResponse:
        """
        Respond to pre-vote request.
        
        Critical differences from regular vote:
        1. Don't update our term (even if request term is higher)
        2. Don't record vote (can pre-vote for multiple candidates)
        3. Check if we've recently heard from leader
        """
        
        # Check 1: Is our term higher?
        # (Use current_term + 1 because candidate hasn't incremented yet)
        if request.term < self.server.current_term:
            return PreVoteResponse(
                term=self.server.current_term,
                vote_granted=False
            )
        
        # Check 2: Have we heard from a leader recently?
        # If yes, the "election" is probably due to network issues
        if self._recently_heard_from_leader():
            return PreVoteResponse(
                term=self.server.current_term,
                vote_granted=False
            )
        
        # Check 3: Is candidate's log up-to-date? (Same as regular vote)
        if not self._is_log_up_to_date(request):
            return PreVoteResponse(
                term=self.server.current_term,
                vote_granted=False
            )
        
        # Pre-vote granted - but NO STATE CHANGES
        return PreVoteResponse(
            term=self.server.current_term,
            vote_granted=True
        )
    
    def _recently_heard_from_leader(self) -> bool:
        """
        Check if we've received a valid message from current leader recently.
        """
        if self.server.last_leader_contact is None:
            return False
        
        elapsed = self.server.current_time() - self.server.last_leader_contact
        return elapsed < self.LEADER_STICKINESS_MS
    
    def _is_log_up_to_date(self, request: PreVoteRequest) -> bool:
        my_last_index, my_last_term = self.server.get_last_log_info()
        
        if request.last_log_term != my_last_term:
            return request.last_log_term > my_last_term
        return request.last_log_index >= my_last_index
 
 
# Election with Pre-Vote
class PreVoteElection:
    """
    Election procedure with pre-vote phase.
    """
    
    async def run_election_with_pre_vote(self) -> bool:
        """
        Two-phase election:
        1. Pre-vote: Check if election would succeed
        2. Real vote: If pre-vote succeeds, run actual election
        """
        
        # Phase 1: Pre-vote
        pre_vote_success = await self._run_pre_vote()
        
        if not pre_vote_success:
            # Pre-vote failed - don't start real election
            # This prevents term inflation for partitioned nodes
            print("Pre-vote failed - not starting election")
            return False
        
        # Phase 2: Real election
        print("Pre-vote succeeded - starting real election")
        return await self._run_real_election()

Pre-Vote in Production

Pre-vote is implemented in most production Raft systems including etcd (enabled by default since v3.4). It adds a small latency cost (one extra round-trip before elections) but significantly improves cluster stability, especially in networks with occasional partitions.

Summary: Leader Election Guarantees

Raft's leader election mechanism is deceptively simple on the surface but provides powerful guarantees:

Key Takeaways

•Election Safety — At most one leader per term, guaranteed by each server voting only once per term and elections requiring majority.
•Leader Completeness — Elected leaders contain all committed entries, guaranteed by the election restriction requiring up-to-date logs.
•Liveness — Elections eventually complete despite failures, guaranteed by randomized timeouts that prevent infinite split votes.
•State Simplicity — Only three states (Follower, Candidate, Leader) with clear, deterministic transitions.
•Self-Healing — Stale leaders automatically step down upon discovering higher terms.

Leader Election at a Glance
Component	Purpose	Mechanism
Term numbers	Logical clock for ordering	Monotonically increasing; higher term always wins
Random timeouts	Break symmetry	150-300ms range prevents synchronized elections
RequestVote RPC	Collect votes	Includes log info for election restriction
Election restriction	Ensure safety	Candidate must have up-to-date log
Pre-vote (optional)	Prevent disruption	Check viability before real election

What's Next:

With leader election understood, we can examine what leaders actually do. The next page covers Log Replication—how the leader accepts client commands, appends them to its log, replicates them to followers, and determines when entries are safely committed.

Page Complete

You now understand Raft's leader election mechanism in depth: the three server states and their transitions, when elections trigger, the RequestVote protocol, the critical election restriction, how split votes are handled, and the pre-vote optimization. You can trace through exactly what happens when a leader fails and a new leader is elected.

2 / 5

Loading learning content...

System Design (HLD)Raft Algorithm

Raft Consensus Algorithm

LevelAdvanced

Duration90 mins

TopicRaft Algorithm

2 / 5

Leader Election

The Foundation of Distributed Authority

What You Will Learn

By the end of this page, you will understand:

Server States and Transitions

Every Raft server is in exactly one of three states at any time:

1. Follower — The default, passive state. Followers respond to requests from leaders and candidates but do not initiate communication. They simply:

Respond to AppendEntries from the leader
Grant or deny votes to candidates
Redirect client requests to the leader (if known)

2. Candidate — A transitional state during elections. A server becomes a candidate when:

It's a follower that hasn't heard from a leader
Its election timeout expires

3. Leader — The active, authoritative state. Exactly one leader exists per term. The leader:

Receives and processes all client requests
Replicates log entries to followers
Sends periodic heartbeats to maintain authority

state_transitions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
"""
Raft State Machine - Complete State Transitions
 
       ┌──────────────────────────────────────────────────────────┐
       │                                                          │
       │   ┌─────────┐   timeout    ┌───────────┐   wins vote   ┌────────┐
       │   │         │────────────►│           │──────────────►│        │
       │   │ Follower│             │ Candidate │               │ Leader │
       │   │         │◄────────────│           │◄──────────────│    │
       │   └─────────┘  discovers   └───────────┘   discovers   └────────┘
       │        ▲       higher term      │         higher term      │
       │        │                        │                          │
       │        │                        │ split vote               │
       │        │                        │ (times out)              │
       │        │                        ▼                          │
       │        │              [restart election]                   │
       │        │                                                   │
       │        └───────────────────────────────────────────────────┘
       │                    discovers higher term
       │
       └── All servers start as Followers
 
"""
 
from enum import Enum
from dataclasses import dataclass
from typing import Optional, Callable
import asyncio
 
class ServerState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"
 
@dataclass
class StateTransition:
    from_state: ServerState
    to_state: ServerState
    reason: str
 
class RaftStateMachine:
    """
    Complete state machine for Raft server states.
    Handles all legal transitions and invariants.
    """
    
    def __init__(self, on_become_leader: Callable, on_step_down: Callable):
        self.state = ServerState.FOLLOWER
        self.on_become_leader = on_become_leader
        self.on_step_down = on_step_down
        self.transition_log: list[StateTransition] = []
    
    def _transition(self, new_state: ServerState, reason: str) -> bool:
        """
        Record and execute a state transition.
        Returns True if transition occurred.
        """
        if new_state == self.state:
            return False  # No-op
            
        old_state = self.state
        self.transition_log.append(StateTransition(old_state, new_state, reason))
        self.state = new_state
        
        # Trigger callbacks
        if new_state == ServerState.LEADER:
            self.on_become_leader()
        elif old_state == ServerState.LEADER:
            self.on_step_down()
            
        return True
    
    # === TRANSITION: Follower → Candidate ===
    def start_election(self) -> bool:
        """
        Called when election timeout expires.
        Only valid from Follower state.
        """
        if self.state != ServerState.FOLLOWER:
            return False
        return self._transition(
            ServerState.CANDIDATE, 
            "election timeout expired"
        )
    
    # === TRANSITION: Candidate → Leader ===
    def win_election(self) -> bool:
        """
        Called when candidate receives majority votes.
        Only valid from Candidate state.
        """
        if self.state != ServerState.CANDIDATE:
            return False
        return self._transition(
            ServerState.LEADER,
            "received majority votes"
        )
    
    # === TRANSITION: Candidate → Candidate (restart) ===
    def restart_election(self) -> bool:
        """
        Called when candidate's election timeout expires
        without winning or losing.
        Only valid from Candidate state.
        """
        if self.state != ServerState.CANDIDATE:
            return False
        # Technically stays Candidate, but increments term
        # and starts new vote collection
        return True  # Term increment handled elsewhere
    
    # === TRANSITION: Any → Follower (step down) ===
    def step_down(self, reason: str) -> bool:
        """
        Called when discovering a higher term from any state.
        Always valid (though no-op if already Follower).
        """
        if self.state == ServerState.FOLLOWER:
            return False
        return self._transition(ServerState.FOLLOWER, reason)
    
    # === QUERY METHODS ===
    def is_leader(self) -> bool:
        return self.state == ServerState.LEADER
    
    def is_candidate(self) -> bool:
        return self.state == ServerState.CANDIDATE
    
    def is_follower(self) -> bool:
        return self.state == ServerState.FOLLOWER

Complete State Transition Rules
From	To	Trigger	Actions
Follower	Candidate	Election timeout	Increment term, vote for self, request votes
Candidate	Leader	Receives majority votes	Start sending heartbeats, initialize leader state
Candidate	Follower	Discovers higher term OR receives valid AppendEntries	Update term, clear voted_for
Candidate	Candidate	Election timeout (split vote)	Increment term, restart election
Leader	Follower	Discovers higher term	Stop heartbeats, clear leader state

When Elections Begin

Elections are triggered by a single condition: a follower's election timeout expires without receiving communication from a valid leader.

This simple rule encapsulates several scenarios:

Leader failure — The leader crashes or becomes unreachable. Followers stop receiving heartbeats and eventually time out.
Network partition — A follower is partitioned from the leader. From the follower's perspective, this looks identical to leader failure.
Cluster startup — When a Raft cluster starts, no leader exists. All servers are followers with running election timers. The first server to time out starts an election.
Slow leader — If the leader is so overloaded that heartbeats don't arrive in time, followers may start spurious elections.

election_trigger.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import asyncio
import random
from typing import Optional
 
class ElectionController:
    """
    Controls when elections are triggered based on heartbeat receipt.
    
    The election timer is RESET when:
    1. Receiving valid AppendEntries from current leader
    2. Granting vote to a candidate
    
    The election timer FIRES when:
    - No reset occurs within the timeout period
    """
    
    ELECTION_TIMEOUT_MIN_MS = 150
    ELECTION_TIMEOUT_MAX_MS = 300
    
    def __init__(self, on_election_timeout):
        self.on_election_timeout = on_election_timeout
        self._timer_task: Optional[asyncio.Task] = None
        self._current_timeout_ms: int = 0
    
    def _random_timeout_ms(self) -> int:
        """
        Generate random election timeout.
        
        Why random?
        - Prevents all servers from timing out simultaneously
        - Usually ONE server times out first and wins cleanly
        - Even in split vote, random timeouts break the tie
        """
        return random.randint(
            self.ELECTION_TIMEOUT_MIN_MS,
            self.ELECTION_TIMEOUT_MAX_MS
        )
    
    async def _timer_loop(self):
        """
        The actual timer coroutine.
        Sleeps for timeout, then triggers election.
        """
        try:
            timeout_sec = self._current_timeout_ms / 1000.0
            await asyncio.sleep(timeout_sec)
            
            # Timeout expired! Time to become a candidate.
            print(f"Election timeout ({self._current_timeout_ms}ms) - starting election")
            await self.on_election_timeout()
            
        except asyncio.CancelledError:
            # Timer was reset before it fired - this is normal
            pass
    
    def reset(self):
        """
        Reset the election timer.
        
        Called when receiving valid communication from leader
        or when granting a vote to a candidate.
        
        Key insight: The timer resets with a NEW random timeout.
        This prevents synchronized elections after recovery.
        """
        # Cancel existing timer
        if self._timer_task and not self._timer_task.done():
            self._timer_task.cancel()
        
        # Start new timer with fresh random timeout
        self._current_timeout_ms = self._random_timeout_ms()
        self._timer_task = asyncio.create_task(self._timer_loop())
    
    def stop(self):
        """
        Stop the election timer entirely.
        Called when becoming leader (leaders don't have election timeouts).
        """
        if self._timer_task and not self._timer_task.done():
            self._timer_task.cancel()
            self._timer_task = None
 
 
# Example: Simulating election trigger scenarios
class Scenario:
    """Demonstration of when elections are triggered."""
    
    @staticmethod
    async def scenario_leader_failure():
        """
        Leader crashes at t=0.
        Follower with 200ms timeout starts election at t=200ms.
        """
        print("Time 0: Leader crashes")
        print("Time 0-200ms: Follower waiting...")
        await asyncio.sleep(0.2)
        print("Time 200ms: Follower starts election")
    
    @staticmethod
    async def scenario_heartbeat_keeps_alive():
        """
        Leader sends heartbeat every 50ms.
        Follower resets 200ms timer on each heartbeat.
        Result: Follower never times out while leader is healthy.
        """
        print("Leader heartbeat schedule: t=50, 100, 150, 200, ...")
        print("Follower timeout 200ms, but resets at t=50")
        print("Next timeout would be at t=250, but resets at t=100")
        print("Follower never starts election while heartbeats arrive")

Election Storm Risk

The RequestVote RPC Protocol

request_vote_rpc.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class RequestVoteRequest:
    """
    Arguments sent by candidates to request votes.
    
    Each field serves a specific purpose:
    """
    
    # === Candidate's term ===
    # Used for:
    # 1. Voters reject if their term is higher (stale candidate)
    # 2. Voters update term if candidate's is higher
    # 3. Identifies which election this vote is for
    term: int
    
    # === Candidate's ID ===
    # Used for:
    # 1. Recording who we voted for
    # 2. Knowing who to recognize as leader if they win
    candidate_id: int
    
    # === Index of candidate's last log entry ===
    # Used for the ELECTION RESTRICTION (critical for safety):
    # Voters only vote for candidates with "at least as up-to-date" logs
    last_log_index: int
    
    # === Term of candidate's last log entry ===
    # Combined with last_log_index to determine "up-to-date-ness"
    # A log is more up-to-date if:
    # 1. It has a higher last term, OR
    # 2. Same last term but longer log
    last_log_term: int
 
 
@dataclass
class RequestVoteResponse:
    """
    Response from voters.
    """
    
    # The voter's current term
    # If higher than candidate's, candidate must step down
    term: int
    
    # Whether the vote was granted
    vote_granted: bool
 
 
class VoteHandler:
    """
    Handles incoming RequestVote RPCs as a voter.
    """
    
    def __init__(self, server):
        self.server = server
    
    def handle_request_vote(self, request: RequestVoteRequest) -> RequestVoteResponse:
        """
        Process a vote request. The complete decision logic:
        """
        
        # === CHECK 1: Term comparison ===
        if request.term < self.server.current_term:
            # Candidate is from past term - reject and inform them
            return RequestVoteResponse(
                term=self.server.current_term,
                vote_granted=False
            )
        
        # If candidate's term is higher, update our term and become follower
        if request.term > self.server.current_term:
            self.server.become_follower(request.term)
        
        # === CHECK 2: Have we already voted this term? ===
        if self.server.voted_for is not None and            self.server.voted_for != request.candidate_id:
            # Already voted for someone else this term
            return RequestVoteResponse(
                term=self.server.current_term,
                vote_granted=False
            )
        
        # === CHECK 3: Is candidate's log at least as up-to-date as ours? ===
        # This is the ELECTION RESTRICTION - critical for safety
        if not self._is_candidate_log_up_to_date(request):
            return RequestVoteResponse(
                term=self.server.current_term,
                vote_granted=False
            )
        
        # === All checks passed - grant vote ===
        self.server.voted_for = request.candidate_id
        self.server.persist()  # Must persist before responding!
        
        # Reset election timer - we found a viable candidate
        self.server.reset_election_timer()
        
        return RequestVoteResponse(
            term=self.server.current_term,
            vote_granted=True
        )
    
    def _is_candidate_log_up_to_date(self, request: RequestVoteRequest) -> bool:
        """
        Determine if candidate's log is at least as up-to-date as ours.
        
        "Up-to-date" comparison:
        1. If last log terms differ: higher term wins
        2. If last log terms equal: longer log wins
        
        This ensures the elected leader has all committed entries.
        """
        my_last_index, my_last_term = self.server.get_last_log_info()
        
        # Rule 1: Compare terms first
        if request.last_log_term != my_last_term:
            return request.last_log_term > my_last_term
        
        # Rule 2: Same term - compare lengths
        return request.last_log_index >= my_last_index

The Three Conditions for Granting a Vote:

A server grants its vote if and only if ALL of these conditions hold:

Term check — The candidate's term is at least as high as the voter's term
Vote availability — The voter hasn't already voted for a different candidate in this term (or hasn't voted at all)
Log check (Election Restriction) — The candidate's log is at least as up-to-date as the voter's log

The first two are straightforward. The third—the election restriction—is subtle but critical for safety. We'll examine it in detail next.

The Election Restriction: Ensuring Safety

The election restriction is one of Raft's most important safety mechanisms. It ensures that any elected leader contains all entries that have been committed in previous terms.

Why is this necessary?

Consider what would happen without this restriction:

Server A is leader in term 1, commits entry X (replicated to majority)
Server A crashes
Server B, which missed entry X, gets elected in term 2
Server B starts accepting new entries, overwriting the slot where X should be
Committed entry X is lost — a safety violation!

election_restriction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
"""
The Election Restriction Explained with Concrete Example
 
Cluster: Servers A, B, C (majority = 2)
 
=== SCENARIO WHERE ELECTION RESTRICTION SAVES US ===
 
Initial state:
    A (Leader, term=1): log = [1:x, 1:y, 1:z]  (committed up to z)
    B (Follower):       log = [1:x, 1:y, 1:z]  (has all entries)
    C (Follower):       log = [1:x, 1:y]        (missed z)
 
Entry z was committed (replicated to A and B, majority achieved).
 
Now A crashes. B and C start election (term 2).
 
=== IF C TRIES TO BECOME LEADER ===
 
C sends RequestVote to B:
    term = 2
    last_log_index = 2  (entries x, y)
    last_log_term = 1
 
B checks C's log against its own:
    B's last_log_index = 3
    B's last_log_term = 1
    
Comparison: Same last term, but B has longer log → B's log is more up-to-date
 
Result: B REJECTS C's vote request.
 
C cannot get majority (only has its own vote).
 
=== IF B TRIES TO BECOME LEADER ===
 
B sends RequestVote to C:
    term = 2
    last_log_index = 3
    last_log_term = 1
 
C checks B's log against its own:
    C's last_log_index = 2
    C's last_log_term = 1
 
Comparison: Same last term, but B has longer log → B's log is more up-to-date
 
Result: C GRANTS vote to B.
 
B gets majority (B + C) and becomes leader.
Entry z is preserved!
"""
 
def compare_logs(
    candidate_last_index: int, 
    candidate_last_term: int,
    voter_last_index: int,
    voter_last_term: int
) -> str:
    """
    Returns which log is 'more up-to-date'.
    
    The comparison works like version numbers:
    - Compare major version (term) first
    - Compare minor version (index) if major versions match
    """
    
    if candidate_last_term > voter_last_term:
        return "CANDIDATE is more up-to-date"
    elif candidate_last_term < voter_last_term:
        return "VOTER is more up-to-date"
    else:  # Same term
        if candidate_last_index >= voter_last_index:
            return "CANDIDATE is at least as up-to-date (vote granted)"
        else:
            return "VOTER is more up-to-date (vote denied)"
 
 
# More complex scenario: Term matters more than length
"""
Scenario where term matters more than length:
 
    A: log = [1:a, 1:b, 1:c, 1:d]  (4 entries, all term 1)
    B: log = [1:a, 1:b, 2:x]       (3 entries, last is term 2)
 
Who has the more up-to-date log?
 
B does! Even though B's log is shorter, B has an entry from term 2.
Entries from higher terms are "newer" than entries from lower terms,
regardless of log length.
 
Why? Because an entry from term 2 means B was replicated up to at least
the leader of term 2. Whatever A has in term 1 must be older.
"""
 
print(compare_logs(
    candidate_last_index=4, candidate_last_term=1,  # A as candidate
    voter_last_index=3, voter_last_term=2           # B as voter
))
# Output: "VOTER is more up-to-date (vote denied)"
 
print(compare_logs(
    candidate_last_index=3, candidate_last_term=2,  # B as candidate
    voter_last_index=4, voter_last_term=1           # A as voter
))
# Output: "CANDIDATE is more up-to-date"

The Quorum Intersection Insight

Running an Election: Step by Step

Let's walk through exactly what happens when a server runs an election, from the moment it times out to winning (or losing).

election_procedure.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import asyncio
from typing import List, Set
from dataclasses import dataclass
 
@dataclass
class VoteResult:
    server_id: int
    vote_granted: bool
    term: int
 
class ElectionCoordinator:
    """
    Coordinates the election process for a candidate.
    """
    
    def __init__(self, server):
        self.server = server
    
    async def run_election(self) -> bool:
        """
        Execute an election. Returns True if we became leader.
        
        Step 1: Transition to candidate
        Step 2: Increment term
        Step 3: Vote for self
        Step 4: Send RequestVote RPCs to all peers
        Step 5: Wait for responses
        Step 6: Process results
        """
        
        # ========================================
        # STEP 1: Become candidate
        # ========================================
        self.server.state = ServerState.CANDIDATE
        print(f"Server {self.server.id}: Became candidate")
        
        # ========================================
        # STEP 2: Increment term
        # ========================================
        self.server.current_term += 1
        election_term = self.server.current_term
        print(f"Server {self.server.id}: Starting election for term {election_term}")
        
        # ========================================
        # STEP 3: Vote for self
        # ========================================
        self.server.voted_for = self.server.id
        self.server.persist()  # Must persist before sending RPCs!
        
        votes_received: Set[int] = {self.server.id}  # Self vote
        votes_needed = (len(self.server.peers) + 1) // 2 + 1  # Majority
        
        print(f"Server {self.server.id}: Voted for self. Need {votes_needed} votes total.")
        
        # ========================================
        # STEP 4: Build RequestVote message
        # ========================================
        last_log_index, last_log_term = self.server.get_last_log_info()
        
        request = RequestVoteRequest(
            term=election_term,
            candidate_id=self.server.id,
            last_log_index=last_log_index,
            last_log_term=last_log_term
        )
        
        # ========================================
        # STEP 5: Send RPCs in parallel, collect results
        # ========================================
        tasks = []
        for peer_id in self.server.peers:
            task = asyncio.create_task(
                self._request_vote_with_retry(peer_id, request)
            )
            tasks.append(task)
        
        # Process responses as they arrive
        for completed in asyncio.as_completed(tasks):
            try:
                result = await completed
                
                # Check if we're still in the same election
                if self.server.current_term != election_term:
                    print(f"Server {self.server.id}: Term changed during election, aborting")
                    return False
                
                if self.server.state != ServerState.CANDIDATE:
                    print(f"Server {self.server.id}: No longer candidate, aborting")
                    return False
                
                # Process the vote
                outcome = self._process_vote_response(result, election_term)
                
                if outcome == "STEP_DOWN":
                    return False
                
                if outcome == "VOTE_GRANTED":
                    votes_received.add(result.server_id)
                    print(f"Server {self.server.id}: Got vote from {result.server_id}. "
                          f"Total: {len(votes_received)}/{votes_needed}")
                    
                    # ========================================
                    # STEP 6: Check if we won
                    # ========================================
                    if len(votes_received) >= votes_needed:
                        print(f"Server {self.server.id}: Won election for term {election_term}!")
                        self.server.become_leader()
                        return True
                        
            except asyncio.TimeoutError:
                continue  # This peer didn't respond in time
        
        # Didn't get enough votes
        print(f"Server {self.server.id}: Election failed. Got {len(votes_received)}/{votes_needed}")
        return False
    
    def _process_vote_response(self, result: VoteResult, election_term: int) -> str:
        """
        Process a single vote response.
        Returns: "VOTE_GRANTED", "VOTE_DENIED", or "STEP_DOWN"
        """
        # If response contains higher term, step down immediately
        if result.term > election_term:
            print(f"Server {self.server.id}: Discovered higher term {result.term}, stepping down")
            self.server.become_follower(result.term)
            return "STEP_DOWN"
        
        if result.vote_granted:
            return "VOTE_GRANTED"
        else:
            return "VOTE_DENIED"
    
    async def _request_vote_with_retry(
        self, 
        peer_id: int, 
        request: RequestVoteRequest,
        timeout_ms: int = 100
    ) -> VoteResult:
        """
        Send RequestVote RPC with timeout.
        In production, might retry on transient failures.
        """
        # Simulated RPC - in real code, this is network I/O
        response = await self.server.rpc_client.request_vote(peer_id, request)
        return VoteResult(
            server_id=peer_id,
            vote_granted=response.vote_granted,
            term=response.term
        )

Election Outcome Possibilities

•Win — Candidate receives votes from a majority (including self). Becomes leader immediately.
•Lose to higher term — Candidate discovers another server has a higher term. Steps down to follower.
•Lose to new leader — Candidate receives AppendEntries from a valid leader for this term. Steps down.
•Split vote — No candidate gets majority. Election timeout expires. Candidate increments term and tries again.

Split Votes and Recovery

A split vote occurs when no candidate receives a majority in an election. This can happen when:

Multiple servers time out simultaneously and become candidates
Network delays cause votes to be distributed before any candidate wins
Servers are partitioned such that no candidate can reach a majority

When a split vote occurs:

No candidate becomes leader
Each candidate's election timeout eventually expires
Candidates increment their term and start a new election
Randomized timeouts make it unlikely the same split will repeat

split_vote_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
"""
Split Vote Scenario Analysis
 
Cluster: 5 servers (A, B, C, D, E), majority = 3
 
=== TIMELINE OF A SPLIT VOTE ===
 
t=0:    Leader crashes
t=150:  A's timeout expires (150ms), A becomes candidate term 2
t=155:  B's timeout expires (155ms), B becomes candidate term 2
t=160:  C's timeout expires (160ms), but receives A's RequestVote first
 
Vote distribution:
    A votes for: A (self)
    B votes for: B (self)  
    C votes for: A (first valid request received)
    D votes for: B (if B's request arrived first)
    E votes for: A (if A's request arrived first)
 
Results vary by network timing, but let's say:
    A gets: A, C, E = 3 votes → WINS!
    B gets: B, D = 2 votes → loses
 
But what if timing was different:
    A gets: A, C = 2 votes
    B gets: B, D, E = 3 votes → B WINS!
 
And adversarial timing:
    A gets: A, C = 2 votes
    B gets: B, D = 2 votes
    E's response delayed... → SPLIT VOTE
 
=== HOW RANDOMIZATION HELPS ===
 
After split vote, candidates restart election:
- A chooses new timeout: 250ms
- B chooses new timeout: 180ms
 
t=0:    Split vote ended
t=180:  B's timeout expires, B starts term 3 election
t=180+: B collects votes before A even starts
t=180+: B wins with majority
t=250:  A's timeout expires, but B is already leader
t=250+: A discovers B's higher term, steps down
 
Probability of repeated splits decreases exponentially
with each retry due to randomization.
"""
 
import random
import statistics
 
def simulate_elections(num_servers: int, timeout_range: tuple, trials: int = 10000):
    """
    Simulate how often clean elections occur vs split votes.
    """
    clean_wins = 0
    split_votes = 0
    
    min_timeout, max_timeout = timeout_range
    threshold = 5  # ms within which we consider "simultaneous"
    
    for _ in range(trials):
        # Each server picks a random timeout
        timeouts = [random.randint(min_timeout, max_timeout) for _ in range(num_servers)]
        
        # Find the minimum timeout (first to become candidate)
        min_time = min(timeouts)
        
        # Count how many are within threshold of minimum (potential split)
        simultaneous = sum(1 for t in timeouts if t - min_time <= threshold)
        
        if simultaneous == 1:
            clean_wins += 1
        else:
            split_votes += 1
    
    return {
        "clean_election_rate": clean_wins / trials,
        "split_vote_rate": split_votes / trials
    }
 
# With standard 150-300ms range, 5 servers:
result = simulate_elections(5, (150, 300))
print(f"Clean election probability: {result['clean_election_rate']:.1%}")
print(f"Split vote probability: {result['split_vote_rate']:.1%}")
 
# Typical output: Clean ~93%, Split ~7%
# Even with a split, the next round usually resolves cleanly

Bounded Election Time

Pre-Vote: Preventing Disruption

Standard Raft has a potential problem: a partitioned server can disrupt the cluster when it reconnects.

The Scenario:

Server A is partitioned from the rest of the cluster
Server A's election timeout expires (no leader heartbeats)
Server A becomes candidate, increments term to 2, 3, 4... repeatedly
Meanwhile, the rest of the cluster operates normally in term 1
Server A reconnects with term 100 (many failed elections)
Server A's messages force all other servers to update to term 100
The healthy leader is forced to step down!

This is technically "safe" (the cluster will recover), but it causes unnecessary disruption. The pre-vote extension prevents this.

pre_vote.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
"""
Pre-Vote: Check Viability Before Real Election
 
The idea: Before incrementing term and requesting real votes,
send "pre-vote" requests that don't affect state.
 
Pre-vote asks: "Would you vote for me if I started an election?"
 
Servers respond based on:
1. Their current leader (if recently heard from leader, answer NO)
2. The candidate's log (same election restriction)
 
Only if pre-vote succeeds does the candidate actually start the election.
 
This prevents partitioned servers from disrupting the cluster:
- They send pre-votes
- Connected servers say "no, we have a leader"
- Partitioned server never increments term
- When reconnected, term is still low → no disruption
"""
 
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class PreVoteRequest:
    """
    Pre-vote request - does NOT change any state.
    Note: Uses proposed term (current + 1), not current term.
    """
    term: int  # Would-be election term (current_term + 1)
    candidate_id: int
    last_log_index: int
    last_log_term: int
 
 
@dataclass  
class PreVoteResponse:
    term: int
    vote_granted: bool
 
 
class PreVoteHandler:
    """
    Handle pre-vote requests.
    Key difference from regular vote: We don't update state!
    """
    
    # How recently we must have heard from leader to reject pre-vote
    LEADER_STICKINESS_MS = 500  # Typically 2-3 heartbeat intervals
    
    def __init__(self, server):
        self.server = server
    
    def handle_pre_vote(self, request: PreVoteRequest) -> PreVoteResponse:
        """
        Respond to pre-vote request.
        
        Critical differences from regular vote:
        1. Don't update our term (even if request term is higher)
        2. Don't record vote (can pre-vote for multiple candidates)
        3. Check if we've recently heard from leader
        """
        
        # Check 1: Is our term higher?
        # (Use current_term + 1 because candidate hasn't incremented yet)
        if request.term < self.server.current_term:
            return PreVoteResponse(
                term=self.server.current_term,
                vote_granted=False
            )
        
        # Check 2: Have we heard from a leader recently?
        # If yes, the "election" is probably due to network issues
        if self._recently_heard_from_leader():
            return PreVoteResponse(
                term=self.server.current_term,
                vote_granted=False
            )
        
        # Check 3: Is candidate's log up-to-date? (Same as regular vote)
        if not self._is_log_up_to_date(request):
            return PreVoteResponse(
                term=self.server.current_term,
                vote_granted=False
            )
        
        # Pre-vote granted - but NO STATE CHANGES
        return PreVoteResponse(
            term=self.server.current_term,
            vote_granted=True
        )
    
    def _recently_heard_from_leader(self) -> bool:
        """
        Check if we've received a valid message from current leader recently.
        """
        if self.server.last_leader_contact is None:
            return False
        
        elapsed = self.server.current_time() - self.server.last_leader_contact
        return elapsed < self.LEADER_STICKINESS_MS
    
    def _is_log_up_to_date(self, request: PreVoteRequest) -> bool:
        my_last_index, my_last_term = self.server.get_last_log_info()
        
        if request.last_log_term != my_last_term:
            return request.last_log_term > my_last_term
        return request.last_log_index >= my_last_index
 
 
# Election with Pre-Vote
class PreVoteElection:
    """
    Election procedure with pre-vote phase.
    """
    
    async def run_election_with_pre_vote(self) -> bool:
        """
        Two-phase election:
        1. Pre-vote: Check if election would succeed
        2. Real vote: If pre-vote succeeds, run actual election
        """
        
        # Phase 1: Pre-vote
        pre_vote_success = await self._run_pre_vote()
        
        if not pre_vote_success:
            # Pre-vote failed - don't start real election
            # This prevents term inflation for partitioned nodes
            print("Pre-vote failed - not starting election")
            return False
        
        # Phase 2: Real election
        print("Pre-vote succeeded - starting real election")
        return await self._run_real_election()

Pre-Vote in Production

Summary: Leader Election Guarantees

Raft's leader election mechanism is deceptively simple on the surface but provides powerful guarantees:

Key Takeaways

•Election Safety — At most one leader per term, guaranteed by each server voting only once per term and elections requiring majority.
•Leader Completeness — Elected leaders contain all committed entries, guaranteed by the election restriction requiring up-to-date logs.
•Liveness — Elections eventually complete despite failures, guaranteed by randomized timeouts that prevent infinite split votes.
•State Simplicity — Only three states (Follower, Candidate, Leader) with clear, deterministic transitions.
•Self-Healing — Stale leaders automatically step down upon discovering higher terms.

Leader Election at a Glance
Component	Purpose	Mechanism
Term numbers	Logical clock for ordering	Monotonically increasing; higher term always wins
Random timeouts	Break symmetry	150-300ms range prevents synchronized elections
RequestVote RPC	Collect votes	Includes log info for election restriction
Election restriction	Ensure safety	Candidate must have up-to-date log
Pre-vote (optional)	Prevent disruption	Check viability before real election

What's Next:

Page Complete

2 / 5