System Design (HLD)Paxos Algorithm

The Paxos Consensus Algorithm

LevelAdvanced

Duration90 mins

TopicPaxos Algorithm

4 / 5

Multi-Paxos for Log Replication

From Single Value to Replicated Logs

Basic Paxos, as elegant as it is, solves a limited problem: agreeing on a single value. Real distributed systems, however, need to agree on sequences of values — commands to execute, transactions to commit, configuration changes to apply. A replicated database, for instance, needs a replicated log of operations that all replicas apply in the same order.

The naive approach would be to run a separate instance of Basic Paxos for each log entry. This works but is extremely inefficient — each entry requires two round trips (Prepare/Promise, Accept/Accepted). For a system processing thousands of commands per second, this overhead is prohibitive.

Multi-Paxos is the set of optimizations that transforms Basic Paxos into a practical protocol for log replication. It's what Google's Chubby, Apache ZooKeeper, and etcd implement under the hood.

What You Will Learn

By the end of this page, you will understand how to extend Basic Paxos for log replication, the stable leader optimization that amortizes Phase 1, how log slots and indices work, handling gaps in the log, and why Multi-Paxos is the foundation for replicated state machines.

The Problem with Running Multiple Paxos Instances

Let's understand why naive multi-instance Paxos is impractical.

The Naive Approach:

To agree on a sequence of values V₀, V₁, V₂, ..., run a separate Paxos instance for each index:

Instance 0: Agree on V₀
Instance 1: Agree on V₁
Instance 2: Agree on V₂
...

Each instance is completely independent — its own proposal numbers, its own Phase 1 and Phase 2.

Naive Multi-Instance Overhead
Metric	Basic Paxos (Single Value)	Naive Multi-Instance (N Values)
Message rounds	2 (Prepare + Accept)	2N rounds
Messages per value	4F+2 (for F fault tolerance)	(4F+2) × N messages
Latency per value	2 RTT	2 RTT per value (if sequential) or requires complex parallelization
Stable storage writes	2 per acceptor	2N per acceptor

The Overhead Problem

Consider a system processing 10,000 commands per second. With naive multi-instance Paxos, that's 20,000 Phase 1 Prepare rounds and 20,000 Phase 2 Accept rounds per second — plus the associated message overhead and disk writes. This is simply not practical.

The Key Insight:

If the same proposer is making proposals for consecutive log slots, do we really need to run Phase 1 for each slot? Phase 1 establishes the proposer's right to propose — if that right extends across all log slots (not just one), we can skip Phase 1 for most operations.

This is the core optimization of Multi-Paxos:

A proposer that completes Phase 1 for all slots (or a range of slots) can skip Phase 1 for subsequent proposals, as long as no higher-numbered proposal preempts it.

The Stable Leader Optimization

The stable leader optimization is the centerpiece of Multi-Paxos. Instead of having any proposer compete for each log slot, we establish a distinguished leader that drives consensus.

How It Works:

Leader Election: One proposer is elected as the leader (using Paxos itself, or another mechanism)
Global Prepare: The leader sends Prepare(N) that applies to all log slots (or all slots from some index forward)
Promises for All Slots: Acceptors promise for all slots — "I won't accept any proposal < N for any slot"
Direct Accepts: For subsequent log entries, the leader skips Phase 1 and directly sends Accept(N, slot, value)
Leadership Continuity: As long as the leader remains stable and no higher proposal preempts it, Phase 1 is amortized across all entries

Converting Mermaid diagram...

The Performance Gain:

With the stable leader optimization, the latency for each log entry is reduced from 2 RTT to 1 RTT (just Phase 2). The number of messages per entry is halved. Phase 1 overhead is amortized across potentially millions of log entries.

Multi-Paxos with Stable Leader
Metric	Naive Multi-Instance	Multi-Paxos with Stable Leader
Phase 1 rounds	1 per value	1 per leadership term (amortized to ~0)
Messages per value	(4F+2)	(2F+2) — only Accept/Accepted
Latency per value	2 RTT	1 RTT (in steady state)
Leadership change cost	N/A	1 RTT for new Phase 1

Leader Stability is Key

The efficiency of Multi-Paxos depends on leader stability. Every leadership change requires a new Phase 1 (Prepare for all slots). Systems often use leader leases, heartbeats, and careful timeout tuning to maximize leader tenure.

Log Slots and Indexing

The Replicated Log Model:

Multi-Paxos maintains a replicated log where each entry is identified by an integer index (or slot number). The log grows monotonically — entries can be added but not removed or reordered.

replicated_log.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class ReplicatedLog:
    """
    The replicated log structure that Multi-Paxos manages.
    
    Each slot may be in one of several states:
    - Empty: No value proposed yet
    - Proposed: A value has been proposed but not yet chosen
    - Chosen: Value has been accepted by a majority (finalized)
    """
    
    def __init__(self):
        # Map from slot index to LogEntry
        self.entries: Dict[int, LogEntry] = {}
        
        # The highest slot index we've seen
        self.highest_slot: int = -1
        
        # The highest slot index that is chosen and contiguously
        # connected to slot 0 (the "commit index")
        self.commit_index: int = -1
    
    def get_next_empty_slot(self) -> int:
        """Return the next slot index to use for a new command."""
        return self.highest_slot + 1
    
    def set_chosen(self, slot: int, value: Any):
        """Mark a slot as chosen with the given value."""
        self.entries[slot] = LogEntry(
            slot=slot,
            value=value,
            status=LogStatus.CHOSEN
        )
        self.highest_slot = max(self.highest_slot, slot)
        self._update_commit_index()
    
    def _update_commit_index(self):
        """
        Update commit_index to the highest contiguous chosen slot.
        
        Commands can only be executed after they're committed, meaning
        all prior slots are also chosen. This ensures ordered execution.
        """
        while self.commit_index + 1 in self.entries:
            if self.entries[self.commit_index + 1].status == LogStatus.CHOSEN:
                self.commit_index += 1
            else:
                break
    
    def get_commands_to_execute(self, last_executed: int) -> List[Command]:
        """
        Return the list of commands that can be executed.
        These are committed entries after last_executed.
        """
        commands = []
        for slot in range(last_executed + 1, self.commit_index + 1):
            commands.append(self.entries[slot].value)
        return commands
 
@dataclass
class LogEntry:
    slot: int
    value: Any
    status: LogStatus  # EMPTY, PROPOSED, ACCEPTED, CHOSEN
    proposal_number: Optional[int] = None

Slot Allocation:

The leader assigns slot numbers to incoming commands. This creates a total ordering of commands across the cluster.

Slot Management Strategies

•Sequential assignment — Leader assigns slot = highest_slot + 1 for each new command. Simple but limits parallelism.
•Batch assignment — Leader reserves a range of slots (e.g., 100-199) and fills them with commands in parallel. Better throughput.
•Gap handling — If slot N is chosen but N-1 is not, the log has a gap. Gaps must be filled (possibly with no-op) before execution can proceed past them.

Commit Index vs. Highest Slot

The commit_index tracks the highest contiguous chosen slot (all slots 0 through commit_index are chosen). The highest_slot may be higher if there are gaps. Commands are only executed up to commit_index to maintain order.

Handling Gaps and No-Ops

Gaps in the log are a subtle but important challenge in Multi-Paxos. They arise when:

A leader proposes for slot N but crashes before completing
A new leader starts proposing at slot N+k without knowing about slot N
Network issues cause some slots to complete out of order

The Gap Problem:

If slot 5 is chosen but slot 4 is empty, we cannot execute the command at slot 5 — commands must be executed in order. The gap at slot 4 must be resolved.

Converting Mermaid diagram...

Filling Gaps with No-Ops:

The solution is to fill gaps with no-op (no operation) commands. A no-op is a special command that does nothing when executed but occupies a log slot, allowing subsequent slots to commit.

Gap Detection and Filling Process:

gap_filling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
async def fill_gaps_on_leadership(self, highest_known_slot: int):
    """
    When becoming leader, identify and fill any gaps in the log.
    
    This is typically done as part of the Phase 1 processing, where
    the new leader discovers partially-filled slots from acceptor
    promise responses.
    
    Args:
        highest_known_slot: The highest slot index discovered during Phase 1
    """
    for slot in range(0, highest_known_slot + 1):
        slot_status = await self._get_slot_status(slot)
        
        if slot_status == SlotStatus.EMPTY:
            # Gap detected! Fill with no-op
            await self._propose_noop(slot)
        
        elif slot_status == SlotStatus.PROPOSED_NOT_CHOSEN:
            # Slot has a proposed value but not chosen
            # Re-drive this slot to completion
            previous_value = await self._get_proposed_value(slot)
            await self._drive_slot_to_completion(slot, previous_value)
 
async def _propose_noop(self, slot: int):
    """
    Propose a no-op command for the given slot.
    
    No-ops are chosen via normal Paxos and then skipped during execution.
    """
    noop = Command(type="NOOP", data=None)
    
    # Skip Phase 1 if we're the stable leader
    # (our global Prepare already covers this slot)
    await self._phase2_accept(slot, noop)
 
def execute_command(self, command: Command):
    """Execute a command, handling no-ops appropriately."""
    if command.type == "NOOP":
        # No-op: do nothing, but log it for debugging
        self.logger.debug(f"Executing NOOP at slot {command.slot}")
        return NoopResult()
    else:
        # Normal command execution
        return self.state_machine.apply(command)

Gap Filling During Leader Election

New leaders typically fill gaps as part of the Phase 1 process. The acceptor promise responses include all accepted-but-not-chosen entries. The new leader re-proposes these (adopting the value per the adoption rule) or fills empty slots with no-ops.

Concurrency Considerations:

In a highly concurrent system, multiple slots may be in-flight simultaneously. The leader might be waiting for acknowledgments on slots 10, 11, and 12 while already proposing slots 13, 14, 15. Gaps can temporarily appear and disappear as messages arrive out of order.

Multi-Paxos State Management

Multi-Paxos requires more sophisticated state management than Basic Paxos, particularly for acceptors who must track per-slot information.

Acceptor State in Multi-Paxos:

multi_paxos_acceptor.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
class MultiPaxosAcceptorState:
    """
    State for a Multi-Paxos acceptor.
    
    Key insight: The global promised_number applies to ALL slots,
    while accepted values are tracked per-slot.
    """
    
    def __init__(self, acceptor_id: int, storage: DurableStorage):
        self.acceptor_id = acceptor_id
        self.storage = storage
        
        # Global: The highest proposal number we've promised
        # Applies to ALL slots (this is the Multi-Paxos optimization)
        self.promised_number: Optional[int] = None
        
        # Per-slot: The accepted value for each slot
        # Map: slot_index -> (proposal_number, value)
        self.accepted_values: Dict[int, Tuple[int, Any]] = {}
        
        # The highest slot index we've seen
        self.max_slot: int = -1
    
    def handle_global_prepare(
        self, 
        proposal_number: int
    ) -> GlobalPromiseResponse:
        """
        Handle a Prepare that applies to all slots.
        
        Returns promise + all accepted values discovered.
        """
        if self._can_promise(proposal_number):
            self.promised_number = proposal_number
            self._persist()
            
            return GlobalPromiseResponse(
                promised=proposal_number,
                accepted_slots=self.accepted_values.copy(),
                max_slot=self.max_slot
            )
        else:
            return RejectionResponse(
                highest_promised=self.promised_number
            )
    
    def handle_accept(
        self, 
        proposal_number: int, 
        slot: int, 
        value: Any
    ) -> AcceptResponse:
        """
        Handle an Accept for a specific slot.
        """
        if proposal_number >= self.promised_number:
            # Accept the value for this slot
            self.accepted_values[slot] = (proposal_number, value)
            self.promised_number = proposal_number
            self.max_slot = max(self.max_slot, slot)
            self._persist()
            
            return AcceptedResponse(
                proposal_number=proposal_number,
                slot=slot,
                value=value
            )
        else:
            return RejectionResponse(
                highest_promised=self.promised_number
            )
    
    def _persist(self):
        """Persist all state to durable storage."""
        self.storage.save(f"acceptor_{self.acceptor_id}", {
            "promised": self.promised_number,
            "accepted": self.accepted_values,
            "max_slot": self.max_slot,
        })
        self.storage.sync()

Leader State in Multi-Paxos:

multi_paxos_leader.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
class MultiPaxosLeaderState:
    """
    State for a Multi-Paxos leader.
    """
    
    def __init__(self, leader_id: int, acceptors: List[str]):
        self.leader_id = leader_id
        self.acceptors = acceptors
        self.majority = len(acceptors) // 2 + 1
        
        # Our current proposal number (applies to all slots)
        self.proposal_number: int = 0
        
        # Whether we've completed Phase 1 for our current proposal number
        self.phase1_complete: bool = False
        
        # The log we're managing
        self.log = ReplicatedLog()
        
        # Next slot to use for new commands
        self.next_slot: int = 0
        
        # In-flight slots (accepted but not yet chosen)
        self.pending_slots: Dict[int, PendingSlot] = {}
    
    async def become_leader(self):
        """
        Execute global Phase 1 to become the leader.
        """
        self.proposal_number = self._generate_proposal_number()
        
        # Send global Prepare to all acceptors
        responses = await self._send_global_prepare(self.proposal_number)
        
        # Need majority of promises
        promises = [r for r in responses if isinstance(r, GlobalPromiseResponse)]
        
        if len(promises) >= self.majority:
            # Collect all accepted values from acceptors
            await self._reconcile_accepted_values(promises)
            self.phase1_complete = True
        else:
            raise LeadershipFailed("Could not get majority promises")
    
    async def propose(self, command: Command) -> int:
        """
        Propose a command (assumes we're the leader with Phase 1 complete).
        Returns the slot index where the command was placed.
        """
        if not self.phase1_complete:
            raise NotLeaderError("Must complete Phase 1 first")
        
        slot = self.next_slot
        self.next_slot += 1
        
        # Skip Phase 1, go directly to Phase 2
        await self._phase2_accept(slot, command)
        
        return slot

Global vs. Per-Slot Prepare

The 'global Prepare' that applies to all slots is an optimization. An equivalent approach is to run slot-specific prepares but cache the promise for reuse. The key insight is the same: Phase 1 approval extends across slots for the same proposal number.

Leadership Changes and Reconfiguration

Leadership changes are the most complex part of Multi-Paxos. When the leader fails or becomes partitioned, a new leader must take over without losing any chosen values.

Leadership Change Protocol:

New Leader Steps

•Detect leader failure — Via heartbeat timeout, explicit abdication, or network partition detection.
•Increment proposal number — New leader chooses a proposal number higher than any seen before.
•Run global Phase 1 — Send Prepare to all acceptors for all slots.
•Collect accepted values — From promise responses, discover all accepted-but-not-chosen values.
•Reconcile the log — For each slot with prior acceptances, adopt the highest-numbered value. Fill gaps with no-ops.
•Complete pending slots — Re-drive Phase 2 for any slots that weren't chosen.
•Resume normal operation — Begin accepting new commands with Phase 2 only.

Reconciliation Example:

New leader L2 becomes leader and discovers from acceptor promises:

Leader Reconciliation Example
Slot	Acceptor A1	Acceptor A2	Acceptor A3	L2's Action
Slot 0	Accepted (1, C0)	Accepted (1, C0)	Accepted (1, C0)	Already chosen, no action
Slot 1	Accepted (1, C1)	Accepted (1, C1)	Accepted (1, C1)	Already chosen, no action
Slot 2	Accepted (1, C2)	Empty	Empty	Adopt C2 (adoption rule), re-drive Phase 2
Slot 3	Empty	Empty	Empty	Gap! Propose no-op
Slot 4	Accepted (1, C4)	Accepted (1, C4)	Empty	Already chosen by majority, confirm

Reconciliation is Critical

The new leader MUST reconcile the log before accepting new commands. If it skips this step and starts proposing in slot 5, it might conflict with the old leader's partially-complete proposal in slot 2. The adoption rule during reconciliation prevents this.

Avoiding Split Brain:

Two leaders believing they're active (split brain) can cause confusion but not safety violations. Paxos handles this through proposal numbers:

If old leader L1 tries to accept for slot N with proposal 5
But new leader L2 already completed Prepare with proposal 6
Acceptors will reject L1's accepts (they promised 6)
L1 will discover it's been superseded and step down

The protocol is safe even during split brain, but liveness suffers (commands may time out).

Replicated State Machines

Multi-Paxos is typically used to implement replicated state machines (RSM). The idea:

All replicas start in the same initial state
All replicas apply the same commands in the same order
All replicas end up in the same state

Multi-Paxos provides the consistent ordering. The state machine provides the semantics.

Converting Mermaid diagram...

replicated_state_machine.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
class ReplicatedStateMachine:
    """
    A state machine replicated via Multi-Paxos.
    """
    
    def __init__(self, paxos: MultiPaxos, state_machine: StateMachine):
        self.paxos = paxos
        self.state_machine = state_machine
        
        # Last applied log index
        self.last_applied: int = -1
    
    async def execute(self, command: Command) -> Result:
        """
        Execute a command against the replicated state machine.
        
        1. Propose the command via Multi-Paxos
        2. Wait for the command to be chosen and committed
        3. Apply all commands up to the committed command
        4. Return the result of executing the command
        """
        # Step 1: Propose command
        slot = await self.paxos.propose(command)
        
        # Step 2: Wait for command to be committed
        # (commit_index must reach our slot)
        await self.paxos.wait_for_commit(slot)
        
        # Step 3: Apply all commands up to our slot
        result = None
        for i in range(self.last_applied + 1, slot + 1):
            entry = self.paxos.log.get(i)
            execution_result = self.state_machine.apply(entry.value)
            
            if i == slot:
                result = execution_result
            
            self.last_applied = i
        
        return result
    
    async def read(self, query: Query) -> Result:
        """
        Execute a read-only query.
        
        For linearizable reads, we must ensure we see all committed writes.
        Options:
        1. Read through consensus (expensive)
        2. Leader lease + local read (if leader)
        3. Read from any replica (stale reads allowed)
        """
        if query.consistency == Consistency.LINEARIZABLE:
            # Ensure we're leader and our lease is valid
            await self.paxos.confirm_leadership()
        
        # Apply any pending committed commands
        await self._apply_pending()
        
        return self.state_machine.read(query)

RSM Applications

Virtually every distributed database, coordination service, and lock manager is a replicated state machine of some form. The state machine could be a key-value store (etcd), a file system (Chubby), a SQL database (Spanner), or any other deterministic system.

Production Considerations

Implementing Multi-Paxos in production requires addressing numerous practical concerns beyond the core protocol.

Performance Optimizations

•Batching — Group multiple commands into a single log entry to amortize consensus overhead.
•Pipelining — Propose multiple slots concurrently rather than waiting for each to complete.
•Write-ahead logging — Combine Paxos persistence with application persistence for efficiency.
•Quorum reads — For read-heavy workloads, read from a majority without going through consensus.
•Leader leases — Allow the leader to serve reads locally during its lease period.

Operational Concerns

•Log compaction/snapshotting — The log grows forever; must periodically snapshot state and truncate the log.
•Membership changes — Adding/removing nodes requires careful protocol extensions (reconfiguration).
•Monitoring — Track leader tenure, proposal latency, rejected proposals, and log lag.
•Failure detection — Tune heartbeat intervals and timeouts for your network characteristics.
•Durability guarantees — Ensure fsync semantics match your durability requirements.

Multi-Paxos Implementation Checklist
Aspect	Consideration	Impact if Missed
Persistence	Sync to disk before responding	Safety violation on crash
Proposal numbers	Global uniqueness guaranteed	Protocol correctness failure
Gap handling	Fill with no-ops on leadership change	Log application stalls
Leadership detection	Heartbeats + timeouts tuned	Frequent failovers or split brain
Log compaction	Periodic snapshots	Unbounded disk/memory usage

Why Many Systems Use Raft Instead

Many of these production concerns are simpler in Raft because Raft makes specific design choices (e.g., only the leader can accept new entries, logs are always contiguous). Multi-Paxos's generality makes it more flexible but also more complex to implement correctly.

Summary: Multi-Paxos for Log Replication

We've explored how Multi-Paxos transforms the single-value Basic Paxos into a practical protocol for replicated logs. Let's consolidate the key takeaways:

Key Takeaways

•Naive multi-instance is impractical — Running separate Basic Paxos for each log entry requires 2 RTT per entry, which is too slow.
•Stable leader optimization is key — By amortizing Phase 1 across all slots, Multi-Paxos achieves 1 RTT per entry in steady state.
•Log slots create total ordering — Commands are assigned to slots, creating a consistent order for all replicas.
•Gaps must be handled — Leadership changes can create gaps; fill them with no-ops to allow progress.
•Reconciliation is critical — New leaders must discover and adopt any in-progress proposals before accepting new commands.
•Replicated state machines are the goal — Multi-Paxos provides consistent ordering; the state machine provides semantics.
•Production requires many additional concerns — Batching, snapshotting, membership changes, and monitoring are essential for real deployments.

What's next:

With a solid understanding of how Multi-Paxos works, we'll examine Why Paxos is Hard to Implement — the practical challenges that trip up implementers and why algorithms like Raft were specifically designed to be easier to understand and implement correctly.

Page Complete

You now understand Multi-Paxos, the practical extension of Basic Paxos for replicated logs. This is the foundation for understanding production consensus systems like etcd, ZooKeeper, and Spanner. Next, we'll explore why implementing Paxos correctly is so challenging.

4 / 5

Loading learning content...

System Design (HLD)Paxos Algorithm

The Paxos Consensus Algorithm

LevelAdvanced

Duration90 mins

TopicPaxos Algorithm

4 / 5

Multi-Paxos for Log Replication

From Single Value to Replicated Logs

Multi-Paxos is the set of optimizations that transforms Basic Paxos into a practical protocol for log replication. It's what Google's Chubby, Apache ZooKeeper, and etcd implement under the hood.

What You Will Learn

The Problem with Running Multiple Paxos Instances

Let's understand why naive multi-instance Paxos is impractical.

The Naive Approach:

To agree on a sequence of values V₀, V₁, V₂, ..., run a separate Paxos instance for each index:

Instance 0: Agree on V₀
Instance 1: Agree on V₁
Instance 2: Agree on V₂
...

Each instance is completely independent — its own proposal numbers, its own Phase 1 and Phase 2.

Naive Multi-Instance Overhead
Metric	Basic Paxos (Single Value)	Naive Multi-Instance (N Values)
Message rounds	2 (Prepare + Accept)	2N rounds
Messages per value	4F+2 (for F fault tolerance)	(4F+2) × N messages
Latency per value	2 RTT	2 RTT per value (if sequential) or requires complex parallelization
Stable storage writes	2 per acceptor	2N per acceptor

The Overhead Problem

The Key Insight:

This is the core optimization of Multi-Paxos:

A proposer that completes Phase 1 for all slots (or a range of slots) can skip Phase 1 for subsequent proposals, as long as no higher-numbered proposal preempts it.

The Stable Leader Optimization

The stable leader optimization is the centerpiece of Multi-Paxos. Instead of having any proposer compete for each log slot, we establish a distinguished leader that drives consensus.

How It Works:

Leader Election: One proposer is elected as the leader (using Paxos itself, or another mechanism)
Global Prepare: The leader sends Prepare(N) that applies to all log slots (or all slots from some index forward)
Promises for All Slots: Acceptors promise for all slots — "I won't accept any proposal < N for any slot"
Direct Accepts: For subsequent log entries, the leader skips Phase 1 and directly sends Accept(N, slot, value)
Leadership Continuity: As long as the leader remains stable and no higher proposal preempts it, Phase 1 is amortized across all entries

Converting Mermaid diagram...

The Performance Gain:

Multi-Paxos with Stable Leader
Metric	Naive Multi-Instance	Multi-Paxos with Stable Leader
Phase 1 rounds	1 per value	1 per leadership term (amortized to ~0)
Messages per value	(4F+2)	(2F+2) — only Accept/Accepted
Latency per value	2 RTT	1 RTT (in steady state)
Leadership change cost	N/A	1 RTT for new Phase 1

Leader Stability is Key

Log Slots and Indexing

The Replicated Log Model:

Multi-Paxos maintains a replicated log where each entry is identified by an integer index (or slot number). The log grows monotonically — entries can be added but not removed or reordered.

replicated_log.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class ReplicatedLog:
    """
    The replicated log structure that Multi-Paxos manages.
    
    Each slot may be in one of several states:
    - Empty: No value proposed yet
    - Proposed: A value has been proposed but not yet chosen
    - Chosen: Value has been accepted by a majority (finalized)
    """
    
    def __init__(self):
        # Map from slot index to LogEntry
        self.entries: Dict[int, LogEntry] = {}
        
        # The highest slot index we've seen
        self.highest_slot: int = -1
        
        # The highest slot index that is chosen and contiguously
        # connected to slot 0 (the "commit index")
        self.commit_index: int = -1
    
    def get_next_empty_slot(self) -> int:
        """Return the next slot index to use for a new command."""
        return self.highest_slot + 1
    
    def set_chosen(self, slot: int, value: Any):
        """Mark a slot as chosen with the given value."""
        self.entries[slot] = LogEntry(
            slot=slot,
            value=value,
            status=LogStatus.CHOSEN
        )
        self.highest_slot = max(self.highest_slot, slot)
        self._update_commit_index()
    
    def _update_commit_index(self):
        """
        Update commit_index to the highest contiguous chosen slot.
        
        Commands can only be executed after they're committed, meaning
        all prior slots are also chosen. This ensures ordered execution.
        """
        while self.commit_index + 1 in self.entries:
            if self.entries[self.commit_index + 1].status == LogStatus.CHOSEN:
                self.commit_index += 1
            else:
                break
    
    def get_commands_to_execute(self, last_executed: int) -> List[Command]:
        """
        Return the list of commands that can be executed.
        These are committed entries after last_executed.
        """
        commands = []
        for slot in range(last_executed + 1, self.commit_index + 1):
            commands.append(self.entries[slot].value)
        return commands
 
@dataclass
class LogEntry:
    slot: int
    value: Any
    status: LogStatus  # EMPTY, PROPOSED, ACCEPTED, CHOSEN
    proposal_number: Optional[int] = None

Slot Allocation:

The leader assigns slot numbers to incoming commands. This creates a total ordering of commands across the cluster.

Slot Management Strategies

•Sequential assignment — Leader assigns slot = highest_slot + 1 for each new command. Simple but limits parallelism.
•Batch assignment — Leader reserves a range of slots (e.g., 100-199) and fills them with commands in parallel. Better throughput.
•Gap handling — If slot N is chosen but N-1 is not, the log has a gap. Gaps must be filled (possibly with no-op) before execution can proceed past them.

Commit Index vs. Highest Slot

Handling Gaps and No-Ops

Gaps in the log are a subtle but important challenge in Multi-Paxos. They arise when:

A leader proposes for slot N but crashes before completing
A new leader starts proposing at slot N+k without knowing about slot N
Network issues cause some slots to complete out of order

The Gap Problem:

If slot 5 is chosen but slot 4 is empty, we cannot execute the command at slot 5 — commands must be executed in order. The gap at slot 4 must be resolved.

Converting Mermaid diagram...

Filling Gaps with No-Ops:

The solution is to fill gaps with no-op (no operation) commands. A no-op is a special command that does nothing when executed but occupies a log slot, allowing subsequent slots to commit.

Gap Detection and Filling Process:

gap_filling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
async def fill_gaps_on_leadership(self, highest_known_slot: int):
    """
    When becoming leader, identify and fill any gaps in the log.
    
    This is typically done as part of the Phase 1 processing, where
    the new leader discovers partially-filled slots from acceptor
    promise responses.
    
    Args:
        highest_known_slot: The highest slot index discovered during Phase 1
    """
    for slot in range(0, highest_known_slot + 1):
        slot_status = await self._get_slot_status(slot)
        
        if slot_status == SlotStatus.EMPTY:
            # Gap detected! Fill with no-op
            await self._propose_noop(slot)
        
        elif slot_status == SlotStatus.PROPOSED_NOT_CHOSEN:
            # Slot has a proposed value but not chosen
            # Re-drive this slot to completion
            previous_value = await self._get_proposed_value(slot)
            await self._drive_slot_to_completion(slot, previous_value)
 
async def _propose_noop(self, slot: int):
    """
    Propose a no-op command for the given slot.
    
    No-ops are chosen via normal Paxos and then skipped during execution.
    """
    noop = Command(type="NOOP", data=None)
    
    # Skip Phase 1 if we're the stable leader
    # (our global Prepare already covers this slot)
    await self._phase2_accept(slot, noop)
 
def execute_command(self, command: Command):
    """Execute a command, handling no-ops appropriately."""
    if command.type == "NOOP":
        # No-op: do nothing, but log it for debugging
        self.logger.debug(f"Executing NOOP at slot {command.slot}")
        return NoopResult()
    else:
        # Normal command execution
        return self.state_machine.apply(command)

Gap Filling During Leader Election

Concurrency Considerations:

Multi-Paxos State Management

Multi-Paxos requires more sophisticated state management than Basic Paxos, particularly for acceptors who must track per-slot information.

Acceptor State in Multi-Paxos:

multi_paxos_acceptor.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
class MultiPaxosAcceptorState:
    """
    State for a Multi-Paxos acceptor.
    
    Key insight: The global promised_number applies to ALL slots,
    while accepted values are tracked per-slot.
    """
    
    def __init__(self, acceptor_id: int, storage: DurableStorage):
        self.acceptor_id = acceptor_id
        self.storage = storage
        
        # Global: The highest proposal number we've promised
        # Applies to ALL slots (this is the Multi-Paxos optimization)
        self.promised_number: Optional[int] = None
        
        # Per-slot: The accepted value for each slot
        # Map: slot_index -> (proposal_number, value)
        self.accepted_values: Dict[int, Tuple[int, Any]] = {}
        
        # The highest slot index we've seen
        self.max_slot: int = -1
    
    def handle_global_prepare(
        self, 
        proposal_number: int
    ) -> GlobalPromiseResponse:
        """
        Handle a Prepare that applies to all slots.
        
        Returns promise + all accepted values discovered.
        """
        if self._can_promise(proposal_number):
            self.promised_number = proposal_number
            self._persist()
            
            return GlobalPromiseResponse(
                promised=proposal_number,
                accepted_slots=self.accepted_values.copy(),
                max_slot=self.max_slot
            )
        else:
            return RejectionResponse(
                highest_promised=self.promised_number
            )
    
    def handle_accept(
        self, 
        proposal_number: int, 
        slot: int, 
        value: Any
    ) -> AcceptResponse:
        """
        Handle an Accept for a specific slot.
        """
        if proposal_number >= self.promised_number:
            # Accept the value for this slot
            self.accepted_values[slot] = (proposal_number, value)
            self.promised_number = proposal_number
            self.max_slot = max(self.max_slot, slot)
            self._persist()
            
            return AcceptedResponse(
                proposal_number=proposal_number,
                slot=slot,
                value=value
            )
        else:
            return RejectionResponse(
                highest_promised=self.promised_number
            )
    
    def _persist(self):
        """Persist all state to durable storage."""
        self.storage.save(f"acceptor_{self.acceptor_id}", {
            "promised": self.promised_number,
            "accepted": self.accepted_values,
            "max_slot": self.max_slot,
        })
        self.storage.sync()

Leader State in Multi-Paxos:

multi_paxos_leader.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
class MultiPaxosLeaderState:
    """
    State for a Multi-Paxos leader.
    """
    
    def __init__(self, leader_id: int, acceptors: List[str]):
        self.leader_id = leader_id
        self.acceptors = acceptors
        self.majority = len(acceptors) // 2 + 1
        
        # Our current proposal number (applies to all slots)
        self.proposal_number: int = 0
        
        # Whether we've completed Phase 1 for our current proposal number
        self.phase1_complete: bool = False
        
        # The log we're managing
        self.log = ReplicatedLog()
        
        # Next slot to use for new commands
        self.next_slot: int = 0
        
        # In-flight slots (accepted but not yet chosen)
        self.pending_slots: Dict[int, PendingSlot] = {}
    
    async def become_leader(self):
        """
        Execute global Phase 1 to become the leader.
        """
        self.proposal_number = self._generate_proposal_number()
        
        # Send global Prepare to all acceptors
        responses = await self._send_global_prepare(self.proposal_number)
        
        # Need majority of promises
        promises = [r for r in responses if isinstance(r, GlobalPromiseResponse)]
        
        if len(promises) >= self.majority:
            # Collect all accepted values from acceptors
            await self._reconcile_accepted_values(promises)
            self.phase1_complete = True
        else:
            raise LeadershipFailed("Could not get majority promises")
    
    async def propose(self, command: Command) -> int:
        """
        Propose a command (assumes we're the leader with Phase 1 complete).
        Returns the slot index where the command was placed.
        """
        if not self.phase1_complete:
            raise NotLeaderError("Must complete Phase 1 first")
        
        slot = self.next_slot
        self.next_slot += 1
        
        # Skip Phase 1, go directly to Phase 2
        await self._phase2_accept(slot, command)
        
        return slot

Global vs. Per-Slot Prepare

Leadership Changes and Reconfiguration

Leadership changes are the most complex part of Multi-Paxos. When the leader fails or becomes partitioned, a new leader must take over without losing any chosen values.

Leadership Change Protocol:

New Leader Steps

•Detect leader failure — Via heartbeat timeout, explicit abdication, or network partition detection.
•Increment proposal number — New leader chooses a proposal number higher than any seen before.
•Run global Phase 1 — Send Prepare to all acceptors for all slots.
•Collect accepted values — From promise responses, discover all accepted-but-not-chosen values.
•Reconcile the log — For each slot with prior acceptances, adopt the highest-numbered value. Fill gaps with no-ops.
•Complete pending slots — Re-drive Phase 2 for any slots that weren't chosen.
•Resume normal operation — Begin accepting new commands with Phase 2 only.

Reconciliation Example:

New leader L2 becomes leader and discovers from acceptor promises:

Leader Reconciliation Example
Slot	Acceptor A1	Acceptor A2	Acceptor A3	L2's Action
Slot 0	Accepted (1, C0)	Accepted (1, C0)	Accepted (1, C0)	Already chosen, no action
Slot 1	Accepted (1, C1)	Accepted (1, C1)	Accepted (1, C1)	Already chosen, no action
Slot 2	Accepted (1, C2)	Empty	Empty	Adopt C2 (adoption rule), re-drive Phase 2
Slot 3	Empty	Empty	Empty	Gap! Propose no-op
Slot 4	Accepted (1, C4)	Accepted (1, C4)	Empty	Already chosen by majority, confirm

Reconciliation is Critical

Avoiding Split Brain:

Two leaders believing they're active (split brain) can cause confusion but not safety violations. Paxos handles this through proposal numbers:

If old leader L1 tries to accept for slot N with proposal 5
But new leader L2 already completed Prepare with proposal 6
Acceptors will reject L1's accepts (they promised 6)
L1 will discover it's been superseded and step down

The protocol is safe even during split brain, but liveness suffers (commands may time out).

Replicated State Machines

Multi-Paxos is typically used to implement replicated state machines (RSM). The idea:

All replicas start in the same initial state
All replicas apply the same commands in the same order
All replicas end up in the same state

Multi-Paxos provides the consistent ordering. The state machine provides the semantics.

Converting Mermaid diagram...

replicated_state_machine.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
class ReplicatedStateMachine:
    """
    A state machine replicated via Multi-Paxos.
    """
    
    def __init__(self, paxos: MultiPaxos, state_machine: StateMachine):
        self.paxos = paxos
        self.state_machine = state_machine
        
        # Last applied log index
        self.last_applied: int = -1
    
    async def execute(self, command: Command) -> Result:
        """
        Execute a command against the replicated state machine.
        
        1. Propose the command via Multi-Paxos
        2. Wait for the command to be chosen and committed
        3. Apply all commands up to the committed command
        4. Return the result of executing the command
        """
        # Step 1: Propose command
        slot = await self.paxos.propose(command)
        
        # Step 2: Wait for command to be committed
        # (commit_index must reach our slot)
        await self.paxos.wait_for_commit(slot)
        
        # Step 3: Apply all commands up to our slot
        result = None
        for i in range(self.last_applied + 1, slot + 1):
            entry = self.paxos.log.get(i)
            execution_result = self.state_machine.apply(entry.value)
            
            if i == slot:
                result = execution_result
            
            self.last_applied = i
        
        return result
    
    async def read(self, query: Query) -> Result:
        """
        Execute a read-only query.
        
        For linearizable reads, we must ensure we see all committed writes.
        Options:
        1. Read through consensus (expensive)
        2. Leader lease + local read (if leader)
        3. Read from any replica (stale reads allowed)
        """
        if query.consistency == Consistency.LINEARIZABLE:
            # Ensure we're leader and our lease is valid
            await self.paxos.confirm_leadership()
        
        # Apply any pending committed commands
        await self._apply_pending()
        
        return self.state_machine.read(query)

RSM Applications

Production Considerations

Implementing Multi-Paxos in production requires addressing numerous practical concerns beyond the core protocol.

Performance Optimizations

•Batching — Group multiple commands into a single log entry to amortize consensus overhead.
•Pipelining — Propose multiple slots concurrently rather than waiting for each to complete.
•Write-ahead logging — Combine Paxos persistence with application persistence for efficiency.
•Quorum reads — For read-heavy workloads, read from a majority without going through consensus.
•Leader leases — Allow the leader to serve reads locally during its lease period.

Operational Concerns

•Log compaction/snapshotting — The log grows forever; must periodically snapshot state and truncate the log.
•Membership changes — Adding/removing nodes requires careful protocol extensions (reconfiguration).
•Monitoring — Track leader tenure, proposal latency, rejected proposals, and log lag.
•Failure detection — Tune heartbeat intervals and timeouts for your network characteristics.
•Durability guarantees — Ensure fsync semantics match your durability requirements.

Multi-Paxos Implementation Checklist
Aspect	Consideration	Impact if Missed
Persistence	Sync to disk before responding	Safety violation on crash
Proposal numbers	Global uniqueness guaranteed	Protocol correctness failure
Gap handling	Fill with no-ops on leadership change	Log application stalls
Leadership detection	Heartbeats + timeouts tuned	Frequent failovers or split brain
Log compaction	Periodic snapshots	Unbounded disk/memory usage

Why Many Systems Use Raft Instead

Summary: Multi-Paxos for Log Replication

We've explored how Multi-Paxos transforms the single-value Basic Paxos into a practical protocol for replicated logs. Let's consolidate the key takeaways:

Key Takeaways

•Naive multi-instance is impractical — Running separate Basic Paxos for each log entry requires 2 RTT per entry, which is too slow.
•Stable leader optimization is key — By amortizing Phase 1 across all slots, Multi-Paxos achieves 1 RTT per entry in steady state.
•Log slots create total ordering — Commands are assigned to slots, creating a consistent order for all replicas.
•Gaps must be handled — Leadership changes can create gaps; fill them with no-ops to allow progress.
•Reconciliation is critical — New leaders must discover and adopt any in-progress proposals before accepting new commands.
•Replicated state machines are the goal — Multi-Paxos provides consistent ordering; the state machine provides semantics.
•Production requires many additional concerns — Batching, snapshotting, membership changes, and monitoring are essential for real deployments.

What's next:

Page Complete

4 / 5