Loading learning content...
Basic Paxos, as elegant as it is, solves a limited problem: agreeing on a single value. Real distributed systems, however, need to agree on sequences of values — commands to execute, transactions to commit, configuration changes to apply. A replicated database, for instance, needs a replicated log of operations that all replicas apply in the same order.
The naive approach would be to run a separate instance of Basic Paxos for each log entry. This works but is extremely inefficient — each entry requires two round trips (Prepare/Promise, Accept/Accepted). For a system processing thousands of commands per second, this overhead is prohibitive.
Multi-Paxos is the set of optimizations that transforms Basic Paxos into a practical protocol for log replication. It's what Google's Chubby, Apache ZooKeeper, and etcd implement under the hood.
By the end of this page, you will understand how to extend Basic Paxos for log replication, the stable leader optimization that amortizes Phase 1, how log slots and indices work, handling gaps in the log, and why Multi-Paxos is the foundation for replicated state machines.
Let's understand why naive multi-instance Paxos is impractical.
The Naive Approach:
To agree on a sequence of values V₀, V₁, V₂, ..., run a separate Paxos instance for each index:
Each instance is completely independent — its own proposal numbers, its own Phase 1 and Phase 2.
| Metric | Basic Paxos (Single Value) | Naive Multi-Instance (N Values) |
|---|---|---|
| Message rounds | 2 (Prepare + Accept) | 2N rounds |
| Messages per value | 4F+2 (for F fault tolerance) | (4F+2) × N messages |
| Latency per value | 2 RTT | 2 RTT per value (if sequential) or requires complex parallelization |
| Stable storage writes | 2 per acceptor | 2N per acceptor |
Consider a system processing 10,000 commands per second. With naive multi-instance Paxos, that's 20,000 Phase 1 Prepare rounds and 20,000 Phase 2 Accept rounds per second — plus the associated message overhead and disk writes. This is simply not practical.
The Key Insight:
If the same proposer is making proposals for consecutive log slots, do we really need to run Phase 1 for each slot? Phase 1 establishes the proposer's right to propose — if that right extends across all log slots (not just one), we can skip Phase 1 for most operations.
This is the core optimization of Multi-Paxos:
A proposer that completes Phase 1 for all slots (or a range of slots) can skip Phase 1 for subsequent proposals, as long as no higher-numbered proposal preempts it.
The stable leader optimization is the centerpiece of Multi-Paxos. Instead of having any proposer compete for each log slot, we establish a distinguished leader that drives consensus.
How It Works:
The Performance Gain:
With the stable leader optimization, the latency for each log entry is reduced from 2 RTT to 1 RTT (just Phase 2). The number of messages per entry is halved. Phase 1 overhead is amortized across potentially millions of log entries.
| Metric | Naive Multi-Instance | Multi-Paxos with Stable Leader |
|---|---|---|
| Phase 1 rounds | 1 per value | 1 per leadership term (amortized to ~0) |
| Messages per value | (4F+2) | (2F+2) — only Accept/Accepted |
| Latency per value | 2 RTT | 1 RTT (in steady state) |
| Leadership change cost | N/A | 1 RTT for new Phase 1 |
The efficiency of Multi-Paxos depends on leader stability. Every leadership change requires a new Phase 1 (Prepare for all slots). Systems often use leader leases, heartbeats, and careful timeout tuning to maximize leader tenure.
The Replicated Log Model:
Multi-Paxos maintains a replicated log where each entry is identified by an integer index (or slot number). The log grows monotonically — entries can be added but not removed or reordered.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
class ReplicatedLog: """ The replicated log structure that Multi-Paxos manages. Each slot may be in one of several states: - Empty: No value proposed yet - Proposed: A value has been proposed but not yet chosen - Chosen: Value has been accepted by a majority (finalized) """ def __init__(self): # Map from slot index to LogEntry self.entries: Dict[int, LogEntry] = {} # The highest slot index we've seen self.highest_slot: int = -1 # The highest slot index that is chosen and contiguously # connected to slot 0 (the "commit index") self.commit_index: int = -1 def get_next_empty_slot(self) -> int: """Return the next slot index to use for a new command.""" return self.highest_slot + 1 def set_chosen(self, slot: int, value: Any): """Mark a slot as chosen with the given value.""" self.entries[slot] = LogEntry( slot=slot, value=value, status=LogStatus.CHOSEN ) self.highest_slot = max(self.highest_slot, slot) self._update_commit_index() def _update_commit_index(self): """ Update commit_index to the highest contiguous chosen slot. Commands can only be executed after they're committed, meaning all prior slots are also chosen. This ensures ordered execution. """ while self.commit_index + 1 in self.entries: if self.entries[self.commit_index + 1].status == LogStatus.CHOSEN: self.commit_index += 1 else: break def get_commands_to_execute(self, last_executed: int) -> List[Command]: """ Return the list of commands that can be executed. These are committed entries after last_executed. """ commands = [] for slot in range(last_executed + 1, self.commit_index + 1): commands.append(self.entries[slot].value) return commands @dataclassclass LogEntry: slot: int value: Any status: LogStatus # EMPTY, PROPOSED, ACCEPTED, CHOSEN proposal_number: Optional[int] = NoneSlot Allocation:
The leader assigns slot numbers to incoming commands. This creates a total ordering of commands across the cluster.
The commit_index tracks the highest contiguous chosen slot (all slots 0 through commit_index are chosen). The highest_slot may be higher if there are gaps. Commands are only executed up to commit_index to maintain order.
Gaps in the log are a subtle but important challenge in Multi-Paxos. They arise when:
The Gap Problem:
If slot 5 is chosen but slot 4 is empty, we cannot execute the command at slot 5 — commands must be executed in order. The gap at slot 4 must be resolved.
Filling Gaps with No-Ops:
The solution is to fill gaps with no-op (no operation) commands. A no-op is a special command that does nothing when executed but occupies a log slot, allowing subsequent slots to commit.
Gap Detection and Filling Process:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
async def fill_gaps_on_leadership(self, highest_known_slot: int): """ When becoming leader, identify and fill any gaps in the log. This is typically done as part of the Phase 1 processing, where the new leader discovers partially-filled slots from acceptor promise responses. Args: highest_known_slot: The highest slot index discovered during Phase 1 """ for slot in range(0, highest_known_slot + 1): slot_status = await self._get_slot_status(slot) if slot_status == SlotStatus.EMPTY: # Gap detected! Fill with no-op await self._propose_noop(slot) elif slot_status == SlotStatus.PROPOSED_NOT_CHOSEN: # Slot has a proposed value but not chosen # Re-drive this slot to completion previous_value = await self._get_proposed_value(slot) await self._drive_slot_to_completion(slot, previous_value) async def _propose_noop(self, slot: int): """ Propose a no-op command for the given slot. No-ops are chosen via normal Paxos and then skipped during execution. """ noop = Command(type="NOOP", data=None) # Skip Phase 1 if we're the stable leader # (our global Prepare already covers this slot) await self._phase2_accept(slot, noop) def execute_command(self, command: Command): """Execute a command, handling no-ops appropriately.""" if command.type == "NOOP": # No-op: do nothing, but log it for debugging self.logger.debug(f"Executing NOOP at slot {command.slot}") return NoopResult() else: # Normal command execution return self.state_machine.apply(command)New leaders typically fill gaps as part of the Phase 1 process. The acceptor promise responses include all accepted-but-not-chosen entries. The new leader re-proposes these (adopting the value per the adoption rule) or fills empty slots with no-ops.
Concurrency Considerations:
In a highly concurrent system, multiple slots may be in-flight simultaneously. The leader might be waiting for acknowledgments on slots 10, 11, and 12 while already proposing slots 13, 14, 15. Gaps can temporarily appear and disappear as messages arrive out of order.
Multi-Paxos requires more sophisticated state management than Basic Paxos, particularly for acceptors who must track per-slot information.
Acceptor State in Multi-Paxos:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
class MultiPaxosAcceptorState: """ State for a Multi-Paxos acceptor. Key insight: The global promised_number applies to ALL slots, while accepted values are tracked per-slot. """ def __init__(self, acceptor_id: int, storage: DurableStorage): self.acceptor_id = acceptor_id self.storage = storage # Global: The highest proposal number we've promised # Applies to ALL slots (this is the Multi-Paxos optimization) self.promised_number: Optional[int] = None # Per-slot: The accepted value for each slot # Map: slot_index -> (proposal_number, value) self.accepted_values: Dict[int, Tuple[int, Any]] = {} # The highest slot index we've seen self.max_slot: int = -1 def handle_global_prepare( self, proposal_number: int ) -> GlobalPromiseResponse: """ Handle a Prepare that applies to all slots. Returns promise + all accepted values discovered. """ if self._can_promise(proposal_number): self.promised_number = proposal_number self._persist() return GlobalPromiseResponse( promised=proposal_number, accepted_slots=self.accepted_values.copy(), max_slot=self.max_slot ) else: return RejectionResponse( highest_promised=self.promised_number ) def handle_accept( self, proposal_number: int, slot: int, value: Any ) -> AcceptResponse: """ Handle an Accept for a specific slot. """ if proposal_number >= self.promised_number: # Accept the value for this slot self.accepted_values[slot] = (proposal_number, value) self.promised_number = proposal_number self.max_slot = max(self.max_slot, slot) self._persist() return AcceptedResponse( proposal_number=proposal_number, slot=slot, value=value ) else: return RejectionResponse( highest_promised=self.promised_number ) def _persist(self): """Persist all state to durable storage.""" self.storage.save(f"acceptor_{self.acceptor_id}", { "promised": self.promised_number, "accepted": self.accepted_values, "max_slot": self.max_slot, }) self.storage.sync()Leader State in Multi-Paxos:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
class MultiPaxosLeaderState: """ State for a Multi-Paxos leader. """ def __init__(self, leader_id: int, acceptors: List[str]): self.leader_id = leader_id self.acceptors = acceptors self.majority = len(acceptors) // 2 + 1 # Our current proposal number (applies to all slots) self.proposal_number: int = 0 # Whether we've completed Phase 1 for our current proposal number self.phase1_complete: bool = False # The log we're managing self.log = ReplicatedLog() # Next slot to use for new commands self.next_slot: int = 0 # In-flight slots (accepted but not yet chosen) self.pending_slots: Dict[int, PendingSlot] = {} async def become_leader(self): """ Execute global Phase 1 to become the leader. """ self.proposal_number = self._generate_proposal_number() # Send global Prepare to all acceptors responses = await self._send_global_prepare(self.proposal_number) # Need majority of promises promises = [r for r in responses if isinstance(r, GlobalPromiseResponse)] if len(promises) >= self.majority: # Collect all accepted values from acceptors await self._reconcile_accepted_values(promises) self.phase1_complete = True else: raise LeadershipFailed("Could not get majority promises") async def propose(self, command: Command) -> int: """ Propose a command (assumes we're the leader with Phase 1 complete). Returns the slot index where the command was placed. """ if not self.phase1_complete: raise NotLeaderError("Must complete Phase 1 first") slot = self.next_slot self.next_slot += 1 # Skip Phase 1, go directly to Phase 2 await self._phase2_accept(slot, command) return slotThe 'global Prepare' that applies to all slots is an optimization. An equivalent approach is to run slot-specific prepares but cache the promise for reuse. The key insight is the same: Phase 1 approval extends across slots for the same proposal number.
Leadership changes are the most complex part of Multi-Paxos. When the leader fails or becomes partitioned, a new leader must take over without losing any chosen values.
Leadership Change Protocol:
Reconciliation Example:
New leader L2 becomes leader and discovers from acceptor promises:
| Slot | Acceptor A1 | Acceptor A2 | Acceptor A3 | L2's Action |
|---|---|---|---|---|
| Slot 0 | Accepted (1, C0) | Accepted (1, C0) | Accepted (1, C0) | Already chosen, no action |
| Slot 1 | Accepted (1, C1) | Accepted (1, C1) | Accepted (1, C1) | Already chosen, no action |
| Slot 2 | Accepted (1, C2) | Empty | Empty | Adopt C2 (adoption rule), re-drive Phase 2 |
| Slot 3 | Empty | Empty | Empty | Gap! Propose no-op |
| Slot 4 | Accepted (1, C4) | Accepted (1, C4) | Empty | Already chosen by majority, confirm |
The new leader MUST reconcile the log before accepting new commands. If it skips this step and starts proposing in slot 5, it might conflict with the old leader's partially-complete proposal in slot 2. The adoption rule during reconciliation prevents this.
Avoiding Split Brain:
Two leaders believing they're active (split brain) can cause confusion but not safety violations. Paxos handles this through proposal numbers:
The protocol is safe even during split brain, but liveness suffers (commands may time out).
Multi-Paxos is typically used to implement replicated state machines (RSM). The idea:
Multi-Paxos provides the consistent ordering. The state machine provides the semantics.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
class ReplicatedStateMachine: """ A state machine replicated via Multi-Paxos. """ def __init__(self, paxos: MultiPaxos, state_machine: StateMachine): self.paxos = paxos self.state_machine = state_machine # Last applied log index self.last_applied: int = -1 async def execute(self, command: Command) -> Result: """ Execute a command against the replicated state machine. 1. Propose the command via Multi-Paxos 2. Wait for the command to be chosen and committed 3. Apply all commands up to the committed command 4. Return the result of executing the command """ # Step 1: Propose command slot = await self.paxos.propose(command) # Step 2: Wait for command to be committed # (commit_index must reach our slot) await self.paxos.wait_for_commit(slot) # Step 3: Apply all commands up to our slot result = None for i in range(self.last_applied + 1, slot + 1): entry = self.paxos.log.get(i) execution_result = self.state_machine.apply(entry.value) if i == slot: result = execution_result self.last_applied = i return result async def read(self, query: Query) -> Result: """ Execute a read-only query. For linearizable reads, we must ensure we see all committed writes. Options: 1. Read through consensus (expensive) 2. Leader lease + local read (if leader) 3. Read from any replica (stale reads allowed) """ if query.consistency == Consistency.LINEARIZABLE: # Ensure we're leader and our lease is valid await self.paxos.confirm_leadership() # Apply any pending committed commands await self._apply_pending() return self.state_machine.read(query)Virtually every distributed database, coordination service, and lock manager is a replicated state machine of some form. The state machine could be a key-value store (etcd), a file system (Chubby), a SQL database (Spanner), or any other deterministic system.
Implementing Multi-Paxos in production requires addressing numerous practical concerns beyond the core protocol.
| Aspect | Consideration | Impact if Missed |
|---|---|---|
| Persistence | Sync to disk before responding | Safety violation on crash |
| Proposal numbers | Global uniqueness guaranteed | Protocol correctness failure |
| Gap handling | Fill with no-ops on leadership change | Log application stalls |
| Leadership detection | Heartbeats + timeouts tuned | Frequent failovers or split brain |
| Log compaction | Periodic snapshots | Unbounded disk/memory usage |
Many of these production concerns are simpler in Raft because Raft makes specific design choices (e.g., only the leader can accept new entries, logs are always contiguous). Multi-Paxos's generality makes it more flexible but also more complex to implement correctly.
We've explored how Multi-Paxos transforms the single-value Basic Paxos into a practical protocol for replicated logs. Let's consolidate the key takeaways:
What's next:
With a solid understanding of how Multi-Paxos works, we'll examine Why Paxos is Hard to Implement — the practical challenges that trip up implementers and why algorithms like Raft were specifically designed to be easier to understand and implement correctly.
You now understand Multi-Paxos, the practical extension of Basic Paxos for replicated logs. This is the foundation for understanding production consensus systems like etcd, ZooKeeper, and Spanner. Next, we'll explore why implementing Paxos correctly is so challenging.