Loading content...
We've seen that 2PC blocks under failures and 3PC breaks under partitions. Yet modern distributed systems like Google Spanner, CockroachDB, and etcd provide both safety and liveness. How?
The answer lies in consensus protocols—algorithms that allow a group of nodes to agree on a value even when some nodes fail or messages are lost. Unlike atomic commit protocols, consensus protocols don't promise that all nodes will participate in every decision. Instead, they guarantee that a majority of working nodes will agree, and that agreement is permanent.
The two most influential consensus protocols are Paxos (theoretically elegant, notoriously difficult to implement) and Raft (designed for understandability, widely deployed). Understanding these protocols unlocks your ability to reason about any modern distributed system.
By the end of this page, you will understand the consensus problem, how Paxos achieves agreement through quorums and proposal numbers, how Raft simplifies consensus with leader election, and how these protocols power real systems like etcd, Consul, and distributed databases.
Consensus is the problem of getting multiple nodes to agree on a single value. It sounds simple, but in a distributed system with failures, it's remarkably subtle.
Formal Requirements:
Why is this hard? The FLP impossibility theorem tells us that in an asynchronous system with even one faulty process, no deterministic protocol can guarantee all three of agreement, validity, and termination.
Consensus protocols work around FLP by either:
| Property | Consensus (Paxos/Raft) | Atomic Commit (2PC/3PC) |
|---|---|---|
| Goal | Agree on one value from proposals | All-or-nothing transaction commit |
| Participants | Can proceed with majority | Requires all participants |
| Failure handling | Minority failures tolerated | Single failure blocks (2PC) or breaks (3PC) |
| Typical use | Replicated state machines, leader election | Distributed transactions |
| Safety | Always maintained | 2PC: safe, 3PC: unsafe under partition |
Consensus isn't used directly for transactions—it's used to build replicated logs and leader election. These become the foundation for distributed databases, configuration stores, and coordination services. You can even build atomic commit ON TOP of consensus (Paxos Commit).
Paxos, invented by Leslie Lamport in 1989, is the theoretical foundation of distributed consensus. Despite its reputation for complexity, the core protocol is elegant.
Paxos Roles:
The Key Insight: Proposal Numbers
Paxos uses monotonically increasing proposal numbers to order operations. A proposal with a higher number supersedes lower-numbered proposals. This creates a total ordering without a central coordinator.
Basic Paxos (Single-Decree):
Phase 1: Prepare
n (higher than any it's seen)Prepare(n) to a majority of acceptorsPhase 2: Accept
Accept(n, value) to majority123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
// Basic Paxos (Single-Decree) Implementation interface PaxosMessage { proposalNumber: number; value?: string;} interface Promise { promised: boolean; acceptedProposal?: number; acceptedValue?: string;} class PaxosAcceptor { private promisedNumber: number = 0; private acceptedNumber: number = 0; private acceptedValue: string | null = null; // Phase 1: Respond to Prepare prepare(proposalNumber: number): Promise { if (proposalNumber > this.promisedNumber) { this.promisedNumber = proposalNumber; return { promised: true, acceptedProposal: this.acceptedNumber, acceptedValue: this.acceptedValue || undefined }; } return { promised: false }; } // Phase 2: Respond to Accept accept(proposalNumber: number, value: string): boolean { if (proposalNumber >= this.promisedNumber) { this.promisedNumber = proposalNumber; this.acceptedNumber = proposalNumber; this.acceptedValue = value; return true; } return false; }} class PaxosProposer { private proposalCounter: number = 0; private serverId: number; private acceptors: PaxosAcceptor[]; private quorumSize: number; constructor(serverId: number, acceptors: PaxosAcceptor[]) { this.serverId = serverId; this.acceptors = acceptors; this.quorumSize = Math.floor(acceptors.length / 2) + 1; } async propose(value: string): Promise<string | null> { // Generate unique proposal number (timestamp + serverId) const proposalNumber = this.generateProposalNumber(); // ====== PHASE 1: PREPARE ====== const promises = await Promise.all( this.acceptors.map(a => a.prepare(proposalNumber)) ); const successfulPromises = promises.filter(p => p.promised); if (successfulPromises.length < this.quorumSize) { // Failed to get majority - retry with higher number return null; } // Determine value to propose let proposedValue = value; const previouslyAccepted = successfulPromises .filter(p => p.acceptedProposal !== undefined) .sort((a, b) => b.acceptedProposal! - a.acceptedProposal!); if (previouslyAccepted.length > 0) { // MUST use the value from highest-numbered accepted proposal // This is how Paxos ensures safety proposedValue = previouslyAccepted[0].acceptedValue!; } // ====== PHASE 2: ACCEPT ====== const accepts = await Promise.all( this.acceptors.map(a => a.accept(proposalNumber, proposedValue)) ); const acceptCount = accepts.filter(a => a).length; if (acceptCount >= this.quorumSize) { return proposedValue; // Consensus achieved! } return null; // Failed - retry } private generateProposalNumber(): number { // Proposal number must be globally unique and increasing // Combine timestamp + server ID this.proposalCounter++; return Date.now() * 1000 + this.serverId; }}Basic Paxos agrees on a SINGLE value. Real systems need Multi-Paxos to agree on a sequence of values (a log). Multi-Paxos adds leader election, log indexing, and recovery—all with subtle corner cases. This complexity is why Raft was created.
Raft was designed in 2013 by Diego Ongaro and John Ousterhout with one goal: make consensus understandable. It achieves the same guarantees as Multi-Paxos but with a cleaner decomposition.
Raft's Key Simplifications:
Raft Node States:
┌─────────────────────────────────────────────────────────┐
│ │
│ FOLLOWER ──timeout──→ CANDIDATE ──wins election──→ LEADER
│ ▲ │ │
│ │ │ loses │
│ │ ▼ │
│ └── discovers higher term ◄──────────────────────┘
│ │
└─────────────────────────────────────────────────────────┘
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
// Raft Protocol Implementation type NodeState = 'FOLLOWER' | 'CANDIDATE' | 'LEADER'; interface LogEntry { term: number; index: number; command: string;} interface RaftNode { id: string; state: NodeState; currentTerm: number; votedFor: string | null; log: LogEntry[]; commitIndex: number; lastApplied: number;} class RaftServer implements RaftNode { id: string; state: NodeState = 'FOLLOWER'; currentTerm: number = 0; votedFor: string | null = null; log: LogEntry[] = []; commitIndex: number = 0; lastApplied: number = 0; private peers: RaftServer[]; private electionTimeout: NodeJS.Timeout | null = null; private heartbeatInterval: NodeJS.Timeout | null = null; // ====== LEADER ELECTION ====== startElectionTimer(): void { // Randomized timeout: 150-300ms (prevents split votes) const timeout = 150 + Math.random() * 150; this.electionTimeout = setTimeout(() => { this.startElection(); }, timeout); } async startElection(): Promise<void> { this.state = 'CANDIDATE'; this.currentTerm++; this.votedFor = this.id; console.log(`[${this.id}] Starting election for term ${this.currentTerm}`); const votes = await Promise.all( this.peers.map(peer => this.requestVote(peer)) ); const voteCount = votes.filter(v => v).length + 1; // +1 for self const majority = Math.floor((this.peers.length + 1) / 2) + 1; if (voteCount >= majority && this.state === 'CANDIDATE') { this.becomeLeader(); } else { this.state = 'FOLLOWER'; this.startElectionTimer(); } } async requestVote(peer: RaftServer): Promise<boolean> { const lastLogIndex = this.log.length - 1; const lastLogTerm = this.log[lastLogIndex]?.term ?? 0; // Peer grants vote if: // 1. Candidate's term >= peer's term // 2. Peer hasn't voted for someone else this term // 3. Candidate's log is at least as up-to-date return peer.handleVoteRequest( this.currentTerm, this.id, lastLogIndex, lastLogTerm ); } handleVoteRequest( term: number, candidateId: string, lastLogIndex: number, lastLogTerm: number ): boolean { if (term < this.currentTerm) return false; if (term > this.currentTerm) { this.currentTerm = term; this.votedFor = null; this.state = 'FOLLOWER'; } const myLastIndex = this.log.length - 1; const myLastTerm = this.log[myLastIndex]?.term ?? 0; const logOk = (lastLogTerm > myLastTerm) || (lastLogTerm === myLastTerm && lastLogIndex >= myLastIndex); if ((this.votedFor === null || this.votedFor === candidateId) && logOk) { this.votedFor = candidateId; this.resetElectionTimer(); return true; } return false; } // ====== LOG REPLICATION ====== becomeLeader(): void { this.state = 'LEADER'; console.log(`[${this.id}] Became leader for term ${this.currentTerm}`); // Send initial empty heartbeat this.sendHeartbeats(); // Start heartbeat timer this.heartbeatInterval = setInterval(() => { this.sendHeartbeats(); }, 50); // Send heartbeats every 50ms } async appendEntry(command: string): Promise<boolean> { if (this.state !== 'LEADER') { throw new Error('Not the leader'); } const entry: LogEntry = { term: this.currentTerm, index: this.log.length, command }; this.log.push(entry); // Replicate to followers const results = await Promise.all( this.peers.map(peer => this.replicateToFollower(peer, entry)) ); const successCount = results.filter(r => r).length + 1; const majority = Math.floor((this.peers.length + 1) / 2) + 1; if (successCount >= majority) { this.commitIndex = entry.index; return true; } return false; }}Consensus protocols power critical infrastructure across the industry:
| System | Protocol | Use Case |
|---|---|---|
| etcd | Raft | Kubernetes configuration store |
| Consul | Raft | Service discovery, KV store |
| CockroachDB | Raft | Distributed SQL replication |
| TiDB (TiKV) | Raft | Distributed KV storage layer |
| Google Spanner | Paxos | Global database replication |
| Apache ZooKeeper | ZAB (Paxos-like) | Distributed coordination |
| MongoDB | Raft (since 4.4) | Replica set elections |
You can build atomic commit on top of consensus: have each participant run Paxos to decide their vote, and a coordinator runs Paxos to decide the global outcome. This is called 'Paxos Commit' and provides non-blocking atomic commitment with partition safety.
What's Next:
The next page explores asynchronous replication patterns—approaches that sacrifice strong consistency for better performance and availability. We'll cover primary-replica replication, multi-primary replication, and the trade-offs each approach makes.
You now understand consensus protocols—the foundation of modern distributed coordination. From Paxos's theoretical elegance to Raft's practical clarity, these algorithms enable the distributed systems that power today's infrastructure.