Data Sync Patterns - Learning Module

Loading content...

0/273

Consensus-Based Coordination

Agreement in the Face of Failure

We've seen that 2PC blocks under failures and 3PC breaks under partitions. Yet modern distributed systems like Google Spanner, CockroachDB, and etcd provide both safety and liveness. How?

The answer lies in consensus protocols—algorithms that allow a group of nodes to agree on a value even when some nodes fail or messages are lost. Unlike atomic commit protocols, consensus protocols don't promise that all nodes will participate in every decision. Instead, they guarantee that a majority of working nodes will agree, and that agreement is permanent.

The two most influential consensus protocols are Paxos (theoretically elegant, notoriously difficult to implement) and Raft (designed for understandability, widely deployed). Understanding these protocols unlocks your ability to reason about any modern distributed system.

What You Will Learn

By the end of this page, you will understand the consensus problem, how Paxos achieves agreement through quorums and proposal numbers, how Raft simplifies consensus with leader election, and how these protocols power real systems like etcd, Consul, and distributed databases.

The Consensus Problem

Consensus is the problem of getting multiple nodes to agree on a single value. It sounds simple, but in a distributed system with failures, it's remarkably subtle.

Formal Requirements:

Agreement — All correct nodes decide on the same value
Validity — If all nodes propose the same value, that's what they decide
Termination — All correct nodes eventually reach a decision
Integrity — Nodes decide at most once

Why is this hard? The FLP impossibility theorem tells us that in an asynchronous system with even one faulty process, no deterministic protocol can guarantee all three of agreement, validity, and termination.

Consensus protocols work around FLP by either:

Using timeouts and randomization (Paxos, Raft)
Relying on synchronized clocks (Spanner's TrueTime)
Accepting probabilistic guarantees

Consensus vs. Atomic Commit
Property	Consensus (Paxos/Raft)	Atomic Commit (2PC/3PC)
Goal	Agree on one value from proposals	All-or-nothing transaction commit
Participants	Can proceed with majority	Requires all participants
Failure handling	Minority failures tolerated	Single failure blocks (2PC) or breaks (3PC)
Typical use	Replicated state machines, leader election	Distributed transactions
Safety	Always maintained	2PC: safe, 3PC: unsafe under partition

Consensus as a Building Block

Consensus isn't used directly for transactions—it's used to build replicated logs and leader election. These become the foundation for distributed databases, configuration stores, and coordination services. You can even build atomic commit ON TOP of consensus (Paxos Commit).

Paxos: The Foundation of Modern Consensus

Paxos, invented by Leslie Lamport in 1989, is the theoretical foundation of distributed consensus. Despite its reputation for complexity, the core protocol is elegant.

Paxos Roles:

Proposers — Propose values (clients or leaders)
Acceptors — Vote on proposals (the voting quorum)
Learners — Learn the decided value (often same nodes as acceptors)

The Key Insight: Proposal Numbers

Paxos uses monotonically increasing proposal numbers to order operations. A proposal with a higher number supersedes lower-numbered proposals. This creates a total ordering without a central coordinator.

Basic Paxos (Single-Decree):

Phase 1: Prepare

Proposer picks proposal number n (higher than any it's seen)
Sends Prepare(n) to a majority of acceptors
Acceptors respond with:
- Promise to reject proposals numbered < n
- Any value they've previously accepted (with its proposal number)

Phase 2: Accept

If proposer receives majority promises:
- If any acceptor returned a previously accepted value, use the value from the highest-numbered proposal
- Otherwise, proposer can use its own value
Sends Accept(n, value) to majority
Acceptors accept if they haven't promised to a higher proposal number

paxos.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
// Basic Paxos (Single-Decree) Implementation
 
interface PaxosMessage {
    proposalNumber: number;
    value?: string;
}
 
interface Promise {
    promised: boolean;
    acceptedProposal?: number;
    acceptedValue?: string;
}
 
class PaxosAcceptor {
    private promisedNumber: number = 0;
    private acceptedNumber: number = 0;
    private acceptedValue: string | null = null;
 
    // Phase 1: Respond to Prepare
    prepare(proposalNumber: number): Promise {
        if (proposalNumber > this.promisedNumber) {
            this.promisedNumber = proposalNumber;
            return {
                promised: true,
                acceptedProposal: this.acceptedNumber,
                acceptedValue: this.acceptedValue || undefined
            };
        }
        return { promised: false };
    }
 
    // Phase 2: Respond to Accept
    accept(proposalNumber: number, value: string): boolean {
        if (proposalNumber >= this.promisedNumber) {
            this.promisedNumber = proposalNumber;
            this.acceptedNumber = proposalNumber;
            this.acceptedValue = value;
            return true;
        }
        return false;
    }
}
 
class PaxosProposer {
    private proposalCounter: number = 0;
    private serverId: number;
    private acceptors: PaxosAcceptor[];
    private quorumSize: number;
 
    constructor(serverId: number, acceptors: PaxosAcceptor[]) {
        this.serverId = serverId;
        this.acceptors = acceptors;
        this.quorumSize = Math.floor(acceptors.length / 2) + 1;
    }
 
    async propose(value: string): Promise<string | null> {
        // Generate unique proposal number (timestamp + serverId)
        const proposalNumber = this.generateProposalNumber();
 
        // ====== PHASE 1: PREPARE ======
        const promises = await Promise.all(
            this.acceptors.map(a => a.prepare(proposalNumber))
        );
 
        const successfulPromises = promises.filter(p => p.promised);
        
        if (successfulPromises.length < this.quorumSize) {
            // Failed to get majority - retry with higher number
            return null;
        }
 
        // Determine value to propose
        let proposedValue = value;
        const previouslyAccepted = successfulPromises
            .filter(p => p.acceptedProposal !== undefined)
            .sort((a, b) => b.acceptedProposal! - a.acceptedProposal!);
 
        if (previouslyAccepted.length > 0) {
            // MUST use the value from highest-numbered accepted proposal
            // This is how Paxos ensures safety
            proposedValue = previouslyAccepted[0].acceptedValue!;
        }
 
        // ====== PHASE 2: ACCEPT ======
        const accepts = await Promise.all(
            this.acceptors.map(a => a.accept(proposalNumber, proposedValue))
        );
 
        const acceptCount = accepts.filter(a => a).length;
        
        if (acceptCount >= this.quorumSize) {
            return proposedValue; // Consensus achieved!
        }
 
        return null; // Failed - retry
    }
 
    private generateProposalNumber(): number {
        // Proposal number must be globally unique and increasing
        // Combine timestamp + server ID
        this.proposalCounter++;
        return Date.now() * 1000 + this.serverId;
    }
}

Why Paxos Is Hard

Basic Paxos agrees on a SINGLE value. Real systems need Multi-Paxos to agree on a sequence of values (a log). Multi-Paxos adds leader election, log indexing, and recovery—all with subtle corner cases. This complexity is why Raft was created.

Raft: Consensus Made Understandable

Raft was designed in 2013 by Diego Ongaro and John Ousterhout with one goal: make consensus understandable. It achieves the same guarantees as Multi-Paxos but with a cleaner decomposition.

Raft's Key Simplifications:

Strong leader — All writes go through the leader; followers only replicate
Leader election — Randomized timeouts prevent split-brain
Log replication — Entries are numbered and committed in order
Safety — Leaders never overwrite committed entries

Raft Node States:

┌─────────────────────────────────────────────────────────┐
│                                                          │
│   FOLLOWER ──timeout──→ CANDIDATE ──wins election──→ LEADER
│       ▲                     │                          │
│       │                     │ loses                    │
│       │                     ▼                          │
│       └── discovers higher term ◄──────────────────────┘
│                                                          │
└─────────────────────────────────────────────────────────┘

raft.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
// Raft Protocol Implementation
 
type NodeState = 'FOLLOWER' | 'CANDIDATE' | 'LEADER';
 
interface LogEntry {
    term: number;
    index: number;
    command: string;
}
 
interface RaftNode {
    id: string;
    state: NodeState;
    currentTerm: number;
    votedFor: string | null;
    log: LogEntry[];
    commitIndex: number;
    lastApplied: number;
}
 
class RaftServer implements RaftNode {
    id: string;
    state: NodeState = 'FOLLOWER';
    currentTerm: number = 0;
    votedFor: string | null = null;
    log: LogEntry[] = [];
    commitIndex: number = 0;
    lastApplied: number = 0;
 
    private peers: RaftServer[];
    private electionTimeout: NodeJS.Timeout | null = null;
    private heartbeatInterval: NodeJS.Timeout | null = null;
 
    // ====== LEADER ELECTION ======
    
    startElectionTimer(): void {
        // Randomized timeout: 150-300ms (prevents split votes)
        const timeout = 150 + Math.random() * 150;
        
        this.electionTimeout = setTimeout(() => {
            this.startElection();
        }, timeout);
    }
 
    async startElection(): Promise<void> {
        this.state = 'CANDIDATE';
        this.currentTerm++;
        this.votedFor = this.id;
        
        console.log(`[${this.id}] Starting election for term ${this.currentTerm}`);
 
        const votes = await Promise.all(
            this.peers.map(peer => this.requestVote(peer))
        );
 
        const voteCount = votes.filter(v => v).length + 1; // +1 for self
        const majority = Math.floor((this.peers.length + 1) / 2) + 1;
 
        if (voteCount >= majority && this.state === 'CANDIDATE') {
            this.becomeLeader();
        } else {
            this.state = 'FOLLOWER';
            this.startElectionTimer();
        }
    }
 
    async requestVote(peer: RaftServer): Promise<boolean> {
        const lastLogIndex = this.log.length - 1;
        const lastLogTerm = this.log[lastLogIndex]?.term ?? 0;
 
        // Peer grants vote if:
        // 1. Candidate's term >= peer's term
        // 2. Peer hasn't voted for someone else this term
        // 3. Candidate's log is at least as up-to-date
        return peer.handleVoteRequest(
            this.currentTerm,
            this.id,
            lastLogIndex,
            lastLogTerm
        );
    }
 
    handleVoteRequest(
        term: number,
        candidateId: string,
        lastLogIndex: number,
        lastLogTerm: number
    ): boolean {
        if (term < this.currentTerm) return false;
        
        if (term > this.currentTerm) {
            this.currentTerm = term;
            this.votedFor = null;
            this.state = 'FOLLOWER';
        }
 
        const myLastIndex = this.log.length - 1;
        const myLastTerm = this.log[myLastIndex]?.term ?? 0;
        
        const logOk = (lastLogTerm > myLastTerm) || 
                      (lastLogTerm === myLastTerm && lastLogIndex >= myLastIndex);
 
        if ((this.votedFor === null || this.votedFor === candidateId) && logOk) {
            this.votedFor = candidateId;
            this.resetElectionTimer();
            return true;
        }
 
        return false;
    }
 
    // ====== LOG REPLICATION ======
 
    becomeLeader(): void {
        this.state = 'LEADER';
        console.log(`[${this.id}] Became leader for term ${this.currentTerm}`);
        
        // Send initial empty heartbeat
        this.sendHeartbeats();
        
        // Start heartbeat timer
        this.heartbeatInterval = setInterval(() => {
            this.sendHeartbeats();
        }, 50); // Send heartbeats every 50ms
    }
 
    async appendEntry(command: string): Promise<boolean> {
        if (this.state !== 'LEADER') {
            throw new Error('Not the leader');
        }
 
        const entry: LogEntry = {
            term: this.currentTerm,
            index: this.log.length,
            command
        };
 
        this.log.push(entry);
 
        // Replicate to followers
        const results = await Promise.all(
            this.peers.map(peer => this.replicateToFollower(peer, entry))
        );
 
        const successCount = results.filter(r => r).length + 1;
        const majority = Math.floor((this.peers.length + 1) / 2) + 1;
 
        if (successCount >= majority) {
            this.commitIndex = entry.index;
            return true;
        }
 
        return false;
    }
}

Raft Safety Properties

•Election Safety — At most one leader per term
•Leader Append-Only — Leaders never overwrite or delete log entries
•Log Matching — If two logs have same entry at same index, all prior entries match
•Leader Completeness — If an entry is committed, it appears in all future leaders' logs
•State Machine Safety — If a server applies an entry at index i, no other server applies a different entry at i

Consensus in Production Systems

Consensus protocols power critical infrastructure across the industry:

Consensus Protocol Deployments
System	Protocol	Use Case
etcd	Raft	Kubernetes configuration store
Consul	Raft	Service discovery, KV store
CockroachDB	Raft	Distributed SQL replication
TiDB (TiKV)	Raft	Distributed KV storage layer
Google Spanner	Paxos	Global database replication
Apache ZooKeeper	ZAB (Paxos-like)	Distributed coordination
MongoDB	Raft (since 4.4)	Replica set elections

Paxos Commit

You can build atomic commit on top of consensus: have each participant run Paxos to decide their vote, and a coordinator runs Paxos to decide the global outcome. This is called 'Paxos Commit' and provides non-blocking atomic commitment with partition safety.

Summary and Looking Ahead

Key Takeaways

•Consensus allows nodes to agree on a value despite failures
•Paxos uses proposal numbers and quorums to achieve agreement
•Raft simplifies consensus with strong leadership and clear state transitions
•Both protocols require a majority (2f+1 nodes to tolerate f failures)
•Modern distributed databases are built on these consensus primitives

What's Next:

The next page explores asynchronous replication patterns—approaches that sacrifice strong consistency for better performance and availability. We'll cover primary-replica replication, multi-primary replication, and the trade-offs each approach makes.

Page Complete

You now understand consensus protocols—the foundation of modern distributed coordination. From Paxos's theoretical elegance to Raft's practical clarity, these algorithms enable the distributed systems that power today's infrastructure.