System Design (HLD)Leader-Follower Replication

Leader-Follower Replication

LevelIntermediate

Duration75 mins

TopicLeader-Follower Replication

1 / 5

Single Leader Accepts Writes

The Foundation of Reliable Data Replication

When you swipe your credit card at a coffee shop, that transaction must be recorded correctly—once and only once—before it's visible to your banking app. When you post a tweet, it must appear on your profile before your followers can see it in their feeds. When you update your profile picture on a social network, every server around the world eventually needs that update, but there can be no confusion about which picture is the 'real' one.

These scenarios share a fundamental challenge: how do we replicate data across multiple database servers while maintaining consistency?

The answer that has powered the majority of production database systems for decades is surprisingly elegant: designate a single node as the leader (also called the primary or master), and ensure that all writes flow through this one authoritative source.

What You Will Learn

By the end of this page, you will understand why single-leader replication is the dominant paradigm in database systems, how it guarantees write consistency, the architectural patterns behind leader election and write routing, and when this model is the right (or wrong) choice for your system.

The Single-Leader Paradigm

At its core, leader-follower replication (also known as primary-replica, master-slave, or active-passive replication) follows a deceptively simple rule:

All write operations must be processed by exactly one node—the leader.

This single constraint solves one of the most complex problems in distributed systems: write conflicts. When multiple nodes can accept writes independently, you face the nightmare of concurrent, conflicting updates. Did User A's edit or User B's edit happen first? What if they edited different fields of the same record? What if network partitions mean neither node knows about the other's write?

By funneling all writes through a single leader, we sidestep these questions entirely. The leader sees every write in sequence, applies them in order, and produces a single, authoritative history of changes. This history—called the replication log, write-ahead log (WAL), or binlog depending on the database—becomes the source of truth that all other nodes follow.

Core Principles of Single-Leader Replication

•Single Point of Write Authority — One node (the leader) accepts all write operations. This eliminates write-write conflicts by design.
•Deterministic Ordering — The leader assigns a sequence to all writes (transaction IDs, log sequence numbers). This ordering is absolute and unambiguous.
•Log-Based Propagation — Changes are captured in a log and streamed to followers. The log is the contract between leader and followers.
•Read Scalability — While only the leader handles writes, reads can be distributed across the leader and all followers, multiplying read throughput.
•Follower Consistency — Followers apply log entries in the same order as the leader, eventually reaching identical states.

Why 'Single' Matters

The 'single' in single-leader isn't a limitation—it's a feature. By accepting a single point of write coordination, we gain something invaluable: a total ordering of all writes. This makes reasoning about consistency vastly simpler than multi-leader or leaderless alternatives.

Anatomy of a Write in Leader-Follower Replication

To truly understand single-leader replication, let's trace the journey of a single write operation from client to durable storage across all replicas. This flow is the heartbeat of every leader-follower database, from PostgreSQL streaming replication to MySQL master-slave setups to MongoDB replica sets.

Write Flow in Leader-Follower Replication
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
┌─────────────────────────────────────────────────────────────────────────────┐
│                        CLIENT APPLICATION                                    │
│                                                                             │
│   UPDATE users SET balance = balance - 100 WHERE user_id = 'alice';         │
└───────────────────────────────────┬─────────────────────────────────────────┘
                                    │
                                    ▼ (1) Client sends write to Leader
┌─────────────────────────────────────────────────────────────────────────────┐
│                           LEADER (PRIMARY)                                   │
│                                                                             │
│  ┌───────────────────┐    ┌────────────────────┐    ┌──────────────────┐   │
│  │ (2) Validate &    │───▶│ (3) Write to WAL   │───▶│ (4) Apply to     │   │
│  │     Parse Query   │    │     (durable log)  │    │     Data Files   │   │
│  └───────────────────┘    └────────────────────┘    └──────────────────┘   │
│                                    │                                        │
│                                    ▼                                        │
│                           ┌────────────────────┐                            │
│                           │ (5) Acknowledge    │ (synchronous mode)        │
│                           │     or Stream      │ (asynchronous mode)       │
│                           │     to Followers   │                            │
│                           └─────────┬──────────┘                            │
└─────────────────────────────────────┼───────────────────────────────────────┘
                                      │
                    ┌─────────────────┼─────────────────┐
                    │                 │                 │
                    ▼                 ▼                 ▼
           ┌───────────────┐  ┌───────────────┐  ┌───────────────┐
           │  FOLLOWER 1   │  │  FOLLOWER 2   │  │  FOLLOWER 3   │
           │               │  │               │  │               │
           │ (6) Receive   │  │ (6) Receive   │  │ (6) Receive   │
           │     WAL Entry │  │     WAL Entry │  │     WAL Entry │
           │               │  │               │  │               │
           │ (7) Apply to  │  │ (7) Apply to  │  │ (7) Apply to  │
           │     Local DB  │  │     Local DB  │  │     Local DB  │
           └───────────────┘  └───────────────┘  └───────────────┘

Step-by-Step Breakdown:

Step 1: Client Routes Write to Leader The client application (or a connection proxy) must route the write to the current leader. This requires knowing which node is the leader—typically determined by querying the database cluster, using a service discovery mechanism, or through a load balancer that routes writes to the primary.

Step 2: Query Validation and Parsing The leader validates the write—checking permissions, parsing the SQL, and planning the execution. This is identical to a single-node database.

Step 3: Write-Ahead Log (WAL) Before modifying any data files, the leader writes the change to a durable, append-only log. This is the critical durability guarantee: if the server crashes after WAL write but before data file update, recovery replays the WAL to restore consistency.

Step 4: Apply to Data Files The leader applies the change to the actual data files (tables, indexes). At this point, the change is visible to local queries.

Step 5: Propagate to Followers The leader streams the WAL entry to followers. The timing of this step relative to client acknowledgment determines whether replication is synchronous or asynchronous (covered in detail in Page 3).

Steps 6-7: Follower Application Each follower receives the WAL entry, writes it to its own log, and applies it to its local data files. The follower's state converges with the leader's.

Why the WAL is Sacred

The write-ahead log is the single source of truth in replication. It's an append-only, sequential record of every change. Both durability (surviving crashes) and replication (synchronizing followers) depend on the WAL being correct and complete. Corrupting the WAL means corrupting the entire cluster's consistency.

Why Single-Leader Replication Works

The elegance of single-leader replication emerges from how it transforms a complex distributed systems problem into a much simpler sequential one. Let's examine the properties that make this model so effective.

Properties of Single-Leader Replication

•Total Write Ordering — Every write receives a unique, monotonically increasing sequence number (LSN, GTID, oplog timestamp). This creates an unambiguous 'happened-before' relationship for all writes in the system.
•Conflict-Free Writes — Since all writes flow through one node, there are no concurrent conflicting writes by definition. The leader serializes all changes.
•Simple Consistency Model — The leader's state is always the most recent. Followers trail behind but never diverge (assuming correct operation). There's one 'true' version of the data.
•Straightforward Recovery — After a failure, recovery means 'catch up to the leader's log position.' The follower knows exactly where it left off and how far behind it is.
•Predictable Read Consistency — Reading from the leader always returns the latest data. Reading from followers returns data that's consistent but possibly stale.

Single-Leader vs. Alternative Models
Aspect	Single-Leader	Multi-Leader	Leaderless
Write Conflicts	Impossible by design	Must be resolved	Must be resolved
Write Latency	One round-trip to leader	One round-trip to nearest leader	Quorum-dependent
Write Throughput	Limited by single node	Higher (multiple writers)	Higher (any node accepts)
Consistency	Strong (leader) or eventual (followers)	Eventual with conflicts	Eventual with conflicts
Complexity	Low	High (conflict resolution)	High (quorum logic)
CAP Tradeoff	CP-leaning	AP-leaning	Tunable

The Sequential Abstraction:

Perhaps the most powerful aspect of single-leader replication is that it allows developers to reason about the database as if it were a single machine. Despite running on multiple servers across multiple data centers, the write path behaves like a single-threaded program processing one write at a time.

This abstraction is a lie, of course—modern databases use extensive concurrency and batching. But it's a useful lie that dramatically simplifies application development. You can use familiar transaction semantics, ACID properties, and isolation levels without worrying about distributed consensus on every operation.

The Power of Simplicity

Don't underestimate the value of a simple consistency model. Multi-leader and leaderless systems offer higher write availability, but at the cost of conflict resolution complexity. Most applications don't need that complexity—single-leader covers 80%+ of use cases more reliably.

Leader Election and Discovery

A single-leader system requires more than just routing writes to a leader—it requires a robust mechanism for determining which node is the leader and making that information available to clients and other nodes. This is the domain of leader election and service discovery.

The Leader Election Problem:

At any given moment, exactly one node should be the leader. If zero nodes are the leader, the system can't accept writes (unavailable). If two or more nodes believe they're the leader, we have a split-brain scenario—both accept writes, and the system loses consistency.

Leader election must solve:

Initial election — When the cluster starts, nodes must agree on a leader.
Failure detection — Nodes must detect when the current leader has failed.
Re-election — After detecting failure, surviving nodes must elect a new leader.
Consistency — All nodes must agree on who the leader is (consensus).

Leader Election Flow

INITIAL STATE                    LEADER FAILURE                  NEW LEADER ELECTED
──────────────────              ──────────────────              ──────────────────
 
  ┌─────────────┐                 ┌─────────────┐                 ┌─────────────┐
  │   Node A    │                 │   Node A    │                 │   Node A    │
  │  (Leader)   │◀── writes       │     ✗✗✗     │    ─────▶      │  (Follower) │
  │   ★ ★ ★     │                 │   FAILED    │                 │             │
  └─────────────┘                 └─────────────┘                 └─────────────┘
 
  ┌─────────────┐                 ┌─────────────┐                 ┌─────────────┐
  │   Node B    │                 │   Node B    │   promoted!    │   Node B    │
  │  (Follower) │                 │  (Follower) │   ─────────▶   │  (Leader)   │◀── writes
  │             │                 │             │                 │   ★ ★ ★     │
  └─────────────┘                 └─────────────┘                 └─────────────┘
 
  ┌─────────────┐                 ┌─────────────┐                 ┌─────────────┐
  │   Node C    │                 │   Node C    │                 │   Node C    │
  │  (Follower) │                 │  (Follower) │                 │  (Follower) │
  │             │                 │             │                 │             │
  └─────────────┘                 └─────────────┘                 └─────────────┘
 
  Heartbeats flow               Node B and C detect            Consensus reached:
  Leader → Followers            missing heartbeats             Node B is new leader

Common Leader Election Approaches:

1. External Consensus Systems Many databases delegate leader election to purpose-built distributed consensus systems like ZooKeeper, etcd, or Consul. These systems implement consensus algorithms (Raft, Paxos, ZAB) and provide primitives like distributed locks and leader election APIs.

Example: Apache Kafka (older versions) uses ZooKeeper for controller election. Clients query ZooKeeper to find the current controller.

2. Built-in Consensus Modern databases increasingly embed consensus algorithms directly. PostgreSQL's Patroni uses etcd or Consul, but systems like CockroachDB and TiDB implement Raft internally.

Example: CockroachDB uses Raft for consensus at the range (partition) level—each range has its own Raft group and leader.

3. Heartbeat + Lease-Based Election The leader periodically renews a lease (a time-limited lock). If the lease expires, followers can attempt to acquire it. This is simpler but relies on clock synchronization.

Example: MongoDB replica sets use an election protocol based on heartbeats and protocol version numbers.

Split-Brain: The Mortal Enemy

Split-brain occurs when two nodes both believe they're the leader and accept conflicting writes. This can corrupt data irreparably. Proper leader election requires consensus (not just timeouts) and fencing mechanisms (like STONITH in HA clusters) to guarantee only one leader operates at a time.

Write Routing Patterns

Once a leader exists, clients must route writes to it. This sounds simple but becomes complex in production environments with connection pooling, load balancers, and dynamic cluster membership. Let's examine the primary patterns for write routing.

Write Routing Strategies

•Direct Connection to Leader — The application tracks the leader's address directly. On failover, the app must detect the change and reconnect. Simple but requires application-level awareness. Used by: Many PostgreSQL setups, custom applications.
•DNS-Based Routing — A DNS record (e.g., primary.db.internal) always points to the current leader. On failover, update the DNS record. Relies on low TTLs and proper cache invalidation. Used by: AWS RDS, many cloud-managed databases.
•Proxy-Based Routing — A proxy (like PgBouncer, ProxySQL, or HAProxy) sits between clients and the database. The proxy knows which node is the leader and routes writes accordingly. This is transparent to clients. Used by: Vitess, ProxySQL with MySQL, many production setups.
•Smart Driver Routing — The database driver natively understands the cluster topology. It automatically discovers the leader and routes writes appropriately. Used by: MongoDB drivers, CockroachDB drivers, Cassandra drivers (for coordinator selection).
•Service Mesh Routing — A service mesh (Istio, Linkerd) routes traffic based on cluster state. Less common for databases but possible with proper integration. Used by: Cloud-native deployments with extensive service mesh infrastructure.

Write Routing Pattern Comparison
Pattern	Failover Speed	Client Complexity	Operational Overhead	Best For
Direct Connection	Slow (app restart)	High	Low	Simple setups, dev environments
DNS-Based	Medium (TTL-bound)	Low	Medium	Cloud-managed databases
Proxy-Based	Fast (proxy-aware)	None	High	High-traffic production systems
Smart Driver	Fast (native)	Low (driver handles)	Low	Modern database clusters
Service Mesh	Fast	None	Very High	Kubernetes-native environments

The Proxy Trade-off

Proxies add latency (typically 1-5ms) and operational complexity, but they provide transparent failover, connection pooling, and the ability to implement read/write splitting without application changes. For high-traffic systems, the benefits usually outweigh the costs.

Real-World Implementations

Leader-follower replication isn't just theory—it's the operational backbone of virtually every major database system. Let's examine how different databases implement the single-leader model.

PostgreSQL Streaming Replication

•Write-Ahead Log (WAL) — PostgreSQL's durability and replication foundation. All changes are written to WAL before data files.
•Streaming Replication — Standbys connect to the primary and receive WAL records in real-time via the replication protocol.
•Synchronous Commit — Configurable per-transaction: synchronous_commit = on/remote_apply/remote_write/local/off
•Replication Slots — Prevent WAL cleanup until standbys have consumed it, preventing replication breaks.
•Failover — Managed by external tools like Patroni, repmgr, or pg_auto_failover. PostgreSQL itself doesn't do automatic failover.

MySQL Replication

•Binary Log (binlog) — MySQL's equivalent of WAL. Records all changes in a binary format suitable for replication.
•GTID (Global Transaction ID) — Modern MySQL uses GTIDs for tracking replication position, simplifying failover.
•Semi-Synchronous Replication — Leader waits for at least one replica to acknowledge before committing.
•Group Replication — MySQL's built-in high-availability solution using Paxos-based consensus.
•Failover — MySQL Router or ProxySQL for automatic failover, or MySQL InnoDB Cluster for integrated HA.

MongoDB Replica Sets

•Oplog (Operations Log) — A capped collection in the local database containing all write operations.
•Automatic Failover — Built-in election protocol promotes a secondary to primary if the primary fails.
•Read Preference — Configurable routing: primary, primaryPreferred, secondary, secondaryPreferred, nearest.
•Write Concern — Configurable durability: w: 1 (leader only), w: majority, w: <number>, w: <tag set>.
•Hidden and Delayed Members — Special replica types for analytics, backup, or disaster recovery.

A Universal Pattern

Despite surface differences (WAL vs. binlog vs. oplog, Patroni vs. Group Replication vs. built-in elections), all these systems implement the same fundamental model: one leader writes, followers replicate, consensus determines leadership. Understanding the pattern lets you adapt to any specific implementation.

Trade-offs and Limitations

Single-leader replication is powerful and widely applicable, but it's not without costs. Understanding its limitations helps you know when to use it—and when to consider alternatives.

Strengths

•Simplicity — One source of truth, no conflict resolution needed
•Strong Consistency — Reads from leader always return latest data
•Familiar Model — Matches single-database semantics developers know
•Read Scaling — Add followers to scale read throughput
•Battle-Tested — Decades of production experience
•Tooling — Excellent ecosystem for all major databases

Limitations

•Write Bottleneck — All writes must go through one node
•Leader Failure — Writes unavailable during failover (seconds to minutes)
•Geographic Latency — Writes from distant regions have high latency
•Single Point of Failure — The leader is a critical dependency
•Limited Write Throughput — Bound by single-node capacity
•Asymmetric Architecture — Adds operational complexity

When Single-Leader is the Right Choice:

✅ Your write throughput fits within a single node's capacity ✅ You need strong consistency without conflict resolution complexity ✅ Your users are geographically concentrated (or latency is acceptable) ✅ Read scaling is more important than write scaling ✅ You value simplicity and operational familiarity

When to Consider Alternatives:

⚠️ Write throughput exceeds single-node limits (consider sharding first, then multi-leader) ⚠️ You have globally distributed users requiring low-latency writes in multiple regions ⚠️ Your availability requirements can't tolerate any failover downtime ⚠️ Your data model naturally supports conflict resolution (e.g., CRDTs)

Don't Prematurely Complicate

Many engineers jump to multi-leader or leaderless systems prematurely. Before abandoning single-leader, ask: Can we shard the data? Can we tolerate the failover window? Is conflict resolution complexity worth the benefit? Usually, single-leader with proper sharding handles far more scale than expected.

Summary: Single Leader Accepts Writes

We've established the foundational principle of leader-follower replication: funneling all writes through a single node to achieve consistency without conflict resolution. Let's consolidate the key insights:

Key Takeaways

•Single-leader replication routes all writes through one designated node — This eliminates write conflicts by design and provides a total ordering of all writes.
•The write-ahead log (WAL) is the replication contract — Changes are durably logged before being applied to data files, and this log is streamed to followers.
•Leader election requires consensus — Systems use either external (ZooKeeper, etcd) or internal (Raft) consensus to ensure exactly one leader.
•Write routing can be direct, DNS-based, proxy-based, or driver-native — Each approach trades off simplicity, failover speed, and operational complexity.
•This model powers PostgreSQL, MySQL, MongoDB, and most production databases — It's the dominant paradigm for good reason: it works.
•Limitations exist but are often overestimated — Single-leader with sharding handles vast scale; multi-leader adds complexity that's rarely needed.

What's Next:

Now that we understand how writes flow through the leader, the next page examines the other side of the replication equation: how followers replicate from the leader. We'll explore replication protocols, log streaming, log shipping, and how followers stay in sync with the leader's state.

Page Complete

You now understand the core principle of single-leader replication: one node accepts all writes, creating a total ordering that simplifies distributed consistency. Next, we'll see how followers receive and apply these writes to maintain synchronized copies of the data.

1 / 5

Loading learning content...

System Design (HLD)Leader-Follower Replication

Leader-Follower Replication

LevelIntermediate

Duration75 mins

TopicLeader-Follower Replication

1 / 5

Single Leader Accepts Writes

The Foundation of Reliable Data Replication

These scenarios share a fundamental challenge: how do we replicate data across multiple database servers while maintaining consistency?

What You Will Learn

The Single-Leader Paradigm

At its core, leader-follower replication (also known as primary-replica, master-slave, or active-passive replication) follows a deceptively simple rule:

All write operations must be processed by exactly one node—the leader.

Core Principles of Single-Leader Replication

•Single Point of Write Authority — One node (the leader) accepts all write operations. This eliminates write-write conflicts by design.
•Deterministic Ordering — The leader assigns a sequence to all writes (transaction IDs, log sequence numbers). This ordering is absolute and unambiguous.
•Log-Based Propagation — Changes are captured in a log and streamed to followers. The log is the contract between leader and followers.
•Read Scalability — While only the leader handles writes, reads can be distributed across the leader and all followers, multiplying read throughput.
•Follower Consistency — Followers apply log entries in the same order as the leader, eventually reaching identical states.

Why 'Single' Matters

Anatomy of a Write in Leader-Follower Replication

Write Flow in Leader-Follower Replication
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
┌─────────────────────────────────────────────────────────────────────────────┐
│                        CLIENT APPLICATION                                    │
│                                                                             │
│   UPDATE users SET balance = balance - 100 WHERE user_id = 'alice';         │
└───────────────────────────────────┬─────────────────────────────────────────┘
                                    │
                                    ▼ (1) Client sends write to Leader
┌─────────────────────────────────────────────────────────────────────────────┐
│                           LEADER (PRIMARY)                                   │
│                                                                             │
│  ┌───────────────────┐    ┌────────────────────┐    ┌──────────────────┐   │
│  │ (2) Validate &    │───▶│ (3) Write to WAL   │───▶│ (4) Apply to     │   │
│  │     Parse Query   │    │     (durable log)  │    │     Data Files   │   │
│  └───────────────────┘    └────────────────────┘    └──────────────────┘   │
│                                    │                                        │
│                                    ▼                                        │
│                           ┌────────────────────┐                            │
│                           │ (5) Acknowledge    │ (synchronous mode)        │
│                           │     or Stream      │ (asynchronous mode)       │
│                           │     to Followers   │                            │
│                           └─────────┬──────────┘                            │
└─────────────────────────────────────┼───────────────────────────────────────┘
                                      │
                    ┌─────────────────┼─────────────────┐
                    │                 │                 │
                    ▼                 ▼                 ▼
           ┌───────────────┐  ┌───────────────┐  ┌───────────────┐
           │  FOLLOWER 1   │  │  FOLLOWER 2   │  │  FOLLOWER 3   │
           │               │  │               │  │               │
           │ (6) Receive   │  │ (6) Receive   │  │ (6) Receive   │
           │     WAL Entry │  │     WAL Entry │  │     WAL Entry │
           │               │  │               │  │               │
           │ (7) Apply to  │  │ (7) Apply to  │  │ (7) Apply to  │
           │     Local DB  │  │     Local DB  │  │     Local DB  │
           └───────────────┘  └───────────────┘  └───────────────┘

Step-by-Step Breakdown:

Step 2: Query Validation and Parsing The leader validates the write—checking permissions, parsing the SQL, and planning the execution. This is identical to a single-node database.

Step 4: Apply to Data Files The leader applies the change to the actual data files (tables, indexes). At this point, the change is visible to local queries.

Steps 6-7: Follower Application Each follower receives the WAL entry, writes it to its own log, and applies it to its local data files. The follower's state converges with the leader's.

Why the WAL is Sacred

Why Single-Leader Replication Works

Properties of Single-Leader Replication

•Total Write Ordering — Every write receives a unique, monotonically increasing sequence number (LSN, GTID, oplog timestamp). This creates an unambiguous 'happened-before' relationship for all writes in the system.
•Conflict-Free Writes — Since all writes flow through one node, there are no concurrent conflicting writes by definition. The leader serializes all changes.
•Simple Consistency Model — The leader's state is always the most recent. Followers trail behind but never diverge (assuming correct operation). There's one 'true' version of the data.
•Straightforward Recovery — After a failure, recovery means 'catch up to the leader's log position.' The follower knows exactly where it left off and how far behind it is.
•Predictable Read Consistency — Reading from the leader always returns the latest data. Reading from followers returns data that's consistent but possibly stale.

Single-Leader vs. Alternative Models
Aspect	Single-Leader	Multi-Leader	Leaderless
Write Conflicts	Impossible by design	Must be resolved	Must be resolved
Write Latency	One round-trip to leader	One round-trip to nearest leader	Quorum-dependent
Write Throughput	Limited by single node	Higher (multiple writers)	Higher (any node accepts)
Consistency	Strong (leader) or eventual (followers)	Eventual with conflicts	Eventual with conflicts
Complexity	Low	High (conflict resolution)	High (quorum logic)
CAP Tradeoff	CP-leaning	AP-leaning	Tunable

The Sequential Abstraction:

The Power of Simplicity

Leader Election and Discovery

The Leader Election Problem:

Leader election must solve:

Initial election — When the cluster starts, nodes must agree on a leader.
Failure detection — Nodes must detect when the current leader has failed.
Re-election — After detecting failure, surviving nodes must elect a new leader.
Consistency — All nodes must agree on who the leader is (consensus).

Leader Election Flow

INITIAL STATE                    LEADER FAILURE                  NEW LEADER ELECTED
──────────────────              ──────────────────              ──────────────────
 
  ┌─────────────┐                 ┌─────────────┐                 ┌─────────────┐
  │   Node A    │                 │   Node A    │                 │   Node A    │
  │  (Leader)   │◀── writes       │     ✗✗✗     │    ─────▶      │  (Follower) │
  │   ★ ★ ★     │                 │   FAILED    │                 │             │
  └─────────────┘                 └─────────────┘                 └─────────────┘
 
  ┌─────────────┐                 ┌─────────────┐                 ┌─────────────┐
  │   Node B    │                 │   Node B    │   promoted!    │   Node B    │
  │  (Follower) │                 │  (Follower) │   ─────────▶   │  (Leader)   │◀── writes
  │             │                 │             │                 │   ★ ★ ★     │
  └─────────────┘                 └─────────────┘                 └─────────────┘
 
  ┌─────────────┐                 ┌─────────────┐                 ┌─────────────┐
  │   Node C    │                 │   Node C    │                 │   Node C    │
  │  (Follower) │                 │  (Follower) │                 │  (Follower) │
  │             │                 │             │                 │             │
  └─────────────┘                 └─────────────┘                 └─────────────┘
 
  Heartbeats flow               Node B and C detect            Consensus reached:
  Leader → Followers            missing heartbeats             Node B is new leader

Common Leader Election Approaches:

Example: Apache Kafka (older versions) uses ZooKeeper for controller election. Clients query ZooKeeper to find the current controller.

2. Built-in Consensus Modern databases increasingly embed consensus algorithms directly. PostgreSQL's Patroni uses etcd or Consul, but systems like CockroachDB and TiDB implement Raft internally.

Example: CockroachDB uses Raft for consensus at the range (partition) level—each range has its own Raft group and leader.

Example: MongoDB replica sets use an election protocol based on heartbeats and protocol version numbers.

Split-Brain: The Mortal Enemy

Write Routing Patterns

Write Routing Strategies

•Direct Connection to Leader — The application tracks the leader's address directly. On failover, the app must detect the change and reconnect. Simple but requires application-level awareness. Used by: Many PostgreSQL setups, custom applications.
•DNS-Based Routing — A DNS record (e.g., primary.db.internal) always points to the current leader. On failover, update the DNS record. Relies on low TTLs and proper cache invalidation. Used by: AWS RDS, many cloud-managed databases.
•Proxy-Based Routing — A proxy (like PgBouncer, ProxySQL, or HAProxy) sits between clients and the database. The proxy knows which node is the leader and routes writes accordingly. This is transparent to clients. Used by: Vitess, ProxySQL with MySQL, many production setups.
•Smart Driver Routing — The database driver natively understands the cluster topology. It automatically discovers the leader and routes writes appropriately. Used by: MongoDB drivers, CockroachDB drivers, Cassandra drivers (for coordinator selection).
•Service Mesh Routing — A service mesh (Istio, Linkerd) routes traffic based on cluster state. Less common for databases but possible with proper integration. Used by: Cloud-native deployments with extensive service mesh infrastructure.

Write Routing Pattern Comparison
Pattern	Failover Speed	Client Complexity	Operational Overhead	Best For
Direct Connection	Slow (app restart)	High	Low	Simple setups, dev environments
DNS-Based	Medium (TTL-bound)	Low	Medium	Cloud-managed databases
Proxy-Based	Fast (proxy-aware)	None	High	High-traffic production systems
Smart Driver	Fast (native)	Low (driver handles)	Low	Modern database clusters
Service Mesh	Fast	None	Very High	Kubernetes-native environments

The Proxy Trade-off

Real-World Implementations

Leader-follower replication isn't just theory—it's the operational backbone of virtually every major database system. Let's examine how different databases implement the single-leader model.

PostgreSQL Streaming Replication

•Write-Ahead Log (WAL) — PostgreSQL's durability and replication foundation. All changes are written to WAL before data files.
•Streaming Replication — Standbys connect to the primary and receive WAL records in real-time via the replication protocol.
•Synchronous Commit — Configurable per-transaction: synchronous_commit = on/remote_apply/remote_write/local/off
•Replication Slots — Prevent WAL cleanup until standbys have consumed it, preventing replication breaks.
•Failover — Managed by external tools like Patroni, repmgr, or pg_auto_failover. PostgreSQL itself doesn't do automatic failover.

MySQL Replication

•Binary Log (binlog) — MySQL's equivalent of WAL. Records all changes in a binary format suitable for replication.
•GTID (Global Transaction ID) — Modern MySQL uses GTIDs for tracking replication position, simplifying failover.
•Semi-Synchronous Replication — Leader waits for at least one replica to acknowledge before committing.
•Group Replication — MySQL's built-in high-availability solution using Paxos-based consensus.
•Failover — MySQL Router or ProxySQL for automatic failover, or MySQL InnoDB Cluster for integrated HA.

MongoDB Replica Sets

•Oplog (Operations Log) — A capped collection in the local database containing all write operations.
•Automatic Failover — Built-in election protocol promotes a secondary to primary if the primary fails.
•Read Preference — Configurable routing: primary, primaryPreferred, secondary, secondaryPreferred, nearest.
•Write Concern — Configurable durability: w: 1 (leader only), w: majority, w: <number>, w: <tag set>.
•Hidden and Delayed Members — Special replica types for analytics, backup, or disaster recovery.

A Universal Pattern

Trade-offs and Limitations

Single-leader replication is powerful and widely applicable, but it's not without costs. Understanding its limitations helps you know when to use it—and when to consider alternatives.

Strengths

•Simplicity — One source of truth, no conflict resolution needed
•Strong Consistency — Reads from leader always return latest data
•Familiar Model — Matches single-database semantics developers know
•Read Scaling — Add followers to scale read throughput
•Battle-Tested — Decades of production experience
•Tooling — Excellent ecosystem for all major databases

Limitations

•Write Bottleneck — All writes must go through one node
•Leader Failure — Writes unavailable during failover (seconds to minutes)
•Geographic Latency — Writes from distant regions have high latency
•Single Point of Failure — The leader is a critical dependency
•Limited Write Throughput — Bound by single-node capacity
•Asymmetric Architecture — Adds operational complexity

When Single-Leader is the Right Choice:

When to Consider Alternatives:

Don't Prematurely Complicate

Summary: Single Leader Accepts Writes

Key Takeaways

•Single-leader replication routes all writes through one designated node — This eliminates write conflicts by design and provides a total ordering of all writes.
•The write-ahead log (WAL) is the replication contract — Changes are durably logged before being applied to data files, and this log is streamed to followers.
•Leader election requires consensus — Systems use either external (ZooKeeper, etcd) or internal (Raft) consensus to ensure exactly one leader.
•Write routing can be direct, DNS-based, proxy-based, or driver-native — Each approach trades off simplicity, failover speed, and operational complexity.
•This model powers PostgreSQL, MySQL, MongoDB, and most production databases — It's the dominant paradigm for good reason: it works.
•Limitations exist but are often overestimated — Single-leader with sharding handles vast scale; multi-leader adds complexity that's rarely needed.

What's Next:

Page Complete

1 / 5