Loading learning content...
Knowing that you need replication is only the beginning. The critical question is: how should replicas be organized and coordinated? Should there be a single node accepting writes? Multiple? Should replicas be daisy-chained for efficiency or fully meshed for resilience? Should consensus protocols govern every write?\n\nThese questions define your replication strategy—the architectural pattern that determines how your distributed database behaves under normal operation and during failures. Each strategy makes specific trade-offs between consistency, availability, latency, and operational complexity.\n\nChoosing the wrong strategy for your workload leads to systems that are either too slow (over-synchronized), too fragile (under-replicated), or too complex to operate (needlessly sophisticated). This page equips you to make informed architectural decisions.
By the end of this page, you will understand the major replication strategies—primary-replica, multi-primary, chain replication, and consensus-based replication. You'll learn the mechanics, trade-offs, failure modes, and ideal use cases for each strategy, enabling you to select and configure the right approach for your system's requirements.
The primary-replica strategy (also called leader-follower or master-slave) is the most common replication pattern. A single designated node—the primary—handles all write operations. One or more replica nodes receive copies of these writes and serve read traffic.
How It Works:\n\n1. All writes go to primary: The primary is the single source of truth for data modifications.\n2. Primary logs changes: Writes are recorded in a transaction log (WAL, binlog, oplog).\n3. Replicas consume log: Replicas connect to the primary and continuously receive log entries.\n4. Replicas apply changes: Each replica applies log entries to its local data, eventually matching the primary.\n5. Reads distributed: Read queries can be served by any replica (with eventual consistency) or the primary (for strong consistency).\n\nConflict Resolution: None Required\n\nBecause only the primary accepts writes, there are no write conflicts. This is the fundamental simplification that makes primary-replica replication operationally manageable.
Failover Mechanics:\n\nWhen the primary fails, a replica must be promoted:\n\n1. Detection: Monitoring detects primary unresponsiveness (heartbeat failure, connection timeout).\n2. Selection: A replica is chosen for promotion (most up-to-date, highest priority, consensus among replicas).\n3. Promotion: The selected replica stops receiving replication and begins accepting writes.\n4. Reconfiguration: Other replicas redirect to the new primary. Clients update connection endpoints.\n5. Old primary handling: If the old primary recovers, it must become a replica (cannot accept writes).\n\nAutomatic vs. Manual Failover:\n- Automatic: Tools like PostgreSQL Patroni, MySQL Orchestrator, or MongoDB replica sets handle failover automatically. Faster recovery but risk of split-brain.\n- Manual: Operators decide when to failover. Slower but avoids premature promotion during transient issues.
If network partitions the primary from replicas but both remain operational, you risk split-brain: two nodes believe they're the primary and accept different writes. This causes data divergence that's extremely difficult to resolve. Consensus-based leader election (using odd-numbered nodes and majority quorums) prevents split-brain.
The multi-primary strategy (also called multi-master or active-active) allows multiple nodes to accept writes simultaneously. Changes made on any primary are replicated to all other primaries. This eliminates the write bottleneck of single-primary but introduces significant complexity.
When Multi-Primary Is Needed:\n\n1. Geographic distribution: Users in multiple regions need low-latency writes. Routing all writes to a single primary in one region adds 100-200ms latency for distant users.\n\n2. High write availability: If the single primary fails, writes are blocked. Multi-primary allows writes to continue on surviving nodes.\n\n3. Write throughput: Write-heavy workloads may exceed single-node capacity.\n\nThe Central Challenge: Conflict Resolution\n\nWhen two primaries accept writes to the same data simultaneously, a conflict occurs. Unlike primary-replica, where conflicts are impossible, multi-primary must detect and resolve conflicts.
Conflict Resolution Strategies:\n\n1. Last-Writer-Wins (LWW)\nUse timestamps to determine which write is "later." The later write overwrites the earlier one.\n- Pros: Simple, deterministic, no manual intervention\n- Cons: Silently loses data; clock synchronization is imperfect; "later" may not mean "correct"\n\n2. Application-Level Resolution\nPresent both values to the application and let business logic decide.\n- Pros: Business-correct resolution; no data loss\n- Cons: Application complexity; may require user intervention\n\n3. Merge Functions\nDefine data type-specific merge operations (e.g., union sets, merge counters).\n- Pros: Automatic, semantically meaningful\n- Cons: Only works for mergeable data types; complex implementation\n\n4. CRDTs (Conflict-free Replicated Data Types)\nData structures mathematically designed to merge without conflicts.\n- Pros: Guaranteed convergence without coordination\n- Cons: Limited data types; not suitable for all use cases
| Strategy | Data Loss Risk | Complexity | Suitable For |
|---|---|---|---|
| Last-Writer-Wins | High (silent) | Low | Logging, caching, non-critical data |
| Application Resolution | None | High | Business-critical data requiring human judgment |
| Merge Functions | None | Medium | Counters, sets, append-only data |
| CRDTs | None | High | Collaborative editing, distributed counters |
The best conflict resolution is no conflict at all. Partition your data so that different records are owned by different primaries. User profile updates always go to the user's 'home' primary. Order processing happens on the primary where the order was created. This 'sticky routing' model gets multi-primary benefits without multi-primary conflicts.
Chain replication is a specialized strategy that arranges nodes in a linear chain. Writes enter at the head of the chain and propagate through each node until reaching the tail. Reads are served exclusively by the tail. This design provides strong consistency with high throughput.
How Chain Replication Works:\n\n1. Writes enter at head: The head node receives all write requests.\n2. Sequential forwarding: The head forwards to the next node in the chain, which forwards to the next, and so on.\n3. Tail acknowledges: Only when the write reaches the tail is it considered committed. The tail sends acknowledgment back to the client.\n4. Reads from tail only: Since the tail has all committed writes, reads from the tail are strongly consistent.\n5. No read from middle nodes: Middle nodes may have uncommitted writes; reading from them would violate consistency.\n\nWhy This Works:\n\nThe tail is guaranteed to have every committed write. Any write in the chain but not at the tail is "in flight"—not yet committed. By serving reads from the tail exclusively, you get linearizable (strongly consistent) reads without any coordination overhead.
Failure Handling in Chain Replication:\n\nHead Failure:\nThe next node in the chain becomes the new head. Pending writes that hadn't propagated past the old head are lost.\n\nMiddle Node Failure:\nThe chain is "spliced"—the predecessor of the failed node connects directly to its successor. In-flight writes must be recovered from the predecessor.\n\nTail Failure:\nThe predecessor of the tail becomes the new tail. Some committed writes (acked by old tail) may need re-acknowledgment.\n\nCRAQ: Chain Replication with Apportioned Queries\n\nA variant called CRAQ (Chain Replication with Apportioned Queries) allows reads from any node, not just the tail. Each node tracks which writes are "clean" (committed) vs. "dirty" (in-flight). Reads of clean data can be served locally; reads of dirty data are forwarded to the tail. This distributes read load while maintaining strong consistency.
Chain replication is less common in traditional databases but appears in distributed storage systems. Microsoft Azure Storage uses a variant of chain replication. Apache Kafka's replication model has chain-like properties. Object storage systems benefit from chain replication's bandwidth efficiency.
Consensus-based replication uses distributed consensus protocols (Paxos, Raft, Zab) to ensure that all replicas agree on the order of writes. This provides the strongest consistency guarantees—linearizability—while tolerating failures of a minority of nodes.
The Core Idea:\n\nBefore a write is committed, a majority (quorum) of nodes must agree to accept it. This ensures:\n\n- Any two quorums overlap by at least one node\n- Committed writes are never lost (at least one surviving node has every commit)\n- No two different values can be committed for the same slot\n\nCommon Consensus Protocols:\n\nRaft\n- Leader-based: a leader handles all client requests\n- Log replication: leader appends entries to log, replicates to followers\n- Leader election: if leader fails, majority elects a new leader\n- Used by: etcd, CockroachDB, TiKV, Consul\n\nPaxos\n- More general and flexible than Raft\n- Separates proposers, acceptors, and learners\n- Multi-Paxos optimizes for sequences of values\n- Used by: Google Spanner (variant), Chubby\n\nZab (Zookeeper Atomic Broadcast)\n- Optimized for primary-backup with strong ordering\n- Used by: Apache Zookeeper, Apache Kafka (for metadata)
Raft Protocol Deep Dive:\n\nLeader Election:\n1. Nodes start as followers with randomized election timeouts\n2. If a follower receives no heartbeat, it becomes a candidate\n3. Candidate requests votes from all nodes\n4. Node receiving majority votes becomes leader\n5. Leader sends periodic heartbeats to maintain authority\n\nLog Replication:\n1. Client sends write to leader\n2. Leader appends entry to its log with a term number and index\n3. Leader sends AppendEntries RPC to all followers in parallel\n4. Followers append to their logs and acknowledge\n5. When majority acknowledge, leader commits the entry\n6. Leader notifies followers to commit\n\nSafety Guarantees:\n- Only one leader per term\n- Leaders never overwrite their logs\n- Only logs with committed entries can become leader\n- Committed entries are durable across failures
| Characteristic | Description | Impact |
|---|---|---|
| Fault Tolerance | Tolerates (N-1)/2 failures for N nodes | 3 nodes → 1 failure; 5 nodes → 2 failures |
| Write Latency | Requires majority acknowledgment | Higher than async, similar to sync |
| Consistency | Linearizable (strongest) | No stale reads possible |
| Availability | Available if majority is reachable | CAP: CP system (not AP) |
| Complexity | Complex protocol implementation | Use battle-tested libraries (etcd, Consul) |
Consensus requires a majority quorum. With 3 nodes, majority is 2, tolerating 1 failure. With 5 nodes, majority is 3, tolerating 2 failures. Even-numbered clusters (2, 4, 6) provide no benefit: 4 nodes still only tolerate 1 failure (majority is 3), same as 3 nodes but with more overhead. Always use odd numbers.
Quorum-based replication (also called leaderless or Dynamo-style replication) eliminates the leader entirely. Clients can write to and read from any node. Consistency is achieved through quorum arithmetic: if you write to W nodes and read from R nodes, and W + R > N (total nodes), at least one node you read from has the latest write.
The Quorum Formula:\n\n\nIf W + R > N, reads will observe the latest write.\n\nWhere:\n N = Total number of replicas\n W = Number of nodes that must acknowledge a write\n R = Number of nodes that must respond to a read\n\n\nCommon Configurations:\n\n| Config | N | W | R | Write Tolerance | Read Tolerance | Characteristic |\n|--------|---|---|---|-----------------|----------------|----------------|\n| Strong | 3 | 2 | 2 | 1 failure | 1 failure | Balanced |\n| Write-heavy | 3 | 1 | 3 | 2 failures | 0 failures | Fast writes |\n| Read-heavy | 3 | 3 | 1 | 0 failures | 2 failures | Fast reads |\n| Eventual | 3 | 1 | 1 | 2 failures | 2 failures | No guarantee |
Read Repair and Anti-Entropy:\n\nQuorum replication requires mechanisms to ensure replicas eventually converge:\n\nRead Repair:\nWhen a client reads from R nodes and receives different versions, it writes the newest version back to stale nodes. This piggy-backs consistency maintenance on read operations.\n\nAnti-Entropy (Merkle Trees):\nBackground processes compare data across replicas using hash trees. When differences are detected, the stale replica receives updates. This catches inconsistencies that read repair misses (data never read).\n\nHinted Handoff:\nIf a node is temporarily unavailable during a write, another node stores the write temporarily (a "hint"). When the target node recovers, the hint is delivered. This maintains write availability during transient failures.
A 'sloppy quorum' allows writes to succeed even if some of the designated N replicas are unavailable, by writing to other nodes instead. This increases write availability but weakens durability guarantees. Amazon's Dynamo uses sloppy quorums—sacrificing strict consistency for high availability.
Selecting the right replication strategy requires matching your workload characteristics and requirements to strategy capabilities. Here's a decision framework:
| Requirement | Best Strategy | Rationale |
|---|---|---|
| Simple operations, strong consistency | Primary-Replica + Sync | Simplest model; sync ensures durability |
| Read scalability, eventual consistency OK | Primary-Replica + Async | Add replicas freely; low write latency |
| Multi-region writes, low latency | Multi-Primary | Local writes in each region |
| Strong consistency, fault tolerance | Consensus (Raft/Paxos) | Linearizable; survives minority failures |
| High write availability, tunable consistency | Quorum-Based | No single point of failure; flexible W/R |
| Simple high throughput, strong consistency | Chain Replication | Efficient network usage; strong guarantees |
Decision Tree:\n\n\n1. Do you need multi-region writes with low latency?\n Yes → Multi-Primary (accept conflict resolution complexity)\n No → Continue\n\n2. Is strong (linearizable) consistency required?\n Yes → Consensus-Based (Raft/Paxos)\n No → Continue\n\n3. Is write availability more important than consistency?\n Yes → Quorum-Based (tunable W/R)\n No → Continue\n\n4. Is read scalability the primary concern?\n Yes → Primary-Replica with Async (many replicas)\n No → Primary-Replica with Sync (simple, durable)\n
Start simple. Primary-replica covers 90% of use cases. Multi-primary and consensus add operational complexity that you must be prepared to manage. Choose the simplest strategy that meets your requirements—complexity is a cost, not a feature.
Real-world systems often combine multiple replication strategies to achieve complex requirements. Understanding these hybrid approaches reveals the flexibility of replication architecture:
Example: CockroachDB Architecture\n\nCockroachDB demonstrates sophisticated hybrid replication:\n\n1. Data is sharded into ranges (~64MB each)\n2. Each range uses Raft consensus for strong consistency\n3. Ranges have 3-5 replicas spread across failure domains\n4. Leaseholder optimization: One replica per range holds a lease for serving reads without consensus\n5. Multi-region: Ranges can span regions with geo-partitioning\n\nThis combines sharding, consensus, and lease-based optimization into a globally consistent yet locally fast system.
Hybrid architectures are often the result of evolution, not initial design. Start with a simple strategy (primary-replica). As requirements grow, add components: a synchronous replica for durability, a remote async replica for DR, sharding for write scale. Each addition should address a specific, measured need.
We've explored the major replication strategies and their trade-offs. Let's consolidate the key insights:
What's Next:\n\nNow that we understand replication strategies, we'll dive into the most critical aspect of distributed database design: Consistency Trade-offs. We'll explore the spectrum from strong to eventual consistency, the CAP and PACELC theorems, and how to choose the right consistency model for your application's semantics.
You now understand the major replication strategies—primary-replica, multi-primary, chain replication, consensus-based, and quorum-based. Each strategy makes specific trade-offs between consistency, availability, latency, and operational complexity. Select the simplest strategy that meets your requirements, and be prepared to evolve as needs grow. Next, we'll explore consistency trade-offs in depth.