Database Management SystemsReplication

Replication in Distributed Databases

LevelAdvanced

Duration90 mins

TopicReplication

4 / 5

Replication Strategies

Architecting for Resilience and Scale

Knowing that you need replication is only the beginning. The critical question is: how should replicas be organized and coordinated? Should there be a single node accepting writes? Multiple? Should replicas be daisy-chained for efficiency or fully meshed for resilience? Should consensus protocols govern every write?\n\nThese questions define your replication strategy—the architectural pattern that determines how your distributed database behaves under normal operation and during failures. Each strategy makes specific trade-offs between consistency, availability, latency, and operational complexity.\n\nChoosing the wrong strategy for your workload leads to systems that are either too slow (over-synchronized), too fragile (under-replicated), or too complex to operate (needlessly sophisticated). This page equips you to make informed architectural decisions.

What You Will Learn

By the end of this page, you will understand the major replication strategies—primary-replica, multi-primary, chain replication, and consensus-based replication. You'll learn the mechanics, trade-offs, failure modes, and ideal use cases for each strategy, enabling you to select and configure the right approach for your system's requirements.

Primary-Replica (Leader-Follower) Strategy

The primary-replica strategy (also called leader-follower or master-slave) is the most common replication pattern. A single designated node—the primary—handles all write operations. One or more replica nodes receive copies of these writes and serve read traffic.

Converting Mermaid diagram...

How It Works:\n\n1. All writes go to primary: The primary is the single source of truth for data modifications.\n2. Primary logs changes: Writes are recorded in a transaction log (WAL, binlog, oplog).\n3. Replicas consume log: Replicas connect to the primary and continuously receive log entries.\n4. Replicas apply changes: Each replica applies log entries to its local data, eventually matching the primary.\n5. Reads distributed: Read queries can be served by any replica (with eventual consistency) or the primary (for strong consistency).\n\nConflict Resolution: None Required\n\nBecause only the primary accepts writes, there are no write conflicts. This is the fundamental simplification that makes primary-replica replication operationally manageable.

Advantages

•Simple mental model—single source of truth
•No conflict resolution complexity
•Read scalability via replica addition
•Well-supported by all major databases
•Easy to reason about consistency
•Mature tooling for failover

Disadvantages

•Primary is a write bottleneck
•Primary failure requires failover
•Failover involves downtime (seconds to minutes)
•Cross-region writes have high latency
•Cannot survive primary region failure without intervention

Failover Mechanics:\n\nWhen the primary fails, a replica must be promoted:\n\n1. Detection: Monitoring detects primary unresponsiveness (heartbeat failure, connection timeout).\n2. Selection: A replica is chosen for promotion (most up-to-date, highest priority, consensus among replicas).\n3. Promotion: The selected replica stops receiving replication and begins accepting writes.\n4. Reconfiguration: Other replicas redirect to the new primary. Clients update connection endpoints.\n5. Old primary handling: If the old primary recovers, it must become a replica (cannot accept writes).\n\nAutomatic vs. Manual Failover:\n- Automatic: Tools like PostgreSQL Patroni, MySQL Orchestrator, or MongoDB replica sets handle failover automatically. Faster recovery but risk of split-brain.\n- Manual: Operators decide when to failover. Slower but avoids premature promotion during transient issues.

The Split-Brain Problem

If network partitions the primary from replicas but both remain operational, you risk split-brain: two nodes believe they're the primary and accept different writes. This causes data divergence that's extremely difficult to resolve. Consensus-based leader election (using odd-numbered nodes and majority quorums) prevents split-brain.

Multi-Primary (Multi-Master) Strategy

The multi-primary strategy (also called multi-master or active-active) allows multiple nodes to accept writes simultaneously. Changes made on any primary are replicated to all other primaries. This eliminates the write bottleneck of single-primary but introduces significant complexity.

Converting Mermaid diagram...

When Multi-Primary Is Needed:\n\n1. Geographic distribution: Users in multiple regions need low-latency writes. Routing all writes to a single primary in one region adds 100-200ms latency for distant users.\n\n2. High write availability: If the single primary fails, writes are blocked. Multi-primary allows writes to continue on surviving nodes.\n\n3. Write throughput: Write-heavy workloads may exceed single-node capacity.\n\nThe Central Challenge: Conflict Resolution\n\nWhen two primaries accept writes to the same data simultaneously, a conflict occurs. Unlike primary-replica, where conflicts are impossible, multi-primary must detect and resolve conflicts.

Conflict Resolution Strategies:\n\n1. Last-Writer-Wins (LWW)\nUse timestamps to determine which write is "later." The later write overwrites the earlier one.\n- Pros: Simple, deterministic, no manual intervention\n- Cons: Silently loses data; clock synchronization is imperfect; "later" may not mean "correct"\n\n2. Application-Level Resolution\nPresent both values to the application and let business logic decide.\n- Pros: Business-correct resolution; no data loss\n- Cons: Application complexity; may require user intervention\n\n3. Merge Functions\nDefine data type-specific merge operations (e.g., union sets, merge counters).\n- Pros: Automatic, semantically meaningful\n- Cons: Only works for mergeable data types; complex implementation\n\n4. CRDTs (Conflict-free Replicated Data Types)\nData structures mathematically designed to merge without conflicts.\n- Pros: Guaranteed convergence without coordination\n- Cons: Limited data types; not suitable for all use cases

Multi-Primary Conflict Resolution Comparison
Strategy	Data Loss Risk	Complexity	Suitable For
Last-Writer-Wins	High (silent)	Low	Logging, caching, non-critical data
Application Resolution	None	High	Business-critical data requiring human judgment
Merge Functions	None	Medium	Counters, sets, append-only data
CRDTs	None	High	Collaborative editing, distributed counters

Avoiding Conflicts Entirely

The best conflict resolution is no conflict at all. Partition your data so that different records are owned by different primaries. User profile updates always go to the user's 'home' primary. Order processing happens on the primary where the order was created. This 'sticky routing' model gets multi-primary benefits without multi-primary conflicts.

Chain Replication

Chain replication is a specialized strategy that arranges nodes in a linear chain. Writes enter at the head of the chain and propagate through each node until reaching the tail. Reads are served exclusively by the tail. This design provides strong consistency with high throughput.

Converting Mermaid diagram...

How Chain Replication Works:\n\n1. Writes enter at head: The head node receives all write requests.\n2. Sequential forwarding: The head forwards to the next node in the chain, which forwards to the next, and so on.\n3. Tail acknowledges: Only when the write reaches the tail is it considered committed. The tail sends acknowledgment back to the client.\n4. Reads from tail only: Since the tail has all committed writes, reads from the tail are strongly consistent.\n5. No read from middle nodes: Middle nodes may have uncommitted writes; reading from them would violate consistency.\n\nWhy This Works:\n\nThe tail is guaranteed to have every committed write. Any write in the chain but not at the tail is "in flight"—not yet committed. By serving reads from the tail exclusively, you get linearizable (strongly consistent) reads without any coordination overhead.

Chain Replication Advantages

•Strong consistency without read/write locks
•High read throughput (tail handles all reads)
•High write throughput (head handles all writes)
•Simplified failure detection (linear structure)
•Better network utilization than all-to-all

Chain Replication Disadvantages

•Write latency proportional to chain length
•Any node failure disrupts the chain
•Chain reconfiguration during failures complex
•Tail is potential read bottleneck
•Not widely implemented in mainstream DBs

Failure Handling in Chain Replication:\n\nHead Failure:\nThe next node in the chain becomes the new head. Pending writes that hadn't propagated past the old head are lost.\n\nMiddle Node Failure:\nThe chain is "spliced"—the predecessor of the failed node connects directly to its successor. In-flight writes must be recovered from the predecessor.\n\nTail Failure:\nThe predecessor of the tail becomes the new tail. Some committed writes (acked by old tail) may need re-acknowledgment.\n\nCRAQ: Chain Replication with Apportioned Queries\n\nA variant called CRAQ (Chain Replication with Apportioned Queries) allows reads from any node, not just the tail. Each node tracks which writes are "clean" (committed) vs. "dirty" (in-flight). Reads of clean data can be served locally; reads of dirty data are forwarded to the tail. This distributes read load while maintaining strong consistency.

Where Chain Replication Is Used

Chain replication is less common in traditional databases but appears in distributed storage systems. Microsoft Azure Storage uses a variant of chain replication. Apache Kafka's replication model has chain-like properties. Object storage systems benefit from chain replication's bandwidth efficiency.

Consensus-Based Replication

Consensus-based replication uses distributed consensus protocols (Paxos, Raft, Zab) to ensure that all replicas agree on the order of writes. This provides the strongest consistency guarantees—linearizability—while tolerating failures of a minority of nodes.

The Core Idea:\n\nBefore a write is committed, a majority (quorum) of nodes must agree to accept it. This ensures:\n\n- Any two quorums overlap by at least one node\n- Committed writes are never lost (at least one surviving node has every commit)\n- No two different values can be committed for the same slot\n\nCommon Consensus Protocols:\n\nRaft\n- Leader-based: a leader handles all client requests\n- Log replication: leader appends entries to log, replicates to followers\n- Leader election: if leader fails, majority elects a new leader\n- Used by: etcd, CockroachDB, TiKV, Consul\n\nPaxos\n- More general and flexible than Raft\n- Separates proposers, acceptors, and learners\n- Multi-Paxos optimizes for sequences of values\n- Used by: Google Spanner (variant), Chubby\n\nZab (Zookeeper Atomic Broadcast)\n- Optimized for primary-backup with strong ordering\n- Used by: Apache Zookeeper, Apache Kafka (for metadata)

Converting Mermaid diagram...

Raft Protocol Deep Dive:\n\nLeader Election:\n1. Nodes start as followers with randomized election timeouts\n2. If a follower receives no heartbeat, it becomes a candidate\n3. Candidate requests votes from all nodes\n4. Node receiving majority votes becomes leader\n5. Leader sends periodic heartbeats to maintain authority\n\nLog Replication:\n1. Client sends write to leader\n2. Leader appends entry to its log with a term number and index\n3. Leader sends AppendEntries RPC to all followers in parallel\n4. Followers append to their logs and acknowledge\n5. When majority acknowledge, leader commits the entry\n6. Leader notifies followers to commit\n\nSafety Guarantees:\n- Only one leader per term\n- Leaders never overwrite their logs\n- Only logs with committed entries can become leader\n- Committed entries are durable across failures

Consensus-Based Replication Characteristics
Characteristic	Description	Impact
Fault Tolerance	Tolerates (N-1)/2 failures for N nodes	3 nodes → 1 failure; 5 nodes → 2 failures
Write Latency	Requires majority acknowledgment	Higher than async, similar to sync
Consistency	Linearizable (strongest)	No stale reads possible
Availability	Available if majority is reachable	CAP: CP system (not AP)
Complexity	Complex protocol implementation	Use battle-tested libraries (etcd, Consul)

Why 3 or 5 Nodes?

Consensus requires a majority quorum. With 3 nodes, majority is 2, tolerating 1 failure. With 5 nodes, majority is 3, tolerating 2 failures. Even-numbered clusters (2, 4, 6) provide no benefit: 4 nodes still only tolerate 1 failure (majority is 3), same as 3 nodes but with more overhead. Always use odd numbers.

Quorum-Based (Leaderless) Strategy

Quorum-based replication (also called leaderless or Dynamo-style replication) eliminates the leader entirely. Clients can write to and read from any node. Consistency is achieved through quorum arithmetic: if you write to W nodes and read from R nodes, and W + R > N (total nodes), at least one node you read from has the latest write.

The Quorum Formula:\n\n\nIf W + R > N, reads will observe the latest write.\n\nWhere:\n N = Total number of replicas\n W = Number of nodes that must acknowledge a write\n R = Number of nodes that must respond to a read\n\n\nCommon Configurations:\n\n| Config | N | W | R | Write Tolerance | Read Tolerance | Characteristic |\n|--------|---|---|---|-----------------|----------------|----------------|\n| Strong | 3 | 2 | 2 | 1 failure | 1 failure | Balanced |\n| Write-heavy | 3 | 1 | 3 | 2 failures | 0 failures | Fast writes |\n| Read-heavy | 3 | 3 | 1 | 0 failures | 2 failures | Fast reads |\n| Eventual | 3 | 1 | 1 | 2 failures | 2 failures | No guarantee |

Converting Mermaid diagram...

Read Repair and Anti-Entropy:\n\nQuorum replication requires mechanisms to ensure replicas eventually converge:\n\nRead Repair:\nWhen a client reads from R nodes and receives different versions, it writes the newest version back to stale nodes. This piggy-backs consistency maintenance on read operations.\n\nAnti-Entropy (Merkle Trees):\nBackground processes compare data across replicas using hash trees. When differences are detected, the stale replica receives updates. This catches inconsistencies that read repair misses (data never read).\n\nHinted Handoff:\nIf a node is temporarily unavailable during a write, another node stores the write temporarily (a "hint"). When the target node recovers, the hint is delivered. This maintains write availability during transient failures.

Quorum Advantages

•No single point of failure (no leader)
•No leader election delays
•Tunable consistency (adjust W, R)
•High write availability
•Excellent for multi-datacenter

Quorum Disadvantages

•Weaker consistency than consensus
•Conflict resolution required
•Higher read/write latency (multiple nodes)
•Complex version vector management
•Read repair may miss some inconsistencies

Sloppy Quorums

A 'sloppy quorum' allows writes to succeed even if some of the designated N replicas are unavailable, by writing to other nodes instead. This increases write availability but weakens durability guarantees. Amazon's Dynamo uses sloppy quorums—sacrificing strict consistency for high availability.

Strategy Selection Guide

Selecting the right replication strategy requires matching your workload characteristics and requirements to strategy capabilities. Here's a decision framework:

Replication Strategy Selection Matrix
Requirement	Best Strategy	Rationale
Simple operations, strong consistency	Primary-Replica + Sync	Simplest model; sync ensures durability
Read scalability, eventual consistency OK	Primary-Replica + Async	Add replicas freely; low write latency
Multi-region writes, low latency	Multi-Primary	Local writes in each region
Strong consistency, fault tolerance	Consensus (Raft/Paxos)	Linearizable; survives minority failures
High write availability, tunable consistency	Quorum-Based	No single point of failure; flexible W/R
Simple high throughput, strong consistency	Chain Replication	Efficient network usage; strong guarantees

Decision Tree:\n\n\n1. Do you need multi-region writes with low latency?\n Yes → Multi-Primary (accept conflict resolution complexity)\n No → Continue\n\n2. Is strong (linearizable) consistency required?\n Yes → Consensus-Based (Raft/Paxos)\n No → Continue\n\n3. Is write availability more important than consistency?\n Yes → Quorum-Based (tunable W/R)\n No → Continue\n\n4. Is read scalability the primary concern?\n Yes → Primary-Replica with Async (many replicas)\n No → Primary-Replica with Sync (simple, durable)\n

Strategy by Workload Type

•OLTP/Transactional: Primary-Replica (sync for financial) or Consensus. Strong consistency, moderate latency acceptable.
•Web Applications: Primary-Replica (async). High read ratio; replicas handle read scaling.
•Global Applications: Multi-Primary or Quorum. Users need low-latency writes from any location.
•Analytics/Data Warehousing: Primary-Replica (async). Eventual consistency acceptable; read scalability critical.
•Real-time Collaboration: CRDTs on Multi-Primary or Quorum. Concurrent edits must merge automatically.
•Configuration/Coordination: Consensus (etcd, Consul, ZooKeeper). Linearizable reads essential for correctness.

Complexity Compounds

Start simple. Primary-replica covers 90% of use cases. Multi-primary and consensus add operational complexity that you must be prepared to manage. Choose the simplest strategy that meets your requirements—complexity is a cost, not a feature.

Hybrid Strategies

Real-world systems often combine multiple replication strategies to achieve complex requirements. Understanding these hybrid approaches reveals the flexibility of replication architecture:

Common Hybrid Architectures

•Primary-Replica with Sync + Async Replicas: One or two synchronous replicas for durability in the same DC; additional async replicas in remote DCs for disaster recovery and read scaling.
•Consensus within Region, Async across Regions: Use Raft/Paxos for strong consistency within a region; replicate the committed log asynchronously to other regions for disaster recovery.
•Sharded Primary-Replica: Different shards have independent primary-replica groups. Each shard uses single-primary, but the system as a whole has multiple 'primaries' (one per shard).
•Multi-Primary with Partition Ownership: Each primary owns specific data partitions. Writes to a partition go to its owner; other primaries are read replicas for that partition.
•Tiered Replication: Hot data uses synchronous replication for durability; cold/archival data uses asynchronous for cost efficiency.

Example: CockroachDB Architecture\n\nCockroachDB demonstrates sophisticated hybrid replication:\n\n1. Data is sharded into ranges (~64MB each)\n2. Each range uses Raft consensus for strong consistency\n3. Ranges have 3-5 replicas spread across failure domains\n4. Leaseholder optimization: One replica per range holds a lease for serving reads without consensus\n5. Multi-region: Ranges can span regions with geo-partitioning\n\nThis combines sharding, consensus, and lease-based optimization into a globally consistent yet locally fast system.

Design for Evolution

Hybrid architectures are often the result of evolution, not initial design. Start with a simple strategy (primary-replica). As requirements grow, add components: a synchronous replica for durability, a remote async replica for DR, sharding for write scale. Each addition should address a specific, measured need.

Summary: Replication Strategies

We've explored the major replication strategies and their trade-offs. Let's consolidate the key insights:

Key Takeaways

•Primary-Replica is the default — Simple, well-understood, no conflicts. Start here unless you have specific requirements that demand otherwise.
•Multi-Primary enables geo-distributed writes — But introduces conflict resolution complexity. Use when low-latency writes from multiple regions is essential.
•Chain Replication optimizes for throughput — Strong consistency with efficient network usage. Used in specialized storage systems.
•Consensus-Based provides the strongest guarantees — Linearizable consistency, survives minority failures. Required for coordination and critical systems.
•Quorum-Based offers tunable consistency — No leader, high availability. Adjust W and R based on workload needs.
•Hybrid strategies address complex requirements — Combine approaches strategically as your system evolves.

What's Next:\n\nNow that we understand replication strategies, we'll dive into the most critical aspect of distributed database design: Consistency Trade-offs. We'll explore the spectrum from strong to eventual consistency, the CAP and PACELC theorems, and how to choose the right consistency model for your application's semantics.

Page Complete

You now understand the major replication strategies—primary-replica, multi-primary, chain replication, consensus-based, and quorum-based. Each strategy makes specific trade-offs between consistency, availability, latency, and operational complexity. Select the simplest strategy that meets your requirements, and be prepared to evolve as needs grow. Next, we'll explore consistency trade-offs in depth.

4 / 5

Loading learning content...

Database Management SystemsReplication

Replication in Distributed Databases

LevelAdvanced

Duration90 mins

TopicReplication

4 / 5

Replication Strategies

Architecting for Resilience and Scale

What You Will Learn

Primary-Replica (Leader-Follower) Strategy

Converting Mermaid diagram...

Advantages

•Simple mental model—single source of truth
•No conflict resolution complexity
•Read scalability via replica addition
•Well-supported by all major databases
•Easy to reason about consistency
•Mature tooling for failover

Disadvantages

•Primary is a write bottleneck
•Primary failure requires failover
•Failover involves downtime (seconds to minutes)
•Cross-region writes have high latency
•Cannot survive primary region failure without intervention

The Split-Brain Problem

Multi-Primary (Multi-Master) Strategy

Converting Mermaid diagram...

Multi-Primary Conflict Resolution Comparison
Strategy	Data Loss Risk	Complexity	Suitable For
Last-Writer-Wins	High (silent)	Low	Logging, caching, non-critical data
Application Resolution	None	High	Business-critical data requiring human judgment
Merge Functions	None	Medium	Counters, sets, append-only data
CRDTs	None	High	Collaborative editing, distributed counters

Avoiding Conflicts Entirely

Chain Replication

Converting Mermaid diagram...

Chain Replication Advantages

•Strong consistency without read/write locks
•High read throughput (tail handles all reads)
•High write throughput (head handles all writes)
•Simplified failure detection (linear structure)
•Better network utilization than all-to-all

Chain Replication Disadvantages

•Write latency proportional to chain length
•Any node failure disrupts the chain
•Chain reconfiguration during failures complex
•Tail is potential read bottleneck
•Not widely implemented in mainstream DBs

Where Chain Replication Is Used

Consensus-Based Replication

Converting Mermaid diagram...

Consensus-Based Replication Characteristics
Characteristic	Description	Impact
Fault Tolerance	Tolerates (N-1)/2 failures for N nodes	3 nodes → 1 failure; 5 nodes → 2 failures
Write Latency	Requires majority acknowledgment	Higher than async, similar to sync
Consistency	Linearizable (strongest)	No stale reads possible
Availability	Available if majority is reachable	CAP: CP system (not AP)
Complexity	Complex protocol implementation	Use battle-tested libraries (etcd, Consul)

Why 3 or 5 Nodes?

Quorum-Based (Leaderless) Strategy

Converting Mermaid diagram...

Quorum Advantages

•No single point of failure (no leader)
•No leader election delays
•Tunable consistency (adjust W, R)
•High write availability
•Excellent for multi-datacenter

Quorum Disadvantages

•Weaker consistency than consensus
•Conflict resolution required
•Higher read/write latency (multiple nodes)
•Complex version vector management
•Read repair may miss some inconsistencies

Sloppy Quorums

Strategy Selection Guide

Selecting the right replication strategy requires matching your workload characteristics and requirements to strategy capabilities. Here's a decision framework:

Replication Strategy Selection Matrix
Requirement	Best Strategy	Rationale
Simple operations, strong consistency	Primary-Replica + Sync	Simplest model; sync ensures durability
Read scalability, eventual consistency OK	Primary-Replica + Async	Add replicas freely; low write latency
Multi-region writes, low latency	Multi-Primary	Local writes in each region
Strong consistency, fault tolerance	Consensus (Raft/Paxos)	Linearizable; survives minority failures
High write availability, tunable consistency	Quorum-Based	No single point of failure; flexible W/R
Simple high throughput, strong consistency	Chain Replication	Efficient network usage; strong guarantees

Strategy by Workload Type

•OLTP/Transactional: Primary-Replica (sync for financial) or Consensus. Strong consistency, moderate latency acceptable.
•Web Applications: Primary-Replica (async). High read ratio; replicas handle read scaling.
•Global Applications: Multi-Primary or Quorum. Users need low-latency writes from any location.
•Analytics/Data Warehousing: Primary-Replica (async). Eventual consistency acceptable; read scalability critical.
•Real-time Collaboration: CRDTs on Multi-Primary or Quorum. Concurrent edits must merge automatically.
•Configuration/Coordination: Consensus (etcd, Consul, ZooKeeper). Linearizable reads essential for correctness.

Complexity Compounds

Hybrid Strategies

Real-world systems often combine multiple replication strategies to achieve complex requirements. Understanding these hybrid approaches reveals the flexibility of replication architecture:

Common Hybrid Architectures

•Primary-Replica with Sync + Async Replicas: One or two synchronous replicas for durability in the same DC; additional async replicas in remote DCs for disaster recovery and read scaling.
•Consensus within Region, Async across Regions: Use Raft/Paxos for strong consistency within a region; replicate the committed log asynchronously to other regions for disaster recovery.
•Sharded Primary-Replica: Different shards have independent primary-replica groups. Each shard uses single-primary, but the system as a whole has multiple 'primaries' (one per shard).
•Multi-Primary with Partition Ownership: Each primary owns specific data partitions. Writes to a partition go to its owner; other primaries are read replicas for that partition.
•Tiered Replication: Hot data uses synchronous replication for durability; cold/archival data uses asynchronous for cost efficiency.

Design for Evolution

Summary: Replication Strategies

We've explored the major replication strategies and their trade-offs. Let's consolidate the key insights:

Key Takeaways

•Primary-Replica is the default — Simple, well-understood, no conflicts. Start here unless you have specific requirements that demand otherwise.
•Multi-Primary enables geo-distributed writes — But introduces conflict resolution complexity. Use when low-latency writes from multiple regions is essential.
•Chain Replication optimizes for throughput — Strong consistency with efficient network usage. Used in specialized storage systems.
•Consensus-Based provides the strongest guarantees — Linearizable consistency, survives minority failures. Required for coordination and critical systems.
•Quorum-Based offers tunable consistency — No leader, high availability. Adjust W and R based on workload needs.
•Hybrid strategies address complex requirements — Combine approaches strategically as your system evolves.

Page Complete

4 / 5