Loading content...
In a distributed file system spanning thousands of machines, failure is not a possibility—it's a certainty. Hard drives fail. Servers crash. Network links go down. Power outages affect entire racks. The question isn't whether components will fail, but when—and how often.
Google's early infrastructure experience is instructive: with thousands of commodity servers, they expected 1-2 machine failures per day in a large cluster. Traditional approaches—expensive hardware, careful maintenance—couldn't scale to this environment.
Replication is the fundamental answer to this challenge. By maintaining multiple copies of data on different machines, a distributed file system can tolerate failures without data loss. When a storage node fails, replicas on surviving nodes ensure continuous availability. When the failed node recovers or is replaced, the system automatically restores the target replication level.
But replication is far more than "just copy the data." The design of a replication system determines durability guarantees, write performance, read distribution, recovery time, and storage costs. Getting replication right is essential to building reliable distributed storage.
By the end of this page, you will understand replication strategies in distributed file systems, including replica placement, synchronization protocols, consistency guarantees, and recovery mechanisms. You'll see how systems balance durability, availability, performance, and cost.
Replication serves multiple critical purposes in distributed file systems. Understanding these purposes helps in designing appropriate replication strategies.
The four pillars of replication:
Calculating failure probability:
The probability of data loss with independent replicas follows a simple model:
Given:
- Probability of single disk failure in a year: p = 2% (typical for HDDs)
- Replication factor: n = number of copies
- Assumption: failures are independent (requires careful placement)
Probability of losing all replicas:
- 1 replica: P(loss) = p = 2%
- 2 replicas: P(loss) = p² = 0.04%
- 3 replicas: P(loss) = p³ = 0.0008% (8 in a million)
- 4 replicas: P(loss) = p⁴ = 0.000016% (16 in 100 million)
With 3 replicas and 1 million files:
- Expected files lost per year: 8
- With 4 replicas: 0.16 files lost per year
The independence assumption is critical:
These calculations assume failures are independent. But failures are often correlated:
Effective replication must place replicas to minimize correlation—across racks, zones, or data centers.
Durability means data isn't lost; availability means data is accessible. Three replicas provide high durability, but if a write requires all three replicas to acknowledge, a single slow or failed replica blocks the write—reducing availability. This tension between durability and availability is fundamental to replication design.
Different distributed file systems employ different replication strategies, each with distinct tradeoffs. The choice depends on workload characteristics, consistency requirements, and cost constraints.
| Strategy | Description | Pros | Cons | Use Case |
|---|---|---|---|---|
| Full Replication | Every replica stores complete, identical copy | Simple, any replica can serve reads | Storage overhead = n× | HDFS, GFS, most DFS |
| Erasure Coding | Data encoded into fragments; subset sufficient to reconstruct | Lower storage overhead (e.g., 1.5×) | Higher CPU, complex writes | Cold data, archival storage |
| Primary-Backup | One primary handles writes; backups replicate | Clear consistency model | Primary is bottleneck | Traditional databases |
| Chain Replication | Writes flow through a chain; tail confirms | Strong consistency, simple | Chain length affects latency | Facebook Haystack |
| Quorum-Based | Writes to W replicas; reads from R; W+R>N | Tunable consistency/availability | Complex reasoning | Dynamo, Cassandra |
Full replication in HDFS/GFS:
The GFS (and later HDFS) model uses simple full replication:
File: /data/sales.csv (256 MB)
Replication factor: 3
Block 1 (128MB): stored on DataNode[4], DataNode[7], DataNode[15]
Block 2 (128MB): stored on DataNode[2], DataNode[11], DataNode[19]
Total storage used: 256MB × 3 = 768 MB
Read operation:
- Client gets block locations from NameNode
- Picks closest replica (by network topology)
- Reads directly from that DataNode
Failure scenario:
- DataNode[7] fails
- Block 1 now has 2 replicas (under-replicated)
- NameNode schedules re-replication: DataNode[4] → DataNode[22]
- Block 1 restored to 3 replicas
This approach is simple, proven, and works well for hot data that's frequently accessed. The 3× storage overhead is acceptable for data that needs fast access.
For cold or archival data, erasure coding reduces storage overhead dramatically. HDFS EC with a 6+3 scheme stores 9 blocks for every 6 data blocks (1.5× overhead vs. 3× for replication), while still tolerating 3 failures. The tradeoff: higher CPU cost for encoding/decoding and degraded read performance when blocks are missing.
Where replicas are placed is as important as how many exist. Poor placement can make theoretically redundant data vulnerable to correlated failures. Good placement balances failure independence against network efficiency.
HDFS default placement policy (3 replicas):
┌─────────────────────┐
│ Cluster │
┌───────────────┴─────────────────────┴───────────────┐
│ │
┌───┴───┐ ┌─────┴────┐
│ Rack 1 │ │ Rack 2 │
│ │ │ │
│┌──────┐│ │┌────────┐│
││Node A││ ← Replica 1 (same node as client if local)│ ││
│└──────┘│ │ ││
│┌──────┐│ │┌────────┐│
││Node B││ ← Replica 2 (different node, same rack) ││ Node D ││ ← Replica 3
│└──────┘│ │└────────┘│ (different rack)
│┌──────┐│ │┌────────┐│
││Node C││ ││ Node E ││
│└──────┘│ │└────────┘│
└────────┘ └──────────┘
Placement rationale:
- Replica 1: Local or same rack for write performance
- Replica 2: Same rack for bandwidth (rack-internal network is fast)
- Replica 3: Different rack for rack failure tolerance
This policy provides data safety (survives rack failure) while optimizing for common hardware configurations where intra-rack bandwidth exceeds inter-rack bandwidth.
Ceph CRUSH placement:
Ceph's CRUSH algorithm provides highly configurable placement through placement rules:
rule replicated_ruleset {
ruleset 1
type replicated
min_size 1
max_size 10
step take default # Start at cluster root
step chooseleaf firstn 0 type rack # Pick N racks
step emit # Place on OSDs in chosen racks
}
# This rule spreads replicas across racks
# "firstn 0" means use the replication factor
# Each replica lands on a different rack
# More complex example: cross-datacenter
rule cross_dc {
step take datacenter1
step chooseleaf firstn 2 type host # 2 replicas in DC1
step emit
step take datacenter2
step chooseleaf firstn 1 type host # 1 replica in DC2
step emit
}
CRUSH allows encoding arbitrary placement policies—across racks, across buildings, across continents—all computed algorithmically without central lookup.
When data is written, all replicas must eventually receive the write. The write protocol determines when writes are acknowledged and how replicas are synchronized.
Synchronous vs. asynchronous replication:
Synchronous Replication
Write acknowledged only after all replicas confirm persistence.
Client → Primary → Replica1 → Replica2
↑________________________________│
ACK (after all persist)
Characteristics:
Asynchronous Replication
Write acknowledged after primary persists; replicas updated in background.
Client → Primary
↑______│ (immediate ACK)
└→ Replica1, Replica2 (background)
Characteristics:
GFS/HDFS write pipeline:
GFS and HDFS use an innovative pipelined write protocol that's synchronous but efficient:
HDFS Write Pipeline:
1. Client asks NameNode for block allocation
NameNode returns: [DN1, DN2, DN3] (ordered by distance)
2. Client connects to DN1 (first in pipeline)
DN1 connects to DN2
DN2 connects to DN3
3. Client streams data to DN1 in packets (64KB each)
Each packet is forwarded down the pipeline:
Client → DN1 → DN2 → DN3
4. Acknowledgments flow back up:
DN3 → DN2 → DN1 → Client
5. When all packets ACKed, block write complete
Client
│ data ↓ ↑ ACK
▼ │
┌────┐ │
│DN1 │→data→─┐│
└────┘ ││
┌────┐←ACK──┐││
│DN2 │→data→││
└────┘ ↓│
┌────┐←ACK──┘│
│DN3 │───────┘
└────┘
Why pipelining?
Without pipelining, the client would send data three times (once to each replica). With pipelining:
Pipelined writes are only as fast as the slowest replica. If DN2 has a failing disk, all writes through that pipeline slow down. HDFS addresses this through: (1) Pipeline recovery—excluding the slow node and continuing with remaining nodes, (2) Hedged reads—reading from multiple replicas in parallel and using the first response, (3) Aggressive failure detection—marking slow nodes as dead.
With multiple replicas, ensuring consistency—that all replicas agree on the data—requires careful protocol design. Different systems offer different consistency guarantees based on their requirements.
Consistency models for replicated data:
| Model | Guarantee | Implementation | Cost |
|---|---|---|---|
| Strong Consistency | All reads see latest write | Synchronous replication or consensus | High latency, reduced availability |
| Sequential Consistency | All clients see operations in same order | Single writer or global ordering | Medium overhead |
| Eventual Consistency | Replicas converge given time without updates | Asynchronous replication + conflict resolution | Low latency, complex handling |
| Read-Your-Writes | Client sees its own writes immediately | Session-sticky reads or version tracking | Low overhead |
| Monotonic Reads | Client never sees older data than already seen | Version tracking, minimum read version | Low overhead |
Quorum protocols:
Many distributed systems use quorum-based protocols to balance consistency and availability:
Quorum Protocol:
- N = total replicas
- W = replicas that must acknowledge a write
- R = replicas that must respond to a read
Rule: W + R > N → Guarantees read sees latest write
Examples with N=3:
W=2, R=2: Moderate read/write latency, strong consistency
Writes: need 2 of 3
Reads: need 2 of 3 (at least 1 overlap with write)
W=3, R=1: Slow writes, fast reads, strong consistency
Good for read-heavy workloads
W=1, R=3: Fast writes, slow reads, strong consistency
Good for write-heavy workloads
W=1, R=1: Fast both, but NO consistency guarantee
May read stale data
Object: X
Version: 5 ← Written with W=2
┌─────┐ ┌─────┐ ┌─────┐
│Rep1 │ │Rep2 │ │Rep3 │
│v=5 │ │v=5 │ │v=4 │ ← Rep3 missed update
└─────┘ └─────┘ └─────┘
Read with R=2 will get v=5 from Rep1 or Rep2
(The version 5 wins because any 2 nodes include at least
one with the latest version)
HDFS largely sidesteps consistency complexity through its append-only model. Blocks are immutable once written; there's no conflict between readers and writers of the same block. Consistency is ensured by completing the write pipeline before making the block visible. The NameNode is the single source of truth for which blocks are finalized.
Replication provides the raw material for fault tolerance, but failure detection and recovery mechanisms actually realize the benefits. The system must detect when replicas are lost and restore the replication factor.
Failure detection mechanisms:
HDFS recovery workflow:
HDFS Block Recovery Process:
1. DETECTION: DataNode stops heartbeating
NameNode: No heartbeat from DN7 for 30 seconds
NameNode marks DN7 as dead
2. IDENTIFICATION: Find under-replicated blocks
NameNode scans: "DN7 stored blocks: [blk_101, blk_102, blk_103,...]"
These blocks now have replication < target
3. PRIORITIZATION: Rank blocks by urgency
Priority 1: Only 1 replica remaining (critical!)
Priority 2: Only 2 replicas remaining
Priority 3: Decommissioning node's blocks
4. SCHEDULING: Choose source and destination
For blk_101:
Current replicas: [DN3, DN12] (need 3rd copy)
Choose source: DN3 (lowest load)
Choose destination: DN19 (different rack, space available)
5. REPLICATION: Execute copy
NameNode → DN3: "Send blk_101 to DN19"
DN3 streams data to DN19
DN19 confirms receipt to NameNode
6. COMPLETION: Update metadata
NameNode updates block locations: blk_101 → [DN3, DN12, DN19]
Replication factor restored!
Typical recovery time:
- Detection: 30-60 seconds
- Per-block replication: ~10 seconds (128MB at 100MB/s)
- Full recovery after node failure: minutes to hours
(depends on data volume and cluster bandwidth)
When a heavily-loaded node fails, the system must replicate potentially terabytes of data. This 'recovery storm' can saturate network bandwidth and slow normal operations. Well-designed systems throttle recovery traffic, prioritize critical blocks, and spread recovery load across many source nodes rather than concentrating on a few.
Full replication's 3× storage overhead is substantial. For cold or infrequently accessed data, erasure coding provides fault tolerance with significantly lower storage costs.
How erasure coding works:
Erasure coding transforms k data blocks into n coded blocks (n > k), such that any k of the n blocks can reconstruct the original data.
Reed-Solomon (6,3) Example:
Original data blocks: D1, D2, D3, D4, D5, D6 (6 data blocks)
Parity blocks computed: P1, P2, P3 (3 parity blocks)
Total blocks stored: 9
Storage overhead: 9/6 = 1.5× (vs. 3× for replication)
Recovery capability:
- Can lose ANY 3 blocks (any combination of data + parity)
- Reconstruct lost blocks from surviving 6 blocks
Example:
Stored: [D1, D2, D3, D4, D5, D6, P1, P2, P3]
After 3 failures: [D1, __, D3, D4, __, D6, P1, __, P3]
Reconstruction: Use D1, D3, D4, D6, P1, P3 → recover D2, D5, P2
D1 D2 D3 D4 D5 D6
│ │ │ │ │ │
└────┴────┴────┴────┴────┴──→ Encoder
│
┌──────────────────────────────────┘
↓
P1 = f(D1,D2,D3,D4,D5,D6)
P2 = g(D1,D2,D3,D4,D5,D6)
P3 = h(D1,D2,D3,D4,D5,D6)
| Aspect | 3× Replication | RS (6,3) | RS (10,4) |
|---|---|---|---|
| Storage Overhead | 3.0× | 1.5× | 1.4× |
| Fault Tolerance | 2 failures | 3 failures | 4 failures |
| Read Performance | Any replica | All k blocks needed | All k blocks needed |
| Degraded Read | Read from other replica | Decode from k blocks (CPU expensive) | Decode from k blocks |
| Write Performance | Write 3 copies | Encode + write n blocks | Encode + write n blocks |
| Recovery I/O | Copy 1 block | Read k blocks, reconstruct | Read k blocks, reconstruct |
| Best For | Hot data, low latency | Cold data, cost-sensitive | Archival data |
HDFS Erasure Coding:
HDFS 3.x added native erasure coding support:
# Enable EC for a directory
hdfs ec -enablePolicy -policy RS-6-3-1024k
hdfs ec -setPolicy -path /cold-data -policy RS-6-3-1024k
# Files in /cold-data now use erasure coding
# Available policies:
# RS-3-2-1024k: 3 data + 2 parity, 1MB block
# RS-6-3-1024k: 6 data + 3 parity, 1MB block (default)
# RS-10-4-1024k: 10 data + 4 parity, 1MB block
When to use erasure coding:
Erasure coding works best for large files. A 6+3 RS code requires at least 6 data blocks. A 10MB file with 1MB blocks would generate 10 data blocks and 5 parity blocks—but a 3MB file would only have 3 data blocks, wasting the encoding structure. Small files should use replication or be combined into larger objects before EC.
For global services, data must be available across multiple geographic regions. Geo-replication extends the principles of local replication to wide-area networks, adding new challenges around latency, bandwidth costs, and regional failures.
Why geo-replicate?
Challenges of geo-replication:
Intra-datacenter: Inter-datacenter:
Latency: < 1ms Latency: 50-200ms
Bandwidth: 10-100 Gbps Bandwidth: 1-10 Gbps
Cost: Low (internal) Cost: High (WAN/transit)
Reliability: High Reliability: Variable
Implication:
- Synchronous replication across datacenters adds 100-400ms to writes
- For most applications, this is unacceptable
- Asynchronous geo-replication is common but risks data loss in disaster
Geo-replication strategies:
Asynchronous replication with RPO
Synchronous replication for critical data
Tiered geo-replication
When multiple regions can accept writes, conflicts may arise. A user in US-West modifies file A while a user in EU-Central simultaneously modifies the same file. Without careful coordination, one update overwrites the other. Solutions include: single-master architectures (one region accepts writes, others are read-only), conflict-free replicated data types (CRDTs), last-writer-wins with vector clocks, or application-level conflict resolution.
We've explored how distributed file systems use replication to achieve durability, availability, and performance. Let's consolidate the key insights:
What's next:
With caching and replication in place, we still need to address the fundamental question: what guarantees does the system provide about the view of data seen by different clients? Next, we'll explore consistency models in depth—the formal frameworks that define what applications can expect from distributed file systems.
You now understand how distributed file systems use replication for fault tolerance, including replica placement strategies, write protocols, failure recovery, and the tradeoffs between full replication and erasure coding. Next, we'll formally explore consistency models.