Distributed File Systems - Learning Module

Loading content...

0/273

Replication

The Foundation of Reliability

In a distributed file system spanning thousands of machines, failure is not a possibility—it's a certainty. Hard drives fail. Servers crash. Network links go down. Power outages affect entire racks. The question isn't whether components will fail, but when—and how often.

Google's early infrastructure experience is instructive: with thousands of commodity servers, they expected 1-2 machine failures per day in a large cluster. Traditional approaches—expensive hardware, careful maintenance—couldn't scale to this environment.

Replication is the fundamental answer to this challenge. By maintaining multiple copies of data on different machines, a distributed file system can tolerate failures without data loss. When a storage node fails, replicas on surviving nodes ensure continuous availability. When the failed node recovers or is replaced, the system automatically restores the target replication level.

But replication is far more than "just copy the data." The design of a replication system determines durability guarantees, write performance, read distribution, recovery time, and storage costs. Getting replication right is essential to building reliable distributed storage.

What You Will Learn

By the end of this page, you will understand replication strategies in distributed file systems, including replica placement, synchronization protocols, consistency guarantees, and recovery mechanisms. You'll see how systems balance durability, availability, performance, and cost.

Why Replication Is Necessary

Replication serves multiple critical purposes in distributed file systems. Understanding these purposes helps in designing appropriate replication strategies.

The four pillars of replication:

Replication Benefits

•Durability — Data survives hardware failures. With 3 replicas on independent machines, data loss requires 3 simultaneous failures. The probability of losing data drops exponentially with each replica.
•Availability — Data remains accessible despite failures. If one replica is unavailable, reads can be served from another. The system continues operating during maintenance and partial outages.
•Read Performance — Multiple replicas can serve reads in parallel, distributing load and reducing latency. Geographically distributed replicas can serve reads from the nearest copy.
•Disaster Recovery — Replicas in different data centers protect against site-wide disasters (fire, flood, power failure). Some replication strategies explicitly target geographic diversity.

Calculating failure probability:

The probability of data loss with independent replicas follows a simple model:

Given:
- Probability of single disk failure in a year: p = 2%  (typical for HDDs)
- Replication factor: n = number of copies
- Assumption: failures are independent (requires careful placement)

Probability of losing all replicas:
- 1 replica:  P(loss) = p = 2%
- 2 replicas: P(loss) = p² = 0.04%
- 3 replicas: P(loss) = p³ = 0.0008% (8 in a million)
- 4 replicas: P(loss) = p⁴ = 0.000016% (16 in 100 million)

With 3 replicas and 1 million files:
- Expected files lost per year: 8
- With 4 replicas: 0.16 files lost per year

The independence assumption is critical:

These calculations assume failures are independent. But failures are often correlated:

Multiple disks in the same server fail together (power supply failure)
Multiple servers in the same rack fail together (rack switch failure)
Multiple racks in the same data center fail together (power outage)

Effective replication must place replicas to minimize correlation—across racks, zones, or data centers.

Durability ≠ Availability

Durability means data isn't lost; availability means data is accessible. Three replicas provide high durability, but if a write requires all three replicas to acknowledge, a single slow or failed replica blocks the write—reducing availability. This tension between durability and availability is fundamental to replication design.

Replication Strategies

Different distributed file systems employ different replication strategies, each with distinct tradeoffs. The choice depends on workload characteristics, consistency requirements, and cost constraints.

Comparison of Replication Strategies
Strategy	Description	Pros	Cons	Use Case
Full Replication	Every replica stores complete, identical copy	Simple, any replica can serve reads	Storage overhead = n×	HDFS, GFS, most DFS
Erasure Coding	Data encoded into fragments; subset sufficient to reconstruct	Lower storage overhead (e.g., 1.5×)	Higher CPU, complex writes	Cold data, archival storage
Primary-Backup	One primary handles writes; backups replicate	Clear consistency model	Primary is bottleneck	Traditional databases
Chain Replication	Writes flow through a chain; tail confirms	Strong consistency, simple	Chain length affects latency	Facebook Haystack
Quorum-Based	Writes to W replicas; reads from R; W+R>N	Tunable consistency/availability	Complex reasoning	Dynamo, Cassandra

Full replication in HDFS/GFS:

The GFS (and later HDFS) model uses simple full replication:

File: /data/sales.csv (256 MB)
Replication factor: 3

Block 1 (128MB): stored on DataNode[4], DataNode[7], DataNode[15]
Block 2 (128MB): stored on DataNode[2], DataNode[11], DataNode[19]

Total storage used: 256MB × 3 = 768 MB

Read operation:
- Client gets block locations from NameNode
- Picks closest replica (by network topology)
- Reads directly from that DataNode

Failure scenario:
- DataNode[7] fails
- Block 1 now has 2 replicas (under-replicated)
- NameNode schedules re-replication: DataNode[4] → DataNode[22]
- Block 1 restored to 3 replicas

This approach is simple, proven, and works well for hot data that's frequently accessed. The 3× storage overhead is acceptable for data that needs fast access.

Erasure Coding for Cost Efficiency

For cold or archival data, erasure coding reduces storage overhead dramatically. HDFS EC with a 6+3 scheme stores 9 blocks for every 6 data blocks (1.5× overhead vs. 3× for replication), while still tolerating 3 failures. The tradeoff: higher CPU cost for encoding/decoding and degraded read performance when blocks are missing.

Replica Placement Policies

Where replicas are placed is as important as how many exist. Poor placement can make theoretically redundant data vulnerable to correlated failures. Good placement balances failure independence against network efficiency.

HDFS default placement policy (3 replicas):

                    ┌─────────────────────┐
                    │     Cluster         │
    ┌───────────────┴─────────────────────┴───────────────┐
    │                                                     │
┌───┴───┐                                           ┌─────┴────┐
│ Rack 1 │                                           │  Rack 2  │
│        │                                           │          │
│┌──────┐│                                           │┌────────┐│
││Node A││ ← Replica 1 (same node as client if local)│         ││
│└──────┘│                                           │         ││
│┌──────┐│                                           │┌────────┐│
││Node B││ ← Replica 2 (different node, same rack)   ││ Node D ││ ← Replica 3
│└──────┘│                                           │└────────┘│  (different rack)
│┌──────┐│                                           │┌────────┐│
││Node C││                                           ││ Node E ││
│└──────┘│                                           │└────────┘│
└────────┘                                           └──────────┘

Placement rationale:
- Replica 1: Local or same rack for write performance
- Replica 2: Same rack for bandwidth (rack-internal network is fast)
- Replica 3: Different rack for rack failure tolerance

This policy provides data safety (survives rack failure) while optimizing for common hardware configurations where intra-rack bandwidth exceeds inter-rack bandwidth.

Placement Considerations

•Failure domain diversity — Spread replicas across independent failure domains (nodes, racks, power circuits, data centers)
•Network topology awareness — Place replicas to minimize cross-rack or cross-datacenter traffic for writes
•Storage balance — Distribute replicas to balance disk usage across the cluster
•Load balance — Consider access patterns; avoid placing all replicas of hot files on the same nodes
•Client proximity — For geo-distributed systems, place replicas near clients for read latency

Ceph CRUSH placement:

Ceph's CRUSH algorithm provides highly configurable placement through placement rules:

rule replicated_ruleset {
    ruleset 1
    type replicated
    min_size 1
    max_size 10
    step take default               # Start at cluster root
    step chooseleaf firstn 0 type rack  # Pick N racks
    step emit                        # Place on OSDs in chosen racks
}

# This rule spreads replicas across racks
# "firstn 0" means use the replication factor
# Each replica lands on a different rack

# More complex example: cross-datacenter
rule cross_dc {
    step take datacenter1
    step chooseleaf firstn 2 type host  # 2 replicas in DC1
    step emit
    step take datacenter2
    step chooseleaf firstn 1 type host  # 1 replica in DC2
    step emit
}

CRUSH allows encoding arbitrary placement policies—across racks, across buildings, across continents—all computed algorithmically without central lookup.

Write Replication Protocols

When data is written, all replicas must eventually receive the write. The write protocol determines when writes are acknowledged and how replicas are synchronized.

Synchronous vs. asynchronous replication:

Synchronous Replication

Write acknowledged only after all replicas confirm persistence.

Client → Primary → Replica1 → Replica2
   ↑________________________________│
              ACK (after all persist)

Characteristics:

Strong durability: Acknowledged data won't be lost
Higher latency: Must wait for slowest replica
Reduced availability: One failed replica blocks writes
Use case: Critical data, financial transactions

Asynchronous Replication

Write acknowledged after primary persists; replicas updated in background.

Client → Primary
   ↑______│ (immediate ACK)
          └→ Replica1, Replica2 (background)

Characteristics:

Lower latency: Don't wait for replicas
Higher availability: Replicas can fall behind
Potential data loss: Unsynced data lost if primary fails
Use case: High-throughput data, logs, analytics

GFS/HDFS write pipeline:

GFS and HDFS use an innovative pipelined write protocol that's synchronous but efficient:

HDFS Write Pipeline:

1. Client asks NameNode for block allocation
   NameNode returns: [DN1, DN2, DN3] (ordered by distance)

2. Client connects to DN1 (first in pipeline)
   DN1 connects to DN2
   DN2 connects to DN3

3. Client streams data to DN1 in packets (64KB each)
   Each packet is forwarded down the pipeline:
   Client → DN1 → DN2 → DN3

4. Acknowledgments flow back up:
   DN3 → DN2 → DN1 → Client

5. When all packets ACKed, block write complete

   Client
      │ data ↓     ↑ ACK
      ▼            │
    ┌────┐        │
    │DN1 │→data→─┐│
    └────┘       ││
    ┌────┐←ACK──┐││
    │DN2 │→data→││
    └────┘      ↓│
    ┌────┐←ACK──┘│
    │DN3 │───────┘
    └────┘

Why pipelining?

Without pipelining, the client would send data three times (once to each replica). With pipelining:

Client sends data once
Network bandwidth is utilized efficiently as data flows down the chain
Latency is slightly higher than single-replica write, but bandwidth is maximized

The Slow Replica Problem

Pipelined writes are only as fast as the slowest replica. If DN2 has a failing disk, all writes through that pipeline slow down. HDFS addresses this through: (1) Pipeline recovery—excluding the slow node and continuing with remaining nodes, (2) Hedged reads—reading from multiple replicas in parallel and using the first response, (3) Aggressive failure detection—marking slow nodes as dead.

Consistency in Replicated Systems

With multiple replicas, ensuring consistency—that all replicas agree on the data—requires careful protocol design. Different systems offer different consistency guarantees based on their requirements.

Consistency models for replicated data:

Consistency Models in Replicated DFS
Model	Guarantee	Implementation	Cost
Strong Consistency	All reads see latest write	Synchronous replication or consensus	High latency, reduced availability
Sequential Consistency	All clients see operations in same order	Single writer or global ordering	Medium overhead
Eventual Consistency	Replicas converge given time without updates	Asynchronous replication + conflict resolution	Low latency, complex handling
Read-Your-Writes	Client sees its own writes immediately	Session-sticky reads or version tracking	Low overhead
Monotonic Reads	Client never sees older data than already seen	Version tracking, minimum read version	Low overhead

Quorum protocols:

Many distributed systems use quorum-based protocols to balance consistency and availability:

Quorum Protocol:

- N = total replicas
- W = replicas that must acknowledge a write
- R = replicas that must respond to a read

Rule: W + R > N  →  Guarantees read sees latest write

Examples with N=3:

  W=2, R=2:  Moderate read/write latency, strong consistency
             Writes: need 2 of 3
             Reads: need 2 of 3 (at least 1 overlap with write)

  W=3, R=1:  Slow writes, fast reads, strong consistency
             Good for read-heavy workloads

  W=1, R=3:  Fast writes, slow reads, strong consistency
             Good for write-heavy workloads

  W=1, R=1:  Fast both, but NO consistency guarantee
             May read stale data

Object: X
Version: 5   ← Written with W=2
┌─────┐   ┌─────┐   ┌─────┐
│Rep1 │   │Rep2 │   │Rep3 │
│v=5  │   │v=5  │   │v=4  │  ← Rep3 missed update
└─────┘   └─────┘   └─────┘

Read with R=2 will get v=5 from Rep1 or Rep2
(The version 5 wins because any 2 nodes include at least
one with the latest version)

HDFS Simplification

HDFS largely sidesteps consistency complexity through its append-only model. Blocks are immutable once written; there's no conflict between readers and writers of the same block. Consistency is ensured by completing the write pipeline before making the block visible. The NameNode is the single source of truth for which blocks are finalized.

Failure Detection and Recovery

Replication provides the raw material for fault tolerance, but failure detection and recovery mechanisms actually realize the benefits. The system must detect when replicas are lost and restore the replication factor.

Failure detection mechanisms:

How Systems Detect Failures

•Heartbeat protocols — Storage nodes send periodic heartbeats to a coordinator. Missing heartbeats trigger failure investigation.
•I/O timeouts — If a read/write to a replica times out, the replica is suspected failed. Multiple timeouts may confirm failure.
•Block checksums — Data integrity verification during reads detects corruption. Corrupted replicas are treated as failures.
•Self-healing checks — Background processes periodically read replicas and verify checksums, detecting latent corruption.
•Gossip protocols — Nodes share information about each other's health. Convergent failure detection across the cluster.

HDFS recovery workflow:

HDFS Block Recovery Process:

1. DETECTION: DataNode stops heartbeating
   NameNode: No heartbeat from DN7 for 30 seconds
   NameNode marks DN7 as dead

2. IDENTIFICATION: Find under-replicated blocks
   NameNode scans: "DN7 stored blocks: [blk_101, blk_102, blk_103,...]"
   These blocks now have replication < target

3. PRIORITIZATION: Rank blocks by urgency
   Priority 1: Only 1 replica remaining (critical!)
   Priority 2: Only 2 replicas remaining
   Priority 3: Decommissioning node's blocks

4. SCHEDULING: Choose source and destination
   For blk_101:
     Current replicas: [DN3, DN12] (need 3rd copy)
     Choose source: DN3 (lowest load)
     Choose destination: DN19 (different rack, space available)

5. REPLICATION: Execute copy
   NameNode → DN3: "Send blk_101 to DN19"
   DN3 streams data to DN19
   DN19 confirms receipt to NameNode

6. COMPLETION: Update metadata
   NameNode updates block locations: blk_101 → [DN3, DN12, DN19]
   Replication factor restored!

Typical recovery time:
- Detection: 30-60 seconds
- Per-block replication: ~10 seconds (128MB at 100MB/s)
- Full recovery after node failure: minutes to hours
  (depends on data volume and cluster bandwidth)

The Recovery Storm

When a heavily-loaded node fails, the system must replicate potentially terabytes of data. This 'recovery storm' can saturate network bandwidth and slow normal operations. Well-designed systems throttle recovery traffic, prioritize critical blocks, and spread recovery load across many source nodes rather than concentrating on a few.

Erasure Coding: Beyond Full Replication

Full replication's 3× storage overhead is substantial. For cold or infrequently accessed data, erasure coding provides fault tolerance with significantly lower storage costs.

How erasure coding works:

Erasure coding transforms k data blocks into n coded blocks (n > k), such that any k of the n blocks can reconstruct the original data.

Reed-Solomon (6,3) Example:

Original data blocks: D1, D2, D3, D4, D5, D6 (6 data blocks)
Parity blocks computed: P1, P2, P3 (3 parity blocks)

Total blocks stored: 9
Storage overhead: 9/6 = 1.5× (vs. 3× for replication)

Recovery capability:
- Can lose ANY 3 blocks (any combination of data + parity)
- Reconstruct lost blocks from surviving 6 blocks

Example:
  Stored: [D1, D2, D3, D4, D5, D6, P1, P2, P3]
  After 3 failures: [D1, __, D3, D4, __, D6, P1, __, P3]
  Reconstruction: Use D1, D3, D4, D6, P1, P3 → recover D2, D5, P2

        D1   D2   D3   D4   D5   D6
        │    │    │    │    │    │
        └────┴────┴────┴────┴────┴──→ Encoder
                                           │
        ┌──────────────────────────────────┘
        ↓
    P1 = f(D1,D2,D3,D4,D5,D6)
    P2 = g(D1,D2,D3,D4,D5,D6)
    P3 = h(D1,D2,D3,D4,D5,D6)

Erasure Coding vs. Replication Comparison
Aspect	3× Replication	RS (6,3)	RS (10,4)
Storage Overhead	3.0×	1.5×	1.4×
Fault Tolerance	2 failures	3 failures	4 failures
Read Performance	Any replica	All k blocks needed	All k blocks needed
Degraded Read	Read from other replica	Decode from k blocks (CPU expensive)	Decode from k blocks
Write Performance	Write 3 copies	Encode + write n blocks	Encode + write n blocks
Recovery I/O	Copy 1 block	Read k blocks, reconstruct	Read k blocks, reconstruct
Best For	Hot data, low latency	Cold data, cost-sensitive	Archival data

HDFS Erasure Coding:

HDFS 3.x added native erasure coding support:

# Enable EC for a directory
hdfs ec -enablePolicy -policy RS-6-3-1024k
hdfs ec -setPolicy -path /cold-data -policy RS-6-3-1024k

# Files in /cold-data now use erasure coding

# Available policies:
# RS-3-2-1024k:  3 data + 2 parity, 1MB block
# RS-6-3-1024k:  6 data + 3 parity, 1MB block (default)
# RS-10-4-1024k: 10 data + 4 parity, 1MB block

When to use erasure coding:

Cold data that's rarely accessed: Tolerates higher read latency
Large files where stripe size is efficient
Cost-sensitive storage: Reduces raw storage requirements
Archival use cases: Maximum durability per storage byte

The Small File Problem Again

Erasure coding works best for large files. A 6+3 RS code requires at least 6 data blocks. A 10MB file with 1MB blocks would generate 10 data blocks and 5 parity blocks—but a 3MB file would only have 3 data blocks, wasting the encoding structure. Small files should use replication or be combined into larger objects before EC.

Geographic Replication

For global services, data must be available across multiple geographic regions. Geo-replication extends the principles of local replication to wide-area networks, adding new challenges around latency, bandwidth costs, and regional failures.

Why geo-replicate?

Disaster recovery: Survive datacenter-wide disasters (fire, flood, earthquake)
Read latency: Serve reads from the nearest region to the user
Regulatory compliance: Keep data copies in specific legal jurisdictions
Content distribution: Place popular content near consumers

Challenges of geo-replication:

Intra-datacenter:          Inter-datacenter:
Latency: < 1ms             Latency: 50-200ms
Bandwidth: 10-100 Gbps     Bandwidth: 1-10 Gbps
Cost: Low (internal)       Cost: High (WAN/transit)
Reliability: High          Reliability: Variable

Implication:
- Synchronous replication across datacenters adds 100-400ms to writes
- For most applications, this is unacceptable
- Asynchronous geo-replication is common but risks data loss in disaster

Geo-replication strategies:

Asynchronous replication with RPO
- Writes acknowledged after local persistence
- Replicated to remote site(s) in background
- Recovery Point Objective (RPO): max data loss window (e.g., 15 minutes)
Synchronous replication for critical data
- Writes span multiple regions before acknowledgment
- Higher latency, zero data loss on failover
- Often used only for critical files/transactions
Tiered geo-replication
- Immediate local replication (3 copies in primary DC)
- Delayed geo-replication (background to 1-2 replica DCs)
- Combines performance with eventual cross-region durability

Converting Mermaid diagram...

Conflict Resolution in Multi-Master Geo-Replication

When multiple regions can accept writes, conflicts may arise. A user in US-West modifies file A while a user in EU-Central simultaneously modifies the same file. Without careful coordination, one update overwrites the other. Solutions include: single-master architectures (one region accepts writes, others are read-only), conflict-free replicated data types (CRDTs), last-writer-wins with vector clocks, or application-level conflict resolution.

Summary: Replication in Distributed File Systems

We've explored how distributed file systems use replication to achieve durability, availability, and performance. Let's consolidate the key insights:

Key Replication Takeaways

•Replication is essential for fault tolerance — In large clusters, failures are constant. Replicas ensure data survives and remains accessible.
•Replica placement matters critically — Placing replicas in different failure domains (racks, zones, datacenters) maintains independence of failures.
•Write protocols determine durability-latency tradeoff — Synchronous replication is safer but slower; asynchronous is faster but risks data loss.
•Pipelined replication maximizes bandwidth efficiency — GFS/HDFS pipeline writes through the replica chain, using network bandwidth effectively.
•Consistency requires careful protocol design — Quorum protocols balance strong consistency against availability and partition tolerance.
•Failure detection and recovery must be automatic — Heartbeats detect failures; background processes restore replication levels.
•Erasure coding reduces storage overhead — For cold data, erasure coding provides similar or better durability at 1.5× storage vs. 3× replication.
•Geo-replication adds latency challenges — Inter-datacenter latency usually requires asynchronous replication with explicit RPO/RTO tradeoffs.

What's next:

With caching and replication in place, we still need to address the fundamental question: what guarantees does the system provide about the view of data seen by different clients? Next, we'll explore consistency models in depth—the formal frameworks that define what applications can expect from distributed file systems.

Page Complete

You now understand how distributed file systems use replication for fault tolerance, including replica placement strategies, write protocols, failure recovery, and the tradeoffs between full replication and erasure coding. Next, we'll formally explore consistency models.

Replication

The Foundation of Reliability

What You Will Learn

Why Replication Is Necessary

Replication serves multiple critical purposes in distributed file systems. Understanding these purposes helps in designing appropriate replication strategies.

The four pillars of replication:

Replication Benefits

•Durability — Data survives hardware failures. With 3 replicas on independent machines, data loss requires 3 simultaneous failures. The probability of losing data drops exponentially with each replica.
•Availability — Data remains accessible despite failures. If one replica is unavailable, reads can be served from another. The system continues operating during maintenance and partial outages.
•Read Performance — Multiple replicas can serve reads in parallel, distributing load and reducing latency. Geographically distributed replicas can serve reads from the nearest copy.
•Disaster Recovery — Replicas in different data centers protect against site-wide disasters (fire, flood, power failure). Some replication strategies explicitly target geographic diversity.

Calculating failure probability:

The probability of data loss with independent replicas follows a simple model:

Given:
- Probability of single disk failure in a year: p = 2%  (typical for HDDs)
- Replication factor: n = number of copies
- Assumption: failures are independent (requires careful placement)

Probability of losing all replicas:
- 1 replica:  P(loss) = p = 2%
- 2 replicas: P(loss) = p² = 0.04%
- 3 replicas: P(loss) = p³ = 0.0008% (8 in a million)
- 4 replicas: P(loss) = p⁴ = 0.000016% (16 in 100 million)

With 3 replicas and 1 million files:
- Expected files lost per year: 8
- With 4 replicas: 0.16 files lost per year

The independence assumption is critical:

These calculations assume failures are independent. But failures are often correlated:

Multiple disks in the same server fail together (power supply failure)
Multiple servers in the same rack fail together (rack switch failure)
Multiple racks in the same data center fail together (power outage)

Effective replication must place replicas to minimize correlation—across racks, zones, or data centers.

Durability ≠ Availability

Replication Strategies

Comparison of Replication Strategies
Strategy	Description	Pros	Cons	Use Case
Full Replication	Every replica stores complete, identical copy	Simple, any replica can serve reads	Storage overhead = n×	HDFS, GFS, most DFS
Erasure Coding	Data encoded into fragments; subset sufficient to reconstruct	Lower storage overhead (e.g., 1.5×)	Higher CPU, complex writes	Cold data, archival storage
Primary-Backup	One primary handles writes; backups replicate	Clear consistency model	Primary is bottleneck	Traditional databases
Chain Replication	Writes flow through a chain; tail confirms	Strong consistency, simple	Chain length affects latency	Facebook Haystack
Quorum-Based	Writes to W replicas; reads from R; W+R>N	Tunable consistency/availability	Complex reasoning	Dynamo, Cassandra

Full replication in HDFS/GFS:

The GFS (and later HDFS) model uses simple full replication:

File: /data/sales.csv (256 MB)
Replication factor: 3

Block 1 (128MB): stored on DataNode[4], DataNode[7], DataNode[15]
Block 2 (128MB): stored on DataNode[2], DataNode[11], DataNode[19]

Total storage used: 256MB × 3 = 768 MB

Read operation:
- Client gets block locations from NameNode
- Picks closest replica (by network topology)
- Reads directly from that DataNode

Failure scenario:
- DataNode[7] fails
- Block 1 now has 2 replicas (under-replicated)
- NameNode schedules re-replication: DataNode[4] → DataNode[22]
- Block 1 restored to 3 replicas

This approach is simple, proven, and works well for hot data that's frequently accessed. The 3× storage overhead is acceptable for data that needs fast access.

Erasure Coding for Cost Efficiency

Replica Placement Policies

HDFS default placement policy (3 replicas):

                    ┌─────────────────────┐
                    │     Cluster         │
    ┌───────────────┴─────────────────────┴───────────────┐
    │                                                     │
┌───┴───┐                                           ┌─────┴────┐
│ Rack 1 │                                           │  Rack 2  │
│        │                                           │          │
│┌──────┐│                                           │┌────────┐│
││Node A││ ← Replica 1 (same node as client if local)│         ││
│└──────┘│                                           │         ││
│┌──────┐│                                           │┌────────┐│
││Node B││ ← Replica 2 (different node, same rack)   ││ Node D ││ ← Replica 3
│└──────┘│                                           │└────────┘│  (different rack)
│┌──────┐│                                           │┌────────┐│
││Node C││                                           ││ Node E ││
│└──────┘│                                           │└────────┘│
└────────┘                                           └──────────┘

Placement rationale:
- Replica 1: Local or same rack for write performance
- Replica 2: Same rack for bandwidth (rack-internal network is fast)
- Replica 3: Different rack for rack failure tolerance

This policy provides data safety (survives rack failure) while optimizing for common hardware configurations where intra-rack bandwidth exceeds inter-rack bandwidth.

Placement Considerations

•Failure domain diversity — Spread replicas across independent failure domains (nodes, racks, power circuits, data centers)
•Network topology awareness — Place replicas to minimize cross-rack or cross-datacenter traffic for writes
•Storage balance — Distribute replicas to balance disk usage across the cluster
•Load balance — Consider access patterns; avoid placing all replicas of hot files on the same nodes
•Client proximity — For geo-distributed systems, place replicas near clients for read latency

Ceph CRUSH placement:

Ceph's CRUSH algorithm provides highly configurable placement through placement rules:

rule replicated_ruleset {
    ruleset 1
    type replicated
    min_size 1
    max_size 10
    step take default               # Start at cluster root
    step chooseleaf firstn 0 type rack  # Pick N racks
    step emit                        # Place on OSDs in chosen racks
}

# This rule spreads replicas across racks
# "firstn 0" means use the replication factor
# Each replica lands on a different rack

# More complex example: cross-datacenter
rule cross_dc {
    step take datacenter1
    step chooseleaf firstn 2 type host  # 2 replicas in DC1
    step emit
    step take datacenter2
    step chooseleaf firstn 1 type host  # 1 replica in DC2
    step emit
}

CRUSH allows encoding arbitrary placement policies—across racks, across buildings, across continents—all computed algorithmically without central lookup.

Write Replication Protocols

When data is written, all replicas must eventually receive the write. The write protocol determines when writes are acknowledged and how replicas are synchronized.

Synchronous vs. asynchronous replication:

Synchronous Replication

Write acknowledged only after all replicas confirm persistence.

Client → Primary → Replica1 → Replica2
   ↑________________________________│
              ACK (after all persist)

Characteristics:

Strong durability: Acknowledged data won't be lost
Higher latency: Must wait for slowest replica
Reduced availability: One failed replica blocks writes
Use case: Critical data, financial transactions

Asynchronous Replication

Write acknowledged after primary persists; replicas updated in background.

Client → Primary
   ↑______│ (immediate ACK)
          └→ Replica1, Replica2 (background)

Characteristics:

Lower latency: Don't wait for replicas
Higher availability: Replicas can fall behind
Potential data loss: Unsynced data lost if primary fails
Use case: High-throughput data, logs, analytics

GFS/HDFS write pipeline:

GFS and HDFS use an innovative pipelined write protocol that's synchronous but efficient:

HDFS Write Pipeline:

1. Client asks NameNode for block allocation
   NameNode returns: [DN1, DN2, DN3] (ordered by distance)

2. Client connects to DN1 (first in pipeline)
   DN1 connects to DN2
   DN2 connects to DN3

3. Client streams data to DN1 in packets (64KB each)
   Each packet is forwarded down the pipeline:
   Client → DN1 → DN2 → DN3

4. Acknowledgments flow back up:
   DN3 → DN2 → DN1 → Client

5. When all packets ACKed, block write complete

   Client
      │ data ↓     ↑ ACK
      ▼            │
    ┌────┐        │
    │DN1 │→data→─┐│
    └────┘       ││
    ┌────┐←ACK──┐││
    │DN2 │→data→││
    └────┘      ↓│
    ┌────┐←ACK──┘│
    │DN3 │───────┘
    └────┘

Why pipelining?

Without pipelining, the client would send data three times (once to each replica). With pipelining:

Client sends data once
Network bandwidth is utilized efficiently as data flows down the chain
Latency is slightly higher than single-replica write, but bandwidth is maximized

The Slow Replica Problem

Consistency in Replicated Systems

Consistency models for replicated data:

Consistency Models in Replicated DFS
Model	Guarantee	Implementation	Cost
Strong Consistency	All reads see latest write	Synchronous replication or consensus	High latency, reduced availability
Sequential Consistency	All clients see operations in same order	Single writer or global ordering	Medium overhead
Eventual Consistency	Replicas converge given time without updates	Asynchronous replication + conflict resolution	Low latency, complex handling
Read-Your-Writes	Client sees its own writes immediately	Session-sticky reads or version tracking	Low overhead
Monotonic Reads	Client never sees older data than already seen	Version tracking, minimum read version	Low overhead

Quorum protocols:

Many distributed systems use quorum-based protocols to balance consistency and availability:

Quorum Protocol:

- N = total replicas
- W = replicas that must acknowledge a write
- R = replicas that must respond to a read

Rule: W + R > N  →  Guarantees read sees latest write

Examples with N=3:

  W=2, R=2:  Moderate read/write latency, strong consistency
             Writes: need 2 of 3
             Reads: need 2 of 3 (at least 1 overlap with write)

  W=3, R=1:  Slow writes, fast reads, strong consistency
             Good for read-heavy workloads

  W=1, R=3:  Fast writes, slow reads, strong consistency
             Good for write-heavy workloads

  W=1, R=1:  Fast both, but NO consistency guarantee
             May read stale data

Object: X
Version: 5   ← Written with W=2
┌─────┐   ┌─────┐   ┌─────┐
│Rep1 │   │Rep2 │   │Rep3 │
│v=5  │   │v=5  │   │v=4  │  ← Rep3 missed update
└─────┘   └─────┘   └─────┘

Read with R=2 will get v=5 from Rep1 or Rep2
(The version 5 wins because any 2 nodes include at least
one with the latest version)

HDFS Simplification

Failure Detection and Recovery

Failure detection mechanisms:

How Systems Detect Failures

•Heartbeat protocols — Storage nodes send periodic heartbeats to a coordinator. Missing heartbeats trigger failure investigation.
•I/O timeouts — If a read/write to a replica times out, the replica is suspected failed. Multiple timeouts may confirm failure.
•Block checksums — Data integrity verification during reads detects corruption. Corrupted replicas are treated as failures.
•Self-healing checks — Background processes periodically read replicas and verify checksums, detecting latent corruption.
•Gossip protocols — Nodes share information about each other's health. Convergent failure detection across the cluster.

HDFS recovery workflow:

HDFS Block Recovery Process:

1. DETECTION: DataNode stops heartbeating
   NameNode: No heartbeat from DN7 for 30 seconds
   NameNode marks DN7 as dead

2. IDENTIFICATION: Find under-replicated blocks
   NameNode scans: "DN7 stored blocks: [blk_101, blk_102, blk_103,...]"
   These blocks now have replication < target

3. PRIORITIZATION: Rank blocks by urgency
   Priority 1: Only 1 replica remaining (critical!)
   Priority 2: Only 2 replicas remaining
   Priority 3: Decommissioning node's blocks

4. SCHEDULING: Choose source and destination
   For blk_101:
     Current replicas: [DN3, DN12] (need 3rd copy)
     Choose source: DN3 (lowest load)
     Choose destination: DN19 (different rack, space available)

5. REPLICATION: Execute copy
   NameNode → DN3: "Send blk_101 to DN19"
   DN3 streams data to DN19
   DN19 confirms receipt to NameNode

6. COMPLETION: Update metadata
   NameNode updates block locations: blk_101 → [DN3, DN12, DN19]
   Replication factor restored!

Typical recovery time:
- Detection: 30-60 seconds
- Per-block replication: ~10 seconds (128MB at 100MB/s)
- Full recovery after node failure: minutes to hours
  (depends on data volume and cluster bandwidth)

The Recovery Storm

Erasure Coding: Beyond Full Replication

Full replication's 3× storage overhead is substantial. For cold or infrequently accessed data, erasure coding provides fault tolerance with significantly lower storage costs.

How erasure coding works:

Erasure coding transforms k data blocks into n coded blocks (n > k), such that any k of the n blocks can reconstruct the original data.

Reed-Solomon (6,3) Example:

Original data blocks: D1, D2, D3, D4, D5, D6 (6 data blocks)
Parity blocks computed: P1, P2, P3 (3 parity blocks)

Total blocks stored: 9
Storage overhead: 9/6 = 1.5× (vs. 3× for replication)

Recovery capability:
- Can lose ANY 3 blocks (any combination of data + parity)
- Reconstruct lost blocks from surviving 6 blocks

Example:
  Stored: [D1, D2, D3, D4, D5, D6, P1, P2, P3]
  After 3 failures: [D1, __, D3, D4, __, D6, P1, __, P3]
  Reconstruction: Use D1, D3, D4, D6, P1, P3 → recover D2, D5, P2

        D1   D2   D3   D4   D5   D6
        │    │    │    │    │    │
        └────┴────┴────┴────┴────┴──→ Encoder
                                           │
        ┌──────────────────────────────────┘
        ↓
    P1 = f(D1,D2,D3,D4,D5,D6)
    P2 = g(D1,D2,D3,D4,D5,D6)
    P3 = h(D1,D2,D3,D4,D5,D6)

Erasure Coding vs. Replication Comparison
Aspect	3× Replication	RS (6,3)	RS (10,4)
Storage Overhead	3.0×	1.5×	1.4×
Fault Tolerance	2 failures	3 failures	4 failures
Read Performance	Any replica	All k blocks needed	All k blocks needed
Degraded Read	Read from other replica	Decode from k blocks (CPU expensive)	Decode from k blocks
Write Performance	Write 3 copies	Encode + write n blocks	Encode + write n blocks
Recovery I/O	Copy 1 block	Read k blocks, reconstruct	Read k blocks, reconstruct
Best For	Hot data, low latency	Cold data, cost-sensitive	Archival data

HDFS Erasure Coding:

HDFS 3.x added native erasure coding support:

# Enable EC for a directory
hdfs ec -enablePolicy -policy RS-6-3-1024k
hdfs ec -setPolicy -path /cold-data -policy RS-6-3-1024k

# Files in /cold-data now use erasure coding

# Available policies:
# RS-3-2-1024k:  3 data + 2 parity, 1MB block
# RS-6-3-1024k:  6 data + 3 parity, 1MB block (default)
# RS-10-4-1024k: 10 data + 4 parity, 1MB block

When to use erasure coding:

Cold data that's rarely accessed: Tolerates higher read latency
Large files where stripe size is efficient
Cost-sensitive storage: Reduces raw storage requirements
Archival use cases: Maximum durability per storage byte

The Small File Problem Again

Geographic Replication

Why geo-replicate?

Disaster recovery: Survive datacenter-wide disasters (fire, flood, earthquake)
Read latency: Serve reads from the nearest region to the user
Regulatory compliance: Keep data copies in specific legal jurisdictions
Content distribution: Place popular content near consumers

Challenges of geo-replication:

Intra-datacenter:          Inter-datacenter:
Latency: < 1ms             Latency: 50-200ms
Bandwidth: 10-100 Gbps     Bandwidth: 1-10 Gbps
Cost: Low (internal)       Cost: High (WAN/transit)
Reliability: High          Reliability: Variable

Implication:
- Synchronous replication across datacenters adds 100-400ms to writes
- For most applications, this is unacceptable
- Asynchronous geo-replication is common but risks data loss in disaster

Geo-replication strategies:

Asynchronous replication with RPO
- Writes acknowledged after local persistence
- Replicated to remote site(s) in background
- Recovery Point Objective (RPO): max data loss window (e.g., 15 minutes)
Synchronous replication for critical data
- Writes span multiple regions before acknowledgment
- Higher latency, zero data loss on failover
- Often used only for critical files/transactions
Tiered geo-replication
- Immediate local replication (3 copies in primary DC)
- Delayed geo-replication (background to 1-2 replica DCs)
- Combines performance with eventual cross-region durability

Converting Mermaid diagram...

Conflict Resolution in Multi-Master Geo-Replication

Summary: Replication in Distributed File Systems

We've explored how distributed file systems use replication to achieve durability, availability, and performance. Let's consolidate the key insights:

Key Replication Takeaways

•Replication is essential for fault tolerance — In large clusters, failures are constant. Replicas ensure data survives and remains accessible.
•Replica placement matters critically — Placing replicas in different failure domains (racks, zones, datacenters) maintains independence of failures.
•Write protocols determine durability-latency tradeoff — Synchronous replication is safer but slower; asynchronous is faster but risks data loss.
•Pipelined replication maximizes bandwidth efficiency — GFS/HDFS pipeline writes through the replica chain, using network bandwidth effectively.
•Consistency requires careful protocol design — Quorum protocols balance strong consistency against availability and partition tolerance.
•Failure detection and recovery must be automatic — Heartbeats detect failures; background processes restore replication levels.
•Erasure coding reduces storage overhead — For cold data, erasure coding provides similar or better durability at 1.5× storage vs. 3× replication.
•Geo-replication adds latency challenges — Inter-datacenter latency usually requires asynchronous replication with explicit RPO/RTO tradeoffs.

What's next:

Page Complete