Operating SystemsDistributed File Systems

Distributed File Systems

LevelAdvanced

Duration90 mins

TopicDistributed File Systems

5 / 5

Consistency Models

The Consistency Challenge

In a single-machine file system, the question "what data will I read?" has a simple answer: the most recent write. There's one copy of the data, one kernel managing access, and a well-defined order of operations.

But distributed file systems shatter this simplicity. With multiple replicas, network delays, and concurrent clients, fundamental questions become surprisingly complex:

If I write data and immediately read it back, will I see my write?
If Process A writes and tells Process B to read, will B see A's write?
If two processes write to the same file, whose write "wins"?
Can I ever see a write that hasn't happened yet (from my perspective)?

Consistency models are the formal frameworks that answer these questions. They define the contract between the distributed file system and its applications—what guarantees the system provides, and what assumptions applications can safely make.

What You Will Learn

By the end of this page, you will understand the spectrum of consistency models from strict to eventual, how different distributed file systems implement these models, and how to reason about the tradeoffs between consistency, availability, and performance. You'll be able to analyze what guarantees a given DFS provides and whether they match your application's requirements.

Why Consistency Is Fundamentally Hard

Before diving into specific consistency models, let's understand why consistency in distributed systems is fundamentally challenging.

The core problem: no global clock or instant communication

In a distributed system:

Events happen at different nodes — Process A writes on node 1; Process B reads on node 2
No shared memory — Nodes communicate only through messages
Messages take time — Network latency means updates don't propagate instantly
Clocks are imperfect — We can't precisely order events across machines
Failures happen — Nodes crash, messages are lost, networks partition

These fundamental limitations make it impossible to provide the same consistency guarantees as a single-machine system without paying significant performance costs.

A thought experiment:

Consider this scenario:

Time    Node A (US-West)         Node B (US-East)          Network
─────────────────────────────────────────────────────────────────────
t=0     write(x, "blue")         
        ↓ (send update to B)     
t=50ms                           [update in transit, 100ms delay]
t=75ms                           read(x) → ???
t=100ms                          [update arrives]
t=125ms                          read(x) → "blue"

At t=75ms, what should B's read return?

Option 1: Block until update arrives (100ms delay) → Consistent but slow
Option 2: Return stale/default value ("red") → Fast but inconsistent
Option 3: Error (value not yet known) → Honest but unhelpful

There's no "right" answer—only tradeoffs. Different consistency models represent different choices in this tradeoff space.

The CAP Theorem Reality

The CAP theorem (Brewer's theorem) formalizes this: during a network partition, a distributed system must choose between Consistency (all nodes see the same data) and Availability (all requests receive a response). You cannot have both. Real distributed file systems position themselves somewhere in this tradeoff space.

The Consistency Model Spectrum

Consistency models form a spectrum from strongest (most intuitive but slowest) to weakest (fastest but most surprising). Understanding this spectrum helps you choose appropriate models for different use cases.

The consistency hierarchy:

Consistency Models from Strongest to Weakest
Model	Guarantee	Performance	DFS Examples
Strict Consistency	All reads return the absolute latest write	Impossible in distributed systems	Theoretical ideal only
Linearizability	Operations appear instantaneous, in real-time order	Slow (requires synchronization)	Spanner (with TrueTime)
Sequential Consistency	All processes see operations in same order	Medium (requires ordering)	Some POSIX file operations
Causal Consistency	Causally related operations seen in order	Medium (requires dependency tracking)	Some distributed databases
Session Consistency	Client's own operations seen in order	Good (requires session affinity)	Many DFS (close-to-open)
Eventual Consistency	All replicas converge eventually (no time bound)	Best (async propagation)	S3, DNS, many object stores

Converting Mermaid diagram...

Key insight:

Stronger consistency models guarantee more intuitive behavior but require more coordination (synchronous communication, consensus protocols, locking). This coordination adds latency and can reduce availability during partitions.

Weaker consistency models allow more concurrent, independent operation but require applications to handle potentially stale or out-of-order data.

Linearizability (Strong Consistency)

Linearizability is the gold standard of consistency—it provides the illusion that operations happen instantaneously, in an order consistent with real time.

Definition:

A system is linearizable if:

Every operation appears to execute atomically at some point between its invocation and response
If operation A completes before operation B starts (in real time), A appears before B in the linearization

Linearizable execution:

Real time:
────────────────────────────────────────────────────────────────
Client A:  |--write(x,1)---|                    |--read(x)--→3|
Client B:         |--read(x)--→1|  |--write(x,3)--|            |
Client C:                              |--read(x)--→1 or 3|    |
────────────────────────────────────────────────────────────────

Linearization point (one valid ordering):
  write(x,1) → read(x)→1 → write(x,3) → read(x)→3
             ↑                ↑             ↑
         Client B          Client B     Client A

Client C's read overlaps with write(x,3), so can return 1 or 3
(but if it returns 3, any later read must also return 3)

Why linearizability is expensive:

To achieve linearizability, the system must:

Synchronize all replicas before acknowledging writes
Prevent stale reads by contacting a quorum or the primary
Handle partitions conservatively by possibly rejecting operations

Linearizable write protocol (simplified):

1. Client sends write to leader/coordinator
2. Leader assigns a sequence number
3. Leader replicates to quorum of replicas
4. All quorum members acknowledge
5. Leader commits and responds to client

Latency = 2 * RTT_max to quorum
        + consensus protocol overhead
        + disk sync time × quorum size

Real-world linearizable systems:

Google Spanner: Uses GPS-synchronized clocks (TrueTime) to achieve linearizability with bounded uncertainty
etcd/ZooKeeper: Provides linearizable operations through Raft/Zab consensus
Foundation DB: Layer 5 transactional semantics with linearizable ordering

Most distributed file systems do NOT provide linearizability for all operations—the performance cost is too high for file system workloads.

Linearizability ≠ Serializability

These terms are often confused. Linearizability is about single-object operations appearing atomic and real-time ordered. Serializability is about multi-object transactions appearing as if executed serially. A system can be linearizable but not serializable (no transactions), or serializable but not linearizable (transactions may reorder in non-real-time ways).

Sequential and Causal Consistency

Sequential consistency and causal consistency offer useful middle-ground positions between linearizability and eventual consistency.

Sequential Consistency:

All processes see all operations in the same order. The order must be consistent with each process's program order, but need not match real time.

Sequentially consistent (valid):
────────────────────────────────────────────────────────────────
Client A:  write(x,1)    write(x,2)
Client B:      read(x)→?     read(x)→?
────────────────────────────────────────────────────────────────

Valid orderings:
  w(x,1) → w(x,2) → r(x)→2 → r(x)→2   ✓
  w(x,1) → r(x)→1 → w(x,2) → r(x)→2   ✓
  r(x)→1 → w(x,1) → ...  ✗ (B reads before A writes - only okay
                            if there was a prior value)

Key: A's writes are seen by B in A's order (1 before 2)
     But B might see 2 before observing 1 completed
     Both clients see same total order of operations

Causal Consistency:

Causally-related operations are seen by all processes in the same order. Concurrent (causally-independent) operations may be seen in different orders by different processes.

causal_consistency_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Causal Consistency Example
 
"""
Causal relationship:
  If operation A could have influenced operation B,
  then all observers must see A before B.
 
Examples of causal relationships:
1. Same process: A happens-before B in program order
2. Message: A is a write, B reads the value A wrote
3. Transitivity: A→B and B→C implies A→C
 
Concurrent (non-causal):
  If neither operation could have influenced the other,
  they are concurrent and may be observed in any order.
"""
 
# Scenario: Social media post and comment
# =========================================
 
# Process A (Alice):
def alice_actions(client):
    # T1: Alice posts a photo
    client.write("/posts/photo_123", "Beach vacation pic")
    
    # T2: Alice sends message to Bob about the post
    client.send_message("bob", "Check out my new post!")
 
# Process B (Bob):
def bob_actions(client):
    # T3: Bob receives message from Alice
    msg = client.receive_message()  # Causal dependency on T2
    
    # T4: Bob reads Alice's post
    # Under causal consistency, Bob MUST see Alice's post
    # because: T1 → T2 (same process) and T2 → T3 (message)
    # Therefore: T1 → T3 by transitivity
    post = client.read("/posts/photo_123")  # Must return "Beach vacation"
 
# Process C (Carol) - independent:
def carol_actions(client):
    # T5: Carol reads the post (no causal link to Alice's message)
    # Carol might or might not see the post depending on timing
    post = client.read("/posts/photo_123")  # May return old value or new
 
"""
Causal consistency guarantees:
- Bob will see Alice's post (causally linked through message)
- Carol might not see it yet (no causal link)
 
This is weaker than sequential consistency:
- Sequential: Carol would also see updates in global order
- Causal: Carol only guaranteed to see causally-related updates
 
Implementation: Vector clocks or dependency tracking
"""
 
class CausalClient:
    def __init__(self, node_id, num_nodes):
        self.node_id = node_id
        # Vector clock: [node_0_time, node_1_time, ...]
        self.vector_clock = [0] * num_nodes
    
    def write(self, key, value):
        # Increment own logical time
        self.vector_clock[self.node_id] += 1
        
        # Send write with current vector clock
        return self.server.write(key, value, self.vector_clock.copy())
    
    def read(self, key, min_version=None):
        # Include our vector clock to track causal dependencies
        # Server will only return values that satisfy causality
        value, server_clock = self.server.read(key, self.vector_clock)
        
        # Update our clock (merge with server's)
        for i in range(len(self.vector_clock)):
            self.vector_clock[i] = max(self.vector_clock[i], 
                                       server_clock[i])
        
        return value

Causal Consistency in Practice

Causal consistency is attractive because it captures what humans intuitively expect about cause and effect, while allowing more concurrency than sequential consistency. Systems like COPS and Eiger implement causal consistency for geo-replicated data stores. However, tracking causal dependencies adds overhead, and the programming model is still more complex than strong consistency.

Session and Client-Centric Consistency

Many distributed file systems provide client-centric consistency guarantees—promises made to each individual client about its own view of the data, rather than global guarantees about all clients.

The four client-centric guarantees:

Client-Centric Consistency Models

•Read-Your-Writes — A client always sees the effects of its own previous writes. If I write a value and then read it, I'll see my write (or a later one).
•Monotonic Reads — A client never sees older versions of data than it has already seen. Once I read version 5, I'll never read version 3.
•Monotonic Writes — A client's writes are applied in the order they were issued. My writes at T1, T2, T3 are applied in that order everywhere.
•Writes Follow Reads — A client's write is applied after any writes that the client previously read. If I read version 5 and write, my write happens 'after' version 5.

Session semantics in distributed file systems:

Many POSIX-like distributed file systems implement close-to-open consistency (a form of session semantics):

Close-to-Open (Session) Consistency:

1. When a file is closed, all modifications are flushed to the server
2. When a file is opened, the client fetches the latest version from server
3. Between open and close, the client may work with a cached version
4. Concurrent writers to the same file have undefined behavior

Timeline:
────────────────────────────────────────────────────────────────
Client A:  open()───write("v1")───close()        
                                     ↓ (flush to server)
Client B:            open()───read()→"old"  close()
                                       ↑
                              (A's write not visible yet)
Client B:                                        open()───read()→"v1"
                                                          ↑
                                                (now sees A's write)
────────────────────────────────────────────────────────────────

Why session semantics are popular:

Match common file usage patterns (edit file, save, close)
Allow aggressive client-side caching during a session
Avoid the overhead of synchronizing every write
Provide intuitive guarantees within a single editor session

When Session Semantics Break Down

Session semantics assume files are not shared concurrently. They break down when: (1) Multiple users edit the same file simultaneously—one user's changes may be silently lost. (2) Applications expect immediate visibility of writes to other processes. (3) Append-only logs where multiple writers must coordinate. For such use cases, stronger consistency or explicit locking is required.

Eventual Consistency

Eventual consistency is the weakest useful consistency model: it only guarantees that if no new updates are made, all replicas will eventually converge to the same value. There's no bound on how long this takes.

Definition:

If no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.

What eventual consistency does NOT guarantee:

Eventual Consistency: What Can Go Wrong

•Stale reads: You might read an old value even after a write occurred
•Non-monotonic reads: You might read v5, then v3, then v5 again
•Lost updates: Concurrent writes might result in only one being visible
•Missing your own writes: Your read might not reflect your recent write
•No convergence time bound: 'Eventually' could be seconds, hours, or longer

Amazon S3: Eventual consistency in practice

S3 historically provided eventual consistency for certain operations:

S3 Consistency Before (2020):

PUT (new object):   Read-after-write consistent for new objects
PUT (overwrite):    Eventual consistency (might read old version)
DELETE:             Eventual consistency (might still read deleted object)
LIST after PUT:     Eventual consistency (new object might not appear)

Real example:
  T=0:    PUT s3://bucket/key with content "v2" (overwrites "v1")
  T=100ms: GET s3://bucket/key → might return "v1" (stale!)
  T=1s:    GET s3://bucket/key → returns "v2"

Workaround for applications:
  - Use unique keys instead of overwriting
  - Include version in key: s3://bucket/data-v2
  - Wait before reading (poor solution)
  - Use DynamoDB for consistent metadata

S3 Strong Consistency (2020):

AWS updated S3 to provide strong read-after-write consistency:

S3 Consistency After (2020):

PUT: Read-after-write consistent for all objects
DELETE: Read-after-write consistent
LIST: Strongly consistent with PUTs and DELETEs

How? Amazon rebuilt the indexing layer to use consensus
for metadata while keeping data path highly available.

Tunable Consistency

Many modern systems offer tunable consistency—clients choose per-operation. For example, you might use eventual consistency for logging (high throughput, loss acceptable) but strong consistency for financial records. Cassandra's consistency level (ONE, QUORUM, ALL) exemplifies this pattern.

POSIX Consistency Requirements

POSIX file system semantics define specific consistency requirements that local file systems meet but distributed file systems struggle with. Understanding these helps explain why DFS implementations diverge from traditional expectations.

Key POSIX consistency requirements:

POSIX Consistency Semantics
Requirement	POSIX Expectation	DFS Challenge
Sequential write ordering	Bytes written are visible in write order	Network reordering, replica lag
Read-after-write	`write()` followed by `read()` returns written data	Cache consistency delays
Atomic rename	`rename()` is atomic: file fully moved or not at all	Cross-node operations not atomic
Last-close semantics	Writes visible to other processes on close	Already a relaxation; DFS compatible
Lock coherence	`flock()` coordinates access across processes	Distributed locking is slow/complex
Persisted after fsync	`fsync()` guarantees data is on stable storage	Distributed 'stable' means replicated?

How different DFS handle POSIX:

1. Full POSIX Compliance (attempts)

CephFS, GPFS, Lustre: Implement distributed locking, metadata coordination
Cost: Higher latency, complexity, reduced performance
Use case: HPC, legacy application support

2. Relaxed POSIX with clarifications

NFS: Close-to-open consistency, attribute caching
Applications must understand the relaxed model
Documented deviations from strict POSIX

3. Non-POSIX API

HDFS: Append-only, no random writes, single-writer
Forces applications to adapt to different model
Cleaner semantics for big data workloads

4. Object Storage (S3)

Completely different API (PUT/GET/DELETE objects)
Strong consistency for individual objects, no partial updates
No POSIX pretense—different programming model

Spectrum of POSIX compliance:

Full POSIX ←━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━→ No POSIX
    |                |                |              |
   GPFS            NFS             HDFS            S3
  Lustre        (relaxed)      (append-only)   (object API)
  CephFS

POSIX Compliance: Blessing and Curse

POSIX compliance lets legacy applications work without modification—a huge benefit. But achieving POSIX semantics in a distributed system requires expensive coordination that can negate the scalability benefits of distribution. Many modern applications are better served by DFS with well-defined, non-POSIX semantics optimized for distributed operation.

Conflict Resolution Strategies

When multiple clients write concurrently, especially under eventual consistency, conflicts can arise. How a system resolves these conflicts determines the final state of data.

Conflict scenarios:

Conflict Example:

T=0:    File version: v1, content: "Hello"

T=100:  Client A (US-West): write(content="Hello World")
T=100:  Client B (US-East): write(content="Hello Universe")
        (concurrent writes, neither sees the other)

T=200:  Updates propagate... what's the final content?
        • "Hello World"?  (A wins)
        • "Hello Universe"?  (B wins)
        • Both preserved somehow?  (conflict preserved)
        • Error/rejected?  (conflict avoided)

Resolution strategies:

Conflict Resolution Approaches

•Last-Writer-Wins (LWW) — Use timestamps; latest timestamp wins. Simple but can lose data. Requires synchronized clocks for correctness.
•First-Writer-Wins — Block later writes; preserve first. Used in some registration systems. Prevents certain conflicts by design.
•Version Vectors — Track causal history; detect conflicts precisely. Present conflicts to application or merge automatically.
•Application Resolution — Store all conflicting versions; let application merge. Used by CouchDB, Riak. Pushes complexity to application.
•CRDTs (Conflict-Free Replicated Data Types) — Data structures designed to merge automatically without conflicts. Examples: counters, sets, registers.
•Operational Transform (OT) — Transform concurrent operations to preserve intent. Used in Google Docs for real-time collaboration.

conflict_resolution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# Conflict Resolution Strategies in Code
 
import time
from dataclasses import dataclass
from typing import Optional, List, Dict, Any
 
# ============================================================
# 1. LAST-WRITER-WINS (LWW) - Simple but can lose data
# ============================================================
 
@dataclass
class LWWValue:
    value: Any
    timestamp: float  # Lamport or wall-clock time
    
def lww_merge(a: LWWValue, b: LWWValue) -> LWWValue:
    """Simple: highest timestamp wins."""
    return a if a.timestamp >= b.timestamp else b
 
# Usage:
# Client A writes: LWWValue("Hello World", 1000.001)
# Client B writes: LWWValue("Hello Universe", 1000.002)
# Merge result: "Hello Universe" (higher timestamp)
# Problem: "Hello World" is silently lost!
 
# ============================================================
# 2. VERSION VECTORS - Detect conflicts precisely
# ============================================================
 
@dataclass
class VersionedValue:
    value: Any
    vector_clock: Dict[str, int]  # node_id -> counter
    
def version_dominates(a: Dict[str, int], b: Dict[str, int]) -> bool:
    """True if 'a' is strictly newer than 'b'."""
    all_keys = set(a.keys()) | set(b.keys())
    a_higher_somewhere = False
    
    for key in all_keys:
        a_val = a.get(key, 0)
        b_val = b.get(key, 0)
        if a_val < b_val:
            return False  # b is ahead on this node
        if a_val > b_val:
            a_higher_somewhere = True
    
    return a_higher_somewhere
 
def detect_conflict(a: VersionedValue, b: VersionedValue) -> str:
    """Determine relationship between two versions."""
    if version_dominates(a.vector_clock, b.vector_clock):
        return "a_dominates"  # a is newer
    elif version_dominates(b.vector_clock, a.vector_clock):
        return "b_dominates"  # b is newer
    else:
        return "conflict"  # neither dominates - true conflict!
 
# Usage:
# Client A (node "a"): writes with clock {a: 1, b: 0}
# Client B (node "b"): writes with clock {a: 0, b: 1}
# → Conflict detected! Neither dominates.
# → Application must decide how to resolve
 
# ============================================================
# 3. CRDT: Grow-Only Counter (no conflicts possible)
# ============================================================
 
class GCounter:
    """
    Conflict-free grow-only counter.
    Each node increments its own slot; sum is the total.
    Merge is element-wise maximum - always converges!
    """
    
    def __init__(self, node_id: str, state: Optional[Dict[str, int]] = None):
        self.node_id = node_id
        self.state = state or {}
    
    def increment(self, amount: int = 1):
        """Increment this node's counter."""
        self.state[self.node_id] = self.state.get(self.node_id, 0) + amount
    
    def value(self) -> int:
        """Get total count across all nodes."""
        return sum(self.state.values())
    
    def merge(self, other: 'GCounter') -> 'GCounter':
        """
        Merge two counters - always works, never conflicts!
        Element-wise maximum preserves all increments.
        """
        merged_state = {}
        all_keys = set(self.state.keys()) | set(other.state.keys())
        for key in all_keys:
            merged_state[key] = max(
                self.state.get(key, 0),
                other.state.get(key, 0)
            )
        return GCounter(self.node_id, merged_state)
 
# Usage:
# Node A: GCounter(state={a: 5, b: 0})  # A incremented 5 times
# Node B: GCounter(state={a: 3, b: 8})  # B sees older A, incremented 8
# Merge: GCounter(state={a: 5, b: 8})   # Keep max of each
# Total: 13 - all increments preserved, no conflict!

Choosing the Right Consistency Model

Selecting an appropriate consistency model requires understanding your application's requirements and the tradeoffs you can accept.

Decision framework:

Consistency Model Selection Guide
Use Case	Recommended Model	Rationale
Financial transactions	Linearizability / Strong	Lost or duplicated transactions unacceptable
User profile data	Read-your-writes + Eventual	User should see own changes; global latency matters
Real-time collaboration	OT or CRDTs	Multiple writers, merge without coordination
Analytics data lake	Eventual consistency	Batch processing tolerates staleness
Social media feed	Causal or Session	Comments should appear under correct post
Inventory counts	Linearizable or Strong	Overselling is costly; need accurate counts
Logging/Metrics	Eventual consistency	Some data loss acceptable for throughput
Configuration files	Strong (or versioned)	Nodes should see consistent config

Questions to ask when choosing:

What's the cost of stale data? If a user sees an old value, what happens? Minor inconvenience or system failure?
What's the cost of coordination? Adding synchronization increases latency. Can your application tolerate 100ms more per operation?
How concurrent are accesses? Single-writer workloads often work fine with weaker consistency. Multi-writer concurrent access needs stronger guarantees or conflict resolution.
What are your availability requirements? Stronger consistency often means operations fail during partitions. Can you tolerate unavailability?
Can application logic handle inconsistency? Some applications can check for conflicts and retry. Others assume consistent views.

Hybrid approach:

Many systems use different consistency models for different operations:

Example: E-commerce platform

- Product catalog: Eventual consistency (reads don't need to be perfect)
- Shopping cart: Session consistency (user sees own cart accurately)
- Inventory decrements: Serializable/Strong (prevent overselling)
- Order placement: Linearizable (order creation is critical)
- Recommendation cache: Eventual (stale recommendations OK)

Start Strong, Weaken as Needed

When in doubt, start with stronger consistency and weaken only after proving performance is insufficient. It's easier to reason about strong consistency, and weakening is often possible without code changes (just configuration). Going from weak to strong consistency often requires fundamental redesign.

Summary: Consistency Models in DFS

We've explored the consistency models that define the behavior of distributed file systems. Let's consolidate the key insights:

Key Consistency Model Takeaways

•Consistency models define the contract — They specify what applications can expect about the relationship between reads and writes.
•Stronger consistency costs more — Linearizability requires synchronization; each step down the spectrum reduces coordination overhead.
•CAP theorem constrains choices — During partitions, choose between consistency (reject operations) or availability (accept stale/divergent data).
•Session semantics are a practical middle ground — Close-to-open consistency matches how users typically interact with files.
•Eventual consistency trades correctness for performance — Acceptable for many workloads but requires careful application design.
•POSIX semantics are expensive in distributed systems — Full POSIX requires coordination that often defeats the purpose of distribution.
•Conflict resolution is essential for weak consistency — LWW, CRDTs, and application merging each have their place.
•Match model to requirements — Different parts of an application may need different consistency levels.

Module complete:

You've now explored the complete architecture of distributed file systems:

DFS Architecture — Components, design patterns, and fundamental tradeoffs
Naming and Location — How files are named and located across distributed nodes
Caching Strategies — How caching bridges the latency gap while managing coherence
Replication — How data is duplicated for fault tolerance and performance
Consistency Models — The guarantees systems provide about data visibility and ordering

These concepts form the foundation for understanding how modern distributed storage systems—from enterprise NAS to cloud object storage to big data platforms—actually work under the hood.

Module Complete

Congratulations! You've completed the Distributed File Systems module. You now understand how distributed file systems are architected, how they handle naming and location, how caching and replication work, and the consistency models that govern their behavior. This knowledge is fundamental to designing and operating distributed storage at scale.

5 / 5

Loading learning content...

Operating SystemsDistributed File Systems

Distributed File Systems

LevelAdvanced

Duration90 mins

TopicDistributed File Systems

5 / 5

Consistency Models

The Consistency Challenge

But distributed file systems shatter this simplicity. With multiple replicas, network delays, and concurrent clients, fundamental questions become surprisingly complex:

If I write data and immediately read it back, will I see my write?
If Process A writes and tells Process B to read, will B see A's write?
If two processes write to the same file, whose write "wins"?
Can I ever see a write that hasn't happened yet (from my perspective)?

What You Will Learn

Why Consistency Is Fundamentally Hard

Before diving into specific consistency models, let's understand why consistency in distributed systems is fundamentally challenging.

The core problem: no global clock or instant communication

In a distributed system:

Events happen at different nodes — Process A writes on node 1; Process B reads on node 2
No shared memory — Nodes communicate only through messages
Messages take time — Network latency means updates don't propagate instantly
Clocks are imperfect — We can't precisely order events across machines
Failures happen — Nodes crash, messages are lost, networks partition

These fundamental limitations make it impossible to provide the same consistency guarantees as a single-machine system without paying significant performance costs.

A thought experiment:

Consider this scenario:

Time    Node A (US-West)         Node B (US-East)          Network
─────────────────────────────────────────────────────────────────────
t=0     write(x, "blue")         
        ↓ (send update to B)     
t=50ms                           [update in transit, 100ms delay]
t=75ms                           read(x) → ???
t=100ms                          [update arrives]
t=125ms                          read(x) → "blue"

At t=75ms, what should B's read return?

Option 1: Block until update arrives (100ms delay) → Consistent but slow
Option 2: Return stale/default value ("red") → Fast but inconsistent
Option 3: Error (value not yet known) → Honest but unhelpful

There's no "right" answer—only tradeoffs. Different consistency models represent different choices in this tradeoff space.

The CAP Theorem Reality

The Consistency Model Spectrum

The consistency hierarchy:

Consistency Models from Strongest to Weakest
Model	Guarantee	Performance	DFS Examples
Strict Consistency	All reads return the absolute latest write	Impossible in distributed systems	Theoretical ideal only
Linearizability	Operations appear instantaneous, in real-time order	Slow (requires synchronization)	Spanner (with TrueTime)
Sequential Consistency	All processes see operations in same order	Medium (requires ordering)	Some POSIX file operations
Causal Consistency	Causally related operations seen in order	Medium (requires dependency tracking)	Some distributed databases
Session Consistency	Client's own operations seen in order	Good (requires session affinity)	Many DFS (close-to-open)
Eventual Consistency	All replicas converge eventually (no time bound)	Best (async propagation)	S3, DNS, many object stores

Converting Mermaid diagram...

Key insight:

Weaker consistency models allow more concurrent, independent operation but require applications to handle potentially stale or out-of-order data.

Linearizability (Strong Consistency)

Linearizability is the gold standard of consistency—it provides the illusion that operations happen instantaneously, in an order consistent with real time.

Definition:

A system is linearizable if:

Every operation appears to execute atomically at some point between its invocation and response
If operation A completes before operation B starts (in real time), A appears before B in the linearization

Linearizable execution:

Real time:
────────────────────────────────────────────────────────────────
Client A:  |--write(x,1)---|                    |--read(x)--→3|
Client B:         |--read(x)--→1|  |--write(x,3)--|            |
Client C:                              |--read(x)--→1 or 3|    |
────────────────────────────────────────────────────────────────

Linearization point (one valid ordering):
  write(x,1) → read(x)→1 → write(x,3) → read(x)→3
             ↑                ↑             ↑
         Client B          Client B     Client A

Client C's read overlaps with write(x,3), so can return 1 or 3
(but if it returns 3, any later read must also return 3)

Why linearizability is expensive:

To achieve linearizability, the system must:

Synchronize all replicas before acknowledging writes
Prevent stale reads by contacting a quorum or the primary
Handle partitions conservatively by possibly rejecting operations

Linearizable write protocol (simplified):

1. Client sends write to leader/coordinator
2. Leader assigns a sequence number
3. Leader replicates to quorum of replicas
4. All quorum members acknowledge
5. Leader commits and responds to client

Latency = 2 * RTT_max to quorum
        + consensus protocol overhead
        + disk sync time × quorum size

Real-world linearizable systems:

Google Spanner: Uses GPS-synchronized clocks (TrueTime) to achieve linearizability with bounded uncertainty
etcd/ZooKeeper: Provides linearizable operations through Raft/Zab consensus
Foundation DB: Layer 5 transactional semantics with linearizable ordering

Most distributed file systems do NOT provide linearizability for all operations—the performance cost is too high for file system workloads.

Linearizability ≠ Serializability

Sequential and Causal Consistency

Sequential consistency and causal consistency offer useful middle-ground positions between linearizability and eventual consistency.

Sequential Consistency:

All processes see all operations in the same order. The order must be consistent with each process's program order, but need not match real time.

Sequentially consistent (valid):
────────────────────────────────────────────────────────────────
Client A:  write(x,1)    write(x,2)
Client B:      read(x)→?     read(x)→?
────────────────────────────────────────────────────────────────

Valid orderings:
  w(x,1) → w(x,2) → r(x)→2 → r(x)→2   ✓
  w(x,1) → r(x)→1 → w(x,2) → r(x)→2   ✓
  r(x)→1 → w(x,1) → ...  ✗ (B reads before A writes - only okay
                            if there was a prior value)

Key: A's writes are seen by B in A's order (1 before 2)
     But B might see 2 before observing 1 completed
     Both clients see same total order of operations

Causal Consistency:

Causally-related operations are seen by all processes in the same order. Concurrent (causally-independent) operations may be seen in different orders by different processes.

causal_consistency_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Causal Consistency Example
 
"""
Causal relationship:
  If operation A could have influenced operation B,
  then all observers must see A before B.
 
Examples of causal relationships:
1. Same process: A happens-before B in program order
2. Message: A is a write, B reads the value A wrote
3. Transitivity: A→B and B→C implies A→C
 
Concurrent (non-causal):
  If neither operation could have influenced the other,
  they are concurrent and may be observed in any order.
"""
 
# Scenario: Social media post and comment
# =========================================
 
# Process A (Alice):
def alice_actions(client):
    # T1: Alice posts a photo
    client.write("/posts/photo_123", "Beach vacation pic")
    
    # T2: Alice sends message to Bob about the post
    client.send_message("bob", "Check out my new post!")
 
# Process B (Bob):
def bob_actions(client):
    # T3: Bob receives message from Alice
    msg = client.receive_message()  # Causal dependency on T2
    
    # T4: Bob reads Alice's post
    # Under causal consistency, Bob MUST see Alice's post
    # because: T1 → T2 (same process) and T2 → T3 (message)
    # Therefore: T1 → T3 by transitivity
    post = client.read("/posts/photo_123")  # Must return "Beach vacation"
 
# Process C (Carol) - independent:
def carol_actions(client):
    # T5: Carol reads the post (no causal link to Alice's message)
    # Carol might or might not see the post depending on timing
    post = client.read("/posts/photo_123")  # May return old value or new
 
"""
Causal consistency guarantees:
- Bob will see Alice's post (causally linked through message)
- Carol might not see it yet (no causal link)
 
This is weaker than sequential consistency:
- Sequential: Carol would also see updates in global order
- Causal: Carol only guaranteed to see causally-related updates
 
Implementation: Vector clocks or dependency tracking
"""
 
class CausalClient:
    def __init__(self, node_id, num_nodes):
        self.node_id = node_id
        # Vector clock: [node_0_time, node_1_time, ...]
        self.vector_clock = [0] * num_nodes
    
    def write(self, key, value):
        # Increment own logical time
        self.vector_clock[self.node_id] += 1
        
        # Send write with current vector clock
        return self.server.write(key, value, self.vector_clock.copy())
    
    def read(self, key, min_version=None):
        # Include our vector clock to track causal dependencies
        # Server will only return values that satisfy causality
        value, server_clock = self.server.read(key, self.vector_clock)
        
        # Update our clock (merge with server's)
        for i in range(len(self.vector_clock)):
            self.vector_clock[i] = max(self.vector_clock[i], 
                                       server_clock[i])
        
        return value

Causal Consistency in Practice

Session and Client-Centric Consistency

The four client-centric guarantees:

Client-Centric Consistency Models

•Read-Your-Writes — A client always sees the effects of its own previous writes. If I write a value and then read it, I'll see my write (or a later one).
•Monotonic Reads — A client never sees older versions of data than it has already seen. Once I read version 5, I'll never read version 3.
•Monotonic Writes — A client's writes are applied in the order they were issued. My writes at T1, T2, T3 are applied in that order everywhere.
•Writes Follow Reads — A client's write is applied after any writes that the client previously read. If I read version 5 and write, my write happens 'after' version 5.

Session semantics in distributed file systems:

Many POSIX-like distributed file systems implement close-to-open consistency (a form of session semantics):

Close-to-Open (Session) Consistency:

1. When a file is closed, all modifications are flushed to the server
2. When a file is opened, the client fetches the latest version from server
3. Between open and close, the client may work with a cached version
4. Concurrent writers to the same file have undefined behavior

Timeline:
────────────────────────────────────────────────────────────────
Client A:  open()───write("v1")───close()        
                                     ↓ (flush to server)
Client B:            open()───read()→"old"  close()
                                       ↑
                              (A's write not visible yet)
Client B:                                        open()───read()→"v1"
                                                          ↑
                                                (now sees A's write)
────────────────────────────────────────────────────────────────

Why session semantics are popular:

Match common file usage patterns (edit file, save, close)
Allow aggressive client-side caching during a session
Avoid the overhead of synchronizing every write
Provide intuitive guarantees within a single editor session

When Session Semantics Break Down

Eventual Consistency

Definition:

If no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.

What eventual consistency does NOT guarantee:

Eventual Consistency: What Can Go Wrong

•Stale reads: You might read an old value even after a write occurred
•Non-monotonic reads: You might read v5, then v3, then v5 again
•Lost updates: Concurrent writes might result in only one being visible
•Missing your own writes: Your read might not reflect your recent write
•No convergence time bound: 'Eventually' could be seconds, hours, or longer

Amazon S3: Eventual consistency in practice

S3 historically provided eventual consistency for certain operations:

S3 Consistency Before (2020):

PUT (new object):   Read-after-write consistent for new objects
PUT (overwrite):    Eventual consistency (might read old version)
DELETE:             Eventual consistency (might still read deleted object)
LIST after PUT:     Eventual consistency (new object might not appear)

Real example:
  T=0:    PUT s3://bucket/key with content "v2" (overwrites "v1")
  T=100ms: GET s3://bucket/key → might return "v1" (stale!)
  T=1s:    GET s3://bucket/key → returns "v2"

Workaround for applications:
  - Use unique keys instead of overwriting
  - Include version in key: s3://bucket/data-v2
  - Wait before reading (poor solution)
  - Use DynamoDB for consistent metadata

S3 Strong Consistency (2020):

AWS updated S3 to provide strong read-after-write consistency:

S3 Consistency After (2020):

PUT: Read-after-write consistent for all objects
DELETE: Read-after-write consistent
LIST: Strongly consistent with PUTs and DELETEs

How? Amazon rebuilt the indexing layer to use consensus
for metadata while keeping data path highly available.

Tunable Consistency

POSIX Consistency Requirements

Key POSIX consistency requirements:

POSIX Consistency Semantics
Requirement	POSIX Expectation	DFS Challenge
Sequential write ordering	Bytes written are visible in write order	Network reordering, replica lag
Read-after-write	`write()` followed by `read()` returns written data	Cache consistency delays
Atomic rename	`rename()` is atomic: file fully moved or not at all	Cross-node operations not atomic
Last-close semantics	Writes visible to other processes on close	Already a relaxation; DFS compatible
Lock coherence	`flock()` coordinates access across processes	Distributed locking is slow/complex
Persisted after fsync	`fsync()` guarantees data is on stable storage	Distributed 'stable' means replicated?

How different DFS handle POSIX:

1. Full POSIX Compliance (attempts)

CephFS, GPFS, Lustre: Implement distributed locking, metadata coordination
Cost: Higher latency, complexity, reduced performance
Use case: HPC, legacy application support

2. Relaxed POSIX with clarifications

NFS: Close-to-open consistency, attribute caching
Applications must understand the relaxed model
Documented deviations from strict POSIX

3. Non-POSIX API

HDFS: Append-only, no random writes, single-writer
Forces applications to adapt to different model
Cleaner semantics for big data workloads

4. Object Storage (S3)

Completely different API (PUT/GET/DELETE objects)
Strong consistency for individual objects, no partial updates
No POSIX pretense—different programming model

Spectrum of POSIX compliance:

Full POSIX ←━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━→ No POSIX
    |                |                |              |
   GPFS            NFS             HDFS            S3
  Lustre        (relaxed)      (append-only)   (object API)
  CephFS

POSIX Compliance: Blessing and Curse

Conflict Resolution Strategies

When multiple clients write concurrently, especially under eventual consistency, conflicts can arise. How a system resolves these conflicts determines the final state of data.

Conflict scenarios:

Conflict Example:

T=0:    File version: v1, content: "Hello"

T=100:  Client A (US-West): write(content="Hello World")
T=100:  Client B (US-East): write(content="Hello Universe")
        (concurrent writes, neither sees the other)

T=200:  Updates propagate... what's the final content?
        • "Hello World"?  (A wins)
        • "Hello Universe"?  (B wins)
        • Both preserved somehow?  (conflict preserved)
        • Error/rejected?  (conflict avoided)

Resolution strategies:

Conflict Resolution Approaches

•Last-Writer-Wins (LWW) — Use timestamps; latest timestamp wins. Simple but can lose data. Requires synchronized clocks for correctness.
•First-Writer-Wins — Block later writes; preserve first. Used in some registration systems. Prevents certain conflicts by design.
•Version Vectors — Track causal history; detect conflicts precisely. Present conflicts to application or merge automatically.
•Application Resolution — Store all conflicting versions; let application merge. Used by CouchDB, Riak. Pushes complexity to application.
•CRDTs (Conflict-Free Replicated Data Types) — Data structures designed to merge automatically without conflicts. Examples: counters, sets, registers.
•Operational Transform (OT) — Transform concurrent operations to preserve intent. Used in Google Docs for real-time collaboration.

conflict_resolution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# Conflict Resolution Strategies in Code
 
import time
from dataclasses import dataclass
from typing import Optional, List, Dict, Any
 
# ============================================================
# 1. LAST-WRITER-WINS (LWW) - Simple but can lose data
# ============================================================
 
@dataclass
class LWWValue:
    value: Any
    timestamp: float  # Lamport or wall-clock time
    
def lww_merge(a: LWWValue, b: LWWValue) -> LWWValue:
    """Simple: highest timestamp wins."""
    return a if a.timestamp >= b.timestamp else b
 
# Usage:
# Client A writes: LWWValue("Hello World", 1000.001)
# Client B writes: LWWValue("Hello Universe", 1000.002)
# Merge result: "Hello Universe" (higher timestamp)
# Problem: "Hello World" is silently lost!
 
# ============================================================
# 2. VERSION VECTORS - Detect conflicts precisely
# ============================================================
 
@dataclass
class VersionedValue:
    value: Any
    vector_clock: Dict[str, int]  # node_id -> counter
    
def version_dominates(a: Dict[str, int], b: Dict[str, int]) -> bool:
    """True if 'a' is strictly newer than 'b'."""
    all_keys = set(a.keys()) | set(b.keys())
    a_higher_somewhere = False
    
    for key in all_keys:
        a_val = a.get(key, 0)
        b_val = b.get(key, 0)
        if a_val < b_val:
            return False  # b is ahead on this node
        if a_val > b_val:
            a_higher_somewhere = True
    
    return a_higher_somewhere
 
def detect_conflict(a: VersionedValue, b: VersionedValue) -> str:
    """Determine relationship between two versions."""
    if version_dominates(a.vector_clock, b.vector_clock):
        return "a_dominates"  # a is newer
    elif version_dominates(b.vector_clock, a.vector_clock):
        return "b_dominates"  # b is newer
    else:
        return "conflict"  # neither dominates - true conflict!
 
# Usage:
# Client A (node "a"): writes with clock {a: 1, b: 0}
# Client B (node "b"): writes with clock {a: 0, b: 1}
# → Conflict detected! Neither dominates.
# → Application must decide how to resolve
 
# ============================================================
# 3. CRDT: Grow-Only Counter (no conflicts possible)
# ============================================================
 
class GCounter:
    """
    Conflict-free grow-only counter.
    Each node increments its own slot; sum is the total.
    Merge is element-wise maximum - always converges!
    """
    
    def __init__(self, node_id: str, state: Optional[Dict[str, int]] = None):
        self.node_id = node_id
        self.state = state or {}
    
    def increment(self, amount: int = 1):
        """Increment this node's counter."""
        self.state[self.node_id] = self.state.get(self.node_id, 0) + amount
    
    def value(self) -> int:
        """Get total count across all nodes."""
        return sum(self.state.values())
    
    def merge(self, other: 'GCounter') -> 'GCounter':
        """
        Merge two counters - always works, never conflicts!
        Element-wise maximum preserves all increments.
        """
        merged_state = {}
        all_keys = set(self.state.keys()) | set(other.state.keys())
        for key in all_keys:
            merged_state[key] = max(
                self.state.get(key, 0),
                other.state.get(key, 0)
            )
        return GCounter(self.node_id, merged_state)
 
# Usage:
# Node A: GCounter(state={a: 5, b: 0})  # A incremented 5 times
# Node B: GCounter(state={a: 3, b: 8})  # B sees older A, incremented 8
# Merge: GCounter(state={a: 5, b: 8})   # Keep max of each
# Total: 13 - all increments preserved, no conflict!

Choosing the Right Consistency Model

Selecting an appropriate consistency model requires understanding your application's requirements and the tradeoffs you can accept.

Decision framework:

Consistency Model Selection Guide
Use Case	Recommended Model	Rationale
Financial transactions	Linearizability / Strong	Lost or duplicated transactions unacceptable
User profile data	Read-your-writes + Eventual	User should see own changes; global latency matters
Real-time collaboration	OT or CRDTs	Multiple writers, merge without coordination
Analytics data lake	Eventual consistency	Batch processing tolerates staleness
Social media feed	Causal or Session	Comments should appear under correct post
Inventory counts	Linearizable or Strong	Overselling is costly; need accurate counts
Logging/Metrics	Eventual consistency	Some data loss acceptable for throughput
Configuration files	Strong (or versioned)	Nodes should see consistent config

Questions to ask when choosing:

What's the cost of stale data? If a user sees an old value, what happens? Minor inconvenience or system failure?
What's the cost of coordination? Adding synchronization increases latency. Can your application tolerate 100ms more per operation?
How concurrent are accesses? Single-writer workloads often work fine with weaker consistency. Multi-writer concurrent access needs stronger guarantees or conflict resolution.
What are your availability requirements? Stronger consistency often means operations fail during partitions. Can you tolerate unavailability?
Can application logic handle inconsistency? Some applications can check for conflicts and retry. Others assume consistent views.

Hybrid approach:

Many systems use different consistency models for different operations:

Example: E-commerce platform

- Product catalog: Eventual consistency (reads don't need to be perfect)
- Shopping cart: Session consistency (user sees own cart accurately)
- Inventory decrements: Serializable/Strong (prevent overselling)
- Order placement: Linearizable (order creation is critical)
- Recommendation cache: Eventual (stale recommendations OK)

Start Strong, Weaken as Needed

Summary: Consistency Models in DFS

We've explored the consistency models that define the behavior of distributed file systems. Let's consolidate the key insights:

Key Consistency Model Takeaways

•Consistency models define the contract — They specify what applications can expect about the relationship between reads and writes.
•Stronger consistency costs more — Linearizability requires synchronization; each step down the spectrum reduces coordination overhead.
•CAP theorem constrains choices — During partitions, choose between consistency (reject operations) or availability (accept stale/divergent data).
•Session semantics are a practical middle ground — Close-to-open consistency matches how users typically interact with files.
•Eventual consistency trades correctness for performance — Acceptable for many workloads but requires careful application design.
•POSIX semantics are expensive in distributed systems — Full POSIX requires coordination that often defeats the purpose of distribution.
•Conflict resolution is essential for weak consistency — LWW, CRDTs, and application merging each have their place.
•Match model to requirements — Different parts of an application may need different consistency levels.

Module complete:

You've now explored the complete architecture of distributed file systems:

DFS Architecture — Components, design patterns, and fundamental tradeoffs
Naming and Location — How files are named and located across distributed nodes
Caching Strategies — How caching bridges the latency gap while managing coherence
Replication — How data is duplicated for fault tolerance and performance
Consistency Models — The guarantees systems provide about data visibility and ordering

These concepts form the foundation for understanding how modern distributed storage systems—from enterprise NAS to cloud object storage to big data platforms—actually work under the hood.

Module Complete

5 / 5