Loading learning content...
In a single-machine file system, the question "what data will I read?" has a simple answer: the most recent write. There's one copy of the data, one kernel managing access, and a well-defined order of operations.
But distributed file systems shatter this simplicity. With multiple replicas, network delays, and concurrent clients, fundamental questions become surprisingly complex:
Consistency models are the formal frameworks that answer these questions. They define the contract between the distributed file system and its applications—what guarantees the system provides, and what assumptions applications can safely make.
By the end of this page, you will understand the spectrum of consistency models from strict to eventual, how different distributed file systems implement these models, and how to reason about the tradeoffs between consistency, availability, and performance. You'll be able to analyze what guarantees a given DFS provides and whether they match your application's requirements.
Before diving into specific consistency models, let's understand why consistency in distributed systems is fundamentally challenging.
The core problem: no global clock or instant communication
In a distributed system:
These fundamental limitations make it impossible to provide the same consistency guarantees as a single-machine system without paying significant performance costs.
A thought experiment:
Consider this scenario:
Time Node A (US-West) Node B (US-East) Network
─────────────────────────────────────────────────────────────────────
t=0 write(x, "blue")
↓ (send update to B)
t=50ms [update in transit, 100ms delay]
t=75ms read(x) → ???
t=100ms [update arrives]
t=125ms read(x) → "blue"
At t=75ms, what should B's read return?
Option 1: Block until update arrives (100ms delay) → Consistent but slow
Option 2: Return stale/default value ("red") → Fast but inconsistent
Option 3: Error (value not yet known) → Honest but unhelpful
There's no "right" answer—only tradeoffs. Different consistency models represent different choices in this tradeoff space.
The CAP theorem (Brewer's theorem) formalizes this: during a network partition, a distributed system must choose between Consistency (all nodes see the same data) and Availability (all requests receive a response). You cannot have both. Real distributed file systems position themselves somewhere in this tradeoff space.
Consistency models form a spectrum from strongest (most intuitive but slowest) to weakest (fastest but most surprising). Understanding this spectrum helps you choose appropriate models for different use cases.
The consistency hierarchy:
| Model | Guarantee | Performance | DFS Examples |
|---|---|---|---|
| Strict Consistency | All reads return the absolute latest write | Impossible in distributed systems | Theoretical ideal only |
| Linearizability | Operations appear instantaneous, in real-time order | Slow (requires synchronization) | Spanner (with TrueTime) |
| Sequential Consistency | All processes see operations in same order | Medium (requires ordering) | Some POSIX file operations |
| Causal Consistency | Causally related operations seen in order | Medium (requires dependency tracking) | Some distributed databases |
| Session Consistency | Client's own operations seen in order | Good (requires session affinity) | Many DFS (close-to-open) |
| Eventual Consistency | All replicas converge eventually (no time bound) | Best (async propagation) | S3, DNS, many object stores |
Key insight:
Stronger consistency models guarantee more intuitive behavior but require more coordination (synchronous communication, consensus protocols, locking). This coordination adds latency and can reduce availability during partitions.
Weaker consistency models allow more concurrent, independent operation but require applications to handle potentially stale or out-of-order data.
Linearizability is the gold standard of consistency—it provides the illusion that operations happen instantaneously, in an order consistent with real time.
Definition:
A system is linearizable if:
Linearizable execution:
Real time:
────────────────────────────────────────────────────────────────
Client A: |--write(x,1)---| |--read(x)--→3|
Client B: |--read(x)--→1| |--write(x,3)--| |
Client C: |--read(x)--→1 or 3| |
────────────────────────────────────────────────────────────────
Linearization point (one valid ordering):
write(x,1) → read(x)→1 → write(x,3) → read(x)→3
↑ ↑ ↑
Client B Client B Client A
Client C's read overlaps with write(x,3), so can return 1 or 3
(but if it returns 3, any later read must also return 3)
Why linearizability is expensive:
To achieve linearizability, the system must:
Linearizable write protocol (simplified):
1. Client sends write to leader/coordinator
2. Leader assigns a sequence number
3. Leader replicates to quorum of replicas
4. All quorum members acknowledge
5. Leader commits and responds to client
Latency = 2 * RTT_max to quorum
+ consensus protocol overhead
+ disk sync time × quorum size
Real-world linearizable systems:
Most distributed file systems do NOT provide linearizability for all operations—the performance cost is too high for file system workloads.
These terms are often confused. Linearizability is about single-object operations appearing atomic and real-time ordered. Serializability is about multi-object transactions appearing as if executed serially. A system can be linearizable but not serializable (no transactions), or serializable but not linearizable (transactions may reorder in non-real-time ways).
Sequential consistency and causal consistency offer useful middle-ground positions between linearizability and eventual consistency.
Sequential Consistency:
All processes see all operations in the same order. The order must be consistent with each process's program order, but need not match real time.
Sequentially consistent (valid):
────────────────────────────────────────────────────────────────
Client A: write(x,1) write(x,2)
Client B: read(x)→? read(x)→?
────────────────────────────────────────────────────────────────
Valid orderings:
w(x,1) → w(x,2) → r(x)→2 → r(x)→2 ✓
w(x,1) → r(x)→1 → w(x,2) → r(x)→2 ✓
r(x)→1 → w(x,1) → ... ✗ (B reads before A writes - only okay
if there was a prior value)
Key: A's writes are seen by B in A's order (1 before 2)
But B might see 2 before observing 1 completed
Both clients see same total order of operations
Causal Consistency:
Causally-related operations are seen by all processes in the same order. Concurrent (causally-independent) operations may be seen in different orders by different processes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
# Causal Consistency Example """Causal relationship: If operation A could have influenced operation B, then all observers must see A before B. Examples of causal relationships:1. Same process: A happens-before B in program order2. Message: A is a write, B reads the value A wrote3. Transitivity: A→B and B→C implies A→C Concurrent (non-causal): If neither operation could have influenced the other, they are concurrent and may be observed in any order.""" # Scenario: Social media post and comment# ========================================= # Process A (Alice):def alice_actions(client): # T1: Alice posts a photo client.write("/posts/photo_123", "Beach vacation pic") # T2: Alice sends message to Bob about the post client.send_message("bob", "Check out my new post!") # Process B (Bob):def bob_actions(client): # T3: Bob receives message from Alice msg = client.receive_message() # Causal dependency on T2 # T4: Bob reads Alice's post # Under causal consistency, Bob MUST see Alice's post # because: T1 → T2 (same process) and T2 → T3 (message) # Therefore: T1 → T3 by transitivity post = client.read("/posts/photo_123") # Must return "Beach vacation" # Process C (Carol) - independent:def carol_actions(client): # T5: Carol reads the post (no causal link to Alice's message) # Carol might or might not see the post depending on timing post = client.read("/posts/photo_123") # May return old value or new """Causal consistency guarantees:- Bob will see Alice's post (causally linked through message)- Carol might not see it yet (no causal link) This is weaker than sequential consistency:- Sequential: Carol would also see updates in global order- Causal: Carol only guaranteed to see causally-related updates Implementation: Vector clocks or dependency tracking""" class CausalClient: def __init__(self, node_id, num_nodes): self.node_id = node_id # Vector clock: [node_0_time, node_1_time, ...] self.vector_clock = [0] * num_nodes def write(self, key, value): # Increment own logical time self.vector_clock[self.node_id] += 1 # Send write with current vector clock return self.server.write(key, value, self.vector_clock.copy()) def read(self, key, min_version=None): # Include our vector clock to track causal dependencies # Server will only return values that satisfy causality value, server_clock = self.server.read(key, self.vector_clock) # Update our clock (merge with server's) for i in range(len(self.vector_clock)): self.vector_clock[i] = max(self.vector_clock[i], server_clock[i]) return valueCausal consistency is attractive because it captures what humans intuitively expect about cause and effect, while allowing more concurrency than sequential consistency. Systems like COPS and Eiger implement causal consistency for geo-replicated data stores. However, tracking causal dependencies adds overhead, and the programming model is still more complex than strong consistency.
Many distributed file systems provide client-centric consistency guarantees—promises made to each individual client about its own view of the data, rather than global guarantees about all clients.
The four client-centric guarantees:
Session semantics in distributed file systems:
Many POSIX-like distributed file systems implement close-to-open consistency (a form of session semantics):
Close-to-Open (Session) Consistency:
1. When a file is closed, all modifications are flushed to the server
2. When a file is opened, the client fetches the latest version from server
3. Between open and close, the client may work with a cached version
4. Concurrent writers to the same file have undefined behavior
Timeline:
────────────────────────────────────────────────────────────────
Client A: open()───write("v1")───close()
↓ (flush to server)
Client B: open()───read()→"old" close()
↑
(A's write not visible yet)
Client B: open()───read()→"v1"
↑
(now sees A's write)
────────────────────────────────────────────────────────────────
Why session semantics are popular:
Session semantics assume files are not shared concurrently. They break down when: (1) Multiple users edit the same file simultaneously—one user's changes may be silently lost. (2) Applications expect immediate visibility of writes to other processes. (3) Append-only logs where multiple writers must coordinate. For such use cases, stronger consistency or explicit locking is required.
Eventual consistency is the weakest useful consistency model: it only guarantees that if no new updates are made, all replicas will eventually converge to the same value. There's no bound on how long this takes.
Definition:
If no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.
What eventual consistency does NOT guarantee:
Amazon S3: Eventual consistency in practice
S3 historically provided eventual consistency for certain operations:
S3 Consistency Before (2020):
PUT (new object): Read-after-write consistent for new objects
PUT (overwrite): Eventual consistency (might read old version)
DELETE: Eventual consistency (might still read deleted object)
LIST after PUT: Eventual consistency (new object might not appear)
Real example:
T=0: PUT s3://bucket/key with content "v2" (overwrites "v1")
T=100ms: GET s3://bucket/key → might return "v1" (stale!)
T=1s: GET s3://bucket/key → returns "v2"
Workaround for applications:
- Use unique keys instead of overwriting
- Include version in key: s3://bucket/data-v2
- Wait before reading (poor solution)
- Use DynamoDB for consistent metadata
S3 Strong Consistency (2020):
AWS updated S3 to provide strong read-after-write consistency:
S3 Consistency After (2020):
PUT: Read-after-write consistent for all objects
DELETE: Read-after-write consistent
LIST: Strongly consistent with PUTs and DELETEs
How? Amazon rebuilt the indexing layer to use consensus
for metadata while keeping data path highly available.
Many modern systems offer tunable consistency—clients choose per-operation. For example, you might use eventual consistency for logging (high throughput, loss acceptable) but strong consistency for financial records. Cassandra's consistency level (ONE, QUORUM, ALL) exemplifies this pattern.
POSIX file system semantics define specific consistency requirements that local file systems meet but distributed file systems struggle with. Understanding these helps explain why DFS implementations diverge from traditional expectations.
Key POSIX consistency requirements:
| Requirement | POSIX Expectation | DFS Challenge |
|---|---|---|
| Sequential write ordering | Bytes written are visible in write order | Network reordering, replica lag |
| Read-after-write | write() followed by read() returns written data | Cache consistency delays |
| Atomic rename | rename() is atomic: file fully moved or not at all | Cross-node operations not atomic |
| Last-close semantics | Writes visible to other processes on close | Already a relaxation; DFS compatible |
| Lock coherence | flock() coordinates access across processes | Distributed locking is slow/complex |
| Persisted after fsync | fsync() guarantees data is on stable storage | Distributed 'stable' means replicated? |
How different DFS handle POSIX:
1. Full POSIX Compliance (attempts)
2. Relaxed POSIX with clarifications
3. Non-POSIX API
4. Object Storage (S3)
Spectrum of POSIX compliance:
Full POSIX ←━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━→ No POSIX
| | | |
GPFS NFS HDFS S3
Lustre (relaxed) (append-only) (object API)
CephFS
POSIX compliance lets legacy applications work without modification—a huge benefit. But achieving POSIX semantics in a distributed system requires expensive coordination that can negate the scalability benefits of distribution. Many modern applications are better served by DFS with well-defined, non-POSIX semantics optimized for distributed operation.
When multiple clients write concurrently, especially under eventual consistency, conflicts can arise. How a system resolves these conflicts determines the final state of data.
Conflict scenarios:
Conflict Example:
T=0: File version: v1, content: "Hello"
T=100: Client A (US-West): write(content="Hello World")
T=100: Client B (US-East): write(content="Hello Universe")
(concurrent writes, neither sees the other)
T=200: Updates propagate... what's the final content?
• "Hello World"? (A wins)
• "Hello Universe"? (B wins)
• Both preserved somehow? (conflict preserved)
• Error/rejected? (conflict avoided)
Resolution strategies:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
# Conflict Resolution Strategies in Code import timefrom dataclasses import dataclassfrom typing import Optional, List, Dict, Any # ============================================================# 1. LAST-WRITER-WINS (LWW) - Simple but can lose data# ============================================================ @dataclassclass LWWValue: value: Any timestamp: float # Lamport or wall-clock time def lww_merge(a: LWWValue, b: LWWValue) -> LWWValue: """Simple: highest timestamp wins.""" return a if a.timestamp >= b.timestamp else b # Usage:# Client A writes: LWWValue("Hello World", 1000.001)# Client B writes: LWWValue("Hello Universe", 1000.002)# Merge result: "Hello Universe" (higher timestamp)# Problem: "Hello World" is silently lost! # ============================================================# 2. VERSION VECTORS - Detect conflicts precisely# ============================================================ @dataclassclass VersionedValue: value: Any vector_clock: Dict[str, int] # node_id -> counter def version_dominates(a: Dict[str, int], b: Dict[str, int]) -> bool: """True if 'a' is strictly newer than 'b'.""" all_keys = set(a.keys()) | set(b.keys()) a_higher_somewhere = False for key in all_keys: a_val = a.get(key, 0) b_val = b.get(key, 0) if a_val < b_val: return False # b is ahead on this node if a_val > b_val: a_higher_somewhere = True return a_higher_somewhere def detect_conflict(a: VersionedValue, b: VersionedValue) -> str: """Determine relationship between two versions.""" if version_dominates(a.vector_clock, b.vector_clock): return "a_dominates" # a is newer elif version_dominates(b.vector_clock, a.vector_clock): return "b_dominates" # b is newer else: return "conflict" # neither dominates - true conflict! # Usage:# Client A (node "a"): writes with clock {a: 1, b: 0}# Client B (node "b"): writes with clock {a: 0, b: 1}# → Conflict detected! Neither dominates.# → Application must decide how to resolve # ============================================================# 3. CRDT: Grow-Only Counter (no conflicts possible)# ============================================================ class GCounter: """ Conflict-free grow-only counter. Each node increments its own slot; sum is the total. Merge is element-wise maximum - always converges! """ def __init__(self, node_id: str, state: Optional[Dict[str, int]] = None): self.node_id = node_id self.state = state or {} def increment(self, amount: int = 1): """Increment this node's counter.""" self.state[self.node_id] = self.state.get(self.node_id, 0) + amount def value(self) -> int: """Get total count across all nodes.""" return sum(self.state.values()) def merge(self, other: 'GCounter') -> 'GCounter': """ Merge two counters - always works, never conflicts! Element-wise maximum preserves all increments. """ merged_state = {} all_keys = set(self.state.keys()) | set(other.state.keys()) for key in all_keys: merged_state[key] = max( self.state.get(key, 0), other.state.get(key, 0) ) return GCounter(self.node_id, merged_state) # Usage:# Node A: GCounter(state={a: 5, b: 0}) # A incremented 5 times# Node B: GCounter(state={a: 3, b: 8}) # B sees older A, incremented 8# Merge: GCounter(state={a: 5, b: 8}) # Keep max of each# Total: 13 - all increments preserved, no conflict!Selecting an appropriate consistency model requires understanding your application's requirements and the tradeoffs you can accept.
Decision framework:
| Use Case | Recommended Model | Rationale |
|---|---|---|
| Financial transactions | Linearizability / Strong | Lost or duplicated transactions unacceptable |
| User profile data | Read-your-writes + Eventual | User should see own changes; global latency matters |
| Real-time collaboration | OT or CRDTs | Multiple writers, merge without coordination |
| Analytics data lake | Eventual consistency | Batch processing tolerates staleness |
| Social media feed | Causal or Session | Comments should appear under correct post |
| Inventory counts | Linearizable or Strong | Overselling is costly; need accurate counts |
| Logging/Metrics | Eventual consistency | Some data loss acceptable for throughput |
| Configuration files | Strong (or versioned) | Nodes should see consistent config |
Questions to ask when choosing:
What's the cost of stale data? If a user sees an old value, what happens? Minor inconvenience or system failure?
What's the cost of coordination? Adding synchronization increases latency. Can your application tolerate 100ms more per operation?
How concurrent are accesses? Single-writer workloads often work fine with weaker consistency. Multi-writer concurrent access needs stronger guarantees or conflict resolution.
What are your availability requirements? Stronger consistency often means operations fail during partitions. Can you tolerate unavailability?
Can application logic handle inconsistency? Some applications can check for conflicts and retry. Others assume consistent views.
Hybrid approach:
Many systems use different consistency models for different operations:
Example: E-commerce platform
- Product catalog: Eventual consistency (reads don't need to be perfect)
- Shopping cart: Session consistency (user sees own cart accurately)
- Inventory decrements: Serializable/Strong (prevent overselling)
- Order placement: Linearizable (order creation is critical)
- Recommendation cache: Eventual (stale recommendations OK)
When in doubt, start with stronger consistency and weaken only after proving performance is insufficient. It's easier to reason about strong consistency, and weakening is often possible without code changes (just configuration). Going from weak to strong consistency often requires fundamental redesign.
We've explored the consistency models that define the behavior of distributed file systems. Let's consolidate the key insights:
Module complete:
You've now explored the complete architecture of distributed file systems:
These concepts form the foundation for understanding how modern distributed storage systems—from enterprise NAS to cloud object storage to big data platforms—actually work under the hood.
Congratulations! You've completed the Distributed File Systems module. You now understand how distributed file systems are architected, how they handle naming and location, how caching and replication work, and the consistency models that govern their behavior. This knowledge is fundamental to designing and operating distributed storage at scale.