Loading content...
Every distributed file system faces a fundamental tension: network access is orders of magnitude slower than local access. A local SSD read takes microseconds; a network read takes milliseconds—a difference of three orders of magnitude. For a file system to feel responsive, it must minimize network round trips.
Caching is the primary technique for bridging this performance gap. By keeping frequently accessed data closer to the client—in RAM, on local disk, or at intermediate servers—a DFS can serve many requests without touching the network at all.
But caching in distributed systems introduces a profound challenge: when multiple clients cache the same data, how do we ensure they see consistent views? Local file systems have one authoritative copy; distributed systems have many copies in various states of freshness. Managing this complexity while preserving performance is one of the most challenging aspects of DFS design.
By the end of this page, you will understand the layered caching architecture of distributed file systems, the cache coherence protocols that maintain consistency, and the tradeoffs between different caching strategies. You'll see how systems balance performance against correctness.
To appreciate the importance of caching, consider the latency hierarchy in a distributed file system:
| Cache Level | Typical Latency | Relative Speed | Capacity |
|---|---|---|---|
| CPU L1 Cache | 1 ns | 1x (baseline) | 64 KB |
| CPU L3 Cache | 10-20 ns | 10-20x slower | 32 MB |
| Local RAM | 100 ns | 100x slower | 64-256 GB |
| Local SSD | 50-100 μs | 50,000x slower | 1-4 TB |
| Local HDD | 5-10 ms | 5,000,000x slower | 4-16 TB |
| Network + Remote SSD | 1-10 ms | 1,000,000-10,000,000x | Petabytes |
| Cross-Datacenter | 50-150 ms | 50,000,000x slower | Exabytes |
The numbers tell the story:
If every read required a network round trip, even the simplest operations would be painfully slow. Reading a 1MB file in 4KB chunks would require 256 network requests—potentially seconds of latency. With effective caching:
Workload characteristics favor caching:
Studies of file system access patterns show:
These patterns make caching highly effective—a well-designed cache achieves 90%+ hit rates in many workloads.
Effective bandwidth = (Cache hit rate × Cache bandwidth) + (Cache miss rate × Network bandwidth). With a 95% cache hit rate, even if network bandwidth is 100x lower than cache bandwidth, effective bandwidth is still 95.5% of cache bandwidth. This is why cache hit rate is the dominant performance factor.
Distributed file systems employ caching at multiple layers, each with distinct characteristics. Understanding where caches exist helps in reasoning about consistency and performance.
The multi-layer cache architecture:
Cache coherence must span all layers:
The challenge is that changes at any layer must eventually be visible at all layers. If a client writes data:
And propagation to other clients: 4. Server → Other clients' DFS caches (invalidation/update) 5. Other clients' DFS caches → Their kernel caches 6. Kernel caches → Applications (read sees new data)
Each step has latency and potential for failure. The consistency model defines guarantees about when these propagations occur.
Cache coherence refers to the consistency of cached copies across multiple clients. When Client A writes a file, when does Client B see the update? Different answers to this question define different coherence models.
The coherence spectrum:
| Model | Definition | Performance | Complexity | Example |
|---|---|---|---|---|
| No Caching | Every access goes to server | Poor | Simple | Early NFS without caching |
| Write-Through | Writes go to server immediately; reads from cache | Reads fast, writes slow | Medium | Traditional NFS |
| Write-Back (Delayed) | Writes cached locally, flushed later | Fast | Complex | NFS with wdelay |
| Session Semantics | Changes visible globally on file close | Fast | Medium | AFS, early GFS |
| Leases/Delegations | Server grants exclusive rights; recalls on conflict | Very fast when uncontended | Complex | NFSv4, SMB |
Understanding session semantics:
Many distributed file systems use session semantics (also called close-to-open consistency):
Session Semantics Rules:
1. Changes made by a client are immediately visible to that client
2. Changes are NOT visible to other clients until the file is closed
3. When a client opens a file, it sees all changes from closed writes
4. Concurrent writers to the same file have undefined behavior
Timeline Example:
─────────────────────────────────────────────────────────────
Client A: open() ──write("v1")──────────────close()
Client B: open() ──read() → sees old data──close()
Client C: open()──read() → sees "v1"
─────────────────────────────────────────────────────────────
↑
File closed, changes propagated
Session semantics trade immediate consistency for significant performance gains—writes are batched and cached, only syncing on close.
Session semantics work well for 'file-per-user' access patterns but break down for shared files. If two users edit the same file simultaneously, the last to close 'wins,' potentially overwriting the other's changes with no warning. Applications requiring real-time collaboration must use different approaches (locks, conflict resolution, CRDTs).
Leases (called delegations in NFSv4) provide a sophisticated cache coherence mechanism that combines aggressive caching with strong consistency guarantees. The key insight: if only one client is accessing a file, it can cache freely; conflicts only require coordination when they actually occur.
How leases work:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
# Lease-based cache coherence mechanism class LeaseServer: """ Manages file leases for cache coherence. """ def __init__(self): self.leases = {} # file_id -> {client, type, expiry} self.pending_recalls = {} def request_lease(self, client_id: str, file_id: str, lease_type: str) -> dict: """ Grant a lease if possible, or queue for recall. lease_type: 'read' (shared) or 'write' (exclusive) """ current = self.leases.get(file_id) if current is None: # No existing lease - grant immediately return self._grant_lease(client_id, file_id, lease_type) if current['client'] == client_id: # Same client, possibly upgrade if lease_type == 'write' and current['type'] == 'read': return self._upgrade_lease(client_id, file_id) return current # Already have sufficient lease # Conflict with another client if lease_type == 'write' or current['type'] == 'write': # Need to recall existing lease return self._initiate_recall(client_id, file_id, current, lease_type) # Both are read leases - can coexist return self._grant_shared_lease(client_id, file_id) def _initiate_recall(self, requesting_client: str, file_id: str, current_lease: dict, requested_type: str) -> dict: """ Recall existing lease before granting new one. """ # Notify lease holder to flush and release recall_id = f"recall_{file_id}_{time.time()}" self.pending_recalls[recall_id] = { 'file_id': file_id, 'requesting_client': requesting_client, 'requested_type': requested_type, 'current_holder': current_lease['client'], 'state': 'pending' } # Send recall to current holder (async) self._send_recall_notification( current_lease['client'], file_id, recall_id ) # Return pending status to requesting client return { 'status': 'pending', 'recall_id': recall_id, 'message': 'Waiting for lease recall' } def handle_recall_response(self, client_id: str, file_id: str, recall_id: str, flushed_data: bytes): """ Process lease surrender from recalled client. """ recall = self.pending_recalls.get(recall_id) if not recall: return # Apply flushed data from recalled client self._apply_flushed_changes(file_id, flushed_data) # Clear old lease del self.leases[file_id] # Grant lease to waiting client self._grant_lease( recall['requesting_client'], file_id, recall['requested_type'] ) del self.pending_recalls[recall_id] def _grant_lease(self, client_id: str, file_id: str, lease_type: str) -> dict: """Grant a new lease.""" lease = { 'client': client_id, 'type': lease_type, 'expiry': time.time() + LEASE_DURATION, 'granted': time.time() } self.leases[file_id] = lease return {'status': 'granted', 'lease': lease} # Lease duration and timing considerationsLEASE_DURATION = 90 # seconds # Trade-offs:# - Longer leases: Better caching, slower conflict resolution# - Shorter leases: Faster conflict resolution, more renewals# - NFSv4 default: 90 seconds, renewed automaticallyNFSv4 delegation types:
The beauty of leases:
For uncontended files (the common case), leases provide near-local-file-system performance—reads and writes hit the local cache with no network traffic. Only when conflict occurs does the server intervene. This optimistic approach matches real-world access patterns where most files are accessed by a single user at a time.
Leases have expiration times to handle client failures. If a client crashes while holding a lease, the server can reclaim the lease after expiration and grant access to other clients. The lease duration balances the cost of renewals (short leases) against the delay in failure recovery (long leases). NFSv4 uses 90-second leases with automatic renewal during active use.
Write caching presents unique challenges—a cached write that's lost before reaching stable storage represents data loss. DFS implementations offer various write caching strategies with different safety and performance tradeoffs.
Write-through caching:
Every write is immediately sent to the server. The write call returns only after the server acknowledges permanent storage.
Client: write(fd, data)
↓
Send data to server
↓
Wait for server ACK
↓
Server writes to disk
↓
Server sends ACK
↓
write() returns success
Pros: Strong durability, simple consistency
Cons: Every write pays network latency
Write-back (delayed write) caching:
Writes are cached locally and sent to the server later—on sync, close, or when the cache fills.
write(fd, data)
↓
Cache locally (returns immediately)
...
[Later: Flush triggered]
↓
Send accumulated data to server
↓
Server acknowledges
Advantages:
Risks:
Write-behind with server-side buffering:
Writes sent to server but ACKed before disk persistence.
write(fd, data)
↓
Send to server
↓
Server buffers in RAM
↓
Server ACKs immediately
↓
[Later: Server flushes to disk]
Advantages:
Risks:
NFS write behavior (controlled by mount options):
# Write-through (synchronous)
mount -t nfs -o sync server:/export /mnt
# Every write waits for server ACK
# Safe but slow
# Write-back (asynchronous) - Default
mount -t nfs -o async server:/export /mnt
# Writes cached, flushed later
# Fast but less safe
# Server-side commit behavior
mount -t nfs -o wdelay server:/export /mnt
# Server delays writes expecting more data
# Improves write combining
The sync flag and close semantics:
Applications can request specific durability:
fd = open("file.txt", O_WRONLY);
write(fd, data, len); // May be cached
fsync(fd); // Force cache flush to server's disk
close(fd); // Also typically forces flush
Critical applications (databases, logs) use fsync explicitly; casual applications rely on close-time flushing.
File data isn't the only thing worth caching—metadata caching can provide even larger performance benefits. Consider that a simple ls -l command in a directory with 1000 files requires at least 1001 stat calls (one for the directory, one for each file). Without metadata caching, that's 1001 network round trips.
What metadata is cached:
Attribute caching in NFS:
NFS aggressively caches file attributes to minimize server load. The ac* mount options control this behavior:
# Attribute cache controls
mount -t nfs -o \
acregmin=3,acregmax=60,\ # Regular files: 3-60 seconds
acdirmin=30,acdirmax=60 \ # Directories: 30-60 seconds
server:/export /mnt
# Behavior:
# - First access: fetch from server, cache locally
# - Subsequent access within acmin: use cached value
# - After acmax: always revalidate with server
# - Between acmin and acmax: revalidate if cache is old
# Disable attribute caching entirely (for debugging)
mount -t nfs -o noac server:/export /mnt
The attribute timeout tradeoff:
Longer attribute cache times reduce server load but increase the window for stale data:
Timeline with 60-second attribute cache:
──────────────────────────────────────────────────────
Client A: stat() → cache stat() → cached(stale!)
Client B: write()──close()
──────────────────────────────────────────────────────
↑ ↑
File modified A still sees old size/time
For many workloads (where files are modified rarely), this is acceptable. For workloads requiring immediate visibility, use shorter timeouts or noac.
Caching 'file not found' results is surprisingly important. Many applications probe for optional files or search multiple paths. Without negative caching, each failed lookup hits the server. NFS caches negative lookups for a configurable duration, significantly improving applications that do path searching (like shell $PATH lookups).
Phil Karlton famously said, "There are only two hard things in Computer Science: cache invalidation and naming things." In distributed file systems, we face both. Cache invalidation—determining when cached data is stale and needs refreshing—is particularly challenging.
Invalidation approaches:
| Method | How It Works | Pros | Cons |
|---|---|---|---|
| Time-Based Expiry | Cache entries expire after fixed/computed TTL | Simple, no server state needed | Stale reads possible within TTL |
| Server Callbacks | Server notifies clients of changes | Immediate invalidation | Server must track all clients |
| Version Validation | Cache stores version; server validates on access | Consistent, scalable | Requires round-trip to check |
| Lease Recall | Server recalls leases before allowing changes | Strong consistency when needed | Complex, delay for conflicts |
NFS client-driven validation:
NFS uses a clever validation scheme based on change attributes:
Validation request (simplified):
Client → Server: GETATTR file_id, cached_change_id=12345
Server → Client: change_id=12345, cache_valid=true
OR
change_id=12347, cache_valid=false
This is lighter than re-reading the data—just checking if it changed. The client controls when to validate based on ac* mount options.
AFS callback mechanism:
AFS (Andrew File System) pioneered server callbacks for cache invalidation:
Server
┌─────────────┐
Callback │ File: F │ Callback
break │ Clients: │ break
←─────────│ [A, B, C] │──────────→
│ │
┌────────┐ │ Modifying │ ┌────────┐
│Client A│←─────│ write by D │─────→│Client B│
└────────┘ └─────────────┘ └────────┘
↓ ↓
Invalidate F Invalidate F
Advantage: Cache data can be trusted until callback received Disadvantage: Server must maintain callback state for every cached file×client
If a heavily-read file (like a shared library) is modified, the server must send callback breaks to potentially thousands of clients simultaneously. This 'callback storm' can overwhelm the server. Systems handle this through rate limiting, batching, or simply invalidating heavily-shared files less aggressively.
Beyond caching data that was already accessed, distributed file systems can anticipate future accesses through prefetching (also called readahead). By predicting what data will be needed and fetching it before it's requested, the system can hide network latency.
Sequential readahead:
The most common and effective prefetching strategy. If a client reads bytes 0-4095, they'll probably want bytes 4096-8191 next.
Sequential Read Pattern:
────────────────────────────────────────────────────
Application: read(0-4K) read(4K-8K) read(8K-12K)
Without RA: [wait] [wait] [wait]
↓ ↓ ↓
fetch fetch fetch
With Readahead:
────────────────────────────────────────────────────
Application: read(0-4K) read(4K-8K) read(8K-12K)
With RA: [wait] [cache hit] [cache hit]
↓
fetch 0-16K (data already prefetched)
Readahead dramatically improves sequential read performance by overlapping I/O with processing.
Readahead adaptation:
Smart implementations adapt readahead size based on access patterns:
Adaptive Readahead Algorithm:
1. Start with small readahead (e.g., 32KB)
2. If pattern is confirmed sequential, increase readahead
3. Double readahead each time sequential access continues
4. Cap at maximum (e.g., 512KB or 2MB)
5. If random access detected, disable or reduce readahead
Linux kernel implementation:
- Initial window: 32KB
- Maximum window: /sys/block/<dev>/queue/read_ahead_kb (default 128KB)
- Pattern detection: tracks last few read positions
DFS-specific prefetching:
Distributed file systems can implement additional prefetching strategies:
HDFS readahead:
HDFS clients implement their own readahead since data comes from remote DataNodes:
// HDFS read strategy
DFSInputStream {
// When reading block B, prefetch next blocks
prefetchBlocks = configuration.get("dfs.client.read.prefetch.size");
// Parallel reads from multiple DataNodes
hedgedReadThreadpool = ...; // Try slow reads on backup replicas
}
Prefetching isn't free—it consumes network bandwidth and memory. Poorly tuned readahead on random access patterns wastes resources and can actually decrease performance. Modern systems detect access patterns and disable readahead when appropriate.
We've explored the critical role of caching in distributed file systems and the mechanisms that make it work. Let's consolidate the key insights:
What's next:
Caching optimizes access to individual copies of data, but distributed file systems maintain multiple copies through replication. Next, we'll explore how DFS implementations create, maintain, and synchronize replicas to provide fault tolerance and improved read performance.
You now understand the caching strategies used in distributed file systems, from basic caching principles through sophisticated lease-based coherence protocols. You can analyze the tradeoffs between performance and consistency in different caching approaches. Next, we'll explore replication strategies.