Distributed File Systems - Learning Module

Loading content...

0/273

Caching Strategies

The Performance Imperative

Every distributed file system faces a fundamental tension: network access is orders of magnitude slower than local access. A local SSD read takes microseconds; a network read takes milliseconds—a difference of three orders of magnitude. For a file system to feel responsive, it must minimize network round trips.

Caching is the primary technique for bridging this performance gap. By keeping frequently accessed data closer to the client—in RAM, on local disk, or at intermediate servers—a DFS can serve many requests without touching the network at all.

But caching in distributed systems introduces a profound challenge: when multiple clients cache the same data, how do we ensure they see consistent views? Local file systems have one authoritative copy; distributed systems have many copies in various states of freshness. Managing this complexity while preserving performance is one of the most challenging aspects of DFS design.

What You Will Learn

By the end of this page, you will understand the layered caching architecture of distributed file systems, the cache coherence protocols that maintain consistency, and the tradeoffs between different caching strategies. You'll see how systems balance performance against correctness.

Why Caching Is Essential in DFS

To appreciate the importance of caching, consider the latency hierarchy in a distributed file system:

Access Latency Hierarchy in DFS
Cache Level	Typical Latency	Relative Speed	Capacity
CPU L1 Cache	1 ns	1x (baseline)	64 KB
CPU L3 Cache	10-20 ns	10-20x slower	32 MB
Local RAM	100 ns	100x slower	64-256 GB
Local SSD	50-100 μs	50,000x slower	1-4 TB
Local HDD	5-10 ms	5,000,000x slower	4-16 TB
Network + Remote SSD	1-10 ms	1,000,000-10,000,000x	Petabytes
Cross-Datacenter	50-150 ms	50,000,000x slower	Exabytes

The numbers tell the story:

If every read required a network round trip, even the simplest operations would be painfully slow. Reading a 1MB file in 4KB chunks would require 256 network requests—potentially seconds of latency. With effective caching:

Metadata caching eliminates repeated path resolutions
Data caching serves file content from local memory
Writeback caching batches writes for efficient network transfer

Workload characteristics favor caching:

Studies of file system access patterns show:

Temporal locality: Recently accessed files are likely to be accessed again
Spatial locality: Nearby bytes are likely to be read together
Read dominance: Most file systems see 90%+ read operations
Working set principle: Clients use a limited subset of files at any time

These patterns make caching highly effective—a well-designed cache achieves 90%+ hit rates in many workloads.

The Effective Bandwidth Equation

Effective bandwidth = (Cache hit rate × Cache bandwidth) + (Cache miss rate × Network bandwidth). With a 95% cache hit rate, even if network bandwidth is 100x lower than cache bandwidth, effective bandwidth is still 95.5% of cache bandwidth. This is why cache hit rate is the dominant performance factor.

Where Caching Happens

Distributed file systems employ caching at multiple layers, each with distinct characteristics. Understanding where caches exist helps in reasoning about consistency and performance.

The multi-layer cache architecture:

Converting Mermaid diagram...

Cache Layers in Detail

•Application Buffers — User-space buffers maintained by applications. The application controls content; the OS sees only opaque memory.
•Kernel Page Cache — The OS caches file pages in RAM. For local files, this is the primary cache. For mounted DFS, it caches data received from the DFS client.
•DFS Client Cache — The DFS client (FUSE daemon, kernel module, or library) maintains its own cache of file data and metadata. This cache is DFS-aware and participates in coherence protocols.
•Proxy/Edge Caches — Intermediate servers closer to clients cache popular data. Reduces load on origin servers and improves latency for distributed teams.
•Server Buffer Cache — Storage servers maintain their own buffer cache, serving repeated reads from RAM and batching writes to disk.
•Storage Controller Cache — Modern storage controllers include RAM caches and battery-backed write buffers for performance.

Cache coherence must span all layers:

The challenge is that changes at any layer must eventually be visible at all layers. If a client writes data:

Application buffer → DFS client cache (application write)
DFS client cache → Server (sync/flush)
Server cache → Disk (persistence)

And propagation to other clients: 4. Server → Other clients' DFS caches (invalidation/update) 5. Other clients' DFS caches → Their kernel caches 6. Kernel caches → Applications (read sees new data)

Each step has latency and potential for failure. The consistency model defines guarantees about when these propagations occur.

Cache Coherence Fundamentals

Cache coherence refers to the consistency of cached copies across multiple clients. When Client A writes a file, when does Client B see the update? Different answers to this question define different coherence models.

The coherence spectrum:

Cache Coherence Models
Model	Definition	Performance	Complexity	Example
No Caching	Every access goes to server	Poor	Simple	Early NFS without caching
Write-Through	Writes go to server immediately; reads from cache	Reads fast, writes slow	Medium	Traditional NFS
Write-Back (Delayed)	Writes cached locally, flushed later	Fast	Complex	NFS with wdelay
Session Semantics	Changes visible globally on file close	Fast	Medium	AFS, early GFS
Leases/Delegations	Server grants exclusive rights; recalls on conflict	Very fast when uncontended	Complex	NFSv4, SMB

Understanding session semantics:

Many distributed file systems use session semantics (also called close-to-open consistency):

Session Semantics Rules:

1. Changes made by a client are immediately visible to that client
2. Changes are NOT visible to other clients until the file is closed
3. When a client opens a file, it sees all changes from closed writes
4. Concurrent writers to the same file have undefined behavior

Timeline Example:
─────────────────────────────────────────────────────────────
Client A: open() ──write("v1")──────────────close()
Client B:              open() ──read() → sees old data──close()
Client C:                                        open()──read() → sees "v1"
─────────────────────────────────────────────────────────────
                                              ↑
                                    File closed, changes propagated

Session semantics trade immediate consistency for significant performance gains—writes are batched and cached, only syncing on close.

The Close-to-Open Trap

Session semantics work well for 'file-per-user' access patterns but break down for shared files. If two users edit the same file simultaneously, the last to close 'wins,' potentially overwriting the other's changes with no warning. Applications requiring real-time collaboration must use different approaches (locks, conflict resolution, CRDTs).

Lease-Based Caching (Delegations)

Leases (called delegations in NFSv4) provide a sophisticated cache coherence mechanism that combines aggressive caching with strong consistency guarantees. The key insight: if only one client is accessing a file, it can cache freely; conflicts only require coordination when they actually occur.

How leases work:

Client requests a lease for a file (read or write)
Server grants the lease with an expiration time
Client caches file data and operates locally
If another client needs conflicting access, server recalls the lease
Lease holder flushes changes and releases the lease
Server grants access to the other client

lease_mechanism.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# Lease-based cache coherence mechanism
 
class LeaseServer:
                                            """
    Manages file leases for cache coherence.
    """
    
    def __init__(self):
        self.leases = {}  # file_id -> {client, type, expiry}
        self.pending_recalls = {}
    
    def request_lease(self, client_id: str, file_id: str, 
                      lease_type: str) -> dict:
        """
        Grant a lease if possible, or queue for recall.
        
        lease_type: 'read' (shared) or 'write' (exclusive)
        """
        current = self.leases.get(file_id)
        
        if current is None:
            # No existing lease - grant immediately
            return self._grant_lease(client_id, file_id, lease_type)
        
        if current['client'] == client_id:
            # Same client, possibly upgrade
            if lease_type == 'write' and current['type'] == 'read':
                return self._upgrade_lease(client_id, file_id)
            return current  # Already have sufficient lease
        
        # Conflict with another client
        if lease_type == 'write' or current['type'] == 'write':
            # Need to recall existing lease
            return self._initiate_recall(client_id, file_id, 
                                          current, lease_type)
        
        # Both are read leases - can coexist
        return self._grant_shared_lease(client_id, file_id)
    
    def _initiate_recall(self, requesting_client: str, file_id: str,
                         current_lease: dict, requested_type: str) -> dict:
        """
        Recall existing lease before granting new one.
        """
        # Notify lease holder to flush and release
        recall_id = f"recall_{file_id}_{time.time()}"
        
        self.pending_recalls[recall_id] = {
            'file_id': file_id,
            'requesting_client': requesting_client,
            'requested_type': requested_type,
            'current_holder': current_lease['client'],
            'state': 'pending'
        }
        
        # Send recall to current holder (async)
        self._send_recall_notification(
            current_lease['client'], 
            file_id,
            recall_id
        )
        
        # Return pending status to requesting client
        return {
            'status': 'pending',
            'recall_id': recall_id,
            'message': 'Waiting for lease recall'
        }
    
    def handle_recall_response(self, client_id: str, file_id: str,
                                recall_id: str, flushed_data: bytes):
        """
        Process lease surrender from recalled client.
        """
        recall = self.pending_recalls.get(recall_id)
        if not recall:
            return
        
        # Apply flushed data from recalled client
        self._apply_flushed_changes(file_id, flushed_data)
        
        # Clear old lease
        del self.leases[file_id]
        
        # Grant lease to waiting client
        self._grant_lease(
            recall['requesting_client'],
            file_id,
            recall['requested_type']
        )
        
        del self.pending_recalls[recall_id]
    
    def _grant_lease(self, client_id: str, file_id: str, 
                     lease_type: str) -> dict:
        """Grant a new lease."""
        lease = {
            'client': client_id,
            'type': lease_type,
            'expiry': time.time() + LEASE_DURATION,
            'granted': time.time()
        }
        self.leases[file_id] = lease
        return {'status': 'granted', 'lease': lease}
 
# Lease duration and timing considerations
LEASE_DURATION = 90  # seconds
 
# Trade-offs:
# - Longer leases: Better caching, slower conflict resolution
# - Shorter leases: Faster conflict resolution, more renewals
# - NFSv4 default: 90 seconds, renewed automatically

NFSv4 delegation types:

Read delegation: Client can cache file data for reading. Multiple clients can hold read delegations. Server recalls when a write is attempted.
Write delegation: Client has exclusive access. Can cache writes without server contact. Server must recall before any other client can access.

The beauty of leases:

For uncontended files (the common case), leases provide near-local-file-system performance—reads and writes hit the local cache with no network traffic. Only when conflict occurs does the server intervene. This optimistic approach matches real-world access patterns where most files are accessed by a single user at a time.

Lease Expiration and Failure

Leases have expiration times to handle client failures. If a client crashes while holding a lease, the server can reclaim the lease after expiration and grant access to other clients. The lease duration balances the cost of renewals (short leases) against the delay in failure recovery (long leases). NFSv4 uses 90-second leases with automatic renewal during active use.

Write Caching Strategies

Write caching presents unique challenges—a cached write that's lost before reaching stable storage represents data loss. DFS implementations offer various write caching strategies with different safety and performance tradeoffs.

Write-through caching:

Every write is immediately sent to the server. The write call returns only after the server acknowledges permanent storage.

Client:  write(fd, data)
            ↓
        Send data to server
            ↓
        Wait for server ACK
            ↓
        Server writes to disk
            ↓
        Server sends ACK
            ↓
        write() returns success

Pros: Strong durability, simple consistency
Cons: Every write pays network latency

Write-back (delayed write) caching:

Writes are cached locally and sent to the server later—on sync, close, or when the cache fills.

write(fd, data)
    ↓
Cache locally (returns immediately)
    ...
[Later: Flush triggered]
    ↓
Send accumulated data to server
    ↓
Server acknowledges

Advantages:

Dramatically faster writes
Write combining (multiple writes to same region)
Better network utilization (larger transfers)

Risks:

Data loss if client crashes before flush
Consistency lag between clients

Write-behind with server-side buffering:

Writes sent to server but ACKed before disk persistence.

write(fd, data)
    ↓
Send to server
    ↓
Server buffers in RAM
    ↓
Server ACKs immediately
    ↓
[Later: Server flushes to disk]

Advantages:

Faster than full write-through
Server can reorder for efficiency
Client off-loads durability

Risks:

Data loss if server crashes before flush
Server memory pressure

NFS write behavior (controlled by mount options):

# Write-through (synchronous)
mount -t nfs -o sync server:/export /mnt
# Every write waits for server ACK
# Safe but slow

# Write-back (asynchronous) - Default
mount -t nfs -o async server:/export /mnt  
# Writes cached, flushed later
# Fast but less safe

# Server-side commit behavior
mount -t nfs -o wdelay server:/export /mnt
# Server delays writes expecting more data
# Improves write combining

The sync flag and close semantics:

Applications can request specific durability:

fd = open("file.txt", O_WRONLY);
write(fd, data, len);      // May be cached
fsync(fd);                 // Force cache flush to server's disk
close(fd);                 // Also typically forces flush

Critical applications (databases, logs) use fsync explicitly; casual applications rely on close-time flushing.

Metadata Caching

File data isn't the only thing worth caching—metadata caching can provide even larger performance benefits. Consider that a simple ls -l command in a directory with 1000 files requires at least 1001 stat calls (one for the directory, one for each file). Without metadata caching, that's 1001 network round trips.

What metadata is cached:

Cached Metadata Types

•File attributes — Size, timestamps, permissions, owner (stat information)
•Directory contents — List of files and subdirectories in a directory
•Path-to-handle mappings — Avoid repeated path resolution
•Negative cache entries — Remember that a file doesn't exist (avoids repeated lookups)
•Block location caches — Where file chunks are stored (in systems like HDFS)
•Access control decisions — Permission check results

Attribute caching in NFS:

NFS aggressively caches file attributes to minimize server load. The ac* mount options control this behavior:

# Attribute cache controls
mount -t nfs -o \
    acregmin=3,acregmax=60,\  # Regular files: 3-60 seconds
    acdirmin=30,acdirmax=60 \ # Directories: 30-60 seconds
    server:/export /mnt

# Behavior:
# - First access: fetch from server, cache locally
# - Subsequent access within acmin: use cached value
# - After acmax: always revalidate with server
# - Between acmin and acmax: revalidate if cache is old

# Disable attribute caching entirely (for debugging)
mount -t nfs -o noac server:/export /mnt

The attribute timeout tradeoff:

Longer attribute cache times reduce server load but increase the window for stale data:

Timeline with 60-second attribute cache:
──────────────────────────────────────────────────────
Client A: stat() → cache     stat() → cached(stale!) 
Client B:          write()──close()                   
──────────────────────────────────────────────────────
                   ↑                    ↑
             File modified        A still sees old size/time

For many workloads (where files are modified rarely), this is acceptable. For workloads requiring immediate visibility, use shorter timeouts or noac.

Negative Caching Matters

Caching 'file not found' results is surprisingly important. Many applications probe for optional files or search multiple paths. Without negative caching, each failed lookup hits the server. NFS caches negative lookups for a configurable duration, significantly improving applications that do path searching (like shell $PATH lookups).

Cache Invalidation Strategies

Phil Karlton famously said, "There are only two hard things in Computer Science: cache invalidation and naming things." In distributed file systems, we face both. Cache invalidation—determining when cached data is stale and needs refreshing—is particularly challenging.

Invalidation approaches:

Cache Invalidation Methods
Method	How It Works	Pros	Cons
Time-Based Expiry	Cache entries expire after fixed/computed TTL	Simple, no server state needed	Stale reads possible within TTL
Server Callbacks	Server notifies clients of changes	Immediate invalidation	Server must track all clients
Version Validation	Cache stores version; server validates on access	Consistent, scalable	Requires round-trip to check
Lease Recall	Server recalls leases before allowing changes	Strong consistency when needed	Complex, delay for conflicts

NFS client-driven validation:

NFS uses a clever validation scheme based on change attributes:

When caching data, client also caches the file's change attribute (ctime/mtime or NFSv4 change ID)
Before using cached data, client may validate: send GETATTR with cached version
Server responds with current version and whether it matches
If match: cache is valid, use it. If mismatch: invalidate, re-read

Validation request (simplified):
  Client → Server: GETATTR file_id, cached_change_id=12345
  Server → Client: change_id=12345, cache_valid=true
                   OR
                   change_id=12347, cache_valid=false

This is lighter than re-reading the data—just checking if it changed. The client controls when to validate based on ac* mount options.

AFS callback mechanism:

AFS (Andrew File System) pioneered server callbacks for cache invalidation:

When a client caches a file, server records a callback promise
If any client modifies the file, server sends callback break to all caching clients
Clients receiving callback break immediately invalidate their cache
When client comes back online after disconnect, it must re-validate all cached files

                        Server
                    ┌─────────────┐
        Callback    │ File: F     │    Callback
        break       │ Clients:    │    break
          ←─────────│ [A, B, C]   │──────────→
                    │             │
    ┌────────┐      │  Modifying  │      ┌────────┐
    │Client A│←─────│  write by D │─────→│Client B│
    └────────┘      └─────────────┘      └────────┘
         ↓                                    ↓
    Invalidate F                         Invalidate F

Advantage: Cache data can be trusted until callback received Disadvantage: Server must maintain callback state for every cached file×client

Callback Storm Risk

If a heavily-read file (like a shared library) is modified, the server must send callback breaks to potentially thousands of clients simultaneously. This 'callback storm' can overwhelm the server. Systems handle this through rate limiting, batching, or simply invalidating heavily-shared files less aggressively.

Prefetching and Readahead

Beyond caching data that was already accessed, distributed file systems can anticipate future accesses through prefetching (also called readahead). By predicting what data will be needed and fetching it before it's requested, the system can hide network latency.

Sequential readahead:

The most common and effective prefetching strategy. If a client reads bytes 0-4095, they'll probably want bytes 4096-8191 next.

Sequential Read Pattern:
────────────────────────────────────────────────────
Application:  read(0-4K)    read(4K-8K)    read(8K-12K)
Without RA:      [wait]        [wait]         [wait]
                   ↓             ↓               ↓
                 fetch         fetch           fetch

With Readahead:
────────────────────────────────────────────────────
Application:  read(0-4K)    read(4K-8K)    read(8K-12K)
With RA:         [wait]      [cache hit]   [cache hit]
                   ↓
              fetch 0-16K   (data already prefetched)

Readahead dramatically improves sequential read performance by overlapping I/O with processing.

Readahead adaptation:

Smart implementations adapt readahead size based on access patterns:

Adaptive Readahead Algorithm:

1. Start with small readahead (e.g., 32KB)
2. If pattern is confirmed sequential, increase readahead
3. Double readahead each time sequential access continues
4. Cap at maximum (e.g., 512KB or 2MB)
5. If random access detected, disable or reduce readahead

Linux kernel implementation:
- Initial window: 32KB
- Maximum window: /sys/block/<dev>/queue/read_ahead_kb (default 128KB)
- Pattern detection: tracks last few read positions

DFS-specific prefetching:

Distributed file systems can implement additional prefetching strategies:

Directory prefetch: When opening a directory, prefetch first N file attributes
Related file prefetch: If file A is accessed, prefetch commonly co-accessed file B
Speculative block prefetch: Fetch next blocks based on file structure

HDFS readahead:

HDFS clients implement their own readahead since data comes from remote DataNodes:

// HDFS read strategy
DFSInputStream {
    // When reading block B, prefetch next blocks
    prefetchBlocks = configuration.get("dfs.client.read.prefetch.size");
    
    // Parallel reads from multiple DataNodes
    hedgedReadThreadpool = ...;  // Try slow reads on backup replicas
}

Beware Wasted Bandwidth

Prefetching isn't free—it consumes network bandwidth and memory. Poorly tuned readahead on random access patterns wastes resources and can actually decrease performance. Modern systems detect access patterns and disable readahead when appropriate.

Summary: Caching in Distributed File Systems

We've explored the critical role of caching in distributed file systems and the mechanisms that make it work. Let's consolidate the key insights:

Key Caching Takeaways

•Caching is essential for DFS performance — Network latency makes uncached access impractical. Good cache design achieves 90%+ hit rates.
•Caching occurs at multiple layers — Application buffers, kernel cache, DFS client cache, and server cache all contribute. Coherence must span all layers.
•Coherence models define consistency guarantees — From strict (write-through) to relaxed (session semantics), each model trades consistency for performance.
•Leases provide the best of both worlds — Aggressive caching with on-demand consistency enforcement. Optimizes for the common case of uncontended access.
•Write caching requires careful durability tradeoffs — Write-back is faster but risks data loss. Critical applications must use explicit sync.
•Metadata caching is as important as data caching — Attribute and directory caching eliminates countless small RPCs.
•Invalidation is fundamentally hard — Time-based, callback-based, and validation-based approaches each have strengths and weaknesses.
•Prefetching hides latency for sequential access — Readahead anticipates future reads, but must adapt to access patterns.

What's next:

Caching optimizes access to individual copies of data, but distributed file systems maintain multiple copies through replication. Next, we'll explore how DFS implementations create, maintain, and synchronize replicas to provide fault tolerance and improved read performance.

Page Complete

You now understand the caching strategies used in distributed file systems, from basic caching principles through sophisticated lease-based coherence protocols. You can analyze the tradeoffs between performance and consistency in different caching approaches. Next, we'll explore replication strategies.

Caching Strategies

The Performance Imperative

What You Will Learn

Why Caching Is Essential in DFS

To appreciate the importance of caching, consider the latency hierarchy in a distributed file system:

Access Latency Hierarchy in DFS
Cache Level	Typical Latency	Relative Speed	Capacity
CPU L1 Cache	1 ns	1x (baseline)	64 KB
CPU L3 Cache	10-20 ns	10-20x slower	32 MB
Local RAM	100 ns	100x slower	64-256 GB
Local SSD	50-100 μs	50,000x slower	1-4 TB
Local HDD	5-10 ms	5,000,000x slower	4-16 TB
Network + Remote SSD	1-10 ms	1,000,000-10,000,000x	Petabytes
Cross-Datacenter	50-150 ms	50,000,000x slower	Exabytes

The numbers tell the story:

Metadata caching eliminates repeated path resolutions
Data caching serves file content from local memory
Writeback caching batches writes for efficient network transfer

Workload characteristics favor caching:

Studies of file system access patterns show:

Temporal locality: Recently accessed files are likely to be accessed again
Spatial locality: Nearby bytes are likely to be read together
Read dominance: Most file systems see 90%+ read operations
Working set principle: Clients use a limited subset of files at any time

These patterns make caching highly effective—a well-designed cache achieves 90%+ hit rates in many workloads.

The Effective Bandwidth Equation

Where Caching Happens

Distributed file systems employ caching at multiple layers, each with distinct characteristics. Understanding where caches exist helps in reasoning about consistency and performance.

The multi-layer cache architecture:

Converting Mermaid diagram...

Cache Layers in Detail

•Application Buffers — User-space buffers maintained by applications. The application controls content; the OS sees only opaque memory.
•Kernel Page Cache — The OS caches file pages in RAM. For local files, this is the primary cache. For mounted DFS, it caches data received from the DFS client.
•DFS Client Cache — The DFS client (FUSE daemon, kernel module, or library) maintains its own cache of file data and metadata. This cache is DFS-aware and participates in coherence protocols.
•Proxy/Edge Caches — Intermediate servers closer to clients cache popular data. Reduces load on origin servers and improves latency for distributed teams.
•Server Buffer Cache — Storage servers maintain their own buffer cache, serving repeated reads from RAM and batching writes to disk.
•Storage Controller Cache — Modern storage controllers include RAM caches and battery-backed write buffers for performance.

Cache coherence must span all layers:

The challenge is that changes at any layer must eventually be visible at all layers. If a client writes data:

Application buffer → DFS client cache (application write)
DFS client cache → Server (sync/flush)
Server cache → Disk (persistence)

Each step has latency and potential for failure. The consistency model defines guarantees about when these propagations occur.

Cache Coherence Fundamentals

The coherence spectrum:

Cache Coherence Models
Model	Definition	Performance	Complexity	Example
No Caching	Every access goes to server	Poor	Simple	Early NFS without caching
Write-Through	Writes go to server immediately; reads from cache	Reads fast, writes slow	Medium	Traditional NFS
Write-Back (Delayed)	Writes cached locally, flushed later	Fast	Complex	NFS with wdelay
Session Semantics	Changes visible globally on file close	Fast	Medium	AFS, early GFS
Leases/Delegations	Server grants exclusive rights; recalls on conflict	Very fast when uncontended	Complex	NFSv4, SMB

Understanding session semantics:

Many distributed file systems use session semantics (also called close-to-open consistency):

Session Semantics Rules:

1. Changes made by a client are immediately visible to that client
2. Changes are NOT visible to other clients until the file is closed
3. When a client opens a file, it sees all changes from closed writes
4. Concurrent writers to the same file have undefined behavior

Timeline Example:
─────────────────────────────────────────────────────────────
Client A: open() ──write("v1")──────────────close()
Client B:              open() ──read() → sees old data──close()
Client C:                                        open()──read() → sees "v1"
─────────────────────────────────────────────────────────────
                                              ↑
                                    File closed, changes propagated

Session semantics trade immediate consistency for significant performance gains—writes are batched and cached, only syncing on close.

The Close-to-Open Trap

Lease-Based Caching (Delegations)

How leases work:

Client requests a lease for a file (read or write)
Server grants the lease with an expiration time
Client caches file data and operates locally
If another client needs conflicting access, server recalls the lease
Lease holder flushes changes and releases the lease
Server grants access to the other client

lease_mechanism.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# Lease-based cache coherence mechanism
 
class LeaseServer:
                                            """
    Manages file leases for cache coherence.
    """
    
    def __init__(self):
        self.leases = {}  # file_id -> {client, type, expiry}
        self.pending_recalls = {}
    
    def request_lease(self, client_id: str, file_id: str, 
                      lease_type: str) -> dict:
        """
        Grant a lease if possible, or queue for recall.
        
        lease_type: 'read' (shared) or 'write' (exclusive)
        """
        current = self.leases.get(file_id)
        
        if current is None:
            # No existing lease - grant immediately
            return self._grant_lease(client_id, file_id, lease_type)
        
        if current['client'] == client_id:
            # Same client, possibly upgrade
            if lease_type == 'write' and current['type'] == 'read':
                return self._upgrade_lease(client_id, file_id)
            return current  # Already have sufficient lease
        
        # Conflict with another client
        if lease_type == 'write' or current['type'] == 'write':
            # Need to recall existing lease
            return self._initiate_recall(client_id, file_id, 
                                          current, lease_type)
        
        # Both are read leases - can coexist
        return self._grant_shared_lease(client_id, file_id)
    
    def _initiate_recall(self, requesting_client: str, file_id: str,
                         current_lease: dict, requested_type: str) -> dict:
        """
        Recall existing lease before granting new one.
        """
        # Notify lease holder to flush and release
        recall_id = f"recall_{file_id}_{time.time()}"
        
        self.pending_recalls[recall_id] = {
            'file_id': file_id,
            'requesting_client': requesting_client,
            'requested_type': requested_type,
            'current_holder': current_lease['client'],
            'state': 'pending'
        }
        
        # Send recall to current holder (async)
        self._send_recall_notification(
            current_lease['client'], 
            file_id,
            recall_id
        )
        
        # Return pending status to requesting client
        return {
            'status': 'pending',
            'recall_id': recall_id,
            'message': 'Waiting for lease recall'
        }
    
    def handle_recall_response(self, client_id: str, file_id: str,
                                recall_id: str, flushed_data: bytes):
        """
        Process lease surrender from recalled client.
        """
        recall = self.pending_recalls.get(recall_id)
        if not recall:
            return
        
        # Apply flushed data from recalled client
        self._apply_flushed_changes(file_id, flushed_data)
        
        # Clear old lease
        del self.leases[file_id]
        
        # Grant lease to waiting client
        self._grant_lease(
            recall['requesting_client'],
            file_id,
            recall['requested_type']
        )
        
        del self.pending_recalls[recall_id]
    
    def _grant_lease(self, client_id: str, file_id: str, 
                     lease_type: str) -> dict:
        """Grant a new lease."""
        lease = {
            'client': client_id,
            'type': lease_type,
            'expiry': time.time() + LEASE_DURATION,
            'granted': time.time()
        }
        self.leases[file_id] = lease
        return {'status': 'granted', 'lease': lease}
 
# Lease duration and timing considerations
LEASE_DURATION = 90  # seconds
 
# Trade-offs:
# - Longer leases: Better caching, slower conflict resolution
# - Shorter leases: Faster conflict resolution, more renewals
# - NFSv4 default: 90 seconds, renewed automatically

NFSv4 delegation types:

Read delegation: Client can cache file data for reading. Multiple clients can hold read delegations. Server recalls when a write is attempted.
Write delegation: Client has exclusive access. Can cache writes without server contact. Server must recall before any other client can access.

The beauty of leases:

Lease Expiration and Failure

Write Caching Strategies

Write-through caching:

Every write is immediately sent to the server. The write call returns only after the server acknowledges permanent storage.

Client:  write(fd, data)
            ↓
        Send data to server
            ↓
        Wait for server ACK
            ↓
        Server writes to disk
            ↓
        Server sends ACK
            ↓
        write() returns success

Pros: Strong durability, simple consistency
Cons: Every write pays network latency

Write-back (delayed write) caching:

Writes are cached locally and sent to the server later—on sync, close, or when the cache fills.

write(fd, data)
    ↓
Cache locally (returns immediately)
    ...
[Later: Flush triggered]
    ↓
Send accumulated data to server
    ↓
Server acknowledges

Advantages:

Dramatically faster writes
Write combining (multiple writes to same region)
Better network utilization (larger transfers)

Risks:

Data loss if client crashes before flush
Consistency lag between clients

Write-behind with server-side buffering:

Writes sent to server but ACKed before disk persistence.

write(fd, data)
    ↓
Send to server
    ↓
Server buffers in RAM
    ↓
Server ACKs immediately
    ↓
[Later: Server flushes to disk]

Advantages:

Faster than full write-through
Server can reorder for efficiency
Client off-loads durability

Risks:

Data loss if server crashes before flush
Server memory pressure

NFS write behavior (controlled by mount options):

# Write-through (synchronous)
mount -t nfs -o sync server:/export /mnt
# Every write waits for server ACK
# Safe but slow

# Write-back (asynchronous) - Default
mount -t nfs -o async server:/export /mnt  
# Writes cached, flushed later
# Fast but less safe

# Server-side commit behavior
mount -t nfs -o wdelay server:/export /mnt
# Server delays writes expecting more data
# Improves write combining

The sync flag and close semantics:

Applications can request specific durability:

fd = open("file.txt", O_WRONLY);
write(fd, data, len);      // May be cached
fsync(fd);                 // Force cache flush to server's disk
close(fd);                 // Also typically forces flush

Critical applications (databases, logs) use fsync explicitly; casual applications rely on close-time flushing.

Metadata Caching

What metadata is cached:

Cached Metadata Types

•File attributes — Size, timestamps, permissions, owner (stat information)
•Directory contents — List of files and subdirectories in a directory
•Path-to-handle mappings — Avoid repeated path resolution
•Negative cache entries — Remember that a file doesn't exist (avoids repeated lookups)
•Block location caches — Where file chunks are stored (in systems like HDFS)
•Access control decisions — Permission check results

Attribute caching in NFS:

NFS aggressively caches file attributes to minimize server load. The ac* mount options control this behavior:

# Attribute cache controls
mount -t nfs -o \
    acregmin=3,acregmax=60,\  # Regular files: 3-60 seconds
    acdirmin=30,acdirmax=60 \ # Directories: 30-60 seconds
    server:/export /mnt

# Behavior:
# - First access: fetch from server, cache locally
# - Subsequent access within acmin: use cached value
# - After acmax: always revalidate with server
# - Between acmin and acmax: revalidate if cache is old

# Disable attribute caching entirely (for debugging)
mount -t nfs -o noac server:/export /mnt

The attribute timeout tradeoff:

Longer attribute cache times reduce server load but increase the window for stale data:

Timeline with 60-second attribute cache:
──────────────────────────────────────────────────────
Client A: stat() → cache     stat() → cached(stale!) 
Client B:          write()──close()                   
──────────────────────────────────────────────────────
                   ↑                    ↑
             File modified        A still sees old size/time

For many workloads (where files are modified rarely), this is acceptable. For workloads requiring immediate visibility, use shorter timeouts or noac.

Negative Caching Matters

Cache Invalidation Strategies

Invalidation approaches:

Cache Invalidation Methods
Method	How It Works	Pros	Cons
Time-Based Expiry	Cache entries expire after fixed/computed TTL	Simple, no server state needed	Stale reads possible within TTL
Server Callbacks	Server notifies clients of changes	Immediate invalidation	Server must track all clients
Version Validation	Cache stores version; server validates on access	Consistent, scalable	Requires round-trip to check
Lease Recall	Server recalls leases before allowing changes	Strong consistency when needed	Complex, delay for conflicts

NFS client-driven validation:

NFS uses a clever validation scheme based on change attributes:

When caching data, client also caches the file's change attribute (ctime/mtime or NFSv4 change ID)
Before using cached data, client may validate: send GETATTR with cached version
Server responds with current version and whether it matches
If match: cache is valid, use it. If mismatch: invalidate, re-read

Validation request (simplified):
  Client → Server: GETATTR file_id, cached_change_id=12345
  Server → Client: change_id=12345, cache_valid=true
                   OR
                   change_id=12347, cache_valid=false

This is lighter than re-reading the data—just checking if it changed. The client controls when to validate based on ac* mount options.

AFS callback mechanism:

AFS (Andrew File System) pioneered server callbacks for cache invalidation:

When a client caches a file, server records a callback promise
If any client modifies the file, server sends callback break to all caching clients
Clients receiving callback break immediately invalidate their cache
When client comes back online after disconnect, it must re-validate all cached files

                        Server
                    ┌─────────────┐
        Callback    │ File: F     │    Callback
        break       │ Clients:    │    break
          ←─────────│ [A, B, C]   │──────────→
                    │             │
    ┌────────┐      │  Modifying  │      ┌────────┐
    │Client A│←─────│  write by D │─────→│Client B│
    └────────┘      └─────────────┘      └────────┘
         ↓                                    ↓
    Invalidate F                         Invalidate F

Advantage: Cache data can be trusted until callback received Disadvantage: Server must maintain callback state for every cached file×client

Callback Storm Risk

Prefetching and Readahead

Sequential readahead:

The most common and effective prefetching strategy. If a client reads bytes 0-4095, they'll probably want bytes 4096-8191 next.

Sequential Read Pattern:
────────────────────────────────────────────────────
Application:  read(0-4K)    read(4K-8K)    read(8K-12K)
Without RA:      [wait]        [wait]         [wait]
                   ↓             ↓               ↓
                 fetch         fetch           fetch

With Readahead:
────────────────────────────────────────────────────
Application:  read(0-4K)    read(4K-8K)    read(8K-12K)
With RA:         [wait]      [cache hit]   [cache hit]
                   ↓
              fetch 0-16K   (data already prefetched)

Readahead dramatically improves sequential read performance by overlapping I/O with processing.

Readahead adaptation:

Smart implementations adapt readahead size based on access patterns:

Adaptive Readahead Algorithm:

1. Start with small readahead (e.g., 32KB)
2. If pattern is confirmed sequential, increase readahead
3. Double readahead each time sequential access continues
4. Cap at maximum (e.g., 512KB or 2MB)
5. If random access detected, disable or reduce readahead

Linux kernel implementation:
- Initial window: 32KB
- Maximum window: /sys/block/<dev>/queue/read_ahead_kb (default 128KB)
- Pattern detection: tracks last few read positions

DFS-specific prefetching:

Distributed file systems can implement additional prefetching strategies:

Directory prefetch: When opening a directory, prefetch first N file attributes
Related file prefetch: If file A is accessed, prefetch commonly co-accessed file B
Speculative block prefetch: Fetch next blocks based on file structure

HDFS readahead:

HDFS clients implement their own readahead since data comes from remote DataNodes:

// HDFS read strategy
DFSInputStream {
    // When reading block B, prefetch next blocks
    prefetchBlocks = configuration.get("dfs.client.read.prefetch.size");
    
    // Parallel reads from multiple DataNodes
    hedgedReadThreadpool = ...;  // Try slow reads on backup replicas
}

Beware Wasted Bandwidth

Summary: Caching in Distributed File Systems

We've explored the critical role of caching in distributed file systems and the mechanisms that make it work. Let's consolidate the key insights:

Key Caching Takeaways

•Caching is essential for DFS performance — Network latency makes uncached access impractical. Good cache design achieves 90%+ hit rates.
•Caching occurs at multiple layers — Application buffers, kernel cache, DFS client cache, and server cache all contribute. Coherence must span all layers.
•Coherence models define consistency guarantees — From strict (write-through) to relaxed (session semantics), each model trades consistency for performance.
•Leases provide the best of both worlds — Aggressive caching with on-demand consistency enforcement. Optimizes for the common case of uncontended access.
•Write caching requires careful durability tradeoffs — Write-back is faster but risks data loss. Critical applications must use explicit sync.
•Metadata caching is as important as data caching — Attribute and directory caching eliminates countless small RPCs.
•Invalidation is fundamentally hard — Time-based, callback-based, and validation-based approaches each have strengths and weaknesses.
•Prefetching hides latency for sequential access — Readahead anticipates future reads, but must adapt to access patterns.

What's next:

Page Complete