Buffer Cache - Learning Module

Loading content...

0/227

Cache Consistency

The Consistency Challenge

The buffer cache creates an illusion that file data exists in fast memory. But this illusion must be carefully maintained—if the cache becomes inconsistent with storage, or with itself across processes, catastrophic consequences follow: corrupted files, data loss, and confused applications.

Cache consistency addresses several related challenges:

Single-system coherence: When multiple processes read and write the same file, they must see a consistent view
Storage coherence: The cache must reflect what's actually on disk, especially after external changes
Crash consistency: After an unexpected failure, the system must recover to a valid state
Distributed consistency: In networked file systems, multiple machines access the same data

Each of these challenges requires specific mechanisms and protocols. Understanding them is essential for building reliable software that correctly handles concurrent access and survives failures.

What You Will Learn

By the end of this page, you will understand: (1) How the buffer cache maintains coherence across multiple processes, (2) Memory-mapped file consistency challenges and solutions, (3) The cache invalidation problem and when it occurs, (4) Consistency semantics (POSIX, session, eventual), (5) Network file system consistency (NFS, CIFS), and (6) Recovery mechanisms after crashes.

Single-System Cache Coherence

On a single machine, multiple processes may simultaneously access the same file. The operating system must ensure they see a consistent view of the file's contents. This is achieved through unified caching—all processes share a single cache buffer for each file block.

The Unified Cache Model:

When two processes open the same file, they share the underlying page cache entries. This is the foundation of single-system consistency:

unified_cache_model.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/*
 * Unified Page Cache Architecture
 *
 * Key insight: The page cache is indexed by (inode, offset), NOT by (fd, offset)
 * This means all file descriptors for the same file share cache pages.
 *
 *   Process A:  fd = open("/data/file.txt", O_RDWR);
 *   Process B:  fd = open("/data/file.txt", O_RDWR);
 *
 *   Both file descriptors point to the same inode
 *   Both read from/write to the same page cache entries
 */
 
struct address_space {
    struct inode        *host;          /* Owning inode */
    struct radix_tree_root page_tree;   /* Radix tree of cached pages */
    spinlock_t          tree_lock;      /* Protects page_tree */
    atomic_t            i_mmap_writable;/* Count of writable mappings */
    struct rb_root      i_mmap;         /* Tree of private/shared mappings */
    unsigned long       nrpages;        /* Number of cached pages */
    /* ... */
};
 
/*
 * When any process reads offset X of a file:
 * 1. Look up page in address_space->page_tree
 * 2. If found: return cached page (shared with all other readers)
 * 3. If not found: allocate page, read from disk, insert into tree
 */
struct page *find_or_create_page(struct address_space *mapping,
                                  pgoff_t index, gfp_t gfp) {
    struct page *page;
    
repeat:
    page = find_get_page(mapping, index);
    if (!page) {
        /* Not in cache - need to create */
        page = alloc_page(gfp);
        if (!page)
            return NULL;
        
        /* Try to insert into page cache */
        /* This is atomic - if another thread beat us, we'll find their page */
        if (add_to_page_cache_lru(page, mapping, index, gfp) < 0) {
            /* Race: someone else added a page first */
            put_page(page);
            goto repeat;  /* Return the other thread's page */
        }
    }
    
    return page;
}
 
/*
 * When any process writes offset X:
 * 1. Find/create the page (as above)
 * 2. Modify the page in place
 * 3. Mark page dirty
 * 4. All other processes immediately see the modification!
 *
 * There's only ONE copy of the data, so writes are immediately visible.
 */

Why This Works:

Because all processes share the same page cache entries:

Writes are immediately visible: When Process A modifies a cached page, Process B sees the change instantly (they're literally reading the same memory)
No coherence protocol needed: Unlike CPU caches (which need MESI/MOESI protocols), there's only one copy to keep consistent
Efficient memory usage: A file opened by 100 processes uses the same cache memory as a file opened once

Per-Process Buffering:

while the kernel cache is unified, applications often add their own buffering (stdio, language runtime buffers). This can create apparent inconsistencies:

application_buffering.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/*
 * Application buffering can create apparent inconsistency
 */
 
/* Process A */
FILE *f = fopen("shared.txt", "w");
fprintf(f, "Hello");   /* In libc buffer, NOT in kernel cache yet */
/* Process B reads: sees nothing */
 
fflush(f);             /* Pushes to kernel page cache */
/* Now process B sees "Hello" */
 
/* The kernel cache is consistent - the libc buffer is the issue */
 
/*
 * Solution: Use unbuffered I/O or explicit flushing
 */
 
/* Option 1: Unbuffered I/O (direct system calls) */
int fd = open("shared.txt", O_RDWR);
write(fd, "Hello", 5);  /* Directly to kernel cache */
 
/* Option 2: Disable stdio buffering */
setvbuf(f, NULL, _IONBF, 0);
 
/* Option 3: Explicit flush after writes */
fprintf(f, "Hello");
fflush(f);

POSIX Consistency Guarantee

POSIX requires that after a successful write(), subsequent read() calls from ANY process will return the written data. This is called 'sequential consistency' at the file level. The unified page cache makes this guarantee naturally efficient—it's just shared memory.

Memory-Mapped File Consistency

Memory-mapped files (mmap) allow processes to access file contents directly as memory, without explicit read/write system calls. This creates consistency challenges because modifications happen through memory stores, not tracked system calls.

The mmap Consistency Model:

mmap_consistency.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/*
 * Memory-mapped file consistency
 *
 * When a file is mmap'd, its page cache pages are mapped directly
 * into the process's address space. Modifications through the mapping
 * modify the page cache directly.
 */
 
#include <sys/mman.h>
 
void mmap_example() {
    int fd = open("data.bin", O_RDWR);
    
    /* Map 4KB of the file into memory */
    char *ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                     MAP_SHARED, fd, 0);
    
    /*
     * ptr now points directly to the page cache page.
     * Writing through ptr modifies the page cache.
     */
    ptr[0] = 'X';   /* Directly modifies page cache */
                    /* Page automatically marked dirty by MMU */
    
    /* 
     * Another process reading the same file:
     *   - If using read(): sees 'X' if same page cache page
     *   - If using mmap(): shares the SAME page, sees 'X' immediately
     */
}
 
/*
 * MAP_SHARED vs MAP_PRIVATE
 *
 * MAP_SHARED: Modifications are visible to other processes
 *             and written back to the file
 *
 * MAP_PRIVATE: Copy-on-write - modifications are private
 *              to this process and NOT written back
 */
 
/*
 * Consistency between mmap and read/write
 *
 * POSIX guarantees: read/write and mmap operate on the same data.
 * Linux achieves this by having both use the same page cache.
 */
 
void mixed_access_example(int fd) {
    char buf[10];
    
    /* mmap some of the file */
    char *ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                     MAP_SHARED, fd, 0);
    
    /* Write through mmap */
    ptr[0] = 'A';
    
    /* Read through read() */
    lseek(fd, 0, SEEK_SET);
    read(fd, buf, 1);
    
    /* buf[0] == 'A' -- guaranteed by POSIX */
    
    /* Write through write() */
    lseek(fd, 0, SEEK_SET);
    write(fd, "B", 1);
    
    /* ptr[0] == 'B' -- visible through mmap */
}

The Dirty Page Problem with mmap:

When a process modifies memory through an mmap mapping, the CPU's MMU sets the dirty bit in the page table entry. The OS must:

Detect dirty pages: Periodically scan page table entries or use hardware notifications
Track modifications: Know which file offsets were modified
Schedule writeback: Eventually write dirty pages back to storage

This happens automatically for MAP_SHARED mappings.

The msync Requirement

While modifications through mmap are eventually written back, msync() is required for durability guarantees. Without msync(MS_SYNC), modified pages may remain in cache indefinitely. For crash safety, call msync() before considering data committed. The MS_ASYNC flag schedules writeback without waiting, while MS_SYNC waits for completion.

msync_usage.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/*
 * msync() for mmap consistency and durability
 */
 
void durable_mmap_write(char *ptr, size_t len) {
    /* Modify through mmap */
    memcpy(ptr, "important data", 14);
    
    /* Option 1: Synchronous - wait for write to complete */
    if (msync(ptr, len, MS_SYNC) != 0) {
        perror("msync MS_SYNC failed");
        /* Data may not be durable! */
    }
    /* After this returns, data is on stable storage */
    
    /* Option 2: Asynchronous - schedule write, don't wait */
    msync(ptr, len, MS_ASYNC);
    /* Data will be written eventually, but not guaranteed yet */
    
    /* Option 3: Invalidate - for cache coherence with external changes */
    msync(ptr, len, MS_INVALIDATE);
    /* Page cache may be discarded, forcing re-read from disk */
}

Cache Invalidation

Cache invalidation is one of the hardest problems in computer science. When should cached data be discarded because it no longer reflects the source of truth? Getting this wrong leads to stale data; overdoing it destroys performance.

Phil Karlton's Famous Quote

"There are only two hard things in Computer Science: cache invalidation and naming things." This quip reflects the genuine difficulty of knowing when cached data is stale.

When Cache Invalidation Occurs:

Cache Invalidation Scenarios
Scenario	Invalidation Mechanism	Example
File deleted	All pages for inode removed	rm removes file
File truncated	Pages beyond new size removed	truncate() shrinks file
Direct I/O write	Overlapping cache pages invalidated	Database bypass cache
Device removed	All pages for device invalidated	USB drive ejected
NFS cache timeout	Attribute cache expires	Remote file changed
Explicit invalidation	Application requests	posix_fadvise(DONTNEED)

cache_invalidation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
/*
 * Cache invalidation mechanisms in Linux
 */
 
/*
 * Truncate: Remove pages beyond new size
 */
void truncate_inode_pages(struct address_space *mapping, loff_t lstart) {
    struct pagevec pvec;
    pgoff_t index = lstart >> PAGE_SHIFT;
    
    pagevec_init(&pvec);
    
    /* Find all pages from 'index' onwards */
    while (pagevec_lookup(&pvec, mapping, &index, PAGEVEC_SIZE)) {
        for (int i = 0; i < pagevec_count(&pvec); i++) {
            struct page *page = pvec.pages[i];
            
            lock_page(page);
            
            if (page->mapping == mapping) {
                /* Truncate partial page if at boundary */
                if (page->index == (lstart >> PAGE_SHIFT)) {
                    unsigned offset = lstart & (PAGE_SIZE - 1);
                    zero_user_segment(page, offset, PAGE_SIZE);
                }
                
                /* Remove from page cache if fully beyond lstart */
                if (page->index > (lstart >> PAGE_SHIFT)) {
                    delete_from_page_cache(page);
                }
            }
            
            unlock_page(page);
        }
        pagevec_release(&pvec);
    }
}
 
/*
 * Direct I/O: Must invalidate overlapping cached pages
 */
ssize_t generic_file_direct_write(struct kiocb *iocb,
                                   struct iov_iter *from) {
    struct file *file = iocb->ki_filp;
    struct address_space *mapping = file->f_mapping;
    loff_t pos = iocb->ki_pos;
    size_t count = iov_iter_count(from);
    
    /* 
     * CRITICAL: Invalidate cached pages in the write range
     * This ensures read() returns the new direct-written data
     */
    invalidate_inode_pages2_range(mapping,
                                   pos >> PAGE_SHIFT,
                                   (pos + count - 1) >> PAGE_SHIFT);
    
    /* Now do the direct I/O bypass cache entirely */
    return __generic_file_write_iter(iocb, from);
}
 
/*
 * posix_fadvise: Application-directed cache control
 */
int posix_fadvise(int fd, off_t offset, off_t len, int advice) {
    switch (advice) {
    case POSIX_FADV_DONTNEED:
        /* 
         * Application says: "I won't need this data again"
         * Invalidate these pages from cache
         */
        invalidate_mapping_pages(mapping, start_index, end_index);
        break;
        
    case POSIX_FADV_WILLNEED:
        /* Start reading pages into cache */
        force_page_cache_readahead(mapping, file, start_index, nrpages);
        break;
        
    case POSIX_FADV_NOREUSE:
        /* 
         * Hint: data will be accessed once, don't prioritize in LRU
         * (implementation varies)
         */
        break;
    }
    return 0;
}

Drop Caches for Testing

Linux provides /proc/sys/vm/drop_caches to manually invalidate caches for testing and benchmarking. echo 1 > /proc/sys/vm/drop_caches frees pagecache. echo 2 > ... frees dentries and inodes. echo 3 > ... frees both. This is useful for testing but should never be used in production—it destroys carefully accumulated cache.

Consistency Semantics

Different file systems and access methods provide different consistency semantics—guarantees about when changes become visible to other processes. Understanding these semantics is crucial for correct concurrent programming.

Major Consistency Models:

File System Consistency Models
Model	Guarantee	Examples
UNIX/POSIX	Changes visible immediately to all processes	Local ext4, XFS, NTFS
Session Semantics	Changes visible on close(); new opens see updates	AFS (Andrew File System)
Immutable Files	Once created, files never change	WORM storage, IPFS
Eventual Consistency	Changes propagate eventually, no ordering	Some distributed systems
Release Consistency	Changes visible after explicit sync point	Some network FS

consistency_semantics.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
/*
 * POSIX/UNIX Semantics (sequential consistency)
 *
 * Guarantee: After write() returns, any subsequent read() by any
 * process will see the written data.
 */
 
/* Process A */
write(fd, "Hello", 5);    /* Modifies page cache, returns */
 
/* Process B (no synchronization needed) */
read(fd, buf, 5);         /* Sees "Hello" */
 
/* This works because both share the same page cache page */
 
 
/*
 * Session Semantics (e.g., AFS)
 *
 * Guarantee: Changes are visible only after close().
 * Opens get a snapshot as of open time.
 */
 
/* Process A on Machine 1 */
fd = open("shared.txt", O_RDWR);  /* Gets current version */
write(fd, "Version 1", 9);        /* Writes to local cache */
/* Process B on Machine 2 doesn't see this yet */
close(fd);                        /* Pushes changes to server */
/* NOW Process B will see "Version 1" on next open */
 
/* Process B on Machine 2 */
fd = open("shared.txt", O_RDONLY);  /* Gets updated version */
read(fd, buf, 9);                   /* Sees "Version 1" */
 
 
/*
 * Close-to-Open Consistency (NFS default)
 *
 * Client caches are validated on open().
 * Between open and close, client may use stale data.
 */
 
/* NFSv3 default behavior */
/* Client A */
fd = open("/nfs/file", O_RDWR);   /* Validates with server */
write(fd, "data", 4);             /* Buffers locally */
close(fd);                        /* Flushes to server */
 
/* Client B */
fd = open("/nfs/file", O_RDONLY); /* Checks if cache valid */
/* If file changed on server, cache invalidated and re-read */
read(fd, buf, 4);                 /* Gets current data */
 
 
/*
 * Handling Inconsistency
 *
 * When consistency model is weaker than needed,
 * applications must add synchronization.
 */
 
/* Lock file for mutual exclusion */
int lock_fd = open("/nfs/file.lock", O_CREAT | O_EXCL, 0644);
if (lock_fd < 0) {
    /* Lock held by someone else */
    handle_contention();
}
 
/* Do work under lock */
int fd = open("/nfs/file", O_RDWR);
/* ... modifications ... */
close(fd);
 
/* Release lock */
close(lock_fd);
unlink("/nfs/file.lock");

Network File System Gotcha

The biggest mistake developers make with NFS is assuming POSIX consistency. Two processes on different machines writing to the same NFS file can silently corrupt each other's data without proper locking. Always use file locking (flock/fcntl) or a coordination service when multiple machines access shared files.

NFS Cache Consistency

Network File System (NFS) faces unique consistency challenges because multiple clients may cache the same file simultaneously. The NFS protocol has evolved through several versions, each with different consistency approaches.

NFS Caching Architecture:

nfs_caching.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/*
 * NFS Client Caching
 *
 * NFS clients maintain multiple caches:
 * 1. Data cache: File contents (like local page cache)
 * 2. Attribute cache: File metadata (mtime, size, etc.)
 * 3. Name lookup cache: Directory entry to inode mappings
 */
 
struct nfs_inode {
    struct inode    vfs_inode;      /* VFS inode */
    
    /* Attribute cache */
    unsigned long   attrtimeo;      /* Attribute cache timeout */
    unsigned long   attr_gencount;  /* Generation for validity */
    struct timespec read_cache_jiffies;  /* When data cache valid until */
    
    /* Delegation (NFSv4) */
    struct nfs_delegation *delegation;   /* Server-granted rights */
    
    /* ... */
};
 
/*
 * Attribute Cache Timeout
 *
 * NFS caches file attributes (size, mtime, etc.) to avoid
 * constant round-trips to the server.
 *
 * Timeout typically 3-60 seconds, dynamically adjusted:
 * - Recently changing files: short timeout
 * - Stable files: longer timeout
 */
 
bool nfs_attribute_cache_valid(struct inode *inode) {
    struct nfs_inode *nfsi = NFS_I(inode);
    
    /* Check if attribute cache has expired */
    if (time_after(jiffies, nfsi->read_cache_jiffies + nfsi->attrtimeo))
        return false;
    
    return true;
}
 
/*
 * Close-to-Open Consistency
 *
 * On open():  Validate cached attributes with server
 * On close(): Flush dirty data to server
 */
 
int nfs_open(struct inode *inode, struct file *filp) {
    /* Revalidate: ask server if our cache is still valid */
    int ret = nfs_revalidate_inode(NFS_SERVER(inode), inode);
    if (ret < 0)
        return ret;
    
    /* If file changed, invalidate data cache */
    if (nfs_ctime_has_changed(inode)) {
        invalidate_inode_pages2(inode->i_mapping);
    }
    
    return 0;
}
 
int nfs_close(struct inode *inode, struct file *filp) {
    /* Flush all dirty data to server */
    nfs_sync_inode(inode);
    
    return 0;
}

NFS Version Consistency Features
Feature	NFSv3	NFSv4	NFSv4.1+
Protocol	Stateless	Stateful	Stateful + Sessions
Locking	Separate (NLM)	Integrated	Integrated
Delegations	No	Yes	Yes + pNFS
Close-to-open	Yes (client impl)	Yes (protocol)	Yes
Lease renewal	Lock manager	RENEW/sequences	Sessions

NFSv4 Delegations:

NFSv4 introduced delegations—the server grants a client exclusive (write) or shared (read) rights to a file. While holding a delegation:

Read delegation: Client can cache reads without server validation
Write delegation: Client can cache writes without immediate push to server

The server recalls delegations when another client wants access, forcing cache flush and revalidation.

NFS Mount Options for Consistency

noac: Disable attribute caching (slower but more consistent). actimeo=N: Set attribute timeout to N seconds. lookupcache=none: Disable name lookup caching. sync: Synchronous writes (slower, more durable). For critical applications, consider noac or very short actimeo values.

Crash Consistency

When a system crashes (power failure, kernel panic, hardware fault), the buffer cache's contents are lost. Crash consistency ensures the system can recover to a valid state—even if not the very latest state—without file system corruption.

The Crash Consistency Problem:

File system operations typically require multiple block writes. For example, creating a file requires:

Write new data block
Update inode with block pointer
Update directory entry
Update inode and block bitmaps

If the system crashes between any of these writes, the file system could be left in an inconsistent state:

Orphan blocks (allocated but not referenced)
Dangling pointers (references to unallocated blocks)
Inconsistent link counts
Duplicate block allocations

crash_consistency.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
/*
 * Crash Consistency Mechanisms
 */
 
/*
 * 1. SYNC WRITES (Simple but slow)
 *
 * Write each block synchronously in order.
 * Crash at any point leaves consistent state.
 */
void create_file_sync(const char *name, const char *data) {
    /* Allocate and write data block */
    int block = allocate_block();
    write_block_sync(block, data);  /* Wait for completion */
    
    /* Allocate and write inode */
    int inum = allocate_inode();
    struct inode in = { .blocks[0] = block, .size = strlen(data) };
    write_inode_sync(inum, &in);    /* Wait for completion */
    
    /* Update directory */
    add_directory_entry_sync(name, inum);  /* Wait */
    
    /* Now safe even if crash occurs */
}
 
/*
 * 2. JOURNALING (Widely used)
 *
 * Write operations to a journal log before applying.
 * On crash, replay journal to complete/undo operations.
 */
void create_file_journal(journal_t *j, const char *name, const char *data) {
    transaction_t *txn = journal_start(j);
    
    /* Log all changes to journal */
    journal_dirty_data(txn, data_block);
    journal_dirty_metadata(txn, inode_block);
    journal_dirty_metadata(txn, directory_block);
    
    /* Commit atomically to journal */
    journal_commit(txn);  /* Includes barriers/fsync */
    
    /* Now safe to write to final locations */
    /* (or do it lazily, journal protects us) */
}
 
/*
 * 3. COPY-ON-WRITE (Modern approach)
 *
 * Never overwrite existing blocks.
 * Atomically update root pointer to new tree.
 * (Used by ZFS, Btrfs)
 */
void create_file_cow(struct btrfs_root *root, const char *name, 
                      const char *data) {
    /* Write new data block (new location) */
    struct btrfs_item *data_item = btrfs_alloc_write(root, data);
    
    /* Create new inode pointing to data (new location) */
    struct btrfs_inode *inode = btrfs_alloc_inode(root);
    inode->items[0] = data_item;
    
    /* Update directory tree (creates new nodes up to root) */
    btrfs_insert_dir_item(root, name, inode);
    
    /* Atomic root pointer update */
    /* Old version still valid until this commits */
    btrfs_commit_root(root);
    
    /* Old blocks can now be freed (or kept for snapshots) */
}
 
/*
 * 4. LOG-STRUCTURED (LFS)
 *
 * Append all writes to a log, never overwrite.
 * Periodic garbage collection reclaims space.
 */
void create_file_lfs(struct lfs *fs, const char *name, const char *data) {
    /* Append data to log */
    lfs_addr_t data_addr = lfs_append(fs, data, strlen(data));
    
    /* Append inode to log */
    struct lfs_inode in = { .blocks[0] = data_addr };
    lfs_addr_t inode_addr = lfs_append(fs, &in, sizeof(in));
    
    /* Append directory update */
    struct lfs_dirent de = { .name = name, .inode = inode_addr };
    lfs_append(fs, &de, sizeof(de));
    
    /* Update inode map (where to find latest inode version) */
    lfs_update_imap(fs, inode_num, inode_addr);
    
    /* Checkpoint */
    lfs_checkpoint(fs);
}

fsck vs. Journal Replay

Before journaling, file systems required full fsck (file system check) after crashes—potentially hours for large disks. With journaling, recovery replays only the recent journal entries—typically seconds. Modern journaling file systems (ext4, XFS, NTFS) make fsck rarely necessary.

Application-Level Consistency Patterns

While file systems provide basic consistency guarantees, applications often need stronger guarantees. Here are proven patterns for achieving application-level consistency.

Pattern 1: Write-Replace (Atomic Update)

consistency_patterns.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
/*
 * PATTERN 1: Atomic File Update via Rename
 *
 * rename() is atomic on POSIX systems.
 * Write to temp file, then rename to target.
 * Either old or new version exists, never partial.
 */
int atomic_file_update(const char *path, const char *data, size_t len) {
    char tmp_path[PATH_MAX];
    snprintf(tmp_path, sizeof(tmp_path), "%s.tmp.XXXXXX", path);
    
    /* Create temp file */
    int fd = mkstemp(tmp_path);
    if (fd < 0) return -1;
    
    /* Write all data */
    if (write(fd, data, len) != len) {
        close(fd);
        unlink(tmp_path);
        return -1;
    }
    
    /* CRITICAL: fsync before rename */
    if (fsync(fd) != 0) {
        close(fd);
        unlink(tmp_path);
        return -1;
    }
    close(fd);
    
    /* Atomic rename */
    if (rename(tmp_path, path) != 0) {
        unlink(tmp_path);
        return -1;
    }
    
    /* Sync directory for rename durability */
    int dir_fd = open(dirname(path), O_RDONLY);
    if (dir_fd >= 0) {
        fsync(dir_fd);
        close(dir_fd);
    }
    
    return 0;  /* Fully durable and atomic */
}
 
/*
 * PATTERN 2: Write-Ahead Log (WAL)
 *
 * Log intended changes before applying.
 * On crash, replay log to recover.
 */
struct wal_entry {
    uint64_t lsn;        /* Log sequence number */
    uint32_t operation;  /* What operation */
    uint32_t data_len;   /* How much data */
    char data[];         /* The data */
};
 
int wal_write(int wal_fd, struct wal_entry *entry) {
    /* Write log entry */
    write(wal_fd, entry, sizeof(*entry) + entry->data_len);
    
    /* CRITICAL: fsync the log before applying changes */
    fsync(wal_fd);
    
    /* Now safe to apply the actual change */
    /* If crash occurs, log replay will redo this */
}
 
/*
 * PATTERN 3: Two-Phase Commit (Database-style)
 *
 * Phase 1: Prepare - log all changes, fsync
 * Phase 2: Commit - write commit record, fsync
 *
 * If crash during Phase 1: transaction never happened
 * If crash during Phase 2: transaction committed (replay)
 */
 
/*
 * PATTERN 4: Lock Files for Coordination
 */
int acquire_lock(const char *path) {
    char lock_path[PATH_MAX];
    snprintf(lock_path, sizeof(lock_path), "%s.lock", path);
    
    /* O_EXCL makes this atomic - only one process can create */
    int fd = open(lock_path, O_CREAT | O_EXCL | O_WRONLY, 0644);
    if (fd < 0) {
        if (errno == EEXIST)
            return -1;  /* Lock held by another process */
        return -2;  /* Other error */
    }
    
    /* Write our PID for debugging */
    dprintf(fd, "%d
", getpid());
    close(fd);
    
    return 0;  /* Lock acquired */
}
 
void release_lock(const char *path) {
    char lock_path[PATH_MAX];
    snprintf(lock_path, sizeof(lock_path), "%s.lock", path);
    unlink(lock_path);
}

Lock File Pitfalls

Lock files have failure modes: (1) Stale locks if process crashes without cleanup, (2) Race conditions if not using O_EXCL, (3) Not safe on all NFS versions. Consider flock()/fcntl() locking for single-machine scenarios or a proper distributed lock service (etcd, ZooKeeper) for multi-machine.

Summary: Cache Consistency Mastery

We've explored the multifaceted world of cache consistency—the mechanisms that ensure cached data accurately reflects its source and remains coherent across processes, machines, and failures. Let's consolidate the key insights:

Key Takeaways

•Unified caching provides automatic local consistency — Multiple processes on one machine share the same page cache pages, making changes immediately visible.
•Memory-mapped files share the cache — mmap and read/write operate on the same data; msync provides durability.
•Cache invalidation is genuinely hard — Knowing when to discard cached data requires careful protocol design.
•Consistency semantics vary by system — POSIX, session, and eventual consistency have different guarantees and performance characteristics.
•NFS consistency requires understanding — Close-to-open semantics, attribute caching, and delegations affect application correctness.
•Crash consistency prevents corruption — Journaling, COW, and WAL techniques ensure recovery to valid states after failures.
•Applications often need stronger guarantees — Atomic rename, WAL patterns, and proper locking provide application-level consistency.

What's Next:

With cache consistency understood, we'll complete this module by examining sync operations—the system calls and mechanisms that force data from the buffer cache to stable storage, providing the durability guarantees that applications depend on.

Page Complete

You now understand the layered consistency challenges in cached file systems: single-system coherence, memory mapping, cache invalidation, network file systems, crash recovery, and application patterns. This knowledge is essential for building reliable distributed systems and debugging subtle data corruption issues.