Loading content...
The buffer cache creates an illusion that file data exists in fast memory. But this illusion must be carefully maintained—if the cache becomes inconsistent with storage, or with itself across processes, catastrophic consequences follow: corrupted files, data loss, and confused applications.
Cache consistency addresses several related challenges:
Each of these challenges requires specific mechanisms and protocols. Understanding them is essential for building reliable software that correctly handles concurrent access and survives failures.
By the end of this page, you will understand: (1) How the buffer cache maintains coherence across multiple processes, (2) Memory-mapped file consistency challenges and solutions, (3) The cache invalidation problem and when it occurs, (4) Consistency semantics (POSIX, session, eventual), (5) Network file system consistency (NFS, CIFS), and (6) Recovery mechanisms after crashes.
On a single machine, multiple processes may simultaneously access the same file. The operating system must ensure they see a consistent view of the file's contents. This is achieved through unified caching—all processes share a single cache buffer for each file block.
The Unified Cache Model:
When two processes open the same file, they share the underlying page cache entries. This is the foundation of single-system consistency:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
/* * Unified Page Cache Architecture * * Key insight: The page cache is indexed by (inode, offset), NOT by (fd, offset) * This means all file descriptors for the same file share cache pages. * * Process A: fd = open("/data/file.txt", O_RDWR); * Process B: fd = open("/data/file.txt", O_RDWR); * * Both file descriptors point to the same inode * Both read from/write to the same page cache entries */ struct address_space { struct inode *host; /* Owning inode */ struct radix_tree_root page_tree; /* Radix tree of cached pages */ spinlock_t tree_lock; /* Protects page_tree */ atomic_t i_mmap_writable;/* Count of writable mappings */ struct rb_root i_mmap; /* Tree of private/shared mappings */ unsigned long nrpages; /* Number of cached pages */ /* ... */}; /* * When any process reads offset X of a file: * 1. Look up page in address_space->page_tree * 2. If found: return cached page (shared with all other readers) * 3. If not found: allocate page, read from disk, insert into tree */struct page *find_or_create_page(struct address_space *mapping, pgoff_t index, gfp_t gfp) { struct page *page; repeat: page = find_get_page(mapping, index); if (!page) { /* Not in cache - need to create */ page = alloc_page(gfp); if (!page) return NULL; /* Try to insert into page cache */ /* This is atomic - if another thread beat us, we'll find their page */ if (add_to_page_cache_lru(page, mapping, index, gfp) < 0) { /* Race: someone else added a page first */ put_page(page); goto repeat; /* Return the other thread's page */ } } return page;} /* * When any process writes offset X: * 1. Find/create the page (as above) * 2. Modify the page in place * 3. Mark page dirty * 4. All other processes immediately see the modification! * * There's only ONE copy of the data, so writes are immediately visible. */Why This Works:
Because all processes share the same page cache entries:
Writes are immediately visible: When Process A modifies a cached page, Process B sees the change instantly (they're literally reading the same memory)
No coherence protocol needed: Unlike CPU caches (which need MESI/MOESI protocols), there's only one copy to keep consistent
Efficient memory usage: A file opened by 100 processes uses the same cache memory as a file opened once
Per-Process Buffering:
while the kernel cache is unified, applications often add their own buffering (stdio, language runtime buffers). This can create apparent inconsistencies:
12345678910111213141516171819202122232425262728
/* * Application buffering can create apparent inconsistency */ /* Process A */FILE *f = fopen("shared.txt", "w");fprintf(f, "Hello"); /* In libc buffer, NOT in kernel cache yet *//* Process B reads: sees nothing */ fflush(f); /* Pushes to kernel page cache *//* Now process B sees "Hello" */ /* The kernel cache is consistent - the libc buffer is the issue */ /* * Solution: Use unbuffered I/O or explicit flushing */ /* Option 1: Unbuffered I/O (direct system calls) */int fd = open("shared.txt", O_RDWR);write(fd, "Hello", 5); /* Directly to kernel cache */ /* Option 2: Disable stdio buffering */setvbuf(f, NULL, _IONBF, 0); /* Option 3: Explicit flush after writes */fprintf(f, "Hello");fflush(f);POSIX requires that after a successful write(), subsequent read() calls from ANY process will return the written data. This is called 'sequential consistency' at the file level. The unified page cache makes this guarantee naturally efficient—it's just shared memory.
Memory-mapped files (mmap) allow processes to access file contents directly as memory, without explicit read/write system calls. This creates consistency challenges because modifications happen through memory stores, not tracked system calls.
The mmap Consistency Model:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
/* * Memory-mapped file consistency * * When a file is mmap'd, its page cache pages are mapped directly * into the process's address space. Modifications through the mapping * modify the page cache directly. */ #include <sys/mman.h> void mmap_example() { int fd = open("data.bin", O_RDWR); /* Map 4KB of the file into memory */ char *ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); /* * ptr now points directly to the page cache page. * Writing through ptr modifies the page cache. */ ptr[0] = 'X'; /* Directly modifies page cache */ /* Page automatically marked dirty by MMU */ /* * Another process reading the same file: * - If using read(): sees 'X' if same page cache page * - If using mmap(): shares the SAME page, sees 'X' immediately */} /* * MAP_SHARED vs MAP_PRIVATE * * MAP_SHARED: Modifications are visible to other processes * and written back to the file * * MAP_PRIVATE: Copy-on-write - modifications are private * to this process and NOT written back */ /* * Consistency between mmap and read/write * * POSIX guarantees: read/write and mmap operate on the same data. * Linux achieves this by having both use the same page cache. */ void mixed_access_example(int fd) { char buf[10]; /* mmap some of the file */ char *ptr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); /* Write through mmap */ ptr[0] = 'A'; /* Read through read() */ lseek(fd, 0, SEEK_SET); read(fd, buf, 1); /* buf[0] == 'A' -- guaranteed by POSIX */ /* Write through write() */ lseek(fd, 0, SEEK_SET); write(fd, "B", 1); /* ptr[0] == 'B' -- visible through mmap */}The Dirty Page Problem with mmap:
When a process modifies memory through an mmap mapping, the CPU's MMU sets the dirty bit in the page table entry. The OS must:
This happens automatically for MAP_SHARED mappings.
While modifications through mmap are eventually written back, msync() is required for durability guarantees. Without msync(MS_SYNC), modified pages may remain in cache indefinitely. For crash safety, call msync() before considering data committed. The MS_ASYNC flag schedules writeback without waiting, while MS_SYNC waits for completion.
1234567891011121314151617181920212223
/* * msync() for mmap consistency and durability */ void durable_mmap_write(char *ptr, size_t len) { /* Modify through mmap */ memcpy(ptr, "important data", 14); /* Option 1: Synchronous - wait for write to complete */ if (msync(ptr, len, MS_SYNC) != 0) { perror("msync MS_SYNC failed"); /* Data may not be durable! */ } /* After this returns, data is on stable storage */ /* Option 2: Asynchronous - schedule write, don't wait */ msync(ptr, len, MS_ASYNC); /* Data will be written eventually, but not guaranteed yet */ /* Option 3: Invalidate - for cache coherence with external changes */ msync(ptr, len, MS_INVALIDATE); /* Page cache may be discarded, forcing re-read from disk */}Cache invalidation is one of the hardest problems in computer science. When should cached data be discarded because it no longer reflects the source of truth? Getting this wrong leads to stale data; overdoing it destroys performance.
"There are only two hard things in Computer Science: cache invalidation and naming things." This quip reflects the genuine difficulty of knowing when cached data is stale.
When Cache Invalidation Occurs:
| Scenario | Invalidation Mechanism | Example |
|---|---|---|
| File deleted | All pages for inode removed | rm removes file |
| File truncated | Pages beyond new size removed | truncate() shrinks file |
| Direct I/O write | Overlapping cache pages invalidated | Database bypass cache |
| Device removed | All pages for device invalidated | USB drive ejected |
| NFS cache timeout | Attribute cache expires | Remote file changed |
| Explicit invalidation | Application requests | posix_fadvise(DONTNEED) |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
/* * Cache invalidation mechanisms in Linux */ /* * Truncate: Remove pages beyond new size */void truncate_inode_pages(struct address_space *mapping, loff_t lstart) { struct pagevec pvec; pgoff_t index = lstart >> PAGE_SHIFT; pagevec_init(&pvec); /* Find all pages from 'index' onwards */ while (pagevec_lookup(&pvec, mapping, &index, PAGEVEC_SIZE)) { for (int i = 0; i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; lock_page(page); if (page->mapping == mapping) { /* Truncate partial page if at boundary */ if (page->index == (lstart >> PAGE_SHIFT)) { unsigned offset = lstart & (PAGE_SIZE - 1); zero_user_segment(page, offset, PAGE_SIZE); } /* Remove from page cache if fully beyond lstart */ if (page->index > (lstart >> PAGE_SHIFT)) { delete_from_page_cache(page); } } unlock_page(page); } pagevec_release(&pvec); }} /* * Direct I/O: Must invalidate overlapping cached pages */ssize_t generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from) { struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; loff_t pos = iocb->ki_pos; size_t count = iov_iter_count(from); /* * CRITICAL: Invalidate cached pages in the write range * This ensures read() returns the new direct-written data */ invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT, (pos + count - 1) >> PAGE_SHIFT); /* Now do the direct I/O bypass cache entirely */ return __generic_file_write_iter(iocb, from);} /* * posix_fadvise: Application-directed cache control */int posix_fadvise(int fd, off_t offset, off_t len, int advice) { switch (advice) { case POSIX_FADV_DONTNEED: /* * Application says: "I won't need this data again" * Invalidate these pages from cache */ invalidate_mapping_pages(mapping, start_index, end_index); break; case POSIX_FADV_WILLNEED: /* Start reading pages into cache */ force_page_cache_readahead(mapping, file, start_index, nrpages); break; case POSIX_FADV_NOREUSE: /* * Hint: data will be accessed once, don't prioritize in LRU * (implementation varies) */ break; } return 0;}Linux provides /proc/sys/vm/drop_caches to manually invalidate caches for testing and benchmarking. echo 1 > /proc/sys/vm/drop_caches frees pagecache. echo 2 > ... frees dentries and inodes. echo 3 > ... frees both. This is useful for testing but should never be used in production—it destroys carefully accumulated cache.
Different file systems and access methods provide different consistency semantics—guarantees about when changes become visible to other processes. Understanding these semantics is crucial for correct concurrent programming.
Major Consistency Models:
| Model | Guarantee | Examples |
|---|---|---|
| UNIX/POSIX | Changes visible immediately to all processes | Local ext4, XFS, NTFS |
| Session Semantics | Changes visible on close(); new opens see updates | AFS (Andrew File System) |
| Immutable Files | Once created, files never change | WORM storage, IPFS |
| Eventual Consistency | Changes propagate eventually, no ordering | Some distributed systems |
| Release Consistency | Changes visible after explicit sync point | Some network FS |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
/* * POSIX/UNIX Semantics (sequential consistency) * * Guarantee: After write() returns, any subsequent read() by any * process will see the written data. */ /* Process A */write(fd, "Hello", 5); /* Modifies page cache, returns */ /* Process B (no synchronization needed) */read(fd, buf, 5); /* Sees "Hello" */ /* This works because both share the same page cache page */ /* * Session Semantics (e.g., AFS) * * Guarantee: Changes are visible only after close(). * Opens get a snapshot as of open time. */ /* Process A on Machine 1 */fd = open("shared.txt", O_RDWR); /* Gets current version */write(fd, "Version 1", 9); /* Writes to local cache *//* Process B on Machine 2 doesn't see this yet */close(fd); /* Pushes changes to server *//* NOW Process B will see "Version 1" on next open */ /* Process B on Machine 2 */fd = open("shared.txt", O_RDONLY); /* Gets updated version */read(fd, buf, 9); /* Sees "Version 1" */ /* * Close-to-Open Consistency (NFS default) * * Client caches are validated on open(). * Between open and close, client may use stale data. */ /* NFSv3 default behavior *//* Client A */fd = open("/nfs/file", O_RDWR); /* Validates with server */write(fd, "data", 4); /* Buffers locally */close(fd); /* Flushes to server */ /* Client B */fd = open("/nfs/file", O_RDONLY); /* Checks if cache valid *//* If file changed on server, cache invalidated and re-read */read(fd, buf, 4); /* Gets current data */ /* * Handling Inconsistency * * When consistency model is weaker than needed, * applications must add synchronization. */ /* Lock file for mutual exclusion */int lock_fd = open("/nfs/file.lock", O_CREAT | O_EXCL, 0644);if (lock_fd < 0) { /* Lock held by someone else */ handle_contention();} /* Do work under lock */int fd = open("/nfs/file", O_RDWR);/* ... modifications ... */close(fd); /* Release lock */close(lock_fd);unlink("/nfs/file.lock");The biggest mistake developers make with NFS is assuming POSIX consistency. Two processes on different machines writing to the same NFS file can silently corrupt each other's data without proper locking. Always use file locking (flock/fcntl) or a coordination service when multiple machines access shared files.
Network File System (NFS) faces unique consistency challenges because multiple clients may cache the same file simultaneously. The NFS protocol has evolved through several versions, each with different consistency approaches.
NFS Caching Architecture:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
/* * NFS Client Caching * * NFS clients maintain multiple caches: * 1. Data cache: File contents (like local page cache) * 2. Attribute cache: File metadata (mtime, size, etc.) * 3. Name lookup cache: Directory entry to inode mappings */ struct nfs_inode { struct inode vfs_inode; /* VFS inode */ /* Attribute cache */ unsigned long attrtimeo; /* Attribute cache timeout */ unsigned long attr_gencount; /* Generation for validity */ struct timespec read_cache_jiffies; /* When data cache valid until */ /* Delegation (NFSv4) */ struct nfs_delegation *delegation; /* Server-granted rights */ /* ... */}; /* * Attribute Cache Timeout * * NFS caches file attributes (size, mtime, etc.) to avoid * constant round-trips to the server. * * Timeout typically 3-60 seconds, dynamically adjusted: * - Recently changing files: short timeout * - Stable files: longer timeout */ bool nfs_attribute_cache_valid(struct inode *inode) { struct nfs_inode *nfsi = NFS_I(inode); /* Check if attribute cache has expired */ if (time_after(jiffies, nfsi->read_cache_jiffies + nfsi->attrtimeo)) return false; return true;} /* * Close-to-Open Consistency * * On open(): Validate cached attributes with server * On close(): Flush dirty data to server */ int nfs_open(struct inode *inode, struct file *filp) { /* Revalidate: ask server if our cache is still valid */ int ret = nfs_revalidate_inode(NFS_SERVER(inode), inode); if (ret < 0) return ret; /* If file changed, invalidate data cache */ if (nfs_ctime_has_changed(inode)) { invalidate_inode_pages2(inode->i_mapping); } return 0;} int nfs_close(struct inode *inode, struct file *filp) { /* Flush all dirty data to server */ nfs_sync_inode(inode); return 0;}| Feature | NFSv3 | NFSv4 | NFSv4.1+ |
|---|---|---|---|
| Protocol | Stateless | Stateful | Stateful + Sessions |
| Locking | Separate (NLM) | Integrated | Integrated |
| Delegations | No | Yes | Yes + pNFS |
| Close-to-open | Yes (client impl) | Yes (protocol) | Yes |
| Lease renewal | Lock manager | RENEW/sequences | Sessions |
NFSv4 Delegations:
NFSv4 introduced delegations—the server grants a client exclusive (write) or shared (read) rights to a file. While holding a delegation:
The server recalls delegations when another client wants access, forcing cache flush and revalidation.
noac: Disable attribute caching (slower but more consistent). actimeo=N: Set attribute timeout to N seconds. lookupcache=none: Disable name lookup caching. sync: Synchronous writes (slower, more durable). For critical applications, consider noac or very short actimeo values.
When a system crashes (power failure, kernel panic, hardware fault), the buffer cache's contents are lost. Crash consistency ensures the system can recover to a valid state—even if not the very latest state—without file system corruption.
The Crash Consistency Problem:
File system operations typically require multiple block writes. For example, creating a file requires:
If the system crashes between any of these writes, the file system could be left in an inconsistent state:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
/* * Crash Consistency Mechanisms */ /* * 1. SYNC WRITES (Simple but slow) * * Write each block synchronously in order. * Crash at any point leaves consistent state. */void create_file_sync(const char *name, const char *data) { /* Allocate and write data block */ int block = allocate_block(); write_block_sync(block, data); /* Wait for completion */ /* Allocate and write inode */ int inum = allocate_inode(); struct inode in = { .blocks[0] = block, .size = strlen(data) }; write_inode_sync(inum, &in); /* Wait for completion */ /* Update directory */ add_directory_entry_sync(name, inum); /* Wait */ /* Now safe even if crash occurs */} /* * 2. JOURNALING (Widely used) * * Write operations to a journal log before applying. * On crash, replay journal to complete/undo operations. */void create_file_journal(journal_t *j, const char *name, const char *data) { transaction_t *txn = journal_start(j); /* Log all changes to journal */ journal_dirty_data(txn, data_block); journal_dirty_metadata(txn, inode_block); journal_dirty_metadata(txn, directory_block); /* Commit atomically to journal */ journal_commit(txn); /* Includes barriers/fsync */ /* Now safe to write to final locations */ /* (or do it lazily, journal protects us) */} /* * 3. COPY-ON-WRITE (Modern approach) * * Never overwrite existing blocks. * Atomically update root pointer to new tree. * (Used by ZFS, Btrfs) */void create_file_cow(struct btrfs_root *root, const char *name, const char *data) { /* Write new data block (new location) */ struct btrfs_item *data_item = btrfs_alloc_write(root, data); /* Create new inode pointing to data (new location) */ struct btrfs_inode *inode = btrfs_alloc_inode(root); inode->items[0] = data_item; /* Update directory tree (creates new nodes up to root) */ btrfs_insert_dir_item(root, name, inode); /* Atomic root pointer update */ /* Old version still valid until this commits */ btrfs_commit_root(root); /* Old blocks can now be freed (or kept for snapshots) */} /* * 4. LOG-STRUCTURED (LFS) * * Append all writes to a log, never overwrite. * Periodic garbage collection reclaims space. */void create_file_lfs(struct lfs *fs, const char *name, const char *data) { /* Append data to log */ lfs_addr_t data_addr = lfs_append(fs, data, strlen(data)); /* Append inode to log */ struct lfs_inode in = { .blocks[0] = data_addr }; lfs_addr_t inode_addr = lfs_append(fs, &in, sizeof(in)); /* Append directory update */ struct lfs_dirent de = { .name = name, .inode = inode_addr }; lfs_append(fs, &de, sizeof(de)); /* Update inode map (where to find latest inode version) */ lfs_update_imap(fs, inode_num, inode_addr); /* Checkpoint */ lfs_checkpoint(fs);}Before journaling, file systems required full fsck (file system check) after crashes—potentially hours for large disks. With journaling, recovery replays only the recent journal entries—typically seconds. Modern journaling file systems (ext4, XFS, NTFS) make fsck rarely necessary.
While file systems provide basic consistency guarantees, applications often need stronger guarantees. Here are proven patterns for achieving application-level consistency.
Pattern 1: Write-Replace (Atomic Update)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
/* * PATTERN 1: Atomic File Update via Rename * * rename() is atomic on POSIX systems. * Write to temp file, then rename to target. * Either old or new version exists, never partial. */int atomic_file_update(const char *path, const char *data, size_t len) { char tmp_path[PATH_MAX]; snprintf(tmp_path, sizeof(tmp_path), "%s.tmp.XXXXXX", path); /* Create temp file */ int fd = mkstemp(tmp_path); if (fd < 0) return -1; /* Write all data */ if (write(fd, data, len) != len) { close(fd); unlink(tmp_path); return -1; } /* CRITICAL: fsync before rename */ if (fsync(fd) != 0) { close(fd); unlink(tmp_path); return -1; } close(fd); /* Atomic rename */ if (rename(tmp_path, path) != 0) { unlink(tmp_path); return -1; } /* Sync directory for rename durability */ int dir_fd = open(dirname(path), O_RDONLY); if (dir_fd >= 0) { fsync(dir_fd); close(dir_fd); } return 0; /* Fully durable and atomic */} /* * PATTERN 2: Write-Ahead Log (WAL) * * Log intended changes before applying. * On crash, replay log to recover. */struct wal_entry { uint64_t lsn; /* Log sequence number */ uint32_t operation; /* What operation */ uint32_t data_len; /* How much data */ char data[]; /* The data */}; int wal_write(int wal_fd, struct wal_entry *entry) { /* Write log entry */ write(wal_fd, entry, sizeof(*entry) + entry->data_len); /* CRITICAL: fsync the log before applying changes */ fsync(wal_fd); /* Now safe to apply the actual change */ /* If crash occurs, log replay will redo this */} /* * PATTERN 3: Two-Phase Commit (Database-style) * * Phase 1: Prepare - log all changes, fsync * Phase 2: Commit - write commit record, fsync * * If crash during Phase 1: transaction never happened * If crash during Phase 2: transaction committed (replay) */ /* * PATTERN 4: Lock Files for Coordination */int acquire_lock(const char *path) { char lock_path[PATH_MAX]; snprintf(lock_path, sizeof(lock_path), "%s.lock", path); /* O_EXCL makes this atomic - only one process can create */ int fd = open(lock_path, O_CREAT | O_EXCL | O_WRONLY, 0644); if (fd < 0) { if (errno == EEXIST) return -1; /* Lock held by another process */ return -2; /* Other error */ } /* Write our PID for debugging */ dprintf(fd, "%d", getpid()); close(fd); return 0; /* Lock acquired */} void release_lock(const char *path) { char lock_path[PATH_MAX]; snprintf(lock_path, sizeof(lock_path), "%s.lock", path); unlink(lock_path);}Lock files have failure modes: (1) Stale locks if process crashes without cleanup, (2) Race conditions if not using O_EXCL, (3) Not safe on all NFS versions. Consider flock()/fcntl() locking for single-machine scenarios or a proper distributed lock service (etcd, ZooKeeper) for multi-machine.
We've explored the multifaceted world of cache consistency—the mechanisms that ensure cached data accurately reflects its source and remains coherent across processes, machines, and failures. Let's consolidate the key insights:
What's Next:
With cache consistency understood, we'll complete this module by examining sync operations—the system calls and mechanisms that force data from the buffer cache to stable storage, providing the durability guarantees that applications depend on.
You now understand the layered consistency challenges in cached file systems: single-system coherence, memory mapping, cache invalidation, network file systems, crash recovery, and application patterns. This knowledge is essential for building reliable distributed systems and debugging subtle data corruption issues.