Loading learning content...
Why is reading a file often instantaneous after the first access? How can a system with 32GB of RAM effectively have "infinite" speed for its most-accessed data? The answer lies in the page cache—Linux's primary mechanism for caching file system data in memory.
The page cache is not a separate memory pool; it's a dynamic, opportunistic consumer of otherwise-unused RAM. It grows to fill available memory, shrinks under pressure, and decides which data to keep based on access patterns. Understanding the page cache is essential for anyone who needs predictable I/O performance, whether you're running databases, web servers, or data processing pipelines.
By the end of this page, you will understand the complete architecture of the Linux page cache: how pages are indexed and looked up, the address_space abstraction that connects files to cached pages, read-ahead algorithms, writeback policies, memory pressure handling, and practical tuning for different workloads. You'll gain the knowledge to diagnose cache behavior and optimize systems for I/O-intensive applications.
The page cache stores file data in memory at page granularity (typically 4KB, though huge pages exist). Every read from a file first checks the page cache; only on a cache miss does actual disk I/O occur.
Using page-sized chunks aligns the cache with:
The trade-off is that even reading one byte loads an entire page. For random access to small records, this can cause read amplification.
Cached pages exist in various states:
| State | Description | Transitions |
|---|---|---|
| Clean | Content matches disk | After read or writeback |
| Dirty | Modified, needs writeback | After write, before sync |
| Writeback | Currently being written to disk | During writeback I/O |
| Uptodate | Contains valid data | After successful read |
| Locked | Under exclusive operation | During I/O or manipulation |
1234567891011121314151617181920212223242526272829
/* Key page flags related to page cache */enum pageflags { PG_locked, /* Page is locked (I/O in progress) */ PG_referenced, /* Page was recently accessed */ PG_uptodate, /* Page contains valid data */ PG_dirty, /* Page modified, needs writeback */ PG_lru, /* Page is on an LRU list */ PG_active, /* Page is on active LRU list */ PG_workingset, /* Recently evicted and refaulted */ PG_waiters, /* Waiters for lock/writeback */ PG_error, /* I/O error occurred */ PG_slab, /* Used by slab allocator */ PG_owner_priv_1, /* Owner-specific (e.g., buffer_head attached) */ PG_private, /* has private data (buffers) */ PG_private_2, /* Additional private flag */ PG_writeback, /* Currently being written out */ PG_head, /* Head of compound page */ PG_mappedtodisk, /* Has blocks allocated on disk */ PG_reclaim, /* About to be reclaimed */ PG_swapbacked, /* Backed by swap (anon pages) */ PG_unevictable, /* Cannot be evicted (mlocked, etc.) */ /* ... */}; /* Macros to test/set/clear flags */#define PageDirty(page) test_bit(PG_dirty, &(page)->flags)#define SetPageDirty(page) set_bit(PG_dirty, &(page)->flags)#define ClearPageDirty(page) clear_bit(PG_dirty, &(page)->flags)/* ... similar for other flags */The struct address_space is the central data structure connecting files to their cached pages. Every inode has an associated address_space that manages its page cache entries.
1234567891011121314151617181920
struct address_space { struct inode *host; /* Owner inode */ struct xarray i_pages; /* Cached pages (radix tree) */ struct rw_semaphore invalidate_lock; /* For page invalidation */ gfp_t gfp_mask; /* Memory allocation flags */ atomic_t i_mmap_writable; /* Writable mmap count */ struct rb_root_cached i_mmap; /* Tree of VMAs mapping this */ struct rw_semaphore i_mmap_rwsem;/* Protect i_mmap tree */ unsigned long nrpages; /* Number of cached pages */ pgoff_t writeback_index; /* Writeback cursor */ const struct address_space_operations *a_ops; /* Operations */ unsigned long flags; /* Error and state flags */ errseq_t wb_err; /* Writeback errors */ spinlock_t private_lock; /* For private_list */ struct list_head private_list; /* Associated buffer_heads */ void *private_data; /* File system private data */};Pages are indexed by their page index (pgoff_t)—the file offset divided by page size. The i_pages field is an XArray (formerly radix tree), providing O(log n) lookup from file offset to page.
For a 1GB file with 4KB pages, there are 262,144 possible page indices. The XArray efficiently handles sparse files where only a subset of pages are cached.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
/* Find a page for a file offset */struct page *find_get_page(struct address_space *mapping, pgoff_t index){ return pagecache_get_page(mapping, index, 0, 0);} /* Find page, or allocate if not present */struct page *find_or_create_page(struct address_space *mapping, pgoff_t index, gfp_t gfp){ return pagecache_get_page(mapping, index, FGP_LOCK | FGP_CREAT | FGP_ACCESSED, gfp);} /* Core lookup function */struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index, fgf_t fgp_flags, gfp_t gfp){ struct page *page; repeat: /* Look up in XArray */ page = find_get_entry(mapping, index); if (!page) { if (!(fgp_flags & FGP_CREAT)) return NULL; /* Allocate new page */ page = __page_cache_alloc(gfp); if (!page) return NULL; /* Add to XArray */ err = add_to_page_cache_lru(page, mapping, index, gfp); if (err) { put_page(page); goto repeat; /* Race, retry */ } } if (fgp_flags & FGP_LOCK) { lock_page(page); } return page;}Modern Linux is transitioning from 'struct page' to 'struct folio' for page cache operations. A folio represents one or more contiguous pages (for huge pages). New code uses folio APIs (filemap_get_folio, filemap_add_folio, etc.), but the underlying concepts remain the same.
When an application reads from a file, the kernel follows a sophisticated path that balances cache hits, disk I/O, and speculative prefetching.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
ssize_t generic_file_read_iter(struct kiocb *iocb, struct iov_iter *to){ struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; loff_t pos = iocb->ki_pos; size_t count = iov_iter_count(to); ssize_t ret = 0; /* For each page in the requested range */ while (count > 0) { pgoff_t index = pos >> PAGE_SHIFT; unsigned long offset = pos & ~PAGE_MASK; size_t bytes = min(PAGE_SIZE - offset, count); /* Trigger read-ahead if appropriate */ page_cache_sync_readahead(mapping, &file->f_ra, file, index, last_index - index); /* Find page in cache */ page = find_get_page(mapping, index); if (!page) { /* Cache miss: read page from disk */ page = read_cache_page(mapping, index, mapping->a_ops->readpage, file); if (IS_ERR(page)) goto error; } /* Wait for page to be uptodate */ wait_on_page_locked_killable(page); if (!PageUptodate(page)) goto error; /* Copy data to user buffer */ ret = copy_page_to_iter(page, offset, bytes, to); put_page(page); pos += ret; count -= ret; } return pos - start_pos;}Read-ahead predictions sequential access patterns and prefetches pages before they're requested. This hides disk latency for streaming workloads.
The read-ahead state is tracked per file handle in struct file_ra_state:
1234567891011
struct file_ra_state { pgoff_t start; /* Start of current read-ahead window */ unsigned int size; /* Window size in pages */ unsigned int async_size;/* Trigger async readahead when this many left */ unsigned int ra_pages; /* Maximum readahead in pages */ unsigned int mmap_miss; /* Cache misses in mmap accesses */ loff_t prev_pos; /* Previous read position */}; /* Default maximum read-ahead (can be tuned per-device) *//* Typically 128KB-256KB, up to a few MB */The kernel uses an adaptive algorithm:
Initial read-ahead: Small window (2-4 pages) to detect access pattern
Sequential detection: If reads progress sequentially, enlarge window
Window growth: Double window size (up to ra_pages maximum) on confirmed sequential access
Async trigger: When reading into the last async_size pages of current window, trigger next window asynchronously
Random access: If pattern is random, disable read-ahead to avoid wasted I/O
Interleaved streams: Track multiple read positions to support concurrent sequential readers
Read-ahead can be tuned per-device: 'blockdev --setra 4096 /dev/sda' sets read-ahead to 4096 sectors (2MB). For streaming workloads (video, backups), larger values help. For random I/O (databases using O_DIRECT), read-ahead is wasted. The default 256 sectors (128KB) is a compromise.
When applications write to files, data typically goes to the page cache first—not directly to disk. This is write-back caching: data is written immediately to memory (cache), and later flushed to disk.
123456789101112131415161718192021222324252627282930313233
ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from){ struct file *file = iocb->ki_filp; struct address_space *mapping = file->f_mapping; loff_t pos = iocb->ki_pos; size_t count = iov_iter_count(from); for (each page in range) { pgoff_t index = pos >> PAGE_SHIFT; /* Get or create page in cache */ page = grab_cache_page_write_begin(mapping, index, flags); /* Prepare the page for writing */ status = a_ops->write_begin(file, mapping, pos, len, &page, &fsdata); /* Copy data from user buffer */ copied = copy_from_iter(page_address(page) + offset, len, from); /* Finalize the write */ status = a_ops->write_end(file, mapping, pos, len, copied, page, fsdata); /* Mark page dirty */ set_page_dirty(page); put_page(page); pos += copied; } return pos - start_pos;}When a page is modified, it's marked dirty. The kernel tracks dirty pages in multiple ways:
PG_dirty indicates page needs writebackDirty pages accumulate until writeback is triggered.
Dirty pages exist only in memory. If the system crashes before writeback, that data is lost! This is why databases use fsync(), why file systems have journaling, and why the dirty page ratio is carefully controlled. Always call fsync() on critical data.
Writeback occurs when:
Explicit sync: fsync(), sync(), or sync_file_range() called
Dirty ratio exceeded: Total dirty pages exceed dirty_ratio (default: 20% of RAM)
Dirty timeout: Pages dirty longer than dirty_expire_centisecs (default: 30 seconds)
Memory pressure: System needs to reclaim memory
Background writeback: dirty_background_ratio (default: 10%) triggers background flusher threads
Unmount: All dirty pages written before filesystem unmount
| Parameter | Default | Description |
|---|---|---|
dirty_background_ratio | 10 | % of RAM at which background writeback starts |
dirty_ratio | 20 | % of RAM at which processes doing writes are throttled |
dirty_expire_centisecs | 3000 | Age at which dirty data is written (100ths of seconds) |
dirty_writeback_centisecs | 500 | Interval between flusher thread wakeups |
dirty_background_bytes | 0 | Absolute threshold (overrides ratio if set) |
dirty_bytes | 0 | Absolute throttling threshold |
Background writeback is performed by flusher threads—kernel threads dedicated to writing dirty pages to disk. The writeback subsystem balances multiple concerns: device throughput, fairness across devices, and responsiveness.
Each block device has a backing_dev_info structure that tracks its writeback state:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
struct backing_dev_info { u64 id; /* BDI identifier */ struct rb_node rb_node; /* In bdi_tree */ struct list_head bdi_list; /* Global BDI list */ unsigned long ra_pages; /* Max readahead */ unsigned long io_pages; /* Max I/O size */ struct kref refcnt; /* Reference count */ unsigned int capabilities; /* Device capabilities */ unsigned int min_ratio; /* Minimum dirty ratio */ unsigned int max_ratio; /* Maximum dirty ratio */ unsigned int max_prop_frac; /* Max proportion */ atomic_long_t tot_write_bandwidth; /* Write speed estimate */ struct bdi_writeback wb; /* Default writeback state */ struct list_head wb_list; /* All writeback instances */ wait_queue_head_t wb_waitq; /* For sync waiters */ struct device *dev; /* Associated device */ char dev_name[64]; /* ... */}; struct bdi_writeback { struct backing_dev_info *bdi; /* Parent BDI */ unsigned long state; /* WB_* flags */ unsigned long last_old_flush; /* Last flush time */ struct list_head b_dirty; /* Dirty inodes */ struct list_head b_io; /* For I/O */ struct list_head b_more_io; /* More I/O pending */ struct list_head b_dirty_time; /* Dirty for time only */ spinlock_t list_lock; atomic_t writeback_inodes; struct percpu_counter stat[NR_WB_STAT_ITEMS]; unsigned long bw_time_stamp; /* Last bandwidth update */ unsigned long dirtied_stamp; unsigned long written_stamp; /* Pages written */ unsigned long write_bandwidth;/* Estimated throughput */ struct delayed_work dwork; /* Writeback work */ struct delayed_work bw_dwork; /* Bandwidth estimation */ struct list_head work_list; /* ... */};The writeback system estimates device write bandwidth to:
The estimation uses a sliding window of recent write completions, adapting dynamically to device performance changes (e.g., SSD vs HDD, device congestion).
Check writeback status with 'cat /proc/meminfo | grep -i dirty' to see current dirty page count. 'cat /sys/block/sda/queue/iostat' shows I/O statistics. For per-BDI info: 'cat /sys/class/bdi//read_ahead_kb' or '/sys/class/bdi//write_bandwidth'.
The page cache isn't fixed-size—it grows to fill available memory. When memory runs low, the kernel must reclaim pages to satisfy new allocations. The page cache is typically the first target for reclaim.
The kernel maintains LRU (Least Recently Used) lists to track page activity. Pages move between lists based on access patterns:
| List | Description | Reclaim Priority |
|---|---|---|
| Inactive Anonymous | Unused anonymous memory (stack, heap) | Medium-high |
| Active Anonymous | Recently used anonymous memory | Low |
| Inactive File | Unused page cache pages | High (easy to reclaim) |
| Active File | Recently used page cache | Medium |
| Unevictable | Locked or pinned pages (mlock) | Never reclaimed |
When memory is needed, the kernel's page reclaim (via kswapd or direct reclaim) performs:
12345678910111213141516171819202122232425
# Memory statistics$ cat /proc/meminfoMemTotal: 32780848 kBMemFree: 234012 kBMemAvailable: 28451248 kB # This is what you usually care aboutBuffers: 516428 kBCached: 26942704 kB # Page cache!SwapCached: 0 kBActive: 14580976 kBInactive: 15984420 kBActive(anon): 1872124 kBInactive(anon): 23256 kBActive(file): 12708852 kB # Active page cacheInactive(file): 15961164 kB # Inactive page cacheDirty: 102848 kB # Needs writebackWriteback: 0 kB # Currently writingAnonPages: 1892876 kBMapped: 834820 kB # Memory-mapped filesShmem: 4584 kBSlab: 1378476 kB # Drop caches (for testing, not production!)$ echo 1 > /proc/sys/vm/drop_caches # Page cache only$ echo 2 > /proc/sys/vm/drop_caches # Slab objects$ echo 3 > /proc/sys/vm/drop_caches # BothThe vm.swappiness parameter (0-200, default 60) influences the balance between reclaiming file pages vs. anonymous pages. Lower values prefer keeping anonymous memory (less swapping). Higher values treat file and anonymous pages more equally. For systems without swap, swappiness has no effect on anonymous memory.
Memory mapping (mmap()) creates a direct connection between a process's virtual address space and the page cache. This enables file access through pointer dereferences rather than read/write system calls.
Private mapping (MAP_PRIVATE):
Shared mapping (MAP_SHARED):
12345678910111213141516171819202122
/* Read-only mapping of a file */int fd = open("data.bin", O_RDONLY);size_t size = lseek(fd, 0, SEEK_END);void *data = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0); /* Now access data directly */int value = ((int *)data)[1000]; /* No syscall, reads from page cache */ /* Shared writable mapping */int fd = open("shared.bin", O_RDWR);void *shared = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); /* Modifications persist to file */((int *)shared)[1000] = 42; /* msync to ensure data reaches disk */msync(shared, size, MS_SYNC); /* Unmap when done */munmap(data, size);munmap(shared, size);When accessing an unmapped page:
Databases like PostgreSQL use mmap for their buffer pool (or manage their own with O_DIRECT). This provides transparent page cache integration, but careful tuning is needed. For very large datasets, the kernel's LRU may evict pages the database would have kept, hurting performance. Modern DBs often use O_DIRECT with custom caching.
Page cache behavior significantly impacts application performance. Here are strategies for different workload types:
Problem: Large sequential reads pollute the cache, evicting frequently-used data.
Solutions:
# Use posix_fadvise to hint about usage patterns
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL); # Enable aggressive readahead
posix_fadvise(fd, 0, 0, POSIX_FADV_NOREUSE); # Don't cache
# Or use O_DIRECT for critical streaming paths
fd = open("large.iso", O_RDONLY | O_DIRECT);
Problem: Database has its own cache; double-buffering wastes memory.
Solutions:
# Use O_DIRECT to bypass page cache
fd = open("data.db", O_RDWR | O_DIRECT | O_SYNC);
# If not using O_DIRECT, tune dirty parameters
sysctl vm.dirty_ratio=5
sysctl vm.dirty_background_ratio=2
# Reduce cache pollution from scans
posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);
Problem: Hot content should stay cached; cold content shouldn't evict it.
Solutions:
# Increase inactive/active ratio
# (done automatically by kernel heuristics)
# Consider preloading hot content
cat /var/www/hot/* > /dev/null
# Tune readahead for expected file sizes
blockdev --setra 512 /dev/sda # 256KB
| Workload | dirty_ratio | dirty_background_ratio | read_ahead_kb | Notes |
|---|---|---|---|---|
| General purpose | 20 (default) | 10 (default) | 128 | Balanced defaults |
| Database (OLTP) | 5-10 | 2-5 | 32-64 | Keep dirty data low, less readahead |
| Backup/streaming | 40-60 | 20-30 | 1024-2048 | Buffer more, larger readahead |
| Video serving | 20 | 10 | 2048-4096 | Large readahead for sequential |
| Low memory system | 10 | 5 | 64 | Conservative to avoid OOM |
Always test tuning changes under realistic load before production deployment. Monitor with 'sar -B' for page statistics, 'vmstat' for memory pressure, and 'iostat' for I/O patterns. Changes that help one workload may hurt others.
We've explored the Linux page cache in depth. Let's consolidate the key concepts:
What's next:
With the page cache understood, we complete our deep dive into Linux file systems with File Operations. The next page examines the actual implementation of VFS operations—how open, read, write, close, and other system calls flow through the kernel from user space to disk.
You now understand the Linux page cache—the critical layer that transforms slow disk I/O into fast memory access. This knowledge enables you to diagnose cache behavior, tune systems for specific workloads, and understand the tradeoffs between memory consumption and I/O performance.