Operating SystemsLinux File Systems

Linux File Systems

LevelAdvanced

Duration120 mins

TopicLinux File Systems

4 / 5

Page Cache

The Memory That Makes I/O Disappear

Why is reading a file often instantaneous after the first access? How can a system with 32GB of RAM effectively have "infinite" speed for its most-accessed data? The answer lies in the page cache—Linux's primary mechanism for caching file system data in memory.

The page cache is not a separate memory pool; it's a dynamic, opportunistic consumer of otherwise-unused RAM. It grows to fill available memory, shrinks under pressure, and decides which data to keep based on access patterns. Understanding the page cache is essential for anyone who needs predictable I/O performance, whether you're running databases, web servers, or data processing pipelines.

What You Will Learn

By the end of this page, you will understand the complete architecture of the Linux page cache: how pages are indexed and looked up, the address_space abstraction that connects files to cached pages, read-ahead algorithms, writeback policies, memory pressure handling, and practical tuning for different workloads. You'll gain the knowledge to diagnose cache behavior and optimize systems for I/O-intensive applications.

Page Cache Fundamentals

The page cache stores file data in memory at page granularity (typically 4KB, though huge pages exist). Every read from a file first checks the page cache; only on a cache miss does actual disk I/O occur.

Why Page Granularity?

Using page-sized chunks aligns the cache with:

Virtual memory: Memory-mapped files share pages between cache and process address spaces
Hardware: Disk I/O typically operates on sector/block boundaries; pages are convenient multiples
Efficiency: Smaller chunks would have excessive metadata overhead; larger chunks would waste memory

The trade-off is that even reading one byte loads an entire page. For random access to small records, this can cause read amplification.

Converting Mermaid diagram...

Page States

Cached pages exist in various states:

Page Cache States
State	Description	Transitions
Clean	Content matches disk	After read or writeback
Dirty	Modified, needs writeback	After write, before sync
Writeback	Currently being written to disk	During writeback I/O
Uptodate	Contains valid data	After successful read
Locked	Under exclusive operation	During I/O or manipulation

Page flags (include/linux/page-flags.h)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/* Key page flags related to page cache */
enum pageflags {
    PG_locked,          /* Page is locked (I/O in progress) */
    PG_referenced,      /* Page was recently accessed */
    PG_uptodate,        /* Page contains valid data */
    PG_dirty,           /* Page modified, needs writeback */
    PG_lru,             /* Page is on an LRU list */
    PG_active,          /* Page is on active LRU list */
    PG_workingset,      /* Recently evicted and refaulted */
    PG_waiters,         /* Waiters for lock/writeback */
    PG_error,           /* I/O error occurred */
    PG_slab,            /* Used by slab allocator */
    PG_owner_priv_1,    /* Owner-specific (e.g., buffer_head attached) */
    PG_private,         /* has private data (buffers) */
    PG_private_2,       /* Additional private flag */
    PG_writeback,       /* Currently being written out */
    PG_head,            /* Head of compound page */
    PG_mappedtodisk,    /* Has blocks allocated on disk */
    PG_reclaim,         /* About to be reclaimed */
    PG_swapbacked,      /* Backed by swap (anon pages) */
    PG_unevictable,     /* Cannot be evicted (mlocked, etc.) */
    /* ... */
};
 
/* Macros to test/set/clear flags */
#define PageDirty(page)     test_bit(PG_dirty, &(page)->flags)
#define SetPageDirty(page)  set_bit(PG_dirty, &(page)->flags)
#define ClearPageDirty(page) clear_bit(PG_dirty, &(page)->flags)
/* ... similar for other flags */

The address_space Abstraction

The struct address_space is the central data structure connecting files to their cached pages. Every inode has an associated address_space that manages its page cache entries.

struct address_space
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
struct address_space {
    struct inode        *host;           /* Owner inode */
    struct xarray       i_pages;         /* Cached pages (radix tree) */
    struct rw_semaphore invalidate_lock; /* For page invalidation */
    gfp_t               gfp_mask;        /* Memory allocation flags */
    atomic_t            i_mmap_writable; /* Writable mmap count */
    
    struct rb_root_cached   i_mmap;      /* Tree of VMAs mapping this */
    struct rw_semaphore     i_mmap_rwsem;/* Protect i_mmap tree */
    
    unsigned long       nrpages;         /* Number of cached pages */
    pgoff_t             writeback_index; /* Writeback cursor */
    
    const struct address_space_operations *a_ops; /* Operations */
    unsigned long       flags;           /* Error and state flags */
    errseq_t            wb_err;          /* Writeback errors */
    spinlock_t          private_lock;    /* For private_list */
    struct list_head    private_list;    /* Associated buffer_heads */
    void                *private_data;   /* File system private data */
};

The Page Index (XArray)

Pages are indexed by their page index (pgoff_t)—the file offset divided by page size. The i_pages field is an XArray (formerly radix tree), providing O(log n) lookup from file offset to page.

For a 1GB file with 4KB pages, there are 262,144 possible page indices. The XArray efficiently handles sparse files where only a subset of pages are cached.

Converting Mermaid diagram...

Looking Up Pages

Page cache lookup
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
/* Find a page for a file offset */
struct page *find_get_page(struct address_space *mapping, pgoff_t index)
{
    return pagecache_get_page(mapping, index, 0, 0);
}
 
/* Find page, or allocate if not present */
struct page *find_or_create_page(struct address_space *mapping,
                                  pgoff_t index, gfp_t gfp)
{
    return pagecache_get_page(mapping, index,
                              FGP_LOCK | FGP_CREAT | FGP_ACCESSED,
                              gfp);
}
 
/* Core lookup function */
struct page *pagecache_get_page(struct address_space *mapping,
                                 pgoff_t index, fgf_t fgp_flags,
                                 gfp_t gfp)
{
    struct page *page;
    
repeat:
    /* Look up in XArray */
    page = find_get_entry(mapping, index);
    if (!page) {
        if (!(fgp_flags & FGP_CREAT))
            return NULL;
        
        /* Allocate new page */
        page = __page_cache_alloc(gfp);
        if (!page)
            return NULL;
            
        /* Add to XArray */
        err = add_to_page_cache_lru(page, mapping, index, gfp);
        if (err) {
            put_page(page);
            goto repeat;  /* Race, retry */
        }
    }
    
    if (fgp_flags & FGP_LOCK) {
        lock_page(page);
    }
    
    return page;
}

Folio Transition

Modern Linux is transitioning from 'struct page' to 'struct folio' for page cache operations. A folio represents one or more contiguous pages (for huge pages). New code uses folio APIs (filemap_get_folio, filemap_add_folio, etc.), but the underlying concepts remain the same.

Read Path and Read-Ahead

When an application reads from a file, the kernel follows a sophisticated path that balances cache hits, disk I/O, and speculative prefetching.

The Buffered Read Path

Simplified buffered read
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
ssize_t generic_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
    struct file *file = iocb->ki_filp;
    struct address_space *mapping = file->f_mapping;
    loff_t pos = iocb->ki_pos;
    size_t count = iov_iter_count(to);
    ssize_t ret = 0;
    
    /* For each page in the requested range */
    while (count > 0) {
        pgoff_t index = pos >> PAGE_SHIFT;
        unsigned long offset = pos & ~PAGE_MASK;
        size_t bytes = min(PAGE_SIZE - offset, count);
        
        /* Trigger read-ahead if appropriate */
        page_cache_sync_readahead(mapping, &file->f_ra, file,
                                   index, last_index - index);
        
        /* Find page in cache */
        page = find_get_page(mapping, index);
        
        if (!page) {
            /* Cache miss: read page from disk */
            page = read_cache_page(mapping, index,
                                   mapping->a_ops->readpage, file);
            if (IS_ERR(page))
                goto error;
        }
        
        /* Wait for page to be uptodate */
        wait_on_page_locked_killable(page);
        if (!PageUptodate(page))
            goto error;
        
        /* Copy data to user buffer */
        ret = copy_page_to_iter(page, offset, bytes, to);
        
        put_page(page);
        pos += ret;
        count -= ret;
    }
    
    return pos - start_pos;
}

Read-Ahead: Predictive Prefetching

Read-ahead predictions sequential access patterns and prefetches pages before they're requested. This hides disk latency for streaming workloads.

The read-ahead state is tracked per file handle in struct file_ra_state:

struct file_ra_state
C
1
2
3
4
5
6
7
8
9
10
11
struct file_ra_state {
    pgoff_t start;          /* Start of current read-ahead window */
    unsigned int size;      /* Window size in pages */
    unsigned int async_size;/* Trigger async readahead when this many left */
    unsigned int ra_pages;  /* Maximum readahead in pages */
    unsigned int mmap_miss; /* Cache misses in mmap accesses */
    loff_t prev_pos;        /* Previous read position */
};
 
/* Default maximum read-ahead (can be tuned per-device) */
/* Typically 128KB-256KB, up to a few MB */

Read-Ahead Algorithm

The kernel uses an adaptive algorithm:

Initial read-ahead: Small window (2-4 pages) to detect access pattern
Sequential detection: If reads progress sequentially, enlarge window
Window growth: Double window size (up to ra_pages maximum) on confirmed sequential access
Async trigger: When reading into the last async_size pages of current window, trigger next window asynchronously
Random access: If pattern is random, disable read-ahead to avoid wasted I/O
Interleaved streams: Track multiple read positions to support concurrent sequential readers

Converting Mermaid diagram...

Tuning Read-Ahead

Read-ahead can be tuned per-device: 'blockdev --setra 4096 /dev/sda' sets read-ahead to 4096 sectors (2MB). For streaming workloads (video, backups), larger values help. For random I/O (databases using O_DIRECT), read-ahead is wasted. The default 256 sectors (128KB) is a compromise.

Write Path and Dirty Pages

When applications write to files, data typically goes to the page cache first—not directly to disk. This is write-back caching: data is written immediately to memory (cache), and later flushed to disk.

The Buffered Write Path

Simplified buffered write
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
    struct file *file = iocb->ki_filp;
    struct address_space *mapping = file->f_mapping;
    loff_t pos = iocb->ki_pos;
    size_t count = iov_iter_count(from);
    
    for (each page in range) {
        pgoff_t index = pos >> PAGE_SHIFT;
        
        /* Get or create page in cache */
        page = grab_cache_page_write_begin(mapping, index, flags);
        
        /* Prepare the page for writing */
        status = a_ops->write_begin(file, mapping, pos, len,
                                     &page, &fsdata);
        
        /* Copy data from user buffer */
        copied = copy_from_iter(page_address(page) + offset, len, from);
        
        /* Finalize the write */
        status = a_ops->write_end(file, mapping, pos, len, copied,
                                   page, fsdata);
                                   
        /* Mark page dirty */
        set_page_dirty(page);
        
        put_page(page);
        pos += copied;
    }
    
    return pos - start_pos;
}

Dirty Page Tracking

When a page is modified, it's marked dirty. The kernel tracks dirty pages in multiple ways:

Page flag: PG_dirty indicates page needs writeback
Inode list: Inodes with dirty pages are on a per-bdi dirty list
Radix tree tags: Pages are tagged in the XArray for efficient iteration

Dirty pages accumulate until writeback is triggered.

Data at Risk

Dirty pages exist only in memory. If the system crashes before writeback, that data is lost! This is why databases use fsync(), why file systems have journaling, and why the dirty page ratio is carefully controlled. Always call fsync() on critical data.

Writeback Triggers

Writeback occurs when:

Explicit sync: fsync(), sync(), or sync_file_range() called
Dirty ratio exceeded: Total dirty pages exceed dirty_ratio (default: 20% of RAM)
Dirty timeout: Pages dirty longer than dirty_expire_centisecs (default: 30 seconds)
Memory pressure: System needs to reclaim memory
Background writeback: dirty_background_ratio (default: 10%) triggers background flusher threads
Unmount: All dirty pages written before filesystem unmount

Writeback tunables (/proc/sys/vm/)
Parameter	Default	Description
`dirty_background_ratio`	10	% of RAM at which background writeback starts
`dirty_ratio`	20	% of RAM at which processes doing writes are throttled
`dirty_expire_centisecs`	3000	Age at which dirty data is written (100ths of seconds)
`dirty_writeback_centisecs`	500	Interval between flusher thread wakeups
`dirty_background_bytes`	0	Absolute threshold (overrides ratio if set)
`dirty_bytes`	0	Absolute throttling threshold

The Flusher Threads (Writeback Workers)

Background writeback is performed by flusher threads—kernel threads dedicated to writing dirty pages to disk. The writeback subsystem balances multiple concerns: device throughput, fairness across devices, and responsiveness.

Writeback Architecture

Converting Mermaid diagram...

backing_dev_info (BDI)

Each block device has a backing_dev_info structure that tracks its writeback state:

struct backing_dev_info (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
struct backing_dev_info {
    u64                 id;              /* BDI identifier */
    struct rb_node      rb_node;         /* In bdi_tree */
    
    struct list_head    bdi_list;        /* Global BDI list */
    unsigned long       ra_pages;        /* Max readahead */
    unsigned long       io_pages;        /* Max I/O size */
    
    struct kref         refcnt;          /* Reference count */
    
    unsigned int        capabilities;    /* Device capabilities */
    unsigned int        min_ratio;       /* Minimum dirty ratio */
    unsigned int        max_ratio;       /* Maximum dirty ratio */
    unsigned int        max_prop_frac;   /* Max proportion */
    
    atomic_long_t       tot_write_bandwidth; /* Write speed estimate */
    
    struct bdi_writeback wb;            /* Default writeback state */
    struct list_head    wb_list;        /* All writeback instances */
    
    wait_queue_head_t   wb_waitq;       /* For sync waiters */
    
    struct device       *dev;           /* Associated device */
    char                dev_name[64];
    
    /* ... */
};
 
struct bdi_writeback {
    struct backing_dev_info *bdi;       /* Parent BDI */
    
    unsigned long       state;          /* WB_* flags */
    unsigned long       last_old_flush; /* Last flush time */
    
    struct list_head    b_dirty;        /* Dirty inodes */
    struct list_head    b_io;           /* For I/O */
    struct list_head    b_more_io;      /* More I/O pending */
    struct list_head    b_dirty_time;   /* Dirty for time only */
    
    spinlock_t          list_lock;
    
    atomic_t            writeback_inodes;
    struct percpu_counter stat[NR_WB_STAT_ITEMS];
    
    unsigned long       bw_time_stamp;  /* Last bandwidth update */
    unsigned long       dirtied_stamp;
    unsigned long       written_stamp;  /* Pages written */
    unsigned long       write_bandwidth;/* Estimated throughput */
    
    struct delayed_work dwork;          /* Writeback work */
    struct delayed_work bw_dwork;       /* Bandwidth estimation */
    
    struct list_head    work_list;
    
    /* ... */
};

Writeback Bandwidth Estimation

The writeback system estimates device write bandwidth to:

Throttle writers: When dirty pages exceed limits, slow down writing processes proportionally
Balance devices: Distribute writeback across devices based on their speeds
Smooth I/O: Avoid bursty writeback that could saturate devices

The estimation uses a sliding window of recent write completions, adapting dynamically to device performance changes (e.g., SSD vs HDD, device congestion).

Monitoring Writeback

Check writeback status with 'cat /proc/meminfo | grep -i dirty' to see current dirty page count. 'cat /sys/block/sda/queue/iostat' shows I/O statistics. For per-BDI info: 'cat /sys/class/bdi//read_ahead_kb' or '/sys/class/bdi//write_bandwidth'.

Memory Pressure and Page Reclaim

The page cache isn't fixed-size—it grows to fill available memory. When memory runs low, the kernel must reclaim pages to satisfy new allocations. The page cache is typically the first target for reclaim.

LRU Lists

The kernel maintains LRU (Least Recently Used) lists to track page activity. Pages move between lists based on access patterns:

Page LRU Lists
List	Description	Reclaim Priority
Inactive Anonymous	Unused anonymous memory (stack, heap)	Medium-high
Active Anonymous	Recently used anonymous memory	Low
Inactive File	Unused page cache pages	High (easy to reclaim)
Active File	Recently used page cache	Medium
Unevictable	Locked or pinned pages (mlock)	Never reclaimed

Converting Mermaid diagram...

The Reclaim Process

When memory is needed, the kernel's page reclaim (via kswapd or direct reclaim) performs:

Scan LRU lists: Start from the tail of inactive lists
Check page status:
- Clean file pages: Simply drop from cache (data still on disk)
- Dirty file pages: Schedule writeback, try later
- Anonymous pages: Swap out if swap space available
- Referenced pages: Give second chance, move back
Apply pressure heuristics: Balance between file and anonymous pages
Continue until satisfied: Meet watermark thresholds

Viewing memory pressure
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Memory statistics
$ cat /proc/meminfo
MemTotal:       32780848 kB
MemFree:          234012 kB
MemAvailable:   28451248 kB   # This is what you usually care about
Buffers:          516428 kB
Cached:         26942704 kB   # Page cache!
SwapCached:            0 kB
Active:         14580976 kB
Inactive:       15984420 kB
Active(anon):    1872124 kB
Inactive(anon):    23256 kB
Active(file):   12708852 kB   # Active page cache
Inactive(file): 15961164 kB   # Inactive page cache
Dirty:            102848 kB   # Needs writeback
Writeback:             0 kB   # Currently writing
AnonPages:       1892876 kB
Mapped:           834820 kB   # Memory-mapped files
Shmem:              4584 kB
Slab:            1378476 kB
 
# Drop caches (for testing, not production!)
$ echo 1 > /proc/sys/vm/drop_caches  # Page cache only
$ echo 2 > /proc/sys/vm/drop_caches  # Slab objects
$ echo 3 > /proc/sys/vm/drop_caches  # Both

swappiness

The vm.swappiness parameter (0-200, default 60) influences the balance between reclaiming file pages vs. anonymous pages. Lower values prefer keeping anonymous memory (less swapping). Higher values treat file and anonymous pages more equally. For systems without swap, swappiness has no effect on anonymous memory.

Memory-Mapped Files

Memory mapping (mmap()) creates a direct connection between a process's virtual address space and the page cache. This enables file access through pointer dereferences rather than read/write system calls.

How mmap Works

Converting Mermaid diagram...

Mapping Types

Private mapping (MAP_PRIVATE):

Copy-on-write: modifications don't affect the file
Used for executable text, read-only data
Changes create anonymous copies

Shared mapping (MAP_SHARED):

Modifications are visible to other processes and persist to file
Used for IPC, database buffer pools
Changes immediately visible through page cache

mmap usage examples
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/* Read-only mapping of a file */
int fd = open("data.bin", O_RDONLY);
size_t size = lseek(fd, 0, SEEK_END);
void *data = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
 
/* Now access data directly */
int value = ((int *)data)[1000];  /* No syscall, reads from page cache */
 
/* Shared writable mapping */
int fd = open("shared.bin", O_RDWR);
void *shared = mmap(NULL, size, PROT_READ | PROT_WRITE, 
                    MAP_SHARED, fd, 0);
 
/* Modifications persist to file */
((int *)shared)[1000] = 42;
 
/* msync to ensure data reaches disk */
msync(shared, size, MS_SYNC);
 
/* Unmap when done */
munmap(data, size);
munmap(shared, size);

Page Fault Handling

When accessing an unmapped page:

CPU generates page fault
Kernel fault handler determines fault type:
- Is VMA valid for this address?
- Is access type (read/write/exec) permitted?
For file-backed VMAs:
- Look up page in page cache
- If not present, read from file
- Install page table entry
Resume execution: Access now succeeds

Database Buffer Pools

Databases like PostgreSQL use mmap for their buffer pool (or manage their own with O_DIRECT). This provides transparent page cache integration, but careful tuning is needed. For very large datasets, the kernel's LRU may evict pages the database would have kept, hurting performance. Modern DBs often use O_DIRECT with custom caching.

Practical Page Cache Tuning

Page cache behavior significantly impacts application performance. Here are strategies for different workload types:

Streaming Workloads (Backup, Video)

Problem: Large sequential reads pollute the cache, evicting frequently-used data.

Solutions:

# Use posix_fadvise to hint about usage patterns
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);  # Enable aggressive readahead
posix_fadvise(fd, 0, 0, POSIX_FADV_NOREUSE);     # Don't cache

# Or use O_DIRECT for critical streaming paths
fd = open("large.iso", O_RDONLY | O_DIRECT);

Database Workloads

Problem: Database has its own cache; double-buffering wastes memory.

Solutions:

# Use O_DIRECT to bypass page cache
fd = open("data.db", O_RDWR | O_DIRECT | O_SYNC);

# If not using O_DIRECT, tune dirty parameters
sysctl vm.dirty_ratio=5
sysctl vm.dirty_background_ratio=2

# Reduce cache pollution from scans
posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);

Web Server Workloads

Problem: Hot content should stay cached; cold content shouldn't evict it.

Solutions:

# Increase inactive/active ratio
# (done automatically by kernel heuristics)

# Consider preloading hot content
cat /var/www/hot/* > /dev/null

# Tune readahead for expected file sizes
blockdev --setra 512 /dev/sda  # 256KB

Workload-Specific Recommendations
Workload	dirty_ratio	dirty_background_ratio	read_ahead_kb	Notes
General purpose	20 (default)	10 (default)	128	Balanced defaults
Database (OLTP)	5-10	2-5	32-64	Keep dirty data low, less readahead
Backup/streaming	40-60	20-30	1024-2048	Buffer more, larger readahead
Video serving	20	10	2048-4096	Large readahead for sequential
Low memory system	10	5	64	Conservative to avoid OOM

Production Changes

Always test tuning changes under realistic load before production deployment. Monitor with 'sar -B' for page statistics, 'vmstat' for memory pressure, and 'iostat' for I/O patterns. Changes that help one workload may hurt others.

Summary: The Page Cache

We've explored the Linux page cache in depth. Let's consolidate the key concepts:

Key Takeaways

•The page cache stores file data in memory at page granularity, making repeated file access essentially free after the first read.
•address_space connects files to their cached pages via an XArray (radix tree), enabling O(log n) lookup from file offset to page.
•Read-ahead predicts sequential access and prefetches pages asynchronously, hiding disk latency for streaming workloads.
•Write-back caching accumulates dirty pages in memory, deferring disk writes to improve throughput but risking data loss on crash.
•Flusher threads perform background writeback, using bandwidth estimation to balance device utilization and process throttling.
•Page reclaim evicts pages under memory pressure, preferring clean file pages (which can be re-read from disk) over anonymous pages.
•Memory-mapped files connect page cache directly to process address space, enabling pointer-based file access with demand paging.

What's next:

With the page cache understood, we complete our deep dive into Linux file systems with File Operations. The next page examines the actual implementation of VFS operations—how open, read, write, close, and other system calls flow through the kernel from user space to disk.

Page Complete

You now understand the Linux page cache—the critical layer that transforms slow disk I/O into fast memory access. This knowledge enables you to diagnose cache behavior, tune systems for specific workloads, and understand the tradeoffs between memory consumption and I/O performance.

4 / 5

Loading learning content...

Operating SystemsLinux File Systems

Linux File Systems

LevelAdvanced

Duration120 mins

TopicLinux File Systems

4 / 5

Page Cache

The Memory That Makes I/O Disappear

What You Will Learn

Page Cache Fundamentals

Why Page Granularity?

Using page-sized chunks aligns the cache with:

Virtual memory: Memory-mapped files share pages between cache and process address spaces
Hardware: Disk I/O typically operates on sector/block boundaries; pages are convenient multiples
Efficiency: Smaller chunks would have excessive metadata overhead; larger chunks would waste memory

The trade-off is that even reading one byte loads an entire page. For random access to small records, this can cause read amplification.

Converting Mermaid diagram...

Page States

Cached pages exist in various states:

Page Cache States
State	Description	Transitions
Clean	Content matches disk	After read or writeback
Dirty	Modified, needs writeback	After write, before sync
Writeback	Currently being written to disk	During writeback I/O
Uptodate	Contains valid data	After successful read
Locked	Under exclusive operation	During I/O or manipulation

Page flags (include/linux/page-flags.h)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/* Key page flags related to page cache */
enum pageflags {
    PG_locked,          /* Page is locked (I/O in progress) */
    PG_referenced,      /* Page was recently accessed */
    PG_uptodate,        /* Page contains valid data */
    PG_dirty,           /* Page modified, needs writeback */
    PG_lru,             /* Page is on an LRU list */
    PG_active,          /* Page is on active LRU list */
    PG_workingset,      /* Recently evicted and refaulted */
    PG_waiters,         /* Waiters for lock/writeback */
    PG_error,           /* I/O error occurred */
    PG_slab,            /* Used by slab allocator */
    PG_owner_priv_1,    /* Owner-specific (e.g., buffer_head attached) */
    PG_private,         /* has private data (buffers) */
    PG_private_2,       /* Additional private flag */
    PG_writeback,       /* Currently being written out */
    PG_head,            /* Head of compound page */
    PG_mappedtodisk,    /* Has blocks allocated on disk */
    PG_reclaim,         /* About to be reclaimed */
    PG_swapbacked,      /* Backed by swap (anon pages) */
    PG_unevictable,     /* Cannot be evicted (mlocked, etc.) */
    /* ... */
};
 
/* Macros to test/set/clear flags */
#define PageDirty(page)     test_bit(PG_dirty, &(page)->flags)
#define SetPageDirty(page)  set_bit(PG_dirty, &(page)->flags)
#define ClearPageDirty(page) clear_bit(PG_dirty, &(page)->flags)
/* ... similar for other flags */

The address_space Abstraction

The struct address_space is the central data structure connecting files to their cached pages. Every inode has an associated address_space that manages its page cache entries.

struct address_space
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
struct address_space {
    struct inode        *host;           /* Owner inode */
    struct xarray       i_pages;         /* Cached pages (radix tree) */
    struct rw_semaphore invalidate_lock; /* For page invalidation */
    gfp_t               gfp_mask;        /* Memory allocation flags */
    atomic_t            i_mmap_writable; /* Writable mmap count */
    
    struct rb_root_cached   i_mmap;      /* Tree of VMAs mapping this */
    struct rw_semaphore     i_mmap_rwsem;/* Protect i_mmap tree */
    
    unsigned long       nrpages;         /* Number of cached pages */
    pgoff_t             writeback_index; /* Writeback cursor */
    
    const struct address_space_operations *a_ops; /* Operations */
    unsigned long       flags;           /* Error and state flags */
    errseq_t            wb_err;          /* Writeback errors */
    spinlock_t          private_lock;    /* For private_list */
    struct list_head    private_list;    /* Associated buffer_heads */
    void                *private_data;   /* File system private data */
};

The Page Index (XArray)

For a 1GB file with 4KB pages, there are 262,144 possible page indices. The XArray efficiently handles sparse files where only a subset of pages are cached.

Converting Mermaid diagram...

Looking Up Pages

Page cache lookup
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
/* Find a page for a file offset */
struct page *find_get_page(struct address_space *mapping, pgoff_t index)
{
    return pagecache_get_page(mapping, index, 0, 0);
}
 
/* Find page, or allocate if not present */
struct page *find_or_create_page(struct address_space *mapping,
                                  pgoff_t index, gfp_t gfp)
{
    return pagecache_get_page(mapping, index,
                              FGP_LOCK | FGP_CREAT | FGP_ACCESSED,
                              gfp);
}
 
/* Core lookup function */
struct page *pagecache_get_page(struct address_space *mapping,
                                 pgoff_t index, fgf_t fgp_flags,
                                 gfp_t gfp)
{
    struct page *page;
    
repeat:
    /* Look up in XArray */
    page = find_get_entry(mapping, index);
    if (!page) {
        if (!(fgp_flags & FGP_CREAT))
            return NULL;
        
        /* Allocate new page */
        page = __page_cache_alloc(gfp);
        if (!page)
            return NULL;
            
        /* Add to XArray */
        err = add_to_page_cache_lru(page, mapping, index, gfp);
        if (err) {
            put_page(page);
            goto repeat;  /* Race, retry */
        }
    }
    
    if (fgp_flags & FGP_LOCK) {
        lock_page(page);
    }
    
    return page;
}

Folio Transition

Read Path and Read-Ahead

When an application reads from a file, the kernel follows a sophisticated path that balances cache hits, disk I/O, and speculative prefetching.

The Buffered Read Path

Simplified buffered read
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
ssize_t generic_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
    struct file *file = iocb->ki_filp;
    struct address_space *mapping = file->f_mapping;
    loff_t pos = iocb->ki_pos;
    size_t count = iov_iter_count(to);
    ssize_t ret = 0;
    
    /* For each page in the requested range */
    while (count > 0) {
        pgoff_t index = pos >> PAGE_SHIFT;
        unsigned long offset = pos & ~PAGE_MASK;
        size_t bytes = min(PAGE_SIZE - offset, count);
        
        /* Trigger read-ahead if appropriate */
        page_cache_sync_readahead(mapping, &file->f_ra, file,
                                   index, last_index - index);
        
        /* Find page in cache */
        page = find_get_page(mapping, index);
        
        if (!page) {
            /* Cache miss: read page from disk */
            page = read_cache_page(mapping, index,
                                   mapping->a_ops->readpage, file);
            if (IS_ERR(page))
                goto error;
        }
        
        /* Wait for page to be uptodate */
        wait_on_page_locked_killable(page);
        if (!PageUptodate(page))
            goto error;
        
        /* Copy data to user buffer */
        ret = copy_page_to_iter(page, offset, bytes, to);
        
        put_page(page);
        pos += ret;
        count -= ret;
    }
    
    return pos - start_pos;
}

Read-Ahead: Predictive Prefetching

Read-ahead predictions sequential access patterns and prefetches pages before they're requested. This hides disk latency for streaming workloads.

The read-ahead state is tracked per file handle in struct file_ra_state:

struct file_ra_state
C
1
2
3
4
5
6
7
8
9
10
11
struct file_ra_state {
    pgoff_t start;          /* Start of current read-ahead window */
    unsigned int size;      /* Window size in pages */
    unsigned int async_size;/* Trigger async readahead when this many left */
    unsigned int ra_pages;  /* Maximum readahead in pages */
    unsigned int mmap_miss; /* Cache misses in mmap accesses */
    loff_t prev_pos;        /* Previous read position */
};
 
/* Default maximum read-ahead (can be tuned per-device) */
/* Typically 128KB-256KB, up to a few MB */

Read-Ahead Algorithm

The kernel uses an adaptive algorithm:

Initial read-ahead: Small window (2-4 pages) to detect access pattern
Sequential detection: If reads progress sequentially, enlarge window
Window growth: Double window size (up to ra_pages maximum) on confirmed sequential access
Async trigger: When reading into the last async_size pages of current window, trigger next window asynchronously
Random access: If pattern is random, disable read-ahead to avoid wasted I/O
Interleaved streams: Track multiple read positions to support concurrent sequential readers

Converting Mermaid diagram...

Tuning Read-Ahead

Write Path and Dirty Pages

The Buffered Write Path

Simplified buffered write
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
    struct file *file = iocb->ki_filp;
    struct address_space *mapping = file->f_mapping;
    loff_t pos = iocb->ki_pos;
    size_t count = iov_iter_count(from);
    
    for (each page in range) {
        pgoff_t index = pos >> PAGE_SHIFT;
        
        /* Get or create page in cache */
        page = grab_cache_page_write_begin(mapping, index, flags);
        
        /* Prepare the page for writing */
        status = a_ops->write_begin(file, mapping, pos, len,
                                     &page, &fsdata);
        
        /* Copy data from user buffer */
        copied = copy_from_iter(page_address(page) + offset, len, from);
        
        /* Finalize the write */
        status = a_ops->write_end(file, mapping, pos, len, copied,
                                   page, fsdata);
                                   
        /* Mark page dirty */
        set_page_dirty(page);
        
        put_page(page);
        pos += copied;
    }
    
    return pos - start_pos;
}

Dirty Page Tracking

When a page is modified, it's marked dirty. The kernel tracks dirty pages in multiple ways:

Page flag: PG_dirty indicates page needs writeback
Inode list: Inodes with dirty pages are on a per-bdi dirty list
Radix tree tags: Pages are tagged in the XArray for efficient iteration

Dirty pages accumulate until writeback is triggered.

Data at Risk

Writeback Triggers

Writeback occurs when:

Explicit sync: fsync(), sync(), or sync_file_range() called
Dirty ratio exceeded: Total dirty pages exceed dirty_ratio (default: 20% of RAM)
Dirty timeout: Pages dirty longer than dirty_expire_centisecs (default: 30 seconds)
Memory pressure: System needs to reclaim memory
Background writeback: dirty_background_ratio (default: 10%) triggers background flusher threads
Unmount: All dirty pages written before filesystem unmount

Writeback tunables (/proc/sys/vm/)
Parameter	Default	Description
`dirty_background_ratio`	10	% of RAM at which background writeback starts
`dirty_ratio`	20	% of RAM at which processes doing writes are throttled
`dirty_expire_centisecs`	3000	Age at which dirty data is written (100ths of seconds)
`dirty_writeback_centisecs`	500	Interval between flusher thread wakeups
`dirty_background_bytes`	0	Absolute threshold (overrides ratio if set)
`dirty_bytes`	0	Absolute throttling threshold

The Flusher Threads (Writeback Workers)

Writeback Architecture

Converting Mermaid diagram...

backing_dev_info (BDI)

Each block device has a backing_dev_info structure that tracks its writeback state:

struct backing_dev_info (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
struct backing_dev_info {
    u64                 id;              /* BDI identifier */
    struct rb_node      rb_node;         /* In bdi_tree */
    
    struct list_head    bdi_list;        /* Global BDI list */
    unsigned long       ra_pages;        /* Max readahead */
    unsigned long       io_pages;        /* Max I/O size */
    
    struct kref         refcnt;          /* Reference count */
    
    unsigned int        capabilities;    /* Device capabilities */
    unsigned int        min_ratio;       /* Minimum dirty ratio */
    unsigned int        max_ratio;       /* Maximum dirty ratio */
    unsigned int        max_prop_frac;   /* Max proportion */
    
    atomic_long_t       tot_write_bandwidth; /* Write speed estimate */
    
    struct bdi_writeback wb;            /* Default writeback state */
    struct list_head    wb_list;        /* All writeback instances */
    
    wait_queue_head_t   wb_waitq;       /* For sync waiters */
    
    struct device       *dev;           /* Associated device */
    char                dev_name[64];
    
    /* ... */
};
 
struct bdi_writeback {
    struct backing_dev_info *bdi;       /* Parent BDI */
    
    unsigned long       state;          /* WB_* flags */
    unsigned long       last_old_flush; /* Last flush time */
    
    struct list_head    b_dirty;        /* Dirty inodes */
    struct list_head    b_io;           /* For I/O */
    struct list_head    b_more_io;      /* More I/O pending */
    struct list_head    b_dirty_time;   /* Dirty for time only */
    
    spinlock_t          list_lock;
    
    atomic_t            writeback_inodes;
    struct percpu_counter stat[NR_WB_STAT_ITEMS];
    
    unsigned long       bw_time_stamp;  /* Last bandwidth update */
    unsigned long       dirtied_stamp;
    unsigned long       written_stamp;  /* Pages written */
    unsigned long       write_bandwidth;/* Estimated throughput */
    
    struct delayed_work dwork;          /* Writeback work */
    struct delayed_work bw_dwork;       /* Bandwidth estimation */
    
    struct list_head    work_list;
    
    /* ... */
};

Writeback Bandwidth Estimation

The writeback system estimates device write bandwidth to:

Throttle writers: When dirty pages exceed limits, slow down writing processes proportionally
Balance devices: Distribute writeback across devices based on their speeds
Smooth I/O: Avoid bursty writeback that could saturate devices

The estimation uses a sliding window of recent write completions, adapting dynamically to device performance changes (e.g., SSD vs HDD, device congestion).

Monitoring Writeback

Memory Pressure and Page Reclaim

LRU Lists

The kernel maintains LRU (Least Recently Used) lists to track page activity. Pages move between lists based on access patterns:

Page LRU Lists
List	Description	Reclaim Priority
Inactive Anonymous	Unused anonymous memory (stack, heap)	Medium-high
Active Anonymous	Recently used anonymous memory	Low
Inactive File	Unused page cache pages	High (easy to reclaim)
Active File	Recently used page cache	Medium
Unevictable	Locked or pinned pages (mlock)	Never reclaimed

Converting Mermaid diagram...

The Reclaim Process

When memory is needed, the kernel's page reclaim (via kswapd or direct reclaim) performs:

Scan LRU lists: Start from the tail of inactive lists
Check page status:
- Clean file pages: Simply drop from cache (data still on disk)
- Dirty file pages: Schedule writeback, try later
- Anonymous pages: Swap out if swap space available
- Referenced pages: Give second chance, move back
Apply pressure heuristics: Balance between file and anonymous pages
Continue until satisfied: Meet watermark thresholds

Viewing memory pressure
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Memory statistics
$ cat /proc/meminfo
MemTotal:       32780848 kB
MemFree:          234012 kB
MemAvailable:   28451248 kB   # This is what you usually care about
Buffers:          516428 kB
Cached:         26942704 kB   # Page cache!
SwapCached:            0 kB
Active:         14580976 kB
Inactive:       15984420 kB
Active(anon):    1872124 kB
Inactive(anon):    23256 kB
Active(file):   12708852 kB   # Active page cache
Inactive(file): 15961164 kB   # Inactive page cache
Dirty:            102848 kB   # Needs writeback
Writeback:             0 kB   # Currently writing
AnonPages:       1892876 kB
Mapped:           834820 kB   # Memory-mapped files
Shmem:              4584 kB
Slab:            1378476 kB
 
# Drop caches (for testing, not production!)
$ echo 1 > /proc/sys/vm/drop_caches  # Page cache only
$ echo 2 > /proc/sys/vm/drop_caches  # Slab objects
$ echo 3 > /proc/sys/vm/drop_caches  # Both

swappiness

Memory-Mapped Files

How mmap Works

Converting Mermaid diagram...

Mapping Types

Private mapping (MAP_PRIVATE):

Copy-on-write: modifications don't affect the file
Used for executable text, read-only data
Changes create anonymous copies

Shared mapping (MAP_SHARED):

Modifications are visible to other processes and persist to file
Used for IPC, database buffer pools
Changes immediately visible through page cache

mmap usage examples
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/* Read-only mapping of a file */
int fd = open("data.bin", O_RDONLY);
size_t size = lseek(fd, 0, SEEK_END);
void *data = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
 
/* Now access data directly */
int value = ((int *)data)[1000];  /* No syscall, reads from page cache */
 
/* Shared writable mapping */
int fd = open("shared.bin", O_RDWR);
void *shared = mmap(NULL, size, PROT_READ | PROT_WRITE, 
                    MAP_SHARED, fd, 0);
 
/* Modifications persist to file */
((int *)shared)[1000] = 42;
 
/* msync to ensure data reaches disk */
msync(shared, size, MS_SYNC);
 
/* Unmap when done */
munmap(data, size);
munmap(shared, size);

Page Fault Handling

When accessing an unmapped page:

CPU generates page fault
Kernel fault handler determines fault type:
- Is VMA valid for this address?
- Is access type (read/write/exec) permitted?
For file-backed VMAs:
- Look up page in page cache
- If not present, read from file
- Install page table entry
Resume execution: Access now succeeds

Database Buffer Pools

Practical Page Cache Tuning

Page cache behavior significantly impacts application performance. Here are strategies for different workload types:

Streaming Workloads (Backup, Video)

Problem: Large sequential reads pollute the cache, evicting frequently-used data.

Solutions:

# Use posix_fadvise to hint about usage patterns
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);  # Enable aggressive readahead
posix_fadvise(fd, 0, 0, POSIX_FADV_NOREUSE);     # Don't cache

# Or use O_DIRECT for critical streaming paths
fd = open("large.iso", O_RDONLY | O_DIRECT);

Database Workloads

Problem: Database has its own cache; double-buffering wastes memory.

Solutions:

# Use O_DIRECT to bypass page cache
fd = open("data.db", O_RDWR | O_DIRECT | O_SYNC);

# If not using O_DIRECT, tune dirty parameters
sysctl vm.dirty_ratio=5
sysctl vm.dirty_background_ratio=2

# Reduce cache pollution from scans
posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);

Web Server Workloads

Problem: Hot content should stay cached; cold content shouldn't evict it.

Solutions:

# Increase inactive/active ratio
# (done automatically by kernel heuristics)

# Consider preloading hot content
cat /var/www/hot/* > /dev/null

# Tune readahead for expected file sizes
blockdev --setra 512 /dev/sda  # 256KB

Workload-Specific Recommendations
Workload	dirty_ratio	dirty_background_ratio	read_ahead_kb	Notes
General purpose	20 (default)	10 (default)	128	Balanced defaults
Database (OLTP)	5-10	2-5	32-64	Keep dirty data low, less readahead
Backup/streaming	40-60	20-30	1024-2048	Buffer more, larger readahead
Video serving	20	10	2048-4096	Large readahead for sequential
Low memory system	10	5	64	Conservative to avoid OOM

Production Changes

Summary: The Page Cache

We've explored the Linux page cache in depth. Let's consolidate the key concepts:

Key Takeaways

•The page cache stores file data in memory at page granularity, making repeated file access essentially free after the first read.
•address_space connects files to their cached pages via an XArray (radix tree), enabling O(log n) lookup from file offset to page.
•Read-ahead predicts sequential access and prefetches pages asynchronously, hiding disk latency for streaming workloads.
•Write-back caching accumulates dirty pages in memory, deferring disk writes to improve throughput but risking data loss on crash.
•Flusher threads perform background writeback, using bandwidth estimation to balance device utilization and process throttling.
•Page reclaim evicts pages under memory pressure, preferring clean file pages (which can be re-read from disk) over anonymous pages.
•Memory-mapped files connect page cache directly to process address space, enabling pointer-based file access with demand paging.

What's next:

Page Complete

4 / 5