Operating SystemsI/O Software

Caching

LevelAdvanced

Duration90 mins

TopicI/O Software

1 / 5

I/O Caching

The Speed Divide That Defines Modern Computing

Every modern computing system faces a fundamental architectural challenge: the processor operates at billions of cycles per second, while storage devices—even the fastest NVMe SSDs—deliver data millions of times slower. This speed disparity, often spanning five to six orders of magnitude, would cripple system performance entirely if not for one of computing's most elegant solutions: caching.

I/O caching represents far more than a simple performance optimization. It is the architectural foundation that makes interactive computing possible, transforming what would be unbearable waits into imperceptible delays. Without caching, even the fastest modern computers would feel slower than machines from decades past, spending the vast majority of their time idle while waiting for data from storage devices.

What You Will Learn

By the end of this page, you will understand the fundamental principles of I/O caching, why it is architecturally necessary, how operating systems implement caching layers, and the profound impact caching has on every aspect of system performance. You'll develop the mental models that distinguish engineers who truly understand system behavior from those who merely use it.

The Necessity of Caching

To truly appreciate I/O caching, we must first understand why it exists—not as an optional enhancement, but as an architectural necessity born from the fundamental physics of how different system components operate.

The Speed Hierarchy Reality

Modern computer systems exhibit a dramatic hierarchy of component speeds, each level separated by orders of magnitude:

Component Access Time Hierarchy
Component	Typical Latency	Relative Speed	Capacity
CPU Registers	~0.3 ns	1x (baseline)	~1 KB
L1 Cache	~1 ns	3-4x slower	32-64 KB
L2 Cache	~3-10 ns	10-30x slower	256 KB - 1 MB
L3 Cache	~10-20 ns	30-70x slower	8-128 MB
Main Memory (DRAM)	~50-100 ns	150-300x slower	8-512 GB
NVMe SSD	~10-50 µs	30,000-150,000x slower	256 GB - 8 TB
SATA SSD	~50-150 µs	150,000-500,000x slower	256 GB - 8 TB
Hard Disk Drive	~5-10 ms	15,000,000-30,000,000x slower	1-20 TB
Network Storage	~1-100 ms	3,000,000-300,000,000x slower	Unlimited

The Fundamental Bottleneck

Consider what happens when a process needs to read a file from disk without caching:

CPU issues read request: ~1 ns
Request traverses software layers: ~100-1000 ns
Controller processes command: ~1-10 µs
Disk performs mechanical seek (HDD): ~5-10 ms
Data transferred to memory: ~100 µs - 1 ms

During steps 3-5, the CPU executes zero useful instructions for that process. At 3 GHz, a 5ms disk seek represents 15 million wasted CPU cycles—enough to execute the entirety of many programs multiple times over.

This isn't merely inefficient; it fundamentally changes the nature of computer system design. Without mitigation, I/O-bound operations would dominate all system behavior, reducing expensive processors to expensive heat generators that spend 99%+ of their time waiting.

The Latency Wall

Unlike CPU and memory speeds, which have improved exponentially over decades following Moore's Law, storage latency has improved much more slowly. Disk seek times have improved perhaps 10x over 30 years, while CPU speeds improved 10,000x. This 'latency wall' makes caching increasingly important, not less, as systems evolve.

Cache Fundamentals

A cache is a smaller, faster storage layer that temporarily holds copies of data from a larger, slower storage layer. The fundamental insight behind caching—applicable from CPU caches to CDN networks—rests on two empirical observations about how programs actually behave:

The Principle of Locality

Temporal Locality: Data accessed recently is likely to be accessed again soon. When you read a configuration file, you're likely to read it again. When you access a database record, you'll probably access it again within the same session.

Spatial Locality: Data near recently accessed data is also likely to be accessed soon. When you read byte N of a file, you'll probably read bytes N+1, N+2, etc. When you access one field of a structure, you'll likely access adjacent fields.

These aren't theoretical abstractions—they emerge from the fundamental nature of computation:

Why Locality Exists

•Loops: Programs repeatedly execute the same instructions over collections of data
•Data Structures: Related data is stored contiguously (arrays, structs, objects)
•Sequential Processing: Files, streams, and buffers are processed in order
•Working Sets: Programs focus on subsets of their total data at any given time
•Hot Data: Most accesses go to a small fraction of total data (Pareto principle)

Cache Architecture Components

Every caching system, regardless of level, shares fundamental architectural components:

Cache Memory: Fast storage holding cached data. For I/O caching, this is typically main memory (RAM). The cache memory is orders of magnitude faster than the backing store but smaller in capacity.

Cache Lines/Blocks: The unit of data transfer between cache and backing store. Operating systems typically use pages (4KB) as the block size for I/O caching, though this varies. Block sizes balance overhead (smaller blocks mean more metadata) against waste (larger blocks may cache unused data).

Tags and Metadata: Information identifying what backing store data each cache entry holds, plus state information like modification status and access history. This metadata enables the cache to answer: "Is this data cached?" and "What should be evicted?"

Replacement Policy: Algorithm determining which cached data to evict when space is needed. This is perhaps the most critical design decision, as it determines cache effectiveness under real workloads.

cache_entry_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/*
 * Conceptual structure of an I/O cache entry
 * Real implementations vary by OS and filesystem
 */
struct cache_block {
    /* Identification */
    dev_t       device;          /* Device containing this block */
    blkcnt_t    block_number;    /* Block number on device */
    
    /* Data */
    void        *data;           /* Pointer to cached data (page-aligned) */
    size_t      size;            /* Size of cached data */
    
    /* State flags */
    unsigned int valid    : 1;   /* Data is valid */
    unsigned int dirty    : 1;   /* Data modified, not yet written back */
    unsigned int locked   : 1;   /* Block is locked (I/O in progress) */
    unsigned int error    : 1;   /* I/O error occurred */
    
    /* Replacement policy support */
    time_t      access_time;     /* Last access timestamp (LRU) */
    unsigned int access_count;   /* Access frequency (LFU) */
    struct list_head lru_list;   /* Position in LRU list */
    
    /* Concurrency control */
    rwlock_t    lock;            /* Reader-writer lock */
    wait_queue_head_t waiters;   /* Threads waiting on this block */
    
    /* Hash chain for fast lookup */
    struct hlist_node hash_node; /* Position in hash table */
};
 
/*
 * Cache lookup: O(1) average via hash table
 */
struct cache_block *cache_lookup(dev_t dev, blkcnt_t block) {
    unsigned int hash = hash_block(dev, block);
    struct cache_block *entry;
    
    rcu_read_lock();
    hlist_for_each_entry_rcu(entry, &cache_hash[hash], hash_node) {
        if (entry->device == dev && entry->block_number == block) {
            /* Found - update access statistics */
            entry->access_time = current_time();
            entry->access_count++;
            rcu_read_unlock();
            return entry;
        }
    }
    rcu_read_unlock();
    return NULL;  /* Cache miss */
}

The Buffer Cache: Block-Level Caching

The buffer cache (also called the block cache) operates at the block device level, caching raw disk blocks regardless of their contents. This is the oldest and most fundamental form of I/O caching, dating back to early Unix systems.

Historical Context

In traditional Unix systems (and preserved in some modern implementations), the buffer cache served as the single point of caching between filesystems and block devices. Every read from disk first checked the buffer cache; every write went through it. This unified approach had elegance: one caching layer served all filesystem types.

Buffer Cache Operation

The buffer cache presents a simple abstraction: given a device and block number, return the block data. Behind this simplicity lies sophisticated machinery:

buffer_cache_operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
/*
 * Buffer cache: Traditional Unix block caching layer
 * 
 * This implements the classic buffer cache algorithm with
 * hash table lookup and LRU replacement
 */
 
#define BUFFER_HASH_SIZE 1024
#define BUFFER_SIZE      4096    /* Typically matches page size */
 
struct buffer_head {
    dev_t           b_dev;       /* Device identifier */
    blkcnt_t        b_blocknr;   /* Block number */
    char            *b_data;     /* Pointer to data */
    
    /* State management */
    unsigned long   b_state;     /* State flags */
    atomic_t        b_count;     /* Reference count */
    
    /* List management */
    struct list_head b_lru;      /* LRU list position */
    struct hlist_node b_hash;    /* Hash chain */
    
    /* I/O completion */
    void (*b_end_io)(struct buffer_head *, int);
    wait_queue_head_t b_wait;
};
 
/* State flags */
#define BH_Uptodate     0        /* Contains valid data */
#define BH_Dirty        1        /* Data modified */
#define BH_Lock         2        /* Locked for I/O */
#define BH_Req          3        /* Has been submitted for I/O */
#define BH_Mapped       4        /* Has disk mapping */
 
/* Hash table for O(1) lookup */
static struct hlist_head buffer_hash[BUFFER_HASH_SIZE];
static DEFINE_SPINLOCK(buffer_hash_lock);
 
/* LRU list for replacement */
static LIST_HEAD(lru_list);
static DEFINE_SPINLOCK(lru_lock);
 
/*
 * Get a buffer for the specified block
 * Returns cached buffer if present, or allocates new one
 */
struct buffer_head *getblk(dev_t dev, blkcnt_t block, int size) {
    struct buffer_head *bh;
    unsigned int hash = hash_buffer(dev, block);
    
    /* First, try to find in cache */
    spin_lock(&buffer_hash_lock);
    bh = find_buffer(dev, block, hash);
    if (bh) {
        /* Cache hit */
        atomic_inc(&bh->b_count);
        spin_unlock(&buffer_hash_lock);
        
        /* Move to end of LRU (most recently used) */
        spin_lock(&lru_lock);
        list_move_tail(&bh->b_lru, &lru_list);
        spin_unlock(&lru_lock);
        
        wait_on_buffer(bh);  /* Wait if I/O in progress */
        return bh;
    }
    spin_unlock(&buffer_hash_lock);
    
    /* Cache miss: allocate new buffer */
    bh = allocate_buffer(dev, block, size);
    if (!bh) {
        /* Memory pressure: reclaim from LRU */
        bh = reclaim_buffer();
        if (!bh)
            return NULL;  /* Out of memory */
        
        /* Reinitialize for new block */
        init_buffer(bh, dev, block, size);
    }
    
    /* Insert into hash table */
    insert_buffer_hash(bh, hash);
    
    return bh;
}
 
/*
 * Read a block, using cache if possible
 */
struct buffer_head *bread(dev_t dev, blkcnt_t block, int size) {
    struct buffer_head *bh;
    
    bh = getblk(dev, block, size);
    if (!bh)
        return NULL;
    
    /* Check if already has valid data */
    if (test_bit(BH_Uptodate, &bh->b_state))
        return bh;  /* Data already in cache */
    
    /* Need to read from disk */
    lock_buffer(bh);
    if (test_bit(BH_Uptodate, &bh->b_state)) {
        /* Someone else loaded it while we waited */
        unlock_buffer(bh);
        return bh;
    }
    
    /* Submit I/O request */
    bh->b_end_io = end_buffer_read;
    submit_bh(READ, bh);
    
    /* Wait for completion */
    wait_on_buffer(bh);
    
    if (!test_bit(BH_Uptodate, &bh->b_state)) {
        brelse(bh);
        return NULL;  /* Read failed */
    }
    
    return bh;
}
 
/*
 * Release a buffer (decrement reference count)
 */
void brelse(struct buffer_head *bh) {
    if (!bh)
        return;
    
    if (atomic_dec_and_test(&bh->b_count)) {
        /* Buffer now unreferenced - eligible for reclaim */
        if (test_bit(BH_Dirty, &bh->b_state)) {
            /* Schedule deferred writeback */
            mark_buffer_dirty(bh);
        }
    }
}

Buffer Cache Characteristics

The buffer cache possesses several important characteristics that shape system behavior:

Block-Oriented: The buffer cache operates on fixed-size blocks aligned to device block boundaries. This matches how block devices actually work but can be inefficient for small reads or non-aligned access patterns.

Device-Independent: The buffer cache doesn't know or care what filesystem format is stored on the device. This enables caching for any block device, including raw device access.

Write Aggregation: Multiple writes to the same block are coalesced in the cache. Only the final state needs to be written to disk, dramatically reducing I/O traffic for frequently-modified data.

Synchronous Semantics Option: Applications can request synchronous writes (O_SYNC) that bypass caching benefits in exchange for durability guarantees.

Buffer Cache vs. Page Cache

Modern Linux unifies the buffer cache with the page cache—they share the same memory. Buffer heads now serve primarily as metadata describing how page cache pages map to disk blocks. This unification eliminated the double-caching problem where data could exist in both caches simultaneously, wasting memory.

The Page Cache: File-Level Caching

The page cache represents a more sophisticated approach to I/O caching, operating at the file level rather than the block level. Instead of caching arbitrary disk blocks, the page cache caches file contents indexed by file identity and offset.

Page Cache Architecture

The page cache organizes cached data by file rather than by device location. This enables several powerful optimizations:

Page Cache Advantages

•Memory-Mapped File Integration: Pages can be mapped directly into process address spaces via mmap(), sharing cache contents without copying
•Read-Ahead Optimization: File-level knowledge enables intelligent prefetching—the kernel knows you're reading sequentially through a file
•Write Batching: Multiple writes to different parts of a file are batched and written optimally, considering file layout
•Unified Memory: Page cache pages can be reclaimed under memory pressure, automatically balancing file caching against process memory needs
•Copy-on-Write Sharing: Multiple processes reading the same file share cached pages, saving memory

page_cache_operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
/*
 * Page Cache Implementation Concepts
 * 
 * The page cache maps (inode, offset) -> page
 * Real Linux implementation uses radix trees and XArrays
 */
 
struct address_space {
    struct inode        *host;           /* Owner inode */
    struct xarray       i_pages;         /* Cached pages (radix tree) */
    atomic_t            i_nrpages;       /* Number of cached pages */
    rwlock_t            tree_lock;       /* Lock for page tree */
    
    /* Address space operations */
    const struct address_space_operations *a_ops;
    
    /* Writeback state */
    unsigned long       flags;
    spinlock_t          private_lock;
    struct list_head    private_list;
};
 
/*
 * Find a page in the page cache
 * Returns NULL if not cached
 */
struct page *find_get_page(struct address_space *mapping, 
                           pgoff_t index) {
    struct page *page;
    
    rcu_read_lock();
    page = xa_load(&mapping->i_pages, index);
    if (page && !page_cache_get_speculative(page)) {
        page = NULL;  /* Page was being reclaimed */
    }
    rcu_read_unlock();
    
    return page;
}
 
/*
 * Find page or create if not present
 * This is the workhorse function for file reads
 */
struct page *find_or_create_page(struct address_space *mapping,
                                  pgoff_t index, gfp_t gfp_mask) {
    struct page *page;
    int error;
    
    /* First, try simple lookup */
    page = find_get_page(mapping, index);
    if (page)
        return page;
    
    /* Not in cache: allocate new page */
    page = alloc_page(gfp_mask);
    if (!page)
        return NULL;
    
    /* Try to add to cache */
    error = add_to_page_cache_locked(page, mapping, index, gfp_mask);
    if (error) {
        /* Someone else added it first - use theirs */
        put_page(page);
        page = find_get_page(mapping, index);
    }
    
    return page;
}
 
/*
 * Generic file read implementation
 * Shows how page cache integrates with file I/O
 */
ssize_t generic_file_read(struct file *filp, char __user *buf,
                          size_t count, loff_t *ppos) {
    struct inode *inode = filp->f_inode;
    struct address_space *mapping = inode->i_mapping;
    loff_t pos = *ppos;
    ssize_t read = 0;
    
    while (count > 0) {
        pgoff_t index = pos >> PAGE_SHIFT;
        size_t offset = pos & ~PAGE_MASK;
        size_t bytes = min(PAGE_SIZE - offset, count);
        struct page *page;
        
        /* Try to get page from cache */
        page = find_get_page(mapping, index);
        
        if (!page) {
            /* Cache miss - need to read from disk */
            page = page_cache_alloc(mapping);
            if (!page)
                return read ? read : -ENOMEM;
            
            /* Add to cache and initiate read */
            add_to_page_cache(page, mapping, index);
            
            /* Trigger read-ahead for sequential access */
            if (mapping->a_ops->readahead)
                trigger_readahead(mapping, filp, index);
            
            /* Read the page from disk */
            read_page_from_disk(mapping, page);
        }
        
        /* Wait for page to be uptodate */
        wait_on_page_locked(page);
        
        if (!PageUptodate(page)) {
            put_page(page);
            return read ? read : -EIO;
        }
        
        /* Copy data to user buffer */
        if (copy_to_user(buf, page_address(page) + offset, bytes)) {
            put_page(page);
            return read ? read : -EFAULT;
        }
        
        put_page(page);
        
        buf += bytes;
        count -= bytes;
        pos += bytes;
        read += bytes;
    }
    
    *ppos = pos;
    return read;
}

Page Cache Data Structures

The page cache uses sophisticated data structures to enable efficient operations:

XArray (Radix Tree): Each file's cached pages are stored in an XArray (formerly radix tree) indexed by page offset. This enables O(log n) lookup, insertion, and range queries—essential for operations like 'invalidate pages 100-200'.

LRU Lists: Pages are maintained on LRU-style lists for replacement decisions. Modern Linux uses multiple lists (active/inactive, file-backed/anonymous) to implement sophisticated replacement policies.

Page Flags: Each page structure contains flags indicating state: PG_locked (under I/O), PG_uptodate (contains valid data), PG_dirty (modified), PG_referenced (recently accessed), etc.

Address Space: Each cached file has an address_space structure containing its page tree and operations. The address_space_operations define how to read/write pages for that file type.

Read-Ahead and Prefetching

Read-ahead (or prefetching) is one of the most impactful I/O optimizations, exploiting spatial locality to initiate I/O before data is actually requested. By reading ahead of the access pattern, the kernel hides I/O latency entirely—when the application requests the next block, it's already in memory.

The Read-Ahead Algorithm

Modern Linux uses an adaptive read-ahead algorithm that adjusts dynamically based on access patterns:

readahead_algorithm.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
/*
 * Simplified adaptive read-ahead algorithm
 * 
 * The kernel tracks read patterns and adjusts
 * read-ahead window size dynamically
 */
 
struct file_ra_state {
    pgoff_t start;              /* Window start */
    unsigned int size;          /* Window size */
    unsigned int async_size;    /* Async readahead trigger */
    unsigned int ra_pages;      /* Maximum readahead */
    
    /* Pattern detection */
    pgoff_t prev_pos;           /* Previous read position */
    unsigned int prev_count;    /* Previous read count */
    unsigned int pattern;       /* Detected pattern flags */
};
 
/* Readahead pattern flags */
#define RA_SEQUENTIAL   1       /* Sequential access detected */
#define RA_RANDOM       2       /* Random access detected */
#define RA_MMAP         4       /* Memory-mapped access */
 
/*
 * Main readahead decision function
 */
void page_cache_readahead(struct address_space *mapping,
                          struct file_ra_state *ra,
                          struct file *filp,
                          pgoff_t offset,
                          unsigned long req_size) {
    unsigned int max = ra->ra_pages;    /* Typically 32 pages (128KB) */
    pgoff_t expected, prev_index;
    
    prev_index = ra->prev_pos >> PAGE_SHIFT;
    expected = ra->start + ra->size;
    
    /* Case 1: Sequential read continues where expected */
    if (offset == expected || offset == prev_index + 1) {
        /* Sequential pattern confirmed - expand window */
        ra->pattern = RA_SEQUENTIAL;
        
        if (offset == ra->start + ra->async_size) {
            /* Hit async trigger - time to prefetch more */
            ra->start = offset;
            ra->size = calc_readahead_size(ra, max);
            ra->async_size = ra->size * 3 / 4;
            
            /* Submit async readahead */
            do_async_readahead(mapping, filp, offset, ra->size);
        }
    }
    
    /* Case 2: Start of sequential read or new sequential stream */
    else if (offset == 0 || offset == prev_index) {
        /* Possible new sequential stream - start conservatively */
        ra->start = offset;
        ra->size = initial_readahead_size(max);
        ra->async_size = ra->size / 2;
        
        do_sync_readahead(mapping, filp, offset, ra->size);
    }
    
    /* Case 3: Random access pattern */
    else if (abs(offset - prev_index) > ra->ra_pages) {
        ra->pattern = RA_RANDOM;
        /* Disable readahead for random access */
        ra->size = 0;
        ra->async_size = 0;
    }
    
    /* Case 4: Stride pattern detection */
    else if (detect_stride(ra, offset)) {
        /* Read stride pattern (e.g., every Nth page) */
        handle_stride_readahead(mapping, ra, offset);
    }
    
    /* Update history */
    ra->prev_pos = (loff_t)offset << PAGE_SHIFT;
}
 
/*
 * Calculate optimal readahead window size
 */
static unsigned int calc_readahead_size(struct file_ra_state *ra,
                                        unsigned int max) {
    unsigned int size = ra->size;
    
    /* Exponential growth up to maximum */
    if (ra->pattern == RA_SEQUENTIAL) {
        size = size * 2;
        if (size > max)
            size = max;
    }
    
    /* Consider memory pressure */
    if (memory_pressure_high())
        size = size / 2;
    
    return max(size, 4U);  /* Minimum 4 pages */
}
 
/*
 * Submit asynchronous readahead
 * Returns immediately; I/O completes in background
 */
static void do_async_readahead(struct address_space *mapping,
                               struct file *filp,
                               pgoff_t offset,
                               unsigned long nr_to_read) {
    struct blk_plug plug;
    unsigned long i;
    
    /* Plug block layer to batch requests */
    blk_start_plug(&plug);
    
    for (i = 0; i < nr_to_read; i++) {
        struct page *page;
        pgoff_t page_offset = offset + i;
        
        /* Skip if already cached */
        page = find_get_page(mapping, page_offset);
        if (page) {
            put_page(page);
            continue;
        }
        
        /* Allocate and add new page */
        page = page_cache_alloc_readahead(mapping);
        if (!page)
            break;
        
        if (add_to_page_cache_lru(page, mapping, page_offset)) {
            put_page(page);
            continue;
        }
        
        /* Mark as readahead page */
        SetPageReadahead(page);
        
        /* Submit read I/O */
        mapping->a_ops->readpage(filp, page);
    }
    
    /* Unplug - submit batched requests */
    blk_finish_plug(&plug);
}

Read-Ahead Benefits and Costs

Read-ahead provides substantial benefits but isn't free:

Benefits:

Hides I/O latency completely for sequential workloads
Enables large, efficient I/O operations (32 pages at once vs. 1)
Keeps the I/O subsystem busy, maximizing throughput
Modern drives and controllers are optimized for large sequential reads

Costs:

Consumes memory for potentially unneeded data
May evict useful cached data to make room
Wasted I/O bandwidth if prefetched data isn't used
Increased latency for other I/O if read-ahead floods the queue

The key is adaptation: aggressive read-ahead for sequential access, conservative or disabled for random access. The kernel tracks patterns and adjusts dynamically.

Controlling Read-Ahead

Applications can influence read-ahead behavior via posix_fadvise() with POSIX_FADV_SEQUENTIAL (hint for aggressive read-ahead), POSIX_FADV_RANDOM (disable read-ahead), or POSIX_FADV_WILLNEED (explicit prefetch request). Database systems often disable kernel read-ahead in favor of application-controlled prefetching.

Memory Mapping and the Page Cache

The page cache enables one of the most powerful I/O mechanisms: memory-mapped files. With mmap(), file contents appear directly in process address space, allowing file access with simple memory operations instead of explicit read/write system calls.

How Memory Mapping Works

When a process maps a file, the kernel creates page table entries that point to the page cache:

memory_mapped_io.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
/*
 * Memory-mapped file I/O through page cache
 */
 
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
 
/*
 * Example: Processing a file via memory mapping
 * 
 * The kernel handles all caching automatically:
 * - Pages are loaded on-demand (page fault)
 * - Pages are shared with other mappers
 * - Modified pages are marked dirty
 * - Writeback occurs automatically or on msync()
 */
void process_file_via_mmap(const char *filename) {
    int fd;
    struct stat sb;
    char *mapped;
    
    fd = open(filename, O_RDWR);
    fstat(fd, &sb);
    
    /* Map file into address space */
    mapped = mmap(NULL,           /* Let kernel choose address */
                  sb.st_size,     /* Map entire file */
                  PROT_READ | PROT_WRITE,
                  MAP_SHARED,     /* Modifications affect file */
                  fd, 0);
    
    close(fd);  /* File stays mapped; fd can be closed */
    
    /*
     * Now file contents are accessible as memory.
     * 
     * Reading:  byte = mapped[offset]
     *   - If page not in cache: page fault
     *   - Kernel loads page from disk to page cache
     *   - Page table updated to map page cache page
     *   - Access resumes, now using cached page
     *
     * Writing:  mapped[offset] = byte
     *   - If page not in cache: same as read
     *   - Mark page dirty
     *   - Eventually written back by kernel or msync()
     */
    
    /* Example: capitalize first 1000 characters */
    for (size_t i = 0; i < 1000 && i < sb.st_size; i++) {
        if (mapped[i] >= 'a' && mapped[i] <= 'z')
            mapped[i] -= 32;  /* Modifies page cache directly */
    }
    
    /* Ensure changes are written to disk */
    msync(mapped, sb.st_size, MS_SYNC);
    
    munmap(mapped, sb.st_size);
}
 
/*
 * Kernel page fault handler for memory-mapped files
 * (Conceptual implementation)
 */
int filemap_fault(struct vm_fault *vmf) {
    struct vm_area_struct *vma = vmf->vma;
    struct file *file = vma->vm_file;
    struct address_space *mapping = file->f_mapping;
    pgoff_t offset = vmf->pgoff;
    struct page *page;
    
    /* Try to find page in cache */
    page = find_get_page(mapping, offset);
    
    if (!page) {
        /* Not cached - need to read from file */
        page = find_or_create_page(mapping, offset, 
                                    GFP_KERNEL);
        if (!page)
            return VM_FAULT_OOM;
        
        /* Read page from disk if not uptodate */
        if (!PageUptodate(page)) {
            mapping->a_ops->readpage(file, page);
            wait_on_page_locked(page);
        }
    }
    
    /* Lock page for mapping */
    lock_page(page);
    
    /*
     * Now install page in process page tables.
     * Multiple processes can map same page cache page.
     */
    vmf->page = page;
    
    return VM_FAULT_LOCKED;
}

Memory Mapping Benefits

Zero-Copy Access: Data never needs to be copied from kernel buffers to user buffers. The process directly accesses the page cache through its page tables.

Automatic Caching: The kernel manages all caching decisions. Pages are loaded on-demand, shared across processes, and written back automatically.

Simplified Programming: File access becomes simple memory operations. No read()/write() calls, no buffer management, no positioning.

Huge File Support: Even files larger than physical memory can be mapped—only accessed regions actually consume memory.

Memory Mapping Considerations

Page Fault Overhead: Every access to not-yet-loaded pages incurs a page fault. For highly random access patterns, explicit read() with proper buffering may outperform mmap().

Signal Handling: I/O errors become SIGBUS signals, which are harder to handle than read() error returns.

TLB Pressure: Large mappings consume TLB entries, potentially impacting other memory accesses.

Writeback Semantics: Modifications may be written back at unpredictable times. Applications requiring durability must explicitly call msync().

Cache Metrics and Analysis

Understanding cache behavior requires systematic measurement. Several key metrics reveal cache effectiveness:

Primary Metrics

Cache Performance Metrics
Metric	Definition	Implications
Hit Rate	Fraction of accesses served from cache	Higher is better; target >90% for file-intensive workloads
Miss Rate	Fraction of accesses requiring I/O (1 - hit rate)	Lower is better; indicates cache sizing and replacement effectiveness
Fill Rate	Rate at which cache fills with new data	Indicates I/O pressure and miss frequency
Eviction Rate	Rate at which data is removed from cache	High rate suggests cache is too small or access pattern is pathological
Dirty Ratio	Fraction of cached data modified but not written	Risk indicator: high dirty ratio risks data loss on failure
Average Miss Penalty	Time cost of a cache miss	Depends on backing store; SSDs have 100x lower miss penalty than HDDs

cache_monitoring.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash
# Monitor page cache performance on Linux
 
# Current cache size
echo "=== Page Cache Size ==="
grep -E "^(Cached|Buffers|Active\(file\)|Inactive\(file\)):" /proc/meminfo
 
# Cache hit/miss statistics (requires perf)
echo -e "\n=== Cache Hit/Miss (10 second sample) ==="
perf stat -e cache-references,cache-misses,page-faults -a sleep 10
 
# Per-file cache status (requires vmtouch)
echo -e "\n=== File Cache Status ==="
vmtouch -v /var/log/syslog
 
# Detailed I/O statistics
echo -e "\n=== Block I/O Statistics ==="
cat /proc/diskstats | awk '
NF >= 14 {
    dev = $3
    reads = $4
    read_sectors = $6
    writes = $8
    write_sectors = $10
    if (reads + writes > 0)
        printf "%-10s: %10d reads, %10d writes\n", dev, reads, writes
}'
 
# Page cache activity (per second)
echo -e "\n=== Page Cache Activity (5 seconds) ==="
sar -B 1 5 2>/dev/null || echo "Install sysstat for detailed stats"
 
# Watch dirty page writeback
echo -e "\n=== Dirty Pages ==="
grep -E "^(Dirty|Writeback):" /proc/meminfo

Calculating Effective Access Time

The effective access time (EAT) quantifies the average time to access data considering cache behavior:

EAT = (hit_rate × cache_access_time) + (miss_rate × miss_penalty)

Where:
  miss_penalty = cache_access_time + backing_store_access_time

Example Calculation:

Page cache access: 100 ns
SSD access: 50 µs (50,000 ns)
HDD access: 8 ms (8,000,000 ns)
Hit rate: 95%

With SSD backing:

EAT = 0.95 × 100ns + 0.05 × 50,100ns = 95ns + 2,505ns = 2,600ns

With HDD backing:

EAT = 0.95 × 100ns + 0.05 × 8,000,100ns = 95ns + 400,005ns = 400,100ns

This illustrates both the power of caching (reducing HDD access from 8ms to 400µs average) and why the backing store still matters (SSD provides 150x lower EAT than HDD even with identical caching).

Amdahl's Law for Caching

Improving cache hit rate has diminishing returns. Going from 80% to 90% hit rate halves the miss rate (huge benefit). Going from 98% to 99% also halves the miss rate (same relative improvement, smaller absolute benefit). Going from 99.9% to 99.95% barely matters. Focus optimization effort where it has the most impact.

Summary: I/O Caching Foundations

We've explored the fundamental concepts of I/O caching, establishing the foundation for understanding more advanced caching topics that follow. Let's consolidate the key insights:

Key Takeaways

•Caching is architecturally necessary — The speed disparity between CPUs and I/O devices (5-7 orders of magnitude) makes caching essential, not optional. Without caching, modern interactive computing would be impossible.
•Locality is the foundation — Temporal and spatial locality, emerging from the fundamental nature of computation, make caching effective. Programs aren't random; they exhibit predictable access patterns that caching exploits.
•Buffer cache and page cache serve different abstractions — Buffer cache works at the block level (device-centric), while page cache works at the file level (file-centric). Modern systems often unify these for efficiency.
•Read-ahead amplifies cache benefits — By predicting sequential access and prefetching data before it's requested, read-ahead hides I/O latency entirely for sequential workloads.
•Memory mapping integrates with the page cache — Memory-mapped files share page cache pages, enabling zero-copy file access and seamless integration between file I/O and memory access.
•Cache metrics enable optimization — Hit rate, miss penalty, and effective access time quantify cache behavior and guide tuning decisions.

What's Next:

The next page examines write caching—how operating systems handle write operations through the cache, the tradeoffs between performance and durability, and the mechanisms that ensure data reaches persistent storage despite the intervening cache layer. Understanding write caching is essential for building systems that are both fast and reliable.

Page Complete

You now understand the fundamental principles of I/O caching: why it exists, how it works at both the buffer and page cache levels, how read-ahead and memory mapping enhance it, and how to measure cache effectiveness. This foundation prepares you to explore the more complex topics of write caching, cache policies, coherence, and optimization.

1 / 5

Loading learning content...

Operating SystemsI/O Software

Caching

LevelAdvanced

Duration90 mins

TopicI/O Software

1 / 5

I/O Caching

The Speed Divide That Defines Modern Computing

What You Will Learn

The Necessity of Caching

The Speed Hierarchy Reality

Modern computer systems exhibit a dramatic hierarchy of component speeds, each level separated by orders of magnitude:

Component Access Time Hierarchy
Component	Typical Latency	Relative Speed	Capacity
CPU Registers	~0.3 ns	1x (baseline)	~1 KB
L1 Cache	~1 ns	3-4x slower	32-64 KB
L2 Cache	~3-10 ns	10-30x slower	256 KB - 1 MB
L3 Cache	~10-20 ns	30-70x slower	8-128 MB
Main Memory (DRAM)	~50-100 ns	150-300x slower	8-512 GB
NVMe SSD	~10-50 µs	30,000-150,000x slower	256 GB - 8 TB
SATA SSD	~50-150 µs	150,000-500,000x slower	256 GB - 8 TB
Hard Disk Drive	~5-10 ms	15,000,000-30,000,000x slower	1-20 TB
Network Storage	~1-100 ms	3,000,000-300,000,000x slower	Unlimited

The Fundamental Bottleneck

Consider what happens when a process needs to read a file from disk without caching:

CPU issues read request: ~1 ns
Request traverses software layers: ~100-1000 ns
Controller processes command: ~1-10 µs
Disk performs mechanical seek (HDD): ~5-10 ms
Data transferred to memory: ~100 µs - 1 ms

The Latency Wall

Cache Fundamentals

The Principle of Locality

These aren't theoretical abstractions—they emerge from the fundamental nature of computation:

Why Locality Exists

•Loops: Programs repeatedly execute the same instructions over collections of data
•Data Structures: Related data is stored contiguously (arrays, structs, objects)
•Sequential Processing: Files, streams, and buffers are processed in order
•Working Sets: Programs focus on subsets of their total data at any given time
•Hot Data: Most accesses go to a small fraction of total data (Pareto principle)

Cache Architecture Components

Every caching system, regardless of level, shares fundamental architectural components:

cache_entry_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/*
 * Conceptual structure of an I/O cache entry
 * Real implementations vary by OS and filesystem
 */
struct cache_block {
    /* Identification */
    dev_t       device;          /* Device containing this block */
    blkcnt_t    block_number;    /* Block number on device */
    
    /* Data */
    void        *data;           /* Pointer to cached data (page-aligned) */
    size_t      size;            /* Size of cached data */
    
    /* State flags */
    unsigned int valid    : 1;   /* Data is valid */
    unsigned int dirty    : 1;   /* Data modified, not yet written back */
    unsigned int locked   : 1;   /* Block is locked (I/O in progress) */
    unsigned int error    : 1;   /* I/O error occurred */
    
    /* Replacement policy support */
    time_t      access_time;     /* Last access timestamp (LRU) */
    unsigned int access_count;   /* Access frequency (LFU) */
    struct list_head lru_list;   /* Position in LRU list */
    
    /* Concurrency control */
    rwlock_t    lock;            /* Reader-writer lock */
    wait_queue_head_t waiters;   /* Threads waiting on this block */
    
    /* Hash chain for fast lookup */
    struct hlist_node hash_node; /* Position in hash table */
};
 
/*
 * Cache lookup: O(1) average via hash table
 */
struct cache_block *cache_lookup(dev_t dev, blkcnt_t block) {
    unsigned int hash = hash_block(dev, block);
    struct cache_block *entry;
    
    rcu_read_lock();
    hlist_for_each_entry_rcu(entry, &cache_hash[hash], hash_node) {
        if (entry->device == dev && entry->block_number == block) {
            /* Found - update access statistics */
            entry->access_time = current_time();
            entry->access_count++;
            rcu_read_unlock();
            return entry;
        }
    }
    rcu_read_unlock();
    return NULL;  /* Cache miss */
}

The Buffer Cache: Block-Level Caching

Historical Context

Buffer Cache Operation

The buffer cache presents a simple abstraction: given a device and block number, return the block data. Behind this simplicity lies sophisticated machinery:

buffer_cache_operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
/*
 * Buffer cache: Traditional Unix block caching layer
 * 
 * This implements the classic buffer cache algorithm with
 * hash table lookup and LRU replacement
 */
 
#define BUFFER_HASH_SIZE 1024
#define BUFFER_SIZE      4096    /* Typically matches page size */
 
struct buffer_head {
    dev_t           b_dev;       /* Device identifier */
    blkcnt_t        b_blocknr;   /* Block number */
    char            *b_data;     /* Pointer to data */
    
    /* State management */
    unsigned long   b_state;     /* State flags */
    atomic_t        b_count;     /* Reference count */
    
    /* List management */
    struct list_head b_lru;      /* LRU list position */
    struct hlist_node b_hash;    /* Hash chain */
    
    /* I/O completion */
    void (*b_end_io)(struct buffer_head *, int);
    wait_queue_head_t b_wait;
};
 
/* State flags */
#define BH_Uptodate     0        /* Contains valid data */
#define BH_Dirty        1        /* Data modified */
#define BH_Lock         2        /* Locked for I/O */
#define BH_Req          3        /* Has been submitted for I/O */
#define BH_Mapped       4        /* Has disk mapping */
 
/* Hash table for O(1) lookup */
static struct hlist_head buffer_hash[BUFFER_HASH_SIZE];
static DEFINE_SPINLOCK(buffer_hash_lock);
 
/* LRU list for replacement */
static LIST_HEAD(lru_list);
static DEFINE_SPINLOCK(lru_lock);
 
/*
 * Get a buffer for the specified block
 * Returns cached buffer if present, or allocates new one
 */
struct buffer_head *getblk(dev_t dev, blkcnt_t block, int size) {
    struct buffer_head *bh;
    unsigned int hash = hash_buffer(dev, block);
    
    /* First, try to find in cache */
    spin_lock(&buffer_hash_lock);
    bh = find_buffer(dev, block, hash);
    if (bh) {
        /* Cache hit */
        atomic_inc(&bh->b_count);
        spin_unlock(&buffer_hash_lock);
        
        /* Move to end of LRU (most recently used) */
        spin_lock(&lru_lock);
        list_move_tail(&bh->b_lru, &lru_list);
        spin_unlock(&lru_lock);
        
        wait_on_buffer(bh);  /* Wait if I/O in progress */
        return bh;
    }
    spin_unlock(&buffer_hash_lock);
    
    /* Cache miss: allocate new buffer */
    bh = allocate_buffer(dev, block, size);
    if (!bh) {
        /* Memory pressure: reclaim from LRU */
        bh = reclaim_buffer();
        if (!bh)
            return NULL;  /* Out of memory */
        
        /* Reinitialize for new block */
        init_buffer(bh, dev, block, size);
    }
    
    /* Insert into hash table */
    insert_buffer_hash(bh, hash);
    
    return bh;
}
 
/*
 * Read a block, using cache if possible
 */
struct buffer_head *bread(dev_t dev, blkcnt_t block, int size) {
    struct buffer_head *bh;
    
    bh = getblk(dev, block, size);
    if (!bh)
        return NULL;
    
    /* Check if already has valid data */
    if (test_bit(BH_Uptodate, &bh->b_state))
        return bh;  /* Data already in cache */
    
    /* Need to read from disk */
    lock_buffer(bh);
    if (test_bit(BH_Uptodate, &bh->b_state)) {
        /* Someone else loaded it while we waited */
        unlock_buffer(bh);
        return bh;
    }
    
    /* Submit I/O request */
    bh->b_end_io = end_buffer_read;
    submit_bh(READ, bh);
    
    /* Wait for completion */
    wait_on_buffer(bh);
    
    if (!test_bit(BH_Uptodate, &bh->b_state)) {
        brelse(bh);
        return NULL;  /* Read failed */
    }
    
    return bh;
}
 
/*
 * Release a buffer (decrement reference count)
 */
void brelse(struct buffer_head *bh) {
    if (!bh)
        return;
    
    if (atomic_dec_and_test(&bh->b_count)) {
        /* Buffer now unreferenced - eligible for reclaim */
        if (test_bit(BH_Dirty, &bh->b_state)) {
            /* Schedule deferred writeback */
            mark_buffer_dirty(bh);
        }
    }
}

Buffer Cache Characteristics

The buffer cache possesses several important characteristics that shape system behavior:

Device-Independent: The buffer cache doesn't know or care what filesystem format is stored on the device. This enables caching for any block device, including raw device access.

Write Aggregation: Multiple writes to the same block are coalesced in the cache. Only the final state needs to be written to disk, dramatically reducing I/O traffic for frequently-modified data.

Synchronous Semantics Option: Applications can request synchronous writes (O_SYNC) that bypass caching benefits in exchange for durability guarantees.

Buffer Cache vs. Page Cache

The Page Cache: File-Level Caching

Page Cache Architecture

The page cache organizes cached data by file rather than by device location. This enables several powerful optimizations:

Page Cache Advantages

•Memory-Mapped File Integration: Pages can be mapped directly into process address spaces via mmap(), sharing cache contents without copying
•Read-Ahead Optimization: File-level knowledge enables intelligent prefetching—the kernel knows you're reading sequentially through a file
•Write Batching: Multiple writes to different parts of a file are batched and written optimally, considering file layout
•Unified Memory: Page cache pages can be reclaimed under memory pressure, automatically balancing file caching against process memory needs
•Copy-on-Write Sharing: Multiple processes reading the same file share cached pages, saving memory

page_cache_operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
/*
 * Page Cache Implementation Concepts
 * 
 * The page cache maps (inode, offset) -> page
 * Real Linux implementation uses radix trees and XArrays
 */
 
struct address_space {
    struct inode        *host;           /* Owner inode */
    struct xarray       i_pages;         /* Cached pages (radix tree) */
    atomic_t            i_nrpages;       /* Number of cached pages */
    rwlock_t            tree_lock;       /* Lock for page tree */
    
    /* Address space operations */
    const struct address_space_operations *a_ops;
    
    /* Writeback state */
    unsigned long       flags;
    spinlock_t          private_lock;
    struct list_head    private_list;
};
 
/*
 * Find a page in the page cache
 * Returns NULL if not cached
 */
struct page *find_get_page(struct address_space *mapping, 
                           pgoff_t index) {
    struct page *page;
    
    rcu_read_lock();
    page = xa_load(&mapping->i_pages, index);
    if (page && !page_cache_get_speculative(page)) {
        page = NULL;  /* Page was being reclaimed */
    }
    rcu_read_unlock();
    
    return page;
}
 
/*
 * Find page or create if not present
 * This is the workhorse function for file reads
 */
struct page *find_or_create_page(struct address_space *mapping,
                                  pgoff_t index, gfp_t gfp_mask) {
    struct page *page;
    int error;
    
    /* First, try simple lookup */
    page = find_get_page(mapping, index);
    if (page)
        return page;
    
    /* Not in cache: allocate new page */
    page = alloc_page(gfp_mask);
    if (!page)
        return NULL;
    
    /* Try to add to cache */
    error = add_to_page_cache_locked(page, mapping, index, gfp_mask);
    if (error) {
        /* Someone else added it first - use theirs */
        put_page(page);
        page = find_get_page(mapping, index);
    }
    
    return page;
}
 
/*
 * Generic file read implementation
 * Shows how page cache integrates with file I/O
 */
ssize_t generic_file_read(struct file *filp, char __user *buf,
                          size_t count, loff_t *ppos) {
    struct inode *inode = filp->f_inode;
    struct address_space *mapping = inode->i_mapping;
    loff_t pos = *ppos;
    ssize_t read = 0;
    
    while (count > 0) {
        pgoff_t index = pos >> PAGE_SHIFT;
        size_t offset = pos & ~PAGE_MASK;
        size_t bytes = min(PAGE_SIZE - offset, count);
        struct page *page;
        
        /* Try to get page from cache */
        page = find_get_page(mapping, index);
        
        if (!page) {
            /* Cache miss - need to read from disk */
            page = page_cache_alloc(mapping);
            if (!page)
                return read ? read : -ENOMEM;
            
            /* Add to cache and initiate read */
            add_to_page_cache(page, mapping, index);
            
            /* Trigger read-ahead for sequential access */
            if (mapping->a_ops->readahead)
                trigger_readahead(mapping, filp, index);
            
            /* Read the page from disk */
            read_page_from_disk(mapping, page);
        }
        
        /* Wait for page to be uptodate */
        wait_on_page_locked(page);
        
        if (!PageUptodate(page)) {
            put_page(page);
            return read ? read : -EIO;
        }
        
        /* Copy data to user buffer */
        if (copy_to_user(buf, page_address(page) + offset, bytes)) {
            put_page(page);
            return read ? read : -EFAULT;
        }
        
        put_page(page);
        
        buf += bytes;
        count -= bytes;
        pos += bytes;
        read += bytes;
    }
    
    *ppos = pos;
    return read;
}

Page Cache Data Structures

The page cache uses sophisticated data structures to enable efficient operations:

Page Flags: Each page structure contains flags indicating state: PG_locked (under I/O), PG_uptodate (contains valid data), PG_dirty (modified), PG_referenced (recently accessed), etc.

Address Space: Each cached file has an address_space structure containing its page tree and operations. The address_space_operations define how to read/write pages for that file type.

Read-Ahead and Prefetching

The Read-Ahead Algorithm

Modern Linux uses an adaptive read-ahead algorithm that adjusts dynamically based on access patterns:

readahead_algorithm.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
/*
 * Simplified adaptive read-ahead algorithm
 * 
 * The kernel tracks read patterns and adjusts
 * read-ahead window size dynamically
 */
 
struct file_ra_state {
    pgoff_t start;              /* Window start */
    unsigned int size;          /* Window size */
    unsigned int async_size;    /* Async readahead trigger */
    unsigned int ra_pages;      /* Maximum readahead */
    
    /* Pattern detection */
    pgoff_t prev_pos;           /* Previous read position */
    unsigned int prev_count;    /* Previous read count */
    unsigned int pattern;       /* Detected pattern flags */
};
 
/* Readahead pattern flags */
#define RA_SEQUENTIAL   1       /* Sequential access detected */
#define RA_RANDOM       2       /* Random access detected */
#define RA_MMAP         4       /* Memory-mapped access */
 
/*
 * Main readahead decision function
 */
void page_cache_readahead(struct address_space *mapping,
                          struct file_ra_state *ra,
                          struct file *filp,
                          pgoff_t offset,
                          unsigned long req_size) {
    unsigned int max = ra->ra_pages;    /* Typically 32 pages (128KB) */
    pgoff_t expected, prev_index;
    
    prev_index = ra->prev_pos >> PAGE_SHIFT;
    expected = ra->start + ra->size;
    
    /* Case 1: Sequential read continues where expected */
    if (offset == expected || offset == prev_index + 1) {
        /* Sequential pattern confirmed - expand window */
        ra->pattern = RA_SEQUENTIAL;
        
        if (offset == ra->start + ra->async_size) {
            /* Hit async trigger - time to prefetch more */
            ra->start = offset;
            ra->size = calc_readahead_size(ra, max);
            ra->async_size = ra->size * 3 / 4;
            
            /* Submit async readahead */
            do_async_readahead(mapping, filp, offset, ra->size);
        }
    }
    
    /* Case 2: Start of sequential read or new sequential stream */
    else if (offset == 0 || offset == prev_index) {
        /* Possible new sequential stream - start conservatively */
        ra->start = offset;
        ra->size = initial_readahead_size(max);
        ra->async_size = ra->size / 2;
        
        do_sync_readahead(mapping, filp, offset, ra->size);
    }
    
    /* Case 3: Random access pattern */
    else if (abs(offset - prev_index) > ra->ra_pages) {
        ra->pattern = RA_RANDOM;
        /* Disable readahead for random access */
        ra->size = 0;
        ra->async_size = 0;
    }
    
    /* Case 4: Stride pattern detection */
    else if (detect_stride(ra, offset)) {
        /* Read stride pattern (e.g., every Nth page) */
        handle_stride_readahead(mapping, ra, offset);
    }
    
    /* Update history */
    ra->prev_pos = (loff_t)offset << PAGE_SHIFT;
}
 
/*
 * Calculate optimal readahead window size
 */
static unsigned int calc_readahead_size(struct file_ra_state *ra,
                                        unsigned int max) {
    unsigned int size = ra->size;
    
    /* Exponential growth up to maximum */
    if (ra->pattern == RA_SEQUENTIAL) {
        size = size * 2;
        if (size > max)
            size = max;
    }
    
    /* Consider memory pressure */
    if (memory_pressure_high())
        size = size / 2;
    
    return max(size, 4U);  /* Minimum 4 pages */
}
 
/*
 * Submit asynchronous readahead
 * Returns immediately; I/O completes in background
 */
static void do_async_readahead(struct address_space *mapping,
                               struct file *filp,
                               pgoff_t offset,
                               unsigned long nr_to_read) {
    struct blk_plug plug;
    unsigned long i;
    
    /* Plug block layer to batch requests */
    blk_start_plug(&plug);
    
    for (i = 0; i < nr_to_read; i++) {
        struct page *page;
        pgoff_t page_offset = offset + i;
        
        /* Skip if already cached */
        page = find_get_page(mapping, page_offset);
        if (page) {
            put_page(page);
            continue;
        }
        
        /* Allocate and add new page */
        page = page_cache_alloc_readahead(mapping);
        if (!page)
            break;
        
        if (add_to_page_cache_lru(page, mapping, page_offset)) {
            put_page(page);
            continue;
        }
        
        /* Mark as readahead page */
        SetPageReadahead(page);
        
        /* Submit read I/O */
        mapping->a_ops->readpage(filp, page);
    }
    
    /* Unplug - submit batched requests */
    blk_finish_plug(&plug);
}

Read-Ahead Benefits and Costs

Read-ahead provides substantial benefits but isn't free:

Benefits:

Hides I/O latency completely for sequential workloads
Enables large, efficient I/O operations (32 pages at once vs. 1)
Keeps the I/O subsystem busy, maximizing throughput
Modern drives and controllers are optimized for large sequential reads

Costs:

Consumes memory for potentially unneeded data
May evict useful cached data to make room
Wasted I/O bandwidth if prefetched data isn't used
Increased latency for other I/O if read-ahead floods the queue

The key is adaptation: aggressive read-ahead for sequential access, conservative or disabled for random access. The kernel tracks patterns and adjusts dynamically.

Controlling Read-Ahead

Memory Mapping and the Page Cache

How Memory Mapping Works

When a process maps a file, the kernel creates page table entries that point to the page cache:

memory_mapped_io.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
/*
 * Memory-mapped file I/O through page cache
 */
 
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
 
/*
 * Example: Processing a file via memory mapping
 * 
 * The kernel handles all caching automatically:
 * - Pages are loaded on-demand (page fault)
 * - Pages are shared with other mappers
 * - Modified pages are marked dirty
 * - Writeback occurs automatically or on msync()
 */
void process_file_via_mmap(const char *filename) {
    int fd;
    struct stat sb;
    char *mapped;
    
    fd = open(filename, O_RDWR);
    fstat(fd, &sb);
    
    /* Map file into address space */
    mapped = mmap(NULL,           /* Let kernel choose address */
                  sb.st_size,     /* Map entire file */
                  PROT_READ | PROT_WRITE,
                  MAP_SHARED,     /* Modifications affect file */
                  fd, 0);
    
    close(fd);  /* File stays mapped; fd can be closed */
    
    /*
     * Now file contents are accessible as memory.
     * 
     * Reading:  byte = mapped[offset]
     *   - If page not in cache: page fault
     *   - Kernel loads page from disk to page cache
     *   - Page table updated to map page cache page
     *   - Access resumes, now using cached page
     *
     * Writing:  mapped[offset] = byte
     *   - If page not in cache: same as read
     *   - Mark page dirty
     *   - Eventually written back by kernel or msync()
     */
    
    /* Example: capitalize first 1000 characters */
    for (size_t i = 0; i < 1000 && i < sb.st_size; i++) {
        if (mapped[i] >= 'a' && mapped[i] <= 'z')
            mapped[i] -= 32;  /* Modifies page cache directly */
    }
    
    /* Ensure changes are written to disk */
    msync(mapped, sb.st_size, MS_SYNC);
    
    munmap(mapped, sb.st_size);
}
 
/*
 * Kernel page fault handler for memory-mapped files
 * (Conceptual implementation)
 */
int filemap_fault(struct vm_fault *vmf) {
    struct vm_area_struct *vma = vmf->vma;
    struct file *file = vma->vm_file;
    struct address_space *mapping = file->f_mapping;
    pgoff_t offset = vmf->pgoff;
    struct page *page;
    
    /* Try to find page in cache */
    page = find_get_page(mapping, offset);
    
    if (!page) {
        /* Not cached - need to read from file */
        page = find_or_create_page(mapping, offset, 
                                    GFP_KERNEL);
        if (!page)
            return VM_FAULT_OOM;
        
        /* Read page from disk if not uptodate */
        if (!PageUptodate(page)) {
            mapping->a_ops->readpage(file, page);
            wait_on_page_locked(page);
        }
    }
    
    /* Lock page for mapping */
    lock_page(page);
    
    /*
     * Now install page in process page tables.
     * Multiple processes can map same page cache page.
     */
    vmf->page = page;
    
    return VM_FAULT_LOCKED;
}

Memory Mapping Benefits

Zero-Copy Access: Data never needs to be copied from kernel buffers to user buffers. The process directly accesses the page cache through its page tables.

Automatic Caching: The kernel manages all caching decisions. Pages are loaded on-demand, shared across processes, and written back automatically.

Simplified Programming: File access becomes simple memory operations. No read()/write() calls, no buffer management, no positioning.

Huge File Support: Even files larger than physical memory can be mapped—only accessed regions actually consume memory.

Memory Mapping Considerations

Page Fault Overhead: Every access to not-yet-loaded pages incurs a page fault. For highly random access patterns, explicit read() with proper buffering may outperform mmap().

Signal Handling: I/O errors become SIGBUS signals, which are harder to handle than read() error returns.

TLB Pressure: Large mappings consume TLB entries, potentially impacting other memory accesses.

Writeback Semantics: Modifications may be written back at unpredictable times. Applications requiring durability must explicitly call msync().

Cache Metrics and Analysis

Understanding cache behavior requires systematic measurement. Several key metrics reveal cache effectiveness:

Primary Metrics

Cache Performance Metrics
Metric	Definition	Implications
Hit Rate	Fraction of accesses served from cache	Higher is better; target >90% for file-intensive workloads
Miss Rate	Fraction of accesses requiring I/O (1 - hit rate)	Lower is better; indicates cache sizing and replacement effectiveness
Fill Rate	Rate at which cache fills with new data	Indicates I/O pressure and miss frequency
Eviction Rate	Rate at which data is removed from cache	High rate suggests cache is too small or access pattern is pathological
Dirty Ratio	Fraction of cached data modified but not written	Risk indicator: high dirty ratio risks data loss on failure
Average Miss Penalty	Time cost of a cache miss	Depends on backing store; SSDs have 100x lower miss penalty than HDDs

cache_monitoring.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash
# Monitor page cache performance on Linux
 
# Current cache size
echo "=== Page Cache Size ==="
grep -E "^(Cached|Buffers|Active\(file\)|Inactive\(file\)):" /proc/meminfo
 
# Cache hit/miss statistics (requires perf)
echo -e "\n=== Cache Hit/Miss (10 second sample) ==="
perf stat -e cache-references,cache-misses,page-faults -a sleep 10
 
# Per-file cache status (requires vmtouch)
echo -e "\n=== File Cache Status ==="
vmtouch -v /var/log/syslog
 
# Detailed I/O statistics
echo -e "\n=== Block I/O Statistics ==="
cat /proc/diskstats | awk '
NF >= 14 {
    dev = $3
    reads = $4
    read_sectors = $6
    writes = $8
    write_sectors = $10
    if (reads + writes > 0)
        printf "%-10s: %10d reads, %10d writes\n", dev, reads, writes
}'
 
# Page cache activity (per second)
echo -e "\n=== Page Cache Activity (5 seconds) ==="
sar -B 1 5 2>/dev/null || echo "Install sysstat for detailed stats"
 
# Watch dirty page writeback
echo -e "\n=== Dirty Pages ==="
grep -E "^(Dirty|Writeback):" /proc/meminfo

Calculating Effective Access Time

The effective access time (EAT) quantifies the average time to access data considering cache behavior:

EAT = (hit_rate × cache_access_time) + (miss_rate × miss_penalty)

Where:
  miss_penalty = cache_access_time + backing_store_access_time

Example Calculation:

Page cache access: 100 ns
SSD access: 50 µs (50,000 ns)
HDD access: 8 ms (8,000,000 ns)
Hit rate: 95%

With SSD backing:

EAT = 0.95 × 100ns + 0.05 × 50,100ns = 95ns + 2,505ns = 2,600ns

With HDD backing:

EAT = 0.95 × 100ns + 0.05 × 8,000,100ns = 95ns + 400,005ns = 400,100ns

This illustrates both the power of caching (reducing HDD access from 8ms to 400µs average) and why the backing store still matters (SSD provides 150x lower EAT than HDD even with identical caching).

Amdahl's Law for Caching

Summary: I/O Caching Foundations

We've explored the fundamental concepts of I/O caching, establishing the foundation for understanding more advanced caching topics that follow. Let's consolidate the key insights:

Key Takeaways

•Caching is architecturally necessary — The speed disparity between CPUs and I/O devices (5-7 orders of magnitude) makes caching essential, not optional. Without caching, modern interactive computing would be impossible.
•Locality is the foundation — Temporal and spatial locality, emerging from the fundamental nature of computation, make caching effective. Programs aren't random; they exhibit predictable access patterns that caching exploits.
•Buffer cache and page cache serve different abstractions — Buffer cache works at the block level (device-centric), while page cache works at the file level (file-centric). Modern systems often unify these for efficiency.
•Read-ahead amplifies cache benefits — By predicting sequential access and prefetching data before it's requested, read-ahead hides I/O latency entirely for sequential workloads.
•Memory mapping integrates with the page cache — Memory-mapped files share page cache pages, enabling zero-copy file access and seamless integration between file I/O and memory access.
•Cache metrics enable optimization — Hit rate, miss penalty, and effective access time quantify cache behavior and guide tuning decisions.

What's Next:

Page Complete

1 / 5