Operating SystemsRead-Ahead

Read-Ahead: Anticipatory Data Loading

LevelAdvanced

Duration75 mins

TopicRead-Ahead

3 / 5

Read-Ahead Window: The Sliding Window Mechanism

How Much to Prefetch?

Once an operating system detects sequential access, it faces a crucial decision: how much data should be prefetched? Too little, and the application will still wait for I/O. Too much, and precious memory is wasted on data that may never be used.

This decision is governed by the read-ahead window—a conceptual sliding window that determines the size of prefetch operations. Think of it as an adjustable net cast ahead of the current read position, capturing data before the application needs it.

The read-ahead window is one of the most carefully tuned components in any operating system. Its size directly impacts:

Application latency — Larger windows hide more latency
Memory usage — Larger windows consume more buffer cache
I/O efficiency — Larger windows enable better disk scheduling
System responsiveness — Aggressive prefetching can starve other processes

This page explores the algorithms and strategies that govern window management.

What You Will Learn

By the end of this page, you will understand the anatomy of a read-ahead window, how windows grow and shrink based on access patterns, the mathematical principles behind window sizing, and how modern systems balance prefetching aggressiveness against resource constraints.

Anatomy of a Read-Ahead Window

A read-ahead window is defined by several key parameters that together determine its behavior. Understanding these parameters is essential for understanding how prefetching works.

Window Parameters:

Read-Ahead Window Parameters
Parameter	Linux Name	Description
Start Position	start	File offset where the current read-ahead window begins
Window Size	size	Number of pages/blocks in the current window
Async Size	async_size	Pages that trigger async prefetch when accessed
Maximum Size	ra_pages	Maximum allowed window size (configurable)
Previous Position	prev_pos	Last read position (for pattern detection)

Converting Mermaid diagram...

Window Regions:

Already-Read Region: Data that was previously in the window but has been consumed by the application. This region is no longer part of the active window.
Sync Region: Data prefetched and waiting in the page cache. The application can read this immediately without I/O wait.
Async Marker: A special position in the window. When the application reads data at this marker, the kernel triggers the next batch of asynchronous prefetching.
Async Region: Data currently being prefetched asynchronously. The I/O is in progress but not yet complete.

The Critical Invariant:

For optimal performance, the sync region must contain enough data to satisfy application reads while the async region is being loaded. This is the fundamental timing relationship that the window sizing algorithm must maintain.

window_structure.c
C (Linux Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* Read-ahead window state structure (from Linux kernel) */
struct file_ra_state {
    /* Start of current read-ahead window (page index) */
    pgoff_t start;
    
    /* Number of pages in current window */
    unsigned int size;
    
    /* Number of pages that are for async readahead.
     * When application reaches this point, trigger next prefetch.
     * Setting: async_size = size means entire window triggers async.
     */
    unsigned int async_size;
    
    /* Maximum number of pages to read-ahead.
     * Configurable via /sys/block/<device>/queue/read_ahead_kb
     */
    unsigned int ra_pages;
    
    /* Cache miss counter for mmap fault-based access */
    unsigned int mmap_miss;
    
    /* Previous read position for sequential detection */
    loff_t prev_pos;
};
 
/*
 * Example window state:
 * 
 * File:    [0][1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]
 * 
 * Window:        |<--- sync --->|<-- async -->|
 *                start=2       marker=6      end=10
 *                size=8, async_size=4
 * 
 * Application reading at:  [3]
 * Status: Block 3 served from cache (sync region)
 * 
 * Application reads block [6] (hits marker):
 * Action: Kernel triggers async prefetch of blocks 10-17
 */
 
void visualize_window(struct file_ra_state *ra) {
    printk("Window: start=%lu, size=%u, async_size=%u, max=%u\n",
           ra->start, ra->size, ra->async_size, ra->ra_pages);
    
    unsigned int sync_end = ra->start + ra->size - ra->async_size;
    unsigned int async_start = sync_end;
    unsigned int window_end = ra->start + ra->size;
    
    printk("  Regions: sync=[%lu,%lu), async=[%lu,%lu)\n",
           ra->start, sync_end, async_start, window_end);
}

Window Growth Strategies

When sequential access is detected, the read-ahead window grows to improve performance. The growth strategy determines how quickly the window expands and how large it can become.

Common Growth Strategies:

Growth Strategy Options

•Linear Growth: Window increases by a fixed amount each time (e.g., +32KB). Simple but slow to reach optimal size.
•Exponential Growth: Window doubles each time (32KB → 64KB → 128KB → 256KB). Reaches optimal size quickly but may overshoot.
•Additive Increase, Multiplicative Decrease (AIMD): Grows linearly, shrinks multiplicatively on problems. Self-stabilizing.
•Proportional Growth: New window size is proportional to current read rate. Adapts to application speed.
•Hybrid: Exponential at start, linear near maximum. Combines benefits of both approaches.

Linux Uses Exponential Growth:

The Linux kernel employs an exponential growth strategy, starting conservative and doubling the window with each successful sequential access until reaching the maximum:

window_growth.c
C (Linux Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/* Simplified Linux read-ahead growth algorithm */
 
/* Initial window size (conservative start) */
#define INIT_RA_SIZE    32   /* 32 pages = 128KB with 4KB pages */
 
/* Calculate next window size */
static unsigned long get_next_ra_size(struct file_ra_state *ra,
                                      unsigned long max_pages)
{
    unsigned long cur_size = ra->size;
    unsigned long new_size;
    
    /*
     * Exponential growth: double the window
     * This rapidly finds the optimal size for streaming workloads
     */
    new_size = cur_size * 2;
    
    /*
     * But don't exceed the maximum
     * Typical max is 512 pages (2MB with 4KB pages)
     */
    if (new_size > max_pages)
        new_size = max_pages;
    
    return new_size;
}
 
/* Initial window size calculation */
static unsigned long get_init_ra_size(unsigned long request_size,
                                      unsigned long max_pages)
{
    unsigned long init_size = INIT_RA_SIZE;
    
    /*
     * If application requests more than default init size,
     * start with the request size (rounded up)
     */
    if (request_size > init_size)
        init_size = request_size;
    
    /*
     * But still cap at 1/4 of max for first access
     * This prevents immediate over-allocation
     */
    unsigned long quarter_max = max_pages / 4;
    if (init_size > quarter_max)
        init_size = quarter_max;
    
    return init_size;
}
 
/*
 * Window growth example:
 * 
 * max_pages = 512 (2MB)
 * 
 * Access 1: init_size = 32 pages (128KB)
 * Access 2: new_size = 64 pages (256KB)  
 * Access 3: new_size = 128 pages (512KB)
 * Access 4: new_size = 256 pages (1MB)
 * Access 5: new_size = 512 pages (2MB)  <- Maximum reached
 * Access 6+: size = 512 pages (stable at max)
 */

Linux Read-Ahead Window Growth Sequence (Default Settings)
Sequential Read #	Window Size (Pages)	Window Size (KB)	Total Prefetched (KB)
1	32	128	128
2	64	256	384
3	128	512	896
4	256	1024	1920
5	512 (max)	2048	3968
6+	512 (stable)	2048	N/A

Why Exponential?

Exponential growth minimizes the number of window adjustments needed to reach optimal size. For a streaming workload, reaching maximum prefetch in 5 steps is far better than the ~15 steps linear growth would require. The potential downside—temporary over-prefetching—is usually less costly than under-prefetching.

Window Shrinking and Reset

Just as important as growing the window is knowing when to shrink or reset it. An oversized window for a random workload wastes significant resources.

Shrinking Triggers:

Window Shrink/Reset Conditions

•Random Access Detected: Read position is far from expected next position. Immediate reset to initial size.
•Backward Seek: Application seeks to an earlier position. May indicate non-sequential pattern.
•Large Jump: Read position jumps more than current window size ahead. Suggests pattern change.
•Cache Miss in Window: Prefetched page was evicted before use (memory pressure). Reduce window to conserve memory.
•Application Hint: POSIX_FADV_RANDOM tells kernel not to prefetch.
•File Close/Position Reset: New read session, start fresh with initial window.

window_reset.c
C (Linux Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
/* Window shrinking and reset logic */
 
struct file_ra_state {
    pgoff_t start;
    unsigned int size;
    unsigned int async_size;
    unsigned int ra_pages;
    loff_t prev_pos;
    
    /* Additional tracking for shrink decisions */
    unsigned int mmap_miss;     /* Cache misses for mmap */
    unsigned int prev_size;     /* Previous window size */
};
 
/* Check if access pattern has broken */
static bool is_sequential_broken(struct file_ra_state *ra,
                                  pgoff_t offset,
                                  unsigned long max_pages)
{
    pgoff_t prev_page = ra->prev_pos >> PAGE_SHIFT;
    pgoff_t expected = prev_page + 1;  /* Simple expectation */
    
    /* Perfect sequential continues from previous end */
    if (offset == expected)
        return false;
    
    /* Small forward skip is OK (tolerance) */
    if (offset > prev_page && offset - prev_page <= 8)
        return false;
    
    /* Everything else is a break in the pattern */
    return true;
}
 
/* Handle window adjustment based on access pattern */
void adjust_window(struct file_ra_state *ra, pgoff_t offset)
{
    if (is_sequential_broken(ra, offset)) {
        /*
         * Pattern is broken - reset to initial state
         * 
         * We don't shrink gradually because:
         * 1. Random access doesn't benefit from any prefetch
         * 2. Quick reset minimizes wasted I/O
         * 3. If it becomes sequential again, growth is fast
         */
        reset_ra_state(ra);
        return;
    }
    
    /* Check for memory pressure indicators */
    if (ra->mmap_miss > 0) {
        /*
         * Pages we prefetched were evicted before use!
         * This means window is too large for available memory.
         * Reduce aggressively.
         */
        ra->size = max(ra->size / 2, INIT_RA_SIZE);
        ra->mmap_miss = 0;
        return;
    }
    
    /* Sequential continues - maybe grow */
    if (should_grow_window(ra)) {
        ra->prev_size = ra->size;
        ra->size = get_next_ra_size(ra, ra->ra_pages);
    }
}
 
/* Reset to initial conservative state */
void reset_ra_state(struct file_ra_state *ra)
{
    ra->prev_size = ra->size;
    ra->size = INIT_RA_SIZE;
    ra->async_size = 0;  /* No async prefetch until pattern confirmed */
    
    /* Keep ra->start at current position for fresh start */
}
 
/*
 * Window lifecycle example:
 * 
 * 1. File open        -> size=32, async=0 (initial)
 * 2. Sequential read  -> size=64, async=32 (growing)
 * 3. Sequential read  -> size=128, async=64
 * 4. Sequential read  -> size=256, async=128
 * 5. SEEK (random)    -> size=32, async=0 (reset!)
 * 6. Sequential read  -> size=64, async=32 (regrows)
 */

The mmap_miss Counter

Linux tracks 'mmap_miss' to detect when the system is under memory pressure. If prefetched pages are being evicted before the application reads them, it means the read-ahead window is too large for current memory conditions. This feedback loop automatically reduces prefetching when the system is constrained.

The Async Marker Mechanism

One of the most elegant aspects of Linux's read-ahead implementation is the async marker mechanism. This provides precise timing for prefetch operations without constant monitoring or polling.

The Problem:

How does the kernel know when to trigger the next batch of prefetching? If it waits too long, the application stalls. If it triggers too early, resources are wasted.

The Solution:

The kernel marks certain pages in the read-ahead window as "async read-ahead markers." When the application reads a marked page (through normal page fault handling), the kernel automatically triggers the next batch of prefetching. This provides perfect timing—the trigger happens exactly when needed.

async_marker.c
C (Linux Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/* The async marker mechanism in Linux read-ahead */
 
/*
 * Page flags used for read-ahead marking:
 * PG_readahead - This page triggers async readahead when accessed
 */
 
/* When setting up a read-ahead window */
void setup_readahead_window(struct file_ra_state *ra,
                            struct address_space *mapping)
{
    pgoff_t start = ra->start;
    unsigned int size = ra->size;
    unsigned int async_size = ra->async_size;
    
    /*
     * Place async marker at the transition point
     * All pages after this marker are "async" - the kernel
     * will prefetch more when they're accessed.
     */
    pgoff_t marker_index = start + size - async_size;
    
    /* Read pages into cache (sync portion first) */
    for (pgoff_t i = start; i < start + size - async_size; i++) {
        struct page *page = find_or_create_page(mapping, i);
        /* No special marking - these are sync pages */
        submit_page_read(page);
    }
    
    /* Mark the transition page */
    struct page *marker_page = find_or_create_page(mapping, marker_index);
    SetPageReadahead(marker_page);  /* Set PG_readahead flag */
    submit_page_read(marker_page);
    
    /* Remaining pages are async */
    for (pgoff_t i = marker_index + 1; i < start + size; i++) {
        struct page *page = find_or_create_page(mapping, i);
        submit_page_read(page);
    }
}
 
/* When a page is accessed and triggers readahead */
void handle_page_access(struct page *page, struct file_ra_state *ra)
{
    if (PageReadahead(page)) {
        /*
         * Application reached the marker!
         * Time to prefetch the next window.
         */
        ClearPageReadahead(page);  /* Clear flag (one-time trigger) */
        
        /* Calculate next window */
        ra->start = ra->start + ra->size;
        ra->size = get_next_ra_size(ra);
        ra->async_size = calculate_async_size(ra);
        
        /* Submit async I/O for next window */
        async_readahead(ra);
    }
}
 
/*
 * Async marker visualization:
 * 
 * Window state:  [already read] [sync region] [M][async region]
 *                                              ^
 *                                              |
 *                                    Async marker (PG_readahead flag)
 * 
 * When application reads page at [M]:
 *   1. Normal read completes (page was already prefetched)
 *   2. Kernel notices PG_readahead flag
 *   3. Kernel clears flag (avoid double-trigger)
 *   4. Kernel calculates next window position and size
 *   5. Kernel submits async I/O for next window
 *   6. Application continues reading (unaware of #2-5)
 * 
 * Result: Next window is prefetching while app reads current window
 */

Calculating Async Size:

The async_size parameter determines where the marker is placed. A larger async_size triggers prefetching earlier (more conservative), while smaller values trigger later (more aggressive).

Typical formula:

async_size = size / 2   (trigger at 50% through window)

Or for large windows:

async_size = size - min_sync_pages  (ensure minimum sync ahead)

Zero Polling Overhead

The beauty of the async marker is that it requires no polling, no timers, and no explicit checks. The page fault handling path—which must execute anyway when a page is accessed—handles the marker check with negligible additional cost. This is a classic example of piggybacking functionality onto existing code paths.

Maximum Window Size Considerations

The maximum read-ahead window size (ra_pages in Linux) is a critical tunable parameter. Setting it correctly requires understanding your workload and system characteristics.

Maximum Window Size Trade-offs
Larger Maximum	Smaller Maximum
Better for large streaming reads	Better for small files/random access
Higher memory usage per file	Lower memory usage per file
More efficient large I/O operations	Faster response for non-streaming
Risk of wasted prefetch on pattern change	Quicker adaptation to new patterns
Better for HDDs (amortize seek time)	Fine for SSDs (low seek cost)

Default Values in Major Systems:

OS/System	Default Max Window	Notes
Linux	128-256KB	Conservative for general use
Linux (RAID/SAN)	2-4MB	Larger for high-throughput storage
Windows	256KB-1MB	Varies by file system and hints
ZFS	128KB (tunable)	Can be increased for streaming
PostgreSQL	OS default	Relies on OS, can hint
MySQL InnoDB	64 pages (1MB)	Own prefetch separate from OS

tuning_window_size.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/bin/bash
# Tuning read-ahead window size on Linux
 
# View current setting for a device
echo "Current read-ahead for /dev/sda:"
cat /sys/block/sda/queue/read_ahead_kb
# Typical output: 128 (KB)
 
# Set read-ahead size (requires root)
# WARNING: Test thoroughly before production use
 
# Conservative setting (good for mixed workloads)
echo 128 > /sys/block/sda/queue/read_ahead_kb
 
# Aggressive setting (for streaming/sequential workloads)
echo 2048 > /sys/block/sda/queue/read_ahead_kb  # 2MB
 
# Very aggressive (for large file processing, HDDs)
echo 8192 > /sys/block/sda/queue/read_ahead_kb  # 8MB
 
# Minimal (for random workloads, SSDs with very fast random I/O)
echo 64 > /sys/block/sda/queue/read_ahead_kb
 
# Apply to all block devices
for dev in /sys/block/sd*/queue/read_ahead_kb; do
    echo 256 > "$dev"
done
 
# Verification with hdparm
hdparm -a /dev/sda
# Shows: readahead = 512 (256.00 KB)  [note: sectors * 0.5 = KB]
 
# Per-file override using fadvise (in C code)
cat << 'EOF'
/* In application code, override for specific files */
int fd = open("large_file.dat", O_RDONLY);
 
/* Request aggressive read-ahead for this file */
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);
 
/* Or request minimal read-ahead */
posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);
EOF

When to Increase Max Window

•Video streaming servers
•Log file processing
•Backup operations
•Data pipeline batch processing
•Scientific computing with large datasets
•HDD-based storage (amortize seek time)

When to Decrease Max Window

•Database OLTP workloads
•Memory-constrained systems
•Many concurrent small file accesses
•SSD with excellent random performance
•Latency-sensitive applications
•When buffer cache pressure is chronic

Window Algorithms Deep Dive

Let's examine the complete window management algorithm used in Linux, bringing together all the concepts we've discussed.

complete_window_algorithm.c
C (Linux Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
/* Complete read-ahead window management algorithm
 * Simplified from Linux mm/readahead.c
 */
 
/* Window management state */
struct ra_state {
    pgoff_t start;           /* Window start (page index) */
    pgoff_t size;            /* Window size (pages) */
    pgoff_t async_size;      /* Async trigger size */
    pgoff_t ra_pages;        /* Maximum window */
    loff_t prev_pos;         /* Previous read position */
};
 
/* Core decision function: called on every read() */
void ondemand_readahead(struct file *filp, 
                        struct ra_state *ra,
                        pgoff_t offset,
                        unsigned long req_size)
{
    pgoff_t max = ra->ra_pages;
    
    /*
     * CASE 1: No existing window (first read or after reset)
     */
    if (ra->size == 0) {
        /* Start with initial window */
        ra->start = offset;
        ra->size = get_init_ra_size(req_size, max);
        ra->async_size = ra->size > 4 ? ra->size - 4 : 0;
        goto do_readahead;
    }
    
    /*
     * CASE 2: Hit the async marker (within async portion of window)
     */
    if (offset >= ra->start + ra->size - ra->async_size) {
        /* We've reached the trigger point - time to queue more */
        
        /* Move window forward */
        pgoff_t new_start = ra->start + ra->size;
        
        /* Grow window (exponential growth) */
        pgoff_t new_size = min(ra->size * 2, max);
        
        /* Update state */
        ra->start = new_start;
        ra->size = new_size;
        ra->async_size = new_size;  /* All async for background load */
        goto do_readahead;
    }
    
    /*
     * CASE 3: Sequential but before async marker
     * Everything is fine - data should be in cache, no action needed
     */
    if (offset >= ra->start && offset < ra->start + ra->size - ra->async_size) {
        /* Within sync portion - page should be cached */
        return;
    }
    
    /*
     * CASE 4: Miss - offset is outside current window
     */
    if (offset < ra->start || offset >= ra->start + ra->size) {
        /* 
         * Could be:
         * - Random access (far from window)
         * - Large sequential skip (jumped ahead)
         * - Backward seek
         */
        
        /* Check if this looks like a big skip (still sequential) */
        pgoff_t prev_page = ra->prev_pos >> PAGE_SHIFT;
        if (offset > prev_page && offset - prev_page < max * 2) {
            /* Likely still sequential, just skipped ahead */
            ra->start = offset;
            ra->size = get_init_ra_size(req_size, max);
            goto do_readahead;
        }
        
        /* Otherwise, assume random access - minimal readahead */
        ra->start = offset;
        ra->size = min(req_size + 4, max);
        ra->async_size = 0;
        goto do_readahead;
    }
 
do_readahead:
    /* Submit I/O for the readahead window */
    for (pgoff_t i = 0; i < ra->size; i++) {
        pgoff_t page_idx = ra->start + i;
        struct page *page = find_or_create_page(filp->f_mapping, page_idx);
        
        /* Set async marker at appropriate page */
        if (i == ra->size - ra->async_size) {
            SetPageReadahead(page);
        }
        
        if (!PageUptodate(page)) {
            /* Page not in cache - submit read */
            submit_read_request(page);
        }
    }
    
    /* Update position tracking */
    ra->prev_pos = (offset + req_size) << PAGE_SHIFT;
}
 
/*
 * Algorithm summary:
 * 
 * 1. On first access: Create initial window, set async marker
 * 2. On read within sync region: Do nothing (cache hit expected)
 * 3. On async marker hit: Create new window, grow size, submit I/O
 * 4. On window miss: Evaluate if sequential skip or random
 * 5. Update prev_pos after every decision for pattern tracking
 */

Algorithm Invariants

The algorithm maintains several key invariants: (1) Window size is always ≤ max and ≥ 0; (2) async_size ≤ size; (3) Window always starts at or before current read position for new windows; (4) prev_pos is updated on every read to track pattern. Violating these invariants can cause missed prefetches or resource waste.

Real-World Performance Impact

Let's quantify the impact of read-ahead window sizing with concrete measurements.

Sequential Read Throughput vs Window Size (1GB file, SATA SSD)
Window Size	Throughput (MB/s)	CPU Wait %	Speedup
0 (disabled)	45	85%	1.0x (baseline)
32KB	120	60%	2.7x
128KB	380	25%	8.4x
512KB	520	8%	11.6x
2MB	550	3%	12.2x
8MB	552	2%	12.3x

Observations:

Dramatic Improvement: Moving from no prefetch to proper window sizing can yield 10x+ throughput improvement for sequential reads.
Diminishing Returns: Most gain comes from initial increases. Going from 512KB to 8MB adds only 6% more throughput.
CPU Efficiency: With prefetching, CPU spends time processing rather than waiting. The 85% wait time without prefetching represents massive CPU waste.
Optimal Point: For this workload, 512KB-2MB provides near-maximum performance. Larger windows add complexity without proportional benefit.

benchmark_window_size.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
#!/usr/bin/env python3
"""
Benchmark script to measure read-ahead window impact.
Run with: sudo python3 benchmark_window_size.py
"""
 
import os
import subprocess
import time
import mmap
 
def set_readahead_kb(device: str, size_kb: int) -> None:
    """Set read-ahead size for a block device."""
    path = f"/sys/block/{device}/queue/read_ahead_kb"
    with open(path, 'w') as f:
        f.write(str(size_kb))
 
def drop_caches() -> None:
    """Drop all caches to ensure cold-start measurement."""
    subprocess.run(['sync'], check=True)
    with open('/proc/sys/vm/drop_caches', 'w') as f:
        f.write('3')
    time.sleep(1)  # Allow caches to clear
 
def measure_sequential_read(filepath: str, iterations: int = 3) -> dict:
    """Measure sequential read throughput."""
    file_size = os.path.getsize(filepath)
    block_size = 64 * 1024  # 64KB reads
    
    times = []
    for _ in range(iterations):
        drop_caches()
        
        start = time.perf_counter()
        with open(filepath, 'rb') as f:
            while f.read(block_size):
                pass
        elapsed = time.perf_counter() - start
        times.append(elapsed)
    
    avg_time = sum(times) / len(times)
    throughput_mb = (file_size / 1024 / 1024) / avg_time
    
    return {
        'file_size_mb': file_size / 1024 / 1024,
        'avg_time_s': avg_time,
        'throughput_mb_s': throughput_mb
    }
 
def benchmark_readahead_sizes(device: str, test_file: str) -> None:
    """Benchmark various read-ahead window sizes."""
    sizes_kb = [0, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
    
    print(f"{'Window (KB)':<12} {'Time (s)':<10} {'MB/s':<10} {'Speedup':<10}")
    print("-" * 42)
    
    baseline = None
    for size_kb in sizes_kb:
        set_readahead_kb(device, size_kb)
        result = measure_sequential_read(test_file)
        
        if baseline is None:
            baseline = result['throughput_mb_s']
        
        speedup = result['throughput_mb_s'] / max(baseline, 0.001)
        
        print(f"{size_kb:<12} {result['avg_time_s']:<10.2f} "
              f"{result['throughput_mb_s']:<10.1f} {speedup:<10.2f}x")
 
if __name__ == "__main__":
    # Configure for your system
    DEVICE = "sda"
    TEST_FILE = "/tmp/benchmark_1gb.dat"
    
    # Create test file if needed
    if not os.path.exists(TEST_FILE):
        print("Creating 1GB test file...")
        subprocess.run(['dd', 'if=/dev/urandom', f'of={TEST_FILE}',
                       'bs=1M', 'count=1024', 'status=progress'], check=True)
    
    benchmark_readahead_sizes(DEVICE, TEST_FILE)

Summary: Mastering the Read-Ahead Window

The read-ahead window is the mechanism through which prefetching decisions are implemented. Let's consolidate the key concepts:

Key Takeaways

•Window Anatomy — A window has start position, size, and async marker that together define prefetching behavior.
•Exponential Growth — Linux doubles window size with each sequential confirmation, rapidly reaching optimal size.
•Fast Reset — Random access immediately resets to initial window, minimizing wasted resources.
•Async Markers — Special page flags trigger prefetching at the right moment with zero polling overhead.
•Tunable Maximum — The max window size should be tuned based on workload (streaming vs random).
•Dramatic Impact — Proper window sizing can yield 10x+ throughput improvement for sequential workloads.
•Diminishing Returns — Beyond certain sizes, additional window growth adds minimal benefit.

What's Next:

The window algorithms we've explored use static rules for growth and shrinkage. But what if the optimal window size varies during execution? The next page explores Adaptive Read-Ahead—algorithms that dynamically adjust prefetching behavior based on real-time performance feedback, achieving even better efficiency.

Page Complete

You now understand the read-ahead window mechanism in depth—its structure, growth and shrink strategies, the elegant async marker system, and how to tune window size for your workloads. The key insight: the window is a self-adjusting structure that balances aggressive prefetching against resource conservation.

3 / 5

Loading learning content...

Operating SystemsRead-Ahead

Read-Ahead: Anticipatory Data Loading

LevelAdvanced

Duration75 mins

TopicRead-Ahead

3 / 5

Read-Ahead Window: The Sliding Window Mechanism

How Much to Prefetch?

The read-ahead window is one of the most carefully tuned components in any operating system. Its size directly impacts:

Application latency — Larger windows hide more latency
Memory usage — Larger windows consume more buffer cache
I/O efficiency — Larger windows enable better disk scheduling
System responsiveness — Aggressive prefetching can starve other processes

This page explores the algorithms and strategies that govern window management.

What You Will Learn

Anatomy of a Read-Ahead Window

A read-ahead window is defined by several key parameters that together determine its behavior. Understanding these parameters is essential for understanding how prefetching works.

Window Parameters:

Read-Ahead Window Parameters
Parameter	Linux Name	Description
Start Position	start	File offset where the current read-ahead window begins
Window Size	size	Number of pages/blocks in the current window
Async Size	async_size	Pages that trigger async prefetch when accessed
Maximum Size	ra_pages	Maximum allowed window size (configurable)
Previous Position	prev_pos	Last read position (for pattern detection)

Converting Mermaid diagram...

Window Regions:

Already-Read Region: Data that was previously in the window but has been consumed by the application. This region is no longer part of the active window.
Sync Region: Data prefetched and waiting in the page cache. The application can read this immediately without I/O wait.
Async Marker: A special position in the window. When the application reads data at this marker, the kernel triggers the next batch of asynchronous prefetching.
Async Region: Data currently being prefetched asynchronously. The I/O is in progress but not yet complete.

The Critical Invariant:

window_structure.c
C (Linux Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* Read-ahead window state structure (from Linux kernel) */
struct file_ra_state {
    /* Start of current read-ahead window (page index) */
    pgoff_t start;
    
    /* Number of pages in current window */
    unsigned int size;
    
    /* Number of pages that are for async readahead.
     * When application reaches this point, trigger next prefetch.
     * Setting: async_size = size means entire window triggers async.
     */
    unsigned int async_size;
    
    /* Maximum number of pages to read-ahead.
     * Configurable via /sys/block/<device>/queue/read_ahead_kb
     */
    unsigned int ra_pages;
    
    /* Cache miss counter for mmap fault-based access */
    unsigned int mmap_miss;
    
    /* Previous read position for sequential detection */
    loff_t prev_pos;
};
 
/*
 * Example window state:
 * 
 * File:    [0][1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]
 * 
 * Window:        |<--- sync --->|<-- async -->|
 *                start=2       marker=6      end=10
 *                size=8, async_size=4
 * 
 * Application reading at:  [3]
 * Status: Block 3 served from cache (sync region)
 * 
 * Application reads block [6] (hits marker):
 * Action: Kernel triggers async prefetch of blocks 10-17
 */
 
void visualize_window(struct file_ra_state *ra) {
    printk("Window: start=%lu, size=%u, async_size=%u, max=%u\n",
           ra->start, ra->size, ra->async_size, ra->ra_pages);
    
    unsigned int sync_end = ra->start + ra->size - ra->async_size;
    unsigned int async_start = sync_end;
    unsigned int window_end = ra->start + ra->size;
    
    printk("  Regions: sync=[%lu,%lu), async=[%lu,%lu)\n",
           ra->start, sync_end, async_start, window_end);
}

Window Growth Strategies

When sequential access is detected, the read-ahead window grows to improve performance. The growth strategy determines how quickly the window expands and how large it can become.

Common Growth Strategies:

Growth Strategy Options

•Linear Growth: Window increases by a fixed amount each time (e.g., +32KB). Simple but slow to reach optimal size.
•Exponential Growth: Window doubles each time (32KB → 64KB → 128KB → 256KB). Reaches optimal size quickly but may overshoot.
•Additive Increase, Multiplicative Decrease (AIMD): Grows linearly, shrinks multiplicatively on problems. Self-stabilizing.
•Proportional Growth: New window size is proportional to current read rate. Adapts to application speed.
•Hybrid: Exponential at start, linear near maximum. Combines benefits of both approaches.

Linux Uses Exponential Growth:

The Linux kernel employs an exponential growth strategy, starting conservative and doubling the window with each successful sequential access until reaching the maximum:

window_growth.c
C (Linux Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/* Simplified Linux read-ahead growth algorithm */
 
/* Initial window size (conservative start) */
#define INIT_RA_SIZE    32   /* 32 pages = 128KB with 4KB pages */
 
/* Calculate next window size */
static unsigned long get_next_ra_size(struct file_ra_state *ra,
                                      unsigned long max_pages)
{
    unsigned long cur_size = ra->size;
    unsigned long new_size;
    
    /*
     * Exponential growth: double the window
     * This rapidly finds the optimal size for streaming workloads
     */
    new_size = cur_size * 2;
    
    /*
     * But don't exceed the maximum
     * Typical max is 512 pages (2MB with 4KB pages)
     */
    if (new_size > max_pages)
        new_size = max_pages;
    
    return new_size;
}
 
/* Initial window size calculation */
static unsigned long get_init_ra_size(unsigned long request_size,
                                      unsigned long max_pages)
{
    unsigned long init_size = INIT_RA_SIZE;
    
    /*
     * If application requests more than default init size,
     * start with the request size (rounded up)
     */
    if (request_size > init_size)
        init_size = request_size;
    
    /*
     * But still cap at 1/4 of max for first access
     * This prevents immediate over-allocation
     */
    unsigned long quarter_max = max_pages / 4;
    if (init_size > quarter_max)
        init_size = quarter_max;
    
    return init_size;
}
 
/*
 * Window growth example:
 * 
 * max_pages = 512 (2MB)
 * 
 * Access 1: init_size = 32 pages (128KB)
 * Access 2: new_size = 64 pages (256KB)  
 * Access 3: new_size = 128 pages (512KB)
 * Access 4: new_size = 256 pages (1MB)
 * Access 5: new_size = 512 pages (2MB)  <- Maximum reached
 * Access 6+: size = 512 pages (stable at max)
 */

Linux Read-Ahead Window Growth Sequence (Default Settings)
Sequential Read #	Window Size (Pages)	Window Size (KB)	Total Prefetched (KB)
1	32	128	128
2	64	256	384
3	128	512	896
4	256	1024	1920
5	512 (max)	2048	3968
6+	512 (stable)	2048	N/A

Why Exponential?

Window Shrinking and Reset

Just as important as growing the window is knowing when to shrink or reset it. An oversized window for a random workload wastes significant resources.

Shrinking Triggers:

Window Shrink/Reset Conditions

•Random Access Detected: Read position is far from expected next position. Immediate reset to initial size.
•Backward Seek: Application seeks to an earlier position. May indicate non-sequential pattern.
•Large Jump: Read position jumps more than current window size ahead. Suggests pattern change.
•Cache Miss in Window: Prefetched page was evicted before use (memory pressure). Reduce window to conserve memory.
•Application Hint: POSIX_FADV_RANDOM tells kernel not to prefetch.
•File Close/Position Reset: New read session, start fresh with initial window.

window_reset.c
C (Linux Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
/* Window shrinking and reset logic */
 
struct file_ra_state {
    pgoff_t start;
    unsigned int size;
    unsigned int async_size;
    unsigned int ra_pages;
    loff_t prev_pos;
    
    /* Additional tracking for shrink decisions */
    unsigned int mmap_miss;     /* Cache misses for mmap */
    unsigned int prev_size;     /* Previous window size */
};
 
/* Check if access pattern has broken */
static bool is_sequential_broken(struct file_ra_state *ra,
                                  pgoff_t offset,
                                  unsigned long max_pages)
{
    pgoff_t prev_page = ra->prev_pos >> PAGE_SHIFT;
    pgoff_t expected = prev_page + 1;  /* Simple expectation */
    
    /* Perfect sequential continues from previous end */
    if (offset == expected)
        return false;
    
    /* Small forward skip is OK (tolerance) */
    if (offset > prev_page && offset - prev_page <= 8)
        return false;
    
    /* Everything else is a break in the pattern */
    return true;
}
 
/* Handle window adjustment based on access pattern */
void adjust_window(struct file_ra_state *ra, pgoff_t offset)
{
    if (is_sequential_broken(ra, offset)) {
        /*
         * Pattern is broken - reset to initial state
         * 
         * We don't shrink gradually because:
         * 1. Random access doesn't benefit from any prefetch
         * 2. Quick reset minimizes wasted I/O
         * 3. If it becomes sequential again, growth is fast
         */
        reset_ra_state(ra);
        return;
    }
    
    /* Check for memory pressure indicators */
    if (ra->mmap_miss > 0) {
        /*
         * Pages we prefetched were evicted before use!
         * This means window is too large for available memory.
         * Reduce aggressively.
         */
        ra->size = max(ra->size / 2, INIT_RA_SIZE);
        ra->mmap_miss = 0;
        return;
    }
    
    /* Sequential continues - maybe grow */
    if (should_grow_window(ra)) {
        ra->prev_size = ra->size;
        ra->size = get_next_ra_size(ra, ra->ra_pages);
    }
}
 
/* Reset to initial conservative state */
void reset_ra_state(struct file_ra_state *ra)
{
    ra->prev_size = ra->size;
    ra->size = INIT_RA_SIZE;
    ra->async_size = 0;  /* No async prefetch until pattern confirmed */
    
    /* Keep ra->start at current position for fresh start */
}
 
/*
 * Window lifecycle example:
 * 
 * 1. File open        -> size=32, async=0 (initial)
 * 2. Sequential read  -> size=64, async=32 (growing)
 * 3. Sequential read  -> size=128, async=64
 * 4. Sequential read  -> size=256, async=128
 * 5. SEEK (random)    -> size=32, async=0 (reset!)
 * 6. Sequential read  -> size=64, async=32 (regrows)
 */

The mmap_miss Counter

The Async Marker Mechanism

One of the most elegant aspects of Linux's read-ahead implementation is the async marker mechanism. This provides precise timing for prefetch operations without constant monitoring or polling.

The Problem:

How does the kernel know when to trigger the next batch of prefetching? If it waits too long, the application stalls. If it triggers too early, resources are wasted.

The Solution:

async_marker.c
C (Linux Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/* The async marker mechanism in Linux read-ahead */
 
/*
 * Page flags used for read-ahead marking:
 * PG_readahead - This page triggers async readahead when accessed
 */
 
/* When setting up a read-ahead window */
void setup_readahead_window(struct file_ra_state *ra,
                            struct address_space *mapping)
{
    pgoff_t start = ra->start;
    unsigned int size = ra->size;
    unsigned int async_size = ra->async_size;
    
    /*
     * Place async marker at the transition point
     * All pages after this marker are "async" - the kernel
     * will prefetch more when they're accessed.
     */
    pgoff_t marker_index = start + size - async_size;
    
    /* Read pages into cache (sync portion first) */
    for (pgoff_t i = start; i < start + size - async_size; i++) {
        struct page *page = find_or_create_page(mapping, i);
        /* No special marking - these are sync pages */
        submit_page_read(page);
    }
    
    /* Mark the transition page */
    struct page *marker_page = find_or_create_page(mapping, marker_index);
    SetPageReadahead(marker_page);  /* Set PG_readahead flag */
    submit_page_read(marker_page);
    
    /* Remaining pages are async */
    for (pgoff_t i = marker_index + 1; i < start + size; i++) {
        struct page *page = find_or_create_page(mapping, i);
        submit_page_read(page);
    }
}
 
/* When a page is accessed and triggers readahead */
void handle_page_access(struct page *page, struct file_ra_state *ra)
{
    if (PageReadahead(page)) {
        /*
         * Application reached the marker!
         * Time to prefetch the next window.
         */
        ClearPageReadahead(page);  /* Clear flag (one-time trigger) */
        
        /* Calculate next window */
        ra->start = ra->start + ra->size;
        ra->size = get_next_ra_size(ra);
        ra->async_size = calculate_async_size(ra);
        
        /* Submit async I/O for next window */
        async_readahead(ra);
    }
}
 
/*
 * Async marker visualization:
 * 
 * Window state:  [already read] [sync region] [M][async region]
 *                                              ^
 *                                              |
 *                                    Async marker (PG_readahead flag)
 * 
 * When application reads page at [M]:
 *   1. Normal read completes (page was already prefetched)
 *   2. Kernel notices PG_readahead flag
 *   3. Kernel clears flag (avoid double-trigger)
 *   4. Kernel calculates next window position and size
 *   5. Kernel submits async I/O for next window
 *   6. Application continues reading (unaware of #2-5)
 * 
 * Result: Next window is prefetching while app reads current window
 */

Calculating Async Size:

The async_size parameter determines where the marker is placed. A larger async_size triggers prefetching earlier (more conservative), while smaller values trigger later (more aggressive).

Typical formula:

async_size = size / 2   (trigger at 50% through window)

Or for large windows:

async_size = size - min_sync_pages  (ensure minimum sync ahead)

Zero Polling Overhead

Maximum Window Size Considerations

The maximum read-ahead window size (ra_pages in Linux) is a critical tunable parameter. Setting it correctly requires understanding your workload and system characteristics.

Maximum Window Size Trade-offs
Larger Maximum	Smaller Maximum
Better for large streaming reads	Better for small files/random access
Higher memory usage per file	Lower memory usage per file
More efficient large I/O operations	Faster response for non-streaming
Risk of wasted prefetch on pattern change	Quicker adaptation to new patterns
Better for HDDs (amortize seek time)	Fine for SSDs (low seek cost)

Default Values in Major Systems:

OS/System	Default Max Window	Notes
Linux	128-256KB	Conservative for general use
Linux (RAID/SAN)	2-4MB	Larger for high-throughput storage
Windows	256KB-1MB	Varies by file system and hints
ZFS	128KB (tunable)	Can be increased for streaming
PostgreSQL	OS default	Relies on OS, can hint
MySQL InnoDB	64 pages (1MB)	Own prefetch separate from OS

tuning_window_size.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/bin/bash
# Tuning read-ahead window size on Linux
 
# View current setting for a device
echo "Current read-ahead for /dev/sda:"
cat /sys/block/sda/queue/read_ahead_kb
# Typical output: 128 (KB)
 
# Set read-ahead size (requires root)
# WARNING: Test thoroughly before production use
 
# Conservative setting (good for mixed workloads)
echo 128 > /sys/block/sda/queue/read_ahead_kb
 
# Aggressive setting (for streaming/sequential workloads)
echo 2048 > /sys/block/sda/queue/read_ahead_kb  # 2MB
 
# Very aggressive (for large file processing, HDDs)
echo 8192 > /sys/block/sda/queue/read_ahead_kb  # 8MB
 
# Minimal (for random workloads, SSDs with very fast random I/O)
echo 64 > /sys/block/sda/queue/read_ahead_kb
 
# Apply to all block devices
for dev in /sys/block/sd*/queue/read_ahead_kb; do
    echo 256 > "$dev"
done
 
# Verification with hdparm
hdparm -a /dev/sda
# Shows: readahead = 512 (256.00 KB)  [note: sectors * 0.5 = KB]
 
# Per-file override using fadvise (in C code)
cat << 'EOF'
/* In application code, override for specific files */
int fd = open("large_file.dat", O_RDONLY);
 
/* Request aggressive read-ahead for this file */
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);
 
/* Or request minimal read-ahead */
posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);
EOF

When to Increase Max Window

•Video streaming servers
•Log file processing
•Backup operations
•Data pipeline batch processing
•Scientific computing with large datasets
•HDD-based storage (amortize seek time)

When to Decrease Max Window

•Database OLTP workloads
•Memory-constrained systems
•Many concurrent small file accesses
•SSD with excellent random performance
•Latency-sensitive applications
•When buffer cache pressure is chronic

Window Algorithms Deep Dive

Let's examine the complete window management algorithm used in Linux, bringing together all the concepts we've discussed.

complete_window_algorithm.c
C (Linux Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
/* Complete read-ahead window management algorithm
 * Simplified from Linux mm/readahead.c
 */
 
/* Window management state */
struct ra_state {
    pgoff_t start;           /* Window start (page index) */
    pgoff_t size;            /* Window size (pages) */
    pgoff_t async_size;      /* Async trigger size */
    pgoff_t ra_pages;        /* Maximum window */
    loff_t prev_pos;         /* Previous read position */
};
 
/* Core decision function: called on every read() */
void ondemand_readahead(struct file *filp, 
                        struct ra_state *ra,
                        pgoff_t offset,
                        unsigned long req_size)
{
    pgoff_t max = ra->ra_pages;
    
    /*
     * CASE 1: No existing window (first read or after reset)
     */
    if (ra->size == 0) {
        /* Start with initial window */
        ra->start = offset;
        ra->size = get_init_ra_size(req_size, max);
        ra->async_size = ra->size > 4 ? ra->size - 4 : 0;
        goto do_readahead;
    }
    
    /*
     * CASE 2: Hit the async marker (within async portion of window)
     */
    if (offset >= ra->start + ra->size - ra->async_size) {
        /* We've reached the trigger point - time to queue more */
        
        /* Move window forward */
        pgoff_t new_start = ra->start + ra->size;
        
        /* Grow window (exponential growth) */
        pgoff_t new_size = min(ra->size * 2, max);
        
        /* Update state */
        ra->start = new_start;
        ra->size = new_size;
        ra->async_size = new_size;  /* All async for background load */
        goto do_readahead;
    }
    
    /*
     * CASE 3: Sequential but before async marker
     * Everything is fine - data should be in cache, no action needed
     */
    if (offset >= ra->start && offset < ra->start + ra->size - ra->async_size) {
        /* Within sync portion - page should be cached */
        return;
    }
    
    /*
     * CASE 4: Miss - offset is outside current window
     */
    if (offset < ra->start || offset >= ra->start + ra->size) {
        /* 
         * Could be:
         * - Random access (far from window)
         * - Large sequential skip (jumped ahead)
         * - Backward seek
         */
        
        /* Check if this looks like a big skip (still sequential) */
        pgoff_t prev_page = ra->prev_pos >> PAGE_SHIFT;
        if (offset > prev_page && offset - prev_page < max * 2) {
            /* Likely still sequential, just skipped ahead */
            ra->start = offset;
            ra->size = get_init_ra_size(req_size, max);
            goto do_readahead;
        }
        
        /* Otherwise, assume random access - minimal readahead */
        ra->start = offset;
        ra->size = min(req_size + 4, max);
        ra->async_size = 0;
        goto do_readahead;
    }
 
do_readahead:
    /* Submit I/O for the readahead window */
    for (pgoff_t i = 0; i < ra->size; i++) {
        pgoff_t page_idx = ra->start + i;
        struct page *page = find_or_create_page(filp->f_mapping, page_idx);
        
        /* Set async marker at appropriate page */
        if (i == ra->size - ra->async_size) {
            SetPageReadahead(page);
        }
        
        if (!PageUptodate(page)) {
            /* Page not in cache - submit read */
            submit_read_request(page);
        }
    }
    
    /* Update position tracking */
    ra->prev_pos = (offset + req_size) << PAGE_SHIFT;
}
 
/*
 * Algorithm summary:
 * 
 * 1. On first access: Create initial window, set async marker
 * 2. On read within sync region: Do nothing (cache hit expected)
 * 3. On async marker hit: Create new window, grow size, submit I/O
 * 4. On window miss: Evaluate if sequential skip or random
 * 5. Update prev_pos after every decision for pattern tracking
 */

Algorithm Invariants

Real-World Performance Impact

Let's quantify the impact of read-ahead window sizing with concrete measurements.

Sequential Read Throughput vs Window Size (1GB file, SATA SSD)
Window Size	Throughput (MB/s)	CPU Wait %	Speedup
0 (disabled)	45	85%	1.0x (baseline)
32KB	120	60%	2.7x
128KB	380	25%	8.4x
512KB	520	8%	11.6x
2MB	550	3%	12.2x
8MB	552	2%	12.3x

Observations:

Dramatic Improvement: Moving from no prefetch to proper window sizing can yield 10x+ throughput improvement for sequential reads.
Diminishing Returns: Most gain comes from initial increases. Going from 512KB to 8MB adds only 6% more throughput.
CPU Efficiency: With prefetching, CPU spends time processing rather than waiting. The 85% wait time without prefetching represents massive CPU waste.
Optimal Point: For this workload, 512KB-2MB provides near-maximum performance. Larger windows add complexity without proportional benefit.

benchmark_window_size.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
#!/usr/bin/env python3
"""
Benchmark script to measure read-ahead window impact.
Run with: sudo python3 benchmark_window_size.py
"""
 
import os
import subprocess
import time
import mmap
 
def set_readahead_kb(device: str, size_kb: int) -> None:
    """Set read-ahead size for a block device."""
    path = f"/sys/block/{device}/queue/read_ahead_kb"
    with open(path, 'w') as f:
        f.write(str(size_kb))
 
def drop_caches() -> None:
    """Drop all caches to ensure cold-start measurement."""
    subprocess.run(['sync'], check=True)
    with open('/proc/sys/vm/drop_caches', 'w') as f:
        f.write('3')
    time.sleep(1)  # Allow caches to clear
 
def measure_sequential_read(filepath: str, iterations: int = 3) -> dict:
    """Measure sequential read throughput."""
    file_size = os.path.getsize(filepath)
    block_size = 64 * 1024  # 64KB reads
    
    times = []
    for _ in range(iterations):
        drop_caches()
        
        start = time.perf_counter()
        with open(filepath, 'rb') as f:
            while f.read(block_size):
                pass
        elapsed = time.perf_counter() - start
        times.append(elapsed)
    
    avg_time = sum(times) / len(times)
    throughput_mb = (file_size / 1024 / 1024) / avg_time
    
    return {
        'file_size_mb': file_size / 1024 / 1024,
        'avg_time_s': avg_time,
        'throughput_mb_s': throughput_mb
    }
 
def benchmark_readahead_sizes(device: str, test_file: str) -> None:
    """Benchmark various read-ahead window sizes."""
    sizes_kb = [0, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
    
    print(f"{'Window (KB)':<12} {'Time (s)':<10} {'MB/s':<10} {'Speedup':<10}")
    print("-" * 42)
    
    baseline = None
    for size_kb in sizes_kb:
        set_readahead_kb(device, size_kb)
        result = measure_sequential_read(test_file)
        
        if baseline is None:
            baseline = result['throughput_mb_s']
        
        speedup = result['throughput_mb_s'] / max(baseline, 0.001)
        
        print(f"{size_kb:<12} {result['avg_time_s']:<10.2f} "
              f"{result['throughput_mb_s']:<10.1f} {speedup:<10.2f}x")
 
if __name__ == "__main__":
    # Configure for your system
    DEVICE = "sda"
    TEST_FILE = "/tmp/benchmark_1gb.dat"
    
    # Create test file if needed
    if not os.path.exists(TEST_FILE):
        print("Creating 1GB test file...")
        subprocess.run(['dd', 'if=/dev/urandom', f'of={TEST_FILE}',
                       'bs=1M', 'count=1024', 'status=progress'], check=True)
    
    benchmark_readahead_sizes(DEVICE, TEST_FILE)

Summary: Mastering the Read-Ahead Window

The read-ahead window is the mechanism through which prefetching decisions are implemented. Let's consolidate the key concepts:

Key Takeaways

•Window Anatomy — A window has start position, size, and async marker that together define prefetching behavior.
•Exponential Growth — Linux doubles window size with each sequential confirmation, rapidly reaching optimal size.
•Fast Reset — Random access immediately resets to initial window, minimizing wasted resources.
•Async Markers — Special page flags trigger prefetching at the right moment with zero polling overhead.
•Tunable Maximum — The max window size should be tuned based on workload (streaming vs random).
•Dramatic Impact — Proper window sizing can yield 10x+ throughput improvement for sequential workloads.
•Diminishing Returns — Beyond certain sizes, additional window growth adds minimal benefit.

What's Next:

Page Complete

3 / 5