Operating SystemsBuffer Cache

Buffer Cache

LevelIntermediate

Duration90 mins

TopicBuffer Cache

1 / 5

Block Cache

The Great Performance Divide

Every time you open a file, edit a document, or run an application, your operating system faces a fundamental performance challenge: main memory operates roughly 100,000 times faster than traditional hard drives, and still 100-1000 times faster than even the fastest SSDs. Without an intelligent intermediary layer to absorb this staggering speed difference, modern computing would grind to an unbearable crawl.

This intermediary layer is the buffer cache (also called the block cache or disk cache)—a region of main memory that temporarily stores copies of disk blocks. The buffer cache stands as one of the most critical performance optimizations in any operating system, transforming what would be painfully slow storage operations into fast memory operations.

Understanding the buffer cache isn't merely academic—it's essential knowledge for any engineer who needs to debug performance issues, optimize database systems, configure servers, or architect high-throughput applications. The buffer cache determines whether your system feels responsive or sluggish, whether your servers handle thousands or millions of requests per second, and whether your data survives power failures.

What You Will Learn

By the end of this page, you will understand: (1) Why the block cache exists and the problem it solves, (2) The architectural design of block caches in modern operating systems, (3) How blocks are organized, addressed, and managed in cache memory, (4) The lifecycle of a cached block from initial read to eventual eviction, and (5) Critical performance characteristics that determine real-world system behavior.

The Memory-Storage Speed Gap

To appreciate why the buffer cache is so critical, we must first understand the enormous performance gap it bridges. Computer memory hierarchies span multiple orders of magnitude in access time, creating a fundamental asymmetry that pervades all system design.

Access Time Comparison (approximate):

Consider what happens when a CPU needs to access data:

Memory Hierarchy Access Times
Storage Level	Typical Access Time	Relative Speed	Human-Scale Analogy
CPU L1 Cache	~1 nanosecond	1x (baseline)	1 second
CPU L2 Cache	~4 nanoseconds	4x slower	4 seconds
CPU L3 Cache	~12 nanoseconds	12x slower	12 seconds
Main Memory (RAM)	~100 nanoseconds	100x slower	1.5 minutes
NVMe SSD (best case)	~10-25 microseconds	10,000-25,000x slower	3-7 hours
SATA SSD	~50-100 microseconds	50,000-100,000x slower	14-28 hours
Hard Disk Drive (HDD)	~5-10 milliseconds	5-10 million x slower	2-4 months

The Human-Scale Perspective:

If we scale up the CPU L1 cache access to 1 second of human time, accessing a hard drive would take 2 to 4 months. Even the fastest NVMe SSDs would require hours of waiting. This isn't hyperbole—it's the reality that the buffer cache must address.

Why Such a Gap Exists:

The speed difference stems from fundamental physical constraints:

RAM is electronic — Data moves at near light-speed through transistor gates. Access involves no mechanical components, and modern DDR5 RAM can deliver 50+ GB/s bandwidth.
HDDs are mechanical — Reading data requires physically moving an actuator arm to the correct track (seek time: 2-15ms) and waiting for the correct sector to rotate under the read head (rotational latency: 2-8ms). Even at 7,200 RPM, a disk completes only 120 rotations per second.
SSDs are electronic but constrained — While SSDs eliminate mechanical delays, they still face NAND flash physics: read/program operations require voltage sensing across cells, controllers must manage wear leveling, and the interface (SATA, NVMe) adds protocol overhead.
Persistence has costs — Unlike volatile RAM, storage devices must retain data without power. This permanence requirement fundamentally limits how fast data can be written and committed.

The Amdahl's Law Problem

If even 1% of memory accesses required disk I/O without caching, and each took 5ms, a program making 10,000 memory accesses would spend 50 seconds just waiting for disk—regardless of how fast the other 99% executed. The buffer cache transforms this dynamic by ensuring that repeated accesses hit fast memory instead of slow storage.

Block Cache Fundamentals

The block cache (or buffer cache) is a memory region managed by the operating system that stores recently accessed disk blocks. When a process reads from or writes to a file, the OS first checks the buffer cache. If the required block is present (a cache hit), the operation completes immediately from memory. If absent (a cache miss), the OS reads the block from disk into the cache before returning it to the process.

Key Terminology:

Block: The fundamental unit of disk I/O. File systems divide storage into fixed-size blocks (typically 4KB in modern systems, though sizes from 512 bytes to 64KB exist).
Buffer: A memory region holding one or more blocks. Historically, "buffer cache" specifically referred to the caching layer, while "page cache" referred to cached file contents. Modern systems often unify these concepts.
Cache Hit: A requested block is already present in memory. The operation completes without disk I/O.
Cache Miss: A requested block is not in memory. The OS must read it from disk.
Hit Ratio: The fraction of requests served from cache. A hit ratio of 0.95 means 95% of accesses avoid disk I/O.

The Basic Cache Structure:

buffer_cache_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/* Simplified buffer cache entry structure */
struct buffer_head {
    /* Identification */
    unsigned long    b_blocknr;      /* Block number on device */
    dev_t            b_dev;          /* Device identifier */
    
    /* State management */
    unsigned long    b_state;        /* Bitfield: BH_Uptodate, BH_Dirty, BH_Lock, etc. */
    atomic_t         b_count;        /* Reference count - how many users */
    
    /* Memory management */
    char            *b_data;         /* Pointer to actual block data in memory */
    unsigned int     b_size;         /* Block size in bytes (e.g., 4096) */
    
    /* Linkage */
    struct buffer_head *b_this_page; /* Circular list of buffers for same page */
    struct hlist_node   b_hash;      /* Hash chain for lookup */
    struct list_head    b_lru;       /* LRU list for replacement */
    
    /* I/O completion */
    bio_end_io_t    *b_end_io;       /* Callback function when I/O completes */
    void            *b_private;      /* Private data for I/O completion */
};
 
/* Key state flags */
#define BH_Uptodate     0   /* Contains valid data from disk */
#define BH_Dirty        1   /* Contains data not yet written to disk */
#define BH_Lock         2   /* Currently locked for I/O */
#define BH_Req          3   /* Has been submitted for I/O */
#define BH_Mapped       4   /* Has a valid mapping on disk */
#define BH_New          5   /* Newly allocated, not yet written */

Block Identification:

Every cached block must be uniquely identifiable. The cache uses a two-part key:

Device identifier (b_dev): Specifies which storage device (disk, partition, or logical volume) the block belongs to. This is typically a device major/minor number pair.
Block number (b_blocknr): The logical block number within that device. For a device with 4KB blocks, block 0 contains bytes 0-4095, block 1 contains bytes 4096-8191, and so on.

This (device, block number) pair uniquely identifies any block in the system and serves as the hash key for cache lookups.

Buffer Heads vs. Page Cache

In Linux, the terms 'buffer cache' and 'page cache' have distinct historical meanings. The buffer cache operated on raw disk blocks, while the page cache operated on file-mapped memory pages. Modern Linux unifies these: the page cache serves as the primary caching layer, with buffer heads (struct buffer_head) serving as descriptors for sub-page block tracking when needed. The terms are often used interchangeably in practice.

Block Lookup and Hashing

A buffer cache may contain millions of cached blocks. When a process requests a specific block, the operating system must determine whether that block exists in cache—and find it quickly. Linear search through millions of entries would defeat the purpose of caching entirely.

The Hash Table Solution:

Buffer caches use hash tables to achieve O(1) average-case lookup. The hash function maps (device, block_number) pairs to a fixed set of hash buckets. Each bucket contains a linked list of buffer heads that hash to that bucket.

Hash Function Design:

A good hash function must:

Distribute blocks uniformly across buckets to avoid clustering
Be fast to compute (it's called for every I/O operation)
Incorporate both device and block number to avoid collisions

buffer_hash.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/* 
 * Simplified buffer cache hash implementation
 * Real implementations have more sophisticated hashing
 */
 
#define BUFFER_HASH_BITS    15
#define BUFFER_HASH_SIZE    (1 << BUFFER_HASH_BITS)  /* 32768 buckets */
#define BUFFER_HASH_MASK    (BUFFER_HASH_SIZE - 1)
 
/* The hash table: array of list heads */
static struct hlist_head buffer_hash_table[BUFFER_HASH_SIZE];
static spinlock_t buffer_hash_lock[BUFFER_HASH_SIZE];
 
/* 
 * Hash function combining device and block number
 * Uses multiplicative hashing for good distribution
 */
static inline unsigned int buffer_hash(dev_t dev, unsigned long block) {
    /* 
     * Golden ratio prime for multiplicative hashing
     * 0x9e370001 = 2654435841 (prime near 2^32 * golden_ratio)
     */
    unsigned long hash = (unsigned long)dev + block;
    hash = hash * 0x9e370001UL;
    
    return (hash >> (32 - BUFFER_HASH_BITS)) & BUFFER_HASH_MASK;
}
 
/*
 * Look up a buffer in the cache
 * Returns NULL if not found, buffer head pointer if found
 */
struct buffer_head *find_buffer(dev_t dev, unsigned long block, 
                                unsigned int size) {
    unsigned int hash = buffer_hash(dev, block);
    struct hlist_head *bucket = &buffer_hash_table[hash];
    struct buffer_head *bh;
    
    spin_lock(&buffer_hash_lock[hash]);
    
    hlist_for_each_entry(bh, bucket, b_hash) {
        if (bh->b_dev == dev && 
            bh->b_blocknr == block && 
            bh->b_size == size) {
            /* Found it - increment reference count */
            atomic_inc(&bh->b_count);
            spin_unlock(&buffer_hash_lock[hash]);
            return bh;
        }
    }
    
    spin_unlock(&buffer_hash_lock[hash]);
    return NULL;  /* Not in cache */
}
 
/*
 * Insert a buffer into the hash table
 */
void insert_buffer_hash(struct buffer_head *bh) {
    unsigned int hash = buffer_hash(bh->b_dev, bh->b_blocknr);
    
    spin_lock(&buffer_hash_lock[hash]);
    hlist_add_head(&bh->b_hash, &buffer_hash_table[hash]);
    spin_unlock(&buffer_hash_lock[hash]);
}

Performance Characteristics:

Average case: O(1) lookup when load factor is reasonable
Worst case: O(n/buckets) if many blocks hash to the same bucket
Memory overhead: Minimal—just pointers plus the buffer_head structures

Per-Bucket Locking:

Notice the per-bucket spinlocks rather than a single global lock. This fine-grained locking dramatically improves scalability on multi-core systems:

Different CPUs can simultaneously access different buckets
Lock contention is reduced by a factor of ~BUFFER_HASH_SIZE
The lock/unlock overhead per operation remains minimal

Modern Alternatives: Radix Trees

While hash tables work well, modern Linux increasingly uses radix trees (tries) for the page cache index. Radix trees offer cache-friendly traversal, efficient range operations (useful for read-ahead), and don't require resizing. The XArray data structure in Linux is a notable example, providing O(log n) operations with excellent cache behavior.

The Block Lifecycle

Every cached block progresses through a well-defined lifecycle from initial creation to eventual eviction. Understanding this lifecycle is crucial for debugging cache behavior and optimizing performance.

Lifecycle States:

Converting Mermaid diagram...

State Transitions Explained:

1. Allocation (Cache Miss) When a block is requested but not in cache, the OS:

Allocates memory for the block data
Creates a buffer_head structure
Initializes state to "allocated but not valid"
Inserts into hash table

2. Reading (I/O In Progress) The OS submits a read request to the device driver:

Buffer is locked (BH_Lock set) to prevent concurrent access
Other requesters wait or are queued
Upon completion, the driver callback marks the buffer uptodate

3. Uptodate (Clean, Valid) The buffer contains valid data matching disk contents:

Safe to read without I/O
Safe to evict if memory pressure requires
Can transition to Dirty if modified

4. In-Use (Referenced) A process actively holds a reference:

Reference count > 0
Cannot be evicted until released
May be shared by multiple concurrent readers

5. Dirty (Modified) The buffer contains data that differs from disk:

Must be written back before eviction
May be modified further before write-back
Periodic flush or explicit sync triggers write

6. Writing (Write I/O In Progress) Write-back is occurring:

Buffer may be locked
New modifications may be allowed (delayed write)
Upon completion, transitions back to Uptodate

7. Eviction When memory is needed:

Clean blocks can be freed immediately
Dirty blocks must be written first
Block removed from hash table and LRU list

block_lifecycle.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/* Example: Reading a block with proper lifecycle handling */
struct buffer_head *bread(dev_t dev, unsigned long block, unsigned int size) {
    struct buffer_head *bh;
    
    /* Step 1: Check if already in cache */
    bh = find_buffer(dev, block, size);
    if (bh) {
        /* Cache hit! But need to ensure it's uptodate */
        if (buffer_uptodate(bh)) {
            return bh;  /* Fast path: return immediately */
        }
        /* Block is in cache but not valid - fall through to read */
    } else {
        /* Cache miss: allocate new buffer */
        bh = alloc_buffer_head();
        if (!bh)
            return NULL;  /* Out of memory */
        
        bh->b_dev = dev;
        bh->b_blocknr = block;
        bh->b_size = size;
        bh->b_data = alloc_page_data(size);
        
        insert_buffer_hash(bh);
    }
    
    /* Step 2: Lock buffer for I/O */
    lock_buffer(bh);
    
    /* Double-check after getting lock (another thread may have read it) */
    if (buffer_uptodate(bh)) {
        unlock_buffer(bh);
        return bh;
    }
    
    /* Step 3: Submit read I/O */
    bh->b_end_io = buffer_read_end_io;
    submit_bh(READ, bh);
    
    /* Step 4: Wait for I/O completion */
    wait_on_buffer(bh);
    
    /* Step 5: Check for errors */
    if (!buffer_uptodate(bh)) {
        release_buffer(bh);
        return NULL;  /* I/O error */
    }
    
    return bh;  /* Success: buffer contains valid data */
}
 
/* Release a buffer when done using it */
void brelse(struct buffer_head *bh) {
    if (!bh) return;
    
    /* Decrement reference count */
    if (atomic_dec_and_test(&bh->b_count)) {
        /* Reference count hit zero - buffer can be reclaimed */
        /* Don't free immediately; add to LRU for potential reuse */
        add_to_lru(bh);
    }
}

Cache Memory Organization

The buffer cache doesn't exist in isolation—it must integrate with the broader memory management subsystem. This integration determines how much memory is available for caching, how cache memory is allocated, and how the cache responds to memory pressure.

Memory Allocation for Buffers:

Modern operating systems typically use the same page frames for both process memory and cache memory. This creates a flexible, adaptive system:

Unified Memory Pool: Rather than dedicating a fixed amount of RAM to caching, the OS dynamically balances between process needs and cache needs.
Demand-Driven Allocation: Cache grows when there's spare memory and shrinks when applications need more.
Page Frame Recycling: When a page is evicted from process memory, it might become cache memory, and vice versa.

Buffer Cache Memory Sources
Memory Source	Characteristics	Use Case
Buddy Allocator Pages	Contiguous, efficient for large allocations	Multi-page buffers, large I/O
Slab Allocator	Pre-allocated, fast, fixed-size objects	Buffer head structures
High Memory (32-bit)	Not directly addressable, requires mapping	Extra cache on 32-bit systems
NUMA Local Memory	Fastest access for specific CPUs	CPU-local caching strategies

Integration with Page Cache:

In modern Unix-like systems (Linux, FreeBSD, etc.), the buffer cache operates within or alongside the page cache:

Page Cache: Caches file data at page granularity (typically 4KB). When you read a file, the data is cached as pages.
Buffer Cache: Provides block-level addressing of the underlying device. A single page may contain multiple buffers if the file system block size is smaller than the page size.

Example: 1KB blocks on 4KB pages

If a file system uses 1KB blocks but the system page size is 4KB:

multi_buffer_page.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/*
 * A single 4KB page can contain multiple buffer_heads
 * when filesystem block size < page size
 *
 * Page layout for 1KB blocks:
 * +------------------+------------------+------------------+------------------+
 * | Buffer 0 (1KB)   | Buffer 1 (1KB)   | Buffer 2 (1KB)   | Buffer 3 (1KB)   |
 * | Block N          | Block N+1        | Block N+2        | Block N+3        |
 * +------------------+------------------+------------------+------------------+
 *        ^                  ^                  ^                   ^
 *        |                  |                  |                   |
 *    buffer_head 0     buffer_head 1      buffer_head 2      buffer_head 3
 *        |                  |                  |                   |
 *        +------ b_this_page (circular linked list) --------------+
 */
 
struct page *find_or_create_page_with_buffers(struct address_space *mapping,
                                               unsigned long index,
                                               unsigned int blocksize) {
    struct page *page;
    struct buffer_head *bh, *head;
    unsigned int buffers_per_page = PAGE_SIZE / blocksize;
    int i;
    
    page = find_or_create_page(mapping, index, GFP_KERNEL);
    if (!page)
        return NULL;
    
    /* Check if page already has buffers */
    if (page_has_buffers(page))
        return page;
    
    /* Allocate buffer heads for this page */
    head = NULL;
    for (i = 0; i < buffers_per_page; i++) {
        bh = alloc_buffer_head();
        bh->b_size = blocksize;
        bh->b_data = page_address(page) + (i * blocksize);
        
        /* Link into circular list */
        if (head) {
            bh->b_this_page = head;
        } else {
            bh->b_this_page = bh;  /* First one points to itself */
            head = bh;
        }
    }
    
    /* Close the circular list */
    /* ... and attach to page->private */
    
    return page;
}

Memory Pressure Response

When system memory runs low, the kernel's memory reclaim mechanism (e.g., Linux's kswapd) scans for pages to free. Clean cache pages are prime targets: they're immediately freeable without I/O. Dirty pages require write-back first, so the kernel prioritizes writing dirty data during idle periods to maintain a pool of easily reclaimable clean pages.

Performance Metrics

Evaluating buffer cache effectiveness requires understanding key performance metrics. These metrics help diagnose performance problems and guide tuning decisions.

Hit Ratio: The Primary Metric

The cache hit ratio measures the fraction of I/O requests satisfied from cache without disk access:

Hit Ratio = Cache Hits / (Cache Hits + Cache Misses)

A hit ratio of 0.95 means 95% of block requests are served from memory. The remaining 5% require physical I/O.

Effective Access Time:

The hit ratio directly determines effective access time:

T_effective = (Hit Ratio × T_memory) + ((1 - Hit Ratio) × T_disk)

Example calculation:

Impact of Hit Ratio on Effective Access Time
Hit Ratio	Memory Access (100ns)	Disk Access (5ms)	Effective Time	Speedup vs No Cache
0% (no cache)	0ns	5,000,000ns	5,000,000ns	1x
50%	50ns	2,500,000ns	2,500,050ns	2x
90%	90ns	500,000ns	500,090ns	10x
95%	95ns	250,000ns	250,095ns	20x
99%	99ns	50,000ns	50,099ns	100x
99.9%	99.9ns	5,000ns	5,100ns	~1000x

The Non-Linear Impact of Hit Ratio:

Notice how dramatically performance improves as hit ratio increases from 90% to 99.9%. This is because even a small percentage of cache misses translates to many slow disk operations. Improving hit ratio from 95% to 99% doubles the effective performance—and the last fraction of a percent matters enormously.

Monitoring Cache Performance:

Operating systems provide tools to observe cache behavior:

cache_monitoring.sh
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Linux: View memory usage including cache
$ free -h
              total        used        free      shared  buff/cache   available
Mem:           31Gi       8.2Gi        12Gi       1.2Gi        10Gi        21Gi
Swap:         8.0Gi          0B       8.0Gi
 
# Note: "buff/cache" shows combined buffer and page cache usage
# "available" shows memory available for applications (includes reclaimable cache)
 
# View detailed buffer/cache breakdown
$ cat /proc/meminfo | grep -E "(Buffers|Cached|Dirty|Writeback)"
Buffers:          456789 kB
Cached:         10234567 kB
SwapCached:            0 kB
Dirty:              1234 kB
Writeback:             0 kB
 
# Real-time I/O statistics showing cache effectiveness
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 12345M  456M 10234M    0    0    12   345  987 1234 10  5 84  1  0
#                                               ^    ^
#                                              bi: blocks read from disk per second
#                                              bo: blocks written to disk per second
# Low bi/bo with active I/O workload = good cache hit ratio
 
# Detailed block I/O statistics
$ iostat -x 1
Device    r/s     w/s   rkB/s   wkB/s  %util
sda      10.00   25.00  160.00  400.00   5.00
# Low utilization despite I/O-heavy workload = cache is effective

Cache Effectiveness Depends on Workload

Hit ratios vary enormously by workload. A database server with a working set smaller than memory might achieve 99%+ hit ratios. A video streaming server accessing large, sequentially-read files once might see near-0% hit ratios (but benefits from read-ahead). Understanding your workload's access patterns is essential for cache tuning.

Locality of Reference: Why Caching Works

Caching only works because real programs exhibit locality of reference—the tendency to access the same data repeatedly, or to access data near previously accessed data. Without locality, every access would be random, and caching would provide no benefit.

Types of Locality:

1. Temporal Locality

Data accessed recently is likely to be accessed again soon. Examples:

Loop variables accessed every iteration
Configuration files read at startup and referenced throughout
Hot database rows accessed by many queries
Recently modified files being edited interactively

2. Spatial Locality

Data near recently accessed data is likely to be needed. Examples:

Sequential array traversal
Reading consecutive records in a file
Accessing nearby inode entries in a directory
Loading adjacent code sections during execution

3. Sequential Locality (special case of spatial)

Accessing data in sequential order. File systems commonly see this:

Reading a file from beginning to end
Log file appends
Streaming media playback

High-Locality Workloads

•Database OLTP: Small working set, repeated queries
•Web servers: Same assets served repeatedly
•Development: Same source files accessed constantly
•Office productivity: Single document edited over time
•Log analysis: Sequential file reads

Low-Locality Workloads

•Database full scans: Every row accessed once
•Video streaming: Each frame accessed once, never reused
•Backup operations: Every file read sequentially once
•Scientific datasets: Random samples from huge datasets
•File search: Traversing entire filesystem once

Working Set Theory:

Peter Denning's working set model formalizes these intuitions. At any moment, a process's working set is the set of pages (or blocks) it has referenced within a recent time window. The working set typically:

Changes slowly over time (programs transition between phases)
Is much smaller than total process address space
Determines minimum memory needed for efficient execution

If the buffer cache can hold the working sets of all active processes and the file system metadata, hit ratios will be excellent. If the total working set exceeds cache capacity, thrashing occurs—excessive eviction and re-fetching that devastates performance.

Quantifying Locality:

The stack distance (or reuse distance) metric measures locality: for each memory reference, how many distinct blocks were accessed since the last access to this block? Low stack distances indicate high temporal locality.

Access Sequence:  A B C A B D A C B A
Stack Distance:   ∞ ∞ ∞ 2 2 ∞ 2 2 2 1

Interpretation:
- Block A reused after 2 intervening blocks (distance 2)
- Block D accessed for first time (distance ∞)
- Short distances = likely cache hits with sufficient cache size

Cache Pollution

Low-locality workloads can 'pollute' the cache—evicting useful data to store data that won't be accessed again. This is why some systems implement scan detection to avoid caching data from sequential scans, and why applications like databases often implement their own buffer pools with application-specific knowledge about access patterns.

Summary: The Block Cache Foundation

We've established the foundational understanding of block caching—the essential mechanism that makes file system performance acceptable despite the enormous speed gap between memory and storage. Let's consolidate the key concepts:

Key Takeaways

•The Speed Gap is Enormous — RAM is 10,000-10,000,000x faster than storage. Without caching, every file operation would bring systems to their knees.
•Block Cache is the Bridge — By keeping recently and frequently accessed disk blocks in RAM, the cache converts slow storage operations into fast memory operations.
•Efficient Lookup is Critical — Hash tables provide O(1) average-case block lookup, with per-bucket locking for multi-core scalability.
•Blocks Have a Lifecycle — From allocation through reading, use, modification, write-back, and eventual eviction, blocks transition through well-defined states.
•Memory is Unified — Modern OS unify page and buffer caches, dynamically balancing memory between applications and caching.
•Hit Ratio Determines Performance — Even small improvements in hit ratio yield large performance gains due to the extreme speed difference.
•Locality Makes It Work — Temporal and spatial locality of reference ensure that caching provides dramatic benefits for typical workloads.

What's Next:

With the block cache foundation established, we'll next explore cache management—the policies and algorithms that determine which blocks to cache, how to prioritize them, and which to evict when memory is needed. These replacement policies (LRU, LFU, ARC, etc.) are critical because poor eviction decisions can devastate performance even with large caches.

Page Complete

You now understand the architectural foundation of block caching: why it exists, how blocks are organized and addressed, and the lifecycle of cached data. This foundation is essential for understanding the management policies, write strategies, and consistency mechanisms covered in subsequent pages.

1 / 5

Loading learning content...

Operating SystemsBuffer Cache

Buffer Cache

LevelIntermediate

Duration90 mins

TopicBuffer Cache

1 / 5

Block Cache

The Great Performance Divide

What You Will Learn

The Memory-Storage Speed Gap

Access Time Comparison (approximate):

Consider what happens when a CPU needs to access data:

Memory Hierarchy Access Times
Storage Level	Typical Access Time	Relative Speed	Human-Scale Analogy
CPU L1 Cache	~1 nanosecond	1x (baseline)	1 second
CPU L2 Cache	~4 nanoseconds	4x slower	4 seconds
CPU L3 Cache	~12 nanoseconds	12x slower	12 seconds
Main Memory (RAM)	~100 nanoseconds	100x slower	1.5 minutes
NVMe SSD (best case)	~10-25 microseconds	10,000-25,000x slower	3-7 hours
SATA SSD	~50-100 microseconds	50,000-100,000x slower	14-28 hours
Hard Disk Drive (HDD)	~5-10 milliseconds	5-10 million x slower	2-4 months

The Human-Scale Perspective:

Why Such a Gap Exists:

The speed difference stems from fundamental physical constraints:

RAM is electronic — Data moves at near light-speed through transistor gates. Access involves no mechanical components, and modern DDR5 RAM can deliver 50+ GB/s bandwidth.
HDDs are mechanical — Reading data requires physically moving an actuator arm to the correct track (seek time: 2-15ms) and waiting for the correct sector to rotate under the read head (rotational latency: 2-8ms). Even at 7,200 RPM, a disk completes only 120 rotations per second.
SSDs are electronic but constrained — While SSDs eliminate mechanical delays, they still face NAND flash physics: read/program operations require voltage sensing across cells, controllers must manage wear leveling, and the interface (SATA, NVMe) adds protocol overhead.
Persistence has costs — Unlike volatile RAM, storage devices must retain data without power. This permanence requirement fundamentally limits how fast data can be written and committed.

The Amdahl's Law Problem

Block Cache Fundamentals

Key Terminology:

Block: The fundamental unit of disk I/O. File systems divide storage into fixed-size blocks (typically 4KB in modern systems, though sizes from 512 bytes to 64KB exist).
Buffer: A memory region holding one or more blocks. Historically, "buffer cache" specifically referred to the caching layer, while "page cache" referred to cached file contents. Modern systems often unify these concepts.
Cache Hit: A requested block is already present in memory. The operation completes without disk I/O.
Cache Miss: A requested block is not in memory. The OS must read it from disk.
Hit Ratio: The fraction of requests served from cache. A hit ratio of 0.95 means 95% of accesses avoid disk I/O.

The Basic Cache Structure:

buffer_cache_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/* Simplified buffer cache entry structure */
struct buffer_head {
    /* Identification */
    unsigned long    b_blocknr;      /* Block number on device */
    dev_t            b_dev;          /* Device identifier */
    
    /* State management */
    unsigned long    b_state;        /* Bitfield: BH_Uptodate, BH_Dirty, BH_Lock, etc. */
    atomic_t         b_count;        /* Reference count - how many users */
    
    /* Memory management */
    char            *b_data;         /* Pointer to actual block data in memory */
    unsigned int     b_size;         /* Block size in bytes (e.g., 4096) */
    
    /* Linkage */
    struct buffer_head *b_this_page; /* Circular list of buffers for same page */
    struct hlist_node   b_hash;      /* Hash chain for lookup */
    struct list_head    b_lru;       /* LRU list for replacement */
    
    /* I/O completion */
    bio_end_io_t    *b_end_io;       /* Callback function when I/O completes */
    void            *b_private;      /* Private data for I/O completion */
};
 
/* Key state flags */
#define BH_Uptodate     0   /* Contains valid data from disk */
#define BH_Dirty        1   /* Contains data not yet written to disk */
#define BH_Lock         2   /* Currently locked for I/O */
#define BH_Req          3   /* Has been submitted for I/O */
#define BH_Mapped       4   /* Has a valid mapping on disk */
#define BH_New          5   /* Newly allocated, not yet written */

Block Identification:

Every cached block must be uniquely identifiable. The cache uses a two-part key:

Device identifier (b_dev): Specifies which storage device (disk, partition, or logical volume) the block belongs to. This is typically a device major/minor number pair.
Block number (b_blocknr): The logical block number within that device. For a device with 4KB blocks, block 0 contains bytes 0-4095, block 1 contains bytes 4096-8191, and so on.

This (device, block number) pair uniquely identifies any block in the system and serves as the hash key for cache lookups.

Buffer Heads vs. Page Cache

Block Lookup and Hashing

The Hash Table Solution:

Hash Function Design:

A good hash function must:

Distribute blocks uniformly across buckets to avoid clustering
Be fast to compute (it's called for every I/O operation)
Incorporate both device and block number to avoid collisions

buffer_hash.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/* 
 * Simplified buffer cache hash implementation
 * Real implementations have more sophisticated hashing
 */
 
#define BUFFER_HASH_BITS    15
#define BUFFER_HASH_SIZE    (1 << BUFFER_HASH_BITS)  /* 32768 buckets */
#define BUFFER_HASH_MASK    (BUFFER_HASH_SIZE - 1)
 
/* The hash table: array of list heads */
static struct hlist_head buffer_hash_table[BUFFER_HASH_SIZE];
static spinlock_t buffer_hash_lock[BUFFER_HASH_SIZE];
 
/* 
 * Hash function combining device and block number
 * Uses multiplicative hashing for good distribution
 */
static inline unsigned int buffer_hash(dev_t dev, unsigned long block) {
    /* 
     * Golden ratio prime for multiplicative hashing
     * 0x9e370001 = 2654435841 (prime near 2^32 * golden_ratio)
     */
    unsigned long hash = (unsigned long)dev + block;
    hash = hash * 0x9e370001UL;
    
    return (hash >> (32 - BUFFER_HASH_BITS)) & BUFFER_HASH_MASK;
}
 
/*
 * Look up a buffer in the cache
 * Returns NULL if not found, buffer head pointer if found
 */
struct buffer_head *find_buffer(dev_t dev, unsigned long block, 
                                unsigned int size) {
    unsigned int hash = buffer_hash(dev, block);
    struct hlist_head *bucket = &buffer_hash_table[hash];
    struct buffer_head *bh;
    
    spin_lock(&buffer_hash_lock[hash]);
    
    hlist_for_each_entry(bh, bucket, b_hash) {
        if (bh->b_dev == dev && 
            bh->b_blocknr == block && 
            bh->b_size == size) {
            /* Found it - increment reference count */
            atomic_inc(&bh->b_count);
            spin_unlock(&buffer_hash_lock[hash]);
            return bh;
        }
    }
    
    spin_unlock(&buffer_hash_lock[hash]);
    return NULL;  /* Not in cache */
}
 
/*
 * Insert a buffer into the hash table
 */
void insert_buffer_hash(struct buffer_head *bh) {
    unsigned int hash = buffer_hash(bh->b_dev, bh->b_blocknr);
    
    spin_lock(&buffer_hash_lock[hash]);
    hlist_add_head(&bh->b_hash, &buffer_hash_table[hash]);
    spin_unlock(&buffer_hash_lock[hash]);
}

Performance Characteristics:

Average case: O(1) lookup when load factor is reasonable
Worst case: O(n/buckets) if many blocks hash to the same bucket
Memory overhead: Minimal—just pointers plus the buffer_head structures

Per-Bucket Locking:

Notice the per-bucket spinlocks rather than a single global lock. This fine-grained locking dramatically improves scalability on multi-core systems:

Different CPUs can simultaneously access different buckets
Lock contention is reduced by a factor of ~BUFFER_HASH_SIZE
The lock/unlock overhead per operation remains minimal

Modern Alternatives: Radix Trees

The Block Lifecycle

Lifecycle States:

Converting Mermaid diagram...

State Transitions Explained:

1. Allocation (Cache Miss) When a block is requested but not in cache, the OS:

Allocates memory for the block data
Creates a buffer_head structure
Initializes state to "allocated but not valid"
Inserts into hash table

2. Reading (I/O In Progress) The OS submits a read request to the device driver:

Buffer is locked (BH_Lock set) to prevent concurrent access
Other requesters wait or are queued
Upon completion, the driver callback marks the buffer uptodate

3. Uptodate (Clean, Valid) The buffer contains valid data matching disk contents:

Safe to read without I/O
Safe to evict if memory pressure requires
Can transition to Dirty if modified

4. In-Use (Referenced) A process actively holds a reference:

Reference count > 0
Cannot be evicted until released
May be shared by multiple concurrent readers

5. Dirty (Modified) The buffer contains data that differs from disk:

Must be written back before eviction
May be modified further before write-back
Periodic flush or explicit sync triggers write

6. Writing (Write I/O In Progress) Write-back is occurring:

Buffer may be locked
New modifications may be allowed (delayed write)
Upon completion, transitions back to Uptodate

7. Eviction When memory is needed:

Clean blocks can be freed immediately
Dirty blocks must be written first
Block removed from hash table and LRU list

block_lifecycle.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/* Example: Reading a block with proper lifecycle handling */
struct buffer_head *bread(dev_t dev, unsigned long block, unsigned int size) {
    struct buffer_head *bh;
    
    /* Step 1: Check if already in cache */
    bh = find_buffer(dev, block, size);
    if (bh) {
        /* Cache hit! But need to ensure it's uptodate */
        if (buffer_uptodate(bh)) {
            return bh;  /* Fast path: return immediately */
        }
        /* Block is in cache but not valid - fall through to read */
    } else {
        /* Cache miss: allocate new buffer */
        bh = alloc_buffer_head();
        if (!bh)
            return NULL;  /* Out of memory */
        
        bh->b_dev = dev;
        bh->b_blocknr = block;
        bh->b_size = size;
        bh->b_data = alloc_page_data(size);
        
        insert_buffer_hash(bh);
    }
    
    /* Step 2: Lock buffer for I/O */
    lock_buffer(bh);
    
    /* Double-check after getting lock (another thread may have read it) */
    if (buffer_uptodate(bh)) {
        unlock_buffer(bh);
        return bh;
    }
    
    /* Step 3: Submit read I/O */
    bh->b_end_io = buffer_read_end_io;
    submit_bh(READ, bh);
    
    /* Step 4: Wait for I/O completion */
    wait_on_buffer(bh);
    
    /* Step 5: Check for errors */
    if (!buffer_uptodate(bh)) {
        release_buffer(bh);
        return NULL;  /* I/O error */
    }
    
    return bh;  /* Success: buffer contains valid data */
}
 
/* Release a buffer when done using it */
void brelse(struct buffer_head *bh) {
    if (!bh) return;
    
    /* Decrement reference count */
    if (atomic_dec_and_test(&bh->b_count)) {
        /* Reference count hit zero - buffer can be reclaimed */
        /* Don't free immediately; add to LRU for potential reuse */
        add_to_lru(bh);
    }
}

Cache Memory Organization

Memory Allocation for Buffers:

Modern operating systems typically use the same page frames for both process memory and cache memory. This creates a flexible, adaptive system:

Unified Memory Pool: Rather than dedicating a fixed amount of RAM to caching, the OS dynamically balances between process needs and cache needs.
Demand-Driven Allocation: Cache grows when there's spare memory and shrinks when applications need more.
Page Frame Recycling: When a page is evicted from process memory, it might become cache memory, and vice versa.

Buffer Cache Memory Sources
Memory Source	Characteristics	Use Case
Buddy Allocator Pages	Contiguous, efficient for large allocations	Multi-page buffers, large I/O
Slab Allocator	Pre-allocated, fast, fixed-size objects	Buffer head structures
High Memory (32-bit)	Not directly addressable, requires mapping	Extra cache on 32-bit systems
NUMA Local Memory	Fastest access for specific CPUs	CPU-local caching strategies

Integration with Page Cache:

In modern Unix-like systems (Linux, FreeBSD, etc.), the buffer cache operates within or alongside the page cache:

Page Cache: Caches file data at page granularity (typically 4KB). When you read a file, the data is cached as pages.
Buffer Cache: Provides block-level addressing of the underlying device. A single page may contain multiple buffers if the file system block size is smaller than the page size.

Example: 1KB blocks on 4KB pages

If a file system uses 1KB blocks but the system page size is 4KB:

multi_buffer_page.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/*
 * A single 4KB page can contain multiple buffer_heads
 * when filesystem block size < page size
 *
 * Page layout for 1KB blocks:
 * +------------------+------------------+------------------+------------------+
 * | Buffer 0 (1KB)   | Buffer 1 (1KB)   | Buffer 2 (1KB)   | Buffer 3 (1KB)   |
 * | Block N          | Block N+1        | Block N+2        | Block N+3        |
 * +------------------+------------------+------------------+------------------+
 *        ^                  ^                  ^                   ^
 *        |                  |                  |                   |
 *    buffer_head 0     buffer_head 1      buffer_head 2      buffer_head 3
 *        |                  |                  |                   |
 *        +------ b_this_page (circular linked list) --------------+
 */
 
struct page *find_or_create_page_with_buffers(struct address_space *mapping,
                                               unsigned long index,
                                               unsigned int blocksize) {
    struct page *page;
    struct buffer_head *bh, *head;
    unsigned int buffers_per_page = PAGE_SIZE / blocksize;
    int i;
    
    page = find_or_create_page(mapping, index, GFP_KERNEL);
    if (!page)
        return NULL;
    
    /* Check if page already has buffers */
    if (page_has_buffers(page))
        return page;
    
    /* Allocate buffer heads for this page */
    head = NULL;
    for (i = 0; i < buffers_per_page; i++) {
        bh = alloc_buffer_head();
        bh->b_size = blocksize;
        bh->b_data = page_address(page) + (i * blocksize);
        
        /* Link into circular list */
        if (head) {
            bh->b_this_page = head;
        } else {
            bh->b_this_page = bh;  /* First one points to itself */
            head = bh;
        }
    }
    
    /* Close the circular list */
    /* ... and attach to page->private */
    
    return page;
}

Memory Pressure Response

Performance Metrics

Evaluating buffer cache effectiveness requires understanding key performance metrics. These metrics help diagnose performance problems and guide tuning decisions.

Hit Ratio: The Primary Metric

The cache hit ratio measures the fraction of I/O requests satisfied from cache without disk access:

Hit Ratio = Cache Hits / (Cache Hits + Cache Misses)

A hit ratio of 0.95 means 95% of block requests are served from memory. The remaining 5% require physical I/O.

Effective Access Time:

The hit ratio directly determines effective access time:

T_effective = (Hit Ratio × T_memory) + ((1 - Hit Ratio) × T_disk)

Example calculation:

Impact of Hit Ratio on Effective Access Time
Hit Ratio	Memory Access (100ns)	Disk Access (5ms)	Effective Time	Speedup vs No Cache
0% (no cache)	0ns	5,000,000ns	5,000,000ns	1x
50%	50ns	2,500,000ns	2,500,050ns	2x
90%	90ns	500,000ns	500,090ns	10x
95%	95ns	250,000ns	250,095ns	20x
99%	99ns	50,000ns	50,099ns	100x
99.9%	99.9ns	5,000ns	5,100ns	~1000x

The Non-Linear Impact of Hit Ratio:

Monitoring Cache Performance:

Operating systems provide tools to observe cache behavior:

cache_monitoring.sh
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Linux: View memory usage including cache
$ free -h
              total        used        free      shared  buff/cache   available
Mem:           31Gi       8.2Gi        12Gi       1.2Gi        10Gi        21Gi
Swap:         8.0Gi          0B       8.0Gi
 
# Note: "buff/cache" shows combined buffer and page cache usage
# "available" shows memory available for applications (includes reclaimable cache)
 
# View detailed buffer/cache breakdown
$ cat /proc/meminfo | grep -E "(Buffers|Cached|Dirty|Writeback)"
Buffers:          456789 kB
Cached:         10234567 kB
SwapCached:            0 kB
Dirty:              1234 kB
Writeback:             0 kB
 
# Real-time I/O statistics showing cache effectiveness
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 12345M  456M 10234M    0    0    12   345  987 1234 10  5 84  1  0
#                                               ^    ^
#                                              bi: blocks read from disk per second
#                                              bo: blocks written to disk per second
# Low bi/bo with active I/O workload = good cache hit ratio
 
# Detailed block I/O statistics
$ iostat -x 1
Device    r/s     w/s   rkB/s   wkB/s  %util
sda      10.00   25.00  160.00  400.00   5.00
# Low utilization despite I/O-heavy workload = cache is effective

Cache Effectiveness Depends on Workload

Locality of Reference: Why Caching Works

Types of Locality:

1. Temporal Locality

Data accessed recently is likely to be accessed again soon. Examples:

Loop variables accessed every iteration
Configuration files read at startup and referenced throughout
Hot database rows accessed by many queries
Recently modified files being edited interactively

2. Spatial Locality

Data near recently accessed data is likely to be needed. Examples:

Sequential array traversal
Reading consecutive records in a file
Accessing nearby inode entries in a directory
Loading adjacent code sections during execution

3. Sequential Locality (special case of spatial)

Accessing data in sequential order. File systems commonly see this:

Reading a file from beginning to end
Log file appends
Streaming media playback

High-Locality Workloads

•Database OLTP: Small working set, repeated queries
•Web servers: Same assets served repeatedly
•Development: Same source files accessed constantly
•Office productivity: Single document edited over time
•Log analysis: Sequential file reads

Low-Locality Workloads

•Database full scans: Every row accessed once
•Video streaming: Each frame accessed once, never reused
•Backup operations: Every file read sequentially once
•Scientific datasets: Random samples from huge datasets
•File search: Traversing entire filesystem once

Working Set Theory:

Changes slowly over time (programs transition between phases)
Is much smaller than total process address space
Determines minimum memory needed for efficient execution

Quantifying Locality:

Access Sequence:  A B C A B D A C B A
Stack Distance:   ∞ ∞ ∞ 2 2 ∞ 2 2 2 1

Interpretation:
- Block A reused after 2 intervening blocks (distance 2)
- Block D accessed for first time (distance ∞)
- Short distances = likely cache hits with sufficient cache size

Cache Pollution

Summary: The Block Cache Foundation

Key Takeaways

•The Speed Gap is Enormous — RAM is 10,000-10,000,000x faster than storage. Without caching, every file operation would bring systems to their knees.
•Block Cache is the Bridge — By keeping recently and frequently accessed disk blocks in RAM, the cache converts slow storage operations into fast memory operations.
•Efficient Lookup is Critical — Hash tables provide O(1) average-case block lookup, with per-bucket locking for multi-core scalability.
•Blocks Have a Lifecycle — From allocation through reading, use, modification, write-back, and eventual eviction, blocks transition through well-defined states.
•Memory is Unified — Modern OS unify page and buffer caches, dynamically balancing memory between applications and caching.
•Hit Ratio Determines Performance — Even small improvements in hit ratio yield large performance gains due to the extreme speed difference.
•Locality Makes It Work — Temporal and spatial locality of reference ensure that caching provides dramatic benefits for typical workloads.

What's Next:

Page Complete

1 / 5