Operating SystemsFile System Layers

File System Layers

LevelIntermediate

Duration90 mins

TopicFile System Layers

3 / 5

Basic File System

The Performance-Critical Middle Layer

Between the abstract world of files and extents and the raw command interface of storage devices lies the basic file system—a seemingly simple layer with profound performance implications. This layer handles block-level I/O: reading and writing generic blocks without understanding what data they contain or which files own them.

Yet calling this layer 'basic' understates its importance. This is where the buffer cache lives—one of the most critical performance mechanisms in any operating system. This is where I/O requests are scheduled, coalesced, and optimized. The efficiency of this layer can make a file system feel instantaneous or painfully slow.

When Netflix streams video to millions of users or Facebook serves billions of image requests, the basic file system is working furiously to cache hot blocks, minimize disk seeks, and keep data flowing. Understanding this layer is essential for anyone tuning system performance or designing for high I/O workloads.

What You Will Learn

By the end of this page, you will understand the basic file system's role in block-level I/O, the architecture and algorithms of the buffer cache, write policies (write-through vs write-back), and how modern systems unify file system and virtual memory caching through the page cache.

Role of the Basic File System

The basic file system sits at layer 3 of our five-layer stack. It receives physical block requests from the file organization module above and issues device commands to the I/O control layer below. Its primary job is to manage the actual reading and writing of blocks.

Critically, this layer has no knowledge of files. It doesn't understand inodes, directories, or file names. It only knows:

Block numbers (physical addresses on the storage device)
Read and write operations
Buffer management for in-memory copies of blocks

This separation is powerful—the same basic file system can serve any allocation strategy implemented above it.

Core Responsibilities

•Buffer Cache Management — Maintain an in-memory cache of recently accessed blocks to avoid repeated disk I/O
•I/O Request Handling — Receive read/write requests for physical blocks and fulfill them from cache or by issuing device commands
•Write Policy Enforcement — Decide when modified (dirty) blocks should be written to disk
•Cache Replacement — Evict cached blocks when memory is needed, using LRU or similar policies
•Request Coalescing — Combine adjacent block requests into single I/O operations for efficiency
•Prefetching — Anticipate future reads and load blocks proactively

The Interface

// Requests FROM file organization module:
buffer_cache_read(device_id, block_number, buffer);
buffer_cache_write(device_id, block_number, buffer);
buffer_cache_sync(device_id);  // Force all dirty blocks to disk

// Requests TO I/O control layer:
device_read_blocks(device_id, start_block, count, buffer);
device_write_blocks(device_id, start_block, count, buffer);

The basic file system insulates upper layers from direct device communication and provides the essential abstraction of a reliable block store.

The Buffer Cache Architecture

The buffer cache (also called block cache) is an in-memory cache of disk blocks. Its purpose is simple: avoid disk I/O whenever possible by keeping frequently accessed blocks in RAM.

Disk I/O is extraordinarily slow compared to memory access:

| Operation | Latency | |-----------|---------|| | L1 Cache access | ~1 nanosecond | | RAM access | ~100 nanoseconds | | SSD random read | ~100,000 nanoseconds (100 μs) | | HDD random read | ~10,000,000 nanoseconds (10 ms) |

An HDD random read is 10 million times slower than an L1 cache hit. Even SSDs are 1,000 times slower than RAM. The buffer cache's job is to exploit temporal locality (recently accessed blocks are accessed again) to serve as many requests as possible from memory.

Buffer Cache Data Structures

A typical buffer cache consists of:

1. Buffer Pool: A collection of fixed-size memory regions (buffers), each capable of holding one disk block.

2. Hash Table: Maps (device_id, block_number) pairs to buffers for O(1) lookup.

3. Free List: Doubly-linked list of buffers not currently in use, ordered by least-recently-used (LRU).

4. Buffer Headers: Metadata for each buffer including:

Device ID and block number (if valid)
Status flags: Valid, Dirty, Locked, In-I/O
Pointers for hash chain and free list
Reference count (number of users)

   ┌────────────────────────────────────────────────────────────┐
   │                      BUFFER CACHE                          │
   ├────────────────────────────────────────────────────────────┤
   │                                                            │
   │   Hash Table                                               │
   │   ┌─────┐                                                  │
   │   │  0  │──→ [buf:d1,b5] ──→ [buf:d2,b100] ──→ NULL       │
   │   │  1  │──→ NULL                                          │
   │   │  2  │──→ [buf:d1,b42] ──→ NULL                        │
   │   │ ... │                                                  │
   │   └─────┘                                                  │
   │                                                            │
   │   LRU Free List                                            │
   │   HEAD ←→ [buf:d1,b99] ←→ [buf:d3,b1] ←→ ... ←→ TAIL      │
   │   (oldest)                                    (newest)     │
   │                                                            │
   │   Buffer Pool (memory area)                                │
   │   ┌────────┬────────┬────────┬────────┬────────┐          │
   │   │4KB buf │4KB buf │4KB buf │  ...   │4KB buf │          │
   │   └────────┴────────┴────────┴────────┴────────┘          │
   │                                                            │
   └────────────────────────────────────────────────────────────┘

Why Both Hash Table and Free List?

The hash table provides O(1) lookup for cache hits. The free list provides fast access to reusable buffers for cache misses. Together, they support the complete cache operation: quickly find cached blocks AND quickly find memory for new blocks. A buffer moves between these structures as its state changes.

Buffer Cache Operations

Read Operation Algorithm

When the basic file system receives a read request for a block:

buffer_cache_read(device, block):
    1. LOCK buffer cache
    2. hash_entry = hash_lookup(device, block)
    
    3. IF hash_entry exists:                    # CACHE HIT
        a. IF buffer is currently in I/O:
            - wait for I/O to complete
        b. buffer.reference_count++
        c. move buffer to MRU end of free list
        d. UNLOCK buffer cache
        e. RETURN buffer data (copy to caller)
        f. buffer.reference_count--
    
    4. ELSE:                                     # CACHE MISS
        a. victim = remove least-recently-used buffer from free list
        b. IF victim is dirty:
            - write victim to disk (synchronously or async)
        c. remove victim from its hash chain
        d. reinitialize victim for new (device, block)
        e. add victim to hash chain for (device, block)
        f. mark buffer as "in I/O"
        g. UNLOCK buffer cache
        h. issue device read (device, block, buffer)
        i. wait for I/O completion
        j. mark buffer valid, clear "in I/O"
        k. RETURN buffer data

The algorithm ensures correctness with concurrent access while minimizing lock hold time. The lock is released before blocking I/O operations.

Write Operation Algorithm

buffer_cache_write(device, block, data):
    1. buffer = buffer_cache_read(device, block)  # Get buffer (may be cache miss)
    
    2. LOCK buffer
    3. copy data into buffer
    4. mark buffer as DIRTY
    5. UNLOCK buffer
    
    6. IF write-through policy:
        - issue device write immediately
        - wait for completion
        - clear DIRTY flag
    
    7. ELSE (write-back policy):
        - RETURN immediately (data only in buffer)
        - dirty buffer will be written later

The choice between write-through and write-back policies has significant performance and reliability implications.

Write Policy Comparison
Aspect	Write-Through	Write-Back
Definition	Write to disk immediately on every write	Write to buffer; sync to disk later
Write latency	High (includes disk I/O)	Low (memory speed only)
Reliability	High (data on disk immediately)	Risk of data loss on crash
Disk I/O volume	High (every write goes to disk)	Lower (coalesce multiple writes)
Use case	Databases with durability needs	General file system performance

The Durability Problem

Write-back caching means data can be 'written' (from the application's perspective) but not yet on disk. A power failure or system crash can lose data that the application believes is safely stored. This is why databases use fsync() and O_SYNC to force durability, and why sudden power loss can corrupt file systems.

Cache Replacement Strategies

When the buffer cache is full and a new block must be loaded, a victim buffer must be selected for eviction. The replacement policy dramatically affects cache hit rate and overall system performance.

Least Recently Used (LRU)

The classic approach: evict the buffer that was accessed longest ago. Based on the assumption that recently accessed blocks will be accessed again (temporal locality).

Implementation: Each access moves the buffer to the MRU (most recently used) end of the free list. Eviction takes from the LRU end.

   Access order: A, B, C, A, D, B, E
   
   After access A: [A]
   After access B: [A, B]
   After access C: [A, B, C]
   After access A: [B, C, A]    ← A moves to end
   After access D: [B, C, A, D]
   Cache full (capacity 4), need to load E:
   Evict B (LRU): [C, A, D, E]

LRU advantages: Excellent for workloads with temporal locality. Simple to implement.

LRU weaknesses:

Sequential scans (reading large files) pollute the cache with one-time-use blocks
Doesn't consider frequency—a frequently accessed block can be evicted after one unused block

LRU Variants and Improvements

LRU-K: Track the K-th most recent access, not just the most recent. Better captures access frequency patterns.

2Q (Two Queue): Maintain separate queues for new and established blocks:

Blocks enter a 'probation' queue on first access
Only move to 'protected' queue on second access
Evict from probation first, protecting frequently-used blocks

ARC (Adaptive Replacement Cache): Self-tuning algorithm that balances recency and frequency.

Maintains 'ghost' entries for recently evicted blocks
If a ghost block is re-accessed, adjusts weighting between recency and frequency
Used in ZFS for excellent adaptive behavior

CLOCK (Second-Chance):

Faster to implement than LRU (no list manipulation on every access)
Use a circular buffer with a 'hand' pointer
Each buffer has a 'referenced' bit, set on access
On eviction: advance hand, checking referenced bit:
- If set: clear bit, continue (give second chance)
- If clear: evict this buffer

   Clock Algorithm:
   
            ┌───┐
        ┌→─→│ 1 │──→──┐     1 = referenced recently
        │   └───┘     │     0 = not referenced
        │             ▼
      ┌───┐         ┌───┐
      │ 0 │←   HAND │ 1 │
      └───┘    ▲    └───┘
        ▲      │      │
        │   ┌───┐     │
        └──←│ 1 │←←──┘
            └───┘
   
   Hand at buffer with 0: EVICT
   Hand at buffer with 1: Clear to 0, advance

Linux Uses a Hybrid Approach

Linux's page cache uses a two-list approach similar to 2Q: 'active' and 'inactive' lists. Pages must be accessed twice to become 'active' and are protected from eviction. Scanning patterns pollute only the inactive list, preserving working set blocks in the active list.

The Unified Page Cache

Modern operating systems unify the buffer cache with the virtual memory page cache. Rather than maintaining separate caches for file system blocks and memory-mapped pages, they use a single unified page cache.

Historical Context

Early Unix systems had a separate buffer cache—a fixed pool of memory dedicated to caching disk blocks. This created problems:

Double caching: Memory-mapped files would cache data in both the buffer cache AND the VM page tables
Fixed allocation: Buffer cache memory couldn't be dynamically resized based on workload
Wasted memory: System with large RAM might have idle buffer cache while paging heavily

The Unified Design

   Traditional Design:                  Unified Design:
   ┌──────────────────────┐            ┌──────────────────────┐
   │   User Processes     │            │   User Processes     │
   └──────────┬───────────┘            └──────────┬───────────┘
              │                                   │
   ┌──────────▼───────────┐            ┌──────────▼───────────┐
   │  Virtual Memory      │            │  Virtual Memory      │
   │    (page cache)      │            │    (PAGE CACHE)      │
   └──────────┬───────────┘            └──────────┬───────────┘
              │                                   │
   ┌──────────▼───────────┐                       │
   │   Buffer Cache       │←─ Separate caches     │
   └──────────┬───────────┘   for same data       │
              │                                   │
   ┌──────────▼───────────┐            ┌──────────▼───────────┐
   │     Block Device     │            │     Block Device     │
   └──────────────────────┘            └──────────────────────┘

In the unified model:

File data lives in the page cache as memory pages
mmap() and read() access the same cached data — no duplication
Cache size adjusts dynamically based on memory pressure
Unused file cache can become process memory and vice versa

How File I/O Uses the Page Cache

read() system call:

1. Calculate which page(s) of the file contain requested bytes
2. For each page:
   a. Look up page in page cache (indexed by inode + file offset)
   b. If present: copy data to user buffer
   c. If absent: allocate page, initiate disk read, wait, then copy
3. Update access time metadata

write() system call:

1. Calculate which page(s) need modification
2. For each page:
   a. Look up page in page cache (load if absent)
   b. Copy user data into page
   c. Mark page as dirty
3. Return to user (data in cache, not on disk)
4. Background writeback will sync dirty pages to disk

mmap() system call:

1. Create virtual memory mapping: offset in file → pages in page cache
2. Pages loaded on demand via page faults
3. Modifications to mapped memory mark pages dirty
4. msync() or munmap() triggers writeback

The Zero-Copy Advantage

With mmap(), applications can access file data with zero copies—the page cache pages are mapped directly into user space. This is why databases like PostgreSQL and MongoDB use mmap() for data files, and why sendfile() can stream files to network sockets without copying data through user space.

Write-Back and Dirty Page Management

With write-back caching, modified pages accumulate in memory. The operating system must periodically flush dirty pages to disk to:

Limit data-loss exposure — Reduce the window where a crash would lose data
Free memory when needed — Dirty pages can't be evicted without writing first
Provide durability guarantees — When applications request sync

Linux Dirty Page Thresholds

Linux uses tunable thresholds to control dirty page writeback:

/proc/sys/vm/dirty_ratio = 20         # % of memory; hard limit
/proc/sys/vm/dirty_background_ratio = 10  # % of memory; background writeback starts
/proc/sys/vm/dirty_writeback_centisecs = 500  # Background thread interval (5 sec)
/proc/sys/vm/dirty_expire_centisecs = 3000    # Age before forced writeback (30 sec)

Background writeback: When dirty pages exceed dirty_background_ratio, the pdflush (older) or bdi-flush (newer) kernel thread begins writing pages asynchronously.

Blocking writeback: When dirty pages exceed dirty_ratio, writing processes are blocked until pages are flushed.

Age-based writeback: Pages older than dirty_expire_centisecs are written regardless of memory pressure.

The sync() Family of Calls

Applications needing durability can force writeback:

Function	Scope	Behavior
sync()	All file systems	Schedule all dirty pages for writeback; may return before I/O completes
fsync(fd)	One file	Write all dirty pages of file AND metadata; block until complete
fdatasync(fd)	One file	Write dirty data pages only (not metadata like atime); faster than fsync
syncfs(fd)	One file system	Like sync() but for one FS
msync()	mmap region	Sync memory-mapped pages

fsync() is critical for durability:

// Without fsync: data may be lost on crash
file = open("important.txt", O_WRONLY);
write(file, data, size);  // Data in page cache
close(file);               // Data STILL in page cache
// CRASH HERE = DATA LOST

// With fsync: data durably on disk
file = open("important.txt", O_WRONLY);
write(file, data, size);
fsync(file);              // Data on disk, metadata on disk
close(file);
// CRASH HERE = DATA SAFE

The fsync() Performance Cost

fsync() blocks until data is physically on the storage device, waiting for disk rotation or flash programming. Calling fsync() after every write can reduce throughput by 10-100x. Databases batch writes and call fsync() only at transaction commit, balancing durability and performance.

Read-Ahead and Prefetching

Read-ahead (also called prefetching) is a critical optimization: load blocks into the cache before they're requested, anticipating future access patterns. When sequential access is detected, read-ahead can hide disk latency almost completely.

Detecting Sequential Access

   Access pattern: Block 0, Block 1, Block 2, Block 3, ...
   
   The file system detects: "This looks sequential!"
   
   When Block 3 is requested, proactively read:
   → Blocks 4, 5, 6, 7, 8, 9, 10, 11  (8 blocks ahead)
   
   By the time application needs Block 4,
   it's already in cache. Zero wait!

Adaptive Read-Ahead

Modern systems use adaptive read-ahead that adjusts window size based on:

Confirmation of sequential access — Read-ahead increases when predictions are correct
Random access detection — Read-ahead decreases or stops for random patterns
Memory pressure — Read-ahead reduced when cache space is scarce
I/O bandwidth — Aggressive read-ahead saturates disk throughput

   Linux Read-Ahead Growth:
   
   Initial: Read-ahead window = 128 KB (32 pages)
   
   Sequential read confirmed:
   → Double window: 256 KB
   
   Still sequential:
   → Double again: 512 KB
   
   Continue until:
   → Maximum: 2 MB (or as configured)
   
   Random read detected:
   → Reset to minimum

Controlling Read-Ahead

// Advise the kernel about access patterns
posix_fadvise(fd, offset, len, POSIX_FADV_SEQUENTIAL);  // Will read sequentially
posix_fadvise(fd, offset, len, POSIX_FADV_RANDOM);      // Random access
posix_fadvise(fd, offset, len, POSIX_FADV_WILLNEED);    // Prefetch this region
posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED);    // Won't need; can evict

// System-wide read-ahead settings
/sys/block/sda/queue/read_ahead_kb  // Default read-ahead size

Application hints matter: A database scanning an index might hint FADV_RANDOM, preventing wasteful read-ahead. A video player might hint FADV_SEQUENTIAL, encouraging aggressive prefetching.

The Read-Ahead/Memory Tradeoff

Aggressive read-ahead improves sequential throughput but:

Consumes cache space that might be better used for frequently accessed blocks
Wastes I/O bandwidth if read-ahead blocks aren't actually used
Can evict useful cached data to make room for speculative reads

This is why read-ahead is adaptive—the system learns from actual access patterns rather than assuming one strategy fits all workloads.

SSD Read-Ahead Considerations

SSDs have no seek penalty, so read-ahead provides less latency benefit than on HDDs. However, SSDs still benefit from larger I/O operations (reduced command overhead) and internal parallelism (multiple channels active). Read-ahead remains valuable but the optimal strategy differs from spinning disks.

I/O Scheduling and Request Coalescing

The basic file system doesn't simply forward every block request to the device. It schedules and coalesces requests to optimize device utilization.

Request Coalescing

When multiple block requests can be combined into a single I/O operation:

   Individual requests:        Coalesced request:
   Read block 100              Read blocks 100-105
   Read block 101              (one command to device)
   Read block 102
   Read block 103
   Read block 104
   Read block 105
   
   6 device commands           1 device command
   6 × command overhead        1 × command overhead

Coalescing is especially valuable for:

HDD: Reduces seeks and rotational delays
SSD: Reduces command overhead and enables internal parallelism
NVMe: Maximizes queue depth efficiency

I/O Schedulers

For HDDs, the order of requests matters enormously due to seek time. I/O schedulers reorder requests to minimize head movement:

NOOP (No-Operation): Simple FIFO queue. Used for SSDs and virtual disks where seek time doesn't matter.

Deadline: Assigns deadlines to reads and writes, preventing starvation. Good for databases.

CFQ (Completely Fair Queuing): Allocates I/O bandwidth fairly among processes. Default for desktops.

mq-deadline / BFQ / Kyber: Modern multi-queue schedulers optimized for NVMe.

   Disk Head at block 50, requests for: 10, 100, 30, 70, 20
   
   FIFO order: 50→10→100→30→70→20 (total: 40+90+70+40+50 = 290 cylinders)
   
   Elevator order: 50→70→100→30→20→10 (total: 20+30+70+10+10 = 140 cylinders)
      (ascending, then descending)
   
   HALF the seek distance!

The basic file system queues requests to the scheduler, which reorders them for efficiency before passing to the device driver.

NVMe Changes Everything

NVMe drives have multiple hardware queues (up to 65,535) and microsecond latencies. Complex I/O scheduling adds overhead without benefit. Modern kernels use simple 'none' or 'mq-deadline' schedulers for NVMe, focusing on low latency rather than seek optimization.

Summary: The Basic File System Layer

We've explored the basic file system—the layer that manages block-level I/O and provides the critical buffer cache. Let's consolidate the key concepts:

Key Takeaways

•The basic file system manages block-level operations — It has no knowledge of files, only physical blocks and buffers.
•The buffer cache is essential for performance — Serving requests from memory is 10,000-10,000,000× faster than disk I/O.
•Modern systems use a unified page cache — File data and virtual memory share the same cache, enabling zero-copy I/O and dynamic sizing.
•Write-back caching improves performance but risks data loss — fsync() and related calls provide durability when needed.
•Replacement policies balance recency and frequency — LRU, CLOCK, and hybrid approaches like ARC adapt to workload patterns.
•Read-ahead hides sequential I/O latency — Adaptive algorithms grow and shrink prefetch windows based on access patterns.
•Request coalescing and I/O scheduling optimize device access — Combining requests and reordering them reduces seek time and command overhead.
•Different storage technologies need different strategies — SSDs and NVMe change assumptions about scheduling and caching.

What's Next:

The basic file system issues device commands but doesn't speak directly to hardware. The I/O control layer handles the translation between abstract device commands and the specific protocols and registers required by each device type. Next, we'll explore device drivers and the I/O control architecture.

Page Complete

You now understand the basic file system layer—its buffer cache architecture, caching policies, and the optimizations that make file systems performant. This knowledge is essential for tuning production systems and understanding I/O-bound performance issues.

3 / 5

Loading learning content...

Operating SystemsFile System Layers

File System Layers

LevelIntermediate

Duration90 mins

TopicFile System Layers

3 / 5

Basic File System

The Performance-Critical Middle Layer

What You Will Learn

Role of the Basic File System

Critically, this layer has no knowledge of files. It doesn't understand inodes, directories, or file names. It only knows:

Block numbers (physical addresses on the storage device)
Read and write operations
Buffer management for in-memory copies of blocks

This separation is powerful—the same basic file system can serve any allocation strategy implemented above it.

Core Responsibilities

•Buffer Cache Management — Maintain an in-memory cache of recently accessed blocks to avoid repeated disk I/O
•I/O Request Handling — Receive read/write requests for physical blocks and fulfill them from cache or by issuing device commands
•Write Policy Enforcement — Decide when modified (dirty) blocks should be written to disk
•Cache Replacement — Evict cached blocks when memory is needed, using LRU or similar policies
•Request Coalescing — Combine adjacent block requests into single I/O operations for efficiency
•Prefetching — Anticipate future reads and load blocks proactively

The Interface

// Requests FROM file organization module:
buffer_cache_read(device_id, block_number, buffer);
buffer_cache_write(device_id, block_number, buffer);
buffer_cache_sync(device_id);  // Force all dirty blocks to disk

// Requests TO I/O control layer:
device_read_blocks(device_id, start_block, count, buffer);
device_write_blocks(device_id, start_block, count, buffer);

The basic file system insulates upper layers from direct device communication and provides the essential abstraction of a reliable block store.

The Buffer Cache Architecture

The buffer cache (also called block cache) is an in-memory cache of disk blocks. Its purpose is simple: avoid disk I/O whenever possible by keeping frequently accessed blocks in RAM.

Disk I/O is extraordinarily slow compared to memory access:

Buffer Cache Data Structures

A typical buffer cache consists of:

1. Buffer Pool: A collection of fixed-size memory regions (buffers), each capable of holding one disk block.

2. Hash Table: Maps (device_id, block_number) pairs to buffers for O(1) lookup.

3. Free List: Doubly-linked list of buffers not currently in use, ordered by least-recently-used (LRU).

4. Buffer Headers: Metadata for each buffer including:

Device ID and block number (if valid)
Status flags: Valid, Dirty, Locked, In-I/O
Pointers for hash chain and free list
Reference count (number of users)

   ┌────────────────────────────────────────────────────────────┐
   │                      BUFFER CACHE                          │
   ├────────────────────────────────────────────────────────────┤
   │                                                            │
   │   Hash Table                                               │
   │   ┌─────┐                                                  │
   │   │  0  │──→ [buf:d1,b5] ──→ [buf:d2,b100] ──→ NULL       │
   │   │  1  │──→ NULL                                          │
   │   │  2  │──→ [buf:d1,b42] ──→ NULL                        │
   │   │ ... │                                                  │
   │   └─────┘                                                  │
   │                                                            │
   │   LRU Free List                                            │
   │   HEAD ←→ [buf:d1,b99] ←→ [buf:d3,b1] ←→ ... ←→ TAIL      │
   │   (oldest)                                    (newest)     │
   │                                                            │
   │   Buffer Pool (memory area)                                │
   │   ┌────────┬────────┬────────┬────────┬────────┐          │
   │   │4KB buf │4KB buf │4KB buf │  ...   │4KB buf │          │
   │   └────────┴────────┴────────┴────────┴────────┘          │
   │                                                            │
   └────────────────────────────────────────────────────────────┘

Why Both Hash Table and Free List?

Buffer Cache Operations

Read Operation Algorithm

When the basic file system receives a read request for a block:

buffer_cache_read(device, block):
    1. LOCK buffer cache
    2. hash_entry = hash_lookup(device, block)
    
    3. IF hash_entry exists:                    # CACHE HIT
        a. IF buffer is currently in I/O:
            - wait for I/O to complete
        b. buffer.reference_count++
        c. move buffer to MRU end of free list
        d. UNLOCK buffer cache
        e. RETURN buffer data (copy to caller)
        f. buffer.reference_count--
    
    4. ELSE:                                     # CACHE MISS
        a. victim = remove least-recently-used buffer from free list
        b. IF victim is dirty:
            - write victim to disk (synchronously or async)
        c. remove victim from its hash chain
        d. reinitialize victim for new (device, block)
        e. add victim to hash chain for (device, block)
        f. mark buffer as "in I/O"
        g. UNLOCK buffer cache
        h. issue device read (device, block, buffer)
        i. wait for I/O completion
        j. mark buffer valid, clear "in I/O"
        k. RETURN buffer data

The algorithm ensures correctness with concurrent access while minimizing lock hold time. The lock is released before blocking I/O operations.

Write Operation Algorithm

buffer_cache_write(device, block, data):
    1. buffer = buffer_cache_read(device, block)  # Get buffer (may be cache miss)
    
    2. LOCK buffer
    3. copy data into buffer
    4. mark buffer as DIRTY
    5. UNLOCK buffer
    
    6. IF write-through policy:
        - issue device write immediately
        - wait for completion
        - clear DIRTY flag
    
    7. ELSE (write-back policy):
        - RETURN immediately (data only in buffer)
        - dirty buffer will be written later

The choice between write-through and write-back policies has significant performance and reliability implications.

Write Policy Comparison
Aspect	Write-Through	Write-Back
Definition	Write to disk immediately on every write	Write to buffer; sync to disk later
Write latency	High (includes disk I/O)	Low (memory speed only)
Reliability	High (data on disk immediately)	Risk of data loss on crash
Disk I/O volume	High (every write goes to disk)	Lower (coalesce multiple writes)
Use case	Databases with durability needs	General file system performance

The Durability Problem

Cache Replacement Strategies

Least Recently Used (LRU)

The classic approach: evict the buffer that was accessed longest ago. Based on the assumption that recently accessed blocks will be accessed again (temporal locality).

Implementation: Each access moves the buffer to the MRU (most recently used) end of the free list. Eviction takes from the LRU end.

   Access order: A, B, C, A, D, B, E
   
   After access A: [A]
   After access B: [A, B]
   After access C: [A, B, C]
   After access A: [B, C, A]    ← A moves to end
   After access D: [B, C, A, D]
   Cache full (capacity 4), need to load E:
   Evict B (LRU): [C, A, D, E]

LRU advantages: Excellent for workloads with temporal locality. Simple to implement.

LRU weaknesses:

Sequential scans (reading large files) pollute the cache with one-time-use blocks
Doesn't consider frequency—a frequently accessed block can be evicted after one unused block

LRU Variants and Improvements

LRU-K: Track the K-th most recent access, not just the most recent. Better captures access frequency patterns.

2Q (Two Queue): Maintain separate queues for new and established blocks:

Blocks enter a 'probation' queue on first access
Only move to 'protected' queue on second access
Evict from probation first, protecting frequently-used blocks

ARC (Adaptive Replacement Cache): Self-tuning algorithm that balances recency and frequency.

Maintains 'ghost' entries for recently evicted blocks
If a ghost block is re-accessed, adjusts weighting between recency and frequency
Used in ZFS for excellent adaptive behavior

CLOCK (Second-Chance):

Faster to implement than LRU (no list manipulation on every access)
Use a circular buffer with a 'hand' pointer
Each buffer has a 'referenced' bit, set on access
On eviction: advance hand, checking referenced bit:
- If set: clear bit, continue (give second chance)
- If clear: evict this buffer

   Clock Algorithm:
   
            ┌───┐
        ┌→─→│ 1 │──→──┐     1 = referenced recently
        │   └───┘     │     0 = not referenced
        │             ▼
      ┌───┐         ┌───┐
      │ 0 │←   HAND │ 1 │
      └───┘    ▲    └───┘
        ▲      │      │
        │   ┌───┐     │
        └──←│ 1 │←←──┘
            └───┘
   
   Hand at buffer with 0: EVICT
   Hand at buffer with 1: Clear to 0, advance

Linux Uses a Hybrid Approach

The Unified Page Cache

Historical Context

Early Unix systems had a separate buffer cache—a fixed pool of memory dedicated to caching disk blocks. This created problems:

Double caching: Memory-mapped files would cache data in both the buffer cache AND the VM page tables
Fixed allocation: Buffer cache memory couldn't be dynamically resized based on workload
Wasted memory: System with large RAM might have idle buffer cache while paging heavily

The Unified Design

   Traditional Design:                  Unified Design:
   ┌──────────────────────┐            ┌──────────────────────┐
   │   User Processes     │            │   User Processes     │
   └──────────┬───────────┘            └──────────┬───────────┘
              │                                   │
   ┌──────────▼───────────┐            ┌──────────▼───────────┐
   │  Virtual Memory      │            │  Virtual Memory      │
   │    (page cache)      │            │    (PAGE CACHE)      │
   └──────────┬───────────┘            └──────────┬───────────┘
              │                                   │
   ┌──────────▼───────────┐                       │
   │   Buffer Cache       │←─ Separate caches     │
   └──────────┬───────────┘   for same data       │
              │                                   │
   ┌──────────▼───────────┐            ┌──────────▼───────────┐
   │     Block Device     │            │     Block Device     │
   └──────────────────────┘            └──────────────────────┘

In the unified model:

File data lives in the page cache as memory pages
mmap() and read() access the same cached data — no duplication
Cache size adjusts dynamically based on memory pressure
Unused file cache can become process memory and vice versa

How File I/O Uses the Page Cache

read() system call:

1. Calculate which page(s) of the file contain requested bytes
2. For each page:
   a. Look up page in page cache (indexed by inode + file offset)
   b. If present: copy data to user buffer
   c. If absent: allocate page, initiate disk read, wait, then copy
3. Update access time metadata

write() system call:

1. Calculate which page(s) need modification
2. For each page:
   a. Look up page in page cache (load if absent)
   b. Copy user data into page
   c. Mark page as dirty
3. Return to user (data in cache, not on disk)
4. Background writeback will sync dirty pages to disk

mmap() system call:

1. Create virtual memory mapping: offset in file → pages in page cache
2. Pages loaded on demand via page faults
3. Modifications to mapped memory mark pages dirty
4. msync() or munmap() triggers writeback

The Zero-Copy Advantage

Write-Back and Dirty Page Management

With write-back caching, modified pages accumulate in memory. The operating system must periodically flush dirty pages to disk to:

Limit data-loss exposure — Reduce the window where a crash would lose data
Free memory when needed — Dirty pages can't be evicted without writing first
Provide durability guarantees — When applications request sync

Linux Dirty Page Thresholds

Linux uses tunable thresholds to control dirty page writeback:

/proc/sys/vm/dirty_ratio = 20         # % of memory; hard limit
/proc/sys/vm/dirty_background_ratio = 10  # % of memory; background writeback starts
/proc/sys/vm/dirty_writeback_centisecs = 500  # Background thread interval (5 sec)
/proc/sys/vm/dirty_expire_centisecs = 3000    # Age before forced writeback (30 sec)

Background writeback: When dirty pages exceed dirty_background_ratio, the pdflush (older) or bdi-flush (newer) kernel thread begins writing pages asynchronously.

Blocking writeback: When dirty pages exceed dirty_ratio, writing processes are blocked until pages are flushed.

Age-based writeback: Pages older than dirty_expire_centisecs are written regardless of memory pressure.

The sync() Family of Calls

Applications needing durability can force writeback:

Function	Scope	Behavior
sync()	All file systems	Schedule all dirty pages for writeback; may return before I/O completes
fsync(fd)	One file	Write all dirty pages of file AND metadata; block until complete
fdatasync(fd)	One file	Write dirty data pages only (not metadata like atime); faster than fsync
syncfs(fd)	One file system	Like sync() but for one FS
msync()	mmap region	Sync memory-mapped pages

fsync() is critical for durability:

// Without fsync: data may be lost on crash
file = open("important.txt", O_WRONLY);
write(file, data, size);  // Data in page cache
close(file);               // Data STILL in page cache
// CRASH HERE = DATA LOST

// With fsync: data durably on disk
file = open("important.txt", O_WRONLY);
write(file, data, size);
fsync(file);              // Data on disk, metadata on disk
close(file);
// CRASH HERE = DATA SAFE

The fsync() Performance Cost

Read-Ahead and Prefetching

Detecting Sequential Access

   Access pattern: Block 0, Block 1, Block 2, Block 3, ...
   
   The file system detects: "This looks sequential!"
   
   When Block 3 is requested, proactively read:
   → Blocks 4, 5, 6, 7, 8, 9, 10, 11  (8 blocks ahead)
   
   By the time application needs Block 4,
   it's already in cache. Zero wait!

Adaptive Read-Ahead

Modern systems use adaptive read-ahead that adjusts window size based on:

Confirmation of sequential access — Read-ahead increases when predictions are correct
Random access detection — Read-ahead decreases or stops for random patterns
Memory pressure — Read-ahead reduced when cache space is scarce
I/O bandwidth — Aggressive read-ahead saturates disk throughput

   Linux Read-Ahead Growth:
   
   Initial: Read-ahead window = 128 KB (32 pages)
   
   Sequential read confirmed:
   → Double window: 256 KB
   
   Still sequential:
   → Double again: 512 KB
   
   Continue until:
   → Maximum: 2 MB (or as configured)
   
   Random read detected:
   → Reset to minimum

Controlling Read-Ahead

// Advise the kernel about access patterns
posix_fadvise(fd, offset, len, POSIX_FADV_SEQUENTIAL);  // Will read sequentially
posix_fadvise(fd, offset, len, POSIX_FADV_RANDOM);      // Random access
posix_fadvise(fd, offset, len, POSIX_FADV_WILLNEED);    // Prefetch this region
posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED);    // Won't need; can evict

// System-wide read-ahead settings
/sys/block/sda/queue/read_ahead_kb  // Default read-ahead size

Application hints matter: A database scanning an index might hint FADV_RANDOM, preventing wasteful read-ahead. A video player might hint FADV_SEQUENTIAL, encouraging aggressive prefetching.

The Read-Ahead/Memory Tradeoff

Aggressive read-ahead improves sequential throughput but:

Consumes cache space that might be better used for frequently accessed blocks
Wastes I/O bandwidth if read-ahead blocks aren't actually used
Can evict useful cached data to make room for speculative reads

This is why read-ahead is adaptive—the system learns from actual access patterns rather than assuming one strategy fits all workloads.

SSD Read-Ahead Considerations

I/O Scheduling and Request Coalescing

The basic file system doesn't simply forward every block request to the device. It schedules and coalesces requests to optimize device utilization.

Request Coalescing

When multiple block requests can be combined into a single I/O operation:

   Individual requests:        Coalesced request:
   Read block 100              Read blocks 100-105
   Read block 101              (one command to device)
   Read block 102
   Read block 103
   Read block 104
   Read block 105
   
   6 device commands           1 device command
   6 × command overhead        1 × command overhead

Coalescing is especially valuable for:

HDD: Reduces seeks and rotational delays
SSD: Reduces command overhead and enables internal parallelism
NVMe: Maximizes queue depth efficiency

I/O Schedulers

For HDDs, the order of requests matters enormously due to seek time. I/O schedulers reorder requests to minimize head movement:

NOOP (No-Operation): Simple FIFO queue. Used for SSDs and virtual disks where seek time doesn't matter.

Deadline: Assigns deadlines to reads and writes, preventing starvation. Good for databases.

CFQ (Completely Fair Queuing): Allocates I/O bandwidth fairly among processes. Default for desktops.

mq-deadline / BFQ / Kyber: Modern multi-queue schedulers optimized for NVMe.

   Disk Head at block 50, requests for: 10, 100, 30, 70, 20
   
   FIFO order: 50→10→100→30→70→20 (total: 40+90+70+40+50 = 290 cylinders)
   
   Elevator order: 50→70→100→30→20→10 (total: 20+30+70+10+10 = 140 cylinders)
      (ascending, then descending)
   
   HALF the seek distance!

The basic file system queues requests to the scheduler, which reorders them for efficiency before passing to the device driver.

NVMe Changes Everything

Summary: The Basic File System Layer

We've explored the basic file system—the layer that manages block-level I/O and provides the critical buffer cache. Let's consolidate the key concepts:

Key Takeaways

•The basic file system manages block-level operations — It has no knowledge of files, only physical blocks and buffers.
•The buffer cache is essential for performance — Serving requests from memory is 10,000-10,000,000× faster than disk I/O.
•Modern systems use a unified page cache — File data and virtual memory share the same cache, enabling zero-copy I/O and dynamic sizing.
•Write-back caching improves performance but risks data loss — fsync() and related calls provide durability when needed.
•Replacement policies balance recency and frequency — LRU, CLOCK, and hybrid approaches like ARC adapt to workload patterns.
•Read-ahead hides sequential I/O latency — Adaptive algorithms grow and shrink prefetch windows based on access patterns.
•Request coalescing and I/O scheduling optimize device access — Combining requests and reordering them reduces seek time and command overhead.
•Different storage technologies need different strategies — SSDs and NVMe change assumptions about scheduling and caching.

What's Next:

Page Complete

3 / 5