Loading learning content...
Between the abstract world of files and extents and the raw command interface of storage devices lies the basic file system—a seemingly simple layer with profound performance implications. This layer handles block-level I/O: reading and writing generic blocks without understanding what data they contain or which files own them.
Yet calling this layer 'basic' understates its importance. This is where the buffer cache lives—one of the most critical performance mechanisms in any operating system. This is where I/O requests are scheduled, coalesced, and optimized. The efficiency of this layer can make a file system feel instantaneous or painfully slow.
When Netflix streams video to millions of users or Facebook serves billions of image requests, the basic file system is working furiously to cache hot blocks, minimize disk seeks, and keep data flowing. Understanding this layer is essential for anyone tuning system performance or designing for high I/O workloads.
By the end of this page, you will understand the basic file system's role in block-level I/O, the architecture and algorithms of the buffer cache, write policies (write-through vs write-back), and how modern systems unify file system and virtual memory caching through the page cache.
The basic file system sits at layer 3 of our five-layer stack. It receives physical block requests from the file organization module above and issues device commands to the I/O control layer below. Its primary job is to manage the actual reading and writing of blocks.
Critically, this layer has no knowledge of files. It doesn't understand inodes, directories, or file names. It only knows:
This separation is powerful—the same basic file system can serve any allocation strategy implemented above it.
// Requests FROM file organization module:
buffer_cache_read(device_id, block_number, buffer);
buffer_cache_write(device_id, block_number, buffer);
buffer_cache_sync(device_id); // Force all dirty blocks to disk
// Requests TO I/O control layer:
device_read_blocks(device_id, start_block, count, buffer);
device_write_blocks(device_id, start_block, count, buffer);
The basic file system insulates upper layers from direct device communication and provides the essential abstraction of a reliable block store.
The buffer cache (also called block cache) is an in-memory cache of disk blocks. Its purpose is simple: avoid disk I/O whenever possible by keeping frequently accessed blocks in RAM.
Disk I/O is extraordinarily slow compared to memory access:
| Operation | Latency | |-----------|---------|| | L1 Cache access | ~1 nanosecond | | RAM access | ~100 nanoseconds | | SSD random read | ~100,000 nanoseconds (100 μs) | | HDD random read | ~10,000,000 nanoseconds (10 ms) |
An HDD random read is 10 million times slower than an L1 cache hit. Even SSDs are 1,000 times slower than RAM. The buffer cache's job is to exploit temporal locality (recently accessed blocks are accessed again) to serve as many requests as possible from memory.
A typical buffer cache consists of:
1. Buffer Pool: A collection of fixed-size memory regions (buffers), each capable of holding one disk block.
2. Hash Table: Maps (device_id, block_number) pairs to buffers for O(1) lookup.
3. Free List: Doubly-linked list of buffers not currently in use, ordered by least-recently-used (LRU).
4. Buffer Headers: Metadata for each buffer including:
┌────────────────────────────────────────────────────────────┐
│ BUFFER CACHE │
├────────────────────────────────────────────────────────────┤
│ │
│ Hash Table │
│ ┌─────┐ │
│ │ 0 │──→ [buf:d1,b5] ──→ [buf:d2,b100] ──→ NULL │
│ │ 1 │──→ NULL │
│ │ 2 │──→ [buf:d1,b42] ──→ NULL │
│ │ ... │ │
│ └─────┘ │
│ │
│ LRU Free List │
│ HEAD ←→ [buf:d1,b99] ←→ [buf:d3,b1] ←→ ... ←→ TAIL │
│ (oldest) (newest) │
│ │
│ Buffer Pool (memory area) │
│ ┌────────┬────────┬────────┬────────┬────────┐ │
│ │4KB buf │4KB buf │4KB buf │ ... │4KB buf │ │
│ └────────┴────────┴────────┴────────┴────────┘ │
│ │
└────────────────────────────────────────────────────────────┘
The hash table provides O(1) lookup for cache hits. The free list provides fast access to reusable buffers for cache misses. Together, they support the complete cache operation: quickly find cached blocks AND quickly find memory for new blocks. A buffer moves between these structures as its state changes.
When the basic file system receives a read request for a block:
buffer_cache_read(device, block):
1. LOCK buffer cache
2. hash_entry = hash_lookup(device, block)
3. IF hash_entry exists: # CACHE HIT
a. IF buffer is currently in I/O:
- wait for I/O to complete
b. buffer.reference_count++
c. move buffer to MRU end of free list
d. UNLOCK buffer cache
e. RETURN buffer data (copy to caller)
f. buffer.reference_count--
4. ELSE: # CACHE MISS
a. victim = remove least-recently-used buffer from free list
b. IF victim is dirty:
- write victim to disk (synchronously or async)
c. remove victim from its hash chain
d. reinitialize victim for new (device, block)
e. add victim to hash chain for (device, block)
f. mark buffer as "in I/O"
g. UNLOCK buffer cache
h. issue device read (device, block, buffer)
i. wait for I/O completion
j. mark buffer valid, clear "in I/O"
k. RETURN buffer data
The algorithm ensures correctness with concurrent access while minimizing lock hold time. The lock is released before blocking I/O operations.
buffer_cache_write(device, block, data):
1. buffer = buffer_cache_read(device, block) # Get buffer (may be cache miss)
2. LOCK buffer
3. copy data into buffer
4. mark buffer as DIRTY
5. UNLOCK buffer
6. IF write-through policy:
- issue device write immediately
- wait for completion
- clear DIRTY flag
7. ELSE (write-back policy):
- RETURN immediately (data only in buffer)
- dirty buffer will be written later
The choice between write-through and write-back policies has significant performance and reliability implications.
| Aspect | Write-Through | Write-Back |
|---|---|---|
| Definition | Write to disk immediately on every write | Write to buffer; sync to disk later |
| Write latency | High (includes disk I/O) | Low (memory speed only) |
| Reliability | High (data on disk immediately) | Risk of data loss on crash |
| Disk I/O volume | High (every write goes to disk) | Lower (coalesce multiple writes) |
| Use case | Databases with durability needs | General file system performance |
Write-back caching means data can be 'written' (from the application's perspective) but not yet on disk. A power failure or system crash can lose data that the application believes is safely stored. This is why databases use fsync() and O_SYNC to force durability, and why sudden power loss can corrupt file systems.
When the buffer cache is full and a new block must be loaded, a victim buffer must be selected for eviction. The replacement policy dramatically affects cache hit rate and overall system performance.
The classic approach: evict the buffer that was accessed longest ago. Based on the assumption that recently accessed blocks will be accessed again (temporal locality).
Implementation: Each access moves the buffer to the MRU (most recently used) end of the free list. Eviction takes from the LRU end.
Access order: A, B, C, A, D, B, E
After access A: [A]
After access B: [A, B]
After access C: [A, B, C]
After access A: [B, C, A] ← A moves to end
After access D: [B, C, A, D]
Cache full (capacity 4), need to load E:
Evict B (LRU): [C, A, D, E]
LRU advantages: Excellent for workloads with temporal locality. Simple to implement.
LRU weaknesses:
LRU-K: Track the K-th most recent access, not just the most recent. Better captures access frequency patterns.
2Q (Two Queue): Maintain separate queues for new and established blocks:
ARC (Adaptive Replacement Cache): Self-tuning algorithm that balances recency and frequency.
CLOCK (Second-Chance):
Clock Algorithm:
┌───┐
┌→─→│ 1 │──→──┐ 1 = referenced recently
│ └───┘ │ 0 = not referenced
│ ▼
┌───┐ ┌───┐
│ 0 │← HAND │ 1 │
└───┘ ▲ └───┘
▲ │ │
│ ┌───┐ │
└──←│ 1 │←←──┘
└───┘
Hand at buffer with 0: EVICT
Hand at buffer with 1: Clear to 0, advance
Linux's page cache uses a two-list approach similar to 2Q: 'active' and 'inactive' lists. Pages must be accessed twice to become 'active' and are protected from eviction. Scanning patterns pollute only the inactive list, preserving working set blocks in the active list.
Modern operating systems unify the buffer cache with the virtual memory page cache. Rather than maintaining separate caches for file system blocks and memory-mapped pages, they use a single unified page cache.
Early Unix systems had a separate buffer cache—a fixed pool of memory dedicated to caching disk blocks. This created problems:
Traditional Design: Unified Design:
┌──────────────────────┐ ┌──────────────────────┐
│ User Processes │ │ User Processes │
└──────────┬───────────┘ └──────────┬───────────┘
│ │
┌──────────▼───────────┐ ┌──────────▼───────────┐
│ Virtual Memory │ │ Virtual Memory │
│ (page cache) │ │ (PAGE CACHE) │
└──────────┬───────────┘ └──────────┬───────────┘
│ │
┌──────────▼───────────┐ │
│ Buffer Cache │←─ Separate caches │
└──────────┬───────────┘ for same data │
│ │
┌──────────▼───────────┐ ┌──────────▼───────────┐
│ Block Device │ │ Block Device │
└──────────────────────┘ └──────────────────────┘
In the unified model:
read() system call:
1. Calculate which page(s) of the file contain requested bytes
2. For each page:
a. Look up page in page cache (indexed by inode + file offset)
b. If present: copy data to user buffer
c. If absent: allocate page, initiate disk read, wait, then copy
3. Update access time metadata
write() system call:
1. Calculate which page(s) need modification
2. For each page:
a. Look up page in page cache (load if absent)
b. Copy user data into page
c. Mark page as dirty
3. Return to user (data in cache, not on disk)
4. Background writeback will sync dirty pages to disk
mmap() system call:
1. Create virtual memory mapping: offset in file → pages in page cache
2. Pages loaded on demand via page faults
3. Modifications to mapped memory mark pages dirty
4. msync() or munmap() triggers writeback
With mmap(), applications can access file data with zero copies—the page cache pages are mapped directly into user space. This is why databases like PostgreSQL and MongoDB use mmap() for data files, and why sendfile() can stream files to network sockets without copying data through user space.
With write-back caching, modified pages accumulate in memory. The operating system must periodically flush dirty pages to disk to:
Linux uses tunable thresholds to control dirty page writeback:
/proc/sys/vm/dirty_ratio = 20 # % of memory; hard limit
/proc/sys/vm/dirty_background_ratio = 10 # % of memory; background writeback starts
/proc/sys/vm/dirty_writeback_centisecs = 500 # Background thread interval (5 sec)
/proc/sys/vm/dirty_expire_centisecs = 3000 # Age before forced writeback (30 sec)
Background writeback: When dirty pages exceed dirty_background_ratio, the pdflush (older) or bdi-flush (newer) kernel thread begins writing pages asynchronously.
Blocking writeback: When dirty pages exceed dirty_ratio, writing processes are blocked until pages are flushed.
Age-based writeback: Pages older than dirty_expire_centisecs are written regardless of memory pressure.
Applications needing durability can force writeback:
| Function | Scope | Behavior |
|---|---|---|
| sync() | All file systems | Schedule all dirty pages for writeback; may return before I/O completes |
| fsync(fd) | One file | Write all dirty pages of file AND metadata; block until complete |
| fdatasync(fd) | One file | Write dirty data pages only (not metadata like atime); faster than fsync |
| syncfs(fd) | One file system | Like sync() but for one FS |
| msync() | mmap region | Sync memory-mapped pages |
fsync() is critical for durability:
// Without fsync: data may be lost on crash
file = open("important.txt", O_WRONLY);
write(file, data, size); // Data in page cache
close(file); // Data STILL in page cache
// CRASH HERE = DATA LOST
// With fsync: data durably on disk
file = open("important.txt", O_WRONLY);
write(file, data, size);
fsync(file); // Data on disk, metadata on disk
close(file);
// CRASH HERE = DATA SAFE
fsync() blocks until data is physically on the storage device, waiting for disk rotation or flash programming. Calling fsync() after every write can reduce throughput by 10-100x. Databases batch writes and call fsync() only at transaction commit, balancing durability and performance.
Read-ahead (also called prefetching) is a critical optimization: load blocks into the cache before they're requested, anticipating future access patterns. When sequential access is detected, read-ahead can hide disk latency almost completely.
Access pattern: Block 0, Block 1, Block 2, Block 3, ...
The file system detects: "This looks sequential!"
When Block 3 is requested, proactively read:
→ Blocks 4, 5, 6, 7, 8, 9, 10, 11 (8 blocks ahead)
By the time application needs Block 4,
it's already in cache. Zero wait!
Modern systems use adaptive read-ahead that adjusts window size based on:
Linux Read-Ahead Growth:
Initial: Read-ahead window = 128 KB (32 pages)
Sequential read confirmed:
→ Double window: 256 KB
Still sequential:
→ Double again: 512 KB
Continue until:
→ Maximum: 2 MB (or as configured)
Random read detected:
→ Reset to minimum
// Advise the kernel about access patterns
posix_fadvise(fd, offset, len, POSIX_FADV_SEQUENTIAL); // Will read sequentially
posix_fadvise(fd, offset, len, POSIX_FADV_RANDOM); // Random access
posix_fadvise(fd, offset, len, POSIX_FADV_WILLNEED); // Prefetch this region
posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED); // Won't need; can evict
// System-wide read-ahead settings
/sys/block/sda/queue/read_ahead_kb // Default read-ahead size
Application hints matter: A database scanning an index might hint FADV_RANDOM, preventing wasteful read-ahead. A video player might hint FADV_SEQUENTIAL, encouraging aggressive prefetching.
Aggressive read-ahead improves sequential throughput but:
This is why read-ahead is adaptive—the system learns from actual access patterns rather than assuming one strategy fits all workloads.
SSDs have no seek penalty, so read-ahead provides less latency benefit than on HDDs. However, SSDs still benefit from larger I/O operations (reduced command overhead) and internal parallelism (multiple channels active). Read-ahead remains valuable but the optimal strategy differs from spinning disks.
The basic file system doesn't simply forward every block request to the device. It schedules and coalesces requests to optimize device utilization.
When multiple block requests can be combined into a single I/O operation:
Individual requests: Coalesced request:
Read block 100 Read blocks 100-105
Read block 101 (one command to device)
Read block 102
Read block 103
Read block 104
Read block 105
6 device commands 1 device command
6 × command overhead 1 × command overhead
Coalescing is especially valuable for:
For HDDs, the order of requests matters enormously due to seek time. I/O schedulers reorder requests to minimize head movement:
NOOP (No-Operation): Simple FIFO queue. Used for SSDs and virtual disks where seek time doesn't matter.
Deadline: Assigns deadlines to reads and writes, preventing starvation. Good for databases.
CFQ (Completely Fair Queuing): Allocates I/O bandwidth fairly among processes. Default for desktops.
mq-deadline / BFQ / Kyber: Modern multi-queue schedulers optimized for NVMe.
Disk Head at block 50, requests for: 10, 100, 30, 70, 20
FIFO order: 50→10→100→30→70→20 (total: 40+90+70+40+50 = 290 cylinders)
Elevator order: 50→70→100→30→20→10 (total: 20+30+70+10+10 = 140 cylinders)
(ascending, then descending)
HALF the seek distance!
The basic file system queues requests to the scheduler, which reorders them for efficiency before passing to the device driver.
NVMe drives have multiple hardware queues (up to 65,535) and microsecond latencies. Complex I/O scheduling adds overhead without benefit. Modern kernels use simple 'none' or 'mq-deadline' schedulers for NVMe, focusing on low latency rather than seek optimization.
We've explored the basic file system—the layer that manages block-level I/O and provides the critical buffer cache. Let's consolidate the key concepts:
What's Next:
The basic file system issues device commands but doesn't speak directly to hardware. The I/O control layer handles the translation between abstract device commands and the specific protocols and registers required by each device type. Next, we'll explore device drivers and the I/O control architecture.
You now understand the basic file system layer—its buffer cache architecture, caching policies, and the optimizations that make file systems performant. This knowledge is essential for tuning production systems and understanding I/O-bound performance issues.