Operating SystemsLinux File Systems

Linux File Systems

LevelAdvanced

Duration120 mins

TopicLinux File Systems

3 / 5

Block I/O Layer

Bridging File Systems and Hardware

Between the file system's logical view of data and the actual disk platters (or flash cells) lies a critical subsystem: the Block I/O layer (or block layer). This subsystem is responsible for:

Translating file system block requests into device-level operations
Reordering and merging requests for optimal hardware performance
Managing the interface between software and storage device drivers
Supporting both legacy single-queue and modern multi-queue architectures

The block layer is invisible to applications yet determines much of your system's I/O performance. Understanding it is essential for anyone working on storage systems, databases, or kernel development.

What You Will Learn

By the end of this page, you will understand the complete architecture of the Linux block I/O subsystem: the bio structure and request lifecycle, I/O schedulers and their algorithms, the transition from single-queue to multi-queue (blk-mq), and how to analyze block layer performance. You'll gain the deep knowledge required for storage performance optimization and kernel-level debugging.

Block Layer Architecture Overview

The block layer sits between file systems (and other higher-level consumers) and block device drivers. Its primary responsibilities include:

Request representation: Packaging I/O requests in a standard format
Request queuing: Managing queues of pending I/O operations
I/O scheduling: Reordering requests to optimize device access patterns
Request merging: Combining adjacent requests to reduce overhead
Device abstraction: Providing a uniform interface to diverse storage hardware

Converting Mermaid diagram...

Historical Evolution

The block layer has undergone significant evolution:

Single-Queue Era (Pre-3.13)

One request queue per device
Global lock protected the queue
I/O schedulers operated on this single queue
Worked well for rotational drives but became a bottleneck for SSDs

Multi-Queue Era (3.13+)

Multiple software queues mapped to hardware queues
Per-CPU staging queues eliminate lock contention
Designed for high-IOPS NVMe devices
Default since kernel 5.0

blk-mq Dominance

As of Linux 5.0, the legacy single-queue path (request_fn) was removed entirely. All block devices now use the blk-mq (multi-queue) infrastructure, even if they only expose a single hardware queue. This unified architecture simplifies the codebase while supporting everything from USB sticks to NVMe arrays.

The bio Structure: Representing Block I/O

The bio (block I/O) structure is the primary unit of I/O in the block layer. It describes a single I/O operation: a contiguous range of sectors to read or write, along with the memory buffers involved.

struct bio (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
struct bio {
    struct bio              *bi_next;       /* Next bio in list */
    struct block_device     *bi_bdev;       /* Target block device */
    blk_opf_t               bi_opf;         /* Operation and flags */
    unsigned short          bi_flags;       /* BIO_* flags */
    unsigned short          bi_ioprio;      /* I/O priority */
    blk_status_t            bi_status;      /* Completion status */
    
    atomic_t                __bi_remaining; /* Pending segment count */
    
    struct bvec_iter        bi_iter;        /* Current iterator state */
    
    blk_qc_t                bi_cookie;      /* For polling */
    bio_end_io_t            *bi_end_io;     /* Completion callback */
    void                    *bi_private;    /* Owner private data */
    
    unsigned short          bi_vcnt;        /* Number of segments */
    unsigned short          bi_max_vecs;    /* Max segments capacity */
    atomic_t                __bi_cnt;       /* Reference count */
    
    struct bio_vec          *bi_io_vec;     /* Segment array */
    struct bio_set          *bi_pool;       /* Pool for allocation */
    
    /* Inline bio_vecs for small I/Os */
    struct bio_vec          bi_inline_vecs[];
};
 
/* Iterator tracks current position in bio */
struct bvec_iter {
    sector_t        bi_sector;      /* Current sector */
    unsigned int    bi_size;        /* Remaining bytes */
    unsigned int    bi_idx;         /* Current vector index */
    unsigned int    bi_bvec_done;   /* Bytes completed in current vec */
};
 
/* Each segment points to a page range */
struct bio_vec {
    struct page     *bv_page;       /* Page containing data */
    unsigned int    bv_len;         /* Length in bytes */
    unsigned int    bv_offset;      /* Offset within page */
};

Key bio Concepts

Sector Addressing The bio specifies disk locations in sectors (512 bytes, regardless of actual device sector size). The bi_iter.bi_sector field holds the starting sector.

Scatter-Gather A single bio can reference multiple non-contiguous memory pages through the bio_vec array. This enables scatter-gather I/O: reading into or writing from discontiguous memory in a single operation.

Operations and Flags The bi_opf field encodes both the operation type and modifier flags:

bio operations and flags
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/* Operation types (lower bits of bi_opf) */
enum req_op {
    REQ_OP_READ,              /* Read sectors */
    REQ_OP_WRITE,             /* Write sectors */
    REQ_OP_FLUSH,             /* Flush device cache */
    REQ_OP_DISCARD,           /* Trim/unmap sectors */
    REQ_OP_SECURE_ERASE,      /* Securely erase sectors */
    REQ_OP_WRITE_ZEROES,      /* Write zeros (may not allocate) */
    REQ_OP_ZONE_OPEN,         /* Open a zone (zoned devices) */
    REQ_OP_ZONE_CLOSE,        /* Close a zone */
    REQ_OP_ZONE_FINISH,       /* Finish a zone */
    REQ_OP_ZONE_APPEND,       /* Append to zone */
    REQ_OP_ZONE_RESET,        /* Reset zone write pointer */
    REQ_OP_DRV_IN,            /* Driver-specific input */
    REQ_OP_DRV_OUT,           /* Driver-specific output */
};
 
/* Modifier flags (upper bits of bi_opf) */
#define REQ_FAILFAST_DEV    (1ULL << __REQ_FAILFAST_DEV)
#define REQ_FAILFAST_TRANSPORT  (1ULL << __REQ_FAILFAST_TRANSPORT)
#define REQ_FAILFAST_DRIVER (1ULL << __REQ_FAILFAST_DRIVER)
#define REQ_SYNC            (1ULL << __REQ_SYNC)     /* Synchronous I/O */
#define REQ_META            (1ULL << __REQ_META)     /* Metadata I/O */
#define REQ_PRIO            (1ULL << __REQ_PRIO)     /* High priority */
#define REQ_NOMERGE         (1ULL << __REQ_NOMERGE)  /* Don't merge */
#define REQ_IDLE            (1ULL << __REQ_IDLE)     /* Low priority */
#define REQ_INTEGRITY       (1ULL << __REQ_INTEGRITY)/* Data integrity */
#define REQ_FUA             (1ULL << __REQ_FUA)      /* Force unit access */
#define REQ_PREFLUSH        (1ULL << __REQ_PREFLUSH) /* Issue flush before */
#define REQ_RAHEAD          (1ULL << __REQ_RAHEAD)   /* Read-ahead */
#define REQ_BACKGROUND      (1ULL << __REQ_BACKGROUND)/* Background I/O */
#define REQ_POLLED          (1ULL << __REQ_POLLED)   /* Caller will poll */

REQ_FUA and Write Ordering

REQ_FUA (Force Unit Access) and REQ_PREFLUSH are critical for data integrity. REQ_PREFLUSH ensures all previous writes reach durable storage before this write executes. REQ_FUA ensures this specific write bypasses the device cache. Together, they implement the ordering guarantees that file system journals depend on.

Submitting and Processing I/O

Let's trace the lifecycle of a block I/O request, from file system submission to device completion.

Step 1: Creating a bio

File systems allocate and initialize bios for their I/O operations:

Creating and submitting a bio
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* Allocate a bio from a bio_set */
struct bio *bio = bio_alloc(bdev, nr_vecs, opf, GFP_KERNEL);
 
/* Or use the generic bio_alloc_bioset */
bio = bio_alloc_bioset(bdev, nr_vecs, opf, GFP_KERNEL, &fs_bio_set);
 
/* Set the starting sector */
bio->bi_iter.bi_sector = sector;
 
/* Add pages to the bio */
for (each page to include) {
    bio_add_page(bio, page, len, offset);
}
 
/* Set completion callback */
bio->bi_end_io = my_completion_handler;
bio->bi_private = my_context;
 
/* Submit the bio */
submit_bio(bio);

Step 2: The submit_bio Path

When submit_bio() is called, the kernel performs several operations:

Converting Mermaid diagram...

Step 3: Plugging and Batching

Plugging is a critical optimization. When a process submits I/O, it often submits multiple bios in quick succession (e.g., writeback of many dirty pages). Instead of immediately dispatching each bio, the kernel "plugs" the queue:

Create a per-task plug: blk_start_plug(&plug)
Submit bios normally (they accumulate in the plug list)
Unplug: blk_finish_plug(&plug)
All accumulated bios are merged and dispatched together

This reduces hardware overhead (fewer interrupts, better command coalescing) and enables better merging decisions.

Plugging example
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void ext4_writepages(struct address_space *mapping, ...)
{
    struct blk_plug plug;
    
    /* Start plugging - bios will accumulate */
    blk_start_plug(&plug);
    
    /* Submit many bios for dirty pages */
    for (each extent to write) {
        struct bio *bio = bio_alloc(...);
        /* ... set up bio ... */
        submit_bio(bio);  /* Doesn't dispatch yet */
    }
    
    /* Unplug - all bios dispatched together */
    blk_finish_plug(&plug);
}

Step 4: Request Merging

The block layer attempts to merge adjacent requests to reduce the number of I/O operations:

Back merge: New bio appends to existing request's end
Front merge: New bio prepends to existing request's start
Attempt merge: Try to merge two existing requests

Merging is only possible when:

Requests are sequential on disk
Same direction (both read or both write)
Combined size doesn't exceed device limits
No intervening flush or FUA operations

The Request Structure

As bios proceed through the block layer, they may be converted to or merged into 'struct request' objects. A request can contain multiple bios chained together. The request is what ultimately gets dispatched to the device driver.

I/O Schedulers

I/O schedulers reorder and prioritize requests to optimize device performance. Different schedulers are suited for different workloads and device types.

Linux I/O Schedulers (blk-mq)
Scheduler	Best For	Key Characteristics
mq-deadline	HDDs, databases	Deadline-based, prevents starvation, maintains read/write queues
bfq (Budget Fair Queuing)	Desktop, interactive	Per-process fair queuing, optimizes latency for interactive apps
kyber	NVMe SSDs	Minimal overhead, uses token buckets for latency targets
none	Fast NVMe, VMs	No reordering, lowest latency, relies on device queue

mq-deadline Scheduler

mq-deadline is the default for most block devices. It maintains:

Sorted queues: Requests sorted by sector for read and write
FIFO queues: Requests sorted by arrival time (deadline tracking)
Dispatching logic: Normally dispatch from sorted queue for locality, but switch to FIFO when deadlines approach

Configurable parameters:

read_expire: Milliseconds before read is considered starved (default: 500)
write_expire: Milliseconds before write is considered starved (default: 5000)
writes_starved: Reads to dispatch before servicing starved writes (default: 2)
fifo_batch: Requests to dispatch in FIFO mode (default: 16)

Viewing and changing scheduler
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
# View current scheduler
$ cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none
 
# Change scheduler
$ echo "bfq" > /sys/block/sda/queue/scheduler
 
# View scheduler parameters (mq-deadline)
$ ls /sys/block/sda/queue/iosched/
fifo_batch  front_merges  read_expire  write_expire  writes_starved
 
# Tune read deadline
$ echo 100 > /sys/block/sda/queue/iosched/read_expire

BFQ (Budget Fair Queuing)

BFQ provides fair scheduling at the process level. Each process (or cgroup) gets a fair share of I/O bandwidth, based on assigned weights. Key features:

I/O budget: Each entity has a budget of sectors it can dispatch before being rescheduled
Low-latency heuristics: Detects interactive processes and boosts their priority
Idle injection: Keeps device idle briefly after dispatching, allowing merging
Hierarchical scheduling: Supports cgroup-based I/O control

BFQ is excellent for desktops and mixed workloads but adds overhead that may be unnecessary for data center SSDs.

When to Use 'none'

For high-end NVMe devices with intelligent internal schedulers, the 'none' scheduler often provides the best performance. These devices have massive internal parallelism and can reorder requests themselves. The software scheduler just adds latency.

The Multi-Queue Block Layer (blk-mq)

The multi-queue block layer (blk-mq) was designed to address the scalability limitations of the legacy single-queue architecture. With NVMe devices capable of millions of IOPS, a single lock-protected queue became an unacceptable bottleneck.

blk-mq Architecture

blk-mq uses a two-level queue structure:

Converting Mermaid diagram...

Software Queues (struct blk_mq_ctx)

One per CPU (or CPU group)
Requests initially placed here
No lock needed for local CPU access
Staging area that enables batching

Hardware Queues (struct blk_mq_hw_ctx)

Map to actual device submission queues
For NVMe: typically one per core (up to device limit)
For SCSI: usually just one
Associated with CPU sets for locality

blk-mq Request Flow

blk-mq request processing
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/* Driver provides ops to handle requests */
static const struct blk_mq_ops nvme_mq_ops = {
    .queue_rq       = nvme_queue_rq,        /* Dispatch to HW */
    .complete       = nvme_pci_complete_rq, /* Handle completion */
    .init_hctx      = nvme_init_hctx,       /* Init HW queue ctx */
    .init_request   = nvme_init_request,     /* Init request */
    .map_queues     = nvme_pci_map_queues,  /* Map SW->HW queues */
    .poll           = nvme_poll,            /* For polling I/O */
};
 
/* The queue_rq function dispatches to hardware */
static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
                                  const struct blk_mq_queue_data *bd)
{
    struct nvme_queue *nvmeq = hctx->driver_data;
    struct request *req = bd->rq;
    struct nvme_command cmd;
    blk_status_t ret;
    
    /* Build NVMe command from request */
    ret = nvme_setup_cmd(req->q->queuedata, req, &cmd);
    if (ret)
        return ret;
    
    /* Start tracking for timeout */
    blk_mq_start_request(req);
    
    /* Submit to NVMe submission queue */
    nvme_submit_cmd(nvmeq, &cmd);
    
    return BLK_STS_OK;
}

Tag-Based Allocation

blk-mq uses a tag-based request allocation scheme. Each hardware queue has a fixed number of tags (typically 128-4096). A tag is allocated when a request starts and freed on completion. This provides natural back-pressure: if all tags are in use, new requests must wait. The tag also serves as an index for completion tracking.

Device Mapper and Software RAID

The block layer includes frameworks for creating virtual block devices that transform or redirect I/O to underlying physical devices.

Device Mapper (dm)

Device Mapper is a kernel framework that creates virtual block devices by mapping I/O to one or more underlying devices through target modules:

Common Device Mapper Targets
Target	Function	Use Case
linear	Maps range to another device range	Concatenating disks
striped	Stripes data across devices	Simple RAID-0
mirror	Mirrors writes to multiple devices	Simple RAID-1
crypt	Encrypts/decrypts I/O	LUKS disk encryption
thin	Thin provisioning with snapshots	LVM thin pools
cache	Caches slow device with fast device	dm-cache, bcache
multipath	Load balances across multiple paths	SAN storage, redundancy
snapshot	Copy-on-write snapshots	Backup, testing
verity	Integrity verification	Android verified boot

Device Mapper examples
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# View device mapper devices
$ dmsetup ls
vg0-root        (253:0)
vg0-swap        (253:1)
 
# View mapping table
$ dmsetup table vg0-root
0 209715200 linear 8:2 2048
 
# Create a simple linear mapping
$ echo "0 1048576 linear /dev/sda1 0" | dmsetup create mydev
 
# Create an encryption layer with cryptsetup
$ cryptsetup luksFormat /dev/sdb1
$ cryptsetup open /dev/sdb1 encrypted_volume
# Creates /dev/mapper/encrypted_volume
 
# LVM is built on device mapper
$ lvcreate -L 10G -n data vg0
# Creates /dev/mapper/vg0-data

MD (Multiple Devices) - Software RAID

The MD subsystem provides software RAID implementations:

MD RAID Levels
Level	Name	Min Disks	Redundancy	Performance
RAID-0	Stripe	2	None	Read/Write: N×
RAID-1	Mirror	2	N-1 disk failures	Read: N×, Write: 1×
RAID-5	Striped parity	3	1 disk failure	Read: (N-1)×, Write: reduced
RAID-6	Double parity	4	2 disk failures	Read: (N-2)×, Write: reduced
RAID-10	Striped mirrors	4	Depends on layout	Read: N×, Write: N/2×

When to Use What

Use Device Mapper for flexible volume management (LVM), encryption (LUKS), thin provisioning, and multipath. Use MD for straightforward software RAID where you want kernel-level redundancy. In practice, many setups combine both: MD for RAID, then LVM (device mapper) on top for flexible partitioning.

Block Layer Observability

Understanding block layer performance requires the right observability tools. Linux provides several mechanisms for analyzing I/O behavior.

/proc and /sys Interfaces

Block device information
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Disk statistics
$ cat /proc/diskstats
   8       0 sda 15324 847 1247658 13072 8521 3812 451720 21504 0 11392 34576
 
# Fields: major minor name
#   reads completed, reads merged, sectors read, ms reading
#   writes completed, writes merged, sectors written, ms writing
#   I/Os in progress, ms doing I/O, weighted ms
 
# Per-device queue configuration
$ ls /sys/block/nvme0n1/queue/
add_random          discard_max_bytes     io_poll_delay   max_sectors
chunk_sectors       discard_max_hw_bytes  iostats         max_segment_size
dax                 discard_zeroes_data   iosched         max_segments
depth               fua                   logical_block   minimum_io_size
...
 
# Important queue parameters
$ cat /sys/block/nvme0n1/queue/nr_requests       # Queue depth
$ cat /sys/block/nvme0n1/queue/max_hw_sectors_kb # Max request size
$ cat /sys/block/nvme0n1/queue/rotational        # 0=SSD, 1=HDD

blktrace: Detailed I/O Tracing

blktrace provides detailed per-request tracing of the entire block I/O path:

Using blktrace
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Capture trace for 30 seconds
$ blktrace -d /dev/sda -o trace -w 30
 
# Parse and display
$ blkparse -i trace
  8,0    0     1     0.000000000   123  A   W 12345678 + 8 <- (8,1) 12312312
  8,0    0     2     0.000001234   123  Q   W 12345678 + 8 [dd]
  8,0    0     3     0.000002345   123  G   W 12345678 + 8 [dd]
  8,0    0     4     0.000003456   123  I   W 12345678 + 8 [dd]
  8,0    0     5     0.000004567   123  D   W 12345678 + 8 [dd]
  8,0    0     6     0.000505678   123  C   W 12345678 + 8 [0]
 
# Action codes:
# A = remap (from another device)
# Q = queued (bio enters block layer)
# G = get request (allocated from pool)
# I = insert to scheduler
# D = dispatch to driver
# C = complete
 
# Generate aggregate statistics
$ btt -i trace.blktrace.0
...
D2C (dispatch to complete) latency statistics...

BPF-Based Tools

BCC/bpftrace block I/O tools
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# biolatency: Histogram of I/O latency
$ biolatency-bpfcc
Tracing block device I/O... Hit Ctrl-C to end.
     usecs               : count     distribution
         0 -> 1          : 0        |                    |
         2 -> 3          : 0        |                    |
         4 -> 7          : 35       |*                   |
         8 -> 15         : 1582     |********            |
        16 -> 31         : 4021     |********************|
        32 -> 63         : 2518     |************        |
        64 -> 127        : 851      |****                |
       128 -> 255        : 194      |                    |
       256 -> 511        : 45       |                    |
       512 -> 1023       : 12       |                    |
 
# biotop: top-like for block I/O
$ biotop-bpfcc
PID    COMM             D MAJ MIN  DISK       I/O  Kbytes  AVGms
2481   postgres         W 253 0    dm-0       124  1984    0.78
2481   postgres         R 253 0    dm-0       89   1424    0.52
521    jbd2/dm-0        W 253 0    dm-0       12   48      0.34
 
# biostacks: I/O with kernel stack traces
$ biostacks-bpfcc
...

iostat Basics

The classic 'iostat -x 1' remains invaluable for quick I/O analysis. Key metrics: %util (saturation for spinning disks), await (average latency), r_await/w_await (read/write latency), and rkB/s, wkB/s (throughput). For NVMe, focus on latency rather than utilization since parallel queues make %util misleading.

Direct I/O and io_uring

Standard buffered I/O passes through the page cache. For certain workloads, bypassing the cache provides better performance. Additionally, modern high-performance I/O uses new interfaces that minimize system call overhead.

Direct I/O (O_DIRECT)

Direct I/O bypasses the page cache, transferring data directly between user buffers and device:

Direct I/O Advantages

•No double-buffering (data not copied to/from page cache)
•Reduces memory pressure
•Avoids cache pollution from scan workloads
•Database can manage its own cache more efficiently
•More predictable latency

Direct I/O Constraints

•Buffer must be aligned (typically 512 or 4096 bytes)
•Offset must be aligned
•Length must be aligned
•No read-ahead optimization
•No write coalescing in page cache

io_uring: Modern Async I/O

io_uring (added in Linux 5.1) provides high-performance asynchronous I/O through a shared ring buffer between kernel and user space:

Converting Mermaid diagram...

Key io_uring features:

Zero-copy submission: Shared memory ring eliminates copy overhead
Batching: Submit many operations with single system call (or none with SQPOLL)
Kernel polling: SQPOLL mode polls the submission queue without syscalls
Linked operations: Chain dependent operations (e.g., read then write)
Fixed buffers/files: Pre-register buffers and file descriptors for reduced overhead

Basic io_uring usage (liburing)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <liburing.h>
 
struct io_uring ring;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
 
/* Initialize ring with 256 entries */
io_uring_queue_init(256, &ring, 0);
 
/* Prepare a read operation */
sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, size, offset);
sqe->user_data = my_context;  /* For identification */
 
/* Submit (can batch multiple sqes) */
io_uring_submit(&ring);
 
/* Wait for completion */
io_uring_wait_cqe(&ring, &cqe);
if (cqe->res >= 0) {
    /* Success: cqe->res is bytes read */
}
io_uring_cqe_seen(&ring, cqe);
 
/* Cleanup */
io_uring_queue_exit(&ring);

io_uring Performance

io_uring can achieve millions of IOPS on modern NVMe devices, approaching raw device performance. With SQPOLL enabled, the kernel continuously polls for submissions, eliminating system call overhead entirely for high-throughput workloads. This makes io_uring essential for storage-intensive applications like databases and network file servers.

Summary: The Block I/O Subsystem

We've explored the complete Linux block I/O layer. Let's consolidate the key concepts:

Key Takeaways

•The bio structure is the primary unit of I/O, describing a block operation with scatter-gather capability. Multiple bios can be merged into requests for efficiency.
•Request merging and plugging combine adjacent I/O operations, reducing hardware overhead and improving throughput for sequential workloads.
•I/O schedulers (mq-deadline, bfq, kyber, none) reorder requests to optimize for different device types and workloads.
•blk-mq (multi-queue) architecture uses per-CPU software queues feeding into hardware queues, eliminating lock contention for high-IOPS devices.
•Device Mapper and MD provide virtual block devices for volume management, encryption, RAID, and other transformations.
•Observability tools (blktrace, BPF tools, iostat) enable deep analysis of I/O behavior and performance.
•io_uring represents the future of high-performance I/O with zero-copy ring buffers and optional kernel-side polling.

What's next:

With the block I/O layer understood, we'll explore the Page Cache—the kernel's primary memory cache for file data. The next page examines how the page cache works, its interaction with file systems and the block layer, writeback policies, and memory pressure handling.

Page Complete

You now understand the Linux block I/O layer—from bio submission through scheduling to device dispatch. This knowledge is essential for storage performance optimization, understanding how file systems interact with hardware, and debugging I/O bottlenecks in production systems.

3 / 5

Loading learning content...

Operating SystemsLinux File Systems

Linux File Systems

LevelAdvanced

Duration120 mins

TopicLinux File Systems

3 / 5

Block I/O Layer

Bridging File Systems and Hardware

Between the file system's logical view of data and the actual disk platters (or flash cells) lies a critical subsystem: the Block I/O layer (or block layer). This subsystem is responsible for:

Translating file system block requests into device-level operations
Reordering and merging requests for optimal hardware performance
Managing the interface between software and storage device drivers
Supporting both legacy single-queue and modern multi-queue architectures

What You Will Learn

Block Layer Architecture Overview

The block layer sits between file systems (and other higher-level consumers) and block device drivers. Its primary responsibilities include:

Request representation: Packaging I/O requests in a standard format
Request queuing: Managing queues of pending I/O operations
I/O scheduling: Reordering requests to optimize device access patterns
Request merging: Combining adjacent requests to reduce overhead
Device abstraction: Providing a uniform interface to diverse storage hardware

Converting Mermaid diagram...

Historical Evolution

The block layer has undergone significant evolution:

Single-Queue Era (Pre-3.13)

One request queue per device
Global lock protected the queue
I/O schedulers operated on this single queue
Worked well for rotational drives but became a bottleneck for SSDs

Multi-Queue Era (3.13+)

Multiple software queues mapped to hardware queues
Per-CPU staging queues eliminate lock contention
Designed for high-IOPS NVMe devices
Default since kernel 5.0

blk-mq Dominance

The bio Structure: Representing Block I/O

struct bio (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
struct bio {
    struct bio              *bi_next;       /* Next bio in list */
    struct block_device     *bi_bdev;       /* Target block device */
    blk_opf_t               bi_opf;         /* Operation and flags */
    unsigned short          bi_flags;       /* BIO_* flags */
    unsigned short          bi_ioprio;      /* I/O priority */
    blk_status_t            bi_status;      /* Completion status */
    
    atomic_t                __bi_remaining; /* Pending segment count */
    
    struct bvec_iter        bi_iter;        /* Current iterator state */
    
    blk_qc_t                bi_cookie;      /* For polling */
    bio_end_io_t            *bi_end_io;     /* Completion callback */
    void                    *bi_private;    /* Owner private data */
    
    unsigned short          bi_vcnt;        /* Number of segments */
    unsigned short          bi_max_vecs;    /* Max segments capacity */
    atomic_t                __bi_cnt;       /* Reference count */
    
    struct bio_vec          *bi_io_vec;     /* Segment array */
    struct bio_set          *bi_pool;       /* Pool for allocation */
    
    /* Inline bio_vecs for small I/Os */
    struct bio_vec          bi_inline_vecs[];
};
 
/* Iterator tracks current position in bio */
struct bvec_iter {
    sector_t        bi_sector;      /* Current sector */
    unsigned int    bi_size;        /* Remaining bytes */
    unsigned int    bi_idx;         /* Current vector index */
    unsigned int    bi_bvec_done;   /* Bytes completed in current vec */
};
 
/* Each segment points to a page range */
struct bio_vec {
    struct page     *bv_page;       /* Page containing data */
    unsigned int    bv_len;         /* Length in bytes */
    unsigned int    bv_offset;      /* Offset within page */
};

Key bio Concepts

Sector Addressing The bio specifies disk locations in sectors (512 bytes, regardless of actual device sector size). The bi_iter.bi_sector field holds the starting sector.

Operations and Flags The bi_opf field encodes both the operation type and modifier flags:

bio operations and flags
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/* Operation types (lower bits of bi_opf) */
enum req_op {
    REQ_OP_READ,              /* Read sectors */
    REQ_OP_WRITE,             /* Write sectors */
    REQ_OP_FLUSH,             /* Flush device cache */
    REQ_OP_DISCARD,           /* Trim/unmap sectors */
    REQ_OP_SECURE_ERASE,      /* Securely erase sectors */
    REQ_OP_WRITE_ZEROES,      /* Write zeros (may not allocate) */
    REQ_OP_ZONE_OPEN,         /* Open a zone (zoned devices) */
    REQ_OP_ZONE_CLOSE,        /* Close a zone */
    REQ_OP_ZONE_FINISH,       /* Finish a zone */
    REQ_OP_ZONE_APPEND,       /* Append to zone */
    REQ_OP_ZONE_RESET,        /* Reset zone write pointer */
    REQ_OP_DRV_IN,            /* Driver-specific input */
    REQ_OP_DRV_OUT,           /* Driver-specific output */
};
 
/* Modifier flags (upper bits of bi_opf) */
#define REQ_FAILFAST_DEV    (1ULL << __REQ_FAILFAST_DEV)
#define REQ_FAILFAST_TRANSPORT  (1ULL << __REQ_FAILFAST_TRANSPORT)
#define REQ_FAILFAST_DRIVER (1ULL << __REQ_FAILFAST_DRIVER)
#define REQ_SYNC            (1ULL << __REQ_SYNC)     /* Synchronous I/O */
#define REQ_META            (1ULL << __REQ_META)     /* Metadata I/O */
#define REQ_PRIO            (1ULL << __REQ_PRIO)     /* High priority */
#define REQ_NOMERGE         (1ULL << __REQ_NOMERGE)  /* Don't merge */
#define REQ_IDLE            (1ULL << __REQ_IDLE)     /* Low priority */
#define REQ_INTEGRITY       (1ULL << __REQ_INTEGRITY)/* Data integrity */
#define REQ_FUA             (1ULL << __REQ_FUA)      /* Force unit access */
#define REQ_PREFLUSH        (1ULL << __REQ_PREFLUSH) /* Issue flush before */
#define REQ_RAHEAD          (1ULL << __REQ_RAHEAD)   /* Read-ahead */
#define REQ_BACKGROUND      (1ULL << __REQ_BACKGROUND)/* Background I/O */
#define REQ_POLLED          (1ULL << __REQ_POLLED)   /* Caller will poll */

REQ_FUA and Write Ordering

Submitting and Processing I/O

Let's trace the lifecycle of a block I/O request, from file system submission to device completion.

Step 1: Creating a bio

File systems allocate and initialize bios for their I/O operations:

Creating and submitting a bio
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* Allocate a bio from a bio_set */
struct bio *bio = bio_alloc(bdev, nr_vecs, opf, GFP_KERNEL);
 
/* Or use the generic bio_alloc_bioset */
bio = bio_alloc_bioset(bdev, nr_vecs, opf, GFP_KERNEL, &fs_bio_set);
 
/* Set the starting sector */
bio->bi_iter.bi_sector = sector;
 
/* Add pages to the bio */
for (each page to include) {
    bio_add_page(bio, page, len, offset);
}
 
/* Set completion callback */
bio->bi_end_io = my_completion_handler;
bio->bi_private = my_context;
 
/* Submit the bio */
submit_bio(bio);

Step 2: The submit_bio Path

When submit_bio() is called, the kernel performs several operations:

Converting Mermaid diagram...

Step 3: Plugging and Batching

Create a per-task plug: blk_start_plug(&plug)
Submit bios normally (they accumulate in the plug list)
Unplug: blk_finish_plug(&plug)
All accumulated bios are merged and dispatched together

This reduces hardware overhead (fewer interrupts, better command coalescing) and enables better merging decisions.

Plugging example
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void ext4_writepages(struct address_space *mapping, ...)
{
    struct blk_plug plug;
    
    /* Start plugging - bios will accumulate */
    blk_start_plug(&plug);
    
    /* Submit many bios for dirty pages */
    for (each extent to write) {
        struct bio *bio = bio_alloc(...);
        /* ... set up bio ... */
        submit_bio(bio);  /* Doesn't dispatch yet */
    }
    
    /* Unplug - all bios dispatched together */
    blk_finish_plug(&plug);
}

Step 4: Request Merging

The block layer attempts to merge adjacent requests to reduce the number of I/O operations:

Back merge: New bio appends to existing request's end
Front merge: New bio prepends to existing request's start
Attempt merge: Try to merge two existing requests

Merging is only possible when:

Requests are sequential on disk
Same direction (both read or both write)
Combined size doesn't exceed device limits
No intervening flush or FUA operations

The Request Structure

I/O Schedulers

I/O schedulers reorder and prioritize requests to optimize device performance. Different schedulers are suited for different workloads and device types.

Linux I/O Schedulers (blk-mq)
Scheduler	Best For	Key Characteristics
mq-deadline	HDDs, databases	Deadline-based, prevents starvation, maintains read/write queues
bfq (Budget Fair Queuing)	Desktop, interactive	Per-process fair queuing, optimizes latency for interactive apps
kyber	NVMe SSDs	Minimal overhead, uses token buckets for latency targets
none	Fast NVMe, VMs	No reordering, lowest latency, relies on device queue

mq-deadline Scheduler

mq-deadline is the default for most block devices. It maintains:

Sorted queues: Requests sorted by sector for read and write
FIFO queues: Requests sorted by arrival time (deadline tracking)
Dispatching logic: Normally dispatch from sorted queue for locality, but switch to FIFO when deadlines approach

Configurable parameters:

read_expire: Milliseconds before read is considered starved (default: 500)
write_expire: Milliseconds before write is considered starved (default: 5000)
writes_starved: Reads to dispatch before servicing starved writes (default: 2)
fifo_batch: Requests to dispatch in FIFO mode (default: 16)

Viewing and changing scheduler
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
# View current scheduler
$ cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none
 
# Change scheduler
$ echo "bfq" > /sys/block/sda/queue/scheduler
 
# View scheduler parameters (mq-deadline)
$ ls /sys/block/sda/queue/iosched/
fifo_batch  front_merges  read_expire  write_expire  writes_starved
 
# Tune read deadline
$ echo 100 > /sys/block/sda/queue/iosched/read_expire

BFQ (Budget Fair Queuing)

BFQ provides fair scheduling at the process level. Each process (or cgroup) gets a fair share of I/O bandwidth, based on assigned weights. Key features:

I/O budget: Each entity has a budget of sectors it can dispatch before being rescheduled
Low-latency heuristics: Detects interactive processes and boosts their priority
Idle injection: Keeps device idle briefly after dispatching, allowing merging
Hierarchical scheduling: Supports cgroup-based I/O control

BFQ is excellent for desktops and mixed workloads but adds overhead that may be unnecessary for data center SSDs.

When to Use 'none'

The Multi-Queue Block Layer (blk-mq)

blk-mq Architecture

blk-mq uses a two-level queue structure:

Converting Mermaid diagram...

Software Queues (struct blk_mq_ctx)

One per CPU (or CPU group)
Requests initially placed here
No lock needed for local CPU access
Staging area that enables batching

Hardware Queues (struct blk_mq_hw_ctx)

Map to actual device submission queues
For NVMe: typically one per core (up to device limit)
For SCSI: usually just one
Associated with CPU sets for locality

blk-mq Request Flow

blk-mq request processing
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/* Driver provides ops to handle requests */
static const struct blk_mq_ops nvme_mq_ops = {
    .queue_rq       = nvme_queue_rq,        /* Dispatch to HW */
    .complete       = nvme_pci_complete_rq, /* Handle completion */
    .init_hctx      = nvme_init_hctx,       /* Init HW queue ctx */
    .init_request   = nvme_init_request,     /* Init request */
    .map_queues     = nvme_pci_map_queues,  /* Map SW->HW queues */
    .poll           = nvme_poll,            /* For polling I/O */
};
 
/* The queue_rq function dispatches to hardware */
static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
                                  const struct blk_mq_queue_data *bd)
{
    struct nvme_queue *nvmeq = hctx->driver_data;
    struct request *req = bd->rq;
    struct nvme_command cmd;
    blk_status_t ret;
    
    /* Build NVMe command from request */
    ret = nvme_setup_cmd(req->q->queuedata, req, &cmd);
    if (ret)
        return ret;
    
    /* Start tracking for timeout */
    blk_mq_start_request(req);
    
    /* Submit to NVMe submission queue */
    nvme_submit_cmd(nvmeq, &cmd);
    
    return BLK_STS_OK;
}

Tag-Based Allocation

Device Mapper and Software RAID

The block layer includes frameworks for creating virtual block devices that transform or redirect I/O to underlying physical devices.

Device Mapper (dm)

Device Mapper is a kernel framework that creates virtual block devices by mapping I/O to one or more underlying devices through target modules:

Common Device Mapper Targets
Target	Function	Use Case
linear	Maps range to another device range	Concatenating disks
striped	Stripes data across devices	Simple RAID-0
mirror	Mirrors writes to multiple devices	Simple RAID-1
crypt	Encrypts/decrypts I/O	LUKS disk encryption
thin	Thin provisioning with snapshots	LVM thin pools
cache	Caches slow device with fast device	dm-cache, bcache
multipath	Load balances across multiple paths	SAN storage, redundancy
snapshot	Copy-on-write snapshots	Backup, testing
verity	Integrity verification	Android verified boot

Device Mapper examples
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# View device mapper devices
$ dmsetup ls
vg0-root        (253:0)
vg0-swap        (253:1)
 
# View mapping table
$ dmsetup table vg0-root
0 209715200 linear 8:2 2048
 
# Create a simple linear mapping
$ echo "0 1048576 linear /dev/sda1 0" | dmsetup create mydev
 
# Create an encryption layer with cryptsetup
$ cryptsetup luksFormat /dev/sdb1
$ cryptsetup open /dev/sdb1 encrypted_volume
# Creates /dev/mapper/encrypted_volume
 
# LVM is built on device mapper
$ lvcreate -L 10G -n data vg0
# Creates /dev/mapper/vg0-data

MD (Multiple Devices) - Software RAID

The MD subsystem provides software RAID implementations:

MD RAID Levels
Level	Name	Min Disks	Redundancy	Performance
RAID-0	Stripe	2	None	Read/Write: N×
RAID-1	Mirror	2	N-1 disk failures	Read: N×, Write: 1×
RAID-5	Striped parity	3	1 disk failure	Read: (N-1)×, Write: reduced
RAID-6	Double parity	4	2 disk failures	Read: (N-2)×, Write: reduced
RAID-10	Striped mirrors	4	Depends on layout	Read: N×, Write: N/2×

When to Use What

Block Layer Observability

Understanding block layer performance requires the right observability tools. Linux provides several mechanisms for analyzing I/O behavior.

/proc and /sys Interfaces

Block device information
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Disk statistics
$ cat /proc/diskstats
   8       0 sda 15324 847 1247658 13072 8521 3812 451720 21504 0 11392 34576
 
# Fields: major minor name
#   reads completed, reads merged, sectors read, ms reading
#   writes completed, writes merged, sectors written, ms writing
#   I/Os in progress, ms doing I/O, weighted ms
 
# Per-device queue configuration
$ ls /sys/block/nvme0n1/queue/
add_random          discard_max_bytes     io_poll_delay   max_sectors
chunk_sectors       discard_max_hw_bytes  iostats         max_segment_size
dax                 discard_zeroes_data   iosched         max_segments
depth               fua                   logical_block   minimum_io_size
...
 
# Important queue parameters
$ cat /sys/block/nvme0n1/queue/nr_requests       # Queue depth
$ cat /sys/block/nvme0n1/queue/max_hw_sectors_kb # Max request size
$ cat /sys/block/nvme0n1/queue/rotational        # 0=SSD, 1=HDD

blktrace: Detailed I/O Tracing

blktrace provides detailed per-request tracing of the entire block I/O path:

Using blktrace
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Capture trace for 30 seconds
$ blktrace -d /dev/sda -o trace -w 30
 
# Parse and display
$ blkparse -i trace
  8,0    0     1     0.000000000   123  A   W 12345678 + 8 <- (8,1) 12312312
  8,0    0     2     0.000001234   123  Q   W 12345678 + 8 [dd]
  8,0    0     3     0.000002345   123  G   W 12345678 + 8 [dd]
  8,0    0     4     0.000003456   123  I   W 12345678 + 8 [dd]
  8,0    0     5     0.000004567   123  D   W 12345678 + 8 [dd]
  8,0    0     6     0.000505678   123  C   W 12345678 + 8 [0]
 
# Action codes:
# A = remap (from another device)
# Q = queued (bio enters block layer)
# G = get request (allocated from pool)
# I = insert to scheduler
# D = dispatch to driver
# C = complete
 
# Generate aggregate statistics
$ btt -i trace.blktrace.0
...
D2C (dispatch to complete) latency statistics...

BPF-Based Tools

BCC/bpftrace block I/O tools
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# biolatency: Histogram of I/O latency
$ biolatency-bpfcc
Tracing block device I/O... Hit Ctrl-C to end.
     usecs               : count     distribution
         0 -> 1          : 0        |                    |
         2 -> 3          : 0        |                    |
         4 -> 7          : 35       |*                   |
         8 -> 15         : 1582     |********            |
        16 -> 31         : 4021     |********************|
        32 -> 63         : 2518     |************        |
        64 -> 127        : 851      |****                |
       128 -> 255        : 194      |                    |
       256 -> 511        : 45       |                    |
       512 -> 1023       : 12       |                    |
 
# biotop: top-like for block I/O
$ biotop-bpfcc
PID    COMM             D MAJ MIN  DISK       I/O  Kbytes  AVGms
2481   postgres         W 253 0    dm-0       124  1984    0.78
2481   postgres         R 253 0    dm-0       89   1424    0.52
521    jbd2/dm-0        W 253 0    dm-0       12   48      0.34
 
# biostacks: I/O with kernel stack traces
$ biostacks-bpfcc
...

iostat Basics

Direct I/O and io_uring

Direct I/O (O_DIRECT)

Direct I/O bypasses the page cache, transferring data directly between user buffers and device:

Direct I/O Advantages

•No double-buffering (data not copied to/from page cache)
•Reduces memory pressure
•Avoids cache pollution from scan workloads
•Database can manage its own cache more efficiently
•More predictable latency

Direct I/O Constraints

•Buffer must be aligned (typically 512 or 4096 bytes)
•Offset must be aligned
•Length must be aligned
•No read-ahead optimization
•No write coalescing in page cache

io_uring: Modern Async I/O

io_uring (added in Linux 5.1) provides high-performance asynchronous I/O through a shared ring buffer between kernel and user space:

Converting Mermaid diagram...

Key io_uring features:

Zero-copy submission: Shared memory ring eliminates copy overhead
Batching: Submit many operations with single system call (or none with SQPOLL)
Kernel polling: SQPOLL mode polls the submission queue without syscalls
Linked operations: Chain dependent operations (e.g., read then write)
Fixed buffers/files: Pre-register buffers and file descriptors for reduced overhead

Basic io_uring usage (liburing)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <liburing.h>
 
struct io_uring ring;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
 
/* Initialize ring with 256 entries */
io_uring_queue_init(256, &ring, 0);
 
/* Prepare a read operation */
sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, size, offset);
sqe->user_data = my_context;  /* For identification */
 
/* Submit (can batch multiple sqes) */
io_uring_submit(&ring);
 
/* Wait for completion */
io_uring_wait_cqe(&ring, &cqe);
if (cqe->res >= 0) {
    /* Success: cqe->res is bytes read */
}
io_uring_cqe_seen(&ring, cqe);
 
/* Cleanup */
io_uring_queue_exit(&ring);

io_uring Performance

Summary: The Block I/O Subsystem

We've explored the complete Linux block I/O layer. Let's consolidate the key concepts:

Key Takeaways

•The bio structure is the primary unit of I/O, describing a block operation with scatter-gather capability. Multiple bios can be merged into requests for efficiency.
•Request merging and plugging combine adjacent I/O operations, reducing hardware overhead and improving throughput for sequential workloads.
•I/O schedulers (mq-deadline, bfq, kyber, none) reorder requests to optimize for different device types and workloads.
•blk-mq (multi-queue) architecture uses per-CPU software queues feeding into hardware queues, eliminating lock contention for high-IOPS devices.
•Device Mapper and MD provide virtual block devices for volume management, encryption, RAID, and other transformations.
•Observability tools (blktrace, BPF tools, iostat) enable deep analysis of I/O behavior and performance.
•io_uring represents the future of high-performance I/O with zero-copy ring buffers and optional kernel-side polling.

What's next:

Page Complete

3 / 5