Loading learning content...
Between the file system's logical view of data and the actual disk platters (or flash cells) lies a critical subsystem: the Block I/O layer (or block layer). This subsystem is responsible for:
The block layer is invisible to applications yet determines much of your system's I/O performance. Understanding it is essential for anyone working on storage systems, databases, or kernel development.
By the end of this page, you will understand the complete architecture of the Linux block I/O subsystem: the bio structure and request lifecycle, I/O schedulers and their algorithms, the transition from single-queue to multi-queue (blk-mq), and how to analyze block layer performance. You'll gain the deep knowledge required for storage performance optimization and kernel-level debugging.
The block layer sits between file systems (and other higher-level consumers) and block device drivers. Its primary responsibilities include:
The block layer has undergone significant evolution:
Single-Queue Era (Pre-3.13)
Multi-Queue Era (3.13+)
As of Linux 5.0, the legacy single-queue path (request_fn) was removed entirely. All block devices now use the blk-mq (multi-queue) infrastructure, even if they only expose a single hardware queue. This unified architecture simplifies the codebase while supporting everything from USB sticks to NVMe arrays.
The bio (block I/O) structure is the primary unit of I/O in the block layer. It describes a single I/O operation: a contiguous range of sectors to read or write, along with the memory buffers involved.
1234567891011121314151617181920212223242526272829303132333435363738394041
struct bio { struct bio *bi_next; /* Next bio in list */ struct block_device *bi_bdev; /* Target block device */ blk_opf_t bi_opf; /* Operation and flags */ unsigned short bi_flags; /* BIO_* flags */ unsigned short bi_ioprio; /* I/O priority */ blk_status_t bi_status; /* Completion status */ atomic_t __bi_remaining; /* Pending segment count */ struct bvec_iter bi_iter; /* Current iterator state */ blk_qc_t bi_cookie; /* For polling */ bio_end_io_t *bi_end_io; /* Completion callback */ void *bi_private; /* Owner private data */ unsigned short bi_vcnt; /* Number of segments */ unsigned short bi_max_vecs; /* Max segments capacity */ atomic_t __bi_cnt; /* Reference count */ struct bio_vec *bi_io_vec; /* Segment array */ struct bio_set *bi_pool; /* Pool for allocation */ /* Inline bio_vecs for small I/Os */ struct bio_vec bi_inline_vecs[];}; /* Iterator tracks current position in bio */struct bvec_iter { sector_t bi_sector; /* Current sector */ unsigned int bi_size; /* Remaining bytes */ unsigned int bi_idx; /* Current vector index */ unsigned int bi_bvec_done; /* Bytes completed in current vec */}; /* Each segment points to a page range */struct bio_vec { struct page *bv_page; /* Page containing data */ unsigned int bv_len; /* Length in bytes */ unsigned int bv_offset; /* Offset within page */};Sector Addressing
The bio specifies disk locations in sectors (512 bytes, regardless of actual device sector size). The bi_iter.bi_sector field holds the starting sector.
Scatter-Gather A single bio can reference multiple non-contiguous memory pages through the bio_vec array. This enables scatter-gather I/O: reading into or writing from discontiguous memory in a single operation.
Operations and Flags
The bi_opf field encodes both the operation type and modifier flags:
1234567891011121314151617181920212223242526272829303132
/* Operation types (lower bits of bi_opf) */enum req_op { REQ_OP_READ, /* Read sectors */ REQ_OP_WRITE, /* Write sectors */ REQ_OP_FLUSH, /* Flush device cache */ REQ_OP_DISCARD, /* Trim/unmap sectors */ REQ_OP_SECURE_ERASE, /* Securely erase sectors */ REQ_OP_WRITE_ZEROES, /* Write zeros (may not allocate) */ REQ_OP_ZONE_OPEN, /* Open a zone (zoned devices) */ REQ_OP_ZONE_CLOSE, /* Close a zone */ REQ_OP_ZONE_FINISH, /* Finish a zone */ REQ_OP_ZONE_APPEND, /* Append to zone */ REQ_OP_ZONE_RESET, /* Reset zone write pointer */ REQ_OP_DRV_IN, /* Driver-specific input */ REQ_OP_DRV_OUT, /* Driver-specific output */}; /* Modifier flags (upper bits of bi_opf) */#define REQ_FAILFAST_DEV (1ULL << __REQ_FAILFAST_DEV)#define REQ_FAILFAST_TRANSPORT (1ULL << __REQ_FAILFAST_TRANSPORT)#define REQ_FAILFAST_DRIVER (1ULL << __REQ_FAILFAST_DRIVER)#define REQ_SYNC (1ULL << __REQ_SYNC) /* Synchronous I/O */#define REQ_META (1ULL << __REQ_META) /* Metadata I/O */#define REQ_PRIO (1ULL << __REQ_PRIO) /* High priority */#define REQ_NOMERGE (1ULL << __REQ_NOMERGE) /* Don't merge */#define REQ_IDLE (1ULL << __REQ_IDLE) /* Low priority */#define REQ_INTEGRITY (1ULL << __REQ_INTEGRITY)/* Data integrity */#define REQ_FUA (1ULL << __REQ_FUA) /* Force unit access */#define REQ_PREFLUSH (1ULL << __REQ_PREFLUSH) /* Issue flush before */#define REQ_RAHEAD (1ULL << __REQ_RAHEAD) /* Read-ahead */#define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)/* Background I/O */#define REQ_POLLED (1ULL << __REQ_POLLED) /* Caller will poll */REQ_FUA (Force Unit Access) and REQ_PREFLUSH are critical for data integrity. REQ_PREFLUSH ensures all previous writes reach durable storage before this write executes. REQ_FUA ensures this specific write bypasses the device cache. Together, they implement the ordering guarantees that file system journals depend on.
Let's trace the lifecycle of a block I/O request, from file system submission to device completion.
File systems allocate and initialize bios for their I/O operations:
1234567891011121314151617181920
/* Allocate a bio from a bio_set */struct bio *bio = bio_alloc(bdev, nr_vecs, opf, GFP_KERNEL); /* Or use the generic bio_alloc_bioset */bio = bio_alloc_bioset(bdev, nr_vecs, opf, GFP_KERNEL, &fs_bio_set); /* Set the starting sector */bio->bi_iter.bi_sector = sector; /* Add pages to the bio */for (each page to include) { bio_add_page(bio, page, len, offset);} /* Set completion callback */bio->bi_end_io = my_completion_handler;bio->bi_private = my_context; /* Submit the bio */submit_bio(bio);When submit_bio() is called, the kernel performs several operations:
Plugging is a critical optimization. When a process submits I/O, it often submits multiple bios in quick succession (e.g., writeback of many dirty pages). Instead of immediately dispatching each bio, the kernel "plugs" the queue:
blk_start_plug(&plug)blk_finish_plug(&plug)This reduces hardware overhead (fewer interrupts, better command coalescing) and enables better merging decisions.
1234567891011121314151617
void ext4_writepages(struct address_space *mapping, ...){ struct blk_plug plug; /* Start plugging - bios will accumulate */ blk_start_plug(&plug); /* Submit many bios for dirty pages */ for (each extent to write) { struct bio *bio = bio_alloc(...); /* ... set up bio ... */ submit_bio(bio); /* Doesn't dispatch yet */ } /* Unplug - all bios dispatched together */ blk_finish_plug(&plug);}The block layer attempts to merge adjacent requests to reduce the number of I/O operations:
Merging is only possible when:
As bios proceed through the block layer, they may be converted to or merged into 'struct request' objects. A request can contain multiple bios chained together. The request is what ultimately gets dispatched to the device driver.
I/O schedulers reorder and prioritize requests to optimize device performance. Different schedulers are suited for different workloads and device types.
| Scheduler | Best For | Key Characteristics |
|---|---|---|
| mq-deadline | HDDs, databases | Deadline-based, prevents starvation, maintains read/write queues |
| bfq (Budget Fair Queuing) | Desktop, interactive | Per-process fair queuing, optimizes latency for interactive apps |
| kyber | NVMe SSDs | Minimal overhead, uses token buckets for latency targets |
| none | Fast NVMe, VMs | No reordering, lowest latency, relies on device queue |
mq-deadline is the default for most block devices. It maintains:
Configurable parameters:
read_expire: Milliseconds before read is considered starved (default: 500)write_expire: Milliseconds before write is considered starved (default: 5000)writes_starved: Reads to dispatch before servicing starved writes (default: 2)fifo_batch: Requests to dispatch in FIFO mode (default: 16)12345678910111213
# View current scheduler$ cat /sys/block/sda/queue/scheduler[mq-deadline] kyber bfq none # Change scheduler$ echo "bfq" > /sys/block/sda/queue/scheduler # View scheduler parameters (mq-deadline)$ ls /sys/block/sda/queue/iosched/fifo_batch front_merges read_expire write_expire writes_starved # Tune read deadline$ echo 100 > /sys/block/sda/queue/iosched/read_expireBFQ provides fair scheduling at the process level. Each process (or cgroup) gets a fair share of I/O bandwidth, based on assigned weights. Key features:
BFQ is excellent for desktops and mixed workloads but adds overhead that may be unnecessary for data center SSDs.
For high-end NVMe devices with intelligent internal schedulers, the 'none' scheduler often provides the best performance. These devices have massive internal parallelism and can reorder requests themselves. The software scheduler just adds latency.
The multi-queue block layer (blk-mq) was designed to address the scalability limitations of the legacy single-queue architecture. With NVMe devices capable of millions of IOPS, a single lock-protected queue became an unacceptable bottleneck.
blk-mq uses a two-level queue structure:
Software Queues (struct blk_mq_ctx)
Hardware Queues (struct blk_mq_hw_ctx)
1234567891011121314151617181920212223242526272829303132
/* Driver provides ops to handle requests */static const struct blk_mq_ops nvme_mq_ops = { .queue_rq = nvme_queue_rq, /* Dispatch to HW */ .complete = nvme_pci_complete_rq, /* Handle completion */ .init_hctx = nvme_init_hctx, /* Init HW queue ctx */ .init_request = nvme_init_request, /* Init request */ .map_queues = nvme_pci_map_queues, /* Map SW->HW queues */ .poll = nvme_poll, /* For polling I/O */}; /* The queue_rq function dispatches to hardware */static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd){ struct nvme_queue *nvmeq = hctx->driver_data; struct request *req = bd->rq; struct nvme_command cmd; blk_status_t ret; /* Build NVMe command from request */ ret = nvme_setup_cmd(req->q->queuedata, req, &cmd); if (ret) return ret; /* Start tracking for timeout */ blk_mq_start_request(req); /* Submit to NVMe submission queue */ nvme_submit_cmd(nvmeq, &cmd); return BLK_STS_OK;}blk-mq uses a tag-based request allocation scheme. Each hardware queue has a fixed number of tags (typically 128-4096). A tag is allocated when a request starts and freed on completion. This provides natural back-pressure: if all tags are in use, new requests must wait. The tag also serves as an index for completion tracking.
The block layer includes frameworks for creating virtual block devices that transform or redirect I/O to underlying physical devices.
Device Mapper is a kernel framework that creates virtual block devices by mapping I/O to one or more underlying devices through target modules:
| Target | Function | Use Case |
|---|---|---|
| linear | Maps range to another device range | Concatenating disks |
| striped | Stripes data across devices | Simple RAID-0 |
| mirror | Mirrors writes to multiple devices | Simple RAID-1 |
| crypt | Encrypts/decrypts I/O | LUKS disk encryption |
| thin | Thin provisioning with snapshots | LVM thin pools |
| cache | Caches slow device with fast device | dm-cache, bcache |
| multipath | Load balances across multiple paths | SAN storage, redundancy |
| snapshot | Copy-on-write snapshots | Backup, testing |
| verity | Integrity verification | Android verified boot |
1234567891011121314151617181920
# View device mapper devices$ dmsetup lsvg0-root (253:0)vg0-swap (253:1) # View mapping table$ dmsetup table vg0-root0 209715200 linear 8:2 2048 # Create a simple linear mapping$ echo "0 1048576 linear /dev/sda1 0" | dmsetup create mydev # Create an encryption layer with cryptsetup$ cryptsetup luksFormat /dev/sdb1$ cryptsetup open /dev/sdb1 encrypted_volume# Creates /dev/mapper/encrypted_volume # LVM is built on device mapper$ lvcreate -L 10G -n data vg0# Creates /dev/mapper/vg0-dataThe MD subsystem provides software RAID implementations:
| Level | Name | Min Disks | Redundancy | Performance |
|---|---|---|---|---|
| RAID-0 | Stripe | 2 | None | Read/Write: N× |
| RAID-1 | Mirror | 2 | N-1 disk failures | Read: N×, Write: 1× |
| RAID-5 | Striped parity | 3 | 1 disk failure | Read: (N-1)×, Write: reduced |
| RAID-6 | Double parity | 4 | 2 disk failures | Read: (N-2)×, Write: reduced |
| RAID-10 | Striped mirrors | 4 | Depends on layout | Read: N×, Write: N/2× |
Use Device Mapper for flexible volume management (LVM), encryption (LUKS), thin provisioning, and multipath. Use MD for straightforward software RAID where you want kernel-level redundancy. In practice, many setups combine both: MD for RAID, then LVM (device mapper) on top for flexible partitioning.
Understanding block layer performance requires the right observability tools. Linux provides several mechanisms for analyzing I/O behavior.
123456789101112131415161718192021
# Disk statistics$ cat /proc/diskstats 8 0 sda 15324 847 1247658 13072 8521 3812 451720 21504 0 11392 34576 # Fields: major minor name# reads completed, reads merged, sectors read, ms reading# writes completed, writes merged, sectors written, ms writing# I/Os in progress, ms doing I/O, weighted ms # Per-device queue configuration$ ls /sys/block/nvme0n1/queue/add_random discard_max_bytes io_poll_delay max_sectorschunk_sectors discard_max_hw_bytes iostats max_segment_sizedax discard_zeroes_data iosched max_segmentsdepth fua logical_block minimum_io_size... # Important queue parameters$ cat /sys/block/nvme0n1/queue/nr_requests # Queue depth$ cat /sys/block/nvme0n1/queue/max_hw_sectors_kb # Max request size$ cat /sys/block/nvme0n1/queue/rotational # 0=SSD, 1=HDDblktrace provides detailed per-request tracing of the entire block I/O path:
123456789101112131415161718192021222324
# Capture trace for 30 seconds$ blktrace -d /dev/sda -o trace -w 30 # Parse and display$ blkparse -i trace 8,0 0 1 0.000000000 123 A W 12345678 + 8 <- (8,1) 12312312 8,0 0 2 0.000001234 123 Q W 12345678 + 8 [dd] 8,0 0 3 0.000002345 123 G W 12345678 + 8 [dd] 8,0 0 4 0.000003456 123 I W 12345678 + 8 [dd] 8,0 0 5 0.000004567 123 D W 12345678 + 8 [dd] 8,0 0 6 0.000505678 123 C W 12345678 + 8 [0] # Action codes:# A = remap (from another device)# Q = queued (bio enters block layer)# G = get request (allocated from pool)# I = insert to scheduler# D = dispatch to driver# C = complete # Generate aggregate statistics$ btt -i trace.blktrace.0...D2C (dispatch to complete) latency statistics...12345678910111213141516171819202122232425
# biolatency: Histogram of I/O latency$ biolatency-bpfccTracing block device I/O... Hit Ctrl-C to end. usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 35 |* | 8 -> 15 : 1582 |******** | 16 -> 31 : 4021 |********************| 32 -> 63 : 2518 |************ | 64 -> 127 : 851 |**** | 128 -> 255 : 194 | | 256 -> 511 : 45 | | 512 -> 1023 : 12 | | # biotop: top-like for block I/O$ biotop-bpfccPID COMM D MAJ MIN DISK I/O Kbytes AVGms2481 postgres W 253 0 dm-0 124 1984 0.782481 postgres R 253 0 dm-0 89 1424 0.52521 jbd2/dm-0 W 253 0 dm-0 12 48 0.34 # biostacks: I/O with kernel stack traces$ biostacks-bpfcc...The classic 'iostat -x 1' remains invaluable for quick I/O analysis. Key metrics: %util (saturation for spinning disks), await (average latency), r_await/w_await (read/write latency), and rkB/s, wkB/s (throughput). For NVMe, focus on latency rather than utilization since parallel queues make %util misleading.
Standard buffered I/O passes through the page cache. For certain workloads, bypassing the cache provides better performance. Additionally, modern high-performance I/O uses new interfaces that minimize system call overhead.
Direct I/O bypasses the page cache, transferring data directly between user buffers and device:
io_uring (added in Linux 5.1) provides high-performance asynchronous I/O through a shared ring buffer between kernel and user space:
Key io_uring features:
1234567891011121314151617181920212223242526
#include <liburing.h> struct io_uring ring;struct io_uring_sqe *sqe;struct io_uring_cqe *cqe; /* Initialize ring with 256 entries */io_uring_queue_init(256, &ring, 0); /* Prepare a read operation */sqe = io_uring_get_sqe(&ring);io_uring_prep_read(sqe, fd, buf, size, offset);sqe->user_data = my_context; /* For identification */ /* Submit (can batch multiple sqes) */io_uring_submit(&ring); /* Wait for completion */io_uring_wait_cqe(&ring, &cqe);if (cqe->res >= 0) { /* Success: cqe->res is bytes read */}io_uring_cqe_seen(&ring, cqe); /* Cleanup */io_uring_queue_exit(&ring);io_uring can achieve millions of IOPS on modern NVMe devices, approaching raw device performance. With SQPOLL enabled, the kernel continuously polls for submissions, eliminating system call overhead entirely for high-throughput workloads. This makes io_uring essential for storage-intensive applications like databases and network file servers.
We've explored the complete Linux block I/O layer. Let's consolidate the key concepts:
What's next:
With the block I/O layer understood, we'll explore the Page Cache—the kernel's primary memory cache for file data. The next page examines how the page cache works, its interaction with file systems and the block layer, writeback policies, and memory pressure handling.
You now understand the Linux block I/O layer—from bio submission through scheduling to device dispatch. This knowledge is essential for storage performance optimization, understanding how file systems interact with hardware, and debugging I/O bottlenecks in production systems.