Operating SystemsFile System Layers

File System Layers

LevelIntermediate

Duration90 mins

TopicFile System Layers

4 / 5

I/O Control

Speaking the Language of Hardware

Above the I/O control layer, everything is abstraction—files, blocks, buffers. Below it, everything is hardware—electrical signals, register values, protocol timings. The I/O control layer is the translator, converting high-level block requests into the precise sequences of commands that make physical devices respond.

This layer is where the rubber meets the road. A request like 'read block 1000' becomes a carefully orchestrated sequence: set up DMA buffers, write to controller registers, issue the command, wait for the interrupt, check status, handle errors, and return data. Each device type—SATA, NVMe, SAS, USB—speaks a different dialect, yet all must be handled through a consistent interface.

For operating system developers, this layer is critical. Bugs here cause data corruption, system crashes, and security vulnerabilities. For system administrators and performance engineers, understanding I/O control explains why some devices offer better performance than others and how to maximize throughput.

What You Will Learn

By the end of this page, you will understand how the I/O control layer bridges software and hardware: the role of device controllers, the differences between programmed I/O and DMA, how interrupts enable efficient I/O, and the architecture of modern storage protocols like NVMe.

The I/O Control Layer's Position

The I/O control layer sits at layer 2 in our five-layer file system stack, between the basic file system above and device drivers below. However, in many implementations, I/O control and device drivers are tightly integrated or even combined.

The key distinction:

I/O Control Layer: Generic device communication patterns—interrupt handling, DMA setup, command queuing
Device Drivers: Device-specific code for particular hardware models

Think of I/O control as the framework for device communication, and drivers as the implementations for each device type.

The Abstraction Hierarchy

   Basic File System:
   "Read physical block 5000 into this buffer"
              │
              ▼
   I/O Control Layer:
   - Set up DMA descriptor pointing to buffer
   - Create command: READ, LBA=5000, count=1
   - Submit command to device queue
   - Wait for completion interrupt
   - Check status, handle errors
   - Return result to caller
              │
              ▼
   Device Driver:
   - Format command for specific device protocol (SATA, NVMe, etc.)
   - Write to device-specific registers
   - Handle device-specific quirks and error codes
              │
              ▼
   Device Controller Hardware:
   - Parse command registers
   - Perform DMA transfer
   - Read data from platters/flash
   - Signal completion via interrupt

The I/O control layer implements the mechanics of I/O while device drivers implement device-specific semantics.

Why Separate I/O Control from Drivers?

Common patterns—like interrupt handling, DMA management, and command queuing—are shared across many devices. The I/O control layer provides these reusable mechanisms, while device drivers only implement device-specific details. This reduces code duplication and simplifies driver development.

Device Controllers: The Hardware Interface

Every storage device has a device controller (also called a host bus adapter or storage adapter)—hardware that mediates between the system bus and the device itself. The controller exposes a set of registers that software uses to communicate with the device.

Controller Register Types

Register Type	Purpose	Example
Command Register	Specifies the operation to perform	READ, WRITE, IDENTIFY
Status Register	Reports device/controller state	Ready, Busy, Error
Data Registers	Hold data for programmed I/O	8/16 bit data port
Address/LBA Registers	Specify block address for operation	Logical Block Address
Count Registers	Number of blocks to transfer	Sector count
Control Registers	Device configuration	Interrupt enable, reset
Error Registers	Error code when status indicates failure	Read/write fault, seek error

   SATA Controller Registers (simplified):
   ┌─────────────────────────────────────────────────┐
   │ Command FIS:                                    │
   │   ├─ Command: 0x25 (READ DMA EXT)               │
   │   ├─ LBA [47:0]: Block address                  │
   │   ├─ Count [15:0]: Sector count                 │
   │   └─ Device: 0xE0 (LBA mode)                    │
   │                                                 │
   │ Status Register:                                │
   │   ├─ BSY: Controller busy                       │
   │   ├─ DRQ: Data request (ready for transfer)    │
   │   ├─ ERR: Error occurred                        │
   │   └─ RDY: Device ready                          │
   │                                                 │
   │ DMA Setup:                                      │
   │   ├─ PRD Table Address: Points to DMA buffer   │
   │   └─ Transfer Direction: Read or Write         │
   └─────────────────────────────────────────────────┘

Accessing Controller Registers

The CPU accesses controller registers through two mechanisms:

1. Port-Mapped I/O (PMIO)

Registers have dedicated I/O port addresses (8/16 bit)
Accessed via special CPU instructions: IN, OUT
Example: Legacy IDE controllers at ports 0x1F0-0x1F7

; Read status register (port 0x1F7)
IN  AL, 0x1F7
TEST AL, 0x80    ; Check BSY bit
JNZ  wait_busy   ; If busy, wait

2. Memory-Mapped I/O (MMIO)

Registers appear as regular memory addresses
Accessed via normal load/store instructions
Example: PCIe devices map registers to memory regions (BARs)

// Access NVMe register at memory-mapped address
volatile uint32_t *nvme_cap = (uint32_t *)(nvme_bar + 0x00);
uint32_t capabilities = *nvme_cap;  // Just read memory!

Modern devices (NVMe, AHCI) exclusively use MMIO for better performance and larger register spaces. Port I/O is limited to legacy compatibility.

The 'Volatile' Keyword Matters

When accessing MMIO registers in C, the pointer must be declared 'volatile'. Otherwise, the compiler may cache register values in CPU registers, missing updates from the hardware. This is a common source of bugs in driver development.

Programmed I/O vs. Direct Memory Access

There are two fundamental ways to transfer data between devices and memory: having the CPU move every byte (Programmed I/O), or having the device controller move data directly (DMA).

Programmed I/O (PIO)

With PIO, the CPU explicitly reads from or writes to device data registers:

// Read 512 bytes from disk using PIO
for (int i = 0; i < 256; i++) {
    // Each word requires a separate CPU instruction
    uint16_t word = inw(0x1F0);  // Read 16-bit data port
    buffer[i] = word;
}
// CPU was busy for entire transfer!

PIO characteristics:

CPU is occupied for entire transfer duration
Simple to implement (no DMA setup)
Low throughput (limited by CPU instruction rate)
High CPU overhead (100% of one core during transfer)
Used only for: slow devices, legacy mode, initial setup

Direct Memory Access (DMA)

With DMA, the device controller transfers data directly to/from memory without CPU involvement:

// Read 512 bytes from disk using DMA
// 1. Set up DMA descriptor
dma_descriptor.buffer_address = phys_addr(buffer);
dma_descriptor.byte_count = 512;

// 2. Tell controller where the descriptor is
write_reg(DMA_DESCRIPTOR_ADDR, phys_addr(&dma_descriptor));

// 3. Start the transfer
write_reg(COMMAND, READ_DMA);

// 4. CPU does other work while DMA runs
schedule_other_work();

// 5. Interrupt signals completion
// (handled by interrupt handler)

DMA characteristics:

CPU only involved in setup and completion
High throughput (memory bus speed)
Low CPU overhead (~5% for setup)
More complex (buffer management, synchronization)
Standard for all high-performance I/O

PIO Limitations

•CPU bound during transfer
•Throughput limited by CPU speed
•One transfer at a time
•High latency for each byte
•No benefit from device parallelism

DMA Advantages

•CPU free during transfer
•Throughput at memory bus speed
•Multiple DMA channels possible
•Burst transfers efficient
•Enables device parallelism

DMA Buffer Management

DMA introduces complexity: the device needs physical memory addresses (not virtual), and the buffer must remain valid during the transfer:

   Virtual Address Space:         Physical Memory:
   ┌─────────────────────┐        ┌──────────────────┐
   │                     │        │                  │
   │   Application       │        │  Kernel memory   │
   │   buffer            │        │                  │
   │   (page 1) ─────────┼────────▶│  Physical page   │
   │   (page 2) ─────────┼────┐   │  0x00200000      │
   │                     │    │   │                  │
   └─────────────────────┘    │   │  Physical page   │
                              └───▶│  0x00500000      │
                                   │                  │
                                   └──────────────────┘
   
   Problem: Virtual buffer may span multiple
            non-contiguous physical pages!

Scatter-Gather DMA solves this: the DMA descriptor contains a list of (physical address, length) pairs, allowing transfer to/from non-contiguous physical regions.

Buffer pinning: DMA buffers must be 'pinned'—prevented from being paged out or moved—for the duration of the transfer. Otherwise, DMA would write to the wrong physical location.

The Cache Coherency Challenge

When DMA writes to memory, the CPU cache may contain stale data at that address. Before reading DMA'ed data, the CPU must invalidate those cache lines. Similarly, before DMA reads from memory, dirty cache lines must be flushed. Cache-coherent DMA (supported by most modern systems) handles this automatically, but embedded systems often require explicit cache management.

Interrupts and I/O Completion

When the CPU initiates an I/O operation, it can't simply wait (polling would waste CPU cycles). Instead, devices signal completion via interrupts—hardware signals that immediately grab the CPU's attention.

The Interrupt Mechanism

   1. CPU issues command to device
          │
          ▼
   2. CPU continues other work
          │            Device executing...
          │
   3. Device completes, asserts interrupt line
          │
          ▼
   4. CPU detects interrupt, saves current state
          │
          ▼
   5. CPU jumps to Interrupt Handler
          │
          ▼
   6. Handler processes completion
          │
          ▼
   7. CPU resumes previous work

Interrupt Handling Flow

// Interrupt handler for storage controller (simplified)
void storage_interrupt_handler(int irq, void *dev_id) {
    struct storage_device *dev = dev_id;
    
    // 1. Read status to determine what happened
    uint32_t status = read_reg(dev, STATUS_REG);
    
    // 2. Acknowledge interrupt (clear pending bit)
    write_reg(dev, STATUS_REG, status);
    
    // 3. Find the completed request
    struct request *req = dev->current_request;
    
    // 4. Check for errors
    if (status & ERROR_BIT) {
        req->status = -EIO;
        handle_error(dev, status);
    } else {
        req->status = SUCCESS;
    }
    
    // 5. Wake up waiting process
    complete(&req->completion);
    
    // 6. Start next queued request if any
    if (!queue_empty(&dev->request_queue)) {
        struct request *next = dequeue(&dev->request_queue);
        start_transfer(dev, next);
    }
}

Interrupt Types

Type	Description	Typical Use
Level-triggered	Interrupt while line is held high	Legacy PCI
Edge-triggered	Interrupt on low-to-high transition	MSI, modern
MSI (Message Signaled)	Interrupt via memory write	PCIe, NVMe
MSI-X	Multiple interrupt vectors per device	High-performance

MSI/MSI-X advantages:

No shared interrupt lines (reduces handling overhead)
Multiple interrupts for different queues/events
Interrupt steering to specific CPUs
Faster interrupt delivery (no controller polling)

Interrupt Coalescing

High-speed devices can generate millions of interrupts per second, overwhelming the CPU. Interrupt coalescing delays interrupt delivery, allowing multiple completions to be processed per interrupt. NVMe controllers are often configured to coalesce interrupts, trading slightly higher latency for dramatically lower CPU overhead.

Polling vs. Interrupts: The Performance Trade-off

While interrupts enable efficient waiting, they aren't always optimal. Polling—repeatedly checking status—can outperform interrupts in certain scenarios.

When Interrupts Win

   Scenario: Slow device (HDD, ~10ms per operation)
   
   Polling:
   while (!device_ready()) {
       /* CPU spins, doing nothing */
       /* 10ms of wasted CPU time! */
   }
   
   Interrupts:
   start_operation();
   sleep_until_interrupt();  /* CPU runs other tasks */
   // Woken by interrupt after 10ms
   // CPU utilization ≈ 0% during wait

When Polling Wins

   Scenario: Fast device (NVMe, ~10μs per operation)
   
   Interrupts:
   start_operation();
   sleep_until_interrupt();
   // Context switch: ~2μs
   // Interrupt handler: ~1μs
   // Wake up: ~2μs
   // Total overhead: ~5μs (50% of operation time!)
   
   Polling:
   start_operation();
   while (!device_ready()) { /* spin */ }
   // 10μs of spinning, but no context switch
   // Often completes faster than interrupt path

Polling vs. Interrupts Comparison
Aspect	Polling	Interrupts
CPU during wait	100% occupied	Available for other work
Latency	Lower (no ctx switch)	Higher (ctx switch overhead)
Throughput	Lower (CPU blocked)	Higher (parallelism)
Power consumption	High (CPU active)	Low (CPU can sleep)
Best for	Ultra-low latency, dedicated systems	General purpose, slow devices

Hybrid Approaches

Adaptive polling: Start with polling; if operation takes too long, switch to interrupt-based waiting.

// Linux io_uring style hybrid
for (int spins = 0; spins < 1000; spins++) {
    if (request_complete())
        return;  // Fast path: no syscall
}
// Slow path: wait for interrupt
wait_for_completion_interruptible(&req->done);

NAPI (Network API): Linux network stack starts in interrupt mode, then switches to polling under high load. Balances latency (interrupts when idle) and throughput (polling when busy).

io_uring polling mode: Modern Linux kernel interface where the kernel polls completion queues in a dedicated thread, eliminating both syscalls and context switches.

NVMe Polling: When Microseconds Matter

For ultra-low latency applications (trading, real-time systems), kernel bypass with user-space polling is common. SPDK (Storage Performance Development Kit) polls NVMe queues from user space, achieving sub-microsecond latencies by avoiding all kernel overhead.

Command Queuing and Parallelism

Modern storage devices can execute multiple commands simultaneously. Command queuing allows the host to submit many requests, letting the device reorder and parallelize them for efficiency.

Queue Depth

Queue depth is the number of outstanding (submitted but not complete) commands:

Technology	Max Queue Depth
IDE/ATA	1 (no queuing)
SATA NCQ	32
SAS	256
NVMe	65,535 per queue × 65,535 queues

Higher queue depth enables:

Device-side reordering: Sort requests to minimize seeks (HDD) or optimize flash access (SSD)
Internal parallelism: SSDs have multiple flash chips that can operate concurrently
Hide latency: While one command waits for media, others execute

NVMe: The Modern Standard

NVMe (Non-Volatile Memory Express) represents the state of the art in storage I/O:

   NVMe Architecture:
   
   ┌─────────────────────────────────────────────────────────────┐
   │                     CPU / System Memory                     │
   ├─────────────────────────────────────────────────────────────┤
   │        Submission Queue 1    Completion Queue 1             │
   │   ┌─────┬─────┬─────┬─────┐  ┌─────┬─────┬─────┬─────┐      │
   │   │ Cmd │ Cmd │ Cmd │     │  │ Cpl │ Cpl │     │     │      │
   │   └──┬──┴─────┴─────┴─────┘  └─────┴──┬──┴─────┴─────┘      │
   │      │                                │                     │
   │      ▼ Doorbell write                 │ Doorbell write      │
   │   ╔══════════════════════════════════════════════════════╗  │
   │   ║              NVMe Controller                         ║  │
   │   ║  DMA Engine ← fetches commands                       ║  │
   │   ║  Flash Controller ← executes                         ║  │
   │   ║  DMA Engine → posts completions                      ║  │
   │   ╚══════════════════════════════════════════════════════╝  │
   │                                                             │
   │   Note: Queues are in HOST memory, not device memory!       │
   │   Controller uses DMA to access them.                       │
   └─────────────────────────────────────────────────────────────┘

NVMe workflow:

Host writes command to submission queue (in host memory)
Host rings doorbell (writes to MMIO register)
Controller fetches command via DMA
Controller executes (read/write flash)
Controller posts completion entry via DMA
Controller interrupts (or host polls completion queue)
Host processes completion

Why NVMe is faster:

Commands in host memory (not device) = faster submission
Multiple queues (no lock contention between CPUs)
Designed for parallelism from the start
Minimal protocol overhead (4-16 DWORD commands)

Per-CPU Queues

NVMe allows creating separate submission/completion queue pairs per CPU core. Each core submits to its own queue with no locking required. The controller manages fairness across queues. This eliminates I/O stack lock contention on many-core systems.

Error Handling and Recovery

Storage devices can fail in numerous ways. The I/O control layer must detect, report, and when possible, recover from errors.

Error Categories

Category	Examples	Typical Handling
Transient	Bit flip, timeout	Retry operation
Correctable	ECC-corrected read	Log, continue
Uncorrectable	Media failure	Return error, mark bad sector
Device failure	Controller hang	Reset, failover

The Error Handling Chain

   1. Device detects error
          │
          ▼
   2. Error status in register/completion entry
          │
          ▼
   3. I/O control layer reads error code
          │
          ▼
   4. Determine if retryable
          │
          ├── YES: Retry (typically 3 attempts)
          │          │
          │          └── Still failing? Continue below
          │
          ▼
   5. Log error to kernel log
          │
          ▼
   6. Return error to caller (EIO, etc.)
          │
          ▼
   7. File system / application decides how to proceed

Common Error Codes

// Standard error codes returned to file system
EIO      // Generic I/O error
EBUSY    // Device busy
ETIMEDOUT// Operation timed out
EMEDIUM  // Bad medium (unreadable sector)
EFAULT   // Bad DMA address/buffer

// Device-specific error detail (varies by protocol)
// SATA: ATA status/error registers
// NVMe: Status Field in completion entry
//       - Status Code Type (SCT)
//       - Status Code (SC)
//       - e.g., UNRECOVERED_READ_ERROR, WRITE_FAULT

Timeouts

Operations can hang due to device failure or firmware bugs. Timeouts prevent indefinite waits:

// Typical timeout handling
rc = wait_for_completion_timeout(&req->done, 30*HZ);  // 30 seconds
if (rc == 0) {
    // Timed out!
    dev_err("Command timed out, aborting\n");
    send_abort(dev, req->tag);
    if (abort_fails) {
        // Nuclear option: reset the controller
        reset_controller(dev);
    }
    return -ETIMEDOUT;
}

Cascading recovery: Modern stacks implement tiered recovery: retry → abort command → reset port → reset controller → mark device offline.

Silent Data Corruption

The most dangerous errors are silent: data is corrupted but no error is indicated. Bit rot, firmware bugs, and cosmic rays can flip bits without detection. This is why enterprise storage uses end-to-end checksums (ZFS, Btrfs) and why databases verify page checksums after reading.

Interface to Device Drivers

The I/O control layer defines a standard interface that device drivers implement. This allows the upper layers to work with any device that conforms to the interface.

Linux Block Device Interface (simplified)

struct blk_mq_ops {
    // Submit a request to the device
    blk_status_t (*queue_rq)(struct blk_mq_hw_ctx *,
                             struct blk_mq_queue_data *);
    
    // Timeout handling
    enum blk_eh_timer_return (*timeout)(struct request *, bool);
    
    // Initialize hardware queue
    int (*init_hctx)(struct blk_mq_hw_ctx *, void *, unsigned int);
    
    // Poll for completions (optional)
    int (*poll)(struct blk_mq_hw_ctx *hctx);
    
    // ... more operations ...
};

Each device driver implements these operations. The block layer calls them without knowing whether it's talking to SATA, NVMe, virtio, or a RAM disk.

The Request Structure

Requests flowing through the interface carry:

struct request {
    struct request_queue *q;      // Queue this belongs to
    struct gendisk *rq_disk;       // Disk device
    sector_t __sector;             // Starting LBA
    unsigned int __data_len;       // Bytes to transfer
    struct bio *bio;               // Bio chain (scatter-gather list)
    unsigned short nr_phys_segments; // Number of DMA segments
    enum req_opf cmd_flags;        // READ, WRITE, FLUSH, etc.
    // ... timing, status, priority, etc.
};

The driver translates this generic request into device-specific commands.

BIO: The Scatter-Gather Unit

A BIO (Block I/O) structure describes a logically contiguous I/O operation that may span multiple physical pages. BIOs form the atomic unit of I/O in the Linux block layer. Multiple BIOs can be merged into a single request for efficiency.

Summary: The I/O Control Layer

We've explored the I/O control layer—the bridge between abstract block requests and physical device communication. Let's consolidate the key concepts:

Key Takeaways

•Device controllers expose registers — Commands, status, and data flow through controller registers accessed via Port I/O or Memory-Mapped I/O.
•DMA enables efficient data transfer — The device moves data directly to/from memory, freeing the CPU for other work.
•Interrupts signal completion — Hardware interrupts notify the CPU when operations complete, enabling efficient waiting.
•Polling can outperform interrupts — For ultra-fast devices (NVMe), polling avoids context switch overhead.
•Command queuing enables parallelism — Submitting multiple commands lets devices reorder and execute concurrently.
•NVMe revolutionizes storage I/O — Host-memory queues, per-CPU queues, and massive parallelism minimize overhead.
•Error handling must be robust — Timeouts, retries, and escalating recovery prevent hangs and data loss.
•Standard interfaces enable driver portability — Device drivers implement common interfaces, hiding device-specific details.

What's Next:

The I/O control layer defines how to communicate with devices, but the actual device-specific code lives in device drivers. Next, we'll explore how device drivers implement protocols like SATA, NVMe, and SCSI, and how the operating system manages the driver ecosystem.

Page Complete

You now understand the I/O control layer—how the operating system communicates with storage devices through controllers, DMA, interrupts, and command queues. This knowledge is essential for understanding storage performance and debugging I/O issues.

4 / 5

Loading learning content...

Operating SystemsFile System Layers

File System Layers

LevelIntermediate

Duration90 mins

TopicFile System Layers

4 / 5

I/O Control

Speaking the Language of Hardware

What You Will Learn

The I/O Control Layer's Position

The key distinction:

I/O Control Layer: Generic device communication patterns—interrupt handling, DMA setup, command queuing
Device Drivers: Device-specific code for particular hardware models

Think of I/O control as the framework for device communication, and drivers as the implementations for each device type.

The Abstraction Hierarchy

   Basic File System:
   "Read physical block 5000 into this buffer"
              │
              ▼
   I/O Control Layer:
   - Set up DMA descriptor pointing to buffer
   - Create command: READ, LBA=5000, count=1
   - Submit command to device queue
   - Wait for completion interrupt
   - Check status, handle errors
   - Return result to caller
              │
              ▼
   Device Driver:
   - Format command for specific device protocol (SATA, NVMe, etc.)
   - Write to device-specific registers
   - Handle device-specific quirks and error codes
              │
              ▼
   Device Controller Hardware:
   - Parse command registers
   - Perform DMA transfer
   - Read data from platters/flash
   - Signal completion via interrupt

The I/O control layer implements the mechanics of I/O while device drivers implement device-specific semantics.

Why Separate I/O Control from Drivers?

Device Controllers: The Hardware Interface

Controller Register Types

Register Type	Purpose	Example
Command Register	Specifies the operation to perform	READ, WRITE, IDENTIFY
Status Register	Reports device/controller state	Ready, Busy, Error
Data Registers	Hold data for programmed I/O	8/16 bit data port
Address/LBA Registers	Specify block address for operation	Logical Block Address
Count Registers	Number of blocks to transfer	Sector count
Control Registers	Device configuration	Interrupt enable, reset
Error Registers	Error code when status indicates failure	Read/write fault, seek error

   SATA Controller Registers (simplified):
   ┌─────────────────────────────────────────────────┐
   │ Command FIS:                                    │
   │   ├─ Command: 0x25 (READ DMA EXT)               │
   │   ├─ LBA [47:0]: Block address                  │
   │   ├─ Count [15:0]: Sector count                 │
   │   └─ Device: 0xE0 (LBA mode)                    │
   │                                                 │
   │ Status Register:                                │
   │   ├─ BSY: Controller busy                       │
   │   ├─ DRQ: Data request (ready for transfer)    │
   │   ├─ ERR: Error occurred                        │
   │   └─ RDY: Device ready                          │
   │                                                 │
   │ DMA Setup:                                      │
   │   ├─ PRD Table Address: Points to DMA buffer   │
   │   └─ Transfer Direction: Read or Write         │
   └─────────────────────────────────────────────────┘

Accessing Controller Registers

The CPU accesses controller registers through two mechanisms:

1. Port-Mapped I/O (PMIO)

Registers have dedicated I/O port addresses (8/16 bit)
Accessed via special CPU instructions: IN, OUT
Example: Legacy IDE controllers at ports 0x1F0-0x1F7

; Read status register (port 0x1F7)
IN  AL, 0x1F7
TEST AL, 0x80    ; Check BSY bit
JNZ  wait_busy   ; If busy, wait

2. Memory-Mapped I/O (MMIO)

Registers appear as regular memory addresses
Accessed via normal load/store instructions
Example: PCIe devices map registers to memory regions (BARs)

// Access NVMe register at memory-mapped address
volatile uint32_t *nvme_cap = (uint32_t *)(nvme_bar + 0x00);
uint32_t capabilities = *nvme_cap;  // Just read memory!

Modern devices (NVMe, AHCI) exclusively use MMIO for better performance and larger register spaces. Port I/O is limited to legacy compatibility.

The 'Volatile' Keyword Matters

Programmed I/O vs. Direct Memory Access

There are two fundamental ways to transfer data between devices and memory: having the CPU move every byte (Programmed I/O), or having the device controller move data directly (DMA).

Programmed I/O (PIO)

With PIO, the CPU explicitly reads from or writes to device data registers:

// Read 512 bytes from disk using PIO
for (int i = 0; i < 256; i++) {
    // Each word requires a separate CPU instruction
    uint16_t word = inw(0x1F0);  // Read 16-bit data port
    buffer[i] = word;
}
// CPU was busy for entire transfer!

PIO characteristics:

CPU is occupied for entire transfer duration
Simple to implement (no DMA setup)
Low throughput (limited by CPU instruction rate)
High CPU overhead (100% of one core during transfer)
Used only for: slow devices, legacy mode, initial setup

Direct Memory Access (DMA)

With DMA, the device controller transfers data directly to/from memory without CPU involvement:

// Read 512 bytes from disk using DMA
// 1. Set up DMA descriptor
dma_descriptor.buffer_address = phys_addr(buffer);
dma_descriptor.byte_count = 512;

// 2. Tell controller where the descriptor is
write_reg(DMA_DESCRIPTOR_ADDR, phys_addr(&dma_descriptor));

// 3. Start the transfer
write_reg(COMMAND, READ_DMA);

// 4. CPU does other work while DMA runs
schedule_other_work();

// 5. Interrupt signals completion
// (handled by interrupt handler)

DMA characteristics:

CPU only involved in setup and completion
High throughput (memory bus speed)
Low CPU overhead (~5% for setup)
More complex (buffer management, synchronization)
Standard for all high-performance I/O

PIO Limitations

•CPU bound during transfer
•Throughput limited by CPU speed
•One transfer at a time
•High latency for each byte
•No benefit from device parallelism

DMA Advantages

•CPU free during transfer
•Throughput at memory bus speed
•Multiple DMA channels possible
•Burst transfers efficient
•Enables device parallelism

DMA Buffer Management

DMA introduces complexity: the device needs physical memory addresses (not virtual), and the buffer must remain valid during the transfer:

   Virtual Address Space:         Physical Memory:
   ┌─────────────────────┐        ┌──────────────────┐
   │                     │        │                  │
   │   Application       │        │  Kernel memory   │
   │   buffer            │        │                  │
   │   (page 1) ─────────┼────────▶│  Physical page   │
   │   (page 2) ─────────┼────┐   │  0x00200000      │
   │                     │    │   │                  │
   └─────────────────────┘    │   │  Physical page   │
                              └───▶│  0x00500000      │
                                   │                  │
                                   └──────────────────┘
   
   Problem: Virtual buffer may span multiple
            non-contiguous physical pages!

Scatter-Gather DMA solves this: the DMA descriptor contains a list of (physical address, length) pairs, allowing transfer to/from non-contiguous physical regions.

Buffer pinning: DMA buffers must be 'pinned'—prevented from being paged out or moved—for the duration of the transfer. Otherwise, DMA would write to the wrong physical location.

The Cache Coherency Challenge

Interrupts and I/O Completion

The Interrupt Mechanism

   1. CPU issues command to device
          │
          ▼
   2. CPU continues other work
          │            Device executing...
          │
   3. Device completes, asserts interrupt line
          │
          ▼
   4. CPU detects interrupt, saves current state
          │
          ▼
   5. CPU jumps to Interrupt Handler
          │
          ▼
   6. Handler processes completion
          │
          ▼
   7. CPU resumes previous work

Interrupt Handling Flow

// Interrupt handler for storage controller (simplified)
void storage_interrupt_handler(int irq, void *dev_id) {
    struct storage_device *dev = dev_id;
    
    // 1. Read status to determine what happened
    uint32_t status = read_reg(dev, STATUS_REG);
    
    // 2. Acknowledge interrupt (clear pending bit)
    write_reg(dev, STATUS_REG, status);
    
    // 3. Find the completed request
    struct request *req = dev->current_request;
    
    // 4. Check for errors
    if (status & ERROR_BIT) {
        req->status = -EIO;
        handle_error(dev, status);
    } else {
        req->status = SUCCESS;
    }
    
    // 5. Wake up waiting process
    complete(&req->completion);
    
    // 6. Start next queued request if any
    if (!queue_empty(&dev->request_queue)) {
        struct request *next = dequeue(&dev->request_queue);
        start_transfer(dev, next);
    }
}

Interrupt Types

Type	Description	Typical Use
Level-triggered	Interrupt while line is held high	Legacy PCI
Edge-triggered	Interrupt on low-to-high transition	MSI, modern
MSI (Message Signaled)	Interrupt via memory write	PCIe, NVMe
MSI-X	Multiple interrupt vectors per device	High-performance

MSI/MSI-X advantages:

No shared interrupt lines (reduces handling overhead)
Multiple interrupts for different queues/events
Interrupt steering to specific CPUs
Faster interrupt delivery (no controller polling)

Interrupt Coalescing

Polling vs. Interrupts: The Performance Trade-off

While interrupts enable efficient waiting, they aren't always optimal. Polling—repeatedly checking status—can outperform interrupts in certain scenarios.

When Interrupts Win

   Scenario: Slow device (HDD, ~10ms per operation)
   
   Polling:
   while (!device_ready()) {
       /* CPU spins, doing nothing */
       /* 10ms of wasted CPU time! */
   }
   
   Interrupts:
   start_operation();
   sleep_until_interrupt();  /* CPU runs other tasks */
   // Woken by interrupt after 10ms
   // CPU utilization ≈ 0% during wait

When Polling Wins

   Scenario: Fast device (NVMe, ~10μs per operation)
   
   Interrupts:
   start_operation();
   sleep_until_interrupt();
   // Context switch: ~2μs
   // Interrupt handler: ~1μs
   // Wake up: ~2μs
   // Total overhead: ~5μs (50% of operation time!)
   
   Polling:
   start_operation();
   while (!device_ready()) { /* spin */ }
   // 10μs of spinning, but no context switch
   // Often completes faster than interrupt path

Polling vs. Interrupts Comparison
Aspect	Polling	Interrupts
CPU during wait	100% occupied	Available for other work
Latency	Lower (no ctx switch)	Higher (ctx switch overhead)
Throughput	Lower (CPU blocked)	Higher (parallelism)
Power consumption	High (CPU active)	Low (CPU can sleep)
Best for	Ultra-low latency, dedicated systems	General purpose, slow devices

Hybrid Approaches

Adaptive polling: Start with polling; if operation takes too long, switch to interrupt-based waiting.

// Linux io_uring style hybrid
for (int spins = 0; spins < 1000; spins++) {
    if (request_complete())
        return;  // Fast path: no syscall
}
// Slow path: wait for interrupt
wait_for_completion_interruptible(&req->done);

NAPI (Network API): Linux network stack starts in interrupt mode, then switches to polling under high load. Balances latency (interrupts when idle) and throughput (polling when busy).

io_uring polling mode: Modern Linux kernel interface where the kernel polls completion queues in a dedicated thread, eliminating both syscalls and context switches.

NVMe Polling: When Microseconds Matter

Command Queuing and Parallelism

Modern storage devices can execute multiple commands simultaneously. Command queuing allows the host to submit many requests, letting the device reorder and parallelize them for efficiency.

Queue Depth

Queue depth is the number of outstanding (submitted but not complete) commands:

Technology	Max Queue Depth
IDE/ATA	1 (no queuing)
SATA NCQ	32
SAS	256
NVMe	65,535 per queue × 65,535 queues

Higher queue depth enables:

Device-side reordering: Sort requests to minimize seeks (HDD) or optimize flash access (SSD)
Internal parallelism: SSDs have multiple flash chips that can operate concurrently
Hide latency: While one command waits for media, others execute

NVMe: The Modern Standard

NVMe (Non-Volatile Memory Express) represents the state of the art in storage I/O:

   NVMe Architecture:
   
   ┌─────────────────────────────────────────────────────────────┐
   │                     CPU / System Memory                     │
   ├─────────────────────────────────────────────────────────────┤
   │        Submission Queue 1    Completion Queue 1             │
   │   ┌─────┬─────┬─────┬─────┐  ┌─────┬─────┬─────┬─────┐      │
   │   │ Cmd │ Cmd │ Cmd │     │  │ Cpl │ Cpl │     │     │      │
   │   └──┬──┴─────┴─────┴─────┘  └─────┴──┬──┴─────┴─────┘      │
   │      │                                │                     │
   │      ▼ Doorbell write                 │ Doorbell write      │
   │   ╔══════════════════════════════════════════════════════╗  │
   │   ║              NVMe Controller                         ║  │
   │   ║  DMA Engine ← fetches commands                       ║  │
   │   ║  Flash Controller ← executes                         ║  │
   │   ║  DMA Engine → posts completions                      ║  │
   │   ╚══════════════════════════════════════════════════════╝  │
   │                                                             │
   │   Note: Queues are in HOST memory, not device memory!       │
   │   Controller uses DMA to access them.                       │
   └─────────────────────────────────────────────────────────────┘

NVMe workflow:

Host writes command to submission queue (in host memory)
Host rings doorbell (writes to MMIO register)
Controller fetches command via DMA
Controller executes (read/write flash)
Controller posts completion entry via DMA
Controller interrupts (or host polls completion queue)
Host processes completion

Why NVMe is faster:

Commands in host memory (not device) = faster submission
Multiple queues (no lock contention between CPUs)
Designed for parallelism from the start
Minimal protocol overhead (4-16 DWORD commands)

Per-CPU Queues

Error Handling and Recovery

Storage devices can fail in numerous ways. The I/O control layer must detect, report, and when possible, recover from errors.

Error Categories

Category	Examples	Typical Handling
Transient	Bit flip, timeout	Retry operation
Correctable	ECC-corrected read	Log, continue
Uncorrectable	Media failure	Return error, mark bad sector
Device failure	Controller hang	Reset, failover

The Error Handling Chain

   1. Device detects error
          │
          ▼
   2. Error status in register/completion entry
          │
          ▼
   3. I/O control layer reads error code
          │
          ▼
   4. Determine if retryable
          │
          ├── YES: Retry (typically 3 attempts)
          │          │
          │          └── Still failing? Continue below
          │
          ▼
   5. Log error to kernel log
          │
          ▼
   6. Return error to caller (EIO, etc.)
          │
          ▼
   7. File system / application decides how to proceed

Common Error Codes

// Standard error codes returned to file system
EIO      // Generic I/O error
EBUSY    // Device busy
ETIMEDOUT// Operation timed out
EMEDIUM  // Bad medium (unreadable sector)
EFAULT   // Bad DMA address/buffer

// Device-specific error detail (varies by protocol)
// SATA: ATA status/error registers
// NVMe: Status Field in completion entry
//       - Status Code Type (SCT)
//       - Status Code (SC)
//       - e.g., UNRECOVERED_READ_ERROR, WRITE_FAULT

Timeouts

Operations can hang due to device failure or firmware bugs. Timeouts prevent indefinite waits:

// Typical timeout handling
rc = wait_for_completion_timeout(&req->done, 30*HZ);  // 30 seconds
if (rc == 0) {
    // Timed out!
    dev_err("Command timed out, aborting\n");
    send_abort(dev, req->tag);
    if (abort_fails) {
        // Nuclear option: reset the controller
        reset_controller(dev);
    }
    return -ETIMEDOUT;
}

Cascading recovery: Modern stacks implement tiered recovery: retry → abort command → reset port → reset controller → mark device offline.

Silent Data Corruption

Interface to Device Drivers

The I/O control layer defines a standard interface that device drivers implement. This allows the upper layers to work with any device that conforms to the interface.

Linux Block Device Interface (simplified)

struct blk_mq_ops {
    // Submit a request to the device
    blk_status_t (*queue_rq)(struct blk_mq_hw_ctx *,
                             struct blk_mq_queue_data *);
    
    // Timeout handling
    enum blk_eh_timer_return (*timeout)(struct request *, bool);
    
    // Initialize hardware queue
    int (*init_hctx)(struct blk_mq_hw_ctx *, void *, unsigned int);
    
    // Poll for completions (optional)
    int (*poll)(struct blk_mq_hw_ctx *hctx);
    
    // ... more operations ...
};

Each device driver implements these operations. The block layer calls them without knowing whether it's talking to SATA, NVMe, virtio, or a RAM disk.

The Request Structure

Requests flowing through the interface carry:

struct request {
    struct request_queue *q;      // Queue this belongs to
    struct gendisk *rq_disk;       // Disk device
    sector_t __sector;             // Starting LBA
    unsigned int __data_len;       // Bytes to transfer
    struct bio *bio;               // Bio chain (scatter-gather list)
    unsigned short nr_phys_segments; // Number of DMA segments
    enum req_opf cmd_flags;        // READ, WRITE, FLUSH, etc.
    // ... timing, status, priority, etc.
};

The driver translates this generic request into device-specific commands.

BIO: The Scatter-Gather Unit

Summary: The I/O Control Layer

We've explored the I/O control layer—the bridge between abstract block requests and physical device communication. Let's consolidate the key concepts:

Key Takeaways

•Device controllers expose registers — Commands, status, and data flow through controller registers accessed via Port I/O or Memory-Mapped I/O.
•DMA enables efficient data transfer — The device moves data directly to/from memory, freeing the CPU for other work.
•Interrupts signal completion — Hardware interrupts notify the CPU when operations complete, enabling efficient waiting.
•Polling can outperform interrupts — For ultra-fast devices (NVMe), polling avoids context switch overhead.
•Command queuing enables parallelism — Submitting multiple commands lets devices reorder and execute concurrently.
•NVMe revolutionizes storage I/O — Host-memory queues, per-CPU queues, and massive parallelism minimize overhead.
•Error handling must be robust — Timeouts, retries, and escalating recovery prevent hangs and data loss.
•Standard interfaces enable driver portability — Device drivers implement common interfaces, hiding device-specific details.

What's Next:

Page Complete

4 / 5