Io Software Layers - Learning Module

Loading content...

0/227

Hardware Layer

Where Software Meets Silicon

Throughout this module, we've descended through the I/O software stack—from user-level libraries through the kernel's device-independent layer, into device drivers, and down to interrupt handlers. Now we reach the foundation: the hardware layer itself. This is where electrical signals become data, where timing matters in nanoseconds, and where the physical realities of silicon and copper constrain what software can accomplish.

Understanding the hardware layer isn't about memorizing chip specifications—it's about understanding the fundamental constraints and capabilities that shape every layer of I/O software above. When a driver seems inexplicably slow or a device behaves strangely, the answer often lies in hardware details that no amount of software can circumvent.

What You Will Learn

By completing this page, you will understand: the physical interface between CPU and devices, how device controllers work, the role of buses and interconnects, memory-mapped I/O and port I/O, DMA operation from the hardware perspective, timing and synchronization requirements, and how all the I/O software layers connect to form a complete system.

Device Controller Architecture

The CPU doesn't communicate directly with physical devices like disk platters or network cables. Instead, each device has a controller (also called an adapter or host bus adapter)—a specialized processor that mediates between the device's physical characteristics and the system bus.

What a Controller Does:

The controller converts between the electrical, mechanical, or optical signals of the device and the digital data that software can understand. It provides:

Controller Functions

•Signal Conversion — Analog ↔ digital, encoding/decoding, error correction
•Buffering — Absorb speed differences between device and bus
•Command Interpretation — Translate generic commands to device operations
•Status Reporting — Report device state, errors, completion
•DMA Engine — Move data to/from main memory without CPU involvement
•Interrupt Generation — Signal CPU when attention is needed

Controller Registers:

Software controls devices through controller registers—small memory locations on the controller that trigger actions when written or reveal status when read:

Common Controller Register Types
Register Type	Purpose	Direction	Example
Command	Tell device what to do	Write	Start transfer, seek, reset
Status	Report device state	Read	Ready, busy, error, interrupt pending
Data-In	Receive data from device	Read	Byte from keyboard, disk sector
Data-Out	Send data to device	Write	Character to display, block to write
Control	Configure device behavior	Write	Set speed, enable DMA, enable interrupts
Address	Specify location	Write	Block number, DMA address

Converting Mermaid diagram...

Modern Controllers Are Computers

Today's controllers are sophisticated computers in their own right. An NVMe SSD controller has multiple ARM cores, gigabytes of cache, and runs a complex firmware stack. A network card controller runs its own RTOS and may have specialized packet processing engines. The "simple register interface" masks enormous complexity.

Accessing Controller Registers

The CPU needs a way to read and write controller registers. There are two fundamental approaches:

1. Port-Mapped I/O (PMIO):

The CPU has a separate address space for I/O ports, accessed via special instructions (IN, OUT on x86). ISA-era devices often use this.

port_io.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
/*
 * Port-mapped I/O on x86
 * I/O ports have their own address space (0x0000 to 0xFFFF)
 */
 
#include <sys/io.h>
 
/* Read a byte from an I/O port */
static inline uint8_t inb(uint16_t port) {
    uint8_t value;
    __asm__ __volatile__("inb %1, %0" : "=a"(value) : "dN"(port));
    return value;
}
 
/* Write a byte to an I/O port */
static inline void outb(uint16_t port, uint8_t value) {
    __asm__ __volatile__("outb %0, %1" : : "a"(value), "dN"(port));
}
 
/* Example: Reading keyboard status */
#define KBD_STATUS_PORT 0x64
#define KBD_DATA_PORT   0x60
#define KBD_OUTPUT_FULL 0x01
 
uint8_t read_keyboard_char(void) {
    /* Wait until keyboard has data */
    while (!(inb(KBD_STATUS_PORT) & KBD_OUTPUT_FULL))
        ; /* Spin wait */
    
    /* Read the scan code */
    return inb(KBD_DATA_PORT);
}
 
/* Example: Legacy PIC programming */
#define PIC1_CMD  0x20
#define PIC1_DATA 0x21
 
void send_eoi(void) {
    outb(PIC1_CMD, 0x20);  /* End-of-interrupt command */
}

2. Memory-Mapped I/O (MMIO):

Controller registers are mapped into the normal memory address space. Reading/writing memory addresses actually accesses device registers. Modern PCIe devices exclusively use MMIO.

mmio.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/*
 * Memory-Mapped I/O
 * Device registers appear at physical memory addresses
 */
 
#include <linux/io.h>
 
struct mydev_regs {
    uint32_t control;    /* Offset 0x00 */
    uint32_t status;     /* Offset 0x04 */
    uint32_t data;       /* Offset 0x08 */
    uint32_t address;    /* Offset 0x0C */
    uint32_t interrupt;  /* Offset 0x10 */
};
 
/* Map device memory during probe */
void __iomem *regs;  /* __iomem marks I/O memory pointers */
 
int map_device(struct pci_dev *pdev) {
    resource_size_t base = pci_resource_start(pdev, 0);  /* BAR 0 */
    resource_size_t size = pci_resource_len(pdev, 0);
    
    /* Request the memory region (claims exclusive access) */
    if (!request_mem_region(base, size, "mydevice"))
        return -EBUSY;
    
    /* Map physical address to kernel virtual address */
    regs = ioremap(base, size);
    if (!regs) {
        release_mem_region(base, size);
        return -ENOMEM;
    }
    
    return 0;
}
 
/* Reading and writing MMIO - MUST use accessor functions! */
void device_start_operation(uint32_t address, uint32_t cmd) {
    /* Wrong: volatile *regs = ... 
     * The compiler might reorder or optimize these! */
    
    /* Correct: Use kernel accessors */
    iowrite32(address, regs + offsetof(struct mydev_regs, address));
    wmb();  /* Write memory barrier - ensure ordering */
    iowrite32(cmd, regs + offsetof(struct mydev_regs, control));
}
 
uint32_t device_read_status(void) {
    return ioread32(regs + offsetof(struct mydev_regs, status));
}

Port I/O Characteristics

•Separate address space (64KB)
•Requires special instructions
•x86-specific (not portable)
•Used by legacy devices
•Simpler hardware decoding

Memory-Mapped I/O Characteristics

•Uses regular memory addresses
•Normal load/store instructions
•Works on all architectures
•Used by modern PCIe devices
•Can map large regions (GBs)

MMIO Requires Special Care

MMIO locations aren't normal memory—reads and writes have side effects. The compiler's optimizer doesn't know this and may reorder, combine, or eliminate accesses. Always use kernel accessor functions (ioread32, iowrite32, etc.) and memory barriers (wmb(), rmb()) to ensure correct behavior.

DMA: The Hardware Perspective

Direct Memory Access (DMA) allows device controllers to transfer data directly to/from main memory without involving the CPU for each byte. This is essential for high-bandwidth devices—the CPU simply couldn't keep up with gigabytes per second of data.

How DMA Works (From Hardware Side):

Converting Mermaid diagram...

Critical DMA Concepts:

DMA Terminology and Concepts
Concept	Description	Why It Matters
Bus Mastering	Device can initiate bus transactions	Required for DMA—device drives the bus
Physical Address	Actual RAM address	DMA uses physical addresses, not virtual
IOVA	I/O Virtual Address (with IOMMU)	What device sees; translated by IOMMU
Scatter-Gather	DMA from/to non-contiguous buffers	Avoids copying to make contiguous buffer
Cache Coherence	CPU cache vs DMA data consistency	DMA may bypass CPU cache—must sync
Bounce Buffer	Intermediate buffer for DMA limitations	When device can't address buffer directly

dma_operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#include <linux/dma-mapping.h>
 
/*
 * DMA operations in Linux drivers
 */
 
struct mydev_data {
    struct pci_dev *pdev;
    void *tx_buffer;      /* Kernel virtual address */
    dma_addr_t tx_dma;    /* DMA (physical/IOVA) address */
    size_t tx_size;
};
 
/*
 * Allocate coherent DMA buffer (CPU and device see same data)
 */
int mydev_alloc_dma(struct mydev_data *dev) {
    dev->tx_size = 4096;
    
    /* dma_alloc_coherent:
     * - Allocates physically contiguous memory
     * - Returns both kernel VA and DMA address
     * - Memory is cache-coherent (no manual sync needed)
     */
    dev->tx_buffer = dma_alloc_coherent(&dev->pdev->dev,
                                        dev->tx_size,
                                        &dev->tx_dma,
                                        GFP_KERNEL);
    if (!dev->tx_buffer)
        return -ENOMEM;
    
    dev_info(&dev->pdev->dev, 
             "DMA buffer: VA=%p, DMA=0x%llx\n",
             dev->tx_buffer, (unsigned long long)dev->tx_dma);
    
    return 0;
}
 
/*
 * Streaming DMA: Map existing buffer for one-time transfer
 */
int mydev_transmit(struct mydev_data *dev, void *data, size_t len) {
    dma_addr_t dma_addr;
    
    /* Map the buffer for DMA
     * DMA_TO_DEVICE: CPU wrote, device will read
     */
    dma_addr = dma_map_single(&dev->pdev->dev, data, len, DMA_TO_DEVICE);
    
    if (dma_mapping_error(&dev->pdev->dev, dma_addr)) {
        dev_err(&dev->pdev->dev, "DMA mapping failed\n");
        return -EIO;
    }
    
    /* CPU is done with buffer - sync for device */
    dma_sync_single_for_device(&dev->pdev->dev, dma_addr, len, DMA_TO_DEVICE);
    
    /* Program the device to DMA from this address */
    iowrite32((uint32_t)dma_addr, dev->regs + DMA_ADDR_REG);
    iowrite32(len, dev->regs + DMA_LEN_REG);
    iowrite32(CMD_START_DMA, dev->regs + DMA_CMD_REG);
    
    /* Wait for completion... */
    
    /* Unmap when done */
    dma_unmap_single(&dev->pdev->dev, dma_addr, len, DMA_TO_DEVICE);
    
    return 0;
}

IOMMU: DMA Protection

Modern systems include an IOMMU (Intel VT-d, AMD-Vi) that translates device memory accesses just like the MMU translates CPU accesses. The IOMMU prevents devices from accessing memory they shouldn't, isolates VMs' device access, and enables DMA to work with virtual addresses. It's essential for security in systems with untrusted devices (like GPUs in cloud VMs).

Buses and Interconnects

The system bus (or system fabric) connects CPUs, memory, and I/O devices. Modern systems have evolved from simple shared buses to complex point-to-point interconnects.

Bus Evolution:

I/O Bus Evolution
Bus	Era	Bandwidth	Characteristics
ISA	1981-2000s	8-16 MB/s	Simple, slow, shared, legacy devices
PCI	1992-2010s	133-533 MB/s	Parallel, shared, plug-and-play
PCI Express 3.0	2010-2017	1 GB/s per lane	Serial, point-to-point, scalable
PCI Express 4.0	2017-2022	2 GB/s per lane	16 lanes = 32 GB/s for GPUs
PCI Express 5.0	2022+	4 GB/s per lane	Highest performance current gen
CXL	2019+	Same as PCIe	Memory expansion, coherent attaches

PCIe Architecture:

PCIe is the dominant I/O interconnect in modern systems. Key concepts:

PCIe Key Concepts

•Lanes — Each lane is a bidirectional serial link; devices use x1, x4, x8, or x16 lanes
•Switches — Fan-out PCIe to multiple devices (like network switches for I/O)
•Root Complex — CPU's interface to PCIe hierarchy; includes host bridge
•Endpoints — Actual devices (NVMe drives, NICs, GPUs)
•TLP (Transaction Layer Packets) — Protocol units: Memory Read/Write, Config Read/Write, etc.
•Configuration Space — Standardized 256-byte (legacy) or 4KB (extended) per-device config area

Converting Mermaid diagram...

pcie_config.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Explore PCIe hierarchy on Linux
 
# List all PCIe devices
$ lspci
00:00.0 Host bridge: Intel Corporation Device 9a14 (rev 01)
00:02.0 VGA compatible controller: Intel Corporation Device 9a49
00:14.0 USB controller: Intel Corporation Device a0ed
01:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD
 
# Detailed view with link speed
$ lspci -vvv -s 02:00.0 | grep -E "(LnkCap|LnkSta|Width)"
LnkCap: Port #0, Speed 8GT/s, Width x4
LnkSta: Speed 8GT/s, Width x4
# 8 GT/s = Gen3, Width x4 = 4 lanes => ~4 GB/s theoretical
 
# Tree view showing hierarchy
$ lspci -tv
-[0000:00]-+-00.0  Intel Corporation Host bridge
           +-02.0  Intel Corporation VGA
           +-14.0  Intel Corporation USB Controller
           +-1c.0-[01]----00.0  Intel Corporation I210 NIC
           +-1d.0-[02]----00.0  Samsung Electronics NVMe SSD
 
# Configuration space dump
$ lspci -xxx -s 02:00.0
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
00: 4d 14 a8 a8 06 04 10 00 00 02 08 01 00 00 00 00
10: 04 00 08 d4 00 00 00 00 00 00 00 00 00 00 00 00
...
# First bytes: 144d = Samsung vendor ID, a8a8 = device ID
 
# BARs (Base Address Registers) show MMIO regions
$ lspci -v -s 02:00.0 | grep "Memory at"
Region 0: Memory at d4080000 (64-bit, non-prefetchable) [size=16K]

NUMA and Device Locality

In multi-socket systems, PCIe devices are connected to specific CPU sockets. Accessing a device from the "wrong" CPU incurs cross-socket latency. High-performance applications pin interrupt handling and memory allocation to the same NUMA node as the device. Check device locality with lspci -vvv | grep 'NUMA node'.

Timing and Synchronization

Hardware operates in the time domain of nanoseconds and microseconds. Software that interacts with hardware must respect timing constraints that are often undocumented or surprising.

Common Timing Concerns:

Hardware Timing Considerations
Scenario	Typical Timing	What Goes Wrong
Register write propagation	1-100 ns	Read-after-write may see old value
Device reset completion	10 µs - 1 s	Access during reset causes errors
Interrupt latency	1-10 µs	Real-time requirements missed
DMA completion	µs to ms	Polling too fast wastes CPU
Device initialization	ms to seconds	Timeout if startup is slow
Power state transitions	10 ms - 1 s	Device unavailable during transition

Memory Barriers and Ordering:

Modern CPUs and compilers reorder memory operations for performance. This is invisible to normal code but dangerous for device I/O, where the order of register writes matters:

memory_barriers.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#include <linux/io.h>
 
/*
 * Memory barriers ensure ordering of memory operations
 */
 
void start_device_operation(void __iomem *regs) {
    /* WRONG: CPU/compiler might reorder these! */
    writel(0x1234, regs + ADDR_REG);
    writel(0x5678, regs + DATA_REG);
    writel(CMD_START, regs + CMD_REG);
    /* Hardware might see CMD_START before ADDR is set! */
    
    /* CORRECT: Use barriers to enforce ordering */
    writel(0x1234, regs + ADDR_REG);
    wmb();  /* Write Memory Barrier - all prior writes complete first */
    
    writel(0x5678, regs + DATA_REG);
    wmb();
    
    writel(CMD_START, regs + CMD_REG);
    /* Now hardware is guaranteed to see writes in order */
}
 
/*
 * Linux barrier types:
 * 
 * mb()   - Full memory barrier (read and write)
 * rmb()  - Read memory barrier (force order of reads)
 * wmb()  - Write memory barrier (force order of writes)
 * 
 * smp_mb()  - SMP-safe barrier (no-op on uniprocessor)
 * smp_rmb(), smp_wmb() - SMP variants
 * 
 * For I/O memory specifically:
 * mmiowb()  - Ensures MMIO writes are visible to other CPUs
 * 
 * Compiler-only barriers:
 * barrier() - Prevent compiler reordering, not CPU reordering
 */
 
/* Read-modify-write with proper synchronization */
uint32_t safely_update_register(void __iomem *regs) {
    uint32_t value;
    
    /* Read current value */
    value = ioread32(regs + STATUS_REG);
    rmb();  /* Ensure read completes before we use value */
    
    /* Modify */
    value |= STATUS_ENABLE;
    
    /* Write back */
    wmb();  /* Ensure all prior writes are visible */
    iowrite32(value, regs + STATUS_REG);
    
    return value;
}

PCIe Posted Writes

PCIe memory writes are "posted"—they complete immediately from the CPU's perspective, but may take time to reach the device. If you need to know a write has actually arrived at the device, read from the device after writing. The read stalls until all prior writes complete. This is a common source of bugs in device drivers.

The Complete I/O Path

Let's trace a complete I/O operation—a disk write—through all the layers we've studied in this module. This synthesizes everything: user-level I/O, device-independent software, device drivers, interrupt handlers, and hardware.

Scenario: A program calls write(fd, buffer, 4096) to write 4KB to an NVMe SSD.

Converting Mermaid diagram...

Detailed Step-by-Step:

1. User Level:

Application calls write(fd, buffer, 4096)
C library may buffer the data; eventually calls the system call

2. VFS / Device-Independent Layer:

Kernel validates file descriptor, checks permissions
Data is copied to page cache (for caching and delayed write)
Page is marked dirty

3. Block Layer:

Page cache writeback creates a bio (block I/O request)
Bio specifies: which device, which sectors, data pointers
I/O scheduler may merge or reorder with other requests

4. Device Driver:

NVMe driver receives the bio
Builds an NVMe write command in submission queue
Sets up DMA mapping (IOVA for the data buffer)
Writes to doorbell register to notify controller

5. Hardware:

Controller fetches command from submission queue (DMA)
Controller fetches write data from memory (DMA)
Controller programs flash write operation
Flash programming completes (microseconds to milliseconds)
Controller writes completion entry to completion queue (DMA)
Controller generates MSI-X interrupt

6. Completion:

CPU receives interrupt, jumps to NVMe ISR
ISR reads completion queue, acknowledges interrupt
Marks bio complete, wakes any waiting processes
Original write() returns (or was asynchronous)

The Power of Layering

Notice how each layer has a specific, focused responsibility. The application doesn't know about DMA or interrupts. The block layer doesn't know it's writing to NVMe vs SATA. The driver doesn't know about file systems. This separation enables the modularity and flexibility we rely on—you can swap any layer without affecting the others.

Module Summary: I/O Software Layers

We've completed our journey through the entire I/O software stack. From the user's simple write() call to the firmware programming flash cells, you now understand how modern operating systems manage the enormous complexity of I/O operations.

The Five Layers Revisited:

I/O Software Layer Summary
Layer	Key Responsibility	Key Insight
User-Level I/O	Application interface, buffering, formatting	Buffering reduces system calls by 100-1000x
Device-Independent	Uniform naming, protection, caching	Everything is a file; one interface for all
Device Drivers	Hardware abstraction, command translation	70% of OS code; most bugs live here
Interrupt Handlers	Asynchronous hardware response	Microsecond constraints; cannot sleep
Hardware Layer	Physical I/O, DMA, timing	Nanosecond domain; ordering matters

Module Key Takeaways

•Layering provides abstraction and modularity — Each layer hides complexity from those above, enabling device independence and maintainability.
•Buffering is the primary performance optimization — From stdio to the page cache, buffers batch small operations and decouple producers from consumers.
•The user-kernel boundary is expensive — System calls cost hundreds of CPU cycles; minimize crossings through buffering and batching.
•Drivers translate generic to specific — The file_operations interface connects the uniform I/O API to hardware-specific implementations.
•Interrupts enable asynchronous operation — Hardware signals the CPU only when needed, freeing cycles for other work.
•DMA offloads data movement — Controllers transfer data directly to memory, essential for high-bandwidth devices.
•Timing and ordering matter at hardware level — Memory barriers, posted writes, and synchronization are non-obvious but critical.

Looking Forward:

This module provided the foundation for understanding I/O software architecture. The subsequent modules in this chapter will explore specific aspects in greater depth:

Device Drivers: Deep dive into driver development, testing, and debugging
Blocking and Non-Blocking I/O: Synchronous vs asynchronous models
Buffering: Advanced buffering techniques and zero-copy I/O
Caching: I/O cache policies and coherence
Spooling: Managing shared devices like printers

With the layered model firmly in mind, each of these topics will make more sense as you understand where they fit in the overall architecture.

Module Complete!

Congratulations! You've completed the I/O Software Layers module. You now understand the complete path from user-space write() calls to hardware flash programming—and all the intricate software layers that make it work seamlessly. This foundation is essential for systems programming, performance optimization, and understanding how operating systems interact with the physical world.

Hardware Layer

Where Software Meets Silicon

What You Will Learn

Device Controller Architecture

What a Controller Does:

The controller converts between the electrical, mechanical, or optical signals of the device and the digital data that software can understand. It provides:

Controller Functions

•Signal Conversion — Analog ↔ digital, encoding/decoding, error correction
•Buffering — Absorb speed differences between device and bus
•Command Interpretation — Translate generic commands to device operations
•Status Reporting — Report device state, errors, completion
•DMA Engine — Move data to/from main memory without CPU involvement
•Interrupt Generation — Signal CPU when attention is needed

Controller Registers:

Software controls devices through controller registers—small memory locations on the controller that trigger actions when written or reveal status when read:

Common Controller Register Types
Register Type	Purpose	Direction	Example
Command	Tell device what to do	Write	Start transfer, seek, reset
Status	Report device state	Read	Ready, busy, error, interrupt pending
Data-In	Receive data from device	Read	Byte from keyboard, disk sector
Data-Out	Send data to device	Write	Character to display, block to write
Control	Configure device behavior	Write	Set speed, enable DMA, enable interrupts
Address	Specify location	Write	Block number, DMA address

Converting Mermaid diagram...

Modern Controllers Are Computers

Accessing Controller Registers

The CPU needs a way to read and write controller registers. There are two fundamental approaches:

1. Port-Mapped I/O (PMIO):

The CPU has a separate address space for I/O ports, accessed via special instructions (IN, OUT on x86). ISA-era devices often use this.

port_io.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
/*
 * Port-mapped I/O on x86
 * I/O ports have their own address space (0x0000 to 0xFFFF)
 */
 
#include <sys/io.h>
 
/* Read a byte from an I/O port */
static inline uint8_t inb(uint16_t port) {
    uint8_t value;
    __asm__ __volatile__("inb %1, %0" : "=a"(value) : "dN"(port));
    return value;
}
 
/* Write a byte to an I/O port */
static inline void outb(uint16_t port, uint8_t value) {
    __asm__ __volatile__("outb %0, %1" : : "a"(value), "dN"(port));
}
 
/* Example: Reading keyboard status */
#define KBD_STATUS_PORT 0x64
#define KBD_DATA_PORT   0x60
#define KBD_OUTPUT_FULL 0x01
 
uint8_t read_keyboard_char(void) {
    /* Wait until keyboard has data */
    while (!(inb(KBD_STATUS_PORT) & KBD_OUTPUT_FULL))
        ; /* Spin wait */
    
    /* Read the scan code */
    return inb(KBD_DATA_PORT);
}
 
/* Example: Legacy PIC programming */
#define PIC1_CMD  0x20
#define PIC1_DATA 0x21
 
void send_eoi(void) {
    outb(PIC1_CMD, 0x20);  /* End-of-interrupt command */
}

2. Memory-Mapped I/O (MMIO):

Controller registers are mapped into the normal memory address space. Reading/writing memory addresses actually accesses device registers. Modern PCIe devices exclusively use MMIO.

mmio.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/*
 * Memory-Mapped I/O
 * Device registers appear at physical memory addresses
 */
 
#include <linux/io.h>
 
struct mydev_regs {
    uint32_t control;    /* Offset 0x00 */
    uint32_t status;     /* Offset 0x04 */
    uint32_t data;       /* Offset 0x08 */
    uint32_t address;    /* Offset 0x0C */
    uint32_t interrupt;  /* Offset 0x10 */
};
 
/* Map device memory during probe */
void __iomem *regs;  /* __iomem marks I/O memory pointers */
 
int map_device(struct pci_dev *pdev) {
    resource_size_t base = pci_resource_start(pdev, 0);  /* BAR 0 */
    resource_size_t size = pci_resource_len(pdev, 0);
    
    /* Request the memory region (claims exclusive access) */
    if (!request_mem_region(base, size, "mydevice"))
        return -EBUSY;
    
    /* Map physical address to kernel virtual address */
    regs = ioremap(base, size);
    if (!regs) {
        release_mem_region(base, size);
        return -ENOMEM;
    }
    
    return 0;
}
 
/* Reading and writing MMIO - MUST use accessor functions! */
void device_start_operation(uint32_t address, uint32_t cmd) {
    /* Wrong: volatile *regs = ... 
     * The compiler might reorder or optimize these! */
    
    /* Correct: Use kernel accessors */
    iowrite32(address, regs + offsetof(struct mydev_regs, address));
    wmb();  /* Write memory barrier - ensure ordering */
    iowrite32(cmd, regs + offsetof(struct mydev_regs, control));
}
 
uint32_t device_read_status(void) {
    return ioread32(regs + offsetof(struct mydev_regs, status));
}

Port I/O Characteristics

•Separate address space (64KB)
•Requires special instructions
•x86-specific (not portable)
•Used by legacy devices
•Simpler hardware decoding

Memory-Mapped I/O Characteristics

•Uses regular memory addresses
•Normal load/store instructions
•Works on all architectures
•Used by modern PCIe devices
•Can map large regions (GBs)

MMIO Requires Special Care

DMA: The Hardware Perspective

How DMA Works (From Hardware Side):

Converting Mermaid diagram...

Critical DMA Concepts:

DMA Terminology and Concepts
Concept	Description	Why It Matters
Bus Mastering	Device can initiate bus transactions	Required for DMA—device drives the bus
Physical Address	Actual RAM address	DMA uses physical addresses, not virtual
IOVA	I/O Virtual Address (with IOMMU)	What device sees; translated by IOMMU
Scatter-Gather	DMA from/to non-contiguous buffers	Avoids copying to make contiguous buffer
Cache Coherence	CPU cache vs DMA data consistency	DMA may bypass CPU cache—must sync
Bounce Buffer	Intermediate buffer for DMA limitations	When device can't address buffer directly

dma_operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#include <linux/dma-mapping.h>
 
/*
 * DMA operations in Linux drivers
 */
 
struct mydev_data {
    struct pci_dev *pdev;
    void *tx_buffer;      /* Kernel virtual address */
    dma_addr_t tx_dma;    /* DMA (physical/IOVA) address */
    size_t tx_size;
};
 
/*
 * Allocate coherent DMA buffer (CPU and device see same data)
 */
int mydev_alloc_dma(struct mydev_data *dev) {
    dev->tx_size = 4096;
    
    /* dma_alloc_coherent:
     * - Allocates physically contiguous memory
     * - Returns both kernel VA and DMA address
     * - Memory is cache-coherent (no manual sync needed)
     */
    dev->tx_buffer = dma_alloc_coherent(&dev->pdev->dev,
                                        dev->tx_size,
                                        &dev->tx_dma,
                                        GFP_KERNEL);
    if (!dev->tx_buffer)
        return -ENOMEM;
    
    dev_info(&dev->pdev->dev, 
             "DMA buffer: VA=%p, DMA=0x%llx\n",
             dev->tx_buffer, (unsigned long long)dev->tx_dma);
    
    return 0;
}
 
/*
 * Streaming DMA: Map existing buffer for one-time transfer
 */
int mydev_transmit(struct mydev_data *dev, void *data, size_t len) {
    dma_addr_t dma_addr;
    
    /* Map the buffer for DMA
     * DMA_TO_DEVICE: CPU wrote, device will read
     */
    dma_addr = dma_map_single(&dev->pdev->dev, data, len, DMA_TO_DEVICE);
    
    if (dma_mapping_error(&dev->pdev->dev, dma_addr)) {
        dev_err(&dev->pdev->dev, "DMA mapping failed\n");
        return -EIO;
    }
    
    /* CPU is done with buffer - sync for device */
    dma_sync_single_for_device(&dev->pdev->dev, dma_addr, len, DMA_TO_DEVICE);
    
    /* Program the device to DMA from this address */
    iowrite32((uint32_t)dma_addr, dev->regs + DMA_ADDR_REG);
    iowrite32(len, dev->regs + DMA_LEN_REG);
    iowrite32(CMD_START_DMA, dev->regs + DMA_CMD_REG);
    
    /* Wait for completion... */
    
    /* Unmap when done */
    dma_unmap_single(&dev->pdev->dev, dma_addr, len, DMA_TO_DEVICE);
    
    return 0;
}

IOMMU: DMA Protection

Buses and Interconnects

The system bus (or system fabric) connects CPUs, memory, and I/O devices. Modern systems have evolved from simple shared buses to complex point-to-point interconnects.

Bus Evolution:

I/O Bus Evolution
Bus	Era	Bandwidth	Characteristics
ISA	1981-2000s	8-16 MB/s	Simple, slow, shared, legacy devices
PCI	1992-2010s	133-533 MB/s	Parallel, shared, plug-and-play
PCI Express 3.0	2010-2017	1 GB/s per lane	Serial, point-to-point, scalable
PCI Express 4.0	2017-2022	2 GB/s per lane	16 lanes = 32 GB/s for GPUs
PCI Express 5.0	2022+	4 GB/s per lane	Highest performance current gen
CXL	2019+	Same as PCIe	Memory expansion, coherent attaches

PCIe Architecture:

PCIe is the dominant I/O interconnect in modern systems. Key concepts:

PCIe Key Concepts

•Lanes — Each lane is a bidirectional serial link; devices use x1, x4, x8, or x16 lanes
•Switches — Fan-out PCIe to multiple devices (like network switches for I/O)
•Root Complex — CPU's interface to PCIe hierarchy; includes host bridge
•Endpoints — Actual devices (NVMe drives, NICs, GPUs)
•TLP (Transaction Layer Packets) — Protocol units: Memory Read/Write, Config Read/Write, etc.
•Configuration Space — Standardized 256-byte (legacy) or 4KB (extended) per-device config area

Converting Mermaid diagram...

pcie_config.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Explore PCIe hierarchy on Linux
 
# List all PCIe devices
$ lspci
00:00.0 Host bridge: Intel Corporation Device 9a14 (rev 01)
00:02.0 VGA compatible controller: Intel Corporation Device 9a49
00:14.0 USB controller: Intel Corporation Device a0ed
01:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD
 
# Detailed view with link speed
$ lspci -vvv -s 02:00.0 | grep -E "(LnkCap|LnkSta|Width)"
LnkCap: Port #0, Speed 8GT/s, Width x4
LnkSta: Speed 8GT/s, Width x4
# 8 GT/s = Gen3, Width x4 = 4 lanes => ~4 GB/s theoretical
 
# Tree view showing hierarchy
$ lspci -tv
-[0000:00]-+-00.0  Intel Corporation Host bridge
           +-02.0  Intel Corporation VGA
           +-14.0  Intel Corporation USB Controller
           +-1c.0-[01]----00.0  Intel Corporation I210 NIC
           +-1d.0-[02]----00.0  Samsung Electronics NVMe SSD
 
# Configuration space dump
$ lspci -xxx -s 02:00.0
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
00: 4d 14 a8 a8 06 04 10 00 00 02 08 01 00 00 00 00
10: 04 00 08 d4 00 00 00 00 00 00 00 00 00 00 00 00
...
# First bytes: 144d = Samsung vendor ID, a8a8 = device ID
 
# BARs (Base Address Registers) show MMIO regions
$ lspci -v -s 02:00.0 | grep "Memory at"
Region 0: Memory at d4080000 (64-bit, non-prefetchable) [size=16K]

NUMA and Device Locality

Timing and Synchronization

Hardware operates in the time domain of nanoseconds and microseconds. Software that interacts with hardware must respect timing constraints that are often undocumented or surprising.

Common Timing Concerns:

Hardware Timing Considerations
Scenario	Typical Timing	What Goes Wrong
Register write propagation	1-100 ns	Read-after-write may see old value
Device reset completion	10 µs - 1 s	Access during reset causes errors
Interrupt latency	1-10 µs	Real-time requirements missed
DMA completion	µs to ms	Polling too fast wastes CPU
Device initialization	ms to seconds	Timeout if startup is slow
Power state transitions	10 ms - 1 s	Device unavailable during transition

Memory Barriers and Ordering:

Modern CPUs and compilers reorder memory operations for performance. This is invisible to normal code but dangerous for device I/O, where the order of register writes matters:

memory_barriers.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#include <linux/io.h>
 
/*
 * Memory barriers ensure ordering of memory operations
 */
 
void start_device_operation(void __iomem *regs) {
    /* WRONG: CPU/compiler might reorder these! */
    writel(0x1234, regs + ADDR_REG);
    writel(0x5678, regs + DATA_REG);
    writel(CMD_START, regs + CMD_REG);
    /* Hardware might see CMD_START before ADDR is set! */
    
    /* CORRECT: Use barriers to enforce ordering */
    writel(0x1234, regs + ADDR_REG);
    wmb();  /* Write Memory Barrier - all prior writes complete first */
    
    writel(0x5678, regs + DATA_REG);
    wmb();
    
    writel(CMD_START, regs + CMD_REG);
    /* Now hardware is guaranteed to see writes in order */
}
 
/*
 * Linux barrier types:
 * 
 * mb()   - Full memory barrier (read and write)
 * rmb()  - Read memory barrier (force order of reads)
 * wmb()  - Write memory barrier (force order of writes)
 * 
 * smp_mb()  - SMP-safe barrier (no-op on uniprocessor)
 * smp_rmb(), smp_wmb() - SMP variants
 * 
 * For I/O memory specifically:
 * mmiowb()  - Ensures MMIO writes are visible to other CPUs
 * 
 * Compiler-only barriers:
 * barrier() - Prevent compiler reordering, not CPU reordering
 */
 
/* Read-modify-write with proper synchronization */
uint32_t safely_update_register(void __iomem *regs) {
    uint32_t value;
    
    /* Read current value */
    value = ioread32(regs + STATUS_REG);
    rmb();  /* Ensure read completes before we use value */
    
    /* Modify */
    value |= STATUS_ENABLE;
    
    /* Write back */
    wmb();  /* Ensure all prior writes are visible */
    iowrite32(value, regs + STATUS_REG);
    
    return value;
}

PCIe Posted Writes

The Complete I/O Path

Scenario: A program calls write(fd, buffer, 4096) to write 4KB to an NVMe SSD.

Converting Mermaid diagram...

Detailed Step-by-Step:

1. User Level:

Application calls write(fd, buffer, 4096)
C library may buffer the data; eventually calls the system call

2. VFS / Device-Independent Layer:

Kernel validates file descriptor, checks permissions
Data is copied to page cache (for caching and delayed write)
Page is marked dirty

3. Block Layer:

Page cache writeback creates a bio (block I/O request)
Bio specifies: which device, which sectors, data pointers
I/O scheduler may merge or reorder with other requests

4. Device Driver:

NVMe driver receives the bio
Builds an NVMe write command in submission queue
Sets up DMA mapping (IOVA for the data buffer)
Writes to doorbell register to notify controller

5. Hardware:

Controller fetches command from submission queue (DMA)
Controller fetches write data from memory (DMA)
Controller programs flash write operation
Flash programming completes (microseconds to milliseconds)
Controller writes completion entry to completion queue (DMA)
Controller generates MSI-X interrupt

6. Completion:

CPU receives interrupt, jumps to NVMe ISR
ISR reads completion queue, acknowledges interrupt
Marks bio complete, wakes any waiting processes
Original write() returns (or was asynchronous)

The Power of Layering

Module Summary: I/O Software Layers

The Five Layers Revisited:

I/O Software Layer Summary
Layer	Key Responsibility	Key Insight
User-Level I/O	Application interface, buffering, formatting	Buffering reduces system calls by 100-1000x
Device-Independent	Uniform naming, protection, caching	Everything is a file; one interface for all
Device Drivers	Hardware abstraction, command translation	70% of OS code; most bugs live here
Interrupt Handlers	Asynchronous hardware response	Microsecond constraints; cannot sleep
Hardware Layer	Physical I/O, DMA, timing	Nanosecond domain; ordering matters

Module Key Takeaways

•Layering provides abstraction and modularity — Each layer hides complexity from those above, enabling device independence and maintainability.
•Buffering is the primary performance optimization — From stdio to the page cache, buffers batch small operations and decouple producers from consumers.
•The user-kernel boundary is expensive — System calls cost hundreds of CPU cycles; minimize crossings through buffering and batching.
•Drivers translate generic to specific — The file_operations interface connects the uniform I/O API to hardware-specific implementations.
•Interrupts enable asynchronous operation — Hardware signals the CPU only when needed, freeing cycles for other work.
•DMA offloads data movement — Controllers transfer data directly to memory, essential for high-bandwidth devices.
•Timing and ordering matter at hardware level — Memory barriers, posted writes, and synchronization are non-obvious but critical.

Looking Forward:

This module provided the foundation for understanding I/O software architecture. The subsequent modules in this chapter will explore specific aspects in greater depth:

Device Drivers: Deep dive into driver development, testing, and debugging
Blocking and Non-Blocking I/O: Synchronous vs asynchronous models
Buffering: Advanced buffering techniques and zero-copy I/O
Caching: I/O cache policies and coherence
Spooling: Managing shared devices like printers

With the layered model firmly in mind, each of these topics will make more sense as you understand where they fit in the overall architecture.

Module Complete!