Operating SystemsMemory-Mapped I/O

Memory-Mapped I/O

LevelIntermediate

Duration90 mins

TopicMemory-Mapped I/O

2 / 5

Memory-Mapped I/O

The Unified Address Space Vision

Memory-Mapped I/O (MMIO) represents a fundamentally different philosophy from Port-Mapped I/O: rather than treating devices as residents of a separate address realm, MMIO integrates device registers directly into the processor's memory address space. Every device register receives a memory address, and CPU memory instructions (load, store) become the universal interface for all hardware communication.

This architectural choice has profound implications. It eliminates the need for dedicated I/O instructions, enables the full power of the processor's addressing modes for device access, and allows devices with large register spaces or memory buffers to be efficiently addressed. The simplicity and elegance of "everything is memory" has made MMIO the dominant paradigm in modern computer architecture.

From graphics cards with gigabytes of VRAM to network interfaces with thousands of registers, from embedded microcontrollers to the most powerful server processors—Memory-Mapped I/O enables them all.

Learning Objectives

By the end of this page, you will understand: (1) The fundamental principles of memory-mapped device access, (2) How device registers appear as ordinary memory locations, (3) Memory region configuration and the role of MTRRs/PAT, (4) Programming techniques for MMIO with memory barrier considerations, (5) Hardware address translation and bus routing, and (6) How modern high-performance devices leverage MMIO for maximum throughput.

The Concept of Memory Mapping

In Memory-Mapped I/O, device registers and memory buffers are assigned addresses within the processor's physical memory address space. When the CPU issues a memory transaction to such an address, the memory controller recognizes it as a device region and routes the transaction to the appropriate hardware.

The Unified Address Model

Consider a 64-bit processor with a 48-bit physical address space (256 TB). Within this vast space:

Some addresses map to DRAM (main memory)
Some addresses map to ROM/Flash (firmware)
Some addresses map to device registers (MMIO regions)
Some addresses are reserved or unused

From the CPU's perspective, all accesses use the same instruction set: MOV, LOAD, STORE (or architecture equivalents). The address bus carries the target address, the data bus carries the data, and control signals indicate read or write. The destination (memory vs. device) is determined solely by address decoding, not by instruction type.

Typical x86-64 Physical Address Space Layout
Address Range	Size	Designation	Contents
0x0000_0000_0000 - 0x0000_0009_FFFF	640 KB	Conventional Memory	Legacy DOS area, BIOS data
0x0000_000A_0000 - 0x0000_000B_FFFF	128 KB	Video Memory	Legacy VGA frame buffer
0x0000_000C_0000 - 0x0000_000F_FFFF	256 KB	ROM Area	BIOS, option ROMs
0x0000_0010_0000 - Low Memory End	Variable	Extended Memory	Main RAM
Low Memory End - 4GB	Variable	PCI MMIO Region	32-bit device BARs
0x0000_FED0_0000 - 0x0000_FED0_3FFF	16 KB	HPET	High Precision Event Timer
0x0000_FEE0_0000 - 0x0000_FEE0_0FFF	4 KB	APIC	Local APIC registers
Above 4GB	Variable	High MMIO	64-bit device BARs, large devices

The Memory Hole Concept

Incorporating MMIO into the memory address space creates memory holes—regions of the address space that map to devices instead of RAM. Even if physical RAM exists at those addresses, it becomes inaccessible (hidden or remapped) when devices claim those ranges.

The most significant memory hole on x86 systems exists between the top of usable memory (typically 2-3 GB on legacy systems) and the 4 GB boundary. This MMIO gap accommodates:

PCI and PCIe device BAR (Base Address Register) mappings
Firmware regions (BIOS/UEFI)
System management structures
Integrated graphics stolen memory

On modern systems with more than 4 GB of RAM, the "hidden" memory is remapped to physical addresses above 4 GB through memory controller remapping features.

Memory Remapping

Modern chipsets implement memory remapping to recover RAM addresses claimed by MMIO. For example, if MMIO occupies addresses 0xC000_0000 to 0xFFFF_FFFF (1 GB), the RAM that would have occupied that space is remapped to 0x1_0000_0000 and above, preserving total usable memory.

How MMIO Works at the Hardware Level

Understanding MMIO requires tracing a memory transaction from CPU instruction through bus hierarchies to the final device. Let's examine this journey in detail.

The Memory Transaction Lifecycle

When the CPU executes a memory instruction targeting an MMIO address, the following sequence occurs:

Virtual Address Translation: If virtual memory is enabled, the MMU translates the virtual address to a physical address. The page table entry may contain attributes (caching behavior, memory type) affecting MMIO.
Cache Lookup: The CPU checks if the address is cached. For properly configured MMIO regions, the access bypasses cache (marked uncacheable or write-combining).
Memory Controller Decode: The integrated or discrete memory controller examines the physical address against configured ranges. DRAM ranges route to memory; MMIO ranges route to I/O bus hierarchy.
Bus Bridge Traversal: On PCIe systems, the transaction traverses the Root Complex, possibly crossing PCIe switches, until reaching the target device's bridge.
Device BAR Match: The target device compares the address against its configured BAR ranges. A match triggers device register access.
Device Response: The device reads or writes the register and returns data (for reads) or acknowledgment (for writes).

Converting Mermaid diagram...

Base Address Registers (BARs)

PCI and PCIe devices advertise their MMIO requirements through Base Address Registers. BARs are configuration space registers that define:

Address Type: 32-bit or 64-bit addressing
Prefetchable: Whether the region supports speculative reads
Size: Determined by writing all 1s and reading back the implemented bits

During system initialization (BIOS/UEFI or OS enumeration), software assigns addresses to each device's BARs, constructing the system's MMIO map. This dynamic assignment allows the same device to reside at different addresses on different systems.

pci_bar_enumeration.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
/*
 * PCI BAR Enumeration and Size Detection
 * 
 * This code demonstrates the standard algorithm for discovering
 * a PCI device's MMIO requirements by probing its Base Address Registers.
 */
 
#include <stdint.h>
 
/* PCI Configuration Space access (x86 uses ports 0xCF8/0xCFC) */
#define PCI_CONFIG_ADDR  0x0CF8
#define PCI_CONFIG_DATA  0x0CFC
 
/* BAR types */
#define BAR_TYPE_MASK    0x01
#define BAR_TYPE_MEMORY  0x00
#define BAR_TYPE_IO      0x01
#define BAR_MEM_TYPE_MASK     0x06
#define BAR_MEM_TYPE_32       0x00
#define BAR_MEM_TYPE_64       0x04
#define BAR_MEM_PREFETCH_MASK 0x08
 
extern void outl(uint16_t port, uint32_t value);
extern uint32_t inl(uint16_t port);
 
/*
 * Build PCI configuration address for a specific register.
 */
static uint32_t pci_config_addr(uint8_t bus, uint8_t device, 
                                 uint8_t function, uint8_t offset)
{
    return (1UL << 31) |           /* Enable bit */
           ((uint32_t)bus << 16) |
           ((uint32_t)(device & 0x1F) << 11) |
           ((uint32_t)(function & 0x07) << 8) |
           (offset & 0xFC);         /* Dword aligned */
}
 
/*
 * Read a 32-bit value from PCI configuration space.
 */
uint32_t pci_config_read32(uint8_t bus, uint8_t device, 
                            uint8_t function, uint8_t offset)
{
    outl(PCI_CONFIG_ADDR, pci_config_addr(bus, device, function, offset));
    return inl(PCI_CONFIG_DATA);
}
 
/*
 * Write a 32-bit value to PCI configuration space.
 */
void pci_config_write32(uint8_t bus, uint8_t device, 
                         uint8_t function, uint8_t offset, uint32_t value)
{
    outl(PCI_CONFIG_ADDR, pci_config_addr(bus, device, function, offset));
    outl(PCI_CONFIG_DATA, value);
}
 
/*
 * Determine the size of a MMIO BAR by the standard probe algorithm.
 * 
 * Algorithm:
 * 1. Save original BAR value
 * 2. Write all 1s to BAR
 * 3. Read back - device returns size mask (cleared bits = size)
 * 4. Restore original BAR value
 * 5. Calculate size from mask
 */
struct bar_info {
    uint64_t base_address;  /* Current BAR value (assigned address) */
    uint64_t size;          /* Size in bytes */
    uint8_t  type;          /* 0=MMIO, 1=Port I/O */
    uint8_t  is_64bit;      /* 1 if 64-bit BAR */
    uint8_t  prefetchable;  /* 1 if prefetchable memory */
};
 
int pci_get_bar_info(uint8_t bus, uint8_t device, uint8_t function,
                      uint8_t bar_index, struct bar_info *info)
{
    /* BAR registers start at offset 0x10, each is 4 bytes */
    uint8_t bar_offset = 0x10 + (bar_index * 4);
    
    /* Read original BAR value */
    uint32_t original = pci_config_read32(bus, device, function, bar_offset);
    
    /* Determine BAR type */
    if (original & BAR_TYPE_IO) {
        /* I/O BAR - not MMIO */
        info->type = 1;
        info->base_address = original & ~0x03;
        
        /* Probe size */
        pci_config_write32(bus, device, function, bar_offset, 0xFFFFFFFF);
        uint32_t size_mask = pci_config_read32(bus, device, function, bar_offset);
        pci_config_write32(bus, device, function, bar_offset, original);
        
        size_mask &= ~0x03;  /* Clear type bits */
        info->size = (~size_mask + 1) & 0xFFFF;  /* I/O limited to 64 KB */
        info->is_64bit = 0;
        info->prefetchable = 0;
        return 0;
    }
    
    /* Memory BAR */
    info->type = 0;
    info->is_64bit = ((original & BAR_MEM_TYPE_MASK) == BAR_MEM_TYPE_64);
    info->prefetchable = (original & BAR_MEM_PREFETCH_MASK) ? 1 : 0;
    
    /* Probe lower 32 bits */
    pci_config_write32(bus, device, function, bar_offset, 0xFFFFFFFF);
    uint32_t low_mask = pci_config_read32(bus, device, function, bar_offset);
    pci_config_write32(bus, device, function, bar_offset, original);
    
    low_mask &= ~0x0F;  /* Clear type bits */
    
    if (info->is_64bit) {
        /* Read upper 32 bits from next BAR */
        uint32_t original_high = pci_config_read32(bus, device, function, 
                                                    bar_offset + 4);
        pci_config_write32(bus, device, function, bar_offset + 4, 0xFFFFFFFF);
        uint32_t high_mask = pci_config_read32(bus, device, function, 
                                                bar_offset + 4);
        pci_config_write32(bus, device, function, bar_offset + 4, original_high);
        
        /* Combine to form 64-bit mask and calculate size */
        uint64_t full_mask = ((uint64_t)high_mask << 32) | low_mask;
        info->size = ~full_mask + 1;
        info->base_address = ((uint64_t)original_high << 32) | 
                             (original & ~0x0F);
    } else {
        info->size = (~low_mask + 1) & 0xFFFFFFFF;
        info->base_address = original & ~0x0F;
    }
    
    return 0;
}

Memory Types and Caching Considerations

One of the most critical aspects of MMIO is proper configuration of memory caching behavior. Unlike RAM, where caching improves performance, MMIO regions require careful cache control to ensure correct device behavior.

Why Caching MMIO is Dangerous

Device registers are side-effecting: reading a register might clear an interrupt flag, and writing might trigger an action. If these accesses were cached:

Read caching: The CPU might return stale cached data instead of the current device state
Write caching: Writes might be coalesced or delayed, missing time-critical device requirements
Speculative reads: The CPU might prefetch MMIO locations, triggering unintended device actions

To prevent these issues, x86 provides memory type configuration through MTRRs (Memory Type Range Registers) and the PAT (Page Attribute Table).

x86 Memory Types for MMIO Regions
Memory Type	Code	Characteristics	MMIO Use Case
Uncacheable (UC)	0	No caching, serialized access, no speculation	Standard device registers
Write Combining (WC)	1	No caching but writes combine, reads may be speculative	Frame buffers, DMA buffers
Write Through (WT)	4	Cached reads, writes propagate immediately	Rarely used for MMIO
Write Protect (WP)	5	Cached reads, writes are silently dropped	Not for MMIO
Write Back (WB)	6	Fully cached, writes delayed	Never use for MMIO
Uncacheable Minus (UC-)	7	Like UC but can be overridden by WC MTRR	Default MMIO fallback

Memory Type Range Registers (MTRRs)

MTRRs provide a mechanism to assign memory types to physical address ranges at the hardware level, independent of page tables. There are two types:

Fixed-range MTRRs: Cover the first 1 MB of memory in fixed-size blocks
Variable-range MTRRs: Define arbitrary power-of-two-sized, aligned regions

Modern systems typically configure variable MTRRs during boot to mark RAM as WB and MMIO regions as UC. The BIOS/UEFI firmware establishes these settings.

Page Attribute Table (PAT)

While MTRRs work at the physical level, the PAT allows memory type configuration in page tables—enabling per-page control visible to the operating system. Each page table entry can specify a PAT index that, combined with the PCD and PWT bits, selects one of eight memory types from the PAT configuration register.

The operating system's memory mapping functions must use appropriate flags when creating MMIO mappings. On Linux, this is handled through ioremap() variants.

linux_ioremap_variants.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
/*
 * Linux MMIO Mapping Functions
 * 
 * The kernel provides several ioremap variants that configure
 * appropriate memory types for different MMIO characteristics.
 */
 
#include <linux/io.h>
#include <linux/types.h>
 
/*
 * ioremap() / ioremap_nocache()
 * 
 * Maps MMIO region with Uncacheable (UC) memory type.
 * Use for: Standard device registers where every access must
 *          reach the device in order and without caching.
 * 
 * @param phys_addr: Physical address of MMIO region (from BAR)
 * @param size: Size of the region in bytes
 * @returns: Virtual address for kernel access, or NULL on failure
 */
void __iomem *map_device_registers(phys_addr_t phys_addr, size_t size)
{
    void __iomem *base;
    
    /* Standard uncacheable mapping for device registers */
    base = ioremap(phys_addr, size);
    if (!base) {
        pr_err("Failed to map MMIO region at %pa
", &phys_addr);
        return NULL;
    }
    
    return base;
}
 
/*
 * ioremap_wc()
 * 
 * Maps MMIO region with Write-Combining (WC) memory type.
 * Use for: Frame buffers, large write-only or write-mostly regions
 *          where combining writes improves throughput.
 */
void __iomem *map_framebuffer(phys_addr_t phys_addr, size_t size)
{
    void __iomem *base;
    
    /* Write-combining for better frame buffer performance */
    base = ioremap_wc(phys_addr, size);
    if (!base) {
        /* Fall back to uncacheable if WC not available */
        base = ioremap(phys_addr, size);
    }
    
    return base;
}
 
/*
 * Example: Complete MMIO mapping and access workflow
 */
struct my_device_regs {
    uint32_t control;       /* Offset 0x00: Control register */
    uint32_t status;        /* Offset 0x04: Status register */
    uint32_t interrupt;     /* Offset 0x08: Interrupt status */
    uint32_t data;          /* Offset 0x0C: Data register */
};
 
struct my_device {
    void __iomem *regs;           /* MMIO register base */
    void __iomem *framebuffer;    /* Write-combining frame buffer */
    phys_addr_t regs_phys;
    phys_addr_t fb_phys;
    size_t fb_size;
};
 
int my_device_probe(struct pci_dev *pdev, struct my_device *dev)
{
    /* Get BAR 0 for registers (uncacheable) */
    dev->regs_phys = pci_resource_start(pdev, 0);
    dev->regs = ioremap(dev->regs_phys, pci_resource_len(pdev, 0));
    if (!dev->regs)
        return -ENOMEM;
    
    /* Get BAR 1 for frame buffer (write-combining) */
    dev->fb_phys = pci_resource_start(pdev, 1);
    dev->fb_size = pci_resource_len(pdev, 1);
    dev->framebuffer = ioremap_wc(dev->fb_phys, dev->fb_size);
    if (!dev->framebuffer) {
        iounmap(dev->regs);
        return -ENOMEM;
    }
    
    pr_info("Device mapped: regs=%pa, fb=%pa (size %zu)
",
            &dev->regs_phys, &dev->fb_phys, dev->fb_size);
    
    return 0;
}
 
void my_device_remove(struct my_device *dev)
{
    if (dev->framebuffer)
        iounmap(dev->framebuffer);
    if (dev->regs)
        iounmap(dev->regs);
}

Critical: Never Use Write-Back for MMIO

Mapping MMIO as write-back (WB) cacheable memory will cause catastrophic failures. Writes may never reach the device, reads return stale data, and the resulting behavior is undefined. Always use ioremap() or equivalent uncacheable mapping functions for device registers.

MMIO Access Functions and Memory Barriers

Accessing MMIO locations requires more than simple pointer dereferences. Compilers may reorder, combine, or eliminate memory accesses as optimizations, and processors may reorder memory operations for performance. For MMIO, these behaviors can cause device miscommunication.

The Problem with Direct Pointer Access

Consider this naive MMIO access:

void __iomem *regs = ioremap(phys_addr, size);
uint32_t *control = (uint32_t *)regs;
*control = 0x01;  // Start operation
*control = 0x02;  // Don't do this!

Problems:

Compiler might reorder or combine the two writes
Compiler might eliminate "redundant" writes
CPU might reorder with respect to other memory operations
CPU might execute writes out of order

For device correctness, every write must reach the device, in order, when programmed.

mmio_access_patterns.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
/*
 * Proper MMIO Access Functions
 * 
 * Modern kernels provide typed accessor functions that ensure:
 * 1. Volatile semantics (no compiler optimization)
 * 2. Correct memory ordering where needed
 * 3. Architecture-appropriate implementation
 */
 
#include <linux/io.h>
 
/*
 * Basic Accessors (Linux: ioread/iowrite)
 * 
 * These functions provide the minimum guarantee: the access will occur.
 * They do NOT provide ordering guarantees with respect to other accesses.
 */
 
void demonstrate_basic_accessors(void __iomem *regs)
{
    uint32_t status;
    
    /* Read 32-bit value from MMIO location */
    status = ioread32(regs + 0x04);
    
    /* Write 32-bit value to MMIO location */
    iowrite32(0x00000001, regs + 0x00);
    
    /* Size variants */
    iowrite8(0xFF, regs + 0x10);      /* 8-bit write */
    iowrite16(0x1234, regs + 0x12);   /* 16-bit write */
    
    /* Big-endian variants (if device uses BE register layout) */
    iowrite32be(0x12345678, regs + 0x20);
    uint32_t be_value = ioread32be(regs + 0x20);
}
 
/*
 * Ordered Accessors (Linux: readl/writel family)
 * 
 * These provide stronger ordering guarantees:
 * - writel: Write is complete before the function returns
 *           (typically includes a read-back fence on weakly-ordered archs)
 * - readl: Preceding writes are visible before this read
 */
 
void demonstrate_ordered_accessors(void __iomem *regs)
{
    uint32_t status;
    
    /* Strongly-ordered write: flushes write buffers */
    writel(0x00000001, regs + 0x00);
    
    /* Read back to confirm write completion */
    status = readl(regs + 0x04);
    
    /* The write at offset 0x00 is guaranteed to have reached
     * the device before the read at offset 0x04 returns */
}
 
/*
 * Relaxed Accessors (Linux: writel_relaxed/readl_relaxed)
 * 
 * Maximum performance, minimum ordering. Use when:
 * - Batching many writes before a final ordered access
 * - Writes can be safely reordered
 * - You explicitly manage barriers
 */
 
void demonstrate_relaxed_accessors(void __iomem *regs)
{
    int i;
    
    /* Batch of relaxed writes - may be reordered and combined */
    for (i = 0; i < 100; i++) {
        writel_relaxed(data[i], regs + 0x100 + (i * 4));
    }
    
    /* Explicit write memory barrier ensures all above writes complete */
    wmb();
    
    /* Final ordered write to trigger device action */
    writel(0x01, regs + 0x00);  /* Start processing */
    
    /* Read status - guarantees all writes visible to device */
    uint32_t status = readl(regs + 0x04);
}
 
/*
 * Memory Barrier Types
 * 
 * Different barriers provide different ordering guarantees.
 */
 
void demonstrate_barriers(void __iomem *regs)
{
    /* Read memory barrier: prior reads complete before subsequent reads */
    rmb();
    
    /* Write memory barrier: prior writes complete before subsequent writes */
    wmb();
    
    /* Full memory barrier: orders all prior memory ops before subsequent */
    mb();
    
    /* MMIO-specific barrier: ensures MMIO writes reach devices */
    mmiowb();  /* Deprecated in favor of spin_unlock ordering */
    
    /* Compiler barrier only: prevents compiler reordering but not CPU */
    barrier();
    
    /* Example: Read status until ready, then read data */
    uint32_t status;
    do {
        status = readl(regs + STATUS_OFFSET);
        cpu_relax();  /* Hint for spin-waiting, may include barrier */
    } while (!(status & READY_BIT));
    
    rmb();  /* Ensure status read completes before data read */
    uint32_t data = readl(regs + DATA_OFFSET);
}
 
/*
 * String/Block Operations
 * 
 * For bulk data transfer, string variants operate on buffers.
 */
 
void demonstrate_block_transfers(void __iomem *regs, void *buffer, size_t count)
{
    /* Read block of 32-bit values */
    ioread32_rep(regs + FIFO_OFFSET, buffer, count);
    
    /* Write block of 32-bit values */
    iowrite32_rep(regs + FIFO_OFFSET, buffer, count);
    
    /* Memory copy to/from MMIO (with proper access width handling) */
    memcpy_toio(regs + 0x1000, buffer, count * 4);
    memcpy_fromio(buffer, regs + 0x1000, count * 4);
}

Architecture Considerations

On x86/x64, MMIO regions are typically marked UC (uncacheable), which provides strong ordering guarantees. On ARM and other weakly-ordered architectures, explicit barriers become critical. The kernel's accessor macros abstract these differences—always use them instead of raw pointer dereferences.

MMIO in Modern High-Performance Devices

Memory-Mapped I/O enables modern devices to achieve extraordinary performance levels by leveraging large address spaces, sophisticated memory types, and direct CPU-device communication patterns.

Graphics Processing Units (GPUs)

Modern GPUs exemplify advanced MMIO usage:

VRAM Mapping: Gigabytes of video memory mapped into CPU address space, often with write-combining for efficient texture uploads
Command Buffers: Shared memory where CPU writes GPU commands, GPU reads and executes
Doorbell Registers: Small MMIO regions where writes trigger GPU to examine command queues
Register Spaces: Thousands of control/status registers for GPU configuration

The combination of large WC-mapped data regions and small UC-mapped control registers enables GPUs to achieve bandwidth-efficient bulk transfers while maintaining precise control semantics.

MMIO Regions in Typical High-End GPU
BAR	Size	Memory Type	Purpose
BAR 0	16-256 MB	Uncacheable	Control registers, doorbells
BAR 1	256 MB - 16 GB	Write-Combining	Direct GPU memory aperture
BAR 2/3	Variable	Write-Combining	Extended memory aperture
Resizable BAR	Up to 100% VRAM	Write-Combining	Full VRAM access (if enabled)

NVMe Solid-State Drives

NVMe leverages MMIO for its high-performance storage interface:

Controller Registers: A 4 KB UC-mapped region for device control, capability reporting, and interrupt management
Doorbell Registers: Beyond the controller registers, doorbell locations where the host writes to notify the controller of new commands
Submission/Completion Queues: Reside in host memory (not MMIO), but doorbell writes trigger controller to read them

The elegance of NVMe's MMIO design enables it to achieve millions of IOPS with minimal CPU overhead—each command submission requires only a single 4-byte doorbell write.

Network Interface Cards

High-speed NICs (40 Gbps, 100 Gbps, and beyond) use MMIO extensively:

Descriptor Rings: Queue descriptors pointing to network packets
CSR (Control/Status Registers): Device configuration and interrupt management
Doorbells: Notify NIC of new packets to transmit
Packet Buffers: Some NICs expose packet memory via MMIO for direct CPU access

Modern NICs minimize MMIO latency through register layout optimization, cache line alignment of frequently accessed registers, and separation of read-heavy and write-heavy regions.

nvme_mmio_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
/*
 * NVMe MMIO Register Access Example
 * 
 * Demonstrates the MMIO structure of NVMe controllers,
 * showcasing minimal-overhead command submission.
 */
 
#include <linux/io.h>
#include <linux/types.h>
 
/* NVMe Controller Registers (from NVMe specification) */
struct nvme_controller_regs {
    uint64_t cap;        /* 0x00: Controller Capabilities */
    uint32_t vs;         /* 0x08: Version */
    uint32_t intms;      /* 0x0C: Interrupt Mask Set */
    uint32_t intmc;      /* 0x10: Interrupt Mask Clear */
    uint32_t cc;         /* 0x14: Controller Configuration */
    uint32_t reserved1;  /* 0x18 */
    uint32_t csts;       /* 0x1C: Controller Status */
    uint32_t nssr;       /* 0x20: NVM Subsystem Reset */
    uint32_t aqa;        /* 0x24: Admin Queue Attributes */
    uint64_t asq;        /* 0x28: Admin Submission Queue Base */
    uint64_t acq;        /* 0x30: Admin Completion Queue Base */
    /* ... additional registers ... */
};
 
/* Doorbell stride is determined by CAP.DSTRD, typically 4 bytes */
#define NVME_DOORBELL_BASE  0x1000
 
struct nvme_device {
    void __iomem *regs;             /* Base of MMIO registers */
    void __iomem *doorbells;        /* Base of doorbell registers */
    uint32_t doorbell_stride;       /* Bytes between doorbells */
    
    /* Admin queues in host memory (not MMIO) */
    struct nvme_command *admin_sq;
    struct nvme_completion *admin_cq;
    uint16_t admin_sq_tail;
    uint16_t admin_cq_head;
};
 
/*
 * Initialize NVMe controller - probe and configure.
 */
int nvme_init(struct nvme_device *dev, void __iomem *bar0)
{
    uint64_t cap;
    uint32_t vs, cc, csts;
    
    dev->regs = bar0;
    
    /* Read capabilities to determine controller parameters */
    cap = readq(dev->regs + offsetof(struct nvme_controller_regs, cap));
    
    /* Extract doorbell stride: 2^(2+DSTRD) bytes */
    uint8_t dstrd = (cap >> 32) & 0x0F;
    dev->doorbell_stride = 4 << dstrd;
    dev->doorbells = dev->regs + NVME_DOORBELL_BASE;
    
    /* Read version */
    vs = readl(dev->regs + offsetof(struct nvme_controller_regs, vs));
    pr_info("NVMe version: %d.%d.%d
", 
            (vs >> 16), (vs >> 8) & 0xFF, vs & 0xFF);
    
    /* Wait for controller ready (CSTS.RDY = 1) after enabling */
    /* ... configuration sequence ... */
    
    return 0;
}
 
/*
 * Submit a command to an NVMe queue.
 * 
 * This demonstrates the minimal MMIO access pattern:
 * 1. Write command to host memory submission queue
 * 2. Single 4-byte doorbell write to notify controller
 */
void nvme_submit_command(struct nvme_device *dev,
                          uint16_t qid,
                          struct nvme_command *cmd,
                          uint16_t *sq_tail)
{
    /* Step 1: Copy command to submission queue (host memory, not MMIO) */
    memcpy(&dev->queues[qid].sq[*sq_tail], cmd, sizeof(*cmd));
    
    /* Ensure command is written before doorbell */
    wmb();
    
    /* Step 2: Advance tail pointer */
    *sq_tail = (*sq_tail + 1) % dev->queues[qid].sq_depth;
    
    /* Step 3: Write doorbell - single 4-byte MMIO write triggers controller */
    /* Submission Queue y Tail Doorbell offset = 0x1000 + (2y * doorbell_stride) */
    writel(*sq_tail, dev->doorbells + (2 * qid * dev->doorbell_stride));
    
    /* That's it - controller will now fetch and process the command */
}
 
/*
 * Check for completions - poll completion queue and ring doorbell.
 */
int nvme_poll_completions(struct nvme_device *dev, 
                           uint16_t qid,
                           struct nvme_completion *completions,
                           int max_completions)
{
    int count = 0;
    struct nvme_queue *q = &dev->queues[qid];
    
    while (count < max_completions) {
        struct nvme_completion *cqe = &q->cq[q->cq_head];
        
        /* Check phase tag to see if entry is valid */
        if ((cqe->status & 1) != q->cq_phase)
            break;  /* No more completions */
        
        /* Copy completion to caller's buffer */
        completions[count++] = *cqe;
        
        /* Advance head */
        q->cq_head++;
        if (q->cq_head >= q->cq_depth) {
            q->cq_head = 0;
            q->cq_phase ^= 1;  /* Toggle phase on wrap */
        }
    }
    
    if (count > 0) {
        /* Ring completion queue head doorbell */
        /* Completion Queue y Head Doorbell offset = 0x1000 + ((2y+1) * doorbell_stride) */
        writel(q->cq_head, dev->doorbells + 
               ((2 * qid + 1) * dev->doorbell_stride));
    }
    
    return count;
}

Advantages of Memory-Mapped I/O

Memory-Mapped I/O has become the dominant paradigm for good reasons. Its advantages extend across hardware design, software development, and system performance.

Key Advantages of MMIO

•Virtually Unlimited Address Space: 64-bit systems provide 16 exabytes of addresses. Even 48-bit physical addressing (256 TB) dwarfs any device's needs.
•Standard Instruction Set Usage: No special I/O instructions required. All processor addressing modes work: base+offset, indexed, auto-increment. Enables efficient compiled code.
•Large Contiguous Regions: Devices can expose megabytes or gigabytes of memory. Frame buffers, command buffers, and firmware can be directly accessed.
•Architecture Independence: ARM, RISC-V, PowerPC, and most modern architectures use MMIO exclusively. Code concepts transfer across platforms.
•Flexible Memory Types: Different regions can use different cache policies (UC, WC, etc.) for optimal performance per use case.
•Natural Language Integration: C pointers and structures can represent device registers naturally. No inline assembly required for basic access.
•DMA Integration: DMA controllers use memory addresses. MMIO regions seamlessly integrate with DMA-based data movement.
•Virtualization Efficiency: MMU-based memory protection naturally extends to MMIO. VMs can have secure, hardware-accelerated device access.

The Power of Uniformity

MMIO's greatest strength is uniformity. Memory allocation, protection, virtual addressing, and access primitives all work the same whether the target is DRAM or a device. This reduces cognitive load for developers and enables reuse of operating system infrastructure.

Summary: Memory-Mapped I/O

This page has provided a thorough exploration of Memory-Mapped I/O, the dominant paradigm for modern device communication. Let's consolidate the key concepts:

Key Takeaways

•Unified Address Space: Device registers appear as memory locations, accessed via standard load/store instructions.
•BAR Configuration: PCI/PCIe devices advertise MMIO requirements; firmware/OS assigns physical addresses during enumeration.
•Memory Type Control: MTRR and PAT configurations mark MMIO as uncacheable (UC) or write-combining (WC) to prevent incorrect caching.
•Access Functions: Kernel-provided accessors (ioread/iowrite, readl/writel) ensure volatile semantics and proper memory ordering.
•Memory Barriers: Explicit barriers may be required on weakly-ordered architectures to ensure correct device interaction.
•High-Performance Applications: GPUs, NVMe, and high-speed NICs leverage MMIO for minimal-overhead communication.
•Flexibility: MMIO supports regions from a few bytes to many gigabytes, accommodating diverse device needs.

Looking Ahead

With both Port-Mapped and Memory-Mapped I/O understood, we're ready to examine how these paradigms consume address space—the critical system design consideration that influences everything from firmware layout to operating system memory management.

Page Complete

You now have mastery over Memory-Mapped I/O concepts—from hardware address translation through kernel access primitives to modern device usage patterns. This knowledge directly applies to understanding device drivers, debugging I/O issues, and designing high-performance systems.

2 / 5

Loading learning content...

Operating SystemsMemory-Mapped I/O

Memory-Mapped I/O

LevelIntermediate

Duration90 mins

TopicMemory-Mapped I/O

2 / 5

Memory-Mapped I/O

The Unified Address Space Vision

Learning Objectives

The Concept of Memory Mapping

The Unified Address Model

Consider a 64-bit processor with a 48-bit physical address space (256 TB). Within this vast space:

Some addresses map to DRAM (main memory)
Some addresses map to ROM/Flash (firmware)
Some addresses map to device registers (MMIO regions)
Some addresses are reserved or unused

Typical x86-64 Physical Address Space Layout
Address Range	Size	Designation	Contents
0x0000_0000_0000 - 0x0000_0009_FFFF	640 KB	Conventional Memory	Legacy DOS area, BIOS data
0x0000_000A_0000 - 0x0000_000B_FFFF	128 KB	Video Memory	Legacy VGA frame buffer
0x0000_000C_0000 - 0x0000_000F_FFFF	256 KB	ROM Area	BIOS, option ROMs
0x0000_0010_0000 - Low Memory End	Variable	Extended Memory	Main RAM
Low Memory End - 4GB	Variable	PCI MMIO Region	32-bit device BARs
0x0000_FED0_0000 - 0x0000_FED0_3FFF	16 KB	HPET	High Precision Event Timer
0x0000_FEE0_0000 - 0x0000_FEE0_0FFF	4 KB	APIC	Local APIC registers
Above 4GB	Variable	High MMIO	64-bit device BARs, large devices

The Memory Hole Concept

The most significant memory hole on x86 systems exists between the top of usable memory (typically 2-3 GB on legacy systems) and the 4 GB boundary. This MMIO gap accommodates:

PCI and PCIe device BAR (Base Address Register) mappings
Firmware regions (BIOS/UEFI)
System management structures
Integrated graphics stolen memory

On modern systems with more than 4 GB of RAM, the "hidden" memory is remapped to physical addresses above 4 GB through memory controller remapping features.

Memory Remapping

How MMIO Works at the Hardware Level

Understanding MMIO requires tracing a memory transaction from CPU instruction through bus hierarchies to the final device. Let's examine this journey in detail.

The Memory Transaction Lifecycle

When the CPU executes a memory instruction targeting an MMIO address, the following sequence occurs:

Virtual Address Translation: If virtual memory is enabled, the MMU translates the virtual address to a physical address. The page table entry may contain attributes (caching behavior, memory type) affecting MMIO.
Cache Lookup: The CPU checks if the address is cached. For properly configured MMIO regions, the access bypasses cache (marked uncacheable or write-combining).
Memory Controller Decode: The integrated or discrete memory controller examines the physical address against configured ranges. DRAM ranges route to memory; MMIO ranges route to I/O bus hierarchy.
Bus Bridge Traversal: On PCIe systems, the transaction traverses the Root Complex, possibly crossing PCIe switches, until reaching the target device's bridge.
Device BAR Match: The target device compares the address against its configured BAR ranges. A match triggers device register access.
Device Response: The device reads or writes the register and returns data (for reads) or acknowledgment (for writes).

Converting Mermaid diagram...

Base Address Registers (BARs)

PCI and PCIe devices advertise their MMIO requirements through Base Address Registers. BARs are configuration space registers that define:

Address Type: 32-bit or 64-bit addressing
Prefetchable: Whether the region supports speculative reads
Size: Determined by writing all 1s and reading back the implemented bits

pci_bar_enumeration.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
/*
 * PCI BAR Enumeration and Size Detection
 * 
 * This code demonstrates the standard algorithm for discovering
 * a PCI device's MMIO requirements by probing its Base Address Registers.
 */
 
#include <stdint.h>
 
/* PCI Configuration Space access (x86 uses ports 0xCF8/0xCFC) */
#define PCI_CONFIG_ADDR  0x0CF8
#define PCI_CONFIG_DATA  0x0CFC
 
/* BAR types */
#define BAR_TYPE_MASK    0x01
#define BAR_TYPE_MEMORY  0x00
#define BAR_TYPE_IO      0x01
#define BAR_MEM_TYPE_MASK     0x06
#define BAR_MEM_TYPE_32       0x00
#define BAR_MEM_TYPE_64       0x04
#define BAR_MEM_PREFETCH_MASK 0x08
 
extern void outl(uint16_t port, uint32_t value);
extern uint32_t inl(uint16_t port);
 
/*
 * Build PCI configuration address for a specific register.
 */
static uint32_t pci_config_addr(uint8_t bus, uint8_t device, 
                                 uint8_t function, uint8_t offset)
{
    return (1UL << 31) |           /* Enable bit */
           ((uint32_t)bus << 16) |
           ((uint32_t)(device & 0x1F) << 11) |
           ((uint32_t)(function & 0x07) << 8) |
           (offset & 0xFC);         /* Dword aligned */
}
 
/*
 * Read a 32-bit value from PCI configuration space.
 */
uint32_t pci_config_read32(uint8_t bus, uint8_t device, 
                            uint8_t function, uint8_t offset)
{
    outl(PCI_CONFIG_ADDR, pci_config_addr(bus, device, function, offset));
    return inl(PCI_CONFIG_DATA);
}
 
/*
 * Write a 32-bit value to PCI configuration space.
 */
void pci_config_write32(uint8_t bus, uint8_t device, 
                         uint8_t function, uint8_t offset, uint32_t value)
{
    outl(PCI_CONFIG_ADDR, pci_config_addr(bus, device, function, offset));
    outl(PCI_CONFIG_DATA, value);
}
 
/*
 * Determine the size of a MMIO BAR by the standard probe algorithm.
 * 
 * Algorithm:
 * 1. Save original BAR value
 * 2. Write all 1s to BAR
 * 3. Read back - device returns size mask (cleared bits = size)
 * 4. Restore original BAR value
 * 5. Calculate size from mask
 */
struct bar_info {
    uint64_t base_address;  /* Current BAR value (assigned address) */
    uint64_t size;          /* Size in bytes */
    uint8_t  type;          /* 0=MMIO, 1=Port I/O */
    uint8_t  is_64bit;      /* 1 if 64-bit BAR */
    uint8_t  prefetchable;  /* 1 if prefetchable memory */
};
 
int pci_get_bar_info(uint8_t bus, uint8_t device, uint8_t function,
                      uint8_t bar_index, struct bar_info *info)
{
    /* BAR registers start at offset 0x10, each is 4 bytes */
    uint8_t bar_offset = 0x10 + (bar_index * 4);
    
    /* Read original BAR value */
    uint32_t original = pci_config_read32(bus, device, function, bar_offset);
    
    /* Determine BAR type */
    if (original & BAR_TYPE_IO) {
        /* I/O BAR - not MMIO */
        info->type = 1;
        info->base_address = original & ~0x03;
        
        /* Probe size */
        pci_config_write32(bus, device, function, bar_offset, 0xFFFFFFFF);
        uint32_t size_mask = pci_config_read32(bus, device, function, bar_offset);
        pci_config_write32(bus, device, function, bar_offset, original);
        
        size_mask &= ~0x03;  /* Clear type bits */
        info->size = (~size_mask + 1) & 0xFFFF;  /* I/O limited to 64 KB */
        info->is_64bit = 0;
        info->prefetchable = 0;
        return 0;
    }
    
    /* Memory BAR */
    info->type = 0;
    info->is_64bit = ((original & BAR_MEM_TYPE_MASK) == BAR_MEM_TYPE_64);
    info->prefetchable = (original & BAR_MEM_PREFETCH_MASK) ? 1 : 0;
    
    /* Probe lower 32 bits */
    pci_config_write32(bus, device, function, bar_offset, 0xFFFFFFFF);
    uint32_t low_mask = pci_config_read32(bus, device, function, bar_offset);
    pci_config_write32(bus, device, function, bar_offset, original);
    
    low_mask &= ~0x0F;  /* Clear type bits */
    
    if (info->is_64bit) {
        /* Read upper 32 bits from next BAR */
        uint32_t original_high = pci_config_read32(bus, device, function, 
                                                    bar_offset + 4);
        pci_config_write32(bus, device, function, bar_offset + 4, 0xFFFFFFFF);
        uint32_t high_mask = pci_config_read32(bus, device, function, 
                                                bar_offset + 4);
        pci_config_write32(bus, device, function, bar_offset + 4, original_high);
        
        /* Combine to form 64-bit mask and calculate size */
        uint64_t full_mask = ((uint64_t)high_mask << 32) | low_mask;
        info->size = ~full_mask + 1;
        info->base_address = ((uint64_t)original_high << 32) | 
                             (original & ~0x0F);
    } else {
        info->size = (~low_mask + 1) & 0xFFFFFFFF;
        info->base_address = original & ~0x0F;
    }
    
    return 0;
}

Memory Types and Caching Considerations

Why Caching MMIO is Dangerous

Device registers are side-effecting: reading a register might clear an interrupt flag, and writing might trigger an action. If these accesses were cached:

Read caching: The CPU might return stale cached data instead of the current device state
Write caching: Writes might be coalesced or delayed, missing time-critical device requirements
Speculative reads: The CPU might prefetch MMIO locations, triggering unintended device actions

To prevent these issues, x86 provides memory type configuration through MTRRs (Memory Type Range Registers) and the PAT (Page Attribute Table).

x86 Memory Types for MMIO Regions
Memory Type	Code	Characteristics	MMIO Use Case
Uncacheable (UC)	0	No caching, serialized access, no speculation	Standard device registers
Write Combining (WC)	1	No caching but writes combine, reads may be speculative	Frame buffers, DMA buffers
Write Through (WT)	4	Cached reads, writes propagate immediately	Rarely used for MMIO
Write Protect (WP)	5	Cached reads, writes are silently dropped	Not for MMIO
Write Back (WB)	6	Fully cached, writes delayed	Never use for MMIO
Uncacheable Minus (UC-)	7	Like UC but can be overridden by WC MTRR	Default MMIO fallback

Memory Type Range Registers (MTRRs)

MTRRs provide a mechanism to assign memory types to physical address ranges at the hardware level, independent of page tables. There are two types:

Fixed-range MTRRs: Cover the first 1 MB of memory in fixed-size blocks
Variable-range MTRRs: Define arbitrary power-of-two-sized, aligned regions

Modern systems typically configure variable MTRRs during boot to mark RAM as WB and MMIO regions as UC. The BIOS/UEFI firmware establishes these settings.

Page Attribute Table (PAT)

The operating system's memory mapping functions must use appropriate flags when creating MMIO mappings. On Linux, this is handled through ioremap() variants.

linux_ioremap_variants.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
/*
 * Linux MMIO Mapping Functions
 * 
 * The kernel provides several ioremap variants that configure
 * appropriate memory types for different MMIO characteristics.
 */
 
#include <linux/io.h>
#include <linux/types.h>
 
/*
 * ioremap() / ioremap_nocache()
 * 
 * Maps MMIO region with Uncacheable (UC) memory type.
 * Use for: Standard device registers where every access must
 *          reach the device in order and without caching.
 * 
 * @param phys_addr: Physical address of MMIO region (from BAR)
 * @param size: Size of the region in bytes
 * @returns: Virtual address for kernel access, or NULL on failure
 */
void __iomem *map_device_registers(phys_addr_t phys_addr, size_t size)
{
    void __iomem *base;
    
    /* Standard uncacheable mapping for device registers */
    base = ioremap(phys_addr, size);
    if (!base) {
        pr_err("Failed to map MMIO region at %pa
", &phys_addr);
        return NULL;
    }
    
    return base;
}
 
/*
 * ioremap_wc()
 * 
 * Maps MMIO region with Write-Combining (WC) memory type.
 * Use for: Frame buffers, large write-only or write-mostly regions
 *          where combining writes improves throughput.
 */
void __iomem *map_framebuffer(phys_addr_t phys_addr, size_t size)
{
    void __iomem *base;
    
    /* Write-combining for better frame buffer performance */
    base = ioremap_wc(phys_addr, size);
    if (!base) {
        /* Fall back to uncacheable if WC not available */
        base = ioremap(phys_addr, size);
    }
    
    return base;
}
 
/*
 * Example: Complete MMIO mapping and access workflow
 */
struct my_device_regs {
    uint32_t control;       /* Offset 0x00: Control register */
    uint32_t status;        /* Offset 0x04: Status register */
    uint32_t interrupt;     /* Offset 0x08: Interrupt status */
    uint32_t data;          /* Offset 0x0C: Data register */
};
 
struct my_device {
    void __iomem *regs;           /* MMIO register base */
    void __iomem *framebuffer;    /* Write-combining frame buffer */
    phys_addr_t regs_phys;
    phys_addr_t fb_phys;
    size_t fb_size;
};
 
int my_device_probe(struct pci_dev *pdev, struct my_device *dev)
{
    /* Get BAR 0 for registers (uncacheable) */
    dev->regs_phys = pci_resource_start(pdev, 0);
    dev->regs = ioremap(dev->regs_phys, pci_resource_len(pdev, 0));
    if (!dev->regs)
        return -ENOMEM;
    
    /* Get BAR 1 for frame buffer (write-combining) */
    dev->fb_phys = pci_resource_start(pdev, 1);
    dev->fb_size = pci_resource_len(pdev, 1);
    dev->framebuffer = ioremap_wc(dev->fb_phys, dev->fb_size);
    if (!dev->framebuffer) {
        iounmap(dev->regs);
        return -ENOMEM;
    }
    
    pr_info("Device mapped: regs=%pa, fb=%pa (size %zu)
",
            &dev->regs_phys, &dev->fb_phys, dev->fb_size);
    
    return 0;
}
 
void my_device_remove(struct my_device *dev)
{
    if (dev->framebuffer)
        iounmap(dev->framebuffer);
    if (dev->regs)
        iounmap(dev->regs);
}

Critical: Never Use Write-Back for MMIO

MMIO Access Functions and Memory Barriers

The Problem with Direct Pointer Access

Consider this naive MMIO access:

void __iomem *regs = ioremap(phys_addr, size);
uint32_t *control = (uint32_t *)regs;
*control = 0x01;  // Start operation
*control = 0x02;  // Don't do this!

Problems:

Compiler might reorder or combine the two writes
Compiler might eliminate "redundant" writes
CPU might reorder with respect to other memory operations
CPU might execute writes out of order

For device correctness, every write must reach the device, in order, when programmed.

mmio_access_patterns.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
/*
 * Proper MMIO Access Functions
 * 
 * Modern kernels provide typed accessor functions that ensure:
 * 1. Volatile semantics (no compiler optimization)
 * 2. Correct memory ordering where needed
 * 3. Architecture-appropriate implementation
 */
 
#include <linux/io.h>
 
/*
 * Basic Accessors (Linux: ioread/iowrite)
 * 
 * These functions provide the minimum guarantee: the access will occur.
 * They do NOT provide ordering guarantees with respect to other accesses.
 */
 
void demonstrate_basic_accessors(void __iomem *regs)
{
    uint32_t status;
    
    /* Read 32-bit value from MMIO location */
    status = ioread32(regs + 0x04);
    
    /* Write 32-bit value to MMIO location */
    iowrite32(0x00000001, regs + 0x00);
    
    /* Size variants */
    iowrite8(0xFF, regs + 0x10);      /* 8-bit write */
    iowrite16(0x1234, regs + 0x12);   /* 16-bit write */
    
    /* Big-endian variants (if device uses BE register layout) */
    iowrite32be(0x12345678, regs + 0x20);
    uint32_t be_value = ioread32be(regs + 0x20);
}
 
/*
 * Ordered Accessors (Linux: readl/writel family)
 * 
 * These provide stronger ordering guarantees:
 * - writel: Write is complete before the function returns
 *           (typically includes a read-back fence on weakly-ordered archs)
 * - readl: Preceding writes are visible before this read
 */
 
void demonstrate_ordered_accessors(void __iomem *regs)
{
    uint32_t status;
    
    /* Strongly-ordered write: flushes write buffers */
    writel(0x00000001, regs + 0x00);
    
    /* Read back to confirm write completion */
    status = readl(regs + 0x04);
    
    /* The write at offset 0x00 is guaranteed to have reached
     * the device before the read at offset 0x04 returns */
}
 
/*
 * Relaxed Accessors (Linux: writel_relaxed/readl_relaxed)
 * 
 * Maximum performance, minimum ordering. Use when:
 * - Batching many writes before a final ordered access
 * - Writes can be safely reordered
 * - You explicitly manage barriers
 */
 
void demonstrate_relaxed_accessors(void __iomem *regs)
{
    int i;
    
    /* Batch of relaxed writes - may be reordered and combined */
    for (i = 0; i < 100; i++) {
        writel_relaxed(data[i], regs + 0x100 + (i * 4));
    }
    
    /* Explicit write memory barrier ensures all above writes complete */
    wmb();
    
    /* Final ordered write to trigger device action */
    writel(0x01, regs + 0x00);  /* Start processing */
    
    /* Read status - guarantees all writes visible to device */
    uint32_t status = readl(regs + 0x04);
}
 
/*
 * Memory Barrier Types
 * 
 * Different barriers provide different ordering guarantees.
 */
 
void demonstrate_barriers(void __iomem *regs)
{
    /* Read memory barrier: prior reads complete before subsequent reads */
    rmb();
    
    /* Write memory barrier: prior writes complete before subsequent writes */
    wmb();
    
    /* Full memory barrier: orders all prior memory ops before subsequent */
    mb();
    
    /* MMIO-specific barrier: ensures MMIO writes reach devices */
    mmiowb();  /* Deprecated in favor of spin_unlock ordering */
    
    /* Compiler barrier only: prevents compiler reordering but not CPU */
    barrier();
    
    /* Example: Read status until ready, then read data */
    uint32_t status;
    do {
        status = readl(regs + STATUS_OFFSET);
        cpu_relax();  /* Hint for spin-waiting, may include barrier */
    } while (!(status & READY_BIT));
    
    rmb();  /* Ensure status read completes before data read */
    uint32_t data = readl(regs + DATA_OFFSET);
}
 
/*
 * String/Block Operations
 * 
 * For bulk data transfer, string variants operate on buffers.
 */
 
void demonstrate_block_transfers(void __iomem *regs, void *buffer, size_t count)
{
    /* Read block of 32-bit values */
    ioread32_rep(regs + FIFO_OFFSET, buffer, count);
    
    /* Write block of 32-bit values */
    iowrite32_rep(regs + FIFO_OFFSET, buffer, count);
    
    /* Memory copy to/from MMIO (with proper access width handling) */
    memcpy_toio(regs + 0x1000, buffer, count * 4);
    memcpy_fromio(buffer, regs + 0x1000, count * 4);
}

Architecture Considerations

MMIO in Modern High-Performance Devices

Memory-Mapped I/O enables modern devices to achieve extraordinary performance levels by leveraging large address spaces, sophisticated memory types, and direct CPU-device communication patterns.

Graphics Processing Units (GPUs)

Modern GPUs exemplify advanced MMIO usage:

VRAM Mapping: Gigabytes of video memory mapped into CPU address space, often with write-combining for efficient texture uploads
Command Buffers: Shared memory where CPU writes GPU commands, GPU reads and executes
Doorbell Registers: Small MMIO regions where writes trigger GPU to examine command queues
Register Spaces: Thousands of control/status registers for GPU configuration

The combination of large WC-mapped data regions and small UC-mapped control registers enables GPUs to achieve bandwidth-efficient bulk transfers while maintaining precise control semantics.

MMIO Regions in Typical High-End GPU
BAR	Size	Memory Type	Purpose
BAR 0	16-256 MB	Uncacheable	Control registers, doorbells
BAR 1	256 MB - 16 GB	Write-Combining	Direct GPU memory aperture
BAR 2/3	Variable	Write-Combining	Extended memory aperture
Resizable BAR	Up to 100% VRAM	Write-Combining	Full VRAM access (if enabled)

NVMe Solid-State Drives

NVMe leverages MMIO for its high-performance storage interface:

Controller Registers: A 4 KB UC-mapped region for device control, capability reporting, and interrupt management
Doorbell Registers: Beyond the controller registers, doorbell locations where the host writes to notify the controller of new commands
Submission/Completion Queues: Reside in host memory (not MMIO), but doorbell writes trigger controller to read them

The elegance of NVMe's MMIO design enables it to achieve millions of IOPS with minimal CPU overhead—each command submission requires only a single 4-byte doorbell write.

Network Interface Cards

High-speed NICs (40 Gbps, 100 Gbps, and beyond) use MMIO extensively:

Descriptor Rings: Queue descriptors pointing to network packets
CSR (Control/Status Registers): Device configuration and interrupt management
Doorbells: Notify NIC of new packets to transmit
Packet Buffers: Some NICs expose packet memory via MMIO for direct CPU access

Modern NICs minimize MMIO latency through register layout optimization, cache line alignment of frequently accessed registers, and separation of read-heavy and write-heavy regions.

nvme_mmio_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
/*
 * NVMe MMIO Register Access Example
 * 
 * Demonstrates the MMIO structure of NVMe controllers,
 * showcasing minimal-overhead command submission.
 */
 
#include <linux/io.h>
#include <linux/types.h>
 
/* NVMe Controller Registers (from NVMe specification) */
struct nvme_controller_regs {
    uint64_t cap;        /* 0x00: Controller Capabilities */
    uint32_t vs;         /* 0x08: Version */
    uint32_t intms;      /* 0x0C: Interrupt Mask Set */
    uint32_t intmc;      /* 0x10: Interrupt Mask Clear */
    uint32_t cc;         /* 0x14: Controller Configuration */
    uint32_t reserved1;  /* 0x18 */
    uint32_t csts;       /* 0x1C: Controller Status */
    uint32_t nssr;       /* 0x20: NVM Subsystem Reset */
    uint32_t aqa;        /* 0x24: Admin Queue Attributes */
    uint64_t asq;        /* 0x28: Admin Submission Queue Base */
    uint64_t acq;        /* 0x30: Admin Completion Queue Base */
    /* ... additional registers ... */
};
 
/* Doorbell stride is determined by CAP.DSTRD, typically 4 bytes */
#define NVME_DOORBELL_BASE  0x1000
 
struct nvme_device {
    void __iomem *regs;             /* Base of MMIO registers */
    void __iomem *doorbells;        /* Base of doorbell registers */
    uint32_t doorbell_stride;       /* Bytes between doorbells */
    
    /* Admin queues in host memory (not MMIO) */
    struct nvme_command *admin_sq;
    struct nvme_completion *admin_cq;
    uint16_t admin_sq_tail;
    uint16_t admin_cq_head;
};
 
/*
 * Initialize NVMe controller - probe and configure.
 */
int nvme_init(struct nvme_device *dev, void __iomem *bar0)
{
    uint64_t cap;
    uint32_t vs, cc, csts;
    
    dev->regs = bar0;
    
    /* Read capabilities to determine controller parameters */
    cap = readq(dev->regs + offsetof(struct nvme_controller_regs, cap));
    
    /* Extract doorbell stride: 2^(2+DSTRD) bytes */
    uint8_t dstrd = (cap >> 32) & 0x0F;
    dev->doorbell_stride = 4 << dstrd;
    dev->doorbells = dev->regs + NVME_DOORBELL_BASE;
    
    /* Read version */
    vs = readl(dev->regs + offsetof(struct nvme_controller_regs, vs));
    pr_info("NVMe version: %d.%d.%d
", 
            (vs >> 16), (vs >> 8) & 0xFF, vs & 0xFF);
    
    /* Wait for controller ready (CSTS.RDY = 1) after enabling */
    /* ... configuration sequence ... */
    
    return 0;
}
 
/*
 * Submit a command to an NVMe queue.
 * 
 * This demonstrates the minimal MMIO access pattern:
 * 1. Write command to host memory submission queue
 * 2. Single 4-byte doorbell write to notify controller
 */
void nvme_submit_command(struct nvme_device *dev,
                          uint16_t qid,
                          struct nvme_command *cmd,
                          uint16_t *sq_tail)
{
    /* Step 1: Copy command to submission queue (host memory, not MMIO) */
    memcpy(&dev->queues[qid].sq[*sq_tail], cmd, sizeof(*cmd));
    
    /* Ensure command is written before doorbell */
    wmb();
    
    /* Step 2: Advance tail pointer */
    *sq_tail = (*sq_tail + 1) % dev->queues[qid].sq_depth;
    
    /* Step 3: Write doorbell - single 4-byte MMIO write triggers controller */
    /* Submission Queue y Tail Doorbell offset = 0x1000 + (2y * doorbell_stride) */
    writel(*sq_tail, dev->doorbells + (2 * qid * dev->doorbell_stride));
    
    /* That's it - controller will now fetch and process the command */
}
 
/*
 * Check for completions - poll completion queue and ring doorbell.
 */
int nvme_poll_completions(struct nvme_device *dev, 
                           uint16_t qid,
                           struct nvme_completion *completions,
                           int max_completions)
{
    int count = 0;
    struct nvme_queue *q = &dev->queues[qid];
    
    while (count < max_completions) {
        struct nvme_completion *cqe = &q->cq[q->cq_head];
        
        /* Check phase tag to see if entry is valid */
        if ((cqe->status & 1) != q->cq_phase)
            break;  /* No more completions */
        
        /* Copy completion to caller's buffer */
        completions[count++] = *cqe;
        
        /* Advance head */
        q->cq_head++;
        if (q->cq_head >= q->cq_depth) {
            q->cq_head = 0;
            q->cq_phase ^= 1;  /* Toggle phase on wrap */
        }
    }
    
    if (count > 0) {
        /* Ring completion queue head doorbell */
        /* Completion Queue y Head Doorbell offset = 0x1000 + ((2y+1) * doorbell_stride) */
        writel(q->cq_head, dev->doorbells + 
               ((2 * qid + 1) * dev->doorbell_stride));
    }
    
    return count;
}

Advantages of Memory-Mapped I/O

Memory-Mapped I/O has become the dominant paradigm for good reasons. Its advantages extend across hardware design, software development, and system performance.

Key Advantages of MMIO

•Virtually Unlimited Address Space: 64-bit systems provide 16 exabytes of addresses. Even 48-bit physical addressing (256 TB) dwarfs any device's needs.
•Standard Instruction Set Usage: No special I/O instructions required. All processor addressing modes work: base+offset, indexed, auto-increment. Enables efficient compiled code.
•Large Contiguous Regions: Devices can expose megabytes or gigabytes of memory. Frame buffers, command buffers, and firmware can be directly accessed.
•Architecture Independence: ARM, RISC-V, PowerPC, and most modern architectures use MMIO exclusively. Code concepts transfer across platforms.
•Flexible Memory Types: Different regions can use different cache policies (UC, WC, etc.) for optimal performance per use case.
•Natural Language Integration: C pointers and structures can represent device registers naturally. No inline assembly required for basic access.
•DMA Integration: DMA controllers use memory addresses. MMIO regions seamlessly integrate with DMA-based data movement.
•Virtualization Efficiency: MMU-based memory protection naturally extends to MMIO. VMs can have secure, hardware-accelerated device access.

The Power of Uniformity

Summary: Memory-Mapped I/O

This page has provided a thorough exploration of Memory-Mapped I/O, the dominant paradigm for modern device communication. Let's consolidate the key concepts:

Key Takeaways

•Unified Address Space: Device registers appear as memory locations, accessed via standard load/store instructions.
•BAR Configuration: PCI/PCIe devices advertise MMIO requirements; firmware/OS assigns physical addresses during enumeration.
•Memory Type Control: MTRR and PAT configurations mark MMIO as uncacheable (UC) or write-combining (WC) to prevent incorrect caching.
•Access Functions: Kernel-provided accessors (ioread/iowrite, readl/writel) ensure volatile semantics and proper memory ordering.
•Memory Barriers: Explicit barriers may be required on weakly-ordered architectures to ensure correct device interaction.
•High-Performance Applications: GPUs, NVMe, and high-speed NICs leverage MMIO for minimal-overhead communication.
•Flexibility: MMIO supports regions from a few bytes to many gigabytes, accommodating diverse device needs.

Looking Ahead

Page Complete

2 / 5