Io Architecture - Learning Module

Loading content...

0/227

Memory-Mapped I/O: Unifying Device Access with Memory Semantics

When Devices Become Memory

What if accessing a device register was exactly the same as reading or writing a variable in memory? What if there were no special instructions, no separate address space, no architectural distinction between 'memory' and 'hardware'?

This is the promise of Memory-Mapped I/O (MMIO)—a paradigm that maps device registers directly into the processor's memory address space, allowing software to interact with hardware using ordinary load and store instructions.

MMIO has become the dominant I/O access method in modern computing. Nearly every contemporary device—from GPU registers to NVMe controller queues, from network interface cards to USB controllers—uses memory-mapped registers for CPU communication. Understanding MMIO is essential for systems programming, driver development, and low-level debugging.

What You Will Learn

By the end of this page, you will understand how MMIO integrates devices into the memory map, the architectural advantages over port-mapped I/O, critical considerations around caching and memory ordering, how operating systems manage MMIO regions, and practical implementation patterns used in real device drivers.

The Memory-Mapped I/O Concept

Memory-Mapped I/O is an architectural technique where device registers are assigned addresses within the processor's physical memory address space. When the CPU issues a memory access to an MMIO address, the memory controller routes the request to the appropriate device instead of actual RAM.

Conceptual Model:

Imagine the physical address space as a city divided into districts. Some districts contain apartment buildings (RAM modules), while others contain specialized facilities (devices). The postal system (memory controller) delivers mail to the correct destination based solely on the address—it doesn't need to know whether the destination is an apartment or a factory.

Physical Address Space (Example x86-64 System)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
0x0000_0000_0000_0000 ─┬─ Low Memory (Legacy PC regions)
                       ├─ 0x000A_0000 - VGA Frame Buffer (MMIO)
                       ├─ 0x000C_0000 - Option ROMs (MMIO)
                       └─ 0x000F_0000 - BIOS Area
0x0000_0001_0000_0000 ─── First 4GB (mixed RAM/MMIO)
                       ├─ Various RAM regions
                       ├─ PCI MMIO windows
                       └─ Local APIC: 0xFEE0_0000 (MMIO)
0x0000_0001_0000_0000+  ─ Extended Memory (mostly RAM)
                       ├─ Large BAR devices (GPU VRAM)
                       └─ PCIe MMCONFIG: ~256MB region

How MMIO Works at Hardware Level:

Address Decode: When the CPU issues a memory access, address decode logic in the memory controller examines the address.
Routing Decision: Based on configured address ranges, the request is routed either to:
- DRAM controllers (for normal memory)
- PCIe root complex (for device MMIO)
- Integrated devices (for on-chip peripherals)
Transaction Execution: The target device receives the read/write request, processes it, and returns data or acknowledgment.
Completion: The response travels back through the fabric to the requesting CPU core.

This routing is configured by firmware during boot (via ACPI and PCI enumeration) and maintained by the operating system thereafter.

Physical vs Virtual Addresses

MMIO mapping occurs at the physical address level. When an OS kernel accesses MMIO, it creates virtual memory mappings (page table entries) that translate kernel virtual addresses to the device's physical MMIO addresses. This mapping must have special attributes (non-cacheable, typically) as we'll discuss shortly.

MMIO vs Port-Mapped I/O: A Detailed Comparison

The previous page covered port-mapped I/O (PMIO), which uses a separate address space accessed via IN/OUT instructions. MMIO takes a fundamentally different approach. Let's compare these paradigms systematically:

Memory-Mapped I/O vs Port-Mapped I/O
Aspect	Memory-Mapped I/O (MMIO)	Port-Mapped I/O (PMIO)
Address Space	Shared with memory (huge: 2^48 on x86-64)	Separate I/O space (limited: 64K on x86)
Instructions	Normal MOV, load/store	Special IN/OUT instructions
Programming Languages	Any language with pointers	Requires inline assembly or intrinsics
Address Width	Full pointer width (32/64-bit)	Fixed 16-bit port addresses
Access Sizes	Arbitrary (1, 2, 4, 8 bytes, even larger)	1, 2, or 4 bytes only
Caching	Must be explicitly disabled for device registers	Never cached (I/O space is uncacheable by design)
Virtual Memory	Full MMU support (mapping, protection)	Bypasses MMU (IOPL/TSS bitmap protection only)
Compiler Optimization	Can be problematic (must use volatile)	Compilers aware of I/O instructions
Architecture Support	Universal (all architectures)	x86-specific (and some legacy systems)
Performance	Can be faster with optimized paths	Often slower due to legacy bus protocols

MMIO Advantages

•Language Agnostic: Any programming language that can dereference pointers can access MMIO
•Vast Address Space: No artificial limits on number or size of device mappings
•MMU Protection: OS can use page tables to control access at user/kernel boundary
•Compiler Flexibility: Standard optimization techniques work (with care)
•Unified Semantics: Same concepts apply to all architectures
•Block Transfers: Can map large regions (e.g., GPU VRAM) directly

MMIO Challenges

•Cache Management: Must explicitly mark regions uncacheable to prevent stale reads
•Memory Ordering: CPU reordering can cause subtle bugs; barriers required
•Address Space Consumption: 32-bit systems can exhaust address space with large devices
•Compiler Issues: Must use volatile and memory barriers correctly
•Virtual Memory Overhead: Requires page table entries for device regions
•Debug Complexity: Memory accesses to devices look identical to RAM access

The Industry Consensus

Despite the challenges, the industry has overwhelmingly standardized on MMIO. All modern CPUs (ARM, RISC-V, MIPS, PowerPC) use MMIO exclusively. x86 maintains port I/O for backward compatibility, but nearly all new device functionality uses MMIO. The PCIe specification strongly encourages MMIO over port I/O.

Accessing MMIO in Software

At first glance, MMIO access appears trivially simple: cast an address to a pointer and dereference it. However, this naive approach is fraught with subtle pitfalls that can cause device malfunction, data corruption, or unpredictable behavior.

Why Pointer Semantics are Dangerous for MMIO:

Compiler Optimization: Compilers assume memory accesses have no side effects. They may:
- Reorder reads and writes
- Cache values in registers and skip subsequent reads
- Combine multiple writes into a single operation
- Eliminate 'dead' writes that are later overwritten
CPU Reordering: Modern CPUs don't execute instructions in program order. A store followed by a load might execute as load-then-store.
Cache Interference: If an MMIO region is accidentally cacheable, CPU caches will return stale data instead of reading the device.

mmio_access_patterns.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
/*
 * MMIO Access Patterns: The Right Way
 * 
 * This demonstrates proper MMIO access techniques that prevent
 * compiler and CPU reordering issues.
 */
 
#include <stdint.h>
 
/*
 * WRONG: Naive pointer access
 * 
 * The compiler can optimize this disastrously:
 * - It might combine the two writes into one
 * - It might reorder them
 * - It might cache the read result and never read the device again
 */
void wrong_mmio_access(void) {
    uint32_t *device_reg = (uint32_t *)0xFEED0000;  /* WRONG: not volatile */
    
    *device_reg = 0x1;     /* Command: start operation */
    *device_reg = 0x2;     /* Command: acknowledge - compiler might skip! */
    
    while (*device_reg & 0x1) {  /* Wait for busy flag - infinite loop! */
        /* Compiler caches the read, never sees flag clear */
    }
}
 
/*
 * CORRECT: Proper volatile access
 * 
 * The 'volatile' keyword tells the compiler:
 * - Every access must actually touch memory
 * - Accesses cannot be reordered with respect to other volatile accesses
 * - Values cannot be cached in registers
 */
 
/* Type-safe MMIO access macros */
#define mmio_read8(addr)       (*(volatile uint8_t *)(addr))
#define mmio_write8(addr, v)   (*(volatile uint8_t *)(addr) = (v))
#define mmio_read16(addr)      (*(volatile uint16_t *)(addr))
#define mmio_write16(addr, v)  (*(volatile uint16_t *)(addr) = (v))
#define mmio_read32(addr)      (*(volatile uint32_t *)(addr))
#define mmio_write32(addr, v)  (*(volatile uint32_t *)(addr) = (v))
#define mmio_read64(addr)      (*(volatile uint64_t *)(addr))
#define mmio_write64(addr, v)  (*(volatile uint64_t *)(addr) = (v))
 
/* Better: Inline functions for type safety */
static inline uint32_t mmio_read_32(void *addr) {
    return *(volatile uint32_t *)addr;
}
 
static inline void mmio_write_32(void *addr, uint32_t value) {
    *(volatile uint32_t *)addr = value;
}
 
/*
 * CORRECT: Using proper MMIO access
 */
void correct_mmio_access(void *device_base) {
    mmio_write32(device_base, 0x1);     /* Write #1 actually happens */
    mmio_write32(device_base, 0x2);     /* Write #2 actually happens */
    
    while (mmio_read32(device_base) & 0x1) {
        /* Actually reads from device each iteration */
    }
}
 
/*
 * MEMORY BARRIERS
 * 
 * Volatile prevents COMPILER reordering, but CPUs can still reorder
 * at the hardware level. Memory barriers are required for strict ordering.
 */
 
/* Compiler-only barrier (prevents compiler reordering) */
#define barrier() __asm__ __volatile__("" ::: "memory")
 
/* Full memory barrier (prevents CPU reordering) */
#define mb()  __asm__ __volatile__("mfence" ::: "memory")
 
/* Read barrier (loads before barrier complete before loads after) */
#define rmb() __asm__ __volatile__("lfence" ::: "memory")
 
/* Write barrier (stores before barrier complete before stores after) */
#define wmb() __asm__ __volatile__("sfence" ::: "memory")
 
/*
 * Example: Device initialization requiring strict ordering
 * 
 * Some devices require commands to be observed in exact sequence.
 * Memory barriers ensure the device sees operations in order.
 */
void init_device_with_ordering(void *base) {
    /* Step 1: Reset the device */
    mmio_write32(base + 0x00, 0xDEAD);  /* Reset command */
    wmb();  /* Ensure reset is complete before configuration */
    
    /* Step 2: Configure operating mode */
    mmio_write32(base + 0x04, 0x0100);  /* Mode register */
    mmio_write32(base + 0x08, 0x0200);  /* Settings register */
    wmb();  /* Ensure configuration is complete */
    
    /* Step 3: Enable the device */
    mmio_write32(base + 0x00, 0x0001);  /* Enable bit */
    mb();   /* Full barrier before reading status */
    
    /* Step 4: Wait for device ready */
    uint32_t status;
    do {
        status = mmio_read32(base + 0x10);
    } while (!(status & 0x8000));  /* Ready bit */
}

Volatile is Necessary but NOT Sufficient

The 'volatile' keyword prevents compiler reordering and caching, but it does NOT prevent CPU instruction reordering. For device programming where operation order matters (which is nearly always), you need both volatile access AND appropriate memory barriers. Modern kernel APIs (like Linux's readl/writel) encapsulate both concerns.

Caching and MMIO Memory Attributes

CPU caches are designed to accelerate memory access by keeping frequently used data close to the processor. This optimization is catastrophic for MMIO—if a device register is cached, the CPU returns stale data instead of querying the device.

The Caching Problem:

Consider a device status register at address 0xFEED0000:

CPU reads status register → cache line is filled
Device completes operation and updates status
CPU reads status register again → cache returns old value!
Software enters infinite polling loop

The solution is to mark MMIO regions as uncacheable in the processor's page tables.

x86 Memory Type Attributes:

The x86 architecture defines several memory types, controlled via page table entries (PAT, PCD, PWT bits) and Memory Type Range Registers (MTRRs):

UC (Uncacheable): No caching whatsoever. Every access goes to the device. Required for most device registers.
WC (Write Combining): Writes can be combined and reordered, but not cached. Reads are uncacheable. Ideal for frame buffers and write-intensive devices.
WT (Write Through): Writes go to both cache and device. Reads may return cached data. Rarely used for MMIO.
WB (Write Back): Full caching—writes may complete only to cache. Dangerous for MMIO!
UC- (Uncached Minus): Like UC, but can be overridden by MTRRs. Used for compatibility.

Memory Types for Device Regions
Memory Type	Caching Behavior	Ordering	Use Case
UC (Uncacheable)	No caching at all	Strongly ordered	Control registers, configuration space
WC (Write Combining)	Writes combined, reads uncached	Weakly ordered	Frame buffers, bulk write regions
WT (Write Through)	Read cached, write through	Strongly ordered	Rarely used for MMIO
WP (Write Protect)	Reads cached, writes not allowed	N/A	Read-only memory regions
WB (Write Back)	Full caching	Weakly ordered	NEVER use for device MMIO

mmio_mapping.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
/*
 * Mapping MMIO Regions in an OS Kernel
 * 
 * This demonstrates how to create page table entries for device registers
 * with proper memory attributes.
 */
 
#include <stdint.h>
 
/* Page Table Entry flags for x86-64 */
#define PTE_PRESENT     (1UL << 0)   /* Page is present in memory */
#define PTE_WRITABLE    (1UL << 1)   /* Page is writable */
#define PTE_USER        (1UL << 2)   /* Page accessible from ring 3 */
#define PTE_PWT         (1UL << 3)   /* Page Write-Through */
#define PTE_PCD         (1UL << 4)   /* Page Cache Disable */
#define PTE_ACCESSED    (1UL << 5)   /* Page has been accessed */
#define PTE_DIRTY       (1UL << 6)   /* Page has been written */
#define PTE_PAT         (1UL << 7)   /* Page Attribute Table index bit 0 */
#define PTE_GLOBAL      (1UL << 8)   /* Global page (don't flush on CR3 change) */
#define PTE_NX          (1UL << 63)  /* No Execute (if NX enabled) */
 
/*
 * Page Attribute Table (PAT) indices for memory types
 * The PAT allows fine-grained control over memory type per page.
 * 
 * Default PAT configuration:
 *   Entry 0: WB (Write Back)
 *   Entry 1: WT (Write Through)  
 *   Entry 2: UC- (Uncached Minus)
 *   Entry 3: UC (Uncacheable)
 *   Entry 4: WB (Write Back)
 *   Entry 5: WT (Write Through)
 *   Entry 6: UC- (Uncached Minus)
 *   Entry 7: UC (Uncacheable)
 */
 
/* Construct PTE flags for UC (Uncacheable) memory type */
#define PTE_UC (PTE_PWT | PTE_PCD)  /* PAT index 3 = UC */
 
/* Construct PTE flags for WC (Write Combining) - requires PAT programming */
/* Assuming PAT entry 5 is programmed as WC */
#define PTE_WC (PTE_PAT | PTE_PWT)  /* PAT index 5 = WC (with custom PAT) */
 
/*
 * Map a physical MMIO region to kernel virtual address space
 * 
 * In a real OS, this would:
 * 1. Allocate virtual address range
 * 2. Create page table entries
 * 3. Set proper memory type attributes
 * 4. Flush TLB for affected pages
 */
void *mmio_map(uint64_t phys_addr, size_t size, int memory_type) {
    /* Align to page boundary */
    uint64_t page_offset = phys_addr & 0xFFF;
    uint64_t aligned_phys = phys_addr & ~0xFFFUL;
    size_t aligned_size = (size + page_offset + 0xFFF) & ~0xFFFUL;
    
    /* Allocate virtual address range (simplified) */
    void *virt_addr = allocate_kernel_virtual_pages(aligned_size / 4096);
    if (!virt_addr) return NULL;
    
    /* Determine PTE flags based on requested memory type */
    uint64_t type_flags;
    switch (memory_type) {
        case MEMORY_TYPE_UC:
            type_flags = PTE_UC;
            break;
        case MEMORY_TYPE_WC:
            type_flags = PTE_WC;
            break;
        default:
            type_flags = PTE_UC; /* Default to uncacheable for safety */
    }
    
    /* Create page table mappings */
    for (size_t i = 0; i < aligned_size; i += 4096) {
        uint64_t pte = (aligned_phys + i) | 
                       PTE_PRESENT | 
                       PTE_WRITABLE | 
                       type_flags |
                       PTE_NX;  /* Device registers should never be executable */
        
        set_page_table_entry((uint64_t)virt_addr + i, pte);
    }
    
    /* Flush TLB for the mapped range */
    for (size_t i = 0; i < aligned_size; i += 4096) {
        __asm__ volatile("invlpg (%0)" : : "r"((uint64_t)virt_addr + i) : "memory");
    }
    
    /* Return pointer adjusted for original offset */
    return (void *)((uint64_t)virt_addr + page_offset);
}
 
/*
 * Unmap an MMIO region
 */
void mmio_unmap(void *virt_addr, size_t size) {
    uint64_t aligned_addr = (uint64_t)virt_addr & ~0xFFFUL;
    size_t aligned_size = (size + ((uint64_t)virt_addr & 0xFFF) + 0xFFF) & ~0xFFFUL;
    
    /* Clear page table entries */
    for (size_t i = 0; i < aligned_size; i += 4096) {
        set_page_table_entry(aligned_addr + i, 0);
    }
    
    /* Flush TLB */
    for (size_t i = 0; i < aligned_size; i += 4096) {
        __asm__ volatile("invlpg (%0)" : : "r"(aligned_addr + i) : "memory");
    }
    
    /* Free virtual address range */
    free_kernel_virtual_pages((void *)aligned_addr, aligned_size / 4096);
}

Linux ioremap() Variants

Linux provides several ioremap variants for different use cases: ioremap() creates UC mappings (default for device registers), ioremap_wc() creates Write-Combining mappings (for frame buffers), ioremap_cache() creates cacheable mappings (for device memory that supports coherent caching), and ioremap_uc() explicitly creates Uncacheable mappings. Always choose the appropriate variant for your device's requirements.

Address Space Layout and PCI Base Address Registers

Modern devices request MMIO address ranges through PCI/PCIe Base Address Registers (BARs). The system firmware and operating system negotiate where each device's registers will appear in the physical address space.

PCI BAR Mechanism:

Each PCI function can have up to 6 BARs (BAR0-BAR5). A BAR can be:

Memory BAR: Requests a memory-mapped address range
I/O BAR: Requests an I/O port range (legacy, discussed in previous page)

Memory BARs have specific formats that indicate their requirements:

Memory BAR Format:

Bit 0     = 0 (Memory Space indicator)
Bits 2:1  = Type: 00=32-bit, 10=64-bit
Bit 3     = Prefetchable flag
Bits 31:4 = Base Address (4-byte aligned minimum)

For 64-bit BARs, the next BAR (BAR[n+1]) contains the upper 32 bits.

Size Discovery Algorithm:

To determine the size of an MMIO region:

Save original BAR value
Write all 1s (0xFFFFFFFF) to the BAR
Read back the BAR - the unwritable bits indicate size
Restore original BAR value

The lowest set bit in the mask indicates the size. For example, reading back 0xFFFFC000 means the bottom 14 bits are hardwired to 0, indicating a 16KB region (2^14).

pci_bar_detection.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
/*
 * PCI BAR Size Detection and MMIO Mapping
 * 
 * Demonstrates how the OS discovers device MMIO requirements
 * from PCI configuration space BARs.
 */
 
#include <stdint.h>
#include <stdbool.h>
 
/* PCI Configuration space reading (from previous page) */
extern uint32_t pci_config_read(uint8_t bus, uint8_t dev, uint8_t func, uint8_t offset);
extern void pci_config_write(uint8_t bus, uint8_t dev, uint8_t func, uint8_t offset, uint32_t val);
 
#define PCI_BAR0 0x10
 
/* BAR type detection */
#define BAR_TYPE_IO      1
#define BAR_TYPE_MEM32   2
#define BAR_TYPE_MEM64   3
 
struct pci_bar_info {
    uint64_t base_address;    /* Physical base address */
    uint64_t size;            /* Size in bytes */
    int type;                 /* BAR_TYPE_IO, BAR_TYPE_MEM32, or BAR_TYPE_MEM64 */
    bool prefetchable;        /* Can the region be prefetched? */
};
 
/*
 * Detect BAR type, size, and base address
 * 
 * Returns: Number of BARs consumed (1 for 32-bit, 2 for 64-bit)
 */
int detect_pci_bar(uint8_t bus, uint8_t dev, uint8_t func, 
                   int bar_index, struct pci_bar_info *info) {
    uint8_t bar_offset = PCI_BAR0 + (bar_index * 4);
    
    /* Read original BAR value */
    uint32_t original = pci_config_read(bus, dev, func, bar_offset);
    
    if (original == 0) {
        /* BAR not implemented */
        info->type = 0;
        return 1;
    }
    
    /* Check if this is I/O or Memory BAR */
    if (original & 0x1) {
        /* I/O BAR */
        info->type = BAR_TYPE_IO;
        info->prefetchable = false;
        
        /* Write all 1s to detect size */
        pci_config_write(bus, dev, func, bar_offset, 0xFFFFFFFF);
        uint32_t size_mask = pci_config_read(bus, dev, func, bar_offset);
        pci_config_write(bus, dev, func, bar_offset, original);  /* Restore */
        
        /* Calculate size from mask */
        size_mask |= 0x3;  /* Ignore type bits */
        info->size = (~size_mask) + 1;  /* Size is power of 2 */
        info->base_address = original & ~0x3UL;
        
        return 1;
    } else {
        /* Memory BAR */
        int mem_type = (original >> 1) & 0x3;
        info->prefetchable = (original >> 3) & 0x1;
        
        if (mem_type == 0x0) {
            /* 32-bit Memory BAR */
            info->type = BAR_TYPE_MEM32;
            
            /* Write all 1s to detect size */
            pci_config_write(bus, dev, func, bar_offset, 0xFFFFFFFF);
            uint32_t size_mask = pci_config_read(bus, dev, func, bar_offset);
            pci_config_write(bus, dev, func, bar_offset, original);
            
            size_mask |= 0xF;  /* Ignore type/prefetch bits */
            info->size = (~size_mask) + 1;
            info->base_address = original & ~0xFUL;
            
            return 1;
        } else if (mem_type == 0x2) {
            /* 64-bit Memory BAR */
            info->type = BAR_TYPE_MEM64;
            
            /* Read upper 32 bits from next BAR */
            uint32_t original_hi = pci_config_read(bus, dev, func, bar_offset + 4);
            
            /* Write all 1s to both BARs */
            pci_config_write(bus, dev, func, bar_offset, 0xFFFFFFFF);
            pci_config_write(bus, dev, func, bar_offset + 4, 0xFFFFFFFF);
            
            uint32_t size_mask_lo = pci_config_read(bus, dev, func, bar_offset);
            uint32_t size_mask_hi = pci_config_read(bus, dev, func, bar_offset + 4);
            
            /* Restore original values */
            pci_config_write(bus, dev, func, bar_offset, original);
            pci_config_write(bus, dev, func, bar_offset + 4, original_hi);
            
            /* Combine into 64-bit mask */
            uint64_t size_mask = ((uint64_t)size_mask_hi << 32) | size_mask_lo;
            size_mask |= 0xF;  /* Ignore type bits in low BAR */
            info->size = (~size_mask) + 1;
            
            info->base_address = ((uint64_t)(original_hi) << 32) | (original & ~0xFUL);
            
            return 2;  /* Consumed 2 BARs */
        }
    }
    
    return 1;
}
 
/*
 * Example: Initialize a hypothetical network card
 */
void init_network_card(uint8_t bus, uint8_t dev, uint8_t func) {
    struct pci_bar_info bars[6];
    int bar = 0;
    
    /* Enumerate all BARs */
    while (bar < 6) {
        bar += detect_pci_bar(bus, dev, func, bar, &bars[bar]);
    }
    
    /* Typically BAR0 contains the main register space */
    if (bars[0].type == BAR_TYPE_MEM64 || bars[0].type == BAR_TYPE_MEM32) {
        /* Map the device registers into kernel virtual memory */
        void *regs = mmio_map(bars[0].base_address, bars[0].size, MEMORY_TYPE_UC);
        
        if (regs) {
            /* Now we can access device registers */
            uint32_t device_id = mmio_read32(regs + 0x00);  /* Example: ID register */
            uint32_t status = mmio_read32(regs + 0x04);      /* Example: Status register */
            
            /* Configure the device... */
            mmio_write32(regs + 0x10, 0x00000001);  /* Example: Enable bit */
        }
    }
    
    /* BAR2 might be the frame buffer or DMA buffer (prefetchable) */
    if (bars[2].type != 0 && bars[2].prefetchable) {
        /* Use WC mapping for better write performance */
        void *buffer = mmio_map(bars[2].base_address, bars[2].size, MEMORY_TYPE_WC);
        /* This region can handle bulk writes efficiently */
    }
}

Prefetchable vs Non-Prefetchable

The prefetchable bit indicates whether the region has memory-like semantics (reads have no side effects, writes can be combined). GPU frame buffers are typically prefetchable. Control registers are NEVER prefetchable because reading them may clear status flags or trigger actions. Using WC (Write-Combining) on prefetchable regions can dramatically improve write throughput.

MMIO Patterns in Modern Device Classes

Modern devices have evolved sophisticated MMIO usage patterns that go far beyond simple register access. Understanding these patterns is essential for driver development and system debugging.

NVMe: Command Queue Architecture

NVMe (Non-Volatile Memory Express) SSDs exemplify advanced MMIO usage. Instead of transferring data through registers, NVMe uses memory-mapped command and completion queues:

NVMe MMIO Architecture

•Controller Registers (BAR0): ~16KB region for controller configuration, capabilities, and doorbell registers
•Submission Queues: Host memory where commands are placed (not in BAR, but accessed via DMA)
•Completion Queues: Host memory where completions appear
•Doorbell Registers: MMIO writes notify controller of new submissions/completions
•Result: Single MMIO write (doorbell) triggers processing of multiple commands

GPU MMIO Regions:

Modern GPUs expose multiple MMIO regions with distinct purposes:

Register BAR (typically BAR0): GPU control/configuration registers (UC mapping)
VRAM BAR (BAR1/BAR2): Direct access to video memory for CPU-GPU data sharing (WC mapping for writes, may be cacheable for reads on some systems)
I/O BAR (legacy): Some GPUs still expose legacy VGA I/O ports
ROM BAR: Video BIOS accessible via MMIO window

GPU drivers carefully choose memory types for each region to optimize performance while maintaining correctness.

Common MMIO Usage Patterns by Device Type
Device Type	Typical MMIO Size	Memory Type	Access Pattern
NVMe Controller	16KB - 64KB	UC (registers)	Queue doorbells + config registers
Network Card (10GbE+)	128KB - 1MB	UC (registers), WC (descriptors)	Descriptor rings, MSI-X tables
GPU Registers	16MB - 256MB	UC (strictly ordered)	Command submission, state programming
GPU VRAM	256MB - 32GB	WC (writes), Cache (coherent)	Texture uploads, render targets
USB xHCI	64KB	UC	Transfer rings, event rings
PCIe MMCONFIG	256MB	UC	Extended configuration space
Local APIC	4KB	UC (page-aligned)	Interrupt controller programming

BAR Size Explosion

Modern GPUs can expose BARs of 16GB or more for direct CPU access to VRAM. This exceeds the entire 32-bit address space! Such devices require 64-bit BARs and can only be fully utilized on 64-bit operating systems with sufficient virtual address space. The 'Resizable BAR' (ReBAR) feature allows runtime negotiation of BAR sizes for optimal CPU-GPU data transfer.

Security Considerations for MMIO

MMIO brings device access into the purview of the Memory Management Unit (MMU), enabling powerful security controls—but also introducing unique attack surfaces.

Protection via Page Tables:

Unlike port I/O (protected by IOPL and TSS bitmaps), MMIO is protected by standard page table permissions:

Supervisor bit: MMIO pages are typically mapped as supervisor-only, preventing user-space access entirely
Read/Write bits: Can create read-only mappings of status registers
No-Execute (NX) bit: MMIO should NEVER be executable—mark all device mappings NX
User/Supervisor: MMIO regions never need user access in normal operation

MMIO Security Threats

•DMA Attacks: Malicious devices can initiate DMA to read/write arbitrary memory. IOMMU (VT-d/AMD-Vi) is essential for protection.
•MMIO Side Channels: Timing of MMIO accesses can leak information. Attackers may observe device latency patterns.
•Confused Deputy: Kernel vulnerabilities may expose MMIO access to attackers. Defense-in-depth with IOMMU helps.
•Physical Access: Plugging malicious hardware (e.g., evil maid attacks with PCIe devices) can bypass software protections.
•Speculative Execution: Some Spectre variants can speculatively access MMIO, potentially leaking device state.

IOMMU Protection:

The I/O Memory Management Unit (IOMMU—Intel VT-d, AMD-Vi, ARM SMMU) provides the same address translation and protection for device DMA that the CPU MMU provides for processor accesses:

Devices can only access memory regions explicitly permitted
DMA addresses are translated through IO page tables
Invalid accesses generate faults instead of corrupting memory
Essential for secure device passthrough in virtualization

Without IOMMU protection, a compromised or malicious device has unrestricted access to all physical memory—a catastrophic security hole.

Desktop Systems at Risk

Many desktop systems ship with IOMMU disabled by default for performance and compatibility. This means a malicious Thunderbolt/USB4 device, PCIe card, or even some network attacks against vulnerable NICs can read your memory contents, capture encryption keys, or install rootkits. Enable VT-d/AMD-Vi in BIOS if security is a concern.

Summary: Memory-Mapped I/O Mastery

Memory-Mapped I/O is the dominant paradigm for modern device communication, unifying hardware access with memory semantics while introducing critical considerations around caching, ordering, and security.

Key Takeaways

•MMIO maps devices into memory address space — Registers appear as memory locations accessed via normal load/store instructions
•Volatile is necessary but not sufficient — Use volatile for correct compiler behavior, but add memory barriers for CPU ordering guarantees
•Caching must be disabled — MMIO regions require UC (Uncacheable) or WC (Write Combining) memory attributes; WB is catastrophic
•BARs define MMIO regions — PCI devices advertise their MMIO requirements through Base Address Registers
•Modern devices use sophisticated patterns — NVMe queues, GPU VRAM, network descriptor rings all leverage MMIO architectures
•Security requires IOMMU — Without IOMMU, DMA-capable devices have unrestricted memory access

What's Next:

With our understanding of how MMIO and port I/O work, the next page examines Programmed I/O (PIO)—the simplest (but often least efficient) technique for transferring data between CPU and devices. We'll see how the CPU manually moves every byte, when this approach is appropriate, and its fundamental performance limitations that drive the need for more sophisticated techniques.

Page Complete

You now understand Memory-Mapped I/O comprehensively—from basic concepts through volatile access patterns, caching implications, PCI BAR mechanics, and security considerations. This knowledge is fundamental for device driver development, kernel engineering, and understanding modern computer architecture.