Memory Mapped Io - Learning Module

Loading content...

0/240

Hardware Support

The Hardware Foundation

The I/O addressing paradigms we've studied—Port-Mapped and Memory-Mapped I/O—are abstractions that exist in software and documentation. Their realization in silicon requires sophisticated hardware support: address decoders, bus bridges, memory controllers, translation units, and intricate signal routing. Understanding this hardware layer transforms I/O from a mysterious black box into a transparent, debuggable system.

Modern computer architectures have evolved from simple processor-to-device connections into complex hierarchies of interconnects. A single memory or I/O transaction may traverse multiple buses, pass through several bridges, undergo address translation, and finally reach the target device—all within nanoseconds. The hardware that enables this orchestration is the focus of this page.

We'll explore chipset architectures, the evolution from legacy buses to PCIe, memory controllers and their I/O routing logic, IOMMUs for address translation, and the specialized hardware features that accelerate I/O operations.

Learning Objectives

By the end of this page, you will understand: (1) The evolution of PC chipset architecture from North/South Bridge to integrated designs, (2) How address decode logic routes transactions to memory vs I/O, (3) PCIe architecture and its role in modern MMIO, (4) IOMMU (VT-d, AMD-Vi) for device virtual addressing, (5) Memory Type Range Registers (MTRRs) and Page Attribute Tables (PAT), and (6) Hardware-assisted virtualization for I/O.

Chipset Architecture Evolution

The "chipset" is the collection of silicon that bridges the CPU to memory and I/O devices. Its architecture profoundly affects I/O performance, latency, and capabilities.

The Classic North/South Bridge Architecture (1990s-2010)

For two decades, PC chipsets followed a two-chip design:

North Bridge (MCH - Memory Controller Hub):

High-speed bridge directly connected to CPU's front-side bus
Contained memory controller (DRAM interface)
Connected CPU to high-bandwidth devices: AGP/PCIe graphics
Handled memory-mapped I/O for attached devices

South Bridge (ICH - I/O Controller Hub):

Connected to North Bridge via internal link (DMI/Intel Hub Architecture)
Controlled low-speed I/O: USB, SATA, Audio, legacy PCI
Contained legacy I/O port handling logic
Managed interrupts (PIC/APIC routing)

This architecture made sense when memory bandwidth and CPU-memory latency were critical bottlenecks—the North Bridge optimized this path while slower I/O was delegated downstream.

Converting Mermaid diagram...

Modern Integrated Architecture (2008-Present)

Starting with Intel's Nehalem (2008), the memory controller migrated into the CPU die. AMD made this move earlier with their Athlon 64 (2003). This integration fundamentally changed the chipset landscape:

CPU/SoC Contains:

Memory Controller (DDR4/DDR5 interface)
PCIe Root Complex (CPU-direct PCIe lanes)
Integrated Graphics (GPU on die)
System Agent (replacement for North Bridge logic)

Platform Controller Hub (PCH):

Successor to South Bridge
Connects via DMI (Direct Media Interface) to CPU
Manages additional PCIe lanes, SATA, USB, Ethernet, audio
Contains Legacy I/O support (LPC for BIOS chip, SPI flash)
Handles SMBus, GPIO, and management functions

Chipset Architecture Comparison
Aspect	North/South Bridge	Modern Integrated
Memory Latency	~60-100 ns (through NB)	~50-80 ns (direct to CPU)
Graphics Bandwidth	AGP 8x: 2.1 GB/s	PCIe 4.0 x16: 32 GB/s
I/O Chip Count	2 major chips + bridges	1 PCH (CPU has memory controller)
MMIO Routing	NB decodes, routes to NB or SB	CPU System Agent + PCH
Power Efficiency	Higher power (external links)	Lower power (integration)
Die Area Trade-off	CPU smaller, chipset larger	CPU larger, simpler chipset

SoC Trend in Modern Systems

The integration trend continues—Apple M-series, AMD APUs, and Intel's hybrid designs incorporate GPU, NPU, memory controller, and I/O complex on a single die or package. The distinction between 'CPU' and 'chipset' increasingly blurs into 'System-on-Chip' architecture.

Address Decode Logic

The heart of I/O routing is address decode logic—hardware that examines each memory transaction's address and determines whether it targets memory (DRAM) or I/O (devices). This logic resides in the CPU's System Agent and the PCH.

Memory Address Decode in System Agent

The System Agent contains programmable registers defining address ranges:

DRAM Rules: Define which addresses map to memory controllers
MMIO Rules: Define which addresses route to PCIe Root Complex
Fixed Regions: Legacy ranges (VGA, BIOS) with hardcoded behavior
TOLM/TOUUD: Top of Low/Upper Usable DRAM—defines the MMIO hole

When a memory transaction arrives:

1. Check if address < TOLM (Top of Low Memory)
   → If yes and not in VGA/ROM region: route to DRAM controller
   → If in VGA region: route to graphics
   → If in ROM region: route to flash interface

2. Check if address >= 4GB and < TOUUD (Top of Upper Usable DRAM)
   → Route to DRAM (high memory)

3. Check if address in MMIO ranges (between TOLM and 4GB)
   → Route to PCIe Root Complex or integrated devices

4. Check if address matches integrated device registers (LAPIC, etc.)
   → Route to integrated target

address_decode_regions.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/*
 * System Agent Address Decode Configuration
 * 
 * These registers (accessed via MSR or MMIO) define how the
 * CPU's System Agent routes memory transactions.
 */
 
/* Example: Intel System Agent registers (conceptual model) */
 
/* Top of Low Memory (TOLM)
 * Address ranges below this and not in ROM/VGA go to DRAM.
 * Typically set to 3GB-3.5GB to leave room for 32-bit MMIO.
 */
#define TOLM_REG   0x??  /* Example offset */
 
/* Top of Upper Usable DRAM (TOUUD)
 * High RAM extends from 4GB to this address.
 * Set based on installed RAM + remapping.
 */
#define TOUUD_REG  0x??  /* Example offset */
 
/* MMIO Hole Definition Example
 *
 * With 8GB RAM and 512MB MMIO space:
 *
 * TOLM = 0xE000_0000 (3.5 GB) - RAM below this
 * MMIO = 0xE000_0000 to 0xFFFF_FFFF (512 MB hole)
 * TOUUD = 0x2_2000_0000 (8.5 GB) - RAM continues above 4GB
 *
 * RAM Regions:
 *   0x0010_0000 - 0xDFFF_FFFF : 3.5 GB (below hole)
 *   0x1_0000_0000 - 0x2_1FFF_FFFF : 4.5 GB (remapped)
 */
 
/* PCIe Base Address (PCIEXBAR)
 * Defines where PCIe ECAM (configuration space) is mapped.
 * Typically 256 MB starting at 0xE000_0000 or similar.
 */
#define PCIEXBAR_REG  0x60
 
struct pciexbar {
    uint64_t enable    : 1;   /* ECAM enabled */
    uint64_t length    : 2;   /* 00=256MB, 01=128MB, 10=64MB */
    uint64_t reserved  : 23;
    uint64_t base_addr : 38;  /* Base address (256MB aligned) */
};
 
/* APIC Base Address
 * Local APIC is typically at 0xFEE0_0000, fixed.
 */
#define LAPIC_BASE_MSR  0x1B
 
/* Reading LAPIC base */
uint64_t get_lapic_base(void)
{
    uint32_t lo, hi;
    asm volatile("rdmsr" : "=a"(lo), "=d"(hi) : "c"(LAPIC_BASE_MSR));
    return ((uint64_t)hi << 32) | (lo & 0xFFFFF000);  /* Bits 12-35 are base */
}
 
/* VGA Legacy Range: 0xA0000 - 0xBFFFF
 * Always routes to VGA-compatible device or integrated graphics.
 * This is hardcoded in the System Agent.
 */
 
/* Legacy ROM Range: 0xC0000 - 0xFFFFF
 * Routes to SPI flash via LPC/eSPI for BIOS compatibility.
 */

PCH Address Decode

The PCH receives transactions from the CPU via DMI and must further route them:

PCH-attached PCIe devices: Decoded by PCIe Root Port ranges
Integrated controllers: USB, SATA, Intel ME/CSME have dedicated MMIO ranges
Legacy I/O ports: Decoded by LPC/eSPI and routed to legacy chips or SuperI/O
SPI Flash: BIOS/UEFI firmware storage

Port I/O Decode Path

For Port I/O (when M/IO# indicates I/O cycle on legacy buses, or via special encoding on PCIe):

CPU issues I/O cycle to System Agent
System Agent has limited internal decode (LAPIC, etc.)
Most ports forwarded to PCH via DMI
PCH decodes port address against:
- Integrated devices (SATA in legacy mode, etc.)
- LPC/eSPI forwarding rules (ports claimed by SuperI/O chip)
LPC bridge forwards unmatched ports to LPC bus (legacy)

This multi-hop path adds latency to Port I/O—another reason MMIO is faster.

PCIe Architecture for MMIO

PCI Express (PCIe) has become the universal interconnect for high-performance I/O devices. Understanding PCIe's transaction layer is essential for comprehending modern MMIO at the hardware level.

PCIe Transaction Types

PCIe is a packet-based protocol. Transactions are encoded in Transaction Layer Packets (TLPs):

TLP Type	Description	MMIO Use
Memory Read (MRd)	Request to read memory/MMIO	CPU reading device register
Memory Write (MWr)	Write to memory/MMIO	CPU writing device register
Configuration Read (CfgRd)	Read PCI config space	Reading BAR values
Configuration Write (CfgWr)	Write PCI config space	Programming BARs
I/O Read (IORd)	Legacy port read	Legacy Port I/O
I/O Write (IOWr)	Legacy port write	Legacy Port I/O
Completion (Cpl)	Return data/status	Response to reads

Note that PCIe explicitly supports I/O transaction types for legacy Port I/O over PCIe, but these are rarely used by modern devices.

Converting Mermaid diagram...

Address Routing in PCIe

PCIe uses a hierarchical address routing model:

Root Complex: CPU's PCIe interface checks if address matches a downstream device
Root Port: Each Root Port has configured memory windows (derived from BAR sizes)
Switch Upstream Port: If transaction enters a switch, switch decodes address
Switch Downstream Port: Routes to the correct sub-branch
Endpoint: Final device claims the transaction

This routing is enabled by bridge configuration: each PCIe bridge (Root Port, Switch) has registers defining its memory range. If an address falls within a bridge's range, the bridge forwards it downstream.

Posted vs Non-Posted Transactions

Posted (Writes): Send-and-forget. CPU issues write TLP and continues without waiting for acknowledgment.
Non-Posted (Reads): CPU issues read TLP and waits for Completion TLP with data.

This distinction is why MMIO writes can be fast (posted) while reads incur full round-trip latency to the device.

pcie_tlp_flow.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/*
 * PCIe Transaction Flow for MMIO Operations
 * 
 * This conceptual code illustrates how a driver MMIO operation
 * becomes a PCIe transaction.
 */
 
#include <linux/io.h>
#include <linux/pci.h>
 
struct nvme_device {
    void __iomem *bar0;  /* Controller registers, BAR0 */
};
 
/*
 * MMIO Write Example: Program NVMe controller
 * 
 * This single write becomes a PCIe Memory Write TLP.
 */
void nvme_enable_controller(struct nvme_device *dev)
{
    uint32_t cc;
    
    /* Read current Controller Configuration */
    cc = readl(dev->bar0 + 0x14);  /* Generates PCIe MRd TLP */
    
    /* What happens for the read:
     * 1. CPU issues load to virtual address (mapped to BAR0 + 0x14)
     * 2. MMU translates to physical: e.g., 0x6000_0000_0014
     * 3. System Agent recognizes MMIO region, not DRAM
     * 4. Root Complex creates MRd TLP:
     *    - Header: Type=0x00 (MRd64), Address=0x6000_0000_0014, Length=1DW
     * 5. TLP routed through PCIe hierarchy to NVMe device
     * 6. NVMe returns CplD (Completion with Data) TLP
     * 7. Data placed in CPU register, execution continues
     * Latency: ~100-500 ns depending on device and topology
     */
    
    /* Modify enable bit */
    cc |= 0x01;  /* EN bit */
    
    /* Write back modified value */
    writel(cc, dev->bar0 + 0x14);  /* Generates PCIe MWr TLP */
    
    /* What happens for the write:
     * 1. CPU issues store to virtual address (BAR0 + 0x14)
     * 2. MMU translates to physical
     * 3. System Agent creates MWr TLP:
     *    - Header: Type=0x60 (MWr64), Address=0x6000_0000_0014, Length=1DW
     *    - Data: 0x00000001 (or whatever CC value with EN set)
     * 4. TLP sent as POSTED - CPU does not wait!
     * 5. Transaction eventually reaches device, register updated
     * 6. CPU has already continued to next instruction
     * Observed CPU latency: ~10-50 ns (posted write)
     */
}
 
/*
 * BAR Configuration During Enumeration
 * 
 * The OS/firmware assigns BAR addresses by writing to config space.
 * These use Configuration TLPs.
 */
void example_bar_programming(struct pci_dev *pdev)
{
    uint32_t bar0_value;
    
    /* Reading BAR0 current value generates CfgRd TLP */
    pci_read_config_dword(pdev, PCI_BASE_ADDRESS_0, &bar0_value);
    
    /* Writing new BAR value generates CfgWr TLP */
    pci_write_config_dword(pdev, PCI_BASE_ADDRESS_0, 0x60000000);
    
    /* Configuration space access uses special addressing:
     * - Bus/Device/Function encoded in config address
     * - PCIe ECAM maps config space to MMIO region
     * - Or legacy CF8/CFC ports generate I/O TLPs
     */
}

IOMMU: Device Virtual Addressing

The Input/Output Memory Management Unit (IOMMU) revolutionizes I/O hardware support by providing virtual addressing for devices. Just as the MMU translates CPU virtual addresses to physical, the IOMMU translates device DMA addresses.

Why IOMMU Matters

Without IOMMU:

Devices DMA directly to physical memory addresses
A malicious or buggy device can corrupt any memory
Virtual machines cannot safely have direct device access
Memory fragmentation limits DMA buffer allocation

With IOMMU:

Devices operate in I/O virtual address space (IOVA)
IOMMU translates IOVA to physical address
Invalid or out-of-range access is blocked
VMs can have isolated "physical" address spaces

IOMMU Technology by Vendor
Vendor	Technology Name	Key Capabilities
Intel	VT-d (Virtualization Technology for Directed I/O)	Address translation, interrupt remapping, device isolation
AMD	AMD-Vi (AMD I/O Virtualization)	Similar to VT-d, integrated in AMD platforms
ARM	SMMU (System Memory Management Unit)	ARM-equivalent, used in mobile/embedded
IBM	TCE (Translation Control Entry)	Power architecture IOMMU

IOMMU Translation Flow

When a device issues a DMA transaction:

Device BDF (Bus:Device:Function) identifies the source
IOMMU looks up device's page table root (context entry)
Device-supplied address (IOVA) is translated via page table walk
If translation succeeds: physical address used for memory access
If translation fails: transaction aborted, interrupt raised

This creates per-device "virtual memory" where each device sees only its granted memory regions.

IOMMU Page Tables

IOMMU page tables have structure similar to CPU page tables:

4 KB pages (matching CPU pages)
Multi-level (typically 4 levels for 48-bit addresses)
Present/permission bits per page
Can be shared with CPU page tables (with care)

The OS maintains IOMMU page tables separately from CPU page tables, granting specific pages for DMA use.

iommu_usage.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
/*
 * IOMMU Usage in Linux Device Drivers
 * 
 * The DMA API abstracts IOMMU details, but understanding
 * what happens under the hood is valuable.
 */
 
#include <linux/dma-mapping.h>
#include <linux/pci.h>
 
struct my_device {
    struct pci_dev *pdev;
    void *dma_buffer;         /* CPU virtual address */
    dma_addr_t dma_handle;    /* Device-visible address (IOVA if IOMMU) */
    size_t buffer_size;
};
 
/*
 * Allocate DMA buffer with IOMMU mapping
 */
int setup_dma_buffer(struct my_device *dev)
{
    dev->buffer_size = 64 * 1024;  /* 64 KB buffer */
    
    /* dma_alloc_coherent does multiple things:
     * 1. Allocates physically contiguous memory
     * 2. If IOMMU present: creates IOMMU mapping, returns IOVA
     * 3. If no IOMMU: returns physical address
     * 4. Ensures coherency (CPU and device see same data)
     */
    dev->dma_buffer = dma_alloc_coherent(&dev->pdev->dev,
                                          dev->buffer_size,
                                          &dev->dma_handle,
                                          GFP_KERNEL);
    
    if (!dev->dma_buffer)
        return -ENOMEM;
    
    /* 
     * At this point:
     * - dev->dma_buffer is CPU virtual address (for driver use)
     * - dev->dma_handle is device address (program into device DMA registers)
     *
     * With IOMMU:
     *   dma_handle might be 0x0000_0001_0000_0000 (IOVA)
     *   Actual physical might be 0x0000_0008_1234_0000 (scattered pages)
     *   IOMMU translates when device accesses dma_handle
     *
     * Without IOMMU:
     *   dma_handle equals physical address
     *   Memory must be physically contiguous
     */
    
    pr_info("DMA buffer: CPU VA=%p, DMA addr=0x%llx
",
            dev->dma_buffer, (unsigned long long)dev->dma_handle);
    
    return 0;
}
 
/*
 * Program device to DMA to/from the buffer
 */
void start_dma_transfer(struct my_device *dev)
{
    void __iomem *regs = pci_iomap(dev->pdev, 0, 0);
    
    /* Write the DMA address to device registers
     * The device will use this address for DMA.
     * With IOMMU: device issues DMA to this IOVA
     * IOMMU translates to actual physical pages
     */
    writeq(dev->dma_handle, regs + DMA_ADDR_REG);
    writel(dev->buffer_size, regs + DMA_SIZE_REG);
    writel(DMA_START_CMD, regs + DMA_CONTROL_REG);
    
    /* Device now DMA's data. IOMMU ensures:
     * - Device can only access mapped pages
     * - Any out-of-bounds access is blocked
     * - VM isolation maintained (if multiple VMs)
     */
}
 
/*
 * IOMMU Isolation for VM Passthrough (VFIO)
 * 
 * When passing a device to a VM:
 * 1. Device assigned to VFIO-managed IOMMU group
 * 2. IOMMU pages tables built from VM's memory layout
 * 3. Device DMA addresses are within VM's "physical" space
 * 4. VM can directly program device without hypervisor trap
 * 5. IOMMU prevents device from accessing host memory
 */

Security Implication

Without IOMMU, a malicious PCIe device (or a device with a bug) can read or write any physical memory, bypassing CPU memory protection. This is why IOMMU must be enabled for security-sensitive systems and is mandatory for technologies like Thunderbolt security and VM device passthrough.

Memory Type Configuration Hardware

The CPU's caching behavior profoundly affects MMIO correctness and performance. Hardware mechanisms to control memory types are essential for proper I/O operation.

Memory Type Range Registers (MTRRs)

MTRRs are Model-Specific Registers (MSRs) that define memory types for physical address ranges:

Fixed MTRRs: Cover the first 1 MB with fixed-size regions
- 64 KB blocks for 0x00000-0x7FFFF (conventional memory)
- 16 KB blocks for 0x80000-0xBFFFF (VGA region)
- 4 KB blocks for 0xC0000-0xFFFFF (ROM region)
Variable MTRRs: Define arbitrary power-of-2 aligned regions (typically 8-20 available)
- Base register: Starting address
- Mask register: Size (defines which address bits to match)
- Type field: UC, WC, WT, WP, or WB

Firmware typically initializes MTRRs during boot:

Mark all RAM as Write-Back (WB)
Mark MMIO hole as Uncacheable (UC)
Mark frame buffer as Write-Combining (WC)

x86 Memory Types
Type	Value	Caching	Write Policy	MMIO Use
UC (Uncacheable)	0	None	Direct to device	Standard device registers
WC (Write Combining)	1	None, but writes combine	Buffered, batched	Frame buffers
WT (Write Through)	4	Read cached	Direct + cache update	Rarely for MMIO
WP (Write Protect)	5	Read cached	Writes ignored	Not for MMIO
WB (Write Back)	6	Full caching	Delayed to memory	NEVER for MMIO!

Page Attribute Table (PAT)

While MTRRs operate on physical ranges, PAT allows memory type specification in page table entries. This provides per-page control visible to the operating system.

Each page table entry has three relevant bits:

PWT (Page Write-Through)
PCD (Page Cache Disable)
PAT (Page Attribute Table bit)

These 3 bits form an index (0-7) into the PAT register, which holds 8 memory type values. The OS can configure PAT to provide useful combinations:

PAT Index	Typical Configuration	Use Case
0 (PWT=0, PCD=0, PAT=0)	WB	Normal RAM
1 (PWT=1, PCD=0, PAT=0)	WC	Frame buffers
2 (PWT=0, PCD=1, PAT=0)	UC-	MMIO (fallback)
3 (PWT=1, PCD=1, PAT=0)	UC	MMIO (strict)

MTRR + PAT Interaction

When both are configured, the effective memory type follows a combination rule:

The most restrictive (least cacheable) type wins
Example: If MTRR says WB but PAT says UC, effective type is UC
This allows OS to override firmware without conflicts

mtrr_pat_configuration.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
/*
 * MTRR and PAT Configuration Examples
 * 
 * This code illustrates how memory types are configured at the
 * hardware level for MMIO correctness.
 */
 
#include <asm/msr.h>
#include <asm/mtrr.h>
 
/* MSR addresses for MTRRs */
#define MSR_MTRRdefType         0x2FF
#define MSR_MTRRfix64K_00000    0x250
#define MSR_MTRRphysBase0       0x200
#define MSR_MTRRphysMask0       0x201
 
/* Memory type encodings */
#define MTRR_TYPE_UC    0  /* Uncacheable */
#define MTRR_TYPE_WC    1  /* Write Combining */
#define MTRR_TYPE_WT    4  /* Write Through */
#define MTRR_TYPE_WP    5  /* Write Protect */
#define MTRR_TYPE_WB    6  /* Write Back */
 
/*
 * Read current MTRR configuration
 */
void dump_mtrr_config(void)
{
    uint64_t def_type;
    int i;
    
    /* Read default type */
    rdmsrl(MSR_MTRRdefType, def_type);
    pr_info("MTRR default type: %lld, enabled: %s, fixed enabled: %s
",
            def_type & 0xFF,
            (def_type & (1 << 11)) ? "yes" : "no",
            (def_type & (1 << 10)) ? "yes" : "no");
    
    /* Read variable MTRRs (typically 8-20 pairs) */
    for (i = 0; i < 8; i++) {
        uint64_t base, mask;
        rdmsrl(MSR_MTRRphysBase0 + i*2, base);
        rdmsrl(MSR_MTRRphysMask0 + i*2, mask);
        
        if (mask & (1 << 11)) {  /* Valid bit */
            uint64_t start = base & ~0xFFF;
            uint64_t size = ~(mask & ~0xFFF) + 1;
            uint8_t type = base & 0xFF;
            
            pr_info("MTRR %d: %016llx-%016llx type=%d (%s)
",
                    i, start, start + size - 1, type,
                    type == 0 ? "UC" : type == 1 ? "WC" : 
                    type == 6 ? "WB" : "other");
        }
    }
}
 
/*
 * Configure PAT register for useful memory types
 * 
 * Linux default PAT configuration:
 * PAT0 = WB (normal RAM)
 * PAT1 = WC (frame buffers)  
 * PAT2 = UC- (MMIO)
 * PAT3 = UC (strict MMIO)
 * PAT4 = WB (duplicate)
 * PAT5 = WP (not commonly used)
 * PAT6 = UC- (duplicate)
 * PAT7 = UC (duplicate)
 */
#define MSR_IA32_CR_PAT  0x277
 
void setup_pat(void)
{
    uint64_t pat = 0;
    
    /* Construct PAT value */
    pat |= ((uint64_t)MTRR_TYPE_WB << 0);   /* PAT0 = WB */
    pat |= ((uint64_t)MTRR_TYPE_WC << 8);   /* PAT1 = WC */
    pat |= ((uint64_t)MTRR_TYPE_UC << 16);  /* PAT2 = UC- (using UC) */
    pat |= ((uint64_t)MTRR_TYPE_UC << 24);  /* PAT3 = UC */
    pat |= ((uint64_t)MTRR_TYPE_WB << 32);  /* PAT4 = WB */
    pat |= ((uint64_t)MTRR_TYPE_WP << 40);  /* PAT5 = WP */
    pat |= ((uint64_t)MTRR_TYPE_UC << 48);  /* PAT6 = UC- (using UC) */
    pat |= ((uint64_t)MTRR_TYPE_UC << 56);  /* PAT7 = UC */
    
    wrmsrl(MSR_IA32_CR_PAT, pat);
    pr_info("PAT configured: 0x%016llx
", pat);
}
 
/*
 * In ioremap context, the kernel sets page table bits to select
 * the appropriate PAT entry:
 *
 * ioremap()    -> UC (PAT3: PWT=1, PCD=1, PAT=0)
 * ioremap_wc() -> WC (PAT1: PWT=1, PCD=0, PAT=0)
 */

Virtualization Hardware for I/O

Modern I/O hardware includes extensive support for virtualization, enabling efficient I/O handling in virtual machine environments.

I/O Virtualization Challenges

Without hardware support, VM I/O requires trap-and-emulate:

Guest OS executes I/O instruction (IN/OUT) or MMIO access
Hardware traps to hypervisor (VM exit)
Hypervisor emulates the I/O operation
Hypervisor returns to guest (VM entry)

This trap overhead devastates I/O performance—each operation adds thousands of cycles.

Hardware Solutions

I/O Virtualization Hardware Features

•IOMMU / VT-d / AMD-Vi: As discussed, provides address translation for device DMA. VMs have isolated DMA address spaces.
•Interrupt Remapping: Hardware redirects device interrupts to correct VM without hypervisor involvement.
•SR-IOV (Single Root I/O Virtualization): Device presents multiple 'virtual functions' (VFs), each assignable to a VM with hardware isolation.
•VT-c (Virtualization Technology for Connectivity): Intel's network-specific features: VMDq for queue-per-VM, VMDQ-RSS for load distribution.
•Posted Interrupts: Interrupts delivered directly to guest without VM exit, if guest is running.
•APIC Virtualization (APICv): Hardware-accelerated virtual APIC, reducing interrupt handling exits.

SR-IOV in Detail

SR-IOV allows a single physical device to appear as multiple virtual devices:

Physical Function (PF): Full-featured device, managed by hypervisor
Virtual Functions (VFs): Lightweight instances, assignable to VMs

Each VF has:

Own PCI configuration space
Own BARs (MMIO regions)
Own interrupts (MSI-X vectors)
Hardware isolation from other VFs

A VM with an assigned VF can perform MMIO and DMA directly to the hardware without hypervisor involvement—achieving near-native performance.

MMIO in VM Context

With EPT/NPT (Extended/Nested Page Tables), guest MMIO can be hardware-accelerated:

Guest page tables map guest-virtual MMIO address
EPT/NPT translates guest-physical to host-physical
If EPT entry permits, MMIO access proceeds without exit
Device MMIO is directly accessed by guest (for VF devices)

For emulated devices, the hypervisor sets EPT entries to trap on MMIO access, enabling efficient emulation only when needed.

Cloud Computing Impact

SR-IOV and IOMMU are foundational for cloud computing. AWS, Azure, and GCP use these technologies to provide VMs with direct access to network and storage hardware, achieving the low latency and high bandwidth required for demanding workloads.

Summary: Hardware Support for I/O

This page has provided a comprehensive exploration of the hardware mechanisms that enable efficient I/O addressing. Let's consolidate the key concepts:

Key Takeaways

•Chipset Evolution: From North/South Bridge to integrated System Agent + PCH. Memory controllers now on-die for lower latency.
•Address Decode Logic: System Agent registers (TOLM, TOUUD, PCIEXBAR) define how addresses route to DRAM vs MMIO.
•PCIe Transaction Layer: MMIO becomes Memory Read/Write TLPs routed through PCIe hierarchy. Posted writes enable low-latency MMIO.
•IOMMU (VT-d/AMD-Vi): Provides device virtual addressing, enabling DMA isolation, VM passthrough, and security.
•Memory Type Configuration: MTRRs (physical) and PAT (page tables) configure caching behavior. MMIO requires UC or WC, never WB.
•Virtualization Support: SR-IOV, interrupt remapping, and APICv enable efficient VM I/O with near-native performance.

Module Completion

You have now completed the comprehensive exploration of Memory-Mapped I/O and I/O addressing paradigms. From the conceptual foundations through hardware implementation details, you possess the knowledge to:

Understand how device drivers interact with hardware at the lowest level
Debug I/O performance issues by understanding transaction flows
Make informed architectural decisions about I/O design
Work effectively with virtualized I/O environments
Analyze and configure system address space layouts

Module Complete

Congratulations! You now have mastery over I/O addressing paradigms—from PMIO and MMIO concepts through hardware implementation. This foundational knowledge directly applies to device driver development, system debugging, performance optimization, and low-level systems programming.

Hardware Support

The Hardware Foundation

Learning Objectives

Chipset Architecture Evolution

The "chipset" is the collection of silicon that bridges the CPU to memory and I/O devices. Its architecture profoundly affects I/O performance, latency, and capabilities.

The Classic North/South Bridge Architecture (1990s-2010)

For two decades, PC chipsets followed a two-chip design:

North Bridge (MCH - Memory Controller Hub):

High-speed bridge directly connected to CPU's front-side bus
Contained memory controller (DRAM interface)
Connected CPU to high-bandwidth devices: AGP/PCIe graphics
Handled memory-mapped I/O for attached devices

South Bridge (ICH - I/O Controller Hub):

Connected to North Bridge via internal link (DMI/Intel Hub Architecture)
Controlled low-speed I/O: USB, SATA, Audio, legacy PCI
Contained legacy I/O port handling logic
Managed interrupts (PIC/APIC routing)

This architecture made sense when memory bandwidth and CPU-memory latency were critical bottlenecks—the North Bridge optimized this path while slower I/O was delegated downstream.

Converting Mermaid diagram...

Modern Integrated Architecture (2008-Present)

CPU/SoC Contains:

Memory Controller (DDR4/DDR5 interface)
PCIe Root Complex (CPU-direct PCIe lanes)
Integrated Graphics (GPU on die)
System Agent (replacement for North Bridge logic)

Platform Controller Hub (PCH):

Successor to South Bridge
Connects via DMI (Direct Media Interface) to CPU
Manages additional PCIe lanes, SATA, USB, Ethernet, audio
Contains Legacy I/O support (LPC for BIOS chip, SPI flash)
Handles SMBus, GPIO, and management functions

Chipset Architecture Comparison
Aspect	North/South Bridge	Modern Integrated
Memory Latency	~60-100 ns (through NB)	~50-80 ns (direct to CPU)
Graphics Bandwidth	AGP 8x: 2.1 GB/s	PCIe 4.0 x16: 32 GB/s
I/O Chip Count	2 major chips + bridges	1 PCH (CPU has memory controller)
MMIO Routing	NB decodes, routes to NB or SB	CPU System Agent + PCH
Power Efficiency	Higher power (external links)	Lower power (integration)
Die Area Trade-off	CPU smaller, chipset larger	CPU larger, simpler chipset

SoC Trend in Modern Systems

Address Decode Logic

Memory Address Decode in System Agent

The System Agent contains programmable registers defining address ranges:

DRAM Rules: Define which addresses map to memory controllers
MMIO Rules: Define which addresses route to PCIe Root Complex
Fixed Regions: Legacy ranges (VGA, BIOS) with hardcoded behavior
TOLM/TOUUD: Top of Low/Upper Usable DRAM—defines the MMIO hole

When a memory transaction arrives:

1. Check if address < TOLM (Top of Low Memory)
   → If yes and not in VGA/ROM region: route to DRAM controller
   → If in VGA region: route to graphics
   → If in ROM region: route to flash interface

2. Check if address >= 4GB and < TOUUD (Top of Upper Usable DRAM)
   → Route to DRAM (high memory)

3. Check if address in MMIO ranges (between TOLM and 4GB)
   → Route to PCIe Root Complex or integrated devices

4. Check if address matches integrated device registers (LAPIC, etc.)
   → Route to integrated target

address_decode_regions.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/*
 * System Agent Address Decode Configuration
 * 
 * These registers (accessed via MSR or MMIO) define how the
 * CPU's System Agent routes memory transactions.
 */
 
/* Example: Intel System Agent registers (conceptual model) */
 
/* Top of Low Memory (TOLM)
 * Address ranges below this and not in ROM/VGA go to DRAM.
 * Typically set to 3GB-3.5GB to leave room for 32-bit MMIO.
 */
#define TOLM_REG   0x??  /* Example offset */
 
/* Top of Upper Usable DRAM (TOUUD)
 * High RAM extends from 4GB to this address.
 * Set based on installed RAM + remapping.
 */
#define TOUUD_REG  0x??  /* Example offset */
 
/* MMIO Hole Definition Example
 *
 * With 8GB RAM and 512MB MMIO space:
 *
 * TOLM = 0xE000_0000 (3.5 GB) - RAM below this
 * MMIO = 0xE000_0000 to 0xFFFF_FFFF (512 MB hole)
 * TOUUD = 0x2_2000_0000 (8.5 GB) - RAM continues above 4GB
 *
 * RAM Regions:
 *   0x0010_0000 - 0xDFFF_FFFF : 3.5 GB (below hole)
 *   0x1_0000_0000 - 0x2_1FFF_FFFF : 4.5 GB (remapped)
 */
 
/* PCIe Base Address (PCIEXBAR)
 * Defines where PCIe ECAM (configuration space) is mapped.
 * Typically 256 MB starting at 0xE000_0000 or similar.
 */
#define PCIEXBAR_REG  0x60
 
struct pciexbar {
    uint64_t enable    : 1;   /* ECAM enabled */
    uint64_t length    : 2;   /* 00=256MB, 01=128MB, 10=64MB */
    uint64_t reserved  : 23;
    uint64_t base_addr : 38;  /* Base address (256MB aligned) */
};
 
/* APIC Base Address
 * Local APIC is typically at 0xFEE0_0000, fixed.
 */
#define LAPIC_BASE_MSR  0x1B
 
/* Reading LAPIC base */
uint64_t get_lapic_base(void)
{
    uint32_t lo, hi;
    asm volatile("rdmsr" : "=a"(lo), "=d"(hi) : "c"(LAPIC_BASE_MSR));
    return ((uint64_t)hi << 32) | (lo & 0xFFFFF000);  /* Bits 12-35 are base */
}
 
/* VGA Legacy Range: 0xA0000 - 0xBFFFF
 * Always routes to VGA-compatible device or integrated graphics.
 * This is hardcoded in the System Agent.
 */
 
/* Legacy ROM Range: 0xC0000 - 0xFFFFF
 * Routes to SPI flash via LPC/eSPI for BIOS compatibility.
 */

PCH Address Decode

The PCH receives transactions from the CPU via DMI and must further route them:

PCH-attached PCIe devices: Decoded by PCIe Root Port ranges
Integrated controllers: USB, SATA, Intel ME/CSME have dedicated MMIO ranges
Legacy I/O ports: Decoded by LPC/eSPI and routed to legacy chips or SuperI/O
SPI Flash: BIOS/UEFI firmware storage

Port I/O Decode Path

For Port I/O (when M/IO# indicates I/O cycle on legacy buses, or via special encoding on PCIe):

CPU issues I/O cycle to System Agent
System Agent has limited internal decode (LAPIC, etc.)
Most ports forwarded to PCH via DMI
PCH decodes port address against:
- Integrated devices (SATA in legacy mode, etc.)
- LPC/eSPI forwarding rules (ports claimed by SuperI/O chip)
LPC bridge forwards unmatched ports to LPC bus (legacy)

This multi-hop path adds latency to Port I/O—another reason MMIO is faster.

PCIe Architecture for MMIO

PCI Express (PCIe) has become the universal interconnect for high-performance I/O devices. Understanding PCIe's transaction layer is essential for comprehending modern MMIO at the hardware level.

PCIe Transaction Types

PCIe is a packet-based protocol. Transactions are encoded in Transaction Layer Packets (TLPs):

TLP Type	Description	MMIO Use
Memory Read (MRd)	Request to read memory/MMIO	CPU reading device register
Memory Write (MWr)	Write to memory/MMIO	CPU writing device register
Configuration Read (CfgRd)	Read PCI config space	Reading BAR values
Configuration Write (CfgWr)	Write PCI config space	Programming BARs
I/O Read (IORd)	Legacy port read	Legacy Port I/O
I/O Write (IOWr)	Legacy port write	Legacy Port I/O
Completion (Cpl)	Return data/status	Response to reads

Note that PCIe explicitly supports I/O transaction types for legacy Port I/O over PCIe, but these are rarely used by modern devices.

Converting Mermaid diagram...

Address Routing in PCIe

PCIe uses a hierarchical address routing model:

Root Complex: CPU's PCIe interface checks if address matches a downstream device
Root Port: Each Root Port has configured memory windows (derived from BAR sizes)
Switch Upstream Port: If transaction enters a switch, switch decodes address
Switch Downstream Port: Routes to the correct sub-branch
Endpoint: Final device claims the transaction

Posted vs Non-Posted Transactions

Posted (Writes): Send-and-forget. CPU issues write TLP and continues without waiting for acknowledgment.
Non-Posted (Reads): CPU issues read TLP and waits for Completion TLP with data.

This distinction is why MMIO writes can be fast (posted) while reads incur full round-trip latency to the device.

pcie_tlp_flow.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/*
 * PCIe Transaction Flow for MMIO Operations
 * 
 * This conceptual code illustrates how a driver MMIO operation
 * becomes a PCIe transaction.
 */
 
#include <linux/io.h>
#include <linux/pci.h>
 
struct nvme_device {
    void __iomem *bar0;  /* Controller registers, BAR0 */
};
 
/*
 * MMIO Write Example: Program NVMe controller
 * 
 * This single write becomes a PCIe Memory Write TLP.
 */
void nvme_enable_controller(struct nvme_device *dev)
{
    uint32_t cc;
    
    /* Read current Controller Configuration */
    cc = readl(dev->bar0 + 0x14);  /* Generates PCIe MRd TLP */
    
    /* What happens for the read:
     * 1. CPU issues load to virtual address (mapped to BAR0 + 0x14)
     * 2. MMU translates to physical: e.g., 0x6000_0000_0014
     * 3. System Agent recognizes MMIO region, not DRAM
     * 4. Root Complex creates MRd TLP:
     *    - Header: Type=0x00 (MRd64), Address=0x6000_0000_0014, Length=1DW
     * 5. TLP routed through PCIe hierarchy to NVMe device
     * 6. NVMe returns CplD (Completion with Data) TLP
     * 7. Data placed in CPU register, execution continues
     * Latency: ~100-500 ns depending on device and topology
     */
    
    /* Modify enable bit */
    cc |= 0x01;  /* EN bit */
    
    /* Write back modified value */
    writel(cc, dev->bar0 + 0x14);  /* Generates PCIe MWr TLP */
    
    /* What happens for the write:
     * 1. CPU issues store to virtual address (BAR0 + 0x14)
     * 2. MMU translates to physical
     * 3. System Agent creates MWr TLP:
     *    - Header: Type=0x60 (MWr64), Address=0x6000_0000_0014, Length=1DW
     *    - Data: 0x00000001 (or whatever CC value with EN set)
     * 4. TLP sent as POSTED - CPU does not wait!
     * 5. Transaction eventually reaches device, register updated
     * 6. CPU has already continued to next instruction
     * Observed CPU latency: ~10-50 ns (posted write)
     */
}
 
/*
 * BAR Configuration During Enumeration
 * 
 * The OS/firmware assigns BAR addresses by writing to config space.
 * These use Configuration TLPs.
 */
void example_bar_programming(struct pci_dev *pdev)
{
    uint32_t bar0_value;
    
    /* Reading BAR0 current value generates CfgRd TLP */
    pci_read_config_dword(pdev, PCI_BASE_ADDRESS_0, &bar0_value);
    
    /* Writing new BAR value generates CfgWr TLP */
    pci_write_config_dword(pdev, PCI_BASE_ADDRESS_0, 0x60000000);
    
    /* Configuration space access uses special addressing:
     * - Bus/Device/Function encoded in config address
     * - PCIe ECAM maps config space to MMIO region
     * - Or legacy CF8/CFC ports generate I/O TLPs
     */
}

IOMMU: Device Virtual Addressing

Why IOMMU Matters

Without IOMMU:

Devices DMA directly to physical memory addresses
A malicious or buggy device can corrupt any memory
Virtual machines cannot safely have direct device access
Memory fragmentation limits DMA buffer allocation

With IOMMU:

Devices operate in I/O virtual address space (IOVA)
IOMMU translates IOVA to physical address
Invalid or out-of-range access is blocked
VMs can have isolated "physical" address spaces

IOMMU Technology by Vendor
Vendor	Technology Name	Key Capabilities
Intel	VT-d (Virtualization Technology for Directed I/O)	Address translation, interrupt remapping, device isolation
AMD	AMD-Vi (AMD I/O Virtualization)	Similar to VT-d, integrated in AMD platforms
ARM	SMMU (System Memory Management Unit)	ARM-equivalent, used in mobile/embedded
IBM	TCE (Translation Control Entry)	Power architecture IOMMU

IOMMU Translation Flow

When a device issues a DMA transaction:

Device BDF (Bus:Device:Function) identifies the source
IOMMU looks up device's page table root (context entry)
Device-supplied address (IOVA) is translated via page table walk
If translation succeeds: physical address used for memory access
If translation fails: transaction aborted, interrupt raised

This creates per-device "virtual memory" where each device sees only its granted memory regions.

IOMMU Page Tables

IOMMU page tables have structure similar to CPU page tables:

4 KB pages (matching CPU pages)
Multi-level (typically 4 levels for 48-bit addresses)
Present/permission bits per page
Can be shared with CPU page tables (with care)

The OS maintains IOMMU page tables separately from CPU page tables, granting specific pages for DMA use.

iommu_usage.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
/*
 * IOMMU Usage in Linux Device Drivers
 * 
 * The DMA API abstracts IOMMU details, but understanding
 * what happens under the hood is valuable.
 */
 
#include <linux/dma-mapping.h>
#include <linux/pci.h>
 
struct my_device {
    struct pci_dev *pdev;
    void *dma_buffer;         /* CPU virtual address */
    dma_addr_t dma_handle;    /* Device-visible address (IOVA if IOMMU) */
    size_t buffer_size;
};
 
/*
 * Allocate DMA buffer with IOMMU mapping
 */
int setup_dma_buffer(struct my_device *dev)
{
    dev->buffer_size = 64 * 1024;  /* 64 KB buffer */
    
    /* dma_alloc_coherent does multiple things:
     * 1. Allocates physically contiguous memory
     * 2. If IOMMU present: creates IOMMU mapping, returns IOVA
     * 3. If no IOMMU: returns physical address
     * 4. Ensures coherency (CPU and device see same data)
     */
    dev->dma_buffer = dma_alloc_coherent(&dev->pdev->dev,
                                          dev->buffer_size,
                                          &dev->dma_handle,
                                          GFP_KERNEL);
    
    if (!dev->dma_buffer)
        return -ENOMEM;
    
    /* 
     * At this point:
     * - dev->dma_buffer is CPU virtual address (for driver use)
     * - dev->dma_handle is device address (program into device DMA registers)
     *
     * With IOMMU:
     *   dma_handle might be 0x0000_0001_0000_0000 (IOVA)
     *   Actual physical might be 0x0000_0008_1234_0000 (scattered pages)
     *   IOMMU translates when device accesses dma_handle
     *
     * Without IOMMU:
     *   dma_handle equals physical address
     *   Memory must be physically contiguous
     */
    
    pr_info("DMA buffer: CPU VA=%p, DMA addr=0x%llx
",
            dev->dma_buffer, (unsigned long long)dev->dma_handle);
    
    return 0;
}
 
/*
 * Program device to DMA to/from the buffer
 */
void start_dma_transfer(struct my_device *dev)
{
    void __iomem *regs = pci_iomap(dev->pdev, 0, 0);
    
    /* Write the DMA address to device registers
     * The device will use this address for DMA.
     * With IOMMU: device issues DMA to this IOVA
     * IOMMU translates to actual physical pages
     */
    writeq(dev->dma_handle, regs + DMA_ADDR_REG);
    writel(dev->buffer_size, regs + DMA_SIZE_REG);
    writel(DMA_START_CMD, regs + DMA_CONTROL_REG);
    
    /* Device now DMA's data. IOMMU ensures:
     * - Device can only access mapped pages
     * - Any out-of-bounds access is blocked
     * - VM isolation maintained (if multiple VMs)
     */
}
 
/*
 * IOMMU Isolation for VM Passthrough (VFIO)
 * 
 * When passing a device to a VM:
 * 1. Device assigned to VFIO-managed IOMMU group
 * 2. IOMMU pages tables built from VM's memory layout
 * 3. Device DMA addresses are within VM's "physical" space
 * 4. VM can directly program device without hypervisor trap
 * 5. IOMMU prevents device from accessing host memory
 */

Security Implication

Memory Type Configuration Hardware

The CPU's caching behavior profoundly affects MMIO correctness and performance. Hardware mechanisms to control memory types are essential for proper I/O operation.

Memory Type Range Registers (MTRRs)

MTRRs are Model-Specific Registers (MSRs) that define memory types for physical address ranges:

Fixed MTRRs: Cover the first 1 MB with fixed-size regions
- 64 KB blocks for 0x00000-0x7FFFF (conventional memory)
- 16 KB blocks for 0x80000-0xBFFFF (VGA region)
- 4 KB blocks for 0xC0000-0xFFFFF (ROM region)
Variable MTRRs: Define arbitrary power-of-2 aligned regions (typically 8-20 available)
- Base register: Starting address
- Mask register: Size (defines which address bits to match)
- Type field: UC, WC, WT, WP, or WB

Firmware typically initializes MTRRs during boot:

Mark all RAM as Write-Back (WB)
Mark MMIO hole as Uncacheable (UC)
Mark frame buffer as Write-Combining (WC)

x86 Memory Types
Type	Value	Caching	Write Policy	MMIO Use
UC (Uncacheable)	0	None	Direct to device	Standard device registers
WC (Write Combining)	1	None, but writes combine	Buffered, batched	Frame buffers
WT (Write Through)	4	Read cached	Direct + cache update	Rarely for MMIO
WP (Write Protect)	5	Read cached	Writes ignored	Not for MMIO
WB (Write Back)	6	Full caching	Delayed to memory	NEVER for MMIO!

Page Attribute Table (PAT)

While MTRRs operate on physical ranges, PAT allows memory type specification in page table entries. This provides per-page control visible to the operating system.

Each page table entry has three relevant bits:

PWT (Page Write-Through)
PCD (Page Cache Disable)
PAT (Page Attribute Table bit)

These 3 bits form an index (0-7) into the PAT register, which holds 8 memory type values. The OS can configure PAT to provide useful combinations:

PAT Index	Typical Configuration	Use Case
0 (PWT=0, PCD=0, PAT=0)	WB	Normal RAM
1 (PWT=1, PCD=0, PAT=0)	WC	Frame buffers
2 (PWT=0, PCD=1, PAT=0)	UC-	MMIO (fallback)
3 (PWT=1, PCD=1, PAT=0)	UC	MMIO (strict)

MTRR + PAT Interaction

When both are configured, the effective memory type follows a combination rule:

The most restrictive (least cacheable) type wins
Example: If MTRR says WB but PAT says UC, effective type is UC
This allows OS to override firmware without conflicts

mtrr_pat_configuration.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
/*
 * MTRR and PAT Configuration Examples
 * 
 * This code illustrates how memory types are configured at the
 * hardware level for MMIO correctness.
 */
 
#include <asm/msr.h>
#include <asm/mtrr.h>
 
/* MSR addresses for MTRRs */
#define MSR_MTRRdefType         0x2FF
#define MSR_MTRRfix64K_00000    0x250
#define MSR_MTRRphysBase0       0x200
#define MSR_MTRRphysMask0       0x201
 
/* Memory type encodings */
#define MTRR_TYPE_UC    0  /* Uncacheable */
#define MTRR_TYPE_WC    1  /* Write Combining */
#define MTRR_TYPE_WT    4  /* Write Through */
#define MTRR_TYPE_WP    5  /* Write Protect */
#define MTRR_TYPE_WB    6  /* Write Back */
 
/*
 * Read current MTRR configuration
 */
void dump_mtrr_config(void)
{
    uint64_t def_type;
    int i;
    
    /* Read default type */
    rdmsrl(MSR_MTRRdefType, def_type);
    pr_info("MTRR default type: %lld, enabled: %s, fixed enabled: %s
",
            def_type & 0xFF,
            (def_type & (1 << 11)) ? "yes" : "no",
            (def_type & (1 << 10)) ? "yes" : "no");
    
    /* Read variable MTRRs (typically 8-20 pairs) */
    for (i = 0; i < 8; i++) {
        uint64_t base, mask;
        rdmsrl(MSR_MTRRphysBase0 + i*2, base);
        rdmsrl(MSR_MTRRphysMask0 + i*2, mask);
        
        if (mask & (1 << 11)) {  /* Valid bit */
            uint64_t start = base & ~0xFFF;
            uint64_t size = ~(mask & ~0xFFF) + 1;
            uint8_t type = base & 0xFF;
            
            pr_info("MTRR %d: %016llx-%016llx type=%d (%s)
",
                    i, start, start + size - 1, type,
                    type == 0 ? "UC" : type == 1 ? "WC" : 
                    type == 6 ? "WB" : "other");
        }
    }
}
 
/*
 * Configure PAT register for useful memory types
 * 
 * Linux default PAT configuration:
 * PAT0 = WB (normal RAM)
 * PAT1 = WC (frame buffers)  
 * PAT2 = UC- (MMIO)
 * PAT3 = UC (strict MMIO)
 * PAT4 = WB (duplicate)
 * PAT5 = WP (not commonly used)
 * PAT6 = UC- (duplicate)
 * PAT7 = UC (duplicate)
 */
#define MSR_IA32_CR_PAT  0x277
 
void setup_pat(void)
{
    uint64_t pat = 0;
    
    /* Construct PAT value */
    pat |= ((uint64_t)MTRR_TYPE_WB << 0);   /* PAT0 = WB */
    pat |= ((uint64_t)MTRR_TYPE_WC << 8);   /* PAT1 = WC */
    pat |= ((uint64_t)MTRR_TYPE_UC << 16);  /* PAT2 = UC- (using UC) */
    pat |= ((uint64_t)MTRR_TYPE_UC << 24);  /* PAT3 = UC */
    pat |= ((uint64_t)MTRR_TYPE_WB << 32);  /* PAT4 = WB */
    pat |= ((uint64_t)MTRR_TYPE_WP << 40);  /* PAT5 = WP */
    pat |= ((uint64_t)MTRR_TYPE_UC << 48);  /* PAT6 = UC- (using UC) */
    pat |= ((uint64_t)MTRR_TYPE_UC << 56);  /* PAT7 = UC */
    
    wrmsrl(MSR_IA32_CR_PAT, pat);
    pr_info("PAT configured: 0x%016llx
", pat);
}
 
/*
 * In ioremap context, the kernel sets page table bits to select
 * the appropriate PAT entry:
 *
 * ioremap()    -> UC (PAT3: PWT=1, PCD=1, PAT=0)
 * ioremap_wc() -> WC (PAT1: PWT=1, PCD=0, PAT=0)
 */

Virtualization Hardware for I/O

Modern I/O hardware includes extensive support for virtualization, enabling efficient I/O handling in virtual machine environments.

I/O Virtualization Challenges

Without hardware support, VM I/O requires trap-and-emulate:

Guest OS executes I/O instruction (IN/OUT) or MMIO access
Hardware traps to hypervisor (VM exit)
Hypervisor emulates the I/O operation
Hypervisor returns to guest (VM entry)

This trap overhead devastates I/O performance—each operation adds thousands of cycles.

Hardware Solutions

I/O Virtualization Hardware Features

•IOMMU / VT-d / AMD-Vi: As discussed, provides address translation for device DMA. VMs have isolated DMA address spaces.
•Interrupt Remapping: Hardware redirects device interrupts to correct VM without hypervisor involvement.
•SR-IOV (Single Root I/O Virtualization): Device presents multiple 'virtual functions' (VFs), each assignable to a VM with hardware isolation.
•VT-c (Virtualization Technology for Connectivity): Intel's network-specific features: VMDq for queue-per-VM, VMDQ-RSS for load distribution.
•Posted Interrupts: Interrupts delivered directly to guest without VM exit, if guest is running.
•APIC Virtualization (APICv): Hardware-accelerated virtual APIC, reducing interrupt handling exits.

SR-IOV in Detail

SR-IOV allows a single physical device to appear as multiple virtual devices:

Physical Function (PF): Full-featured device, managed by hypervisor
Virtual Functions (VFs): Lightweight instances, assignable to VMs

Each VF has:

Own PCI configuration space
Own BARs (MMIO regions)
Own interrupts (MSI-X vectors)
Hardware isolation from other VFs

A VM with an assigned VF can perform MMIO and DMA directly to the hardware without hypervisor involvement—achieving near-native performance.

MMIO in VM Context

With EPT/NPT (Extended/Nested Page Tables), guest MMIO can be hardware-accelerated:

Guest page tables map guest-virtual MMIO address
EPT/NPT translates guest-physical to host-physical
If EPT entry permits, MMIO access proceeds without exit
Device MMIO is directly accessed by guest (for VF devices)

For emulated devices, the hypervisor sets EPT entries to trap on MMIO access, enabling efficient emulation only when needed.

Cloud Computing Impact

Summary: Hardware Support for I/O

This page has provided a comprehensive exploration of the hardware mechanisms that enable efficient I/O addressing. Let's consolidate the key concepts:

Key Takeaways

•Chipset Evolution: From North/South Bridge to integrated System Agent + PCH. Memory controllers now on-die for lower latency.
•Address Decode Logic: System Agent registers (TOLM, TOUUD, PCIEXBAR) define how addresses route to DRAM vs MMIO.
•PCIe Transaction Layer: MMIO becomes Memory Read/Write TLPs routed through PCIe hierarchy. Posted writes enable low-latency MMIO.
•IOMMU (VT-d/AMD-Vi): Provides device virtual addressing, enabling DMA isolation, VM passthrough, and security.
•Memory Type Configuration: MTRRs (physical) and PAT (page tables) configure caching behavior. MMIO requires UC or WC, never WB.
•Virtualization Support: SR-IOV, interrupt remapping, and APICv enable efficient VM I/O with near-native performance.

Module Completion

Understand how device drivers interact with hardware at the lowest level
Debug I/O performance issues by understanding transaction flows
Make informed architectural decisions about I/O design
Work effectively with virtualized I/O environments
Analyze and configure system address space layouts

Module Complete