Loading content...
What if accessing a device register was exactly the same as reading or writing a variable in memory? What if there were no special instructions, no separate address space, no architectural distinction between 'memory' and 'hardware'?
This is the promise of Memory-Mapped I/O (MMIO)—a paradigm that maps device registers directly into the processor's memory address space, allowing software to interact with hardware using ordinary load and store instructions.
MMIO has become the dominant I/O access method in modern computing. Nearly every contemporary device—from GPU registers to NVMe controller queues, from network interface cards to USB controllers—uses memory-mapped registers for CPU communication. Understanding MMIO is essential for systems programming, driver development, and low-level debugging.
By the end of this page, you will understand how MMIO integrates devices into the memory map, the architectural advantages over port-mapped I/O, critical considerations around caching and memory ordering, how operating systems manage MMIO regions, and practical implementation patterns used in real device drivers.
Memory-Mapped I/O is an architectural technique where device registers are assigned addresses within the processor's physical memory address space. When the CPU issues a memory access to an MMIO address, the memory controller routes the request to the appropriate device instead of actual RAM.
Conceptual Model:
Imagine the physical address space as a city divided into districts. Some districts contain apartment buildings (RAM modules), while others contain specialized facilities (devices). The postal system (memory controller) delivers mail to the correct destination based solely on the address—it doesn't need to know whether the destination is an apartment or a factory.
Physical Address Space (Example x86-64 System)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
0x0000_0000_0000_0000 ─┬─ Low Memory (Legacy PC regions)
├─ 0x000A_0000 - VGA Frame Buffer (MMIO)
├─ 0x000C_0000 - Option ROMs (MMIO)
└─ 0x000F_0000 - BIOS Area
0x0000_0001_0000_0000 ─── First 4GB (mixed RAM/MMIO)
├─ Various RAM regions
├─ PCI MMIO windows
└─ Local APIC: 0xFEE0_0000 (MMIO)
0x0000_0001_0000_0000+ ─ Extended Memory (mostly RAM)
├─ Large BAR devices (GPU VRAM)
└─ PCIe MMCONFIG: ~256MB region
How MMIO Works at Hardware Level:
Address Decode: When the CPU issues a memory access, address decode logic in the memory controller examines the address.
Routing Decision: Based on configured address ranges, the request is routed either to:
Transaction Execution: The target device receives the read/write request, processes it, and returns data or acknowledgment.
Completion: The response travels back through the fabric to the requesting CPU core.
This routing is configured by firmware during boot (via ACPI and PCI enumeration) and maintained by the operating system thereafter.
MMIO mapping occurs at the physical address level. When an OS kernel accesses MMIO, it creates virtual memory mappings (page table entries) that translate kernel virtual addresses to the device's physical MMIO addresses. This mapping must have special attributes (non-cacheable, typically) as we'll discuss shortly.
The previous page covered port-mapped I/O (PMIO), which uses a separate address space accessed via IN/OUT instructions. MMIO takes a fundamentally different approach. Let's compare these paradigms systematically:
| Aspect | Memory-Mapped I/O (MMIO) | Port-Mapped I/O (PMIO) |
|---|---|---|
| Address Space | Shared with memory (huge: 2^48 on x86-64) | Separate I/O space (limited: 64K on x86) |
| Instructions | Normal MOV, load/store | Special IN/OUT instructions |
| Programming Languages | Any language with pointers | Requires inline assembly or intrinsics |
| Address Width | Full pointer width (32/64-bit) | Fixed 16-bit port addresses |
| Access Sizes | Arbitrary (1, 2, 4, 8 bytes, even larger) | 1, 2, or 4 bytes only |
| Caching | Must be explicitly disabled for device registers | Never cached (I/O space is uncacheable by design) |
| Virtual Memory | Full MMU support (mapping, protection) | Bypasses MMU (IOPL/TSS bitmap protection only) |
| Compiler Optimization | Can be problematic (must use volatile) | Compilers aware of I/O instructions |
| Architecture Support | Universal (all architectures) | x86-specific (and some legacy systems) |
| Performance | Can be faster with optimized paths | Often slower due to legacy bus protocols |
Despite the challenges, the industry has overwhelmingly standardized on MMIO. All modern CPUs (ARM, RISC-V, MIPS, PowerPC) use MMIO exclusively. x86 maintains port I/O for backward compatibility, but nearly all new device functionality uses MMIO. The PCIe specification strongly encourages MMIO over port I/O.
At first glance, MMIO access appears trivially simple: cast an address to a pointer and dereference it. However, this naive approach is fraught with subtle pitfalls that can cause device malfunction, data corruption, or unpredictable behavior.
Why Pointer Semantics are Dangerous for MMIO:
Compiler Optimization: Compilers assume memory accesses have no side effects. They may:
CPU Reordering: Modern CPUs don't execute instructions in program order. A store followed by a load might execute as load-then-store.
Cache Interference: If an MMIO region is accidentally cacheable, CPU caches will return stale data instead of reading the device.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
/* * MMIO Access Patterns: The Right Way * * This demonstrates proper MMIO access techniques that prevent * compiler and CPU reordering issues. */ #include <stdint.h> /* * WRONG: Naive pointer access * * The compiler can optimize this disastrously: * - It might combine the two writes into one * - It might reorder them * - It might cache the read result and never read the device again */void wrong_mmio_access(void) { uint32_t *device_reg = (uint32_t *)0xFEED0000; /* WRONG: not volatile */ *device_reg = 0x1; /* Command: start operation */ *device_reg = 0x2; /* Command: acknowledge - compiler might skip! */ while (*device_reg & 0x1) { /* Wait for busy flag - infinite loop! */ /* Compiler caches the read, never sees flag clear */ }} /* * CORRECT: Proper volatile access * * The 'volatile' keyword tells the compiler: * - Every access must actually touch memory * - Accesses cannot be reordered with respect to other volatile accesses * - Values cannot be cached in registers */ /* Type-safe MMIO access macros */#define mmio_read8(addr) (*(volatile uint8_t *)(addr))#define mmio_write8(addr, v) (*(volatile uint8_t *)(addr) = (v))#define mmio_read16(addr) (*(volatile uint16_t *)(addr))#define mmio_write16(addr, v) (*(volatile uint16_t *)(addr) = (v))#define mmio_read32(addr) (*(volatile uint32_t *)(addr))#define mmio_write32(addr, v) (*(volatile uint32_t *)(addr) = (v))#define mmio_read64(addr) (*(volatile uint64_t *)(addr))#define mmio_write64(addr, v) (*(volatile uint64_t *)(addr) = (v)) /* Better: Inline functions for type safety */static inline uint32_t mmio_read_32(void *addr) { return *(volatile uint32_t *)addr;} static inline void mmio_write_32(void *addr, uint32_t value) { *(volatile uint32_t *)addr = value;} /* * CORRECT: Using proper MMIO access */void correct_mmio_access(void *device_base) { mmio_write32(device_base, 0x1); /* Write #1 actually happens */ mmio_write32(device_base, 0x2); /* Write #2 actually happens */ while (mmio_read32(device_base) & 0x1) { /* Actually reads from device each iteration */ }} /* * MEMORY BARRIERS * * Volatile prevents COMPILER reordering, but CPUs can still reorder * at the hardware level. Memory barriers are required for strict ordering. */ /* Compiler-only barrier (prevents compiler reordering) */#define barrier() __asm__ __volatile__("" ::: "memory") /* Full memory barrier (prevents CPU reordering) */#define mb() __asm__ __volatile__("mfence" ::: "memory") /* Read barrier (loads before barrier complete before loads after) */#define rmb() __asm__ __volatile__("lfence" ::: "memory") /* Write barrier (stores before barrier complete before stores after) */#define wmb() __asm__ __volatile__("sfence" ::: "memory") /* * Example: Device initialization requiring strict ordering * * Some devices require commands to be observed in exact sequence. * Memory barriers ensure the device sees operations in order. */void init_device_with_ordering(void *base) { /* Step 1: Reset the device */ mmio_write32(base + 0x00, 0xDEAD); /* Reset command */ wmb(); /* Ensure reset is complete before configuration */ /* Step 2: Configure operating mode */ mmio_write32(base + 0x04, 0x0100); /* Mode register */ mmio_write32(base + 0x08, 0x0200); /* Settings register */ wmb(); /* Ensure configuration is complete */ /* Step 3: Enable the device */ mmio_write32(base + 0x00, 0x0001); /* Enable bit */ mb(); /* Full barrier before reading status */ /* Step 4: Wait for device ready */ uint32_t status; do { status = mmio_read32(base + 0x10); } while (!(status & 0x8000)); /* Ready bit */}The 'volatile' keyword prevents compiler reordering and caching, but it does NOT prevent CPU instruction reordering. For device programming where operation order matters (which is nearly always), you need both volatile access AND appropriate memory barriers. Modern kernel APIs (like Linux's readl/writel) encapsulate both concerns.
CPU caches are designed to accelerate memory access by keeping frequently used data close to the processor. This optimization is catastrophic for MMIO—if a device register is cached, the CPU returns stale data instead of querying the device.
The Caching Problem:
Consider a device status register at address 0xFEED0000:
The solution is to mark MMIO regions as uncacheable in the processor's page tables.
x86 Memory Type Attributes:
The x86 architecture defines several memory types, controlled via page table entries (PAT, PCD, PWT bits) and Memory Type Range Registers (MTRRs):
UC (Uncacheable): No caching whatsoever. Every access goes to the device. Required for most device registers.
WC (Write Combining): Writes can be combined and reordered, but not cached. Reads are uncacheable. Ideal for frame buffers and write-intensive devices.
WT (Write Through): Writes go to both cache and device. Reads may return cached data. Rarely used for MMIO.
WB (Write Back): Full caching—writes may complete only to cache. Dangerous for MMIO!
UC- (Uncached Minus): Like UC, but can be overridden by MTRRs. Used for compatibility.
| Memory Type | Caching Behavior | Ordering | Use Case |
|---|---|---|---|
| UC (Uncacheable) | No caching at all | Strongly ordered | Control registers, configuration space |
| WC (Write Combining) | Writes combined, reads uncached | Weakly ordered | Frame buffers, bulk write regions |
| WT (Write Through) | Read cached, write through | Strongly ordered | Rarely used for MMIO |
| WP (Write Protect) | Reads cached, writes not allowed | N/A | Read-only memory regions |
| WB (Write Back) | Full caching | Weakly ordered | NEVER use for device MMIO |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
/* * Mapping MMIO Regions in an OS Kernel * * This demonstrates how to create page table entries for device registers * with proper memory attributes. */ #include <stdint.h> /* Page Table Entry flags for x86-64 */#define PTE_PRESENT (1UL << 0) /* Page is present in memory */#define PTE_WRITABLE (1UL << 1) /* Page is writable */#define PTE_USER (1UL << 2) /* Page accessible from ring 3 */#define PTE_PWT (1UL << 3) /* Page Write-Through */#define PTE_PCD (1UL << 4) /* Page Cache Disable */#define PTE_ACCESSED (1UL << 5) /* Page has been accessed */#define PTE_DIRTY (1UL << 6) /* Page has been written */#define PTE_PAT (1UL << 7) /* Page Attribute Table index bit 0 */#define PTE_GLOBAL (1UL << 8) /* Global page (don't flush on CR3 change) */#define PTE_NX (1UL << 63) /* No Execute (if NX enabled) */ /* * Page Attribute Table (PAT) indices for memory types * The PAT allows fine-grained control over memory type per page. * * Default PAT configuration: * Entry 0: WB (Write Back) * Entry 1: WT (Write Through) * Entry 2: UC- (Uncached Minus) * Entry 3: UC (Uncacheable) * Entry 4: WB (Write Back) * Entry 5: WT (Write Through) * Entry 6: UC- (Uncached Minus) * Entry 7: UC (Uncacheable) */ /* Construct PTE flags for UC (Uncacheable) memory type */#define PTE_UC (PTE_PWT | PTE_PCD) /* PAT index 3 = UC */ /* Construct PTE flags for WC (Write Combining) - requires PAT programming *//* Assuming PAT entry 5 is programmed as WC */#define PTE_WC (PTE_PAT | PTE_PWT) /* PAT index 5 = WC (with custom PAT) */ /* * Map a physical MMIO region to kernel virtual address space * * In a real OS, this would: * 1. Allocate virtual address range * 2. Create page table entries * 3. Set proper memory type attributes * 4. Flush TLB for affected pages */void *mmio_map(uint64_t phys_addr, size_t size, int memory_type) { /* Align to page boundary */ uint64_t page_offset = phys_addr & 0xFFF; uint64_t aligned_phys = phys_addr & ~0xFFFUL; size_t aligned_size = (size + page_offset + 0xFFF) & ~0xFFFUL; /* Allocate virtual address range (simplified) */ void *virt_addr = allocate_kernel_virtual_pages(aligned_size / 4096); if (!virt_addr) return NULL; /* Determine PTE flags based on requested memory type */ uint64_t type_flags; switch (memory_type) { case MEMORY_TYPE_UC: type_flags = PTE_UC; break; case MEMORY_TYPE_WC: type_flags = PTE_WC; break; default: type_flags = PTE_UC; /* Default to uncacheable for safety */ } /* Create page table mappings */ for (size_t i = 0; i < aligned_size; i += 4096) { uint64_t pte = (aligned_phys + i) | PTE_PRESENT | PTE_WRITABLE | type_flags | PTE_NX; /* Device registers should never be executable */ set_page_table_entry((uint64_t)virt_addr + i, pte); } /* Flush TLB for the mapped range */ for (size_t i = 0; i < aligned_size; i += 4096) { __asm__ volatile("invlpg (%0)" : : "r"((uint64_t)virt_addr + i) : "memory"); } /* Return pointer adjusted for original offset */ return (void *)((uint64_t)virt_addr + page_offset);} /* * Unmap an MMIO region */void mmio_unmap(void *virt_addr, size_t size) { uint64_t aligned_addr = (uint64_t)virt_addr & ~0xFFFUL; size_t aligned_size = (size + ((uint64_t)virt_addr & 0xFFF) + 0xFFF) & ~0xFFFUL; /* Clear page table entries */ for (size_t i = 0; i < aligned_size; i += 4096) { set_page_table_entry(aligned_addr + i, 0); } /* Flush TLB */ for (size_t i = 0; i < aligned_size; i += 4096) { __asm__ volatile("invlpg (%0)" : : "r"(aligned_addr + i) : "memory"); } /* Free virtual address range */ free_kernel_virtual_pages((void *)aligned_addr, aligned_size / 4096);}Linux provides several ioremap variants for different use cases: ioremap() creates UC mappings (default for device registers), ioremap_wc() creates Write-Combining mappings (for frame buffers), ioremap_cache() creates cacheable mappings (for device memory that supports coherent caching), and ioremap_uc() explicitly creates Uncacheable mappings. Always choose the appropriate variant for your device's requirements.
Modern devices request MMIO address ranges through PCI/PCIe Base Address Registers (BARs). The system firmware and operating system negotiate where each device's registers will appear in the physical address space.
PCI BAR Mechanism:
Each PCI function can have up to 6 BARs (BAR0-BAR5). A BAR can be:
Memory BARs have specific formats that indicate their requirements:
Memory BAR Format:
Bit 0 = 0 (Memory Space indicator)
Bits 2:1 = Type: 00=32-bit, 10=64-bit
Bit 3 = Prefetchable flag
Bits 31:4 = Base Address (4-byte aligned minimum)
For 64-bit BARs, the next BAR (BAR[n+1]) contains the upper 32 bits.
Size Discovery Algorithm:
To determine the size of an MMIO region:
The lowest set bit in the mask indicates the size. For example, reading back 0xFFFFC000 means the bottom 14 bits are hardwired to 0, indicating a 16KB region (2^14).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
/* * PCI BAR Size Detection and MMIO Mapping * * Demonstrates how the OS discovers device MMIO requirements * from PCI configuration space BARs. */ #include <stdint.h>#include <stdbool.h> /* PCI Configuration space reading (from previous page) */extern uint32_t pci_config_read(uint8_t bus, uint8_t dev, uint8_t func, uint8_t offset);extern void pci_config_write(uint8_t bus, uint8_t dev, uint8_t func, uint8_t offset, uint32_t val); #define PCI_BAR0 0x10 /* BAR type detection */#define BAR_TYPE_IO 1#define BAR_TYPE_MEM32 2#define BAR_TYPE_MEM64 3 struct pci_bar_info { uint64_t base_address; /* Physical base address */ uint64_t size; /* Size in bytes */ int type; /* BAR_TYPE_IO, BAR_TYPE_MEM32, or BAR_TYPE_MEM64 */ bool prefetchable; /* Can the region be prefetched? */}; /* * Detect BAR type, size, and base address * * Returns: Number of BARs consumed (1 for 32-bit, 2 for 64-bit) */int detect_pci_bar(uint8_t bus, uint8_t dev, uint8_t func, int bar_index, struct pci_bar_info *info) { uint8_t bar_offset = PCI_BAR0 + (bar_index * 4); /* Read original BAR value */ uint32_t original = pci_config_read(bus, dev, func, bar_offset); if (original == 0) { /* BAR not implemented */ info->type = 0; return 1; } /* Check if this is I/O or Memory BAR */ if (original & 0x1) { /* I/O BAR */ info->type = BAR_TYPE_IO; info->prefetchable = false; /* Write all 1s to detect size */ pci_config_write(bus, dev, func, bar_offset, 0xFFFFFFFF); uint32_t size_mask = pci_config_read(bus, dev, func, bar_offset); pci_config_write(bus, dev, func, bar_offset, original); /* Restore */ /* Calculate size from mask */ size_mask |= 0x3; /* Ignore type bits */ info->size = (~size_mask) + 1; /* Size is power of 2 */ info->base_address = original & ~0x3UL; return 1; } else { /* Memory BAR */ int mem_type = (original >> 1) & 0x3; info->prefetchable = (original >> 3) & 0x1; if (mem_type == 0x0) { /* 32-bit Memory BAR */ info->type = BAR_TYPE_MEM32; /* Write all 1s to detect size */ pci_config_write(bus, dev, func, bar_offset, 0xFFFFFFFF); uint32_t size_mask = pci_config_read(bus, dev, func, bar_offset); pci_config_write(bus, dev, func, bar_offset, original); size_mask |= 0xF; /* Ignore type/prefetch bits */ info->size = (~size_mask) + 1; info->base_address = original & ~0xFUL; return 1; } else if (mem_type == 0x2) { /* 64-bit Memory BAR */ info->type = BAR_TYPE_MEM64; /* Read upper 32 bits from next BAR */ uint32_t original_hi = pci_config_read(bus, dev, func, bar_offset + 4); /* Write all 1s to both BARs */ pci_config_write(bus, dev, func, bar_offset, 0xFFFFFFFF); pci_config_write(bus, dev, func, bar_offset + 4, 0xFFFFFFFF); uint32_t size_mask_lo = pci_config_read(bus, dev, func, bar_offset); uint32_t size_mask_hi = pci_config_read(bus, dev, func, bar_offset + 4); /* Restore original values */ pci_config_write(bus, dev, func, bar_offset, original); pci_config_write(bus, dev, func, bar_offset + 4, original_hi); /* Combine into 64-bit mask */ uint64_t size_mask = ((uint64_t)size_mask_hi << 32) | size_mask_lo; size_mask |= 0xF; /* Ignore type bits in low BAR */ info->size = (~size_mask) + 1; info->base_address = ((uint64_t)(original_hi) << 32) | (original & ~0xFUL); return 2; /* Consumed 2 BARs */ } } return 1;} /* * Example: Initialize a hypothetical network card */void init_network_card(uint8_t bus, uint8_t dev, uint8_t func) { struct pci_bar_info bars[6]; int bar = 0; /* Enumerate all BARs */ while (bar < 6) { bar += detect_pci_bar(bus, dev, func, bar, &bars[bar]); } /* Typically BAR0 contains the main register space */ if (bars[0].type == BAR_TYPE_MEM64 || bars[0].type == BAR_TYPE_MEM32) { /* Map the device registers into kernel virtual memory */ void *regs = mmio_map(bars[0].base_address, bars[0].size, MEMORY_TYPE_UC); if (regs) { /* Now we can access device registers */ uint32_t device_id = mmio_read32(regs + 0x00); /* Example: ID register */ uint32_t status = mmio_read32(regs + 0x04); /* Example: Status register */ /* Configure the device... */ mmio_write32(regs + 0x10, 0x00000001); /* Example: Enable bit */ } } /* BAR2 might be the frame buffer or DMA buffer (prefetchable) */ if (bars[2].type != 0 && bars[2].prefetchable) { /* Use WC mapping for better write performance */ void *buffer = mmio_map(bars[2].base_address, bars[2].size, MEMORY_TYPE_WC); /* This region can handle bulk writes efficiently */ }}The prefetchable bit indicates whether the region has memory-like semantics (reads have no side effects, writes can be combined). GPU frame buffers are typically prefetchable. Control registers are NEVER prefetchable because reading them may clear status flags or trigger actions. Using WC (Write-Combining) on prefetchable regions can dramatically improve write throughput.
Modern devices have evolved sophisticated MMIO usage patterns that go far beyond simple register access. Understanding these patterns is essential for driver development and system debugging.
NVMe: Command Queue Architecture
NVMe (Non-Volatile Memory Express) SSDs exemplify advanced MMIO usage. Instead of transferring data through registers, NVMe uses memory-mapped command and completion queues:
GPU MMIO Regions:
Modern GPUs expose multiple MMIO regions with distinct purposes:
GPU drivers carefully choose memory types for each region to optimize performance while maintaining correctness.
| Device Type | Typical MMIO Size | Memory Type | Access Pattern |
|---|---|---|---|
| NVMe Controller | 16KB - 64KB | UC (registers) | Queue doorbells + config registers |
| Network Card (10GbE+) | 128KB - 1MB | UC (registers), WC (descriptors) | Descriptor rings, MSI-X tables |
| GPU Registers | 16MB - 256MB | UC (strictly ordered) | Command submission, state programming |
| GPU VRAM | 256MB - 32GB | WC (writes), Cache (coherent) | Texture uploads, render targets |
| USB xHCI | 64KB | UC | Transfer rings, event rings |
| PCIe MMCONFIG | 256MB | UC | Extended configuration space |
| Local APIC | 4KB | UC (page-aligned) | Interrupt controller programming |
Modern GPUs can expose BARs of 16GB or more for direct CPU access to VRAM. This exceeds the entire 32-bit address space! Such devices require 64-bit BARs and can only be fully utilized on 64-bit operating systems with sufficient virtual address space. The 'Resizable BAR' (ReBAR) feature allows runtime negotiation of BAR sizes for optimal CPU-GPU data transfer.
MMIO brings device access into the purview of the Memory Management Unit (MMU), enabling powerful security controls—but also introducing unique attack surfaces.
Protection via Page Tables:
Unlike port I/O (protected by IOPL and TSS bitmaps), MMIO is protected by standard page table permissions:
IOMMU Protection:
The I/O Memory Management Unit (IOMMU—Intel VT-d, AMD-Vi, ARM SMMU) provides the same address translation and protection for device DMA that the CPU MMU provides for processor accesses:
Without IOMMU protection, a compromised or malicious device has unrestricted access to all physical memory—a catastrophic security hole.
Many desktop systems ship with IOMMU disabled by default for performance and compatibility. This means a malicious Thunderbolt/USB4 device, PCIe card, or even some network attacks against vulnerable NICs can read your memory contents, capture encryption keys, or install rootkits. Enable VT-d/AMD-Vi in BIOS if security is a concern.
Memory-Mapped I/O is the dominant paradigm for modern device communication, unifying hardware access with memory semantics while introducing critical considerations around caching, ordering, and security.
What's Next:
With our understanding of how MMIO and port I/O work, the next page examines Programmed I/O (PIO)—the simplest (but often least efficient) technique for transferring data between CPU and devices. We'll see how the CPU manually moves every byte, when this approach is appropriate, and its fundamental performance limitations that drive the need for more sophisticated techniques.
You now understand Memory-Mapped I/O comprehensively—from basic concepts through volatile access patterns, caching implications, PCI BAR mechanics, and security considerations. This knowledge is fundamental for device driver development, kernel engineering, and understanding modern computer architecture.