Loading learning content...
Every memory access your program makes—every variable read, every function call, every stack operation—passes through a specialized piece of hardware before reaching physical memory. This hardware, the Memory Management Unit (MMU), performs the address translation we've been discussing, checks permissions, manages the TLB, and raises exceptions when something goes wrong. All of this happens transparently, billions of times per second, inside your CPU.
The MMU is the hardware embodiment of memory management policy. The operating system sets up the data structures (page tables), but the MMU enforces them in real time. Understanding the MMU—its capabilities, limitations, and interface with software—is essential for systems programming, OS development, and performance optimization.
By the end of this page, you will understand the MMU's role in the memory system, its internal components and architecture, how it performs translation and protection checking at hardware speed, TLB organization and management, the MMU-OS interface for page faults and TLB control, and how modern MMU features enable virtualization and security.
The Memory Management Unit is a hardware component responsible for handling all memory references made by the CPU. It sits logically between the CPU's execution units and the memory system, intercepting every memory access.
Formally:
The MMU is a hardware component that translates logical (virtual) addresses to physical addresses, enforces memory protection policies, and manages address translation caches (TLB) to maintain performance.
In modern processors, the MMU is integrated directly into the CPU die, typically as part of the core's load/store unit. Historically, it was sometimes a separate chip.
The MMU operates before the cache hierarchy. Modern CPUs use virtual addresses for L1 cache indexing (Virtually Indexed, Physically Tagged—VIPT) to allow translation and cache lookup to proceed in parallel. The MMU still provides the physical tag for comparison. This design is crucial for maintaining performance—the TLB lookup and L1 cache lookup happen simultaneously.
The MMU comprises several specialized hardware structures, each optimized for its specific function. Understanding these components reveals how the MMU achieves its remarkable performance.
| Component | Function | Typical Implementation | Performance Characteristics |
|---|---|---|---|
| TLB | Cache virtual-to-physical translations | Fully associative or set-associative SRAM | 1-cycle access, 64-1536 entries |
| Page Table Walker (PTW) | Walk page tables on TLB miss | State machine + memory access logic | ≈10-100 cycles per walk |
| Permission Checker | Validate access against PTE flags | Combinational logic | 0 extra cycles (parallel with TLB) |
| Address Space ID Register | Hold current process's ASID/PCID | Register + comparator | Enables TLB sharing across contexts |
| Control Registers | Configure MMU behavior (CR0, CR3, CR4 on x86) | Privileged registers | Set by OS kernel only |
| Page Walk Cache | Cache intermediate page table entries | Small associative cache | Reduces multi-level walk cost |
The Translation Lookaside Buffer (TLB):
The TLB is the heart of MMU performance. It's a specialized cache storing recent virtual-to-physical address mappings. Unlike data caches that store actual data, the TLB stores metadata—translation information.
TLB Organization:
Modern CPUs typically have multiple TLB levels and types:
| TLB Type | Entries | Associativity | Page Sizes | Access Time |
|---|---|---|---|---|
| L1 ITLB (Instructions) | 64-128 | 4-8 way | 4KB | 1 cycle |
| L1 DTLB (Data) | 64-128 | 4-8 way | 4KB | 1 cycle |
| L2 STLB (Shared/Unified) | 512-2048 | 4-16 way | 4KB + 2MB | 5-10 cycles |
| Huge Page TLB | 16-64 | Fully assoc. | 2MB, 1GB | 1-2 cycles |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123
/* * TLB Entry Structure and Lookup * * This shows the logical structure of TLB entries and lookup. * Real hardware implements this in transistors, not software. */ #include <stdint.h>#include <stdbool.h> // TLB entry format (simplified; real entries have more fields)typedef struct { uint64_t vpn; // Virtual Page Number (the key) uint64_t pfn; // Physical Frame Number (the value) uint16_t asid; // Address Space ID (for multi-process TLB) // Permission and attribute bits unsigned present : 1; // Entry is valid unsigned writable : 1; // Page is writable unsigned user : 1; // Page accessible from user mode unsigned executable : 1;// Page can be executed unsigned global : 1; // Entry not flushed on ASID change unsigned dirty : 1; // Page has been written unsigned accessed : 1; // Page has been accessed} TLBEntry; #define TLB_SIZE 128#define TLB_WAYS 8#define TLB_SETS (TLB_SIZE / TLB_WAYS) // L1 Data TLB: 8-way set associative, 128 entries = 16 setsTLBEntry l1_dtlb[TLB_SETS][TLB_WAYS]; /* * TLB Lookup Process: * * 1. Extract set index from virtual address * - For 16 sets, use log2(16) = 4 bits of VPN * * 2. Compare VPN against all entries in the set (parallel!) * - Also compare ASID (or check global bit) * * 3. If match found (hit), return PFN and permissions * If no match (miss), invoke page table walker */ typedef struct { bool hit; uint64_t pfn; bool writable; bool executable; bool user;} TLBLookupResult; TLBLookupResult tlb_lookup(uint64_t vpn, uint16_t current_asid) { TLBLookupResult result = {false, 0, false, false, false}; // Calculate set index int set = vpn % TLB_SETS; // Search all ways in the set (this is parallel in hardware) for (int way = 0; way < TLB_WAYS; way++) { TLBEntry* entry = &l1_dtlb[set][way]; // Check if entry matches bool vpn_match = (entry->vpn == vpn); bool asid_ok = entry->global || (entry->asid == current_asid); if (entry->present && vpn_match && asid_ok) { // TLB Hit! result.hit = true; result.pfn = entry->pfn; result.writable = entry->writable; result.executable = entry->executable; result.user = entry->user; // Update accessed bit (may be done lazily in real hardware) entry->accessed = 1; return result; } } // TLB Miss - need page table walk return result;} /* * TLB Entry Insertion (after page table walk): * * When a TLB miss occurs and the page table walker finds the translation, * the MMU inserts the result into the TLB for future lookups. * * The replacement policy (typically pseudo-LRU) chooses which * way in the set to evict. */void tlb_insert(uint64_t vpn, uint64_t pfn, uint16_t asid, bool writable, bool user, bool executable, bool global) { int set = vpn % TLB_SETS; // Find a way to replace (use pseudo-LRU or similar) int victim_way = 0; // Simplified; real uses LRU tracking TLBEntry* entry = &l1_dtlb[set][victim_way]; entry->vpn = vpn; entry->pfn = pfn; entry->asid = asid; entry->present = 1; entry->writable = writable; entry->user = user; entry->executable = executable; entry->global = global; entry->dirty = 0; entry->accessed = 1;} /* * In real hardware: * - All 8 comparisons happen simultaneously in one clock cycle * - The comparators are CAM (Content-Addressable Memory) cells * - Power consumption is significant due to parallel comparison * - This is why TLB size is limited (more entries = more power) */TLB Reach = Number of TLB Entries × Page Size. This is the total amount of memory that can be translated without a TLB miss. For example, 1024 entries × 4 KB = 4 MB reach. If a program's working set exceeds TLB reach, it will suffer continuous TLB misses. Huge pages (2 MB, 1 GB) dramatically increase TLB reach—1024 entries × 2 MB = 2 GB reach!
When the TLB doesn't contain a translation (TLB miss), the Page Table Walker (PTW) hardware automatically reads the page table structures from memory to find the mapping. This process is called a page table walk.
Page Table Walk Steps (x86-64, 4-level):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168
/* * Page Table Walker (PTW) Logic * * This is implemented in hardware as a state machine. * The following represents its logical behavior. */ #include <stdint.h>#include <stdbool.h>#include <stdio.h> #define PML4_SHIFT 39#define PDPT_SHIFT 30#define PD_SHIFT 21#define PT_SHIFT 12#define INDEX_MASK 0x1FF // 9 bits #define PTE_PRESENT (1ULL << 0)#define PTE_WRITABLE (1ULL << 1)#define PTE_USER (1ULL << 2)#define PTE_PS (1ULL << 7) // Page Size (huge page)#define PTE_ADDR_MASK 0x000FFFFFFFFFF000ULL // Physical address bits typedef enum { WALK_SUCCESS, WALK_PAGE_FAULT, WALK_ACCESS_FAULT} WalkResult; typedef struct { WalkResult result; uint64_t physical_frame; uint64_t page_size; // 4KB, 2MB, or 1GB bool writable; bool user; bool executable;} PTWResult; // Simulate memory read (in hardware, this goes to cache/memory)uint64_t read_memory_64(uint64_t physical_addr) { // In real hardware, this is a memory bus transaction // The PTW has its own path to memory, bypassing TLB return 0; // Placeholder} PTWResult page_table_walk(uint64_t cr3, uint64_t virtual_addr, bool is_write, bool is_user, bool is_execute) { PTWResult result = {WALK_SUCCESS, 0, 4096, true, true, true}; // Extract indices for each level int pml4_idx = (virtual_addr >> PML4_SHIFT) & INDEX_MASK; int pdpt_idx = (virtual_addr >> PDPT_SHIFT) & INDEX_MASK; int pd_idx = (virtual_addr >> PD_SHIFT) & INDEX_MASK; int pt_idx = (virtual_addr >> PT_SHIFT) & INDEX_MASK; printf("Page Table Walk for VA 0x%016llx", (unsigned long long)virtual_addr); // Level 4: PML4 uint64_t pml4_base = cr3 & PTE_ADDR_MASK; uint64_t pml4_entry_addr = pml4_base + pml4_idx * 8; uint64_t pml4_entry = read_memory_64(pml4_entry_addr); printf(" PML4[%d] @ 0x%llx = 0x%llx", pml4_idx, (unsigned long long)pml4_entry_addr, (unsigned long long)pml4_entry); if (!(pml4_entry & PTE_PRESENT)) { result.result = WALK_PAGE_FAULT; return result; } // Check permissions at this level if (is_user && !(pml4_entry & PTE_USER)) { result.result = WALK_ACCESS_FAULT; return result; } // Level 3: PDPT uint64_t pdpt_base = pml4_entry & PTE_ADDR_MASK; uint64_t pdpt_entry_addr = pdpt_base + pdpt_idx * 8; uint64_t pdpt_entry = read_memory_64(pdpt_entry_addr); printf(" PDPT[%d] @ 0x%llx = 0x%llx", pdpt_idx, (unsigned long long)pdpt_entry_addr, (unsigned long long)pdpt_entry); if (!(pdpt_entry & PTE_PRESENT)) { result.result = WALK_PAGE_FAULT; return result; } // Check for 1GB huge page if (pdpt_entry & PTE_PS) { result.physical_frame = (pdpt_entry & 0x000FFFFFC0000000ULL) >> 30; result.page_size = 1ULL << 30; // 1 GB printf(" 1GB huge page! Frame = 0x%llx", (unsigned long long)result.physical_frame); return result; } // Level 2: PD uint64_t pd_base = pdpt_entry & PTE_ADDR_MASK; uint64_t pd_entry_addr = pd_base + pd_idx * 8; uint64_t pd_entry = read_memory_64(pd_entry_addr); printf(" PD[%d] @ 0x%llx = 0x%llx", pd_idx, (unsigned long long)pd_entry_addr, (unsigned long long)pd_entry); if (!(pd_entry & PTE_PRESENT)) { result.result = WALK_PAGE_FAULT; return result; } // Check for 2MB huge page if (pd_entry & PTE_PS) { result.physical_frame = (pd_entry & 0x000FFFFFFFE00000ULL) >> 21; result.page_size = 1ULL << 21; // 2 MB printf(" 2MB huge page! Frame = 0x%llx", (unsigned long long)result.physical_frame); return result; } // Level 1: PT uint64_t pt_base = pd_entry & PTE_ADDR_MASK; uint64_t pt_entry_addr = pt_base + pt_idx * 8; uint64_t pt_entry = read_memory_64(pt_entry_addr); printf(" PT[%d] @ 0x%llx = 0x%llx", pt_idx, (unsigned long long)pt_entry_addr, (unsigned long long)pt_entry); if (!(pt_entry & PTE_PRESENT)) { result.result = WALK_PAGE_FAULT; return result; } // 4KB page result.physical_frame = (pt_entry & PTE_ADDR_MASK) >> 12; result.page_size = 4096; result.writable = !!(pt_entry & PTE_WRITABLE); result.user = !!(pt_entry & PTE_USER); printf(" 4KB page. Frame = 0x%llx", (unsigned long long)result.physical_frame); return result;} /* * Hardware Optimization: Page Walk Cache (PWC) * * Modern CPUs cache intermediate page table entries. * If we recently translated a nearby address, chances are * the PML4/PDPT/PD entries are the same—only PT differs. * * Example: * VA 0x7FFF12340000 and VA 0x7FFF12341000 share * the same PML4, PDPT, PD entries—only PT differs. * * With PWC, the second walk only reads the PT level. * This reduces average walk cost significantly. */Each page table walk requires 4 memory accesses (for 4-level paging). Even if those accesses hit L1 cache (~4 cycles each), a walk costs 16+ cycles. If they go to main memory (~100ns each), a walk costs 400+ cycles. This is why TLB hit rate is critical—a 1% miss rate still means millions of expensive walks per second in a high-throughput workload.
The operating system controls the MMU through special hardware registers. These registers are privileged—only kernel-mode code can modify them. They configure fundamental aspects of MMU behavior and are essential for understanding how OS kernel code manages memory.
| Register | Purpose | Key Bits | Modified When |
|---|---|---|---|
| CR0 | System control modes | PG (paging enable), WP (write protect) | Boot time, rarely changed |
| CR2 | Page fault linear address | Faulting virtual address | Set by hardware on page fault |
| CR3 | Page table base + PCID | PML4 physical address, PCID | Every context switch |
| CR4 | Extended features | PAE, PSE, PCIDE, SMEP, SMAP | Boot time, feature enable |
| EFER (MSR) | Long mode control | LME (Long Mode Enable), NXE | Boot time for 64-bit mode |
CR3: The Page Table Base Register
CR3 is the most frequently modified MMU register. It holds the physical address of the top-level page table (PML4 in x86-64). Changing CR3 effectively switches the entire address space.
CR3 Contents:
On a context switch from Process A to Process B:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
/* * MMU Control Register Operations * * These operations are performed in kernel mode only. * Attempting them from user mode causes a protection fault. */ #include <stdint.h> /* * Read CR3 - get current page table base */static inline uint64_t read_cr3(void) { uint64_t val; __asm__ volatile("mov %%cr3, %0" : "=r"(val)); return val;} /* * Write CR3 - switch address space * * This is the heart of context switching memory. * After this instruction executes, all memory translations * are based on the new page table. * * WARNING: This instruction implicitly flushes TLB entries * (except global pages and PCID-tagged entries on modern CPUs) */static inline void write_cr3(uint64_t val) { __asm__ volatile("mov %0, %%cr3" : : "r"(val) : "memory");} /* * Read CR2 - get faulting address after page fault */static inline uint64_t read_cr2(void) { uint64_t val; __asm__ volatile("mov %%cr2, %0" : "=r"(val)); return val;} /* * Context switch between processes (simplified) */typedef struct { uint64_t cr3; // Page table base // ... other saved state (GPRs, etc.)} ProcessContext; void switch_address_space(ProcessContext* from, ProcessContext* to) { // Only switch CR3 if actually changing address spaces // (Multiple threads share address space, no CR3 switch needed) if (from->cr3 != to->cr3) { from->cr3 = read_cr3(); // Save old write_cr3(to->cr3); // Load new /* * At this point: * - TLB entries for old address space are invalid * (unless using PCID or they're global) * - All memory accesses use the new page table * - The next instruction fetch is translated through new tables! */ }} /* * Enable/Disable paging (only at boot time) */static inline uint64_t read_cr0(void) { uint64_t val; __asm__ volatile("mov %%cr0, %0" : "=r"(val)); return val;} static inline void write_cr0(uint64_t val) { __asm__ volatile("mov %0, %%cr0" : : "r"(val) : "memory");} #define CR0_PG (1UL << 31) // Paging enable bit#define CR0_WP (1UL << 16) // Write protect bit void enable_paging(void) { // Set up page tables in CR3 first! // Then enable paging uint64_t cr0 = read_cr0(); cr0 |= CR0_PG | CR0_WP; write_cr0(cr0); // Paging is now active! // All subsequent addresses are virtual and translated} /* * IMPORTANT: CR4 security features * * CR4.SMEP (Supervisor Mode Execution Prevention): * - If set, kernel mode cannot execute user-mode pages * - Defends against ret2user attacks * * CR4.SMAP (Supervisor Mode Access Prevention): * - If set, kernel mode cannot read/write user-mode pages * unless explicitly enabled (EFLAGS.AC = 1) * - Prevents accidental kernel access to user data * - Defends against many exploit primitives */Without PCID, writing CR3 flushes the entire TLB (a 'full shootdown'). With PCID, TLB entries are tagged with a 12-bit process ID. The MMU only uses entries matching the current PCID, so entries from other processes remain cached. On switch back, those entries are still valid! This can reduce context switch overhead by 40-50% in TLB-sensitive workloads.
The TLB caches translations, but this cache must be kept consistent with the actual page tables. When the OS modifies page tables—changing mappings, permissions, or removing pages—it must ensure the TLB doesn't contain stale entries. This is TLB management, one of the most performance-critical aspects of OS memory management.
TLB Invalidation Instructions:
The x86-64 architecture provides several ways to invalidate TLB entries:
| Instruction | Effect | Use Case |
|---|---|---|
MOV to CR3 | Flush all non-global entries | Context switch |
INVLPG addr | Flush single page entry | Single page change |
INVPCID | Flush by PCID, address, or both | Fine-grained control |
INVLPGA (AMD) | Invalidate by ASID and address | Guest VM management |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
/* * TLB Invalidation Operations * * Critical for maintaining TLB coherency with page tables. * Incorrect invalidation leads to using stale translations— * data corruption, security vulnerabilities, crashes. */ #include <stdint.h> /* * INVLPG - Invalidate single page * * This is the most common invalidation operation. * Used when modifying a single page table entry. */static inline void invlpg(void* addr) { __asm__ volatile("invlpg (%0)" : : "r"(addr) : "memory");} /* * Full TLB flush (via CR3 reload) * * Expensive but sometimes necessary. * On systems without PCID, this is what context switch does. */static inline void flush_tlb_all(void) { uint64_t cr3; __asm__ volatile("mov %%cr3, %0" : "=r"(cr3)); __asm__ volatile("mov %0, %%cr3" : : "r"(cr3) : "memory");} /* * Unmap a page: update PTE and invalidate TLB */void unmap_page(uint64_t* pte, void* virtual_addr) { // Step 1: Clear the page table entry *pte = 0; // Mark not present // Step 2: Memory barrier - ensure PTE write is visible __asm__ volatile("mfence" ::: "memory"); // Step 3: Invalidate TLB entry invlpg(virtual_addr); /* * Order matters! If we invalidated TLB before clearing PTE: * - Another CPU might cache the old entry between our ops * - We must ensure PTE is cleared before TLB invalidation */} /* * Change page permissions (e.g., make writable page read-only) */void make_page_readonly(uint64_t* pte, void* virtual_addr) { // Clear the writable bit *pte &= ~(1ULL << 1); // Clear R/W bit // Barrier __asm__ volatile("mfence" ::: "memory"); // Invalidate invlpg(virtual_addr);} /* * TLB Shootdown: Multi-processor TLB coherency * * Problem: When CPU 0 modifies a page table entry, CPU 1's TLB * might still have the old translation cached. * * Solution: TLB shootdown via Inter-Processor Interrupt (IPI) * * 1. CPU 0 modifies PTE * 2. CPU 0 invalidates its own TLB (INVLPG) * 3. CPU 0 sends IPI to all other CPUs running the affected process * 4. Other CPUs receive interrupt, execute INVLPG, acknowledge * 5. CPU 0 waits for acknowledgments before proceeding * * This is expensive! ~10,000 cycles for a full shootdown. */ typedef struct { void* address; // Virtual address to invalidate uint16_t asid; // Process/address space identifier volatile int ack_count; // How many CPUs have acknowledged int target_count; // How many CPUs need to acknowledge} TLBShootdownRequest; // IPI handler on remote CPUvoid tlb_shootdown_ipi_handler(TLBShootdownRequest* req) { // Check if this address space is active on this CPU // If so, invalidate invlpg(req->address); // Acknowledge __sync_fetch_and_add(&req->ack_count, 1);} /* * Performance optimization: Lazy TLB * * If a CPU is running a kernel thread (no user address space), * we can skip sending IPI for user-space TLB invalidations. * We mark the CPU as "lazy" and do the invalidation if/when * it switches back to user mode. */In heavily multi-threaded applications with frequent memory mapping changes, TLB shootdowns can become a major bottleneck. Each shootdown requires interrupting all CPUs, context-saving their state, invalidating entries, and acknowledging. Workloads like JVMs (with garbage collection), databases, and hypervisors can suffer significantly. This is a key motivation for persistent memory mappings and huge pages.
When address translation cannot proceed normally, the MMU raises an exception (also called a fault or trap). The OS kernel handles these exceptions to implement demand paging, copy-on-write, memory protection, and more. Understanding MMU exceptions is essential for kernel development.
| Exception | Cause | CR2 Contains | Typical OS Response |
|---|---|---|---|
| Page Fault (not present) | PTE.Present = 0 | Faulting address | Load page from disk, create mapping |
| Page Fault (write to RO) | Write to PTE.R/W = 0 | Faulting address | COW copy, or signal SIGSEGV |
| Page Fault (user to kernel) | User access to PTE.U/S = 0 | Faulting address | Signal SIGSEGV (security violation) |
| Page Fault (execute NX) | Execute on PTE.NX = 1 | Faulting address | Signal SIGSEGV (security violation) |
| General Protection Fault | Various invalid operations | Varies | Signal SIGSEGV or kernel panic |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
/* * Page Fault Handler (Simplified) * * This is one of the most critical OS kernel routines. * It runs in kernel mode, triggered by MMU exception. */ #include <stdint.h>#include <stdbool.h> // Error code bits pushed by hardware on page fault#define PF_PRESENT (1 << 0) // Fault caused by not-present page (0) or protection (1)#define PF_WRITE (1 << 1) // Fault caused by write (1) or read (0)#define PF_USER (1 << 2) // Fault occurred in user mode (1) or kernel (0)#define PF_RSVD (1 << 3) // Fault caused by reserved bit violation#define PF_INSTR (1 << 4) // Fault caused by instruction fetch typedef enum { FAULT_HANDLED, // Fault resolved, resume execution FAULT_SIGNAL_SEGV, // Send SIGSEGV to process FAULT_KERNEL_PANIC, // Unrecoverable kernel error} FaultResolution; FaultResolution handle_page_fault(uint64_t error_code, uint64_t fault_addr) { // Read the faulting address from CR2 fault_addr = read_cr2(); printf("Page fault: addr=0x%016llx, error=0x%llx", (unsigned long long)fault_addr, (unsigned long long)error_code); // Decode error code bool is_present = error_code & PF_PRESENT; bool is_write = error_code & PF_WRITE; bool is_user = error_code & PF_USER; bool is_reserved = error_code & PF_RSVD; bool is_instruction = error_code & PF_INSTR; // Reserved bit violation: always an error (corrupted page table) if (is_reserved) { return FAULT_KERNEL_PANIC; } // Find VMA (Virtual Memory Area) containing fault address // VMA describes valid regions of the address space VMA* vma = find_vma(current_process, fault_addr); if (vma == NULL) { // Address is not in any valid region printf(" No VMA for this address"); return is_user ? FAULT_SIGNAL_SEGV : FAULT_KERNEL_PANIC; } // Check if access type matches VMA permissions if (is_write && !(vma->flags & VM_WRITE)) { printf(" Write to read-only VMA"); return is_user ? FAULT_SIGNAL_SEGV : FAULT_KERNEL_PANIC; } if (is_instruction && !(vma->flags & VM_EXEC)) { printf(" Execute on non-executable VMA"); return is_user ? FAULT_SIGNAL_SEGV : FAULT_KERNEL_PANIC; } // Fault in valid VMA - check specific causes if (!is_present) { // Page not present: demand paging printf(" Page not present - loading..."); if (vma->type == VMA_ANONYMOUS) { // Anonymous memory: allocate zero page allocate_anonymous_page(fault_addr); } else if (vma->type == VMA_FILE_MAPPED) { // File-mapped: read from file load_page_from_file(vma->file, fault_addr); } else if (vma->type == VMA_SWAP) { // Swapped out: read from swap load_page_from_swap(fault_addr); } return FAULT_HANDLED; } if (is_present && is_write) { // Present but write fault: likely COW printf(" Write to present page - checking COW..."); if (is_cow_page(fault_addr)) { // Copy-on-Write: make private copy handle_cow(fault_addr); return FAULT_HANDLED; } } // Shouldn't reach here if VMA matches printf(" Unhandled case"); return is_user ? FAULT_SIGNAL_SEGV : FAULT_KERNEL_PANIC;} /* * The page fault handler is one of the most performance-sensitive * kernel routines. Optimizations include: * * - Fast-path for common cases (demand paging of anonymous memory) * - VMA lookup using red-black trees or radix trees for O(log n) * - Prefaulting: loading nearby pages when accessing one * - Avoiding unnecessary TLB flushes * - Lock-free paths where possible */Not all page faults indicate errors. Many are 'soft' faults—the page is valid but not loaded yet (demand paging) or needs copying (COW). Only 'hard' faults (accessing truly invalid memory) result in SIGSEGV. A healthy system has many soft page faults; watching page fault counters without understanding this leads to false alarms.
Modern MMUs include advanced features that go beyond basic address translation. These features enable virtualization, enhance security, and improve performance in ways that weren't possible with earlier MMU designs.
Extended Page Tables (EPT) for Virtualization:
In a virtualized system, the guest OS maintains its own page tables, mapping guest virtual to guest physical addresses. But the hypervisor has a second layer—Extended Page Tables—that map guest physical to host physical addresses.
Memory Access Path:
1. Guest Virtual Address (what guest process sees)
2. Guest Page Tables → Guest Physical Address
3. Extended Page Tables → Host Physical Address (actual RAM)
Without hardware support, every guest page table walk would require multiple VM exits (hypervisor calls), crippling performance. EPT performs both translations in hardware—the MMU walks both tables automatically.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
/* * Memory Protection Keys (MPK) - Intel PKU Feature * * MPK allows fast permission changes without modifying page tables * or flushing TLB. Perfect for sandboxing and managed runtimes. */ #include <stdint.h>#include <sys/mman.h> // Read PKRU (Protection Key Rights for User pages)static inline uint32_t read_pkru(void) { uint32_t eax, ecx = 0, edx; __asm__ volatile("rdpkru" : "=a"(eax), "=d"(edx) : "c"(ecx)); return eax;} // Write PKRUstatic inline void write_pkru(uint32_t val) { uint32_t eax = val, ecx = 0, edx = 0; __asm__ volatile("wrpkru" : : "a"(eax), "d"(edx), "c"(ecx));} /* * PKRU format: * - 32 bits total, 2 bits per protection key (16 keys total) * - Bit 2*i: Disable access for key i * - Bit 2*i+1: Disable write for key i * * Page table entries have a 4-bit protection key field. * On every memory access, MMU checks: * 1. Normal page-table permissions (R/W/X) * 2. PKRU permissions for the page's protection key * * Both must allow access, or fault occurs. */ #define PKEY_DISABLE_ACCESS 0x1#define PKEY_DISABLE_WRITE 0x2 // Allocate a protection keyint pkey_alloc(unsigned int flags, unsigned int access_rights) { // Syscall: allocate an unused protection key // Returns key number (0-15) or error return 0; // Simplified} // Associate a key with a memory rangevoid protect_region(void* addr, size_t len, int pkey) { // mprotect variant that sets protection key // pkey_mprotect(addr, len, PROT_READ | PROT_WRITE, pkey);} /* * Example: Sandboxing untrusted code * * 1. Allocate protection key for sensitive data * int key = pkey_alloc(0, PKEY_DISABLE_ACCESS); * * 2. Associate key with sensitive memory * protect_region(secret_buffer, 4096, key); * * 3. Sensitive memory is now inaccessible * * 4. When trusted code needs access: * uint32_t old_pkru = read_pkru(); * write_pkru(old_pkru & ~(0x3 << (2 * key))); // Enable key * // ... access memory ... * write_pkru(old_pkru); // Restore protection * * Benefit: wrpkru is ~20 cycles vs ~1000 cycles for mprotect() * No TLB flush, no syscall overhead */After Spectre/Meltdown (2018), MMU features gained new importance. KPTI (Kernel Page Table Isolation) maintains separate page tables for kernel and user mode to prevent speculative execution attacks. This roughly doubles TLB misses on syscalls. Features like PCID became essential to mitigate this performance impact. Modern MMU design is now inseparable from security considerations.
We've explored the MMU—the hardware foundation of memory management. This specialized processor component enables everything from simple address translation to advanced virtualization and security features.
What's Next:
We've examined the MMU and its role in address translation. The final topic in this module is base and limit registers—the simpler, historical predecessor to paging that's still conceptually important and used in some contexts today.
You now understand the MMU as the hardware that makes virtual memory, memory protection, and modern multiprogramming possible. This knowledge is essential for kernel development, performance tuning, security analysis, and understanding system behavior at a deep level. The MMU is where software policy meets hardware enforcement—the critical boundary in system design.