Operating SystemsLogical vs Physical Addresses

Understanding Logical vs Physical Addresses

LevelIntermediate

Duration75 mins

TopicLogical vs Physical Addresses

4 / 5

Memory Management Unit (MMU)

The Hardware That Makes Virtual Memory Possible

Every memory access your program makes—every variable read, every function call, every stack operation—passes through a specialized piece of hardware before reaching physical memory. This hardware, the Memory Management Unit (MMU), performs the address translation we've been discussing, checks permissions, manages the TLB, and raises exceptions when something goes wrong. All of this happens transparently, billions of times per second, inside your CPU.

The MMU is the hardware embodiment of memory management policy. The operating system sets up the data structures (page tables), but the MMU enforces them in real time. Understanding the MMU—its capabilities, limitations, and interface with software—is essential for systems programming, OS development, and performance optimization.

What You Will Learn

By the end of this page, you will understand the MMU's role in the memory system, its internal components and architecture, how it performs translation and protection checking at hardware speed, TLB organization and management, the MMU-OS interface for page faults and TLB control, and how modern MMU features enable virtualization and security.

MMU Overview and Role

The Memory Management Unit is a hardware component responsible for handling all memory references made by the CPU. It sits logically between the CPU's execution units and the memory system, intercepting every memory access.

Formally:

The MMU is a hardware component that translates logical (virtual) addresses to physical addresses, enforces memory protection policies, and manages address translation caches (TLB) to maintain performance.

In modern processors, the MMU is integrated directly into the CPU die, typically as part of the core's load/store unit. Historically, it was sometimes a separate chip.

Converting Mermaid diagram...

Core MMU Responsibilities

•Address Translation: Converting every logical address from the CPU into a physical address for the memory system.
•Protection Enforcement: Checking access permissions (read/write/execute, user/supervisor) and raising faults on violations.
•TLB Management: Caching recent translations for performance; handling TLB misses by walking page tables.
•Exception Generation: Raising page faults (page not present) and protection faults (access violation) for OS handling.
•Cacheability Control: Determining whether memory accesses should be cached, write-through, or uncached based on page attributes.
•ASID/PCID Management: Supporting address-space identifiers to reduce TLB flush overhead on context switches.

Position in the Memory Hierarchy

The MMU operates before the cache hierarchy. Modern CPUs use virtual addresses for L1 cache indexing (Virtually Indexed, Physically Tagged—VIPT) to allow translation and cache lookup to proceed in parallel. The MMU still provides the physical tag for comparison. This design is crucial for maintaining performance—the TLB lookup and L1 cache lookup happen simultaneously.

MMU Internal Components

The MMU comprises several specialized hardware structures, each optimized for its specific function. Understanding these components reveals how the MMU achieves its remarkable performance.

MMU Component Architecture
Component	Function	Typical Implementation	Performance Characteristics
TLB	Cache virtual-to-physical translations	Fully associative or set-associative SRAM	1-cycle access, 64-1536 entries
Page Table Walker (PTW)	Walk page tables on TLB miss	State machine + memory access logic	≈10-100 cycles per walk
Permission Checker	Validate access against PTE flags	Combinational logic	0 extra cycles (parallel with TLB)
Address Space ID Register	Hold current process's ASID/PCID	Register + comparator	Enables TLB sharing across contexts
Control Registers	Configure MMU behavior (CR0, CR3, CR4 on x86)	Privileged registers	Set by OS kernel only
Page Walk Cache	Cache intermediate page table entries	Small associative cache	Reduces multi-level walk cost

The Translation Lookaside Buffer (TLB):

The TLB is the heart of MMU performance. It's a specialized cache storing recent virtual-to-physical address mappings. Unlike data caches that store actual data, the TLB stores metadata—translation information.

TLB Organization:

Modern CPUs typically have multiple TLB levels and types:

TLB Type	Entries	Associativity	Page Sizes	Access Time
L1 ITLB (Instructions)	64-128	4-8 way	4KB	1 cycle
L1 DTLB (Data)	64-128	4-8 way	4KB	1 cycle
L2 STLB (Shared/Unified)	512-2048	4-16 way	4KB + 2MB	5-10 cycles
Huge Page TLB	16-64	Fully assoc.	2MB, 1GB	1-2 cycles

tlb_structure.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
/*
 * TLB Entry Structure and Lookup
 * 
 * This shows the logical structure of TLB entries and lookup.
 * Real hardware implements this in transistors, not software.
 */
 
#include <stdint.h>
#include <stdbool.h>
 
// TLB entry format (simplified; real entries have more fields)
typedef struct {
    uint64_t vpn;           // Virtual Page Number (the key)
    uint64_t pfn;           // Physical Frame Number (the value)
    uint16_t asid;          // Address Space ID (for multi-process TLB)
    
    // Permission and attribute bits
    unsigned present : 1;   // Entry is valid
    unsigned writable : 1;  // Page is writable
    unsigned user : 1;      // Page accessible from user mode
    unsigned executable : 1;// Page can be executed
    unsigned global : 1;    // Entry not flushed on ASID change
    unsigned dirty : 1;     // Page has been written
    unsigned accessed : 1;  // Page has been accessed
} TLBEntry;
 
#define TLB_SIZE 128
#define TLB_WAYS 8
#define TLB_SETS (TLB_SIZE / TLB_WAYS)
 
// L1 Data TLB: 8-way set associative, 128 entries = 16 sets
TLBEntry l1_dtlb[TLB_SETS][TLB_WAYS];
 
/*
 * TLB Lookup Process:
 * 
 * 1. Extract set index from virtual address
 *    - For 16 sets, use log2(16) = 4 bits of VPN
 * 
 * 2. Compare VPN against all entries in the set (parallel!)
 *    - Also compare ASID (or check global bit)
 * 
 * 3. If match found (hit), return PFN and permissions
 *    If no match (miss), invoke page table walker
 */
 
typedef struct {
    bool hit;
    uint64_t pfn;
    bool writable;
    bool executable;
    bool user;
} TLBLookupResult;
 
TLBLookupResult tlb_lookup(uint64_t vpn, uint16_t current_asid) {
    TLBLookupResult result = {false, 0, false, false, false};
    
    // Calculate set index
    int set = vpn % TLB_SETS;
    
    // Search all ways in the set (this is parallel in hardware)
    for (int way = 0; way < TLB_WAYS; way++) {
        TLBEntry* entry = &l1_dtlb[set][way];
        
        // Check if entry matches
        bool vpn_match = (entry->vpn == vpn);
        bool asid_ok = entry->global || (entry->asid == current_asid);
        
        if (entry->present && vpn_match && asid_ok) {
            // TLB Hit!
            result.hit = true;
            result.pfn = entry->pfn;
            result.writable = entry->writable;
            result.executable = entry->executable;
            result.user = entry->user;
            
            // Update accessed bit (may be done lazily in real hardware)
            entry->accessed = 1;
            
            return result;
        }
    }
    
    // TLB Miss - need page table walk
    return result;
}
 
/*
 * TLB Entry Insertion (after page table walk):
 * 
 * When a TLB miss occurs and the page table walker finds the translation,
 * the MMU inserts the result into the TLB for future lookups.
 * 
 * The replacement policy (typically pseudo-LRU) chooses which
 * way in the set to evict.
 */
void tlb_insert(uint64_t vpn, uint64_t pfn, uint16_t asid, 
                bool writable, bool user, bool executable, bool global) {
    int set = vpn % TLB_SETS;
    
    // Find a way to replace (use pseudo-LRU or similar)
    int victim_way = 0;  // Simplified; real uses LRU tracking
    
    TLBEntry* entry = &l1_dtlb[set][victim_way];
    entry->vpn = vpn;
    entry->pfn = pfn;
    entry->asid = asid;
    entry->present = 1;
    entry->writable = writable;
    entry->user = user;
    entry->executable = executable;
    entry->global = global;
    entry->dirty = 0;
    entry->accessed = 1;
}
 
/*
 * In real hardware:
 * - All 8 comparisons happen simultaneously in one clock cycle
 * - The comparators are CAM (Content-Addressable Memory) cells
 * - Power consumption is significant due to parallel comparison
 * - This is why TLB size is limited (more entries = more power)
 */

TLB Reach: The Key Metric

TLB Reach = Number of TLB Entries × Page Size. This is the total amount of memory that can be translated without a TLB miss. For example, 1024 entries × 4 KB = 4 MB reach. If a program's working set exceeds TLB reach, it will suffer continuous TLB misses. Huge pages (2 MB, 1 GB) dramatically increase TLB reach—1024 entries × 2 MB = 2 GB reach!

The Page Table Walker

When the TLB doesn't contain a translation (TLB miss), the Page Table Walker (PTW) hardware automatically reads the page table structures from memory to find the mapping. This process is called a page table walk.

Page Table Walk Steps (x86-64, 4-level):

Read CR3: Get physical address of PML4 table
Read PML4 entry: Access PML4[VPN[47:39]]
- If not present → Page Fault
- Otherwise, extract PDPT physical address
Read PDPT entry: Access PDPT[VPN[38:30]]
- If not present → Page Fault
- If PS bit set (1GB huge page) → Done
- Otherwise, extract PD physical address
Read PD entry: Access PD[VPN[29:21]]
- If not present → Page Fault
- If PS bit set (2MB huge page) → Done
- Otherwise, extract PT physical address
Read PT entry: Access PT[VPN[20:12]]
- If not present → Page Fault
- Otherwise, extract frame number → Done
Insert translation into TLB for future accesses

Converting Mermaid diagram...

page_table_walker.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
/*
 * Page Table Walker (PTW) Logic
 * 
 * This is implemented in hardware as a state machine.
 * The following represents its logical behavior.
 */
 
#include <stdint.h>
#include <stdbool.h>
#include <stdio.h>
 
#define PML4_SHIFT  39
#define PDPT_SHIFT  30
#define PD_SHIFT    21
#define PT_SHIFT    12
#define INDEX_MASK  0x1FF  // 9 bits
 
#define PTE_PRESENT     (1ULL << 0)
#define PTE_WRITABLE    (1ULL << 1)
#define PTE_USER        (1ULL << 2)
#define PTE_PS          (1ULL << 7)   // Page Size (huge page)
#define PTE_ADDR_MASK   0x000FFFFFFFFFF000ULL  // Physical address bits
 
typedef enum {
    WALK_SUCCESS,
    WALK_PAGE_FAULT,
    WALK_ACCESS_FAULT
} WalkResult;
 
typedef struct {
    WalkResult result;
    uint64_t physical_frame;
    uint64_t page_size;      // 4KB, 2MB, or 1GB
    bool writable;
    bool user;
    bool executable;
} PTWResult;
 
// Simulate memory read (in hardware, this goes to cache/memory)
uint64_t read_memory_64(uint64_t physical_addr) {
    // In real hardware, this is a memory bus transaction
    // The PTW has its own path to memory, bypassing TLB
    return 0;  // Placeholder
}
 
PTWResult page_table_walk(uint64_t cr3, uint64_t virtual_addr, 
                          bool is_write, bool is_user, bool is_execute) {
    PTWResult result = {WALK_SUCCESS, 0, 4096, true, true, true};
    
    // Extract indices for each level
    int pml4_idx = (virtual_addr >> PML4_SHIFT) & INDEX_MASK;
    int pdpt_idx = (virtual_addr >> PDPT_SHIFT) & INDEX_MASK;
    int pd_idx   = (virtual_addr >> PD_SHIFT)   & INDEX_MASK;
    int pt_idx   = (virtual_addr >> PT_SHIFT)   & INDEX_MASK;
    
    printf("Page Table Walk for VA 0x%016llx
", 
           (unsigned long long)virtual_addr);
    
    // Level 4: PML4
    uint64_t pml4_base = cr3 & PTE_ADDR_MASK;
    uint64_t pml4_entry_addr = pml4_base + pml4_idx * 8;
    uint64_t pml4_entry = read_memory_64(pml4_entry_addr);
    printf("  PML4[%d] @ 0x%llx = 0x%llx
", 
           pml4_idx, (unsigned long long)pml4_entry_addr,
           (unsigned long long)pml4_entry);
    
    if (!(pml4_entry & PTE_PRESENT)) {
        result.result = WALK_PAGE_FAULT;
        return result;
    }
    
    // Check permissions at this level
    if (is_user && !(pml4_entry & PTE_USER)) {
        result.result = WALK_ACCESS_FAULT;
        return result;
    }
    
    // Level 3: PDPT
    uint64_t pdpt_base = pml4_entry & PTE_ADDR_MASK;
    uint64_t pdpt_entry_addr = pdpt_base + pdpt_idx * 8;
    uint64_t pdpt_entry = read_memory_64(pdpt_entry_addr);
    printf("  PDPT[%d] @ 0x%llx = 0x%llx
",
           pdpt_idx, (unsigned long long)pdpt_entry_addr,
           (unsigned long long)pdpt_entry);
    
    if (!(pdpt_entry & PTE_PRESENT)) {
        result.result = WALK_PAGE_FAULT;
        return result;
    }
    
    // Check for 1GB huge page
    if (pdpt_entry & PTE_PS) {
        result.physical_frame = (pdpt_entry & 0x000FFFFFC0000000ULL) >> 30;
        result.page_size = 1ULL << 30;  // 1 GB
        printf("  1GB huge page! Frame = 0x%llx
", 
               (unsigned long long)result.physical_frame);
        return result;
    }
    
    // Level 2: PD
    uint64_t pd_base = pdpt_entry & PTE_ADDR_MASK;
    uint64_t pd_entry_addr = pd_base + pd_idx * 8;
    uint64_t pd_entry = read_memory_64(pd_entry_addr);
    printf("  PD[%d] @ 0x%llx = 0x%llx
",
           pd_idx, (unsigned long long)pd_entry_addr,
           (unsigned long long)pd_entry);
    
    if (!(pd_entry & PTE_PRESENT)) {
        result.result = WALK_PAGE_FAULT;
        return result;
    }
    
    // Check for 2MB huge page
    if (pd_entry & PTE_PS) {
        result.physical_frame = (pd_entry & 0x000FFFFFFFE00000ULL) >> 21;
        result.page_size = 1ULL << 21;  // 2 MB
        printf("  2MB huge page! Frame = 0x%llx
",
               (unsigned long long)result.physical_frame);
        return result;
    }
    
    // Level 1: PT
    uint64_t pt_base = pd_entry & PTE_ADDR_MASK;
    uint64_t pt_entry_addr = pt_base + pt_idx * 8;
    uint64_t pt_entry = read_memory_64(pt_entry_addr);
    printf("  PT[%d] @ 0x%llx = 0x%llx
",
           pt_idx, (unsigned long long)pt_entry_addr,
           (unsigned long long)pt_entry);
    
    if (!(pt_entry & PTE_PRESENT)) {
        result.result = WALK_PAGE_FAULT;
        return result;
    }
    
    // 4KB page
    result.physical_frame = (pt_entry & PTE_ADDR_MASK) >> 12;
    result.page_size = 4096;
    result.writable = !!(pt_entry & PTE_WRITABLE);
    result.user = !!(pt_entry & PTE_USER);
    
    printf("  4KB page. Frame = 0x%llx
",
           (unsigned long long)result.physical_frame);
    
    return result;
}
 
/*
 * Hardware Optimization: Page Walk Cache (PWC)
 * 
 * Modern CPUs cache intermediate page table entries.
 * If we recently translated a nearby address, chances are
 * the PML4/PDPT/PD entries are the same—only PT differs.
 * 
 * Example:
 *   VA 0x7FFF12340000 and VA 0x7FFF12341000 share
 *   the same PML4, PDPT, PD entries—only PT differs.
 *   
 * With PWC, the second walk only reads the PT level.
 * This reduces average walk cost significantly.
 */

PTW Performance Impact

Each page table walk requires 4 memory accesses (for 4-level paging). Even if those accesses hit L1 cache (~4 cycles each), a walk costs 16+ cycles. If they go to main memory (~100ns each), a walk costs 400+ cycles. This is why TLB hit rate is critical—a 1% miss rate still means millions of expensive walks per second in a high-throughput workload.

MMU Control Registers

The operating system controls the MMU through special hardware registers. These registers are privileged—only kernel-mode code can modify them. They configure fundamental aspects of MMU behavior and are essential for understanding how OS kernel code manages memory.

x86-64 MMU Control Registers
Register	Purpose	Key Bits	Modified When
CR0	System control modes	PG (paging enable), WP (write protect)	Boot time, rarely changed
CR2	Page fault linear address	Faulting virtual address	Set by hardware on page fault
CR3	Page table base + PCID	PML4 physical address, PCID	Every context switch
CR4	Extended features	PAE, PSE, PCIDE, SMEP, SMAP	Boot time, feature enable
EFER (MSR)	Long mode control	LME (Long Mode Enable), NXE	Boot time for 64-bit mode

CR3: The Page Table Base Register

CR3 is the most frequently modified MMU register. It holds the physical address of the top-level page table (PML4 in x86-64). Changing CR3 effectively switches the entire address space.

CR3 Contents:

Bits 51:12: Physical address of PML4 table (aligned to 4KB boundary)
Bits 11:0: Control bits, including PCID (Process Context Identifier)

On a context switch from Process A to Process B:

OS saves Process A's CR3 in its process control block
OS loads Process B's CR3 value
The moment CR3 is written, ALL memory translations change
Further CPU accesses use Process B's address space

mmu_registers.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
/*
 * MMU Control Register Operations
 * 
 * These operations are performed in kernel mode only.
 * Attempting them from user mode causes a protection fault.
 */
 
#include <stdint.h>
 
/*
 * Read CR3 - get current page table base
 */
static inline uint64_t read_cr3(void) {
    uint64_t val;
    __asm__ volatile("mov %%cr3, %0" : "=r"(val));
    return val;
}
 
/*
 * Write CR3 - switch address space
 * 
 * This is the heart of context switching memory.
 * After this instruction executes, all memory translations
 * are based on the new page table.
 * 
 * WARNING: This instruction implicitly flushes TLB entries
 * (except global pages and PCID-tagged entries on modern CPUs)
 */
static inline void write_cr3(uint64_t val) {
    __asm__ volatile("mov %0, %%cr3" : : "r"(val) : "memory");
}
 
/*
 * Read CR2 - get faulting address after page fault
 */
static inline uint64_t read_cr2(void) {
    uint64_t val;
    __asm__ volatile("mov %%cr2, %0" : "=r"(val));
    return val;
}
 
/*
 * Context switch between processes (simplified)
 */
typedef struct {
    uint64_t cr3;       // Page table base
    // ... other saved state (GPRs, etc.)
} ProcessContext;
 
void switch_address_space(ProcessContext* from, ProcessContext* to) {
    // Only switch CR3 if actually changing address spaces
    // (Multiple threads share address space, no CR3 switch needed)
    if (from->cr3 != to->cr3) {
        from->cr3 = read_cr3();  // Save old
        write_cr3(to->cr3);       // Load new
        
        /*
         * At this point:
         * - TLB entries for old address space are invalid
         *   (unless using PCID or they're global)
         * - All memory accesses use the new page table
         * - The next instruction fetch is translated through new tables!
         */
    }
}
 
/*
 * Enable/Disable paging (only at boot time)
 */
static inline uint64_t read_cr0(void) {
    uint64_t val;
    __asm__ volatile("mov %%cr0, %0" : "=r"(val));
    return val;
}
 
static inline void write_cr0(uint64_t val) {
    __asm__ volatile("mov %0, %%cr0" : : "r"(val) : "memory");
}
 
#define CR0_PG (1UL << 31)  // Paging enable bit
#define CR0_WP (1UL << 16)  // Write protect bit
 
void enable_paging(void) {
    // Set up page tables in CR3 first!
    // Then enable paging
    uint64_t cr0 = read_cr0();
    cr0 |= CR0_PG | CR0_WP;
    write_cr0(cr0);
    
    // Paging is now active!
    // All subsequent addresses are virtual and translated
}
 
/*
 * IMPORTANT: CR4 security features
 * 
 * CR4.SMEP (Supervisor Mode Execution Prevention):
 *   - If set, kernel mode cannot execute user-mode pages
 *   - Defends against ret2user attacks
 * 
 * CR4.SMAP (Supervisor Mode Access Prevention):
 *   - If set, kernel mode cannot read/write user-mode pages
 *     unless explicitly enabled (EFLAGS.AC = 1)
 *   - Prevents accidental kernel access to user data
 *   - Defends against many exploit primitives
 */

PCID: Avoiding TLB Flush on Context Switch

Without PCID, writing CR3 flushes the entire TLB (a 'full shootdown'). With PCID, TLB entries are tagged with a 12-bit process ID. The MMU only uses entries matching the current PCID, so entries from other processes remain cached. On switch back, those entries are still valid! This can reduce context switch overhead by 40-50% in TLB-sensitive workloads.

TLB Management

The TLB caches translations, but this cache must be kept consistent with the actual page tables. When the OS modifies page tables—changing mappings, permissions, or removing pages—it must ensure the TLB doesn't contain stale entries. This is TLB management, one of the most performance-critical aspects of OS memory management.

When TLB Invalidation Is Needed

•Unmapping a page: The mapping no longer exists; cached translation must be removed.
•Changing permissions: A page became read-only; TLB may have writable cached.
•Swapping a page out: Physical frame is reclaimed; old translation is invalid.
•Copy-on-Write fault: Page is copied; TLB points to old physical frame.
•Process exit: All mappings are gone; TLB entries should be removed.
•fork()/exec(): Address space changes fundamentally.

TLB Invalidation Instructions:

The x86-64 architecture provides several ways to invalidate TLB entries:

Instruction	Effect	Use Case
`MOV to CR3`	Flush all non-global entries	Context switch
`INVLPG addr`	Flush single page entry	Single page change
`INVPCID`	Flush by PCID, address, or both	Fine-grained control
`INVLPGA` (AMD)	Invalidate by ASID and address	Guest VM management

tlb_invalidation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
/*
 * TLB Invalidation Operations
 * 
 * Critical for maintaining TLB coherency with page tables.
 * Incorrect invalidation leads to using stale translations—
 * data corruption, security vulnerabilities, crashes.
 */
 
#include <stdint.h>
 
/*
 * INVLPG - Invalidate single page
 * 
 * This is the most common invalidation operation.
 * Used when modifying a single page table entry.
 */
static inline void invlpg(void* addr) {
    __asm__ volatile("invlpg (%0)" : : "r"(addr) : "memory");
}
 
/*
 * Full TLB flush (via CR3 reload)
 * 
 * Expensive but sometimes necessary.
 * On systems without PCID, this is what context switch does.
 */
static inline void flush_tlb_all(void) {
    uint64_t cr3;
    __asm__ volatile("mov %%cr3, %0" : "=r"(cr3));
    __asm__ volatile("mov %0, %%cr3" : : "r"(cr3) : "memory");
}
 
/*
 * Unmap a page: update PTE and invalidate TLB
 */
void unmap_page(uint64_t* pte, void* virtual_addr) {
    // Step 1: Clear the page table entry
    *pte = 0;  // Mark not present
    
    // Step 2: Memory barrier - ensure PTE write is visible
    __asm__ volatile("mfence" ::: "memory");
    
    // Step 3: Invalidate TLB entry
    invlpg(virtual_addr);
    
    /*
     * Order matters! If we invalidated TLB before clearing PTE:
     * - Another CPU might cache the old entry between our ops
     * - We must ensure PTE is cleared before TLB invalidation
     */
}
 
/*
 * Change page permissions (e.g., make writable page read-only)
 */
void make_page_readonly(uint64_t* pte, void* virtual_addr) {
    // Clear the writable bit
    *pte &= ~(1ULL << 1);  // Clear R/W bit
    
    // Barrier
    __asm__ volatile("mfence" ::: "memory");
    
    // Invalidate
    invlpg(virtual_addr);
}
 
/*
 * TLB Shootdown: Multi-processor TLB coherency
 * 
 * Problem: When CPU 0 modifies a page table entry, CPU 1's TLB
 * might still have the old translation cached.
 * 
 * Solution: TLB shootdown via Inter-Processor Interrupt (IPI)
 * 
 * 1. CPU 0 modifies PTE
 * 2. CPU 0 invalidates its own TLB (INVLPG)
 * 3. CPU 0 sends IPI to all other CPUs running the affected process
 * 4. Other CPUs receive interrupt, execute INVLPG, acknowledge
 * 5. CPU 0 waits for acknowledgments before proceeding
 * 
 * This is expensive! ~10,000 cycles for a full shootdown.
 */
 
typedef struct {
    void* address;          // Virtual address to invalidate
    uint16_t asid;          // Process/address space identifier
    volatile int ack_count; // How many CPUs have acknowledged
    int target_count;       // How many CPUs need to acknowledge
} TLBShootdownRequest;
 
// IPI handler on remote CPU
void tlb_shootdown_ipi_handler(TLBShootdownRequest* req) {
    // Check if this address space is active on this CPU
    // If so, invalidate
    invlpg(req->address);
    
    // Acknowledge
    __sync_fetch_and_add(&req->ack_count, 1);
}
 
/*
 * Performance optimization: Lazy TLB
 * 
 * If a CPU is running a kernel thread (no user address space),
 * we can skip sending IPI for user-space TLB invalidations.
 * We mark the CPU as "lazy" and do the invalidation if/when
 * it switches back to user mode.
 */

TLB Shootdown: The Hidden Performance Killer

In heavily multi-threaded applications with frequent memory mapping changes, TLB shootdowns can become a major bottleneck. Each shootdown requires interrupting all CPUs, context-saving their state, invalidating entries, and acknowledging. Workloads like JVMs (with garbage collection), databases, and hypervisors can suffer significantly. This is a key motivation for persistent memory mappings and huge pages.

MMU and Exceptions

When address translation cannot proceed normally, the MMU raises an exception (also called a fault or trap). The OS kernel handles these exceptions to implement demand paging, copy-on-write, memory protection, and more. Understanding MMU exceptions is essential for kernel development.

MMU-Generated Exceptions
Exception	Cause	CR2 Contains	Typical OS Response
Page Fault (not present)	PTE.Present = 0	Faulting address	Load page from disk, create mapping
Page Fault (write to RO)	Write to PTE.R/W = 0	Faulting address	COW copy, or signal SIGSEGV
Page Fault (user to kernel)	User access to PTE.U/S = 0	Faulting address	Signal SIGSEGV (security violation)
Page Fault (execute NX)	Execute on PTE.NX = 1	Faulting address	Signal SIGSEGV (security violation)
General Protection Fault	Various invalid operations	Varies	Signal SIGSEGV or kernel panic

page_fault_handler.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
/*
 * Page Fault Handler (Simplified)
 * 
 * This is one of the most critical OS kernel routines.
 * It runs in kernel mode, triggered by MMU exception.
 */
 
#include <stdint.h>
#include <stdbool.h>
 
// Error code bits pushed by hardware on page fault
#define PF_PRESENT  (1 << 0)  // Fault caused by not-present page (0) or protection (1)
#define PF_WRITE    (1 << 1)  // Fault caused by write (1) or read (0)
#define PF_USER     (1 << 2)  // Fault occurred in user mode (1) or kernel (0)
#define PF_RSVD     (1 << 3)  // Fault caused by reserved bit violation
#define PF_INSTR    (1 << 4)  // Fault caused by instruction fetch
 
typedef enum {
    FAULT_HANDLED,      // Fault resolved, resume execution
    FAULT_SIGNAL_SEGV,  // Send SIGSEGV to process
    FAULT_KERNEL_PANIC, // Unrecoverable kernel error
} FaultResolution;
 
FaultResolution handle_page_fault(uint64_t error_code, uint64_t fault_addr) {
    // Read the faulting address from CR2
    fault_addr = read_cr2();
    
    printf("Page fault: addr=0x%016llx, error=0x%llx
",
           (unsigned long long)fault_addr, (unsigned long long)error_code);
    
    // Decode error code
    bool is_present = error_code & PF_PRESENT;
    bool is_write = error_code & PF_WRITE;
    bool is_user = error_code & PF_USER;
    bool is_reserved = error_code & PF_RSVD;
    bool is_instruction = error_code & PF_INSTR;
    
    // Reserved bit violation: always an error (corrupted page table)
    if (is_reserved) {
        return FAULT_KERNEL_PANIC;
    }
    
    // Find VMA (Virtual Memory Area) containing fault address
    // VMA describes valid regions of the address space
    VMA* vma = find_vma(current_process, fault_addr);
    
    if (vma == NULL) {
        // Address is not in any valid region
        printf("  No VMA for this address
");
        return is_user ? FAULT_SIGNAL_SEGV : FAULT_KERNEL_PANIC;
    }
    
    // Check if access type matches VMA permissions
    if (is_write && !(vma->flags & VM_WRITE)) {
        printf("  Write to read-only VMA
");
        return is_user ? FAULT_SIGNAL_SEGV : FAULT_KERNEL_PANIC;
    }
    
    if (is_instruction && !(vma->flags & VM_EXEC)) {
        printf("  Execute on non-executable VMA
");
        return is_user ? FAULT_SIGNAL_SEGV : FAULT_KERNEL_PANIC;
    }
    
    // Fault in valid VMA - check specific causes
    
    if (!is_present) {
        // Page not present: demand paging
        printf("  Page not present - loading...
");
        
        if (vma->type == VMA_ANONYMOUS) {
            // Anonymous memory: allocate zero page
            allocate_anonymous_page(fault_addr);
        } else if (vma->type == VMA_FILE_MAPPED) {
            // File-mapped: read from file
            load_page_from_file(vma->file, fault_addr);
        } else if (vma->type == VMA_SWAP) {
            // Swapped out: read from swap
            load_page_from_swap(fault_addr);
        }
        
        return FAULT_HANDLED;
    }
    
    if (is_present && is_write) {
        // Present but write fault: likely COW
        printf("  Write to present page - checking COW...
");
        
        if (is_cow_page(fault_addr)) {
            // Copy-on-Write: make private copy
            handle_cow(fault_addr);
            return FAULT_HANDLED;
        }
    }
    
    // Shouldn't reach here if VMA matches
    printf("  Unhandled case
");
    return is_user ? FAULT_SIGNAL_SEGV : FAULT_KERNEL_PANIC;
}
 
/*
 * The page fault handler is one of the most performance-sensitive
 * kernel routines. Optimizations include:
 * 
 * - Fast-path for common cases (demand paging of anonymous memory)
 * - VMA lookup using red-black trees or radix trees for O(log n)
 * - Prefaulting: loading nearby pages when accessing one
 * - Avoiding unnecessary TLB flushes
 * - Lock-free paths where possible
 */

Page Faults Are Normal (Sometimes)

Not all page faults indicate errors. Many are 'soft' faults—the page is valid but not loaded yet (demand paging) or needs copying (COW). Only 'hard' faults (accessing truly invalid memory) result in SIGSEGV. A healthy system has many soft page faults; watching page fault counters without understanding this leads to false alarms.

Modern MMU Features

Modern MMUs include advanced features that go beyond basic address translation. These features enable virtualization, enhance security, and improve performance in ways that weren't possible with earlier MMU designs.

Advanced MMU Features

•Nested Paging (EPT/NPT): Two-level address translation for virtualization. Guest virtual → Guest physical → Host physical. Hardware handles both levels without VM exits.
•Memory Protection Keys (MPK): 4-bit key per page, 16 key permissions in user register. Allows fast, fine-grained permission changes without TLB flush.
•SMEP/SMAP: Supervisor Mode Execution/Access Prevention. Kernel cannot execute/access user pages (unless explicitly allowed). Blocks many exploitation techniques.
•5-Level Paging: Extends virtual address space from 48 bits to 57 bits, allowing 128 PB of virtual address space for large memory systems.
•Control-flow Enforcement (CET): Shadow stack and indirect branch tracking. Hardware-enforced protection against ROP/JOP attacks.
•TME/MKTME (Total Memory Encryption): Hardware encryption of memory with per-page keys. Defends against physical memory attacks.

Extended Page Tables (EPT) for Virtualization:

In a virtualized system, the guest OS maintains its own page tables, mapping guest virtual to guest physical addresses. But the hypervisor has a second layer—Extended Page Tables—that map guest physical to host physical addresses.

Memory Access Path:
1. Guest Virtual Address (what guest process sees)
2. Guest Page Tables → Guest Physical Address
3. Extended Page Tables → Host Physical Address (actual RAM)

Without hardware support, every guest page table walk would require multiple VM exits (hypervisor calls), crippling performance. EPT performs both translations in hardware—the MMU walks both tables automatically.

memory_protection_keys.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
/*
 * Memory Protection Keys (MPK) - Intel PKU Feature
 * 
 * MPK allows fast permission changes without modifying page tables
 * or flushing TLB. Perfect for sandboxing and managed runtimes.
 */
 
#include <stdint.h>
#include <sys/mman.h>
 
// Read PKRU (Protection Key Rights for User pages)
static inline uint32_t read_pkru(void) {
    uint32_t eax, ecx = 0, edx;
    __asm__ volatile("rdpkru" : "=a"(eax), "=d"(edx) : "c"(ecx));
    return eax;
}
 
// Write PKRU
static inline void write_pkru(uint32_t val) {
    uint32_t eax = val, ecx = 0, edx = 0;
    __asm__ volatile("wrpkru" : : "a"(eax), "d"(edx), "c"(ecx));
}
 
/*
 * PKRU format:
 *   - 32 bits total, 2 bits per protection key (16 keys total)
 *   - Bit 2*i: Disable access for key i
 *   - Bit 2*i+1: Disable write for key i
 * 
 * Page table entries have a 4-bit protection key field.
 * On every memory access, MMU checks:
 *   1. Normal page-table permissions (R/W/X)
 *   2. PKRU permissions for the page's protection key
 * 
 * Both must allow access, or fault occurs.
 */
 
#define PKEY_DISABLE_ACCESS  0x1
#define PKEY_DISABLE_WRITE   0x2
 
// Allocate a protection key
int pkey_alloc(unsigned int flags, unsigned int access_rights) {
    // Syscall: allocate an unused protection key
    // Returns key number (0-15) or error
    return 0;  // Simplified
}
 
// Associate a key with a memory range
void protect_region(void* addr, size_t len, int pkey) {
    // mprotect variant that sets protection key
    // pkey_mprotect(addr, len, PROT_READ | PROT_WRITE, pkey);
}
 
/*
 * Example: Sandboxing untrusted code
 * 
 * 1. Allocate protection key for sensitive data
 *    int key = pkey_alloc(0, PKEY_DISABLE_ACCESS);
 * 
 * 2. Associate key with sensitive memory
 *    protect_region(secret_buffer, 4096, key);
 * 
 * 3. Sensitive memory is now inaccessible
 * 
 * 4. When trusted code needs access:
 *    uint32_t old_pkru = read_pkru();
 *    write_pkru(old_pkru & ~(0x3 << (2 * key)));  // Enable key
 *    // ... access memory ...
 *    write_pkru(old_pkru);  // Restore protection
 * 
 * Benefit: wrpkru is ~20 cycles vs ~1000 cycles for mprotect()
 *          No TLB flush, no syscall overhead
 */

The Post-Spectre World

After Spectre/Meltdown (2018), MMU features gained new importance. KPTI (Kernel Page Table Isolation) maintains separate page tables for kernel and user mode to prevent speculative execution attacks. This roughly doubles TLB misses on syscalls. Features like PCID became essential to mitigate this performance impact. Modern MMU design is now inseparable from security considerations.

Summary: The Memory Management Unit

We've explored the MMU—the hardware foundation of memory management. This specialized processor component enables everything from simple address translation to advanced virtualization and security features.

Key Takeaways

•The MMU is hardware that performs address translation on every memory access, converting virtual addresses to physical addresses in real time.
•Key MMU components include the TLB, Page Table Walker, and control registers—each optimized for its specific function.
•The TLB caches translations for performance—without it, every memory access would require multiple page table accesses.
•The Page Table Walker handles TLB misses by reading the multi-level page table structure from memory.
•Control registers (CR0, CR3, CR4) configure MMU behavior—CR3 holds the page table pointer and changes on context switches.
•TLB management ensures consistency—the OS must invalidate TLB entries when modifying page tables, including costly multi-processor shootdowns.
•MMU exceptions (page faults) enable demand paging, COW, and protection—the OS fault handler implements these policies.
•Modern MMU features support virtualization (EPT), security (SMEP/SMAP/PKU), and performance (PCID)—enabling capabilities impossible with simpler hardware.

What's Next:

We've examined the MMU and its role in address translation. The final topic in this module is base and limit registers—the simpler, historical predecessor to paging that's still conceptually important and used in some contexts today.

Page Complete

You now understand the MMU as the hardware that makes virtual memory, memory protection, and modern multiprogramming possible. This knowledge is essential for kernel development, performance tuning, security analysis, and understanding system behavior at a deep level. The MMU is where software policy meets hardware enforcement—the critical boundary in system design.

4 / 5

Loading learning content...

Operating SystemsLogical vs Physical Addresses

Understanding Logical vs Physical Addresses

LevelIntermediate

Duration75 mins

TopicLogical vs Physical Addresses

4 / 5

Memory Management Unit (MMU)

The Hardware That Makes Virtual Memory Possible

What You Will Learn

MMU Overview and Role

Formally:

The MMU is a hardware component that translates logical (virtual) addresses to physical addresses, enforces memory protection policies, and manages address translation caches (TLB) to maintain performance.

In modern processors, the MMU is integrated directly into the CPU die, typically as part of the core's load/store unit. Historically, it was sometimes a separate chip.

Converting Mermaid diagram...

Core MMU Responsibilities

•Address Translation: Converting every logical address from the CPU into a physical address for the memory system.
•Protection Enforcement: Checking access permissions (read/write/execute, user/supervisor) and raising faults on violations.
•TLB Management: Caching recent translations for performance; handling TLB misses by walking page tables.
•Exception Generation: Raising page faults (page not present) and protection faults (access violation) for OS handling.
•Cacheability Control: Determining whether memory accesses should be cached, write-through, or uncached based on page attributes.
•ASID/PCID Management: Supporting address-space identifiers to reduce TLB flush overhead on context switches.

Position in the Memory Hierarchy

MMU Internal Components

The MMU comprises several specialized hardware structures, each optimized for its specific function. Understanding these components reveals how the MMU achieves its remarkable performance.

MMU Component Architecture
Component	Function	Typical Implementation	Performance Characteristics
TLB	Cache virtual-to-physical translations	Fully associative or set-associative SRAM	1-cycle access, 64-1536 entries
Page Table Walker (PTW)	Walk page tables on TLB miss	State machine + memory access logic	≈10-100 cycles per walk
Permission Checker	Validate access against PTE flags	Combinational logic	0 extra cycles (parallel with TLB)
Address Space ID Register	Hold current process's ASID/PCID	Register + comparator	Enables TLB sharing across contexts
Control Registers	Configure MMU behavior (CR0, CR3, CR4 on x86)	Privileged registers	Set by OS kernel only
Page Walk Cache	Cache intermediate page table entries	Small associative cache	Reduces multi-level walk cost

The Translation Lookaside Buffer (TLB):

TLB Organization:

Modern CPUs typically have multiple TLB levels and types:

TLB Type	Entries	Associativity	Page Sizes	Access Time
L1 ITLB (Instructions)	64-128	4-8 way	4KB	1 cycle
L1 DTLB (Data)	64-128	4-8 way	4KB	1 cycle
L2 STLB (Shared/Unified)	512-2048	4-16 way	4KB + 2MB	5-10 cycles
Huge Page TLB	16-64	Fully assoc.	2MB, 1GB	1-2 cycles

tlb_structure.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
/*
 * TLB Entry Structure and Lookup
 * 
 * This shows the logical structure of TLB entries and lookup.
 * Real hardware implements this in transistors, not software.
 */
 
#include <stdint.h>
#include <stdbool.h>
 
// TLB entry format (simplified; real entries have more fields)
typedef struct {
    uint64_t vpn;           // Virtual Page Number (the key)
    uint64_t pfn;           // Physical Frame Number (the value)
    uint16_t asid;          // Address Space ID (for multi-process TLB)
    
    // Permission and attribute bits
    unsigned present : 1;   // Entry is valid
    unsigned writable : 1;  // Page is writable
    unsigned user : 1;      // Page accessible from user mode
    unsigned executable : 1;// Page can be executed
    unsigned global : 1;    // Entry not flushed on ASID change
    unsigned dirty : 1;     // Page has been written
    unsigned accessed : 1;  // Page has been accessed
} TLBEntry;
 
#define TLB_SIZE 128
#define TLB_WAYS 8
#define TLB_SETS (TLB_SIZE / TLB_WAYS)
 
// L1 Data TLB: 8-way set associative, 128 entries = 16 sets
TLBEntry l1_dtlb[TLB_SETS][TLB_WAYS];
 
/*
 * TLB Lookup Process:
 * 
 * 1. Extract set index from virtual address
 *    - For 16 sets, use log2(16) = 4 bits of VPN
 * 
 * 2. Compare VPN against all entries in the set (parallel!)
 *    - Also compare ASID (or check global bit)
 * 
 * 3. If match found (hit), return PFN and permissions
 *    If no match (miss), invoke page table walker
 */
 
typedef struct {
    bool hit;
    uint64_t pfn;
    bool writable;
    bool executable;
    bool user;
} TLBLookupResult;
 
TLBLookupResult tlb_lookup(uint64_t vpn, uint16_t current_asid) {
    TLBLookupResult result = {false, 0, false, false, false};
    
    // Calculate set index
    int set = vpn % TLB_SETS;
    
    // Search all ways in the set (this is parallel in hardware)
    for (int way = 0; way < TLB_WAYS; way++) {
        TLBEntry* entry = &l1_dtlb[set][way];
        
        // Check if entry matches
        bool vpn_match = (entry->vpn == vpn);
        bool asid_ok = entry->global || (entry->asid == current_asid);
        
        if (entry->present && vpn_match && asid_ok) {
            // TLB Hit!
            result.hit = true;
            result.pfn = entry->pfn;
            result.writable = entry->writable;
            result.executable = entry->executable;
            result.user = entry->user;
            
            // Update accessed bit (may be done lazily in real hardware)
            entry->accessed = 1;
            
            return result;
        }
    }
    
    // TLB Miss - need page table walk
    return result;
}
 
/*
 * TLB Entry Insertion (after page table walk):
 * 
 * When a TLB miss occurs and the page table walker finds the translation,
 * the MMU inserts the result into the TLB for future lookups.
 * 
 * The replacement policy (typically pseudo-LRU) chooses which
 * way in the set to evict.
 */
void tlb_insert(uint64_t vpn, uint64_t pfn, uint16_t asid, 
                bool writable, bool user, bool executable, bool global) {
    int set = vpn % TLB_SETS;
    
    // Find a way to replace (use pseudo-LRU or similar)
    int victim_way = 0;  // Simplified; real uses LRU tracking
    
    TLBEntry* entry = &l1_dtlb[set][victim_way];
    entry->vpn = vpn;
    entry->pfn = pfn;
    entry->asid = asid;
    entry->present = 1;
    entry->writable = writable;
    entry->user = user;
    entry->executable = executable;
    entry->global = global;
    entry->dirty = 0;
    entry->accessed = 1;
}
 
/*
 * In real hardware:
 * - All 8 comparisons happen simultaneously in one clock cycle
 * - The comparators are CAM (Content-Addressable Memory) cells
 * - Power consumption is significant due to parallel comparison
 * - This is why TLB size is limited (more entries = more power)
 */

TLB Reach: The Key Metric

The Page Table Walker

Page Table Walk Steps (x86-64, 4-level):

Read CR3: Get physical address of PML4 table
Read PML4 entry: Access PML4[VPN[47:39]]
- If not present → Page Fault
- Otherwise, extract PDPT physical address
Read PDPT entry: Access PDPT[VPN[38:30]]
- If not present → Page Fault
- If PS bit set (1GB huge page) → Done
- Otherwise, extract PD physical address
Read PD entry: Access PD[VPN[29:21]]
- If not present → Page Fault
- If PS bit set (2MB huge page) → Done
- Otherwise, extract PT physical address
Read PT entry: Access PT[VPN[20:12]]
- If not present → Page Fault
- Otherwise, extract frame number → Done
Insert translation into TLB for future accesses

Converting Mermaid diagram...

page_table_walker.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
/*
 * Page Table Walker (PTW) Logic
 * 
 * This is implemented in hardware as a state machine.
 * The following represents its logical behavior.
 */
 
#include <stdint.h>
#include <stdbool.h>
#include <stdio.h>
 
#define PML4_SHIFT  39
#define PDPT_SHIFT  30
#define PD_SHIFT    21
#define PT_SHIFT    12
#define INDEX_MASK  0x1FF  // 9 bits
 
#define PTE_PRESENT     (1ULL << 0)
#define PTE_WRITABLE    (1ULL << 1)
#define PTE_USER        (1ULL << 2)
#define PTE_PS          (1ULL << 7)   // Page Size (huge page)
#define PTE_ADDR_MASK   0x000FFFFFFFFFF000ULL  // Physical address bits
 
typedef enum {
    WALK_SUCCESS,
    WALK_PAGE_FAULT,
    WALK_ACCESS_FAULT
} WalkResult;
 
typedef struct {
    WalkResult result;
    uint64_t physical_frame;
    uint64_t page_size;      // 4KB, 2MB, or 1GB
    bool writable;
    bool user;
    bool executable;
} PTWResult;
 
// Simulate memory read (in hardware, this goes to cache/memory)
uint64_t read_memory_64(uint64_t physical_addr) {
    // In real hardware, this is a memory bus transaction
    // The PTW has its own path to memory, bypassing TLB
    return 0;  // Placeholder
}
 
PTWResult page_table_walk(uint64_t cr3, uint64_t virtual_addr, 
                          bool is_write, bool is_user, bool is_execute) {
    PTWResult result = {WALK_SUCCESS, 0, 4096, true, true, true};
    
    // Extract indices for each level
    int pml4_idx = (virtual_addr >> PML4_SHIFT) & INDEX_MASK;
    int pdpt_idx = (virtual_addr >> PDPT_SHIFT) & INDEX_MASK;
    int pd_idx   = (virtual_addr >> PD_SHIFT)   & INDEX_MASK;
    int pt_idx   = (virtual_addr >> PT_SHIFT)   & INDEX_MASK;
    
    printf("Page Table Walk for VA 0x%016llx
", 
           (unsigned long long)virtual_addr);
    
    // Level 4: PML4
    uint64_t pml4_base = cr3 & PTE_ADDR_MASK;
    uint64_t pml4_entry_addr = pml4_base + pml4_idx * 8;
    uint64_t pml4_entry = read_memory_64(pml4_entry_addr);
    printf("  PML4[%d] @ 0x%llx = 0x%llx
", 
           pml4_idx, (unsigned long long)pml4_entry_addr,
           (unsigned long long)pml4_entry);
    
    if (!(pml4_entry & PTE_PRESENT)) {
        result.result = WALK_PAGE_FAULT;
        return result;
    }
    
    // Check permissions at this level
    if (is_user && !(pml4_entry & PTE_USER)) {
        result.result = WALK_ACCESS_FAULT;
        return result;
    }
    
    // Level 3: PDPT
    uint64_t pdpt_base = pml4_entry & PTE_ADDR_MASK;
    uint64_t pdpt_entry_addr = pdpt_base + pdpt_idx * 8;
    uint64_t pdpt_entry = read_memory_64(pdpt_entry_addr);
    printf("  PDPT[%d] @ 0x%llx = 0x%llx
",
           pdpt_idx, (unsigned long long)pdpt_entry_addr,
           (unsigned long long)pdpt_entry);
    
    if (!(pdpt_entry & PTE_PRESENT)) {
        result.result = WALK_PAGE_FAULT;
        return result;
    }
    
    // Check for 1GB huge page
    if (pdpt_entry & PTE_PS) {
        result.physical_frame = (pdpt_entry & 0x000FFFFFC0000000ULL) >> 30;
        result.page_size = 1ULL << 30;  // 1 GB
        printf("  1GB huge page! Frame = 0x%llx
", 
               (unsigned long long)result.physical_frame);
        return result;
    }
    
    // Level 2: PD
    uint64_t pd_base = pdpt_entry & PTE_ADDR_MASK;
    uint64_t pd_entry_addr = pd_base + pd_idx * 8;
    uint64_t pd_entry = read_memory_64(pd_entry_addr);
    printf("  PD[%d] @ 0x%llx = 0x%llx
",
           pd_idx, (unsigned long long)pd_entry_addr,
           (unsigned long long)pd_entry);
    
    if (!(pd_entry & PTE_PRESENT)) {
        result.result = WALK_PAGE_FAULT;
        return result;
    }
    
    // Check for 2MB huge page
    if (pd_entry & PTE_PS) {
        result.physical_frame = (pd_entry & 0x000FFFFFFFE00000ULL) >> 21;
        result.page_size = 1ULL << 21;  // 2 MB
        printf("  2MB huge page! Frame = 0x%llx
",
               (unsigned long long)result.physical_frame);
        return result;
    }
    
    // Level 1: PT
    uint64_t pt_base = pd_entry & PTE_ADDR_MASK;
    uint64_t pt_entry_addr = pt_base + pt_idx * 8;
    uint64_t pt_entry = read_memory_64(pt_entry_addr);
    printf("  PT[%d] @ 0x%llx = 0x%llx
",
           pt_idx, (unsigned long long)pt_entry_addr,
           (unsigned long long)pt_entry);
    
    if (!(pt_entry & PTE_PRESENT)) {
        result.result = WALK_PAGE_FAULT;
        return result;
    }
    
    // 4KB page
    result.physical_frame = (pt_entry & PTE_ADDR_MASK) >> 12;
    result.page_size = 4096;
    result.writable = !!(pt_entry & PTE_WRITABLE);
    result.user = !!(pt_entry & PTE_USER);
    
    printf("  4KB page. Frame = 0x%llx
",
           (unsigned long long)result.physical_frame);
    
    return result;
}
 
/*
 * Hardware Optimization: Page Walk Cache (PWC)
 * 
 * Modern CPUs cache intermediate page table entries.
 * If we recently translated a nearby address, chances are
 * the PML4/PDPT/PD entries are the same—only PT differs.
 * 
 * Example:
 *   VA 0x7FFF12340000 and VA 0x7FFF12341000 share
 *   the same PML4, PDPT, PD entries—only PT differs.
 *   
 * With PWC, the second walk only reads the PT level.
 * This reduces average walk cost significantly.
 */

PTW Performance Impact

MMU Control Registers

x86-64 MMU Control Registers
Register	Purpose	Key Bits	Modified When
CR0	System control modes	PG (paging enable), WP (write protect)	Boot time, rarely changed
CR2	Page fault linear address	Faulting virtual address	Set by hardware on page fault
CR3	Page table base + PCID	PML4 physical address, PCID	Every context switch
CR4	Extended features	PAE, PSE, PCIDE, SMEP, SMAP	Boot time, feature enable
EFER (MSR)	Long mode control	LME (Long Mode Enable), NXE	Boot time for 64-bit mode

CR3: The Page Table Base Register

CR3 is the most frequently modified MMU register. It holds the physical address of the top-level page table (PML4 in x86-64). Changing CR3 effectively switches the entire address space.

CR3 Contents:

Bits 51:12: Physical address of PML4 table (aligned to 4KB boundary)
Bits 11:0: Control bits, including PCID (Process Context Identifier)

On a context switch from Process A to Process B:

OS saves Process A's CR3 in its process control block
OS loads Process B's CR3 value
The moment CR3 is written, ALL memory translations change
Further CPU accesses use Process B's address space

mmu_registers.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
/*
 * MMU Control Register Operations
 * 
 * These operations are performed in kernel mode only.
 * Attempting them from user mode causes a protection fault.
 */
 
#include <stdint.h>
 
/*
 * Read CR3 - get current page table base
 */
static inline uint64_t read_cr3(void) {
    uint64_t val;
    __asm__ volatile("mov %%cr3, %0" : "=r"(val));
    return val;
}
 
/*
 * Write CR3 - switch address space
 * 
 * This is the heart of context switching memory.
 * After this instruction executes, all memory translations
 * are based on the new page table.
 * 
 * WARNING: This instruction implicitly flushes TLB entries
 * (except global pages and PCID-tagged entries on modern CPUs)
 */
static inline void write_cr3(uint64_t val) {
    __asm__ volatile("mov %0, %%cr3" : : "r"(val) : "memory");
}
 
/*
 * Read CR2 - get faulting address after page fault
 */
static inline uint64_t read_cr2(void) {
    uint64_t val;
    __asm__ volatile("mov %%cr2, %0" : "=r"(val));
    return val;
}
 
/*
 * Context switch between processes (simplified)
 */
typedef struct {
    uint64_t cr3;       // Page table base
    // ... other saved state (GPRs, etc.)
} ProcessContext;
 
void switch_address_space(ProcessContext* from, ProcessContext* to) {
    // Only switch CR3 if actually changing address spaces
    // (Multiple threads share address space, no CR3 switch needed)
    if (from->cr3 != to->cr3) {
        from->cr3 = read_cr3();  // Save old
        write_cr3(to->cr3);       // Load new
        
        /*
         * At this point:
         * - TLB entries for old address space are invalid
         *   (unless using PCID or they're global)
         * - All memory accesses use the new page table
         * - The next instruction fetch is translated through new tables!
         */
    }
}
 
/*
 * Enable/Disable paging (only at boot time)
 */
static inline uint64_t read_cr0(void) {
    uint64_t val;
    __asm__ volatile("mov %%cr0, %0" : "=r"(val));
    return val;
}
 
static inline void write_cr0(uint64_t val) {
    __asm__ volatile("mov %0, %%cr0" : : "r"(val) : "memory");
}
 
#define CR0_PG (1UL << 31)  // Paging enable bit
#define CR0_WP (1UL << 16)  // Write protect bit
 
void enable_paging(void) {
    // Set up page tables in CR3 first!
    // Then enable paging
    uint64_t cr0 = read_cr0();
    cr0 |= CR0_PG | CR0_WP;
    write_cr0(cr0);
    
    // Paging is now active!
    // All subsequent addresses are virtual and translated
}
 
/*
 * IMPORTANT: CR4 security features
 * 
 * CR4.SMEP (Supervisor Mode Execution Prevention):
 *   - If set, kernel mode cannot execute user-mode pages
 *   - Defends against ret2user attacks
 * 
 * CR4.SMAP (Supervisor Mode Access Prevention):
 *   - If set, kernel mode cannot read/write user-mode pages
 *     unless explicitly enabled (EFLAGS.AC = 1)
 *   - Prevents accidental kernel access to user data
 *   - Defends against many exploit primitives
 */

PCID: Avoiding TLB Flush on Context Switch

TLB Management

When TLB Invalidation Is Needed

•Unmapping a page: The mapping no longer exists; cached translation must be removed.
•Changing permissions: A page became read-only; TLB may have writable cached.
•Swapping a page out: Physical frame is reclaimed; old translation is invalid.
•Copy-on-Write fault: Page is copied; TLB points to old physical frame.
•Process exit: All mappings are gone; TLB entries should be removed.
•fork()/exec(): Address space changes fundamentally.

TLB Invalidation Instructions:

The x86-64 architecture provides several ways to invalidate TLB entries:

Instruction	Effect	Use Case
`MOV to CR3`	Flush all non-global entries	Context switch
`INVLPG addr`	Flush single page entry	Single page change
`INVPCID`	Flush by PCID, address, or both	Fine-grained control
`INVLPGA` (AMD)	Invalidate by ASID and address	Guest VM management

tlb_invalidation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
/*
 * TLB Invalidation Operations
 * 
 * Critical for maintaining TLB coherency with page tables.
 * Incorrect invalidation leads to using stale translations—
 * data corruption, security vulnerabilities, crashes.
 */
 
#include <stdint.h>
 
/*
 * INVLPG - Invalidate single page
 * 
 * This is the most common invalidation operation.
 * Used when modifying a single page table entry.
 */
static inline void invlpg(void* addr) {
    __asm__ volatile("invlpg (%0)" : : "r"(addr) : "memory");
}
 
/*
 * Full TLB flush (via CR3 reload)
 * 
 * Expensive but sometimes necessary.
 * On systems without PCID, this is what context switch does.
 */
static inline void flush_tlb_all(void) {
    uint64_t cr3;
    __asm__ volatile("mov %%cr3, %0" : "=r"(cr3));
    __asm__ volatile("mov %0, %%cr3" : : "r"(cr3) : "memory");
}
 
/*
 * Unmap a page: update PTE and invalidate TLB
 */
void unmap_page(uint64_t* pte, void* virtual_addr) {
    // Step 1: Clear the page table entry
    *pte = 0;  // Mark not present
    
    // Step 2: Memory barrier - ensure PTE write is visible
    __asm__ volatile("mfence" ::: "memory");
    
    // Step 3: Invalidate TLB entry
    invlpg(virtual_addr);
    
    /*
     * Order matters! If we invalidated TLB before clearing PTE:
     * - Another CPU might cache the old entry between our ops
     * - We must ensure PTE is cleared before TLB invalidation
     */
}
 
/*
 * Change page permissions (e.g., make writable page read-only)
 */
void make_page_readonly(uint64_t* pte, void* virtual_addr) {
    // Clear the writable bit
    *pte &= ~(1ULL << 1);  // Clear R/W bit
    
    // Barrier
    __asm__ volatile("mfence" ::: "memory");
    
    // Invalidate
    invlpg(virtual_addr);
}
 
/*
 * TLB Shootdown: Multi-processor TLB coherency
 * 
 * Problem: When CPU 0 modifies a page table entry, CPU 1's TLB
 * might still have the old translation cached.
 * 
 * Solution: TLB shootdown via Inter-Processor Interrupt (IPI)
 * 
 * 1. CPU 0 modifies PTE
 * 2. CPU 0 invalidates its own TLB (INVLPG)
 * 3. CPU 0 sends IPI to all other CPUs running the affected process
 * 4. Other CPUs receive interrupt, execute INVLPG, acknowledge
 * 5. CPU 0 waits for acknowledgments before proceeding
 * 
 * This is expensive! ~10,000 cycles for a full shootdown.
 */
 
typedef struct {
    void* address;          // Virtual address to invalidate
    uint16_t asid;          // Process/address space identifier
    volatile int ack_count; // How many CPUs have acknowledged
    int target_count;       // How many CPUs need to acknowledge
} TLBShootdownRequest;
 
// IPI handler on remote CPU
void tlb_shootdown_ipi_handler(TLBShootdownRequest* req) {
    // Check if this address space is active on this CPU
    // If so, invalidate
    invlpg(req->address);
    
    // Acknowledge
    __sync_fetch_and_add(&req->ack_count, 1);
}
 
/*
 * Performance optimization: Lazy TLB
 * 
 * If a CPU is running a kernel thread (no user address space),
 * we can skip sending IPI for user-space TLB invalidations.
 * We mark the CPU as "lazy" and do the invalidation if/when
 * it switches back to user mode.
 */

TLB Shootdown: The Hidden Performance Killer

MMU and Exceptions

MMU-Generated Exceptions
Exception	Cause	CR2 Contains	Typical OS Response
Page Fault (not present)	PTE.Present = 0	Faulting address	Load page from disk, create mapping
Page Fault (write to RO)	Write to PTE.R/W = 0	Faulting address	COW copy, or signal SIGSEGV
Page Fault (user to kernel)	User access to PTE.U/S = 0	Faulting address	Signal SIGSEGV (security violation)
Page Fault (execute NX)	Execute on PTE.NX = 1	Faulting address	Signal SIGSEGV (security violation)
General Protection Fault	Various invalid operations	Varies	Signal SIGSEGV or kernel panic

page_fault_handler.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
/*
 * Page Fault Handler (Simplified)
 * 
 * This is one of the most critical OS kernel routines.
 * It runs in kernel mode, triggered by MMU exception.
 */
 
#include <stdint.h>
#include <stdbool.h>
 
// Error code bits pushed by hardware on page fault
#define PF_PRESENT  (1 << 0)  // Fault caused by not-present page (0) or protection (1)
#define PF_WRITE    (1 << 1)  // Fault caused by write (1) or read (0)
#define PF_USER     (1 << 2)  // Fault occurred in user mode (1) or kernel (0)
#define PF_RSVD     (1 << 3)  // Fault caused by reserved bit violation
#define PF_INSTR    (1 << 4)  // Fault caused by instruction fetch
 
typedef enum {
    FAULT_HANDLED,      // Fault resolved, resume execution
    FAULT_SIGNAL_SEGV,  // Send SIGSEGV to process
    FAULT_KERNEL_PANIC, // Unrecoverable kernel error
} FaultResolution;
 
FaultResolution handle_page_fault(uint64_t error_code, uint64_t fault_addr) {
    // Read the faulting address from CR2
    fault_addr = read_cr2();
    
    printf("Page fault: addr=0x%016llx, error=0x%llx
",
           (unsigned long long)fault_addr, (unsigned long long)error_code);
    
    // Decode error code
    bool is_present = error_code & PF_PRESENT;
    bool is_write = error_code & PF_WRITE;
    bool is_user = error_code & PF_USER;
    bool is_reserved = error_code & PF_RSVD;
    bool is_instruction = error_code & PF_INSTR;
    
    // Reserved bit violation: always an error (corrupted page table)
    if (is_reserved) {
        return FAULT_KERNEL_PANIC;
    }
    
    // Find VMA (Virtual Memory Area) containing fault address
    // VMA describes valid regions of the address space
    VMA* vma = find_vma(current_process, fault_addr);
    
    if (vma == NULL) {
        // Address is not in any valid region
        printf("  No VMA for this address
");
        return is_user ? FAULT_SIGNAL_SEGV : FAULT_KERNEL_PANIC;
    }
    
    // Check if access type matches VMA permissions
    if (is_write && !(vma->flags & VM_WRITE)) {
        printf("  Write to read-only VMA
");
        return is_user ? FAULT_SIGNAL_SEGV : FAULT_KERNEL_PANIC;
    }
    
    if (is_instruction && !(vma->flags & VM_EXEC)) {
        printf("  Execute on non-executable VMA
");
        return is_user ? FAULT_SIGNAL_SEGV : FAULT_KERNEL_PANIC;
    }
    
    // Fault in valid VMA - check specific causes
    
    if (!is_present) {
        // Page not present: demand paging
        printf("  Page not present - loading...
");
        
        if (vma->type == VMA_ANONYMOUS) {
            // Anonymous memory: allocate zero page
            allocate_anonymous_page(fault_addr);
        } else if (vma->type == VMA_FILE_MAPPED) {
            // File-mapped: read from file
            load_page_from_file(vma->file, fault_addr);
        } else if (vma->type == VMA_SWAP) {
            // Swapped out: read from swap
            load_page_from_swap(fault_addr);
        }
        
        return FAULT_HANDLED;
    }
    
    if (is_present && is_write) {
        // Present but write fault: likely COW
        printf("  Write to present page - checking COW...
");
        
        if (is_cow_page(fault_addr)) {
            // Copy-on-Write: make private copy
            handle_cow(fault_addr);
            return FAULT_HANDLED;
        }
    }
    
    // Shouldn't reach here if VMA matches
    printf("  Unhandled case
");
    return is_user ? FAULT_SIGNAL_SEGV : FAULT_KERNEL_PANIC;
}
 
/*
 * The page fault handler is one of the most performance-sensitive
 * kernel routines. Optimizations include:
 * 
 * - Fast-path for common cases (demand paging of anonymous memory)
 * - VMA lookup using red-black trees or radix trees for O(log n)
 * - Prefaulting: loading nearby pages when accessing one
 * - Avoiding unnecessary TLB flushes
 * - Lock-free paths where possible
 */

Page Faults Are Normal (Sometimes)

Modern MMU Features

Advanced MMU Features

•Nested Paging (EPT/NPT): Two-level address translation for virtualization. Guest virtual → Guest physical → Host physical. Hardware handles both levels without VM exits.
•Memory Protection Keys (MPK): 4-bit key per page, 16 key permissions in user register. Allows fast, fine-grained permission changes without TLB flush.
•SMEP/SMAP: Supervisor Mode Execution/Access Prevention. Kernel cannot execute/access user pages (unless explicitly allowed). Blocks many exploitation techniques.
•5-Level Paging: Extends virtual address space from 48 bits to 57 bits, allowing 128 PB of virtual address space for large memory systems.
•Control-flow Enforcement (CET): Shadow stack and indirect branch tracking. Hardware-enforced protection against ROP/JOP attacks.
•TME/MKTME (Total Memory Encryption): Hardware encryption of memory with per-page keys. Defends against physical memory attacks.

Extended Page Tables (EPT) for Virtualization:

Memory Access Path:
1. Guest Virtual Address (what guest process sees)
2. Guest Page Tables → Guest Physical Address
3. Extended Page Tables → Host Physical Address (actual RAM)

memory_protection_keys.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
/*
 * Memory Protection Keys (MPK) - Intel PKU Feature
 * 
 * MPK allows fast permission changes without modifying page tables
 * or flushing TLB. Perfect for sandboxing and managed runtimes.
 */
 
#include <stdint.h>
#include <sys/mman.h>
 
// Read PKRU (Protection Key Rights for User pages)
static inline uint32_t read_pkru(void) {
    uint32_t eax, ecx = 0, edx;
    __asm__ volatile("rdpkru" : "=a"(eax), "=d"(edx) : "c"(ecx));
    return eax;
}
 
// Write PKRU
static inline void write_pkru(uint32_t val) {
    uint32_t eax = val, ecx = 0, edx = 0;
    __asm__ volatile("wrpkru" : : "a"(eax), "d"(edx), "c"(ecx));
}
 
/*
 * PKRU format:
 *   - 32 bits total, 2 bits per protection key (16 keys total)
 *   - Bit 2*i: Disable access for key i
 *   - Bit 2*i+1: Disable write for key i
 * 
 * Page table entries have a 4-bit protection key field.
 * On every memory access, MMU checks:
 *   1. Normal page-table permissions (R/W/X)
 *   2. PKRU permissions for the page's protection key
 * 
 * Both must allow access, or fault occurs.
 */
 
#define PKEY_DISABLE_ACCESS  0x1
#define PKEY_DISABLE_WRITE   0x2
 
// Allocate a protection key
int pkey_alloc(unsigned int flags, unsigned int access_rights) {
    // Syscall: allocate an unused protection key
    // Returns key number (0-15) or error
    return 0;  // Simplified
}
 
// Associate a key with a memory range
void protect_region(void* addr, size_t len, int pkey) {
    // mprotect variant that sets protection key
    // pkey_mprotect(addr, len, PROT_READ | PROT_WRITE, pkey);
}
 
/*
 * Example: Sandboxing untrusted code
 * 
 * 1. Allocate protection key for sensitive data
 *    int key = pkey_alloc(0, PKEY_DISABLE_ACCESS);
 * 
 * 2. Associate key with sensitive memory
 *    protect_region(secret_buffer, 4096, key);
 * 
 * 3. Sensitive memory is now inaccessible
 * 
 * 4. When trusted code needs access:
 *    uint32_t old_pkru = read_pkru();
 *    write_pkru(old_pkru & ~(0x3 << (2 * key)));  // Enable key
 *    // ... access memory ...
 *    write_pkru(old_pkru);  // Restore protection
 * 
 * Benefit: wrpkru is ~20 cycles vs ~1000 cycles for mprotect()
 *          No TLB flush, no syscall overhead
 */

The Post-Spectre World

Summary: The Memory Management Unit

Key Takeaways

•The MMU is hardware that performs address translation on every memory access, converting virtual addresses to physical addresses in real time.
•Key MMU components include the TLB, Page Table Walker, and control registers—each optimized for its specific function.
•The TLB caches translations for performance—without it, every memory access would require multiple page table accesses.
•The Page Table Walker handles TLB misses by reading the multi-level page table structure from memory.
•Control registers (CR0, CR3, CR4) configure MMU behavior—CR3 holds the page table pointer and changes on context switches.
•TLB management ensures consistency—the OS must invalidate TLB entries when modifying page tables, including costly multi-processor shootdowns.
•MMU exceptions (page faults) enable demand paging, COW, and protection—the OS fault handler implements these policies.
•Modern MMU features support virtualization (EPT), security (SMEP/SMAP/PKU), and performance (PCID)—enabling capabilities impossible with simpler hardware.

What's Next:

Page Complete

4 / 5