Loading content...
Every virtual memory access requires translating through the page table—but the page table itself must be stored somewhere in memory. This creates a fascinating bootstrapping question: How do we access the page table if we need the page table to access memory?
The answer lies in hardwired CPU registers, carefully designed memory layouts, and a fundamental asymmetry: page tables are stored at physical addresses that the CPU can access directly, bypassing the translation mechanism entirely for this specific purpose.
Understanding page table location is essential for OS kernel development, hypervisor implementation, and deep performance optimization. It connects abstract virtual memory concepts to concrete hardware mechanisms.
By the end of this page, you will understand CR3 and equivalent registers, how page tables are allocated and managed in kernel memory, the relationship between page table physical addresses and virtual mappings, and strategies for efficient page table memory management.
Every architecture has a dedicated register that holds the physical address of the root page table. The MMU reads this register on each memory access (or TLB miss) to begin the page table walk.
Architecture-Specific Registers:
| Architecture | Register | Width | Contents | Additional Features |
|---|---|---|---|---|
| x86 (32-bit) | CR3 | 32 bits | Page Directory physical address | PWT, PCD flags for caching |
| x86-64 | CR3 | 64 bits | PML4 physical address (bits 12-51) | PCID (12 bits) if CR4.PCIDE=1 |
| ARM AArch64 | TTBR0_EL1 / TTBR1_EL1 | 64 bits | Translation table base | Separate tables for user/kernel |
| RISC-V | SATP | 64 bits (Sv39/48) | Sv39: mode + ASID + PPN | Mode selects page table format |
| MIPS | EntryHi/Context | 32/64 bits | Part of TLB management | Software-managed TLB |
x86-64 CR3 Register Format: Without PCID (CR4.PCIDE = 0):┌────────────────────────────────────────────────────────────────┐│ 63 52 │ 51 12 │ 11 0 │├─────────────┼────────────────────────────────────────┼─────────┤│ Reserved │ PML4 Table Physical Address │ Flags ││ (MBZ) │ (40 bits = 1TB addressable) │PWT│PCD│ │└─────────────┴────────────────────────────────────────┴─────────┘ With PCID (CR4.PCIDE = 1):┌────────────────────────────────────────────────────────────────┐│ 63 52 │ 51 12 │ 11 0 │├─────────────┼────────────────────────────────────────┼─────────┤│ Reserved │ PML4 Table Physical Address │ PCID ││ (MBZ) │ (40 bits) │(12 bits)│└─────────────┴────────────────────────────────────────┴─────────┘ Key Points:• Bits 0-11 contain flags or PCID (not part of address)• PML4 must be 4KB-aligned (bits 0-11 would be 0 in address anyway)• Hardware uses bits 12-51 as the physical address• Writing to CR3 triggers TLB flush (unless PCID is used and noflush bit set) Example: CR3 = 0x00000001234000 (PCID disabled) This means: - PML4 table is at physical address 0x1234000 - That's 19,144,704 in decimal, about 18MB into physical memory - The PML4 table occupies physical addresses 0x1234000 - 0x1234FFF (4KB)Critical Property: Physical Address
The value in CR3 is a physical address, not a virtual address. This is essential—if it were virtual, we'd need a page table to translate CR3 to find the page table, creating infinite regress.
The MMU accesses CR3's contents directly through the memory bus, bypassing its own translation logic for this one read. All subsequent page table entries also contain physical addresses for the same reason.
ARM AArch64 uses two separate base registers: TTBR0_EL1 for user space (lower addresses) and TTBR1_EL1 for kernel space (upper addresses). This allows completely independent user and kernel page tables without the upper-level entries needing to match. It's the hardware foundation for KPTI-like separation.
Page tables are themselves allocated from physical memory. The kernel must manage this allocation carefully—page tables can't be swapped (that would make accessing swap impossible!), and their physical addresses must be known.
Key Requirements:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
/* Linux kernel page table allocation (simplified) */ /* Allocate a single page table (one page of PTEs) */pte_t *alloc_pte_table(void) { struct page *page; pte_t *pte; /* Allocate from kernel's page allocator * GFP_KERNEL: Can sleep, normal allocation * __GFP_ZERO: Zero the page (security: don't leak old PTEs) */ page = alloc_page(GFP_KERNEL | __GFP_ZERO); if (!page) return NULL; /* Get virtual address for kernel to manipulate */ pte = (pte_t *)page_address(page); /* Track that this page is a page table */ page->flags |= PG_table; return pte;} /* Free a page table */void free_pte_table(pte_t *pte) { struct page *page = virt_to_page(pte); free_page(page);} /* Create complete page table hierarchy for a new process */pgd_t *create_page_table(void) { pgd_t *pgd; /* Allocate top-level table (PGD/PML4) */ pgd = (pgd_t *)alloc_page(GFP_KERNEL | __GFP_ZERO); if (!pgd) return NULL; /* Copy kernel mappings from init_mm's page table * All processes share the same kernel mapping */ memcpy(pgd + KERNEL_PGD_BOUNDARY, init_mm.pgd + KERNEL_PGD_BOUNDARY, (PTRS_PER_PGD - KERNEL_PGD_BOUNDARY) * sizeof(pgd_t)); /* User portion starts empty - will be populated on demand */ return pgd;} /* Get physical address for CR3 */unsigned long pgd_to_cr3(pgd_t *pgd) { /* Convert kernel virtual address to physical address */ return __pa(pgd); /* __pa = physical address of */} /* Switch to a different page table */void switch_page_table(struct mm_struct *mm) { unsigned long cr3_value; cr3_value = pgd_to_cr3(mm->pgd); /* On PCID systems, include PCID and possibly noflush */ if (cpu_has_pcid) { cr3_value |= mm->context.pcid; if (can_skip_flush) cr3_value |= X86_CR3_PCID_NOFLUSH; } write_cr3(cr3_value); /* Atomic write to CR3 */}Memory Pool for Page Tables:
Linux uses the regular page allocator for page tables, but some systems maintain a dedicated pool:
Page tables collectively can consume significant memory. A process with 4GB of mapped memory might need:
In practice, due to multi-level sparsity, overhead is typically 0.1-1% of mapped size.
Page tables must remain in physical memory. If we swapped out a page table, we'd need the page table to find it in swap—circular dependency. This makes page table memory overhead a 'real' cost that can't be reclaimed under memory pressure, unlike user pages which can be evicted.
While page table entries contain physical addresses, the kernel often needs to manipulate page tables using virtual addresses. This creates an interesting requirement: the kernel must have a way to access any physical page through virtual memory.
Common Approaches:
1. Direct Mapping (Linux, FreeBSD): The kernel maintains a direct (identity-like) mapping of all physical memory:
Virtual = Physical + PAGE_OFFSET
For example, physical address 0x1234000 might be accessible at virtual address 0xFFFF888001234000. Any physical address can be accessed by adding a constant.
Linux x86-64 Virtual Memory Layout (simplified): Virtual Address Purpose─────────────────────────────────────────────────────────────────────0xFFFFFFFFFFFFFFFF ┌────────────────┐ │ Unused/Guard │0xFFFFFFFF80000000 ├────────────────┤ │ Kernel Text │ Kernel code, loaded here │ (.text) │0xFFFFFFFF00000000 ├────────────────┤ │ Kernel Module │ Loadable modules │ Space │0xFFFFFFFE80000000 ├────────────────┤ │ ... │0xFFFF888000000000 ├────────────────┤ │ DIRECT MAP │ ← All of physical memory mapped here! │ (up to 64TB) │ │ │ Virtual = Physical + 0xFFFF8880000000000xFFFF880000000000 ├────────────────┤ │ ...vmalloc... │0x0000800000000000 ├────────────────┤ ← Non-canonical hole0x00007FFFFFFFFFFF ├────────────────┤ │ User Space │ Per-process user mappings │ (~128 TB) │0x0000000000000000 └────────────────┘ Page Table Access via Direct Map: pgd_t *pgd = current->mm->pgd; // Virtual address (in direct map) // To get physical address for CR3: phys_addr_t pgd_phys = __pa(pgd); // = pgd - PAGE_OFFSET // To access a physical address as virtual: void *virt = __va(phys); // = phys + PAGE_OFFSET2. Temporary Mappings (if no direct map): On systems without full direct mapping (e.g., 32-bit with >4GB RAM):
void *kmap(struct page *page); // Create temporary mapping
void kunmap(struct page *page); // Remove mapping
3. Recursive Page Table Mapping: A clever trick: make one entry in the top-level table point to itself!
PML4[self_entry] = physical_address_of_PML4
This creates a virtual address range where the MMU's table walk equals the tables themselves.
Why Direct Mapping is Preferred:
With recursive mapping, accessing virtual address 0xFFFFFFFFFFFFF000 (with self-entry at index 511) causes the MMU to use PML4[511]→PML4[511]→PML4[511]→PML4[511]→PML4[0], giving you access to the first entry of your own PML4! This was common in educational OS implementations.
Most operating systems map the kernel into every process's virtual address space. This raises important questions about layout, sharing, and the isolation required after attacks like Meltdown.
The Traditional Model (pre-KPTI):
Each process has one page table. The upper portion (kernel space) is identical across all processes—same virtual addresses map to same physical frames. The lower portion (user space) is unique per process.
Traditional Virtual Address Space Layout: Process A's Page Table: Process B's Page Table:┌────────────────────────┐ ┌────────────────────────┐│ Kernel Space (shared) │ ==== │ Kernel Space (shared) ││ PML4[256-511] → │ same PTs │ PML4[256-511] → ││ (all point to same │ │ (all point to same ││ physical kernel) │ │ physical kernel) │├────────────────────────┤ ├────────────────────────┤│ User Space (unique) │ │ User Space (unique) ││ PML4[0-255] → │ separate │ PML4[0-255] → ││ (Process A's mappings) │ PTs │ (Process B's mappings) │└────────────────────────┘ └────────────────────────┘ The PML4 itself is per-process, but entries 256-511 are copies that point to the SAME physical kernel page tables. Context Switch (A → B):1. Save A's state2. write_cr3(B's PML4 physical address)3. TLB flushed (non-global entries)4. Kernel entries may be marked Global (stay in TLB)5. Resume B's executionKernel Page Table Isolation (KPTI):
After the Meltdown attack (2018), operating systems implemented KPTI. The idea: when running user code, the kernel pages should be completely unmapped, not just marked supervisor-only.
KPTI Model:
Two page tables per process:
Switching on syscall/interrupt:
Cost: Extra CR3 writes, potential TLB flushes
Mitigation: PCID allows keeping both tables' entries in TLB
| Aspect | Traditional | KPTI |
|---|---|---|
| Tables per process | 1 | 2 (user table + kernel table) |
| Kernel visible when in user mode | Yes (U/S=0) | No (not mapped) |
| CR3 changes on syscall | No | Yes |
| Meltdown vulnerable | Yes | No |
| Performance impact | Baseline | 1-5% overhead typical |
ARM's dual TTBR design (TTBR0 for user, TTBR1 for kernel) naturally supports KPTI-like separation without duplicating full page tables. The kernel simply doesn't map its pages in TTBR0. This is one reason ARM systems were less affected by Meltdown-class vulnerabilities.
When the operating system switches from one process to another, it changes CR3 to point to the new process's page table. This seemingly simple operation has profound implications for performance and correctness.
What Happens on CR3 Write:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
/* Linux context switch (simplified from arch/x86/mm/tlb.c) */ void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) { unsigned long cr3; if (prev == next) return; /* Same address space, no switch needed */ /* Prepare CR3 value */ cr3 = __sme_pa(next->pgd); /* Physical address of new PGD */ if (static_cpu_has(X86_FEATURE_PCID)) { /* With PCID: tag entries instead of flushing */ u16 pcid = next->context.ctx_id; if (next->context.is_fresh) { /* First time using this PCID - must flush old entries */ cr3 |= pcid; } else { /* Reusing PCID - may have cached entries */ cr3 |= pcid | X86_CR3_PCID_NOFLUSH; } } /* The actual switch */ write_cr3(cr3); /* Track that we're now in a different address space */ this_cpu_write(cpu_tlbstate.loaded_mm, next); this_cpu_write(cpu_tlbstate.loaded_mm_asid, next->context.ctx_id);} /* * Cost analysis of CR3 write: * * WITHOUT PCID: * - ~150-300 cycles for CR3 write itself * - TLB completely flushed (except global entries) * - First memory accesses after switch cause TLB misses * - Each miss: 4 memory accesses (4-level page walk) * - Effective cost can be thousands of cycles * * WITH PCID (and NOFLUSH): * - ~50-100 cycles for CR3 write * - TLB entries from previous use of this PCID may be valid! * - Dramatically reduced miss rate after switch * - Only works if cycling through limited processes */PCID Strategy:
With only 4096 PCIDs available, the OS must manage them carefully:
Simple rotation: Assign PCIDs 0, 1, 2, ... to new processes. When exhausted, reuse from 0 with a global flush.
Per-CPU pools: Each CPU tracks which PCIDs are in use. Context switches on same CPU can reuse.
Priority-based: Hot processes (high-priority, frequently scheduled) get dedicated PCIDs.
Lazy invalidation: Don't flush immediately on free; just mark stale. Flush when PCID is reassigned.
Kernel/User Transitions with KPTI:
With KPTI, CR3 switches happen not just on context switch but on every syscall:
User calls read():
SYSCALL instruction
Trampoline code in user page table
Switch CR3 to kernel page table
Execute sys_read() in kernel
Switch CR3 back to user page table
SYSRET to user
This doubles CR3 switches, making PCID even more critical for performance.
Kernel threads often don't need user-space mappings. Linux uses 'lazy TLB' mode: when a CPU is running kernel-only work, it doesn't switch CR3 to the new process's page table. It uses the previous process's tables but only accesses kernel addresses. This saves TLB flush overhead.
Page tables are allocated from kernel memory, but their layout matters for performance. Cache effects, NUMA placement, and memory fragmentation all impact translation speed.
Cache Considerations:
A 4-level page table walk with cold cache requires:
To improve cache behavior:
Page Table Cache Behavior Analysis: 4KB Page Table Structure:┌──────────────────────────────────────────────────┐│ PTE[0] PTE[1] PTE[2] ... PTE[7] │ ← Cache line 0 (64 bytes, 8 PTEs)├──────────────────────────────────────────────────┤│ PTE[8] PTE[9] PTE[10] ... PTE[15] │ ← Cache line 1├──────────────────────────────────────────────────┤│ ... │├──────────────────────────────────────────────────┤│ PTE[504] PTE[505] PTE[506] ... PTE[511] │ ← Cache line 63└──────────────────────────────────────────────────┘ Each cache line holds 8 PTEs (8 bytes × 8 = 64 bytes).Each PTE covers one 4KB page.Each cache line covers 8 × 4KB = 32KB of virtual addresses. Implications:• Accessing addresses within 32KB likely hits same PT cache line• Sequential access benefits from prefetching/spatial locality• Random access across large ranges thrashes PT cache entries NUMA Considerations:• Page tables for a process should be on same NUMA node as process• Poor placement: PT on node 0, data on node 1 → every access crosses interconnect• Linux: MPOL_BIND or numactl can control placementHuge Page Impact on Page Tables:
Using 2MB or 1GB pages dramatically reduces page table depth and entry count:
| Page Size | Address Bits | PT Levels | Entries per GB |
|---|---|---|---|
| 4KB | 12 | 4 | 262,144 |
| 2MB | 21 | 3 | 512 |
| 1GB | 30 | 2 | 1 |
With 2MB pages:
Page Table Fragmentation:
Over time, physical memory for page tables can become fragmented:
Under extreme memory pressure, page table memory becomes a real constraint. A fork-heavy workload might create thousands of processes, each needing ~20KB minimum for page tables. With 10,000 processes, that's 200MB just for page tables. The kernel must track and potentially limit page table memory consumption.
When a computer boots, paging is initially disabled (CR0.PG=0). The CPU runs in real mode or protected mode without paging, accessing physical addresses directly. The bootloader and early kernel must set up initial page tables before enabling paging.
Boot Sequence (x86-64 Linux, simplified):
123456789101112131415161718192021222324252627282930313233343536
Linux x86-64 Early Page Tables (arch/x86/kernel/head_64.S): /* * Build initial page tables to map: * - Identity map for first few megabytes (for transition) * - Kernel at its linked virtual address (__START_KERNEL_map) * * Tables are built at compile-time at known physical locations: * level4_pgt, level3_kernel_pgt, level2_kernel_pgt, etc. */ SYM_DATA_START_PAGE_ALIGNED(level4_pgt) .quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC .fill 510, 8, 0 .quad level3_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENCSYM_DATA_END(level4_pgt) /* * Enabling paging (simplified): * * movq $level4_pgt - __START_KERNEL_map, %rax # Physical addr of PML4 * movq %rax, %cr3 # Load into CR3 * * movq %cr0, %rax * orq $CR0_PG, %rax # Set paging bit * movq %rax, %cr0 # Enable paging! * * # IMMEDIATELY after this, all addresses are virtual! * # We're still executing from identity-mapped low addresses * * # Jump to high (kernel) virtual address: * movabs $1f, %rax * jmp *%rax * 1: * # Now running at proper kernel address */The Transition Moment:
The instant CR0.PG is set, all instruction fetches go through paging. But the CPU is still executing code at a low physical address. The solution:
This is one of the most delicate moments in OS bootstrap—any mistake means triple fault and reset.
UEFI firmware can set up initial paging before transferring to the OS (UEFI runs in long mode with paging enabled). The OS must understand and potentially modify these tables rather than building from scratch. This makes UEFI boot simpler in some ways but requires handling existing page tables.
Page table bugs are among the most difficult to diagnose—symptoms may be subtle or catastrophic, and the debugging tools themselves need working page tables.
Common Page Table Bugs:
12345678910111213141516171819202122232425262728293031323334
# Debugging page tables in Linux # 1. Read current CR3 and page tables (requires debugger or crash dump)$ crash vmlinux vmcorecrash> p $cr3CR3: 0x0000000123456789 crash> vtop 0x7fffc3400000 # Virtual-to-physical translationVIRTUAL PHYSICAL0x7fffc3400000 -> 0x456789000 # 2. Walk page tables manually in crash/gdbcrash> rd -64 0xFFFF888123456000 8 # Read PML4 entries # 3. Check /proc for process mappings$ cat /proc/$(pidof myprogram)/maps00400000-00452000 r-xp 00000000 08:01 1234567 /usr/bin/myprogram7fffc3200000-7fffc3221000 rw-p 00000000 00:00 0 [stack] # 4. Check /proc for page table info$ cat /proc/$(pidof myprogram)/smaps | grep -A 5 "^7fffc"7fffc3200000-7fffc3221000 rw-p 00000000 00:00 0 [stack]Size: 132 kBRss: 16 kBPss: 16 kB # 5. Use pagewalk debug (kernel config)# CONFIG_PAGE_TABLE_CHECK=y adds validation of page table operations # 6. Hardware performance counters for TLB statistics$ perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./myprogram Performance counter stats: 1,234,567,890 dTLB-loads 12,345,678 dTLB-load-misses # High miss rate = PT walking a lotHardware Debug Support:
Software Hardening:
// Linux: Page table integrity checking
CONFIG_DEBUG_WX=y // Warn on W+X mappings
CONFIG_PAGE_TABLE_CHECK=y // Validate PT operations
CONFIG_DEBUG_PAGEALLOC=y // Detect use-after-free of pages
If a page fault handler itself faults, that's a double fault. If the double fault handler faults, that's a triple fault—the CPU resets. Page table bugs during early boot or in fault handlers often manifest as mysterious system resets with no error message.
Page tables must be stored somewhere, and understanding that 'somewhere'—physically in memory, logically organized, and found via hardware registers—is essential for OS implementation. Let's consolidate:
Module Complete:
With this page, we've completed our deep dive into Page Tables—from overall structure through PTEs, valid bits, protection bits, and finally location. You now understand the data structures that enable paging-based virtual memory, forming the foundation for address translation, protection, and efficient memory management.
The next module covers Address Translation—the actual process of converting virtual addresses to physical addresses using these page table structures.
You now understand where page tables live in memory, how the hardware finds them, how the kernel manages them, and the bootstrap process that creates the first page tables. This knowledge is essential for kernel development, hypervisor implementation, and understanding memory-related security mechanisms.