Loading learning content...
In the previous pages, we explored two-level and three-level paging as specific solutions to the page table size problem. But stepping back, we can see these as instances of a more general pattern: hierarchical page tables.
A hierarchical page table is fundamentally a tree data structure where:
This tree structure is the key to achieving O(memory used) overhead instead of O(address space size) overhead. Whether a system uses two levels (32-bit x86), four levels (x86-64), or any other depth, the same principles apply.
This page abstracts multi-level paging into general principles. You'll understand hierarchical page tables as a tree structure, learn how sparse allocation works across any number of levels, explore recursive traversal algorithms, and see how these concepts unify all multi-level paging implementations.
A hierarchical page table is a specific type of tree called a radix tree (or trie). Each level of the tree corresponds to a portion of the virtual address, and navigation through the tree is determined by successive fields of the address.
Tree structure properties:
┌─────────────┐
│ Root │ ← CR3/TTBR points here
│ (Level 0) │ Always allocated for each process
└──────┬──────┘
┌────────┼────────┐
▼ ▼ ▼
┌────────┐┌────────┐┌────────┐
│Level 1 ││Level 1 ││Level 1 │ ← Allocated on demand
│ Dir ││ Dir ││ Dir │ (when region used)
└────┬───┘└────┬───┘└────┬───┘
│ │ │
┌────┼────┐ ... ...
▼ ▼ ▼
┌────┐┌────┐┌────┐
│ PT ││ PT ││ PT │ ← Leaf level
│ ││ ││ │ Contains frame numbers
└────┘└────┘└────┘
│ │ │
▼ ▼ ▼
[Frame][Frame][Frame] ← Physical memory
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
// Hierarchical Page Table as a Radix Tree // Generic multi-level page table configurationtypedef struct { int levels; // Number of levels (2, 3, 4, 5...) int bits_per_level; // Bits used for indexing at each level int offset_bits; // Bits for page offset int entry_size; // Bytes per entry (4 or 8) int entries_per_table;// 2^bits_per_level} PageTableConfig; // Example configurationsconst PageTableConfig CONFIG_X86 = { .levels = 2, .bits_per_level = 10, .offset_bits = 12, .entry_size = 4, .entries_per_table = 1024}; const PageTableConfig CONFIG_X86_64 = { .levels = 4, .bits_per_level = 9, .offset_bits = 12, .entry_size = 8, .entries_per_table = 512}; // Calculate tree propertiesvoid analyze_tree(const PageTableConfig* config) { int address_bits = config->levels * config->bits_per_level + config->offset_bits; int max_pages = 1ULL << (address_bits - config->offset_bits); int table_size = config->entries_per_table * config->entry_size; printf("Page Table Tree Analysis:\n"); printf(" Address bits: %d\n", address_bits); printf(" Tree depth: %d levels\n", config->levels); printf(" Branching factor: %d\n", config->entries_per_table); printf(" Max leaf nodes: %d (max pages)\n", max_pages); printf(" Table size: %d bytes\n", table_size); // Best case: minimal allocation (1 page mapped) // Need: root + one table per level down to leaf int min_tables = config->levels; printf(" Min tables (1 page): %d (%d bytes)\n", min_tables, min_tables * table_size); // Worst case: full address space mapped // Level i has entries_per_table^i tables printf(" Max tables (full): astronomical\n");}Unlike binary search trees, radix trees have O(K) lookup time where K is the key length (number of levels), not O(log N) where N is the number of entries. For page tables, this means translation time is constant regardless of how many pages are mapped—determined only by tree depth.
The fundamental advantage of hierarchical page tables is sparse allocation: we only allocate table nodes for paths that lead to actual mappings. This principle underlies all memory savings.
Mathematical analysis:
Consider a general hierarchical page table with:
Flat page table allocation:
Hierarchical allocation bounds:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
# Sparse Allocation Analysis import math def analyze_allocation(levels, branching, entry_size, mapped_pages, page_distribution): """ Analyze page table memory usage. page_distribution: how pages are spread across address space - 'clustered': pages are in contiguous regions - 'scattered': pages are randomly distributed """ table_size = branching * entry_size # Flat table would need: total_possible_pages = branching ** levels flat_size = total_possible_pages * entry_size # Hierarchical allocation depends on distribution if page_distribution == 'clustered': # Pages in contiguous regions share upper-level tables # Rough estimate: sqrt(P) at upper levels, P at leaf level tables_needed = estimate_clustered(levels, branching, mapped_pages) else: # Scattered pages need more paths tables_needed = estimate_scattered(levels, branching, mapped_pages) hierarchical_size = tables_needed * table_size savings_ratio = (flat_size - hierarchical_size) / flat_size * 100 print(f"Allocation Analysis:") print(f" Levels: {levels}, Branching: {branching}") print(f" Mapped pages: {mapped_pages:,}") print(f" Distribution: {page_distribution}") print(f"") print(f" Flat table size: {flat_size:,} bytes ({flat_size/1024/1024:.2f} MB)") print(f" Hierarchical: ~{tables_needed} tables = {hierarchical_size:,} bytes") print(f" Savings: {savings_ratio:.2f}%") return hierarchical_size def estimate_clustered(levels, branching, pages): """Estimate tables needed for clustered page distribution.""" # For clustered pages, upper levels are heavily shared # Roughly: level_i needs pages / (branching^(levels-i)) tables tables = 0 for level in range(levels): coverage_per_table = branching ** (levels - 1 - level) tables_at_level = math.ceil(pages / coverage_per_table) tables += min(tables_at_level, branching ** level) # Can't exceed max return max(tables, levels) # At least one path def estimate_scattered(levels, branching, pages): """Estimate tables needed for scattered page distribution.""" # Scattered pages need more separate paths # Upper levels: sqrt of lower level usage roughly tables = pages # Leaf level: one entry per page, may share tables leaf_tables = math.ceil(pages / branching) tables = leaf_tables for level in range(levels - 1): parent_tables = math.ceil(tables / branching) tables = tables + parent_tables return max(tables, levels) # Example: x86-64 with 4 levels, 512 entries, 8-byte entries# Process with 10,000 mapped pages (40MB)print("=" * 60)analyze_allocation(4, 512, 8, 10000, 'clustered')print()analyze_allocation(4, 512, 8, 10000, 'scattered')print() # Same system, very sparse process (100 pages = 400KB)print("=" * 60)analyze_allocation(4, 512, 8, 100, 'clustered')print() # Comparison: flat 48-bit page table# 2^36 pages × 8 bytes = 549,755,813,888 bytes = 512GB!| Pages Mapped | Used Memory | Hierarchical Overhead | Flat Overhead | Savings |
|---|---|---|---|---|
| 100 | 400 KB | ~20 KB | 512 GB | 99.9999996% |
| 10,000 | 40 MB | ~400 KB | 512 GB | 99.9999% |
| 1,000,000 | 4 GB | ~40 MB | 512 GB | 99.99% |
| 100,000,000 | 400 GB | ~4 GB | 512 GB | 99.2% |
Hierarchical page tables transform the overhead from O(address space) to O(mapped memory). For typical processes using a tiny fraction of their address space, savings exceed 99.99%. This makes huge address spaces practical.
The tree structure of hierarchical page tables naturally suggests a recursive algorithm for traversal. This perspective unifies all multi-level paging implementations into a single conceptual framework.
Recursive translation:
At each level, the algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
// Recursive Page Table Traversal - General Algorithm typedef uint64_t* PageTable;typedef uint64_t Entry; // Configuration#define MAX_LEVELS 5#define ENTRY_PRESENT (1ULL << 0)#define ENTRY_LEAF (1ULL << 7) // Large page flag#define ADDR_MASK 0x0000FFFFFFFFF000ULL // Level-specific bit extractionint bits_per_level[MAX_LEVELS]; // Configured at initint level_shift[MAX_LEVELS]; // Precomputed shifts // Initialize shifts: level 0 is root, level (n-1) is leafvoid init_level_shifts(int num_levels, int bits_per_lvl, int offset_bits) { int pos = offset_bits; for (int i = num_levels - 1; i >= 0; i--) { level_shift[i] = pos; bits_per_level[i] = bits_per_lvl; pos += bits_per_lvl; }} // Extract index for given level from virtual addressuint64_t get_index(uint64_t vaddr, int level) { uint64_t mask = (1ULL << bits_per_level[level]) - 1; return (vaddr >> level_shift[level]) & mask;} // Recursive translation function// Returns physical address, or 0 on faultuint64_t translate_recursive( PageTable table, // Current level's table (physical -> virtual mapped) uint64_t vaddr, // Virtual address being translated int level, // Current level (0 = root) int max_level // Deepest level (leaves)) { // Get index for this level uint64_t index = get_index(vaddr, level); Entry entry = table[index]; // Check if entry is present if (!(entry & ENTRY_PRESENT)) { // Page fault: this subtree doesn't exist page_fault(vaddr, level); return 0; // Not reached } // Check if this is a large page (leaf before max level) if ((entry & ENTRY_LEAF) && level < max_level) { // Large page: combine frame address with remaining address bits uint64_t frame_addr = entry & ADDR_MASK; uint64_t offset = vaddr & ((1ULL << level_shift[level]) - 1); return frame_addr | offset; } // Check if we've reached the leaf level if (level == max_level) { // Leaf entry: extract frame number uint64_t frame_addr = entry & ADDR_MASK; uint64_t offset = vaddr & ((1ULL << level_shift[level]) - 1); return frame_addr | offset; } // Internal node: recursively descend uint64_t child_phys = entry & ADDR_MASK; PageTable child_table = phys_to_virt(child_phys); return translate_recursive(child_table, vaddr, level + 1, max_level);} // Top-level translation functionuint64_t translate(uint64_t vaddr) { // Get root table from hardware register PageTable root = phys_to_virt(read_cr3()); // Start recursion at level 0 return translate_recursive(root, vaddr, 0, num_levels - 1);}The elegance of recursion:
The recursive formulation reveals that all multi-level paging schemes are structurally identical—they differ only in:
This uniformity explains why operating systems can support multiple page table formats with shared code. The Linux kernel, for example, uses a single pgd_t → p4d_t → pud_t → pmd_t → pte_t framework that works whether the architecture has 2, 3, 4, or 5 actual levels.
Hardware optimization:
While the recursive model is conceptually clean, hardware MMUs typically implement iterative traversal with a fixed number of stages. Each stage:
Modern MMUs include page table walkers that perform this traversal in parallel with other operations, hiding some latency.
The recursive translation is tail-recursive: the recursive call is the last operation. This means it can be trivially converted to an iterative loop with no stack growth—exactly what hardware MMUs implement.
Efficient management of hierarchical page tables requires careful allocation and deallocation strategies. The goal is to minimize overhead while ensuring fast allocation when processes grow.
Allocation triggers:
New page tables are allocated when:
mmap(), brk(), stack growth, or fault handling requires new mappings123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
// Page Table Allocation Strategies // Strategy 1: Lazy Allocation (most common)// Allocate tables only when the mapping is first established int map_page_lazy(PageTable root, uint64_t vaddr, uint64_t paddr, uint64_t flags, int num_levels) { PageTable current = root; for (int level = 0; level < num_levels - 1; level++) { uint64_t index = get_index(vaddr, level); Entry* entry = ¤t[index]; if (!(*entry & ENTRY_PRESENT)) { // Allocate new child table on demand uint64_t new_table = allocate_page_table(); if (new_table == 0) { return -ENOMEM; // Out of memory } // Clear new table memset(phys_to_virt(new_table), 0, PAGE_SIZE); // Install pointer to new table *entry = new_table | ENTRY_PRESENT | ENTRY_USER | ENTRY_WRITE; } // Descend to next level current = phys_to_virt(*entry & ADDR_MASK); } // Install final mapping uint64_t leaf_index = get_index(vaddr, num_levels - 1); current[leaf_index] = paddr | flags | ENTRY_PRESENT; return 0;} // Strategy 2: Pre-allocation for kernel space// Allocate all tables for a region upfront void preallocate_region(PageTable root, uint64_t start, uint64_t end, int num_levels) { // For kernel mappings that won't change, pre-allocate all tables // Avoids allocation during critical paths for (uint64_t addr = start; addr < end; addr += coverage_per_l1_entry) { // Walk down and allocate at each level ensure_path_exists(root, addr, num_levels); }} // Strategy 3: Pooled allocation// Maintain a cache of pre-allocated page table pages #define PT_POOL_SIZE 64 typedef struct { uint64_t pages[PT_POOL_SIZE]; int count; spinlock_t lock;} PageTablePool; static PageTablePool pt_pool; void pt_pool_refill(void) { // Called periodically or when pool is low // Allocate pages in batches for efficiency while (pt_pool.count < PT_POOL_SIZE / 2) { uint64_t page = allocate_physical_page(); if (page == 0) break; pt_pool.pages[pt_pool.count++] = page; }} uint64_t pt_pool_allocate(void) { spin_lock(&pt_pool.lock); if (pt_pool.count == 0) { spin_unlock(&pt_pool.lock); // Emergency: allocate directly return allocate_physical_page(); } uint64_t page = pt_pool.pages[--pt_pool.count]; spin_unlock(&pt_pool.lock); // Trigger async refill if getting low if (pt_pool.count < PT_POOL_SIZE / 4) { schedule_refill(); } return page;}Production systems often use hybrid strategies: lazy allocation for user space (where variations are expected), pre-allocation for kernel space (which is fixed and critical), and pooled allocation to accelerate the common case while handling corner cases gracefully.
Deallocating page tables correctly is more complex than allocation due to the tree structure and the need to avoid memory leaks. A single intermediate table can only be freed when all its entries are empty.
Deallocation triggers:
munmap())MADV_FREE, heap shrinking)123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
// Page Table Deallocation Strategies // Structure to track table usagetypedef struct { uint64_t phys_addr; uint16_t valid_entries; // Number of present entries uint16_t level;} PageTableInfo; // Decrement entry count when unmappingvoid unmap_page(PageTable root, uint64_t vaddr, int num_levels) { PageTable tables[MAX_LEVELS]; uint64_t indices[MAX_LEVELS]; // Walk down, recording path PageTable current = root; for (int level = 0; level < num_levels; level++) { tables[level] = current; indices[level] = get_index(vaddr, level); if (level < num_levels - 1) { Entry entry = current[indices[level]]; current = phys_to_virt(entry & ADDR_MASK); } } // Clear the leaf entry tables[num_levels - 1][indices[num_levels - 1]] = 0; // Walk back up, freeing empty tables for (int level = num_levels - 1; level > 0; level--) { if (table_is_empty(tables[level])) { // Free this table free_physical_page(virt_to_phys(tables[level])); // Clear parent's pointer to us tables[level - 1][indices[level - 1]] = 0; } else { // Table still has entries, stop walking up break; } } // Invalidate TLB entry invlpg(vaddr);} // Full page table tree deletion (process exit)void destroy_page_tables(PageTable root, int level, int max_level) { if (root == NULL) return; for (int i = 0; i < entries_per_level; i++) { Entry entry = root[i]; if (!(entry & ENTRY_PRESENT)) { continue; // Skip empty entries } if (level < max_level && !(entry & ENTRY_LEAF)) { // Recurse into child table first PageTable child = phys_to_virt(entry & ADDR_MASK); destroy_page_tables(child, level + 1, max_level); } // If this is a leaf (actual page mapping), // optionally free the frame too depending on policy if (level == max_level || (entry & ENTRY_LEAF)) { // Frame cleanup depends on whether it's shared, file-backed, etc. handle_frame_cleanup(entry); } } // Free this table itself free_physical_page(virt_to_phys(root));} // Process exit cleanupvoid process_exit_cleanup(Process* proc) { PageTable root = phys_to_virt(proc->cr3); // Recursive deletion of entire tree destroy_page_tables(root, 0, proc->pt_levels - 1); // Note: Don't free kernel portion of address space // as it's shared across all processes}Freeing a page table while another CPU holds a cached TLB entry pointing to it is a critical bug. The other CPU might walk through freed/reused memory, causing corruption or security exploits. Multi-core systems require careful TLB shootdown protocols before freeing tables.
The hierarchical structure enables powerful sharing optimizations. Multiple processes can share portions of the page table tree, reducing memory overhead and simplifying updates.
Common sharing scenarios:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
// Page Table Sharing Strategies // 1. Kernel space sharing at process creationvoid setup_kernel_mappings(PageTable new_root) { // Copy pointers to kernel page tables // In x86-64 with 48-bit addresses, entries 256-511 are kernel space PageTable kernel_root = get_kernel_page_table(); for (int i = 256; i < 512; i++) { // Share reference to kernel's tables (not copy) new_root[i] = kernel_root[i]; } // Now kernel tables are shared; updates to kernel mappings // automatically visible to this process} // 2. Fork sharing (copy-on-write)PageTable fork_page_tables(PageTable parent_root) { // Allocate new root for child PageTable child_root = allocate_page_table(); // USER SPACE: Initially share all subtables, mark entries COW for (int i = 0; i < 256; i++) { if (parent_root[i] & ENTRY_PRESENT) { // Share the subtree reference child_root[i] = parent_root[i]; // Mark both parent and child as read-only for COW parent_root[i] &= ~ENTRY_WRITE; child_root[i] &= ~ENTRY_WRITE; // Increment reference count on shared table incref_page_table(parent_root[i] & ADDR_MASK); } } // KERNEL SPACE: Share as usual for (int i = 256; i < 512; i++) { child_root[i] = parent_root[i]; // Kernel tables are globally shared, no COW needed } return child_root;} // 3. COW fault handlingvoid handle_cow_fault(PageTable root, uint64_t vaddr) { // Walk to find the highest shared table that needs splitting PageTable path[MAX_LEVELS]; uint64_t indices[MAX_LEVELS]; // ... walk and identify shared tables ... for (int level = fault_level; level < num_levels; level++) { if (get_refcount(path[level]) > 1) { // This table is shared, need to copy PageTable new_table = allocate_page_table(); memcpy(new_table, path[level], PAGE_SIZE); // Update parent to point to new copy path[level - 1][indices[level - 1]] = virt_to_phys(new_table) | get_flags(path[level - 1][indices[level - 1]]); // Decrement refcount on old shared table decref_page_table(path[level]); path[level] = new_table; } } // Now the leaf entry is private; make it writable path[num_levels - 1][indices[num_levels - 1]] |= ENTRY_WRITE; // If frame itself is shared, may also need to copy the frame // ... handle frame COW ...}Benefits of sharing:
| Mechanism | Memory Saved | Complexity |
|---|---|---|
| Kernel sharing | Significant (kernel tables shared by all) | Low (static sharing) |
| Fork COW | Huge for fork-heavy workloads | Moderate (refcounts needed) |
| Shared library sharing | Moderate (common mappings shared) | Moderate (alignment requirements) |
Reference counting granularity:
To support sharing, each page table needs a reference count. When a table is shared between processes, the count is incremented. When a process unmaps or exits, it's decremented. The table is freed only when the count reaches zero.
This can be stored in the struct page for the table's physical frame, leveraging existing memory management infrastructure.
In containerized environments running many identical instances, page table sharing dramatically reduces overhead. If 100 containers run the same image, they share most page table entries for code, libraries, and read-only data, multiplying memory efficiency.
Implementing hierarchical page tables correctly requires attention to several subtle issues that don't arise in simpler data structures.
invlpg on x86) is required. On multi-core systems, TLB shootdowns must reach all cores.12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
// Critical Implementation Details // 1. Proper page table entry update sequencevoid update_pte_safe(Entry* entry, uint64_t new_value, uint64_t vaddr) { // Ensure we don't create inconsistent transient states // For non-present → present: write new value atomically if (!(*entry & ENTRY_PRESENT)) { // Table must be ready before marking present barrier(); // Ensure child table writes complete WRITE_ONCE(*entry, new_value); // No TLB invalidation needed: wasn't cached return; } // For present → non-present: invalidate TLB first if (!(new_value & ENTRY_PRESENT)) { WRITE_ONCE(*entry, 0); // Clear entry atomically barrier(); tlb_invalidate(vaddr); return; } // For present → present (changing mapping): complex case // Must invalidate old TLB entry before modifying WRITE_ONCE(*entry, new_value); barrier(); tlb_invalidate(vaddr);} // 2. Multi-core TLB shootdownvoid tlb_shootdown(uint64_t vaddr, cpumask_t cores) { // Send IPI to each core that might have cached this entry for_each_cpu(cpu, cores) { if (cpu == current_cpu()) { invlpg(vaddr); // Local invalidation } else { send_ipi(cpu, IPI_TLB_SHOOTDOWN, vaddr); } } // Wait for acknowledgment from all cores wait_for_all_acks(cores);} // 3. Memory barriers for table initializationPageTable create_initialized_table(Entry* parent_entry, uint64_t flags) { PageTable new_table = allocate_page_table(); // Initialize the table contents memset(new_table, 0, PAGE_SIZE); // CRITICAL: Memory barrier ensures table is fully written // before parent entry becomes valid wmb(); // Write memory barrier // Now safe to link from parent *parent_entry = virt_to_phys(new_table) | flags | ENTRY_PRESENT; return new_table;} // 4. Self-mapping (recursive mapping) for page table access// Map the PML4 into itself at a known index#define RECURSIVE_SLOT 510 // Choose unused slot void setup_recursive_mapping(PageTable pml4) { // Point slot 510 back to PML4 itself pml4[RECURSIVE_SLOT] = virt_to_phys(pml4) | ENTRY_PRESENT | ENTRY_WRITE; // Now, virtual address derived from {510, 510, 510, 510, 0} // accesses PML4[0]. Very clever trick for fast table access. // To access entry N at level L without physical mapping: // vaddr = BASE | (510 shifted) | (indices for higher levels) | N}Page table bugs can compromise system security. A write to the wrong entry can expose kernel memory to user processes (privilege escalation). Modern CPUs include SMEP/SMAP (prevent executing/accessing user memory from kernel) and PKE (protection keys) to limit damage from page table errors.
Hierarchical page tables are the universal solution to the page table size problem, implementing a radix tree structure that achieves sparse, efficient memory management.
What's next:
We've understood the structure of hierarchical page tables. In the next page, we'll examine the address translation process in detail—following a virtual address through all levels of the hierarchy, understanding how hardware MMUs implement this walk, and seeing how TLB caching makes multi-level translation practical.
You now understand hierarchical page tables as a general pattern: the tree structure, sparse allocation principle, recursive traversal, allocation/deallocation strategies, sharing mechanisms, and implementation considerations. Next, we'll dive into the mechanics of multi-level address translation.