Operating SystemsMulti-Level Paging

Multi-Level Paging

LevelIntermediate

Duration75 mins

TopicMulti-Level Paging

3 / 5

Hierarchical Page Tables

The Universal Pattern of Hierarchical Page Tables

In the previous pages, we explored two-level and three-level paging as specific solutions to the page table size problem. But stepping back, we can see these as instances of a more general pattern: hierarchical page tables.

A hierarchical page table is fundamentally a tree data structure where:

The root is always allocated and pointed to by a hardware register (CR3, TTBR, SATP)
Internal nodes (directory entries) point to child tables
Leaf nodes (page table entries) map directly to physical frames
Unallocated subtrees represent unmapped regions of the address space

This tree structure is the key to achieving O(memory used) overhead instead of O(address space size) overhead. Whether a system uses two levels (32-bit x86), four levels (x86-64), or any other depth, the same principles apply.

What You Will Learn

This page abstracts multi-level paging into general principles. You'll understand hierarchical page tables as a tree structure, learn how sparse allocation works across any number of levels, explore recursive traversal algorithms, and see how these concepts unify all multi-level paging implementations.

Page Tables as Tree Structures

A hierarchical page table is a specific type of tree called a radix tree (or trie). Each level of the tree corresponds to a portion of the virtual address, and navigation through the tree is determined by successive fields of the address.

Tree structure properties:

                     ┌─────────────┐
                     │    Root     │  ← CR3/TTBR points here
                     │   (Level 0) │     Always allocated for each process
                     └──────┬──────┘
                   ┌────────┼────────┐
                   ▼        ▼        ▼
              ┌────────┐┌────────┐┌────────┐
              │Level 1 ││Level 1 ││Level 1 │  ← Allocated on demand
              │  Dir   ││  Dir   ││  Dir   │     (when region used)
              └────┬───┘└────┬───┘└────┬───┘
                   │         │         │
              ┌────┼────┐   ...       ...
              ▼    ▼    ▼
           ┌────┐┌────┐┌────┐
           │ PT ││ PT ││ PT │  ← Leaf level
           │    ││    ││    │     Contains frame numbers
           └────┘└────┘└────┘
              │    │    │
              ▼    ▼    ▼
           [Frame][Frame][Frame]  ← Physical memory

Key Tree Properties

•Fixed depth — Unlike general trees, page table trees have a predetermined depth based on address size and page size. All leaf nodes are at the same level.
•Fixed branching factor — Each node has a fixed maximum number of children (e.g., 512 in x86-64). This is determined by the number of bits at each level.
•Sparse population — Most branches are NULL (unmapped). Only paths to actually-mapped pages are allocated.
•Path determines address — The sequence of indices used to reach a leaf encodes the virtual page number. No explicit key storage is needed.
•Left-to-right traversal — High-order bits are used first, refining the address with each level until reaching the page level.

tree_properties.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Hierarchical Page Table as a Radix Tree
 
// Generic multi-level page table configuration
typedef struct {
    int levels;           // Number of levels (2, 3, 4, 5...)
    int bits_per_level;   // Bits used for indexing at each level
    int offset_bits;      // Bits for page offset
    int entry_size;       // Bytes per entry (4 or 8)
    int entries_per_table;// 2^bits_per_level
} PageTableConfig;
 
// Example configurations
const PageTableConfig CONFIG_X86 = {
    .levels = 2,
    .bits_per_level = 10,
    .offset_bits = 12,
    .entry_size = 4,
    .entries_per_table = 1024
};
 
const PageTableConfig CONFIG_X86_64 = {
    .levels = 4,
    .bits_per_level = 9,
    .offset_bits = 12,
    .entry_size = 8,
    .entries_per_table = 512
};
 
// Calculate tree properties
void analyze_tree(const PageTableConfig* config) {
    int address_bits = config->levels * config->bits_per_level + config->offset_bits;
    int max_pages = 1ULL << (address_bits - config->offset_bits);
    int table_size = config->entries_per_table * config->entry_size;
    
    printf("Page Table Tree Analysis:\n");
    printf("  Address bits: %d\n", address_bits);
    printf("  Tree depth: %d levels\n", config->levels);
    printf("  Branching factor: %d\n", config->entries_per_table);
    printf("  Max leaf nodes: %d (max pages)\n", max_pages);
    printf("  Table size: %d bytes\n", table_size);
    
    // Best case: minimal allocation (1 page mapped)
    // Need: root + one table per level down to leaf
    int min_tables = config->levels;
    printf("  Min tables (1 page): %d (%d bytes)\n", 
           min_tables, min_tables * table_size);
    
    // Worst case: full address space mapped
    // Level i has entries_per_table^i tables
    printf("  Max tables (full): astronomical\n");
}

Radix Tree Efficiency

Unlike binary search trees, radix trees have O(K) lookup time where K is the key length (number of levels), not O(log N) where N is the number of entries. For page tables, this means translation time is constant regardless of how many pages are mapped—determined only by tree depth.

The Sparse Allocation Principle

The fundamental advantage of hierarchical page tables is sparse allocation: we only allocate table nodes for paths that lead to actual mappings. This principle underlies all memory savings.

Mathematical analysis:

Consider a general hierarchical page table with:

D levels (depth)
B entries per table (branching factor)
S bytes per table
P pages actually mapped

Flat page table allocation:

Always allocates: B^D entries × entry_size bytes
Independent of P (number of used pages)

Hierarchical allocation bounds:

Minimum: D tables (path to one page)
Maximum: Sum over levels where every path branches
Typical: O(P) tables for P scattered pages

sparse_allocation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# Sparse Allocation Analysis
 
import math
 
def analyze_allocation(levels, branching, entry_size, mapped_pages, page_distribution):
    """
    Analyze page table memory usage.
    
    page_distribution: how pages are spread across address space
    - 'clustered': pages are in contiguous regions
    - 'scattered': pages are randomly distributed
    """
    
    table_size = branching * entry_size
    
    # Flat table would need:
    total_possible_pages = branching ** levels
    flat_size = total_possible_pages * entry_size
    
    # Hierarchical allocation depends on distribution
    if page_distribution == 'clustered':
        # Pages in contiguous regions share upper-level tables
        # Rough estimate: sqrt(P) at upper levels, P at leaf level
        tables_needed = estimate_clustered(levels, branching, mapped_pages)
    else:
        # Scattered pages need more paths
        tables_needed = estimate_scattered(levels, branching, mapped_pages)
    
    hierarchical_size = tables_needed * table_size
    
    savings_ratio = (flat_size - hierarchical_size) / flat_size * 100
    
    print(f"Allocation Analysis:")
    print(f"  Levels: {levels}, Branching: {branching}")
    print(f"  Mapped pages: {mapped_pages:,}")
    print(f"  Distribution: {page_distribution}")
    print(f"")
    print(f"  Flat table size: {flat_size:,} bytes ({flat_size/1024/1024:.2f} MB)")
    print(f"  Hierarchical: ~{tables_needed} tables = {hierarchical_size:,} bytes")
    print(f"  Savings: {savings_ratio:.2f}%")
    return hierarchical_size
 
 
def estimate_clustered(levels, branching, pages):
    """Estimate tables needed for clustered page distribution."""
    # For clustered pages, upper levels are heavily shared
    # Roughly: level_i needs pages / (branching^(levels-i)) tables
    tables = 0
    for level in range(levels):
        coverage_per_table = branching ** (levels - 1 - level)
        tables_at_level = math.ceil(pages / coverage_per_table)
        tables += min(tables_at_level, branching ** level)  # Can't exceed max
    return max(tables, levels)  # At least one path
 
 
def estimate_scattered(levels, branching, pages):
    """Estimate tables needed for scattered page distribution."""
    # Scattered pages need more separate paths
    # Upper levels: sqrt of lower level usage roughly
    tables = pages  # Leaf level: one entry per page, may share tables
    leaf_tables = math.ceil(pages / branching)
    tables = leaf_tables
    
    for level in range(levels - 1):
        parent_tables = math.ceil(tables / branching)
        tables = tables + parent_tables
    
    return max(tables, levels)
 
 
# Example: x86-64 with 4 levels, 512 entries, 8-byte entries
# Process with 10,000 mapped pages (40MB)
print("=" * 60)
analyze_allocation(4, 512, 8, 10000, 'clustered')
print()
analyze_allocation(4, 512, 8, 10000, 'scattered')
print()
 
# Same system, very sparse process (100 pages = 400KB)
print("=" * 60)
analyze_allocation(4, 512, 8, 100, 'clustered')
print()
 
# Comparison: flat 48-bit page table
# 2^36 pages × 8 bytes = 549,755,813,888 bytes = 512GB!

Sparse Allocation: Overhead vs. Flat Table
Pages Mapped	Used Memory	Hierarchical Overhead	Flat Overhead	Savings
100	400 KB	~20 KB	512 GB	99.9999996%
10,000	40 MB	~400 KB	512 GB	99.9999%
1,000,000	4 GB	~40 MB	512 GB	99.99%
100,000,000	400 GB	~4 GB	512 GB	99.2%

The Key Insight

Hierarchical page tables transform the overhead from O(address space) to O(mapped memory). For typical processes using a tiny fraction of their address space, savings exceed 99.99%. This makes huge address spaces practical.

Recursive Traversal Algorithm

The tree structure of hierarchical page tables naturally suggests a recursive algorithm for traversal. This perspective unifies all multi-level paging implementations into a single conceptual framework.

Recursive translation:

At each level, the algorithm:

Extracts the index for this level from the virtual address
Reads the entry at that index
If at leaf level, returns the frame number
Otherwise, recurses into the child table pointed to by the entry

recursive_traversal.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Recursive Page Table Traversal - General Algorithm
 
typedef uint64_t* PageTable;
typedef uint64_t Entry;
 
// Configuration
#define MAX_LEVELS      5
#define ENTRY_PRESENT   (1ULL << 0)
#define ENTRY_LEAF      (1ULL << 7)   // Large page flag
#define ADDR_MASK       0x0000FFFFFFFFF000ULL
 
// Level-specific bit extraction
int bits_per_level[MAX_LEVELS];   // Configured at init
int level_shift[MAX_LEVELS];      // Precomputed shifts
 
// Initialize shifts: level 0 is root, level (n-1) is leaf
void init_level_shifts(int num_levels, int bits_per_lvl, int offset_bits) {
    int pos = offset_bits;
    for (int i = num_levels - 1; i >= 0; i--) {
        level_shift[i] = pos;
        bits_per_level[i] = bits_per_lvl;
        pos += bits_per_lvl;
    }
}
 
// Extract index for given level from virtual address
uint64_t get_index(uint64_t vaddr, int level) {
    uint64_t mask = (1ULL << bits_per_level[level]) - 1;
    return (vaddr >> level_shift[level]) & mask;
}
 
// Recursive translation function
// Returns physical address, or 0 on fault
uint64_t translate_recursive(
    PageTable table,      // Current level's table (physical -> virtual mapped)
    uint64_t vaddr,       // Virtual address being translated
    int level,            // Current level (0 = root)
    int max_level         // Deepest level (leaves)
) {
    // Get index for this level
    uint64_t index = get_index(vaddr, level);
    Entry entry = table[index];
    
    // Check if entry is present
    if (!(entry & ENTRY_PRESENT)) {
        // Page fault: this subtree doesn't exist
        page_fault(vaddr, level);
        return 0;  // Not reached
    }
    
    // Check if this is a large page (leaf before max level)
    if ((entry & ENTRY_LEAF) && level < max_level) {
        // Large page: combine frame address with remaining address bits
        uint64_t frame_addr = entry & ADDR_MASK;
        uint64_t offset = vaddr & ((1ULL << level_shift[level]) - 1);
        return frame_addr | offset;
    }
    
    // Check if we've reached the leaf level
    if (level == max_level) {
        // Leaf entry: extract frame number
        uint64_t frame_addr = entry & ADDR_MASK;
        uint64_t offset = vaddr & ((1ULL << level_shift[level]) - 1);
        return frame_addr | offset;
    }
    
    // Internal node: recursively descend
    uint64_t child_phys = entry & ADDR_MASK;
    PageTable child_table = phys_to_virt(child_phys);
    
    return translate_recursive(child_table, vaddr, level + 1, max_level);
}
 
// Top-level translation function
uint64_t translate(uint64_t vaddr) {
    // Get root table from hardware register
    PageTable root = phys_to_virt(read_cr3());
    
    // Start recursion at level 0
    return translate_recursive(root, vaddr, 0, num_levels - 1);
}

The elegance of recursion:

The recursive formulation reveals that all multi-level paging schemes are structurally identical—they differ only in:

Number of levels (recursion depth)
Bits per level (index width)
Entry size and flags

This uniformity explains why operating systems can support multiple page table formats with shared code. The Linux kernel, for example, uses a single pgd_t → p4d_t → pud_t → pmd_t → pte_t framework that works whether the architecture has 2, 3, 4, or 5 actual levels.

Hardware optimization:

While the recursive model is conceptually clean, hardware MMUs typically implement iterative traversal with a fixed number of stages. Each stage:

Adds table_base to shifted/masked index
Issues a memory read
Checks entry validity
Proceeds or faults

Modern MMUs include page table walkers that perform this traversal in parallel with other operations, hiding some latency.

Tail Recursion to Iteration

The recursive translation is tail-recursive: the recursive call is the last operation. This means it can be trivially converted to an iterative loop with no stack growth—exactly what hardware MMUs implement.

Page Table Allocation Strategies

Efficient management of hierarchical page tables requires careful allocation and deallocation strategies. The goal is to minimize overhead while ensuring fast allocation when processes grow.

Allocation triggers:

New page tables are allocated when:

A process is created (at least the root table)
Memory is mapped to a previously unmapped region
mmap(), brk(), stack growth, or fault handling requires new mappings

allocation_strategies.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
// Page Table Allocation Strategies
 
// Strategy 1: Lazy Allocation (most common)
// Allocate tables only when the mapping is first established
 
int map_page_lazy(PageTable root, uint64_t vaddr, uint64_t paddr, uint64_t flags,
                  int num_levels) {
    PageTable current = root;
    
    for (int level = 0; level < num_levels - 1; level++) {
        uint64_t index = get_index(vaddr, level);
        Entry* entry = &current[index];
        
        if (!(*entry & ENTRY_PRESENT)) {
            // Allocate new child table on demand
            uint64_t new_table = allocate_page_table();
            if (new_table == 0) {
                return -ENOMEM;  // Out of memory
            }
            
            // Clear new table
            memset(phys_to_virt(new_table), 0, PAGE_SIZE);
            
            // Install pointer to new table
            *entry = new_table | ENTRY_PRESENT | ENTRY_USER | ENTRY_WRITE;
        }
        
        // Descend to next level
        current = phys_to_virt(*entry & ADDR_MASK);
    }
    
    // Install final mapping
    uint64_t leaf_index = get_index(vaddr, num_levels - 1);
    current[leaf_index] = paddr | flags | ENTRY_PRESENT;
    
    return 0;
}
 
 
// Strategy 2: Pre-allocation for kernel space
// Allocate all tables for a region upfront
 
void preallocate_region(PageTable root, uint64_t start, uint64_t end,
                        int num_levels) {
    // For kernel mappings that won't change, pre-allocate all tables
    // Avoids allocation during critical paths
    
    for (uint64_t addr = start; addr < end; addr += coverage_per_l1_entry) {
        // Walk down and allocate at each level
        ensure_path_exists(root, addr, num_levels);
    }
}
 
 
// Strategy 3: Pooled allocation
// Maintain a cache of pre-allocated page table pages
 
#define PT_POOL_SIZE    64
 
typedef struct {
    uint64_t pages[PT_POOL_SIZE];
    int count;
    spinlock_t lock;
} PageTablePool;
 
static PageTablePool pt_pool;
 
void pt_pool_refill(void) {
    // Called periodically or when pool is low
    // Allocate pages in batches for efficiency
    while (pt_pool.count < PT_POOL_SIZE / 2) {
        uint64_t page = allocate_physical_page();
        if (page == 0) break;
        pt_pool.pages[pt_pool.count++] = page;
    }
}
 
uint64_t pt_pool_allocate(void) {
    spin_lock(&pt_pool.lock);
    
    if (pt_pool.count == 0) {
        spin_unlock(&pt_pool.lock);
        // Emergency: allocate directly
        return allocate_physical_page();
    }
    
    uint64_t page = pt_pool.pages[--pt_pool.count];
    spin_unlock(&pt_pool.lock);
    
    // Trigger async refill if getting low
    if (pt_pool.count < PT_POOL_SIZE / 4) {
        schedule_refill();
    }
    
    return page;
}

Lazy Allocation Benefits

•Minimal initial overhead — Only root table allocated at process creation
•Memory proportional to usage — Only allocate what's needed
•Amortized cost — Allocation spread across process lifetime

Pre-allocation Benefits

•Predictable performance — No allocation during critical paths
•Reduced fragmentation — Batch allocation is more efficient
•Essential for kernel — Kernel can't fault during interrupts

Hybrid Approaches

Production systems often use hybrid strategies: lazy allocation for user space (where variations are expected), pre-allocation for kernel space (which is fixed and critical), and pooled allocation to accelerate the common case while handling corner cases gracefully.

Page Table Deallocation

Deallocating page tables correctly is more complex than allocation due to the tree structure and the need to avoid memory leaks. A single intermediate table can only be freed when all its entries are empty.

Deallocation triggers:

Process termination (complete tree deletion)
Unmapping memory regions (munmap())
Shrinking address space (MADV_FREE, heap shrinking)

deallocation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
// Page Table Deallocation Strategies
 
// Structure to track table usage
typedef struct {
    uint64_t phys_addr;
    uint16_t valid_entries;  // Number of present entries
    uint16_t level;
} PageTableInfo;
 
// Decrement entry count when unmapping
void unmap_page(PageTable root, uint64_t vaddr, int num_levels) {
    PageTable tables[MAX_LEVELS];
    uint64_t indices[MAX_LEVELS];
    
    // Walk down, recording path
    PageTable current = root;
    for (int level = 0; level < num_levels; level++) {
        tables[level] = current;
        indices[level] = get_index(vaddr, level);
        
        if (level < num_levels - 1) {
            Entry entry = current[indices[level]];
            current = phys_to_virt(entry & ADDR_MASK);
        }
    }
    
    // Clear the leaf entry
    tables[num_levels - 1][indices[num_levels - 1]] = 0;
    
    // Walk back up, freeing empty tables
    for (int level = num_levels - 1; level > 0; level--) {
        if (table_is_empty(tables[level])) {
            // Free this table
            free_physical_page(virt_to_phys(tables[level]));
            // Clear parent's pointer to us
            tables[level - 1][indices[level - 1]] = 0;
        } else {
            // Table still has entries, stop walking up
            break;
        }
    }
    
    // Invalidate TLB entry
    invlpg(vaddr);
}
 
 
// Full page table tree deletion (process exit)
void destroy_page_tables(PageTable root, int level, int max_level) {
    if (root == NULL) return;
    
    for (int i = 0; i < entries_per_level; i++) {
        Entry entry = root[i];
        
        if (!(entry & ENTRY_PRESENT)) {
            continue;  // Skip empty entries
        }
        
        if (level < max_level && !(entry & ENTRY_LEAF)) {
            // Recurse into child table first
            PageTable child = phys_to_virt(entry & ADDR_MASK);
            destroy_page_tables(child, level + 1, max_level);
        }
        
        // If this is a leaf (actual page mapping), 
        // optionally free the frame too depending on policy
        if (level == max_level || (entry & ENTRY_LEAF)) {
            // Frame cleanup depends on whether it's shared, file-backed, etc.
            handle_frame_cleanup(entry);
        }
    }
    
    // Free this table itself
    free_physical_page(virt_to_phys(root));
}
 
 
// Process exit cleanup
void process_exit_cleanup(Process* proc) {
    PageTable root = phys_to_virt(proc->cr3);
    
    // Recursive deletion of entire tree
    destroy_page_tables(root, 0, proc->pt_levels - 1);
    
    // Note: Don't free kernel portion of address space
    // as it's shared across all processes
}

Deallocation Challenges

•Reference counting — Track how many valid entries each table has to know when it can be freed.
•Bottom-up cleanup — Must free leaf tables before their parents, or parent entries point to freed memory.
•TLB invalidation — Must invalidate cached translations before freeing tables, or CPU may access freed memory.
•Shared tables — Kernel mappings are often shared; must not free shared tables on single-process exit.
•Concurrent access — Other CPUs may be walking the tables; need synchronization or RCU.

Use-After-Free Risk

Freeing a page table while another CPU holds a cached TLB entry pointing to it is a critical bug. The other CPU might walk through freed/reused memory, causing corruption or security exploits. Multi-core systems require careful TLB shootdown protocols before freeing tables.

Sharing in Hierarchical Page Tables

The hierarchical structure enables powerful sharing optimizations. Multiple processes can share portions of the page table tree, reducing memory overhead and simplifying updates.

Common sharing scenarios:

Shared Page Table Subtrees

•Kernel space — All processes share the kernel portion of the address space. Upper-level entries for kernel addresses point to shared kernel page tables.
•Shared libraries — Processes mapping the same library at the same virtual address can share page table entries pointing to shared frames.
•Copy-on-write fork — After fork(), parent and child share page tables until either writes. Then tables are split (copied) at the appropriate level.
•Memory-mapped files — Multiple processes mapping the same file can share not just frames but also page table subtrees.

sharing.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
// Page Table Sharing Strategies
 
// 1. Kernel space sharing at process creation
void setup_kernel_mappings(PageTable new_root) {
    // Copy pointers to kernel page tables
    // In x86-64 with 48-bit addresses, entries 256-511 are kernel space
    
    PageTable kernel_root = get_kernel_page_table();
    
    for (int i = 256; i < 512; i++) {
        // Share reference to kernel's tables (not copy)
        new_root[i] = kernel_root[i];
    }
    
    // Now kernel tables are shared; updates to kernel mappings
    // automatically visible to this process
}
 
// 2. Fork sharing (copy-on-write)
PageTable fork_page_tables(PageTable parent_root) {
    // Allocate new root for child
    PageTable child_root = allocate_page_table();
    
    // USER SPACE: Initially share all subtables, mark entries COW
    for (int i = 0; i < 256; i++) {
        if (parent_root[i] & ENTRY_PRESENT) {
            // Share the subtree reference
            child_root[i] = parent_root[i];
            
            // Mark both parent and child as read-only for COW
            parent_root[i] &= ~ENTRY_WRITE;
            child_root[i] &= ~ENTRY_WRITE;
            
            // Increment reference count on shared table
            incref_page_table(parent_root[i] & ADDR_MASK);
        }
    }
    
    // KERNEL SPACE: Share as usual
    for (int i = 256; i < 512; i++) {
        child_root[i] = parent_root[i];
        // Kernel tables are globally shared, no COW needed
    }
    
    return child_root;
}
 
// 3. COW fault handling
void handle_cow_fault(PageTable root, uint64_t vaddr) {
    // Walk to find the highest shared table that needs splitting
    PageTable path[MAX_LEVELS];
    uint64_t indices[MAX_LEVELS];
    
    // ... walk and identify shared tables ...
    
    for (int level = fault_level; level < num_levels; level++) {
        if (get_refcount(path[level]) > 1) {
            // This table is shared, need to copy
            PageTable new_table = allocate_page_table();
            memcpy(new_table, path[level], PAGE_SIZE);
            
            // Update parent to point to new copy
            path[level - 1][indices[level - 1]] = 
                virt_to_phys(new_table) | get_flags(path[level - 1][indices[level - 1]]);
            
            // Decrement refcount on old shared table
            decref_page_table(path[level]);
            
            path[level] = new_table;
        }
    }
    
    // Now the leaf entry is private; make it writable
    path[num_levels - 1][indices[num_levels - 1]] |= ENTRY_WRITE;
    
    // If frame itself is shared, may also need to copy the frame
    // ... handle frame COW ...
}

Benefits of sharing:

Mechanism	Memory Saved	Complexity
Kernel sharing	Significant (kernel tables shared by all)	Low (static sharing)
Fork COW	Huge for fork-heavy workloads	Moderate (refcounts needed)
Shared library sharing	Moderate (common mappings shared)	Moderate (alignment requirements)

Reference counting granularity:

To support sharing, each page table needs a reference count. When a table is shared between processes, the count is incremented. When a process unmaps or exits, it's decremented. The table is freed only when the count reaches zero.

This can be stored in the struct page for the table's physical frame, leveraging existing memory management infrastructure.

Real-World Impact

In containerized environments running many identical instances, page table sharing dramatically reduces overhead. If 100 containers run the same image, they share most page table entries for code, libraries, and read-only data, multiplying memory efficiency.

Implementation Considerations

Implementing hierarchical page tables correctly requires attention to several subtle issues that don't arise in simpler data structures.

Critical Implementation Issues

•TLB coherence — Changes to page tables aren't immediately visible to the MMU due to TLB caching. Explicit invalidation (invlpg on x86) is required. On multi-core systems, TLB shootdowns must reach all cores.
•Atomic updates — Entry modifications must be atomic. Partially-updated entries can cause the MMU to see inconsistent states. Use proper memory barriers.
•Memory ordering — The table must be fully initialized before the parent entry points to it. A parent entry written before child initialization creates a race with hardware walkers.
•Self-referential mapping — Some systems map the page table into its own address space for O(1) access to any entry. This clever trick requires careful bootstrapping.
•PCID/ASID support — Process Context IDs allow TLB entries from multiple processes to coexist. This complicates sharing semantics and invalidation requirements.

implementation_details.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// Critical Implementation Details
 
// 1. Proper page table entry update sequence
void update_pte_safe(Entry* entry, uint64_t new_value, uint64_t vaddr) {
    // Ensure we don't create inconsistent transient states
    
    // For non-present → present: write new value atomically
    if (!(*entry & ENTRY_PRESENT)) {
        // Table must be ready before marking present
        barrier();  // Ensure child table writes complete
        WRITE_ONCE(*entry, new_value);
        // No TLB invalidation needed: wasn't cached
        return;
    }
    
    // For present → non-present: invalidate TLB first
    if (!(new_value & ENTRY_PRESENT)) {
        WRITE_ONCE(*entry, 0);  // Clear entry atomically
        barrier();
        tlb_invalidate(vaddr);
        return;
    }
    
    // For present → present (changing mapping): complex case
    // Must invalidate old TLB entry before modifying
    WRITE_ONCE(*entry, new_value);
    barrier();
    tlb_invalidate(vaddr);
}
 
// 2. Multi-core TLB shootdown
void tlb_shootdown(uint64_t vaddr, cpumask_t cores) {
    // Send IPI to each core that might have cached this entry
    for_each_cpu(cpu, cores) {
        if (cpu == current_cpu()) {
            invlpg(vaddr);  // Local invalidation
        } else {
            send_ipi(cpu, IPI_TLB_SHOOTDOWN, vaddr);
        }
    }
    
    // Wait for acknowledgment from all cores
    wait_for_all_acks(cores);
}
 
// 3. Memory barriers for table initialization
PageTable create_initialized_table(Entry* parent_entry, uint64_t flags) {
    PageTable new_table = allocate_page_table();
    
    // Initialize the table contents
    memset(new_table, 0, PAGE_SIZE);
    
    // CRITICAL: Memory barrier ensures table is fully written
    // before parent entry becomes valid
    wmb();  // Write memory barrier
    
    // Now safe to link from parent
    *parent_entry = virt_to_phys(new_table) | flags | ENTRY_PRESENT;
    
    return new_table;
}
 
// 4. Self-mapping (recursive mapping) for page table access
// Map the PML4 into itself at a known index
#define RECURSIVE_SLOT  510  // Choose unused slot
 
void setup_recursive_mapping(PageTable pml4) {
    // Point slot 510 back to PML4 itself
    pml4[RECURSIVE_SLOT] = virt_to_phys(pml4) | ENTRY_PRESENT | ENTRY_WRITE;
    
    // Now, virtual address derived from {510, 510, 510, 510, 0}
    // accesses PML4[0]. Very clever trick for fast table access.
    
    // To access entry N at level L without physical mapping:
    // vaddr = BASE | (510 shifted) | (indices for higher levels) | N
}

Security Implications

Page table bugs can compromise system security. A write to the wrong entry can expose kernel memory to user processes (privilege escalation). Modern CPUs include SMEP/SMAP (prevent executing/accessing user memory from kernel) and PKE (protection keys) to limit damage from page table errors.

Summary: Hierarchical Page Tables

Hierarchical page tables are the universal solution to the page table size problem, implementing a radix tree structure that achieves sparse, efficient memory management.

Key Takeaways

•Tree structure — Page tables form a radix tree with fixed depth and branching factor. Address bits select the path from root to leaf.
•Sparse allocation — Only paths to mapped pages are allocated. Overhead is O(mapped memory), not O(address space).
•Recursive traversal — A single recursive algorithm handles any number of levels, unifying all multi-level implementations.
•Lazy allocation — Tables are allocated on-demand when mappings are created, minimizing initial overhead.
•Bottom-up deallocation — Empty tables can be freed, but only after all children are freed. Reference counting tracks sharing.
•Sharing opportunities — Kernel mappings, COW fork, and shared libraries can share page table subtrees for efficiency.
•Implementation complexity — TLB coherence, atomic updates, and memory ordering are critical for correctness.

What's next:

We've understood the structure of hierarchical page tables. In the next page, we'll examine the address translation process in detail—following a virtual address through all levels of the hierarchy, understanding how hardware MMUs implement this walk, and seeing how TLB caching makes multi-level translation practical.

Page Complete

You now understand hierarchical page tables as a general pattern: the tree structure, sparse allocation principle, recursive traversal, allocation/deallocation strategies, sharing mechanisms, and implementation considerations. Next, we'll dive into the mechanics of multi-level address translation.

3 / 5

Loading learning content...

Operating SystemsMulti-Level Paging

Multi-Level Paging

LevelIntermediate

Duration75 mins

TopicMulti-Level Paging

3 / 5

Hierarchical Page Tables

The Universal Pattern of Hierarchical Page Tables

A hierarchical page table is fundamentally a tree data structure where:

The root is always allocated and pointed to by a hardware register (CR3, TTBR, SATP)
Internal nodes (directory entries) point to child tables
Leaf nodes (page table entries) map directly to physical frames
Unallocated subtrees represent unmapped regions of the address space

What You Will Learn

Page Tables as Tree Structures

Tree structure properties:

                     ┌─────────────┐
                     │    Root     │  ← CR3/TTBR points here
                     │   (Level 0) │     Always allocated for each process
                     └──────┬──────┘
                   ┌────────┼────────┐
                   ▼        ▼        ▼
              ┌────────┐┌────────┐┌────────┐
              │Level 1 ││Level 1 ││Level 1 │  ← Allocated on demand
              │  Dir   ││  Dir   ││  Dir   │     (when region used)
              └────┬───┘└────┬───┘└────┬───┘
                   │         │         │
              ┌────┼────┐   ...       ...
              ▼    ▼    ▼
           ┌────┐┌────┐┌────┐
           │ PT ││ PT ││ PT │  ← Leaf level
           │    ││    ││    │     Contains frame numbers
           └────┘└────┘└────┘
              │    │    │
              ▼    ▼    ▼
           [Frame][Frame][Frame]  ← Physical memory

Key Tree Properties

•Fixed depth — Unlike general trees, page table trees have a predetermined depth based on address size and page size. All leaf nodes are at the same level.
•Fixed branching factor — Each node has a fixed maximum number of children (e.g., 512 in x86-64). This is determined by the number of bits at each level.
•Sparse population — Most branches are NULL (unmapped). Only paths to actually-mapped pages are allocated.
•Path determines address — The sequence of indices used to reach a leaf encodes the virtual page number. No explicit key storage is needed.
•Left-to-right traversal — High-order bits are used first, refining the address with each level until reaching the page level.

tree_properties.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Hierarchical Page Table as a Radix Tree
 
// Generic multi-level page table configuration
typedef struct {
    int levels;           // Number of levels (2, 3, 4, 5...)
    int bits_per_level;   // Bits used for indexing at each level
    int offset_bits;      // Bits for page offset
    int entry_size;       // Bytes per entry (4 or 8)
    int entries_per_table;// 2^bits_per_level
} PageTableConfig;
 
// Example configurations
const PageTableConfig CONFIG_X86 = {
    .levels = 2,
    .bits_per_level = 10,
    .offset_bits = 12,
    .entry_size = 4,
    .entries_per_table = 1024
};
 
const PageTableConfig CONFIG_X86_64 = {
    .levels = 4,
    .bits_per_level = 9,
    .offset_bits = 12,
    .entry_size = 8,
    .entries_per_table = 512
};
 
// Calculate tree properties
void analyze_tree(const PageTableConfig* config) {
    int address_bits = config->levels * config->bits_per_level + config->offset_bits;
    int max_pages = 1ULL << (address_bits - config->offset_bits);
    int table_size = config->entries_per_table * config->entry_size;
    
    printf("Page Table Tree Analysis:\n");
    printf("  Address bits: %d\n", address_bits);
    printf("  Tree depth: %d levels\n", config->levels);
    printf("  Branching factor: %d\n", config->entries_per_table);
    printf("  Max leaf nodes: %d (max pages)\n", max_pages);
    printf("  Table size: %d bytes\n", table_size);
    
    // Best case: minimal allocation (1 page mapped)
    // Need: root + one table per level down to leaf
    int min_tables = config->levels;
    printf("  Min tables (1 page): %d (%d bytes)\n", 
           min_tables, min_tables * table_size);
    
    // Worst case: full address space mapped
    // Level i has entries_per_table^i tables
    printf("  Max tables (full): astronomical\n");
}

Radix Tree Efficiency

The Sparse Allocation Principle

The fundamental advantage of hierarchical page tables is sparse allocation: we only allocate table nodes for paths that lead to actual mappings. This principle underlies all memory savings.

Mathematical analysis:

Consider a general hierarchical page table with:

D levels (depth)
B entries per table (branching factor)
S bytes per table
P pages actually mapped

Flat page table allocation:

Always allocates: B^D entries × entry_size bytes
Independent of P (number of used pages)

Hierarchical allocation bounds:

Minimum: D tables (path to one page)
Maximum: Sum over levels where every path branches
Typical: O(P) tables for P scattered pages

sparse_allocation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# Sparse Allocation Analysis
 
import math
 
def analyze_allocation(levels, branching, entry_size, mapped_pages, page_distribution):
    """
    Analyze page table memory usage.
    
    page_distribution: how pages are spread across address space
    - 'clustered': pages are in contiguous regions
    - 'scattered': pages are randomly distributed
    """
    
    table_size = branching * entry_size
    
    # Flat table would need:
    total_possible_pages = branching ** levels
    flat_size = total_possible_pages * entry_size
    
    # Hierarchical allocation depends on distribution
    if page_distribution == 'clustered':
        # Pages in contiguous regions share upper-level tables
        # Rough estimate: sqrt(P) at upper levels, P at leaf level
        tables_needed = estimate_clustered(levels, branching, mapped_pages)
    else:
        # Scattered pages need more paths
        tables_needed = estimate_scattered(levels, branching, mapped_pages)
    
    hierarchical_size = tables_needed * table_size
    
    savings_ratio = (flat_size - hierarchical_size) / flat_size * 100
    
    print(f"Allocation Analysis:")
    print(f"  Levels: {levels}, Branching: {branching}")
    print(f"  Mapped pages: {mapped_pages:,}")
    print(f"  Distribution: {page_distribution}")
    print(f"")
    print(f"  Flat table size: {flat_size:,} bytes ({flat_size/1024/1024:.2f} MB)")
    print(f"  Hierarchical: ~{tables_needed} tables = {hierarchical_size:,} bytes")
    print(f"  Savings: {savings_ratio:.2f}%")
    return hierarchical_size
 
 
def estimate_clustered(levels, branching, pages):
    """Estimate tables needed for clustered page distribution."""
    # For clustered pages, upper levels are heavily shared
    # Roughly: level_i needs pages / (branching^(levels-i)) tables
    tables = 0
    for level in range(levels):
        coverage_per_table = branching ** (levels - 1 - level)
        tables_at_level = math.ceil(pages / coverage_per_table)
        tables += min(tables_at_level, branching ** level)  # Can't exceed max
    return max(tables, levels)  # At least one path
 
 
def estimate_scattered(levels, branching, pages):
    """Estimate tables needed for scattered page distribution."""
    # Scattered pages need more separate paths
    # Upper levels: sqrt of lower level usage roughly
    tables = pages  # Leaf level: one entry per page, may share tables
    leaf_tables = math.ceil(pages / branching)
    tables = leaf_tables
    
    for level in range(levels - 1):
        parent_tables = math.ceil(tables / branching)
        tables = tables + parent_tables
    
    return max(tables, levels)
 
 
# Example: x86-64 with 4 levels, 512 entries, 8-byte entries
# Process with 10,000 mapped pages (40MB)
print("=" * 60)
analyze_allocation(4, 512, 8, 10000, 'clustered')
print()
analyze_allocation(4, 512, 8, 10000, 'scattered')
print()
 
# Same system, very sparse process (100 pages = 400KB)
print("=" * 60)
analyze_allocation(4, 512, 8, 100, 'clustered')
print()
 
# Comparison: flat 48-bit page table
# 2^36 pages × 8 bytes = 549,755,813,888 bytes = 512GB!

Sparse Allocation: Overhead vs. Flat Table
Pages Mapped	Used Memory	Hierarchical Overhead	Flat Overhead	Savings
100	400 KB	~20 KB	512 GB	99.9999996%
10,000	40 MB	~400 KB	512 GB	99.9999%
1,000,000	4 GB	~40 MB	512 GB	99.99%
100,000,000	400 GB	~4 GB	512 GB	99.2%

The Key Insight

Recursive Traversal Algorithm

Recursive translation:

At each level, the algorithm:

Extracts the index for this level from the virtual address
Reads the entry at that index
If at leaf level, returns the frame number
Otherwise, recurses into the child table pointed to by the entry

recursive_traversal.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Recursive Page Table Traversal - General Algorithm
 
typedef uint64_t* PageTable;
typedef uint64_t Entry;
 
// Configuration
#define MAX_LEVELS      5
#define ENTRY_PRESENT   (1ULL << 0)
#define ENTRY_LEAF      (1ULL << 7)   // Large page flag
#define ADDR_MASK       0x0000FFFFFFFFF000ULL
 
// Level-specific bit extraction
int bits_per_level[MAX_LEVELS];   // Configured at init
int level_shift[MAX_LEVELS];      // Precomputed shifts
 
// Initialize shifts: level 0 is root, level (n-1) is leaf
void init_level_shifts(int num_levels, int bits_per_lvl, int offset_bits) {
    int pos = offset_bits;
    for (int i = num_levels - 1; i >= 0; i--) {
        level_shift[i] = pos;
        bits_per_level[i] = bits_per_lvl;
        pos += bits_per_lvl;
    }
}
 
// Extract index for given level from virtual address
uint64_t get_index(uint64_t vaddr, int level) {
    uint64_t mask = (1ULL << bits_per_level[level]) - 1;
    return (vaddr >> level_shift[level]) & mask;
}
 
// Recursive translation function
// Returns physical address, or 0 on fault
uint64_t translate_recursive(
    PageTable table,      // Current level's table (physical -> virtual mapped)
    uint64_t vaddr,       // Virtual address being translated
    int level,            // Current level (0 = root)
    int max_level         // Deepest level (leaves)
) {
    // Get index for this level
    uint64_t index = get_index(vaddr, level);
    Entry entry = table[index];
    
    // Check if entry is present
    if (!(entry & ENTRY_PRESENT)) {
        // Page fault: this subtree doesn't exist
        page_fault(vaddr, level);
        return 0;  // Not reached
    }
    
    // Check if this is a large page (leaf before max level)
    if ((entry & ENTRY_LEAF) && level < max_level) {
        // Large page: combine frame address with remaining address bits
        uint64_t frame_addr = entry & ADDR_MASK;
        uint64_t offset = vaddr & ((1ULL << level_shift[level]) - 1);
        return frame_addr | offset;
    }
    
    // Check if we've reached the leaf level
    if (level == max_level) {
        // Leaf entry: extract frame number
        uint64_t frame_addr = entry & ADDR_MASK;
        uint64_t offset = vaddr & ((1ULL << level_shift[level]) - 1);
        return frame_addr | offset;
    }
    
    // Internal node: recursively descend
    uint64_t child_phys = entry & ADDR_MASK;
    PageTable child_table = phys_to_virt(child_phys);
    
    return translate_recursive(child_table, vaddr, level + 1, max_level);
}
 
// Top-level translation function
uint64_t translate(uint64_t vaddr) {
    // Get root table from hardware register
    PageTable root = phys_to_virt(read_cr3());
    
    // Start recursion at level 0
    return translate_recursive(root, vaddr, 0, num_levels - 1);
}

The elegance of recursion:

The recursive formulation reveals that all multi-level paging schemes are structurally identical—they differ only in:

Number of levels (recursion depth)
Bits per level (index width)
Entry size and flags

Hardware optimization:

While the recursive model is conceptually clean, hardware MMUs typically implement iterative traversal with a fixed number of stages. Each stage:

Adds table_base to shifted/masked index
Issues a memory read
Checks entry validity
Proceeds or faults

Modern MMUs include page table walkers that perform this traversal in parallel with other operations, hiding some latency.

Tail Recursion to Iteration

Page Table Allocation Strategies

Efficient management of hierarchical page tables requires careful allocation and deallocation strategies. The goal is to minimize overhead while ensuring fast allocation when processes grow.

Allocation triggers:

New page tables are allocated when:

A process is created (at least the root table)
Memory is mapped to a previously unmapped region
mmap(), brk(), stack growth, or fault handling requires new mappings

allocation_strategies.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
// Page Table Allocation Strategies
 
// Strategy 1: Lazy Allocation (most common)
// Allocate tables only when the mapping is first established
 
int map_page_lazy(PageTable root, uint64_t vaddr, uint64_t paddr, uint64_t flags,
                  int num_levels) {
    PageTable current = root;
    
    for (int level = 0; level < num_levels - 1; level++) {
        uint64_t index = get_index(vaddr, level);
        Entry* entry = &current[index];
        
        if (!(*entry & ENTRY_PRESENT)) {
            // Allocate new child table on demand
            uint64_t new_table = allocate_page_table();
            if (new_table == 0) {
                return -ENOMEM;  // Out of memory
            }
            
            // Clear new table
            memset(phys_to_virt(new_table), 0, PAGE_SIZE);
            
            // Install pointer to new table
            *entry = new_table | ENTRY_PRESENT | ENTRY_USER | ENTRY_WRITE;
        }
        
        // Descend to next level
        current = phys_to_virt(*entry & ADDR_MASK);
    }
    
    // Install final mapping
    uint64_t leaf_index = get_index(vaddr, num_levels - 1);
    current[leaf_index] = paddr | flags | ENTRY_PRESENT;
    
    return 0;
}
 
 
// Strategy 2: Pre-allocation for kernel space
// Allocate all tables for a region upfront
 
void preallocate_region(PageTable root, uint64_t start, uint64_t end,
                        int num_levels) {
    // For kernel mappings that won't change, pre-allocate all tables
    // Avoids allocation during critical paths
    
    for (uint64_t addr = start; addr < end; addr += coverage_per_l1_entry) {
        // Walk down and allocate at each level
        ensure_path_exists(root, addr, num_levels);
    }
}
 
 
// Strategy 3: Pooled allocation
// Maintain a cache of pre-allocated page table pages
 
#define PT_POOL_SIZE    64
 
typedef struct {
    uint64_t pages[PT_POOL_SIZE];
    int count;
    spinlock_t lock;
} PageTablePool;
 
static PageTablePool pt_pool;
 
void pt_pool_refill(void) {
    // Called periodically or when pool is low
    // Allocate pages in batches for efficiency
    while (pt_pool.count < PT_POOL_SIZE / 2) {
        uint64_t page = allocate_physical_page();
        if (page == 0) break;
        pt_pool.pages[pt_pool.count++] = page;
    }
}
 
uint64_t pt_pool_allocate(void) {
    spin_lock(&pt_pool.lock);
    
    if (pt_pool.count == 0) {
        spin_unlock(&pt_pool.lock);
        // Emergency: allocate directly
        return allocate_physical_page();
    }
    
    uint64_t page = pt_pool.pages[--pt_pool.count];
    spin_unlock(&pt_pool.lock);
    
    // Trigger async refill if getting low
    if (pt_pool.count < PT_POOL_SIZE / 4) {
        schedule_refill();
    }
    
    return page;
}

Lazy Allocation Benefits

•Minimal initial overhead — Only root table allocated at process creation
•Memory proportional to usage — Only allocate what's needed
•Amortized cost — Allocation spread across process lifetime

Pre-allocation Benefits

•Predictable performance — No allocation during critical paths
•Reduced fragmentation — Batch allocation is more efficient
•Essential for kernel — Kernel can't fault during interrupts

Hybrid Approaches

Page Table Deallocation

Deallocation triggers:

Process termination (complete tree deletion)
Unmapping memory regions (munmap())
Shrinking address space (MADV_FREE, heap shrinking)

deallocation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
// Page Table Deallocation Strategies
 
// Structure to track table usage
typedef struct {
    uint64_t phys_addr;
    uint16_t valid_entries;  // Number of present entries
    uint16_t level;
} PageTableInfo;
 
// Decrement entry count when unmapping
void unmap_page(PageTable root, uint64_t vaddr, int num_levels) {
    PageTable tables[MAX_LEVELS];
    uint64_t indices[MAX_LEVELS];
    
    // Walk down, recording path
    PageTable current = root;
    for (int level = 0; level < num_levels; level++) {
        tables[level] = current;
        indices[level] = get_index(vaddr, level);
        
        if (level < num_levels - 1) {
            Entry entry = current[indices[level]];
            current = phys_to_virt(entry & ADDR_MASK);
        }
    }
    
    // Clear the leaf entry
    tables[num_levels - 1][indices[num_levels - 1]] = 0;
    
    // Walk back up, freeing empty tables
    for (int level = num_levels - 1; level > 0; level--) {
        if (table_is_empty(tables[level])) {
            // Free this table
            free_physical_page(virt_to_phys(tables[level]));
            // Clear parent's pointer to us
            tables[level - 1][indices[level - 1]] = 0;
        } else {
            // Table still has entries, stop walking up
            break;
        }
    }
    
    // Invalidate TLB entry
    invlpg(vaddr);
}
 
 
// Full page table tree deletion (process exit)
void destroy_page_tables(PageTable root, int level, int max_level) {
    if (root == NULL) return;
    
    for (int i = 0; i < entries_per_level; i++) {
        Entry entry = root[i];
        
        if (!(entry & ENTRY_PRESENT)) {
            continue;  // Skip empty entries
        }
        
        if (level < max_level && !(entry & ENTRY_LEAF)) {
            // Recurse into child table first
            PageTable child = phys_to_virt(entry & ADDR_MASK);
            destroy_page_tables(child, level + 1, max_level);
        }
        
        // If this is a leaf (actual page mapping), 
        // optionally free the frame too depending on policy
        if (level == max_level || (entry & ENTRY_LEAF)) {
            // Frame cleanup depends on whether it's shared, file-backed, etc.
            handle_frame_cleanup(entry);
        }
    }
    
    // Free this table itself
    free_physical_page(virt_to_phys(root));
}
 
 
// Process exit cleanup
void process_exit_cleanup(Process* proc) {
    PageTable root = phys_to_virt(proc->cr3);
    
    // Recursive deletion of entire tree
    destroy_page_tables(root, 0, proc->pt_levels - 1);
    
    // Note: Don't free kernel portion of address space
    // as it's shared across all processes
}

Deallocation Challenges

•Reference counting — Track how many valid entries each table has to know when it can be freed.
•Bottom-up cleanup — Must free leaf tables before their parents, or parent entries point to freed memory.
•TLB invalidation — Must invalidate cached translations before freeing tables, or CPU may access freed memory.
•Shared tables — Kernel mappings are often shared; must not free shared tables on single-process exit.
•Concurrent access — Other CPUs may be walking the tables; need synchronization or RCU.

Use-After-Free Risk

Sharing in Hierarchical Page Tables

The hierarchical structure enables powerful sharing optimizations. Multiple processes can share portions of the page table tree, reducing memory overhead and simplifying updates.

Common sharing scenarios:

Shared Page Table Subtrees

•Kernel space — All processes share the kernel portion of the address space. Upper-level entries for kernel addresses point to shared kernel page tables.
•Shared libraries — Processes mapping the same library at the same virtual address can share page table entries pointing to shared frames.
•Copy-on-write fork — After fork(), parent and child share page tables until either writes. Then tables are split (copied) at the appropriate level.
•Memory-mapped files — Multiple processes mapping the same file can share not just frames but also page table subtrees.

sharing.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
// Page Table Sharing Strategies
 
// 1. Kernel space sharing at process creation
void setup_kernel_mappings(PageTable new_root) {
    // Copy pointers to kernel page tables
    // In x86-64 with 48-bit addresses, entries 256-511 are kernel space
    
    PageTable kernel_root = get_kernel_page_table();
    
    for (int i = 256; i < 512; i++) {
        // Share reference to kernel's tables (not copy)
        new_root[i] = kernel_root[i];
    }
    
    // Now kernel tables are shared; updates to kernel mappings
    // automatically visible to this process
}
 
// 2. Fork sharing (copy-on-write)
PageTable fork_page_tables(PageTable parent_root) {
    // Allocate new root for child
    PageTable child_root = allocate_page_table();
    
    // USER SPACE: Initially share all subtables, mark entries COW
    for (int i = 0; i < 256; i++) {
        if (parent_root[i] & ENTRY_PRESENT) {
            // Share the subtree reference
            child_root[i] = parent_root[i];
            
            // Mark both parent and child as read-only for COW
            parent_root[i] &= ~ENTRY_WRITE;
            child_root[i] &= ~ENTRY_WRITE;
            
            // Increment reference count on shared table
            incref_page_table(parent_root[i] & ADDR_MASK);
        }
    }
    
    // KERNEL SPACE: Share as usual
    for (int i = 256; i < 512; i++) {
        child_root[i] = parent_root[i];
        // Kernel tables are globally shared, no COW needed
    }
    
    return child_root;
}
 
// 3. COW fault handling
void handle_cow_fault(PageTable root, uint64_t vaddr) {
    // Walk to find the highest shared table that needs splitting
    PageTable path[MAX_LEVELS];
    uint64_t indices[MAX_LEVELS];
    
    // ... walk and identify shared tables ...
    
    for (int level = fault_level; level < num_levels; level++) {
        if (get_refcount(path[level]) > 1) {
            // This table is shared, need to copy
            PageTable new_table = allocate_page_table();
            memcpy(new_table, path[level], PAGE_SIZE);
            
            // Update parent to point to new copy
            path[level - 1][indices[level - 1]] = 
                virt_to_phys(new_table) | get_flags(path[level - 1][indices[level - 1]]);
            
            // Decrement refcount on old shared table
            decref_page_table(path[level]);
            
            path[level] = new_table;
        }
    }
    
    // Now the leaf entry is private; make it writable
    path[num_levels - 1][indices[num_levels - 1]] |= ENTRY_WRITE;
    
    // If frame itself is shared, may also need to copy the frame
    // ... handle frame COW ...
}

Benefits of sharing:

Mechanism	Memory Saved	Complexity
Kernel sharing	Significant (kernel tables shared by all)	Low (static sharing)
Fork COW	Huge for fork-heavy workloads	Moderate (refcounts needed)
Shared library sharing	Moderate (common mappings shared)	Moderate (alignment requirements)

Reference counting granularity:

This can be stored in the struct page for the table's physical frame, leveraging existing memory management infrastructure.

Real-World Impact

Implementation Considerations

Implementing hierarchical page tables correctly requires attention to several subtle issues that don't arise in simpler data structures.

Critical Implementation Issues

•TLB coherence — Changes to page tables aren't immediately visible to the MMU due to TLB caching. Explicit invalidation (invlpg on x86) is required. On multi-core systems, TLB shootdowns must reach all cores.
•Atomic updates — Entry modifications must be atomic. Partially-updated entries can cause the MMU to see inconsistent states. Use proper memory barriers.
•Memory ordering — The table must be fully initialized before the parent entry points to it. A parent entry written before child initialization creates a race with hardware walkers.
•Self-referential mapping — Some systems map the page table into its own address space for O(1) access to any entry. This clever trick requires careful bootstrapping.
•PCID/ASID support — Process Context IDs allow TLB entries from multiple processes to coexist. This complicates sharing semantics and invalidation requirements.

implementation_details.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// Critical Implementation Details
 
// 1. Proper page table entry update sequence
void update_pte_safe(Entry* entry, uint64_t new_value, uint64_t vaddr) {
    // Ensure we don't create inconsistent transient states
    
    // For non-present → present: write new value atomically
    if (!(*entry & ENTRY_PRESENT)) {
        // Table must be ready before marking present
        barrier();  // Ensure child table writes complete
        WRITE_ONCE(*entry, new_value);
        // No TLB invalidation needed: wasn't cached
        return;
    }
    
    // For present → non-present: invalidate TLB first
    if (!(new_value & ENTRY_PRESENT)) {
        WRITE_ONCE(*entry, 0);  // Clear entry atomically
        barrier();
        tlb_invalidate(vaddr);
        return;
    }
    
    // For present → present (changing mapping): complex case
    // Must invalidate old TLB entry before modifying
    WRITE_ONCE(*entry, new_value);
    barrier();
    tlb_invalidate(vaddr);
}
 
// 2. Multi-core TLB shootdown
void tlb_shootdown(uint64_t vaddr, cpumask_t cores) {
    // Send IPI to each core that might have cached this entry
    for_each_cpu(cpu, cores) {
        if (cpu == current_cpu()) {
            invlpg(vaddr);  // Local invalidation
        } else {
            send_ipi(cpu, IPI_TLB_SHOOTDOWN, vaddr);
        }
    }
    
    // Wait for acknowledgment from all cores
    wait_for_all_acks(cores);
}
 
// 3. Memory barriers for table initialization
PageTable create_initialized_table(Entry* parent_entry, uint64_t flags) {
    PageTable new_table = allocate_page_table();
    
    // Initialize the table contents
    memset(new_table, 0, PAGE_SIZE);
    
    // CRITICAL: Memory barrier ensures table is fully written
    // before parent entry becomes valid
    wmb();  // Write memory barrier
    
    // Now safe to link from parent
    *parent_entry = virt_to_phys(new_table) | flags | ENTRY_PRESENT;
    
    return new_table;
}
 
// 4. Self-mapping (recursive mapping) for page table access
// Map the PML4 into itself at a known index
#define RECURSIVE_SLOT  510  // Choose unused slot
 
void setup_recursive_mapping(PageTable pml4) {
    // Point slot 510 back to PML4 itself
    pml4[RECURSIVE_SLOT] = virt_to_phys(pml4) | ENTRY_PRESENT | ENTRY_WRITE;
    
    // Now, virtual address derived from {510, 510, 510, 510, 0}
    // accesses PML4[0]. Very clever trick for fast table access.
    
    // To access entry N at level L without physical mapping:
    // vaddr = BASE | (510 shifted) | (indices for higher levels) | N
}

Security Implications

Summary: Hierarchical Page Tables

Hierarchical page tables are the universal solution to the page table size problem, implementing a radix tree structure that achieves sparse, efficient memory management.

Key Takeaways

•Tree structure — Page tables form a radix tree with fixed depth and branching factor. Address bits select the path from root to leaf.
•Sparse allocation — Only paths to mapped pages are allocated. Overhead is O(mapped memory), not O(address space).
•Recursive traversal — A single recursive algorithm handles any number of levels, unifying all multi-level implementations.
•Lazy allocation — Tables are allocated on-demand when mappings are created, minimizing initial overhead.
•Bottom-up deallocation — Empty tables can be freed, but only after all children are freed. Reference counting tracks sharing.
•Sharing opportunities — Kernel mappings, COW fork, and shared libraries can share page table subtrees for efficiency.
•Implementation complexity — TLB coherence, atomic updates, and memory ordering are critical for correctness.

What's next:

Page Complete

3 / 5