Operating SystemsMulti-Level Paging

Multi-Level Paging

LevelIntermediate

Duration75 mins

TopicMulti-Level Paging

4 / 5

Address Translation

The Translation Pipeline: From Virtual to Physical

Every memory access in a modern operating system involves address translation—converting the virtual address used by software into the physical address where data actually resides. In multi-level paging systems, this translation requires navigating through multiple page table levels, each adding latency to memory access.

The translation process sits at the heart of memory management, executed billions of times per second on a typical system. Its efficiency directly impacts system performance, as even small improvements compound across countless iterations.

This page examines the complete translation pipeline: how the virtual address is decomposed, how the page table hierarchy is traversed, how the Translation Lookaside Buffer (TLB) accelerates common cases, and how hardware page table walkers handle misses—all while maintaining correctness and security across multiple protection boundaries.

What You Will Learn

By the end of this page, you will understand the complete multi-level translation algorithm step-by-step, the role of hardware MMU and page table walkers, how the TLB integrates with multi-level tables, the performance implications of table depth, and how protection checks are enforced during translation.

The Multi-Level Translation Problem

In a multi-level paging system, translating a virtual address to a physical address requires traversing a tree of page tables. Each level of the tree must be accessed sequentially—you cannot read level N until you've read the pointer from level N-1.

The fundamental challenge:

With D levels of page tables, every memory access needs D+1 physical memory reads:

D reads to walk through the page tables
1 read for the actual data

For a 4-level system (like x86-64), this means 5 memory accesses per load/store without caching.

Translation Overhead by Number of Levels
Levels	Table Reads	Total Reads (with data)	Slowdown Factor
1 (flat)	1	2	2×
2	2	3	3×
3	3	4	4×
4	4	5	5×
5	5	6	6×

Why this overhead is unacceptable:

Consider a system with 100ns memory latency. A single 4-level page table walk requires:

4 sequential memory reads × 100ns = 400ns for translation
Plus 100ns for the actual data access
Total: 500ns per memory operation

Without any caching, this would make memory 5× slower than raw physical access. At 10 billion memory operations per second on a modern system, this overhead would be catastrophic.

The solution: The Translation Lookaside Buffer (TLB)

The TLB is a specialized cache that stores recent virtual-to-physical translations. A TLB hit provides the physical frame number immediately, bypassing the entire page table walk. With typical TLB hit rates of 99%+, the average translation cost drops dramatically:

Average translation time = (hit_rate × TLB_time) + (miss_rate × walk_time)
                        = (0.99 × 1ns) + (0.01 × 400ns)
                        = 0.99ns + 4ns
                        = ~5ns average

This reduces the effective overhead from 5× to approximately 5% on average—a critical optimization that makes multi-level paging viable.

TLB Miss Penalty

While TLB hits are fast, TLB misses are very expensive. Code patterns with poor locality (jumping to random addresses, accessing scattered data structures) cause frequent TLB misses and suffer the full multi-level walk penalty. Understanding TLB behavior is crucial for high-performance software.

The Complete Translation Algorithm

Let's trace through the complete translation algorithm for a 4-level system, showing every step from virtual address input to physical address output.

Multi-Level Translation Steps (x86-64)

•Receive virtual address — CPU issues memory operation with 48-bit virtual address.
•Check TLB — Look up cached translation. If hit, skip to step 10.
•Read CR3 — On TLB miss, get physical address of PML4 table from CR3 register.
•Walk Level 1 (PML4) — Extract bits 39-47, read PML4[index], validate entry.
•Walk Level 2 (PDPT) — Extract bits 30-38, follow pointer, read PDPT[index], validate.
•Check for 1GB page — If PS bit set, entry maps 1GB page. Calculate address and skip to step 9.
•Walk Level 3 (PD) — Extract bits 21-29, follow pointer, read PD[index], validate.
•Check for 2MB page — If PS bit set, entry maps 2MB page. Calculate address and skip to step 9.
•Walk Level 4 (PT) — Extract bits 12-20, follow pointer, read PT[index], validate.
•Form physical address — Combine frame number from entry with page offset (bits 0-11).
•Update TLB — Cache this translation for future accesses to the same page.
•Access memory — Use physical address for the actual memory operation.

complete_translation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
// Complete x86-64 Four-Level Translation Algorithm
 
#include <stdint.h>
 
// Address field extraction for 48-bit virtual addresses
#define VA_OFFSET(va)   ((va) & 0xFFF)           // Bits 0-11
#define VA_PT_IDX(va)   (((va) >> 12) & 0x1FF)   // Bits 12-20
#define VA_PD_IDX(va)   (((va) >> 21) & 0x1FF)   // Bits 21-29
#define VA_PDPT_IDX(va) (((va) >> 30) & 0x1FF)   // Bits 30-38
#define VA_PML4_IDX(va) (((va) >> 39) & 0x1FF)   // Bits 39-47
 
// Entry flags
#define ENTRY_P     (1ULL << 0)   // Present
#define ENTRY_RW    (1ULL << 1)   // Read/Write
#define ENTRY_US    (1ULL << 2)   // User/Supervisor
#define ENTRY_PWT   (1ULL << 3)   // Write-Through
#define ENTRY_PCD   (1ULL << 4)   // Cache Disable
#define ENTRY_A     (1ULL << 5)   // Accessed
#define ENTRY_D     (1ULL << 6)   // Dirty (leaf entries only)
#define ENTRY_PS    (1ULL << 7)   // Page Size (large page)
#define ENTRY_G     (1ULL << 8)   // Global
#define ENTRY_XD    (1ULL << 63)  // Execute Disable (NX bit)
 
#define ADDR_MASK   0x000FFFFFFFFFF000ULL  // Physical address bits
 
typedef uint64_t pte_t;
typedef int TranslationResult;
 
// Error codes
#define TRANS_SUCCESS        0
#define TRANS_FAULT_PML4    -1
#define TRANS_FAULT_PDPT    -2
#define TRANS_FAULT_PD      -3
#define TRANS_FAULT_PT      -4
#define TRANS_PERM_ERROR    -5
 
// TLB lookup (returns 0 if miss)
uint64_t tlb_lookup(uint64_t va, uint64_t* pa);
 
// Physical memory read (for page table walking)
uint64_t phys_read64(uint64_t pa);
 
// Permission check based on access type
bool check_permissions(pte_t entry, int access_type, int cpl);
 
// The complete translation function
TranslationResult translate_x64(
    uint64_t cr3,           // Page table root (physical)
    uint64_t va,            // Virtual address to translate
    int access_type,        // READ, WRITE, or EXEC
    int cpl,                // Current privilege level (0-3)
    uint64_t* pa            // Output: physical address
) {
    pte_t entry;
    uint64_t table_pa;
    
    // STEP 1: Check TLB first
    if (tlb_lookup(va, pa)) {
        return TRANS_SUCCESS;  // TLB hit!
    }
    
    // STEP 2: TLB miss - start page table walk
    // Read PML4 entry
    table_pa = cr3 & ADDR_MASK;
    entry = phys_read64(table_pa + VA_PML4_IDX(va) * 8);
    
    if (!(entry & ENTRY_P)) {
        return TRANS_FAULT_PML4;  // PML4 entry not present
    }
    if (!check_permissions(entry, access_type, cpl)) {
        return TRANS_PERM_ERROR;
    }
    
    // STEP 3: Read PDPT entry
    table_pa = entry & ADDR_MASK;
    entry = phys_read64(table_pa + VA_PDPT_IDX(va) * 8);
    
    if (!(entry & ENTRY_P)) {
        return TRANS_FAULT_PDPT;
    }
    if (!check_permissions(entry, access_type, cpl)) {
        return TRANS_PERM_ERROR;
    }
    
    // STEP 4: Check for 1GB page
    if (entry & ENTRY_PS) {
        // 1GB page: bits 30-47 from VA become offset
        *pa = (entry & 0x000FFFFFC0000000ULL) | (va & 0x3FFFFFFF);
        update_tlb(va, *pa, entry, PAGE_SIZE_1GB);
        return TRANS_SUCCESS;
    }
    
    // STEP 5: Read PD entry
    table_pa = entry & ADDR_MASK;
    entry = phys_read64(table_pa + VA_PD_IDX(va) * 8);
    
    if (!(entry & ENTRY_P)) {
        return TRANS_FAULT_PD;
    }
    if (!check_permissions(entry, access_type, cpl)) {
        return TRANS_PERM_ERROR;
    }
    
    // STEP 6: Check for 2MB page
    if (entry & ENTRY_PS) {
        // 2MB page: bits 21-47 from VA become offset
        *pa = (entry & 0x000FFFFFFFE00000ULL) | (va & 0x1FFFFF);
        update_tlb(va, *pa, entry, PAGE_SIZE_2MB);
        return TRANS_SUCCESS;
    }
    
    // STEP 7: Read PT entry
    table_pa = entry & ADDR_MASK;
    entry = phys_read64(table_pa + VA_PT_IDX(va) * 8);
    
    if (!(entry & ENTRY_P)) {
        return TRANS_FAULT_PT;
    }
    if (!check_permissions(entry, access_type, cpl)) {
        return TRANS_PERM_ERROR;
    }
    
    // STEP 8: Calculate physical address
    *pa = (entry & ADDR_MASK) | VA_OFFSET(va);
    
    // STEP 9: Update TLB
    update_tlb(va, *pa, entry, PAGE_SIZE_4KB);
    
    return TRANS_SUCCESS;
}

Permission Accumulation

Permissions are checked at EVERY level during the walk. Each level can restrict access but cannot grant permissions that a higher level denied. The effective permission is the intersection (most restrictive combination) of all levels.

Hardware Page Table Walkers

Modern CPUs include dedicated hardware units called Page Table Walkers (PTW) that perform address translation on TLB misses. These specialized circuits can walk page tables much faster than software could.

Key PTW design decisions:

Hardware Walker Features

•Parallel execution — The PTW operates in parallel with CPU execution. While a walk is in progress, the CPU can continue executing other instructions (until it needs the translation).
•Multiple walkers — Modern CPUs have multiple PTW units (Intel has 2 per core), allowing parallel handling of multiple TLB misses.
•Speculative walks — PTWs can speculatively begin walks for predicted memory accesses, reducing visible latency.
•Caching intermediate levels — Some PTWs cache intermediate page table entries, so a miss that shares upper levels with a recent walk is faster.
•Memory ordering — PTWs issue properly ordered memory reads that participate in the cache coherence protocol.

hardware_walker.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// Conceptual Hardware Page Table Walker State Machine
 
STATES:
    IDLE          // Waiting for TLB miss
    WALK_PML4     // Reading PML4 entry
    WALK_PDPT     // Reading PDPT entry  
    WALK_PD       // Reading PD entry
    WALK_PT       // Reading PT entry
    COMPLETE      // Walk finished, updating TLB
    FAULT         // Page fault detected
 
// State machine transitions
on TLB_MISS(virtual_address, access_type):
    va = virtual_address
    access = access_type
    state = WALK_PML4
    issue_read(CR3 + PML4_INDEX(va) * 8)
 
on MEMORY_RESPONSE(data):
    entry = data
    
    if not (entry.present):
        state = FAULT
        raise_page_fault(va, LEVEL)
    
    switch state:
        case WALK_PML4:
            pml4e = entry
            if !check_perms(entry, access): raise_fault()
            state = WALK_PDPT
            issue_read(entry.addr + PDPT_INDEX(va) * 8)
            
        case WALK_PDPT:
            pdpte = entry
            if !check_perms(entry, access): raise_fault()
            if entry.page_size:  // 1GB page
                result = form_1gb_address(entry, va)
                state = COMPLETE
            else:
                state = WALK_PD
                issue_read(entry.addr + PD_INDEX(va) * 8)
                
        case WALK_PD:
            pde = entry
            if !check_perms(entry, access): raise_fault()
            if entry.page_size:  // 2MB page
                result = form_2mb_address(entry, va)
                state = COMPLETE
            else:
                state = WALK_PT
                issue_read(entry.addr + PT_INDEX(va) * 8)
                
        case WALK_PT:
            pte = entry
            if !check_perms(entry, access): raise_fault()
            result = form_4kb_address(entry, va)
            state = COMPLETE
 
on state == COMPLETE:
    // Combine permissions from all levels
    effective_perms = pml4e.perms & pdpte.perms & pde.perms & pte.perms
    update_tlb(va, result.frame, effective_perms)
    signal_walk_complete(va, result)
    state = IDLE

Page Walk Cache (PWC):

Some processors include a Page Walk Cache that stores entries from intermediate levels of recent walks. This provides significant speedups when:

Multiple pages share the same PML4/PDPT/PD entries
Working sets have good spatial locality
Processes have contiguous mappings

Example: Accessing pages 0x1000, 0x2000, 0x3000 in sequence

Without PWC:
  Page 0x1000: 4 table reads (PML4 → PDPT → PD → PT)
  Page 0x2000: 4 table reads
  Page 0x3000: 4 table reads
  Total: 12 reads

With PWC (entries shared across these pages):
  Page 0x1000: 4 table reads (populate cache)
  Page 0x2000: 1 table read (PT only, others cached)
  Page 0x3000: 1 table read (PT only, others cached)
  Total: 6 reads (50% reduction)

Software vs Hardware Walking

Most modern architectures (x86, ARM, RISC-V Sv39/48) use hardware page table walkers. Some older or simpler architectures use software TLB miss handlers, giving the OS full control but at higher latency cost.

TLB Interaction with Multi-Level Tables

The Translation Lookaside Buffer (TLB) is the critical optimization that makes multi-level paging practical. Understanding how the TLB interacts with hierarchical page tables is essential for both OS developers and performance-conscious application programmers.

TLB organization:

TLB Characteristics in Modern Processors
TLB Level	Entries (typical)	Associativity	Latency	Page Sizes
L1 ITLB (instruction)	64-128	4-8 way	1 cycle	4KB, 2MB
L1 DTLB (data)	64-128	4-8 way	1 cycle	4KB, 2MB
L2 STLB (unified)	512-2048	8-12 way	6-10 cycles	4KB, 2MB, 1GB

tlb_interaction.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
// TLB Operation with Multi-Level Page Tables
 
// TLB Entry structure (conceptual)
typedef struct {
    uint64_t vpn;           // Virtual page number (tag)
    uint64_t ppn;           // Physical page number (translation)
    uint8_t  permissions;   // Combined permissions from all levels
    uint8_t  asid;          // Address Space ID (PCID)
    bool     global;        // Global mapping (not flushed on context switch)
    uint8_t  page_size;     // 4KB, 2MB, or 1GB
    bool     valid;
} TLBEntry;
 
// TLB sets (simplified 4-way set-associative)
#define TLB_SETS    64
#define TLB_WAYS    4
 
TLBEntry tlb[TLB_SETS][TLB_WAYS];
 
// TLB lookup
bool tlb_lookup(uint64_t va, uint16_t asid, uint64_t* pa, uint8_t* perms) {
    // Calculate virtual page number and set index
    uint64_t vpn = va >> 12;  // For 4KB pages
    uint64_t set = vpn % TLB_SETS;
    
    // Check all ways in the set
    for (int way = 0; way < TLB_WAYS; way++) {
        TLBEntry* e = &tlb[set][way];
        
        // Handle different page sizes
        uint64_t mask = (e->page_size == 2) ? 0x1FF :   // 2MB: mask 9 bits
                        (e->page_size == 3) ? 0x3FFFF : // 1GB: mask 18 bits
                        0;                              // 4KB: no mask
        
        // Compare VPN (respecting page size)
        if (e->valid && 
            ((e->vpn >> (mask ? count_bits(mask) : 0)) == 
             (vpn >> (mask ? count_bits(mask) : 0))) &&
            (e->global || e->asid == asid)) {
            
            // TLB Hit!
            *pa = (e->ppn << 12) | (va & ((1ULL << page_shift(e->page_size)) - 1));
            *perms = e->permissions;
            
            // Update access statistics
            tlb_stats.hits++;
            return true;
        }
    }
    
    // TLB Miss
    tlb_stats.misses++;
    return false;
}
 
// TLB update after page walk completes
void tlb_update(uint64_t va, uint64_t pa, uint8_t perms, uint8_t page_size,
                uint16_t asid, bool global) {
    uint64_t vpn = va >> 12;
    uint64_t set = vpn % TLB_SETS;
    
    // Find empty or victim entry
    int victim = find_victim(set);  // LRU or pseudo-random
    
    TLBEntry* e = &tlb[set][victim];
    e->vpn = vpn;
    e->ppn = pa >> 12;
    e->permissions = perms;
    e->asid = asid;
    e->global = global;
    e->page_size = page_size;
    e->valid = true;
}
 
// TLB invalidation (needed when page tables change)
void tlb_invalidate_page(uint64_t va) {
    // Invalidate specific page
    asm volatile("invlpg (%0)" :: "r"(va) : "memory");
}
 
void tlb_invalidate_asid(uint16_t asid) {
    // Invalidate all entries for an address space
    // On x86-64 with PCID: reload CR3 with the flush bit
    // Or iterate and invalidate matching entries
}
 
void tlb_flush_all() {
    // Reload CR3 to flush entire TLB (non-global entries)
    uint64_t cr3;
    asm volatile("mov %%cr3, %0" : "=r"(cr3));
    asm volatile("mov %0, %%cr3" :: "r"(cr3) : "memory");
}

Critical TLB concepts:

ASID/PCID: Address Space Identifiers allow TLB entries from multiple processes to coexist. Without PCID, context switches require full TLB flushes. With PCID, only ASID-specific invalidations are needed.
Global entries: Kernel mappings are typically marked global. These entries persist across context switches, avoiding redundant kernel TLB misses.
Large page TLB entries: A single TLB entry for a 2MB page covers 512× more address space than a 4KB entry. Large pages dramatically improve TLB coverage.
TLB shootdown: When page tables change, all CPUs that might have cached the old translation must invalidate their TLB entries. This requires inter-processor interrupts (IPIs), which are expensive.

TLB-Related Spectre Mitigations

Spectre and Meltdown vulnerabilities led to significant TLB-related mitigations. KPTI (Kernel Page Table Isolation) requires separate page tables for user and kernel mode, causing more TLB flushes. PCID helps mitigate this cost by preserving some entries across switches.

Permission Checking During Translation

Address translation isn't just about finding physical addresses—it's also the enforcement point for memory protection. At each level of the page table walk, the hardware checks protection bits to ensure the access is permitted.

Permission bits checked:

Permission Bits in Page Table Entries
Bit	Name	Effect	Checked Against
P (bit 0)	Present	Page/table exists in memory	Always
R/W (bit 1)	Read/Write	0=Read-only, 1=Read-Write	Write accesses
U/S (bit 2)	User/Supervisor	0=Kernel only, 1=User accessible	CPL (privilege level)
NX (bit 63)	No Execute	1=Cannot execute code from page	Instruction fetches

permission_checking.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
// Permission Checking During Multi-Level Walk
 
typedef enum {
    ACCESS_READ,
    ACCESS_WRITE,
    ACCESS_EXECUTE
} AccessType;
 
// Check permissions for one level's entry
bool check_entry_permissions(pte_t entry, AccessType access, int cpl) {
    // Present check
    if (!(entry & ENTRY_P)) {
        return false;  // Will cause page fault
    }
    
    // User/Supervisor check
    bool is_user_mode = (cpl == 3);
    bool page_user_accessible = (entry & ENTRY_US) != 0;
    
    if (is_user_mode && !page_user_accessible) {
        return false;  // User trying to access kernel page
    }
    
    // Write permission check
    if (access == ACCESS_WRITE) {
        bool page_writable = (entry & ENTRY_RW) != 0;
        if (!page_writable) {
            return false;  // Write to read-only page
        }
    }
    
    // Execute permission check (when NX supported)
    if (access == ACCESS_EXECUTE) {
        bool no_execute = (entry & ENTRY_XD) != 0;
        if (no_execute) {
            return false;  // Execute on NX page
        }
    }
    
    return true;
}
 
// Permission accumulation across levels
// The effective permission is the intersection (most restrictive)
uint64_t combine_permissions(pte_t* entries, int levels) {
    uint64_t effective = 0xFFFFFFFFFFFFFFFF;  // Start with all permissions
    
    for (int i = 0; i < levels; i++) {
        // Only these bits accumulate (others are level-specific)
        uint64_t perm_bits = entries[i] & (ENTRY_RW | ENTRY_US);
        uint64_t nx_bit = entries[i] & ENTRY_XD;
        
        // R/W and U/S: use AND (must be set at ALL levels)
        effective &= perm_bits;
        
        // NX: use OR (if set at ANY level, cannot execute)
        effective |= nx_bit;
    }
    
    return effective;
}
 
// Example translation with full permission checking
TranslationResult translate_with_checks(
    PageTable tables[4],
    uint64_t indices[4],
    AccessType access,
    int cpl
) {
    pte_t entries[4];
    
    for (int level = 0; level < 4; level++) {
        entries[level] = tables[level][indices[level]];
        
        // Check permissions at this level
        if (!check_entry_permissions(entries[level], access, cpl)) {
            // Generate appropriate page fault
            PageFaultCode code;
            code.present = (entries[level] & ENTRY_P) != 0;
            code.write = (access == ACCESS_WRITE);
            code.user = (cpl == 3);
            code.reserved = false;
            code.ifetch = (access == ACCESS_EXECUTE);
            
            return (TranslationResult){
                .success = false,
                .fault_code = code,
                .fault_level = level
            };
        }
    }
    
    // All levels passed - calculate effective permissions
    uint64_t effective_perms = combine_permissions(entries, 4);
    
    return (TranslationResult){
        .success = true,
        .physical_addr = (entries[3] & ADDR_MASK) | offset,
        .permissions = effective_perms
    };
}

SMEP, SMAP, and PKE:

Modern processors add additional protection mechanisms:

SMEP (Supervisor Mode Execution Prevention): Kernel cannot execute code from user-mode pages, even if mapped as executable.
SMAP (Supervisor Mode Access Prevention): Kernel cannot read/write user-mode pages without explicit permission, preventing certain exploit techniques.
PKE (Protection Keys for Userspace): Pages can be tagged with 4-bit keys. Access permissions per key can be changed with a single register write, enabling fast permission changes without modifying page tables.

Defense in Depth

The permission checking at translation time is a critical security boundary. Combined with features like SMEP, SMAP, and NX, the page table walk enforces the memory isolation that keeps processes and the kernel separated—the foundation of system security.

Translation Performance Characteristics

Understanding translation performance is crucial for building high-performance systems. Let's analyze the costs at each stage of the translation pipeline.

Translation Latency Components
Component	Latency	Notes
TLB hit (L1)	1 cycle (~0.3ns)	Parallel with cache access
TLB miss + L2 TLB hit	6-10 cycles (~3ns)	Additional lookup stage
Page walk (L1 cache hits)	~20-30 cycles	Best case: all tables cached
Page walk (L2 cache hits)	~50-100 cycles	Tables in L2 cache
Page walk (L3 cache hits)	~100-200 cycles	Tables in L3 cache
Page walk (memory)	~400+ cycles (~100ns+)	Tables must be fetched from RAM

performance_analysis.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
// Effective Memory Access Time (EMAT) Calculation
 
typedef struct {
    double tlb_hit_rate;
    double tlb_hit_time;
    double pwc_hit_rate;      // Page walk cache hit rate
    int page_table_levels;
    double memory_access_time;
    double cache_hit_rate;
    double cache_hit_time;
} TranslationParams;
 
double calculate_emat(TranslationParams* p) {
    // TLB miss rate
    double tlb_miss_rate = 1.0 - p->tlb_hit_rate;
    
    // Average page walk time (considering caching of tables)
    double walk_time = 0;
    for (int level = 0; level < p->page_table_levels; level++) {
        // Each level access might hit in cache or go to memory
        double level_time = 
            p->cache_hit_rate * p->cache_hit_time +
            (1 - p->cache_hit_rate) * p->memory_access_time;
        
        // PWC might avoid some levels
        if (level < p->page_table_levels - 1) {
            level_time *= (1 - p->pwc_hit_rate);
        }
        walk_time += level_time;
    }
    
    // Translation time
    double trans_time = 
        p->tlb_hit_rate * p->tlb_hit_time +
        tlb_miss_rate * walk_time;
    
    // Total: translation + actual memory access
    double emat = trans_time + p->memory_access_time;
    
    return emat;
}
 
// Example: Modern x86-64 system
void example_analysis() {
    TranslationParams modern = {
        .tlb_hit_rate = 0.995,       // 99.5% TLB hit rate
        .tlb_hit_time = 1.0,         // 1 cycle
        .pwc_hit_rate = 0.90,        // 90% page walk cache hit rate
        .page_table_levels = 4,
        .memory_access_time = 100.0, // 100 cycles to memory
        .cache_hit_rate = 0.85,      // 85% cache hit rate for tables
        .cache_hit_time = 12.0       // L3 hit
    };
    
    double emat = calculate_emat(&modern);
    printf("EMAT: %.2f cycles\n", emat);  // ~103-105 cycles
    
    // Compare with perfect translation (no overhead)
    double perfect = modern.memory_access_time;
    printf("Overhead: %.1f%%\n", (emat - perfect) / perfect * 100);
}
 
// Output:
// EMAT: ~103 cycles
// Overhead: ~3%
//
// High TLB hit rate keeps translation overhead very low!

Factors affecting translation performance:

Working set size: Applications touching more pages have lower TLB hit rates. Large working sets overwhelm the TLB.
Access patterns: Sequential access provides excellent TLB performance. Random access causes frequent TLB misses.
Page size: Using 2MB huge pages provides 512× better TLB coverage than 4KB pages for contiguous mappings.
Context switches: Each switch may flush TLB entries (without PCID) or consume TLB capacity (with PCID).
Multi-threaded scaling: Multiple cores share TLB shootdown overhead when page tables change.

Optimization strategies:

Use huge pages where possible (especially for databases, HPC, ML workloads)
Minimize working set size to fit in TLB
Prefer sequential access patterns
Batch memory operations to amortize any TLB miss costs
Consider NUMA topology to avoid remote memory for page tables

Profiling TLB Behavior

Use CPU performance counters to measure TLB behavior: perf stat -e dTLB-load-misses,dTLB-store-misses,iTLB-load-misses. High miss rates indicate opportunity for huge pages or working set reduction.

Real Architecture Examples

Let's examine how translation works on two major architectures: x86-64 and ARM64.

x86-64 Translation (Intel/AMD):

Virtual Address Format (48-bit canonical):
┌────────┬─────────┬───────┬────────┬─────────┬──────────┐
│Sign ext│  PML4   │ PDPT  │   PD   │   PT    │  Offset  │
│16 bits │ 9 bits  │9 bits │ 9 bits │ 9 bits  │ 12 bits  │
│ unused │ [47:39] │[38:30]│ [29:21]│ [20:12] │  [11:0]  │
└────────┴─────────┴───────┴────────┴─────────┴──────────┘

Key Features:
- CR3 holds PML4 physical address
- 9-9-9-9-12 bit split
- 512 entries per table (4KB tables)
- Large pages: 2MB (PD level), 1GB (PDPT level)
- NX bit (bit 63) for execute disable
- PCID support (12-bit ASIDs in CR3)
- 5-level paging (PML5) for 57-bit VA

ARM's contiguous bit optimization:

ARM64 includes a "contiguous" hint bit. When set, it indicates that 16 consecutive entries share the same attributes. This allows the TLB to cache a single entry covering 16 pages (64KB), effectively creating intermediate-sized pages without needing huge page alignment.

Split address spaces:

Both architectures support split address spaces—different page table roots for user and kernel:

x86-64: Historically single CR3; KPTI now uses per-CPU switching between user/kernel page tables
ARM64: Native support with TTBR0 (user) and TTBR1 (kernel), determined by high address bit

The ARM approach avoids the overhead of switching page tables on kernel entry, providing a natural security boundary.

RISC-V Sv39/Sv48

RISC-V defines Sv39 (3-level, 39-bit VA) and Sv48 (4-level, 48-bit VA) paging modes. Like x86-64, it uses 9 bits per level with 4KB pages and support for megapages (2MB) and gigapages (1GB). The simplicity of RISC-V's Privileged Architecture makes it excellent for learning.

Summary: Address Translation

Address translation is the core mechanism that connects virtual memory abstractions to physical reality. Multi-level translation requires careful engineering but enables the memory isolation essential to modern systems.

Key Takeaways

•Sequential table walks — Each level must be read before proceeding to the next. A 4-level walk requires 4 memory accesses.
•TLB is essential — Without TLB caching, multi-level paging would impose a 4-5× performance penalty.
•Hardware page table walkers — Modern CPUs include dedicated circuits for efficient page walks on TLB misses.
•Permission accumulation — Permissions are checked at every level, with the effective permission being the most restrictive combination.
•Large pages reduce overhead — 2MB/1GB pages provide massive TLB coverage improvements for suitable workloads.
•Performance depends on locality — Sequential access with good TLB hit rates sees minimal translation overhead.
•Architecture-specific details matter — x86-64, ARM64, and RISC-V have different but similar approaches to multi-level translation.

What's next:

We've thoroughly examined multi-level paging from structure to translation. In the final page of this module, we'll analyze the fundamental space-time trade-offs inherent in multi-level paging—understanding when to add more levels, when to use large pages, and how to optimize for different workload characteristics.

Page Complete

You now understand multi-level address translation in depth: the complete algorithm, hardware page table walkers, TLB interaction, permission checking, performance characteristics, and real-world architecture examples. Next, we'll explore the space-time trade-offs that shape page table design decisions.

4 / 5

Loading learning content...

Operating SystemsMulti-Level Paging

Multi-Level Paging

LevelIntermediate

Duration75 mins

TopicMulti-Level Paging

4 / 5

Address Translation

The Translation Pipeline: From Virtual to Physical

What You Will Learn

The Multi-Level Translation Problem

The fundamental challenge:

With D levels of page tables, every memory access needs D+1 physical memory reads:

D reads to walk through the page tables
1 read for the actual data

For a 4-level system (like x86-64), this means 5 memory accesses per load/store without caching.

Translation Overhead by Number of Levels
Levels	Table Reads	Total Reads (with data)	Slowdown Factor
1 (flat)	1	2	2×
2	2	3	3×
3	3	4	4×
4	4	5	5×
5	5	6	6×

Why this overhead is unacceptable:

Consider a system with 100ns memory latency. A single 4-level page table walk requires:

4 sequential memory reads × 100ns = 400ns for translation
Plus 100ns for the actual data access
Total: 500ns per memory operation

Without any caching, this would make memory 5× slower than raw physical access. At 10 billion memory operations per second on a modern system, this overhead would be catastrophic.

The solution: The Translation Lookaside Buffer (TLB)

Average translation time = (hit_rate × TLB_time) + (miss_rate × walk_time)
                        = (0.99 × 1ns) + (0.01 × 400ns)
                        = 0.99ns + 4ns
                        = ~5ns average

This reduces the effective overhead from 5× to approximately 5% on average—a critical optimization that makes multi-level paging viable.

TLB Miss Penalty

The Complete Translation Algorithm

Let's trace through the complete translation algorithm for a 4-level system, showing every step from virtual address input to physical address output.

Multi-Level Translation Steps (x86-64)

•Receive virtual address — CPU issues memory operation with 48-bit virtual address.
•Check TLB — Look up cached translation. If hit, skip to step 10.
•Read CR3 — On TLB miss, get physical address of PML4 table from CR3 register.
•Walk Level 1 (PML4) — Extract bits 39-47, read PML4[index], validate entry.
•Walk Level 2 (PDPT) — Extract bits 30-38, follow pointer, read PDPT[index], validate.
•Check for 1GB page — If PS bit set, entry maps 1GB page. Calculate address and skip to step 9.
•Walk Level 3 (PD) — Extract bits 21-29, follow pointer, read PD[index], validate.
•Check for 2MB page — If PS bit set, entry maps 2MB page. Calculate address and skip to step 9.
•Walk Level 4 (PT) — Extract bits 12-20, follow pointer, read PT[index], validate.
•Form physical address — Combine frame number from entry with page offset (bits 0-11).
•Update TLB — Cache this translation for future accesses to the same page.
•Access memory — Use physical address for the actual memory operation.

complete_translation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
// Complete x86-64 Four-Level Translation Algorithm
 
#include <stdint.h>
 
// Address field extraction for 48-bit virtual addresses
#define VA_OFFSET(va)   ((va) & 0xFFF)           // Bits 0-11
#define VA_PT_IDX(va)   (((va) >> 12) & 0x1FF)   // Bits 12-20
#define VA_PD_IDX(va)   (((va) >> 21) & 0x1FF)   // Bits 21-29
#define VA_PDPT_IDX(va) (((va) >> 30) & 0x1FF)   // Bits 30-38
#define VA_PML4_IDX(va) (((va) >> 39) & 0x1FF)   // Bits 39-47
 
// Entry flags
#define ENTRY_P     (1ULL << 0)   // Present
#define ENTRY_RW    (1ULL << 1)   // Read/Write
#define ENTRY_US    (1ULL << 2)   // User/Supervisor
#define ENTRY_PWT   (1ULL << 3)   // Write-Through
#define ENTRY_PCD   (1ULL << 4)   // Cache Disable
#define ENTRY_A     (1ULL << 5)   // Accessed
#define ENTRY_D     (1ULL << 6)   // Dirty (leaf entries only)
#define ENTRY_PS    (1ULL << 7)   // Page Size (large page)
#define ENTRY_G     (1ULL << 8)   // Global
#define ENTRY_XD    (1ULL << 63)  // Execute Disable (NX bit)
 
#define ADDR_MASK   0x000FFFFFFFFFF000ULL  // Physical address bits
 
typedef uint64_t pte_t;
typedef int TranslationResult;
 
// Error codes
#define TRANS_SUCCESS        0
#define TRANS_FAULT_PML4    -1
#define TRANS_FAULT_PDPT    -2
#define TRANS_FAULT_PD      -3
#define TRANS_FAULT_PT      -4
#define TRANS_PERM_ERROR    -5
 
// TLB lookup (returns 0 if miss)
uint64_t tlb_lookup(uint64_t va, uint64_t* pa);
 
// Physical memory read (for page table walking)
uint64_t phys_read64(uint64_t pa);
 
// Permission check based on access type
bool check_permissions(pte_t entry, int access_type, int cpl);
 
// The complete translation function
TranslationResult translate_x64(
    uint64_t cr3,           // Page table root (physical)
    uint64_t va,            // Virtual address to translate
    int access_type,        // READ, WRITE, or EXEC
    int cpl,                // Current privilege level (0-3)
    uint64_t* pa            // Output: physical address
) {
    pte_t entry;
    uint64_t table_pa;
    
    // STEP 1: Check TLB first
    if (tlb_lookup(va, pa)) {
        return TRANS_SUCCESS;  // TLB hit!
    }
    
    // STEP 2: TLB miss - start page table walk
    // Read PML4 entry
    table_pa = cr3 & ADDR_MASK;
    entry = phys_read64(table_pa + VA_PML4_IDX(va) * 8);
    
    if (!(entry & ENTRY_P)) {
        return TRANS_FAULT_PML4;  // PML4 entry not present
    }
    if (!check_permissions(entry, access_type, cpl)) {
        return TRANS_PERM_ERROR;
    }
    
    // STEP 3: Read PDPT entry
    table_pa = entry & ADDR_MASK;
    entry = phys_read64(table_pa + VA_PDPT_IDX(va) * 8);
    
    if (!(entry & ENTRY_P)) {
        return TRANS_FAULT_PDPT;
    }
    if (!check_permissions(entry, access_type, cpl)) {
        return TRANS_PERM_ERROR;
    }
    
    // STEP 4: Check for 1GB page
    if (entry & ENTRY_PS) {
        // 1GB page: bits 30-47 from VA become offset
        *pa = (entry & 0x000FFFFFC0000000ULL) | (va & 0x3FFFFFFF);
        update_tlb(va, *pa, entry, PAGE_SIZE_1GB);
        return TRANS_SUCCESS;
    }
    
    // STEP 5: Read PD entry
    table_pa = entry & ADDR_MASK;
    entry = phys_read64(table_pa + VA_PD_IDX(va) * 8);
    
    if (!(entry & ENTRY_P)) {
        return TRANS_FAULT_PD;
    }
    if (!check_permissions(entry, access_type, cpl)) {
        return TRANS_PERM_ERROR;
    }
    
    // STEP 6: Check for 2MB page
    if (entry & ENTRY_PS) {
        // 2MB page: bits 21-47 from VA become offset
        *pa = (entry & 0x000FFFFFFFE00000ULL) | (va & 0x1FFFFF);
        update_tlb(va, *pa, entry, PAGE_SIZE_2MB);
        return TRANS_SUCCESS;
    }
    
    // STEP 7: Read PT entry
    table_pa = entry & ADDR_MASK;
    entry = phys_read64(table_pa + VA_PT_IDX(va) * 8);
    
    if (!(entry & ENTRY_P)) {
        return TRANS_FAULT_PT;
    }
    if (!check_permissions(entry, access_type, cpl)) {
        return TRANS_PERM_ERROR;
    }
    
    // STEP 8: Calculate physical address
    *pa = (entry & ADDR_MASK) | VA_OFFSET(va);
    
    // STEP 9: Update TLB
    update_tlb(va, *pa, entry, PAGE_SIZE_4KB);
    
    return TRANS_SUCCESS;
}

Permission Accumulation

Hardware Page Table Walkers

Key PTW design decisions:

Hardware Walker Features

•Parallel execution — The PTW operates in parallel with CPU execution. While a walk is in progress, the CPU can continue executing other instructions (until it needs the translation).
•Multiple walkers — Modern CPUs have multiple PTW units (Intel has 2 per core), allowing parallel handling of multiple TLB misses.
•Speculative walks — PTWs can speculatively begin walks for predicted memory accesses, reducing visible latency.
•Caching intermediate levels — Some PTWs cache intermediate page table entries, so a miss that shares upper levels with a recent walk is faster.
•Memory ordering — PTWs issue properly ordered memory reads that participate in the cache coherence protocol.

hardware_walker.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// Conceptual Hardware Page Table Walker State Machine
 
STATES:
    IDLE          // Waiting for TLB miss
    WALK_PML4     // Reading PML4 entry
    WALK_PDPT     // Reading PDPT entry  
    WALK_PD       // Reading PD entry
    WALK_PT       // Reading PT entry
    COMPLETE      // Walk finished, updating TLB
    FAULT         // Page fault detected
 
// State machine transitions
on TLB_MISS(virtual_address, access_type):
    va = virtual_address
    access = access_type
    state = WALK_PML4
    issue_read(CR3 + PML4_INDEX(va) * 8)
 
on MEMORY_RESPONSE(data):
    entry = data
    
    if not (entry.present):
        state = FAULT
        raise_page_fault(va, LEVEL)
    
    switch state:
        case WALK_PML4:
            pml4e = entry
            if !check_perms(entry, access): raise_fault()
            state = WALK_PDPT
            issue_read(entry.addr + PDPT_INDEX(va) * 8)
            
        case WALK_PDPT:
            pdpte = entry
            if !check_perms(entry, access): raise_fault()
            if entry.page_size:  // 1GB page
                result = form_1gb_address(entry, va)
                state = COMPLETE
            else:
                state = WALK_PD
                issue_read(entry.addr + PD_INDEX(va) * 8)
                
        case WALK_PD:
            pde = entry
            if !check_perms(entry, access): raise_fault()
            if entry.page_size:  // 2MB page
                result = form_2mb_address(entry, va)
                state = COMPLETE
            else:
                state = WALK_PT
                issue_read(entry.addr + PT_INDEX(va) * 8)
                
        case WALK_PT:
            pte = entry
            if !check_perms(entry, access): raise_fault()
            result = form_4kb_address(entry, va)
            state = COMPLETE
 
on state == COMPLETE:
    // Combine permissions from all levels
    effective_perms = pml4e.perms & pdpte.perms & pde.perms & pte.perms
    update_tlb(va, result.frame, effective_perms)
    signal_walk_complete(va, result)
    state = IDLE

Page Walk Cache (PWC):

Some processors include a Page Walk Cache that stores entries from intermediate levels of recent walks. This provides significant speedups when:

Multiple pages share the same PML4/PDPT/PD entries
Working sets have good spatial locality
Processes have contiguous mappings

Example: Accessing pages 0x1000, 0x2000, 0x3000 in sequence

Without PWC:
  Page 0x1000: 4 table reads (PML4 → PDPT → PD → PT)
  Page 0x2000: 4 table reads
  Page 0x3000: 4 table reads
  Total: 12 reads

With PWC (entries shared across these pages):
  Page 0x1000: 4 table reads (populate cache)
  Page 0x2000: 1 table read (PT only, others cached)
  Page 0x3000: 1 table read (PT only, others cached)
  Total: 6 reads (50% reduction)

Software vs Hardware Walking

TLB Interaction with Multi-Level Tables

TLB organization:

TLB Characteristics in Modern Processors
TLB Level	Entries (typical)	Associativity	Latency	Page Sizes
L1 ITLB (instruction)	64-128	4-8 way	1 cycle	4KB, 2MB
L1 DTLB (data)	64-128	4-8 way	1 cycle	4KB, 2MB
L2 STLB (unified)	512-2048	8-12 way	6-10 cycles	4KB, 2MB, 1GB

tlb_interaction.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
// TLB Operation with Multi-Level Page Tables
 
// TLB Entry structure (conceptual)
typedef struct {
    uint64_t vpn;           // Virtual page number (tag)
    uint64_t ppn;           // Physical page number (translation)
    uint8_t  permissions;   // Combined permissions from all levels
    uint8_t  asid;          // Address Space ID (PCID)
    bool     global;        // Global mapping (not flushed on context switch)
    uint8_t  page_size;     // 4KB, 2MB, or 1GB
    bool     valid;
} TLBEntry;
 
// TLB sets (simplified 4-way set-associative)
#define TLB_SETS    64
#define TLB_WAYS    4
 
TLBEntry tlb[TLB_SETS][TLB_WAYS];
 
// TLB lookup
bool tlb_lookup(uint64_t va, uint16_t asid, uint64_t* pa, uint8_t* perms) {
    // Calculate virtual page number and set index
    uint64_t vpn = va >> 12;  // For 4KB pages
    uint64_t set = vpn % TLB_SETS;
    
    // Check all ways in the set
    for (int way = 0; way < TLB_WAYS; way++) {
        TLBEntry* e = &tlb[set][way];
        
        // Handle different page sizes
        uint64_t mask = (e->page_size == 2) ? 0x1FF :   // 2MB: mask 9 bits
                        (e->page_size == 3) ? 0x3FFFF : // 1GB: mask 18 bits
                        0;                              // 4KB: no mask
        
        // Compare VPN (respecting page size)
        if (e->valid && 
            ((e->vpn >> (mask ? count_bits(mask) : 0)) == 
             (vpn >> (mask ? count_bits(mask) : 0))) &&
            (e->global || e->asid == asid)) {
            
            // TLB Hit!
            *pa = (e->ppn << 12) | (va & ((1ULL << page_shift(e->page_size)) - 1));
            *perms = e->permissions;
            
            // Update access statistics
            tlb_stats.hits++;
            return true;
        }
    }
    
    // TLB Miss
    tlb_stats.misses++;
    return false;
}
 
// TLB update after page walk completes
void tlb_update(uint64_t va, uint64_t pa, uint8_t perms, uint8_t page_size,
                uint16_t asid, bool global) {
    uint64_t vpn = va >> 12;
    uint64_t set = vpn % TLB_SETS;
    
    // Find empty or victim entry
    int victim = find_victim(set);  // LRU or pseudo-random
    
    TLBEntry* e = &tlb[set][victim];
    e->vpn = vpn;
    e->ppn = pa >> 12;
    e->permissions = perms;
    e->asid = asid;
    e->global = global;
    e->page_size = page_size;
    e->valid = true;
}
 
// TLB invalidation (needed when page tables change)
void tlb_invalidate_page(uint64_t va) {
    // Invalidate specific page
    asm volatile("invlpg (%0)" :: "r"(va) : "memory");
}
 
void tlb_invalidate_asid(uint16_t asid) {
    // Invalidate all entries for an address space
    // On x86-64 with PCID: reload CR3 with the flush bit
    // Or iterate and invalidate matching entries
}
 
void tlb_flush_all() {
    // Reload CR3 to flush entire TLB (non-global entries)
    uint64_t cr3;
    asm volatile("mov %%cr3, %0" : "=r"(cr3));
    asm volatile("mov %0, %%cr3" :: "r"(cr3) : "memory");
}

Critical TLB concepts:

ASID/PCID: Address Space Identifiers allow TLB entries from multiple processes to coexist. Without PCID, context switches require full TLB flushes. With PCID, only ASID-specific invalidations are needed.
Global entries: Kernel mappings are typically marked global. These entries persist across context switches, avoiding redundant kernel TLB misses.
Large page TLB entries: A single TLB entry for a 2MB page covers 512× more address space than a 4KB entry. Large pages dramatically improve TLB coverage.
TLB shootdown: When page tables change, all CPUs that might have cached the old translation must invalidate their TLB entries. This requires inter-processor interrupts (IPIs), which are expensive.

TLB-Related Spectre Mitigations

Permission Checking During Translation

Permission bits checked:

Permission Bits in Page Table Entries
Bit	Name	Effect	Checked Against
P (bit 0)	Present	Page/table exists in memory	Always
R/W (bit 1)	Read/Write	0=Read-only, 1=Read-Write	Write accesses
U/S (bit 2)	User/Supervisor	0=Kernel only, 1=User accessible	CPL (privilege level)
NX (bit 63)	No Execute	1=Cannot execute code from page	Instruction fetches

permission_checking.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
// Permission Checking During Multi-Level Walk
 
typedef enum {
    ACCESS_READ,
    ACCESS_WRITE,
    ACCESS_EXECUTE
} AccessType;
 
// Check permissions for one level's entry
bool check_entry_permissions(pte_t entry, AccessType access, int cpl) {
    // Present check
    if (!(entry & ENTRY_P)) {
        return false;  // Will cause page fault
    }
    
    // User/Supervisor check
    bool is_user_mode = (cpl == 3);
    bool page_user_accessible = (entry & ENTRY_US) != 0;
    
    if (is_user_mode && !page_user_accessible) {
        return false;  // User trying to access kernel page
    }
    
    // Write permission check
    if (access == ACCESS_WRITE) {
        bool page_writable = (entry & ENTRY_RW) != 0;
        if (!page_writable) {
            return false;  // Write to read-only page
        }
    }
    
    // Execute permission check (when NX supported)
    if (access == ACCESS_EXECUTE) {
        bool no_execute = (entry & ENTRY_XD) != 0;
        if (no_execute) {
            return false;  // Execute on NX page
        }
    }
    
    return true;
}
 
// Permission accumulation across levels
// The effective permission is the intersection (most restrictive)
uint64_t combine_permissions(pte_t* entries, int levels) {
    uint64_t effective = 0xFFFFFFFFFFFFFFFF;  // Start with all permissions
    
    for (int i = 0; i < levels; i++) {
        // Only these bits accumulate (others are level-specific)
        uint64_t perm_bits = entries[i] & (ENTRY_RW | ENTRY_US);
        uint64_t nx_bit = entries[i] & ENTRY_XD;
        
        // R/W and U/S: use AND (must be set at ALL levels)
        effective &= perm_bits;
        
        // NX: use OR (if set at ANY level, cannot execute)
        effective |= nx_bit;
    }
    
    return effective;
}
 
// Example translation with full permission checking
TranslationResult translate_with_checks(
    PageTable tables[4],
    uint64_t indices[4],
    AccessType access,
    int cpl
) {
    pte_t entries[4];
    
    for (int level = 0; level < 4; level++) {
        entries[level] = tables[level][indices[level]];
        
        // Check permissions at this level
        if (!check_entry_permissions(entries[level], access, cpl)) {
            // Generate appropriate page fault
            PageFaultCode code;
            code.present = (entries[level] & ENTRY_P) != 0;
            code.write = (access == ACCESS_WRITE);
            code.user = (cpl == 3);
            code.reserved = false;
            code.ifetch = (access == ACCESS_EXECUTE);
            
            return (TranslationResult){
                .success = false,
                .fault_code = code,
                .fault_level = level
            };
        }
    }
    
    // All levels passed - calculate effective permissions
    uint64_t effective_perms = combine_permissions(entries, 4);
    
    return (TranslationResult){
        .success = true,
        .physical_addr = (entries[3] & ADDR_MASK) | offset,
        .permissions = effective_perms
    };
}

SMEP, SMAP, and PKE:

Modern processors add additional protection mechanisms:

SMEP (Supervisor Mode Execution Prevention): Kernel cannot execute code from user-mode pages, even if mapped as executable.
SMAP (Supervisor Mode Access Prevention): Kernel cannot read/write user-mode pages without explicit permission, preventing certain exploit techniques.
PKE (Protection Keys for Userspace): Pages can be tagged with 4-bit keys. Access permissions per key can be changed with a single register write, enabling fast permission changes without modifying page tables.

Defense in Depth

Translation Performance Characteristics

Understanding translation performance is crucial for building high-performance systems. Let's analyze the costs at each stage of the translation pipeline.

Translation Latency Components
Component	Latency	Notes
TLB hit (L1)	1 cycle (~0.3ns)	Parallel with cache access
TLB miss + L2 TLB hit	6-10 cycles (~3ns)	Additional lookup stage
Page walk (L1 cache hits)	~20-30 cycles	Best case: all tables cached
Page walk (L2 cache hits)	~50-100 cycles	Tables in L2 cache
Page walk (L3 cache hits)	~100-200 cycles	Tables in L3 cache
Page walk (memory)	~400+ cycles (~100ns+)	Tables must be fetched from RAM

performance_analysis.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
// Effective Memory Access Time (EMAT) Calculation
 
typedef struct {
    double tlb_hit_rate;
    double tlb_hit_time;
    double pwc_hit_rate;      // Page walk cache hit rate
    int page_table_levels;
    double memory_access_time;
    double cache_hit_rate;
    double cache_hit_time;
} TranslationParams;
 
double calculate_emat(TranslationParams* p) {
    // TLB miss rate
    double tlb_miss_rate = 1.0 - p->tlb_hit_rate;
    
    // Average page walk time (considering caching of tables)
    double walk_time = 0;
    for (int level = 0; level < p->page_table_levels; level++) {
        // Each level access might hit in cache or go to memory
        double level_time = 
            p->cache_hit_rate * p->cache_hit_time +
            (1 - p->cache_hit_rate) * p->memory_access_time;
        
        // PWC might avoid some levels
        if (level < p->page_table_levels - 1) {
            level_time *= (1 - p->pwc_hit_rate);
        }
        walk_time += level_time;
    }
    
    // Translation time
    double trans_time = 
        p->tlb_hit_rate * p->tlb_hit_time +
        tlb_miss_rate * walk_time;
    
    // Total: translation + actual memory access
    double emat = trans_time + p->memory_access_time;
    
    return emat;
}
 
// Example: Modern x86-64 system
void example_analysis() {
    TranslationParams modern = {
        .tlb_hit_rate = 0.995,       // 99.5% TLB hit rate
        .tlb_hit_time = 1.0,         // 1 cycle
        .pwc_hit_rate = 0.90,        // 90% page walk cache hit rate
        .page_table_levels = 4,
        .memory_access_time = 100.0, // 100 cycles to memory
        .cache_hit_rate = 0.85,      // 85% cache hit rate for tables
        .cache_hit_time = 12.0       // L3 hit
    };
    
    double emat = calculate_emat(&modern);
    printf("EMAT: %.2f cycles\n", emat);  // ~103-105 cycles
    
    // Compare with perfect translation (no overhead)
    double perfect = modern.memory_access_time;
    printf("Overhead: %.1f%%\n", (emat - perfect) / perfect * 100);
}
 
// Output:
// EMAT: ~103 cycles
// Overhead: ~3%
//
// High TLB hit rate keeps translation overhead very low!

Factors affecting translation performance:

Working set size: Applications touching more pages have lower TLB hit rates. Large working sets overwhelm the TLB.
Access patterns: Sequential access provides excellent TLB performance. Random access causes frequent TLB misses.
Page size: Using 2MB huge pages provides 512× better TLB coverage than 4KB pages for contiguous mappings.
Context switches: Each switch may flush TLB entries (without PCID) or consume TLB capacity (with PCID).
Multi-threaded scaling: Multiple cores share TLB shootdown overhead when page tables change.

Optimization strategies:

Use huge pages where possible (especially for databases, HPC, ML workloads)
Minimize working set size to fit in TLB
Prefer sequential access patterns
Batch memory operations to amortize any TLB miss costs
Consider NUMA topology to avoid remote memory for page tables

Profiling TLB Behavior

Use CPU performance counters to measure TLB behavior: perf stat -e dTLB-load-misses,dTLB-store-misses,iTLB-load-misses. High miss rates indicate opportunity for huge pages or working set reduction.

Real Architecture Examples

Let's examine how translation works on two major architectures: x86-64 and ARM64.

x86-64 Translation (Intel/AMD):

Virtual Address Format (48-bit canonical):
┌────────┬─────────┬───────┬────────┬─────────┬──────────┐
│Sign ext│  PML4   │ PDPT  │   PD   │   PT    │  Offset  │
│16 bits │ 9 bits  │9 bits │ 9 bits │ 9 bits  │ 12 bits  │
│ unused │ [47:39] │[38:30]│ [29:21]│ [20:12] │  [11:0]  │
└────────┴─────────┴───────┴────────┴─────────┴──────────┘

Key Features:
- CR3 holds PML4 physical address
- 9-9-9-9-12 bit split
- 512 entries per table (4KB tables)
- Large pages: 2MB (PD level), 1GB (PDPT level)
- NX bit (bit 63) for execute disable
- PCID support (12-bit ASIDs in CR3)
- 5-level paging (PML5) for 57-bit VA

ARM's contiguous bit optimization:

Split address spaces:

Both architectures support split address spaces—different page table roots for user and kernel:

x86-64: Historically single CR3; KPTI now uses per-CPU switching between user/kernel page tables
ARM64: Native support with TTBR0 (user) and TTBR1 (kernel), determined by high address bit

The ARM approach avoids the overhead of switching page tables on kernel entry, providing a natural security boundary.

RISC-V Sv39/Sv48

Summary: Address Translation

Key Takeaways

•Sequential table walks — Each level must be read before proceeding to the next. A 4-level walk requires 4 memory accesses.
•TLB is essential — Without TLB caching, multi-level paging would impose a 4-5× performance penalty.
•Hardware page table walkers — Modern CPUs include dedicated circuits for efficient page walks on TLB misses.
•Permission accumulation — Permissions are checked at every level, with the effective permission being the most restrictive combination.
•Large pages reduce overhead — 2MB/1GB pages provide massive TLB coverage improvements for suitable workloads.
•Performance depends on locality — Sequential access with good TLB hit rates sees minimal translation overhead.
•Architecture-specific details matter — x86-64, ARM64, and RISC-V have different but similar approaches to multi-level translation.

What's next:

Page Complete

4 / 5