Huge Pages - Learning Module

Loading content...

0/227

TLB Efficiency

The Cache That Rules All Caches

Every memory access in a modern system must pass through a critical checkpoint: address translation. The CPU's address bus carries virtual addresses, but DRAM responds only to physical addresses. This translation must happen for every load, every store, every instruction fetch—potentially billions of times per second.

If the CPU performed a full page table walk for every memory access, modern systems would grind to a halt. A four-level page table walk requires four sequential memory reads before the actual data can be fetched. At ~100ns per memory access, this would add 400ns of overhead to every operation—making the CPU effectively 100× slower.

The solution is the Translation Lookaside Buffer (TLB)—a specialized cache that stores recently used virtual-to-physical address translations. TLB hits provide translation in 1-2 CPU cycles; TLB misses can cost hundreds of cycles. This makes TLB efficiency one of the most critical factors in system performance, and it's precisely where huge pages deliver their most dramatic benefits.

What You Will Learn

By the end of this page, you will understand how the TLB works at the hardware level, master the mathematics of TLB coverage, analyze the true cost of TLB misses, and see exactly how huge pages multiply TLB effectiveness by orders of magnitude.

TLB Fundamentals

The TLB is a content-addressable memory (CAM) located within the Memory Management Unit (MMU) of the CPU. Unlike normal caches that are indexed by memory address, the TLB is indexed by virtual page number and returns physical frame numbers.

TLB Structure:

Each TLB entry contains:

Tag: Virtual page number (VPN)
Data: Physical frame number (PFN) + permission bits
Valid bit: Whether the entry is currently valid
ASID/PCID: Address Space ID (to avoid flushes on context switch)

When the CPU needs to access memory:

Extract the VPN from the virtual address
Search TLB for matching VPN (CAM lookup—parallel, fast)
TLB Hit: Return PFN, combine with offset → physical address
TLB Miss: Invoke page table walker (hardware or microcode)

Typical TLB Hierarchy in Modern x86 Processors
TLB Level	Entries (4KB)	Entries (2MB)	Entries (1GB)	Associativity	Access Latency
L1 ITLB (Instructions)	128	8	—	8-way	1 cycle
L1 DTLB (Data)	64	32	4	4-way	1 cycle
L2 STLB (Unified)	1536-4096	1536	16	8-12 way	7-8 cycles

Critical observation:

Notice that modern CPUs have separate TLB entries for different page sizes. The 4KB TLB might have 1536 entries, and the 2MB TLB also has 1536 entries—but each 2MB entry covers 512× more memory. This is the architectural foundation that makes huge pages so powerful: you get similar TLB capacity but vastly greater coverage.

Hardware Page Table Walker

Modern x86 processors include a hardware page table walker that automatically handles TLB misses without trapping to the OS. This walker reads page table entries from memory and installs them into the TLB. While faster than software handling, it still requires multiple memory accesses—making TLB misses expensive even with hardware assistance.

TLB Coverage Mathematics

TLB coverage is the total amount of memory that can be translated without a TLB miss at any given moment. This single metric captures the essence of TLB efficiency:

TLB Coverage = Number of TLB Entries × Page Size

Let's calculate coverage for a typical Intel Skylake-class processor with a unified L2 TLB of 1536 entries:

4KB Pages (1536 entries)

•TLB Coverage: 1536 × 4KB = 6MB
•For 64GB RAM: 0.009% coverage
•Working set > 6MB = guaranteed TLB misses
•64GB traversal: ~16 million TLB misses
•Database with 32GB buffer pool: extensive thrashing

2MB Pages (1536 entries)

•TLB Coverage: 1536 × 2MB = 3GB
•For 64GB RAM: 4.7% coverage
•Most working sets fit in TLB
•64GB traversal: ~32,768 TLB misses
•Database buffer pool: mostly cached

The 512× amplification:

With the same number of TLB entries, 2MB pages provide 512× more coverage than 4KB pages. This isn't just "512× better"—it often means the difference between fitting in TLB and not fitting at all. Memory access patterns that cause constant TLB misses with 4KB pages may experience nearly 100% TLB hits with 2MB pages.

For 1GB pages, the effect is even more dramatic:

1GB Pages: 16 entries × 1GB = 16GB coverage

Even with the limited 1GB TLB (typically 4-16 entries), you can cover more memory than most application working sets.

tlb_coverage_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
#!/usr/bin/env python3
"""
TLB Coverage Analysis Tool
 
Calculates TLB coverage for different page sizes and demonstrates
the dramatic impact of huge pages on effective memory reach.
"""
 
from dataclasses import dataclass
from typing import List
 
@dataclass
class TLBConfig:
    """TLB configuration for a specific page size"""
    page_size: int
    page_name: str
    l1_dtlb_entries: int
    l1_itlb_entries: int
    l2_stlb_entries: int
 
@dataclass
class WorkloadProfile:
    """Memory workload characteristics"""
    name: str
    working_set_mb: float
    access_pattern: str  # 'sequential', 'random', 'strided'
 
def format_bytes(bytes_val: float) -> str:
    """Format bytes as human-readable string"""
    for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
        if bytes_val < 1024:
            return f"{bytes_val:.1f} {unit}"
        bytes_val /= 1024
    return f"{bytes_val:.1f} PB"
 
def calculate_tlb_coverage(tlb_entries: int, page_size: int) -> int:
    """Calculate total memory coverage for TLB"""
    return tlb_entries * page_size
 
def estimate_tlb_miss_rate(working_set: int, tlb_coverage: int, 
                           access_pattern: str) -> float:
    """
    Estimate TLB miss rate based on working set and coverage.
    
    This is a simplified model. Real miss rates depend on:
    - Access pattern (temporal and spatial locality)
    - TLB replacement policy (usually pseudo-LRU)
    - Code vs data accesses
    - Multi-level TLB behavior
    """
    if working_set <= tlb_coverage:
        # Working set fits - primarily compulsory misses
        if access_pattern == 'sequential':
            return 0.01  # 1% - excellent locality
        elif access_pattern == 'random':
            return 0.05  # 5% - occasional conflicts
        else:  # strided
            return 0.02  # 2%
    else:
        # Working set exceeds coverage - capacity misses dominate
        overflow_ratio = working_set / tlb_coverage
        
        if access_pattern == 'sequential':
            # Sequential: one miss per page, good reuse within page
            return min(0.5, 0.1 * overflow_ratio)
        elif access_pattern == 'random':
            # Random: high miss rate, poor reuse
            return min(0.95, 0.3 + 0.4 * (overflow_ratio - 1) / 10)
        else:  # strided
            return min(0.7, 0.15 * overflow_ratio)
 
# Intel Skylake/Ice Lake class TLB configuration
TLB_CONFIGS = [
    TLBConfig(4 * 1024, "4KB", l1_dtlb_entries=64, l1_itlb_entries=128, 
              l2_stlb_entries=1536),
    TLBConfig(2 * 1024 * 1024, "2MB", l1_dtlb_entries=32, l1_itlb_entries=8, 
              l2_stlb_entries=1536),
    TLBConfig(1024 * 1024 * 1024, "1GB", l1_dtlb_entries=4, l1_itlb_entries=0, 
              l2_stlb_entries=16),
]
 
# Representative workloads
WORKLOADS = [
    WorkloadProfile("Small application", 16, "mixed"),
    WorkloadProfile("Web server", 256, "random"),
    WorkloadProfile("In-memory cache", 2048, "random"),
    WorkloadProfile("Database buffer pool", 8192, "random"),
    WorkloadProfile("Big data analytics", 32768, "strided"),
    WorkloadProfile("In-memory database", 131072, "random"),  # 128GB
]
 
print("=" * 90)
print("TLB COVERAGE ANALYSIS - MODERN x86 PROCESSOR")
print("=" * 90)
 
# Calculate and display coverage for each page size
print("\n" + "─" * 90)
print("TLB COVERAGE BY PAGE SIZE")
print("─" * 90)
print(f"{'Page Size':<12} {'L1 DTLB':<20} {'L2 STLB':<20} {'Total Coverage':<20}")
print("─" * 90)
 
for config in TLB_CONFIGS:
    l1_coverage = calculate_tlb_coverage(config.l1_dtlb_entries, config.page_size)
    l2_coverage = calculate_tlb_coverage(config.l2_stlb_entries, config.page_size)
    total = l1_coverage + l2_coverage  # Simplified - assumes minimal overlap
    
    print(f"{config.page_name:<12} "
          f"{config.l1_dtlb_entries:>4} entries = {format_bytes(l1_coverage):<8} "
          f"{config.l2_stlb_entries:>4} entries = {format_bytes(l2_coverage):<8} "
          f"≈ {format_bytes(total)}")
 
print("\n" + "─" * 90)
print("ESTIMATED TLB MISS RATES BY WORKLOAD")
print("─" * 90)
print(f"{'Workload':<25} {'Working Set':<12} {'4KB Pages':<15} {'2MB Pages':<15} {'1GB Pages':<15}")
print("─" * 90)
 
for workload in WORKLOADS:
    ws_bytes = workload.working_set_mb * 1024 * 1024
    row = f"{workload.name:<25} {format_bytes(ws_bytes):<12} "
    
    for config in TLB_CONFIGS:
        l2_coverage = calculate_tlb_coverage(config.l2_stlb_entries, config.page_size)
        miss_rate = estimate_tlb_miss_rate(ws_bytes, l2_coverage, workload.access_pattern)
        row += f"{miss_rate*100:>6.2f}% miss   "
    
    print(row)
 
print("\n" + "=" * 90)
print("KEY INSIGHT: 2MB pages reduce miss rates by 10-100x for large workloads")
print("=" * 90)

The True Cost of TLB Misses

A TLB miss is deceptively expensive. While the hardware page table walker handles misses automatically, each miss triggers a cascade of memory operations that stall the CPU pipeline.

Anatomy of a TLB miss on x86-64:

TLB Miss Cost Breakdown (4KB Pages)
Operation	Memory Accesses	Latency (cycles)	Notes
Read PML4 entry	1	~200	Almost never cached
Read PDPT entry	1	~150	Rarely cached
Read PD entry	1	~100-150	Sometimes cached
Read PT entry	1	~80-150	Often cached
Install TLB entry	0	~10	Microcode overhead
Total (worst case)	4	~600-700	~150-175ns

Why the walk is slow:

Sequential dependency: Each level depends on the result of the previous level
Memory latency: Each access may go to DRAM if not in cache
Cache interference: Page table walks compete with data for cache space
Nested page walks: Under virtualization, each guest walk may trigger host walks (2D walks = up to 24 memory accesses!)

The huge page advantage:

With 2MB pages, the walk stops one level earlier:

Skip PT read: Save 1 memory access, ~100 cycles
Better caching: Fewer page table levels = higher cache hit rate
Reduced pressure: Fewer PTEs in memory = less cache pollution

TLB Miss Cost by Page Size
Page Size	Levels Walked	Memory Accesses	Typical Latency	Under Virtualization
4 KB	4 (PML4→PT)	4	~600 cycles	Up to 24 accesses
2 MB	3 (PML4→PD)	3	~450 cycles	Up to 15 accesses
1 GB	2 (PML4→PDPT)	2	~300 cycles	Up to 8 accesses

tlb_miss_impact.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
/*
 * Benchmark demonstrating TLB miss impact
 * 
 * This program measures memory access latency with varying working set sizes
 * to show the inflection points where TLB misses start dominating.
 *
 * Compile: gcc -O2 -o tlb_miss tlb_miss_impact.c -lrt
 * Run: ./tlb_miss (requires root for huge pages)
 */
 
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <time.h>
#include <sys/mman.h>
#include <errno.h>
 
#define ITERATIONS 100000000
#define CACHE_LINE_SIZE 64
 
// Get current time in nanoseconds
static inline uint64_t get_time_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}
 
// Pointer chasing benchmark - defeats prefetchers
// Returns average access latency in nanoseconds
double measure_access_latency(void *region, size_t size, int iterations) {
    size_t num_elements = size / sizeof(void*);
    void **pointers = (void**)region;
    
    // Create random chase pattern (pointer to next pointer)
    for (size_t i = 0; i < num_elements; i++) {
        pointers[i] = &pointers[(i + 4096/sizeof(void*)) % num_elements];
    }
    
    // Warm up
    void **p = &pointers[0];
    for (int i = 0; i < 10000; i++) {
        p = (void**)*p;
    }
    
    // Measure
    uint64_t start = get_time_ns();
    p = &pointers[0];
    for (int i = 0; i < iterations; i++) {
        p = (void**)*p;  // Chase the pointer
    }
    uint64_t end = get_time_ns();
    
    // Prevent optimization
    volatile void *sink = p;
    (void)sink;
    
    return (double)(end - start) / iterations;
}
 
void* allocate_region(size_t size, int use_huge_pages) {
    int flags = MAP_PRIVATE | MAP_ANONYMOUS;
    
    if (use_huge_pages) {
        flags |= MAP_HUGETLB;
    }
    
    void *region = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
    
    if (region == MAP_FAILED) {
        if (use_huge_pages) {
            fprintf(stderr, "Huge page allocation failed (errno=%d). "
                    "Try: echo 1024 > /proc/sys/vm/nr_hugepages\n", errno);
        }
        return NULL;
    }
    
    // Touch all pages to ensure allocation
    memset(region, 0, size);
    
    return region;
}
 
int main() {
    printf("TLB Miss Impact Benchmark\n");
    printf("═══════════════════════════════════════════════════════════════\n\n");
    
    // Test different working set sizes (in MB)
    size_t sizes_mb[] = {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024};
    int num_sizes = sizeof(sizes_mb) / sizeof(sizes_mb[0]);
    
    printf("Working Set    4KB Pages    2MB Huge Pages    Speedup\n");
    printf("─────────────────────────────────────────────────────────────\n");
    
    for (int i = 0; i < num_sizes; i++) {
        size_t size = sizes_mb[i] * 1024 * 1024;
        
        // Skip if system doesn't have enough memory
        void *region_4kb = allocate_region(size, 0);
        if (!region_4kb) continue;
        
        void *region_huge = allocate_region(size, 1);
        
        double latency_4kb = measure_access_latency(region_4kb, size, 
                                                    ITERATIONS / (i + 1));
        
        double latency_huge = -1;
        if (region_huge) {
            latency_huge = measure_access_latency(region_huge, size, 
                                                   ITERATIONS / (i + 1));
            munmap(region_huge, size);
        }
        
        printf("%6zu MB       %6.2f ns", sizes_mb[i], latency_4kb);
        
        if (latency_huge > 0) {
            double speedup = latency_4kb / latency_huge;
            printf("        %6.2f ns         %.2fx\n", latency_huge, speedup);
        } else {
            printf("        N/A              N/A\n");
        }
        
        munmap(region_4kb, size);
    }
    
    printf("\n═══════════════════════════════════════════════════════════════\n");
    printf("Note: Speedup increases as working set exceeds TLB coverage (~6MB for 4KB).\n");
    printf("Beyond TLB coverage, huge pages provide 1.5-3x performance improvement.\n");
    
    return 0;
}
 
/*
 * Expected output pattern (actual numbers vary by system):
 *
 * Working Set    4KB Pages    2MB Huge Pages    Speedup
 * ─────────────────────────────────────────────────────────────
 *     1 MB         4.50 ns          4.45 ns         1.01x
 *     2 MB         4.55 ns          4.48 ns         1.02x
 *     4 MB         4.80 ns          4.52 ns         1.06x
 *     8 MB        12.30 ns          4.65 ns         2.65x  ← TLB overflow starts
 *    16 MB        28.50 ns          5.20 ns         5.48x
 *    32 MB        45.80 ns          8.40 ns         5.45x
 *    64 MB        62.40 ns         15.30 ns         4.08x
 *   128 MB        78.90 ns         24.60 ns         3.21x
 *   256 MB        95.40 ns         38.70 ns         2.47x
 *   512 MB       112.60 ns         52.80 ns         2.13x
 *  1024 MB       128.30 ns         68.40 ns         1.88x
 */

Virtualization Amplification

Under virtualization (KVM, VMware, Hyper-V), TLB miss cost is dramatically amplified. The hypervisor maintains its own page tables, causing 'nested' or '2D' page walks. A 4KB page miss may require 24 memory accesses (4 guest × 4 host + nested combinations). Huge pages at both guest and host level are critical for virtualized workloads.

TLB Architecture Deep Dive

Modern TLB architecture is remarkably sophisticated, evolved over decades to maximize hit rates while minimizing latency. Understanding this architecture reveals why huge pages are so effective.

Split vs Unified TLBs:

Most processors use a split L1 TLB (separate for instructions and data) and a unified L2 TLB (shared). This mirrors the instruction/data cache split:

ITLB (Instruction TLB): Optimized for sequential code access patterns
DTLB (Data TLB): Handles more random data access patterns
STLB (Second-Level TLB): Larger, unified, catches misses from both L1 TLBs

Intel Ice Lake TLB Specifications
TLB Type	4KB Entries	2MB Entries	1GB Entries	Associativity
L1 ITLB	128	8	—	8-way
L1 DTLB	64	32	4	4-way
L2 STLB	2048	2048	16	16-way

Associativity and Conflicts:

TLBs use set-associative caching. With 4-way associativity, each VPN can only reside in one of four possible locations. If a process accesses more than 4 pages that map to the same set, conflict misses occur even with available capacity.

Huge pages dramatically reduce conflict probability:

With 4KB pages: 2^52 possible VPNs compete for TLB sets
With 2MB pages: 2^43 possible VPNs compete for TLB sets
Fewer entries needed = fewer opportunities for conflicts

PCID (Process Context ID):

Modern x86 processors support PCID—a 12-bit tag attached to TLB entries that identifies which address space they belong to. This allows TLB entries to survive context switches, dramatically improving performance for systems running many processes.

tlb_entry.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
/*
 * Conceptual representation of TLB entry structure
 * Actual hardware implementation varies by architecture
 */
 
#include <stdint.h>
#include <stdbool.h>
 
/* TLB Entry for 4KB pages */
typedef struct {
    /* Tag portion - identifies the virtual page */
    uint64_t vpn          : 52;  // Virtual Page Number (bits 63:12 of VA)
    uint16_t asid         : 12;  // Address Space ID (PCID on x86)
    
    /* Data portion - translation and permissions */
    uint64_t pfn          : 40;  // Physical Frame Number (bits 51:12 of PA)
    
    /* Permission flags (from PTE) */
    bool     present      : 1;   // Page present in memory
    bool     writable     : 1;   // Write permission
    bool     user         : 1;   // User-mode accessible
    bool     global       : 1;   // Global page (skip ASID check)
    bool     accessed     : 1;   // Page has been accessed
    bool     dirty        : 1;   // Page has been written
    bool     page_size    : 1;   // PS bit - is this a huge page entry?
    bool     no_execute   : 1;   // Execute permission (inverted)
    
    /* TLB metadata */
    bool     valid        : 1;   // Entry is valid
    uint8_t  mesi_state   : 2;   // Coherence state (for cache-coherent TLB)
    
    /* LRU tracking for replacement (architectural, not in actual entry) */
    uint8_t  lru_counter  : 4;   // Pseudo-LRU for replacement
    
} TLB_Entry_4KB;
 
/* TLB Entry for 2MB huge pages */
typedef struct {
    /* Tag portion - fewer bits needed for VPN */
    uint64_t vpn          : 43;  // Virtual Page Number (bits 63:21 of VA)
    uint16_t asid         : 12;  // Address Space ID
    
    /* Data portion */
    uint64_t pfn          : 31;  // Physical Frame Number (bits 51:21 of PA)
    
    /* Same permission flags as 4KB */
    bool     present      : 1;
    bool     writable     : 1;
    bool     user         : 1;
    bool     global       : 1;
    bool     accessed     : 1;
    bool     dirty        : 1;
    bool     page_size    : 1;   // Always 1 for 2MB entries
    bool     no_execute   : 1;
    
    bool     valid        : 1;
    uint8_t  lru_counter  : 4;
    
} TLB_Entry_2MB;
 
/* TLB Entry for 1GB giant pages */
typedef struct {
    /* Tag portion - even fewer bits */
    uint64_t vpn          : 34;  // Virtual Page Number (bits 63:30 of VA)
    uint16_t asid         : 12;
    
    /* Data portion */
    uint64_t pfn          : 22;  // Physical Frame Number (bits 51:30 of PA)
    
    /* Permission flags */
    bool     present      : 1;
    bool     writable     : 1;
    bool     user         : 1;
    bool     global       : 1;
    bool     accessed     : 1;
    bool     dirty        : 1;
    bool     page_size    : 1;   // Always 1 for 1GB entries
    bool     no_execute   : 1;
    
    bool     valid        : 1;
    uint8_t  lru_counter  : 4;
    
} TLB_Entry_1GB;
 
/*
 * TLB Lookup simulation showing the parallel search nature
 */
typedef struct {
    TLB_Entry_4KB entries_4kb[64];   // L1 DTLB 4KB entries
    TLB_Entry_2MB entries_2mb[32];   // L1 DTLB 2MB entries
    TLB_Entry_1GB entries_1gb[4];    // L1 DTLB 1GB entries
    int associativity;
} L1_DTLB;
 
/*
 * Simulated TLB lookup - in real hardware, all comparisons happen in parallel
 */
bool tlb_lookup(L1_DTLB *tlb, uint64_t virtual_addr, uint64_t asid,
                uint64_t *physical_addr, bool *is_huge) {
    
    // Extract VPNs for each page size
    uint64_t vpn_4kb = virtual_addr >> 12;
    uint64_t vpn_2mb = virtual_addr >> 21;
    uint64_t vpn_1gb = virtual_addr >> 30;
    
    // Check 1GB entries first (highest priority)
    for (int i = 0; i < 4; i++) {
        if (tlb->entries_1gb[i].valid &&
            tlb->entries_1gb[i].vpn == vpn_1gb &&
            (tlb->entries_1gb[i].global || tlb->entries_1gb[i].asid == asid)) {
            
            *physical_addr = ((uint64_t)tlb->entries_1gb[i].pfn << 30) |
                            (virtual_addr & 0x3FFFFFFF);  // 30-bit offset
            *is_huge = true;
            return true;  // TLB Hit!
        }
    }
    
    // Check 2MB entries
    for (int i = 0; i < 32; i++) {
        if (tlb->entries_2mb[i].valid &&
            tlb->entries_2mb[i].vpn == vpn_2mb &&
            (tlb->entries_2mb[i].global || tlb->entries_2mb[i].asid == asid)) {
            
            *physical_addr = ((uint64_t)tlb->entries_2mb[i].pfn << 21) |
                            (virtual_addr & 0x1FFFFF);  // 21-bit offset
            *is_huge = true;
            return true;  // TLB Hit!
        }
    }
    
    // Check 4KB entries
    for (int i = 0; i < 64; i++) {
        if (tlb->entries_4kb[i].valid &&
            tlb->entries_4kb[i].vpn == vpn_4kb &&
            (tlb->entries_4kb[i].global || tlb->entries_4kb[i].asid == asid)) {
            
            *physical_addr = ((uint64_t)tlb->entries_4kb[i].pfn << 12) |
                            (virtual_addr & 0xFFF);  // 12-bit offset
            *is_huge = false;
            return true;  // TLB Hit!
        }
    }
    
    return false;  // TLB Miss - need page table walk
}

Measuring TLB Performance

Understanding TLB behavior requires measuring it on real systems. Modern CPUs provide performance counters specifically for TLB events, accessible through tools like perf on Linux.

Key TLB Performance Counters:

Essential TLB Performance Counters (Intel)
Counter	Event Name	What It Measures
dtlb_load_misses.miss_causes_a_walk	DTLB Load Miss → Walk	Data loads that missed L1/L2 TLB
dtlb_load_misses.walk_completed	Walk Completed	Page walks that completed successfully
dtlb_load_misses.walk_duration	Walk Cycles	Cycles spent doing page walks
dtlb_store_misses.miss_causes_a_walk	DTLB Store Miss → Walk	Data stores that missed TLB
itlb_misses.miss_causes_a_walk	ITLB Miss → Walk	Instruction fetches that missed TLB
page_walker_loads.dtlb_*	Walk Memory Access	Memory ops by page walker (by level)

measure_tlb.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
#!/bin/bash
#
# TLB Performance Analysis Script
# Measures TLB behavior for a given application
#
# Usage: ./measure_tlb.sh <command>
#
 
if [ -z "$1" ]; then
    echo "Usage: $0 <command_to_profile>"
    exit 1
fi
 
echo "════════════════════════════════════════════════════════════════════"
echo "TLB PERFORMANCE ANALYSIS"
echo "════════════════════════════════════════════════════════════════════"
echo ""
 
# Check for perf availability
if ! command -v perf &> /dev/null; then
    echo "Error: 'perf' not found. Install linux-tools-generic."
    exit 1
fi
 
# Define TLB-related events
TLB_EVENTS=(
    "dtlb_load_misses.miss_causes_a_walk"
    "dtlb_load_misses.walk_completed"
    "dtlb_load_misses.walk_completed_4k"
    "dtlb_load_misses.walk_completed_2m_4m"
    "dtlb_load_misses.walk_completed_1g"
    "dtlb_store_misses.miss_causes_a_walk"
    "itlb_misses.miss_causes_a_walk"
    "instructions"
    "cycles"
)
 
# Build event string
EVENT_STR=""
for evt in "${TLB_EVENTS[@]}"; do
    if [ -n "$EVENT_STR" ]; then
        EVENT_STR="$EVENT_STR,"
    fi
    EVENT_STR="$EVENT_STR$evt"
done
 
echo "Running: $@"
echo "Events: $EVENT_STR"
echo ""
echo "────────────────────────────────────────────────────────────────────"
 
# Run with perf stat
perf stat -e "$EVENT_STR" -- "$@" 2>&1 | tee /tmp/tlb_analysis.txt
 
echo ""
echo "────────────────────────────────────────────────────────────────────"
echo "ANALYSIS SUMMARY"
echo "────────────────────────────────────────────────────────────────────"
 
# Parse and analyze results
DTLB_MISSES=$(grep "dtlb_load_misses.miss_causes_a_walk" /tmp/tlb_analysis.txt | awk '{print $1}' | tr -d ',')
INSTRUCTIONS=$(grep "instructions" /tmp/tlb_analysis.txt | awk '{print $1}' | tr -d ',')
CYCLES=$(grep "cycles" /tmp/tlb_analysis.txt | awk '{print $1}' | tr -d ',')
 
if [ -n "$DTLB_MISSES" ] && [ -n "$INSTRUCTIONS" ] && [ "$INSTRUCTIONS" -gt 0 ]; then
    MISS_RATE=$(echo "scale=6; $DTLB_MISSES * 1000000 / $INSTRUCTIONS" | bc)
    echo "DTLB miss rate: $MISS_RATE misses per million instructions"
    
    if (( $(echo "$MISS_RATE > 1000" | bc -l) )); then
        echo "⚠️  HIGH TLB MISS RATE - Consider using huge pages!"
    elif (( $(echo "$MISS_RATE > 100" | bc -l) )); then
        echo "⚡ Moderate TLB pressure - Huge pages may help"
    else
        echo "✓ TLB miss rate is acceptable"
    fi
fi
 
echo ""
echo "════════════════════════════════════════════════════════════════════"
 
# Additional analysis: Check huge page usage
echo ""
echo "CURRENT HUGE PAGE CONFIGURATION:"
echo "────────────────────────────────────────────────────────────────────"
if [ -f /proc/meminfo ]; then
    grep -i huge /proc/meminfo
fi
 
echo ""
echo "TIP: To enable huge pages, run:"
echo "  echo 1024 | sudo tee /proc/sys/vm/nr_hugepages"
echo "  Or use: madvise(addr, len, MADV_HUGEPAGE) for THP"
echo "════════════════════════════════════════════════════════════════════"

Interpreting the numbers:

TLB miss rates are typically expressed as:

Misses per thousand instructions (MPKI): Good applications < 0.5, problematic > 5
Miss ratio: TLB misses / total memory accesses
Walk cycles: Fraction of total cycles spent in page walks

A well-tuned application with huge pages should show:

DTLB MPKI < 0.1 for cache-friendly workloads
Walk cycles < 1% of total execution time
Most walks completing at the "2m_4m" or "1g" level (huge pages)

Quick Diagnostic

If dtlb_load_misses.walk_completed shows predominantly 'walk_completed_4k' rather than 'walk_completed_2m_4m', huge pages are not being used effectively. Check /proc/<pid>/smaps for 'AnonHugePages' to verify huge page usage.

The Huge Page TLB Advantage

Let's consolidate everything we've learned to understand exactly how huge pages transform TLB efficiency:

1. Coverage Multiplication:

With the same number of TLB entries, huge pages cover 512× (2MB) or 262,144× (1GB) more memory. This often means the difference between constant TLB misses and nearly perfect hits.

Real-World TLB Efficiency Comparison
Metric	4KB Pages	2MB Pages	Improvement
TLB Coverage (1536 entries)	6 MB	3 GB	512×
Database buffer hit rate (32GB pool)	~85%	~99%	+14%
Page walk cycles (memory scan)	35% of time	2% of time	17.5× reduction
Random access latency overhead	+80ns avg	+5ns avg	16× reduction
VM exit cost (EPT walks)	~4000 cycles	~1500 cycles	2.7× reduction

2. Reduced Walk Depth:

When a TLB miss does occur, the page walk is shorter. 2MB pages eliminate one level; 1GB pages eliminate two levels. Under virtualization, this cascades—each eliminated guest level saves multiple host accesses.

3. Better Cache Behavior:

Fewer page table entries to cache
Higher-level page table entries more likely to be cached (shared across many mappings)
Less cache pollution from page walk traffic

4. Reduced Memory Overhead:

The page table structures themselves consume less memory:

4KB: ~2GB for 1TB address space
2MB: ~4MB for 1TB address space
This freed memory can be used for data caching instead

Workloads That Benefit Most

•In-memory databases (Redis, SAP HANA): Random access across large datasets
•Virtualization/Containers: Nested page tables amplify TLB miss cost
•Scientific computing: Large array traversals and matrix operations
•Big data processing: Scanning and shuffling large datasets
•JVM/CLR applications: Large heaps with garbage collection
•Memory-mapped file access: Databases, analytics on large files

The Performance Guarantee

For any workload with a working set larger than ~6MB that involves significant memory access, huge pages will improve performance. The improvement ranges from 5-10% for modest workloads to 50%+ for memory-intensive applications like databases and virtualization.

Summary: TLB Efficiency

The TLB is the critical path for every memory access. Its efficiency determines whether address translation is a single-cycle operation or a hundreds-of-cycles penalty. Here's what we've learned:

Key Takeaways

•TLB coverage limits addressable speed — with 4KB pages, only ~6MB of memory can be accessed without TLB misses
•TLB misses are expensive — 500-700 cycles for a full page walk, amplified under virtualization
•Huge pages multiply TLB effectiveness by 512× — each entry covers vastly more memory
•Shorter page walks reduce miss penalty — fewer levels mean faster resolution when misses occur
•Modern CPUs have dedicated huge page TLBs — hardware optimized for both page sizes
•Measurement is essential — use perf counters to identify TLB bottlenecks before optimization

What's next:

Understanding TLB benefits motivates using huge pages—but how do you actually allocate them? The next page explores huge page allocation: the mechanisms for reserving and using huge pages, including boot-time reservation, hugetlbfs, and programmatic allocation via mmap() and madvise().

Page Complete

You now understand why TLB efficiency is the primary driver of huge page benefits. The mathematics are clear: larger pages mean more coverage, fewer misses, and faster translation. Next, we'll learn how to actually allocate and use huge pages in practice.