Loading content...
Every memory access in a modern system must pass through a critical checkpoint: address translation. The CPU's address bus carries virtual addresses, but DRAM responds only to physical addresses. This translation must happen for every load, every store, every instruction fetch—potentially billions of times per second.
If the CPU performed a full page table walk for every memory access, modern systems would grind to a halt. A four-level page table walk requires four sequential memory reads before the actual data can be fetched. At ~100ns per memory access, this would add 400ns of overhead to every operation—making the CPU effectively 100× slower.
The solution is the Translation Lookaside Buffer (TLB)—a specialized cache that stores recently used virtual-to-physical address translations. TLB hits provide translation in 1-2 CPU cycles; TLB misses can cost hundreds of cycles. This makes TLB efficiency one of the most critical factors in system performance, and it's precisely where huge pages deliver their most dramatic benefits.
By the end of this page, you will understand how the TLB works at the hardware level, master the mathematics of TLB coverage, analyze the true cost of TLB misses, and see exactly how huge pages multiply TLB effectiveness by orders of magnitude.
The TLB is a content-addressable memory (CAM) located within the Memory Management Unit (MMU) of the CPU. Unlike normal caches that are indexed by memory address, the TLB is indexed by virtual page number and returns physical frame numbers.
TLB Structure:
Each TLB entry contains:
When the CPU needs to access memory:
| TLB Level | Entries (4KB) | Entries (2MB) | Entries (1GB) | Associativity | Access Latency |
|---|---|---|---|---|---|
| L1 ITLB (Instructions) | 128 | 8 | — | 8-way | 1 cycle |
| L1 DTLB (Data) | 64 | 32 | 4 | 4-way | 1 cycle |
| L2 STLB (Unified) | 1536-4096 | 1536 | 16 | 8-12 way | 7-8 cycles |
Critical observation:
Notice that modern CPUs have separate TLB entries for different page sizes. The 4KB TLB might have 1536 entries, and the 2MB TLB also has 1536 entries—but each 2MB entry covers 512× more memory. This is the architectural foundation that makes huge pages so powerful: you get similar TLB capacity but vastly greater coverage.
Modern x86 processors include a hardware page table walker that automatically handles TLB misses without trapping to the OS. This walker reads page table entries from memory and installs them into the TLB. While faster than software handling, it still requires multiple memory accesses—making TLB misses expensive even with hardware assistance.
TLB coverage is the total amount of memory that can be translated without a TLB miss at any given moment. This single metric captures the essence of TLB efficiency:
TLB Coverage = Number of TLB Entries × Page Size
Let's calculate coverage for a typical Intel Skylake-class processor with a unified L2 TLB of 1536 entries:
The 512× amplification:
With the same number of TLB entries, 2MB pages provide 512× more coverage than 4KB pages. This isn't just "512× better"—it often means the difference between fitting in TLB and not fitting at all. Memory access patterns that cause constant TLB misses with 4KB pages may experience nearly 100% TLB hits with 2MB pages.
For 1GB pages, the effect is even more dramatic:
1GB Pages: 16 entries × 1GB = 16GB coverage
Even with the limited 1GB TLB (typically 4-16 entries), you can cover more memory than most application working sets.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
#!/usr/bin/env python3"""TLB Coverage Analysis Tool Calculates TLB coverage for different page sizes and demonstratesthe dramatic impact of huge pages on effective memory reach.""" from dataclasses import dataclassfrom typing import List @dataclassclass TLBConfig: """TLB configuration for a specific page size""" page_size: int page_name: str l1_dtlb_entries: int l1_itlb_entries: int l2_stlb_entries: int @dataclassclass WorkloadProfile: """Memory workload characteristics""" name: str working_set_mb: float access_pattern: str # 'sequential', 'random', 'strided' def format_bytes(bytes_val: float) -> str: """Format bytes as human-readable string""" for unit in ['B', 'KB', 'MB', 'GB', 'TB']: if bytes_val < 1024: return f"{bytes_val:.1f} {unit}" bytes_val /= 1024 return f"{bytes_val:.1f} PB" def calculate_tlb_coverage(tlb_entries: int, page_size: int) -> int: """Calculate total memory coverage for TLB""" return tlb_entries * page_size def estimate_tlb_miss_rate(working_set: int, tlb_coverage: int, access_pattern: str) -> float: """ Estimate TLB miss rate based on working set and coverage. This is a simplified model. Real miss rates depend on: - Access pattern (temporal and spatial locality) - TLB replacement policy (usually pseudo-LRU) - Code vs data accesses - Multi-level TLB behavior """ if working_set <= tlb_coverage: # Working set fits - primarily compulsory misses if access_pattern == 'sequential': return 0.01 # 1% - excellent locality elif access_pattern == 'random': return 0.05 # 5% - occasional conflicts else: # strided return 0.02 # 2% else: # Working set exceeds coverage - capacity misses dominate overflow_ratio = working_set / tlb_coverage if access_pattern == 'sequential': # Sequential: one miss per page, good reuse within page return min(0.5, 0.1 * overflow_ratio) elif access_pattern == 'random': # Random: high miss rate, poor reuse return min(0.95, 0.3 + 0.4 * (overflow_ratio - 1) / 10) else: # strided return min(0.7, 0.15 * overflow_ratio) # Intel Skylake/Ice Lake class TLB configurationTLB_CONFIGS = [ TLBConfig(4 * 1024, "4KB", l1_dtlb_entries=64, l1_itlb_entries=128, l2_stlb_entries=1536), TLBConfig(2 * 1024 * 1024, "2MB", l1_dtlb_entries=32, l1_itlb_entries=8, l2_stlb_entries=1536), TLBConfig(1024 * 1024 * 1024, "1GB", l1_dtlb_entries=4, l1_itlb_entries=0, l2_stlb_entries=16),] # Representative workloadsWORKLOADS = [ WorkloadProfile("Small application", 16, "mixed"), WorkloadProfile("Web server", 256, "random"), WorkloadProfile("In-memory cache", 2048, "random"), WorkloadProfile("Database buffer pool", 8192, "random"), WorkloadProfile("Big data analytics", 32768, "strided"), WorkloadProfile("In-memory database", 131072, "random"), # 128GB] print("=" * 90)print("TLB COVERAGE ANALYSIS - MODERN x86 PROCESSOR")print("=" * 90) # Calculate and display coverage for each page sizeprint("\n" + "─" * 90)print("TLB COVERAGE BY PAGE SIZE")print("─" * 90)print(f"{'Page Size':<12} {'L1 DTLB':<20} {'L2 STLB':<20} {'Total Coverage':<20}")print("─" * 90) for config in TLB_CONFIGS: l1_coverage = calculate_tlb_coverage(config.l1_dtlb_entries, config.page_size) l2_coverage = calculate_tlb_coverage(config.l2_stlb_entries, config.page_size) total = l1_coverage + l2_coverage # Simplified - assumes minimal overlap print(f"{config.page_name:<12} " f"{config.l1_dtlb_entries:>4} entries = {format_bytes(l1_coverage):<8} " f"{config.l2_stlb_entries:>4} entries = {format_bytes(l2_coverage):<8} " f"≈ {format_bytes(total)}") print("\n" + "─" * 90)print("ESTIMATED TLB MISS RATES BY WORKLOAD")print("─" * 90)print(f"{'Workload':<25} {'Working Set':<12} {'4KB Pages':<15} {'2MB Pages':<15} {'1GB Pages':<15}")print("─" * 90) for workload in WORKLOADS: ws_bytes = workload.working_set_mb * 1024 * 1024 row = f"{workload.name:<25} {format_bytes(ws_bytes):<12} " for config in TLB_CONFIGS: l2_coverage = calculate_tlb_coverage(config.l2_stlb_entries, config.page_size) miss_rate = estimate_tlb_miss_rate(ws_bytes, l2_coverage, workload.access_pattern) row += f"{miss_rate*100:>6.2f}% miss " print(row) print("\n" + "=" * 90)print("KEY INSIGHT: 2MB pages reduce miss rates by 10-100x for large workloads")print("=" * 90)A TLB miss is deceptively expensive. While the hardware page table walker handles misses automatically, each miss triggers a cascade of memory operations that stall the CPU pipeline.
Anatomy of a TLB miss on x86-64:
| Operation | Memory Accesses | Latency (cycles) | Notes |
|---|---|---|---|
| Read PML4 entry | 1 | ~200 | Almost never cached |
| Read PDPT entry | 1 | ~150 | Rarely cached |
| Read PD entry | 1 | ~100-150 | Sometimes cached |
| Read PT entry | 1 | ~80-150 | Often cached |
| Install TLB entry | 0 | ~10 | Microcode overhead |
| Total (worst case) | 4 | ~600-700 | ~150-175ns |
Why the walk is slow:
The huge page advantage:
With 2MB pages, the walk stops one level earlier:
| Page Size | Levels Walked | Memory Accesses | Typical Latency | Under Virtualization |
|---|---|---|---|---|
| 4 KB | 4 (PML4→PT) | 4 | ~600 cycles | Up to 24 accesses |
| 2 MB | 3 (PML4→PD) | 3 | ~450 cycles | Up to 15 accesses |
| 1 GB | 2 (PML4→PDPT) | 2 | ~300 cycles | Up to 8 accesses |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150
/* * Benchmark demonstrating TLB miss impact * * This program measures memory access latency with varying working set sizes * to show the inflection points where TLB misses start dominating. * * Compile: gcc -O2 -o tlb_miss tlb_miss_impact.c -lrt * Run: ./tlb_miss (requires root for huge pages) */ #define _GNU_SOURCE#include <stdio.h>#include <stdlib.h>#include <string.h>#include <stdint.h>#include <time.h>#include <sys/mman.h>#include <errno.h> #define ITERATIONS 100000000#define CACHE_LINE_SIZE 64 // Get current time in nanosecondsstatic inline uint64_t get_time_ns(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return ts.tv_sec * 1000000000ULL + ts.tv_nsec;} // Pointer chasing benchmark - defeats prefetchers// Returns average access latency in nanosecondsdouble measure_access_latency(void *region, size_t size, int iterations) { size_t num_elements = size / sizeof(void*); void **pointers = (void**)region; // Create random chase pattern (pointer to next pointer) for (size_t i = 0; i < num_elements; i++) { pointers[i] = &pointers[(i + 4096/sizeof(void*)) % num_elements]; } // Warm up void **p = &pointers[0]; for (int i = 0; i < 10000; i++) { p = (void**)*p; } // Measure uint64_t start = get_time_ns(); p = &pointers[0]; for (int i = 0; i < iterations; i++) { p = (void**)*p; // Chase the pointer } uint64_t end = get_time_ns(); // Prevent optimization volatile void *sink = p; (void)sink; return (double)(end - start) / iterations;} void* allocate_region(size_t size, int use_huge_pages) { int flags = MAP_PRIVATE | MAP_ANONYMOUS; if (use_huge_pages) { flags |= MAP_HUGETLB; } void *region = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0); if (region == MAP_FAILED) { if (use_huge_pages) { fprintf(stderr, "Huge page allocation failed (errno=%d). " "Try: echo 1024 > /proc/sys/vm/nr_hugepages\n", errno); } return NULL; } // Touch all pages to ensure allocation memset(region, 0, size); return region;} int main() { printf("TLB Miss Impact Benchmark\n"); printf("═══════════════════════════════════════════════════════════════\n\n"); // Test different working set sizes (in MB) size_t sizes_mb[] = {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024}; int num_sizes = sizeof(sizes_mb) / sizeof(sizes_mb[0]); printf("Working Set 4KB Pages 2MB Huge Pages Speedup\n"); printf("─────────────────────────────────────────────────────────────\n"); for (int i = 0; i < num_sizes; i++) { size_t size = sizes_mb[i] * 1024 * 1024; // Skip if system doesn't have enough memory void *region_4kb = allocate_region(size, 0); if (!region_4kb) continue; void *region_huge = allocate_region(size, 1); double latency_4kb = measure_access_latency(region_4kb, size, ITERATIONS / (i + 1)); double latency_huge = -1; if (region_huge) { latency_huge = measure_access_latency(region_huge, size, ITERATIONS / (i + 1)); munmap(region_huge, size); } printf("%6zu MB %6.2f ns", sizes_mb[i], latency_4kb); if (latency_huge > 0) { double speedup = latency_4kb / latency_huge; printf(" %6.2f ns %.2fx\n", latency_huge, speedup); } else { printf(" N/A N/A\n"); } munmap(region_4kb, size); } printf("\n═══════════════════════════════════════════════════════════════\n"); printf("Note: Speedup increases as working set exceeds TLB coverage (~6MB for 4KB).\n"); printf("Beyond TLB coverage, huge pages provide 1.5-3x performance improvement.\n"); return 0;} /* * Expected output pattern (actual numbers vary by system): * * Working Set 4KB Pages 2MB Huge Pages Speedup * ───────────────────────────────────────────────────────────── * 1 MB 4.50 ns 4.45 ns 1.01x * 2 MB 4.55 ns 4.48 ns 1.02x * 4 MB 4.80 ns 4.52 ns 1.06x * 8 MB 12.30 ns 4.65 ns 2.65x ← TLB overflow starts * 16 MB 28.50 ns 5.20 ns 5.48x * 32 MB 45.80 ns 8.40 ns 5.45x * 64 MB 62.40 ns 15.30 ns 4.08x * 128 MB 78.90 ns 24.60 ns 3.21x * 256 MB 95.40 ns 38.70 ns 2.47x * 512 MB 112.60 ns 52.80 ns 2.13x * 1024 MB 128.30 ns 68.40 ns 1.88x */Under virtualization (KVM, VMware, Hyper-V), TLB miss cost is dramatically amplified. The hypervisor maintains its own page tables, causing 'nested' or '2D' page walks. A 4KB page miss may require 24 memory accesses (4 guest × 4 host + nested combinations). Huge pages at both guest and host level are critical for virtualized workloads.
Modern TLB architecture is remarkably sophisticated, evolved over decades to maximize hit rates while minimizing latency. Understanding this architecture reveals why huge pages are so effective.
Split vs Unified TLBs:
Most processors use a split L1 TLB (separate for instructions and data) and a unified L2 TLB (shared). This mirrors the instruction/data cache split:
| TLB Type | 4KB Entries | 2MB Entries | 1GB Entries | Associativity |
|---|---|---|---|---|
| L1 ITLB | 128 | 8 | — | 8-way |
| L1 DTLB | 64 | 32 | 4 | 4-way |
| L2 STLB | 2048 | 2048 | 16 | 16-way |
Associativity and Conflicts:
TLBs use set-associative caching. With 4-way associativity, each VPN can only reside in one of four possible locations. If a process accesses more than 4 pages that map to the same set, conflict misses occur even with available capacity.
Huge pages dramatically reduce conflict probability:
PCID (Process Context ID):
Modern x86 processors support PCID—a 12-bit tag attached to TLB entries that identifies which address space they belong to. This allows TLB entries to survive context switches, dramatically improving performance for systems running many processes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
/* * Conceptual representation of TLB entry structure * Actual hardware implementation varies by architecture */ #include <stdint.h>#include <stdbool.h> /* TLB Entry for 4KB pages */typedef struct { /* Tag portion - identifies the virtual page */ uint64_t vpn : 52; // Virtual Page Number (bits 63:12 of VA) uint16_t asid : 12; // Address Space ID (PCID on x86) /* Data portion - translation and permissions */ uint64_t pfn : 40; // Physical Frame Number (bits 51:12 of PA) /* Permission flags (from PTE) */ bool present : 1; // Page present in memory bool writable : 1; // Write permission bool user : 1; // User-mode accessible bool global : 1; // Global page (skip ASID check) bool accessed : 1; // Page has been accessed bool dirty : 1; // Page has been written bool page_size : 1; // PS bit - is this a huge page entry? bool no_execute : 1; // Execute permission (inverted) /* TLB metadata */ bool valid : 1; // Entry is valid uint8_t mesi_state : 2; // Coherence state (for cache-coherent TLB) /* LRU tracking for replacement (architectural, not in actual entry) */ uint8_t lru_counter : 4; // Pseudo-LRU for replacement } TLB_Entry_4KB; /* TLB Entry for 2MB huge pages */typedef struct { /* Tag portion - fewer bits needed for VPN */ uint64_t vpn : 43; // Virtual Page Number (bits 63:21 of VA) uint16_t asid : 12; // Address Space ID /* Data portion */ uint64_t pfn : 31; // Physical Frame Number (bits 51:21 of PA) /* Same permission flags as 4KB */ bool present : 1; bool writable : 1; bool user : 1; bool global : 1; bool accessed : 1; bool dirty : 1; bool page_size : 1; // Always 1 for 2MB entries bool no_execute : 1; bool valid : 1; uint8_t lru_counter : 4; } TLB_Entry_2MB; /* TLB Entry for 1GB giant pages */typedef struct { /* Tag portion - even fewer bits */ uint64_t vpn : 34; // Virtual Page Number (bits 63:30 of VA) uint16_t asid : 12; /* Data portion */ uint64_t pfn : 22; // Physical Frame Number (bits 51:30 of PA) /* Permission flags */ bool present : 1; bool writable : 1; bool user : 1; bool global : 1; bool accessed : 1; bool dirty : 1; bool page_size : 1; // Always 1 for 1GB entries bool no_execute : 1; bool valid : 1; uint8_t lru_counter : 4; } TLB_Entry_1GB; /* * TLB Lookup simulation showing the parallel search nature */typedef struct { TLB_Entry_4KB entries_4kb[64]; // L1 DTLB 4KB entries TLB_Entry_2MB entries_2mb[32]; // L1 DTLB 2MB entries TLB_Entry_1GB entries_1gb[4]; // L1 DTLB 1GB entries int associativity;} L1_DTLB; /* * Simulated TLB lookup - in real hardware, all comparisons happen in parallel */bool tlb_lookup(L1_DTLB *tlb, uint64_t virtual_addr, uint64_t asid, uint64_t *physical_addr, bool *is_huge) { // Extract VPNs for each page size uint64_t vpn_4kb = virtual_addr >> 12; uint64_t vpn_2mb = virtual_addr >> 21; uint64_t vpn_1gb = virtual_addr >> 30; // Check 1GB entries first (highest priority) for (int i = 0; i < 4; i++) { if (tlb->entries_1gb[i].valid && tlb->entries_1gb[i].vpn == vpn_1gb && (tlb->entries_1gb[i].global || tlb->entries_1gb[i].asid == asid)) { *physical_addr = ((uint64_t)tlb->entries_1gb[i].pfn << 30) | (virtual_addr & 0x3FFFFFFF); // 30-bit offset *is_huge = true; return true; // TLB Hit! } } // Check 2MB entries for (int i = 0; i < 32; i++) { if (tlb->entries_2mb[i].valid && tlb->entries_2mb[i].vpn == vpn_2mb && (tlb->entries_2mb[i].global || tlb->entries_2mb[i].asid == asid)) { *physical_addr = ((uint64_t)tlb->entries_2mb[i].pfn << 21) | (virtual_addr & 0x1FFFFF); // 21-bit offset *is_huge = true; return true; // TLB Hit! } } // Check 4KB entries for (int i = 0; i < 64; i++) { if (tlb->entries_4kb[i].valid && tlb->entries_4kb[i].vpn == vpn_4kb && (tlb->entries_4kb[i].global || tlb->entries_4kb[i].asid == asid)) { *physical_addr = ((uint64_t)tlb->entries_4kb[i].pfn << 12) | (virtual_addr & 0xFFF); // 12-bit offset *is_huge = false; return true; // TLB Hit! } } return false; // TLB Miss - need page table walk}Understanding TLB behavior requires measuring it on real systems. Modern CPUs provide performance counters specifically for TLB events, accessible through tools like perf on Linux.
Key TLB Performance Counters:
| Counter | Event Name | What It Measures |
|---|---|---|
| dtlb_load_misses.miss_causes_a_walk | DTLB Load Miss → Walk | Data loads that missed L1/L2 TLB |
| dtlb_load_misses.walk_completed | Walk Completed | Page walks that completed successfully |
| dtlb_load_misses.walk_duration | Walk Cycles | Cycles spent doing page walks |
| dtlb_store_misses.miss_causes_a_walk | DTLB Store Miss → Walk | Data stores that missed TLB |
| itlb_misses.miss_causes_a_walk | ITLB Miss → Walk | Instruction fetches that missed TLB |
| page_walker_loads.dtlb_* | Walk Memory Access | Memory ops by page walker (by level) |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
#!/bin/bash## TLB Performance Analysis Script# Measures TLB behavior for a given application## Usage: ./measure_tlb.sh <command># if [ -z "$1" ]; then echo "Usage: $0 <command_to_profile>" exit 1fi echo "════════════════════════════════════════════════════════════════════"echo "TLB PERFORMANCE ANALYSIS"echo "════════════════════════════════════════════════════════════════════"echo "" # Check for perf availabilityif ! command -v perf &> /dev/null; then echo "Error: 'perf' not found. Install linux-tools-generic." exit 1fi # Define TLB-related eventsTLB_EVENTS=( "dtlb_load_misses.miss_causes_a_walk" "dtlb_load_misses.walk_completed" "dtlb_load_misses.walk_completed_4k" "dtlb_load_misses.walk_completed_2m_4m" "dtlb_load_misses.walk_completed_1g" "dtlb_store_misses.miss_causes_a_walk" "itlb_misses.miss_causes_a_walk" "instructions" "cycles") # Build event stringEVENT_STR=""for evt in "${TLB_EVENTS[@]}"; do if [ -n "$EVENT_STR" ]; then EVENT_STR="$EVENT_STR," fi EVENT_STR="$EVENT_STR$evt"done echo "Running: $@"echo "Events: $EVENT_STR"echo ""echo "────────────────────────────────────────────────────────────────────" # Run with perf statperf stat -e "$EVENT_STR" -- "$@" 2>&1 | tee /tmp/tlb_analysis.txt echo ""echo "────────────────────────────────────────────────────────────────────"echo "ANALYSIS SUMMARY"echo "────────────────────────────────────────────────────────────────────" # Parse and analyze resultsDTLB_MISSES=$(grep "dtlb_load_misses.miss_causes_a_walk" /tmp/tlb_analysis.txt | awk '{print $1}' | tr -d ',')INSTRUCTIONS=$(grep "instructions" /tmp/tlb_analysis.txt | awk '{print $1}' | tr -d ',')CYCLES=$(grep "cycles" /tmp/tlb_analysis.txt | awk '{print $1}' | tr -d ',') if [ -n "$DTLB_MISSES" ] && [ -n "$INSTRUCTIONS" ] && [ "$INSTRUCTIONS" -gt 0 ]; then MISS_RATE=$(echo "scale=6; $DTLB_MISSES * 1000000 / $INSTRUCTIONS" | bc) echo "DTLB miss rate: $MISS_RATE misses per million instructions" if (( $(echo "$MISS_RATE > 1000" | bc -l) )); then echo "⚠️ HIGH TLB MISS RATE - Consider using huge pages!" elif (( $(echo "$MISS_RATE > 100" | bc -l) )); then echo "⚡ Moderate TLB pressure - Huge pages may help" else echo "✓ TLB miss rate is acceptable" fifi echo ""echo "════════════════════════════════════════════════════════════════════" # Additional analysis: Check huge page usageecho ""echo "CURRENT HUGE PAGE CONFIGURATION:"echo "────────────────────────────────────────────────────────────────────"if [ -f /proc/meminfo ]; then grep -i huge /proc/meminfofi echo ""echo "TIP: To enable huge pages, run:"echo " echo 1024 | sudo tee /proc/sys/vm/nr_hugepages"echo " Or use: madvise(addr, len, MADV_HUGEPAGE) for THP"echo "════════════════════════════════════════════════════════════════════"Interpreting the numbers:
TLB miss rates are typically expressed as:
A well-tuned application with huge pages should show:
If dtlb_load_misses.walk_completed shows predominantly 'walk_completed_4k' rather than 'walk_completed_2m_4m', huge pages are not being used effectively. Check /proc/<pid>/smaps for 'AnonHugePages' to verify huge page usage.
Let's consolidate everything we've learned to understand exactly how huge pages transform TLB efficiency:
1. Coverage Multiplication:
With the same number of TLB entries, huge pages cover 512× (2MB) or 262,144× (1GB) more memory. This often means the difference between constant TLB misses and nearly perfect hits.
| Metric | 4KB Pages | 2MB Pages | Improvement |
|---|---|---|---|
| TLB Coverage (1536 entries) | 6 MB | 3 GB | 512× |
| Database buffer hit rate (32GB pool) | ~85% | ~99% | +14% |
| Page walk cycles (memory scan) | 35% of time | 2% of time | 17.5× reduction |
| Random access latency overhead | +80ns avg | +5ns avg | 16× reduction |
| VM exit cost (EPT walks) | ~4000 cycles | ~1500 cycles | 2.7× reduction |
2. Reduced Walk Depth:
When a TLB miss does occur, the page walk is shorter. 2MB pages eliminate one level; 1GB pages eliminate two levels. Under virtualization, this cascades—each eliminated guest level saves multiple host accesses.
3. Better Cache Behavior:
4. Reduced Memory Overhead:
The page table structures themselves consume less memory:
For any workload with a working set larger than ~6MB that involves significant memory access, huge pages will improve performance. The improvement ranges from 5-10% for modest workloads to 50%+ for memory-intensive applications like databases and virtualization.
The TLB is the critical path for every memory access. Its efficiency determines whether address translation is a single-cycle operation or a hundreds-of-cycles penalty. Here's what we've learned:
What's next:
Understanding TLB benefits motivates using huge pages—but how do you actually allocate them? The next page explores huge page allocation: the mechanisms for reserving and using huge pages, including boot-time reservation, hugetlbfs, and programmatic allocation via mmap() and madvise().
You now understand why TLB efficiency is the primary driver of huge page benefits. The mathematics are clear: larger pages mean more coverage, fewer misses, and faster translation. Next, we'll learn how to actually allocate and use huge pages in practice.