Operating SystemsTranslation Lookaside Buffer (TLB)

Translation Lookaside Buffer (TLB)

LevelIntermediate

Duration60 mins

TopicTranslation Lookaside Buffer (TLB)

3 / 5

TLB Hit/Miss

The Binary Outcome That Determines Performance

Every memory access your program makes—every instruction fetch, every data load, every store—results in one of two TLB outcomes: hit or miss. This binary event has an astounding performance differential. A TLB hit adds perhaps 1-2 nanoseconds to memory access; a TLB miss can add 100-500 nanoseconds or more. That's a 100×-500× difference in translation time.

The aggregate effect of millions of these outcomes per second defines your program's memory performance. A program with 99.5% TLB hit rate runs dramatically faster than one with 95% hit rate—the miss rate differs by 10×, and misses dominate runtime.

Understanding TLB hit and miss behavior is essential for both systems programmers who optimize memory management and application developers who design data structures for performance. This page examines what causes hits and misses, how systems handle misses, and techniques for maximizing hit rates.

What You Will Learn

By the end of this page, you will understand the conditions for TLB hits and misses, how hardware and software handle miss scenarios, the different types of TLB misses, page walk mechanics, and optimization techniques for achieving high hit rates.

TLB Hit: The Fast Path

A TLB hit occurs when the translation for a virtual page is found in the TLB. This is the desired outcome—fast, efficient, and adding minimal latency to memory access.

Conditions for a TLB hit:

TLB Hit Requirements

•VPN match: The virtual page number exists in some TLB entry
•ASID match: The entry's ASID matches the current process's ASID (or entry is marked global)
•Valid entry: The TLB entry's valid bit is set
•Permission satisfied: The protection bits allow the requested access type (read/write/execute)

What happens on a TLB hit:

1. TLB receives virtual address from CPU
2. Parallel comparison in 1 cycle
3. Match found → entry selected
4. Valid bit checked → entry is valid
5. Protection bits checked → access permitted
6. PFN extracted from entry
7. Physical address = PFN + offset
8. Memory access proceeds

Total TLB hit time: 1-2 cycles (~0.5-1 ns)

The hit remains fast even with additional checks:

Modern TLBs perform several validations in parallel with the content-addressable lookup:

ASID comparison (is this entry for the current process?)
Valid bit check (is the entry populated?)
Protection bit verification (is this access type allowed?)

All of these happen in the same 1-2 cycle window. If any check fails, the result is either a miss (for ASID/valid issues) or a protection fault (for permission violations).

Protection Faults vs TLB Misses

A TLB hit with permission violation is NOT a TLB miss—the translation exists. Instead, it triggers a protection fault, which the OS handles differently (usually by terminating the process for illegal access). Only missing or invalid translations trigger the slow page walk path.

TLB Miss: The Slow Path

A TLB miss occurs when no valid translation for the requested virtual page exists in the TLB. This triggers the expensive page table walk to find the translation in memory.

Conditions for a TLB miss:

VPN not found: No entry contains a matching virtual page number
ASID mismatch: Entry exists but belongs to different address space (and not global)
Invalid entry: Entry matches but valid bit is clear (stale entry)

What happens on a TLB miss (hardware-managed, like x86):

Hardware Page Walk Sequence (x86-64, 4-level)

•TLB lookup fails: No matching entry found after parallel search
•Page walk starts: CPU's Memory Management Unit (MMU) takes over
•Read PML4 entry: CR3 + (VPN[47:39] × 8) → PML4E (memory access #1)
•Check PML4E present bit: If not present → page fault, else continue
•Read PDPT entry: PML4E.addr + (VPN[38:30] × 8) → PDPTE (memory access #2)
•Check PDPTE present bit: If not present → page fault, else continue
•Read PD entry: PDPTE.addr + (VPN[29:21] × 8) → PDE (memory access #3)
•Check PDE present bit: If not present → page fault, else continue
•Read PT entry: PDE.addr + (VPN[20:12] × 8) → PTE (memory access #4)
•Check PTE present bit: If not present → page fault, else extract PFN
•TLB fill: New translation installed in TLB (replacing victim entry if full)
•Memory access: Physical address formed and data accessed

The brutal arithmetic of TLB misses:

Assuming each page table memory access is a cache miss (worst case):

4 sequential memory accesses × 100ns each = 400ns
Plus final data access = +100ns
Total: 500ns (vs ~1ns for TLB hit)

Even with cached page table entries (common case):

4 accesses × ~10ns each (L3 cache) = 40ns
Plus data access = +100ns
Total: 140ns (still 140× slower than hit)

This extreme penalty is why TLB hit rates must be extremely high (>99%) for good performance.

Miss Penalty Compounds

TLB miss penalty isn't just the page walk time—it also blocks the CPU pipeline, prevents dependent instructions from executing, and consumes memory bandwidth. A single TLB miss can stall an out-of-order CPU for hundreds of cycles.

Types of TLB Misses

Not all TLB misses are created equal. Understanding miss types helps diagnose performance issues and choose appropriate optimizations.

1. Compulsory (Cold) Misses:

The first access to a page always misses—the translation has never been loaded. These are unavoidable.

Characteristics:

Occur at program start, after large memory allocation, or when accessing new regions
Cannot be eliminated, only amortized (larger pages reduce frequency)
Prepaging can hide but not eliminate them

2. Capacity Misses:

The working set exceeds TLB capacity—evicted translations are accessed again.

Characteristics:

Working set (recently-used pages) exceeds TLB entries
Thrashing behavior: evict A, access B, evict B, access A
Solutions: larger TLBs, larger pages (more coverage per entry), improve locality

3. Conflict Misses (Set-Associative TLBs):

For set-associative TLBs, multiple pages might map to the same set.

Characteristics:

VPN bits determining set index cause unhealthy clustering
TLB has unused entries while one set overflows
Fully-associative TLBs eliminate this category
Very rare in L1 TLBs (most are fully associative)

TLB Miss Types and Mitigation
Miss Type	Cause	Frequency	Mitigation Strategies
Compulsory	First access to page	Low (once per page)	Larger pages, prepaging, working set optimization
Capacity	Working set > TLB size	Variable (workload dependent)	Larger TLBs, huge pages, locality improvement
Conflict	Set collision (set-assoc only)	Rare for TLBs	Higher associativity, victim TLB
Coherence	Cross-core invalidation	Shared memory dependent	ASID design, lazy shootdown

4. Coherence/Invalidation Misses:

In multiprocessor systems, TLB entries may be invalidated due to:

Another CPU unmapped shared memory
Page migration to another NUMA node
Copy-on-write protection changes

Characteristics:

TLB shootdown IPIs (interprocessor interrupts) clear entries
Entry was valid but is now stale
Can cause TLB "storms" in certain workloads

5. Context Switch Misses:

When ASID exhaustion forces TLB flush on context switch:

All entries from previous process become invalid
New process starts cold (all compulsory misses)
Mitigated by PCID/ASID (entries can coexist)

Capacity Dominates in Practice

For most workloads, capacity misses dominate after warmup. The TLB simply cannot hold all actively-used translations. This is why huge page support is so valuable—a single 2MB TLB entry covers what 512 4KB entries would.

Hardware vs Software TLB Management

Architectures differ fundamentally in how they handle TLB misses: some use hardware-managed page walks, others software-managed TLB miss handlers. Each approach has significant implications.

Hardware-Managed TLB (x86, ARM, modern RISC-V):

The CPU itself contains a dedicated page walk engine that automatically handles TLB misses.

Hardware TLB Management Advantages

•Fast miss handling (hardware speed)
•Transparent to operating system
•No software overhead on every miss
•Speculative page walks possible
•Page walk caches accelerate walks

Hardware TLB Management Disadvantages

•Fixed page table format (locked to hardware)
•Complex hardware implementation
•Security surface (Spectre/Meltdown)
•Less flexibility for experimental designs
•Hardware bugs are unfixable

Software-Managed TLB (MIPS, older SPARC, some research architectures):

On TLB miss, hardware traps to a kernel exception handler that finds and installs the translation.

Software TLB Management Advantages

•Flexible page table formats
•Simpler hardware design
•OS can optimize table structures
•Easier to fix bugs (software update)
•Can implement novel schemes

Software TLB Management Disadvantages

•Slower miss handling (trap + software)
•Handler must be carefully tuned
•TLB miss handler can itself TLB miss!
•Higher miss penalty variance
•Context switch to kernel on every miss

The TLB miss handler challenge (software-managed):

The TLB miss handler has a circular dependency problem: it must access memory (to read page tables) but memory access requires TLB entries. Solutions include:

Wired TLB entries: Reserve some TLB entries for kernel use, never evicted
Unmapped handlers: Handler code runs from physical addresses, bypassing translation
Software cache: Quick lookup table in kernel memory that fits in wired entries

MIPS systems dedicate the first 8-16 TLB entries to the kernel, ensuring the miss handler can always run.

The modern consensus: Hardware management wins

Most modern architectures use hardware TLB management because:

Miss penalty is dominated by memory latency, not walk logic
Hardware speculative page walks hide latency
Page walk caches dramatically improve repeated walks
Security considerations favor removing software from the path

Page Walk Caches and Miss Optimization

Modern processors don't just cache final translations—they cache intermediate page table entries to accelerate page walks.

Page Walk Cache / Paging Structure Cache:

A 4-level page walk accesses entries at each level. But for a given process, the upper levels change infrequently:

PML4 (level 4): One entry covers 512GB of virtual space
PDPT (level 3): One entry covers 1GB
PD (level 2): One entry covers 2MB
PT (level 1): One entry covers 4KB

By caching upper-level entries, most page walks reduce from 4 memory accesses to 1-2.

Page Walk Cache structure:

Page Walk Cache Levels and Coverage
Cache Level	Entry Coverage	Typical Entries	Benefit
PML4 cache	512 GB	4-16	Rarely misses after warmup
PDPT cache	1 GB	16-32	Covers most process memory
PD cache	2 MB	64-128	Reduces walks to 1 access usually
TLB (PT level)	4 KB/2 MB/1 GB	64-2048	Final translation, well-known

The math of page walk caches:

Assuming a 99% hit rate at each cache level:

Without caches: 4 memory accesses per walk
With 99% PML4 cache hit: 0.01 × 1 + 1 × 3 = 3.01 accesses
With 99% PDPT cache hit: 0.01 × 2 + 1 × 2 = 2.02 accesses
With 99% PD cache hit: 0.01 × 3 + 1 × 1 = 1.03 accesses

With all caches: Most walks are single-access (just read the PT entry)!

Speculative page walks:

Modern out-of-order CPUs can predict TLB misses and begin page walks speculatively:

Address generation unit predicts upcoming addresses
If predicted address would miss in TLB, start walk early
By the time CPU confirms access, translation may be ready

Intel's Skylake and later aggressively speculate on both L1 and L2 TLB misses.

Page Walk Bandwidth

High-end server CPUs include multiple page table walk engines that can service several TLB misses in parallel. Intel's Sapphire Rapids has 4 parallel page walkers, ensuring that heavy TLB miss workloads don't serialize on a single walker.

Measuring TLB Hit/Miss Performance

Diagnosing TLB performance requires hardware performance counters and specialized tools. Modern CPUs provide extensive TLB monitoring capabilities.

Key performance counters (Intel naming):

dTLB-load-misses: Data TLB misses on loads
dTLB-store-misses: Data TLB misses on stores
iTLB-misses: Instruction TLB misses
dtlb_load_misses.miss_causes_a_walk: Misses triggering page walks
dtlb_load_misses.walk_completed: Completed page walks
dtlb_load_misses.walk_completed_4k: Walks for 4KB pages
dtlb_load_misses.walk_pending: Cycles with pending walks

Using perf to measure TLB behavior:

tlb_measurement.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
# Basic TLB miss measurement
perf stat -e dTLB-load-misses,dTLB-store-misses,iTLB-misses ./my_program
 
# Detailed TLB analysis
perf stat -e dtlb_load_misses.miss_causes_a_walk,\
             dtlb_load_misses.walk_completed,\
             dtlb_load_misses.walk_pending,\
             dtlb_load_misses.stlb_hit ./my_program
 
# TLB miss ratio (loads)
perf stat -e L1-dcache-loads,dTLB-load-misses ./my_program
# Miss rate = dTLB-load-misses / L1-dcache-loads

Interpreting TLB metrics:

Metric	Good	Concerning	Critical
DTLB miss rate	< 0.5%	0.5-2%	2%
ITLB miss rate	< 0.1%	0.1-0.5%	0.5%
Walk cycles/miss	< 50	50-200	200
L2 TLB hit rate (on L1 miss)	90%	70-90%	< 70%

Common TLB performance patterns:

Random access pattern: High miss rate, walks miss in cache → severe penalty
Streaming access: Low miss rate, sequential pages stay in TLB
Large working set: Moderate miss rate, but L2 TLB catches most → acceptable
TLB thrashing: Very high miss rate, even L2 misses → application redesign needed

The Hidden TLB Problem

TLB misses often masquerade as cache misses in profiling. The page walk accesses are memory references that show up as LLC misses, not as TLB events. Always measure TLB-specific counters when diagnosing memory performance issues.

Techniques for Maximizing TLB Hit Rate

Given the extreme penalty of TLB misses, optimizing for hit rate is crucial for high-performance applications.

1. Use Huge Pages:

The most effective TLB optimization is using larger pages:

Page Size	TLB Coverage (64 entries)	Vs 4KB
4 KB	256 KB	1×
2 MB	128 MB	512×
1 GB	64 GB	262,144×

A single 2MB TLB entry covers what 512 4KB entries would. If your application has a large, stable working set, huge pages can eliminate TLB misses almost entirely.

huge_pages.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Linux: Allocate memory with 2MB huge pages
#include <sys/mman.h>
 
void* ptr = mmap(NULL, 
                 size,
                 PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                 -1, 0);
 
// Alternative: Use transparent huge pages
madvise(ptr, size, MADV_HUGEPAGE);
 
// For databases: Reserve huge pages at boot
// /etc/sysctl.conf: vm.nr_hugepages = 1024

2. Improve Memory Locality:

Arrange data structures so related data is on the same page:

Cache-conscious data layout
Structure field ordering (hot fields together)
Pool allocation for small objects
Avoid pointer chasing across page boundaries

3. Reduce Working Set:

Compress data in memory
Use compact data representations
Process in blocks that fit in TLB coverage
Temporal blocking (process all operations on a page before moving on)

4. Align to Page Boundaries:

Align large allocations to page boundaries
Avoid straddling objects across page boundaries
Pack related data within pages

5. NUMA-Aware Allocation:

On multi-socket systems:

Allocate memory on the socket that will access it
Remote memory access may pollute TLBs on both sockets
Use numactl and libnuma for control

The Database Lesson

High-performance databases (PostgreSQL, MySQL, Oracle) often recommend huge pages. A database with a 100GB buffer pool at 4KB pages needs 25 million TLB entries—impossible. With 2MB huge pages, it needs only 50,000 entries, easily fitting in L2 TLB.

Summary: TLB Hit/Miss Dynamics

We've thoroughly examined TLB hit and miss behavior—the binary outcome that dramatically affects memory performance. Let's consolidate the key insights:

Key Takeaways

•Extreme performance differential: TLB hit adds ~1ns; miss adds 100-500ns (100×-500× difference)
•Hit requirements: VPN match, ASID match (or global), valid entry, permission satisfied
•Miss triggers page walk: 4-5 sequential memory accesses in modern 64-bit systems
•Miss types matter: Compulsory, capacity, conflict, coherence—each requires different mitigation
•Hardware vs software management: Modern systems use hardware page walkers with page walk caches
•Page walk caches help: Upper-level page table entries cached, reducing most walks to 1-2 accesses
•Huge pages are powerful: 2MB pages provide 512× TLB coverage per entry vs 4KB pages
•Measure to optimize: Use hardware performance counters to diagnose TLB behavior

What's next:

Understanding that hits are good and misses are expensive leads to a natural question: exactly how expensive? The next page examines Effective Access Time—the mathematical framework for quantifying memory access performance given TLB hit rates, miss penalties, and cache behavior.

Page Complete

You now understand TLB hit and miss behavior in depth—from the fast path of immediate translation to the slow path of page table walking, the types of misses, hardware and software management approaches, and optimization techniques for maximizing hit rates.

3 / 5

Loading learning content...

Operating SystemsTranslation Lookaside Buffer (TLB)

Translation Lookaside Buffer (TLB)

LevelIntermediate

Duration60 mins

TopicTranslation Lookaside Buffer (TLB)

3 / 5

TLB Hit/Miss

The Binary Outcome That Determines Performance

What You Will Learn

TLB Hit: The Fast Path

A TLB hit occurs when the translation for a virtual page is found in the TLB. This is the desired outcome—fast, efficient, and adding minimal latency to memory access.

Conditions for a TLB hit:

TLB Hit Requirements

•VPN match: The virtual page number exists in some TLB entry
•ASID match: The entry's ASID matches the current process's ASID (or entry is marked global)
•Valid entry: The TLB entry's valid bit is set
•Permission satisfied: The protection bits allow the requested access type (read/write/execute)

What happens on a TLB hit:

1. TLB receives virtual address from CPU
2. Parallel comparison in 1 cycle
3. Match found → entry selected
4. Valid bit checked → entry is valid
5. Protection bits checked → access permitted
6. PFN extracted from entry
7. Physical address = PFN + offset
8. Memory access proceeds

Total TLB hit time: 1-2 cycles (~0.5-1 ns)

The hit remains fast even with additional checks:

Modern TLBs perform several validations in parallel with the content-addressable lookup:

ASID comparison (is this entry for the current process?)
Valid bit check (is the entry populated?)
Protection bit verification (is this access type allowed?)

All of these happen in the same 1-2 cycle window. If any check fails, the result is either a miss (for ASID/valid issues) or a protection fault (for permission violations).

Protection Faults vs TLB Misses

TLB Miss: The Slow Path

A TLB miss occurs when no valid translation for the requested virtual page exists in the TLB. This triggers the expensive page table walk to find the translation in memory.

Conditions for a TLB miss:

VPN not found: No entry contains a matching virtual page number
ASID mismatch: Entry exists but belongs to different address space (and not global)
Invalid entry: Entry matches but valid bit is clear (stale entry)

What happens on a TLB miss (hardware-managed, like x86):

Hardware Page Walk Sequence (x86-64, 4-level)

•TLB lookup fails: No matching entry found after parallel search
•Page walk starts: CPU's Memory Management Unit (MMU) takes over
•Read PML4 entry: CR3 + (VPN[47:39] × 8) → PML4E (memory access #1)
•Check PML4E present bit: If not present → page fault, else continue
•Read PDPT entry: PML4E.addr + (VPN[38:30] × 8) → PDPTE (memory access #2)
•Check PDPTE present bit: If not present → page fault, else continue
•Read PD entry: PDPTE.addr + (VPN[29:21] × 8) → PDE (memory access #3)
•Check PDE present bit: If not present → page fault, else continue
•Read PT entry: PDE.addr + (VPN[20:12] × 8) → PTE (memory access #4)
•Check PTE present bit: If not present → page fault, else extract PFN
•TLB fill: New translation installed in TLB (replacing victim entry if full)
•Memory access: Physical address formed and data accessed

The brutal arithmetic of TLB misses:

Assuming each page table memory access is a cache miss (worst case):

4 sequential memory accesses × 100ns each = 400ns
Plus final data access = +100ns
Total: 500ns (vs ~1ns for TLB hit)

Even with cached page table entries (common case):

4 accesses × ~10ns each (L3 cache) = 40ns
Plus data access = +100ns
Total: 140ns (still 140× slower than hit)

This extreme penalty is why TLB hit rates must be extremely high (>99%) for good performance.

Miss Penalty Compounds

Types of TLB Misses

Not all TLB misses are created equal. Understanding miss types helps diagnose performance issues and choose appropriate optimizations.

1. Compulsory (Cold) Misses:

The first access to a page always misses—the translation has never been loaded. These are unavoidable.

Characteristics:

Occur at program start, after large memory allocation, or when accessing new regions
Cannot be eliminated, only amortized (larger pages reduce frequency)
Prepaging can hide but not eliminate them

2. Capacity Misses:

The working set exceeds TLB capacity—evicted translations are accessed again.

Characteristics:

Working set (recently-used pages) exceeds TLB entries
Thrashing behavior: evict A, access B, evict B, access A
Solutions: larger TLBs, larger pages (more coverage per entry), improve locality

3. Conflict Misses (Set-Associative TLBs):

For set-associative TLBs, multiple pages might map to the same set.

Characteristics:

VPN bits determining set index cause unhealthy clustering
TLB has unused entries while one set overflows
Fully-associative TLBs eliminate this category
Very rare in L1 TLBs (most are fully associative)

TLB Miss Types and Mitigation
Miss Type	Cause	Frequency	Mitigation Strategies
Compulsory	First access to page	Low (once per page)	Larger pages, prepaging, working set optimization
Capacity	Working set > TLB size	Variable (workload dependent)	Larger TLBs, huge pages, locality improvement
Conflict	Set collision (set-assoc only)	Rare for TLBs	Higher associativity, victim TLB
Coherence	Cross-core invalidation	Shared memory dependent	ASID design, lazy shootdown

4. Coherence/Invalidation Misses:

In multiprocessor systems, TLB entries may be invalidated due to:

Another CPU unmapped shared memory
Page migration to another NUMA node
Copy-on-write protection changes

Characteristics:

TLB shootdown IPIs (interprocessor interrupts) clear entries
Entry was valid but is now stale
Can cause TLB "storms" in certain workloads

5. Context Switch Misses:

When ASID exhaustion forces TLB flush on context switch:

All entries from previous process become invalid
New process starts cold (all compulsory misses)
Mitigated by PCID/ASID (entries can coexist)

Capacity Dominates in Practice

Hardware vs Software TLB Management

Architectures differ fundamentally in how they handle TLB misses: some use hardware-managed page walks, others software-managed TLB miss handlers. Each approach has significant implications.

Hardware-Managed TLB (x86, ARM, modern RISC-V):

The CPU itself contains a dedicated page walk engine that automatically handles TLB misses.

Hardware TLB Management Advantages

•Fast miss handling (hardware speed)
•Transparent to operating system
•No software overhead on every miss
•Speculative page walks possible
•Page walk caches accelerate walks

Hardware TLB Management Disadvantages

•Fixed page table format (locked to hardware)
•Complex hardware implementation
•Security surface (Spectre/Meltdown)
•Less flexibility for experimental designs
•Hardware bugs are unfixable

Software-Managed TLB (MIPS, older SPARC, some research architectures):

On TLB miss, hardware traps to a kernel exception handler that finds and installs the translation.

Software TLB Management Advantages

•Flexible page table formats
•Simpler hardware design
•OS can optimize table structures
•Easier to fix bugs (software update)
•Can implement novel schemes

Software TLB Management Disadvantages

•Slower miss handling (trap + software)
•Handler must be carefully tuned
•TLB miss handler can itself TLB miss!
•Higher miss penalty variance
•Context switch to kernel on every miss

The TLB miss handler challenge (software-managed):

The TLB miss handler has a circular dependency problem: it must access memory (to read page tables) but memory access requires TLB entries. Solutions include:

Wired TLB entries: Reserve some TLB entries for kernel use, never evicted
Unmapped handlers: Handler code runs from physical addresses, bypassing translation
Software cache: Quick lookup table in kernel memory that fits in wired entries

MIPS systems dedicate the first 8-16 TLB entries to the kernel, ensuring the miss handler can always run.

The modern consensus: Hardware management wins

Most modern architectures use hardware TLB management because:

Miss penalty is dominated by memory latency, not walk logic
Hardware speculative page walks hide latency
Page walk caches dramatically improve repeated walks
Security considerations favor removing software from the path

Page Walk Caches and Miss Optimization

Modern processors don't just cache final translations—they cache intermediate page table entries to accelerate page walks.

Page Walk Cache / Paging Structure Cache:

A 4-level page walk accesses entries at each level. But for a given process, the upper levels change infrequently:

PML4 (level 4): One entry covers 512GB of virtual space
PDPT (level 3): One entry covers 1GB
PD (level 2): One entry covers 2MB
PT (level 1): One entry covers 4KB

By caching upper-level entries, most page walks reduce from 4 memory accesses to 1-2.

Page Walk Cache structure:

Page Walk Cache Levels and Coverage
Cache Level	Entry Coverage	Typical Entries	Benefit
PML4 cache	512 GB	4-16	Rarely misses after warmup
PDPT cache	1 GB	16-32	Covers most process memory
PD cache	2 MB	64-128	Reduces walks to 1 access usually
TLB (PT level)	4 KB/2 MB/1 GB	64-2048	Final translation, well-known

The math of page walk caches:

Assuming a 99% hit rate at each cache level:

Without caches: 4 memory accesses per walk
With 99% PML4 cache hit: 0.01 × 1 + 1 × 3 = 3.01 accesses
With 99% PDPT cache hit: 0.01 × 2 + 1 × 2 = 2.02 accesses
With 99% PD cache hit: 0.01 × 3 + 1 × 1 = 1.03 accesses

With all caches: Most walks are single-access (just read the PT entry)!

Speculative page walks:

Modern out-of-order CPUs can predict TLB misses and begin page walks speculatively:

Address generation unit predicts upcoming addresses
If predicted address would miss in TLB, start walk early
By the time CPU confirms access, translation may be ready

Intel's Skylake and later aggressively speculate on both L1 and L2 TLB misses.

Page Walk Bandwidth

Measuring TLB Hit/Miss Performance

Diagnosing TLB performance requires hardware performance counters and specialized tools. Modern CPUs provide extensive TLB monitoring capabilities.

Key performance counters (Intel naming):

dTLB-load-misses: Data TLB misses on loads
dTLB-store-misses: Data TLB misses on stores
iTLB-misses: Instruction TLB misses
dtlb_load_misses.miss_causes_a_walk: Misses triggering page walks
dtlb_load_misses.walk_completed: Completed page walks
dtlb_load_misses.walk_completed_4k: Walks for 4KB pages
dtlb_load_misses.walk_pending: Cycles with pending walks

Using perf to measure TLB behavior:

tlb_measurement.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
# Basic TLB miss measurement
perf stat -e dTLB-load-misses,dTLB-store-misses,iTLB-misses ./my_program
 
# Detailed TLB analysis
perf stat -e dtlb_load_misses.miss_causes_a_walk,\
             dtlb_load_misses.walk_completed,\
             dtlb_load_misses.walk_pending,\
             dtlb_load_misses.stlb_hit ./my_program
 
# TLB miss ratio (loads)
perf stat -e L1-dcache-loads,dTLB-load-misses ./my_program
# Miss rate = dTLB-load-misses / L1-dcache-loads

Interpreting TLB metrics:

Metric	Good	Concerning	Critical
DTLB miss rate	< 0.5%	0.5-2%	2%
ITLB miss rate	< 0.1%	0.1-0.5%	0.5%
Walk cycles/miss	< 50	50-200	200
L2 TLB hit rate (on L1 miss)	90%	70-90%	< 70%

Common TLB performance patterns:

Random access pattern: High miss rate, walks miss in cache → severe penalty
Streaming access: Low miss rate, sequential pages stay in TLB
Large working set: Moderate miss rate, but L2 TLB catches most → acceptable
TLB thrashing: Very high miss rate, even L2 misses → application redesign needed

The Hidden TLB Problem

Techniques for Maximizing TLB Hit Rate

Given the extreme penalty of TLB misses, optimizing for hit rate is crucial for high-performance applications.

1. Use Huge Pages:

The most effective TLB optimization is using larger pages:

Page Size	TLB Coverage (64 entries)	Vs 4KB
4 KB	256 KB	1×
2 MB	128 MB	512×
1 GB	64 GB	262,144×

A single 2MB TLB entry covers what 512 4KB entries would. If your application has a large, stable working set, huge pages can eliminate TLB misses almost entirely.

huge_pages.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Linux: Allocate memory with 2MB huge pages
#include <sys/mman.h>
 
void* ptr = mmap(NULL, 
                 size,
                 PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                 -1, 0);
 
// Alternative: Use transparent huge pages
madvise(ptr, size, MADV_HUGEPAGE);
 
// For databases: Reserve huge pages at boot
// /etc/sysctl.conf: vm.nr_hugepages = 1024

2. Improve Memory Locality:

Arrange data structures so related data is on the same page:

Cache-conscious data layout
Structure field ordering (hot fields together)
Pool allocation for small objects
Avoid pointer chasing across page boundaries

3. Reduce Working Set:

Compress data in memory
Use compact data representations
Process in blocks that fit in TLB coverage
Temporal blocking (process all operations on a page before moving on)

4. Align to Page Boundaries:

Align large allocations to page boundaries
Avoid straddling objects across page boundaries
Pack related data within pages

5. NUMA-Aware Allocation:

On multi-socket systems:

Allocate memory on the socket that will access it
Remote memory access may pollute TLBs on both sockets
Use numactl and libnuma for control

The Database Lesson

Summary: TLB Hit/Miss Dynamics

We've thoroughly examined TLB hit and miss behavior—the binary outcome that dramatically affects memory performance. Let's consolidate the key insights:

Key Takeaways

•Extreme performance differential: TLB hit adds ~1ns; miss adds 100-500ns (100×-500× difference)
•Hit requirements: VPN match, ASID match (or global), valid entry, permission satisfied
•Miss triggers page walk: 4-5 sequential memory accesses in modern 64-bit systems
•Miss types matter: Compulsory, capacity, conflict, coherence—each requires different mitigation
•Hardware vs software management: Modern systems use hardware page walkers with page walk caches
•Page walk caches help: Upper-level page table entries cached, reducing most walks to 1-2 accesses
•Huge pages are powerful: 2MB pages provide 512× TLB coverage per entry vs 4KB pages
•Measure to optimize: Use hardware performance counters to diagnose TLB behavior

What's next:

Page Complete

3 / 5