Loading learning content...
The TLB is a cache, and like all caches, it has limited capacity. When a TLB miss occurs and a new translation must be installed, a fundamental question arises: which existing entry should be evicted to make room?
This decision—TLB replacement—seems simple but has profound implications. Evict the wrong entry, and you'll need that translation again soon, triggering another expensive page walk. Evict wisely, and the TLB continues to serve the program's working set efficiently. The difference between a good and poor replacement decision can be 100×+ in access time.
TLB replacement faces unique constraints compared to data cache replacement. TLBs are small (64-2048 entries), often fully associative (every entry is a candidate for eviction), and must make decisions in hardware at CPU speed. These constraints shape the replacement algorithms used in practice.
By the end of this page, you will understand why TLB replacement is challenging, the common replacement algorithms (random, FIFO, LRU, pseudo-LRU), hardware versus software implementation, special considerations like wired entries and ASIDs, and how replacement policies affect real system performance.
Before examining specific algorithms, let's establish the fundamental constraints and goals of TLB replacement.
The replacement trigger:
TLB replacement occurs when:
Note that replacement only happens after a successful page walk. If the page walk discovers the page is not present (page fault), the OS handles it; no TLB replacement occurs until the page is loaded.
Goals of replacement policy:
Unique TLB constraints vs data caches:
1. Full associativity is common:
Data caches are typically 4-16 way set-associative, limiting victim selection to one set. L1 TLBs are often fully associative, meaning any of 64-128 entries could be victims. This expands the decision space.
2. Entries cover pages, not cache lines:
A TLB entry covers 4KB-1GB, while a cache line is 64 bytes. Evicting a TLB entry affects many more potential memory accesses. The cost of a bad eviction is higher.
3. No spatial prefetching for TLB:
Cache replacement benefits from spatial locality—nearby cache lines are often accessed. But page translations don't benefit from this. Knowing the translation for page N tells you nothing about page N+1's translation.
4. Working set is process-specific:
Context switches may invalidate entries or change the relevant working set. Replacement policy must handle transitions gracefully.
Unlike data caches where significant research has optimized replacement policies (RRIP, DRRIP, Hawkeye, etc.), TLB replacement policies remain relatively simple. The small TLB size means even simple policies work reasonably well, and hardware complexity favors simplicity.
The simplest replacement policy: randomly select an entry to evict. This may seem naive, but it's surprisingly effective and widely used in TLBs.
How random replacement works:
Implementation:
Random selection typically uses a Linear Feedback Shift Register (LFSR)—a simple hardware circuit that produces a deterministic but seemingly random sequence of indices. The LFSR advances on each replacement, providing the next "random" victim index.
Why random works well for TLBs:
Working sets are often smaller than TLB: If your program's hot pages fit in the TLB, it doesn't matter which cold entry you evict—they're all cold.
Probabilistic protection of hot entries: With n entries and 1 replacement, probability of evicting any specific entry is 1/n. A hot entry will likely survive many replacements.
No pathological patterns: LRU can thrash if the working set is exactly TLB size + 1; random doesn't thrash.
Example: MIPS TLB
Classic MIPS R4000 uses random replacement for its TLB. The CPU maintains a "Random" register that cycles through valid TLB indices, excluding wired entries. On TLB miss, the entry at index Random is replaced.
// Simplified MIPS random update
Random = (Random == Wired) ? (TLB_SIZE - 1) : (Random - 1)
The Random register decrements on each TLB access, wrapping around while avoiding wired (pinned) entries.
Random replacement establishes a useful baseline: any policy worse than random is definitively bad, and many complex policies only marginally beat random. For TLBs, 'marginally better than random' often isn't worth the hardware cost.
Least Recently Used (LRU) is the theoretically attractive replacement policy: evict the entry that hasn't been used for the longest time. The intuition is that recently-used translations are likely to be used again (temporal locality), so we should keep them.
True LRU for TLBs:
Implementing exact LRU requires tracking the access order of all entries:
For a TLB with n entries, this requires O(log n) bits per entry and O(n) updates on every access—expensive!
For a 64-entry TLB:
True LRU is rarely used for TLBs due to this overhead.
Pseudo-LRU (PLRU) algorithms:
PLRU algorithms approximate LRU with much less hardware. The most common is Tree PLRU:
Tree PLRU operation:
For 64 entries:
How Tree PLRU selects a victim:
victim = root
while victim is internal node:
if pointer points left:
victim = left child
else:
victim = right child
// victim is now a leaf (TLB entry)
return victim
The victim is approximately (not exactly) the LRU entry. Tree PLRU misses may be 5-10% higher than true LRU but uses far less hardware.
| Method | State Bits (64 entries) | Update Cost | Accuracy | Hardware Complexity |
|---|---|---|---|---|
| True LRU | ~384 bits | O(n) operations | Perfect | High (counters, comparators) |
| Tree PLRU | 63 bits | O(log n) operations | ~95% of LRU | Low (bit flips) |
| Bit PLRU | 64 bits | O(1) operation | ~85% of LRU | Very Low |
| Random | ~6 bits (LFSR) | O(1) operation | ~80% of LRU | Minimal |
Bit PLRU (NRU - Not Recently Used):
An even simpler approximation:
This approximates LRU at the granularity of "referenced since last round" vs "not referenced." It's less accurate but trivially cheap.
Example: ARM TLB uses pseudo-random/PLRU hybrid
ARM Cortex designs often use a combination:
First-In-First-Out (FIFO) replaces the entry that has been in the TLB longest, regardless of how recently it was used.
FIFO implementation:
// FIFO replacement
victim = entries[victim_pointer]
new_entry installed at entries[victim_pointer]
victim_pointer = (victim_pointer + 1) % TLB_SIZE
FIFO characteristics:
Bélády's Anomaly:
FIFO exhibits a counterintuitive property: adding more TLB entries can sometimes increase the miss rate! This is called Bélády's anomaly.
Example access sequence: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
With 3 entries: 9 page faults With 4 entries: 10 page faults (!)
This anomaly occurs because FIFO doesn't consider recency. More entries can create different replacement timing that's accidentally worse.
Clock Algorithm (Second-Chance FIFO):
A hybrid of FIFO and NRU that avoids evicting recently-used entries:
This gives recently-accessed entries a "second chance"—they won't be evicted until the clock hand comes around again after their reference bit was cleared.
The Clock algorithm is popular for page replacement (choosing which pages to evict from physical memory) but less common for TLB replacement. Page replacement can afford the extra logic; TLB replacement prioritizes speed. Still, some TLBs use Clock-like policies.
TLB replacement can be handled entirely in hardware, entirely in software, or with a hybrid approach. The choice affects complexity, flexibility, and performance.
Hardware-Managed Replacement (x86, modern ARM):
The CPU contains dedicated logic that:
All of this happens transparently to software. The OS doesn't even know replacements are occurring.
| Aspect | Hardware-Managed | Software-Managed |
|---|---|---|
| Replacement policy | Fixed (random, PLRU) | Programmable (any policy) |
| Performance | Fast (no trap overhead) | Slower (trap + handler) |
| Flexibility | None (hardwired) | Complete (any algorithm) |
| Policy tuning | Impossible | Per-workload optimization possible |
| Debugging | Black box | Traceable and modifiable |
| Examples | x86, ARM Cortex | MIPS, SPARC, some research CPUs |
Software-Managed Replacement (MIPS, SPARC):
On TLB miss:
MIPS TLB replacement example:
// Simplified MIPS TLB miss handler
void tlb_miss_handler(void) {
vaddr_t faulting_addr = read_badvaddr(); // Get faulting address
pte_t *pte = lookup_page_table(faulting_addr); // Find translation
if (pte == NULL || !pte->valid) {
// Page fault - not a simple TLB miss
page_fault_handler(faulting_addr);
return;
}
// Have valid translation, install in TLB
int victim_index = read_random_register(); // Hardware provides random index
// Or use software-chosen victim:
// victim_index = get_lru_entry(); // Software-maintained LRU
write_tlb_entry(victim_index, faulting_addr, pte);
}
MIPS provides a "Random" register with a hardware-maintained random value, but software can override and write to any index.
Some systems combine approaches: hardware manages L1 TLB replacement for speed, while software manages L2 TLB. This balances performance (fast path in hardware) with flexibility (sophisticated policies for larger L2).
Not all TLB entries should be candidates for replacement. Some translations are so critical that evicting them would be catastrophic. These are wired or locked entries.
Why wire TLB entries:
MIPS wired entry implementation:
MIPS R4000 has a "Wired" register. TLB entries 0 through (Wired-1) are never selected by random replacement:
// Wired = 4 means entries 0-3 are wired
// Random register only cycles through 4 to 47
if (Random < TLB_SIZE && Random >= Wired) {
// Valid victim index
} else {
// Impossible - hardware ensures Random >= Wired
}
The OS sets Wired during initialization to protect kernel entries.
| Entry Range | Contents | Why Wired |
|---|---|---|
| 0 | Exception vector page | TLB miss handler code lives here |
| 1-2 | Kernel stack mappings | Handler needs stack access |
| 3-7 | Core kernel text/data | Frequently accessed, must always hit |
| 8+ | User space / normal kernel | Normal replacement candidates |
Global entries (x86):
Alternative to wiring: mark entries as "global." Global entries are not flushed on context switch (when CR3/ASID changes). This effectively pins kernel translations while allowing user translations to be replaced.
// x86 PTE flags
#define PTE_GLOBAL (1 << 8) // Entry not flushed on CR3 write
// Kernel mappings are global
kernel_pte.flags |= PTE_GLOBAL;
// User mappings are not global
user_pte.flags &= ~PTE_GLOBAL;
With PCID/ASID support, global entries coexist with multiple process address spaces in the TLB. They're shared across all ASIDs.
Trade-offs of wired entries:
If the TLB miss handler's code or data isn't in the TLB, handling the miss causes another miss, causing another handler invocation—infinite recursion! Wired entries prevent this by ensuring the handler is always accessible.
Address Space Identifiers (ASIDs) tag TLB entries with their owning process. This creates interesting interactions with replacement policy.
Traditional (pre-ASID) behavior:
Without ASIDs, context switching required flushing the entire TLB:
// Context switch: old process → new process
flush_entire_tlb(); // All entries become invalid
load_new_page_table();
// New process starts with cold TLB
Every context switch reset the TLB—terrible for systems with frequent switching.
With ASIDs:
Each TLB entry is tagged with the ASID of its owning process:
TLB Entry: [ASID | VPN | PFN | flags]
On lookup, current ASID must match entry ASID (or entry must be global):
match = (entry.VPN == lookup_VPN) &&
(entry.ASID == current_ASID || entry.global);
Context switches update the current ASID but don't flush the TLB. Entries from multiple processes coexist.
Replacement implications:
1. Should replacement consider ASID?
Option A: Replace any entry regardless of ASID
Option B: Prefer replacing entries from non-current ASIDs
Most hardware uses Option A (ASID-unaware replacement) for simplicity.
2. ASID exhaustion:
ASIDs have limited bits (8-16 typically, supporting 256-65536 ASIDs). When ASIDs are exhausted:
// Need new ASID, but all are in use
if (no_free_asids()) {
flush_all_non_global_entries();
reset_asid_allocator();
assign_new_asid(process);
}
This "ASID rollover" flushes entries for all processes, effectively restarting the TLB. Entries from the flushed ASIDs are now invalid despite not being explicitly replaced.
3. Lazy TLB invalidation:
When unmapping a page, the OS must invalidate corresponding TLB entries. With ASIDs:
// Efficient per-process flush
invalidate_tlb_by_asid(old_process_asid);
Research has explored ASID-aware replacement that predicts which process will run next and biases replacement toward other processes' entries. While effective in simulation, the hardware complexity has prevented widespread adoption.
We've comprehensively examined TLB replacement—the decision of which translation to evict when installing a new one. Let's consolidate the key insights:
Module Complete:
This page concludes our comprehensive exploration of the Translation Lookaside Buffer. We've covered:
You now have deep, operational understanding of one of the most critical hardware structures in modern computing—the enabler that makes virtual memory practical.
Congratulations! You've mastered the Translation Lookaside Buffer—from its fundamental purpose through its hardware implementation to its replacement policies. This knowledge is essential for OS development, systems performance engineering, and understanding modern computer architecture.