Operating SystemsTranslation Lookaside Buffer (TLB)

Translation Lookaside Buffer (TLB)

LevelIntermediate

Duration60 mins

TopicTranslation Lookaside Buffer (TLB)

5 / 5

TLB Replacement

The Eviction Dilemma

The TLB is a cache, and like all caches, it has limited capacity. When a TLB miss occurs and a new translation must be installed, a fundamental question arises: which existing entry should be evicted to make room?

This decision—TLB replacement—seems simple but has profound implications. Evict the wrong entry, and you'll need that translation again soon, triggering another expensive page walk. Evict wisely, and the TLB continues to serve the program's working set efficiently. The difference between a good and poor replacement decision can be 100×+ in access time.

TLB replacement faces unique constraints compared to data cache replacement. TLBs are small (64-2048 entries), often fully associative (every entry is a candidate for eviction), and must make decisions in hardware at CPU speed. These constraints shape the replacement algorithms used in practice.

What You Will Learn

By the end of this page, you will understand why TLB replacement is challenging, the common replacement algorithms (random, FIFO, LRU, pseudo-LRU), hardware versus software implementation, special considerations like wired entries and ASIDs, and how replacement policies affect real system performance.

TLB Replacement Fundamentals

Before examining specific algorithms, let's establish the fundamental constraints and goals of TLB replacement.

The replacement trigger:

TLB replacement occurs when:

A TLB miss happens (translation not in TLB)
The page table walk successfully finds the translation
The TLB is full (all entries occupied)
A victim entry must be selected for eviction
The new translation overwrites the victim

Note that replacement only happens after a successful page walk. If the page walk discovers the page is not present (page fault), the OS handles it; no TLB replacement occurs until the page is loaded.

Goals of replacement policy:

Replacement Policy Objectives

•Minimize miss rate: Evict entries least likely to be used soon
•Low latency: Make quick decisions (can't stall the CPU)
•Hardware simplicity: Must be implementable in logic gates
•Fairness: Don't starve any entry from being evicted
•Determinism (optional): Predictable behavior aids security and real-time systems

Unique TLB constraints vs data caches:

1. Full associativity is common:

Data caches are typically 4-16 way set-associative, limiting victim selection to one set. L1 TLBs are often fully associative, meaning any of 64-128 entries could be victims. This expands the decision space.

2. Entries cover pages, not cache lines:

A TLB entry covers 4KB-1GB, while a cache line is 64 bytes. Evicting a TLB entry affects many more potential memory accesses. The cost of a bad eviction is higher.

3. No spatial prefetching for TLB:

Cache replacement benefits from spatial locality—nearby cache lines are often accessed. But page translations don't benefit from this. Knowing the translation for page N tells you nothing about page N+1's translation.

4. Working set is process-specific:

Context switches may invalidate entries or change the relevant working set. Replacement policy must handle transitions gracefully.

The Practical Reality

Unlike data caches where significant research has optimized replacement policies (RRIP, DRRIP, Hawkeye, etc.), TLB replacement policies remain relatively simple. The small TLB size means even simple policies work reasonably well, and hardware complexity favors simplicity.

Random Replacement

The simplest replacement policy: randomly select an entry to evict. This may seem naive, but it's surprisingly effective and widely used in TLBs.

How random replacement works:

A pseudo-random number generator provides an index
The entry at that index becomes the victim
New translation overwrites the victim

Implementation:

Random selection typically uses a Linear Feedback Shift Register (LFSR)—a simple hardware circuit that produces a deterministic but seemingly random sequence of indices. The LFSR advances on each replacement, providing the next "random" victim index.

Random Replacement Advantages

•Extremely simple hardware (LFSR + mux)
•O(1) selection time
•No state to track per entry
•No pathological worst cases
•Fair—all entries equally likely to be evicted
•Immune to adversarial patterns

Random Replacement Disadvantages

•May evict frequently-used entries
•Doesn't exploit temporal locality
•Slightly higher miss rate than LRU (typically 10-20%)
•Non-deterministic (complicates debugging)
•Can evict just-inserted entry

Why random works well for TLBs:

Working sets are often smaller than TLB: If your program's hot pages fit in the TLB, it doesn't matter which cold entry you evict—they're all cold.
Probabilistic protection of hot entries: With n entries and 1 replacement, probability of evicting any specific entry is 1/n. A hot entry will likely survive many replacements.
No pathological patterns: LRU can thrash if the working set is exactly TLB size + 1; random doesn't thrash.

Example: MIPS TLB

Classic MIPS R4000 uses random replacement for its TLB. The CPU maintains a "Random" register that cycles through valid TLB indices, excluding wired entries. On TLB miss, the entry at index Random is replaced.

// Simplified MIPS random update
Random = (Random == Wired) ? (TLB_SIZE - 1) : (Random - 1)

The Random register decrements on each TLB access, wrapping around while avoiding wired (pinned) entries.

Random as a Baseline

Random replacement establishes a useful baseline: any policy worse than random is definitively bad, and many complex policies only marginally beat random. For TLBs, 'marginally better than random' often isn't worth the hardware cost.

LRU and Pseudo-LRU Replacement

Least Recently Used (LRU) is the theoretically attractive replacement policy: evict the entry that hasn't been used for the longest time. The intuition is that recently-used translations are likely to be used again (temporal locality), so we should keep them.

True LRU for TLBs:

Implementing exact LRU requires tracking the access order of all entries:

On each TLB access, update the used entry to "most recent"
Maintain a total ordering of all entries by recency
On replacement, select the least recent entry

For a TLB with n entries, this requires O(log n) bits per entry and O(n) updates on every access—expensive!

For a 64-entry TLB:

6 bits per entry (position in ordering) × 64 entries = 384 bits of state
Every TLB hit must update multiple entries' positions

True LRU is rarely used for TLBs due to this overhead.

Pseudo-LRU (PLRU) algorithms:

PLRU algorithms approximate LRU with much less hardware. The most common is Tree PLRU:

Tree PLRU operation:

Organize entries as leaves of a binary tree
Each internal node has a 1-bit pointer indicating which subtree was used more recently
On access: update pointer bits along the path from root to accessed entry, pointing away from the accessed entry
On replacement: follow pointer bits from root to select victim

For 64 entries:

Binary tree with 63 internal nodes
63 bits of state (vs 384+ for true LRU)
Update affects only log₂(64) = 6 bits per access

How Tree PLRU selects a victim:

victim = root
while victim is internal node:
    if pointer points left:
        victim = left child
    else:
        victim = right child
// victim is now a leaf (TLB entry)
return victim

The victim is approximately (not exactly) the LRU entry. Tree PLRU misses may be 5-10% higher than true LRU but uses far less hardware.

LRU Implementation Comparison
Method	State Bits (64 entries)	Update Cost	Accuracy	Hardware Complexity
True LRU	~384 bits	O(n) operations	Perfect	High (counters, comparators)
Tree PLRU	63 bits	O(log n) operations	~95% of LRU	Low (bit flips)
Bit PLRU	64 bits	O(1) operation	~85% of LRU	Very Low
Random	~6 bits (LFSR)	O(1) operation	~80% of LRU	Minimal

Bit PLRU (NRU - Not Recently Used):

An even simpler approximation:

Each entry has a single "referenced" bit
On access: set the referenced bit
On replacement: find an entry with bit=0 and evict it
If all bits are 1: clear all bits and try again

This approximates LRU at the granularity of "referenced since last round" vs "not referenced." It's less accurate but trivially cheap.

Example: ARM TLB uses pseudo-random/PLRU hybrid

ARM Cortex designs often use a combination:

L1 TLB: Random or simple PLRU for speed
L2 TLB: Tree PLRU for better hit rates (can afford slightly higher latency)

FIFO and Round-Robin Replacement

First-In-First-Out (FIFO) replaces the entry that has been in the TLB longest, regardless of how recently it was used.

FIFO implementation:

Maintain a circular pointer to the next victim
On replacement: evict entry at pointer, install new entry, increment pointer
Access pattern doesn't affect victim selection

// FIFO replacement
victim = entries[victim_pointer]
new_entry installed at entries[victim_pointer]
victim_pointer = (victim_pointer + 1) % TLB_SIZE

FIFO characteristics:

FIFO Advantages

•Extremely simple (single counter)
•No per-entry state needed
•Deterministic behavior
•Good for streaming patterns
•Fair rotation through entries

FIFO Disadvantages

•Ignores access frequency entirely
•Can evict heavily-used entries
•Exhibits Bélády's anomaly
•Poor for looping patterns
•Worse than random for many workloads

Bélády's Anomaly:

FIFO exhibits a counterintuitive property: adding more TLB entries can sometimes increase the miss rate! This is called Bélády's anomaly.

Example access sequence: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5

With 3 entries: 9 page faults With 4 entries: 10 page faults (!)

This anomaly occurs because FIFO doesn't consider recency. More entries can create different replacement timing that's accidentally worse.

Clock Algorithm (Second-Chance FIFO):

A hybrid of FIFO and NRU that avoids evicting recently-used entries:

Entries arranged in a circle with a "clock hand" pointer
Each entry has a reference bit, set on access
On replacement:
- Check entry at clock hand
- If reference bit = 1: clear bit, advance hand, repeat
- If reference bit = 0: evict this entry

This gives recently-accessed entries a "second chance"—they won't be evicted until the clock hand comes around again after their reference bit was cleared.

Clock in Page Replacement vs TLB

The Clock algorithm is popular for page replacement (choosing which pages to evict from physical memory) but less common for TLB replacement. Page replacement can afford the extra logic; TLB replacement prioritizes speed. Still, some TLBs use Clock-like policies.

Hardware vs Software TLB Replacement

TLB replacement can be handled entirely in hardware, entirely in software, or with a hybrid approach. The choice affects complexity, flexibility, and performance.

Hardware-Managed Replacement (x86, modern ARM):

The CPU contains dedicated logic that:

Detects TLB misses
Performs page table walks
Selects victim entry using built-in policy
Installs new translation

All of this happens transparently to software. The OS doesn't even know replacements are occurring.

Hardware vs Software TLB Replacement
Aspect	Hardware-Managed	Software-Managed
Replacement policy	Fixed (random, PLRU)	Programmable (any policy)
Performance	Fast (no trap overhead)	Slower (trap + handler)
Flexibility	None (hardwired)	Complete (any algorithm)
Policy tuning	Impossible	Per-workload optimization possible
Debugging	Black box	Traceable and modifiable
Examples	x86, ARM Cortex	MIPS, SPARC, some research CPUs

Software-Managed Replacement (MIPS, SPARC):

On TLB miss:

CPU traps to kernel exception handler
Handler looks up translation in page table
Handler selects victim entry (can use any algorithm)
Handler writes translation to specific TLB entry
Resume user program

MIPS TLB replacement example:

// Simplified MIPS TLB miss handler
void tlb_miss_handler(void) {
    vaddr_t faulting_addr = read_badvaddr();  // Get faulting address
    pte_t *pte = lookup_page_table(faulting_addr);  // Find translation
    
    if (pte == NULL || !pte->valid) {
        // Page fault - not a simple TLB miss
        page_fault_handler(faulting_addr);
        return;
    }
    
    // Have valid translation, install in TLB
    int victim_index = read_random_register();  // Hardware provides random index
    
    // Or use software-chosen victim:
    // victim_index = get_lru_entry();  // Software-maintained LRU
    
    write_tlb_entry(victim_index, faulting_addr, pte);
}

MIPS provides a "Random" register with a hardware-maintained random value, but software can override and write to any index.

The Hybrid Approach

Some systems combine approaches: hardware manages L1 TLB replacement for speed, while software manages L2 TLB. This balances performance (fast path in hardware) with flexibility (sophisticated policies for larger L2).

Wired and Reserved TLB Entries

Not all TLB entries should be candidates for replacement. Some translations are so critical that evicting them would be catastrophic. These are wired or locked entries.

Why wire TLB entries:

Exception handlers: The TLB miss handler code must be accessible without causing another TLB miss!
Kernel data structures: Page tables, TLB management structures, interrupt handlers
Real-time requirements: Critical code paths that can't tolerate variable latency
Hypervisor/VMM: Virtual machine control structures

MIPS wired entry implementation:

MIPS R4000 has a "Wired" register. TLB entries 0 through (Wired-1) are never selected by random replacement:

// Wired = 4 means entries 0-3 are wired
// Random register only cycles through 4 to 47
if (Random < TLB_SIZE && Random >= Wired) {
    // Valid victim index
} else {
    // Impossible - hardware ensures Random >= Wired
}

The OS sets Wired during initialization to protect kernel entries.

Typical Wired Entry Usage
Entry Range	Contents	Why Wired
0	Exception vector page	TLB miss handler code lives here
1-2	Kernel stack mappings	Handler needs stack access
3-7	Core kernel text/data	Frequently accessed, must always hit
8+	User space / normal kernel	Normal replacement candidates

Global entries (x86):

Alternative to wiring: mark entries as "global." Global entries are not flushed on context switch (when CR3/ASID changes). This effectively pins kernel translations while allowing user translations to be replaced.

// x86 PTE flags
#define PTE_GLOBAL  (1 << 8)  // Entry not flushed on CR3 write

// Kernel mappings are global
kernel_pte.flags |= PTE_GLOBAL;

// User mappings are not global
user_pte.flags &= ~PTE_GLOBAL;

With PCID/ASID support, global entries coexist with multiple process address spaces in the TLB. They're shared across all ASIDs.

Trade-offs of wired entries:

More wired entries → fewer entries for normal replacement → higher miss rate for user code
Too few wired entries → kernel may TLB miss during critical operations
Typical: 4-16 wired entries (10-25% of small TLBs)

The TLB Miss Recursion Problem

If the TLB miss handler's code or data isn't in the TLB, handling the miss causes another miss, causing another handler invocation—infinite recursion! Wired entries prevent this by ensuring the handler is always accessible.

Replacement Interaction with ASIDs

Address Space Identifiers (ASIDs) tag TLB entries with their owning process. This creates interesting interactions with replacement policy.

Traditional (pre-ASID) behavior:

Without ASIDs, context switching required flushing the entire TLB:

// Context switch: old process → new process
flush_entire_tlb();  // All entries become invalid
load_new_page_table();
// New process starts with cold TLB

Every context switch reset the TLB—terrible for systems with frequent switching.

With ASIDs:

Each TLB entry is tagged with the ASID of its owning process:

TLB Entry: [ASID | VPN | PFN | flags]

On lookup, current ASID must match entry ASID (or entry must be global):

match = (entry.VPN == lookup_VPN) && 
        (entry.ASID == current_ASID || entry.global);

Context switches update the current ASID but don't flush the TLB. Entries from multiple processes coexist.

Replacement implications:

1. Should replacement consider ASID?

Option A: Replace any entry regardless of ASID

Pro: Maximum flexibility, all entries are candidates
Con: May evict entries that will be used after context switch

Option B: Prefer replacing entries from non-current ASIDs

Pro: Protects current process's working set
Con: Other processes return to cold TLBs

Most hardware uses Option A (ASID-unaware replacement) for simplicity.

2. ASID exhaustion:

ASIDs have limited bits (8-16 typically, supporting 256-65536 ASIDs). When ASIDs are exhausted:

// Need new ASID, but all are in use
if (no_free_asids()) {
    flush_all_non_global_entries();
    reset_asid_allocator();
    assign_new_asid(process);
}

This "ASID rollover" flushes entries for all processes, effectively restarting the TLB. Entries from the flushed ASIDs are now invalid despite not being explicitly replaced.

3. Lazy TLB invalidation:

When unmapping a page, the OS must invalidate corresponding TLB entries. With ASIDs:

Old approach: Flush specific entry by VPN
With ASID: Can invalidate all entries for an ASID at once

// Efficient per-process flush
invalidate_tlb_by_asid(old_process_asid);

ASID-Aware Replacement (Research)

Research has explored ASID-aware replacement that predicts which process will run next and biases replacement toward other processes' entries. While effective in simulation, the hardware complexity has prevented widespread adoption.

Summary: TLB Replacement Strategies

We've comprehensively examined TLB replacement—the decision of which translation to evict when installing a new one. Let's consolidate the key insights:

Key Takeaways

•Replacement is triggered on TLB miss: After successful page walk, a victim entry must be selected
•Random replacement is common: Simple, fast, no pathological cases, performs within 10-20% of optimal
•LRU is approximated, not exact: True LRU is too expensive; Tree PLRU and Bit PLRU are practical approximations
•FIFO has Bélády's anomaly: Can perform worse with more entries; generally avoided for TLBs
•Hardware vs software choice matters: Hardware is faster; software is flexible. Modern systems favor hardware.
•Wired entries prevent recursion: TLB miss handler code must be permanently cached to avoid infinite recursion
•ASID complicates replacement: Entries from multiple processes coexist; replacement typically ignores ASID
•Simple policies dominate: TLBs are small enough that sophisticated policies provide marginal benefit

Module Complete:

This page concludes our comprehensive exploration of the Translation Lookaside Buffer. We've covered:

TLB Purpose — Why TLBs exist and the problem they solve
Associative Memory — The CAM hardware enabling parallel lookup
TLB Hit/Miss — The binary outcome and its performance implications
Effective Access Time — Quantifying memory performance mathematically
TLB Replacement — How victims are selected when capacity is exhausted

You now have deep, operational understanding of one of the most critical hardware structures in modern computing—the enabler that makes virtual memory practical.

Module Complete

Congratulations! You've mastered the Translation Lookaside Buffer—from its fundamental purpose through its hardware implementation to its replacement policies. This knowledge is essential for OS development, systems performance engineering, and understanding modern computer architecture.

5 / 5

Loading learning content...

Operating SystemsTranslation Lookaside Buffer (TLB)

Translation Lookaside Buffer (TLB)

LevelIntermediate

Duration60 mins

TopicTranslation Lookaside Buffer (TLB)

5 / 5

TLB Replacement

The Eviction Dilemma

What You Will Learn

TLB Replacement Fundamentals

Before examining specific algorithms, let's establish the fundamental constraints and goals of TLB replacement.

The replacement trigger:

TLB replacement occurs when:

A TLB miss happens (translation not in TLB)
The page table walk successfully finds the translation
The TLB is full (all entries occupied)
A victim entry must be selected for eviction
The new translation overwrites the victim

Goals of replacement policy:

Replacement Policy Objectives

•Minimize miss rate: Evict entries least likely to be used soon
•Low latency: Make quick decisions (can't stall the CPU)
•Hardware simplicity: Must be implementable in logic gates
•Fairness: Don't starve any entry from being evicted
•Determinism (optional): Predictable behavior aids security and real-time systems

Unique TLB constraints vs data caches:

1. Full associativity is common:

2. Entries cover pages, not cache lines:

A TLB entry covers 4KB-1GB, while a cache line is 64 bytes. Evicting a TLB entry affects many more potential memory accesses. The cost of a bad eviction is higher.

3. No spatial prefetching for TLB:

4. Working set is process-specific:

Context switches may invalidate entries or change the relevant working set. Replacement policy must handle transitions gracefully.

The Practical Reality

Random Replacement

The simplest replacement policy: randomly select an entry to evict. This may seem naive, but it's surprisingly effective and widely used in TLBs.

How random replacement works:

A pseudo-random number generator provides an index
The entry at that index becomes the victim
New translation overwrites the victim

Implementation:

Random Replacement Advantages

•Extremely simple hardware (LFSR + mux)
•O(1) selection time
•No state to track per entry
•No pathological worst cases
•Fair—all entries equally likely to be evicted
•Immune to adversarial patterns

Random Replacement Disadvantages

•May evict frequently-used entries
•Doesn't exploit temporal locality
•Slightly higher miss rate than LRU (typically 10-20%)
•Non-deterministic (complicates debugging)
•Can evict just-inserted entry

Why random works well for TLBs:

Working sets are often smaller than TLB: If your program's hot pages fit in the TLB, it doesn't matter which cold entry you evict—they're all cold.
Probabilistic protection of hot entries: With n entries and 1 replacement, probability of evicting any specific entry is 1/n. A hot entry will likely survive many replacements.
No pathological patterns: LRU can thrash if the working set is exactly TLB size + 1; random doesn't thrash.

Example: MIPS TLB

// Simplified MIPS random update
Random = (Random == Wired) ? (TLB_SIZE - 1) : (Random - 1)

The Random register decrements on each TLB access, wrapping around while avoiding wired (pinned) entries.

Random as a Baseline

LRU and Pseudo-LRU Replacement

True LRU for TLBs:

Implementing exact LRU requires tracking the access order of all entries:

On each TLB access, update the used entry to "most recent"
Maintain a total ordering of all entries by recency
On replacement, select the least recent entry

For a TLB with n entries, this requires O(log n) bits per entry and O(n) updates on every access—expensive!

For a 64-entry TLB:

6 bits per entry (position in ordering) × 64 entries = 384 bits of state
Every TLB hit must update multiple entries' positions

True LRU is rarely used for TLBs due to this overhead.

Pseudo-LRU (PLRU) algorithms:

PLRU algorithms approximate LRU with much less hardware. The most common is Tree PLRU:

Tree PLRU operation:

Organize entries as leaves of a binary tree
Each internal node has a 1-bit pointer indicating which subtree was used more recently
On access: update pointer bits along the path from root to accessed entry, pointing away from the accessed entry
On replacement: follow pointer bits from root to select victim

For 64 entries:

Binary tree with 63 internal nodes
63 bits of state (vs 384+ for true LRU)
Update affects only log₂(64) = 6 bits per access

How Tree PLRU selects a victim:

victim = root
while victim is internal node:
    if pointer points left:
        victim = left child
    else:
        victim = right child
// victim is now a leaf (TLB entry)
return victim

The victim is approximately (not exactly) the LRU entry. Tree PLRU misses may be 5-10% higher than true LRU but uses far less hardware.

LRU Implementation Comparison
Method	State Bits (64 entries)	Update Cost	Accuracy	Hardware Complexity
True LRU	~384 bits	O(n) operations	Perfect	High (counters, comparators)
Tree PLRU	63 bits	O(log n) operations	~95% of LRU	Low (bit flips)
Bit PLRU	64 bits	O(1) operation	~85% of LRU	Very Low
Random	~6 bits (LFSR)	O(1) operation	~80% of LRU	Minimal

Bit PLRU (NRU - Not Recently Used):

An even simpler approximation:

Each entry has a single "referenced" bit
On access: set the referenced bit
On replacement: find an entry with bit=0 and evict it
If all bits are 1: clear all bits and try again

This approximates LRU at the granularity of "referenced since last round" vs "not referenced." It's less accurate but trivially cheap.

Example: ARM TLB uses pseudo-random/PLRU hybrid

ARM Cortex designs often use a combination:

L1 TLB: Random or simple PLRU for speed
L2 TLB: Tree PLRU for better hit rates (can afford slightly higher latency)

FIFO and Round-Robin Replacement

First-In-First-Out (FIFO) replaces the entry that has been in the TLB longest, regardless of how recently it was used.

FIFO implementation:

Maintain a circular pointer to the next victim
On replacement: evict entry at pointer, install new entry, increment pointer
Access pattern doesn't affect victim selection

// FIFO replacement
victim = entries[victim_pointer]
new_entry installed at entries[victim_pointer]
victim_pointer = (victim_pointer + 1) % TLB_SIZE

FIFO characteristics:

FIFO Advantages

•Extremely simple (single counter)
•No per-entry state needed
•Deterministic behavior
•Good for streaming patterns
•Fair rotation through entries

FIFO Disadvantages

•Ignores access frequency entirely
•Can evict heavily-used entries
•Exhibits Bélády's anomaly
•Poor for looping patterns
•Worse than random for many workloads

Bélády's Anomaly:

FIFO exhibits a counterintuitive property: adding more TLB entries can sometimes increase the miss rate! This is called Bélády's anomaly.

Example access sequence: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5

With 3 entries: 9 page faults With 4 entries: 10 page faults (!)

This anomaly occurs because FIFO doesn't consider recency. More entries can create different replacement timing that's accidentally worse.

Clock Algorithm (Second-Chance FIFO):

A hybrid of FIFO and NRU that avoids evicting recently-used entries:

Entries arranged in a circle with a "clock hand" pointer
Each entry has a reference bit, set on access
On replacement:
- Check entry at clock hand
- If reference bit = 1: clear bit, advance hand, repeat
- If reference bit = 0: evict this entry

This gives recently-accessed entries a "second chance"—they won't be evicted until the clock hand comes around again after their reference bit was cleared.

Clock in Page Replacement vs TLB

Hardware vs Software TLB Replacement

TLB replacement can be handled entirely in hardware, entirely in software, or with a hybrid approach. The choice affects complexity, flexibility, and performance.

Hardware-Managed Replacement (x86, modern ARM):

The CPU contains dedicated logic that:

Detects TLB misses
Performs page table walks
Selects victim entry using built-in policy
Installs new translation

All of this happens transparently to software. The OS doesn't even know replacements are occurring.

Hardware vs Software TLB Replacement
Aspect	Hardware-Managed	Software-Managed
Replacement policy	Fixed (random, PLRU)	Programmable (any policy)
Performance	Fast (no trap overhead)	Slower (trap + handler)
Flexibility	None (hardwired)	Complete (any algorithm)
Policy tuning	Impossible	Per-workload optimization possible
Debugging	Black box	Traceable and modifiable
Examples	x86, ARM Cortex	MIPS, SPARC, some research CPUs

Software-Managed Replacement (MIPS, SPARC):

On TLB miss:

CPU traps to kernel exception handler
Handler looks up translation in page table
Handler selects victim entry (can use any algorithm)
Handler writes translation to specific TLB entry
Resume user program

MIPS TLB replacement example:

// Simplified MIPS TLB miss handler
void tlb_miss_handler(void) {
    vaddr_t faulting_addr = read_badvaddr();  // Get faulting address
    pte_t *pte = lookup_page_table(faulting_addr);  // Find translation
    
    if (pte == NULL || !pte->valid) {
        // Page fault - not a simple TLB miss
        page_fault_handler(faulting_addr);
        return;
    }
    
    // Have valid translation, install in TLB
    int victim_index = read_random_register();  // Hardware provides random index
    
    // Or use software-chosen victim:
    // victim_index = get_lru_entry();  // Software-maintained LRU
    
    write_tlb_entry(victim_index, faulting_addr, pte);
}

MIPS provides a "Random" register with a hardware-maintained random value, but software can override and write to any index.

The Hybrid Approach

Wired and Reserved TLB Entries

Not all TLB entries should be candidates for replacement. Some translations are so critical that evicting them would be catastrophic. These are wired or locked entries.

Why wire TLB entries:

Exception handlers: The TLB miss handler code must be accessible without causing another TLB miss!
Kernel data structures: Page tables, TLB management structures, interrupt handlers
Real-time requirements: Critical code paths that can't tolerate variable latency
Hypervisor/VMM: Virtual machine control structures

MIPS wired entry implementation:

MIPS R4000 has a "Wired" register. TLB entries 0 through (Wired-1) are never selected by random replacement:

// Wired = 4 means entries 0-3 are wired
// Random register only cycles through 4 to 47
if (Random < TLB_SIZE && Random >= Wired) {
    // Valid victim index
} else {
    // Impossible - hardware ensures Random >= Wired
}

The OS sets Wired during initialization to protect kernel entries.

Typical Wired Entry Usage
Entry Range	Contents	Why Wired
0	Exception vector page	TLB miss handler code lives here
1-2	Kernel stack mappings	Handler needs stack access
3-7	Core kernel text/data	Frequently accessed, must always hit
8+	User space / normal kernel	Normal replacement candidates

Global entries (x86):

// x86 PTE flags
#define PTE_GLOBAL  (1 << 8)  // Entry not flushed on CR3 write

// Kernel mappings are global
kernel_pte.flags |= PTE_GLOBAL;

// User mappings are not global
user_pte.flags &= ~PTE_GLOBAL;

With PCID/ASID support, global entries coexist with multiple process address spaces in the TLB. They're shared across all ASIDs.

Trade-offs of wired entries:

More wired entries → fewer entries for normal replacement → higher miss rate for user code
Too few wired entries → kernel may TLB miss during critical operations
Typical: 4-16 wired entries (10-25% of small TLBs)

The TLB Miss Recursion Problem

Replacement Interaction with ASIDs

Address Space Identifiers (ASIDs) tag TLB entries with their owning process. This creates interesting interactions with replacement policy.

Traditional (pre-ASID) behavior:

Without ASIDs, context switching required flushing the entire TLB:

// Context switch: old process → new process
flush_entire_tlb();  // All entries become invalid
load_new_page_table();
// New process starts with cold TLB

Every context switch reset the TLB—terrible for systems with frequent switching.

With ASIDs:

Each TLB entry is tagged with the ASID of its owning process:

TLB Entry: [ASID | VPN | PFN | flags]

On lookup, current ASID must match entry ASID (or entry must be global):

match = (entry.VPN == lookup_VPN) && 
        (entry.ASID == current_ASID || entry.global);

Context switches update the current ASID but don't flush the TLB. Entries from multiple processes coexist.

Replacement implications:

1. Should replacement consider ASID?

Option A: Replace any entry regardless of ASID

Pro: Maximum flexibility, all entries are candidates
Con: May evict entries that will be used after context switch

Option B: Prefer replacing entries from non-current ASIDs

Pro: Protects current process's working set
Con: Other processes return to cold TLBs

Most hardware uses Option A (ASID-unaware replacement) for simplicity.

2. ASID exhaustion:

ASIDs have limited bits (8-16 typically, supporting 256-65536 ASIDs). When ASIDs are exhausted:

// Need new ASID, but all are in use
if (no_free_asids()) {
    flush_all_non_global_entries();
    reset_asid_allocator();
    assign_new_asid(process);
}

This "ASID rollover" flushes entries for all processes, effectively restarting the TLB. Entries from the flushed ASIDs are now invalid despite not being explicitly replaced.

3. Lazy TLB invalidation:

When unmapping a page, the OS must invalidate corresponding TLB entries. With ASIDs:

Old approach: Flush specific entry by VPN
With ASID: Can invalidate all entries for an ASID at once

// Efficient per-process flush
invalidate_tlb_by_asid(old_process_asid);

ASID-Aware Replacement (Research)

Summary: TLB Replacement Strategies

We've comprehensively examined TLB replacement—the decision of which translation to evict when installing a new one. Let's consolidate the key insights:

Key Takeaways

•Replacement is triggered on TLB miss: After successful page walk, a victim entry must be selected
•Random replacement is common: Simple, fast, no pathological cases, performs within 10-20% of optimal
•LRU is approximated, not exact: True LRU is too expensive; Tree PLRU and Bit PLRU are practical approximations
•FIFO has Bélády's anomaly: Can perform worse with more entries; generally avoided for TLBs
•Hardware vs software choice matters: Hardware is faster; software is flexible. Modern systems favor hardware.
•Wired entries prevent recursion: TLB miss handler code must be permanently cached to avoid infinite recursion
•ASID complicates replacement: Entries from multiple processes coexist; replacement typically ignores ASID
•Simple policies dominate: TLBs are small enough that sophisticated policies provide marginal benefit

Module Complete:

This page concludes our comprehensive exploration of the Translation Lookaside Buffer. We've covered:

TLB Purpose — Why TLBs exist and the problem they solve
Associative Memory — The CAM hardware enabling parallel lookup
TLB Hit/Miss — The binary outcome and its performance implications
Effective Access Time — Quantifying memory performance mathematically
TLB Replacement — How victims are selected when capacity is exhausted

You now have deep, operational understanding of one of the most critical hardware structures in modern computing—the enabler that makes virtual memory practical.

Module Complete

5 / 5