Lru Algorithm - Learning Module

Loading content...

0/227

Implementation Challenges: Why True LRU is Impractical

The Gap Between Theory and Practice

LRU is elegant in theory: maintain recency ordering, evict the oldest page. We've shown that a doubly linked list with hash table achieves O(1) operations. So why isn't every operating system running true LRU?

The answer lies in a fundamental mismatch between what LRU requires and what computer systems can practically provide. True LRU needs to know the recency order of every page based on every memory access. With modern systems executing billions of memory references per second, this requirement creates a chasm between theoretical elegance and practical implementation.

This page dissects the implementation challenges in depth, explaining precisely why pure LRU is impractical and setting the stage for the approximation techniques that make LRU-like behavior achievable.

What You Will Learn

By the end of this page, you will understand: (1) the fundamental update frequency problem, (2) hardware limitations for tracking recency, (3) software overhead analysis, (4) scalability challenges with large address spaces, (5) the trap of OS involvement in every memory access, and (6) why approximations are the only practical path forward.

The Update Frequency Problem

The Core Issue:

LRU requires updating the recency ordering on every memory reference. Let's quantify this:

Modern CPU: 3-5 GHz clock frequency
Instructions per cycle (IPC): ~2-4 for typical workloads
Memory references per instruction: 1 (fetch) + 0-2 (data) = 1-3 average
Memory references per second: ~6-20 billion

Every single one of these references would need to:

Identify which page is being accessed
Find that page in the recency data structure
Update its position to "most recently used"

Even with O(1) operations, performing this 10+ billion times per second is impossible in software.

Update Overhead Analysis
Approach	Cycles per Update	Updates/Second	Total Cycles	% of CPU
Pure software LRU	50-100	10 billion	500B-1000B	100% (impossible)
Lock-protected software	200-500	10 billion	2T-5T	100% (impossible)
Hardware-assisted software	10-20	10 billion	100B-200B	30-60% (unusable)
No recency tracking	0	N/A	0	0% ✓

The Mathematics of Impossibility:

Consider a 3 GHz CPU with 3 billion cycles per second available. If each LRU update takes just 10 cycles (extremely optimistic for software), updating 10 billion times would require 100 billion cycles—over 30x more than the CPU can provide.

Even if we could reduce update cost to 1 cycle (hardware only), we'd need 10 billion cycles per second just for LRU tracking—consuming over 3x the CPU capacity.

Conclusion: No software-based true LRU can work. Hardware-based true LRU would consume an impractical fraction of CPU resources.

This is not an engineering challenge to overcome—it's a fundamental impossibility given current technology.

The Core Constraint

The CPU executes memory references at billions per second in its critical path. Any tracking mechanism that adds overhead to this path is immediately impractical. This is why true LRU cannot be implemented in software and why even hardware support must be minimal.

Hardware Limitations

Could hardware solve the LRU tracking problem? In theory, yes. In practice, hardware limitations make true LRU impractical even with silicon assistance.

Why CPU Caches Can Do LRU (approximately):

L1/L2/L3 caches do implement pseudo-LRU or true LRU for small associativity (2-16 ways). This works because:

Small sets: Each set has only 2-16 lines to track
Dedicated circuits: Specialized flip-flops maintain order bits
Simple encoding: 4-way LRU needs only 3-4 bits
Fast updates: Parallel with data access, single-cycle
Local scope: Each set is independent

Why Page Tables Cannot:

Page tables have fundamentally different characteristics:

Cache LRU vs. Page Table LRU Requirements
Aspect	CPU Cache	Page Tables	Ratio
Entries to track	16 (per set)	1,000,000+	62,500x more
Bits for ordering	4-5 per set	20+ per entry	1 million+ total
Update frequency	Every cache access	Every memory ref	Similar
Update scope	One set	Global ordering	Much harder
Hardware cost	Minimal per set	Massive global	Impractical
Encoding	Position bits	Counter/timestamp	Much larger

The Storage Problem:

To maintain true LRU ordering for n pages, each page needs a position value from 1 to n. This requires log₂(n) bits per page.

1 million pages: 20 bits per page = 20 Mbits = 2.5 MB just for LRU ordering
10 million pages: 24 bits per page = 240 Mbits = 30 MB
This storage must be in fast hardware (not DRAM) for single-cycle access

The Update Circuit Problem:

When a page moves to the MRU position, all pages previously more recent must shift by one position. This is O(n) updates atomically—requiring circuits that can update millions of values simultaneously.

No practical hardware design can achieve this. Modern CPUs don't even try—they provide only a single reference bit per page table entry, which is vastly simpler.

What Hardware Actually Provides

Instead of true LRU, hardware provides a reference bit (R-bit) in each page table entry. The MMU sets this bit to 1 when the page is accessed. The OS can read and clear these bits periodically. This minimal hardware support is the foundation for all practical LRU approximations.

Software Overhead in Detail

Let's examine exactly what a software-based true LRU would require, demonstrating why it's impractical.

Option 1: Trap on Every Memory Access

impossible_trap_approach.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// HYPOTHETICAL: Trap to OS on every memory access
// THIS IS WHAT TRUE SOFTWARE LRU WOULD REQUIRE
 
void memory_access_trap(void* virtual_address) {  // Called billions/sec
    // 1. Disable interrupts (prevent concurrent access)
    disable_interrupts();
    
    // 2. Get the page number
    int page = (uintptr_t)virtual_address >> PAGE_SHIFT;
    
    // 3. Find in LRU data structure (even O(1) is too slow)
    PageNode* node = hashmap_lookup(lru_map, page);
    
    // 4. Move to MRU position
    if (node) {
        unlink_node(node);
        insert_at_head(node);
    }
    
    // 5. Re-enable interrupts
    enable_interrupts();
    
    // 6. Return to user code
}
 
// Each trap: ~500-1000 cycles minimum
// With 10 billion memory accesses/second:
// = 5-10 TRILLION cycles/second required
// = 1600-3300 seconds of CPU time PER SECOND
// = ABSOLUTELY IMPOSSIBLE

Option 2: Store Timestamp on Every Access (No Trap)

What if we could update a timestamp without trapping to the OS?

impossible_timestamp_approach.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// HYPOTHETICAL: Update timestamp on every access
// Requires hardware to automatically update a counter
 
struct PageTableEntry {
    uint64_t frame_number : 52;    // Physical frame
    uint64_t flags : 12;           // Permissions, etc.
    uint64_t last_access_time;     // 64-bit timestamp (8 bytes extra per PTE!)
};
 
// Problems:
// 1. Page table entries would double in size (8→16 bytes)
// 2. Hardware must atomically write 64-bit timestamp on every access
// 3. Each memory access becomes a write operation
// 4. Cache coherency for timestamps is nightmarish
// 5. Finding minimum timestamp still requires O(n) scan
 
// Even if we had this, finding the LRU page:
PageTableEntry* find_lru_victim() {
    uint64_t min_time = UINT64_MAX;
    PageTableEntry* victim = NULL;
    
    // Scan ALL pages in memory - O(n) where n can be millions
    for (int i = 0; i < num_pages_in_memory; i++) {
        if (memory_pages[i].last_access_time < min_time) {
            min_time = memory_pages[i].last_access_time;
            victim = &memory_pages[i];
        }
    }
    return victim;  // Takes milliseconds for large memory!
}

Option 3: Maintain Sorted List (Software, No Per-Access Trap)

What if we maintained a sorted list that we update only during page faults?

fault_time_lru.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Update LRU only at page fault time
// This is NOT true LRU - many accesses are missed!
 
void page_fault_handler(void* virtual_address) {
    int page = get_page_number(virtual_address);
    
    // Problem: We only know about THIS access
    // All memory accesses since last fault are UNKNOWN!
    
    // Example scenario:
    // - Page A faults at t=0, loaded
    // - Page B faults at t=1, loaded  
    // - ... 1 million accesses to A, B, C, D without faults ...
    // - Page E faults at t=1000000
    
    // At t=1000000, our data structure thinks:
    //   LRU order: [E, B, A] (based on fault times)
    // 
    // But ACTUAL recency order (from access times):
    //   [E, D, C, B, A] or any permutation depending on accesses
    //
    // We have ZERO information about the middle million accesses!
    
    update_lru_structure_incorrectly(page);
}
 
// This is fundamentally broken - we can't know true recency
// without tracking every access, which we've shown is impossible

The Information Gap

Between page faults, millions of memory references occur that the OS never observes. Without tracking these, software cannot know true recency order. This is why LRU must be approximated using periodic sampling or hardware-assisted reference bit tracking.

Scalability Challenges

As systems grow in memory size, LRU implementation becomes even harder. Modern servers have terabytes of RAM, meaning millions of pages to manage.

Scaling the Page Count:

Memory Size vs. LRU Tracking Overhead
RAM Size	Page Count (4KB)	LRU List Size	Scan Time (O(n))	Annual Scans @ 1/sec
1 GB	256K	~2 MB	~0.3 ms	10 million
16 GB	4M	~32 MB	~5 ms	10 million
64 GB	16M	~128 MB	~20 ms	10 million
256 GB	64M	~512 MB	~80 ms	10 million
1 TB	256M	~2 GB	~320 ms	10 million
4 TB	1B	~8 GB	~1.3 sec	10 million

The O(n) Scan Problem:

Even with O(1) updates (doubly linked list), finding the LRU page is O(1) (it's at the tail). But the update problem scales with access frequency, not page count. With 256M pages, every access potentially moves a page in a 256M-node list.

Lock Contention:

On multi-core systems (common: 64-128 cores, future: 256+), multiple cores simultaneously access memory. A single LRU list would be a massive lock contention point:

lru_contention.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Multi-core LRU contention disaster
 
// Single global LRU lock - TERRIBLE for scalability
pthread_mutex_t global_lru_lock;
 
void update_recency(int page) {
    pthread_mutex_lock(&global_lru_lock);  // All cores serialize here!
    
    // Move page to MRU position
    move_to_head(page);
    
    pthread_mutex_unlock(&global_lru_lock);
}
 
// With 64 cores, each doing millions of memory accesses/sec:
// - All 64 cores contend for one lock
// - Lock operations take 50-200 cycles each
// - Cores spend most time waiting for lock, not computing
// - System throughput collapses
 
// Even lock-free structures don't help:
// - Atomic operations (CAS) still cause cache line bouncing
// - Memory bandwidth consumed by coherency traffic
// - O(cores²) scaling for contention

Per-CPU LRU Lists?

One might consider per-CPU LRU lists to avoid contention. But then we lose global LRU ordering—what's LRU on CPU 0 might be MRU on CPU 1. Balancing between lists would require synchronization, reintroducing contention. Real systems (like Linux) do use per-CPU structures but for batching, not true LRU.

The Critical Memory Access Path

Understanding why LRU tracking is impossible requires understanding the memory access critical path—the sequence of events for every memory reference.

Normal Memory Access Path:

memory_access_path.txt
NORMAL MEMORY ACCESS (without tracking):
 
CPU Pipeline    MMU (TLB/Page Table)    Cache    Memory
============    ====================    =====    ======
                      
1. Execute      2. VA→PA Translation    3. Cache    4. DRAM
   instruction     (TLB lookup)            lookup      (if miss)
                   ~1-2 cycles             ~4-12      ~50-100
                   (miss: ~100)            cycles      cycles
 
Total for HIT: 5-14 cycles (in L1/L2)
Total for cache MISS: 60-110 cycles
 
The CPU expects memory operations to complete in single-digit cycles.
Any additional overhead is immediately visible in performance.

If We Added LRU Tracking:

Adding even minimal overhead to this path is devastating:

memory_access_with_lru.txt
MEMORY ACCESS WITH LRU TRACKING (hypothetical):
 
CPU Pipeline    MMU          LRU Update        Cache    Memory
============    ===          ==========        =====    ======
 
1. Execute      2. VA→PA     3. Update LRU     4. Cache    5. DRAM
   instruction     translation   structure       lookup      (if miss)
                   ~1-2 cycles   ~20-50 cycles   ~4-12       ~50-100
                                 (optimistic!)   cycles      cycles
 
Total for HIT: 25-66 cycles (vs. 5-14 without tracking)
Slowdown factor: 3-5x even for cache hits!
 
But wait, it's worse:
- LRU update touches LRU data structure
- This data structure is in cache
- LRU updates POLLUTE the cache with LRU metadata
- Actual program data is evicted to make room
- More cache misses for real work → even more overhead
 
Cascading effect makes actual slowdown 10-100x.

The Heisenberg Problem

Attempting to observe (track) memory access patterns disturbs those patterns. LRU tracking metadata competes with actual data for cache space. The act of tracking LRU causes more cache misses, which in turn generates more tracking data, in a destructive feedback loop.

What Hardware Actually Provides

Architecture designers recognized the impossibility of true LRU and provided a minimal, efficient alternative: reference bits in page table entries.

The Reference Bit (R-bit or Accessed Bit):

page_table_entry.txt
x86-64 Page Table Entry (64 bits)
==================================
 
Bits 63-52: Available for OS use
Bit  51:    No Execute (NX) bit
Bits 50-12: Physical frame number (40 bits = 4096 TB physical address space)
Bits 11-9:  Available for OS use  
Bit  8:     Global page (G)
Bit  7:     Page Size (PS) - for huge pages
Bit  6:     Dirty bit (D) - set when page is WRITTEN  ← Hardware sets
Bit  5:     Accessed bit (A) - set when page is READ or WRITTEN  ← Hardware sets  
Bit  4:     Page Cache Disable (PCD)
Bit  3:     Page Write Through (PWT)
Bit  2:     User/Supervisor (U/S)
Bit  1:     Read/Write (R/W)
Bit  0:     Present (P)
 
The ACCESSED BIT (bit 5) is our only hardware help for LRU:
- MMU sets it to 1 on ANY access (read or write)
- OS can clear it to 0
- That's it - no timestamp, no ordering, just a single bit

What the Reference Bit Tells Us:

Reference Bit Value	Meaning
R = 1	Page was accessed since last cleared
R = 0	Page was NOT accessed since last cleared

What It Doesn't Tell Us:

When exactly was it accessed?
How many times was it accessed?
What's the relative ordering among pages with R=1?
Was it accessed 1 nanosecond ago or 1 hour ago?

The reference bit provides existence information ("was it used?"), not recency information ("when was it used?"). This is what makes LRU approximation necessary.

The Dirty Bit Too

The dirty bit (D) indicates whether a page was written. This matters for page replacement: dirty pages must be written to disk before eviction, while clean pages can be discarded immediately. Smart algorithms (Enhanced Second-Chance) use both reference and dirty bits together.

The Gap That Approximations Fill

Given hardware's minimal support (just a reference bit), how can we approximate LRU? The gap between "exact recency" and "was it accessed?" must be bridged.

Observation: Temporal Sampling

If we periodically sample the reference bits, we can observe which pages were accessed during each interval. This gives us coarse-grained recency information.

Example: Sampling at 10ms intervals

Reference Bit Sampling Over Time
Page	0-10ms	10-20ms	20-30ms	30-40ms	Inferred Recency
A	R=1	R=1	R=1	R=1	Very hot (always accessed)
B	R=0	R=1	R=0	R=1	Warm (intermittent)
C	R=0	R=0	R=0	R=1	Recently touched
D	R=1	R=0	R=0	R=0	Old (used only in past)
E	R=0	R=0	R=0	R=0	Cold → Eviction candidate

The Approximation Trade-off:

Sampling Approach	Pros	Cons
Frequent sampling (1ms)	Fine-grained recency	High OS overhead
Infrequent sampling (100ms)	Low overhead	Coarse recency
Multiple history bits	Better ordering	More memory
Single reference bit	Simple, standard	Binary (used/not used)

What We Lose:

True ordering among recently-accessed pages
Precise recency within sampling intervals
Distinction between "accessed once" and "accessed 1 million times"

What We Gain:

Practical overhead (OS runs periodically, not per-access)
Good-enough approximation for locality-driven workloads
Ability to identify cold pages for eviction

The 80/20 of Page Replacement

For most workloads, identifying the roughly coldest pages is sufficient—we don't need perfect LRU ordering. The difference between evicting the 10th-coldest page vs. the coldest page is often negligible. Approximations exploit this by being "good enough" with practical overhead.

Historical Attempts at True LRU

Before the community accepted that approximations were necessary, several systems attempted true LRU. Their failures are instructive.

The IBM System/370 and Reference Counter Approach (1970s):

IBM's VM/370 experimented with per-page reference counters that hardware would increment on each access. This required:

Extra bits in each page table entry
Atomic counter increment on every memory reference
Periodic scanning to find minimum counter

Result: The counter increment overhead was measurable even with hardware support. More problematically, counter overflow required periodic normalization (subtracting minimum from all counters)—a global O(n) operation.

The VAX/VMS Working Set Approach (1977):

Digital's VMS used a complex working set manager with per-process LRU. It attempted fine-grained recency tracking through expensive bookkeeping. The system was notorious for:

High OS overhead under memory pressure
Visible pauses during working set adjustments
Complex tuning parameters that confused administrators

Lessons from Historical Systems

•Hardware counters don't scale — Incrementing per-page counters on every reference creates bandwidth and coherency problems on multi-processor systems.
•Overflow handling is expensive — Counters eventually overflow, requiring global normalization that pauses the system.
•Complex tracking hurts transparency — Users experience unpredictable performance based on invisible memory management state.
•Simple approximations won — Systems that used simpler Clock algorithm (like BSD Unix) had better real-world performance than elaborate LRU schemes.

The Industry Consensus

By the late 1980s, the industry reached consensus: hardware provides reference bits, software uses approximation algorithms. This division of labor—minimal hardware support, clever software approximation—remains the foundation of all modern memory management.

Summary: Why Approximations Are Necessary

We've established definitively why true LRU cannot be practically implemented. Let's consolidate these findings:

Key Takeaways

•Billions of updates per second are impossible — Even O(1) operations can't handle 10+ billion per-access updates in software. The math simply doesn't work.
•Hardware true LRU doesn't scale — Unlike cache LRU (small sets, few ways), page table LRU requires tracking millions of pages globally. No practical hardware design achieves this.
•Software overhead is multiplicative — Any per-access overhead cascades through cache pollution, lock contention, and memory bandwidth consumption, causing 10-100x slowdowns.
•Multi-core scaling makes it worse — Global LRU structures become contention bottlenecks that destroy parallelism on modern many-core systems.
•The memory access path cannot tolerate overhead — CPUs expect memory operations to complete in cycles; any tracking adds unacceptable latency.
•Hardware provides only reference bits — Architecture designers chose minimal, efficient support (1 bit per PTE) rather than impractical full tracking.
•Historical attempts confirmed impracticality — Real systems that tried elaborate LRU tracking performed worse than those with simple approximations.

What's Next:

Now that we understand why true LRU is impractical, we'll explore the implementations that do work. The next page examines the Counter Implementation—one of two theoretical approaches to exact LRU, understanding it helps appreciate both the elegance and the limitations.

Page Complete

You now understand why every operating system uses LRU approximations rather than true LRU. The combination of update frequency requirements, hardware limitations, software overhead, and multi-core scaling makes exact LRU fundamentally impractical. This understanding motivates the approximation techniques in modern systems.

Implementation Challenges: Why True LRU is Impractical

The Gap Between Theory and Practice

What You Will Learn

The Update Frequency Problem

The Core Issue:

LRU requires updating the recency ordering on every memory reference. Let's quantify this:

Modern CPU: 3-5 GHz clock frequency
Instructions per cycle (IPC): ~2-4 for typical workloads
Memory references per instruction: 1 (fetch) + 0-2 (data) = 1-3 average
Memory references per second: ~6-20 billion

Every single one of these references would need to:

Identify which page is being accessed
Find that page in the recency data structure
Update its position to "most recently used"

Even with O(1) operations, performing this 10+ billion times per second is impossible in software.

Update Overhead Analysis
Approach	Cycles per Update	Updates/Second	Total Cycles	% of CPU
Pure software LRU	50-100	10 billion	500B-1000B	100% (impossible)
Lock-protected software	200-500	10 billion	2T-5T	100% (impossible)
Hardware-assisted software	10-20	10 billion	100B-200B	30-60% (unusable)
No recency tracking	0	N/A	0	0% ✓

The Mathematics of Impossibility:

Even if we could reduce update cost to 1 cycle (hardware only), we'd need 10 billion cycles per second just for LRU tracking—consuming over 3x the CPU capacity.

Conclusion: No software-based true LRU can work. Hardware-based true LRU would consume an impractical fraction of CPU resources.

This is not an engineering challenge to overcome—it's a fundamental impossibility given current technology.

The Core Constraint

Hardware Limitations

Could hardware solve the LRU tracking problem? In theory, yes. In practice, hardware limitations make true LRU impractical even with silicon assistance.

Why CPU Caches Can Do LRU (approximately):

L1/L2/L3 caches do implement pseudo-LRU or true LRU for small associativity (2-16 ways). This works because:

Small sets: Each set has only 2-16 lines to track
Dedicated circuits: Specialized flip-flops maintain order bits
Simple encoding: 4-way LRU needs only 3-4 bits
Fast updates: Parallel with data access, single-cycle
Local scope: Each set is independent

Why Page Tables Cannot:

Page tables have fundamentally different characteristics:

Cache LRU vs. Page Table LRU Requirements
Aspect	CPU Cache	Page Tables	Ratio
Entries to track	16 (per set)	1,000,000+	62,500x more
Bits for ordering	4-5 per set	20+ per entry	1 million+ total
Update frequency	Every cache access	Every memory ref	Similar
Update scope	One set	Global ordering	Much harder
Hardware cost	Minimal per set	Massive global	Impractical
Encoding	Position bits	Counter/timestamp	Much larger

The Storage Problem:

To maintain true LRU ordering for n pages, each page needs a position value from 1 to n. This requires log₂(n) bits per page.

1 million pages: 20 bits per page = 20 Mbits = 2.5 MB just for LRU ordering
10 million pages: 24 bits per page = 240 Mbits = 30 MB
This storage must be in fast hardware (not DRAM) for single-cycle access

The Update Circuit Problem:

No practical hardware design can achieve this. Modern CPUs don't even try—they provide only a single reference bit per page table entry, which is vastly simpler.

What Hardware Actually Provides

Software Overhead in Detail

Let's examine exactly what a software-based true LRU would require, demonstrating why it's impractical.

Option 1: Trap on Every Memory Access

impossible_trap_approach.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// HYPOTHETICAL: Trap to OS on every memory access
// THIS IS WHAT TRUE SOFTWARE LRU WOULD REQUIRE
 
void memory_access_trap(void* virtual_address) {  // Called billions/sec
    // 1. Disable interrupts (prevent concurrent access)
    disable_interrupts();
    
    // 2. Get the page number
    int page = (uintptr_t)virtual_address >> PAGE_SHIFT;
    
    // 3. Find in LRU data structure (even O(1) is too slow)
    PageNode* node = hashmap_lookup(lru_map, page);
    
    // 4. Move to MRU position
    if (node) {
        unlink_node(node);
        insert_at_head(node);
    }
    
    // 5. Re-enable interrupts
    enable_interrupts();
    
    // 6. Return to user code
}
 
// Each trap: ~500-1000 cycles minimum
// With 10 billion memory accesses/second:
// = 5-10 TRILLION cycles/second required
// = 1600-3300 seconds of CPU time PER SECOND
// = ABSOLUTELY IMPOSSIBLE

Option 2: Store Timestamp on Every Access (No Trap)

What if we could update a timestamp without trapping to the OS?

impossible_timestamp_approach.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// HYPOTHETICAL: Update timestamp on every access
// Requires hardware to automatically update a counter
 
struct PageTableEntry {
    uint64_t frame_number : 52;    // Physical frame
    uint64_t flags : 12;           // Permissions, etc.
    uint64_t last_access_time;     // 64-bit timestamp (8 bytes extra per PTE!)
};
 
// Problems:
// 1. Page table entries would double in size (8→16 bytes)
// 2. Hardware must atomically write 64-bit timestamp on every access
// 3. Each memory access becomes a write operation
// 4. Cache coherency for timestamps is nightmarish
// 5. Finding minimum timestamp still requires O(n) scan
 
// Even if we had this, finding the LRU page:
PageTableEntry* find_lru_victim() {
    uint64_t min_time = UINT64_MAX;
    PageTableEntry* victim = NULL;
    
    // Scan ALL pages in memory - O(n) where n can be millions
    for (int i = 0; i < num_pages_in_memory; i++) {
        if (memory_pages[i].last_access_time < min_time) {
            min_time = memory_pages[i].last_access_time;
            victim = &memory_pages[i];
        }
    }
    return victim;  // Takes milliseconds for large memory!
}

Option 3: Maintain Sorted List (Software, No Per-Access Trap)

What if we maintained a sorted list that we update only during page faults?

fault_time_lru.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Update LRU only at page fault time
// This is NOT true LRU - many accesses are missed!
 
void page_fault_handler(void* virtual_address) {
    int page = get_page_number(virtual_address);
    
    // Problem: We only know about THIS access
    // All memory accesses since last fault are UNKNOWN!
    
    // Example scenario:
    // - Page A faults at t=0, loaded
    // - Page B faults at t=1, loaded  
    // - ... 1 million accesses to A, B, C, D without faults ...
    // - Page E faults at t=1000000
    
    // At t=1000000, our data structure thinks:
    //   LRU order: [E, B, A] (based on fault times)
    // 
    // But ACTUAL recency order (from access times):
    //   [E, D, C, B, A] or any permutation depending on accesses
    //
    // We have ZERO information about the middle million accesses!
    
    update_lru_structure_incorrectly(page);
}
 
// This is fundamentally broken - we can't know true recency
// without tracking every access, which we've shown is impossible

The Information Gap

Scalability Challenges

As systems grow in memory size, LRU implementation becomes even harder. Modern servers have terabytes of RAM, meaning millions of pages to manage.

Scaling the Page Count:

Memory Size vs. LRU Tracking Overhead
RAM Size	Page Count (4KB)	LRU List Size	Scan Time (O(n))	Annual Scans @ 1/sec
1 GB	256K	~2 MB	~0.3 ms	10 million
16 GB	4M	~32 MB	~5 ms	10 million
64 GB	16M	~128 MB	~20 ms	10 million
256 GB	64M	~512 MB	~80 ms	10 million
1 TB	256M	~2 GB	~320 ms	10 million
4 TB	1B	~8 GB	~1.3 sec	10 million

The O(n) Scan Problem:

Lock Contention:

On multi-core systems (common: 64-128 cores, future: 256+), multiple cores simultaneously access memory. A single LRU list would be a massive lock contention point:

lru_contention.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Multi-core LRU contention disaster
 
// Single global LRU lock - TERRIBLE for scalability
pthread_mutex_t global_lru_lock;
 
void update_recency(int page) {
    pthread_mutex_lock(&global_lru_lock);  // All cores serialize here!
    
    // Move page to MRU position
    move_to_head(page);
    
    pthread_mutex_unlock(&global_lru_lock);
}
 
// With 64 cores, each doing millions of memory accesses/sec:
// - All 64 cores contend for one lock
// - Lock operations take 50-200 cycles each
// - Cores spend most time waiting for lock, not computing
// - System throughput collapses
 
// Even lock-free structures don't help:
// - Atomic operations (CAS) still cause cache line bouncing
// - Memory bandwidth consumed by coherency traffic
// - O(cores²) scaling for contention

Per-CPU LRU Lists?

The Critical Memory Access Path

Understanding why LRU tracking is impossible requires understanding the memory access critical path—the sequence of events for every memory reference.

Normal Memory Access Path:

memory_access_path.txt
NORMAL MEMORY ACCESS (without tracking):
 
CPU Pipeline    MMU (TLB/Page Table)    Cache    Memory
============    ====================    =====    ======
                      
1. Execute      2. VA→PA Translation    3. Cache    4. DRAM
   instruction     (TLB lookup)            lookup      (if miss)
                   ~1-2 cycles             ~4-12      ~50-100
                   (miss: ~100)            cycles      cycles
 
Total for HIT: 5-14 cycles (in L1/L2)
Total for cache MISS: 60-110 cycles
 
The CPU expects memory operations to complete in single-digit cycles.
Any additional overhead is immediately visible in performance.

If We Added LRU Tracking:

Adding even minimal overhead to this path is devastating:

memory_access_with_lru.txt
MEMORY ACCESS WITH LRU TRACKING (hypothetical):
 
CPU Pipeline    MMU          LRU Update        Cache    Memory
============    ===          ==========        =====    ======
 
1. Execute      2. VA→PA     3. Update LRU     4. Cache    5. DRAM
   instruction     translation   structure       lookup      (if miss)
                   ~1-2 cycles   ~20-50 cycles   ~4-12       ~50-100
                                 (optimistic!)   cycles      cycles
 
Total for HIT: 25-66 cycles (vs. 5-14 without tracking)
Slowdown factor: 3-5x even for cache hits!
 
But wait, it's worse:
- LRU update touches LRU data structure
- This data structure is in cache
- LRU updates POLLUTE the cache with LRU metadata
- Actual program data is evicted to make room
- More cache misses for real work → even more overhead
 
Cascading effect makes actual slowdown 10-100x.

The Heisenberg Problem

What Hardware Actually Provides

Architecture designers recognized the impossibility of true LRU and provided a minimal, efficient alternative: reference bits in page table entries.

The Reference Bit (R-bit or Accessed Bit):

page_table_entry.txt
x86-64 Page Table Entry (64 bits)
==================================
 
Bits 63-52: Available for OS use
Bit  51:    No Execute (NX) bit
Bits 50-12: Physical frame number (40 bits = 4096 TB physical address space)
Bits 11-9:  Available for OS use  
Bit  8:     Global page (G)
Bit  7:     Page Size (PS) - for huge pages
Bit  6:     Dirty bit (D) - set when page is WRITTEN  ← Hardware sets
Bit  5:     Accessed bit (A) - set when page is READ or WRITTEN  ← Hardware sets  
Bit  4:     Page Cache Disable (PCD)
Bit  3:     Page Write Through (PWT)
Bit  2:     User/Supervisor (U/S)
Bit  1:     Read/Write (R/W)
Bit  0:     Present (P)
 
The ACCESSED BIT (bit 5) is our only hardware help for LRU:
- MMU sets it to 1 on ANY access (read or write)
- OS can clear it to 0
- That's it - no timestamp, no ordering, just a single bit

What the Reference Bit Tells Us:

Reference Bit Value	Meaning
R = 1	Page was accessed since last cleared
R = 0	Page was NOT accessed since last cleared

What It Doesn't Tell Us:

When exactly was it accessed?
How many times was it accessed?
What's the relative ordering among pages with R=1?
Was it accessed 1 nanosecond ago or 1 hour ago?

The reference bit provides existence information ("was it used?"), not recency information ("when was it used?"). This is what makes LRU approximation necessary.

The Dirty Bit Too

The Gap That Approximations Fill

Given hardware's minimal support (just a reference bit), how can we approximate LRU? The gap between "exact recency" and "was it accessed?" must be bridged.

Observation: Temporal Sampling

If we periodically sample the reference bits, we can observe which pages were accessed during each interval. This gives us coarse-grained recency information.

Example: Sampling at 10ms intervals

Reference Bit Sampling Over Time
Page	0-10ms	10-20ms	20-30ms	30-40ms	Inferred Recency
A	R=1	R=1	R=1	R=1	Very hot (always accessed)
B	R=0	R=1	R=0	R=1	Warm (intermittent)
C	R=0	R=0	R=0	R=1	Recently touched
D	R=1	R=0	R=0	R=0	Old (used only in past)
E	R=0	R=0	R=0	R=0	Cold → Eviction candidate

The Approximation Trade-off:

Sampling Approach	Pros	Cons
Frequent sampling (1ms)	Fine-grained recency	High OS overhead
Infrequent sampling (100ms)	Low overhead	Coarse recency
Multiple history bits	Better ordering	More memory
Single reference bit	Simple, standard	Binary (used/not used)

What We Lose:

True ordering among recently-accessed pages
Precise recency within sampling intervals
Distinction between "accessed once" and "accessed 1 million times"

What We Gain:

Practical overhead (OS runs periodically, not per-access)
Good-enough approximation for locality-driven workloads
Ability to identify cold pages for eviction

The 80/20 of Page Replacement

Historical Attempts at True LRU

Before the community accepted that approximations were necessary, several systems attempted true LRU. Their failures are instructive.

The IBM System/370 and Reference Counter Approach (1970s):

IBM's VM/370 experimented with per-page reference counters that hardware would increment on each access. This required:

Extra bits in each page table entry
Atomic counter increment on every memory reference
Periodic scanning to find minimum counter

The VAX/VMS Working Set Approach (1977):

Digital's VMS used a complex working set manager with per-process LRU. It attempted fine-grained recency tracking through expensive bookkeeping. The system was notorious for:

High OS overhead under memory pressure
Visible pauses during working set adjustments
Complex tuning parameters that confused administrators

Lessons from Historical Systems

•Hardware counters don't scale — Incrementing per-page counters on every reference creates bandwidth and coherency problems on multi-processor systems.
•Overflow handling is expensive — Counters eventually overflow, requiring global normalization that pauses the system.
•Complex tracking hurts transparency — Users experience unpredictable performance based on invisible memory management state.
•Simple approximations won — Systems that used simpler Clock algorithm (like BSD Unix) had better real-world performance than elaborate LRU schemes.

The Industry Consensus

Summary: Why Approximations Are Necessary

We've established definitively why true LRU cannot be practically implemented. Let's consolidate these findings:

Key Takeaways

•Billions of updates per second are impossible — Even O(1) operations can't handle 10+ billion per-access updates in software. The math simply doesn't work.
•Hardware true LRU doesn't scale — Unlike cache LRU (small sets, few ways), page table LRU requires tracking millions of pages globally. No practical hardware design achieves this.
•Software overhead is multiplicative — Any per-access overhead cascades through cache pollution, lock contention, and memory bandwidth consumption, causing 10-100x slowdowns.
•Multi-core scaling makes it worse — Global LRU structures become contention bottlenecks that destroy parallelism on modern many-core systems.
•The memory access path cannot tolerate overhead — CPUs expect memory operations to complete in cycles; any tracking adds unacceptable latency.
•Hardware provides only reference bits — Architecture designers chose minimal, efficient support (1 bit per PTE) rather than impractical full tracking.
•Historical attempts confirmed impracticality — Real systems that tried elaborate LRU tracking performed worse than those with simple approximations.

What's Next:

Page Complete