Operating SystemsLRU Algorithm

LRU (Least Recently Used) Page Replacement

LevelIntermediate

Duration75 mins

TopicLRU Algorithm

4 / 5

Counter Implementation: Timestamps and Logical Clocks for LRU

Recording Time to Measure Recency

The most intuitive approach to implementing LRU is to record when each page was last accessed. If we associate a timestamp or counter value with each page graph, finding the least recently used page reduces to finding the page with the oldest (smallest) timestamp.

This counter-based or timestamp-based approach is conceptually straightforward—but as we established in the previous page, implementing it efficiently is fraught with challenges. This page examines the counter approach in detail, exploring both hardware-assisted and software-only variants, understanding their trade-offs, and seeing how they inform the approximation techniques used in practice.

Even though pure counter implementations are rarely used in modern operating systems, understanding them is essential—they represent the "ideal" that approximations aim to approximate.

What You Will Learn

By the end of this page, you will understand: (1) the logical clock model for recency tracking, (2) hardware counter implementation and its costs, (3) software counter alternatives, (4) the counter overflow problem, (5) finding the minimum counter efficiently, and (6) why counters alone are insufficient for practical systems.

The Logical Clock Model

LRU cares about the ordering of events, not absolute time. We don't need to know that page A was accessed at 3:42:15.123 PM—we just need to know whether it was accessed before or after page B.

This insight leads to using logical clocks rather than real-time clocks.

Definition: Logical Clock for LRU

A logical clock is a monotonically increasing counter that advances with each memory reference:

logical_time = 0  // Global counter

on_memory_reference(page):
    logical_time = logical_time + 1
    last_access[page] = logical_time

After any sequence of memory references, the last_access values give perfect LRU ordering: smaller values = older access = candidates for eviction.

Logical Clock Example: Reference String [A, B, C, B, A, D]
Time	Reference	Logical Clock	last_access[A]	last_access[B]	last_access[C]	last_access[D]	LRU Order
t=1	A	1	1				[A]
t=2	B	2	1	2			[A, B]
t=3	C	3	1	2	3		[A, B, C]
t=4	B	4	1	4	3		[A, C, B]
t=5	A	5	5	4	3		[C, B, A]
t=6	D	6	5	4	3	6	[C, B, A, D]

At time t=6:

Page	last_access	How Long Ago
D	6	0 (most recent)
A	5	1
B	4	2
C	3	3 (least recent → LRU victim)

The beauty of this model is its simplicity: the LRU victim is always the page with the minimum last_access value.

Why Logical Time Works

Logical time perfectly captures recency because LRU only cares about relative ordering. Whether 1 nanosecond or 1 hour passed between references is irrelevant—only the sequence matters. Logical clocks are simpler, more portable, and avoid real-time clock issues (drift, synchronization, resolution).

Hardware Counter Implementation

The most efficient counter implementation would have hardware maintain the timestamp. Let's examine what this would require.

Conceptual Design:

hardware_lru_counter.txt
HARDWARE LRU COUNTER DESIGN
===========================
 
Component 1: Global Logical Clock Register
- N-bit counter (N typically 32 or 64)
- Increments automatically on every memory reference
- Must be accessible from all cores (coherent)
 
Component 2: Extended Page Table Entry
- Standard PTE fields (52 bits for address, 12 bits for flags)
- Added: N-bit timestamp field
- On EVERY memory access, hardware writes current clock to timestamp field
 
   63                                           12 11         0
  +-----------------------------------------------+------------+
  |     Physical Frame Number + Timestamp        |   Flags    |
  +-----------------------------------------------+------------+
                                    ^
                                    |
                            Must accommodate 32-64 bit timestamp
 
Component 3: Minimum Finder
- When eviction needed, hardware or software scans all PTEs
- Finds entry with minimum timestamp
- That page is the LRU victim

Hardware Implementation Costs:

Hardware Counter Overhead Analysis
Component	Cost	Impact
PTE size increase	+32 to +64 bits per entry	Page tables grow 1.5-2x, more cache misses
Clock register	One 64-bit register per core	Minimal silicon cost
Clock coherency	Cross-core synchronization	Adds memory bus traffic
Timestamp writes	Write on EVERY memory access	Huge memory bandwidth consumption
Cache line pollution	Timestamp updates evict data	Reduced effective cache size

The Fatal Flaw: Write Amplification

Every memory read becomes a memory write (to record the timestamp). This is catastrophic for performance:

Memory reads: ~50-100 cycles latency, high bandwidth
Memory writes: ~50-100 cycles latency, but must update caches, manage coherency
Expected memory traffic: doubles or more
Cache behavior: timestamp field bounces between caches, evicting useful data

This is why no modern architecture implements hardware timestamp-based LRU for page tables. The write amplification alone makes it infeasible.

Why Caches Can Do This (But Page Tables Can't)

CPU caches use pseudo-LRU with small state (a few bits per set) stored in dedicated SRAM cells, updated in parallel with data access. Page tables would need per-entry timestamps in main memory, touching the memory system for every access. The storage hierarchy difference is fundamental.

Software Counter Implementation

Since hardware counters are impractical, can we implement counters in software? Let's examine the options.

Option 1: Trap-Based Counter Update

The OS could trap on every memory access and update a software counter table.

trap_based_counter.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// TRAP-BASED COUNTER LRU (Hypothetical - NOT practical)
 
// Global shared state
uint64_t logical_clock = 0;
uint64_t page_timestamp[MAX_PAGES];  // Timestamp per page
spinlock_t lru_lock;
 
// Called on EVERY memory access (would require hardware trap)
void memory_access_trap_handler(int page_number) {
    // This runs BILLIONS of times per second!
    
    acquire_spinlock(&lru_lock);
    
    logical_clock++;
    page_timestamp[page_number] = logical_clock;
    
    release_spinlock(&lru_lock);
}
 
// COST ANALYSIS:
// - Trap overhead: ~500-1000 cycles minimum
// - Lock acquisition: ~50-200 cycles  
// - Memory updates: ~10-50 cycles
// - Total: ~600-1300 cycles per memory reference
//
// With 10 billion memory references/second:
//   = 6-13 trillion additional cycles/second
//   = CPU would need to be 2000-4000 GHz to keep up
//
// CONCLUSION: Completely impossible

Option 2: Page Fault Time Only

What if we only update counters during page faults? This is actually feasible—but it's not true LRU.

fault_time_counter.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// PAGE FAULT TIME COUNTER (Feasible but NOT true LRU)
 
uint64_t fault_clock = 0;
uint64_t page_fault_time[MAX_PAGES];
 
void page_fault_handler(int page_number) {
    // Only runs on page faults (thousands/second, not billions)
    
    fault_clock++;
    page_fault_time[page_number] = fault_clock;
    
    // Load page from disk...
    // Update page tables...
}
 
int find_lru_victim() {
    // Find page with minimum fault time
    uint64_t min_time = UINT64_MAX;
    int victim = -1;
    
    for (int i = 0; i < num_resident_pages; i++) {
        if (page_fault_time[resident_pages[i]] < min_time) {
            min_time = page_fault_time[resident_pages[i]];
            victim = resident_pages[i];
        }
    }
    return victim;
}
 
// PROBLEM: This is FIFO for resident pages, not LRU!
//
// Example:
//   Page A faults at t=1, loaded
//   Page B faults at t=2, loaded
//   ... A is accessed 1 million times, B never touched ...
//   Need to evict at t=3: Algorithm says "evict A" (fault_time=1)
//   But A is HOT! B should be evicted.
//
// We never learn about accesses between faults.

The Information Gap

Between page faults, the OS has no visibility into memory accesses. Software counters updated only on faults capture fault ordering (essentially FIFO among resident pages), not access recency (LRU). This fundamental information gap is why pure software counters cannot implement true LRU.

The Counter Overflow Problem

Even if we could maintain counters (hypothetically), we face the overflow problem. Counters have finite bits and eventually wrap around.

Overflow Timeline:

Counter Overflow Analysis
Counter Size	Maximum Value	Refs/Second	Time to Overflow
16-bit	65,535	10 billion	6.5 microseconds
32-bit	4.3 billion	10 billion	0.43 seconds
48-bit	281 trillion	10 billion	9.3 hours
64-bit	18.4 quintillion	10 billion	58 years

32-bit counters overflow in under a second—clearly useless.

64-bit counters are safe from overflow (58 years at 10 billion refs/sec) but require 8 bytes per page table entry, doubling PTE size.

The Overflow Solution: Normalization

When counters approach overflow, we can normalize by subtracting the minimum counter value from all counters:

counter_normalization.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Counter normalization to prevent overflow
 
void normalize_counters() {
    // Find minimum counter value
    uint64_t min_val = UINT64_MAX;
    for (int i = 0; i < num_pages; i++) {
        if (page_timestamp[i] < min_val) {
            min_val = page_timestamp[i];
        }
    }
    
    // Subtract minimum from all counters
    for (int i = 0; i < num_pages; i++) {
        page_timestamp[i] -= min_val;
    }
    
    // Reset global clock to highest page value
    logical_clock = 0;
    for (int i = 0; i < num_pages; i++) {
        if (page_timestamp[i] > logical_clock) {
            logical_clock = page_timestamp[i];
        }
    }
}
 
// COST: O(n) scan of ALL pages
// With 1 million pages: ~1-10 milliseconds
// During normalization: no memory accesses can be tracked
// Must acquire global lock, blocking all cores
//
// If we normalize every second: 1-10ms of stall per second = 0.1-1% overhead
// But with 32-bit counters, we'd normalize 2x/second = unacceptable

64-Bit Counters Are the Only Option

For counter-based LRU to be remotely practical, 64-bit counters are mandatory. This means 8 additional bytes per PTE—a significant memory cost. For a system with 16GB RAM (4M pages), that's 32MB just for LRU counters. For 1TB (256M pages), it's 2GB.

Finding the Minimum Counter

Even with perfect counter maintenance, we face another challenge: finding the LRU victim requires locating the page with the minimum counter value.

The Naïve Approach: O(n) Scan

find_minimum_naive.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
int find_lru_victim_naive() {
    uint64_t min_time = UINT64_MAX;
    int victim = -1;
    
    // Scan ALL resident pages
    for (int i = 0; i < num_resident_pages; i++) {
        if (page_timestamp[resident_pages[i]] < min_time) {
            min_time = page_timestamp[resident_pages[i]];
            victim = resident_pages[i];
        }
    }
    
    return victim;
}
 
// With 1 million pages:
//   - 1 million comparisons
//   - ~1 million memory reads (8 bytes each = 8 MB data)
//   - Cache misses galore
//   - Total time: 1-10 milliseconds
//
// Page faults typically occur thousands of times per second.
// 1000 faults × 10ms each = 10 seconds of scanning per second
// IMPOSSIBLE.

Better Data Structures: Min-Heap

We could maintain a min-heap keyed by timestamp:

min_heap_lru.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Min-heap based LRU (better but still problematic)
 
typedef struct {
    int page_number;
    uint64_t timestamp;
} HeapEntry;
 
HeapEntry lru_heap[MAX_PAGES];
int heap_size;
HashMap page_to_heap_position;  // For O(1) position lookup
 
// Find LRU victim: O(1) - just return heap root
int find_lru_victim() {
    return lru_heap[0].page_number;  // Root is minimum
}
 
// Problem: UPDATE on access is expensive
void update_timestamp_on_access(int page_number) {
    // 1. Find page's position in heap: O(1) with hash map
    int pos = hashmap_get(page_to_heap_position, page_number);
    
    // 2. Update timestamp (increases value)
    lru_heap[pos].timestamp = ++logical_clock;
    
    // 3. Re-heapify: timestamp increased, so bubble DOWN
    //    O(log n) operations
    heapify_down(pos);
}
 
// The heap "fix" operation is O(log n) per access
// With n=1M pages: log(1M) ≈ 20 heap operations per access
// With 10 billion accesses/second: 200 billion heap ops/second
// Still impossible in software!

Data Structure Trade-offs for Counter-Based LRU
Structure	Find Min	Update on Access	Space	Practical?
Unsorted array	O(n)	O(1)	O(n)	No (find too slow)
Sorted array	O(1)	O(n)	O(n)	No (update too slow)
Min-heap	O(1)	O(log n)	O(n)	No (update too frequent)
BST/RB-tree	O(1)*	O(log n)	O(n)	No (same issue)
Doubly linked list	O(1)	O(1)**	O(n)	Best possible

The Linked List Advantage

The doubly linked list (as we saw in page 2) achieves O(1) for both finding the LRU victim (tail) and updating on access (move to head). But the update still must happen on every access—the problem isn't the data structure, it's the frequency of updates.

Some historical and embedded systems used register-based LRU where each page has a dedicated hardware register holding its recency information. While impractical for general-purpose systems, understanding this approach illuminates the trade-offs.

The Conceptual Design:

register_lru.txt
REGISTER-BASED LRU
==================
 
For N frames, maintain N hardware registers (8 bits each for simplicity):
 
     Frame 0 Reg    Frame 1 Reg    Frame 2 Reg    Frame 3 Reg
    +-----------+  +-----------+  +-----------+  +-----------+
    | 10010011  |  | 01001001  |  | 11000001  |  | 00100100  |
    +-----------+  +-----------+  +-----------+  +-----------+
         147           73            193            36
 
On each memory access to frame F:
1. Right-shift all registers by 1 (divide by 2, losing LSB)
2. Set MSB of frame F's register to 1
 
Example: Access to Frame 2
 
Before: [147, 73, 193, 36]
After shift: [73, 36, 96, 18]   (all values halved, rounded down)
Set MSB of Frame 2: [73, 36, 224, 18]   (96 + 128 = 224)
 
Interpretation:
- Higher values = more recent accesses
- Find minimum = LRU victim
- MSB = accessed this period, LSB = accessed long ago

The Shift Register Insight:

The beauty of this scheme is that the register value encodes a history of accesses:

 Register: 10110001 (177 decimal)
           ↑ ↑   ↑
           │ │   └─ Accessed 8 units ago
           │ └─ Accessed 2-3 units ago  
           └─ Accessed this unit

Higher values = more recent/frequent access. The register naturally "ages" information as time passes (via right shifts).

Why This Works for Small Systems:

Hardware registers are fast (single-cycle access)
Shift operations are trivially parallel
Comparison for minimum is O(N) but N is small
No software involvement in tracking

Why This Fails for Modern Systems:

Register-Based LRU Scalability
RAM Size	Pages (4KB)	Registers Needed	Register Storage	Feasibility
64 KB	16	16 × 8 bits	128 bits	✓ Easy (embedded)
1 MB	256	256 × 8 bits	2 Kbits	Marginal
16 MB	4K	4K × 16 bits	64 Kbits	Expensive
1 GB	256K	256K × 32 bits	8 Mbits	Impractical
64 GB	16M	16M × 64 bits	1 Gbit	Impossible

Historical Use: Small TLBs and Caches

Register-based LRU variants are actually used in small structures: TLBs (32-64 entries), L1 cache sets (2-8 ways). With few entries, the hardware cost is acceptable. This "micro-LRU" works well because N is tiny. Page-level LRU has N in millions—a completely different engineering challenge.

Combining Counters with Reference Bits

A practical approach combines the hardware reference bit with periodic software counter updates. This trades true LRU for an approximation with acceptable overhead.

The Aging Algorithm:

aging_algorithm.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// THE AGING ALGORITHM
// Periodic sampling combines reference bits with counters
 
// Each page has an 8-bit aging counter
uint8_t age_counter[MAX_PAGES];
 
// Every T milliseconds (e.g., T=100ms), the OS runs:
void aging_timer_interrupt() {
    for (int page = 0; page < num_resident_pages; page++) {
        // Right-shift counter (age the value)
        age_counter[page] >>= 1;
        
        // If reference bit is set, add 128 (set MSB)
        if (get_reference_bit(page)) {
            age_counter[page] |= 0x80;  // 0b10000000
        }
        
        // Clear reference bit for next period
        clear_reference_bit(page);
    }
}
 
// To find LRU victim: find page with minimum age_counter
int find_lru_victim() {
    uint8_t min_age = 0xFF;
    int victim = -1;
    
    for (int page = 0; page < num_resident_pages; page++) {
        if (age_counter[page] < min_age) {
            min_age = age_counter[page];
            victim = page;
        }
    }
    return victim;
}

How Aging Works:

With 8-bit counters and 100ms intervals:

Bit 7 (MSB): referenced in last 100ms
Bit 6: referenced 100-200ms ago
Bit 5: referenced 200-300ms ago
...
Bit 0 (LSB): referenced 700-800ms ago

The counter value is essentially a weighted history where recent references contribute more than old ones.

Example Trace:

Aging Counter Evolution for Two Pages
Time	Page A Ref?	Page A Counter	Page B Ref?	Page B Counter
0ms	Yes	10000000 (128)	Yes	10000000 (128)
100ms	Yes	11000000 (192)	No	01000000 (64)
200ms	Yes	11100000 (224)	Yes	10100000 (160)
300ms	No	01110000 (112)	Yes	11010000 (208)
400ms	No	00111000 (56)	Yes	11101000 (232)
500ms	No	00011100 (28)	No	01110100 (116)

Aging is a Practical LRU Approximation

The aging algorithm runs O(n) every sampling period (say, 100ms), not every memory access. With 1000ms/100ms = 10 scans per second and 1M pages: 10M operations/second—completely feasible. This is millions of times better than true LRU's 10 billion operations/second.

Counter Implementation in Practice

Given all the challenges, how do real systems use counter-like mechanisms?

Linux: Multi-Generation LRU (MGLRU)

Modern Linux (5.18+) uses MGLRU which maintains multiple "generations" of pages:

mglru_concept.txt
LINUX MGLRU (Multi-Generation LRU)
===================================
 
Instead of per-page counters, pages are placed in GENERATIONS:
 
    Gen 0 (oldest)    Gen 1         Gen 2        Gen 3 (newest)
    +-----------+  +-----------+  +-----------+  +-----------+
    | Page list |  | Page list |  | Page list |  | Page list |
    +-----------+  +-----------+  +-----------+  +-----------+
    Evict first    ...           ...           Most protected
 
Promotion: When page is accessed, move to higher generation
Demotion: Periodically, pages age down generations
Eviction: Select victims from generation 0
 
Benefits:
- No per-page counter storage
- Constant-time generation membership check
- Batch aging (move whole generations down)
- Configurable number of generations (typically 4)
 
This is counter-like behavior with practical overhead.

Database Buffer Pools

Databases (PostgreSQL, MySQL, Oracle) manage their own buffer pools and can implement more sophisticated LRU variants because:

Buffer pool size is limited (configured, not all RAM)
I/O operations are slower than memory, so tracking overhead is proportionally smaller
They have context about access patterns (table scans vs. index lookups)

Many databases use LRU-K variants that consider last K references, or 2Q which separates first-access from subsequent accesses.

Where Counter-Like Approaches Work

•Application-level caches — Memcached, Redis use LRU with management overhead amortized over network latency
•Database buffer pools — Disk I/O dominates, so software overhead is tolerable
•Small embedded systems — Few pages, register-based tracking is feasible
•Hardware caches — True LRU or pseudo-LRU implemented in silicon for small sets
•Generational schemes — Coarse-grained 'generations' rather than precise counters

The Practical Lesson

Pure counter-based LRU is impractical for OS page replacement, but counter-inspired techniques (aging, generations, LRU-K) provide good approximations. The key is reducing update frequency from per-access to per-interval, and using hardware reference bits to bridge the gap.

Summary: Counter Implementation

We've explored counter-based LRU in depth, understanding both its elegance and its limitations. Let's consolidate:

Key Takeaways

•Logical clocks capture recency perfectly — A monotonically increasing counter assigned on each access gives exact LRU ordering. The page with minimum counter is the LRU victim.
•Hardware counters require per-access writes — Every memory read becomes a write to update the timestamp. This write amplification is prohibitively expensive.
•Software counters can't see between faults — Without trapping every memory access (impossible), software only learns about faults—yielding FIFO among resident pages, not LRU.
•64-bit counters avoid overflow — 32-bit counters overflow in under a second at modern access rates. 64-bit counters are practical but double PTE size.
•Finding the minimum is O(n) naively — Linear scans of millions of pages per fault are too slow. Even min-heaps require O(log n) updates per access—still too frequent.
•Register-based schemes work for small N — With 16-64 entries (like TLBs), hardware registers with shift-aging are practical. For millions of pages, impossible.
•Aging combines counters with reference bits — Periodic sampling converts reference bit information into weighted counters, approximating LRU with practical overhead.
•Real systems use counter-inspired approximations — MGLRU, LRU-K, and generational schemes use counter-like aging without true per-access tracking.

What's Next:

The next page examines the other exact LRU implementation: the Stack Implementation. While the counter approach associates a timestamp with each page, the stack approach maintains explicit ordering through data structure manipulation. Understanding both approaches completes our picture of exact LRU before we move to practical approximations.

Page Complete

You now understand counter-based LRU implementation in full depth: the logical clock model, hardware and software variants, overflow issues, minimum-finding challenges, and the aging algorithm that bridges theory and practice. This foundation prepares you for the stack-based alternative and the approximation algorithms that real systems use.

4 / 5

Loading learning content...

Operating SystemsLRU Algorithm

LRU (Least Recently Used) Page Replacement

LevelIntermediate

Duration75 mins

TopicLRU Algorithm

4 / 5

Counter Implementation: Timestamps and Logical Clocks for LRU

Recording Time to Measure Recency

Even though pure counter implementations are rarely used in modern operating systems, understanding them is essential—they represent the "ideal" that approximations aim to approximate.

What You Will Learn

The Logical Clock Model

LRU cares about the ordering of events, not absolute time. We don't need to know that page A was accessed at 3:42:15.123 PM—we just need to know whether it was accessed before or after page B.

This insight leads to using logical clocks rather than real-time clocks.

Definition: Logical Clock for LRU

A logical clock is a monotonically increasing counter that advances with each memory reference:

logical_time = 0  // Global counter

on_memory_reference(page):
    logical_time = logical_time + 1
    last_access[page] = logical_time

After any sequence of memory references, the last_access values give perfect LRU ordering: smaller values = older access = candidates for eviction.

Logical Clock Example: Reference String [A, B, C, B, A, D]
Time	Reference	Logical Clock	last_access[A]	last_access[B]	last_access[C]	last_access[D]	LRU Order
t=1	A	1	1				[A]
t=2	B	2	1	2			[A, B]
t=3	C	3	1	2	3		[A, B, C]
t=4	B	4	1	4	3		[A, C, B]
t=5	A	5	5	4	3		[C, B, A]
t=6	D	6	5	4	3	6	[C, B, A, D]

At time t=6:

Page	last_access	How Long Ago
D	6	0 (most recent)
A	5	1
B	4	2
C	3	3 (least recent → LRU victim)

The beauty of this model is its simplicity: the LRU victim is always the page with the minimum last_access value.

Why Logical Time Works

Hardware Counter Implementation

The most efficient counter implementation would have hardware maintain the timestamp. Let's examine what this would require.

Conceptual Design:

hardware_lru_counter.txt
HARDWARE LRU COUNTER DESIGN
===========================
 
Component 1: Global Logical Clock Register
- N-bit counter (N typically 32 or 64)
- Increments automatically on every memory reference
- Must be accessible from all cores (coherent)
 
Component 2: Extended Page Table Entry
- Standard PTE fields (52 bits for address, 12 bits for flags)
- Added: N-bit timestamp field
- On EVERY memory access, hardware writes current clock to timestamp field
 
   63                                           12 11         0
  +-----------------------------------------------+------------+
  |     Physical Frame Number + Timestamp        |   Flags    |
  +-----------------------------------------------+------------+
                                    ^
                                    |
                            Must accommodate 32-64 bit timestamp
 
Component 3: Minimum Finder
- When eviction needed, hardware or software scans all PTEs
- Finds entry with minimum timestamp
- That page is the LRU victim

Hardware Implementation Costs:

Hardware Counter Overhead Analysis
Component	Cost	Impact
PTE size increase	+32 to +64 bits per entry	Page tables grow 1.5-2x, more cache misses
Clock register	One 64-bit register per core	Minimal silicon cost
Clock coherency	Cross-core synchronization	Adds memory bus traffic
Timestamp writes	Write on EVERY memory access	Huge memory bandwidth consumption
Cache line pollution	Timestamp updates evict data	Reduced effective cache size

The Fatal Flaw: Write Amplification

Every memory read becomes a memory write (to record the timestamp). This is catastrophic for performance:

Memory reads: ~50-100 cycles latency, high bandwidth
Memory writes: ~50-100 cycles latency, but must update caches, manage coherency
Expected memory traffic: doubles or more
Cache behavior: timestamp field bounces between caches, evicting useful data

This is why no modern architecture implements hardware timestamp-based LRU for page tables. The write amplification alone makes it infeasible.

Why Caches Can Do This (But Page Tables Can't)

Software Counter Implementation

Since hardware counters are impractical, can we implement counters in software? Let's examine the options.

Option 1: Trap-Based Counter Update

The OS could trap on every memory access and update a software counter table.

trap_based_counter.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// TRAP-BASED COUNTER LRU (Hypothetical - NOT practical)
 
// Global shared state
uint64_t logical_clock = 0;
uint64_t page_timestamp[MAX_PAGES];  // Timestamp per page
spinlock_t lru_lock;
 
// Called on EVERY memory access (would require hardware trap)
void memory_access_trap_handler(int page_number) {
    // This runs BILLIONS of times per second!
    
    acquire_spinlock(&lru_lock);
    
    logical_clock++;
    page_timestamp[page_number] = logical_clock;
    
    release_spinlock(&lru_lock);
}
 
// COST ANALYSIS:
// - Trap overhead: ~500-1000 cycles minimum
// - Lock acquisition: ~50-200 cycles  
// - Memory updates: ~10-50 cycles
// - Total: ~600-1300 cycles per memory reference
//
// With 10 billion memory references/second:
//   = 6-13 trillion additional cycles/second
//   = CPU would need to be 2000-4000 GHz to keep up
//
// CONCLUSION: Completely impossible

Option 2: Page Fault Time Only

What if we only update counters during page faults? This is actually feasible—but it's not true LRU.

fault_time_counter.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// PAGE FAULT TIME COUNTER (Feasible but NOT true LRU)
 
uint64_t fault_clock = 0;
uint64_t page_fault_time[MAX_PAGES];
 
void page_fault_handler(int page_number) {
    // Only runs on page faults (thousands/second, not billions)
    
    fault_clock++;
    page_fault_time[page_number] = fault_clock;
    
    // Load page from disk...
    // Update page tables...
}
 
int find_lru_victim() {
    // Find page with minimum fault time
    uint64_t min_time = UINT64_MAX;
    int victim = -1;
    
    for (int i = 0; i < num_resident_pages; i++) {
        if (page_fault_time[resident_pages[i]] < min_time) {
            min_time = page_fault_time[resident_pages[i]];
            victim = resident_pages[i];
        }
    }
    return victim;
}
 
// PROBLEM: This is FIFO for resident pages, not LRU!
//
// Example:
//   Page A faults at t=1, loaded
//   Page B faults at t=2, loaded
//   ... A is accessed 1 million times, B never touched ...
//   Need to evict at t=3: Algorithm says "evict A" (fault_time=1)
//   But A is HOT! B should be evicted.
//
// We never learn about accesses between faults.

The Information Gap

The Counter Overflow Problem

Even if we could maintain counters (hypothetically), we face the overflow problem. Counters have finite bits and eventually wrap around.

Overflow Timeline:

Counter Overflow Analysis
Counter Size	Maximum Value	Refs/Second	Time to Overflow
16-bit	65,535	10 billion	6.5 microseconds
32-bit	4.3 billion	10 billion	0.43 seconds
48-bit	281 trillion	10 billion	9.3 hours
64-bit	18.4 quintillion	10 billion	58 years

32-bit counters overflow in under a second—clearly useless.

64-bit counters are safe from overflow (58 years at 10 billion refs/sec) but require 8 bytes per page table entry, doubling PTE size.

The Overflow Solution: Normalization

When counters approach overflow, we can normalize by subtracting the minimum counter value from all counters:

counter_normalization.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Counter normalization to prevent overflow
 
void normalize_counters() {
    // Find minimum counter value
    uint64_t min_val = UINT64_MAX;
    for (int i = 0; i < num_pages; i++) {
        if (page_timestamp[i] < min_val) {
            min_val = page_timestamp[i];
        }
    }
    
    // Subtract minimum from all counters
    for (int i = 0; i < num_pages; i++) {
        page_timestamp[i] -= min_val;
    }
    
    // Reset global clock to highest page value
    logical_clock = 0;
    for (int i = 0; i < num_pages; i++) {
        if (page_timestamp[i] > logical_clock) {
            logical_clock = page_timestamp[i];
        }
    }
}
 
// COST: O(n) scan of ALL pages
// With 1 million pages: ~1-10 milliseconds
// During normalization: no memory accesses can be tracked
// Must acquire global lock, blocking all cores
//
// If we normalize every second: 1-10ms of stall per second = 0.1-1% overhead
// But with 32-bit counters, we'd normalize 2x/second = unacceptable

64-Bit Counters Are the Only Option

Finding the Minimum Counter

Even with perfect counter maintenance, we face another challenge: finding the LRU victim requires locating the page with the minimum counter value.

The Naïve Approach: O(n) Scan

find_minimum_naive.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
int find_lru_victim_naive() {
    uint64_t min_time = UINT64_MAX;
    int victim = -1;
    
    // Scan ALL resident pages
    for (int i = 0; i < num_resident_pages; i++) {
        if (page_timestamp[resident_pages[i]] < min_time) {
            min_time = page_timestamp[resident_pages[i]];
            victim = resident_pages[i];
        }
    }
    
    return victim;
}
 
// With 1 million pages:
//   - 1 million comparisons
//   - ~1 million memory reads (8 bytes each = 8 MB data)
//   - Cache misses galore
//   - Total time: 1-10 milliseconds
//
// Page faults typically occur thousands of times per second.
// 1000 faults × 10ms each = 10 seconds of scanning per second
// IMPOSSIBLE.

Better Data Structures: Min-Heap

We could maintain a min-heap keyed by timestamp:

min_heap_lru.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Min-heap based LRU (better but still problematic)
 
typedef struct {
    int page_number;
    uint64_t timestamp;
} HeapEntry;
 
HeapEntry lru_heap[MAX_PAGES];
int heap_size;
HashMap page_to_heap_position;  // For O(1) position lookup
 
// Find LRU victim: O(1) - just return heap root
int find_lru_victim() {
    return lru_heap[0].page_number;  // Root is minimum
}
 
// Problem: UPDATE on access is expensive
void update_timestamp_on_access(int page_number) {
    // 1. Find page's position in heap: O(1) with hash map
    int pos = hashmap_get(page_to_heap_position, page_number);
    
    // 2. Update timestamp (increases value)
    lru_heap[pos].timestamp = ++logical_clock;
    
    // 3. Re-heapify: timestamp increased, so bubble DOWN
    //    O(log n) operations
    heapify_down(pos);
}
 
// The heap "fix" operation is O(log n) per access
// With n=1M pages: log(1M) ≈ 20 heap operations per access
// With 10 billion accesses/second: 200 billion heap ops/second
// Still impossible in software!

Data Structure Trade-offs for Counter-Based LRU
Structure	Find Min	Update on Access	Space	Practical?
Unsorted array	O(n)	O(1)	O(n)	No (find too slow)
Sorted array	O(1)	O(n)	O(n)	No (update too slow)
Min-heap	O(1)	O(log n)	O(n)	No (update too frequent)
BST/RB-tree	O(1)*	O(log n)	O(n)	No (same issue)
Doubly linked list	O(1)	O(1)**	O(n)	Best possible

The Linked List Advantage

The Conceptual Design:

register_lru.txt
REGISTER-BASED LRU
==================
 
For N frames, maintain N hardware registers (8 bits each for simplicity):
 
     Frame 0 Reg    Frame 1 Reg    Frame 2 Reg    Frame 3 Reg
    +-----------+  +-----------+  +-----------+  +-----------+
    | 10010011  |  | 01001001  |  | 11000001  |  | 00100100  |
    +-----------+  +-----------+  +-----------+  +-----------+
         147           73            193            36
 
On each memory access to frame F:
1. Right-shift all registers by 1 (divide by 2, losing LSB)
2. Set MSB of frame F's register to 1
 
Example: Access to Frame 2
 
Before: [147, 73, 193, 36]
After shift: [73, 36, 96, 18]   (all values halved, rounded down)
Set MSB of Frame 2: [73, 36, 224, 18]   (96 + 128 = 224)
 
Interpretation:
- Higher values = more recent accesses
- Find minimum = LRU victim
- MSB = accessed this period, LSB = accessed long ago

The Shift Register Insight:

The beauty of this scheme is that the register value encodes a history of accesses:

 Register: 10110001 (177 decimal)
           ↑ ↑   ↑
           │ │   └─ Accessed 8 units ago
           │ └─ Accessed 2-3 units ago  
           └─ Accessed this unit

Higher values = more recent/frequent access. The register naturally "ages" information as time passes (via right shifts).

Why This Works for Small Systems:

Hardware registers are fast (single-cycle access)
Shift operations are trivially parallel
Comparison for minimum is O(N) but N is small
No software involvement in tracking

Why This Fails for Modern Systems:

Register-Based LRU Scalability
RAM Size	Pages (4KB)	Registers Needed	Register Storage	Feasibility
64 KB	16	16 × 8 bits	128 bits	✓ Easy (embedded)
1 MB	256	256 × 8 bits	2 Kbits	Marginal
16 MB	4K	4K × 16 bits	64 Kbits	Expensive
1 GB	256K	256K × 32 bits	8 Mbits	Impractical
64 GB	16M	16M × 64 bits	1 Gbit	Impossible

Historical Use: Small TLBs and Caches

Combining Counters with Reference Bits

A practical approach combines the hardware reference bit with periodic software counter updates. This trades true LRU for an approximation with acceptable overhead.

The Aging Algorithm:

aging_algorithm.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// THE AGING ALGORITHM
// Periodic sampling combines reference bits with counters
 
// Each page has an 8-bit aging counter
uint8_t age_counter[MAX_PAGES];
 
// Every T milliseconds (e.g., T=100ms), the OS runs:
void aging_timer_interrupt() {
    for (int page = 0; page < num_resident_pages; page++) {
        // Right-shift counter (age the value)
        age_counter[page] >>= 1;
        
        // If reference bit is set, add 128 (set MSB)
        if (get_reference_bit(page)) {
            age_counter[page] |= 0x80;  // 0b10000000
        }
        
        // Clear reference bit for next period
        clear_reference_bit(page);
    }
}
 
// To find LRU victim: find page with minimum age_counter
int find_lru_victim() {
    uint8_t min_age = 0xFF;
    int victim = -1;
    
    for (int page = 0; page < num_resident_pages; page++) {
        if (age_counter[page] < min_age) {
            min_age = age_counter[page];
            victim = page;
        }
    }
    return victim;
}

How Aging Works:

With 8-bit counters and 100ms intervals:

Bit 7 (MSB): referenced in last 100ms
Bit 6: referenced 100-200ms ago
Bit 5: referenced 200-300ms ago
...
Bit 0 (LSB): referenced 700-800ms ago

The counter value is essentially a weighted history where recent references contribute more than old ones.

Example Trace:

Aging Counter Evolution for Two Pages
Time	Page A Ref?	Page A Counter	Page B Ref?	Page B Counter
0ms	Yes	10000000 (128)	Yes	10000000 (128)
100ms	Yes	11000000 (192)	No	01000000 (64)
200ms	Yes	11100000 (224)	Yes	10100000 (160)
300ms	No	01110000 (112)	Yes	11010000 (208)
400ms	No	00111000 (56)	Yes	11101000 (232)
500ms	No	00011100 (28)	No	01110100 (116)

Aging is a Practical LRU Approximation

Counter Implementation in Practice

Given all the challenges, how do real systems use counter-like mechanisms?

Linux: Multi-Generation LRU (MGLRU)

Modern Linux (5.18+) uses MGLRU which maintains multiple "generations" of pages:

mglru_concept.txt
LINUX MGLRU (Multi-Generation LRU)
===================================
 
Instead of per-page counters, pages are placed in GENERATIONS:
 
    Gen 0 (oldest)    Gen 1         Gen 2        Gen 3 (newest)
    +-----------+  +-----------+  +-----------+  +-----------+
    | Page list |  | Page list |  | Page list |  | Page list |
    +-----------+  +-----------+  +-----------+  +-----------+
    Evict first    ...           ...           Most protected
 
Promotion: When page is accessed, move to higher generation
Demotion: Periodically, pages age down generations
Eviction: Select victims from generation 0
 
Benefits:
- No per-page counter storage
- Constant-time generation membership check
- Batch aging (move whole generations down)
- Configurable number of generations (typically 4)
 
This is counter-like behavior with practical overhead.

Database Buffer Pools

Databases (PostgreSQL, MySQL, Oracle) manage their own buffer pools and can implement more sophisticated LRU variants because:

Buffer pool size is limited (configured, not all RAM)
I/O operations are slower than memory, so tracking overhead is proportionally smaller
They have context about access patterns (table scans vs. index lookups)

Many databases use LRU-K variants that consider last K references, or 2Q which separates first-access from subsequent accesses.

Where Counter-Like Approaches Work

•Application-level caches — Memcached, Redis use LRU with management overhead amortized over network latency
•Database buffer pools — Disk I/O dominates, so software overhead is tolerable
•Small embedded systems — Few pages, register-based tracking is feasible
•Hardware caches — True LRU or pseudo-LRU implemented in silicon for small sets
•Generational schemes — Coarse-grained 'generations' rather than precise counters

The Practical Lesson

Summary: Counter Implementation

We've explored counter-based LRU in depth, understanding both its elegance and its limitations. Let's consolidate:

Key Takeaways

•Logical clocks capture recency perfectly — A monotonically increasing counter assigned on each access gives exact LRU ordering. The page with minimum counter is the LRU victim.
•Hardware counters require per-access writes — Every memory read becomes a write to update the timestamp. This write amplification is prohibitively expensive.
•Software counters can't see between faults — Without trapping every memory access (impossible), software only learns about faults—yielding FIFO among resident pages, not LRU.
•64-bit counters avoid overflow — 32-bit counters overflow in under a second at modern access rates. 64-bit counters are practical but double PTE size.
•Finding the minimum is O(n) naively — Linear scans of millions of pages per fault are too slow. Even min-heaps require O(log n) updates per access—still too frequent.
•Register-based schemes work for small N — With 16-64 entries (like TLBs), hardware registers with shift-aging are practical. For millions of pages, impossible.
•Aging combines counters with reference bits — Periodic sampling converts reference bit information into weighted counters, approximating LRU with practical overhead.
•Real systems use counter-inspired approximations — MGLRU, LRU-K, and generational schemes use counter-like aging without true per-access tracking.

What's Next:

Page Complete

4 / 5