Loading learning content...
The most intuitive approach to implementing LRU is to record when each page was last accessed. If we associate a timestamp or counter value with each page graph, finding the least recently used page reduces to finding the page with the oldest (smallest) timestamp.
This counter-based or timestamp-based approach is conceptually straightforward—but as we established in the previous page, implementing it efficiently is fraught with challenges. This page examines the counter approach in detail, exploring both hardware-assisted and software-only variants, understanding their trade-offs, and seeing how they inform the approximation techniques used in practice.
Even though pure counter implementations are rarely used in modern operating systems, understanding them is essential—they represent the "ideal" that approximations aim to approximate.
By the end of this page, you will understand: (1) the logical clock model for recency tracking, (2) hardware counter implementation and its costs, (3) software counter alternatives, (4) the counter overflow problem, (5) finding the minimum counter efficiently, and (6) why counters alone are insufficient for practical systems.
LRU cares about the ordering of events, not absolute time. We don't need to know that page A was accessed at 3:42:15.123 PM—we just need to know whether it was accessed before or after page B.
This insight leads to using logical clocks rather than real-time clocks.
Definition: Logical Clock for LRU
A logical clock is a monotonically increasing counter that advances with each memory reference:
logical_time = 0 // Global counter
on_memory_reference(page):
logical_time = logical_time + 1
last_access[page] = logical_time
After any sequence of memory references, the last_access values give perfect LRU ordering: smaller values = older access = candidates for eviction.
| Time | Reference | Logical Clock | last_access[A] | last_access[B] | last_access[C] | last_access[D] | LRU Order |
|---|---|---|---|---|---|---|---|
| t=1 | A | 1 | 1 | [A] | |||
| t=2 | B | 2 | 1 | 2 | [A, B] | ||
| t=3 | C | 3 | 1 | 2 | 3 | [A, B, C] | |
| t=4 | B | 4 | 1 | 4 | 3 | [A, C, B] | |
| t=5 | A | 5 | 5 | 4 | 3 | [C, B, A] | |
| t=6 | D | 6 | 5 | 4 | 3 | 6 | [C, B, A, D] |
At time t=6:
| Page | last_access | How Long Ago |
|---|---|---|
| D | 6 | 0 (most recent) |
| A | 5 | 1 |
| B | 4 | 2 |
| C | 3 | 3 (least recent → LRU victim) |
The beauty of this model is its simplicity: the LRU victim is always the page with the minimum last_access value.
Logical time perfectly captures recency because LRU only cares about relative ordering. Whether 1 nanosecond or 1 hour passed between references is irrelevant—only the sequence matters. Logical clocks are simpler, more portable, and avoid real-time clock issues (drift, synchronization, resolution).
The most efficient counter implementation would have hardware maintain the timestamp. Let's examine what this would require.
Conceptual Design:
HARDWARE LRU COUNTER DESIGN=========================== Component 1: Global Logical Clock Register- N-bit counter (N typically 32 or 64)- Increments automatically on every memory reference- Must be accessible from all cores (coherent) Component 2: Extended Page Table Entry- Standard PTE fields (52 bits for address, 12 bits for flags)- Added: N-bit timestamp field- On EVERY memory access, hardware writes current clock to timestamp field 63 12 11 0 +-----------------------------------------------+------------+ | Physical Frame Number + Timestamp | Flags | +-----------------------------------------------+------------+ ^ | Must accommodate 32-64 bit timestamp Component 3: Minimum Finder- When eviction needed, hardware or software scans all PTEs- Finds entry with minimum timestamp- That page is the LRU victimHardware Implementation Costs:
| Component | Cost | Impact |
|---|---|---|
| PTE size increase | +32 to +64 bits per entry | Page tables grow 1.5-2x, more cache misses |
| Clock register | One 64-bit register per core | Minimal silicon cost |
| Clock coherency | Cross-core synchronization | Adds memory bus traffic |
| Timestamp writes | Write on EVERY memory access | Huge memory bandwidth consumption |
| Cache line pollution | Timestamp updates evict data | Reduced effective cache size |
The Fatal Flaw: Write Amplification
Every memory read becomes a memory write (to record the timestamp). This is catastrophic for performance:
This is why no modern architecture implements hardware timestamp-based LRU for page tables. The write amplification alone makes it infeasible.
CPU caches use pseudo-LRU with small state (a few bits per set) stored in dedicated SRAM cells, updated in parallel with data access. Page tables would need per-entry timestamps in main memory, touching the memory system for every access. The storage hierarchy difference is fundamental.
Since hardware counters are impractical, can we implement counters in software? Let's examine the options.
Option 1: Trap-Based Counter Update
The OS could trap on every memory access and update a software counter table.
123456789101112131415161718192021222324252627282930
// TRAP-BASED COUNTER LRU (Hypothetical - NOT practical) // Global shared stateuint64_t logical_clock = 0;uint64_t page_timestamp[MAX_PAGES]; // Timestamp per pagespinlock_t lru_lock; // Called on EVERY memory access (would require hardware trap)void memory_access_trap_handler(int page_number) { // This runs BILLIONS of times per second! acquire_spinlock(&lru_lock); logical_clock++; page_timestamp[page_number] = logical_clock; release_spinlock(&lru_lock);} // COST ANALYSIS:// - Trap overhead: ~500-1000 cycles minimum// - Lock acquisition: ~50-200 cycles // - Memory updates: ~10-50 cycles// - Total: ~600-1300 cycles per memory reference//// With 10 billion memory references/second:// = 6-13 trillion additional cycles/second// = CPU would need to be 2000-4000 GHz to keep up//// CONCLUSION: Completely impossibleOption 2: Page Fault Time Only
What if we only update counters during page faults? This is actually feasible—but it's not true LRU.
123456789101112131415161718192021222324252627282930313233343536373839
// PAGE FAULT TIME COUNTER (Feasible but NOT true LRU) uint64_t fault_clock = 0;uint64_t page_fault_time[MAX_PAGES]; void page_fault_handler(int page_number) { // Only runs on page faults (thousands/second, not billions) fault_clock++; page_fault_time[page_number] = fault_clock; // Load page from disk... // Update page tables...} int find_lru_victim() { // Find page with minimum fault time uint64_t min_time = UINT64_MAX; int victim = -1; for (int i = 0; i < num_resident_pages; i++) { if (page_fault_time[resident_pages[i]] < min_time) { min_time = page_fault_time[resident_pages[i]]; victim = resident_pages[i]; } } return victim;} // PROBLEM: This is FIFO for resident pages, not LRU!//// Example:// Page A faults at t=1, loaded// Page B faults at t=2, loaded// ... A is accessed 1 million times, B never touched ...// Need to evict at t=3: Algorithm says "evict A" (fault_time=1)// But A is HOT! B should be evicted.//// We never learn about accesses between faults.Between page faults, the OS has no visibility into memory accesses. Software counters updated only on faults capture fault ordering (essentially FIFO among resident pages), not access recency (LRU). This fundamental information gap is why pure software counters cannot implement true LRU.
Even if we could maintain counters (hypothetically), we face the overflow problem. Counters have finite bits and eventually wrap around.
Overflow Timeline:
| Counter Size | Maximum Value | Refs/Second | Time to Overflow |
|---|---|---|---|
| 16-bit | 65,535 | 10 billion | 6.5 microseconds |
| 32-bit | 4.3 billion | 10 billion | 0.43 seconds |
| 48-bit | 281 trillion | 10 billion | 9.3 hours |
| 64-bit | 18.4 quintillion | 10 billion | 58 years |
32-bit counters overflow in under a second—clearly useless.
64-bit counters are safe from overflow (58 years at 10 billion refs/sec) but require 8 bytes per page table entry, doubling PTE size.
The Overflow Solution: Normalization
When counters approach overflow, we can normalize by subtracting the minimum counter value from all counters:
1234567891011121314151617181920212223242526272829303132
// Counter normalization to prevent overflow void normalize_counters() { // Find minimum counter value uint64_t min_val = UINT64_MAX; for (int i = 0; i < num_pages; i++) { if (page_timestamp[i] < min_val) { min_val = page_timestamp[i]; } } // Subtract minimum from all counters for (int i = 0; i < num_pages; i++) { page_timestamp[i] -= min_val; } // Reset global clock to highest page value logical_clock = 0; for (int i = 0; i < num_pages; i++) { if (page_timestamp[i] > logical_clock) { logical_clock = page_timestamp[i]; } }} // COST: O(n) scan of ALL pages// With 1 million pages: ~1-10 milliseconds// During normalization: no memory accesses can be tracked// Must acquire global lock, blocking all cores//// If we normalize every second: 1-10ms of stall per second = 0.1-1% overhead// But with 32-bit counters, we'd normalize 2x/second = unacceptableFor counter-based LRU to be remotely practical, 64-bit counters are mandatory. This means 8 additional bytes per PTE—a significant memory cost. For a system with 16GB RAM (4M pages), that's 32MB just for LRU counters. For 1TB (256M pages), it's 2GB.
Even with perfect counter maintenance, we face another challenge: finding the LRU victim requires locating the page with the minimum counter value.
The Naïve Approach: O(n) Scan
123456789101112131415161718192021222324
int find_lru_victim_naive() { uint64_t min_time = UINT64_MAX; int victim = -1; // Scan ALL resident pages for (int i = 0; i < num_resident_pages; i++) { if (page_timestamp[resident_pages[i]] < min_time) { min_time = page_timestamp[resident_pages[i]]; victim = resident_pages[i]; } } return victim;} // With 1 million pages:// - 1 million comparisons// - ~1 million memory reads (8 bytes each = 8 MB data)// - Cache misses galore// - Total time: 1-10 milliseconds//// Page faults typically occur thousands of times per second.// 1000 faults × 10ms each = 10 seconds of scanning per second// IMPOSSIBLE.Better Data Structures: Min-Heap
We could maintain a min-heap keyed by timestamp:
123456789101112131415161718192021222324252627282930313233
// Min-heap based LRU (better but still problematic) typedef struct { int page_number; uint64_t timestamp;} HeapEntry; HeapEntry lru_heap[MAX_PAGES];int heap_size;HashMap page_to_heap_position; // For O(1) position lookup // Find LRU victim: O(1) - just return heap rootint find_lru_victim() { return lru_heap[0].page_number; // Root is minimum} // Problem: UPDATE on access is expensivevoid update_timestamp_on_access(int page_number) { // 1. Find page's position in heap: O(1) with hash map int pos = hashmap_get(page_to_heap_position, page_number); // 2. Update timestamp (increases value) lru_heap[pos].timestamp = ++logical_clock; // 3. Re-heapify: timestamp increased, so bubble DOWN // O(log n) operations heapify_down(pos);} // The heap "fix" operation is O(log n) per access// With n=1M pages: log(1M) ≈ 20 heap operations per access// With 10 billion accesses/second: 200 billion heap ops/second// Still impossible in software!| Structure | Find Min | Update on Access | Space | Practical? |
|---|---|---|---|---|
| Unsorted array | O(n) | O(1) | O(n) | No (find too slow) |
| Sorted array | O(1) | O(n) | O(n) | No (update too slow) |
| Min-heap | O(1) | O(log n) | O(n) | No (update too frequent) |
| BST/RB-tree | O(1)* | O(log n) | O(n) | No (same issue) |
| Doubly linked list | O(1) | O(1)** | O(n) | Best possible |
The doubly linked list (as we saw in page 2) achieves O(1) for both finding the LRU victim (tail) and updating on access (move to head). But the update still must happen on every access—the problem isn't the data structure, it's the frequency of updates.
Some historical and embedded systems used register-based LRU where each page has a dedicated hardware register holding its recency information. While impractical for general-purpose systems, understanding this approach illuminates the trade-offs.
The Conceptual Design:
REGISTER-BASED LRU================== For N frames, maintain N hardware registers (8 bits each for simplicity): Frame 0 Reg Frame 1 Reg Frame 2 Reg Frame 3 Reg +-----------+ +-----------+ +-----------+ +-----------+ | 10010011 | | 01001001 | | 11000001 | | 00100100 | +-----------+ +-----------+ +-----------+ +-----------+ 147 73 193 36 On each memory access to frame F:1. Right-shift all registers by 1 (divide by 2, losing LSB)2. Set MSB of frame F's register to 1 Example: Access to Frame 2 Before: [147, 73, 193, 36]After shift: [73, 36, 96, 18] (all values halved, rounded down)Set MSB of Frame 2: [73, 36, 224, 18] (96 + 128 = 224) Interpretation:- Higher values = more recent accesses- Find minimum = LRU victim- MSB = accessed this period, LSB = accessed long agoThe Shift Register Insight:
The beauty of this scheme is that the register value encodes a history of accesses:
Register: 10110001 (177 decimal)
↑ ↑ ↑
│ │ └─ Accessed 8 units ago
│ └─ Accessed 2-3 units ago
└─ Accessed this unit
Higher values = more recent/frequent access. The register naturally "ages" information as time passes (via right shifts).
Why This Works for Small Systems:
Why This Fails for Modern Systems:
| RAM Size | Pages (4KB) | Registers Needed | Register Storage | Feasibility |
|---|---|---|---|---|
| 64 KB | 16 | 16 × 8 bits | 128 bits | ✓ Easy (embedded) |
| 1 MB | 256 | 256 × 8 bits | 2 Kbits | Marginal |
| 16 MB | 4K | 4K × 16 bits | 64 Kbits | Expensive |
| 1 GB | 256K | 256K × 32 bits | 8 Mbits | Impractical |
| 64 GB | 16M | 16M × 64 bits | 1 Gbit | Impossible |
Register-based LRU variants are actually used in small structures: TLBs (32-64 entries), L1 cache sets (2-8 ways). With few entries, the hardware cost is acceptable. This "micro-LRU" works well because N is tiny. Page-level LRU has N in millions—a completely different engineering challenge.
A practical approach combines the hardware reference bit with periodic software counter updates. This trades true LRU for an approximation with acceptable overhead.
The Aging Algorithm:
1234567891011121314151617181920212223242526272829303132333435
// THE AGING ALGORITHM// Periodic sampling combines reference bits with counters // Each page has an 8-bit aging counteruint8_t age_counter[MAX_PAGES]; // Every T milliseconds (e.g., T=100ms), the OS runs:void aging_timer_interrupt() { for (int page = 0; page < num_resident_pages; page++) { // Right-shift counter (age the value) age_counter[page] >>= 1; // If reference bit is set, add 128 (set MSB) if (get_reference_bit(page)) { age_counter[page] |= 0x80; // 0b10000000 } // Clear reference bit for next period clear_reference_bit(page); }} // To find LRU victim: find page with minimum age_counterint find_lru_victim() { uint8_t min_age = 0xFF; int victim = -1; for (int page = 0; page < num_resident_pages; page++) { if (age_counter[page] < min_age) { min_age = age_counter[page]; victim = page; } } return victim;}How Aging Works:
With 8-bit counters and 100ms intervals:
The counter value is essentially a weighted history where recent references contribute more than old ones.
Example Trace:
| Time | Page A Ref? | Page A Counter | Page B Ref? | Page B Counter |
|---|---|---|---|---|
| 0ms | Yes | 10000000 (128) | Yes | 10000000 (128) |
| 100ms | Yes | 11000000 (192) | No | 01000000 (64) |
| 200ms | Yes | 11100000 (224) | Yes | 10100000 (160) |
| 300ms | No | 01110000 (112) | Yes | 11010000 (208) |
| 400ms | No | 00111000 (56) | Yes | 11101000 (232) |
| 500ms | No | 00011100 (28) | No | 01110100 (116) |
The aging algorithm runs O(n) every sampling period (say, 100ms), not every memory access. With 1000ms/100ms = 10 scans per second and 1M pages: 10M operations/second—completely feasible. This is millions of times better than true LRU's 10 billion operations/second.
Given all the challenges, how do real systems use counter-like mechanisms?
Linux: Multi-Generation LRU (MGLRU)
Modern Linux (5.18+) uses MGLRU which maintains multiple "generations" of pages:
LINUX MGLRU (Multi-Generation LRU)=================================== Instead of per-page counters, pages are placed in GENERATIONS: Gen 0 (oldest) Gen 1 Gen 2 Gen 3 (newest) +-----------+ +-----------+ +-----------+ +-----------+ | Page list | | Page list | | Page list | | Page list | +-----------+ +-----------+ +-----------+ +-----------+ Evict first ... ... Most protected Promotion: When page is accessed, move to higher generationDemotion: Periodically, pages age down generationsEviction: Select victims from generation 0 Benefits:- No per-page counter storage- Constant-time generation membership check- Batch aging (move whole generations down)- Configurable number of generations (typically 4) This is counter-like behavior with practical overhead.Database Buffer Pools
Databases (PostgreSQL, MySQL, Oracle) manage their own buffer pools and can implement more sophisticated LRU variants because:
Many databases use LRU-K variants that consider last K references, or 2Q which separates first-access from subsequent accesses.
Pure counter-based LRU is impractical for OS page replacement, but counter-inspired techniques (aging, generations, LRU-K) provide good approximations. The key is reducing update frequency from per-access to per-interval, and using hardware reference bits to bridge the gap.
We've explored counter-based LRU in depth, understanding both its elegance and its limitations. Let's consolidate:
What's Next:
The next page examines the other exact LRU implementation: the Stack Implementation. While the counter approach associates a timestamp with each page, the stack approach maintains explicit ordering through data structure manipulation. Understanding both approaches completes our picture of exact LRU before we move to practical approximations.
You now understand counter-based LRU implementation in full depth: the logical clock model, hardware and software variants, overflow issues, minimum-finding challenges, and the aging algorithm that bridges theory and practice. This foundation prepares you for the stack-based alternative and the approximation algorithms that real systems use.