Loading learning content...
In 1966, László Bélády at IBM Research published a proof that would shape memory management for decades. He demonstrated that the optimal page to replace is the one that will not be used for the longest time in the future. This sounds obvious in retrospect, but Bélády proved something profound: knowing which page to evict is computationally equivalent to predicting the future.
Since we cannot predict the future, every practical page replacement algorithm is an approximation—a heuristic guessing which page is least likely to be needed soon. Some approximations are better than others. Some fall prey to pathological behaviors. Some are simple to implement; others require elaborate data structures.
Victim selection is the moment of truth in page replacement. Choose well, and the evicted page isn't needed again for a long time. Choose poorly, and you'll fault it back almost immediately—wasting two disk I/O operations and degrading system performance.
By the end of this page, you will understand: (1) The theoretical optimal solution and why it's impossible to implement, (2) How heuristics approximate optimal behavior using observable information, (3) The tradeoffs between selection accuracy and implementation cost, (4) Criteria for evaluating victim selection quality, and (5) The concept of stack algorithms and Bélády's anomaly.
Before exploring practical algorithms, we must understand the theoretical ideal. Bélády's Optimal Algorithm (OPT) provides the theoretical minimum number of page faults possible for any given reference string and frame count.
The OPT Rule:
Replace the page that will not be used for the longest time in the future.
This is provably optimal: any other choice could be improved by swapping it with the OPT choice without increasing faults.
Example:
Reference string: 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2, 0, 1, 7, 0, 1
Frames available: 3
Step-by-step OPT execution:
Ref: 7 Frames: [7, -, -] Fault (cold start)
Ref: 0 Frames: [7, 0, -] Fault (cold start)
Ref: 1 Frames: [7, 0, 1] Fault (cold start)
Ref: 2 Frames: [2, 0, 1] Fault: Replace 7 (used at position 17)
Ref: 0 Frames: [2, 0, 1] Hit
Ref: 3 Frames: [2, 0, 3] Fault: Replace 1 (used at position 13)
Ref: 0 Frames: [2, 0, 3] Hit
Ref: 4 Frames: [2, 0, 4] Fault: Replace 3 (used at position 9)
Ref: 2 Frames: [2, 0, 4] Hit
Ref: 3 Frames: [3, 0, 4] Fault: Replace 2 (used at position 12)...
... continuing this process yields 9 total faults for 20 references.
No algorithm can do better than OPT for this reference string with 3 frames.
OPT requires knowledge of future memory accesses—which doesn't exist. At runtime, the OS only knows the past, not the future. OPT is useful only for: (1) Analyzing trace data after execution, (2) Benchmarking other algorithms (how close do they get to optimal?), and (3) Theoretical understanding of limits.
Why OPT Matters Despite Being Impractical:
OPT establishes an upper bound on achievable performance. When evaluating real algorithms, we compare them to OPT:
Algorithm Quality = OPT Faults / Algorithm Faults × 100%
An algorithm achieving 80% of OPT quality (only 25% more faults) is reasonably good. An algorithm achieving only 40% (150% more faults) has significant room for improvement.
The Insight That Enabled Progress:
Since we can't know the future, we must infer it from the past. This is where the principle of locality becomes crucial:
Pages accessed in the recent past are likely to be accessed in the near future.
This principle—backed by extensive empirical observation—allows us to approximate OPT:
OPT asks: "Which page won't be used for the longest future time?"
We approximate: "Which page hasn't been used for the longest past time?"
This transforms an impossible question into a tractable one.
This approximation is the foundation of LRU (Least Recently Used) and its variants—the most successful family of page replacement algorithms.
Practical algorithms can only use information that's actually observable. Understanding what's available helps us appreciate why different algorithms exist and what tradeoffs they make.
Category 1: Temporal Information
When was the page last used?
| Information | Collection Method | Algorithm Using It |
|---|---|---|
| Load time | Record timestamp when page loaded | FIFO |
| Last access time | Update timestamp on every access | LRU |
| Last reference window | Reference bit + periodic clearing | Clock/Second-Chance |
Category 2: Frequency Information
How often is the page used?
| Information | Collection Method | Algorithm Using It |
|---|---|---|
| Access count | Increment counter on each access | LFU |
| Recent access count | Counter with decay over time | LRFU |
Category 3: State Information
What is the current state of the page?
| Information | Source | Why It Matters |
|---|---|---|
| Dirty (modified) bit | Hardware | Dirty pages require write before eviction |
| Reference bit | Hardware | Recently referenced pages likely needed |
| Page type | OS metadata | Different page types have different value |
| Sharing count | Reverse mapping | Shared pages affect multiple processes |
Category 4: Process Context
What do we know about the owning process?
| Information | Source | Why It Matters |
|---|---|---|
| Process priority | Scheduler | High-priority processes may need faster access |
| Process activity | Scheduler | Idle processes unlikely to need pages |
| Working set membership | Working set tracking | Pages in WS are actively needed |
| Memory allocation | Resource accounting | Over-allocating processes may be deprioritized |
| Information Type | Quality of Prediction | Collection Cost | Example Algorithm |
|---|---|---|---|
| Load time only | Poor | Very Low | FIFO |
| Reference bit (1-bit) | Moderate | Low | Clock |
| Reference + Dirty bits | Good | Low | Enhanced Clock |
| Exact access timestamps | Excellent | High | LRU (ideal) |
| Access frequency + recency | Excellent | Very High | ARC, LRFU |
More information enables better decisions, but collecting and maintaining that information has overhead. The best algorithms find the sweet spot: enough information to make good decisions without spending more resources on tracking than they save by avoiding faults.
Victim selection typically involves multiple criteria applied in priority order. Understanding this hierarchy clarifies how algorithms combine different signals.
Priority 1: Evictability
Some pages simply cannot be evicted:
Unevictable Pages (Absolute Exclusions):
├── Kernel text and data pages
├── Locked pages (mlock, mlockall)
├── Pages undergoing DMA
├── Active file system buffers
└── Pages pinned by drivers
These must be skipped entirely.
Priority 2: Cost Reduction
Among evictable pages, prefer those cheaper to evict:
Cost Preference Order:
1. Clean pages (no disk write needed)
2. Dirty pages backed by swap (write to swap)
3. Dirty pages backed by file (write to file)
4. Pages requiring special handling
Rationale: Clean page eviction is instant.
Dirty page eviction requires disk I/O.
Composite Selection Example: Enhanced Second-Chance
This algorithm combines reference and dirty bits into a 4-class priority system:
Class 0: Reference=0, Dirty=0 → Best victim
Not recently used, clean
Class 1: Reference=0, Dirty=1 → Good victim
Not recently used, but dirty
Class 2: Reference=1, Dirty=0 → Poor victim
Recently used, but clean
Class 3: Reference=1, Dirty=1 → Worst victim
Recently used AND dirty
Algorithm scans pages in circular order:
The Multi-pass ensures we prefer clean, unreferenced pages.
Priority 4: Page Type Differentiation (Linux)
Linux distinguishes page types:
Page Type Priority (lower = evict first):
1. Clean file-backed pages (can reload from file)
2. Dirty file-backed pages (must write, but recoverable)
3. Anonymous pages (must swap, no backing file)
4. Mapped executable pages (may cause major faults)
File pages can be evicted without swap; anonymous pages require swap space.
The vm.swappiness parameter (0-100) controls the relative weight of reclaiming anonymous vs file pages. At 0, Linux strongly prefers evicting file pages. At 100, it equally considers both. This allows tuning for workloads that are file-I/O heavy (prefer low swappiness) vs compute-heavy with large anonymous allocations (may tolerate higher swappiness).
What happens when victim selection goes wrong? Understanding the consequences helps appreciate why algorithm choice matters.
Scenario: Immediate Re-fault
The worst outcome is evicting a page that's immediately needed:
Time T+0: Page fault on page A
Select page B as victim
Evict B, load A
Cost: 1 disk read (+ 1 write if dirty)
Time T+1: Instruction needs page B
Page fault on page B
Select page C as victim
Evict C, load B
Cost: 1 disk read (+ 1 write if dirty)
Time T+2: Instruction needs page C
... cascade continues ...
This is thrashing: constant page faults, no useful work.
Quantifying the Impact:
Let's compare good vs poor victim selection:
Good Selection (pages not needed soon):
- Evict page X at T+0
- Page X re-faulted at T+1000
- 1000 useful memory accesses between faults
- Effective overhead: 1 fault / 1000 accesses = 0.1%
Poor Selection (pages needed immediately):
- Evict page X at T+0
- Page X re-faulted at T+10
- 10 useful memory accesses between faults
- Effective overhead: 1 fault / 10 accesses = 10%
The poor selection is 100x worse!
| Selection Quality | Avg Time Between Re-faults | Page Fault Rate | System Behavior |
|---|---|---|---|
| Excellent (near OPT) | 10,000 references | < 0.01% | Near-native performance |
| Good | 1,000 - 10,000 references | 0.01-0.1% | Slightly elevated latency |
| Acceptable | 100 - 1,000 references | 0.1-1% | Noticeable slowdowns |
| Poor | 10 - 100 references | 1-10% | Significant degradation |
| Pathological | < 10 references | 10% | Thrashing, unusable |
Real-World Example: FIFO Under Scan Workload
Consider a sequential scan through data larger than memory:
Scenario:
- 1 GB file, 4 KB pages = 262,144 pages
- 512 MB memory = 131,072 frames
- Sequential read from start to end
With FIFO:
- Pages loaded in order: 0, 1, 2, ... 131,071
- Memory full, start evicting oldest
- Evict page 0, load page 131,072
- Evict page 1, load page 131,073
- ... continue for entire file ...
Result: Every single page faults once.
Total faults: 262,144 (100% fault rate)
This is actually acceptable for sequential access!
The Disaster Case: Looping Scan
Scenario:
- Same 1 GB file
- Program loops: read entire file, repeat
With FIFO after first pass:
- Memory contains pages 131,072 - 262,143
- Loop restarts, needs page 0
- Page 0 was evicted! Fault.
- Evict 131,072 to load 0
- Next access: page 1 was evicted! Fault.
- Evict 131,073 to load 1
- ... every single access faults ...
With Perfect LRU:
- Memory contains pages 131,072 - 262,143
- Loop restarts, needs page 0
- Evict page 131,072 (least recently used)
- Access page 1... also faulted
- Still bad, but the access pattern defeats all algorithms
This demonstrates: no algorithm can help when working set > memory.
When the working set exceeds available memory, no victim selection algorithm can prevent thrashing. The only solutions are: (1) Add more memory, (2) Reduce working set size, (3) Swap out entire processes to free bulk memory, or (4) Terminate processes (OOM killer). Victim selection algorithms optimize within feasibility bounds—they can't create physical memory that doesn't exist.
One might expect that adding more frames to a system would always reduce page faults—more memory means more pages can stay resident. Surprisingly, this intuition is wrong for some algorithms.
Bélády's Anomaly is the counterintuitive phenomenon where increasing the number of frames can increase the number of page faults for certain algorithms and reference strings.
Classic Example with FIFO:
Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
With 3 frames:
┌─────┬───────────────┬───────────────┐
│ Ref │ Frames │ Event │
├─────┼───────────────┼───────────────┤
│ 1 │ [1, -, -] │ Fault │
│ 2 │ [1, 2, -] │ Fault │
│ 3 │ [1, 2, 3] │ Fault │
│ 4 │ [4, 2, 3] │ Fault (evict 1) │
│ 1 │ [4, 1, 3] │ Fault (evict 2) │
│ 2 │ [4, 1, 2] │ Fault (evict 3) │
│ 5 │ [5, 1, 2] │ Fault (evict 4) │
│ 1 │ [5, 1, 2] │ Hit │
│ 2 │ [5, 1, 2] │ Hit │
│ 3 │ [5, 3, 2] │ Fault (evict 1) │
│ 4 │ [5, 3, 4] │ Fault (evict 2) │
│ 5 │ [5, 3, 4] │ Hit │
└─────┴───────────────┴───────────────┘
Total faults with 3 frames: 9
With 4 frames:
┌─────┬─────────────────┬───────────────┐
│ Ref │ Frames │ Event │
├─────┼─────────────────┼───────────────┤
│ 1 │ [1, -, -, -] │ Fault │
│ 2 │ [1, 2, -, -] │ Fault │
│ 3 │ [1, 2, 3, -] │ Fault │
│ 4 │ [1, 2, 3, 4] │ Fault │
│ 1 │ [1, 2, 3, 4] │ Hit │
│ 2 │ [1, 2, 3, 4] │ Hit │
│ 5 │ [5, 2, 3, 4] │ Fault (evict 1) │
│ 1 │ [5, 1, 3, 4] │ Fault (evict 2) │
│ 2 │ [5, 1, 2, 4] │ Fault (evict 3) │
│ 3 │ [5, 1, 2, 3] │ Fault (evict 4) │
│ 4 │ [4, 1, 2, 3] │ Fault (evict 5) │
│ 5 │ [4, 5, 2, 3] │ Fault (evict 1) │
└─────┴─────────────────┴───────────────┘
Total faults with 4 frames: 10
10 > 9: Adding a frame INCREASED faults!
Bélády's anomaly occurs because FIFO's eviction order depends on load order, which changes when more frames are available. With different eviction timing, pages that were hits become misses and vice versa. The net effect can be negative. This counterintuitive behavior reveals that FIFO doesn't capture the true 'value' of keeping a page resident.
Stack Algorithms: Immune to Bélády's Anomaly
A class of algorithms called stack algorithms are provably immune to Bélády's anomaly. These algorithms maintain the property:
The set of pages in memory with n frames is always a subset of pages in memory with n+1 frames.
Formal Definition:
An algorithm is a stack algorithm if, at every point in time:
S(t, n) ⊆ S(t, n+1) for all n
Where S(t, n) = set of pages in memory at time t with n frames
Examples:
The Stack Property Guarantees:
For stack algorithms:
Faults(n frames) ≥ Faults(n+1 frames)
More frames never hurts.
Why LRU Has the Stack Property:
LRU maintains a strict ordering by recency. With n frames, the n most recently used pages are resident. With n+1 frames, the n+1 most recently used are resident—which includes all of the previous n. Thus:
LRU: Pages in memory with 3 frames at time t ⊆ Pages with 4 frames at time t
This is always true because both are "most recent k pages."
Practical Implication:
When choosing between algorithms:
Each victim selection strategy involves tradeoffs between multiple dimensions. Understanding these helps explain why simple algorithms often win in practice.
Dimension 1: Selection Accuracy vs Overhead
More Accurate Selection:
+ Fewer page faults
+ Better system throughput
- Higher per-access bookkeeping
- More complex data structures
- Longer critical sections (potential contention)
Less Accurate Selection:
+ Simpler implementation
+ Lower per-access overhead
+ Faster victim identification
- More page faults
- Potentially pathological cases
Example: Perfect LRU vs Clock
| Aspect | Perfect LRU | Clock |
|---|---|---|
| Selection quality | Optimal among history-based | ~80-90% of LRU |
| Per-access cost | Update linked list or counter | None (hardware reference bit) |
| Selection cost | O(1) with proper data structure | O(n) worst case scan |
| Memory overhead | 8+ bytes per page | 1 bit per page |
| Scalability | Limited by list operations | Excellent |
Dimension 2: Memory Overhead
Per-Page Metadata Requirements:
FIFO: ~8 bytes (position in queue)
LRU: ~16 bytes (list pointers + timestamp)
LFU: ~12 bytes (pointer + counter)
Clock: ~0 bytes (uses hardware reference bit)
ARC: ~32 bytes (dual lists + ghost entries)
With 16 GB RAM and 4 KB pages:
Total pages: 4 million
FIFO: 32 MB metadata
LRU: 64 MB metadata
LFU: 48 MB metadata
Clock: 0 MB additional
ARC: 128 MB metadata
Dimension 3: Scalability with CPUs
Multiprocessor systems face contention on shared data structures:
Global LRU list (naive):
Every memory access updates list position
List protected by lock
Lock becomes severe bottleneck
Per-CPU or partitioned structures:
Each CPU has local list
Global coordination only on eviction
Much better scalability
Slightly less accurate
| Algorithm | Accuracy | Implementation Complexity | Scalability | Real-World Use |
|---|---|---|---|---|
| OPT | Perfect | Impossible online | N/A | Benchmark only |
| LRU (exact) | Excellent | High | Poor | Rarely used |
| LRU (approx) | Very Good | Medium | Good | Common basis |
| Clock | Good | Low | Excellent | Very common |
| FIFO | Poor-Fair | Very Low | Excellent | Backup option |
| LFU | Situation-dependent | Medium | Medium | Specific workloads |
| ARC | Excellent | High | Medium | ZFS, databases |
Clock (and its variants) dominate production systems because they achieve ~85-90% of perfect LRU performance with near-zero overhead. The hardware reference bit does the work that would otherwise require expensive software tracking. This is why most real operating systems use Clock variants rather than pure LRU.
Real operating systems use sophisticated combinations of the principles we've discussed. Let's examine how major systems implement victim selection.
Linux Page Frame Reclamation Algorithm (PFRA):
Linux maintains two LRU-like lists per memory zone:
┌────────────────────────────────┐
│ Active List │
│ Recently accessed pages │
│ Protected from reclamation │
└────────────┬───────────────────┘
│ Demotion (aging)
▼
┌────────────────────────────────┐
│ Inactive List │
│ Candidates for reclamation │
│ Evicted from tail │
└────────────────────────────────┘
Pages enter at active list.
Periodic scanning demotes old pages to inactive.
Eviction takes from inactive tail.
Accessed inactive pages promoted back to active.
Linux Selection Criteria (Simplified):
1. Scan inactive list for unreferenced pages
2. Prefer clean pages over dirty
3. Prefer file-backed over anonymous (tunable via swappiness)
4. Consider cgroup memory limits
5. Avoid recently dirtied pages (give time to coalesce writes)
6. Skip locked/pinned pages
7. Balance across memory zones
FreeBSD: Multi-Queue LRU
FreeBSD uses four page queues:
┌──────────────────────────────────────────────────┐
│ Active Queue │ Recently referenced │
├───────────────────┼──────────────────────────────┤
│ Inactive Queue │ Not recently referenced │
├───────────────────┼──────────────────────────────┤
│ Laundry Queue │ Dirty, scheduled for write │
├───────────────────┼──────────────────────────────┤
│ Free Queue │ Available for allocation │
└───────────────────┴──────────────────────────────┘
Daemon (pagedaemon) moves pages through queues:
Active → Inactive (aging)
Inactive dirty → Laundry (write scheduling)
Inactive clean → Free (available)
Laundry after write → Free
Windows Memory Manager:
Windows uses Working Set management:
1. Each process has a Working Set (resident pages)
2. Working Set has min and max size
3. When memory pressure high:
a. Trim Working Sets toward min
b. Aged pages go to Standby List
c. Modified pages go to Modified List
4. Standby List: Clean pages, still valid
"Soft" faults if re-accessed
5. Modified List: Dirty pages queued for write
Become Standby after written
6. Free List: Zeroed pages ready for allocation
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
/* Simplified Victim Selection (Clock-like) */ struct page_frame *select_victim(void) { struct page_frame *victim = NULL; int passes = 0; /* Circular scan through all frames */ while (victim == NULL && passes < 2) { struct page_frame *pf = get_clock_hand(); /* Skip unevictable pages */ if (pf->flags & (PF_LOCKED | PF_KERNEL | PF_DMA)) { advance_clock_hand(); continue; } /* First pass: look for unreferenced clean pages */ if (passes == 0) { if (!pf->referenced && !pf->dirty) { victim = pf; break; } } /* Clear reference bit for second pass */ if (pf->referenced) { pf->referenced = 0; advance_clock_hand(); continue; } /* Second pass: accept any unreferenced page */ if (passes == 1 && !pf->referenced) { victim = pf; break; } advance_clock_hand(); /* Track passes through the clock */ if (clock_hand_at_start()) passes++; } return victim; /* May be NULL if all pages locked */}We've explored victim selection—the heart of page replacement algorithms. Let's consolidate our understanding:
What's Next:
We've established what victim selection aims to achieve and how different approaches compare. The next page examines the dirty bit—a crucial piece of hardware support that dramatically affects the cost of evicting specific pages and influences which victims algorithms prefer.
You now understand victim selection at a deep level—from the theoretical optimal through practical approximations, and from simple heuristics to production implementations. This foundation prepares you for understanding the specific hardware and software mechanisms that enable efficient page replacement.