Loading learning content...
When the buffer pool is full and a query needs a page that isn't cached, the buffer manager faces a critical decision: which page should be evicted to make room for the new one? This decision—made thousands of times per second in a busy database—can mean the difference between a query waiting microseconds versus milliseconds.
A poor eviction choice forces the database to re-fetch a page that will be needed again soon. An optimal choice evicts pages that won't be accessed again, maximizing the cache's effectiveness. The algorithms that make these decisions are called page replacement algorithms or eviction policies, and they represent some of the most studied problems in computer systems.
By the end of this page, you will understand the theoretical foundations of page replacement, the practical algorithms used in production databases, and the trade-offs each algorithm makes between accuracy, memory overhead, and CPU cost. You'll learn why simple LRU isn't enough and how databases adapt eviction policies to their specific workload patterns.
Before examining practical algorithms, let's understand what optimal page replacement would look like. In 1966, László Bélády proved that the optimal page replacement strategy is to evict the page that will be used furthest in the future.
Bélády's OPT Algorithm:
The OPT algorithm requires knowledge of the future—which pages will be accessed and in what order. This information is unavailable in real systems. OPT serves as a theoretical benchmark: the best possible hit rate against which practical algorithms can be measured. Any real algorithm's hit rate will be at or below OPT's.
The value of OPT:
While impractical for runtime use, OPT is invaluable for:
Research has shown that for typical database workloads, well-designed practical algorithms achieve 90-95% of OPT's hit rate. The remaining gap represents inherent uncertainty about the future.
| Algorithm | Hit Rate | % of OPT |
|---|---|---|
| OPT (theoretical) | 95.2% | 100% |
| LRU-K (K=2) | 91.8% | 96.4% |
| Clock | 89.1% | 93.6% |
| LRU | 88.5% | 92.9% |
| FIFO | 82.3% | 86.4% |
| Random | 74.6% | 78.4% |
The Least Recently Used (LRU) algorithm is based on a simple but powerful observation: pages that have been accessed recently are likely to be accessed again soon. This is the temporal locality principle that underlies most caching strategies.
LRU Algorithm:
LRU approximates OPT by using the past as a predictor of the future. If a page hasn't been used recently, it's likely not to be used in the near future.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
// LRU implementation using a doubly-linked list + hash map class LRUReplacer {private: size_t capacity; // Doubly-linked list: front = most recently used, back = least recently used list<FrameId> lru_list; // Map from frame_id to iterator in the list (for O(1) removal) unordered_map<FrameId, list<FrameId>::iterator> frame_map; mutex latch; public: LRUReplacer(size_t capacity) : capacity(capacity) {} // Record that a frame was accessed void recordAccess(FrameId frame_id) { lock_guard<mutex> guard(latch); auto it = frame_map.find(frame_id); if (it != frame_map.end()) { // Already in list - move to front (most recently used) lru_list.erase(it->second); } // Add to front of list lru_list.push_front(frame_id); frame_map[frame_id] = lru_list.begin(); } // Select a victim frame for eviction (returns least recently used) bool selectVictim(FrameId* victim_frame) { lock_guard<mutex> guard(latch); if (lru_list.empty()) { return false; // No frames available for eviction } // Return the least recently used frame (back of list) *victim_frame = lru_list.back(); return true; } // Remove a frame from consideration (e.g., when pinned) void remove(FrameId frame_id) { lock_guard<mutex> guard(latch); auto it = frame_map.find(frame_id); if (it != frame_map.end()) { lru_list.erase(it->second); frame_map.erase(it); } } // Add a frame back (e.g., when unpinned) void add(FrameId frame_id) { recordAccess(frame_id); // Same as recording an access }};Time and space complexity:
Pros and cons of LRU:
Consider a full table scan that reads every page in a large table sequentially. Pure LRU would fill the buffer pool with scan pages (accessed once, never again) while evicting frequently-used hot pages. This is called 'scan pollution' and severely degrades cache efficiency. Production databases use modified LRU variants to address this.
The Clock algorithm (also called Second-Chance or NRU - Not Recently Used) is an LRU approximation that trades some accuracy for significantly lower overhead. Instead of maintaining an exact LRU ordering, Clock uses a single reference bit per page to track whether the page has been accessed recently.
How Clock works:
Imagine all buffer frames arranged in a circle, with a "clock hand" pointing to one frame:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
// Clock (Second-Chance) replacement algorithm class ClockReplacer {private: size_t capacity; size_t clock_hand; // Current position in the circular buffer struct FrameInfo { bool in_replacer; // Is this frame available for replacement? bool ref_bit; // Has this frame been recently accessed? }; vector<FrameInfo> frames; mutex latch; public: ClockReplacer(size_t capacity) : capacity(capacity), clock_hand(0), frames(capacity) { for (auto& f : frames) { f.in_replacer = false; f.ref_bit = false; } } // Record that a frame was accessed (set reference bit) void recordAccess(FrameId frame_id) { lock_guard<mutex> guard(latch); frames[frame_id].ref_bit = true; } // Find a victim frame for eviction bool selectVictim(FrameId* victim_frame) { lock_guard<mutex> guard(latch); size_t start = clock_hand; size_t passes = 0; // Avoid infinite loop if all frames are pinned while (true) { FrameInfo& f = frames[clock_hand]; if (f.in_replacer) { if (f.ref_bit) { // Second chance: clear ref bit and move on f.ref_bit = false; } else { // Found victim: ref_bit is 0 *victim_frame = clock_hand; f.in_replacer = false; clock_hand = (clock_hand + 1) % capacity; return true; } } clock_hand = (clock_hand + 1) % capacity; // Detect if we've gone full circle twice (all frames checked) if (clock_hand == start) { passes++; if (passes >= 2) { return false; // No victim found } } } } // Remove frame from replacer (e.g., when pinned) void pin(FrameId frame_id) { lock_guard<mutex> guard(latch); frames[frame_id].in_replacer = false; } // Add frame to replacer (e.g., when unpinned) void unpin(FrameId frame_id) { lock_guard<mutex> guard(latch); frames[frame_id].in_replacer = true; // Note: ref_bit is NOT set on unpin - only on actual access }};Why Clock is more practical than LRU:
The "second chance" intuition:
The name "second chance" captures the algorithm's fairness: a page that was recently accessed (ref_bit = 1) gets a second chance—its bit is cleared and the clock moves on. Only pages that aren't accessed during a full clock rotation get evicted.
On most workloads, Clock achieves 90-95% of LRU's hit rate while requiring a fraction of the overhead. This makes Clock the preferred choice for systems where page access is very frequent and the cost of exact LRU tracking would be prohibitive.
Standard LRU considers only the most recent access time. But consider two pages:
LRU would evict Page A despite its clearly higher value. LRU-K and 2Q address this by incorporating access frequency into eviction decisions.
LRU-K Algorithm:
LRU-K tracks the K-th most recent access to each page. Eviction decisions are based on the time of the K-th-to-last access ("backward K-distance").
For K=2 (LRU-2, the most common variant):
12345678910111213141516171819202122232425262728293031323334353637383940414243
// LRU-K conceptual structure (K=2) struct PageHistory { PageId page_id; deque<Timestamp> access_history; // Most recent K accesses // Get the K-th most recent access time (backward K-distance) Timestamp getKthAccess(int k) const { if (access_history.size() < k) { return INFINITE_PAST; // Not accessed K times yet } return access_history[access_history.size() - k]; }}; class LRUK_Replacer { static const int K = 2; // Track last 2 accesses map<PageId, PageHistory> histories; void recordAccess(PageId page_id, Timestamp now) { auto& hist = histories[page_id]; hist.access_history.push_back(now); // Keep only the last K accesses while (hist.access_history.size() > K) { hist.access_history.pop_front(); } } PageId selectVictim() { PageId victim; Timestamp oldest_kth_access = FUTURE; // Timestamp far in future for (const auto& [page_id, hist] : histories) { Timestamp kth_access = hist.getKthAccess(K); if (kth_access < oldest_kth_access) { oldest_kth_access = kth_access; victim = page_id; } } return victim; }};The 2Q Algorithm:
2Q (Two Queue) is a practical simplification of LRU-2 that uses two separate queues instead of tracking exact timestamps:
How 2Q works:
This simple structure prevents scan pollution because scanned pages never leave A1, protecting the "hot" pages in Am.
Both LRU-K and 2Q are 'scan resistant': a sequential scan that touches pages once cannot evict hot pages that are accessed frequently. This is critical for mixed OLTP/analytical workloads where occasional large scans shouldn't destroy the cache.
Modern workloads are dynamic, shifting between different access patterns. Static algorithms may excel on one pattern but perform poorly when the pattern changes. Adaptive Replacement Cache (ARC) and LIRS dynamically adjust their behavior based on observed workload characteristics.
ARC (Adaptive Replacement Cache):
Developed by IBM, ARC maintains two LRU lists and adaptively balances between them:
The key insight: ARC uses the ghost lists to learn what's working:
This self-tuning mechanism adapts to whether the workload favors recency or frequency.
| List | Contents | Purpose |
|---|---|---|
| T1 | Pages accessed exactly once recently | Recency-focused cache portion |
| T2 | Pages accessed multiple times recently | Frequency-focused cache portion |
| B1 | Ghost entries: recently evicted from T1 | Learn when to grow T1 |
| B2 | Ghost entries: recently evicted from T2 | Learn when to grow T2 |
LIRS (Low Inter-reference Recency Set):
LIRS distinguishes between "hot" and "cold" pages based on Inter-Reference Recency (IRR)—the number of distinct pages accessed between consecutive accesses to the same page.
LIRS dedicates most of the buffer to hot pages and uses a small portion for cold pages, evicting cold pages first.
PostgreSQL uses a Clock variant with some scan resistance. MySQL InnoDB uses a modified LRU with a 'young' and 'old' sublists. ZFS uses ARC as its primary cache algorithm. The choice depends on workload characteristics and implementation complexity tolerance.
Database buffer pools have unique requirements beyond general-purpose caching. Page replacement algorithms must account for these database-specific concerns:
MySQL InnoDB's LRU Implementation:
InnoDB uses a modified LRU with two key adaptations:
Midpoint Insertion: New pages enter the LRU list at the 3/8 point (not the head). This gives pages a chance to prove their value before occupying the "young" portion.
Young/Old Sublist: The list is divided into a "young" (head) and "old" (tail) portion. Pages must be accessed again after a configurable delay (innodb_old_blocks_time) to move from old to young.
This prevents full table scans from polluting the cache. Scanned pages enter the old sublist and get evicted before they can displace hot pages in the young sublist.
123456789101112131415161718192021222324252627282930313233
// Conceptual InnoDB LRU list structure class InnoDBLRU { // Single list, but logically divided list<FrameId> lru_list; // Iterator to the midpoint (3/8 from tail) list<FrameId>::iterator old_sublist_start; // Configuration static constexpr double OLD_RATIO = 0.375; // 3/8 Duration old_blocks_time = 1000ms; // Time before page can become "young" void insertNew(FrameId frame) { // Insert at midpoint, not head lru_list.insert(old_sublist_start, frame); recordInsertTime(frame, now()); } void recordAccess(FrameId frame) { if (isInOldSublist(frame)) { // Check if enough time has passed if (now() - getInsertTime(frame) > old_blocks_time) { // Promote to young sublist (move to head) moveToHead(frame); } // Otherwise, don't promote - prevents scan pollution } else { // Already in young sublist - move to head moveToHead(frame); } }};When the replacement algorithm selects a victim, the buffer manager must handle eviction carefully, especially for dirty pages. The eviction process involves multiple steps with important correctness and performance implications.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// Buffer pool eviction process void BufferPool::evictPage(FrameId frame_id) { FrameDescriptor& desc = descriptors[frame_id]; // Step 1: Verify the frame can be evicted assert(desc.pin_count == 0); // Must be unpinned // Step 2: Handle dirty page if (desc.is_dirty) { // Step 2a: Ensure WAL records are flushed first // This is the Write-Ahead Logging constraint log_manager->flushToLSN(desc.page_lsn); // Step 2b: Write the dirty page to disk disk_manager->writePage(desc.page_id, &frames[frame_id]); // Step 2c: Clear dirty bit desc.is_dirty = false; } // Step 3: Remove from page table page_table.erase(desc.page_id); // Step 4: Clear frame metadata desc.page_id = INVALID_PAGE_ID; desc.ref_bit = false; // Step 5: Add frame to free list (optional, depends on caller) // free_list.push_back(frame_id);} // Finding a victim that's ready for evictionFrameId BufferPool::findEvictableVictim() { while (true) { FrameId candidate = replacer->selectVictim(); if (candidate == INVALID_FRAME_ID) { // No unpinned frames available throw BufferPoolFullException("All frames are pinned"); } FrameDescriptor& desc = descriptors[candidate]; if (desc.pin_count == 0) { return candidate; // Found an evictable victim } // Victim was pinned after selection - try again // (Race condition between selection and eviction) }}The cost of evicting dirty pages:
Evicting a dirty page requires synchronous disk I/O (the page must be written before the frame can be reused). This can significantly increase latency for the requesting thread.
Mitigation strategies:
Background flushing: A background thread ("page cleaner" in MySQL, "bgwriter" in PostgreSQL) periodically writes dirty pages to disk, keeping the ratio of clean pages high.
Prefer clean victims: When multiple candidates have similar replacement priority, prefer evicting clean pages to avoid disk I/O.
Victim selection look-ahead: Instead of selecting one victim, identify several candidates and prefer the clean ones.
Checkpoint coordination: During checkpoints, many pages become clean simultaneously, increasing the clean victim pool.
If the buffer pool runs out of clean frames, every new page fetch requires waiting for a dirty page to be written to disk. This creates latency spikes visible to applications. Monitoring the ratio of dirty pages and background flush rates is essential for stable performance.
Page replacement is a critical component of buffer pool management, directly impacting cache hit rates and query performance. The choice of algorithm depends on workload characteristics, implementation complexity tolerance, and system requirements.
What's next:
We've seen that dirty pages complicate eviction. The next page explores dirty page management in depth: how databases track modifications, when to flush pages to disk, and how to balance durability requirements against performance.
You now understand page replacement algorithms, from the theoretical optimum to practical implementations used in production databases. This knowledge helps you tune buffer pool behavior and understand performance characteristics.