Database Management SystemsBuffer Management

Buffer Management in Database Systems

LevelIntermediate

Duration60 mins

TopicBuffer Management

2 / 5

Page Replacement

The Eviction Dilemma

When the buffer pool is full and a query needs a page that isn't cached, the buffer manager faces a critical decision: which page should be evicted to make room for the new one? This decision—made thousands of times per second in a busy database—can mean the difference between a query waiting microseconds versus milliseconds.

A poor eviction choice forces the database to re-fetch a page that will be needed again soon. An optimal choice evicts pages that won't be accessed again, maximizing the cache's effectiveness. The algorithms that make these decisions are called page replacement algorithms or eviction policies, and they represent some of the most studied problems in computer systems.

What You Will Learn

By the end of this page, you will understand the theoretical foundations of page replacement, the practical algorithms used in production databases, and the trade-offs each algorithm makes between accuracy, memory overhead, and CPU cost. You'll learn why simple LRU isn't enough and how databases adapt eviction policies to their specific workload patterns.

The Theoretical Optimum: Bélády's Algorithm

Before examining practical algorithms, let's understand what optimal page replacement would look like. In 1966, László Bélády proved that the optimal page replacement strategy is to evict the page that will be used furthest in the future.

Bélády's OPT Algorithm:

When a page fault occurs and the buffer is full
For each cached page, determine when it will next be accessed
Evict the page whose next access is furthest away
If a page will never be accessed again, evict it immediately

Why OPT Is Impractical

The OPT algorithm requires knowledge of the future—which pages will be accessed and in what order. This information is unavailable in real systems. OPT serves as a theoretical benchmark: the best possible hit rate against which practical algorithms can be measured. Any real algorithm's hit rate will be at or below OPT's.

The value of OPT:

While impractical for runtime use, OPT is invaluable for:

Algorithm comparison: By running a workload trace through OPT, we can measure how close practical algorithms come to optimal
Upper bound analysis: OPT tells us the theoretical maximum cache efficiency for a given workload
Insight generation: Studying why OPT makes certain choices can inspire improved practical algorithms

Research has shown that for typical database workloads, well-designed practical algorithms achieve 90-95% of OPT's hit rate. The remaining gap represents inherent uncertainty about the future.

Comparing Algorithm Hit Rates (Example Workload)
Algorithm	Hit Rate	% of OPT
OPT (theoretical)	95.2%	100%
LRU-K (K=2)	91.8%	96.4%
Clock	89.1%	93.6%
LRU	88.5%	92.9%
FIFO	82.3%	86.4%
Random	74.6%	78.4%

LRU: Least Recently Used

The Least Recently Used (LRU) algorithm is based on a simple but powerful observation: pages that have been accessed recently are likely to be accessed again soon. This is the temporal locality principle that underlies most caching strategies.

LRU Algorithm:

Maintain an ordering of all cached pages by access time
On each page access, move that page to the "most recently used" position
When eviction is needed, evict the page at the "least recently used" position

LRU approximates OPT by using the past as a predictor of the future. If a page hasn't been used recently, it's likely not to be used in the near future.

lru_implementation.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// LRU implementation using a doubly-linked list + hash map
 
class LRUReplacer {
private:
    size_t capacity;
    
    // Doubly-linked list: front = most recently used, back = least recently used
    list<FrameId> lru_list;
    
    // Map from frame_id to iterator in the list (for O(1) removal)
    unordered_map<FrameId, list<FrameId>::iterator> frame_map;
    
    mutex latch;
 
public:
    LRUReplacer(size_t capacity) : capacity(capacity) {}
    
    // Record that a frame was accessed
    void recordAccess(FrameId frame_id) {
        lock_guard<mutex> guard(latch);
        
        auto it = frame_map.find(frame_id);
        if (it != frame_map.end()) {
            // Already in list - move to front (most recently used)
            lru_list.erase(it->second);
        }
        
        // Add to front of list
        lru_list.push_front(frame_id);
        frame_map[frame_id] = lru_list.begin();
    }
    
    // Select a victim frame for eviction (returns least recently used)
    bool selectVictim(FrameId* victim_frame) {
        lock_guard<mutex> guard(latch);
        
        if (lru_list.empty()) {
            return false;  // No frames available for eviction
        }
        
        // Return the least recently used frame (back of list)
        *victim_frame = lru_list.back();
        return true;
    }
    
    // Remove a frame from consideration (e.g., when pinned)
    void remove(FrameId frame_id) {
        lock_guard<mutex> guard(latch);
        
        auto it = frame_map.find(frame_id);
        if (it != frame_map.end()) {
            lru_list.erase(it->second);
            frame_map.erase(it);
        }
    }
    
    // Add a frame back (e.g., when unpinned)
    void add(FrameId frame_id) {
        recordAccess(frame_id);  // Same as recording an access
    }
};

Time and space complexity:

Access recording: O(1) amortized (hash map lookup + list manipulation)
Victim selection: O(1) (return tail of list)
Space overhead: O(n) for the linked list and hash map, where n is buffer pool size

Pros and cons of LRU:

Advantages of LRU

•Simple to understand and implement
•Respects temporal locality effectively
•O(1) operations with proper data structures
•No tuning parameters required
•Good performance on many workloads

Disadvantages of LRU

•High overhead: every access requires list manipulation
•Susceptible to scan pollution (sequential scans)
•Doesn't consider access frequency (page accessed once vs. 1000 times)
•Mutex contention on the LRU list in concurrent systems
•Poor for workloads with recency-unfriendly patterns

The Scan Pollution Problem

Consider a full table scan that reads every page in a large table sequentially. Pure LRU would fill the buffer pool with scan pages (accessed once, never again) while evicting frequently-used hot pages. This is called 'scan pollution' and severely degrades cache efficiency. Production databases use modified LRU variants to address this.

Clock Algorithm: LRU Approximation

The Clock algorithm (also called Second-Chance or NRU - Not Recently Used) is an LRU approximation that trades some accuracy for significantly lower overhead. Instead of maintaining an exact LRU ordering, Clock uses a single reference bit per page to track whether the page has been accessed recently.

How Clock works:

Imagine all buffer frames arranged in a circle, with a "clock hand" pointing to one frame:

Each frame has a reference bit, initially 0
When a page is accessed, set its reference bit to 1
When eviction is needed:
- Examine the frame at the clock hand
- If reference bit is 1: set it to 0, advance clock hand, repeat
- If reference bit is 0: evict this page, advance clock hand

clock_replacement.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
// Clock (Second-Chance) replacement algorithm
 
class ClockReplacer {
private:
    size_t capacity;
    size_t clock_hand;  // Current position in the circular buffer
    
    struct FrameInfo {
        bool in_replacer;  // Is this frame available for replacement?
        bool ref_bit;      // Has this frame been recently accessed?
    };
    
    vector<FrameInfo> frames;
    mutex latch;
 
public:
    ClockReplacer(size_t capacity) 
        : capacity(capacity), clock_hand(0), frames(capacity) {
        for (auto& f : frames) {
            f.in_replacer = false;
            f.ref_bit = false;
        }
    }
    
    // Record that a frame was accessed (set reference bit)
    void recordAccess(FrameId frame_id) {
        lock_guard<mutex> guard(latch);
        frames[frame_id].ref_bit = true;
    }
    
    // Find a victim frame for eviction
    bool selectVictim(FrameId* victim_frame) {
        lock_guard<mutex> guard(latch);
        
        size_t start = clock_hand;
        size_t passes = 0;  // Avoid infinite loop if all frames are pinned
        
        while (true) {
            FrameInfo& f = frames[clock_hand];
            
            if (f.in_replacer) {
                if (f.ref_bit) {
                    // Second chance: clear ref bit and move on
                    f.ref_bit = false;
                } else {
                    // Found victim: ref_bit is 0
                    *victim_frame = clock_hand;
                    f.in_replacer = false;
                    clock_hand = (clock_hand + 1) % capacity;
                    return true;
                }
            }
            
            clock_hand = (clock_hand + 1) % capacity;
            
            // Detect if we've gone full circle twice (all frames checked)
            if (clock_hand == start) {
                passes++;
                if (passes >= 2) {
                    return false;  // No victim found
                }
            }
        }
    }
    
    // Remove frame from replacer (e.g., when pinned)
    void pin(FrameId frame_id) {
        lock_guard<mutex> guard(latch);
        frames[frame_id].in_replacer = false;
    }
    
    // Add frame to replacer (e.g., when unpinned)
    void unpin(FrameId frame_id) {
        lock_guard<mutex> guard(latch);
        frames[frame_id].in_replacer = true;
        // Note: ref_bit is NOT set on unpin - only on actual access
    }
};

Why Clock is more practical than LRU:

Lower overhead per access: Setting a reference bit is far cheaper than manipulating a linked list
Better concurrency: Reference bit can be set with an atomic operation; no data structure contention
Simpler implementation: No complex linked list management
Bounded memory: Only one bit per frame, regardless of access pattern

The "second chance" intuition:

The name "second chance" captures the algorithm's fairness: a page that was recently accessed (ref_bit = 1) gets a second chance—its bit is cleared and the clock moves on. Only pages that aren't accessed during a full clock rotation get evicted.

Clock vs. LRU Performance

On most workloads, Clock achieves 90-95% of LRU's hit rate while requiring a fraction of the overhead. This makes Clock the preferred choice for systems where page access is very frequent and the cost of exact LRU tracking would be prohibitive.

LRU-K and 2Q: Frequency-Aware Replacement

Standard LRU considers only the most recent access time. But consider two pages:

Page A: Accessed 1000 times, last access 5 minutes ago
Page B: Accessed once (during a scan), last access 1 minute ago

LRU would evict Page A despite its clearly higher value. LRU-K and 2Q address this by incorporating access frequency into eviction decisions.

LRU-K Algorithm:

LRU-K tracks the K-th most recent access to each page. Eviction decisions are based on the time of the K-th-to-last access ("backward K-distance").

A page with only 1 access has infinite K-distance (for K > 1)
A page recently accessed K times has small K-distance
Evict the page with maximum K-distance

For K=2 (LRU-2, the most common variant):

Track the last two access times for each page
Use the second-to-last access time for eviction decisions
Pages accessed only once are evicted before pages accessed multiple times

lru_k_concept.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// LRU-K conceptual structure (K=2)
 
struct PageHistory {
    PageId page_id;
    deque<Timestamp> access_history;  // Most recent K accesses
    
    // Get the K-th most recent access time (backward K-distance)
    Timestamp getKthAccess(int k) const {
        if (access_history.size() < k) {
            return INFINITE_PAST;  // Not accessed K times yet
        }
        return access_history[access_history.size() - k];
    }
};
 
class LRUK_Replacer {
    static const int K = 2;  // Track last 2 accesses
    map<PageId, PageHistory> histories;
    
    void recordAccess(PageId page_id, Timestamp now) {
        auto& hist = histories[page_id];
        hist.access_history.push_back(now);
        
        // Keep only the last K accesses
        while (hist.access_history.size() > K) {
            hist.access_history.pop_front();
        }
    }
    
    PageId selectVictim() {
        PageId victim;
        Timestamp oldest_kth_access = FUTURE;  // Timestamp far in future
        
        for (const auto& [page_id, hist] : histories) {
            Timestamp kth_access = hist.getKthAccess(K);
            if (kth_access < oldest_kth_access) {
                oldest_kth_access = kth_access;
                victim = page_id;
            }
        }
        return victim;
    }
};

The 2Q Algorithm:

2Q (Two Queue) is a practical simplification of LRU-2 that uses two separate queues instead of tracking exact timestamps:

A1 (Am queue): Pages accessed only once reside here; managed as FIFO
Am (Am queue): Pages accessed more than once; managed as LRU

How 2Q works:

New pages (first access) enter A1
If a page in A1 is accessed again, move it to Am
Eviction: first try to evict from A1, then from Am

This simple structure prevents scan pollution because scanned pages never leave A1, protecting the "hot" pages in Am.

Scan Resistance

Both LRU-K and 2Q are 'scan resistant': a sequential scan that touches pages once cannot evict hot pages that are accessed frequently. This is critical for mixed OLTP/analytical workloads where occasional large scans shouldn't destroy the cache.

ARC and LIRS: Adaptive Algorithms

Modern workloads are dynamic, shifting between different access patterns. Static algorithms may excel on one pattern but perform poorly when the pattern changes. Adaptive Replacement Cache (ARC) and LIRS dynamically adjust their behavior based on observed workload characteristics.

ARC (Adaptive Replacement Cache):

Developed by IBM, ARC maintains two LRU lists and adaptively balances between them:

L1: Pages accessed recently exactly once (recency-biased)
L2: Pages accessed recently more than once (frequency-biased)
B1, B2: "Ghost" lists tracking recently evicted pages from L1 and L2

The key insight: ARC uses the ghost lists to learn what's working:

If a page from B1 is requested, it would have been a hit if L1 were larger → grow L1
If a page from B2 is requested, it would have been a hit if L2 were larger → grow L2

This self-tuning mechanism adapts to whether the workload favors recency or frequency.

ARC's Four Lists
List	Contents	Purpose
T1	Pages accessed exactly once recently	Recency-focused cache portion
T2	Pages accessed multiple times recently	Frequency-focused cache portion
B1	Ghost entries: recently evicted from T1	Learn when to grow T1
B2	Ghost entries: recently evicted from T2	Learn when to grow T2

LIRS (Low Inter-reference Recency Set):

LIRS distinguishes between "hot" and "cold" pages based on Inter-Reference Recency (IRR)—the number of distinct pages accessed between consecutive accesses to the same page.

Low IRR pages: Accessed frequently with few intervening accesses → Hot → Protected
High IRR pages: Accessed infrequently or with many intervening accesses → Cold → Eviction candidates

LIRS dedicates most of the buffer to hot pages and uses a small portion for cold pages, evicting cold pages first.

Algorithm Selection in Practice

PostgreSQL uses a Clock variant with some scan resistance. MySQL InnoDB uses a modified LRU with a 'young' and 'old' sublists. ZFS uses ARC as its primary cache algorithm. The choice depends on workload characteristics and implementation complexity tolerance.

Database-Specific Replacement Considerations

Database buffer pools have unique requirements beyond general-purpose caching. Page replacement algorithms must account for these database-specific concerns:

Database-Specific Requirements

•Pinned Pages — Pages in active use by transactions cannot be evicted. The algorithm must skip pinned pages and track when they become unpinned.
•Dirty Pages — Evicting a dirty page requires writing it to disk first, adding latency. Some algorithms prefer evicting clean pages.
•WAL Constraints — Dirty pages cannot be written until their corresponding log records are flushed (Write-Ahead Logging). This affects eviction timing.
•Prefetch Interaction — Query execution prefetches pages. Prefetched pages shouldn't immediately pollute the cache if unused.
•Index vs. Data Pages — Some systems prioritize index pages over data pages, as indexes are accessed more frequently.
•Large Object Pages — Pages for BLOBs/CLOBs may warrant different treatment than regular table pages.

MySQL InnoDB's LRU Implementation:

InnoDB uses a modified LRU with two key adaptations:

Midpoint Insertion: New pages enter the LRU list at the 3/8 point (not the head). This gives pages a chance to prove their value before occupying the "young" portion.
Young/Old Sublist: The list is divided into a "young" (head) and "old" (tail) portion. Pages must be accessed again after a configurable delay (innodb_old_blocks_time) to move from old to young.

This prevents full table scans from polluting the cache. Scanned pages enter the old sublist and get evicted before they can displace hot pages in the young sublist.

innodb_lru_concept.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Conceptual InnoDB LRU list structure
 
class InnoDBLRU {
    // Single list, but logically divided
    list<FrameId> lru_list;  
    
    // Iterator to the midpoint (3/8 from tail)
    list<FrameId>::iterator old_sublist_start;
    
    // Configuration
    static constexpr double OLD_RATIO = 0.375;  // 3/8
    Duration old_blocks_time = 1000ms;  // Time before page can become "young"
    
    void insertNew(FrameId frame) {
        // Insert at midpoint, not head
        lru_list.insert(old_sublist_start, frame);
        recordInsertTime(frame, now());
    }
    
    void recordAccess(FrameId frame) {
        if (isInOldSublist(frame)) {
            // Check if enough time has passed
            if (now() - getInsertTime(frame) > old_blocks_time) {
                // Promote to young sublist (move to head)
                moveToHead(frame);
            }
            // Otherwise, don't promote - prevents scan pollution
        } else {
            // Already in young sublist - move to head
            moveToHead(frame);
        }
    }
};

Eviction and Dirty Page Handling

When the replacement algorithm selects a victim, the buffer manager must handle eviction carefully, especially for dirty pages. The eviction process involves multiple steps with important correctness and performance implications.

eviction_process.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Buffer pool eviction process
 
void BufferPool::evictPage(FrameId frame_id) {
    FrameDescriptor& desc = descriptors[frame_id];
    
    // Step 1: Verify the frame can be evicted
    assert(desc.pin_count == 0);  // Must be unpinned
    
    // Step 2: Handle dirty page
    if (desc.is_dirty) {
        // Step 2a: Ensure WAL records are flushed first
        // This is the Write-Ahead Logging constraint
        log_manager->flushToLSN(desc.page_lsn);
        
        // Step 2b: Write the dirty page to disk
        disk_manager->writePage(desc.page_id, &frames[frame_id]);
        
        // Step 2c: Clear dirty bit
        desc.is_dirty = false;
    }
    
    // Step 3: Remove from page table
    page_table.erase(desc.page_id);
    
    // Step 4: Clear frame metadata
    desc.page_id = INVALID_PAGE_ID;
    desc.ref_bit = false;
    
    // Step 5: Add frame to free list (optional, depends on caller)
    // free_list.push_back(frame_id);
}
 
// Finding a victim that's ready for eviction
FrameId BufferPool::findEvictableVictim() {
    while (true) {
        FrameId candidate = replacer->selectVictim();
        
        if (candidate == INVALID_FRAME_ID) {
            // No unpinned frames available
            throw BufferPoolFullException("All frames are pinned");
        }
        
        FrameDescriptor& desc = descriptors[candidate];
        
        if (desc.pin_count == 0) {
            return candidate;  // Found an evictable victim
        }
        
        // Victim was pinned after selection - try again
        // (Race condition between selection and eviction)
    }
}

The cost of evicting dirty pages:

Evicting a dirty page requires synchronous disk I/O (the page must be written before the frame can be reused). This can significantly increase latency for the requesting thread.

Mitigation strategies:

Background flushing: A background thread ("page cleaner" in MySQL, "bgwriter" in PostgreSQL) periodically writes dirty pages to disk, keeping the ratio of clean pages high.
Prefer clean victims: When multiple candidates have similar replacement priority, prefer evicting clean pages to avoid disk I/O.
Victim selection look-ahead: Instead of selecting one victim, identify several candidates and prefer the clean ones.
Checkpoint coordination: During checkpoints, many pages become clean simultaneously, increasing the clean victim pool.

The Eviction Latency Spike

If the buffer pool runs out of clean frames, every new page fetch requires waiting for a dirty page to be written to disk. This creates latency spikes visible to applications. Monitoring the ratio of dirty pages and background flush rates is essential for stable performance.

Summary: Choosing the Right Page Replacement Strategy

Page replacement is a critical component of buffer pool management, directly impacting cache hit rates and query performance. The choice of algorithm depends on workload characteristics, implementation complexity tolerance, and system requirements.

Key Takeaways

•Bélády's OPT is the theoretical optimum — It evicts pages used furthest in the future, but requires impossible future knowledge. It serves as a benchmark.
•LRU tracks exact recency — Simple and effective, but has high per-access overhead and is susceptible to scan pollution.
•Clock approximates LRU cheaply — Using reference bits instead of exact timestamps, Clock achieves most of LRU's effectiveness with far less overhead.
•LRU-K and 2Q add frequency awareness — By requiring multiple accesses before promoting pages, these algorithms resist scan pollution.
•ARC and LIRS adapt dynamically — Self-tuning algorithms that balance recency and frequency based on observed workload patterns.
•Database replacement has unique concerns — Pinned pages, dirty pages, WAL constraints, and prefetching all influence algorithm design.
•Dirty page eviction is expensive — Background flushing and preferring clean victims minimize eviction-induced latency.

What's next:

We've seen that dirty pages complicate eviction. The next page explores dirty page management in depth: how databases track modifications, when to flush pages to disk, and how to balance durability requirements against performance.

Page Complete

You now understand page replacement algorithms, from the theoretical optimum to practical implementations used in production databases. This knowledge helps you tune buffer pool behavior and understand performance characteristics.

2 / 5

Loading learning content...

Database Management SystemsBuffer Management

Buffer Management in Database Systems

LevelIntermediate

Duration60 mins

TopicBuffer Management

2 / 5

Page Replacement

The Eviction Dilemma

What You Will Learn

The Theoretical Optimum: Bélády's Algorithm

Bélády's OPT Algorithm:

When a page fault occurs and the buffer is full
For each cached page, determine when it will next be accessed
Evict the page whose next access is furthest away
If a page will never be accessed again, evict it immediately

Why OPT Is Impractical

The value of OPT:

While impractical for runtime use, OPT is invaluable for:

Algorithm comparison: By running a workload trace through OPT, we can measure how close practical algorithms come to optimal
Upper bound analysis: OPT tells us the theoretical maximum cache efficiency for a given workload
Insight generation: Studying why OPT makes certain choices can inspire improved practical algorithms

Research has shown that for typical database workloads, well-designed practical algorithms achieve 90-95% of OPT's hit rate. The remaining gap represents inherent uncertainty about the future.

Comparing Algorithm Hit Rates (Example Workload)
Algorithm	Hit Rate	% of OPT
OPT (theoretical)	95.2%	100%
LRU-K (K=2)	91.8%	96.4%
Clock	89.1%	93.6%
LRU	88.5%	92.9%
FIFO	82.3%	86.4%
Random	74.6%	78.4%

LRU: Least Recently Used

LRU Algorithm:

Maintain an ordering of all cached pages by access time
On each page access, move that page to the "most recently used" position
When eviction is needed, evict the page at the "least recently used" position

LRU approximates OPT by using the past as a predictor of the future. If a page hasn't been used recently, it's likely not to be used in the near future.

lru_implementation.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// LRU implementation using a doubly-linked list + hash map
 
class LRUReplacer {
private:
    size_t capacity;
    
    // Doubly-linked list: front = most recently used, back = least recently used
    list<FrameId> lru_list;
    
    // Map from frame_id to iterator in the list (for O(1) removal)
    unordered_map<FrameId, list<FrameId>::iterator> frame_map;
    
    mutex latch;
 
public:
    LRUReplacer(size_t capacity) : capacity(capacity) {}
    
    // Record that a frame was accessed
    void recordAccess(FrameId frame_id) {
        lock_guard<mutex> guard(latch);
        
        auto it = frame_map.find(frame_id);
        if (it != frame_map.end()) {
            // Already in list - move to front (most recently used)
            lru_list.erase(it->second);
        }
        
        // Add to front of list
        lru_list.push_front(frame_id);
        frame_map[frame_id] = lru_list.begin();
    }
    
    // Select a victim frame for eviction (returns least recently used)
    bool selectVictim(FrameId* victim_frame) {
        lock_guard<mutex> guard(latch);
        
        if (lru_list.empty()) {
            return false;  // No frames available for eviction
        }
        
        // Return the least recently used frame (back of list)
        *victim_frame = lru_list.back();
        return true;
    }
    
    // Remove a frame from consideration (e.g., when pinned)
    void remove(FrameId frame_id) {
        lock_guard<mutex> guard(latch);
        
        auto it = frame_map.find(frame_id);
        if (it != frame_map.end()) {
            lru_list.erase(it->second);
            frame_map.erase(it);
        }
    }
    
    // Add a frame back (e.g., when unpinned)
    void add(FrameId frame_id) {
        recordAccess(frame_id);  // Same as recording an access
    }
};

Time and space complexity:

Access recording: O(1) amortized (hash map lookup + list manipulation)
Victim selection: O(1) (return tail of list)
Space overhead: O(n) for the linked list and hash map, where n is buffer pool size

Pros and cons of LRU:

Advantages of LRU

•Simple to understand and implement
•Respects temporal locality effectively
•O(1) operations with proper data structures
•No tuning parameters required
•Good performance on many workloads

Disadvantages of LRU

•High overhead: every access requires list manipulation
•Susceptible to scan pollution (sequential scans)
•Doesn't consider access frequency (page accessed once vs. 1000 times)
•Mutex contention on the LRU list in concurrent systems
•Poor for workloads with recency-unfriendly patterns

The Scan Pollution Problem

Clock Algorithm: LRU Approximation

How Clock works:

Imagine all buffer frames arranged in a circle, with a "clock hand" pointing to one frame:

Each frame has a reference bit, initially 0
When a page is accessed, set its reference bit to 1
When eviction is needed:
- Examine the frame at the clock hand
- If reference bit is 1: set it to 0, advance clock hand, repeat
- If reference bit is 0: evict this page, advance clock hand

clock_replacement.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
// Clock (Second-Chance) replacement algorithm
 
class ClockReplacer {
private:
    size_t capacity;
    size_t clock_hand;  // Current position in the circular buffer
    
    struct FrameInfo {
        bool in_replacer;  // Is this frame available for replacement?
        bool ref_bit;      // Has this frame been recently accessed?
    };
    
    vector<FrameInfo> frames;
    mutex latch;
 
public:
    ClockReplacer(size_t capacity) 
        : capacity(capacity), clock_hand(0), frames(capacity) {
        for (auto& f : frames) {
            f.in_replacer = false;
            f.ref_bit = false;
        }
    }
    
    // Record that a frame was accessed (set reference bit)
    void recordAccess(FrameId frame_id) {
        lock_guard<mutex> guard(latch);
        frames[frame_id].ref_bit = true;
    }
    
    // Find a victim frame for eviction
    bool selectVictim(FrameId* victim_frame) {
        lock_guard<mutex> guard(latch);
        
        size_t start = clock_hand;
        size_t passes = 0;  // Avoid infinite loop if all frames are pinned
        
        while (true) {
            FrameInfo& f = frames[clock_hand];
            
            if (f.in_replacer) {
                if (f.ref_bit) {
                    // Second chance: clear ref bit and move on
                    f.ref_bit = false;
                } else {
                    // Found victim: ref_bit is 0
                    *victim_frame = clock_hand;
                    f.in_replacer = false;
                    clock_hand = (clock_hand + 1) % capacity;
                    return true;
                }
            }
            
            clock_hand = (clock_hand + 1) % capacity;
            
            // Detect if we've gone full circle twice (all frames checked)
            if (clock_hand == start) {
                passes++;
                if (passes >= 2) {
                    return false;  // No victim found
                }
            }
        }
    }
    
    // Remove frame from replacer (e.g., when pinned)
    void pin(FrameId frame_id) {
        lock_guard<mutex> guard(latch);
        frames[frame_id].in_replacer = false;
    }
    
    // Add frame to replacer (e.g., when unpinned)
    void unpin(FrameId frame_id) {
        lock_guard<mutex> guard(latch);
        frames[frame_id].in_replacer = true;
        // Note: ref_bit is NOT set on unpin - only on actual access
    }
};

Why Clock is more practical than LRU:

Lower overhead per access: Setting a reference bit is far cheaper than manipulating a linked list
Better concurrency: Reference bit can be set with an atomic operation; no data structure contention
Simpler implementation: No complex linked list management
Bounded memory: Only one bit per frame, regardless of access pattern

The "second chance" intuition:

Clock vs. LRU Performance

LRU-K and 2Q: Frequency-Aware Replacement

Standard LRU considers only the most recent access time. But consider two pages:

Page A: Accessed 1000 times, last access 5 minutes ago
Page B: Accessed once (during a scan), last access 1 minute ago

LRU would evict Page A despite its clearly higher value. LRU-K and 2Q address this by incorporating access frequency into eviction decisions.

LRU-K Algorithm:

LRU-K tracks the K-th most recent access to each page. Eviction decisions are based on the time of the K-th-to-last access ("backward K-distance").

A page with only 1 access has infinite K-distance (for K > 1)
A page recently accessed K times has small K-distance
Evict the page with maximum K-distance

For K=2 (LRU-2, the most common variant):

Track the last two access times for each page
Use the second-to-last access time for eviction decisions
Pages accessed only once are evicted before pages accessed multiple times

lru_k_concept.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// LRU-K conceptual structure (K=2)
 
struct PageHistory {
    PageId page_id;
    deque<Timestamp> access_history;  // Most recent K accesses
    
    // Get the K-th most recent access time (backward K-distance)
    Timestamp getKthAccess(int k) const {
        if (access_history.size() < k) {
            return INFINITE_PAST;  // Not accessed K times yet
        }
        return access_history[access_history.size() - k];
    }
};
 
class LRUK_Replacer {
    static const int K = 2;  // Track last 2 accesses
    map<PageId, PageHistory> histories;
    
    void recordAccess(PageId page_id, Timestamp now) {
        auto& hist = histories[page_id];
        hist.access_history.push_back(now);
        
        // Keep only the last K accesses
        while (hist.access_history.size() > K) {
            hist.access_history.pop_front();
        }
    }
    
    PageId selectVictim() {
        PageId victim;
        Timestamp oldest_kth_access = FUTURE;  // Timestamp far in future
        
        for (const auto& [page_id, hist] : histories) {
            Timestamp kth_access = hist.getKthAccess(K);
            if (kth_access < oldest_kth_access) {
                oldest_kth_access = kth_access;
                victim = page_id;
            }
        }
        return victim;
    }
};

The 2Q Algorithm:

2Q (Two Queue) is a practical simplification of LRU-2 that uses two separate queues instead of tracking exact timestamps:

A1 (Am queue): Pages accessed only once reside here; managed as FIFO
Am (Am queue): Pages accessed more than once; managed as LRU

How 2Q works:

New pages (first access) enter A1
If a page in A1 is accessed again, move it to Am
Eviction: first try to evict from A1, then from Am

This simple structure prevents scan pollution because scanned pages never leave A1, protecting the "hot" pages in Am.

Scan Resistance

ARC and LIRS: Adaptive Algorithms

ARC (Adaptive Replacement Cache):

Developed by IBM, ARC maintains two LRU lists and adaptively balances between them:

L1: Pages accessed recently exactly once (recency-biased)
L2: Pages accessed recently more than once (frequency-biased)
B1, B2: "Ghost" lists tracking recently evicted pages from L1 and L2

The key insight: ARC uses the ghost lists to learn what's working:

If a page from B1 is requested, it would have been a hit if L1 were larger → grow L1
If a page from B2 is requested, it would have been a hit if L2 were larger → grow L2

This self-tuning mechanism adapts to whether the workload favors recency or frequency.

ARC's Four Lists
List	Contents	Purpose
T1	Pages accessed exactly once recently	Recency-focused cache portion
T2	Pages accessed multiple times recently	Frequency-focused cache portion
B1	Ghost entries: recently evicted from T1	Learn when to grow T1
B2	Ghost entries: recently evicted from T2	Learn when to grow T2

LIRS (Low Inter-reference Recency Set):

LIRS distinguishes between "hot" and "cold" pages based on Inter-Reference Recency (IRR)—the number of distinct pages accessed between consecutive accesses to the same page.

Low IRR pages: Accessed frequently with few intervening accesses → Hot → Protected
High IRR pages: Accessed infrequently or with many intervening accesses → Cold → Eviction candidates

LIRS dedicates most of the buffer to hot pages and uses a small portion for cold pages, evicting cold pages first.

Algorithm Selection in Practice

Database-Specific Replacement Considerations

Database buffer pools have unique requirements beyond general-purpose caching. Page replacement algorithms must account for these database-specific concerns:

Database-Specific Requirements

•Pinned Pages — Pages in active use by transactions cannot be evicted. The algorithm must skip pinned pages and track when they become unpinned.
•Dirty Pages — Evicting a dirty page requires writing it to disk first, adding latency. Some algorithms prefer evicting clean pages.
•WAL Constraints — Dirty pages cannot be written until their corresponding log records are flushed (Write-Ahead Logging). This affects eviction timing.
•Prefetch Interaction — Query execution prefetches pages. Prefetched pages shouldn't immediately pollute the cache if unused.
•Index vs. Data Pages — Some systems prioritize index pages over data pages, as indexes are accessed more frequently.
•Large Object Pages — Pages for BLOBs/CLOBs may warrant different treatment than regular table pages.

MySQL InnoDB's LRU Implementation:

InnoDB uses a modified LRU with two key adaptations:

Midpoint Insertion: New pages enter the LRU list at the 3/8 point (not the head). This gives pages a chance to prove their value before occupying the "young" portion.
Young/Old Sublist: The list is divided into a "young" (head) and "old" (tail) portion. Pages must be accessed again after a configurable delay (innodb_old_blocks_time) to move from old to young.

This prevents full table scans from polluting the cache. Scanned pages enter the old sublist and get evicted before they can displace hot pages in the young sublist.

innodb_lru_concept.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Conceptual InnoDB LRU list structure
 
class InnoDBLRU {
    // Single list, but logically divided
    list<FrameId> lru_list;  
    
    // Iterator to the midpoint (3/8 from tail)
    list<FrameId>::iterator old_sublist_start;
    
    // Configuration
    static constexpr double OLD_RATIO = 0.375;  // 3/8
    Duration old_blocks_time = 1000ms;  // Time before page can become "young"
    
    void insertNew(FrameId frame) {
        // Insert at midpoint, not head
        lru_list.insert(old_sublist_start, frame);
        recordInsertTime(frame, now());
    }
    
    void recordAccess(FrameId frame) {
        if (isInOldSublist(frame)) {
            // Check if enough time has passed
            if (now() - getInsertTime(frame) > old_blocks_time) {
                // Promote to young sublist (move to head)
                moveToHead(frame);
            }
            // Otherwise, don't promote - prevents scan pollution
        } else {
            // Already in young sublist - move to head
            moveToHead(frame);
        }
    }
};

Eviction and Dirty Page Handling

eviction_process.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Buffer pool eviction process
 
void BufferPool::evictPage(FrameId frame_id) {
    FrameDescriptor& desc = descriptors[frame_id];
    
    // Step 1: Verify the frame can be evicted
    assert(desc.pin_count == 0);  // Must be unpinned
    
    // Step 2: Handle dirty page
    if (desc.is_dirty) {
        // Step 2a: Ensure WAL records are flushed first
        // This is the Write-Ahead Logging constraint
        log_manager->flushToLSN(desc.page_lsn);
        
        // Step 2b: Write the dirty page to disk
        disk_manager->writePage(desc.page_id, &frames[frame_id]);
        
        // Step 2c: Clear dirty bit
        desc.is_dirty = false;
    }
    
    // Step 3: Remove from page table
    page_table.erase(desc.page_id);
    
    // Step 4: Clear frame metadata
    desc.page_id = INVALID_PAGE_ID;
    desc.ref_bit = false;
    
    // Step 5: Add frame to free list (optional, depends on caller)
    // free_list.push_back(frame_id);
}
 
// Finding a victim that's ready for eviction
FrameId BufferPool::findEvictableVictim() {
    while (true) {
        FrameId candidate = replacer->selectVictim();
        
        if (candidate == INVALID_FRAME_ID) {
            // No unpinned frames available
            throw BufferPoolFullException("All frames are pinned");
        }
        
        FrameDescriptor& desc = descriptors[candidate];
        
        if (desc.pin_count == 0) {
            return candidate;  // Found an evictable victim
        }
        
        // Victim was pinned after selection - try again
        // (Race condition between selection and eviction)
    }
}

The cost of evicting dirty pages:

Evicting a dirty page requires synchronous disk I/O (the page must be written before the frame can be reused). This can significantly increase latency for the requesting thread.

Mitigation strategies:

Background flushing: A background thread ("page cleaner" in MySQL, "bgwriter" in PostgreSQL) periodically writes dirty pages to disk, keeping the ratio of clean pages high.
Prefer clean victims: When multiple candidates have similar replacement priority, prefer evicting clean pages to avoid disk I/O.
Victim selection look-ahead: Instead of selecting one victim, identify several candidates and prefer the clean ones.
Checkpoint coordination: During checkpoints, many pages become clean simultaneously, increasing the clean victim pool.

The Eviction Latency Spike

Summary: Choosing the Right Page Replacement Strategy

Key Takeaways

•Bélády's OPT is the theoretical optimum — It evicts pages used furthest in the future, but requires impossible future knowledge. It serves as a benchmark.
•LRU tracks exact recency — Simple and effective, but has high per-access overhead and is susceptible to scan pollution.
•Clock approximates LRU cheaply — Using reference bits instead of exact timestamps, Clock achieves most of LRU's effectiveness with far less overhead.
•LRU-K and 2Q add frequency awareness — By requiring multiple accesses before promoting pages, these algorithms resist scan pollution.
•ARC and LIRS adapt dynamically — Self-tuning algorithms that balance recency and frequency based on observed workload patterns.
•Database replacement has unique concerns — Pinned pages, dirty pages, WAL constraints, and prefetching all influence algorithm design.
•Dirty page eviction is expensive — Background flushing and preferring clean victims minimize eviction-induced latency.

What's next:

Page Complete

2 / 5