Loading content...
Of all the page replacement algorithms developed over decades of computer science research, two have emerged as the dominant choices in production database systems: Least Recently Used (LRU) and Clock.
LRU provides optimal recency-based replacement when overhead is acceptable. Clock approximates LRU with dramatically lower cost. Together, they form the foundation of page replacement in systems ranging from tiny embedded databases to planet-scale distributed systems. Understanding their implementation details, variations, and trade-offs is essential for anyone working with database internals.
By the end of this page, you will master the implementation details of LRU and Clock algorithms, understand the production variations used in real databases like PostgreSQL and MySQL, and know how to tune these algorithms for different workload characteristics.
Implementing LRU efficiently requires careful data structure design. The core challenge is supporting two operations efficiently:
Several implementation strategies achieve these requirements:
Strategy 1: Doubly-Linked List + Hash Map
The classic implementation uses two data structures in tandem:
This achieves O(1) for both operations: access moves a node (O(1) with direct pointer), and victim selection returns the tail (O(1)).
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
// Classic LRU with doubly-linked list + hash map class LRUList { struct Node { FrameId frame_id; Node* prev; Node* next; }; Node* head; // Most recently used Node* tail; // Least recently used unordered_map<FrameId, Node*> node_map; size_t size; mutex latch; public: LRUList() : head(nullptr), tail(nullptr), size(0) {} // Move or add frame to MRU position (head) void touch(FrameId frame_id) { lock_guard<mutex> guard(latch); auto it = node_map.find(frame_id); if (it != node_map.end()) { // Already in list - move to head Node* node = it->second; removeFromList(node); addToHead(node); } else { // New frame - create node and add to head Node* node = new Node{frame_id, nullptr, nullptr}; node_map[frame_id] = node; addToHead(node); size++; } } // Get LRU frame (don't remove - caller decides) FrameId getLRU() { lock_guard<mutex> guard(latch); return (tail != nullptr) ? tail->frame_id : INVALID_FRAME_ID; } // Remove a specific frame (when pinned or evicted) void remove(FrameId frame_id) { lock_guard<mutex> guard(latch); auto it = node_map.find(frame_id); if (it != node_map.end()) { Node* node = it->second; removeFromList(node); node_map.erase(it); delete node; size--; } } private: void addToHead(Node* node) { node->prev = nullptr; node->next = head; if (head) head->prev = node; head = node; if (!tail) tail = node; } void removeFromList(Node* node) { if (node->prev) node->prev->next = node->next; else head = node->next; if (node->next) node->next->prev = node->prev; else tail = node->prev; }};Strategy 2: Timestamp-Based LRU
Instead of maintaining explicit ordering, each frame stores a timestamp of its last access. Victim selection scans for the frame with the oldest timestamp.
This trades faster access tracking for slower victim selection. It's useful when accesses vastly outnumber evictions.
| Strategy | Access Cost | Victim Cost | Memory Overhead | Contention |
|---|---|---|---|---|
| List + HashMap | O(1) | O(1) | ~24 bytes/frame | High (list manipulation) |
| Timestamp | O(1) | O(n) | 8 bytes/frame | Low (atomic timestamp) |
| Approximate (sampling) | O(1) | O(k) samples | 8 bytes/frame | Very low |
Pure LRU becomes a significant bottleneck in multi-core systems. Every page access requires updating the LRU list, and this update must be synchronized across threads. With dozens or hundreds of cores, the LRU list's mutex becomes a severe contention point.
The concurrency problem:
Solutions to LRU concurrency:
1234567891011121314151617181920212223242526272829303132333435
// Partitioned LRU to reduce contention class PartitionedLRU { static const size_t NUM_PARTITIONS = 64; // Often matches core count LRUList partitions[NUM_PARTITIONS]; size_t getPartition(FrameId frame_id) { // Simple hash-based partitioning return frame_id % NUM_PARTITIONS; } public: void touch(FrameId frame_id) { size_t p = getPartition(frame_id); partitions[p].touch(frame_id); // Only contends with accesses to same partition } FrameId getVictim() { // Try each partition until we find a victim // More sophisticated: select partition with oldest LRU for (size_t i = 0; i < NUM_PARTITIONS; i++) { FrameId victim = partitions[i].getLRU(); if (victim != INVALID_FRAME_ID) { return victim; } } return INVALID_FRAME_ID; } void remove(FrameId frame_id) { size_t p = getPartition(frame_id); partitions[p].remove(frame_id); }};The overhead of maintaining exact LRU order is rarely justified. Modern databases almost universally use Clock or LRU approximations. The hit rate difference is typically 1-5%, but the throughput difference under concurrency can be 10x or more.
The Clock algorithm (also called Second-Chance or Not Recently Used) is the dominant replacement algorithm in production databases due to its excellent performance-to-overhead ratio. Let's examine it in detail.
The fundamental insight:
Clock replaces the expensive "move to head" operation with a cheap "set bit" operation. Instead of tracking exact recency order, it only distinguishes between "recently used" and "not recently used." This binary distinction, surprisingly, captures most of LRU's benefit.
How Clock maintains state:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
// Detailed Clock algorithm implementation class ClockReplacer {private: struct FrameState { bool in_pool; // Is this frame in the replacement pool? bool ref_bit; // Has this frame been accessed recently? }; vector<FrameState> frames; size_t capacity; atomic<size_t> clock_hand; // Atomic for better concurrency size_t num_in_pool; // Count of replaceable frames mutex latch; // Protects in_pool and num_in_pool public: ClockReplacer(size_t capacity) : frames(capacity), capacity(capacity), clock_hand(0), num_in_pool(0) { for (auto& f : frames) { f.in_pool = false; f.ref_bit = false; } } // Called when a frame is accessed // Note: This can be lockless! Just an atomic store. void recordAccess(FrameId frame_id) { // Simple atomic store - no locking required frames[frame_id].ref_bit = true; } // Find a victim for eviction FrameId selectVictim() { lock_guard<mutex> guard(latch); if (num_in_pool == 0) { return INVALID_FRAME_ID; } // Scan until we find a victim // Worst case: 2 full rotations (first clears all ref bits) size_t scanned = 0; size_t max_scans = 2 * capacity; while (scanned < max_scans) { size_t hand = clock_hand.load(); FrameState& frame = frames[hand]; // Advance clock hand first (for next iteration) clock_hand.store((hand + 1) % capacity); if (!frame.in_pool) { // Frame is pinned - skip it scanned++; continue; } if (frame.ref_bit) { // Recently used - give it a second chance frame.ref_bit = false; scanned++; continue; } // Found victim: in_pool == true && ref_bit == false frame.in_pool = false; num_in_pool--; return hand; } // Shouldn't happen if num_in_pool > 0 return INVALID_FRAME_ID; } // Add frame to replacement pool (when unpinned) void unpin(FrameId frame_id) { lock_guard<mutex> guard(latch); if (!frames[frame_id].in_pool) { frames[frame_id].in_pool = true; num_in_pool++; // Note: ref_bit is NOT set here // It will be set on next access } } // Remove from replacement pool (when pinned) void pin(FrameId frame_id) { lock_guard<mutex> guard(latch); if (frames[frame_id].in_pool) { frames[frame_id].in_pool = false; num_in_pool--; } }};Clock's efficiency advantages:
The name 'Second Chance' comes from the algorithm's behavior: a frame with ref_bit=1 isn't evicted immediately. Instead, its bit is cleared and the clock moves on. The frame gets a 'second chance' to be accessed before the clock returns. Only frames not accessed during a full rotation are evicted.
Production databases often use enhanced variants of the basic Clock algorithm to address specific workload characteristics:
Enhanced Clock (Two-handed Clock):
Uses two clock hands instead of one:
This provides more time for pages to prove their worth, reducing the chance of evicting truly hot pages.
Clock with Priority Classes:
Different pages get different treatment:
This is often implemented using multiple reference bits or a reference counter instead of a single bit.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
// Enhanced Clock with reference counter class EnhancedClockReplacer { struct FrameState { bool in_pool; uint8_t ref_count; // 0-3 instead of single bit }; static const uint8_t MAX_REF_COUNT = 3; vector<FrameState> frames; size_t clock_hand; public: // Increment reference count (capped at MAX) void recordAccess(FrameId frame_id) { auto& ref = frames[frame_id].ref_count; if (ref < MAX_REF_COUNT) { ref++; // Atomic increment if needed } } FrameId selectVictim() { while (true) { FrameState& frame = frames[clock_hand]; if (frame.in_pool) { if (frame.ref_count == 0) { // Found victim frame.in_pool = false; return clock_hand; } else { // Decrement and continue frame.ref_count--; } } clock_hand = (clock_hand + 1) % frames.size(); } }}; // Clock with scan resistance (MySQL InnoDB-inspired)class ScanResistantClock { struct FrameState { bool in_pool; bool ref_bit; bool is_young; // In the "hot" section Timestamp last_access; }; Duration young_threshold = 1s; // innodb_old_blocks_time public: void recordAccess(FrameId frame_id, bool is_new_page) { auto& frame = frames[frame_id]; frame.ref_bit = true; Timestamp now = getCurrentTime(); if (is_new_page) { // New pages start as "old" - not immediately promoted frame.is_young = false; frame.last_access = now; } else if (!frame.is_young) { // Page in "old" section - check if it should be promoted if (now - frame.last_access > young_threshold) { // Accessed again after threshold - promote to young frame.is_young = true; } frame.last_access = now; } // Pages already young just update ref_bit } FrameId selectVictim() { // First try to evict from "old" pages // Only evict "young" pages if no old pages available // This protects hot data from sequential scans // Implementation: scan old pages first, then young // Or: maintain separate lists for young/old }};| Variant | Key Feature | Use Case |
|---|---|---|
| Basic Clock | Single ref bit | Simple systems, low contention |
| Two-handed Clock | Leading/trailing hands | Better approximation of LRU |
| Multi-bit Clock | Reference counter | Distinguishing hot from warm pages |
| Scan-resistant Clock | Young/old sections | Mixed OLTP/scan workloads |
PostgreSQL uses a Clock-based replacement algorithm with several interesting refinements. Understanding its implementation provides insight into production-grade buffer management.
PostgreSQL's buffer management architecture:
shared_buffers)buf_hdr containing state (including a usage_count)12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
// PostgreSQL-style Clock implementation (simplified) #define BM_MAX_USAGE_COUNT 5 typedef struct BufferDesc { int32 buf_id; uint32 state; // Includes usage_count, refcount, flags pg_atomic_uint32 state_spinlock; // usage_count is stored in state bits // Extract: (state >> BUF_USAGE_COUNT_SHIFT) & BUF_USAGE_COUNT_MASK} BufferDesc; // Called when buffer is accessedvoid PinBuffer_common(BufferDesc *buf) { for (;;) { uint32 old_state = pg_atomic_read_u32(&buf->state); uint32 new_state = old_state; // Increment refcount new_state += BUF_REFCOUNT_ONE; // Increment usage_count if not at max if (BUF_STATE_GET_USAGECOUNT(new_state) < BM_MAX_USAGE_COUNT) { new_state += BUF_USAGECOUNT_ONE; } // Atomic compare-and-swap if (pg_atomic_compare_exchange_u32(&buf->state, &old_state, new_state)) { break; } // Retry if state changed }} // Find a buffer to evict (ClockSweep)BufferDesc* StrategyGetBuffer(BufferStrategyControl *strategy) { // Try freelist first if (strategy->firstFreeBuffer >= 0) { return GetBufferFromFreeList(strategy); } // Clock sweep int tries = NBuffers * 2; // Avoid infinite loop while (tries > 0) { BufferDesc *buf = GetBufferDescriptor(strategy->nextVictimBuffer); // Advance clock hand strategy->nextVictimBuffer = (strategy->nextVictimBuffer + 1) % NBuffers; // Try to victimize this buffer uint32 state = LockBufHdr(buf); // Skip if pinned if (BUF_STATE_GET_REFCOUNT(state) > 0) { UnlockBufHdr(buf, state); tries--; continue; } // Check usage count uint32 usage_count = BUF_STATE_GET_USAGECOUNT(state); if (usage_count > 0) { // Decrement and continue searching state -= BUF_USAGECOUNT_ONE; UnlockBufHdr(buf, state); tries--; continue; } // Found victim: refcount == 0 && usage_count == 0 return buf; } // No victim found - all buffers pinned elog(ERROR, "no unpinned buffers available"); return NULL;}PostgreSQL's usage_count (0-5) provides more granularity than a single bit. A page that's accessed repeatedly accumulates a higher usage_count, requiring multiple Clock passes to evict. This approximates LRU-K behavior without the complexity of tracking access timestamps.
MySQL InnoDB uses a modified LRU algorithm with explicit protections against scan pollution. Its implementation is more complex than PostgreSQL's Clock but provides better scan resistance.
InnoDB's LRU structure:
innodb_old_blocks_time to move to youngThe key insight:
Sequential scan pages enter the old sublist. If the scan continues without re-accessing pages, they remain in old and get evicted first. Hot OLTP data in the young sublist is protected.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
// InnoDB-style LRU with young/old sublists (conceptual) class InnoDBLRU { struct BufferPage { PageId page_id; BufferPage* prev; BufferPage* next; Timestamp access_time; bool is_young; // In young sublist }; BufferPage* lru_head; // Head = most recently used in young BufferPage* old_head; // Start of old sublist BufferPage* lru_tail; // Tail = least recently used in old Duration old_blocks_time; // innodb_old_blocks_time (default 1000ms) double old_ratio = 0.375; // 3/8 for old sublist public: // Insert new page at old/young boundary void insertNewPage(BufferPage* page) { page->is_young = false; page->access_time = currentTime(); // Insert just before old_head (at boundary) insertBefore(old_head, page); } // Handle page access void accessPage(BufferPage* page) { Timestamp now = currentTime(); if (page->is_young) { // Already in young sublist - move to head if not already there if (page != lru_head) { removeFromList(page); insertAtHead(page); } page->access_time = now; } else { // In old sublist - check if should promote Duration time_since_first = now - page->access_time; if (time_since_first > old_blocks_time) { // Accessed again after threshold - promote to young removeFromList(page); page->is_young = true; insertAtHead(page); } // If accessed within old_blocks_time, don't promote // This prevents read-ahead and scans from polluting young page->access_time = now; } } // Get victim for eviction (from tail of old sublist) BufferPage* getVictim() { // Scan from tail looking for unpinned page BufferPage* candidate = lru_tail; while (candidate != nullptr) { if (!candidate->isPinned()) { return candidate; } candidate = candidate->prev; // Prefer victims from old sublist if (candidate == old_head) { // Reached young sublist - ideally don't evict from here // But may need to if old is all pinned } } return nullptr; // All pages pinned } // Adjust sublist sizes after operations void rebalance() { // Ensure old sublist maintains target ratio size_t total = getTotalPageCount(); size_t target_old = total * old_ratio; size_t current_old = getOldSublistCount(); while (current_old < target_old && lru_head != old_head) { // Move pages from young to old moveLastYoungToOld(); current_old++; } }};| Parameter | Default | Description |
|---|---|---|
| innodb_buffer_pool_size | 128MB | Total buffer pool memory |
| innodb_buffer_pool_instances | 8 (if > 1GB) | Number of buffer pool partitions |
| innodb_old_blocks_pct | 37 (3/8) | Percentage for old sublist |
| innodb_old_blocks_time | 1000 (ms) | Time before promoting to young |
| innodb_lru_scan_depth | 1024 | Pages scanned for flushing per second |
While LRU and Clock algorithms are largely self-tuning, understanding key tuning parameters helps optimize performance for specific workloads.
Workload-specific tuning:
For OLTP workloads:
For analytical/scan workloads:
For mixed workloads:
Before tuning, gather metrics: buffer pool hit ratio, pages read/written per second, evictions from young vs. old sublists. In MySQL, SHOW ENGINE INNODB STATUS provides buffer pool statistics. In PostgreSQL, enable pg_stat_statements and monitor pg_statio_user_tables.
LRU and Clock algorithms represent the practical consensus for page replacement in database systems. Their variations address real-world challenges like concurrency, scan pollution, and workload diversity.
Module complete:
This concludes our deep dive into buffer management. You now understand the buffer pool architecture, page replacement algorithms, dirty page handling, the buffer manager's role, and the specific implementations used in production databases. This knowledge is foundational for database performance tuning, storage engine development, and understanding how databases achieve high performance despite the inherent slowness of persistent storage.
Congratulations! You've mastered buffer management—one of the most critical components of database system architecture. You understand how databases bridge the memory-storage divide to deliver fast query performance while maintaining data durability.