Loading learning content...
Having established that the buffer cache is essential for bridging the memory-storage speed gap, we now face the central challenge of caching: memory is finite, but the data we might want to cache is unbounded. A server might have 128 GB of RAM available for caching, but it accesses petabytes of storage. How do we decide what to keep in that precious 128 GB?
This question—cache replacement policy—is one of the most studied problems in computer science. The ideal policy would keep exactly the blocks that will be accessed soonest, evicting blocks that won't be needed. Of course, predicting the future is impossible, so practical policies must infer future access patterns from past behavior.
The choice of replacement policy profoundly impacts system performance. A poor policy might achieve a 70% hit ratio where a better policy achieves 95%—and as we saw, that difference translates to 5x or more performance improvement. Furthermore, policies must be computationally cheap; we can't spend milliseconds deciding what to evict when the eviction itself takes microseconds.
By the end of this page, you will understand: (1) The theoretical optimal replacement policy (Bélády's algorithm) and why it's unachievable, (2) LRU (Least Recently Used) and its dominance in caching, (3) LFU (Least Frequently Used) and when it outperforms LRU, (4) Advanced adaptive policies like ARC, (5) Practical implementation considerations including approximate algorithms, and (6) How modern operating systems actually implement cache replacement.
Before examining practical replacement policies, let's understand what "optimal" means. In 1966, László Bélády proved that the optimal cache replacement policy is to evict the block whose next access is furthest in the future. This algorithm, called OPT (or Bélády's algorithm), minimizes cache misses for any given cache size.
Why OPT is Optimal:
Intuition: If we must evict something, we lose the least by evicting what we won't need for the longest time. Evicting something needed immediately causes an immediate miss; evicting something needed far in the future delays the miss as long as possible.
Proof sketch: Suppose we have a different policy P that evicts a block B when OPT would have evicted block C. If B is accessed before C, P incurs a miss that OPT avoids. If C is accessed before B, both policies miss equally. So P can never be better than OPT and is often worse.
Example:
Cache size: 3 blocksAccess sequence: A B C A B D A D B C A Applying OPT:┌─────────┬─────────┬─────────────────┬──────────────────────────────────────┐│ Access │ Cache │ Action │ Reasoning │├─────────┼─────────┼─────────────────┼──────────────────────────────────────┤│ A │ [A - -] │ Miss, load A │ Empty cache ││ B │ [A B -] │ Miss, load B │ Empty slot ││ C │ [A B C] │ Miss, load C │ Empty slot ││ A │ [A B C] │ Hit │ A in cache ││ B │ [A B C] │ Hit │ B in cache ││ D │ [A B D] │ Miss, evict C │ C's next access is position 10 ││ │ │ │ A's next access is position 7 ││ │ │ │ B's next access is position 9 ││ │ │ │ → Evict C (furthest in future) ││ A │ [A B D] │ Hit │ A in cache ││ D │ [A B D] │ Hit │ D in cache ││ B │ [A B D] │ Hit │ B in cache ││ C │ [A B C] │ Miss, evict D │ D never accessed again ││ A │ [A B C] │ Hit │ A in cache │└─────────┴─────────┴─────────────────┴──────────────────────────────────────┘ Total: 5 misses out of 11 accesses (54.5% hit ratio)This is the best ANY policy can achieve for this sequence.Why OPT is Impractical:
OPT requires knowing the future access sequence—information we simply don't have. When a block is accessed, we don't know when (or if) it will be accessed again. OPT serves three purposes:
The Gap to Close:
Research shows that good practical policies typically achieve 70-95% of OPT's hit ratio on real workloads. The remaining gap represents fundamental uncertainty about the future.
When analyzing a system's cache performance, you can record the access trace and compute what OPT would have achieved. If your real policy gets 80% hit ratio and OPT would have gotten 82%, you're doing very well. If your policy gets 60% and OPT would have gotten 95%, there's significant room for improvement.
Least Recently Used (LRU) is the most widely used cache replacement policy. The principle is simple: evict the block that has gone the longest time without being accessed. LRU exploits temporal locality—if something hasn't been used recently, it's probably not going to be used soon.
LRU Intuition:
LRU approximates OPT by using "time since last access" as a proxy for "time until next access." This works remarkably well because access patterns exhibit temporal locality. A block accessed 1 second ago is more likely to be accessed again than a block accessed 1 hour ago.
Formal Definition:
Maintain a total ordering of cached blocks by last-access time. On cache miss, evict the block with the oldest (earliest) last-access time.
LRU Implementation with a Doubly-Linked List:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
/* * LRU Cache Implementation using Doubly-Linked List + Hash Table * * The list maintains LRU ordering: * HEAD <-> MRU <-> ... <-> LRU <-> TAIL * * Most Recently Used (MRU) is at the front * Least Recently Used (LRU) is at the back */ struct lru_cache { unsigned int capacity; /* Maximum number of entries */ unsigned int size; /* Current number of entries */ struct list_head lru_list; /* Doubly-linked list head */ struct hlist_head *hash_table; /* Hash table for O(1) lookup */ spinlock_t lock; /* Protects the cache */}; struct cache_entry { unsigned long key; /* Block identifier */ void *data; /* Cached block data */ struct list_head lru_node; /* Position in LRU list */ struct hlist_node hash_node; /* Position in hash chain */}; /* * Look up a block in the cache * If found, move to front of LRU list (most recently used) * Time complexity: O(1) average (hash lookup + list operations) */void *lru_cache_get(struct lru_cache *cache, unsigned long key) { struct cache_entry *entry; unsigned int hash = key % HASH_SIZE; spin_lock(&cache->lock); /* O(1) hash table lookup */ hlist_for_each_entry(entry, &cache->hash_table[hash], hash_node) { if (entry->key == key) { /* Found! Move to front of LRU list (mark as MRU) */ list_del(&entry->lru_node); list_add(&entry->lru_node, &cache->lru_list); spin_unlock(&cache->lock); return entry->data; } } spin_unlock(&cache->lock); return NULL; /* Cache miss */} /* * Insert a block into the cache * If cache is full, evict the least recently used entry * Time complexity: O(1) */void lru_cache_put(struct lru_cache *cache, unsigned long key, void *data) { struct cache_entry *entry, *victim; unsigned int hash = key % HASH_SIZE; spin_lock(&cache->lock); /* Check if already in cache (update case) */ hlist_for_each_entry(entry, &cache->hash_table[hash], hash_node) { if (entry->key == key) { entry->data = data; list_del(&entry->lru_node); list_add(&entry->lru_node, &cache->lru_list); spin_unlock(&cache->lock); return; } } /* Need to insert new entry */ if (cache->size >= cache->capacity) { /* Cache full: evict LRU entry (last in list) */ victim = list_last_entry(&cache->lru_list, struct cache_entry, lru_node); /* Remove from hash table */ hlist_del(&victim->hash_node); /* Remove from LRU list */ list_del(&victim->lru_node); /* Free the evicted entry */ kfree(victim->data); kfree(victim); cache->size--; } /* Allocate and insert new entry */ entry = kmalloc(sizeof(*entry), GFP_KERNEL); entry->key = key; entry->data = data; /* Add to front of LRU list (most recently used) */ list_add(&entry->lru_node, &cache->lru_list); /* Add to hash table */ hlist_add_head(&entry->hash_node, &cache->hash_table[hash]); cache->size++; spin_unlock(&cache->lock);}LRU Complexity Analysis:
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| Lookup (get) | O(1) average | — |
| Insert (put) | O(1) | — |
| Eviction | O(1) | — |
| Total space | — | O(n) for n entries |
Why LRU Works Well:
When LRU Fails:
LRU has known weaknesses:
Scan resistance: A single sequential scan through many blocks evicts everything, even frequently-used blocks that don't happen to be accessed during the scan.
Recency vs. frequency: LRU ignores access frequency. A block accessed once 1 minute ago beats a block accessed 100 times over the last hour but last accessed 2 minutes ago.
Loop patterns: If a workload cycles through N+1 blocks with cache size N, every access misses (worst case).
Consider a database with 1000 frequently-accessed rows cached. An admin runs a full table scan touching 50,000 rows once each. Pure LRU would evict all 1000 hot rows to make room for rows that will never be accessed again. After the scan, performance tanks until the hot rows are re-cached. This 'cache pollution' problem motivates more sophisticated policies.
Least Frequently Used (LFU) takes a different approach: evict the block with the lowest access count. Where LRU asks "when was this last used?", LFU asks "how often is this used?"
LFU Intuition:
Blocks accessed many times are probably important and worth keeping. A block accessed 1000 times is more valuable than one accessed twice, regardless of recency. LFU captures this frequency information that LRU ignores.
LFU Implementation Challenges:
Unlike LRU, naive LFU is expensive:
Finding minimum frequency: Which block has the lowest count? Requires a priority queue or sorted structure.
Count accumulation: Old blocks accumulate high counts even if access stopped. A block accessed 1000 times years ago might never be evicted.
Ties: Many blocks may have the same frequency. Which to evict among ties?
Efficient LFU using Frequency Lists:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136
/* * LFU Cache using O(1) operations * * Key insight: Maintain a list of frequency buckets. * Each bucket contains all entries with that access count. * * Structure: * freq_list[1] -> [entries accessed 1 time] * freq_list[2] -> [entries accessed 2 times] * freq_list[3] -> [entries accessed 3 times] * ... * * min_freq tracks the lowest non-empty frequency bucket */ struct lfu_entry { unsigned long key; void *data; unsigned int freq; /* Access count */ struct list_head freq_node; /* In frequency bucket */ struct hlist_node hash_node; /* In hash table */}; struct freq_bucket { unsigned int frequency; struct list_head entries; /* LRU-ordered within same freq */}; struct lfu_cache { unsigned int capacity; unsigned int size; unsigned int min_freq; /* Lowest frequency with entries */ struct hlist_head *hash_table; struct freq_bucket *freq_buckets; unsigned int max_freq; spinlock_t lock;}; /* * Get operation: O(1) * 1. Hash lookup to find entry * 2. Increment frequency, move to higher bucket */void *lfu_cache_get(struct lfu_cache *cache, unsigned long key) { struct lfu_entry *entry; unsigned int hash = key % HASH_SIZE; unsigned int old_freq, new_freq; spin_lock(&cache->lock); entry = hash_lookup(cache->hash_table, hash, key); if (!entry) { spin_unlock(&cache->lock); return NULL; } /* Remove from current frequency bucket */ old_freq = entry->freq; list_del(&entry->freq_node); /* If this was the only entry at min_freq, update min_freq */ if (old_freq == cache->min_freq && list_empty(&cache->freq_buckets[old_freq].entries)) { cache->min_freq++; } /* Move to higher frequency bucket */ new_freq = old_freq + 1; entry->freq = new_freq; /* Ensure bucket exists (may need dynamic expansion) */ ensure_bucket_exists(cache, new_freq); /* Add to new frequency bucket (at head = most recent within freq) */ list_add(&entry->freq_node, &cache->freq_buckets[new_freq].entries); spin_unlock(&cache->lock); return entry->data;} /* * Put operation: O(1) * If full, evict from min_freq bucket (LRU within that bucket) */void lfu_cache_put(struct lfu_cache *cache, unsigned long key, void *data) { struct lfu_entry *entry, *victim; unsigned int hash = key % HASH_SIZE; spin_lock(&cache->lock); /* Check for existing entry */ entry = hash_lookup(cache->hash_table, hash, key); if (entry) { entry->data = data; /* Promote frequency (same as get) */ promote_entry(cache, entry); spin_unlock(&cache->lock); return; } /* Eviction if needed */ if (cache->size >= cache->capacity) { /* Find victim: last entry in min_freq bucket */ /* (LRU among least-frequently-used) */ victim = list_last_entry( &cache->freq_buckets[cache->min_freq].entries, struct lfu_entry, freq_node ); list_del(&victim->freq_node); hlist_del(&victim->hash_node); kfree(victim->data); kfree(victim); cache->size--; /* Update min_freq if bucket now empty */ if (list_empty(&cache->freq_buckets[cache->min_freq].entries)) { /* Next put will set min_freq to 1 anyway */ } } /* Insert new entry with frequency 1 */ entry = kmalloc(sizeof(*entry), GFP_KERNEL); entry->key = key; entry->data = data; entry->freq = 1; list_add(&entry->freq_node, &cache->freq_buckets[1].entries); hlist_add_head(&entry->hash_node, &cache->hash_table[hash]); cache->min_freq = 1; /* New entry always has freq 1 */ cache->size++; spin_unlock(&cache->lock);}To prevent frequency counts from growing unboundedly and stagnating the cache, practical LFU implementations use 'aging': periodically dividing all frequencies by 2 (or decaying exponentially). This allows the cache to adapt when access patterns shift, while still favoring frequently-accessed items.
Adaptive Replacement Cache (ARC), developed by IBM researchers Megiddo and Modha, elegantly combines the benefits of LRU and LFU while avoiding their weaknesses. ARC adapts its behavior based on workload characteristics—automatically favoring recency or frequency depending on what works better.
The Core Insight:
ARC maintains two LRU lists plus two "ghost" lists:
Ghost lists don't hold actual data, only metadata about recently evicted items. They enable ARC to learn from eviction mistakes.
How ARC Adapts:
This feedback loop automatically tunes the cache to the workload.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
/* * ARC (Adaptive Replacement Cache) Overview * * Cache is divided between T1 (recency) and T2 (frequency) * Parameter p determines the balance point * p adapts based on ghost list hits */ struct arc_cache { unsigned int c; /* Total cache capacity */ unsigned int p; /* Target size for T1 (0 <= p <= c) */ /* The four lists */ struct lru_list T1; /* Recent (accessed once recently) */ struct lru_list B1; /* Ghost entries evicted from T1 */ struct lru_list T2; /* Frequent (accessed >= twice) */ struct lru_list B2; /* Ghost entries evicted from T2 */ /* T1 + T2 <= c (total data cached) */ /* B1 + B2 <= c (history we keep) */}; void arc_access(struct arc_cache *arc, unsigned long key) { /* Case 1: Hit in T1 or T2 */ if (in_list(&arc->T1, key)) { move_to_mru(&arc->T2, key); /* Promote to T2 */ return; } if (in_list(&arc->T2, key)) { move_to_mru(&arc->T2, key); /* Keep in T2, refresh position */ return; } /* Case 2: Hit in ghost list B1 */ if (in_list(&arc->B1, key)) { /* * We evicted from T1 too eagerly! * Adapt: increase p (give more space to T1) */ unsigned int delta = MAX(1, size(&arc->B2) / size(&arc->B1)); arc->p = MIN(arc->c, arc->p + delta); replace(arc, key, /* p_in_B2 = */ false); remove_from(&arc->B1, key); insert_mru(&arc->T2, key); /* Now it's accessed twice */ return; } /* Case 3: Hit in ghost list B2 */ if (in_list(&arc->B2, key)) { /* * We evicted from T2 too eagerly! * Adapt: decrease p (give more space to T2) */ unsigned int delta = MAX(1, size(&arc->B1) / size(&arc->B2)); arc->p = MAX(0, (int)arc->p - delta); replace(arc, key, /* p_in_B2 = */ true); remove_from(&arc->B2, key); insert_mru(&arc->T2, key); return; } /* Case 4: Complete miss (not in any list) */ if (size(&arc->T1) + size(&arc->B1) == arc->c) { /* B1 is full, need to evict from B1 */ if (size(&arc->T1) < arc->c) { evict_lru(&arc->B1); /* Make room in B1 */ replace(arc, key, false); } else { /* T1 == c, evict from T1 directly (B1 empty) */ evict_lru(&arc->T1); } } else if (size(&arc->T1) + size(&arc->T2) + size(&arc->B1) + size(&arc->B2) >= arc->c) { if (size(&arc->T1) + size(&arc->T2) + size(&arc->B1) + size(&arc->B2) == 2 * arc->c) { evict_lru(&arc->B2); } replace(arc, key, false); } insert_mru(&arc->T1, key); /* New entry starts in T1 */} /* * Replace: evict one item from cache to make room */void replace(struct arc_cache *arc, unsigned long key, bool p_in_B2) { if (size(&arc->T1) > 0 && (size(&arc->T1) > arc->p || (p_in_B2 && size(&arc->T1) == arc->p))) { /* Evict from T1 */ unsigned long victim = evict_lru(&arc->T1); insert_mru(&arc->B1, victim); /* Move to ghost list */ } else { /* Evict from T2 */ unsigned long victim = evict_lru(&arc->T2); insert_mru(&arc->B2, victim); /* Move to ghost list */ }}ARC Properties:
Scan resistant: Sequential scans only fill T1; T2 (frequently used) remains protected
Self-tuning: The parameter p automatically adapts to workload characteristics
Bounded overhead: Ghost lists are bounded to cache size, storing only keys
Theoretical guarantees: ARC provably never performs worse than LRU by more than a constant factor
Performance Results:
In benchmarks, ARC typically:
ARC is famously used in the ZFS file system, where it's a cornerstone of ZFS's performance. ZFS extends ARC with additional features like prefetching integration and tunable ghost list sizes. When people praise ZFS's caching performance, they're largely praising ARC's adaptivity.
While true LRU with doubly-linked lists works perfectly, it has a practical problem: every access requires list manipulation. For caches with millions of entries accessed millions of times per second, these constant list updates become a performance bottleneck—especially due to lock contention on multi-core systems.
Approximate LRU algorithms sacrifice perfect LRU ordering for dramatically reduced overhead. They maintain "roughly LRU" behavior with orders of magnitude less bookkeeping.
The Clock Algorithm (Second-Chance):
The most famous approximate LRU algorithm uses a circular buffer and a "clock hand":
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
/* * Clock Algorithm (Second-Chance) Implementation * * Each cache entry has a "referenced" bit. * A circular "clock hand" sweeps through entries. * To find an eviction victim: * - If referenced bit is set: clear it, move hand forward (second chance) * - If referenced bit is clear: evict this entry */ struct clock_cache { struct cache_entry *entries; unsigned int capacity; unsigned int size; unsigned int hand; /* Current position of clock hand */ spinlock_t lock;}; struct cache_entry { unsigned long key; void *data; bool referenced; /* Access bit */ bool valid; /* Is this slot in use? */}; /* * Access: O(1) - just set the referenced bit * Much cheaper than LRU list manipulation! */void *clock_cache_get(struct clock_cache *cache, unsigned long key) { struct cache_entry *entry; entry = hash_lookup(cache, key); /* O(1) hash lookup */ if (entry && entry->valid) { entry->referenced = true; /* Mark as recently used */ return entry->data; } return NULL;} /* * Find victim for eviction using clock sweep * Average case: O(1), worst case: O(n) */unsigned int clock_find_victim(struct clock_cache *cache) { unsigned int start = cache->hand; while (true) { struct cache_entry *entry = &cache->entries[cache->hand]; if (!entry->valid) { /* Empty slot, use it directly */ return cache->hand; } if (!entry->referenced) { /* Not recently used, evict this one */ return cache->hand; } /* Give second chance: clear bit and move on */ entry->referenced = false; cache->hand = (cache->hand + 1) % cache->capacity; /* * Note: In degenerate case where all entries are referenced, * clock makes a full sweep clearing all bits before evicting. * Still O(n) worst case, but amortized O(1) per eviction. */ }} /* * Insert: Find slot and insert */void clock_cache_put(struct clock_cache *cache, unsigned long key, void *data) { struct cache_entry *entry; unsigned int slot; spin_lock(&cache->lock); /* Check for existing entry */ entry = hash_lookup(cache, key); if (entry && entry->valid) { entry->data = data; entry->referenced = true; spin_unlock(&cache->lock); return; } /* Find slot for new entry */ slot = clock_find_victim(cache); entry = &cache->entries[slot]; if (entry->valid) { /* Evicting existing entry */ hash_remove(cache, entry->key); } else { cache->size++; } /* Insert new entry */ entry->key = key; entry->data = data; entry->referenced = true; entry->valid = true; hash_insert(cache, key, slot); /* Advance hand past this entry */ cache->hand = (slot + 1) % cache->capacity; spin_unlock(&cache->lock);}Clock Algorithm Analysis:
| Aspect | True LRU | Clock (Second-Chance) |
|---|---|---|
| Memory overhead | 2 pointers per entry | 1 bit per entry |
| Access cost | O(1) list operations | O(1) bit set |
| Lock contention | High (modifies list) | Low (only bit update) |
| Eviction accuracy | Perfect LRU | Approximate LRU |
| Implementation | Doubly-linked list | Circular array |
Enhanced Clock Variations:
CLOCK-Pro: Uses three "clock hands" and tracks both recency and reuse distance. Provides near-ARC performance with clock-level overhead.
2Q (Two-Queue): Maintains two queues like simplified ARC. New entries go to a FIFO queue; if accessed again before eviction, they move to an LRU queue.
LIRS (Low Inter-reference Recency Set): Tracks both recency and inter-reference recency (time between consecutive accesses to the same block). Highly effective but complex.
Linux uses a sophisticated variant called 'Two-List LRU' or 'Split LRU'. Pages are divided into active and inactive lists. New pages start on the inactive list; if accessed again, they're promoted to active. This approximates the 2Q/ARC insight that 'accessed twice' is a strong signal of future value.
Real-world caches often serve diverse access patterns simultaneously. A database cache might hold frequently-accessed index pages, randomly-accessed data pages, and rarely-accessed archival data. Treating all blocks identically ignores crucial information about their access patterns and importance.
Multi-Queue (MQ) Caching:
The Multi-Queue algorithm maintains multiple LRU queues representing different "temperature" levels (how hot/frequently-accessed an item is):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
/* * Multi-Queue (MQ) Cache * * Multiple queues Q0, Q1, ..., Qn represent access frequency tiers. * - Q0: Items accessed 1 time (cold) * - Q1: Items accessed 2-3 times * - Q2: Items accessed 4-7 times * - Qn: Items accessed >= 2^n times (hottest) * * Each queue is LRU-ordered. * Eviction starts from the coldest non-empty queue. */ #define MQ_NUM_QUEUES 8 /* 8 temperature levels */ struct mq_cache { unsigned int capacity; unsigned int size; /* Array of LRU queues, indexed by access tier */ struct lru_list queues[MQ_NUM_QUEUES]; /* Ghost queue for evicted items */ struct lru_list ghost_queue; /* Hash table for O(1) lookup */ struct hlist_head *hash_table; /* Configuration */ unsigned int lifetime; /* How long before items fade */ unsigned long current_time;}; struct mq_entry { unsigned long key; void *data; unsigned int access_count; /* Total accesses */ unsigned int current_queue; /* Which queue this is in */ unsigned long last_access; /* For temporal fading */ unsigned long expire_time; /* When to demote to lower queue */ struct list_head queue_node; struct hlist_node hash_node;}; /* * Compute queue level from access count */unsigned int compute_queue_level(unsigned int access_count) { /* Queue i holds items with 2^(i-1) < accesses <= 2^i */ if (access_count == 0) return 0; unsigned int level = 0; unsigned int threshold = 1; while (access_count > threshold && level < MQ_NUM_QUEUES - 1) { threshold *= 2; level++; } return level;} /* * Access an item */void mq_cache_access(struct mq_cache *cache, unsigned long key) { struct mq_entry *entry; unsigned int old_queue, new_queue; entry = hash_lookup(cache->hash_table, key); if (!entry) { /* Cache miss - handle separately */ return; } /* Increment access count */ entry->access_count++; entry->last_access = cache->current_time; /* Compute new queue based on access count */ old_queue = entry->current_queue; new_queue = compute_queue_level(entry->access_count); if (new_queue != old_queue) { /* Promote to higher queue */ list_del(&entry->queue_node); list_add(&entry->queue_node, &cache->queues[new_queue].head); entry->current_queue = new_queue; } else { /* Move to front of same queue (LRU within queue) */ list_del(&entry->queue_node); list_add(&entry->queue_node, &cache->queues[old_queue].head); } /* Reset expiration timer */ entry->expire_time = cache->current_time + cache->lifetime;} /* * Find eviction victim: LRU from coldest non-empty queue */struct mq_entry *mq_find_victim(struct mq_cache *cache) { for (int q = 0; q < MQ_NUM_QUEUES; q++) { if (!list_empty(&cache->queues[q].head)) { /* Return LRU from this queue (tail of list) */ return list_last_entry( &cache->queues[q].head, struct mq_entry, queue_node ); } } return NULL; /* Cache empty */} /* * Periodic maintenance: demote expired items * Items that haven't been accessed within 'lifetime' lose a queue level */void mq_expire_check(struct mq_cache *cache) { for (int q = MQ_NUM_QUEUES - 1; q > 0; q--) { struct mq_entry *entry, *tmp; list_for_each_entry_safe(entry, tmp, &cache->queues[q].head, queue_node) { if (cache->current_time > entry->expire_time) { /* Demote to lower queue */ list_del(&entry->queue_node); entry->current_queue--; list_add_tail(&entry->queue_node, &cache->queues[entry->current_queue].head); entry->expire_time = cache->current_time + cache->lifetime; } } }}Segmented Buffer Pools in Databases:
Database systems often segment their buffer pools explicitly:
| Segment | Contents | Replacement Policy | Size |
|---|---|---|---|
| Hot | Frequently accessed pages | LRU | 70% |
| Warm | Recently accessed once | FIFO | 20% |
| Cold | Read-ahead, scan data | FIFO | 10% |
This segmentation:
MySQL's InnoDB uses a sophisticated buffer pool with 'young' and 'old' sublists. New pages enter at the midpoint (not the head) of the LRU list. If accessed again within a short window, they're promoted to the 'young' sublist. This elegantly handles both scan resistance and genuine hot data promotion.
Understanding theoretical algorithms is essential, but production systems face additional challenges that pure algorithms don't address. Let's examine how real operating systems handle cache management.
Linux Page Reclaim:
Linux maintains two LRU lists per memory zone:
The kernel daemon kswapd continuously monitors memory pressure and performs reclamation:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
/* * Simplified view of Linux page reclaim (2-list LRU) * * Real implementation in mm/vmscan.c is much more complex */ /* * Page flags relevant to LRU management */#define PG_referenced /* Accessed since last check */#define PG_active /* On active list (vs inactive) */#define PG_lru /* On one of the LRU lists */ /* * Shrink the inactive list * Called when memory is needed */unsigned long shrink_inactive_list(struct lruvec *lruvec, unsigned long nr_to_scan) { LIST_HEAD(pages_to_free); unsigned long nr_reclaimed = 0; /* Isolate pages from LRU for processing */ struct list_head page_list; isolate_lru_pages(lruvec, &page_list, nr_to_scan, ISOLATE_INACTIVE); while (!list_empty(&page_list)) { struct page *page = lru_to_page(&page_list); list_del(&page->lru); /* Can we reclaim this page? */ if (page_mapped(page)) { /* Page is mapped by a process - try to unmap */ if (!try_to_unmap(page)) { /* Can't unmap, put back on LRU */ activate_page(page); /* Maybe promote to active */ continue; } } /* Check if page is dirty */ if (PageDirty(page)) { /* Must write back before freeing */ if (may_writeback()) { /* Queue for writeback, check again later */ pageout(page); continue; } /* Can't writeback now, keep the page */ list_add(&page->lru, &pages_to_keep); continue; } /* Page is clean and unmapped - free it! */ __clear_page_lru_flags(page); list_add(&page->lru, &pages_to_free); nr_reclaimed++; } /* Put kept pages back on LRU */ putback_lru_pages(&pages_to_keep); /* Actually free the reclaimed pages */ free_unref_page_list(&pages_to_free); return nr_reclaimed;} /* * Balance between active and inactive lists * Move cold pages from active to inactive */void shrink_active_list(struct lruvec *lruvec, unsigned long nr_to_scan) { struct list_head pages_to_deactivate; /* Isolate pages from active list */ isolate_lru_pages(lruvec, &page_list, nr_to_scan, ISOLATE_ACTIVE); while (!list_empty(&page_list)) { struct page *page = lru_to_page(&page_list); /* Check referenced bit */ if (PageReferenced(page)) { /* Recently accessed - keep on active list */ ClearPageReferenced(page); /* Clear for next cycle */ list_add(&page->lru, &pages_to_keep); } else { /* Not accessed recently - demote to inactive */ ClearPageActive(page); list_add(&page->lru, &pages_to_deactivate); } } /* Move demoted pages to inactive list */ move_pages_to_lru(lruvec, &pages_to_deactivate); /* Keep others on active list */ move_pages_to_lru(lruvec, &pages_to_keep);}| OS | Algorithm | Key Features |
|---|---|---|
| Linux | Split LRU (Active/Inactive) | Workingset detection, MGLRU (multi-gen LRU) in recent kernels |
| FreeBSD | Modified Clock | Separate queues for file vs anonymous pages |
| Windows | Working Set + Standby Lists | Per-process working sets, global standby/modified lists |
| Solaris/illumos | Cyclic scanning | Pageout scanner, lotsfree/minfree thresholds |
| ZFS | ARC | Adaptive, self-tuning, metadata vs data separation |
Linux 6.1+ introduced MGLRU, representing the most significant page reclaim improvement in decades. Instead of binary active/inactive, MGLRU tracks multiple generations with bloom-filter-based aging. It dramatically improves performance on memory-constrained systems and reduces CPU overhead in page reclaim by 40% or more in benchmarks.
We've explored the full spectrum of cache replacement policies, from theoretical optimal to production implementations. Let's consolidate the key insights:
What's Next:
With cache management policies understood, we'll next examine write-back policies—the strategies that determine when and how cached modifications are persisted to storage. Write-back policies involve fundamental tradeoffs between performance and durability, making them critical for both performance optimization and data integrity.
You now understand the major cache replacement policies, their tradeoffs, and how they're implemented in production systems. This knowledge is essential for debugging cache performance issues, tuning system behavior, and understanding why some workloads hit cache better than others.