Loading learning content...
Every modern computer faces a fundamental tension: applications demand more memory than physically exists. For decades, the primary solution was virtual memory—using disk storage as an extension of RAM. But this comes at a staggering cost: disk access is roughly 100,000 to 1,000,000 times slower than memory access.
What if there were a middle ground? What if, instead of immediately evicting pages to disk when memory pressure rises, the operating system could compress those pages and keep them in RAM—trading CPU cycles for memory capacity?
This is the promise of memory compression, and modern operating systems have elevated it from a simple optimization to a sophisticated subsystem that dramatically improves performance under memory pressure.
By the end of this page, you will understand how compressed page caches work at a deep technical level—including the memory hierarchy positioning, compression pool management, page selection strategies, and the architectural decisions that make compression viable in production operating systems.
To understand why memory compression exists, we must first appreciate the massive latency gaps in modern memory hierarchies. Consider the time to access data at different levels:
| Level | Latency | Relative Speed |
|---|---|---|
| L1 Cache | ~1 ns | 1x |
| L2 Cache | ~4 ns | 4x slower |
| L3 Cache | ~12 ns | 12x slower |
| Main Memory (DRAM) | ~100 ns | 100x slower |
| SSD (NVMe) | ~10,000 ns (10 μs) | 10,000x slower |
| HDD | ~10,000,000 ns (10 ms) | 10,000,000x slower |
When the operating system needs to swap out a memory page, it traditionally writes to disk. Even with the fastest NVMe SSDs, this operation is 100x slower than a memory access. With HDDs, it's 100,000x slower.
A single page fault requiring disk I/O can stall a process for 10 microseconds (SSD) to 10 milliseconds (HDD). During that time, a modern CPU at 3 GHz could have executed 30,000 to 30,000,000 instructions. Swapping is devastatingly expensive.
The Compression Insight:
Compression introduces a new tier in this hierarchy: compressed memory. Instead of immediately sending pages to disk, we compress them and store the compressed data in RAM:
| Level | Latency | Capacity Impact |
|---|---|---|
| Uncompressed Memory | ~100 ns | 1x capacity |
| Compressed Memory | ~300-1000 ns | 2-4x effective capacity |
| Swap (SSD) | ~10,000 ns | Disk-limited |
| Swap (HDD) | ~10,000,000 ns | Disk-limited |
Compression is 10-100x faster than SSD access and 10,000-100,000x faster than HDD access, while typically achieving 2-4x compression ratios on typical workloads.
Memory compression trades CPU cycles for memory capacity. This tradeoff is favorable when: (1) memory is scarce but CPU has idle cycles, (2) disk I/O would otherwise occur, and (3) the data compresses well. Modern systems dynamically evaluate these conditions.
A compressed page cache is a region of memory that stores compressed versions of memory pages that would otherwise be swapped to disk. The key architectural components include:
1. Compression Pool: A contiguous or virtually contiguous memory region reserved for storing compressed pages. Unlike the regular page cache (which stores pages at their natural 4KB granularity), the compression pool stores variable-size compressed data.
2. Page Metadata: For each compressed page, the system maintains metadata:
3. Index Structure: A data structure (typically a radix tree or hash table) that maps from original page identifiers to their compressed representations.
12345678910111213141516171819202122232425262728293031323334353637383940414243
/* Simplified compressed page cache data structures */ struct compressed_page { unsigned long original_pfn; /* Original page frame number */ void *compressed_data; /* Pointer into compression pool */ unsigned int compressed_size; /* Size after compression */ unsigned int compression_algo; /* Algorithm identifier */ struct list_head lru_list; /* LRU list linkage */ struct rb_node rb_node; /* Red-black tree for lookup */ atomic_t refcount; /* Reference counting */ unsigned long flags; /* Status flags */}; struct compression_pool { void *pool_base; /* Base address of pool */ unsigned long pool_size; /* Total pool size */ unsigned long used_bytes; /* Currently used bytes */ unsigned long compressed_pages; /* Number of compressed pages */ spinlock_t lock; /* Pool access lock */ /* Memory allocator for variable-size compressed data */ struct z3fold_pool *allocator; /* Or zbud, z4fold, etc. */ /* Statistics */ atomic64_t compress_ops; /* Compression operations */ atomic64_t decompress_ops; /* Decompression operations */ atomic64_t bytes_saved; /* Memory saved by compression */}; struct compressed_cache { struct rb_root page_tree; /* Fast lookup by PFN */ struct list_head lru_hot; /* Recently accessed */ struct list_head lru_cold; /* Candidates for eviction */ struct compression_pool *pool; /* The actual data storage */ /* Configuration */ unsigned long max_pages; /* Maximum pages to store */ unsigned long max_pool_size; /* Maximum pool memory */ /* Synchronization */ rwlock_t tree_lock; /* Protects page_tree */ struct mutex config_lock; /* Protects configuration */};Variable-Size Storage Challenge:
One of the fundamental challenges in compressed cache design is variable-size storage. Uncompressed pages are always 4KB (on most systems), enabling simple array-based allocation. Compressed pages range from a few bytes (highly compressible data) to nearly 4KB (incompressible data).
This requires specialized allocators:
The choice of allocator significantly impacts both memory efficiency and CPU overhead.
General-purpose allocators like malloc() are designed for user-space with different constraints. Kernel compressed caches need: (1) atomic/interrupt-safe allocation, (2) minimal fragmentation over long periods, (3) predictable latency, and (4) efficient reclaim under memory pressure. Specialized allocators like zbud and zsmalloc are tuned for these requirements.
Not all pages are equally suitable for compression. The operating system must intelligently select which pages to compress, considering multiple factors:
1. Compressibility: Some pages compress extremely well (90%+ reduction), while others are nearly incompressible (encrypted data, already-compressed media). Compressing incompressible data wastes CPU cycles.
2. Access Frequency: Frequently accessed (hot) pages incur decompression costs on every access. These should remain uncompressed. Infrequently accessed (cold) pages are ideal candidates.
3. Page Type: Different page types have different compression characteristics:
4. Temporal Patterns: Pages that were recently active may become active again soon. The system considers recency and frequency patterns.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
/* Page selection heuristics for compression */ enum compression_decision { COMPRESS_YES, /* Good candidate, compress */ COMPRESS_NO_HOT, /* Too frequently accessed */ COMPRESS_NO_TYPE, /* Incompressible page type */ COMPRESS_NO_SIZE, /* Estimated poor compression */ COMPRESS_DEFER, /* Check again later */}; static enum compression_decision evaluate_page(struct page *page){ /* Never compress kernel pages */ if (PageKernel(page)) return COMPRESS_NO_TYPE; /* Skip pages with pending I/O */ if (PageWriteback(page) || PageLocked(page)) return COMPRESS_DEFER; /* Check access frequency via hardware reference bit */ if (page_recently_accessed(page)) return COMPRESS_NO_HOT; /* Zero pages get special handling - don't waste pool space */ if (page_all_zeros(page)) return COMPRESS_YES; /* Will use zero-page deduplication */ /* Sample first 64 bytes for entropy estimation */ unsigned int entropy = estimate_entropy(page_address(page), 64); /* High entropy suggests incompressible data */ if (entropy > ENTROPY_THRESHOLD) return COMPRESS_NO_SIZE; /* Check page age - prefer older pages */ unsigned long age = get_page_age(page); if (age < MIN_COMPRESSION_AGE) return COMPRESS_DEFER; return COMPRESS_YES;} /* Fast entropy estimation using byte histogram */static unsigned int estimate_entropy(void *data, size_t len){ unsigned int histogram[256] = {0}; unsigned char *bytes = data; unsigned int entropy = 0; /* Build histogram */ for (size_t i = 0; i < len; i++) histogram[bytes[i]]++; /* Estimate entropy via histogram spread */ unsigned int unique_bytes = 0; for (int i = 0; i < 256; i++) { if (histogram[i] > 0) unique_bytes++; } /* More unique values = higher entropy = less compressible */ return (unique_bytes * 100) / 256;}Managing the compression pool efficiently is critical to overall system performance. This involves several key operations:
1. Pool Sizing: The compression pool size determines how many compressed pages can be stored. Too small, and pages overflow to swap. Too large, and uncompressed memory is needlessly sacrificed.
Modern systems use dynamic sizing based on:
2. Space Allocation: Within the pool, compressed data of varying sizes must be allocated efficiently. This is where specialized allocators (zbud, z3fold, zsmalloc) prove essential.
3. Fragmentation Management: Over time, as pages are compressed and decompressed, the pool can become fragmented—small free spaces scattered throughout. Defragmentation strategies include:
Pool Eviction Policy:
When the compression pool is full and new pages need compression, the system must evict existing compressed pages. The eviction decision considers:
This creates a two-tier LRU system:
When evicting from the compression pool to swap, the system can write compressed data directly to disk. This 'compressed swap' saves both disk space and I/O bandwidth. The decompression happens when the page is faulted back into memory.
When a process accesses a page that has been compressed, the system must transparently decompress it and provide the original data. This decompression path is performance-critical.
The Decompression Fault Path:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
/* Simplified decompression fault handler */ static int handle_compressed_page_fault( struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *pte){ struct compressed_page *cpage; struct page *new_page; void *dest; int ret; /* 1. Lookup the compressed page */ cpage = lookup_compressed_page(mm, address >> PAGE_SHIFT); if (!cpage) return VM_FAULT_SIGBUS; /* Not found - fatal */ /* 2. Allocate a new physical page */ new_page = alloc_page(GFP_HIGHUSER_MOVABLE); if (!new_page) { /* Memory pressure - try to reclaim and retry */ return VM_FAULT_OOM; } /* 3. Decompress into the new page */ dest = kmap_atomic(new_page); ret = decompress(cpage->compressed_data, cpage->compressed_size, dest, PAGE_SIZE, cpage->compression_algo); kunmap_atomic(dest); if (ret != PAGE_SIZE) { put_page(new_page); return VM_FAULT_SIGBUS; /* Decompression failed */ } /* 4. Update page tables atomically */ spin_lock(&mm->page_table_lock); pte_t entry = mk_pte(new_page, vma->vm_page_prot); set_pte_at(mm, address, pte, entry); /* 5. Update reverse mapping */ page_add_anon_rmap(new_page, vma, address); spin_unlock(&mm->page_table_lock); /* 6. Remove from compression pool */ remove_compressed_page(cpage); /* 7. Add to active LRU list */ lru_cache_add_active_anon(new_page); /* Statistics */ count_vm_event(DECOMPRESS_FAULT); return 0; /* Success - instruction will retry */}Latency Considerations:
Decompression latency directly impacts application performance. Key factors include:
| Factor | Impact | Mitigation |
|---|---|---|
| Algorithm speed | 100-1000 ns per 4KB | Choose fast algorithms (LZ4, LZO) |
| Cache effects | L3 misses add ~12 ns each | Prefetch compressed data |
| Memory allocation | O(100) ns typical | Pre-allocate pages |
| Lock contention | Highly variable | Fine-grained locking |
| TLB updates | ~10-100 ns | Minimize TLB shootdowns |
Total latency: 300-1500 ns for typical decompression
Compare this to:
Even at 10x the latency of an uncompressed access, compression is 10-10,000x faster than swap.
Some systems implement speculative decompression: when a compressed page is accessed, they also decompress nearby pages in the same region. This exploits spatial locality—if one page in a region is accessed, neighbors are likely to follow. This amortizes decompression overhead across multiple page accesses.
Compressed page caches must integrate seamlessly with the kernel's memory reclaim subsystem—the component responsible for freeing memory when pressure rises. This integration is complex and critical.
Memory Zones and Compression:
Linux organizes memory into zones (DMA, DMA32, Normal, HighMem). The compression pool typically resides in the Normal zone. When reclaim runs, it must coordinate:
The Reclaim Decision Tree:
kswapd Integration:
The kernel's background reclaim daemon (kswapd) is extended to consider compression:
Watermarks and Thresholds:
The system maintains multiple watermarks:
| Watermark | Action | Purpose |
|---|---|---|
| min_free | Wake kswapd | Start background reclaim |
| low_free | Aggressive compression | Compress before swapping |
| high_free | Decompress if idle | Opportunistically expand |
| pool_full | Pool eviction | Move pool→swap |
| critical | Direct reclaim | Synchronous reclaim |
These watermarks are tunable and adaptive—they adjust based on observed workload patterns.
Compression adds CPU load during reclaim. If the system is already CPU-bound, aggressive compression can worsen overall performance. Modern implementations throttle compression based on CPU usage and prioritize swap when CPU is scarce.
Operating a compressed page cache effectively requires understanding its performance characteristics and tuning accordingly. Key metrics include:
Compression Ratio:
Ratio = Original Size / Compressed Size
Higher is better. Typical ratios:
Hit Rate:
Hit Rate = Decompressions / (Decompressions + Swap-ins)
Measures how often compressed pages satisfy requests vs. requiring swap access. Higher is better.
Compression/Decompression Latency: Time per operation. Lower is better. Typical targets:
Pool Efficiency:
Efficiency = Logical Bytes Stored / Physical Pool Size
Measures effective compression including fragmentation overhead.
1234567891011121314151617181920212223242526272829
# Example zswap statistics from /sys/kernel/debug/zswap # Pages stored and retrievedstored_pages: 125432same_filled_pages: 23456 # Zero-page deduplicationloaded_pages: 98234 # Pool utilizationpool_total_size: 268435456 # 256 MBcompressed_size: 134217728 # 128 MB effectivepool_limit_hit: 0 # Never hit limit # Compression performancecompress_time_us: 2345678 # Total compression CPU timedecompress_time_us: 1234567 # Total decompression CPU time # Rejection statisticsreject_compress_poor: 12345 # Poor compression ratioreject_kmem_alloc_fail: 0 # Allocation failuresreject_reclaim_fail: 23 # Could not reclaim pool space # Writeback to swapwritten_back_pages: 4567 # Evicted from pool to swap # Calculated metricsAverage compression ratio: 2.34xHit rate: 95.6%Average compress latency: 1.87 μs/pageAverage decompress latency: 0.98 μs/pageUse cat /sys/kernel/debug/zswap/* for zswap stats, zramctl for zram info, and vmstat -s for overall memory stats. Track these over time to understand workload patterns and optimize configuration.
We've explored the architecture and operation of compressed page caches—a critical technique for improving memory efficiency in modern operating systems. Let's consolidate the key concepts:
What's Next:
Now that we understand compressed page cache fundamentals, the next page explores zswap—Linux's implementation of a compressed swap cache that sits between the reclaim path and the swap device, providing first-class compressed caching in production systems.
You now understand the architecture, operation, and performance characteristics of compressed page caches. This foundation prepares you for understanding specific implementations (zswap, zram) and the tradeoffs involved in deploying memory compression in production systems.