Operating SystemsMemory Compression

Memory Compression

LevelAdvanced

Duration90 mins

TopicMemory Compression

1 / 5

Compressed Page Cache

The Memory Pressure Paradox

Every modern computer faces a fundamental tension: applications demand more memory than physically exists. For decades, the primary solution was virtual memory—using disk storage as an extension of RAM. But this comes at a staggering cost: disk access is roughly 100,000 to 1,000,000 times slower than memory access.

What if there were a middle ground? What if, instead of immediately evicting pages to disk when memory pressure rises, the operating system could compress those pages and keep them in RAM—trading CPU cycles for memory capacity?

This is the promise of memory compression, and modern operating systems have elevated it from a simple optimization to a sophisticated subsystem that dramatically improves performance under memory pressure.

What You Will Learn

By the end of this page, you will understand how compressed page caches work at a deep technical level—including the memory hierarchy positioning, compression pool management, page selection strategies, and the architectural decisions that make compression viable in production operating systems.

The Memory Hierarchy Problem

To understand why memory compression exists, we must first appreciate the massive latency gaps in modern memory hierarchies. Consider the time to access data at different levels:

Level	Latency	Relative Speed
L1 Cache	~1 ns	1x
L2 Cache	~4 ns	4x slower
L3 Cache	~12 ns	12x slower
Main Memory (DRAM)	~100 ns	100x slower
SSD (NVMe)	~10,000 ns (10 μs)	10,000x slower
HDD	~10,000,000 ns (10 ms)	10,000,000x slower

When the operating system needs to swap out a memory page, it traditionally writes to disk. Even with the fastest NVMe SSDs, this operation is 100x slower than a memory access. With HDDs, it's 100,000x slower.

The Swap Performance Wall

A single page fault requiring disk I/O can stall a process for 10 microseconds (SSD) to 10 milliseconds (HDD). During that time, a modern CPU at 3 GHz could have executed 30,000 to 30,000,000 instructions. Swapping is devastatingly expensive.

The Compression Insight:

Compression introduces a new tier in this hierarchy: compressed memory. Instead of immediately sending pages to disk, we compress them and store the compressed data in RAM:

Level	Latency	Capacity Impact
Uncompressed Memory	~100 ns	1x capacity
Compressed Memory	~300-1000 ns	2-4x effective capacity
Swap (SSD)	~10,000 ns	Disk-limited
Swap (HDD)	~10,000,000 ns	Disk-limited

Compression is 10-100x faster than SSD access and 10,000-100,000x faster than HDD access, while typically achieving 2-4x compression ratios on typical workloads.

The Fundamental Tradeoff

Memory compression trades CPU cycles for memory capacity. This tradeoff is favorable when: (1) memory is scarce but CPU has idle cycles, (2) disk I/O would otherwise occur, and (3) the data compresses well. Modern systems dynamically evaluate these conditions.

Compressed Page Cache Architecture

A compressed page cache is a region of memory that stores compressed versions of memory pages that would otherwise be swapped to disk. The key architectural components include:

1. Compression Pool: A contiguous or virtually contiguous memory region reserved for storing compressed pages. Unlike the regular page cache (which stores pages at their natural 4KB granularity), the compression pool stores variable-size compressed data.

2. Page Metadata: For each compressed page, the system maintains metadata:

Original page frame number (or virtual address)
Compressed size
Location within the compression pool
Compression algorithm used (in multi-algorithm systems)
Age/priority information for eviction decisions

3. Index Structure: A data structure (typically a radix tree or hash table) that maps from original page identifiers to their compressed representations.

compressed_cache_structures.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
/* Simplified compressed page cache data structures */
 
struct compressed_page {
    unsigned long original_pfn;       /* Original page frame number */
    void *compressed_data;            /* Pointer into compression pool */
    unsigned int compressed_size;     /* Size after compression */
    unsigned int compression_algo;    /* Algorithm identifier */
    struct list_head lru_list;        /* LRU list linkage */
    struct rb_node rb_node;           /* Red-black tree for lookup */
    atomic_t refcount;                /* Reference counting */
    unsigned long flags;              /* Status flags */
};
 
struct compression_pool {
    void *pool_base;                  /* Base address of pool */
    unsigned long pool_size;          /* Total pool size */
    unsigned long used_bytes;         /* Currently used bytes */
    unsigned long compressed_pages;   /* Number of compressed pages */
    spinlock_t lock;                  /* Pool access lock */
    
    /* Memory allocator for variable-size compressed data */
    struct z3fold_pool *allocator;    /* Or zbud, z4fold, etc. */
    
    /* Statistics */
    atomic64_t compress_ops;          /* Compression operations */
    atomic64_t decompress_ops;        /* Decompression operations */
    atomic64_t bytes_saved;           /* Memory saved by compression */
};
 
struct compressed_cache {
    struct rb_root page_tree;         /* Fast lookup by PFN */
    struct list_head lru_hot;         /* Recently accessed */
    struct list_head lru_cold;        /* Candidates for eviction */
    struct compression_pool *pool;    /* The actual data storage */
    
    /* Configuration */
    unsigned long max_pages;          /* Maximum pages to store */
    unsigned long max_pool_size;      /* Maximum pool memory */
    
    /* Synchronization */
    rwlock_t tree_lock;               /* Protects page_tree */
    struct mutex config_lock;         /* Protects configuration */
};

Variable-Size Storage Challenge:

One of the fundamental challenges in compressed cache design is variable-size storage. Uncompressed pages are always 4KB (on most systems), enabling simple array-based allocation. Compressed pages range from a few bytes (highly compressible data) to nearly 4KB (incompressible data).

This requires specialized allocators:

zbud: Stores exactly 2 compressed pages per physical page, limiting fragmentation but potentially wasting space
z3fold: Stores up to 3 compressed pages per physical page, better space utilization
zsmalloc: A more complex allocator supporting arbitrary sizes, used by zram

The choice of allocator significantly impacts both memory efficiency and CPU overhead.

Why Not Just Use malloc()?

General-purpose allocators like malloc() are designed for user-space with different constraints. Kernel compressed caches need: (1) atomic/interrupt-safe allocation, (2) minimal fragmentation over long periods, (3) predictable latency, and (4) efficient reclaim under memory pressure. Specialized allocators like zbud and zsmalloc are tuned for these requirements.

Page Selection Strategy

Not all pages are equally suitable for compression. The operating system must intelligently select which pages to compress, considering multiple factors:

1. Compressibility: Some pages compress extremely well (90%+ reduction), while others are nearly incompressible (encrypted data, already-compressed media). Compressing incompressible data wastes CPU cycles.

2. Access Frequency: Frequently accessed (hot) pages incur decompression costs on every access. These should remain uncompressed. Infrequently accessed (cold) pages are ideal candidates.

3. Page Type: Different page types have different compression characteristics:

Anonymous pages (heap, stack): Often highly compressible
File-backed pages: Depends on file type
Kernel pages: Typically kept uncompressed for latency

4. Temporal Patterns: Pages that were recently active may become active again soon. The system considers recency and frequency patterns.

Ideal Compression Candidates

•Zero pages — Pages filled entirely with zeros compress to nearly nothing. Many systems detect these specially.
•Stack pages — Often contain repetitive patterns (stack frames, saved registers) and compress well.
•Heap pages with sparse allocation — Memory allocators often leave gaps that compress efficiently.
•Inactive anonymous pages — Pages from background processes that haven't run recently.
•Aged file cache pages — File data that was read once but not accessed since.

Poor Compression Candidates

•Encrypted data — Properly encrypted data is statistically random and incompressible.
•Compressed media — JPEG, MP3, MP4, ZIP files are already compressed.
•Hot pages — Frequently accessed pages where decompression latency hurts performance.
•Kernel critical paths — Pages accessed in interrupt handlers or with strict latency requirements.
•Pages about to be freed — Compressing pages that will be released wastes effort.

page_selection_heuristics.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/* Page selection heuristics for compression */
 
enum compression_decision {
    COMPRESS_YES,           /* Good candidate, compress */
    COMPRESS_NO_HOT,        /* Too frequently accessed */
    COMPRESS_NO_TYPE,       /* Incompressible page type */
    COMPRESS_NO_SIZE,       /* Estimated poor compression */
    COMPRESS_DEFER,         /* Check again later */
};
 
static enum compression_decision evaluate_page(struct page *page)
{
    /* Never compress kernel pages */
    if (PageKernel(page))
        return COMPRESS_NO_TYPE;
    
    /* Skip pages with pending I/O */
    if (PageWriteback(page) || PageLocked(page))
        return COMPRESS_DEFER;
    
    /* Check access frequency via hardware reference bit */
    if (page_recently_accessed(page))
        return COMPRESS_NO_HOT;
    
    /* Zero pages get special handling - don't waste pool space */
    if (page_all_zeros(page))
        return COMPRESS_YES;  /* Will use zero-page deduplication */
    
    /* Sample first 64 bytes for entropy estimation */
    unsigned int entropy = estimate_entropy(page_address(page), 64);
    
    /* High entropy suggests incompressible data */
    if (entropy > ENTROPY_THRESHOLD)
        return COMPRESS_NO_SIZE;
    
    /* Check page age - prefer older pages */
    unsigned long age = get_page_age(page);
    if (age < MIN_COMPRESSION_AGE)
        return COMPRESS_DEFER;
    
    return COMPRESS_YES;
}
 
/* Fast entropy estimation using byte histogram */
static unsigned int estimate_entropy(void *data, size_t len)
{
    unsigned int histogram[256] = {0};
    unsigned char *bytes = data;
    unsigned int entropy = 0;
    
    /* Build histogram */
    for (size_t i = 0; i < len; i++)
        histogram[bytes[i]]++;
    
    /* Estimate entropy via histogram spread */
    unsigned int unique_bytes = 0;
    for (int i = 0; i < 256; i++) {
        if (histogram[i] > 0)
            unique_bytes++;
    }
    
    /* More unique values = higher entropy = less compressible */
    return (unique_bytes * 100) / 256;
}

Compression Pool Management

Managing the compression pool efficiently is critical to overall system performance. This involves several key operations:

1. Pool Sizing: The compression pool size determines how many compressed pages can be stored. Too small, and pages overflow to swap. Too large, and uncompressed memory is needlessly sacrificed.

Modern systems use dynamic sizing based on:

Current memory pressure
Observed compression ratios
Swap activity levels
Available CPU capacity

2. Space Allocation: Within the pool, compressed data of varying sizes must be allocated efficiently. This is where specialized allocators (zbud, z3fold, zsmalloc) prove essential.

3. Fragmentation Management: Over time, as pages are compressed and decompressed, the pool can become fragmented—small free spaces scattered throughout. Defragmentation strategies include:

Coalescing adjacent free regions
Relocating compressed pages
Periodic compaction

Converting Mermaid diagram...

Pool Eviction Policy:

When the compression pool is full and new pages need compression, the system must evict existing compressed pages. The eviction decision considers:

LRU ordering — Least recently accessed compressed pages are preferred victims
Compression efficiency — Pages with poor compression ratios may be evicted first
Access patterns — Pages unlikely to be accessed again are better victims
Destination cost — Consider whether to decompress-and-free vs. write-to-swap

This creates a two-tier LRU system:

Hot tier: Uncompressed pages in regular memory
Warm tier: Compressed pages in the compression pool
Cold tier: Swapped pages on disk

Writeback Optimization

When evicting from the compression pool to swap, the system can write compressed data directly to disk. This 'compressed swap' saves both disk space and I/O bandwidth. The decompression happens when the page is faulted back into memory.

Decompression and Access Patterns

When a process accesses a page that has been compressed, the system must transparently decompress it and provide the original data. This decompression path is performance-critical.

The Decompression Fault Path:

Page Fault Triggered: Process accesses a virtual address mapped to a compressed page
Fault Handler Entry: MMU triggers a fault; kernel fault handler runs
Lookup Compressed Page: Index structure lookup to find compressed data
Allocate Target Frame: Find a free physical frame for decompressed data
Decompress: Copy and decompress data into the target frame
Update Page Tables: Map the virtual address to the new physical frame
Update Cache State: Remove from compression pool, add to regular LRU
Return and Retry: Return to user space; instruction retries and succeeds

decompression_fault_handler.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
/* Simplified decompression fault handler */
 
static int handle_compressed_page_fault(
    struct mm_struct *mm,
    struct vm_area_struct *vma,
    unsigned long address,
    pte_t *pte)
{
    struct compressed_page *cpage;
    struct page *new_page;
    void *dest;
    int ret;
    
    /* 1. Lookup the compressed page */
    cpage = lookup_compressed_page(mm, address >> PAGE_SHIFT);
    if (!cpage)
        return VM_FAULT_SIGBUS;  /* Not found - fatal */
    
    /* 2. Allocate a new physical page */
    new_page = alloc_page(GFP_HIGHUSER_MOVABLE);
    if (!new_page) {
        /* Memory pressure - try to reclaim and retry */
        return VM_FAULT_OOM;
    }
    
    /* 3. Decompress into the new page */
    dest = kmap_atomic(new_page);
    ret = decompress(cpage->compressed_data,
                     cpage->compressed_size,
                     dest, PAGE_SIZE,
                     cpage->compression_algo);
    kunmap_atomic(dest);
    
    if (ret != PAGE_SIZE) {
        put_page(new_page);
        return VM_FAULT_SIGBUS;  /* Decompression failed */
    }
    
    /* 4. Update page tables atomically */
    spin_lock(&mm->page_table_lock);
    
    pte_t entry = mk_pte(new_page, vma->vm_page_prot);
    set_pte_at(mm, address, pte, entry);
    
    /* 5. Update reverse mapping */
    page_add_anon_rmap(new_page, vma, address);
    
    spin_unlock(&mm->page_table_lock);
    
    /* 6. Remove from compression pool */
    remove_compressed_page(cpage);
    
    /* 7. Add to active LRU list */
    lru_cache_add_active_anon(new_page);
    
    /* Statistics */
    count_vm_event(DECOMPRESS_FAULT);
    
    return 0;  /* Success - instruction will retry */
}

Latency Considerations:

Decompression latency directly impacts application performance. Key factors include:

Factor	Impact	Mitigation
Algorithm speed	100-1000 ns per 4KB	Choose fast algorithms (LZ4, LZO)
Cache effects	L3 misses add ~12 ns each	Prefetch compressed data
Memory allocation	O(100) ns typical	Pre-allocate pages
Lock contention	Highly variable	Fine-grained locking
TLB updates	~10-100 ns	Minimize TLB shootdowns

Total latency: 300-1500 ns for typical decompression

Compare this to:

Uncompressed page access: ~100 ns (page table walk only)
Swap-in from SSD: ~10,000 ns
Swap-in from HDD: ~10,000,000 ns

Even at 10x the latency of an uncompressed access, compression is 10-10,000x faster than swap.

Speculative Decompression

Some systems implement speculative decompression: when a compressed page is accessed, they also decompress nearby pages in the same region. This exploits spatial locality—if one page in a region is accessed, neighbors are likely to follow. This amortizes decompression overhead across multiple page accesses.

Integration with Memory Reclaim

Compressed page caches must integrate seamlessly with the kernel's memory reclaim subsystem—the component responsible for freeing memory when pressure rises. This integration is complex and critical.

Memory Zones and Compression:

Linux organizes memory into zones (DMA, DMA32, Normal, HighMem). The compression pool typically resides in the Normal zone. When reclaim runs, it must coordinate:

Should we compress pages or evict to swap?
Should we decompress and free, or write compressed to swap?
How much CPU budget for compression vs. I/O for swap?

The Reclaim Decision Tree:

Converting Mermaid diagram...

kswapd Integration:

The kernel's background reclaim daemon (kswapd) is extended to consider compression:

Compression-first mode: Under light pressure, kswapd compresses pages instead of swapping
Hybrid mode: Under moderate pressure, compress high-ratio pages, swap low-ratio pages
Swap-first mode: Under severe pressure, write directly to swap (including compressed writeouts)

Watermarks and Thresholds:

The system maintains multiple watermarks:

Watermark	Action	Purpose
min_free	Wake kswapd	Start background reclaim
low_free	Aggressive compression	Compress before swapping
high_free	Decompress if idle	Opportunistically expand
pool_full	Pool eviction	Move pool→swap
critical	Direct reclaim	Synchronous reclaim

These watermarks are tunable and adaptive—they adjust based on observed workload patterns.

Reclaim Scheduling

Compression adds CPU load during reclaim. If the system is already CPU-bound, aggressive compression can worsen overall performance. Modern implementations throttle compression based on CPU usage and prioritize swap when CPU is scarce.

Performance Monitoring and Tuning

Operating a compressed page cache effectively requires understanding its performance characteristics and tuning accordingly. Key metrics include:

Compression Ratio:

Ratio = Original Size / Compressed Size

Higher is better. Typical ratios:

Zero pages: ∞ (deduplicated to single reference)
Typical anonymous: 2-4x
Text/code: 3-5x
Encrypted/compressed: 1.0-1.1x (nearly no benefit)

Hit Rate:

Hit Rate = Decompressions / (Decompressions + Swap-ins)

Measures how often compressed pages satisfy requests vs. requiring swap access. Higher is better.

Compression/Decompression Latency: Time per operation. Lower is better. Typical targets:

Compression: < 2 μs per 4KB
Decompression: < 1 μs per 4KB

Pool Efficiency:

Efficiency = Logical Bytes Stored / Physical Pool Size

Measures effective compression including fragmentation overhead.

compression_stats_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Example zswap statistics from /sys/kernel/debug/zswap
 
# Pages stored and retrieved
stored_pages:           125432
same_filled_pages:      23456   # Zero-page deduplication
loaded_pages:           98234
 
# Pool utilization
pool_total_size:        268435456  # 256 MB
compressed_size:        134217728  # 128 MB effective
pool_limit_hit:         0          # Never hit limit
 
# Compression performance
compress_time_us:       2345678    # Total compression CPU time
decompress_time_us:     1234567    # Total decompression CPU time
 
# Rejection statistics
reject_compress_poor:   12345      # Poor compression ratio
reject_kmem_alloc_fail: 0          # Allocation failures
reject_reclaim_fail:    23         # Could not reclaim pool space
 
# Writeback to swap
written_back_pages:     4567       # Evicted from pool to swap
 
# Calculated metrics
Average compression ratio: 2.34x
Hit rate: 95.6%
Average compress latency: 1.87 μs/page
Average decompress latency: 0.98 μs/page

Tuning Parameters

•max_pool_percent — Maximum percentage of RAM for compression pool (typical: 20-50%)
•compressor — Algorithm selection (lz4, lzo, zstd, etc.)
•zpool — Pool allocator (zbud, z3fold, zsmalloc)
•same_filled_pages — Enable zero-page deduplication
•accept_threshold — Minimum compression ratio to accept
•writeback — Enable pool-to-swap writeback

Monitoring Commands

Use cat /sys/kernel/debug/zswap/* for zswap stats, zramctl for zram info, and vmstat -s for overall memory stats. Track these over time to understand workload patterns and optimize configuration.

Summary: Compressed Page Cache

We've explored the architecture and operation of compressed page caches—a critical technique for improving memory efficiency in modern operating systems. Let's consolidate the key concepts:

Key Takeaways

•Compression fills the hierarchy gap — Compressed memory is 10-100x faster than SSD and 10,000x faster than HDD, making it a valuable tier between RAM and swap.
•Variable-size storage requires specialized allocators — zbud, z3fold, and zsmalloc handle the challenge of storing variable-size compressed data efficiently.
•Page selection is critical — Not all pages compress well; intelligent selection avoids wasting CPU on incompressible data.
•Pool management balances multiple concerns — Size, fragmentation, eviction policy, and writeback all affect overall performance.
•Transparent decompression — The system handles all compression/decompression transparently to applications.
•Integration with reclaim — Compression interacts deeply with the kernel's memory reclaim mechanisms.
•Monitoring enables optimization — Track compression ratio, hit rate, and latency to tune for your workload.

What's Next:

Now that we understand compressed page cache fundamentals, the next page explores zswap—Linux's implementation of a compressed swap cache that sits between the reclaim path and the swap device, providing first-class compressed caching in production systems.

Page Complete

You now understand the architecture, operation, and performance characteristics of compressed page caches. This foundation prepares you for understanding specific implementations (zswap, zram) and the tradeoffs involved in deploying memory compression in production systems.

1 / 5

Loading learning content...

Operating SystemsMemory Compression

Memory Compression

LevelAdvanced

Duration90 mins

TopicMemory Compression

1 / 5

Compressed Page Cache

The Memory Pressure Paradox

What You Will Learn

The Memory Hierarchy Problem

To understand why memory compression exists, we must first appreciate the massive latency gaps in modern memory hierarchies. Consider the time to access data at different levels:

Level	Latency	Relative Speed
L1 Cache	~1 ns	1x
L2 Cache	~4 ns	4x slower
L3 Cache	~12 ns	12x slower
Main Memory (DRAM)	~100 ns	100x slower
SSD (NVMe)	~10,000 ns (10 μs)	10,000x slower
HDD	~10,000,000 ns (10 ms)	10,000,000x slower

The Swap Performance Wall

The Compression Insight:

Compression introduces a new tier in this hierarchy: compressed memory. Instead of immediately sending pages to disk, we compress them and store the compressed data in RAM:

Level	Latency	Capacity Impact
Uncompressed Memory	~100 ns	1x capacity
Compressed Memory	~300-1000 ns	2-4x effective capacity
Swap (SSD)	~10,000 ns	Disk-limited
Swap (HDD)	~10,000,000 ns	Disk-limited

Compression is 10-100x faster than SSD access and 10,000-100,000x faster than HDD access, while typically achieving 2-4x compression ratios on typical workloads.

The Fundamental Tradeoff

Compressed Page Cache Architecture

A compressed page cache is a region of memory that stores compressed versions of memory pages that would otherwise be swapped to disk. The key architectural components include:

2. Page Metadata: For each compressed page, the system maintains metadata:

Original page frame number (or virtual address)
Compressed size
Location within the compression pool
Compression algorithm used (in multi-algorithm systems)
Age/priority information for eviction decisions

3. Index Structure: A data structure (typically a radix tree or hash table) that maps from original page identifiers to their compressed representations.

compressed_cache_structures.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
/* Simplified compressed page cache data structures */
 
struct compressed_page {
    unsigned long original_pfn;       /* Original page frame number */
    void *compressed_data;            /* Pointer into compression pool */
    unsigned int compressed_size;     /* Size after compression */
    unsigned int compression_algo;    /* Algorithm identifier */
    struct list_head lru_list;        /* LRU list linkage */
    struct rb_node rb_node;           /* Red-black tree for lookup */
    atomic_t refcount;                /* Reference counting */
    unsigned long flags;              /* Status flags */
};
 
struct compression_pool {
    void *pool_base;                  /* Base address of pool */
    unsigned long pool_size;          /* Total pool size */
    unsigned long used_bytes;         /* Currently used bytes */
    unsigned long compressed_pages;   /* Number of compressed pages */
    spinlock_t lock;                  /* Pool access lock */
    
    /* Memory allocator for variable-size compressed data */
    struct z3fold_pool *allocator;    /* Or zbud, z4fold, etc. */
    
    /* Statistics */
    atomic64_t compress_ops;          /* Compression operations */
    atomic64_t decompress_ops;        /* Decompression operations */
    atomic64_t bytes_saved;           /* Memory saved by compression */
};
 
struct compressed_cache {
    struct rb_root page_tree;         /* Fast lookup by PFN */
    struct list_head lru_hot;         /* Recently accessed */
    struct list_head lru_cold;        /* Candidates for eviction */
    struct compression_pool *pool;    /* The actual data storage */
    
    /* Configuration */
    unsigned long max_pages;          /* Maximum pages to store */
    unsigned long max_pool_size;      /* Maximum pool memory */
    
    /* Synchronization */
    rwlock_t tree_lock;               /* Protects page_tree */
    struct mutex config_lock;         /* Protects configuration */
};

Variable-Size Storage Challenge:

This requires specialized allocators:

zbud: Stores exactly 2 compressed pages per physical page, limiting fragmentation but potentially wasting space
z3fold: Stores up to 3 compressed pages per physical page, better space utilization
zsmalloc: A more complex allocator supporting arbitrary sizes, used by zram

The choice of allocator significantly impacts both memory efficiency and CPU overhead.

Why Not Just Use malloc()?

Page Selection Strategy

Not all pages are equally suitable for compression. The operating system must intelligently select which pages to compress, considering multiple factors:

2. Access Frequency: Frequently accessed (hot) pages incur decompression costs on every access. These should remain uncompressed. Infrequently accessed (cold) pages are ideal candidates.

3. Page Type: Different page types have different compression characteristics:

Anonymous pages (heap, stack): Often highly compressible
File-backed pages: Depends on file type
Kernel pages: Typically kept uncompressed for latency

4. Temporal Patterns: Pages that were recently active may become active again soon. The system considers recency and frequency patterns.

Ideal Compression Candidates

•Zero pages — Pages filled entirely with zeros compress to nearly nothing. Many systems detect these specially.
•Stack pages — Often contain repetitive patterns (stack frames, saved registers) and compress well.
•Heap pages with sparse allocation — Memory allocators often leave gaps that compress efficiently.
•Inactive anonymous pages — Pages from background processes that haven't run recently.
•Aged file cache pages — File data that was read once but not accessed since.

Poor Compression Candidates

•Encrypted data — Properly encrypted data is statistically random and incompressible.
•Compressed media — JPEG, MP3, MP4, ZIP files are already compressed.
•Hot pages — Frequently accessed pages where decompression latency hurts performance.
•Kernel critical paths — Pages accessed in interrupt handlers or with strict latency requirements.
•Pages about to be freed — Compressing pages that will be released wastes effort.

page_selection_heuristics.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/* Page selection heuristics for compression */
 
enum compression_decision {
    COMPRESS_YES,           /* Good candidate, compress */
    COMPRESS_NO_HOT,        /* Too frequently accessed */
    COMPRESS_NO_TYPE,       /* Incompressible page type */
    COMPRESS_NO_SIZE,       /* Estimated poor compression */
    COMPRESS_DEFER,         /* Check again later */
};
 
static enum compression_decision evaluate_page(struct page *page)
{
    /* Never compress kernel pages */
    if (PageKernel(page))
        return COMPRESS_NO_TYPE;
    
    /* Skip pages with pending I/O */
    if (PageWriteback(page) || PageLocked(page))
        return COMPRESS_DEFER;
    
    /* Check access frequency via hardware reference bit */
    if (page_recently_accessed(page))
        return COMPRESS_NO_HOT;
    
    /* Zero pages get special handling - don't waste pool space */
    if (page_all_zeros(page))
        return COMPRESS_YES;  /* Will use zero-page deduplication */
    
    /* Sample first 64 bytes for entropy estimation */
    unsigned int entropy = estimate_entropy(page_address(page), 64);
    
    /* High entropy suggests incompressible data */
    if (entropy > ENTROPY_THRESHOLD)
        return COMPRESS_NO_SIZE;
    
    /* Check page age - prefer older pages */
    unsigned long age = get_page_age(page);
    if (age < MIN_COMPRESSION_AGE)
        return COMPRESS_DEFER;
    
    return COMPRESS_YES;
}
 
/* Fast entropy estimation using byte histogram */
static unsigned int estimate_entropy(void *data, size_t len)
{
    unsigned int histogram[256] = {0};
    unsigned char *bytes = data;
    unsigned int entropy = 0;
    
    /* Build histogram */
    for (size_t i = 0; i < len; i++)
        histogram[bytes[i]]++;
    
    /* Estimate entropy via histogram spread */
    unsigned int unique_bytes = 0;
    for (int i = 0; i < 256; i++) {
        if (histogram[i] > 0)
            unique_bytes++;
    }
    
    /* More unique values = higher entropy = less compressible */
    return (unique_bytes * 100) / 256;
}

Compression Pool Management

Managing the compression pool efficiently is critical to overall system performance. This involves several key operations:

1. Pool Sizing: The compression pool size determines how many compressed pages can be stored. Too small, and pages overflow to swap. Too large, and uncompressed memory is needlessly sacrificed.

Modern systems use dynamic sizing based on:

Current memory pressure
Observed compression ratios
Swap activity levels
Available CPU capacity

2. Space Allocation: Within the pool, compressed data of varying sizes must be allocated efficiently. This is where specialized allocators (zbud, z3fold, zsmalloc) prove essential.

3. Fragmentation Management: Over time, as pages are compressed and decompressed, the pool can become fragmented—small free spaces scattered throughout. Defragmentation strategies include:

Coalescing adjacent free regions
Relocating compressed pages
Periodic compaction

Converting Mermaid diagram...

Pool Eviction Policy:

When the compression pool is full and new pages need compression, the system must evict existing compressed pages. The eviction decision considers:

LRU ordering — Least recently accessed compressed pages are preferred victims
Compression efficiency — Pages with poor compression ratios may be evicted first
Access patterns — Pages unlikely to be accessed again are better victims
Destination cost — Consider whether to decompress-and-free vs. write-to-swap

This creates a two-tier LRU system:

Hot tier: Uncompressed pages in regular memory
Warm tier: Compressed pages in the compression pool
Cold tier: Swapped pages on disk

Writeback Optimization

Decompression and Access Patterns

When a process accesses a page that has been compressed, the system must transparently decompress it and provide the original data. This decompression path is performance-critical.

The Decompression Fault Path:

Page Fault Triggered: Process accesses a virtual address mapped to a compressed page
Fault Handler Entry: MMU triggers a fault; kernel fault handler runs
Lookup Compressed Page: Index structure lookup to find compressed data
Allocate Target Frame: Find a free physical frame for decompressed data
Decompress: Copy and decompress data into the target frame
Update Page Tables: Map the virtual address to the new physical frame
Update Cache State: Remove from compression pool, add to regular LRU
Return and Retry: Return to user space; instruction retries and succeeds

decompression_fault_handler.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
/* Simplified decompression fault handler */
 
static int handle_compressed_page_fault(
    struct mm_struct *mm,
    struct vm_area_struct *vma,
    unsigned long address,
    pte_t *pte)
{
    struct compressed_page *cpage;
    struct page *new_page;
    void *dest;
    int ret;
    
    /* 1. Lookup the compressed page */
    cpage = lookup_compressed_page(mm, address >> PAGE_SHIFT);
    if (!cpage)
        return VM_FAULT_SIGBUS;  /* Not found - fatal */
    
    /* 2. Allocate a new physical page */
    new_page = alloc_page(GFP_HIGHUSER_MOVABLE);
    if (!new_page) {
        /* Memory pressure - try to reclaim and retry */
        return VM_FAULT_OOM;
    }
    
    /* 3. Decompress into the new page */
    dest = kmap_atomic(new_page);
    ret = decompress(cpage->compressed_data,
                     cpage->compressed_size,
                     dest, PAGE_SIZE,
                     cpage->compression_algo);
    kunmap_atomic(dest);
    
    if (ret != PAGE_SIZE) {
        put_page(new_page);
        return VM_FAULT_SIGBUS;  /* Decompression failed */
    }
    
    /* 4. Update page tables atomically */
    spin_lock(&mm->page_table_lock);
    
    pte_t entry = mk_pte(new_page, vma->vm_page_prot);
    set_pte_at(mm, address, pte, entry);
    
    /* 5. Update reverse mapping */
    page_add_anon_rmap(new_page, vma, address);
    
    spin_unlock(&mm->page_table_lock);
    
    /* 6. Remove from compression pool */
    remove_compressed_page(cpage);
    
    /* 7. Add to active LRU list */
    lru_cache_add_active_anon(new_page);
    
    /* Statistics */
    count_vm_event(DECOMPRESS_FAULT);
    
    return 0;  /* Success - instruction will retry */
}

Latency Considerations:

Decompression latency directly impacts application performance. Key factors include:

Factor	Impact	Mitigation
Algorithm speed	100-1000 ns per 4KB	Choose fast algorithms (LZ4, LZO)
Cache effects	L3 misses add ~12 ns each	Prefetch compressed data
Memory allocation	O(100) ns typical	Pre-allocate pages
Lock contention	Highly variable	Fine-grained locking
TLB updates	~10-100 ns	Minimize TLB shootdowns

Total latency: 300-1500 ns for typical decompression

Compare this to:

Uncompressed page access: ~100 ns (page table walk only)
Swap-in from SSD: ~10,000 ns
Swap-in from HDD: ~10,000,000 ns

Even at 10x the latency of an uncompressed access, compression is 10-10,000x faster than swap.

Speculative Decompression

Integration with Memory Reclaim

Memory Zones and Compression:

Linux organizes memory into zones (DMA, DMA32, Normal, HighMem). The compression pool typically resides in the Normal zone. When reclaim runs, it must coordinate:

Should we compress pages or evict to swap?
Should we decompress and free, or write compressed to swap?
How much CPU budget for compression vs. I/O for swap?

The Reclaim Decision Tree:

Converting Mermaid diagram...

kswapd Integration:

The kernel's background reclaim daemon (kswapd) is extended to consider compression:

Compression-first mode: Under light pressure, kswapd compresses pages instead of swapping
Hybrid mode: Under moderate pressure, compress high-ratio pages, swap low-ratio pages
Swap-first mode: Under severe pressure, write directly to swap (including compressed writeouts)

Watermarks and Thresholds:

The system maintains multiple watermarks:

Watermark	Action	Purpose
min_free	Wake kswapd	Start background reclaim
low_free	Aggressive compression	Compress before swapping
high_free	Decompress if idle	Opportunistically expand
pool_full	Pool eviction	Move pool→swap
critical	Direct reclaim	Synchronous reclaim

These watermarks are tunable and adaptive—they adjust based on observed workload patterns.

Reclaim Scheduling

Performance Monitoring and Tuning

Operating a compressed page cache effectively requires understanding its performance characteristics and tuning accordingly. Key metrics include:

Compression Ratio:

Ratio = Original Size / Compressed Size

Higher is better. Typical ratios:

Zero pages: ∞ (deduplicated to single reference)
Typical anonymous: 2-4x
Text/code: 3-5x
Encrypted/compressed: 1.0-1.1x (nearly no benefit)

Hit Rate:

Hit Rate = Decompressions / (Decompressions + Swap-ins)

Measures how often compressed pages satisfy requests vs. requiring swap access. Higher is better.

Compression/Decompression Latency: Time per operation. Lower is better. Typical targets:

Compression: < 2 μs per 4KB
Decompression: < 1 μs per 4KB

Pool Efficiency:

Efficiency = Logical Bytes Stored / Physical Pool Size

Measures effective compression including fragmentation overhead.

compression_stats_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Example zswap statistics from /sys/kernel/debug/zswap
 
# Pages stored and retrieved
stored_pages:           125432
same_filled_pages:      23456   # Zero-page deduplication
loaded_pages:           98234
 
# Pool utilization
pool_total_size:        268435456  # 256 MB
compressed_size:        134217728  # 128 MB effective
pool_limit_hit:         0          # Never hit limit
 
# Compression performance
compress_time_us:       2345678    # Total compression CPU time
decompress_time_us:     1234567    # Total decompression CPU time
 
# Rejection statistics
reject_compress_poor:   12345      # Poor compression ratio
reject_kmem_alloc_fail: 0          # Allocation failures
reject_reclaim_fail:    23         # Could not reclaim pool space
 
# Writeback to swap
written_back_pages:     4567       # Evicted from pool to swap
 
# Calculated metrics
Average compression ratio: 2.34x
Hit rate: 95.6%
Average compress latency: 1.87 μs/page
Average decompress latency: 0.98 μs/page

Tuning Parameters

•max_pool_percent — Maximum percentage of RAM for compression pool (typical: 20-50%)
•compressor — Algorithm selection (lz4, lzo, zstd, etc.)
•zpool — Pool allocator (zbud, z3fold, zsmalloc)
•same_filled_pages — Enable zero-page deduplication
•accept_threshold — Minimum compression ratio to accept
•writeback — Enable pool-to-swap writeback

Monitoring Commands

Summary: Compressed Page Cache

We've explored the architecture and operation of compressed page caches—a critical technique for improving memory efficiency in modern operating systems. Let's consolidate the key concepts:

Key Takeaways

•Compression fills the hierarchy gap — Compressed memory is 10-100x faster than SSD and 10,000x faster than HDD, making it a valuable tier between RAM and swap.
•Variable-size storage requires specialized allocators — zbud, z3fold, and zsmalloc handle the challenge of storing variable-size compressed data efficiently.
•Page selection is critical — Not all pages compress well; intelligent selection avoids wasting CPU on incompressible data.
•Pool management balances multiple concerns — Size, fragmentation, eviction policy, and writeback all affect overall performance.
•Transparent decompression — The system handles all compression/decompression transparently to applications.
•Integration with reclaim — Compression interacts deeply with the kernel's memory reclaim mechanisms.
•Monitoring enables optimization — Track compression ratio, hit rate, and latency to tune for your workload.

What's Next:

Page Complete

1 / 5