Operating SystemsMemory Compression

Memory Compression

LevelAdvanced

Duration90 mins

TopicMemory Compression

4 / 5

Compression Algorithms

The Heart of Memory Compression

Memory compression systems like zswap and zram are only as effective as their underlying compression algorithms. The choice of algorithm fundamentally determines the tradeoff between compression ratio (how much memory you save) and latency (how long compression/decompression takes).

This is not an abstract tradeoff. Consider a single page fault:

With LZ4, decompression takes ~200 nanoseconds
With Zstd, decompression takes ~800 nanoseconds
The difference: 600 nanoseconds per fault

At 10,000 page faults per second, that's 6 milliseconds of additional latency per second. But Zstd might compress 30% better, fitting more pages in RAM and causing fewer faults overall.

Understanding these algorithms—their mechanics, strengths, and optimal use cases—is essential for tuning memory compression effectively.

What You Will Learn

By the end of this page, you will understand the internal mechanics of major compression algorithms (LZ4, LZO, Zstd, DEFLATE), their performance characteristics, asymmetric compression strategies, and how to select the right algorithm for specific workloads and system constraints.

Compression Fundamentals

All compression algorithms exploit one core principle: redundancy elimination. Data that contains patterns—repeated sequences, predictable structures, uneven symbol distributions—can be represented more compactly.

Two Fundamental Approaches:

1. Dictionary-Based (LZ-family): These algorithms find repeated sequences and replace them with references to earlier occurrences. The 'dictionary' is the already-processed data.

Original: "the quick brown fox the quick brown cat"
Compressed: "the quick brown fox <ref:offset=19,len=16>cat"

The reference says 'copy 16 bytes from position 19 bytes back'—shorter than repeating the text.

2. Entropy Coding (Huffman, ANS): These algorithms assign shorter codes to more frequent symbols. In English text, 'e' appears more often than 'z', so 'e' gets a shorter binary code.

Traditional ASCII: 8 bits per character
Huffman for English: ~4.5 bits per character (average)

Modern algorithms combine both approaches: LZ-style matching finds repeated patterns, then entropy coding compresses the match references and literal bytes.

Converting Mermaid diagram...

Key Performance Metrics:

Metric	Definition	Impact
Compression Ratio	Original Size / Compressed Size	Higher = more memory saved
Compression Speed	MB/s during compression	Higher = less CPU during page-out
Decompression Speed	MB/s during decompression	Higher = lower latency on page-in
Memory Usage	Working memory during operation	Lower = less overhead per operation

For memory compression, decompression speed is typically most critical because it directly impacts page fault latency—the moment a user waits for a page.

Asymmetric Workload

Memory compression has an asymmetric read/write pattern: pages are typically compressed once (page-out) but potentially decompressed multiple times (page faults). This asymmetry favors algorithms with fast decompression, even at the cost of slower compression.

LZ4: Maximum Speed

LZ4 is the gold standard for speed-optimized compression. Developed by Yann Collet, it achieves decompression speeds exceeding 4 GB/s on modern hardware while maintaining reasonable compression ratios.

Core Design Philosophy:

Simplicity over complexity
Predictable, branch-minimal code paths
Cache-friendly memory access patterns
No entropy coding (raw bytes + simple encoding)

LZ4 Format:

LZ4 uses an extremely simple format: a sequence of 'tokens' where each token contains:

Literal length (0-15 in token, extended if needed)
Match offset (2 bytes, little-endian)
Match length (4-18 in token, extended if needed)

Token byte: [literal_len:4][match_len:4]
   |           |              |
   |           |              +-- Match length - 4 (add min match len)
   |           +-- Literal length (15 = extended)
   +-- Combined in one byte for cache efficiency

lz4_decompression_core.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/* Simplified LZ4 decompression core - illustrates the algorithm */
 
static int lz4_decompress(const uint8_t *src, size_t src_len,
                          uint8_t *dst, size_t dst_len)
{
    const uint8_t *src_end = src + src_len;
    uint8_t *dst_ptr = dst;
    uint8_t *dst_end = dst + dst_len;
 
    while (src < src_end) {
        /* Read token byte */
        uint8_t token = *src++;
        
        uint32_t literal_len = token >> 4;    /* High 4 bits */
        uint32_t match_len = token & 0x0F;    /* Low 4 bits */
 
        /* Extended literal length */
        if (literal_len == 15) {
            uint8_t extra;
            do {
                extra = *src++;
                literal_len += extra;
            } while (extra == 255);
        }
 
        /* Copy literal bytes */
        if (literal_len > 0) {
            memcpy(dst_ptr, src, literal_len);
            src += literal_len;
            dst_ptr += literal_len;
        }
 
        /* Check if this was the last sequence (no match) */
        if (src >= src_end)
            break;
 
        /* Read match offset (2 bytes, little-endian) */
        uint16_t offset = src[0] | (src[1] << 8);
        src += 2;
 
        /* Extended match length */
        match_len += 4;  /* Minimum match is 4 bytes */
        if ((token & 0x0F) == 15) {
            uint8_t extra;
            do {
                extra = *src++;
                match_len += extra;
            } while (extra == 255);
        }
 
        /* Copy match (may overlap with destination!) */
        uint8_t *match_src = dst_ptr - offset;
        
        /* Overlapping copy: byte-by-byte for correctness */
        if (offset < match_len) {
            for (size_t i = 0; i < match_len; i++) {
                dst_ptr[i] = match_src[i];
            }
            dst_ptr += match_len;
        } else {
            /* Non-overlapping: fast memcpy */
            memcpy(dst_ptr, match_src, match_len);
            dst_ptr += match_len;
        }
    }
 
    return dst_ptr - dst;  /* Return decompressed size */
}

LZ4 Variants in Linux Kernel
Variant	Speed	Ratio	Use Case
lz4	4+ GB/s decomp	~2.1:1	Default for many systems
lz4hc	Slower comp, same decomp	~2.5:1	Batch compression, storage

Why LZ4 is So Fast

LZ4's speed comes from: (1) No entropy coding—bytes are copied directly, (2) Simple format with minimal branching, (3) Memory access patterns optimized for CPU cache, (4) SIMD-friendly design for vectorized copying. The tradeoff: lower compression ratio than more complex algorithms.

LZO: Balanced Performance

LZO (Lempel-Ziv-Oberhumer) is a mature compression algorithm that predates LZ4. It offers a balance between speed and compression ratio, and has been the historical default for many Linux memory compression features.

Key Characteristics:

Developed in 1996 by Markus Oberhumer
Very fast decompression (though slower than LZ4)
Multiple algorithm variants (LZO1, LZO1X, LZO1Y, etc.)
LZO-RLE variant adds run-length encoding for better ratio
Proven stability over decades of use

LZO vs LZ4:

Aspect	LZO	LZ4
Decompression speed	~550 MB/s	~4000 MB/s
Compression speed	~400 MB/s	~450 MB/s
Compression ratio	~2.2:1	~2.1:1
Memory usage	Higher	Lower
Code complexity	Higher	Lower
Maturity	1996	2011

LZO1X Algorithm Overview:

LZO1X is the most commonly used variant. It uses a similar approach to LZ4 but with different encoding:

LZO instruction format:
- Control byte determines instruction type
- Literal run: 1-31 literal bytes follow
- Short match: 2-3 byte match at close offset
- Long match: 3+ bytes at larger offset
- Special encodings for common patterns

LZO-RLE Enhancement:

The LZO-RLE variant adds Run-Length Encoding for repeated byte sequences:

Before: [0x00][0x00][0x00][0x00][0x00][0x00][0x00][0x00]
After:  [RLE_MARKER][0x00][8]  (3 bytes instead of 8)

This is particularly effective for memory pages containing zero-filled regions, which are common in:

Uninitialized heap allocations
Sparse data structures
Stack frames with padding

lzo_kernel_usage.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/* LZO compression in Linux kernel (simplified) */
 
#include <linux/lzo.h>
 
/* Kernel provides these simple interfaces */
 
int lzo1x_1_compress(const unsigned char *src, size_t src_len,
                     unsigned char *dst, size_t *dst_len,
                     void *wrkmem)
{
    /* wrkmem must be LZO1X_1_MEM_COMPRESS bytes (8KB-64KB) */
    /* Returns LZO_E_OK on success */
}
 
int lzo1x_decompress_safe(const unsigned char *src, size_t src_len,
                          unsigned char *dst, size_t *dst_len)
{
    /* 'safe' variant checks bounds during decompression */
    /* Critical for security with untrusted input */
}
 
/* Example kernel module usage */
static int compress_page(struct page *page, void *dst, size_t *dst_len)
{
    void *src;
    void *wrkmem;
    int ret;
    
    wrkmem = kmalloc(LZO1X_1_MEM_COMPRESS, GFP_KERNEL);
    if (!wrkmem)
        return -ENOMEM;
    
    src = kmap_atomic(page);
    *dst_len = PAGE_SIZE;  /* Maximum output size */
    
    ret = lzo1x_1_compress(src, PAGE_SIZE, dst, dst_len, wrkmem);
    
    kunmap_atomic(src);
    kfree(wrkmem);
    
    return (ret == LZO_E_OK) ? 0 : -EIO;
}

LZO's Kernel Role

LZO was zswap's original default compressor and remains widely used. Its 'lzo-rle' variant is now the kernel default for zswap due to improved handling of sparse pages. If you're unsure which to choose and want stability over maximum speed, LZO-RLE is a safe choice.

Zstd: Maximum Compression

Zstandard (Zstd) represents the state of the art in general-purpose compression. Developed by Yann Collet (also LZ4's creator) at Facebook/Meta, it achieves compression ratios approaching DEFLATE/zlib while maintaining speed competitive with LZ4.

Design Goals:

Compression ratio competitive with zlib
Decompression speed competitive with LZ4
Scalable from real-time to high-compression modes
Rich feature set (dictionaries, streaming, multi-threading)

Internal Architecture:

Zstd combines multiple techniques:

LZ77-style matching — Find repeated sequences
Finite State Entropy (FSE) — Modern entropy coding (faster than Huffman)
Huffman coding — For literal bytes
Repeat offsets — Efficiently encode repeated match distances
Dictionary support — Pre-train on representative data

Zstd Compression Levels
Level	Compression	Decompression	Ratio	Use Case
1	~400 MB/s	~1500 MB/s	~2.8:1	Real-time, balanced
3 (default)	~250 MB/s	~1500 MB/s	~2.9:1	General purpose
5	~120 MB/s	~1500 MB/s	~3.0:1	Better ratio, still fast
9	~50 MB/s	~1400 MB/s	~3.2:1	High compression
19	~5 MB/s	~1300 MB/s	~3.5:1	Maximum compression
22 (max)	~2 MB/s	~1200 MB/s	~3.6:1	Archival only

Critical insight: Decompression speed is nearly constant across levels, while compression speed varies dramatically. This makes Zstd ideal for write-once-read-many scenarios.

For memory compression, level 1-3 is typical—we need fast compression during page-out.

Finite State Entropy (FSE):

Zstd's entropy coding uses FSE, a modern technique that's faster than Huffman while achieving similar compression:

Huffman:  Build tree → Traverse tree per symbol → ~1 branch/symbol
FSE:      State machine → Table lookup per symbol → ~0 branches/symbol

FSE also supports fractional bit precision—a symbol might use 4.7 bits on average, not just 4 or 5. This improves compression without computational cost during decompression.

zstd_kernel_context.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/* Zstd usage in kernel for zswap/zram (simplified) */
 
#include <linux/zstd.h>
 
/* Compression context - allocated once, reused */
struct zstd_compress_ctx {
    zstd_cctx *cctx;
    void *workspace;
    size_t workspace_size;
    int level;
};
 
static int zstd_ctx_init(struct zstd_compress_ctx *ctx, int level)
{
    zstd_parameters params;
    
    /* Get parameters for this level and page-sized input */
    params = zstd_get_params(level, PAGE_SIZE);
    
    /* Calculate workspace size */
    ctx->workspace_size = zstd_cctx_workspace_bound(&params.cParams);
    ctx->workspace = vmalloc(ctx->workspace_size);
    if (!ctx->workspace)
        return -ENOMEM;
    
    /* Initialize compression context */
    ctx->cctx = zstd_init_cctx(ctx->workspace, ctx->workspace_size);
    if (!ctx->cctx) {
        vfree(ctx->workspace);
        return -EINVAL;
    }
    
    ctx->level = level;
    return 0;
}
 
static int zstd_compress_page(struct zstd_compress_ctx *ctx,
                               const void *src, void *dst,
                               size_t *dst_len)
{
    zstd_parameters params;
    size_t ret;
    
    params = zstd_get_params(ctx->level, PAGE_SIZE);
    
    ret = zstd_compress_cctx(ctx->cctx, dst, *dst_len,
                             src, PAGE_SIZE, &params);
    
    if (zstd_is_error(ret))
        return -EIO;
    
    *dst_len = ret;
    return 0;
}
 
/* Decompression is simpler - no context needed for single-shot */
static int zstd_decompress_page(const void *src, size_t src_len,
                                 void *dst, size_t *dst_len)
{
    size_t ret;
    
    /* Workspace for decompression */
    size_t ws_size = zstd_dctx_workspace_bound();
    void *workspace = kmalloc(ws_size, GFP_KERNEL);
    if (!workspace)
        return -ENOMEM;
    
    zstd_dctx *dctx = zstd_init_dctx(workspace, ws_size);
    
    ret = zstd_decompress_dctx(dctx, dst, *dst_len, src, src_len);
    
    kfree(workspace);
    
    if (zstd_is_error(ret))
        return -EIO;
    
    *dst_len = ret;
    return 0;
}

When to Use Zstd

Choose Zstd when: (1) Memory is scarce and every page counts, (2) CPU is plentiful relative to memory pressure, (3) Workload has good compressibility where extra ratio matters. The ~20-30% better compression over LZ4 can significantly reduce swap activity under heavy pressure.

DEFLATE and zlib

DEFLATE is the venerable compression algorithm used in ZIP files, gzip, and PNG images. While largely superseded by Zstd for new applications, understanding DEFLATE provides historical context and explains why it's still found in some systems.

DEFLATE Architecture:

DEFLATE combines:

LZ77 — Dictionary-based matching (similar to LZ4/Zstd)
Huffman coding — Entropy coding for symbols

Key Characteristics:

Aspect	DEFLATE/zlib	Zstd
Age	1996 (RFC 1951)	2015
Compression	~20 MB/s	~250 MB/s
Decompression	~300 MB/s	~1500 MB/s
Ratio	~3.0:1	~3.0:1
Entropy coder	Huffman	FSE
Compatibility	Universal	Growing

Why DEFLATE is Slower:

Huffman decoding requires bit-by-bit tree traversal (branches)
Sliding window implementation less cache-friendly
Designed for 1990s hardware optimizations

zlib in the Kernel:

The Linux kernel includes zlib implementation for various uses:

File system compression (btrfs, SquashFS)
Network compression (IPComp)
Legacy archive support

For memory compression, zlib is generally not recommended due to speed limitations. However, it remains available:

# Check if zlib/deflate is available for zswap
cat /sys/module/zswap/parameters/compressor
cat /proc/crypto | grep deflate

Modern Use Cases for DEFLATE:

Compatibility requirements — Must exchange data with legacy systems
Resource-constrained devices — Simple implementation, low code size
Specific standards — Some protocols mandate DEFLATE

Avoid DEFLATE for Memory Compression

For zswap/zram, DEFLATE's slow decompression (300 MB/s vs LZ4's 4000 MB/s) makes it unsuitable. Page fault latency would increase 10x+. Use LZ4, LZO, or Zstd instead. DEFLATE is included in the kernel but optimized for file/network compression, not real-time memory operations.

Algorithm Comparison and Benchmarks

Choosing the right algorithm requires understanding how each performs across different dimensions. Here's a comprehensive comparison based on real-world kernel benchmarks:

Benchmark Methodology:

Input: 4KB pages typical of memory compression
Hardware: Modern x86-64 (Skylake/Zen class)
Kernel versions: 5.x/6.x with current implementations

Compression Algorithm Performance Comparison
Algorithm	Compress MB/s	Decompress MB/s	Ratio	Latency/4KB
LZ4	780	4200	2.1:1	~200 ns
LZ4HC	45	4200	2.5:1	~200 ns
LZO-RLE	400	550	2.2:1	~700 ns
LZO	380	520	2.1:1	~750 ns
Zstd-1	400	1500	2.8:1	~270 ns
Zstd-3	250	1500	2.9:1	~270 ns
DEFLATE	20	300	3.0:1	~1300 ns
842 (HW)	varies	varies	~2.5:1	~100 ns

Visual Comparison:

Converting Mermaid diagram...

Algorithm Selection Guide

•LZ4 — Best for latency-critical systems, desktops, and CPU-bound servers. Lowest page fault latency.
•LZO-RLE — Good default choice. Proven stability, handles sparse pages well. Kernel default for zswap.
•Zstd (level 1-3) — Best compression with acceptable speed. Ideal when memory is critical and CPU has headroom.
•DEFLATE — Avoid for memory compression. Use only for compatibility requirements.
•Hardware (842) — Best if hardware is available (IBM POWER). Near-zero CPU overhead.

Real-World Testing

Benchmarks vary with actual workload. Test with YOUR data: run the system under typical load, monitor compression ratio and latency with each algorithm, then decide. A workload with highly compressible data benefits more from Zstd; one with poor compressibility gains little from any algorithm.

Hardware Compression Accelerators

Modern CPUs and accelerators increasingly include dedicated compression hardware, offloading the CPU and achieving higher throughput with lower latency.

IBM POWER - NX842:

IBM POWER processors include the NX coprocessor with hardware compression supporting the '842' algorithm:

Compression/decompression at near-memory speeds
Minimal CPU involvement
Integrated into Linux kernel (drivers/crypto/nx/)
Used automatically by zswap if available

Intel QAT (QuickAssist Technology):

Intel offers PCIe accelerator cards and on-chip acceleration:

Supports DEFLATE, LZ4, Zstd
Throughput: 100+ GB/s per device
Latency: ~10 μs (higher than software, but offloaded)
Best for bulk compression, less ideal for single-page operations

hardware_compression_check.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/bin/bash
# Check for hardware compression support
 
echo "=== Hardware Compression Detection ==="
echo ""
 
# Check for IBM NX 842 (POWER)
if [ -d "/sys/devices/vio" ]; then
    if dmesg | grep -q "nx-842"; then
        echo "IBM NX 842: DETECTED"
        echo "  Driver: nx-842-powernv or nx-842-pseries"
    fi
fi
 
# Check for Intel QAT
if lspci | grep -i "QuickAssist" > /dev/null 2>&1; then
    echo "Intel QAT: DETECTED"
    lspci | grep -i "QuickAssist"
    # Check if driver loaded
    if lsmod | grep -q "qat_"; then
        echo "  Driver: LOADED"
    else
        echo "  Driver: NOT LOADED (install qat driver)"
    fi
fi
 
# Check for AMD compression (future)
if [ -f "/sys/devices/pci*/*/compression" ]; then
    echo "AMD Compression: DETECTED"
fi
 
# Check available crypto algorithms
echo ""
echo "=== Available Compression Algorithms ==="
echo "From /proc/crypto:"
grep "^name" /proc/crypto | grep -E "lz4|lzo|zstd|deflate|842" | sort -u
 
# Check what zswap supports
echo ""
echo "Current zswap compressor: $(cat /sys/module/zswap/parameters/compressor 2>/dev/null || echo 'N/A')"

Hardware vs Software Trade-offs:

Aspect	Software	Hardware
Latency (single page)	~200-1000 ns	~1-10 μs
Throughput (bulk)	~1-5 GB/s	~10-200 GB/s
CPU usage	100%	Near 0%
Availability	Always	Specific hardware
Flexibility	High (any algorithm)	Fixed algorithms

Key Insight: Hardware compression has higher single-operation latency due to the round-trip to the accelerator. It excels at high throughput with zero CPU usage. For memory compression of individual pages, software is often faster; hardware shines when offloading sustained high-volume compression.

Future Directions

CPU vendors are integrating compression into the memory controller path, enabling transparent compression with near-zero latency. This could revolutionize memory compression—imagine effective doubling of memory capacity with no software overhead. Watch for developments in CXL (Compute Express Link) memory compression.

Data Characteristics and Compressibility

The effectiveness of compression depends heavily on the data being compressed. Understanding what makes data compressible helps predict and optimize compression performance.

Highly Compressible Data:

Zero-filled regions — Common in:
- Uninitialized memory (calloc vs malloc)
- Sparse arrays and hash tables
- Stack frame padding
- Compression ratio: ∞ (detected separately)
Text and code — Natural language, source code:
- Repetitive patterns (keywords, variable names)
- Limited alphabet (printable ASCII)
- Compression ratio: 3-5:1
Structured data — JSON, XML, protocol buffers:
- Repeated field names and structure
- Compression ratio: 4-10:1
Database buffers — Table rows, indexes:
- Repeated column types
- NULL padding
- Compression ratio: 2-4:1

Good Compression Candidates

•Zero-filled pages (heap, BSS)
•Text buffers and strings
•JSON/XML documents
•Log data in memory
•Sparse data structures
•Application state (undo buffers)
•Database query results

Poor Compression Candidates

•Encrypted data (AES, TLS)
•Already-compressed media (JPEG, MP4, ZIP)
•Random data (PRNG output)
•Hash values and digests
•Cryptocurrency operations
•Dense floating-point arrays
•Compressed file caches

Entropy Analysis:

Compressibility correlates with Shannon entropy—a measure of information content. Higher entropy means less compression potential:

Entropy range: 0 (perfectly predictable) to 8 (random bytes)

Typical memory pages:
- Zero page:        0.0 bits/byte → infinite compression
- Text page:        4-5 bits/byte → 2:1 compression
- Binary data:      6-7 bits/byte → 1.3:1 compression
- Encrypted/random: 8.0 bits/byte → no compression

Memory compression systems often sample pages to estimate entropy before attempting compression—rejecting high-entropy pages to avoid wasting CPU.

Encrypted Memory

Full-disk encryption (dm-crypt, LUKS) encrypts swap devices. Encrypted data is incompressible. If using zswap with encrypted swap, compression happens BEFORE encryption during writeback—so zswap still provides benefit. But zram with encryption enabled will have poor compression of encrypted content.

Summary: Compression Algorithms

We've explored the compression algorithms that power memory compression systems. Let's consolidate the key concepts:

Key Takeaways

•Compression combines pattern matching and entropy coding — LZ77-style matching finds repetitions; entropy coding optimizes symbol representation.
•LZ4 is fastest — ~200 ns/page decompression, best for latency-critical systems.
•LZO-RLE is stable and proven — Kernel default, handles sparse pages well.
•Zstd offers best ratio with good speed — Choose when memory is critical and CPU has capacity.
•DEFLATE is too slow for memory compression — Use only for compatibility, not performance.
•Hardware acceleration is available — IBM NX842, Intel QAT for high-throughput scenarios.
•Data characteristics matter — Text compresses well; encrypted/compressed data doesn't.
•Match algorithm to workload — Test with real data to find optimal balance.

What's Next:

The final page of this module explores performance tradeoffs—bringing together everything we've learned to understand when memory compression helps, when it hurts, and how to optimize the complete system for specific workloads.

Page Complete

You now understand the internals of major compression algorithms, their performance characteristics, and selection criteria. This knowledge enables informed decisions about which algorithm to deploy for your specific memory compression needs.

4 / 5

Loading learning content...

Operating SystemsMemory Compression

Memory Compression

LevelAdvanced

Duration90 mins

TopicMemory Compression

4 / 5

Compression Algorithms

The Heart of Memory Compression

This is not an abstract tradeoff. Consider a single page fault:

With LZ4, decompression takes ~200 nanoseconds
With Zstd, decompression takes ~800 nanoseconds
The difference: 600 nanoseconds per fault

At 10,000 page faults per second, that's 6 milliseconds of additional latency per second. But Zstd might compress 30% better, fitting more pages in RAM and causing fewer faults overall.

Understanding these algorithms—their mechanics, strengths, and optimal use cases—is essential for tuning memory compression effectively.

What You Will Learn

Compression Fundamentals

Two Fundamental Approaches:

1. Dictionary-Based (LZ-family): These algorithms find repeated sequences and replace them with references to earlier occurrences. The 'dictionary' is the already-processed data.

Original: "the quick brown fox the quick brown cat"
Compressed: "the quick brown fox <ref:offset=19,len=16>cat"

The reference says 'copy 16 bytes from position 19 bytes back'—shorter than repeating the text.

2. Entropy Coding (Huffman, ANS): These algorithms assign shorter codes to more frequent symbols. In English text, 'e' appears more often than 'z', so 'e' gets a shorter binary code.

Traditional ASCII: 8 bits per character
Huffman for English: ~4.5 bits per character (average)

Modern algorithms combine both approaches: LZ-style matching finds repeated patterns, then entropy coding compresses the match references and literal bytes.

Converting Mermaid diagram...

Key Performance Metrics:

Metric	Definition	Impact
Compression Ratio	Original Size / Compressed Size	Higher = more memory saved
Compression Speed	MB/s during compression	Higher = less CPU during page-out
Decompression Speed	MB/s during decompression	Higher = lower latency on page-in
Memory Usage	Working memory during operation	Lower = less overhead per operation

For memory compression, decompression speed is typically most critical because it directly impacts page fault latency—the moment a user waits for a page.

Asymmetric Workload

LZ4: Maximum Speed

Core Design Philosophy:

Simplicity over complexity
Predictable, branch-minimal code paths
Cache-friendly memory access patterns
No entropy coding (raw bytes + simple encoding)

LZ4 Format:

LZ4 uses an extremely simple format: a sequence of 'tokens' where each token contains:

Literal length (0-15 in token, extended if needed)
Match offset (2 bytes, little-endian)
Match length (4-18 in token, extended if needed)

Token byte: [literal_len:4][match_len:4]
   |           |              |
   |           |              +-- Match length - 4 (add min match len)
   |           +-- Literal length (15 = extended)
   +-- Combined in one byte for cache efficiency

lz4_decompression_core.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/* Simplified LZ4 decompression core - illustrates the algorithm */
 
static int lz4_decompress(const uint8_t *src, size_t src_len,
                          uint8_t *dst, size_t dst_len)
{
    const uint8_t *src_end = src + src_len;
    uint8_t *dst_ptr = dst;
    uint8_t *dst_end = dst + dst_len;
 
    while (src < src_end) {
        /* Read token byte */
        uint8_t token = *src++;
        
        uint32_t literal_len = token >> 4;    /* High 4 bits */
        uint32_t match_len = token & 0x0F;    /* Low 4 bits */
 
        /* Extended literal length */
        if (literal_len == 15) {
            uint8_t extra;
            do {
                extra = *src++;
                literal_len += extra;
            } while (extra == 255);
        }
 
        /* Copy literal bytes */
        if (literal_len > 0) {
            memcpy(dst_ptr, src, literal_len);
            src += literal_len;
            dst_ptr += literal_len;
        }
 
        /* Check if this was the last sequence (no match) */
        if (src >= src_end)
            break;
 
        /* Read match offset (2 bytes, little-endian) */
        uint16_t offset = src[0] | (src[1] << 8);
        src += 2;
 
        /* Extended match length */
        match_len += 4;  /* Minimum match is 4 bytes */
        if ((token & 0x0F) == 15) {
            uint8_t extra;
            do {
                extra = *src++;
                match_len += extra;
            } while (extra == 255);
        }
 
        /* Copy match (may overlap with destination!) */
        uint8_t *match_src = dst_ptr - offset;
        
        /* Overlapping copy: byte-by-byte for correctness */
        if (offset < match_len) {
            for (size_t i = 0; i < match_len; i++) {
                dst_ptr[i] = match_src[i];
            }
            dst_ptr += match_len;
        } else {
            /* Non-overlapping: fast memcpy */
            memcpy(dst_ptr, match_src, match_len);
            dst_ptr += match_len;
        }
    }
 
    return dst_ptr - dst;  /* Return decompressed size */
}

LZ4 Variants in Linux Kernel
Variant	Speed	Ratio	Use Case
lz4	4+ GB/s decomp	~2.1:1	Default for many systems
lz4hc	Slower comp, same decomp	~2.5:1	Batch compression, storage

Why LZ4 is So Fast

LZO: Balanced Performance

Key Characteristics:

Developed in 1996 by Markus Oberhumer
Very fast decompression (though slower than LZ4)
Multiple algorithm variants (LZO1, LZO1X, LZO1Y, etc.)
LZO-RLE variant adds run-length encoding for better ratio
Proven stability over decades of use

LZO vs LZ4:

Aspect	LZO	LZ4
Decompression speed	~550 MB/s	~4000 MB/s
Compression speed	~400 MB/s	~450 MB/s
Compression ratio	~2.2:1	~2.1:1
Memory usage	Higher	Lower
Code complexity	Higher	Lower
Maturity	1996	2011

LZO1X Algorithm Overview:

LZO1X is the most commonly used variant. It uses a similar approach to LZ4 but with different encoding:

LZO instruction format:
- Control byte determines instruction type
- Literal run: 1-31 literal bytes follow
- Short match: 2-3 byte match at close offset
- Long match: 3+ bytes at larger offset
- Special encodings for common patterns

LZO-RLE Enhancement:

The LZO-RLE variant adds Run-Length Encoding for repeated byte sequences:

Before: [0x00][0x00][0x00][0x00][0x00][0x00][0x00][0x00]
After:  [RLE_MARKER][0x00][8]  (3 bytes instead of 8)

This is particularly effective for memory pages containing zero-filled regions, which are common in:

Uninitialized heap allocations
Sparse data structures
Stack frames with padding

lzo_kernel_usage.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/* LZO compression in Linux kernel (simplified) */
 
#include <linux/lzo.h>
 
/* Kernel provides these simple interfaces */
 
int lzo1x_1_compress(const unsigned char *src, size_t src_len,
                     unsigned char *dst, size_t *dst_len,
                     void *wrkmem)
{
    /* wrkmem must be LZO1X_1_MEM_COMPRESS bytes (8KB-64KB) */
    /* Returns LZO_E_OK on success */
}
 
int lzo1x_decompress_safe(const unsigned char *src, size_t src_len,
                          unsigned char *dst, size_t *dst_len)
{
    /* 'safe' variant checks bounds during decompression */
    /* Critical for security with untrusted input */
}
 
/* Example kernel module usage */
static int compress_page(struct page *page, void *dst, size_t *dst_len)
{
    void *src;
    void *wrkmem;
    int ret;
    
    wrkmem = kmalloc(LZO1X_1_MEM_COMPRESS, GFP_KERNEL);
    if (!wrkmem)
        return -ENOMEM;
    
    src = kmap_atomic(page);
    *dst_len = PAGE_SIZE;  /* Maximum output size */
    
    ret = lzo1x_1_compress(src, PAGE_SIZE, dst, dst_len, wrkmem);
    
    kunmap_atomic(src);
    kfree(wrkmem);
    
    return (ret == LZO_E_OK) ? 0 : -EIO;
}

LZO's Kernel Role

Zstd: Maximum Compression

Design Goals:

Compression ratio competitive with zlib
Decompression speed competitive with LZ4
Scalable from real-time to high-compression modes
Rich feature set (dictionaries, streaming, multi-threading)

Internal Architecture:

Zstd combines multiple techniques:

LZ77-style matching — Find repeated sequences
Finite State Entropy (FSE) — Modern entropy coding (faster than Huffman)
Huffman coding — For literal bytes
Repeat offsets — Efficiently encode repeated match distances
Dictionary support — Pre-train on representative data

Zstd Compression Levels
Level	Compression	Decompression	Ratio	Use Case
1	~400 MB/s	~1500 MB/s	~2.8:1	Real-time, balanced
3 (default)	~250 MB/s	~1500 MB/s	~2.9:1	General purpose
5	~120 MB/s	~1500 MB/s	~3.0:1	Better ratio, still fast
9	~50 MB/s	~1400 MB/s	~3.2:1	High compression
19	~5 MB/s	~1300 MB/s	~3.5:1	Maximum compression
22 (max)	~2 MB/s	~1200 MB/s	~3.6:1	Archival only

Critical insight: Decompression speed is nearly constant across levels, while compression speed varies dramatically. This makes Zstd ideal for write-once-read-many scenarios.

For memory compression, level 1-3 is typical—we need fast compression during page-out.

Finite State Entropy (FSE):

Zstd's entropy coding uses FSE, a modern technique that's faster than Huffman while achieving similar compression:

Huffman:  Build tree → Traverse tree per symbol → ~1 branch/symbol
FSE:      State machine → Table lookup per symbol → ~0 branches/symbol

FSE also supports fractional bit precision—a symbol might use 4.7 bits on average, not just 4 or 5. This improves compression without computational cost during decompression.

zstd_kernel_context.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/* Zstd usage in kernel for zswap/zram (simplified) */
 
#include <linux/zstd.h>
 
/* Compression context - allocated once, reused */
struct zstd_compress_ctx {
    zstd_cctx *cctx;
    void *workspace;
    size_t workspace_size;
    int level;
};
 
static int zstd_ctx_init(struct zstd_compress_ctx *ctx, int level)
{
    zstd_parameters params;
    
    /* Get parameters for this level and page-sized input */
    params = zstd_get_params(level, PAGE_SIZE);
    
    /* Calculate workspace size */
    ctx->workspace_size = zstd_cctx_workspace_bound(&params.cParams);
    ctx->workspace = vmalloc(ctx->workspace_size);
    if (!ctx->workspace)
        return -ENOMEM;
    
    /* Initialize compression context */
    ctx->cctx = zstd_init_cctx(ctx->workspace, ctx->workspace_size);
    if (!ctx->cctx) {
        vfree(ctx->workspace);
        return -EINVAL;
    }
    
    ctx->level = level;
    return 0;
}
 
static int zstd_compress_page(struct zstd_compress_ctx *ctx,
                               const void *src, void *dst,
                               size_t *dst_len)
{
    zstd_parameters params;
    size_t ret;
    
    params = zstd_get_params(ctx->level, PAGE_SIZE);
    
    ret = zstd_compress_cctx(ctx->cctx, dst, *dst_len,
                             src, PAGE_SIZE, &params);
    
    if (zstd_is_error(ret))
        return -EIO;
    
    *dst_len = ret;
    return 0;
}
 
/* Decompression is simpler - no context needed for single-shot */
static int zstd_decompress_page(const void *src, size_t src_len,
                                 void *dst, size_t *dst_len)
{
    size_t ret;
    
    /* Workspace for decompression */
    size_t ws_size = zstd_dctx_workspace_bound();
    void *workspace = kmalloc(ws_size, GFP_KERNEL);
    if (!workspace)
        return -ENOMEM;
    
    zstd_dctx *dctx = zstd_init_dctx(workspace, ws_size);
    
    ret = zstd_decompress_dctx(dctx, dst, *dst_len, src, src_len);
    
    kfree(workspace);
    
    if (zstd_is_error(ret))
        return -EIO;
    
    *dst_len = ret;
    return 0;
}

When to Use Zstd

DEFLATE and zlib

DEFLATE Architecture:

DEFLATE combines:

LZ77 — Dictionary-based matching (similar to LZ4/Zstd)
Huffman coding — Entropy coding for symbols

Key Characteristics:

Aspect	DEFLATE/zlib	Zstd
Age	1996 (RFC 1951)	2015
Compression	~20 MB/s	~250 MB/s
Decompression	~300 MB/s	~1500 MB/s
Ratio	~3.0:1	~3.0:1
Entropy coder	Huffman	FSE
Compatibility	Universal	Growing

Why DEFLATE is Slower:

Huffman decoding requires bit-by-bit tree traversal (branches)
Sliding window implementation less cache-friendly
Designed for 1990s hardware optimizations

zlib in the Kernel:

The Linux kernel includes zlib implementation for various uses:

File system compression (btrfs, SquashFS)
Network compression (IPComp)
Legacy archive support

For memory compression, zlib is generally not recommended due to speed limitations. However, it remains available:

# Check if zlib/deflate is available for zswap
cat /sys/module/zswap/parameters/compressor
cat /proc/crypto | grep deflate

Modern Use Cases for DEFLATE:

Compatibility requirements — Must exchange data with legacy systems
Resource-constrained devices — Simple implementation, low code size
Specific standards — Some protocols mandate DEFLATE

Avoid DEFLATE for Memory Compression

Algorithm Comparison and Benchmarks

Choosing the right algorithm requires understanding how each performs across different dimensions. Here's a comprehensive comparison based on real-world kernel benchmarks:

Benchmark Methodology:

Input: 4KB pages typical of memory compression
Hardware: Modern x86-64 (Skylake/Zen class)
Kernel versions: 5.x/6.x with current implementations

Compression Algorithm Performance Comparison
Algorithm	Compress MB/s	Decompress MB/s	Ratio	Latency/4KB
LZ4	780	4200	2.1:1	~200 ns
LZ4HC	45	4200	2.5:1	~200 ns
LZO-RLE	400	550	2.2:1	~700 ns
LZO	380	520	2.1:1	~750 ns
Zstd-1	400	1500	2.8:1	~270 ns
Zstd-3	250	1500	2.9:1	~270 ns
DEFLATE	20	300	3.0:1	~1300 ns
842 (HW)	varies	varies	~2.5:1	~100 ns

Visual Comparison:

Converting Mermaid diagram...

Algorithm Selection Guide

•LZ4 — Best for latency-critical systems, desktops, and CPU-bound servers. Lowest page fault latency.
•LZO-RLE — Good default choice. Proven stability, handles sparse pages well. Kernel default for zswap.
•Zstd (level 1-3) — Best compression with acceptable speed. Ideal when memory is critical and CPU has headroom.
•DEFLATE — Avoid for memory compression. Use only for compatibility requirements.
•Hardware (842) — Best if hardware is available (IBM POWER). Near-zero CPU overhead.

Real-World Testing

Hardware Compression Accelerators

Modern CPUs and accelerators increasingly include dedicated compression hardware, offloading the CPU and achieving higher throughput with lower latency.

IBM POWER - NX842:

IBM POWER processors include the NX coprocessor with hardware compression supporting the '842' algorithm:

Compression/decompression at near-memory speeds
Minimal CPU involvement
Integrated into Linux kernel (drivers/crypto/nx/)
Used automatically by zswap if available

Intel QAT (QuickAssist Technology):

Intel offers PCIe accelerator cards and on-chip acceleration:

Supports DEFLATE, LZ4, Zstd
Throughput: 100+ GB/s per device
Latency: ~10 μs (higher than software, but offloaded)
Best for bulk compression, less ideal for single-page operations

hardware_compression_check.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/bin/bash
# Check for hardware compression support
 
echo "=== Hardware Compression Detection ==="
echo ""
 
# Check for IBM NX 842 (POWER)
if [ -d "/sys/devices/vio" ]; then
    if dmesg | grep -q "nx-842"; then
        echo "IBM NX 842: DETECTED"
        echo "  Driver: nx-842-powernv or nx-842-pseries"
    fi
fi
 
# Check for Intel QAT
if lspci | grep -i "QuickAssist" > /dev/null 2>&1; then
    echo "Intel QAT: DETECTED"
    lspci | grep -i "QuickAssist"
    # Check if driver loaded
    if lsmod | grep -q "qat_"; then
        echo "  Driver: LOADED"
    else
        echo "  Driver: NOT LOADED (install qat driver)"
    fi
fi
 
# Check for AMD compression (future)
if [ -f "/sys/devices/pci*/*/compression" ]; then
    echo "AMD Compression: DETECTED"
fi
 
# Check available crypto algorithms
echo ""
echo "=== Available Compression Algorithms ==="
echo "From /proc/crypto:"
grep "^name" /proc/crypto | grep -E "lz4|lzo|zstd|deflate|842" | sort -u
 
# Check what zswap supports
echo ""
echo "Current zswap compressor: $(cat /sys/module/zswap/parameters/compressor 2>/dev/null || echo 'N/A')"

Hardware vs Software Trade-offs:

Aspect	Software	Hardware
Latency (single page)	~200-1000 ns	~1-10 μs
Throughput (bulk)	~1-5 GB/s	~10-200 GB/s
CPU usage	100%	Near 0%
Availability	Always	Specific hardware
Flexibility	High (any algorithm)	Fixed algorithms

Future Directions

Data Characteristics and Compressibility

The effectiveness of compression depends heavily on the data being compressed. Understanding what makes data compressible helps predict and optimize compression performance.

Highly Compressible Data:

Zero-filled regions — Common in:
- Uninitialized memory (calloc vs malloc)
- Sparse arrays and hash tables
- Stack frame padding
- Compression ratio: ∞ (detected separately)
Text and code — Natural language, source code:
- Repetitive patterns (keywords, variable names)
- Limited alphabet (printable ASCII)
- Compression ratio: 3-5:1
Structured data — JSON, XML, protocol buffers:
- Repeated field names and structure
- Compression ratio: 4-10:1
Database buffers — Table rows, indexes:
- Repeated column types
- NULL padding
- Compression ratio: 2-4:1

Good Compression Candidates

•Zero-filled pages (heap, BSS)
•Text buffers and strings
•JSON/XML documents
•Log data in memory
•Sparse data structures
•Application state (undo buffers)
•Database query results

Poor Compression Candidates

•Encrypted data (AES, TLS)
•Already-compressed media (JPEG, MP4, ZIP)
•Random data (PRNG output)
•Hash values and digests
•Cryptocurrency operations
•Dense floating-point arrays
•Compressed file caches

Entropy Analysis:

Compressibility correlates with Shannon entropy—a measure of information content. Higher entropy means less compression potential:

Entropy range: 0 (perfectly predictable) to 8 (random bytes)

Typical memory pages:
- Zero page:        0.0 bits/byte → infinite compression
- Text page:        4-5 bits/byte → 2:1 compression
- Binary data:      6-7 bits/byte → 1.3:1 compression
- Encrypted/random: 8.0 bits/byte → no compression

Memory compression systems often sample pages to estimate entropy before attempting compression—rejecting high-entropy pages to avoid wasting CPU.

Encrypted Memory

Summary: Compression Algorithms

We've explored the compression algorithms that power memory compression systems. Let's consolidate the key concepts:

Key Takeaways

•Compression combines pattern matching and entropy coding — LZ77-style matching finds repetitions; entropy coding optimizes symbol representation.
•LZ4 is fastest — ~200 ns/page decompression, best for latency-critical systems.
•LZO-RLE is stable and proven — Kernel default, handles sparse pages well.
•Zstd offers best ratio with good speed — Choose when memory is critical and CPU has capacity.
•DEFLATE is too slow for memory compression — Use only for compatibility, not performance.
•Hardware acceleration is available — IBM NX842, Intel QAT for high-throughput scenarios.
•Data characteristics matter — Text compresses well; encrypted/compressed data doesn't.
•Match algorithm to workload — Test with real data to find optimal balance.

What's Next:

Page Complete

4 / 5