Loading learning content...
Memory compression systems like zswap and zram are only as effective as their underlying compression algorithms. The choice of algorithm fundamentally determines the tradeoff between compression ratio (how much memory you save) and latency (how long compression/decompression takes).
This is not an abstract tradeoff. Consider a single page fault:
At 10,000 page faults per second, that's 6 milliseconds of additional latency per second. But Zstd might compress 30% better, fitting more pages in RAM and causing fewer faults overall.
Understanding these algorithms—their mechanics, strengths, and optimal use cases—is essential for tuning memory compression effectively.
By the end of this page, you will understand the internal mechanics of major compression algorithms (LZ4, LZO, Zstd, DEFLATE), their performance characteristics, asymmetric compression strategies, and how to select the right algorithm for specific workloads and system constraints.
All compression algorithms exploit one core principle: redundancy elimination. Data that contains patterns—repeated sequences, predictable structures, uneven symbol distributions—can be represented more compactly.
Two Fundamental Approaches:
1. Dictionary-Based (LZ-family): These algorithms find repeated sequences and replace them with references to earlier occurrences. The 'dictionary' is the already-processed data.
Original: "the quick brown fox the quick brown cat"
Compressed: "the quick brown fox <ref:offset=19,len=16>cat"
The reference says 'copy 16 bytes from position 19 bytes back'—shorter than repeating the text.
2. Entropy Coding (Huffman, ANS): These algorithms assign shorter codes to more frequent symbols. In English text, 'e' appears more often than 'z', so 'e' gets a shorter binary code.
Traditional ASCII: 8 bits per character
Huffman for English: ~4.5 bits per character (average)
Modern algorithms combine both approaches: LZ-style matching finds repeated patterns, then entropy coding compresses the match references and literal bytes.
Key Performance Metrics:
| Metric | Definition | Impact |
|---|---|---|
| Compression Ratio | Original Size / Compressed Size | Higher = more memory saved |
| Compression Speed | MB/s during compression | Higher = less CPU during page-out |
| Decompression Speed | MB/s during decompression | Higher = lower latency on page-in |
| Memory Usage | Working memory during operation | Lower = less overhead per operation |
For memory compression, decompression speed is typically most critical because it directly impacts page fault latency—the moment a user waits for a page.
Memory compression has an asymmetric read/write pattern: pages are typically compressed once (page-out) but potentially decompressed multiple times (page faults). This asymmetry favors algorithms with fast decompression, even at the cost of slower compression.
LZ4 is the gold standard for speed-optimized compression. Developed by Yann Collet, it achieves decompression speeds exceeding 4 GB/s on modern hardware while maintaining reasonable compression ratios.
Core Design Philosophy:
LZ4 Format:
LZ4 uses an extremely simple format: a sequence of 'tokens' where each token contains:
Token byte: [literal_len:4][match_len:4]
| | |
| | +-- Match length - 4 (add min match len)
| +-- Literal length (15 = extended)
+-- Combined in one byte for cache efficiency
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
/* Simplified LZ4 decompression core - illustrates the algorithm */ static int lz4_decompress(const uint8_t *src, size_t src_len, uint8_t *dst, size_t dst_len){ const uint8_t *src_end = src + src_len; uint8_t *dst_ptr = dst; uint8_t *dst_end = dst + dst_len; while (src < src_end) { /* Read token byte */ uint8_t token = *src++; uint32_t literal_len = token >> 4; /* High 4 bits */ uint32_t match_len = token & 0x0F; /* Low 4 bits */ /* Extended literal length */ if (literal_len == 15) { uint8_t extra; do { extra = *src++; literal_len += extra; } while (extra == 255); } /* Copy literal bytes */ if (literal_len > 0) { memcpy(dst_ptr, src, literal_len); src += literal_len; dst_ptr += literal_len; } /* Check if this was the last sequence (no match) */ if (src >= src_end) break; /* Read match offset (2 bytes, little-endian) */ uint16_t offset = src[0] | (src[1] << 8); src += 2; /* Extended match length */ match_len += 4; /* Minimum match is 4 bytes */ if ((token & 0x0F) == 15) { uint8_t extra; do { extra = *src++; match_len += extra; } while (extra == 255); } /* Copy match (may overlap with destination!) */ uint8_t *match_src = dst_ptr - offset; /* Overlapping copy: byte-by-byte for correctness */ if (offset < match_len) { for (size_t i = 0; i < match_len; i++) { dst_ptr[i] = match_src[i]; } dst_ptr += match_len; } else { /* Non-overlapping: fast memcpy */ memcpy(dst_ptr, match_src, match_len); dst_ptr += match_len; } } return dst_ptr - dst; /* Return decompressed size */}| Variant | Speed | Ratio | Use Case |
|---|---|---|---|
| lz4 | 4+ GB/s decomp | ~2.1:1 | Default for many systems |
| lz4hc | Slower comp, same decomp | ~2.5:1 | Batch compression, storage |
LZ4's speed comes from: (1) No entropy coding—bytes are copied directly, (2) Simple format with minimal branching, (3) Memory access patterns optimized for CPU cache, (4) SIMD-friendly design for vectorized copying. The tradeoff: lower compression ratio than more complex algorithms.
LZO (Lempel-Ziv-Oberhumer) is a mature compression algorithm that predates LZ4. It offers a balance between speed and compression ratio, and has been the historical default for many Linux memory compression features.
Key Characteristics:
LZO vs LZ4:
| Aspect | LZO | LZ4 |
|---|---|---|
| Decompression speed | ~550 MB/s | ~4000 MB/s |
| Compression speed | ~400 MB/s | ~450 MB/s |
| Compression ratio | ~2.2:1 | ~2.1:1 |
| Memory usage | Higher | Lower |
| Code complexity | Higher | Lower |
| Maturity | 1996 | 2011 |
LZO1X Algorithm Overview:
LZO1X is the most commonly used variant. It uses a similar approach to LZ4 but with different encoding:
LZO instruction format:
- Control byte determines instruction type
- Literal run: 1-31 literal bytes follow
- Short match: 2-3 byte match at close offset
- Long match: 3+ bytes at larger offset
- Special encodings for common patterns
LZO-RLE Enhancement:
The LZO-RLE variant adds Run-Length Encoding for repeated byte sequences:
Before: [0x00][0x00][0x00][0x00][0x00][0x00][0x00][0x00]
After: [RLE_MARKER][0x00][8] (3 bytes instead of 8)
This is particularly effective for memory pages containing zero-filled regions, which are common in:
123456789101112131415161718192021222324252627282930313233343536373839404142
/* LZO compression in Linux kernel (simplified) */ #include <linux/lzo.h> /* Kernel provides these simple interfaces */ int lzo1x_1_compress(const unsigned char *src, size_t src_len, unsigned char *dst, size_t *dst_len, void *wrkmem){ /* wrkmem must be LZO1X_1_MEM_COMPRESS bytes (8KB-64KB) */ /* Returns LZO_E_OK on success */} int lzo1x_decompress_safe(const unsigned char *src, size_t src_len, unsigned char *dst, size_t *dst_len){ /* 'safe' variant checks bounds during decompression */ /* Critical for security with untrusted input */} /* Example kernel module usage */static int compress_page(struct page *page, void *dst, size_t *dst_len){ void *src; void *wrkmem; int ret; wrkmem = kmalloc(LZO1X_1_MEM_COMPRESS, GFP_KERNEL); if (!wrkmem) return -ENOMEM; src = kmap_atomic(page); *dst_len = PAGE_SIZE; /* Maximum output size */ ret = lzo1x_1_compress(src, PAGE_SIZE, dst, dst_len, wrkmem); kunmap_atomic(src); kfree(wrkmem); return (ret == LZO_E_OK) ? 0 : -EIO;}LZO was zswap's original default compressor and remains widely used. Its 'lzo-rle' variant is now the kernel default for zswap due to improved handling of sparse pages. If you're unsure which to choose and want stability over maximum speed, LZO-RLE is a safe choice.
Zstandard (Zstd) represents the state of the art in general-purpose compression. Developed by Yann Collet (also LZ4's creator) at Facebook/Meta, it achieves compression ratios approaching DEFLATE/zlib while maintaining speed competitive with LZ4.
Design Goals:
Internal Architecture:
Zstd combines multiple techniques:
| Level | Compression | Decompression | Ratio | Use Case |
|---|---|---|---|---|
| 1 | ~400 MB/s | ~1500 MB/s | ~2.8:1 | Real-time, balanced |
| 3 (default) | ~250 MB/s | ~1500 MB/s | ~2.9:1 | General purpose |
| 5 | ~120 MB/s | ~1500 MB/s | ~3.0:1 | Better ratio, still fast |
| 9 | ~50 MB/s | ~1400 MB/s | ~3.2:1 | High compression |
| 19 | ~5 MB/s | ~1300 MB/s | ~3.5:1 | Maximum compression |
| 22 (max) | ~2 MB/s | ~1200 MB/s | ~3.6:1 | Archival only |
Critical insight: Decompression speed is nearly constant across levels, while compression speed varies dramatically. This makes Zstd ideal for write-once-read-many scenarios.
For memory compression, level 1-3 is typical—we need fast compression during page-out.
Finite State Entropy (FSE):
Zstd's entropy coding uses FSE, a modern technique that's faster than Huffman while achieving similar compression:
Huffman: Build tree → Traverse tree per symbol → ~1 branch/symbol
FSE: State machine → Table lookup per symbol → ~0 branches/symbol
FSE also supports fractional bit precision—a symbol might use 4.7 bits on average, not just 4 or 5. This improves compression without computational cost during decompression.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
/* Zstd usage in kernel for zswap/zram (simplified) */ #include <linux/zstd.h> /* Compression context - allocated once, reused */struct zstd_compress_ctx { zstd_cctx *cctx; void *workspace; size_t workspace_size; int level;}; static int zstd_ctx_init(struct zstd_compress_ctx *ctx, int level){ zstd_parameters params; /* Get parameters for this level and page-sized input */ params = zstd_get_params(level, PAGE_SIZE); /* Calculate workspace size */ ctx->workspace_size = zstd_cctx_workspace_bound(¶ms.cParams); ctx->workspace = vmalloc(ctx->workspace_size); if (!ctx->workspace) return -ENOMEM; /* Initialize compression context */ ctx->cctx = zstd_init_cctx(ctx->workspace, ctx->workspace_size); if (!ctx->cctx) { vfree(ctx->workspace); return -EINVAL; } ctx->level = level; return 0;} static int zstd_compress_page(struct zstd_compress_ctx *ctx, const void *src, void *dst, size_t *dst_len){ zstd_parameters params; size_t ret; params = zstd_get_params(ctx->level, PAGE_SIZE); ret = zstd_compress_cctx(ctx->cctx, dst, *dst_len, src, PAGE_SIZE, ¶ms); if (zstd_is_error(ret)) return -EIO; *dst_len = ret; return 0;} /* Decompression is simpler - no context needed for single-shot */static int zstd_decompress_page(const void *src, size_t src_len, void *dst, size_t *dst_len){ size_t ret; /* Workspace for decompression */ size_t ws_size = zstd_dctx_workspace_bound(); void *workspace = kmalloc(ws_size, GFP_KERNEL); if (!workspace) return -ENOMEM; zstd_dctx *dctx = zstd_init_dctx(workspace, ws_size); ret = zstd_decompress_dctx(dctx, dst, *dst_len, src, src_len); kfree(workspace); if (zstd_is_error(ret)) return -EIO; *dst_len = ret; return 0;}Choose Zstd when: (1) Memory is scarce and every page counts, (2) CPU is plentiful relative to memory pressure, (3) Workload has good compressibility where extra ratio matters. The ~20-30% better compression over LZ4 can significantly reduce swap activity under heavy pressure.
DEFLATE is the venerable compression algorithm used in ZIP files, gzip, and PNG images. While largely superseded by Zstd for new applications, understanding DEFLATE provides historical context and explains why it's still found in some systems.
DEFLATE Architecture:
DEFLATE combines:
Key Characteristics:
| Aspect | DEFLATE/zlib | Zstd |
|---|---|---|
| Age | 1996 (RFC 1951) | 2015 |
| Compression | ~20 MB/s | ~250 MB/s |
| Decompression | ~300 MB/s | ~1500 MB/s |
| Ratio | ~3.0:1 | ~3.0:1 |
| Entropy coder | Huffman | FSE |
| Compatibility | Universal | Growing |
Why DEFLATE is Slower:
zlib in the Kernel:
The Linux kernel includes zlib implementation for various uses:
For memory compression, zlib is generally not recommended due to speed limitations. However, it remains available:
# Check if zlib/deflate is available for zswap
cat /sys/module/zswap/parameters/compressor
cat /proc/crypto | grep deflate
Modern Use Cases for DEFLATE:
For zswap/zram, DEFLATE's slow decompression (300 MB/s vs LZ4's 4000 MB/s) makes it unsuitable. Page fault latency would increase 10x+. Use LZ4, LZO, or Zstd instead. DEFLATE is included in the kernel but optimized for file/network compression, not real-time memory operations.
Choosing the right algorithm requires understanding how each performs across different dimensions. Here's a comprehensive comparison based on real-world kernel benchmarks:
Benchmark Methodology:
| Algorithm | Compress MB/s | Decompress MB/s | Ratio | Latency/4KB |
|---|---|---|---|---|
| LZ4 | 780 | 4200 | 2.1:1 | ~200 ns |
| LZ4HC | 45 | 4200 | 2.5:1 | ~200 ns |
| LZO-RLE | 400 | 550 | 2.2:1 | ~700 ns |
| LZO | 380 | 520 | 2.1:1 | ~750 ns |
| Zstd-1 | 400 | 1500 | 2.8:1 | ~270 ns |
| Zstd-3 | 250 | 1500 | 2.9:1 | ~270 ns |
| DEFLATE | 20 | 300 | 3.0:1 | ~1300 ns |
| 842 (HW) | varies | varies | ~2.5:1 | ~100 ns |
Visual Comparison:
Benchmarks vary with actual workload. Test with YOUR data: run the system under typical load, monitor compression ratio and latency with each algorithm, then decide. A workload with highly compressible data benefits more from Zstd; one with poor compressibility gains little from any algorithm.
Modern CPUs and accelerators increasingly include dedicated compression hardware, offloading the CPU and achieving higher throughput with lower latency.
IBM POWER - NX842:
IBM POWER processors include the NX coprocessor with hardware compression supporting the '842' algorithm:
Intel QAT (QuickAssist Technology):
Intel offers PCIe accelerator cards and on-chip acceleration:
12345678910111213141516171819202122232425262728293031323334353637383940
#!/bin/bash# Check for hardware compression support echo "=== Hardware Compression Detection ==="echo "" # Check for IBM NX 842 (POWER)if [ -d "/sys/devices/vio" ]; then if dmesg | grep -q "nx-842"; then echo "IBM NX 842: DETECTED" echo " Driver: nx-842-powernv or nx-842-pseries" fifi # Check for Intel QATif lspci | grep -i "QuickAssist" > /dev/null 2>&1; then echo "Intel QAT: DETECTED" lspci | grep -i "QuickAssist" # Check if driver loaded if lsmod | grep -q "qat_"; then echo " Driver: LOADED" else echo " Driver: NOT LOADED (install qat driver)" fifi # Check for AMD compression (future)if [ -f "/sys/devices/pci*/*/compression" ]; then echo "AMD Compression: DETECTED"fi # Check available crypto algorithmsecho ""echo "=== Available Compression Algorithms ==="echo "From /proc/crypto:"grep "^name" /proc/crypto | grep -E "lz4|lzo|zstd|deflate|842" | sort -u # Check what zswap supportsecho ""echo "Current zswap compressor: $(cat /sys/module/zswap/parameters/compressor 2>/dev/null || echo 'N/A')"Hardware vs Software Trade-offs:
| Aspect | Software | Hardware |
|---|---|---|
| Latency (single page) | ~200-1000 ns | ~1-10 μs |
| Throughput (bulk) | ~1-5 GB/s | ~10-200 GB/s |
| CPU usage | 100% | Near 0% |
| Availability | Always | Specific hardware |
| Flexibility | High (any algorithm) | Fixed algorithms |
Key Insight: Hardware compression has higher single-operation latency due to the round-trip to the accelerator. It excels at high throughput with zero CPU usage. For memory compression of individual pages, software is often faster; hardware shines when offloading sustained high-volume compression.
CPU vendors are integrating compression into the memory controller path, enabling transparent compression with near-zero latency. This could revolutionize memory compression—imagine effective doubling of memory capacity with no software overhead. Watch for developments in CXL (Compute Express Link) memory compression.
The effectiveness of compression depends heavily on the data being compressed. Understanding what makes data compressible helps predict and optimize compression performance.
Highly Compressible Data:
Zero-filled regions — Common in:
Text and code — Natural language, source code:
Structured data — JSON, XML, protocol buffers:
Database buffers — Table rows, indexes:
Entropy Analysis:
Compressibility correlates with Shannon entropy—a measure of information content. Higher entropy means less compression potential:
Entropy range: 0 (perfectly predictable) to 8 (random bytes)
Typical memory pages:
- Zero page: 0.0 bits/byte → infinite compression
- Text page: 4-5 bits/byte → 2:1 compression
- Binary data: 6-7 bits/byte → 1.3:1 compression
- Encrypted/random: 8.0 bits/byte → no compression
Memory compression systems often sample pages to estimate entropy before attempting compression—rejecting high-entropy pages to avoid wasting CPU.
Full-disk encryption (dm-crypt, LUKS) encrypts swap devices. Encrypted data is incompressible. If using zswap with encrypted swap, compression happens BEFORE encryption during writeback—so zswap still provides benefit. But zram with encryption enabled will have poor compression of encrypted content.
We've explored the compression algorithms that power memory compression systems. Let's consolidate the key concepts:
What's Next:
The final page of this module explores performance tradeoffs—bringing together everything we've learned to understand when memory compression helps, when it hurts, and how to optimize the complete system for specific workloads.
You now understand the internals of major compression algorithms, their performance characteristics, and selection criteria. This knowledge enables informed decisions about which algorithm to deploy for your specific memory compression needs.