Deduplication & Compression - Learning Module

Loading content...

0/273

Compression Algorithms

The Science of Making Data Smaller

Every byte stored, transmitted, or processed carries a cost. Storage isn't free—neither is bandwidth, memory, nor I/O time. Compression is the science and engineering of representing data in fewer bits without losing the information we care about.

Consider the scale: Netflix transmits over 400 petabytes of video per month. A 40% improvement in compression efficiency would save 160 petabytes of bandwidth—translating to hundreds of millions of dollars in infrastructure savings annually. AWS stores over 200 trillion objects in S3; even a 1% storage reduction represents exabytes of capacity.

But compression is not a free lunch. It trades computation for space. Every byte saved requires CPU cycles to compress and decompress. In storage systems, the choice of compression algorithm—and whether to compress at all—is a nuanced engineering decision that depends on data characteristics, access patterns, and hardware capabilities.

This page takes you deep into the world of compression algorithms, from foundational theory to practical implementation considerations for large-scale storage systems.

What You Will Learn

By the end of this page, you will understand the fundamental principles of data compression, master the differences between lossless and lossy compression, explore classic algorithms (LZ77, DEFLATE, LZW) and modern alternatives (LZ4, Zstandard, Brotli), and learn how to select the right algorithm for your storage architecture based on compression ratio, speed, and memory requirements.

Compression Fundamentals: Why Data Can Be Compressed

Data compression is possible because most data contains redundancy—patterns, repetitions, and predictable structures. Compression exploits these regularities to represent data more efficiently.

Information Theory Foundation

Claude Shannon's information theory provides the mathematical foundation. The entropy of data measures its intrinsic information content—the minimum number of bits needed to represent it.

Entropy H(X) = -Σ P(x) * log₂(P(x))

For example, if a file contains only the letters A and B, each appearing 50% of the time:

Entropy = -0.5 × log₂(0.5) - 0.5 × log₂(0.5) = 1 bit per symbol
Each character contains exactly 1 bit of information
Using 8-bit ASCII wastes 7 bits per character
Optimal compression: 8:1 ratio

If the file is 90% A and 10% B:

Entropy ≈ 0.47 bits per symbol
Optimal compression: ~17:1 ratio

The fundamental limit: No lossless compression can beat entropy. Data that already appears random (encrypted data, already-compressed data) approaches maximum entropy and cannot be meaningfully compressed further.

Compression Potential by Data Type
Data Type	Typical Redundancy	Compression Ratio Potential	Best Approach
Plain text	Very high (letters, words repeat)	3:1 to 10:1	Dictionary + Huffman coding
Source code	High (keywords, patterns)	3:1 to 5:1	Dictionary-based
Log files	Very high (timestamps, repeated messages)	10:1 to 50:1	Dictionary + run-length
JSON/XML	High (repeated tags, structure)	5:1 to 15:1	Dictionary-based
Database pages	Moderate (depends on data)	2:1 to 4:1	Block compression
Images (lossless)	Moderate (spatial redundancy)	1.5:1 to 3:1	PNG, WEBP lossless
Images (lossy)	High (perceptual redundancy)	10:1 to 100:1	JPEG, WEBP lossy
Encrypted data	None (appears random)	1:1 (no compression)	Don't compress
Already compressed	None	1:1 or worse	Don't re-compress

The Compression Expansion Problem

Attempting to compress already-compressed or random data can actually INCREASE file size. Compression algorithms add overhead (headers, dictionaries), and if no redundancy is found, the output is larger than the input. Production systems must detect uncompressible data and skip compression—checking entropy or using fast-abort heuristics.

Lossless vs. Lossy Compression

Compression divides into two fundamentally different categories based on whether original data can be perfectly recovered.

Lossless Compression

Definition: The decompressed data is bit-for-bit identical to the original. No information is lost.

How it works: Lossless algorithms exploit statistical redundancy:

Dictionary coding: Replace repeated sequences with shorter codes
Entropy coding: Assign shorter bit sequences to more frequent symbols
Run-length encoding: Replace repeated runs with count + symbol

Use cases:

Executable binaries (must be exact)
Database files (cannot tolerate corruption)
Documents, source code
Compressed archives (ZIP, tar.gz)
Backup systems

Examples: gzip, LZ4, Zstandard, bzip2, LZMA

Lossy Compression

Definition: Original data cannot be perfectly recovered. Some information is permanently discarded, but the loss is (ideally) imperceptible or acceptable.

How it works: Lossy algorithms exploit human perception:

Psychoacoustic models: Remove sounds humans can't hear
Psychovisual models: Remove visual details humans can't see
Quantization: Reduce precision of measurements
Subsampling: Store less information for less-sensitive components

Use cases:

Images for web display (JPEG, WebP)
Video streaming (H.264, H.265, AV1)
Audio (MP3, AAC, Opus)
Sensor data where precision can be reduced

Trade-off: Lossy compression achieves dramatically higher ratios (10x-100x+) but requires careful tuning to avoid unacceptable quality loss.

For storage systems: We focus primarily on lossless compression because data integrity is paramount. Lossy compression is typically applied at the application layer (image processing, video transcoding) before storage.

Lossless Compression

•Perfect reconstruction guaranteed
•Compression ratio: 1.5:1 to 10:1
•Used for: code, documents, databases
•Essential for data integrity
•Algorithms: LZ4, Zstd, DEFLATE

Lossy Compression

•Information permanently lost
•Compression ratio: 10:1 to 100:1+
•Used for: images, audio, video
•Quality vs size trade-off
•Algorithms: JPEG, H.265, Opus

Dictionary-Based Compression: The LZ Family

The most widely used compression algorithms are built on dictionary-based techniques, pioneered by Abraham Lempel and Jacob Ziv in the 1970s. These algorithms build a "dictionary" of previously seen patterns and replace subsequent occurrences with references.

LZ77 (Sliding Window)

LZ77 uses a sliding window of recently processed data as an implicit dictionary. When a pattern repeats, it's encoded as a (distance, length) reference to where it appeared before.

Example:

Input: "ABRACADABRA"

1. Output 'A' (literal) - dictionary: "A"
2. Output 'B' (literal) - dictionary: "AB"
3. Output 'R' (literal) - dictionary: "ABR"
4. Output 'A' (back 4, length 1) - references first 'A'
5. Output 'C' (literal) - dictionary: "ABRAC"
6. Output 'A' (back 2, length 1) - references recent 'A'
7. Output 'D' (literal)
8. Output (back 7, length 4) - copies "ABRA"

Compressed: A, B, R, (4,1), C, (2,1), D, (7,4)

Sliding window trade-offs:

Larger window: Better compression (can reference older patterns) but slower search and more memory
Smaller window: Faster but misses distant repetitions
Typical sizes: 32KB (gzip) to 8MB+ (modern compressors)

lz77_concept.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
from dataclasses import dataclass
from typing import Union, List
 
@dataclass
class Literal:
    """Raw byte that couldn't be matched."""
    byte: int
 
@dataclass  
class BackReference:
    """Reference to previously seen data."""
    distance: int  # How far back to look
    length: int    # How many bytes to copy
 
Token = Union[Literal, BackReference]
 
def lz77_compress(data: bytes, window_size: int = 32768) -> List[Token]:
    """
    Simplified LZ77 compression.
    
    Real implementations use hash tables for O(1) pattern matching
    rather than this O(n) scan approach.
    """
    tokens = []
    pos = 0
    
    while pos < len(data):
        # Search backwards in sliding window for longest match
        best_distance = 0
        best_length = 0
        
        # Define search window (last window_size bytes)
        search_start = max(0, pos - window_size)
        
        # Try each position in window
        for search_pos in range(search_start, pos):
            # Count matching bytes
            match_length = 0
            while (pos + match_length < len(data) and
                   match_length < 258 and  # Max match length
                   data[search_pos + match_length] == data[pos + match_length]):
                match_length += 1
            
            # Keep best match (minimum 3 bytes to be worth referencing)
            if match_length >= 3 and match_length > best_length:
                best_distance = pos - search_pos
                best_length = match_length
        
        if best_length >= 3:
            # Output back-reference
            tokens.append(BackReference(best_distance, best_length))
            pos += best_length
        else:
            # Output literal byte
            tokens.append(Literal(data[pos]))
            pos += 1
    
    return tokens

LZ78 and LZW (Explicit Dictionary)

LZ78/LZW build an explicit dictionary during compression, adding new phrases as they're encountered. LZW (Lempel-Ziv-Welch) became famous through its use in GIF images and early Unix compress utility.

LZW Process:

Initialize dictionary with all single characters
Read input and find longest string S in dictionary
Output dictionary code for S
Add S + next character to dictionary
Repeat until input exhausted

Advantages:

Single-pass algorithm (no lookahead needed)
Dictionary transmitted implicitly (decoder reconstructs it)

Disadvantages:

Dictionary can grow unboundedly (need reset strategies)
Patent issues historically limited adoption (expired now)

DEFLATE (LZ77 + Huffman)

DEFLATE, used by gzip and zlib, combines LZ77 with Huffman coding for a two-stage compression:

LZ77 stage: Replace repetitions with back-references
Huffman stage: Encode the LZ77 output with variable-length codes

Huffman coding assigns shorter bit sequences to more frequent symbols. If 'E' appears 100 times and 'Q' appears once, 'E' might get a 3-bit code while 'Q' gets a 10-bit code.

DEFLATE's combination achieves excellent compression ratios (typically 3:1 to 8:1 for text) and became the de facto standard for decades.

Why DEFLATE Dominated

DEFLATE's combination of good compression ratio, reasonable speed, and patent-free status made it ubiquitous. It's the algorithm inside gzip, zlib, PNG images, ZIP files, and HTTP compression. Despite being from 1993, it remains widely used—though modern alternatives offer better trade-offs.

Modern Compression Algorithms: Speed Meets Ratio

While DEFLATE remains common, modern algorithms offer dramatically better trade-offs between compression ratio and speed. Storage systems increasingly adopt these newer alternatives.

LZ4: Speed Champion

Developed by Yann Collet in 2011, LZ4 prioritizes decompression speed above all else. It's designed for scenarios where data is compressed once but read many times.

Characteristics:

Decompression speed: 3-5 GB/s per core (10x faster than gzip)
Compression speed: 400-700 MB/s
Compression ratio: Typically 2:1 to 2.5:1 (30-50% worse than gzip)
Memory usage: Very low (<16KB state)

Use cases:

Real-time log compression
Database page compression (PostgreSQL, MySQL)
In-memory caching (Redis compression)
Network protocols where latency matters
SSDs where fast decompression reduces read amplification

Why so fast? LZ4 uses a simplified format without entropy coding, fixed 64KB window, and bytewise operations that SIMD can accelerate.

Zstandard (Zstd): The New Standard

Developed by Yann Collet at Facebook (2016), Zstandard achieves the best of both worlds—compression ratios matching or exceeding gzip/bzip2 with speeds approaching LZ4.

Characteristics:

Compression levels: 1-22 (default: 3)
Level 1: ~500 MB/s compression, ~2.8:1 ratio, ~1500 MB/s decompression
Level 3: ~200 MB/s compression, ~3.0:1 ratio, ~1300 MB/s decompression
Level 19: ~3 MB/s compression, ~3.5:1 ratio, ~1200 MB/s decompression

Key innovations:

Finite State Entropy (FSE): Advanced entropy coder faster than Huffman
Dictionary mode: Pre-shared dictionary for small data compression
Streaming API: Process unbounded streams efficiently
Multi-threaded: Parallel compression for multi-core CPUs

Industry adoption:

Linux kernel (btrfs, squashfs)
Facebook (all internal data)
FreeBSD pkg format
AWS, Azure for internal compression
rsync, tar, many backup tools

zstd_usage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import zstandard as zstd
 
class ZstdCompressor:
    """
    Zstandard compression wrapper demonstrating key features.
    """
    
    def __init__(self, level: int = 3, threads: int = 0):
        """
        Initialize compressor.
        
        Args:
            level: 1 (fastest) to 22 (best ratio). Default 3 is balanced.
            threads: 0 = auto-detect cores, 1 = single-threaded
        """
        self.level = level
        self.cctx = zstd.ZstdCompressor(level=level, threads=threads)
        self.dctx = zstd.ZstdDecompressor()
    
    def compress(self, data: bytes) -> bytes:
        """Compress data in memory."""
        return self.cctx.compress(data)
    
    def decompress(self, data: bytes) -> bytes:
        """Decompress data in memory."""
        return self.dctx.decompress(data)
    
    def compress_stream(self, input_file: str, output_file: str):
        """Stream compress from file to file (memory efficient)."""
        with open(input_file, 'rb') as fin:
            with open(output_file, 'wb') as fout:
                self.cctx.copy_stream(fin, fout)
    
    @staticmethod
    def create_dictionary(samples: list[bytes], dict_size: int = 110 * 1024):
        """
        Train a dictionary on sample data.
        
        Dictionaries dramatically improve compression for small,
        similar data (e.g., JSON API responses, log entries).
        
        Returns a dictionary that can be shared between compressor
        and decompressor.
        """
        return zstd.train_dictionary(dict_size, samples)
    
    def compress_with_dict(self, data: bytes, dictionary: zstd.ZstdCompressionDict):
        """
        Compress using pre-trained dictionary.
        
        For 1KB JSON objects, dictionaries can improve ratio from
        1.5:1 to 3:1 or better, because the dictionary captures
        common structure.
        """
        cctx = zstd.ZstdCompressor(dict_data=dictionary, level=self.level)
        return cctx.compress(data)
 
 
# Example: Benchmark different levels
def benchmark_levels(data: bytes):
    """Compare compression at different levels."""
    import time
    
    results = []
    for level in [1, 3, 5, 9, 15, 19]:
        compressor = ZstdCompressor(level=level)
        
        start = time.time()
        compressed = compressor.compress(data)
        compress_time = time.time() - start
        
        start = time.time()
        decompressed = compressor.decompress(compressed)
        decompress_time = time.time() - start
        
        ratio = len(data) / len(compressed)
        
        results.append({
            'level': level,
            'ratio': f"{ratio:.2f}:1",
            'compress_speed': f"{len(data)/compress_time/1e6:.1f} MB/s",
            'decompress_speed': f"{len(data)/decompress_time/1e6:.1f} MB/s",
        })
    
    return results

Brotli: Web-Optimized Compression

Developed by Google (2015), Brotli is optimized for web content delivery. It achieves 20-25% better compression than gzip on HTML/CSS/JavaScript.

Characteristics:

Compression ratio: Best-in-class for text content
Built-in dictionary: 120KB static dictionary of common web content
Decompression speed: Comparable to gzip
Compression speed: Slower than gzip at equivalent ratios

Use cases:

HTTP content encoding (supported by all modern browsers)
WOFF2 font format
Static asset compression (build-time, not runtime)

Trade-off: Brotli excels when you can afford slow compression for fast, bandwidth-efficient delivery. CDNs pre-compress static assets with Brotli; dynamic content still uses gzip.

Snappy: Google's Speed Focus

Google's internal compression algorithm, optimized for speed over ratio.

Characteristics:

Compression: 250+ MB/s
Decompression: 500+ MB/s
Ratio: ~1.5:1 to 2:1

Use cases:

Google internal systems (BigTable, MapReduce)
LevelDB/RocksDB block compression
Real-time streaming where latency trumps savings

Modern Compression Algorithm Comparison
Algorithm	Compress Speed	Decompress Speed	Ratio (text)	Best For
LZ4	★★★★★ (~500 MB/s)	★★★★★ (~3000 MB/s)	★★☆☆☆ (2.1:1)	Real-time, databases
Snappy	★★★★★ (~400 MB/s)	★★★★★ (~1500 MB/s)	★★☆☆☆ (2.0:1)	Log streaming, fast stores
Zstd (fast)	★★★★☆ (~300 MB/s)	★★★★★ (~1400 MB/s)	★★★☆☆ (2.9:1)	General purpose, balanced
Zstd (default)	★★★☆☆ (~150 MB/s)	★★★★★ (~1300 MB/s)	★★★★☆ (3.1:1)	Storage, archives
gzip	★★☆☆☆ (~30 MB/s)	★★★☆☆ (~300 MB/s)	★★★☆☆ (3.0:1)	Legacy compatibility
Brotli	★☆☆☆☆ (~5 MB/s)	★★★☆☆ (~350 MB/s)	★★★★★ (3.5:1)	Static web content
LZMA/xz	★☆☆☆☆ (~3 MB/s)	★★☆☆☆ (~100 MB/s)	★★★★★ (3.8:1)	Archives, distributions

Zstandard Is the Modern Default

For new storage systems, Zstandard is typically the best choice. It dominates the speed-ratio Pareto frontier—at any given compression ratio, Zstd is usually the fastest option. Its dictionary mode is particularly valuable for compressing many small, similar objects. Facebook reports 2-3x compression speed improvements over gzip with equal or better ratios.

Entropy Coding: The Final Compression Stage

Dictionary compression (LZ77/LZ78) produces a stream of symbols—literals and back-references. Entropy coding is the final step that converts these symbols into minimum-length bit sequences.

Huffman Coding (1952)

Huffman coding assigns variable-length codes based on symbol frequency: frequent symbols get short codes, rare symbols get long codes.

Building a Huffman Tree:

Count frequency of each symbol
Create leaf nodes for each symbol
Repeat: combine two lowest-frequency nodes into parent
Code for each symbol: path from root (left=0, right=1)

Example:

Symbols: A(45) B(25) C(15) D(10) E(5)

Huffman tree:
       (100)
       /    \
     A(45)  (55)
            /   \
         B(25) (30)
               /   \
            C(15) (15)
                  /   \
               D(10) E(5)

Codes: A=0, B=10, C=110, D=1110, E=1111

Original: 8 bits per symbol
Huffman: (45×1 + 25×2 + 15×3 + 10×4 + 5×4) / 100 = 2.0 bits/symbol
Compression: 4:1 ratio

Limitation: Huffman requires at least 1 bit per symbol. For highly skewed distributions (e.g., 99% zeros), Huffman approaches but can't reach the theoretical entropy.

Arithmetic Coding

Arithmetic coding encodes the entire message as a single fractional number (0 to 1), achieving near-optimal entropy without the 1-bit-per-symbol limitation.

Concept:

Each symbol has a probability range within [0, 1)
Encoding narrows the range based on each symbol
Final range is encoded as a binary fraction
Bit length approaches Shannon entropy

Advantages:

Theoretically optimal compression
Handles skewed distributions efficiently

Disadvantages:

Slower than Huffman (more computation per symbol)
Must know exact probabilities (or adapt them)
Patent issues historically (mostly expired)

Finite State Entropy (FSE)

Used in Zstandard, FSE achieves near-arithmetic-coding ratios with near-Huffman speeds.

How FSE works:

Uses a precomputed state table based on symbol probabilities
Each state transition encodes bits while moving to the next state
Table-driven: after initialization, it's just memory lookups

FSE vs Huffman vs Arithmetic:

Entropy Coder	Compression Quality	Encoding Speed	Decoding Speed
Huffman	Good (but ≥1 bit/symbol)	Very fast	Very fast
Arithmetic	Optimal	Slow	Slow
FSE	Near-optimal	Fast	Very fast

FSE's speed comes from table-based operations that vectorize well and avoid floating-point math.

Entropy Coding Selection

Most storage systems don't implement custom entropy coders—they rely on complete algorithms like Zstd or LZ4. But understanding entropy coding helps diagnose why certain data compresses poorly (already near maximum entropy) and why modern algorithms outperform older ones (better entropy coding stages).

Compression in Storage Systems

Integrating compression into storage systems requires careful architectural decisions about where, when, and how to compress.

Block-Level Compression

The most common approach in storage systems: compress fixed-size blocks (typically 4KB-128KB) independently.

Advantages:

Random access: decompress only the blocks you need
Parallelizable: different blocks processed by different cores
Bounded memory: predictable memory usage per block

Disadvantages:

Reduced ratio: can't exploit cross-block patterns
Block alignment overhead

Examples:

ZFS: per-block compression (LZ4, gzip, zstd options)
Btrfs: per-extent compression
RocksDB: SST block compression
PostgreSQL: TOAST compression for large values

block_compression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import zstandard as zstd
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class CompressedBlock:
    """A storage block with compression metadata."""
    compressed_data: bytes
    original_size: int
    compression_algorithm: str
    is_compressed: bool  # False if compression didn't help
 
class BlockCompressor:
    """
    Block-level compression for storage systems.
    
    Key features:
    - Skip compression if data is incompressible
    - Handle blocks of various sizes
    - Track compression statistics
    """
    
    def __init__(
        self,
        block_size: int = 65536,      # 64KB default
        compression_level: int = 3,
        min_savings_percent: int = 10,  # Skip if <10% space saved
    ):
        self.block_size = block_size
        self.min_savings = min_savings_percent / 100.0
        self.cctx = zstd.ZstdCompressor(level=compression_level)
        self.dctx = zstd.ZstdDecompressor()
        
        # Statistics
        self.stats = {
            'blocks_processed': 0,
            'blocks_compressed': 0,
            'blocks_skipped': 0,
            'bytes_before': 0,
            'bytes_after': 0,
        }
    
    def compress_block(self, data: bytes) -> CompressedBlock:
        """
        Compress a block, skipping if compression isn't beneficial.
        
        Real storage systems check this to avoid wasting CPU on
        incompressible data (already compressed, encrypted, random).
        """
        self.stats['blocks_processed'] += 1
        self.stats['bytes_before'] += len(data)
        
        original_size = len(data)
        
        # Try compression
        compressed = self.cctx.compress(data)
        
        # Check if compression actually helped
        savings = 1.0 - (len(compressed) / original_size)
        
        if savings >= self.min_savings:
            # Compression worthwhile
            self.stats['blocks_compressed'] += 1
            self.stats['bytes_after'] += len(compressed)
            return CompressedBlock(
                compressed_data=compressed,
                original_size=original_size,
                compression_algorithm='zstd',
                is_compressed=True,
            )
        else:
            # Compression not worth it - store original
            self.stats['blocks_skipped'] += 1
            self.stats['bytes_after'] += original_size
            return CompressedBlock(
                compressed_data=data,
                original_size=original_size,
                compression_algorithm='none',
                is_compressed=False,
            )
    
    def decompress_block(self, block: CompressedBlock) -> bytes:
        """Decompress a block if it was compressed."""
        if block.is_compressed:
            return self.dctx.decompress(block.compressed_data)
        else:
            return block.compressed_data
    
    def get_ratio(self) -> float:
        """Calculate overall compression ratio."""
        if self.stats['bytes_after'] == 0:
            return 1.0
        return self.stats['bytes_before'] / self.stats['bytes_after']

Inline vs. Background Compression

Like deduplication, compression timing affects performance:

Inline compression:

Compress in the write path before persisting
Pros: Immediate storage savings, reduced write I/O
Cons: Write latency penalty, CPU on critical path
Best for: Write-heavy workloads with fast compressors (LZ4)

Background compression:

Write uncompressed, compress in background job
Pros: No write latency impact
Cons: Temporary extra storage, rewrite I/O
Best for: Read-heavy workloads, slow compressors (high-ratio modes)

Compression and Encryption Order

Always compress before encrypting:

Compression then encryption: Compressor sees redundant plaintext, achieves full ratio
Encryption then compression: Encrypted data looks random, compression fails

✓ Correct: PlainText → Compress → Encrypt → Store
✗ Wrong:   PlainText → Encrypt → Compress → Store (no compression)

Security note: Compression before encryption can leak information through compressed size (see CRIME/BREACH attacks on HTTPS). For sensitive data, consider padding or skipping compression.

Storage System Compression Strategies
System	Default Algorithm	Block Size	Inline/Background
ZFS	LZ4 (default)	Record size (128KB default)	Inline
Btrfs	zstd (default kernel 5.1+)	Extent-based	Inline
RocksDB	Snappy/LZ4/Zstd	SST block (4KB-64KB)	Compaction-time
PostgreSQL	pglz/LZ4/Zstd	TOAST threshold (2KB)	Inline
MongoDB	Snappy/Zstd	WiredTiger blocks	Inline
Kafka	Snappy/LZ4/Zstd/gzip	Batch-level	Producer-side

Dictionary Compression for Small Data

Standard compression struggles with small data. A 500-byte JSON object can't build meaningful patterns from just 500 bytes—there isn't enough context. Dictionary compression solves this by providing a pre-trained dictionary of common patterns.

The Small Data Problem

Compression algorithms need context to find patterns. For small inputs:

LZ77 window doesn't fill up
Huffman tree overhead exceeds data size
Expected ratio: 0.8:1 to 1.5:1 (often expansion)

Yet many storage systems handle millions of small objects:

JSON API responses (1-5 KB each)
Log entries (100-500 bytes each)
Key-value pairs (50-200 bytes each)

Dictionary Training

A compression dictionary captures common patterns from a training set. When compressing new data, the algorithm has immediate access to these patterns.

Zstandard Dictionary Training:

Collect representative samples (e.g., 10,000 similar JSON objects)
Train dictionary to find common substrings
Dictionary size: typically 64KB-128KB
Distribute dictionary with compressed data

Improvements: For 1KB JSON objects:

Without dictionary: 1.3:1 ratio
With trained dictionary: 3.5:1 ratio
2.7x improvement from dictionary

dictionary_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import zstandard as zstd
import json
import random
 
def train_json_dictionary(sample_generator, num_samples: int = 10000):
    """
    Train a Zstandard dictionary on sample JSON objects.
    
    The dictionary captures common patterns (field names, value formats,
    structure) that appear across many objects.
    """
    # Collect samples
    samples = []
    for i, obj in enumerate(sample_generator):
        if i >= num_samples:
            break
        samples.append(json.dumps(obj).encode('utf-8'))
    
    print(f"Training dictionary on {len(samples)} samples")
    print(f"Total sample size: {sum(len(s) for s in samples) / 1e6:.2f} MB")
    
    # Train dictionary
    dictionary = zstd.train_dictionary(
        dict_size=110 * 1024,  # 110KB dictionary
        samples=samples,
        level=3,  # Training level
    )
    
    print(f"Dictionary size: {len(dictionary.as_bytes())} bytes")
    print(f"Dictionary ID: {dictionary.dict_id()}")
    
    return dictionary
 
 
def evaluate_dictionary(dictionary, test_samples: list[bytes]):
    """
    Compare compression with and without dictionary.
    """
    # Without dictionary
    cctx_plain = zstd.ZstdCompressor(level=3)
    
    # With dictionary
    cctx_dict = zstd.ZstdCompressor(level=3, dict_data=dictionary)
    
    plain_total = 0
    dict_total = 0
    original_total = 0
    
    for sample in test_samples:
        original_total += len(sample)
        plain_total += len(cctx_plain.compress(sample))
        dict_total += len(cctx_dict.compress(sample))
    
    print(f"\nResults on {len(test_samples)} test samples:")
    print(f"  Original size:    {original_total:,} bytes")
    print(f"  Without dict:     {plain_total:,} bytes ({original_total/plain_total:.2f}:1)")
    print(f"  With dict:        {dict_total:,} bytes ({original_total/dict_total:.2f}:1)")
    print(f"  Dictionary gain:  {plain_total/dict_total:.2f}x improvement")
 
 
# Example usage for API response compression
def api_response_generator():
    """Generate sample API responses for dictionary training."""
    endpoints = ['user', 'order', 'product', 'cart', 'inventory']
    statuses = ['active', 'pending', 'completed', 'cancelled']
    
    for _ in range(100000):
        yield {
            'id': random.randint(1, 1000000),
            'type': random.choice(endpoints),
            'status': random.choice(statuses),
            'created_at': '2024-01-15T10:30:00Z',
            'updated_at': '2024-01-15T11:45:00Z',
            'metadata': {
                'version': '2.0',
                'source': 'api-gateway',
                'region': random.choice(['us-east-1', 'eu-west-1', 'ap-south-1']),
            },
            'data': {
                'name': f'Item_{random.randint(1, 10000)}',
                'price_cents': random.randint(100, 100000),
                'quantity': random.randint(1, 100),
            }
        }

When to Use Dictionary Compression

Dictionary compression shines when: (1) Objects are small (<10KB), (2) Objects share common structure (JSON schemas, log formats), (3) You can train on representative samples, (4) You can distribute dictionaries to decompressors. Perfect for: API caching, log storage, key-value stores, message queues with typed messages.

Summary: Compression Algorithm Selection Guide

Compression is a fundamental tool in storage engineering, but there's no universally "best" algorithm. Selection depends on your specific requirements.

Decision Framework

1. What's your priority?

Minimizing storage/bandwidth: Use Brotli or Zstd (high levels)
Minimizing CPU usage: Use LZ4 or Snappy
Balanced: Use Zstd (default level)

2. What's your access pattern?

Sequential reads: Higher compression levels acceptable
Random access: Need block-level compression
Real-time/streaming: Need microsecond latency → LZ4 or no compression

3. What's your data type?

Text/structured data: High redundancy → strong compression
Already compressed (JPEG, video): Skip compression
Encrypted data: Cannot compress

4. What hardware do you have?

Fast SSD, slow network: Compress more (CPU is cheap vs. bandwidth)
Slow CPU, fast storage: Compress less or use LZ4
Multi-core available: Use parallel compression (Zstd -T0)

Algorithm Selection Quick Reference

•LZ4 → Real-time systems, databases, when decompression speed is critical
•Zstd (level 1-3) → General purpose, balanced performance
•Zstd (level 15+) → Archival, cold storage, maximize ratio
•Zstd + dictionary → Many small similar objects (JSON, logs)
•Brotli → Static web content, fonts, build-time compression
•gzip → Legacy compatibility, HTTP fallback
•Snappy → Google ecosystem, LevelDB/BigTable compatibility

Page Complete

You now understand the fundamentals of data compression—from information theory to practical algorithm selection. You can evaluate trade-offs between compression ratio, speed, and memory usage to choose the right algorithm for your storage system. Next, we'll explore how deduplication and compression work together to maximize overall storage efficiency.

Compression Algorithms

The Science of Making Data Smaller

This page takes you deep into the world of compression algorithms, from foundational theory to practical implementation considerations for large-scale storage systems.

What You Will Learn

Compression Fundamentals: Why Data Can Be Compressed

Data compression is possible because most data contains redundancy—patterns, repetitions, and predictable structures. Compression exploits these regularities to represent data more efficiently.

Information Theory Foundation

Claude Shannon's information theory provides the mathematical foundation. The entropy of data measures its intrinsic information content—the minimum number of bits needed to represent it.

Entropy H(X) = -Σ P(x) * log₂(P(x))

For example, if a file contains only the letters A and B, each appearing 50% of the time:

Entropy = -0.5 × log₂(0.5) - 0.5 × log₂(0.5) = 1 bit per symbol
Each character contains exactly 1 bit of information
Using 8-bit ASCII wastes 7 bits per character
Optimal compression: 8:1 ratio

If the file is 90% A and 10% B:

Entropy ≈ 0.47 bits per symbol
Optimal compression: ~17:1 ratio

Compression Potential by Data Type
Data Type	Typical Redundancy	Compression Ratio Potential	Best Approach
Plain text	Very high (letters, words repeat)	3:1 to 10:1	Dictionary + Huffman coding
Source code	High (keywords, patterns)	3:1 to 5:1	Dictionary-based
Log files	Very high (timestamps, repeated messages)	10:1 to 50:1	Dictionary + run-length
JSON/XML	High (repeated tags, structure)	5:1 to 15:1	Dictionary-based
Database pages	Moderate (depends on data)	2:1 to 4:1	Block compression
Images (lossless)	Moderate (spatial redundancy)	1.5:1 to 3:1	PNG, WEBP lossless
Images (lossy)	High (perceptual redundancy)	10:1 to 100:1	JPEG, WEBP lossy
Encrypted data	None (appears random)	1:1 (no compression)	Don't compress
Already compressed	None	1:1 or worse	Don't re-compress

The Compression Expansion Problem

Lossless vs. Lossy Compression

Compression divides into two fundamentally different categories based on whether original data can be perfectly recovered.

Lossless Compression

Definition: The decompressed data is bit-for-bit identical to the original. No information is lost.

How it works: Lossless algorithms exploit statistical redundancy:

Dictionary coding: Replace repeated sequences with shorter codes
Entropy coding: Assign shorter bit sequences to more frequent symbols
Run-length encoding: Replace repeated runs with count + symbol

Use cases:

Executable binaries (must be exact)
Database files (cannot tolerate corruption)
Documents, source code
Compressed archives (ZIP, tar.gz)
Backup systems

Examples: gzip, LZ4, Zstandard, bzip2, LZMA

Lossy Compression

Definition: Original data cannot be perfectly recovered. Some information is permanently discarded, but the loss is (ideally) imperceptible or acceptable.

How it works: Lossy algorithms exploit human perception:

Psychoacoustic models: Remove sounds humans can't hear
Psychovisual models: Remove visual details humans can't see
Quantization: Reduce precision of measurements
Subsampling: Store less information for less-sensitive components

Use cases:

Images for web display (JPEG, WebP)
Video streaming (H.264, H.265, AV1)
Audio (MP3, AAC, Opus)
Sensor data where precision can be reduced

Trade-off: Lossy compression achieves dramatically higher ratios (10x-100x+) but requires careful tuning to avoid unacceptable quality loss.

Lossless Compression

•Perfect reconstruction guaranteed
•Compression ratio: 1.5:1 to 10:1
•Used for: code, documents, databases
•Essential for data integrity
•Algorithms: LZ4, Zstd, DEFLATE

Lossy Compression

•Information permanently lost
•Compression ratio: 10:1 to 100:1+
•Used for: images, audio, video
•Quality vs size trade-off
•Algorithms: JPEG, H.265, Opus

Dictionary-Based Compression: The LZ Family

LZ77 (Sliding Window)

LZ77 uses a sliding window of recently processed data as an implicit dictionary. When a pattern repeats, it's encoded as a (distance, length) reference to where it appeared before.

Example:

Input: "ABRACADABRA"

1. Output 'A' (literal) - dictionary: "A"
2. Output 'B' (literal) - dictionary: "AB"
3. Output 'R' (literal) - dictionary: "ABR"
4. Output 'A' (back 4, length 1) - references first 'A'
5. Output 'C' (literal) - dictionary: "ABRAC"
6. Output 'A' (back 2, length 1) - references recent 'A'
7. Output 'D' (literal)
8. Output (back 7, length 4) - copies "ABRA"

Compressed: A, B, R, (4,1), C, (2,1), D, (7,4)

Sliding window trade-offs:

Larger window: Better compression (can reference older patterns) but slower search and more memory
Smaller window: Faster but misses distant repetitions
Typical sizes: 32KB (gzip) to 8MB+ (modern compressors)

lz77_concept.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
from dataclasses import dataclass
from typing import Union, List
 
@dataclass
class Literal:
    """Raw byte that couldn't be matched."""
    byte: int
 
@dataclass  
class BackReference:
    """Reference to previously seen data."""
    distance: int  # How far back to look
    length: int    # How many bytes to copy
 
Token = Union[Literal, BackReference]
 
def lz77_compress(data: bytes, window_size: int = 32768) -> List[Token]:
    """
    Simplified LZ77 compression.
    
    Real implementations use hash tables for O(1) pattern matching
    rather than this O(n) scan approach.
    """
    tokens = []
    pos = 0
    
    while pos < len(data):
        # Search backwards in sliding window for longest match
        best_distance = 0
        best_length = 0
        
        # Define search window (last window_size bytes)
        search_start = max(0, pos - window_size)
        
        # Try each position in window
        for search_pos in range(search_start, pos):
            # Count matching bytes
            match_length = 0
            while (pos + match_length < len(data) and
                   match_length < 258 and  # Max match length
                   data[search_pos + match_length] == data[pos + match_length]):
                match_length += 1
            
            # Keep best match (minimum 3 bytes to be worth referencing)
            if match_length >= 3 and match_length > best_length:
                best_distance = pos - search_pos
                best_length = match_length
        
        if best_length >= 3:
            # Output back-reference
            tokens.append(BackReference(best_distance, best_length))
            pos += best_length
        else:
            # Output literal byte
            tokens.append(Literal(data[pos]))
            pos += 1
    
    return tokens

LZ78 and LZW (Explicit Dictionary)

LZW Process:

Initialize dictionary with all single characters
Read input and find longest string S in dictionary
Output dictionary code for S
Add S + next character to dictionary
Repeat until input exhausted

Advantages:

Single-pass algorithm (no lookahead needed)
Dictionary transmitted implicitly (decoder reconstructs it)

Disadvantages:

Dictionary can grow unboundedly (need reset strategies)
Patent issues historically limited adoption (expired now)

DEFLATE (LZ77 + Huffman)

DEFLATE, used by gzip and zlib, combines LZ77 with Huffman coding for a two-stage compression:

LZ77 stage: Replace repetitions with back-references
Huffman stage: Encode the LZ77 output with variable-length codes

Huffman coding assigns shorter bit sequences to more frequent symbols. If 'E' appears 100 times and 'Q' appears once, 'E' might get a 3-bit code while 'Q' gets a 10-bit code.

DEFLATE's combination achieves excellent compression ratios (typically 3:1 to 8:1 for text) and became the de facto standard for decades.

Why DEFLATE Dominated

Modern Compression Algorithms: Speed Meets Ratio

While DEFLATE remains common, modern algorithms offer dramatically better trade-offs between compression ratio and speed. Storage systems increasingly adopt these newer alternatives.

LZ4: Speed Champion

Developed by Yann Collet in 2011, LZ4 prioritizes decompression speed above all else. It's designed for scenarios where data is compressed once but read many times.

Characteristics:

Decompression speed: 3-5 GB/s per core (10x faster than gzip)
Compression speed: 400-700 MB/s
Compression ratio: Typically 2:1 to 2.5:1 (30-50% worse than gzip)
Memory usage: Very low (<16KB state)

Use cases:

Real-time log compression
Database page compression (PostgreSQL, MySQL)
In-memory caching (Redis compression)
Network protocols where latency matters
SSDs where fast decompression reduces read amplification

Why so fast? LZ4 uses a simplified format without entropy coding, fixed 64KB window, and bytewise operations that SIMD can accelerate.

Zstandard (Zstd): The New Standard

Developed by Yann Collet at Facebook (2016), Zstandard achieves the best of both worlds—compression ratios matching or exceeding gzip/bzip2 with speeds approaching LZ4.

Characteristics:

Compression levels: 1-22 (default: 3)
Level 1: ~500 MB/s compression, ~2.8:1 ratio, ~1500 MB/s decompression
Level 3: ~200 MB/s compression, ~3.0:1 ratio, ~1300 MB/s decompression
Level 19: ~3 MB/s compression, ~3.5:1 ratio, ~1200 MB/s decompression

Key innovations:

Finite State Entropy (FSE): Advanced entropy coder faster than Huffman
Dictionary mode: Pre-shared dictionary for small data compression
Streaming API: Process unbounded streams efficiently
Multi-threaded: Parallel compression for multi-core CPUs

Industry adoption:

Linux kernel (btrfs, squashfs)
Facebook (all internal data)
FreeBSD pkg format
AWS, Azure for internal compression
rsync, tar, many backup tools

zstd_usage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import zstandard as zstd
 
class ZstdCompressor:
    """
    Zstandard compression wrapper demonstrating key features.
    """
    
    def __init__(self, level: int = 3, threads: int = 0):
        """
        Initialize compressor.
        
        Args:
            level: 1 (fastest) to 22 (best ratio). Default 3 is balanced.
            threads: 0 = auto-detect cores, 1 = single-threaded
        """
        self.level = level
        self.cctx = zstd.ZstdCompressor(level=level, threads=threads)
        self.dctx = zstd.ZstdDecompressor()
    
    def compress(self, data: bytes) -> bytes:
        """Compress data in memory."""
        return self.cctx.compress(data)
    
    def decompress(self, data: bytes) -> bytes:
        """Decompress data in memory."""
        return self.dctx.decompress(data)
    
    def compress_stream(self, input_file: str, output_file: str):
        """Stream compress from file to file (memory efficient)."""
        with open(input_file, 'rb') as fin:
            with open(output_file, 'wb') as fout:
                self.cctx.copy_stream(fin, fout)
    
    @staticmethod
    def create_dictionary(samples: list[bytes], dict_size: int = 110 * 1024):
        """
        Train a dictionary on sample data.
        
        Dictionaries dramatically improve compression for small,
        similar data (e.g., JSON API responses, log entries).
        
        Returns a dictionary that can be shared between compressor
        and decompressor.
        """
        return zstd.train_dictionary(dict_size, samples)
    
    def compress_with_dict(self, data: bytes, dictionary: zstd.ZstdCompressionDict):
        """
        Compress using pre-trained dictionary.
        
        For 1KB JSON objects, dictionaries can improve ratio from
        1.5:1 to 3:1 or better, because the dictionary captures
        common structure.
        """
        cctx = zstd.ZstdCompressor(dict_data=dictionary, level=self.level)
        return cctx.compress(data)
 
 
# Example: Benchmark different levels
def benchmark_levels(data: bytes):
    """Compare compression at different levels."""
    import time
    
    results = []
    for level in [1, 3, 5, 9, 15, 19]:
        compressor = ZstdCompressor(level=level)
        
        start = time.time()
        compressed = compressor.compress(data)
        compress_time = time.time() - start
        
        start = time.time()
        decompressed = compressor.decompress(compressed)
        decompress_time = time.time() - start
        
        ratio = len(data) / len(compressed)
        
        results.append({
            'level': level,
            'ratio': f"{ratio:.2f}:1",
            'compress_speed': f"{len(data)/compress_time/1e6:.1f} MB/s",
            'decompress_speed': f"{len(data)/decompress_time/1e6:.1f} MB/s",
        })
    
    return results

Brotli: Web-Optimized Compression

Developed by Google (2015), Brotli is optimized for web content delivery. It achieves 20-25% better compression than gzip on HTML/CSS/JavaScript.

Characteristics:

Compression ratio: Best-in-class for text content
Built-in dictionary: 120KB static dictionary of common web content
Decompression speed: Comparable to gzip
Compression speed: Slower than gzip at equivalent ratios

Use cases:

HTTP content encoding (supported by all modern browsers)
WOFF2 font format
Static asset compression (build-time, not runtime)

Trade-off: Brotli excels when you can afford slow compression for fast, bandwidth-efficient delivery. CDNs pre-compress static assets with Brotli; dynamic content still uses gzip.

Snappy: Google's Speed Focus

Google's internal compression algorithm, optimized for speed over ratio.

Characteristics:

Compression: 250+ MB/s
Decompression: 500+ MB/s
Ratio: ~1.5:1 to 2:1

Use cases:

Google internal systems (BigTable, MapReduce)
LevelDB/RocksDB block compression
Real-time streaming where latency trumps savings

Modern Compression Algorithm Comparison
Algorithm	Compress Speed	Decompress Speed	Ratio (text)	Best For
LZ4	★★★★★ (~500 MB/s)	★★★★★ (~3000 MB/s)	★★☆☆☆ (2.1:1)	Real-time, databases
Snappy	★★★★★ (~400 MB/s)	★★★★★ (~1500 MB/s)	★★☆☆☆ (2.0:1)	Log streaming, fast stores
Zstd (fast)	★★★★☆ (~300 MB/s)	★★★★★ (~1400 MB/s)	★★★☆☆ (2.9:1)	General purpose, balanced
Zstd (default)	★★★☆☆ (~150 MB/s)	★★★★★ (~1300 MB/s)	★★★★☆ (3.1:1)	Storage, archives
gzip	★★☆☆☆ (~30 MB/s)	★★★☆☆ (~300 MB/s)	★★★☆☆ (3.0:1)	Legacy compatibility
Brotli	★☆☆☆☆ (~5 MB/s)	★★★☆☆ (~350 MB/s)	★★★★★ (3.5:1)	Static web content
LZMA/xz	★☆☆☆☆ (~3 MB/s)	★★☆☆☆ (~100 MB/s)	★★★★★ (3.8:1)	Archives, distributions

Zstandard Is the Modern Default

Entropy Coding: The Final Compression Stage

Dictionary compression (LZ77/LZ78) produces a stream of symbols—literals and back-references. Entropy coding is the final step that converts these symbols into minimum-length bit sequences.

Huffman Coding (1952)

Huffman coding assigns variable-length codes based on symbol frequency: frequent symbols get short codes, rare symbols get long codes.

Building a Huffman Tree:

Count frequency of each symbol
Create leaf nodes for each symbol
Repeat: combine two lowest-frequency nodes into parent
Code for each symbol: path from root (left=0, right=1)

Example:

Symbols: A(45) B(25) C(15) D(10) E(5)

Huffman tree:
       (100)
       /    \
     A(45)  (55)
            /   \
         B(25) (30)
               /   \
            C(15) (15)
                  /   \
               D(10) E(5)

Codes: A=0, B=10, C=110, D=1110, E=1111

Original: 8 bits per symbol
Huffman: (45×1 + 25×2 + 15×3 + 10×4 + 5×4) / 100 = 2.0 bits/symbol
Compression: 4:1 ratio

Limitation: Huffman requires at least 1 bit per symbol. For highly skewed distributions (e.g., 99% zeros), Huffman approaches but can't reach the theoretical entropy.

Arithmetic Coding

Arithmetic coding encodes the entire message as a single fractional number (0 to 1), achieving near-optimal entropy without the 1-bit-per-symbol limitation.

Concept:

Each symbol has a probability range within [0, 1)
Encoding narrows the range based on each symbol
Final range is encoded as a binary fraction
Bit length approaches Shannon entropy

Advantages:

Theoretically optimal compression
Handles skewed distributions efficiently

Disadvantages:

Slower than Huffman (more computation per symbol)
Must know exact probabilities (or adapt them)
Patent issues historically (mostly expired)

Finite State Entropy (FSE)

Used in Zstandard, FSE achieves near-arithmetic-coding ratios with near-Huffman speeds.

How FSE works:

Uses a precomputed state table based on symbol probabilities
Each state transition encodes bits while moving to the next state
Table-driven: after initialization, it's just memory lookups

FSE vs Huffman vs Arithmetic:

Entropy Coder	Compression Quality	Encoding Speed	Decoding Speed
Huffman	Good (but ≥1 bit/symbol)	Very fast	Very fast
Arithmetic	Optimal	Slow	Slow
FSE	Near-optimal	Fast	Very fast

FSE's speed comes from table-based operations that vectorize well and avoid floating-point math.

Entropy Coding Selection

Compression in Storage Systems

Integrating compression into storage systems requires careful architectural decisions about where, when, and how to compress.

Block-Level Compression

The most common approach in storage systems: compress fixed-size blocks (typically 4KB-128KB) independently.

Advantages:

Random access: decompress only the blocks you need
Parallelizable: different blocks processed by different cores
Bounded memory: predictable memory usage per block

Disadvantages:

Reduced ratio: can't exploit cross-block patterns
Block alignment overhead

Examples:

ZFS: per-block compression (LZ4, gzip, zstd options)
Btrfs: per-extent compression
RocksDB: SST block compression
PostgreSQL: TOAST compression for large values

block_compression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import zstandard as zstd
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class CompressedBlock:
    """A storage block with compression metadata."""
    compressed_data: bytes
    original_size: int
    compression_algorithm: str
    is_compressed: bool  # False if compression didn't help
 
class BlockCompressor:
    """
    Block-level compression for storage systems.
    
    Key features:
    - Skip compression if data is incompressible
    - Handle blocks of various sizes
    - Track compression statistics
    """
    
    def __init__(
        self,
        block_size: int = 65536,      # 64KB default
        compression_level: int = 3,
        min_savings_percent: int = 10,  # Skip if <10% space saved
    ):
        self.block_size = block_size
        self.min_savings = min_savings_percent / 100.0
        self.cctx = zstd.ZstdCompressor(level=compression_level)
        self.dctx = zstd.ZstdDecompressor()
        
        # Statistics
        self.stats = {
            'blocks_processed': 0,
            'blocks_compressed': 0,
            'blocks_skipped': 0,
            'bytes_before': 0,
            'bytes_after': 0,
        }
    
    def compress_block(self, data: bytes) -> CompressedBlock:
        """
        Compress a block, skipping if compression isn't beneficial.
        
        Real storage systems check this to avoid wasting CPU on
        incompressible data (already compressed, encrypted, random).
        """
        self.stats['blocks_processed'] += 1
        self.stats['bytes_before'] += len(data)
        
        original_size = len(data)
        
        # Try compression
        compressed = self.cctx.compress(data)
        
        # Check if compression actually helped
        savings = 1.0 - (len(compressed) / original_size)
        
        if savings >= self.min_savings:
            # Compression worthwhile
            self.stats['blocks_compressed'] += 1
            self.stats['bytes_after'] += len(compressed)
            return CompressedBlock(
                compressed_data=compressed,
                original_size=original_size,
                compression_algorithm='zstd',
                is_compressed=True,
            )
        else:
            # Compression not worth it - store original
            self.stats['blocks_skipped'] += 1
            self.stats['bytes_after'] += original_size
            return CompressedBlock(
                compressed_data=data,
                original_size=original_size,
                compression_algorithm='none',
                is_compressed=False,
            )
    
    def decompress_block(self, block: CompressedBlock) -> bytes:
        """Decompress a block if it was compressed."""
        if block.is_compressed:
            return self.dctx.decompress(block.compressed_data)
        else:
            return block.compressed_data
    
    def get_ratio(self) -> float:
        """Calculate overall compression ratio."""
        if self.stats['bytes_after'] == 0:
            return 1.0
        return self.stats['bytes_before'] / self.stats['bytes_after']

Inline vs. Background Compression

Like deduplication, compression timing affects performance:

Inline compression:

Compress in the write path before persisting
Pros: Immediate storage savings, reduced write I/O
Cons: Write latency penalty, CPU on critical path
Best for: Write-heavy workloads with fast compressors (LZ4)

Background compression:

Write uncompressed, compress in background job
Pros: No write latency impact
Cons: Temporary extra storage, rewrite I/O
Best for: Read-heavy workloads, slow compressors (high-ratio modes)

Compression and Encryption Order

Always compress before encrypting:

Compression then encryption: Compressor sees redundant plaintext, achieves full ratio
Encryption then compression: Encrypted data looks random, compression fails

✓ Correct: PlainText → Compress → Encrypt → Store
✗ Wrong:   PlainText → Encrypt → Compress → Store (no compression)

Security note: Compression before encryption can leak information through compressed size (see CRIME/BREACH attacks on HTTPS). For sensitive data, consider padding or skipping compression.

Storage System Compression Strategies
System	Default Algorithm	Block Size	Inline/Background
ZFS	LZ4 (default)	Record size (128KB default)	Inline
Btrfs	zstd (default kernel 5.1+)	Extent-based	Inline
RocksDB	Snappy/LZ4/Zstd	SST block (4KB-64KB)	Compaction-time
PostgreSQL	pglz/LZ4/Zstd	TOAST threshold (2KB)	Inline
MongoDB	Snappy/Zstd	WiredTiger blocks	Inline
Kafka	Snappy/LZ4/Zstd/gzip	Batch-level	Producer-side

Dictionary Compression for Small Data

The Small Data Problem

Compression algorithms need context to find patterns. For small inputs:

LZ77 window doesn't fill up
Huffman tree overhead exceeds data size
Expected ratio: 0.8:1 to 1.5:1 (often expansion)

Yet many storage systems handle millions of small objects:

JSON API responses (1-5 KB each)
Log entries (100-500 bytes each)
Key-value pairs (50-200 bytes each)

Dictionary Training

A compression dictionary captures common patterns from a training set. When compressing new data, the algorithm has immediate access to these patterns.

Zstandard Dictionary Training:

Collect representative samples (e.g., 10,000 similar JSON objects)
Train dictionary to find common substrings
Dictionary size: typically 64KB-128KB
Distribute dictionary with compressed data

Improvements: For 1KB JSON objects:

Without dictionary: 1.3:1 ratio
With trained dictionary: 3.5:1 ratio
2.7x improvement from dictionary

dictionary_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import zstandard as zstd
import json
import random
 
def train_json_dictionary(sample_generator, num_samples: int = 10000):
    """
    Train a Zstandard dictionary on sample JSON objects.
    
    The dictionary captures common patterns (field names, value formats,
    structure) that appear across many objects.
    """
    # Collect samples
    samples = []
    for i, obj in enumerate(sample_generator):
        if i >= num_samples:
            break
        samples.append(json.dumps(obj).encode('utf-8'))
    
    print(f"Training dictionary on {len(samples)} samples")
    print(f"Total sample size: {sum(len(s) for s in samples) / 1e6:.2f} MB")
    
    # Train dictionary
    dictionary = zstd.train_dictionary(
        dict_size=110 * 1024,  # 110KB dictionary
        samples=samples,
        level=3,  # Training level
    )
    
    print(f"Dictionary size: {len(dictionary.as_bytes())} bytes")
    print(f"Dictionary ID: {dictionary.dict_id()}")
    
    return dictionary
 
 
def evaluate_dictionary(dictionary, test_samples: list[bytes]):
    """
    Compare compression with and without dictionary.
    """
    # Without dictionary
    cctx_plain = zstd.ZstdCompressor(level=3)
    
    # With dictionary
    cctx_dict = zstd.ZstdCompressor(level=3, dict_data=dictionary)
    
    plain_total = 0
    dict_total = 0
    original_total = 0
    
    for sample in test_samples:
        original_total += len(sample)
        plain_total += len(cctx_plain.compress(sample))
        dict_total += len(cctx_dict.compress(sample))
    
    print(f"\nResults on {len(test_samples)} test samples:")
    print(f"  Original size:    {original_total:,} bytes")
    print(f"  Without dict:     {plain_total:,} bytes ({original_total/plain_total:.2f}:1)")
    print(f"  With dict:        {dict_total:,} bytes ({original_total/dict_total:.2f}:1)")
    print(f"  Dictionary gain:  {plain_total/dict_total:.2f}x improvement")
 
 
# Example usage for API response compression
def api_response_generator():
    """Generate sample API responses for dictionary training."""
    endpoints = ['user', 'order', 'product', 'cart', 'inventory']
    statuses = ['active', 'pending', 'completed', 'cancelled']
    
    for _ in range(100000):
        yield {
            'id': random.randint(1, 1000000),
            'type': random.choice(endpoints),
            'status': random.choice(statuses),
            'created_at': '2024-01-15T10:30:00Z',
            'updated_at': '2024-01-15T11:45:00Z',
            'metadata': {
                'version': '2.0',
                'source': 'api-gateway',
                'region': random.choice(['us-east-1', 'eu-west-1', 'ap-south-1']),
            },
            'data': {
                'name': f'Item_{random.randint(1, 10000)}',
                'price_cents': random.randint(100, 100000),
                'quantity': random.randint(1, 100),
            }
        }

When to Use Dictionary Compression

Summary: Compression Algorithm Selection Guide

Compression is a fundamental tool in storage engineering, but there's no universally "best" algorithm. Selection depends on your specific requirements.

Decision Framework

1. What's your priority?

Minimizing storage/bandwidth: Use Brotli or Zstd (high levels)
Minimizing CPU usage: Use LZ4 or Snappy
Balanced: Use Zstd (default level)

2. What's your access pattern?

Sequential reads: Higher compression levels acceptable
Random access: Need block-level compression
Real-time/streaming: Need microsecond latency → LZ4 or no compression

3. What's your data type?

Text/structured data: High redundancy → strong compression
Already compressed (JPEG, video): Skip compression
Encrypted data: Cannot compress

4. What hardware do you have?

Fast SSD, slow network: Compress more (CPU is cheap vs. bandwidth)
Slow CPU, fast storage: Compress less or use LZ4
Multi-core available: Use parallel compression (Zstd -T0)

Algorithm Selection Quick Reference

•LZ4 → Real-time systems, databases, when decompression speed is critical
•Zstd (level 1-3) → General purpose, balanced performance
•Zstd (level 15+) → Archival, cold storage, maximize ratio
•Zstd + dictionary → Many small similar objects (JSON, logs)
•Brotli → Static web content, fonts, build-time compression
•gzip → Legacy compatibility, HTTP fallback
•Snappy → Google ecosystem, LevelDB/BigTable compatibility

Page Complete