Loading content...
Every byte stored, transmitted, or processed carries a cost. Storage isn't free—neither is bandwidth, memory, nor I/O time. Compression is the science and engineering of representing data in fewer bits without losing the information we care about.
Consider the scale: Netflix transmits over 400 petabytes of video per month. A 40% improvement in compression efficiency would save 160 petabytes of bandwidth—translating to hundreds of millions of dollars in infrastructure savings annually. AWS stores over 200 trillion objects in S3; even a 1% storage reduction represents exabytes of capacity.
But compression is not a free lunch. It trades computation for space. Every byte saved requires CPU cycles to compress and decompress. In storage systems, the choice of compression algorithm—and whether to compress at all—is a nuanced engineering decision that depends on data characteristics, access patterns, and hardware capabilities.
This page takes you deep into the world of compression algorithms, from foundational theory to practical implementation considerations for large-scale storage systems.
By the end of this page, you will understand the fundamental principles of data compression, master the differences between lossless and lossy compression, explore classic algorithms (LZ77, DEFLATE, LZW) and modern alternatives (LZ4, Zstandard, Brotli), and learn how to select the right algorithm for your storage architecture based on compression ratio, speed, and memory requirements.
Data compression is possible because most data contains redundancy—patterns, repetitions, and predictable structures. Compression exploits these regularities to represent data more efficiently.
Claude Shannon's information theory provides the mathematical foundation. The entropy of data measures its intrinsic information content—the minimum number of bits needed to represent it.
Entropy H(X) = -Σ P(x) * log₂(P(x))
For example, if a file contains only the letters A and B, each appearing 50% of the time:
If the file is 90% A and 10% B:
The fundamental limit: No lossless compression can beat entropy. Data that already appears random (encrypted data, already-compressed data) approaches maximum entropy and cannot be meaningfully compressed further.
| Data Type | Typical Redundancy | Compression Ratio Potential | Best Approach |
|---|---|---|---|
| Plain text | Very high (letters, words repeat) | 3:1 to 10:1 | Dictionary + Huffman coding |
| Source code | High (keywords, patterns) | 3:1 to 5:1 | Dictionary-based |
| Log files | Very high (timestamps, repeated messages) | 10:1 to 50:1 | Dictionary + run-length |
| JSON/XML | High (repeated tags, structure) | 5:1 to 15:1 | Dictionary-based |
| Database pages | Moderate (depends on data) | 2:1 to 4:1 | Block compression |
| Images (lossless) | Moderate (spatial redundancy) | 1.5:1 to 3:1 | PNG, WEBP lossless |
| Images (lossy) | High (perceptual redundancy) | 10:1 to 100:1 | JPEG, WEBP lossy |
| Encrypted data | None (appears random) | 1:1 (no compression) | Don't compress |
| Already compressed | None | 1:1 or worse | Don't re-compress |
Attempting to compress already-compressed or random data can actually INCREASE file size. Compression algorithms add overhead (headers, dictionaries), and if no redundancy is found, the output is larger than the input. Production systems must detect uncompressible data and skip compression—checking entropy or using fast-abort heuristics.
Compression divides into two fundamentally different categories based on whether original data can be perfectly recovered.
Definition: The decompressed data is bit-for-bit identical to the original. No information is lost.
How it works: Lossless algorithms exploit statistical redundancy:
Use cases:
Examples: gzip, LZ4, Zstandard, bzip2, LZMA
Definition: Original data cannot be perfectly recovered. Some information is permanently discarded, but the loss is (ideally) imperceptible or acceptable.
How it works: Lossy algorithms exploit human perception:
Use cases:
Trade-off: Lossy compression achieves dramatically higher ratios (10x-100x+) but requires careful tuning to avoid unacceptable quality loss.
For storage systems: We focus primarily on lossless compression because data integrity is paramount. Lossy compression is typically applied at the application layer (image processing, video transcoding) before storage.
The most widely used compression algorithms are built on dictionary-based techniques, pioneered by Abraham Lempel and Jacob Ziv in the 1970s. These algorithms build a "dictionary" of previously seen patterns and replace subsequent occurrences with references.
LZ77 uses a sliding window of recently processed data as an implicit dictionary. When a pattern repeats, it's encoded as a (distance, length) reference to where it appeared before.
Example:
Input: "ABRACADABRA"
1. Output 'A' (literal) - dictionary: "A"
2. Output 'B' (literal) - dictionary: "AB"
3. Output 'R' (literal) - dictionary: "ABR"
4. Output 'A' (back 4, length 1) - references first 'A'
5. Output 'C' (literal) - dictionary: "ABRAC"
6. Output 'A' (back 2, length 1) - references recent 'A'
7. Output 'D' (literal)
8. Output (back 7, length 4) - copies "ABRA"
Compressed: A, B, R, (4,1), C, (2,1), D, (7,4)
Sliding window trade-offs:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
from dataclasses import dataclassfrom typing import Union, List @dataclassclass Literal: """Raw byte that couldn't be matched.""" byte: int @dataclass class BackReference: """Reference to previously seen data.""" distance: int # How far back to look length: int # How many bytes to copy Token = Union[Literal, BackReference] def lz77_compress(data: bytes, window_size: int = 32768) -> List[Token]: """ Simplified LZ77 compression. Real implementations use hash tables for O(1) pattern matching rather than this O(n) scan approach. """ tokens = [] pos = 0 while pos < len(data): # Search backwards in sliding window for longest match best_distance = 0 best_length = 0 # Define search window (last window_size bytes) search_start = max(0, pos - window_size) # Try each position in window for search_pos in range(search_start, pos): # Count matching bytes match_length = 0 while (pos + match_length < len(data) and match_length < 258 and # Max match length data[search_pos + match_length] == data[pos + match_length]): match_length += 1 # Keep best match (minimum 3 bytes to be worth referencing) if match_length >= 3 and match_length > best_length: best_distance = pos - search_pos best_length = match_length if best_length >= 3: # Output back-reference tokens.append(BackReference(best_distance, best_length)) pos += best_length else: # Output literal byte tokens.append(Literal(data[pos])) pos += 1 return tokensLZ78/LZW build an explicit dictionary during compression, adding new phrases as they're encountered. LZW (Lempel-Ziv-Welch) became famous through its use in GIF images and early Unix compress utility.
LZW Process:
Advantages:
Disadvantages:
DEFLATE, used by gzip and zlib, combines LZ77 with Huffman coding for a two-stage compression:
Huffman coding assigns shorter bit sequences to more frequent symbols. If 'E' appears 100 times and 'Q' appears once, 'E' might get a 3-bit code while 'Q' gets a 10-bit code.
DEFLATE's combination achieves excellent compression ratios (typically 3:1 to 8:1 for text) and became the de facto standard for decades.
DEFLATE's combination of good compression ratio, reasonable speed, and patent-free status made it ubiquitous. It's the algorithm inside gzip, zlib, PNG images, ZIP files, and HTTP compression. Despite being from 1993, it remains widely used—though modern alternatives offer better trade-offs.
While DEFLATE remains common, modern algorithms offer dramatically better trade-offs between compression ratio and speed. Storage systems increasingly adopt these newer alternatives.
Developed by Yann Collet in 2011, LZ4 prioritizes decompression speed above all else. It's designed for scenarios where data is compressed once but read many times.
Characteristics:
Use cases:
Why so fast? LZ4 uses a simplified format without entropy coding, fixed 64KB window, and bytewise operations that SIMD can accelerate.
Developed by Yann Collet at Facebook (2016), Zstandard achieves the best of both worlds—compression ratios matching or exceeding gzip/bzip2 with speeds approaching LZ4.
Characteristics:
Key innovations:
Industry adoption:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import zstandard as zstd class ZstdCompressor: """ Zstandard compression wrapper demonstrating key features. """ def __init__(self, level: int = 3, threads: int = 0): """ Initialize compressor. Args: level: 1 (fastest) to 22 (best ratio). Default 3 is balanced. threads: 0 = auto-detect cores, 1 = single-threaded """ self.level = level self.cctx = zstd.ZstdCompressor(level=level, threads=threads) self.dctx = zstd.ZstdDecompressor() def compress(self, data: bytes) -> bytes: """Compress data in memory.""" return self.cctx.compress(data) def decompress(self, data: bytes) -> bytes: """Decompress data in memory.""" return self.dctx.decompress(data) def compress_stream(self, input_file: str, output_file: str): """Stream compress from file to file (memory efficient).""" with open(input_file, 'rb') as fin: with open(output_file, 'wb') as fout: self.cctx.copy_stream(fin, fout) @staticmethod def create_dictionary(samples: list[bytes], dict_size: int = 110 * 1024): """ Train a dictionary on sample data. Dictionaries dramatically improve compression for small, similar data (e.g., JSON API responses, log entries). Returns a dictionary that can be shared between compressor and decompressor. """ return zstd.train_dictionary(dict_size, samples) def compress_with_dict(self, data: bytes, dictionary: zstd.ZstdCompressionDict): """ Compress using pre-trained dictionary. For 1KB JSON objects, dictionaries can improve ratio from 1.5:1 to 3:1 or better, because the dictionary captures common structure. """ cctx = zstd.ZstdCompressor(dict_data=dictionary, level=self.level) return cctx.compress(data) # Example: Benchmark different levelsdef benchmark_levels(data: bytes): """Compare compression at different levels.""" import time results = [] for level in [1, 3, 5, 9, 15, 19]: compressor = ZstdCompressor(level=level) start = time.time() compressed = compressor.compress(data) compress_time = time.time() - start start = time.time() decompressed = compressor.decompress(compressed) decompress_time = time.time() - start ratio = len(data) / len(compressed) results.append({ 'level': level, 'ratio': f"{ratio:.2f}:1", 'compress_speed': f"{len(data)/compress_time/1e6:.1f} MB/s", 'decompress_speed': f"{len(data)/decompress_time/1e6:.1f} MB/s", }) return resultsDeveloped by Google (2015), Brotli is optimized for web content delivery. It achieves 20-25% better compression than gzip on HTML/CSS/JavaScript.
Characteristics:
Use cases:
Trade-off: Brotli excels when you can afford slow compression for fast, bandwidth-efficient delivery. CDNs pre-compress static assets with Brotli; dynamic content still uses gzip.
Google's internal compression algorithm, optimized for speed over ratio.
Characteristics:
Use cases:
| Algorithm | Compress Speed | Decompress Speed | Ratio (text) | Best For |
|---|---|---|---|---|
| LZ4 | ★★★★★ (~500 MB/s) | ★★★★★ (~3000 MB/s) | ★★☆☆☆ (2.1:1) | Real-time, databases |
| Snappy | ★★★★★ (~400 MB/s) | ★★★★★ (~1500 MB/s) | ★★☆☆☆ (2.0:1) | Log streaming, fast stores |
| Zstd (fast) | ★★★★☆ (~300 MB/s) | ★★★★★ (~1400 MB/s) | ★★★☆☆ (2.9:1) | General purpose, balanced |
| Zstd (default) | ★★★☆☆ (~150 MB/s) | ★★★★★ (~1300 MB/s) | ★★★★☆ (3.1:1) | Storage, archives |
| gzip | ★★☆☆☆ (~30 MB/s) | ★★★☆☆ (~300 MB/s) | ★★★☆☆ (3.0:1) | Legacy compatibility |
| Brotli | ★☆☆☆☆ (~5 MB/s) | ★★★☆☆ (~350 MB/s) | ★★★★★ (3.5:1) | Static web content |
| LZMA/xz | ★☆☆☆☆ (~3 MB/s) | ★★☆☆☆ (~100 MB/s) | ★★★★★ (3.8:1) | Archives, distributions |
For new storage systems, Zstandard is typically the best choice. It dominates the speed-ratio Pareto frontier—at any given compression ratio, Zstd is usually the fastest option. Its dictionary mode is particularly valuable for compressing many small, similar objects. Facebook reports 2-3x compression speed improvements over gzip with equal or better ratios.
Dictionary compression (LZ77/LZ78) produces a stream of symbols—literals and back-references. Entropy coding is the final step that converts these symbols into minimum-length bit sequences.
Huffman coding assigns variable-length codes based on symbol frequency: frequent symbols get short codes, rare symbols get long codes.
Building a Huffman Tree:
Example:
Symbols: A(45) B(25) C(15) D(10) E(5)
Huffman tree:
(100)
/ \
A(45) (55)
/ \
B(25) (30)
/ \
C(15) (15)
/ \
D(10) E(5)
Codes: A=0, B=10, C=110, D=1110, E=1111
Original: 8 bits per symbol
Huffman: (45×1 + 25×2 + 15×3 + 10×4 + 5×4) / 100 = 2.0 bits/symbol
Compression: 4:1 ratio
Limitation: Huffman requires at least 1 bit per symbol. For highly skewed distributions (e.g., 99% zeros), Huffman approaches but can't reach the theoretical entropy.
Arithmetic coding encodes the entire message as a single fractional number (0 to 1), achieving near-optimal entropy without the 1-bit-per-symbol limitation.
Concept:
Advantages:
Disadvantages:
Used in Zstandard, FSE achieves near-arithmetic-coding ratios with near-Huffman speeds.
How FSE works:
FSE vs Huffman vs Arithmetic:
| Entropy Coder | Compression Quality | Encoding Speed | Decoding Speed |
|---|---|---|---|
| Huffman | Good (but ≥1 bit/symbol) | Very fast | Very fast |
| Arithmetic | Optimal | Slow | Slow |
| FSE | Near-optimal | Fast | Very fast |
FSE's speed comes from table-based operations that vectorize well and avoid floating-point math.
Most storage systems don't implement custom entropy coders—they rely on complete algorithms like Zstd or LZ4. But understanding entropy coding helps diagnose why certain data compresses poorly (already near maximum entropy) and why modern algorithms outperform older ones (better entropy coding stages).
Integrating compression into storage systems requires careful architectural decisions about where, when, and how to compress.
The most common approach in storage systems: compress fixed-size blocks (typically 4KB-128KB) independently.
Advantages:
Disadvantages:
Examples:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import zstandard as zstdfrom dataclasses import dataclassfrom typing import Optional @dataclassclass CompressedBlock: """A storage block with compression metadata.""" compressed_data: bytes original_size: int compression_algorithm: str is_compressed: bool # False if compression didn't help class BlockCompressor: """ Block-level compression for storage systems. Key features: - Skip compression if data is incompressible - Handle blocks of various sizes - Track compression statistics """ def __init__( self, block_size: int = 65536, # 64KB default compression_level: int = 3, min_savings_percent: int = 10, # Skip if <10% space saved ): self.block_size = block_size self.min_savings = min_savings_percent / 100.0 self.cctx = zstd.ZstdCompressor(level=compression_level) self.dctx = zstd.ZstdDecompressor() # Statistics self.stats = { 'blocks_processed': 0, 'blocks_compressed': 0, 'blocks_skipped': 0, 'bytes_before': 0, 'bytes_after': 0, } def compress_block(self, data: bytes) -> CompressedBlock: """ Compress a block, skipping if compression isn't beneficial. Real storage systems check this to avoid wasting CPU on incompressible data (already compressed, encrypted, random). """ self.stats['blocks_processed'] += 1 self.stats['bytes_before'] += len(data) original_size = len(data) # Try compression compressed = self.cctx.compress(data) # Check if compression actually helped savings = 1.0 - (len(compressed) / original_size) if savings >= self.min_savings: # Compression worthwhile self.stats['blocks_compressed'] += 1 self.stats['bytes_after'] += len(compressed) return CompressedBlock( compressed_data=compressed, original_size=original_size, compression_algorithm='zstd', is_compressed=True, ) else: # Compression not worth it - store original self.stats['blocks_skipped'] += 1 self.stats['bytes_after'] += original_size return CompressedBlock( compressed_data=data, original_size=original_size, compression_algorithm='none', is_compressed=False, ) def decompress_block(self, block: CompressedBlock) -> bytes: """Decompress a block if it was compressed.""" if block.is_compressed: return self.dctx.decompress(block.compressed_data) else: return block.compressed_data def get_ratio(self) -> float: """Calculate overall compression ratio.""" if self.stats['bytes_after'] == 0: return 1.0 return self.stats['bytes_before'] / self.stats['bytes_after']Like deduplication, compression timing affects performance:
Inline compression:
Background compression:
Always compress before encrypting:
✓ Correct: PlainText → Compress → Encrypt → Store
✗ Wrong: PlainText → Encrypt → Compress → Store (no compression)
Security note: Compression before encryption can leak information through compressed size (see CRIME/BREACH attacks on HTTPS). For sensitive data, consider padding or skipping compression.
| System | Default Algorithm | Block Size | Inline/Background |
|---|---|---|---|
| ZFS | LZ4 (default) | Record size (128KB default) | Inline |
| Btrfs | zstd (default kernel 5.1+) | Extent-based | Inline |
| RocksDB | Snappy/LZ4/Zstd | SST block (4KB-64KB) | Compaction-time |
| PostgreSQL | pglz/LZ4/Zstd | TOAST threshold (2KB) | Inline |
| MongoDB | Snappy/Zstd | WiredTiger blocks | Inline |
| Kafka | Snappy/LZ4/Zstd/gzip | Batch-level | Producer-side |
Standard compression struggles with small data. A 500-byte JSON object can't build meaningful patterns from just 500 bytes—there isn't enough context. Dictionary compression solves this by providing a pre-trained dictionary of common patterns.
Compression algorithms need context to find patterns. For small inputs:
Yet many storage systems handle millions of small objects:
A compression dictionary captures common patterns from a training set. When compressing new data, the algorithm has immediate access to these patterns.
Zstandard Dictionary Training:
Improvements: For 1KB JSON objects:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import zstandard as zstdimport jsonimport random def train_json_dictionary(sample_generator, num_samples: int = 10000): """ Train a Zstandard dictionary on sample JSON objects. The dictionary captures common patterns (field names, value formats, structure) that appear across many objects. """ # Collect samples samples = [] for i, obj in enumerate(sample_generator): if i >= num_samples: break samples.append(json.dumps(obj).encode('utf-8')) print(f"Training dictionary on {len(samples)} samples") print(f"Total sample size: {sum(len(s) for s in samples) / 1e6:.2f} MB") # Train dictionary dictionary = zstd.train_dictionary( dict_size=110 * 1024, # 110KB dictionary samples=samples, level=3, # Training level ) print(f"Dictionary size: {len(dictionary.as_bytes())} bytes") print(f"Dictionary ID: {dictionary.dict_id()}") return dictionary def evaluate_dictionary(dictionary, test_samples: list[bytes]): """ Compare compression with and without dictionary. """ # Without dictionary cctx_plain = zstd.ZstdCompressor(level=3) # With dictionary cctx_dict = zstd.ZstdCompressor(level=3, dict_data=dictionary) plain_total = 0 dict_total = 0 original_total = 0 for sample in test_samples: original_total += len(sample) plain_total += len(cctx_plain.compress(sample)) dict_total += len(cctx_dict.compress(sample)) print(f"\nResults on {len(test_samples)} test samples:") print(f" Original size: {original_total:,} bytes") print(f" Without dict: {plain_total:,} bytes ({original_total/plain_total:.2f}:1)") print(f" With dict: {dict_total:,} bytes ({original_total/dict_total:.2f}:1)") print(f" Dictionary gain: {plain_total/dict_total:.2f}x improvement") # Example usage for API response compressiondef api_response_generator(): """Generate sample API responses for dictionary training.""" endpoints = ['user', 'order', 'product', 'cart', 'inventory'] statuses = ['active', 'pending', 'completed', 'cancelled'] for _ in range(100000): yield { 'id': random.randint(1, 1000000), 'type': random.choice(endpoints), 'status': random.choice(statuses), 'created_at': '2024-01-15T10:30:00Z', 'updated_at': '2024-01-15T11:45:00Z', 'metadata': { 'version': '2.0', 'source': 'api-gateway', 'region': random.choice(['us-east-1', 'eu-west-1', 'ap-south-1']), }, 'data': { 'name': f'Item_{random.randint(1, 10000)}', 'price_cents': random.randint(100, 100000), 'quantity': random.randint(1, 100), } }Dictionary compression shines when: (1) Objects are small (<10KB), (2) Objects share common structure (JSON schemas, log formats), (3) You can train on representative samples, (4) You can distribute dictionaries to decompressors. Perfect for: API caching, log storage, key-value stores, message queues with typed messages.
Compression is a fundamental tool in storage engineering, but there's no universally "best" algorithm. Selection depends on your specific requirements.
1. What's your priority?
2. What's your access pattern?
3. What's your data type?
4. What hardware do you have?
You now understand the fundamentals of data compression—from information theory to practical algorithm selection. You can evaluate trade-offs between compression ratio, speed, and memory usage to choose the right algorithm for your storage system. Next, we'll explore how deduplication and compression work together to maximize overall storage efficiency.