Deduplication & Compression - Learning Module

Loading content...

0/273

CPU vs Storage Trade-offs

The Fundamental Trade-off in Storage Optimization

Every storage optimization technique extracts value by trading one resource for another. Compression trades CPU cycles for storage capacity—spending compute to make data smaller. Deduplication trades memory (for indexes) and CPU (for hashing) for storage savings. Even choosing not to optimize is a trade-off: storage capacity traded for CPU headroom.

The economics are complex. Storage costs ($/GB/month) have declined exponentially, while compute costs have declined more slowly. A strategy that was cost-effective in 2020 might be wasteful today. Similarly, hardware varies: SSDs change the calculus differently than spinning disks; ARM processors have different efficiency than x86.

This page develops the analytical framework for making informed trade-off decisions. We'll build cost models, analyze real-world scenarios, and establish guidelines for when to optimize aggressively, when to use lightweight optimization, and when to skip optimization entirely.

What You Will Learn

By the end of this page, you will understand how to quantify the cost of compression and deduplication in terms of CPU time and memory, build economic models comparing optimization cost to storage savings, recognize when lightweight optimization beats aggressive optimization, and make data-driven decisions about optimization configuration for different storage tiers.

Quantifying the Computational Cost of Optimization

Before making trade-off decisions, we need precise measurements of what optimization actually costs.

Compression CPU Cost

Compression algorithms consume CPU on both write (compression) and read (decompression) paths. The costs vary dramatically:

Compression throughput (single core, modern x86-64):

Algorithm	Compression	Decompression	Ratio (text)	Compress CPU cost/TB
LZ4 fast	750 MB/s	4,000 MB/s	2.1:1	0.37 core-hours
LZ4 HC	100 MB/s	4,000 MB/s	2.4:1	2.78 core-hours
Zstd -1	500 MB/s	1,400 MB/s	2.8:1	0.56 core-hours
Zstd -3	250 MB/s	1,300 MB/s	3.0:1	1.11 core-hours
Zstd -9	60 MB/s	1,200 MB/s	3.2:1	4.63 core-hours
Zstd -19	3 MB/s	1,100 MB/s	3.5:1	92.6 core-hours
gzip -6	30 MB/s	400 MB/s	3.0:1	9.26 core-hours

Key insight: Decompression is almost always faster than compression. Read-heavy workloads pay a much lower CPU tax than write-heavy workloads.

Deduplication CPU Cost

Deduplication requires hashing every chunk plus index lookups:

Hashing throughput:

SHA-256: ~500 MB/s per core
SHA-1 (deprecated for security, still used for dedup): ~700 MB/s
xxHash64: ~10,000 MB/s (but not cryptographic)
BLAKE3: ~2,000 MB/s (modern choice)

Index lookup cost:

In-memory hash table: <1 μs per lookup
Redis/cache-backed: 50-500 μs per lookup
SSD-backed index: 100-1000 μs per lookup

Per-TB cost calculation: With 8 KB average chunk size, 1 TB = 134 million chunks.

Hashing cost:
  SHA-256: 1 TB / 500 MB/s = 2048 seconds = 0.57 core-hours
  BLAKE3: 1 TB / 2000 MB/s = 512 seconds = 0.14 core-hours

Index lookup cost:
  134 million lookups × 100 μs = 13,400 seconds = 3.7 core-hours

Total dedup CPU per TB: ~4-5 core-hours (dominated by index lookups)

Memory Cost of Deduplication

Fingerprint index sizing:

For each unique chunk, we store:

32-byte hash (SHA-256)
8-byte storage pointer
Optional: 4-byte reference count
Per-entry overhead: ~48-64 bytes with data structure overhead

Memory requirements:

1 TB unique data / 8 KB chunks = 134 million entries
134M × 50 bytes = 6.7 GB RAM per TB of unique data

At scale:

100 TB unique data → 670 GB RAM for index
1 PB unique data → 6.7 TB RAM for index

Mitigation strategies:

Bloom filter first tier: ~1 GB per 100 TB (with false positives)
SSD-based index with hot cache: 10-20 GB RAM + SSD
Locality-based indexing: 5-10% of full index size

Resource Cost Summary for Common Configurations
Configuration	CPU (core-hours/TB)	RAM (GB/TB unique)	Latency Impact
LZ4 compression only	~0.4 compress, ~0.07 decompress	Minimal	Low (+2-5%)
Zstd-3 compression	~1.1 compress, ~0.25 decompress	Minimal	Moderate (+5-15%)
Dedup only (in-memory index)	~4-5	~6-8 GB per TB unique	Moderate (+10-30%)
Dedup + LZ4	~5-6	~6-8 GB	Moderate
Dedup + Zstd-3	~6-7	~6-8 GB	Higher (+20-40%)
Dedup + Zstd-15 (archival)	~30+	~6-8 GB	High (acceptable for cold)

Hardware Acceleration

Modern hardware can dramatically change these numbers. Intel QAT accelerators can compress at 100 Gbps; AWS Graviton3 excels at Zstd; GPUs can accelerate certain operations. Always benchmark on your actual hardware—published numbers are guidelines, not guarantees.

Economic Modeling: When Does Optimization Pay Off?

The core economic question: Does the cost of compression/dedup exceed the cost of the storage it saves?

Building the Cost Model

Storage cost components:

Raw media: $/GB/month for SSD, HDD, or tape
Infrastructure: power, cooling, rack space
Operations: management, maintenance, replacement
Data transfer: egress costs for cloud

Compute cost components:

CPU time: $/core-hour
Memory: $/GB-month
Electricity: often embedded in compute pricing

Simplified model:

Optimization ROI = Storage Saved × Storage Cost − Compute Cost

Where:
  Storage Saved = Original Size × (1 - 1/Compression Ratio)
  Storage Cost = $/GB/month × Retention Months
  Compute Cost = Core-Hours × $/Core-Hour + RAM-GB × $/GB/month

optimization_economics.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class StorageCosts:
    """Storage cost parameters."""
    cost_per_gb_month: float          # $/GB/month
    retention_months: float            # How long data is kept
    
    @property
    def cost_per_gb_lifetime(self) -> float:
        return self.cost_per_gb_month * self.retention_months
 
@dataclass
class ComputeCosts:
    """Compute cost parameters."""
    cost_per_core_hour: float         # $/core-hour
    cost_per_gb_ram_month: float      # $/GB RAM/month
    
@dataclass
class OptimizationProfile:
    """Profile for a compression/dedup configuration."""
    name: str
    compression_ratio: float           # e.g., 3.0 for 3:1
    cpu_hours_per_tb: float           # Core-hours to compress 1 TB
    ram_gb_per_tb_unique: float       # RAM needed per TB unique data (0 for compression only)
    decompression_cpu_per_tb: float   # Core-hours to decompress 1 TB
    typical_dedup_ratio: float = 1.0  # Additional dedup savings
 
class OptimizationROICalculator:
    """
    Calculate ROI of different storage optimization strategies.
    """
    
    # Common optimization profiles
    PROFILES = {
        'none': OptimizationProfile(
            name='No optimization',
            compression_ratio=1.0,
            cpu_hours_per_tb=0,
            ram_gb_per_tb_unique=0,
            decompression_cpu_per_tb=0,
        ),
        'lz4_fast': OptimizationProfile(
            name='LZ4 Fast',
            compression_ratio=2.1,
            cpu_hours_per_tb=0.4,
            ram_gb_per_tb_unique=0,
            decompression_cpu_per_tb=0.07,
        ),
        'zstd_fast': OptimizationProfile(
            name='Zstd Level 1',
            compression_ratio=2.8,
            cpu_hours_per_tb=0.6,
            ram_gb_per_tb_unique=0,
            decompression_cpu_per_tb=0.2,
        ),
        'zstd_default': OptimizationProfile(
            name='Zstd Level 3',
            compression_ratio=3.0,
            cpu_hours_per_tb=1.1,
            ram_gb_per_tb_unique=0,
            decompression_cpu_per_tb=0.25,
        ),
        'zstd_high': OptimizationProfile(
            name='Zstd Level 15',
            compression_ratio=3.4,
            cpu_hours_per_tb=25,
            ram_gb_per_tb_unique=0,
            decompression_cpu_per_tb=0.25,
        ),
        'dedup_lz4': OptimizationProfile(
            name='Dedup + LZ4',
            compression_ratio=2.1,
            cpu_hours_per_tb=5.5,  # Hash + index + LZ4
            ram_gb_per_tb_unique=7,
            decompression_cpu_per_tb=0.1,
            typical_dedup_ratio=3.0,  # Common for backup workloads
        ),
        'dedup_zstd': OptimizationProfile(
            name='Dedup + Zstd',
            compression_ratio=3.0,
            cpu_hours_per_tb=6.5,
            ram_gb_per_tb_unique=7,
            decompression_cpu_per_tb=0.3,
            typical_dedup_ratio=3.0,
        ),
    }
    
    def __init__(self, storage_costs: StorageCosts, compute_costs: ComputeCosts):
        self.storage = storage_costs
        self.compute = compute_costs
    
    def calculate_roi(
        self,
        profile: OptimizationProfile,
        data_size_tb: float,
        reads_per_tb_per_month: float = 1.0,  # Read amplification
    ) -> dict:
        """
        Calculate ROI for an optimization profile.
        
        Returns cost breakdown and net savings.
        """
        data_size_gb = data_size_tb * 1024
        
        # Combined reduction from dedup + compression
        total_ratio = profile.compression_ratio * profile.typical_dedup_ratio
        
        # Storage calculations
        physical_gb = data_size_gb / total_ratio
        storage_saved_gb = data_size_gb - physical_gb
        storage_savings = storage_saved_gb * self.storage.cost_per_gb_lifetime
        
        # Compute costs for compression
        compress_cpu_hours = data_size_tb * profile.cpu_hours_per_tb
        compress_cost = compress_cpu_hours * self.compute.cost_per_core_hour
        
        # Compute costs for decompression (ongoing reads)
        total_reads_tb = data_size_tb * reads_per_tb_per_month * self.storage.retention_months
        decompress_cpu_hours = total_reads_tb * profile.decompression_cpu_per_tb
        decompress_cost = decompress_cpu_hours * self.compute.cost_per_core_hour
        
        # Memory costs (for dedup index, over retention period)
        unique_tb = data_size_tb / profile.typical_dedup_ratio
        ram_gb_needed = unique_tb * profile.ram_gb_per_tb_unique
        ram_cost = ram_gb_needed * self.compute.cost_per_gb_ram_month * self.storage.retention_months
        
        # Total compute cost
        total_compute_cost = compress_cost + decompress_cost + ram_cost
        
        # Net ROI
        net_savings = storage_savings - total_compute_cost
        roi_percent = (net_savings / max(total_compute_cost, 0.01)) * 100
        
        return {
            'profile': profile.name,
            'effective_ratio': f'{total_ratio:.1f}:1',
            'storage_saved_gb': storage_saved_gb,
            'storage_savings_usd': storage_savings,
            'compress_cost_usd': compress_cost,
            'decompress_cost_usd': decompress_cost,
            'ram_cost_usd': ram_cost,
            'total_compute_cost_usd': total_compute_cost,
            'net_savings_usd': net_savings,
            'roi_percent': roi_percent,
            'cost_effective': net_savings > 0,
        }
    
    def compare_profiles(
        self,
        data_size_tb: float,
        reads_per_tb_per_month: float = 1.0,
    ) -> list[dict]:
        """Compare all profiles for a given workload."""
        results = []
        for profile in self.PROFILES.values():
            result = self.calculate_roi(profile, data_size_tb, reads_per_tb_per_month)
            results.append(result)
        
        # Sort by net savings
        results.sort(key=lambda x: x['net_savings_usd'], reverse=True)
        return results
 
 
# Example usage with real-world costs
if __name__ == '__main__':
    # AWS-like pricing (illustrative)
    storage = StorageCosts(
        cost_per_gb_month=0.023,  # S3 Standard
        retention_months=12,
    )
    compute = ComputeCosts(
        cost_per_core_hour=0.04,   # Approximation
        cost_per_gb_ram_month=0.005,
    )
    
    calculator = OptimizationROICalculator(storage, compute)
    
    # Analyze 100 TB backup workload
    results = calculator.compare_profiles(
        data_size_tb=100,
        reads_per_tb_per_month=0.5,  # Read 50% of data per month
    )
    
    for r in results:
        print(f"{r['profile']:20} | Ratio: {r['effective_ratio']:6} | "
              f"Net: ${r['net_savings_usd']:, .0f
                        }")

Scenario Analysis

Scenario 1: Cold Archival Storage

Storage: $0.004/GB/month (Glacier-like)
Retention: 7 years (84 months)
Reads: <0.1x per month

Result: Even expensive compression (Zstd -19) is highly profitable. Storage cost dominates, and aggressive compression saves $2.80/GB over lifetime while costing <$0.10/GB in compute.

Scenario 2: Hot OLTP Database

Storage: $0.10/GB/month (high-performance SSD)
Retention: 1 month (rotating data)
Reads: 10x per month

Result: Only lightweight compression (LZ4) is profitable. Decompression CPU for frequent reads exceeds storage savings for complex algorithms.

Scenario 3: Backup to Cloud

Storage: $0.023/GB/month
Retention: 12 months
Reads: 0.3x per month (occasional restores)

Result: Dedup + moderate compression (Zstd -3) is optimal. High dedup ratio on backup data makes the index RAM cost worthwhile.

The Crossover Point

There's always a 'crossover point' where optimization switches from profitable to wasteful. For compression, it's typically around 0.5-2 reads per month—above that, decompression cost exceeds storage savings. For dedup, it depends heavily on data redundancy—unique data shouldn't be deduped at all.

Access Patterns: When Latency Trumps Savings

Economic models assume CPU time is fungible with money. But for latency-sensitive workloads, time on the critical path isn't just expensive—it's unacceptable.

The Latency Budget Perspective

OLTP database scenario:

SLA: 99th percentile response time < 50ms
Disk read: 0.1ms (NVMe SSD)
Network: 1ms
Application processing: 5ms
Available for decompression: ~43ms

With 43ms budget and 1300 MB/s Zstd decompression:

Can decompress: 43ms × 1300 MB/s = 56 MB per request
Typical page: 8KB → No problem

But with sequential scan:

Scan 10 GB of data
Decompression time: 10 GB / 1300 MB/s = 7.7 seconds
Unacceptable for interactive queries

Optimization by Access Tier

Optimization Strategy by Access Tier
Tier	Latency Requirement	Recommended Compression	Recommended Dedup	Rationale
L1 Cache (RAM)	Microseconds	None	None	CPU overhead exceeds memory savings
Hot Storage (NVMe)	<1ms reads	LZ4 or none	Rarely	Latency-critical; lightweight only
Warm Storage (SSD)	1-10ms reads	LZ4 or Zstd-1	Per-volume	Balance savings with latency
Cold Storage (HDD)	10-100ms reads	Zstd-9	Global	Disk latency dominates anyway
Archive (Tape/Glacier)	Minutes-hours	Zstd-19 or LZMA	Global	Retrieval cost justifies any save

Block Size and Latency Trade-off

Larger blocks compress better but increase read amplification:

Example:

4 KB block: ~1.8:1 compression, read exactly what you need
64 KB block: ~2.5:1 compression, read 16x more than needed for point lookup
1 MB block: ~3.0:1 compression, read 256x more than needed

The math: For random 4KB reads from 64KB compressed blocks:

Disk read: 64 KB compressed → ~25 KB on disk
Decompression: Full 64 KB uncompressed
Waste: 60 KB decompressed but not used
Overhead: ~15x more decompression CPU than needed

Best practice:

Match block size to access pattern
Random access: smaller blocks (8-16 KB)
Sequential access: larger blocks (64-128 KB+)
Hybrid: use block indexes for sub-block access

latency_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
from dataclasses import dataclass
 
@dataclass
class StorageMedia:
    """Characteristics of a storage medium."""
    name: str
    read_latency_ms: float          # Time to first byte
    read_bandwidth_mb_s: float      # Sequential read speed
    iops: int                       # Random read IOPS
 
@dataclass
class CompressionConfig:
    """Compression configuration."""
    algorithm: str
    decompress_bandwidth_mb_s: float
    compression_ratio: float
 
# Common storage media profiles
MEDIA = {
    'nvme': StorageMedia('NVMe SSD', 0.02, 3000, 500000),
    'sata_ssd': StorageMedia('SATA SSD', 0.1, 500, 50000),
    'hdd': StorageMedia('HDD', 10, 150, 150),
    'cloud_ssd': StorageMedia('Cloud SSD (gp3)', 0.5, 400, 16000),
    'cloud_hdd': StorageMedia('Cloud HDD (st1)', 20, 250, 500),
}
 
# Common compression profiles (decompression matters for reads)
COMPRESSION = {
    'none': CompressionConfig('None', float('inf'), 1.0),
    'lz4': CompressionConfig('LZ4', 4000, 2.1),
    'zstd_fast': CompressionConfig('Zstd-1', 1400, 2.8),
    'zstd_default': CompressionConfig('Zstd-3', 1300, 3.0),
    'zstd_high': CompressionConfig('Zstd-15', 1100, 3.4),
}
 
def analyze_read_latency(
    logical_size_kb: float,
    block_size_kb: float,
    media: StorageMedia,
    compression: CompressionConfig,
) -> dict:
    """
    Analyze read latency for a request.
    
    Accounts for:
    1. Read amplification (reading full block for partial access)
    2. Decompression time
    3. Media latency
    """
    # Read amplification: how many blocks must we read?
    blocks_needed = max(1, (logical_size_kb + block_size_kb - 1) // block_size_kb)
    logical_read_kb = blocks_needed * block_size_kb
    
    # Compressed size to read from disk
    physical_read_kb = logical_read_kb / compression.compression_ratio
    
    # Time components
    disk_latency_ms = media.read_latency_ms
    disk_transfer_ms = (physical_read_kb / 1024) / media.read_bandwidth_mb_s * 1000
    decompress_ms = (logical_read_kb / 1024) / compression.decompress_bandwidth_mb_s * 1000
    
    total_ms = disk_latency_ms + disk_transfer_ms + decompress_ms
    
    return {
        'request_size_kb': logical_size_kb,
        'block_size_kb': block_size_kb,
        'blocks_read': blocks_needed,
        'read_amplification': logical_read_kb / logical_size_kb,
        'physical_read_kb': physical_read_kb,
        'disk_latency_ms': disk_latency_ms,
        'disk_transfer_ms': disk_transfer_ms,
        'decompress_ms': decompress_ms,
        'total_latency_ms': total_ms,
        'decompress_fraction': decompress_ms / total_ms,
    }
 
def find_optimal_config(
    logical_size_kb: float,
    latency_budget_ms: float,
    media: StorageMedia,
) -> list[dict]:
    """
    Find compression configurations that meet latency budget.
    """
    results = []
    
    for block_size in [4, 8, 16, 32, 64, 128]:
        for comp_name, compression in COMPRESSION.items():
            analysis = analyze_read_latency(
                logical_size_kb, block_size, media, compression
            )
            
            meets_budget = analysis['total_latency_ms'] <= latency_budget_ms
            
            results.append({
                'block_size_kb': block_size,
                'compression': comp_name,
                'ratio': compression.compression_ratio,
                'latency_ms': analysis['total_latency_ms'],
                'meets_budget': meets_budget,
                'decompress_fraction': analysis['decompress_fraction'],
            })
    
    # Sort by compression ratio (best first) among those meeting budget
    meeting = [r for r in results if r['meets_budget']]
    meeting.sort(key=lambda x: x['ratio'], reverse=True)
    
    return meeting[:5]  # Top 5 options

P99 Latency Matters More Than Average

Average decompression speed is misleading. Some blocks compress poorly and decompress slowly. GC pauses can spike latency. Always measure P99/P99.9 latency under real load—not average throughput. A single slow decompression during a critical path can cause SLA breach.

Hardware Considerations: CPUs, Accelerators, and Media

The optimal trade-off point shifts with hardware capabilities. Modern systems offer various acceleration options.

CPU Architecture Variations

x86-64 (Intel/AMD):

Excellent single-core compression performance
AVX2/AVX-512 acceleration for some algorithms
LZ4: ~800 MB/s/core; Zstd: ~300 MB/s/core

ARM (Graviton, Apple Silicon):

AWS Graviton3: 25% better Zstd compression than x86 per $ (efficiency)
Apple M-series: Excellent all-around, hardware-assisted decompression
Lower power consumption per operation

Different calculus: ARM's efficiency advantage means compression is "cheaper" in terms of cost and power. Strategies that were marginal on x86 become clearly profitable on ARM.

Hardware Compression Accelerators

Intel QAT (Quick Assist Technology):

Offload DEFLATE compression to dedicated hardware
Up to 100 Gbps compression throughput
Frees CPU cores for application work
Supported by major storage systems (ZFS, Ceph)

NVIDIA GPU compression:

nvCOMP library for GPU-accelerated compression
LZ4 on GPU: 10-50 GB/s (vs ~0.8 GB/s per CPU core)
Best for bulk data processing, not random access

Computational storage (CSD):

Compression/decompression in the SSD controller
Zero host CPU usage
Samsung SmartSSD, ScaleFlux CSD
Trade-off: higher SSD cost, limited algorithm flexibility

Hardware Acceleration Options
Hardware	Type	Throughput	Use Case	Considerations
Intel QAT	PCIe Accelerator	100 Gbps	High-volume compression	Requires driver integration
AMD CDNA GPU	GPU Offload	50+ GB/s	Batch compression	PCIe transfer overhead
ScaleFlux CSD	Computational SSD	3 GB/s per drive	Transparent compression	Vendor lock-in
AWS Graviton3	ARM CPU	~25% more efficient	Cloud workloads	Software compatibility check
FPGA (Xilinx)	Programmable	Custom	Specialized algorithms	Development effort

Storage Media Economics

SSD vs. HDD trade-offs:

Factor	NVMe SSD	HDD
Cost per TB	$80-150	$15-25
Random read latency	0.02ms	10ms
Sequential bandwidth	3-7 GB/s	150-250 MB/s
Power per TB	Higher	Lower
Compression value	Lower (disk is fast, CPU is bottleneck)	Higher (CPU faster than disk)

On HDD:

Disk is slow (150 MB/s)
CPU can decompress at 1300 MB/s (Zstd)
Compression is nearly "free"—disk time saved exceeds CPU time spent

On NVMe SSD:

Disk is fast (3000 MB/s)
CPU can decompress at 1300 MB/s (Zstd)
Decompression is now the bottleneck!
LZ4 (4000 MB/s decompress) is competitive; Zstd is not

Recommendation: For NVMe-based systems, LZ4 is often optimal. For HDD-based systems, higher-ratio compression pays off.

Cloud-Specific Optimizations

Cloud providers charge for egress ($0.02-0.09/GB). Reducing transferred bytes saves real money. Compression on read (decompress locally) can be more valuable than compression on write if data crosses network boundaries. Calculate egress savings into your ROI models.

Adaptive Optimization: Dynamic Trade-off Management

Static optimization configurations are suboptimal. Data characteristics vary, access patterns shift, and resource availability fluctuates. Advanced systems adapt dynamically.

Content-Aware Compression

Detect and skip incompressible data:

Check file extension: skip .jpg, .mp4, .gz, .zip
Sample first 4KB and attempt compression
If ratio < 1.1, skip remaining data
Store with "not compressed" flag

Benefit: Avoid wasting CPU on already-compressed data.

Implementation sketch:

def should_compress(data: bytes) -> bool:
    # Quick entropy check
    sample = data[:4096]
    compressed = lz4.compress(sample)
    ratio = len(sample) / len(compressed)
    return ratio > 1.1  # Only compress if >10% savings

Time-Based Adaptation

Vary compression level by time of day:

Peak hours: LZ4 (minimize latency)
Off-peak: Zstd-9 (maximize savings on new data)
Background recompression: Zstd-19 on aged data during maintenance windows

Access Frequency Tiering

Hot data: Frequently accessed

Lightweight compression (LZ4)
Or no compression
Priority: latency

Warm data: Occasional access

Moderate compression (Zstd-3)
Balance savings and latency

Cold data: Rare access

Aggressive compression (Zstd-15+)
Compress fully, accept decompression cost on rare reads

adaptive_compression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
from enum import Enum
from dataclasses import dataclass
import time
import zstandard as zstd
import lz4.frame
 
class DataTemperature(Enum):
    HOT = 'hot'
    WARM = 'warm'
    COLD = 'cold'
    FROZEN = 'frozen'
 
@dataclass
class CompressionStrategy:
    """Strategy for a data temperature tier."""
    algorithm: str
    level: int
    skip_threshold: float  # Skip if ratio below this
 
class AdaptiveCompressor:
    """
    Compression system that adapts to data characteristics and access patterns.
    """
    
    STRATEGIES = {
        DataTemperature.HOT: CompressionStrategy('lz4', 0, 1.05),
        DataTemperature.WARM: CompressionStrategy('zstd', 3, 1.1),
        DataTemperature.COLD: CompressionStrategy('zstd', 9, 1.15),
        DataTemperature.FROZEN: CompressionStrategy('zstd', 19, 1.2),
    }
    
    # File extensions that shouldn't be compressed
    SKIP_EXTENSIONS = {
        '.jpg', '.jpeg', '.png', '.gif', '.webp',  # Images
        '.mp4', '.avi', '.mkv', '.webm',           # Video
        '.mp3', '.aac', '.flac', '.ogg',           # Audio
        '.zip', '.gz', '.bz2', '.xz', '.7z',       # Already compressed
        '.pdf',                                      # Often has compressed streams
    }
    
    def __init__(self):
        self.zstd_compressors = {
            level: zstd.ZstdCompressor(level=level)
            for level in [3, 9, 15, 19]
        }
        self.zstd_decompressor = zstd.ZstdDecompressor()
    
    def _should_skip_by_extension(self, filename: str) -> bool:
        """Check if file extension indicates incompressible data."""
        ext = '.' + filename.rsplit('.', 1)[-1].lower() if '.' in filename else ''
        return ext in self.SKIP_EXTENSIONS
    
    def _check_compressibility(self, data: bytes, threshold: float) -> tuple[bool, float]:
        """
        Quick check if data is worth compressing.
        
        Returns (should_compress, sample_ratio)
        """
        sample = data[:4096]
        if len(sample) < 100:
            return True, 0  # Too small to check
        
        try:
            compressed = lz4.frame.compress(sample)
            ratio = len(sample) / len(compressed)
            return ratio >= threshold, ratio
        except:
            return True, 0  # If check fails, try compressing
    
    def compress(
        self,
        data: bytes,
        temperature: DataTemperature,
        filename: str = '',
    ) -> tuple[bytes, dict]:
        """
        Compress data adaptively based on temperature and content.
        
        Returns (compressed_data, metadata)
        """
        strategy = self.STRATEGIES[temperature]
        
        # Skip by extension
        if filename and self._should_skip_by_extension(filename):
            return data, {
                'compressed': False,
                'reason': 'extension_skip',
                'algorithm': 'none',
            }
        
        # Compressibility check
        worth_it, sample_ratio = self._check_compressibility(data, strategy.skip_threshold)
        if not worth_it:
            return data, {
                'compressed': False,
                'reason': 'low_ratio',
                'sample_ratio': sample_ratio,
                'algorithm': 'none',
            }
        
        # Compress with appropriate algorithm
        start = time.time()
        
        if strategy.algorithm == 'lz4':
            compressed = lz4.frame.compress(data)
        else:
            compressor = self.zstd_compressors.get(strategy.level)
            compressed = compressor.compress(data)
        
        elapsed = time.time() - start
        ratio = len(data) / len(compressed)
        
        # Final check: did compression actually help?
        if ratio < strategy.skip_threshold:
            return data, {
                'compressed': False,
                'reason': 'actual_ratio_low',
                'actual_ratio': ratio,
                'algorithm': 'none',
            }
        
        return compressed, {
            'compressed': True,
            'algorithm': strategy.algorithm,
            'level': strategy.level,
            'original_size': len(data),
            'compressed_size': len(compressed),
            'ratio': ratio,
            'compress_time_ms': elapsed * 1000,
        }
    
    def decompress(self, data: bytes, metadata: dict) -> bytes:
        """Decompress based on stored metadata."""
        if not metadata.get('compressed', False):
            return data
        
        algorithm = metadata['algorithm']
        
        if algorithm == 'lz4':
            return lz4.frame.decompress(data)
        elif algorithm == 'zstd':
            return self.zstd_decompressor.decompress(data)
        else:
            raise ValueError(f"Unknown algorithm: {algorithm}")

Auto-Tuning in Production

Advanced storage systems like ZFS can auto-tune compression. ZFS's 'compress=auto' (proposed feature) would analyze block content and choose optimal algorithm. NetApp's adaptive compression varies level based on system load. These features reduce operator burden and improve efficiency.

Decision Framework: Choosing the Right Trade-off

Synthesizing everything covered, here's a practical decision framework.

Step 1: Characterize Your Workload

Questions to answer:

Data compressibility: Is it text, structured data, or already compressed?
Data deduplicability: How much redundancy exists across files?
Access pattern: Random access, sequential scan, or append-only?
Latency requirements: What's the SLA? P99 target?
Read/write ratio: How often is data read vs. written?
Retention period: Days, months, or years?
Growth rate: How fast is data volume increasing?

Step 2: Calculate Economics

Estimate savings: Based on similar workload benchmarks or sampling
Estimate costs: CPU, memory, latency impact
Model total cost: Over the data's lifetime
Compare options: No optimization, lightweight, aggressive

Quick Decision Rules

•If media files (video, images, audio): Skip compression entirely
•If encrypted data: Skip both dedup and compression
•If latency < 1ms required: Use LZ4 only, or skip compression
•If latency < 10ms required: Use LZ4 or Zstd-1
•If backup/archive workload: Use dedup + Zstd-9 (or higher)
•If data lives > 1 year: Aggressive compression usually pays off
•If data lives < 1 week: Lightweight or no compression
•If on NVMe + high IOPS workload: LZ4 or skip; decompression is bottleneck
•If on HDD: Compression almost always beneficial

Recommended Configurations by Workload
Workload	Dedup	Compression	Rationale
OLTP Database (hot)	No	LZ4 page-level	Latency critical
OLAP Data Warehouse	No	Zstd-9 columnar	Scan-oriented, ratio matters
Primary File Storage	Per-volume	Zstd-3	Balance savings and performance
Backup Target	Global	Zstd-9	Maximize savings, tolerate CPU
Long-term Archive	Global	Zstd-19	Storage cost dominates everything
Video/Media Library	File-hash only	None	Pre-compressed content
Log Aggregation	No	Zstd-3 + dictionary	Highly compressible text
Container Registry	Content-addressable	Per-layer Zstd	Natural dedup via layers

Summary: Mastering the Trade-off

CPU versus storage is the fundamental trade-off in storage optimization. There's no universal "right" answer—only the right answer for your specific workload, hardware, and economics.

Key principles:

Measure before optimizing: Benchmark your actual data to understand compressibility and deduplication potential.
Build cost models: Account for CPU, memory, latency, and storage over the data's full lifecycle.
Match strategy to tier: Hot data needs lightweight optimization; cold data benefits from aggressive optimization.
Consider latency budgets: SLA requirements may preclude certain algorithms regardless of cost.
Adapt dynamically: Time-of-day, access frequency, and content type should influence strategy.
Hardware shapes trade-offs: ARM vs. x86, SSD vs. HDD, accelerators—all shift the economics.
Skip when appropriate: Sometimes no optimization is optimal. Don't waste CPU on incompressible data.

Page Complete

You now have a comprehensive framework for analyzing CPU vs. storage trade-offs. You can build economic models, evaluate latency impacts, adapt to hardware capabilities, and make data-driven optimization decisions. Next, we'll explore implementation considerations—the practical engineering challenges of building deduplication and compression into production storage systems.