Deduplication & Compression - Learning Module

Loading content...

0/273

Storage Efficiency

The Compound Effect of Storage Optimization

When deduplication and compression are deployed together, their effects compound multiplicatively. If deduplication alone achieves a 4:1 ratio and compression achieves 2.5:1, the combined system might achieve 10:1—10 terabytes of logical data stored in 1 terabyte of physical capacity.

But achieving optimal storage efficiency isn't simply enabling both features and walking away. The order of operations matters. The chunk size affects both. Certain data types benefit from one but not the other. Production systems require careful tuning to maximize the data reduction ratio while controlling CPU and memory overhead.

This page examines storage efficiency holistically—how to measure it, how to optimize it, and how industry-leading systems achieve 20:1 or better ratios for the right workloads.

What You Will Learn

By the end of this page, you will understand how to calculate and benchmark data reduction ratios, know the correct order of operations for combining deduplication and compression, recognize which optimizations apply to which data types, and be able to design storage systems that maximize efficiency for your specific workloads.

Measuring Storage Efficiency: Metrics That Matter

Before optimizing storage efficiency, you must measure it accurately. Several metrics capture different aspects of data reduction.

Data Reduction Ratio (DRR)

The most commonly cited metric: the ratio of logical (pre-optimization) data to physical (post-optimization) storage.

Data Reduction Ratio = Logical Data Size / Physical Data Size

Example:

Logical data: 50 TB
Physical storage used: 5 TB
DRR = 50 / 5 = 10:1

Components of DRR: When both deduplication and compression are active:

Total DRR = Dedup Ratio × Compression Ratio

Example:
- Dedup ratio: 4:1 (50 TB → 12.5 TB unique)
- Compression ratio: 2.5:1 (12.5 TB → 5 TB physical)
- Total DRR: 4 × 2.5 = 10:1

Capacity Savings Percentage

An alternative representation that's often more intuitive:

Capacity Savings % = (1 - 1/DRR) × 100

Examples:
- 2:1 ratio → 50% savings
- 5:1 ratio → 80% savings
- 10:1 ratio → 90% savings
- 20:1 ratio → 95% savings

Note the diminishing returns: going from 10:1 to 20:1 only saves an additional 5 percentage points.

Storage Efficiency Metrics Overview
Metric	Formula	Use Case	Caveats
Data Reduction Ratio	Logical / Physical	Overall efficiency marketing	Can be manipulated by counting method
Deduplication Ratio	Pre-dedup / Post-dedup	Dedup effectiveness	Varies wildly by workload
Compression Ratio	Uncompressed / Compressed	Compression effectiveness	Measured on already-deduped data
Effective Capacity	Physical × DRR	Usable storage	Depends on future data patterns
Thin Provisioning Ratio	Allocated / Used	Over-provisioning measurement	Not data reduction

Beware Marketing Metrics

Warning: Storage vendors often report DRR under ideal conditions—backup workloads with identical VMs, synthetic datasets with high redundancy, or cherry-picked customer examples. Real-world ratios are often 2-5x lower than advertised maximums.

Benchmark your actual workload: The only reliable efficiency metric is one measured on your actual production data. Run tests with representative samples before making purchasing decisions.

Efficiency Over Time

Storage efficiency isn't static. As data ages:

New data arrives unoptimized → temporary low efficiency
Post-process jobs run → efficiency improves
Garbage collection removes unused chunks → slight efficiency drop during compaction
Data pattern changes → efficiency may shift

Monitor efficiency trends, not just point-in-time measurements. A declining trend might indicate changing workloads or retention issues.

The Normalized Ratio Trap

Some vendors report 'effective' or 'normalized' ratios that include thin provisioning (allocated but unused space). A volume allocated at 100TB but using 10TB physical appears as 10:1—even with zero dedup/compression. Always distinguish actual data reduction from allocation accounting.

Order of Operations: The Pipeline That Matters

The sequence in which you apply storage optimizations dramatically affects final efficiency. Getting this wrong can nullify an entire optimization stage.

The Correct Order

For maximum efficiency, apply in this sequence:

1. Deduplication    →   Eliminates identical chunks
2. Compression      →   Shrinks remaining unique chunks
3. Encryption       →   Secures compressed data
4. Erasure coding   →   Adds redundancy for durability

Why This Order?

Deduplication before compression:

Dedup finds identical chunks. If we compress first, identical data might compress differently due to context—creating false non-matches.
Compressed data is harder to match (small changes cause large compressed differences).
Dedup reduces the total volume that compression must process.

Compression before encryption:

Compression exploits patterns in data.
Encrypted data has no discernible patterns (appears random).
Compressing encrypted data is futile—it actually expands due to overhead.

Encryption before erasure coding:

Erasure coding adds parity for durability.
Must protect the encrypted data, not plaintext.
Coding fragments don't need to be individually readable.

Correct Pipeline

•Raw data arrives (100 TB)
•Deduplication: 100 → 25 TB (4:1)
•Compression: 25 → 10 TB (2.5:1)
•Encryption: 10 TB (no size change)
•Erasure coding: 10 → 13 TB (1.3x overhead)
•Final: 13 TB physical for 100 TB logical

Wrong Pipeline

•Raw data arrives (100 TB)
•Encryption first: 100 TB
•Compression fails: still 100 TB
•Deduplication fails: still 100 TB
•Erasure coding: 100 → 130 TB
•Final: 130 TB physical for 100 TB logical!

Special Case: Client-Side Encryption

When clients encrypt before sending (zero-knowledge model):

Storage system cannot deduplicate across clients (identical files, different keys, different ciphertext)
No compression benefit (encrypted data is incompressible)
Only per-user dedup possible (same user, same key might produce same ciphertext)

Mitigation strategies:

Convergent encryption: Derive encryption key from content hash. Identical content → identical key → identical ciphertext → enables dedup. Trade-off: vulnerable to confirmation attacks.
Application-level dedup: Dedup at client before encryption (like Borg/Restic).
Accept lower efficiency: Some scenarios require true zero-knowledge at the cost of storage efficiency.

The Chunk Alignment Challenge

Deduplication chunk boundaries must be consistent for matching to work. If compression changes data layout, chunk boundaries shift.

Solution: In multi-stage pipelines:

Dedup identifies unique chunks
Each unique chunk is compressed independently
Chunk identity is based on uncompressed content, not compressed representation

Netflix's Example

Netflix's Open Connect CDN doesn't use traditional dedup or compression on video files—they're already highly compressed. Instead, efficiency comes from caching (temporal dedup—many users watch the same content) and erasure coding only for durability. Know your data type to avoid wasting CPU on ineffective optimization.

Workload-Specific Optimization Strategies

Different workloads have radically different efficiency characteristics. A strategy that achieves 30:1 for virtual machines might achieve only 1.2:1 for video files. Understanding your workload is essential.

Backup and Archive Workloads

Characteristics:

Highly repetitive (daily backups are 95% similar)
Sequential write, sequential read
Large datasets (TBs to PBs)
Tolerant of write latency

Optimization strategy:

Variable-length CDC for resilience to file edits
Global deduplication across all backups
High compression levels (time is cheap for backups)
Inline dedup, post-process compression hybrid

Typical efficiency: 15:1 to 50:1 for enterprise VMs, 10:1 to 30:1 for databases.

Virtual Desktop Infrastructure (VDI)

Characteristics:

Thousands of desktops with identical OS images
Shared application binaries
Only user data differs
Random I/O pattern during usage

Optimization strategy:

Fixed-block dedup (VM images don't shift like documents)
Hot fingerprint cache for common OS/app blocks
Fast decompression (LZ4) for low read latency
Compute-side caching to reduce dedup lookups

Typical efficiency: 20:1 to 70:1 (higher with linked clones).

Database Storage

Characteristics:

Structured pages with headers/metadata
Indexes contain repetitive key prefixes
MVCC systems store multiple row versions
Latency-sensitive for OLTP

Optimization strategy:

Page-level compression (PostgreSQL TOAST, MySQL InnoDB compression)
Careful dedup evaluation (databases already organize data to minimize redundancy)
Lightweight compressors (LZ4) for OLTP; stronger for OLAP
Dictionary compression for repeated column values

Typical efficiency: 2:1 to 5:1 compression; dedup often minimal on actively updated data.

Workload Efficiency Profiles
Workload	Dedup Potential	Compression Potential	Recommended Strategy
Backup / DR	★★★★★ (high repeat)	★★★★☆	CDC + global dedup + Zstd high
VDI / Virtual Servers	★★★★★ (shared OS)	★★★☆☆	Fixed block + LZ4
Primary File Storage	★★★☆☆ (some copies)	★★★★☆ (text, docs)	Per-volume dedup + Zstd
OLTP Databases	★☆☆☆☆ (low)	★★★☆☆ (structured)	Page compression only
OLAP / Data Warehouse	★★☆☆☆	★★★★★ (columnar)	Column compression + partitioning
Media (video/images)	★☆☆☆☆ (unique)	★☆☆☆☆ (already compressed)	Skip optimization
Log aggregation	★★★☆☆ (patterns)	★★★★★ (text)	Time-based batching + Zstd + dict

Log and Telemetry Storage

Characteristics:

Massive volume (GB/hour per service)
Highly structured (timestamps, severity, source)
Repetitive patterns (same log format, same messages)
Append-only, time-series nature

Optimization strategy:

Column-oriented storage (separate timestamps from messages)
Dictionary compression with field-specific dictionaries
Time-based batching (compress 1 hour at a time for context)
Run-length encoding for repeated values (same severity in sequence)

Example efficiency: Elasticsearch with best_compression codec achieves 5:1 to 15:1 on structured logs.

Container Images and Layers

Characteristics:

Layered format (base image + diffs)
Massive duplication across images
Read-heavy, pull-on-deploy

Optimization strategy:

Layer-level dedup (Docker/containerd do this naturally)
Content-addressable storage (each layer stored once)
Registry-level dedup for cross-image optimization

Typical efficiency: 10:1 to 100:1 across a container registry with many images sharing base layers.

Efficiency ≠ Performance

High efficiency ratios don't mean high performance. A 50:1 dedup ratio on cold archive data might come with 10x read latency due to chunk reassembly. Balance efficiency against access requirements—hot data might justify lower ratios for better performance.

Thin Provisioning: Virtual Capacity Management

While not true data reduction, thin provisioning is a complementary technique that maximizes utilization of physical capacity.

What Is Thin Provisioning?

Thin provisioning allocates virtual capacity to applications without immediately reserving equivalent physical storage. Applications see their full allocation; physical capacity is consumed only when data is actually written.

Example:

10 VMs allocated 1 TB each (10 TB virtual)
Average actual usage: 300 GB per VM (3 TB physical)
Thin provisioning ratio: 10:3 (3.3:1)

Without thin provisioning: 10 TB physical required, 7 TB wasted. With thin provisioning: 3-4 TB physical sufficient, with room to grow.

Thin Provisioning + Dedup + Compression

These techniques compound:

Works like this:

10 VMs × 1 TB allocated = 10 TB virtual
     ↓ (thin: only 30% used)      
3 TB actual written
     ↓ (dedup: 4:1)               
750 GB unique data
     ↓ (compression: 2.5:1)        
300 GB physical storage

Total effective ratio: 10 TB / 300 GB = 33:1

Over-Subscription Risks

Thin provisioning enables over-subscription—allocating more virtual capacity than physical capacity exists. This is powerful but dangerous.

The capacity runaway scenario:

Physical capacity: 10 TB
Virtual allocations: 50 TB (5:1 over-subscription)
Expected usage: 2 TB (based on historical patterns)
Actual usage spikes: 12 TB (new workload, log explosion, etc.)
Result: Out of space, writes fail, system-wide outage

Safeguards:

Capacity forecasting: Project growth based on trends
Threshold alerts: Warn at 70%, alert at 85%, critical at 90%
Auto-tiering: Overflow to cheaper storage
Write rejection: Fail individual volumes before pool exhausts
Regular reviews: Reclaim space from over-provisioned volumes

Zero-Block Detection

A specific thin-provisioning optimization: never store blocks that are all zeros.

Why this matters:

Newly allocated storage is typically zeroed
Deleted files leave zero-filled gaps
Database tablespaces often pre-allocate with zeros

Implementation:

Check if block is all zeros in O(block_size) time
Store a flag instead of actual zeros
Return zeros dynamically on read

Impact: For databases with pre-allocated tablespaces, zero-detection alone might achieve 3:1 efficiency on 'empty' space.

thin_provisioning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
from dataclasses import dataclass
from typing import Optional
import threading
 
@dataclass
class VirtualVolume:
    """A thinly provisioned volume."""
    name: str
    virtual_size: int      # Allocated capacity visible to client
    physical_used: int     # Actual physical consumption
    
    @property
    def usage_percent(self) -> float:
        return (self.physical_used / self.virtual_size) * 100
 
class ThinProvisionedPool:
    """
    Storage pool with thin provisioning support.
    
    Demonstrates capacity management with over-subscription.
    """
    
    def __init__(self, physical_capacity: int, oversubscription_limit: float = 3.0):
        self.physical_capacity = physical_capacity
        self.physical_used = 0
        self.oversubscription_limit = oversubscription_limit
        self.volumes: dict[str, VirtualVolume] = {}
        self.lock = threading.Lock()
        
        # Alert thresholds
        self.warn_threshold = 0.70
        self.alert_threshold = 0.85
        self.critical_threshold = 0.90
    
    def create_volume(self, name: str, virtual_size: int) -> Optional[VirtualVolume]:
        """
        Create a new thinly provisioned volume.
        
        Checks over-subscription limits but doesn't reserve physical space.
        """
        with self.lock:
            total_virtual = sum(v.virtual_size for v in self.volumes.values())
            new_total = total_virtual + virtual_size
            
            # Check over-subscription limit
            if new_total > self.physical_capacity * self.oversubscription_limit:
                raise ValueError(
                    f"Over-subscription limit exceeded: "
                    f"{new_total/self.physical_capacity:.1f}x > {self.oversubscription_limit}x"
                )
            
            volume = VirtualVolume(
                name=name,
                virtual_size=virtual_size,
                physical_used=0
            )
            self.volumes[name] = volume
            return volume
    
    def write_block(self, volume_name: str, size: int) -> bool:
        """
        Write data to a volume, consuming physical capacity.
        
        Returns False if physical pool is exhausted.
        """
        with self.lock:
            if self.physical_used + size > self.physical_capacity:
                return False  # Out of physical space
            
            self.physical_used += size
            self.volumes[volume_name].physical_used += size
            
            # Check thresholds
            usage = self.physical_used / self.physical_capacity
            self._check_thresholds(usage)
            
            return True
    
    def _check_thresholds(self, usage: float):
        if usage >= self.critical_threshold:
            print(f"CRITICAL: Pool at {usage:.1%} capacity!")
        elif usage >= self.alert_threshold:
            print(f"ALERT: Pool at {usage:.1%} capacity")
        elif usage >= self.warn_threshold:
            print(f"Warning: Pool at {usage:.1%} capacity")
    
    def get_efficiency_metrics(self) -> dict:
        """Calculate pool-wide efficiency metrics."""
        total_virtual = sum(v.virtual_size for v in self.volumes.values())
        total_written = sum(v.physical_used for v in self.volumes.values())
        
        return {
            'physical_capacity': self.physical_capacity,
            'physical_used': self.physical_used,
            'virtual_allocated': total_virtual,
            'oversubscription_ratio': total_virtual / self.physical_capacity,
            'utilization': self.physical_used / self.physical_capacity,
            'thin_ratio': total_virtual / max(self.physical_used, 1),
        }

Efficiency Monitoring and Continuous Tuning

Storage efficiency isn't a set-and-forget configuration. Workloads evolve, data patterns change, and optimization overhead must be balanced against benefit.

Key Metrics to Monitor

Efficiency metrics:

Real-time dedup ratio (should be stable; drops indicate changing data patterns)
Compression ratio (varies by data type entering system)
Combined data reduction ratio
Efficiency by volume/dataset (identify poor performers)

Resource metrics:

Dedup CPU utilization
Fingerprint cache hit rate (low hits = inefficient dedup)
Compression throughput (bottleneck identification)
Memory usage for dedup index

Capacity metrics:

Physical capacity consumed
Growth rate and projection
Days until capacity exhaustion

efficiency_monitor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import time
from dataclasses import dataclass
from typing import List
import statistics
 
@dataclass
class EfficiencySnapshot:
    """Point-in-time efficiency measurement."""
    timestamp: float
    logical_bytes: int
    physical_bytes: int
    dedup_ratio: float
    compression_ratio: float
    combined_ratio: float
    fingerprint_cache_hits: int
    fingerprint_cache_misses: int
 
class EfficiencyMonitor:
    """
    Continuous monitoring of storage efficiency with trend analysis.
    """
    
    def __init__(self, storage_system, sample_interval_seconds: int = 300):
        self.storage = storage_system
        self.interval = sample_interval_seconds
        self.history: List[EfficiencySnapshot] = []
        self.max_history = 1000  # ~3.5 days at 5-minute intervals
    
    def capture_snapshot(self) -> EfficiencySnapshot:
        """Capture current efficiency state."""
        stats = self.storage.get_efficiency_stats()
        
        snapshot = EfficiencySnapshot(
            timestamp=time.time(),
            logical_bytes=stats['logical_bytes'],
            physical_bytes=stats['physical_bytes'],
            dedup_ratio=stats['dedup_ratio'],
            compression_ratio=stats['compression_ratio'],
            combined_ratio=stats['logical_bytes'] / max(stats['physical_bytes'], 1),
            fingerprint_cache_hits=stats['cache_hits'],
            fingerprint_cache_misses=stats['cache_misses'],
        )
        
        self.history.append(snapshot)
        if len(self.history) > self.max_history:
            self.history.pop(0)
        
        return snapshot
    
    def detect_efficiency_degradation(self, lookback_samples: int = 50) -> dict:
        """
        Detect if efficiency has degraded significantly.
        
        Returns warnings if current efficiency is notably worse than
        recent historical average.
        """
        if len(self.history) < lookback_samples:
            return {'status': 'insufficient_data'}
        
        recent = self.history[-lookback_samples//5:]
        historical = self.history[-lookback_samples:-lookback_samples//5]
        
        recent_ratio = statistics.mean(s.combined_ratio for s in recent)
        historical_ratio = statistics.mean(s.combined_ratio for s in historical)
        
        # Cache hit rate analysis
        recent_hits = sum(s.fingerprint_cache_hits for s in recent)
        recent_misses = sum(s.fingerprint_cache_misses for s in recent)
        recent_cache_rate = recent_hits / max(recent_hits + recent_misses, 1)
        
        warnings = []
        
        # Check for ratio degradation (>15% drop)
        if recent_ratio < historical_ratio * 0.85:
            warnings.append({
                'type': 'efficiency_drop',
                'message': f'Combined ratio dropped from {historical_ratio:.2f}:1 to {recent_ratio:.2f}:1',
                'severity': 'warning',
            })
        
        # Check for cache efficiency problems
        if recent_cache_rate < 0.60:
            warnings.append({
                'type': 'low_cache_hit_rate',
                'message': f'Fingerprint cache hit rate is {recent_cache_rate:.1%}',
                'severity': 'warning',
            })
        
        return {
            'status': 'ok' if not warnings else 'degraded',
            'current_ratio': recent_ratio,
            'historical_ratio': historical_ratio,
            'cache_hit_rate': recent_cache_rate,
            'warnings': warnings,
        }
    
    def project_capacity_exhaustion(self) -> dict:
        """
        Project when storage will be exhausted based on growth trends.
        """
        if len(self.history) < 100:
            return {'status': 'insufficient_data'}
        
        # Calculate growth rate over last 100 samples
        first = self.history[-100]
        last = self.history[-1]
        
        bytes_growth = last.physical_bytes - first.physical_bytes
        time_seconds = last.timestamp - first.timestamp
        
        growth_rate_per_day = (bytes_growth / time_seconds) * 86400
        
        capacity = self.storage.get_physical_capacity()
        remaining = capacity - last.physical_bytes
        
        if growth_rate_per_day > 0:
            days_remaining = remaining / growth_rate_per_day
        else:
            days_remaining = float('inf')
        
        return {
            'physical_used': last.physical_bytes,
            'physical_capacity': capacity,
            'utilization': last.physical_bytes / capacity,
            'growth_rate_gb_day': growth_rate_per_day / (1024**3),
            'days_until_full': days_remaining,
        }

Tuning Levers

When efficiency degrades or costs rise, several adjustments can help:

1. Chunk size adjustments:

Smaller chunks: Better dedup ratio, more index overhead
Larger chunks: Lower overhead, misses more duplicates
Tune based on typical file sizes in your workload

2. Compression level tuning:

Higher levels: Better ratio, more CPU, slower writes
Lower levels: Faster, less savings
Consider time of day: Higher compression during off-peak

3. Dedup scope:

Global: Maximum dedup, maximum overhead
Per-volume: Less overhead, misses cross-volume duplicates
Per-dataset: Balance for specific retention policies

4. Cache sizing:

Larger fingerprint cache: Better hit rates, more RAM
Hot-data caching: Focus RAM on active datasets

5. Selective optimization:

Skip compression for already-compressed files (check magic bytes)
Disable dedup for encrypted volumes
Adjust per-volume based on workload type

Cost-Based Optimization

Frame efficiency tuning in cost terms: If compression saves $1,000/month in storage but costs $500/month in compute, it's profitable. But if moving from Zstd level 3 to level 19 saves $50 in storage while costing $200 in compute, it's counterproductive. Monitor and model costs, not just ratios.

Industry Case Studies: Storage Efficiency at Scale

Real-world implementations reveal how efficiency strategies play out at scale.

Case Study: Dell EMC Data Domain at Enterprise Scale

A Fortune 500 company's backup infrastructure:

Before optimization:

200 TB of VMs backed up daily
Full backup weekly, incrementals daily
90-day retention policy
Storage requirement: ~4 PB (daily incrementals × 90 days)

After Data Domain with dedup:

Global inline deduplication
Average 22:1 dedup ratio on VM backups
Storage consumed: ~180 TB physical
Cost avoidance: >$3.5M (enterprise storage vs consumed capacity)

Key enablers:

VMs share common OS and application binaries
Incremental backups are highly similar
90-day retention means lots of overlapping content

Case Study: Cloud Object Storage (Simulated S3-style)

A SaaS platform storing customer documents:

Workload characteristics:

50M objects, average 500KB each
Mix: 60% Office docs, 25% PDFs, 10% images, 5% other
Significant document versioning (same doc revised repeatedly)

Efficiency strategy:

Object-level dedup (hash entire objects)
Compression: Zstd level 3 with dictionary for Office/PDFs
Skip compression for images (already compressed)

Results:

Dedup ratio: 1.8:1 (18% unique objects were duplicates)
Compression: 3.2:1 on compressible content
Combined effective ratio: 2.9:1
Storage: 8.6 TB physical for 25 TB logical

Lesson: General file storage sees lower efficiency than backup workloads—users create unique content. Dictionary compression was key for small documents.

Case Study: Container Registry Optimization

A large Kubernetes platform's container registry:

Before optimization:

10,000 container images
Average 5 layers per image, 200MB average layer
Many images share base layers (ubuntu, alpine, node)
Naive storage: 10 TB

Content-addressable storage:

Store each layer by its content hash
Manifest references layers by hash
Same layer in 1000 images stored once

Results:

Unique layers: 12,000 (many shared)
Average layer compression: 2.5:1
Storage: 350 GB for 10 TB logical
Effective ratio: 28:1

Lesson: Content-addressable is natural dedup—layers shared across images are never duplicated.

Key Lessons from Industry

•Backup workloads achieve highest efficiency (20:1+) due to incremental similarity
•Primary storage typically sees lower ratios (2:1 to 5:1) but impacts more data
•Content-addressable architectures (Git, Docker, CAS) provide implicit deduplication
•Workload-appropriate algorithms matter: LZ4 for latency-sensitive, Zstd for archival
•ROI calculation must include CPU costs, not just storage savings
•Monitoring is essential—efficiency degrades as workloads change

Summary: Maximizing Storage Efficiency

Storage efficiency is a multi-dimensional optimization problem. Success requires understanding your workload, selecting appropriate techniques, monitoring continuously, and tuning deliberately.

Core principles:

Measure accurately: Use true data reduction ratios, not marketing numbers. Benchmark with your actual data.
Order matters: Dedup → Compress → Encrypt → Encode. Wrong order nullifies optimization.
Know your workload: VDI and backup are goldmines for efficiency. Media files are not. Apply techniques appropriately.
Compound effects: Thin provisioning, dedup, and compression together can achieve 30:1+ ratios for ideal workloads.
Monitor and tune: Efficiency degrades over time. Track metrics, project capacity, and adjust configurations.
Balance cost: CPU for compression, RAM for dedup indexes, I/O for garbage collection—all have costs that offset savings.

Page Complete

You now understand how to measure, achieve, and maintain storage efficiency at scale. You know the correct order of operations, how different workloads behave, and how to monitor and tune for sustained efficiency. Next, we'll examine the CPU vs. storage trade-offs—when the computational cost of optimization outweighs the storage savings.

Storage Efficiency

The Compound Effect of Storage Optimization

This page examines storage efficiency holistically—how to measure it, how to optimize it, and how industry-leading systems achieve 20:1 or better ratios for the right workloads.

What You Will Learn

Measuring Storage Efficiency: Metrics That Matter

Before optimizing storage efficiency, you must measure it accurately. Several metrics capture different aspects of data reduction.

Data Reduction Ratio (DRR)

The most commonly cited metric: the ratio of logical (pre-optimization) data to physical (post-optimization) storage.

Data Reduction Ratio = Logical Data Size / Physical Data Size

Example:

Logical data: 50 TB
Physical storage used: 5 TB
DRR = 50 / 5 = 10:1

Components of DRR: When both deduplication and compression are active:

Total DRR = Dedup Ratio × Compression Ratio

Example:
- Dedup ratio: 4:1 (50 TB → 12.5 TB unique)
- Compression ratio: 2.5:1 (12.5 TB → 5 TB physical)
- Total DRR: 4 × 2.5 = 10:1

Capacity Savings Percentage

An alternative representation that's often more intuitive:

Capacity Savings % = (1 - 1/DRR) × 100

Examples:
- 2:1 ratio → 50% savings
- 5:1 ratio → 80% savings
- 10:1 ratio → 90% savings
- 20:1 ratio → 95% savings

Note the diminishing returns: going from 10:1 to 20:1 only saves an additional 5 percentage points.

Storage Efficiency Metrics Overview
Metric	Formula	Use Case	Caveats
Data Reduction Ratio	Logical / Physical	Overall efficiency marketing	Can be manipulated by counting method
Deduplication Ratio	Pre-dedup / Post-dedup	Dedup effectiveness	Varies wildly by workload
Compression Ratio	Uncompressed / Compressed	Compression effectiveness	Measured on already-deduped data
Effective Capacity	Physical × DRR	Usable storage	Depends on future data patterns
Thin Provisioning Ratio	Allocated / Used	Over-provisioning measurement	Not data reduction

Beware Marketing Metrics

Benchmark your actual workload: The only reliable efficiency metric is one measured on your actual production data. Run tests with representative samples before making purchasing decisions.

Efficiency Over Time

Storage efficiency isn't static. As data ages:

New data arrives unoptimized → temporary low efficiency
Post-process jobs run → efficiency improves
Garbage collection removes unused chunks → slight efficiency drop during compaction
Data pattern changes → efficiency may shift

Monitor efficiency trends, not just point-in-time measurements. A declining trend might indicate changing workloads or retention issues.

The Normalized Ratio Trap

Order of Operations: The Pipeline That Matters

The sequence in which you apply storage optimizations dramatically affects final efficiency. Getting this wrong can nullify an entire optimization stage.

The Correct Order

For maximum efficiency, apply in this sequence:

1. Deduplication    →   Eliminates identical chunks
2. Compression      →   Shrinks remaining unique chunks
3. Encryption       →   Secures compressed data
4. Erasure coding   →   Adds redundancy for durability

Why This Order?

Deduplication before compression:

Dedup finds identical chunks. If we compress first, identical data might compress differently due to context—creating false non-matches.
Compressed data is harder to match (small changes cause large compressed differences).
Dedup reduces the total volume that compression must process.

Compression before encryption:

Compression exploits patterns in data.
Encrypted data has no discernible patterns (appears random).
Compressing encrypted data is futile—it actually expands due to overhead.

Encryption before erasure coding:

Erasure coding adds parity for durability.
Must protect the encrypted data, not plaintext.
Coding fragments don't need to be individually readable.

Correct Pipeline

•Raw data arrives (100 TB)
•Deduplication: 100 → 25 TB (4:1)
•Compression: 25 → 10 TB (2.5:1)
•Encryption: 10 TB (no size change)
•Erasure coding: 10 → 13 TB (1.3x overhead)
•Final: 13 TB physical for 100 TB logical

Wrong Pipeline

•Raw data arrives (100 TB)
•Encryption first: 100 TB
•Compression fails: still 100 TB
•Deduplication fails: still 100 TB
•Erasure coding: 100 → 130 TB
•Final: 130 TB physical for 100 TB logical!

Special Case: Client-Side Encryption

When clients encrypt before sending (zero-knowledge model):

Storage system cannot deduplicate across clients (identical files, different keys, different ciphertext)
No compression benefit (encrypted data is incompressible)
Only per-user dedup possible (same user, same key might produce same ciphertext)

Mitigation strategies:

Convergent encryption: Derive encryption key from content hash. Identical content → identical key → identical ciphertext → enables dedup. Trade-off: vulnerable to confirmation attacks.
Application-level dedup: Dedup at client before encryption (like Borg/Restic).
Accept lower efficiency: Some scenarios require true zero-knowledge at the cost of storage efficiency.

The Chunk Alignment Challenge

Deduplication chunk boundaries must be consistent for matching to work. If compression changes data layout, chunk boundaries shift.

Solution: In multi-stage pipelines:

Dedup identifies unique chunks
Each unique chunk is compressed independently
Chunk identity is based on uncompressed content, not compressed representation

Netflix's Example

Workload-Specific Optimization Strategies

Backup and Archive Workloads

Characteristics:

Highly repetitive (daily backups are 95% similar)
Sequential write, sequential read
Large datasets (TBs to PBs)
Tolerant of write latency

Optimization strategy:

Variable-length CDC for resilience to file edits
Global deduplication across all backups
High compression levels (time is cheap for backups)
Inline dedup, post-process compression hybrid

Typical efficiency: 15:1 to 50:1 for enterprise VMs, 10:1 to 30:1 for databases.

Virtual Desktop Infrastructure (VDI)

Characteristics:

Thousands of desktops with identical OS images
Shared application binaries
Only user data differs
Random I/O pattern during usage

Optimization strategy:

Fixed-block dedup (VM images don't shift like documents)
Hot fingerprint cache for common OS/app blocks
Fast decompression (LZ4) for low read latency
Compute-side caching to reduce dedup lookups

Typical efficiency: 20:1 to 70:1 (higher with linked clones).

Database Storage

Characteristics:

Structured pages with headers/metadata
Indexes contain repetitive key prefixes
MVCC systems store multiple row versions
Latency-sensitive for OLTP

Optimization strategy:

Page-level compression (PostgreSQL TOAST, MySQL InnoDB compression)
Careful dedup evaluation (databases already organize data to minimize redundancy)
Lightweight compressors (LZ4) for OLTP; stronger for OLAP
Dictionary compression for repeated column values

Typical efficiency: 2:1 to 5:1 compression; dedup often minimal on actively updated data.

Workload Efficiency Profiles
Workload	Dedup Potential	Compression Potential	Recommended Strategy
Backup / DR	★★★★★ (high repeat)	★★★★☆	CDC + global dedup + Zstd high
VDI / Virtual Servers	★★★★★ (shared OS)	★★★☆☆	Fixed block + LZ4
Primary File Storage	★★★☆☆ (some copies)	★★★★☆ (text, docs)	Per-volume dedup + Zstd
OLTP Databases	★☆☆☆☆ (low)	★★★☆☆ (structured)	Page compression only
OLAP / Data Warehouse	★★☆☆☆	★★★★★ (columnar)	Column compression + partitioning
Media (video/images)	★☆☆☆☆ (unique)	★☆☆☆☆ (already compressed)	Skip optimization
Log aggregation	★★★☆☆ (patterns)	★★★★★ (text)	Time-based batching + Zstd + dict

Log and Telemetry Storage

Characteristics:

Massive volume (GB/hour per service)
Highly structured (timestamps, severity, source)
Repetitive patterns (same log format, same messages)
Append-only, time-series nature

Optimization strategy:

Column-oriented storage (separate timestamps from messages)
Dictionary compression with field-specific dictionaries
Time-based batching (compress 1 hour at a time for context)
Run-length encoding for repeated values (same severity in sequence)

Example efficiency: Elasticsearch with best_compression codec achieves 5:1 to 15:1 on structured logs.

Container Images and Layers

Characteristics:

Layered format (base image + diffs)
Massive duplication across images
Read-heavy, pull-on-deploy

Optimization strategy:

Layer-level dedup (Docker/containerd do this naturally)
Content-addressable storage (each layer stored once)
Registry-level dedup for cross-image optimization

Typical efficiency: 10:1 to 100:1 across a container registry with many images sharing base layers.

Efficiency ≠ Performance

Thin Provisioning: Virtual Capacity Management

While not true data reduction, thin provisioning is a complementary technique that maximizes utilization of physical capacity.

What Is Thin Provisioning?

Example:

10 VMs allocated 1 TB each (10 TB virtual)
Average actual usage: 300 GB per VM (3 TB physical)
Thin provisioning ratio: 10:3 (3.3:1)

Without thin provisioning: 10 TB physical required, 7 TB wasted. With thin provisioning: 3-4 TB physical sufficient, with room to grow.

Thin Provisioning + Dedup + Compression

These techniques compound:

Works like this:

10 VMs × 1 TB allocated = 10 TB virtual
     ↓ (thin: only 30% used)      
3 TB actual written
     ↓ (dedup: 4:1)               
750 GB unique data
     ↓ (compression: 2.5:1)        
300 GB physical storage

Total effective ratio: 10 TB / 300 GB = 33:1

Over-Subscription Risks

Thin provisioning enables over-subscription—allocating more virtual capacity than physical capacity exists. This is powerful but dangerous.

The capacity runaway scenario:

Physical capacity: 10 TB
Virtual allocations: 50 TB (5:1 over-subscription)
Expected usage: 2 TB (based on historical patterns)
Actual usage spikes: 12 TB (new workload, log explosion, etc.)
Result: Out of space, writes fail, system-wide outage

Safeguards:

Capacity forecasting: Project growth based on trends
Threshold alerts: Warn at 70%, alert at 85%, critical at 90%
Auto-tiering: Overflow to cheaper storage
Write rejection: Fail individual volumes before pool exhausts
Regular reviews: Reclaim space from over-provisioned volumes

Zero-Block Detection

A specific thin-provisioning optimization: never store blocks that are all zeros.

Why this matters:

Newly allocated storage is typically zeroed
Deleted files leave zero-filled gaps
Database tablespaces often pre-allocate with zeros

Implementation:

Check if block is all zeros in O(block_size) time
Store a flag instead of actual zeros
Return zeros dynamically on read

Impact: For databases with pre-allocated tablespaces, zero-detection alone might achieve 3:1 efficiency on 'empty' space.

thin_provisioning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
from dataclasses import dataclass
from typing import Optional
import threading
 
@dataclass
class VirtualVolume:
    """A thinly provisioned volume."""
    name: str
    virtual_size: int      # Allocated capacity visible to client
    physical_used: int     # Actual physical consumption
    
    @property
    def usage_percent(self) -> float:
        return (self.physical_used / self.virtual_size) * 100
 
class ThinProvisionedPool:
    """
    Storage pool with thin provisioning support.
    
    Demonstrates capacity management with over-subscription.
    """
    
    def __init__(self, physical_capacity: int, oversubscription_limit: float = 3.0):
        self.physical_capacity = physical_capacity
        self.physical_used = 0
        self.oversubscription_limit = oversubscription_limit
        self.volumes: dict[str, VirtualVolume] = {}
        self.lock = threading.Lock()
        
        # Alert thresholds
        self.warn_threshold = 0.70
        self.alert_threshold = 0.85
        self.critical_threshold = 0.90
    
    def create_volume(self, name: str, virtual_size: int) -> Optional[VirtualVolume]:
        """
        Create a new thinly provisioned volume.
        
        Checks over-subscription limits but doesn't reserve physical space.
        """
        with self.lock:
            total_virtual = sum(v.virtual_size for v in self.volumes.values())
            new_total = total_virtual + virtual_size
            
            # Check over-subscription limit
            if new_total > self.physical_capacity * self.oversubscription_limit:
                raise ValueError(
                    f"Over-subscription limit exceeded: "
                    f"{new_total/self.physical_capacity:.1f}x > {self.oversubscription_limit}x"
                )
            
            volume = VirtualVolume(
                name=name,
                virtual_size=virtual_size,
                physical_used=0
            )
            self.volumes[name] = volume
            return volume
    
    def write_block(self, volume_name: str, size: int) -> bool:
        """
        Write data to a volume, consuming physical capacity.
        
        Returns False if physical pool is exhausted.
        """
        with self.lock:
            if self.physical_used + size > self.physical_capacity:
                return False  # Out of physical space
            
            self.physical_used += size
            self.volumes[volume_name].physical_used += size
            
            # Check thresholds
            usage = self.physical_used / self.physical_capacity
            self._check_thresholds(usage)
            
            return True
    
    def _check_thresholds(self, usage: float):
        if usage >= self.critical_threshold:
            print(f"CRITICAL: Pool at {usage:.1%} capacity!")
        elif usage >= self.alert_threshold:
            print(f"ALERT: Pool at {usage:.1%} capacity")
        elif usage >= self.warn_threshold:
            print(f"Warning: Pool at {usage:.1%} capacity")
    
    def get_efficiency_metrics(self) -> dict:
        """Calculate pool-wide efficiency metrics."""
        total_virtual = sum(v.virtual_size for v in self.volumes.values())
        total_written = sum(v.physical_used for v in self.volumes.values())
        
        return {
            'physical_capacity': self.physical_capacity,
            'physical_used': self.physical_used,
            'virtual_allocated': total_virtual,
            'oversubscription_ratio': total_virtual / self.physical_capacity,
            'utilization': self.physical_used / self.physical_capacity,
            'thin_ratio': total_virtual / max(self.physical_used, 1),
        }

Efficiency Monitoring and Continuous Tuning

Storage efficiency isn't a set-and-forget configuration. Workloads evolve, data patterns change, and optimization overhead must be balanced against benefit.

Key Metrics to Monitor

Efficiency metrics:

Real-time dedup ratio (should be stable; drops indicate changing data patterns)
Compression ratio (varies by data type entering system)
Combined data reduction ratio
Efficiency by volume/dataset (identify poor performers)

Resource metrics:

Dedup CPU utilization
Fingerprint cache hit rate (low hits = inefficient dedup)
Compression throughput (bottleneck identification)
Memory usage for dedup index

Capacity metrics:

Physical capacity consumed
Growth rate and projection
Days until capacity exhaustion

efficiency_monitor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import time
from dataclasses import dataclass
from typing import List
import statistics
 
@dataclass
class EfficiencySnapshot:
    """Point-in-time efficiency measurement."""
    timestamp: float
    logical_bytes: int
    physical_bytes: int
    dedup_ratio: float
    compression_ratio: float
    combined_ratio: float
    fingerprint_cache_hits: int
    fingerprint_cache_misses: int
 
class EfficiencyMonitor:
    """
    Continuous monitoring of storage efficiency with trend analysis.
    """
    
    def __init__(self, storage_system, sample_interval_seconds: int = 300):
        self.storage = storage_system
        self.interval = sample_interval_seconds
        self.history: List[EfficiencySnapshot] = []
        self.max_history = 1000  # ~3.5 days at 5-minute intervals
    
    def capture_snapshot(self) -> EfficiencySnapshot:
        """Capture current efficiency state."""
        stats = self.storage.get_efficiency_stats()
        
        snapshot = EfficiencySnapshot(
            timestamp=time.time(),
            logical_bytes=stats['logical_bytes'],
            physical_bytes=stats['physical_bytes'],
            dedup_ratio=stats['dedup_ratio'],
            compression_ratio=stats['compression_ratio'],
            combined_ratio=stats['logical_bytes'] / max(stats['physical_bytes'], 1),
            fingerprint_cache_hits=stats['cache_hits'],
            fingerprint_cache_misses=stats['cache_misses'],
        )
        
        self.history.append(snapshot)
        if len(self.history) > self.max_history:
            self.history.pop(0)
        
        return snapshot
    
    def detect_efficiency_degradation(self, lookback_samples: int = 50) -> dict:
        """
        Detect if efficiency has degraded significantly.
        
        Returns warnings if current efficiency is notably worse than
        recent historical average.
        """
        if len(self.history) < lookback_samples:
            return {'status': 'insufficient_data'}
        
        recent = self.history[-lookback_samples//5:]
        historical = self.history[-lookback_samples:-lookback_samples//5]
        
        recent_ratio = statistics.mean(s.combined_ratio for s in recent)
        historical_ratio = statistics.mean(s.combined_ratio for s in historical)
        
        # Cache hit rate analysis
        recent_hits = sum(s.fingerprint_cache_hits for s in recent)
        recent_misses = sum(s.fingerprint_cache_misses for s in recent)
        recent_cache_rate = recent_hits / max(recent_hits + recent_misses, 1)
        
        warnings = []
        
        # Check for ratio degradation (>15% drop)
        if recent_ratio < historical_ratio * 0.85:
            warnings.append({
                'type': 'efficiency_drop',
                'message': f'Combined ratio dropped from {historical_ratio:.2f}:1 to {recent_ratio:.2f}:1',
                'severity': 'warning',
            })
        
        # Check for cache efficiency problems
        if recent_cache_rate < 0.60:
            warnings.append({
                'type': 'low_cache_hit_rate',
                'message': f'Fingerprint cache hit rate is {recent_cache_rate:.1%}',
                'severity': 'warning',
            })
        
        return {
            'status': 'ok' if not warnings else 'degraded',
            'current_ratio': recent_ratio,
            'historical_ratio': historical_ratio,
            'cache_hit_rate': recent_cache_rate,
            'warnings': warnings,
        }
    
    def project_capacity_exhaustion(self) -> dict:
        """
        Project when storage will be exhausted based on growth trends.
        """
        if len(self.history) < 100:
            return {'status': 'insufficient_data'}
        
        # Calculate growth rate over last 100 samples
        first = self.history[-100]
        last = self.history[-1]
        
        bytes_growth = last.physical_bytes - first.physical_bytes
        time_seconds = last.timestamp - first.timestamp
        
        growth_rate_per_day = (bytes_growth / time_seconds) * 86400
        
        capacity = self.storage.get_physical_capacity()
        remaining = capacity - last.physical_bytes
        
        if growth_rate_per_day > 0:
            days_remaining = remaining / growth_rate_per_day
        else:
            days_remaining = float('inf')
        
        return {
            'physical_used': last.physical_bytes,
            'physical_capacity': capacity,
            'utilization': last.physical_bytes / capacity,
            'growth_rate_gb_day': growth_rate_per_day / (1024**3),
            'days_until_full': days_remaining,
        }

Tuning Levers

When efficiency degrades or costs rise, several adjustments can help:

1. Chunk size adjustments:

Smaller chunks: Better dedup ratio, more index overhead
Larger chunks: Lower overhead, misses more duplicates
Tune based on typical file sizes in your workload

2. Compression level tuning:

Higher levels: Better ratio, more CPU, slower writes
Lower levels: Faster, less savings
Consider time of day: Higher compression during off-peak

3. Dedup scope:

Global: Maximum dedup, maximum overhead
Per-volume: Less overhead, misses cross-volume duplicates
Per-dataset: Balance for specific retention policies

4. Cache sizing:

Larger fingerprint cache: Better hit rates, more RAM
Hot-data caching: Focus RAM on active datasets

5. Selective optimization:

Skip compression for already-compressed files (check magic bytes)
Disable dedup for encrypted volumes
Adjust per-volume based on workload type

Cost-Based Optimization

Industry Case Studies: Storage Efficiency at Scale

Real-world implementations reveal how efficiency strategies play out at scale.

Case Study: Dell EMC Data Domain at Enterprise Scale

A Fortune 500 company's backup infrastructure:

Before optimization:

200 TB of VMs backed up daily
Full backup weekly, incrementals daily
90-day retention policy
Storage requirement: ~4 PB (daily incrementals × 90 days)

After Data Domain with dedup:

Global inline deduplication
Average 22:1 dedup ratio on VM backups
Storage consumed: ~180 TB physical
Cost avoidance: >$3.5M (enterprise storage vs consumed capacity)

Key enablers:

VMs share common OS and application binaries
Incremental backups are highly similar
90-day retention means lots of overlapping content

Case Study: Cloud Object Storage (Simulated S3-style)

A SaaS platform storing customer documents:

Workload characteristics:

50M objects, average 500KB each
Mix: 60% Office docs, 25% PDFs, 10% images, 5% other
Significant document versioning (same doc revised repeatedly)

Efficiency strategy:

Object-level dedup (hash entire objects)
Compression: Zstd level 3 with dictionary for Office/PDFs
Skip compression for images (already compressed)

Results:

Dedup ratio: 1.8:1 (18% unique objects were duplicates)
Compression: 3.2:1 on compressible content
Combined effective ratio: 2.9:1
Storage: 8.6 TB physical for 25 TB logical

Lesson: General file storage sees lower efficiency than backup workloads—users create unique content. Dictionary compression was key for small documents.

Case Study: Container Registry Optimization

A large Kubernetes platform's container registry:

Before optimization:

10,000 container images
Average 5 layers per image, 200MB average layer
Many images share base layers (ubuntu, alpine, node)
Naive storage: 10 TB

Content-addressable storage:

Store each layer by its content hash
Manifest references layers by hash
Same layer in 1000 images stored once

Results:

Unique layers: 12,000 (many shared)
Average layer compression: 2.5:1
Storage: 350 GB for 10 TB logical
Effective ratio: 28:1

Lesson: Content-addressable is natural dedup—layers shared across images are never duplicated.

Key Lessons from Industry

•Backup workloads achieve highest efficiency (20:1+) due to incremental similarity
•Primary storage typically sees lower ratios (2:1 to 5:1) but impacts more data
•Content-addressable architectures (Git, Docker, CAS) provide implicit deduplication
•Workload-appropriate algorithms matter: LZ4 for latency-sensitive, Zstd for archival
•ROI calculation must include CPU costs, not just storage savings
•Monitoring is essential—efficiency degrades as workloads change

Summary: Maximizing Storage Efficiency

Storage efficiency is a multi-dimensional optimization problem. Success requires understanding your workload, selecting appropriate techniques, monitoring continuously, and tuning deliberately.

Core principles:

Measure accurately: Use true data reduction ratios, not marketing numbers. Benchmark with your actual data.
Order matters: Dedup → Compress → Encrypt → Encode. Wrong order nullifies optimization.
Know your workload: VDI and backup are goldmines for efficiency. Media files are not. Apply techniques appropriately.
Compound effects: Thin provisioning, dedup, and compression together can achieve 30:1+ ratios for ideal workloads.
Monitor and tune: Efficiency degrades over time. Track metrics, project capacity, and adjust configurations.
Balance cost: CPU for compression, RAM for dedup indexes, I/O for garbage collection—all have costs that offset savings.

Page Complete