System Design (HLD)Deduplication and Compression

Deduplication and Compression: Optimizing Storage Efficiency at Scale

LevelIntermediate

Duration90 mins

TopicDeduplication and Compression

1 / 5

Data Deduplication Strategies

The Hidden Redundancy Crisis in Modern Storage

Every organization with significant data infrastructure harbors a secret: they're storing far more duplicate data than they realize. Studies consistently show that enterprise storage environments contain 50-80% redundant data. Email attachments forwarded to hundreds of recipients. Virtual machine images with 95% identical operating system files. Backup snapshots that re-copy unchanged data night after night.

This redundancy isn't merely wasteful—it's catastrophically expensive at scale. Consider a cloud storage provider managing exabytes of data where 60% is redundant. Eliminating that redundancy doesn't just save storage costs; it reduces replication bandwidth, accelerates backup windows, decreases power consumption, and improves data retrieval performance. The financial implications run into hundreds of millions of dollars annually for large-scale operators.

Data deduplication is the systematic practice of identifying and eliminating this redundancy. But implementing deduplication correctly requires deep understanding of trade-offs that span computer science fundamentals, systems engineering, and business requirements. Get it wrong, and you introduce latency, consume excessive CPU, or worse—lose data integrity.

What You Will Learn

By the end of this page, you will understand the complete landscape of data deduplication strategies—from file-level to sub-block deduplication, inline versus post-process approaches, source versus target deduplication, and the critical role of content-defined chunking. You'll learn how industry leaders like NetApp, Dell EMC, and cloud providers implement deduplication, and how to select the right strategy for your storage architecture.

Fundamentals of Data Deduplication

At its core, data deduplication is conceptually simple: store each unique piece of data exactly once, and replace all duplicates with references to that single copy. The complexity—and the engineering challenge—lies entirely in implementation details.

The Deduplication Process:

Every deduplication system, regardless of granularity or timing, follows a common workflow:

Segmentation: Divide incoming data into chunks (units of comparison)
Fingerprinting: Compute a cryptographic hash (fingerprint) for each chunk
Lookup: Check if the fingerprint already exists in the deduplication index
Decision: If duplicate, store a reference; if unique, store the actual data
Index Update: Add new fingerprints to the index for future comparisons
Reconstruction: When reading, reassemble data by resolving references to actual chunks

The Hash Collision Problem

Deduplication relies on cryptographic hashes to identify matching content. If two different data segments produce the same hash (a collision), the system might incorrectly consider them duplicates—leading to data loss. Modern systems use SHA-256 or stronger hashes where collision probability is approximately 1 in 2^128. For context, you'd need to hash roughly 10^38 chunks before expecting a collision—orders of magnitude more data than exists in the entire digital universe.

The Deduplication Ratio:

The effectiveness of deduplication is measured by the deduplication ratio—the ratio of logical data size (original data) to physical data size (after deduplication).

Deduplication Ratio = Logical Data Size / Physical Data Size

A ratio of 10:1 means you're storing 10 TB of logical data in just 1 TB of physical storage. Typical ratios vary dramatically by workload:

Virtual machine images: 10:1 to 50:1 (VMs share OS files)
Email servers: 3:1 to 10:1 (attachments forwarded repeatedly)
Database backups: 5:1 to 20:1 (incremental changes between backups)
Media files (video/audio): 1.1:1 to 2:1 (already compressed, little redundancy)
General enterprise files: 3:1 to 6:1 (documents, presentations, etc.)

Deduplication Ratio by Workload Type
Workload Type	Typical Ratio	Reason	Best Dedup Strategy
Virtual Desktop Infrastructure (VDI)	20:1 to 70:1	Identical OS/application binaries across thousands of desktops	Fixed-size blocks with caching
Database Backups	10:1 to 30:1	Daily backups differ only in changed records	Variable-length chunking
Email Archives	3:1 to 15:1	Attachments forwarded to multiple recipients	Sub-file deduplication
General File Shares	2:1 to 6:1	Document versions, copied files	File or block-level dedup
Video/Audio Media	1.1:1 to 1.5:1	Already compressed, each file unique	Often better to skip dedup

Deduplication Granularity: From Files to Bytes

The granularity of deduplication—the size of chunks being compared—is perhaps the most critical architectural decision. Finer granularity finds more duplicates but requires more metadata overhead and CPU cycles. Coarser granularity is faster but misses redundancy within files.

File-Level Deduplication (Single-Instance Storage)

The simplest approach: compare entire files. If two files are byte-for-byte identical, store one copy and create references (hard links or pointers) from both logical paths.

How It Works:

Compute a hash of the entire file (e.g., SHA-256)
Check if hash exists in the file index
If match, store a reference instead of the file
If new, store the file and add hash to index

Advantages:

Minimal metadata overhead (one hash per file)
Fast—only one hash computation per file
Simple implementation
No reassembly needed on read

Disadvantages:

Misses partial matches (two files differing by one byte are stored twice)
Ineffective for backup scenarios where files change incrementally
Document revisions aren't deduplicated

Use Cases: Email attachments, static file archives, software distribution (identical binaries).

file_level_dedup.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import hashlib
from typing import Dict, Optional
from dataclasses import dataclass
 
@dataclass
class FileReference:
    file_id: str
    original_path: str
 
class FileDeduplicationIndex:
    """
    File-level deduplication index.
    Maps content hashes to storage locations.
    """
    def __init__(self, storage_backend):
        self.index: Dict[str, str] = {}  # hash -> file_id
        self.storage = storage_backend
        
    def compute_file_hash(self, file_path: str) -> str:
        """Compute SHA-256 hash of entire file."""
        sha256 = hashlib.sha256()
        with open(file_path, 'rb') as f:
            # Read in chunks to handle large files
            for chunk in iter(lambda: f.read(8192), b''):
                sha256.update(chunk)
        return sha256.hexdigest()
    
    def store_file(self, file_path: str) -> FileReference:
        """
        Store file with deduplication.
        Returns reference to stored (or existing) file.
        """
        content_hash = self.compute_file_hash(file_path)
        
        if content_hash in self.index:
            # Duplicate found - return reference to existing file
            existing_file_id = self.index[content_hash]
            return FileReference(
                file_id=existing_file_id,
                original_path=file_path
            )
        else:
            # New unique file - store it
            file_id = self.storage.write_file(file_path)
            self.index[content_hash] = file_id
            return FileReference(
                file_id=file_id,
                original_path=file_path
            )

Block-Level Deduplication (Fixed-Size Chunks)

Divide files into fixed-size blocks (typically 4KB to 128KB) and deduplicate at the block level. This catches redundancy within files and across files that share common segments.

How It Works:

Divide each file into fixed-size blocks (e.g., 64KB)
Compute hash of each block
For each block, check index for duplicates
Store unique blocks; create references for duplicates
Store file as sequence of block references (recipe)

Advantages:

Catches partial file redundancy
Effective for files with common headers/footers
Reasonable metadata overhead
Good for backup/restore workloads

Disadvantages:

The Boundary Shift Problem: If data is inserted at the beginning of a file, every subsequent block boundary shifts, causing all blocks to appear unique even though content is mostly unchanged.
Less effective than variable-length chunking for many workloads

Typical Block Sizes:

4KB: High dedup ratio but massive metadata overhead
64KB: Common balance between ratio and overhead
128KB: Lower overhead, acceptable for some workloads

The Boundary Shift Problem Illustrated

Consider a 1MB file divided into 16 × 64KB blocks. Now insert 1 byte at position 0. With fixed-size chunking, EVERY block boundary shifts: Block 1 now contains bytes 0-65535 instead of 1-65536. All 16 hashes change. Zero deduplication occurs despite 99.99% content similarity. This fundamental limitation drove the development of content-defined chunking.

Variable-Length Chunking (Content-Defined Chunking)

The most sophisticated approach: determine chunk boundaries based on content rather than position. This makes boundaries immune to insertions and deletions—shifting data doesn't shift boundaries.

How Content-Defined Chunking (CDC) Works:

Slide a window across the data stream
Compute a rolling hash (e.g., Rabin fingerprint) at each position
When the hash meets a specific condition (e.g., lowest N bits are zero), declare a chunk boundary
This creates chunks of variable length with content-dependent boundaries

The magic: Insertions or deletions only affect the local region. Chunks before and after the modification retain their original boundaries and hashes, preserving deduplication potential.

content_defined_chunking.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
class RabinFingerprint:
    """
    Rabin fingerprint for rolling hash computation.
    Enables efficient sliding window hash updates.
    """
    def __init__(self, window_size: int = 48, modulus: int = (1 << 31) - 1):
        self.window_size = window_size
        self.modulus = modulus
        self.base = 257  # Prime base
        self.window = bytearray(window_size)
        self.window_pos = 0
        self.hash_value = 0
        # Precompute base^(window_size-1) for removing old byte
        self.pow_base = pow(self.base, window_size - 1, self.modulus)
        
    def slide(self, new_byte: int) -> int:
        """Add new byte, remove oldest byte, return new hash."""
        old_byte = self.window[self.window_pos]
        
        # Remove contribution of oldest byte
        self.hash_value -= old_byte * self.pow_base
        self.hash_value = self.hash_value % self.modulus
        
        # Add new byte
        self.hash_value = (self.hash_value * self.base + new_byte) % self.modulus
        
        # Update window
        self.window[self.window_pos] = new_byte
        self.window_pos = (self.window_pos + 1) % self.window_size
        
        return self.hash_value
 
 
def content_defined_chunking(
    data: bytes,
    min_chunk: int = 2048,      # Minimum chunk size (2KB)
    max_chunk: int = 65536,     # Maximum chunk size (64KB)
    avg_chunk: int = 8192,      # Target average (8KB)
) -> list[tuple[int, int]]:    # Returns [(start, end), ...]
    """
    Split data into variable-length chunks using content-defined boundaries.
    
    Chunk boundaries occur when:
    - Rabin hash matches pattern (low N bits are zero)
    - OR maximum chunk size is reached (prevents unbounded chunks)
    
    Minimum chunk size is enforced to prevent tiny chunks.
    """
    chunks = []
    fingerprint = RabinFingerprint(window_size=48)
    
    # Mask determines average chunk size: avg = 2^(bits)
    # For 8KB average: ~13 bits, mask = 0x1FFF
    import math
    mask_bits = int(math.log2(avg_chunk))
    mask = (1 << mask_bits) - 1
    
    chunk_start = 0
    
    for i, byte in enumerate(data):
        hash_value = fingerprint.slide(byte)
        chunk_size = i - chunk_start + 1
        
        # Check for chunk boundary
        is_boundary = False
        
        if chunk_size >= max_chunk:
            # Force boundary at max size
            is_boundary = True
        elif chunk_size >= min_chunk:
            # Check if hash indicates natural boundary
            if (hash_value & mask) == mask:
                is_boundary = True
        
        if is_boundary:
            chunks.append((chunk_start, i + 1))
            chunk_start = i + 1
    
    # Handle remaining data
    if chunk_start < len(data):
        chunks.append((chunk_start, len(data)))
    
    return chunks

Variable-Length Chunking Trade-offs:

Aspect	Fixed-Size Blocks	Variable-Length (CDC)
Dedup effectiveness	Lower	Higher (30-50% better)
Metadata overhead	Predictable	Variable
Computational cost	Low	Higher (rolling hash)
Resilience to edits	Poor	Excellent
Implementation complexity	Simple	Complex
Best for	VDI, static content	Backups, file sync

FastCDC: Optimized Content-Defined Chunking

Traditional CDC using Rabin fingerprints is computationally expensive. FastCDC, developed by researchers at Dell EMC, uses a gear-based rolling hash that's 3-10x faster while maintaining similar deduplication ratios. Many modern systems (including restic, BorgBackup) have adopted FastCDC or similar optimized algorithms.

Inline vs. Post-Process Deduplication

When deduplication occurs has profound implications for performance, storage efficiency, and system complexity. The two primary approaches—inline and post-process—offer fundamentally different trade-offs.

Inline Deduplication (Synchronous)

Deduplication occurs during the write path, before data is persisted to storage. Every incoming block is fingerprinted and checked against the index in real-time.

Workflow:

Application writes data
System chunks the data (fixed or variable)
Each chunk is hashed
Hash is checked against dedup index
If duplicate: store reference only
If unique: store data, update index
Write acknowledged to application

Critical Implications:

Latency Impact: Write latency increases due to hash computation and index lookup. For storage systems with SLA requirements (databases, real-time applications), this can be unacceptable.
Index in Memory: The dedup index must be accessible with low latency—typically in RAM. For every 1TB of unique data with 8KB chunks, you need ~16 million index entries. At 40 bytes per entry (hash + pointer), that's 640MB of RAM per TB.
Storage Efficiency: Data is never stored redundantly. You never over-provision, and you see immediate capacity savings.
Backup Windows: Since dedup happens inline, backup windows can be shorter—less data is actually written to storage.

Inline Deduplication Advantages

•Immediate storage savings
•No temporary storage needed for undeduped data
•Shorter backup windows
•Reduced replication bandwidth
•Consistent storage consumption view

Inline Deduplication Challenges

•Increased write latency
•CPU overhead on write path
•Large memory footprint for index
•Complex failure handling
•Can bottleneck ingest performance

Post-Process Deduplication (Asynchronous)

Data is written to storage immediately without deduplication. A background process later scans the stored data, identifies duplicates, and consolidates them.

Workflow:

Application writes data (writes proceed at full speed)
Data is stored in "landing zone" (fully redundant)
Write acknowledged immediately
Background job scans new data
Chunks are hashed and compared
Duplicates are removed; references are created
Index is updated

Critical Implications:

No Write Latency Impact: Applications see raw storage performance. Essential for latency-sensitive workloads.
Temporary Over-Provisioning: You need storage for undeduped data until post-process completes. If you expect 3:1 dedup ratio, you need 3x the final capacity during the landing window.
Computational Flexibility: Deduplication runs during off-peak hours, using otherwise idle resources. Can be throttled or paused.
Recovery Complexity: If system fails between write and dedup, data is still safe (just not deduplicated yet).

Inline vs. Post-Process Deduplication Comparison
Factor	Inline	Post-Process
Write latency	Higher (+10-50%)	No impact
Storage efficiency	Immediate savings	Delayed savings
Temporary capacity needed	None	100% + overhead
CPU during writes	High	Minimal
Background resources	Minimal	Significant
Best for	Backup targets, VDI	Primary storage, databases
Implementation complexity	Higher	Moderate

Hybrid Approaches in Practice

Many enterprise systems use hybrid deduplication. Dell EMC Data Domain uses 'inline with fingerprint caching'—common patterns are deduplicated inline using a hot fingerprint cache, while cold patterns are written and post-processed later. NetApp ONTAP offers both modes configurable per volume. Pure Storage FlashBlade uses inline dedup for metadata and post-process for large blobs.

Source vs. Target Deduplication

Where deduplication occurs—at the source (client) or target (storage system)—determines bandwidth usage, client complexity, and security implications.

Source Deduplication

Deduplication happens on the client machine before data is transmitted. The client queries the target to determine which chunks already exist, then transmits only unique chunks.

Workflow:

Client chunks local data
Client computes hashes for all chunks
Client sends hash list to server
Server responds with "already have" / "need" for each hash
Client transmits only needed chunks
Server stores chunks and updates references

Advantages:

Massive bandwidth reduction: Only unique data crosses the network. For incremental backups with 95% overlap, you transmit 5% of the data.
Distributed processing: Hash computation distributed across many clients, reducing server load.
Lower server storage I/O: Server receives pre-deduplicated stream.

Disadvantages:

Client CPU overhead: Clients must perform chunking and hashing—significant for low-powered devices.
Protocol complexity: Requires coordination between client and server.
Security considerations: Hash queries can leak information about stored content (known as "confirmation attacks").

source_dedup_client.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class SourceDedupClient:
    """
    Client-side deduplication implementation.
    Minimizes network transfer by only sending unique chunks.
    """
    def __init__(self, server_connection, chunker):
        self.server = server_connection
        self.chunker = chunker
        
    def backup_file(self, file_path: str) -> BackupReceipt:
        """Backup file with source-side deduplication."""
        # Step 1: Chunk the file locally
        with open(file_path, 'rb') as f:
            data = f.read()
        
        chunks = self.chunker.chunk(data)
        chunk_hashes = []
        chunk_data = {}
        
        # Step 2: Compute hashes for all chunks
        for start, end in chunks:
            chunk_bytes = data[start:end]
            chunk_hash = hashlib.sha256(chunk_bytes).hexdigest()
            chunk_hashes.append(chunk_hash)
            chunk_data[chunk_hash] = chunk_bytes
        
        # Step 3: Query server for existing chunks
        # This is the key optimization - we only send hashes first
        needed_hashes = self.server.query_missing_chunks(chunk_hashes)
        
        # Step 4: Upload only missing chunks
        upload_count = 0
        upload_bytes = 0
        
        for chunk_hash in needed_hashes:
            self.server.upload_chunk(chunk_hash, chunk_data[chunk_hash])
            upload_count += 1
            upload_bytes += len(chunk_data[chunk_hash])
        
        # Step 5: Register the file as a sequence of chunk references
        file_recipe = {
            'path': file_path,
            'chunks': chunk_hashes,
            'size': len(data),
        }
        receipt = self.server.register_file(file_recipe)
        
        print(f"Deduplication savings: {len(data)} bytes -> {upload_bytes} bytes")
        print(f"Chunks: {len(chunks)} total, {upload_count} uploaded")
        
        return receipt

Target Deduplication

All data is transmitted to the target storage system, which performs deduplication locally. The client sends complete data without awareness of deduplication.

Workflow:

Client sends complete data to server
Server receives data into landing zone
Server chunks and hashes data
Server performs dedup (inline or post-process)
Server acknowledges write

Advantages:

Simple clients: No dedup logic required on clients. Any file transfer protocol works.
Centralized control: All dedup decisions made by storage system with complete visibility.
No coordination complexity: No query-response protocol for existing chunks.
Better security: No hash-based queries that could leak content information.

Disadvantages:

Full bandwidth required: All data, including duplicates, crosses the network.
Server bears full load: All hashing and comparison on target system.
Higher network costs: Particularly expensive for backup over WAN.

Source vs. Target Deduplication Decision Matrix
Scenario	Recommended	Rationale
WAN backup (remote office)	Source	Bandwidth is expensive; minimize transfer
LAN backup (datacenter)	Target	Bandwidth is cheap; simpler architecture
Mobile/IoT devices	Target	Client CPU/battery constraints
Backup appliances	Target	Dedicated hardware handles load
Cloud-native apps	Source	Reduce egress costs to cloud storage
Enterprise sync (Dropbox-like)	Source	Minimize sync time and bandwidth

The Confirmation Attack Problem

In source deduplication, attackers can probe whether specific content exists on the server by computing its hash and querying. If the server says 'already exists,' the attacker knows that content is stored by some user. Mitigations include: requiring proof of possession (client must demonstrate it has the full chunk, not just the hash), per-user keys that make hashes user-specific, and rate limiting queries.

The Deduplication Index: The Heart of the System

The deduplication index—the data structure mapping content hashes to storage locations—is the critical bottleneck in any dedup system. Its design determines system scalability, performance, and reliability.

Index Structure and Scale

For a system storing 100TB of unique data with 8KB average chunk size:

Number of chunks: 100TB / 8KB = 13.1 billion chunks
Index entry (SHA-256 + pointer): 32 bytes hash + 8 bytes pointer = 40 bytes
Total index size: 13.1B × 40 bytes = 524 GB

This index must support:

Lookups: O(1) or very low latency per chunk during writes
Insertions: Adding new chunk fingerprints
Reference counting: Tracking how many files reference each chunk (for garbage collection)
Persistence: Surviving system restarts without full rebuild

Index Storage Strategies

1. In-Memory Index

Store the entire index in RAM for fastest access.

Pros: Sub-microsecond lookups, maximum dedup throughput
Cons: Enormous RAM requirements (524GB for 100TB), lost on crash unless checkpointed
Use case: High-performance backup appliances with abundant RAM

2. Tiered Index with Bloom Filters

Keep a Bloom filter in memory as a first-pass filter, with the full index on SSD.

Memory tier: Bloom filter (1-10 GB) that catches 100% of new chunks (no false negatives)
SSD tier: Full index, consulted only when Bloom filter indicates potential hit
Pros: 10-50x memory reduction with minimal performance impact
Cons: False positives cause unnecessary SSD lookups

3. Locality-Based Indexing

Exploit the fact that similar files have similar chunk sequences. Cache index entries for recently-seen chunk neighborhoods.

Observation: If chunks A, B, C appear together in one file, they likely appear together in similar files
Implementation: When chunk A is accessed, prefetch index entries for chunks that historically co-occur with A
Pros: Excellent cache hit rates for backup workloads
Cons: Less effective for random access patterns

tiered_dedup_index.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
from bloom_filter import BloomFilter
import redis  # Example using Redis as SSD-backed index
 
class TieredDedupIndex:
    """
    Two-tier deduplication index using Bloom filter and Redis.
    
    Bloom filter in memory catches definite non-matches instantly.
    Redis (on SSD) stores full index for confirmed lookups.
    """
    
    def __init__(
        self,
        expected_chunks: int = 10_000_000_000,  # 10 billion chunks
        bloom_false_positive_rate: float = 0.01,  # 1% false positive
        redis_client: redis.Redis = None,
    ):
        # Bloom filter: ~10GB for 10B items at 1% FP rate
        self.bloom = BloomFilter(
            max_elements=expected_chunks,
            error_rate=bloom_false_positive_rate
        )
        
        self.redis = redis_client or redis.Redis(
            host='localhost', 
            port=6379,
            decode_responses=False
        )
        
        self.stats = {
            'lookups': 0,
            'bloom_negatives': 0,  # Definite new chunks
            'bloom_positives': 0,  # Potential duplicates
            'actual_duplicates': 0,  # Confirmed duplicates
            'false_positives': 0,  # Bloom said yes, but new
        }
    
    def lookup(self, chunk_hash: bytes) -> Optional[bytes]:
        """
        Look up chunk hash, return storage location if exists.
        
        Returns:
            Storage location (bytes) if chunk exists, None otherwise.
        """
        self.stats['lookups'] += 1
        
        # Step 1: Check Bloom filter (in-memory, ~100ns)
        if chunk_hash not in self.bloom:
            # Definitely not in index - no need to check SSD
            self.stats['bloom_negatives'] += 1
            return None
        
        self.stats['bloom_positives'] += 1
        
        # Step 2: Bloom says maybe - check Redis (SSD, ~1ms)
        location = self.redis.get(f"chunk:{chunk_hash.hex()}")
        
        if location is not None:
            self.stats['actual_duplicates'] += 1
            return location
        else:
            # False positive from Bloom filter
            self.stats['false_positives'] += 1
            return None
    
    def insert(self, chunk_hash: bytes, location: bytes) -> None:
        """Add chunk to index."""
        # Add to Bloom filter (cannot be removed later)
        self.bloom.add(chunk_hash)
        
        # Add to Redis with reference count
        key = f"chunk:{chunk_hash.hex()}"
        self.redis.set(key, location)
        self.redis.set(f"refcount:{chunk_hash.hex()}", 1)
    
    def increment_ref(self, chunk_hash: bytes) -> int:
        """Increment reference count for existing chunk."""
        return self.redis.incr(f"refcount:{chunk_hash.hex()}")

Sparse Indexing in Practice

Dell EMC Data Domain's pioneering 'sparse indexing' technique indexes only sampled chunks (e.g., every 10th chunk). When a sample matches, the system follows the chunk sequence on disk to check adjacent chunks. This reduces index size by 10x with minimal loss in dedup effectiveness—because chunk sequences tend to repeat together.

Garbage Collection in Deduplicated Storage

When a file is deleted from a deduplicated storage system, you cannot simply remove its chunks—those chunks might be referenced by other files. This creates the complex problem of garbage collection in deduplicated environments.

The Reference Counting Challenge

Each chunk maintains a reference count tracking how many files (or "recipes") reference it:

Chunk A (hash: abc123)
  - Referenced by: File1, File2, File3
  - Reference count: 3

When File1 is deleted:
  - Decrement reference count to 2
  - Chunk still needed, do not delete

When File2 and File3 are deleted:
  - Reference count becomes 0
  - Chunk becomes garbage, can be reclaimed

Problems with reference counting:

Crash consistency: If system crashes between file deletion and ref count update, counts become inaccurate
Distributed coordination: In clustered systems, reference updates must be synchronized
Circular references: In complex systems, orphaned chunk cycles can exist (rare in backup systems)

Mark-and-Sweep Garbage Collection

An alternative approach borrowed from programming language runtimes:

Phase 1: Mark

Traverse all live file recipes
Mark every referenced chunk as "in use"

Phase 2: Sweep

Scan all chunks in storage
Any chunk not marked is garbage
Reclaim unmarked chunks

Advantages:

Crash-safe: Incomplete GC just means some garbage temporarily retained
No need for perfect reference counts
Handles any form of orphaned data

Disadvantages:

Must scan entire dataset (expensive for petabyte systems)
System might need read quiescence during mark phase
Typically run infrequently (weekly/monthly)

Garbage Collection Strategies Comparison
Strategy	Pros	Cons	Best For
Reference counting	Immediate reclamation, Simple logic	Crash vulnerability, Distributed complexity	Single-node systems
Mark-and-sweep	Crash-safe, Handles all cases	Expensive, Requires quiescence	Periodic deep clean
Generational	Efficient for typical workloads	Complex implementation	Large-scale backup
Log-structured merge	Write-optimized, Background compaction	Temporary space amplification	Cloud object stores

The Container Compaction Problem

Chunks are typically stored in containers (files holding many chunks) for I/O efficiency. When some chunks in a container are garbage-collected while others remain live, the container becomes fragmented. Eventually, you need 'container compaction'—copying live chunks to new containers and reclaiming the old. This is disk-I/O intensive and must be carefully scheduled.

Industry Implementations of Deduplication

Understanding how industry leaders implement deduplication reveals the practical trade-offs made at scale.

Dell EMC Data Domain

The original purpose-built deduplication appliance, Data Domain pioneered many techniques still used today:

Stream-Informed Segmentation: Variable-length chunking with optimized anchor detection
Global Dedup: All data across the system deduped against single global index
Sparse Indexing: Only samples of fingerprints stored in RAM; locality exploited for rest
Inline with Cache: Hot fingerprints deduplicated inline; cold data post-processed
Dedup Ratio: Typically 10:1 to 30:1 for backup workloads

NetApp ONTAP (Aggregate Inline Deduplication)

Inline + Background: New writes get inline dedup with warm fingerprint cache; background process handles cache misses
Volume-Scoped or Aggregate-Scoped: Dedup within a volume or across all volumes on storage
Fixed-Size Blocks: Uses 4KB block alignment for simplicity with SSDs
Zero-Block Detection: Special optimization for datasets with many zero-filled regions

Cloud Object Storage (AWS S3, GCS, Azure Blob)

Major cloud providers typically do NOT offer deduplication as a customer-visible feature for general object storage, for several reasons:

Multi-tenant isolation: Cross-customer dedup raises security/privacy concerns
Workload diversity: Many workloads (media) don't benefit from dedup
Customer-side encryption: Encrypted data doesn't deduplicate (same content, different ciphertext)

However, within their infrastructure:

Internal dedup for OS images, container layers, and package repositories
Storage class optimizations that may include internal dedup
Content-addressable storage for immutable artifacts

BorgBackup and Restic (Open Source)

These open-source backup tools demonstrate elegant dedup implementations:

Content-Defined Chunking: Variable-length chunks with Buzhash (Borg) or CDC (Restic)
Client-Side Dedup: Dedup happens on the client before encryption
Repository Format: Chunks stored in pack files for efficiency
Encryption After Dedup: Chunks are hashed for dedup, then encrypted for storage
Cross-Machine Dedup: Multiple machines backing up to same repo share chunks

Key Takeaways from Industry

•Variable-length CDC dominates for backup workloads where data shifts are common
•Inline + post-process hybrid is the pragmatic choice for production systems
•Fingerprint caching is essential—full index in RAM isn't scalable
•Global dedup maximizes savings but increases complexity
•Encryption and dedup are in tension—order of operations matters

Summary: Mastering Data Deduplication

Data deduplication is far more than a simple storage optimization—it's a complex engineering discipline that touches hashing, data structures, distributed systems, and resource management.

Core Concepts:

Deduplication eliminates redundant data by storing unique chunks once and using references for duplicates
Granularity matters: File-level is simple but limited; variable-length CDC is most effective
Content-defined chunking makes boundaries immune to insertions/deletions
Inline vs. post-process trade latency for immediate savings
Source dedup saves bandwidth; target dedup simplifies clients
The index is the bottleneck—tiered approaches with Bloom filters are essential at scale
Garbage collection requires careful design to avoid corruption and space leaks

Page Complete

You now have a comprehensive understanding of data deduplication strategies. You understand the trade-offs between granularity levels, timing approaches, and location decisions. Next, we'll explore compression algorithms—often used alongside deduplication to further reduce storage requirements and understand when to apply each technique.

1 / 5

Loading learning content...

System Design (HLD)Deduplication and Compression

Deduplication and Compression: Optimizing Storage Efficiency at Scale

LevelIntermediate

Duration90 mins

TopicDeduplication and Compression

1 / 5

Data Deduplication Strategies

The Hidden Redundancy Crisis in Modern Storage

What You Will Learn

Fundamentals of Data Deduplication

The Deduplication Process:

Every deduplication system, regardless of granularity or timing, follows a common workflow:

Segmentation: Divide incoming data into chunks (units of comparison)
Fingerprinting: Compute a cryptographic hash (fingerprint) for each chunk
Lookup: Check if the fingerprint already exists in the deduplication index
Decision: If duplicate, store a reference; if unique, store the actual data
Index Update: Add new fingerprints to the index for future comparisons
Reconstruction: When reading, reassemble data by resolving references to actual chunks

The Hash Collision Problem

The Deduplication Ratio:

The effectiveness of deduplication is measured by the deduplication ratio—the ratio of logical data size (original data) to physical data size (after deduplication).

Deduplication Ratio = Logical Data Size / Physical Data Size

A ratio of 10:1 means you're storing 10 TB of logical data in just 1 TB of physical storage. Typical ratios vary dramatically by workload:

Virtual machine images: 10:1 to 50:1 (VMs share OS files)
Email servers: 3:1 to 10:1 (attachments forwarded repeatedly)
Database backups: 5:1 to 20:1 (incremental changes between backups)
Media files (video/audio): 1.1:1 to 2:1 (already compressed, little redundancy)
General enterprise files: 3:1 to 6:1 (documents, presentations, etc.)

Deduplication Ratio by Workload Type
Workload Type	Typical Ratio	Reason	Best Dedup Strategy
Virtual Desktop Infrastructure (VDI)	20:1 to 70:1	Identical OS/application binaries across thousands of desktops	Fixed-size blocks with caching
Database Backups	10:1 to 30:1	Daily backups differ only in changed records	Variable-length chunking
Email Archives	3:1 to 15:1	Attachments forwarded to multiple recipients	Sub-file deduplication
General File Shares	2:1 to 6:1	Document versions, copied files	File or block-level dedup
Video/Audio Media	1.1:1 to 1.5:1	Already compressed, each file unique	Often better to skip dedup

Deduplication Granularity: From Files to Bytes

File-Level Deduplication (Single-Instance Storage)

The simplest approach: compare entire files. If two files are byte-for-byte identical, store one copy and create references (hard links or pointers) from both logical paths.

How It Works:

Compute a hash of the entire file (e.g., SHA-256)
Check if hash exists in the file index
If match, store a reference instead of the file
If new, store the file and add hash to index

Advantages:

Minimal metadata overhead (one hash per file)
Fast—only one hash computation per file
Simple implementation
No reassembly needed on read

Disadvantages:

Misses partial matches (two files differing by one byte are stored twice)
Ineffective for backup scenarios where files change incrementally
Document revisions aren't deduplicated

Use Cases: Email attachments, static file archives, software distribution (identical binaries).

file_level_dedup.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import hashlib
from typing import Dict, Optional
from dataclasses import dataclass
 
@dataclass
class FileReference:
    file_id: str
    original_path: str
 
class FileDeduplicationIndex:
    """
    File-level deduplication index.
    Maps content hashes to storage locations.
    """
    def __init__(self, storage_backend):
        self.index: Dict[str, str] = {}  # hash -> file_id
        self.storage = storage_backend
        
    def compute_file_hash(self, file_path: str) -> str:
        """Compute SHA-256 hash of entire file."""
        sha256 = hashlib.sha256()
        with open(file_path, 'rb') as f:
            # Read in chunks to handle large files
            for chunk in iter(lambda: f.read(8192), b''):
                sha256.update(chunk)
        return sha256.hexdigest()
    
    def store_file(self, file_path: str) -> FileReference:
        """
        Store file with deduplication.
        Returns reference to stored (or existing) file.
        """
        content_hash = self.compute_file_hash(file_path)
        
        if content_hash in self.index:
            # Duplicate found - return reference to existing file
            existing_file_id = self.index[content_hash]
            return FileReference(
                file_id=existing_file_id,
                original_path=file_path
            )
        else:
            # New unique file - store it
            file_id = self.storage.write_file(file_path)
            self.index[content_hash] = file_id
            return FileReference(
                file_id=file_id,
                original_path=file_path
            )

Block-Level Deduplication (Fixed-Size Chunks)

Divide files into fixed-size blocks (typically 4KB to 128KB) and deduplicate at the block level. This catches redundancy within files and across files that share common segments.

How It Works:

Divide each file into fixed-size blocks (e.g., 64KB)
Compute hash of each block
For each block, check index for duplicates
Store unique blocks; create references for duplicates
Store file as sequence of block references (recipe)

Advantages:

Catches partial file redundancy
Effective for files with common headers/footers
Reasonable metadata overhead
Good for backup/restore workloads

Disadvantages:

The Boundary Shift Problem: If data is inserted at the beginning of a file, every subsequent block boundary shifts, causing all blocks to appear unique even though content is mostly unchanged.
Less effective than variable-length chunking for many workloads

Typical Block Sizes:

4KB: High dedup ratio but massive metadata overhead
64KB: Common balance between ratio and overhead
128KB: Lower overhead, acceptable for some workloads

The Boundary Shift Problem Illustrated

Variable-Length Chunking (Content-Defined Chunking)

The most sophisticated approach: determine chunk boundaries based on content rather than position. This makes boundaries immune to insertions and deletions—shifting data doesn't shift boundaries.

How Content-Defined Chunking (CDC) Works:

Slide a window across the data stream
Compute a rolling hash (e.g., Rabin fingerprint) at each position
When the hash meets a specific condition (e.g., lowest N bits are zero), declare a chunk boundary
This creates chunks of variable length with content-dependent boundaries

The magic: Insertions or deletions only affect the local region. Chunks before and after the modification retain their original boundaries and hashes, preserving deduplication potential.

content_defined_chunking.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
class RabinFingerprint:
    """
    Rabin fingerprint for rolling hash computation.
    Enables efficient sliding window hash updates.
    """
    def __init__(self, window_size: int = 48, modulus: int = (1 << 31) - 1):
        self.window_size = window_size
        self.modulus = modulus
        self.base = 257  # Prime base
        self.window = bytearray(window_size)
        self.window_pos = 0
        self.hash_value = 0
        # Precompute base^(window_size-1) for removing old byte
        self.pow_base = pow(self.base, window_size - 1, self.modulus)
        
    def slide(self, new_byte: int) -> int:
        """Add new byte, remove oldest byte, return new hash."""
        old_byte = self.window[self.window_pos]
        
        # Remove contribution of oldest byte
        self.hash_value -= old_byte * self.pow_base
        self.hash_value = self.hash_value % self.modulus
        
        # Add new byte
        self.hash_value = (self.hash_value * self.base + new_byte) % self.modulus
        
        # Update window
        self.window[self.window_pos] = new_byte
        self.window_pos = (self.window_pos + 1) % self.window_size
        
        return self.hash_value
 
 
def content_defined_chunking(
    data: bytes,
    min_chunk: int = 2048,      # Minimum chunk size (2KB)
    max_chunk: int = 65536,     # Maximum chunk size (64KB)
    avg_chunk: int = 8192,      # Target average (8KB)
) -> list[tuple[int, int]]:    # Returns [(start, end), ...]
    """
    Split data into variable-length chunks using content-defined boundaries.
    
    Chunk boundaries occur when:
    - Rabin hash matches pattern (low N bits are zero)
    - OR maximum chunk size is reached (prevents unbounded chunks)
    
    Minimum chunk size is enforced to prevent tiny chunks.
    """
    chunks = []
    fingerprint = RabinFingerprint(window_size=48)
    
    # Mask determines average chunk size: avg = 2^(bits)
    # For 8KB average: ~13 bits, mask = 0x1FFF
    import math
    mask_bits = int(math.log2(avg_chunk))
    mask = (1 << mask_bits) - 1
    
    chunk_start = 0
    
    for i, byte in enumerate(data):
        hash_value = fingerprint.slide(byte)
        chunk_size = i - chunk_start + 1
        
        # Check for chunk boundary
        is_boundary = False
        
        if chunk_size >= max_chunk:
            # Force boundary at max size
            is_boundary = True
        elif chunk_size >= min_chunk:
            # Check if hash indicates natural boundary
            if (hash_value & mask) == mask:
                is_boundary = True
        
        if is_boundary:
            chunks.append((chunk_start, i + 1))
            chunk_start = i + 1
    
    # Handle remaining data
    if chunk_start < len(data):
        chunks.append((chunk_start, len(data)))
    
    return chunks

Variable-Length Chunking Trade-offs:

Aspect	Fixed-Size Blocks	Variable-Length (CDC)
Dedup effectiveness	Lower	Higher (30-50% better)
Metadata overhead	Predictable	Variable
Computational cost	Low	Higher (rolling hash)
Resilience to edits	Poor	Excellent
Implementation complexity	Simple	Complex
Best for	VDI, static content	Backups, file sync

FastCDC: Optimized Content-Defined Chunking

Inline vs. Post-Process Deduplication

Inline Deduplication (Synchronous)

Deduplication occurs during the write path, before data is persisted to storage. Every incoming block is fingerprinted and checked against the index in real-time.

Workflow:

Application writes data
System chunks the data (fixed or variable)
Each chunk is hashed
Hash is checked against dedup index
If duplicate: store reference only
If unique: store data, update index
Write acknowledged to application

Critical Implications:

Latency Impact: Write latency increases due to hash computation and index lookup. For storage systems with SLA requirements (databases, real-time applications), this can be unacceptable.
Index in Memory: The dedup index must be accessible with low latency—typically in RAM. For every 1TB of unique data with 8KB chunks, you need ~16 million index entries. At 40 bytes per entry (hash + pointer), that's 640MB of RAM per TB.
Storage Efficiency: Data is never stored redundantly. You never over-provision, and you see immediate capacity savings.
Backup Windows: Since dedup happens inline, backup windows can be shorter—less data is actually written to storage.

Inline Deduplication Advantages

•Immediate storage savings
•No temporary storage needed for undeduped data
•Shorter backup windows
•Reduced replication bandwidth
•Consistent storage consumption view

Inline Deduplication Challenges

•Increased write latency
•CPU overhead on write path
•Large memory footprint for index
•Complex failure handling
•Can bottleneck ingest performance

Post-Process Deduplication (Asynchronous)

Data is written to storage immediately without deduplication. A background process later scans the stored data, identifies duplicates, and consolidates them.

Workflow:

Application writes data (writes proceed at full speed)
Data is stored in "landing zone" (fully redundant)
Write acknowledged immediately
Background job scans new data
Chunks are hashed and compared
Duplicates are removed; references are created
Index is updated

Critical Implications:

No Write Latency Impact: Applications see raw storage performance. Essential for latency-sensitive workloads.
Temporary Over-Provisioning: You need storage for undeduped data until post-process completes. If you expect 3:1 dedup ratio, you need 3x the final capacity during the landing window.
Computational Flexibility: Deduplication runs during off-peak hours, using otherwise idle resources. Can be throttled or paused.
Recovery Complexity: If system fails between write and dedup, data is still safe (just not deduplicated yet).

Inline vs. Post-Process Deduplication Comparison
Factor	Inline	Post-Process
Write latency	Higher (+10-50%)	No impact
Storage efficiency	Immediate savings	Delayed savings
Temporary capacity needed	None	100% + overhead
CPU during writes	High	Minimal
Background resources	Minimal	Significant
Best for	Backup targets, VDI	Primary storage, databases
Implementation complexity	Higher	Moderate

Hybrid Approaches in Practice

Source vs. Target Deduplication

Where deduplication occurs—at the source (client) or target (storage system)—determines bandwidth usage, client complexity, and security implications.

Source Deduplication

Deduplication happens on the client machine before data is transmitted. The client queries the target to determine which chunks already exist, then transmits only unique chunks.

Workflow:

Client chunks local data
Client computes hashes for all chunks
Client sends hash list to server
Server responds with "already have" / "need" for each hash
Client transmits only needed chunks
Server stores chunks and updates references

Advantages:

Massive bandwidth reduction: Only unique data crosses the network. For incremental backups with 95% overlap, you transmit 5% of the data.
Distributed processing: Hash computation distributed across many clients, reducing server load.
Lower server storage I/O: Server receives pre-deduplicated stream.

Disadvantages:

Client CPU overhead: Clients must perform chunking and hashing—significant for low-powered devices.
Protocol complexity: Requires coordination between client and server.
Security considerations: Hash queries can leak information about stored content (known as "confirmation attacks").

source_dedup_client.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class SourceDedupClient:
    """
    Client-side deduplication implementation.
    Minimizes network transfer by only sending unique chunks.
    """
    def __init__(self, server_connection, chunker):
        self.server = server_connection
        self.chunker = chunker
        
    def backup_file(self, file_path: str) -> BackupReceipt:
        """Backup file with source-side deduplication."""
        # Step 1: Chunk the file locally
        with open(file_path, 'rb') as f:
            data = f.read()
        
        chunks = self.chunker.chunk(data)
        chunk_hashes = []
        chunk_data = {}
        
        # Step 2: Compute hashes for all chunks
        for start, end in chunks:
            chunk_bytes = data[start:end]
            chunk_hash = hashlib.sha256(chunk_bytes).hexdigest()
            chunk_hashes.append(chunk_hash)
            chunk_data[chunk_hash] = chunk_bytes
        
        # Step 3: Query server for existing chunks
        # This is the key optimization - we only send hashes first
        needed_hashes = self.server.query_missing_chunks(chunk_hashes)
        
        # Step 4: Upload only missing chunks
        upload_count = 0
        upload_bytes = 0
        
        for chunk_hash in needed_hashes:
            self.server.upload_chunk(chunk_hash, chunk_data[chunk_hash])
            upload_count += 1
            upload_bytes += len(chunk_data[chunk_hash])
        
        # Step 5: Register the file as a sequence of chunk references
        file_recipe = {
            'path': file_path,
            'chunks': chunk_hashes,
            'size': len(data),
        }
        receipt = self.server.register_file(file_recipe)
        
        print(f"Deduplication savings: {len(data)} bytes -> {upload_bytes} bytes")
        print(f"Chunks: {len(chunks)} total, {upload_count} uploaded")
        
        return receipt

Target Deduplication

All data is transmitted to the target storage system, which performs deduplication locally. The client sends complete data without awareness of deduplication.

Workflow:

Client sends complete data to server
Server receives data into landing zone
Server chunks and hashes data
Server performs dedup (inline or post-process)
Server acknowledges write

Advantages:

Simple clients: No dedup logic required on clients. Any file transfer protocol works.
Centralized control: All dedup decisions made by storage system with complete visibility.
No coordination complexity: No query-response protocol for existing chunks.
Better security: No hash-based queries that could leak content information.

Disadvantages:

Full bandwidth required: All data, including duplicates, crosses the network.
Server bears full load: All hashing and comparison on target system.
Higher network costs: Particularly expensive for backup over WAN.

Source vs. Target Deduplication Decision Matrix
Scenario	Recommended	Rationale
WAN backup (remote office)	Source	Bandwidth is expensive; minimize transfer
LAN backup (datacenter)	Target	Bandwidth is cheap; simpler architecture
Mobile/IoT devices	Target	Client CPU/battery constraints
Backup appliances	Target	Dedicated hardware handles load
Cloud-native apps	Source	Reduce egress costs to cloud storage
Enterprise sync (Dropbox-like)	Source	Minimize sync time and bandwidth

The Confirmation Attack Problem

The Deduplication Index: The Heart of the System

Index Structure and Scale

For a system storing 100TB of unique data with 8KB average chunk size:

Number of chunks: 100TB / 8KB = 13.1 billion chunks
Index entry (SHA-256 + pointer): 32 bytes hash + 8 bytes pointer = 40 bytes
Total index size: 13.1B × 40 bytes = 524 GB

This index must support:

Lookups: O(1) or very low latency per chunk during writes
Insertions: Adding new chunk fingerprints
Reference counting: Tracking how many files reference each chunk (for garbage collection)
Persistence: Surviving system restarts without full rebuild

Index Storage Strategies

1. In-Memory Index

Store the entire index in RAM for fastest access.

Pros: Sub-microsecond lookups, maximum dedup throughput
Cons: Enormous RAM requirements (524GB for 100TB), lost on crash unless checkpointed
Use case: High-performance backup appliances with abundant RAM

2. Tiered Index with Bloom Filters

Keep a Bloom filter in memory as a first-pass filter, with the full index on SSD.

Memory tier: Bloom filter (1-10 GB) that catches 100% of new chunks (no false negatives)
SSD tier: Full index, consulted only when Bloom filter indicates potential hit
Pros: 10-50x memory reduction with minimal performance impact
Cons: False positives cause unnecessary SSD lookups

3. Locality-Based Indexing

Exploit the fact that similar files have similar chunk sequences. Cache index entries for recently-seen chunk neighborhoods.

Observation: If chunks A, B, C appear together in one file, they likely appear together in similar files
Implementation: When chunk A is accessed, prefetch index entries for chunks that historically co-occur with A
Pros: Excellent cache hit rates for backup workloads
Cons: Less effective for random access patterns

tiered_dedup_index.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
from bloom_filter import BloomFilter
import redis  # Example using Redis as SSD-backed index
 
class TieredDedupIndex:
    """
    Two-tier deduplication index using Bloom filter and Redis.
    
    Bloom filter in memory catches definite non-matches instantly.
    Redis (on SSD) stores full index for confirmed lookups.
    """
    
    def __init__(
        self,
        expected_chunks: int = 10_000_000_000,  # 10 billion chunks
        bloom_false_positive_rate: float = 0.01,  # 1% false positive
        redis_client: redis.Redis = None,
    ):
        # Bloom filter: ~10GB for 10B items at 1% FP rate
        self.bloom = BloomFilter(
            max_elements=expected_chunks,
            error_rate=bloom_false_positive_rate
        )
        
        self.redis = redis_client or redis.Redis(
            host='localhost', 
            port=6379,
            decode_responses=False
        )
        
        self.stats = {
            'lookups': 0,
            'bloom_negatives': 0,  # Definite new chunks
            'bloom_positives': 0,  # Potential duplicates
            'actual_duplicates': 0,  # Confirmed duplicates
            'false_positives': 0,  # Bloom said yes, but new
        }
    
    def lookup(self, chunk_hash: bytes) -> Optional[bytes]:
        """
        Look up chunk hash, return storage location if exists.
        
        Returns:
            Storage location (bytes) if chunk exists, None otherwise.
        """
        self.stats['lookups'] += 1
        
        # Step 1: Check Bloom filter (in-memory, ~100ns)
        if chunk_hash not in self.bloom:
            # Definitely not in index - no need to check SSD
            self.stats['bloom_negatives'] += 1
            return None
        
        self.stats['bloom_positives'] += 1
        
        # Step 2: Bloom says maybe - check Redis (SSD, ~1ms)
        location = self.redis.get(f"chunk:{chunk_hash.hex()}")
        
        if location is not None:
            self.stats['actual_duplicates'] += 1
            return location
        else:
            # False positive from Bloom filter
            self.stats['false_positives'] += 1
            return None
    
    def insert(self, chunk_hash: bytes, location: bytes) -> None:
        """Add chunk to index."""
        # Add to Bloom filter (cannot be removed later)
        self.bloom.add(chunk_hash)
        
        # Add to Redis with reference count
        key = f"chunk:{chunk_hash.hex()}"
        self.redis.set(key, location)
        self.redis.set(f"refcount:{chunk_hash.hex()}", 1)
    
    def increment_ref(self, chunk_hash: bytes) -> int:
        """Increment reference count for existing chunk."""
        return self.redis.incr(f"refcount:{chunk_hash.hex()}")

Sparse Indexing in Practice

Garbage Collection in Deduplicated Storage

The Reference Counting Challenge

Each chunk maintains a reference count tracking how many files (or "recipes") reference it:

Chunk A (hash: abc123)
  - Referenced by: File1, File2, File3
  - Reference count: 3

When File1 is deleted:
  - Decrement reference count to 2
  - Chunk still needed, do not delete

When File2 and File3 are deleted:
  - Reference count becomes 0
  - Chunk becomes garbage, can be reclaimed

Problems with reference counting:

Crash consistency: If system crashes between file deletion and ref count update, counts become inaccurate
Distributed coordination: In clustered systems, reference updates must be synchronized
Circular references: In complex systems, orphaned chunk cycles can exist (rare in backup systems)

Mark-and-Sweep Garbage Collection

An alternative approach borrowed from programming language runtimes:

Phase 1: Mark

Traverse all live file recipes
Mark every referenced chunk as "in use"

Phase 2: Sweep

Scan all chunks in storage
Any chunk not marked is garbage
Reclaim unmarked chunks

Advantages:

Crash-safe: Incomplete GC just means some garbage temporarily retained
No need for perfect reference counts
Handles any form of orphaned data

Disadvantages:

Must scan entire dataset (expensive for petabyte systems)
System might need read quiescence during mark phase
Typically run infrequently (weekly/monthly)

Garbage Collection Strategies Comparison
Strategy	Pros	Cons	Best For
Reference counting	Immediate reclamation, Simple logic	Crash vulnerability, Distributed complexity	Single-node systems
Mark-and-sweep	Crash-safe, Handles all cases	Expensive, Requires quiescence	Periodic deep clean
Generational	Efficient for typical workloads	Complex implementation	Large-scale backup
Log-structured merge	Write-optimized, Background compaction	Temporary space amplification	Cloud object stores

The Container Compaction Problem

Industry Implementations of Deduplication

Understanding how industry leaders implement deduplication reveals the practical trade-offs made at scale.

Dell EMC Data Domain

The original purpose-built deduplication appliance, Data Domain pioneered many techniques still used today:

Stream-Informed Segmentation: Variable-length chunking with optimized anchor detection
Global Dedup: All data across the system deduped against single global index
Sparse Indexing: Only samples of fingerprints stored in RAM; locality exploited for rest
Inline with Cache: Hot fingerprints deduplicated inline; cold data post-processed
Dedup Ratio: Typically 10:1 to 30:1 for backup workloads

NetApp ONTAP (Aggregate Inline Deduplication)

Inline + Background: New writes get inline dedup with warm fingerprint cache; background process handles cache misses
Volume-Scoped or Aggregate-Scoped: Dedup within a volume or across all volumes on storage
Fixed-Size Blocks: Uses 4KB block alignment for simplicity with SSDs
Zero-Block Detection: Special optimization for datasets with many zero-filled regions

Cloud Object Storage (AWS S3, GCS, Azure Blob)

Major cloud providers typically do NOT offer deduplication as a customer-visible feature for general object storage, for several reasons:

Multi-tenant isolation: Cross-customer dedup raises security/privacy concerns
Workload diversity: Many workloads (media) don't benefit from dedup
Customer-side encryption: Encrypted data doesn't deduplicate (same content, different ciphertext)

However, within their infrastructure:

Internal dedup for OS images, container layers, and package repositories
Storage class optimizations that may include internal dedup
Content-addressable storage for immutable artifacts

BorgBackup and Restic (Open Source)

These open-source backup tools demonstrate elegant dedup implementations:

Content-Defined Chunking: Variable-length chunks with Buzhash (Borg) or CDC (Restic)
Client-Side Dedup: Dedup happens on the client before encryption
Repository Format: Chunks stored in pack files for efficiency
Encryption After Dedup: Chunks are hashed for dedup, then encrypted for storage
Cross-Machine Dedup: Multiple machines backing up to same repo share chunks

Key Takeaways from Industry

•Variable-length CDC dominates for backup workloads where data shifts are common
•Inline + post-process hybrid is the pragmatic choice for production systems
•Fingerprint caching is essential—full index in RAM isn't scalable
•Global dedup maximizes savings but increases complexity
•Encryption and dedup are in tension—order of operations matters

Summary: Mastering Data Deduplication

Data deduplication is far more than a simple storage optimization—it's a complex engineering discipline that touches hashing, data structures, distributed systems, and resource management.

Core Concepts:

Deduplication eliminates redundant data by storing unique chunks once and using references for duplicates
Granularity matters: File-level is simple but limited; variable-length CDC is most effective
Content-defined chunking makes boundaries immune to insertions/deletions
Inline vs. post-process trade latency for immediate savings
Source dedup saves bandwidth; target dedup simplifies clients
The index is the bottleneck—tiered approaches with Bloom filters are essential at scale
Garbage collection requires careful design to avoid corruption and space leaks

Page Complete

1 / 5