Ssd Internals - Learning Module

Loading content...

0/227

Garbage Collection

The Relentless Battle for Free Space

NAND flash has an uncomfortable truth: blocks cannot be overwritten in place. When you update a file, the new data is written to fresh pages while the old pages become invalid ghosts—consuming physical space but no longer representing current data. Left unchecked, invalidity would eventually consume the entire SSD.

Garbage collection (GC) is the SSD firmware's continuous battle against this accumulation of obsolete data. By consolidating valid pages and erasing blocks full of invalid data, GC reclaims physical space and maintains the free block pool that enables continued writing. Yet GC is not free—it consumes flash bandwidth, increases wear, and can cause the dreaded performance cliff that brings even premium SSDs to their knees.

What You Will Learn

By the end of this page, you will understand why garbage collection is necessary, comprehend the algorithms that select victim blocks and relocate data, analyze write amplification and its performance impact, and recognize how GC behavior influences SSD performance under sustained workloads.

The Garbage Collection Problem

To understand why garbage collection is necessary, we must revisit the fundamental asymmetry of NAND flash:

Program Operation: Individual pages (4-16KB) can be written (programmed) independently.

Erase Operation: An entire block (64-512 pages, or 256KB-4MB) must be erased as a unit.

Write-Once Constraint: Pages can only transition from erased (all 1s) to programmed; any change requires erasing the entire block.

This asymmetry creates a dilemma: when you update 4KB of data, the SSD writes to a new page (the old page becomes invalid). You cannot overwrite the old page—it must wait for a block erase. But erasing the block destroys all 63-511 sibling pages, which may contain valid data.

Converting Mermaid diagram...

The Accumulation Problem:

Consider a simple scenario:

Block A contains 64 pages of valid data
User overwrites 32 pages with new data
New data written to Block B (32 pages)
Block A now has 32 valid + 32 invalid pages
Process repeats: Block A valid count decreases, invalidity spreads

Event after event, invalid pages accumulate across blocks. The SSD appears full even when logical capacity is available because physical space is consumed by obsolete data.

Free Block Exhaustion:

When free blocks are depleted, writes cannot proceed. The SSD must:

Identify blocks with high invalidity
Copy remaining valid pages to a new block
Erase the victim block to create a free block
Resume host writes

This is foreground GC—blocking host operations until space is reclaimed.

The Performance Cliff

When an SSD runs out of free blocks and must perform foreground GC, write latency can spike from microseconds to hundreds of milliseconds. A GC cycle relocating valid data from one 2MB block requires reading and rewriting potentially 1MB+ of data—all while the host waits. This is the 'performance cliff' that turns a 500,000 IOPS SSD into a stuttering mess.

Garbage Collection Architecture

Modern SSDs implement sophisticated garbage collection architectures designed to minimize host impact while maintaining free block availability.

GC Modes:

Garbage Collection Operational Modes
Mode	Trigger	Host Impact	Urgency
Idle GC	No pending host I/O; free blocks above threshold	None (background)	Low
Background GC	Free blocks below soft threshold	Minimal (interleaved)	Medium
Foreground GC	Free blocks critically low	Severe (blocking)	Critical

Idle Garbage Collection:

When the SSD detects no pending host commands, it executes GC in the background:

Monitor host queues for incoming I/O
Interruptible: immediately pause GC if host command arrives
Maximally efficient: can use all flash channels for GC operations
Proactive: maintain healthy free block pool before urgency develops

Idle GC is the ideal scenario—invisible to the host, completed during naturally occurring pauses in I/O activity.

Background Garbage Collection:

When free blocks drop below a soft threshold (e.g., 20% of over-provisioned space), background GC activates even during host activity:

Interleave GC operations with host I/O
Rate-limit GC to not overwhelm flash bandwidth
Accept slightly elevated host latency to maintain free space
Runs continuously until free blocks return to healthy levels

Foreground Garbage Collection:

If free blocks are exhausted before GC can reclaim space, foreground GC stalls host operations:

Host writes blocked until GC completes
Latency spikes to tens or hundreds of milliseconds
Indicates either insufficient over-provisioning or overwhelming write rate
Should be avoided through proper idle/background GC

The Importance of Idle Time

SSDs that never experience idle periods cannot perform invisible GC. Sustained write workloads without breaks force GC into background or foreground modes, degrading performance. For write-intensive applications, ensuring periodic idle windows (even seconds of inactivity) dramatically improves sustained performance.

Victim Block Selection Algorithms

Victim selection is the algorithm that chooses which blocks to garbage-collect. The objective: maximize reclaimed space while minimizing data movement (and thus write amplification).

Greedy Algorithm (Basic):

The simplest approach: always select the block with the most invalid pages.

Victim Selection:
1. For each block, count valid pages
2. Select block with minimum valid page count
3. GC that block: copy valid pages, erase block

Greedy Pros:

Maximizes space reclaimed per GC operation
Minimizes data movement (fewer valid pages to relocate)
Simple to implement

Greedy Cons:

May repeatedly select same blocks (wear concentration)
Ignores data temperature (hot vs cold data mixing)
Can increase long-term write amplification

Victim Selection Algorithms
Algorithm	Selection Criteria	Pros	Cons
Greedy	Most invalid pages	Minimal data movement	Ignores wear, data temperature
Cost-Benefit	Age × invalidity ratio	Balances WAF and stability	Complex computation
Adaptive GC	Workload-aware scoring	Optimal for mixed workloads	Requires workload modeling
Wear-Aware	High wear + invalidity	Extends lifespan	May increase immediate WAF

Cost-Benefit Algorithm:

Analyzes the cost (valid data to relocate) versus benefit (space reclaimed and time until re-invalidation):

Cost-Benefit Score = (1 - u) × age / (2 × u)

Where:
- u = valid page ratio (valid_pages / total_pages)
- age = time since last data modification in block

Higher score = better victim candidate

Intuition:

Low validity ratio (1-u high) → more space reclaimed per GC
Higher age → data is cold; relocated data less likely to be invalidated soon
Combining both: prefer blocks with old, mostly-invalid data

Cost-benefit reduces write amplification by avoiding blocks with recently-written data that will likely be invalidated anyway.

Hot-Cold Data Separation

Advanced algorithms track data temperature: frequently updated data (hot) versus rarely modified data (cold). Keeping hot and cold data in separate blocks improves GC efficiency. When GC runs on a hot-data block, most data is already invalid. When it runs on a cold-data block, most data remains valid—but only rarely needs to run. Explicit temperature tracking enables optimal placement.

Write Amplification

Write amplification factor (WAF) is the ratio of physical data written to flash versus logical data written by the host. It is the single most important metric for understanding SSD efficiency under write workloads.

WAF Calculation:

WAF = Physical Data Written to NAND / Logical Data Written by Host

Example:
- Host writes 100 GB of data
- SSD writes 300 GB to flash (including GC relocations)
- WAF = 300 / 100 = 3.0

Ideal vs Reality:

WAF = 1.0: Perfect efficiency; every host write goes to flash exactly once
WAF = 2-4: Typical for consumer SSDs under mixed workloads
WAF = 5-10: High-write random workloads or nearly-full SSDs
WAF = 20+: Pathological cases with severe GC pressure

Write Amplification Factors
Factor	Contribution to WAF	Mitigation Strategy
GC data relocation	Primary contributor; relocating valid pages	Efficient victim selection, idle GC
Partial page writes	Writing < page size requires read-modify-write	Host-side write coalescing
SLC cache folding	Cache data rewritten as TLC/QLC	Larger SLC cache, write batching
FTL metadata updates	Mapping table changes	Log-structured updates
ECC/Metadata overhead	Parity and metadata per page	Fixed cost, ~10-15%

WAF Impact:

Reduced Endurance: Every extra write consumes P/E cycles. WAF of 3 means you're consuming 3× the rated endurance.
Lower Sustained Performance: Flash bandwidth spent on GC is unavailable for host I/O. High WAF degrades throughput.
Increased Latency: GC operations add latency variability; high WAF means more frequent GC.
Power Consumption: Extra writes consume extra power—particularly impactful on mobile devices.

Minimizing Write Amplification:

More over-provisioning: Higher OP means lower utilization, fewer GC operations
Sequential writes: Sequential patterns avoid fragmentation; WAF approaches 1.0
TRIM/Discard: Informing SSD of deleted data enables proactive cleanup
Hot-cold separation: Keeping data types separate reduces valid page mixing
Write batching: Larger writes reduce partial-page waste

The Full-Drive Catastrophe

WAF increases dramatically as SSDs approach full capacity. At 90% utilization, there's minimal over-provisioning left for GC. Every new write requires immediate valid-data relocation, and GC can barely keep up. This is why enterprise best practices mandate keeping SSDs below 80% capacity—and why advertised TB Written (TBW) ratings assume moderate utilization.

GC Performance Characteristics

Understanding garbage collection behavior is essential for predicting SSD performance under sustained workloads. The transition from cached to steady-state performance defines the practical experience of using an SSD.

Three Performance States:

SSD Performance States
State	Duration	Behavior	Typical Performance
Fresh/Bursting	First 20-100GB writes	SLC cache absorbs writes; no GC	Peak speeds (3,000+ MB/s)
Transition	Cache exhaustion	Cache full; folding begins; GC starts	Rapidly declining (1,500→500 MB/s)
Steady-State	Indefinite sustained writes	Continuous GC + folding; equilibrium	Lowest sustained (80-300 MB/s)

The Steady-State Reality:

Most SSD benchmarks measure only burst performance—the first few gigabytes before caches fill and GC activates. Real-world sustained performance depends on:

SLC Cache Size: Larger cache delays transition but doesn't prevent it
Over-Provisioning: More OP provides GC headroom
NAND Speed: Faster base NAND improves steady-state floor
Controller Capability: Better algorithms minimize WAF and maximize parallelism
Workload Pattern: Random vs sequential dramatically affects GC pressure

Benchmarking Steady-State:

Professional SSD testing uses SNIA (Storage Networking Industry Association) protocols:

Pre-conditioning: Write entire drive 2×+ capacity with random data
Stabilization: Continue random writes until performance stabilizes (±10%)
Measurement: Record performance over extended period

This methodology reveals the true sustained performance—often 10-50× lower than peak specifications.

Reading SSD Reviews Critically

When evaluating SSDs, seek reviews that include steady-state testing. Peak sequential speeds are nearly identical across modern NVMe drives; the differentiation is in sustained random write performance, GC efficiency, and latency consistency. AnandTech, Tom's Hardware, and StorageReview.com typically include proper steady-state methodology.

Latency During GC:

GC causes latency variability—the enemy of consistent performance:

Normal read latency: 20-50μs
Read during background GC: 50-200μs (flash channel contention)
Read during foreground GC: 1-100ms (waiting for block erase)

For latency-sensitive applications (databases, real-time systems), QoS percentiles matter more than average latency. Look for:

P99 latency (99th percentile)
P99.9 latency (99.9th percentile)
Maximum observed latency

An SSD with 30μs average but 50ms P99 will cause timeouts; one with 50μs average and 100μs P99 may be superior for your workload.

Advanced GC Strategies

Modern SSD controllers implement sophisticated GC strategies beyond basic victim selection, aiming to minimize host impact while maintaining free space.

Multi-Stream Writes:

NVMe supports streams—a hint from the host indicating data temperature or lifecycle:

Stream 1: Hot data (frequently updated)
Stream 2: Warm data (occasionally updated)
Stream 3: Cold data (rarely updated)

The SSD places each stream in separate blocks. Benefits:

GC on hot blocks finds mostly-invalid pages
GC on cold blocks happens rarely
Reduced WAF from hot-cold mixing

Advanced GC Techniques
Technique	Mechanism	Benefit	Requirement
Multi-Stream	Host-directed data placement	Reduced WAF from temperature separation	Application/FS support
ZNS (Zoned Namespaces)	Host controls write sequencing	Minimal GC; host manages zones	Application redesign
Predictive GC	ML-based invalidation prediction	Proactive GC before pressure	Training data, compute overhead
Cooperative GC	Host-SSD coordination	Host delays writes during GC	System-level integration

Zoned Namespaces (ZNS):

ZNS represents a paradigm shift in SSD management:

SSD exposes physical zone boundaries to host
Host must write sequentially within each zone
Host explicitly resets (erases) zones when done
SSD performs zero or minimal garbage collection

By shifting GC responsibility to the host (which has application-level knowledge), ZNS achieves near-zero write amplification. Adoption is growing in cloud and hyperscale environments where the software stack can be modified.

Rate-Limited GC:

To prevent GC storms from overwhelming host I/O:

Allocate fixed bandwidth budget for GC per time window
Execute GC operations within budget
Pause GC when budget exhausted, resume next window
Adjust budget based on free block urgency

This provides predictable latency at the cost of potentially slower free space recovery.

The Future: Computational Storage

Emerging computational storage devices perform data processing within the SSD itself. For GC, this could mean: compressing data to reduce physical footprint, deduplicating before write to eliminate redundancy, or application-aware GC that understands data lifecycle. These technologies are early-stage but promise further WAF reduction.

GC Behavior Across Workload Patterns

Different workload patterns create dramatically different GC behaviors. Understanding these patterns helps predict and optimize SSD performance for specific use cases.

Sequential vs Random Writes:

Sequential Writes

•Fills blocks completely before moving on
•Blocks become fully invalid together
•GC erases entire blocks with no valid data
•WAF approaches 1.0 (optimal)
•Example: video recording, log files

Random Writes

•Scatters writes across many blocks
•Blocks contain mix of valid/invalid pages
•GC must relocate valid pages constantly
•WAF ranges 3-10× (high overhead)
•Example: databases, virtual machines

Workload Size Impact:

Small writes (4KB): Maximum page fragmentation; each 4KB write touches separate page; GC pressure high
Large writes (128KB+): Fewer pages affected per logical unit; better block filling efficiency
Aligned writes: Writes that align to page and block boundaries minimize partial-page overhead

SSD Capacity Utilization:

GC efficiency degrades non-linearly with utilization:

Utilization	Free Block Pool	GC Frequency	WAF Impact
50%	Large	Rare	Minimal
70%	Moderate	Regular	Noticeable
85%	Small	Frequent	Significant
95%	Minimal	Constant	Severe
99%	Near-zero	Blocking	Catastrophic

Enterprise recommendation: Keep SSDs below 80% utilization Consumer guidance: Avoid filling SSDs beyond 90%

Database Workloads

Database workloads (small random writes, high update frequency, transaction logging) are among the most challenging for SSD GC. For database servers: use enterprise SSDs with high over-provisioning, consider write-caching to battery-backed RAID controllers, and monitor WAF/endurance consumption closely.

Summary: Garbage Collection

We've explored the mechanics and implications of garbage collection—the essential process that reclaims flash space and enables continued SSD operation. Let's consolidate the key insights:

Key Takeaways

•NAND's erase-before-write constraint creates invalid data accumulation that must be reclaimed through garbage collection.
•GC operates in three modes: Idle (invisible), background (interleaved), and foreground (blocking)—with dramatically different host impact.
•Victim selection algorithms balance reclaiming maximum space against minimizing valid data relocation, with cost-benefit analysis outperforming simple greedy approaches.
•Write amplification factor (WAF) measures GC efficiency; higher WAF means more wear, less performance, and reduced endurance.
•Steady-state performance is far lower than peak performance; proper benchmarking requires pre-conditioning and stabilization.
•Over-provisioning and utilization directly impact GC pressure; keeping SSDs below 80% capacity dramatically improves efficiency.
•Advanced techniques (multi-stream, ZNS, hot-cold separation) can reduce WAF toward 1.0 with appropriate software support.

What's Next:

Garbage collection requires knowing which blocks contain invalid data. But SSDs cannot detect file deletion from host commands alone—when a file is deleted, the SSD remains unaware. The next page covers the TRIM command, which bridges this information gap by informing SSDs of deallocated blocks, enabling proactive cleanup and improved GC efficiency.

Page Complete

You now understand why garbage collection is necessary, how victim selection algorithms work, the meaning and impact of write amplification, and how workload patterns affect GC behavior. This knowledge is essential for understanding sustained SSD performance and making informed decisions about SSD deployment in different workload scenarios.