Loading content...
NAND flash has an uncomfortable truth: blocks cannot be overwritten in place. When you update a file, the new data is written to fresh pages while the old pages become invalid ghosts—consuming physical space but no longer representing current data. Left unchecked, invalidity would eventually consume the entire SSD.
Garbage collection (GC) is the SSD firmware's continuous battle against this accumulation of obsolete data. By consolidating valid pages and erasing blocks full of invalid data, GC reclaims physical space and maintains the free block pool that enables continued writing. Yet GC is not free—it consumes flash bandwidth, increases wear, and can cause the dreaded performance cliff that brings even premium SSDs to their knees.
By the end of this page, you will understand why garbage collection is necessary, comprehend the algorithms that select victim blocks and relocate data, analyze write amplification and its performance impact, and recognize how GC behavior influences SSD performance under sustained workloads.
To understand why garbage collection is necessary, we must revisit the fundamental asymmetry of NAND flash:
Program Operation: Individual pages (4-16KB) can be written (programmed) independently.
Erase Operation: An entire block (64-512 pages, or 256KB-4MB) must be erased as a unit.
Write-Once Constraint: Pages can only transition from erased (all 1s) to programmed; any change requires erasing the entire block.
This asymmetry creates a dilemma: when you update 4KB of data, the SSD writes to a new page (the old page becomes invalid). You cannot overwrite the old page—it must wait for a block erase. But erasing the block destroys all 63-511 sibling pages, which may contain valid data.
The Accumulation Problem:
Consider a simple scenario:
Event after event, invalid pages accumulate across blocks. The SSD appears full even when logical capacity is available because physical space is consumed by obsolete data.
Free Block Exhaustion:
When free blocks are depleted, writes cannot proceed. The SSD must:
This is foreground GC—blocking host operations until space is reclaimed.
When an SSD runs out of free blocks and must perform foreground GC, write latency can spike from microseconds to hundreds of milliseconds. A GC cycle relocating valid data from one 2MB block requires reading and rewriting potentially 1MB+ of data—all while the host waits. This is the 'performance cliff' that turns a 500,000 IOPS SSD into a stuttering mess.
Modern SSDs implement sophisticated garbage collection architectures designed to minimize host impact while maintaining free block availability.
GC Modes:
| Mode | Trigger | Host Impact | Urgency |
|---|---|---|---|
| Idle GC | No pending host I/O; free blocks above threshold | None (background) | Low |
| Background GC | Free blocks below soft threshold | Minimal (interleaved) | Medium |
| Foreground GC | Free blocks critically low | Severe (blocking) | Critical |
Idle Garbage Collection:
When the SSD detects no pending host commands, it executes GC in the background:
Idle GC is the ideal scenario—invisible to the host, completed during naturally occurring pauses in I/O activity.
Background Garbage Collection:
When free blocks drop below a soft threshold (e.g., 20% of over-provisioned space), background GC activates even during host activity:
Foreground Garbage Collection:
If free blocks are exhausted before GC can reclaim space, foreground GC stalls host operations:
SSDs that never experience idle periods cannot perform invisible GC. Sustained write workloads without breaks force GC into background or foreground modes, degrading performance. For write-intensive applications, ensuring periodic idle windows (even seconds of inactivity) dramatically improves sustained performance.
Victim selection is the algorithm that chooses which blocks to garbage-collect. The objective: maximize reclaimed space while minimizing data movement (and thus write amplification).
Greedy Algorithm (Basic):
The simplest approach: always select the block with the most invalid pages.
Victim Selection:
1. For each block, count valid pages
2. Select block with minimum valid page count
3. GC that block: copy valid pages, erase block
Greedy Pros:
Greedy Cons:
| Algorithm | Selection Criteria | Pros | Cons |
|---|---|---|---|
| Greedy | Most invalid pages | Minimal data movement | Ignores wear, data temperature |
| Cost-Benefit | Age × invalidity ratio | Balances WAF and stability | Complex computation |
| Adaptive GC | Workload-aware scoring | Optimal for mixed workloads | Requires workload modeling |
| Wear-Aware | High wear + invalidity | Extends lifespan | May increase immediate WAF |
Cost-Benefit Algorithm:
Analyzes the cost (valid data to relocate) versus benefit (space reclaimed and time until re-invalidation):
Cost-Benefit Score = (1 - u) × age / (2 × u)
Where:
- u = valid page ratio (valid_pages / total_pages)
- age = time since last data modification in block
Higher score = better victim candidate
Intuition:
Cost-benefit reduces write amplification by avoiding blocks with recently-written data that will likely be invalidated anyway.
Advanced algorithms track data temperature: frequently updated data (hot) versus rarely modified data (cold). Keeping hot and cold data in separate blocks improves GC efficiency. When GC runs on a hot-data block, most data is already invalid. When it runs on a cold-data block, most data remains valid—but only rarely needs to run. Explicit temperature tracking enables optimal placement.
Write amplification factor (WAF) is the ratio of physical data written to flash versus logical data written by the host. It is the single most important metric for understanding SSD efficiency under write workloads.
WAF Calculation:
WAF = Physical Data Written to NAND / Logical Data Written by Host
Example:
- Host writes 100 GB of data
- SSD writes 300 GB to flash (including GC relocations)
- WAF = 300 / 100 = 3.0
Ideal vs Reality:
| Factor | Contribution to WAF | Mitigation Strategy |
|---|---|---|
| GC data relocation | Primary contributor; relocating valid pages | Efficient victim selection, idle GC |
| Partial page writes | Writing < page size requires read-modify-write | Host-side write coalescing |
| SLC cache folding | Cache data rewritten as TLC/QLC | Larger SLC cache, write batching |
| FTL metadata updates | Mapping table changes | Log-structured updates |
| ECC/Metadata overhead | Parity and metadata per page | Fixed cost, ~10-15% |
WAF Impact:
Reduced Endurance: Every extra write consumes P/E cycles. WAF of 3 means you're consuming 3× the rated endurance.
Lower Sustained Performance: Flash bandwidth spent on GC is unavailable for host I/O. High WAF degrades throughput.
Increased Latency: GC operations add latency variability; high WAF means more frequent GC.
Power Consumption: Extra writes consume extra power—particularly impactful on mobile devices.
Minimizing Write Amplification:
WAF increases dramatically as SSDs approach full capacity. At 90% utilization, there's minimal over-provisioning left for GC. Every new write requires immediate valid-data relocation, and GC can barely keep up. This is why enterprise best practices mandate keeping SSDs below 80% capacity—and why advertised TB Written (TBW) ratings assume moderate utilization.
Understanding garbage collection behavior is essential for predicting SSD performance under sustained workloads. The transition from cached to steady-state performance defines the practical experience of using an SSD.
Three Performance States:
| State | Duration | Behavior | Typical Performance |
|---|---|---|---|
| Fresh/Bursting | First 20-100GB writes | SLC cache absorbs writes; no GC | Peak speeds (3,000+ MB/s) |
| Transition | Cache exhaustion | Cache full; folding begins; GC starts | Rapidly declining (1,500→500 MB/s) |
| Steady-State | Indefinite sustained writes | Continuous GC + folding; equilibrium | Lowest sustained (80-300 MB/s) |
The Steady-State Reality:
Most SSD benchmarks measure only burst performance—the first few gigabytes before caches fill and GC activates. Real-world sustained performance depends on:
Benchmarking Steady-State:
Professional SSD testing uses SNIA (Storage Networking Industry Association) protocols:
This methodology reveals the true sustained performance—often 10-50× lower than peak specifications.
When evaluating SSDs, seek reviews that include steady-state testing. Peak sequential speeds are nearly identical across modern NVMe drives; the differentiation is in sustained random write performance, GC efficiency, and latency consistency. AnandTech, Tom's Hardware, and StorageReview.com typically include proper steady-state methodology.
Latency During GC:
GC causes latency variability—the enemy of consistent performance:
For latency-sensitive applications (databases, real-time systems), QoS percentiles matter more than average latency. Look for:
An SSD with 30μs average but 50ms P99 will cause timeouts; one with 50μs average and 100μs P99 may be superior for your workload.
Modern SSD controllers implement sophisticated GC strategies beyond basic victim selection, aiming to minimize host impact while maintaining free space.
Multi-Stream Writes:
NVMe supports streams—a hint from the host indicating data temperature or lifecycle:
The SSD places each stream in separate blocks. Benefits:
| Technique | Mechanism | Benefit | Requirement |
|---|---|---|---|
| Multi-Stream | Host-directed data placement | Reduced WAF from temperature separation | Application/FS support |
| ZNS (Zoned Namespaces) | Host controls write sequencing | Minimal GC; host manages zones | Application redesign |
| Predictive GC | ML-based invalidation prediction | Proactive GC before pressure | Training data, compute overhead |
| Cooperative GC | Host-SSD coordination | Host delays writes during GC | System-level integration |
Zoned Namespaces (ZNS):
ZNS represents a paradigm shift in SSD management:
By shifting GC responsibility to the host (which has application-level knowledge), ZNS achieves near-zero write amplification. Adoption is growing in cloud and hyperscale environments where the software stack can be modified.
Rate-Limited GC:
To prevent GC storms from overwhelming host I/O:
This provides predictable latency at the cost of potentially slower free space recovery.
Emerging computational storage devices perform data processing within the SSD itself. For GC, this could mean: compressing data to reduce physical footprint, deduplicating before write to eliminate redundancy, or application-aware GC that understands data lifecycle. These technologies are early-stage but promise further WAF reduction.
Different workload patterns create dramatically different GC behaviors. Understanding these patterns helps predict and optimize SSD performance for specific use cases.
Sequential vs Random Writes:
Workload Size Impact:
SSD Capacity Utilization:
GC efficiency degrades non-linearly with utilization:
| Utilization | Free Block Pool | GC Frequency | WAF Impact |
|---|---|---|---|
| 50% | Large | Rare | Minimal |
| 70% | Moderate | Regular | Noticeable |
| 85% | Small | Frequent | Significant |
| 95% | Minimal | Constant | Severe |
| 99% | Near-zero | Blocking | Catastrophic |
Enterprise recommendation: Keep SSDs below 80% utilization Consumer guidance: Avoid filling SSDs beyond 90%
Database workloads (small random writes, high update frequency, transaction logging) are among the most challenging for SSD GC. For database servers: use enterprise SSDs with high over-provisioning, consider write-caching to battery-backed RAID controllers, and monitor WAF/endurance consumption closely.
We've explored the mechanics and implications of garbage collection—the essential process that reclaims flash space and enables continued SSD operation. Let's consolidate the key insights:
What's Next:
Garbage collection requires knowing which blocks contain invalid data. But SSDs cannot detect file deletion from host commands alone—when a file is deleted, the SSD remains unaware. The next page covers the TRIM command, which bridges this information gap by informing SSDs of deallocated blocks, enabling proactive cleanup and improved GC efficiency.
You now understand why garbage collection is necessary, how victim selection algorithms work, the meaning and impact of write amplification, and how workload patterns affect GC behavior. This knowledge is essential for understanding sustained SSD performance and making informed decisions about SSD deployment in different workload scenarios.