Loading content...
Log-structured file systems achieve remarkable write performance by never updating data in place. But this elegant solution creates a fundamental problem: dead data accumulates.
Every time a file is modified, a new version is written to the log head. The old version remains on disk, consuming space but serving no purpose. Every deleted file leaves behind orphaned blocks. Every directory rename creates obsolete entries.
Without intervention, an LFS would quickly fill with garbage—dead data that will never be read again. Garbage collection (GC) is the mechanism that reclaims this space, and its efficiency determines whether LFS is practical for real workloads.
This page examines garbage collection in depth: why dead data accumulates, how to identify what's alive, the fundamental mechanics of reclaiming space, and why this is one of the most challenging problems in LFS design.
By the end of this page, you will understand why garbage collection is necessary, how liveness of data is determined, the basic mechanics of reclaiming segments, the write amplification cost of GC, and how GC performance fundamentally constrains LFS at high utilization.
To understand garbage collection, we must first understand what creates garbage. In LFS, every modification leaves behind dead data.
Sources of Dead Data:
Unlike traditional file systems where updates happen in place (freeing the space implicitly), LFS must explicitly reclaim dead space through garbage collection.
Garbage Creation in Log-Structured File System: SCENARIO: Modify a 4 KB file three times Initial State (T0):┌─────────────────────────────────────────────────────────────────────┐│ LOG: [Seg1: Data_v1, Inode_v1] → Log Head ││ ││ File "readme.txt" = Data_v1 (LIVE) ││ Inode = Inode_v1 (LIVE) ││ Live: 100%, Dead: 0% │└─────────────────────────────────────────────────────────────────────┘ After First Modification (T1):┌─────────────────────────────────────────────────────────────────────┐│ LOG: [Seg1: Data_v1, Inode_v1][Seg2: Data_v2, Inode_v2] → Log Head ││ ↑ ││ DEAD (superseded by v2) ││ ││ File "readme.txt" = Data_v2 (LIVE) ││ Inode = Inode_v2 (LIVE) ││ Live: 50%, Dead: 50% │└─────────────────────────────────────────────────────────────────────┘ After Second Modification (T2):┌─────────────────────────────────────────────────────────────────────┐│ LOG: [Seg1: Data_v1, Inode_v1][Seg2: Data_v2, Inode_v2][Seg3: v3...]││ ↑ ↑ ││ DEAD DEAD ││ ││ File "readme.txt" = Data_v3 (LIVE) ││ Live: 33%, Dead: 67% │└─────────────────────────────────────────────────────────────────────┘ After File Deletion (T3):┌─────────────────────────────────────────────────────────────────────┐│ LOG: [Seg1: DEAD][Seg2: DEAD][Seg3: DEAD][Seg4: Dir Update] ││ ││ File "readme.txt" = DELETED ││ Live: ~0% (only directory update), Dead: ~100% ││ ││ ⚠ All those segments contain only garbage now! │└─────────────────────────────────────────────────────────────────────┘The Compounding Effect:
Garbage accumulation is not linear—it compounds:
Small files dominate many workloads: Modifying a 1 KB file in a 4 MB segment leaves 99.97% of the segment valid, 0.03% dead—but the entire segment can't be reused until cleaned
Mixed lifetimes: A segment with 90% long-lived data and 10% short-lived data becomes 90% live/10% dead quickly, then stays that way for a long time
Workload patterns: Some applications (databases, logs) update the same data repeatedly, creating garbage rapidly
The Critical Metric: Segment Utilization
Each segment has a utilization (or liveness) ratio:
Utilization = Live Bytes / Total Segment Size
Garbage collection's job is to consolidate live data from low-utilization segments, freeing those segments for reuse.
As disk utilization increases, finding low-utilization segments becomes harder. At 90% disk full, even 'garbage-heavy' segments might be 80% live. Cleaning them reclaims little space at high cost. This creates a performance cliff that makes LFS challenging for nearly-full disks.
Before we can collect garbage, we must distinguish live data (still referenced, potentially needed) from dead data (superseded or deleted, never needed again).
The Liveness Problem:
A block is live if and only if:
A block is dead if any of these conditions fail.
Two Approaches to Liveness Determination:
Liveness Determination Using Segment Summary: SEGMENT TO CLEAN:┌─────────────────────────────────────────────────────────────────────┐│ SEGMENT 42 (4 MB, written at T=1000) │├─────────────────────────────────────────────────────────────────────┤│ SEGMENT SUMMARY: ││ Block 0: inode 17, offset 0 ││ Block 1: inode 17, offset 4096 ││ Block 2: inode 23, offset 0 ││ Block 3: inode 23, offset 4096 ││ Block 4: <inode block for inode 17> ││ Block 5: <inode block for inode 23> ││ ... │└─────────────────────────────────────────────────────────────────────┘ LIVENESS CHECK PROCESS: For each block in segment:┌─────────────────────────────────────────────────────────────────────┐│ Block 0: inode 17, offset 0 ││ ││ Step 1: imap[17] → current inode 17 at Segment 99, offset 2048 ││ Step 2: Read inode 17 from Seg 99 ││ Step 3: inode17.block[0] → Segment 42, offset 0? ││ → YES! Block is LIVE ││ ││ Block 1: inode 17, offset 4096 ││ Step 1: imap[17] → Segment 99, offset 2048 (cached from above) ││ Step 2: inode17.block[1] → Segment 88, offset 1024 ││ → NO! Block 1 was superseded. Block is DEAD ││ ││ Block 2: inode 23, offset 0 ││ Step 1: imap[23] → INODE NOT FOUND ││ → File 23 was deleted. Block is DEAD ││ ││ ... continue for all blocks ... │└─────────────────────────────────────────────────────────────────────┘ RESULT: Segment 42 contains: - 2 LIVE data blocks (8 KB) - 5 DEAD data blocks (20 KB) - 1 LIVE inode (inode 17) - 1 DEAD inode (inode 23 - deleted) Utilization: 8 KB / 4 MB = 0.2% → Excellent candidate for cleaning!Optimization: Version Numbers
The forward reference check requires reading inodes, which is expensive. Version numbers optimize this:
if (block.owner_inode == inode_number AND
block.version == current_inode.version AND
inode.block_pointers[block.offset] points here)
→ LIVE
else
→ DEAD
Segment Usage Table:
Rather than computing liveness at GC time, LFS implementations typically maintain a segment usage table that tracks:
This table enables instant lookup of segment utilization without scanning.
When a file is modified, we know which old blocks are being superseded (from the inode being replaced). We can immediately decrement those blocks' segments' live byte counts. When a file is deleted, we traverse its inode and decrement all of its segments. This keeps segment utilization accurate without expensive scans.
With the ability to identify live data, we can now understand how garbage collection actually works.
The GC Algorithm (Simplified):
Garbage Collection Process: BEFORE GC:═══════════════════════════════════════════════════════════════════════┌─────────────────────────────────────────────────────────────────────┐│ DISK STATE: ││ ││ [Seg 1: 20% live][Seg 2: 95% live][Seg 3: 5% live][Seg 4: 100%]... ││ ↑ GC candidate ↑ Skip ↑ GC candidate ││ ││ [Seg 5: 80% live][Seg 6: 15% live][FREE][FREE][FREE] ││ ↑ GC candidate ↑↑↑ Available for writes │└─────────────────────────────────────────────────────────────────────┘ GC STEP 1: Select Victims (segments with lowest utilization)═══════════════════════════════════════════════════════════════════════Selected: Segment 3 (5% live) - best candidate Why Segment 3? - Only 5% live = 200 KB live data in 4 MB segment - Cleaning reclaims 3.8 MB at cost of moving 200 KB - Efficiency: 19:1 ratio GC STEP 2: Scan and Identify Live Blocks═══════════════════════════════════════════════════════════════════════┌─────────────────────────────────────────────────────────────────────┐│ SEGMENT 3 ANALYSIS: ││ ││ Block 0: inode 42, off 0 → Check imap... → DEAD (file deleted) ││ Block 1: inode 42, off 4K → Check imap... → DEAD ││ Block 2: inode 42, off 8K → Check imap... → DEAD ││ Block 3: inode 58, off 0 → Check imap... → LIVE ✓ ││ Block 4: inode 58, off 4K → Check imap... → DEAD (superseded) ││ Block 5: inode 71, off 0 → Check imap... → LIVE ✓ ││ ... (continue through segment) ││ ││ Result: 50 blocks checked, 5 are LIVE (200 KB), 45 are DEAD │└─────────────────────────────────────────────────────────────────────┘ GC STEP 3: Copy Live Blocks to Log Head═══════════════════════════════════════════════════════════════════════┌─────────────────────────────────────────────────────────────────────┐│ CURRENT SEGMENT BUFFER (at log head): ││ ││ [Current app writes...][Copied from Seg3: Live Block 3][Block 5]... ││ ↑ ││ GC-induced writes ││ ││ These blocks go into a normal segment with fresh data ││ No special marking needed—they're just regular writes now │└─────────────────────────────────────────────────────────────────────┘ GC STEP 4: Update Inode Map═══════════════════════════════════════════════════════════════════════┌─────────────────────────────────────────────────────────────────────┐│ IMAP UPDATES: ││ ││ imap[58].block[0]: Seg 3, off 12K → Seg 7, off 2048 (new location) ││ imap[71].block[0]: Seg 3, off 20K → Seg 7, off 6144 (new location) ││ ││ These updates go into the same segment buffer being written │└─────────────────────────────────────────────────────────────────────┘ GC STEP 5: Free Victim Segment═══════════════════════════════════════════════════════════════════════┌─────────────────────────────────────────────────────────────────────┐│ AFTER GC COMPLETES: ││ ││ Segment 3: Marked FREE (available for future writes) ││ ││ Free space increased: 4 MB gained ││ Write cost: 200 KB (live data) + overhead ≈ 250 KB ││ ││ Net gain: 4 MB - 0.25 MB = 3.75 MB of usable space │└─────────────────────────────────────────────────────────────────────┘Key Observations:
GC induces writes: Moving live data requires writing it again. This is called GC write amplification.
GC competes with applications: While GC is copying live blocks, applications may also be writing. They share segment buffer space and disk bandwidth.
Utilization determines efficiency: Low-utilization segments yield high returns. A 10% utilized segment gives 90% free space for 10% write cost. A 90% utilized segment gives only 10% free space for 90% write cost.
Metadata moves too: When data blocks move, their imap entries must update. When inodes move, the master imap must update. GC cascades through the metadata hierarchy.
When Does GC Run?
Common triggers:
Every byte of live data in a cleaned segment must be written again. This means GC can double, triple, or even further multiply the total write volume. On flash storage, this accelerates wear. On any storage, it consumes bandwidth that could serve applications.
Write amplification is the ratio of total physical writes to logical writes requested by applications. Understanding this metric is crucial for predicting LFS performance.
Sources of Write Amplification in LFS:
Calculating GC Write Amplification:
Let's derive the write amplification from GC mathematically.
Assume:
Write amplification factor W:
W = (bytes written) / (new bytes stored)
W = (u × S) / ((1 - u) × S)
W = u / (1 - u)
| Avg Segment Utilization | Write Amplification (u/(1-u)) | Total Writes per App Write | Interpretation |
|---|---|---|---|
| 10% | 0.11× | 1.11× | Minimal GC impact—most segments nearly empty |
| 30% | 0.43× | 1.43× | Healthy—GC is efficient |
| 50% | 1.0× | 2.0× | Every app write causes equal GC write |
| 70% | 2.33× | 3.33× | Significant GC overhead |
| 80% | 4.0× | 5.0× | Heavy GC penalty—disk mostly busy with GC |
| 90% | 9.0× | 10.0× | Severe degradation—90% of writes are GC |
| 95% | 19.0× | 20.0× | Near unusable—almost all bandwidth to GC |
The Utilization Crisis:
Notice how write amplification explodes as utilization approaches 100%:
This explains the common advice: never run LFS volumes above 80-90% full.
But wait—what determines segment utilization at cleaning time?
Segment Utilization and Write Amplification: DISK STATE: 100 segments, each 4 MB = 400 MB diskLive data on disk: 320 MB = 80% disk utilization QUESTION: What's the AVERAGE segment utilization? Scenario A: Uniform distribution┌─────────────────────────────────────────────────────────────────────┐│ All 100 segments have exactly 80% live data ││ Average utilization for cleaning = 80% ││ Write amplification = 0.8 / 0.2 = 4.0× ││ ││ Every new write requires 4× writes total (1 new + 3 GC) │└─────────────────────────────────────────────────────────────────────┘ Scenario B: Bimodal distribution (better!)┌─────────────────────────────────────────────────────────────────────┐│ 80 segments: 100% live (no garbage, don't clean these) ││ 20 segments: 0% live (all garbage, free clean!) ││ ││ Average disk utilization still 80% (320 MB / 400 MB) ││ But we clean the 0% segments - no live data to copy! ││ Write amplification = 0 / 1 = 0× ││ ││ No extra writes for GC at all! │└─────────────────────────────────────────────────────────────────────┘ Scenario C: Bimodal with some mixed (realistic)┌─────────────────────────────────────────────────────────────────────┐│ 60 segments: 100% live (stable data, don't touch) ││ 20 segments: 50% live (some garbage mixed in) ││ 20 segments: 20% live (mostly garbage) ││ ││ Total live: 60×4 + 20×2 + 20×0.8 = 240 + 40 + 16 = 296 MB ││ Disk utilization: 296 / 400 = 74% ││ ││ GC Target: 20% utilized segments ││ Write amplification = 0.2 / 0.8 = 0.25× ││ ││ Only 25% extra writes for GC - very efficient! │└─────────────────────────────────────────────────────────────────────┘ KEY INSIGHT: Distribution matters more than average utilization.If we can keep hot and cold data separated, GC becomes cheap.The secret to efficient LFS garbage collection is creating a bimodal distribution: some segments with nearly all live data (cold, stable), others with nearly all dead data (former hot spots). This allows GC to target low-utilization segments with minimal copying. This is why data separation strategies (hot/cold/warm) matter so much.
Which segments should GC clean? This decision profoundly affects performance. Different policies optimize for different goals.
Greedy Policy:
Always clean the segment with the lowest utilization.
Pros:
Cons:
Cost-Benefit Policy:
Weight both utilization AND age. Prefer cleaning segments that are:
The Rosenblum-Ousterhout formula:
Benefit = (1 - u) × age / (1 + u)
where:
u = utilization (0 to 1)
age = time since segment was written
| Policy | Criteria | Best For | Weakness |
|---|---|---|---|
| Greedy | Lowest utilization | Short-term efficiency | Moves cold data repeatedly |
| Cost-Benefit | Low util + young age | Mixed workloads | Complex to tune age weight |
| Least Recently Written | Oldest segment | Uniform access | May clean high-utilization segments |
| CAT (Cost-Age-Time) | Composite score | Production systems | Requires extensive tuning |
| Adaptive | Varies with conditions | Dynamic workloads | Implementation complexity |
Victim Selection Policy Comparison: SCENARIO: 10 segments to choose from for GC Segment | Utilization | Age (hours) | Greedy Score | Cost-Benefit Score--------|-------------|-------------|--------------|------------------- A | 10% | 1 | 0.10 | (0.9×1)/(1.1) = 0.82 B | 15% | 48 | 0.15 | (0.85×48)/(1.15) = 35.5 C | 20% | 24 | 0.20 | (0.8×24)/(1.2) = 16.0 D | 50% | 72 | 0.50 | (0.5×72)/(1.5) = 24.0 E | 80% | 168 | 0.80 | (0.2×168)/(1.8) = 18.7 GREEDY POLICY selects: Segment A (10% utilization) → Copy 10% live data, reclaim 90% space → But A is only 1 hour old - data is probably hot! → This data will likely be modified again soon → We just moved hot data that will create more garbage COST-BENEFIT selects: Segment B (35.5 score) → 15% utilization, but 48 hours old → This old data is probably cold/stable → Worth moving because it won't need moving again soon → 85% space reclaimed, data unlikely to change GREEDY fails because:┌─────────────────────────────────────────────────────────────────────┐│ T=0: Segment A written (hot file created) ││ T=1: GC cleans A, moves hot file to Segment N ││ T=2: Hot file modified, new version in Segment N+1 ││ Old version in Segment N becomes garbage ││ T=3: GC cleans N, moves hot file to Segment M ││ T=4: Repeat... hot data keeps getting moved! ││ ││ Result: GC spinning wheels moving the same hot data repeatedly │└─────────────────────────────────────────────────────────────────────┘ COST-BENEFIT succeeds because:┌─────────────────────────────────────────────────────────────────────┐│ T=0: Segment B written (mix of files) ││ T=48: B is now old, hot data has been superseded, only cold remains ││ GC cleans B, moves cold data to Segment N ││ This cold data won't change for a long time ││ Segment N stays stable, high utilization, no future GC need ││ ││ Result: Moved data stays moved; no repeated cleaning │└─────────────────────────────────────────────────────────────────────┘Modern Adaptive Approaches:
Production systems often use adaptive policies that adjust based on:
F2FS Example:
F2FS uses a modified cost-benefit approach:
The optimal victim selection policy depends on workload characteristics. Write-heavy databases benefit from cost-benefit. Sequential log files work well with greedy. Most production systems use tunable hybrid policies that administrators adjust based on observed performance.
When and how garbage collection runs significantly impacts application performance. GC competes with applications for disk bandwidth and can cause latency spikes if not managed carefully.
GC Scheduling Strategies:
GC Performance Impact Visualization: SCENARIO: LFS disk at 85% utilization, moderate write load WITHOUT BACKGROUND GC (Blocking only):Time ─────────────────────────────────────────────────────────────────► │ Normal writes │││││ GC STALL │││││ Normal │││ GC STALL ││ Normal └───────────────┘ └─────────┘ └───────┘ └────────┘ Latency: ████░░░░░░████████████░░░░░░█████████░░░░░░░░░░ Normal Spike!!!! Normal Spike!!! Normal Latency percentiles: p50: 5 ms p99: 2,000 ms (40x normal!) p99.9: 5,000 ms WITH BACKGROUND GC (Concurrent, tuned):Time ─────────────────────────────────────────────────────────────────► │ Normal + BG_GC │ Normal + BG_GC │ Normal + BG_GC │ Normal │ └────────────────┘────────────────┘────────────────┘────────┘ Latency: ████████████████████████████████████████████████ Slightly elevated but CONSISTENT Latency percentiles: p50: 7 ms (+40%, acceptable) p99: 15 ms (only 3x normal) p99.9: 25 ms BACKGROUND GC BANDWIDTH ALLOCATION: Greedy allocation (GC hogs bandwidth): ┌────────────────────────────────────────────────────────────────┐ │ App writes: ████ │ │ GC writes: ████████████████████████████████████ │ │ │ │ GC gets 80% bandwidth → App latency unacceptable │ └────────────────────────────────────────────────────────────────┘ Throttled allocation (balanced): ┌────────────────────────────────────────────────────────────────┐ │ App writes: ████████████████████████████████ │ │ GC writes: ████████ │ │ │ │ GC gets 20% bandwidth → App latency acceptable │ │ But: GC might fall behind during peak load │ └────────────────────────────────────────────────────────────────┘ Adaptive allocation (modern approach): ┌────────────────────────────────────────────────────────────────┐ │ Low app load: App ████ GC ████████████████████ │ │ Med app load: App ████████████ GC ████████ │ │ High app load: App ████████████████████ GC ████ │ │ Critical: App ██ GC ██ (both throttled)│ │ │ │ GC bandwidth varies with free space and app demand │ └────────────────────────────────────────────────────────────────┘GC Threading and Parallelism:
Modern LFS implementations use sophisticated GC scheduling:
Handling GC Emergencies:
When free space drops to critical levels (e.g., <5%):
Most LFS implementations reserve 10-20% of disk capacity as 'overprovision' space—always kept free, never written by applications. This buffer ensures GC always has segments to write to, preventing deadlock where GC needs to write but has no space to write into.
Let's examine how production file systems implement garbage collection.
NILFS2 Approach:
F2FS Approach:
| File System | GC Trigger | Victim Selection | Live Data Handling | Special Features |
|---|---|---|---|---|
| NILFS2 | Snapshot-based | Unreferenced segments | N/A (no copying) | GC = snapshot deletion |
| F2FS | Free section count | Greedy or Cost-Benefit | Copy to head of temperature-matched log | Section-aligned for SSD |
| ZFS (SPA) | Metaslab fullness | Best-fit within metaslab | COW + space maps | Not pure LFS; hybrid approach |
| Btrfs | Block group balance | Fullness ratio | Relocate extents | User-triggered or auto |
F2FS GC Deep Dive:
F2FS's GC is particularly sophisticated for mobile and SSD use cases:
Section Granularity: GC operates on "sections" (multiple segments grouped to match SSD erase blocks). This reduces SSD internal GC conflicts.
Two-Phase Cleaning:
Foreground GC Limits: If foreground GC runs too long, F2FS yields to prevent application starvation.
Dirty Segment Bitmap: Efficient tracking of which segments need GC consideration.
SSR (Slack Space Recycling): For emergency space reclaim, F2FS can reuse partial segments without full cleaning—sacrificing sequentiality for space.
F2FS GC Thresholds (typical Android configuration):
- gc_min_sleep_time: 10 ms (minimum between GC rounds)
- gc_max_sleep_time: 30 seconds (maximum idle time before GC)
- gc_no_gc_sleep_time: 5 minutes (sleep when nothing to clean)
- gc_urgent_mid_space: 5% (threshold for mid-urgency GC)
- gc_urgent_high_space: 2% (threshold for emergency GC)
For F2FS on Linux, monitor GC performance via /sys/fs/f2fs/<device>/gc_*. Key metrics: gc_call (how often GC runs), gc_segment (segments cleaned), gc_block (blocks moved). High gc_call with low gc_segment indicates GC is frequently needed but finding no good candidates—a sign of poor data separation or near-full disk.
Garbage collection is the critical maintenance operation that makes log-structured file systems viable. Without effective GC, LFS would quickly fill with dead data and become unusable.
What's Next:
Garbage collection reclaims space, but the segment cleaning process determines its efficiency. The next page explores cleaning strategies in depth: how to merge segments, when to copy vs. compact, threading models for cleaning, and how modern implementations minimize GC overhead through intelligent data placement.
You now understand the fundamentals of garbage collection in log-structured file systems—why it's necessary, how liveness is determined, the mechanics of reclamation, and the critical role of segment utilization. This foundation prepares you for the advanced cleaning strategies covered next.