Log Structured File Systems - Learning Module

Loading content...

0/227

Garbage Collection: Reclaiming Dead Space

The Price of Never Updating In Place

Log-structured file systems achieve remarkable write performance by never updating data in place. But this elegant solution creates a fundamental problem: dead data accumulates.

Every time a file is modified, a new version is written to the log head. The old version remains on disk, consuming space but serving no purpose. Every deleted file leaves behind orphaned blocks. Every directory rename creates obsolete entries.

Without intervention, an LFS would quickly fill with garbage—dead data that will never be read again. Garbage collection (GC) is the mechanism that reclaims this space, and its efficiency determines whether LFS is practical for real workloads.

This page examines garbage collection in depth: why dead data accumulates, how to identify what's alive, the fundamental mechanics of reclaiming space, and why this is one of the most challenging problems in LFS design.

What You Will Learn

By the end of this page, you will understand why garbage collection is necessary, how liveness of data is determined, the basic mechanics of reclaiming segments, the write amplification cost of GC, and how GC performance fundamentally constrains LFS at high utilization.

Why Garbage Accumulates

To understand garbage collection, we must first understand what creates garbage. In LFS, every modification leaves behind dead data.

Sources of Dead Data:

File Modifications: Each update writes new blocks; old blocks become dead
File Deletions: All blocks of deleted files become dead instantly
Directory Updates: Renamed or moved entries leave dead directory blocks
Metadata Changes: Updated attributes create dead inode versions
Truncations: Shortened files abandon their former tail blocks

Unlike traditional file systems where updates happen in place (freeing the space implicitly), LFS must explicitly reclaim dead space through garbage collection.

garbage_creation.txt
Garbage Creation in Log-Structured File System:
 
SCENARIO: Modify a 4 KB file three times
 
Initial State (T0):
┌─────────────────────────────────────────────────────────────────────┐
│ LOG: [Seg1: Data_v1, Inode_v1] → Log Head                           │
│                                                                     │
│ File "readme.txt" = Data_v1 (LIVE)                                  │
│ Inode = Inode_v1 (LIVE)                                             │
│ Live: 100%, Dead: 0%                                                │
└─────────────────────────────────────────────────────────────────────┘
 
After First Modification (T1):
┌─────────────────────────────────────────────────────────────────────┐
│ LOG: [Seg1: Data_v1, Inode_v1][Seg2: Data_v2, Inode_v2] → Log Head  │
│              ↑                                                      │
│              DEAD (superseded by v2)                                │
│                                                                     │
│ File "readme.txt" = Data_v2 (LIVE)                                  │
│ Inode = Inode_v2 (LIVE)                                             │
│ Live: 50%, Dead: 50%                                                │
└─────────────────────────────────────────────────────────────────────┘
 
After Second Modification (T2):
┌─────────────────────────────────────────────────────────────────────┐
│ LOG: [Seg1: Data_v1, Inode_v1][Seg2: Data_v2, Inode_v2][Seg3: v3...]│
│              ↑                       ↑                              │
│              DEAD                    DEAD                           │
│                                                                     │
│ File "readme.txt" = Data_v3 (LIVE)                                  │
│ Live: 33%, Dead: 67%                                                │
└─────────────────────────────────────────────────────────────────────┘
 
After File Deletion (T3):
┌─────────────────────────────────────────────────────────────────────┐
│ LOG: [Seg1: DEAD][Seg2: DEAD][Seg3: DEAD][Seg4: Dir Update]         │
│                                                                     │
│ File "readme.txt" = DELETED                                         │
│ Live: ~0% (only directory update), Dead: ~100%                      │
│                                                                     │
│ ⚠ All those segments contain only garbage now!                      │
└─────────────────────────────────────────────────────────────────────┘

The Compounding Effect:

Garbage accumulation is not linear—it compounds:

Small files dominate many workloads: Modifying a 1 KB file in a 4 MB segment leaves 99.97% of the segment valid, 0.03% dead—but the entire segment can't be reused until cleaned
Mixed lifetimes: A segment with 90% long-lived data and 10% short-lived data becomes 90% live/10% dead quickly, then stays that way for a long time
Workload patterns: Some applications (databases, logs) update the same data repeatedly, creating garbage rapidly

The Critical Metric: Segment Utilization

Each segment has a utilization (or liveness) ratio:

Utilization = Live Bytes / Total Segment Size

Utilization = 100%: All data in segment is current and reachable
Utilization = 50%: Half the segment is garbage
Utilization = 0%: Entire segment is garbage (can be immediately reused)

Garbage collection's job is to consolidate live data from low-utilization segments, freeing those segments for reuse.

The Utilization Cliff

As disk utilization increases, finding low-utilization segments becomes harder. At 90% disk full, even 'garbage-heavy' segments might be 80% live. Cleaning them reclaims little space at high cost. This creates a performance cliff that makes LFS challenging for nearly-full disks.

Identifying Live vs. Dead Data

Before we can collect garbage, we must distinguish live data (still referenced, potentially needed) from dead data (superseded or deleted, never needed again).

The Liveness Problem:

A block is live if and only if:

It belongs to an existing file (not deleted)
It is the current version of that block (not superseded)
It can be reached through the current inode map → inode → block pointers chain

A block is dead if any of these conditions fail.

Two Approaches to Liveness Determination:

Forward Reference Checking

•Read segment summary for each block
•Summary says: 'Block X belongs to inode Y, offset Z'
•Look up inode Y in current inode map
•Read inode Y, check if offset Z points to this block
•If yes → LIVE; if no → DEAD
•Pro: Always accurate
•Con: Requires inode reads during GC

Back Pointer Checking

•Each block contains owner inode + version number
•Check if block's version matches inode's version
•If versions match → LIVE; if not → DEAD
•Version incremented on each inode update
•Pro: Fast, no extra I/O needed
•Con: Requires version storage overhead
•Con: Must keep version numbers synchronized

liveness_checking.txt
Liveness Determination Using Segment Summary:
 
SEGMENT TO CLEAN:
┌─────────────────────────────────────────────────────────────────────┐
│ SEGMENT 42 (4 MB, written at T=1000)                                │
├─────────────────────────────────────────────────────────────────────┤
│ SEGMENT SUMMARY:                                                    │
│   Block 0: inode 17, offset 0                                       │
│   Block 1: inode 17, offset 4096                                    │
│   Block 2: inode 23, offset 0                                       │
│   Block 3: inode 23, offset 4096                                    │
│   Block 4: <inode block for inode 17>                               │
│   Block 5: <inode block for inode 23>                               │
│   ...                                                               │
└─────────────────────────────────────────────────────────────────────┘
 
LIVENESS CHECK PROCESS:
 
For each block in segment:
┌─────────────────────────────────────────────────────────────────────┐
│ Block 0: inode 17, offset 0                                         │
│                                                                     │
│   Step 1: imap[17] → current inode 17 at Segment 99, offset 2048    │
│   Step 2: Read inode 17 from Seg 99                                 │
│   Step 3: inode17.block[0] → Segment 42, offset 0?                  │
│           → YES! Block is LIVE                                      │
│                                                                     │
│ Block 1: inode 17, offset 4096                                      │
│   Step 1: imap[17] → Segment 99, offset 2048 (cached from above)    │
│   Step 2: inode17.block[1] → Segment 88, offset 1024                │
│           → NO! Block 1 was superseded. Block is DEAD               │
│                                                                     │
│ Block 2: inode 23, offset 0                                         │
│   Step 1: imap[23] → INODE NOT FOUND                                │
│           → File 23 was deleted. Block is DEAD                      │
│                                                                     │
│ ... continue for all blocks ...                                     │
└─────────────────────────────────────────────────────────────────────┘
 
RESULT:
  Segment 42 contains:
    - 2 LIVE data blocks (8 KB)
    - 5 DEAD data blocks (20 KB)
    - 1 LIVE inode (inode 17)
    - 1 DEAD inode (inode 23 - deleted)
    
  Utilization: 8 KB / 4 MB = 0.2%
  → Excellent candidate for cleaning!

Optimization: Version Numbers

The forward reference check requires reading inodes, which is expensive. Version numbers optimize this:

Each inode has a version number, incremented on every modification
Each data block records its owner's version number when written
To check liveness: compare block's stored version with inode's current version

if (block.owner_inode == inode_number AND
    block.version == current_inode.version AND
    inode.block_pointers[block.offset] points here)
    → LIVE
else
    → DEAD

Segment Usage Table:

Rather than computing liveness at GC time, LFS implementations typically maintain a segment usage table that tracks:

Total bytes written to segment
Live bytes remaining (updated as blocks die)
Last modification time
Number of blocks by type (data, inode, imap)

This table enables instant lookup of segment utilization without scanning.

Keeping the Usage Table Current

When a file is modified, we know which old blocks are being superseded (from the inode being replaced). We can immediately decrement those blocks' segments' live byte counts. When a file is deleted, we traverse its inode and decrement all of its segments. This keeps segment utilization accurate without expensive scans.

Basic Garbage Collection Mechanics

With the ability to identify live data, we can now understand how garbage collection actually works.

The GC Algorithm (Simplified):

Select victim segment(s): Choose segment(s) with low utilization
Identify live blocks: Scan segment, check liveness of each block
Copy live blocks to log head: Append live data to current segment buffer
Update metadata: Adjust imap to point to new locations
Free victim segment: Mark segment as available for reuse

gc_process.txt
Garbage Collection Process:
 
BEFORE GC:
═══════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────────────┐
│ DISK STATE:                                                         │
│                                                                     │
│ [Seg 1: 20% live][Seg 2: 95% live][Seg 3: 5% live][Seg 4: 100%]... │
│       ↑ GC candidate     ↑ Skip         ↑ GC candidate              │
│                                                                     │
│ [Seg 5: 80% live][Seg 6: 15% live][FREE][FREE][FREE]               │
│                         ↑ GC candidate    ↑↑↑ Available for writes │
└─────────────────────────────────────────────────────────────────────┘
 
GC STEP 1: Select Victims (segments with lowest utilization)
═══════════════════════════════════════════════════════════════════════
Selected: Segment 3 (5% live) - best candidate
  
  Why Segment 3?
  - Only 5% live = 200 KB live data in 4 MB segment
  - Cleaning reclaims 3.8 MB at cost of moving 200 KB
  - Efficiency: 19:1 ratio
 
GC STEP 2: Scan and Identify Live Blocks
═══════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────────────┐
│ SEGMENT 3 ANALYSIS:                                                 │
│                                                                     │
│ Block  0: inode 42, off 0    → Check imap... → DEAD (file deleted) │
│ Block  1: inode 42, off 4K   → Check imap... → DEAD                 │
│ Block  2: inode 42, off 8K   → Check imap... → DEAD                 │
│ Block  3: inode 58, off 0    → Check imap... → LIVE ✓               │
│ Block  4: inode 58, off 4K   → Check imap... → DEAD (superseded)    │
│ Block  5: inode 71, off 0    → Check imap... → LIVE ✓               │
│ ... (continue through segment)                                      │
│                                                                     │
│ Result: 50 blocks checked, 5 are LIVE (200 KB), 45 are DEAD         │
└─────────────────────────────────────────────────────────────────────┘
 
GC STEP 3: Copy Live Blocks to Log Head
═══════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────────────┐
│ CURRENT SEGMENT BUFFER (at log head):                               │
│                                                                     │
│ [Current app writes...][Copied from Seg3: Live Block 3][Block 5]... │
│                                     ↑                               │
│                              GC-induced writes                      │
│                                                                     │
│ These blocks go into a normal segment with fresh data               │
│ No special marking needed—they're just regular writes now           │
└─────────────────────────────────────────────────────────────────────┘
 
GC STEP 4: Update Inode Map
═══════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────────────┐
│ IMAP UPDATES:                                                       │
│                                                                     │
│ imap[58].block[0]: Seg 3, off 12K → Seg 7, off 2048  (new location) │
│ imap[71].block[0]: Seg 3, off 20K → Seg 7, off 6144  (new location) │
│                                                                     │
│ These updates go into the same segment buffer being written         │
└─────────────────────────────────────────────────────────────────────┘
 
GC STEP 5: Free Victim Segment
═══════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────────────┐
│ AFTER GC COMPLETES:                                                 │
│                                                                     │
│ Segment 3: Marked FREE (available for future writes)                │
│                                                                     │
│ Free space increased: 4 MB gained                                   │
│ Write cost: 200 KB (live data) + overhead ≈ 250 KB                  │
│                                                                     │
│ Net gain: 4 MB - 0.25 MB = 3.75 MB of usable space                  │
└─────────────────────────────────────────────────────────────────────┘

Key Observations:

GC induces writes: Moving live data requires writing it again. This is called GC write amplification.
GC competes with applications: While GC is copying live blocks, applications may also be writing. They share segment buffer space and disk bandwidth.
Utilization determines efficiency: Low-utilization segments yield high returns. A 10% utilized segment gives 90% free space for 10% write cost. A 90% utilized segment gives only 10% free space for 90% write cost.
Metadata moves too: When data blocks move, their imap entries must update. When inodes move, the master imap must update. GC cascades through the metadata hierarchy.

When Does GC Run?

Common triggers:

Free space threshold: When free segments drop below X% (e.g., 10%)
Background idle time: Clean during low disk activity
Urgent cleaning: If free space critically low, block new writes until GC completes
Scheduled cleaning: Periodic GC during maintenance windows

The Write Amplification Tax

Every byte of live data in a cleaned segment must be written again. This means GC can double, triple, or even further multiply the total write volume. On flash storage, this accelerates wear. On any storage, it consumes bandwidth that could serve applications.

Write Amplification Analysis

Write amplification is the ratio of total physical writes to logical writes requested by applications. Understanding this metric is crucial for predicting LFS performance.

Sources of Write Amplification in LFS:

Inherent LFS overhead: Writing imap updates, segment summaries, etc.
GC-induced writes: Copying live data from cleaned segments
Metadata updates: Inode modifications trigger imap writes

Calculating GC Write Amplification:

Let's derive the write amplification from GC mathematically.

Assume:

Average segment utilization at cleaning time = u (0 to 1)
Segment size = S
To free one segment, we must copy u×S bytes of live data
Cost to reclaim 1 segment = u×S writes
Benefit from reclaiming 1 segment = (1-u)×S free space

Write amplification factor W:

W = (bytes written) / (new bytes stored)
W = (u × S) / ((1 - u) × S)
W = u / (1 - u)

GC Write Amplification vs. Segment Utilization
Avg Segment Utilization	Write Amplification (u/(1-u))	Total Writes per App Write	Interpretation
10%	0.11×	1.11×	Minimal GC impact—most segments nearly empty
30%	0.43×	1.43×	Healthy—GC is efficient
50%	1.0×	2.0×	Every app write causes equal GC write
70%	2.33×	3.33×	Significant GC overhead
80%	4.0×	5.0×	Heavy GC penalty—disk mostly busy with GC
90%	9.0×	10.0×	Severe degradation—90% of writes are GC
95%	19.0×	20.0×	Near unusable—almost all bandwidth to GC

The Utilization Crisis:

Notice how write amplification explodes as utilization approaches 100%:

At 50% utilization: 2× total writes (manageable)
At 90% utilization: 10× total writes (severe impact)
At 95% utilization: 20× total writes (near-catastrophic)

This explains the common advice: never run LFS volumes above 80-90% full.

But wait—what determines segment utilization at cleaning time?

utilization_math.txt
Segment Utilization and Write Amplification:
 
DISK STATE: 100 segments, each 4 MB = 400 MB disk
Live data on disk: 320 MB = 80% disk utilization
 
QUESTION: What's the AVERAGE segment utilization?
 
Scenario A: Uniform distribution
┌─────────────────────────────────────────────────────────────────────┐
│ All 100 segments have exactly 80% live data                         │
│ Average utilization for cleaning = 80%                              │
│ Write amplification = 0.8 / 0.2 = 4.0×                              │
│                                                                     │
│ Every new write requires 4× writes total (1 new + 3 GC)             │
└─────────────────────────────────────────────────────────────────────┘
 
Scenario B: Bimodal distribution (better!)
┌─────────────────────────────────────────────────────────────────────┐
│ 80 segments: 100% live (no garbage, don't clean these)              │
│ 20 segments: 0% live (all garbage, free clean!)                     │
│                                                                     │
│ Average disk utilization still 80% (320 MB / 400 MB)                │
│ But we clean the 0% segments - no live data to copy!                │
│ Write amplification = 0 / 1 = 0×                                    │
│                                                                     │
│ No extra writes for GC at all!                                      │
└─────────────────────────────────────────────────────────────────────┘
 
Scenario C: Bimodal with some mixed (realistic)
┌─────────────────────────────────────────────────────────────────────┐
│ 60 segments: 100% live (stable data, don't touch)                   │
│ 20 segments: 50% live (some garbage mixed in)                       │
│ 20 segments: 20% live (mostly garbage)                              │
│                                                                     │
│ Total live: 60×4 + 20×2 + 20×0.8 = 240 + 40 + 16 = 296 MB          │
│ Disk utilization: 296 / 400 = 74%                                   │
│                                                                     │
│ GC Target: 20% utilized segments                                    │
│ Write amplification = 0.2 / 0.8 = 0.25×                             │
│                                                                     │
│ Only 25% extra writes for GC - very efficient!                      │
└─────────────────────────────────────────────────────────────────────┘
 
KEY INSIGHT: Distribution matters more than average utilization.
If we can keep hot and cold data separated, GC becomes cheap.

The Bimodal Distribution Goal

The secret to efficient LFS garbage collection is creating a bimodal distribution: some segments with nearly all live data (cold, stable), others with nearly all dead data (former hot spots). This allows GC to target low-utilization segments with minimal copying. This is why data separation strategies (hot/cold/warm) matter so much.

Victim Selection Policies

Which segments should GC clean? This decision profoundly affects performance. Different policies optimize for different goals.

Greedy Policy:

Always clean the segment with the lowest utilization.

Pros:

Maximizes space reclaimed per segment cleaned
Minimizes immediate write amplification
Simple to implement

Cons:

May clean old segments with cold data
Cold data gets moved, hot data stays, creating churn
Can lead to repeated movement of stable data

Cost-Benefit Policy:

Weight both utilization AND age. Prefer cleaning segments that are:

Low utilization (lots of garbage)
Old (data that survived long is likely cold; moving it wastes effort)

The Rosenblum-Ousterhout formula:

Benefit = (1 - u) × age / (1 + u)

where:
  u = utilization (0 to 1)
  age = time since segment was written

Victim Selection Policy Comparison
Policy	Criteria	Best For	Weakness
Greedy	Lowest utilization	Short-term efficiency	Moves cold data repeatedly
Cost-Benefit	Low util + young age	Mixed workloads	Complex to tune age weight
Least Recently Written	Oldest segment	Uniform access	May clean high-utilization segments
CAT (Cost-Age-Time)	Composite score	Production systems	Requires extensive tuning
Adaptive	Varies with conditions	Dynamic workloads	Implementation complexity

victim_selection.txt
Victim Selection Policy Comparison:
 
SCENARIO: 10 segments to choose from for GC
 
Segment | Utilization | Age (hours) | Greedy Score | Cost-Benefit Score
--------|-------------|-------------|--------------|-------------------
   A    |    10%      |     1       |    0.10      |  (0.9×1)/(1.1) = 0.82
   B    |    15%      |    48       |    0.15      |  (0.85×48)/(1.15) = 35.5
   C    |    20%      |    24       |    0.20      |  (0.8×24)/(1.2) = 16.0
   D    |    50%      |    72       |    0.50      |  (0.5×72)/(1.5) = 24.0
   E    |    80%      |   168       |    0.80      |  (0.2×168)/(1.8) = 18.7
   
GREEDY POLICY selects: Segment A (10% utilization)
  → Copy 10% live data, reclaim 90% space
  → But A is only 1 hour old - data is probably hot!
  → This data will likely be modified again soon
  → We just moved hot data that will create more garbage
 
COST-BENEFIT selects: Segment B (35.5 score)
  → 15% utilization, but 48 hours old
  → This old data is probably cold/stable
  → Worth moving because it won't need moving again soon
  → 85% space reclaimed, data unlikely to change
 
GREEDY fails because:
┌─────────────────────────────────────────────────────────────────────┐
│ T=0: Segment A written (hot file created)                           │
│ T=1: GC cleans A, moves hot file to Segment N                       │
│ T=2: Hot file modified, new version in Segment N+1                  │
│      Old version in Segment N becomes garbage                       │
│ T=3: GC cleans N, moves hot file to Segment M                       │
│ T=4: Repeat... hot data keeps getting moved!                        │
│                                                                     │
│ Result: GC spinning wheels moving the same hot data repeatedly      │
└─────────────────────────────────────────────────────────────────────┘
 
COST-BENEFIT succeeds because:
┌─────────────────────────────────────────────────────────────────────┐
│ T=0: Segment B written (mix of files)                               │
│ T=48: B is now old, hot data has been superseded, only cold remains │
│       GC cleans B, moves cold data to Segment N                     │
│       This cold data won't change for a long time                   │
│       Segment N stays stable, high utilization, no future GC need   │
│                                                                     │
│ Result: Moved data stays moved; no repeated cleaning                │
└─────────────────────────────────────────────────────────────────────┘

Modern Adaptive Approaches:

Production systems often use adaptive policies that adjust based on:

Current free space: More aggressive (greedy) when space is critical
Recent workload: If writes are hot, weight age more heavily
Time of day: Background cleaning during low-activity periods
SSD wear: Minimize total writes to extend flash lifespan

F2FS Example:

F2FS uses a modified cost-benefit approach:

Separates segments by "temperature" (hot/warm/cold)
Within each temperature zone, applies greedy selection
Avoids moving data between temperature zones unnecessarily
Uses victim count limits to prevent GC from blocking foreground I/O

No Universal Best Policy

The optimal victim selection policy depends on workload characteristics. Write-heavy databases benefit from cost-benefit. Sequential log files work well with greedy. Most production systems use tunable hybrid policies that administrators adjust based on observed performance.

GC Scheduling and Performance Impact

When and how garbage collection runs significantly impacts application performance. GC competes with applications for disk bandwidth and can cause latency spikes if not managed carefully.

GC Scheduling Strategies:

GC Timing Approaches

•Foreground (Blocking) GC: Runs when free space exhausted; blocks application writes until space reclaimed. Simple but causes severe latency spikes.
•Background (Concurrent) GC: Runs in separate thread during idle time. Smooth performance but may fall behind during sustained load.
•Hybrid GC: Background GC normally; switches to blocking when critically low. Balances latency and throughput.
•Preemptive GC: Starts GC before space is critical, maintaining buffer. Avoids emergencies at cost of slightly reduced throughput.
•Scheduled GC: Runs during known low-activity periods (nights, weekends). Works well for predictable workloads.

gc_impact.txt
GC Performance Impact Visualization:
 
SCENARIO: LFS disk at 85% utilization, moderate write load
 
WITHOUT BACKGROUND GC (Blocking only):
Time ─────────────────────────────────────────────────────────────────►
      │ Normal writes │││││ GC STALL │││││ Normal │││ GC STALL ││ Normal
      └───────────────┘     └─────────┘    └───────┘   └────────┘
      
      Latency: ████░░░░░░████████████░░░░░░█████████░░░░░░░░░░
               Normal    Spike!!!!     Normal   Spike!!!    Normal
      
      Latency percentiles:
        p50: 5 ms
        p99: 2,000 ms (40x normal!)
        p99.9: 5,000 ms
 
WITH BACKGROUND GC (Concurrent, tuned):
Time ─────────────────────────────────────────────────────────────────►
      │ Normal + BG_GC │ Normal + BG_GC │ Normal + BG_GC │ Normal │
      └────────────────┘────────────────┘────────────────┘────────┘
      
      Latency: ████████████████████████████████████████████████
               Slightly elevated but CONSISTENT
      
      Latency percentiles:
        p50: 7 ms (+40%, acceptable)
        p99: 15 ms (only 3x normal)
        p99.9: 25 ms
 
BACKGROUND GC BANDWIDTH ALLOCATION:
 
Greedy allocation (GC hogs bandwidth):
  ┌────────────────────────────────────────────────────────────────┐
  │ App writes:    ████                                            │
  │ GC writes:     ████████████████████████████████████            │
  │                                                                │
  │ GC gets 80% bandwidth → App latency unacceptable               │
  └────────────────────────────────────────────────────────────────┘
 
Throttled allocation (balanced):
  ┌────────────────────────────────────────────────────────────────┐
  │ App writes:    ████████████████████████████████                │
  │ GC writes:     ████████                                        │
  │                                                                │
  │ GC gets 20% bandwidth → App latency acceptable                 │
  │ But: GC might fall behind during peak load                     │
  └────────────────────────────────────────────────────────────────┘
 
Adaptive allocation (modern approach):
  ┌────────────────────────────────────────────────────────────────┐
  │ Low app load:  App ████    GC ████████████████████             │
  │ Med app load:  App ████████████  GC ████████                   │
  │ High app load: App ████████████████████  GC ████               │
  │ Critical:      App ██                    GC ██ (both throttled)│
  │                                                                │
  │ GC bandwidth varies with free space and app demand             │
  └────────────────────────────────────────────────────────────────┘

GC Threading and Parallelism:

Modern LFS implementations use sophisticated GC scheduling:

Dedicated GC thread: Runs at lower priority than application I/O
I/O scheduler integration: Tags GC I/O for lower priority
Parallelism: Multiple segments cleaned concurrently on multi-channel SSDs
Batching: Clean multiple victims in one pass to amortize overhead

Handling GC Emergencies:

When free space drops to critical levels (e.g., <5%):

Stop accepting new writes: Block applications until space available
Emergency GC mode: Run GC at maximum speed, ignoring fairness
Best-effort reclaim: Accept any reclaimable segment, even high-utilization ones
Notify applications: Signal that the system is under stress (may influence their behavior)

The 10% Rule

Most LFS implementations reserve 10-20% of disk capacity as 'overprovision' space—always kept free, never written by applications. This buffer ensures GC always has segments to write to, preventing deadlock where GC needs to write but has no space to write into.

Garbage Collection in Practice

Let's examine how production file systems implement garbage collection.

NILFS2 Approach:

Uses continuous checkpointing—every sync creates a snapshot
GC only removes segments unreferenced by any snapshot
Administrator controls snapshot retention policy
GC is relatively simple: segment with no snapshot references → free

F2FS Approach:

Six log heads for different data temperatures
Foreground GC: triggered when free sections ≤ threshold
Background GC: runs opportunistically, limits impact
Two GC modes: Greedy (urgent) and Cost-Benefit (normal)
Section-based cleaning for SSD erase block alignment

GC Implementation Comparison
File System	GC Trigger	Victim Selection	Live Data Handling	Special Features
NILFS2	Snapshot-based	Unreferenced segments	N/A (no copying)	GC = snapshot deletion
F2FS	Free section count	Greedy or Cost-Benefit	Copy to head of temperature-matched log	Section-aligned for SSD
ZFS (SPA)	Metaslab fullness	Best-fit within metaslab	COW + space maps	Not pure LFS; hybrid approach
Btrfs	Block group balance	Fullness ratio	Relocate extents	User-triggered or auto

F2FS GC Deep Dive:

F2FS's GC is particularly sophisticated for mobile and SSD use cases:

Section Granularity: GC operates on "sections" (multiple segments grouped to match SSD erase blocks). This reduces SSD internal GC conflicts.
Two-Phase Cleaning:
- Phase 1: Identify victim section, read all blocks
- Phase 2: Write live blocks to appropriate temperature log
Foreground GC Limits: If foreground GC runs too long, F2FS yields to prevent application starvation.
Dirty Segment Bitmap: Efficient tracking of which segments need GC consideration.
SSR (Slack Space Recycling): For emergency space reclaim, F2FS can reuse partial segments without full cleaning—sacrificing sequentiality for space.

F2FS GC Thresholds (typical Android configuration):
- gc_min_sleep_time: 10 ms (minimum between GC rounds)
- gc_max_sleep_time: 30 seconds (maximum idle time before GC)
- gc_no_gc_sleep_time: 5 minutes (sleep when nothing to clean)
- gc_urgent_mid_space: 5% (threshold for mid-urgency GC)
- gc_urgent_high_space: 2% (threshold for emergency GC)

Monitoring GC in Production

For F2FS on Linux, monitor GC performance via /sys/fs/f2fs/<device>/gc_*. Key metrics: gc_call (how often GC runs), gc_segment (segments cleaned), gc_block (blocks moved). High gc_call with low gc_segment indicates GC is frequently needed but finding no good candidates—a sign of poor data separation or near-full disk.

Summary: Mastering Garbage Collection

Garbage collection is the critical maintenance operation that makes log-structured file systems viable. Without effective GC, LFS would quickly fill with dead data and become unusable.

Key Takeaways

•Dead data accumulates inevitably — Every modification creates garbage; GC is not optional but essential.
•Liveness determination is the foundation — Segment summaries and inode map enable efficient live/dead classification.
•GC induces write amplification — Copying live data costs bandwidth; the cost scales with segment utilization.
•Write amplification explodes near full — At 90%+ utilization, GC overhead can consume most disk bandwidth.
•Victim selection policy matters — Cost-benefit outperforms greedy by avoiding repeated movement of hot data.
•Scheduling affects latency — Background GC smooths performance; foreground GC causes spikes but guarantees progress.

What's Next:

Garbage collection reclaims space, but the segment cleaning process determines its efficiency. The next page explores cleaning strategies in depth: how to merge segments, when to copy vs. compact, threading models for cleaning, and how modern implementations minimize GC overhead through intelligent data placement.

Page Complete

You now understand the fundamentals of garbage collection in log-structured file systems—why it's necessary, how liveness is determined, the mechanics of reclamation, and the critical role of segment utilization. This foundation prepares you for the advanced cleaning strategies covered next.