Log Structured File Systems - Learning Module

Loading content...

0/227

Cleaning: Advanced Segment Reclamation Strategies

Beyond Basic Garbage Collection

In the previous page, we explored the fundamentals of garbage collection—identifying dead data and reclaiming segments. But knowing what to clean is only half the battle. How we clean determines whether LFS achieves its performance promises.

Segment cleaning encompasses the strategies and optimizations that make garbage collection efficient at scale. This includes:

How to organize and merge live data from multiple dirty segments
Threading and concurrency models for non-blocking cleaning
Data separation strategies to minimize future cleaning costs
Adaptive techniques that respond to workload characteristics

These advanced strategies separate production-ready LFS implementations from naive prototypes. Understanding them is essential for anyone designing, tuning, or debugging log-structured storage systems.

What You Will Learn

By the end of this page, you will understand segment merging and compaction strategies, concurrent cleaning with foreground/background threads, hot/cold data separation for cleaning efficiency, hole-plugging and partial segment reclamation, and adaptive cleaning that responds to system conditions.

Segment Merging Strategies

When multiple segments have low utilization, cleaning them individually is inefficient. Segment merging combines live data from several dirty segments into fewer, fuller segments.

Simple Merge:

Clean multiple segments at once:

Select N victim segments (low utilization)
Identify all live blocks across N segments
Write live blocks into M new segments (M < N)
Free all N victim segments

The efficiency depends on the total live data across victims:

Merge Input: N segments, average U utilization
Live data: N × U × segment_size
Output: ceil(N × U) segments
Space reclaimed: N - ceil(N × U) segments

segment_merging.txt
Segment Merging Example:
 
BEFORE MERGE:
═══════════════════════════════════════════════════════════════════════
Segment 15: [████░░░░░░░░] 30% live (1.2 MB live, 2.8 MB dead)
Segment 22: [██░░░░░░░░░░] 20% live (0.8 MB live, 3.2 MB dead)
Segment 31: [███░░░░░░░░░] 25% live (1.0 MB live, 3.0 MB dead)
Segment 47: [█░░░░░░░░░░░] 10% live (0.4 MB live, 3.6 MB dead)
────────────────────────────────────────────────────────────────────────
Total: 4 segments × 4 MB = 16 MB disk space
       3.4 MB live data, 12.6 MB dead data
 
MERGE PROCESS:
═══════════════════════════════════════════════════════════════════════
Step 1: Read all live blocks from segments 15, 22, 31, 47
        Total live data: 3.4 MB
 
Step 2: Write consolidated data to new segment(s)
        Output: 1 new segment (85% full)
        
Step 3: Update imap entries for all moved blocks
 
Step 4: Free segments 15, 22, 31, 47
 
AFTER MERGE:
═══════════════════════════════════════════════════════════════════════
New Segment 99: [██████████░░] 85% live (3.4 MB live, 0.6 MB slack)
Segments 15, 22, 31, 47: [FREE for reuse]
────────────────────────────────────────────────────────────────────────
Result: 4 segments merged → 1 segment
        3 segments reclaimed for new writes
        
EFFICIENCY CALCULATION:
  Read cost:  3.4 MB (live data from 4 segments)
  Write cost: 4 MB (one new segment with padding)
  Space gain: 12 MB (3 free segments)
  
  Amplification: 4 MB written / 12 MB reclaimed = 0.33×
  (We write 1 byte for every 3 bytes we reclaim—excellent!)

Greedy Merge vs. Optimal Merge:

Greedy approach: Merge whichever low-utilization segments are available now.

Pro: Quick decision, no waiting
Con: May miss better combinations

Optimal approach: Wait until many low-utilization segments available, then find best combination.

Pro: Better merge efficiency (pack output segments more fully)
Con: Delays cleaning, may cause space emergencies

Practical compromise: Set thresholds—always clean when below critical space, opportunistically merge during idle periods.

Partial Segment Output:

What if merged live data doesn't fill a complete segment?

Option 1: Padding — Write partial segment, padding with zeros

Simple, but wastes storage and bandwidth

Option 2: Coalescing with application writes — Add merge output to current write buffer

Efficient, but couples GC with application I/O timing

Option 3: Dedicated partial segment area — Special zone for GC output that doesn't require full segments

Used by some implementations for flexibility

The Merge Granularity Trade-off

Larger merge batches improve efficiency but require more memory for buffering and may delay cleaning progress. Production systems typically merge 4-16 segments at a time, balancing memory usage against merge efficiency.

Concurrent Cleaning: Foreground and Background

Cleaning must run concurrently with application I/O to prevent blocking. Modern LFS implementations use sophisticated threading models to balance cleaning progress with application responsiveness.

Thread Architecture:

Application threads: Handle user read/write requests
Checkpoint thread: Periodically writes consistency checkpoints
Background GC thread(s): Clean segments during idle time
Foreground GC: Triggered inline when space critically low

Coordination Challenges:

GC reads segments that apps might be reading (contention for read bandwidth)
GC writes to log head, same as apps (contention for write bandwidth)
GC updates imap entries that apps might be using (requires locking)
GC must not free segments with live snapshots or in-flight reads

concurrent_cleaning.txt
Concurrent Cleaning Thread Model:
 
THREAD ARCHITECTURE:
═══════════════════════════════════════════════════════════════════════
                    ┌─────────────────────────────────────────────────┐
                    │              APPLICATION THREADS                │
                    │    (handle read(), write(), fsync() calls)     │
                    └─────────────────────┬───────────────────────────┘
                                          │
                    ┌─────────────────────▼───────────────────────────┐
                    │              SEGMENT BUFFER                      │
                    │    (shared write buffer for log head)           │
                    │    [App data][App data][GC data][App data]...   │
                    └─────────────────────┬───────────────────────────┘
                                          │
        ┌─────────────────────────────────┼────────────────────────────┐
        │                                 │                            │
        ▼                                 ▼                            ▼
┌───────────────────┐       ┌───────────────────┐       ┌─────────────────┐
│  BACKGROUND GC    │       │   CHECKPOINT      │       │  FOREGROUND GC  │
│     THREAD        │       │     THREAD        │       │   (inline)      │
├───────────────────┤       ├───────────────────┤       ├─────────────────┤
│ • Low priority    │       │ • Periodic        │       │ • High priority │
│ • Runs when idle  │       │ • Writes CR       │       │ • Blocks until  │
│ • Rate-limited    │       │ • Persists imap   │       │   space freed   │
│ • Victim: C-B     │       │ • 30s interval    │       │ • Victim: Greedy│
└───────────────────┘       └───────────────────┘       └─────────────────┘
 
COORDINATION MECHANISMS:
═══════════════════════════════════════════════════════════════════════
 
1. LOCK HIERARCHY (prevents deadlock):
   acquire(segment_table_lock) < acquire(imap_lock) < acquire(segment_X_lock)
   
   GC must follow this order when updating structures.
 
2. READ-COPY-UPDATE (RCU) for imap:
   - Readers never block (use current imap snapshot)
   - GC creates new imap version atomically
   - Old imap freed after all readers finish
   
   Benefits: Read-heavy workloads see no GC latency impact
 
3. SEGMENT PINNING:
   - In-flight reads "pin" segments from GC reclamation
   - GC skips pinned segments or waits
   - Reference counting tracks pin state
   
   ┌─────────────────────────────────────────────────────────────────┐
   │ App reading Segment 42:  pin_count[42]++                        │
   │ GC wants Segment 42:     if pin_count[42] > 0: skip or wait     │
   │ App finishes read:       pin_count[42]--                        │
   │ When pin_count[42] == 0: GC can proceed                         │
   └─────────────────────────────────────────────────────────────────┘
 
4. BANDWIDTH PARTITIONING:
   - I/O scheduler tags GC I/O as low-priority
   - App I/O preempts GC I/O when needed
   - GC throttles itself based on free space level

Background GC Throttling:

Background GC must be throttled to avoid starving applications:

while (should_continue_gc()) {
    segment = select_victim();
    clean_segment(segment);
    
    // Throttle based on conditions
    if (free_space > high_watermark) {
        sleep(long_interval);  // Plenty of space, relax
    } else if (free_space > low_watermark) {
        sleep(medium_interval);  // Normal operation
    } else {
        // Critical! No sleeping, maximize cleaning
        continue;
    }
    
    // Yield to applications
    yield_if_pending_io();
}

Foreground GC Behavior:

When free space drops below critical threshold:

Application write call blocks
Inline GC runs on application's thread
GC cleans minimum segments needed
Application write proceeds with reclaimed space
Background GC continues afterward to restore healthy free space

This ensures applications never permanently block, but they do experience latency spikes during space crises.

The Cascade Problem

If foreground GC runs too often, it can cascade: GC writes consume what little free space exists, triggering more GC. Proper thresholds and background GC pacing prevent this death spiral. The rule of thumb: foreground GC should run less than 1% of the time under normal workloads.

Hot/Cold Data Separation

The most effective way to minimize cleaning overhead is to prevent low-utilization segments from forming in the first place. Hot/cold separation achieves this by grouping data based on expected lifetime.

The Core Insight:

If we write hot (frequently changing) and cold (stable) data to the same segment:

Hot data dies quickly → segment becomes partially dead
Cold data lives long → segment can't be reclaimed
Result: Segment stuck at ~50% utilization indefinitely

If we separate them:

Hot segment: All data dies together → 0% utilization → free!
Cold segment: All data lives together → 100% utilization → no GC needed

hot_cold_separation.txt
Hot/Cold Separation Impact on Cleaning:
 
SCENARIO: 100 MB of hot data (updates every hour)
          100 MB of cold data (updates yearly)
          Writing at 10 MB/hour
 
WITHOUT SEPARATION (mixed segments):
═══════════════════════════════════════════════════════════════════════
Hour 0: [Hot+Cold][Hot+Cold][Hot+Cold]... 
        50 segments, each 50% hot, 50% cold
 
Hour 1: Hot data updated → written to new segments
        Old segments: 50% dead (old hot), 50% live (cold)
        New segments: 50% hot, 50% cold
        
        50 segments at 50% utilization → need cleaning
        But cleaning moves cold data that won't die for a year!
        
Hour 2-24: Each hour, hot data cycles
           Cold data keeps getting moved during cleaning
           Massive write amplification: ~10× for cleaning alone
 
ANALYSIS:
  Every hot data update triggers cleaning
  Every clean moves stable cold data
  Cold data moved 24×/day × 365 days = 8,760 moves/year
  Total writes/year: 100 MB hot × 24 × 365 + 
                     100 MB cold × 8,760 = 
                     876 GB hot + 876 GB cold = 1,752 GB written
 
WITH SEPARATION:
═══════════════════════════════════════════════════════════════════════
Hour 0: [Hot][Hot][Hot]...[Cold][Cold]...
        25 hot segments, 25 cold segments
        
Hour 1: Hot data updated → new hot segments
        Old hot segments: 100% dead → FREE immediately!
        Cold segments: untouched (100% live)
        
        No cleaning needed—hot segments naturally become empty
        
Hour 2-24: Hot segments cycle; cold segments stable
           No cleaning if hot segment count stays constant
 
ANALYSIS:
  Hot data updates: 100 MB × 24 × 365 = 876 GB/year
  Cold data moves: 100 MB × 1 (yearly update) = 0.1 GB/year
  Total writes/year: ~876 GB
 
WRITE REDUCTION: 1,752 GB → 876 GB = 50% reduction
(And in practice, further improvements from better hot segment sizing)

Data Temperature Classification:

How do we identify hot vs. cold data?

Write frequency tracking: Count writes per inode/block
- High write count → hot
- Low write count → cold
Age heuristics: Newly written → likely hot, survives long → cold
- Data that's been live for days is probably stable
File type hints:
- Database files → hot
- Log files (append only) → cold-ish (written once)
- Media files → cold
Directory classification:
- /tmp → hot
- /var/log → cold
- User home directories → mixed (need finer classification)
Application hints: fadvise(), fallocate() flags
- Apps can declare expected access patterns

F2FS Temperature Classification
Temperature	Data Characteristics	Examples	Segment Handling
HOT_NODE	Frequently updated metadata	Directory inodes, active file inodes	Dedicated hot node log
HOT_DATA	Frequently updated file content	Database blocks, temp files	Dedicated hot data log
WARM_NODE	Occasionally updated metadata	Normal file inodes	Warm node log
WARM_DATA	Occasionally updated content	Document files, code	Warm data log
COLD_NODE	Rarely updated metadata	Archived file inodes	Cold node log
COLD_DATA	Rarely updated content	Media, backups	Cold data log

Dynamic Temperature Adjustment

Initial temperature is a guess. Good implementations reclassify data based on observed behavior. If 'cold' data gets updated, promote it to warm/hot. If 'hot' data stabilizes, demote it to cold. But don't move data for reclassification—just remember the new temperature for future writes.

Hole-Plugging and Partial Reclamation

Standard LFS cleaning reclaims entire segments. But what about partially-dead segments where full cleaning isn't efficient? Hole-plugging offers an alternative.

The Hole-Plugging Concept:

Instead of copying live data out of a segment, write new data into the holes (dead areas) within that segment.

Traditional LFS: Append-only to log head; never reuse partial segments Hole-plugging: Can write into dead regions of existing segments

Advantages:

Reclaims space without full segment cleaning
No write amplification from copying live data
Useful when all segments have mixed utilization

Disadvantages:

Breaks sequential write property (random writes return)
Complicates segment summary management
May not be compatible with SSD optimization

hole_plugging.txt
Hole-Plugging vs. Traditional Cleaning:
 
SCENARIO: Segment with scattered dead blocks
 
SEGMENT 42 STATE:
┌─────────────────────────────────────────────────────────────────────┐
│ [LIVE][DEAD][LIVE][DEAD][DEAD][LIVE][DEAD][LIVE][DEAD][DEAD]       │
│   A     ░     B     ░     ░     C     ░     D     ░     ░           │
│                                                                     │
│ Utilization: 40% live (4 blocks), 60% dead (6 blocks)               │
└─────────────────────────────────────────────────────────────────────┘
 
TRADITIONAL CLEANING:
═══════════════════════════════════════════════════════════════════════
1. Read live blocks A, B, C, D
2. Append to current segment buffer at log head
3. Mark Segment 42 as FREE
 
   New segment at log head:
   ┌─────────────────────────────────────────────────────────────────┐
   │ [...existing data...][A][B][C][D][...new app data...]          │
   └─────────────────────────────────────────────────────────────────┘
   
   Cost: 4 blocks written (moved)
   Gain: 10 blocks free
   Ratio: 0.4× write amplification
   Property: Sequential writes maintained ✓
 
HOLE-PLUGGING:
═══════════════════════════════════════════════════════════════════════
1. Identify dead regions in Segment 42: blocks 1, 3, 4, 6, 8, 9
2. Write NEW application data directly into these holes
 
   Before:
   ┌─────────────────────────────────────────────────────────────────┐
   │ [A ][░ ][B ][░ ][░ ][C ][░ ][D ][░ ][░ ]                        │
   └─────────────────────────────────────────────────────────────────┘
   
   After (new data X, Y, Z written into holes):
   ┌─────────────────────────────────────────────────────────────────┐
   │ [A ][X ][B ][Y ][Y2][C ][Z ][D ][  ][  ]                        │
   └─────────────────────────────────────────────────────────────────┘
   
   Cost: 0 blocks moved (only new data written)
   Gain: 4 blocks reused, 2 still available
   Property: Random writes occur! Non-sequential ✗
 
WHEN HOLE-PLUGGING MAKES SENSE:
═══════════════════════════════════════════════════════════════════════
• All segments have roughly equal, moderate utilization (50-70%)
• Normal cleaning would achieve poor efficiency
• System is nearly full (little choice)
• Workload is already random I/O (sequential benefit lost anyway)
• Emergency reclamation needed
 
WHEN TO AVOID HOLE-PLUGGING:
═══════════════════════════════════════════════════════════════════════
• SSD with internal GC (random writes hurt SSD)
• HDD (seek time for each hole access)
• Segment has few holes (overhead exceeds benefit)
• Sequential workload (would regress performance)

F2FS SSR (Slack Space Reclamation):

F2FS implements hole-plugging as SSR mode—a fallback when normal logging can't proceed:

Normal mode: Pure log-structured, append to section head
SSR mode: Enabled when free sections below threshold
SSR writes into valid (partially used) segments instead of allocating new
Returns to normal mode when space pressure relieved

SSR is intentionally a degradation mode:

Trades sequential performance for continued operation
Metrics distinguish SSR writes from normal writes
Administrators monitor SSR frequency as a health indicator

// F2FS SSR policy (simplified)
if (free_sections < threshold) {
    // Switch to SSR mode
    target = find_segment_with_most_holes();
    write_to_holes(target, new_data);
} else {
    // Normal append-only mode
    append_to_current_segment(new_data);
}

The SSR Trade-off

SSR keeps the file system operational when space is critical, but it's not a solution—it's a band-aid. High SSR usage indicates the disk is too full for healthy LFS operation. The proper fix is to delete data, increase disk capacity, or tune GC to be more aggressive.

Adaptive Cleaning Strategies

No single cleaning policy works optimally for all workloads. Adaptive cleaning adjusts strategy based on observed conditions.

What Can Be Adapted:

Cleaning aggressiveness: How often and how much to clean
Victim selection: Greedy vs. cost-benefit vs. hybrid
Foregound/background balance: When to block apps vs. defer
Merge batch size: How many segments to clean together
Data separation policy: Temperature thresholds and reclassification rates

adaptive_cleaning.txt
Adaptive Cleaning Decision Framework:
 
INPUT SIGNALS:
═══════════════════════════════════════════════════════════════════════
1. Free space level      (current_free / total_capacity)
2. Free space trend      (increasing, stable, decreasing)
3. Application I/O rate  (bytes/sec from apps)
4. GC efficiency         (bytes_reclaimed / bytes_written by GC)
5. Segment utilization   (distribution of segment liveness)
6. Write pattern         (sequential, random, mixed)
7. Time of day           (peak hours, off-peak)
 
ADAPTIVE POLICIES:
═══════════════════════════════════════════════════════════════════════
 
Policy: GC_AGGRESSIVENESS
┌─────────────────────────────────────────────────────────────────────┐
│ free_space > 30%:  LAZY                                             │
│   • Background GC only                                              │
│   • Long sleep between rounds                                       │
│   • Wait for very low-utilization victims                           │
│                                                                     │
│ 30% > free_space > 15%:  NORMAL                                     │
│   • Background GC with moderate pacing                              │
│   • Cost-benefit victim selection                                   │
│   • Aim to maintain space level                                     │
│                                                                     │
│ 15% > free_space > 5%:  AGGRESSIVE                                  │
│   • Background GC at high rate                                      │
│   • Greedy victim selection                                         │
│   • Foreground GC if apps blocked too long                          │
│                                                                     │
│ free_space < 5%:  EMERGENCY                                         │
│   • Foreground GC blocks all writes                                 │
│   • SSR mode enabled                                                │
│   • Accept any reclaimable segment                                  │
└─────────────────────────────────────────────────────────────────────┘
 
Policy: VICTIM_SELECTION
┌─────────────────────────────────────────────────────────────────────┐
│ Stable workload (low variance in I/O rate):                         │
│   → Cost-benefit: Optimize for minimum total writes over time       │
│                                                                     │
│ Bursty workload (high variance):                                    │
│   → Greedy during bursts: Maximize immediate reclaim                │
│   → Cost-benefit during lulls: Optimize long-term                   │
│                                                                     │
│ Sequential write pattern detected:                                  │
│   → Prefer oldest segments (data likely cold)                       │
│                                                                     │
│ Random write pattern detected:                                      │
│   → Prefer lowest utilization (greedy)                              │
└─────────────────────────────────────────────────────────────────────┘
 
Policy: CLEANING_PARALLELISM
┌─────────────────────────────────────────────────────────────────────┐
│ App I/O rate low, free space low:                                   │
│   → Max parallelism: Multiple GC threads, high bandwidth            │
│                                                                     │
│ App I/O rate high, free space comfortable:                          │
│   → Minimal parallelism: Single thread, throttled                   │
│                                                                     │
│ Off-peak hours detected:                                            │
│   → Proactive cleaning: Build free space reserve for peak           │
└─────────────────────────────────────────────────────────────────────┘

Machine Learning in GC:

Advanced systems are exploring ML-based approaches:

Workload prediction: Predict future write rate to size GC bandwidth
Data lifetime prediction: Classify hot/cold based on access patterns
Victim selection: Learn optimal victims from historical cleaning efficiency
Threshold tuning: Auto-adjust free space thresholds based on workload

Example: Facebook RocksDB

Uses ML to predict compaction (similar to LFS cleaning) timing
Reduced write amplification by 15-40% in production
Model trained on historical workload patterns

Adaptive Parameter Ranges
Parameter	Conservative	Normal	Aggressive	Emergency
GC sleep interval	10 sec	1 sec	100 ms	0 ms
Victim utilization threshold	< 20%	< 40%	< 70%	Any
Foreground GC trigger	Never	< 5%	< 10%	Always
Background GC bandwidth %	5%	20%	50%	100%
Merge batch size	1 segment	4 segments	8 segments	As many as needed

Observability Is Key

Adaptive strategies require good observability. Track GC metrics: segments cleaned, bytes moved, cleaning latency, write amplification, foreground GC frequency. Without metrics, you can't adapt intelligently—you're just guessing.

Snapshot-Aware Cleaning

Log-structured file systems naturally support snapshots since old data is never overwritten. But snapshots complicate cleaning: data that appears "dead" (not in current version) may be "live" (referenced by a snapshot).

Snapshot-LFS Interaction:

Snapshot creation: Mark current checkpoint as snapshot point. Instant, no data copy.
Normal operation: Continues writing to log head. Old data preserved.
Snapshot access: Reads traverse imap as of snapshot checkpoint.
Garbage collection: Must check if "dead" blocks are snapshot-referenced.
Snapshot deletion: Blocks referenced only by that snapshot become truly dead.

snapshot_aware_gc.txt
Snapshot-Aware Garbage Collection:
 
FILE HISTORY:
═══════════════════════════════════════════════════════════════════════
T=0: Create file.txt = "Hello"           [Seg 1: "Hello"]
T=1: ★ SNAPSHOT S1 created ★             (references file.txt v1)
T=2: Modify file.txt = "World"           [Seg 2: "World"]  
T=3: ★ SNAPSHOT S2 created ★             (references file.txt v2)
T=4: Modify file.txt = "Final"           [Seg 3: "Final"]  (current)
T=5: Delete file.txt                     (current: no file.txt)
 
DATA LIVENESS VIEW:
═══════════════════════════════════════════════════════════════════════
                           Without Snapshots    With Snapshots S1, S2
                           ──────────────────   ─────────────────────
Seg 1: "Hello" (T=0)       DEAD (superseded)    LIVE (S1 reference)
Seg 2: "World" (T=2)       DEAD (superseded)    LIVE (S2 reference)
Seg 3: "Final" (T=4)       DEAD (deleted)       DEAD (no snapshot kept)
 
GC Decision with Snapshots:
┌─────────────────────────────────────────────────────────────────────┐
│ Segment 1 Analysis:                                                 │
│   Current imap: file.txt → DELETED                                  │
│   Snapshot S1 imap: file.txt → Seg 1 → LIVE in S1                  │
│   Snapshot S2 imap: file.txt → Seg 2 → Not referenced in S1        │
│                                                                     │
│   Conclusion: "Hello" is LIVE (S1 needs it)                         │
│               Cannot free Segment 1 until S1 deleted                │
├─────────────────────────────────────────────────────────────────────┤
│ Segment 2 Analysis:                                                 │
│   Snapshot S2 imap: file.txt → Seg 2 → LIVE in S2                  │
│   Conclusion: "World" is LIVE (S2 needs it)                         │
├─────────────────────────────────────────────────────────────────────┤
│ Segment 3 Analysis:                                                 │
│   Current imap: DELETED                                             │
│   S1 imap: Not this version                                         │
│   S2 imap: Not this version (S2 taken before this write)           │
│   Conclusion: "Final" is DEAD (no reference anywhere)               │
└─────────────────────────────────────────────────────────────────────┘
 
SNAPSHOT DELETION TRIGGERS GC:
═══════════════════════════════════════════════════════════════════════
T=6: Admin deletes Snapshot S1
 
Before S1 deletion:
  Seg 1 "Hello": LIVE (S1)
  
After S1 deletion:
  Check other snapshots... S2 doesn't reference Seg 1
  Check current... current doesn't reference Seg 1
  Seg 1 "Hello": DEAD → GC can reclaim!
 
IMPLEMENTATION: Reference Counting per Segment
═══════════════════════════════════════════════════════════════════════
  Segment 1: refcount = 1 (S1)
  Segment 2: refcount = 1 (S2)
  Segment 3: refcount = 0 (no references)
  
  When snapshot deleted, decrement refcount for all its blocks
  When refcount reaches 0, segment eligible for cleaning

NILFS2 Snapshot Approach:

NILFS2 takes snapshots to the extreme:

Every checkpoint is a snapshot (continuous snapshotting)
No explicit "live" vs "dead" — only "referenced by snapshot" vs "not"
GC = snapshot deletion: Cleaning means deleting old snapshots
User controls retention: Policy specifies which snapshots to keep

This simplifies GC dramatically:

No liveness checking during cleaning
No data copying (just segment freeing when all snapshots gone)
Trade-off: Disk space consumed by snapshot retention

Reference Counting Overhead:

Tracking snapshot references adds overhead:

Memory: Reference count per block or segment
CPU: Update counts on snapshot create/delete
Complexity: Multiple levels of indirection

Some implementations use epoch-based GC:

Divide time into epochs
Each snapshot belongs to an epoch
When all snapshots in epoch deleted, free all epoch's unique data
Coarser granularity, lower overhead

Snapshots Change Everything

With snapshots, the concept of 'dead' data becomes more nuanced. Data is only truly dead when no snapshot (including the implicit 'current' snapshot) references it. This fundamentally changes GC from 'find unreferenced blocks' to 'find blocks unreferenced by ALL snapshots.'

Cleaning Performance Tuning

Proper tuning of cleaning parameters can significantly improve LFS performance. Here are key parameters and tuning guidelines.

Critical Parameters:

Key Cleaning Parameters and Tuning
Parameter	Impact	Too Low	Too High	Typical Range
Free space reserve %	Emergency buffer	Frequent foreground GC, stalls	Wasted capacity	10-20%
GC background ratio	Cleaning pace	Falls behind, crises	Starves applications	10-30%
Victim utilization threshold	Cleaning trigger	Too picky, runs out of space	Cleans efficient segments	20-50%
Segment merge count	Batch efficiency	Low merge efficiency	Memory pressure	4-16
Hot/cold temperature threshold	Data separation	Poor separation	Too many temperature zones	2-3 levels

Workload-Specific Tuning:

Database workloads (heavy random writes, sync-heavy):

Increase free space reserve to 20%
Enable aggressive background GC
Tune hot/cold separation for transaction logs vs. data

Media storage (large sequential, append-mostly):

Lower free space reserve to 10%
Lazy background GC sufficient
Single temperature level often works

General-purpose server:

Default settings usually work
Monitor and adjust based on observed GC frequency
Watch for foreground GC occurrences

Mobile/embedded (limited RAM, flash storage):

Smaller segment buffers
More aggressive cleaning to avoid SSR
F2FS-style temperature separation critical

tuning_examples.txt
F2FS Tuning Examples (via sysfs):
 
# View current settings
cat /sys/fs/f2fs/<device>/gc_*
 
# Example settings for different workloads:
 
# WORKLOAD: Database (PostgreSQL, MySQL)
# High random writes, frequent fsync, latency sensitive
echo 2000 > /sys/fs/f2fs/<device>/gc_urgent_sleep_time   # Fast emergency GC
echo 10000 > /sys/fs/f2fs/<device>/gc_min_sleep_time     # Frequent background GC  
echo 30 > /sys/fs/f2fs/<device>/gc_urgent_mid_space      # Higher threshold
 
# WORKLOAD: Media server (video streaming, large files)
# Sequential writes, read-heavy, throughput sensitive
echo 10000 > /sys/fs/f2fs/<device>/gc_urgent_sleep_time  # Slower emergency GC OK
echo 60000 > /sys/fs/f2fs/<device>/gc_min_sleep_time     # Infrequent GC
echo 10 > /sys/fs/f2fs/<device>/gc_urgent_mid_space      # Lower threshold
 
# WORKLOAD: Android phone (mixed, battery sensitive)
# Default settings are usually well-tuned for this
echo 5000 > /sys/fs/f2fs/<device>/gc_urgent_sleep_time
echo 30000 > /sys/fs/f2fs/<device>/gc_min_sleep_time
 
# Monitor GC health:
watch -n 1 'cat /sys/fs/f2fs/<device>/stat'
# Look for:
#   gc_call: Total GC invocations
#   gc_fg_calls: Foreground GC (bad if high)
#   gc_segment: Segments cleaned
#   gc_block: Blocks moved (write amplification indicator)

The Tuning Process

Start with defaults. 2. Monitor GC metrics under production load. 3. If foreground GC occurs often, increase background GC aggressiveness or reserve space. 4. If throughput is low, reduce GC aggressiveness. 5. Iterate based on metrics, not guesses.

Summary: Cleaning Mastery

Segment cleaning is where the practical realities of log-structured storage meet sophisticated engineering. Effective cleaning strategies determine whether LFS delivers on its performance promises.

Key Takeaways

•Segment merging consolidates live data — Combining multiple low-utilization segments reduces cleaning overhead.
•Concurrent cleaning prevents blocking — Background GC threads work while applications proceed; foreground GC is last resort.
•Hot/cold separation is the key optimization — Keeping data with similar lifetimes together dramatically reduces cleaning work.
•Hole-plugging trades sequentiality for space — SSR mode allows continued operation when space is critical, but degrades performance.
•Adaptive strategies respond to conditions — One-size-fits-all doesn't work; cleaning must adapt to workload and space pressure.
•Snapshots add reference complexity — Data is only dead when no snapshot references it; cleanup connects to retention policy.

What's Next:

The final page explores SSD optimization—how LFS designs adapt for flash storage. We'll examine how SSDs' unique characteristics (erase before write, wear leveling, internal GC) interact with LFS, and how modern implementations like F2FS optimize specifically for flash hardware.

Page Complete

You now understand advanced segment cleaning strategies—from merging and concurrent cleaning to hot/cold separation and adaptive policies. This knowledge enables you to tune and troubleshoot LFS implementations and understand why SSDs require specialized LFS designs.