Loading content...
In the previous page, we explored the fundamentals of garbage collection—identifying dead data and reclaiming segments. But knowing what to clean is only half the battle. How we clean determines whether LFS achieves its performance promises.
Segment cleaning encompasses the strategies and optimizations that make garbage collection efficient at scale. This includes:
These advanced strategies separate production-ready LFS implementations from naive prototypes. Understanding them is essential for anyone designing, tuning, or debugging log-structured storage systems.
By the end of this page, you will understand segment merging and compaction strategies, concurrent cleaning with foreground/background threads, hot/cold data separation for cleaning efficiency, hole-plugging and partial segment reclamation, and adaptive cleaning that responds to system conditions.
When multiple segments have low utilization, cleaning them individually is inefficient. Segment merging combines live data from several dirty segments into fewer, fuller segments.
Simple Merge:
Clean multiple segments at once:
The efficiency depends on the total live data across victims:
Merge Input: N segments, average U utilization
Live data: N × U × segment_size
Output: ceil(N × U) segments
Space reclaimed: N - ceil(N × U) segments
Segment Merging Example: BEFORE MERGE:═══════════════════════════════════════════════════════════════════════Segment 15: [████░░░░░░░░] 30% live (1.2 MB live, 2.8 MB dead)Segment 22: [██░░░░░░░░░░] 20% live (0.8 MB live, 3.2 MB dead)Segment 31: [███░░░░░░░░░] 25% live (1.0 MB live, 3.0 MB dead)Segment 47: [█░░░░░░░░░░░] 10% live (0.4 MB live, 3.6 MB dead)────────────────────────────────────────────────────────────────────────Total: 4 segments × 4 MB = 16 MB disk space 3.4 MB live data, 12.6 MB dead data MERGE PROCESS:═══════════════════════════════════════════════════════════════════════Step 1: Read all live blocks from segments 15, 22, 31, 47 Total live data: 3.4 MB Step 2: Write consolidated data to new segment(s) Output: 1 new segment (85% full) Step 3: Update imap entries for all moved blocks Step 4: Free segments 15, 22, 31, 47 AFTER MERGE:═══════════════════════════════════════════════════════════════════════New Segment 99: [██████████░░] 85% live (3.4 MB live, 0.6 MB slack)Segments 15, 22, 31, 47: [FREE for reuse]────────────────────────────────────────────────────────────────────────Result: 4 segments merged → 1 segment 3 segments reclaimed for new writes EFFICIENCY CALCULATION: Read cost: 3.4 MB (live data from 4 segments) Write cost: 4 MB (one new segment with padding) Space gain: 12 MB (3 free segments) Amplification: 4 MB written / 12 MB reclaimed = 0.33× (We write 1 byte for every 3 bytes we reclaim—excellent!)Greedy Merge vs. Optimal Merge:
Greedy approach: Merge whichever low-utilization segments are available now.
Optimal approach: Wait until many low-utilization segments available, then find best combination.
Practical compromise: Set thresholds—always clean when below critical space, opportunistically merge during idle periods.
Partial Segment Output:
What if merged live data doesn't fill a complete segment?
Option 1: Padding — Write partial segment, padding with zeros
Option 2: Coalescing with application writes — Add merge output to current write buffer
Option 3: Dedicated partial segment area — Special zone for GC output that doesn't require full segments
Larger merge batches improve efficiency but require more memory for buffering and may delay cleaning progress. Production systems typically merge 4-16 segments at a time, balancing memory usage against merge efficiency.
Cleaning must run concurrently with application I/O to prevent blocking. Modern LFS implementations use sophisticated threading models to balance cleaning progress with application responsiveness.
Thread Architecture:
Coordination Challenges:
Concurrent Cleaning Thread Model: THREAD ARCHITECTURE:═══════════════════════════════════════════════════════════════════════ ┌─────────────────────────────────────────────────┐ │ APPLICATION THREADS │ │ (handle read(), write(), fsync() calls) │ └─────────────────────┬───────────────────────────┘ │ ┌─────────────────────▼───────────────────────────┐ │ SEGMENT BUFFER │ │ (shared write buffer for log head) │ │ [App data][App data][GC data][App data]... │ └─────────────────────┬───────────────────────────┘ │ ┌─────────────────────────────────┼────────────────────────────┐ │ │ │ ▼ ▼ ▼┌───────────────────┐ ┌───────────────────┐ ┌─────────────────┐│ BACKGROUND GC │ │ CHECKPOINT │ │ FOREGROUND GC ││ THREAD │ │ THREAD │ │ (inline) │├───────────────────┤ ├───────────────────┤ ├─────────────────┤│ • Low priority │ │ • Periodic │ │ • High priority ││ • Runs when idle │ │ • Writes CR │ │ • Blocks until ││ • Rate-limited │ │ • Persists imap │ │ space freed ││ • Victim: C-B │ │ • 30s interval │ │ • Victim: Greedy│└───────────────────┘ └───────────────────┘ └─────────────────┘ COORDINATION MECHANISMS:═══════════════════════════════════════════════════════════════════════ 1. LOCK HIERARCHY (prevents deadlock): acquire(segment_table_lock) < acquire(imap_lock) < acquire(segment_X_lock) GC must follow this order when updating structures. 2. READ-COPY-UPDATE (RCU) for imap: - Readers never block (use current imap snapshot) - GC creates new imap version atomically - Old imap freed after all readers finish Benefits: Read-heavy workloads see no GC latency impact 3. SEGMENT PINNING: - In-flight reads "pin" segments from GC reclamation - GC skips pinned segments or waits - Reference counting tracks pin state ┌─────────────────────────────────────────────────────────────────┐ │ App reading Segment 42: pin_count[42]++ │ │ GC wants Segment 42: if pin_count[42] > 0: skip or wait │ │ App finishes read: pin_count[42]-- │ │ When pin_count[42] == 0: GC can proceed │ └─────────────────────────────────────────────────────────────────┘ 4. BANDWIDTH PARTITIONING: - I/O scheduler tags GC I/O as low-priority - App I/O preempts GC I/O when needed - GC throttles itself based on free space levelBackground GC Throttling:
Background GC must be throttled to avoid starving applications:
while (should_continue_gc()) {
segment = select_victim();
clean_segment(segment);
// Throttle based on conditions
if (free_space > high_watermark) {
sleep(long_interval); // Plenty of space, relax
} else if (free_space > low_watermark) {
sleep(medium_interval); // Normal operation
} else {
// Critical! No sleeping, maximize cleaning
continue;
}
// Yield to applications
yield_if_pending_io();
}
Foreground GC Behavior:
When free space drops below critical threshold:
This ensures applications never permanently block, but they do experience latency spikes during space crises.
If foreground GC runs too often, it can cascade: GC writes consume what little free space exists, triggering more GC. Proper thresholds and background GC pacing prevent this death spiral. The rule of thumb: foreground GC should run less than 1% of the time under normal workloads.
The most effective way to minimize cleaning overhead is to prevent low-utilization segments from forming in the first place. Hot/cold separation achieves this by grouping data based on expected lifetime.
The Core Insight:
If we write hot (frequently changing) and cold (stable) data to the same segment:
If we separate them:
Hot/Cold Separation Impact on Cleaning: SCENARIO: 100 MB of hot data (updates every hour) 100 MB of cold data (updates yearly) Writing at 10 MB/hour WITHOUT SEPARATION (mixed segments):═══════════════════════════════════════════════════════════════════════Hour 0: [Hot+Cold][Hot+Cold][Hot+Cold]... 50 segments, each 50% hot, 50% cold Hour 1: Hot data updated → written to new segments Old segments: 50% dead (old hot), 50% live (cold) New segments: 50% hot, 50% cold 50 segments at 50% utilization → need cleaning But cleaning moves cold data that won't die for a year! Hour 2-24: Each hour, hot data cycles Cold data keeps getting moved during cleaning Massive write amplification: ~10× for cleaning alone ANALYSIS: Every hot data update triggers cleaning Every clean moves stable cold data Cold data moved 24×/day × 365 days = 8,760 moves/year Total writes/year: 100 MB hot × 24 × 365 + 100 MB cold × 8,760 = 876 GB hot + 876 GB cold = 1,752 GB written WITH SEPARATION:═══════════════════════════════════════════════════════════════════════Hour 0: [Hot][Hot][Hot]...[Cold][Cold]... 25 hot segments, 25 cold segments Hour 1: Hot data updated → new hot segments Old hot segments: 100% dead → FREE immediately! Cold segments: untouched (100% live) No cleaning needed—hot segments naturally become empty Hour 2-24: Hot segments cycle; cold segments stable No cleaning if hot segment count stays constant ANALYSIS: Hot data updates: 100 MB × 24 × 365 = 876 GB/year Cold data moves: 100 MB × 1 (yearly update) = 0.1 GB/year Total writes/year: ~876 GB WRITE REDUCTION: 1,752 GB → 876 GB = 50% reduction(And in practice, further improvements from better hot segment sizing)Data Temperature Classification:
How do we identify hot vs. cold data?
Write frequency tracking: Count writes per inode/block
Age heuristics: Newly written → likely hot, survives long → cold
File type hints:
Directory classification:
/tmp → hot/var/log → coldApplication hints: fadvise(), fallocate() flags
| Temperature | Data Characteristics | Examples | Segment Handling |
|---|---|---|---|
| HOT_NODE | Frequently updated metadata | Directory inodes, active file inodes | Dedicated hot node log |
| HOT_DATA | Frequently updated file content | Database blocks, temp files | Dedicated hot data log |
| WARM_NODE | Occasionally updated metadata | Normal file inodes | Warm node log |
| WARM_DATA | Occasionally updated content | Document files, code | Warm data log |
| COLD_NODE | Rarely updated metadata | Archived file inodes | Cold node log |
| COLD_DATA | Rarely updated content | Media, backups | Cold data log |
Initial temperature is a guess. Good implementations reclassify data based on observed behavior. If 'cold' data gets updated, promote it to warm/hot. If 'hot' data stabilizes, demote it to cold. But don't move data for reclassification—just remember the new temperature for future writes.
Standard LFS cleaning reclaims entire segments. But what about partially-dead segments where full cleaning isn't efficient? Hole-plugging offers an alternative.
The Hole-Plugging Concept:
Instead of copying live data out of a segment, write new data into the holes (dead areas) within that segment.
Traditional LFS: Append-only to log head; never reuse partial segments Hole-plugging: Can write into dead regions of existing segments
Advantages:
Disadvantages:
Hole-Plugging vs. Traditional Cleaning: SCENARIO: Segment with scattered dead blocks SEGMENT 42 STATE:┌─────────────────────────────────────────────────────────────────────┐│ [LIVE][DEAD][LIVE][DEAD][DEAD][LIVE][DEAD][LIVE][DEAD][DEAD] ││ A ░ B ░ ░ C ░ D ░ ░ ││ ││ Utilization: 40% live (4 blocks), 60% dead (6 blocks) │└─────────────────────────────────────────────────────────────────────┘ TRADITIONAL CLEANING:═══════════════════════════════════════════════════════════════════════1. Read live blocks A, B, C, D2. Append to current segment buffer at log head3. Mark Segment 42 as FREE New segment at log head: ┌─────────────────────────────────────────────────────────────────┐ │ [...existing data...][A][B][C][D][...new app data...] │ └─────────────────────────────────────────────────────────────────┘ Cost: 4 blocks written (moved) Gain: 10 blocks free Ratio: 0.4× write amplification Property: Sequential writes maintained ✓ HOLE-PLUGGING:═══════════════════════════════════════════════════════════════════════1. Identify dead regions in Segment 42: blocks 1, 3, 4, 6, 8, 92. Write NEW application data directly into these holes Before: ┌─────────────────────────────────────────────────────────────────┐ │ [A ][░ ][B ][░ ][░ ][C ][░ ][D ][░ ][░ ] │ └─────────────────────────────────────────────────────────────────┘ After (new data X, Y, Z written into holes): ┌─────────────────────────────────────────────────────────────────┐ │ [A ][X ][B ][Y ][Y2][C ][Z ][D ][ ][ ] │ └─────────────────────────────────────────────────────────────────┘ Cost: 0 blocks moved (only new data written) Gain: 4 blocks reused, 2 still available Property: Random writes occur! Non-sequential ✗ WHEN HOLE-PLUGGING MAKES SENSE:═══════════════════════════════════════════════════════════════════════• All segments have roughly equal, moderate utilization (50-70%)• Normal cleaning would achieve poor efficiency• System is nearly full (little choice)• Workload is already random I/O (sequential benefit lost anyway)• Emergency reclamation needed WHEN TO AVOID HOLE-PLUGGING:═══════════════════════════════════════════════════════════════════════• SSD with internal GC (random writes hurt SSD)• HDD (seek time for each hole access)• Segment has few holes (overhead exceeds benefit)• Sequential workload (would regress performance)F2FS SSR (Slack Space Reclamation):
F2FS implements hole-plugging as SSR mode—a fallback when normal logging can't proceed:
SSR is intentionally a degradation mode:
// F2FS SSR policy (simplified)
if (free_sections < threshold) {
// Switch to SSR mode
target = find_segment_with_most_holes();
write_to_holes(target, new_data);
} else {
// Normal append-only mode
append_to_current_segment(new_data);
}
SSR keeps the file system operational when space is critical, but it's not a solution—it's a band-aid. High SSR usage indicates the disk is too full for healthy LFS operation. The proper fix is to delete data, increase disk capacity, or tune GC to be more aggressive.
No single cleaning policy works optimally for all workloads. Adaptive cleaning adjusts strategy based on observed conditions.
What Can Be Adapted:
Adaptive Cleaning Decision Framework: INPUT SIGNALS:═══════════════════════════════════════════════════════════════════════1. Free space level (current_free / total_capacity)2. Free space trend (increasing, stable, decreasing)3. Application I/O rate (bytes/sec from apps)4. GC efficiency (bytes_reclaimed / bytes_written by GC)5. Segment utilization (distribution of segment liveness)6. Write pattern (sequential, random, mixed)7. Time of day (peak hours, off-peak) ADAPTIVE POLICIES:═══════════════════════════════════════════════════════════════════════ Policy: GC_AGGRESSIVENESS┌─────────────────────────────────────────────────────────────────────┐│ free_space > 30%: LAZY ││ • Background GC only ││ • Long sleep between rounds ││ • Wait for very low-utilization victims ││ ││ 30% > free_space > 15%: NORMAL ││ • Background GC with moderate pacing ││ • Cost-benefit victim selection ││ • Aim to maintain space level ││ ││ 15% > free_space > 5%: AGGRESSIVE ││ • Background GC at high rate ││ • Greedy victim selection ││ • Foreground GC if apps blocked too long ││ ││ free_space < 5%: EMERGENCY ││ • Foreground GC blocks all writes ││ • SSR mode enabled ││ • Accept any reclaimable segment │└─────────────────────────────────────────────────────────────────────┘ Policy: VICTIM_SELECTION┌─────────────────────────────────────────────────────────────────────┐│ Stable workload (low variance in I/O rate): ││ → Cost-benefit: Optimize for minimum total writes over time ││ ││ Bursty workload (high variance): ││ → Greedy during bursts: Maximize immediate reclaim ││ → Cost-benefit during lulls: Optimize long-term ││ ││ Sequential write pattern detected: ││ → Prefer oldest segments (data likely cold) ││ ││ Random write pattern detected: ││ → Prefer lowest utilization (greedy) │└─────────────────────────────────────────────────────────────────────┘ Policy: CLEANING_PARALLELISM┌─────────────────────────────────────────────────────────────────────┐│ App I/O rate low, free space low: ││ → Max parallelism: Multiple GC threads, high bandwidth ││ ││ App I/O rate high, free space comfortable: ││ → Minimal parallelism: Single thread, throttled ││ ││ Off-peak hours detected: ││ → Proactive cleaning: Build free space reserve for peak │└─────────────────────────────────────────────────────────────────────┘Machine Learning in GC:
Advanced systems are exploring ML-based approaches:
Example: Facebook RocksDB
| Parameter | Conservative | Normal | Aggressive | Emergency |
|---|---|---|---|---|
| GC sleep interval | 10 sec | 1 sec | 100 ms | 0 ms |
| Victim utilization threshold | < 20% | < 40% | < 70% | Any |
| Foreground GC trigger | Never | < 5% | < 10% | Always |
| Background GC bandwidth % | 5% | 20% | 50% | 100% |
| Merge batch size | 1 segment | 4 segments | 8 segments | As many as needed |
Adaptive strategies require good observability. Track GC metrics: segments cleaned, bytes moved, cleaning latency, write amplification, foreground GC frequency. Without metrics, you can't adapt intelligently—you're just guessing.
Log-structured file systems naturally support snapshots since old data is never overwritten. But snapshots complicate cleaning: data that appears "dead" (not in current version) may be "live" (referenced by a snapshot).
Snapshot-LFS Interaction:
Snapshot-Aware Garbage Collection: FILE HISTORY:═══════════════════════════════════════════════════════════════════════T=0: Create file.txt = "Hello" [Seg 1: "Hello"]T=1: ★ SNAPSHOT S1 created ★ (references file.txt v1)T=2: Modify file.txt = "World" [Seg 2: "World"] T=3: ★ SNAPSHOT S2 created ★ (references file.txt v2)T=4: Modify file.txt = "Final" [Seg 3: "Final"] (current)T=5: Delete file.txt (current: no file.txt) DATA LIVENESS VIEW:═══════════════════════════════════════════════════════════════════════ Without Snapshots With Snapshots S1, S2 ────────────────── ─────────────────────Seg 1: "Hello" (T=0) DEAD (superseded) LIVE (S1 reference)Seg 2: "World" (T=2) DEAD (superseded) LIVE (S2 reference)Seg 3: "Final" (T=4) DEAD (deleted) DEAD (no snapshot kept) GC Decision with Snapshots:┌─────────────────────────────────────────────────────────────────────┐│ Segment 1 Analysis: ││ Current imap: file.txt → DELETED ││ Snapshot S1 imap: file.txt → Seg 1 → LIVE in S1 ││ Snapshot S2 imap: file.txt → Seg 2 → Not referenced in S1 ││ ││ Conclusion: "Hello" is LIVE (S1 needs it) ││ Cannot free Segment 1 until S1 deleted │├─────────────────────────────────────────────────────────────────────┤│ Segment 2 Analysis: ││ Snapshot S2 imap: file.txt → Seg 2 → LIVE in S2 ││ Conclusion: "World" is LIVE (S2 needs it) │├─────────────────────────────────────────────────────────────────────┤│ Segment 3 Analysis: ││ Current imap: DELETED ││ S1 imap: Not this version ││ S2 imap: Not this version (S2 taken before this write) ││ Conclusion: "Final" is DEAD (no reference anywhere) │└─────────────────────────────────────────────────────────────────────┘ SNAPSHOT DELETION TRIGGERS GC:═══════════════════════════════════════════════════════════════════════T=6: Admin deletes Snapshot S1 Before S1 deletion: Seg 1 "Hello": LIVE (S1) After S1 deletion: Check other snapshots... S2 doesn't reference Seg 1 Check current... current doesn't reference Seg 1 Seg 1 "Hello": DEAD → GC can reclaim! IMPLEMENTATION: Reference Counting per Segment═══════════════════════════════════════════════════════════════════════ Segment 1: refcount = 1 (S1) Segment 2: refcount = 1 (S2) Segment 3: refcount = 0 (no references) When snapshot deleted, decrement refcount for all its blocks When refcount reaches 0, segment eligible for cleaningNILFS2 Snapshot Approach:
NILFS2 takes snapshots to the extreme:
This simplifies GC dramatically:
Reference Counting Overhead:
Tracking snapshot references adds overhead:
Some implementations use epoch-based GC:
With snapshots, the concept of 'dead' data becomes more nuanced. Data is only truly dead when no snapshot (including the implicit 'current' snapshot) references it. This fundamentally changes GC from 'find unreferenced blocks' to 'find blocks unreferenced by ALL snapshots.'
Proper tuning of cleaning parameters can significantly improve LFS performance. Here are key parameters and tuning guidelines.
Critical Parameters:
| Parameter | Impact | Too Low | Too High | Typical Range |
|---|---|---|---|---|
| Free space reserve % | Emergency buffer | Frequent foreground GC, stalls | Wasted capacity | 10-20% |
| GC background ratio | Cleaning pace | Falls behind, crises | Starves applications | 10-30% |
| Victim utilization threshold | Cleaning trigger | Too picky, runs out of space | Cleans efficient segments | 20-50% |
| Segment merge count | Batch efficiency | Low merge efficiency | Memory pressure | 4-16 |
| Hot/cold temperature threshold | Data separation | Poor separation | Too many temperature zones | 2-3 levels |
Workload-Specific Tuning:
Database workloads (heavy random writes, sync-heavy):
Media storage (large sequential, append-mostly):
General-purpose server:
Mobile/embedded (limited RAM, flash storage):
F2FS Tuning Examples (via sysfs): # View current settingscat /sys/fs/f2fs/<device>/gc_* # Example settings for different workloads: # WORKLOAD: Database (PostgreSQL, MySQL)# High random writes, frequent fsync, latency sensitiveecho 2000 > /sys/fs/f2fs/<device>/gc_urgent_sleep_time # Fast emergency GCecho 10000 > /sys/fs/f2fs/<device>/gc_min_sleep_time # Frequent background GC echo 30 > /sys/fs/f2fs/<device>/gc_urgent_mid_space # Higher threshold # WORKLOAD: Media server (video streaming, large files)# Sequential writes, read-heavy, throughput sensitiveecho 10000 > /sys/fs/f2fs/<device>/gc_urgent_sleep_time # Slower emergency GC OKecho 60000 > /sys/fs/f2fs/<device>/gc_min_sleep_time # Infrequent GCecho 10 > /sys/fs/f2fs/<device>/gc_urgent_mid_space # Lower threshold # WORKLOAD: Android phone (mixed, battery sensitive)# Default settings are usually well-tuned for thisecho 5000 > /sys/fs/f2fs/<device>/gc_urgent_sleep_timeecho 30000 > /sys/fs/f2fs/<device>/gc_min_sleep_time # Monitor GC health:watch -n 1 'cat /sys/fs/f2fs/<device>/stat'# Look for:# gc_call: Total GC invocations# gc_fg_calls: Foreground GC (bad if high)# gc_segment: Segments cleaned# gc_block: Blocks moved (write amplification indicator)Segment cleaning is where the practical realities of log-structured storage meet sophisticated engineering. Effective cleaning strategies determine whether LFS delivers on its performance promises.
What's Next:
The final page explores SSD optimization—how LFS designs adapt for flash storage. We'll examine how SSDs' unique characteristics (erase before write, wear leveling, internal GC) interact with LFS, and how modern implementations like F2FS optimize specifically for flash hardware.
You now understand advanced segment cleaning strategies—from merging and concurrent cleaning to hot/cold separation and adaptive policies. This knowledge enables you to tune and troubleshoot LFS implementations and understand why SSDs require specialized LFS designs.