Log Structured File Systems - Learning Module

Loading content...

0/240

SSD Optimization: Log-Structured Storage for Flash

A Perfect Match: LFS and Flash Storage

When Rosenblum and Ousterhout designed the original LFS in 1992, they were optimizing for hard disk drives. Fast-forward three decades, and log-structured file systems have found an even more compelling application: solid-state drives (SSDs).

The match between LFS and flash storage is remarkably synergistic:

Flash memory cannot overwrite—it must erase before writing, much like how LFS never updates in place
SSDs internally perform garbage collection, which aligns with LFS segment cleaning
Sequential writes maximize SSD parallelism across multiple flash channels
Log-structured writes reduce SSD internal write amplification

This page explores how modern log-structured file systems like F2FS (Flash-Friendly File System) optimize specifically for flash storage, why traditional file systems struggle on SSDs, and how understanding flash characteristics leads to better file system design.

What You Will Learn

By the end of this page, you will understand flash memory fundamentals (pages, blocks, erase-before-write), how SSD internal operations interact with file system writes, why LFS is naturally suited for flash storage, F2FS architecture and optimizations, TRIM command integration, and practical considerations for deploying LFS on SSDs.

Flash Memory Fundamentals

To understand why LFS is ideal for SSDs, we must first understand how flash memory operates. Unlike magnetic disks, flash has asymmetric read/write/erase characteristics.

Flash Memory Hierarchy:

Cell: Stores 1-4 bits (SLC, MLC, TLC, QLC)
Page: Smallest readable/writable unit (4-16 KB)
Block: Smallest erasable unit (128-512 pages = 512 KB - 4 MB)
Plane: Group of blocks sharing hardware
Die: Complete flash chip
Package: Multiple dies in one physical chip

flash_hierarchy.txt
Flash Memory Physical Hierarchy:
 
┌─────────────────────────────────────────────────────────────────────┐
│ SSD PACKAGE                                                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌────────────────────────┐  ┌────────────────────────┐             │
│  │       DIE 0            │  │       DIE 1            │  ...        │
│  │  ┌─────────┬─────────┐ │  │  ┌─────────┬─────────┐ │             │
│  │  │ Plane 0 │ Plane 1 │ │  │  │ Plane 0 │ Plane 1 │ │             │
│  │  │┌───────┐│┌───────┐│ │  │  │┌───────┐│┌───────┐│ │             │
│  │  ││Block 0│││Block 0││ │  │  ││Block 0│││Block 0││ │             │
│  │  ││┌─────┐│││┌─────┐││ │  │  ││┌─────┐│││┌─────┐││ │             │
│  │  │││Pg 0 ││││Pg 0  │││ │  │  │││Pg 0 ││││Pg 0  │││ │             │
│  │  │││Pg 1 ││││Pg 1  │││ │  │  │││Pg 1 ││││Pg 1  │││ │             │
│  │  │││ ... ││││ ...  │││ │  │  │││ ... ││││ ...  │││ │             │
│  │  │││Pg255││││Pg255 │││ │  │  │││Pg255││││Pg255 │││ │             │
│  │  ││└─────┘│││└─────┘││ │  │  ││└─────┘│││└─────┘││ │             │
│  │  ││Block 1│││Block 1││ │  │  ││Block 1│││Block 1││ │             │
│  │  ││  ...  │││  ...  ││ │  │  ││  ...  │││  ...  ││ │             │
│  │  │└───────┘│└───────┘│ │  │  │└───────┘│└───────┘│ │             │
│  │  └─────────┴─────────┘ │  │  └─────────┴─────────┘ │             │
│  └────────────────────────┘  └────────────────────────┘             │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
 
OPERATION CHARACTERISTICS:
═══════════════════════════════════════════════════════════════════════
Operation │ Granularity │ Latency     │ Notes
──────────┼─────────────┼─────────────┼────────────────────────────────
Read      │ Page (4KB)  │ ~25 μs      │ Fast, random access OK
Write     │ Page (4KB)  │ ~200-500 μs │ Must write to erased page
Erase     │ Block (2MB) │ ~1-3 ms     │ Slow, affects 256-512 pages
──────────┴─────────────┴─────────────┴────────────────────────────────
 
KEY CONSTRAINT: Cannot overwrite a page!
  - Page starts "erased" (all 1s)
  - Write changes some 1s to 0s
  - To write again, must erase entire block (sets all bits to 1)
  - This is the "erase-before-write" constraint

The Erase-Before-Write Problem:

To modify 4 KB of existing data on an SSD:

The old page cannot be overwritten (flash physics)
Write new data to a fresh page
Mark old page as invalid (stale)
Eventually: erase the block containing old page to reuse

But erasing a block affects 256+ pages. If only one page was stale, the SSD must:

Read all valid pages from the block
Erase the entire block
Write valid pages back to a different block

This is called SSD internal write amplification—the SSD writes more data than requested.

The Flash Translation Layer (FTL):

SSDs don't expose this complexity directly. The Flash Translation Layer provides a block device interface:

Logical Block Address (LBA) → Physical Page mapping
Handles wear leveling (spreads erases evenly)
Manages garbage collection internally
Maintains indirection table in RAM + flash

The FTL essentially implements a log-structured approach inside the SSD. LFS and the FTL can work synergistically or conflict, depending on alignment.

Double Garbage Collection

If the file system does GC (moving blocks) and the SSD does internal GC (moving pages), you can have DOUBLE write amplification. This is why SSD-aware file systems are critical—they coordinate with SSD characteristics to minimize total write amplification.

Why Traditional File Systems Struggle on SSDs

Traditional file systems like ext4 and NTFS were designed for HDDs. Their design assumptions conflict with SSD characteristics.

Problem 1: Random In-Place Updates

Traditional file systems update data in place. On SSDs, each in-place update:

Writes to a new page (old location can't be overwritten)
Invalidates old page
Increases internal fragmentation
Triggers more internal GC

Problem 2: Metadata Hot Spots

Traditional file systems have fixed metadata locations:

Superblock, inode tables, bitmap at fixed positions
High-frequency updates to same LBAs
Causes uneven wear on flash cells
Undermines wear leveling

Problem 3: Alignment Mismatch

File system block size (4 KB) vs. erase block size (2 MB):

Random small writes force full block copies
No coordination with erase boundaries
Suboptimal parallelism utilization

Traditional FS vs. SSD-Optimized FS Characteristics
Characteristic	Traditional FS (ext4)	SSD Impact	LFS Approach
Update style	In-place	Triggers SSD GC	Append-only, sequential
Metadata location	Fixed	Hot spots, wear concentration	Roaming, spread across log
Block alignment	4 KB	Mismatched with erase blocks	Segment = erase block aligned
GC awareness	None	Conflicts with SSD GC	Coordinates with SSD
TRIM support	Added later	Critical for reclaim	Native integration
Write pattern	Mixed random/seq	Suboptimal parallelism	Pure sequential per log

ssd_write_amplification.txt
Write Amplification Example: Traditional FS vs. LFS on SSD
 
SCENARIO: Update 100 small files (4 KB each, scattered across disk)
 
TRADITIONAL FS (ext4):
═══════════════════════════════════════════════════════════════════════
File System Operations:
  - 100 data block updates (in-place, scattered LBAs)
  - ~100 inode updates (in inode table, fixed locations)
  - Multiple bitmap updates
  
FS write volume: ~800 KB (100 files × 4 KB + metadata overhead)
 
SSD Internal Impact (simplified):
  - Each 4 KB write goes to new page (page allocation)
  - 200+ old pages marked invalid (scattered across blocks)
  - Internal GC triggered frequently
  - Each GC may copy 50+ valid pages to free 1 block
  
SSD actual writes: ~8 MB (10× write amplification typical)
SSD erases triggered: 4-8 blocks
 
LOG-STRUCTURED FS:
═══════════════════════════════════════════════════════════════════════
File System Operations:
  - All 100 updates collected in segment buffer
  - Written as one 4 MB sequential segment
  - Includes data + inodes + imap updates
  
FS write volume: ~4 MB (one segment with all updates)
 
SSD Internal Impact:
  - Sequential write fills contiguous pages
  - SSD pre-erases blocks in advance (background)
  - Write aligns with erase block boundaries
  - Minimal internal GC during write
  
SSD actual writes: ~4.5 MB (1.1× write amplification)
SSD erases triggered: 2 blocks (sequential, pre-erased)
 
COMPARISON:
═══════════════════════════════════════════════════════════════════════
                      ext4            LFS         Improvement
  FS write volume     0.8 MB          4 MB        (LFS batches more)
  SSD actual writes   8 MB            4.5 MB      1.8× less
  SSD erases          4-8 blocks      2 blocks    2-4× less
  Flash wear          High            Low         Significant lifespan gain
  
Note: LFS appears to write more at FS level (4 MB vs 0.8 MB) but causes
much less SSD-level write amplification, resulting in less actual flash wear.

The Deeper Insight

What matters for SSD lifespan isn't file system write volume—it's total flash writes including internal amplification. LFS may write slightly more data, but by writing sequentially and aligning with erase blocks, it triggers far less SSD-internal copying. Less flash wear = longer SSD life.

F2FS: Flash-Friendly File System Architecture

F2FS (Flash-Friendly File System) was developed by Samsung and introduced to Linux in 2012. It's designed from the ground up for flash storage, combining log-structured principles with flash-specific optimizations.

Design Goals:

Flash-friendly writes: Sequential, aligned with erase blocks
Low write amplification: Minimize both FS and SSD-level amplification
Fast random reads: Maintain good read locality where possible
Efficient GC: Coordinate with SSD GC, minimize data movement
Crash recovery: Consistent-after-crash with minimal overhead

f2fs_layout.txt
F2FS Disk Layout:
 
┌─────────────────────────────────────────────────────────────────────┐
│                         F2FS DISK LAYOUT                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │ SUPERBLOCK (SB)                                                 ││
│  │ Fixed location, contains FS parameters                          ││
│  │ 2 copies for redundancy                                         ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │ CHECKPOINT AREA (CP)                                            ││
│  │ Two checkpoint packs (alternating)                              ││
│  │ Contains: NAT/SIT journals, orphan inodes, summary               ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │ SEGMENT INFO TABLE (SIT)                                        ││
│  │ Per-segment metadata: valid block count, bitmap                 ││
│  │ Used for GC victim selection                                    ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │ NODE ADDRESS TABLE (NAT)                                        ││
│  │ Maps node_id → physical block address (like LFS imap)           ││
│  │ Updated in checkpoint, cached in memory                         ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │ SEGMENT SUMMARY AREA (SSA)                                      ││
│  │ Owner info for each block in main area                          ││
│  │ Used for GC liveness checking                                   ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │ MAIN AREA (Data + Node Logs)                                    ││
│  │ ┌─────────────────────────────────────────────────────────────┐ ││
│  │ │ Section 0    │ Section 1    │ Section 2    │ ...           │ ││
│  │ │ (2MB)        │ (2MB)        │ (2MB)        │               │ ││
│  │ │ ┌──────────┐ │ ┌──────────┐ │ ┌──────────┐ │               │ ││
│  │ │ │Segment 0 │ │ │Segment 2 │ │ │Segment 4 │ │               │ ││
│  │ │ │Segment 1 │ │ │Segment 3 │ │ │Segment 5 │ │               │ ││
│  │ │ └──────────┘ │ └──────────┘ │ └──────────┘ │               │ ││
│  │ └─────────────────────────────────────────────────────────────┘ ││
│  │                                                                 ││
│  │ 6 Log Heads: HOT_NODE, WARM_NODE, COLD_NODE,                    ││
│  │              HOT_DATA, WARM_DATA, COLD_DATA                     ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
 
F2FS TERMINOLOGY:
  Block = 4 KB (same as page size)
  Segment = 512 blocks = 2 MB (matches SSD erase block)
  Section = 1+ segments (configurable, for GC unit)
  Zone = 1+ sections (for large-scale organization)

Key F2FS Structures:

Node Address Table (NAT):

Equivalent to LFS inode map
Maps node IDs to physical addresses
Cached in memory, persisted at checkpoint
Journal buffers updates between checkpoints

Segment Info Table (SIT):

Tracks valid block count per segment
Bitmap of valid blocks within segment
Enables instant GC victim selection
Updated via journaling for crash safety

Multi-Head Logging:

Six separate log heads (hot/warm/cold × node/data)
Each log writes sequentially to its current segment
Enables parallel writes to different flash channels
Improves data separation for GC efficiency

F2FS Flash Optimizations

•Segment = Erase Block: 2 MB segments align with typical SSD erase blocks, minimizing partial-block copies
•Section-level GC: Cleans entire sections (multiple segments) for erase block alignment
•Multi-stream writes: Six log heads enable parallel access to multiple flash channels
•TRIM integration: Immediately informs SSD when blocks become free
•Roll-forward recovery: Uses NAT and SIT journals for fast crash recovery
•Adaptive logging: Switches between normal mode and SSR (hole-plugging) based on space pressure

Aligning File System with SSD Internals

The most effective LFS implementations are designed with explicit knowledge of SSD architecture. This alignment reduces total write amplification and maximizes SSD performance.

Erase Block Alignment:

SSDs erase in large chunks (128 KB - 4 MB). If FS operations don't align:

Partial block writes require read-modify-write internally
GC must copy extra valid pages
Erases happen more frequently

F2FS sections (2 MB) are designed to match common erase block sizes. GC operates on sections, ensuring that when FS frees space, it's freeing complete erase blocks the SSD can reclaim cleanly.

erase_block_alignment.txt
Erase Block Alignment Impact:
 
MISALIGNED (Traditional FS):
═══════════════════════════════════════════════════════════════════════
SSD Erase Block (2 MB):
┌─────────────────────────────────────────────────────────────────────┐
│ [FS Block A][FS Block B][Free][FS Block D][Free][FS Block F][Free]  │
└─────────────────────────────────────────────────────────────────────┘
                   ↑               ↑              ↑
              Scattered free space across erase block
 
When FS deletes A, B, D, F:
  - SSD erase block still has 4 valid pages (from other FS blocks)
  - Can't erase yet!
  - Internal GC eventually copies those 4 pages to consolidate
  
Result: FS freed 4 blocks, but SSD couldn't reclaim the erase block
 
ALIGNED (F2FS Section):
═══════════════════════════════════════════════════════════════════════
SSD Erase Block (2 MB) = F2FS Section:
┌─────────────────────────────────────────────────────────────────────┐
│ [F2FS Section 5: All data from one logical grouping]                │
│ [Block 1][Block 2][Block 3][Block 4]...[Block 512]                  │
└─────────────────────────────────────────────────────────────────────┘
                   ↑
             All related data together
 
When F2FS GC cleans Section 5:
  - All 512 blocks in section become invalid together
  - SSD can erase entire block immediately
  - No internal copying needed!
  
Result: FS GC directly enables SSD erase
 
PRACTICAL ALIGNMENT:
═══════════════════════════════════════════════════════════════════════
Create F2FS with explicit alignment:
 
# Query SSD erase block size (not always exposed)
# Common values: 512KB, 1MB, 2MB, 4MB
 
# Create F2FS with matching section size:
mkfs.f2fs -s <segments_per_section> -z <sections_per_zone> /dev/sda1
 
# Example: 2MB erase block, 2MB segment (default)
mkfs.f2fs -s 1 /dev/sda1
 
# Example: 4MB erase block
mkfs.f2fs -s 2 /dev/sda1  # 2 segments × 2MB = 4MB section

Channel and Die Parallelism:

SSDs achieve high performance through parallelism:

Multiple channels (4-8 typical)
Multiple dies per channel
Pages can be written to different channels simultaneously

How LFS exploits parallelism:

Large sequential writes: 2 MB segment spans multiple channels
Multi-stream logging: Different logs can target different channels
Write batching: Accumulating data before flush enables full parallelism

Coordinating with SSD GC:

SSD and FS both do GC. Coordination prevents conflicts:

TRIM/DISCARD: FS tells SSD which blocks are no longer valid
Timely cleaning: FS cleans before SSD gets pressured
Pattern matching: Sequential FS writes match SSD's preferred patterns
Avoiding interference: Don't trigger FS GC writes during SSD GC

SSD-Aware File System Design Principles
Principle	Implementation	Benefit
Erase block alignment	Segment/section = erase block	Direct erase, no internal copy
Sequential writes	Append-only logging	Maximizes channel parallelism
Large I/Os	Segment buffering (MB-sized)	Saturates SSD bandwidth
TRIM support	Immediate discard on delete	Enables proactive SSD GC
Hot/cold separation	Multi-temperature logs	Reduces both FS and SSD GC
Over-provisioning awareness	Reserve FS free space	Ensures SSD has GC headroom

Query Your SSD Characteristics

While SSDs don't always expose internal geometry, you can often infer it through benchmarking or manufacturer specs. The Linux hdparm and nvme tools provide some information. Matching FS parameters to SSD characteristics yields measurable performance improvements.

TRIM and Discard Integration

TRIM (ATA) or UNMAP (SCSI/NVMe) commands inform the SSD that certain logical blocks are no longer in use. This is critical for SSD performance and longevity.

Why TRIM Matters:

Without TRIM:

File is deleted at FS level
SSD doesn't know those LBAs are free
SSD continues treating them as valid data
SSD internal GC copies "valid" (actually deleted) data
Massive write amplification and wasted space

With TRIM:

File is deleted at FS level
FS sends TRIM for the freed LBAs
SSD marks those pages as invalid immediately
SSD internal GC ignores trimmed pages
Erase blocks become free faster

trim_operation.txt
TRIM Operation Flow:
 
WITHOUT TRIM:
═══════════════════════════════════════════════════════════════════════
T=0: File occupies LBAs 1000-1100 on SSD
     SSD internal mapping: LBA 1000-1100 → Pages in Block 5
     
T=1: User deletes file
     FS marks blocks free in its own structures
     SSD state: UNCHANGED (still thinks LBAs 1000-1100 are valid)
     
T=2: New file written to different LBAs (2000-2100)
     SSD maps new data to fresh pages in Block 6
     
T=3: SSD internal GC needed
     Block 5 scanned: "All pages valid" (LBAs 1000-1100)
     GC copies all pages from Block 5 → wastes bandwidth!
     
Result: Deleted data being copied around pointlessly
 
WITH TRIM:
═══════════════════════════════════════════════════════════════════════
T=0: File occupies LBAs 1000-1100 on SSD
     SSD internal mapping: LBA 1000-1100 → Pages in Block 5
     
T=1: User deletes file
     FS marks blocks free + issues TRIM(LBAs 1000-1100)
     SSD receives TRIM: marks those pages INVALID
     
T=2: New file written to different LBAs (2000-2100)
     SSD maps new data to fresh pages in Block 6
     
T=3: SSD internal GC
     Block 5 scanned: "All pages INVALID!" (trimmed)
     GC: Skip Block 5, it's ready for erase when convenient
     
Result: No wasted copying, efficient space reclaim
 
TRIM IN F2FS:
═══════════════════════════════════════════════════════════════════════
F2FS GC Process with TRIM:
 
1. Select victim section (2 MB)
2. Copy live blocks to new location
3. Update NAT entries
4. Issue TRIM for entire section range
5. Mark section as free in SIT
 
# Mount options for TRIM control:
mount -o discard /dev/nvme0n1p1 /mnt       # Real-time discard
mount -o nodiscard /dev/nvme0n1p1 /mnt     # Batch discard via fstrim
 
# Periodic TRIM (recommended for some SSDs):
fstrim -v /mnt     # Manually trim all free space
# Or via cron/systemd timer for weekly fstrim

TRIM Strategies:

Real-time Discard (mount -o discard):

TRIM issued immediately on delete/GC
Pro: SSD always has accurate information
Con: TRIM overhead on every delete; can impact latency

Batched Discard (fstrim periodic):

Accumulate frees, TRIM in batches
Pro: Amortizes TRIM overhead
Con: SSD has stale info between trims

Hybrid Approach:

Real-time TRIM for large ranges only
Batch small TRIMs periodically
Best of both worlds for many workloads

F2FS TRIM Behavior:

F2FS sends TRIM at section granularity during GC:

Section freed → TRIM entire section
Aligns with erase block size
Efficient for SSD processing

Checkpoint also triggers TRIM updates:

Orphaned inodes (crashed-in-flight deletes) → TRIM
NAT/SIT cleanup → TRIM associated data blocks

TRIM and Wear Leveling Interaction

Some SSDs use TRIM to improve wear leveling by avoiding writes to frequently-erased blocks. Without TRIM, the SSD may concentrate erases on a subset of blocks, accelerating wear-out. Regular TRIM is important for SSD longevity, not just performance.

Minimizing Total Write Amplification

Total write amplification (WA) on an SSD system combines:

File System WA: Extra writes from FS operations (journaling, GC copying)
SSD Internal WA: Extra writes from FTL (internal GC, wear leveling)

Minimizing TOTAL WA requires optimizing both levels.

Write Amplification Formula:

Total_WA = FS_WA × SSD_WA

Where:
  FS_WA = (FS bytes written to device) / (application bytes written)
  SSD_WA = (flash bytes actually written) / (bytes received from FS)
  Total_WA = (flash bytes written) / (application bytes)

Write Amplification Factors
Source	Traditional FS (ext4)	Log-Structured (F2FS)	Reduction
Metadata updates	In-place (high SSD WA)	Append to log (low SSD WA)	2-5x
File modification	Read-modify-write	Append new version	1.5-3x
Journal writes	Separate journal area	Integrated in log	1.2-2x
FS garbage collection	N/A (in-place)	Segment cleaning	-1x to +2x*
SSD internal GC	Random pattern = high	Sequential = low	2-10x
Overall typical	3-10x	1.5-3x	~3x improvement

*Note: FS GC can add write amplification, but by coordinating with SSD, it often reduces TOTAL amplification by enabling efficient SSD GC.

F2FS Techniques for WA Reduction:

Inline data: Store small files (≤ 3.4 KB) inside inode
- Avoids separate data block allocation
- One write instead of two
Inline dentry: Store small directories inside inode
- Same benefit as inline data
- Common for leaf directories
Atomic writes: Combine related updates
- Less metadata overhead per file
- Fewer separate write operations
Hot/cold separation: Temperature-based logging
- Hot data in hot segments → dies together
- Reduces copying during GC
IPU (In-Place Update) for specific cases:
- Very old, cold data with recent small update
- Avoids creating garbage for trivial changes
- Used sparingly to maintain log benefits

wa_optimization.txt
Write Amplification Optimization Techniques:
 
INLINE DATA:
═══════════════════════════════════════════════════════════════════════
Small file (< 3.4 KB):
 
Without inline data:                With inline data:
┌─────────────────────┐            ┌─────────────────────┐
│ Inode Block         │            │ Inode Block         │
│ - file metadata     │            │ - file metadata     │
│ - ptr → data block  │            │ - [inline data here]│ ← data inside!
└─────────────────────┘            └─────────────────────┘
         │
         ▼
┌─────────────────────┐
│ Data Block (4 KB)   │            (No separate data block)
│ - actual content    │
└─────────────────────┘
 
Writes: 2 blocks                   Writes: 1 block
WA reduction: 50% for small files
 
ATOMIC WRITE (F2FS feature):
═══════════════════════════════════════════════════════════════════════
Database transaction: UPDATE table SET col=val WHERE id=123
 
Without atomic write:              With atomic write (F2FS):
1. Write data page                 1. Start atomic operation
2. Write log record                2. Buffer all writes in memory
3. fsync (flush + journal)         3. fsync → single segment write
4. Write metadata                  
5. fsync again                     
 
Result:                            Result:
- Multiple small writes            - One large sequential write
- Multiple fsyncs                  - One fsync
- High SSD random overhead         - Optimal for SSD
 
API usage:
  ioctl(fd, F2FS_IOC_START_ATOMIC_WRITE);
  // ... perform writes ...
  ioctl(fd, F2FS_IOC_COMMIT_ATOMIC_WRITE);
 
HOT/COLD SEPARATION IMPACT:
═══════════════════════════════════════════════════════════════════════
Scenario: 1 GB hot data + 9 GB cold data
 
Without separation:
  All data mixed → segments 50% hot, 50% cold
  GC moves cold data constantly
  FS WA from GC: ~3x
  Combined with SSD WA: ~6-10x total
 
With separation (F2FS 6-log):
  Hot in HOT_DATA log → dies together → segment goes to 0%
  Cold in COLD_DATA log → rarely changes → stays at 100%
  GC rarely touches cold segments
  FS WA from GC: ~1.2x
  Combined with SSD WA: ~2x total
 
Improvement: 3-5x less flash wear

Over-Provisioning Interaction

SSDs reserve 7-28% capacity as over-provisioning for GC and wear leveling. Log-structured file systems should leave additional free space (~10-20%) beyond this. Total reserved space (SSD OP + FS reserve) of 25-40% enables optimal performance and longevity for write-heavy workloads.

Practical SSD Deployment Guidelines

Deploying log-structured file systems on SSDs requires attention to configuration, monitoring, and workload-specific tuning.

Recommended F2FS Mount Options for SSDs:

f2fs_ssd_deployment.txt
F2FS SSD Deployment Configuration:
 
MKFS OPTIMIZATION:
═══════════════════════════════════════════════════════════════════════
# Create F2FS with SSD-optimized settings:
mkfs.f2fs -f \
  -O extra_attr,inode_checksum,sb_checksum,compression \
  -s 1 \           # 1 segment per section (2MB = common erase block)
  -z 1 \           # 1 section per zone
  /dev/nvme0n1p1
 
# For larger erase block SSDs (e.g., 4MB):
mkfs.f2fs -s 2 /dev/nvme0n1p1
 
MOUNT OPTIONS:
═══════════════════════════════════════════════════════════════════════
# Basic SSD-optimized mount:
mount -t f2fs -o noatime,discard /dev/nvme0n1p1 /mnt/data
 
# Full optimization for different workloads:
 
# General purpose (desktop/laptop):
mount -t f2fs \
  -o noatime \           # Reduce metadata writes
  -o discard \           # Real-time TRIM
  -o background_gc=on \  # Enable background GC
  -o gc_merge \          # Merge GC I/O with regular I/O
  /dev/nvme0n1p1 /mnt/data
 
# Database server (high random write):
mount -t f2fs \
  -o noatime,nodiratime \
  -o discard \
  -o active_logs=6 \     # Use all 6 log heads
  -o gc_merge \
  -o fsync_mode=nobarrier \  # If battery backup exists
  /dev/nvme0n1p1 /mnt/data
 
# Mobile/embedded (limited RAM):  
mount -t f2fs \
  -o noatime \
  -o background_gc=sync \  # Sync GC when idle
  -o gc_merge \
  /dev/nvme0n1p1 /mnt/data
 
MONITORING:
═══════════════════════════════════════════════════════════════════════
# F2FS statistics (sysfs):
cat /sys/fs/f2fs/<device>/stat
 
# Key metrics to watch:
#   gc_call       - Total GC invocations
#   gc_fg_calls   - Foreground GC (should be rare)
#   gc_data_blks  - Data blocks moved by GC (WA indicator)
#   dirty_*       - Dirty segments per temperature zone
 
# Write amplification estimation:
#   Calculate: gc_data_blks / total_data_written
 
# Free space:
df -h /mnt/data
# Keep above 10-20% for optimal performance
 
# SMART data for SSD health:
smartctl -a /dev/nvme0n1 | grep -E '(Wear|Written|Available)'

Common Deployment Mistakes:

Running disk too full: Keep 10-20% free minimum
- Symptoms: Frequent foreground GC, high latency
- Fix: Delete files, add capacity, tune retention
Ignoring TRIM: Mount without discard and no fstrim
- Symptoms: Degrading performance over time
- Fix: Add discard option or schedule fstrim
Misaligned partitions: Partition not aligned to erase block
- Symptoms: Suboptimal write performance
- Fix: Use parted with proper alignment, or let installer handle
Wrong fs for workload: Using F2FS for read-heavy workload
- Symptoms: Higher CPU usage, no performance benefit
- Fix: ext4 may be better for read-dominant access
Over-aggressive fsync: Too-frequent syncs hurt LFS batching
- Symptoms: Small segment writes, high amplification
- Fix: Tune application, use F2FS atomic writes

Workload-Specific FS Recommendations
Workload	Best FS Choice	Key Configuration	Why
Android phone	F2FS	Default settings	Optimized for flash, small files, app installs
Linux desktop SSD	F2FS or ext4	F2FS for write-heavy	ext4 mature; F2FS for browser/dev
Database server	F2FS with tuning	Active logs, GC tuning	Random writes benefit from LFS
NAS/File server	ext4 or XFS	Large reads dominate	LFS overhead not justified
Build server	F2FS	Fast for many small files	Compiler creates/deletes many files
Container storage	F2FS or overlay2	F2FS for base layer	Image layers + container writes

Benchmark Your Workload

Generic recommendations are starting points. Use fio to simulate your actual workload on ext4 vs. F2FS. Measure throughput, latency, and (if possible) SSD SMART data for writes over time. Real workload testing beats theoretical optimization.

Summary: LFS and the Flash Future

Log-structured file systems have found their ideal medium in flash storage. The fundamental properties of LFS—sequential writes, append-only updates, and coordinated garbage collection—align perfectly with flash memory's constraints and strengths.

Key Takeaways

•Flash requires erase-before-write — SSDs can't overwrite pages; they write to new locations and garbage collect later, just like LFS.
•Traditional file systems conflict with SSDs — In-place updates, scattered writes, and lack of TRIM cause high write amplification inside the SSD.
•F2FS is purpose-built for flash — Six-log architecture, section-aligned GC, and TRIM integration minimize total write amplification.
•Alignment matters — Matching FS segments to SSD erase blocks enables direct block erasure without internal copying.
•TRIM is essential — Informing the SSD about freed blocks prevents it from copying deleted data during internal GC.
•Total WA = FS_WA × SSD_WA — Optimizing one level while ignoring the other leaves performance on the table.

The Future: Zoned Storage and Beyond:

Emerging Zoned Namespace (ZNS) SSDs expose log-structured interfaces directly:

Writes must be sequential within zones
No SSD internal GC needed (FS handles all GC)
Eliminates double GC overhead entirely
Represents the ultimate convergence of LFS and flash

As storage technology evolves toward persistent memory and new non-volatile media, the principles of log-structured storage—sequential writes, append-only modifications, and explicit garbage collection—will remain foundational to efficient storage system design.

Module Complete

Congratulations! You've completed the comprehensive study of log-structured file systems. From the foundational LFS concept through sequential writes, garbage collection, segment cleaning, and SSD optimization, you now possess deep understanding of this revolutionary storage paradigm. This knowledge forms the basis for working with modern storage systems, databases (LSM trees), and distributed storage infrastructure.