Loading content...
When Rosenblum and Ousterhout designed the original LFS in 1992, they were optimizing for hard disk drives. Fast-forward three decades, and log-structured file systems have found an even more compelling application: solid-state drives (SSDs).
The match between LFS and flash storage is remarkably synergistic:
This page explores how modern log-structured file systems like F2FS (Flash-Friendly File System) optimize specifically for flash storage, why traditional file systems struggle on SSDs, and how understanding flash characteristics leads to better file system design.
By the end of this page, you will understand flash memory fundamentals (pages, blocks, erase-before-write), how SSD internal operations interact with file system writes, why LFS is naturally suited for flash storage, F2FS architecture and optimizations, TRIM command integration, and practical considerations for deploying LFS on SSDs.
To understand why LFS is ideal for SSDs, we must first understand how flash memory operates. Unlike magnetic disks, flash has asymmetric read/write/erase characteristics.
Flash Memory Hierarchy:
Flash Memory Physical Hierarchy: ┌─────────────────────────────────────────────────────────────────────┐│ SSD PACKAGE │├─────────────────────────────────────────────────────────────────────┤│ ││ ┌────────────────────────┐ ┌────────────────────────┐ ││ │ DIE 0 │ │ DIE 1 │ ... ││ │ ┌─────────┬─────────┐ │ │ ┌─────────┬─────────┐ │ ││ │ │ Plane 0 │ Plane 1 │ │ │ │ Plane 0 │ Plane 1 │ │ ││ │ │┌───────┐│┌───────┐│ │ │ │┌───────┐│┌───────┐│ │ ││ │ ││Block 0│││Block 0││ │ │ ││Block 0│││Block 0││ │ ││ │ ││┌─────┐│││┌─────┐││ │ │ ││┌─────┐│││┌─────┐││ │ ││ │ │││Pg 0 ││││Pg 0 │││ │ │ │││Pg 0 ││││Pg 0 │││ │ ││ │ │││Pg 1 ││││Pg 1 │││ │ │ │││Pg 1 ││││Pg 1 │││ │ ││ │ │││ ... ││││ ... │││ │ │ │││ ... ││││ ... │││ │ ││ │ │││Pg255││││Pg255 │││ │ │ │││Pg255││││Pg255 │││ │ ││ │ ││└─────┘│││└─────┘││ │ │ ││└─────┘│││└─────┘││ │ ││ │ ││Block 1│││Block 1││ │ │ ││Block 1│││Block 1││ │ ││ │ ││ ... │││ ... ││ │ │ ││ ... │││ ... ││ │ ││ │ │└───────┘│└───────┘│ │ │ │└───────┘│└───────┘│ │ ││ │ └─────────┴─────────┘ │ │ └─────────┴─────────┘ │ ││ └────────────────────────┘ └────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────┘ OPERATION CHARACTERISTICS:═══════════════════════════════════════════════════════════════════════Operation │ Granularity │ Latency │ Notes──────────┼─────────────┼─────────────┼────────────────────────────────Read │ Page (4KB) │ ~25 μs │ Fast, random access OKWrite │ Page (4KB) │ ~200-500 μs │ Must write to erased pageErase │ Block (2MB) │ ~1-3 ms │ Slow, affects 256-512 pages──────────┴─────────────┴─────────────┴──────────────────────────────── KEY CONSTRAINT: Cannot overwrite a page! - Page starts "erased" (all 1s) - Write changes some 1s to 0s - To write again, must erase entire block (sets all bits to 1) - This is the "erase-before-write" constraintThe Erase-Before-Write Problem:
To modify 4 KB of existing data on an SSD:
But erasing a block affects 256+ pages. If only one page was stale, the SSD must:
This is called SSD internal write amplification—the SSD writes more data than requested.
The Flash Translation Layer (FTL):
SSDs don't expose this complexity directly. The Flash Translation Layer provides a block device interface:
The FTL essentially implements a log-structured approach inside the SSD. LFS and the FTL can work synergistically or conflict, depending on alignment.
If the file system does GC (moving blocks) and the SSD does internal GC (moving pages), you can have DOUBLE write amplification. This is why SSD-aware file systems are critical—they coordinate with SSD characteristics to minimize total write amplification.
Traditional file systems like ext4 and NTFS were designed for HDDs. Their design assumptions conflict with SSD characteristics.
Problem 1: Random In-Place Updates
Traditional file systems update data in place. On SSDs, each in-place update:
Problem 2: Metadata Hot Spots
Traditional file systems have fixed metadata locations:
Problem 3: Alignment Mismatch
File system block size (4 KB) vs. erase block size (2 MB):
| Characteristic | Traditional FS (ext4) | SSD Impact | LFS Approach |
|---|---|---|---|
| Update style | In-place | Triggers SSD GC | Append-only, sequential |
| Metadata location | Fixed | Hot spots, wear concentration | Roaming, spread across log |
| Block alignment | 4 KB | Mismatched with erase blocks | Segment = erase block aligned |
| GC awareness | None | Conflicts with SSD GC | Coordinates with SSD |
| TRIM support | Added later | Critical for reclaim | Native integration |
| Write pattern | Mixed random/seq | Suboptimal parallelism | Pure sequential per log |
Write Amplification Example: Traditional FS vs. LFS on SSD SCENARIO: Update 100 small files (4 KB each, scattered across disk) TRADITIONAL FS (ext4):═══════════════════════════════════════════════════════════════════════File System Operations: - 100 data block updates (in-place, scattered LBAs) - ~100 inode updates (in inode table, fixed locations) - Multiple bitmap updates FS write volume: ~800 KB (100 files × 4 KB + metadata overhead) SSD Internal Impact (simplified): - Each 4 KB write goes to new page (page allocation) - 200+ old pages marked invalid (scattered across blocks) - Internal GC triggered frequently - Each GC may copy 50+ valid pages to free 1 block SSD actual writes: ~8 MB (10× write amplification typical)SSD erases triggered: 4-8 blocks LOG-STRUCTURED FS:═══════════════════════════════════════════════════════════════════════File System Operations: - All 100 updates collected in segment buffer - Written as one 4 MB sequential segment - Includes data + inodes + imap updates FS write volume: ~4 MB (one segment with all updates) SSD Internal Impact: - Sequential write fills contiguous pages - SSD pre-erases blocks in advance (background) - Write aligns with erase block boundaries - Minimal internal GC during write SSD actual writes: ~4.5 MB (1.1× write amplification)SSD erases triggered: 2 blocks (sequential, pre-erased) COMPARISON:═══════════════════════════════════════════════════════════════════════ ext4 LFS Improvement FS write volume 0.8 MB 4 MB (LFS batches more) SSD actual writes 8 MB 4.5 MB 1.8× less SSD erases 4-8 blocks 2 blocks 2-4× less Flash wear High Low Significant lifespan gain Note: LFS appears to write more at FS level (4 MB vs 0.8 MB) but causesmuch less SSD-level write amplification, resulting in less actual flash wear.What matters for SSD lifespan isn't file system write volume—it's total flash writes including internal amplification. LFS may write slightly more data, but by writing sequentially and aligning with erase blocks, it triggers far less SSD-internal copying. Less flash wear = longer SSD life.
F2FS (Flash-Friendly File System) was developed by Samsung and introduced to Linux in 2012. It's designed from the ground up for flash storage, combining log-structured principles with flash-specific optimizations.
Design Goals:
F2FS Disk Layout: ┌─────────────────────────────────────────────────────────────────────┐│ F2FS DISK LAYOUT │├─────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────────┐││ │ SUPERBLOCK (SB) │││ │ Fixed location, contains FS parameters │││ │ 2 copies for redundancy │││ └─────────────────────────────────────────────────────────────────┘││ ││ ┌─────────────────────────────────────────────────────────────────┐││ │ CHECKPOINT AREA (CP) │││ │ Two checkpoint packs (alternating) │││ │ Contains: NAT/SIT journals, orphan inodes, summary │││ └─────────────────────────────────────────────────────────────────┘││ ││ ┌─────────────────────────────────────────────────────────────────┐││ │ SEGMENT INFO TABLE (SIT) │││ │ Per-segment metadata: valid block count, bitmap │││ │ Used for GC victim selection │││ └─────────────────────────────────────────────────────────────────┘││ ││ ┌─────────────────────────────────────────────────────────────────┐││ │ NODE ADDRESS TABLE (NAT) │││ │ Maps node_id → physical block address (like LFS imap) │││ │ Updated in checkpoint, cached in memory │││ └─────────────────────────────────────────────────────────────────┘││ ││ ┌─────────────────────────────────────────────────────────────────┐││ │ SEGMENT SUMMARY AREA (SSA) │││ │ Owner info for each block in main area │││ │ Used for GC liveness checking │││ └─────────────────────────────────────────────────────────────────┘││ ││ ┌─────────────────────────────────────────────────────────────────┐││ │ MAIN AREA (Data + Node Logs) │││ │ ┌─────────────────────────────────────────────────────────────┐ │││ │ │ Section 0 │ Section 1 │ Section 2 │ ... │ │││ │ │ (2MB) │ (2MB) │ (2MB) │ │ │││ │ │ ┌──────────┐ │ ┌──────────┐ │ ┌──────────┐ │ │ │││ │ │ │Segment 0 │ │ │Segment 2 │ │ │Segment 4 │ │ │ │││ │ │ │Segment 1 │ │ │Segment 3 │ │ │Segment 5 │ │ │ │││ │ │ └──────────┘ │ └──────────┘ │ └──────────┘ │ │ │││ │ └─────────────────────────────────────────────────────────────┘ │││ │ │││ │ 6 Log Heads: HOT_NODE, WARM_NODE, COLD_NODE, │││ │ HOT_DATA, WARM_DATA, COLD_DATA │││ └─────────────────────────────────────────────────────────────────┘││ │└─────────────────────────────────────────────────────────────────────┘ F2FS TERMINOLOGY: Block = 4 KB (same as page size) Segment = 512 blocks = 2 MB (matches SSD erase block) Section = 1+ segments (configurable, for GC unit) Zone = 1+ sections (for large-scale organization)Key F2FS Structures:
Node Address Table (NAT):
Segment Info Table (SIT):
Multi-Head Logging:
The most effective LFS implementations are designed with explicit knowledge of SSD architecture. This alignment reduces total write amplification and maximizes SSD performance.
Erase Block Alignment:
SSDs erase in large chunks (128 KB - 4 MB). If FS operations don't align:
F2FS sections (2 MB) are designed to match common erase block sizes. GC operates on sections, ensuring that when FS frees space, it's freeing complete erase blocks the SSD can reclaim cleanly.
Erase Block Alignment Impact: MISALIGNED (Traditional FS):═══════════════════════════════════════════════════════════════════════SSD Erase Block (2 MB):┌─────────────────────────────────────────────────────────────────────┐│ [FS Block A][FS Block B][Free][FS Block D][Free][FS Block F][Free] │└─────────────────────────────────────────────────────────────────────┘ ↑ ↑ ↑ Scattered free space across erase block When FS deletes A, B, D, F: - SSD erase block still has 4 valid pages (from other FS blocks) - Can't erase yet! - Internal GC eventually copies those 4 pages to consolidate Result: FS freed 4 blocks, but SSD couldn't reclaim the erase block ALIGNED (F2FS Section):═══════════════════════════════════════════════════════════════════════SSD Erase Block (2 MB) = F2FS Section:┌─────────────────────────────────────────────────────────────────────┐│ [F2FS Section 5: All data from one logical grouping] ││ [Block 1][Block 2][Block 3][Block 4]...[Block 512] │└─────────────────────────────────────────────────────────────────────┘ ↑ All related data together When F2FS GC cleans Section 5: - All 512 blocks in section become invalid together - SSD can erase entire block immediately - No internal copying needed! Result: FS GC directly enables SSD erase PRACTICAL ALIGNMENT:═══════════════════════════════════════════════════════════════════════Create F2FS with explicit alignment: # Query SSD erase block size (not always exposed)# Common values: 512KB, 1MB, 2MB, 4MB # Create F2FS with matching section size:mkfs.f2fs -s <segments_per_section> -z <sections_per_zone> /dev/sda1 # Example: 2MB erase block, 2MB segment (default)mkfs.f2fs -s 1 /dev/sda1 # Example: 4MB erase blockmkfs.f2fs -s 2 /dev/sda1 # 2 segments × 2MB = 4MB sectionChannel and Die Parallelism:
SSDs achieve high performance through parallelism:
How LFS exploits parallelism:
Coordinating with SSD GC:
SSD and FS both do GC. Coordination prevents conflicts:
| Principle | Implementation | Benefit |
|---|---|---|
| Erase block alignment | Segment/section = erase block | Direct erase, no internal copy |
| Sequential writes | Append-only logging | Maximizes channel parallelism |
| Large I/Os | Segment buffering (MB-sized) | Saturates SSD bandwidth |
| TRIM support | Immediate discard on delete | Enables proactive SSD GC |
| Hot/cold separation | Multi-temperature logs | Reduces both FS and SSD GC |
| Over-provisioning awareness | Reserve FS free space | Ensures SSD has GC headroom |
While SSDs don't always expose internal geometry, you can often infer it through benchmarking or manufacturer specs. The Linux hdparm and nvme tools provide some information. Matching FS parameters to SSD characteristics yields measurable performance improvements.
TRIM (ATA) or UNMAP (SCSI/NVMe) commands inform the SSD that certain logical blocks are no longer in use. This is critical for SSD performance and longevity.
Why TRIM Matters:
Without TRIM:
With TRIM:
TRIM Operation Flow: WITHOUT TRIM:═══════════════════════════════════════════════════════════════════════T=0: File occupies LBAs 1000-1100 on SSD SSD internal mapping: LBA 1000-1100 → Pages in Block 5 T=1: User deletes file FS marks blocks free in its own structures SSD state: UNCHANGED (still thinks LBAs 1000-1100 are valid) T=2: New file written to different LBAs (2000-2100) SSD maps new data to fresh pages in Block 6 T=3: SSD internal GC needed Block 5 scanned: "All pages valid" (LBAs 1000-1100) GC copies all pages from Block 5 → wastes bandwidth! Result: Deleted data being copied around pointlessly WITH TRIM:═══════════════════════════════════════════════════════════════════════T=0: File occupies LBAs 1000-1100 on SSD SSD internal mapping: LBA 1000-1100 → Pages in Block 5 T=1: User deletes file FS marks blocks free + issues TRIM(LBAs 1000-1100) SSD receives TRIM: marks those pages INVALID T=2: New file written to different LBAs (2000-2100) SSD maps new data to fresh pages in Block 6 T=3: SSD internal GC Block 5 scanned: "All pages INVALID!" (trimmed) GC: Skip Block 5, it's ready for erase when convenient Result: No wasted copying, efficient space reclaim TRIM IN F2FS:═══════════════════════════════════════════════════════════════════════F2FS GC Process with TRIM: 1. Select victim section (2 MB)2. Copy live blocks to new location3. Update NAT entries4. Issue TRIM for entire section range5. Mark section as free in SIT # Mount options for TRIM control:mount -o discard /dev/nvme0n1p1 /mnt # Real-time discardmount -o nodiscard /dev/nvme0n1p1 /mnt # Batch discard via fstrim # Periodic TRIM (recommended for some SSDs):fstrim -v /mnt # Manually trim all free space# Or via cron/systemd timer for weekly fstrimTRIM Strategies:
Real-time Discard (mount -o discard):
Batched Discard (fstrim periodic):
Hybrid Approach:
F2FS TRIM Behavior:
F2FS sends TRIM at section granularity during GC:
Checkpoint also triggers TRIM updates:
Some SSDs use TRIM to improve wear leveling by avoiding writes to frequently-erased blocks. Without TRIM, the SSD may concentrate erases on a subset of blocks, accelerating wear-out. Regular TRIM is important for SSD longevity, not just performance.
Total write amplification (WA) on an SSD system combines:
Minimizing TOTAL WA requires optimizing both levels.
Write Amplification Formula:
Total_WA = FS_WA × SSD_WA
Where:
FS_WA = (FS bytes written to device) / (application bytes written)
SSD_WA = (flash bytes actually written) / (bytes received from FS)
Total_WA = (flash bytes written) / (application bytes)
| Source | Traditional FS (ext4) | Log-Structured (F2FS) | Reduction |
|---|---|---|---|
| Metadata updates | In-place (high SSD WA) | Append to log (low SSD WA) | 2-5x |
| File modification | Read-modify-write | Append new version | 1.5-3x |
| Journal writes | Separate journal area | Integrated in log | 1.2-2x |
| FS garbage collection | N/A (in-place) | Segment cleaning | -1x to +2x* |
| SSD internal GC | Random pattern = high | Sequential = low | 2-10x |
| Overall typical | 3-10x | 1.5-3x | ~3x improvement |
*Note: FS GC can add write amplification, but by coordinating with SSD, it often reduces TOTAL amplification by enabling efficient SSD GC.
F2FS Techniques for WA Reduction:
Inline data: Store small files (≤ 3.4 KB) inside inode
Inline dentry: Store small directories inside inode
Atomic writes: Combine related updates
Hot/cold separation: Temperature-based logging
IPU (In-Place Update) for specific cases:
Write Amplification Optimization Techniques: INLINE DATA:═══════════════════════════════════════════════════════════════════════Small file (< 3.4 KB): Without inline data: With inline data:┌─────────────────────┐ ┌─────────────────────┐│ Inode Block │ │ Inode Block ││ - file metadata │ │ - file metadata ││ - ptr → data block │ │ - [inline data here]│ ← data inside!└─────────────────────┘ └─────────────────────┘ │ ▼┌─────────────────────┐│ Data Block (4 KB) │ (No separate data block)│ - actual content │└─────────────────────┘ Writes: 2 blocks Writes: 1 blockWA reduction: 50% for small files ATOMIC WRITE (F2FS feature):═══════════════════════════════════════════════════════════════════════Database transaction: UPDATE table SET col=val WHERE id=123 Without atomic write: With atomic write (F2FS):1. Write data page 1. Start atomic operation2. Write log record 2. Buffer all writes in memory3. fsync (flush + journal) 3. fsync → single segment write4. Write metadata 5. fsync again Result: Result:- Multiple small writes - One large sequential write- Multiple fsyncs - One fsync- High SSD random overhead - Optimal for SSD API usage: ioctl(fd, F2FS_IOC_START_ATOMIC_WRITE); // ... perform writes ... ioctl(fd, F2FS_IOC_COMMIT_ATOMIC_WRITE); HOT/COLD SEPARATION IMPACT:═══════════════════════════════════════════════════════════════════════Scenario: 1 GB hot data + 9 GB cold data Without separation: All data mixed → segments 50% hot, 50% cold GC moves cold data constantly FS WA from GC: ~3x Combined with SSD WA: ~6-10x total With separation (F2FS 6-log): Hot in HOT_DATA log → dies together → segment goes to 0% Cold in COLD_DATA log → rarely changes → stays at 100% GC rarely touches cold segments FS WA from GC: ~1.2x Combined with SSD WA: ~2x total Improvement: 3-5x less flash wearSSDs reserve 7-28% capacity as over-provisioning for GC and wear leveling. Log-structured file systems should leave additional free space (~10-20%) beyond this. Total reserved space (SSD OP + FS reserve) of 25-40% enables optimal performance and longevity for write-heavy workloads.
Deploying log-structured file systems on SSDs requires attention to configuration, monitoring, and workload-specific tuning.
Recommended F2FS Mount Options for SSDs:
F2FS SSD Deployment Configuration: MKFS OPTIMIZATION:═══════════════════════════════════════════════════════════════════════# Create F2FS with SSD-optimized settings:mkfs.f2fs -f \ -O extra_attr,inode_checksum,sb_checksum,compression \ -s 1 \ # 1 segment per section (2MB = common erase block) -z 1 \ # 1 section per zone /dev/nvme0n1p1 # For larger erase block SSDs (e.g., 4MB):mkfs.f2fs -s 2 /dev/nvme0n1p1 MOUNT OPTIONS:═══════════════════════════════════════════════════════════════════════# Basic SSD-optimized mount:mount -t f2fs -o noatime,discard /dev/nvme0n1p1 /mnt/data # Full optimization for different workloads: # General purpose (desktop/laptop):mount -t f2fs \ -o noatime \ # Reduce metadata writes -o discard \ # Real-time TRIM -o background_gc=on \ # Enable background GC -o gc_merge \ # Merge GC I/O with regular I/O /dev/nvme0n1p1 /mnt/data # Database server (high random write):mount -t f2fs \ -o noatime,nodiratime \ -o discard \ -o active_logs=6 \ # Use all 6 log heads -o gc_merge \ -o fsync_mode=nobarrier \ # If battery backup exists /dev/nvme0n1p1 /mnt/data # Mobile/embedded (limited RAM): mount -t f2fs \ -o noatime \ -o background_gc=sync \ # Sync GC when idle -o gc_merge \ /dev/nvme0n1p1 /mnt/data MONITORING:═══════════════════════════════════════════════════════════════════════# F2FS statistics (sysfs):cat /sys/fs/f2fs/<device>/stat # Key metrics to watch:# gc_call - Total GC invocations# gc_fg_calls - Foreground GC (should be rare)# gc_data_blks - Data blocks moved by GC (WA indicator)# dirty_* - Dirty segments per temperature zone # Write amplification estimation:# Calculate: gc_data_blks / total_data_written # Free space:df -h /mnt/data# Keep above 10-20% for optimal performance # SMART data for SSD health:smartctl -a /dev/nvme0n1 | grep -E '(Wear|Written|Available)'Common Deployment Mistakes:
Running disk too full: Keep 10-20% free minimum
Ignoring TRIM: Mount without discard and no fstrim
discard option or schedule fstrimMisaligned partitions: Partition not aligned to erase block
parted with proper alignment, or let installer handleWrong fs for workload: Using F2FS for read-heavy workload
Over-aggressive fsync: Too-frequent syncs hurt LFS batching
| Workload | Best FS Choice | Key Configuration | Why |
|---|---|---|---|
| Android phone | F2FS | Default settings | Optimized for flash, small files, app installs |
| Linux desktop SSD | F2FS or ext4 | F2FS for write-heavy | ext4 mature; F2FS for browser/dev |
| Database server | F2FS with tuning | Active logs, GC tuning | Random writes benefit from LFS |
| NAS/File server | ext4 or XFS | Large reads dominate | LFS overhead not justified |
| Build server | F2FS | Fast for many small files | Compiler creates/deletes many files |
| Container storage | F2FS or overlay2 | F2FS for base layer | Image layers + container writes |
Generic recommendations are starting points. Use fio to simulate your actual workload on ext4 vs. F2FS. Measure throughput, latency, and (if possible) SSD SMART data for writes over time. Real workload testing beats theoretical optimization.
Log-structured file systems have found their ideal medium in flash storage. The fundamental properties of LFS—sequential writes, append-only updates, and coordinated garbage collection—align perfectly with flash memory's constraints and strengths.
The Future: Zoned Storage and Beyond:
Emerging Zoned Namespace (ZNS) SSDs expose log-structured interfaces directly:
As storage technology evolves toward persistent memory and new non-volatile media, the principles of log-structured storage—sequential writes, append-only modifications, and explicit garbage collection—will remain foundational to efficient storage system design.
Congratulations! You've completed the comprehensive study of log-structured file systems. From the foundational LFS concept through sequential writes, garbage collection, segment cleaning, and SSD optimization, you now possess deep understanding of this revolutionary storage paradigm. This knowledge forms the basis for working with modern storage systems, databases (LSM trees), and distributed storage infrastructure.