Log Structured File Systems - Learning Module

Loading content...

0/240

Sequential Writes: Maximizing Disk Bandwidth

The Power of Sequential Access

In the previous page, we established that log-structured file systems convert random writes to sequential writes. But why is sequential access so powerful? And how does LFS actually achieve this transformation?

The answer lies in understanding the fundamental physics of storage devices—and how LFS exploits that physics to deliver performance that seems almost magical: taking workloads that would cripple traditional file systems and running them at near-disk-bandwidth speeds.

This page dives deep into the mechanics of sequential writes in LFS: how segment buffers work, why size matters, how LFS handles different types of data, and the engineering decisions that determine real-world performance.

What You Will Learn

By the end of this page, you will understand the physics behind sequential vs. random I/O performance, how LFS segment buffers transform scattered writes into sequential bursts, the trade-offs involved in segment sizing, and how modern implementations handle multi-stream logging for different data types.

The Physics of Disk I/O

To understand why sequential access dominates random access in performance, we need to examine how storage devices physically operate.

Hard Disk Drives (HDDs):

An HDD consists of magnetic platters spinning at 5,400 to 15,000 RPM, with read/write heads mounted on actuator arms. To access data, the drive must:

Seek: Move the head to the correct track (radial position)
Rotate: Wait for the correct sector to spin under the head
Transfer: Read or write the data

The time for each phase:

Seek time: 2-15 ms (average ~8 ms for consumer drives)
Rotational latency: Half a rotation on average = 4.2 ms at 7,200 RPM
Transfer time: Negligible for small blocks (0.01-0.1 ms for 4 KB)

For a 4 KB random read: Total ≈ 12 ms = ~83 IOPS (I/O operations per second)

For 4 KB sequential reads (no seek, no rotational delay between blocks): Transfer rate ≈ 150 MB/s = ~37,500 4KB blocks/second

HDD Performance: Sequential vs. Random
Access Pattern	Operations/Second	Throughput	Relative Performance
Random 4KB Read	~100 IOPS	~0.4 MB/s	1x (baseline)
Sequential 4KB Read	~37,500 blocks/s	~150 MB/s	375x faster
Random 4KB Write	~80 IOPS	~0.3 MB/s	0.8x
Sequential 4KB Write	~35,000 blocks/s	~140 MB/s	~450x faster than random write

The 400x Gap

Sequential writes can be 400x faster than random writes on HDDs. This isn't a software limitation—it's physics. The only way to close this gap is to eliminate seeks and rotational delays by keeping the head moving forward.

Solid State Drives (SSDs):

SSDs have no moving parts, eliminating seek and rotational latency. However, they have their own characteristics that favor sequential access:

Read/Write Asymmetry: Writes are more complex than reads
Page vs. Block Granularity:
- Reads: Per-page (4 KB typical)
- Writes: Require erasing entire block first (128-512 KB)
Write Amplification: Random writes cause internal copying and garbage collection
Channel Parallelism: SSDs have multiple channels; sequential writes utilize all channels efficiently

While the sequential advantage is smaller for SSDs (~5-10x vs. 400x for HDDs), it remains significant—especially for write-heavy workloads.

io_performance_comparison.txt
Storage Performance Characteristics:
 
HDD (7200 RPM, 150 MB/s Sequential):
┌────────────────────────────────────────────────────────────────────┐
│  Random 4KB Write                                                  │
│  ├─ Seek: 8ms                                                      │
│  ├─ Rotation: 4ms                                                  │
│  ├─ Transfer: 0.03ms                                               │
│  └─ Total: ~12ms → 83 IOPS → 0.33 MB/s                             │
│                                                                    │
│  Sequential 4KB Writes:                                            │
│  ├─ Seek: 0ms (head already positioned)                            │
│  ├─ Rotation: 0ms (next sector immediately available)              │
│  ├─ Transfer: 0.03ms                                               │
│  └─ Total: ~0.03ms → 35,000 IOPS → 140 MB/s                        │
│                                                                    │
│  Performance Ratio: 140 / 0.33 = 424x                              │
└────────────────────────────────────────────────────────────────────┘
 
SSD (NVMe, 3000 MB/s Sequential):
┌────────────────────────────────────────────────────────────────────┐
│  Random 4KB Write (QD=1):                                          │
│  ├─ Latency: ~50-100 μs                                            │
│  └─ IOPS: 10,000-20,000 → ~50-80 MB/s                              │
│                                                                    │
│  Sequential 4KB Writes (QD=32):                                    │
│  ├─ Latency: ~10-20 μs per operation (amortized)                   │
│  └─ Throughput: 500,000+ IOPS → 2,000-3,000 MB/s                   │
│                                                                    │
│  Performance Ratio: 3000 / 60 = 50x                                │
│  (Still significant, even without mechanical delays!)              │
└────────────────────────────────────────────────────────────────────┘

The Fundamental Insight:

Storage devices are optimized for sequential access at every level:

HDDs: Mechanical design requires forward motion
SSDs: Flash translation layer, wear leveling, and parallelism optimization
Controllers: Prefetching, write combining, queue optimization
Operating systems: Read-ahead, write-back caching

LFS aligns with this reality by never asking the storage device to do what it does poorly: scattered random writes. Instead, LFS accumulates writes in memory and issues them in large, sequential bursts—exactly what storage devices do best.

The Segment Buffer: Batching for Performance

The segment buffer is the central mechanism that transforms scattered application writes into sequential disk writes. Understanding its operation is key to understanding LFS performance.

How the Segment Buffer Works:

Accumulation Phase: Application writes (data blocks, inodes, directory updates) are collected in a memory buffer
Segment Construction: When the buffer is full (or a sync is requested), all pending writes are organized into a segment
Sequential Write: The entire segment is written to disk in one contiguous operation
Imap Update: The inode map is updated to reflect new block locations

This batching converts what might be hundreds of scattered 4 KB writes into a single multi-megabyte sequential write.

segment_buffer_operation.txt
Segment Buffer Operation:
 
Time T1: Application writes file A (4 KB)
┌──────────────────────────────────────────────────────────────────┐
│ SEGMENT BUFFER (in memory)                                       │
│ ┌─────────────┐                                                  │
│ │ File A data │  (4 KB)                                          │
│ └─────────────┘                                                  │
│ Buffer usage: 4 KB / 4 MB = 0.1%                                 │
└──────────────────────────────────────────────────────────────────┘
 
Time T2: Application writes file B (8 KB), updates directory
┌──────────────────────────────────────────────────────────────────┐
│ SEGMENT BUFFER (in memory)                                       │
│ ┌─────────────┬─────────────┬─────────────┬──────────────┐       │
│ │ File A data │ File B data │ File B data │ Dir update   │       │
│ │   (4 KB)    │  (4 KB pg1) │  (4 KB pg2) │   (4 KB)     │       │
│ └─────────────┴─────────────┴─────────────┴──────────────┘       │
│ Buffer usage: 16 KB / 4 MB = 0.4%                                │
└──────────────────────────────────────────────────────────────────┘
 
... more writes accumulate ...
 
Time T100: Buffer full, trigger segment write
┌──────────────────────────────────────────────────────────────────┐
│ SEGMENT BUFFER (ready to flush)                                  │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ [Seg Summary][Data A][Data B][Data C]...[Inode 1][Inode 2].. │ │
│ │ [Inode 3][Imap Block 1][Imap Block 2]                        │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ Buffer usage: 4 MB / 4 MB = 100%                                 │
│                                                                  │
│ FLUSH TO DISK → Single 4 MB sequential write                     │
│ HDD: 4 MB at 140 MB/s = ~29 ms                                   │
│ SSD: 4 MB at 2000 MB/s = ~2 ms                                   │
└──────────────────────────────────────────────────────────────────┘
 
CONTRAST with traditional FS (same 100 write operations):
┌──────────────────────────────────────────────────────────────────┐
│ Each write immediately goes to disk:                             │
│  - Write data block → seek + write                               │
│  - Update inode → seek + write                                   │
│  - Update directory → seek + write                               │
│  - Update bitmap → seek + write                                  │
│                                                                  │
│ 100 operations × 4 disk writes × 12 ms seek = ~4,800 ms          │
│                                                                  │
│ LFS: ~29 ms vs Traditional: ~4,800 ms = 165x faster              │
└──────────────────────────────────────────────────────────────────┘

Segment Buffer Contents:

A segment isn't just a random collection of blocks. It's carefully organized for both write efficiency and future read performance:

Segment Summary Block (beginning):
- Lists every block in the segment
- For each block: owning inode, offset within file
- Checksum for integrity verification
- Enables garbage collection without reading inodes
Data Blocks:
- File content from multiple files
- Grouped by file when possible (locality)
- May include partial file updates
Inode Blocks:
- Updated inodes for files in this segment
- Placed after their data blocks (read locality)
Inode Map Updates:
- Changes to the imap since last segment
- Only the portions that changed (incremental)
Segment Trailer (optional, some implementations):
- Duplicate checksum for validation
- Write completion marker

Why Inodes Follow Their Data

When reading a file, we need both the inode (for metadata) and the data blocks. By placing a file's inode immediately after its data in the segment, LFS ensures that reading a recently-written file requires minimal disk seeks—the inode and data are physically adjacent.

Segment Sizing: Finding the Sweet Spot

The choice of segment size profoundly affects LFS performance. There's no universally optimal size—the right choice depends on workload characteristics, hardware, and system requirements.

The Trade-offs:

Larger Segments

•Better throughput: Amortizes seek time over more data
•More efficient GC: Fewer segments to track and clean
•Better sequentiality: Longer uninterrupted writes
•Higher memory usage: More RAM for segment buffer
•Longer flush latency: More data to write per flush
•More data at risk: Crash loses more unflushed data

Smaller Segments

•Lower memory usage: Less RAM needed for buffering
•Faster sync: Smaller writes complete quicker
•Less data at risk: Crash loses less unflushed data
•More GC overhead: More segments to track and clean
•Less throughput: More seeks (relative to data transferred)
•Worse cleaning efficiency: Smaller segments fragment faster

Mathematical Analysis:

To achieve a target percentage of maximum disk bandwidth, segment size must exceed a threshold. Consider an HDD with:

Seek time (T_seek): 10 ms
Maximum sequential write rate (R_max): 100 MB/s

To achieve 90% of maximum throughput:

Efficiency = Segment_Size / (Segment_Size + T_seek × R_max)
0.90 = S / (S + 0.01 × 100 MB)
0.90 = S / (S + 1 MB)
S = 0.90 × (S + 1 MB)
S = 0.90S + 0.9 MB
0.10S = 0.9 MB
S = 9 MB

For 90% efficiency on this HDD, segments must be at least 9 MB.

For 99% efficiency: approximately 99 MB segments required.

Segment Size vs. Disk Utilization (HDD, 10ms seek, 100 MB/s)
Segment Size	Disk Utilization	Effective Throughput	Use Case
512 KB	34%	34 MB/s	Interactive, low-latency
1 MB	50%	50 MB/s	Balanced workloads
4 MB	80%	80 MB/s	Typical LFS default
16 MB	94%	94 MB/s	High throughput
64 MB	98%	98 MB/s	Batch/archive workloads

SSD Considerations

For SSDs, the optimal segment size is often determined by flash erase block size (128 KB - 4 MB) rather than seek time. Aligning segment size with erase block size minimizes write amplification within the SSD's flash translation layer.

Common Segment Sizes in Practice:

Original LFS (1992): 512 KB - 1 MB (limited by memory availability)
NILFS2: 8 MB default (configurable)
F2FS: 2 MB sections, 6 sections per zone = 12 MB effective units
Most modern implementations: 4-16 MB

Dynamic Segment Sizing:

Some advanced implementations adjust segment size dynamically based on:

Current write rate (more data = larger segments)
Available memory (can buffer more)
Sync frequency requirements
Disk utilization (fuller disk = smaller segments for GC efficiency)

Write Ordering and Consistency

Writing data sequentially sounds simple, but ensuring consistency requires careful attention to write ordering. LFS must guarantee that the file system remains valid even if a crash interrupts a segment write.

The Ordering Problem:

Consider what's in a segment:

Data blocks
Inodes pointing to those data blocks
Inode map entries pointing to those inodes
Segment summary describing the segment

If the disk reorders these writes and crashes mid-segment:

Inode map might point to an inode that doesn't exist yet
Inode might reference data blocks not yet written
Segment summary might be incomplete

Any of these could corrupt the file system.

write_ordering.txt
Safe Write Ordering in LFS:
 
SEGMENT STRUCTURE (written atomically as one unit):
┌────────────────────────────────────────────────────────────────────┐
│ ┌────────────┬────────────────┬──────────┬──────────┐              │
│ │ Seg Summary│  DATA BLOCKS   │  INODES  │  IMAP    │              │
│ │  (header)  │  (file data)   │(metadata)│ (index)  │              │
│ └────────────┴────────────────┴──────────┴──────────┘              │
│         │            │              │          │                    │
│         ▼            ▼              ▼          ▼                    │
│      Write 1      Write 2       Write 3    Write 4                  │
│     (first)      (second)      (third)    (last)                    │
└────────────────────────────────────────────────────────────────────┘
 
WHY THIS ORDER WORKS:
 
1. Header written first (but not yet referenced by anything)
   - Contains checksums; can detect partial writes
   
2. Data blocks written (but not yet reachable)  
   - Inodes don't point here yet
   - Safe: orphan data just wastes space
   
3. Inodes written (but not yet reachable)
   - Imap doesn't point here yet
   - Inodes reference data blocks (which are already written)
   - Safe: orphan inodes just waste space
   
4. Imap blocks written last
   - Now inodes become reachable
   - Inodes point to valid data blocks
   - Safe: either old imap (pre-write) or new imap (post-write)
   
5. Checkpoint update (separate operation)
   - Only after segment fully written
   - Points to updated imap blocks
   - Makes the entire segment visible atomically
 
CRASH SCENARIOS:
 
Crash during step 2 (data blocks):
  → Recovery ignores partial segment (checksum fails)
  → Imap still points to old inode locations
  → Result: Write is lost, but FS consistent ✓
 
Crash during step 3 (inodes):
  → Recovery ignores partial segment
  → Imap points to old inodes (in previous segments)
  → Result: Write is lost, but FS consistent ✓
 
Crash during step 4 (imap):
  → Recovery uses previous checkpoint
  → Old imap blocks still valid
  → Result: Write is lost, but FS consistent ✓
 
Crash after step 4, before checkpoint:
  → Roll-forward recovery can recover this segment
  → Segment is complete; checkpoint update is atomic
  → Result: Write may be recovered ✓

Checkpoint Atomicity:

The checkpoint region is updated only after a segment is completely written. This creates an atomic commit point:

Before checkpoint update: Old file system state (valid)
After checkpoint update: New file system state (valid)
Never: Inconsistent state

Handling Disk Write Reordering:

Modern disks may reorder writes for performance. To ensure ordering, LFS uses:

Barriers/Flushes: Force all pending writes to complete before proceeding
Cache flush commands: FUA (Force Unit Access) or write-through
Journaling within segment: Write-ahead of critical metadata

F2FS Approach:

F2FS (Flash-Friendly File System) uses a sophisticated technique:

Writes node (inode) blocks to a separate "node log"
Writes data blocks to "data log"
Uses checkpointing to ensure cross-log consistency
Enables parallel writes to different flash channels

The Power of Append-Only

Because LFS never modifies existing data in place, a crash can only result in lost writes—never corruption of existing data. This is fundamentally safer than update-in-place file systems where a crash during an in-place update can corrupt the original data.

Multi-Log Architectures

Modern LFS implementations often use multiple logs rather than a single append point. This enables separation of data with different characteristics and better optimization for specific access patterns.

Why Multiple Logs?

Not all data is created equal:

Hot data: Frequently modified (e.g., database indices)
Cold data: Rarely changed (e.g., archived files)
Metadata: Small, frequent updates (directories, inodes)
Large files: Sequential streaming writes

Mixing hot and cold data in the same segment leads to inefficient garbage collection: cleaning a segment with 10% live data still requires moving that 10%.

multi_log_architecture.txt
F2FS Multi-Log Architecture:
 
┌──────────────────────────────────────────────────────────────────────┐
│                        F2FS DISK LAYOUT                              │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────────┐  ┌──────────────────────────────────────────┐  │
│  │ SUPERBLOCK/CP    │  │            SEGMENT INFO TABLE            │  │
│  │   (fixed)        │  │        (segment metadata + usage)        │  │
│  └──────────────────┘  └──────────────────────────────────────────┘  │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────────┐│
│  │                         MAIN AREA                                ││
│  │  ┌─────────────────────────────────────────────────────────────┐ ││
│  │  │   HOT NODE LOG    │   WARM NODE LOG   │   COLD NODE LOG     │ ││
│  │  │   (hot metadata)  │  (warm metadata)  │  (cold metadata)    │ ││
│  │  ├───────────────────┼───────────────────┼─────────────────────┤ ││
│  │  │   HOT DATA LOG    │   WARM DATA LOG   │   COLD DATA LOG     │ ││
│  │  │   (hot file data) │  (warm file data) │  (cold file data)   │ ││
│  │  └─────────────────────────────────────────────────────────────┘ ││
│  └──────────────────────────────────────────────────────────────────┘│
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
 
DATA SEPARATION STRATEGY:
 
    HOT = Frequently modified data
    ├── Direct blocks of frequently written files
    ├── Directory entries for active directories
    └── Inode blocks for actively modified files
 
    WARM = Moderately accessed data
    ├── Data from files with mixed access patterns
    └── Default for most application data
 
    COLD = Rarely modified data
    ├── Data from files not accessed recently
    ├── Archived or backup files
    └── Large media files (after initial write)
 
BENEFITS:
  1. Hot data segregated → when GC runs on hot area, most blocks are dead
  2. Cold data separate → rarely needs GC, high utilization
  3. Parallel writing → each log head can accept writes independently
  4. Flash optimization → wear leveling per temperature zone

Data Classification Techniques:

File systems use various heuristics to classify data:

Access frequency tracking: Count reads/writes per file or block
File extension/type: Media files likely cold, database files likely hot
Directory location: /tmp is hot, /archive is cold
Size heuristics: Small files often hot (metadata-like), large files often cold
Age: Recently created/modified data is hot; old data is cold
Explicit hints: Applications can specify expected access patterns (fadvise())

NILFS2 Approach:

NILFS2 uses a single log but with a different strategy:

All writes go to the segment at the log head
Continuous snapshots capture every state
Cleaning is deferred and batched
Simpler architecture, suitable for archival workloads

Multi-Log Strategies in Production File Systems
File System	Log Structure	Classification Method	Optimization Target
F2FS	6 logs (3 temps × 2 types)	Heuristic + hints	SSD wear leveling, GC efficiency
NILFS2	Single log + snapshots	None (all mixed)	Continuous snapshots, simplicity
ZFS (ZIL)	Separate intent log	Sync vs async	Low-latency sync writes
Btrfs	Multiple allocation groups	Size + type	Concurrent allocation

The Complexity Trade-off

Multi-log architectures improve garbage collection efficiency but add complexity. The file system must track which log each block belongs to, decide classification in real-time, and handle data migration between logs. This complexity is justified for performance-critical systems like SSDs but may be overkill for simpler use cases.

Handling Sync Operations

LFS's segment buffering creates a tension with applications that require durability guarantees. When an application calls fsync(), it expects data to be safely on disk—not lingering in a memory buffer.

The Sync Challenge:

If we wait for the segment buffer to fill before writing, fsync() could take arbitrarily long. But if we write partial segments on every sync, we lose the sequential write advantage.

Solutions:

Partial Segment Writes: Write whatever is in the buffer immediately
- Loses some efficiency but maintains correctness
- Next segment starts where this one ended
Dedicated Sync Log: Maintain a separate small log for sync-heavy workloads
- Main log gets batched writes
- Sync log handles durability requirements
- ZFS uses this with the ZIL (ZFS Intent Log)
Segment Aggregation: Combine sync requests
- Multiple syncs within a short window share one segment write
- Improves efficiency for bursty sync patterns (e.g., databases)

sync_handling.txt
LFS Sync Handling Strategies:
 
STRATEGY 1: Partial Segment Write
─────────────────────────────────────────────────────────────────────
Timeline:
  T1: Write A (4 KB) → buffer
  T2: Write B (8 KB) → buffer
  T3: fsync() called!
  
  Buffer state: [A: 4KB][B: 8KB] = 12 KB (buffer capacity: 4 MB)
  
  ACTION: Flush 12 KB segment immediately
  ┌─────────────────────────────────────────┐
  │ [Summary][A][B][Inodes][Imap] → disk    │  (12 KB + overhead)
  └─────────────────────────────────────────┘
  
  EFFICIENCY: (12 KB / 4 MB) = 0.3% buffer utilization
  COST: Full seek overhead for small write
  
STRATEGY 2: Dedicated Sync Log (ZFS ZIL-style)
─────────────────────────────────────────────────────────────────────
  Dual-log architecture:
  
  ┌──────────────────────────────────────────────────────────────────┐
  │  MAIN LOG (large segments)     │   SYNC LOG (small, fast)       │
  │  ┌─────────────────────────┐   │   ┌────────────────────────┐   │
  │  │ [Seg N-1][Seg N ][ ...  │   │   │ [sync1][sync2][sync3]  │   │
  │  │  (4 MB)  (4 MB)         │   │   │  (4KB)  (8 KB)(2 KB)   │   │
  │  └─────────────────────────┘   │   └────────────────────────┘   │
  └──────────────────────────────────────────────────────────────────┘
  
  Write flow:
  T1: Write A → buffer + sync log entry
  T2: Write B → buffer + sync log entry
  T3: fsync() → flush sync log only (fast, small)
  T4: (later) buffer fills → main log gets full segment
  T5: Sync log entries cleared (data now in main log)
  
  ADVANTAGE: fsync() latency low; main log throughput high
  
STRATEGY 3: Sync Coalescing
─────────────────────────────────────────────────────────────────────
  Multiple syncs consolidated:
  
  T1: App1 writes A, calls fsync()    ─┐
  T2: App2 writes B, calls fsync()     ├── All within 5ms window
  T3: App3 writes C, calls fsync()    ─┘
  
  Without coalescing: 3 separate segment writes (3 seeks)
  With coalescing: 1 segment write containing A, B, C (1 seek)
  
  Implementation:
    - Collect syncs for N milliseconds (e.g., 5-10 ms)
    - Flush once with all pending data
    - Wake all waiting threads

Database Workload Optimization:

Databases issue frequent fsync() calls for transaction durability. This is particularly challenging for LFS:

Each transaction commit may trigger fsync()
Transactions can be small (single row update)
Batching is limited by commit latency requirements

Group Commit:

Many databases implement "group commit"—collecting multiple transaction commits and issuing a single fsync() for all. This aligns perfectly with LFS:

Transaction 1 commits → wait in queue
Transaction 2 commits → wait in queue
Transactions 3, 4, 5 commit → queue
Short delay expires → flush all 5 with one fsync()

Combined with LFS segment batching, this achieves high transaction throughput with durability.

Application Awareness

Applications can significantly improve LFS performance by batching their own writes before calling fsync(), using O_DSYNC instead of fsync() when metadata durability isn't needed, and using posix_fadvise() to hint at access patterns. Well-designed applications and LFS can achieve remarkable synergy.

Write Performance Analysis

Let's examine how LFS sequential write strategy performs under different workloads compared to traditional file systems.

Workload Categories:

Small random writes: Many small files, scattered updates (email, metadata-heavy)
Large sequential writes: Big files, streaming writes (video, backup)
Mixed workloads: Combination of both patterns
Sync-heavy: Frequent fsync() calls (databases)

LFS vs Traditional FS Performance (HDD, relative performance)
Workload	ext4	LFS	Advantage	Why
Small random writes (4KB × 10000)	1x	50-100x	LFS	Batching eliminates seeks
Large sequential (1GB file)	1x	0.95-1.05x	Similar	Both sequential; LFS has slight overhead
File creation burst (1000 files)	1x	20-50x	LFS	Metadata + data batched together
Sync-heavy (1000 fsyncs)	1x	0.5-2x	Depends	Sync coalescing helps; frequent syncs hurt
90% disk full state	0.8x	0.3-0.5x	ext4	LFS GC overhead dominates

SSD Performance Characteristics:

On SSDs, the performance landscape changes:

Sequential advantage is smaller (5-10x vs 100x+)
But SSD internal garbage collection benefits from LFS alignment
Write amplification reduction significant for SSD lifespan
Multi-queue NVMe can exploit LFS parallel log writes

LFS vs Traditional FS Performance (NVMe SSD, relative performance)
Workload	ext4	F2FS	Advantage	Notes
Small random writes	1x	2-5x	F2FS	Reduces SSD write amplification
Large sequential	1x	0.9-1.1x	Similar	Both saturate SSD bandwidth
File creation burst	1x	3-8x	F2FS	Metadata batching still helps
Sustained random (hours)	1x	1.5-3x	F2FS	SSD GC benefits from log structure
Mixed read/write	1x	0.8-1.2x	Depends	Read fragmentation can hurt F2FS

The Full Disk Problem

LFS performance degrades significantly as the disk fills up. With less free space, garbage collection runs more frequently and must work harder to find clean segments. Best practice: keep 10-20% free space on LFS volumes. This issue is explored in detail in the Garbage Collection page.

Real-World Benchmarks:

From Samsung F2FS paper (2015), on mobile devices:

SQLite random insert: 2x faster than ext4
App installation: 1.5x faster
Boot time: 10% reduction

From NILFS2 evaluations:

Mail server workload: 4x throughput improvement
Backup workloads: Near-disk-bandwidth writes
Snapshot creation: Instant (no copy required)

These results demonstrate that LFS's sequential write strategy delivers substantial real-world benefits, particularly for write-intensive workloads on both HDDs and SSDs.

Summary: Sequential Writes Mastery

We've explored the mechanics of sequential writes in LFS and how this fundamental design decision transforms file system performance.

Key Takeaways

•Sequential I/O exploits storage physics — HDDs achieve 100-400x better throughput sequentially; SSDs gain 5-50x and reduce internal GC overhead.
•Segment buffers transform access patterns — Accumulating writes in memory converts scattered I/O to large sequential bursts.
•Segment size balances throughput and latency — Larger segments improve throughput; smaller segments reduce sync latency and memory usage.
•Write ordering ensures crash consistency — Data, inodes, then imap, with checkpoint commit providing atomicity.
•Multi-log architectures improve GC efficiency — Separating hot and cold data reduces work during garbage collection.
•Sync operations require special handling — Partial segments, sync logs, or coalescing maintain durability without sacrificing throughput.

What's Next:

Sequential writes solve the performance problem but create a new one: garbage collection. Since we never update in place, old versions of data accumulate and consume space. The next page explores how LFS reclaims this space through garbage collection—a critical and challenging aspect of log-structured storage.

Page Complete

You now understand how LFS achieves near-optimal write performance through sequential I/O. This knowledge is foundational for understanding garbage collection challenges and why segment cleaning strategies matter for sustained LFS performance.