Loading content...
In the previous page, we established that log-structured file systems convert random writes to sequential writes. But why is sequential access so powerful? And how does LFS actually achieve this transformation?
The answer lies in understanding the fundamental physics of storage devices—and how LFS exploits that physics to deliver performance that seems almost magical: taking workloads that would cripple traditional file systems and running them at near-disk-bandwidth speeds.
This page dives deep into the mechanics of sequential writes in LFS: how segment buffers work, why size matters, how LFS handles different types of data, and the engineering decisions that determine real-world performance.
By the end of this page, you will understand the physics behind sequential vs. random I/O performance, how LFS segment buffers transform scattered writes into sequential bursts, the trade-offs involved in segment sizing, and how modern implementations handle multi-stream logging for different data types.
To understand why sequential access dominates random access in performance, we need to examine how storage devices physically operate.
Hard Disk Drives (HDDs):
An HDD consists of magnetic platters spinning at 5,400 to 15,000 RPM, with read/write heads mounted on actuator arms. To access data, the drive must:
The time for each phase:
For a 4 KB random read: Total ≈ 12 ms = ~83 IOPS (I/O operations per second)
For 4 KB sequential reads (no seek, no rotational delay between blocks): Transfer rate ≈ 150 MB/s = ~37,500 4KB blocks/second
| Access Pattern | Operations/Second | Throughput | Relative Performance |
|---|---|---|---|
| Random 4KB Read | ~100 IOPS | ~0.4 MB/s | 1x (baseline) |
| Sequential 4KB Read | ~37,500 blocks/s | ~150 MB/s | 375x faster |
| Random 4KB Write | ~80 IOPS | ~0.3 MB/s | 0.8x |
| Sequential 4KB Write | ~35,000 blocks/s | ~140 MB/s | ~450x faster than random write |
Sequential writes can be 400x faster than random writes on HDDs. This isn't a software limitation—it's physics. The only way to close this gap is to eliminate seeks and rotational delays by keeping the head moving forward.
Solid State Drives (SSDs):
SSDs have no moving parts, eliminating seek and rotational latency. However, they have their own characteristics that favor sequential access:
While the sequential advantage is smaller for SSDs (~5-10x vs. 400x for HDDs), it remains significant—especially for write-heavy workloads.
Storage Performance Characteristics: HDD (7200 RPM, 150 MB/s Sequential):┌────────────────────────────────────────────────────────────────────┐│ Random 4KB Write ││ ├─ Seek: 8ms ││ ├─ Rotation: 4ms ││ ├─ Transfer: 0.03ms ││ └─ Total: ~12ms → 83 IOPS → 0.33 MB/s ││ ││ Sequential 4KB Writes: ││ ├─ Seek: 0ms (head already positioned) ││ ├─ Rotation: 0ms (next sector immediately available) ││ ├─ Transfer: 0.03ms ││ └─ Total: ~0.03ms → 35,000 IOPS → 140 MB/s ││ ││ Performance Ratio: 140 / 0.33 = 424x │└────────────────────────────────────────────────────────────────────┘ SSD (NVMe, 3000 MB/s Sequential):┌────────────────────────────────────────────────────────────────────┐│ Random 4KB Write (QD=1): ││ ├─ Latency: ~50-100 μs ││ └─ IOPS: 10,000-20,000 → ~50-80 MB/s ││ ││ Sequential 4KB Writes (QD=32): ││ ├─ Latency: ~10-20 μs per operation (amortized) ││ └─ Throughput: 500,000+ IOPS → 2,000-3,000 MB/s ││ ││ Performance Ratio: 3000 / 60 = 50x ││ (Still significant, even without mechanical delays!) │└────────────────────────────────────────────────────────────────────┘The Fundamental Insight:
Storage devices are optimized for sequential access at every level:
LFS aligns with this reality by never asking the storage device to do what it does poorly: scattered random writes. Instead, LFS accumulates writes in memory and issues them in large, sequential bursts—exactly what storage devices do best.
The segment buffer is the central mechanism that transforms scattered application writes into sequential disk writes. Understanding its operation is key to understanding LFS performance.
How the Segment Buffer Works:
Accumulation Phase: Application writes (data blocks, inodes, directory updates) are collected in a memory buffer
Segment Construction: When the buffer is full (or a sync is requested), all pending writes are organized into a segment
Sequential Write: The entire segment is written to disk in one contiguous operation
Imap Update: The inode map is updated to reflect new block locations
This batching converts what might be hundreds of scattered 4 KB writes into a single multi-megabyte sequential write.
Segment Buffer Operation: Time T1: Application writes file A (4 KB)┌──────────────────────────────────────────────────────────────────┐│ SEGMENT BUFFER (in memory) ││ ┌─────────────┐ ││ │ File A data │ (4 KB) ││ └─────────────┘ ││ Buffer usage: 4 KB / 4 MB = 0.1% │└──────────────────────────────────────────────────────────────────┘ Time T2: Application writes file B (8 KB), updates directory┌──────────────────────────────────────────────────────────────────┐│ SEGMENT BUFFER (in memory) ││ ┌─────────────┬─────────────┬─────────────┬──────────────┐ ││ │ File A data │ File B data │ File B data │ Dir update │ ││ │ (4 KB) │ (4 KB pg1) │ (4 KB pg2) │ (4 KB) │ ││ └─────────────┴─────────────┴─────────────┴──────────────┘ ││ Buffer usage: 16 KB / 4 MB = 0.4% │└──────────────────────────────────────────────────────────────────┘ ... more writes accumulate ... Time T100: Buffer full, trigger segment write┌──────────────────────────────────────────────────────────────────┐│ SEGMENT BUFFER (ready to flush) ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ [Seg Summary][Data A][Data B][Data C]...[Inode 1][Inode 2].. │ ││ │ [Inode 3][Imap Block 1][Imap Block 2] │ ││ └──────────────────────────────────────────────────────────────┘ ││ Buffer usage: 4 MB / 4 MB = 100% ││ ││ FLUSH TO DISK → Single 4 MB sequential write ││ HDD: 4 MB at 140 MB/s = ~29 ms ││ SSD: 4 MB at 2000 MB/s = ~2 ms │└──────────────────────────────────────────────────────────────────┘ CONTRAST with traditional FS (same 100 write operations):┌──────────────────────────────────────────────────────────────────┐│ Each write immediately goes to disk: ││ - Write data block → seek + write ││ - Update inode → seek + write ││ - Update directory → seek + write ││ - Update bitmap → seek + write ││ ││ 100 operations × 4 disk writes × 12 ms seek = ~4,800 ms ││ ││ LFS: ~29 ms vs Traditional: ~4,800 ms = 165x faster │└──────────────────────────────────────────────────────────────────┘Segment Buffer Contents:
A segment isn't just a random collection of blocks. It's carefully organized for both write efficiency and future read performance:
Segment Summary Block (beginning):
Data Blocks:
Inode Blocks:
Inode Map Updates:
Segment Trailer (optional, some implementations):
When reading a file, we need both the inode (for metadata) and the data blocks. By placing a file's inode immediately after its data in the segment, LFS ensures that reading a recently-written file requires minimal disk seeks—the inode and data are physically adjacent.
The choice of segment size profoundly affects LFS performance. There's no universally optimal size—the right choice depends on workload characteristics, hardware, and system requirements.
The Trade-offs:
Mathematical Analysis:
To achieve a target percentage of maximum disk bandwidth, segment size must exceed a threshold. Consider an HDD with:
To achieve 90% of maximum throughput:
Efficiency = Segment_Size / (Segment_Size + T_seek × R_max)
0.90 = S / (S + 0.01 × 100 MB)
0.90 = S / (S + 1 MB)
S = 0.90 × (S + 1 MB)
S = 0.90S + 0.9 MB
0.10S = 0.9 MB
S = 9 MB
For 90% efficiency on this HDD, segments must be at least 9 MB.
For 99% efficiency: approximately 99 MB segments required.
| Segment Size | Disk Utilization | Effective Throughput | Use Case |
|---|---|---|---|
| 512 KB | 34% | 34 MB/s | Interactive, low-latency |
| 1 MB | 50% | 50 MB/s | Balanced workloads |
| 4 MB | 80% | 80 MB/s | Typical LFS default |
| 16 MB | 94% | 94 MB/s | High throughput |
| 64 MB | 98% | 98 MB/s | Batch/archive workloads |
For SSDs, the optimal segment size is often determined by flash erase block size (128 KB - 4 MB) rather than seek time. Aligning segment size with erase block size minimizes write amplification within the SSD's flash translation layer.
Common Segment Sizes in Practice:
Dynamic Segment Sizing:
Some advanced implementations adjust segment size dynamically based on:
Writing data sequentially sounds simple, but ensuring consistency requires careful attention to write ordering. LFS must guarantee that the file system remains valid even if a crash interrupts a segment write.
The Ordering Problem:
Consider what's in a segment:
If the disk reorders these writes and crashes mid-segment:
Any of these could corrupt the file system.
Safe Write Ordering in LFS: SEGMENT STRUCTURE (written atomically as one unit):┌────────────────────────────────────────────────────────────────────┐│ ┌────────────┬────────────────┬──────────┬──────────┐ ││ │ Seg Summary│ DATA BLOCKS │ INODES │ IMAP │ ││ │ (header) │ (file data) │(metadata)│ (index) │ ││ └────────────┴────────────────┴──────────┴──────────┘ ││ │ │ │ │ ││ ▼ ▼ ▼ ▼ ││ Write 1 Write 2 Write 3 Write 4 ││ (first) (second) (third) (last) │└────────────────────────────────────────────────────────────────────┘ WHY THIS ORDER WORKS: 1. Header written first (but not yet referenced by anything) - Contains checksums; can detect partial writes 2. Data blocks written (but not yet reachable) - Inodes don't point here yet - Safe: orphan data just wastes space 3. Inodes written (but not yet reachable) - Imap doesn't point here yet - Inodes reference data blocks (which are already written) - Safe: orphan inodes just waste space 4. Imap blocks written last - Now inodes become reachable - Inodes point to valid data blocks - Safe: either old imap (pre-write) or new imap (post-write) 5. Checkpoint update (separate operation) - Only after segment fully written - Points to updated imap blocks - Makes the entire segment visible atomically CRASH SCENARIOS: Crash during step 2 (data blocks): → Recovery ignores partial segment (checksum fails) → Imap still points to old inode locations → Result: Write is lost, but FS consistent ✓ Crash during step 3 (inodes): → Recovery ignores partial segment → Imap points to old inodes (in previous segments) → Result: Write is lost, but FS consistent ✓ Crash during step 4 (imap): → Recovery uses previous checkpoint → Old imap blocks still valid → Result: Write is lost, but FS consistent ✓ Crash after step 4, before checkpoint: → Roll-forward recovery can recover this segment → Segment is complete; checkpoint update is atomic → Result: Write may be recovered ✓Checkpoint Atomicity:
The checkpoint region is updated only after a segment is completely written. This creates an atomic commit point:
Handling Disk Write Reordering:
Modern disks may reorder writes for performance. To ensure ordering, LFS uses:
F2FS Approach:
F2FS (Flash-Friendly File System) uses a sophisticated technique:
Because LFS never modifies existing data in place, a crash can only result in lost writes—never corruption of existing data. This is fundamentally safer than update-in-place file systems where a crash during an in-place update can corrupt the original data.
Modern LFS implementations often use multiple logs rather than a single append point. This enables separation of data with different characteristics and better optimization for specific access patterns.
Why Multiple Logs?
Not all data is created equal:
Mixing hot and cold data in the same segment leads to inefficient garbage collection: cleaning a segment with 10% live data still requires moving that 10%.
F2FS Multi-Log Architecture: ┌──────────────────────────────────────────────────────────────────────┐│ F2FS DISK LAYOUT │├──────────────────────────────────────────────────────────────────────┤│ ││ ┌──────────────────┐ ┌──────────────────────────────────────────┐ ││ │ SUPERBLOCK/CP │ │ SEGMENT INFO TABLE │ ││ │ (fixed) │ │ (segment metadata + usage) │ ││ └──────────────────┘ └──────────────────────────────────────────┘ ││ ││ ┌──────────────────────────────────────────────────────────────────┐││ │ MAIN AREA │││ │ ┌─────────────────────────────────────────────────────────────┐ │││ │ │ HOT NODE LOG │ WARM NODE LOG │ COLD NODE LOG │ │││ │ │ (hot metadata) │ (warm metadata) │ (cold metadata) │ │││ │ ├───────────────────┼───────────────────┼─────────────────────┤ │││ │ │ HOT DATA LOG │ WARM DATA LOG │ COLD DATA LOG │ │││ │ │ (hot file data) │ (warm file data) │ (cold file data) │ │││ │ └─────────────────────────────────────────────────────────────┘ │││ └──────────────────────────────────────────────────────────────────┘││ │└──────────────────────────────────────────────────────────────────────┘ DATA SEPARATION STRATEGY: HOT = Frequently modified data ├── Direct blocks of frequently written files ├── Directory entries for active directories └── Inode blocks for actively modified files WARM = Moderately accessed data ├── Data from files with mixed access patterns └── Default for most application data COLD = Rarely modified data ├── Data from files not accessed recently ├── Archived or backup files └── Large media files (after initial write) BENEFITS: 1. Hot data segregated → when GC runs on hot area, most blocks are dead 2. Cold data separate → rarely needs GC, high utilization 3. Parallel writing → each log head can accept writes independently 4. Flash optimization → wear leveling per temperature zoneData Classification Techniques:
File systems use various heuristics to classify data:
/tmp is hot, /archive is coldfadvise())NILFS2 Approach:
NILFS2 uses a single log but with a different strategy:
| File System | Log Structure | Classification Method | Optimization Target |
|---|---|---|---|
| F2FS | 6 logs (3 temps × 2 types) | Heuristic + hints | SSD wear leveling, GC efficiency |
| NILFS2 | Single log + snapshots | None (all mixed) | Continuous snapshots, simplicity |
| ZFS (ZIL) | Separate intent log | Sync vs async | Low-latency sync writes |
| Btrfs | Multiple allocation groups | Size + type | Concurrent allocation |
Multi-log architectures improve garbage collection efficiency but add complexity. The file system must track which log each block belongs to, decide classification in real-time, and handle data migration between logs. This complexity is justified for performance-critical systems like SSDs but may be overkill for simpler use cases.
LFS's segment buffering creates a tension with applications that require durability guarantees. When an application calls fsync(), it expects data to be safely on disk—not lingering in a memory buffer.
The Sync Challenge:
If we wait for the segment buffer to fill before writing, fsync() could take arbitrarily long. But if we write partial segments on every sync, we lose the sequential write advantage.
Solutions:
Partial Segment Writes: Write whatever is in the buffer immediately
Dedicated Sync Log: Maintain a separate small log for sync-heavy workloads
Segment Aggregation: Combine sync requests
LFS Sync Handling Strategies: STRATEGY 1: Partial Segment Write─────────────────────────────────────────────────────────────────────Timeline: T1: Write A (4 KB) → buffer T2: Write B (8 KB) → buffer T3: fsync() called! Buffer state: [A: 4KB][B: 8KB] = 12 KB (buffer capacity: 4 MB) ACTION: Flush 12 KB segment immediately ┌─────────────────────────────────────────┐ │ [Summary][A][B][Inodes][Imap] → disk │ (12 KB + overhead) └─────────────────────────────────────────┘ EFFICIENCY: (12 KB / 4 MB) = 0.3% buffer utilization COST: Full seek overhead for small write STRATEGY 2: Dedicated Sync Log (ZFS ZIL-style)───────────────────────────────────────────────────────────────────── Dual-log architecture: ┌──────────────────────────────────────────────────────────────────┐ │ MAIN LOG (large segments) │ SYNC LOG (small, fast) │ │ ┌─────────────────────────┐ │ ┌────────────────────────┐ │ │ │ [Seg N-1][Seg N ][ ... │ │ │ [sync1][sync2][sync3] │ │ │ │ (4 MB) (4 MB) │ │ │ (4KB) (8 KB)(2 KB) │ │ │ └─────────────────────────┘ │ └────────────────────────┘ │ └──────────────────────────────────────────────────────────────────┘ Write flow: T1: Write A → buffer + sync log entry T2: Write B → buffer + sync log entry T3: fsync() → flush sync log only (fast, small) T4: (later) buffer fills → main log gets full segment T5: Sync log entries cleared (data now in main log) ADVANTAGE: fsync() latency low; main log throughput high STRATEGY 3: Sync Coalescing───────────────────────────────────────────────────────────────────── Multiple syncs consolidated: T1: App1 writes A, calls fsync() ─┐ T2: App2 writes B, calls fsync() ├── All within 5ms window T3: App3 writes C, calls fsync() ─┘ Without coalescing: 3 separate segment writes (3 seeks) With coalescing: 1 segment write containing A, B, C (1 seek) Implementation: - Collect syncs for N milliseconds (e.g., 5-10 ms) - Flush once with all pending data - Wake all waiting threadsDatabase Workload Optimization:
Databases issue frequent fsync() calls for transaction durability. This is particularly challenging for LFS:
Group Commit:
Many databases implement "group commit"—collecting multiple transaction commits and issuing a single fsync() for all. This aligns perfectly with LFS:
Combined with LFS segment batching, this achieves high transaction throughput with durability.
Applications can significantly improve LFS performance by batching their own writes before calling fsync(), using O_DSYNC instead of fsync() when metadata durability isn't needed, and using posix_fadvise() to hint at access patterns. Well-designed applications and LFS can achieve remarkable synergy.
Let's examine how LFS sequential write strategy performs under different workloads compared to traditional file systems.
Workload Categories:
| Workload | ext4 | LFS | Advantage | Why |
|---|---|---|---|---|
| Small random writes (4KB × 10000) | 1x | 50-100x | LFS | Batching eliminates seeks |
| Large sequential (1GB file) | 1x | 0.95-1.05x | Similar | Both sequential; LFS has slight overhead |
| File creation burst (1000 files) | 1x | 20-50x | LFS | Metadata + data batched together |
| Sync-heavy (1000 fsyncs) | 1x | 0.5-2x | Depends | Sync coalescing helps; frequent syncs hurt |
| 90% disk full state | 0.8x | 0.3-0.5x | ext4 | LFS GC overhead dominates |
SSD Performance Characteristics:
On SSDs, the performance landscape changes:
| Workload | ext4 | F2FS | Advantage | Notes |
|---|---|---|---|---|
| Small random writes | 1x | 2-5x | F2FS | Reduces SSD write amplification |
| Large sequential | 1x | 0.9-1.1x | Similar | Both saturate SSD bandwidth |
| File creation burst | 1x | 3-8x | F2FS | Metadata batching still helps |
| Sustained random (hours) | 1x | 1.5-3x | F2FS | SSD GC benefits from log structure |
| Mixed read/write | 1x | 0.8-1.2x | Depends | Read fragmentation can hurt F2FS |
LFS performance degrades significantly as the disk fills up. With less free space, garbage collection runs more frequently and must work harder to find clean segments. Best practice: keep 10-20% free space on LFS volumes. This issue is explored in detail in the Garbage Collection page.
Real-World Benchmarks:
From Samsung F2FS paper (2015), on mobile devices:
From NILFS2 evaluations:
These results demonstrate that LFS's sequential write strategy delivers substantial real-world benefits, particularly for write-intensive workloads on both HDDs and SSDs.
We've explored the mechanics of sequential writes in LFS and how this fundamental design decision transforms file system performance.
What's Next:
Sequential writes solve the performance problem but create a new one: garbage collection. Since we never update in place, old versions of data accumulate and consume space. The next page explores how LFS reclaims this space through garbage collection—a critical and challenging aspect of log-structured storage.
You now understand how LFS achieves near-optimal write performance through sequential I/O. This knowledge is foundational for understanding garbage collection challenges and why segment cleaning strategies matter for sustained LFS performance.