Loading content...
In the previous page, we explored write-through caching—the conservative approach that guarantees durability at the cost of performance. Every write waited for disk acknowledgment, creating a direct relationship between I/O latency and application throughput.
But consider this scenario: your application writes 100 small records per second to the same file region. With write-through, that's 100 disk I/O operations. If each takes 15µs (NVMe), you've consumed 1.5ms of disk time. Acceptable. But on an HDD at 10ms per write, you've saturated the disk entirely with just 100 writes/second.
Now imagine those 100 writes all modify the same 4KB block—perhaps updating a counter, appending to a log, or modifying an in-memory data structure that gets serialized. With write-through, you write the same block 100 times. The final 99 writes are pure waste.
Write-back caching eliminates this waste. Instead of writing immediately, data accumulates in cache, and only the final state is eventually written to disk. The same 100 writes become 1 I/O operation. Performance improves by 100×—but at a cost.
This page explores that cost, and more importantly, how to manage it correctly.
By the end of this page, you will understand the complete lifecycle of dirty data in a write-back system, the mechanisms that eventually force data to disk, the failure modes that can cause data loss, and the design patterns that allow systems to achieve write-back performance with acceptable durability guarantees.
Write-back (also called write-behind) is a caching strategy where writes update only the cache initially. The write operation returns success immediately after the cache is updated. The actual write to the backing store happens later, asynchronously, controlled by various policies.
The key behavioral characteristics:
write() call returns as soon as data is copied to the kernel's page cache—no waiting for diskThe Write-Back Data Flow
Application: write(fd, buffer, 4096)
│
▼
┌─────────────────────────────────────────────────────────┐
│ User Space → Kernel Space Transition (syscall) │
│ [Time: ~1-5µs] │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ VFS Layer: Validate fd, check permissions │
│ [Time: ~0.5µs] │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Page Cache: │
│ 1. Find or allocate page for this file offset │
│ 2. Copy data from user buffer to page │
│ 3. Mark page as DIRTY │
│ 4. Update file metadata (size, mtime) │
│ [Time: ~0.1-1µs per KB] │
└─────────────────────────────────────────────────────────┘
│
▼ ← RETURN IMMEDIATELY HERE (write-back)
┌─────────────────────────────────────────────────────────┐
│ Application: write() returns 4096 (success) │
│ [Total time: ~2-10µs] │
└─────────────────────────────────────────────────────────┘
... later ...
┌─────────────────────────────────────────────────────────┐
│ Background Flusher Thread (pdflush/flush-X): │
│ 1. Scan for dirty pages │
│ 2. Batch into I/O requests │
│ 3. Submit to block layer │
│ 4. Clear dirty flags on completion │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Storage Device: Write to persistent medium │
│ [Data now durable] │
└─────────────────────────────────────────────────────────┘
Notice the fundamental difference: the application sees a 2-10µs operation regardless of whether the disk would have taken 10ms. From the application's perspective, every write is essentially a memory copy.
Between the write() call returning and the background flush completing, data exists ONLY in volatile RAM. A power failure, kernel panic, or forced system reset during this window loses all uncommitted writes. This window can range from milliseconds to 30+ seconds depending on configuration.
The Linux page cache is the central data structure enabling write-back behavior. Understanding its mechanics is essential for understanding write-back performance and failure modes.
Page Cache Organization
Every file's content is represented in memory as a collection of 4KB pages (matching the typical filesystem block size and CPU page size). These pages are organized in a radix tree indexed by file offset:
struct address_space {
struct inode *host; /* Owner inode */
struct xarray i_pages; /* Radix tree of pages */
unsigned long nrpages; /* Number of cached pages */
pgoff_t writeback_index; /* Writeback starts here */
const struct address_space_operations *a_ops;
unsigned long flags; /* Various flags */
/* ... more fields ... */
};
Each page has state flags indicating its current status:
| Flag | Meaning | Implications |
|---|---|---|
| PG_uptodate | Page contains valid data | Safe to read from cache |
| PG_dirty | Page differs from disk | Must be written eventually |
| PG_writeback | Write to disk in progress | Cannot modify until complete |
| PG_locked | Page is locked for I/O | Other accessors must wait |
| PG_referenced | Recently accessed | Used for eviction decisions |
The Dirty Page Lifecycle
Application write()
│
▼
┌──────────────────┐
│ CLEAN PAGE │ Page matches disk content
│ (or no page) │ (or page doesn't exist)
└────────┬─────────┘
│ write() copies data
▼
┌──────────────────┐
│ DIRTY PAGE │ Page differs from disk
│ PG_dirty = 1 │ Needs eventual flush
└────────┬─────────┘
│ Background flusher or fsync()
▼
┌──────────────────┐
│ WRITEBACK PAGE │ I/O in progress
│ PG_writeback = 1 │ Temporary state
└────────┬─────────┘
│ I/O completion
▼
┌──────────────────┐
│ CLEAN PAGE │ Page matches disk again
│ PG_dirty = 0 │ Can be evicted if needed
└──────────────────┘
Memory Pressure and Dirty Pages
The page cache competes for memory with applications and other kernel subsystems. When memory becomes scarce, the kernel must evict pages. But dirty pages cannot simply be discarded—they must be written first.
This creates an important interaction:
The /proc/meminfo file shows current dirty page status:
Dirty: 1234 kB # Currently dirty, not being written
Writeback: 567 kB # Currently being written to disk
AnonPages: 4567890 kB # Anonymous memory (not file-backed)
Mapped: 123456 kB # Memory-mapped files
PageTables: 12345 kB # Page table overhead
Dirty Limits and Throttling
Linux prevents dirty pages from consuming all memory through configurable limits:
12345678910111213141516171819202122232425262728293031323334
# View current dirty page limitscat /proc/sys/vm/dirty_background_ratio# Default: 10 - Start background writeback at 10% RAM dirty cat /proc/sys/vm/dirty_ratio# Default: 20 - Block writers at 20% RAM dirty (throttling) cat /proc/sys/vm/dirty_writeback_centisecs# Default: 500 - Flush daemon wakes every 5 seconds cat /proc/sys/vm/dirty_expire_centisecs# Default: 3000 - Pages older than 30 seconds get written # For absolute byte limits (useful on high-memory systems):cat /proc/sys/vm/dirty_background_bytescat /proc/sys/vm/dirty_bytes # Example: System with 64GB RAM, default settings# dirty_background_ratio = 10%# → Background flush starts at 6.4GB dirty# dirty_ratio = 20% # → Application writes block at 12.8GB dirty # Tuning for low-latency workloads (reduce dirty window):echo 5 > /proc/sys/vm/dirty_background_ratioecho 10 > /proc/sys/vm/dirty_ratioecho 100 > /proc/sys/vm/dirty_writeback_centisecsecho 500 > /proc/sys/vm/dirty_expire_centisecs # Tuning for throughput (allow more buffering):echo 20 > /proc/sys/vm/dirty_background_ratioecho 40 > /proc/sys/vm/dirty_ratioecho 1000 > /proc/sys/vm/dirty_writeback_centisecsecho 6000 > /proc/sys/vm/dirty_expire_centisecsLower dirty ratios reduce data-at-risk during crash but increase I/O pressure and may cause write stalls. Higher ratios improve throughput but increase potential data loss and memory pressure. Production systems typically reduce defaults for safety, especially on systems with large RAM where default percentages represent enormous data loss windows.
The Linux kernel employs dedicated kernel threads to write dirty pages to disk. Understanding these mechanisms is crucial for predicting writeback behavior.
Writeback Threads
Modern Linux uses per-backing-device writeback threads:
$ ps aux | grep flush
root 1234 0.0 0.0 0 0 ? S Jan01 0:23 [flush-8:0]
root 1235 0.0 0.0 0 0 ? S Jan01 0:15 [flush-8:16]
root 1236 0.0 0.0 0 0 ? S Jan01 0:08 [flush-253:0]
Each flush-X:Y thread handles writeback for a specific block device (major:minor numbers). This allows parallel writeback across multiple devices.
Writeback Triggers
Writeback occurs in several circumstances:
Periodic Flushing (dirty_writeback_centisecs)
Background Threshold (dirty_background_ratio)
Blocking Threshold (dirty_ratio)
Explicit Flush (fsync, syncfs, sync)
Memory Pressure
Writeback Request Flow
┌────────────────────────────────────────────────────────────┐
│ Flusher Thread: "Time to write dirty pages for /dev/sda" │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ 1. Walk dirty inode list for this block device │
│ 2. For each inode, walk its dirty page tree │
│ 3. Batch contiguous pages into single I/O requests │
│ 4. Apply I/O scheduling (CFQ, BFQ, mq-deadline, etc.) │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Block Layer: │
│ - Merge adjacent requests │
│ - Reorder for seek optimization (HDDs) │
│ - Split requests exceeding device limits │
│ - Track request completion for accounting │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Device Driver: Submit to hardware command queue │
│ [For NVMe: may use multiple hardware queues] │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Storage Device: Writes data to medium │
│ [Returns completion interrupt/poll response] │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Completion Path: │
│ - Clear PG_writeback flag on pages │
│ - Clear PG_dirty flag (data is clean now) │
│ - Update writeback accounting │
│ - Wake any waiters (fsync callers) │
└────────────────────────────────────────────────────────────┘
The block layer's merging capability is a crucial write-back advantage. A thousand 4KB writes to sequential addresses become a single 4MB I/O operation. This dramatically improves HDD performance (one seek instead of 1000) and SSD efficiency (optimal write unit alignment).
Let's develop quantitative models for write-back performance to understand when and why it provides such dramatic improvements over write-through.
Write Latency Model
For an individual write operation:
Write-Through: T_write = T_syscall + T_cache + T_io_complete
≈ 5µs + 1µs + T_device (10ms HDD, 20µs SSD)
Write-Back: T_write = T_syscall + T_cache + T_return
≈ 5µs + 1µs + 0.1µs ≈ 6µs (regardless of device)
For an HDD, write-back is 1600× faster per operation (10ms vs 6µs). For NVMe SSD, write-back is 4× faster (26µs vs 6µs).
Write Absorption Model
When multiple writes hit the same cache page, only one I/O operation eventually occurs:
N writes to same page:
Write-Through: N × T_io = N × 10ms (HDD)
Write-Back: N × T_cache + 1 × T_io = N × 1µs + 10ms
For N=100 on HDD:
Write-Through: 1000ms (1 second!)
Write-Back: ~110ms (0.11 seconds)
Effective speedup: ~9× just from absorption
Sequential Write Optimization
| Storage | Strategy | Operations | Total Time | Throughput |
|---|---|---|---|---|
| HDD 7200 | Write-through | 262,144 × 10ms | 43.7 min | 0.39 MB/s |
| HDD 7200 | Write-back (merged) | ~256 × 10ms | 2.56 sec | 400 MB/s |
| NVMe SSD | Write-through | 262,144 × 20µs | 5.24 sec | 195 MB/s |
| NVMe SSD | Write-back (merged) | ~1024 × 10µs | 10.2 ms | 100+ GB/s* |
*Limited by memory bandwidth, not storage.
The Dirty Window Cost
The performance benefit has a durability cost. At any moment, the "data at risk" is:
Data_At_Risk = Total_Dirty_Pages × Page_Size
With default settings (10% dirty background, 20% dirty max):
32GB RAM system:
Normal operation: up to 3.2GB dirty (10% threshold active)
Heavy write load: up to 6.4GB dirty (blocking at 20%)
256GB RAM system:
Normal operation: up to 25.6GB dirty!
Heavy write load: up to 51.2GB dirty!
The default percentage-based limits were designed when 512MB was a lot of RAM. On modern high-memory systems, using absolute byte limits (dirty_background_bytes, dirty_bytes) is strongly recommended.
Latency Variability
Write-back introduces latency variance that write-through lacks. Consider:
Single write, no memory pressure: ~6µs (P50, P99, P99.9 all similar)
Single write, dirty ratio exceeded: ~50ms-1s (blocked on writeback)
Single write, memory pressure: ~10ms-10s (direct reclaim)
Applications sensitive to tail latency (real-time systems, financial trading) often prefer write-through or hybrid approaches precisely because of this variance.
Systems crossing the dirty_ratio threshold experience sudden, severe latency degradation. A system happily doing 100,000 writes/second can drop to 100 writes/second when writers start blocking. Monitor dirty page levels proactively and tune thresholds before you hit them in production.
Write-back's Achilles heel is crash behavior. Let's analyze exactly what happens when a system crashes with dirty pages in memory.
Crash Scenario Analysis
Timeline during normal operation:
T=0.000s: Application writes block A (goes to cache, dirty)
T=0.001s: Application writes block B (goes to cache, dirty)
T=0.002s: Application writes block C (goes to cache, dirty)
...
T=5.000s: Background flush starts, writes A to disk
T=5.100s: Background flush writes B to disk
---CRASH--- (power loss at T=5.15s)
T=5.200s: Block C would have been written... but system is dead
Post-crash state:
Block A: On disk (correct, latest version)
Block B: On disk (correct, latest version)
Block C: Lost forever (only existed in RAM)
For individual files, this might mean:
Filesystem Metadata Corruption
More insidiously, write-back can corrupt filesystem metadata:
Application creates new file:
1. Allocate inode (metadata write, goes to cache)
2. Allocate data blocks (metadata write, goes to cache)
3. Update directory entry (metadata write, goes to cache)
4. Write file content (data write, goes to cache)
If crash occurs after step 2 but before all metadata is flushed:
- Inode exists but directory doesn't reference it (orphan inode)
- Blocks allocated but file doesn't exist (space leak)
- Directory references file but inode missing (dangling reference)
Result: Filesystem inconsistency requiring fsck
Write Barriers: Ordering Without Full Sync
To address metadata ordering issues, filesystems use write barriers:
Barrier semantics:
All writes before the barrier must complete
before any write after the barrier begins
Without barriers (write-back, free reordering):
write A ─┐
write B ─┼─→ Disk sees: B, C, A (or any order)
write C ─┘
With barriers:
write A
write B
---BARRIER---
write C
write D
Disk sees: (A,B in some order), then (C,D in some order)
A and B definitely before C and D
Barriers are cheaper than full sync because:
Journaling Filesystems
Modern filesystems (ext4, XFS, NTFS, APFS) use journaling to handle crash recovery without requiring write-through for all data:
If crash occurs:
This achieves the safety of write-through for metadata with the performance of write-back for data.
Applications requiring durability should not rely on filesystem writeback behavior. They should explicitly call fsync() or use O_SYNC/O_DSYNC for critical writes. Assume write-back is the default and code defensively.
Operating system write-back is only part of the story. Storage devices have their own write caches, creating a multi-level caching hierarchy.
Device Write Cache Architecture
┌─────────────────────────────────────────────────────────────┐
│ Application │
└─────────────────────────────────────────────────────────────┘
│ write()
▼
┌─────────────────────────────────────────────────────────────┐
│ OS Page Cache (Write-Back) │
│ [Volatile RAM, 100GB+ possible] │
└─────────────────────────────────────────────────────────────┘
│ Background flush
▼
┌─────────────────────────────────────────────────────────────┐
│ RAID Controller Cache (If present) │
│ [Volatile or Battery-Backed, 256MB-8GB typical] │
└─────────────────────────────────────────────────────────────┘
│ RAID write
▼
┌─────────────────────────────────────────────────────────────┐
│ Drive Write Cache │
│ • HDD: volatile DRAM, 8-256MB │
│ • SATA SSD: volatile DRAM, 256MB-2GB │
│ • NVMe SSD: volatile DRAM + sometimes capacitor-backed │
└─────────────────────────────────────────────────────────────┘
│ Internal staging
▼
┌─────────────────────────────────────────────────────────────┐
│ Persistent Medium │
│ • HDD: magnetic platters │
│ • SSD: NAND flash cells │
└─────────────────────────────────────────────────────────────┘
Each layer can independently use write-back, and all layers must be considered for durability analysis.
SSD Internal Write-Back
SSDs have complex internal caching for performance and wear management:
When the OS "writes" and the SSD "acknowledges":
Power Loss Protection (PLP)
Enterprise SSDs often include capacitors that provide enough power to flush DRAM to NAND during power loss:
| SSD Class | PLP | Typical Flush Time | Data Safety |
|---|---|---|---|
| Consumer SATA | None | N/A | NOT safe, data loss possible |
| Prosumer NVMe | Partial | ~10ms | Metadata safe, data may be lost |
| Enterprise NVMe | Full | ~50-100ms | All in-flight data flushed |
| Intel Optane | N/A | Inherently safe | Persistent memory, no flush needed |
Controlling Device Write Cache
123456789101112131415161718192021
# Check device write caching statushdparm -W /dev/sda# /dev/sda:# write-caching = 1 (on) ← Device is using write-back # Disable device write cache (for maximum safety)hdparm -W0 /dev/sda# Warning: Significant performance impact! # For NVMe devices:nvme get-feature -f 0x06 /dev/nvme0 # Check volatile write cachenvme set-feature -f 0x06 -v 0 /dev/nvme0 # Disable VWC # Verify effect on write performance# With cache enabled (typical NVMe):fio --name=test --rw=randwrite --bs=4k --numjobs=1 \ --size=1G --runtime=10 --direct=1 --sync=1# Results: ~50,000-100,000 IOPS # With cache disabled (same drive):# Results: ~10,000-30,000 IOPS (FTL overhead visible)Rather than globally disabling device write cache, use Forced Unit Access (FUA) for specific critical writes. FUA bypasses the volatile cache for that operation only. Linux exposes this via O_DSYNC or REQ_FUA at the block layer. This preserves performance for non-critical writes while ensuring durability for critical ones.
Pure write-through sacrifices too much performance. Pure write-back sacrifices too much durability. Production systems use hybrid patterns that provide controlled durability with minimal performance impact.
Pattern 1: Group Commit
Batch multiple operations, sync once:
class GroupCommitLog:
"""
Batches multiple log entries, commits together.
Amortizes fsync cost across many operations.
"""
def __init__(self, path, commit_interval_ms=10, max_batch_size=100):
self.fd = os.open(path, os.O_WRONLY | os.O_CREAT | os.O_APPEND)
self.pending = []
self.pending_lock = threading.Lock()
self.commit_interval = commit_interval_ms / 1000.0
self.max_batch = max_batch_size
self.commit_event = threading.Event()
self.committer = threading.Thread(target=self._commit_loop)
self.committer.start()
def append(self, record):
"""Non-blocking: adds record to pending batch."""
with self.pending_lock:
future = Future()
self.pending.append((record, future))
if len(self.pending) >= self.max_batch:
self.commit_event.set()
return future
def _commit_loop(self):
while True:
self.commit_event.wait(timeout=self.commit_interval)
with self.pending_lock:
batch = self.pending
self.pending = []
if not batch:
continue
# Write all records (goes to OS cache)
for record, future in batch:
os.write(self.fd, record.encode() + b'
')
# Single fsync for entire batch
os.fsync(self.fd)
# Signal all waiters
for _, future in batch:
future.set_result(True)
This pattern achieves:
Pattern 2: Write-Ahead Logging (WAL)
The canonical pattern for database durability:
Transaction execution:
1. Write change descriptions to log (sequential, fsync at commit)
2. Apply changes to data pages (write-back, no fsync)
3. Eventually checkpoint: flush dirty pages, truncate log
Crash recovery:
1. Read log from last checkpoint
2. Replay logged operations against possibly-stale data pages
3. System is consistent
Key insight: Random writes to data pages become sequential writes to log.
Sequential + fsync is much faster than random + fsync.
Pattern 3: Periodic Checkpoints
Allow write-back most of the time, force sync at intervals:
class CheckpointedCache:
def __init__(self, checkpoint_interval_sec=60):
self.dirty_files = set()
self.interval = checkpoint_interval_sec
threading.Thread(target=self._checkpoint_loop).start()
def write(self, path, data):
# Normal write-back write
with open(path, 'wb') as f:
f.write(data)
self.dirty_files.add(path)
def _checkpoint_loop(self):
while True:
time.sleep(self.interval)
# Copy and clear dirty set
to_sync = self.dirty_files
self.dirty_files = set()
# Sync all dirty files
for path in to_sync:
fd = os.open(path, os.O_RDONLY)
os.fsync(fd)
os.close(fd)
# At this point, system can survive crash with
# at most 'interval' seconds of data loss
| Strategy | Max Data Loss | Latency (P50) | Latency (P99) | Use Case |
|---|---|---|---|---|
| Pure write-through | None | 10-20ms (HDD) | 100ms+ | Compliance-critical |
| Pure write-back | 30+ seconds | ~6µs | ~100ms (threshold) | Temporary data |
| Group commit (10ms) | 10ms | 10ms | 11ms | Transaction logs |
| Checkpoints (60s) | 60 seconds | ~6µs | ~60s (checkpoint) | Application data |
| WAL + write-back | 0 (committed) | ~10ms commit | ~15ms | Databases |
These hybrid patterns demonstrate that the write-through vs write-back choice isn't binary. By understanding both strategies deeply, you can design systems that achieve required durability guarantees with minimal performance overhead. The key is identifying exactly which data requires immediate persistence.
Write-back caching is the default behavior in modern operating systems, and for good reason—its performance benefits are dramatic. But those benefits come with complexity and risk that systems engineers must understand and manage.
What's Next
We've now explored both ends of the spectrum: write-through (immediate durability, poor performance) and write-back (excellent performance, deferred durability). Next, we'll examine delayed writes—a nuanced strategy that provides explicit control over when write-back data is flushed, enabling applications to make fine-grained durability/performance trade-offs.
You now understand write-back caching comprehensively—its mechanisms, performance characteristics, crash implications, and how production systems combine it with synchronization to achieve both performance and reliability.