Loading learning content...
When your application calls write(), something remarkable happens—or rather, something that should happen. The data must traverse multiple layers of abstraction, crossing kernel boundaries, navigating buffer caches, and ultimately finding its way to magnetic platters or flash cells where it can survive power failures, crashes, and the entropy of time.
But here's the question that has occupied operating system designers for decades: When exactly should data be committed to persistent storage?
This isn't a theoretical concern. The answer determines whether your database can guarantee ACID properties, whether your file system survives sudden power loss without corruption, and whether your application's performance suffers under write-heavy workloads. The "write strategy" you choose—or that your operating system chooses for you—represents one of the most consequential trade-offs in systems engineering.
We begin our exploration with the most conservative approach: write-through. It's conceptually simple, provides the strongest guarantees, and pays the highest performance cost. Understanding write-through deeply will provide the foundation for appreciating why more complex strategies exist and when each is appropriate.
By the end of this page, you will understand the precise mechanics of write-through caching, its implementation at the hardware and software levels, the mathematical basis for its performance characteristics, and the specific scenarios where write-through is not just acceptable but essential. You'll be equipped to make informed decisions about write strategies in production systems.
Write-through is a caching strategy where every write operation updates both the cache and the backing store synchronously before the write is considered complete. The operation only returns success to the caller after the data has been durably committed to persistent storage.
Let's be precise about what "synchronous" means in this context:
write() system call does not return until the backing store confirms successful persistenceThis creates a strict happens-before relationship: if write() returns success, the data is guaranteed to exist on the persistent medium. If the system crashes immediately after the call returns, the data survives.
Write-through inherently maintains cache coherence—the cache and backing store are never inconsistent. At any moment, the cache contains only data that also exists identically on the persistent storage. This property eliminates an entire category of failure modes and simplifies recovery logic.
The Write-Through Data Flow
Consider what happens when an application writes 4KB of data with write-through semantics:
Application: write(fd, buffer, 4096)
│
▼
┌─────────────────────────────────────────────────────────┐
│ User Space → Kernel Space Transition (syscall) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ VFS Layer: Validate fd, check permissions │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Page Cache: Copy data to kernel buffer │
│ [Data now in RAM cache] │
└─────────────────────────────────────────────────────────┘
│
▼ (Write-through: DO NOT return yet)
┌─────────────────────────────────────────────────────────┐
│ Block Layer: Translate to block device request │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Device Driver: Issue I/O command to hardware │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Storage Device: Write to persistent medium │
│ [BARRIER: Wait for device confirmation] │
└─────────────────────────────────────────────────────────┘
│
▼ (Confirmation received)
┌─────────────────────────────────────────────────────────┐
│ Return path: Propagate success back to application │
└─────────────────────────────────────────────────────────┘
│
▼
Application: write() returns 4096 (success)
The critical observation is the BARRIER step. The application is blocked—suspended—until the storage device confirms the write completed. This waiting period is the source of both write-through's strength (durability guarantees) and its weakness (latency).
To truly understand write-through performance, we must descend to the hardware level. The latency penalty depends entirely on the physical characteristics of the storage medium.
Traditional Hard Disk Drives (HDDs)
For a spinning disk, a write operation involves:
A single write-through operation on a 7200 RPM HDD thus costs approximately 8-20ms. This yields a maximum of 50-125 random write IOPS.
| Storage Type | Random Write Latency | Effective Write IOPS | Write-Through Viability |
|---|---|---|---|
| HDD 7200 RPM | 8-20ms | 50-125 IOPS | Severely limited |
| HDD 15000 RPM | 4-10ms | 100-250 IOPS | Production-marginal |
| SATA SSD | 20-100µs | 10,000-50,000 IOPS | Viable for many workloads |
| NVMe SSD | 10-20µs | 50,000-500,000 IOPS | Excellent for most cases |
| Intel Optane | 7-10µs | 100,000-550,000 IOPS | Near write-back performance |
| NVRAM/Battery-backed | <1µs | 1,000,000 IOPS | Write-through recommended |
Solid State Drives: A Complicated Picture
SSDs dramatically reduce write latency, but the story has nuances:
Program/Erase Asymmetry: NAND flash can only be written to erased blocks. If no erased blocks are available, the SSD must erase first, which takes 1-5ms per block.
Write Amplification: The Flash Translation Layer (FTL) may need to write more data than requested (read-modify-write for partial page writes).
Device Caching: Most SSDs have internal DRAM caches. A write-through policy at the OS level doesn't guarantee immediate NAND persistence unless the device is configured appropriately or Forced Unit Access (FUA) is used.
The Hidden Problem: Device-Level Write Caching
Modern storage devices have their own caches. A "write-through" policy at the OS level means:
write() returnsTrue durability requires either:
Many systems claiming write-through semantics actually implement 'write-through to device cache,' not 'write-through to persistent medium.' Verify your storage stack end-to-end. Use hdparm -W0 /dev/sdX to check/disable write cache on Linux, or ensure your enterprise SSDs/RAID controllers have battery-backed cache.
Operating systems provide multiple mechanisms for applications to request write-through semantics. Understanding these mechanisms is essential for building reliable systems.
Linux Implementation Mechanisms
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
#include <fcntl.h>#include <unistd.h>#include <stdio.h> int main() { /* * Method 1: O_SYNC flag at open() * * Every write() blocks until data AND metadata are * durably on the storage medium. Strongest guarantee. */ int fd_sync = open("/data/critical.dat", O_WRONLY | O_CREAT | O_SYNC, 0644); if (fd_sync < 0) { perror("O_SYNC open failed"); return 1; } // This write blocks until data is on disk // Guaranteed durable when write() returns write(fd_sync, "critical data", 13); close(fd_sync); /* * Method 2: O_DSYNC flag at open() * * Every write() blocks until DATA is durable. * Metadata (timestamps, etc.) may be buffered. * Slightly better performance than O_SYNC. */ int fd_dsync = open("/data/important.dat", O_WRONLY | O_CREAT | O_DSYNC, 0644); write(fd_dsync, "important data", 14); close(fd_dsync); /* * Method 3: fsync() / fdatasync() after write() * * Allows selective synchronization - batch multiple * writes then force durability at commit points. */ int fd_batch = open("/data/batched.dat", O_WRONLY | O_CREAT, 0644); // These writes go to page cache (fast) write(fd_batch, "record1\n", 8); write(fd_batch, "record2\n", 8); write(fd_batch, "record3\n", 8); // Now force everything to disk (slow, but amortized) if (fsync(fd_batch) < 0) { perror("fsync failed - data may not be durable!"); } close(fd_batch); /* * Method 4: O_DIRECT + fsync() * * Bypass page cache entirely, write directly to device. * Must use aligned buffers and sizes. * Application manages its own caching. */ void *aligned_buf; posix_memalign(&aligned_buf, 4096, 4096); memset(aligned_buf, 0, 4096); sprintf(aligned_buf, "Direct I/O data"); int fd_direct = open("/data/direct.dat", O_WRONLY | O_CREAT | O_DIRECT, 0644); // Write bypasses page cache ssize_t written = write(fd_direct, aligned_buf, 4096); // Still need fsync for durability guarantee fsync(fd_direct); free(aligned_buf); close(fd_direct); return 0;}Understanding the Semantic Differences
The distinctions between these methods matter enormously for correctness:
| Method | Data Durable | Metadata Durable | Cache State | Performance Impact |
|---|---|---|---|---|
| O_SYNC | On return | On return | Updated | Highest latency |
| O_DSYNC | On return | Eventually | Updated | Slightly better |
| fsync() | After call | After call | Updated | Amortizable |
| fdatasync() | After call | Partially | Updated | Better than fsync |
| O_DIRECT | Manual control | Manual control | Bypassed | Predictable latency |
What "Metadata Durable" Means
File metadata includes:
For pure data integrity (e.g., database files), O_DSYNC is often sufficient. For filesystem consistency (e.g., after creating a new file), full O_SYNC or fsync() may be required.
Production databases typically use O_DIRECT with explicit fsync() at commit points. This gives them full control over what's cached (their buffer pool), when writes happen (after transaction log), and when durability is enforced (at commit). PostgreSQL, MySQL InnoDB, and RocksDB all follow this pattern.
Let's develop a rigorous understanding of write-through performance through mathematical modeling. This analysis enables capacity planning and helps determine when write-through is viable.
Single-Threaded Write Throughput Model
For sequential write-through operations:
Throughput = Write_Size / (T_syscall + T_cache + T_io + T_confirm)
Where:
T_syscall = System call overhead (~1-5µs)
T_cache = Page cache copy time (~0.1µs per KB)
T_io = Device I/O latency (varies by device)
T_confirm = Confirmation propagation (~1-2µs)
For a 4KB write to various devices:
| Device | T_io | Total Latency | Throughput (4KB) | Throughput (MB/s) |
|---|---|---|---|---|
| HDD 7200 RPM | 12ms | 12.01ms | 83 ops/s | 0.33 MB/s |
| SATA SSD | 50µs | 56µs | 17,857 ops/s | 70 MB/s |
| NVMe SSD | 15µs | 21µs | 47,619 ops/s | 186 MB/s |
| Optane | 8µs | 14µs | 71,428 ops/s | 279 MB/s |
Queue Depth Impact
Write-through with queue depth 1 (one operation at a time) is the worst case. Modern storage supports concurrent operations:
Effective_Throughput = Base_Throughput × min(Queue_Depth, Device_Parallelism)
NVMe SSDs support queue depths of 64K+ across multiple queues. Even with write-through, parallelism helps:
However, write-through with multiple threads complicates ordering guarantees. Each thread's writes are individually ordered, but inter-thread ordering requires additional synchronization.
Latency Distribution Analysis
Real-world write latency follows a distribution, not a fixed value:
Typical NVMe SSD Write Latency Distribution:
P50 (median): 15µs — Half of writes faster than this
P90: 25µs — 90% of writes faster
P99: 50µs — 99% of writes faster
P99.9: 200µs — Occasional GC delays
P99.99: 2ms — Rare block erase delays
For write-through applications, tail latency matters enormously. A single slow write blocks the entire transaction. This is why write-heavy workloads often prefer write-back with periodic synchronization—the latency variance is absorbed.
Cost-Benefit Trade-off Formula
For a given workload, the viability of write-through can be expressed as:
Write_Through_Viable = (Required_Throughput × Latency_P99) < Available_IOPS
Example:
Required: 1000 writes/second
P99 latency: 50µs (NVMe SSD)
1000 × 0.00005s = 0.05 = 5% device utilization
✓ Write-through is viable with substantial headroom
Counter-example:
Required: 50,000 writes/second
P99 latency: 50µs
50,000 × 0.00005s = 2.5 = 250% device utilization
✗ Write-through cannot meet throughput requirements
When pure write-through cannot meet throughput requirements, the common solution is batched synchronization: buffer N writes, then issue one fsync(). This achieves write-back performance with periodic durability guarantees. Most databases use variants of this approach with their write-ahead log.
Despite its performance costs, write-through is the correct choice—sometimes the only correct choice—in specific scenarios. Understanding these cases helps you make principled trade-offs rather than defaulting to "fastest possible."
Case Study: PostgreSQL WAL Synchronization
PostgreSQL provides a canonical example of calibrated write-through usage. Its synchronous_commit setting controls when WAL records are durable:
| Setting | Behavior | Durability | Performance |
|---|---|---|---|
| off | Return immediately | May lose last ~600ms | Fastest |
| local | fsync to local disk | Survives local crash | Good |
| remote_write | Wait for replica receipt | Survives local crash | Moderate |
| on (default) | Wait for replica flush | Survives one node failure | Baseline |
| remote_apply | Wait for replica visibility | Full consistency | Slowest |
Each level adds latency but strengthens guarantees. Production systems choose based on their tolerance for data loss versus performance requirements.
The "Two-Second Rule"
A practical heuristic: if losing the last 2 seconds of data would be catastrophic, you need write-through or near-synchronous semantics. If losing 2 seconds is inconvenient but recoverable (e.g., user can retry), write-back with periodic sync is usually acceptable.
This rule emerges from typical power failure scenarios—a UPS can usually sustain systems long enough to flush buffers if the outage is brief, but sudden failures give no warning.
Use write-through for data whose loss would violate correctness guarantees (transactions, consensus) or external requirements (compliance, contracts). Use write-back for data that can be reconstructed or whose loss is merely inconvenient (caches, temporary files, recoverable state).
The write-through concept originated in CPU cache design, and understanding this context illuminates why the trade-offs work differently at different system layers.
CPU Cache Write-Through Architecture
In a write-through CPU cache:
CPU Core 0 CPU Core 1
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ L1 Cache│ │ L1 Cache│
│ (8KB) │ │ (8KB) │
└────┬────┘ └────┬────┘
│ Write propagates immediately │
▼ ▼
┌──────────────────────────────────────────┐
│ L2 Cache (256KB) │
│ [Always coherent with L1 writes] │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Main Memory (DDR) │
└──────────────────────────────────────────┘
Why CPU Caches Favor Write-Back
Modern CPUs overwhelmingly use write-back caches because:
Memory Bus Saturation: Write-through generates memory traffic for every store instruction. A CPU executing 1 billion stores/second would saturate any memory bus.
Write Combining: Write-back allows multiple writes to the same cache line to be absorbed, sending only the final value to memory.
Power Consumption: Memory writes consume significantly more power than cache writes. Write-back reduces memory access frequency.
However, examining when write-through is used in CPU caches reveals instructive patterns:
| Context | Cache Strategy | Reason |
|---|---|---|
| L1 Data Cache | Write-back | Performance critical |
| L1 Instruction Cache | Write-through (effectively) | Rarely written |
| Embedded Systems | Sometimes write-through | Simplicity, real-time predictability |
| Graphics Frame Buffers | Write-combining (variant) | Streaming writes to display |
| Memory-Mapped I/O | Uncached or write-through | Device expects immediate visibility |
Notice that write-through becomes more viable as the speed gap between cache and backing store decreases. CPU L1 to memory is ~100x slower—write-through is prohibitive. NVMe SSD to RAM is ~100x slower—write-through is painful but usable. Optane to RAM is ~3x slower—write-through is nearly transparent.
Implementing write-through correctly requires attention to subtle details that are easy to overlook. A system that appears to be write-through but isn't can silently lose data.
Essential Verification Steps
Verify Device Cache Settings
# Check HDD/SSD write cache status
hdparm -W /dev/sda
# Disable write cache for true write-through
hdparm -W0 /dev/sda
Confirm File System Mount Options
# Mount with sync option (all writes synchronous)
mount -o sync /dev/sda1 /data
# Or use dirsync for directory operations only
mount -o dirsync /dev/sda1 /data
Validate at Application Level
// Verify O_SYNC/O_DSYNC is honored
int fd = open("/data/test", O_WRONLY | O_SYNC);
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < 1000; i++) {
write(fd, buf, 4096);
}
clock_gettime(CLOCK_MONOTONIC, &end);
double elapsed_ms = (end.tv_sec - start.tv_sec) * 1000 +
(end.tv_nsec - start.tv_nsec) / 1000000.0;
// For NVMe: expect ~15-50ms (15-50µs × 1000)
// If <<1ms total, writes are being cached!
printf("1000 sync writes took %.2f ms\n", elapsed_ms);
Error Handling Imperatives
Write-through is only as reliable as your error handling. Every synchronous write can fail, and the failure MUST be handled:
ssize_t safe_sync_write(int fd, const void *buf, size_t count) {
ssize_t written = 0;
while (written < count) {
ssize_t result = write(fd, buf + written, count - written);
if (result < 0) {
if (errno == EINTR) {
continue; // Retry on interrupt
}
if (errno == EAGAIN || errno == EWOULDBLOCK) {
// Should not happen with sync writes
// Log and abort - something is wrong
log_error("Unexpected EAGAIN on sync write");
return -1;
}
// Hard errors: ENOSPC, EIO, EDQUOT, etc.
log_error("Sync write failed: %s", strerror(errno));
return -1;
}
written += result;
}
// For O_SYNC files, data is durable here
// For regular files, still need fsync
return written;
}
RAID Controller Considerations
Enterprise RAID controllers often have battery-backed write caches (BBU). This changes the durability equation:
This is the best of both worlds—write-through semantics with write-back performance. However:
Virtual machines add layers of caching invisible to the guest OS. A guest's O_SYNC write may sit in the hypervisor's cache, the host's page cache, or a SAN's cache. Always verify durability end-to-end, especially in cloud environments where you don't control the storage stack.
We have comprehensively examined write-through as the most conservative write strategy—one that trades performance for immediate durability guarantees. Let's consolidate the essential insights:
What's Next
Write-through establishes our baseline—the safest, slowest approach. Next, we'll explore write-back caching, the strategy that most systems use by default. Write-back trades immediate durability for dramatically improved performance by allowing dirty data to accumulate in cache. Understanding write-back's mechanisms, risks, and proper usage is essential for any systems engineer working with storage.
You now understand write-through caching at a fundamental level—its mechanics, performance characteristics, hardware considerations, and appropriate use cases. This foundation is essential for appreciating why more complex write strategies exist and when to apply them.