Operating SystemsFile System Performance

Write Strategies

LevelAdvanced

Duration90 mins

TopicFile System Performance

1 / 5

Write-Through Strategy

The Fundamental Question of Persistence

When your application calls write(), something remarkable happens—or rather, something that should happen. The data must traverse multiple layers of abstraction, crossing kernel boundaries, navigating buffer caches, and ultimately finding its way to magnetic platters or flash cells where it can survive power failures, crashes, and the entropy of time.

But here's the question that has occupied operating system designers for decades: When exactly should data be committed to persistent storage?

This isn't a theoretical concern. The answer determines whether your database can guarantee ACID properties, whether your file system survives sudden power loss without corruption, and whether your application's performance suffers under write-heavy workloads. The "write strategy" you choose—or that your operating system chooses for you—represents one of the most consequential trade-offs in systems engineering.

We begin our exploration with the most conservative approach: write-through. It's conceptually simple, provides the strongest guarantees, and pays the highest performance cost. Understanding write-through deeply will provide the foundation for appreciating why more complex strategies exist and when each is appropriate.

What You Will Master

By the end of this page, you will understand the precise mechanics of write-through caching, its implementation at the hardware and software levels, the mathematical basis for its performance characteristics, and the specific scenarios where write-through is not just acceptable but essential. You'll be equipped to make informed decisions about write strategies in production systems.

Defining Write-Through: The Mechanics of Synchronous Persistence

Write-through is a caching strategy where every write operation updates both the cache and the backing store synchronously before the write is considered complete. The operation only returns success to the caller after the data has been durably committed to persistent storage.

Let's be precise about what "synchronous" means in this context:

Temporal Ordering: The write to persistent storage happens during the write system call, not at some later time
Completion Semantics: The write() system call does not return until the backing store confirms successful persistence
No Buffering Aggregation: Each write operation independently triggers I/O; writes are not batched or coalesced

This creates a strict happens-before relationship: if write() returns success, the data is guaranteed to exist on the persistent medium. If the system crashes immediately after the call returns, the data survives.

Cache Coherence Guarantee

Write-through inherently maintains cache coherence—the cache and backing store are never inconsistent. At any moment, the cache contains only data that also exists identically on the persistent storage. This property eliminates an entire category of failure modes and simplifies recovery logic.

The Write-Through Data Flow

Consider what happens when an application writes 4KB of data with write-through semantics:

Application: write(fd, buffer, 4096)
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  User Space → Kernel Space Transition (syscall)         │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  VFS Layer: Validate fd, check permissions              │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  Page Cache: Copy data to kernel buffer                 │
│  [Data now in RAM cache]                                │
└─────────────────────────────────────────────────────────┘
    │
    ▼ (Write-through: DO NOT return yet)
┌─────────────────────────────────────────────────────────┐
│  Block Layer: Translate to block device request         │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  Device Driver: Issue I/O command to hardware           │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  Storage Device: Write to persistent medium             │
│  [BARRIER: Wait for device confirmation]                │
└─────────────────────────────────────────────────────────┘
    │
    ▼ (Confirmation received)
┌─────────────────────────────────────────────────────────┐
│  Return path: Propagate success back to application     │
└─────────────────────────────────────────────────────────┘
    │
    ▼
Application: write() returns 4096 (success)

The critical observation is the BARRIER step. The application is blocked—suspended—until the storage device confirms the write completed. This waiting period is the source of both write-through's strength (durability guarantees) and its weakness (latency).

The Hardware Reality: What Happens at the Physical Layer

To truly understand write-through performance, we must descend to the hardware level. The latency penalty depends entirely on the physical characteristics of the storage medium.

Traditional Hard Disk Drives (HDDs)

For a spinning disk, a write operation involves:

Seek Time: Moving the read/write head to the correct track (3-15ms average)
Rotational Latency: Waiting for the target sector to rotate under the head (half a rotation on average)
- 7200 RPM drive: 60,000ms ÷ 7200 ÷ 2 ≈ 4.17ms
- 15,000 RPM drive: 60,000ms ÷ 15000 ÷ 2 ≈ 2ms
Transfer Time: Actually writing the data (negligible for small writes)
Settling Time: Head stabilization before write (~0.5-1ms)

A single write-through operation on a 7200 RPM HDD thus costs approximately 8-20ms. This yields a maximum of 50-125 random write IOPS.

Physical Write Latency by Storage Technology
Storage Type	Random Write Latency	Effective Write IOPS	Write-Through Viability
HDD 7200 RPM	8-20ms	50-125 IOPS	Severely limited
HDD 15000 RPM	4-10ms	100-250 IOPS	Production-marginal
SATA SSD	20-100µs	10,000-50,000 IOPS	Viable for many workloads
NVMe SSD	10-20µs	50,000-500,000 IOPS	Excellent for most cases
Intel Optane	7-10µs	100,000-550,000 IOPS	Near write-back performance
NVRAM/Battery-backed	<1µs	1,000,000 IOPS	Write-through recommended

Solid State Drives: A Complicated Picture

SSDs dramatically reduce write latency, but the story has nuances:

Program/Erase Asymmetry: NAND flash can only be written to erased blocks. If no erased blocks are available, the SSD must erase first, which takes 1-5ms per block.
Write Amplification: The Flash Translation Layer (FTL) may need to write more data than requested (read-modify-write for partial page writes).
Device Caching: Most SSDs have internal DRAM caches. A write-through policy at the OS level doesn't guarantee immediate NAND persistence unless the device is configured appropriately or Forced Unit Access (FUA) is used.

The Hidden Problem: Device-Level Write Caching

Modern storage devices have their own caches. A "write-through" policy at the OS level means:

Data is sent to the device before write() returns
The device may buffer the data in volatile RAM
A power failure at this moment loses data

True durability requires either:

Disabling device write cache (significant performance penalty)
Using FUA (Forced Unit Access) on each write (bypasses device cache)
Using write barriers (flushes device cache at specific points)
Battery-backed write cache (allows device caching safely)

The Durability Illusion

Many systems claiming write-through semantics actually implement 'write-through to device cache,' not 'write-through to persistent medium.' Verify your storage stack end-to-end. Use hdparm -W0 /dev/sdX to check/disable write cache on Linux, or ensure your enterprise SSDs/RAID controllers have battery-backed cache.

Implementation Patterns: How Operating Systems Achieve Write-Through

Operating systems provide multiple mechanisms for applications to request write-through semantics. Understanding these mechanisms is essential for building reliable systems.

Linux Implementation Mechanisms

write_through_linux.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
 
int main() {
    /*
     * Method 1: O_SYNC flag at open()
     * 
     * Every write() blocks until data AND metadata are
     * durably on the storage medium. Strongest guarantee.
     */
    int fd_sync = open("/data/critical.dat", 
                       O_WRONLY | O_CREAT | O_SYNC, 0644);
    if (fd_sync < 0) {
        perror("O_SYNC open failed");
        return 1;
    }
    
    // This write blocks until data is on disk
    // Guaranteed durable when write() returns
    write(fd_sync, "critical data", 13);
    close(fd_sync);
    
    /*
     * Method 2: O_DSYNC flag at open()
     * 
     * Every write() blocks until DATA is durable.
     * Metadata (timestamps, etc.) may be buffered.
     * Slightly better performance than O_SYNC.
     */
    int fd_dsync = open("/data/important.dat",
                        O_WRONLY | O_CREAT | O_DSYNC, 0644);
    write(fd_dsync, "important data", 14);
    close(fd_dsync);
    
    /*
     * Method 3: fsync() / fdatasync() after write()
     * 
     * Allows selective synchronization - batch multiple
     * writes then force durability at commit points.
     */
    int fd_batch = open("/data/batched.dat",
                        O_WRONLY | O_CREAT, 0644);
    
    // These writes go to page cache (fast)
    write(fd_batch, "record1\n", 8);
    write(fd_batch, "record2\n", 8);
    write(fd_batch, "record3\n", 8);
    
    // Now force everything to disk (slow, but amortized)
    if (fsync(fd_batch) < 0) {
        perror("fsync failed - data may not be durable!");
    }
    
    close(fd_batch);
    
    /*
     * Method 4: O_DIRECT + fsync()
     * 
     * Bypass page cache entirely, write directly to device.
     * Must use aligned buffers and sizes.
     * Application manages its own caching.
     */
    void *aligned_buf;
    posix_memalign(&aligned_buf, 4096, 4096);
    memset(aligned_buf, 0, 4096);
    sprintf(aligned_buf, "Direct I/O data");
    
    int fd_direct = open("/data/direct.dat",
                         O_WRONLY | O_CREAT | O_DIRECT, 0644);
    
    // Write bypasses page cache
    ssize_t written = write(fd_direct, aligned_buf, 4096);
    
    // Still need fsync for durability guarantee
    fsync(fd_direct);
    
    free(aligned_buf);
    close(fd_direct);
    
    return 0;
}

Understanding the Semantic Differences

The distinctions between these methods matter enormously for correctness:

Method	Data Durable	Metadata Durable	Cache State	Performance Impact
O_SYNC	On return	On return	Updated	Highest latency
O_DSYNC	On return	Eventually	Updated	Slightly better
fsync()	After call	After call	Updated	Amortizable
fdatasync()	After call	Partially	Updated	Better than fsync
O_DIRECT	Manual control	Manual control	Bypassed	Predictable latency

What "Metadata Durable" Means

File metadata includes:

Modification timestamps (mtime, ctime)
File size (if extended)
Permission changes
Link count
Inode updates for block allocation

For pure data integrity (e.g., database files), O_DSYNC is often sufficient. For filesystem consistency (e.g., after creating a new file), full O_SYNC or fsync() may be required.

Database Engine Patterns

Production databases typically use O_DIRECT with explicit fsync() at commit points. This gives them full control over what's cached (their buffer pool), when writes happen (after transaction log), and when durability is enforced (at commit). PostgreSQL, MySQL InnoDB, and RocksDB all follow this pattern.

Performance Mathematics: Quantifying the Write-Through Penalty

Let's develop a rigorous understanding of write-through performance through mathematical modeling. This analysis enables capacity planning and helps determine when write-through is viable.

Single-Threaded Write Throughput Model

For sequential write-through operations:

Throughput = Write_Size / (T_syscall + T_cache + T_io + T_confirm)

Where:
  T_syscall  = System call overhead (~1-5µs)
  T_cache    = Page cache copy time (~0.1µs per KB)
  T_io       = Device I/O latency (varies by device)
  T_confirm  = Confirmation propagation (~1-2µs)

For a 4KB write to various devices:

Single-Threaded Write-Through Throughput Analysis
Device	T_io	Total Latency	Throughput (4KB)	Throughput (MB/s)
HDD 7200 RPM	12ms	12.01ms	83 ops/s	0.33 MB/s
SATA SSD	50µs	56µs	17,857 ops/s	70 MB/s
NVMe SSD	15µs	21µs	47,619 ops/s	186 MB/s
Optane	8µs	14µs	71,428 ops/s	279 MB/s

Queue Depth Impact

Write-through with queue depth 1 (one operation at a time) is the worst case. Modern storage supports concurrent operations:

Effective_Throughput = Base_Throughput × min(Queue_Depth, Device_Parallelism)

NVMe SSDs support queue depths of 64K+ across multiple queues. Even with write-through, parallelism helps:

QD=1: 47,619 ops/s (186 MB/s)
QD=4: 190,476 ops/s (744 MB/s)
QD=32: 1,523,809 ops/s (5.8 GB/s) — if device supports it

However, write-through with multiple threads complicates ordering guarantees. Each thread's writes are individually ordered, but inter-thread ordering requires additional synchronization.

Latency Distribution Analysis

Real-world write latency follows a distribution, not a fixed value:

Typical NVMe SSD Write Latency Distribution:
  P50 (median):      15µs    — Half of writes faster than this
  P90:               25µs    — 90% of writes faster
  P99:               50µs    — 99% of writes faster
  P99.9:             200µs   — Occasional GC delays
  P99.99:            2ms     — Rare block erase delays

For write-through applications, tail latency matters enormously. A single slow write blocks the entire transaction. This is why write-heavy workloads often prefer write-back with periodic synchronization—the latency variance is absorbed.

Cost-Benefit Trade-off Formula

For a given workload, the viability of write-through can be expressed as:

Write_Through_Viable = (Required_Throughput × Latency_P99) < Available_IOPS

Example:
  Required: 1000 writes/second
  P99 latency: 50µs (NVMe SSD)
  
  1000 × 0.00005s = 0.05 = 5% device utilization
  
  ✓ Write-through is viable with substantial headroom

Counter-example:
  Required: 50,000 writes/second
  P99 latency: 50µs
  
  50,000 × 0.00005s = 2.5 = 250% device utilization
  
  ✗ Write-through cannot meet throughput requirements

The Batching Alternative

When pure write-through cannot meet throughput requirements, the common solution is batched synchronization: buffer N writes, then issue one fsync(). This achieves write-back performance with periodic durability guarantees. Most databases use variants of this approach with their write-ahead log.

When Write-Through Is Essential: Use Cases Demanding Immediate Durability

Despite its performance costs, write-through is the correct choice—sometimes the only correct choice—in specific scenarios. Understanding these cases helps you make principled trade-offs rather than defaulting to "fastest possible."

Scenarios Requiring Write-Through

•Write-Ahead Logging (WAL): Database transaction logs MUST be durable before the transaction is considered committed. If the log record isn't persistent, crash recovery cannot guarantee atomicity. Every ACID database uses synchronous writes for WAL commit records.
•Financial Transaction Records: Regulatory requirements often mandate that transaction records be immediately persistent. A payment that's 'in the cache' when power fails could lead to reconciliation nightmares or legal liability.
•Distributed Consensus State: Systems like etcd, ZooKeeper, and Raft implementations require that accepted proposals be durable immediately. The consensus protocol correctness depends on durability preceding acknowledgment.
•Audit Logs: Security and compliance logging may require immediate persistence so that an attacker cannot clear cached logs before they reach disk.
•Configuration Updates: Changes to critical configuration files often use write-through to ensure the new configuration survives the writes. Partial configuration updates on crash can leave systems in unrecoverable states.
•Synchronization Points: Even primarily write-back systems need synchronization points—moments when all buffered data becomes durable. These are effectively localized write-through operations.

Case Study: PostgreSQL WAL Synchronization

PostgreSQL provides a canonical example of calibrated write-through usage. Its synchronous_commit setting controls when WAL records are durable:

Setting	Behavior	Durability	Performance
off	Return immediately	May lose last ~600ms	Fastest
local	fsync to local disk	Survives local crash	Good
remote_write	Wait for replica receipt	Survives local crash	Moderate
on (default)	Wait for replica flush	Survives one node failure	Baseline
remote_apply	Wait for replica visibility	Full consistency	Slowest

Each level adds latency but strengthens guarantees. Production systems choose based on their tolerance for data loss versus performance requirements.

The "Two-Second Rule"

A practical heuristic: if losing the last 2 seconds of data would be catastrophic, you need write-through or near-synchronous semantics. If losing 2 seconds is inconvenient but recoverable (e.g., user can retry), write-back with periodic sync is usually acceptable.

This rule emerges from typical power failure scenarios—a UPS can usually sustain systems long enough to flush buffers if the outage is brief, but sudden failures give no warning.

Design Principle

Use write-through for data whose loss would violate correctness guarantees (transactions, consensus) or external requirements (compliance, contracts). Use write-back for data that can be reconstructed or whose loss is merely inconvenient (caches, temporary files, recoverable state).

Write-Through in CPU Caches: A Parallel Perspective

The write-through concept originated in CPU cache design, and understanding this context illuminates why the trade-offs work differently at different system layers.

CPU Cache Write-Through Architecture

In a write-through CPU cache:

Every store instruction updates both L1 cache and the next level (L2 or main memory)
The write buffer allows the CPU to continue while the write propagates
Cache coherency is automatic—no invalidation messages needed on writes

CPU Core 0                         CPU Core 1
    │                                   │
    ▼                                   ▼
┌─────────┐                       ┌─────────┐
│ L1 Cache│                       │ L1 Cache│
│ (8KB)   │                       │ (8KB)   │
└────┬────┘                       └────┬────┘
     │ Write propagates immediately     │
     ▼                                   ▼
┌──────────────────────────────────────────┐
│            L2 Cache (256KB)              │
│    [Always coherent with L1 writes]     │
└──────────────────────────────────────────┘
     │
     ▼
┌──────────────────────────────────────────┐
│            Main Memory (DDR)             │
└──────────────────────────────────────────┘

Why CPU Caches Favor Write-Back

Modern CPUs overwhelmingly use write-back caches because:

Memory Bus Saturation: Write-through generates memory traffic for every store instruction. A CPU executing 1 billion stores/second would saturate any memory bus.
Write Combining: Write-back allows multiple writes to the same cache line to be absorbed, sending only the final value to memory.
Power Consumption: Memory writes consume significantly more power than cache writes. Write-back reduces memory access frequency.

However, examining when write-through is used in CPU caches reveals instructive patterns:

Context	Cache Strategy	Reason
L1 Data Cache	Write-back	Performance critical
L1 Instruction Cache	Write-through (effectively)	Rarely written
Embedded Systems	Sometimes write-through	Simplicity, real-time predictability
Graphics Frame Buffers	Write-combining (variant)	Streaming writes to display
Memory-Mapped I/O	Uncached or write-through	Device expects immediate visibility

The Speed Hierarchy Insight

Notice that write-through becomes more viable as the speed gap between cache and backing store decreases. CPU L1 to memory is ~100x slower—write-through is prohibitive. NVMe SSD to RAM is ~100x slower—write-through is painful but usable. Optane to RAM is ~3x slower—write-through is nearly transparent.

Implementation Considerations: Building Write-Through Correctly

Implementing write-through correctly requires attention to subtle details that are easy to overlook. A system that appears to be write-through but isn't can silently lose data.

Essential Verification Steps

Verify Device Cache Settings

# Check HDD/SSD write cache status
hdparm -W /dev/sda

# Disable write cache for true write-through
hdparm -W0 /dev/sda

Confirm File System Mount Options

# Mount with sync option (all writes synchronous)
mount -o sync /dev/sda1 /data

# Or use dirsync for directory operations only
mount -o dirsync /dev/sda1 /data

Validate at Application Level

// Verify O_SYNC/O_DSYNC is honored
int fd = open("/data/test", O_WRONLY | O_SYNC);

struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);

for (int i = 0; i < 1000; i++) {
    write(fd, buf, 4096);
}

clock_gettime(CLOCK_MONOTONIC, &end);

double elapsed_ms = (end.tv_sec - start.tv_sec) * 1000 +
                    (end.tv_nsec - start.tv_nsec) / 1000000.0;

// For NVMe: expect ~15-50ms (15-50µs × 1000)
// If <<1ms total, writes are being cached!
printf("1000 sync writes took %.2f ms\n", elapsed_ms);

Error Handling Imperatives

Write-through is only as reliable as your error handling. Every synchronous write can fail, and the failure MUST be handled:

ssize_t safe_sync_write(int fd, const void *buf, size_t count) {
    ssize_t written = 0;
    
    while (written < count) {
        ssize_t result = write(fd, buf + written, count - written);
        
        if (result < 0) {
            if (errno == EINTR) {
                continue;  // Retry on interrupt
            }
            if (errno == EAGAIN || errno == EWOULDBLOCK) {
                // Should not happen with sync writes
                // Log and abort - something is wrong
                log_error("Unexpected EAGAIN on sync write");
                return -1;
            }
            
            // Hard errors: ENOSPC, EIO, EDQUOT, etc.
            log_error("Sync write failed: %s", strerror(errno));
            return -1;
        }
        
        written += result;
    }
    
    // For O_SYNC files, data is durable here
    // For regular files, still need fsync
    return written;
}

RAID Controller Considerations

Enterprise RAID controllers often have battery-backed write caches (BBU). This changes the durability equation:

Write to BBU cache: ~microseconds, durable across power loss
BBU destages to disk: happens in background
From OS perspective: write-through to the controller

This is the best of both worlds—write-through semantics with write-back performance. However:

BBU has limited capacity (256MB-4GB typically)
BBU battery can fail (monitor health!)
Some failure modes (double disk failure) still lose BBU data

The Virtualization Trap

Virtual machines add layers of caching invisible to the guest OS. A guest's O_SYNC write may sit in the hypervisor's cache, the host's page cache, or a SAN's cache. Always verify durability end-to-end, especially in cloud environments where you don't control the storage stack.

Summary: The Write-Through Foundation

We have comprehensively examined write-through as the most conservative write strategy—one that trades performance for immediate durability guarantees. Let's consolidate the essential insights:

Key Takeaways

•Core Definition: Write-through writes to both cache and backing store synchronously; the write call blocks until data is durable.
•Coherency Guarantee: Cache and storage are always consistent—no dirty data exists only in cache. This eliminates crash recovery complexity.
•Performance Cost: Each write incurs full I/O latency (8-20ms for HDD, 15-100µs for SSD). Throughput is fundamentally limited by storage speed.
•Hardware Stack Complexity: True durability requires attention to every layer—OS, file system, device driver, device cache, and physical medium.
•Essential Use Cases: WAL commit records, financial transactions, consensus state, audit logs—anywhere data loss violates correctness or compliance.
•Implementation Details: Use O_SYNC, O_DSYNC, or fsync(); verify device cache settings; handle all write errors; validate latency empirically.
•Design Trade-off: Choose write-through for correctness-critical data, not convenience-critical data. Performance-sensitive systems use it selectively.

What's Next

Write-through establishes our baseline—the safest, slowest approach. Next, we'll explore write-back caching, the strategy that most systems use by default. Write-back trades immediate durability for dramatically improved performance by allowing dirty data to accumulate in cache. Understanding write-back's mechanisms, risks, and proper usage is essential for any systems engineer working with storage.

Page Complete

You now understand write-through caching at a fundamental level—its mechanics, performance characteristics, hardware considerations, and appropriate use cases. This foundation is essential for appreciating why more complex write strategies exist and when to apply them.

1 / 5

Loading learning content...

Operating SystemsFile System Performance

Write Strategies

LevelAdvanced

Duration90 mins

TopicFile System Performance

1 / 5

Write-Through Strategy

The Fundamental Question of Persistence

But here's the question that has occupied operating system designers for decades: When exactly should data be committed to persistent storage?

What You Will Master

Defining Write-Through: The Mechanics of Synchronous Persistence

Let's be precise about what "synchronous" means in this context:

Temporal Ordering: The write to persistent storage happens during the write system call, not at some later time
Completion Semantics: The write() system call does not return until the backing store confirms successful persistence
No Buffering Aggregation: Each write operation independently triggers I/O; writes are not batched or coalesced

Cache Coherence Guarantee

The Write-Through Data Flow

Consider what happens when an application writes 4KB of data with write-through semantics:

Application: write(fd, buffer, 4096)
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  User Space → Kernel Space Transition (syscall)         │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  VFS Layer: Validate fd, check permissions              │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  Page Cache: Copy data to kernel buffer                 │
│  [Data now in RAM cache]                                │
└─────────────────────────────────────────────────────────┘
    │
    ▼ (Write-through: DO NOT return yet)
┌─────────────────────────────────────────────────────────┐
│  Block Layer: Translate to block device request         │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  Device Driver: Issue I/O command to hardware           │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  Storage Device: Write to persistent medium             │
│  [BARRIER: Wait for device confirmation]                │
└─────────────────────────────────────────────────────────┘
    │
    ▼ (Confirmation received)
┌─────────────────────────────────────────────────────────┐
│  Return path: Propagate success back to application     │
└─────────────────────────────────────────────────────────┘
    │
    ▼
Application: write() returns 4096 (success)

The Hardware Reality: What Happens at the Physical Layer

To truly understand write-through performance, we must descend to the hardware level. The latency penalty depends entirely on the physical characteristics of the storage medium.

Traditional Hard Disk Drives (HDDs)

For a spinning disk, a write operation involves:

Seek Time: Moving the read/write head to the correct track (3-15ms average)
Rotational Latency: Waiting for the target sector to rotate under the head (half a rotation on average)
- 7200 RPM drive: 60,000ms ÷ 7200 ÷ 2 ≈ 4.17ms
- 15,000 RPM drive: 60,000ms ÷ 15000 ÷ 2 ≈ 2ms
Transfer Time: Actually writing the data (negligible for small writes)
Settling Time: Head stabilization before write (~0.5-1ms)

A single write-through operation on a 7200 RPM HDD thus costs approximately 8-20ms. This yields a maximum of 50-125 random write IOPS.

Physical Write Latency by Storage Technology
Storage Type	Random Write Latency	Effective Write IOPS	Write-Through Viability
HDD 7200 RPM	8-20ms	50-125 IOPS	Severely limited
HDD 15000 RPM	4-10ms	100-250 IOPS	Production-marginal
SATA SSD	20-100µs	10,000-50,000 IOPS	Viable for many workloads
NVMe SSD	10-20µs	50,000-500,000 IOPS	Excellent for most cases
Intel Optane	7-10µs	100,000-550,000 IOPS	Near write-back performance
NVRAM/Battery-backed	<1µs	1,000,000 IOPS	Write-through recommended

Solid State Drives: A Complicated Picture

SSDs dramatically reduce write latency, but the story has nuances:

Program/Erase Asymmetry: NAND flash can only be written to erased blocks. If no erased blocks are available, the SSD must erase first, which takes 1-5ms per block.
Write Amplification: The Flash Translation Layer (FTL) may need to write more data than requested (read-modify-write for partial page writes).
Device Caching: Most SSDs have internal DRAM caches. A write-through policy at the OS level doesn't guarantee immediate NAND persistence unless the device is configured appropriately or Forced Unit Access (FUA) is used.

The Hidden Problem: Device-Level Write Caching

Modern storage devices have their own caches. A "write-through" policy at the OS level means:

Data is sent to the device before write() returns
The device may buffer the data in volatile RAM
A power failure at this moment loses data

True durability requires either:

Disabling device write cache (significant performance penalty)
Using FUA (Forced Unit Access) on each write (bypasses device cache)
Using write barriers (flushes device cache at specific points)
Battery-backed write cache (allows device caching safely)

The Durability Illusion

Implementation Patterns: How Operating Systems Achieve Write-Through

Operating systems provide multiple mechanisms for applications to request write-through semantics. Understanding these mechanisms is essential for building reliable systems.

Linux Implementation Mechanisms

write_through_linux.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
 
int main() {
    /*
     * Method 1: O_SYNC flag at open()
     * 
     * Every write() blocks until data AND metadata are
     * durably on the storage medium. Strongest guarantee.
     */
    int fd_sync = open("/data/critical.dat", 
                       O_WRONLY | O_CREAT | O_SYNC, 0644);
    if (fd_sync < 0) {
        perror("O_SYNC open failed");
        return 1;
    }
    
    // This write blocks until data is on disk
    // Guaranteed durable when write() returns
    write(fd_sync, "critical data", 13);
    close(fd_sync);
    
    /*
     * Method 2: O_DSYNC flag at open()
     * 
     * Every write() blocks until DATA is durable.
     * Metadata (timestamps, etc.) may be buffered.
     * Slightly better performance than O_SYNC.
     */
    int fd_dsync = open("/data/important.dat",
                        O_WRONLY | O_CREAT | O_DSYNC, 0644);
    write(fd_dsync, "important data", 14);
    close(fd_dsync);
    
    /*
     * Method 3: fsync() / fdatasync() after write()
     * 
     * Allows selective synchronization - batch multiple
     * writes then force durability at commit points.
     */
    int fd_batch = open("/data/batched.dat",
                        O_WRONLY | O_CREAT, 0644);
    
    // These writes go to page cache (fast)
    write(fd_batch, "record1\n", 8);
    write(fd_batch, "record2\n", 8);
    write(fd_batch, "record3\n", 8);
    
    // Now force everything to disk (slow, but amortized)
    if (fsync(fd_batch) < 0) {
        perror("fsync failed - data may not be durable!");
    }
    
    close(fd_batch);
    
    /*
     * Method 4: O_DIRECT + fsync()
     * 
     * Bypass page cache entirely, write directly to device.
     * Must use aligned buffers and sizes.
     * Application manages its own caching.
     */
    void *aligned_buf;
    posix_memalign(&aligned_buf, 4096, 4096);
    memset(aligned_buf, 0, 4096);
    sprintf(aligned_buf, "Direct I/O data");
    
    int fd_direct = open("/data/direct.dat",
                         O_WRONLY | O_CREAT | O_DIRECT, 0644);
    
    // Write bypasses page cache
    ssize_t written = write(fd_direct, aligned_buf, 4096);
    
    // Still need fsync for durability guarantee
    fsync(fd_direct);
    
    free(aligned_buf);
    close(fd_direct);
    
    return 0;
}

Understanding the Semantic Differences

The distinctions between these methods matter enormously for correctness:

Method	Data Durable	Metadata Durable	Cache State	Performance Impact
O_SYNC	On return	On return	Updated	Highest latency
O_DSYNC	On return	Eventually	Updated	Slightly better
fsync()	After call	After call	Updated	Amortizable
fdatasync()	After call	Partially	Updated	Better than fsync
O_DIRECT	Manual control	Manual control	Bypassed	Predictable latency

What "Metadata Durable" Means

File metadata includes:

Modification timestamps (mtime, ctime)
File size (if extended)
Permission changes
Link count
Inode updates for block allocation

For pure data integrity (e.g., database files), O_DSYNC is often sufficient. For filesystem consistency (e.g., after creating a new file), full O_SYNC or fsync() may be required.

Database Engine Patterns

Performance Mathematics: Quantifying the Write-Through Penalty

Let's develop a rigorous understanding of write-through performance through mathematical modeling. This analysis enables capacity planning and helps determine when write-through is viable.

Single-Threaded Write Throughput Model

For sequential write-through operations:

Throughput = Write_Size / (T_syscall + T_cache + T_io + T_confirm)

Where:
  T_syscall  = System call overhead (~1-5µs)
  T_cache    = Page cache copy time (~0.1µs per KB)
  T_io       = Device I/O latency (varies by device)
  T_confirm  = Confirmation propagation (~1-2µs)

For a 4KB write to various devices:

Single-Threaded Write-Through Throughput Analysis
Device	T_io	Total Latency	Throughput (4KB)	Throughput (MB/s)
HDD 7200 RPM	12ms	12.01ms	83 ops/s	0.33 MB/s
SATA SSD	50µs	56µs	17,857 ops/s	70 MB/s
NVMe SSD	15µs	21µs	47,619 ops/s	186 MB/s
Optane	8µs	14µs	71,428 ops/s	279 MB/s

Queue Depth Impact

Write-through with queue depth 1 (one operation at a time) is the worst case. Modern storage supports concurrent operations:

Effective_Throughput = Base_Throughput × min(Queue_Depth, Device_Parallelism)

NVMe SSDs support queue depths of 64K+ across multiple queues. Even with write-through, parallelism helps:

QD=1: 47,619 ops/s (186 MB/s)
QD=4: 190,476 ops/s (744 MB/s)
QD=32: 1,523,809 ops/s (5.8 GB/s) — if device supports it

However, write-through with multiple threads complicates ordering guarantees. Each thread's writes are individually ordered, but inter-thread ordering requires additional synchronization.

Latency Distribution Analysis

Real-world write latency follows a distribution, not a fixed value:

Typical NVMe SSD Write Latency Distribution:
  P50 (median):      15µs    — Half of writes faster than this
  P90:               25µs    — 90% of writes faster
  P99:               50µs    — 99% of writes faster
  P99.9:             200µs   — Occasional GC delays
  P99.99:            2ms     — Rare block erase delays

Cost-Benefit Trade-off Formula

For a given workload, the viability of write-through can be expressed as:

Write_Through_Viable = (Required_Throughput × Latency_P99) < Available_IOPS

Example:
  Required: 1000 writes/second
  P99 latency: 50µs (NVMe SSD)
  
  1000 × 0.00005s = 0.05 = 5% device utilization
  
  ✓ Write-through is viable with substantial headroom

Counter-example:
  Required: 50,000 writes/second
  P99 latency: 50µs
  
  50,000 × 0.00005s = 2.5 = 250% device utilization
  
  ✗ Write-through cannot meet throughput requirements

The Batching Alternative

When Write-Through Is Essential: Use Cases Demanding Immediate Durability

Scenarios Requiring Write-Through

•Write-Ahead Logging (WAL): Database transaction logs MUST be durable before the transaction is considered committed. If the log record isn't persistent, crash recovery cannot guarantee atomicity. Every ACID database uses synchronous writes for WAL commit records.
•Financial Transaction Records: Regulatory requirements often mandate that transaction records be immediately persistent. A payment that's 'in the cache' when power fails could lead to reconciliation nightmares or legal liability.
•Distributed Consensus State: Systems like etcd, ZooKeeper, and Raft implementations require that accepted proposals be durable immediately. The consensus protocol correctness depends on durability preceding acknowledgment.
•Audit Logs: Security and compliance logging may require immediate persistence so that an attacker cannot clear cached logs before they reach disk.
•Configuration Updates: Changes to critical configuration files often use write-through to ensure the new configuration survives the writes. Partial configuration updates on crash can leave systems in unrecoverable states.
•Synchronization Points: Even primarily write-back systems need synchronization points—moments when all buffered data becomes durable. These are effectively localized write-through operations.

Case Study: PostgreSQL WAL Synchronization

PostgreSQL provides a canonical example of calibrated write-through usage. Its synchronous_commit setting controls when WAL records are durable:

Setting	Behavior	Durability	Performance
off	Return immediately	May lose last ~600ms	Fastest
local	fsync to local disk	Survives local crash	Good
remote_write	Wait for replica receipt	Survives local crash	Moderate
on (default)	Wait for replica flush	Survives one node failure	Baseline
remote_apply	Wait for replica visibility	Full consistency	Slowest

Each level adds latency but strengthens guarantees. Production systems choose based on their tolerance for data loss versus performance requirements.

The "Two-Second Rule"

This rule emerges from typical power failure scenarios—a UPS can usually sustain systems long enough to flush buffers if the outage is brief, but sudden failures give no warning.

Design Principle

Write-Through in CPU Caches: A Parallel Perspective

The write-through concept originated in CPU cache design, and understanding this context illuminates why the trade-offs work differently at different system layers.

CPU Cache Write-Through Architecture

In a write-through CPU cache:

Every store instruction updates both L1 cache and the next level (L2 or main memory)
The write buffer allows the CPU to continue while the write propagates
Cache coherency is automatic—no invalidation messages needed on writes

CPU Core 0                         CPU Core 1
    │                                   │
    ▼                                   ▼
┌─────────┐                       ┌─────────┐
│ L1 Cache│                       │ L1 Cache│
│ (8KB)   │                       │ (8KB)   │
└────┬────┘                       └────┬────┘
     │ Write propagates immediately     │
     ▼                                   ▼
┌──────────────────────────────────────────┐
│            L2 Cache (256KB)              │
│    [Always coherent with L1 writes]     │
└──────────────────────────────────────────┘
     │
     ▼
┌──────────────────────────────────────────┐
│            Main Memory (DDR)             │
└──────────────────────────────────────────┘

Why CPU Caches Favor Write-Back

Modern CPUs overwhelmingly use write-back caches because:

Memory Bus Saturation: Write-through generates memory traffic for every store instruction. A CPU executing 1 billion stores/second would saturate any memory bus.
Write Combining: Write-back allows multiple writes to the same cache line to be absorbed, sending only the final value to memory.
Power Consumption: Memory writes consume significantly more power than cache writes. Write-back reduces memory access frequency.

However, examining when write-through is used in CPU caches reveals instructive patterns:

Context	Cache Strategy	Reason
L1 Data Cache	Write-back	Performance critical
L1 Instruction Cache	Write-through (effectively)	Rarely written
Embedded Systems	Sometimes write-through	Simplicity, real-time predictability
Graphics Frame Buffers	Write-combining (variant)	Streaming writes to display
Memory-Mapped I/O	Uncached or write-through	Device expects immediate visibility

The Speed Hierarchy Insight

Implementation Considerations: Building Write-Through Correctly

Implementing write-through correctly requires attention to subtle details that are easy to overlook. A system that appears to be write-through but isn't can silently lose data.

Essential Verification Steps

Verify Device Cache Settings

# Check HDD/SSD write cache status
hdparm -W /dev/sda

# Disable write cache for true write-through
hdparm -W0 /dev/sda

Confirm File System Mount Options

# Mount with sync option (all writes synchronous)
mount -o sync /dev/sda1 /data

# Or use dirsync for directory operations only
mount -o dirsync /dev/sda1 /data

Validate at Application Level

// Verify O_SYNC/O_DSYNC is honored
int fd = open("/data/test", O_WRONLY | O_SYNC);

struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);

for (int i = 0; i < 1000; i++) {
    write(fd, buf, 4096);
}

clock_gettime(CLOCK_MONOTONIC, &end);

double elapsed_ms = (end.tv_sec - start.tv_sec) * 1000 +
                    (end.tv_nsec - start.tv_nsec) / 1000000.0;

// For NVMe: expect ~15-50ms (15-50µs × 1000)
// If <<1ms total, writes are being cached!
printf("1000 sync writes took %.2f ms\n", elapsed_ms);

Error Handling Imperatives

Write-through is only as reliable as your error handling. Every synchronous write can fail, and the failure MUST be handled:

ssize_t safe_sync_write(int fd, const void *buf, size_t count) {
    ssize_t written = 0;
    
    while (written < count) {
        ssize_t result = write(fd, buf + written, count - written);
        
        if (result < 0) {
            if (errno == EINTR) {
                continue;  // Retry on interrupt
            }
            if (errno == EAGAIN || errno == EWOULDBLOCK) {
                // Should not happen with sync writes
                // Log and abort - something is wrong
                log_error("Unexpected EAGAIN on sync write");
                return -1;
            }
            
            // Hard errors: ENOSPC, EIO, EDQUOT, etc.
            log_error("Sync write failed: %s", strerror(errno));
            return -1;
        }
        
        written += result;
    }
    
    // For O_SYNC files, data is durable here
    // For regular files, still need fsync
    return written;
}

RAID Controller Considerations

Enterprise RAID controllers often have battery-backed write caches (BBU). This changes the durability equation:

Write to BBU cache: ~microseconds, durable across power loss
BBU destages to disk: happens in background
From OS perspective: write-through to the controller

This is the best of both worlds—write-through semantics with write-back performance. However:

BBU has limited capacity (256MB-4GB typically)
BBU battery can fail (monitor health!)
Some failure modes (double disk failure) still lose BBU data

The Virtualization Trap

Summary: The Write-Through Foundation

We have comprehensively examined write-through as the most conservative write strategy—one that trades performance for immediate durability guarantees. Let's consolidate the essential insights:

Key Takeaways

•Core Definition: Write-through writes to both cache and backing store synchronously; the write call blocks until data is durable.
•Coherency Guarantee: Cache and storage are always consistent—no dirty data exists only in cache. This eliminates crash recovery complexity.
•Performance Cost: Each write incurs full I/O latency (8-20ms for HDD, 15-100µs for SSD). Throughput is fundamentally limited by storage speed.
•Hardware Stack Complexity: True durability requires attention to every layer—OS, file system, device driver, device cache, and physical medium.
•Essential Use Cases: WAL commit records, financial transactions, consensus state, audit logs—anywhere data loss violates correctness or compliance.
•Implementation Details: Use O_SYNC, O_DSYNC, or fsync(); verify device cache settings; handle all write errors; validate latency empirically.
•Design Trade-off: Choose write-through for correctness-critical data, not convenience-critical data. Performance-sensitive systems use it selectively.

What's Next

Page Complete

1 / 5