Write Strategies - Learning Module

Loading content...

0/227

Delayed Write Strategy

Between Immediate and Eventual

We've examined write-through (immediate persistence) and write-back (eventual persistence). But real-world applications often need something in between: delayed write strategies that provide explicit control over when cached data becomes durable.

Consider a database transaction:

Write undo log entry (must be durable before modifying data)
Modify data page A (can stay in cache)
Modify data page B (can stay in cache)
Write transaction commit record (must be durable before acknowledging client)
Eventually, data pages A and B can be flushed at leisure

Neither pure write-through nor pure write-back fits this pattern. Write-through would force every operation to disk—devastating for performance. Write-back would defer everything, including the critical log entries—catastrophic for correctness.

Delayed write strategies allow applications to explicitly control which writes are immediately durable and which can remain cached. The operating system provides the mechanisms; the application provides the intelligence about what data is critical.

This page explores those mechanisms in depth and provides the patterns for using them correctly.

What You Will Master

By the end of this page, you will understand the complete set of mechanisms for controlling write timing—from file-level flags to explicit sync calls. You'll learn the precise semantics of each mechanism, their performance characteristics, and how to combine them to build systems with fine-grained durability guarantees.

The Delayed Write Model: Controlled Deferral

Delayed write refers to the practice of initially writing data to cache (like write-back) but giving applications explicit control over when that data is flushed to persistent storage. The "delay" is not a fixed time period—it's a gap between write completion and application-initiated synchronization.

The Delayed Write Timeline

T=0.000: write(fd, data1, size)  → Returns immediately (data in cache)
T=0.001: write(fd, data2, size)  → Returns immediately (data in cache)
T=0.002: write(fd, data3, size)  → Returns immediately (data in cache)
         ...
         (Application performs other work)
         (Data remains in cache, at risk of loss)
         ...
T=0.500: fsync(fd)               → Blocks until data1, data2, data3 on disk
         → At this point, data is durable

Key: Duration from T=0.000 to T=0.500 is explicitly controlled by application
     Not by OS periodic flush (write-back)
     Not by zero (write-through)

This model is distinct from write-back in an important way:

Aspect	Write-Back	Delayed Write
Initial write	To cache only	To cache only
Return timing	Immediate	Immediate
Durability trigger	OS timer/threshold	Application sync call
Durability timing	Unpredictable	Application-controlled
Data-at-risk awareness	OS doesn't know	Application tracks explicitly

The Conceptual Model

Think of delayed write as a contract between application and OS:

Application: "I'm writing data, but I'll tell you when I need it durable."
OS: "Understood. I'll keep it in fast cache until then."
OS: "I might also write it opportunistically if I have nothing else to do."
Application: "Fine, but I won't assume it's durable until I explicitly ask."

This contract shifts responsibility: the OS is no longer guaranteeing durability on any schedule. The application must explicitly request durability at appropriate points.

Default Behavior Is Delayed Write

When you open a file normally (without O_SYNC or O_DSYNC) and call write(), you're using delayed write semantics. The data goes to cache, and it's your responsibility to call fsync() if you need durability guarantees. Most developers don't realize this—they assume write() is durable when it's not.

Explicit Synchronization Calls: The fsync Family

POSIX and Linux provide a family of synchronization calls that force cached data to persistent storage. Understanding the precise semantics of each is critical for correct system design.

The Synchronization Hierarchy

sync_calls.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
#include <unistd.h>
#include <fcntl.h>
#include <linux/fs.h>  // For syncfs
 
/*
 * 1. sync() - System-wide synchronization
 * 
 * Schedules all dirty buffers for writeout across ALL filesystems.
 * IMPORTANT: On Linux, sync() returns immediately after SCHEDULING
 * writes, not after writes complete! (POSIX allows this)
 */
void example_sync(void) {
    // Schedule all dirty data for writing (returns immediately!)
    sync();
    
    // WARNING: Data may still be in-flight when this returns
    // Do NOT use sync() if you need confirmed durability
}
 
/*
 * 2. syncfs(fd) - Filesystem-wide synchronization
 * 
 * Synchronizes all dirty data for the filesystem containing fd.
 * Blocks until all data and metadata for that filesystem are durable.
 * More targeted than sync(), guarantees completion.
 */
void example_syncfs(int fd) {
    // Wait for all dirty data on this filesystem to be written
    if (syncfs(fd) < 0) {
        perror("syncfs failed");
    }
    // All data on this filesystem is now durable
}
 
/*
 * 3. fsync(fd) - File-specific synchronization
 * 
 * Forces all modified data and metadata for the specific file
 * to be written to the underlying storage device.
 * Blocks until complete.
 */
void example_fsync(int fd) {
    // Write data to file
    write(fd, buffer, size);
    
    // Force data AND metadata to disk
    if (fsync(fd) < 0) {
        perror("fsync failed");
        // Must handle! Data may not be durable
    }
    // File data and metadata are now durable
}
 
/*
 * 4. fdatasync(fd) - Data-only synchronization
 * 
 * Like fsync() but only guarantees file data is synchronized.
 * Metadata (size, timestamps) may not be synchronized unless
 * required for data retrieval.
 * 
 * POSIX: "transfers all in-core modified data... but only
 * metadata necessary to retrieve the data"
 */
void example_fdatasync(int fd) {
    write(fd, buffer, size);
    
    // Force data to disk, skip non-essential metadata
    if (fdatasync(fd) < 0) {
        perror("fdatasync failed");
    }
    // File data is durable
    // Modification time might NOT be updated on disk
}
 
/*
 * 5. sync_file_range(fd, offset, nbytes, flags)
 *    Linux-specific: fine-grained range synchronization
 * 
 * Provides low-level control over synchronization of specific
 * byte ranges. Powerful but dangerous if misused.
 */
#define _GNU_SOURCE
#include <fcntl.h>
 
void example_sync_file_range(int fd) {
    /*
     * SYNC_FILE_RANGE_WRITE:
     * Start writeout (async) - returns immediately
     */
    sync_file_range(fd, 0, 4096,
                    SYNC_FILE_RANGE_WRITE);
    
    // Do other work while I/O in progress...
    
    /*
     * SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WAIT_AFTER:
     * Wait for previously initiated writes to complete
     */
    sync_file_range(fd, 0, 4096,
                    SYNC_FILE_RANGE_WAIT_BEFORE |
                    SYNC_FILE_RANGE_WAIT_AFTER);
    
    // Range 0-4096 is now on disk
    
    /*
     * Common pattern: initiate write, do work, wait for completion
     * This overlaps I/O with computation
     */
}

Synchronization Call Comparison
Call	Scope	Data	Metadata	Blocking	Performance
sync()	All filesystems	Scheduled	Scheduled	Returns immediately	Useless for durability
syncfs(fd)	One filesystem	Guaranteed	Guaranteed	Until complete	Heavy, affects all files
fsync(fd)	One file	Guaranteed	Guaranteed	Until complete	Includes directory sync
fdatasync(fd)	One file	Guaranteed	Partial*	Until complete	Faster than fsync
sync_file_range()	Byte range	Controllable	Not synced	Controllable	Most flexible

*fdatasync() syncs metadata required for data retrieval (file size if extended), but may skip non-essential metadata (modification time).

The Directory Sync Problem

Calling fsync() on a file ensures the file's contents are durable. But what about the file's existence? If you create a new file, the durability of the file depends on the durability of its parent directory entry.

// INCORRECT: Creating a durable file
int fd = open("/data/newfile.txt", O_WRONLY | O_CREAT, 0644);
write(fd, "important data", 14);
fsync(fd);  // Data is durable
close(fd);
// CRASH: File may not exist after recovery!
// The directory entry pointing to the file might have been cached

// CORRECT: Creating a durable file
int fd = open("/data/newfile.txt", O_WRONLY | O_CREAT, 0644);
write(fd, "important data", 14);
fsync(fd);  // Data is durable
close(fd);

int dir_fd = open("/data", O_RDONLY);
fsync(dir_fd);  // Directory entry is now durable
close(dir_fd);
// File definitely exists after crash

This requirement is frequently overlooked, leading to data loss after crashes.

The ext4 Delayed Allocation Surprise

On ext4 with default mount options, even fdatasync() might not make data durable if the file has no allocated blocks yet (newly created). ext4's delayed allocation can lose data on crash. Mount with 'nodelalloc' or 'data=ordered' for safety, or ensure fsync() after first write to new files.

Performance Implications: The Cost of Synchronization

Synchronization is expensive. Understanding the cost structure helps design systems that minimize sync overhead while maintaining durability.

fsync Latency Components

fsync(fd) latency = T_flush_dirty_pages + T_flush_device_cache + T_journal_sync

Where:
  T_flush_dirty_pages: Write all dirty pages for this file to device
                       Linear in number of dirty pages
                       
  T_flush_device_cache: Issue flush/FUA to device
                        Constant time (~50µs - 5ms depending on device)
                        Required to ensure device cache is also durable
                        
  T_journal_sync: If journaling, wait for journal commit
                  Includes journal write + device flush
                  Affects other files on same filesystem

Measured fsync Latencies

Typical fsync Latency by Scenario (NVMe SSD)
Scenario	Dirty Data	Measured Latency	Bottleneck
Single 4KB page	4 KB	50-100µs	Device flush command
Single 4KB, no journal	4 KB	15-30µs	Page write only
100 dirty pages	400 KB	200-500µs	I/O time + flush
File + directory sync	8 KB	100-200µs	Two independent syncs
Heavily journaled FS	4 KB	200-2000µs	Journal contention
Same on HDD	4 KB	10-30ms	Rotational latency

The Batching Opportunity

The key performance insight: fsync cost is largely fixed overhead (device flush), not proportional to data size. Batching multiple writes between syncs amortizes this overhead:

# Naive: fsync after every write (100 writes/second)
for record in records:
    write(fd, record)
    fsync(fd)  # 100µs each = 10ms total latency per batch
    # Max throughput: ~10,000 ops/second

# Batched: fsync after N writes
batch = []
for record in records:
    batch.append(record)
    if len(batch) >= 100:
        for r in batch:
            write(fd, r)
        fsync(fd)  # 100µs once per 100 writes
        batch = []
# Max throughput: ~1,000,000 ops/second

# Speedup: 100x from simple batching

The Interference Problem

fsync affects other operations on the same filesystem due to journal dependencies:

Thread A: write to file1, fsync(file1)
Thread B: write to file2 (no sync)

On journaled filesystem:
  - fsync(file1) must commit journal
  - Journal commit includes file2's changes too
  - Thread B's data becomes durable as side effect

  - But also: fsync(file1) waits for journal commit
  - Journal commit may wait for other dirty metadata
  - Thread A's latency increases due to Thread B's writes

This is why database engines often use O_DIRECT or dedicated filesystems for WAL: to isolate sync latency from other I/O.

The Separate WAL Filesystem Trick

High-performance databases (PostgreSQL, MySQL) often recommend placing transaction logs on a separate filesystem or dedicated SSD. This isolates WAL fsync() from data file I/O, ensuring consistent commit latency regardless of data workload size.

Advanced Durability Patterns: Beyond Simple fsync

Production systems use sophisticated patterns to achieve durability with minimal performance impact. Let's examine the most important ones.

Pattern 1: Double-Write Buffer

Used by MySQL InnoDB to prevent torn pages (partial writes during crash):

Problem: 16KB InnoDB page ≠ 4KB filesystem block
         Power loss mid-write leaves page half-old, half-new
         Checksum detects corruption but can't recover

Solution: Double-write buffer
  1. Write page to sequential "doublewrite buffer" area
  2. fsync the doublewrite buffer
  3. Write page to actual location (can use write-back!)
  4. No fsync needed for actual write

Recovery:
  - If actual page checksums fail, read from doublewrite buffer
  - Doublewrite buffer is always intact (sequential write, one sync)

Pattern 2: Write Ordering with Barriers

For operations where order matters more than immediate durability:

// Ensure critical metadata is on disk before dependent data

// Write metadata (e.g., allocation bitmap)
write(meta_fd, new_allocation_info, meta_size);
fsync(meta_fd);  // Barrier: metadata durable

// Now safe to write data that depends on this allocation
write(data_fd, user_data, data_size);
// No fsync needed immediately - can batch with other writes

// Eventually, sync data too (or let background writeback handle)

Pattern 3: Epoch-Based Flushing

Group writes by "epoch", sync each epoch as a unit:

epoch_flush.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import threading
import time
import os
 
class EpochFlusher:
    """
    Groups writes into epochs, syncs entire epochs atomically.
    Provides both batching benefits and predictable durability.
    """
    
    def __init__(self, epoch_duration_ms=100):
        self.current_epoch = []
        self.epoch_lock = threading.Lock()
        self.epoch_duration = epoch_duration_ms / 1000.0
        self.file_handles = {}  # path -> fd
        
        # Start epoch flusher thread
        self.flusher = threading.Thread(target=self._flush_loop, daemon=True)
        self.flusher.start()
    
    def write(self, path, offset, data):
        """
        Add write to current epoch. Returns future for durability notification.
        """
        future = EpochFuture()
        
        with self.epoch_lock:
            self.current_epoch.append({
                'path': path,
                'offset': offset,
                'data': data,
                'future': future
            })
        
        return future
    
    def _flush_loop(self):
        while True:
            time.sleep(self.epoch_duration)
            
            # Atomically swap epoch
            with self.epoch_lock:
                epoch = self.current_epoch
                self.current_epoch = []
            
            if not epoch:
                continue
            
            # Group writes by file for efficiency
            by_file = {}
            for write in epoch:
                if write['path'] not in by_file:
                    by_file[write['path']] = []
                by_file[write['path']].append(write)
            
            # Apply all writes (goes to cache)
            for path, writes in by_file.items():
                fd = self._get_fd(path)
                for w in sorted(writes, key=lambda x: x['offset']):
                    os.lseek(fd, w['offset'], os.SEEK_SET)
                    os.write(fd, w['data'])
            
            # Single sync per file for entire epoch
            for path in by_file:
                fd = self._get_fd(path)
                os.fsync(fd)
            
            # Notify all waiters in this epoch
            for write in epoch:
                write['future'].set_done()
    
    def _get_fd(self, path):
        if path not in self.file_handles:
            self.file_handles[path] = os.open(
                path, os.O_WRONLY | os.O_CREAT, 0o644
            )
        return self.file_handles[path]
 
class EpochFuture:
    def __init__(self):
        self._done = threading.Event()
    
    def set_done(self):
        self._done.set()
    
    def wait(self):
        """Block until this write is durable."""
        self._done.wait()

Pattern 4: Asynchronous fsync with io_uring

Linux io_uring provides true asynchronous fsync:

#include <liburing.h>

// Traditional: fsync blocks calling thread
fsync(fd);  // Thread sleeps until complete

// io_uring: submit fsync, continue working, check later
struct io_uring ring;
io_uring_queue_init(32, &ring, 0);

// Submit async fsync
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_fsync(sqe, fd, IORING_FSYNC_DATASYNC);
io_uring_submit(&ring);

// Do other work...

// Later: check if fsync completed
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);  // Or io_uring_peek_cqe for non-blocking

if (cqe->res < 0) {
    // fsync failed!
    handle_error(-cqe->res);
} else {
    // Data is durable
}
io_uring_cqe_seen(&ring, cqe);

This enables high-throughput durability-aware applications that overlap computation with sync operations.

Combining Patterns

Production systems often combine these patterns. A database might use: WAL on separate filesystem (isolation) + group commit (batching) + io_uring (concurrency) + double-write buffer (torn page protection). Each pattern addresses a specific concern while maintaining overall durability guarantees.

Application-Specific Strategies: Matching Sync to Use Case

Different application types have different durability requirements. Let's examine optimal strategies for common cases.

Strategy by Application Type

•OLTP Database: WAL with fsync on commit; data pages use write-back; periodic checkpoints sync all dirty data. Commit latency = 1 fsync. Focus: transaction durability.
•Log Aggregation System: Batch writes by time (1-5 seconds); single fsync per batch; accept some data loss on crash. Throughput critical, some loss acceptable.
•Key-Value Store (consistent): Every PUT gets fsync after write. Simple, correct, but slow. May use group commit for batching.
•Key-Value Store (eventual): Acknowledge after write to memory. Background flush. Explicitly trade durability for latency.
•File Editor (vi/emacs): fsync on explicit save (:w). swap/backup file may use periodic sync. User controls durability.
•Build System: No sync during build (temporary outputs). Final artifacts may get fsync. Reproducibility > durability.
•Financial Trading: Critical: Synchronous to multiple replicas before acknowledgment. Cannot use delayed write for order placement.

Case Study: SQLite Durability Modes

SQLite provides a perfect example of configurable delayed write strategies:

-- 1. FULL synchronous (safest, slowest)
PRAGMA synchronous = FULL;
-- fsync after every transaction AND after journal write
-- Survives both OS crash and power loss

-- 2. NORMAL synchronous (balanced)
PRAGMA synchronous = NORMAL;
-- fsync at critical moments only
-- Survives OS crash; may corrupt on power loss at wrong moment

-- 3. OFF synchronous (fastest, risky)
PRAGMA synchronous = OFF;
-- Never fsync automatically
-- Data corruption likely on any crash
-- Only for ephemeral/reconstructible data

Case Study: PostgreSQL fsync Settings

PostgreSQL's wal_sync_method controls how WAL is synced:

Method	Mechanism	Durability	Performance
fsync (default)	fsync()	Full	Baseline
fdatasync	fdatasync()	Full	10-20% faster
open_sync	O_SYNC flag	Full	Varies
open_datasync	O_DSYNC flag	Full	Usually fastest

The choice depends on OS and filesystem. Always benchmark on your specific configuration.

Testing Durability

Never assume your sync calls work correctly. Use tools like diskchecker or dm-flakey (Linux device-mapper target) to simulate power loss and verify your application recovers correctly. Many 'durable' systems fail these tests.

Common Mistakes: The Durability Pitfalls

Even experienced developers make subtle mistakes with delayed write semantics. Let's examine the most common pitfalls and their corrections.

Critical Durability Mistakes

•Assuming write() is durable: By far the most common error. write() only guarantees data is in the kernel cache. Always call fsync() if durability matters.
•Ignoring fsync() return value: fsync() can fail! Failures indicate data may not be durable. Always check the return value and handle errors.
•Forgetting directory fsync for new files: Creating a file and fsync-ing it doesn't make it durable. The parent directory must also be synced.
•Using sync() instead of fsync(): sync() on Linux returns before writes complete. It schedules but doesn't confirm. Never use for durability verification.
•Assuming close() flushes data: close() does NOT guarantee data is on disk. It releases the file descriptor. fsync() before close() if durability needed.
•Ignoring rename() durability: Renaming files requires syncing both source and destination directories, plus the file itself if modified.
•Trusting O_DIRECT alone: O_DIRECT bypasses page cache but doesn't bypass device cache. Still need fsync() or FUA for durability.

durability_mistakes.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// MISTAKE 1: Assuming write() is durable
void mistake_1(void) {
    int fd = open("data.txt", O_WRONLY | O_CREAT, 0644);
    write(fd, "important", 9);
    close(fd);
    printf("Data saved!\n");  // WRONG: Data is in cache, not disk!
}
 
// MISTAKE 2: Ignoring fsync() return value
void mistake_2(void) {
    int fd = open("data.txt", O_WRONLY | O_CREAT, 0644);
    write(fd, "important", 9);
    fsync(fd);  // What if this fails? We don't know!
    close(fd);
}
 
// MISTAKE 3: Forgetting directory sync
void mistake_3(void) {
    int fd = open("/data/newfile.txt", O_WRONLY | O_CREAT, 0644);
    write(fd, "important", 9);
    fsync(fd);  // File content is durable
    close(fd);
    // BUT: /data directory entry not synced!
    // After crash, file may not exist
}
 
// CORRECT IMPLEMENTATION
int durable_create_file(const char *path, const void *data, size_t len) {
    int fd = -1, dir_fd = -1;
    char *path_copy = NULL;
    int result = -1;
    
    // Write file
    fd = open(path, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd < 0) goto cleanup;
    
    if (write(fd, data, len) != (ssize_t)len) goto cleanup;
    
    // fsync file - CHECK RETURN VALUE
    if (fsync(fd) < 0) {
        perror("fsync file failed");
        goto cleanup;
    }
    
    // Sync parent directory
    path_copy = strdup(path);
    char *dir_path = dirname(path_copy);
    
    dir_fd = open(dir_path, O_RDONLY);
    if (dir_fd < 0) goto cleanup;
    
    if (fsync(dir_fd) < 0) {
        perror("fsync directory failed");
        goto cleanup;
    }
    
    result = 0;  // Success
    
cleanup:
    if (fd >= 0) close(fd);
    if (dir_fd >= 0) close(dir_fd);
    free(path_copy);
    return result;
}

The Rename Atomicity Myth

Many developers believe that write-new-file, sync, rename is atomic and crash-safe. It's NOT unless you also sync the destination directory after rename. The rename may be lost on crash even if the new file is durable.

Testing Durability: Verifying Your Implementation

Implementing delayed write correctly is hard. Testing it is harder. Normal testing can't verify durability because tests don't involve power failures. Here are techniques for rigorous durability testing.

Technique 1: dm-flakey (Linux)

Device-mapper target that can simulate various failure modes:

# Create a flakey device that drops writes after 60 seconds
dmsetup create mydata --table "0 $(blockdev --getsz /dev/sdb) \
    flakey /dev/sdb 0 60 5 \
    drop_writes"

# 60: Seconds device works normally
# 5: Seconds device "fails" (drops writes)
# Then it repeats

# Mount filesystem on flakey device
mkfs.ext4 /dev/mapper/mydata
mount /dev/mapper/mydata /mnt/test

# Run your application
./my_durable_application /mnt/test/data

# During "drop" window, writes are lost
# Application must either:
#   - Fail gracefully (detect write failures)
#   - Retry when device recovers
#   - Have state on disk that allows recovery

Technique 2: libeatmydata

LD_PRELOAD library that makes fsync() return immediately without doing anything:

# Run application with fsync as no-op
LD_PRELOAD=libeatmydata.so ./my_application

# If application breaks: it wasn't handling lack of durability
# If application works: it might be relying on fsync correctly!

# Useful for:
#   - Performance testing without I/O overhead
#   - Identifying excessive fsync() calls
#   - Verifying retry logic handles transient failures

Technique 3: Crash Injection with sqlite3_test_control

SQLite's test infrastructure can inject crashes:

// Crash at specific fsync call
sqlite3_test_control(SQLITE_TESTCTRL_BYTEORDER, ...);

// Or use --enable-debug build with -DSQLITE_CRASH_TEST
// Allows triggering crashes at various recovery points

Technique 4: Hardware-Assisted Testing

For ultimate verification, use actual power control:

Test setup:
  - Controlled power strip (smart plug with API)
  - Test system writing continuously
  - Script that cuts power at random intervals
  - Second system to verify state after power-on

Verification:
  - Boot test system
  - Run consistency check
  - Verify all "committed" data is present
  - Repeat thousands of times

Real-World Validation

Major database vendors run continuous crash testing (often called 'powerfailtest' or 'crashtest') as part of their CI/CD. SQLite, PostgreSQL, and MySQL all publish results. If you're building durable systems, you need similar testing infrastructure.

Summary: Mastering Delayed Writes

Delayed writes represent the practical middle ground between immediate durability and maximum performance. By understanding the synchronization primitives and their proper use, you can build systems that are both fast and correct.

Key Takeaways

•Core Concept: Delayed write means write() goes to cache immediately, but durability is controlled by explicit fsync() calls at application-chosen points.
•The fsync Family: sync(), syncfs(), fsync(), fdatasync(), sync_file_range() provide different scopes and guarantees. Know the difference—it matters.
•Directory Sync Required: New or renamed files require syncing the parent directory for true durability. Most applications get this wrong.
•Batching Amortizes Cost: fsync() cost is dominated by device flush overhead. Batch multiple writes per sync for orders-of-magnitude throughput improvement.
•Advanced Patterns: Double-write buffers, epoch flushing, io_uring async sync—production systems combine multiple techniques for both safety and performance.
•Application-Specific: Different applications have different durability requirements. Match your sync strategy to actual needs, not paranoia.
•Testing Is Essential: You cannot verify durability without crash testing. Use dm-flakey, libeatmydata, or actual power-fail testing.

What's Next

We've now covered three fundamental strategies: write-through (immediate), write-back (eventual), and delayed write (explicit control). Next, we'll examine ordered writes—the technique of controlling which writes complete before others, enabling crash-safe data structures without syncing every operation. Ordered writes are the foundation of journaling filesystems and ACID databases.

Page Complete

You now understand delayed write strategies comprehensively—from the fsync family semantics to advanced patterns like epoch flushing and io_uring. You're equipped to design applications that achieve their durability requirements with minimal performance overhead.