Loading content...
We've examined write-through (immediate persistence) and write-back (eventual persistence). But real-world applications often need something in between: delayed write strategies that provide explicit control over when cached data becomes durable.
Consider a database transaction:
Neither pure write-through nor pure write-back fits this pattern. Write-through would force every operation to disk—devastating for performance. Write-back would defer everything, including the critical log entries—catastrophic for correctness.
Delayed write strategies allow applications to explicitly control which writes are immediately durable and which can remain cached. The operating system provides the mechanisms; the application provides the intelligence about what data is critical.
This page explores those mechanisms in depth and provides the patterns for using them correctly.
By the end of this page, you will understand the complete set of mechanisms for controlling write timing—from file-level flags to explicit sync calls. You'll learn the precise semantics of each mechanism, their performance characteristics, and how to combine them to build systems with fine-grained durability guarantees.
Delayed write refers to the practice of initially writing data to cache (like write-back) but giving applications explicit control over when that data is flushed to persistent storage. The "delay" is not a fixed time period—it's a gap between write completion and application-initiated synchronization.
The Delayed Write Timeline
T=0.000: write(fd, data1, size) → Returns immediately (data in cache)
T=0.001: write(fd, data2, size) → Returns immediately (data in cache)
T=0.002: write(fd, data3, size) → Returns immediately (data in cache)
...
(Application performs other work)
(Data remains in cache, at risk of loss)
...
T=0.500: fsync(fd) → Blocks until data1, data2, data3 on disk
→ At this point, data is durable
Key: Duration from T=0.000 to T=0.500 is explicitly controlled by application
Not by OS periodic flush (write-back)
Not by zero (write-through)
This model is distinct from write-back in an important way:
| Aspect | Write-Back | Delayed Write |
|---|---|---|
| Initial write | To cache only | To cache only |
| Return timing | Immediate | Immediate |
| Durability trigger | OS timer/threshold | Application sync call |
| Durability timing | Unpredictable | Application-controlled |
| Data-at-risk awareness | OS doesn't know | Application tracks explicitly |
The Conceptual Model
Think of delayed write as a contract between application and OS:
Application: "I'm writing data, but I'll tell you when I need it durable."
OS: "Understood. I'll keep it in fast cache until then."
OS: "I might also write it opportunistically if I have nothing else to do."
Application: "Fine, but I won't assume it's durable until I explicitly ask."
This contract shifts responsibility: the OS is no longer guaranteeing durability on any schedule. The application must explicitly request durability at appropriate points.
When you open a file normally (without O_SYNC or O_DSYNC) and call write(), you're using delayed write semantics. The data goes to cache, and it's your responsibility to call fsync() if you need durability guarantees. Most developers don't realize this—they assume write() is durable when it's not.
POSIX and Linux provide a family of synchronization calls that force cached data to persistent storage. Understanding the precise semantics of each is critical for correct system design.
The Synchronization Hierarchy
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
#include <unistd.h>#include <fcntl.h>#include <linux/fs.h> // For syncfs /* * 1. sync() - System-wide synchronization * * Schedules all dirty buffers for writeout across ALL filesystems. * IMPORTANT: On Linux, sync() returns immediately after SCHEDULING * writes, not after writes complete! (POSIX allows this) */void example_sync(void) { // Schedule all dirty data for writing (returns immediately!) sync(); // WARNING: Data may still be in-flight when this returns // Do NOT use sync() if you need confirmed durability} /* * 2. syncfs(fd) - Filesystem-wide synchronization * * Synchronizes all dirty data for the filesystem containing fd. * Blocks until all data and metadata for that filesystem are durable. * More targeted than sync(), guarantees completion. */void example_syncfs(int fd) { // Wait for all dirty data on this filesystem to be written if (syncfs(fd) < 0) { perror("syncfs failed"); } // All data on this filesystem is now durable} /* * 3. fsync(fd) - File-specific synchronization * * Forces all modified data and metadata for the specific file * to be written to the underlying storage device. * Blocks until complete. */void example_fsync(int fd) { // Write data to file write(fd, buffer, size); // Force data AND metadata to disk if (fsync(fd) < 0) { perror("fsync failed"); // Must handle! Data may not be durable } // File data and metadata are now durable} /* * 4. fdatasync(fd) - Data-only synchronization * * Like fsync() but only guarantees file data is synchronized. * Metadata (size, timestamps) may not be synchronized unless * required for data retrieval. * * POSIX: "transfers all in-core modified data... but only * metadata necessary to retrieve the data" */void example_fdatasync(int fd) { write(fd, buffer, size); // Force data to disk, skip non-essential metadata if (fdatasync(fd) < 0) { perror("fdatasync failed"); } // File data is durable // Modification time might NOT be updated on disk} /* * 5. sync_file_range(fd, offset, nbytes, flags) * Linux-specific: fine-grained range synchronization * * Provides low-level control over synchronization of specific * byte ranges. Powerful but dangerous if misused. */#define _GNU_SOURCE#include <fcntl.h> void example_sync_file_range(int fd) { /* * SYNC_FILE_RANGE_WRITE: * Start writeout (async) - returns immediately */ sync_file_range(fd, 0, 4096, SYNC_FILE_RANGE_WRITE); // Do other work while I/O in progress... /* * SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WAIT_AFTER: * Wait for previously initiated writes to complete */ sync_file_range(fd, 0, 4096, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WAIT_AFTER); // Range 0-4096 is now on disk /* * Common pattern: initiate write, do work, wait for completion * This overlaps I/O with computation */}| Call | Scope | Data | Metadata | Blocking | Performance |
|---|---|---|---|---|---|
| sync() | All filesystems | Scheduled | Scheduled | Returns immediately | Useless for durability |
| syncfs(fd) | One filesystem | Guaranteed | Guaranteed | Until complete | Heavy, affects all files |
| fsync(fd) | One file | Guaranteed | Guaranteed | Until complete | Includes directory sync |
| fdatasync(fd) | One file | Guaranteed | Partial* | Until complete | Faster than fsync |
| sync_file_range() | Byte range | Controllable | Not synced | Controllable | Most flexible |
*fdatasync() syncs metadata required for data retrieval (file size if extended), but may skip non-essential metadata (modification time).
The Directory Sync Problem
Calling fsync() on a file ensures the file's contents are durable. But what about the file's existence? If you create a new file, the durability of the file depends on the durability of its parent directory entry.
// INCORRECT: Creating a durable file
int fd = open("/data/newfile.txt", O_WRONLY | O_CREAT, 0644);
write(fd, "important data", 14);
fsync(fd); // Data is durable
close(fd);
// CRASH: File may not exist after recovery!
// The directory entry pointing to the file might have been cached
// CORRECT: Creating a durable file
int fd = open("/data/newfile.txt", O_WRONLY | O_CREAT, 0644);
write(fd, "important data", 14);
fsync(fd); // Data is durable
close(fd);
int dir_fd = open("/data", O_RDONLY);
fsync(dir_fd); // Directory entry is now durable
close(dir_fd);
// File definitely exists after crash
This requirement is frequently overlooked, leading to data loss after crashes.
On ext4 with default mount options, even fdatasync() might not make data durable if the file has no allocated blocks yet (newly created). ext4's delayed allocation can lose data on crash. Mount with 'nodelalloc' or 'data=ordered' for safety, or ensure fsync() after first write to new files.
Synchronization is expensive. Understanding the cost structure helps design systems that minimize sync overhead while maintaining durability.
fsync Latency Components
fsync(fd) latency = T_flush_dirty_pages + T_flush_device_cache + T_journal_sync
Where:
T_flush_dirty_pages: Write all dirty pages for this file to device
Linear in number of dirty pages
T_flush_device_cache: Issue flush/FUA to device
Constant time (~50µs - 5ms depending on device)
Required to ensure device cache is also durable
T_journal_sync: If journaling, wait for journal commit
Includes journal write + device flush
Affects other files on same filesystem
Measured fsync Latencies
| Scenario | Dirty Data | Measured Latency | Bottleneck |
|---|---|---|---|
| Single 4KB page | 4 KB | 50-100µs | Device flush command |
| Single 4KB, no journal | 4 KB | 15-30µs | Page write only |
| 100 dirty pages | 400 KB | 200-500µs | I/O time + flush |
| File + directory sync | 8 KB | 100-200µs | Two independent syncs |
| Heavily journaled FS | 4 KB | 200-2000µs | Journal contention |
| Same on HDD | 4 KB | 10-30ms | Rotational latency |
The Batching Opportunity
The key performance insight: fsync cost is largely fixed overhead (device flush), not proportional to data size. Batching multiple writes between syncs amortizes this overhead:
# Naive: fsync after every write (100 writes/second)
for record in records:
write(fd, record)
fsync(fd) # 100µs each = 10ms total latency per batch
# Max throughput: ~10,000 ops/second
# Batched: fsync after N writes
batch = []
for record in records:
batch.append(record)
if len(batch) >= 100:
for r in batch:
write(fd, r)
fsync(fd) # 100µs once per 100 writes
batch = []
# Max throughput: ~1,000,000 ops/second
# Speedup: 100x from simple batching
The Interference Problem
fsync affects other operations on the same filesystem due to journal dependencies:
Thread A: write to file1, fsync(file1)
Thread B: write to file2 (no sync)
On journaled filesystem:
- fsync(file1) must commit journal
- Journal commit includes file2's changes too
- Thread B's data becomes durable as side effect
- But also: fsync(file1) waits for journal commit
- Journal commit may wait for other dirty metadata
- Thread A's latency increases due to Thread B's writes
This is why database engines often use O_DIRECT or dedicated filesystems for WAL: to isolate sync latency from other I/O.
High-performance databases (PostgreSQL, MySQL) often recommend placing transaction logs on a separate filesystem or dedicated SSD. This isolates WAL fsync() from data file I/O, ensuring consistent commit latency regardless of data workload size.
Production systems use sophisticated patterns to achieve durability with minimal performance impact. Let's examine the most important ones.
Pattern 1: Double-Write Buffer
Used by MySQL InnoDB to prevent torn pages (partial writes during crash):
Problem: 16KB InnoDB page ≠ 4KB filesystem block
Power loss mid-write leaves page half-old, half-new
Checksum detects corruption but can't recover
Solution: Double-write buffer
1. Write page to sequential "doublewrite buffer" area
2. fsync the doublewrite buffer
3. Write page to actual location (can use write-back!)
4. No fsync needed for actual write
Recovery:
- If actual page checksums fail, read from doublewrite buffer
- Doublewrite buffer is always intact (sequential write, one sync)
Pattern 2: Write Ordering with Barriers
For operations where order matters more than immediate durability:
// Ensure critical metadata is on disk before dependent data
// Write metadata (e.g., allocation bitmap)
write(meta_fd, new_allocation_info, meta_size);
fsync(meta_fd); // Barrier: metadata durable
// Now safe to write data that depends on this allocation
write(data_fd, user_data, data_size);
// No fsync needed immediately - can batch with other writes
// Eventually, sync data too (or let background writeback handle)
Pattern 3: Epoch-Based Flushing
Group writes by "epoch", sync each epoch as a unit:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import threadingimport timeimport os class EpochFlusher: """ Groups writes into epochs, syncs entire epochs atomically. Provides both batching benefits and predictable durability. """ def __init__(self, epoch_duration_ms=100): self.current_epoch = [] self.epoch_lock = threading.Lock() self.epoch_duration = epoch_duration_ms / 1000.0 self.file_handles = {} # path -> fd # Start epoch flusher thread self.flusher = threading.Thread(target=self._flush_loop, daemon=True) self.flusher.start() def write(self, path, offset, data): """ Add write to current epoch. Returns future for durability notification. """ future = EpochFuture() with self.epoch_lock: self.current_epoch.append({ 'path': path, 'offset': offset, 'data': data, 'future': future }) return future def _flush_loop(self): while True: time.sleep(self.epoch_duration) # Atomically swap epoch with self.epoch_lock: epoch = self.current_epoch self.current_epoch = [] if not epoch: continue # Group writes by file for efficiency by_file = {} for write in epoch: if write['path'] not in by_file: by_file[write['path']] = [] by_file[write['path']].append(write) # Apply all writes (goes to cache) for path, writes in by_file.items(): fd = self._get_fd(path) for w in sorted(writes, key=lambda x: x['offset']): os.lseek(fd, w['offset'], os.SEEK_SET) os.write(fd, w['data']) # Single sync per file for entire epoch for path in by_file: fd = self._get_fd(path) os.fsync(fd) # Notify all waiters in this epoch for write in epoch: write['future'].set_done() def _get_fd(self, path): if path not in self.file_handles: self.file_handles[path] = os.open( path, os.O_WRONLY | os.O_CREAT, 0o644 ) return self.file_handles[path] class EpochFuture: def __init__(self): self._done = threading.Event() def set_done(self): self._done.set() def wait(self): """Block until this write is durable.""" self._done.wait()Pattern 4: Asynchronous fsync with io_uring
Linux io_uring provides true asynchronous fsync:
#include <liburing.h>
// Traditional: fsync blocks calling thread
fsync(fd); // Thread sleeps until complete
// io_uring: submit fsync, continue working, check later
struct io_uring ring;
io_uring_queue_init(32, &ring, 0);
// Submit async fsync
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_fsync(sqe, fd, IORING_FSYNC_DATASYNC);
io_uring_submit(&ring);
// Do other work...
// Later: check if fsync completed
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe); // Or io_uring_peek_cqe for non-blocking
if (cqe->res < 0) {
// fsync failed!
handle_error(-cqe->res);
} else {
// Data is durable
}
io_uring_cqe_seen(&ring, cqe);
This enables high-throughput durability-aware applications that overlap computation with sync operations.
Production systems often combine these patterns. A database might use: WAL on separate filesystem (isolation) + group commit (batching) + io_uring (concurrency) + double-write buffer (torn page protection). Each pattern addresses a specific concern while maintaining overall durability guarantees.
Different application types have different durability requirements. Let's examine optimal strategies for common cases.
Case Study: SQLite Durability Modes
SQLite provides a perfect example of configurable delayed write strategies:
-- 1. FULL synchronous (safest, slowest)
PRAGMA synchronous = FULL;
-- fsync after every transaction AND after journal write
-- Survives both OS crash and power loss
-- 2. NORMAL synchronous (balanced)
PRAGMA synchronous = NORMAL;
-- fsync at critical moments only
-- Survives OS crash; may corrupt on power loss at wrong moment
-- 3. OFF synchronous (fastest, risky)
PRAGMA synchronous = OFF;
-- Never fsync automatically
-- Data corruption likely on any crash
-- Only for ephemeral/reconstructible data
Case Study: PostgreSQL fsync Settings
PostgreSQL's wal_sync_method controls how WAL is synced:
| Method | Mechanism | Durability | Performance |
|---|---|---|---|
| fsync (default) | fsync() | Full | Baseline |
| fdatasync | fdatasync() | Full | 10-20% faster |
| open_sync | O_SYNC flag | Full | Varies |
| open_datasync | O_DSYNC flag | Full | Usually fastest |
The choice depends on OS and filesystem. Always benchmark on your specific configuration.
Never assume your sync calls work correctly. Use tools like diskchecker or dm-flakey (Linux device-mapper target) to simulate power loss and verify your application recovers correctly. Many 'durable' systems fail these tests.
Even experienced developers make subtle mistakes with delayed write semantics. Let's examine the most common pitfalls and their corrections.
write() only guarantees data is in the kernel cache. Always call fsync() if durability matters.fsync() can fail! Failures indicate data may not be durable. Always check the return value and handle errors.sync() on Linux returns before writes complete. It schedules but doesn't confirm. Never use for durability verification.close() does NOT guarantee data is on disk. It releases the file descriptor. fsync() before close() if durability needed.12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
// MISTAKE 1: Assuming write() is durablevoid mistake_1(void) { int fd = open("data.txt", O_WRONLY | O_CREAT, 0644); write(fd, "important", 9); close(fd); printf("Data saved!\n"); // WRONG: Data is in cache, not disk!} // MISTAKE 2: Ignoring fsync() return valuevoid mistake_2(void) { int fd = open("data.txt", O_WRONLY | O_CREAT, 0644); write(fd, "important", 9); fsync(fd); // What if this fails? We don't know! close(fd);} // MISTAKE 3: Forgetting directory syncvoid mistake_3(void) { int fd = open("/data/newfile.txt", O_WRONLY | O_CREAT, 0644); write(fd, "important", 9); fsync(fd); // File content is durable close(fd); // BUT: /data directory entry not synced! // After crash, file may not exist} // CORRECT IMPLEMENTATIONint durable_create_file(const char *path, const void *data, size_t len) { int fd = -1, dir_fd = -1; char *path_copy = NULL; int result = -1; // Write file fd = open(path, O_WRONLY | O_CREAT | O_TRUNC, 0644); if (fd < 0) goto cleanup; if (write(fd, data, len) != (ssize_t)len) goto cleanup; // fsync file - CHECK RETURN VALUE if (fsync(fd) < 0) { perror("fsync file failed"); goto cleanup; } // Sync parent directory path_copy = strdup(path); char *dir_path = dirname(path_copy); dir_fd = open(dir_path, O_RDONLY); if (dir_fd < 0) goto cleanup; if (fsync(dir_fd) < 0) { perror("fsync directory failed"); goto cleanup; } result = 0; // Success cleanup: if (fd >= 0) close(fd); if (dir_fd >= 0) close(dir_fd); free(path_copy); return result;}Many developers believe that write-new-file, sync, rename is atomic and crash-safe. It's NOT unless you also sync the destination directory after rename. The rename may be lost on crash even if the new file is durable.
Implementing delayed write correctly is hard. Testing it is harder. Normal testing can't verify durability because tests don't involve power failures. Here are techniques for rigorous durability testing.
Technique 1: dm-flakey (Linux)
Device-mapper target that can simulate various failure modes:
# Create a flakey device that drops writes after 60 seconds
dmsetup create mydata --table "0 $(blockdev --getsz /dev/sdb) \
flakey /dev/sdb 0 60 5 \
drop_writes"
# 60: Seconds device works normally
# 5: Seconds device "fails" (drops writes)
# Then it repeats
# Mount filesystem on flakey device
mkfs.ext4 /dev/mapper/mydata
mount /dev/mapper/mydata /mnt/test
# Run your application
./my_durable_application /mnt/test/data
# During "drop" window, writes are lost
# Application must either:
# - Fail gracefully (detect write failures)
# - Retry when device recovers
# - Have state on disk that allows recovery
Technique 2: libeatmydata
LD_PRELOAD library that makes fsync() return immediately without doing anything:
# Run application with fsync as no-op
LD_PRELOAD=libeatmydata.so ./my_application
# If application breaks: it wasn't handling lack of durability
# If application works: it might be relying on fsync correctly!
# Useful for:
# - Performance testing without I/O overhead
# - Identifying excessive fsync() calls
# - Verifying retry logic handles transient failures
Technique 3: Crash Injection with sqlite3_test_control
SQLite's test infrastructure can inject crashes:
// Crash at specific fsync call
sqlite3_test_control(SQLITE_TESTCTRL_BYTEORDER, ...);
// Or use --enable-debug build with -DSQLITE_CRASH_TEST
// Allows triggering crashes at various recovery points
Technique 4: Hardware-Assisted Testing
For ultimate verification, use actual power control:
Test setup:
- Controlled power strip (smart plug with API)
- Test system writing continuously
- Script that cuts power at random intervals
- Second system to verify state after power-on
Verification:
- Boot test system
- Run consistency check
- Verify all "committed" data is present
- Repeat thousands of times
Major database vendors run continuous crash testing (often called 'powerfailtest' or 'crashtest') as part of their CI/CD. SQLite, PostgreSQL, and MySQL all publish results. If you're building durable systems, you need similar testing infrastructure.
Delayed writes represent the practical middle ground between immediate durability and maximum performance. By understanding the synchronization primitives and their proper use, you can build systems that are both fast and correct.
What's Next
We've now covered three fundamental strategies: write-through (immediate), write-back (eventual), and delayed write (explicit control). Next, we'll examine ordered writes—the technique of controlling which writes complete before others, enabling crash-safe data structures without syncing every operation. Ordered writes are the foundation of journaling filesystems and ACID databases.
You now understand delayed write strategies comprehensively—from the fsync family semantics to advanced patterns like epoch flushing and io_uring. You're equipped to design applications that achieve their durability requirements with minimal performance overhead.