Caching - Learning Module

Loading content...

0/240

Write Caching

The Performance-Durability Tradeoff

Writing data presents a fundamentally different challenge than reading it. When you read data, the worst case is performance degradation—waiting for data to load. When you write data, the worst case is data loss—believing data is saved when it isn't. This asymmetry places write caching at the intersection of the two most critical concerns in computing: performance and reliability.

Write caching dramatically accelerates applications by absorbing writes into fast memory instead of waiting for slow storage devices. But this creates a window of vulnerability: data in the cache but not yet on disk can be lost to power failure, system crashes, or hardware faults. Every operating system must navigate this tradeoff, and understanding how they do so is essential for building systems that are both fast and reliable.

What You Will Learn

By the end of this page, you will understand how operating systems handle write caching, the various write-back strategies and their tradeoffs, the mechanisms that balance performance against durability, and how applications can control write behavior to meet their specific reliability requirements.

Why Write Caching Matters

To appreciate write caching, consider what happens without it. Every write operation would:

Transfer data from user space to kernel space
Build I/O commands for the device controller
Wait for the device to physically store the data
Return control to the application only when data is on stable storage

For a hard disk drive with 5-10ms latency, this means at most 100-200 writes per second per file. For an application like a database logging transactions, a mail server receiving messages, or even a text editor saving files, this would create unbearable delays.

The Write Amplification Problem

The situation is actually worse than simple latency suggests. Consider writing a single byte to a file:

Steps to Write One Byte (Worst Case)

•Read original block (4KB) from disk to modify — 1 disk read
•Modify byte in memory — negligible
•Update metadata (modification time, size if extended) — may require additional reads
•Write block back to disk — 1 disk write
•Write metadata (inode) to disk — 1 disk write
•Write journal entry (if journaling) — 1 disk write
•Wait for completion of each synchronous operation

Without caching, writing a single byte could require 4+ disk operations, each taking 5-10ms—a total of 20-40ms for one byte. That's 25-50 bytes per second maximum throughput!

Write Caching Solution

Write caching transforms this pathological case into something manageable:

Write to cache (memory): ~100 ns
Mark page dirty: ~10 ns
Return to application: immediate
Later: batch dirty pages with adjacent writes, write efficiently in background

The application perceives write latency of microseconds instead of milliseconds. Multiple writes to the same block are coalesced—only the final state needs to reach disk. Sequential writes to adjacent blocks are batched into single large I/O operations.

The result: throughput improvements of 100x to 10,000x compared to synchronous writes, with latency improvements of similar magnitude.

The Hidden Danger

This performance comes with a critical caveat: until data is written to stable storage, it exists only in volatile memory. Power loss erases it completely. Applications that can't tolerate any data loss (databases, financial systems) must use additional mechanisms to ensure durability, which we'll explore throughout this page.

Write Caching Strategies

Operating systems implement several strategies for handling write operations, each occupying a different point on the performance-durability spectrum:

Write-Through Caching

In write-through mode, every write goes immediately to both the cache and the backing storage:

Application Write → Cache → Backing Store (synchronously)
                        ↓
              Return when both complete

Characteristics:

Maximum durability—data on stable storage immediately
Minimum performance—writes as slow as the backing store
Simplest implementation—no dirty tracking needed
Cache contains clean copies—can be discarded safely

Write-Back Caching

In write-back (or write-behind) mode, writes go only to the cache initially:

Application Write → Cache (mark dirty)
                       ↓
              Return immediately
                       ↓
           Later: Writeback to Backing Store

Characteristics:

Maximum performance—writes complete at memory speed
Deferred durability—data vulnerable until writeback
Complex implementation—must track dirty data
Cache may contain data not on disk—cannot discard without writing

Write-Around Caching

In write-around mode, writes bypass the cache entirely:

Application Write → Backing Store (directly)
                       ↓
              Return when complete
                       ↓
        Cache NOT populated (or invalidated)

Characteristics:

Avoids cache pollution from write-once data
Subsequent reads require disk access
Useful for large sequential writes (backups, bulk loads)
Prevents frequently-read data from being evicted

Write Strategy Comparison
Strategy	Write Latency	Data Safety	Read-after-Write	Use Case
Write-Through	High (disk bound)	Maximum	Cache hit	Critical data, simple systems
Write-Back	Low (memory bound)	Deferred	Cache hit	General purpose, performance focus
Write-Around	High (disk bound)	Maximum	Cache miss	Bulk writes, cold data

Modern operating systems predominantly use write-back caching for general file I/O, with mechanisms to provide stronger guarantees when applications request them. The performance benefit is simply too substantial to abandon for most workloads.

Dirty Page Management

When using write-back caching, the operating system must carefully track which cache pages have been modified but not yet written to storage. These are called dirty pages, and managing them correctly is essential for both performance and reliability.

Tracking Dirty State

Every page in the page cache carries a dirty flag. When an application writes to a cached page, the kernel sets this flag. The page remains dirty until its contents are written to backing storage, at which point the flag is cleared.

The kernel maintains data structures to efficiently enumerate dirty pages:

dirty_page_tracking.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
/*
 * Dirty page tracking in the Linux kernel
 * 
 * Pages are organized in the page cache by file (address_space)
 * and tracked for writeback via tags in the radix tree
 */
 
/* Page flags for dirty state */
#define PG_dirty        4        /* Page has been modified */
#define PG_writeback    15       /* Page is being written out */
 
/* Address space tags for efficient dirty page lookup */
#define PAGECACHE_TAG_DIRTY     0    /* Pages needing writeback */
#define PAGECACHE_TAG_WRITEBACK 1    /* Pages currently writing */
#define PAGECACHE_TAG_TOWRITE   2    /* Pages marked for this writeback cycle */
 
/*
 * Mark a page dirty
 * Called when a write modifies a cached page
 */
void set_page_dirty(struct page *page) {
    struct address_space *mapping = page->mapping;
    
    if (!TestSetPageDirty(page)) {
        /* Page was clean - now dirty */
        
        /* Add dirty tag to radix tree for efficient lookup */
        xa_lock_irq(&mapping->i_pages);
        __xa_set_mark(&mapping->i_pages, page_index(page),
                      PAGECACHE_TAG_DIRTY);
        xa_unlock_irq(&mapping->i_pages);
        
        /* Account for dirty pages */
        account_page_dirtied(page, mapping);
        
        /* Wake writeback thread if needed */
        __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
    }
}
 
/*
 * Clear dirty state after successful writeback
 */
void clear_page_dirty_for_io(struct page *page) {
    struct address_space *mapping = page->mapping;
    
    if (TestClearPageDirty(page)) {
        /* Page was dirty - now clean */
        dec_zone_page_state(page, NR_FILE_DIRTY);
        
        /* Clear tag in radix tree */
        xa_lock_irq(&mapping->i_pages);
        __xa_clear_mark(&mapping->i_pages, page_index(page),
                        PAGECACHE_TAG_DIRTY);
        xa_unlock_irq(&mapping->i_pages);
    }
}
 
/*
 * Find all dirty pages for a file
 * Used to determine what needs writeback
 */
unsigned find_get_pages_tag(struct address_space *mapping,
                            pgoff_t *start,
                            xa_mark_t tag,
                            unsigned int nr_pages,
                            struct page **pages) {
    XA_STATE(xas, &mapping->i_pages, *start);
    struct page *page;
    unsigned int ret = 0;
    
    rcu_read_lock();
    xas_for_each_marked(&xas, page, ULONG_MAX, tag) {
        if (xas_retry(&xas, page))
            continue;
        
        if (!page_cache_get_speculative(page))
            continue;
        
        pages[ret++] = page;
        if (ret >= nr_pages) {
            *start = page->index + 1;
            break;
        }
    }
    rcu_read_unlock();
    
    return ret;
}

Dirty Page Limits

The kernel doesn't allow dirty pages to accumulate indefinitely. Excessive dirty pages create multiple problems:

Memory Pressure: Dirty pages cannot be freed without writing them first, reducing memory available for other uses. Under memory pressure, the system must wait for writeback before reclaiming pages.

Write Cliffs: If too much dirty data accumulates, a flush (from sync, file close, or memory pressure) suddenly needs to write gigabytes, creating long pauses.

Data Loss Risk: More dirty data means more potential data loss on failure.

Linux implements two dirty page thresholds:

Linux Dirty Page Thresholds
Threshold	Default Value	Effect When Exceeded
dirty_background_ratio	10% of memory	Background writeback begins
dirty_ratio	20% of memory	Writing process blocks until dirty pages decrease
dirty_writeback_centisecs	500 (5 seconds)	Writeback thread wakes periodically
dirty_expire_centisecs	3000 (30 seconds)	Pages older than this are written back

dirty_limits.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# View and configure dirty page limits on Linux
 
echo "=== Current Dirty Page Settings ==="
echo "Background ratio: $(cat /proc/sys/vm/dirty_background_ratio)%"
echo "Foreground ratio: $(cat /proc/sys/vm/dirty_ratio)%"
echo "Writeback interval: $(cat /proc/sys/vm/dirty_writeback_centisecs) centiseconds"
echo "Expire time: $(cat /proc/sys/vm/dirty_expire_centisecs) centiseconds"
 
echo -e "
=== Current Dirty Page State ==="
grep -E "^(Dirty|Writeback|MemTotal):" /proc/meminfo
 
echo -e "
=== Per-Device Writeback Status ==="
for bdi in /sys/class/bdi/*/; do
    if [ -d "$bdi" ]; then
        name=$(basename "$bdi")
        dirty=$(cat "$bdi/dirty_inflight_bytes" 2>/dev/null || echo "N/A")
        echo "$name: $dirty bytes in flight"
    fi
done
 
# Example: Reduce dirty limits for faster writeback (SSD-optimized)
# sudo sysctl -w vm.dirty_background_ratio=5
# sudo sysctl -w vm.dirty_ratio=10
# sudo sysctl -w vm.dirty_expire_centisecs=1500
 
# Example: Increase for high-throughput batch workloads
# sudo sysctl -w vm.dirty_background_ratio=20
# sudo sysctl -w vm.dirty_ratio=40

Writeback Mechanisms

The kernel uses several mechanisms to move dirty data from the cache to stable storage. Understanding these mechanisms is crucial for predicting system behavior and tuning performance.

Background Writeback Thread

Linux runs kernel threads (one per block device backing store) responsible for continuous background writeback. These threads wake periodically (every 5 seconds by default) to write dirty pages that have been in cache too long (30 seconds by default):

writeback_thread.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
/*
 * Simplified writeback logic (conceptual)
 * 
 * The actual Linux implementation is more complex,
 * handling multiple inodes, congestion, and priorities
 */
 
/* Main writeback work structure */
struct bdi_writeback {
    struct backing_dev_info *bdi;       /* Device info */
    struct task_struct *task;           /* Writeback thread */
    struct list_head b_dirty;           /* Dirty inodes */
    struct list_head b_io;              /* Inodes ready for I/O */
    struct list_head b_more_io;         /* More work after current batch */
    unsigned long last_old_flush;       /* When we last flushed old data */
};
 
/*
 * Writeback thread main loop
 */
int wb_workfn(void *data) {
    struct bdi_writeback *wb = data;
    
    while (!kthread_should_stop()) {
        /* Wait for work or timeout */
        wait_event_interruptible_timeout(
            wb->wait,
            has_work(wb) || kthread_should_stop(),
            dirty_writeback_interval
        );
        
        if (kthread_should_stop())
            break;
        
        /* Process queued work items */
        while (!list_empty(&wb->work_list)) {
            struct wb_writeback_work *work;
            
            work = list_first_entry(&wb->work_list,
                                    struct wb_writeback_work, list);
            list_del(&work->list);
            
            wb_do_writeback(wb, work);
            
            if (work->single)
                wake_up_process(work->waiter);
            else
                kfree(work);
        }
        
        /* Check for old dirty pages */
        if (time_after(jiffies, wb->last_old_flush +
                       dirty_expire_interval)) {
            wb_flush_old_pages(wb);
            wb->last_old_flush = jiffies;
        }
    }
    
    return 0;
}
 
/*
 * Flush old dirty pages
 */
void wb_flush_old_pages(struct bdi_writeback *wb) {
    struct inode *inode;
    unsigned long expire_time = jiffies - dirty_expire_interval;
    
    list_for_each_entry(inode, &wb->b_dirty, i_wb_list) {
        /* Skip if dirtied too recently */
        if (time_after(inode->dirtied_when, expire_time))
            continue;
        
        /* Write back this inode's dirty pages */
        writeback_single_inode(inode, WB_SYNC_NONE, LONG_MAX);
    }
}
 
/*
 * Perform writeback for a single inode
 */
int writeback_single_inode(struct inode *inode, 
                           enum writeback_sync_modes sync,
                           long nr_to_write) {
    struct address_space *mapping = inode->i_mapping;
    struct writeback_control wbc = {
        .sync_mode = sync,
        .nr_to_write = nr_to_write,
        .range_start = 0,
        .range_end = LLONG_MAX,
    };
    
    /* Write dirty data pages */
    do_writepages(mapping, &wbc);
    
    /* Write inode metadata if dirty */
    if (inode->i_state & I_DIRTY)
        write_inode(inode, &wbc);
    
    return wbc.nr_to_write;
}

Triggered Writeback Scenarios

Beyond periodic background writeback, several events trigger immediate or prioritized writeback:

Memory Pressure: When free memory drops below thresholds, the kernel must reclaim memory. Dirty pages must be written before they can be reclaimed, so memory pressure triggers writeback.

Sync Operations: System calls like sync(), fsync(), and fdatasync() explicitly request writeback. We'll examine these in detail.

File Close: When the last reference to a file is closed, some systems trigger writeback (though not required by POSIX).

Unmount: Unmounting a filesystem requires all dirty data to be written first.

Threshold Exceeded: When dirty data exceeds dirty_ratio, the writing process blocks until dirty pages decrease.

Writeback Control Structure

The kernel uses a writeback_control structure to specify writeback parameters:

writeback_control.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/*
 * Writeback control structure - parameters for write operations
 */
struct writeback_control {
    /* How many pages to write (decremented as pages written) */
    long nr_to_write;
    
    /* Writeback mode */
    enum writeback_sync_modes sync_mode;
    #define WB_SYNC_NONE    0    /* Don't wait for I/O completion */
    #define WB_SYNC_ALL     1    /* Wait for all I/O to complete */
    
    /* Range to write (for partial file writeback) */
    loff_t range_start;
    loff_t range_end;
    
    /* Output: pages skipped/written */
    unsigned long pages_skipped;
    unsigned long nr_written;
    
    /* Flags */
    unsigned for_kupdate:1;      /* Periodic writeback */
    unsigned for_background:1;   /* Background threshold writeback */
    unsigned range_cyclic:1;     /* Cyclic writeback (wrap around) */
    unsigned for_sync:1;         /* sync() or fsync() */
    unsigned for_reclaim:1;      /* Memory reclaim triggered */
    unsigned tagged_writepages:1;/* Use tagged writepages */
};
 
/*
 * Example: fsync implementation
 */
int fsync_file(struct file *file) {
    struct address_space *mapping = file->f_mapping;
    struct inode *inode = mapping->host;
    int ret;
    
    /* Write all dirty pages for this file */
    struct writeback_control wbc = {
        .sync_mode = WB_SYNC_ALL,      /* Wait for completion */
        .nr_to_write = LONG_MAX,       /* Write everything */
        .range_start = 0,
        .range_end = LLONG_MAX,
        .for_sync = 1,
    };
    
    ret = filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
    if (ret)
        return ret;
    
    /* Write inode metadata */
    ret = sync_inode_metadata(inode, 1);
    
    /* Flush device write cache */
    if (!ret && inode->i_sb->s_bdev)
        blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL);
    
    return ret;
}

Synchronization Primitives

Applications requiring durability guarantees must explicitly synchronize data to storage. POSIX defines several primitives with different guarantees and performance characteristics:

sync() — System-Wide Flush

The sync() system call schedules all buffered modifications for writing to all filesystems:

void sync(void);

Semantics: Schedules writeback of all dirty data and metadata across all filesystems. Traditionally returns immediately after scheduling; modern Linux waits for completion.

Use Case: Emergency shutdown, system maintenance. Rarely used by applications due to global scope.

syncfs() — Filesystem Flush

int syncfs(int fd);

Semantics: Writes all dirty data and metadata for the filesystem containing the file referred to by fd.

Use Case: Ensuring a specific filesystem is synchronized without affecting others.

fsync() — File Data and Metadata

int fsync(int fd);

Semantics: Transfers all modified data and metadata for the file to storage and flushes device write caches. Blocks until complete.

Guarantees: After successful return, file data and all metadata necessary to retrieve the file ARE on stable storage.

fdatasync() — File Data Only

int fdatasync(int fd);

Semantics: Like fsync(), but only writes data and metadata necessary to access the data (e.g., file size if extended, but not modification time).

Use Case: When you need data durability but don't care about preserving exact timestamps. Can be significantly faster than fsync() on some filesystems.

Synchronization Primitive Comparison
Primitive	Scope	Waits for Completion	Flushes Device Cache	Relative Speed
sync()	All filesystems	Yes (modern Linux)	Usually	Slowest
syncfs()	Single filesystem	Yes	Yes	Slow
fsync()	Single file + metadata	Yes	Yes	Medium
fdatasync()	Single file data	Yes	Yes	Fastest
O_SYNC writes	Each write operation	Yes	Usually	Very slow

sync_examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
/*
 * Synchronization primitive usage examples
 * Demonstrating different durability guarantees
 */
 
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
 
/*
 * Case 1: Transaction log with strong durability
 * 
 * For a database transaction log, we need guaranteed
 * durability before acknowledging a transaction commit.
 */
int write_transaction_log(int log_fd, const char *entry, size_t len) {
    ssize_t written;
    
    /* Write the log entry */
    written = write(log_fd, entry, len);
    if (written != len)
        return -1;
    
    /*
     * Use fdatasync() for durability.
     * We need the data on disk, but don't care about
     * modification time - fdatasync() suffices and
     * is faster than fsync().
     */
    if (fdatasync(log_fd) == -1)
        return -1;
    
    return 0;  /* Entry is now durably stored */
}
 
/*
 * Case 2: Configuration file atomic update
 * 
 * For configuration files, we need the file to be
 * either fully old or fully new - never partial.
 * We also need metadata (for the rename).
 */
int atomic_config_update(const char *config_path, 
                         const char *new_content,
                         size_t content_len) {
    char temp_path[256];
    int fd;
    
    /* Create temporary file in same directory */
    snprintf(temp_path, sizeof(temp_path), "%s.tmp", config_path);
    fd = open(temp_path, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd == -1)
        return -1;
    
    /* Write new content */
    if (write(fd, new_content, content_len) != content_len) {
        close(fd);
        unlink(temp_path);
        return -1;
    }
    
    /*
     * fsync() the file - we need both data AND metadata
     * (specifically, the file's directory entry) to be
     * stable before the rename.
     */
    if (fsync(fd) == -1) {
        close(fd);
        unlink(temp_path);
        return -1;
    }
    close(fd);
    
    /* Atomically replace old config with new */
    if (rename(temp_path, config_path) == -1) {
        unlink(temp_path);
        return -1;
    }
    
    /*
     * fsync() the directory to ensure the rename is durable.
     * This is often overlooked but necessary for full durability.
     */
    int dir_fd = open(".", O_RDONLY | O_DIRECTORY);
    if (dir_fd != -1) {
        fsync(dir_fd);
        close(dir_fd);
    }
    
    return 0;
}
 
/*
 * Case 3: High-throughput logging with batched sync
 * 
 * For applications needing high write throughput with
 * eventual durability (e.g., application logs), batch
 * multiple writes before syncing.
 */
struct log_buffer {
    int fd;
    char buffer[65536];
    size_t used;
    size_t sync_threshold;
};
 
int buffered_log_write(struct log_buffer *lb, 
                       const char *msg, size_t len) {
    /* Buffer the write */
    if (lb->used + len > sizeof(lb->buffer)) {
        /* Buffer full - flush first */
        if (write(lb->fd, lb->buffer, lb->used) != lb->used)
            return -1;
        
        /* Sync periodically based on threshold */
        static size_t total_written = 0;
        total_written += lb->used;
        
        if (total_written >= lb->sync_threshold) {
            fdatasync(lb->fd);  /* Batch sync */
            total_written = 0;
        }
        
        lb->used = 0;
    }
    
    memcpy(lb->buffer + lb->used, msg, len);
    lb->used += len;
    
    return 0;
}

The fsync() Trap

fsync() on the file alone is NOT sufficient for many durability scenarios. Filesystem metadata (directory entries, inode tables) may be cached separately. For complete durability after creating or renaming files, you must fsync() the containing directory as well. Many applications have shipped with data loss bugs because of this subtle requirement.

Write Barriers and Ordering

Modern storage stacks contain multiple levels of caching and reordering—disk controller caches, RAID controller caches, SAN caches, and the drives' own NCQ (Native Command Queuing) or write caches. For durability guarantees to be meaningful, the kernel must ensure that data actually reaches stable storage in the required order.

The Write Ordering Problem

Consider a journaling filesystem writing a transaction:

Write journal entry describing changes
Write journal commit record
Write actual data changes

If these writes are reordered (by disk controller, NCQ, etc.), disaster can result. If the commit record reaches disk before the journal entry, and power fails, recovery sees a committed transaction with corrupt data.

Cache Flush Commands

To ensure ordering across storage caches, the kernel uses cache flush commands:

SATA/SCSI: FLUSH CACHE command forces all cached writes to stable media

NVMe: Flush command with similar semantics

The kernel issues flushes at critical points to enforce ordering:

write_barriers.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
/*
 * Write barriers in the Linux block layer
 * 
 * Barriers ensure that writes before the barrier
 * are on stable media before writes after the barrier
 * can begin.
 */
 
/*
 * Issue a flush to the block device
 * This ensures all previous writes are on stable media
 */
int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask) {
    struct bio *bio;
    int ret;
    
    bio = bio_alloc(bdev, 0, REQ_OP_FLUSH | REQ_PREFLUSH, gfp_mask);
    if (!bio)
        return -ENOMEM;
    
    /* Submit and wait for completion */
    ret = submit_bio_wait(bio);
    bio_put(bio);
    
    return ret;
}
 
/*
 * Submit a write with flush flags
 * 
 * REQ_PREFLUSH: Flush cache before this write
 * REQ_FUA: This write goes directly to media (not cached)
 */
void submit_critical_write(struct block_device *bdev,
                           sector_t sector,
                           void *data,
                           size_t len) {
    struct bio *bio;
    
    bio = bio_alloc(bdev, 1, REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA,
                    GFP_KERNEL);
    
    bio->bi_iter.bi_sector = sector;
    bio_add_page(bio, virt_to_page(data), len, offset_in_page(data));
    
    /*
     * This write will:
     * 1. Flush all previous cached writes (REQ_PREFLUSH)
     * 2. Write this data
     * 3. Force this data to stable media (REQ_FUA)
     *
     * After completion, all data written before AND including
     * this write is guaranteed on stable storage.
     */
    submit_bio_wait(bio);
    bio_put(bio);
}
 
/*
 * Journal commit sequence with proper barriers
 * (Conceptual ext4-style ordered mode)
 */
int commit_transaction(struct journal *j, struct transaction *t) {
    int ret;
    
    /* Phase 1: Write journal descriptor block */
    ret = write_journal_descriptor(j, t);
    if (ret)
        return ret;
    
    /* Phase 2: Write journal data blocks */
    ret = write_journal_data(j, t);
    if (ret)
        return ret;
    
    /*
     * CRITICAL BARRIER: Ensure descriptor and data are
     * on disk before writing commit block.
     * 
     * Without this barrier, the commit block could reach
     * disk first, and a crash would cause recovery to
     * replay a partial/corrupt transaction.
     */
    blkdev_issue_flush(j->bdev, GFP_KERNEL);
    
    /* Phase 3: Write commit block with FUA
     * 
     * The commit block uses FUA to bypass device cache,
     * ensuring it's on stable media after this returns.
     */
    ret = write_commit_block_fua(j, t);
    if (ret)
        return ret;
    
    /* Transaction is now committed and durable */
    return 0;
}

Force Unit Access (FUA)

FUA (Force Unit Access) is a write flag indicating that the data should bypass the drive's write cache and go directly to stable media. This is critical for commit records and other data that must be durable immediately.

Without FUA:

Data → OS Cache → Device Cache → Media
                        ↑
               Power loss here loses data!

With FUA:

Data → OS Cache → Media (bypasses device cache)

FUA is more efficient than a full cache flush when you need one specific write to be durable but don't care about previous writes (they may have their own FUA or you'll flush later).

Hardware Considerations

Not all hardware supports these features honestly:

Some enterprise drives with battery-backed cache treat their cache as stable media
Some consumer drives ignore flush commands or implement them incorrectly
Some RAID controllers batch flushes for performance (increasing vulnerability window)
Some SSDs have supercapacitors to flush cache on power loss

Testing Barrier Behavior

The only way to know if your storage stack correctly implements barriers is to test it. Tools like diskchecker.pl simulate power failures and verify data integrity. Production systems with durability requirements should be tested with actual power pulls during heavy write loads.

Application-Level Control

Applications have significant control over write caching behavior through open flags and runtime calls. Understanding these options enables building systems with precisely the right tradeoff between performance and durability.

Open Flags Controlling Write Behavior

Write-Related Open Flags
Flag	Effect	Performance Impact	Use Case
O_SYNC	Each write waits for data+metadata to reach storage	Very significant	Critical data requiring guaranteed write ordering
O_DSYNC	Each write waits for data to reach storage	Significant	Data-only durability (faster than O_SYNC)
O_DIRECT	Bypasses page cache entirely	Complex (reduces caching, enables DMA)	Databases with own caching layer

write_control_patterns.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
/*
 * Write caching control patterns
 * Different strategies for different requirements
 */
 
#define _GNU_SOURCE
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
 
/*
 * Pattern 1: Synchronous writes for critical single-threaded data
 * 
 * O_SYNC ensures every write is on stable storage before returning.
 * Simple but slow - use only when durability beats all else.
 */
int open_synchronous_critical(const char *path) {
    return open(path, O_WRONLY | O_CREAT | O_SYNC, 0644);
}
 
/*
 * Pattern 2: Direct I/O for databases
 * 
 * O_DIRECT bypasses the page cache, giving the application
 * full control over buffering and caching. The database
 * implements its own buffer pool with specialized eviction,
 * prefetching, and WAL sync strategies.
 * 
 * Note: O_DIRECT has alignment requirements!
 */
int open_direct_database(const char *path) {
    int fd = open(path, O_RDWR | O_CREAT | O_DIRECT, 0644);
    if (fd == -1)
        return -1;
    
    /* 
     * Direct I/O usually still requires explicit sync for 
     * durability - data bypasses OS cache but may still
     * be in device cache.
     */
    return fd;
}
 
/*
 * Aligned buffer for O_DIRECT
 */
void *alloc_direct_buffer(size_t size) {
    void *buf = NULL;
    /* Align to filesystem block size (typically 4KB) */
    if (posix_memalign(&buf, 4096, size) != 0)
        return NULL;
    return buf;
}
 
/*
 * Pattern 3: Write-ahead log (WAL) with group commit
 * 
 * Balance throughput and durability by batching syncs.
 * Many transactions write to log, but one sync covers all.
 */
struct wal_group_commit {
    int log_fd;
    pthread_mutex_t mutex;
    pthread_cond_t cond;
    
    int pending_count;       /* Transactions waiting for sync */
    int sync_in_progress;    /* Sync currently happening */
    int next_group_size;     /* Target batch size */
};
 
int wal_write_and_sync(struct wal_group_commit *wal,
                       const void *entry, size_t len) {
    pthread_mutex_lock(&wal->mutex);
    
    /* Write our entry (protected by mutex) */
    write(wal->log_fd, entry, len);
    
    wal->pending_count++;
    
    /* Check if we should trigger a group commit */
    if (!wal->sync_in_progress && 
        wal->pending_count >= wal->next_group_size) {
        
        /* We'll lead this group commit */
        wal->sync_in_progress = 1;
        int group_size = wal->pending_count;
        
        pthread_mutex_unlock(&wal->mutex);
        
        /* Do the actual sync (may take several ms) */
        fdatasync(wal->log_fd);
        
        pthread_mutex_lock(&wal->mutex);
        
        /* Wake all waiters from this batch */
        wal->sync_in_progress = 0;
        wal->pending_count = 0;
        pthread_cond_broadcast(&wal->cond);
        
        /* Adaptive batch sizing based on throughput */
        if (group_size > 10)
            wal->next_group_size = group_size;
        
    } else {
        /* Wait for the leader to sync */
        while (wal->pending_count > 0)
            pthread_cond_wait(&wal->cond, &wal->mutex);
    }
    
    pthread_mutex_unlock(&wal->mutex);
    return 0;
}
 
/*
 * Pattern 4: Tiered durability based on data importance
 */
enum durability_level {
    DURABILITY_NONE,       /* Fire and forget */
    DURABILITY_EVENTUAL,   /* Will sync eventually */
    DURABILITY_IMMEDIATE,  /* Sync before returning */
};
 
int write_with_durability(int fd, const void *data, size_t len,
                          enum durability_level level) {
    ssize_t ret = write(fd, data, len);
    if (ret != len)
        return -1;
    
    switch (level) {
    case DURABILITY_NONE:
        /* Rely on background writeback */
        break;
        
    case DURABILITY_EVENTUAL:
        /* Hint to sync soon but don't wait */
        sync_file_range(fd, 0, 0, 
                        SYNC_FILE_RANGE_WRITE);
        break;
        
    case DURABILITY_IMMEDIATE:
        /* Full sync - guaranteed durable */
        if (fdatasync(fd) == -1)
            return -1;
        break;
    }
    
    return 0;
}

posix_fadvise() for Write Hints

Applications can provide hints about write patterns using posix_fadvise(). POSIX_FADV_DONTNEED after writing can hint that the data doesn't need to stay cached. POSIX_FADV_NOREUSE hints that data will only be accessed once. These don't guarantee behavior but can improve cache efficiency for specific access patterns.

Summary: Write Caching Mastery

Write caching sits at the critical intersection of performance and reliability. We've explored the full stack from application-level control down to hardware synchronization primitives. Let's consolidate the key insights:

Key Takeaways

•Write caching provides enormous performance benefits — 100x to 10,000x improvement over synchronous writes through coalescing, batching, and background processing.
•Three main strategies exist — Write-through (instant durability, slow), write-back (deferred durability, fast), and write-around (bypass cache, specialized use).
•Dirty page management controls memory usage — Background and foreground thresholds prevent unlimited dirty page accumulation while maximizing write batching.
•Synchronization primitives provide explicit control — sync(), fsync(), fdatasync() offer different scopes and guarantees; applications must choose appropriately.
•Write barriers ensure correct ordering — Cache flush commands and FUA flags ensure data reaches stable media in the required order despite multiple cache layers.
•Applications control the tradeoff — O_SYNC, O_DSYNC, O_DIRECT, and various sync calls let applications specify exactly the durability guarantees they need.

What's Next:

The next page examines cache policies—the algorithms that determine which data to keep in cache when space is limited. Understanding replacement policies like LRU, LFU, ARC, and their variants is essential for predicting and optimizing cache behavior under real workloads.

Page Complete

You now understand write caching from application intent to hardware synchronization. You can reason about durability guarantees, choose appropriate synchronization primitives, and understand the performance implications of different choices. This knowledge is essential for building systems that are both fast and reliable.

Write Caching

The Performance-Durability Tradeoff

What You Will Learn

Why Write Caching Matters

To appreciate write caching, consider what happens without it. Every write operation would:

Transfer data from user space to kernel space
Build I/O commands for the device controller
Wait for the device to physically store the data
Return control to the application only when data is on stable storage

The Write Amplification Problem

The situation is actually worse than simple latency suggests. Consider writing a single byte to a file:

Steps to Write One Byte (Worst Case)

•Read original block (4KB) from disk to modify — 1 disk read
•Modify byte in memory — negligible
•Update metadata (modification time, size if extended) — may require additional reads
•Write block back to disk — 1 disk write
•Write metadata (inode) to disk — 1 disk write
•Write journal entry (if journaling) — 1 disk write
•Wait for completion of each synchronous operation

Without caching, writing a single byte could require 4+ disk operations, each taking 5-10ms—a total of 20-40ms for one byte. That's 25-50 bytes per second maximum throughput!

Write Caching Solution

Write caching transforms this pathological case into something manageable:

Write to cache (memory): ~100 ns
Mark page dirty: ~10 ns
Return to application: immediate
Later: batch dirty pages with adjacent writes, write efficiently in background

The result: throughput improvements of 100x to 10,000x compared to synchronous writes, with latency improvements of similar magnitude.

The Hidden Danger

Write Caching Strategies

Operating systems implement several strategies for handling write operations, each occupying a different point on the performance-durability spectrum:

Write-Through Caching

In write-through mode, every write goes immediately to both the cache and the backing storage:

Application Write → Cache → Backing Store (synchronously)
                        ↓
              Return when both complete

Characteristics:

Maximum durability—data on stable storage immediately
Minimum performance—writes as slow as the backing store
Simplest implementation—no dirty tracking needed
Cache contains clean copies—can be discarded safely

Write-Back Caching

In write-back (or write-behind) mode, writes go only to the cache initially:

Application Write → Cache (mark dirty)
                       ↓
              Return immediately
                       ↓
           Later: Writeback to Backing Store

Characteristics:

Maximum performance—writes complete at memory speed
Deferred durability—data vulnerable until writeback
Complex implementation—must track dirty data
Cache may contain data not on disk—cannot discard without writing

Write-Around Caching

In write-around mode, writes bypass the cache entirely:

Application Write → Backing Store (directly)
                       ↓
              Return when complete
                       ↓
        Cache NOT populated (or invalidated)

Characteristics:

Avoids cache pollution from write-once data
Subsequent reads require disk access
Useful for large sequential writes (backups, bulk loads)
Prevents frequently-read data from being evicted

Write Strategy Comparison
Strategy	Write Latency	Data Safety	Read-after-Write	Use Case
Write-Through	High (disk bound)	Maximum	Cache hit	Critical data, simple systems
Write-Back	Low (memory bound)	Deferred	Cache hit	General purpose, performance focus
Write-Around	High (disk bound)	Maximum	Cache miss	Bulk writes, cold data

Dirty Page Management

Tracking Dirty State

The kernel maintains data structures to efficiently enumerate dirty pages:

dirty_page_tracking.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
/*
 * Dirty page tracking in the Linux kernel
 * 
 * Pages are organized in the page cache by file (address_space)
 * and tracked for writeback via tags in the radix tree
 */
 
/* Page flags for dirty state */
#define PG_dirty        4        /* Page has been modified */
#define PG_writeback    15       /* Page is being written out */
 
/* Address space tags for efficient dirty page lookup */
#define PAGECACHE_TAG_DIRTY     0    /* Pages needing writeback */
#define PAGECACHE_TAG_WRITEBACK 1    /* Pages currently writing */
#define PAGECACHE_TAG_TOWRITE   2    /* Pages marked for this writeback cycle */
 
/*
 * Mark a page dirty
 * Called when a write modifies a cached page
 */
void set_page_dirty(struct page *page) {
    struct address_space *mapping = page->mapping;
    
    if (!TestSetPageDirty(page)) {
        /* Page was clean - now dirty */
        
        /* Add dirty tag to radix tree for efficient lookup */
        xa_lock_irq(&mapping->i_pages);
        __xa_set_mark(&mapping->i_pages, page_index(page),
                      PAGECACHE_TAG_DIRTY);
        xa_unlock_irq(&mapping->i_pages);
        
        /* Account for dirty pages */
        account_page_dirtied(page, mapping);
        
        /* Wake writeback thread if needed */
        __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
    }
}
 
/*
 * Clear dirty state after successful writeback
 */
void clear_page_dirty_for_io(struct page *page) {
    struct address_space *mapping = page->mapping;
    
    if (TestClearPageDirty(page)) {
        /* Page was dirty - now clean */
        dec_zone_page_state(page, NR_FILE_DIRTY);
        
        /* Clear tag in radix tree */
        xa_lock_irq(&mapping->i_pages);
        __xa_clear_mark(&mapping->i_pages, page_index(page),
                        PAGECACHE_TAG_DIRTY);
        xa_unlock_irq(&mapping->i_pages);
    }
}
 
/*
 * Find all dirty pages for a file
 * Used to determine what needs writeback
 */
unsigned find_get_pages_tag(struct address_space *mapping,
                            pgoff_t *start,
                            xa_mark_t tag,
                            unsigned int nr_pages,
                            struct page **pages) {
    XA_STATE(xas, &mapping->i_pages, *start);
    struct page *page;
    unsigned int ret = 0;
    
    rcu_read_lock();
    xas_for_each_marked(&xas, page, ULONG_MAX, tag) {
        if (xas_retry(&xas, page))
            continue;
        
        if (!page_cache_get_speculative(page))
            continue;
        
        pages[ret++] = page;
        if (ret >= nr_pages) {
            *start = page->index + 1;
            break;
        }
    }
    rcu_read_unlock();
    
    return ret;
}

Dirty Page Limits

The kernel doesn't allow dirty pages to accumulate indefinitely. Excessive dirty pages create multiple problems:

Write Cliffs: If too much dirty data accumulates, a flush (from sync, file close, or memory pressure) suddenly needs to write gigabytes, creating long pauses.

Data Loss Risk: More dirty data means more potential data loss on failure.

Linux implements two dirty page thresholds:

Linux Dirty Page Thresholds
Threshold	Default Value	Effect When Exceeded
dirty_background_ratio	10% of memory	Background writeback begins
dirty_ratio	20% of memory	Writing process blocks until dirty pages decrease
dirty_writeback_centisecs	500 (5 seconds)	Writeback thread wakes periodically
dirty_expire_centisecs	3000 (30 seconds)	Pages older than this are written back

dirty_limits.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# View and configure dirty page limits on Linux
 
echo "=== Current Dirty Page Settings ==="
echo "Background ratio: $(cat /proc/sys/vm/dirty_background_ratio)%"
echo "Foreground ratio: $(cat /proc/sys/vm/dirty_ratio)%"
echo "Writeback interval: $(cat /proc/sys/vm/dirty_writeback_centisecs) centiseconds"
echo "Expire time: $(cat /proc/sys/vm/dirty_expire_centisecs) centiseconds"
 
echo -e "
=== Current Dirty Page State ==="
grep -E "^(Dirty|Writeback|MemTotal):" /proc/meminfo
 
echo -e "
=== Per-Device Writeback Status ==="
for bdi in /sys/class/bdi/*/; do
    if [ -d "$bdi" ]; then
        name=$(basename "$bdi")
        dirty=$(cat "$bdi/dirty_inflight_bytes" 2>/dev/null || echo "N/A")
        echo "$name: $dirty bytes in flight"
    fi
done
 
# Example: Reduce dirty limits for faster writeback (SSD-optimized)
# sudo sysctl -w vm.dirty_background_ratio=5
# sudo sysctl -w vm.dirty_ratio=10
# sudo sysctl -w vm.dirty_expire_centisecs=1500
 
# Example: Increase for high-throughput batch workloads
# sudo sysctl -w vm.dirty_background_ratio=20
# sudo sysctl -w vm.dirty_ratio=40

Writeback Mechanisms

The kernel uses several mechanisms to move dirty data from the cache to stable storage. Understanding these mechanisms is crucial for predicting system behavior and tuning performance.

Background Writeback Thread

writeback_thread.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
/*
 * Simplified writeback logic (conceptual)
 * 
 * The actual Linux implementation is more complex,
 * handling multiple inodes, congestion, and priorities
 */
 
/* Main writeback work structure */
struct bdi_writeback {
    struct backing_dev_info *bdi;       /* Device info */
    struct task_struct *task;           /* Writeback thread */
    struct list_head b_dirty;           /* Dirty inodes */
    struct list_head b_io;              /* Inodes ready for I/O */
    struct list_head b_more_io;         /* More work after current batch */
    unsigned long last_old_flush;       /* When we last flushed old data */
};
 
/*
 * Writeback thread main loop
 */
int wb_workfn(void *data) {
    struct bdi_writeback *wb = data;
    
    while (!kthread_should_stop()) {
        /* Wait for work or timeout */
        wait_event_interruptible_timeout(
            wb->wait,
            has_work(wb) || kthread_should_stop(),
            dirty_writeback_interval
        );
        
        if (kthread_should_stop())
            break;
        
        /* Process queued work items */
        while (!list_empty(&wb->work_list)) {
            struct wb_writeback_work *work;
            
            work = list_first_entry(&wb->work_list,
                                    struct wb_writeback_work, list);
            list_del(&work->list);
            
            wb_do_writeback(wb, work);
            
            if (work->single)
                wake_up_process(work->waiter);
            else
                kfree(work);
        }
        
        /* Check for old dirty pages */
        if (time_after(jiffies, wb->last_old_flush +
                       dirty_expire_interval)) {
            wb_flush_old_pages(wb);
            wb->last_old_flush = jiffies;
        }
    }
    
    return 0;
}
 
/*
 * Flush old dirty pages
 */
void wb_flush_old_pages(struct bdi_writeback *wb) {
    struct inode *inode;
    unsigned long expire_time = jiffies - dirty_expire_interval;
    
    list_for_each_entry(inode, &wb->b_dirty, i_wb_list) {
        /* Skip if dirtied too recently */
        if (time_after(inode->dirtied_when, expire_time))
            continue;
        
        /* Write back this inode's dirty pages */
        writeback_single_inode(inode, WB_SYNC_NONE, LONG_MAX);
    }
}
 
/*
 * Perform writeback for a single inode
 */
int writeback_single_inode(struct inode *inode, 
                           enum writeback_sync_modes sync,
                           long nr_to_write) {
    struct address_space *mapping = inode->i_mapping;
    struct writeback_control wbc = {
        .sync_mode = sync,
        .nr_to_write = nr_to_write,
        .range_start = 0,
        .range_end = LLONG_MAX,
    };
    
    /* Write dirty data pages */
    do_writepages(mapping, &wbc);
    
    /* Write inode metadata if dirty */
    if (inode->i_state & I_DIRTY)
        write_inode(inode, &wbc);
    
    return wbc.nr_to_write;
}

Triggered Writeback Scenarios

Beyond periodic background writeback, several events trigger immediate or prioritized writeback:

Memory Pressure: When free memory drops below thresholds, the kernel must reclaim memory. Dirty pages must be written before they can be reclaimed, so memory pressure triggers writeback.

Sync Operations: System calls like sync(), fsync(), and fdatasync() explicitly request writeback. We'll examine these in detail.

File Close: When the last reference to a file is closed, some systems trigger writeback (though not required by POSIX).

Unmount: Unmounting a filesystem requires all dirty data to be written first.

Threshold Exceeded: When dirty data exceeds dirty_ratio, the writing process blocks until dirty pages decrease.

Writeback Control Structure

The kernel uses a writeback_control structure to specify writeback parameters:

writeback_control.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/*
 * Writeback control structure - parameters for write operations
 */
struct writeback_control {
    /* How many pages to write (decremented as pages written) */
    long nr_to_write;
    
    /* Writeback mode */
    enum writeback_sync_modes sync_mode;
    #define WB_SYNC_NONE    0    /* Don't wait for I/O completion */
    #define WB_SYNC_ALL     1    /* Wait for all I/O to complete */
    
    /* Range to write (for partial file writeback) */
    loff_t range_start;
    loff_t range_end;
    
    /* Output: pages skipped/written */
    unsigned long pages_skipped;
    unsigned long nr_written;
    
    /* Flags */
    unsigned for_kupdate:1;      /* Periodic writeback */
    unsigned for_background:1;   /* Background threshold writeback */
    unsigned range_cyclic:1;     /* Cyclic writeback (wrap around) */
    unsigned for_sync:1;         /* sync() or fsync() */
    unsigned for_reclaim:1;      /* Memory reclaim triggered */
    unsigned tagged_writepages:1;/* Use tagged writepages */
};
 
/*
 * Example: fsync implementation
 */
int fsync_file(struct file *file) {
    struct address_space *mapping = file->f_mapping;
    struct inode *inode = mapping->host;
    int ret;
    
    /* Write all dirty pages for this file */
    struct writeback_control wbc = {
        .sync_mode = WB_SYNC_ALL,      /* Wait for completion */
        .nr_to_write = LONG_MAX,       /* Write everything */
        .range_start = 0,
        .range_end = LLONG_MAX,
        .for_sync = 1,
    };
    
    ret = filemap_write_and_wait_range(mapping, 0, LLONG_MAX);
    if (ret)
        return ret;
    
    /* Write inode metadata */
    ret = sync_inode_metadata(inode, 1);
    
    /* Flush device write cache */
    if (!ret && inode->i_sb->s_bdev)
        blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL);
    
    return ret;
}

Synchronization Primitives

Applications requiring durability guarantees must explicitly synchronize data to storage. POSIX defines several primitives with different guarantees and performance characteristics:

sync() — System-Wide Flush

The sync() system call schedules all buffered modifications for writing to all filesystems:

void sync(void);

Semantics: Schedules writeback of all dirty data and metadata across all filesystems. Traditionally returns immediately after scheduling; modern Linux waits for completion.

Use Case: Emergency shutdown, system maintenance. Rarely used by applications due to global scope.

syncfs() — Filesystem Flush

int syncfs(int fd);

Semantics: Writes all dirty data and metadata for the filesystem containing the file referred to by fd.

Use Case: Ensuring a specific filesystem is synchronized without affecting others.

fsync() — File Data and Metadata

int fsync(int fd);

Semantics: Transfers all modified data and metadata for the file to storage and flushes device write caches. Blocks until complete.

Guarantees: After successful return, file data and all metadata necessary to retrieve the file ARE on stable storage.

fdatasync() — File Data Only

int fdatasync(int fd);

Semantics: Like fsync(), but only writes data and metadata necessary to access the data (e.g., file size if extended, but not modification time).

Use Case: When you need data durability but don't care about preserving exact timestamps. Can be significantly faster than fsync() on some filesystems.

Synchronization Primitive Comparison
Primitive	Scope	Waits for Completion	Flushes Device Cache	Relative Speed
sync()	All filesystems	Yes (modern Linux)	Usually	Slowest
syncfs()	Single filesystem	Yes	Yes	Slow
fsync()	Single file + metadata	Yes	Yes	Medium
fdatasync()	Single file data	Yes	Yes	Fastest
O_SYNC writes	Each write operation	Yes	Usually	Very slow

sync_examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
/*
 * Synchronization primitive usage examples
 * Demonstrating different durability guarantees
 */
 
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
 
/*
 * Case 1: Transaction log with strong durability
 * 
 * For a database transaction log, we need guaranteed
 * durability before acknowledging a transaction commit.
 */
int write_transaction_log(int log_fd, const char *entry, size_t len) {
    ssize_t written;
    
    /* Write the log entry */
    written = write(log_fd, entry, len);
    if (written != len)
        return -1;
    
    /*
     * Use fdatasync() for durability.
     * We need the data on disk, but don't care about
     * modification time - fdatasync() suffices and
     * is faster than fsync().
     */
    if (fdatasync(log_fd) == -1)
        return -1;
    
    return 0;  /* Entry is now durably stored */
}
 
/*
 * Case 2: Configuration file atomic update
 * 
 * For configuration files, we need the file to be
 * either fully old or fully new - never partial.
 * We also need metadata (for the rename).
 */
int atomic_config_update(const char *config_path, 
                         const char *new_content,
                         size_t content_len) {
    char temp_path[256];
    int fd;
    
    /* Create temporary file in same directory */
    snprintf(temp_path, sizeof(temp_path), "%s.tmp", config_path);
    fd = open(temp_path, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd == -1)
        return -1;
    
    /* Write new content */
    if (write(fd, new_content, content_len) != content_len) {
        close(fd);
        unlink(temp_path);
        return -1;
    }
    
    /*
     * fsync() the file - we need both data AND metadata
     * (specifically, the file's directory entry) to be
     * stable before the rename.
     */
    if (fsync(fd) == -1) {
        close(fd);
        unlink(temp_path);
        return -1;
    }
    close(fd);
    
    /* Atomically replace old config with new */
    if (rename(temp_path, config_path) == -1) {
        unlink(temp_path);
        return -1;
    }
    
    /*
     * fsync() the directory to ensure the rename is durable.
     * This is often overlooked but necessary for full durability.
     */
    int dir_fd = open(".", O_RDONLY | O_DIRECTORY);
    if (dir_fd != -1) {
        fsync(dir_fd);
        close(dir_fd);
    }
    
    return 0;
}
 
/*
 * Case 3: High-throughput logging with batched sync
 * 
 * For applications needing high write throughput with
 * eventual durability (e.g., application logs), batch
 * multiple writes before syncing.
 */
struct log_buffer {
    int fd;
    char buffer[65536];
    size_t used;
    size_t sync_threshold;
};
 
int buffered_log_write(struct log_buffer *lb, 
                       const char *msg, size_t len) {
    /* Buffer the write */
    if (lb->used + len > sizeof(lb->buffer)) {
        /* Buffer full - flush first */
        if (write(lb->fd, lb->buffer, lb->used) != lb->used)
            return -1;
        
        /* Sync periodically based on threshold */
        static size_t total_written = 0;
        total_written += lb->used;
        
        if (total_written >= lb->sync_threshold) {
            fdatasync(lb->fd);  /* Batch sync */
            total_written = 0;
        }
        
        lb->used = 0;
    }
    
    memcpy(lb->buffer + lb->used, msg, len);
    lb->used += len;
    
    return 0;
}

The fsync() Trap

Write Barriers and Ordering

The Write Ordering Problem

Consider a journaling filesystem writing a transaction:

Write journal entry describing changes
Write journal commit record
Write actual data changes

Cache Flush Commands

To ensure ordering across storage caches, the kernel uses cache flush commands:

SATA/SCSI: FLUSH CACHE command forces all cached writes to stable media

NVMe: Flush command with similar semantics

The kernel issues flushes at critical points to enforce ordering:

write_barriers.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
/*
 * Write barriers in the Linux block layer
 * 
 * Barriers ensure that writes before the barrier
 * are on stable media before writes after the barrier
 * can begin.
 */
 
/*
 * Issue a flush to the block device
 * This ensures all previous writes are on stable media
 */
int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask) {
    struct bio *bio;
    int ret;
    
    bio = bio_alloc(bdev, 0, REQ_OP_FLUSH | REQ_PREFLUSH, gfp_mask);
    if (!bio)
        return -ENOMEM;
    
    /* Submit and wait for completion */
    ret = submit_bio_wait(bio);
    bio_put(bio);
    
    return ret;
}
 
/*
 * Submit a write with flush flags
 * 
 * REQ_PREFLUSH: Flush cache before this write
 * REQ_FUA: This write goes directly to media (not cached)
 */
void submit_critical_write(struct block_device *bdev,
                           sector_t sector,
                           void *data,
                           size_t len) {
    struct bio *bio;
    
    bio = bio_alloc(bdev, 1, REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA,
                    GFP_KERNEL);
    
    bio->bi_iter.bi_sector = sector;
    bio_add_page(bio, virt_to_page(data), len, offset_in_page(data));
    
    /*
     * This write will:
     * 1. Flush all previous cached writes (REQ_PREFLUSH)
     * 2. Write this data
     * 3. Force this data to stable media (REQ_FUA)
     *
     * After completion, all data written before AND including
     * this write is guaranteed on stable storage.
     */
    submit_bio_wait(bio);
    bio_put(bio);
}
 
/*
 * Journal commit sequence with proper barriers
 * (Conceptual ext4-style ordered mode)
 */
int commit_transaction(struct journal *j, struct transaction *t) {
    int ret;
    
    /* Phase 1: Write journal descriptor block */
    ret = write_journal_descriptor(j, t);
    if (ret)
        return ret;
    
    /* Phase 2: Write journal data blocks */
    ret = write_journal_data(j, t);
    if (ret)
        return ret;
    
    /*
     * CRITICAL BARRIER: Ensure descriptor and data are
     * on disk before writing commit block.
     * 
     * Without this barrier, the commit block could reach
     * disk first, and a crash would cause recovery to
     * replay a partial/corrupt transaction.
     */
    blkdev_issue_flush(j->bdev, GFP_KERNEL);
    
    /* Phase 3: Write commit block with FUA
     * 
     * The commit block uses FUA to bypass device cache,
     * ensuring it's on stable media after this returns.
     */
    ret = write_commit_block_fua(j, t);
    if (ret)
        return ret;
    
    /* Transaction is now committed and durable */
    return 0;
}

Force Unit Access (FUA)

Without FUA:

Data → OS Cache → Device Cache → Media
                        ↑
               Power loss here loses data!

With FUA:

Data → OS Cache → Media (bypasses device cache)

FUA is more efficient than a full cache flush when you need one specific write to be durable but don't care about previous writes (they may have their own FUA or you'll flush later).

Hardware Considerations

Not all hardware supports these features honestly:

Some enterprise drives with battery-backed cache treat their cache as stable media
Some consumer drives ignore flush commands or implement them incorrectly
Some RAID controllers batch flushes for performance (increasing vulnerability window)
Some SSDs have supercapacitors to flush cache on power loss

Testing Barrier Behavior

Application-Level Control

Open Flags Controlling Write Behavior

Write-Related Open Flags
Flag	Effect	Performance Impact	Use Case
O_SYNC	Each write waits for data+metadata to reach storage	Very significant	Critical data requiring guaranteed write ordering
O_DSYNC	Each write waits for data to reach storage	Significant	Data-only durability (faster than O_SYNC)
O_DIRECT	Bypasses page cache entirely	Complex (reduces caching, enables DMA)	Databases with own caching layer

write_control_patterns.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
/*
 * Write caching control patterns
 * Different strategies for different requirements
 */
 
#define _GNU_SOURCE
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
 
/*
 * Pattern 1: Synchronous writes for critical single-threaded data
 * 
 * O_SYNC ensures every write is on stable storage before returning.
 * Simple but slow - use only when durability beats all else.
 */
int open_synchronous_critical(const char *path) {
    return open(path, O_WRONLY | O_CREAT | O_SYNC, 0644);
}
 
/*
 * Pattern 2: Direct I/O for databases
 * 
 * O_DIRECT bypasses the page cache, giving the application
 * full control over buffering and caching. The database
 * implements its own buffer pool with specialized eviction,
 * prefetching, and WAL sync strategies.
 * 
 * Note: O_DIRECT has alignment requirements!
 */
int open_direct_database(const char *path) {
    int fd = open(path, O_RDWR | O_CREAT | O_DIRECT, 0644);
    if (fd == -1)
        return -1;
    
    /* 
     * Direct I/O usually still requires explicit sync for 
     * durability - data bypasses OS cache but may still
     * be in device cache.
     */
    return fd;
}
 
/*
 * Aligned buffer for O_DIRECT
 */
void *alloc_direct_buffer(size_t size) {
    void *buf = NULL;
    /* Align to filesystem block size (typically 4KB) */
    if (posix_memalign(&buf, 4096, size) != 0)
        return NULL;
    return buf;
}
 
/*
 * Pattern 3: Write-ahead log (WAL) with group commit
 * 
 * Balance throughput and durability by batching syncs.
 * Many transactions write to log, but one sync covers all.
 */
struct wal_group_commit {
    int log_fd;
    pthread_mutex_t mutex;
    pthread_cond_t cond;
    
    int pending_count;       /* Transactions waiting for sync */
    int sync_in_progress;    /* Sync currently happening */
    int next_group_size;     /* Target batch size */
};
 
int wal_write_and_sync(struct wal_group_commit *wal,
                       const void *entry, size_t len) {
    pthread_mutex_lock(&wal->mutex);
    
    /* Write our entry (protected by mutex) */
    write(wal->log_fd, entry, len);
    
    wal->pending_count++;
    
    /* Check if we should trigger a group commit */
    if (!wal->sync_in_progress && 
        wal->pending_count >= wal->next_group_size) {
        
        /* We'll lead this group commit */
        wal->sync_in_progress = 1;
        int group_size = wal->pending_count;
        
        pthread_mutex_unlock(&wal->mutex);
        
        /* Do the actual sync (may take several ms) */
        fdatasync(wal->log_fd);
        
        pthread_mutex_lock(&wal->mutex);
        
        /* Wake all waiters from this batch */
        wal->sync_in_progress = 0;
        wal->pending_count = 0;
        pthread_cond_broadcast(&wal->cond);
        
        /* Adaptive batch sizing based on throughput */
        if (group_size > 10)
            wal->next_group_size = group_size;
        
    } else {
        /* Wait for the leader to sync */
        while (wal->pending_count > 0)
            pthread_cond_wait(&wal->cond, &wal->mutex);
    }
    
    pthread_mutex_unlock(&wal->mutex);
    return 0;
}
 
/*
 * Pattern 4: Tiered durability based on data importance
 */
enum durability_level {
    DURABILITY_NONE,       /* Fire and forget */
    DURABILITY_EVENTUAL,   /* Will sync eventually */
    DURABILITY_IMMEDIATE,  /* Sync before returning */
};
 
int write_with_durability(int fd, const void *data, size_t len,
                          enum durability_level level) {
    ssize_t ret = write(fd, data, len);
    if (ret != len)
        return -1;
    
    switch (level) {
    case DURABILITY_NONE:
        /* Rely on background writeback */
        break;
        
    case DURABILITY_EVENTUAL:
        /* Hint to sync soon but don't wait */
        sync_file_range(fd, 0, 0, 
                        SYNC_FILE_RANGE_WRITE);
        break;
        
    case DURABILITY_IMMEDIATE:
        /* Full sync - guaranteed durable */
        if (fdatasync(fd) == -1)
            return -1;
        break;
    }
    
    return 0;
}

posix_fadvise() for Write Hints

Summary: Write Caching Mastery

Key Takeaways

•Write caching provides enormous performance benefits — 100x to 10,000x improvement over synchronous writes through coalescing, batching, and background processing.
•Three main strategies exist — Write-through (instant durability, slow), write-back (deferred durability, fast), and write-around (bypass cache, specialized use).
•Dirty page management controls memory usage — Background and foreground thresholds prevent unlimited dirty page accumulation while maximizing write batching.
•Synchronization primitives provide explicit control — sync(), fsync(), fdatasync() offer different scopes and guarantees; applications must choose appropriately.
•Write barriers ensure correct ordering — Cache flush commands and FUA flags ensure data reaches stable media in the required order despite multiple cache layers.
•Applications control the tradeoff — O_SYNC, O_DSYNC, O_DIRECT, and various sync calls let applications specify exactly the durability guarantees they need.

What's Next:

Page Complete