Operating SystemsBuffer Cache

Buffer Cache

LevelIntermediate

Duration90 mins

TopicBuffer Cache

3 / 5

Write-Back Policy

The Write Durability Problem

When a process modifies data, the change initially exists only in volatile RAM within the buffer cache. If power fails before the modification reaches stable storage, the data is lost forever. This reality creates one of the most fundamental tradeoffs in system design: performance versus durability.

The write-back policy determines when and how the operating system transfers modified ("dirty") data from the buffer cache to persistent storage. Get this wrong, and you face either:

Unacceptable performance: If every write immediately goes to disk, applications slow to a crawl. A simple save operation might take seconds instead of milliseconds.
Unacceptable data loss: If writes are cached too aggressively, a crash or power failure destroys hours of work, corrupts databases, or leaves file systems in inconsistent states.

Mastering write-back policies is essential for anyone building reliable systems, tuning database performance, or diagnosing mysterious data loss.

What You Will Learn

By the end of this page, you will understand: (1) The fundamental tradeoff between write performance and durability, (2) Write-through versus write-back caching strategies, (3) Delayed write (write-behind) implementation details, (4) How dirty pages are tracked and flushed, (5) The kernel's pdflush/flush threads and writeback mechanisms, and (6) Configuration parameters for tuning write-back behavior.

Write Strategies Comparison

There are fundamentally three approaches to handling writes in a cached system. Each makes different tradeoffs between performance, durability, and complexity.

1. Write-Through (Synchronous Write)

Every write immediately propagates to stable storage. The write call doesn't return until the data is safely on disk.

2. Write-Back (Write-Behind, Delayed Write)

Writes are held in cache and written to storage later, either periodically or when memory pressure demands. The write call returns immediately after updating the cache.

3. Write-Around

Writes bypass the cache entirely and go directly to storage. The cache is not updated (or is invalidated). Useful when written data won't be read soon.

Write Strategy Comparison
Characteristic	Write-Through	Write-Back	Write-Around
Write latency	High (disk speed)	Low (memory speed)	High (disk speed)
Read after write	Fast (cached)	Fast (cached)	Slow (must read from disk)
Durability	Excellent	Risk of data loss	Excellent
Write bandwidth	Limited by disk	Aggregated, efficient	Limited by disk
Cache pollution	May pollute	May pollute	Avoids pollution
Implementation	Simple	Complex	Simple
Use cases	Critical data	General purpose	Streaming writes

Why Write-Back Dominates:

Modern operating systems primarily use write-back caching because:

Write coalescing: Multiple writes to the same block become a single disk write
Write reordering: The OS can write blocks in optimal order for the storage device
Latency hiding: Applications don't wait for slow disk I/O
Better disk utilization: Bursts of writes are absorbed, smoothing I/O load

Example: Write Coalescing Benefit

write_coalescing.txt

Text

# Application writes to a file rapidly:
write(fd, "Hello", 5);     # Modify bytes 0-4
write(fd, "World", 5);     # Modify bytes 5-9  
write(fd, "!", 1);         # Modify byte 10
write(fd, "...", 3);       # Modify bytes 11-13
 
# With WRITE-THROUGH (4 disk operations):
Time 0ms:   Write bytes 0-4 to disk     [5ms disk latency]
Time 5ms:   Write bytes 5-9 to disk     [5ms disk latency]
Time 10ms:  Write bytes 10 to disk      [5ms disk latency]
Time 15ms:  Write bytes 11-13 to disk   [5ms disk latency]
Total: 20ms for 4 write syscalls
 
# With WRITE-BACK (1 disk operation, deferred):
Time 0.001ms:  Update bytes 0-4 in cache   [~1 µs]
Time 0.002ms:  Update bytes 5-9 in cache   [~1 µs]
Time 0.003ms:  Update bytes 10 in cache    [~1 µs]
Time 0.004ms:  Update bytes 11-13 in cache [~1 µs]
Total: ~0.004ms for 4 write syscalls (5000x faster!)
 
# Later, writeback thread combines all into ONE disk write:
Time 5000ms:  Write bytes 0-13 to disk    [5ms disk latency]
 
# Even better: if bytes 0-13 are modified 100 more times
# before writeback, still only ONE disk write occurs!

The Durability Cost

Write-back's performance comes at a cost: if the system crashes before dirty data is written to disk, that data is lost. A 30-second writeback delay means potentially 30 seconds of recent writes vanish on power failure. Critical applications (databases, financial systems) must explicitly sync data or use write-through for durability.

Dirty Page Tracking

For write-back caching to work, the operating system must track which cached pages contain modifications that haven't been written to storage. This tracking is essential for:

Writeback: Knowing what needs to be written
Eviction: Dirty pages must be written before their memory can be reused
Recovery: After crash, only clean pages can be trusted

The Dirty Bit:

Each cached page has a dirty bit (or flag) indicating whether it has been modified since being read from or last written to storage.

Dirty Tracking Mechanisms:

dirty_tracking.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
/*
 * Linux page flags related to dirty state
 * Defined in include/linux/page-flags.h
 */
 
/* Page flag bits */
#define PG_dirty        4    /* Page has been modified */
#define PG_writeback    15   /* Page is being written back now */
#define PG_reclaim      18   /* Page marked for reclaim */
 
/*
 * Dirty page lifecycle:
 *
 * 1. Page read from disk -> Clean (PG_dirty clear)
 * 2. Process modifies page -> Mark dirty (PG_dirty set)
 * 3. Writeback initiated -> PG_writeback set, PG_dirty stays
 * 4. Writeback completes -> PG_writeback cleared, PG_dirty cleared
 *
 * If page modified during writeback:
 * 3b. PG_dirty gets re-set while PG_writeback is set
 * 4b. After writeback, PG_writeback cleared but PG_dirty remains
 *     -> Page needs another writeback
 */
 
/* Check and set dirty state */
static inline int PageDirty(struct page *page) {
    return test_bit(PG_dirty, &page->flags);
}
 
static inline void SetPageDirty(struct page *page) {
    set_bit(PG_dirty, &page->flags);
}
 
static inline void ClearPageDirty(struct page *page) {
    clear_bit(PG_dirty, &page->flags);
}
 
/*
 * Mark a page as dirty (called when page is modified)
 * This is a simplified version - real implementation handles:
 * - Inode dirty list management
 * - Memory cgroup accounting
 * - Block device backing
 */
int set_page_dirty(struct page *page) {
    struct address_space *mapping = page_mapping(page);
    
    if (!PageDirty(page)) {
        /* First time dirtying this page */
        SetPageDirty(page);
        
        if (mapping) {
            /* Add to inode's dirty page tracking */
            spin_lock(&mapping->private_lock);
            
            /* Add to radix tree dirty tag */
            radix_tree_tag_set(&mapping->page_tree,
                              page_index(page), PAGECACHE_TAG_DIRTY);
            
            /* If this is first dirty page for inode, add inode to 
             * backing device's dirty list */
            if (mapping->nrpages_dirty++ == 0) {
                inode_mark_dirty(mapping->host);
            }
            
            spin_unlock(&mapping->private_lock);
        }
        
        return 1;  /* Page was newly dirtied */
    }
    
    return 0;  /* Page was already dirty */
}

Inode and Backing Device Dirty Lists:

Linux organizes dirty pages hierarchically:

Backing Device (e.g., /dev/sda)
 └── Dirty Inode List
     ├── Inode A (dirty)
     │   └── Dirty Pages: [100, 142, 205, ...]
     ├── Inode B (dirty)
     │   └── Dirty Pages: [0, 1, 2, 50, ...]
     └── Inode C (dirty)
         └── Dirty Pages: [1024, 1025, ...]

This organization enables:

Efficient writeback: Process all dirty pages for one file together
Fair scheduling: Cycle through dirty inodes rather than letting one monopolize I/O
Device-aware: Write to the same device in sequence for disk head efficiency

Hardware Dirty Bit Support

CPUs provide hardware support for dirty tracking. Page table entries (PTEs) include a 'dirty' bit that the MMU sets automatically whenever a write occurs through that mapping. The OS periodically harvests these hardware dirty bits, enabling efficient tracking without software overhead on every write.

Writeback Mechanisms

The operating system uses dedicated kernel threads to handle the actual writing of dirty pages to storage. This background writeback is essential for performance—it allows the system to choose optimal moments for I/O rather than blocking applications.

Linux Writeback Architecture:

Linux has evolved through several writeback implementations:

bdflush (Linux 2.0-2.4): Single kernel thread, limited parallelism
pdflush (Linux 2.6.0-2.6.31): Pool of threads, better parallelism
Per-BDI flusher threads (Linux 2.6.32+): One thread per backing device

Current Architecture (Per-BDI Writeback):

writeback_thread.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
/*
 * Per-Backing Device Info (BDI) writeback architecture
 * 
 * Each block device has its own writeback thread:
 *   flush-8:0   <- /dev/sda (major 8, minor 0)
 *   flush-8:16  <- /dev/sdb (major 8, minor 16)
 *   flush-253:0 <- /dev/dm-0 (device mapper)
 *
 * This allows parallel writeback to different devices
 */
 
struct bdi_writeback {
    struct backing_dev_info *bdi;    /* Parent BDI */
    
    /* Lists of dirty inodes */
    struct list_head b_dirty;        /* Dirty inodes */
    struct list_head b_io;           /* Inodes being written */
    struct list_head b_more_io;      /* Inodes needing more I/O */
    struct list_head b_dirty_time;   /* Time-expired dirty inodes */
    
    /* Writeback control */
    unsigned long last_old_flush;    /* When last periodic flush ran */
    struct delayed_work dwork;       /* The actual work item */
};
 
/*
 * Main writeback function - called by flusher thread
 */
void wb_workfn(struct work_struct *work) {
    struct bdi_writeback *wb = container_of(to_delayed_work(work),
                                             struct bdi_writeback, dwork);
    long pages_written;
    
    /* Set up writeback control */
    struct writeback_control wbc = {
        .sync_mode = WB_SYNC_NONE,   /* Non-blocking */
        .nr_to_write = LONG_MAX,     /* No page limit */
        .range_start = 0,
        .range_end = LLONG_MAX,
    };
    
    /* Write dirty pages until we've caught up */
    while (wb_has_dirty_io(wb)) {
        pages_written = writeback_inodes_wb(wb, &wbc);
        
        if (pages_written <= 0)
            break;
        
        /* Yield CPU to prevent monopolizing */
        cond_resched();
    }
    
    /* Schedule next run if there's more work or for periodic flush */
    if (wb_has_dirty_io(wb) || time_for_periodic_flush(wb)) {
        queue_delayed_work(bdi_wq, &wb->dwork,
                          msecs_to_jiffies(dirty_writeback_interval));
    }
}
 
/*
 * Write back dirty inodes for this device
 */
long writeback_inodes_wb(struct bdi_writeback *wb,
                         struct writeback_control *wbc) {
    struct inode *inode;
    long pages_written = 0;
    
    spin_lock(&wb->list_lock);
    
    while (!list_empty(&wb->b_io)) {
        inode = list_first_entry(&wb->b_io, struct inode, i_io_list);
        
        spin_unlock(&wb->list_lock);
        
        /* Write this inode's dirty pages */
        pages_written += writeback_single_inode(inode, wbc);
        
        spin_lock(&wb->list_lock);
        
        /* Move inode based on result */
        if (inode_has_dirty_pages(inode)) {
            /* More pages to write - put on b_more_io */
            list_move(&inode->i_io_list, &wb->b_more_io);
        } else {
            /* All clean - remove from dirty lists */
            list_del_init(&inode->i_io_list);
        }
    }
    
    spin_unlock(&wb->list_lock);
    return pages_written;
}

Writeback Triggers:

Writeback occurs in several scenarios:

Trigger	Description	Urgency
Periodic timer	Every `dirty_writeback_centisecs` (default 5s)	Low
Age threshold	Pages dirty longer than `dirty_expire_centisecs` (30s)	Medium
Memory pressure	System running low on free pages	High
Explicit sync	`sync()`, `fsync()`, or `fdatasync()` call	Immediate
Dirty ratio exceeded	Dirty pages exceed `dirty_ratio` of RAM	Immediate (blocking)
Background ratio	Dirty pages exceed `dirty_background_ratio`	Background (non-blocking)

The Dirty Throttling Dance

When dirty pages grow too numerous, the kernel 'throttles' writing processes—making them sleep in the write path. This backpressure prevents dirty pages from growing unboundedly and ensures the system can recover. The throttling is proportional: the more over-limit, the longer the sleep.

Writeback Configuration Parameters

Linux provides tunable parameters that control writeback behavior. Understanding these is essential for optimizing system performance and durability for specific workloads.

Key Parameters (in /proc/sys/vm/):

Linux Writeback Parameters
Parameter	Default	Description
`dirty_background_ratio`	10	Percent of RAM at which background writeback starts
`dirty_background_bytes`	0	Absolute byte threshold (overrides ratio if set)
`dirty_ratio`	20	Percent of RAM at which processes block on writes
`dirty_bytes`	0	Absolute byte threshold for blocking
`dirty_writeback_centisecs`	500	Interval between writeback thread wakeups (5 seconds)
`dirty_expire_centisecs`	3000	Age at which dirty data is 'old' and must be written (30 seconds)

writeback_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#!/bin/bash
# Writeback tuning examples
 
# View current values
cat /proc/sys/vm/dirty_ratio
cat /proc/sys/vm/dirty_background_ratio
cat /proc/sys/vm/dirty_expire_centisecs
cat /proc/sys/vm/dirty_writeback_centisecs
 
# SCENARIO 1: Database Server (prioritize durability)
# Low dirty thresholds = less data at risk
# Frequent writeback = faster flush
echo 5 > /proc/sys/vm/dirty_ratio
echo 2 > /proc/sys/vm/dirty_background_ratio
echo 500 > /proc/sys/vm/dirty_expire_centisecs    # 5 seconds
echo 100 > /proc/sys/vm/dirty_writeback_centisecs # 1 second
 
# SCENARIO 2: File Server (prioritize throughput)
# Higher dirty limits = more write coalescing
# Slower writeback = fewer interruptions
echo 40 > /proc/sys/vm/dirty_ratio
echo 20 > /proc/sys/vm/dirty_background_ratio
echo 6000 > /proc/sys/vm/dirty_expire_centisecs   # 60 seconds
echo 1000 > /proc/sys/vm/dirty_writeback_centisecs # 10 seconds
 
# SCENARIO 3: Machine with limited RAM (128MB)
# Use absolute bytes instead of percentages
# Prevents small RAM from having tiny dirty limits
echo 0 > /proc/sys/vm/dirty_ratio  # Disable ratio
echo 50000000 > /proc/sys/vm/dirty_bytes  # 50MB hard limit
echo 0 > /proc/sys/vm/dirty_background_ratio
echo 25000000 > /proc/sys/vm/dirty_background_bytes  # 25MB background
 
# SCENARIO 4: Laptop (prioritize battery life)
# Less frequent writeback = more disk idle time
# Allow disk to spin down longer
echo 90 > /proc/sys/vm/laptop_mode  # Enable laptop mode (seconds)
echo 60 > /proc/sys/vm/dirty_ratio
echo 40 > /proc/sys/vm/dirty_background_ratio
echo 60000 > /proc/sys/vm/dirty_expire_centisecs   # 10 minutes
echo 60000 > /proc/sys/vm/dirty_writeback_centisecs # 10 minutes
 
# Monitor dirty page status
watch -n1 'grep -E "Dirty|Writeback" /proc/meminfo'

Understanding the Two Thresholds:

The dirty_background_ratio and dirty_ratio work together:

                  Memory
                    │
                    │
 dirty_ratio (20%)  ├─────── BLOCKING THRESHOLD ────────┐
                    │        Processes sleep until           │
                    │        dirty pages drop below          │
                    │                                        │
                    │   ← Dirty pages in danger zone →       │
                    │                                        │
 dirty_background   ├─────── WRITEBACK THRESHOLD ────────┘
 ratio (10%)        │        Flusher threads actively
                    │        writing dirty pages
                    │
                    │   ← Normal operation →
                    │
                  0 └─────────────────────────────────────

The Dirty Ratio Trap

On systems with large RAM (256GB+), default percentages become problematic. 20% of 256GB = 51GB of potentially dirty data! If power fails, you could lose 51GB of writes. For such systems, use absolute byte limits (dirty_bytes, dirty_background_bytes) to cap exposure regardless of RAM size.

Write Barriers and Ordering

Modern storage systems include multiple layers of caching—not just the OS buffer cache, but also disk controller caches and drive firmware caches. Write barriers ensure that writes reach stable storage in the correct order, which is critical for file system consistency.

The Ordering Problem:

Consider a file system updating a file:

Write new data block D
Update metadata block M to point to D

If M reaches disk before D (due to reordering in any cache layer), and the system crashes, the file system will have metadata pointing to garbage—a corruption scenario.

Write Barrier Operation:

write_barriers.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
/*
 * Write barriers ensure ordering across cache layers
 *
 * A write barrier has this semantics:
 *   All writes BEFORE the barrier complete to stable storage
 *   BEFORE any writes AFTER the barrier begin.
 */
 
/*
 * File system journal write with barriers
 * Ensures data integrity after crash
 */
void journal_commit_transaction(journal_t *journal, 
                                 transaction_t *transaction) {
    
    /* Step 1: Write journal descriptor block */
    /* Describes what data is in this transaction */
    write_journal_descriptor(journal, transaction);
    
    /* Step 2: Write all data/metadata blocks to journal */
    /* These are copies that can be replayed on crash */
    list_for_each_entry(block, &transaction->blocks, list) {
        write_to_journal(journal, block);
    }
    
    /* BARRIER: Ensure all journal data is on stable storage */
    /* before we write the commit record */
    blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL);
    
    /* Step 3: Write commit record */
    /* This marks the transaction as complete */
    write_journal_commit(journal, transaction);
    
    /* BARRIER: Ensure commit record is on stable storage */
    /* before we write to final locations */
    blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL);
    
    /* Step 4: Now safe to write to actual file system locations */
    /* (called "checkpointing") */
    checkpoint_transaction(journal, transaction);
}
 
/*
 * The blkdev_issue_flush function
 * Sends a cache flush command to the storage device
 */
int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
                       sector_t *error_sector) {
    struct bio *bio;
    int ret;
    
    /* Allocate a bio for the flush command */
    bio = bio_alloc(GFP_KERNEL, 0);
    bio_set_dev(bio, bdev);
    bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;  /* Flush operation */
    
    /* Submit and wait */
    ret = submit_bio_wait(bio);
    
    bio_put(bio);
    return ret;
}
 
/*
 * Storage device perspective:
 *
 * When device receives FLUSH command:
 * 1. Complete all pending writes in device cache
 * 2. Force data to stable media (platters for HDD, cells for SSD)
 * 3. Return success only after data is truly persisted
 *
 * Commands: 
 *   SATA: FLUSH CACHE / FLUSH CACHE EXT
 *   SCSI: SYNCHRONIZE CACHE
 *   NVMe: Flush command
 */

Cache Layers That Can Reorder Writes
Cache Layer	Size	Volatile?	Respects Barriers?
OS Buffer Cache	GB	Yes	N/A (software)
RAID Controller Cache	MB-GB	Maybe (BBU)	Must be configured
Disk Drive Cache	MB	Usually yes	With FUA/flush commands
SSD Controller Cache	MB	Newer have capacitors	With flush commands

FUA - Force Unit Access

FUA (Force Unit Access) is an alternative to barriers for individual writes. A write with FUA set bypasses the disk's volatile cache and goes directly to stable media. It's more efficient than flush when only specific writes need immediate durability, as it doesn't drain the entire cache.

Fsync and Durability Guarantees

Applications that need durability guarantees cannot rely on the OS's delayed writeback—they must explicitly request synchronization. The fsync() family of system calls provides this capability.

The Sync Family:

Synchronization System Calls
Call	Scope	What It Syncs	Use Case
`sync()`	Entire system	All dirty buffers for all filesystems	Before shutdown
`syncfs(fd)`	Single filesystem	All dirty buffers for one filesystem	Before unmount
`fsync(fd)`	Single file	File data + metadata (size, timestamps)	Transactional writes
`fdatasync(fd)`	Single file data	File data only (skip some metadata)	Performance-critical
`sync_file_range()`	Byte range	Specific portion of file	Large file streaming

fsync_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
/*
 * Demonstrating proper durable write patterns
 */
 
#include <unistd.h>
#include <fcntl.h>
 
/*
 * WRONG: Data can be lost on crash
 * The close() doesn't guarantee durability!
 */
void write_file_wrong(const char *path, const char *data, size_t len) {
    int fd = open(path, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    write(fd, data, len);
    close(fd);  /* Data may still be in buffer cache! */
}
 
/*
 * CORRECT: Data is durable after fsync returns
 */
void write_file_correct(const char *path, const char *data, size_t len) {
    int fd = open(path, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    write(fd, data, len);
    
    /* fsync ensures data + metadata reach stable storage */
    if (fsync(fd) != 0) {
        perror("fsync failed");
        /* Handle error - data may not be durable! */
    }
    
    close(fd);
}
 
/*
 * Atomic file update using rename
 * This is the GOLD STANDARD for durable writes
 */
void atomic_write_file(const char *path, const char *data, size_t len) {
    char tmp_path[256];
    snprintf(tmp_path, sizeof(tmp_path), "%s.tmp.XXXXXX", path);
    
    /* Create temporary file */
    int fd = mkstemp(tmp_path);
    if (fd < 0) {
        perror("mkstemp");
        return;
    }
    
    /* Write data */
    ssize_t written = write(fd, data, len);
    if (written != len) {
        close(fd);
        unlink(tmp_path);
        return;
    }
    
    /* CRITICAL: fsync the file data */
    if (fsync(fd) != 0) {
        close(fd);
        unlink(tmp_path);
        return;
    }
    close(fd);
    
    /* Rename is atomic on POSIX systems */
    if (rename(tmp_path, path) != 0) {
        unlink(tmp_path);
        return;
    }
    
    /* CRITICAL: fsync the directory to make rename durable */
    int dir_fd = open(dirname(path), O_RDONLY);
    if (dir_fd >= 0) {
        fsync(dir_fd);
        close(dir_fd);
    }
    
    /* Now the update is fully durable and atomic:
     * - If crash before rename: old file unchanged
     * - If crash after rename: new file complete
     * Never a partial or corrupted state!
     */
}
 
/*
 * fdatasync vs fsync: When to use which
 */
void fdatasync_example(int fd) {
    /*
     * fdatasync() is like fsync() but:
     * - Only syncs data, not all metadata
     * - Skips metadata that doesn't affect data retrieval
     *   (e.g., access time, modification time)
     * - Still syncs size if file was extended
     *
     * Faster when only data integrity matters
     */
    
    /* Use fdatasync for performance-critical paths */
    fdatasync(fd);  /* ~30% faster than fsync on some systems */
    
    /* Use fsync when metadata matters */
    fsync(fd);  /* Guarantees timestamps, permissions, etc. */
}

The fsync Directory Trap

Many developers forget that file creation/rename requires syncing the parent directory, not just the file. Without the directory fsync, a crash might leave the new file invisible (it existed but the directory entry didn't persist). This has caused data loss in many applications including older versions of SQLite and git.

Write-Back in Practice

Let's examine how write-back behaves in real-world scenarios and common patterns that arise in production systems.

Scenario 1: The Write Burst

An application writes 100MB of data as fast as it can:

write_burst_analysis.txt

Text

# System has 16GB RAM
# dirty_background_ratio = 10% (1.6GB)
# dirty_ratio = 20% (3.2GB)
# dirty_writeback_centisecs = 500 (5 seconds)
 
Phase 1: Fast writes (seconds 0-1)
├── Application writes 100MB burst
├── All writes go to page cache (memory speed: ~10GB/s)
├── Takes ~0.01 seconds to complete all writes
├── Dirty pages: ~100MB
├── Below background threshold: NO writeback triggered
└── Application continues immediately
 
Phase 2: Background writeback (seconds 5-6)
├── Timer fires after 5 seconds
├── Writeback thread wakes up
├── Writes 100MB to disk (disk speed: ~100MB/s)
├── Takes ~1 second
└── Dirty pages: 0MB
 
# What if application writes 5GB burst?
 
Phase 1: Fast writes (seconds 0-0.5)
├── Application writes 1.6GB (reaches background threshold)
├── Writeback starts in background
├── Application continues writing...
├── Application writes 3.2GB (reaches dirty_ratio!)
├── APPLICATION BLOCKS - cannot write more
└── Waits until dirty pages drop below 3.2GB
 
Phase 2: Throttled writes (seconds 0.5-15)
├── Writeback thread runs at disk speed (~100MB/s)
├── Application writes blocked intermittently
├── Each write that pushes over limit causes ~10ms sleep
├── Effective write speed: ~100MB/s (disk limited)
└── 5GB takes ~50 seconds instead of 0.5 seconds

Scenario 2: Database Durability

Databases face the tension between write-back efficiency and durability acutely:

fsync Every Commit

•Every transaction does fsync
•Latency: ~5-10ms per commit
•Throughput: ~100-200 TPS (HDD)
•Perfect durability
•Used by: Default PostgreSQL

Group Commit

•Batch multiple commits
•One fsync for many transactions
•Latency: ~5-10ms per batch
•Throughput: 10,000+ TPS
•Tiny risk window: last batch

Monitoring Writeback in Real-Time

Use these commands to observe writeback behavior: watch -n1 'grep -E "Dirty|Writeback" /proc/meminfo' shows dirty page counts. iostat -x 1 shows actual I/O to devices. iotop identifies which processes are doing I/O. echo 1 > /proc/sys/vm/block_dump logs all block I/O to kernel log (noisy!).

Summary: Write-Back Policy Mastery

We've explored the critical world of write-back policies—the strategies that determine when modified data leaves the buffer cache and reaches stable storage. Let's consolidate the key insights:

Key Takeaways

•Write-through trades performance for durability — Every write waits for disk, providing strong guarantees but slow operation.
•Write-back enables write coalescing and reordering — Multiple modifications become single disk writes, dramatically improving throughput.
•Dirty pages are tracked hierarchically — The OS maintains dirty bits per page, dirty lists per inode, and dirty inodes per device.
•Writeback threads handle background persistence — Per-device flusher threads write dirty data without blocking applications.
•Tunable parameters control behavior — dirty_ratio, dirty_background_ratio, and expiration times let you tune the performance/durability tradeoff.
•Write barriers ensure ordering across cache layers — Essential for file system consistency and crash recovery.
•Applications needing durability must use fsync — Don't trust close() for data safety; explicit sync is required.

What's Next:

With write-back policies understood, we'll next examine cache consistency—the mechanisms that ensure the buffer cache remains synchronized with storage contents, especially across multiple processes, network file systems, and after crashes.

Page Complete

You now understand write-back policies, dirty page tracking, writeback mechanisms, and durability guarantees. This knowledge is essential for building reliable systems, debugging data loss issues, and tuning storage I/O performance.

3 / 5

Loading learning content...

Operating SystemsBuffer Cache

Buffer Cache

LevelIntermediate

Duration90 mins

TopicBuffer Cache

3 / 5

Write-Back Policy

The Write Durability Problem

The write-back policy determines when and how the operating system transfers modified ("dirty") data from the buffer cache to persistent storage. Get this wrong, and you face either:

Unacceptable performance: If every write immediately goes to disk, applications slow to a crawl. A simple save operation might take seconds instead of milliseconds.
Unacceptable data loss: If writes are cached too aggressively, a crash or power failure destroys hours of work, corrupts databases, or leaves file systems in inconsistent states.

Mastering write-back policies is essential for anyone building reliable systems, tuning database performance, or diagnosing mysterious data loss.

What You Will Learn

Write Strategies Comparison

There are fundamentally three approaches to handling writes in a cached system. Each makes different tradeoffs between performance, durability, and complexity.

1. Write-Through (Synchronous Write)

Every write immediately propagates to stable storage. The write call doesn't return until the data is safely on disk.

2. Write-Back (Write-Behind, Delayed Write)

Writes are held in cache and written to storage later, either periodically or when memory pressure demands. The write call returns immediately after updating the cache.

3. Write-Around

Writes bypass the cache entirely and go directly to storage. The cache is not updated (or is invalidated). Useful when written data won't be read soon.

Write Strategy Comparison
Characteristic	Write-Through	Write-Back	Write-Around
Write latency	High (disk speed)	Low (memory speed)	High (disk speed)
Read after write	Fast (cached)	Fast (cached)	Slow (must read from disk)
Durability	Excellent	Risk of data loss	Excellent
Write bandwidth	Limited by disk	Aggregated, efficient	Limited by disk
Cache pollution	May pollute	May pollute	Avoids pollution
Implementation	Simple	Complex	Simple
Use cases	Critical data	General purpose	Streaming writes

Why Write-Back Dominates:

Modern operating systems primarily use write-back caching because:

Write coalescing: Multiple writes to the same block become a single disk write
Write reordering: The OS can write blocks in optimal order for the storage device
Latency hiding: Applications don't wait for slow disk I/O
Better disk utilization: Bursts of writes are absorbed, smoothing I/O load

Example: Write Coalescing Benefit

write_coalescing.txt

Text

# Application writes to a file rapidly:
write(fd, "Hello", 5);     # Modify bytes 0-4
write(fd, "World", 5);     # Modify bytes 5-9  
write(fd, "!", 1);         # Modify byte 10
write(fd, "...", 3);       # Modify bytes 11-13
 
# With WRITE-THROUGH (4 disk operations):
Time 0ms:   Write bytes 0-4 to disk     [5ms disk latency]
Time 5ms:   Write bytes 5-9 to disk     [5ms disk latency]
Time 10ms:  Write bytes 10 to disk      [5ms disk latency]
Time 15ms:  Write bytes 11-13 to disk   [5ms disk latency]
Total: 20ms for 4 write syscalls
 
# With WRITE-BACK (1 disk operation, deferred):
Time 0.001ms:  Update bytes 0-4 in cache   [~1 µs]
Time 0.002ms:  Update bytes 5-9 in cache   [~1 µs]
Time 0.003ms:  Update bytes 10 in cache    [~1 µs]
Time 0.004ms:  Update bytes 11-13 in cache [~1 µs]
Total: ~0.004ms for 4 write syscalls (5000x faster!)
 
# Later, writeback thread combines all into ONE disk write:
Time 5000ms:  Write bytes 0-13 to disk    [5ms disk latency]
 
# Even better: if bytes 0-13 are modified 100 more times
# before writeback, still only ONE disk write occurs!

The Durability Cost

Dirty Page Tracking

For write-back caching to work, the operating system must track which cached pages contain modifications that haven't been written to storage. This tracking is essential for:

Writeback: Knowing what needs to be written
Eviction: Dirty pages must be written before their memory can be reused
Recovery: After crash, only clean pages can be trusted

The Dirty Bit:

Each cached page has a dirty bit (or flag) indicating whether it has been modified since being read from or last written to storage.

Dirty Tracking Mechanisms:

dirty_tracking.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
/*
 * Linux page flags related to dirty state
 * Defined in include/linux/page-flags.h
 */
 
/* Page flag bits */
#define PG_dirty        4    /* Page has been modified */
#define PG_writeback    15   /* Page is being written back now */
#define PG_reclaim      18   /* Page marked for reclaim */
 
/*
 * Dirty page lifecycle:
 *
 * 1. Page read from disk -> Clean (PG_dirty clear)
 * 2. Process modifies page -> Mark dirty (PG_dirty set)
 * 3. Writeback initiated -> PG_writeback set, PG_dirty stays
 * 4. Writeback completes -> PG_writeback cleared, PG_dirty cleared
 *
 * If page modified during writeback:
 * 3b. PG_dirty gets re-set while PG_writeback is set
 * 4b. After writeback, PG_writeback cleared but PG_dirty remains
 *     -> Page needs another writeback
 */
 
/* Check and set dirty state */
static inline int PageDirty(struct page *page) {
    return test_bit(PG_dirty, &page->flags);
}
 
static inline void SetPageDirty(struct page *page) {
    set_bit(PG_dirty, &page->flags);
}
 
static inline void ClearPageDirty(struct page *page) {
    clear_bit(PG_dirty, &page->flags);
}
 
/*
 * Mark a page as dirty (called when page is modified)
 * This is a simplified version - real implementation handles:
 * - Inode dirty list management
 * - Memory cgroup accounting
 * - Block device backing
 */
int set_page_dirty(struct page *page) {
    struct address_space *mapping = page_mapping(page);
    
    if (!PageDirty(page)) {
        /* First time dirtying this page */
        SetPageDirty(page);
        
        if (mapping) {
            /* Add to inode's dirty page tracking */
            spin_lock(&mapping->private_lock);
            
            /* Add to radix tree dirty tag */
            radix_tree_tag_set(&mapping->page_tree,
                              page_index(page), PAGECACHE_TAG_DIRTY);
            
            /* If this is first dirty page for inode, add inode to 
             * backing device's dirty list */
            if (mapping->nrpages_dirty++ == 0) {
                inode_mark_dirty(mapping->host);
            }
            
            spin_unlock(&mapping->private_lock);
        }
        
        return 1;  /* Page was newly dirtied */
    }
    
    return 0;  /* Page was already dirty */
}

Inode and Backing Device Dirty Lists:

Linux organizes dirty pages hierarchically:

Backing Device (e.g., /dev/sda)
 └── Dirty Inode List
     ├── Inode A (dirty)
     │   └── Dirty Pages: [100, 142, 205, ...]
     ├── Inode B (dirty)
     │   └── Dirty Pages: [0, 1, 2, 50, ...]
     └── Inode C (dirty)
         └── Dirty Pages: [1024, 1025, ...]

This organization enables:

Efficient writeback: Process all dirty pages for one file together
Fair scheduling: Cycle through dirty inodes rather than letting one monopolize I/O
Device-aware: Write to the same device in sequence for disk head efficiency

Hardware Dirty Bit Support

Writeback Mechanisms

Linux Writeback Architecture:

Linux has evolved through several writeback implementations:

bdflush (Linux 2.0-2.4): Single kernel thread, limited parallelism
pdflush (Linux 2.6.0-2.6.31): Pool of threads, better parallelism
Per-BDI flusher threads (Linux 2.6.32+): One thread per backing device

Current Architecture (Per-BDI Writeback):

writeback_thread.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
/*
 * Per-Backing Device Info (BDI) writeback architecture
 * 
 * Each block device has its own writeback thread:
 *   flush-8:0   <- /dev/sda (major 8, minor 0)
 *   flush-8:16  <- /dev/sdb (major 8, minor 16)
 *   flush-253:0 <- /dev/dm-0 (device mapper)
 *
 * This allows parallel writeback to different devices
 */
 
struct bdi_writeback {
    struct backing_dev_info *bdi;    /* Parent BDI */
    
    /* Lists of dirty inodes */
    struct list_head b_dirty;        /* Dirty inodes */
    struct list_head b_io;           /* Inodes being written */
    struct list_head b_more_io;      /* Inodes needing more I/O */
    struct list_head b_dirty_time;   /* Time-expired dirty inodes */
    
    /* Writeback control */
    unsigned long last_old_flush;    /* When last periodic flush ran */
    struct delayed_work dwork;       /* The actual work item */
};
 
/*
 * Main writeback function - called by flusher thread
 */
void wb_workfn(struct work_struct *work) {
    struct bdi_writeback *wb = container_of(to_delayed_work(work),
                                             struct bdi_writeback, dwork);
    long pages_written;
    
    /* Set up writeback control */
    struct writeback_control wbc = {
        .sync_mode = WB_SYNC_NONE,   /* Non-blocking */
        .nr_to_write = LONG_MAX,     /* No page limit */
        .range_start = 0,
        .range_end = LLONG_MAX,
    };
    
    /* Write dirty pages until we've caught up */
    while (wb_has_dirty_io(wb)) {
        pages_written = writeback_inodes_wb(wb, &wbc);
        
        if (pages_written <= 0)
            break;
        
        /* Yield CPU to prevent monopolizing */
        cond_resched();
    }
    
    /* Schedule next run if there's more work or for periodic flush */
    if (wb_has_dirty_io(wb) || time_for_periodic_flush(wb)) {
        queue_delayed_work(bdi_wq, &wb->dwork,
                          msecs_to_jiffies(dirty_writeback_interval));
    }
}
 
/*
 * Write back dirty inodes for this device
 */
long writeback_inodes_wb(struct bdi_writeback *wb,
                         struct writeback_control *wbc) {
    struct inode *inode;
    long pages_written = 0;
    
    spin_lock(&wb->list_lock);
    
    while (!list_empty(&wb->b_io)) {
        inode = list_first_entry(&wb->b_io, struct inode, i_io_list);
        
        spin_unlock(&wb->list_lock);
        
        /* Write this inode's dirty pages */
        pages_written += writeback_single_inode(inode, wbc);
        
        spin_lock(&wb->list_lock);
        
        /* Move inode based on result */
        if (inode_has_dirty_pages(inode)) {
            /* More pages to write - put on b_more_io */
            list_move(&inode->i_io_list, &wb->b_more_io);
        } else {
            /* All clean - remove from dirty lists */
            list_del_init(&inode->i_io_list);
        }
    }
    
    spin_unlock(&wb->list_lock);
    return pages_written;
}

Writeback Triggers:

Writeback occurs in several scenarios:

Trigger	Description	Urgency
Periodic timer	Every `dirty_writeback_centisecs` (default 5s)	Low
Age threshold	Pages dirty longer than `dirty_expire_centisecs` (30s)	Medium
Memory pressure	System running low on free pages	High
Explicit sync	`sync()`, `fsync()`, or `fdatasync()` call	Immediate
Dirty ratio exceeded	Dirty pages exceed `dirty_ratio` of RAM	Immediate (blocking)
Background ratio	Dirty pages exceed `dirty_background_ratio`	Background (non-blocking)

The Dirty Throttling Dance

Writeback Configuration Parameters

Linux provides tunable parameters that control writeback behavior. Understanding these is essential for optimizing system performance and durability for specific workloads.

Key Parameters (in /proc/sys/vm/):

Linux Writeback Parameters
Parameter	Default	Description
`dirty_background_ratio`	10	Percent of RAM at which background writeback starts
`dirty_background_bytes`	0	Absolute byte threshold (overrides ratio if set)
`dirty_ratio`	20	Percent of RAM at which processes block on writes
`dirty_bytes`	0	Absolute byte threshold for blocking
`dirty_writeback_centisecs`	500	Interval between writeback thread wakeups (5 seconds)
`dirty_expire_centisecs`	3000	Age at which dirty data is 'old' and must be written (30 seconds)

writeback_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#!/bin/bash
# Writeback tuning examples
 
# View current values
cat /proc/sys/vm/dirty_ratio
cat /proc/sys/vm/dirty_background_ratio
cat /proc/sys/vm/dirty_expire_centisecs
cat /proc/sys/vm/dirty_writeback_centisecs
 
# SCENARIO 1: Database Server (prioritize durability)
# Low dirty thresholds = less data at risk
# Frequent writeback = faster flush
echo 5 > /proc/sys/vm/dirty_ratio
echo 2 > /proc/sys/vm/dirty_background_ratio
echo 500 > /proc/sys/vm/dirty_expire_centisecs    # 5 seconds
echo 100 > /proc/sys/vm/dirty_writeback_centisecs # 1 second
 
# SCENARIO 2: File Server (prioritize throughput)
# Higher dirty limits = more write coalescing
# Slower writeback = fewer interruptions
echo 40 > /proc/sys/vm/dirty_ratio
echo 20 > /proc/sys/vm/dirty_background_ratio
echo 6000 > /proc/sys/vm/dirty_expire_centisecs   # 60 seconds
echo 1000 > /proc/sys/vm/dirty_writeback_centisecs # 10 seconds
 
# SCENARIO 3: Machine with limited RAM (128MB)
# Use absolute bytes instead of percentages
# Prevents small RAM from having tiny dirty limits
echo 0 > /proc/sys/vm/dirty_ratio  # Disable ratio
echo 50000000 > /proc/sys/vm/dirty_bytes  # 50MB hard limit
echo 0 > /proc/sys/vm/dirty_background_ratio
echo 25000000 > /proc/sys/vm/dirty_background_bytes  # 25MB background
 
# SCENARIO 4: Laptop (prioritize battery life)
# Less frequent writeback = more disk idle time
# Allow disk to spin down longer
echo 90 > /proc/sys/vm/laptop_mode  # Enable laptop mode (seconds)
echo 60 > /proc/sys/vm/dirty_ratio
echo 40 > /proc/sys/vm/dirty_background_ratio
echo 60000 > /proc/sys/vm/dirty_expire_centisecs   # 10 minutes
echo 60000 > /proc/sys/vm/dirty_writeback_centisecs # 10 minutes
 
# Monitor dirty page status
watch -n1 'grep -E "Dirty|Writeback" /proc/meminfo'

Understanding the Two Thresholds:

The dirty_background_ratio and dirty_ratio work together:

                  Memory
                    │
                    │
 dirty_ratio (20%)  ├─────── BLOCKING THRESHOLD ────────┐
                    │        Processes sleep until           │
                    │        dirty pages drop below          │
                    │                                        │
                    │   ← Dirty pages in danger zone →       │
                    │                                        │
 dirty_background   ├─────── WRITEBACK THRESHOLD ────────┘
 ratio (10%)        │        Flusher threads actively
                    │        writing dirty pages
                    │
                    │   ← Normal operation →
                    │
                  0 └─────────────────────────────────────

The Dirty Ratio Trap

Write Barriers and Ordering

The Ordering Problem:

Consider a file system updating a file:

Write new data block D
Update metadata block M to point to D

If M reaches disk before D (due to reordering in any cache layer), and the system crashes, the file system will have metadata pointing to garbage—a corruption scenario.

Write Barrier Operation:

write_barriers.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
/*
 * Write barriers ensure ordering across cache layers
 *
 * A write barrier has this semantics:
 *   All writes BEFORE the barrier complete to stable storage
 *   BEFORE any writes AFTER the barrier begin.
 */
 
/*
 * File system journal write with barriers
 * Ensures data integrity after crash
 */
void journal_commit_transaction(journal_t *journal, 
                                 transaction_t *transaction) {
    
    /* Step 1: Write journal descriptor block */
    /* Describes what data is in this transaction */
    write_journal_descriptor(journal, transaction);
    
    /* Step 2: Write all data/metadata blocks to journal */
    /* These are copies that can be replayed on crash */
    list_for_each_entry(block, &transaction->blocks, list) {
        write_to_journal(journal, block);
    }
    
    /* BARRIER: Ensure all journal data is on stable storage */
    /* before we write the commit record */
    blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL);
    
    /* Step 3: Write commit record */
    /* This marks the transaction as complete */
    write_journal_commit(journal, transaction);
    
    /* BARRIER: Ensure commit record is on stable storage */
    /* before we write to final locations */
    blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL);
    
    /* Step 4: Now safe to write to actual file system locations */
    /* (called "checkpointing") */
    checkpoint_transaction(journal, transaction);
}
 
/*
 * The blkdev_issue_flush function
 * Sends a cache flush command to the storage device
 */
int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
                       sector_t *error_sector) {
    struct bio *bio;
    int ret;
    
    /* Allocate a bio for the flush command */
    bio = bio_alloc(GFP_KERNEL, 0);
    bio_set_dev(bio, bdev);
    bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;  /* Flush operation */
    
    /* Submit and wait */
    ret = submit_bio_wait(bio);
    
    bio_put(bio);
    return ret;
}
 
/*
 * Storage device perspective:
 *
 * When device receives FLUSH command:
 * 1. Complete all pending writes in device cache
 * 2. Force data to stable media (platters for HDD, cells for SSD)
 * 3. Return success only after data is truly persisted
 *
 * Commands: 
 *   SATA: FLUSH CACHE / FLUSH CACHE EXT
 *   SCSI: SYNCHRONIZE CACHE
 *   NVMe: Flush command
 */

Cache Layers That Can Reorder Writes
Cache Layer	Size	Volatile?	Respects Barriers?
OS Buffer Cache	GB	Yes	N/A (software)
RAID Controller Cache	MB-GB	Maybe (BBU)	Must be configured
Disk Drive Cache	MB	Usually yes	With FUA/flush commands
SSD Controller Cache	MB	Newer have capacitors	With flush commands

FUA - Force Unit Access

Fsync and Durability Guarantees

Applications that need durability guarantees cannot rely on the OS's delayed writeback—they must explicitly request synchronization. The fsync() family of system calls provides this capability.

The Sync Family:

Synchronization System Calls
Call	Scope	What It Syncs	Use Case
`sync()`	Entire system	All dirty buffers for all filesystems	Before shutdown
`syncfs(fd)`	Single filesystem	All dirty buffers for one filesystem	Before unmount
`fsync(fd)`	Single file	File data + metadata (size, timestamps)	Transactional writes
`fdatasync(fd)`	Single file data	File data only (skip some metadata)	Performance-critical
`sync_file_range()`	Byte range	Specific portion of file	Large file streaming

fsync_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
/*
 * Demonstrating proper durable write patterns
 */
 
#include <unistd.h>
#include <fcntl.h>
 
/*
 * WRONG: Data can be lost on crash
 * The close() doesn't guarantee durability!
 */
void write_file_wrong(const char *path, const char *data, size_t len) {
    int fd = open(path, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    write(fd, data, len);
    close(fd);  /* Data may still be in buffer cache! */
}
 
/*
 * CORRECT: Data is durable after fsync returns
 */
void write_file_correct(const char *path, const char *data, size_t len) {
    int fd = open(path, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    write(fd, data, len);
    
    /* fsync ensures data + metadata reach stable storage */
    if (fsync(fd) != 0) {
        perror("fsync failed");
        /* Handle error - data may not be durable! */
    }
    
    close(fd);
}
 
/*
 * Atomic file update using rename
 * This is the GOLD STANDARD for durable writes
 */
void atomic_write_file(const char *path, const char *data, size_t len) {
    char tmp_path[256];
    snprintf(tmp_path, sizeof(tmp_path), "%s.tmp.XXXXXX", path);
    
    /* Create temporary file */
    int fd = mkstemp(tmp_path);
    if (fd < 0) {
        perror("mkstemp");
        return;
    }
    
    /* Write data */
    ssize_t written = write(fd, data, len);
    if (written != len) {
        close(fd);
        unlink(tmp_path);
        return;
    }
    
    /* CRITICAL: fsync the file data */
    if (fsync(fd) != 0) {
        close(fd);
        unlink(tmp_path);
        return;
    }
    close(fd);
    
    /* Rename is atomic on POSIX systems */
    if (rename(tmp_path, path) != 0) {
        unlink(tmp_path);
        return;
    }
    
    /* CRITICAL: fsync the directory to make rename durable */
    int dir_fd = open(dirname(path), O_RDONLY);
    if (dir_fd >= 0) {
        fsync(dir_fd);
        close(dir_fd);
    }
    
    /* Now the update is fully durable and atomic:
     * - If crash before rename: old file unchanged
     * - If crash after rename: new file complete
     * Never a partial or corrupted state!
     */
}
 
/*
 * fdatasync vs fsync: When to use which
 */
void fdatasync_example(int fd) {
    /*
     * fdatasync() is like fsync() but:
     * - Only syncs data, not all metadata
     * - Skips metadata that doesn't affect data retrieval
     *   (e.g., access time, modification time)
     * - Still syncs size if file was extended
     *
     * Faster when only data integrity matters
     */
    
    /* Use fdatasync for performance-critical paths */
    fdatasync(fd);  /* ~30% faster than fsync on some systems */
    
    /* Use fsync when metadata matters */
    fsync(fd);  /* Guarantees timestamps, permissions, etc. */
}

The fsync Directory Trap

Write-Back in Practice

Let's examine how write-back behaves in real-world scenarios and common patterns that arise in production systems.

Scenario 1: The Write Burst

An application writes 100MB of data as fast as it can:

write_burst_analysis.txt

Text

# System has 16GB RAM
# dirty_background_ratio = 10% (1.6GB)
# dirty_ratio = 20% (3.2GB)
# dirty_writeback_centisecs = 500 (5 seconds)
 
Phase 1: Fast writes (seconds 0-1)
├── Application writes 100MB burst
├── All writes go to page cache (memory speed: ~10GB/s)
├── Takes ~0.01 seconds to complete all writes
├── Dirty pages: ~100MB
├── Below background threshold: NO writeback triggered
└── Application continues immediately
 
Phase 2: Background writeback (seconds 5-6)
├── Timer fires after 5 seconds
├── Writeback thread wakes up
├── Writes 100MB to disk (disk speed: ~100MB/s)
├── Takes ~1 second
└── Dirty pages: 0MB
 
# What if application writes 5GB burst?
 
Phase 1: Fast writes (seconds 0-0.5)
├── Application writes 1.6GB (reaches background threshold)
├── Writeback starts in background
├── Application continues writing...
├── Application writes 3.2GB (reaches dirty_ratio!)
├── APPLICATION BLOCKS - cannot write more
└── Waits until dirty pages drop below 3.2GB
 
Phase 2: Throttled writes (seconds 0.5-15)
├── Writeback thread runs at disk speed (~100MB/s)
├── Application writes blocked intermittently
├── Each write that pushes over limit causes ~10ms sleep
├── Effective write speed: ~100MB/s (disk limited)
└── 5GB takes ~50 seconds instead of 0.5 seconds

Scenario 2: Database Durability

Databases face the tension between write-back efficiency and durability acutely:

fsync Every Commit

•Every transaction does fsync
•Latency: ~5-10ms per commit
•Throughput: ~100-200 TPS (HDD)
•Perfect durability
•Used by: Default PostgreSQL

Group Commit

•Batch multiple commits
•One fsync for many transactions
•Latency: ~5-10ms per batch
•Throughput: 10,000+ TPS
•Tiny risk window: last batch

Monitoring Writeback in Real-Time

Summary: Write-Back Policy Mastery

We've explored the critical world of write-back policies—the strategies that determine when modified data leaves the buffer cache and reaches stable storage. Let's consolidate the key insights:

Key Takeaways

•Write-through trades performance for durability — Every write waits for disk, providing strong guarantees but slow operation.
•Write-back enables write coalescing and reordering — Multiple modifications become single disk writes, dramatically improving throughput.
•Dirty pages are tracked hierarchically — The OS maintains dirty bits per page, dirty lists per inode, and dirty inodes per device.
•Writeback threads handle background persistence — Per-device flusher threads write dirty data without blocking applications.
•Tunable parameters control behavior — dirty_ratio, dirty_background_ratio, and expiration times let you tune the performance/durability tradeoff.
•Write barriers ensure ordering across cache layers — Essential for file system consistency and crash recovery.
•Applications needing durability must use fsync — Don't trust close() for data safety; explicit sync is required.

What's Next:

Page Complete

3 / 5