Loading learning content...
When a process modifies data, the change initially exists only in volatile RAM within the buffer cache. If power fails before the modification reaches stable storage, the data is lost forever. This reality creates one of the most fundamental tradeoffs in system design: performance versus durability.
The write-back policy determines when and how the operating system transfers modified ("dirty") data from the buffer cache to persistent storage. Get this wrong, and you face either:
Unacceptable performance: If every write immediately goes to disk, applications slow to a crawl. A simple save operation might take seconds instead of milliseconds.
Unacceptable data loss: If writes are cached too aggressively, a crash or power failure destroys hours of work, corrupts databases, or leaves file systems in inconsistent states.
Mastering write-back policies is essential for anyone building reliable systems, tuning database performance, or diagnosing mysterious data loss.
By the end of this page, you will understand: (1) The fundamental tradeoff between write performance and durability, (2) Write-through versus write-back caching strategies, (3) Delayed write (write-behind) implementation details, (4) How dirty pages are tracked and flushed, (5) The kernel's pdflush/flush threads and writeback mechanisms, and (6) Configuration parameters for tuning write-back behavior.
There are fundamentally three approaches to handling writes in a cached system. Each makes different tradeoffs between performance, durability, and complexity.
1. Write-Through (Synchronous Write)
Every write immediately propagates to stable storage. The write call doesn't return until the data is safely on disk.
2. Write-Back (Write-Behind, Delayed Write)
Writes are held in cache and written to storage later, either periodically or when memory pressure demands. The write call returns immediately after updating the cache.
3. Write-Around
Writes bypass the cache entirely and go directly to storage. The cache is not updated (or is invalidated). Useful when written data won't be read soon.
| Characteristic | Write-Through | Write-Back | Write-Around |
|---|---|---|---|
| Write latency | High (disk speed) | Low (memory speed) | High (disk speed) |
| Read after write | Fast (cached) | Fast (cached) | Slow (must read from disk) |
| Durability | Excellent | Risk of data loss | Excellent |
| Write bandwidth | Limited by disk | Aggregated, efficient | Limited by disk |
| Cache pollution | May pollute | May pollute | Avoids pollution |
| Implementation | Simple | Complex | Simple |
| Use cases | Critical data | General purpose | Streaming writes |
Why Write-Back Dominates:
Modern operating systems primarily use write-back caching because:
Example: Write Coalescing Benefit
# Application writes to a file rapidly:write(fd, "Hello", 5); # Modify bytes 0-4write(fd, "World", 5); # Modify bytes 5-9 write(fd, "!", 1); # Modify byte 10write(fd, "...", 3); # Modify bytes 11-13 # With WRITE-THROUGH (4 disk operations):Time 0ms: Write bytes 0-4 to disk [5ms disk latency]Time 5ms: Write bytes 5-9 to disk [5ms disk latency]Time 10ms: Write bytes 10 to disk [5ms disk latency]Time 15ms: Write bytes 11-13 to disk [5ms disk latency]Total: 20ms for 4 write syscalls # With WRITE-BACK (1 disk operation, deferred):Time 0.001ms: Update bytes 0-4 in cache [~1 µs]Time 0.002ms: Update bytes 5-9 in cache [~1 µs]Time 0.003ms: Update bytes 10 in cache [~1 µs]Time 0.004ms: Update bytes 11-13 in cache [~1 µs]Total: ~0.004ms for 4 write syscalls (5000x faster!) # Later, writeback thread combines all into ONE disk write:Time 5000ms: Write bytes 0-13 to disk [5ms disk latency] # Even better: if bytes 0-13 are modified 100 more times# before writeback, still only ONE disk write occurs!Write-back's performance comes at a cost: if the system crashes before dirty data is written to disk, that data is lost. A 30-second writeback delay means potentially 30 seconds of recent writes vanish on power failure. Critical applications (databases, financial systems) must explicitly sync data or use write-through for durability.
For write-back caching to work, the operating system must track which cached pages contain modifications that haven't been written to storage. This tracking is essential for:
The Dirty Bit:
Each cached page has a dirty bit (or flag) indicating whether it has been modified since being read from or last written to storage.
Dirty Tracking Mechanisms:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
/* * Linux page flags related to dirty state * Defined in include/linux/page-flags.h */ /* Page flag bits */#define PG_dirty 4 /* Page has been modified */#define PG_writeback 15 /* Page is being written back now */#define PG_reclaim 18 /* Page marked for reclaim */ /* * Dirty page lifecycle: * * 1. Page read from disk -> Clean (PG_dirty clear) * 2. Process modifies page -> Mark dirty (PG_dirty set) * 3. Writeback initiated -> PG_writeback set, PG_dirty stays * 4. Writeback completes -> PG_writeback cleared, PG_dirty cleared * * If page modified during writeback: * 3b. PG_dirty gets re-set while PG_writeback is set * 4b. After writeback, PG_writeback cleared but PG_dirty remains * -> Page needs another writeback */ /* Check and set dirty state */static inline int PageDirty(struct page *page) { return test_bit(PG_dirty, &page->flags);} static inline void SetPageDirty(struct page *page) { set_bit(PG_dirty, &page->flags);} static inline void ClearPageDirty(struct page *page) { clear_bit(PG_dirty, &page->flags);} /* * Mark a page as dirty (called when page is modified) * This is a simplified version - real implementation handles: * - Inode dirty list management * - Memory cgroup accounting * - Block device backing */int set_page_dirty(struct page *page) { struct address_space *mapping = page_mapping(page); if (!PageDirty(page)) { /* First time dirtying this page */ SetPageDirty(page); if (mapping) { /* Add to inode's dirty page tracking */ spin_lock(&mapping->private_lock); /* Add to radix tree dirty tag */ radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); /* If this is first dirty page for inode, add inode to * backing device's dirty list */ if (mapping->nrpages_dirty++ == 0) { inode_mark_dirty(mapping->host); } spin_unlock(&mapping->private_lock); } return 1; /* Page was newly dirtied */ } return 0; /* Page was already dirty */}Inode and Backing Device Dirty Lists:
Linux organizes dirty pages hierarchically:
Backing Device (e.g., /dev/sda)
└── Dirty Inode List
├── Inode A (dirty)
│ └── Dirty Pages: [100, 142, 205, ...]
├── Inode B (dirty)
│ └── Dirty Pages: [0, 1, 2, 50, ...]
└── Inode C (dirty)
└── Dirty Pages: [1024, 1025, ...]
This organization enables:
CPUs provide hardware support for dirty tracking. Page table entries (PTEs) include a 'dirty' bit that the MMU sets automatically whenever a write occurs through that mapping. The OS periodically harvests these hardware dirty bits, enabling efficient tracking without software overhead on every write.
The operating system uses dedicated kernel threads to handle the actual writing of dirty pages to storage. This background writeback is essential for performance—it allows the system to choose optimal moments for I/O rather than blocking applications.
Linux Writeback Architecture:
Linux has evolved through several writeback implementations:
Current Architecture (Per-BDI Writeback):
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
/* * Per-Backing Device Info (BDI) writeback architecture * * Each block device has its own writeback thread: * flush-8:0 <- /dev/sda (major 8, minor 0) * flush-8:16 <- /dev/sdb (major 8, minor 16) * flush-253:0 <- /dev/dm-0 (device mapper) * * This allows parallel writeback to different devices */ struct bdi_writeback { struct backing_dev_info *bdi; /* Parent BDI */ /* Lists of dirty inodes */ struct list_head b_dirty; /* Dirty inodes */ struct list_head b_io; /* Inodes being written */ struct list_head b_more_io; /* Inodes needing more I/O */ struct list_head b_dirty_time; /* Time-expired dirty inodes */ /* Writeback control */ unsigned long last_old_flush; /* When last periodic flush ran */ struct delayed_work dwork; /* The actual work item */}; /* * Main writeback function - called by flusher thread */void wb_workfn(struct work_struct *work) { struct bdi_writeback *wb = container_of(to_delayed_work(work), struct bdi_writeback, dwork); long pages_written; /* Set up writeback control */ struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, /* Non-blocking */ .nr_to_write = LONG_MAX, /* No page limit */ .range_start = 0, .range_end = LLONG_MAX, }; /* Write dirty pages until we've caught up */ while (wb_has_dirty_io(wb)) { pages_written = writeback_inodes_wb(wb, &wbc); if (pages_written <= 0) break; /* Yield CPU to prevent monopolizing */ cond_resched(); } /* Schedule next run if there's more work or for periodic flush */ if (wb_has_dirty_io(wb) || time_for_periodic_flush(wb)) { queue_delayed_work(bdi_wq, &wb->dwork, msecs_to_jiffies(dirty_writeback_interval)); }} /* * Write back dirty inodes for this device */long writeback_inodes_wb(struct bdi_writeback *wb, struct writeback_control *wbc) { struct inode *inode; long pages_written = 0; spin_lock(&wb->list_lock); while (!list_empty(&wb->b_io)) { inode = list_first_entry(&wb->b_io, struct inode, i_io_list); spin_unlock(&wb->list_lock); /* Write this inode's dirty pages */ pages_written += writeback_single_inode(inode, wbc); spin_lock(&wb->list_lock); /* Move inode based on result */ if (inode_has_dirty_pages(inode)) { /* More pages to write - put on b_more_io */ list_move(&inode->i_io_list, &wb->b_more_io); } else { /* All clean - remove from dirty lists */ list_del_init(&inode->i_io_list); } } spin_unlock(&wb->list_lock); return pages_written;}Writeback Triggers:
Writeback occurs in several scenarios:
| Trigger | Description | Urgency |
|---|---|---|
| Periodic timer | Every dirty_writeback_centisecs (default 5s) | Low |
| Age threshold | Pages dirty longer than dirty_expire_centisecs (30s) | Medium |
| Memory pressure | System running low on free pages | High |
| Explicit sync | sync(), fsync(), or fdatasync() call | Immediate |
| Dirty ratio exceeded | Dirty pages exceed dirty_ratio of RAM | Immediate (blocking) |
| Background ratio | Dirty pages exceed dirty_background_ratio | Background (non-blocking) |
When dirty pages grow too numerous, the kernel 'throttles' writing processes—making them sleep in the write path. This backpressure prevents dirty pages from growing unboundedly and ensures the system can recover. The throttling is proportional: the more over-limit, the longer the sleep.
Linux provides tunable parameters that control writeback behavior. Understanding these is essential for optimizing system performance and durability for specific workloads.
Key Parameters (in /proc/sys/vm/):
| Parameter | Default | Description |
|---|---|---|
dirty_background_ratio | 10 | Percent of RAM at which background writeback starts |
dirty_background_bytes | 0 | Absolute byte threshold (overrides ratio if set) |
dirty_ratio | 20 | Percent of RAM at which processes block on writes |
dirty_bytes | 0 | Absolute byte threshold for blocking |
dirty_writeback_centisecs | 500 | Interval between writeback thread wakeups (5 seconds) |
dirty_expire_centisecs | 3000 | Age at which dirty data is 'old' and must be written (30 seconds) |
1234567891011121314151617181920212223242526272829303132333435363738394041424344
#!/bin/bash# Writeback tuning examples # View current valuescat /proc/sys/vm/dirty_ratiocat /proc/sys/vm/dirty_background_ratiocat /proc/sys/vm/dirty_expire_centisecscat /proc/sys/vm/dirty_writeback_centisecs # SCENARIO 1: Database Server (prioritize durability)# Low dirty thresholds = less data at risk# Frequent writeback = faster flushecho 5 > /proc/sys/vm/dirty_ratioecho 2 > /proc/sys/vm/dirty_background_ratioecho 500 > /proc/sys/vm/dirty_expire_centisecs # 5 secondsecho 100 > /proc/sys/vm/dirty_writeback_centisecs # 1 second # SCENARIO 2: File Server (prioritize throughput)# Higher dirty limits = more write coalescing# Slower writeback = fewer interruptionsecho 40 > /proc/sys/vm/dirty_ratioecho 20 > /proc/sys/vm/dirty_background_ratioecho 6000 > /proc/sys/vm/dirty_expire_centisecs # 60 secondsecho 1000 > /proc/sys/vm/dirty_writeback_centisecs # 10 seconds # SCENARIO 3: Machine with limited RAM (128MB)# Use absolute bytes instead of percentages# Prevents small RAM from having tiny dirty limitsecho 0 > /proc/sys/vm/dirty_ratio # Disable ratioecho 50000000 > /proc/sys/vm/dirty_bytes # 50MB hard limitecho 0 > /proc/sys/vm/dirty_background_ratioecho 25000000 > /proc/sys/vm/dirty_background_bytes # 25MB background # SCENARIO 4: Laptop (prioritize battery life)# Less frequent writeback = more disk idle time# Allow disk to spin down longerecho 90 > /proc/sys/vm/laptop_mode # Enable laptop mode (seconds)echo 60 > /proc/sys/vm/dirty_ratioecho 40 > /proc/sys/vm/dirty_background_ratioecho 60000 > /proc/sys/vm/dirty_expire_centisecs # 10 minutesecho 60000 > /proc/sys/vm/dirty_writeback_centisecs # 10 minutes # Monitor dirty page statuswatch -n1 'grep -E "Dirty|Writeback" /proc/meminfo'Understanding the Two Thresholds:
The dirty_background_ratio and dirty_ratio work together:
Memory
│
│
dirty_ratio (20%) ├─────── BLOCKING THRESHOLD ────────┐
│ Processes sleep until │
│ dirty pages drop below │
│ │
│ ← Dirty pages in danger zone → │
│ │
dirty_background ├─────── WRITEBACK THRESHOLD ────────┘
ratio (10%) │ Flusher threads actively
│ writing dirty pages
│
│ ← Normal operation →
│
0 └─────────────────────────────────────
On systems with large RAM (256GB+), default percentages become problematic. 20% of 256GB = 51GB of potentially dirty data! If power fails, you could lose 51GB of writes. For such systems, use absolute byte limits (dirty_bytes, dirty_background_bytes) to cap exposure regardless of RAM size.
Modern storage systems include multiple layers of caching—not just the OS buffer cache, but also disk controller caches and drive firmware caches. Write barriers ensure that writes reach stable storage in the correct order, which is critical for file system consistency.
The Ordering Problem:
Consider a file system updating a file:
If M reaches disk before D (due to reordering in any cache layer), and the system crashes, the file system will have metadata pointing to garbage—a corruption scenario.
Write Barrier Operation:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
/* * Write barriers ensure ordering across cache layers * * A write barrier has this semantics: * All writes BEFORE the barrier complete to stable storage * BEFORE any writes AFTER the barrier begin. */ /* * File system journal write with barriers * Ensures data integrity after crash */void journal_commit_transaction(journal_t *journal, transaction_t *transaction) { /* Step 1: Write journal descriptor block */ /* Describes what data is in this transaction */ write_journal_descriptor(journal, transaction); /* Step 2: Write all data/metadata blocks to journal */ /* These are copies that can be replayed on crash */ list_for_each_entry(block, &transaction->blocks, list) { write_to_journal(journal, block); } /* BARRIER: Ensure all journal data is on stable storage */ /* before we write the commit record */ blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL); /* Step 3: Write commit record */ /* This marks the transaction as complete */ write_journal_commit(journal, transaction); /* BARRIER: Ensure commit record is on stable storage */ /* before we write to final locations */ blkdev_issue_flush(journal->j_dev, GFP_KERNEL, NULL); /* Step 4: Now safe to write to actual file system locations */ /* (called "checkpointing") */ checkpoint_transaction(journal, transaction);} /* * The blkdev_issue_flush function * Sends a cache flush command to the storage device */int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask, sector_t *error_sector) { struct bio *bio; int ret; /* Allocate a bio for the flush command */ bio = bio_alloc(GFP_KERNEL, 0); bio_set_dev(bio, bdev); bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH; /* Flush operation */ /* Submit and wait */ ret = submit_bio_wait(bio); bio_put(bio); return ret;} /* * Storage device perspective: * * When device receives FLUSH command: * 1. Complete all pending writes in device cache * 2. Force data to stable media (platters for HDD, cells for SSD) * 3. Return success only after data is truly persisted * * Commands: * SATA: FLUSH CACHE / FLUSH CACHE EXT * SCSI: SYNCHRONIZE CACHE * NVMe: Flush command */| Cache Layer | Size | Volatile? | Respects Barriers? |
|---|---|---|---|
| OS Buffer Cache | GB | Yes | N/A (software) |
| RAID Controller Cache | MB-GB | Maybe (BBU) | Must be configured |
| Disk Drive Cache | MB | Usually yes | With FUA/flush commands |
| SSD Controller Cache | MB | Newer have capacitors | With flush commands |
FUA (Force Unit Access) is an alternative to barriers for individual writes. A write with FUA set bypasses the disk's volatile cache and goes directly to stable media. It's more efficient than flush when only specific writes need immediate durability, as it doesn't drain the entire cache.
Applications that need durability guarantees cannot rely on the OS's delayed writeback—they must explicitly request synchronization. The fsync() family of system calls provides this capability.
The Sync Family:
| Call | Scope | What It Syncs | Use Case |
|---|---|---|---|
sync() | Entire system | All dirty buffers for all filesystems | Before shutdown |
syncfs(fd) | Single filesystem | All dirty buffers for one filesystem | Before unmount |
fsync(fd) | Single file | File data + metadata (size, timestamps) | Transactional writes |
fdatasync(fd) | Single file data | File data only (skip some metadata) | Performance-critical |
sync_file_range() | Byte range | Specific portion of file | Large file streaming |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
/* * Demonstrating proper durable write patterns */ #include <unistd.h>#include <fcntl.h> /* * WRONG: Data can be lost on crash * The close() doesn't guarantee durability! */void write_file_wrong(const char *path, const char *data, size_t len) { int fd = open(path, O_WRONLY | O_CREAT | O_TRUNC, 0644); write(fd, data, len); close(fd); /* Data may still be in buffer cache! */} /* * CORRECT: Data is durable after fsync returns */void write_file_correct(const char *path, const char *data, size_t len) { int fd = open(path, O_WRONLY | O_CREAT | O_TRUNC, 0644); write(fd, data, len); /* fsync ensures data + metadata reach stable storage */ if (fsync(fd) != 0) { perror("fsync failed"); /* Handle error - data may not be durable! */ } close(fd);} /* * Atomic file update using rename * This is the GOLD STANDARD for durable writes */void atomic_write_file(const char *path, const char *data, size_t len) { char tmp_path[256]; snprintf(tmp_path, sizeof(tmp_path), "%s.tmp.XXXXXX", path); /* Create temporary file */ int fd = mkstemp(tmp_path); if (fd < 0) { perror("mkstemp"); return; } /* Write data */ ssize_t written = write(fd, data, len); if (written != len) { close(fd); unlink(tmp_path); return; } /* CRITICAL: fsync the file data */ if (fsync(fd) != 0) { close(fd); unlink(tmp_path); return; } close(fd); /* Rename is atomic on POSIX systems */ if (rename(tmp_path, path) != 0) { unlink(tmp_path); return; } /* CRITICAL: fsync the directory to make rename durable */ int dir_fd = open(dirname(path), O_RDONLY); if (dir_fd >= 0) { fsync(dir_fd); close(dir_fd); } /* Now the update is fully durable and atomic: * - If crash before rename: old file unchanged * - If crash after rename: new file complete * Never a partial or corrupted state! */} /* * fdatasync vs fsync: When to use which */void fdatasync_example(int fd) { /* * fdatasync() is like fsync() but: * - Only syncs data, not all metadata * - Skips metadata that doesn't affect data retrieval * (e.g., access time, modification time) * - Still syncs size if file was extended * * Faster when only data integrity matters */ /* Use fdatasync for performance-critical paths */ fdatasync(fd); /* ~30% faster than fsync on some systems */ /* Use fsync when metadata matters */ fsync(fd); /* Guarantees timestamps, permissions, etc. */}Many developers forget that file creation/rename requires syncing the parent directory, not just the file. Without the directory fsync, a crash might leave the new file invisible (it existed but the directory entry didn't persist). This has caused data loss in many applications including older versions of SQLite and git.
Let's examine how write-back behaves in real-world scenarios and common patterns that arise in production systems.
Scenario 1: The Write Burst
An application writes 100MB of data as fast as it can:
# System has 16GB RAM# dirty_background_ratio = 10% (1.6GB)# dirty_ratio = 20% (3.2GB)# dirty_writeback_centisecs = 500 (5 seconds) Phase 1: Fast writes (seconds 0-1)├── Application writes 100MB burst├── All writes go to page cache (memory speed: ~10GB/s)├── Takes ~0.01 seconds to complete all writes├── Dirty pages: ~100MB├── Below background threshold: NO writeback triggered└── Application continues immediately Phase 2: Background writeback (seconds 5-6)├── Timer fires after 5 seconds├── Writeback thread wakes up├── Writes 100MB to disk (disk speed: ~100MB/s)├── Takes ~1 second└── Dirty pages: 0MB # What if application writes 5GB burst? Phase 1: Fast writes (seconds 0-0.5)├── Application writes 1.6GB (reaches background threshold)├── Writeback starts in background├── Application continues writing...├── Application writes 3.2GB (reaches dirty_ratio!)├── APPLICATION BLOCKS - cannot write more└── Waits until dirty pages drop below 3.2GB Phase 2: Throttled writes (seconds 0.5-15)├── Writeback thread runs at disk speed (~100MB/s)├── Application writes blocked intermittently├── Each write that pushes over limit causes ~10ms sleep├── Effective write speed: ~100MB/s (disk limited)└── 5GB takes ~50 seconds instead of 0.5 secondsScenario 2: Database Durability
Databases face the tension between write-back efficiency and durability acutely:
Use these commands to observe writeback behavior: watch -n1 'grep -E "Dirty|Writeback" /proc/meminfo' shows dirty page counts. iostat -x 1 shows actual I/O to devices. iotop identifies which processes are doing I/O. echo 1 > /proc/sys/vm/block_dump logs all block I/O to kernel log (noisy!).
We've explored the critical world of write-back policies—the strategies that determine when modified data leaves the buffer cache and reaches stable storage. Let's consolidate the key insights:
dirty_ratio, dirty_background_ratio, and expiration times let you tune the performance/durability tradeoff.What's Next:
With write-back policies understood, we'll next examine cache consistency—the mechanisms that ensure the buffer cache remains synchronized with storage contents, especially across multiple processes, network file systems, and after crashes.
You now understand write-back policies, dirty page tracking, writeback mechanisms, and durability guarantees. This knowledge is essential for building reliable systems, debugging data loss issues, and tuning storage I/O performance.