Loading content...
Writing data presents a fundamentally different challenge than reading it. When you read data, the worst case is performance degradation—waiting for data to load. When you write data, the worst case is data loss—believing data is saved when it isn't. This asymmetry places write caching at the intersection of the two most critical concerns in computing: performance and reliability.
Write caching dramatically accelerates applications by absorbing writes into fast memory instead of waiting for slow storage devices. But this creates a window of vulnerability: data in the cache but not yet on disk can be lost to power failure, system crashes, or hardware faults. Every operating system must navigate this tradeoff, and understanding how they do so is essential for building systems that are both fast and reliable.
By the end of this page, you will understand how operating systems handle write caching, the various write-back strategies and their tradeoffs, the mechanisms that balance performance against durability, and how applications can control write behavior to meet their specific reliability requirements.
To appreciate write caching, consider what happens without it. Every write operation would:
For a hard disk drive with 5-10ms latency, this means at most 100-200 writes per second per file. For an application like a database logging transactions, a mail server receiving messages, or even a text editor saving files, this would create unbearable delays.
The situation is actually worse than simple latency suggests. Consider writing a single byte to a file:
Without caching, writing a single byte could require 4+ disk operations, each taking 5-10ms—a total of 20-40ms for one byte. That's 25-50 bytes per second maximum throughput!
Write caching transforms this pathological case into something manageable:
The application perceives write latency of microseconds instead of milliseconds. Multiple writes to the same block are coalesced—only the final state needs to reach disk. Sequential writes to adjacent blocks are batched into single large I/O operations.
The result: throughput improvements of 100x to 10,000x compared to synchronous writes, with latency improvements of similar magnitude.
This performance comes with a critical caveat: until data is written to stable storage, it exists only in volatile memory. Power loss erases it completely. Applications that can't tolerate any data loss (databases, financial systems) must use additional mechanisms to ensure durability, which we'll explore throughout this page.
Operating systems implement several strategies for handling write operations, each occupying a different point on the performance-durability spectrum:
In write-through mode, every write goes immediately to both the cache and the backing storage:
Application Write → Cache → Backing Store (synchronously)
↓
Return when both complete
Characteristics:
In write-back (or write-behind) mode, writes go only to the cache initially:
Application Write → Cache (mark dirty)
↓
Return immediately
↓
Later: Writeback to Backing Store
Characteristics:
In write-around mode, writes bypass the cache entirely:
Application Write → Backing Store (directly)
↓
Return when complete
↓
Cache NOT populated (or invalidated)
Characteristics:
| Strategy | Write Latency | Data Safety | Read-after-Write | Use Case |
|---|---|---|---|---|
| Write-Through | High (disk bound) | Maximum | Cache hit | Critical data, simple systems |
| Write-Back | Low (memory bound) | Deferred | Cache hit | General purpose, performance focus |
| Write-Around | High (disk bound) | Maximum | Cache miss | Bulk writes, cold data |
Modern operating systems predominantly use write-back caching for general file I/O, with mechanisms to provide stronger guarantees when applications request them. The performance benefit is simply too substantial to abandon for most workloads.
When using write-back caching, the operating system must carefully track which cache pages have been modified but not yet written to storage. These are called dirty pages, and managing them correctly is essential for both performance and reliability.
Every page in the page cache carries a dirty flag. When an application writes to a cached page, the kernel sets this flag. The page remains dirty until its contents are written to backing storage, at which point the flag is cleared.
The kernel maintains data structures to efficiently enumerate dirty pages:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
/* * Dirty page tracking in the Linux kernel * * Pages are organized in the page cache by file (address_space) * and tracked for writeback via tags in the radix tree */ /* Page flags for dirty state */#define PG_dirty 4 /* Page has been modified */#define PG_writeback 15 /* Page is being written out */ /* Address space tags for efficient dirty page lookup */#define PAGECACHE_TAG_DIRTY 0 /* Pages needing writeback */#define PAGECACHE_TAG_WRITEBACK 1 /* Pages currently writing */#define PAGECACHE_TAG_TOWRITE 2 /* Pages marked for this writeback cycle */ /* * Mark a page dirty * Called when a write modifies a cached page */void set_page_dirty(struct page *page) { struct address_space *mapping = page->mapping; if (!TestSetPageDirty(page)) { /* Page was clean - now dirty */ /* Add dirty tag to radix tree for efficient lookup */ xa_lock_irq(&mapping->i_pages); __xa_set_mark(&mapping->i_pages, page_index(page), PAGECACHE_TAG_DIRTY); xa_unlock_irq(&mapping->i_pages); /* Account for dirty pages */ account_page_dirtied(page, mapping); /* Wake writeback thread if needed */ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); }} /* * Clear dirty state after successful writeback */void clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page->mapping; if (TestClearPageDirty(page)) { /* Page was dirty - now clean */ dec_zone_page_state(page, NR_FILE_DIRTY); /* Clear tag in radix tree */ xa_lock_irq(&mapping->i_pages); __xa_clear_mark(&mapping->i_pages, page_index(page), PAGECACHE_TAG_DIRTY); xa_unlock_irq(&mapping->i_pages); }} /* * Find all dirty pages for a file * Used to determine what needs writeback */unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *start, xa_mark_t tag, unsigned int nr_pages, struct page **pages) { XA_STATE(xas, &mapping->i_pages, *start); struct page *page; unsigned int ret = 0; rcu_read_lock(); xas_for_each_marked(&xas, page, ULONG_MAX, tag) { if (xas_retry(&xas, page)) continue; if (!page_cache_get_speculative(page)) continue; pages[ret++] = page; if (ret >= nr_pages) { *start = page->index + 1; break; } } rcu_read_unlock(); return ret;}The kernel doesn't allow dirty pages to accumulate indefinitely. Excessive dirty pages create multiple problems:
Memory Pressure: Dirty pages cannot be freed without writing them first, reducing memory available for other uses. Under memory pressure, the system must wait for writeback before reclaiming pages.
Write Cliffs: If too much dirty data accumulates, a flush (from sync, file close, or memory pressure) suddenly needs to write gigabytes, creating long pauses.
Data Loss Risk: More dirty data means more potential data loss on failure.
Linux implements two dirty page thresholds:
| Threshold | Default Value | Effect When Exceeded |
|---|---|---|
| dirty_background_ratio | 10% of memory | Background writeback begins |
| dirty_ratio | 20% of memory | Writing process blocks until dirty pages decrease |
| dirty_writeback_centisecs | 500 (5 seconds) | Writeback thread wakes periodically |
| dirty_expire_centisecs | 3000 (30 seconds) | Pages older than this are written back |
12345678910111213141516171819202122232425262728293031
#!/bin/bash# View and configure dirty page limits on Linux echo "=== Current Dirty Page Settings ==="echo "Background ratio: $(cat /proc/sys/vm/dirty_background_ratio)%"echo "Foreground ratio: $(cat /proc/sys/vm/dirty_ratio)%"echo "Writeback interval: $(cat /proc/sys/vm/dirty_writeback_centisecs) centiseconds"echo "Expire time: $(cat /proc/sys/vm/dirty_expire_centisecs) centiseconds" echo -e "=== Current Dirty Page State ==="grep -E "^(Dirty|Writeback|MemTotal):" /proc/meminfo echo -e "=== Per-Device Writeback Status ==="for bdi in /sys/class/bdi/*/; do if [ -d "$bdi" ]; then name=$(basename "$bdi") dirty=$(cat "$bdi/dirty_inflight_bytes" 2>/dev/null || echo "N/A") echo "$name: $dirty bytes in flight" fidone # Example: Reduce dirty limits for faster writeback (SSD-optimized)# sudo sysctl -w vm.dirty_background_ratio=5# sudo sysctl -w vm.dirty_ratio=10# sudo sysctl -w vm.dirty_expire_centisecs=1500 # Example: Increase for high-throughput batch workloads# sudo sysctl -w vm.dirty_background_ratio=20# sudo sysctl -w vm.dirty_ratio=40The kernel uses several mechanisms to move dirty data from the cache to stable storage. Understanding these mechanisms is crucial for predicting system behavior and tuning performance.
Linux runs kernel threads (one per block device backing store) responsible for continuous background writeback. These threads wake periodically (every 5 seconds by default) to write dirty pages that have been in cache too long (30 seconds by default):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
/* * Simplified writeback logic (conceptual) * * The actual Linux implementation is more complex, * handling multiple inodes, congestion, and priorities */ /* Main writeback work structure */struct bdi_writeback { struct backing_dev_info *bdi; /* Device info */ struct task_struct *task; /* Writeback thread */ struct list_head b_dirty; /* Dirty inodes */ struct list_head b_io; /* Inodes ready for I/O */ struct list_head b_more_io; /* More work after current batch */ unsigned long last_old_flush; /* When we last flushed old data */}; /* * Writeback thread main loop */int wb_workfn(void *data) { struct bdi_writeback *wb = data; while (!kthread_should_stop()) { /* Wait for work or timeout */ wait_event_interruptible_timeout( wb->wait, has_work(wb) || kthread_should_stop(), dirty_writeback_interval ); if (kthread_should_stop()) break; /* Process queued work items */ while (!list_empty(&wb->work_list)) { struct wb_writeback_work *work; work = list_first_entry(&wb->work_list, struct wb_writeback_work, list); list_del(&work->list); wb_do_writeback(wb, work); if (work->single) wake_up_process(work->waiter); else kfree(work); } /* Check for old dirty pages */ if (time_after(jiffies, wb->last_old_flush + dirty_expire_interval)) { wb_flush_old_pages(wb); wb->last_old_flush = jiffies; } } return 0;} /* * Flush old dirty pages */void wb_flush_old_pages(struct bdi_writeback *wb) { struct inode *inode; unsigned long expire_time = jiffies - dirty_expire_interval; list_for_each_entry(inode, &wb->b_dirty, i_wb_list) { /* Skip if dirtied too recently */ if (time_after(inode->dirtied_when, expire_time)) continue; /* Write back this inode's dirty pages */ writeback_single_inode(inode, WB_SYNC_NONE, LONG_MAX); }} /* * Perform writeback for a single inode */int writeback_single_inode(struct inode *inode, enum writeback_sync_modes sync, long nr_to_write) { struct address_space *mapping = inode->i_mapping; struct writeback_control wbc = { .sync_mode = sync, .nr_to_write = nr_to_write, .range_start = 0, .range_end = LLONG_MAX, }; /* Write dirty data pages */ do_writepages(mapping, &wbc); /* Write inode metadata if dirty */ if (inode->i_state & I_DIRTY) write_inode(inode, &wbc); return wbc.nr_to_write;}Beyond periodic background writeback, several events trigger immediate or prioritized writeback:
Memory Pressure: When free memory drops below thresholds, the kernel must reclaim memory. Dirty pages must be written before they can be reclaimed, so memory pressure triggers writeback.
Sync Operations: System calls like sync(), fsync(), and fdatasync() explicitly request writeback. We'll examine these in detail.
File Close: When the last reference to a file is closed, some systems trigger writeback (though not required by POSIX).
Unmount: Unmounting a filesystem requires all dirty data to be written first.
Threshold Exceeded: When dirty data exceeds dirty_ratio, the writing process blocks until dirty pages decrease.
The kernel uses a writeback_control structure to specify writeback parameters:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
/* * Writeback control structure - parameters for write operations */struct writeback_control { /* How many pages to write (decremented as pages written) */ long nr_to_write; /* Writeback mode */ enum writeback_sync_modes sync_mode; #define WB_SYNC_NONE 0 /* Don't wait for I/O completion */ #define WB_SYNC_ALL 1 /* Wait for all I/O to complete */ /* Range to write (for partial file writeback) */ loff_t range_start; loff_t range_end; /* Output: pages skipped/written */ unsigned long pages_skipped; unsigned long nr_written; /* Flags */ unsigned for_kupdate:1; /* Periodic writeback */ unsigned for_background:1; /* Background threshold writeback */ unsigned range_cyclic:1; /* Cyclic writeback (wrap around) */ unsigned for_sync:1; /* sync() or fsync() */ unsigned for_reclaim:1; /* Memory reclaim triggered */ unsigned tagged_writepages:1;/* Use tagged writepages */}; /* * Example: fsync implementation */int fsync_file(struct file *file) { struct address_space *mapping = file->f_mapping; struct inode *inode = mapping->host; int ret; /* Write all dirty pages for this file */ struct writeback_control wbc = { .sync_mode = WB_SYNC_ALL, /* Wait for completion */ .nr_to_write = LONG_MAX, /* Write everything */ .range_start = 0, .range_end = LLONG_MAX, .for_sync = 1, }; ret = filemap_write_and_wait_range(mapping, 0, LLONG_MAX); if (ret) return ret; /* Write inode metadata */ ret = sync_inode_metadata(inode, 1); /* Flush device write cache */ if (!ret && inode->i_sb->s_bdev) blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL); return ret;}Applications requiring durability guarantees must explicitly synchronize data to storage. POSIX defines several primitives with different guarantees and performance characteristics:
The sync() system call schedules all buffered modifications for writing to all filesystems:
void sync(void);
Semantics: Schedules writeback of all dirty data and metadata across all filesystems. Traditionally returns immediately after scheduling; modern Linux waits for completion.
Use Case: Emergency shutdown, system maintenance. Rarely used by applications due to global scope.
int syncfs(int fd);
Semantics: Writes all dirty data and metadata for the filesystem containing the file referred to by fd.
Use Case: Ensuring a specific filesystem is synchronized without affecting others.
int fsync(int fd);
Semantics: Transfers all modified data and metadata for the file to storage and flushes device write caches. Blocks until complete.
Guarantees: After successful return, file data and all metadata necessary to retrieve the file ARE on stable storage.
int fdatasync(int fd);
Semantics: Like fsync(), but only writes data and metadata necessary to access the data (e.g., file size if extended, but not modification time).
Use Case: When you need data durability but don't care about preserving exact timestamps. Can be significantly faster than fsync() on some filesystems.
| Primitive | Scope | Waits for Completion | Flushes Device Cache | Relative Speed |
|---|---|---|---|---|
| sync() | All filesystems | Yes (modern Linux) | Usually | Slowest |
| syncfs() | Single filesystem | Yes | Yes | Slow |
| fsync() | Single file + metadata | Yes | Yes | Medium |
| fdatasync() | Single file data | Yes | Yes | Fastest |
| O_SYNC writes | Each write operation | Yes | Usually | Very slow |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
/* * Synchronization primitive usage examples * Demonstrating different durability guarantees */ #include <unistd.h>#include <fcntl.h>#include <string.h>#include <errno.h> /* * Case 1: Transaction log with strong durability * * For a database transaction log, we need guaranteed * durability before acknowledging a transaction commit. */int write_transaction_log(int log_fd, const char *entry, size_t len) { ssize_t written; /* Write the log entry */ written = write(log_fd, entry, len); if (written != len) return -1; /* * Use fdatasync() for durability. * We need the data on disk, but don't care about * modification time - fdatasync() suffices and * is faster than fsync(). */ if (fdatasync(log_fd) == -1) return -1; return 0; /* Entry is now durably stored */} /* * Case 2: Configuration file atomic update * * For configuration files, we need the file to be * either fully old or fully new - never partial. * We also need metadata (for the rename). */int atomic_config_update(const char *config_path, const char *new_content, size_t content_len) { char temp_path[256]; int fd; /* Create temporary file in same directory */ snprintf(temp_path, sizeof(temp_path), "%s.tmp", config_path); fd = open(temp_path, O_WRONLY | O_CREAT | O_TRUNC, 0644); if (fd == -1) return -1; /* Write new content */ if (write(fd, new_content, content_len) != content_len) { close(fd); unlink(temp_path); return -1; } /* * fsync() the file - we need both data AND metadata * (specifically, the file's directory entry) to be * stable before the rename. */ if (fsync(fd) == -1) { close(fd); unlink(temp_path); return -1; } close(fd); /* Atomically replace old config with new */ if (rename(temp_path, config_path) == -1) { unlink(temp_path); return -1; } /* * fsync() the directory to ensure the rename is durable. * This is often overlooked but necessary for full durability. */ int dir_fd = open(".", O_RDONLY | O_DIRECTORY); if (dir_fd != -1) { fsync(dir_fd); close(dir_fd); } return 0;} /* * Case 3: High-throughput logging with batched sync * * For applications needing high write throughput with * eventual durability (e.g., application logs), batch * multiple writes before syncing. */struct log_buffer { int fd; char buffer[65536]; size_t used; size_t sync_threshold;}; int buffered_log_write(struct log_buffer *lb, const char *msg, size_t len) { /* Buffer the write */ if (lb->used + len > sizeof(lb->buffer)) { /* Buffer full - flush first */ if (write(lb->fd, lb->buffer, lb->used) != lb->used) return -1; /* Sync periodically based on threshold */ static size_t total_written = 0; total_written += lb->used; if (total_written >= lb->sync_threshold) { fdatasync(lb->fd); /* Batch sync */ total_written = 0; } lb->used = 0; } memcpy(lb->buffer + lb->used, msg, len); lb->used += len; return 0;}fsync() on the file alone is NOT sufficient for many durability scenarios. Filesystem metadata (directory entries, inode tables) may be cached separately. For complete durability after creating or renaming files, you must fsync() the containing directory as well. Many applications have shipped with data loss bugs because of this subtle requirement.
Modern storage stacks contain multiple levels of caching and reordering—disk controller caches, RAID controller caches, SAN caches, and the drives' own NCQ (Native Command Queuing) or write caches. For durability guarantees to be meaningful, the kernel must ensure that data actually reaches stable storage in the required order.
Consider a journaling filesystem writing a transaction:
If these writes are reordered (by disk controller, NCQ, etc.), disaster can result. If the commit record reaches disk before the journal entry, and power fails, recovery sees a committed transaction with corrupt data.
To ensure ordering across storage caches, the kernel uses cache flush commands:
SATA/SCSI: FLUSH CACHE command forces all cached writes to stable media
NVMe: Flush command with similar semantics
The kernel issues flushes at critical points to enforce ordering:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
/* * Write barriers in the Linux block layer * * Barriers ensure that writes before the barrier * are on stable media before writes after the barrier * can begin. */ /* * Issue a flush to the block device * This ensures all previous writes are on stable media */int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask) { struct bio *bio; int ret; bio = bio_alloc(bdev, 0, REQ_OP_FLUSH | REQ_PREFLUSH, gfp_mask); if (!bio) return -ENOMEM; /* Submit and wait for completion */ ret = submit_bio_wait(bio); bio_put(bio); return ret;} /* * Submit a write with flush flags * * REQ_PREFLUSH: Flush cache before this write * REQ_FUA: This write goes directly to media (not cached) */void submit_critical_write(struct block_device *bdev, sector_t sector, void *data, size_t len) { struct bio *bio; bio = bio_alloc(bdev, 1, REQ_OP_WRITE | REQ_PREFLUSH | REQ_FUA, GFP_KERNEL); bio->bi_iter.bi_sector = sector; bio_add_page(bio, virt_to_page(data), len, offset_in_page(data)); /* * This write will: * 1. Flush all previous cached writes (REQ_PREFLUSH) * 2. Write this data * 3. Force this data to stable media (REQ_FUA) * * After completion, all data written before AND including * this write is guaranteed on stable storage. */ submit_bio_wait(bio); bio_put(bio);} /* * Journal commit sequence with proper barriers * (Conceptual ext4-style ordered mode) */int commit_transaction(struct journal *j, struct transaction *t) { int ret; /* Phase 1: Write journal descriptor block */ ret = write_journal_descriptor(j, t); if (ret) return ret; /* Phase 2: Write journal data blocks */ ret = write_journal_data(j, t); if (ret) return ret; /* * CRITICAL BARRIER: Ensure descriptor and data are * on disk before writing commit block. * * Without this barrier, the commit block could reach * disk first, and a crash would cause recovery to * replay a partial/corrupt transaction. */ blkdev_issue_flush(j->bdev, GFP_KERNEL); /* Phase 3: Write commit block with FUA * * The commit block uses FUA to bypass device cache, * ensuring it's on stable media after this returns. */ ret = write_commit_block_fua(j, t); if (ret) return ret; /* Transaction is now committed and durable */ return 0;}FUA (Force Unit Access) is a write flag indicating that the data should bypass the drive's write cache and go directly to stable media. This is critical for commit records and other data that must be durable immediately.
Without FUA:
Data → OS Cache → Device Cache → Media
↑
Power loss here loses data!
With FUA:
Data → OS Cache → Media (bypasses device cache)
FUA is more efficient than a full cache flush when you need one specific write to be durable but don't care about previous writes (they may have their own FUA or you'll flush later).
Not all hardware supports these features honestly:
The only way to know if your storage stack correctly implements barriers is to test it. Tools like diskchecker.pl simulate power failures and verify data integrity. Production systems with durability requirements should be tested with actual power pulls during heavy write loads.
Applications have significant control over write caching behavior through open flags and runtime calls. Understanding these options enables building systems with precisely the right tradeoff between performance and durability.
| Flag | Effect | Performance Impact | Use Case |
|---|---|---|---|
| O_SYNC | Each write waits for data+metadata to reach storage | Very significant | Critical data requiring guaranteed write ordering |
| O_DSYNC | Each write waits for data to reach storage | Significant | Data-only durability (faster than O_SYNC) |
| O_DIRECT | Bypasses page cache entirely | Complex (reduces caching, enables DMA) | Databases with own caching layer |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
/* * Write caching control patterns * Different strategies for different requirements */ #define _GNU_SOURCE#include <fcntl.h>#include <unistd.h>#include <string.h>#include <stdlib.h> /* * Pattern 1: Synchronous writes for critical single-threaded data * * O_SYNC ensures every write is on stable storage before returning. * Simple but slow - use only when durability beats all else. */int open_synchronous_critical(const char *path) { return open(path, O_WRONLY | O_CREAT | O_SYNC, 0644);} /* * Pattern 2: Direct I/O for databases * * O_DIRECT bypasses the page cache, giving the application * full control over buffering and caching. The database * implements its own buffer pool with specialized eviction, * prefetching, and WAL sync strategies. * * Note: O_DIRECT has alignment requirements! */int open_direct_database(const char *path) { int fd = open(path, O_RDWR | O_CREAT | O_DIRECT, 0644); if (fd == -1) return -1; /* * Direct I/O usually still requires explicit sync for * durability - data bypasses OS cache but may still * be in device cache. */ return fd;} /* * Aligned buffer for O_DIRECT */void *alloc_direct_buffer(size_t size) { void *buf = NULL; /* Align to filesystem block size (typically 4KB) */ if (posix_memalign(&buf, 4096, size) != 0) return NULL; return buf;} /* * Pattern 3: Write-ahead log (WAL) with group commit * * Balance throughput and durability by batching syncs. * Many transactions write to log, but one sync covers all. */struct wal_group_commit { int log_fd; pthread_mutex_t mutex; pthread_cond_t cond; int pending_count; /* Transactions waiting for sync */ int sync_in_progress; /* Sync currently happening */ int next_group_size; /* Target batch size */}; int wal_write_and_sync(struct wal_group_commit *wal, const void *entry, size_t len) { pthread_mutex_lock(&wal->mutex); /* Write our entry (protected by mutex) */ write(wal->log_fd, entry, len); wal->pending_count++; /* Check if we should trigger a group commit */ if (!wal->sync_in_progress && wal->pending_count >= wal->next_group_size) { /* We'll lead this group commit */ wal->sync_in_progress = 1; int group_size = wal->pending_count; pthread_mutex_unlock(&wal->mutex); /* Do the actual sync (may take several ms) */ fdatasync(wal->log_fd); pthread_mutex_lock(&wal->mutex); /* Wake all waiters from this batch */ wal->sync_in_progress = 0; wal->pending_count = 0; pthread_cond_broadcast(&wal->cond); /* Adaptive batch sizing based on throughput */ if (group_size > 10) wal->next_group_size = group_size; } else { /* Wait for the leader to sync */ while (wal->pending_count > 0) pthread_cond_wait(&wal->cond, &wal->mutex); } pthread_mutex_unlock(&wal->mutex); return 0;} /* * Pattern 4: Tiered durability based on data importance */enum durability_level { DURABILITY_NONE, /* Fire and forget */ DURABILITY_EVENTUAL, /* Will sync eventually */ DURABILITY_IMMEDIATE, /* Sync before returning */}; int write_with_durability(int fd, const void *data, size_t len, enum durability_level level) { ssize_t ret = write(fd, data, len); if (ret != len) return -1; switch (level) { case DURABILITY_NONE: /* Rely on background writeback */ break; case DURABILITY_EVENTUAL: /* Hint to sync soon but don't wait */ sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE); break; case DURABILITY_IMMEDIATE: /* Full sync - guaranteed durable */ if (fdatasync(fd) == -1) return -1; break; } return 0;}Applications can provide hints about write patterns using posix_fadvise(). POSIX_FADV_DONTNEED after writing can hint that the data doesn't need to stay cached. POSIX_FADV_NOREUSE hints that data will only be accessed once. These don't guarantee behavior but can improve cache efficiency for specific access patterns.
Write caching sits at the critical intersection of performance and reliability. We've explored the full stack from application-level control down to hardware synchronization primitives. Let's consolidate the key insights:
What's Next:
The next page examines cache policies—the algorithms that determine which data to keep in cache when space is limited. Understanding replacement policies like LRU, LFU, ARC, and their variants is essential for predicting and optimizing cache behavior under real workloads.
You now understand write caching from application intent to hardware synchronization. You can reason about durability guarantees, choose appropriate synchronization primitives, and understand the performance implications of different choices. This knowledge is essential for building systems that are both fast and reliable.