Write Strategies - Learning Module

Loading content...

0/240

Write-Back Strategy

The Performance Imperative

In the previous page, we explored write-through caching—the conservative approach that guarantees durability at the cost of performance. Every write waited for disk acknowledgment, creating a direct relationship between I/O latency and application throughput.

But consider this scenario: your application writes 100 small records per second to the same file region. With write-through, that's 100 disk I/O operations. If each takes 15µs (NVMe), you've consumed 1.5ms of disk time. Acceptable. But on an HDD at 10ms per write, you've saturated the disk entirely with just 100 writes/second.

Now imagine those 100 writes all modify the same 4KB block—perhaps updating a counter, appending to a log, or modifying an in-memory data structure that gets serialized. With write-through, you write the same block 100 times. The final 99 writes are pure waste.

Write-back caching eliminates this waste. Instead of writing immediately, data accumulates in cache, and only the final state is eventually written to disk. The same 100 writes become 1 I/O operation. Performance improves by 100×—but at a cost.

This page explores that cost, and more importantly, how to manage it correctly.

What You Will Master

By the end of this page, you will understand the complete lifecycle of dirty data in a write-back system, the mechanisms that eventually force data to disk, the failure modes that can cause data loss, and the design patterns that allow systems to achieve write-back performance with acceptable durability guarantees.

Defining Write-Back: The Mechanics of Deferred Persistence

Write-back (also called write-behind) is a caching strategy where writes update only the cache initially. The write operation returns success immediately after the cache is updated. The actual write to the backing store happens later, asynchronously, controlled by various policies.

The key behavioral characteristics:

Immediate Return: The write() call returns as soon as data is copied to the kernel's page cache—no waiting for disk
Dirty Marking: The cached page is marked "dirty" to indicate it differs from the on-disk version
Asynchronous Writeback: Background processes eventually write dirty pages to disk
Write Absorption: Multiple writes to the same location merge; only the final state is persisted

The Write-Back Data Flow

Application: write(fd, buffer, 4096)
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  User Space → Kernel Space Transition (syscall)         │
│  [Time: ~1-5µs]                                         │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  VFS Layer: Validate fd, check permissions              │
│  [Time: ~0.5µs]                                         │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  Page Cache:                                            │
│    1. Find or allocate page for this file offset        │
│    2. Copy data from user buffer to page                │
│    3. Mark page as DIRTY                                │
│    4. Update file metadata (size, mtime)                │
│  [Time: ~0.1-1µs per KB]                                │
└─────────────────────────────────────────────────────────┘
    │
    ▼  ← RETURN IMMEDIATELY HERE (write-back)
┌─────────────────────────────────────────────────────────┐
│  Application: write() returns 4096 (success)            │
│  [Total time: ~2-10µs]                                  │
└─────────────────────────────────────────────────────────┘

                          ... later ...

┌─────────────────────────────────────────────────────────┐
│  Background Flusher Thread (pdflush/flush-X):           │
│    1. Scan for dirty pages                              │
│    2. Batch into I/O requests                           │
│    3. Submit to block layer                             │
│    4. Clear dirty flags on completion                   │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  Storage Device: Write to persistent medium             │
│  [Data now durable]                                     │
└─────────────────────────────────────────────────────────┘

Notice the fundamental difference: the application sees a 2-10µs operation regardless of whether the disk would have taken 10ms. From the application's perspective, every write is essentially a memory copy.

The Durability Window

Between the write() call returning and the background flush completing, data exists ONLY in volatile RAM. A power failure, kernel panic, or forced system reset during this window loses all uncommitted writes. This window can range from milliseconds to 30+ seconds depending on configuration.

The Page Cache: Where Dirty Data Lives

The Linux page cache is the central data structure enabling write-back behavior. Understanding its mechanics is essential for understanding write-back performance and failure modes.

Page Cache Organization

Every file's content is represented in memory as a collection of 4KB pages (matching the typical filesystem block size and CPU page size). These pages are organized in a radix tree indexed by file offset:

struct address_space {
    struct inode *host;           /* Owner inode */
    struct xarray i_pages;        /* Radix tree of pages */
    unsigned long nrpages;        /* Number of cached pages */
    pgoff_t writeback_index;      /* Writeback starts here */
    const struct address_space_operations *a_ops;
    unsigned long flags;          /* Various flags */
    /* ... more fields ... */
};

Each page has state flags indicating its current status:

Flag	Meaning	Implications
PG_uptodate	Page contains valid data	Safe to read from cache
PG_dirty	Page differs from disk	Must be written eventually
PG_writeback	Write to disk in progress	Cannot modify until complete
PG_locked	Page is locked for I/O	Other accessors must wait
PG_referenced	Recently accessed	Used for eviction decisions

The Dirty Page Lifecycle

Application write()
       │
       ▼
┌──────────────────┐
│   CLEAN PAGE     │ Page matches disk content
│   (or no page)   │ (or page doesn't exist)
└────────┬─────────┘
         │ write() copies data
         ▼
┌──────────────────┐
│   DIRTY PAGE     │ Page differs from disk
│  PG_dirty = 1    │ Needs eventual flush
└────────┬─────────┘
         │ Background flusher or fsync()
         ▼
┌──────────────────┐
│ WRITEBACK PAGE   │ I/O in progress
│ PG_writeback = 1 │ Temporary state
└────────┬─────────┘
         │ I/O completion
         ▼
┌──────────────────┐
│   CLEAN PAGE     │ Page matches disk again
│  PG_dirty = 0    │ Can be evicted if needed
└──────────────────┘

Memory Pressure and Dirty Pages

The page cache competes for memory with applications and other kernel subsystems. When memory becomes scarce, the kernel must evict pages. But dirty pages cannot simply be discarded—they must be written first.

This creates an important interaction:

Application writes heavily, generating many dirty pages
Memory becomes constrained
Kernel must write dirty pages to reclaim memory
This writeback happens synchronously in the memory reclaim path
Applications may stall waiting for memory

The /proc/meminfo file shows current dirty page status:

Dirty:              1234 kB    # Currently dirty, not being written
Writeback:           567 kB    # Currently being written to disk
AnonPages:       4567890 kB    # Anonymous memory (not file-backed)
Mapped:           123456 kB    # Memory-mapped files
PageTables:        12345 kB    # Page table overhead

Dirty Limits and Throttling

Linux prevents dirty pages from consuming all memory through configurable limits:

dirty_limits.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# View current dirty page limits
cat /proc/sys/vm/dirty_background_ratio
# Default: 10 - Start background writeback at 10% RAM dirty
 
cat /proc/sys/vm/dirty_ratio
# Default: 20 - Block writers at 20% RAM dirty (throttling)
 
cat /proc/sys/vm/dirty_writeback_centisecs
# Default: 500 - Flush daemon wakes every 5 seconds
 
cat /proc/sys/vm/dirty_expire_centisecs
# Default: 3000 - Pages older than 30 seconds get written
 
# For absolute byte limits (useful on high-memory systems):
cat /proc/sys/vm/dirty_background_bytes
cat /proc/sys/vm/dirty_bytes
 
# Example: System with 64GB RAM, default settings
# dirty_background_ratio = 10%
#   → Background flush starts at 6.4GB dirty
# dirty_ratio = 20%  
#   → Application writes block at 12.8GB dirty
 
# Tuning for low-latency workloads (reduce dirty window):
echo 5 > /proc/sys/vm/dirty_background_ratio
echo 10 > /proc/sys/vm/dirty_ratio
echo 100 > /proc/sys/vm/dirty_writeback_centisecs
echo 500 > /proc/sys/vm/dirty_expire_centisecs
 
# Tuning for throughput (allow more buffering):
echo 20 > /proc/sys/vm/dirty_background_ratio
echo 40 > /proc/sys/vm/dirty_ratio
echo 1000 > /proc/sys/vm/dirty_writeback_centisecs
echo 6000 > /proc/sys/vm/dirty_expire_centisecs

Tuning Trade-offs

Lower dirty ratios reduce data-at-risk during crash but increase I/O pressure and may cause write stalls. Higher ratios improve throughput but increase potential data loss and memory pressure. Production systems typically reduce defaults for safety, especially on systems with large RAM where default percentages represent enormous data loss windows.

Background Writeback: How Dirty Pages Reach Disk

The Linux kernel employs dedicated kernel threads to write dirty pages to disk. Understanding these mechanisms is crucial for predicting writeback behavior.

Writeback Threads

Modern Linux uses per-backing-device writeback threads:

$ ps aux | grep flush
root    1234  0.0  0.0      0     0 ?   S    Jan01   0:23 [flush-8:0]
root    1235  0.0  0.0      0     0 ?   S    Jan01   0:15 [flush-8:16]
root    1236  0.0  0.0      0     0 ?   S    Jan01   0:08 [flush-253:0]

Each flush-X:Y thread handles writeback for a specific block device (major:minor numbers). This allows parallel writeback across multiple devices.

Writeback Triggers

Writeback occurs in several circumstances:

Periodic Flushing (dirty_writeback_centisecs)
- Every N centiseconds (default: 5 seconds), the flusher wakes
- Writes pages that have been dirty longer than dirty_expire_centisecs
- Designed to bound the maximum age of dirty data
Background Threshold (dirty_background_ratio)
- When dirty pages exceed this threshold, background writeback intensifies
- Non-blocking: applications continue writing freely
- Goal: keep dirty level from reaching the blocking threshold
Blocking Threshold (dirty_ratio)
- When dirty pages exceed this higher threshold, writers are blocked
- The process calling write() is put to sleep until dirty level drops
- This prevents runaway memory consumption
Explicit Flush (fsync, syncfs, sync)
- Application requests durability for specific data
- Blocks until requested data is written
Memory Pressure
- When the system needs memory and must reclaim pages
- Direct writeback in the reclaim path
- Can cause unexpected latency spikes

Writeback Request Flow

┌────────────────────────────────────────────────────────────┐
│  Flusher Thread: "Time to write dirty pages for /dev/sda"  │
└────────────────────────────────────────────────────────────┘
    │
    ▼
┌────────────────────────────────────────────────────────────┐
│  1. Walk dirty inode list for this block device            │
│  2. For each inode, walk its dirty page tree               │
│  3. Batch contiguous pages into single I/O requests        │
│  4. Apply I/O scheduling (CFQ, BFQ, mq-deadline, etc.)     │
└────────────────────────────────────────────────────────────┘
    │
    ▼
┌────────────────────────────────────────────────────────────┐
│  Block Layer:                                              │
│    - Merge adjacent requests                               │
│    - Reorder for seek optimization (HDDs)                  │
│    - Split requests exceeding device limits                │
│    - Track request completion for accounting               │
└────────────────────────────────────────────────────────────┘
    │
    ▼
┌────────────────────────────────────────────────────────────┐
│  Device Driver: Submit to hardware command queue           │
│  [For NVMe: may use multiple hardware queues]              │
└────────────────────────────────────────────────────────────┘
    │
    ▼
┌────────────────────────────────────────────────────────────┐
│  Storage Device: Writes data to medium                     │
│  [Returns completion interrupt/poll response]              │
└────────────────────────────────────────────────────────────┘
    │
    ▼
┌────────────────────────────────────────────────────────────┐
│  Completion Path:                                          │
│    - Clear PG_writeback flag on pages                      │
│    - Clear PG_dirty flag (data is clean now)               │
│    - Update writeback accounting                           │
│    - Wake any waiters (fsync callers)                      │
└────────────────────────────────────────────────────────────┘

I/O Merging Benefits

The block layer's merging capability is a crucial write-back advantage. A thousand 4KB writes to sequential addresses become a single 4MB I/O operation. This dramatically improves HDD performance (one seek instead of 1000) and SSD efficiency (optimal write unit alignment).

Performance Analysis: Quantifying Write-Back Benefits

Let's develop quantitative models for write-back performance to understand when and why it provides such dramatic improvements over write-through.

Write Latency Model

For an individual write operation:

Write-Through: T_write = T_syscall + T_cache + T_io_complete
             ≈ 5µs + 1µs + T_device (10ms HDD, 20µs SSD)

Write-Back:   T_write = T_syscall + T_cache + T_return
             ≈ 5µs + 1µs + 0.1µs ≈ 6µs (regardless of device)

For an HDD, write-back is 1600× faster per operation (10ms vs 6µs). For NVMe SSD, write-back is 4× faster (26µs vs 6µs).

Write Absorption Model

When multiple writes hit the same cache page, only one I/O operation eventually occurs:

N writes to same page:
  Write-Through: N × T_io = N × 10ms (HDD)
  Write-Back:    N × T_cache + 1 × T_io = N × 1µs + 10ms

  For N=100 on HDD:
    Write-Through: 1000ms (1 second!)
    Write-Back:    ~110ms (0.11 seconds)

  Effective speedup: ~9× just from absorption

Sequential Write Optimization

Sequential Write Performance Comparison (1GB file, 4KB writes)
Storage	Strategy	Operations	Total Time	Throughput
HDD 7200	Write-through	262,144 × 10ms	43.7 min	0.39 MB/s
HDD 7200	Write-back (merged)	~256 × 10ms	2.56 sec	400 MB/s
NVMe SSD	Write-through	262,144 × 20µs	5.24 sec	195 MB/s
NVMe SSD	Write-back (merged)	~1024 × 10µs	10.2 ms	100+ GB/s*

*Limited by memory bandwidth, not storage.

The Dirty Window Cost

The performance benefit has a durability cost. At any moment, the "data at risk" is:

Data_At_Risk = Total_Dirty_Pages × Page_Size

With default settings (10% dirty background, 20% dirty max):
  32GB RAM system:
    Normal operation: up to 3.2GB dirty (10% threshold active)
    Heavy write load: up to 6.4GB dirty (blocking at 20%)

  256GB RAM system:
    Normal operation: up to 25.6GB dirty!
    Heavy write load: up to 51.2GB dirty!

The default percentage-based limits were designed when 512MB was a lot of RAM. On modern high-memory systems, using absolute byte limits (dirty_background_bytes, dirty_bytes) is strongly recommended.

Latency Variability

Write-back introduces latency variance that write-through lacks. Consider:

Single write, no memory pressure:              ~6µs (P50, P99, P99.9 all similar)
Single write, dirty ratio exceeded:            ~50ms-1s (blocked on writeback)
Single write, memory pressure:                 ~10ms-10s (direct reclaim)

Applications sensitive to tail latency (real-time systems, financial trading) often prefer write-through or hybrid approaches precisely because of this variance.

The Writeback Cliff

Systems crossing the dirty_ratio threshold experience sudden, severe latency degradation. A system happily doing 100,000 writes/second can drop to 100 writes/second when writers start blocking. Monitor dirty page levels proactively and tune thresholds before you hit them in production.

Crash Recovery Implications: What Happens When Power Fails

Write-back's Achilles heel is crash behavior. Let's analyze exactly what happens when a system crashes with dirty pages in memory.

Crash Scenario Analysis

Timeline during normal operation:

T=0.000s: Application writes block A (goes to cache, dirty)
T=0.001s: Application writes block B (goes to cache, dirty)
T=0.002s: Application writes block C (goes to cache, dirty)
...
T=5.000s: Background flush starts, writes A to disk
T=5.100s: Background flush writes B to disk
---CRASH--- (power loss at T=5.15s)
T=5.200s: Block C would have been written... but system is dead

Post-crash state:
  Block A: On disk (correct, latest version)
  Block B: On disk (correct, latest version)
  Block C: Lost forever (only existed in RAM)

For individual files, this might mean:

Appended log entries disappear
Database transactions committed in application but not on disk
Configuration changes revert
User data lost

Filesystem Metadata Corruption

More insidiously, write-back can corrupt filesystem metadata:

Application creates new file:
  1. Allocate inode (metadata write, goes to cache)
  2. Allocate data blocks (metadata write, goes to cache)
  3. Update directory entry (metadata write, goes to cache)
  4. Write file content (data write, goes to cache)

If crash occurs after step 2 but before all metadata is flushed:
  - Inode exists but directory doesn't reference it (orphan inode)
  - Blocks allocated but file doesn't exist (space leak)
  - Directory references file but inode missing (dangling reference)
  
Result: Filesystem inconsistency requiring fsck

Write Barriers: Ordering Without Full Sync

To address metadata ordering issues, filesystems use write barriers:

Barrier semantics:
  All writes before the barrier must complete
  before any write after the barrier begins

Without barriers (write-back, free reordering):
  write A  ─┐
  write B  ─┼─→ Disk sees: B, C, A (or any order)
  write C  ─┘

With barriers:
  write A
  write B
  ---BARRIER---
  write C
  write D
  
  Disk sees: (A,B in some order), then (C,D in some order)
             A and B definitely before C and D

Barriers are cheaper than full sync because:

Writes before the barrier are batched and merged
Writes after the barrier are batched and merged
Only the barrier itself forces a flush

Journaling Filesystems

Modern filesystems (ext4, XFS, NTFS, APFS) use journaling to handle crash recovery without requiring write-through for all data:

Journal Write: Write metadata changes to a sequential log
Journal Commit: Barrier to ensure log is durable
Data Write: Write actual data (can be write-back)
Checkpoint: Update main filesystem structures

If crash occurs:

Before step 2: Transaction not committed, replay does nothing
Between 2 and 4: Replay log entries to restore consistency
After step 4: Main structures already updated

This achieves the safety of write-through for metadata with the performance of write-back for data.

Data That Should Not Use Write-Back

•Database transaction logs: Lost WAL entries mean lost committed transactions. Always fsync() after commit record.
•Consensus protocol state: Raft/Paxos correctness depends on durable state before acknowledgment.
•Financial records: Regulatory requirements often mandate immediate persistence.
•Security audit logs: Attackers may cause crashes to clear cached evidence.
•Configuration files after update: Application may have acted on new config that isn't durable.

Application-Level Durability

Applications requiring durability should not rely on filesystem writeback behavior. They should explicitly call fsync() or use O_SYNC/O_DSYNC for critical writes. Assume write-back is the default and code defensively.

Write-Back in Storage Devices: The Hidden Cache Layer

Operating system write-back is only part of the story. Storage devices have their own write caches, creating a multi-level caching hierarchy.

Device Write Cache Architecture

┌─────────────────────────────────────────────────────────────┐
│  Application                                                │
└─────────────────────────────────────────────────────────────┘
           │ write()
           ▼
┌─────────────────────────────────────────────────────────────┐
│  OS Page Cache (Write-Back)                                 │
│  [Volatile RAM, 100GB+ possible]                            │
└─────────────────────────────────────────────────────────────┘
           │ Background flush
           ▼
┌─────────────────────────────────────────────────────────────┐
│  RAID Controller Cache (If present)                         │
│  [Volatile or Battery-Backed, 256MB-8GB typical]            │
└─────────────────────────────────────────────────────────────┘
           │ RAID write
           ▼
┌─────────────────────────────────────────────────────────────┐
│  Drive Write Cache                                          │
│  • HDD: volatile DRAM, 8-256MB                              │
│  • SATA SSD: volatile DRAM, 256MB-2GB                       │
│  • NVMe SSD: volatile DRAM + sometimes capacitor-backed     │
└─────────────────────────────────────────────────────────────┘
           │ Internal staging
           ▼
┌─────────────────────────────────────────────────────────────┐
│  Persistent Medium                                          │
│  • HDD: magnetic platters                                   │
│  • SSD: NAND flash cells                                    │
└─────────────────────────────────────────────────────────────┘

Each layer can independently use write-back, and all layers must be considered for durability analysis.

SSD Internal Write-Back

SSDs have complex internal caching for performance and wear management:

Host Interface Layer (HIL): Accepts commands from OS
DRAM Buffer: Stages incoming writes (volatile!)
Flash Translation Layer (FTL): Maps logical to physical addresses
Program Queue: Batches writes to flash
NAND Flash: Actual persistent storage

When the OS "writes" and the SSD "acknowledges":

Data is in SSD DRAM (fast acknowledgment)
Data may not be in NAND yet
Power loss here loses data!

Power Loss Protection (PLP)

Enterprise SSDs often include capacitors that provide enough power to flush DRAM to NAND during power loss:

SSD Class	PLP	Typical Flush Time	Data Safety
Consumer SATA	None	N/A	NOT safe, data loss possible
Prosumer NVMe	Partial	~10ms	Metadata safe, data may be lost
Enterprise NVMe	Full	~50-100ms	All in-flight data flushed
Intel Optane	N/A	Inherently safe	Persistent memory, no flush needed

Controlling Device Write Cache

device_cache_control.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Check device write caching status
hdparm -W /dev/sda
# /dev/sda:
#  write-caching = 1 (on)   ← Device is using write-back
 
# Disable device write cache (for maximum safety)
hdparm -W0 /dev/sda
# Warning: Significant performance impact!
 
# For NVMe devices:
nvme get-feature -f 0x06 /dev/nvme0  # Check volatile write cache
nvme set-feature -f 0x06 -v 0 /dev/nvme0  # Disable VWC
 
# Verify effect on write performance
# With cache enabled (typical NVMe):
fio --name=test --rw=randwrite --bs=4k --numjobs=1 \
    --size=1G --runtime=10 --direct=1 --sync=1
# Results: ~50,000-100,000 IOPS
 
# With cache disabled (same drive):
# Results: ~10,000-30,000 IOPS (FTL overhead visible)

The FUA Alternative

Rather than globally disabling device write cache, use Forced Unit Access (FUA) for specific critical writes. FUA bypasses the volatile cache for that operation only. Linux exposes this via O_DSYNC or REQ_FUA at the block layer. This preserves performance for non-critical writes while ensuring durability for critical ones.

Hybrid Patterns: Balancing Performance and Durability

Pure write-through sacrifices too much performance. Pure write-back sacrifices too much durability. Production systems use hybrid patterns that provide controlled durability with minimal performance impact.

Pattern 1: Group Commit

Batch multiple operations, sync once:

class GroupCommitLog:
    """
    Batches multiple log entries, commits together.
    Amortizes fsync cost across many operations.
    """
    
    def __init__(self, path, commit_interval_ms=10, max_batch_size=100):
        self.fd = os.open(path, os.O_WRONLY | os.O_CREAT | os.O_APPEND)
        self.pending = []
        self.pending_lock = threading.Lock()
        self.commit_interval = commit_interval_ms / 1000.0
        self.max_batch = max_batch_size
        self.commit_event = threading.Event()
        self.committer = threading.Thread(target=self._commit_loop)
        self.committer.start()
    
    def append(self, record):
        """Non-blocking: adds record to pending batch."""
        with self.pending_lock:
            future = Future()
            self.pending.append((record, future))
            if len(self.pending) >= self.max_batch:
                self.commit_event.set()
        return future
    
    def _commit_loop(self):
        while True:
            self.commit_event.wait(timeout=self.commit_interval)
            
            with self.pending_lock:
                batch = self.pending
                self.pending = []
            
            if not batch:
                continue
            
            # Write all records (goes to OS cache)
            for record, future in batch:
                os.write(self.fd, record.encode() + b'
')
            
            # Single fsync for entire batch
            os.fsync(self.fd)
            
            # Signal all waiters
            for _, future in batch:
                future.set_result(True)

This pattern achieves:

10ms worst-case latency per operation (commit interval)
Amortized fsync cost: 1 sync per 100 operations
Effective IOPS: batch_size / sync_time = 100 / 0.0001s = 1,000,000

Pattern 2: Write-Ahead Logging (WAL)

The canonical pattern for database durability:

Transaction execution:
  1. Write change descriptions to log (sequential, fsync at commit)
  2. Apply changes to data pages (write-back, no fsync)
  3. Eventually checkpoint: flush dirty pages, truncate log

Crash recovery:
  1. Read log from last checkpoint
  2. Replay logged operations against possibly-stale data pages
  3. System is consistent

Key insight: Random writes to data pages become sequential writes to log.
Sequential + fsync is much faster than random + fsync.

Pattern 3: Periodic Checkpoints

Allow write-back most of the time, force sync at intervals:

class CheckpointedCache:
    def __init__(self, checkpoint_interval_sec=60):
        self.dirty_files = set()
        self.interval = checkpoint_interval_sec
        threading.Thread(target=self._checkpoint_loop).start()
    
    def write(self, path, data):
        # Normal write-back write
        with open(path, 'wb') as f:
            f.write(data)
        self.dirty_files.add(path)
    
    def _checkpoint_loop(self):
        while True:
            time.sleep(self.interval)
            
            # Copy and clear dirty set
            to_sync = self.dirty_files
            self.dirty_files = set()
            
            # Sync all dirty files
            for path in to_sync:
                fd = os.open(path, os.O_RDONLY)
                os.fsync(fd)
                os.close(fd)
            
            # At this point, system can survive crash with
            # at most 'interval' seconds of data loss

Hybrid Write Strategies Comparison
Strategy	Max Data Loss	Latency (P50)	Latency (P99)	Use Case
Pure write-through	None	10-20ms (HDD)	100ms+	Compliance-critical
Pure write-back	30+ seconds	~6µs	~100ms (threshold)	Temporary data
Group commit (10ms)	10ms	10ms	11ms	Transaction logs
Checkpoints (60s)	60 seconds	~6µs	~60s (checkpoint)	Application data
WAL + write-back	0 (committed)	~10ms commit	~15ms	Databases

The Best of Both Worlds

These hybrid patterns demonstrate that the write-through vs write-back choice isn't binary. By understanding both strategies deeply, you can design systems that achieve required durability guarantees with minimal performance overhead. The key is identifying exactly which data requires immediate persistence.

Summary: Mastering Write-Back

Write-back caching is the default behavior in modern operating systems, and for good reason—its performance benefits are dramatic. But those benefits come with complexity and risk that systems engineers must understand and manage.

Key Takeaways

•Core Definition: Write-back defers persistence—writes update cache only, with asynchronous flush to storage. Returns immediately, no blocking on I/O.
•Performance Advantage: 100-10,000× faster than write-through for individual operations; enables write absorption and I/O merging.
•Dirty Page Management: Linux uses percentage-based thresholds; tune dirty_ratio, dirty_background_ratio for your workload and memory size.
•Background Writeback: Per-device flusher threads handle async writes; triggered by time (dirty_expire), threshold (dirty_background), or explicit sync.
•Crash Risk: All dirty pages are lost on power failure. Default settings can leave gigabytes of data at risk on high-memory systems.
•Multi-Layer Caching: OS cache, RAID controller cache, and device cache all participate. All layers must be considered for durability analysis.
•Hybrid Patterns: Group commit, WAL, and checkpoints combine write-back performance with controlled durability guarantees.

What's Next

We've now explored both ends of the spectrum: write-through (immediate durability, poor performance) and write-back (excellent performance, deferred durability). Next, we'll examine delayed writes—a nuanced strategy that provides explicit control over when write-back data is flushed, enabling applications to make fine-grained durability/performance trade-offs.

Page Complete

You now understand write-back caching comprehensively—its mechanisms, performance characteristics, crash implications, and how production systems combine it with synchronization to achieve both performance and reliability.