Operating SystemsPage Replacement Need

Page Replacement Need

LevelIntermediate

Duration60 mins

TopicPage Replacement Need

4 / 5

Dirty Bit

The Bit That Saves Millions of I/O Operations

Consider a simple question: when evicting a page from memory, how does the operating system know whether it needs to write the page to disk?

Without this knowledge, the OS would have two choices—both terrible:

Always write every evicted page — Wasteful. Most evicted pages haven't changed. Writing unmodified pages doubles I/O without benefit.
Never write any page — Catastrophic. Modified data would be lost. This isn't even viable.

The solution is elegant: a single bit of hardware support called the dirty bit (also called the modified bit). This humble bit, maintained automatically by the processor for every page, answers the critical question: "Has this page been written to since it was loaded?"

This page explores the dirty bit in comprehensive detail—its mechanism, implementation, role in page replacement, and the optimization opportunities it enables.

What You Will Learn

By the end of this page, you will understand: (1) The precise mechanism by which dirty bits are set and cleared, (2) How dirty bits influence page replacement decisions, (3) The substantial performance impact of preferring clean page eviction, (4) Background page cleaning strategies that reduce eviction latency, and (5) The relationship between dirty bits and system reliability.

What Is the Dirty Bit?

The dirty bit (also called the modify bit or M bit) is a single bit in each page table entry (PTE) that indicates whether the corresponding page has been modified (written to) since it was loaded into memory.

Formal Definition:

Dirty Bit = 1 : Page has been written to since last load/clear
Dirty Bit = 0 : Page has NOT been written to (clean)

Location in Page Table Entry:

Typical 64-bit Page Table Entry Layout (x86-64):
┌────┬───┬───┬───┬───┬───┬───┬───┬──────────────────────────────┬───────────┐
│ NX │ 0 │ 0 │ A │ D │PCD│PWT│U/S│ Physical Frame Number        │   Flags   │
│  1 │   │   │ 1 │ 1 │ 1 │ 1 │ 1 │        40 bits               │  3 bits   │
└────┴───┴───┴───┴───┴───┴───┴───┴──────────────────────────────┴───────────┘
                     ↑
                     └── Dirty Bit (D)

Where:
  D = Dirty bit (set by hardware on write)
  A = Accessed bit (set by hardware on any access)

The Invariant:

If Dirty = 0: Page contents in memory == Page contents on disk/backing store
If Dirty = 1: Page contents in memory != Page contents on disk/backing store
              (memory has newer data)

This invariant is critical: a clean page can be discarded without loss; a dirty page must be written first.

Hardware vs Software

The dirty bit is set automatically by the processor's Memory Management Unit (MMU) during write operations—there's no software involvement in setting it. However, the operating system is responsible for clearing it after writing the page to disk. This hardware/software partnership is essential for efficiency.

Page States Based on Dirty Bit:

Dirty Bit	Page State	Eviction Behavior
0	Clean	Can discard immediately; reload from backing store if needed
1	Dirty	Must write to backing store before frame can be reused

Types of Backing Store:

Depending on the page type, a dirty page is written to different locations:

Page Type	Backing Store	Write Destination
Anonymous (heap, stack)	Swap partition/file	Swap space
Private file mapping	Copy-on-write	Swap space
Shared file mapping	File system	Original file
Memory-mapped file	File system	File at mapped offset

The Hardware Mechanism

Understanding exactly when and how the dirty bit is set and cleared is essential for appreciating its role in memory management.

When the Dirty Bit Is Set:

The dirty bit is set automatically by the MMU during the address translation process for a write operation:

Write Operation Sequence:

1. CPU executes store instruction (e.g., MOV [addr], value)
2. MMU receives virtual address and write intent
3. MMU walks page table to find PTE
4. MMU checks PTE permissions:
   - If not writable → Protection fault
   - If not present → Page fault
   - If writable and present → Proceed
5. MMU atomically:
   a. Sets Dirty bit = 1 in PTE
   b. Sets Accessed bit = 1 in PTE
   c. Performs translation
6. Write completes to physical memory

Atomic Update Requirement:

The dirty bit update must be atomic with respect to the write:

❌ Incorrect (non-atomic):
   1. Translate address
   2. Perform write
   3. Set dirty bit
   
   Problem: If interrupted after step 2, dirty bit never set.
   Page appears clean but has modified data.
   Eviction would lose the write!

✓ Correct (atomic):
   Single hardware operation sets bit AND performs write.
   Cannot be interrupted between.

When the Dirty Bit Is Cleared:

Unlike setting (which is automatic), clearing the dirty bit requires explicit OS action:

Clearing Sequence:

1. OS selects a dirty page for cleaning
2. OS ensures page is not being written (lock or COW)
3. OS initiates write to backing store
4. OS waits for I/O completion
5. OS atomically clears Dirty bit = 0
6. Page is now "clean" and can be evicted without write

Critical: Must ensure no concurrent write during steps 3-5!

Handling Concurrent Writes:

If a process writes to a page while it's being cleaned:

Scenario:
  T1: OS starts writing page to disk (dirty=1)
  T2: Process writes to page
  T3: OS finishes disk write
  T4: OS clears dirty bit (WRONG!)

Problem: Write at T2 is not on disk, but dirty=0 claims it is.
Eviction would lose T2's write.

Solutions:
  1. Write-protect page during cleaning
     - Any write triggers fault
     - Fault handler re-marks dirty
     
  2. Check dirty bit after write completes
     - If set, page was modified during write
     - Either restart write or leave dirty
     
  3. Copy-on-write during cleaning
     - Create copy for cleaning
     - Original can accept writes

The TLB Complication

The TLB may cache PTEs. When a dirty bit is set, it's set in the cached TLB entry, not necessarily the page table in memory. When scanning for clean pages, the OS must either flush TLB entries or check both TLB and page table. Most architectures write dirty bits through to memory, but verification is platform-specific.

dirty-bit-operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/* Dirty Bit Manipulation in OS Kernel */
 
/* Check if page is dirty */
static inline bool page_is_dirty(struct page *page) {
    pte_t *pte = get_pte_for_page(page);
    return pte_dirty(*pte);
}
 
/* Mark page as dirty (software-initiated) */
static inline void set_page_dirty(struct page *page) {
    pte_t *pte = get_pte_for_page(page);
    set_pte_dirty(pte);
    page->flags |= PG_dirty;  /* Also set in page struct */
}
 
/* Clear dirty bit after successful write-back */
static inline void clear_page_dirty(struct page *page) {
    pte_t *pte = get_pte_for_page(page);
    
    /* Must ensure no concurrent writes */
    ASSERT(page_locked(page));
    
    /* Clear in PTE */
    clear_pte_dirty(pte);
    
    /* Clear in page struct */
    page->flags &= ~PG_dirty;
    
    /* Flush TLB entry if architecture requires */
    flush_tlb_page(page_to_vaddr(page));
}
 
/* Safe write-back with race handling */
int writeback_page_safe(struct page *page) {
    int ret;
    
    /* Lock the page to prevent modifications */
    lock_page(page);
    
    /* Write-protect to catch concurrent writes */
    protect_page_during_writeback(page);
    
    /* Perform the actual I/O */
    ret = write_page_to_backing_store(page);
    
    if (ret == 0) {
        /* Check if page was dirtied during write */
        if (!page_is_dirty(page)) {
            /* Safe to mark clean */
            clear_page_dirty(page);
        } else {
            /* Page was modified during write */
            /* Leave dirty, will need another write */
        }
    }
    
    /* Restore normal protections */
    unprotect_page_after_writeback(page);
    
    unlock_page(page);
    return ret;
}

Impact on Page Replacement

The dirty bit fundamentally affects page replacement strategy. Its impact is so significant that most algorithms explicitly factor it into victim selection.

The Cost Differential:

Clean Page Eviction:
  1. Invalidate PTE (valid = 0)
  2. Add frame to free list
  Total time: ~1 microsecond

Dirty Page Eviction:
  1. Initiate disk write
  2. Wait for write completion
  3. Invalidate PTE
  4. Add frame to free list
  Total time: ~10 milliseconds (HDD) or ~100 microseconds (SSD)

The 10,000x Penalty:

On traditional hard drives:

Clean eviction: ~1 μs
Dirty eviction: ~10 ms
Ratio: 10,000x slower

Even on SSDs:

Clean eviction: ~1 μs
Dirty eviction: ~100 μs
Ratio: 100x slower

This massive asymmetry means that preferring clean pages has enormous impact.

Eviction Time Comparison by Storage Type
Storage Type	Clean Eviction	Dirty Eviction	Slowdown Factor
HDD (7200 RPM)	~1 μs	~10 ms	10,000×
SATA SSD	~1 μs	~100 μs	100×
NVMe SSD	~1 μs	~20 μs	20×
RAM Disk / tmpfs	~1 μs	~1 μs	1× (no backing store)

Incorporating Dirty Status in Victim Selection:

Algorithms handle dirty pages in several ways:

Approach 1: Prefer Clean, Accept Dirty (Common)

Victim Selection:
  1. Scan for clean, unreferenced pages
  2. If found, select as victim
  3. If none found, scan for dirty, unreferenced pages
  4. Select dirty page as victim (takes longer but necessary)

Approach 2: Enhanced Clock (NRU Classes)

Classify pages into four categories:

Class 0: Referenced=0, Dirty=0  →  Best victim
         Not recently used, clean
         Eviction cost: minimal

Class 1: Referenced=0, Dirty=1  →  Good victim
         Not recently used, but dirty
         Eviction cost: 1 write

Class 2: Referenced=1, Dirty=0  →  Poor victim
         Recently used, but clean
         May fault back soon

Class 3: Referenced=1, Dirty=1  →  Worst victim
         Recently used AND dirty
         High cost and likely to re-fault

Scan order: Class 0 → Class 1 → Class 2 → Class 3

Approach 3: Write-Back First, Then Evict

Some systems write dirty pages asynchronously, then treat them as clean:

1. Background daemon identifies dirty pages
2. Writes them to backing store
3. Clears dirty bit
4. Page now "clean" from eviction perspective
5. When eviction needed, more clean pages available

The Write-Back Strategy

Modern systems use background page cleaning (pdflush/flush workers in Linux) to convert dirty pages to clean pages before they're needed for eviction. This "pre-cleaning" means that when memory pressure hits, many pages are already clean—reducing the critical path latency for page faults that require replacement.

Page Cleaning Strategies

Operating systems employ sophisticated strategies to keep a pool of clean pages available, minimizing eviction latency during memory pressure.

Strategy 1: Synchronous Write-Back

The simplest approach—wait inline during eviction:

When victim is dirty:
  1. Start write to disk
  2. Block until write completes
  3. Clear dirty bit
  4. Frame now available

Pros: Simple, predictable
Cons: Maximum latency on page fault critical path

Strategy 2: Asynchronous Pre-Cleaning (Background Flush)

Proactively clean pages before they're needed:

Linux flush workers (formerly pdflush/bdflush):

┌────────────────────────────────────────────────────────┐
│                    Memory Pressure                      │
│   HIGH ────────── MEDIUM ────────── LOW                │
│     ↓               ↓                ↓                 │
│  Aggressive      Moderate          Lazy               │
│   cleaning       cleaning        cleaning             │
└────────────────────────────────────────────────────────┘

Behavior:
  - Low pressure: Clean pages older than dirty_expire_centisecs
  - Medium pressure: More aggressive cleaning
  - High pressure: Synchronous cleaning in allocation path

Linux Dirty Page Parameters

•dirty_background_ratio — Percentage of memory at which background writeback starts (default: 10%)
•dirty_ratio — Percentage of memory at which processes block on writeback (default: 20%)
•dirty_expire_centisecs — Age in centiseconds when dirty data is old enough to be written (default: 3000 = 30 seconds)
•dirty_writeback_centisecs — Interval between background writeback worker wakeups (default: 500 = 5 seconds)

Strategy 3: Clustered Writes

Group nearby dirty pages for efficient I/O:

Scattered Dirty Pages (Inefficient):
  Page at disk block 1000 → seek + write
  Page at disk block 5000 → seek + write
  Page at disk block 1500 → seek + write
  Page at disk block 8000 → seek + write
  Total: 4 seeks (expensive)

Clustered Writes (Efficient):
  Sort pages by disk location
  Page at disk block 1000 → seek + write
  Page at disk block 1500 → small seek + write  
  Page at disk block 5000 → seek + write
  Page at disk block 8000 → seek + write
  Total: Fewer seeks, better throughput

Many systems coalesce: adjacent pages → single larger I/O

Strategy 4: Write-Ahead for Eviction Candidates

Preemptively clean pages likely to be evicted:

Observation: Pages at the tail of LRU are eviction candidates

Strategy:
  1. Monitor inactive list (eviction candidates)
  2. For dirty pages on inactive list:
     a. Initiate background write
     b. Move to "laundering" state
  3. When write completes:
     a. Clear dirty bit
     b. Page now clean for future eviction

Result: When eviction occurs, many candidates are already clean.

check-dirty-thresholds.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Check and tune dirty page parameters on Linux
 
# View current settings
cat /proc/sys/vm/dirty_background_ratio    # Background writeback threshold
cat /proc/sys/vm/dirty_ratio               # Foreground writeback threshold
cat /proc/sys/vm/dirty_expire_centisecs    # Age before forced write
cat /proc/sys/vm/dirty_writeback_centisecs # Worker wakeup interval
 
# View current state
cat /proc/meminfo | grep -i dirty
# Dirty:              512 kB   (currently dirty pages)
# Writeback:           64 kB   (currently being written)
 
# Example: Reduce dirty ratio for latency-sensitive workloads
# (fewer dirty pages = faster evictions when needed)
echo 5 > /proc/sys/vm/dirty_background_ratio
echo 10 > /proc/sys/vm/dirty_ratio
 
# Example: Increase for throughput-oriented workloads
# (more coalescing, fewer total writes)
echo 20 > /proc/sys/vm/dirty_background_ratio
echo 40 > /proc/sys/vm/dirty_ratio
 
# Persistent configuration in /etc/sysctl.conf:
# vm.dirty_background_ratio = 5
# vm.dirty_ratio = 10

Write-Back Policies

How and when dirty pages are written back involves fundamental tradeoffs between performance, reliability, and resource usage.

Policy 1: Write-Back (Lazy Write)

Defer writes until necessary:

Characteristics:
  - Page modified in memory only
  - Dirty bit set, but no I/O
  - Written only when:
    a. Page is being evicted, OR
    b. File is synced (fsync), OR
    c. Dirty time limit exceeded, OR
    d. System is shutting down

Advantages:
  + Fewer total writes (coalescing)
  + Better performance for repeated modifications
  + Reduced disk wear (especially SSDs)

Disadvantages:
  - Data loss window during crashes
  - More dirty pages in memory
  - Higher eviction latency possible

Policy 2: Write-Through (Immediate Write)

Write immediately on every modification:

Characteristics:
  - Every store operation writes to disk
  - Or at least queues for immediate write
  - Dirty bit cleared promptly

Advantages:
  + Minimal data loss window
  + Pages almost always clean
  + Instant eviction capability

Disadvantages:
  - Very high I/O volume
  - Poor performance for write-heavy workloads
  - Excessive disk wear

Usage: Rare in general-purpose OS; used for specific
       reliability-critical file systems/databases.

Write-Back vs Write-Through Comparison
Aspect	Write-Back	Write-Through
Write Latency	Memory-speed (~100ns)	Disk-speed (~10ms)
Consistency	Eventually consistent	Immediately consistent
Data Loss Risk	Up to dirty timeout	Minimal
Eviction Latency	May need write first	Always instant
I/O Volume	Low (coalesced)	High (every write)
Use Case	General computing	Critical databases

Policy 3: Hybrid Approaches

Real systems use combinations:

Linux Default Behavior:
  1. Writes to page cache (memory-speed)
  2. Page marked dirty
  3. After 30 seconds, background write
  4. On memory pressure, accelerated write
  5. fsync() forces immediate write

Database Systems (e.g., PostgreSQL):
  1. Write to page cache (fast)
  2. WAL record to disk (synchronous)
  3. Dirty data pages written by checkpoint
  4. Crash recovery replays WAL
  
  Combines: fast writes + durability

Ordered Write-Back:

For file system consistency, writes must be ordered:

Scenario: Create new file
  1. Allocate inode
  2. Initialize inode data
  3. Add directory entry pointing to inode

Correct order: 1 → 2 → 3
  If crash after 2: orphan inode (recoverable)
  
Wrong order: 3 → 1 → 2  
  If crash after 3: directory points to garbage
  
File systems (ext4, XFS) enforce ordering constraints
on which dirty pages can be written before others.

The fsync Guarantee

Applications requiring durability must call fsync() (or fdatasync()). Without it, data may exist only in dirty pages and be lost on crash. Databases, log systems, and any reliability-critical application must use fsync—but it's expensive. Strategic fsync placement (e.g., after transaction commit, not after every write) balances durability and performance.

Reliability and Crash Recovery

The dirty bit mechanism has profound implications for system reliability and crash recovery. Understanding these is essential for building robust systems.

The Data Loss Window:

Time: ─────────────────────────────────────────────────────────►
      │                    │                    │
    write()           page cleaned            crash
    (dirty=1)          (dirty=0)              (data lost?)
      │                    │                    │
      │←── Data at risk ──►│←── Data safe ────►│

Data written to dirty pages but not yet on disk is lost on crash.

Linux defaults:
  - dirty_expire_centisecs = 3000 (30 seconds)
  - Maximum data loss window without fsync: ~30 seconds

What Survives a Crash:

After unexpected power loss:

✓ Data on disk (including recently fsynced data)
✓ Data in battery-backed disk cache
✓ Data in persistent memory (NVDIMM)

✗ Data in volatile RAM (dirty pages)
✗ Data in non-battery disk cache (if enabled!

Critical Reliability Considerations

•Disk Write Caching — Many disks cache writes; data may not be on platters even after OS reports completion. Use write barriers or disable write cache for critical systems.
•Power Failure Scenarios — UPS provides time for clean shutdown; RAID controllers with battery-backed cache preserve in-flight writes.
•Application Responsibility — The OS doesn't know which data is important. Applications must fsync when durability matters.
•Journal/WAL Systems — Databases and modern file systems use logging to limit crash recovery scope; dirty pages alone would risk inconsistency.
•Metadata Consistency — File system metadata must be consistent; journaling (ext4) or copy-on-write (ZFS, btrfs) protects structural integrity.

Modern Persistent Memory:

Intel Optane DC Persistent Memory and similar technologies change the game:

Traditional:
  CPU ─► DRAM (volatile) ─► Disk (persistent)
  Data loss window: dirty_expire time

With Persistent Memory:
  CPU ─► PMEM (persistent) ─► Disk (backup)
  Data loss window: ~0 (direct persistence)

PMEM access:
  - Load/store instructions persist directly
  - Must use proper ordering (CLWB, SFENCE)
  - Dirty bits may be N/A (always persistent)

Kernel Panic and Dirty Pages:

On kernel panic:
  1. All file systems are NOT unmounted cleanly
  2. Dirty pages in memory are lost
  3. Disk state may be inconsistent

Recovery process:
  1. fsck / file system check on boot
  2. Replay journal if journaled FS
  3. Report lost files/data if unrecoverable
  
Mitigation:
  - Frequent syncs reduce data at risk
  - Reliable power (UPS) allows clean shutdown
  - Hardware watchdog can catch hangs

The rename() Durability Pattern

A common pattern for atomic file updates: write new content to temp file, fsync temp file, rename temp to target, fsync containing directory. This ensures the rename either sees all-old or all-new content. Without the fsyncs, crash could leave a zero-length file (rename visible, data not flushed) or stale directory entries.

Monitoring and Debugging Dirty Pages

Visibility into dirty page state is essential for performance tuning and debugging. Here's how to monitor and analyze dirty page behavior.

System-Wide Statistics:

$ cat /proc/meminfo | grep -i dirty
Dirty:             12584 kB    # Currently dirty pages
Writeback:           128 kB    # Currently being written
WritebackTmp:          0 kB    # Writeback using temp pages

$ cat /proc/vmstat | grep -i dirty
nr_dirty 3146                   # Number of dirty pages
nr_writeback 32                 # Pages under writeback
nr_dirtied 1847362             # Total pages ever dirtied
nr_written 1840216             # Total pages written back

Per-Process Statistics:

# From /proc/[pid]/smaps
$ cat /proc/1234/smaps | grep -i dirty
Private_Dirty:         128 kB
Shared_Dirty:            0 kB

# Aggregate per process
$ grep -E "Private_Dirty|Shared_Dirty" /proc/1234/smaps | \
  awk '{sum += $2} END {print sum " kB dirty"}'

monitor-dirty-pages.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
# Monitor dirty page statistics over time
 
echo "Time,Dirty_KB,Writeback_KB,Written_Pages"
 
while true; do
    # Get current stats
    dirty=$(grep "^Dirty:" /proc/meminfo | awk '{print $2}')
    writeback=$(grep "^Writeback:" /proc/meminfo | awk '{print $2}')
    written=$(grep "nr_written" /proc/vmstat | awk '{print $2}')
    
    echo "$(date +%H:%M:%S),$dirty,$writeback,$written"
    
    sleep 1
done
 
# Example output analysis:
# Steady low Dirty, low Writeback → System is keeping up
# Rising Dirty → Writes outpacing background flush
# High Writeback → Active flushing in progress
# Spiky Written → Batch writes completing
 
# To trigger writeback for testing:
# sync  # Force all dirty pages to disk
# echo 3 > /proc/sys/vm/drop_caches  # Free cached pages (needs sync first)

Tracing Writeback Activity:

# Using ftrace to trace writeback events
$ echo 1 > /sys/kernel/debug/tracing/events/writeback/enable
$ cat /sys/kernel/debug/tracing/trace_pipe

# Sample output:
flush-8:0-1234  [001] .... 1234.567: writeback_start: ...
flush-8:0-1234  [001] .... 1234.890: writeback_written: nr=256
flush-8:0-1234  [001] .... 1235.123: writeback_wait: ...

Common Issues and Symptoms:

Symptom	Likely Cause	Investigation
Very high Dirty count	Not flushing fast enough	Check dirty_ratio, I/O capacity
Spiky application latency	Hitting dirty_ratio, synchronous flush	Lower dirty_ratio
High Writeback, low throughput	I/O subsystem saturated	Check iostat, consider faster storage
Dirty count never drops	Constant writes exceeding flush rate	Reduce write rate or increase I/O capacity
Data loss after crash	Dirty pages lost	Use fsync for important data

Performance Tuning Rule of Thumb

For latency-sensitive workloads: lower dirty_ratio (5-10%) ensures evictions rarely need synchronous writes. For throughput-oriented workloads: higher dirty_ratio (30-40%) allows more coalescing but risks latency spikes. Profile your specific workload to find the optimal balance.

Summary and Key Takeaways

We've thoroughly explored the dirty bit—a single piece of hardware support with far-reaching implications. Let's consolidate our understanding:

Key Takeaways

•The dirty bit indicates whether a page has been modified — Set automatically by hardware on write; cleared by OS after write-back.
•Dirty pages are 100-10,000x more expensive to evict — They require disk I/O before the frame can be reused.
•Page replacement algorithms prefer clean pages — Enhanced Clock and similar algorithms prioritize evicting clean, unreferenced pages.
•Background page cleaning reduces eviction latency — Pre-writing dirty pages means more clean candidates when memory pressure hits.
•Write-back policies balance performance and durability — Lazy write-back improves performance; fsync ensures durability.
•Dirty pages represent a data loss window — Data not yet on disk is lost on crash; applications must use fsync for critical data.
•Monitoring dirty page state aids troubleshooting — /proc/meminfo, vmstat, and tracing tools reveal writeback behavior.

What's Next:

The dirty bit is one of two crucial hardware-supported bits for page replacement. The next page explores the modify bit (also known as the access/reference bit), which serves a different but complementary purpose—tracking whether pages have been recently accessed to guide victim selection based on recency.

Page Complete

You now have a comprehensive understanding of the dirty bit—from its hardware mechanism to its role in page replacement, write-back policies, reliability implications, and system monitoring. This knowledge is essential for understanding how operating systems efficiently manage memory while balancing performance and data integrity.

4 / 5

Loading learning content...

Operating SystemsPage Replacement Need

Page Replacement Need

LevelIntermediate

Duration60 mins

TopicPage Replacement Need

4 / 5

Dirty Bit

The Bit That Saves Millions of I/O Operations

Consider a simple question: when evicting a page from memory, how does the operating system know whether it needs to write the page to disk?

Without this knowledge, the OS would have two choices—both terrible:

Always write every evicted page — Wasteful. Most evicted pages haven't changed. Writing unmodified pages doubles I/O without benefit.
Never write any page — Catastrophic. Modified data would be lost. This isn't even viable.

This page explores the dirty bit in comprehensive detail—its mechanism, implementation, role in page replacement, and the optimization opportunities it enables.

What You Will Learn

What Is the Dirty Bit?

Formal Definition:

Dirty Bit = 1 : Page has been written to since last load/clear
Dirty Bit = 0 : Page has NOT been written to (clean)

Location in Page Table Entry:

Typical 64-bit Page Table Entry Layout (x86-64):
┌────┬───┬───┬───┬───┬───┬───┬───┬──────────────────────────────┬───────────┐
│ NX │ 0 │ 0 │ A │ D │PCD│PWT│U/S│ Physical Frame Number        │   Flags   │
│  1 │   │   │ 1 │ 1 │ 1 │ 1 │ 1 │        40 bits               │  3 bits   │
└────┴───┴───┴───┴───┴───┴───┴───┴──────────────────────────────┴───────────┘
                     ↑
                     └── Dirty Bit (D)

Where:
  D = Dirty bit (set by hardware on write)
  A = Accessed bit (set by hardware on any access)

The Invariant:

If Dirty = 0: Page contents in memory == Page contents on disk/backing store
If Dirty = 1: Page contents in memory != Page contents on disk/backing store
              (memory has newer data)

This invariant is critical: a clean page can be discarded without loss; a dirty page must be written first.

Hardware vs Software

Page States Based on Dirty Bit:

Dirty Bit	Page State	Eviction Behavior
0	Clean	Can discard immediately; reload from backing store if needed
1	Dirty	Must write to backing store before frame can be reused

Types of Backing Store:

Depending on the page type, a dirty page is written to different locations:

Page Type	Backing Store	Write Destination
Anonymous (heap, stack)	Swap partition/file	Swap space
Private file mapping	Copy-on-write	Swap space
Shared file mapping	File system	Original file
Memory-mapped file	File system	File at mapped offset

The Hardware Mechanism

Understanding exactly when and how the dirty bit is set and cleared is essential for appreciating its role in memory management.

When the Dirty Bit Is Set:

The dirty bit is set automatically by the MMU during the address translation process for a write operation:

Write Operation Sequence:

1. CPU executes store instruction (e.g., MOV [addr], value)
2. MMU receives virtual address and write intent
3. MMU walks page table to find PTE
4. MMU checks PTE permissions:
   - If not writable → Protection fault
   - If not present → Page fault
   - If writable and present → Proceed
5. MMU atomically:
   a. Sets Dirty bit = 1 in PTE
   b. Sets Accessed bit = 1 in PTE
   c. Performs translation
6. Write completes to physical memory

Atomic Update Requirement:

The dirty bit update must be atomic with respect to the write:

❌ Incorrect (non-atomic):
   1. Translate address
   2. Perform write
   3. Set dirty bit
   
   Problem: If interrupted after step 2, dirty bit never set.
   Page appears clean but has modified data.
   Eviction would lose the write!

✓ Correct (atomic):
   Single hardware operation sets bit AND performs write.
   Cannot be interrupted between.

When the Dirty Bit Is Cleared:

Unlike setting (which is automatic), clearing the dirty bit requires explicit OS action:

Clearing Sequence:

1. OS selects a dirty page for cleaning
2. OS ensures page is not being written (lock or COW)
3. OS initiates write to backing store
4. OS waits for I/O completion
5. OS atomically clears Dirty bit = 0
6. Page is now "clean" and can be evicted without write

Critical: Must ensure no concurrent write during steps 3-5!

Handling Concurrent Writes:

If a process writes to a page while it's being cleaned:

Scenario:
  T1: OS starts writing page to disk (dirty=1)
  T2: Process writes to page
  T3: OS finishes disk write
  T4: OS clears dirty bit (WRONG!)

Problem: Write at T2 is not on disk, but dirty=0 claims it is.
Eviction would lose T2's write.

Solutions:
  1. Write-protect page during cleaning
     - Any write triggers fault
     - Fault handler re-marks dirty
     
  2. Check dirty bit after write completes
     - If set, page was modified during write
     - Either restart write or leave dirty
     
  3. Copy-on-write during cleaning
     - Create copy for cleaning
     - Original can accept writes

The TLB Complication

dirty-bit-operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/* Dirty Bit Manipulation in OS Kernel */
 
/* Check if page is dirty */
static inline bool page_is_dirty(struct page *page) {
    pte_t *pte = get_pte_for_page(page);
    return pte_dirty(*pte);
}
 
/* Mark page as dirty (software-initiated) */
static inline void set_page_dirty(struct page *page) {
    pte_t *pte = get_pte_for_page(page);
    set_pte_dirty(pte);
    page->flags |= PG_dirty;  /* Also set in page struct */
}
 
/* Clear dirty bit after successful write-back */
static inline void clear_page_dirty(struct page *page) {
    pte_t *pte = get_pte_for_page(page);
    
    /* Must ensure no concurrent writes */
    ASSERT(page_locked(page));
    
    /* Clear in PTE */
    clear_pte_dirty(pte);
    
    /* Clear in page struct */
    page->flags &= ~PG_dirty;
    
    /* Flush TLB entry if architecture requires */
    flush_tlb_page(page_to_vaddr(page));
}
 
/* Safe write-back with race handling */
int writeback_page_safe(struct page *page) {
    int ret;
    
    /* Lock the page to prevent modifications */
    lock_page(page);
    
    /* Write-protect to catch concurrent writes */
    protect_page_during_writeback(page);
    
    /* Perform the actual I/O */
    ret = write_page_to_backing_store(page);
    
    if (ret == 0) {
        /* Check if page was dirtied during write */
        if (!page_is_dirty(page)) {
            /* Safe to mark clean */
            clear_page_dirty(page);
        } else {
            /* Page was modified during write */
            /* Leave dirty, will need another write */
        }
    }
    
    /* Restore normal protections */
    unprotect_page_after_writeback(page);
    
    unlock_page(page);
    return ret;
}

Impact on Page Replacement

The dirty bit fundamentally affects page replacement strategy. Its impact is so significant that most algorithms explicitly factor it into victim selection.

The Cost Differential:

Clean Page Eviction:
  1. Invalidate PTE (valid = 0)
  2. Add frame to free list
  Total time: ~1 microsecond

Dirty Page Eviction:
  1. Initiate disk write
  2. Wait for write completion
  3. Invalidate PTE
  4. Add frame to free list
  Total time: ~10 milliseconds (HDD) or ~100 microseconds (SSD)

The 10,000x Penalty:

On traditional hard drives:

Clean eviction: ~1 μs
Dirty eviction: ~10 ms
Ratio: 10,000x slower

Even on SSDs:

Clean eviction: ~1 μs
Dirty eviction: ~100 μs
Ratio: 100x slower

This massive asymmetry means that preferring clean pages has enormous impact.

Eviction Time Comparison by Storage Type
Storage Type	Clean Eviction	Dirty Eviction	Slowdown Factor
HDD (7200 RPM)	~1 μs	~10 ms	10,000×
SATA SSD	~1 μs	~100 μs	100×
NVMe SSD	~1 μs	~20 μs	20×
RAM Disk / tmpfs	~1 μs	~1 μs	1× (no backing store)

Incorporating Dirty Status in Victim Selection:

Algorithms handle dirty pages in several ways:

Approach 1: Prefer Clean, Accept Dirty (Common)

Victim Selection:
  1. Scan for clean, unreferenced pages
  2. If found, select as victim
  3. If none found, scan for dirty, unreferenced pages
  4. Select dirty page as victim (takes longer but necessary)

Approach 2: Enhanced Clock (NRU Classes)

Classify pages into four categories:

Class 0: Referenced=0, Dirty=0  →  Best victim
         Not recently used, clean
         Eviction cost: minimal

Class 1: Referenced=0, Dirty=1  →  Good victim
         Not recently used, but dirty
         Eviction cost: 1 write

Class 2: Referenced=1, Dirty=0  →  Poor victim
         Recently used, but clean
         May fault back soon

Class 3: Referenced=1, Dirty=1  →  Worst victim
         Recently used AND dirty
         High cost and likely to re-fault

Scan order: Class 0 → Class 1 → Class 2 → Class 3

Approach 3: Write-Back First, Then Evict

Some systems write dirty pages asynchronously, then treat them as clean:

1. Background daemon identifies dirty pages
2. Writes them to backing store
3. Clears dirty bit
4. Page now "clean" from eviction perspective
5. When eviction needed, more clean pages available

The Write-Back Strategy

Page Cleaning Strategies

Operating systems employ sophisticated strategies to keep a pool of clean pages available, minimizing eviction latency during memory pressure.

Strategy 1: Synchronous Write-Back

The simplest approach—wait inline during eviction:

When victim is dirty:
  1. Start write to disk
  2. Block until write completes
  3. Clear dirty bit
  4. Frame now available

Pros: Simple, predictable
Cons: Maximum latency on page fault critical path

Strategy 2: Asynchronous Pre-Cleaning (Background Flush)

Proactively clean pages before they're needed:

Linux flush workers (formerly pdflush/bdflush):

┌────────────────────────────────────────────────────────┐
│                    Memory Pressure                      │
│   HIGH ────────── MEDIUM ────────── LOW                │
│     ↓               ↓                ↓                 │
│  Aggressive      Moderate          Lazy               │
│   cleaning       cleaning        cleaning             │
└────────────────────────────────────────────────────────┘

Behavior:
  - Low pressure: Clean pages older than dirty_expire_centisecs
  - Medium pressure: More aggressive cleaning
  - High pressure: Synchronous cleaning in allocation path

Linux Dirty Page Parameters

•dirty_background_ratio — Percentage of memory at which background writeback starts (default: 10%)
•dirty_ratio — Percentage of memory at which processes block on writeback (default: 20%)
•dirty_expire_centisecs — Age in centiseconds when dirty data is old enough to be written (default: 3000 = 30 seconds)
•dirty_writeback_centisecs — Interval between background writeback worker wakeups (default: 500 = 5 seconds)

Strategy 3: Clustered Writes

Group nearby dirty pages for efficient I/O:

Scattered Dirty Pages (Inefficient):
  Page at disk block 1000 → seek + write
  Page at disk block 5000 → seek + write
  Page at disk block 1500 → seek + write
  Page at disk block 8000 → seek + write
  Total: 4 seeks (expensive)

Clustered Writes (Efficient):
  Sort pages by disk location
  Page at disk block 1000 → seek + write
  Page at disk block 1500 → small seek + write  
  Page at disk block 5000 → seek + write
  Page at disk block 8000 → seek + write
  Total: Fewer seeks, better throughput

Many systems coalesce: adjacent pages → single larger I/O

Strategy 4: Write-Ahead for Eviction Candidates

Preemptively clean pages likely to be evicted:

Observation: Pages at the tail of LRU are eviction candidates

Strategy:
  1. Monitor inactive list (eviction candidates)
  2. For dirty pages on inactive list:
     a. Initiate background write
     b. Move to "laundering" state
  3. When write completes:
     a. Clear dirty bit
     b. Page now clean for future eviction

Result: When eviction occurs, many candidates are already clean.

check-dirty-thresholds.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Check and tune dirty page parameters on Linux
 
# View current settings
cat /proc/sys/vm/dirty_background_ratio    # Background writeback threshold
cat /proc/sys/vm/dirty_ratio               # Foreground writeback threshold
cat /proc/sys/vm/dirty_expire_centisecs    # Age before forced write
cat /proc/sys/vm/dirty_writeback_centisecs # Worker wakeup interval
 
# View current state
cat /proc/meminfo | grep -i dirty
# Dirty:              512 kB   (currently dirty pages)
# Writeback:           64 kB   (currently being written)
 
# Example: Reduce dirty ratio for latency-sensitive workloads
# (fewer dirty pages = faster evictions when needed)
echo 5 > /proc/sys/vm/dirty_background_ratio
echo 10 > /proc/sys/vm/dirty_ratio
 
# Example: Increase for throughput-oriented workloads
# (more coalescing, fewer total writes)
echo 20 > /proc/sys/vm/dirty_background_ratio
echo 40 > /proc/sys/vm/dirty_ratio
 
# Persistent configuration in /etc/sysctl.conf:
# vm.dirty_background_ratio = 5
# vm.dirty_ratio = 10

Write-Back Policies

How and when dirty pages are written back involves fundamental tradeoffs between performance, reliability, and resource usage.

Policy 1: Write-Back (Lazy Write)

Defer writes until necessary:

Characteristics:
  - Page modified in memory only
  - Dirty bit set, but no I/O
  - Written only when:
    a. Page is being evicted, OR
    b. File is synced (fsync), OR
    c. Dirty time limit exceeded, OR
    d. System is shutting down

Advantages:
  + Fewer total writes (coalescing)
  + Better performance for repeated modifications
  + Reduced disk wear (especially SSDs)

Disadvantages:
  - Data loss window during crashes
  - More dirty pages in memory
  - Higher eviction latency possible

Policy 2: Write-Through (Immediate Write)

Write immediately on every modification:

Characteristics:
  - Every store operation writes to disk
  - Or at least queues for immediate write
  - Dirty bit cleared promptly

Advantages:
  + Minimal data loss window
  + Pages almost always clean
  + Instant eviction capability

Disadvantages:
  - Very high I/O volume
  - Poor performance for write-heavy workloads
  - Excessive disk wear

Usage: Rare in general-purpose OS; used for specific
       reliability-critical file systems/databases.

Write-Back vs Write-Through Comparison
Aspect	Write-Back	Write-Through
Write Latency	Memory-speed (~100ns)	Disk-speed (~10ms)
Consistency	Eventually consistent	Immediately consistent
Data Loss Risk	Up to dirty timeout	Minimal
Eviction Latency	May need write first	Always instant
I/O Volume	Low (coalesced)	High (every write)
Use Case	General computing	Critical databases

Policy 3: Hybrid Approaches

Real systems use combinations:

Linux Default Behavior:
  1. Writes to page cache (memory-speed)
  2. Page marked dirty
  3. After 30 seconds, background write
  4. On memory pressure, accelerated write
  5. fsync() forces immediate write

Database Systems (e.g., PostgreSQL):
  1. Write to page cache (fast)
  2. WAL record to disk (synchronous)
  3. Dirty data pages written by checkpoint
  4. Crash recovery replays WAL
  
  Combines: fast writes + durability

Ordered Write-Back:

For file system consistency, writes must be ordered:

Scenario: Create new file
  1. Allocate inode
  2. Initialize inode data
  3. Add directory entry pointing to inode

Correct order: 1 → 2 → 3
  If crash after 2: orphan inode (recoverable)
  
Wrong order: 3 → 1 → 2  
  If crash after 3: directory points to garbage
  
File systems (ext4, XFS) enforce ordering constraints
on which dirty pages can be written before others.

The fsync Guarantee

Reliability and Crash Recovery

The dirty bit mechanism has profound implications for system reliability and crash recovery. Understanding these is essential for building robust systems.

The Data Loss Window:

Time: ─────────────────────────────────────────────────────────►
      │                    │                    │
    write()           page cleaned            crash
    (dirty=1)          (dirty=0)              (data lost?)
      │                    │                    │
      │←── Data at risk ──►│←── Data safe ────►│

Data written to dirty pages but not yet on disk is lost on crash.

Linux defaults:
  - dirty_expire_centisecs = 3000 (30 seconds)
  - Maximum data loss window without fsync: ~30 seconds

What Survives a Crash:

After unexpected power loss:

✓ Data on disk (including recently fsynced data)
✓ Data in battery-backed disk cache
✓ Data in persistent memory (NVDIMM)

✗ Data in volatile RAM (dirty pages)
✗ Data in non-battery disk cache (if enabled!

Critical Reliability Considerations

•Disk Write Caching — Many disks cache writes; data may not be on platters even after OS reports completion. Use write barriers or disable write cache for critical systems.
•Power Failure Scenarios — UPS provides time for clean shutdown; RAID controllers with battery-backed cache preserve in-flight writes.
•Application Responsibility — The OS doesn't know which data is important. Applications must fsync when durability matters.
•Journal/WAL Systems — Databases and modern file systems use logging to limit crash recovery scope; dirty pages alone would risk inconsistency.
•Metadata Consistency — File system metadata must be consistent; journaling (ext4) or copy-on-write (ZFS, btrfs) protects structural integrity.

Modern Persistent Memory:

Intel Optane DC Persistent Memory and similar technologies change the game:

Traditional:
  CPU ─► DRAM (volatile) ─► Disk (persistent)
  Data loss window: dirty_expire time

With Persistent Memory:
  CPU ─► PMEM (persistent) ─► Disk (backup)
  Data loss window: ~0 (direct persistence)

PMEM access:
  - Load/store instructions persist directly
  - Must use proper ordering (CLWB, SFENCE)
  - Dirty bits may be N/A (always persistent)

Kernel Panic and Dirty Pages:

On kernel panic:
  1. All file systems are NOT unmounted cleanly
  2. Dirty pages in memory are lost
  3. Disk state may be inconsistent

Recovery process:
  1. fsck / file system check on boot
  2. Replay journal if journaled FS
  3. Report lost files/data if unrecoverable
  
Mitigation:
  - Frequent syncs reduce data at risk
  - Reliable power (UPS) allows clean shutdown
  - Hardware watchdog can catch hangs

The rename() Durability Pattern

Monitoring and Debugging Dirty Pages

Visibility into dirty page state is essential for performance tuning and debugging. Here's how to monitor and analyze dirty page behavior.

System-Wide Statistics:

$ cat /proc/meminfo | grep -i dirty
Dirty:             12584 kB    # Currently dirty pages
Writeback:           128 kB    # Currently being written
WritebackTmp:          0 kB    # Writeback using temp pages

$ cat /proc/vmstat | grep -i dirty
nr_dirty 3146                   # Number of dirty pages
nr_writeback 32                 # Pages under writeback
nr_dirtied 1847362             # Total pages ever dirtied
nr_written 1840216             # Total pages written back

Per-Process Statistics:

# From /proc/[pid]/smaps
$ cat /proc/1234/smaps | grep -i dirty
Private_Dirty:         128 kB
Shared_Dirty:            0 kB

# Aggregate per process
$ grep -E "Private_Dirty|Shared_Dirty" /proc/1234/smaps | \
  awk '{sum += $2} END {print sum " kB dirty"}'

monitor-dirty-pages.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
# Monitor dirty page statistics over time
 
echo "Time,Dirty_KB,Writeback_KB,Written_Pages"
 
while true; do
    # Get current stats
    dirty=$(grep "^Dirty:" /proc/meminfo | awk '{print $2}')
    writeback=$(grep "^Writeback:" /proc/meminfo | awk '{print $2}')
    written=$(grep "nr_written" /proc/vmstat | awk '{print $2}')
    
    echo "$(date +%H:%M:%S),$dirty,$writeback,$written"
    
    sleep 1
done
 
# Example output analysis:
# Steady low Dirty, low Writeback → System is keeping up
# Rising Dirty → Writes outpacing background flush
# High Writeback → Active flushing in progress
# Spiky Written → Batch writes completing
 
# To trigger writeback for testing:
# sync  # Force all dirty pages to disk
# echo 3 > /proc/sys/vm/drop_caches  # Free cached pages (needs sync first)

Tracing Writeback Activity:

# Using ftrace to trace writeback events
$ echo 1 > /sys/kernel/debug/tracing/events/writeback/enable
$ cat /sys/kernel/debug/tracing/trace_pipe

# Sample output:
flush-8:0-1234  [001] .... 1234.567: writeback_start: ...
flush-8:0-1234  [001] .... 1234.890: writeback_written: nr=256
flush-8:0-1234  [001] .... 1235.123: writeback_wait: ...

Common Issues and Symptoms:

Symptom	Likely Cause	Investigation
Very high Dirty count	Not flushing fast enough	Check dirty_ratio, I/O capacity
Spiky application latency	Hitting dirty_ratio, synchronous flush	Lower dirty_ratio
High Writeback, low throughput	I/O subsystem saturated	Check iostat, consider faster storage
Dirty count never drops	Constant writes exceeding flush rate	Reduce write rate or increase I/O capacity
Data loss after crash	Dirty pages lost	Use fsync for important data

Performance Tuning Rule of Thumb

Summary and Key Takeaways

We've thoroughly explored the dirty bit—a single piece of hardware support with far-reaching implications. Let's consolidate our understanding:

Key Takeaways

•The dirty bit indicates whether a page has been modified — Set automatically by hardware on write; cleared by OS after write-back.
•Dirty pages are 100-10,000x more expensive to evict — They require disk I/O before the frame can be reused.
•Page replacement algorithms prefer clean pages — Enhanced Clock and similar algorithms prioritize evicting clean, unreferenced pages.
•Background page cleaning reduces eviction latency — Pre-writing dirty pages means more clean candidates when memory pressure hits.
•Write-back policies balance performance and durability — Lazy write-back improves performance; fsync ensures durability.
•Dirty pages represent a data loss window — Data not yet on disk is lost on crash; applications must use fsync for critical data.
•Monitoring dirty page state aids troubleshooting — /proc/meminfo, vmstat, and tracing tools reveal writeback behavior.

What's Next:

Page Complete

4 / 5