Database Management SystemsBuffer Management

Buffer Management in Database Systems

LevelIntermediate

Duration60 mins

TopicBuffer Management

3 / 5

Dirty Pages

The Modified Data Challenge

When a transaction modifies data—inserting a row, updating a value, or deleting a record—the change happens in the buffer pool, not directly on disk. The modified page becomes dirty: its in-memory contents differ from the persistent copy on disk.

Dirty pages represent a fundamental tension in database design. They enable high-performance modifications (memory writes are 100,000x faster than disk writes), but they also represent uncommitted or unpersisted data that could be lost in a crash. Managing dirty pages correctly is essential for achieving both performance and durability.

What You Will Learn

By the end of this page, you will understand how databases track dirty pages, the various flushing strategies and their trade-offs, the relationship between dirty pages and Write-Ahead Logging, and how production systems balance performance against recovery time objectives.

What Makes a Page Dirty?

A page becomes dirty when its in-memory contents are modified after being read from disk. Any operation that changes data on the page sets the dirty bit in the frame descriptor:

Operations that dirty a page:

INSERT: Adding a new row to a data page
UPDATE: Modifying existing row values
DELETE: Removing rows (often marks rows as deleted rather than physically removing)
Index operations: Inserting, splitting, or rebalancing index nodes
Page header updates: Modifying page-level metadata (free space pointers, slot directories)
Vacuum/compaction: Reclaiming space from deleted rows

dirty_bit_management.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Dirty bit management in the buffer manager
 
class BufferPool {
public:
    // Called by the execution engine after modifying a page
    void markDirty(PageId page_id) {
        lock_guard<mutex> guard(pool_latch);
        
        auto it = page_table.find(page_id);
        if (it == page_table.end()) {
            throw PageNotFoundException("Page not in buffer pool");
        }
        
        FrameId frame_id = it->second;
        FrameDescriptor& desc = descriptors[frame_id];
        
        // Set the dirty bit
        desc.is_dirty = true;
        
        // Record the LSN of the modification for WAL ordering
        // This comes from the log record written before the modification
        desc.page_lsn = current_lsn;
        
        // Add to dirty page tracking (for checkpoint/background writer)
        dirty_page_list.insert(page_id);
    }
    
    // Check if a page is dirty
    bool isDirty(PageId page_id) {
        lock_guard<mutex> guard(pool_latch);
        auto it = page_table.find(page_id);
        if (it == page_table.end()) return false;
        return descriptors[it->second].is_dirty;
    }
    
    // Clear dirty bit after successful flush to disk
    void markClean(PageId page_id) {
        lock_guard<mutex> guard(pool_latch);
        auto it = page_table.find(page_id);
        if (it == page_table.end()) return;
        
        FrameId frame_id = it->second;
        descriptors[frame_id].is_dirty = false;
        dirty_page_list.erase(page_id);
    }
};

The dirty bit lifecycle:

Page loaded from disk: Dirty bit is initially false
Page modified by transaction: Buffer manager sets dirty bit to true
Page flushed to disk: Background writer or evictor writes page, sets dirty bit to false
Page modified again: Dirty bit returns to true

A page's dirty bit may toggle multiple times between disk reads. The key invariant: if dirty bit is true, the disk copy is stale.

Dirty Bit vs. Log Records

The dirty bit indicates that a page differs from disk. It does NOT indicate whether the changes are committed or durable. Durability comes from Write-Ahead Logging: changes are durable once their log records are on stable storage, regardless of whether the data pages are flushed.

The Dirty Page Table

Beyond per-frame dirty bits, databases maintain a Dirty Page Table (DPT)—a centralized data structure tracking all dirty pages in the buffer pool. The DPT serves multiple purposes:

Recovery: During crash recovery, the DPT (reconstructed from the log) identifies pages that may need redo operations
Checkpointing: The DPT's oldest LSN determines how far back recovery must scan the log
Background flushing: The writer threads use the DPT to prioritize which pages to flush
Monitoring: DBAs can query the DPT to understand buffer pool state

dirty_page_table.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// Dirty Page Table structure
 
struct DirtyPageEntry {
    PageId page_id;         // The dirty page's identifier
    LSN recovery_lsn;       // LSN of the first modification since last flush
                            // Also called "recLSN" in ARIES terminology
    Timestamp first_dirtied; // When the page first became dirty
    size_t modification_count; // Number of modifications (optional metric)
};
 
class DirtyPageTable {
private:
    // Map from page_id to dirty page information
    unordered_map<PageId, DirtyPageEntry> entries;
    mutex dpt_latch;
    
public:
    // Add or update a dirty page entry
    void addDirtyPage(PageId page_id, LSN lsn) {
        lock_guard<mutex> guard(dpt_latch);
        
        auto it = entries.find(page_id);
        if (it == entries.end()) {
            // First modification since last flush
            DirtyPageEntry entry;
            entry.page_id = page_id;
            entry.recovery_lsn = lsn;  // RecLSN is the first modifying LSN
            entry.first_dirtied = now();
            entry.modification_count = 1;
            entries[page_id] = entry;
        } else {
            // Page already dirty - only update modification count
            // RecLSN stays as the FIRST modification's LSN
            it->second.modification_count++;
        }
    }
    
    // Remove page from DPT after successful flush
    void removeDirtyPage(PageId page_id) {
        lock_guard<mutex> guard(dpt_latch);
        entries.erase(page_id);
    }
    
    // Get the minimum recLSN across all dirty pages
    // This is the oldest log position needed for recovery
    LSN getMinRecoveryLSN() {
        lock_guard<mutex> guard(dpt_latch);
        
        LSN min_lsn = MAX_LSN;
        for (const auto& [page_id, entry] : entries) {
            if (entry.recovery_lsn < min_lsn) {
                min_lsn = entry.recovery_lsn;
            }
        }
        return min_lsn;
    }
    
    // Get pages to flush, sorted by oldest first
    vector<PageId> getPagesToFlush(size_t limit) {
        lock_guard<mutex> guard(dpt_latch);
        
        vector<pair<Timestamp, PageId>> candidates;
        for (const auto& [page_id, entry] : entries) {
            candidates.push_back({entry.first_dirtied, page_id});
        }
        
        // Sort by first_dirtied time (oldest first)
        sort(candidates.begin(), candidates.end());
        
        vector<PageId> result;
        for (size_t i = 0; i < min(limit, candidates.size()); i++) {
            result.push_back(candidates[i].second);
        }
        return result;
    }
};

The recLSN concept:

Each dirty page entry includes a recLSN (recovery LSN)—the LSN of the first log record that dirtied the page since its last flush. This is critical for recovery:

During recovery, pages must be redone from their recLSN forward
The minimum recLSN across all dirty pages determines the log scan start point
Flushing a dirty page and removing it from the DPT allows the associated log to be truncated

This is why frequent dirty page flushing reduces recovery time: it advances the minimum recLSN, shortening the log scan required during crash recovery.

Checkpoint and the DPT

During a checkpoint, the database writes the current Dirty Page Table to the log. This snapshot tells recovery exactly which pages were dirty at checkpoint time, eliminating the need to scan the entire log to reconstruct this information.

Flushing Strategies

Dirty pages must eventually be written to disk. The timing and manner of these writes significantly impact performance, recovery time, and system stability. Databases employ various flushing strategies to balance these concerns.

Dirty Page Flushing Strategies
Strategy	Description	Trade-offs
On-demand (eviction)	Flush dirty page only when evicting	Good throughput, unpredictable latency
Periodic background	Background thread flushes pages at regular intervals	Smooth I/O, may not keep up with write rate
Threshold-based	Flush when dirty page count exceeds threshold	Prevents accumulation, burst I/O possible
Oldest-first (age-based)	Prioritize flushing pages dirty longest	Minimizes recovery time, advances log truncation
Checkpoint-driven	Flush all dirty pages during checkpoint	Simplest, causes I/O spikes

Modern databases use a combination of strategies:

Background writer continuously flushes dirty pages, prioritizing oldest ones
Threshold triggers increase flush rate when too many pages are dirty
Checkpoint ensures all pages dirty before checkpoint start are flushed
On-demand eviction handles edge cases when background flushing can't keep up

background_writer.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// Background writer (page cleaner) implementation
 
class BackgroundWriter {
private:
    BufferPool* buffer_pool;
    DirtyPageTable* dirty_page_table;
    DiskManager* disk_manager;
    
    // Configuration
    Duration flush_interval = 100ms;
    size_t pages_per_flush = 64;
    double dirty_ratio_high = 0.75;  // Trigger aggressive flushing
    double dirty_ratio_low = 0.50;   // Normal flushing rate
    
    atomic<bool> running;
    
public:
    void writerLoop() {
        while (running) {
            // Calculate current dirty ratio
            double dirty_ratio = getDirtyPageRatio();
            
            // Determine how many pages to flush this cycle
            size_t pages_to_flush;
            if (dirty_ratio > dirty_ratio_high) {
                // Aggressive flushing: 2x normal rate
                pages_to_flush = pages_per_flush * 2;
            } else if (dirty_ratio > dirty_ratio_low) {
                // Normal flushing rate
                pages_to_flush = pages_per_flush;
            } else {
                // Low dirty ratio: minimal flushing
                pages_to_flush = pages_per_flush / 4;
            }
            
            // Get oldest dirty pages to flush
            vector<PageId> to_flush = 
                dirty_page_table->getPagesToFlush(pages_to_flush);
            
            // Flush each page
            for (PageId page_id : to_flush) {
                flushPage(page_id);
            }
            
            // Sleep until next cycle
            this_thread::sleep_for(flush_interval);
        }
    }
    
    void flushPage(PageId page_id) {
        // Get the page from buffer pool
        Page* page = buffer_pool->getPageForFlush(page_id);
        if (page == nullptr) return;  // Page was evicted
        
        // Ensure WAL records are flushed first
        LSN page_lsn = buffer_pool->getPageLSN(page_id);
        log_manager->flushToLSN(page_lsn);
        
        // Write to disk
        disk_manager->writePage(page_id, page);
        
        // Mark page as clean
        buffer_pool->markClean(page_id);
        dirty_page_table->removeDirtyPage(page_id);
    }
};

The Write-Ahead Logging Constraint

Dirty pages cannot be flushed to disk arbitrarily. The Write-Ahead Logging (WAL) protocol imposes a critical constraint:

Before a dirty page can be written to disk, all log records that describe modifications to that page must first be written to stable storage.

This constraint ensures that if a crash occurs immediately after the page write, recovery can redo the modifications by reading the log. Without WAL, a partial page write could leave the database in an unrecoverable state.

wal_constraint.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Enforcing the WAL constraint during page flush
 
void BufferPool::flushDirtyPage(FrameId frame_id) {
    FrameDescriptor& desc = descriptors[frame_id];
    
    if (!desc.is_dirty) {
        return;  // Nothing to flush
    }
    
    // CRITICAL: WAL constraint enforcement
    // The page's LSN records the LSN of the most recent modification
    LSN page_lsn = desc.page_lsn;
    
    // Ensure all log records up to and including page_lsn are on disk
    // This may trigger a log flush if records are still in log buffer
    log_manager->flushToLSN(page_lsn);
    
    // Now safe to write the page to disk
    // Even if crash occurs, recovery can redo from the flushed log
    disk_manager->writePage(desc.page_id, &frames[frame_id]);
    
    // Update state
    desc.is_dirty = false;
    dirty_page_table->removeDirtyPage(desc.page_id);
}
 
// Log manager's flushToLSN implementation
void LogManager::flushToLSN(LSN target_lsn) {
    lock_guard<mutex> guard(log_latch);
    
    // If already flushed to this LSN, nothing to do
    if (target_lsn <= flushed_lsn) {
        return;
    }
    
    // Flush log buffer to disk up to target_lsn
    size_t bytes_to_flush = lsnToBufferOffset(target_lsn) - 
                            lsnToBufferOffset(flushed_lsn);
    
    // Write to log file with fsync for durability
    logFile->write(log_buffer + lsnToBufferOffset(flushed_lsn), 
                   bytes_to_flush);
    logFile->fsync();  // Ensure data reaches disk
    
    flushed_lsn = target_lsn;
}

Why WAL is essential:

Consider what happens without WAL:

Transaction T modifies page P in memory
Page P is flushed to disk before the log record
System crashes
On recovery, there's no log record to redo the modification
The database is inconsistent: page P has changes from an uncommitted or partially applied transaction

With WAL:

Transaction T modifies page P in memory
Log record is written to disk FIRST
Page P may or may not be flushed before crash
On recovery, log record allows redoing the modification if needed
Consistency is maintained regardless of which pages were flushed

The Log Flush Bottleneck

The WAL constraint means that dirty page writes depend on log writes. If the log is slow (e.g., writing to a slow disk), page flushing is also slow. This is why production databases place logs on the fastest available storage, often separate from data files.

Steal and No-Force Policies

Database systems make two fundamental policy decisions about dirty page management:

STEAL Policy: Can the buffer manager flush ("steal") a dirty page modified by an uncommitted transaction?

FORCE Policy: Must all dirty pages modified by a transaction be forced to disk before the transaction commits?

Buffer Management Policies
Policy	Meaning	Requirements
STEAL	Can flush uncommitted transaction's pages	Requires UNDO capability in recovery
NO-STEAL	Cannot flush uncommitted transaction's pages	No UNDO needed; may run out of buffer space
FORCE	Must flush all transaction's pages before commit	No REDO needed; high commit latency
NO-FORCE	Need not flush pages before commit	Requires REDO capability in recovery

The four combinations:

STEAL + FORCE: Allows stealing, requires forcing. Needs UNDO (for stolen pages of aborted transactions). Simple REDO (all committed changes are on disk).
NO-STEAL + FORCE: No stealing, requires forcing. No UNDO, no REDO needed. Simple recovery but poor performance.
STEAL + NO-FORCE: Most common in production systems. Allows stealing, doesn't require forcing. Needs both UNDO and REDO. Best performance.
NO-STEAL + NO-FORCE: No stealing, doesn't require forcing. Needs REDO but not UNDO. Limited by buffer space.

Why STEAL + NO-FORCE Wins

STEAL allows the buffer manager to evict any unpinned page, preventing buffer pool exhaustion. NO-FORCE allows commits to complete without waiting for I/O. This combination maximizes flexibility and performance. The cost is recovery complexity: the recovery algorithm (like ARIES) must handle both UNDO and REDO.

steal_no_force.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// STEAL + NO-FORCE implementation
 
// STEAL: When evicting, we can flush uncommitted pages
void BufferPool::evictPageWithSteal(FrameId frame_id) {
    FrameDescriptor& desc = descriptors[frame_id];
    
    if (desc.is_dirty) {
        // STEAL policy: we CAN flush uncommitted changes
        // WAL ensures we can undo if the transaction aborts
        flushDirtyPage(frame_id);
    }
    
    // Continue with eviction...
}
 
// NO-FORCE: Commit does NOT wait for page flushes
void TransactionManager::commitTransaction(TxnId txn_id) {
    // Step 1: Write COMMIT log record
    LSN commit_lsn = log_manager->writeCommitRecord(txn_id);
    
    // Step 2: Flush log up to commit record
    // This is the durability guarantee - log is on stable storage
    log_manager->flushToLSN(commit_lsn);
    
    // Step 3: NO-FORCE - we do NOT flush dirty data pages
    // The log flush guarantees durability; data pages can flush later
    
    // Step 4: Release locks, cleanup transaction state
    lock_manager->releaseAllLocks(txn_id);
    active_transactions.erase(txn_id);
    
    // Dirty pages will be flushed by background writer eventually
    // Recovery will REDO any committed changes not yet on disk
}

Dirty Page Ratio and Performance

The dirty page ratio—the proportion of buffer pool pages that are dirty—is a critical metric for database performance and stability. Both extremes cause problems:

Too Many Dirty Pages

•Eviction becomes slow (requires writing before reuse)
•Longer crash recovery (more log to scan)
•Risk of "checkpoint storm" during heavy writes
•Background writer can't keep up with write rate
•May trigger emergency synchronous flushes

Too Few Dirty Pages

•Excessive disk writes (flushing too aggressively)
•Wasted I/O for pages modified multiple times
•Poor write coalescing opportunities
•Increased disk wear (especially for SSDs)
•Resources spent on unnecessary flushes

Optimal dirty page management:

The ideal approach balances these concerns:

Target ratio: Maintain a dirty page ratio that provides write coalescing benefits without risking eviction stalls. Common targets are 25-50%.
Flush prioritization: Prioritize flushing the oldest dirty pages to advance the recovery point and allow log truncation.
Adaptive flushing: Increase flush rate when dirty ratio exceeds threshold; decrease when it's low.
Write coalescing: Allow multiple modifications to the same page before flushing, reducing total I/O.

Monitoring Dirty Page Metrics

In PostgreSQL, monitor pg_stat_bgwriter.buffers_backend (pages flushed by backends, indicating background writer isn't keeping up) and pg_stat_bgwriter.buffers_clean (pages flushed by background writer). In MySQL, check Innodb_buffer_pool_pages_dirty and Innodb_buffer_pool_wait_free (waits for free pages).

Dirty Page Configuration Parameters
Database	Parameter	Purpose
PostgreSQL	bgwriter_lru_maxpages	Max pages written per bgwriter round
PostgreSQL	bgwriter_delay	Milliseconds between bgwriter rounds
MySQL	innodb_max_dirty_pages_pct	Target max dirty page percentage
MySQL	innodb_io_capacity	I/O operations per second for flushing
Oracle	DB_WRITER_PROCESSES	Number of database writer processes

Recovery Implications of Dirty Pages

Dirty page management directly impacts crash recovery performance. The relationship is straightforward: more unflushed dirty pages means more work during recovery.

How dirty pages affect recovery:

Analysis phase: Reconstructs the Dirty Page Table to identify pages that need redo
Redo phase: Must replay log records for all dirty pages since the oldest recLSN
- More dirty pages = more pages to potentially redo
- Older dirty pages = further back in the log to scan
Undo phase: Uses the log to undo uncommitted transactions whose pages were stolen

The key insight: flushing a dirty page and checkpointing advances the recovery start point. After a checkpoint, recovery only needs to scan log records since the checkpoint.

recovery_and_dirty_pages.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// How dirty pages affect recovery time
 
class RecoveryManager {
    // Analyze dirty page impact on recovery
    RecoveryEstimate estimateRecoveryTime(Checkpoint checkpoint) {
        RecoveryEstimate estimate;
        
        // Get the dirty page table from checkpoint
        DirtyPageTable dpt = checkpoint.dirty_page_table;
        
        // Find the oldest recLSN - this is where redo starts
        LSN oldest_rec_lsn = dpt.getMinRecoveryLSN();
        
        // Calculate log bytes to scan
        LSN current_lsn = log_manager->getCurrentLSN();
        size_t log_bytes_to_scan = current_lsn - oldest_rec_lsn;
        
        // Estimate number of pages to redo
        size_t pages_to_redo = dpt.count();
        
        // Estimate recovery time
        // Log scan rate: ~100 MB/s
        // Page redo rate: ~10,000 pages/s
        estimate.log_scan_time = log_bytes_to_scan / (100 * 1024 * 1024);
        estimate.redo_time = pages_to_redo / 10000.0;
        estimate.total_time = estimate.log_scan_time + estimate.redo_time;
        
        return estimate;
    }
    
    // Recommendation: Keep dirty page age bounded
    Duration getOldestDirtyPageAge() {
        vector<DirtyPageEntry> entries = dirty_page_table.getAllEntries();
        
        Timestamp oldest = now();
        for (const auto& entry : entries) {
            if (entry.first_dirtied < oldest) {
                oldest = entry.first_dirtied;
            }
        }
        
        return now() - oldest;
    }
    
    // Alert if dirty pages are too old
    void checkDirtyPageAge() {
        Duration max_age = 5min;  // Target: no page dirty for > 5 minutes
        Duration oldest = getOldestDirtyPageAge();
        
        if (oldest > max_age) {
            log_warning("Oldest dirty page is {} old; recovery time is elevated",
                        oldest);
        }
    }
};

The Recovery Time Objective (RTO) Connection

If your RTO is 5 minutes, your dirty page management must ensure recovery can complete in that time. This often means aggressive background flushing and frequent checkpoints. There's a direct trade-off between runtime performance (less I/O) and recovery speed (more I/O).

Summary: Mastering Dirty Page Management

Dirty page management is central to database performance and reliability. Proper handling ensures efficient use of the buffer pool while maintaining durability and enabling fast recovery.

Key Takeaways

•A dirty page has in-memory modifications not yet on disk — The dirty bit tracks this state; the page_lsn tracks the modification's log position.
•The Dirty Page Table centralizes tracking — It enables checkpoint optimization, recovery planning, and background writer prioritization.
•WAL constraint must be enforced — Log records must be flushed before their corresponding dirty pages can be written to disk.
•STEAL + NO-FORCE is the production standard — It maximizes flexibility at the cost of recovery complexity (both UNDO and REDO required).
•Background flushing prevents eviction stalls — Proactive flushing keeps clean pages available for allocation.
•Dirty page age affects recovery time — Flushing oldest pages first advances the recovery point and enables log truncation.
•Monitoring dirty page metrics is essential — Dirty ratio, flush rates, and oldest page age all impact performance and recovery.

What's next:

We've examined pages and their states. The next page explores the Buffer Manager: the component that orchestrates all these operations, providing the interface between query execution and the buffer pool.

Page Complete

You now understand dirty page management: tracking, flushing, WAL constraints, and performance implications. This knowledge is essential for database tuning and understanding recovery behavior.

3 / 5

Loading learning content...

Database Management SystemsBuffer Management

Buffer Management in Database Systems

LevelIntermediate

Duration60 mins

TopicBuffer Management

3 / 5

Dirty Pages

The Modified Data Challenge

What You Will Learn

What Makes a Page Dirty?

A page becomes dirty when its in-memory contents are modified after being read from disk. Any operation that changes data on the page sets the dirty bit in the frame descriptor:

Operations that dirty a page:

INSERT: Adding a new row to a data page
UPDATE: Modifying existing row values
DELETE: Removing rows (often marks rows as deleted rather than physically removing)
Index operations: Inserting, splitting, or rebalancing index nodes
Page header updates: Modifying page-level metadata (free space pointers, slot directories)
Vacuum/compaction: Reclaiming space from deleted rows

dirty_bit_management.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Dirty bit management in the buffer manager
 
class BufferPool {
public:
    // Called by the execution engine after modifying a page
    void markDirty(PageId page_id) {
        lock_guard<mutex> guard(pool_latch);
        
        auto it = page_table.find(page_id);
        if (it == page_table.end()) {
            throw PageNotFoundException("Page not in buffer pool");
        }
        
        FrameId frame_id = it->second;
        FrameDescriptor& desc = descriptors[frame_id];
        
        // Set the dirty bit
        desc.is_dirty = true;
        
        // Record the LSN of the modification for WAL ordering
        // This comes from the log record written before the modification
        desc.page_lsn = current_lsn;
        
        // Add to dirty page tracking (for checkpoint/background writer)
        dirty_page_list.insert(page_id);
    }
    
    // Check if a page is dirty
    bool isDirty(PageId page_id) {
        lock_guard<mutex> guard(pool_latch);
        auto it = page_table.find(page_id);
        if (it == page_table.end()) return false;
        return descriptors[it->second].is_dirty;
    }
    
    // Clear dirty bit after successful flush to disk
    void markClean(PageId page_id) {
        lock_guard<mutex> guard(pool_latch);
        auto it = page_table.find(page_id);
        if (it == page_table.end()) return;
        
        FrameId frame_id = it->second;
        descriptors[frame_id].is_dirty = false;
        dirty_page_list.erase(page_id);
    }
};

The dirty bit lifecycle:

Page loaded from disk: Dirty bit is initially false
Page modified by transaction: Buffer manager sets dirty bit to true
Page flushed to disk: Background writer or evictor writes page, sets dirty bit to false
Page modified again: Dirty bit returns to true

A page's dirty bit may toggle multiple times between disk reads. The key invariant: if dirty bit is true, the disk copy is stale.

Dirty Bit vs. Log Records

The Dirty Page Table

Beyond per-frame dirty bits, databases maintain a Dirty Page Table (DPT)—a centralized data structure tracking all dirty pages in the buffer pool. The DPT serves multiple purposes:

Recovery: During crash recovery, the DPT (reconstructed from the log) identifies pages that may need redo operations
Checkpointing: The DPT's oldest LSN determines how far back recovery must scan the log
Background flushing: The writer threads use the DPT to prioritize which pages to flush
Monitoring: DBAs can query the DPT to understand buffer pool state

dirty_page_table.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// Dirty Page Table structure
 
struct DirtyPageEntry {
    PageId page_id;         // The dirty page's identifier
    LSN recovery_lsn;       // LSN of the first modification since last flush
                            // Also called "recLSN" in ARIES terminology
    Timestamp first_dirtied; // When the page first became dirty
    size_t modification_count; // Number of modifications (optional metric)
};
 
class DirtyPageTable {
private:
    // Map from page_id to dirty page information
    unordered_map<PageId, DirtyPageEntry> entries;
    mutex dpt_latch;
    
public:
    // Add or update a dirty page entry
    void addDirtyPage(PageId page_id, LSN lsn) {
        lock_guard<mutex> guard(dpt_latch);
        
        auto it = entries.find(page_id);
        if (it == entries.end()) {
            // First modification since last flush
            DirtyPageEntry entry;
            entry.page_id = page_id;
            entry.recovery_lsn = lsn;  // RecLSN is the first modifying LSN
            entry.first_dirtied = now();
            entry.modification_count = 1;
            entries[page_id] = entry;
        } else {
            // Page already dirty - only update modification count
            // RecLSN stays as the FIRST modification's LSN
            it->second.modification_count++;
        }
    }
    
    // Remove page from DPT after successful flush
    void removeDirtyPage(PageId page_id) {
        lock_guard<mutex> guard(dpt_latch);
        entries.erase(page_id);
    }
    
    // Get the minimum recLSN across all dirty pages
    // This is the oldest log position needed for recovery
    LSN getMinRecoveryLSN() {
        lock_guard<mutex> guard(dpt_latch);
        
        LSN min_lsn = MAX_LSN;
        for (const auto& [page_id, entry] : entries) {
            if (entry.recovery_lsn < min_lsn) {
                min_lsn = entry.recovery_lsn;
            }
        }
        return min_lsn;
    }
    
    // Get pages to flush, sorted by oldest first
    vector<PageId> getPagesToFlush(size_t limit) {
        lock_guard<mutex> guard(dpt_latch);
        
        vector<pair<Timestamp, PageId>> candidates;
        for (const auto& [page_id, entry] : entries) {
            candidates.push_back({entry.first_dirtied, page_id});
        }
        
        // Sort by first_dirtied time (oldest first)
        sort(candidates.begin(), candidates.end());
        
        vector<PageId> result;
        for (size_t i = 0; i < min(limit, candidates.size()); i++) {
            result.push_back(candidates[i].second);
        }
        return result;
    }
};

The recLSN concept:

Each dirty page entry includes a recLSN (recovery LSN)—the LSN of the first log record that dirtied the page since its last flush. This is critical for recovery:

During recovery, pages must be redone from their recLSN forward
The minimum recLSN across all dirty pages determines the log scan start point
Flushing a dirty page and removing it from the DPT allows the associated log to be truncated

This is why frequent dirty page flushing reduces recovery time: it advances the minimum recLSN, shortening the log scan required during crash recovery.

Checkpoint and the DPT

Flushing Strategies

Dirty Page Flushing Strategies
Strategy	Description	Trade-offs
On-demand (eviction)	Flush dirty page only when evicting	Good throughput, unpredictable latency
Periodic background	Background thread flushes pages at regular intervals	Smooth I/O, may not keep up with write rate
Threshold-based	Flush when dirty page count exceeds threshold	Prevents accumulation, burst I/O possible
Oldest-first (age-based)	Prioritize flushing pages dirty longest	Minimizes recovery time, advances log truncation
Checkpoint-driven	Flush all dirty pages during checkpoint	Simplest, causes I/O spikes

Modern databases use a combination of strategies:

Background writer continuously flushes dirty pages, prioritizing oldest ones
Threshold triggers increase flush rate when too many pages are dirty
Checkpoint ensures all pages dirty before checkpoint start are flushed
On-demand eviction handles edge cases when background flushing can't keep up

background_writer.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// Background writer (page cleaner) implementation
 
class BackgroundWriter {
private:
    BufferPool* buffer_pool;
    DirtyPageTable* dirty_page_table;
    DiskManager* disk_manager;
    
    // Configuration
    Duration flush_interval = 100ms;
    size_t pages_per_flush = 64;
    double dirty_ratio_high = 0.75;  // Trigger aggressive flushing
    double dirty_ratio_low = 0.50;   // Normal flushing rate
    
    atomic<bool> running;
    
public:
    void writerLoop() {
        while (running) {
            // Calculate current dirty ratio
            double dirty_ratio = getDirtyPageRatio();
            
            // Determine how many pages to flush this cycle
            size_t pages_to_flush;
            if (dirty_ratio > dirty_ratio_high) {
                // Aggressive flushing: 2x normal rate
                pages_to_flush = pages_per_flush * 2;
            } else if (dirty_ratio > dirty_ratio_low) {
                // Normal flushing rate
                pages_to_flush = pages_per_flush;
            } else {
                // Low dirty ratio: minimal flushing
                pages_to_flush = pages_per_flush / 4;
            }
            
            // Get oldest dirty pages to flush
            vector<PageId> to_flush = 
                dirty_page_table->getPagesToFlush(pages_to_flush);
            
            // Flush each page
            for (PageId page_id : to_flush) {
                flushPage(page_id);
            }
            
            // Sleep until next cycle
            this_thread::sleep_for(flush_interval);
        }
    }
    
    void flushPage(PageId page_id) {
        // Get the page from buffer pool
        Page* page = buffer_pool->getPageForFlush(page_id);
        if (page == nullptr) return;  // Page was evicted
        
        // Ensure WAL records are flushed first
        LSN page_lsn = buffer_pool->getPageLSN(page_id);
        log_manager->flushToLSN(page_lsn);
        
        // Write to disk
        disk_manager->writePage(page_id, page);
        
        // Mark page as clean
        buffer_pool->markClean(page_id);
        dirty_page_table->removeDirtyPage(page_id);
    }
};

The Write-Ahead Logging Constraint

Dirty pages cannot be flushed to disk arbitrarily. The Write-Ahead Logging (WAL) protocol imposes a critical constraint:

Before a dirty page can be written to disk, all log records that describe modifications to that page must first be written to stable storage.

wal_constraint.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Enforcing the WAL constraint during page flush
 
void BufferPool::flushDirtyPage(FrameId frame_id) {
    FrameDescriptor& desc = descriptors[frame_id];
    
    if (!desc.is_dirty) {
        return;  // Nothing to flush
    }
    
    // CRITICAL: WAL constraint enforcement
    // The page's LSN records the LSN of the most recent modification
    LSN page_lsn = desc.page_lsn;
    
    // Ensure all log records up to and including page_lsn are on disk
    // This may trigger a log flush if records are still in log buffer
    log_manager->flushToLSN(page_lsn);
    
    // Now safe to write the page to disk
    // Even if crash occurs, recovery can redo from the flushed log
    disk_manager->writePage(desc.page_id, &frames[frame_id]);
    
    // Update state
    desc.is_dirty = false;
    dirty_page_table->removeDirtyPage(desc.page_id);
}
 
// Log manager's flushToLSN implementation
void LogManager::flushToLSN(LSN target_lsn) {
    lock_guard<mutex> guard(log_latch);
    
    // If already flushed to this LSN, nothing to do
    if (target_lsn <= flushed_lsn) {
        return;
    }
    
    // Flush log buffer to disk up to target_lsn
    size_t bytes_to_flush = lsnToBufferOffset(target_lsn) - 
                            lsnToBufferOffset(flushed_lsn);
    
    // Write to log file with fsync for durability
    logFile->write(log_buffer + lsnToBufferOffset(flushed_lsn), 
                   bytes_to_flush);
    logFile->fsync();  // Ensure data reaches disk
    
    flushed_lsn = target_lsn;
}

Why WAL is essential:

Consider what happens without WAL:

Transaction T modifies page P in memory
Page P is flushed to disk before the log record
System crashes
On recovery, there's no log record to redo the modification
The database is inconsistent: page P has changes from an uncommitted or partially applied transaction

With WAL:

Transaction T modifies page P in memory
Log record is written to disk FIRST
Page P may or may not be flushed before crash
On recovery, log record allows redoing the modification if needed
Consistency is maintained regardless of which pages were flushed

The Log Flush Bottleneck

Steal and No-Force Policies

Database systems make two fundamental policy decisions about dirty page management:

STEAL Policy: Can the buffer manager flush ("steal") a dirty page modified by an uncommitted transaction?

FORCE Policy: Must all dirty pages modified by a transaction be forced to disk before the transaction commits?

Buffer Management Policies
Policy	Meaning	Requirements
STEAL	Can flush uncommitted transaction's pages	Requires UNDO capability in recovery
NO-STEAL	Cannot flush uncommitted transaction's pages	No UNDO needed; may run out of buffer space
FORCE	Must flush all transaction's pages before commit	No REDO needed; high commit latency
NO-FORCE	Need not flush pages before commit	Requires REDO capability in recovery

The four combinations:

STEAL + FORCE: Allows stealing, requires forcing. Needs UNDO (for stolen pages of aborted transactions). Simple REDO (all committed changes are on disk).
NO-STEAL + FORCE: No stealing, requires forcing. No UNDO, no REDO needed. Simple recovery but poor performance.
STEAL + NO-FORCE: Most common in production systems. Allows stealing, doesn't require forcing. Needs both UNDO and REDO. Best performance.
NO-STEAL + NO-FORCE: No stealing, doesn't require forcing. Needs REDO but not UNDO. Limited by buffer space.

Why STEAL + NO-FORCE Wins

steal_no_force.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// STEAL + NO-FORCE implementation
 
// STEAL: When evicting, we can flush uncommitted pages
void BufferPool::evictPageWithSteal(FrameId frame_id) {
    FrameDescriptor& desc = descriptors[frame_id];
    
    if (desc.is_dirty) {
        // STEAL policy: we CAN flush uncommitted changes
        // WAL ensures we can undo if the transaction aborts
        flushDirtyPage(frame_id);
    }
    
    // Continue with eviction...
}
 
// NO-FORCE: Commit does NOT wait for page flushes
void TransactionManager::commitTransaction(TxnId txn_id) {
    // Step 1: Write COMMIT log record
    LSN commit_lsn = log_manager->writeCommitRecord(txn_id);
    
    // Step 2: Flush log up to commit record
    // This is the durability guarantee - log is on stable storage
    log_manager->flushToLSN(commit_lsn);
    
    // Step 3: NO-FORCE - we do NOT flush dirty data pages
    // The log flush guarantees durability; data pages can flush later
    
    // Step 4: Release locks, cleanup transaction state
    lock_manager->releaseAllLocks(txn_id);
    active_transactions.erase(txn_id);
    
    // Dirty pages will be flushed by background writer eventually
    // Recovery will REDO any committed changes not yet on disk
}

Dirty Page Ratio and Performance

The dirty page ratio—the proportion of buffer pool pages that are dirty—is a critical metric for database performance and stability. Both extremes cause problems:

Too Many Dirty Pages

•Eviction becomes slow (requires writing before reuse)
•Longer crash recovery (more log to scan)
•Risk of "checkpoint storm" during heavy writes
•Background writer can't keep up with write rate
•May trigger emergency synchronous flushes

Too Few Dirty Pages

•Excessive disk writes (flushing too aggressively)
•Wasted I/O for pages modified multiple times
•Poor write coalescing opportunities
•Increased disk wear (especially for SSDs)
•Resources spent on unnecessary flushes

Optimal dirty page management:

The ideal approach balances these concerns:

Target ratio: Maintain a dirty page ratio that provides write coalescing benefits without risking eviction stalls. Common targets are 25-50%.
Flush prioritization: Prioritize flushing the oldest dirty pages to advance the recovery point and allow log truncation.
Adaptive flushing: Increase flush rate when dirty ratio exceeds threshold; decrease when it's low.
Write coalescing: Allow multiple modifications to the same page before flushing, reducing total I/O.

Monitoring Dirty Page Metrics

Dirty Page Configuration Parameters
Database	Parameter	Purpose
PostgreSQL	bgwriter_lru_maxpages	Max pages written per bgwriter round
PostgreSQL	bgwriter_delay	Milliseconds between bgwriter rounds
MySQL	innodb_max_dirty_pages_pct	Target max dirty page percentage
MySQL	innodb_io_capacity	I/O operations per second for flushing
Oracle	DB_WRITER_PROCESSES	Number of database writer processes

Recovery Implications of Dirty Pages

Dirty page management directly impacts crash recovery performance. The relationship is straightforward: more unflushed dirty pages means more work during recovery.

How dirty pages affect recovery:

Analysis phase: Reconstructs the Dirty Page Table to identify pages that need redo
Redo phase: Must replay log records for all dirty pages since the oldest recLSN
- More dirty pages = more pages to potentially redo
- Older dirty pages = further back in the log to scan
Undo phase: Uses the log to undo uncommitted transactions whose pages were stolen

The key insight: flushing a dirty page and checkpointing advances the recovery start point. After a checkpoint, recovery only needs to scan log records since the checkpoint.

recovery_and_dirty_pages.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// How dirty pages affect recovery time
 
class RecoveryManager {
    // Analyze dirty page impact on recovery
    RecoveryEstimate estimateRecoveryTime(Checkpoint checkpoint) {
        RecoveryEstimate estimate;
        
        // Get the dirty page table from checkpoint
        DirtyPageTable dpt = checkpoint.dirty_page_table;
        
        // Find the oldest recLSN - this is where redo starts
        LSN oldest_rec_lsn = dpt.getMinRecoveryLSN();
        
        // Calculate log bytes to scan
        LSN current_lsn = log_manager->getCurrentLSN();
        size_t log_bytes_to_scan = current_lsn - oldest_rec_lsn;
        
        // Estimate number of pages to redo
        size_t pages_to_redo = dpt.count();
        
        // Estimate recovery time
        // Log scan rate: ~100 MB/s
        // Page redo rate: ~10,000 pages/s
        estimate.log_scan_time = log_bytes_to_scan / (100 * 1024 * 1024);
        estimate.redo_time = pages_to_redo / 10000.0;
        estimate.total_time = estimate.log_scan_time + estimate.redo_time;
        
        return estimate;
    }
    
    // Recommendation: Keep dirty page age bounded
    Duration getOldestDirtyPageAge() {
        vector<DirtyPageEntry> entries = dirty_page_table.getAllEntries();
        
        Timestamp oldest = now();
        for (const auto& entry : entries) {
            if (entry.first_dirtied < oldest) {
                oldest = entry.first_dirtied;
            }
        }
        
        return now() - oldest;
    }
    
    // Alert if dirty pages are too old
    void checkDirtyPageAge() {
        Duration max_age = 5min;  // Target: no page dirty for > 5 minutes
        Duration oldest = getOldestDirtyPageAge();
        
        if (oldest > max_age) {
            log_warning("Oldest dirty page is {} old; recovery time is elevated",
                        oldest);
        }
    }
};

The Recovery Time Objective (RTO) Connection

Summary: Mastering Dirty Page Management

Dirty page management is central to database performance and reliability. Proper handling ensures efficient use of the buffer pool while maintaining durability and enabling fast recovery.

Key Takeaways

•A dirty page has in-memory modifications not yet on disk — The dirty bit tracks this state; the page_lsn tracks the modification's log position.
•The Dirty Page Table centralizes tracking — It enables checkpoint optimization, recovery planning, and background writer prioritization.
•WAL constraint must be enforced — Log records must be flushed before their corresponding dirty pages can be written to disk.
•STEAL + NO-FORCE is the production standard — It maximizes flexibility at the cost of recovery complexity (both UNDO and REDO required).
•Background flushing prevents eviction stalls — Proactive flushing keeps clean pages available for allocation.
•Dirty page age affects recovery time — Flushing oldest pages first advances the recovery point and enables log truncation.
•Monitoring dirty page metrics is essential — Dirty ratio, flush rates, and oldest page age all impact performance and recovery.

What's next:

Page Complete

You now understand dirty page management: tracking, flushing, WAL constraints, and performance implications. This knowledge is essential for database tuning and understanding recovery behavior.

3 / 5