Database Management SystemsAnalysis Phase

ARIES Analysis Phase: Reconstructing System State

LevelAdvanced

Duration75 mins

TopicAnalysis Phase

1 / 5

Dirty Page Table: Tracking Modified Pages for Recovery

The Foundation of Intelligent Recovery

When a database system crashes, it faces a daunting challenge: millions of log records may exist, but which operations actually need to be redone? Blindly replaying every logged operation from the beginning of time would be catastrophically inefficient—potentially taking hours or days on systems with years of operational history.

The Dirty Page Table (DPT) solves this problem elegantly. It maintains runtime knowledge of which database pages have been modified but not yet written to stable storage, enabling the recovery system to determine precisely where redo processing must begin. This page explores the DPT in exhaustive technical detail—its structure, maintenance protocol, checkpointing integration, and critical role in the ARIES recovery algorithm.

What You Will Learn

By the end of this page, you will understand the complete architecture of the Dirty Page Table, how it's maintained during normal operations, how it survives checkpoints, and how the analysis phase uses it to establish the recovery starting point. You'll be able to trace DPT operations through concrete scenarios and explain why it's fundamental to efficient crash recovery.

Understanding Dirty Pages

Before examining the Dirty Page Table itself, we must establish a precise understanding of what constitutes a dirty page in database systems. This concept is fundamental to buffer management, write-ahead logging, and crash recovery.

Definition: A dirty page is a database page that resides in the buffer pool (main memory) and has been modified since it was last read from disk (or written to disk). The term 'dirty' signifies that the in-memory version differs from the on-disk version.

This divergence is intentional and beneficial for performance. If the buffer manager wrote every modification immediately to disk, the system would be I/O-bound and painfully slow. Instead, modifications accumulate in memory, and pages are written to disk:

When the buffer manager needs space for new pages (page replacement)
During checkpoint operations
When transactions commit (in force-at-commit policies)
During periodic background flushing

Page States in the Buffer Pool
State	In Memory	In-Memory = On-Disk	Needs Write-Back
Clean	Yes	Yes (identical)	No
Dirty	Yes	No (modified)	Yes
Not Resident	No	N/A	N/A
Pinned & Clean	Yes (locked)	Yes	No
Pinned & Dirty	Yes (locked)	No	Yes (after unpin)

The Durability Challenge

Dirty pages represent uncommitted or committed-but-not-yet-persisted data. If the system crashes while pages are dirty, their modifications could be lost. The Write-Ahead Logging (WAL) protocol ensures that log records describing these modifications are written to stable storage before the transaction commits, enabling recovery to reconstruct any lost modifications.

Why Dirty Pages Matter for Recovery:

During crash recovery, the system must:

Redo operations whose effects may not have reached disk (dirty pages that weren't flushed)
Undo operations from transactions that didn't commit

The challenge is efficiency. A database might have:

1 billion log records spanning 5 years of operation
10,000 dirty pages at crash time
Only 100,000 log records relevant to those dirty pages

Without tracking dirty pages, recovery would need to scan the entire log. With the Dirty Page Table, recovery starts from the earliest relevant log record—the one that first dirtied a page still dirty at crash time.

Dirty Page Table Architecture

The Dirty Page Table is an in-memory data structure maintained by the buffer manager. Its design balances fast lookup and update operations with the information needed for crash recovery.

Core Structure:

The DPT is typically implemented as a hash table mapping page identifiers to their recovery-relevant metadata. Each entry contains critical information about when and how the page became dirty.

dirty_page_table.h
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Conceptual structure of a Dirty Page Table entry
struct DirtyPageEntry {
    PageID    page_id;        // Unique identifier: (table_space_id, page_number)
    LSN       recovery_lsn;   // LSN of the FIRST log record that dirtied this page
                              // since it was last written to disk
    Timestamp first_dirty_time; // Wall-clock time when page became dirty
    bool      is_flushing;    // True if write is currently in progress
};
 
// The Dirty Page Table itself
class DirtyPageTable {
private:
    HashMap<PageID, DirtyPageEntry> entries;
    Mutex                           table_lock;
    
public:
    // Record that a page has been modified
    void markDirty(PageID pid, LSN log_lsn);
    
    // Record that a page has been written to disk
    void markClean(PageID pid);
    
    // Get the minimum Recovery LSN across all dirty pages
    LSN getMinRecoveryLSN();
    
    // Create checkpoint snapshot
    vector<DirtyPageEntry> snapshot();
    
    // Check if a page is dirty
    bool isDirty(PageID pid);
};

Key Fields Explained:

Page ID: The unique identifier of the database page. In most systems, this is a tuple of (tablespace or database file ID, page number within that file). Page IDs must be globally unique across the entire database.

Recovery LSN (recLSN): This is the most critical field for crash recovery. It records the Log Sequence Number of the first log record that modified this page since it was last flushed to disk. This LSN tells the recovery system: "If you want to reconstruct this page's current state, you must start redoing operations from this log position."

First Dirty Time: A timestamp recording when the page became dirty. This is useful for monitoring (identifying pages that have been dirty too long) and for certain flushing strategies.

Is Flushing: A flag indicating that a disk write is currently in progress. This prevents the entry from being removed prematurely and handles concurrent operations correctly.

Why Track the First Dirtying LSN?

Consider a page modified by log records at LSN 100, 150, and 200. If only LSN 200 were tracked, recovery might skip LSN 100 and 150, leaving the page in an inconsistent state. By tracking the first modification (LSN 100), ARIES guarantees that all subsequent modifications to this page will also be replayed, since they have higher LSNs.

DPT Operations During Normal Processing

The Dirty Page Table is continuously updated during normal database operations. Every modification to a buffer pool page and every page write-back affects the DPT. Understanding these operations is essential for grasping how the DPT stays accurate.

Operation 1: Marking a Page Dirty

When a transaction modifies data on a page, the buffer manager must update the DPT. The critical invariant is that the recLSN only captures the first modification—subsequent modifications to an already-dirty page do not update the recLSN.

mark_dirty_operation.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void DirtyPageTable::markDirty(PageID pid, LSN log_lsn) {
    lock_guard<Mutex> lock(table_lock);
    
    auto it = entries.find(pid);
    if (it == entries.end()) {
        // Page is not currently dirty - create new entry
        DirtyPageEntry entry;
        entry.page_id = pid;
        entry.recovery_lsn = log_lsn;  // This is the first dirtying LSN
        entry.first_dirty_time = getCurrentTime();
        entry.is_flushing = false;
        
        entries[pid] = entry;
        
        log_debug("Page %d marked dirty with recLSN=%d", pid, log_lsn);
    }
    // If already dirty, do NOT update recLSN - we need the earliest one
    // This is a critical invariant of ARIES
}

Operation 2: Marking a Page Clean

When the buffer manager successfully writes a dirty page to disk, that page becomes clean. The DPT entry must be removed because the on-disk version is now current.

However, there's a subtle concurrency issue: what if a transaction modifies the page while the write is in progress? ARIES handles this by checking whether new modifications occurred during the flush.

mark_clean_operation.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
void DirtyPageTable::markClean(PageID pid, LSN written_page_lsn) {
    lock_guard<Mutex> lock(table_lock);
    
    auto it = entries.find(pid);
    if (it == entries.end()) {
        return;  // Page wasn't tracked as dirty
    }
    
    // Critical check: Was the page modified DURING the flush?
    // The page header contains a pageLSN indicating the last modification
    // If pageLSN > written_page_lsn, new modifications occurred during write
    
    BufferPoolPage* page = buffer_pool.getPage(pid);
    if (page->getPageLSN() > written_page_lsn) {
        // Page was modified during flush - update recLSN to reflect
        // that only modifications up to written_page_lsn are on disk
        // Actually, the page is still dirty with modifications after flush
        it->second.recovery_lsn = written_page_lsn + 1; // Simplification
        it->second.is_flushing = false;
        log_debug("Page %d modified during flush, remains dirty", pid);
    } else {
        // Page is truly clean - remove from DPT
        entries.erase(it);
        log_debug("Page %d marked clean, removed from DPT", pid);
    }
}

Concurrency Considerations

In high-concurrency systems, pages may be modified by one transaction while being flushed to disk. The page's LSN (stored in the page header) provides the synchronization mechanism—it definitively answers whether the flushed version is current. Some implementations use latches or more sophisticated synchronization to handle this.

Operation 3: Finding the Minimum Recovery LSN

One of the most important operations on the DPT is finding the minimum recLSN across all dirty pages. This value determines where redo processing must begin—it's the earliest log position that might need to be replayed.

min_recovery_lsn.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
LSN DirtyPageTable::getMinRecoveryLSN() {
    lock_guard<Mutex> lock(table_lock);
    
    if (entries.empty()) {
        return INVALID_LSN;  // No dirty pages - no redo needed from DPT
    }
    
    LSN min_lsn = MAX_LSN;
    for (const auto& [pid, entry] : entries) {
        if (entry.recovery_lsn < min_lsn) {
            min_lsn = entry.recovery_lsn;
        }
    }
    
    return min_lsn;
}
 
// Example: Determining the redo start point
// At checkpoint time:
//   DPT contains: Page 10 (recLSN=500), Page 25 (recLSN=350), Page 42 (recLSN=600)
//   Min recLSN = 350
//   Recovery will start redo from LSN 350 (or earlier if transaction table indicates)

Checkpoint Integration

The Dirty Page Table plays a crucial role in checkpointing—the periodic operation that bounds recovery time by recording system state to the log. During a checkpoint, the DPT is captured to stable storage, enabling recovery to reconstruct the DPT state at checkpoint time.

Checkpoint Types and DPT:

Modern database systems typically use fuzzy checkpoints (also called non-quiescent checkpoints), which don't halt transaction processing. The DPT snapshot taken during a fuzzy checkpoint reflects a moment-in-time view of which pages were dirty.

Checkpoint Types and Their DPT Handling
Checkpoint Type	Transaction Blocking	DPT Accuracy	Recovery Complexity
Quiescent	Full blocking	Perfect (no concurrent mods)	Simplest
Transaction-Consistent	No new transactions start	Consistent snapshot	Moderate
Fuzzy (ARIES)	None	Point-in-time snapshot	Requires analysis phase
Non-Blocking Fuzzy	None	Approximate	Most complex

Writing DPT to Checkpoint:

During a fuzzy checkpoint, ARIES writes two key data structures to the log:

Transaction Table - Active transactions and their states
Dirty Page Table - All dirty pages and their recLSNs

The checkpoint record structure includes the complete DPT snapshot:

checkpoint_dpt.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
struct CheckpointRecord {
    LSN                    checkpoint_lsn;
    vector<TransactionEntry> transaction_table;
    vector<DirtyPageEntry>   dirty_page_table;  // Complete DPT snapshot
    LSN                    master_record_ptr;   // Points to this checkpoint
};
 
void RecoveryManager::performFuzzyCheckpoint() {
    // Step 1: Log BEGIN_CHECKPOINT
    LSN begin_ckpt_lsn = log_manager.appendLog(BEGIN_CHECKPOINT, {});
    
    // Step 2: Take consistent snapshot of DPT
    // This should be atomic or use latching to ensure consistency
    vector<DirtyPageEntry> dpt_snapshot = dirty_page_table.snapshot();
    
    // Step 3: Take snapshot of transaction table
    vector<TransactionEntry> txn_snapshot = txn_manager.snapshot();
    
    // Step 4: Log END_CHECKPOINT with both tables
    LSN end_ckpt_lsn = log_manager.appendLog(END_CHECKPOINT, {
        .transaction_table = txn_snapshot,
        .dirty_page_table = dpt_snapshot
    });
    
    // Step 5: Update master record to point to this checkpoint
    // The master record is stored in a known location on disk
    updateMasterRecord(begin_ckpt_lsn);
    
    log_info("Checkpoint complete: begin=%d, end=%d, DPT entries=%d",
             begin_ckpt_lsn, end_ckpt_lsn, dpt_snapshot.size());
}

Master Record Purpose

The master record is a fixed location on disk that always points to the most recent successful checkpoint. During recovery, the system reads the master record first to locate the checkpoint, then reads the checkpoint to find the DPT and transaction table snapshots. This avoids scanning the entire log to find the latest checkpoint.

Why the DPT Snapshot Matters:

Without the DPT snapshot in the checkpoint, recovery would need to scan the log from the beginning to determine which pages might be dirty. With the snapshot:

Recovery starts with the DPT state at checkpoint time
Analysis phase updates this DPT by scanning log records after the checkpoint
Redo processing knows exactly where to start (minimum recLSN)

This drastically reduces recovery time—from potentially hours to minutes or seconds.

DPT Reconstruction During Analysis

During crash recovery, the Analysis Phase reconstructs the DPT to reflect its state at crash time. This process involves:

Loading the DPT snapshot from the last checkpoint
Scanning log records from the checkpoint forward
Updating the DPT based on logged operations

The reconstructed DPT tells the Redo Phase exactly where to begin processing and which pages need attention.

analysis_phase_dpt.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
void AnalysisPhase::reconstructDPT() {
    // Step 1: Initialize DPT from checkpoint
    // Read the end-checkpoint record to get the DPT snapshot
    CheckpointRecord ckpt = readLastCheckpoint();
    
    for (const auto& entry : ckpt.dirty_page_table) {
        dirty_page_table.insert(entry.page_id, entry);
    }
    
    log_info("Loaded DPT from checkpoint with %d entries", 
             ckpt.dirty_page_table.size());
    
    // Step 2: Scan log from checkpoint forward to end
    LSN current_lsn = ckpt.checkpoint_lsn;
    while (current_lsn != END_OF_LOG) {
        LogRecord record = log_manager.readLog(current_lsn);
        
        if (record.isRedoable()) {
            // This log record modified a page
            PageID pid = record.getPageID();
            
            if (!dirty_page_table.contains(pid)) {
                // Page not in DPT - add it with this LSN as recLSN
                DirtyPageEntry entry;
                entry.page_id = pid;
                entry.recovery_lsn = current_lsn;
                dirty_page_table.insert(pid, entry);
                
                log_debug("Analysis: Added page %d to DPT, recLSN=%d", 
                         pid, current_lsn);
            }
            // If already in DPT, don't update recLSN (keep earliest)
        }
        
        if (record.type == END_LOG && record.isCommit()) {
            // Transaction committed - but don't remove pages from DPT
            // Pages might still be dirty even if transaction committed
            // (No-Force policy means commit doesn't force page writes)
        }
        
        current_lsn = record.next_lsn;
    }
    
    // After analysis, DPT reflects crash-time state
    log_info("Analysis complete: DPT has %d entries, min recLSN=%d",
             dirty_page_table.size(), dirty_page_table.getMinRecoveryLSN());
}

Critical: Don't Update Existing recLSN

When a page is already in the DPT and we encounter another log record modifying it, we must NOT update the recLSN. The existing recLSN points to the first modification, and redo must start from there to correctly replay all changes. Updating recLSN would skip earlier modifications, leaving the page inconsistent.

Handling End-of-Log Write Records:

If the log contains records indicating that a page was successfully written to disk (sometimes logged as 'page write' or incorporated into the checkpoint), those pages can be removed from the DPT during analysis. However, in many ARIES implementations, page writes are not logged, and the DPT is conservative—it may contain pages that are actually clean.

This is acceptable because the Redo Phase will check each page's LSN before actually redoing an operation. If the page's on-disk LSN is greater than or equal to the log record's LSN, redo is skipped for that record.

Determining the RedoLSN

The ultimate purpose of the DPT in the Analysis Phase is to determine the RedoLSN—the log position where the Redo Phase must begin processing. This is a critical optimization that can save enormous amounts of recovery time.

RedoLSN Definition:

The RedoLSN is the minimum of:

The smallest recLSN in the Dirty Page Table
The smallest firstLSN of any uncommitted transaction (from the Transaction Table)

Both values must be considered because:

DPT entries indicate pages that might not have been flushed
Transaction Table entries indicate operations that might not have been applied

determine_redo_lsn.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
LSN AnalysisPhase::determineRedoLSN() {
    // Get minimum recLSN from Dirty Page Table
    LSN min_dpt_lsn = dirty_page_table.getMinRecoveryLSN();
    
    // Get minimum firstLSN from Transaction Table
    // (firstLSN is the LSN of the first log record for each active transaction)
    LSN min_txn_lsn = transaction_table.getMinFirstLSN();
    
    // RedoLSN is the minimum of both
    LSN redo_lsn;
    
    if (min_dpt_lsn == INVALID_LSN) {
        redo_lsn = min_txn_lsn;  // No dirty pages (unlikely but possible)
    } else if (min_txn_lsn == INVALID_LSN) {
        redo_lsn = min_dpt_lsn;  // No active transactions
    } else {
        redo_lsn = min(min_dpt_lsn, min_txn_lsn);
    }
    
    log_info("RedoLSN determined: %d (DPT min: %d, TxnTable min: %d)",
             redo_lsn, min_dpt_lsn, min_txn_lsn);
    
    return redo_lsn;
}
 
/*
 * Example Scenario:
 * -----------------
 * Checkpoint at LSN 1000
 * 
 * DPT at checkpoint:
 *   Page 10: recLSN = 950
 *   Page 20: recLSN = 1020
 *   Page 30: recLSN = 980
 * 
 * Transaction Table at checkpoint:
 *   T1: firstLSN = 900, lastLSN = 1050
 *   T2: firstLSN = 1030, lastLSN = 1040
 * 
 * After analysis (scanning to LSN 1200):
 *   DPT adds Page 40: recLSN = 1100
 *   T1 committed at LSN 1080
 *   T2 still active (lastLSN = 1150)
 * 
 * Min DPT recLSN = 950 (Page 10)
 * Min TxnTable firstLSN = 1030 (T2)
 * 
 * RedoLSN = min(950, 1030) = 950
 * 
 * Redo will start from LSN 950 and proceed to LSN 1200
 */

Recovery Optimization Impact

Consider a database with 10 years of logs totaling 100 million records. If the oldest dirty page was dirtied 5 minutes before the crash, RedoLSN might be 10,000 records back. Instead of replaying 100 million records, we replay only 10,000—a 10,000x improvement in recovery time.

Real-World Implementations

Production database systems implement variations of the Dirty Page Table concept, often with optimizations for their specific architectures. Let's examine how major systems handle dirty page tracking.

PostgreSQL's Approach:

PostgreSQL doesn't maintain a traditional Dirty Page Table. Instead, it uses a different recovery architecture:

Shared Buffers: Each buffer has a 'dirty' flag in its descriptor
Background Writer: Periodically scans and writes dirty buffers
Checkpoint Process: Writes ALL dirty buffers and then writes a checkpoint record
Full-Page Writes: After a checkpoint, the first modification to a page writes the entire page to WAL

This means PostgreSQL checkpoints are more aggressive about flushing dirty pages, but it avoids maintaining a separate DPT structure.

SQL
1
2
3
4
5
6
7
8
9
10
11
12
-- PostgreSQL tracks dirty buffers in shared memory
-- View dirty buffer statistics:
SELECT 
    buffers_checkpoint,   -- Buffers written during checkpoints
    buffers_clean,        -- Buffers written by background writer
    buffers_backend       -- Buffers written by backends
FROM pg_stat_bgwriter;
 
-- Each buffer descriptor contains:
-- - tag (identifies the page)
-- - state flags (including BM_DIRTY)
-- - usage_count (for clock sweep replacement)

Summary: Dirty Page Table Mastery

The Dirty Page Table is a deceptively simple data structure with profound implications for database reliability and performance. Let's consolidate the essential concepts:

Key Takeaways

•Definition: The DPT tracks pages modified in memory but not yet persisted to disk, with each entry recording the first-dirtying LSN (recLSN).
•Core Invariant: The recLSN always points to the earliest modification since the page was last flushed—never updated for subsequent modifications.
•Checkpoint Integration: DPT snapshots are written to the log during checkpoints, enabling recovery to start from checkpoint state.
•Analysis Phase Role: Reconstructs the crash-time DPT by loading checkpoint snapshot and scanning subsequent log records.
•RedoLSN: The minimum recLSN in the DPT determines where redo processing can safely begin.
•Efficiency Impact: DPT enables recovery to skip irrelevant historical log records, reducing recovery time by orders of magnitude.

What's Next:

With the Dirty Page Table understood, we'll examine its companion data structure—the Transaction Table. Together, these two structures form the complete state information needed to perform correct and efficient crash recovery. The Transaction Table tracks active transactions and their undo requirements, complementing the DPT's page-level recovery information.

Page Complete

You now have a comprehensive understanding of the Dirty Page Table—its structure, operations, checkpoint integration, and role in the ARIES Analysis Phase. This knowledge is fundamental to understanding how modern databases achieve fast, reliable crash recovery.

1 / 5

Loading learning content...

Database Management SystemsAnalysis Phase

ARIES Analysis Phase: Reconstructing System State

LevelAdvanced

Duration75 mins

TopicAnalysis Phase

1 / 5

Dirty Page Table: Tracking Modified Pages for Recovery

The Foundation of Intelligent Recovery

What You Will Learn

Understanding Dirty Pages

When the buffer manager needs space for new pages (page replacement)
During checkpoint operations
When transactions commit (in force-at-commit policies)
During periodic background flushing

Page States in the Buffer Pool
State	In Memory	In-Memory = On-Disk	Needs Write-Back
Clean	Yes	Yes (identical)	No
Dirty	Yes	No (modified)	Yes
Not Resident	No	N/A	N/A
Pinned & Clean	Yes (locked)	Yes	No
Pinned & Dirty	Yes (locked)	No	Yes (after unpin)

The Durability Challenge

Why Dirty Pages Matter for Recovery:

During crash recovery, the system must:

Redo operations whose effects may not have reached disk (dirty pages that weren't flushed)
Undo operations from transactions that didn't commit

The challenge is efficiency. A database might have:

1 billion log records spanning 5 years of operation
10,000 dirty pages at crash time
Only 100,000 log records relevant to those dirty pages

Dirty Page Table Architecture

The Dirty Page Table is an in-memory data structure maintained by the buffer manager. Its design balances fast lookup and update operations with the information needed for crash recovery.

Core Structure:

The DPT is typically implemented as a hash table mapping page identifiers to their recovery-relevant metadata. Each entry contains critical information about when and how the page became dirty.

dirty_page_table.h
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Conceptual structure of a Dirty Page Table entry
struct DirtyPageEntry {
    PageID    page_id;        // Unique identifier: (table_space_id, page_number)
    LSN       recovery_lsn;   // LSN of the FIRST log record that dirtied this page
                              // since it was last written to disk
    Timestamp first_dirty_time; // Wall-clock time when page became dirty
    bool      is_flushing;    // True if write is currently in progress
};
 
// The Dirty Page Table itself
class DirtyPageTable {
private:
    HashMap<PageID, DirtyPageEntry> entries;
    Mutex                           table_lock;
    
public:
    // Record that a page has been modified
    void markDirty(PageID pid, LSN log_lsn);
    
    // Record that a page has been written to disk
    void markClean(PageID pid);
    
    // Get the minimum Recovery LSN across all dirty pages
    LSN getMinRecoveryLSN();
    
    // Create checkpoint snapshot
    vector<DirtyPageEntry> snapshot();
    
    // Check if a page is dirty
    bool isDirty(PageID pid);
};

Key Fields Explained:

First Dirty Time: A timestamp recording when the page became dirty. This is useful for monitoring (identifying pages that have been dirty too long) and for certain flushing strategies.

Is Flushing: A flag indicating that a disk write is currently in progress. This prevents the entry from being removed prematurely and handles concurrent operations correctly.

Why Track the First Dirtying LSN?

DPT Operations During Normal Processing

Operation 1: Marking a Page Dirty

mark_dirty_operation.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void DirtyPageTable::markDirty(PageID pid, LSN log_lsn) {
    lock_guard<Mutex> lock(table_lock);
    
    auto it = entries.find(pid);
    if (it == entries.end()) {
        // Page is not currently dirty - create new entry
        DirtyPageEntry entry;
        entry.page_id = pid;
        entry.recovery_lsn = log_lsn;  // This is the first dirtying LSN
        entry.first_dirty_time = getCurrentTime();
        entry.is_flushing = false;
        
        entries[pid] = entry;
        
        log_debug("Page %d marked dirty with recLSN=%d", pid, log_lsn);
    }
    // If already dirty, do NOT update recLSN - we need the earliest one
    // This is a critical invariant of ARIES
}

Operation 2: Marking a Page Clean

When the buffer manager successfully writes a dirty page to disk, that page becomes clean. The DPT entry must be removed because the on-disk version is now current.

mark_clean_operation.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
void DirtyPageTable::markClean(PageID pid, LSN written_page_lsn) {
    lock_guard<Mutex> lock(table_lock);
    
    auto it = entries.find(pid);
    if (it == entries.end()) {
        return;  // Page wasn't tracked as dirty
    }
    
    // Critical check: Was the page modified DURING the flush?
    // The page header contains a pageLSN indicating the last modification
    // If pageLSN > written_page_lsn, new modifications occurred during write
    
    BufferPoolPage* page = buffer_pool.getPage(pid);
    if (page->getPageLSN() > written_page_lsn) {
        // Page was modified during flush - update recLSN to reflect
        // that only modifications up to written_page_lsn are on disk
        // Actually, the page is still dirty with modifications after flush
        it->second.recovery_lsn = written_page_lsn + 1; // Simplification
        it->second.is_flushing = false;
        log_debug("Page %d modified during flush, remains dirty", pid);
    } else {
        // Page is truly clean - remove from DPT
        entries.erase(it);
        log_debug("Page %d marked clean, removed from DPT", pid);
    }
}

Concurrency Considerations

Operation 3: Finding the Minimum Recovery LSN

min_recovery_lsn.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
LSN DirtyPageTable::getMinRecoveryLSN() {
    lock_guard<Mutex> lock(table_lock);
    
    if (entries.empty()) {
        return INVALID_LSN;  // No dirty pages - no redo needed from DPT
    }
    
    LSN min_lsn = MAX_LSN;
    for (const auto& [pid, entry] : entries) {
        if (entry.recovery_lsn < min_lsn) {
            min_lsn = entry.recovery_lsn;
        }
    }
    
    return min_lsn;
}
 
// Example: Determining the redo start point
// At checkpoint time:
//   DPT contains: Page 10 (recLSN=500), Page 25 (recLSN=350), Page 42 (recLSN=600)
//   Min recLSN = 350
//   Recovery will start redo from LSN 350 (or earlier if transaction table indicates)

Checkpoint Integration

Checkpoint Types and DPT:

Checkpoint Types and Their DPT Handling
Checkpoint Type	Transaction Blocking	DPT Accuracy	Recovery Complexity
Quiescent	Full blocking	Perfect (no concurrent mods)	Simplest
Transaction-Consistent	No new transactions start	Consistent snapshot	Moderate
Fuzzy (ARIES)	None	Point-in-time snapshot	Requires analysis phase
Non-Blocking Fuzzy	None	Approximate	Most complex

Writing DPT to Checkpoint:

During a fuzzy checkpoint, ARIES writes two key data structures to the log:

Transaction Table - Active transactions and their states
Dirty Page Table - All dirty pages and their recLSNs

The checkpoint record structure includes the complete DPT snapshot:

checkpoint_dpt.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
struct CheckpointRecord {
    LSN                    checkpoint_lsn;
    vector<TransactionEntry> transaction_table;
    vector<DirtyPageEntry>   dirty_page_table;  // Complete DPT snapshot
    LSN                    master_record_ptr;   // Points to this checkpoint
};
 
void RecoveryManager::performFuzzyCheckpoint() {
    // Step 1: Log BEGIN_CHECKPOINT
    LSN begin_ckpt_lsn = log_manager.appendLog(BEGIN_CHECKPOINT, {});
    
    // Step 2: Take consistent snapshot of DPT
    // This should be atomic or use latching to ensure consistency
    vector<DirtyPageEntry> dpt_snapshot = dirty_page_table.snapshot();
    
    // Step 3: Take snapshot of transaction table
    vector<TransactionEntry> txn_snapshot = txn_manager.snapshot();
    
    // Step 4: Log END_CHECKPOINT with both tables
    LSN end_ckpt_lsn = log_manager.appendLog(END_CHECKPOINT, {
        .transaction_table = txn_snapshot,
        .dirty_page_table = dpt_snapshot
    });
    
    // Step 5: Update master record to point to this checkpoint
    // The master record is stored in a known location on disk
    updateMasterRecord(begin_ckpt_lsn);
    
    log_info("Checkpoint complete: begin=%d, end=%d, DPT entries=%d",
             begin_ckpt_lsn, end_ckpt_lsn, dpt_snapshot.size());
}

Master Record Purpose

Why the DPT Snapshot Matters:

Without the DPT snapshot in the checkpoint, recovery would need to scan the log from the beginning to determine which pages might be dirty. With the snapshot:

Recovery starts with the DPT state at checkpoint time
Analysis phase updates this DPT by scanning log records after the checkpoint
Redo processing knows exactly where to start (minimum recLSN)

This drastically reduces recovery time—from potentially hours to minutes or seconds.

DPT Reconstruction During Analysis

During crash recovery, the Analysis Phase reconstructs the DPT to reflect its state at crash time. This process involves:

Loading the DPT snapshot from the last checkpoint
Scanning log records from the checkpoint forward
Updating the DPT based on logged operations

The reconstructed DPT tells the Redo Phase exactly where to begin processing and which pages need attention.

analysis_phase_dpt.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
void AnalysisPhase::reconstructDPT() {
    // Step 1: Initialize DPT from checkpoint
    // Read the end-checkpoint record to get the DPT snapshot
    CheckpointRecord ckpt = readLastCheckpoint();
    
    for (const auto& entry : ckpt.dirty_page_table) {
        dirty_page_table.insert(entry.page_id, entry);
    }
    
    log_info("Loaded DPT from checkpoint with %d entries", 
             ckpt.dirty_page_table.size());
    
    // Step 2: Scan log from checkpoint forward to end
    LSN current_lsn = ckpt.checkpoint_lsn;
    while (current_lsn != END_OF_LOG) {
        LogRecord record = log_manager.readLog(current_lsn);
        
        if (record.isRedoable()) {
            // This log record modified a page
            PageID pid = record.getPageID();
            
            if (!dirty_page_table.contains(pid)) {
                // Page not in DPT - add it with this LSN as recLSN
                DirtyPageEntry entry;
                entry.page_id = pid;
                entry.recovery_lsn = current_lsn;
                dirty_page_table.insert(pid, entry);
                
                log_debug("Analysis: Added page %d to DPT, recLSN=%d", 
                         pid, current_lsn);
            }
            // If already in DPT, don't update recLSN (keep earliest)
        }
        
        if (record.type == END_LOG && record.isCommit()) {
            // Transaction committed - but don't remove pages from DPT
            // Pages might still be dirty even if transaction committed
            // (No-Force policy means commit doesn't force page writes)
        }
        
        current_lsn = record.next_lsn;
    }
    
    // After analysis, DPT reflects crash-time state
    log_info("Analysis complete: DPT has %d entries, min recLSN=%d",
             dirty_page_table.size(), dirty_page_table.getMinRecoveryLSN());
}

Critical: Don't Update Existing recLSN

Handling End-of-Log Write Records:

Determining the RedoLSN

RedoLSN Definition:

The RedoLSN is the minimum of:

The smallest recLSN in the Dirty Page Table
The smallest firstLSN of any uncommitted transaction (from the Transaction Table)

Both values must be considered because:

DPT entries indicate pages that might not have been flushed
Transaction Table entries indicate operations that might not have been applied

determine_redo_lsn.cpp
C++
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
LSN AnalysisPhase::determineRedoLSN() {
    // Get minimum recLSN from Dirty Page Table
    LSN min_dpt_lsn = dirty_page_table.getMinRecoveryLSN();
    
    // Get minimum firstLSN from Transaction Table
    // (firstLSN is the LSN of the first log record for each active transaction)
    LSN min_txn_lsn = transaction_table.getMinFirstLSN();
    
    // RedoLSN is the minimum of both
    LSN redo_lsn;
    
    if (min_dpt_lsn == INVALID_LSN) {
        redo_lsn = min_txn_lsn;  // No dirty pages (unlikely but possible)
    } else if (min_txn_lsn == INVALID_LSN) {
        redo_lsn = min_dpt_lsn;  // No active transactions
    } else {
        redo_lsn = min(min_dpt_lsn, min_txn_lsn);
    }
    
    log_info("RedoLSN determined: %d (DPT min: %d, TxnTable min: %d)",
             redo_lsn, min_dpt_lsn, min_txn_lsn);
    
    return redo_lsn;
}
 
/*
 * Example Scenario:
 * -----------------
 * Checkpoint at LSN 1000
 * 
 * DPT at checkpoint:
 *   Page 10: recLSN = 950
 *   Page 20: recLSN = 1020
 *   Page 30: recLSN = 980
 * 
 * Transaction Table at checkpoint:
 *   T1: firstLSN = 900, lastLSN = 1050
 *   T2: firstLSN = 1030, lastLSN = 1040
 * 
 * After analysis (scanning to LSN 1200):
 *   DPT adds Page 40: recLSN = 1100
 *   T1 committed at LSN 1080
 *   T2 still active (lastLSN = 1150)
 * 
 * Min DPT recLSN = 950 (Page 10)
 * Min TxnTable firstLSN = 1030 (T2)
 * 
 * RedoLSN = min(950, 1030) = 950
 * 
 * Redo will start from LSN 950 and proceed to LSN 1200
 */

Recovery Optimization Impact

Real-World Implementations

PostgreSQL's Approach:

PostgreSQL doesn't maintain a traditional Dirty Page Table. Instead, it uses a different recovery architecture:

Shared Buffers: Each buffer has a 'dirty' flag in its descriptor
Background Writer: Periodically scans and writes dirty buffers
Checkpoint Process: Writes ALL dirty buffers and then writes a checkpoint record
Full-Page Writes: After a checkpoint, the first modification to a page writes the entire page to WAL

This means PostgreSQL checkpoints are more aggressive about flushing dirty pages, but it avoids maintaining a separate DPT structure.

SQL
1
2
3
4
5
6
7
8
9
10
11
12
-- PostgreSQL tracks dirty buffers in shared memory
-- View dirty buffer statistics:
SELECT 
    buffers_checkpoint,   -- Buffers written during checkpoints
    buffers_clean,        -- Buffers written by background writer
    buffers_backend       -- Buffers written by backends
FROM pg_stat_bgwriter;
 
-- Each buffer descriptor contains:
-- - tag (identifies the page)
-- - state flags (including BM_DIRTY)
-- - usage_count (for clock sweep replacement)

Summary: Dirty Page Table Mastery

The Dirty Page Table is a deceptively simple data structure with profound implications for database reliability and performance. Let's consolidate the essential concepts:

Key Takeaways

•Definition: The DPT tracks pages modified in memory but not yet persisted to disk, with each entry recording the first-dirtying LSN (recLSN).
•Core Invariant: The recLSN always points to the earliest modification since the page was last flushed—never updated for subsequent modifications.
•Checkpoint Integration: DPT snapshots are written to the log during checkpoints, enabling recovery to start from checkpoint state.
•Analysis Phase Role: Reconstructs the crash-time DPT by loading checkpoint snapshot and scanning subsequent log records.
•RedoLSN: The minimum recLSN in the DPT determines where redo processing can safely begin.
•Efficiency Impact: DPT enables recovery to skip irrelevant historical log records, reducing recovery time by orders of magnitude.

What's Next:

Page Complete

1 / 5