Database Management SystemsCheckpoints

Checkpoints in Database Recovery

LevelIntermediate

Duration60 mins

TopicCheckpoints

3 / 5

Fuzzy Checkpoints

The Non-Blocking Imperative

In production database systems, availability is paramount. Taking the database offline—even for a few seconds—during checkpoint is unacceptable. Users experience timeouts, transactions fail, and business operations halt.

Fuzzy checkpoints solve this problem by fundamentally changing how we think about checkpoint consistency. Instead of demanding that the on-disk state perfectly match the checkpoint record (as consistent checkpoints do), fuzzy checkpoints accept that the on-disk state may be "fuzzy"—slightly ahead of or behind the checkpoint's view.

The magic is in the recovery process. By recording sufficient information in the checkpoint (the ATT and DPT), recovery can correctly interpret the fuzzy state and restore full consistency.

This page examines fuzzy checkpoints in depth: the theoretical foundation, the implementation mechanics, the recovery implications, and the practical benefits that make them the universal choice for modern databases.

What You Will Learn

By the end of this page, you will understand the theory behind fuzzy consistency, how the ATT and DPT enable correct recovery from fuzzy state, the specific mechanisms that allow transactions to proceed during checkpoint, and how modern databases implement and optimize fuzzy checkpoints.

Understanding Checkpoint "Fuzziness"

The term "fuzzy" captures the essential characteristic of these checkpoints: the relationship between the checkpoint record and the actual database state is imprecise.

What makes a checkpoint "fuzzy"?

In a consistent checkpoint, when the checkpoint record is written:

All dirty pages have been flushed to disk
No transactions are active
The on-disk state is exactly consistent

In a fuzzy checkpoint, when the checkpoint record is written:

Some dirty pages may still be in memory (not yet flushed)
Transactions are actively running
The on-disk state is only approximately consistent

The fuzziness manifests in several ways:

Sources of Checkpoint Fuzziness

•Dirty pages not yet flushed — The DPT lists pages that are dirty in memory. Some of these may reach disk before a crash; others may not. Recovery cannot assume either way.
•Active transactions in unknown final state — The ATT lists active transactions. Some will commit before a crash; others will abort or be interrupted. Recovery must determine which.
•Concurrent modifications during checkpoint — While capturing ATT/DPT snapshots, transactions continue modifying pages. Some modifications may be included; others may not.
•Background flushing timing — Page flushing happens asynchronously. A page flushed after the DPT snapshot is still in DPT. A page flushed just before isn't (correctly).

Converting Mermaid diagram...

The Key Insight

Fuzzy checkpoints don't prevent recovery from working—they just mean recovery has to do more work. The DPT provides an upper bound on what might need redo. The ATT provides an upper bound on what might need undo. Recovery starts with these bounds and refines them by reading the log.

How Recovery Interprets Fuzzy State

The brilliance of fuzzy checkpoints lies in how recovery uses the imprecise information to restore a precise, consistent state.

Recovery's approach to fuzziness:

Recovery doesn't try to figure out the exact state at checkpoint time. Instead, it:

Assumes the worst case — Treats every page in DPT as potentially needing redo. Treats every transaction in ATT as potentially needing undo.
Replays the log to discover reality — By processing log records after the checkpoint, recovery learns what actually happened: which pages were actually modified, which transactions actually committed.
Applies corrections as needed — Redo operations restore missing changes. Undo operations reverse incomplete transactions.

The DPT and recovery:

fuzzy_checkpoint_redo_logic.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- How recovery uses the Dirty Page Table from fuzzy checkpoint
 
FUNCTION redo_phase(checkpoint):
    -- Find the earliest point that might need redo
    redo_lsn = MIN(page.recovery_lsn FOR page IN checkpoint.dirty_page_table)
    
    -- Scan log from this point forward
    FOR each log_record FROM redo_lsn TO end_of_log:
        page_id = log_record.page_id
        
        -- Check if this page might need the operation
        IF page_id IN checkpoint.dirty_page_table:
            page_recovery_lsn = checkpoint.dirty_page_table[page_id].recovery_lsn
            
            IF log_record.lsn >= page_recovery_lsn:
                -- This record might affect the page
                -- Read current page LSN from disk
                disk_page = read_page(page_id)
                
                IF log_record.lsn > disk_page.page_lsn:
                    -- Page on disk is older than this record
                    -- Must redo this operation
                    apply_redo(log_record, disk_page)
                    write_page(disk_page)
                ELSE:
                    -- Page on disk already has this change
                    -- (It was flushed after checkpoint)
                    -- Skip redo - already applied
                    SKIP
        ELSE:
            -- Page not in DPT at checkpoint
            -- But it might have been modified after checkpoint
            -- Still need to check and potentially redo
            -- (Page was clean at checkpoint but dirtied later)
            disk_page = read_page(page_id)
            IF log_record.lsn > disk_page.page_lsn:
                apply_redo(log_record, disk_page)
                write_page(disk_page)

Key concept: Page LSN comparison

Every database page stores the LSN of the last log record that modified it (the page_lsn). During redo:

If log_record.lsn > page.page_lsn → The page needs this operation (wasn't flushed after this change)
If log_record.lsn ≤ page.page_lsn → The page already has this change (was flushed after this log record was written)

This comparison is how recovery detects whether a page listed in the DPT actually needs redo. The "fuzziness" of whether the page was flushed is resolved by checking the page itself.

The ATT and recovery:

Similarly, the Active Transaction Table from the checkpoint provides a starting list of transactions that might need undo. As recovery processes the log:

Transactions found to commit are removed from the undo list
Transactions still active at crash time remain on the undo list
The final undo list contains only transactions that truly need rollback

Redo Everything, Undo Selectively

The ARIES recovery algorithm (used by most databases) follows a 'redo all, undo selectively' approach. Redo replays all operations to reconstruct the crash-time state. Then undo rolls back only uncommitted transactions. This works perfectly with fuzzy checkpoints because the checkpoint provides starting points, not final answers.

Transactions During Checkpoint

The defining feature of fuzzy checkpoints is that transactions continue running throughout the checkpoint process. Let's examine exactly what happens when transactions execute concurrently with checkpoint operations.

Timeline of a fuzzy checkpoint with concurrent transactions:

Converting Mermaid diagram...

What happens to each transaction during the checkpoint:

Transaction Status in Fuzzy Checkpoint
Transaction	Checkpoint Capture	Actual Outcome	Recovery Requirement
Txn A	Active in ATT (captured at t=1)	Committed at t=7 (after ATT capture)	Initially in undo list; removed when commit found in log
Txn B	Not in ATT (started at t=4, after capture)	Aborted at crash (t=18)	Discovered during log scan; added to undo list; rolled back
Txn C	Active in ATT (captured at t=1)	Committed at t=9 (after capture)	Initially in undo list; removed when commit found in log

Critical observation:

The ATT snapshot is taken at a specific moment during the checkpoint (when the ATT lock is held). Transactions that:

Started before AND are still running at snapshot time → Included in ATT
Committed before snapshot time → Not included (already committed)
Started after snapshot time → Not included (didn't exist yet)

But the checkpoint doesn't know what happens afterward. Transaction A might commit; Transaction C might commit; new transactions might start. All of this is discovered during log replay, not from the checkpoint.

The brief lock window:

The only blocking that occurs is during the ATT and DPT snapshot captures. These locks are held for:

ATT snapshot: Typically microseconds to low milliseconds
DPT snapshot: Typically microseconds to low milliseconds

During these brief windows, new transaction starts and new page modifications may be blocked. But the window is so short that it's imperceptible to most workloads.

Why Not Lock-Free?

You might wonder: why lock at all? The answer is consistency. Without a brief lock, the ATT snapshot might see transaction T's entry being added while also seeing T's uncommitted work in DPT. The momentary lock ensures the snapshot is point-in-time consistent—essential for correct recovery.

Dirty Page Table Mechanics

The Dirty Page Table (DPT) in a fuzzy checkpoint requires careful management. It must accurately reflect which pages might need redo, accounting for both the flushing lag and the checkpoint capture timing.

DPT lifecycle:

DPT Entry Lifecycle

•Page becomes dirty — When a page is modified, an entry is added to DPT with recovery_lsn set to the LSN of the modifying log record. This is the first change that would need redo if the page isn't flushed.
•Subsequent modifications — If the page is modified again, recovery_lsn stays the same. It's the first modification that matters for redo—redo will replay all subsequent modifications anyway.
•Page is flushed to disk — After a page is written to disk, its entry is removed from DPT. The page is now clean; no redo is needed for pre-flush changes.
•Checkpoint captures DPT — The current DPT contents are copied into the checkpoint record. This is a snapshot of which pages were dirty at that moment.
•Post-checkpoint changes — Pages dirtied after the DPT snapshot aren't in the checkpoint's DPT. Recovery will find them by scanning the log.

dpt_management.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- DPT management during fuzzy checkpoint
 
-- Runtime DPT structure (maintained continuously)
RUNTIME_DPT = {
    entries: Map<PageID, DPTEntry>
}
 
-- When a page is first modified in buffer pool
FUNCTION on_page_modified(page_id, modifying_lsn):
    IF page_id NOT IN runtime_dpt.entries:
        -- First modification; record recovery_lsn
        runtime_dpt.entries[page_id] = DPTEntry {
            page_id: page_id,
            recovery_lsn: modifying_lsn,
            -- recovery_lsn is the first LSN that might need redo
        }
    -- If already in DPT, don't change recovery_lsn
    -- (first modification is what matters)
 
-- When a page is flushed to disk
FUNCTION on_page_flushed(page_id):
    -- Page is now clean; remove from DPT
    DELETE runtime_dpt.entries[page_id]
    -- Note: The page remembers its page_lsn on disk
    -- This allows recovery to skip already-applied redo
 
-- Checkpoint DPT snapshot (brief lock held)
FUNCTION capture_dpt_snapshot():
    ACQUIRE_LOCK(runtime_dpt)
    checkpoint_dpt = COPY(runtime_dpt.entries)
    RELEASE_LOCK(runtime_dpt)
    RETURN checkpoint_dpt
 
-- Compute redo_lsn from DPT
FUNCTION compute_redo_lsn(checkpoint_dpt):
    IF checkpoint_dpt IS EMPTY:
        -- No dirty pages; redo starts from checkpoint
        RETURN checkpoint_lsn
    ELSE:
        -- Redo starts from oldest recovery_lsn
        RETURN MIN(entry.recovery_lsn FOR entry IN checkpoint_dpt)

The recovery_lsn optimization:

The recovery_lsn stored for each dirty page is critical for recovery efficiency. Consider a page modified by 1000 log records. If we stored all 1000 LSNs, the DPT would be enormous. Instead, we store only the first—that's where redo starts for this page. Redo will naturally replay all 1000 modifications in order.

Redo_lsn computation:

The checkpoint's redo_lsn is the minimum recovery_lsn across all dirty pages. This is where redo begins after a crash. If background flushing is aggressive:

Dirty pages have recent recovery_lsn values (they were flushed before getting too old)
Redo_lsn is close to the checkpoint time
Recovery replays only a small amount of log

If background flushing is slow:

Some dirty pages have old recovery_lsn values (they haven't been flushed in a while)
Redo_lsn might be far before the checkpoint
Recovery must replay more log

DPT Size Considerations

In a large database with many dirty pages, the DPT can be substantial—potentially containing millions of entries. The checkpoint must write all of this to the log. Systems optimize by keeping DPT compact (using page IDs and LSNs only) and by aggressively flushing pages to reduce DPT size.

Active Transaction Table Mechanics

The Active Transaction Table (ATT) in a fuzzy checkpoint captures the state of all running transactions at the snapshot moment. This information guides the undo phase of recovery.

ATT entry contents:

att_entry_structure.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- ATT entry for each active transaction
ATT_ENTRY = {
    txn_id:       TransactionID,    -- Unique transaction identifier
    
    state:        ENUM {            -- Transaction's current state
        ACTIVE,                     -- Currently executing
        PREPARING,                  -- In 2PC prepare phase
        COMMITTED,                  -- Commit record written (but may not be hardened)
        ABORTED                     -- Abort decided (but rollback may be in progress)
    },
    
    last_lsn:     LSN,              -- Most recent log record written by this txn
    
    first_lsn:    LSN,              -- First log record written by this txn
                                    -- (Useful for log truncation boundary)
    
    undo_nxt_lsn: LSN               -- Next log record to undo (for partial rollback)
                                    -- Used when txn was rolling back at checkpoint time
}
 
-- Example ATT at checkpoint time
CHECKPOINT_ATT = [
    { txn_id: 1001, state: ACTIVE,    last_lsn: 12500400, first_lsn: 12480000, undo_nxt_lsn: NULL },
    { txn_id: 1005, state: ACTIVE,    last_lsn: 12500450, first_lsn: 12495000, undo_nxt_lsn: NULL },
    { txn_id: 1007, state: PREPARING, last_lsn: 12500480, first_lsn: 12500100, undo_nxt_lsn: NULL },
    { txn_id: 1009, state: ABORTED,   last_lsn: 12500300, first_lsn: 12490000, undo_nxt_lsn: 12500250 }
    -- Txn 1009 was rolling back at checkpoint time; undo continues from LSN 12500250
]

How recovery uses the ATT:

Initialize undo candidates: At the start of recovery, the ATT from the checkpoint becomes the initial "loser" list—transactions that might need to be undone.
Update during redo: As redo scans the log, it discovers:
- New transactions that started after the checkpoint → Add to loser list
- Transactions that committed → Remove from loser list
- Transactions that aborted → Mark for undo (if not already complete)
Final loser list: After redo completes, the remaining transactions in the loser list are those that were running at crash time and never committed. These need undo.
Perform undo: Each loser transaction is rolled back by scanning backward through its log records.

ATT Evolution During Recovery
Phase	Txn 1001	Txn 1005	Txn 1007	Txn 1009	Txn 2000 (new)
Checkpoint ATT	Active	Active	Preparing	Aborting	—
After redo (found commit T1001)	Committed ✓	Active	Preparing	Aborted	Active
After redo (found commit T1007)	Committed ✓	Active	Committed ✓	Aborted	Active
End of redo	Not in loser list	In loser list	Not in loser list	In loser list	In loser list
After undo	—	Undone	—	Undone	Undone

Preparing State and Distributed Transactions

The PREPARING state handles distributed transactions (2-phase commit). A transaction in PREPARING has passed the point of no return—it cannot unilaterally abort. Recovery must ask the coordinator whether to commit or abort. This is a special case handled after normal undo completes.

Integration with Background Flushing

Fuzzy checkpoints work hand-in-hand with background page flushing. The checkpoint establishes recovery points, while background flushing keeps dirty page accumulation under control. Together, they bound recovery time without impacting transactions.

The relationship:

Background flushing keeps pages recent: By continuously flushing old dirty pages, the oldest recovery_lsn in the DPT stays recent.
Checkpoints capture the current state: When a checkpoint runs, it captures the DPT as it currently stands—shaped by background flushing.
Recovery time depends on both: Recovery time ≈ (checkpoint interval) + (oldest dirty page age). Both checkpoint frequency and flushing aggressiveness matter.

Converting Mermaid diagram...

Flushing policy and recovery time:

Background Flushing Impact on Recovery
Flushing Policy	Typical DPT Size	Oldest recovery_lsn	Recovery Time Impact
No background flushing	Very large (all modified pages)	Potentially hours old	Recovery scans hours of log
Lazy flushing (LRU-based)	Medium (recently modified only)	Minutes to hours old	Recovery scans minutes to hours
Aggressive flushing (age target)	Small (only very recent changes)	Seconds to minutes old	Recovery completes in seconds
Write-through (no caching)	Zero (nothing dirty)	N/A	Minimal redo needed (but slow runtime)

Checkpoint-triggered flushing acceleration:

Some systems temporarily increase background flushing intensity when a checkpoint is approaching or in progress. The goal is to:

Reduce DPT size before the checkpoint (smaller checkpoint record)
Advance recovery_lsn values (shorter recovery time)

This acceleration is carefully limited to avoid overwhelming I/O capacity:

Flushing rate = base_rate + (checkpoint_proximity × boost_factor)

Where checkpoint_proximity is 0 far from checkpoints and approaches 1 as the checkpoint begins.

Tuning for Your SLA

Your checkpoint frequency and flushing intensity should be tuned to meet your Recovery Time Objective (RTO). If your RTO is 60 seconds, you need checkpoints and flushing frequent enough to ensure redo_lsn is never more than ~50 seconds behind (leaving margin for redo execution time).

Fuzzy Checkpoints in Practice

Modern database systems all implement fuzzy checkpoints, though terminology and details vary. Let's examine how major systems approach checkpointing.

PostgreSQL:

PostgreSQL uses a checkpoint process that:

Writes all dirty buffers to disk (synchronously or spread out)
Creates a checkpoint record in the WAL
Updates the control file with checkpoint location
Is triggered by time interval, WAL size, or explicit command

PostgreSQL's checkpoint is slightly less "fuzzy" than ARIES—it does flush all dirty buffers, but does so without stopping transactions.

MySQL/InnoDB:

InnoDB uses a fuzzy checkpoint mechanism with:

Continuous background flushing (page cleaner threads)
Sharp checkpoints only during shutdown
Fuzzy checkpoints during normal operation
innodb_log_checkpoint_now for forced checkpoints

SQL Server:

SQL Server implements:

Indirect checkpoints (target-based recovery time)
Lazy writer for background page flushing
Automatic checkpoint tuning
CHECKPOINT command for manual triggering

checkpoint_configuration_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- PostgreSQL checkpoint configuration
-- postgresql.conf settings
 
checkpoint_timeout = 5min          -- Max time between checkpoints
checkpoint_completion_target = 0.9  -- Spread I/O over 90% of interval
max_wal_size = 1GB                 -- Trigger checkpoint at this WAL size
checkpoint_warning = 30s           -- Warn if checkpoints too frequent
 
-- View checkpoint statistics
SELECT * FROM pg_stat_bgwriter;
 
-- Force checkpoint
CHECKPOINT;
 
--------------------------------------------------------------
 
-- MySQL/InnoDB checkpoint configuration
-- my.cnf settings
 
innodb_log_file_size = 1G         -- WAL file size
innodb_max_dirty_pages_pct = 75   -- Background flush below this
innodb_adaptive_flushing = ON     -- Adaptive checkpoint-aware flushing
innodb_io_capacity = 2000         -- IOPS budget for background writes
 
-- View checkpoint information
SHOW ENGINE INNODB STATUS;
 
--------------------------------------------------------------
 
-- SQL Server indirect checkpoint configuration
 
-- Set target recovery time (controls checkpoint frequency)
ALTER DATABASE [MyDatabase] 
SET TARGET_RECOVERY_TIME = 60 SECONDS;
 
-- View checkpoint statistics
SELECT * FROM sys.dm_db_log_stats(DB_ID());
 
-- Force checkpoint  
CHECKPOINT;

Common tuning considerations across systems:

Recovery time vs. checkpoint overhead: More frequent checkpoints mean faster recovery but more I/O overhead during normal operation.
I/O smoothing: Spreading checkpoint writes over time prevents I/O spikes. PostgreSQL's checkpoint_completion_target and InnoDB's adaptive flushing address this.
Container/cloud environments: In virtualized environments, I/O may be less predictable. Systems may need more conservative settings to avoid exceeding provisioned IOPS.
SSD vs. HDD: SSDs tolerate random I/O better, allowing more aggressive flushing and less concern about checkpoint I/O patterns.

Indirect Checkpoints

SQL Server's 'indirect checkpoints' are an advanced approach that specifies a target recovery time. The system automatically adjusts checkpoint frequency and flushing intensity to meet this target. This shifts checkpoint tuning from manual parameter adjustment to declaring an SLA.

Summary: Fuzzy Checkpoints

Fuzzy checkpoints are the enabling mechanism for high-availability database systems. They provide the recovery benefits of checkpoints without the performance costs of blocking transactions. Let's consolidate the key concepts:

Key Takeaways

•Fuzzy checkpoints allow concurrent transactions — Only brief locks are needed for ATT/DPT snapshots; transactions continue normally throughout.
•"Fuzziness" is resolved during recovery — The checkpoint provides upper bounds; log replay discovers what actually happened.
•DPT tracks pages that might need redo — Recovery_lsn indicates where redo starts for each page; actual need is determined by page_lsn comparison.
•ATT tracks transactions that might need undo — Initial loser list is refined during redo phase as commits are discovered.
•Background flushing is essential partner — Keeps dirty pages recent, which keeps redo_lsn close to checkpoint time, minimizing recovery work.
•Page LSN enables redo skipping — If page's LSN shows a log record was already applied, recovery skips it—handling the fuzziness of when flushes occurred.
•All modern databases use fuzzy checkpoints — Implementation details vary, but the fundamental approach is universal.

What's next:

Now that we understand how fuzzy checkpoints work, we'll examine checkpoint frequency—how to determine the optimal interval between checkpoints, balancing recovery time requirements against operational overhead.

Page Complete

You now have a deep understanding of fuzzy checkpoints—the mechanism that makes high-availability database systems possible. You understand the theory of fuzziness, the mechanics of ATT and DPT management, the integration with background flushing, and how major databases implement checkpointing. Next, we'll explore how to choose the right checkpoint frequency.

3 / 5

Loading learning content...

Database Management SystemsCheckpoints

Checkpoints in Database Recovery

LevelIntermediate

Duration60 mins

TopicCheckpoints

3 / 5

Fuzzy Checkpoints

The Non-Blocking Imperative

The magic is in the recovery process. By recording sufficient information in the checkpoint (the ATT and DPT), recovery can correctly interpret the fuzzy state and restore full consistency.

What You Will Learn

Understanding Checkpoint "Fuzziness"

The term "fuzzy" captures the essential characteristic of these checkpoints: the relationship between the checkpoint record and the actual database state is imprecise.

What makes a checkpoint "fuzzy"?

In a consistent checkpoint, when the checkpoint record is written:

All dirty pages have been flushed to disk
No transactions are active
The on-disk state is exactly consistent

In a fuzzy checkpoint, when the checkpoint record is written:

Some dirty pages may still be in memory (not yet flushed)
Transactions are actively running
The on-disk state is only approximately consistent

The fuzziness manifests in several ways:

Sources of Checkpoint Fuzziness

•Dirty pages not yet flushed — The DPT lists pages that are dirty in memory. Some of these may reach disk before a crash; others may not. Recovery cannot assume either way.
•Active transactions in unknown final state — The ATT lists active transactions. Some will commit before a crash; others will abort or be interrupted. Recovery must determine which.
•Concurrent modifications during checkpoint — While capturing ATT/DPT snapshots, transactions continue modifying pages. Some modifications may be included; others may not.
•Background flushing timing — Page flushing happens asynchronously. A page flushed after the DPT snapshot is still in DPT. A page flushed just before isn't (correctly).

Converting Mermaid diagram...

The Key Insight

How Recovery Interprets Fuzzy State

The brilliance of fuzzy checkpoints lies in how recovery uses the imprecise information to restore a precise, consistent state.

Recovery's approach to fuzziness:

Recovery doesn't try to figure out the exact state at checkpoint time. Instead, it:

Assumes the worst case — Treats every page in DPT as potentially needing redo. Treats every transaction in ATT as potentially needing undo.
Replays the log to discover reality — By processing log records after the checkpoint, recovery learns what actually happened: which pages were actually modified, which transactions actually committed.
Applies corrections as needed — Redo operations restore missing changes. Undo operations reverse incomplete transactions.

The DPT and recovery:

fuzzy_checkpoint_redo_logic.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- How recovery uses the Dirty Page Table from fuzzy checkpoint
 
FUNCTION redo_phase(checkpoint):
    -- Find the earliest point that might need redo
    redo_lsn = MIN(page.recovery_lsn FOR page IN checkpoint.dirty_page_table)
    
    -- Scan log from this point forward
    FOR each log_record FROM redo_lsn TO end_of_log:
        page_id = log_record.page_id
        
        -- Check if this page might need the operation
        IF page_id IN checkpoint.dirty_page_table:
            page_recovery_lsn = checkpoint.dirty_page_table[page_id].recovery_lsn
            
            IF log_record.lsn >= page_recovery_lsn:
                -- This record might affect the page
                -- Read current page LSN from disk
                disk_page = read_page(page_id)
                
                IF log_record.lsn > disk_page.page_lsn:
                    -- Page on disk is older than this record
                    -- Must redo this operation
                    apply_redo(log_record, disk_page)
                    write_page(disk_page)
                ELSE:
                    -- Page on disk already has this change
                    -- (It was flushed after checkpoint)
                    -- Skip redo - already applied
                    SKIP
        ELSE:
            -- Page not in DPT at checkpoint
            -- But it might have been modified after checkpoint
            -- Still need to check and potentially redo
            -- (Page was clean at checkpoint but dirtied later)
            disk_page = read_page(page_id)
            IF log_record.lsn > disk_page.page_lsn:
                apply_redo(log_record, disk_page)
                write_page(disk_page)

Key concept: Page LSN comparison

Every database page stores the LSN of the last log record that modified it (the page_lsn). During redo:

If log_record.lsn > page.page_lsn → The page needs this operation (wasn't flushed after this change)
If log_record.lsn ≤ page.page_lsn → The page already has this change (was flushed after this log record was written)

This comparison is how recovery detects whether a page listed in the DPT actually needs redo. The "fuzziness" of whether the page was flushed is resolved by checking the page itself.

The ATT and recovery:

Similarly, the Active Transaction Table from the checkpoint provides a starting list of transactions that might need undo. As recovery processes the log:

Transactions found to commit are removed from the undo list
Transactions still active at crash time remain on the undo list
The final undo list contains only transactions that truly need rollback

Redo Everything, Undo Selectively

Transactions During Checkpoint

Timeline of a fuzzy checkpoint with concurrent transactions:

Converting Mermaid diagram...

What happens to each transaction during the checkpoint:

Transaction Status in Fuzzy Checkpoint
Transaction	Checkpoint Capture	Actual Outcome	Recovery Requirement
Txn A	Active in ATT (captured at t=1)	Committed at t=7 (after ATT capture)	Initially in undo list; removed when commit found in log
Txn B	Not in ATT (started at t=4, after capture)	Aborted at crash (t=18)	Discovered during log scan; added to undo list; rolled back
Txn C	Active in ATT (captured at t=1)	Committed at t=9 (after capture)	Initially in undo list; removed when commit found in log

Critical observation:

The ATT snapshot is taken at a specific moment during the checkpoint (when the ATT lock is held). Transactions that:

Started before AND are still running at snapshot time → Included in ATT
Committed before snapshot time → Not included (already committed)
Started after snapshot time → Not included (didn't exist yet)

The brief lock window:

The only blocking that occurs is during the ATT and DPT snapshot captures. These locks are held for:

ATT snapshot: Typically microseconds to low milliseconds
DPT snapshot: Typically microseconds to low milliseconds

During these brief windows, new transaction starts and new page modifications may be blocked. But the window is so short that it's imperceptible to most workloads.

Why Not Lock-Free?

Dirty Page Table Mechanics

DPT lifecycle:

DPT Entry Lifecycle

•Page becomes dirty — When a page is modified, an entry is added to DPT with recovery_lsn set to the LSN of the modifying log record. This is the first change that would need redo if the page isn't flushed.
•Subsequent modifications — If the page is modified again, recovery_lsn stays the same. It's the first modification that matters for redo—redo will replay all subsequent modifications anyway.
•Page is flushed to disk — After a page is written to disk, its entry is removed from DPT. The page is now clean; no redo is needed for pre-flush changes.
•Checkpoint captures DPT — The current DPT contents are copied into the checkpoint record. This is a snapshot of which pages were dirty at that moment.
•Post-checkpoint changes — Pages dirtied after the DPT snapshot aren't in the checkpoint's DPT. Recovery will find them by scanning the log.

dpt_management.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
-- DPT management during fuzzy checkpoint
 
-- Runtime DPT structure (maintained continuously)
RUNTIME_DPT = {
    entries: Map<PageID, DPTEntry>
}
 
-- When a page is first modified in buffer pool
FUNCTION on_page_modified(page_id, modifying_lsn):
    IF page_id NOT IN runtime_dpt.entries:
        -- First modification; record recovery_lsn
        runtime_dpt.entries[page_id] = DPTEntry {
            page_id: page_id,
            recovery_lsn: modifying_lsn,
            -- recovery_lsn is the first LSN that might need redo
        }
    -- If already in DPT, don't change recovery_lsn
    -- (first modification is what matters)
 
-- When a page is flushed to disk
FUNCTION on_page_flushed(page_id):
    -- Page is now clean; remove from DPT
    DELETE runtime_dpt.entries[page_id]
    -- Note: The page remembers its page_lsn on disk
    -- This allows recovery to skip already-applied redo
 
-- Checkpoint DPT snapshot (brief lock held)
FUNCTION capture_dpt_snapshot():
    ACQUIRE_LOCK(runtime_dpt)
    checkpoint_dpt = COPY(runtime_dpt.entries)
    RELEASE_LOCK(runtime_dpt)
    RETURN checkpoint_dpt
 
-- Compute redo_lsn from DPT
FUNCTION compute_redo_lsn(checkpoint_dpt):
    IF checkpoint_dpt IS EMPTY:
        -- No dirty pages; redo starts from checkpoint
        RETURN checkpoint_lsn
    ELSE:
        -- Redo starts from oldest recovery_lsn
        RETURN MIN(entry.recovery_lsn FOR entry IN checkpoint_dpt)

The recovery_lsn optimization:

Redo_lsn computation:

The checkpoint's redo_lsn is the minimum recovery_lsn across all dirty pages. This is where redo begins after a crash. If background flushing is aggressive:

Dirty pages have recent recovery_lsn values (they were flushed before getting too old)
Redo_lsn is close to the checkpoint time
Recovery replays only a small amount of log

If background flushing is slow:

Some dirty pages have old recovery_lsn values (they haven't been flushed in a while)
Redo_lsn might be far before the checkpoint
Recovery must replay more log

DPT Size Considerations

Active Transaction Table Mechanics

The Active Transaction Table (ATT) in a fuzzy checkpoint captures the state of all running transactions at the snapshot moment. This information guides the undo phase of recovery.

ATT entry contents:

att_entry_structure.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- ATT entry for each active transaction
ATT_ENTRY = {
    txn_id:       TransactionID,    -- Unique transaction identifier
    
    state:        ENUM {            -- Transaction's current state
        ACTIVE,                     -- Currently executing
        PREPARING,                  -- In 2PC prepare phase
        COMMITTED,                  -- Commit record written (but may not be hardened)
        ABORTED                     -- Abort decided (but rollback may be in progress)
    },
    
    last_lsn:     LSN,              -- Most recent log record written by this txn
    
    first_lsn:    LSN,              -- First log record written by this txn
                                    -- (Useful for log truncation boundary)
    
    undo_nxt_lsn: LSN               -- Next log record to undo (for partial rollback)
                                    -- Used when txn was rolling back at checkpoint time
}
 
-- Example ATT at checkpoint time
CHECKPOINT_ATT = [
    { txn_id: 1001, state: ACTIVE,    last_lsn: 12500400, first_lsn: 12480000, undo_nxt_lsn: NULL },
    { txn_id: 1005, state: ACTIVE,    last_lsn: 12500450, first_lsn: 12495000, undo_nxt_lsn: NULL },
    { txn_id: 1007, state: PREPARING, last_lsn: 12500480, first_lsn: 12500100, undo_nxt_lsn: NULL },
    { txn_id: 1009, state: ABORTED,   last_lsn: 12500300, first_lsn: 12490000, undo_nxt_lsn: 12500250 }
    -- Txn 1009 was rolling back at checkpoint time; undo continues from LSN 12500250
]

How recovery uses the ATT:

Initialize undo candidates: At the start of recovery, the ATT from the checkpoint becomes the initial "loser" list—transactions that might need to be undone.
Update during redo: As redo scans the log, it discovers:
- New transactions that started after the checkpoint → Add to loser list
- Transactions that committed → Remove from loser list
- Transactions that aborted → Mark for undo (if not already complete)
Final loser list: After redo completes, the remaining transactions in the loser list are those that were running at crash time and never committed. These need undo.
Perform undo: Each loser transaction is rolled back by scanning backward through its log records.

ATT Evolution During Recovery
Phase	Txn 1001	Txn 1005	Txn 1007	Txn 1009	Txn 2000 (new)
Checkpoint ATT	Active	Active	Preparing	Aborting	—
After redo (found commit T1001)	Committed ✓	Active	Preparing	Aborted	Active
After redo (found commit T1007)	Committed ✓	Active	Committed ✓	Aborted	Active
End of redo	Not in loser list	In loser list	Not in loser list	In loser list	In loser list
After undo	—	Undone	—	Undone	Undone

Preparing State and Distributed Transactions

Integration with Background Flushing

The relationship:

Background flushing keeps pages recent: By continuously flushing old dirty pages, the oldest recovery_lsn in the DPT stays recent.
Checkpoints capture the current state: When a checkpoint runs, it captures the DPT as it currently stands—shaped by background flushing.
Recovery time depends on both: Recovery time ≈ (checkpoint interval) + (oldest dirty page age). Both checkpoint frequency and flushing aggressiveness matter.

Converting Mermaid diagram...

Flushing policy and recovery time:

Background Flushing Impact on Recovery
Flushing Policy	Typical DPT Size	Oldest recovery_lsn	Recovery Time Impact
No background flushing	Very large (all modified pages)	Potentially hours old	Recovery scans hours of log
Lazy flushing (LRU-based)	Medium (recently modified only)	Minutes to hours old	Recovery scans minutes to hours
Aggressive flushing (age target)	Small (only very recent changes)	Seconds to minutes old	Recovery completes in seconds
Write-through (no caching)	Zero (nothing dirty)	N/A	Minimal redo needed (but slow runtime)

Checkpoint-triggered flushing acceleration:

Some systems temporarily increase background flushing intensity when a checkpoint is approaching or in progress. The goal is to:

Reduce DPT size before the checkpoint (smaller checkpoint record)
Advance recovery_lsn values (shorter recovery time)

This acceleration is carefully limited to avoid overwhelming I/O capacity:

Flushing rate = base_rate + (checkpoint_proximity × boost_factor)

Where checkpoint_proximity is 0 far from checkpoints and approaches 1 as the checkpoint begins.

Tuning for Your SLA

Fuzzy Checkpoints in Practice

Modern database systems all implement fuzzy checkpoints, though terminology and details vary. Let's examine how major systems approach checkpointing.

PostgreSQL:

PostgreSQL uses a checkpoint process that:

Writes all dirty buffers to disk (synchronously or spread out)
Creates a checkpoint record in the WAL
Updates the control file with checkpoint location
Is triggered by time interval, WAL size, or explicit command

PostgreSQL's checkpoint is slightly less "fuzzy" than ARIES—it does flush all dirty buffers, but does so without stopping transactions.

MySQL/InnoDB:

InnoDB uses a fuzzy checkpoint mechanism with:

Continuous background flushing (page cleaner threads)
Sharp checkpoints only during shutdown
Fuzzy checkpoints during normal operation
innodb_log_checkpoint_now for forced checkpoints

SQL Server:

SQL Server implements:

Indirect checkpoints (target-based recovery time)
Lazy writer for background page flushing
Automatic checkpoint tuning
CHECKPOINT command for manual triggering

checkpoint_configuration_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- PostgreSQL checkpoint configuration
-- postgresql.conf settings
 
checkpoint_timeout = 5min          -- Max time between checkpoints
checkpoint_completion_target = 0.9  -- Spread I/O over 90% of interval
max_wal_size = 1GB                 -- Trigger checkpoint at this WAL size
checkpoint_warning = 30s           -- Warn if checkpoints too frequent
 
-- View checkpoint statistics
SELECT * FROM pg_stat_bgwriter;
 
-- Force checkpoint
CHECKPOINT;
 
--------------------------------------------------------------
 
-- MySQL/InnoDB checkpoint configuration
-- my.cnf settings
 
innodb_log_file_size = 1G         -- WAL file size
innodb_max_dirty_pages_pct = 75   -- Background flush below this
innodb_adaptive_flushing = ON     -- Adaptive checkpoint-aware flushing
innodb_io_capacity = 2000         -- IOPS budget for background writes
 
-- View checkpoint information
SHOW ENGINE INNODB STATUS;
 
--------------------------------------------------------------
 
-- SQL Server indirect checkpoint configuration
 
-- Set target recovery time (controls checkpoint frequency)
ALTER DATABASE [MyDatabase] 
SET TARGET_RECOVERY_TIME = 60 SECONDS;
 
-- View checkpoint statistics
SELECT * FROM sys.dm_db_log_stats(DB_ID());
 
-- Force checkpoint  
CHECKPOINT;

Common tuning considerations across systems:

Recovery time vs. checkpoint overhead: More frequent checkpoints mean faster recovery but more I/O overhead during normal operation.
I/O smoothing: Spreading checkpoint writes over time prevents I/O spikes. PostgreSQL's checkpoint_completion_target and InnoDB's adaptive flushing address this.
Container/cloud environments: In virtualized environments, I/O may be less predictable. Systems may need more conservative settings to avoid exceeding provisioned IOPS.
SSD vs. HDD: SSDs tolerate random I/O better, allowing more aggressive flushing and less concern about checkpoint I/O patterns.

Indirect Checkpoints

Summary: Fuzzy Checkpoints

Key Takeaways

•Fuzzy checkpoints allow concurrent transactions — Only brief locks are needed for ATT/DPT snapshots; transactions continue normally throughout.
•"Fuzziness" is resolved during recovery — The checkpoint provides upper bounds; log replay discovers what actually happened.
•DPT tracks pages that might need redo — Recovery_lsn indicates where redo starts for each page; actual need is determined by page_lsn comparison.
•ATT tracks transactions that might need undo — Initial loser list is refined during redo phase as commits are discovered.
•Background flushing is essential partner — Keeps dirty pages recent, which keeps redo_lsn close to checkpoint time, minimizing recovery work.
•Page LSN enables redo skipping — If page's LSN shows a log record was already applied, recovery skips it—handling the fuzziness of when flushes occurred.
•All modern databases use fuzzy checkpoints — Implementation details vary, but the fundamental approach is universal.

What's next:

Page Complete

3 / 5