Loading learning content...
In production database systems, availability is paramount. Taking the database offline—even for a few seconds—during checkpoint is unacceptable. Users experience timeouts, transactions fail, and business operations halt.
Fuzzy checkpoints solve this problem by fundamentally changing how we think about checkpoint consistency. Instead of demanding that the on-disk state perfectly match the checkpoint record (as consistent checkpoints do), fuzzy checkpoints accept that the on-disk state may be "fuzzy"—slightly ahead of or behind the checkpoint's view.
The magic is in the recovery process. By recording sufficient information in the checkpoint (the ATT and DPT), recovery can correctly interpret the fuzzy state and restore full consistency.
This page examines fuzzy checkpoints in depth: the theoretical foundation, the implementation mechanics, the recovery implications, and the practical benefits that make them the universal choice for modern databases.
By the end of this page, you will understand the theory behind fuzzy consistency, how the ATT and DPT enable correct recovery from fuzzy state, the specific mechanisms that allow transactions to proceed during checkpoint, and how modern databases implement and optimize fuzzy checkpoints.
The term "fuzzy" captures the essential characteristic of these checkpoints: the relationship between the checkpoint record and the actual database state is imprecise.
What makes a checkpoint "fuzzy"?
In a consistent checkpoint, when the checkpoint record is written:
In a fuzzy checkpoint, when the checkpoint record is written:
The fuzziness manifests in several ways:
Fuzzy checkpoints don't prevent recovery from working—they just mean recovery has to do more work. The DPT provides an upper bound on what might need redo. The ATT provides an upper bound on what might need undo. Recovery starts with these bounds and refines them by reading the log.
The brilliance of fuzzy checkpoints lies in how recovery uses the imprecise information to restore a precise, consistent state.
Recovery's approach to fuzziness:
Recovery doesn't try to figure out the exact state at checkpoint time. Instead, it:
Assumes the worst case — Treats every page in DPT as potentially needing redo. Treats every transaction in ATT as potentially needing undo.
Replays the log to discover reality — By processing log records after the checkpoint, recovery learns what actually happened: which pages were actually modified, which transactions actually committed.
Applies corrections as needed — Redo operations restore missing changes. Undo operations reverse incomplete transactions.
The DPT and recovery:
1234567891011121314151617181920212223242526272829303132333435363738
-- How recovery uses the Dirty Page Table from fuzzy checkpoint FUNCTION redo_phase(checkpoint): -- Find the earliest point that might need redo redo_lsn = MIN(page.recovery_lsn FOR page IN checkpoint.dirty_page_table) -- Scan log from this point forward FOR each log_record FROM redo_lsn TO end_of_log: page_id = log_record.page_id -- Check if this page might need the operation IF page_id IN checkpoint.dirty_page_table: page_recovery_lsn = checkpoint.dirty_page_table[page_id].recovery_lsn IF log_record.lsn >= page_recovery_lsn: -- This record might affect the page -- Read current page LSN from disk disk_page = read_page(page_id) IF log_record.lsn > disk_page.page_lsn: -- Page on disk is older than this record -- Must redo this operation apply_redo(log_record, disk_page) write_page(disk_page) ELSE: -- Page on disk already has this change -- (It was flushed after checkpoint) -- Skip redo - already applied SKIP ELSE: -- Page not in DPT at checkpoint -- But it might have been modified after checkpoint -- Still need to check and potentially redo -- (Page was clean at checkpoint but dirtied later) disk_page = read_page(page_id) IF log_record.lsn > disk_page.page_lsn: apply_redo(log_record, disk_page) write_page(disk_page)Key concept: Page LSN comparison
Every database page stores the LSN of the last log record that modified it (the page_lsn). During redo:
This comparison is how recovery detects whether a page listed in the DPT actually needs redo. The "fuzziness" of whether the page was flushed is resolved by checking the page itself.
The ATT and recovery:
Similarly, the Active Transaction Table from the checkpoint provides a starting list of transactions that might need undo. As recovery processes the log:
The ARIES recovery algorithm (used by most databases) follows a 'redo all, undo selectively' approach. Redo replays all operations to reconstruct the crash-time state. Then undo rolls back only uncommitted transactions. This works perfectly with fuzzy checkpoints because the checkpoint provides starting points, not final answers.
The defining feature of fuzzy checkpoints is that transactions continue running throughout the checkpoint process. Let's examine exactly what happens when transactions execute concurrently with checkpoint operations.
Timeline of a fuzzy checkpoint with concurrent transactions:
What happens to each transaction during the checkpoint:
| Transaction | Checkpoint Capture | Actual Outcome | Recovery Requirement |
|---|---|---|---|
| Txn A | Active in ATT (captured at t=1) | Committed at t=7 (after ATT capture) | Initially in undo list; removed when commit found in log |
| Txn B | Not in ATT (started at t=4, after capture) | Aborted at crash (t=18) | Discovered during log scan; added to undo list; rolled back |
| Txn C | Active in ATT (captured at t=1) | Committed at t=9 (after capture) | Initially in undo list; removed when commit found in log |
Critical observation:
The ATT snapshot is taken at a specific moment during the checkpoint (when the ATT lock is held). Transactions that:
But the checkpoint doesn't know what happens afterward. Transaction A might commit; Transaction C might commit; new transactions might start. All of this is discovered during log replay, not from the checkpoint.
The brief lock window:
The only blocking that occurs is during the ATT and DPT snapshot captures. These locks are held for:
During these brief windows, new transaction starts and new page modifications may be blocked. But the window is so short that it's imperceptible to most workloads.
You might wonder: why lock at all? The answer is consistency. Without a brief lock, the ATT snapshot might see transaction T's entry being added while also seeing T's uncommitted work in DPT. The momentary lock ensures the snapshot is point-in-time consistent—essential for correct recovery.
The Dirty Page Table (DPT) in a fuzzy checkpoint requires careful management. It must accurately reflect which pages might need redo, accounting for both the flushing lag and the checkpoint capture timing.
DPT lifecycle:
1234567891011121314151617181920212223242526272829303132333435363738394041
-- DPT management during fuzzy checkpoint -- Runtime DPT structure (maintained continuously)RUNTIME_DPT = { entries: Map<PageID, DPTEntry>} -- When a page is first modified in buffer poolFUNCTION on_page_modified(page_id, modifying_lsn): IF page_id NOT IN runtime_dpt.entries: -- First modification; record recovery_lsn runtime_dpt.entries[page_id] = DPTEntry { page_id: page_id, recovery_lsn: modifying_lsn, -- recovery_lsn is the first LSN that might need redo } -- If already in DPT, don't change recovery_lsn -- (first modification is what matters) -- When a page is flushed to diskFUNCTION on_page_flushed(page_id): -- Page is now clean; remove from DPT DELETE runtime_dpt.entries[page_id] -- Note: The page remembers its page_lsn on disk -- This allows recovery to skip already-applied redo -- Checkpoint DPT snapshot (brief lock held)FUNCTION capture_dpt_snapshot(): ACQUIRE_LOCK(runtime_dpt) checkpoint_dpt = COPY(runtime_dpt.entries) RELEASE_LOCK(runtime_dpt) RETURN checkpoint_dpt -- Compute redo_lsn from DPTFUNCTION compute_redo_lsn(checkpoint_dpt): IF checkpoint_dpt IS EMPTY: -- No dirty pages; redo starts from checkpoint RETURN checkpoint_lsn ELSE: -- Redo starts from oldest recovery_lsn RETURN MIN(entry.recovery_lsn FOR entry IN checkpoint_dpt)The recovery_lsn optimization:
The recovery_lsn stored for each dirty page is critical for recovery efficiency. Consider a page modified by 1000 log records. If we stored all 1000 LSNs, the DPT would be enormous. Instead, we store only the first—that's where redo starts for this page. Redo will naturally replay all 1000 modifications in order.
Redo_lsn computation:
The checkpoint's redo_lsn is the minimum recovery_lsn across all dirty pages. This is where redo begins after a crash. If background flushing is aggressive:
If background flushing is slow:
In a large database with many dirty pages, the DPT can be substantial—potentially containing millions of entries. The checkpoint must write all of this to the log. Systems optimize by keeping DPT compact (using page IDs and LSNs only) and by aggressively flushing pages to reduce DPT size.
The Active Transaction Table (ATT) in a fuzzy checkpoint captures the state of all running transactions at the snapshot moment. This information guides the undo phase of recovery.
ATT entry contents:
12345678910111213141516171819202122232425262728
-- ATT entry for each active transactionATT_ENTRY = { txn_id: TransactionID, -- Unique transaction identifier state: ENUM { -- Transaction's current state ACTIVE, -- Currently executing PREPARING, -- In 2PC prepare phase COMMITTED, -- Commit record written (but may not be hardened) ABORTED -- Abort decided (but rollback may be in progress) }, last_lsn: LSN, -- Most recent log record written by this txn first_lsn: LSN, -- First log record written by this txn -- (Useful for log truncation boundary) undo_nxt_lsn: LSN -- Next log record to undo (for partial rollback) -- Used when txn was rolling back at checkpoint time} -- Example ATT at checkpoint timeCHECKPOINT_ATT = [ { txn_id: 1001, state: ACTIVE, last_lsn: 12500400, first_lsn: 12480000, undo_nxt_lsn: NULL }, { txn_id: 1005, state: ACTIVE, last_lsn: 12500450, first_lsn: 12495000, undo_nxt_lsn: NULL }, { txn_id: 1007, state: PREPARING, last_lsn: 12500480, first_lsn: 12500100, undo_nxt_lsn: NULL }, { txn_id: 1009, state: ABORTED, last_lsn: 12500300, first_lsn: 12490000, undo_nxt_lsn: 12500250 } -- Txn 1009 was rolling back at checkpoint time; undo continues from LSN 12500250]How recovery uses the ATT:
Initialize undo candidates: At the start of recovery, the ATT from the checkpoint becomes the initial "loser" list—transactions that might need to be undone.
Update during redo: As redo scans the log, it discovers:
Final loser list: After redo completes, the remaining transactions in the loser list are those that were running at crash time and never committed. These need undo.
Perform undo: Each loser transaction is rolled back by scanning backward through its log records.
| Phase | Txn 1001 | Txn 1005 | Txn 1007 | Txn 1009 | Txn 2000 (new) |
|---|---|---|---|---|---|
| Checkpoint ATT | Active | Active | Preparing | Aborting | — |
| After redo (found commit T1001) | Committed ✓ | Active | Preparing | Aborted | Active |
| After redo (found commit T1007) | Committed ✓ | Active | Committed ✓ | Aborted | Active |
| End of redo | Not in loser list | In loser list | Not in loser list | In loser list | In loser list |
| After undo | — | Undone | — | Undone | Undone |
The PREPARING state handles distributed transactions (2-phase commit). A transaction in PREPARING has passed the point of no return—it cannot unilaterally abort. Recovery must ask the coordinator whether to commit or abort. This is a special case handled after normal undo completes.
Fuzzy checkpoints work hand-in-hand with background page flushing. The checkpoint establishes recovery points, while background flushing keeps dirty page accumulation under control. Together, they bound recovery time without impacting transactions.
The relationship:
Background flushing keeps pages recent: By continuously flushing old dirty pages, the oldest recovery_lsn in the DPT stays recent.
Checkpoints capture the current state: When a checkpoint runs, it captures the DPT as it currently stands—shaped by background flushing.
Recovery time depends on both: Recovery time ≈ (checkpoint interval) + (oldest dirty page age). Both checkpoint frequency and flushing aggressiveness matter.
Flushing policy and recovery time:
| Flushing Policy | Typical DPT Size | Oldest recovery_lsn | Recovery Time Impact |
|---|---|---|---|
| No background flushing | Very large (all modified pages) | Potentially hours old | Recovery scans hours of log |
| Lazy flushing (LRU-based) | Medium (recently modified only) | Minutes to hours old | Recovery scans minutes to hours |
| Aggressive flushing (age target) | Small (only very recent changes) | Seconds to minutes old | Recovery completes in seconds |
| Write-through (no caching) | Zero (nothing dirty) | N/A | Minimal redo needed (but slow runtime) |
Checkpoint-triggered flushing acceleration:
Some systems temporarily increase background flushing intensity when a checkpoint is approaching or in progress. The goal is to:
This acceleration is carefully limited to avoid overwhelming I/O capacity:
Flushing rate = base_rate + (checkpoint_proximity × boost_factor)
Where checkpoint_proximity is 0 far from checkpoints and approaches 1 as the checkpoint begins.
Your checkpoint frequency and flushing intensity should be tuned to meet your Recovery Time Objective (RTO). If your RTO is 60 seconds, you need checkpoints and flushing frequent enough to ensure redo_lsn is never more than ~50 seconds behind (leaving margin for redo execution time).
Modern database systems all implement fuzzy checkpoints, though terminology and details vary. Let's examine how major systems approach checkpointing.
PostgreSQL:
PostgreSQL uses a checkpoint process that:
PostgreSQL's checkpoint is slightly less "fuzzy" than ARIES—it does flush all dirty buffers, but does so without stopping transactions.
MySQL/InnoDB:
InnoDB uses a fuzzy checkpoint mechanism with:
innodb_log_checkpoint_now for forced checkpointsSQL Server:
SQL Server implements:
CHECKPOINT command for manual triggering12345678910111213141516171819202122232425262728293031323334353637383940
-- PostgreSQL checkpoint configuration-- postgresql.conf settings checkpoint_timeout = 5min -- Max time between checkpointscheckpoint_completion_target = 0.9 -- Spread I/O over 90% of intervalmax_wal_size = 1GB -- Trigger checkpoint at this WAL sizecheckpoint_warning = 30s -- Warn if checkpoints too frequent -- View checkpoint statisticsSELECT * FROM pg_stat_bgwriter; -- Force checkpointCHECKPOINT; -------------------------------------------------------------- -- MySQL/InnoDB checkpoint configuration-- my.cnf settings innodb_log_file_size = 1G -- WAL file sizeinnodb_max_dirty_pages_pct = 75 -- Background flush below thisinnodb_adaptive_flushing = ON -- Adaptive checkpoint-aware flushinginnodb_io_capacity = 2000 -- IOPS budget for background writes -- View checkpoint informationSHOW ENGINE INNODB STATUS; -------------------------------------------------------------- -- SQL Server indirect checkpoint configuration -- Set target recovery time (controls checkpoint frequency)ALTER DATABASE [MyDatabase] SET TARGET_RECOVERY_TIME = 60 SECONDS; -- View checkpoint statisticsSELECT * FROM sys.dm_db_log_stats(DB_ID()); -- Force checkpoint CHECKPOINT;Common tuning considerations across systems:
Recovery time vs. checkpoint overhead: More frequent checkpoints mean faster recovery but more I/O overhead during normal operation.
I/O smoothing: Spreading checkpoint writes over time prevents I/O spikes. PostgreSQL's checkpoint_completion_target and InnoDB's adaptive flushing address this.
Container/cloud environments: In virtualized environments, I/O may be less predictable. Systems may need more conservative settings to avoid exceeding provisioned IOPS.
SSD vs. HDD: SSDs tolerate random I/O better, allowing more aggressive flushing and less concern about checkpoint I/O patterns.
SQL Server's 'indirect checkpoints' are an advanced approach that specifies a target recovery time. The system automatically adjusts checkpoint frequency and flushing intensity to meet this target. This shifts checkpoint tuning from manual parameter adjustment to declaring an SLA.
Fuzzy checkpoints are the enabling mechanism for high-availability database systems. They provide the recovery benefits of checkpoints without the performance costs of blocking transactions. Let's consolidate the key concepts:
What's next:
Now that we understand how fuzzy checkpoints work, we'll examine checkpoint frequency—how to determine the optimal interval between checkpoints, balancing recovery time requirements against operational overhead.
You now have a deep understanding of fuzzy checkpoints—the mechanism that makes high-availability database systems possible. You understand the theory of fuzziness, the mechanics of ATT and DPT management, the integration with background flushing, and how major databases implement checkpointing. Next, we'll explore how to choose the right checkpoint frequency.