Loading content...
Checkpoints serve a critical purpose in database recovery: they establish known-good points from which recovery can begin, bounding the amount of log that must be scanned and the work that must be redone. However, naive checkpointing requires stopping all transaction processing while dirty pages are flushed—an unacceptable interruption for high-throughput systems.
ARIES resolves this paradox with fuzzy checkpoints: checkpoints that record the current state of the system's bookkeeping without requiring the actual data to be in any particular state. This allows checkpoints to complete in milliseconds rather than seconds, with transaction processing continuing uninterrupted.
By the end of this page, you will understand why checkpoints are necessary for bounded recovery, the problems with traditional 'sharp' checkpoints, how fuzzy checkpoints work in ARIES, what information is recorded during a checkpoint, and how recovery uses checkpoint data to minimize work.
Without checkpoints, crash recovery would need to scan the entire log from the beginning of time and redo all operations that might not be on disk. For a database that has been running for years, this could mean processing millions or billions of log records—taking hours or days to recover.
The Fundamental Recovery Problem:
Recovery needs to answer: "Which log records' effects might not be on disk?" Without checkpoints, the conservative answer is "potentially any of them, going back to database creation."
What Checkpoints Provide:
Recovery Starting Point: A checkpoint records which pages were dirty at checkpoint time and the earliest LSN that might need to be redone. Recovery can start from this point, not from the beginning.
Bounded Recovery Time: With regular checkpoints, recovery time is proportional to work since the last checkpoint, not total database lifetime.
Log Truncation: Log records before the checkpoint's starting point can be safely discarded (archived). This bounds log storage requirements.
Active Transaction State: The checkpoint records which transactions were active, enabling efficient construction of the undo list.
Recovery Without Checkpoints: Database operational for: 5 yearsLog accumulated: 10 TBLog records: 50 billion CRASH occurs! Recovery must potentially:1. Read 10 TB of log data2. Analyze 50 billion log records 3. Determine which are already on disk4. Redo/Undo as necessary Estimated recovery time: 48+ hoursSystem unavailable: 2 days ═══════════════════════════════════════════════════════════════ Recovery WITH Checkpoints (every 5 minutes): Checkpoint occurred: 3 minutes before crashLog since checkpoint: ~200 MBLog records since checkpoint: ~1 million Recovery must:1. Read checkpoint record (which pages were dirty, active txns)2. Read 200 MB of log since checkpoint3. Analyze ~1 million log records4. Redo/Undo from checkpoint point Estimated recovery time: 30 secondsSystem unavailable: < 1 minute ═══════════════════════════════════════════════════════════════ Improvement: 5,000x faster recoveryModern systems often have Recovery Time Objectives (RTOs) measured in seconds or minutes. Meeting a 5-minute RTO is impossible without checkpoints. Checkpoint frequency directly controls the maximum recovery time.
The simplest form of checkpoint is a sharp (or consistent) checkpoint. It creates a point in time where all committed data is guaranteed to be on disk. While conceptually simple, it's impractical for production systems.
Sharp Checkpoint Procedure:
The problem is obvious: steps 1-3 can take seconds or minutes, during which the database is unavailable.
Why Sharp Checkpoints Are Impractical:
Buffer Pool Size: Modern buffer pools hold hundreds of thousands of dirty pages. Flushing all of them takes 30-60 seconds even on fast storage.
Long-Running Transactions: If a checkpoint must wait for active transactions to complete, a single long-running query blocks the entire checkpoint.
Transaction Throughput: Any pause in transaction processing cascades through connection pools, application queues, and user experience.
SLA Violations: A 60-second checkpoint pause every 5 minutes means 20% unavailability—completely unacceptable.
Fuzzy Checkpoints solve this by changing what a checkpoint means. Instead of guaranteeing that all data is on disk, a fuzzy checkpoint records where recovery should start looking and what it will find.
The term 'fuzzy' reflects that the checkpoint doesn't represent a clean, consistent point in time. The database state at checkpoint time is 'fuzzy'—some pages might be written, others not—but recovery has enough information to handle this fuzziness correctly.
A fuzzy checkpoint captures a snapshot of the system's bookkeeping structures—information that tells recovery what state the database might be in. The checkpoint does NOT guarantee that any particular pages are on disk.
Critical Information Recorded:
| Component | What It Contains | How Recovery Uses It |
|---|---|---|
| Transaction Table | Active transaction IDs, their states, and lastLSNs | Identifies transactions needing UNDO (didn't commit before crash) |
| Dirty Page Table | Page IDs of dirty pages and their recLSNs | Determines the earliest LSN from which REDO must start |
| Checkpoint Begin LSN | LSN when checkpoint started | May need to scan log before this for some info |
| Checkpoint End LSN | LSN when checkpoint completed | Recovery knows last reliable checkpoint data |
The Transaction Table:
For each active transaction at checkpoint time:
This allows recovery to quickly identify which transactions were in-flight and might need to be rolled back.
The Dirty Page Table:
For each dirty page in the buffer pool at checkpoint time:
The recLSN is crucial: it tells recovery the earliest point from which this page might need redo. The minimum recLSN across all dirty pages determines where the REDO phase starts.
Fuzzy Checkpoint Record Structure: ┌──────────────────────────────────────────────────────────────────────┐│ CHECKPOINT BEGIN RECORD ││ LSN: 5000 │└──────────────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────────────┐│ TRANSACTION TABLE │├──────────────────────────────────────────────────────────────────────┤│ TransactionID │ State │ LastLSN │ UndoNextLSN │├────────────────┼──────────┼─────────┼────────────────────────────────┤│ T101 │ Active │ 4950 │ 4950 ││ T102 │ Active │ 4980 │ 4980 ││ T103 │ Prepared │ 4920 │ 4900 │└──────────────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────────────┐│ DIRTY PAGE TABLE │├──────────────────────────────────────────────────────────────────────┤│ PageID │ RecLSN │ Notes │├───────────┼──────────┼──────────────────────────────────────────────┤│ Page 42 │ 4500 │ Dirtied at LSN 4500, not yet flushed ││ Page 157 │ 4800 │ Dirtied at LSN 4800, not yet flushed ││ Page 301 │ 4200 │ Dirtied at LSN 4200, not yet flushed ││ Page 509 │ 4900 │ Dirtied at LSN 4900, not yet flushed │└──────────────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────────────┐│ CHECKPOINT END RECORD ││ LSN: 5010 │└──────────────────────────────────────────────────────────────────────┘ MINIMUM RecLSN = 4200 (from Page 301)═══════════════════════════════════════ Recovery REDO phase will start scanning from LSN 4200 (not 5000!)This is because Page 301 might have modifications at LSN 4200that haven't been flushed to disk yet.The recLSN (recovery LSN) for a page is the LSN when the page first became dirty after its last flush to disk. This is the earliest point where modifications to this page might not be on disk. Recovery must redo from this point for this page.
The beauty of fuzzy checkpoints is their simplicity and non-intrusiveness. The entire procedure involves writing log records and copying in-memory data structures—no waiting for page flushes.
Step-by-Step Procedure:
Critical Observation:
Notice what's NOT in this procedure:
The entire checkpoint can complete in milliseconds because it's just writing a few log records. Meanwhile, new transactions start, existing transactions continue, and pages are modified—none of this affects the checkpoint.
The Trade-off:
Because we don't flush pages, some dirty pages from before the checkpoint might not be on disk. Recovery handles this by checking each page's pageLSN during redo. If pageLSN < log record's LSN, the log record must be redone.
1234567891011121314151617181920212223242526272829303132333435363738
function performFuzzyCheckpoint() { // Step 1: Write begin record beginLSN = logManager.append(CHECKPOINT_BEGIN); // Step 2: Snapshot transaction table // Brief lock to get consistent snapshot txnTable.readLock(); txnTableCopy = txnTable.deepCopy(); txnTable.readUnlock(); // Step 3: Snapshot dirty page table dirtyPageTable.readLock(); dptCopy = dirtyPageTable.deepCopy(); dirtyPageTable.readUnlock(); // Transactions and page modifications continue uninterrupted // during steps 2-3 (we just take snapshots) // Step 4: Write end record with snapshot data endRecord = CHECKPOINT_END { transactionTable: txnTableCopy, dirtyPageTable: dptCopy }; endLSN = logManager.append(endRecord); // Step 5: Flush log to stable storage logManager.flushUpToLSN(endLSN); // Step 6: Update master record so recovery can find this checkpoint masterRecord.update(lastCheckpointLSN: beginLSN); masterRecord.flush(); printf("Checkpoint complete: LSN %d to %d, %d active txns, %d dirty pages", beginLSN, endLSN, txnTableCopy.size, dptCopy.size);} // Typical execution time: 5-50 milliseconds// No impact on transaction throughputBecause fuzzy checkpoints are so cheap, they can run frequently—every few minutes or even more often. More frequent checkpoints mean faster recovery. The trade-off is slightly more log space for checkpoint records, which is negligible.
The genius of fuzzy checkpoints becomes clear when we examine how recovery uses the checkpoint data. Despite the checkpoint not guaranteeing any particular disk state, recovery can efficiently rebuild the necessary information.
Recovery Initialization:
Determining the REDO Start Point:
The checkpoint provides a Dirty Page Table with recLSNs. The minimum recLSN across all entries is the earliest point where any page modification might not be on disk. REDO must start from this LSN.
This is typically before the checkpoint itself, because some pages were dirty before the checkpoint was taken. But the checkpoint bounds how far back we need to look.
Example:
Without the checkpoint, we might need to start from LSN 0 (beginning of the log).
Recovery Process with Fuzzy Checkpoint: Timeline:═════════════════════════════════════════════════════════════════════Log: │...│4200│4201│...│4800│...│4900│...│5000│5010│...│5500│CRASH! │ │ │ │ │ │ │ │ │ CP │ CP │ │ │ │ │ │ │ │ │ │ │ │BEGIN│END │ │ │═════════════════════════════════════════════════════════════════════ Checkpoint at LSN 5000-5010 recorded: - Dirty Page Table: Page 301 (recLSN=4200), Page 42 (recLSN=4800), Page 509 (recLSN=4900) - Transaction Table: T101 (active), T102 (active) ═══════════════════════════════════════════════════════════════════════ ANALYSIS PHASE:1. Load checkpoint data (Transaction Table, Dirty Page Table)2. Scan log from LSN 5010 to 5500 (crash point)3. Update Transaction Table and Dirty Page Table with post-checkpoint changes4. Final state: know exactly which transactions need UNDO, which pages might need REDO REDO PHASE:1. Start at min(recLSN) = 4200 (from checkpoint's DPT)2. Scan forward to LSN 55003. For each log record: - Fetch page - If pageLSN < log record LSN: apply the redo - If pageLSN >= log record LSN: skip (already on disk) UNDO PHASE:1. Transaction Table shows T101 and T102 were active2. Analysis phase found no commit records for them3. Roll back their changes using log records TOTAL LOG SCANNED: 4200 to 5500 (1300 records)WITHOUT CHECKPOINT: 0 to 5500 (5500 records) - 4x more work!The checkpoint provides initial Transaction Table and Dirty Page Table values, but the Analysis phase updates these while scanning the log from checkpoint to crash. Pages dirtied after the checkpoint are added; committed transactions are removed from the undo list.
While fuzzy checkpoints don't require page flushing, practical systems typically run background page flushing to ensure dirty pages eventually reach disk. This is separate from checkpointing but interacts with it.
Why Background Flushing Matters:
Bounding Recovery REDO: If a page has recLSN from 3 hours ago, all log records since then might need to be redone for that page. Background flushing keeps recLSNs relatively recent.
Log Truncation: We can only truncate log records older than the minimum recLSN. Background flushing advances this minimum.
Crash Safety: The more pages flushed, the less REDO work after a crash.
Buffer Pool Efficiency: Pre-emptively flushing dirty pages means they can be evicted immediately when needed, rather than requiring an unexpected flush.
Checkpoint-Triggered Flushing:
Some systems trigger background flushing after checkpoints to "chase" checkpoint LSN:
This is NOT part of the checkpoint itself—it happens asynchronously after the checkpoint completes.
12345678910111213141516171819202122232425262728293031323334353637
// Background page flushing process (runs continuously)function backgroundPageFlusher() { while (systemRunning) { // Find oldest dirty pages (lowest recLSNs) candidates = dirtyPageTable.getOldestEntries(count: 100); for (page in candidates) { // Acquire page latch page.latchShared(); if (page.isDirty) { // Ensure WAL protocol: log flushed before page logManager.flushUpToLSN(page.pageLSN); // Write page to disk diskManager.writePageAsync(page); // When write completes, remove from dirty page table onWriteComplete(() => { dirtyPageTable.remove(page.id); page.clearDirty(); }); } page.unlatchShared(); } // Rate limit to avoid overwhelming I/O sleep(backgroundFlushInterval); }} // Benefits:// 1. Keeps recovery window bounded// 2. Enables log truncation// 3. Reduces steal pressure (eviction of dirty pages)// 4. Smooth I/O load vs. bursty checkpoint-time flushingARIES cleanly separates checkpoint (recording state) from page flushing (getting data to disk). The checkpoint doesn't flush pages; background processes do. This separation allows each to be optimized independently.
Implementing fuzzy checkpoints in production systems involves several practical considerations beyond the basic algorithm.
Dirty Page Table Compression:
With buffer pools of 100GB+ and page sizes of 8KB, you might have 13 million pages. If 10% are dirty, the Dirty Page Table has 1.3 million entries. At ~16 bytes per entry, that's 20MB just for the checkpoint record.
Optimizations include:
Log Space Management:
Checkpoints enable log truncation: log records older than the recovery starting point can be discarded (or archived). The recovery starting point is the minimum of:
Aggressive background flushing keeps #1 recent. Short transactions keep #2 recent. Together, they bound log size.
A single long-running transaction can prevent log truncation because its begin LSN is old. Systems must either enforce transaction duration limits or accept the log space cost of long transactions.
Fuzzy checkpoints exemplify ARIES's philosophy of decoupling logical requirements from physical constraints. The logical requirement is bounded recovery time; the physical constraint was that traditional checkpoints required halting the system. ARIES simply changed what checkpoints record, achieving the goal without the cost.
What's Next:
ARIES supports not just flat transactions but also nested transactions (savepoints). The next page examines how ARIES handles partial rollback, nested abort, and the subtleties of multi-level transaction recovery.
You now understand fuzzy checkpoints—the technique that allows ARIES to bound recovery time without disrupting normal transaction processing. This is essential for meeting recovery time objectives in production systems. Next, we'll explore ARIES's support for nested transactions.