Aries Features - Learning Module

Loading content...

0/252

Fuzzy Checkpoints

The Checkpoint Paradox

Checkpoints serve a critical purpose in database recovery: they establish known-good points from which recovery can begin, bounding the amount of log that must be scanned and the work that must be redone. However, naive checkpointing requires stopping all transaction processing while dirty pages are flushed—an unacceptable interruption for high-throughput systems.

ARIES resolves this paradox with fuzzy checkpoints: checkpoints that record the current state of the system's bookkeeping without requiring the actual data to be in any particular state. This allows checkpoints to complete in milliseconds rather than seconds, with transaction processing continuing uninterrupted.

What You Will Learn

By the end of this page, you will understand why checkpoints are necessary for bounded recovery, the problems with traditional 'sharp' checkpoints, how fuzzy checkpoints work in ARIES, what information is recorded during a checkpoint, and how recovery uses checkpoint data to minimize work.

Why Checkpoints Are Essential

Without checkpoints, crash recovery would need to scan the entire log from the beginning of time and redo all operations that might not be on disk. For a database that has been running for years, this could mean processing millions or billions of log records—taking hours or days to recover.

The Fundamental Recovery Problem:

Recovery needs to answer: "Which log records' effects might not be on disk?" Without checkpoints, the conservative answer is "potentially any of them, going back to database creation."

What Checkpoints Provide:

Recovery Starting Point: A checkpoint records which pages were dirty at checkpoint time and the earliest LSN that might need to be redone. Recovery can start from this point, not from the beginning.
Bounded Recovery Time: With regular checkpoints, recovery time is proportional to work since the last checkpoint, not total database lifetime.
Log Truncation: Log records before the checkpoint's starting point can be safely discarded (archived). This bounds log storage requirements.
Active Transaction State: The checkpoint records which transactions were active, enabling efficient construction of the undo list.

recovery_without_checkpoints.md
Recovery Without Checkpoints:
 
Database operational for: 5 years
Log accumulated: 10 TB
Log records: 50 billion
 
CRASH occurs!
 
Recovery must potentially:
1. Read 10 TB of log data
2. Analyze 50 billion log records  
3. Determine which are already on disk
4. Redo/Undo as necessary
 
Estimated recovery time: 48+ hours
System unavailable: 2 days
 
═══════════════════════════════════════════════════════════════
 
Recovery WITH Checkpoints (every 5 minutes):
 
Checkpoint occurred: 3 minutes before crash
Log since checkpoint: ~200 MB
Log records since checkpoint: ~1 million
 
Recovery must:
1. Read checkpoint record (which pages were dirty, active txns)
2. Read 200 MB of log since checkpoint
3. Analyze ~1 million log records
4. Redo/Undo from checkpoint point
 
Estimated recovery time: 30 seconds
System unavailable: < 1 minute
 
═══════════════════════════════════════════════════════════════
 
Improvement: 5,000x faster recovery

Recovery Time Objectives

Modern systems often have Recovery Time Objectives (RTOs) measured in seconds or minutes. Meeting a 5-minute RTO is impossible without checkpoints. Checkpoint frequency directly controls the maximum recovery time.

Sharp vs. Fuzzy Checkpoints

The simplest form of checkpoint is a sharp (or consistent) checkpoint. It creates a point in time where all committed data is guaranteed to be on disk. While conceptually simple, it's impractical for production systems.

Sharp Checkpoint Procedure:

Stop accepting new transactions
Wait for all active transactions to complete
Flush all dirty pages to disk
Write a checkpoint record to the log
Resume transaction processing

The problem is obvious: steps 1-3 can take seconds or minutes, during which the database is unavailable.

Sharp Checkpoints

•Must stop transaction processing
•Wait for all dirty pages to flush
•Guaranteed consistent on-disk state
•Downtime: seconds to minutes
•Recovery starts after checkpoint
•Simple recovery model
•Impractical for OLTP systems

Fuzzy Checkpoints

•Transactions continue uninterrupted
•No pages flushed during checkpoint
•Records current system state only
•Downtime: zero
•Recovery uses recorded metadata
•More complex recovery model
•Essential for production systems

Why Sharp Checkpoints Are Impractical:

Buffer Pool Size: Modern buffer pools hold hundreds of thousands of dirty pages. Flushing all of them takes 30-60 seconds even on fast storage.
Long-Running Transactions: If a checkpoint must wait for active transactions to complete, a single long-running query blocks the entire checkpoint.
Transaction Throughput: Any pause in transaction processing cascades through connection pools, application queues, and user experience.
SLA Violations: A 60-second checkpoint pause every 5 minutes means 20% unavailability—completely unacceptable.

Fuzzy Checkpoints solve this by changing what a checkpoint means. Instead of guaranteeing that all data is on disk, a fuzzy checkpoint records where recovery should start looking and what it will find.

The Name 'Fuzzy'

The term 'fuzzy' reflects that the checkpoint doesn't represent a clean, consistent point in time. The database state at checkpoint time is 'fuzzy'—some pages might be written, others not—but recovery has enough information to handle this fuzziness correctly.

What Fuzzy Checkpoints Record

A fuzzy checkpoint captures a snapshot of the system's bookkeeping structures—information that tells recovery what state the database might be in. The checkpoint does NOT guarantee that any particular pages are on disk.

Critical Information Recorded:

Fuzzy Checkpoint Contents
Component	What It Contains	How Recovery Uses It
Transaction Table	Active transaction IDs, their states, and lastLSNs	Identifies transactions needing UNDO (didn't commit before crash)
Dirty Page Table	Page IDs of dirty pages and their recLSNs	Determines the earliest LSN from which REDO must start
Checkpoint Begin LSN	LSN when checkpoint started	May need to scan log before this for some info
Checkpoint End LSN	LSN when checkpoint completed	Recovery knows last reliable checkpoint data

The Transaction Table:

For each active transaction at checkpoint time:

Transaction ID
Transaction state (active, preparing, committed, aborted)
lastLSN: The LSN of the transaction's most recent log record

This allows recovery to quickly identify which transactions were in-flight and might need to be rolled back.

The Dirty Page Table:

For each dirty page in the buffer pool at checkpoint time:

Page ID
recLSN: The LSN of the first log record that dirtied this page since it was last written

The recLSN is crucial: it tells recovery the earliest point from which this page might need redo. The minimum recLSN across all dirty pages determines where the REDO phase starts.

checkpoint_record_structure.md
Fuzzy Checkpoint Record Structure:
 
┌──────────────────────────────────────────────────────────────────────┐
│                    CHECKPOINT BEGIN RECORD                           │
│                       LSN: 5000                                      │
└──────────────────────────────────────────────────────────────────────┘
 
┌──────────────────────────────────────────────────────────────────────┐
│                      TRANSACTION TABLE                               │
├──────────────────────────────────────────────────────────────────────┤
│  TransactionID │ State    │ LastLSN │ UndoNextLSN                   │
├────────────────┼──────────┼─────────┼────────────────────────────────┤
│  T101          │ Active   │ 4950    │ 4950                          │
│  T102          │ Active   │ 4980    │ 4980                          │
│  T103          │ Prepared │ 4920    │ 4900                          │
└──────────────────────────────────────────────────────────────────────┘
 
┌──────────────────────────────────────────────────────────────────────┐
│                       DIRTY PAGE TABLE                               │
├──────────────────────────────────────────────────────────────────────┤
│  PageID   │ RecLSN   │ Notes                                        │
├───────────┼──────────┼──────────────────────────────────────────────┤
│  Page 42  │ 4500     │ Dirtied at LSN 4500, not yet flushed        │
│  Page 157 │ 4800     │ Dirtied at LSN 4800, not yet flushed        │
│  Page 301 │ 4200     │ Dirtied at LSN 4200, not yet flushed        │
│  Page 509 │ 4900     │ Dirtied at LSN 4900, not yet flushed        │
└──────────────────────────────────────────────────────────────────────┘
 
┌──────────────────────────────────────────────────────────────────────┐
│                      CHECKPOINT END RECORD                           │
│                        LSN: 5010                                     │
└──────────────────────────────────────────────────────────────────────┘
 
MINIMUM RecLSN = 4200 (from Page 301)
═══════════════════════════════════════
 
Recovery REDO phase will start scanning from LSN 4200 (not 5000!)
This is because Page 301 might have modifications at LSN 4200
that haven't been flushed to disk yet.

The recLSN Concept

The recLSN (recovery LSN) for a page is the LSN when the page first became dirty after its last flush to disk. This is the earliest point where modifications to this page might not be on disk. Recovery must redo from this point for this page.

The Fuzzy Checkpoint Procedure

The beauty of fuzzy checkpoints is their simplicity and non-intrusiveness. The entire procedure involves writing log records and copying in-memory data structures—no waiting for page flushes.

Step-by-Step Procedure:

Fuzzy Checkpoint Steps

•Write CHECKPOINT_BEGIN log record — This marks the start of checkpoint. The LSN is recorded.
•Copy the Transaction Table — Take a snapshot of currently active transactions and their states. This is fast because it's an in-memory copy.
•Copy the Dirty Page Table — Take a snapshot of which pages are dirty and their recLSNs. Also fast—just copying an in-memory hash table.
•Write CHECKPOINT_END log record — This contains the snapshot data from steps 2-3. The checkpoint is now complete.
•Flush the log — Ensure the checkpoint records are on stable storage.
•Update master record — Record the location of this checkpoint for recovery to find.

Critical Observation:

Notice what's NOT in this procedure:

No waiting for transactions to complete
No flushing of dirty data pages
No stopping transaction processing
No coordination with running queries

The entire checkpoint can complete in milliseconds because it's just writing a few log records. Meanwhile, new transactions start, existing transactions continue, and pages are modified—none of this affects the checkpoint.

The Trade-off:

Because we don't flush pages, some dirty pages from before the checkpoint might not be on disk. Recovery handles this by checking each page's pageLSN during redo. If pageLSN < log record's LSN, the log record must be redone.

fuzzy_checkpoint_implementation.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
function performFuzzyCheckpoint() {
    // Step 1: Write begin record
    beginLSN = logManager.append(CHECKPOINT_BEGIN);
    
    // Step 2: Snapshot transaction table
    // Brief lock to get consistent snapshot
    txnTable.readLock();
    txnTableCopy = txnTable.deepCopy();
    txnTable.readUnlock();
    
    // Step 3: Snapshot dirty page table
    dirtyPageTable.readLock();
    dptCopy = dirtyPageTable.deepCopy();
    dirtyPageTable.readUnlock();
    
    // Transactions and page modifications continue uninterrupted
    // during steps 2-3 (we just take snapshots)
    
    // Step 4: Write end record with snapshot data
    endRecord = CHECKPOINT_END {
        transactionTable: txnTableCopy,
        dirtyPageTable: dptCopy
    };
    endLSN = logManager.append(endRecord);
    
    // Step 5: Flush log to stable storage
    logManager.flushUpToLSN(endLSN);
    
    // Step 6: Update master record so recovery can find this checkpoint
    masterRecord.update(lastCheckpointLSN: beginLSN);
    masterRecord.flush();
    
    printf("Checkpoint complete: LSN %d to %d, %d active txns, %d dirty pages",
           beginLSN, endLSN, txnTableCopy.size, dptCopy.size);
}
 
// Typical execution time: 5-50 milliseconds
// No impact on transaction throughput

Checkpoint Frequency

Because fuzzy checkpoints are so cheap, they can run frequently—every few minutes or even more often. More frequent checkpoints mean faster recovery. The trade-off is slightly more log space for checkpoint records, which is negligible.

Recovery Using Fuzzy Checkpoints

The genius of fuzzy checkpoints becomes clear when we examine how recovery uses the checkpoint data. Despite the checkpoint not guaranteeing any particular disk state, recovery can efficiently rebuild the necessary information.

Recovery Initialization:

Find the most recent checkpoint (from master record)
Read the CHECKPOINT_BEGIN and CHECKPOINT_END records
Initialize the Transaction Table and Dirty Page Table from checkpoint data
Begin the Analysis phase from the checkpoint's end LSN

Determining the REDO Start Point:

The checkpoint provides a Dirty Page Table with recLSNs. The minimum recLSN across all entries is the earliest point where any page modification might not be on disk. REDO must start from this LSN.

This is typically before the checkpoint itself, because some pages were dirty before the checkpoint was taken. But the checkpoint bounds how far back we need to look.

Example:

Checkpoint taken at LSN 5000
Dirty Page Table entries: recLSN values of 4200, 4800, 4900
Minimum recLSN = 4200
REDO starts at LSN 4200 (800 records before checkpoint)

Without the checkpoint, we might need to start from LSN 0 (beginning of the log).

recovery_with_fuzzy_checkpoint.md
Recovery Process with Fuzzy Checkpoint:
 
Timeline:
═════════════════════════════════════════════════════════════════════
Log:  │...│4200│4201│...│4800│...│4900│...│5000│5010│...│5500│CRASH!
      │   │    │    │   │    │   │    │   │ CP │ CP │   │    │
      │   │    │    │   │    │   │    │   │BEGIN│END │   │    │
═════════════════════════════════════════════════════════════════════
 
Checkpoint at LSN 5000-5010 recorded:
  - Dirty Page Table: Page 301 (recLSN=4200), Page 42 (recLSN=4800), 
                      Page 509 (recLSN=4900)
  - Transaction Table: T101 (active), T102 (active)
 
═══════════════════════════════════════════════════════════════════════
 
ANALYSIS PHASE:
1. Load checkpoint data (Transaction Table, Dirty Page Table)
2. Scan log from LSN 5010 to 5500 (crash point)
3. Update Transaction Table and Dirty Page Table with post-checkpoint changes
4. Final state: know exactly which transactions need UNDO, 
   which pages might need REDO
 
REDO PHASE:
1. Start at min(recLSN) = 4200 (from checkpoint's DPT)
2. Scan forward to LSN 5500
3. For each log record:
   - Fetch page
   - If pageLSN < log record LSN: apply the redo
   - If pageLSN >= log record LSN: skip (already on disk)
 
UNDO PHASE:
1. Transaction Table shows T101 and T102 were active
2. Analysis phase found no commit records for them
3. Roll back their changes using log records
 
TOTAL LOG SCANNED: 4200 to 5500 (1300 records)
WITHOUT CHECKPOINT: 0 to 5500 (5500 records) - 4x more work!

The Analysis Phase Updates Checkpoint Data

The checkpoint provides initial Transaction Table and Dirty Page Table values, but the Analysis phase updates these while scanning the log from checkpoint to crash. Pages dirtied after the checkpoint are added; committed transactions are removed from the undo list.

Background Page Flushing and Checkpoints

While fuzzy checkpoints don't require page flushing, practical systems typically run background page flushing to ensure dirty pages eventually reach disk. This is separate from checkpointing but interacts with it.

Why Background Flushing Matters:

Bounding Recovery REDO: If a page has recLSN from 3 hours ago, all log records since then might need to be redone for that page. Background flushing keeps recLSNs relatively recent.
Log Truncation: We can only truncate log records older than the minimum recLSN. Background flushing advances this minimum.
Crash Safety: The more pages flushed, the less REDO work after a crash.
Buffer Pool Efficiency: Pre-emptively flushing dirty pages means they can be evicted immediately when needed, rather than requiring an unexpected flush.

Checkpoint-Triggered Flushing:

Some systems trigger background flushing after checkpoints to "chase" checkpoint LSN:

Checkpoint records current Dirty Page Table
Background flusher prioritizes pages with oldest recLSNs
As pages are flushed, their entries are removed from Dirty Page Table
Next checkpoint has fewer (or more recent) dirty pages
Recovery window shrinks progressively

This is NOT part of the checkpoint itself—it happens asynchronously after the checkpoint completes.

background_flusher.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Background page flushing process (runs continuously)
function backgroundPageFlusher() {
    while (systemRunning) {
        // Find oldest dirty pages (lowest recLSNs)
        candidates = dirtyPageTable.getOldestEntries(count: 100);
        
        for (page in candidates) {
            // Acquire page latch
            page.latchShared();
            
            if (page.isDirty) {
                // Ensure WAL protocol: log flushed before page
                logManager.flushUpToLSN(page.pageLSN);
                
                // Write page to disk
                diskManager.writePageAsync(page);
                
                // When write completes, remove from dirty page table
                onWriteComplete(() => {
                    dirtyPageTable.remove(page.id);
                    page.clearDirty();
                });
            }
            
            page.unlatchShared();
        }
        
        // Rate limit to avoid overwhelming I/O
        sleep(backgroundFlushInterval);
    }
}
 
// Benefits:
// 1. Keeps recovery window bounded
// 2. Enables log truncation
// 3. Reduces steal pressure (eviction of dirty pages)
// 4. Smooth I/O load vs. bursty checkpoint-time flushing

Decoupled Responsibilities

ARIES cleanly separates checkpoint (recording state) from page flushing (getting data to disk). The checkpoint doesn't flush pages; background processes do. This separation allows each to be optimized independently.

Practical Implementation Considerations

Implementing fuzzy checkpoints in production systems involves several practical considerations beyond the basic algorithm.

Implementation Considerations

•Checkpoint Frequency Tuning: More frequent checkpoints mean faster recovery but more log writes. Typical values: every 1-10 minutes, or after N log records (e.g., every 100MB of log).
•Dirty Page Table Size: With large buffer pools, the Dirty Page Table can be huge. Some systems use approximations (sampling) or tiered tracking.
•Long-Running Transactions: If a transaction runs for hours, its entry stays in the Transaction Table across many checkpoints. This is fine—just takes space in checkpoint records.
•Checkpoint I/O Impact: While fuzzy checkpoints themselves are cheap, the background flushing they trigger can create I/O bursts. Throttling is important.
•Concurrent Checkpoint Writing: Only one checkpoint should be active at a time. Use mutex or similar coordination.
•Master Record Reliability: The master record pointing to the last checkpoint is critical. It must be written atomically and may use redundant copies.

Dirty Page Table Compression:

With buffer pools of 100GB+ and page sizes of 8KB, you might have 13 million pages. If 10% are dirty, the Dirty Page Table has 1.3 million entries. At ~16 bytes per entry, that's 20MB just for the checkpoint record.

Optimizations include:

Bitmap representation: Track dirty pages as a bitmap (1 bit per page)
Range compression: Consecutive dirty pages recorded as ranges
Checksum verification: Record checksums to validate checkpoint integrity
Incremental checkpoints: Only record changes since last checkpoint

Log Space Management:

Checkpoints enable log truncation: log records older than the recovery starting point can be discarded (or archived). The recovery starting point is the minimum of:

Oldest recLSN in the Dirty Page Table
Oldest begin LSN of any active transaction
Any other system components that reference log LSNs

Aggressive background flushing keeps #1 recent. Short transactions keep #2 recent. Together, they bound log size.

Log Retention for Long Transactions

A single long-running transaction can prevent log truncation because its begin LSN is old. Systems must either enforce transaction duration limits or accept the log space cost of long transactions.

Summary: Checkpoints Without Tears

Fuzzy checkpoints exemplify ARIES's philosophy of decoupling logical requirements from physical constraints. The logical requirement is bounded recovery time; the physical constraint was that traditional checkpoints required halting the system. ARIES simply changed what checkpoints record, achieving the goal without the cost.

Key Takeaways

•Checkpoints bound recovery time by establishing a starting point closer to the crash than the beginning of the log.
•Sharp checkpoints require flushing all pages and stopping transactions—unacceptable for production systems.
•Fuzzy checkpoints record the Transaction Table and Dirty Page Table without flushing pages—completing in milliseconds.
•The Dirty Page Table's recLSN values determine where REDO must start—typically before the checkpoint but bounded by it.
•Background page flushing keeps recLSN values recent, further bounding recovery work and enabling log truncation.
•Recovery uses checkpoint data as initial state, updating it during the Analysis phase as needed.
•Zero runtime impact makes fuzzy checkpoints practical to run frequently, keeping recovery times short.

What's Next:

ARIES supports not just flat transactions but also nested transactions (savepoints). The next page examines how ARIES handles partial rollback, nested abort, and the subtleties of multi-level transaction recovery.

Page Complete

You now understand fuzzy checkpoints—the technique that allows ARIES to bound recovery time without disrupting normal transaction processing. This is essential for meeting recovery time objectives in production systems. Next, we'll explore ARIES's support for nested transactions.

Fuzzy Checkpoints

The Checkpoint Paradox

What You Will Learn

Why Checkpoints Are Essential

The Fundamental Recovery Problem:

Recovery needs to answer: "Which log records' effects might not be on disk?" Without checkpoints, the conservative answer is "potentially any of them, going back to database creation."

What Checkpoints Provide:

Recovery Starting Point: A checkpoint records which pages were dirty at checkpoint time and the earliest LSN that might need to be redone. Recovery can start from this point, not from the beginning.
Bounded Recovery Time: With regular checkpoints, recovery time is proportional to work since the last checkpoint, not total database lifetime.
Log Truncation: Log records before the checkpoint's starting point can be safely discarded (archived). This bounds log storage requirements.
Active Transaction State: The checkpoint records which transactions were active, enabling efficient construction of the undo list.

recovery_without_checkpoints.md
Recovery Without Checkpoints:
 
Database operational for: 5 years
Log accumulated: 10 TB
Log records: 50 billion
 
CRASH occurs!
 
Recovery must potentially:
1. Read 10 TB of log data
2. Analyze 50 billion log records  
3. Determine which are already on disk
4. Redo/Undo as necessary
 
Estimated recovery time: 48+ hours
System unavailable: 2 days
 
═══════════════════════════════════════════════════════════════
 
Recovery WITH Checkpoints (every 5 minutes):
 
Checkpoint occurred: 3 minutes before crash
Log since checkpoint: ~200 MB
Log records since checkpoint: ~1 million
 
Recovery must:
1. Read checkpoint record (which pages were dirty, active txns)
2. Read 200 MB of log since checkpoint
3. Analyze ~1 million log records
4. Redo/Undo from checkpoint point
 
Estimated recovery time: 30 seconds
System unavailable: < 1 minute
 
═══════════════════════════════════════════════════════════════
 
Improvement: 5,000x faster recovery

Recovery Time Objectives

Sharp vs. Fuzzy Checkpoints

Sharp Checkpoint Procedure:

Stop accepting new transactions
Wait for all active transactions to complete
Flush all dirty pages to disk
Write a checkpoint record to the log
Resume transaction processing

The problem is obvious: steps 1-3 can take seconds or minutes, during which the database is unavailable.

Sharp Checkpoints

•Must stop transaction processing
•Wait for all dirty pages to flush
•Guaranteed consistent on-disk state
•Downtime: seconds to minutes
•Recovery starts after checkpoint
•Simple recovery model
•Impractical for OLTP systems

Fuzzy Checkpoints

•Transactions continue uninterrupted
•No pages flushed during checkpoint
•Records current system state only
•Downtime: zero
•Recovery uses recorded metadata
•More complex recovery model
•Essential for production systems

Why Sharp Checkpoints Are Impractical:

Buffer Pool Size: Modern buffer pools hold hundreds of thousands of dirty pages. Flushing all of them takes 30-60 seconds even on fast storage.
Long-Running Transactions: If a checkpoint must wait for active transactions to complete, a single long-running query blocks the entire checkpoint.
Transaction Throughput: Any pause in transaction processing cascades through connection pools, application queues, and user experience.
SLA Violations: A 60-second checkpoint pause every 5 minutes means 20% unavailability—completely unacceptable.

The Name 'Fuzzy'

What Fuzzy Checkpoints Record

Critical Information Recorded:

Fuzzy Checkpoint Contents
Component	What It Contains	How Recovery Uses It
Transaction Table	Active transaction IDs, their states, and lastLSNs	Identifies transactions needing UNDO (didn't commit before crash)
Dirty Page Table	Page IDs of dirty pages and their recLSNs	Determines the earliest LSN from which REDO must start
Checkpoint Begin LSN	LSN when checkpoint started	May need to scan log before this for some info
Checkpoint End LSN	LSN when checkpoint completed	Recovery knows last reliable checkpoint data

The Transaction Table:

For each active transaction at checkpoint time:

Transaction ID
Transaction state (active, preparing, committed, aborted)
lastLSN: The LSN of the transaction's most recent log record

This allows recovery to quickly identify which transactions were in-flight and might need to be rolled back.

The Dirty Page Table:

For each dirty page in the buffer pool at checkpoint time:

Page ID
recLSN: The LSN of the first log record that dirtied this page since it was last written

The recLSN is crucial: it tells recovery the earliest point from which this page might need redo. The minimum recLSN across all dirty pages determines where the REDO phase starts.

checkpoint_record_structure.md
Fuzzy Checkpoint Record Structure:
 
┌──────────────────────────────────────────────────────────────────────┐
│                    CHECKPOINT BEGIN RECORD                           │
│                       LSN: 5000                                      │
└──────────────────────────────────────────────────────────────────────┘
 
┌──────────────────────────────────────────────────────────────────────┐
│                      TRANSACTION TABLE                               │
├──────────────────────────────────────────────────────────────────────┤
│  TransactionID │ State    │ LastLSN │ UndoNextLSN                   │
├────────────────┼──────────┼─────────┼────────────────────────────────┤
│  T101          │ Active   │ 4950    │ 4950                          │
│  T102          │ Active   │ 4980    │ 4980                          │
│  T103          │ Prepared │ 4920    │ 4900                          │
└──────────────────────────────────────────────────────────────────────┘
 
┌──────────────────────────────────────────────────────────────────────┐
│                       DIRTY PAGE TABLE                               │
├──────────────────────────────────────────────────────────────────────┤
│  PageID   │ RecLSN   │ Notes                                        │
├───────────┼──────────┼──────────────────────────────────────────────┤
│  Page 42  │ 4500     │ Dirtied at LSN 4500, not yet flushed        │
│  Page 157 │ 4800     │ Dirtied at LSN 4800, not yet flushed        │
│  Page 301 │ 4200     │ Dirtied at LSN 4200, not yet flushed        │
│  Page 509 │ 4900     │ Dirtied at LSN 4900, not yet flushed        │
└──────────────────────────────────────────────────────────────────────┘
 
┌──────────────────────────────────────────────────────────────────────┐
│                      CHECKPOINT END RECORD                           │
│                        LSN: 5010                                     │
└──────────────────────────────────────────────────────────────────────┘
 
MINIMUM RecLSN = 4200 (from Page 301)
═══════════════════════════════════════
 
Recovery REDO phase will start scanning from LSN 4200 (not 5000!)
This is because Page 301 might have modifications at LSN 4200
that haven't been flushed to disk yet.

The recLSN Concept

The Fuzzy Checkpoint Procedure

The beauty of fuzzy checkpoints is their simplicity and non-intrusiveness. The entire procedure involves writing log records and copying in-memory data structures—no waiting for page flushes.

Step-by-Step Procedure:

Fuzzy Checkpoint Steps

•Write CHECKPOINT_BEGIN log record — This marks the start of checkpoint. The LSN is recorded.
•Copy the Transaction Table — Take a snapshot of currently active transactions and their states. This is fast because it's an in-memory copy.
•Copy the Dirty Page Table — Take a snapshot of which pages are dirty and their recLSNs. Also fast—just copying an in-memory hash table.
•Write CHECKPOINT_END log record — This contains the snapshot data from steps 2-3. The checkpoint is now complete.
•Flush the log — Ensure the checkpoint records are on stable storage.
•Update master record — Record the location of this checkpoint for recovery to find.

Critical Observation:

Notice what's NOT in this procedure:

No waiting for transactions to complete
No flushing of dirty data pages
No stopping transaction processing
No coordination with running queries

The Trade-off:

fuzzy_checkpoint_implementation.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
function performFuzzyCheckpoint() {
    // Step 1: Write begin record
    beginLSN = logManager.append(CHECKPOINT_BEGIN);
    
    // Step 2: Snapshot transaction table
    // Brief lock to get consistent snapshot
    txnTable.readLock();
    txnTableCopy = txnTable.deepCopy();
    txnTable.readUnlock();
    
    // Step 3: Snapshot dirty page table
    dirtyPageTable.readLock();
    dptCopy = dirtyPageTable.deepCopy();
    dirtyPageTable.readUnlock();
    
    // Transactions and page modifications continue uninterrupted
    // during steps 2-3 (we just take snapshots)
    
    // Step 4: Write end record with snapshot data
    endRecord = CHECKPOINT_END {
        transactionTable: txnTableCopy,
        dirtyPageTable: dptCopy
    };
    endLSN = logManager.append(endRecord);
    
    // Step 5: Flush log to stable storage
    logManager.flushUpToLSN(endLSN);
    
    // Step 6: Update master record so recovery can find this checkpoint
    masterRecord.update(lastCheckpointLSN: beginLSN);
    masterRecord.flush();
    
    printf("Checkpoint complete: LSN %d to %d, %d active txns, %d dirty pages",
           beginLSN, endLSN, txnTableCopy.size, dptCopy.size);
}
 
// Typical execution time: 5-50 milliseconds
// No impact on transaction throughput

Checkpoint Frequency

Recovery Using Fuzzy Checkpoints

Recovery Initialization:

Find the most recent checkpoint (from master record)
Read the CHECKPOINT_BEGIN and CHECKPOINT_END records
Initialize the Transaction Table and Dirty Page Table from checkpoint data
Begin the Analysis phase from the checkpoint's end LSN

Determining the REDO Start Point:

This is typically before the checkpoint itself, because some pages were dirty before the checkpoint was taken. But the checkpoint bounds how far back we need to look.

Example:

Checkpoint taken at LSN 5000
Dirty Page Table entries: recLSN values of 4200, 4800, 4900
Minimum recLSN = 4200
REDO starts at LSN 4200 (800 records before checkpoint)

Without the checkpoint, we might need to start from LSN 0 (beginning of the log).

recovery_with_fuzzy_checkpoint.md
Recovery Process with Fuzzy Checkpoint:
 
Timeline:
═════════════════════════════════════════════════════════════════════
Log:  │...│4200│4201│...│4800│...│4900│...│5000│5010│...│5500│CRASH!
      │   │    │    │   │    │   │    │   │ CP │ CP │   │    │
      │   │    │    │   │    │   │    │   │BEGIN│END │   │    │
═════════════════════════════════════════════════════════════════════
 
Checkpoint at LSN 5000-5010 recorded:
  - Dirty Page Table: Page 301 (recLSN=4200), Page 42 (recLSN=4800), 
                      Page 509 (recLSN=4900)
  - Transaction Table: T101 (active), T102 (active)
 
═══════════════════════════════════════════════════════════════════════
 
ANALYSIS PHASE:
1. Load checkpoint data (Transaction Table, Dirty Page Table)
2. Scan log from LSN 5010 to 5500 (crash point)
3. Update Transaction Table and Dirty Page Table with post-checkpoint changes
4. Final state: know exactly which transactions need UNDO, 
   which pages might need REDO
 
REDO PHASE:
1. Start at min(recLSN) = 4200 (from checkpoint's DPT)
2. Scan forward to LSN 5500
3. For each log record:
   - Fetch page
   - If pageLSN < log record LSN: apply the redo
   - If pageLSN >= log record LSN: skip (already on disk)
 
UNDO PHASE:
1. Transaction Table shows T101 and T102 were active
2. Analysis phase found no commit records for them
3. Roll back their changes using log records
 
TOTAL LOG SCANNED: 4200 to 5500 (1300 records)
WITHOUT CHECKPOINT: 0 to 5500 (5500 records) - 4x more work!

The Analysis Phase Updates Checkpoint Data

Background Page Flushing and Checkpoints

Why Background Flushing Matters:

Bounding Recovery REDO: If a page has recLSN from 3 hours ago, all log records since then might need to be redone for that page. Background flushing keeps recLSNs relatively recent.
Log Truncation: We can only truncate log records older than the minimum recLSN. Background flushing advances this minimum.
Crash Safety: The more pages flushed, the less REDO work after a crash.
Buffer Pool Efficiency: Pre-emptively flushing dirty pages means they can be evicted immediately when needed, rather than requiring an unexpected flush.

Checkpoint-Triggered Flushing:

Some systems trigger background flushing after checkpoints to "chase" checkpoint LSN:

Checkpoint records current Dirty Page Table
Background flusher prioritizes pages with oldest recLSNs
As pages are flushed, their entries are removed from Dirty Page Table
Next checkpoint has fewer (or more recent) dirty pages
Recovery window shrinks progressively

This is NOT part of the checkpoint itself—it happens asynchronously after the checkpoint completes.

background_flusher.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Background page flushing process (runs continuously)
function backgroundPageFlusher() {
    while (systemRunning) {
        // Find oldest dirty pages (lowest recLSNs)
        candidates = dirtyPageTable.getOldestEntries(count: 100);
        
        for (page in candidates) {
            // Acquire page latch
            page.latchShared();
            
            if (page.isDirty) {
                // Ensure WAL protocol: log flushed before page
                logManager.flushUpToLSN(page.pageLSN);
                
                // Write page to disk
                diskManager.writePageAsync(page);
                
                // When write completes, remove from dirty page table
                onWriteComplete(() => {
                    dirtyPageTable.remove(page.id);
                    page.clearDirty();
                });
            }
            
            page.unlatchShared();
        }
        
        // Rate limit to avoid overwhelming I/O
        sleep(backgroundFlushInterval);
    }
}
 
// Benefits:
// 1. Keeps recovery window bounded
// 2. Enables log truncation
// 3. Reduces steal pressure (eviction of dirty pages)
// 4. Smooth I/O load vs. bursty checkpoint-time flushing

Decoupled Responsibilities

Practical Implementation Considerations

Implementing fuzzy checkpoints in production systems involves several practical considerations beyond the basic algorithm.

Implementation Considerations

•Checkpoint Frequency Tuning: More frequent checkpoints mean faster recovery but more log writes. Typical values: every 1-10 minutes, or after N log records (e.g., every 100MB of log).
•Dirty Page Table Size: With large buffer pools, the Dirty Page Table can be huge. Some systems use approximations (sampling) or tiered tracking.
•Long-Running Transactions: If a transaction runs for hours, its entry stays in the Transaction Table across many checkpoints. This is fine—just takes space in checkpoint records.
•Checkpoint I/O Impact: While fuzzy checkpoints themselves are cheap, the background flushing they trigger can create I/O bursts. Throttling is important.
•Concurrent Checkpoint Writing: Only one checkpoint should be active at a time. Use mutex or similar coordination.
•Master Record Reliability: The master record pointing to the last checkpoint is critical. It must be written atomically and may use redundant copies.

Dirty Page Table Compression:

Optimizations include:

Bitmap representation: Track dirty pages as a bitmap (1 bit per page)
Range compression: Consecutive dirty pages recorded as ranges
Checksum verification: Record checksums to validate checkpoint integrity
Incremental checkpoints: Only record changes since last checkpoint

Log Space Management:

Checkpoints enable log truncation: log records older than the recovery starting point can be discarded (or archived). The recovery starting point is the minimum of:

Oldest recLSN in the Dirty Page Table
Oldest begin LSN of any active transaction
Any other system components that reference log LSNs

Aggressive background flushing keeps #1 recent. Short transactions keep #2 recent. Together, they bound log size.

Log Retention for Long Transactions

A single long-running transaction can prevent log truncation because its begin LSN is old. Systems must either enforce transaction duration limits or accept the log space cost of long transactions.

Summary: Checkpoints Without Tears

Key Takeaways

•Checkpoints bound recovery time by establishing a starting point closer to the crash than the beginning of the log.
•Sharp checkpoints require flushing all pages and stopping transactions—unacceptable for production systems.
•Fuzzy checkpoints record the Transaction Table and Dirty Page Table without flushing pages—completing in milliseconds.
•The Dirty Page Table's recLSN values determine where REDO must start—typically before the checkpoint but bounded by it.
•Background page flushing keeps recLSN values recent, further bounding recovery work and enabling log truncation.
•Recovery uses checkpoint data as initial state, updating it during the Analysis phase as needed.
•Zero runtime impact makes fuzzy checkpoints practical to run frequently, keeping recovery times short.

What's Next:

Page Complete