Database Management SystemsCheckpoints

Checkpoints in Database Recovery

LevelIntermediate

Duration60 mins

TopicCheckpoints

2 / 5

Checkpoint Process

From Concept to Execution

Understanding what a checkpoint is is only half the story. Understanding how a checkpoint happens reveals the intricate coordination required to capture a consistent database snapshot while the system continues processing transactions.

The checkpoint process must:

Coordinate with the transaction manager — Account for all active transactions without blocking them unnecessarily
Coordinate with the buffer manager — Identify and flush dirty pages while allowing ongoing reads and writes
Coordinate with the log manager — Write checkpoint records at precisely the right positions
Handle failures — What if the system crashes during the checkpoint itself?

This page examines the checkpoint process in detail, covering both the classic consistent checkpoint and the modern fuzzy checkpoint approaches. By the end, you'll understand exactly what happens when a database system performs a checkpoint.

What You Will Learn

This page covers the step-by-step checkpoint execution process, the role of each database subsystem during checkpointing, the BEGIN_CHECKPOINT and END_CHECKPOINT log records, dirty page flushing strategies, and how to handle crashes that occur during checkpoint execution.

The Consistent (Quiescent) Checkpoint Process

We begin with the simplest form: the consistent checkpoint, also called the quiescent checkpoint or sharp checkpoint. While rarely used in production due to its performance impact, understanding it provides the conceptual foundation for more sophisticated approaches.

The consistent checkpoint goal:

Create a point where:

No transactions are active
All committed data is on disk
The database is in a completely consistent, recoverable state

The process unfolds in distinct phases:

Consistent Checkpoint Steps

•Stop accepting new transactions — The checkpoint manager signals the transaction manager to stop admitting new transactions. Incoming transactions are queued or rejected.
•Wait for active transactions to complete — All currently executing transactions continue until they commit or abort. The checkpoint manager monitors the active transaction count.
•Flush all dirty pages to disk — Once no transactions are active, the buffer manager writes every modified page in the buffer pool to stable storage. This is the most I/O-intensive step.
•Force the log to disk — Ensure all log records (including those for abort operations) are persisted. This guarantees WAL compliance.
•Write the checkpoint record — A single log record is written containing the checkpoint marker and the current LSN. Since no transactions are active, no ATT or DPT is needed.
•Update the master record — A fixed location on disk (the master record) is updated to point to the new checkpoint. This enables fast checkpoint location during recovery.
•Resume normal operation — The checkpoint manager signals that new transactions can begin. The queued transactions are processed.

Converting Mermaid diagram...

Performance Impact

During a consistent checkpoint, all transaction processing stops. For a database with 10GB of dirty pages and a 100 MB/s disk, flushing alone takes 100 seconds. Add wait time for long transactions, and checkpoints can block work for minutes. This is why consistent checkpoints are rarely used in production.

Checkpoint Log Records

Checkpoints are recorded in the transaction log using special log record types. These records serve as markers that the recovery system uses to determine its starting point.

For consistent checkpoints:

A single CHECKPOINT record suffices because the checkpoint represents an instantaneous, consistent state. No additional information is needed—the database on disk is complete and correct.

For fuzzy checkpoints (more common):

Two records bracket the checkpoint process:

BEGIN_CHECKPOINT — Written when the checkpoint process starts
END_CHECKPOINT — Written when checkpoint completes, containing ATT and DPT

checkpoint_log_records.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- Consistent Checkpoint Log Record (simplified)
CHECKPOINT_RECORD {
    type:           'CHECKPOINT',
    lsn:            12500000,
    timestamp:      '2024-01-15 14:30:00.000',
    
    -- For consistent checkpoints, state is implicit:
    -- - No active transactions (all completed before checkpoint)
    -- - No dirty pages (all flushed before checkpoint)
    -- - Recovery can start from exactly this point
}
 
-- Fuzzy Checkpoint Log Records (more common)
 
BEGIN_CHECKPOINT_RECORD {
    type:           'BEGIN_CHECKPOINT',
    lsn:            12500000,
    timestamp:      '2024-01-15 14:30:00.000',
    
    -- Signals: "I'm starting a checkpoint now"
    -- Used to bound the checkpoint duration
}
 
END_CHECKPOINT_RECORD {
    type:           'END_CHECKPOINT',
    lsn:            12500500,
    timestamp:      '2024-01-15 14:30:05.847',
    begin_lsn:      12500000,          -- Links back to BEGIN
    
    -- Active Transaction Table
    active_transactions: [
        { txn_id: 1001, state: 'ACTIVE',    last_lsn: 12500400, first_lsn: 12480000 },
        { txn_id: 1005, state: 'ACTIVE',    last_lsn: 12500450, first_lsn: 12495000 },
        { txn_id: 1007, state: 'PREPARING', last_lsn: 12500480, first_lsn: 12500100 }
    ],
    
    -- Dirty Page Table
    dirty_pages: [
        { page_id: 'T1:P42',  recovery_lsn: 12490000 },
        { page_id: 'T1:P43',  recovery_lsn: 12495000 },
        { page_id: 'T2:P17',  recovery_lsn: 12492000 },
        { page_id: 'IDX1:P5', recovery_lsn: 12498000 }
    ],
    
    -- Recovery start point (minimum recovery_lsn)
    redo_lsn:       12490000
}

Why two records for fuzzy checkpoints?

The BEGIN and END brackets serve important purposes:

Crash during checkpoint: If the system crashes after BEGIN but before END, recovery knows a checkpoint was in progress but didn't complete. It uses the previous checkpoint instead.
Bounding checkpoint duration: The time between BEGIN and END shows how long the checkpoint took. This is useful for performance monitoring and tuning.
State capture timing: The END record captures ATT and DPT as they exist at checkpoint completion. The BEGIN record establishes when the checkpoint's "view" of the world was established.

The master record:

A special, fixed-location record on disk always points to the most recent valid checkpoint. During recovery:

Read the master record to find checkpoint location
Read the checkpoint record(s) to get ATT and DPT
Begin recovery from the redo_lsn

Master Record Location

The master record is stored at a well-known, fixed location—often the first block of the log file or a dedicated control file. This location never changes, allowing recovery to always find it without searching. It's typically only a few bytes: just the LSN of the latest checkpoint.

The Fuzzy Checkpoint Process

Fuzzy checkpoints (also called non-quiescent checkpoints) are the standard in modern database systems. They allow transactions to continue running during the checkpoint, dramatically reducing the performance impact.

The key insight:

We don't need a perfectly consistent on-disk state at checkpoint time. We just need enough information to reconstruct consistency during recovery. The ATT and DPT provide this information.

The fuzzy checkpoint process:

Fuzzy Checkpoint Steps

•Write BEGIN_CHECKPOINT to log — Mark the start of the checkpoint process. Transactions continue normally.
•Snapshot the Active Transaction Table — Capture a consistent view of all active transactions. This requires a brief lock on the transaction table (microseconds to milliseconds).
•Snapshot the Dirty Page Table — Capture the current set of dirty pages and their recovery LSNs. This also requires a brief lock on buffer pool metadata.
•Begin background page flushing — Signal the buffer manager to start flushing dirty pages. This happens asynchronously; transactions continue reading and writing.
•Wait for critical flushes — Some systems wait for specific pages to be flushed before proceeding. Others don't wait at all.
•Write END_CHECKPOINT with ATT and DPT — Record the captured snapshots in the log. This is the actual checkpoint record.
•Update master record — Point to the new checkpoint so recovery can find it.
•Continue background flushing — Page flushing may continue after the checkpoint record is written. The checkpoint is valid regardless.

Converting Mermaid diagram...

Why this works:

The fuzzy checkpoint captures a snapshot of the system state that is not perfectly consistent with the on-disk database. Some pages captured in the DPT may be flushed after the checkpoint. Some transactions captured in the ATT may commit.

But recovery doesn't need perfect consistency. It needs:

A complete list of transactions that might need undo → ATT provides this
The earliest point from which redo might be needed → DPT's minimum recovery_lsn provides this

Recovery will then replay the log, discovering what actually happened after the checkpoint. The checkpoint provides the starting point, not the ending state.

The "Fuzzy" Analogy

The checkpoint is "fuzzy" because the on-disk state doesn't match the checkpoint record precisely. It's not a crisp, consistent snapshot but rather an approximation that recovery must reconcile. The fuzziness is resolved during redo, which replays operations to bring the database to a consistent state.

Dirty Page Flushing Strategies

One of the most critical aspects of checkpoint design is how to flush dirty pages. The approach chosen significantly impacts both runtime performance and recovery time.

The flushing dilemma:

Flush all pages during checkpoint: Ensures shortest possible recovery but creates I/O storms that disrupt normal operations.
Flush no pages at checkpoint: Minimal checkpoint overhead but recovery must redo all changes since the last flush.
Flush continuously in background: Balances the trade-offs but requires sophisticated scheduling.

Modern databases use continuous background flushing, decoupled from the checkpoint process itself.

Dirty Page Flushing Strategy Comparison
Strategy	Checkpoint Impact	Normal Operations	Recovery Time	Complexity
Synchronous Flush All	Very High (blocks)	Clean (no background I/O)	Shortest possible	Low
No Flush at Checkpoint	Very Low	Clean (no background I/O)	Longest possible	Low
Async Flush at Checkpoint	Low to Medium	I/O spike during checkpoint	Good	Medium
Continuous Background Flush	Very Low	Steady background I/O	Excellent	High
Hybrid/Adaptive	Varies	Adaptive based on load	Optimized	Very High

Continuous background flushing (modern approach):

The buffer manager maintains a background writer or page cleaner process that continuously:

Identifies dirty pages that are good candidates for flushing
Writes them to disk without disrupting foreground queries
Prioritizes pages with older recovery LSNs (these contribute to recovery time)
Respects I/O bandwidth limits to avoid saturating disk

When a checkpoint occurs:

The checkpoint captures the DPT as-is
It may signal the background writer to increase flushing rate temporarily
It doesn't wait for all pages to be flushed
The checkpoint completes quickly

This approach spreads I/O over time, avoiding spikes, while keeping recovery time bounded by the background flushing rate.

background_writer_policy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Conceptual background page writer policy
 
BACKGROUND_WRITER_CONFIG {
    -- How often to wake up and check for dirty pages
    wake_interval_ms:        100,
    
    -- Maximum pages to flush per wake cycle
    max_pages_per_cycle:     100,
    
    -- Target dirty page ratio in buffer pool
    target_dirty_ratio:      0.25,    -- 25% dirty pages max
    
    -- Priority: flush pages with older recovery_lsn first
    -- This minimizes redo_lsn in next checkpoint, reducing recovery time
    flush_priority:          'OLDEST_RECOVERY_LSN_FIRST',
    
    -- I/O throttling to avoid saturating disk
    max_io_bandwidth_mbps:   50,      -- Leave room for user queries
    
    -- Increase flush rate when checkpoint is imminent
    checkpoint_boost_factor: 2.0      -- Double the rate near checkpoint
}
 
-- The result:
-- - Dirty pages are continuously flushed in background
-- - Most pages have recent recovery_lsn values
-- - Checkpoint's redo_lsn is usually recent
-- - Recovery replays only a small amount of log

Recovery LSN and Redo Speed

The redo_lsn (minimum recovery_lsn across all dirty pages) determines where redo starts. If background flushing is aggressive, most pages will have recent recovery_lsn values, and redo_lsn stays close to the checkpoint. If flushing lags, old pages accumulate, pushing redo_lsn further back and extending recovery time.

Coordination with Database Subsystems

The checkpoint process interacts with multiple database subsystems. Each interaction requires careful synchronization to ensure correctness without introducing unnecessary blocking.

Subsystem interactions during checkpoint:

Checkpoint Subsystem Coordination

•Transaction Manager — Provides the Active Transaction Table. The checkpoint manager must read ATT atomically, requiring a brief read lock. Long-running transactions must be included in the snapshot.
•Buffer Manager — Provides the Dirty Page Table and performs page flushing. The DPT snapshot requires a brief lock on buffer pool metadata. Flushing is coordinated to avoid I/O conflicts.
•Log Manager — Writes checkpoint log records and ensures they are durably stored. The checkpoint END record must be forced to disk before updating the master record.
•Lock Manager — Must continue normal operation. The checkpoint should not acquire any user-level locks that could cause deadlocks with transactions.
•Recovery Manager — Not directly involved during checkpointing, but the checkpoint's information is designed for recovery's consumption.

Converting Mermaid diagram...

Critical synchronization points:

ATT snapshot: Must be point-in-time consistent. If transaction T commits after we've captured part of the ATT, we must include T in the ATT (as it was active when we started) OR exclude it entirely (if it committed before). Partial inclusion leads to recovery errors.
DPT snapshot: Must be consistent with the log. If page P is dirty and in DPT, its recovery_lsn must point to a valid log record that will reconstruct P. If we miss a dirty page, recovery may not redo its changes.
Log force before master update: If we update the master record before the checkpoint log record is on disk, and then crash, recovery will seek a checkpoint that doesn't exist. The order must be: write checkpoint record → force log → update master.

Minimizing lock duration:

The brief locks required for ATT and DPT snapshots are typically held for microseconds to milliseconds. Modern systems optimize these paths aggressively:

Copy-on-write techniques for snapshot creation
Lock-free concurrent data structures where possible
Pre-computed checkpoint information updated incrementally

The Master Record Update

The master record update is the final step and acts as the 'commit' of the checkpoint. Until the master record is updated, the previous checkpoint remains valid. If the system crashes during checkpoint, recovery uses the previous checkpoint—completely ignoring the partial current checkpoint.

Handling Crashes During Checkpoint

What happens if the system crashes while a checkpoint is in progress? This is a critical correctness concern that checkpoint protocols must address.

The recovery approach:

Checkpoints are designed so that an incomplete checkpoint has no effect. Recovery always uses the most recent complete checkpoint, identified by the master record.

Failure scenarios:

Crash During Checkpoint Scenarios
Crash Point	State on Disk	Recovery Action
Before BEGIN_CHECKPOINT written	No evidence of checkpoint attempt	Use previous checkpoint (from master record)
After BEGIN but before END	BEGIN record in log, no END	Detect incomplete checkpoint; use previous checkpoint
After END but before master update	Complete checkpoint in log	Master points to old checkpoint; use old checkpoint
During master record write	Master may be corrupted	Use backup master or log scan to find valid checkpoint
After master update (success)	New checkpoint is valid	Use new checkpoint (normal recovery)

Why incomplete checkpoints are safe:

BEGIN without END is ignored: The recovery algorithm looks for END_CHECKPOINT records. A BEGIN without matching END is treated as if the checkpoint never started.
END without master update is unreachable: Recovery reads the master record first. If the master still points to the old checkpoint, the new checkpoint (even if complete in the log) is never used.
Master record durability: The master record is typically written with double-write or other techniques to survive partial writes. Some systems keep multiple copies.

The commit point:

The checkpoint is not durably established until the master record update completes. Everything before that point can be safely abandoned.

checkpoint_safety_logic.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Recovery logic for finding valid checkpoint
 
FUNCTION find_valid_checkpoint():
    -- Step 1: Read master record
    master = read_master_record()
    
    IF master.checkpoint_lsn IS NULL:
        -- No checkpoint ever completed; scan from log start
        RETURN NULL
    
    -- Step 2: Read the checkpoint record
    checkpoint = read_log_record(master.checkpoint_lsn)
    
    -- Step 3: Verify it's a complete checkpoint
    IF checkpoint.type == 'END_CHECKPOINT':
        -- Found valid fuzzy checkpoint
        RETURN checkpoint
    ELSE IF checkpoint.type == 'CHECKPOINT':
        -- Found valid consistent checkpoint  
        RETURN checkpoint
    ELSE:
        -- Master record is corrupted; fall back to log scan
        RETURN scan_log_for_latest_checkpoint()
 
FUNCTION scan_log_for_latest_checkpoint():
    -- Emergency fallback: scan entire log for last END_CHECKPOINT
    -- This is slow but provides recovery when master is damaged
    latest = NULL
    FOR each record IN log (forward scan):
        IF record.type == 'END_CHECKPOINT':
            latest = record
    RETURN latest

Checkpoint Robustness

The checkpoint protocol is designed with crashes in mind. Any crash at any point during the checkpoint process results in a safe fallback to the previous checkpoint. This is an example of atomicity in system design—the checkpoint either fully succeeds or has no effect.

Checkpoint Performance Characteristics

Understanding checkpoint performance helps database administrators tune systems for their specific requirements. Key metrics include checkpoint duration, I/O impact, and latency effects on transactions.

Performance metrics to monitor:

Checkpoint Performance Metrics

•Checkpoint Duration — Time from BEGIN_CHECKPOINT to master record update. Shorter is better for recovery (more recent state captured), but very short may indicate insufficient work being done.
•Pages Flushed — Number of dirty pages written during or after checkpoint. High counts indicate either large transaction volume or infrequent flushing.
•Checkpoint I/O Volume — Total bytes written during checkpoint phase. Spikes in I/O can indicate synchronous flushing strategies.
•Transaction Latency Impact — How much checkpoint affects 95th/99th percentile transaction latency. Should be minimal with fuzzy checkpoints.
•Recovery Time Estimate — Based on current redo_lsn distance from checkpoint, how long would recovery take? This is the key metric for RTO planning.

Checkpoint Tuning Parameters
Parameter	Effect of Increasing	Effect of Decreasing	Trade-off
Checkpoint Interval	More log accumulates; longer recovery	More frequent I/O; shorter recovery	Recovery time vs runtime overhead
Background Writer Intensity	Less dirty pages; shorter recovery	More dirty pages; background I/O spikes	Recovery time vs steady-state I/O
ATT/DPT Lock Duration	Not adjustable (minimize always)	Not adjustable (minimize always)	Correctness requirement
Log Force Policy	Safer but slower checkpoints	Faster but risk corruption	Durability vs performance

Performance anti-patterns:

Checkpoint I/O storms: If checkpoints trigger synchronous flushing of many pages, transaction latency spikes during checkpoints. Solution: Use continuous background flushing.
Very infrequent checkpoints: Saves runtime overhead but leads to long recovery times. If checkpoint interval exceeds RTO, this is a critical misconfiguration.
Insufficient dirty page headroom: If buffer pool is too small, many pages are dirty and must be flushed frequently, causing I/O pressure unrelated to checkpoints.
Long-running transactions: A transaction running for hours blocks log truncation at its start point, even if checkpoints occur. Monitor and address long transactions separately.

Production Monitoring

Production databases should alert on checkpoint duration exceeding thresholds and on estimated recovery time exceeding RTO. These are leading indicators of future problems—address them before a crash makes them urgent.

Summary: Checkpoint Process

The checkpoint process is a carefully orchestrated sequence of operations that captures recoverable state while minimizing impact on concurrent transactions. Let's consolidate the key concepts:

Key Takeaways

•Consistent checkpoints block all transactions — They create perfect on-disk state but are impractical for production systems due to downtime.
•Fuzzy checkpoints allow concurrent transactions — They capture ATT and DPT snapshots with brief locks, then let background flushing handle page writes.
•BEGIN_CHECKPOINT and END_CHECKPOINT bracket the process — The END record contains the actual checkpoint information; BEGIN marks the start for incomplete detection.
•The master record is the commit point — Until the master record is updated, the checkpoint is not valid. Crashes at any earlier point default to the previous checkpoint.
•Background page flushing is essential — Continuous flushing reduces both checkpoint overhead and recovery time by keeping dirty pages recent.
•Subsystem coordination requires careful synchronization — ATT and DPT snapshots must be point-in-time consistent; log and master updates must be ordered correctly.
•Crashes during checkpoint are handled gracefully — The protocol ensures incomplete checkpoints have no effect on recovery.

What's next:

Having understood the standard checkpoint process, we'll examine fuzzy checkpoints in greater depth—the specific techniques that allow checkpoints to proceed without blocking transactions, and how the "fuzziness" is reconciled during recovery.

Page Complete

You now understand the checkpoint process from start to finish—the sequence of operations, the subsystem coordination, the handling of failures, and the performance considerations. Next, we'll dive deeper into fuzzy checkpoints, the standard mechanism used by modern databases.

2 / 5

Loading learning content...

Database Management SystemsCheckpoints

Checkpoints in Database Recovery

LevelIntermediate

Duration60 mins

TopicCheckpoints

2 / 5

Checkpoint Process

From Concept to Execution

The checkpoint process must:

Coordinate with the transaction manager — Account for all active transactions without blocking them unnecessarily
Coordinate with the buffer manager — Identify and flush dirty pages while allowing ongoing reads and writes
Coordinate with the log manager — Write checkpoint records at precisely the right positions
Handle failures — What if the system crashes during the checkpoint itself?

What You Will Learn

The Consistent (Quiescent) Checkpoint Process

The consistent checkpoint goal:

Create a point where:

No transactions are active
All committed data is on disk
The database is in a completely consistent, recoverable state

The process unfolds in distinct phases:

Consistent Checkpoint Steps

•Stop accepting new transactions — The checkpoint manager signals the transaction manager to stop admitting new transactions. Incoming transactions are queued or rejected.
•Wait for active transactions to complete — All currently executing transactions continue until they commit or abort. The checkpoint manager monitors the active transaction count.
•Flush all dirty pages to disk — Once no transactions are active, the buffer manager writes every modified page in the buffer pool to stable storage. This is the most I/O-intensive step.
•Force the log to disk — Ensure all log records (including those for abort operations) are persisted. This guarantees WAL compliance.
•Write the checkpoint record — A single log record is written containing the checkpoint marker and the current LSN. Since no transactions are active, no ATT or DPT is needed.
•Update the master record — A fixed location on disk (the master record) is updated to point to the new checkpoint. This enables fast checkpoint location during recovery.
•Resume normal operation — The checkpoint manager signals that new transactions can begin. The queued transactions are processed.

Converting Mermaid diagram...

Performance Impact

Checkpoint Log Records

Checkpoints are recorded in the transaction log using special log record types. These records serve as markers that the recovery system uses to determine its starting point.

For consistent checkpoints:

A single CHECKPOINT record suffices because the checkpoint represents an instantaneous, consistent state. No additional information is needed—the database on disk is complete and correct.

For fuzzy checkpoints (more common):

Two records bracket the checkpoint process:

BEGIN_CHECKPOINT — Written when the checkpoint process starts
END_CHECKPOINT — Written when checkpoint completes, containing ATT and DPT

checkpoint_log_records.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- Consistent Checkpoint Log Record (simplified)
CHECKPOINT_RECORD {
    type:           'CHECKPOINT',
    lsn:            12500000,
    timestamp:      '2024-01-15 14:30:00.000',
    
    -- For consistent checkpoints, state is implicit:
    -- - No active transactions (all completed before checkpoint)
    -- - No dirty pages (all flushed before checkpoint)
    -- - Recovery can start from exactly this point
}
 
-- Fuzzy Checkpoint Log Records (more common)
 
BEGIN_CHECKPOINT_RECORD {
    type:           'BEGIN_CHECKPOINT',
    lsn:            12500000,
    timestamp:      '2024-01-15 14:30:00.000',
    
    -- Signals: "I'm starting a checkpoint now"
    -- Used to bound the checkpoint duration
}
 
END_CHECKPOINT_RECORD {
    type:           'END_CHECKPOINT',
    lsn:            12500500,
    timestamp:      '2024-01-15 14:30:05.847',
    begin_lsn:      12500000,          -- Links back to BEGIN
    
    -- Active Transaction Table
    active_transactions: [
        { txn_id: 1001, state: 'ACTIVE',    last_lsn: 12500400, first_lsn: 12480000 },
        { txn_id: 1005, state: 'ACTIVE',    last_lsn: 12500450, first_lsn: 12495000 },
        { txn_id: 1007, state: 'PREPARING', last_lsn: 12500480, first_lsn: 12500100 }
    ],
    
    -- Dirty Page Table
    dirty_pages: [
        { page_id: 'T1:P42',  recovery_lsn: 12490000 },
        { page_id: 'T1:P43',  recovery_lsn: 12495000 },
        { page_id: 'T2:P17',  recovery_lsn: 12492000 },
        { page_id: 'IDX1:P5', recovery_lsn: 12498000 }
    ],
    
    -- Recovery start point (minimum recovery_lsn)
    redo_lsn:       12490000
}

Why two records for fuzzy checkpoints?

The BEGIN and END brackets serve important purposes:

Crash during checkpoint: If the system crashes after BEGIN but before END, recovery knows a checkpoint was in progress but didn't complete. It uses the previous checkpoint instead.
Bounding checkpoint duration: The time between BEGIN and END shows how long the checkpoint took. This is useful for performance monitoring and tuning.
State capture timing: The END record captures ATT and DPT as they exist at checkpoint completion. The BEGIN record establishes when the checkpoint's "view" of the world was established.

The master record:

A special, fixed-location record on disk always points to the most recent valid checkpoint. During recovery:

Read the master record to find checkpoint location
Read the checkpoint record(s) to get ATT and DPT
Begin recovery from the redo_lsn

Master Record Location

The Fuzzy Checkpoint Process

The key insight:

We don't need a perfectly consistent on-disk state at checkpoint time. We just need enough information to reconstruct consistency during recovery. The ATT and DPT provide this information.

The fuzzy checkpoint process:

Fuzzy Checkpoint Steps

•Write BEGIN_CHECKPOINT to log — Mark the start of the checkpoint process. Transactions continue normally.
•Snapshot the Active Transaction Table — Capture a consistent view of all active transactions. This requires a brief lock on the transaction table (microseconds to milliseconds).
•Snapshot the Dirty Page Table — Capture the current set of dirty pages and their recovery LSNs. This also requires a brief lock on buffer pool metadata.
•Begin background page flushing — Signal the buffer manager to start flushing dirty pages. This happens asynchronously; transactions continue reading and writing.
•Wait for critical flushes — Some systems wait for specific pages to be flushed before proceeding. Others don't wait at all.
•Write END_CHECKPOINT with ATT and DPT — Record the captured snapshots in the log. This is the actual checkpoint record.
•Update master record — Point to the new checkpoint so recovery can find it.
•Continue background flushing — Page flushing may continue after the checkpoint record is written. The checkpoint is valid regardless.

Converting Mermaid diagram...

Why this works:

But recovery doesn't need perfect consistency. It needs:

A complete list of transactions that might need undo → ATT provides this
The earliest point from which redo might be needed → DPT's minimum recovery_lsn provides this

Recovery will then replay the log, discovering what actually happened after the checkpoint. The checkpoint provides the starting point, not the ending state.

The "Fuzzy" Analogy

Dirty Page Flushing Strategies

One of the most critical aspects of checkpoint design is how to flush dirty pages. The approach chosen significantly impacts both runtime performance and recovery time.

The flushing dilemma:

Flush all pages during checkpoint: Ensures shortest possible recovery but creates I/O storms that disrupt normal operations.
Flush no pages at checkpoint: Minimal checkpoint overhead but recovery must redo all changes since the last flush.
Flush continuously in background: Balances the trade-offs but requires sophisticated scheduling.

Modern databases use continuous background flushing, decoupled from the checkpoint process itself.

Dirty Page Flushing Strategy Comparison
Strategy	Checkpoint Impact	Normal Operations	Recovery Time	Complexity
Synchronous Flush All	Very High (blocks)	Clean (no background I/O)	Shortest possible	Low
No Flush at Checkpoint	Very Low	Clean (no background I/O)	Longest possible	Low
Async Flush at Checkpoint	Low to Medium	I/O spike during checkpoint	Good	Medium
Continuous Background Flush	Very Low	Steady background I/O	Excellent	High
Hybrid/Adaptive	Varies	Adaptive based on load	Optimized	Very High

Continuous background flushing (modern approach):

The buffer manager maintains a background writer or page cleaner process that continuously:

Identifies dirty pages that are good candidates for flushing
Writes them to disk without disrupting foreground queries
Prioritizes pages with older recovery LSNs (these contribute to recovery time)
Respects I/O bandwidth limits to avoid saturating disk

When a checkpoint occurs:

The checkpoint captures the DPT as-is
It may signal the background writer to increase flushing rate temporarily
It doesn't wait for all pages to be flushed
The checkpoint completes quickly

This approach spreads I/O over time, avoiding spikes, while keeping recovery time bounded by the background flushing rate.

background_writer_policy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Conceptual background page writer policy
 
BACKGROUND_WRITER_CONFIG {
    -- How often to wake up and check for dirty pages
    wake_interval_ms:        100,
    
    -- Maximum pages to flush per wake cycle
    max_pages_per_cycle:     100,
    
    -- Target dirty page ratio in buffer pool
    target_dirty_ratio:      0.25,    -- 25% dirty pages max
    
    -- Priority: flush pages with older recovery_lsn first
    -- This minimizes redo_lsn in next checkpoint, reducing recovery time
    flush_priority:          'OLDEST_RECOVERY_LSN_FIRST',
    
    -- I/O throttling to avoid saturating disk
    max_io_bandwidth_mbps:   50,      -- Leave room for user queries
    
    -- Increase flush rate when checkpoint is imminent
    checkpoint_boost_factor: 2.0      -- Double the rate near checkpoint
}
 
-- The result:
-- - Dirty pages are continuously flushed in background
-- - Most pages have recent recovery_lsn values
-- - Checkpoint's redo_lsn is usually recent
-- - Recovery replays only a small amount of log

Recovery LSN and Redo Speed

Coordination with Database Subsystems

The checkpoint process interacts with multiple database subsystems. Each interaction requires careful synchronization to ensure correctness without introducing unnecessary blocking.

Subsystem interactions during checkpoint:

Checkpoint Subsystem Coordination

•Transaction Manager — Provides the Active Transaction Table. The checkpoint manager must read ATT atomically, requiring a brief read lock. Long-running transactions must be included in the snapshot.
•Buffer Manager — Provides the Dirty Page Table and performs page flushing. The DPT snapshot requires a brief lock on buffer pool metadata. Flushing is coordinated to avoid I/O conflicts.
•Log Manager — Writes checkpoint log records and ensures they are durably stored. The checkpoint END record must be forced to disk before updating the master record.
•Lock Manager — Must continue normal operation. The checkpoint should not acquire any user-level locks that could cause deadlocks with transactions.
•Recovery Manager — Not directly involved during checkpointing, but the checkpoint's information is designed for recovery's consumption.

Converting Mermaid diagram...

Critical synchronization points:

ATT snapshot: Must be point-in-time consistent. If transaction T commits after we've captured part of the ATT, we must include T in the ATT (as it was active when we started) OR exclude it entirely (if it committed before). Partial inclusion leads to recovery errors.
DPT snapshot: Must be consistent with the log. If page P is dirty and in DPT, its recovery_lsn must point to a valid log record that will reconstruct P. If we miss a dirty page, recovery may not redo its changes.
Log force before master update: If we update the master record before the checkpoint log record is on disk, and then crash, recovery will seek a checkpoint that doesn't exist. The order must be: write checkpoint record → force log → update master.

Minimizing lock duration:

The brief locks required for ATT and DPT snapshots are typically held for microseconds to milliseconds. Modern systems optimize these paths aggressively:

Copy-on-write techniques for snapshot creation
Lock-free concurrent data structures where possible
Pre-computed checkpoint information updated incrementally

The Master Record Update

Handling Crashes During Checkpoint

What happens if the system crashes while a checkpoint is in progress? This is a critical correctness concern that checkpoint protocols must address.

The recovery approach:

Checkpoints are designed so that an incomplete checkpoint has no effect. Recovery always uses the most recent complete checkpoint, identified by the master record.

Failure scenarios:

Crash During Checkpoint Scenarios
Crash Point	State on Disk	Recovery Action
Before BEGIN_CHECKPOINT written	No evidence of checkpoint attempt	Use previous checkpoint (from master record)
After BEGIN but before END	BEGIN record in log, no END	Detect incomplete checkpoint; use previous checkpoint
After END but before master update	Complete checkpoint in log	Master points to old checkpoint; use old checkpoint
During master record write	Master may be corrupted	Use backup master or log scan to find valid checkpoint
After master update (success)	New checkpoint is valid	Use new checkpoint (normal recovery)

Why incomplete checkpoints are safe:

BEGIN without END is ignored: The recovery algorithm looks for END_CHECKPOINT records. A BEGIN without matching END is treated as if the checkpoint never started.
END without master update is unreachable: Recovery reads the master record first. If the master still points to the old checkpoint, the new checkpoint (even if complete in the log) is never used.
Master record durability: The master record is typically written with double-write or other techniques to survive partial writes. Some systems keep multiple copies.

The commit point:

The checkpoint is not durably established until the master record update completes. Everything before that point can be safely abandoned.

checkpoint_safety_logic.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Recovery logic for finding valid checkpoint
 
FUNCTION find_valid_checkpoint():
    -- Step 1: Read master record
    master = read_master_record()
    
    IF master.checkpoint_lsn IS NULL:
        -- No checkpoint ever completed; scan from log start
        RETURN NULL
    
    -- Step 2: Read the checkpoint record
    checkpoint = read_log_record(master.checkpoint_lsn)
    
    -- Step 3: Verify it's a complete checkpoint
    IF checkpoint.type == 'END_CHECKPOINT':
        -- Found valid fuzzy checkpoint
        RETURN checkpoint
    ELSE IF checkpoint.type == 'CHECKPOINT':
        -- Found valid consistent checkpoint  
        RETURN checkpoint
    ELSE:
        -- Master record is corrupted; fall back to log scan
        RETURN scan_log_for_latest_checkpoint()
 
FUNCTION scan_log_for_latest_checkpoint():
    -- Emergency fallback: scan entire log for last END_CHECKPOINT
    -- This is slow but provides recovery when master is damaged
    latest = NULL
    FOR each record IN log (forward scan):
        IF record.type == 'END_CHECKPOINT':
            latest = record
    RETURN latest

Checkpoint Robustness

Checkpoint Performance Characteristics

Performance metrics to monitor:

Checkpoint Performance Metrics

•Checkpoint Duration — Time from BEGIN_CHECKPOINT to master record update. Shorter is better for recovery (more recent state captured), but very short may indicate insufficient work being done.
•Pages Flushed — Number of dirty pages written during or after checkpoint. High counts indicate either large transaction volume or infrequent flushing.
•Checkpoint I/O Volume — Total bytes written during checkpoint phase. Spikes in I/O can indicate synchronous flushing strategies.
•Transaction Latency Impact — How much checkpoint affects 95th/99th percentile transaction latency. Should be minimal with fuzzy checkpoints.
•Recovery Time Estimate — Based on current redo_lsn distance from checkpoint, how long would recovery take? This is the key metric for RTO planning.

Checkpoint Tuning Parameters
Parameter	Effect of Increasing	Effect of Decreasing	Trade-off
Checkpoint Interval	More log accumulates; longer recovery	More frequent I/O; shorter recovery	Recovery time vs runtime overhead
Background Writer Intensity	Less dirty pages; shorter recovery	More dirty pages; background I/O spikes	Recovery time vs steady-state I/O
ATT/DPT Lock Duration	Not adjustable (minimize always)	Not adjustable (minimize always)	Correctness requirement
Log Force Policy	Safer but slower checkpoints	Faster but risk corruption	Durability vs performance

Performance anti-patterns:

Checkpoint I/O storms: If checkpoints trigger synchronous flushing of many pages, transaction latency spikes during checkpoints. Solution: Use continuous background flushing.
Very infrequent checkpoints: Saves runtime overhead but leads to long recovery times. If checkpoint interval exceeds RTO, this is a critical misconfiguration.
Insufficient dirty page headroom: If buffer pool is too small, many pages are dirty and must be flushed frequently, causing I/O pressure unrelated to checkpoints.
Long-running transactions: A transaction running for hours blocks log truncation at its start point, even if checkpoints occur. Monitor and address long transactions separately.

Production Monitoring

Summary: Checkpoint Process

The checkpoint process is a carefully orchestrated sequence of operations that captures recoverable state while minimizing impact on concurrent transactions. Let's consolidate the key concepts:

Key Takeaways

•Consistent checkpoints block all transactions — They create perfect on-disk state but are impractical for production systems due to downtime.
•Fuzzy checkpoints allow concurrent transactions — They capture ATT and DPT snapshots with brief locks, then let background flushing handle page writes.
•BEGIN_CHECKPOINT and END_CHECKPOINT bracket the process — The END record contains the actual checkpoint information; BEGIN marks the start for incomplete detection.
•The master record is the commit point — Until the master record is updated, the checkpoint is not valid. Crashes at any earlier point default to the previous checkpoint.
•Background page flushing is essential — Continuous flushing reduces both checkpoint overhead and recovery time by keeping dirty pages recent.
•Subsystem coordination requires careful synchronization — ATT and DPT snapshots must be point-in-time consistent; log and master updates must be ordered correctly.
•Crashes during checkpoint are handled gracefully — The protocol ensures incomplete checkpoints have no effect on recovery.

What's next:

Page Complete

2 / 5