Loading learning content...
Understanding what a checkpoint is is only half the story. Understanding how a checkpoint happens reveals the intricate coordination required to capture a consistent database snapshot while the system continues processing transactions.
The checkpoint process must:
This page examines the checkpoint process in detail, covering both the classic consistent checkpoint and the modern fuzzy checkpoint approaches. By the end, you'll understand exactly what happens when a database system performs a checkpoint.
This page covers the step-by-step checkpoint execution process, the role of each database subsystem during checkpointing, the BEGIN_CHECKPOINT and END_CHECKPOINT log records, dirty page flushing strategies, and how to handle crashes that occur during checkpoint execution.
We begin with the simplest form: the consistent checkpoint, also called the quiescent checkpoint or sharp checkpoint. While rarely used in production due to its performance impact, understanding it provides the conceptual foundation for more sophisticated approaches.
The consistent checkpoint goal:
Create a point where:
The process unfolds in distinct phases:
During a consistent checkpoint, all transaction processing stops. For a database with 10GB of dirty pages and a 100 MB/s disk, flushing alone takes 100 seconds. Add wait time for long transactions, and checkpoints can block work for minutes. This is why consistent checkpoints are rarely used in production.
Checkpoints are recorded in the transaction log using special log record types. These records serve as markers that the recovery system uses to determine its starting point.
For consistent checkpoints:
A single CHECKPOINT record suffices because the checkpoint represents an instantaneous, consistent state. No additional information is needed—the database on disk is complete and correct.
For fuzzy checkpoints (more common):
Two records bracket the checkpoint process:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
-- Consistent Checkpoint Log Record (simplified)CHECKPOINT_RECORD { type: 'CHECKPOINT', lsn: 12500000, timestamp: '2024-01-15 14:30:00.000', -- For consistent checkpoints, state is implicit: -- - No active transactions (all completed before checkpoint) -- - No dirty pages (all flushed before checkpoint) -- - Recovery can start from exactly this point} -- Fuzzy Checkpoint Log Records (more common) BEGIN_CHECKPOINT_RECORD { type: 'BEGIN_CHECKPOINT', lsn: 12500000, timestamp: '2024-01-15 14:30:00.000', -- Signals: "I'm starting a checkpoint now" -- Used to bound the checkpoint duration} END_CHECKPOINT_RECORD { type: 'END_CHECKPOINT', lsn: 12500500, timestamp: '2024-01-15 14:30:05.847', begin_lsn: 12500000, -- Links back to BEGIN -- Active Transaction Table active_transactions: [ { txn_id: 1001, state: 'ACTIVE', last_lsn: 12500400, first_lsn: 12480000 }, { txn_id: 1005, state: 'ACTIVE', last_lsn: 12500450, first_lsn: 12495000 }, { txn_id: 1007, state: 'PREPARING', last_lsn: 12500480, first_lsn: 12500100 } ], -- Dirty Page Table dirty_pages: [ { page_id: 'T1:P42', recovery_lsn: 12490000 }, { page_id: 'T1:P43', recovery_lsn: 12495000 }, { page_id: 'T2:P17', recovery_lsn: 12492000 }, { page_id: 'IDX1:P5', recovery_lsn: 12498000 } ], -- Recovery start point (minimum recovery_lsn) redo_lsn: 12490000}Why two records for fuzzy checkpoints?
The BEGIN and END brackets serve important purposes:
Crash during checkpoint: If the system crashes after BEGIN but before END, recovery knows a checkpoint was in progress but didn't complete. It uses the previous checkpoint instead.
Bounding checkpoint duration: The time between BEGIN and END shows how long the checkpoint took. This is useful for performance monitoring and tuning.
State capture timing: The END record captures ATT and DPT as they exist at checkpoint completion. The BEGIN record establishes when the checkpoint's "view" of the world was established.
The master record:
A special, fixed-location record on disk always points to the most recent valid checkpoint. During recovery:
The master record is stored at a well-known, fixed location—often the first block of the log file or a dedicated control file. This location never changes, allowing recovery to always find it without searching. It's typically only a few bytes: just the LSN of the latest checkpoint.
Fuzzy checkpoints (also called non-quiescent checkpoints) are the standard in modern database systems. They allow transactions to continue running during the checkpoint, dramatically reducing the performance impact.
The key insight:
We don't need a perfectly consistent on-disk state at checkpoint time. We just need enough information to reconstruct consistency during recovery. The ATT and DPT provide this information.
The fuzzy checkpoint process:
Why this works:
The fuzzy checkpoint captures a snapshot of the system state that is not perfectly consistent with the on-disk database. Some pages captured in the DPT may be flushed after the checkpoint. Some transactions captured in the ATT may commit.
But recovery doesn't need perfect consistency. It needs:
Recovery will then replay the log, discovering what actually happened after the checkpoint. The checkpoint provides the starting point, not the ending state.
The checkpoint is "fuzzy" because the on-disk state doesn't match the checkpoint record precisely. It's not a crisp, consistent snapshot but rather an approximation that recovery must reconcile. The fuzziness is resolved during redo, which replays operations to bring the database to a consistent state.
One of the most critical aspects of checkpoint design is how to flush dirty pages. The approach chosen significantly impacts both runtime performance and recovery time.
The flushing dilemma:
Modern databases use continuous background flushing, decoupled from the checkpoint process itself.
| Strategy | Checkpoint Impact | Normal Operations | Recovery Time | Complexity |
|---|---|---|---|---|
| Synchronous Flush All | Very High (blocks) | Clean (no background I/O) | Shortest possible | Low |
| No Flush at Checkpoint | Very Low | Clean (no background I/O) | Longest possible | Low |
| Async Flush at Checkpoint | Low to Medium | I/O spike during checkpoint | Good | Medium |
| Continuous Background Flush | Very Low | Steady background I/O | Excellent | High |
| Hybrid/Adaptive | Varies | Adaptive based on load | Optimized | Very High |
Continuous background flushing (modern approach):
The buffer manager maintains a background writer or page cleaner process that continuously:
When a checkpoint occurs:
This approach spreads I/O over time, avoiding spikes, while keeping recovery time bounded by the background flushing rate.
12345678910111213141516171819202122232425262728
-- Conceptual background page writer policy BACKGROUND_WRITER_CONFIG { -- How often to wake up and check for dirty pages wake_interval_ms: 100, -- Maximum pages to flush per wake cycle max_pages_per_cycle: 100, -- Target dirty page ratio in buffer pool target_dirty_ratio: 0.25, -- 25% dirty pages max -- Priority: flush pages with older recovery_lsn first -- This minimizes redo_lsn in next checkpoint, reducing recovery time flush_priority: 'OLDEST_RECOVERY_LSN_FIRST', -- I/O throttling to avoid saturating disk max_io_bandwidth_mbps: 50, -- Leave room for user queries -- Increase flush rate when checkpoint is imminent checkpoint_boost_factor: 2.0 -- Double the rate near checkpoint} -- The result:-- - Dirty pages are continuously flushed in background-- - Most pages have recent recovery_lsn values-- - Checkpoint's redo_lsn is usually recent-- - Recovery replays only a small amount of logThe redo_lsn (minimum recovery_lsn across all dirty pages) determines where redo starts. If background flushing is aggressive, most pages will have recent recovery_lsn values, and redo_lsn stays close to the checkpoint. If flushing lags, old pages accumulate, pushing redo_lsn further back and extending recovery time.
The checkpoint process interacts with multiple database subsystems. Each interaction requires careful synchronization to ensure correctness without introducing unnecessary blocking.
Subsystem interactions during checkpoint:
Critical synchronization points:
ATT snapshot: Must be point-in-time consistent. If transaction T commits after we've captured part of the ATT, we must include T in the ATT (as it was active when we started) OR exclude it entirely (if it committed before). Partial inclusion leads to recovery errors.
DPT snapshot: Must be consistent with the log. If page P is dirty and in DPT, its recovery_lsn must point to a valid log record that will reconstruct P. If we miss a dirty page, recovery may not redo its changes.
Log force before master update: If we update the master record before the checkpoint log record is on disk, and then crash, recovery will seek a checkpoint that doesn't exist. The order must be: write checkpoint record → force log → update master.
Minimizing lock duration:
The brief locks required for ATT and DPT snapshots are typically held for microseconds to milliseconds. Modern systems optimize these paths aggressively:
The master record update is the final step and acts as the 'commit' of the checkpoint. Until the master record is updated, the previous checkpoint remains valid. If the system crashes during checkpoint, recovery uses the previous checkpoint—completely ignoring the partial current checkpoint.
What happens if the system crashes while a checkpoint is in progress? This is a critical correctness concern that checkpoint protocols must address.
The recovery approach:
Checkpoints are designed so that an incomplete checkpoint has no effect. Recovery always uses the most recent complete checkpoint, identified by the master record.
Failure scenarios:
| Crash Point | State on Disk | Recovery Action |
|---|---|---|
| Before BEGIN_CHECKPOINT written | No evidence of checkpoint attempt | Use previous checkpoint (from master record) |
| After BEGIN but before END | BEGIN record in log, no END | Detect incomplete checkpoint; use previous checkpoint |
| After END but before master update | Complete checkpoint in log | Master points to old checkpoint; use old checkpoint |
| During master record write | Master may be corrupted | Use backup master or log scan to find valid checkpoint |
| After master update (success) | New checkpoint is valid | Use new checkpoint (normal recovery) |
Why incomplete checkpoints are safe:
BEGIN without END is ignored: The recovery algorithm looks for END_CHECKPOINT records. A BEGIN without matching END is treated as if the checkpoint never started.
END without master update is unreachable: Recovery reads the master record first. If the master still points to the old checkpoint, the new checkpoint (even if complete in the log) is never used.
Master record durability: The master record is typically written with double-write or other techniques to survive partial writes. Some systems keep multiple copies.
The commit point:
The checkpoint is not durably established until the master record update completes. Everything before that point can be safely abandoned.
1234567891011121314151617181920212223242526272829303132
-- Recovery logic for finding valid checkpoint FUNCTION find_valid_checkpoint(): -- Step 1: Read master record master = read_master_record() IF master.checkpoint_lsn IS NULL: -- No checkpoint ever completed; scan from log start RETURN NULL -- Step 2: Read the checkpoint record checkpoint = read_log_record(master.checkpoint_lsn) -- Step 3: Verify it's a complete checkpoint IF checkpoint.type == 'END_CHECKPOINT': -- Found valid fuzzy checkpoint RETURN checkpoint ELSE IF checkpoint.type == 'CHECKPOINT': -- Found valid consistent checkpoint RETURN checkpoint ELSE: -- Master record is corrupted; fall back to log scan RETURN scan_log_for_latest_checkpoint() FUNCTION scan_log_for_latest_checkpoint(): -- Emergency fallback: scan entire log for last END_CHECKPOINT -- This is slow but provides recovery when master is damaged latest = NULL FOR each record IN log (forward scan): IF record.type == 'END_CHECKPOINT': latest = record RETURN latestThe checkpoint protocol is designed with crashes in mind. Any crash at any point during the checkpoint process results in a safe fallback to the previous checkpoint. This is an example of atomicity in system design—the checkpoint either fully succeeds or has no effect.
Understanding checkpoint performance helps database administrators tune systems for their specific requirements. Key metrics include checkpoint duration, I/O impact, and latency effects on transactions.
Performance metrics to monitor:
| Parameter | Effect of Increasing | Effect of Decreasing | Trade-off |
|---|---|---|---|
| Checkpoint Interval | More log accumulates; longer recovery | More frequent I/O; shorter recovery | Recovery time vs runtime overhead |
| Background Writer Intensity | Less dirty pages; shorter recovery | More dirty pages; background I/O spikes | Recovery time vs steady-state I/O |
| ATT/DPT Lock Duration | Not adjustable (minimize always) | Not adjustable (minimize always) | Correctness requirement |
| Log Force Policy | Safer but slower checkpoints | Faster but risk corruption | Durability vs performance |
Performance anti-patterns:
Checkpoint I/O storms: If checkpoints trigger synchronous flushing of many pages, transaction latency spikes during checkpoints. Solution: Use continuous background flushing.
Very infrequent checkpoints: Saves runtime overhead but leads to long recovery times. If checkpoint interval exceeds RTO, this is a critical misconfiguration.
Insufficient dirty page headroom: If buffer pool is too small, many pages are dirty and must be flushed frequently, causing I/O pressure unrelated to checkpoints.
Long-running transactions: A transaction running for hours blocks log truncation at its start point, even if checkpoints occur. Monitor and address long transactions separately.
Production databases should alert on checkpoint duration exceeding thresholds and on estimated recovery time exceeding RTO. These are leading indicators of future problems—address them before a crash makes them urgent.
The checkpoint process is a carefully orchestrated sequence of operations that captures recoverable state while minimizing impact on concurrent transactions. Let's consolidate the key concepts:
What's next:
Having understood the standard checkpoint process, we'll examine fuzzy checkpoints in greater depth—the specific techniques that allow checkpoints to proceed without blocking transactions, and how the "fuzziness" is reconciled during recovery.
You now understand the checkpoint process from start to finish—the sequence of operations, the subsystem coordination, the handling of failures, and the performance considerations. Next, we'll dive deeper into fuzzy checkpoints, the standard mechanism used by modern databases.