Loading content...
The moment of truth for any checkpoint system arrives when a crash occurs. Everything we've discussed—checkpoint concepts, processes, fuzziness, and frequency—exists to serve this moment: recovery.
When a database restarts after a crash, the recovery system must:
This page traces the complete recovery process, showing exactly how checkpoint information guides each phase and how the fuzziness of checkpoints is resolved into a consistent database state.
By the end of this page, you will understand the complete recovery flow from crash to operational database, how the checkpoint's ATT and DPT are used in each phase, how recovery resolves the fuzziness of checkpoints, and practical considerations for ensuring fast, correct recovery.
When a database system starts after an unclean shutdown, the recovery manager takes control before any user activity is allowed.
The first critical task: finding the checkpoint.
Recovery must locate the most recent valid checkpoint. This is done through the master record—a fixed-location pointer that always indicates where to find the current checkpoint.
Recovery startup sequence:
123456789101112131415161718192021222324252627282930313233343536373839404142
-- Recovery startup: locating and loading checkpoint FUNCTION recovery_startup(): -- Step 1: Read control file control = read_control_file() -- Step 2: Check if recovery is needed IF control.shutdown_flag == CLEAN: LOG("Clean shutdown detected; no recovery needed") RETURN SUCCESS LOG("Unclean shutdown detected; starting recovery") -- Step 3: Read master record master = read_master_record() IF master.checkpoint_lsn IS NULL: LOG("No checkpoint found; recovering from log start") checkpoint_lsn = LOG_START att = EMPTY_TABLE dpt = EMPTY_TABLE redo_lsn = LOG_START ELSE: -- Step 4: Read checkpoint record checkpoint = read_log_record(master.checkpoint_lsn) -- Step 5: Validate checkpoint IF checkpoint.type != 'END_CHECKPOINT': LOG("Invalid checkpoint; searching for previous") checkpoint = scan_for_valid_checkpoint(before = master.checkpoint_lsn) -- Step 6: Extract tables att = checkpoint.active_transactions dpt = checkpoint.dirty_pages redo_lsn = checkpoint.redo_lsn LOG("Checkpoint loaded: " + master.checkpoint_lsn) LOG("Active transactions: " + COUNT(att)) LOG("Dirty pages: " + COUNT(dpt)) LOG("Redo starting point: " + redo_lsn) -- Step 7: Begin analysis phase RETURN begin_analysis_phase(redo_lsn, att, dpt)Handling checkpoint corruption:
If the checkpoint record at the master's location is corrupted or incomplete (perhaps the crash occurred during checkpoint writing), recovery falls back to the previous checkpoint.
In extreme cases (no valid checkpoint anywhere), recovery must scan from the beginning of the log. This is why:
The redo_lsn from checkpoint:
The redo_lsn is the minimum recovery_lsn from the Dirty Page Table—the earliest point where any dirty page might need redo. This is where the analysis and redo phases begin their log scan.
During recovery, the database is not available for user connections. Only after recovery completes successfully does the system open for normal operation. This is why recovery time directly impacts availability—every second of recovery is a second of downtime.
The analysis phase scans the log from the checkpoint's redo_lsn to the end, updating the ATT and DPT to reflect what happened after the checkpoint.
Purpose of analysis:
The checkpoint's ATT and DPT are snapshots from checkpoint time. Since then:
Analysis reconstructs the true ATT and DPT as of crash time, which redo and undo will then use.
Analysis phase operations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
-- Analysis phase: updating ATT and DPT from checkpoint to crash FUNCTION analysis_phase(redo_lsn, att, dpt): -- Scan log from redo_lsn to end FOR each log_record FROM redo_lsn TO end_of_log: txn_id = log_record.txn_id -- Handle different record types CASE log_record.type: WHEN 'BEGIN': -- New transaction started after checkpoint att[txn_id] = ATTEntry { state: ACTIVE, first_lsn: log_record.lsn, last_lsn: log_record.lsn, undo_nxt_lsn: NULL } WHEN 'UPDATE', 'INSERT', 'DELETE': -- Data modification page_id = log_record.page_id -- Update ATT: track last LSN att[txn_id].last_lsn = log_record.lsn -- Update DPT: add page if not already present IF page_id NOT IN dpt: dpt[page_id] = DPTEntry { recovery_lsn: log_record.lsn -- First log record that might need redo for this page } -- If already in DPT, keep original recovery_lsn -- (we want the FIRST modification, not the latest) WHEN 'COMMIT': -- Transaction committed; remove from ATT -- (won't need undo) DELETE att[txn_id] WHEN 'ABORT': -- Transaction aborted; mark for undo att[txn_id].state = ABORTED att[txn_id].undo_nxt_lsn = log_record.prev_lsn WHEN 'CLR': -- Compensation Log Record -- Part of ongoing rollback; update undo position att[txn_id].undo_nxt_lsn = log_record.undo_nxt_lsn WHEN 'END': -- Transaction fully complete; remove from ATT DELETE att[txn_id] -- Compute redo start point IF dpt IS EMPTY: final_redo_lsn = end_of_log -- Nothing to redo ELSE: final_redo_lsn = MIN(entry.recovery_lsn FOR entry IN dpt) RETURN att, dpt, final_redo_lsnKey analysis outcomes:
Final ATT contains "loser" transactions: All transactions still active at crash time. These need undo.
Final DPT contains all potentially dirty pages: Both from checkpoint and from post-checkpoint modifications.
Redo LSN is the earliest point requiring replay: Minimum recovery_lsn across all dirty pages.
Why not just use checkpoint values?
Because of checkpoint fuzziness:
Analysis resolves the fuzziness for ATT. DPT remains pessimistic—we include everything that might need redo; redo phase determines what actually needs redo.
Analysis only reads log records—it doesn't touch data pages. This makes it very fast: typically limited only by log read speed. Even a gigabyte of log can be analyzed in seconds with modern storage.
The redo phase replays log records to bring the database to its crash-time state. This includes all modifications—even those by uncommitted transactions—because undo needs the complete crash-time state to work backwards.
Redo philosophy: "Repeat History"
The redo phase operates on the principle of repeating history. It reconstructs exactly what the database looked like at the moment of the crash, including:
Redo phase operations:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
-- Redo phase: replaying history from redo_lsn to crash FUNCTION redo_phase(redo_lsn, dpt): LOG("Beginning redo phase from LSN: " + redo_lsn) records_redone = 0 records_skipped = 0 -- Scan log forward from redo_lsn FOR each log_record FROM redo_lsn TO end_of_log: -- Only redo data-modifying records IF log_record.type NOT IN ('UPDATE', 'INSERT', 'DELETE', 'CLR'): CONTINUE -- Skip non-data records page_id = log_record.page_id -- Optimization: check if page is in DPT -- If not in DPT, page was clean at checkpoint and no newer records affect it -- (This check is optional but speeds up recovery) IF page_id NOT IN dpt: -- Page wasn't dirty at checkpoint and wasn't modified after records_skipped += 1 CONTINUE -- Check if redo is needed for this specific record IF log_record.lsn < dpt[page_id].recovery_lsn: -- Record is before this page's recovery_lsn -- Page was flushed after this record; no redo needed records_skipped += 1 CONTINUE -- Read the actual page from disk page = read_page(page_id) -- Critical check: compare LSNs IF log_record.lsn <= page.page_lsn: -- Page's LSN is at or after this record -- The modification already reached disk; skip redo records_skipped += 1 CONTINUE -- Redo is needed: apply the log record page = apply_log_record(log_record, page) page.page_lsn = log_record.lsn -- Update page LSN write_page(page) -- Write to disk (or buffer pool) records_redone += 1 LOG("Redo complete: " + records_redone + " redone, " + records_skipped + " skipped") RETURN SUCCESSThe page_lsn check is crucial:
The page_lsn (stored on each page) tells us the LSN of the last log record that modified this page and was flushed to disk. During redo:
This is how the "fuzziness" is resolved: we don't know from the checkpoint which pages were actually flushed, but we can tell by reading each page and checking its LSN.
| Page in DPT? | log_record.lsn vs page.page_lsn | Action | Reason |
|---|---|---|---|
| Yes | log_record.lsn > page_lsn | REDO | Page on disk is older; apply change |
| Yes | log_record.lsn ≤ page_lsn | SKIP | Page was flushed after this log record |
| No | log_record.lsn > page_lsn | REDO | Page modified after checkpoint; apply change |
| No (and < checkpoint) | — | SKIP (optimized) | Page was clean at checkpoint and stayed clean |
Redo operations are idempotent—applying them twice produces the same result as applying once. This is essential because recovery might crash and restart. The page_lsn check ensures we don't re-apply operations that already succeeded in a previous recovery attempt.
The undo phase rolls back all transactions that were active at crash time and never committed—the "loser" transactions identified by the final ATT from analysis.
Why undo is necessary:
After redo, the database reflects the crash-time state, including partial work by uncommitted transactions. This violates atomicity—uncommitted transactions should leave no trace. Undo removes these partial modifications.
Undo phase approach:
Undo works backward through the log, undoing each modification by loser transactions. As each modification is undone, a Compensation Log Record (CLR) is written, recording the undo operation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
-- Undo phase: rolling back loser transactions FUNCTION undo_phase(att): -- Build list of loser transactions (still in ATT after analysis) losers = [txn for txn in att WHERE txn.state IN (ACTIVE, ABORTED)] LOG("Beginning undo phase for " + COUNT(losers) + " loser transactions") -- Build priority queue of log records to undo -- Priority: highest LSN first (undo in reverse order) undo_queue = PriorityQueue() -- Max-heap by LSN FOR txn IN losers: IF txn.undo_nxt_lsn IS NOT NULL: undo_queue.insert(txn.undo_nxt_lsn, txn.txn_id) ELSE: undo_queue.insert(txn.last_lsn, txn.txn_id) -- Process undo in reverse LSN order WHILE undo_queue IS NOT EMPTY: lsn, txn_id = undo_queue.pop() log_record = read_log_record(lsn) IF log_record.type == 'CLR': -- This is a compensation log record from previous rollback -- Skip to the undo_nxt_lsn recorded in the CLR IF log_record.undo_nxt_lsn IS NOT NULL: undo_queue.insert(log_record.undo_nxt_lsn, txn_id) -- If undo_nxt_lsn is NULL, this txn is fully undone CONTINUE IF log_record.type IN ('UPDATE', 'INSERT', 'DELETE'): -- Undo this modification page = read_page(log_record.page_id) page = apply_undo(log_record, page) -- Write CLR to log clr = write_clr( txn_id: txn_id, undo_of: lsn, undo_nxt_lsn: log_record.prev_lsn, -- Next to undo page_id: log_record.page_id ) page.page_lsn = clr.lsn write_page(page) -- Continue with previous log record of this txn IF log_record.prev_lsn IS NOT NULL: undo_queue.insert(log_record.prev_lsn, txn_id) ELSE: -- Reached beginning of txn; write END record write_end_record(txn_id) LOG("Transaction " + txn_id + " fully undone") LOG("Undo phase complete") RETURN SUCCESSWhy write CLRs during undo?
Compensation Log Records serve two purposes:
Durability of undo operations: CLRs are log records, so they're durable. If the system crashes during undo, recovery can continue undo from where it left off by reading the CLRs.
Prevent redo of undone work: Each CLR marks that a modification was undone. If redo runs again after a recovery crash, it will see the CLR and know not to redo the original (now-undone) operation.
The undo_nxt_lsn chain:
Each CLR contains an undo_nxt_lsn pointing to the next log record that needs undoing. This chain enables:
In typical systems, most transactions commit quickly, so there are few losers. Even if a transaction ran for an hour, undo only needs to reverse its actual modifications—often seconds of work. Undo time rarely dominates recovery.
After undo completes, the database is in a transaction-consistent state. But before opening to users, a few finishing steps are required.
Finalization steps:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
-- Recovery completion and database opening FUNCTION complete_recovery(): -- Ensure all recovery work is durable force_log_to_stable_storage() LOG("Recovery complete. Summary:") LOG(" Redo: " + stats.redo_records + " records processed") LOG(" Undo: " + stats.losers_count + " transactions rolled back") LOG(" Duration: " + stats.duration_seconds + " seconds") -- Optional: take immediate checkpoint to protect recovery work IF config.checkpoint_after_recovery: perform_checkpoint(reason = "POST_RECOVERY") -- Update control file update_control_file( state: CLEAN, last_recovery_time: NOW(), recovery_lsn: end_of_log ) -- Open database for connections set_database_state(OPEN) LOG("Database open for user connections") -- Start background processes start_background_writer() start_checkpoint_scheduler() start_archiver() -- Some systems continue background work IF deferred_work IS NOT EMPTY: schedule_background_cleanup(deferred_work) LOG("Background cleanup scheduled for " + COUNT(deferred_work) + " items") -- Timeline of recovery phasesRECOVERY_TIMELINE = { startup: "~1-5 seconds (read control files)", analysis: "~10-30 seconds per GB of log (log scan only)", redo: "~1-5 minutes per GB of data (I/O bound)", undo: "~seconds to minutes (depends on loser work)", finalize: "~1-5 seconds (log force, open)"}Recovery time breakdown:
In most recoveries:
Why checkpoint after recovery?
If the system crashes again soon after recovery, it would need to redo the same work. A post-recovery checkpoint:
Some systems make this optional (immediate availability is valued more than crash protection).
Transactions in PREPARING state (2-phase commit) cannot be committed or aborted unilaterally. After basic recovery, these transactions remain 'in-doubt' until the coordinator is contacted. Some systems open for normal operation while waiting for in-doubt resolution; others block until resolved.
Not all checkpoints are equal. The quality of a checkpoint—how recent its redo_lsn is, how accurate its tables are—directly affects recovery speed.
Factors that improve checkpoint quality:
| Quality Factor | Good State | Poor State | Recovery Impact |
|---|---|---|---|
| Dirty page age | Most pages recently flushed | Many old dirty pages | Redo_lsn close to checkpoint vs. far behind |
| Checkpoint frequency | Checkpoints every 5 min | Checkpoints every hour | 5 min of log replay vs. 1 hour |
| Active transaction count | Few long-running txns | Many long-running txns | Small ATT vs. large ATT in undo |
| Log generation rate | Steady, moderate rate | High burst rate | Predictable redo time vs. variable |
| Background flushing | Aggressive, continuous | Minimal, lazy | Short redo vs. long redo |
The redo_lsn is the key metric:
The most important checkpoint quality indicator is the distance from redo_lsn to the checkpoint LSN:
When redo_lsn is old, it means dirty pages haven't been flushed. Even though the checkpoint is recent, recovery must go back further into the log.
12345678910111213141516171819202122232425262728293031323334
-- Monitoring checkpoint quality for recovery prediction -- PostgreSQL: Estimate recovery time from current stateSELECT pg_current_wal_lsn() AS current_lsn, (SELECT redo_lsn FROM pg_control_checkpoint()) AS checkpoint_redo_lsn, pg_wal_lsn_diff( pg_current_wal_lsn(), (SELECT redo_lsn FROM pg_control_checkpoint()) ) / 1024 / 1024 AS log_distance_mb, -- Rough estimate: assume 100 MB/s redo rate pg_wal_lsn_diff( pg_current_wal_lsn(), (SELECT redo_lsn FROM pg_control_checkpoint()) ) / 1024 / 1024 / 100 AS estimated_redo_seconds; -- MySQL/InnoDB: Check checkpoint ageSELECT (@@innodb_log_file_size * @@innodb_log_files_in_group) AS log_space_bytes, (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_checkpoint_age') AS checkpoint_age, -- checkpoint_age / log_space = how full the log is -- Higher = potentially longer recovery ROUND( (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_checkpoint_age') / (@@innodb_log_file_size * @@innodb_log_files_in_group) * 100, 2 ) AS log_fill_percentage; -- Alert if recovery estimate exceeds threshold-- Example alert condition:-- IF estimated_redo_seconds > target_recovery_time * 0.8:-- ALERT("Recovery time approaching SLA threshold")Predictions are estimates. The only way to know true recovery time is to crash a test system and measure. Do this quarterly: create a replica, crash it, time recovery, compare to SLA. This validates your checkpoint and flushing configuration.
Let's examine recovery in concrete scenarios, tracing through the phases with example data.
Scenario: E-commerce Database Crash
An e-commerce database crashes during peak traffic. At crash time:
Recovery trace:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
-- Recovery scenario walkthrough -- STARTUP (0-2 seconds)[14:35:00] Database startup initiated[14:35:00] Reading control file...[14:35:00] Detected unclean shutdown at 14:34:55[14:35:01] Master record points to checkpoint at LSN 1000000[14:35:01] Reading checkpoint record...[14:35:01] Checkpoint loaded: - ATT entries: 150 transactions - DPT entries: 50,000 pages - Redo_lsn: 980000 (2.5 min before checkpoint) - Checkpoint taken at: 14:32:00 -- ANALYSIS PHASE (2-8 seconds)[14:35:02] Starting analysis from LSN 980000[14:35:02] Scanning 500 MB of log records...[14:35:05] Analysis progress: 250 MB scanned (50%)[14:35:08] Analysis complete: - Final ATT: 35 transactions (losers) - 115 transactions committed after checkpoint - Final DPT: 52,000 pages - 2,000 pages dirtied after checkpoint - Redo_lsn confirmed: 980000 -- REDO PHASE (8-68 seconds)[14:35:08] Starting redo from LSN 980000[14:35:08] Processing 650,000 log records...[14:35:15] Redo progress: 100 MB / 550 MB total[14:35:25] Redo progress: 250 MB / 550 MB total [14:35:40] Redo progress: 400 MB / 550 MB total[14:35:55] Redo progress: 550 MB / 550 MB total[14:36:00] Redo complete: - Records processed: 650,000 - Pages actually redone: 48,500 - Pages skipped (already flushed): 3,500 -- UNDO PHASE (68-78 seconds)[14:36:00] Starting undo for 35 loser transactions[14:36:02] Undoing transaction 1001 (1,200 modifications)[14:36:04] Undoing transaction 1005 (450 modifications)... (33 more transactions)[14:36:08] Undo complete: - Transactions undone: 35 - Total modifications reversed: 12,500 - CLRs written: 12,500 -- FINALIZATION (78-82 seconds)[14:36:08] Forcing log to stable storage[14:36:09] Performing post-recovery checkpoint[14:36:10] Updating control file[14:36:10] Database open for connections -- SUMMARY[14:36:10] Recovery complete Total time: 70 seconds - Analysis: 6 seconds - Redo: 52 seconds - Undo: 8 seconds - Finalization: 4 secondsObservations from this scenario:
Redo dominated: 52 of 70 seconds (74%) was redo. This is typical.
DPT fuzziness resolved: 50,000 pages in DPT, but only 48,500 actually needed redo. The rest were flushed between checkpoint and crash.
ATT fuzziness resolved: 150 transactions in checkpoint ATT, but only 35 were losers. The rest committed after checkpoint.
3,500 pages skipped: Background flushing between checkpoint and crash saved redo work.
35 losers were small: Even 35 active transactions only needed 12,500 undo operations—8 seconds of work.
This 70-second recovery meets a 2-minute RTO easily. If the RTO were 30 seconds, adjustments would be needed: more frequent checkpoints, more aggressive flushing, or faster storage.
Recovery is where checkpoint theory meets reality. The checkpoint provides the starting point; the recovery algorithm uses that information to restore a consistent database state. Let's consolidate the key concepts:
Module conclusion:
With this page, we've completed our deep dive into checkpoints. You now understand:
Checkpoints are the bridge between the theoretical guarantees of transaction durability and the practical reality of systems that must recover quickly after failures. Every production database depends on well-tuned checkpointing to meet availability requirements.
Congratulations! You have completed the Checkpoints module. You now possess deep, practical knowledge of checkpoint mechanisms—understanding that will serve you whether you're tuning production databases, debugging recovery issues, or designing new database systems. You understand not just how checkpoints work, but why they work that way.