Loading content...
A production database might have a log spanning years of operation—billions of records representing countless transactions. When the system crashes, every one of these historical transactions has already completed; their effects are safely on disk. Only a tiny fraction of log records are relevant: those describing changes that might not have reached stable storage.
The RedoLSN—also called the recovery start point—is the earliest log position that could possibly require redo operations. Everything before this point is guaranteed to be safely on disk. Finding this boundary efficiently is crucial: starting redo too early wastes time replaying already-durable changes; starting too late risks missing updates, leaving the database inconsistent.
This page explores the algorithm for determining the RedoLSN, its correctness proof, the interaction between the Dirty Page Table and Transaction Table in this calculation, and practical considerations in production systems.
By the end of this page, you will understand exactly how the RedoLSN is calculated, why it's correct and complete, how it bounds recovery work, and how different system configurations affect this calculation. You'll be able to compute the RedoLSN for any given system state.
The RedoLSN marks the boundary between "definitely on disk" and "might need redo." Understanding what this means requires understanding how the database maintains durability.
The Durability Problem:
Modern databases use a no-force policy: when a transaction commits, we don't force its modified pages to disk immediately. Instead:
This creates a window where committed data exists only in the log, not in the data files. If the system crashes during this window, redo must reconstruct these changes from log records.
Formal Definition:
RedoLSN = The smallest LSN such that all database pages are guaranteed to reflect all log records with LSN < RedoLSN.
Equivalently:
Starting redo from RedoLSN guarantees that we don't miss any updates that might not be on disk.
The key insight is that we need to consider two sources of information:
The RedoLSN might be earlier than strictly necessary (some records processed during redo may already be on disk). This is acceptable—redo checks each page's LSN before applying operations. The critical property is completeness: we must never skip a record that needs redo.
After the Analysis Phase completes its log scan, calculating the RedoLSN is straightforward. It requires finding the minimum across two sets of LSNs.
Algorithm:
Implementation:
12345678910111213141516171819202122232425262728293031323334353637383940414243
LSN AnalysisPhase::calculateRedoLSN() { // Component 1: Minimum recLSN from Dirty Page Table // This is the LSN of the earliest operation that might not be on disk LSN min_rec_lsn = MAX_LSN; for (const auto& [page_id, entry] : dirty_page_table) { if (entry.recovery_lsn < min_rec_lsn) { min_rec_lsn = entry.recovery_lsn; } } // If DPT is empty, no dirty pages means no redo needed (from DPT perspective) if (dirty_page_table.empty()) { min_rec_lsn = MAX_LSN; // Will be replaced if txn table has entries } // Component 2: Minimum firstLSN from Transaction Table (optional but safer) // In standard ARIES, DPT contains all relevant pages, but some implementations // use this as an additional safety check LSN min_txn_lsn = MAX_LSN; for (const auto& [txn_id, entry] : transaction_table) { if (entry.first_lsn < min_txn_lsn) { min_txn_lsn = entry.first_lsn; } } // RedoLSN is the minimum of both LSN redo_lsn; if (min_rec_lsn == MAX_LSN && min_txn_lsn == MAX_LSN) { // No dirty pages and no active transactions // This means nothing needs redo - start from end of log redo_lsn = log_manager.getLastLSN(); log_info("RedoLSN: No redo needed (no dirty pages or active txns)"); } else { redo_lsn = min(min_rec_lsn, min_txn_lsn); log_info("RedoLSN: %d (min_rec: %d, min_txn: %d)", redo_lsn, min_rec_lsn, min_txn_lsn); } return redo_lsn;}| Scenario | Min DPT recLSN | Min Txn firstLSN | RedoLSN |
|---|---|---|---|
| Normal operation | 5000 | 5500 | 5000 |
| Long-running transaction | 6000 | 4000 | 4000 |
| No dirty pages | ∞ (empty) | 3000 | 3000 |
| No active transactions | 2500 | ∞ (empty) | 2500 |
| Fresh checkpoint, nothing dirty | ∞ | ∞ | End of log (no redo) |
Understanding why the minimum of recLSNs is the correct starting point requires tracing through what the recLSN represents and how pages reach disk.
Theorem: All log records with LSN < RedoLSN have their effects reflected on disk.
Proof:
Consider any log record R with LSN(R) < RedoLSN. We need to show that R's effects are on the database pages on disk.
Case 1: The page modified by R is not in the DPT
If a page is not in the DPT, one of two things is true:
In either case, the page on disk is current. Since the page isn't dirty, record R's effects (if R modified this page) must already be on disk.
Case 2: The page modified by R is in the DPT with recLSN > LSN(R)
This would mean R modified the page before the current recLSN. But recLSN is the LSN of the first modification since the page was last flushed. If R modified the page and has LSN < recLSN, then the page must have been flushed after R's modification but before the current recLSN's modification. Therefore, R's effects are on disk.
Case 3: The page modified by R is in the DPT with recLSN = LSN(R)
This means R was the first modification since the last flush. But this contradicts LSN(R) < RedoLSN since RedoLSN ≤ recLSN for all entries in DPT.
Case 4: The page modified by R is in the DPT with recLSN < LSN(R)
This would mean a different record modified the page before R. But since recLSN ≤ RedoLSN (by definition of minimum), and LSN(R) < RedoLSN, we have recLSN ≤ RedoLSN > LSN(R) ≥ recLSN, which is impossible. ∎
Think of recLSN as a 'high water mark' for each page—everything below it has been flushed. The minimum recLSN across all pages is the global high water mark. Below this level, all pages are on disk. Above it, some pages might be dirty.
12345678910111213141516171819202122
Log Timeline (LSN increasing rightward):═══════════════════════════════════════════════════════════════ 0 1000 2000 3000 4000 5000 6000 7000 | | | | | | | | Page 100: [recLSN=2000]───────────────────────────────────── ↑ First dirty since last flush Page 200: [recLSN=3500]───────────────────── ↑ First dirty since last flush Page 300: [recLSN=4500]──────── ↑ First dirty RedoLSN = min(2000, 3500, 4500) = 2000 Everything at LSN < 2000 is GUARANTEED on disk:- Page 100's recLSN is 2000, so it was flushed including everything up to some point < 2000- Page 200's recLSN is 3500 > 2000, so it was flushed even more recently - Page 300's recLSN is 4500 > 2000, same logic Therefore, Redo safely starts at LSN 2000.Checkpoints indirectly affect the RedoLSN by influencing the Dirty Page Table state. Understanding this relationship reveals why checkpoint frequency directly impacts recovery time.
How Checkpoints Affect RedoLSN:
Checkpoint Types and Their Impact:
| Checkpoint Type | Dirty Page Flushing | RedoLSN Impact | Recovery Time |
|---|---|---|---|
| Quiescent | All pages flushed | RedoLSN = checkpoint LSN | Minimal redo |
| Transaction-Consistent | All pages flushed | RedoLSN = checkpoint LSN | Minimal redo |
| Fuzzy (standard ARIES) | No flushing required | RedoLSN may be before checkpoint | More redo possible |
| Aggressive Fuzzy | Background flushing encouraged | RedoLSN closer to checkpoint | Moderate redo |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
/* * Scenario: Fuzzy Checkpoint Without Aggressive Flushing * * Time 0: Page 100 dirtied at LSN 1000 * Time 1: Page 200 dirtied at LSN 1500 * Time 2: Checkpoint at LSN 2000 * DPT snapshot: {P100: recLSN=1000, P200: recLSN=1500} * Time 3: Page 300 dirtied at LSN 2500 * Time 4: CRASH at LSN 3000 * * Analysis Result: * DPT: {P100: recLSN=1000, P200: recLSN=1500, P300: recLSN=2500} * RedoLSN = 1000 (older than checkpoint!) * * Redo must process 2000 log records (1000 to 3000) */ /* * Scenario: Checkpoint With Aggressive Flushing * * Time 0: Page 100 dirtied at LSN 1000 * Time 1: Page 200 dirtied at LSN 1500 * Time 2: Checkpoint at LSN 2000 * - Checkpoint also flushes all dirty pages! * - DPT snapshot after flush: {} (empty) * Time 3: Page 300 dirtied at LSN 2500 * Time 4: CRASH at LSN 3000 * * Analysis Result: * DPT: {P300: recLSN=2500} * RedoLSN = 2500 (after checkpoint!) * * Redo must process only 500 log records (2500 to 3000) */ void RecoveryManager::performAggressiveCheckpoint() { // Standard checkpoint steps plus page flushing LSN begin_lsn = log_manager.appendLog(BEGIN_CHECKPOINT); // Snapshot tables first (for consistency) auto dpt_snapshot = dirty_page_table.snapshot(); auto txn_snapshot = transaction_table.snapshot(); // Flush ALL dirty pages to disk buffer_manager.flushAllDirtyPages(); // Now DPT can be cleared (pages are clean) // But we already took snapshot, so checkpoint contains pre-flush state log_manager.appendLog(END_CHECKPOINT, dpt_snapshot, txn_snapshot); log_manager.flush(); // Update master record updateMasterRecord(begin_lsn);}Aggressive flushing during checkpoints reduces RedoLSN (faster recovery) but increases checkpoint I/O cost and may impact online transaction performance. Systems must balance recovery time requirements against operational impact. Many production databases offer tunable checkpoint aggressiveness.
In standard ARIES, the Dirty Page Table is the primary source for RedoLSN calculation. However, considering the Transaction Table provides an additional safety margin and can be necessary in certain edge cases.
When Transaction Table Matters:
Incomplete DPT Information: If a log record modified a page but the page was evicted from the buffer pool before the DPT was fully updated (race condition in some implementations)
Distributed Transactions: In 2PC scenarios, a transaction may have prepared but not committed, affecting pages on remote nodes that aren't in the local DPT
Logical Logging: If the system uses logical logging (operations described abstractly rather than by exact page changes), the DPT may not capture all affected pages
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
/* * Edge Case: Transaction Table Required * * Consider this scenario with physiological logging: * * LSN 1000: T1 begins * LSN 1100: T1 INSERT into table (affects B-tree pages 10, 11, 12) * - Log record: INSERT key=42 into table * - This is a logical operation, not physical page updates * - DPT might only track the leaf page, not internal pages * LSN 1200: Buffer manager evicts page 11 (writes to disk) * - Page 11 removed from DPT * LSN 1300: CRASH * * DPT at crash might only have: * Page 10: recLSN = 1100 * Page 12: recLSN = 1100 * (Page 11 was evicted clean) * * Transaction Table: * T1: firstLSN = 1000, lastLSN = 1100, state = ACTIVE * * If we use only DPT: RedoLSN = 1100 * If we consider Txn Table: RedoLSN = min(1100, 1000) = 1000 * * The safer choice is 1000, ensuring we don't miss any operations * that might have affected pages we don't know about. */ LSN AnalysisPhase::calculateRedoLSNSafe() { LSN from_dpt = getMinRecLSNFromDPT(); LSN from_txn = getMinFirstLSNFromTransactionTable(); // Take the minimum for safety LSN redo_lsn = min(from_dpt, from_txn); // Log if there's a significant difference (might indicate a problem) if (from_dpt != MAX_LSN && from_txn != MAX_LSN) { LSN diff = abs((long)from_dpt - (long)from_txn); if (diff > 10000) { // Threshold for concern log_warn("RedoLSN sources differ significantly: DPT=%d, TxnTable=%d", from_dpt, from_txn); } } return redo_lsn;}Using both DPT and Transaction Table for RedoLSN provides defense in depth. The additional scan overhead (processing a few extra log records) is negligible compared to the cost of missing a record that needs redo. Production systems often include such safety margins.
Several special cases affect the RedoLSN calculation. Understanding these helps debug recovery issues and optimize checkpoint strategies.
Case 1: Empty DPT and No Active Transactions
If both tables are empty after analysis, no redo is needed. This happens if:
1234567891011121314151617
LSN AnalysisPhase::calculateRedoLSN() { if (dirty_page_table.empty() && transaction_table.empty()) { // No redo needed - database is consistent log_info("Database clean: no redo required"); return END_OF_LOG; // Special marker meaning "skip redo" } // ... normal calculation} bool RedoPhase::shouldRun(LSN redo_lsn) { if (redo_lsn == END_OF_LOG) { log_info("Redo phase skipped: nothing to redo"); return false; } return true;}Case 2: Very Old Dirty Pages
If a page has been dirty for a long time (perhaps due to a stuck transaction or locking issue), RedoLSN might be very old, causing excessive redo work.
123456789101112131415161718192021222324252627282930313233343536373839404142
void AnalysisPhase::detectProblematicDirtyPages() { LSN current_lsn = log_manager.getLastLSN(); LSN checkpoint_lsn = getLastCheckpointLSN(); for (const auto& [page_id, entry] : dirty_page_table) { LSN age = current_lsn - entry.recovery_lsn; // Flag pages dirty since before the last checkpoint if (entry.recovery_lsn < checkpoint_lsn) { log_warn("Page %d dirty since before checkpoint! recLSN=%d, ckpt=%d", page_id, entry.recovery_lsn, checkpoint_lsn); // This might indicate: // 1. Long-running transaction holding the page // 2. Buffer manager thrashing (pages not being flushed) // 3. Checkpoint not flushing aggressively enough } // Flag excessively old dirty pages if (age > CONFIG_MAX_DIRTY_PAGE_AGE) { log_error("Page %d has been dirty for %d LSN units - investigate!", page_id, age); } }} /* * Optimization: Proactive Flushing * * Some systems periodically flush pages whose recLSN is getting too old, * preventing RedoLSN from drifting too far into the past. */void BufferManager::proactiveFlush() { LSN threshold = log_manager.getLastLSN() - CONFIG_FLUSH_AGE_THRESHOLD; for (Page& page : buffer_pool) { if (page.isDirty() && page.getRecLSN() < threshold) { flushPage(page); log_debug("Proactively flushed page %d (old recLSN)", page.id); } }}Case 3: Distributed Databases
In distributed systems, the RedoLSN concept extends across nodes. Each node has its own DPT and log, but global consistency requires coordinating recovery.
In distributed databases, each node determines its local RedoLSN independently. Global recovery may require waiting for all nodes to complete redo and undo before distributed transactions can be resolved. The 2PC protocol's PREPARE records help coordinate this process.
Let's work through several realistic examples to solidify understanding of RedoLSN calculation.
Scenario: An e-commerce database crashes during a busy shopping period.
State at Crash:
| Page | Type | recLSN |
|---|---|---|
| Customers_42 | Data page | 8,450,000 |
| Orders_idx_7 | Index page | 8,490,000 |
| Inventory_15 | Data page | 8,460,000 |
| Cart_temp_99 | Temp page | 8,500,000 |
| Products_2 | Data page | 8,400,000 |
Transaction Table:
Calculation:
Redo will start from LSN 8,400,000 and process approximately 100,000 log records.
The RedoLSN is the crucial boundary that separates "definitely on disk" from "might need redo." Let's consolidate the essential concepts:
What's Next:
With the RedoLSN determined, we've completed the Analysis Phase's primary outputs: the reconstructed DPT, the Transaction Table with loser transactions identified, and the starting point for redo. The final page examines how all this information comes together to reconstruct the crash-time database state, preparing for the Redo and Undo phases.
You now understand how the Analysis Phase determines the RedoLSN—the algorithm, its correctness, the role of checkpoints, and practical implications. This knowledge is essential for understanding recovery performance and tuning checkpoint strategies.