Loading learning content...
When a committed transaction's changes haven't reached disk before a crash, those changes are lost from the data files. But they're not lost from the database—because they were logged first. The system can redo those changes, replaying them from the log to restore the committed state.
Redo information is the complement to undo. Where undo provides atomicity (abort leaves no trace), redo provides durability (commit survives any failure). Together, they form the complete recovery capability of WAL.
This page explores redo in depth: what it contains, how it's structured, and how the redo phase of recovery reconstructs committed transactions that the crash interrupted.
You will understand redo information (after-images), the mechanics of the redo phase during recovery, idempotence requirements for redo operations, and how databases achieve durability through log-based replay. This completes your understanding of WAL's recovery guarantees.
Redo information, also called after-images or new values, represents the values that a modification introduces. It provides the ability to recreate any committed change.
Formal Definition:
For any modification M that changes data item X from value V₁ to value V₂:
Contrast with Undo:
| Aspect | Undo Information | Redo Information |
|---|---|---|
| Contains | Before-image (old value) | After-image (new value) |
| Purpose | Reverse uncommitted changes | Replay committed changes |
| Guarantees | Atomicity | Durability |
| When Applied | Transaction abort / Crash recovery undo phase | Crash recovery redo phase |
| Direction | Moves backward in time | Moves forward in time |
What Redo Information Contains:
| Component | Description | Example |
|---|---|---|
| LSN | Log Sequence Number for ordering | 12847 |
| Transaction ID | Which transaction made this change | T1024 |
| Page ID | Which page was modified | Page#4521 |
| Offset | Position within the page | Byte 312 |
| Length | Size of the modified region | 8 bytes |
| After-Image | The new value to apply | 500 |
Notice that redo doesn't strictly need the before-image—only the new value matters for replay. However, most systems store both in a single log record, serving dual purposes.
Even if the data page was completely reinitialized (lost entirely), redo can reconstruct it by replaying all modifications from the log. The log is the source of truth—data pages are merely a cache that can be rebuilt.
The redo component of WAL has a specific timing constraint:
The Redo Rule: Before a transaction commits (returns success to the client), all log records for that transaction—including the commit record—must be on stable storage.
This rule enables the NO-FORCE policy:
NO-FORCE Explanation:
When a transaction commits, we don't force its dirty pages to disk. We only force its log records. This is dramatically faster:
A transaction with 100 modified pages might require 100 random I/Os to force data. But its log records are sequential and might fit in a single I/O.
12345678910111213141516171819202122232425262728293031
function commitTransaction(transaction: Transaction): // STEP 1: Create commit record commitRecord = new CommitLogRecord( LSN: log.nextLSN(), TransactionID: transaction.id, PrevLSN: transaction.lastLSN, CommitTime: now() ) // STEP 2: Append commit record to log buffer log.append(commitRecord) // STEP 3: Force log to stable storage // This is the ONLY I/O barrier at commit time log.flushUpTo(commitRecord.LSN) // STEP 4: Transaction is now durable! // Even if crash happens right now, recovery can redo all changes // STEP 5: Release locks (strict 2PL releases at commit) lockManager.releaseAll(transaction) // STEP 6: Return success to client // Data pages may still be dirty in buffer - that's fine return COMMIT_SUCCESS // Note what we did NOT do:// - We did NOT force dirty pages to disk// - We did NOT wait for data page writes// - We ONLY waited for log flush// This is NO-FORCE: data pages written lazily, log written eagerlyWhy is NO-FORCE Safe?
Consider what happens if crash occurs immediately after commit returns:
The key insight: The log IS the transaction. Data pages are just a performance optimization—a cached materialization of what the log already contains.
Flushing the log for every commit would be expensive. Systems use 'group commit': batch multiple transactions' commit records and flush once. Each transaction waits until the batch flush completes. This dramatically reduces I/O while maintaining full durability.
After a crash, the redo phase replays logged operations to ensure all committed changes are present in the data pages. This is the second phase of ARIES-style recovery (after Analysis):
Redo Phase Goals:
Wait—why redo uncommitted transactions? Because we need to restore the exact crash-time state before undo can work correctly. We 'repeat history' then undo the losers.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
function recoveryRedoPhase(redoStartLSN: LSN, dirtyPageTable: Map<PageID, RecoveryLSN>): print("Starting REDO phase from LSN: " + redoStartLSN) // Scan log forward from the redo start point currentLSN = redoStartLSN while currentLSN < log.endOfLog(): logRecord = log.read(currentLSN) // Only consider records that affect data pages if logRecord.type in [UPDATE, CLR, FULL_PAGE_IMAGE]: pageID = logRecord.pageID // Check if redo is needed using dirty page table if pageID in dirtyPageTable: recoveryLSN = dirtyPageTable[pageID] // Redo only if log LSN >= recovery LSN // (modification might not have reached disk) if currentLSN >= recoveryLSN: // Fetch the page page = bufferPool.getPage(pageID) // Compare page LSN with log record LSN if page.pageLSN < currentLSN: // THIS is the key redo check // Page LSN < Log LSN means this update was NOT applied // Must apply the redo applyRedo(page, logRecord) page.pageLSN = currentLSN page.markDirty() print(" REDO: Applied LSN " + currentLSN + " to Page " + pageID) else: // Page already contains this or later update // No redo needed print(" SKIP: Page " + pageID + " already has LSN " + page.pageLSN) currentLSN = log.nextRecord(currentLSN) print("REDO phase complete") function applyRedo(page: Page, logRecord: LogRecord): // Apply the after-image to the page page.setData( logRecord.offset, logRecord.afterImage )The Page LSN Check:
The critical optimization in redo is comparing page.pageLSN with log.LSN:
page.pageLSN >= log.LSN: This update already reached disk → Skippage.pageLSN < log.LSN: Update was lost → Apply redoThis avoids re-applying modifications that survived the crash, making redo efficient even when most data was already persistent.
Why Redo Everything (Including Uncommitted)?
The redo phase doesn't distinguish committed from uncommitted transactions. It replays ALL logged modifications. This 'repeating history' approach has a crucial benefit:
The order is significant: redo first, then undo. Redo reconstructs the exact crash state. Undo then removes effects of losers. If we undid before redoing, we might undo things that were never applied, or miss undoing things that were partially applied.
For recovery to be reliable, redo operations must be idempotent—applying the same redo multiple times produces the same result as applying it once.
Why Idempotence Matters:
Consider this scenario:
Without idempotence, we might apply the same operation twice:
balance = balance + 100 → Applied twice = wrong result!With idempotence:
balance = 600 → Applied twice = same correct result123456789101112131415161718192021222324252627282930313233
// NON-IDEMPOTENT: Dangerous for recovery!// Applying twice gives wrong result operation_increment: balance = balance + 100 // First apply: balance goes from 500 to 600 // Second apply: balance goes from 600 to 700 (WRONG!) // IDEMPOTENT: Safe for recovery// Applying once or many times gives same result operation_set: balance = 600 // First apply: balance becomes 600 // Second apply: balance is still 600 (CORRECT!) // WAL logging approach ensuring idempotence: // Instead of logging: "increment balance by 100"// Log: "set balance to 600" (the final value) LogRecord { LSN: 12345, PageID: accounts_page_42, Offset: byte_312, BeforeImage: 500, // What it was AfterImage: 600 // What it should be (not "add 100")} // Physical redo is naturally idempotent:// "Write bytes 0x00000258 at offset 312"// Doing this once or multiple times = same final stateAchieving Idempotence:
1. Physical Logging (Byte-Level)
Store the actual bytes to write, not the operation:
{offset: 312, bytes: 0x00000258} (the value 600)2. Physiological Logging
Physical to the page, logical within the page:
{pageID: 42, operation: 'insert row with key 100'}3. LSN-Based Skip
Use page LSN to detect already-applied operations:
page.pageLSN >= log.LSN, skip the redoAny operation that depends on current state ('add 5', 'append to list', 'increment counter') is non-idempotent and dangerous for recovery. Always log the RESULT, not the DELTA. The log record should contain what the value IS, not what was done to it.
Without checkpoints, redo would need to start from the beginning of the log—potentially processing days or weeks of operations. Checkpoints limit the redo scope:
How Checkpoints Limit Redo:
The Redo Start Point (ARIES RedoLSN):
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// During checkpoint, we record the dirty page table statefunction performCheckpoint(): // Find all dirty pages in buffer pool dirtyPages = bufferPool.getDirtyPages() // For each dirty page, record its RecoveryLSN // (the first LSN that dirtied this page since last flush) dirtyPageTable = {} for page in dirtyPages: dirtyPageTable[page.id] = page.recoveryLSN // Also record active transactions activeTransactions = transactionManager.getActive() // Write checkpoint record checkpointRecord = new CheckpointRecord( LSN: log.nextLSN(), DirtyPageTable: dirtyPageTable, ActiveTransactions: activeTransactions ) log.append(checkpointRecord) log.flush() // Update master record to point to this checkpoint writeMasterRecord(checkpointRecord.LSN) // During recovery, determine where to start redofunction determineRedoStartLSN(checkpoint: CheckpointRecord) -> LSN: // The redo start point is the MINIMUM of: // 1. All RecoveryLSNs in the dirty page table // This is the oldest modification that might not be on disk if checkpoint.dirtyPageTable.isEmpty(): // No dirty pages at checkpoint time // Start redo from the checkpoint return checkpoint.LSN minRecoveryLSN = MIN(checkpoint.dirtyPageTable.values()) // Could also need to consider active transaction's first LSN // In case transaction started before any current dirty page for txn in checkpoint.activeTransactions: if txn.firstLSN < minRecoveryLSN: minRecoveryLSN = txn.firstLSN return minRecoveryLSNCheckpoint Trade-offs:
| Checkpoint Frequency | Checkpoint Cost | Recovery Time | Log Size |
|---|---|---|---|
| Very frequent | High (flush overhead) | Very short | Small |
| Moderate | Moderate | Moderate | Moderate |
| Infrequent | Low | Long redo phase | Large |
Fuzzy Checkpoints:
Modern systems use fuzzy checkpoints that don't require stopping all activity:
This reduces checkpoint impact on normal operations.
More checkpoints = faster recovery but higher runtime cost. Less checkpoints = slower recovery but better throughput. Tune based on your recovery time objective (RTO). If you need recovery under 1 minute, checkpoint frequently. If 10 minutes is acceptable, checkpoint less often.
Database systems choose between different levels of abstraction for redo logging:
Physical Redo (Byte-Level)
Log the exact bytes modified at their exact location:
{page: 42, offset: 312, oldBytes: 0x000001F4, newBytes: 0x00000258}Pros:
Cons:
| Type | Log Content | Redo Complexity | Space Efficiency |
|---|---|---|---|
| Physical | Byte changes at byte offsets | Trivial (copy bytes) | Poor (full data duplication) |
| Logical | High-level SQL operation | Complex (re-execute query) | Excellent (compact) |
| Physiological | Physical to page, logical within page | Moderate (interpret operation) | Good (compact but localized) |
Logical Redo (Operation-Level)
Log the SQL operation that was executed:
{INSERT INTO accounts VALUES (42, 'Alice', 1000)}Pros:
Cons:
Physiological Redo (ARIES approach)
Physical to the page level, logical within the page:
{page: 42, operation: INSERT_ROW, rowData: ...}This is the sweet spot:
12345678910111213141516171819202122232425262728293031323334353637383940
// Physical logging: log exact bytesphysicalLog = { LSN: 12345, PageID: 42, Offset: 512, Length: 64, OldImage: [0x00, 0x01, 0x02, ...], // 64 bytes NewImage: [0x10, 0x11, 0x12, ...] // 64 bytes} // Logical logging: log the SQLlogicalLog = { LSN: 12345, Query: "INSERT INTO users (id, name, balance) VALUES (100, 'Alice', 500)", TransactionID: T1} // Physiological logging: physical page, logical operationphysiologicalLog = { LSN: 12345, PageID: 42, // Physical: which page Operation: "INSERT_ROW", // Logical: what operation RowData: { // Logical: operation data id: 100, name: "Alice", balance: 500 }, SlotHint: 7 // Optimization: where to insert} // Redo for physiological:function redoPhysiological(page: Page, log: PhysiologicalLog): if log.operation == "INSERT_ROW": if not page.hasRowWithKey(log.rowData.id): page.insertRow(log.rowData, slotHint: log.slotHint) else if log.operation == "UPDATE_ROW": page.updateRow(log.rowData.id, log.rowData.changes) else if log.operation == "DELETE_ROW": if page.hasRowWithKey(log.rowData.id): page.deleteRow(log.rowData.id)Long redo phases can extend database downtime during recovery. Several optimizations reduce redo time:
1. Parallel Redo
Redo operations on different pages are independent—they can execute in parallel:
123456789101112131415161718192021222324252627282930313233
function parallelRedoPhase(logRecords: List<LogRecord>, numWorkers: int): // Group log records by page recordsByPage = groupBy(logRecords, r -> r.pageID) // Create per-page redo chains // Each chain must execute in LSN order, but chains are independent redoChains = [] for pageID, records in recordsByPage: chain = new RedoChain(pageID, sortByLSN(records)) redoChains.append(chain) // Distribute chains to workers // Worker i handles chains where hash(pageID) % numWorkers == i workers = [] for i in range(numWorkers): workerChains = redoChains.filter(c -> hash(c.pageID) % numWorkers == i) workers.append(startWorker(workerChains)) // Wait for all workers to complete for worker in workers: worker.join() class RedoWorker: function run(chains: List<RedoChain>): for chain in chains: page = bufferPool.getPage(chain.pageID) for record in chain.records: if page.pageLSN < record.LSN: applyRedo(page, record) page.pageLSN = record.LSN page.markDirty()2. Prefetching
During redo, we know which pages we'll need (from the log). Prefetch them to eliminate disk wait:
3. Batched Redo
Instead of applying one log record at a time:
4. Recovery Checkpoints
During very long redo:
Production systems define a Recovery Time Objective (RTO)—maximum acceptable recovery time. Design checkpoint frequency, log size limits, and redo parallelism to meet your RTO. A 5-minute RTO requires different tuning than a 60-minute RTO.
We've explored redo information—the after-images that guarantee durability. Let's consolidate the key concepts:
What's Next:
Now that we understand both undo and redo information, the final page explores why WAL matters in practice—examining the critical importance of this protocol for database reliability, performance, and replication in production systems.
You now understand redo information—the after-images that guarantee durability. With undo and redo together, WAL provides complete recovery capability: atomicity (undo) and durability (redo). Next, we'll explore why WAL is critical for real-world database systems.