Loading learning content...
When a database system crashes, the traditional instinct might be to find the most recent consistent state and move forward from there. However, the designers of ARIES—C. Mohan and colleagues at IBM Research—made a counterintuitive but profound choice: repeat history exactly as it happened before the crash.
This principle, known as Repeating History, is the philosophical foundation of the ARIES redo phase. Rather than trying to shortcut to a consistent state, ARIES painstakingly reconstructs the exact database state that existed at the moment of failure. Only after this reconstruction does it undo incomplete transactions.
This approach might seem wasteful at first glance—why redo work that will later be undone? But as we'll discover, this seemingly roundabout path is actually the most elegant and correct solution to an incredibly complex problem.
By the end of this page, you will understand: (1) Why repeating history is essential for correct recovery, (2) How this approach handles complex scenarios like nested transactions and partial writes, (3) The mathematical invariants that repeating history maintains, and (4) Why shortcuts lead to incorrect recovery in edge cases.
Repeating history is the ARIES principle that during recovery, the database should be restored to the exact state it was in at the moment of crash—including the effects of uncommitted transactions.
This might seem strange. After all, uncommitted transactions are, by definition, incomplete. Why would we want to restore their effects?
The answer lies in separation of concerns:
By separating these concerns, ARIES achieves several critical properties that simpler algorithms cannot.
| Aspect | Repeating History (ARIES) | Selective Redo |
|---|---|---|
| Redo scope | All logged operations after oldest dirty page LSN | Only committed transaction operations |
| State after redo | Exact pre-crash state (committed + uncommitted) | Only committed modifications |
| Undo requirement | Required for uncommitted transactions | Not needed (uncommitted already excluded) |
| Complexity | Simpler redo logic, clean separation | Complex filtering during redo |
| Correctness | Mathematically proven correct | Edge cases in complex scenarios |
| CLR handling | Natural—CLRs are part of history | Requires special handling |
The key insight: By not trying to be clever during redo—by simply replaying history as it occurred—ARIES avoids an entire class of bugs and edge cases that plague selective recovery approaches.
Consider a scenario where Transaction T1 modifies page P, then Transaction T2 reads that modified value and makes its own changes to page Q based on T1's write. If we selectively skip T1's changes during redo (because T1 didn't commit), T2's changes now make no sense—they were computed based on a value we're pretending doesn't exist.
Repeating history sidesteps this entirely. We replay T1's changes, then T2's changes. Later, in the undo phase, we roll back T1 (and potentially T2 if it also didn't commit, using proper cascading logic).
To understand why repeating history works, we need to formalize what recovery must achieve. Let's define the key concepts:
Definition 1: Database State
Let S(t) represent the database state at time t. S(t) is a mapping from page identifiers to page contents, modified by the sequence of operations applied up to time t.
Definition 2: Operation History
Let H = {o₁, o₂, ..., oₙ} be the sequence of operations (in Log Sequence Number order) applied to the database. Each operation oᵢ transforms state S(tᵢ₋₁) into S(tᵢ).
Definition 3: Crash Point
Let T_crash be the moment of system failure. S(T_crash) represents the database state at crash time. Due to buffer pool management, this state may not be fully persisted to disk.
Definition 4: Recovery Goal
After recovery, the database should contain exactly the effects of all committed transactions and none of the effects of uncommitted transactions.
The challenge is that we don't know exactly which pages were flushed before the crash. Some modifications might be on disk, others might not. The repeating history approach, combined with LSN comparison, handles all possible combinations of what-was-flushed-and-what-wasn't correctly.
The Repeating History Invariant:
After the redo phase completes, for every page P in the database:
S_recovered(P) = S(T_crash)(P)
In other words, the redo phase reconstructs the exact pre-crash state for every page.
Why This Invariant Matters:
Undo Correctness: The undo operations were logged with the assumption that the database was in state S(T_crash). If we reconstruct that exact state, the undo records will correctly reverse any changes.
CLR Handling: Compensation Log Records (CLRs) written during live rollbacks are just part of history. Repeating history naturally incorporates them.
Partial Page Writes: Even if a page was partially written before crash (due to power failure mid-write), LSN comparison ensures we reach the correct state.
Nested Transactions: Complex transaction hierarchies with savepoints and partial rollbacks all get correctly restored—they're just history.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
# Formal verification of the repeating history principle# This demonstrates why the approach is mathematically correct class DatabaseState: """Represents database state as page -> (value, LSN) mapping""" def __init__(self): self.pages = {} # page_id -> (value, page_lsn) def apply_operation(self, op): """Apply an operation, updating page value and LSN""" if op.type == 'UPDATE': self.pages[op.page_id] = (op.after_value, op.lsn) return self class Operation: def __init__(self, lsn, page_id, op_type, before_value, after_value): self.lsn = lsn self.page_id = page_id self.type = op_type self.before_value = before_value self.after_value = after_value def verify_repeating_history(): """ Prove: Repeating history produces S(T_crash) regardless of which subset of pages were actually flushed before crash """ # Create history of operations history = [ Operation(lsn=100, page_id='P1', op_type='UPDATE', before_value='A', after_value='B'), Operation(lsn=101, page_id='P2', op_type='UPDATE', before_value='X', after_value='Y'), Operation(lsn=102, page_id='P1', op_type='UPDATE', before_value='B', after_value='C'), Operation(lsn=103, page_id='P3', op_type='UPDATE', before_value='M', after_value='N'), ] # Scenario: Pages P1 and P3 flushed before crash, P2 was not # Disk state at crash: # - P1: value='C', LSN=102 (flushed after both updates) # - P2: value='X', LSN=0 (never flushed, still has old value) # - P3: value='N', LSN=103 (flushed) disk_state = DatabaseState() disk_state.pages = { 'P1': ('C', 102), # Flushed - has latest 'P2': ('X', 0), # Not flushed - has stale value 'P3': ('N', 103) # Flushed - has latest } # Apply repeating history from oldest unflushed LSN # In redo, we check: if page_lsn < operation_lsn, apply operation def redo_with_lsn_check(state, history, start_lsn=0): """Redo operations using LSN comparison""" for op in history: if op.lsn < start_lsn: continue # Skip operations before redo start point current_page_lsn = state.pages.get(op.page_id, ('', -1))[1] if current_page_lsn < op.lsn: # Page is behind log - apply the operation state.apply_operation(op) print(f"Applied LSN {op.lsn} to {op.page_id}") else: # Page already has this or later - skip print(f"Skipped LSN {op.lsn} for {op.page_id} (page LSN={current_page_lsn})") return state # Run redo phase recovered_state = redo_with_lsn_check(disk_state, history) # Verify: All pages should now have crash-time values expected = { 'P1': ('C', 102), # Was correct, not re-applied 'P2': ('Y', 101), # Was stale, now corrected 'P3': ('N', 103) # Was correct, not re-applied } for page_id, (expected_val, expected_lsn) in expected.items(): actual = recovered_state.pages[page_id] assert actual == (expected_val, expected_lsn), \ f"Page {page_id}: expected {expected}, got {actual}" print("\n✓ Repeating history with LSN comparison correctly") print(" reconstructs crash-time state regardless of flush pattern!") verify_repeating_history()A natural question arises: if uncommitted transactions will be undone anyway, why redo them in the first place? Why not simply skip them during redo?
The answer involves several subtle but critical issues:
A Concrete Example: The B-Tree Split Scenario
Consider this timeline:
If we "skip" T1's work during redo:
By repeating history, both operations succeed. Then, during undo:
Repeating history keeps the redo phase simple: apply everything. All the complexity of determining what should actually remain gets deferred to the undo phase, which has access to the fully reconstructed state. This separation of concerns is a hallmark of good system design.
Before ARIES, database systems used simpler recovery algorithms that tried to avoid repeating history. Understanding why they struggled illuminates the elegance of the ARIES approach.
Pre-ARIES Recovery Approaches:
The STEAL/NO-FORCE Revolution
ARIES introduced the STEAL/NO-FORCE buffer management policy:
This policy maximizes flexibility and performance but requires sophisticated recovery. Repeating history is the key technique that makes STEAL/NO-FORCE work correctly.
| Policy | STEAL? | FORCE? | Redo Needed? | Undo Needed? | Performance |
|---|---|---|---|---|---|
| NO-STEAL/FORCE | No | Yes | No | No | Poor—disk writes block commits, limited memory |
| NO-STEAL/NO-FORCE | No | No | Yes | No | Better—but memory limited |
| STEAL/FORCE | Yes | Yes | No | Yes | Better—but commits still slow |
| STEAL/NO-FORCE (ARIES) | Yes | No | Yes | Yes | Best—full flexibility, fast commits |
The ARIES approach with STEAL/NO-FORCE is now standard in virtually all major RDBMS implementations: PostgreSQL, MySQL/InnoDB, Oracle, SQL Server, and DB2 all use variants of this approach. The repeating history principle has been proven correct in millions of real-world recovery scenarios.
A critical question for repeating history is: where do we start?
We could theoretically start from the beginning of the log—but that would mean redoing potentially years of operations. Instead, ARIES uses checkpoints to bound the redo work.
The RedoLSN Calculation:
During the analysis phase, ARIES computes the RedoLSN—the log sequence number from which redo should begin. This is calculated as:
RedoLSN = MIN(recLSN of all pages in DirtyPageTable)
Where:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
# Calculating the RedoLSN from the Dirty Page Table class DirtyPageTableEntry: """Entry in the Dirty Page Table""" def __init__(self, page_id, rec_lsn): self.page_id = page_id self.rec_lsn = rec_lsn # First LSN that dirtied this page class AnalysisPhaseResult: """Output from the analysis phase""" def __init__(self): self.dirty_page_table = {} # page_id -> DirtyPageTableEntry self.transaction_table = {} # txn_id -> TransactionTableEntry def calculate_redo_lsn(analysis_result): """ Calculate the LSN from which redo should begin. This is the minimum recLSN across all dirty pages. Intuition: We must start redoing from the point where the oldest potentially unflushed modification occurred. """ if not analysis_result.dirty_page_table: # No dirty pages means nothing to redo return float('inf') # No redo needed redo_lsn = min( entry.rec_lsn for entry in analysis_result.dirty_page_table.values() ) return redo_lsn # Example: Typical recovery scenariodef demonstrate_redo_lsn_calculation(): """ Scenario: Three pages were dirty at crash time - Page 10: first dirtied at LSN 500, modified again at LSN 700 - Page 20: first dirtied at LSN 600 - Page 30: first dirtied at LSN 400, already flushed (not in DPT) """ result = AnalysisPhaseResult() # Only pages that were dirty at crash are in DPT result.dirty_page_table = { 10: DirtyPageTableEntry(page_id=10, rec_lsn=500), 20: DirtyPageTableEntry(page_id=20, rec_lsn=600), # Page 30 was flushed, so it's not in DPT } redo_lsn = calculate_redo_lsn(result) print(f"Dirty Page Table:") for page_id, entry in result.dirty_page_table.items(): print(f" Page {page_id}: recLSN = {entry.rec_lsn}") print(f"\nCalculated RedoLSN: {redo_lsn}") print(f"Redo will start from LSN {redo_lsn}") print(f"\nThis ensures we don't miss any operations that might") print(f"not have been flushed to disk before the crash.") demonstrate_redo_lsn_calculation()Why This Starting Point is Sufficient:
Pages flushed before recLSN don't need redo: If a page's most recent flush happened at LSN X, and all modifications since X are captured starting from recLSN, we're covered.
Checkpoint bounding: The analysis phase starts from the most recent checkpoint. The DPT at checkpoint time is our starting point, updated by scanning the log forward.
Conservative correctness: Even if we redo an operation that was already flushed to disk, the LSN comparison (which we'll cover in detail later) ensures we don't corrupt data.
The formula is elegant in its simplicity: find the oldest operation that might not be on disk, and start there. Everything before that point is guaranteed to be stable.
Let's trace through a complete example to see repeating history in action. We'll simulate a crash scenario and walk through exactly how the redo phase reconstructs the pre-crash state.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178
# Complete example of repeating history during recovery class LogRecord: """Represents a log record in the WAL""" def __init__(self, lsn, txn_id, record_type, page_id=None, before_value=None, after_value=None, prev_lsn=None): self.lsn = lsn self.txn_id = txn_id self.record_type = record_type # UPDATE, COMMIT, ABORT, CLR, CHECKPOINT self.page_id = page_id self.before_value = before_value self.after_value = after_value self.prev_lsn = prev_lsn # Previous LSN for this transaction class SimulatedCrash: """Simulates a database system with crash and recovery""" def __init__(self): self.log = [] self.disk_pages = {} # What's actually on disk self.buffer_pool = {} # In-memory pages (lost on crash) self.next_lsn = 100 self.checkpoint_lsn = None def write_log(self, **kwargs): """Write a log record and return its LSN""" record = LogRecord(lsn=self.next_lsn, **kwargs) self.log.append(record) self.next_lsn += 1 return record.lsn def update_page(self, txn_id, page_id, before_val, after_val, prev_lsn=None): """Perform an update operation""" # Write log record FIRST (WAL protocol) lsn = self.write_log( txn_id=txn_id, record_type='UPDATE', page_id=page_id, before_value=before_val, after_value=after_val, prev_lsn=prev_lsn ) # Update page in buffer pool self.buffer_pool[page_id] = { 'value': after_val, 'page_lsn': lsn } return lsn def flush_page(self, page_id): """Simulate flushing a page to disk""" if page_id in self.buffer_pool: self.disk_pages[page_id] = self.buffer_pool[page_id].copy() print(f" Flushed page {page_id} to disk (LSN: {self.disk_pages[page_id]['page_lsn']})") def commit_txn(self, txn_id, prev_lsn): """Commit a transaction""" return self.write_log(txn_id=txn_id, record_type='COMMIT', prev_lsn=prev_lsn) def crash(self): """Simulate a system crash - buffer pool is lost!""" print("\n💥 SYSTEM CRASH! Buffer pool contents lost.") self.buffer_pool = {} # All in-memory state is gone def recover_redo_phase(self, redo_lsn): """ Redo phase: Repeat history from redo_lsn forward This demonstrates the core principle! """ print(f"\n🔄 REDO PHASE: Starting from LSN {redo_lsn}") print("=" * 50) for record in self.log: if record.lsn < redo_lsn: continue # Skip log records before redo point if record.record_type not in ['UPDATE', 'CLR']: continue # Only redo data modifications page_id = record.page_id # Get current page LSN from disk if page_id in self.disk_pages: current_page_lsn = self.disk_pages[page_id]['page_lsn'] else: current_page_lsn = -1 # Page never written to disk # The key decision: does this operation need to be redone? if current_page_lsn < record.lsn: # Page is behind - operation wasn't flushed, must redo self.disk_pages[page_id] = { 'value': record.after_value, 'page_lsn': record.lsn } print(f" REDO LSN {record.lsn}: Page {page_id} " f"'{record.before_value}' → '{record.after_value}' " f"(page was at LSN {current_page_lsn})") else: # Page already has this update - skip print(f" SKIP LSN {record.lsn}: Page {page_id} " f"already at LSN {current_page_lsn}") # Run the demonstrationdef demonstrate_repeating_history(): """ Scenario: - T1: Updates page 1 and page 2, does NOT commit - T2: Updates page 1 and page 3, DOES commit - Only page 1 is flushed before crash - Recovery must restore the exact pre-crash state """ print("=" * 60) print("DEMONSTRATING REPEATING HISTORY PRINCIPLE") print("=" * 60) db = SimulatedCrash() # Initial state: pages start with values A, X, M print("\n📝 Operations before crash:") # T1: Uncommitted transaction t1_lsn1 = db.update_page('T1', 'page1', 'A', 'B') print(f" T1 updates page1: A → B (LSN {t1_lsn1})") t1_lsn2 = db.update_page('T1', 'page2', 'X', 'Y', t1_lsn1) print(f" T1 updates page2: X → Y (LSN {t1_lsn2})") # T2: Committed transaction t2_lsn1 = db.update_page('T2', 'page1', 'B', 'C') # Note: reads T1's uncommitted value! print(f" T2 updates page1: B → C (LSN {t2_lsn1})") t2_lsn2 = db.update_page('T2', 'page3', 'M', 'N', t2_lsn1) print(f" T2 updates page3: M → N (LSN {t2_lsn2})") db.commit_txn('T2', t2_lsn2) print(f" T2 commits") # Only page1 gets flushed before crash print("\n📀 Flushing pages to disk:") db.flush_page('page1') # page1 is now on disk with value 'C' # page2 and page3 are still only in buffer pool! print("\n📊 State before crash:") print(f" Disk: page1 = {db.disk_pages.get('page1', 'not on disk')}") print(f" Disk: page2 = {db.disk_pages.get('page2', 'not on disk')}") print(f" Disk: page3 = {db.disk_pages.get('page3', 'not on disk')}") print(f" Buffer: page1 = {db.buffer_pool.get('page1')}") print(f" Buffer: page2 = {db.buffer_pool.get('page2')}") print(f" Buffer: page3 = {db.buffer_pool.get('page3')}") # CRASH! db.crash() print("\n📊 State after crash (disk only):") print(f" Disk: page1 = {db.disk_pages.get('page1', 'NOT PRESENT')}") print(f" Disk: page2 = {db.disk_pages.get('page2', 'NOT PRESENT')}") print(f" Disk: page3 = {db.disk_pages.get('page3', 'NOT PRESENT')}") # Recovery: Redo phase # In reality, analysis phase would determine redo_lsn # Here, page2 and page3 were never flushed, so we start from earliest unflushed redo_lsn = t1_lsn1 # In practice, computed from DPT db.recover_redo_phase(redo_lsn) print("\n📊 State after REDO phase:") print(f" page1 = {db.disk_pages.get('page1')}") print(f" page2 = {db.disk_pages.get('page2')}") print(f" page3 = {db.disk_pages.get('page3')}") print("\n✅ RESULT: Database restored to exact pre-crash state!") print(" - page1: 'C' (last update by T2)") print(" - page2: 'Y' (update by T1, even though uncommitted)") print(" - page3: 'N' (update by T2)") print("\n The UNDO phase would then roll back T1's changes.") print(" But that's a separate step - redo just repeats history!") demonstrate_repeating_history()Notice how the redo phase doesn't care about commit status. It redoes T1's changes even though T1 never committed. The undo phase (not shown) would later roll back T1's effects. This clean separation—redo everything first, then undo uncommitted—is the essence of repeating history.
We've explored the foundational principle that makes ARIES recovery correct and elegant. Let's consolidate the key insights:
What's Next:
Now that we understand why we repeat history, the next page explores how we do it efficiently. We'll examine the process of redoing all operations, the mechanics of applying log records to pages, and how ARIES handles the various types of log records during redo.
You now understand the philosophical foundation of ARIES redo: repeating history exactly as it occurred, regardless of transaction commit status. This counterintuitive approach, combined with clean separation from the undo phase, enables correct recovery in all scenarios while supporting high-performance buffer management policies.