Database Management SystemsRedo Phase

ARIES Redo Phase: Reconstructing Consistency

LevelAdvanced

Duration75 mins

TopicRedo Phase

1 / 5

Repeating History: The Core Philosophy of ARIES Redo

The Philosophy That Changed Database Recovery

When a database system crashes, the traditional instinct might be to find the most recent consistent state and move forward from there. However, the designers of ARIES—C. Mohan and colleagues at IBM Research—made a counterintuitive but profound choice: repeat history exactly as it happened before the crash.

This principle, known as Repeating History, is the philosophical foundation of the ARIES redo phase. Rather than trying to shortcut to a consistent state, ARIES painstakingly reconstructs the exact database state that existed at the moment of failure. Only after this reconstruction does it undo incomplete transactions.

This approach might seem wasteful at first glance—why redo work that will later be undone? But as we'll discover, this seemingly roundabout path is actually the most elegant and correct solution to an incredibly complex problem.

What You Will Learn

By the end of this page, you will understand: (1) Why repeating history is essential for correct recovery, (2) How this approach handles complex scenarios like nested transactions and partial writes, (3) The mathematical invariants that repeating history maintains, and (4) Why shortcuts lead to incorrect recovery in edge cases.

What is Repeating History?

Repeating history is the ARIES principle that during recovery, the database should be restored to the exact state it was in at the moment of crash—including the effects of uncommitted transactions.

This might seem strange. After all, uncommitted transactions are, by definition, incomplete. Why would we want to restore their effects?

The answer lies in separation of concerns:

Redo phase: Reconstruct the exact pre-crash state, including all modifications (committed and uncommitted)
Undo phase: Roll back only the uncommitted transactions

By separating these concerns, ARIES achieves several critical properties that simpler algorithms cannot.

Repeating History vs. Selective Redo
Aspect	Repeating History (ARIES)	Selective Redo
Redo scope	All logged operations after oldest dirty page LSN	Only committed transaction operations
State after redo	Exact pre-crash state (committed + uncommitted)	Only committed modifications
Undo requirement	Required for uncommitted transactions	Not needed (uncommitted already excluded)
Complexity	Simpler redo logic, clean separation	Complex filtering during redo
Correctness	Mathematically proven correct	Edge cases in complex scenarios
CLR handling	Natural—CLRs are part of history	Requires special handling

The key insight: By not trying to be clever during redo—by simply replaying history as it occurred—ARIES avoids an entire class of bugs and edge cases that plague selective recovery approaches.

Consider a scenario where Transaction T1 modifies page P, then Transaction T2 reads that modified value and makes its own changes to page Q based on T1's write. If we selectively skip T1's changes during redo (because T1 didn't commit), T2's changes now make no sense—they were computed based on a value we're pretending doesn't exist.

Repeating history sidesteps this entirely. We replay T1's changes, then T2's changes. Later, in the undo phase, we roll back T1 (and potentially T2 if it also didn't commit, using proper cascading logic).

The Mathematical Foundation

To understand why repeating history works, we need to formalize what recovery must achieve. Let's define the key concepts:

Definition 1: Database State

Let S(t) represent the database state at time t. S(t) is a mapping from page identifiers to page contents, modified by the sequence of operations applied up to time t.

Definition 2: Operation History

Let H = {o₁, o₂, ..., oₙ} be the sequence of operations (in Log Sequence Number order) applied to the database. Each operation oᵢ transforms state S(tᵢ₋₁) into S(tᵢ).

Definition 3: Crash Point

Let T_crash be the moment of system failure. S(T_crash) represents the database state at crash time. Due to buffer pool management, this state may not be fully persisted to disk.

Definition 4: Recovery Goal

After recovery, the database should contain exactly the effects of all committed transactions and none of the effects of uncommitted transactions.

The Critical Insight

The challenge is that we don't know exactly which pages were flushed before the crash. Some modifications might be on disk, others might not. The repeating history approach, combined with LSN comparison, handles all possible combinations of what-was-flushed-and-what-wasn't correctly.

The Repeating History Invariant:

After the redo phase completes, for every page P in the database:

S_recovered(P) = S(T_crash)(P)

In other words, the redo phase reconstructs the exact pre-crash state for every page.

Why This Invariant Matters:

Undo Correctness: The undo operations were logged with the assumption that the database was in state S(T_crash). If we reconstruct that exact state, the undo records will correctly reverse any changes.
CLR Handling: Compensation Log Records (CLRs) written during live rollbacks are just part of history. Repeating history naturally incorporates them.
Partial Page Writes: Even if a page was partially written before crash (due to power failure mid-write), LSN comparison ensures we reach the correct state.
Nested Transactions: Complex transaction hierarchies with savepoints and partial rollbacks all get correctly restored—they're just history.

repeating_history_proof.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# Formal verification of the repeating history principle
# This demonstrates why the approach is mathematically correct
 
class DatabaseState:
    """Represents database state as page -> (value, LSN) mapping"""
    def __init__(self):
        self.pages = {}  # page_id -> (value, page_lsn)
    
    def apply_operation(self, op):
        """Apply an operation, updating page value and LSN"""
        if op.type == 'UPDATE':
            self.pages[op.page_id] = (op.after_value, op.lsn)
        return self
 
class Operation:
    def __init__(self, lsn, page_id, op_type, before_value, after_value):
        self.lsn = lsn
        self.page_id = page_id
        self.type = op_type
        self.before_value = before_value
        self.after_value = after_value
 
def verify_repeating_history():
    """
    Prove: Repeating history produces S(T_crash) regardless of
    which subset of pages were actually flushed before crash
    """
    # Create history of operations
    history = [
        Operation(lsn=100, page_id='P1', op_type='UPDATE', 
                  before_value='A', after_value='B'),
        Operation(lsn=101, page_id='P2', op_type='UPDATE', 
                  before_value='X', after_value='Y'),
        Operation(lsn=102, page_id='P1', op_type='UPDATE', 
                  before_value='B', after_value='C'),
        Operation(lsn=103, page_id='P3', op_type='UPDATE', 
                  before_value='M', after_value='N'),
    ]
    
    # Scenario: Pages P1 and P3 flushed before crash, P2 was not
    # Disk state at crash:
    # - P1: value='C', LSN=102 (flushed after both updates)
    # - P2: value='X', LSN=0 (never flushed, still has old value)
    # - P3: value='N', LSN=103 (flushed)
    
    disk_state = DatabaseState()
    disk_state.pages = {
        'P1': ('C', 102),  # Flushed - has latest
        'P2': ('X', 0),    # Not flushed - has stale value
        'P3': ('N', 103)   # Flushed - has latest
    }
    
    # Apply repeating history from oldest unflushed LSN
    # In redo, we check: if page_lsn < operation_lsn, apply operation
    
    def redo_with_lsn_check(state, history, start_lsn=0):
        """Redo operations using LSN comparison"""
        for op in history:
            if op.lsn < start_lsn:
                continue  # Skip operations before redo start point
            
            current_page_lsn = state.pages.get(op.page_id, ('', -1))[1]
            
            if current_page_lsn < op.lsn:
                # Page is behind log - apply the operation
                state.apply_operation(op)
                print(f"Applied LSN {op.lsn} to {op.page_id}")
            else:
                # Page already has this or later - skip
                print(f"Skipped LSN {op.lsn} for {op.page_id} (page LSN={current_page_lsn})")
        
        return state
    
    # Run redo phase
    recovered_state = redo_with_lsn_check(disk_state, history)
    
    # Verify: All pages should now have crash-time values
    expected = {
        'P1': ('C', 102),   # Was correct, not re-applied
        'P2': ('Y', 101),   # Was stale, now corrected
        'P3': ('N', 103)    # Was correct, not re-applied
    }
    
    for page_id, (expected_val, expected_lsn) in expected.items():
        actual = recovered_state.pages[page_id]
        assert actual == (expected_val, expected_lsn), \
            f"Page {page_id}: expected {expected}, got {actual}"
    
    print("\n✓ Repeating history with LSN comparison correctly")
    print("  reconstructs crash-time state regardless of flush pattern!")
 
verify_repeating_history()

Why Not Skip Uncommitted Transactions?

A natural question arises: if uncommitted transactions will be undone anyway, why redo them in the first place? Why not simply skip them during redo?

The answer involves several subtle but critical issues:

Problems with Skipping Uncommitted During Redo

•Inter-Transaction Dependencies: Committed transaction T2 may have read uncommitted values from T1. If we skip T1's changes, T2's modifications become inconsistent. We can't selectively undo this dependency during redo.
•Page-Level Physical Logging: ARIES uses physiological logging—logical within a page, physical for page selection. Skipping an update changes the physical state of the page, potentially invalidating subsequent updates' before-images.
•Space Management: Allocation and deallocation of pages (space management) by uncommitted transactions affects where committed transactions could place their data. Skipping changes physical layout consistency.
•B-Tree Structure Modifications: Index structure modifications (splits, merges) by uncommitted transactions affect the paths traversed by committed transactions. The physical index structure must be consistent.
•Compensation Log Records: If the system was rolling back T1 when it crashed, CLRs for T1 are in the log. These are 'redo-only' records that must be replayed. They refer to specific physical states that only exist if T1's forward work is also redone.

A Concrete Example: The B-Tree Split Scenario

Consider this timeline:

LSN 100: T1 (uncommitted) inserts key 50, causing B-tree leaf node split
LSN 101: T2 (committed) inserts key 51 into the newly created split page
Crash occurs

If we "skip" T1's work during redo:

The leaf split never happens
T2's insert at LSN 101 targets a page that doesn't exist
Recovery fails or produces corruption

By repeating history, both operations succeed. Then, during undo:

We undo T1's insert (removing key 50)
But we don't un-split the node (structure modifications are typically handled differently)
T2's insert of key 51 remains correctly placed

The Elegance of Separation

Repeating history keeps the redo phase simple: apply everything. All the complexity of determining what should actually remain gets deferred to the undo phase, which has access to the fully reconstructed state. This separation of concerns is a hallmark of good system design.

Historical Context and Alternatives

Before ARIES, database systems used simpler recovery algorithms that tried to avoid repeating history. Understanding why they struggled illuminates the elegance of the ARIES approach.

Pre-ARIES Recovery Approaches:

Shadow Paging

•Maintains two copies of modified pages
•Commit: atomic switch of page table pointer
•Recovery: simply revert to shadow copy
•Problems: Poor locality, high storage overhead, complex garbage collection, doesn't work well with large databases

Deferred Update (NO-STEAL)

•Never write uncommitted changes to disk
•No undo ever needed (uncommitted isn't on disk)
•"Pre-crash state is always consistent"
•Problems: Huge memory requirements, can't handle long transactions, severely limits concurrency

The STEAL/NO-FORCE Revolution

ARIES introduced the STEAL/NO-FORCE buffer management policy:

STEAL: Dirty pages from uncommitted transactions can be written to disk (necessary for large transactions that exceed memory)
NO-FORCE: Committed transaction pages don't have to be forced to disk at commit (commit can be fast—just force the log)

This policy maximizes flexibility and performance but requires sophisticated recovery. Repeating history is the key technique that makes STEAL/NO-FORCE work correctly.

Buffer Management Policies and Recovery Requirements
Policy	STEAL?	FORCE?	Redo Needed?	Undo Needed?	Performance
NO-STEAL/FORCE	No	Yes	No	No	Poor—disk writes block commits, limited memory
NO-STEAL/NO-FORCE	No	No	Yes	No	Better—but memory limited
STEAL/FORCE	Yes	Yes	No	Yes	Better—but commits still slow
STEAL/NO-FORCE (ARIES)	Yes	No	Yes	Yes	Best—full flexibility, fast commits

Industry Adoption

The ARIES approach with STEAL/NO-FORCE is now standard in virtually all major RDBMS implementations: PostgreSQL, MySQL/InnoDB, Oracle, SQL Server, and DB2 all use variants of this approach. The repeating history principle has been proven correct in millions of real-world recovery scenarios.

The Redo Starting Point

A critical question for repeating history is: where do we start?

We could theoretically start from the beginning of the log—but that would mean redoing potentially years of operations. Instead, ARIES uses checkpoints to bound the redo work.

The RedoLSN Calculation:

During the analysis phase, ARIES computes the RedoLSN—the log sequence number from which redo should begin. This is calculated as:

RedoLSN = MIN(recLSN of all pages in DirtyPageTable)

Where:

DirtyPageTable (DPT): Tracked during analysis, contains all pages that were dirty (modified but not yet flushed) at crash time
recLSN: The "recovery LSN" for each dirty page—the first LSN that dirtied this page since it was last flushed

redo_starting_point.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Calculating the RedoLSN from the Dirty Page Table
 
class DirtyPageTableEntry:
    """Entry in the Dirty Page Table"""
    def __init__(self, page_id, rec_lsn):
        self.page_id = page_id
        self.rec_lsn = rec_lsn  # First LSN that dirtied this page
 
class AnalysisPhaseResult:
    """Output from the analysis phase"""
    def __init__(self):
        self.dirty_page_table = {}  # page_id -> DirtyPageTableEntry
        self.transaction_table = {} # txn_id -> TransactionTableEntry
 
def calculate_redo_lsn(analysis_result):
    """
    Calculate the LSN from which redo should begin.
    This is the minimum recLSN across all dirty pages.
    
    Intuition: We must start redoing from the point where the
    oldest potentially unflushed modification occurred.
    """
    if not analysis_result.dirty_page_table:
        # No dirty pages means nothing to redo
        return float('inf')  # No redo needed
    
    redo_lsn = min(
        entry.rec_lsn 
        for entry in analysis_result.dirty_page_table.values()
    )
    
    return redo_lsn
 
# Example: Typical recovery scenario
def demonstrate_redo_lsn_calculation():
    """
    Scenario: Three pages were dirty at crash time
    - Page 10: first dirtied at LSN 500, modified again at LSN 700
    - Page 20: first dirtied at LSN 600
    - Page 30: first dirtied at LSN 400, already flushed (not in DPT)
    """
    result = AnalysisPhaseResult()
    
    # Only pages that were dirty at crash are in DPT
    result.dirty_page_table = {
        10: DirtyPageTableEntry(page_id=10, rec_lsn=500),
        20: DirtyPageTableEntry(page_id=20, rec_lsn=600),
        # Page 30 was flushed, so it's not in DPT
    }
    
    redo_lsn = calculate_redo_lsn(result)
    
    print(f"Dirty Page Table:")
    for page_id, entry in result.dirty_page_table.items():
        print(f"  Page {page_id}: recLSN = {entry.rec_lsn}")
    
    print(f"\nCalculated RedoLSN: {redo_lsn}")
    print(f"Redo will start from LSN {redo_lsn}")
    print(f"\nThis ensures we don't miss any operations that might")
    print(f"not have been flushed to disk before the crash.")
 
demonstrate_redo_lsn_calculation()

Why This Starting Point is Sufficient:

Pages flushed before recLSN don't need redo: If a page's most recent flush happened at LSN X, and all modifications since X are captured starting from recLSN, we're covered.
Checkpoint bounding: The analysis phase starts from the most recent checkpoint. The DPT at checkpoint time is our starting point, updated by scanning the log forward.
Conservative correctness: Even if we redo an operation that was already flushed to disk, the LSN comparison (which we'll cover in detail later) ensures we don't corrupt data.

The formula is elegant in its simplicity: find the oldest operation that might not be on disk, and start there. Everything before that point is guaranteed to be stable.

Repeating History in Practice

Let's trace through a complete example to see repeating history in action. We'll simulate a crash scenario and walk through exactly how the redo phase reconstructs the pre-crash state.

repeating_history_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
# Complete example of repeating history during recovery
 
class LogRecord:
    """Represents a log record in the WAL"""
    def __init__(self, lsn, txn_id, record_type, page_id=None, 
                 before_value=None, after_value=None, prev_lsn=None):
        self.lsn = lsn
        self.txn_id = txn_id
        self.record_type = record_type  # UPDATE, COMMIT, ABORT, CLR, CHECKPOINT
        self.page_id = page_id
        self.before_value = before_value
        self.after_value = after_value
        self.prev_lsn = prev_lsn  # Previous LSN for this transaction
 
class SimulatedCrash:
    """Simulates a database system with crash and recovery"""
    
    def __init__(self):
        self.log = []
        self.disk_pages = {}      # What's actually on disk
        self.buffer_pool = {}     # In-memory pages (lost on crash)
        self.next_lsn = 100
        self.checkpoint_lsn = None
    
    def write_log(self, **kwargs):
        """Write a log record and return its LSN"""
        record = LogRecord(lsn=self.next_lsn, **kwargs)
        self.log.append(record)
        self.next_lsn += 1
        return record.lsn
    
    def update_page(self, txn_id, page_id, before_val, after_val, prev_lsn=None):
        """Perform an update operation"""
        # Write log record FIRST (WAL protocol)
        lsn = self.write_log(
            txn_id=txn_id, 
            record_type='UPDATE',
            page_id=page_id,
            before_value=before_val,
            after_value=after_val,
            prev_lsn=prev_lsn
        )
        # Update page in buffer pool
        self.buffer_pool[page_id] = {
            'value': after_val,
            'page_lsn': lsn
        }
        return lsn
    
    def flush_page(self, page_id):
        """Simulate flushing a page to disk"""
        if page_id in self.buffer_pool:
            self.disk_pages[page_id] = self.buffer_pool[page_id].copy()
            print(f"  Flushed page {page_id} to disk (LSN: {self.disk_pages[page_id]['page_lsn']})")
    
    def commit_txn(self, txn_id, prev_lsn):
        """Commit a transaction"""
        return self.write_log(txn_id=txn_id, record_type='COMMIT', prev_lsn=prev_lsn)
    
    def crash(self):
        """Simulate a system crash - buffer pool is lost!"""
        print("\n💥 SYSTEM CRASH! Buffer pool contents lost.")
        self.buffer_pool = {}  # All in-memory state is gone
    
    def recover_redo_phase(self, redo_lsn):
        """
        Redo phase: Repeat history from redo_lsn forward
        This demonstrates the core principle!
        """
        print(f"\n🔄 REDO PHASE: Starting from LSN {redo_lsn}")
        print("=" * 50)
        
        for record in self.log:
            if record.lsn < redo_lsn:
                continue  # Skip log records before redo point
            
            if record.record_type not in ['UPDATE', 'CLR']:
                continue  # Only redo data modifications
            
            page_id = record.page_id
            
            # Get current page LSN from disk
            if page_id in self.disk_pages:
                current_page_lsn = self.disk_pages[page_id]['page_lsn']
            else:
                current_page_lsn = -1  # Page never written to disk
            
            # The key decision: does this operation need to be redone?
            if current_page_lsn < record.lsn:
                # Page is behind - operation wasn't flushed, must redo
                self.disk_pages[page_id] = {
                    'value': record.after_value,
                    'page_lsn': record.lsn
                }
                print(f"  REDO LSN {record.lsn}: Page {page_id} "
                      f"'{record.before_value}' → '{record.after_value}' "
                      f"(page was at LSN {current_page_lsn})")
            else:
                # Page already has this update - skip
                print(f"  SKIP LSN {record.lsn}: Page {page_id} "
                      f"already at LSN {current_page_lsn}")
 
# Run the demonstration
def demonstrate_repeating_history():
    """
    Scenario:
    - T1: Updates page 1 and page 2, does NOT commit
    - T2: Updates page 1 and page 3, DOES commit
    - Only page 1 is flushed before crash
    - Recovery must restore the exact pre-crash state
    """
    print("=" * 60)
    print("DEMONSTRATING REPEATING HISTORY PRINCIPLE")
    print("=" * 60)
    
    db = SimulatedCrash()
    
    # Initial state: pages start with values A, X, M
    print("\n📝 Operations before crash:")
    
    # T1: Uncommitted transaction
    t1_lsn1 = db.update_page('T1', 'page1', 'A', 'B')
    print(f"  T1 updates page1: A → B (LSN {t1_lsn1})")
    
    t1_lsn2 = db.update_page('T1', 'page2', 'X', 'Y', t1_lsn1)
    print(f"  T1 updates page2: X → Y (LSN {t1_lsn2})")
    
    # T2: Committed transaction
    t2_lsn1 = db.update_page('T2', 'page1', 'B', 'C')  # Note: reads T1's uncommitted value!
    print(f"  T2 updates page1: B → C (LSN {t2_lsn1})")
    
    t2_lsn2 = db.update_page('T2', 'page3', 'M', 'N', t2_lsn1)
    print(f"  T2 updates page3: M → N (LSN {t2_lsn2})")
    
    db.commit_txn('T2', t2_lsn2)
    print(f"  T2 commits")
    
    # Only page1 gets flushed before crash
    print("\n📀 Flushing pages to disk:")
    db.flush_page('page1')  # page1 is now on disk with value 'C'
    # page2 and page3 are still only in buffer pool!
    
    print("\n📊 State before crash:")
    print(f"  Disk: page1 = {db.disk_pages.get('page1', 'not on disk')}")
    print(f"  Disk: page2 = {db.disk_pages.get('page2', 'not on disk')}")
    print(f"  Disk: page3 = {db.disk_pages.get('page3', 'not on disk')}")
    print(f"  Buffer: page1 = {db.buffer_pool.get('page1')}")
    print(f"  Buffer: page2 = {db.buffer_pool.get('page2')}")
    print(f"  Buffer: page3 = {db.buffer_pool.get('page3')}")
    
    # CRASH!
    db.crash()
    
    print("\n📊 State after crash (disk only):")
    print(f"  Disk: page1 = {db.disk_pages.get('page1', 'NOT PRESENT')}")
    print(f"  Disk: page2 = {db.disk_pages.get('page2', 'NOT PRESENT')}")
    print(f"  Disk: page3 = {db.disk_pages.get('page3', 'NOT PRESENT')}")
    
    # Recovery: Redo phase
    # In reality, analysis phase would determine redo_lsn
    # Here, page2 and page3 were never flushed, so we start from earliest unflushed
    redo_lsn = t1_lsn1  # In practice, computed from DPT
    
    db.recover_redo_phase(redo_lsn)
    
    print("\n📊 State after REDO phase:")
    print(f"  page1 = {db.disk_pages.get('page1')}")
    print(f"  page2 = {db.disk_pages.get('page2')}")
    print(f"  page3 = {db.disk_pages.get('page3')}")
    
    print("\n✅ RESULT: Database restored to exact pre-crash state!")
    print("   - page1: 'C' (last update by T2)")
    print("   - page2: 'Y' (update by T1, even though uncommitted)")
    print("   - page3: 'N' (update by T2)")
    print("\n   The UNDO phase would then roll back T1's changes.")
    print("   But that's a separate step - redo just repeats history!")
 
demonstrate_repeating_history()

Key Observation

Notice how the redo phase doesn't care about commit status. It redoes T1's changes even though T1 never committed. The undo phase (not shown) would later roll back T1's effects. This clean separation—redo everything first, then undo uncommitted—is the essence of repeating history.

Summary: The Repeating History Paradigm

We've explored the foundational principle that makes ARIES recovery correct and elegant. Let's consolidate the key insights:

Key Takeaways

•Repeating history reconstructs the exact crash-time state — Including uncommitted modifications. This seems counterintuitive but enables correct undo.
•Separation of concerns is key — Redo handles 'what was the state?', Undo handles 'what should remain?' Clean separation simplifies both phases.
•Skipping uncommitted operations is dangerous — Inter-transaction dependencies, physical page state, and B-tree structures can all become inconsistent.
•The approach enables STEAL/NO-FORCE — Maximum buffer pool flexibility and fast commits become possible with repeating history.
•RedoLSN bounds the work — We only redo from the oldest potentially unflushed modification, not from the beginning of time.
•LSN comparison prevents redundant work — Operations already reflected on disk are skipped. We'll explore this mechanism in detail next.

What's Next:

Now that we understand why we repeat history, the next page explores how we do it efficiently. We'll examine the process of redoing all operations, the mechanics of applying log records to pages, and how ARIES handles the various types of log records during redo.

Page Complete

You now understand the philosophical foundation of ARIES redo: repeating history exactly as it occurred, regardless of transaction commit status. This counterintuitive approach, combined with clean separation from the undo phase, enables correct recovery in all scenarios while supporting high-performance buffer management policies.

1 / 5

Loading learning content...

Database Management SystemsRedo Phase

ARIES Redo Phase: Reconstructing Consistency

LevelAdvanced

Duration75 mins

TopicRedo Phase

1 / 5

Repeating History: The Core Philosophy of ARIES Redo

The Philosophy That Changed Database Recovery

What You Will Learn

What is Repeating History?

This might seem strange. After all, uncommitted transactions are, by definition, incomplete. Why would we want to restore their effects?

The answer lies in separation of concerns:

Redo phase: Reconstruct the exact pre-crash state, including all modifications (committed and uncommitted)
Undo phase: Roll back only the uncommitted transactions

By separating these concerns, ARIES achieves several critical properties that simpler algorithms cannot.

Repeating History vs. Selective Redo
Aspect	Repeating History (ARIES)	Selective Redo
Redo scope	All logged operations after oldest dirty page LSN	Only committed transaction operations
State after redo	Exact pre-crash state (committed + uncommitted)	Only committed modifications
Undo requirement	Required for uncommitted transactions	Not needed (uncommitted already excluded)
Complexity	Simpler redo logic, clean separation	Complex filtering during redo
Correctness	Mathematically proven correct	Edge cases in complex scenarios
CLR handling	Natural—CLRs are part of history	Requires special handling

The Mathematical Foundation

To understand why repeating history works, we need to formalize what recovery must achieve. Let's define the key concepts:

Definition 1: Database State

Let S(t) represent the database state at time t. S(t) is a mapping from page identifiers to page contents, modified by the sequence of operations applied up to time t.

Definition 2: Operation History

Let H = {o₁, o₂, ..., oₙ} be the sequence of operations (in Log Sequence Number order) applied to the database. Each operation oᵢ transforms state S(tᵢ₋₁) into S(tᵢ).

Definition 3: Crash Point

Let T_crash be the moment of system failure. S(T_crash) represents the database state at crash time. Due to buffer pool management, this state may not be fully persisted to disk.

Definition 4: Recovery Goal

After recovery, the database should contain exactly the effects of all committed transactions and none of the effects of uncommitted transactions.

The Critical Insight

The Repeating History Invariant:

After the redo phase completes, for every page P in the database:

S_recovered(P) = S(T_crash)(P)

In other words, the redo phase reconstructs the exact pre-crash state for every page.

Why This Invariant Matters:

Undo Correctness: The undo operations were logged with the assumption that the database was in state S(T_crash). If we reconstruct that exact state, the undo records will correctly reverse any changes.
CLR Handling: Compensation Log Records (CLRs) written during live rollbacks are just part of history. Repeating history naturally incorporates them.
Partial Page Writes: Even if a page was partially written before crash (due to power failure mid-write), LSN comparison ensures we reach the correct state.
Nested Transactions: Complex transaction hierarchies with savepoints and partial rollbacks all get correctly restored—they're just history.

repeating_history_proof.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# Formal verification of the repeating history principle
# This demonstrates why the approach is mathematically correct
 
class DatabaseState:
    """Represents database state as page -> (value, LSN) mapping"""
    def __init__(self):
        self.pages = {}  # page_id -> (value, page_lsn)
    
    def apply_operation(self, op):
        """Apply an operation, updating page value and LSN"""
        if op.type == 'UPDATE':
            self.pages[op.page_id] = (op.after_value, op.lsn)
        return self
 
class Operation:
    def __init__(self, lsn, page_id, op_type, before_value, after_value):
        self.lsn = lsn
        self.page_id = page_id
        self.type = op_type
        self.before_value = before_value
        self.after_value = after_value
 
def verify_repeating_history():
    """
    Prove: Repeating history produces S(T_crash) regardless of
    which subset of pages were actually flushed before crash
    """
    # Create history of operations
    history = [
        Operation(lsn=100, page_id='P1', op_type='UPDATE', 
                  before_value='A', after_value='B'),
        Operation(lsn=101, page_id='P2', op_type='UPDATE', 
                  before_value='X', after_value='Y'),
        Operation(lsn=102, page_id='P1', op_type='UPDATE', 
                  before_value='B', after_value='C'),
        Operation(lsn=103, page_id='P3', op_type='UPDATE', 
                  before_value='M', after_value='N'),
    ]
    
    # Scenario: Pages P1 and P3 flushed before crash, P2 was not
    # Disk state at crash:
    # - P1: value='C', LSN=102 (flushed after both updates)
    # - P2: value='X', LSN=0 (never flushed, still has old value)
    # - P3: value='N', LSN=103 (flushed)
    
    disk_state = DatabaseState()
    disk_state.pages = {
        'P1': ('C', 102),  # Flushed - has latest
        'P2': ('X', 0),    # Not flushed - has stale value
        'P3': ('N', 103)   # Flushed - has latest
    }
    
    # Apply repeating history from oldest unflushed LSN
    # In redo, we check: if page_lsn < operation_lsn, apply operation
    
    def redo_with_lsn_check(state, history, start_lsn=0):
        """Redo operations using LSN comparison"""
        for op in history:
            if op.lsn < start_lsn:
                continue  # Skip operations before redo start point
            
            current_page_lsn = state.pages.get(op.page_id, ('', -1))[1]
            
            if current_page_lsn < op.lsn:
                # Page is behind log - apply the operation
                state.apply_operation(op)
                print(f"Applied LSN {op.lsn} to {op.page_id}")
            else:
                # Page already has this or later - skip
                print(f"Skipped LSN {op.lsn} for {op.page_id} (page LSN={current_page_lsn})")
        
        return state
    
    # Run redo phase
    recovered_state = redo_with_lsn_check(disk_state, history)
    
    # Verify: All pages should now have crash-time values
    expected = {
        'P1': ('C', 102),   # Was correct, not re-applied
        'P2': ('Y', 101),   # Was stale, now corrected
        'P3': ('N', 103)    # Was correct, not re-applied
    }
    
    for page_id, (expected_val, expected_lsn) in expected.items():
        actual = recovered_state.pages[page_id]
        assert actual == (expected_val, expected_lsn), \
            f"Page {page_id}: expected {expected}, got {actual}"
    
    print("\n✓ Repeating history with LSN comparison correctly")
    print("  reconstructs crash-time state regardless of flush pattern!")
 
verify_repeating_history()

Why Not Skip Uncommitted Transactions?

A natural question arises: if uncommitted transactions will be undone anyway, why redo them in the first place? Why not simply skip them during redo?

The answer involves several subtle but critical issues:

Problems with Skipping Uncommitted During Redo

•Inter-Transaction Dependencies: Committed transaction T2 may have read uncommitted values from T1. If we skip T1's changes, T2's modifications become inconsistent. We can't selectively undo this dependency during redo.
•Page-Level Physical Logging: ARIES uses physiological logging—logical within a page, physical for page selection. Skipping an update changes the physical state of the page, potentially invalidating subsequent updates' before-images.
•Space Management: Allocation and deallocation of pages (space management) by uncommitted transactions affects where committed transactions could place their data. Skipping changes physical layout consistency.
•B-Tree Structure Modifications: Index structure modifications (splits, merges) by uncommitted transactions affect the paths traversed by committed transactions. The physical index structure must be consistent.
•Compensation Log Records: If the system was rolling back T1 when it crashed, CLRs for T1 are in the log. These are 'redo-only' records that must be replayed. They refer to specific physical states that only exist if T1's forward work is also redone.

A Concrete Example: The B-Tree Split Scenario

Consider this timeline:

LSN 100: T1 (uncommitted) inserts key 50, causing B-tree leaf node split
LSN 101: T2 (committed) inserts key 51 into the newly created split page
Crash occurs

If we "skip" T1's work during redo:

The leaf split never happens
T2's insert at LSN 101 targets a page that doesn't exist
Recovery fails or produces corruption

By repeating history, both operations succeed. Then, during undo:

We undo T1's insert (removing key 50)
But we don't un-split the node (structure modifications are typically handled differently)
T2's insert of key 51 remains correctly placed

The Elegance of Separation

Historical Context and Alternatives

Before ARIES, database systems used simpler recovery algorithms that tried to avoid repeating history. Understanding why they struggled illuminates the elegance of the ARIES approach.

Pre-ARIES Recovery Approaches:

Shadow Paging

•Maintains two copies of modified pages
•Commit: atomic switch of page table pointer
•Recovery: simply revert to shadow copy
•Problems: Poor locality, high storage overhead, complex garbage collection, doesn't work well with large databases

Deferred Update (NO-STEAL)

•Never write uncommitted changes to disk
•No undo ever needed (uncommitted isn't on disk)
•"Pre-crash state is always consistent"
•Problems: Huge memory requirements, can't handle long transactions, severely limits concurrency

The STEAL/NO-FORCE Revolution

ARIES introduced the STEAL/NO-FORCE buffer management policy:

STEAL: Dirty pages from uncommitted transactions can be written to disk (necessary for large transactions that exceed memory)
NO-FORCE: Committed transaction pages don't have to be forced to disk at commit (commit can be fast—just force the log)

This policy maximizes flexibility and performance but requires sophisticated recovery. Repeating history is the key technique that makes STEAL/NO-FORCE work correctly.

Buffer Management Policies and Recovery Requirements
Policy	STEAL?	FORCE?	Redo Needed?	Undo Needed?	Performance
NO-STEAL/FORCE	No	Yes	No	No	Poor—disk writes block commits, limited memory
NO-STEAL/NO-FORCE	No	No	Yes	No	Better—but memory limited
STEAL/FORCE	Yes	Yes	No	Yes	Better—but commits still slow
STEAL/NO-FORCE (ARIES)	Yes	No	Yes	Yes	Best—full flexibility, fast commits

Industry Adoption

The Redo Starting Point

A critical question for repeating history is: where do we start?

We could theoretically start from the beginning of the log—but that would mean redoing potentially years of operations. Instead, ARIES uses checkpoints to bound the redo work.

The RedoLSN Calculation:

During the analysis phase, ARIES computes the RedoLSN—the log sequence number from which redo should begin. This is calculated as:

RedoLSN = MIN(recLSN of all pages in DirtyPageTable)

Where:

DirtyPageTable (DPT): Tracked during analysis, contains all pages that were dirty (modified but not yet flushed) at crash time
recLSN: The "recovery LSN" for each dirty page—the first LSN that dirtied this page since it was last flushed

redo_starting_point.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Calculating the RedoLSN from the Dirty Page Table
 
class DirtyPageTableEntry:
    """Entry in the Dirty Page Table"""
    def __init__(self, page_id, rec_lsn):
        self.page_id = page_id
        self.rec_lsn = rec_lsn  # First LSN that dirtied this page
 
class AnalysisPhaseResult:
    """Output from the analysis phase"""
    def __init__(self):
        self.dirty_page_table = {}  # page_id -> DirtyPageTableEntry
        self.transaction_table = {} # txn_id -> TransactionTableEntry
 
def calculate_redo_lsn(analysis_result):
    """
    Calculate the LSN from which redo should begin.
    This is the minimum recLSN across all dirty pages.
    
    Intuition: We must start redoing from the point where the
    oldest potentially unflushed modification occurred.
    """
    if not analysis_result.dirty_page_table:
        # No dirty pages means nothing to redo
        return float('inf')  # No redo needed
    
    redo_lsn = min(
        entry.rec_lsn 
        for entry in analysis_result.dirty_page_table.values()
    )
    
    return redo_lsn
 
# Example: Typical recovery scenario
def demonstrate_redo_lsn_calculation():
    """
    Scenario: Three pages were dirty at crash time
    - Page 10: first dirtied at LSN 500, modified again at LSN 700
    - Page 20: first dirtied at LSN 600
    - Page 30: first dirtied at LSN 400, already flushed (not in DPT)
    """
    result = AnalysisPhaseResult()
    
    # Only pages that were dirty at crash are in DPT
    result.dirty_page_table = {
        10: DirtyPageTableEntry(page_id=10, rec_lsn=500),
        20: DirtyPageTableEntry(page_id=20, rec_lsn=600),
        # Page 30 was flushed, so it's not in DPT
    }
    
    redo_lsn = calculate_redo_lsn(result)
    
    print(f"Dirty Page Table:")
    for page_id, entry in result.dirty_page_table.items():
        print(f"  Page {page_id}: recLSN = {entry.rec_lsn}")
    
    print(f"\nCalculated RedoLSN: {redo_lsn}")
    print(f"Redo will start from LSN {redo_lsn}")
    print(f"\nThis ensures we don't miss any operations that might")
    print(f"not have been flushed to disk before the crash.")
 
demonstrate_redo_lsn_calculation()

Why This Starting Point is Sufficient:

Pages flushed before recLSN don't need redo: If a page's most recent flush happened at LSN X, and all modifications since X are captured starting from recLSN, we're covered.
Checkpoint bounding: The analysis phase starts from the most recent checkpoint. The DPT at checkpoint time is our starting point, updated by scanning the log forward.
Conservative correctness: Even if we redo an operation that was already flushed to disk, the LSN comparison (which we'll cover in detail later) ensures we don't corrupt data.

The formula is elegant in its simplicity: find the oldest operation that might not be on disk, and start there. Everything before that point is guaranteed to be stable.

Repeating History in Practice

Let's trace through a complete example to see repeating history in action. We'll simulate a crash scenario and walk through exactly how the redo phase reconstructs the pre-crash state.

repeating_history_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
# Complete example of repeating history during recovery
 
class LogRecord:
    """Represents a log record in the WAL"""
    def __init__(self, lsn, txn_id, record_type, page_id=None, 
                 before_value=None, after_value=None, prev_lsn=None):
        self.lsn = lsn
        self.txn_id = txn_id
        self.record_type = record_type  # UPDATE, COMMIT, ABORT, CLR, CHECKPOINT
        self.page_id = page_id
        self.before_value = before_value
        self.after_value = after_value
        self.prev_lsn = prev_lsn  # Previous LSN for this transaction
 
class SimulatedCrash:
    """Simulates a database system with crash and recovery"""
    
    def __init__(self):
        self.log = []
        self.disk_pages = {}      # What's actually on disk
        self.buffer_pool = {}     # In-memory pages (lost on crash)
        self.next_lsn = 100
        self.checkpoint_lsn = None
    
    def write_log(self, **kwargs):
        """Write a log record and return its LSN"""
        record = LogRecord(lsn=self.next_lsn, **kwargs)
        self.log.append(record)
        self.next_lsn += 1
        return record.lsn
    
    def update_page(self, txn_id, page_id, before_val, after_val, prev_lsn=None):
        """Perform an update operation"""
        # Write log record FIRST (WAL protocol)
        lsn = self.write_log(
            txn_id=txn_id, 
            record_type='UPDATE',
            page_id=page_id,
            before_value=before_val,
            after_value=after_val,
            prev_lsn=prev_lsn
        )
        # Update page in buffer pool
        self.buffer_pool[page_id] = {
            'value': after_val,
            'page_lsn': lsn
        }
        return lsn
    
    def flush_page(self, page_id):
        """Simulate flushing a page to disk"""
        if page_id in self.buffer_pool:
            self.disk_pages[page_id] = self.buffer_pool[page_id].copy()
            print(f"  Flushed page {page_id} to disk (LSN: {self.disk_pages[page_id]['page_lsn']})")
    
    def commit_txn(self, txn_id, prev_lsn):
        """Commit a transaction"""
        return self.write_log(txn_id=txn_id, record_type='COMMIT', prev_lsn=prev_lsn)
    
    def crash(self):
        """Simulate a system crash - buffer pool is lost!"""
        print("\n💥 SYSTEM CRASH! Buffer pool contents lost.")
        self.buffer_pool = {}  # All in-memory state is gone
    
    def recover_redo_phase(self, redo_lsn):
        """
        Redo phase: Repeat history from redo_lsn forward
        This demonstrates the core principle!
        """
        print(f"\n🔄 REDO PHASE: Starting from LSN {redo_lsn}")
        print("=" * 50)
        
        for record in self.log:
            if record.lsn < redo_lsn:
                continue  # Skip log records before redo point
            
            if record.record_type not in ['UPDATE', 'CLR']:
                continue  # Only redo data modifications
            
            page_id = record.page_id
            
            # Get current page LSN from disk
            if page_id in self.disk_pages:
                current_page_lsn = self.disk_pages[page_id]['page_lsn']
            else:
                current_page_lsn = -1  # Page never written to disk
            
            # The key decision: does this operation need to be redone?
            if current_page_lsn < record.lsn:
                # Page is behind - operation wasn't flushed, must redo
                self.disk_pages[page_id] = {
                    'value': record.after_value,
                    'page_lsn': record.lsn
                }
                print(f"  REDO LSN {record.lsn}: Page {page_id} "
                      f"'{record.before_value}' → '{record.after_value}' "
                      f"(page was at LSN {current_page_lsn})")
            else:
                # Page already has this update - skip
                print(f"  SKIP LSN {record.lsn}: Page {page_id} "
                      f"already at LSN {current_page_lsn}")
 
# Run the demonstration
def demonstrate_repeating_history():
    """
    Scenario:
    - T1: Updates page 1 and page 2, does NOT commit
    - T2: Updates page 1 and page 3, DOES commit
    - Only page 1 is flushed before crash
    - Recovery must restore the exact pre-crash state
    """
    print("=" * 60)
    print("DEMONSTRATING REPEATING HISTORY PRINCIPLE")
    print("=" * 60)
    
    db = SimulatedCrash()
    
    # Initial state: pages start with values A, X, M
    print("\n📝 Operations before crash:")
    
    # T1: Uncommitted transaction
    t1_lsn1 = db.update_page('T1', 'page1', 'A', 'B')
    print(f"  T1 updates page1: A → B (LSN {t1_lsn1})")
    
    t1_lsn2 = db.update_page('T1', 'page2', 'X', 'Y', t1_lsn1)
    print(f"  T1 updates page2: X → Y (LSN {t1_lsn2})")
    
    # T2: Committed transaction
    t2_lsn1 = db.update_page('T2', 'page1', 'B', 'C')  # Note: reads T1's uncommitted value!
    print(f"  T2 updates page1: B → C (LSN {t2_lsn1})")
    
    t2_lsn2 = db.update_page('T2', 'page3', 'M', 'N', t2_lsn1)
    print(f"  T2 updates page3: M → N (LSN {t2_lsn2})")
    
    db.commit_txn('T2', t2_lsn2)
    print(f"  T2 commits")
    
    # Only page1 gets flushed before crash
    print("\n📀 Flushing pages to disk:")
    db.flush_page('page1')  # page1 is now on disk with value 'C'
    # page2 and page3 are still only in buffer pool!
    
    print("\n📊 State before crash:")
    print(f"  Disk: page1 = {db.disk_pages.get('page1', 'not on disk')}")
    print(f"  Disk: page2 = {db.disk_pages.get('page2', 'not on disk')}")
    print(f"  Disk: page3 = {db.disk_pages.get('page3', 'not on disk')}")
    print(f"  Buffer: page1 = {db.buffer_pool.get('page1')}")
    print(f"  Buffer: page2 = {db.buffer_pool.get('page2')}")
    print(f"  Buffer: page3 = {db.buffer_pool.get('page3')}")
    
    # CRASH!
    db.crash()
    
    print("\n📊 State after crash (disk only):")
    print(f"  Disk: page1 = {db.disk_pages.get('page1', 'NOT PRESENT')}")
    print(f"  Disk: page2 = {db.disk_pages.get('page2', 'NOT PRESENT')}")
    print(f"  Disk: page3 = {db.disk_pages.get('page3', 'NOT PRESENT')}")
    
    # Recovery: Redo phase
    # In reality, analysis phase would determine redo_lsn
    # Here, page2 and page3 were never flushed, so we start from earliest unflushed
    redo_lsn = t1_lsn1  # In practice, computed from DPT
    
    db.recover_redo_phase(redo_lsn)
    
    print("\n📊 State after REDO phase:")
    print(f"  page1 = {db.disk_pages.get('page1')}")
    print(f"  page2 = {db.disk_pages.get('page2')}")
    print(f"  page3 = {db.disk_pages.get('page3')}")
    
    print("\n✅ RESULT: Database restored to exact pre-crash state!")
    print("   - page1: 'C' (last update by T2)")
    print("   - page2: 'Y' (update by T1, even though uncommitted)")
    print("   - page3: 'N' (update by T2)")
    print("\n   The UNDO phase would then roll back T1's changes.")
    print("   But that's a separate step - redo just repeats history!")
 
demonstrate_repeating_history()

Key Observation

Summary: The Repeating History Paradigm

We've explored the foundational principle that makes ARIES recovery correct and elegant. Let's consolidate the key insights:

Key Takeaways

•Repeating history reconstructs the exact crash-time state — Including uncommitted modifications. This seems counterintuitive but enables correct undo.
•Separation of concerns is key — Redo handles 'what was the state?', Undo handles 'what should remain?' Clean separation simplifies both phases.
•Skipping uncommitted operations is dangerous — Inter-transaction dependencies, physical page state, and B-tree structures can all become inconsistent.
•The approach enables STEAL/NO-FORCE — Maximum buffer pool flexibility and fast commits become possible with repeating history.
•RedoLSN bounds the work — We only redo from the oldest potentially unflushed modification, not from the beginning of time.
•LSN comparison prevents redundant work — Operations already reflected on disk are skipped. We'll explore this mechanism in detail next.

What's Next:

Page Complete

1 / 5