Redo Phase - Learning Module

Loading content...

0/241

LSN Comparison: The Foundation of Idempotent Redo

The Genius of a Simple Number

At the heart of ARIES redo lies a remarkably simple concept: a Log Sequence Number (LSN). This monotonically increasing value, attached to every log record and every data page, enables the entire recovery mechanism to work correctly.

The LSN comparison during redo answers a critical question: Has this operation already been applied to this page? If the page's LSN indicates the operation is already reflected in its current state, we skip it. If not, we apply it.

This simple comparison achieves something profound: it makes redo idempotent—safe to repeat any number of times with the same result. This property is essential because recovery itself can crash and restart, potentially re-processing the same log records.

What You Will Learn

By the end of this page, you will understand: (1) The complete semantics of Log Sequence Numbers, (2) How page LSNs are maintained during normal operation, (3) The precise comparison algorithm during redo, (4) Why LSN comparison guarantees correct recovery, and (5) Edge cases and potential pitfalls.

Anatomy of the Log Sequence Number

An LSN is a unique identifier for each log record, but its implementation carries significant design decisions. Let's examine the LSN in depth.

LSN Properties:

Monotonically Increasing: Each new log record gets an LSN greater than all previous LSNs
Unique: No two log records share the same LSN
Persistent: The LSN generation mechanism survives crashes
Efficient to Compare: Simple numeric comparison is sufficient

lsn_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class LogSequenceNumber:
    """
    LSN Implementation Details
    
    Common implementations use one of these approaches:
    1. Simple counter (requires persistent counter storage)
    2. Physical log offset (byte position in log file)
    3. Hybrid (file number + offset within file)
    """
    
    # Approach 1: Simple Counter LSN
    class CounterLSN:
        """
        Uses a persistent counter incremented for each log record.
        Simple but requires careful persistence of counter state.
        """
        def __init__(self, initial_value=0):
            self.current = initial_value
        
        def next(self):
            self.current += 1
            return self.current
        
        # On recovery, must read highest LSN from log to restore counter
    
    # Approach 2: Physical Offset LSN (PostgreSQL style)
    class PhysicalOffsetLSN:
        """
        LSN = byte offset in the Write-Ahead Log.
        
        Advantages:
        - Self-recovering: highest LSN = log file size
        - Provides log position for direct seeks
        - No separate counter to persist
        
        Structure: typically 64-bit value
        - Upper bits: log segment file number
        - Lower bits: byte offset within segment
        """
        def __init__(self, segment_number=0, offset=0):
            self.segment_number = segment_number  # Which log file
            self.offset = offset                   # Byte position
        
        def to_uint64(self):
            """Convert to comparable 64-bit value"""
            # Example: 32 bits for segment, 32 bits for offset
            return (self.segment_number << 32) | self.offset
        
        def from_uint64(self, value):
            """Reconstruct from 64-bit value"""
            self.segment_number = value >> 32
            self.offset = value & 0xFFFFFFFF
        
        def __lt__(self, other):
            return self.to_uint64() < other.to_uint64()
        
        def __eq__(self, other):
            return self.to_uint64() == other.to_uint64()
 
# PostgreSQL LSN Example
def postgresql_lsn_example():
    """
    PostgreSQL represents LSN as segment/offset.
    Example value: 0/164E9C0
    - 0 = segment number
    - 164E9C0 = byte offset (23,391,680 in decimal)
    """
    lsn_str = "0/164E9C0"
    parts = lsn_str.split('/')
    segment = int(parts[0], 16)
    offset = int(parts[1], 16)
    
    print(f"PostgreSQL LSN: {lsn_str}")
    print(f"  Segment: {segment}")
    print(f"  Offset: {offset} bytes ({offset / (1024*1024):.2f} MB into segment)")
    
    # Can directly seek to this position in WAL file
    wal_filename = f"pg_wal/{segment:08X}{offset >> 24:08X}"
    file_offset = offset & 0xFFFFFF
    print(f"  File: {wal_filename}, Position: {file_offset}")
 
postgresql_lsn_example()

The Physical Offset Advantage:

Using physical log offset as the LSN (as PostgreSQL does) provides an elegant property: given an LSN, you can immediately locate the log record on disk. This eliminates the need for an LSN-to-location index and speeds up log access during recovery.

LSN Overflow Considerations:

With a 64-bit LSN and typical log record sizes (100-500 bytes), a database could generate ~10^17 log records before overflow. At 10,000 transactions/second averaging 10 log records each, this represents about 30 million years of operation—effectively infinite for practical purposes.

Page LSN Maintenance During Normal Operation

Every data page in the database contains a Page LSN—the LSN of the most recent log record that modified this page. This value is critical for redo correctness.

When is Page LSN Updated?

page_lsn_maintenance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
class DataPage:
    """
    Database page with LSN tracking.
    The Page LSN is stored in the page header.
    """
    
    PAGE_SIZE = 8192  # 8KB typical page size
    HEADER_SIZE = 24  # Space for page metadata
    
    def __init__(self, page_id):
        self.page_id = page_id
        self.page_lsn = 0  # Initially 0 (never modified)
        self.data = bytearray(self.PAGE_SIZE)
        self.is_dirty = False
    
    def apply_modification(self, modification, log_record_lsn):
        """
        Apply a modification and update the page LSN.
        
        CRITICAL: The page LSN must be updated atomically with
        the modification. They must be on disk together.
        """
        # Step 1: Apply the actual modification
        if modification['type'] == 'INSERT':
            self._insert_record(modification['slot'], modification['data'])
        elif modification['type'] == 'UPDATE':
            self._update_record(modification['slot'], modification['data'])
        elif modification['type'] == 'DELETE':
            self._delete_record(modification['slot'])
        
        # Step 2: Update page LSN (MUST happen with modification)
        self.page_lsn = log_record_lsn
        
        # Step 3: Mark dirty
        self.is_dirty = True
    
    def get_page_lsn(self):
        """Read the page's current LSN"""
        return self.page_lsn
    
    def serialize(self):
        """
        Serialize page for disk write.
        Page LSN is in the first 8 bytes of header.
        """
        header = self.page_lsn.to_bytes(8, 'little')
        # ... rest of header and data
        return header + self.data[self.HEADER_SIZE:]
    
    @classmethod
    def deserialize(cls, page_id, raw_bytes):
        """Read page from disk, extracting LSN from header"""
        page = cls(page_id)
        page.page_lsn = int.from_bytes(raw_bytes[:8], 'little')
        page.data = bytearray(raw_bytes)
        return page
 
class TransactionManager:
    """
    Demonstrates the complete flow of page modification with LSN.
    """
    
    def __init__(self, log_manager, buffer_manager):
        self.log = log_manager
        self.buffer = buffer_manager
    
    def update_row(self, txn_id, table_id, row_id, new_value, old_value):
        """
        Complete flow of a row update with proper LSN handling.
        """
        # Step 1: Locate the page containing this row
        page_id = self._get_page_for_row(table_id, row_id)
        page = self.buffer.fetch_page(page_id)
        
        # Step 2: Write the log record FIRST (WAL protocol)
        # This is critical: log before modifying the page
        log_record = {
            'type': 'UPDATE',
            'txn_id': txn_id,
            'page_id': page_id,
            'redo_info': {'slot': row_id, 'data': new_value},
            'undo_info': {'slot': row_id, 'data': old_value}
        }
        lsn = self.log.write(log_record)
        
        # Step 3: Apply modification and update page LSN
        # These happen together in memory
        page.apply_modification(log_record['redo_info'], lsn)
        
        # Step 4: Return - page is dirty but not necessarily flushed
        # Buffer manager will flush eventually (STEAL/NO-FORCE)
        return lsn
    
    def _get_page_for_row(self, table_id, row_id):
        # Implementation detail: map row to page
        pass

The Atomicity Requirement

The page modification and its LSN update MUST be atomic from the perspective of what's written to disk. If the page is written with the new data but old LSN, recovery will re-apply the operation (safe due to idempotence). If written with new LSN but old data, the operation will be skipped—CORRUPTION! This is why page LSN is in the page header, written in the same I/O operation.

The Page LSN Invariant:

At any point in time, for any page P on disk:

Page P contains the effects of all log records with LSN ≤ P.page_lsn
Page P does NOT contain effects of any log records with LSN > P.page_lsn

This invariant is preserved by:

Always updating page LSN when modifying page data
Writing both together in a single disk I/O
Using WAL: log record is durable before page can be flushed

The Redo LSN Comparison Algorithm

During redo, for each log record we encounter, we must decide: should this operation be re-applied to the page? The LSN comparison provides the answer.

The Decision Rule:

IF page.page_lsn < log_record.lsn THEN
    Apply the log record's redo information
    Update page.page_lsn = log_record.lsn
ELSE
    Skip this log record for this page

Why This Works:

Case: page_lsn < record.lsn

•The page doesn't reflect this operation yet
•The modification might have been in buffer pool but not flushed
•We MUST apply the redo to restore pre-crash state
•After applying, update page_lsn to record.lsn

Case: page_lsn >= record.lsn

•The page already contains this operation's effect
•The page was flushed after this modification
•Applying again would be redundant (but safe)
•We SKIP to save unnecessary work

lsn_comparison_redo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
class RedoLSNComparison:
    """
    Detailed implementation of LSN comparison during redo.
    Shows all the cases and edge conditions.
    """
    
    def should_redo(self, page, log_record, dirty_page_table):
        """
        Determine if a log record should be applied during redo.
        Returns (should_apply, reason) tuple for clarity.
        """
        page_id = log_record.page_id
        record_lsn = log_record.lsn
        
        # Check 1: Is this page in the Dirty Page Table?
        # If not, the page was flushed after ALL its modifications
        if page_id not in dirty_page_table:
            return False, "Page not in DPT - fully flushed before crash"
        
        # Check 2: Is this record's LSN < page's recLSN?
        # If so, this modification was flushed before the page became dirty again
        rec_lsn = dirty_page_table[page_id].rec_lsn
        if record_lsn < rec_lsn:
            return False, f"LSN {record_lsn} < recLSN {rec_lsn} - already flushed"
        
        # Check 3: Read the page and compare LSNs
        # This is the definitive check
        page_lsn = page.get_page_lsn()
        
        if page_lsn >= record_lsn:
            return False, f"pageLSN {page_lsn} >= {record_lsn} - operation already on disk"
        
        # All checks passed - must redo this operation
        return True, f"pageLSN {page_lsn} < {record_lsn} - must redo"
    
    def apply_redo(self, page, log_record):
        """
        Apply redo and update page LSN atomically.
        """
        # Apply the modification
        self._apply_modification(page, log_record.redo_info)
        
        # Critical: update page LSN
        page.page_lsn = log_record.lsn
        
        # Mark dirty for eventual flush
        page.is_dirty = True
 
def demonstrate_lsn_comparison():
    """
    Trace through a complete example of LSN comparison.
    """
    print("=" * 60)
    print("LSN COMPARISON DURING REDO - COMPLETE EXAMPLE")
    print("=" * 60)
    
    # Scenario setup
    scenario = """
    Timeline:
    - LSN 100: T1 updates Page 10, value A → B
    - LSN 101: T2 updates Page 20, value X → Y  
    - LSN 102: T1 updates Page 10, value B → C
    - LSN 103: T2 updates Page 30, value M → N
    - Page 10 flushed to disk (page_lsn = 102)
    - LSN 104: T1 updates Page 10, value C → D
    - CRASH!
    
    After crash, disk state:
    - Page 10: page_lsn = 102, value = C
    - Page 20: page_lsn = 0, value = X (never flushed)
    - Page 30: page_lsn = 0, value = M (never flushed)
    """
    print(scenario)
    
    # Simulate redo
    disk_pages = {
        10: {'page_lsn': 102, 'value': 'C'},
        20: {'page_lsn': 0, 'value': 'X'},
        30: {'page_lsn': 0, 'value': 'M'},
    }
    
    log_records = [
        (100, 10, 'B'),  # LSN, page_id, new_value
        (101, 20, 'Y'),
        (102, 10, 'C'),
        (103, 30, 'N'),
        (104, 10, 'D'),
    ]
    
    print("\nREDO PHASE (starting from RedoLSN = 100):")
    print("-" * 40)
    
    for lsn, page_id, new_value in log_records:
        page = disk_pages[page_id]
        page_lsn = page['page_lsn']
        
        if page_lsn < lsn:
            print(f"LSN {lsn}, Page {page_id}: " 
                  f"page_lsn={page_lsn} < {lsn} → REDO, set value={new_value}")
            page['value'] = new_value
            page['page_lsn'] = lsn
        else:
            print(f"LSN {lsn}, Page {page_id}: "
                  f"page_lsn={page_lsn} >= {lsn} → SKIP")
    
    print("\nState after redo:")
    for page_id, page in disk_pages.items():
        print(f"  Page {page_id}: value='{page['value']}', page_lsn={page['page_lsn']}")
    
    print("\n✓ All pages restored to pre-crash state!")
 
demonstrate_lsn_comparison()

The Three-Level Filtering Optimization

Reading a page from disk is expensive—potentially 10ms on spinning disk or 0.1ms on SSD. ARIES uses a three-level filtering approach to minimize unnecessary page reads during redo.

Level 1: Dirty Page Table Presence

Before reading the page, check if the page_id exists in the Dirty Page Table (DPT):

If NOT in DPT: The page was flushed after all its modifications → Skip
Cost: O(1) hash lookup, zero disk I/O

Level 2: RecLSN Comparison

If the page is in DPT, compare record.lsn with page's recLSN:

If record.lsn < recLSN: This modification was flushed before the page became dirty again → Skip
Cost: O(1) comparison, zero disk I/O

Level 3: Page LSN Comparison

Only if both above checks pass, read the page and compare:

If page_lsn >= record.lsn: Operation already on disk → Skip
If page_lsn < record.lsn: Must redo → Apply
Cost: Page read (expensive), O(1) comparison

three_level_filter.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
class ThreeLevelRedoFilter:
    """
    Implements the three-level filtering optimization.
    Tracks statistics to demonstrate the optimization's effectiveness.
    """
    
    def __init__(self, dirty_page_table, buffer_manager):
        self.dpt = dirty_page_table
        self.buffer = buffer_manager
        
        # Statistics
        self.stats = {
            'total_records': 0,
            'filtered_level_1_not_in_dpt': 0,
            'filtered_level_2_rec_lsn': 0,
            'filtered_level_3_page_lsn': 0,
            'redone': 0,
            'pages_read': 0
        }
    
    def process_redo_record(self, log_record):
        """
        Process a single redo record through all three filter levels.
        """
        self.stats['total_records'] += 1
        page_id = log_record.page_id
        record_lsn = log_record.lsn
        
        # ========================================
        # LEVEL 1: DPT Presence Check
        # ========================================
        # Cost: Hash table lookup (O(1), no I/O)
        if page_id not in self.dpt:
            self.stats['filtered_level_1_not_in_dpt'] += 1
            return 'SKIP_L1', "Page not in DPT"
        
        # ========================================
        # LEVEL 2: RecLSN Comparison  
        # ========================================
        # Cost: Simple comparison (O(1), no I/O)
        rec_lsn = self.dpt[page_id].rec_lsn
        if record_lsn < rec_lsn:
            self.stats['filtered_level_2_rec_lsn'] += 1
            return 'SKIP_L2', f"LSN {record_lsn} < recLSN {rec_lsn}"
        
        # ========================================
        # LEVEL 3: Page LSN Comparison
        # ========================================
        # Cost: Page read (expensive!) + comparison
        page = self.buffer.fetch_page(page_id)
        self.stats['pages_read'] += 1
        
        page_lsn = page.get_page_lsn()
        if page_lsn >= record_lsn:
            self.stats['filtered_level_3_page_lsn'] += 1
            return 'SKIP_L3', f"pageLSN {page_lsn} >= {record_lsn}"
        
        # ========================================
        # REDO REQUIRED
        # ========================================
        self.apply_redo(page, log_record)
        self.stats['redone'] += 1
        return 'REDONE', f"Applied LSN {record_lsn} to page {page_id}"
    
    def apply_redo(self, page, log_record):
        """Apply the redo operation and update page LSN"""
        # Apply modification
        page.apply(log_record.redo_info)
        # Update LSN
        page.page_lsn = log_record.lsn
        # Mark dirty
        self.buffer.mark_dirty(page)
    
    def print_statistics(self):
        """Show the filtering effectiveness"""
        s = self.stats
        total = s['total_records']
        
        print("\n" + "=" * 50)
        print("THREE-LEVEL FILTERING STATISTICS")
        print("=" * 50)
        
        print(f"\nTotal log records processed: {total}")
        print(f"\nFiltered (no disk I/O):")
        print(f"  Level 1 (not in DPT):  {s['filtered_level_1_not_in_dpt']:5d} "
              f"({100*s['filtered_level_1_not_in_dpt']/max(1,total):.1f}%)")
        print(f"  Level 2 (recLSN):      {s['filtered_level_2_rec_lsn']:5d} "
              f"({100*s['filtered_level_2_rec_lsn']/max(1,total):.1f}%)")
        
        print(f"\nRequired page read:")
        print(f"  Level 3 (page LSN):    {s['filtered_level_3_page_lsn']:5d} "
              f"({100*s['filtered_level_3_page_lsn']/max(1,total):.1f}%)")
        print(f"  Redone:                {s['redone']:5d} "
              f"({100*s['redone']/max(1,total):.1f}%)")
        
        print(f"\nTotal pages read:        {s['pages_read']}")
        print(f"I/O savings:             {100*(1 - s['pages_read']/max(1,total)):.1f}%")
 
# Simulate a realistic scenario
def demonstrate_filtering_effectiveness():
    """
    Simulate redo with realistic filter hit rates.
    
    Typical scenario:
    - Many pages flushed normally before crash → Level 1 filters
    - Some recently dirtied pages → Levels 2/3 filter most
    - Small fraction actually need redo
    """
    import random
    
    # Simulated DPT (10% of pages were dirty at crash)
    total_pages = 10000
    dirty_pages = set(random.sample(range(total_pages), k=1000))
    
    dpt = {}
    for page_id in dirty_pages:
        # recLSN is somewhere in the last 1000 LSNs
        rec_lsn = random.randint(9000, 10000)
        class Entry: pass
        e = Entry()
        e.rec_lsn = rec_lsn
        dpt[page_id] = e
    
    # Simulate processing 5000 log records (LSN 8000-13000)
    class MockLogRecord:
        def __init__(self, lsn, page_id):
            self.lsn = lsn
            self.page_id = page_id
    
    class MockBuffer:
        def __init__(self):
            self.page_lsns = {}
        def fetch_page(self, page_id):
            class Page:
                def __init__(self, lsn):
                    self.page_lsn = lsn
                def get_page_lsn(self):
                    return self.page_lsn
            # Simulate: page has random LSN up to current time
            return Page(random.randint(0, 10000))
        def mark_dirty(self, page): pass
    
    filter_obj = ThreeLevelRedoFilter(dpt, MockBuffer())
    
    for lsn in range(8000, 13000):
        page_id = random.randint(0, total_pages - 1)
        record = MockLogRecord(lsn, page_id)
        filter_obj.process_redo_record(record)
    
    filter_obj.print_statistics()
 
# Run the demonstration
demonstrate_filtering_effectiveness()

Real-World Impact

In production systems, Level 1 and Level 2 filters together eliminate 80-95% of log records before any page I/O. This dramatically speeds up recovery. A database that would take hours to recover by reading every page might recover in minutes thanks to these optimizations.

Edge Cases and Correctness Guarantees

The LSN comparison mechanism handles several tricky edge cases correctly. Let's examine each:

Edge Case 1: Partial Page Write

What if a page was being written when the crash occurred, resulting in a partially written page?

partial_write_handling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def partial_write_scenario():
    """
    Scenario: Crash during page write
    
    A page write is an 8KB operation. If power fails midway:
    - Part of the new data is written
    - Part remains as old data
    - The page is CORRUPTED
    
    How does ARIES handle this?
    """
    
    # Two approaches used in practice:
    
    # Approach 1: Page Checksums + Double-Write Buffer (InnoDB)
    class DoubleWriteProtection:
        """
        Before writing to actual location:
        1. Write page to doublewrite buffer (sequential area)
        2. Sync doublewrite buffer
        3. Write page to actual location
        
        On recovery:
        - Checksum each page
        - If corrupt, restore from doublewrite buffer
        - Then apply redo normally
        """
        def write_page(self, page):
            # Step 1: Write to doublewrite area
            self.doublewrite_buffer.write(page)
            self.doublewrite_buffer.sync()
            
            # Step 2: Write to actual location
            self.data_file.write_at(page.page_id, page.data)
        
        def recover_page(self, page_id):
            page = self.data_file.read(page_id)
            if not page.verify_checksum():
                # Corrupted! Restore from doublewrite
                page = self.doublewrite_buffer.get(page_id)
            return page
    
    # Approach 2: Full Page Writes After Checkpoint (PostgreSQL)
    class FullPageWrites:
        """
        After a checkpoint, the first modification to any page
        logs the ENTIRE page content (before-image).
        
        On recovery:
        - If page is corrupt or stale, full page image restores it
        - Subsequent redo records apply normally
        """
        def log_modification(self, page, modification, is_first_since_checkpoint):
            if is_first_since_checkpoint:
                # Log the full page image
                log_record = {
                    'type': 'FULL_PAGE_WRITE',
                    'page_id': page.page_id,
                    'full_page_data': page.data,  # Complete page!
                    'modification': modification
                }
            else:
                # Normal incremental log
                log_record = {
                    'type': 'UPDATE',
                    'page_id': page.page_id,
                    'modification': modification
                }
            return self.log.write(log_record)
    
    print("Both approaches ensure that even with partial writes,")
    print("redo can restore pages to a correct state.")
    print("LSN comparison still works correctly afterwards.")
 
partial_write_scenario()

Edge Case 2: Page LSN Equals Record LSN

What if page_lsn == record.lsn exactly? This means the page on disk reflects exactly this log record's modification. The comparison page_lsn >= record.lsn correctly skips the redo.

Edge Case 3: Multiple Log Records with Same LSN

This should never happen—LSNs are unique by definition. If a bug allows duplicate LSNs, recovery becomes undefined.

Edge Case 4: Page Flushed Between Checkpoint and Crash

The DPT might contain a page that was actually flushed after checkpoint:

flush_after_checkpoint.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def flush_after_checkpoint_scenario():
    """
    Timeline:
    1. Checkpoint taken, Page P in DPT with recLSN=100
    2. Page P flushed (page_lsn=100 now on disk)
    3. Page P dirtied again (LSN=150)
    4. Crash before Page P flushed again
    
    DPT still has recLSN=100 (from checkpoint)
    But page on disk has page_lsn=100
    
    How does redo handle this?
    """
    
    # During redo for LSN 100:
    page_on_disk_lsn = 100
    record_lsn = 100
    rec_lsn_in_dpt = 100
    
    # Level 1: Page in DPT? Yes
    # Level 2: record.lsn >= recLSN? 100 >= 100, yes, continue
    # Level 3: page_lsn >= record.lsn? 100 >= 100, yes, SKIP
    print("LSN 100: Correctly skipped (already on disk)")
    
    # During redo for LSN 150:
    record_lsn = 150
    
    # Level 1: Page in DPT? Yes
    # Level 2: record.lsn >= recLSN? 150 >= 100, yes, continue
    # Level 3: page_lsn >= record.lsn? 100 >= 150? NO, must REDO
    print("LSN 150: Correctly redone (modification was lost)")
    
    print("\nThe three-level filter handles this correctly!")
    print("Conservative DPT entry doesn't cause incorrect redo,")
    print("just potentially one extra page read.")
 
flush_after_checkpoint_scenario()

Conservative vs. Precise Filtering

Levels 1 and 2 are conservative filters—they might let through records that don't need redo, but they never filter out records that DO need redo. Level 3 (the page LSN check) is the definitive filter. This design prioritizes correctness over perfect efficiency.

LSNs Beyond Redo

While we've focused on redo, LSNs serve several additional purposes in database systems:

1. Recovery Ordering

During undo phase, LSNs ensure operations are undone in reverse order. The prev_lsn chain in log records enables this.

2. Checkpoint Optimization

The checkpoint record's LSN marks a known-good state. Recovery scans forward from there.

3. Replication

LSNs identify positions in the log for replication:

Replicas track "last applied LSN"
Consumers request records from a specific LSN
Exactly-once semantics rely on LSN comparison

4. Point-in-Time Recovery

To recover to a specific moment:

Apply all redo up to target LSN
Skip redo for records with LSN > target
Undo any uncommitted transactions as of target LSN

lsn_applications.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
class LSNApplications:
    """
    Demonstrates various uses of LSN beyond redo.
    """
    
    def point_in_time_recovery(self, target_lsn):
        """
        Recover database to state at target_lsn.
        """
        # Phase 1: Analysis (same as normal)
        analysis_result = self.run_analysis_phase()
        
        # Phase 2: Modified Redo - stop at target
        for record in self.log.scan_forward(analysis_result.redo_lsn):
            if record.lsn > target_lsn:
                break  # Don't redo beyond target
            
            if self.should_redo(record):
                self.apply_redo(record)
        
        # Phase 3: Undo transactions uncommitted at target time
        # Any transaction without COMMIT record before target_lsn
        for txn_id in analysis_result.active_transactions:
            commit_lsn = self.find_commit_lsn(txn_id)
            if commit_lsn is None or commit_lsn > target_lsn:
                self.undo_transaction(txn_id)
    
    def streaming_replication_apply(self, replica_state):
        """
        Apply log records to a replica.
        """
        # Replica tracks: last_applied_lsn
        last_lsn = replica_state.last_applied_lsn
        
        for record in self.log.scan_forward(last_lsn + 1):
            # Apply to replica (similar to redo)
            self.apply_to_replica(record, replica_state)
            
            # Update tracking
            replica_state.last_applied_lsn = record.lsn
            
            # Handle commit: make changes visible
            if record.type == 'COMMIT':
                replica_state.make_visible(record.txn_id)
    
    def log_truncation_decision(self):
        """
        Determine which log records can be safely deleted.
        
        We can truncate log up to:
        min(checkpoint_lsn, oldest_active_txn_start_lsn, replica_positions)
        """
        # Latest checkpoint
        checkpoint_lsn = self.get_latest_checkpoint_lsn()
        
        # Oldest active transaction (might need to undo)
        oldest_active = min(
            txn.start_lsn for txn in self.get_active_transactions()
        ) if self.get_active_transactions() else float('inf')
        
        # Farthest-behind replica
        replica_min = min(
            r.last_applied_lsn for r in self.get_replicas()
        ) if self.get_replicas() else float('inf')
        
        # Safe to truncate up to this point
        safe_truncate_lsn = min(checkpoint_lsn, oldest_active, replica_min)
        
        return safe_truncate_lsn
 
# Demonstrate point-in-time recovery
def pitr_example():
    """
    Example: Recover to just before an accidental DELETE.
    """
    print("Scenario: User accidentally ran DELETE FROM users")
    print("The DELETE was at LSN 5000")
    print("")
    print("Point-in-time recovery to LSN 4999:")
    print("  1. Redo all operations up to LSN 4999")
    print("  2. Skip the DELETE at LSN 5000")
    print("  3. Undo any transactions that hadn't committed by LSN 4999")
    print("")
    print("Result: Database restored to moment before DELETE!")
 
pitr_example()

Summary: The Power of LSN Comparison

The Log Sequence Number is a deceptively simple concept that enables the entire ARIES recovery mechanism. Let's consolidate what we've learned:

Key Takeaways

•LSN is a unique, monotonic identifier — Each log record gets a unique LSN, enabling precise ordering and comparison.
•Page LSN tracks modifications — Every page stores the LSN of the most recent log record that modified it.
•Comparison enables idempotent redo — If page_lsn >= record.lsn, the operation is already on disk—safe to skip.
•Three-level filtering minimizes I/O — DPT presence, recLSN, then page LSN progressively filter, avoiding most page reads.
•Edge cases are handled correctly — Partial writes, conservative DPT entries, and timing issues all resolve correctly.
•LSNs enable broader functionality — Replication, point-in-time recovery, and log management all rely on LSN semantics.

What's Next:

The LSN comparison mechanism makes redo idempotent—safe to repeat. The next page explores idempotence in depth: what it means, why it's essential for crash recovery, and how ARIES ensures all redo operations are idempotent.

Page Complete

You now understand the complete LSN comparison mechanism that makes ARIES redo correct. This simple numeric comparison—is the page behind the log?—combined with careful LSN maintenance during normal operation, ensures that recovery always produces the exact pre-crash database state.