Loading content...
At the heart of ARIES redo lies a remarkably simple concept: a Log Sequence Number (LSN). This monotonically increasing value, attached to every log record and every data page, enables the entire recovery mechanism to work correctly.
The LSN comparison during redo answers a critical question: Has this operation already been applied to this page? If the page's LSN indicates the operation is already reflected in its current state, we skip it. If not, we apply it.
This simple comparison achieves something profound: it makes redo idempotent—safe to repeat any number of times with the same result. This property is essential because recovery itself can crash and restart, potentially re-processing the same log records.
By the end of this page, you will understand: (1) The complete semantics of Log Sequence Numbers, (2) How page LSNs are maintained during normal operation, (3) The precise comparison algorithm during redo, (4) Why LSN comparison guarantees correct recovery, and (5) Edge cases and potential pitfalls.
An LSN is a unique identifier for each log record, but its implementation carries significant design decisions. Let's examine the LSN in depth.
LSN Properties:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
class LogSequenceNumber: """ LSN Implementation Details Common implementations use one of these approaches: 1. Simple counter (requires persistent counter storage) 2. Physical log offset (byte position in log file) 3. Hybrid (file number + offset within file) """ # Approach 1: Simple Counter LSN class CounterLSN: """ Uses a persistent counter incremented for each log record. Simple but requires careful persistence of counter state. """ def __init__(self, initial_value=0): self.current = initial_value def next(self): self.current += 1 return self.current # On recovery, must read highest LSN from log to restore counter # Approach 2: Physical Offset LSN (PostgreSQL style) class PhysicalOffsetLSN: """ LSN = byte offset in the Write-Ahead Log. Advantages: - Self-recovering: highest LSN = log file size - Provides log position for direct seeks - No separate counter to persist Structure: typically 64-bit value - Upper bits: log segment file number - Lower bits: byte offset within segment """ def __init__(self, segment_number=0, offset=0): self.segment_number = segment_number # Which log file self.offset = offset # Byte position def to_uint64(self): """Convert to comparable 64-bit value""" # Example: 32 bits for segment, 32 bits for offset return (self.segment_number << 32) | self.offset def from_uint64(self, value): """Reconstruct from 64-bit value""" self.segment_number = value >> 32 self.offset = value & 0xFFFFFFFF def __lt__(self, other): return self.to_uint64() < other.to_uint64() def __eq__(self, other): return self.to_uint64() == other.to_uint64() # PostgreSQL LSN Exampledef postgresql_lsn_example(): """ PostgreSQL represents LSN as segment/offset. Example value: 0/164E9C0 - 0 = segment number - 164E9C0 = byte offset (23,391,680 in decimal) """ lsn_str = "0/164E9C0" parts = lsn_str.split('/') segment = int(parts[0], 16) offset = int(parts[1], 16) print(f"PostgreSQL LSN: {lsn_str}") print(f" Segment: {segment}") print(f" Offset: {offset} bytes ({offset / (1024*1024):.2f} MB into segment)") # Can directly seek to this position in WAL file wal_filename = f"pg_wal/{segment:08X}{offset >> 24:08X}" file_offset = offset & 0xFFFFFF print(f" File: {wal_filename}, Position: {file_offset}") postgresql_lsn_example()The Physical Offset Advantage:
Using physical log offset as the LSN (as PostgreSQL does) provides an elegant property: given an LSN, you can immediately locate the log record on disk. This eliminates the need for an LSN-to-location index and speeds up log access during recovery.
LSN Overflow Considerations:
With a 64-bit LSN and typical log record sizes (100-500 bytes), a database could generate ~10^17 log records before overflow. At 10,000 transactions/second averaging 10 log records each, this represents about 30 million years of operation—effectively infinite for practical purposes.
Every data page in the database contains a Page LSN—the LSN of the most recent log record that modified this page. This value is critical for redo correctness.
When is Page LSN Updated?
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
class DataPage: """ Database page with LSN tracking. The Page LSN is stored in the page header. """ PAGE_SIZE = 8192 # 8KB typical page size HEADER_SIZE = 24 # Space for page metadata def __init__(self, page_id): self.page_id = page_id self.page_lsn = 0 # Initially 0 (never modified) self.data = bytearray(self.PAGE_SIZE) self.is_dirty = False def apply_modification(self, modification, log_record_lsn): """ Apply a modification and update the page LSN. CRITICAL: The page LSN must be updated atomically with the modification. They must be on disk together. """ # Step 1: Apply the actual modification if modification['type'] == 'INSERT': self._insert_record(modification['slot'], modification['data']) elif modification['type'] == 'UPDATE': self._update_record(modification['slot'], modification['data']) elif modification['type'] == 'DELETE': self._delete_record(modification['slot']) # Step 2: Update page LSN (MUST happen with modification) self.page_lsn = log_record_lsn # Step 3: Mark dirty self.is_dirty = True def get_page_lsn(self): """Read the page's current LSN""" return self.page_lsn def serialize(self): """ Serialize page for disk write. Page LSN is in the first 8 bytes of header. """ header = self.page_lsn.to_bytes(8, 'little') # ... rest of header and data return header + self.data[self.HEADER_SIZE:] @classmethod def deserialize(cls, page_id, raw_bytes): """Read page from disk, extracting LSN from header""" page = cls(page_id) page.page_lsn = int.from_bytes(raw_bytes[:8], 'little') page.data = bytearray(raw_bytes) return page class TransactionManager: """ Demonstrates the complete flow of page modification with LSN. """ def __init__(self, log_manager, buffer_manager): self.log = log_manager self.buffer = buffer_manager def update_row(self, txn_id, table_id, row_id, new_value, old_value): """ Complete flow of a row update with proper LSN handling. """ # Step 1: Locate the page containing this row page_id = self._get_page_for_row(table_id, row_id) page = self.buffer.fetch_page(page_id) # Step 2: Write the log record FIRST (WAL protocol) # This is critical: log before modifying the page log_record = { 'type': 'UPDATE', 'txn_id': txn_id, 'page_id': page_id, 'redo_info': {'slot': row_id, 'data': new_value}, 'undo_info': {'slot': row_id, 'data': old_value} } lsn = self.log.write(log_record) # Step 3: Apply modification and update page LSN # These happen together in memory page.apply_modification(log_record['redo_info'], lsn) # Step 4: Return - page is dirty but not necessarily flushed # Buffer manager will flush eventually (STEAL/NO-FORCE) return lsn def _get_page_for_row(self, table_id, row_id): # Implementation detail: map row to page passThe page modification and its LSN update MUST be atomic from the perspective of what's written to disk. If the page is written with the new data but old LSN, recovery will re-apply the operation (safe due to idempotence). If written with new LSN but old data, the operation will be skipped—CORRUPTION! This is why page LSN is in the page header, written in the same I/O operation.
The Page LSN Invariant:
At any point in time, for any page P on disk:
Page P contains the effects of all log records with LSN ≤ P.page_lsn
Page P does NOT contain effects of any log records with LSN > P.page_lsn
This invariant is preserved by:
During redo, for each log record we encounter, we must decide: should this operation be re-applied to the page? The LSN comparison provides the answer.
The Decision Rule:
IF page.page_lsn < log_record.lsn THEN
Apply the log record's redo information
Update page.page_lsn = log_record.lsn
ELSE
Skip this log record for this page
Why This Works:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
class RedoLSNComparison: """ Detailed implementation of LSN comparison during redo. Shows all the cases and edge conditions. """ def should_redo(self, page, log_record, dirty_page_table): """ Determine if a log record should be applied during redo. Returns (should_apply, reason) tuple for clarity. """ page_id = log_record.page_id record_lsn = log_record.lsn # Check 1: Is this page in the Dirty Page Table? # If not, the page was flushed after ALL its modifications if page_id not in dirty_page_table: return False, "Page not in DPT - fully flushed before crash" # Check 2: Is this record's LSN < page's recLSN? # If so, this modification was flushed before the page became dirty again rec_lsn = dirty_page_table[page_id].rec_lsn if record_lsn < rec_lsn: return False, f"LSN {record_lsn} < recLSN {rec_lsn} - already flushed" # Check 3: Read the page and compare LSNs # This is the definitive check page_lsn = page.get_page_lsn() if page_lsn >= record_lsn: return False, f"pageLSN {page_lsn} >= {record_lsn} - operation already on disk" # All checks passed - must redo this operation return True, f"pageLSN {page_lsn} < {record_lsn} - must redo" def apply_redo(self, page, log_record): """ Apply redo and update page LSN atomically. """ # Apply the modification self._apply_modification(page, log_record.redo_info) # Critical: update page LSN page.page_lsn = log_record.lsn # Mark dirty for eventual flush page.is_dirty = True def demonstrate_lsn_comparison(): """ Trace through a complete example of LSN comparison. """ print("=" * 60) print("LSN COMPARISON DURING REDO - COMPLETE EXAMPLE") print("=" * 60) # Scenario setup scenario = """ Timeline: - LSN 100: T1 updates Page 10, value A → B - LSN 101: T2 updates Page 20, value X → Y - LSN 102: T1 updates Page 10, value B → C - LSN 103: T2 updates Page 30, value M → N - Page 10 flushed to disk (page_lsn = 102) - LSN 104: T1 updates Page 10, value C → D - CRASH! After crash, disk state: - Page 10: page_lsn = 102, value = C - Page 20: page_lsn = 0, value = X (never flushed) - Page 30: page_lsn = 0, value = M (never flushed) """ print(scenario) # Simulate redo disk_pages = { 10: {'page_lsn': 102, 'value': 'C'}, 20: {'page_lsn': 0, 'value': 'X'}, 30: {'page_lsn': 0, 'value': 'M'}, } log_records = [ (100, 10, 'B'), # LSN, page_id, new_value (101, 20, 'Y'), (102, 10, 'C'), (103, 30, 'N'), (104, 10, 'D'), ] print("\nREDO PHASE (starting from RedoLSN = 100):") print("-" * 40) for lsn, page_id, new_value in log_records: page = disk_pages[page_id] page_lsn = page['page_lsn'] if page_lsn < lsn: print(f"LSN {lsn}, Page {page_id}: " f"page_lsn={page_lsn} < {lsn} → REDO, set value={new_value}") page['value'] = new_value page['page_lsn'] = lsn else: print(f"LSN {lsn}, Page {page_id}: " f"page_lsn={page_lsn} >= {lsn} → SKIP") print("\nState after redo:") for page_id, page in disk_pages.items(): print(f" Page {page_id}: value='{page['value']}', page_lsn={page['page_lsn']}") print("\n✓ All pages restored to pre-crash state!") demonstrate_lsn_comparison()Reading a page from disk is expensive—potentially 10ms on spinning disk or 0.1ms on SSD. ARIES uses a three-level filtering approach to minimize unnecessary page reads during redo.
Level 1: Dirty Page Table Presence
Before reading the page, check if the page_id exists in the Dirty Page Table (DPT):
Level 2: RecLSN Comparison
If the page is in DPT, compare record.lsn with page's recLSN:
Level 3: Page LSN Comparison
Only if both above checks pass, read the page and compare:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153
class ThreeLevelRedoFilter: """ Implements the three-level filtering optimization. Tracks statistics to demonstrate the optimization's effectiveness. """ def __init__(self, dirty_page_table, buffer_manager): self.dpt = dirty_page_table self.buffer = buffer_manager # Statistics self.stats = { 'total_records': 0, 'filtered_level_1_not_in_dpt': 0, 'filtered_level_2_rec_lsn': 0, 'filtered_level_3_page_lsn': 0, 'redone': 0, 'pages_read': 0 } def process_redo_record(self, log_record): """ Process a single redo record through all three filter levels. """ self.stats['total_records'] += 1 page_id = log_record.page_id record_lsn = log_record.lsn # ======================================== # LEVEL 1: DPT Presence Check # ======================================== # Cost: Hash table lookup (O(1), no I/O) if page_id not in self.dpt: self.stats['filtered_level_1_not_in_dpt'] += 1 return 'SKIP_L1', "Page not in DPT" # ======================================== # LEVEL 2: RecLSN Comparison # ======================================== # Cost: Simple comparison (O(1), no I/O) rec_lsn = self.dpt[page_id].rec_lsn if record_lsn < rec_lsn: self.stats['filtered_level_2_rec_lsn'] += 1 return 'SKIP_L2', f"LSN {record_lsn} < recLSN {rec_lsn}" # ======================================== # LEVEL 3: Page LSN Comparison # ======================================== # Cost: Page read (expensive!) + comparison page = self.buffer.fetch_page(page_id) self.stats['pages_read'] += 1 page_lsn = page.get_page_lsn() if page_lsn >= record_lsn: self.stats['filtered_level_3_page_lsn'] += 1 return 'SKIP_L3', f"pageLSN {page_lsn} >= {record_lsn}" # ======================================== # REDO REQUIRED # ======================================== self.apply_redo(page, log_record) self.stats['redone'] += 1 return 'REDONE', f"Applied LSN {record_lsn} to page {page_id}" def apply_redo(self, page, log_record): """Apply the redo operation and update page LSN""" # Apply modification page.apply(log_record.redo_info) # Update LSN page.page_lsn = log_record.lsn # Mark dirty self.buffer.mark_dirty(page) def print_statistics(self): """Show the filtering effectiveness""" s = self.stats total = s['total_records'] print("\n" + "=" * 50) print("THREE-LEVEL FILTERING STATISTICS") print("=" * 50) print(f"\nTotal log records processed: {total}") print(f"\nFiltered (no disk I/O):") print(f" Level 1 (not in DPT): {s['filtered_level_1_not_in_dpt']:5d} " f"({100*s['filtered_level_1_not_in_dpt']/max(1,total):.1f}%)") print(f" Level 2 (recLSN): {s['filtered_level_2_rec_lsn']:5d} " f"({100*s['filtered_level_2_rec_lsn']/max(1,total):.1f}%)") print(f"\nRequired page read:") print(f" Level 3 (page LSN): {s['filtered_level_3_page_lsn']:5d} " f"({100*s['filtered_level_3_page_lsn']/max(1,total):.1f}%)") print(f" Redone: {s['redone']:5d} " f"({100*s['redone']/max(1,total):.1f}%)") print(f"\nTotal pages read: {s['pages_read']}") print(f"I/O savings: {100*(1 - s['pages_read']/max(1,total)):.1f}%") # Simulate a realistic scenariodef demonstrate_filtering_effectiveness(): """ Simulate redo with realistic filter hit rates. Typical scenario: - Many pages flushed normally before crash → Level 1 filters - Some recently dirtied pages → Levels 2/3 filter most - Small fraction actually need redo """ import random # Simulated DPT (10% of pages were dirty at crash) total_pages = 10000 dirty_pages = set(random.sample(range(total_pages), k=1000)) dpt = {} for page_id in dirty_pages: # recLSN is somewhere in the last 1000 LSNs rec_lsn = random.randint(9000, 10000) class Entry: pass e = Entry() e.rec_lsn = rec_lsn dpt[page_id] = e # Simulate processing 5000 log records (LSN 8000-13000) class MockLogRecord: def __init__(self, lsn, page_id): self.lsn = lsn self.page_id = page_id class MockBuffer: def __init__(self): self.page_lsns = {} def fetch_page(self, page_id): class Page: def __init__(self, lsn): self.page_lsn = lsn def get_page_lsn(self): return self.page_lsn # Simulate: page has random LSN up to current time return Page(random.randint(0, 10000)) def mark_dirty(self, page): pass filter_obj = ThreeLevelRedoFilter(dpt, MockBuffer()) for lsn in range(8000, 13000): page_id = random.randint(0, total_pages - 1) record = MockLogRecord(lsn, page_id) filter_obj.process_redo_record(record) filter_obj.print_statistics() # Run the demonstrationdemonstrate_filtering_effectiveness()In production systems, Level 1 and Level 2 filters together eliminate 80-95% of log records before any page I/O. This dramatically speeds up recovery. A database that would take hours to recover by reading every page might recover in minutes thanks to these optimizations.
The LSN comparison mechanism handles several tricky edge cases correctly. Let's examine each:
Edge Case 1: Partial Page Write
What if a page was being written when the crash occurred, resulting in a partially written page?
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
def partial_write_scenario(): """ Scenario: Crash during page write A page write is an 8KB operation. If power fails midway: - Part of the new data is written - Part remains as old data - The page is CORRUPTED How does ARIES handle this? """ # Two approaches used in practice: # Approach 1: Page Checksums + Double-Write Buffer (InnoDB) class DoubleWriteProtection: """ Before writing to actual location: 1. Write page to doublewrite buffer (sequential area) 2. Sync doublewrite buffer 3. Write page to actual location On recovery: - Checksum each page - If corrupt, restore from doublewrite buffer - Then apply redo normally """ def write_page(self, page): # Step 1: Write to doublewrite area self.doublewrite_buffer.write(page) self.doublewrite_buffer.sync() # Step 2: Write to actual location self.data_file.write_at(page.page_id, page.data) def recover_page(self, page_id): page = self.data_file.read(page_id) if not page.verify_checksum(): # Corrupted! Restore from doublewrite page = self.doublewrite_buffer.get(page_id) return page # Approach 2: Full Page Writes After Checkpoint (PostgreSQL) class FullPageWrites: """ After a checkpoint, the first modification to any page logs the ENTIRE page content (before-image). On recovery: - If page is corrupt or stale, full page image restores it - Subsequent redo records apply normally """ def log_modification(self, page, modification, is_first_since_checkpoint): if is_first_since_checkpoint: # Log the full page image log_record = { 'type': 'FULL_PAGE_WRITE', 'page_id': page.page_id, 'full_page_data': page.data, # Complete page! 'modification': modification } else: # Normal incremental log log_record = { 'type': 'UPDATE', 'page_id': page.page_id, 'modification': modification } return self.log.write(log_record) print("Both approaches ensure that even with partial writes,") print("redo can restore pages to a correct state.") print("LSN comparison still works correctly afterwards.") partial_write_scenario()Edge Case 2: Page LSN Equals Record LSN
What if page_lsn == record.lsn exactly? This means the page on disk reflects exactly this log record's modification. The comparison page_lsn >= record.lsn correctly skips the redo.
Edge Case 3: Multiple Log Records with Same LSN
This should never happen—LSNs are unique by definition. If a bug allows duplicate LSNs, recovery becomes undefined.
Edge Case 4: Page Flushed Between Checkpoint and Crash
The DPT might contain a page that was actually flushed after checkpoint:
12345678910111213141516171819202122232425262728293031323334353637
def flush_after_checkpoint_scenario(): """ Timeline: 1. Checkpoint taken, Page P in DPT with recLSN=100 2. Page P flushed (page_lsn=100 now on disk) 3. Page P dirtied again (LSN=150) 4. Crash before Page P flushed again DPT still has recLSN=100 (from checkpoint) But page on disk has page_lsn=100 How does redo handle this? """ # During redo for LSN 100: page_on_disk_lsn = 100 record_lsn = 100 rec_lsn_in_dpt = 100 # Level 1: Page in DPT? Yes # Level 2: record.lsn >= recLSN? 100 >= 100, yes, continue # Level 3: page_lsn >= record.lsn? 100 >= 100, yes, SKIP print("LSN 100: Correctly skipped (already on disk)") # During redo for LSN 150: record_lsn = 150 # Level 1: Page in DPT? Yes # Level 2: record.lsn >= recLSN? 150 >= 100, yes, continue # Level 3: page_lsn >= record.lsn? 100 >= 150? NO, must REDO print("LSN 150: Correctly redone (modification was lost)") print("\nThe three-level filter handles this correctly!") print("Conservative DPT entry doesn't cause incorrect redo,") print("just potentially one extra page read.") flush_after_checkpoint_scenario()Levels 1 and 2 are conservative filters—they might let through records that don't need redo, but they never filter out records that DO need redo. Level 3 (the page LSN check) is the definitive filter. This design prioritizes correctness over perfect efficiency.
While we've focused on redo, LSNs serve several additional purposes in database systems:
1. Recovery Ordering
During undo phase, LSNs ensure operations are undone in reverse order. The prev_lsn chain in log records enables this.
2. Checkpoint Optimization
The checkpoint record's LSN marks a known-good state. Recovery scans forward from there.
3. Replication
LSNs identify positions in the log for replication:
4. Point-in-Time Recovery
To recover to a specific moment:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
class LSNApplications: """ Demonstrates various uses of LSN beyond redo. """ def point_in_time_recovery(self, target_lsn): """ Recover database to state at target_lsn. """ # Phase 1: Analysis (same as normal) analysis_result = self.run_analysis_phase() # Phase 2: Modified Redo - stop at target for record in self.log.scan_forward(analysis_result.redo_lsn): if record.lsn > target_lsn: break # Don't redo beyond target if self.should_redo(record): self.apply_redo(record) # Phase 3: Undo transactions uncommitted at target time # Any transaction without COMMIT record before target_lsn for txn_id in analysis_result.active_transactions: commit_lsn = self.find_commit_lsn(txn_id) if commit_lsn is None or commit_lsn > target_lsn: self.undo_transaction(txn_id) def streaming_replication_apply(self, replica_state): """ Apply log records to a replica. """ # Replica tracks: last_applied_lsn last_lsn = replica_state.last_applied_lsn for record in self.log.scan_forward(last_lsn + 1): # Apply to replica (similar to redo) self.apply_to_replica(record, replica_state) # Update tracking replica_state.last_applied_lsn = record.lsn # Handle commit: make changes visible if record.type == 'COMMIT': replica_state.make_visible(record.txn_id) def log_truncation_decision(self): """ Determine which log records can be safely deleted. We can truncate log up to: min(checkpoint_lsn, oldest_active_txn_start_lsn, replica_positions) """ # Latest checkpoint checkpoint_lsn = self.get_latest_checkpoint_lsn() # Oldest active transaction (might need to undo) oldest_active = min( txn.start_lsn for txn in self.get_active_transactions() ) if self.get_active_transactions() else float('inf') # Farthest-behind replica replica_min = min( r.last_applied_lsn for r in self.get_replicas() ) if self.get_replicas() else float('inf') # Safe to truncate up to this point safe_truncate_lsn = min(checkpoint_lsn, oldest_active, replica_min) return safe_truncate_lsn # Demonstrate point-in-time recoverydef pitr_example(): """ Example: Recover to just before an accidental DELETE. """ print("Scenario: User accidentally ran DELETE FROM users") print("The DELETE was at LSN 5000") print("") print("Point-in-time recovery to LSN 4999:") print(" 1. Redo all operations up to LSN 4999") print(" 2. Skip the DELETE at LSN 5000") print(" 3. Undo any transactions that hadn't committed by LSN 4999") print("") print("Result: Database restored to moment before DELETE!") pitr_example()The Log Sequence Number is a deceptively simple concept that enables the entire ARIES recovery mechanism. Let's consolidate what we've learned:
What's Next:
The LSN comparison mechanism makes redo idempotent—safe to repeat. The next page explores idempotence in depth: what it means, why it's essential for crash recovery, and how ARIES ensures all redo operations are idempotent.
You now understand the complete LSN comparison mechanism that makes ARIES redo correct. This simple numeric comparison—is the page behind the log?—combined with careful LSN maintenance during normal operation, ensures that recovery always produces the exact pre-crash database state.