Loading learning content...
When your bank processes a transfer, when your favorite app saves your data, when an airline books your seat—chances are extremely high that ARIES is the recovery algorithm ensuring those transactions survive power failures, crashes, and hardware errors.
ARIES—Algorithms for Recovery and Isolation Exploiting Semantics—was developed at IBM Research in the late 1980s and early 1990s by C. Mohan and colleagues. It has become the de facto standard for database recovery, implemented with variations in Oracle, DB2, SQL Server, PostgreSQL, MySQL InnoDB, and virtually every production-grade DBMS.
By the end of this page, you will understand ARIES's design philosophy, its three recovery phases (Analysis, Redo, Undo), the key data structures it maintains, and why its 'repeating history' approach provides both flexibility and correctness.
ARIES is designed around three fundamental principles that distinguish it from simpler recovery algorithms:
1. Write-Ahead Logging (WAL):
Before any change to a database page is written to disk, the corresponding log record must first be written to stable storage.
This ensures recoverability regardless of when pages are flushed.
2. Repeating History During Redo:
During recovery, ARIES replays ALL logged actions (not just committed ones) to restore the database to its exact pre-crash state before undoing incomplete transactions.
This counterintuitive approach simplifies recovery logic and supports complex features like fine-grained locking.
3. Logging Changes During Undo:
When ARIES undoes an incomplete transaction's actions, it writes Compensation Log Records (CLRs) that record the undo operations.
This makes undo operations idempotent and ensures crash-safe recovery of recovery itself.
| Principle | Policy | Benefit |
|---|---|---|
| STEAL | Uncommitted changes may be written to disk | No transaction size limit; flexible buffer management |
| NO-FORCE | Committed pages needn't be forced to disk | Fast commits (force only log) |
| Repeat History | Redo all actions, then undo losers | Consistent recovery; supports physiological logging |
| CLRs for Undo | Log undo actions as they happen | Crash-safe undo; prevents repeated undo work |
| Fuzzy Checkpoints | Checkpoint without quiescing system | Continuous operation; bounded recovery |
Why STEAL/NO-FORCE?
This seemingly strange combination is actually optimal for general-purpose databases:
STEAL allows the buffer manager to evict any page when memory is needed, even if it contains uncommitted changes. This removes transaction size limits.
NO-FORCE means committed transactions don't wait for all their pages to be written—only the log records are forced. This dramatically reduces commit latency.
The cost is requiring both UNDO (for uncommitted changes that reached disk) and REDO (for committed changes that didn't). ARIES handles both efficiently.
ARIES prioritizes flexibility and performance during normal operation, accepting additional complexity during recovery. Since systems spend >99.9% of time in normal operation, this trade-off is overwhelmingly positive.
ARIES maintains several data structures that enable efficient recovery. Understanding these is essential to understanding how ARIES works.
1. Log Sequence Number (LSN):
Every log record receives a unique, monotonically increasing LSN. LSNs serve as the universal timeline for the entire system:
pageLSN)2. Transaction Table:
Maintains the state of all active transactions:
12345678910111213141516171819202122232425262728
// Transaction Table EntrySTRUCT TransactionTableEntry: txn_id : TransactionID status : {RUNNING, COMMITTED, ABORTED} lastLSN : LSN // Most recent log record for this txn undoNextLSN : LSN // Next record to undo (during recovery) // Dirty Page Table Entry STRUCT DirtyPageTableEntry: page_id : PageID recLSN : LSN // LSN of first log record that dirtied page // (since last checkpoint or flush) // Log Record StructureSTRUCT LogRecord: LSN : LSN // This record's unique sequence number prevLSN : LSN // Previous LSN for same transaction txn_id : TransactionID type : {BEGIN, UPDATE, COMMIT, ABORT, CLR, END, CHECKPOINT} page_id : PageID // For UPDATE and CLR undo_next_LSN : LSN // For CLR: next record to undo before_image : BYTES // For UPDATE: value before change after_image : BYTES // For UPDATE: value after change // Page Header (maintained on each data page)STRUCT PageHeader: page_id : PageID pageLSN : LSN // LSN of last update to this page3. Dirty Page Table (DPT):
Tracks pages that have been modified in the buffer pool but may not have been flushed to disk:
The recLSN is critical: during REDO, we know we can skip log records with LSN < recLSN for a given page (the page was clean at that point).
4. prevLSN Chain:
Each log record contains a prevLSN pointing to the previous log record for the same transaction. This creates per-transaction chains enabling efficient traversal:
T1: BEGIN(100) <-- UPDATE(105) <-- UPDATE(112) <-- COMMIT(118)
^ ^ ^ ^
| | | |
prevLSN=NULL prevLSN=100 prevLSN=105 prevLSN=112
During UNDO, we follow these chains backwards to undo transaction operations in reverse order.
LSNs unify time across the entire system. By comparing a page's pageLSN with a log record's LSN, we instantly know if that record's update is already reflected in the page. This comparison is the foundation of ARIES REDO efficiency.
ARIES recovery proceeds through three distinct phases, each with a specific purpose:
Phase 1: ANALYSIS
Scans the log from the last checkpoint to the end, reconstructing:
Phase 2: REDO
Scans forward through the log from the minimum recLSN in the dirty page table, replaying ALL actions to restore the database to its exact pre-crash state—including uncommitted changes.
Phase 3: UNDO
Scans backward, undoing the actions of transactions that were active at crash time and never committed.
| Phase | Direction | Purpose | Output |
|---|---|---|---|
| Analysis | Forward from checkpoint | Reconstruct system state at crash | Transaction table, Dirty page table, Redo start point |
| Redo | Forward from min(recLSN) | Repeat history to restore pre-crash state | Database restored to crash-time state |
| Undo | Backward (per-transaction) | Remove effects of incomplete transactions | Consistent database with only committed data |
Why Three Phases?
The separation serves important purposes:
Analysis avoids redundant work—we learn exactly what needs REDO and UNDO before doing any I/O.
REDO restoring exact pre-crash state enables:
UNDO as a separate phase allows:
ARIES redoes ALL operations—even those from transactions that will be undone. This seems wasteful but is essential for correctness with physiological logging. If we skipped uncommitted changes during REDO, the page state wouldn't match what the UNDO log records expect.
The Analysis phase reconstructs two critical tables by scanning the log from the last checkpoint to the end. This phase does no I/O to data pages—it only reads the log.
Starting Point:
Processing Each Log Record:
As we scan forward through the log, we update our tables based on each record type:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
PROCEDURE Analysis_Phase(): // Start from last checkpoint checkpoint := find_last_checkpoint_record() // Initialize tables from checkpoint transaction_table := checkpoint.active_transactions dirty_page_table := checkpoint.dirty_pages // Scan log from checkpoint to end lsn := checkpoint.LSN WHILE more_log_records(): record := read_log_record(lsn) CASE record.type OF: BEGIN: // New transaction started transaction_table.add({ txn_id: record.txn_id, status: RUNNING, lastLSN: record.LSN, undoNextLSN: NULL }) UPDATE, CLR: // Transaction modified a page transaction_table[record.txn_id].lastLSN := record.LSN // Track dirty page IF record.page_id NOT IN dirty_page_table THEN dirty_page_table.add({ page_id: record.page_id, recLSN: record.LSN // First modification since flush }) END IF COMMIT: // Transaction committed transaction_table[record.txn_id].status := COMMITTED ABORT: // Transaction aborted (may require undo) transaction_table[record.txn_id].status := ABORTED END: // Transaction fully complete (committed or aborted + undone) transaction_table.remove(record.txn_id) END CASE lsn := next_LSN() END WHILE // Analysis complete. Returns: // - transaction_table: All transactions requiring UNDO // - dirty_page_table: Defines REDO starting point RETURN (transaction_table, dirty_page_table)After Analysis:
The transaction table now contains:
The dirty page table tells us:
Analysis Example:
Checkpoint saved: T1 active (lastLSN=100), pages P1(recLSN=95), P2(recLSN=98)
Log after checkpoint:
LSN=110: [UPDATE, T1, P3, ...] // T1 updates P3
LSN=112: [BEGIN, T2] // T2 starts
LSN=115: [UPDATE, T2, P1, ...] // T2 updates P1
LSN=118: [COMMIT, T1] // T1 commits
LSN=120: [UPDATE, T2, P2, ...] // T2 updates P2
--- CRASH ---
After Analysis:
The REDO phase restores the database to its exact pre-crash state by replaying all logged operations. This is ARIES's 'repeat history' paradigm.
Starting Point: The minimum recLSN from the dirty page table (the earliest point where a page might need updates)
Key Principle: REDO only if necessary. For each UPDATE or CLR record, check if the update is already reflected in the page.
123456789101112131415161718192021222324252627282930313233343536373839
PROCEDURE Redo_Phase(dirty_page_table): // Find starting point for REDO redo_lsn := MIN(recLSN for each page in dirty_page_table) // Scan forward from redo_lsn to end of log FOR EACH log_record WHERE record.LSN >= redo_lsn: // Only process UPDATE and CLR records (they modify pages) IF record.type NOT IN {UPDATE, CLR} THEN CONTINUE END IF // Optimization 1: Skip if page not in dirty page table IF record.page_id NOT IN dirty_page_table THEN CONTINUE // Page wasn't dirty, no redo needed END IF // Optimization 2: Skip if record is before page became dirty IF record.LSN < dirty_page_table[record.page_id].recLSN THEN CONTINUE // Record preceded the dirty state END IF // Now we must examine the actual page page := fetch_page(record.page_id) // Key check: Is this update already on the page? IF page.pageLSN >= record.LSN THEN CONTINUE // Already applied (page was written after this log record) END IF // Apply the redo operation apply_redo(record, page) // Update page's LSN to reflect this operation page.pageLSN := record.LSN END FOR // Database now reflects pre-crash state (including uncommitted changes)The pageLSN Check:
The comparison page.pageLSN >= record.LSN is the heart of REDO efficiency:
This check allows ARIES to handle:
Idempotency:
Each REDO operation is idempotent—applying it multiple times has the same effect as applying it once. Combined with the pageLSN check, this means:
In practice, REDO skips most log records: pages flushed during checkpointing already have updates; only pages dirty at crash need work. Hot pages (frequently updated) may need more redo, but they're also likely in the buffer pool from recent activity.
The UNDO phase removes the effects of transactions that were active at crash time. After REDO, the database contains both committed and uncommitted changes—UNDO restores consistency by reversing the uncommitted ones.
Key Concept: Compensation Log Records (CLRs)
When ARIES undoes an operation, it writes a CLR that:
undoNextLSN pointing to the next record to undoCLRs ensure that if we crash during UNDO, we don't redo the same undo work.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
PROCEDURE Undo_Phase(transaction_table): // Collect all LSNs that need to be undone // For each active transaction, we'll undo from its lastLSN backward undo_list := new MaxHeap() // Process highest LSN first FOR EACH txn IN transaction_table WHERE txn.status != COMMITTED: undo_list.add(txn.lastLSN) END FOR // Process undo list in descending LSN order WHILE undo_list.not_empty(): lsn := undo_list.pop_max() record := read_log_record(lsn) IF record.type = CLR THEN // CLR's undoNextLSN tells us what to undo next IF record.undoNextLSN != NULL THEN undo_list.add(record.undoNextLSN) END IF // CLRs are never undone themselves ELSE IF record.type = UPDATE THEN // Undo this update page := fetch_page(record.page_id) // Apply the undo using before-image apply_undo(record, page) // Write CLR to record the undo clr := { type: CLR, txn_id: record.txn_id, page_id: record.page_id, LSN: next_LSN(), undoNextLSN: record.prevLSN, // Points to next record to undo redo_info: record.before_image // CLR redoes the undo } write_log(clr) page.pageLSN := clr.LSN // Queue the previous operation for undo IF record.prevLSN != NULL THEN undo_list.add(record.prevLSN) END IF END IF END WHILE // Write END records for all undone transactions FOR EACH txn IN transaction_table WHERE txn.status != COMMITTED: write_log({type: END, txn_id: txn.txn_id}) END FORThe undoNextLSN Chain:
CLRs contain an undoNextLSN field that creates a 'skip chain' for efficient traversal:
Original: UPDATE(100) <-- UPDATE(110) <-- UPDATE(120)
^ | |
| v |
+---- CLR(125) ----> prevLSN=100
| undoNextLSN=100
v
CLR(130) ----> undoNextLSN=100
|
v
Undo of 100: CLR(135) ----> undoNextLSN=NULL
If we crash after writing CLR(125), recovery's UNDO phase:
Why Process Highest LSN First?
Using a max-heap ensures we undo operations in reverse chronological order:
CLRs make UNDO restartable. If we crash during UNDO, recovery's REDO phase replays the CLRs (applying the undo operations), then UNDO continues from where it left off using undoNextLSN pointers.
Checkpoints bound recovery time by saving system state periodically. ARIES uses fuzzy checkpoints that don't require stopping the system.
The Checkpointing Process:
BEGIN_CHECKPOINT recordEND_CHECKPOINT record with pointer to BEGINNo quiescing is required. Transactions continue throughout.
Why 'Fuzzy'?
Between BEGIN_CHECKPOINT and END_CHECKPOINT:
The checkpoint captures a 'fuzzy' snapshot—not a precise moment, but a valid starting point for recovery. Analysis will correct any discrepancies by scanning beyond the checkpoint.
1234567891011121314151617181920212223242526
PROCEDURE Fuzzy_Checkpoint(): // Step 1: Write begin marker begin_lsn := write_log({type: BEGIN_CHECKPOINT}) // Step 2: Capture current state (snapshot, may be slightly stale) txn_table_copy := copy(transaction_table) dpt_copy := copy(dirty_page_table) // Step 3: Write state to log write_log({ type: END_CHECKPOINT, begin_lsn: begin_lsn, transaction_table: txn_table_copy, dirty_page_table: dpt_copy }) // Step 4: Update master record to point to new checkpoint // This is the last action - if we crash before this, // the previous checkpoint is still valid update_master_record(begin_lsn) // No transactions were blocked during this process // The master record stores the LSN of the last valid checkpoint// On restart, recovery reads the master record firstCheckpoint Impact on Recovery:
| Without Checkpoints | With Checkpoints |
|---|---|
| Analysis scans entire log | Analysis starts at checkpoint |
| REDO starts at log beginning | REDO starts at min(recLSN) |
| UNDO processes all uncommitted | UNDO unchanged |
| Recovery O(log size) | Recovery O(checkpoint interval) |
Checkpoint Frequency Trade-off:
Some systems combine checkpointing with background dirty page writeback. Writing dirty pages to disk allows removing them from the DPT, which advances the REDO starting point and reduces recovery work. This is separate from the checkpoint record itself.
ARIES represents the pinnacle of database recovery algorithm design. Let's consolidate the key concepts:
What's Next:
In the final page of this module, we'll directly compare all the recovery algorithms we've studied—deferred update, immediate update, shadow paging, and ARIES. We'll create a decision framework for choosing among them and discuss how these concepts apply in modern distributed systems.
You now understand ARIES—the industry-standard recovery algorithm. You can explain its three phases, the role of LSNs and data structures, the 'repeat history' philosophy, and how fuzzy checkpoints bound recovery time. This knowledge applies directly to understanding any production DBMS's recovery behavior.