Database Management SystemTransaction Concepts

Recoverability

LevelIntermediate

Duration75 mins

TopicTransaction Concepts

5 / 5

Recovery Implications

From Theory to Practice: Recovery in Action

Throughout this module, we've explored the hierarchy of schedule properties: recoverable, cascadeless, and strict schedules. Each property adds constraints on how transactions can interact, with profound implications for database recovery.

Now we connect the dots. When a database system crashes in the middle of executing transactions, when a transaction encounters an error and must abort, when the system must restore itself to a consistent state—these are the moments when recoverability properties determine success or failure.

This page examines how real database recovery systems leverage these schedule properties, the algorithms they use, and the trade-offs they make. Understanding this connection transforms abstract schedule properties into concrete engineering decisions.

What You Will Learn

By the end of this page, you will understand how schedule properties affect recovery algorithms, the role of Write-Ahead Logging (WAL) in recovery, undo and redo recovery operations, how ARIES recovery handles different schedule types, and practical guidelines for choosing schedule properties in system design.

The Recovery Framework

Database recovery addresses one fundamental question: After a failure, how do we restore the database to a consistent state where all committed transaction effects are preserved and all uncommitted transaction effects are removed?

This involves two complementary operations:

UNDO (Rollback):

Remove effects of uncommitted transactions
Restore data items to their before-images
Must handle all transactions that were active at failure time

REDO (Roll-forward):

Ensure committed transaction effects are durable
Re-apply changes that may not have reached disk
Must handle transactions that committed but whose writes might not be flushed

Schedule properties directly impact how these operations work:

Schedule Properties and Recovery Implications
Schedule Property	UNDO Complexity	REDO Complexity	Recovery Algorithm Impact
Non-recoverable	Impossible	Impossible	Cannot recover—committed transactions depend on aborted data
Recoverable only	Complex (cascades)	Standard	Must identify and abort all cascade victims
Cascadeless	Moderate	Standard	No cascade tracking, but write-write dependencies may exist
Strict	Simple	Standard	Before-images are committed; straightforward undo

Recovery is Non-Negotiable

Unlike other database features that can be traded off for performance, recovery correctness is absolute. A database that cannot correctly recover from failures is fundamentally unusable for any application requiring data durability.

Write-Ahead Logging (WAL) and Recovery

Write-Ahead Logging (WAL) is the foundation of recovery in virtually all modern databases. The WAL protocol ensures that recovery information is always available:

The WAL Rule:

Before any data modification is written to the database, the corresponding log record MUST be written to stable storage (log file on disk).

This ensures that if a failure occurs, the log contains all the information needed to recover:

Before-images for UNDO
After-images (or the operation itself) for REDO

Log Record Structure:

wal_log_record.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
from dataclasses import dataclass
from typing import Any, Optional
from enum import Enum
 
class LogRecordType(Enum):
    UPDATE = "UPDATE"       # Data modification
    COMMIT = "COMMIT"       # Transaction commit
    ABORT = "ABORT"         # Transaction abort
    CHECKPOINT = "CHECKPOINT"  # Consistency point
    BEGIN = "BEGIN"         # Transaction start
    CLR = "CLR"            # Compensation Log Record (for undo)
 
@dataclass
class WALLogRecord:
    """
    Standard Write-Ahead Log record structure.
    
    The log contains all information needed for recovery:
    - Before-images for UNDO
    - After-images for REDO
    - Transaction state for determining what to do
    """
    lsn: int                    # Log Sequence Number (unique, increasing)
    txn_id: str                 # Transaction ID
    record_type: LogRecordType  # Type of log record
    
    # For UPDATE records:
    page_id: Optional[str] = None    # Which page was modified
    offset: Optional[int] = None     # Position within page
    before_image: Optional[Any] = None  # Value before update (for UNDO)
    after_image: Optional[Any] = None   # Value after update (for REDO)
    
    # For linking:
    prev_lsn: Optional[int] = None   # Previous log record for this transaction
    undo_next_lsn: Optional[int] = None  # For CLRs: next record to undo
    
    def can_undo(self) -> bool:
        """Check if this record can be undone."""
        return self.record_type == LogRecordType.UPDATE and self.before_image is not None
    
    def can_redo(self) -> bool:
        """Check if this record can be redone."""
        return self.record_type == LogRecordType.UPDATE and self.after_image is not None
 
# Example log sequence demonstrating WAL
print("=" * 70)
print("WRITE-AHEAD LOG EXAMPLE")
print("=" * 70)
 
log = [
    WALLogRecord(100, "T1", LogRecordType.BEGIN),
    WALLogRecord(101, "T1", LogRecordType.UPDATE, "P1", 0, 
                 before_image="A=50", after_image="A=100"),
    WALLogRecord(102, "T2", LogRecordType.BEGIN),
    WALLogRecord(103, "T2", LogRecordType.UPDATE, "P2", 0,
                 before_image="B=200", after_image="B=150"),
    WALLogRecord(104, "T1", LogRecordType.UPDATE, "P1", 4,
                 before_image="C=30", after_image="C=80"),
    WALLogRecord(105, "T1", LogRecordType.COMMIT),  # T1 committed
    # ---- CRASH OCCURS HERE ----
]
 
print("\nLog records at crash time:")
for record in log:
    if record.record_type == LogRecordType.UPDATE:
        print(f"  LSN {record.lsn}: {record.txn_id} {record.record_type.value} "
              f"{record.page_id}:{record.before_image} → {record.after_image}")
    else:
        print(f"  LSN {record.lsn}: {record.txn_id} {record.record_type.value}")
 
print("\nRecovery analysis:")
print("  T1: COMMITTED (LSN 105) - REDO its updates if needed")
print("  T2: No COMMIT record - UNDO its updates (restore before-images)")

Why schedule properties matter for WAL:

Property	Before-Image Reliability	Recovery Requirement
Recoverable	May contain uncommitted data	Must track dependencies
Cascadeless	May contain uncommitted writes	Simpler but still need care
Strict	Always committed data	Simple, independent undo

With strict schedules, before-images in the log are always committed values. This means UNDO can simply restore the before-image without checking if it came from an uncommitted transaction.

UNDO Recovery and Schedule Properties

UNDO recovery reverses the effects of uncommitted transactions. The complexity of UNDO directly depends on schedule properties:

UNDO in Non-Recoverable Schedules:

Cannot be done correctly
Some committed transactions depend on aborted data
Unsolicited—this is why non-recoverable schedules are prohibited

UNDO in Recoverable (not cascadeless) Schedules:

Must identify cascade victims
All transactions that read uncommitted data must also abort
Requires dependency tracking or re-analysis at recovery time

undo_recovery_complexity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
from dataclasses import dataclass, field
from typing import List, Dict, Set, Any, Optional
from enum import Enum
 
class TxnState(Enum):
    ACTIVE = "ACTIVE"
    COMMITTED = "COMMITTED"  
    ABORTED = "ABORTED"
 
@dataclass
class UndoRecoveryComplex:
    """
    UNDO recovery for recoverable (but not cascadeless) schedules.
    Must handle cascading aborts during recovery.
    """
    
    committed_before_crash: Set[str] = field(default_factory=set)
    active_at_crash: Set[str] = field(default_factory=set)
    read_from: Dict[str, Set[str]] = field(default_factory=dict)  # txn -> set of txns it read from
    
    def identify_cascade_victims(self) -> Set[str]:
        """
        In a recoverable (not cascadeless) schedule, we must find all
        transactions that read from transactions we're going to undo.
        
        This is complex because:
        1. Start with active (uncommitted) transactions
        2. Find all transactions that read from them
        3. If any of those read from another active one, add them
        4. Repeat until no new victims found
        """
        to_abort = self.active_at_crash.copy()
        
        changed = True
        iterations = 0
        while changed:
            changed = False
            iterations += 1
            new_victims = set()
            
            for txn, read_sources in self.read_from.items():
                if txn not in to_abort and txn not in self.committed_before_crash:
                    # Check if this transaction read from any transaction we're aborting
                    if read_sources & to_abort:
                        new_victims.add(txn)
                        changed = True
            
            to_abort.update(new_victims)
            print(f"  Iteration {iterations}: Found {len(new_victims)} new cascade victims")
        
        return to_abort
    
    def perform_undo(self, to_abort: Set[str]) -> None:
        """
        Undo all transactions in the abort set.
        Order matters here for correctness!
        """
        print(f"\n  Transactions to UNDO: {to_abort}")
        print("  Must undo in reverse chronological order of their writes")
        print("  Must also consider that some transactions may have read from each other")
        print("  COMPLEX: Recovery manager needs full dependency information")
 
@dataclass
class UndoRecoverySimple:
    """
    UNDO recovery for strict schedules.
    Much simpler because before-images are reliable.
    """
    
    active_at_crash: Set[str] = field(default_factory=set)
    
    def perform_undo(self, log_records: List[Any]) -> None:
        """
        Simple UNDO for strict schedules:
        1. Find all active transactions
        2. For each, restore before-images
        3. No need to track dependencies—before-images are committed values
        """
        print(f"\n  Transactions to UNDO: {self.active_at_crash}")
        print("  SIMPLE: Just restore before-images for each")
        print("  No cascade analysis needed")
        print("  Order doesn't matter (before-images are independent)")
 
# Comparison demonstration
print("=" * 70)
print("UNDO RECOVERY COMPLEXITY COMPARISON")
print("=" * 70)
 
# Complex case (recoverable but not cascadeless)
print("\n--- Recoverable (with dirty reads) Schedule ---")
complex_recovery = UndoRecoveryComplex(
    committed_before_crash={"T4", "T5"},
    active_at_crash={"T1", "T3"},
    read_from={
        "T2": {"T1"},      # T2 read from T1 (dirty read)
        "T3": {"T2"},      # T3 read from T2
        "T4": set(),       # T4 read from committed data only
        "T5": {"T4"},      # T5 read from T4 (committed)
    }
)
cascade_victims = complex_recovery.identify_cascade_victims()
complex_recovery.perform_undo(cascade_victims)
 
# Simple case (strict schedule)
print("\n--- Strict Schedule ---")
simple_recovery = UndoRecoverySimple(
    active_at_crash={"T1", "T3"}
)
simple_recovery.perform_undo([])
 
print("\n" + "=" * 70)
print("KEY INSIGHT: Strict schedules make UNDO trivially simple.")
print("=" * 70)

Cascade Identification at Recovery

For non-cascadeless schedules, the recovery system must identify cascading aborts during recovery. This requires either maintaining dependency graphs during execution or re-analyzing the log to determine which transactions read uncommitted data. Both approaches add significant complexity.

REDO Recovery: Ensuring Durability

REDO recovery ensures that committed transaction effects are durable—even if they weren't written to disk before the crash. Interestingly, REDO is largely independent of schedule properties.

Why REDO is simpler:

REDO only applies to committed transactions
Committed transactions have no uncommitted dependencies (by recoverability)
We're applying after-images, not reversing changes
Order usually doesn't matter for independent updates

The REDO Process:

redo_recovery.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
from dataclasses import dataclass, field
from typing import List, Dict, Any, Set
from enum import Enum
 
class RecordType(Enum):
    UPDATE = "UPDATE"
    COMMIT = "COMMIT"
    BEGIN = "BEGIN"
 
@dataclass
class LogRecord:
    lsn: int
    txn_id: str
    record_type: RecordType
    page_id: str = ""
    after_image: Any = None
    page_lsn: int = 0  # LSN of last update applied to page
 
@dataclass
class Page:
    page_id: str
    data: Dict[str, Any]
    page_lsn: int = 0  # Last LSN applied to this page
 
class RedoRecovery:
    """
    REDO recovery is relatively simple regardless of schedule type:
    
    For each log record of a committed transaction:
        If the page's LSN < log record's LSN:
            Apply the after-image (redo the update)
    
    The page_lsn comparison ensures we don't redo updates
    that were already written to disk.
    """
    
    def __init__(self, log: List[LogRecord], pages: Dict[str, Page]):
        self.log = log
        self.pages = pages
    
    def identify_committed_transactions(self) -> Set[str]:
        """Find all transactions that committed before crash."""
        committed = set()
        for record in self.log:
            if record.record_type == RecordType.COMMIT:
                committed.add(record.txn_id)
        return committed
    
    def redo_pass(self) -> None:
        """
        Redo all committed transaction updates.
        
        This is a single forward pass through the log.
        For each UPDATE by a committed transaction,
        redo if the page doesn't have this update yet.
        """
        committed = self.identify_committed_transactions()
        print(f"Committed transactions: {committed}")
        print("\nStarting REDO pass (forward through log)...")
        
        redo_count = 0
        skip_count = 0
        
        for record in self.log:
            if record.record_type != RecordType.UPDATE:
                continue
            
            if record.txn_id not in committed:
                # Skip uncommitted transaction updates
                continue
            
            page = self.pages.get(record.page_id)
            if page is None:
                # Page might not be loaded; would need to fetch from disk
                print(f"  LSN {record.lsn}: Page {record.page_id} not in buffer")
                continue
            
            if page.page_lsn < record.lsn:
                # This update wasn't applied to disk; redo it
                print(f"  LSN {record.lsn}: REDO {record.txn_id}'s update to {record.page_id}")
                print(f"    Applying: {record.after_image}")
                page.data = record.after_image
                page.page_lsn = record.lsn
                redo_count += 1
            else:
                # Update already on disk; skip
                print(f"  LSN {record.lsn}: SKIP (page_lsn {page.page_lsn} >= {record.lsn})")
                skip_count += 1
        
        print(f"\nREDO complete: {redo_count} redone, {skip_count} skipped")
 
# Demonstration
print("=" * 70)
print("REDO RECOVERY DEMONSTRATION")
print("=" * 70)
 
log = [
    LogRecord(100, "T1", RecordType.BEGIN),
    LogRecord(101, "T1", RecordType.UPDATE, "P1", after_image={"A": 100}),
    LogRecord(102, "T2", RecordType.BEGIN),
    LogRecord(103, "T1", RecordType.UPDATE, "P2", after_image={"B": 200}),
    LogRecord(104, "T2", RecordType.UPDATE, "P3", after_image={"C": 300}),
    LogRecord(105, "T1", RecordType.COMMIT),  # T1 committed
    LogRecord(106, "T2", RecordType.UPDATE, "P1", after_image={"A": 150}),
    # CRASH - T2 never committed
]
 
# Simulate that P1 has LSN 101 on disk, P2 has LSN 0 (update wasn't flushed)
pages = {
    "P1": Page("P1", {"A": 100}, page_lsn=101),  # T1's first update was flushed
    "P2": Page("P2", {"B": 0}, page_lsn=0),       # T1's second update wasn't flushed
    "P3": Page("P3", {"C": 0}, page_lsn=0),       # T2's update wasn't flushed
}
 
print("\nInitial page states:")
for pid, page in pages.items():
    print(f"  {pid}: {page.data}, page_lsn={page.page_lsn}")
 
print()
recovery = RedoRecovery(log, pages)
recovery.redo_pass()
 
print("\nFinal page states:")
for pid, page in pages.items():
    print(f"  {pid}: {page.data}, page_lsn={page.page_lsn}")
 
print("\nNote: T2's update to P3 was NOT redone (T2 didn't commit)")

REDO is Schedule-Agnostic

Unlike UNDO, REDO recovery works the same way regardless of schedule type. The key insight is that we only REDO committed transactions, and recoverability guarantees that committed transactions don't depend on uncommitted data. This makes REDO straightforward.

ARIES: Industry-Standard Recovery

ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the most widely used recovery algorithm in commercial databases. It's used by IBM DB2, Microsoft SQL Server, and influenced PostgreSQL and MySQL InnoDB.

ARIES Key Principles:

Write-Ahead Logging: All changes logged before applied
Repeating History: REDO replays exactly what happened
Logging During Undo: Even undo operations are logged (CLRs)

Three Phases of ARIES Recovery:

ARIES Recovery Phases

•1. Analysis Phase: Scan log from last checkpoint to determine: (a) Which transactions were active at crash, (b) Which pages might need REDO, (c) Build the "dirty page table" and "active transaction table"
•2. REDO Phase: Starting from earliest relevant log record, redo ALL updates (committed or not) to bring database to crash-time state. This "repeats history." Uses page_lsn to avoid redundant REDOs.
•3. UNDO Phase: Undo uncommitted transactions in reverse order. Uses Compensation Log Records (CLRs) to log undo work—ensures undo is idempotent if there's another crash during recovery.

How Schedule Properties Affect ARIES:

Phase	Recoverable	Cascadeless	Strict
Analysis	Must track dependencies	Standard	Standard
REDO	Same (repeat history)	Same	Same
UNDO	Complex (cascades)	Moderate	Simple

ARIES assumes strict schedules in its standard form. For non-strict schedules, the UNDO phase would need modification to handle dependencies.

Why databases use strict schedules with ARIES:

ARIES UNDO is simple: just restore before-images
No cascade tracking needed
Recovery is fast and predictable
Implementation is less error-prone

ARIES in Action

When your PostgreSQL or SQL Server database recovers after a crash, it's running a variant of ARIES. The database reads the WAL, determines what transactions were in progress, redoes committed work, and undoes uncommitted work—all leveraging the guarantees provided by strict schedules.

Checkpoints and Recovery Speed

Checkpoints are periodic snapshots that reduce recovery time. Without checkpoints, recovery would need to process the entire log from the beginning of time. Checkpoints provide a "safe starting point."

Types of Checkpoints:

1. Consistent Checkpoint (Quiesce):

Stop all transaction processing
Flush all dirty pages to disk
Write checkpoint record
Resume processing

Simple but causes a pause in processing.

2. Fuzzy Checkpoint:

Write checkpoint record with list of dirty pages and active transactions
Continue processing while background flush happens
Recovery starts from the oldest log record referenced

Used by most production systems for minimal interruption.

Checkpoint Impact on Recovery
Checkpoint Type	Processing Impact	Recovery Start Point	Log Needed
None	None	Beginning of log	Entire log
Consistent	Pause during checkpoint	Last checkpoint	Log since checkpoint
Fuzzy	Minimal pause	Oldest dirty page or active txn	Log since oldest active

Recovery time formula:

Recovery Time ≈ Analysis Time + REDO Time + UNDO Time

Analysis Time ∝ Log records since checkpoint
REDO Time ∝ Dirty pages × Log records
UNDO Time ∝ Active transactions × Their log records

For strict schedules, UNDO is simpler:

No cascade analysis
Independent undo of each active transaction
Can potentially parallelize UNDO

Checkpoint frequency trade-off:

More frequent → Shorter recovery time
More frequent → More I/O overhead during normal operation
Typical: 5-10 minute checkpoint intervals

Recovery Time SLA

Production systems often have Recovery Time Objectives (RTOs). Checkpoint frequency is tuned to meet these SLAs. For a 1-minute RTO, you might checkpoint every 30 seconds. Strict schedules help because UNDO phase is faster and more predictable.

Practical Guidelines for System Design

Understanding recoverability properties helps make informed decisions about database configuration and application design:

Guideline 1: Default to Strict Schedules

Unless you have specific requirements that demand otherwise:

Use READ COMMITTED or higher isolation level
Let the database enforce strict schedules (Strict 2PL or MVCC with write locks)
Benefit from simple, fast, predictable recovery

Guideline 2: Understand Your Isolation Level

Know what your isolation level provides:

Isolation Level Recommendations
Use Case	Recommended Level	Recovery Implications
Financial transactions	SERIALIZABLE	Strictest guarantees, simple recovery
General OLTP	READ COMMITTED	Cascadeless + Strict for writes, fast recovery
Reporting queries	READ COMMITTED SNAPSHOT	Snapshot isolation, no blocking, simple
Approximate analytics	READ UNCOMMITTED (rare)	Not cascadeless, complex recovery if crashes occur

Operational Best Practices

•Monitor recovery time: Test recovery periodically. If too slow, increase checkpoint frequency.
•Size log appropriately: Log must hold at least 2 checkpoint intervals worth of activity.
•Test failure scenarios: Actually crash your test database and verify recovery works correctly.
•Keep transactions short: Long transactions increase UNDO work and recovery time.
•Avoid READ UNCOMMITTED: The marginal performance gain rarely justifies the recovery complexity.
•Understand MVCC vs. locking: MVCC provides excellent recovery properties with high concurrency.

Don't Sacrifice Recovery for Performance

It's tempting to weaken isolation for performance. Remember: you're trading simple, predictable recovery for marginal throughput gains. When the crash happens—and it will—you'll want the simple recovery path.

Module Summary: Recoverability

We've completed our exploration of recoverability in transaction processing. This final summary brings together all the concepts from the module:

Module Key Takeaways

•Recoverable schedules ensure recovery is possible by requiring readers to wait for writers to commit before they can commit.
•Cascading rollbacks amplify single failures when transactions read uncommitted data that later gets aborted.
•Cascadeless schedules prevent cascades by ensuring transactions only read committed data (no dirty reads).
•Strict schedules enable simple recovery by ensuring before-images are always committed values.
•Write-Ahead Logging (WAL) provides the foundation for recovery, storing before-images (UNDO) and after-images (REDO).
•ARIES recovery assumes strict schedules for its simple, effective UNDO phase.
•Modern databases use MVCC or Strict 2PL to guarantee strict schedules with high concurrency.
•Practical systems should default to READ COMMITTED or higher, accepting strict schedule properties for reliable recovery.

Converting Mermaid diagram...

The Big Picture:

Recoverability is not an abstract theoretical concept—it's the foundation that makes databases reliable. When you use a database with confidence that your committed data will survive crashes, you're relying on:

Schedule properties that make recovery well-defined
WAL that captures all information needed
Recovery algorithms that correctly undo and redo
Checkpoints that make recovery practical

Understanding these concepts helps you choose appropriate isolation levels, design robust applications, and debug issues when things go wrong.

Module Complete

Congratulations! You've mastered the hierarchy of recoverability properties—from basic recoverable schedules through cascadeless and strict schedules. You understand how these properties enable database recovery, how they're implemented in practice, and how to make informed decisions about isolation levels. This knowledge forms the foundation for understanding transaction management and database reliability.

5 / 5

Loading learning content...

Database Management SystemTransaction Concepts

Recoverability

LevelIntermediate

Duration75 mins

TopicTransaction Concepts

5 / 5

Recovery Implications

From Theory to Practice: Recovery in Action

What You Will Learn

The Recovery Framework

This involves two complementary operations:

UNDO (Rollback):

Remove effects of uncommitted transactions
Restore data items to their before-images
Must handle all transactions that were active at failure time

REDO (Roll-forward):

Ensure committed transaction effects are durable
Re-apply changes that may not have reached disk
Must handle transactions that committed but whose writes might not be flushed

Schedule properties directly impact how these operations work:

Schedule Properties and Recovery Implications
Schedule Property	UNDO Complexity	REDO Complexity	Recovery Algorithm Impact
Non-recoverable	Impossible	Impossible	Cannot recover—committed transactions depend on aborted data
Recoverable only	Complex (cascades)	Standard	Must identify and abort all cascade victims
Cascadeless	Moderate	Standard	No cascade tracking, but write-write dependencies may exist
Strict	Simple	Standard	Before-images are committed; straightforward undo

Recovery is Non-Negotiable

Write-Ahead Logging (WAL) and Recovery

Write-Ahead Logging (WAL) is the foundation of recovery in virtually all modern databases. The WAL protocol ensures that recovery information is always available:

The WAL Rule:

Before any data modification is written to the database, the corresponding log record MUST be written to stable storage (log file on disk).

This ensures that if a failure occurs, the log contains all the information needed to recover:

Before-images for UNDO
After-images (or the operation itself) for REDO

Log Record Structure:

wal_log_record.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
from dataclasses import dataclass
from typing import Any, Optional
from enum import Enum
 
class LogRecordType(Enum):
    UPDATE = "UPDATE"       # Data modification
    COMMIT = "COMMIT"       # Transaction commit
    ABORT = "ABORT"         # Transaction abort
    CHECKPOINT = "CHECKPOINT"  # Consistency point
    BEGIN = "BEGIN"         # Transaction start
    CLR = "CLR"            # Compensation Log Record (for undo)
 
@dataclass
class WALLogRecord:
    """
    Standard Write-Ahead Log record structure.
    
    The log contains all information needed for recovery:
    - Before-images for UNDO
    - After-images for REDO
    - Transaction state for determining what to do
    """
    lsn: int                    # Log Sequence Number (unique, increasing)
    txn_id: str                 # Transaction ID
    record_type: LogRecordType  # Type of log record
    
    # For UPDATE records:
    page_id: Optional[str] = None    # Which page was modified
    offset: Optional[int] = None     # Position within page
    before_image: Optional[Any] = None  # Value before update (for UNDO)
    after_image: Optional[Any] = None   # Value after update (for REDO)
    
    # For linking:
    prev_lsn: Optional[int] = None   # Previous log record for this transaction
    undo_next_lsn: Optional[int] = None  # For CLRs: next record to undo
    
    def can_undo(self) -> bool:
        """Check if this record can be undone."""
        return self.record_type == LogRecordType.UPDATE and self.before_image is not None
    
    def can_redo(self) -> bool:
        """Check if this record can be redone."""
        return self.record_type == LogRecordType.UPDATE and self.after_image is not None
 
# Example log sequence demonstrating WAL
print("=" * 70)
print("WRITE-AHEAD LOG EXAMPLE")
print("=" * 70)
 
log = [
    WALLogRecord(100, "T1", LogRecordType.BEGIN),
    WALLogRecord(101, "T1", LogRecordType.UPDATE, "P1", 0, 
                 before_image="A=50", after_image="A=100"),
    WALLogRecord(102, "T2", LogRecordType.BEGIN),
    WALLogRecord(103, "T2", LogRecordType.UPDATE, "P2", 0,
                 before_image="B=200", after_image="B=150"),
    WALLogRecord(104, "T1", LogRecordType.UPDATE, "P1", 4,
                 before_image="C=30", after_image="C=80"),
    WALLogRecord(105, "T1", LogRecordType.COMMIT),  # T1 committed
    # ---- CRASH OCCURS HERE ----
]
 
print("\nLog records at crash time:")
for record in log:
    if record.record_type == LogRecordType.UPDATE:
        print(f"  LSN {record.lsn}: {record.txn_id} {record.record_type.value} "
              f"{record.page_id}:{record.before_image} → {record.after_image}")
    else:
        print(f"  LSN {record.lsn}: {record.txn_id} {record.record_type.value}")
 
print("\nRecovery analysis:")
print("  T1: COMMITTED (LSN 105) - REDO its updates if needed")
print("  T2: No COMMIT record - UNDO its updates (restore before-images)")

Why schedule properties matter for WAL:

Property	Before-Image Reliability	Recovery Requirement
Recoverable	May contain uncommitted data	Must track dependencies
Cascadeless	May contain uncommitted writes	Simpler but still need care
Strict	Always committed data	Simple, independent undo

With strict schedules, before-images in the log are always committed values. This means UNDO can simply restore the before-image without checking if it came from an uncommitted transaction.

UNDO Recovery and Schedule Properties

UNDO recovery reverses the effects of uncommitted transactions. The complexity of UNDO directly depends on schedule properties:

UNDO in Non-Recoverable Schedules:

Cannot be done correctly
Some committed transactions depend on aborted data
Unsolicited—this is why non-recoverable schedules are prohibited

UNDO in Recoverable (not cascadeless) Schedules:

Must identify cascade victims
All transactions that read uncommitted data must also abort
Requires dependency tracking or re-analysis at recovery time

undo_recovery_complexity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
from dataclasses import dataclass, field
from typing import List, Dict, Set, Any, Optional
from enum import Enum
 
class TxnState(Enum):
    ACTIVE = "ACTIVE"
    COMMITTED = "COMMITTED"  
    ABORTED = "ABORTED"
 
@dataclass
class UndoRecoveryComplex:
    """
    UNDO recovery for recoverable (but not cascadeless) schedules.
    Must handle cascading aborts during recovery.
    """
    
    committed_before_crash: Set[str] = field(default_factory=set)
    active_at_crash: Set[str] = field(default_factory=set)
    read_from: Dict[str, Set[str]] = field(default_factory=dict)  # txn -> set of txns it read from
    
    def identify_cascade_victims(self) -> Set[str]:
        """
        In a recoverable (not cascadeless) schedule, we must find all
        transactions that read from transactions we're going to undo.
        
        This is complex because:
        1. Start with active (uncommitted) transactions
        2. Find all transactions that read from them
        3. If any of those read from another active one, add them
        4. Repeat until no new victims found
        """
        to_abort = self.active_at_crash.copy()
        
        changed = True
        iterations = 0
        while changed:
            changed = False
            iterations += 1
            new_victims = set()
            
            for txn, read_sources in self.read_from.items():
                if txn not in to_abort and txn not in self.committed_before_crash:
                    # Check if this transaction read from any transaction we're aborting
                    if read_sources & to_abort:
                        new_victims.add(txn)
                        changed = True
            
            to_abort.update(new_victims)
            print(f"  Iteration {iterations}: Found {len(new_victims)} new cascade victims")
        
        return to_abort
    
    def perform_undo(self, to_abort: Set[str]) -> None:
        """
        Undo all transactions in the abort set.
        Order matters here for correctness!
        """
        print(f"\n  Transactions to UNDO: {to_abort}")
        print("  Must undo in reverse chronological order of their writes")
        print("  Must also consider that some transactions may have read from each other")
        print("  COMPLEX: Recovery manager needs full dependency information")
 
@dataclass
class UndoRecoverySimple:
    """
    UNDO recovery for strict schedules.
    Much simpler because before-images are reliable.
    """
    
    active_at_crash: Set[str] = field(default_factory=set)
    
    def perform_undo(self, log_records: List[Any]) -> None:
        """
        Simple UNDO for strict schedules:
        1. Find all active transactions
        2. For each, restore before-images
        3. No need to track dependencies—before-images are committed values
        """
        print(f"\n  Transactions to UNDO: {self.active_at_crash}")
        print("  SIMPLE: Just restore before-images for each")
        print("  No cascade analysis needed")
        print("  Order doesn't matter (before-images are independent)")
 
# Comparison demonstration
print("=" * 70)
print("UNDO RECOVERY COMPLEXITY COMPARISON")
print("=" * 70)
 
# Complex case (recoverable but not cascadeless)
print("\n--- Recoverable (with dirty reads) Schedule ---")
complex_recovery = UndoRecoveryComplex(
    committed_before_crash={"T4", "T5"},
    active_at_crash={"T1", "T3"},
    read_from={
        "T2": {"T1"},      # T2 read from T1 (dirty read)
        "T3": {"T2"},      # T3 read from T2
        "T4": set(),       # T4 read from committed data only
        "T5": {"T4"},      # T5 read from T4 (committed)
    }
)
cascade_victims = complex_recovery.identify_cascade_victims()
complex_recovery.perform_undo(cascade_victims)
 
# Simple case (strict schedule)
print("\n--- Strict Schedule ---")
simple_recovery = UndoRecoverySimple(
    active_at_crash={"T1", "T3"}
)
simple_recovery.perform_undo([])
 
print("\n" + "=" * 70)
print("KEY INSIGHT: Strict schedules make UNDO trivially simple.")
print("=" * 70)

Cascade Identification at Recovery

REDO Recovery: Ensuring Durability

REDO recovery ensures that committed transaction effects are durable—even if they weren't written to disk before the crash. Interestingly, REDO is largely independent of schedule properties.

Why REDO is simpler:

REDO only applies to committed transactions
Committed transactions have no uncommitted dependencies (by recoverability)
We're applying after-images, not reversing changes
Order usually doesn't matter for independent updates

The REDO Process:

redo_recovery.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
from dataclasses import dataclass, field
from typing import List, Dict, Any, Set
from enum import Enum
 
class RecordType(Enum):
    UPDATE = "UPDATE"
    COMMIT = "COMMIT"
    BEGIN = "BEGIN"
 
@dataclass
class LogRecord:
    lsn: int
    txn_id: str
    record_type: RecordType
    page_id: str = ""
    after_image: Any = None
    page_lsn: int = 0  # LSN of last update applied to page
 
@dataclass
class Page:
    page_id: str
    data: Dict[str, Any]
    page_lsn: int = 0  # Last LSN applied to this page
 
class RedoRecovery:
    """
    REDO recovery is relatively simple regardless of schedule type:
    
    For each log record of a committed transaction:
        If the page's LSN < log record's LSN:
            Apply the after-image (redo the update)
    
    The page_lsn comparison ensures we don't redo updates
    that were already written to disk.
    """
    
    def __init__(self, log: List[LogRecord], pages: Dict[str, Page]):
        self.log = log
        self.pages = pages
    
    def identify_committed_transactions(self) -> Set[str]:
        """Find all transactions that committed before crash."""
        committed = set()
        for record in self.log:
            if record.record_type == RecordType.COMMIT:
                committed.add(record.txn_id)
        return committed
    
    def redo_pass(self) -> None:
        """
        Redo all committed transaction updates.
        
        This is a single forward pass through the log.
        For each UPDATE by a committed transaction,
        redo if the page doesn't have this update yet.
        """
        committed = self.identify_committed_transactions()
        print(f"Committed transactions: {committed}")
        print("\nStarting REDO pass (forward through log)...")
        
        redo_count = 0
        skip_count = 0
        
        for record in self.log:
            if record.record_type != RecordType.UPDATE:
                continue
            
            if record.txn_id not in committed:
                # Skip uncommitted transaction updates
                continue
            
            page = self.pages.get(record.page_id)
            if page is None:
                # Page might not be loaded; would need to fetch from disk
                print(f"  LSN {record.lsn}: Page {record.page_id} not in buffer")
                continue
            
            if page.page_lsn < record.lsn:
                # This update wasn't applied to disk; redo it
                print(f"  LSN {record.lsn}: REDO {record.txn_id}'s update to {record.page_id}")
                print(f"    Applying: {record.after_image}")
                page.data = record.after_image
                page.page_lsn = record.lsn
                redo_count += 1
            else:
                # Update already on disk; skip
                print(f"  LSN {record.lsn}: SKIP (page_lsn {page.page_lsn} >= {record.lsn})")
                skip_count += 1
        
        print(f"\nREDO complete: {redo_count} redone, {skip_count} skipped")
 
# Demonstration
print("=" * 70)
print("REDO RECOVERY DEMONSTRATION")
print("=" * 70)
 
log = [
    LogRecord(100, "T1", RecordType.BEGIN),
    LogRecord(101, "T1", RecordType.UPDATE, "P1", after_image={"A": 100}),
    LogRecord(102, "T2", RecordType.BEGIN),
    LogRecord(103, "T1", RecordType.UPDATE, "P2", after_image={"B": 200}),
    LogRecord(104, "T2", RecordType.UPDATE, "P3", after_image={"C": 300}),
    LogRecord(105, "T1", RecordType.COMMIT),  # T1 committed
    LogRecord(106, "T2", RecordType.UPDATE, "P1", after_image={"A": 150}),
    # CRASH - T2 never committed
]
 
# Simulate that P1 has LSN 101 on disk, P2 has LSN 0 (update wasn't flushed)
pages = {
    "P1": Page("P1", {"A": 100}, page_lsn=101),  # T1's first update was flushed
    "P2": Page("P2", {"B": 0}, page_lsn=0),       # T1's second update wasn't flushed
    "P3": Page("P3", {"C": 0}, page_lsn=0),       # T2's update wasn't flushed
}
 
print("\nInitial page states:")
for pid, page in pages.items():
    print(f"  {pid}: {page.data}, page_lsn={page.page_lsn}")
 
print()
recovery = RedoRecovery(log, pages)
recovery.redo_pass()
 
print("\nFinal page states:")
for pid, page in pages.items():
    print(f"  {pid}: {page.data}, page_lsn={page.page_lsn}")
 
print("\nNote: T2's update to P3 was NOT redone (T2 didn't commit)")

REDO is Schedule-Agnostic

ARIES: Industry-Standard Recovery

ARIES Key Principles:

Write-Ahead Logging: All changes logged before applied
Repeating History: REDO replays exactly what happened
Logging During Undo: Even undo operations are logged (CLRs)

Three Phases of ARIES Recovery:

ARIES Recovery Phases

•1. Analysis Phase: Scan log from last checkpoint to determine: (a) Which transactions were active at crash, (b) Which pages might need REDO, (c) Build the "dirty page table" and "active transaction table"
•2. REDO Phase: Starting from earliest relevant log record, redo ALL updates (committed or not) to bring database to crash-time state. This "repeats history." Uses page_lsn to avoid redundant REDOs.
•3. UNDO Phase: Undo uncommitted transactions in reverse order. Uses Compensation Log Records (CLRs) to log undo work—ensures undo is idempotent if there's another crash during recovery.

How Schedule Properties Affect ARIES:

Phase	Recoverable	Cascadeless	Strict
Analysis	Must track dependencies	Standard	Standard
REDO	Same (repeat history)	Same	Same
UNDO	Complex (cascades)	Moderate	Simple

ARIES assumes strict schedules in its standard form. For non-strict schedules, the UNDO phase would need modification to handle dependencies.

Why databases use strict schedules with ARIES:

ARIES UNDO is simple: just restore before-images
No cascade tracking needed
Recovery is fast and predictable
Implementation is less error-prone

ARIES in Action

Checkpoints and Recovery Speed

Types of Checkpoints:

1. Consistent Checkpoint (Quiesce):

Stop all transaction processing
Flush all dirty pages to disk
Write checkpoint record
Resume processing

Simple but causes a pause in processing.

2. Fuzzy Checkpoint:

Write checkpoint record with list of dirty pages and active transactions
Continue processing while background flush happens
Recovery starts from the oldest log record referenced

Used by most production systems for minimal interruption.

Checkpoint Impact on Recovery
Checkpoint Type	Processing Impact	Recovery Start Point	Log Needed
None	None	Beginning of log	Entire log
Consistent	Pause during checkpoint	Last checkpoint	Log since checkpoint
Fuzzy	Minimal pause	Oldest dirty page or active txn	Log since oldest active

Recovery time formula:

Recovery Time ≈ Analysis Time + REDO Time + UNDO Time

Analysis Time ∝ Log records since checkpoint
REDO Time ∝ Dirty pages × Log records
UNDO Time ∝ Active transactions × Their log records

For strict schedules, UNDO is simpler:

No cascade analysis
Independent undo of each active transaction
Can potentially parallelize UNDO

Checkpoint frequency trade-off:

More frequent → Shorter recovery time
More frequent → More I/O overhead during normal operation
Typical: 5-10 minute checkpoint intervals

Recovery Time SLA

Practical Guidelines for System Design

Understanding recoverability properties helps make informed decisions about database configuration and application design:

Guideline 1: Default to Strict Schedules

Unless you have specific requirements that demand otherwise:

Use READ COMMITTED or higher isolation level
Let the database enforce strict schedules (Strict 2PL or MVCC with write locks)
Benefit from simple, fast, predictable recovery

Guideline 2: Understand Your Isolation Level

Know what your isolation level provides:

Isolation Level Recommendations
Use Case	Recommended Level	Recovery Implications
Financial transactions	SERIALIZABLE	Strictest guarantees, simple recovery
General OLTP	READ COMMITTED	Cascadeless + Strict for writes, fast recovery
Reporting queries	READ COMMITTED SNAPSHOT	Snapshot isolation, no blocking, simple
Approximate analytics	READ UNCOMMITTED (rare)	Not cascadeless, complex recovery if crashes occur

Operational Best Practices

•Monitor recovery time: Test recovery periodically. If too slow, increase checkpoint frequency.
•Size log appropriately: Log must hold at least 2 checkpoint intervals worth of activity.
•Test failure scenarios: Actually crash your test database and verify recovery works correctly.
•Keep transactions short: Long transactions increase UNDO work and recovery time.
•Avoid READ UNCOMMITTED: The marginal performance gain rarely justifies the recovery complexity.
•Understand MVCC vs. locking: MVCC provides excellent recovery properties with high concurrency.

Don't Sacrifice Recovery for Performance

Module Summary: Recoverability

We've completed our exploration of recoverability in transaction processing. This final summary brings together all the concepts from the module:

Module Key Takeaways

•Recoverable schedules ensure recovery is possible by requiring readers to wait for writers to commit before they can commit.
•Cascading rollbacks amplify single failures when transactions read uncommitted data that later gets aborted.
•Cascadeless schedules prevent cascades by ensuring transactions only read committed data (no dirty reads).
•Strict schedules enable simple recovery by ensuring before-images are always committed values.
•Write-Ahead Logging (WAL) provides the foundation for recovery, storing before-images (UNDO) and after-images (REDO).
•ARIES recovery assumes strict schedules for its simple, effective UNDO phase.
•Modern databases use MVCC or Strict 2PL to guarantee strict schedules with high concurrency.
•Practical systems should default to READ COMMITTED or higher, accepting strict schedule properties for reliable recovery.

Converting Mermaid diagram...

The Big Picture:

Schedule properties that make recovery well-defined
WAL that captures all information needed
Recovery algorithms that correctly undo and redo
Checkpoints that make recovery practical

Understanding these concepts helps you choose appropriate isolation levels, design robust applications, and debug issues when things go wrong.

Module Complete

5 / 5