Log Based Recovery - Learning Module

Loading content...

0/241

Log Records

The Building Blocks of Database History

If the transaction log is the diary of all database activity, then log records are the individual entries in that diary. Each log record captures a precise moment of action—a row inserted, a value changed, a transaction completed. Together, thousands or millions of these records form the complete narrative of your database's history.

But not all diary entries are alike. A log record for updating an account balance looks very different from one marking the end of a transaction. Each type of log record serves a specific purpose in the recovery process, and understanding these distinctions is crucial for understanding how databases actually recover from failures.

In this page, we'll dissect the anatomy of log records, examining the different types, the information they carry, and why each piece of data matters for successful recovery.

What You Will Learn

By the end of this page, you will understand the detailed structure of log records, the different categories of log records (data modification, transaction control, system operations), what information each record type must contain, and how records are linked together to form recoverable transaction histories.

Log Record Structure Deep Dive

Every log record, regardless of its type, shares a common structural foundation. This foundation ensures that the recovery system can process any record correctly and efficiently. Let's examine each component in detail:

1. Log Sequence Number (LSN)

The LSN is the unique identifier for each log record. It serves as the "address" of the record in the log and provides several critical properties:

Monotonically Increasing: Each new record gets a higher LSN than all previous records
Unique: No two records ever share the same LSN
Orderable: LSNs define the exact sequence of all operations
Addressable: Any part of the system can reference a specific log record by its LSN

In most implementations, the LSN is composed of the log file number and the byte offset within that file, making it both unique and providing a direct path to locate the record.

Universal Log Record Components
Component	Size (Typical)	Purpose	Recovery Use
LSN	8 bytes	Unique record identifier	Ordering, page-LSN comparison
Transaction ID	4-8 bytes	Which transaction generated this record	Undo/redo grouping by transaction
Previous LSN	8 bytes	Links to transaction's prior record	Backward traversal for undo
Record Type	1-2 bytes	Identifies the record category	Determines processing logic
Record Length	2-4 bytes	Total bytes in this record	Navigation, parsing
Timestamp	8 bytes	When operation occurred	Point-in-time recovery
Checksum	4 bytes	Integrity verification	Detecting log corruption

2. Previous LSN and Transaction Chaining

The Previous LSN (PrevLSN) field is a clever optimization that enables efficient transaction rollback. Each log record points backward to the previous record created by the same transaction. This creates a linked list within the log for each transaction:

T1 Log Chain: BEGIN(001) ← UPDATE(004) ← UPDATE(009) ← COMMIT(012)
              ↑                                         ↑
              start                                     end

During rollback (either explicit or during recovery), the system can efficiently walk backward through just this transaction's records without scanning the entire log. If transaction T1 needs to abort after LSN 009, the system:

Finds LSN 009 (the last record for T1)
Uses PrevLSN to find LSN 004
Uses PrevLSN to find LSN 001
Has now traversed all T1's operations in reverse order

Why Backward Chaining?

Rollback processes operations in reverse order—the last modification made must be undone first. The PrevLSN chain is designed precisely for this reverse traversal, making rollback efficient even when thousands of other transactions' records are interleaved in the log.

Data Modification Records

Data modification records are the most common type of log record. They document every change to the actual database content—insertions, updates, and deletions. These records must contain enough information to either redo the operation (apply the change forward) or undo it (reverse the change).

The Undo/Redo Information Principle:

For any data modification record to support recovery, it must contain:

Before Image: The state of the data before the modification (for undo)
After Image: The state of the data after the modification (for redo)

Depending on the recovery algorithm, some systems may store only one image type, but the most robust systems store both.

Data Modification Record Types
Record Type	Before Image	After Image	Description
INSERT	Empty/NULL	Complete new row	A new row was added to a table
DELETE	Complete deleted row	Empty/NULL	A row was removed from a table
UPDATE	Old column value(s)	New column value(s)	Column(s) in a row were modified

Detailed Structure of an UPDATE Log Record:

UPDATE records are the most complex because they must capture what changed, not just what exists. Here's a comprehensive breakdown:

UPDATE Log Record Structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Detailed structure of an UPDATE log record
 
struct UpdateLogRecord {
    // === Common Header ===
    uint64_t    lsn;               // e.g., 0x0001234500005678
    uint32_t    transaction_id;     // e.g., 4782
    uint64_t    prev_lsn;           // e.g., 0x0001234500005234
    uint8_t     record_type;        // UPDATE = 0x02
    uint16_t    record_length;      // Total bytes
    
    // === Location Information ===
    uint32_t    table_id;           // Internal table identifier
    uint32_t    page_id;            // Which data page was modified
    uint16_t    slot_number;        // Which slot/row within the page
    
    // === Column Change Information ===
    uint16_t    num_columns_changed; // How many columns modified
    uint16_t    column_ids[];        // Array of column IDs that changed
    
    // === Before Image (for Undo) ===
    uint16_t    before_length;       // Length of before image data
    byte[]      before_values;       // Old values of changed columns
    
    // === After Image (for Redo) ===
    uint16_t    after_length;        // Length of after image data
    byte[]      after_values;        // New values of changed columns
    
    // === Integrity ===
    uint32_t    checksum;            // CRC32 or similar
}
 
// Example: UPDATE accounts SET balance = 500.00 WHERE id = 'A123'
// 
// UpdateLogRecord {
//     lsn: 0x0001234500009ABC,
//     transaction_id: 4782,
//     prev_lsn: 0x0001234500009000,
//     record_type: UPDATE,
//     table_id: 15,            // 'accounts' table
//     page_id: 847,            // Page containing this row
//     slot_number: 12,         // Row's position in page
//     num_columns_changed: 1,
//     column_ids: [3],         // 'balance' is column 3
//     before_length: 8,
//     before_values: [0x40 0xC3 0x88 0x00 0x00 0x00 0x00 0x00], // 10000.00
//     after_length: 8,
//     after_values: [0x40 0x7F 0x40 0x00 0x00 0x00 0x00 0x00],  // 500.00
// }

INSERT and DELETE Records:

INSERT records don't need a before image (there was nothing before), but they need the complete new row as the after image. The undo of an INSERT is simply deleting the row.

DELETE records are the mirror image: they need the complete row as the before image (for undo) but no after image. The redo of a DELETE just removes the row again.

Optimization: Logical vs. Physical Records

Some systems use logical log records that describe the operation conceptually ("SET balance = balance - 500") rather than physically ("old value = 1000, new value = 500"). Logical records can be more compact but are harder to apply correctly if the database state isn't exactly as expected. Most production systems use physical logging for data modifications.

Transaction Control Records

Transaction control records mark the boundaries and state changes of transactions. Unlike data modification records, they don't describe changes to table data. Instead, they provide the structural information needed to understand which operations belong to which transaction and what the final outcome of each transaction was.

These records are crucial for recovery because they define:

When a transaction started (what needs to be tracked)
When it ended (and whether successfully or not)
The final fate of each transaction's changes

Transaction Control Record Types
Record Type	Purpose	Recovery Implications
BEGIN	Marks the start of a new transaction	Establishes transaction identity; starts the undo chain
COMMIT	Marks successful completion of transaction	All transaction's changes become permanent and must survive
ABORT	Marks unsuccessful completion of transaction	All transaction's changes must be undone and discarded
PREPARE	Two-phase commit: ready to commit	Transaction is durable but awaiting coordinator decision
END	All cleanup for transaction is complete	Transaction's resources can be released; log can be truncated past this point

The Critical Nature of COMMIT Records:

The COMMIT record is arguably the most important record in the entire log. It serves as the definitive marker that separates "must persist" from "must be undone." The rule is absolute:

A transaction's changes are durable if and only if its COMMIT record has been written to stable storage.

Before COMMIT written to disk:

Transaction can be rolled back at any time
Crash will result in all changes being undone
Client has not received confirmation

After COMMIT written to disk:

Changes are guaranteed to survive any failure
Client receives confirmation
Recovery will redo any changes not yet in data files

Transaction Control Record Structures
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// BEGIN record - minimal structure
struct BeginLogRecord {
    uint64_t    lsn;
    uint32_t    transaction_id;
    uint64_t    prev_lsn;          // NULL/0 for BEGIN (first record)
    uint8_t     record_type;       // BEGIN = 0x10
    uint64_t    start_timestamp;   // When transaction started
    uint8_t     isolation_level;   // READ_COMMITTED, SERIALIZABLE, etc.
}
 
// COMMIT record - marks successful completion
struct CommitLogRecord {
    uint64_t    lsn;
    uint32_t    transaction_id;
    uint64_t    prev_lsn;          // Points to last operation
    uint8_t     record_type;       // COMMIT = 0x11
    uint64_t    commit_timestamp;  // Official commit time
}
 
// ABORT record - marks rollback
struct AbortLogRecord {
    uint64_t    lsn;
    uint32_t    transaction_id;
    uint64_t    prev_lsn;          // Points to last operation
    uint8_t     record_type;       // ABORT = 0x12
    uint64_t    abort_timestamp;
    uint16_t    abort_reason;      // Error code, if applicable
}
 
// END record - cleanup complete
struct EndLogRecord {
    uint64_t    lsn;
    uint32_t    transaction_id;
    uint64_t    prev_lsn;          // Points to COMMIT or ABORT
    uint8_t     record_type;       // END = 0x13
}

The END Record's Role

The END record indicates that all cleanup for a transaction is complete—rollback is finished (if aborted) or all effects are stable (if committed). Until END is written, the transaction may still require recovery attention. END records enable log truncation, as no recovery work remains before this point for this transaction.

Compensation Log Records (CLRs)

One of the most sophisticated log record types is the Compensation Log Record (CLR). CLRs are generated during rollback operations—either explicit rollback (ROLLBACK statement) or recovery rollback (undoing incomplete transactions after a crash).

Why Do We Need CLRs?

Consider this scenario:

Transaction T1 updates row X from 100 to 200 (logged)
T1 decides to rollback
During rollback, T1 updates row X from 200 back to 100
System crashes during rollback

Without CLRs, when the system recovers, it sees:

T1 modified X to 200
No COMMIT for T1
Therefore, undo T1: change X from 200 to 100

But X is already 100! If we undo blindly, we might corrupt data. CLRs solve this by recording "I already undid this operation."

CLR Key Insight

CLRs make undo operations idempotent. No matter how many times recovery runs, applying CLRs correctly ensures the undo work is done exactly once. CLRs are redo-only records—they're never themselves undone.

CLR Structure:

CLRs have a special field called UndoNextLSN that points to the next log record that still needs to be undone. This allows recovery to skip operations that have already been compensated:

Compensation Log Record Structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Compensation Log Record (CLR)
struct CLR {
    uint64_t    lsn;
    uint32_t    transaction_id;
    uint64_t    prev_lsn;           // Links in transaction chain
    uint8_t     record_type;        // CLR = 0x20
    
    // === Key CLR Fields ===
    uint64_t    undo_next_lsn;      // Next record to undo (skip completed)
    uint64_t    compensating_lsn;   // Which operation this undoes
    
    // === Redo Information ===
    // CLRs contain the REDO of the undo operation
    // (So if we crash again, we can redo the undo)
    uint32_t    table_id;
    uint32_t    page_id;
    uint16_t    slot_number;
    byte[]      restored_value;     // The "before image" from original
}
 
// Example Scenario:
// LSN 001: T1 BEGIN
// LSN 002: T1 UPDATE X: 100 → 200
// LSN 003: T1 UPDATE Y: 50 → 75
// LSN 004: T1 UPDATE Z: 300 → 400
// 
// T1 issues ROLLBACK:
// LSN 005: CLR for LSN 004 (undoing Z: 400 → 300), UndoNextLSN = 003
// LSN 006: CLR for LSN 003 (undoing Y: 75 → 50), UndoNextLSN = 002
// LSN 007: CLR for LSN 002 (undoing X: 200 → 100), UndoNextLSN = 001
// LSN 008: T1 ABORT
// 
// If crash after LSN 006:
// Recovery sees T1's last record is CLR at LSN 006
// UndoNextLSN = 002, so only need to undo LSN 002
// Skip LSN 003 and 004 (already compensated)

CLR Properties:

CLRs are redo-only: During recovery, CLRs are redone (to restore the undo work) but never themselves undone. This prevents infinite loops.
UndoNextLSN enables skipping: The system doesn't re-undo operations that have already been undone. This is essential for correctness.
CLRs contain redo data only: Since an undo operation's "before image" is the after image of the original operation (which we're reverting to), CLRs only need redo information.
CLRs link to original: Each CLR references the original operation it compensates, providing a complete audit trail.

Checkpoint Records

Checkpoint records are system-level records that periodically capture the state of the database system. Unlike transaction records that describe individual operations, checkpoint records create efficient recovery starting points that dramatically reduce the amount of log that must be processed during crash recovery.

Why Checkpoints?

Without checkpoints, recovery would need to:

Start from the very first log record ever written
Scan the entire log history
Process potentially millions of records

This would make recovery take hours or days. Checkpoints solve this by recording:

Which transactions are currently active
Which pages in the buffer pool are dirty (modified but not yet written to disk)
Other state information needed to resume normal operation

Checkpoint Record Information
Field	Description	Recovery Use
Active Transaction Table	List of all transactions in progress at checkpoint time	Determines which transactions may need undo
Dirty Page Table	List of modified pages in buffer pool with their recovery LSN	Determines starting point for redo
Master Record Pointer	Points to the most recent successful checkpoint	Allows recovery to find starting point
Checkpoint LSN	The LSN of this checkpoint record	Establishes checkpoint position in log

Checkpoint Record Structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Checkpoint Record Structure (ARIES-style)
struct CheckpointLogRecord {
    uint64_t    lsn;
    uint8_t     record_type;        // CHECKPOINT_BEGIN or CHECKPOINT_END
    
    // === Active Transaction Table ===
    uint32_t    num_active_transactions;
    struct ActiveTxEntry {
        uint32_t    transaction_id;
        uint8_t     state;          // ACTIVE, PREPARED
        uint64_t    last_lsn;       // Last LSN for this transaction
        uint64_t    undo_next_lsn;  // Next LSN to undo if rollback needed
    } active_transactions[];
    
    // === Dirty Page Table ===
    uint32_t    num_dirty_pages;
    struct DirtyPageEntry {
        uint32_t    page_id;
        uint64_t    recovery_lsn;   // First LSN that dirtied this page
    } dirty_pages[];
}
 
// Example at checkpoint time:
// 
// Active Transactions:
// T1: Active, LastLSN=500, UndoNextLSN=500
// T3: Active, LastLSN=495, UndoNextLSN=480
// T4: Prepared, LastLSN=498, UndoNextLSN=498
//
// Dirty Pages:
// Page 42: RecLSN=450  (first dirty at LSN 450)
// Page 58: RecLSN=480  (first dirty at LSN 480)
// Page 99: RecLSN=495  (first dirty at LSN 495)
//
// Recovery will:
// 1. Start redo from min(RecLSN) = 450
// 2. Track T1, T3, T4 as potentially needing undo
// 3. Skip log records before LSN 450 for redo

Fuzzy Checkpoints

Modern databases use 'fuzzy' checkpoints that write CHECKPOINT_BEGIN, continue normal operations, then write CHECKPOINT_END. This avoids blocking transactions during checkpointing. Recovery uses the CHECKPOINT_END record and handles any transactions that committed or aborted between BEGIN and END.

System and Utility Records

Beyond data modifications and transaction control, databases generate various system and utility log records. These records support operations that aren't direct data changes but still need to be recoverable or auditable.

System and Utility Log Record Types
Record Type	Purpose	Examples
DDL Records	Schema definition changes	CREATE TABLE, ALTER TABLE, DROP INDEX
Space Management	Storage allocation changes	Page allocation, extent allocation, file growth
Index Operations	Index structure modifications	Page splits, merges, rebalancing
LOB Operations	Large object handling	BLOB/CLOB storage, segmentation
Savepoint Records	Transaction partial rollback points	SAVEPOINT, ROLLBACK TO SAVEPOINT
Nested Transaction	Sub-transaction boundaries	Nested BEGIN/COMMIT/ROLLBACK

Schema (DDL) Log Records:

Schema changes present unique challenges because they affect the metadata used to interpret data pages:

DDL Log Record for ALTER TABLE:
{
  operation: ADD_COLUMN,
  table_id: 15,
  new_column: {
    name: 'email',
    type: VARCHAR(255),
    nullable: true,
    position: 5
  },
  catalog_page_modified: 1842,
  before_catalog_entry: [...],
  after_catalog_entry: [...]
}

DDL operations typically acquire exclusive locks and often force checkpoints to ensure schema consistency during recovery.

Space Management Records:

These records track physical storage changes:

Page Allocation: When a new page is allocated to a table or index
Extent Allocation: When a larger block of pages is allocated
Page Deallocation: When pages are freed (e.g., after DROP TABLE)

These must be logged because recovery needs to restore the database's physical structure, not just its logical content.

Savepoints Enable Partial Rollback

Savepoint records allow applications to rollback to a specific point within a transaction without aborting the entire transaction. The log records mark these points so that partial undo can stop at the savepoint rather than undoing all operations.

Record Size and Optimization

Log record design involves careful trade-offs between information completeness and storage efficiency. Larger records provide more information but consume more disk space and I/O bandwidth. Production systems employ several optimization techniques:

Log Record Optimization Techniques

•Minimal Before/After Images — Only log the columns that actually changed, not the entire row. An update to one column in a 50-column table logs only that column.
•Compression — Log records can be compressed, especially when containing text data. This trades CPU time for storage space.
•Physical Logging for Small Changes — For small modifications, record the physical byte-level change. For large modifications, use logical descriptions.
•Record Grouping — Multiple small operations may be combined into a single log record where safe to do so.
•Variable-Length Fields — Use variable-length encoding for fields like before/after images rather than fixed maximum sizes.
•Transaction Combining — When a transaction modifies the same row multiple times, some systems consolidate into a single net change.

Record Size Statistics:

In typical OLTP workloads:

Record Type	Typical Size	Frequency
UPDATE (1-2 columns)	50-150 bytes	Very High
INSERT (typical row)	100-500 bytes	High
DELETE	100-400 bytes	Medium
COMMIT	30-50 bytes	High
BEGIN	30-40 bytes	High
Checkpoint	1KB-1MB	Low

The log's sequential nature means that even small records benefit from batching and asynchronous writes. Modern systems buffer log records and write them in groups, reducing the number of disk I/O operations.

The Commit Bottleneck

Despite optimizations, the COMMIT record remains a synchronization point—it must be durably written before acknowledging the commit. This 'log force' at commit is often the primary bottleneck in high-throughput systems. Group commit, where multiple transactions' commit records are written together, addresses this by amortizing the I/O cost.

Summary: The Vocabulary of Recovery

We've examined the building blocks that make up transaction logs. Each record type plays a specific role in enabling database recovery. Let's consolidate:

Key Takeaways

•Every log record shares a common structure — LSN for identification, transaction ID for grouping, PrevLSN for chaining, and type-specific data.
•Data modification records contain before and after images — Enabling both redo (forward recovery) and undo (rollback) operations.
•Transaction control records define boundaries — BEGIN, COMMIT, ABORT, and END mark the lifecycle of each transaction.
•Compensation Log Records (CLRs) make undo idempotent — They record undo work so it's never repeated, even across multiple crashes.
•Checkpoint records enable efficient recovery — By capturing system state, they limit how much log must be processed.
•System records cover schema, space, and internal operations — Everything affecting database state must be recoverable.
•Optimizations reduce log size without sacrificing recoverability — Minimal logging, compression, and grouping make logging practical at scale.

What's Next:

Now that we understand the individual log record types, we'll examine how they're organized into different log types and structures. We'll explore the differences between undo-only, redo-only, and undo-redo logging, and understand how the choice of logging type affects recovery strategy.

Page Complete

You now understand the detailed structure and types of log records—the vocabulary that databases use to document their history. Next, we'll see how these records are organized into different logging approaches with varying trade-offs.