Loading content...
If the transaction log is the diary of all database activity, then log records are the individual entries in that diary. Each log record captures a precise moment of action—a row inserted, a value changed, a transaction completed. Together, thousands or millions of these records form the complete narrative of your database's history.
But not all diary entries are alike. A log record for updating an account balance looks very different from one marking the end of a transaction. Each type of log record serves a specific purpose in the recovery process, and understanding these distinctions is crucial for understanding how databases actually recover from failures.
In this page, we'll dissect the anatomy of log records, examining the different types, the information they carry, and why each piece of data matters for successful recovery.
By the end of this page, you will understand the detailed structure of log records, the different categories of log records (data modification, transaction control, system operations), what information each record type must contain, and how records are linked together to form recoverable transaction histories.
Every log record, regardless of its type, shares a common structural foundation. This foundation ensures that the recovery system can process any record correctly and efficiently. Let's examine each component in detail:
1. Log Sequence Number (LSN)
The LSN is the unique identifier for each log record. It serves as the "address" of the record in the log and provides several critical properties:
In most implementations, the LSN is composed of the log file number and the byte offset within that file, making it both unique and providing a direct path to locate the record.
| Component | Size (Typical) | Purpose | Recovery Use |
|---|---|---|---|
| LSN | 8 bytes | Unique record identifier | Ordering, page-LSN comparison |
| Transaction ID | 4-8 bytes | Which transaction generated this record | Undo/redo grouping by transaction |
| Previous LSN | 8 bytes | Links to transaction's prior record | Backward traversal for undo |
| Record Type | 1-2 bytes | Identifies the record category | Determines processing logic |
| Record Length | 2-4 bytes | Total bytes in this record | Navigation, parsing |
| Timestamp | 8 bytes | When operation occurred | Point-in-time recovery |
| Checksum | 4 bytes | Integrity verification | Detecting log corruption |
2. Previous LSN and Transaction Chaining
The Previous LSN (PrevLSN) field is a clever optimization that enables efficient transaction rollback. Each log record points backward to the previous record created by the same transaction. This creates a linked list within the log for each transaction:
T1 Log Chain: BEGIN(001) ← UPDATE(004) ← UPDATE(009) ← COMMIT(012)
↑ ↑
start end
During rollback (either explicit or during recovery), the system can efficiently walk backward through just this transaction's records without scanning the entire log. If transaction T1 needs to abort after LSN 009, the system:
Rollback processes operations in reverse order—the last modification made must be undone first. The PrevLSN chain is designed precisely for this reverse traversal, making rollback efficient even when thousands of other transactions' records are interleaved in the log.
Data modification records are the most common type of log record. They document every change to the actual database content—insertions, updates, and deletions. These records must contain enough information to either redo the operation (apply the change forward) or undo it (reverse the change).
The Undo/Redo Information Principle:
For any data modification record to support recovery, it must contain:
Depending on the recovery algorithm, some systems may store only one image type, but the most robust systems store both.
| Record Type | Before Image | After Image | Description |
|---|---|---|---|
| INSERT | Empty/NULL | Complete new row | A new row was added to a table |
| DELETE | Complete deleted row | Empty/NULL | A row was removed from a table |
| UPDATE | Old column value(s) | New column value(s) | Column(s) in a row were modified |
Detailed Structure of an UPDATE Log Record:
UPDATE records are the most complex because they must capture what changed, not just what exists. Here's a comprehensive breakdown:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
// Detailed structure of an UPDATE log record struct UpdateLogRecord { // === Common Header === uint64_t lsn; // e.g., 0x0001234500005678 uint32_t transaction_id; // e.g., 4782 uint64_t prev_lsn; // e.g., 0x0001234500005234 uint8_t record_type; // UPDATE = 0x02 uint16_t record_length; // Total bytes // === Location Information === uint32_t table_id; // Internal table identifier uint32_t page_id; // Which data page was modified uint16_t slot_number; // Which slot/row within the page // === Column Change Information === uint16_t num_columns_changed; // How many columns modified uint16_t column_ids[]; // Array of column IDs that changed // === Before Image (for Undo) === uint16_t before_length; // Length of before image data byte[] before_values; // Old values of changed columns // === After Image (for Redo) === uint16_t after_length; // Length of after image data byte[] after_values; // New values of changed columns // === Integrity === uint32_t checksum; // CRC32 or similar} // Example: UPDATE accounts SET balance = 500.00 WHERE id = 'A123'// // UpdateLogRecord {// lsn: 0x0001234500009ABC,// transaction_id: 4782,// prev_lsn: 0x0001234500009000,// record_type: UPDATE,// table_id: 15, // 'accounts' table// page_id: 847, // Page containing this row// slot_number: 12, // Row's position in page// num_columns_changed: 1,// column_ids: [3], // 'balance' is column 3// before_length: 8,// before_values: [0x40 0xC3 0x88 0x00 0x00 0x00 0x00 0x00], // 10000.00// after_length: 8,// after_values: [0x40 0x7F 0x40 0x00 0x00 0x00 0x00 0x00], // 500.00// }INSERT and DELETE Records:
INSERT records don't need a before image (there was nothing before), but they need the complete new row as the after image. The undo of an INSERT is simply deleting the row.
DELETE records are the mirror image: they need the complete row as the before image (for undo) but no after image. The redo of a DELETE just removes the row again.
Optimization: Logical vs. Physical Records
Some systems use logical log records that describe the operation conceptually ("SET balance = balance - 500") rather than physically ("old value = 1000, new value = 500"). Logical records can be more compact but are harder to apply correctly if the database state isn't exactly as expected. Most production systems use physical logging for data modifications.
Transaction control records mark the boundaries and state changes of transactions. Unlike data modification records, they don't describe changes to table data. Instead, they provide the structural information needed to understand which operations belong to which transaction and what the final outcome of each transaction was.
These records are crucial for recovery because they define:
| Record Type | Purpose | Recovery Implications |
|---|---|---|
| BEGIN | Marks the start of a new transaction | Establishes transaction identity; starts the undo chain |
| COMMIT | Marks successful completion of transaction | All transaction's changes become permanent and must survive |
| ABORT | Marks unsuccessful completion of transaction | All transaction's changes must be undone and discarded |
| PREPARE | Two-phase commit: ready to commit | Transaction is durable but awaiting coordinator decision |
| END | All cleanup for transaction is complete | Transaction's resources can be released; log can be truncated past this point |
The Critical Nature of COMMIT Records:
The COMMIT record is arguably the most important record in the entire log. It serves as the definitive marker that separates "must persist" from "must be undone." The rule is absolute:
A transaction's changes are durable if and only if its COMMIT record has been written to stable storage.
Before COMMIT written to disk:
After COMMIT written to disk:
123456789101112131415161718192021222324252627282930313233343536
// BEGIN record - minimal structurestruct BeginLogRecord { uint64_t lsn; uint32_t transaction_id; uint64_t prev_lsn; // NULL/0 for BEGIN (first record) uint8_t record_type; // BEGIN = 0x10 uint64_t start_timestamp; // When transaction started uint8_t isolation_level; // READ_COMMITTED, SERIALIZABLE, etc.} // COMMIT record - marks successful completionstruct CommitLogRecord { uint64_t lsn; uint32_t transaction_id; uint64_t prev_lsn; // Points to last operation uint8_t record_type; // COMMIT = 0x11 uint64_t commit_timestamp; // Official commit time} // ABORT record - marks rollbackstruct AbortLogRecord { uint64_t lsn; uint32_t transaction_id; uint64_t prev_lsn; // Points to last operation uint8_t record_type; // ABORT = 0x12 uint64_t abort_timestamp; uint16_t abort_reason; // Error code, if applicable} // END record - cleanup completestruct EndLogRecord { uint64_t lsn; uint32_t transaction_id; uint64_t prev_lsn; // Points to COMMIT or ABORT uint8_t record_type; // END = 0x13}The END record indicates that all cleanup for a transaction is complete—rollback is finished (if aborted) or all effects are stable (if committed). Until END is written, the transaction may still require recovery attention. END records enable log truncation, as no recovery work remains before this point for this transaction.
One of the most sophisticated log record types is the Compensation Log Record (CLR). CLRs are generated during rollback operations—either explicit rollback (ROLLBACK statement) or recovery rollback (undoing incomplete transactions after a crash).
Why Do We Need CLRs?
Consider this scenario:
Without CLRs, when the system recovers, it sees:
But X is already 100! If we undo blindly, we might corrupt data. CLRs solve this by recording "I already undid this operation."
CLRs make undo operations idempotent. No matter how many times recovery runs, applying CLRs correctly ensures the undo work is done exactly once. CLRs are redo-only records—they're never themselves undone.
CLR Structure:
CLRs have a special field called UndoNextLSN that points to the next log record that still needs to be undone. This allows recovery to skip operations that have already been compensated:
123456789101112131415161718192021222324252627282930313233343536
// Compensation Log Record (CLR)struct CLR { uint64_t lsn; uint32_t transaction_id; uint64_t prev_lsn; // Links in transaction chain uint8_t record_type; // CLR = 0x20 // === Key CLR Fields === uint64_t undo_next_lsn; // Next record to undo (skip completed) uint64_t compensating_lsn; // Which operation this undoes // === Redo Information === // CLRs contain the REDO of the undo operation // (So if we crash again, we can redo the undo) uint32_t table_id; uint32_t page_id; uint16_t slot_number; byte[] restored_value; // The "before image" from original} // Example Scenario:// LSN 001: T1 BEGIN// LSN 002: T1 UPDATE X: 100 → 200// LSN 003: T1 UPDATE Y: 50 → 75// LSN 004: T1 UPDATE Z: 300 → 400// // T1 issues ROLLBACK:// LSN 005: CLR for LSN 004 (undoing Z: 400 → 300), UndoNextLSN = 003// LSN 006: CLR for LSN 003 (undoing Y: 75 → 50), UndoNextLSN = 002// LSN 007: CLR for LSN 002 (undoing X: 200 → 100), UndoNextLSN = 001// LSN 008: T1 ABORT// // If crash after LSN 006:// Recovery sees T1's last record is CLR at LSN 006// UndoNextLSN = 002, so only need to undo LSN 002// Skip LSN 003 and 004 (already compensated)CLR Properties:
CLRs are redo-only: During recovery, CLRs are redone (to restore the undo work) but never themselves undone. This prevents infinite loops.
UndoNextLSN enables skipping: The system doesn't re-undo operations that have already been undone. This is essential for correctness.
CLRs contain redo data only: Since an undo operation's "before image" is the after image of the original operation (which we're reverting to), CLRs only need redo information.
CLRs link to original: Each CLR references the original operation it compensates, providing a complete audit trail.
Checkpoint records are system-level records that periodically capture the state of the database system. Unlike transaction records that describe individual operations, checkpoint records create efficient recovery starting points that dramatically reduce the amount of log that must be processed during crash recovery.
Why Checkpoints?
Without checkpoints, recovery would need to:
This would make recovery take hours or days. Checkpoints solve this by recording:
| Field | Description | Recovery Use |
|---|---|---|
| Active Transaction Table | List of all transactions in progress at checkpoint time | Determines which transactions may need undo |
| Dirty Page Table | List of modified pages in buffer pool with their recovery LSN | Determines starting point for redo |
| Master Record Pointer | Points to the most recent successful checkpoint | Allows recovery to find starting point |
| Checkpoint LSN | The LSN of this checkpoint record | Establishes checkpoint position in log |
1234567891011121314151617181920212223242526272829303132333435363738
// Checkpoint Record Structure (ARIES-style)struct CheckpointLogRecord { uint64_t lsn; uint8_t record_type; // CHECKPOINT_BEGIN or CHECKPOINT_END // === Active Transaction Table === uint32_t num_active_transactions; struct ActiveTxEntry { uint32_t transaction_id; uint8_t state; // ACTIVE, PREPARED uint64_t last_lsn; // Last LSN for this transaction uint64_t undo_next_lsn; // Next LSN to undo if rollback needed } active_transactions[]; // === Dirty Page Table === uint32_t num_dirty_pages; struct DirtyPageEntry { uint32_t page_id; uint64_t recovery_lsn; // First LSN that dirtied this page } dirty_pages[];} // Example at checkpoint time:// // Active Transactions:// T1: Active, LastLSN=500, UndoNextLSN=500// T3: Active, LastLSN=495, UndoNextLSN=480// T4: Prepared, LastLSN=498, UndoNextLSN=498//// Dirty Pages:// Page 42: RecLSN=450 (first dirty at LSN 450)// Page 58: RecLSN=480 (first dirty at LSN 480)// Page 99: RecLSN=495 (first dirty at LSN 495)//// Recovery will:// 1. Start redo from min(RecLSN) = 450// 2. Track T1, T3, T4 as potentially needing undo// 3. Skip log records before LSN 450 for redoModern databases use 'fuzzy' checkpoints that write CHECKPOINT_BEGIN, continue normal operations, then write CHECKPOINT_END. This avoids blocking transactions during checkpointing. Recovery uses the CHECKPOINT_END record and handles any transactions that committed or aborted between BEGIN and END.
Beyond data modifications and transaction control, databases generate various system and utility log records. These records support operations that aren't direct data changes but still need to be recoverable or auditable.
| Record Type | Purpose | Examples |
|---|---|---|
| DDL Records | Schema definition changes | CREATE TABLE, ALTER TABLE, DROP INDEX |
| Space Management | Storage allocation changes | Page allocation, extent allocation, file growth |
| Index Operations | Index structure modifications | Page splits, merges, rebalancing |
| LOB Operations | Large object handling | BLOB/CLOB storage, segmentation |
| Savepoint Records | Transaction partial rollback points | SAVEPOINT, ROLLBACK TO SAVEPOINT |
| Nested Transaction | Sub-transaction boundaries | Nested BEGIN/COMMIT/ROLLBACK |
Schema (DDL) Log Records:
Schema changes present unique challenges because they affect the metadata used to interpret data pages:
DDL Log Record for ALTER TABLE:
{
operation: ADD_COLUMN,
table_id: 15,
new_column: {
name: 'email',
type: VARCHAR(255),
nullable: true,
position: 5
},
catalog_page_modified: 1842,
before_catalog_entry: [...],
after_catalog_entry: [...]
}
DDL operations typically acquire exclusive locks and often force checkpoints to ensure schema consistency during recovery.
Space Management Records:
These records track physical storage changes:
These must be logged because recovery needs to restore the database's physical structure, not just its logical content.
Savepoint records allow applications to rollback to a specific point within a transaction without aborting the entire transaction. The log records mark these points so that partial undo can stop at the savepoint rather than undoing all operations.
Log record design involves careful trade-offs between information completeness and storage efficiency. Larger records provide more information but consume more disk space and I/O bandwidth. Production systems employ several optimization techniques:
Record Size Statistics:
In typical OLTP workloads:
| Record Type | Typical Size | Frequency |
|---|---|---|
| UPDATE (1-2 columns) | 50-150 bytes | Very High |
| INSERT (typical row) | 100-500 bytes | High |
| DELETE | 100-400 bytes | Medium |
| COMMIT | 30-50 bytes | High |
| BEGIN | 30-40 bytes | High |
| Checkpoint | 1KB-1MB | Low |
The log's sequential nature means that even small records benefit from batching and asynchronous writes. Modern systems buffer log records and write them in groups, reducing the number of disk I/O operations.
Despite optimizations, the COMMIT record remains a synchronization point—it must be durably written before acknowledging the commit. This 'log force' at commit is often the primary bottleneck in high-throughput systems. Group commit, where multiple transactions' commit records are written together, addresses this by amortizing the I/O cost.
We've examined the building blocks that make up transaction logs. Each record type plays a specific role in enabling database recovery. Let's consolidate:
What's Next:
Now that we understand the individual log record types, we'll examine how they're organized into different log types and structures. We'll explore the differences between undo-only, redo-only, and undo-redo logging, and understand how the choice of logging type affects recovery strategy.
You now understand the detailed structure and types of log records—the vocabulary that databases use to document their history. Next, we'll see how these records are organized into different logging approaches with varying trade-offs.