Loading learning content...
Imagine a database processing thousands of financial transactions per second. Power fails mid-transaction. When the system restarts, how does it know which transactions completed, which partially executed, and which never started? How does it restore consistency without losing committed data or keeping uncommitted changes?
The answer lies in a deceptively simple principle that underpins every production database system: Write-Ahead Logging (WAL). This protocol—sometimes called journaling in file systems—is the fundamental mechanism that transforms databases from fragile data stores into resilient, recoverable systems.
The WAL rule is not merely an optimization or a convenience. It is an absolute requirement for crash recovery. Without it, durability guarantees collapse, and databases cannot reliably recover from failures.
By the end of this page, you will understand the Write-Ahead Logging rule in precise terms, why it is necessary for recovery, and the catastrophic consequences of violating it. You'll see how this single protocol enables both undo and redo operations—the two building blocks of crash recovery.
Before we can appreciate the WAL rule, we must understand the fundamental problem it solves. Consider what happens during normal database operation:
Two storage locations exist:
Databases cache frequently accessed data pages in main memory (the buffer pool) for performance. When transactions modify data, they first modify the in-memory copies. Eventually, these modified pages must be written back to disk—but exactly when this happens creates a critical problem.
This creates an apparently impossible situation. We want:
These requirements seem contradictory. If we don't immediately write committed changes to disk, how do we guarantee they survive crashes? If we do write them immediately, how do we maintain performance?
The WAL protocol resolves this tension through an elegant insight: What if we don't need to write the data immediately—only a record of what changed?
Writing a small, sequential log record is orders of magnitude faster than writing a large, random data page. If we log the change before allowing any modification to reach disk, we can reconstruct the change later—even if the data page never made it to disk.
The Write-Ahead Logging rule is stated with precision:
The WAL Rule: Before any in-place update to a data page is written to stable storage (disk), the corresponding log record must first be written to stable storage.
In simpler terms: Log first, then data.
This rule is sometimes expressed as two separate guarantees:
| Component | Formal Statement | Purpose |
|---|---|---|
| Undo Rule | Before any uncommitted modification is written to disk, the log record with the old value (undo information) must reach stable storage | Enables rolling back uncommitted transactions after crash |
| Redo Rule | Before a transaction commits, all log records describing its modifications (containing new values) must reach stable storage | Enables replaying committed transactions after crash |
The combination of these rules provides a complete recovery guarantee:
If a transaction commits: All its changes are logged to stable storage. Even if the data pages were never written to disk, recovery can reconstruct them.
If a transaction fails to commit: The undo information is on stable storage. Recovery can roll back any changes that may have reached disk prematurely.
Notice the elegance: We never need to force data pages to disk at any particular time. We only need to ensure the log maintains a complete history of all changes. Data pages can be written lazily, in whatever order is convenient for performance—because the log can always reconstruct them.
Log records are written sequentially to the end of a log file. This is drastically faster than random writes to data pages scattered across the disk. A single transaction modifying 100 random pages might require 100 random disk seeks. But logging those 100 modifications requires appending 100 small records sequentially—typically completing in the time of 1-2 random writes.
The term "write-ahead" is not arbitrary—it describes a strict ordering requirement. Let's understand why this order is critical by examining what happens if we violate it.
Scenario: Violating the WAL Rule
Consider a transaction T1 that updates a bank account balance from $1000 to $500:
T1: UPDATE accounts SET balance = 500 WHERE id = 'A001';
Without WAL, the sequence might be:
After restart, the database sees balance = $500 on disk. But there's no record of T1—no way to know if T1 committed or not. If T1 actually never committed before the crash, we now have:
Violating the WAL rule doesn't cause occasional errors—it makes recovery impossible. Once a modification reaches disk without its log record, that change cannot be undone. The database is permanently in an inconsistent state that no amount of sophisticated recovery logic can fix.
Now consider the correct WAL-compliant sequence:
<T1, accounts.A001.balance, 1000, 500> (transaction, item, old value, new value)After restart:
In both cases, the database returns to a consistent state. The key is that the log record—containing both old and new values—reached stable storage before any data modification could reach disk.
For the WAL rule to support complete recovery, log records must contain sufficient information for both undo and redo operations. A typical log record structure includes:
Essential Fields:
| Field | Description | Example |
|---|---|---|
| LSN | Log Sequence Number — unique, monotonically increasing identifier | 12847 |
| TransactionID | Identifier of the transaction that made this change | T1024 |
| PageID | Database page that was modified | Page#4521 |
| Offset | Location within the page | Byte 312 |
| Length | Size of the modified region | 8 bytes |
| Before-Image | Old value (for undo) | 1000 |
| After-Image | New value (for redo) | 500 |
| PrevLSN | Previous log record for this transaction (enables efficient scanning) | 12830 |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// Different types of log records // UPDATE record - captures a data modificationstruct UpdateLogRecord { LSN: uint64 // Unique log sequence number Type: "UPDATE" // Record type identifier TransactionID: uint32 // Which transaction made this change PrevLSN: uint64 // Previous record for this transaction PageID: uint32 // Which page was modified Offset: uint16 // Position within the page BeforeImage: bytes // Old value (for UNDO) AfterImage: bytes // New value (for REDO)} // COMMIT record - marks transaction completionstruct CommitLogRecord { LSN: uint64 Type: "COMMIT" TransactionID: uint32 PrevLSN: uint64} // ABORT record - marks transaction rollbackstruct AbortLogRecord { LSN: uint64 Type: "ABORT" TransactionID: uint32 PrevLSN: uint64} // BEGIN record - marks transaction startstruct BeginLogRecord { LSN: uint64 Type: "BEGIN" TransactionID: uint32} // CLR (Compensation Log Record) - records an undo actionstruct CLRLogRecord { LSN: uint64 Type: "CLR" TransactionID: uint32 PrevLSN: uint64 UndoNextLSN: uint64 // Next record to undo (skips already undone) PageID: uint32 AfterImage: bytes // The result of the undo}Understanding the Before and After Images:
The before-image (also called undo information) and after-image (also called redo information) are the key to recovery:
Before-Image: The value that existed before this modification. If we need to undo this change (because the transaction didn't commit), we restore this value.
After-Image: The value after this modification. If we need to redo this change (because the transaction committed but the data didn't reach disk), we apply this value.
Some systems use physical logging (recording actual byte changes), while others use logical logging (recording the operation itself, like 'increment counter by 5'). Modern systems often use physiological logging—physical to the page, logical within the page—balancing recovery complexity with logging efficiency.
You might wonder: why store both old and new values? Couldn't we compute one from the other? The answer is recovery performance. During crash recovery, we may need to process millions of log records quickly. Computing the inverse of arbitrary operations would be complex and slow. Storing both values trades disk space for recovery speed—a worthwhile exchange.
The WAL rule tells us that log records must be written before data pages. But it doesn't tell us when data pages should be written. This leads to two policy decisions that significantly impact performance:
Question 1: Force Policy (at commit time)
When a transaction commits, do we force all its modified pages to disk?
Question 2: Steal Policy (before commit)
Can a transaction's modified pages be written to disk before the transaction commits?
| Policy | Performance | Recovery Needs | Used By |
|---|---|---|---|
| STEAL/NO-FORCE | Best — pages written only when buffer needs space | Needs both UNDO and REDO | Most modern systems (PostgreSQL, MySQL/InnoDB) |
| NO-STEAL/FORCE | Worst — locks memory, forces sync I/O | No recovery logging needed | Small embedded systems |
| STEAL/FORCE | Poor — forced writes at commit | Needs UNDO only | Rare |
| NO-STEAL/NO-FORCE | Moderate — memory pressure issues | Needs REDO only | Rare |
Why STEAL/NO-FORCE Dominates:
The STEAL/NO-FORCE combination provides the best runtime performance:
NO-FORCE means commits are fast — A transaction commits by writing only its log records (sequential I/O), not its data pages (random I/O). This can be 100x faster.
STEAL means buffer management is flexible — The buffer manager can evict any page when it needs space. It doesn't have to keep uncommitted pages pinned in memory, avoiding out-of-memory situations.
The cost is more complex recovery—we need both undo and redo capabilities. But since crashes are rare and performance matters constantly, this trade-off is universally preferred in production systems.
Without WAL, STEAL/NO-FORCE would be impossible. The log provides the safety net: STEAL is safe because we have undo information logged. NO-FORCE is safe because we have redo information logged. WAL is the foundation that makes high-performance databases possible.
Implementing the WAL rule correctly requires careful engineering. Several subtle issues can undermine the guarantee:
1. Page Tearing
Most file systems write in units smaller than database pages (often 4KB file system blocks vs 8KB or 16KB database pages). If power fails mid-write, you might have half an old page and half a new page—a torn page.
Solutions:
12345678910111213141516171819202122232425262728
// PostgreSQL-style full page image on first modification function modifyPage(page, modification, transactionLog): if page.needsFullPageImage: // First modification since last checkpoint // Log the entire page content before any change logRecord = new FullPageLogRecord( LSN: nextLSN(), TransactionID: currentTransaction(), PageID: page.id, FullPageContent: page.content.copy(), // Before-image of entire page Modification: modification ) writeToWAL(logRecord) page.needsFullPageImage = false else: // Normal logging - just the delta logRecord = new UpdateLogRecord( LSN: nextLSN(), TransactionID: currentTransaction(), PageID: page.id, BeforeImage: page.getAffectedBytes(modification), AfterImage: modification.newValue ) writeToWAL(logRecord) // Now safe to apply modification to in-memory page page.apply(modification)2. Ensuring Log Durability
Writing to the OS file system doesn't guarantee data reaches disk. The OS may buffer writes in its own cache. To truly satisfy WAL:
fsync() or equivalent to force log to disk3. Log Buffer Management
For performance, databases buffer log records in memory before flushing to disk:
Every major database implements WAL, though with variations in naming and implementation details:
PostgreSQL — Write-Ahead Log (WAL)
PostgreSQL coined the term 'WAL' and implements a sophisticated version:
pg_wal directory (formerly pg_xlog)MySQL/InnoDB — Redo Log
InnoDB uses a circular redo log:
Oracle — Redo Log
Oracle uses groups of redo log files:
| System | Log Name | Torn Page Solution | Replication Integration |
|---|---|---|---|
| PostgreSQL | WAL (Write-Ahead Log) | Full-page images | Streaming replication via WAL shipping |
| MySQL InnoDB | Redo Log | Double-write buffer | Binary log separate from redo log |
| Oracle | Redo Log | Redo log + Data Guard | Redo log shipping to standby |
| SQL Server | Transaction Log | Torn page detection + checksums | Log shipping or Always On |
| SQLite | WAL mode | WAL file + checkpoints | Litestream for replication |
Beyond Databases:
The WAL principle extends far beyond traditional databases:
Once you understand WAL, you see it everywhere—it's the universal pattern for reliable persistent state.
WAL isn't just a database feature—it's an architectural pattern. Any time you need reliable persistence with crash recovery, consider logging changes before applying them. It works for databases, file systems, distributed systems, and even application-level data management.
We've covered the foundational protocol that makes database recovery possible. Let's consolidate the key concepts:
What's Next:
Now that we understand the WAL rule as a protocol, the next page explores why logging must happen before data writes—examining the causality and timing constraints that make 'write-ahead' the only viable ordering for crash recovery.
You now understand the Write-Ahead Logging rule—the cornerstone protocol that guarantees database durability and atomicity. Every commercial and open-source database relies on this principle. Next, we'll explore the deeper reasons why log-before-data is the only correct ordering.