Loading learning content...
In the previous page, we stated the WAL rule: log records must be written to stable storage before data modifications. But why this order? Why can't we write data first and log second? Why can't we write them simultaneously?
The answer lies in a fundamental analysis of crash scenarios. When we systematically examine what can go wrong during writes, we discover that log-before-data is the only ordering that permits complete recovery in all failure cases. Any other ordering creates windows where recovery becomes impossible.
This page provides the rigorous reasoning behind this critical ordering constraint.
You will understand the formal reasoning behind log-before-data ordering, analyze all possible failure scenarios during writes, and see why alternative orderings lead to unrecoverable states. This deep understanding is essential for implementing correct recovery systems.
To understand why ordering matters, consider what happens during any write operation:
The Write Window:
When a transaction modifies data, two things eventually need to happen:
These operations take time—typically milliseconds. A crash can occur at any moment within this window. We must ensure that no matter when the crash occurs, recovery is possible.
Let's enumerate all possible states when a crash can interrupt the write process:
| Scenario | Log Status | Data Page Status | Recovery Situation |
|---|---|---|---|
| Crash before any write | Not on disk | Not on disk | Transaction never happened — consistent |
| Crash after log, before data | On disk | Not on disk | Can REDO if committed, or ignore if not |
| Crash after log, after data | On disk | On disk | Nothing to do — already durable |
| Crash after data, before log | Not on disk | On disk | UNRECOVERABLE — can't undo or identify transaction |
The fourth scenario—data on disk without log—is catastrophic. Let's understand why:
Why Data-Before-Log Fails:
When data reaches disk without its log record, the database loses the ability to reason about that data. Was it from a committed transaction? An aborted one? We can never know. This isn't a recoverable error—it's permanent data corruption.
Let's formalize this analysis. Define:
For any modification M that creates log record r and modifies data page p, we need to ensure that after any crash, we can return to a consistent state.
Theorem: For complete recoverability, L(r) must precede D(p) for all modifications.
Proof by case analysis:
Now consider the forbidden case with Data-Before-Log:
This analysis shows that log-before-data isn't a design preference—it's a mathematical necessity. Any system that writes data before logging creates a failure window where complete recovery is impossible. There is no workaround.
The theory is compelling, but let's examine real scenarios to understand the practical impact:
Scenario 1: Bank Transfer Failure
Consider a transfer of $500 from Account A to Account B:
1234567891011121314
-- Transaction T1: Transfer $500 from A to BBEGIN TRANSACTION T1; -- Step 1: Debit from Account AUPDATE accounts SET balance = balance - 500 WHERE account_id = 'A'; -- A: 1000 -> 500 -- Step 2: Credit to Account B UPDATE accounts SET balance = balance + 500 WHERE account_id = 'B'; -- B: 2000 -> 2500 COMMIT TRANSACTION T1;With Proper Log-Before-Data:
<T1, START><T1, A.balance, 1000, 500> ← Reaches stable storage<T1, B.balance, 2000, 2500> ← Reaches stable storage<T1, COMMIT> ← Reaches stable storageCrash at any point:
Imagine the system wrote the debit to Account A directly to disk BEFORE logging. Crash occurs. On restart: Account A shows $500 (debit applied), Account B shows $2000 (credit not applied), no log record exists. $500 has vanished from the bank. This is not a hypothetical—it happens in systems that violate WAL.
Scenario 2: E-Commerce Order Corruption
Consider an order placement that:
If #1 (inventory decrement) reaches disk without log, and system crashes before #2 and #3:
Scenario 3: Financial Report Generation
Imagine end-of-day processing that:
If step 2 (ledger update) reaches disk without log:
We can express the WAL ordering constraint using the happens-before relationship (a concept from concurrent systems): For any operation O that modifies data page P and creates log record L:
12345678910111213141516171819202122232425
// WAL Ordering Constraints as Happens-Before Relations // Fundamental WAL constraint// The log record for operation O must reach stable storage// before the data page modified by O can reach stable storageLog(O) →→ Stable(Log(O)) →→ CanFlush(Page(O)) // For a transaction T with operations O1, O2, ..., On:// Each operation's log must be stable before its page can flush∀ Oi ∈ T: Stable(Log(Oi)) →→ CanFlush(Page(Oi)) // The commit record must be stable before commit returns to clientStable(CommitRecord(T)) →→ CommitAck(T) // For UNDO capability:// Must log before-image before page can reach diskLog(BeforeImage(O)) →→ Stable(BeforeImage(O)) →→ CanFlush(Page(O)) // For REDO capability: // Must log after-image before commit can be durableLog(AfterImage(O)) →→ Stable(AfterImage(O)) →→ Stable(CommitRecord(T)) // Combined constraint for complete recoverability:∀ O ∈ T: Stable(Log(O)) →→ CanFlush(Page(O)) AND Stable(AllLogs(T)) →→ Stable(CommitRecord(T))Understanding the Constraints:
Operation → Log → Disk(Log) → Can Disk(Data)
This chain ensures that for any modification:
Log Ordering Preserves Transaction Order
Within a transaction, log records must be written in operation order. If operation O₁ happens before O₂, then Log(O₁) must be stable before Log(O₂). This ensures recovery replays operations in the correct sequence.
Commit is the Durability Point
The commit record is special: once it's stable, the transaction is guaranteed durable. The client can be told 'success'. But this means all operation logs must already be stable—otherwise, we'd have a committed transaction we can't fully redo.
In concurrent systems terms, the commit record reaching stable storage is the 'linearization point' of the transaction. Before: the transaction might or might not have happened. After: the transaction definitely happened and will survive any crash. This single-point transition is what makes ACID atomicity possible.
Understanding why log-before-data is necessary is one thing. Implementing it correctly is another. Here's how database engines enforce this constraint:
Page LSN Mechanism:
Every data page contains a Page LSN field—the LSN of the most recent log record that modified this page. When the buffer manager wants to flush a dirty page to disk, it checks:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
class BufferManager: // Called when buffer pool needs to evict a dirty page function flushPage(page: Page) -> bool: // CRITICAL: Enforce WAL constraint // The log must be flushed at least up to this page's LSN if page.pageLSN > log.flushedLSN: // Cannot flush this data page yet! // The log records for its modifications haven't reached disk log.flushUpTo(page.pageLSN) // Wait for log flush to complete awaitLogFlush(page.pageLSN) // Now safe to write the data page // All log records describing this page's changes are stable disk.writePage(page.pageNumber, page.data) // Mark page as clean in buffer pool page.isDirty = false return true class LogManager: // Track highest LSN that's been flushed to disk flushedLSN: uint64 = 0 // Buffer for log records not yet flushed logBuffer: List<LogRecord> = [] function flushUpTo(targetLSN: uint64): if targetLSN <= self.flushedLSN: return // Already flushed // Write all buffered log records up to targetLSN recordsToFlush = self.logBuffer.filter(r => r.LSN <= targetLSN) for record in recordsToFlush: disk.appendToLog(record.serialize()) // CRITICAL: Force to stable storage disk.fsync(logFile) // Update flushed marker self.flushedLSN = targetLSN // Remove flushed records from buffer self.logBuffer = self.logBuffer.filter(r => r.LSN > targetLSN)Key Enforcement Points:
Page Modification Sets Page LSN
Whenever a transaction modifies a page, it first logs the change and gets an LSN. That LSN is recorded in the page header. The page is marked dirty.
Buffer Manager Checks Before Flush
Before writing any dirty page to disk, the buffer manager compares page.pageLSN with log.flushedLSN. If the page's LSN is higher, the log must be flushed first.
Forced Log Flush on Commit
When a transaction commits, its commit record must reach stable storage. This forces a log flush that includes all the transaction's log records.
The beauty of Page LSN is that enforcement is a single integer comparison per page flush. There's no expensive lookup or coordination. The buffer manager simply checks one number before writing. This minimal overhead is why WAL can achieve excellent performance while maintaining complete safety.
The log-before-data principle seems straightforward, but several edge cases require careful handling:
1. Log Record Spanning Disk Blocks
Log records vary in size. A large log record might span two disk blocks. What if power fails after writing the first block but before the second?
Solution: Each log record includes:
12345678910111213141516171819202122232425262728293031323334353637383940
function validateLogRecord(recordBytes: bytes) -> LogRecord | null: // Minimum header size check if len(recordBytes) < HEADER_SIZE: return null // Incomplete header // Read length from header declaredLength = readUint32(recordBytes, offset=0) // Check if we have complete record if len(recordBytes) < declaredLength: return null // Incomplete record (crash during write) // Validate checksum declaredChecksum = readUint32(recordBytes, offset=declaredLength-4) actualChecksum = computeCRC32(recordBytes[0:declaredLength-4]) if declaredChecksum != actualChecksum: return null // Corrupted record (torn write) // Parse and return valid record return parseLogRecord(recordBytes[0:declaredLength]) function recoverFromLog(): position = LOG_START lastValidLSN = 0 while position < logFileSize: recordBytes = readBytes(position, MAX_RECORD_SIZE) record = validateLogRecord(recordBytes) if record == null: // End of valid log found break lastValidLSN = record.LSN processRecoveryRecord(record) position += record.length // Truncate any garbage after last valid record truncateLog(position)2. Operating System Buffering
Writing to a file doesn't guarantee it reaches disk—the OS may buffer it. Using write() alone violates WAL because data could be in OS cache when we think it's safe.
Solution: Use fsync() or equivalent after critical log writes:
fsync(fd) — Forces file to stable storagefdatasync(fd) — Forces data but not metadata (slightly faster)O_DIRECT — Bypass OS cache entirelyO_DSYNC / O_SYNC — Synchronous writes3. Disk Write Reordering
Modern disks have write caches and may reorder writes for efficiency. You write A then B, but B might reach platters first.
Solution:
FUA - Force Unit Access flag)4. The First Write After Checkpoint
After a checkpoint, the next modification to a page is special. If the page is later torn, we have no log record with the 'before' state (checkpoint happened after previous logs were discarded).
Solution: PostgreSQL logs a full-page image on first modification after checkpoint. The entire page content serves as the before-image, guaranteeing recovery from any tear.
Some disks claim write completion before data actually reaches platters (lying). Databases assume fsync guarantees durability—but cheap hardware may violate this. Enterprise SSDs and battery-backed controllers provide actual guarantees. When durability matters, verify your hardware.
Let's prove that log-before-data, combined with proper commit handling, guarantees complete recoverability. We need to show:
Theorem: If a database system follows the WAL rule, then after any crash, the database can be restored to a consistent state reflecting all committed transactions and none of the uncommitted ones.
Definitions:
To Prove: A recovery algorithm exists that transforms S into C.
Proof:
Step 1: All committed transactions can be redone
For any committed transaction T ∈ D:
Step 2: All uncommitted transactions can be undone
For any uncommitted transaction T ∈ U:
Step 3: Recovery algorithm
1. Scan log to determine D (committed) and U (uncommitted)
2. REDO: For each T ∈ D, replay all modifications (apply after-images)
3. UNDO: For each T ∈ U, rollback all modifications (apply before-images)
4. Result: State C where only committed transactions are reflected
QED
This proof shows that WAL isn't a 'best practice'—it's a mathematical guarantee. Given proper WAL implementation, recovery to a consistent state is always possible. No committed data is lost. No uncommitted data remains. This is the foundation of database reliability.
We've explored the deep reasoning behind the log-before-data ordering. Let's consolidate the key insights:
What's Next:
Now that we understand why log must precede data, the next page explores the undo information—the before-images that enable rolling back uncommitted transactions. We'll examine how undo records are structured, used, and optimized.
You now understand the formal reasoning behind log-before-data ordering. This isn't arbitrary—it's the only ordering that permits complete recovery from all failure scenarios. Next, we'll dive into undo information and how databases restore previous values.