Database Management SystemsWrite-Ahead Logging (WAL)

Write-Ahead Logging (WAL)

LevelIntermediate

Duration60 mins

TopicWrite-Ahead Logging (WAL)

2 / 5

Log Before Data

The Inviolable Order

In the previous page, we stated the WAL rule: log records must be written to stable storage before data modifications. But why this order? Why can't we write data first and log second? Why can't we write them simultaneously?

The answer lies in a fundamental analysis of crash scenarios. When we systematically examine what can go wrong during writes, we discover that log-before-data is the only ordering that permits complete recovery in all failure cases. Any other ordering creates windows where recovery becomes impossible.

This page provides the rigorous reasoning behind this critical ordering constraint.

What You Will Learn

You will understand the formal reasoning behind log-before-data ordering, analyze all possible failure scenarios during writes, and see why alternative orderings lead to unrecoverable states. This deep understanding is essential for implementing correct recovery systems.

The Crash Timing Problem

To understand why ordering matters, consider what happens during any write operation:

The Write Window:

When a transaction modifies data, two things eventually need to happen:

The log record must be written to disk
The data page must be written to disk (eventually)

These operations take time—typically milliseconds. A crash can occur at any moment within this window. We must ensure that no matter when the crash occurs, recovery is possible.

Let's enumerate all possible states when a crash can interrupt the write process:

Crash Timing Scenarios
Scenario	Log Status	Data Page Status	Recovery Situation
Crash before any write	Not on disk	Not on disk	Transaction never happened — consistent
Crash after log, before data	On disk	Not on disk	Can REDO if committed, or ignore if not
Crash after log, after data	On disk	On disk	Nothing to do — already durable
Crash after data, before log	Not on disk	On disk	UNRECOVERABLE — can't undo or identify transaction

The fourth scenario—data on disk without log—is catastrophic. Let's understand why:

Why Data-Before-Log Fails:

The modified data is now on disk
No record exists of what the old value was
No record exists of which transaction made this change
We cannot determine if this transaction committed or aborted
We cannot undo the change (no before-image)
The database is in an inconsistent state permanently

Permanent Corruption

When data reaches disk without its log record, the database loses the ability to reason about that data. Was it from a committed transaction? An aborted one? We can never know. This isn't a recoverable error—it's permanent data corruption.

Formal Ordering Analysis

Let's formalize this analysis. Define:

L(r) = The event of log record r reaching stable storage
D(p) = The event of modified data page p reaching stable storage
C = The event of a system crash

For any modification M that creates log record r and modifies data page p, we need to ensure that after any crash, we can return to a consistent state.

Theorem: For complete recoverability, L(r) must precede D(p) for all modifications.

Proof by case analysis:

Case Analysis for Log-Before-Data

•Case 1: C occurs before L(r) and before D(p) — Neither reached disk. Transaction's modification is lost entirely. If uncommitted, this is correct (atomicity preserved). If committed, the commit record also didn't reach disk, so the transaction effectively never committed.
•Case 2: C occurs after L(r) but before D(p) — Log record is on disk, data isn't. Recovery finds the log record. If transaction committed, REDO applies the modification. If not committed, UNDO can run but there's nothing to undo (data never reached disk). Either way, consistent state is achieved.
•Case 3: C occurs after both L(r) and D(p) — Both are on disk. No recovery action needed for this modification. If transaction committed, data is durable. If not committed, UNDO uses the log record's before-image to restore.

Now consider the forbidden case with Data-Before-Log:

Failed Case: D(p) before L(r)

•Case X: C occurs after D(p) but before L(r) — Data is on disk, log is not. Recovery has no record of this modification. It cannot determine: (1) which transaction modified the data, (2) what the original value was, (3) whether the transaction intended to commit.
•The data page contains state from an unknown transaction — It might be half of an atomic operation. It might corrupt referential integrity. It might violate business constraints.
•No possible recovery action exists — Without the before-image, we cannot undo. Without the transaction identity, we cannot reason about commit status. The database is irrecoverably corrupted.

The Ordering is Not Optional

This analysis shows that log-before-data isn't a design preference—it's a mathematical necessity. Any system that writes data before logging creates a failure window where complete recovery is impossible. There is no workaround.

Real-World Failure Scenarios

The theory is compelling, but let's examine real scenarios to understand the practical impact:

Scenario 1: Bank Transfer Failure

Consider a transfer of $500 from Account A to Account B:

Bank Transfer Transaction
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Transaction T1: Transfer $500 from A to B
BEGIN TRANSACTION T1;
 
-- Step 1: Debit from Account A
UPDATE accounts 
SET balance = balance - 500 
WHERE account_id = 'A';  -- A: 1000 -> 500
 
-- Step 2: Credit to Account B  
UPDATE accounts 
SET balance = balance + 500 
WHERE account_id = 'B';  -- B: 2000 -> 2500
 
COMMIT TRANSACTION T1;

With Proper Log-Before-Data:

Log: <T1, START>
Modify page X in memory (Account A: 1000 → 500)
Log to disk: <T1, A.balance, 1000, 500> ← Reaches stable storage
Page X may or may not be written to disk yet
Modify page Y in memory (Account B: 2000 → 2500)
Log to disk: <T1, B.balance, 2000, 2500> ← Reaches stable storage
Page Y may or may not be written to disk yet
Log to disk: <T1, COMMIT> ← Reaches stable storage

Crash at any point:

Before step 3: No record of T1's actions, no data on disk → Clean
After step 3, before step 8: T1's changes are logged but not committed → UNDO any data pages that reached disk
After step 8: T1 is committed → REDO any changes whose data pages didn't reach disk

Disaster Scenario: Data-Before-Log

Imagine the system wrote the debit to Account A directly to disk BEFORE logging. Crash occurs. On restart: Account A shows $500 (debit applied), Account B shows $2000 (credit not applied), no log record exists. $500 has vanished from the bank. This is not a hypothetical—it happens in systems that violate WAL.

Scenario 2: E-Commerce Order Corruption

Consider an order placement that:

Decrements inventory count
Creates order record
Updates customer order history

If #1 (inventory decrement) reaches disk without log, and system crashes before #2 and #3:

Inventory shows 1 less item than actual
No order exists
Customer never charged
Inventory is permanently wrong (what decrement? for what order?)

Scenario 3: Financial Report Generation

Imagine end-of-day processing that:

Calculates daily totals
Updates ledger summaries
Archives transactions

If step 2 (ledger update) reaches disk without log:

Ledger shows today's totals
But tomorrow, after recovery, no record of what was included
Ledger becomes unreliable/non-auditable

The Happens-Before Relationship

We can express the WAL ordering constraint using the happens-before relationship (a concept from concurrent systems): For any operation O that modifies data page P and creates log record L:

Happens-Before Constraints

Pseudocode

// WAL Ordering Constraints as Happens-Before Relations
 
// Fundamental WAL constraint
// The log record for operation O must reach stable storage
// before the data page modified by O can reach stable storage
Log(O) →→ Stable(Log(O)) →→ CanFlush(Page(O))
 
// For a transaction T with operations O1, O2, ..., On:
// Each operation's log must be stable before its page can flush
∀ Oi ∈ T: Stable(Log(Oi)) →→ CanFlush(Page(Oi))
 
// The commit record must be stable before commit returns to client
Stable(CommitRecord(T)) →→ CommitAck(T)
 
// For UNDO capability:
// Must log before-image before page can reach disk
Log(BeforeImage(O)) →→ Stable(BeforeImage(O)) →→ CanFlush(Page(O))
 
// For REDO capability:  
// Must log after-image before commit can be durable
Log(AfterImage(O)) →→ Stable(AfterImage(O)) →→ Stable(CommitRecord(T))
 
// Combined constraint for complete recoverability:
∀ O ∈ T: Stable(Log(O)) →→ CanFlush(Page(O))
         AND Stable(AllLogs(T)) →→ Stable(CommitRecord(T))

Understanding the Constraints:

Operation → Log → Disk(Log) → Can Disk(Data)

This chain ensures that for any modification:
- We first create the log record
- The log record reaches stable storage
- Only then can the modified data page be written
Log Ordering Preserves Transaction Order

Within a transaction, log records must be written in operation order. If operation O₁ happens before O₂, then Log(O₁) must be stable before Log(O₂). This ensures recovery replays operations in the correct sequence.
Commit is the Durability Point

The commit record is special: once it's stable, the transaction is guaranteed durable. The client can be told 'success'. But this means all operation logs must already be stable—otherwise, we'd have a committed transaction we can't fully redo.

The Commit Record as Linearization Point

In concurrent systems terms, the commit record reaching stable storage is the 'linearization point' of the transaction. Before: the transaction might or might not have happened. After: the transaction definitely happened and will survive any crash. This single-point transition is what makes ACID atomicity possible.

Enforcing Log-Before-Data

Understanding why log-before-data is necessary is one thing. Implementing it correctly is another. Here's how database engines enforce this constraint:

Page LSN Mechanism:

Every data page contains a Page LSN field—the LSN of the most recent log record that modified this page. When the buffer manager wants to flush a dirty page to disk, it checks:

Page LSN Enforcement

Pseudocode

class BufferManager:
    // Called when buffer pool needs to evict a dirty page
    function flushPage(page: Page) -> bool:
        // CRITICAL: Enforce WAL constraint
        // The log must be flushed at least up to this page's LSN
        if page.pageLSN > log.flushedLSN:
            // Cannot flush this data page yet!
            // The log records for its modifications haven't reached disk
            log.flushUpTo(page.pageLSN)
            // Wait for log flush to complete
            awaitLogFlush(page.pageLSN)
        
        // Now safe to write the data page
        // All log records describing this page's changes are stable
        disk.writePage(page.pageNumber, page.data)
        
        // Mark page as clean in buffer pool
        page.isDirty = false
        return true
 
class LogManager:
    // Track highest LSN that's been flushed to disk
    flushedLSN: uint64 = 0
    
    // Buffer for log records not yet flushed
    logBuffer: List<LogRecord> = []
    
    function flushUpTo(targetLSN: uint64):
        if targetLSN <= self.flushedLSN:
            return  // Already flushed
        
        // Write all buffered log records up to targetLSN
        recordsToFlush = self.logBuffer.filter(r => r.LSN <= targetLSN)
        for record in recordsToFlush:
            disk.appendToLog(record.serialize())
        
        // CRITICAL: Force to stable storage
        disk.fsync(logFile)
        
        // Update flushed marker
        self.flushedLSN = targetLSN
        
        // Remove flushed records from buffer
        self.logBuffer = self.logBuffer.filter(r => r.LSN > targetLSN)

Key Enforcement Points:

Page Modification Sets Page LSN

Whenever a transaction modifies a page, it first logs the change and gets an LSN. That LSN is recorded in the page header. The page is marked dirty.
Buffer Manager Checks Before Flush

Before writing any dirty page to disk, the buffer manager compares page.pageLSN with log.flushedLSN. If the page's LSN is higher, the log must be flushed first.
Forced Log Flush on Commit

When a transaction commits, its commit record must reach stable storage. This forces a log flush that includes all the transaction's log records.

Enforcement Mechanisms Summary

•Page LSN tracking — Every page knows the LSN of its most recent modification
•Buffer manager guard — Data page writes blocked until corresponding log is stable
•Log flush on commit — Commit waits for all related log records to be durable
•LSN comparison — Simple numeric comparison determines if flush is safe
•Group commit — Multiple transactions can share log flushes for efficiency

Why This is Efficient

The beauty of Page LSN is that enforcement is a single integer comparison per page flush. There's no expensive lookup or coordination. The buffer manager simply checks one number before writing. This minimal overhead is why WAL can achieve excellent performance while maintaining complete safety.

Edge Cases and Subtleties

The log-before-data principle seems straightforward, but several edge cases require careful handling:

1. Log Record Spanning Disk Blocks

Log records vary in size. A large log record might span two disk blocks. What if power fails after writing the first block but before the second?

Solution: Each log record includes:

Length field at the start (so we know expected size)
Checksum covering the entire record
During recovery, validate both length and checksum
Incomplete/corrupt records at log tail are treated as never written

Log Record Validation

Pseudocode

function validateLogRecord(recordBytes: bytes) -> LogRecord | null:
    // Minimum header size check
    if len(recordBytes) < HEADER_SIZE:
        return null  // Incomplete header
    
    // Read length from header
    declaredLength = readUint32(recordBytes, offset=0)
    
    // Check if we have complete record
    if len(recordBytes) < declaredLength:
        return null  // Incomplete record (crash during write)
    
    // Validate checksum
    declaredChecksum = readUint32(recordBytes, offset=declaredLength-4)
    actualChecksum = computeCRC32(recordBytes[0:declaredLength-4])
    
    if declaredChecksum != actualChecksum:
        return null  // Corrupted record (torn write)
    
    // Parse and return valid record
    return parseLogRecord(recordBytes[0:declaredLength])
 
function recoverFromLog():
    position = LOG_START
    lastValidLSN = 0
    
    while position < logFileSize:
        recordBytes = readBytes(position, MAX_RECORD_SIZE)
        record = validateLogRecord(recordBytes)
        
        if record == null:
            // End of valid log found
            break
        
        lastValidLSN = record.LSN
        processRecoveryRecord(record)
        position += record.length
    
    // Truncate any garbage after last valid record
    truncateLog(position)

2. Operating System Buffering

Writing to a file doesn't guarantee it reaches disk—the OS may buffer it. Using write() alone violates WAL because data could be in OS cache when we think it's safe.

Solution: Use fsync() or equivalent after critical log writes:

fsync(fd) — Forces file to stable storage
fdatasync(fd) — Forces data but not metadata (slightly faster)
O_DIRECT — Bypass OS cache entirely
O_DSYNC / O_SYNC — Synchronous writes

3. Disk Write Reordering

Modern disks have write caches and may reorder writes for efficiency. You write A then B, but B might reach platters first.

Solution:

Disable write caching (impacts performance)
Or use battery-backed write cache (guarantees cache survives power loss)
Or use write barriers (FUA - Force Unit Access flag)

4. The First Write After Checkpoint

After a checkpoint, the next modification to a page is special. If the page is later torn, we have no log record with the 'before' state (checkpoint happened after previous logs were discarded).

Solution: PostgreSQL logs a full-page image on first modification after checkpoint. The entire page content serves as the before-image, guaranteeing recovery from any tear.

Hardware Lies

Some disks claim write completion before data actually reaches platters (lying). Databases assume fsync guarantees durability—but cheap hardware may violate this. Enterprise SSDs and battery-backed controllers provide actual guarantees. When durability matters, verify your hardware.

Proving Correctness

Let's prove that log-before-data, combined with proper commit handling, guarantees complete recoverability. We need to show:

Theorem: If a database system follows the WAL rule, then after any crash, the database can be restored to a consistent state reflecting all committed transactions and none of the uncommitted ones.

Definitions:

Let D be the set of transactions that committed before the crash
Let U be the set of transactions that did not commit before the crash
Let S be the database state after crash (whatever is on disk)
Let C be the consistent target state (committed transactions only)

To Prove: A recovery algorithm exists that transforms S into C.

Proof:

Step 1: All committed transactions can be redone

For any committed transaction T ∈ D:

T's commit record reached stable storage (by definition of committed)
All T's log records reached stable storage before commit (by WAL redo rule)
Therefore, for every modification M by T, the log contains M's after-image
Recovery can apply each after-image to reach T's committed state

Step 2: All uncommitted transactions can be undone

For any uncommitted transaction T ∈ U:

If any of T's modifications M reached disk:
- M's log record reached disk first (by WAL undo rule)
- The log record contains M's before-image
- Recovery can apply the before-image to restore pre-T state
If none of T's modifications reached disk:
- Nothing to undo—state already reflects absence of T

Step 3: Recovery algorithm

1. Scan log to determine D (committed) and U (uncommitted)
2. REDO: For each T ∈ D, replay all modifications (apply after-images)
3. UNDO: For each T ∈ U, rollback all modifications (apply before-images)
4. Result: State C where only committed transactions are reflected

QED

Correctness Guaranteed

This proof shows that WAL isn't a 'best practice'—it's a mathematical guarantee. Given proper WAL implementation, recovery to a consistent state is always possible. No committed data is lost. No uncommitted data remains. This is the foundation of database reliability.

Summary: Log Before Data

We've explored the deep reasoning behind the log-before-data ordering. Let's consolidate the key insights:

Key Takeaways

•Ordering is mathematically necessary — Analysis of crash timing scenarios proves that data-before-log creates unrecoverable states.
•If data reaches disk without log, recovery is impossible — We lose the ability to identify, undo, or reason about the change.
•Page LSN enforces the constraint — Each page tracks its most recent log LSN; buffer manager blocks flush until log is stable.
•Commit record is the linearization point — Once commit is stable, transaction is durable; before, it can be fully undone.
•Edge cases require careful handling — Torn records, OS buffering, disk reordering all threaten WAL guarantees.
•Correctness is provable — WAL provides a formal guarantee: all committed transactions survive, all uncommitted are rolled back.

What's Next:

Now that we understand why log must precede data, the next page explores the undo information—the before-images that enable rolling back uncommitted transactions. We'll examine how undo records are structured, used, and optimized.

Page Complete

You now understand the formal reasoning behind log-before-data ordering. This isn't arbitrary—it's the only ordering that permits complete recovery from all failure scenarios. Next, we'll dive into undo information and how databases restore previous values.

2 / 5

Loading learning content...

Database Management SystemsWrite-Ahead Logging (WAL)

Write-Ahead Logging (WAL)

LevelIntermediate

Duration60 mins

TopicWrite-Ahead Logging (WAL)

2 / 5

Log Before Data

The Inviolable Order

This page provides the rigorous reasoning behind this critical ordering constraint.

What You Will Learn

The Crash Timing Problem

To understand why ordering matters, consider what happens during any write operation:

The Write Window:

When a transaction modifies data, two things eventually need to happen:

The log record must be written to disk
The data page must be written to disk (eventually)

These operations take time—typically milliseconds. A crash can occur at any moment within this window. We must ensure that no matter when the crash occurs, recovery is possible.

Let's enumerate all possible states when a crash can interrupt the write process:

Crash Timing Scenarios
Scenario	Log Status	Data Page Status	Recovery Situation
Crash before any write	Not on disk	Not on disk	Transaction never happened — consistent
Crash after log, before data	On disk	Not on disk	Can REDO if committed, or ignore if not
Crash after log, after data	On disk	On disk	Nothing to do — already durable
Crash after data, before log	Not on disk	On disk	UNRECOVERABLE — can't undo or identify transaction

The fourth scenario—data on disk without log—is catastrophic. Let's understand why:

Why Data-Before-Log Fails:

The modified data is now on disk
No record exists of what the old value was
No record exists of which transaction made this change
We cannot determine if this transaction committed or aborted
We cannot undo the change (no before-image)
The database is in an inconsistent state permanently

Permanent Corruption

Formal Ordering Analysis

Let's formalize this analysis. Define:

L(r) = The event of log record r reaching stable storage
D(p) = The event of modified data page p reaching stable storage
C = The event of a system crash

For any modification M that creates log record r and modifies data page p, we need to ensure that after any crash, we can return to a consistent state.

Theorem: For complete recoverability, L(r) must precede D(p) for all modifications.

Proof by case analysis:

Case Analysis for Log-Before-Data

•Case 1: C occurs before L(r) and before D(p) — Neither reached disk. Transaction's modification is lost entirely. If uncommitted, this is correct (atomicity preserved). If committed, the commit record also didn't reach disk, so the transaction effectively never committed.
•Case 2: C occurs after L(r) but before D(p) — Log record is on disk, data isn't. Recovery finds the log record. If transaction committed, REDO applies the modification. If not committed, UNDO can run but there's nothing to undo (data never reached disk). Either way, consistent state is achieved.
•Case 3: C occurs after both L(r) and D(p) — Both are on disk. No recovery action needed for this modification. If transaction committed, data is durable. If not committed, UNDO uses the log record's before-image to restore.

Now consider the forbidden case with Data-Before-Log:

Failed Case: D(p) before L(r)

•Case X: C occurs after D(p) but before L(r) — Data is on disk, log is not. Recovery has no record of this modification. It cannot determine: (1) which transaction modified the data, (2) what the original value was, (3) whether the transaction intended to commit.
•The data page contains state from an unknown transaction — It might be half of an atomic operation. It might corrupt referential integrity. It might violate business constraints.
•No possible recovery action exists — Without the before-image, we cannot undo. Without the transaction identity, we cannot reason about commit status. The database is irrecoverably corrupted.

The Ordering is Not Optional

Real-World Failure Scenarios

The theory is compelling, but let's examine real scenarios to understand the practical impact:

Scenario 1: Bank Transfer Failure

Consider a transfer of $500 from Account A to Account B:

Bank Transfer Transaction
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Transaction T1: Transfer $500 from A to B
BEGIN TRANSACTION T1;
 
-- Step 1: Debit from Account A
UPDATE accounts 
SET balance = balance - 500 
WHERE account_id = 'A';  -- A: 1000 -> 500
 
-- Step 2: Credit to Account B  
UPDATE accounts 
SET balance = balance + 500 
WHERE account_id = 'B';  -- B: 2000 -> 2500
 
COMMIT TRANSACTION T1;

With Proper Log-Before-Data:

Log: <T1, START>
Modify page X in memory (Account A: 1000 → 500)
Log to disk: <T1, A.balance, 1000, 500> ← Reaches stable storage
Page X may or may not be written to disk yet
Modify page Y in memory (Account B: 2000 → 2500)
Log to disk: <T1, B.balance, 2000, 2500> ← Reaches stable storage
Page Y may or may not be written to disk yet
Log to disk: <T1, COMMIT> ← Reaches stable storage

Crash at any point:

Before step 3: No record of T1's actions, no data on disk → Clean
After step 3, before step 8: T1's changes are logged but not committed → UNDO any data pages that reached disk
After step 8: T1 is committed → REDO any changes whose data pages didn't reach disk

Disaster Scenario: Data-Before-Log

Scenario 2: E-Commerce Order Corruption

Consider an order placement that:

Decrements inventory count
Creates order record
Updates customer order history

If #1 (inventory decrement) reaches disk without log, and system crashes before #2 and #3:

Inventory shows 1 less item than actual
No order exists
Customer never charged
Inventory is permanently wrong (what decrement? for what order?)

Scenario 3: Financial Report Generation

Imagine end-of-day processing that:

Calculates daily totals
Updates ledger summaries
Archives transactions

If step 2 (ledger update) reaches disk without log:

Ledger shows today's totals
But tomorrow, after recovery, no record of what was included
Ledger becomes unreliable/non-auditable

The Happens-Before Relationship

We can express the WAL ordering constraint using the happens-before relationship (a concept from concurrent systems): For any operation O that modifies data page P and creates log record L:

Happens-Before Constraints

Pseudocode

// WAL Ordering Constraints as Happens-Before Relations
 
// Fundamental WAL constraint
// The log record for operation O must reach stable storage
// before the data page modified by O can reach stable storage
Log(O) →→ Stable(Log(O)) →→ CanFlush(Page(O))
 
// For a transaction T with operations O1, O2, ..., On:
// Each operation's log must be stable before its page can flush
∀ Oi ∈ T: Stable(Log(Oi)) →→ CanFlush(Page(Oi))
 
// The commit record must be stable before commit returns to client
Stable(CommitRecord(T)) →→ CommitAck(T)
 
// For UNDO capability:
// Must log before-image before page can reach disk
Log(BeforeImage(O)) →→ Stable(BeforeImage(O)) →→ CanFlush(Page(O))
 
// For REDO capability:  
// Must log after-image before commit can be durable
Log(AfterImage(O)) →→ Stable(AfterImage(O)) →→ Stable(CommitRecord(T))
 
// Combined constraint for complete recoverability:
∀ O ∈ T: Stable(Log(O)) →→ CanFlush(Page(O))
         AND Stable(AllLogs(T)) →→ Stable(CommitRecord(T))

Understanding the Constraints:

Operation → Log → Disk(Log) → Can Disk(Data)

This chain ensures that for any modification:
- We first create the log record
- The log record reaches stable storage
- Only then can the modified data page be written
Log Ordering Preserves Transaction Order

Within a transaction, log records must be written in operation order. If operation O₁ happens before O₂, then Log(O₁) must be stable before Log(O₂). This ensures recovery replays operations in the correct sequence.
Commit is the Durability Point

The commit record is special: once it's stable, the transaction is guaranteed durable. The client can be told 'success'. But this means all operation logs must already be stable—otherwise, we'd have a committed transaction we can't fully redo.

The Commit Record as Linearization Point

Enforcing Log-Before-Data

Understanding why log-before-data is necessary is one thing. Implementing it correctly is another. Here's how database engines enforce this constraint:

Page LSN Mechanism:

Every data page contains a Page LSN field—the LSN of the most recent log record that modified this page. When the buffer manager wants to flush a dirty page to disk, it checks:

Page LSN Enforcement

Pseudocode

class BufferManager:
    // Called when buffer pool needs to evict a dirty page
    function flushPage(page: Page) -> bool:
        // CRITICAL: Enforce WAL constraint
        // The log must be flushed at least up to this page's LSN
        if page.pageLSN > log.flushedLSN:
            // Cannot flush this data page yet!
            // The log records for its modifications haven't reached disk
            log.flushUpTo(page.pageLSN)
            // Wait for log flush to complete
            awaitLogFlush(page.pageLSN)
        
        // Now safe to write the data page
        // All log records describing this page's changes are stable
        disk.writePage(page.pageNumber, page.data)
        
        // Mark page as clean in buffer pool
        page.isDirty = false
        return true
 
class LogManager:
    // Track highest LSN that's been flushed to disk
    flushedLSN: uint64 = 0
    
    // Buffer for log records not yet flushed
    logBuffer: List<LogRecord> = []
    
    function flushUpTo(targetLSN: uint64):
        if targetLSN <= self.flushedLSN:
            return  // Already flushed
        
        // Write all buffered log records up to targetLSN
        recordsToFlush = self.logBuffer.filter(r => r.LSN <= targetLSN)
        for record in recordsToFlush:
            disk.appendToLog(record.serialize())
        
        // CRITICAL: Force to stable storage
        disk.fsync(logFile)
        
        // Update flushed marker
        self.flushedLSN = targetLSN
        
        // Remove flushed records from buffer
        self.logBuffer = self.logBuffer.filter(r => r.LSN > targetLSN)

Key Enforcement Points:

Page Modification Sets Page LSN

Whenever a transaction modifies a page, it first logs the change and gets an LSN. That LSN is recorded in the page header. The page is marked dirty.
Buffer Manager Checks Before Flush

Before writing any dirty page to disk, the buffer manager compares page.pageLSN with log.flushedLSN. If the page's LSN is higher, the log must be flushed first.
Forced Log Flush on Commit

When a transaction commits, its commit record must reach stable storage. This forces a log flush that includes all the transaction's log records.

Enforcement Mechanisms Summary

•Page LSN tracking — Every page knows the LSN of its most recent modification
•Buffer manager guard — Data page writes blocked until corresponding log is stable
•Log flush on commit — Commit waits for all related log records to be durable
•LSN comparison — Simple numeric comparison determines if flush is safe
•Group commit — Multiple transactions can share log flushes for efficiency

Why This is Efficient

Edge Cases and Subtleties

The log-before-data principle seems straightforward, but several edge cases require careful handling:

1. Log Record Spanning Disk Blocks

Log records vary in size. A large log record might span two disk blocks. What if power fails after writing the first block but before the second?

Solution: Each log record includes:

Length field at the start (so we know expected size)
Checksum covering the entire record
During recovery, validate both length and checksum
Incomplete/corrupt records at log tail are treated as never written

Log Record Validation

Pseudocode

function validateLogRecord(recordBytes: bytes) -> LogRecord | null:
    // Minimum header size check
    if len(recordBytes) < HEADER_SIZE:
        return null  // Incomplete header
    
    // Read length from header
    declaredLength = readUint32(recordBytes, offset=0)
    
    // Check if we have complete record
    if len(recordBytes) < declaredLength:
        return null  // Incomplete record (crash during write)
    
    // Validate checksum
    declaredChecksum = readUint32(recordBytes, offset=declaredLength-4)
    actualChecksum = computeCRC32(recordBytes[0:declaredLength-4])
    
    if declaredChecksum != actualChecksum:
        return null  // Corrupted record (torn write)
    
    // Parse and return valid record
    return parseLogRecord(recordBytes[0:declaredLength])
 
function recoverFromLog():
    position = LOG_START
    lastValidLSN = 0
    
    while position < logFileSize:
        recordBytes = readBytes(position, MAX_RECORD_SIZE)
        record = validateLogRecord(recordBytes)
        
        if record == null:
            // End of valid log found
            break
        
        lastValidLSN = record.LSN
        processRecoveryRecord(record)
        position += record.length
    
    // Truncate any garbage after last valid record
    truncateLog(position)

2. Operating System Buffering

Writing to a file doesn't guarantee it reaches disk—the OS may buffer it. Using write() alone violates WAL because data could be in OS cache when we think it's safe.

Solution: Use fsync() or equivalent after critical log writes:

fsync(fd) — Forces file to stable storage
fdatasync(fd) — Forces data but not metadata (slightly faster)
O_DIRECT — Bypass OS cache entirely
O_DSYNC / O_SYNC — Synchronous writes

3. Disk Write Reordering

Modern disks have write caches and may reorder writes for efficiency. You write A then B, but B might reach platters first.

Solution:

Disable write caching (impacts performance)
Or use battery-backed write cache (guarantees cache survives power loss)
Or use write barriers (FUA - Force Unit Access flag)

4. The First Write After Checkpoint

After a checkpoint, the next modification to a page is special. If the page is later torn, we have no log record with the 'before' state (checkpoint happened after previous logs were discarded).

Solution: PostgreSQL logs a full-page image on first modification after checkpoint. The entire page content serves as the before-image, guaranteeing recovery from any tear.

Hardware Lies

Proving Correctness

Let's prove that log-before-data, combined with proper commit handling, guarantees complete recoverability. We need to show:

Definitions:

Let D be the set of transactions that committed before the crash
Let U be the set of transactions that did not commit before the crash
Let S be the database state after crash (whatever is on disk)
Let C be the consistent target state (committed transactions only)

To Prove: A recovery algorithm exists that transforms S into C.

Proof:

Step 1: All committed transactions can be redone

For any committed transaction T ∈ D:

T's commit record reached stable storage (by definition of committed)
All T's log records reached stable storage before commit (by WAL redo rule)
Therefore, for every modification M by T, the log contains M's after-image
Recovery can apply each after-image to reach T's committed state

Step 2: All uncommitted transactions can be undone

For any uncommitted transaction T ∈ U:

If any of T's modifications M reached disk:
- M's log record reached disk first (by WAL undo rule)
- The log record contains M's before-image
- Recovery can apply the before-image to restore pre-T state
If none of T's modifications reached disk:
- Nothing to undo—state already reflects absence of T

Step 3: Recovery algorithm

1. Scan log to determine D (committed) and U (uncommitted)
2. REDO: For each T ∈ D, replay all modifications (apply after-images)
3. UNDO: For each T ∈ U, rollback all modifications (apply before-images)
4. Result: State C where only committed transactions are reflected

QED

Correctness Guaranteed

Summary: Log Before Data

We've explored the deep reasoning behind the log-before-data ordering. Let's consolidate the key insights:

Key Takeaways

•Ordering is mathematically necessary — Analysis of crash timing scenarios proves that data-before-log creates unrecoverable states.
•If data reaches disk without log, recovery is impossible — We lose the ability to identify, undo, or reason about the change.
•Page LSN enforces the constraint — Each page tracks its most recent log LSN; buffer manager blocks flush until log is stable.
•Commit record is the linearization point — Once commit is stable, transaction is durable; before, it can be fully undone.
•Edge cases require careful handling — Torn records, OS buffering, disk reordering all threaten WAL guarantees.
•Correctness is provable — WAL provides a formal guarantee: all committed transactions survive, all uncommitted are rolled back.

What's Next:

Page Complete

2 / 5