Database Management SystemsWrite-Ahead Logging (WAL)

Write-Ahead Logging (WAL)

LevelIntermediate

Duration60 mins

TopicWrite-Ahead Logging (WAL)

1 / 5

WAL Rule

The Protocol That Guarantees Recovery

Imagine a database processing thousands of financial transactions per second. Power fails mid-transaction. When the system restarts, how does it know which transactions completed, which partially executed, and which never started? How does it restore consistency without losing committed data or keeping uncommitted changes?

The answer lies in a deceptively simple principle that underpins every production database system: Write-Ahead Logging (WAL). This protocol—sometimes called journaling in file systems—is the fundamental mechanism that transforms databases from fragile data stores into resilient, recoverable systems.

The WAL rule is not merely an optimization or a convenience. It is an absolute requirement for crash recovery. Without it, durability guarantees collapse, and databases cannot reliably recover from failures.

What You Will Learn

By the end of this page, you will understand the Write-Ahead Logging rule in precise terms, why it is necessary for recovery, and the catastrophic consequences of violating it. You'll see how this single protocol enables both undo and redo operations—the two building blocks of crash recovery.

The Recovery Problem

Before we can appreciate the WAL rule, we must understand the fundamental problem it solves. Consider what happens during normal database operation:

Two storage locations exist:

Main Memory (Volatile) — Fast access, but contents are lost on power failure
Disk Storage (Non-Volatile) — Slow access, but contents survive power failures

Databases cache frequently accessed data pages in main memory (the buffer pool) for performance. When transactions modify data, they first modify the in-memory copies. Eventually, these modified pages must be written back to disk—but exactly when this happens creates a critical problem.

The Core Dilemma

•If we write modified pages to disk immediately — Performance suffers catastrophically. Every small update requires disk I/O, destroying throughput.
•If we delay writing to disk — We risk losing committed transaction effects if the system crashes before the write completes.
•If uncommitted changes reach disk — A crash may leave the database in an inconsistent state with partial transactions persisted.

This creates an apparently impossible situation. We want:

High performance — Minimize disk I/O by batching and deferring writes
Durability — Committed transactions must survive crashes
Atomicity — Uncommitted transactions must be completely undone after crashes

These requirements seem contradictory. If we don't immediately write committed changes to disk, how do we guarantee they survive crashes? If we do write them immediately, how do we maintain performance?

The WAL protocol resolves this tension through an elegant insight: What if we don't need to write the data immediately—only a record of what changed?

The Key Insight

Writing a small, sequential log record is orders of magnitude faster than writing a large, random data page. If we log the change before allowing any modification to reach disk, we can reconstruct the change later—even if the data page never made it to disk.

The WAL Rule Defined

The Write-Ahead Logging rule is stated with precision:

The WAL Rule: Before any in-place update to a data page is written to stable storage (disk), the corresponding log record must first be written to stable storage.

In simpler terms: Log first, then data.

This rule is sometimes expressed as two separate guarantees:

The Two Components of the WAL Rule
Component	Formal Statement	Purpose
Undo Rule	Before any uncommitted modification is written to disk, the log record with the old value (undo information) must reach stable storage	Enables rolling back uncommitted transactions after crash
Redo Rule	Before a transaction commits, all log records describing its modifications (containing new values) must reach stable storage	Enables replaying committed transactions after crash

The combination of these rules provides a complete recovery guarantee:

If a transaction commits: All its changes are logged to stable storage. Even if the data pages were never written to disk, recovery can reconstruct them.
If a transaction fails to commit: The undo information is on stable storage. Recovery can roll back any changes that may have reached disk prematurely.

Notice the elegance: We never need to force data pages to disk at any particular time. We only need to ensure the log maintains a complete history of all changes. Data pages can be written lazily, in whatever order is convenient for performance—because the log can always reconstruct them.

Why Sequential Logging is Fast

Log records are written sequentially to the end of a log file. This is drastically faster than random writes to data pages scattered across the disk. A single transaction modifying 100 random pages might require 100 random disk seeks. But logging those 100 modifications requires appending 100 small records sequentially—typically completing in the time of 1-2 random writes.

Why 'Write-Ahead' Matters

The term "write-ahead" is not arbitrary—it describes a strict ordering requirement. Let's understand why this order is critical by examining what happens if we violate it.

Scenario: Violating the WAL Rule

Consider a transaction T1 that updates a bank account balance from $1000 to $500:

T1: UPDATE accounts SET balance = 500 WHERE id = 'A001';

Without WAL, the sequence might be:

T1 modifies the data page in memory (balance → $500)
The buffer manager writes the modified page to disk
System crashes
T1's log record is never written

After restart, the database sees balance = $500 on disk. But there's no record of T1—no way to know if T1 committed or not. If T1 actually never committed before the crash, we now have:

$500 deducted without a corresponding credit anywhere
No way to recover the original $1000 balance
Database is permanently corrupted

Catastrophic Failure

Violating the WAL rule doesn't cause occasional errors—it makes recovery impossible. Once a modification reaches disk without its log record, that change cannot be undone. The database is permanently in an inconsistent state that no amount of sophisticated recovery logic can fix.

Now consider the correct WAL-compliant sequence:

T1 modifies the data page in memory (balance → $500)
Before the data page is written to disk, the log record is written:
- <T1, accounts.A001.balance, 1000, 500> (transaction, item, old value, new value)
Log record reaches stable storage ✓
Data page can now be written to disk (whenever convenient)
System crashes (at any point after step 3)

After restart:

If T1 committed before crash: The commit record is in the log. Recovery can redo the update if the data page didn't make it to disk.
If T1 did not commit: The log shows an incomplete transaction. Recovery can undo the update if the data page did make it to disk.

In both cases, the database returns to a consistent state. The key is that the log record—containing both old and new values—reached stable storage before any data modification could reach disk.

The Ordering Guarantee

•Log write always precedes data write — This is the invariant that must never be violated
•Order within log is preserved — Log records for a transaction appear in operation order
•Commit record is the point of no return — Once commit record reaches disk, the transaction is durable
•Data page writes are unordered — They can happen in any order, at any time, because the log can handle any scenario

Log Record Structure

For the WAL rule to support complete recovery, log records must contain sufficient information for both undo and redo operations. A typical log record structure includes:

Essential Fields:

Field	Description	Example
LSN	Log Sequence Number — unique, monotonically increasing identifier	`12847`
TransactionID	Identifier of the transaction that made this change	`T1024`
PageID	Database page that was modified	`Page#4521`
Offset	Location within the page	`Byte 312`
Length	Size of the modified region	`8 bytes`
Before-Image	Old value (for undo)	`1000`
After-Image	New value (for redo)	`500`
PrevLSN	Previous log record for this transaction (enables efficient scanning)	`12830`

Log Record Types

Pseudocode

// Different types of log records
 
// UPDATE record - captures a data modification
struct UpdateLogRecord {
    LSN:          uint64        // Unique log sequence number
    Type:         "UPDATE"      // Record type identifier
    TransactionID: uint32       // Which transaction made this change
    PrevLSN:      uint64        // Previous record for this transaction
    PageID:       uint32        // Which page was modified
    Offset:       uint16        // Position within the page
    BeforeImage:  bytes         // Old value (for UNDO)
    AfterImage:   bytes         // New value (for REDO)
}
 
// COMMIT record - marks transaction completion
struct CommitLogRecord {
    LSN:          uint64
    Type:         "COMMIT"
    TransactionID: uint32
    PrevLSN:      uint64
}
 
// ABORT record - marks transaction rollback
struct AbortLogRecord {
    LSN:          uint64
    Type:         "ABORT"  
    TransactionID: uint32
    PrevLSN:      uint64
}
 
// BEGIN record - marks transaction start
struct BeginLogRecord {
    LSN:          uint64
    Type:         "BEGIN"
    TransactionID: uint32
}
 
// CLR (Compensation Log Record) - records an undo action
struct CLRLogRecord {
    LSN:          uint64
    Type:         "CLR"
    TransactionID: uint32
    PrevLSN:      uint64
    UndoNextLSN:  uint64        // Next record to undo (skips already undone)
    PageID:       uint32        
    AfterImage:   bytes         // The result of the undo
}

Understanding the Before and After Images:

The before-image (also called undo information) and after-image (also called redo information) are the key to recovery:

Before-Image: The value that existed before this modification. If we need to undo this change (because the transaction didn't commit), we restore this value.
After-Image: The value after this modification. If we need to redo this change (because the transaction committed but the data didn't reach disk), we apply this value.

Some systems use physical logging (recording actual byte changes), while others use logical logging (recording the operation itself, like 'increment counter by 5'). Modern systems often use physiological logging—physical to the page, logical within the page—balancing recovery complexity with logging efficiency.

Why Both Images?

You might wonder: why store both old and new values? Couldn't we compute one from the other? The answer is recovery performance. During crash recovery, we may need to process millions of log records quickly. Computing the inverse of arbitrary operations would be complex and slow. Storing both values trades disk space for recovery speed—a worthwhile exchange.

Force vs. No-Force Policies

The WAL rule tells us that log records must be written before data pages. But it doesn't tell us when data pages should be written. This leads to two policy decisions that significantly impact performance:

Question 1: Force Policy (at commit time)

When a transaction commits, do we force all its modified pages to disk?

FORCE: All dirty pages modified by committing transaction are written to disk immediately
NO-FORCE: Dirty pages can remain in memory after commit; they'll be written later

Question 2: Steal Policy (before commit)

Can a transaction's modified pages be written to disk before the transaction commits?

STEAL: Buffer manager can write uncommitted pages to disk (to free buffer space)
NO-STEAL: Uncommitted pages must remain in memory until commit

Policy Combinations and Their Requirements
Policy	Performance	Recovery Needs	Used By
STEAL/NO-FORCE	Best — pages written only when buffer needs space	Needs both UNDO and REDO	Most modern systems (PostgreSQL, MySQL/InnoDB)
NO-STEAL/FORCE	Worst — locks memory, forces sync I/O	No recovery logging needed	Small embedded systems
STEAL/FORCE	Poor — forced writes at commit	Needs UNDO only	Rare
NO-STEAL/NO-FORCE	Moderate — memory pressure issues	Needs REDO only	Rare

Why STEAL/NO-FORCE Dominates:

The STEAL/NO-FORCE combination provides the best runtime performance:

NO-FORCE means commits are fast — A transaction commits by writing only its log records (sequential I/O), not its data pages (random I/O). This can be 100x faster.
STEAL means buffer management is flexible — The buffer manager can evict any page when it needs space. It doesn't have to keep uncommitted pages pinned in memory, avoiding out-of-memory situations.

The cost is more complex recovery—we need both undo and redo capabilities. But since crashes are rare and performance matters constantly, this trade-off is universally preferred in production systems.

WAL Enables STEAL/NO-FORCE

Without WAL, STEAL/NO-FORCE would be impossible. The log provides the safety net: STEAL is safe because we have undo information logged. NO-FORCE is safe because we have redo information logged. WAL is the foundation that makes high-performance databases possible.

Implementation Considerations

Implementing the WAL rule correctly requires careful engineering. Several subtle issues can undermine the guarantee:

1. Page Tearing

Most file systems write in units smaller than database pages (often 4KB file system blocks vs 8KB or 16KB database pages). If power fails mid-write, you might have half an old page and half a new page—a torn page.

Solutions:

Checksums on every page to detect torn pages
Double-write buffer (InnoDB) — write pages to a safe buffer first
Full-page images in log (PostgreSQL) — log entire page on first modification after checkpoint

Torn Page Detection

Pseudocode

// PostgreSQL-style full page image on first modification
 
function modifyPage(page, modification, transactionLog):
    if page.needsFullPageImage:
        // First modification since last checkpoint
        // Log the entire page content before any change
        logRecord = new FullPageLogRecord(
            LSN: nextLSN(),
            TransactionID: currentTransaction(),
            PageID: page.id,
            FullPageContent: page.content.copy(),  // Before-image of entire page
            Modification: modification
        )
        writeToWAL(logRecord)
        page.needsFullPageImage = false
    else:
        // Normal logging - just the delta
        logRecord = new UpdateLogRecord(
            LSN: nextLSN(),
            TransactionID: currentTransaction(),
            PageID: page.id,
            BeforeImage: page.getAffectedBytes(modification),
            AfterImage: modification.newValue
        )
        writeToWAL(logRecord)
    
    // Now safe to apply modification to in-memory page
    page.apply(modification)

2. Ensuring Log Durability

Writing to the OS file system doesn't guarantee data reaches disk. The OS may buffer writes in its own cache. To truly satisfy WAL:

Use fsync() or equivalent to force log to disk
Or use battery-backed write caches that guarantee durability
Or use direct I/O to bypass OS buffering

3. Log Buffer Management

For performance, databases buffer log records in memory before flushing to disk:

Log records accumulate in a log buffer
Flush the buffer when: transaction commits, buffer fills, periodic timer fires
Commit request blocks until the flush including its commit record completes
Multiple transactions can share a single log flush (group commit)

Common Implementation Pitfalls

•Forgetting fsync on commit — OS buffering means 'write' doesn't guarantee durability; committed data can be lost
•Logging after data page write — Violates WAL; creates unrecoverable corruption risk
•Not handling torn log records — Log records can also be torn; need checksums and length fields
•Circular log wraparound — Must ensure old log segments are checkpointed before reuse
•Blocking on every commit — Should implement group commit to amortize fsync cost across transactions

WAL in Production Systems

Every major database implements WAL, though with variations in naming and implementation details:

PostgreSQL — Write-Ahead Log (WAL)

PostgreSQL coined the term 'WAL' and implements a sophisticated version:

Logs stored in the pg_wal directory (formerly pg_xlog)
16MB segment files by default
Full-page writes after checkpoint to handle torn pages
WAL used for both crash recovery and streaming replication

MySQL/InnoDB — Redo Log

InnoDB uses a circular redo log:

Fixed-size redo log files (configurable)
Checkpointing moves 'oldest LSN' forward to allow log reuse
Double-write buffer to handle torn pages

Oracle — Redo Log

Oracle uses groups of redo log files:

Multiple redo log groups for availability
Log switches cycle between groups
Archived logs enable point-in-time recovery

WAL Implementation Comparison
System	Log Name	Torn Page Solution	Replication Integration
PostgreSQL	WAL (Write-Ahead Log)	Full-page images	Streaming replication via WAL shipping
MySQL InnoDB	Redo Log	Double-write buffer	Binary log separate from redo log
Oracle	Redo Log	Redo log + Data Guard	Redo log shipping to standby
SQL Server	Transaction Log	Torn page detection + checksums	Log shipping or Always On
SQLite	WAL mode	WAL file + checkpoints	Litestream for replication

Beyond Databases:

The WAL principle extends far beyond traditional databases:

File systems (ext4, NTFS) use journaling—WAL for metadata changes
Key-value stores (RocksDB, LevelDB) write to a WAL before memtables
Message queues (Kafka) are essentially a distributed commit log
Versioned storage (Git) maintains a transaction log of all changes

Once you understand WAL, you see it everywhere—it's the universal pattern for reliable persistent state.

WAL as Architecture Pattern

WAL isn't just a database feature—it's an architectural pattern. Any time you need reliable persistence with crash recovery, consider logging changes before applying them. It works for databases, file systems, distributed systems, and even application-level data management.

Summary: The WAL Rule

We've covered the foundational protocol that makes database recovery possible. Let's consolidate the key concepts:

Key Takeaways

•The WAL Rule is absolute — Log records MUST be written to stable storage before any corresponding data modification. This is not optional or negotiable.
•WAL enables high-performance recovery — By logging changes sequentially, we avoid expensive random I/O while maintaining complete recovery capability.
•Two components: Undo and Redo — Before-images enable undo of uncommitted transactions; after-images enable redo of committed transactions.
•STEAL/NO-FORCE is optimal — WAL enables the highest-performance buffer management policy used by modern databases.
•Log structure matters — Log records must contain LSN, transaction ID, and both before and after images for complete recovery.
•Implementation requires care — Torn pages, log durability, and group commit are critical engineering concerns.

What's Next:

Now that we understand the WAL rule as a protocol, the next page explores why logging must happen before data writes—examining the causality and timing constraints that make 'write-ahead' the only viable ordering for crash recovery.

Page Complete

You now understand the Write-Ahead Logging rule—the cornerstone protocol that guarantees database durability and atomicity. Every commercial and open-source database relies on this principle. Next, we'll explore the deeper reasons why log-before-data is the only correct ordering.

1 / 5

Loading learning content...

Database Management SystemsWrite-Ahead Logging (WAL)

Write-Ahead Logging (WAL)

LevelIntermediate

Duration60 mins

TopicWrite-Ahead Logging (WAL)

1 / 5

WAL Rule

The Protocol That Guarantees Recovery

What You Will Learn

The Recovery Problem

Before we can appreciate the WAL rule, we must understand the fundamental problem it solves. Consider what happens during normal database operation:

Two storage locations exist:

Main Memory (Volatile) — Fast access, but contents are lost on power failure
Disk Storage (Non-Volatile) — Slow access, but contents survive power failures

The Core Dilemma

•If we write modified pages to disk immediately — Performance suffers catastrophically. Every small update requires disk I/O, destroying throughput.
•If we delay writing to disk — We risk losing committed transaction effects if the system crashes before the write completes.
•If uncommitted changes reach disk — A crash may leave the database in an inconsistent state with partial transactions persisted.

This creates an apparently impossible situation. We want:

High performance — Minimize disk I/O by batching and deferring writes
Durability — Committed transactions must survive crashes
Atomicity — Uncommitted transactions must be completely undone after crashes

The WAL protocol resolves this tension through an elegant insight: What if we don't need to write the data immediately—only a record of what changed?

The Key Insight

The WAL Rule Defined

The Write-Ahead Logging rule is stated with precision:

The WAL Rule: Before any in-place update to a data page is written to stable storage (disk), the corresponding log record must first be written to stable storage.

In simpler terms: Log first, then data.

This rule is sometimes expressed as two separate guarantees:

The Two Components of the WAL Rule
Component	Formal Statement	Purpose
Undo Rule	Before any uncommitted modification is written to disk, the log record with the old value (undo information) must reach stable storage	Enables rolling back uncommitted transactions after crash
Redo Rule	Before a transaction commits, all log records describing its modifications (containing new values) must reach stable storage	Enables replaying committed transactions after crash

The combination of these rules provides a complete recovery guarantee:

If a transaction commits: All its changes are logged to stable storage. Even if the data pages were never written to disk, recovery can reconstruct them.
If a transaction fails to commit: The undo information is on stable storage. Recovery can roll back any changes that may have reached disk prematurely.

Why Sequential Logging is Fast

Why 'Write-Ahead' Matters

The term "write-ahead" is not arbitrary—it describes a strict ordering requirement. Let's understand why this order is critical by examining what happens if we violate it.

Scenario: Violating the WAL Rule

Consider a transaction T1 that updates a bank account balance from $1000 to $500:

T1: UPDATE accounts SET balance = 500 WHERE id = 'A001';

Without WAL, the sequence might be:

T1 modifies the data page in memory (balance → $500)
The buffer manager writes the modified page to disk
System crashes
T1's log record is never written

After restart, the database sees balance = $500 on disk. But there's no record of T1—no way to know if T1 committed or not. If T1 actually never committed before the crash, we now have:

$500 deducted without a corresponding credit anywhere
No way to recover the original $1000 balance
Database is permanently corrupted

Catastrophic Failure

Now consider the correct WAL-compliant sequence:

T1 modifies the data page in memory (balance → $500)
Before the data page is written to disk, the log record is written:
- <T1, accounts.A001.balance, 1000, 500> (transaction, item, old value, new value)
Log record reaches stable storage ✓
Data page can now be written to disk (whenever convenient)
System crashes (at any point after step 3)

After restart:

If T1 committed before crash: The commit record is in the log. Recovery can redo the update if the data page didn't make it to disk.
If T1 did not commit: The log shows an incomplete transaction. Recovery can undo the update if the data page did make it to disk.

In both cases, the database returns to a consistent state. The key is that the log record—containing both old and new values—reached stable storage before any data modification could reach disk.

The Ordering Guarantee

•Log write always precedes data write — This is the invariant that must never be violated
•Order within log is preserved — Log records for a transaction appear in operation order
•Commit record is the point of no return — Once commit record reaches disk, the transaction is durable
•Data page writes are unordered — They can happen in any order, at any time, because the log can handle any scenario

Log Record Structure

For the WAL rule to support complete recovery, log records must contain sufficient information for both undo and redo operations. A typical log record structure includes:

Essential Fields:

Field	Description	Example
LSN	Log Sequence Number — unique, monotonically increasing identifier	`12847`
TransactionID	Identifier of the transaction that made this change	`T1024`
PageID	Database page that was modified	`Page#4521`
Offset	Location within the page	`Byte 312`
Length	Size of the modified region	`8 bytes`
Before-Image	Old value (for undo)	`1000`
After-Image	New value (for redo)	`500`
PrevLSN	Previous log record for this transaction (enables efficient scanning)	`12830`

Log Record Types

Pseudocode

// Different types of log records
 
// UPDATE record - captures a data modification
struct UpdateLogRecord {
    LSN:          uint64        // Unique log sequence number
    Type:         "UPDATE"      // Record type identifier
    TransactionID: uint32       // Which transaction made this change
    PrevLSN:      uint64        // Previous record for this transaction
    PageID:       uint32        // Which page was modified
    Offset:       uint16        // Position within the page
    BeforeImage:  bytes         // Old value (for UNDO)
    AfterImage:   bytes         // New value (for REDO)
}
 
// COMMIT record - marks transaction completion
struct CommitLogRecord {
    LSN:          uint64
    Type:         "COMMIT"
    TransactionID: uint32
    PrevLSN:      uint64
}
 
// ABORT record - marks transaction rollback
struct AbortLogRecord {
    LSN:          uint64
    Type:         "ABORT"  
    TransactionID: uint32
    PrevLSN:      uint64
}
 
// BEGIN record - marks transaction start
struct BeginLogRecord {
    LSN:          uint64
    Type:         "BEGIN"
    TransactionID: uint32
}
 
// CLR (Compensation Log Record) - records an undo action
struct CLRLogRecord {
    LSN:          uint64
    Type:         "CLR"
    TransactionID: uint32
    PrevLSN:      uint64
    UndoNextLSN:  uint64        // Next record to undo (skips already undone)
    PageID:       uint32        
    AfterImage:   bytes         // The result of the undo
}

Understanding the Before and After Images:

The before-image (also called undo information) and after-image (also called redo information) are the key to recovery:

Before-Image: The value that existed before this modification. If we need to undo this change (because the transaction didn't commit), we restore this value.
After-Image: The value after this modification. If we need to redo this change (because the transaction committed but the data didn't reach disk), we apply this value.

Why Both Images?

Force vs. No-Force Policies

Question 1: Force Policy (at commit time)

When a transaction commits, do we force all its modified pages to disk?

FORCE: All dirty pages modified by committing transaction are written to disk immediately
NO-FORCE: Dirty pages can remain in memory after commit; they'll be written later

Question 2: Steal Policy (before commit)

Can a transaction's modified pages be written to disk before the transaction commits?

STEAL: Buffer manager can write uncommitted pages to disk (to free buffer space)
NO-STEAL: Uncommitted pages must remain in memory until commit

Policy Combinations and Their Requirements
Policy	Performance	Recovery Needs	Used By
STEAL/NO-FORCE	Best — pages written only when buffer needs space	Needs both UNDO and REDO	Most modern systems (PostgreSQL, MySQL/InnoDB)
NO-STEAL/FORCE	Worst — locks memory, forces sync I/O	No recovery logging needed	Small embedded systems
STEAL/FORCE	Poor — forced writes at commit	Needs UNDO only	Rare
NO-STEAL/NO-FORCE	Moderate — memory pressure issues	Needs REDO only	Rare

Why STEAL/NO-FORCE Dominates:

The STEAL/NO-FORCE combination provides the best runtime performance:

NO-FORCE means commits are fast — A transaction commits by writing only its log records (sequential I/O), not its data pages (random I/O). This can be 100x faster.
STEAL means buffer management is flexible — The buffer manager can evict any page when it needs space. It doesn't have to keep uncommitted pages pinned in memory, avoiding out-of-memory situations.

WAL Enables STEAL/NO-FORCE

Implementation Considerations

Implementing the WAL rule correctly requires careful engineering. Several subtle issues can undermine the guarantee:

1. Page Tearing

Solutions:

Checksums on every page to detect torn pages
Double-write buffer (InnoDB) — write pages to a safe buffer first
Full-page images in log (PostgreSQL) — log entire page on first modification after checkpoint

Torn Page Detection

Pseudocode

// PostgreSQL-style full page image on first modification
 
function modifyPage(page, modification, transactionLog):
    if page.needsFullPageImage:
        // First modification since last checkpoint
        // Log the entire page content before any change
        logRecord = new FullPageLogRecord(
            LSN: nextLSN(),
            TransactionID: currentTransaction(),
            PageID: page.id,
            FullPageContent: page.content.copy(),  // Before-image of entire page
            Modification: modification
        )
        writeToWAL(logRecord)
        page.needsFullPageImage = false
    else:
        // Normal logging - just the delta
        logRecord = new UpdateLogRecord(
            LSN: nextLSN(),
            TransactionID: currentTransaction(),
            PageID: page.id,
            BeforeImage: page.getAffectedBytes(modification),
            AfterImage: modification.newValue
        )
        writeToWAL(logRecord)
    
    // Now safe to apply modification to in-memory page
    page.apply(modification)

2. Ensuring Log Durability

Writing to the OS file system doesn't guarantee data reaches disk. The OS may buffer writes in its own cache. To truly satisfy WAL:

Use fsync() or equivalent to force log to disk
Or use battery-backed write caches that guarantee durability
Or use direct I/O to bypass OS buffering

3. Log Buffer Management

For performance, databases buffer log records in memory before flushing to disk:

Log records accumulate in a log buffer
Flush the buffer when: transaction commits, buffer fills, periodic timer fires
Commit request blocks until the flush including its commit record completes
Multiple transactions can share a single log flush (group commit)

Common Implementation Pitfalls

•Forgetting fsync on commit — OS buffering means 'write' doesn't guarantee durability; committed data can be lost
•Logging after data page write — Violates WAL; creates unrecoverable corruption risk
•Not handling torn log records — Log records can also be torn; need checksums and length fields
•Circular log wraparound — Must ensure old log segments are checkpointed before reuse
•Blocking on every commit — Should implement group commit to amortize fsync cost across transactions

WAL in Production Systems

Every major database implements WAL, though with variations in naming and implementation details:

PostgreSQL — Write-Ahead Log (WAL)

PostgreSQL coined the term 'WAL' and implements a sophisticated version:

Logs stored in the pg_wal directory (formerly pg_xlog)
16MB segment files by default
Full-page writes after checkpoint to handle torn pages
WAL used for both crash recovery and streaming replication

MySQL/InnoDB — Redo Log

InnoDB uses a circular redo log:

Fixed-size redo log files (configurable)
Checkpointing moves 'oldest LSN' forward to allow log reuse
Double-write buffer to handle torn pages

Oracle — Redo Log

Oracle uses groups of redo log files:

Multiple redo log groups for availability
Log switches cycle between groups
Archived logs enable point-in-time recovery

WAL Implementation Comparison
System	Log Name	Torn Page Solution	Replication Integration
PostgreSQL	WAL (Write-Ahead Log)	Full-page images	Streaming replication via WAL shipping
MySQL InnoDB	Redo Log	Double-write buffer	Binary log separate from redo log
Oracle	Redo Log	Redo log + Data Guard	Redo log shipping to standby
SQL Server	Transaction Log	Torn page detection + checksums	Log shipping or Always On
SQLite	WAL mode	WAL file + checkpoints	Litestream for replication

Beyond Databases:

The WAL principle extends far beyond traditional databases:

File systems (ext4, NTFS) use journaling—WAL for metadata changes
Key-value stores (RocksDB, LevelDB) write to a WAL before memtables
Message queues (Kafka) are essentially a distributed commit log
Versioned storage (Git) maintains a transaction log of all changes

Once you understand WAL, you see it everywhere—it's the universal pattern for reliable persistent state.

WAL as Architecture Pattern

Summary: The WAL Rule

We've covered the foundational protocol that makes database recovery possible. Let's consolidate the key concepts:

Key Takeaways

•The WAL Rule is absolute — Log records MUST be written to stable storage before any corresponding data modification. This is not optional or negotiable.
•WAL enables high-performance recovery — By logging changes sequentially, we avoid expensive random I/O while maintaining complete recovery capability.
•Two components: Undo and Redo — Before-images enable undo of uncommitted transactions; after-images enable redo of committed transactions.
•STEAL/NO-FORCE is optimal — WAL enables the highest-performance buffer management policy used by modern databases.
•Log structure matters — Log records must contain LSN, transaction ID, and both before and after images for complete recovery.
•Implementation requires care — Torn pages, log durability, and group commit are critical engineering concerns.

What's Next:

Page Complete

1 / 5