Loading content...
Write-Ahead Logging (WAL) is not merely a technique—it is the foundational protocol that enables all crash-consistent storage systems. From file systems to databases to distributed consensus algorithms, WAL provides the theoretical and practical machinery for durability and atomicity.\n\nThe principle appears deceptively simple: write a description of what you intend to do before you do it. But beneath this simplicity lies a carefully engineered protocol with precise ordering requirements, performance optimizations, and subtle correctness conditions. Understanding WAL deeply means understanding how computers can make guarantees about the future despite the uncertainty of crashes, power failures, and hardware errors.
This page takes you deep into the mechanics of write-ahead logging. You'll understand the precise ordering constraints that ensure correctness, the protocol for transaction commit, the checkpointing mechanism that bounds recovery time, and the performance optimizations that make WAL practical for high-throughput workloads.
The write-ahead logging protocol defines a precise sequence of operations that must be followed to ensure crash consistency. Let's examine each step in detail, understanding not just what is done but why each step is necessary.
Phase 1: Transaction Accumulation\n\nBefore any disk I/O occurs, the file system accumulates modification requests in memory. For a file append operation, this includes:\n\n- The new data blocks (in the page cache)\n- Modified inode (in the inode cache)\n- Modified bitmap blocks (in the buffer cache)\n\nThese modifications exist only in volatile memory. A crash at this point simply loses the operation—perfectly safe from a consistency perspective.\n\nPhase 2: Log Record Construction\n\nOnce a transaction is ready to commit (either triggered by fsync, a timer, or memory pressure), the file system constructs log records describing all modifications:
12345678910111213141516171819202122232425262728
Transaction 12847 Log Structure:┌────────────────────────────────────────────────────────────────┐│ TRANSACTION START ││ Transaction ID: 12847 ││ Timestamp: 2024-01-15T14:32:16.847Z ││ Sequence: 58291 │├────────────────────────────────────────────────────────────────┤│ DESCRIPTOR BLOCK ││ Describes 3 data blocks to follow ││ Block 1: Target 847201, Flags: CHECKSUM ││ Block 2: Target 847202, Flags: CHECKSUM ││ Block 3: Target 102847, Flags: CHECKSUM | METADATA │├────────────────────────────────────────────────────────────────┤│ DATA BLOCK 1 (New file content, first 4KB) ││ [4096 bytes of file data] │├────────────────────────────────────────────────────────────────┤│ DATA BLOCK 2 (New file content, second 4KB) ││ [4096 bytes of file data] │├────────────────────────────────────────────────────────────────┤│ DATA BLOCK 3 (Modified inode block) ││ Inode 5847: size 8192, blocks [847201, 847202], mtime=... ││ Other inodes in same block: [unchanged] │├────────────────────────────────────────────────────────────────┤│ COMMIT RECORD ││ Transaction ID: 12847 ││ Checksum: 0x8A7F2C1E (covers entire transaction) ││ Commit Time: 2024-01-15T14:32:16.852Z │└────────────────────────────────────────────────────────────────┘Phase 3: Log Write with Ordering\n\nThe log records are written to the journal with strict ordering constraints:\n\n1. Descriptor block first — Written and persisted before data blocks (some systems write together but flush descriptor first)\n2. Data blocks — Written to the journal area, may be issued in parallel\n3. Barrier — A write barrier ensures all previous writes complete before the commit\n4. Commit record — Written only after all other log blocks are persisted\n\nThe critical insight: the commit record must hit disk after everything it commits. If the commit record were written first, a crash could find a valid commit but missing or partially-written data.
Write barriers (also called flushes or cache flushes) force all previous writes to complete before any subsequent writes can begin. Without barriers, aggressive disk caching and write reordering could write the commit record before the data, violating correctness. File systems that disable barriers for performance sacrifice crash consistency.
Phase 4: Acknowledgment\n\nOnce the commit record is persisted (barrier completed, commit block on media), the transaction is officially committed. For operations that requested durability (via fsync), the file system can now return success to the application. At this point:\n\n- The transaction will be recovered on crash—guaranteed\n- The actual modifications may or may not be on disk yet\n- The application can rely on the data being durable\n\nPhase 5: Writeback (Background)\n\nAfter commit, the file system writes the actual modifications to their final on-disk locations. This can happen immediately or be deferred:\n\n- Data blocks written to their final locations\n- Metadata blocks (inodes, bitmaps) written to their final locations\n- These writes can be reordered and optimized for efficiency\n\nImportantly, these writeback operations need not complete for the transaction to be safe. They're already in the journal.
The correctness of WAL depends on precise ordering of writes. Modern storage systems aggressively reorder writes for performance, so file systems must explicitly enforce ordering. Let's examine the required constraints and how they're enforced.
| Constraint | Requirement | Why It Matters |
|---|---|---|
| Log before data | Log records must be durable before final location writes | Recovery replays log; if final writes hit disk first, we lose idempotency guarantees |
| Data before commit | All transaction data must be durable before commit record | Commit indicates completion; partial data with commit = corruption |
| Commit atomic | Commit record must be written atomically (single sector) | Torn commit record = indeterminate state |
| Sequential within transaction | Log records for a transaction written in order | Enables simple sequential recovery scan |
| Checkpoint ordering | Checkpoint only after all included transactions written back | Checkpoint indicates "safe to forget"; premature checkpoint = data loss |
Enforcement Mechanisms:\n\n1. Write Barriers (FUA and Flush)\n\nWrite barriers force visibility ordering:
123456789101112131415161718192021222324
Barrier Types in Linux Block Layer: REQ_PREFLUSH (before write)├── Forces all previous writes in drive cache to media├── Ensures ordering: everything before barrier is durable└── Expensive on HDDs (often triggers full rotation wait) REQ_FUA (Force Unit Access)├── This specific write bypasses drive cache, goes to media├── Guarantees this write is durable when acknowledged├── Cheaper than full flush (only affects one write)└── Used for commit records, metadata Typical Journal Commit Sequence:1. Write descriptor block [normal write]2. Write data blocks [normal writes, may be parallel]3. Issue FLUSH [wait for all above to hit media]4. Write commit block + FUA [commit goes directly to media] Some systems use double-barrier:1. Write data blocks2. FLUSH3. Write commit block + FUA4. FLUSH (ensures commit is durable before acknowledging)2. Transaction Dependencies\n\nSome file systems allow concurrent transactions but track dependencies:\n\n- Transaction B modifies data written by Transaction A\n- B cannot commit until A commits\n- Dependency graph limits parallelism but ensures ordering\n\n3. Epoch-Based Ordering\n\nSome systems group transactions into epochs:\n\n- All transactions in epoch N must complete before epoch N+1 begins\n- Simplifies ordering: only need barriers between epochs\n- Allows parallelism within epoch\n\n4. Single-Threaded Journal Access\n\nSimplest approach: serialize all journal access\n\n- One thread handles all commits\n- Natural ordering through serialization\n- Can be a bottleneck for write-heavy workloads
Disk drives have volatile write caches that improve performance by acknowledging writes before they reach the persistent media. If the drive lies about completion (reports success while data is in cache), a crash loses that data. File systems assume barriers work; drives that ignore barriers break consistency. This is why enterprise drives have battery-backed caches or properly implement FUA.
The journal is a finite resource—typically a few hundred megabytes to a few gigabytes. Without a mechanism to reclaim space, it would quickly fill up and halt all file system operations. Checkpointing is the process that reclaims journal space by tracking which transactions have been fully written to their final locations.
The Checkpoint Invariant:\n\n> A transaction can be removed from the journal only after all its modifications have been written to their final on-disk locations.\n\nOnce this condition is met, the journal entries are redundant—the actual file system structures contain the data. The journal space can be reused.\n\nCheckpoint Record:\n\nThe checkpoint is itself recorded durably, typically in the journal superblock:
12345678910111213141516171819202122232425262728293031
Journal State Diagram: Time ───────────────────────────────────────────────────────────► Journal (circular buffer, 128MB):┌─────────────────────────────────────────────────────────────────┐│ [Free] [Free] │ T100 │ T101 │ T102 │ T103 │ T104 │ [Free] [Free]│└─────────────────────────────────────────────────────────────────┘ ↑ ↑ │ │ Oldest active Current write transaction position (checkpoint tail) (write head) Checkpoint Progress:─────────────────────────────────────────────────────────────────Transaction │ Commit Status │ Writeback Status │ Checkpointable?─────────────────────────────────────────────────────────────────T100 │ Committed │ Complete │ ✓ YesT101 │ Committed │ Complete │ ✓ YesT102 │ Committed │ In Progress (30%)│ ✗ NoT103 │ Committed │ Not Started │ ✗ NoT104 │ In Progress │ N/A │ ✗ No───────────────────────────────────────────────────────────────── After checkpointing T100 and T101:┌─────────────────────────────────────────────────────────────────┐│ [Free] [Free] [Free] [Free] │ T102 │ T103 │ T104 │ [Free] [Free]│└─────────────────────────────────────────────────────────────────┘ ↑ New checkpoint tailCheckpoint Triggers:\n\nFile systems checkpoint under several conditions:
Checkpoint Algorithm:\n\n\nCHECKPOINT_PROCEDURE():\n oldest_active = journal.tail_transaction\n \n for each txn from oldest_active forward:\n if txn.writeback_complete:\n // All blocks written to final locations\n journal.advance_tail(txn)\n free_journal_blocks(txn.log_blocks)\n else:\n // Cannot checkpoint past this transaction\n // May need to force writeback if journal is full\n if journal.space_remaining < threshold:\n force_writeback(txn)\n break\n \n update_journal_superblock(new_tail)\n issue_barrier() // Ensure superblock update is durable\n\n\nJournal Full Handling:\n\nIf the journal fills before transactions can be checkpointed, the file system must stall:\n\n1. Block all new transactions from starting\n2. Force writeback of oldest transactions' data\n3. Checkpoint completed transactions\n4. Resume normal operation\n\nThis stall is disruptive but necessary—there's no safe way to proceed without journal space. Proper journal sizing avoids this scenario under normal workloads.
A checkpoint must not advance past a transaction whose writeback is incomplete. If the checkpoint marker advances and then a crash occurs, recovery will not replay the transaction, but the final data isn't on disk either—data loss. The file system must carefully track writeback completion before checkpointing.
When the system boots after a crash, the file system must recover to a consistent state before it can be mounted. The recovery algorithm scans the journal, identifies committed transactions that may not have been fully written back, and replays them.\n\nThis process is deterministic—given the same journal contents, recovery will always produce the same result. This determinism is essential for reliability.
Recovery Phases:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
// Simplified journal recovery algorithmtypedef struct { uint32_t sequence; // Transaction sequence number uint32_t start_block; // First log block uint32_t length; // Number of log blocks bool valid; // Commit found and checksum valid block_mapping_t *mappings; // Target locations for replay} transaction_record_t; int journal_recover(journal_t *journal) { // Phase 1: Read journal superblock journal_superblock_t *sb = read_journal_superblock(journal); uint32_t checkpoint_seq = sb->s_sequence; uint32_t scan_start = sb->s_start; // Phase 2 & 3: Scan and validate transactions list_t *transactions = list_create(); uint32_t current_block = scan_start; uint32_t expected_seq = checkpoint_seq; while (true) { journal_header_t *header = read_journal_block(journal, current_block); if (header->h_magic != JOURNAL_MAGIC) { break; // End of valid journal records } if (header->h_blocktype == JBD2_DESCRIPTOR_BLOCK) { // Found start of a transaction transaction_record_t *txn = parse_transaction( journal, current_block, expected_seq); if (txn->valid) { // Transaction has valid commit with matching checksum list_append(transactions, txn); expected_seq++; current_block = txn->start_block + txn->length; } else { // Invalid or incomplete transaction - stop scanning // All subsequent records are potentially corrupt break; } } else { current_block = next_journal_block(journal, current_block); } } // Phase 4 & 5: Replay valid transactions printf("Recovery: %d transactions to replay\n", list_size(transactions)); foreach (transaction_record_t *txn, transactions) { for (int i = 0; i < txn->mappings_count; i++) { block_mapping_t *map = &txn->mappings[i]; // Read data from journal void *data = read_journal_block(journal, map->journal_block); // Write to final location write_filesystem_block(map->target_block, data); } } // Phase 6: Flush all writes and update checkpoint issue_write_barrier(); sb->s_sequence = expected_seq; sb->s_start = current_block; write_journal_superblock(journal, sb); issue_write_barrier(); // Phase 7: Recovery complete printf("Recovery complete. File system consistent.\n"); return 0;}Key Recovery Properties:\n\nIdempotency: Replaying a transaction multiple times produces the same result as replaying once. This is essential because we don't know if a transaction was partially written back before the crash. Replaying is always safe.\n\nForward-only scanning: Recovery only scans forward from the checkpoint. We never need to undo work or scan backward—committed transactions are replayed, uncommitted transactions are ignored.\n\nBounded time: Recovery time depends on the number of transactions since last checkpoint, not on file system size. For a 5-second commit interval manager and typical workloads, this means tens to hundreds of transactions—seconds of recovery time regardless of file system size.
Database systems often use both REDO and UNDO logging, but file systems typically use only REDO. This is because file systems don't support user-controlled transaction rollback—transactions are implicit and always run to completion or are never started. The simpler REDO-only approach is sufficient and easier to implement correctly.
Committing each file system operation as a separate transaction would be catastrophically slow—each commit requires a flush and FUA write. Instead, modern journaling file systems batch multiple operations into compound transactions that commit together.
The Batching Window:\n\nThe file system maintains an "open" transaction that accumulates modifications over a time window (typically 5 seconds in ext4). All operations during this window become part of the same transaction:
12345678910111213141516171819202122
Time: T0──────T1──────T2──────T3──────T4──────T5 (commit)──T6──────► Operations arriving: T0+0.1s: write(file_A) ──┐ T0+0.3s: mkdir(dir_B) ──┤ T0+1.2s: write(file_C) ──┼──► Transaction 100 T0+2.8s: unlink(file_D)──┤ (accumulating) T0+4.1s: write(file_E) ──┤ T0+4.9s: rename(F, G) ──┘ T5+0.0s: COMMIT ──► Transaction 100 committed All 6 operations are atomic T5+0.1s: write(file_H) ──┐ T5+0.5s: create(file_I)──┼──► Transaction 101 ... │ (new batch) Commit interval: 5 secondsAll operations in window share single: - Descriptor block write - Commit record write - Barrier/flush overheadPerformance Impact:\n\nBatching dramatically improves throughput:
| Metric | Per-Operation Commit | 5-Second Batching | Improvement |
|---|---|---|---|
| Operations per commit | 1 | ~1000 (typical) | 1000x fewer commits |
| Flushes per operation | 1 | 0.001 | 1000x fewer flushes |
| Latency (HDD) | ~10ms | ~0.01ms amortized | Massive |
| IOPS limited operations | 100/sec | 100,000+/sec | 1000x |
| Data loss window | 0 | Up to 5 seconds | Trade-off |
Early Commit Triggers:\n\nWhile batching defers commits, certain conditions force early commit:
Batching means a crash can lose the last 5 seconds of work (unless fsync was called). For most workloads this is acceptable—unsaved documents would be lost anyway. For databases and critical applications, explicit fsync ensures durability at the cost of performance. Understand your application's requirements.
Handling Dependencies Within a Batch:\n\nOperations within a batch may have dependencies:\n\n1. Create file F\n2. Write to file F\n3. Rename file F to G\n\nAll three are in the same transaction. On recovery, either all three happen or none. This is correct—the batch is atomic. But the file system must ensure the ordering of operations within the batch is preserved in the log, so replay produces the intended result.
Database systems distinguish between REDO logging and UNDO logging, and many use a combination called REDO-UNDO or ARIES (Algorithm for Recovery and Isolation Exploiting Semantics). File systems typically use simpler approaches. Understanding the distinction illuminates design choices.
Why File Systems Choose REDO:\n\nFile systems predominantly use REDO-only logging for several reasons:\n\n1. No application rollback: Unlike databases, file systems don't support "I changed my mind, undo this transaction." Transactions always complete.\n\n2. Simpler implementation: REDO-only avoids the complexity of reconstructing old values and handling partial undo.\n\n3. Metadata focus: Most file systems journal only metadata, not data. The metadata modifications are already in memory; logging them is straightforward.\n\n4. Batch commit model: The 5-second batching window means transactions are relatively short-lived. The overhead of buffering is acceptable.\n\nREDO-Only Constraint:\n\nREDO logging requires a steal/no-force buffer policy:\n- No-steal: Uncommitted data must not be written to final location (would require undo on crash)\n- No-force: Committed data need not be immediately written to final location (journal provides durability)\n\nThis is the standard approach in journaling file systems. Uncommitted modifications stay in memory; committed modifications are safe in the journal and can be lazily written back.
| Property | REDO Only | UNDO Only | REDO-UNDO (ARIES) |
|---|---|---|---|
| Log content | New values | Old values | Both |
| Dirty pages to disk | After commit | Before commit | Any time |
| Recovery action | Redo committed | Undo uncommitted | Redo then undo |
| Memory requirement | Buffer until commit | Log before modify | Moderate |
| Implementation complexity | Low | Medium | High |
| Typical use | File systems | Rare | Databases |
File system journals typically use physical logging—recording exact block contents. Databases often use logical logging—recording operations like "insert row X." Physical logging is simpler and idempotent (can replay safely even if already applied), but uses more space. File systems' focus on metadata makes physical logging practical.
A crash can corrupt not just data but the journal itself. If the system was writing to the journal at crash time, the written blocks may be partially complete (torn writes). Modern journals use checksums to detect such corruption and ensure only valid transactions are replayed.
The Torn Write Problem:\n\nConsider a 4KB journal block being written when power fails:\n\n- First 2KB reaches the disk\n- Last 2KB does not\n- Result: Block contains half new data, half old (or garbage)\n\nIf this is a commit block, we might incorrectly conclude a transaction is complete. If it's a data block, we might replay corrupted data. Checksums provide defense.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
// Journal checksum types (ext4/JBD2 example)#define JBD2_CHECKSUM_CRC32 1#define JBD2_CHECKSUM_CRC32C 2#define JBD2_CHECKSUM_XXHASH 3 // Newer, faster // Commit block with checksumstruct journal_commit_block { struct journal_header header; // type = COMMIT, sequence uint8_t chksum_type; // Algorithm used uint8_t chksum_size; // Size of checksum uint8_t padding[2]; uint32_t chksum[4]; // The checksum value(s) uint64_t commit_sec; // Commit timestamp uint32_t commit_nsec;}; // Validate a complete transactionbool validate_transaction(journal_t *j, transaction_t *txn) { // Step 1: Check commit block checksum covers descriptor + data uint32_t running_crc = 0; // Include descriptor block running_crc = crc32c(running_crc, txn->descriptor_block, JOURNAL_BLOCK_SIZE); // Include all data blocks for (int i = 0; i < txn->num_data_blocks; i++) { running_crc = crc32c(running_crc, txn->data_blocks[i], JOURNAL_BLOCK_SIZE); } // Include commit block (excluding checksum field) char *commit = (char *)txn->commit_block; running_crc = crc32c(running_crc, commit, offsetof(struct journal_commit_block, chksum)); // Compare with stored checksum if (running_crc != txn->commit_block->chksum[0]) { printf("Transaction %u failed checksum: computed %08x, stored %08x\n", txn->sequence, running_crc, txn->commit_block->chksum[0]); return false; } // Step 2: Check individual block tags if present for (int i = 0; i < txn->num_blocks; i++) { block_tag_t *tag = &txn->descriptor->tags[i]; if (tag->flags & TAG_FLAG_CHECKSUM) { uint16_t block_crc = crc16(txn->data_blocks[i], JOURNAL_BLOCK_SIZE); if (block_crc != tag->checksum) { printf("Block %d checksum mismatch\n", i); return false; } } } return true;}Checksum Coverage:\n\nModern journals compute checksums over the entire transaction:\n\n1. Descriptor block — Includes all block tags and target addresses\n2. All data blocks — Every logged block is included\n3. Commit block — The commit itself (with checksum field zeroed during computation)\n\nThis creates an end-to-end integrity check. Any corruption—in any block of the transaction—will cause validation to fail.\n\nChecksum Algorithm Selection:
| Algorithm | Collision Resistance | Performance | Notes |
|---|---|---|---|
| CRC32 | Moderate | Good | Hardware acceleration on many CPUs |
| CRC32C | Moderate | Excellent | Intel CRC32 instruction, preferred |
| xxHash | High | Excellent | Modern, very fast, gaining adoption |
| SHA-256 | Very High | Poor | Overkill for corruption detection |
Journal checksums detect corruption from crashes and hardware errors. They do not protect against malicious modification—an attacker can modify both data and checksum. For security against tampering, you need cryptographic MACs or authenticated encryption. File systems like ZFS provide this; most journaling filesystems do not.
We've explored the detailed mechanics of write-ahead logging. Let's consolidate the key insights:
What's Next:\n\nWith the WAL mechanism understood, we'll now explore metadata journaling—the most common journaling mode that protects file system structure while allowing data to be written directly. This mode offers the best balance of performance and consistency for most workloads.
You now understand the core mechanism that makes journaling work. This knowledge applies not just to file systems but to databases, message queues, distributed systems, and any system that needs durability guarantees. WAL is a fundamental technique in systems engineering.