Journaling - Learning Module

Loading content...

0/227

Write-Ahead Logging

The Core Protocol

Write-Ahead Logging (WAL) is not merely a technique—it is the foundational protocol that enables all crash-consistent storage systems. From file systems to databases to distributed consensus algorithms, WAL provides the theoretical and practical machinery for durability and atomicity.\n\nThe principle appears deceptively simple: write a description of what you intend to do before you do it. But beneath this simplicity lies a carefully engineered protocol with precise ordering requirements, performance optimizations, and subtle correctness conditions. Understanding WAL deeply means understanding how computers can make guarantees about the future despite the uncertainty of crashes, power failures, and hardware errors.

What You Will Learn

This page takes you deep into the mechanics of write-ahead logging. You'll understand the precise ordering constraints that ensure correctness, the protocol for transaction commit, the checkpointing mechanism that bounds recovery time, and the performance optimizations that make WAL practical for high-throughput workloads.

The WAL Protocol in Detail

The write-ahead logging protocol defines a precise sequence of operations that must be followed to ensure crash consistency. Let's examine each step in detail, understanding not just what is done but why each step is necessary.

Phase 1: Transaction Accumulation\n\nBefore any disk I/O occurs, the file system accumulates modification requests in memory. For a file append operation, this includes:\n\n- The new data blocks (in the page cache)\n- Modified inode (in the inode cache)\n- Modified bitmap blocks (in the buffer cache)\n\nThese modifications exist only in volatile memory. A crash at this point simply loses the operation—perfectly safe from a consistency perspective.\n\nPhase 2: Log Record Construction\n\nOnce a transaction is ready to commit (either triggered by fsync, a timer, or memory pressure), the file system constructs log records describing all modifications:

log_record_construction.txt

Text

Transaction 12847 Log Structure:
┌────────────────────────────────────────────────────────────────┐
│ TRANSACTION START                                              │
│   Transaction ID: 12847                                        │
│   Timestamp: 2024-01-15T14:32:16.847Z                         │
│   Sequence: 58291                                              │
├────────────────────────────────────────────────────────────────┤
│ DESCRIPTOR BLOCK                                               │
│   Describes 3 data blocks to follow                            │
│   Block 1: Target 847201, Flags: CHECKSUM                      │
│   Block 2: Target 847202, Flags: CHECKSUM                      │
│   Block 3: Target 102847, Flags: CHECKSUM | METADATA           │
├────────────────────────────────────────────────────────────────┤
│ DATA BLOCK 1 (New file content, first 4KB)                     │
│   [4096 bytes of file data]                                    │
├────────────────────────────────────────────────────────────────┤
│ DATA BLOCK 2 (New file content, second 4KB)                    │
│   [4096 bytes of file data]                                    │
├────────────────────────────────────────────────────────────────┤
│ DATA BLOCK 3 (Modified inode block)                            │
│   Inode 5847: size 8192, blocks [847201, 847202], mtime=...    │
│   Other inodes in same block: [unchanged]                      │
├────────────────────────────────────────────────────────────────┤
│ COMMIT RECORD                                                  │
│   Transaction ID: 12847                                        │
│   Checksum: 0x8A7F2C1E (covers entire transaction)             │
│   Commit Time: 2024-01-15T14:32:16.852Z                       │
└────────────────────────────────────────────────────────────────┘

Phase 3: Log Write with Ordering\n\nThe log records are written to the journal with strict ordering constraints:\n\n1. Descriptor block first — Written and persisted before data blocks (some systems write together but flush descriptor first)\n2. Data blocks — Written to the journal area, may be issued in parallel\n3. Barrier — A write barrier ensures all previous writes complete before the commit\n4. Commit record — Written only after all other log blocks are persisted\n\nThe critical insight: the commit record must hit disk after everything it commits. If the commit record were written first, a crash could find a valid commit but missing or partially-written data.

Write Barriers Are Critical

Write barriers (also called flushes or cache flushes) force all previous writes to complete before any subsequent writes can begin. Without barriers, aggressive disk caching and write reordering could write the commit record before the data, violating correctness. File systems that disable barriers for performance sacrifice crash consistency.

Phase 4: Acknowledgment\n\nOnce the commit record is persisted (barrier completed, commit block on media), the transaction is officially committed. For operations that requested durability (via fsync), the file system can now return success to the application. At this point:\n\n- The transaction will be recovered on crash—guaranteed\n- The actual modifications may or may not be on disk yet\n- The application can rely on the data being durable\n\nPhase 5: Writeback (Background)\n\nAfter commit, the file system writes the actual modifications to their final on-disk locations. This can happen immediately or be deferred:\n\n- Data blocks written to their final locations\n- Metadata blocks (inodes, bitmaps) written to their final locations\n- These writes can be reordered and optimized for efficiency\n\nImportantly, these writeback operations need not complete for the transaction to be safe. They're already in the journal.

Ordering Constraints and Correctness

The correctness of WAL depends on precise ordering of writes. Modern storage systems aggressively reorder writes for performance, so file systems must explicitly enforce ordering. Let's examine the required constraints and how they're enforced.

Required WAL Ordering Constraints
Constraint	Requirement	Why It Matters
Log before data	Log records must be durable before final location writes	Recovery replays log; if final writes hit disk first, we lose idempotency guarantees
Data before commit	All transaction data must be durable before commit record	Commit indicates completion; partial data with commit = corruption
Commit atomic	Commit record must be written atomically (single sector)	Torn commit record = indeterminate state
Sequential within transaction	Log records for a transaction written in order	Enables simple sequential recovery scan
Checkpoint ordering	Checkpoint only after all included transactions written back	Checkpoint indicates "safe to forget"; premature checkpoint = data loss

Enforcement Mechanisms:\n\n1. Write Barriers (FUA and Flush)\n\nWrite barriers force visibility ordering:

barrier_commands.txt

Text

Barrier Types in Linux Block Layer:
 
REQ_PREFLUSH (before write)
├── Forces all previous writes in drive cache to media
├── Ensures ordering: everything before barrier is durable
└── Expensive on HDDs (often triggers full rotation wait)
 
REQ_FUA (Force Unit Access)
├── This specific write bypasses drive cache, goes to media
├── Guarantees this write is durable when acknowledged
├── Cheaper than full flush (only affects one write)
└── Used for commit records, metadata
 
Typical Journal Commit Sequence:
1. Write descriptor block      [normal write]
2. Write data blocks           [normal writes, may be parallel]
3. Issue FLUSH                  [wait for all above to hit media]
4. Write commit block + FUA    [commit goes directly to media]
 
Some systems use double-barrier:
1. Write data blocks
2. FLUSH
3. Write commit block + FUA
4. FLUSH (ensures commit is durable before acknowledging)

2. Transaction Dependencies\n\nSome file systems allow concurrent transactions but track dependencies:\n\n- Transaction B modifies data written by Transaction A\n- B cannot commit until A commits\n- Dependency graph limits parallelism but ensures ordering\n\n3. Epoch-Based Ordering\n\nSome systems group transactions into epochs:\n\n- All transactions in epoch N must complete before epoch N+1 begins\n- Simplifies ordering: only need barriers between epochs\n- Allows parallelism within epoch\n\n4. Single-Threaded Journal Access\n\nSimplest approach: serialize all journal access\n\n- One thread handles all commits\n- Natural ordering through serialization\n- Can be a bottleneck for write-heavy workloads

The Disk Cache Problem

Disk drives have volatile write caches that improve performance by acknowledging writes before they reach the persistent media. If the drive lies about completion (reports success while data is in cache), a crash loses that data. File systems assume barriers work; drives that ignore barriers break consistency. This is why enterprise drives have battery-backed caches or properly implement FUA.

Checkpointing and Log Reclamation

The journal is a finite resource—typically a few hundred megabytes to a few gigabytes. Without a mechanism to reclaim space, it would quickly fill up and halt all file system operations. Checkpointing is the process that reclaims journal space by tracking which transactions have been fully written to their final locations.

The Checkpoint Invariant:\n\n> A transaction can be removed from the journal only after all its modifications have been written to their final on-disk locations.\n\nOnce this condition is met, the journal entries are redundant—the actual file system structures contain the data. The journal space can be reused.\n\nCheckpoint Record:\n\nThe checkpoint is itself recorded durably, typically in the journal superblock:

checkpoint_mechanism.txt

Text

Journal State Diagram:
 
Time ───────────────────────────────────────────────────────────►
 
Journal (circular buffer, 128MB):
┌─────────────────────────────────────────────────────────────────┐
│ [Free] [Free] │ T100 │ T101 │ T102 │ T103 │ T104 │ [Free] [Free]│
└─────────────────────────────────────────────────────────────────┘
                 ↑                               ↑
                 │                               │
            Oldest active                    Current write
            transaction                      position
            (checkpoint tail)                (write head)
 
Checkpoint Progress:
─────────────────────────────────────────────────────────────────
Transaction │ Commit Status │ Writeback Status │ Checkpointable?
─────────────────────────────────────────────────────────────────
T100        │ Committed     │ Complete         │ ✓ Yes
T101        │ Committed     │ Complete         │ ✓ Yes
T102        │ Committed     │ In Progress (30%)│ ✗ No
T103        │ Committed     │ Not Started      │ ✗ No
T104        │ In Progress   │ N/A              │ ✗ No
─────────────────────────────────────────────────────────────────
 
After checkpointing T100 and T101:
┌─────────────────────────────────────────────────────────────────┐
│ [Free] [Free] [Free] [Free] │ T102 │ T103 │ T104 │ [Free] [Free]│
└─────────────────────────────────────────────────────────────────┘
                               ↑
                               New checkpoint tail

Checkpoint Triggers:\n\nFile systems checkpoint under several conditions:

When Checkpointing Occurs

•Periodic timer — Many file systems checkpoint every few seconds (e.g., every 5 seconds in ext4)
•Journal space pressure — When the journal is filling up, must checkpoint to reclaim space
•Memory pressure — Dirty buffers must be flushed, triggering writeback and enabling checkpoint
•Sync request — sync(), fsync(), or syncfs() may trigger checkpoint
•Unmount — Clean unmount checkpoints all transactions

Checkpoint Algorithm:\n\n\nCHECKPOINT_PROCEDURE():\n oldest_active = journal.tail_transaction\n \n for each txn from oldest_active forward:\n if txn.writeback_complete:\n // All blocks written to final locations\n journal.advance_tail(txn)\n free_journal_blocks(txn.log_blocks)\n else:\n // Cannot checkpoint past this transaction\n // May need to force writeback if journal is full\n if journal.space_remaining < threshold:\n force_writeback(txn)\n break\n \n update_journal_superblock(new_tail)\n issue_barrier() // Ensure superblock update is durable\n\n\nJournal Full Handling:\n\nIf the journal fills before transactions can be checkpointed, the file system must stall:\n\n1. Block all new transactions from starting\n2. Force writeback of oldest transactions' data\n3. Checkpoint completed transactions\n4. Resume normal operation\n\nThis stall is disruptive but necessary—there's no safe way to proceed without journal space. Proper journal sizing avoids this scenario under normal workloads.

The Checkpoint Race

A checkpoint must not advance past a transaction whose writeback is incomplete. If the checkpoint marker advances and then a crash occurs, recovery will not replay the transaction, but the final data isn't on disk either—data loss. The file system must carefully track writeback completion before checkpointing.

Recovery Algorithm

When the system boots after a crash, the file system must recover to a consistent state before it can be mounted. The recovery algorithm scans the journal, identifies committed transactions that may not have been fully written back, and replays them.\n\nThis process is deterministic—given the same journal contents, recovery will always produce the same result. This determinism is essential for reliability.

Recovery Phases:

Journal Recovery Steps

•Read journal superblock — Find the journal location, size, and last checkpoint position
•Scan for transactions — Starting from checkpoint, scan forward for transaction records
•Validate transactions — Check each transaction's commit record and checksum
•Build replay list — Collect all valid committed transactions
•Replay transactions — Write logged data to final locations
•Update checkpoint — Mark all replayed transactions as complete
•Mount file system — File system is now consistent and available

recovery_algorithm.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
// Simplified journal recovery algorithm
typedef struct {
    uint32_t sequence;        // Transaction sequence number
    uint32_t start_block;     // First log block
    uint32_t length;          // Number of log blocks
    bool valid;               // Commit found and checksum valid
    block_mapping_t *mappings; // Target locations for replay
} transaction_record_t;
 
int journal_recover(journal_t *journal) {
    // Phase 1: Read journal superblock
    journal_superblock_t *sb = read_journal_superblock(journal);
    uint32_t checkpoint_seq = sb->s_sequence;
    uint32_t scan_start = sb->s_start;
    
    // Phase 2 & 3: Scan and validate transactions
    list_t *transactions = list_create();
    uint32_t current_block = scan_start;
    uint32_t expected_seq = checkpoint_seq;
    
    while (true) {
        journal_header_t *header = read_journal_block(journal, current_block);
        
        if (header->h_magic != JOURNAL_MAGIC) {
            break;  // End of valid journal records
        }
        
        if (header->h_blocktype == JBD2_DESCRIPTOR_BLOCK) {
            // Found start of a transaction
            transaction_record_t *txn = parse_transaction(
                journal, current_block, expected_seq);
            
            if (txn->valid) {
                // Transaction has valid commit with matching checksum
                list_append(transactions, txn);
                expected_seq++;
                current_block = txn->start_block + txn->length;
            } else {
                // Invalid or incomplete transaction - stop scanning
                // All subsequent records are potentially corrupt
                break;
            }
        } else {
            current_block = next_journal_block(journal, current_block);
        }
    }
    
    // Phase 4 & 5: Replay valid transactions
    printf("Recovery: %d transactions to replay\n", list_size(transactions));
    
    foreach (transaction_record_t *txn, transactions) {
        for (int i = 0; i < txn->mappings_count; i++) {
            block_mapping_t *map = &txn->mappings[i];
            
            // Read data from journal
            void *data = read_journal_block(journal, map->journal_block);
            
            // Write to final location
            write_filesystem_block(map->target_block, data);
        }
    }
    
    // Phase 6: Flush all writes and update checkpoint
    issue_write_barrier();
    
    sb->s_sequence = expected_seq;
    sb->s_start = current_block;
    write_journal_superblock(journal, sb);
    issue_write_barrier();
    
    // Phase 7: Recovery complete
    printf("Recovery complete. File system consistent.\n");
    return 0;
}

Key Recovery Properties:\n\nIdempotency: Replaying a transaction multiple times produces the same result as replaying once. This is essential because we don't know if a transaction was partially written back before the crash. Replaying is always safe.\n\nForward-only scanning: Recovery only scans forward from the checkpoint. We never need to undo work or scan backward—committed transactions are replayed, uncommitted transactions are ignored.\n\nBounded time: Recovery time depends on the number of transactions since last checkpoint, not on file system size. For a 5-second commit interval manager and typical workloads, this means tens to hundreds of transactions—seconds of recovery time regardless of file system size.

Why Redo, Not Undo?

Database systems often use both REDO and UNDO logging, but file systems typically use only REDO. This is because file systems don't support user-controlled transaction rollback—transactions are implicit and always run to completion or are never started. The simpler REDO-only approach is sufficient and easier to implement correctly.

Transaction Batching and Compound Transactions

Committing each file system operation as a separate transaction would be catastrophically slow—each commit requires a flush and FUA write. Instead, modern journaling file systems batch multiple operations into compound transactions that commit together.

The Batching Window:\n\nThe file system maintains an "open" transaction that accumulates modifications over a time window (typically 5 seconds in ext4). All operations during this window become part of the same transaction:

batching_timeline.txt

Text

Time: T0──────T1──────T2──────T3──────T4──────T5 (commit)──T6──────►
 
Operations arriving:
  T0+0.1s: write(file_A) ──┐
  T0+0.3s: mkdir(dir_B)  ──┤
  T0+1.2s: write(file_C) ──┼──► Transaction 100
  T0+2.8s: unlink(file_D)──┤    (accumulating)
  T0+4.1s: write(file_E) ──┤
  T0+4.9s: rename(F, G)  ──┘
                          
  T5+0.0s: COMMIT         ──► Transaction 100 committed
                               All 6 operations are atomic
  
  T5+0.1s: write(file_H) ──┐
  T5+0.5s: create(file_I)──┼──► Transaction 101
  ...                      │    (new batch)
 
Commit interval: 5 seconds
All operations in window share single:
  - Descriptor block write
  - Commit record write  
  - Barrier/flush overhead

Performance Impact:\n\nBatching dramatically improves throughput:

Batching Performance Comparison
Metric	Per-Operation Commit	5-Second Batching	Improvement
Operations per commit	1	~1000 (typical)	1000x fewer commits
Flushes per operation	1	0.001	1000x fewer flushes
Latency (HDD)	~10ms	~0.01ms amortized	Massive
IOPS limited operations	100/sec	100,000+/sec	1000x
Data loss window	0	Up to 5 seconds	Trade-off

Early Commit Triggers:\n\nWhile batching defers commits, certain conditions force early commit:

Early Commit Conditions

•fsync() call — Application explicitly requests data durability; must commit now
•Journal full — No space for new operations; must commit and checkpoint
•Memory pressure — System needs to reclaim buffers; must flush dirty data
•Umount — File system being unmounted; commit all pending work
•Sync/syncfs — System-wide or file-system sync request
•Transaction too large — Single transaction exceeds journal capacity

The Durability-Performance Trade-off

Batching means a crash can lose the last 5 seconds of work (unless fsync was called). For most workloads this is acceptable—unsaved documents would be lost anyway. For databases and critical applications, explicit fsync ensures durability at the cost of performance. Understand your application's requirements.

Handling Dependencies Within a Batch:\n\nOperations within a batch may have dependencies:\n\n1. Create file F\n2. Write to file F\n3. Rename file F to G\n\nAll three are in the same transaction. On recovery, either all three happen or none. This is correct—the batch is atomic. But the file system must ensure the ordering of operations within the batch is preserved in the log, so replay produces the intended result.

Redo vs. Undo Logging

Database systems distinguish between REDO logging and UNDO logging, and many use a combination called REDO-UNDO or ARIES (Algorithm for Recovery and Isolation Exploiting Semantics). File systems typically use simpler approaches. Understanding the distinction illuminates design choices.

REDO Logging

•Log contains: New values to be written
•Write timing: Data written to disk before commit allowed
•Recovery: Replay committed transactions (redo the work)
•Pros: Simple, no undo needed
•Cons: Must buffer all writes until commit
•File system use: Most common approach

UNDO Logging

•Log contains: Old values before modification
•Write timing: Data can be written before commit
•Recovery: Undo uncommitted transactions
•Pros: Data reaches disk faster
•Cons: Must undo partial work, complex
•File system use: Rare, complexity not justified

Why File Systems Choose REDO:\n\nFile systems predominantly use REDO-only logging for several reasons:\n\n1. No application rollback: Unlike databases, file systems don't support "I changed my mind, undo this transaction." Transactions always complete.\n\n2. Simpler implementation: REDO-only avoids the complexity of reconstructing old values and handling partial undo.\n\n3. Metadata focus: Most file systems journal only metadata, not data. The metadata modifications are already in memory; logging them is straightforward.\n\n4. Batch commit model: The 5-second batching window means transactions are relatively short-lived. The overhead of buffering is acceptable.\n\nREDO-Only Constraint:\n\nREDO logging requires a steal/no-force buffer policy:\n- No-steal: Uncommitted data must not be written to final location (would require undo on crash)\n- No-force: Committed data need not be immediately written to final location (journal provides durability)\n\nThis is the standard approach in journaling file systems. Uncommitted modifications stay in memory; committed modifications are safe in the journal and can be lazily written back.

Logging Approaches Comparison
Property	REDO Only	UNDO Only	REDO-UNDO (ARIES)
Log content	New values	Old values	Both
Dirty pages to disk	After commit	Before commit	Any time
Recovery action	Redo committed	Undo uncommitted	Redo then undo
Memory requirement	Buffer until commit	Log before modify	Moderate
Implementation complexity	Low	Medium	High
Typical use	File systems	Rare	Databases

Physical vs. Logical Logging

File system journals typically use physical logging—recording exact block contents. Databases often use logical logging—recording operations like "insert row X." Physical logging is simpler and idempotent (can replay safely even if already applied), but uses more space. File systems' focus on metadata makes physical logging practical.

Journal Checksums and Corruption Detection

A crash can corrupt not just data but the journal itself. If the system was writing to the journal at crash time, the written blocks may be partially complete (torn writes). Modern journals use checksums to detect such corruption and ensure only valid transactions are replayed.

The Torn Write Problem:\n\nConsider a 4KB journal block being written when power fails:\n\n- First 2KB reaches the disk\n- Last 2KB does not\n- Result: Block contains half new data, half old (or garbage)\n\nIf this is a commit block, we might incorrectly conclude a transaction is complete. If it's a data block, we might replay corrupted data. Checksums provide defense.

checksum_validation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Journal checksum types (ext4/JBD2 example)
#define JBD2_CHECKSUM_CRC32  1
#define JBD2_CHECKSUM_CRC32C 2
#define JBD2_CHECKSUM_XXHASH 3  // Newer, faster
 
// Commit block with checksum
struct journal_commit_block {
    struct journal_header header;       // type = COMMIT, sequence
    uint8_t chksum_type;                // Algorithm used
    uint8_t chksum_size;                // Size of checksum
    uint8_t padding[2];
    uint32_t chksum[4];                 // The checksum value(s)
    uint64_t commit_sec;                // Commit timestamp
    uint32_t commit_nsec;
};
 
// Validate a complete transaction
bool validate_transaction(journal_t *j, transaction_t *txn) {
    // Step 1: Check commit block checksum covers descriptor + data
    uint32_t running_crc = 0;
    
    // Include descriptor block
    running_crc = crc32c(running_crc, 
                         txn->descriptor_block, 
                         JOURNAL_BLOCK_SIZE);
    
    // Include all data blocks
    for (int i = 0; i < txn->num_data_blocks; i++) {
        running_crc = crc32c(running_crc,
                             txn->data_blocks[i],
                             JOURNAL_BLOCK_SIZE);
    }
    
    // Include commit block (excluding checksum field)
    char *commit = (char *)txn->commit_block;
    running_crc = crc32c(running_crc, commit, 
                         offsetof(struct journal_commit_block, chksum));
    
    // Compare with stored checksum
    if (running_crc != txn->commit_block->chksum[0]) {
        printf("Transaction %u failed checksum: computed %08x, stored %08x\n",
               txn->sequence, running_crc, txn->commit_block->chksum[0]);
        return false;
    }
    
    // Step 2: Check individual block tags if present
    for (int i = 0; i < txn->num_blocks; i++) {
        block_tag_t *tag = &txn->descriptor->tags[i];
        if (tag->flags & TAG_FLAG_CHECKSUM) {
            uint16_t block_crc = crc16(txn->data_blocks[i], 
                                       JOURNAL_BLOCK_SIZE);
            if (block_crc != tag->checksum) {
                printf("Block %d checksum mismatch\n", i);
                return false;
            }
        }
    }
    
    return true;
}

Checksum Coverage:\n\nModern journals compute checksums over the entire transaction:\n\n1. Descriptor block — Includes all block tags and target addresses\n2. All data blocks — Every logged block is included\n3. Commit block — The commit itself (with checksum field zeroed during computation)\n\nThis creates an end-to-end integrity check. Any corruption—in any block of the transaction—will cause validation to fail.\n\nChecksum Algorithm Selection:

Checksum Algorithm Comparison
Algorithm	Collision Resistance	Performance	Notes
CRC32	Moderate	Good	Hardware acceleration on many CPUs
CRC32C	Moderate	Excellent	Intel CRC32 instruction, preferred
xxHash	High	Excellent	Modern, very fast, gaining adoption
SHA-256	Very High	Poor	Overkill for corruption detection

Checksums Are Not Security

Journal checksums detect corruption from crashes and hardware errors. They do not protect against malicious modification—an attacker can modify both data and checksum. For security against tampering, you need cryptographic MACs or authenticated encryption. File systems like ZFS provide this; most journaling filesystems do not.

Summary: The WAL Mechanism

We've explored the detailed mechanics of write-ahead logging. Let's consolidate the key insights:

Key Takeaways

•Precise ordering — The WAL protocol requires strict ordering: descriptor → data → barrier → commit. Write barriers enforce this ordering against disk reordering.
•Checkpointing — Transactions can be removed from the journal only after their modifications are written to final locations. This bounds journal space and recovery time.
•Recovery algorithm — Scan forward from checkpoint, validate transactions via checksums, replay committed transactions. O(log size) time complexity.
•Transaction batching — Multiple operations batched into single transactions, amortizing commit overhead. Trade-off is increased data loss window.
•REDO-only logging — File systems use simpler REDO logging; no undo capability needed since transactions always complete.
•Checksums — End-to-end checksums over entire transactions detect corruption from torn writes and hardware errors.

What's Next:\n\nWith the WAL mechanism understood, we'll now explore metadata journaling—the most common journaling mode that protects file system structure while allowing data to be written directly. This mode offers the best balance of performance and consistency for most workloads.

WAL Mastered

You now understand the core mechanism that makes journaling work. This knowledge applies not just to file systems but to databases, message queues, distributed systems, and any system that needs durability guarantees. WAL is a fundamental technique in systems engineering.

Write-Ahead Logging

The Core Protocol

What You Will Learn

The WAL Protocol in Detail

log_record_construction.txt

Text

Transaction 12847 Log Structure:
┌────────────────────────────────────────────────────────────────┐
│ TRANSACTION START                                              │
│   Transaction ID: 12847                                        │
│   Timestamp: 2024-01-15T14:32:16.847Z                         │
│   Sequence: 58291                                              │
├────────────────────────────────────────────────────────────────┤
│ DESCRIPTOR BLOCK                                               │
│   Describes 3 data blocks to follow                            │
│   Block 1: Target 847201, Flags: CHECKSUM                      │
│   Block 2: Target 847202, Flags: CHECKSUM                      │
│   Block 3: Target 102847, Flags: CHECKSUM | METADATA           │
├────────────────────────────────────────────────────────────────┤
│ DATA BLOCK 1 (New file content, first 4KB)                     │
│   [4096 bytes of file data]                                    │
├────────────────────────────────────────────────────────────────┤
│ DATA BLOCK 2 (New file content, second 4KB)                    │
│   [4096 bytes of file data]                                    │
├────────────────────────────────────────────────────────────────┤
│ DATA BLOCK 3 (Modified inode block)                            │
│   Inode 5847: size 8192, blocks [847201, 847202], mtime=...    │
│   Other inodes in same block: [unchanged]                      │
├────────────────────────────────────────────────────────────────┤
│ COMMIT RECORD                                                  │
│   Transaction ID: 12847                                        │
│   Checksum: 0x8A7F2C1E (covers entire transaction)             │
│   Commit Time: 2024-01-15T14:32:16.852Z                       │
└────────────────────────────────────────────────────────────────┘

Write Barriers Are Critical

Ordering Constraints and Correctness

Required WAL Ordering Constraints
Constraint	Requirement	Why It Matters
Log before data	Log records must be durable before final location writes	Recovery replays log; if final writes hit disk first, we lose idempotency guarantees
Data before commit	All transaction data must be durable before commit record	Commit indicates completion; partial data with commit = corruption
Commit atomic	Commit record must be written atomically (single sector)	Torn commit record = indeterminate state
Sequential within transaction	Log records for a transaction written in order	Enables simple sequential recovery scan
Checkpoint ordering	Checkpoint only after all included transactions written back	Checkpoint indicates "safe to forget"; premature checkpoint = data loss

Enforcement Mechanisms:\n\n1. Write Barriers (FUA and Flush)\n\nWrite barriers force visibility ordering:

barrier_commands.txt

Text

Barrier Types in Linux Block Layer:
 
REQ_PREFLUSH (before write)
├── Forces all previous writes in drive cache to media
├── Ensures ordering: everything before barrier is durable
└── Expensive on HDDs (often triggers full rotation wait)
 
REQ_FUA (Force Unit Access)
├── This specific write bypasses drive cache, goes to media
├── Guarantees this write is durable when acknowledged
├── Cheaper than full flush (only affects one write)
└── Used for commit records, metadata
 
Typical Journal Commit Sequence:
1. Write descriptor block      [normal write]
2. Write data blocks           [normal writes, may be parallel]
3. Issue FLUSH                  [wait for all above to hit media]
4. Write commit block + FUA    [commit goes directly to media]
 
Some systems use double-barrier:
1. Write data blocks
2. FLUSH
3. Write commit block + FUA
4. FLUSH (ensures commit is durable before acknowledging)

The Disk Cache Problem

Checkpointing and Log Reclamation

checkpoint_mechanism.txt

Text

Journal State Diagram:
 
Time ───────────────────────────────────────────────────────────►
 
Journal (circular buffer, 128MB):
┌─────────────────────────────────────────────────────────────────┐
│ [Free] [Free] │ T100 │ T101 │ T102 │ T103 │ T104 │ [Free] [Free]│
└─────────────────────────────────────────────────────────────────┘
                 ↑                               ↑
                 │                               │
            Oldest active                    Current write
            transaction                      position
            (checkpoint tail)                (write head)
 
Checkpoint Progress:
─────────────────────────────────────────────────────────────────
Transaction │ Commit Status │ Writeback Status │ Checkpointable?
─────────────────────────────────────────────────────────────────
T100        │ Committed     │ Complete         │ ✓ Yes
T101        │ Committed     │ Complete         │ ✓ Yes
T102        │ Committed     │ In Progress (30%)│ ✗ No
T103        │ Committed     │ Not Started      │ ✗ No
T104        │ In Progress   │ N/A              │ ✗ No
─────────────────────────────────────────────────────────────────
 
After checkpointing T100 and T101:
┌─────────────────────────────────────────────────────────────────┐
│ [Free] [Free] [Free] [Free] │ T102 │ T103 │ T104 │ [Free] [Free]│
└─────────────────────────────────────────────────────────────────┘
                               ↑
                               New checkpoint tail

Checkpoint Triggers:\n\nFile systems checkpoint under several conditions:

When Checkpointing Occurs

•Periodic timer — Many file systems checkpoint every few seconds (e.g., every 5 seconds in ext4)
•Journal space pressure — When the journal is filling up, must checkpoint to reclaim space
•Memory pressure — Dirty buffers must be flushed, triggering writeback and enabling checkpoint
•Sync request — sync(), fsync(), or syncfs() may trigger checkpoint
•Unmount — Clean unmount checkpoints all transactions

The Checkpoint Race

Recovery Algorithm

Recovery Phases:

Journal Recovery Steps

•Read journal superblock — Find the journal location, size, and last checkpoint position
•Scan for transactions — Starting from checkpoint, scan forward for transaction records
•Validate transactions — Check each transaction's commit record and checksum
•Build replay list — Collect all valid committed transactions
•Replay transactions — Write logged data to final locations
•Update checkpoint — Mark all replayed transactions as complete
•Mount file system — File system is now consistent and available

recovery_algorithm.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
// Simplified journal recovery algorithm
typedef struct {
    uint32_t sequence;        // Transaction sequence number
    uint32_t start_block;     // First log block
    uint32_t length;          // Number of log blocks
    bool valid;               // Commit found and checksum valid
    block_mapping_t *mappings; // Target locations for replay
} transaction_record_t;
 
int journal_recover(journal_t *journal) {
    // Phase 1: Read journal superblock
    journal_superblock_t *sb = read_journal_superblock(journal);
    uint32_t checkpoint_seq = sb->s_sequence;
    uint32_t scan_start = sb->s_start;
    
    // Phase 2 & 3: Scan and validate transactions
    list_t *transactions = list_create();
    uint32_t current_block = scan_start;
    uint32_t expected_seq = checkpoint_seq;
    
    while (true) {
        journal_header_t *header = read_journal_block(journal, current_block);
        
        if (header->h_magic != JOURNAL_MAGIC) {
            break;  // End of valid journal records
        }
        
        if (header->h_blocktype == JBD2_DESCRIPTOR_BLOCK) {
            // Found start of a transaction
            transaction_record_t *txn = parse_transaction(
                journal, current_block, expected_seq);
            
            if (txn->valid) {
                // Transaction has valid commit with matching checksum
                list_append(transactions, txn);
                expected_seq++;
                current_block = txn->start_block + txn->length;
            } else {
                // Invalid or incomplete transaction - stop scanning
                // All subsequent records are potentially corrupt
                break;
            }
        } else {
            current_block = next_journal_block(journal, current_block);
        }
    }
    
    // Phase 4 & 5: Replay valid transactions
    printf("Recovery: %d transactions to replay\n", list_size(transactions));
    
    foreach (transaction_record_t *txn, transactions) {
        for (int i = 0; i < txn->mappings_count; i++) {
            block_mapping_t *map = &txn->mappings[i];
            
            // Read data from journal
            void *data = read_journal_block(journal, map->journal_block);
            
            // Write to final location
            write_filesystem_block(map->target_block, data);
        }
    }
    
    // Phase 6: Flush all writes and update checkpoint
    issue_write_barrier();
    
    sb->s_sequence = expected_seq;
    sb->s_start = current_block;
    write_journal_superblock(journal, sb);
    issue_write_barrier();
    
    // Phase 7: Recovery complete
    printf("Recovery complete. File system consistent.\n");
    return 0;
}

Why Redo, Not Undo?

Transaction Batching and Compound Transactions

batching_timeline.txt

Text

Time: T0──────T1──────T2──────T3──────T4──────T5 (commit)──T6──────►
 
Operations arriving:
  T0+0.1s: write(file_A) ──┐
  T0+0.3s: mkdir(dir_B)  ──┤
  T0+1.2s: write(file_C) ──┼──► Transaction 100
  T0+2.8s: unlink(file_D)──┤    (accumulating)
  T0+4.1s: write(file_E) ──┤
  T0+4.9s: rename(F, G)  ──┘
                          
  T5+0.0s: COMMIT         ──► Transaction 100 committed
                               All 6 operations are atomic
  
  T5+0.1s: write(file_H) ──┐
  T5+0.5s: create(file_I)──┼──► Transaction 101
  ...                      │    (new batch)
 
Commit interval: 5 seconds
All operations in window share single:
  - Descriptor block write
  - Commit record write  
  - Barrier/flush overhead

Performance Impact:\n\nBatching dramatically improves throughput:

Batching Performance Comparison
Metric	Per-Operation Commit	5-Second Batching	Improvement
Operations per commit	1	~1000 (typical)	1000x fewer commits
Flushes per operation	1	0.001	1000x fewer flushes
Latency (HDD)	~10ms	~0.01ms amortized	Massive
IOPS limited operations	100/sec	100,000+/sec	1000x
Data loss window	0	Up to 5 seconds	Trade-off

Early Commit Triggers:\n\nWhile batching defers commits, certain conditions force early commit:

Early Commit Conditions

•fsync() call — Application explicitly requests data durability; must commit now
•Journal full — No space for new operations; must commit and checkpoint
•Memory pressure — System needs to reclaim buffers; must flush dirty data
•Umount — File system being unmounted; commit all pending work
•Sync/syncfs — System-wide or file-system sync request
•Transaction too large — Single transaction exceeds journal capacity

The Durability-Performance Trade-off

Redo vs. Undo Logging

REDO Logging

•Log contains: New values to be written
•Write timing: Data written to disk before commit allowed
•Recovery: Replay committed transactions (redo the work)
•Pros: Simple, no undo needed
•Cons: Must buffer all writes until commit
•File system use: Most common approach

UNDO Logging

•Log contains: Old values before modification
•Write timing: Data can be written before commit
•Recovery: Undo uncommitted transactions
•Pros: Data reaches disk faster
•Cons: Must undo partial work, complex
•File system use: Rare, complexity not justified

Logging Approaches Comparison
Property	REDO Only	UNDO Only	REDO-UNDO (ARIES)
Log content	New values	Old values	Both
Dirty pages to disk	After commit	Before commit	Any time
Recovery action	Redo committed	Undo uncommitted	Redo then undo
Memory requirement	Buffer until commit	Log before modify	Moderate
Implementation complexity	Low	Medium	High
Typical use	File systems	Rare	Databases

Physical vs. Logical Logging

Journal Checksums and Corruption Detection

checksum_validation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Journal checksum types (ext4/JBD2 example)
#define JBD2_CHECKSUM_CRC32  1
#define JBD2_CHECKSUM_CRC32C 2
#define JBD2_CHECKSUM_XXHASH 3  // Newer, faster
 
// Commit block with checksum
struct journal_commit_block {
    struct journal_header header;       // type = COMMIT, sequence
    uint8_t chksum_type;                // Algorithm used
    uint8_t chksum_size;                // Size of checksum
    uint8_t padding[2];
    uint32_t chksum[4];                 // The checksum value(s)
    uint64_t commit_sec;                // Commit timestamp
    uint32_t commit_nsec;
};
 
// Validate a complete transaction
bool validate_transaction(journal_t *j, transaction_t *txn) {
    // Step 1: Check commit block checksum covers descriptor + data
    uint32_t running_crc = 0;
    
    // Include descriptor block
    running_crc = crc32c(running_crc, 
                         txn->descriptor_block, 
                         JOURNAL_BLOCK_SIZE);
    
    // Include all data blocks
    for (int i = 0; i < txn->num_data_blocks; i++) {
        running_crc = crc32c(running_crc,
                             txn->data_blocks[i],
                             JOURNAL_BLOCK_SIZE);
    }
    
    // Include commit block (excluding checksum field)
    char *commit = (char *)txn->commit_block;
    running_crc = crc32c(running_crc, commit, 
                         offsetof(struct journal_commit_block, chksum));
    
    // Compare with stored checksum
    if (running_crc != txn->commit_block->chksum[0]) {
        printf("Transaction %u failed checksum: computed %08x, stored %08x\n",
               txn->sequence, running_crc, txn->commit_block->chksum[0]);
        return false;
    }
    
    // Step 2: Check individual block tags if present
    for (int i = 0; i < txn->num_blocks; i++) {
        block_tag_t *tag = &txn->descriptor->tags[i];
        if (tag->flags & TAG_FLAG_CHECKSUM) {
            uint16_t block_crc = crc16(txn->data_blocks[i], 
                                       JOURNAL_BLOCK_SIZE);
            if (block_crc != tag->checksum) {
                printf("Block %d checksum mismatch\n", i);
                return false;
            }
        }
    }
    
    return true;
}

Checksum Algorithm Comparison
Algorithm	Collision Resistance	Performance	Notes
CRC32	Moderate	Good	Hardware acceleration on many CPUs
CRC32C	Moderate	Excellent	Intel CRC32 instruction, preferred
xxHash	High	Excellent	Modern, very fast, gaining adoption
SHA-256	Very High	Poor	Overkill for corruption detection

Checksums Are Not Security

Summary: The WAL Mechanism

We've explored the detailed mechanics of write-ahead logging. Let's consolidate the key insights:

Key Takeaways

•Precise ordering — The WAL protocol requires strict ordering: descriptor → data → barrier → commit. Write barriers enforce this ordering against disk reordering.
•Checkpointing — Transactions can be removed from the journal only after their modifications are written to final locations. This bounds journal space and recovery time.
•Recovery algorithm — Scan forward from checkpoint, validate transactions via checksums, replay committed transactions. O(log size) time complexity.
•Transaction batching — Multiple operations batched into single transactions, amortizing commit overhead. Trade-off is increased data loss window.
•REDO-only logging — File systems use simpler REDO logging; no undo capability needed since transactions always complete.
•Checksums — End-to-end checksums over entire transactions detect corruption from torn writes and hardware errors.

WAL Mastered