Journaling - Learning Module

Loading content...

0/227

Journal Replay

Restoring Consistency

Everything we've learned about journaling leads to this moment: the system has crashed, power has been restored, and the file system must recover. The journal contains a record of recent activity—some transactions completed, others were interrupted mid-flight. How does the file system determine what happened and restore consistency?\n\nJournal replay is the recovery mechanism that transforms an inconsistent post-crash state into a consistent, usable file system. It's the payoff for all the careful logging and ordering constraints we've studied. A well-implemented replay algorithm is fast, deterministic, and robust—it can handle any crash scenario and restore the file system in seconds, regardless of disk size.

What You Will Learn

This page provides a comprehensive examination of journal replay. You'll understand how the recovery scan identifies transactions, how validity is determined, the replay execution process, and the subtle correctness considerations that ensure replay always produces a consistent result. This knowledge completes your understanding of journaling and helps you predict file system behavior after crashes.

The Recovery Context

Before examining the replay algorithm, let's understand what state the file system is in when recovery begins. This context is essential for understanding why replay works the way it does.

Post-Crash State Possibilities:\n\nWhen a system crashes, the file system may be in any of many intermediate states:

Possible Post-Crash States
Component	Possible States	Recovery Need
Journal	Partially written transactions, complete transactions, corrupted blocks	Identify valid transactions
Metadata on disk	Mix of old and new versions, depending on writeback progress	Ensure consistency with journal
Data on disk	May or may not reflect recent writes	Depends on journaling mode
Checkpoint pointer	May be stale (pointing to already-applied transactions)	Scan may replay already-applied work
Page cache	Lost completely (volatile)	All dirty data lost

The Core Insight:\n\nThe journal is the source of truth for recent operations. We don't trust the on-disk metadata—it might be partially updated or stale. We don't know what was in the page cache—it's gone. We only trust the journal, and within the journal, only transactions with valid commit records.\n\nRecovery Philosophy:\n\n1. Pessimistic: Assume the worst. Every transaction that might be incomplete is treated as if it didn't happen.\n2. Idempotent: Replay every valid transaction, even if it might have already been applied. Replaying a transaction twice produces the same result as replaying once.\n3. Sequential: Process transactions in order. Don't skip ahead or parallelize in ways that could violate dependencies.

Clean vs. Dirty Mounts

When a file system is cleanly unmounted, all transactions are checkpointed, and a 'clean' flag is written. On next mount, recovery is skipped—the file system is known-consistent. Recovery only runs after unclean shutdown (crash, power failure). This is why clean shutdowns are fast and crashes require recovery time.

The Recovery Scan

The first phase of recovery is scanning the journal to identify valid transactions. This process must be thorough—missing a valid transaction loses changes; accepting an invalid transaction corrupts the filesystem.

recovery_scan.txt

Text

Journal Recovery Scan Process:
 
Step 1: Read Journal Superblock
─────────────────────────────────────────────────────────────────
Journal Superblock Contents:
├── Magic number:     0xC03B3998 (validates journal format)
├── Version:          2 (JBD2)
├── Block size:       4096
├── First block:      1 (first log block number)
├── Max blocks:       32768 (128MB journal)
├── First sequence:   45892 (expected next sequence)
├── Start block:      847 (where to start scanning)
└── Features:         CHECKSUM_V3, CSUM_V2
 
Key information: Scan starts at block 847, sequence 45892
 
Step 2: Scan Forward for Transactions
─────────────────────────────────────────────────────────────────
Block  Type         Sequence   Status
─────────────────────────────────────────────────────────────────
847    Descriptor   45892      Found - start of transaction
848    Data block   -          Part of txn 45892
849    Data block   -          Part of txn 45892
850    Commit       45892      ✓ Valid checksum - txn COMPLETE
851    Descriptor   45893      Found - start of new txn
852    Data block   -          Part of txn 45893
853    Data block   -          Part of txn 45893
854    Data block   -          Part of txn 45893
855    Commit       45893      ✓ Valid checksum - txn COMPLETE
856    Descriptor   45894      Found - start of new txn
857    Data block   -          Part of txn 45894
858    (garbage)    -          ✗ Invalid magic - END OF LOG
─────────────────────────────────────────────────────────────────
 
Result: 2 complete transactions (45892, 45893)
        1 incomplete transaction (45894) - IGNORED

Scan Algorithm Details:

Journal Scan Steps

•Read superblock — Get journal parameters: block size, first sequence number, start position
•Seek to start position — This is where the last checkpoint left off
•Read each block — Examine block type from header magic number
•Descriptor block — Start of a transaction; record sequence number, extract block tags
•Data blocks — Read and buffer the logged data
•Commit block — End of transaction; validate checksum, if valid mark transaction as replayable
•Stop conditions — Invalid magic, wrong sequence number, failed checksum, or wraparound to start

Transaction Validation:\n\nA transaction is valid only if ALL of these conditions are met:

Transaction Validity Requirements

•Sequence continuity — Transaction sequence number matches expected (previous + 1)
•Descriptor integrity — Descriptor block has valid magic and parseable tags
•Data blocks present — All blocks listed in descriptor are present in journal
•Commit present — Commit block exists with matching sequence number
•Checksum valid — Commit block checksum covers entire transaction and matches

The Sequence Gap Rule

If a sequence number is missing, all subsequent transactions are suspect. For example, if we find transactions 100 and 102 but not 101, we cannot trust 102—the gap indicates potential corruption or incomplete scan. Conservative implementations stop at the first gap or invalid transaction.

Transaction Replay

Once valid transactions are identified, they must be replayed—their logged blocks written to their final on-disk locations. This phase restores the file system to a consistent state reflecting all committed work.

transaction_replay.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Replay all valid transactions
int journal_replay(journal_t *journal) {
    list_t *valid_transactions;
    transaction_t *txn;
    int replayed = 0;
    int blocks_written = 0;
    
    // Phase 1: Scan and collect valid transactions
    valid_transactions = journal_scan_transactions(journal);
    
    if (list_empty(valid_transactions)) {
        printk("Journal clean - no replay needed\n");
        return 0;
    }
    
    printk("Journal replay: %d transactions to replay\n",
           list_size(valid_transactions));
    
    // Phase 2: Replay each transaction in order
    list_foreach(valid_transactions, txn) {
        printk("Replaying transaction %u (%d blocks)\n",
               txn->sequence, txn->num_blocks);
        
        // Replay each block in the transaction
        for (int i = 0; i < txn->num_blocks; i++) {
            block_tag_t *tag = &txn->tags[i];
            void *data = txn->data_blocks[i];
            
            // Write block to its final location
            buffer_head_t *bh = sb_bread(journal->j_fs_dev, 
                                         tag->target_block);
            memcpy(bh->b_data, data, journal->j_blocksize);
            mark_buffer_dirty(bh);
            
            // Write immediately - don't defer
            sync_dirty_buffer(bh);
            brelse(bh);
            
            blocks_written++;
        }
        
        replayed++;
    }
    
    // Phase 3: Ensure all replayed data is durable
    // This barrier ensures replay is complete before we proceed
    blkdev_issue_flush(journal->j_fs_dev);
    
    // Phase 4: Update journal to indicate replay complete
    // Advance superblock past replayed transactions
    journal->j_sb->s_start = journal->j_head;
    journal->j_sb->s_sequence = journal->j_tail_sequence + 1;
    write_journal_superblock(journal);
    blkdev_issue_flush(journal->j_dev);
    
    printk("Journal replay complete: %d transactions, %d blocks\n",
           replayed, blocks_written);
    
    return 0;
}

Replay Ordering:\n\nTransactions must be replayed in sequence order. A later transaction may depend on changes from an earlier one:\n\n1. Transaction 100: Create file, allocate inode 5847\n2. Transaction 101: Write to file, update inode 5847's block pointers\n\nIf we replayed 101 before 100, we might write to an inode that hasn't been initialized yet. Sequential replay preserves dependencies.\n\nParallel Replay Opportunities:\n\nWhile transactions must be ordered, independent blocks within a transaction can be written in parallel:\n\n- Blocks don't overlap—each goes to a different disk location\n- I/O schedulers can optimize write order for disk efficiency\n- Modern NVMe drives can process many parallel writes\n\nImplementations often batch many block writes and issue them together, letting the I/O stack optimize.

Why Sync Each Transaction?

Some implementations sync after each transaction rather than batching all transactions together. This is safer: if the system crashes again during recovery, already-replayed transactions don't need re-replaying. More aggressive implementations batch for speed, accepting that a crash during recovery might extend recovery time.

Handling Revoke Records

Not all journal records describe blocks to be replayed. Revoke records tell recovery to NOT replay certain blocks from earlier transactions. Understanding revokes is essential for correct recovery.

The Revoke Problem:\n\nConsider this sequence:\n\n1. Transaction 100: Allocate block 5000 for file A, write inode\n2. User deletes file A\n3. Transaction 101: Deallocate block 5000, update bitmap\n4. Transaction 102: Allocate block 5000 for file B, write NEW data\n5. CRASH before transaction 102 completes\n\nDuring recovery:\n- Transaction 100 is valid (has commit record)\n- Transaction 102 is invalid (no commit record, discarded)\n\nIf we replay transaction 100, we write file A's inode pointing to block 5000. But block 5000 might now contain newer data from file B (step 4's data block might have reached disk even though the transaction didn't commit).\n\nReplay would overwrite newer data with stale transaction 100 data!

revoke_handling.txt

Text

Revoke Record Mechanism:
 
When a block is freed (deleted) after being journaled:
1. Record a REVOKE entry in the current transaction
2. Revoke specifies: "Block 5000 should not be replayed from txn < 101"
 
Journal Contents:
───────────────────────────────────────────────────────────────────
Txn 100: [Descriptor][Block 5000: file A data][Commit]
           This would normally be replayed...
           
Txn 101: [Descriptor][Bitmap update][REVOKE: Block 5000][Commit]
           Revoke says: Don't replay block 5000 from earlier txns
           
Txn 102: [Descriptor][Block 5000: file B data][NO COMMIT - incomplete]
           Invalid transaction, ignored
───────────────────────────────────────────────────────────────────
 
Recovery Process with Revokes:
1. Pass 1 (SCAN): Find all committed transactions
2. Pass 2 (REVOKE): Scan for revoke records, build revoke hash table
   Revoke table: {block 5000 → revoked before txn 102}
3. Pass 3 (REPLAY): For each block in each transaction:
   - Check if block is revoked for this transaction
   - If revoked, skip this block
   - If not revoked, replay it
 
Result: Block 5000 from Txn 100 is NOT replayed
        Block 5000 contains whatever was on disk (possibly file B data, 
        or old unrelated data) - but not stale file A data

Revoke Semantics:\n\n- A revoke for block B in transaction T means: "Don't replay block B from any transaction before T"\n- Revokes are recorded when blocks are freed/deallocated\n- The revoke hash table is built before any replays occur\n- Revoke checking adds overhead but ensures correctness\n\nImplementation Efficiency:\n\nBuilding the revoke table requires scanning the journal twice: once to find revokes, once to replay. This is known as the two-pass recovery algorithm. Some implementations use three passes:\n\n1. Pass 1: Scan forward, identify all valid transactions\n2. Pass 2: Scan backward (or forward through revokes), build revoke table\n3. Pass 3: Replay, checking revoke table for each block

Revokes and Full Journaling

Revokes are most important with metadata journaling. With full data journaling, data blocks are also in the journal, so the revoke scenario is less concerning—we'd replay the correct data. However, revokes still matter for blocks that are freed and then reused between checkpoints.

Recovery Time Analysis

One of journaling's key benefits is bounded recovery time. Let's analyze what determines recovery duration and how to minimize it.

Recovery Time Components:

Recovery Time Breakdown
Phase	Duration	Depends On
Journal superblock read	~1ms	Disk latency
Scan for transactions	O(journal size)	Journal size, disk throughput
Build revoke table	O(transaction count)	Number of revoke records
Replay blocks	O(blocks to replay)	Number of blocks since checkpoint
Sync replayed data	O(blocks × disk latency)	Disk sync speed
Update journal superblock	~2ms	Single write + sync

Typical Recovery Times:

recovery_time_examples.txt

Text

Example Recovery Scenarios:
 
Scenario 1: Light workload, recent checkpoint
─────────────────────────────────────────────────────────────────
Journal size: 128MB
Transactions since checkpoint: 5
Blocks to replay: 200
Device: SSD
 
Scan time:     ~1ms (few transactions to scan)
Build revoke:  negligible
Replay:        200 blocks × 0.1ms = ~20ms
Sync:          ~1ms (SSD sync fast)
──────────────────────────────────────────────────────
Total:         ~25ms
 
 
Scenario 2: Heavy workload, checkpoint behind
─────────────────────────────────────────────────────────────────
Journal size: 1GB
Transactions since checkpoint: 100
Blocks to replay: 50,000
Device: HDD
 
Scan time:     ~2 seconds (1GB journal, 500MB/s read)
Build revoke:  ~100ms
Replay:        50,000 blocks × 4KB = 200MB
               200MB ÷ 150MB/s write = ~1.3 seconds
Sync:          ~200ms (HDD depends on location)
──────────────────────────────────────────────────────
Total:         ~4 seconds
 
 
Scenario 3: Very old checkpoint (rare/problematic)
─────────────────────────────────────────────────────────────────
Journal size: 4GB (full)
Transactions since checkpoint: 500
Blocks to replay: 500,000
Device: HDD
 
Scan time:     ~8 seconds
Build revoke:  ~500ms  
Replay:        500,000 × 4KB = 2GB
               2GB ÷ 150MB/s = ~15 seconds
Sync:          ~5 seconds
──────────────────────────────────────────────────────
Total:         ~30 seconds
 
Note: 30 seconds is still much better than fsck on a
multi-terabyte filesystem (hours).

Minimizing Recovery Time:

Recovery Time Optimization Techniques

•Frequent checkpointing — Checkpoint every few seconds rather than waiting for journal pressure. Less data to replay.
•Smaller journal — Counter-intuitive, but smaller journals fill faster, forcing checkpoints. Trade-off with peak write rate.
•Faster journal device — External SSD journal speeds both scan and replay.
•Parallel replay — Issue multiple block writes in parallel (modern implementations do this).
•Lazy initialization — Mount quickly with recovery in background (risky, some data may be inconsistent until complete).

vs. fsck Time

For perspective: full fsck on a 4TB filesystem with 100 million inodes can take 4-6 hours. Journal replay on the same system takes seconds to tens of seconds. This is the fundamental win of journaling—recovery time proportional to recent activity, not total filesystem size.

Correctness Considerations

Journal replay must be correct in all cases—an incorrect recovery is worse than no recovery. Let's examine the subtle correctness properties that replay algorithms must maintain.

Property 1: Idempotence\n\nReplaying a transaction must be safe even if it was already applied. We don't know the exact crash point—the writeback might have completed for some blocks but not others.\n\nHow it's achieved: Transactions write complete block contents, not incremental changes. Writing the same block contents twice is harmless—the second write just overwrites with identical data.\n\nProperty 2: Atomicity\n\nA transaction either fully replays or doesn't replay at all. We never partially replay.\n\nHow it's achieved: The commit record is the atomicity gate. No commit = no replay, even if all data blocks are present. Commit present = replay everything in the transaction.

Property 3: Ordering\n\nTransactions must be replayed in sequence order. Transaction N might depend on the result of transaction N-1.\n\nHow it's achieved: The scan produces a transaction list in sequence order. Replay iterates this list sequentially.\n\nProperty 4: Revoke Correctness\n\nRevoked blocks must never be replayed from older transactions, even if those transactions are valid.\n\nHow it's achieved: Build complete revoke table before any replays. Check every block against revoke table. Revokes are never ignored or forgotten.

correctness_invariants.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
// Key correctness invariants for journal replay
 
/*
 * Invariant 1: Checksum Validation
 * A transaction is only considered valid if its checksum passes.
 * This guards against torn writes corrupting the journal.
 */
bool validate_transaction(transaction_t *txn) {
    uint32_t computed = compute_checksum(txn);
    uint32_t stored = txn->commit->checksum;
    
    if (computed != stored) {
        // Checksum mismatch - transaction invalid
        // This is the expected case for incomplete commits
        return false;
    }
    return true;
}
 
/*
 * Invariant 2: Sequence Continuity
 * Transactions must have continuous sequence numbers.
 * A gap indicates potential corruption or missed entries.
 */
bool check_sequence_continuity(list_t *transactions) {
    transaction_t *prev = NULL;
    
    list_foreach(transactions, txn) {
        if (prev && txn->sequence != prev->sequence + 1) {
            // Sequence gap detected
            // Depending on implementation: error or stop here
            return false;
        }
        prev = txn;
    }
    return true;
}
 
/*
 * Invariant 3: Complete Revoke Application  
 * All revokes from all committed transactions must be
 * applied before any blocks are replayed.
 */
void build_revoke_table(journal_t *j, list_t *transactions) {
    revoke_table_t *table = revoke_table_create();
    
    // Process ALL transactions for revokes FIRST
    list_foreach(transactions, txn) {
        for_each_revoke_record(txn, revoke) {
            // Record: "block B revoked before txn T"
            revoke_table_add(table, revoke->block, txn->sequence);
        }
    }
    
    j->j_revoke_table = table;
}
 
// During replay, check revokes
bool should_replay_block(journal_t *j, block_tag_t *tag, 
                         uint32_t txn_sequence) {
    uint32_t revoked_before = revoke_table_lookup(
        j->j_revoke_table, tag->target_block);
    
    if (revoked_before && txn_sequence < revoked_before) {
        // This block was revoked after this transaction
        // Do NOT replay it
        return false;
    }
    return true;
}

Conservative Error Handling

If any unexpected condition is detected during recovery (wrong magic numbers, sequence gaps, suspicious checksums), most implementations abort or require manual intervention. Attempting to recover from a corrupted journal risks making things worse. When in doubt, report the error and let administrators decide.

Advanced Recovery Topics

Beyond basic replay, several advanced scenarios and techniques are important for production systems.

Crash During Recovery:\n\nWhat if the system crashes DURING journal replay?\n\nScenarios:\n- Replayed some transactions, updating journal superblock, crash\n- Replayed some transactions, crash before superblock update\n\nSolution: Recovery is idempotent! On the next boot:\n- If superblock was updated: scan starts after already-replayed transactions\n- If superblock wasn't updated: replay everything again (harmless due to idempotence)\n\nEither way, we reach a consistent state. This is why idempotence is so important.

Multi-Device Journals:\n\nFor LVM, RAID, and other multi-device configurations:\n\n- Each filesystem has its own journal\n- Journals must be replayed independently\n- Device-mapper ordering must be respected\n\nSpecial care is needed when journal device and data device are different:\n- Journal device must be available before data device\n- Failure of journal device prevents mount entirely\n- External journal on failed device = catastrophic

Online/Lazy Recovery:\n\nSome systems offer expedited mount with background recovery:\n\n1. Quick scan identifies immediate needs\n2. Filesystem mounts read-only or limited write\n3. Full recovery proceeds in background\n4. Full access enabled after recovery completes\n\nThis reduces apparent downtime but adds complexity. Data accessed before recovery completes might show inconsistencies. XFS supports this model.

recovery_debugging.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Debugging journal recovery issues
 
# Check if journal replay occurred on last mount
dmesg | grep -i "EXT4-fs.*recovery"
# Output: "EXT4-fs (sda1): recovery complete"
# Or: "EXT4-fs (sda1): no recovery required"
 
# Force replay (unmount, mark dirty, mount)
umount /mnt/data
tune2fs -O ^clean /dev/sda1  # Clear clean flag
mount /dev/sda1 /mnt/data    # Will trigger replay
 
# View journal contents (requires unmount)
debugfs -R "logdump -a" /dev/sda1 2>/dev/null | head -100
 
# Check journal superblock
dumpe2fs -h /dev/sda1 2>/dev/null | grep -A10 "^Journal"
 
# Check filesystem state
tune2fs -l /dev/sda1 | grep "Filesystem state"
# "clean" or "not clean"
 
# XFS journal status
xfs_logprint -c /dev/sdb1  # Count log entries
xfs_logprint -t /dev/sdb1  # Transaction dump
 
# Recover corrupted journal (emergency, may lose data)
# ext4: rebuild journal
tune2fs -O ^has_journal /dev/sda1  # Remove journal
tune2fs -j /dev/sda1               # Recreate journal
 
# XFS: clear log and repair
xfs_repair -L /dev/sdb1  # -L zeros log (DATA LOSS POSSIBLE)

Journal Corruption Is Rare

True journal corruption (not just incomplete transactions) is rare and usually indicates hardware problems—failing disk, bad memory, controller issues. If recovery consistently fails, investigate hardware before trying emergency repairs. Checksums detect most corruption; persistent checksum failures need hardware attention.

Summary: Journal Replay

We've completed our deep examination of journal replay—the mechanism that makes journaling work. Let's consolidate our understanding:

Key Takeaways

•Recovery context — After crash, journal is the source of truth. On-disk state may be partially updated and cannot be trusted.
•Scan algorithm — Read journal from last checkpoint, validate each transaction's commit record and checksum, stop at first invalid entry.
•Transaction replay — Write logged blocks to their final locations, in transaction order, ensuring all blocks from a transaction are applied.
•Revoke records — Prevent replaying blocks from older transactions that were later freed. Essential for correctness when blocks are reused.
•Recovery time — O(journal size) + O(blocks to replay), typically seconds to tens of seconds. Much faster than fsck.
•Correctness properties — Idempotence, atomicity, ordering, and complete revoke application ensure recovery always produces correct results.

Module Complete:\n\nWith journal replay understood, you now have a complete picture of file system journaling. From the crash consistency problem through write-ahead logging, metadata and full journaling modes, to recovery and replay—you understand how modern file systems maintain consistency across failures.\n\nThis knowledge applies directly to:\n- Choosing appropriate journaling modes for your workloads\n- Understanding and predicting recovery behavior\n- Debugging file system issues after crashes\n- Making informed decisions about application durability\n- Designing systems that correctly handle crashes

Journaling Module Complete

Congratulations! You've mastered file system journaling—one of the most important mechanisms in modern operating systems. This knowledge is foundational for anyone working with storage systems, databases, or any application requiring crash consistency.

Journal Replay

Restoring Consistency

What You Will Learn

The Recovery Context

Before examining the replay algorithm, let's understand what state the file system is in when recovery begins. This context is essential for understanding why replay works the way it does.

Post-Crash State Possibilities:\n\nWhen a system crashes, the file system may be in any of many intermediate states:

Possible Post-Crash States
Component	Possible States	Recovery Need
Journal	Partially written transactions, complete transactions, corrupted blocks	Identify valid transactions
Metadata on disk	Mix of old and new versions, depending on writeback progress	Ensure consistency with journal
Data on disk	May or may not reflect recent writes	Depends on journaling mode
Checkpoint pointer	May be stale (pointing to already-applied transactions)	Scan may replay already-applied work
Page cache	Lost completely (volatile)	All dirty data lost

Clean vs. Dirty Mounts

The Recovery Scan

recovery_scan.txt

Text

Journal Recovery Scan Process:
 
Step 1: Read Journal Superblock
─────────────────────────────────────────────────────────────────
Journal Superblock Contents:
├── Magic number:     0xC03B3998 (validates journal format)
├── Version:          2 (JBD2)
├── Block size:       4096
├── First block:      1 (first log block number)
├── Max blocks:       32768 (128MB journal)
├── First sequence:   45892 (expected next sequence)
├── Start block:      847 (where to start scanning)
└── Features:         CHECKSUM_V3, CSUM_V2
 
Key information: Scan starts at block 847, sequence 45892
 
Step 2: Scan Forward for Transactions
─────────────────────────────────────────────────────────────────
Block  Type         Sequence   Status
─────────────────────────────────────────────────────────────────
847    Descriptor   45892      Found - start of transaction
848    Data block   -          Part of txn 45892
849    Data block   -          Part of txn 45892
850    Commit       45892      ✓ Valid checksum - txn COMPLETE
851    Descriptor   45893      Found - start of new txn
852    Data block   -          Part of txn 45893
853    Data block   -          Part of txn 45893
854    Data block   -          Part of txn 45893
855    Commit       45893      ✓ Valid checksum - txn COMPLETE
856    Descriptor   45894      Found - start of new txn
857    Data block   -          Part of txn 45894
858    (garbage)    -          ✗ Invalid magic - END OF LOG
─────────────────────────────────────────────────────────────────
 
Result: 2 complete transactions (45892, 45893)
        1 incomplete transaction (45894) - IGNORED

Scan Algorithm Details:

Journal Scan Steps

•Read superblock — Get journal parameters: block size, first sequence number, start position
•Seek to start position — This is where the last checkpoint left off
•Read each block — Examine block type from header magic number
•Descriptor block — Start of a transaction; record sequence number, extract block tags
•Data blocks — Read and buffer the logged data
•Commit block — End of transaction; validate checksum, if valid mark transaction as replayable
•Stop conditions — Invalid magic, wrong sequence number, failed checksum, or wraparound to start

Transaction Validation:\n\nA transaction is valid only if ALL of these conditions are met:

Transaction Validity Requirements

•Sequence continuity — Transaction sequence number matches expected (previous + 1)
•Descriptor integrity — Descriptor block has valid magic and parseable tags
•Data blocks present — All blocks listed in descriptor are present in journal
•Commit present — Commit block exists with matching sequence number
•Checksum valid — Commit block checksum covers entire transaction and matches

The Sequence Gap Rule

Transaction Replay

transaction_replay.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Replay all valid transactions
int journal_replay(journal_t *journal) {
    list_t *valid_transactions;
    transaction_t *txn;
    int replayed = 0;
    int blocks_written = 0;
    
    // Phase 1: Scan and collect valid transactions
    valid_transactions = journal_scan_transactions(journal);
    
    if (list_empty(valid_transactions)) {
        printk("Journal clean - no replay needed\n");
        return 0;
    }
    
    printk("Journal replay: %d transactions to replay\n",
           list_size(valid_transactions));
    
    // Phase 2: Replay each transaction in order
    list_foreach(valid_transactions, txn) {
        printk("Replaying transaction %u (%d blocks)\n",
               txn->sequence, txn->num_blocks);
        
        // Replay each block in the transaction
        for (int i = 0; i < txn->num_blocks; i++) {
            block_tag_t *tag = &txn->tags[i];
            void *data = txn->data_blocks[i];
            
            // Write block to its final location
            buffer_head_t *bh = sb_bread(journal->j_fs_dev, 
                                         tag->target_block);
            memcpy(bh->b_data, data, journal->j_blocksize);
            mark_buffer_dirty(bh);
            
            // Write immediately - don't defer
            sync_dirty_buffer(bh);
            brelse(bh);
            
            blocks_written++;
        }
        
        replayed++;
    }
    
    // Phase 3: Ensure all replayed data is durable
    // This barrier ensures replay is complete before we proceed
    blkdev_issue_flush(journal->j_fs_dev);
    
    // Phase 4: Update journal to indicate replay complete
    // Advance superblock past replayed transactions
    journal->j_sb->s_start = journal->j_head;
    journal->j_sb->s_sequence = journal->j_tail_sequence + 1;
    write_journal_superblock(journal);
    blkdev_issue_flush(journal->j_dev);
    
    printk("Journal replay complete: %d transactions, %d blocks\n",
           replayed, blocks_written);
    
    return 0;
}

Why Sync Each Transaction?

Handling Revoke Records

revoke_handling.txt

Text

Revoke Record Mechanism:
 
When a block is freed (deleted) after being journaled:
1. Record a REVOKE entry in the current transaction
2. Revoke specifies: "Block 5000 should not be replayed from txn < 101"
 
Journal Contents:
───────────────────────────────────────────────────────────────────
Txn 100: [Descriptor][Block 5000: file A data][Commit]
           This would normally be replayed...
           
Txn 101: [Descriptor][Bitmap update][REVOKE: Block 5000][Commit]
           Revoke says: Don't replay block 5000 from earlier txns
           
Txn 102: [Descriptor][Block 5000: file B data][NO COMMIT - incomplete]
           Invalid transaction, ignored
───────────────────────────────────────────────────────────────────
 
Recovery Process with Revokes:
1. Pass 1 (SCAN): Find all committed transactions
2. Pass 2 (REVOKE): Scan for revoke records, build revoke hash table
   Revoke table: {block 5000 → revoked before txn 102}
3. Pass 3 (REPLAY): For each block in each transaction:
   - Check if block is revoked for this transaction
   - If revoked, skip this block
   - If not revoked, replay it
 
Result: Block 5000 from Txn 100 is NOT replayed
        Block 5000 contains whatever was on disk (possibly file B data, 
        or old unrelated data) - but not stale file A data

Revokes and Full Journaling

Recovery Time Analysis

One of journaling's key benefits is bounded recovery time. Let's analyze what determines recovery duration and how to minimize it.

Recovery Time Components:

Recovery Time Breakdown
Phase	Duration	Depends On
Journal superblock read	~1ms	Disk latency
Scan for transactions	O(journal size)	Journal size, disk throughput
Build revoke table	O(transaction count)	Number of revoke records
Replay blocks	O(blocks to replay)	Number of blocks since checkpoint
Sync replayed data	O(blocks × disk latency)	Disk sync speed
Update journal superblock	~2ms	Single write + sync

Typical Recovery Times:

recovery_time_examples.txt

Text

Example Recovery Scenarios:
 
Scenario 1: Light workload, recent checkpoint
─────────────────────────────────────────────────────────────────
Journal size: 128MB
Transactions since checkpoint: 5
Blocks to replay: 200
Device: SSD
 
Scan time:     ~1ms (few transactions to scan)
Build revoke:  negligible
Replay:        200 blocks × 0.1ms = ~20ms
Sync:          ~1ms (SSD sync fast)
──────────────────────────────────────────────────────
Total:         ~25ms
 
 
Scenario 2: Heavy workload, checkpoint behind
─────────────────────────────────────────────────────────────────
Journal size: 1GB
Transactions since checkpoint: 100
Blocks to replay: 50,000
Device: HDD
 
Scan time:     ~2 seconds (1GB journal, 500MB/s read)
Build revoke:  ~100ms
Replay:        50,000 blocks × 4KB = 200MB
               200MB ÷ 150MB/s write = ~1.3 seconds
Sync:          ~200ms (HDD depends on location)
──────────────────────────────────────────────────────
Total:         ~4 seconds
 
 
Scenario 3: Very old checkpoint (rare/problematic)
─────────────────────────────────────────────────────────────────
Journal size: 4GB (full)
Transactions since checkpoint: 500
Blocks to replay: 500,000
Device: HDD
 
Scan time:     ~8 seconds
Build revoke:  ~500ms  
Replay:        500,000 × 4KB = 2GB
               2GB ÷ 150MB/s = ~15 seconds
Sync:          ~5 seconds
──────────────────────────────────────────────────────
Total:         ~30 seconds
 
Note: 30 seconds is still much better than fsck on a
multi-terabyte filesystem (hours).

Minimizing Recovery Time:

Recovery Time Optimization Techniques

•Frequent checkpointing — Checkpoint every few seconds rather than waiting for journal pressure. Less data to replay.
•Smaller journal — Counter-intuitive, but smaller journals fill faster, forcing checkpoints. Trade-off with peak write rate.
•Faster journal device — External SSD journal speeds both scan and replay.
•Parallel replay — Issue multiple block writes in parallel (modern implementations do this).
•Lazy initialization — Mount quickly with recovery in background (risky, some data may be inconsistent until complete).

vs. fsck Time

Correctness Considerations

Journal replay must be correct in all cases—an incorrect recovery is worse than no recovery. Let's examine the subtle correctness properties that replay algorithms must maintain.

correctness_invariants.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
// Key correctness invariants for journal replay
 
/*
 * Invariant 1: Checksum Validation
 * A transaction is only considered valid if its checksum passes.
 * This guards against torn writes corrupting the journal.
 */
bool validate_transaction(transaction_t *txn) {
    uint32_t computed = compute_checksum(txn);
    uint32_t stored = txn->commit->checksum;
    
    if (computed != stored) {
        // Checksum mismatch - transaction invalid
        // This is the expected case for incomplete commits
        return false;
    }
    return true;
}
 
/*
 * Invariant 2: Sequence Continuity
 * Transactions must have continuous sequence numbers.
 * A gap indicates potential corruption or missed entries.
 */
bool check_sequence_continuity(list_t *transactions) {
    transaction_t *prev = NULL;
    
    list_foreach(transactions, txn) {
        if (prev && txn->sequence != prev->sequence + 1) {
            // Sequence gap detected
            // Depending on implementation: error or stop here
            return false;
        }
        prev = txn;
    }
    return true;
}
 
/*
 * Invariant 3: Complete Revoke Application  
 * All revokes from all committed transactions must be
 * applied before any blocks are replayed.
 */
void build_revoke_table(journal_t *j, list_t *transactions) {
    revoke_table_t *table = revoke_table_create();
    
    // Process ALL transactions for revokes FIRST
    list_foreach(transactions, txn) {
        for_each_revoke_record(txn, revoke) {
            // Record: "block B revoked before txn T"
            revoke_table_add(table, revoke->block, txn->sequence);
        }
    }
    
    j->j_revoke_table = table;
}
 
// During replay, check revokes
bool should_replay_block(journal_t *j, block_tag_t *tag, 
                         uint32_t txn_sequence) {
    uint32_t revoked_before = revoke_table_lookup(
        j->j_revoke_table, tag->target_block);
    
    if (revoked_before && txn_sequence < revoked_before) {
        // This block was revoked after this transaction
        // Do NOT replay it
        return false;
    }
    return true;
}

Conservative Error Handling

Advanced Recovery Topics

Beyond basic replay, several advanced scenarios and techniques are important for production systems.

recovery_debugging.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Debugging journal recovery issues
 
# Check if journal replay occurred on last mount
dmesg | grep -i "EXT4-fs.*recovery"
# Output: "EXT4-fs (sda1): recovery complete"
# Or: "EXT4-fs (sda1): no recovery required"
 
# Force replay (unmount, mark dirty, mount)
umount /mnt/data
tune2fs -O ^clean /dev/sda1  # Clear clean flag
mount /dev/sda1 /mnt/data    # Will trigger replay
 
# View journal contents (requires unmount)
debugfs -R "logdump -a" /dev/sda1 2>/dev/null | head -100
 
# Check journal superblock
dumpe2fs -h /dev/sda1 2>/dev/null | grep -A10 "^Journal"
 
# Check filesystem state
tune2fs -l /dev/sda1 | grep "Filesystem state"
# "clean" or "not clean"
 
# XFS journal status
xfs_logprint -c /dev/sdb1  # Count log entries
xfs_logprint -t /dev/sdb1  # Transaction dump
 
# Recover corrupted journal (emergency, may lose data)
# ext4: rebuild journal
tune2fs -O ^has_journal /dev/sda1  # Remove journal
tune2fs -j /dev/sda1               # Recreate journal
 
# XFS: clear log and repair
xfs_repair -L /dev/sdb1  # -L zeros log (DATA LOSS POSSIBLE)

Journal Corruption Is Rare

Summary: Journal Replay

We've completed our deep examination of journal replay—the mechanism that makes journaling work. Let's consolidate our understanding:

Key Takeaways

•Recovery context — After crash, journal is the source of truth. On-disk state may be partially updated and cannot be trusted.
•Scan algorithm — Read journal from last checkpoint, validate each transaction's commit record and checksum, stop at first invalid entry.
•Transaction replay — Write logged blocks to their final locations, in transaction order, ensuring all blocks from a transaction are applied.
•Revoke records — Prevent replaying blocks from older transactions that were later freed. Essential for correctness when blocks are reused.
•Recovery time — O(journal size) + O(blocks to replay), typically seconds to tens of seconds. Much faster than fsck.
•Correctness properties — Idempotence, atomicity, ordering, and complete revoke application ensure recovery always produces correct results.

Journaling Module Complete