Loading content...
Everything we've learned about journaling leads to this moment: the system has crashed, power has been restored, and the file system must recover. The journal contains a record of recent activity—some transactions completed, others were interrupted mid-flight. How does the file system determine what happened and restore consistency?\n\nJournal replay is the recovery mechanism that transforms an inconsistent post-crash state into a consistent, usable file system. It's the payoff for all the careful logging and ordering constraints we've studied. A well-implemented replay algorithm is fast, deterministic, and robust—it can handle any crash scenario and restore the file system in seconds, regardless of disk size.
This page provides a comprehensive examination of journal replay. You'll understand how the recovery scan identifies transactions, how validity is determined, the replay execution process, and the subtle correctness considerations that ensure replay always produces a consistent result. This knowledge completes your understanding of journaling and helps you predict file system behavior after crashes.
Before examining the replay algorithm, let's understand what state the file system is in when recovery begins. This context is essential for understanding why replay works the way it does.
Post-Crash State Possibilities:\n\nWhen a system crashes, the file system may be in any of many intermediate states:
| Component | Possible States | Recovery Need |
|---|---|---|
| Journal | Partially written transactions, complete transactions, corrupted blocks | Identify valid transactions |
| Metadata on disk | Mix of old and new versions, depending on writeback progress | Ensure consistency with journal |
| Data on disk | May or may not reflect recent writes | Depends on journaling mode |
| Checkpoint pointer | May be stale (pointing to already-applied transactions) | Scan may replay already-applied work |
| Page cache | Lost completely (volatile) | All dirty data lost |
The Core Insight:\n\nThe journal is the source of truth for recent operations. We don't trust the on-disk metadata—it might be partially updated or stale. We don't know what was in the page cache—it's gone. We only trust the journal, and within the journal, only transactions with valid commit records.\n\nRecovery Philosophy:\n\n1. Pessimistic: Assume the worst. Every transaction that might be incomplete is treated as if it didn't happen.\n2. Idempotent: Replay every valid transaction, even if it might have already been applied. Replaying a transaction twice produces the same result as replaying once.\n3. Sequential: Process transactions in order. Don't skip ahead or parallelize in ways that could violate dependencies.
When a file system is cleanly unmounted, all transactions are checkpointed, and a 'clean' flag is written. On next mount, recovery is skipped—the file system is known-consistent. Recovery only runs after unclean shutdown (crash, power failure). This is why clean shutdowns are fast and crashes require recovery time.
The first phase of recovery is scanning the journal to identify valid transactions. This process must be thorough—missing a valid transaction loses changes; accepting an invalid transaction corrupts the filesystem.
123456789101112131415161718192021222324252627282930313233343536
Journal Recovery Scan Process: Step 1: Read Journal Superblock─────────────────────────────────────────────────────────────────Journal Superblock Contents:├── Magic number: 0xC03B3998 (validates journal format)├── Version: 2 (JBD2)├── Block size: 4096├── First block: 1 (first log block number)├── Max blocks: 32768 (128MB journal)├── First sequence: 45892 (expected next sequence)├── Start block: 847 (where to start scanning)└── Features: CHECKSUM_V3, CSUM_V2 Key information: Scan starts at block 847, sequence 45892 Step 2: Scan Forward for Transactions─────────────────────────────────────────────────────────────────Block Type Sequence Status─────────────────────────────────────────────────────────────────847 Descriptor 45892 Found - start of transaction848 Data block - Part of txn 45892849 Data block - Part of txn 45892850 Commit 45892 ✓ Valid checksum - txn COMPLETE851 Descriptor 45893 Found - start of new txn852 Data block - Part of txn 45893853 Data block - Part of txn 45893854 Data block - Part of txn 45893855 Commit 45893 ✓ Valid checksum - txn COMPLETE856 Descriptor 45894 Found - start of new txn857 Data block - Part of txn 45894858 (garbage) - ✗ Invalid magic - END OF LOG───────────────────────────────────────────────────────────────── Result: 2 complete transactions (45892, 45893) 1 incomplete transaction (45894) - IGNOREDScan Algorithm Details:
Transaction Validation:\n\nA transaction is valid only if ALL of these conditions are met:
If a sequence number is missing, all subsequent transactions are suspect. For example, if we find transactions 100 and 102 but not 101, we cannot trust 102—the gap indicates potential corruption or incomplete scan. Conservative implementations stop at the first gap or invalid transaction.
Once valid transactions are identified, they must be replayed—their logged blocks written to their final on-disk locations. This phase restores the file system to a consistent state reflecting all committed work.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
// Replay all valid transactionsint journal_replay(journal_t *journal) { list_t *valid_transactions; transaction_t *txn; int replayed = 0; int blocks_written = 0; // Phase 1: Scan and collect valid transactions valid_transactions = journal_scan_transactions(journal); if (list_empty(valid_transactions)) { printk("Journal clean - no replay needed\n"); return 0; } printk("Journal replay: %d transactions to replay\n", list_size(valid_transactions)); // Phase 2: Replay each transaction in order list_foreach(valid_transactions, txn) { printk("Replaying transaction %u (%d blocks)\n", txn->sequence, txn->num_blocks); // Replay each block in the transaction for (int i = 0; i < txn->num_blocks; i++) { block_tag_t *tag = &txn->tags[i]; void *data = txn->data_blocks[i]; // Write block to its final location buffer_head_t *bh = sb_bread(journal->j_fs_dev, tag->target_block); memcpy(bh->b_data, data, journal->j_blocksize); mark_buffer_dirty(bh); // Write immediately - don't defer sync_dirty_buffer(bh); brelse(bh); blocks_written++; } replayed++; } // Phase 3: Ensure all replayed data is durable // This barrier ensures replay is complete before we proceed blkdev_issue_flush(journal->j_fs_dev); // Phase 4: Update journal to indicate replay complete // Advance superblock past replayed transactions journal->j_sb->s_start = journal->j_head; journal->j_sb->s_sequence = journal->j_tail_sequence + 1; write_journal_superblock(journal); blkdev_issue_flush(journal->j_dev); printk("Journal replay complete: %d transactions, %d blocks\n", replayed, blocks_written); return 0;}Replay Ordering:\n\nTransactions must be replayed in sequence order. A later transaction may depend on changes from an earlier one:\n\n1. Transaction 100: Create file, allocate inode 5847\n2. Transaction 101: Write to file, update inode 5847's block pointers\n\nIf we replayed 101 before 100, we might write to an inode that hasn't been initialized yet. Sequential replay preserves dependencies.\n\nParallel Replay Opportunities:\n\nWhile transactions must be ordered, independent blocks within a transaction can be written in parallel:\n\n- Blocks don't overlap—each goes to a different disk location\n- I/O schedulers can optimize write order for disk efficiency\n- Modern NVMe drives can process many parallel writes\n\nImplementations often batch many block writes and issue them together, letting the I/O stack optimize.
Some implementations sync after each transaction rather than batching all transactions together. This is safer: if the system crashes again during recovery, already-replayed transactions don't need re-replaying. More aggressive implementations batch for speed, accepting that a crash during recovery might extend recovery time.
Not all journal records describe blocks to be replayed. Revoke records tell recovery to NOT replay certain blocks from earlier transactions. Understanding revokes is essential for correct recovery.
The Revoke Problem:\n\nConsider this sequence:\n\n1. Transaction 100: Allocate block 5000 for file A, write inode\n2. User deletes file A\n3. Transaction 101: Deallocate block 5000, update bitmap\n4. Transaction 102: Allocate block 5000 for file B, write NEW data\n5. CRASH before transaction 102 completes\n\nDuring recovery:\n- Transaction 100 is valid (has commit record)\n- Transaction 102 is invalid (no commit record, discarded)\n\nIf we replay transaction 100, we write file A's inode pointing to block 5000. But block 5000 might now contain newer data from file B (step 4's data block might have reached disk even though the transaction didn't commit).\n\nReplay would overwrite newer data with stale transaction 100 data!
123456789101112131415161718192021222324252627282930
Revoke Record Mechanism: When a block is freed (deleted) after being journaled:1. Record a REVOKE entry in the current transaction2. Revoke specifies: "Block 5000 should not be replayed from txn < 101" Journal Contents:───────────────────────────────────────────────────────────────────Txn 100: [Descriptor][Block 5000: file A data][Commit] This would normally be replayed... Txn 101: [Descriptor][Bitmap update][REVOKE: Block 5000][Commit] Revoke says: Don't replay block 5000 from earlier txns Txn 102: [Descriptor][Block 5000: file B data][NO COMMIT - incomplete] Invalid transaction, ignored─────────────────────────────────────────────────────────────────── Recovery Process with Revokes:1. Pass 1 (SCAN): Find all committed transactions2. Pass 2 (REVOKE): Scan for revoke records, build revoke hash table Revoke table: {block 5000 → revoked before txn 102}3. Pass 3 (REPLAY): For each block in each transaction: - Check if block is revoked for this transaction - If revoked, skip this block - If not revoked, replay it Result: Block 5000 from Txn 100 is NOT replayed Block 5000 contains whatever was on disk (possibly file B data, or old unrelated data) - but not stale file A dataRevoke Semantics:\n\n- A revoke for block B in transaction T means: "Don't replay block B from any transaction before T"\n- Revokes are recorded when blocks are freed/deallocated\n- The revoke hash table is built before any replays occur\n- Revoke checking adds overhead but ensures correctness\n\nImplementation Efficiency:\n\nBuilding the revoke table requires scanning the journal twice: once to find revokes, once to replay. This is known as the two-pass recovery algorithm. Some implementations use three passes:\n\n1. Pass 1: Scan forward, identify all valid transactions\n2. Pass 2: Scan backward (or forward through revokes), build revoke table\n3. Pass 3: Replay, checking revoke table for each block
Revokes are most important with metadata journaling. With full data journaling, data blocks are also in the journal, so the revoke scenario is less concerning—we'd replay the correct data. However, revokes still matter for blocks that are freed and then reused between checkpoints.
One of journaling's key benefits is bounded recovery time. Let's analyze what determines recovery duration and how to minimize it.
Recovery Time Components:
| Phase | Duration | Depends On |
|---|---|---|
| Journal superblock read | ~1ms | Disk latency |
| Scan for transactions | O(journal size) | Journal size, disk throughput |
| Build revoke table | O(transaction count) | Number of revoke records |
| Replay blocks | O(blocks to replay) | Number of blocks since checkpoint |
| Sync replayed data | O(blocks × disk latency) | Disk sync speed |
| Update journal superblock | ~2ms | Single write + sync |
Typical Recovery Times:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
Example Recovery Scenarios: Scenario 1: Light workload, recent checkpoint─────────────────────────────────────────────────────────────────Journal size: 128MBTransactions since checkpoint: 5Blocks to replay: 200Device: SSD Scan time: ~1ms (few transactions to scan)Build revoke: negligibleReplay: 200 blocks × 0.1ms = ~20msSync: ~1ms (SSD sync fast)──────────────────────────────────────────────────────Total: ~25ms Scenario 2: Heavy workload, checkpoint behind─────────────────────────────────────────────────────────────────Journal size: 1GBTransactions since checkpoint: 100Blocks to replay: 50,000Device: HDD Scan time: ~2 seconds (1GB journal, 500MB/s read)Build revoke: ~100msReplay: 50,000 blocks × 4KB = 200MB 200MB ÷ 150MB/s write = ~1.3 secondsSync: ~200ms (HDD depends on location)──────────────────────────────────────────────────────Total: ~4 seconds Scenario 3: Very old checkpoint (rare/problematic)─────────────────────────────────────────────────────────────────Journal size: 4GB (full)Transactions since checkpoint: 500Blocks to replay: 500,000Device: HDD Scan time: ~8 secondsBuild revoke: ~500ms Replay: 500,000 × 4KB = 2GB 2GB ÷ 150MB/s = ~15 secondsSync: ~5 seconds──────────────────────────────────────────────────────Total: ~30 seconds Note: 30 seconds is still much better than fsck on amulti-terabyte filesystem (hours).Minimizing Recovery Time:
For perspective: full fsck on a 4TB filesystem with 100 million inodes can take 4-6 hours. Journal replay on the same system takes seconds to tens of seconds. This is the fundamental win of journaling—recovery time proportional to recent activity, not total filesystem size.
Journal replay must be correct in all cases—an incorrect recovery is worse than no recovery. Let's examine the subtle correctness properties that replay algorithms must maintain.
Property 1: Idempotence\n\nReplaying a transaction must be safe even if it was already applied. We don't know the exact crash point—the writeback might have completed for some blocks but not others.\n\nHow it's achieved: Transactions write complete block contents, not incremental changes. Writing the same block contents twice is harmless—the second write just overwrites with identical data.\n\nProperty 2: Atomicity\n\nA transaction either fully replays or doesn't replay at all. We never partially replay.\n\nHow it's achieved: The commit record is the atomicity gate. No commit = no replay, even if all data blocks are present. Commit present = replay everything in the transaction.
Property 3: Ordering\n\nTransactions must be replayed in sequence order. Transaction N might depend on the result of transaction N-1.\n\nHow it's achieved: The scan produces a transaction list in sequence order. Replay iterates this list sequentially.\n\nProperty 4: Revoke Correctness\n\nRevoked blocks must never be replayed from older transactions, even if those transactions are valid.\n\nHow it's achieved: Build complete revoke table before any replays. Check every block against revoke table. Revokes are never ignored or forgotten.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
// Key correctness invariants for journal replay /* * Invariant 1: Checksum Validation * A transaction is only considered valid if its checksum passes. * This guards against torn writes corrupting the journal. */bool validate_transaction(transaction_t *txn) { uint32_t computed = compute_checksum(txn); uint32_t stored = txn->commit->checksum; if (computed != stored) { // Checksum mismatch - transaction invalid // This is the expected case for incomplete commits return false; } return true;} /* * Invariant 2: Sequence Continuity * Transactions must have continuous sequence numbers. * A gap indicates potential corruption or missed entries. */bool check_sequence_continuity(list_t *transactions) { transaction_t *prev = NULL; list_foreach(transactions, txn) { if (prev && txn->sequence != prev->sequence + 1) { // Sequence gap detected // Depending on implementation: error or stop here return false; } prev = txn; } return true;} /* * Invariant 3: Complete Revoke Application * All revokes from all committed transactions must be * applied before any blocks are replayed. */void build_revoke_table(journal_t *j, list_t *transactions) { revoke_table_t *table = revoke_table_create(); // Process ALL transactions for revokes FIRST list_foreach(transactions, txn) { for_each_revoke_record(txn, revoke) { // Record: "block B revoked before txn T" revoke_table_add(table, revoke->block, txn->sequence); } } j->j_revoke_table = table;} // During replay, check revokesbool should_replay_block(journal_t *j, block_tag_t *tag, uint32_t txn_sequence) { uint32_t revoked_before = revoke_table_lookup( j->j_revoke_table, tag->target_block); if (revoked_before && txn_sequence < revoked_before) { // This block was revoked after this transaction // Do NOT replay it return false; } return true;}If any unexpected condition is detected during recovery (wrong magic numbers, sequence gaps, suspicious checksums), most implementations abort or require manual intervention. Attempting to recover from a corrupted journal risks making things worse. When in doubt, report the error and let administrators decide.
Beyond basic replay, several advanced scenarios and techniques are important for production systems.
Crash During Recovery:\n\nWhat if the system crashes DURING journal replay?\n\nScenarios:\n- Replayed some transactions, updating journal superblock, crash\n- Replayed some transactions, crash before superblock update\n\nSolution: Recovery is idempotent! On the next boot:\n- If superblock was updated: scan starts after already-replayed transactions\n- If superblock wasn't updated: replay everything again (harmless due to idempotence)\n\nEither way, we reach a consistent state. This is why idempotence is so important.
Multi-Device Journals:\n\nFor LVM, RAID, and other multi-device configurations:\n\n- Each filesystem has its own journal\n- Journals must be replayed independently\n- Device-mapper ordering must be respected\n\nSpecial care is needed when journal device and data device are different:\n- Journal device must be available before data device\n- Failure of journal device prevents mount entirely\n- External journal on failed device = catastrophic
Online/Lazy Recovery:\n\nSome systems offer expedited mount with background recovery:\n\n1. Quick scan identifies immediate needs\n2. Filesystem mounts read-only or limited write\n3. Full recovery proceeds in background\n4. Full access enabled after recovery completes\n\nThis reduces apparent downtime but adds complexity. Data accessed before recovery completes might show inconsistencies. XFS supports this model.
123456789101112131415161718192021222324252627282930313233
# Debugging journal recovery issues # Check if journal replay occurred on last mountdmesg | grep -i "EXT4-fs.*recovery"# Output: "EXT4-fs (sda1): recovery complete"# Or: "EXT4-fs (sda1): no recovery required" # Force replay (unmount, mark dirty, mount)umount /mnt/datatune2fs -O ^clean /dev/sda1 # Clear clean flagmount /dev/sda1 /mnt/data # Will trigger replay # View journal contents (requires unmount)debugfs -R "logdump -a" /dev/sda1 2>/dev/null | head -100 # Check journal superblockdumpe2fs -h /dev/sda1 2>/dev/null | grep -A10 "^Journal" # Check filesystem statetune2fs -l /dev/sda1 | grep "Filesystem state"# "clean" or "not clean" # XFS journal statusxfs_logprint -c /dev/sdb1 # Count log entriesxfs_logprint -t /dev/sdb1 # Transaction dump # Recover corrupted journal (emergency, may lose data)# ext4: rebuild journaltune2fs -O ^has_journal /dev/sda1 # Remove journaltune2fs -j /dev/sda1 # Recreate journal # XFS: clear log and repairxfs_repair -L /dev/sdb1 # -L zeros log (DATA LOSS POSSIBLE)True journal corruption (not just incomplete transactions) is rare and usually indicates hardware problems—failing disk, bad memory, controller issues. If recovery consistently fails, investigate hardware before trying emergency repairs. Checksums detect most corruption; persistent checksum failures need hardware attention.
We've completed our deep examination of journal replay—the mechanism that makes journaling work. Let's consolidate our understanding:
Module Complete:\n\nWith journal replay understood, you now have a complete picture of file system journaling. From the crash consistency problem through write-ahead logging, metadata and full journaling modes, to recovery and replay—you understand how modern file systems maintain consistency across failures.\n\nThis knowledge applies directly to:\n- Choosing appropriate journaling modes for your workloads\n- Understanding and predicting recovery behavior\n- Debugging file system issues after crashes\n- Making informed decisions about application durability\n- Designing systems that correctly handle crashes
Congratulations! You've mastered file system journaling—one of the most important mechanisms in modern operating systems. This knowledge is foundational for anyone working with storage systems, databases, or any application requiring crash consistency.