Ext2 Ext3 Ext4 - Learning Module

Loading content...

0/240

Ext3 Journaling

The End of fsck Nightmares

Before 2001, system administrators lived in fear of the unexpected reboot. When an ext2 system crashed—whether from power failure, kernel panic, or hardware fault—the recovery process was predictable and painful: run fsck, watch it scan every inode and block on the disk, and wait. For large filesystems, this could take hours. A 500 GB disk might require 30 minutes or more of scanning before the system could come back online.

The root problem was fundamental to ext2's design. Without a record of in-progress operations, the only way to verify filesystem consistency was to examine everything. Every inode's link count must match directory entries. Every allocated block must be reachable. Every bitmap must accurately reflect allocation state. With no shortcuts available, recovery time scaled linearly with filesystem size.

Journaling changed everything. By recording intended changes before making them permanent, ext3 could recover from crashes in seconds rather than hours. The technique—borrowed from database systems—transformed Linux filesystems from fragile to production-ready.

This page explores how ext3's journaling works, from the fundamental write-ahead logging concept through implementation details that made it both reliable and performant.

What You Will Learn

By the end of this page, you will understand write-ahead logging theory, the three ext3/ext4 journaling modes (writeback, ordered, journal), the journal's on-disk structure, transaction commit sequences, and crash recovery mechanisms. You'll gain practical knowledge for tuning journal performance and troubleshooting recovery issues.

The Problem Journaling Solves

To understand journaling, we must first understand why file systems become inconsistent after crashes.

The Multi-Write Problem:

Consider creating a new file /home/user/document.txt. This seemingly simple operation requires modifying multiple disk structures:

Inode bitmap: Mark a new inode as allocated
Inode table: Initialize the new inode with metadata
Block bitmap: Mark data blocks as allocated (if file has content)
Data blocks: Write actual file content
Directory inode: Update modification time
Directory data block: Add new directory entry

These six updates cannot complete atomically. If power fails midway:

Converting Mermaid diagram...

ext2's Approach: Full Scan (fsck)

Without journaling, ext2's recovery strategy was exhaustive verification:

Pass 1: Check inodes, blocks, and sizes
Pass 2: Check directory structure
Pass 3: Check directory connectivity
Pass 4: Check reference counts (link count)
Pass 5: Check group summary information (bitmaps)

Time complexity: O(n) where n = total inodes and blocks

Filesystem Size	Typical fsck Time
10 GB	1-2 minutes
100 GB	10-20 minutes
1 TB	1-2 hours
10 TB	10+ hours

For a server that must be available 24/7, hours of downtime after each unexpected reboot was unacceptable.

The Core Insight: Write-Ahead Logging

Database systems solved this problem decades ago with write-ahead logging (WAL):

Before modifying data, write the intended change to a log
After writing the log entry, modify the actual data
After successful modification, mark the log entry as complete
On crash: replay incomplete log entries

This reduces recovery from O(n) scanning to O(log size) replay—typically seconds regardless of filesystem size.

The ARIES Protocol

ext3's journaling is based on the ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) protocol developed at IBM in the 1990s. ARIES provided the theoretical foundation for reliable recovery using write-ahead logging, influencing virtually every modern database and journaling filesystem.

The Journal's On-Disk Structure

The ext3/ext4 journal is stored as a special file managed by the Journaling Block Device (JBD2) layer. By default, it uses inode number 8 and occupies a fixed region of disk space.

Journal Location:

# View journal inode and size
dumpe2fs /dev/sda1 | grep -i journal
# Output:
Journal inode:            8
Journal size:             128M
Journal blocks:           32768

Journal Layout:

journal_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Journal superblock structure (first block of journal)
typedef struct journal_superblock_s {
    // Static information
    __be32  s_header.h_magic;      // JBD2 magic: 0xC03B3998
    __be32  s_header.h_blocktype;  // Superblock type identifier
    __be32  s_blocksize;           // Journal block size
    __be32  s_maxlen;              // Total journal blocks
    __be32  s_first;               // First usable block (after SB)
    
    // Dynamic information
    __be32  s_sequence;            // Sequence of first transaction
    __be32  s_start;               // Block of first transaction
    __be32  s_errno;               // Error value if any
    
    // Version 2 fields
    __be32  s_feature_compat;      // Compatible features
    __be32  s_feature_incompat;    // Incompatible features
    __be32  s_feature_ro_compat;   // Read-only compat features
    __u8    s_uuid[16];            // Journal UUID
    __be32  s_nr_users;            // Number of filesystems using
    __be32  s_dynsuper;            // Location of dynamic superblock
    
    // Recovery info
    __be32  s_max_transaction;     // Max blocks per transaction
    __be32  s_max_trans_data;      // Max data blocks per transaction
    
    __u8    s_checksum_type;       // Checksum algorithm
    __u8    s_padding2[3];
    __be32  s_num_fc_blks;         // Fast commit blocks
    __be32  s_padding[41];
    __be32  s_checksum;            // Superblock checksum
    __u8    s_users[16*48];        // UUIDs of filesystems using
} journal_superblock_t;
 
// Descriptor block (describes following data blocks)
typedef struct journal_header_s {
    __be32  h_magic;               // JBD2_MAGIC_NUMBER
    __be32  h_blocktype;           // DESCRIPTOR, COMMIT, or REVOKE
    __be32  h_sequence;            // Transaction sequence number
} journal_header_t;
 
// Block tag in descriptor block (ext4 format)
typedef struct journal_block_tag_s {
    __be32  t_blocknr;             // Destination block (low 32)
    __be16  t_checksum;            // Block checksum
    __be16  t_flags;               // Flags (ESCAPE, SAME_UUID, etc.)
    __be32  t_blocknr_high;        // Destination block (high 32)
} journal_block_tag_t;

Journal Block Types:

Block Type	Purpose	Header ID
Superblock	Journal configuration and state	N/A
Descriptor	Lists following data blocks and their destinations	1
Data Blocks	Copies of filesystem blocks being modified	N/A
Commit	Marks transaction as complete	2
Revoke	Lists blocks that should NOT be replayed	5

Circular Buffer:

The journal operates as a circular buffer:

Journal Space:
+--------+------+------+------+------+------+------+------+
| Super  | T1   | T1   | T2   | T2   | T2   | FREE | FREE |
| Block  | desc | data | desc | data | cmit |      |      |
+--------+------+------+------+------+------+------+------+
         ↑                           ↑             ↑
      s_start                     s_end         wrap point
   (oldest active)             (newest)      (reuses space)

When the journal fills, old completed transactions are overwritten. This is safe because:

Committed transactions have been written to their final locations
Only uncommitted or recently committed transactions need preservation
Checkpointing ensures committed data reaches permanent storage

Journal Size Matters

A too-small journal causes frequent checkpointing pauses. A too-large journal wastes space. Default 128 MB is adequate for most workloads. Database servers with heavy transaction loads may benefit from 256-512 MB. Use tune2fs -J size=256 to resize (requires unmounted filesystem).

The Three Journaling Modes

ext3/ext4 offers three journaling modes, each trading off between safety and performance. Understanding these modes is crucial for optimizing different workloads.

Mode 1: Journal (data=journal)

The most conservative mode journals both metadata AND data blocks:

Write file data:
1. Write data blocks to journal
2. Write metadata blocks to journal  
3. Write commit record
4. Write data blocks to final location
5. Write metadata blocks to final location
6. Checkpoint: free journal space

Characteristics:

✅ Full crash protection for data and metadata
✅ Can recover file contents after crash
❌ Writes every block twice (journal + final)
❌ Write throughput cut roughly in half
📊 Use case: Critical data where no loss is acceptable

data=journal Pros

•Maximum data protection
•Atomic file content updates
•No partial writes visible
•Simplest recovery model
•Good for small files

data=journal Cons

•~50% write performance penalty
•High journal space usage
•More journal wrap-around
•More checkpointing pauses
•Not suitable for streaming writes

Mode 2: Ordered (data=ordered)

The default mode—journals metadata only, but ensures data is written before metadata:

Write file data:
1. Write data blocks to FINAL location
2. Wait for data write completion
3. Write metadata blocks to journal
4. Write commit record
5. Write metadata to final location

Characteristics:

✅ Data reaches disk before metadata commits
✅ No stale/garbage data visible after crash
✅ Much better performance than full journaling
⚠️ Data may be lost if crash before step 3
📊 Use case: General purpose (default for good reason)

Mode 3: Writeback (data=writeback)

Fastest mode—journals metadata only with no ordering guarantees:

Write file data:
1. Write data blocks to final location (async)
2. Write metadata blocks to journal
3. Write commit record
4. Write metadata to final location

Characteristics:

✅ Maximum write performance
✅ Metadata always consistent after crash
❌ Files may contain stale/garbage data after crash
❌ Security risk: old file content exposed
📊 Use case: Scratch/temp filesystems, benchmarks

Journaling Mode Comparison
Aspect	data=journal	data=ordered	data=writeback
What's journaled	Metadata + Data	Metadata only	Metadata only
Data ordering	Before metadata	Before metadata	No ordering
Write amplification	2x for all data	1x (no extra writes)	1x (no extra writes)
Recovery guarantee	Full content	No garbage data	Metadata only
Performance impact	~50% slower	~5-15% slower	Baseline
Default	No	Yes	No

mount_options.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Set journaling mode via mount options
mount -o data=journal /dev/sda1 /mnt/critical  # Maximum safety
mount -o data=ordered /dev/sda1 /mnt/general   # Default, balanced
mount -o data=writeback /dev/sda1 /mnt/temp    # Maximum performance
 
# Set default mode in fstab
# /dev/sda1  /data  ext4  defaults,data=ordered  0  2
 
# Check current mode
mount | grep sda1
# /dev/sda1 on /data type ext4 (rw,relatime,data=ordered)
 
# Set default mode in filesystem (affects all mounts)
tune2fs -o journal_data /dev/sda1        # data=journal
tune2fs -o journal_data_ordered /dev/sda1 # data=ordered
tune2fs -o journal_data_writeback /dev/sda1

Choosing the Right Mode

Use data=ordered (default) for almost everything. Consider data=journal for critical databases where atomicity matters. Use data=writeback only for temporary data, build directories, or when benchmarking—never for user data on multi-user systems (security risk from exposing old file contents).

Transaction Lifecycle

Understanding the transaction lifecycle reveals how journaling achieves atomicity while maintaining performance.

Transaction States:

Converting Mermaid diagram...

Detailed Transaction Flow:

1. T_RUNNING (Active Transaction)

// Filesystem operations join the current transaction
handle_t *handle = jbd2_journal_start(journal, nblocks);

// Modify metadata blocks within the transaction
jbd2_journal_get_write_access(handle, bh);
modify_block(bh);
jbd2_journal_dirty_metadata(handle, bh);

// Complete this operation
jbd2_journal_stop(handle);

2. Commit Trigger A transaction commits when:

Commit timer expires (default: 5 seconds)
Transaction exceeds size limit
Explicit sync request (fsync, sync)
Journal space runs low

3. T_LOCKED → T_FLUSH

// Lock transaction: no new handles can join
transaction->t_state = T_LOCKED;

// Wait for all existing handles to complete
wait_for_handles_to_complete();

// Begin flushing to journal
transaction->t_state = T_FLUSH;

4. Journal Write Sequence

1. Write descriptor block (lists all blocks in transaction)
2. Write data/metadata blocks to journal space
3. Issue flush command (barrier) to disk
4. Write commit block with transaction checksum
5. Issue another flush command

The two flush commands ensure proper ordering: blocks must reach disk before commit, commit must reach disk before checkpoint.

journal_commit.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Simplified commit sequence (jbd2_journal_commit_transaction)
void commit_transaction(journal_t *journal, transaction_t *tx) {
    struct buffer_head *descriptor;
    
    // Phase 1: Write descriptor block
    descriptor = allocate_journal_block(journal);
    build_descriptor(descriptor, tx);
    submit_bh(WRITE, descriptor);
    
    // Phase 2: Write all dirty metadata blocks to journal
    list_for_each_entry(bh, &tx->t_buffers, b_tnext) {
        journal_block = allocate_journal_block(journal);
        memcpy(journal_block->b_data, bh->b_data, bh->b_size);
        submit_bh(WRITE, journal_block);
    }
    
    // Phase 3: Issue barrier (ensure writes reached disk)
    blkdev_issue_flush(journal->j_dev);
    
    // Phase 4: Write commit block
    commit_block = allocate_journal_block(journal);
    commit_block->h_blocktype = JBD2_COMMIT_BLOCK;
    commit_block->h_sequence = tx->t_tid;
    commit_block->h_chksum = compute_transaction_checksum(tx);
    submit_bh(WRITE, commit_block);
    
    // Phase 5: Second barrier
    blkdev_issue_flush(journal->j_dev);
    
    // Transaction is now committed - safe from crash
    tx->t_state = T_FINISHED;
}

The 5-Second Commit Interval

By default, ext3/ext4 commits transactions every 5 seconds. This batches many operations into single commits, improving throughput. The tradeoff: up to 5 seconds of operations may be lost on crash. Use commit=N mount option to adjust (lower = safer, higher = faster).

Crash Recovery Process

When a system crashes with uncommitted transactions, ext3/ext4's recovery process replays the journal to restore consistency.

Recovery Detection:

During mount, the kernel checks if recovery is needed:

if (!(sb->s_state & EXT4_VALID_FS) ||
    (sb->s_feature_incompat & EXT4_FEATURE_INCOMPAT_RECOVER)) {
    // Filesystem was not cleanly unmounted
    // Journal recovery required
    jbd2_journal_recover(journal);
}

Recovery Phases:

Converting Mermaid diagram...

Phase 1: PASS_SCAN

Scan the journal to identify valid transactions:

void recovery_pass_scan(journal_t *journal) {
    block_t block = journal->j_sb->s_start;
    tid_t expected_seq = journal->j_sb->s_sequence;
    
    while (block_is_valid(block)) {
        header = read_journal_block(block);
        
        if (header->h_magic != JBD2_MAGIC_NUMBER)
            break;  // End of journal or corruption
            
        if (header->h_sequence != expected_seq)
            break;  // Sequence inconsistency
        
        switch (header->h_blocktype) {
            case JBD2_DESCRIPTOR_BLOCK:
                record_descriptor(block);
                block += count_data_blocks(header);
                break;
            case JBD2_COMMIT_BLOCK:
                if (verify_checksum(header)) {
                    mark_transaction_valid(expected_seq);
                    expected_seq++;
                }
                break;
            case JBD2_REVOKE_BLOCK:
                record_revokes(block);
                break;
        }
        block++;
    }
}

Phase 2: PASS_REVOKE

Process revoke records to avoid replaying deleted blocks:

// Revoke blocks mark filesystem locations that should NOT
// be updated during replay (typically from deleted files)

for_each_revoke_record(record) {
    tid_t revoke_tid = record->r_transaction;
    block_t block = record->r_block;
    
    // If a later transaction revoked this block,
    // don't replay older data to it
    add_to_revoke_table(block, revoke_tid);
}

Phase 3: PASS_REPLAY

Replay committed transactions to restore consistency:

for_each_committed_transaction(tx) {
    for_each_block_in_transaction(tx, block) {
        // Check revoke table
        if (is_revoked(block->destination, tx->tid))
            continue;  // Skip revoked block
        
        // Replay: copy from journal to final location
        write_block(block->destination, block->data);
    }
}

Recovery Time Comparison
Filesystem Size	ext2 fsck Time	ext3/ext4 Recovery
10 GB	1-2 minutes	< 1 second
100 GB	10-20 minutes	1-2 seconds
1 TB	1-2 hours	2-5 seconds
10 TB	10+ hours	5-10 seconds

Recovery Is Automatic

Journal recovery happens automatically during mount—no user intervention required. After a crash, simply boot the system normally. The kernel detects the unclean shutdown and replays the journal before completing the mount.

Checkpointing and Space Management

The journal has finite space. Checkpointing is the process of writing committed transactions to their final locations, freeing journal space for reuse.

Why Checkpointing Matters:

Journal at 80% capacity:
+--------+------+------+------+------+------+------+------+
| Super  | T1✓  | T2✓  | T3✓  | T4✓  | T5•  | FREE | FREE |
+--------+------+------+------+------+------+------+------+
                                       ↑
                              Current transaction

T1-T4 are committed but not yet checkpointed.
Journal can only use remaining 20% until T1-T4 checkpoint.

Checkpoint Trigger Conditions:

Journal space falls below threshold (typically 25%)
Background checkpoint timer expires
Sync or fsync forces checkpoint
Unmount flushes all transactions

checkpointing.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Simplified checkpoint process
void jbd2_log_do_checkpoint(journal_t *journal) {
    transaction_t *tx;
    struct buffer_head *bh;
    
    while ((tx = get_oldest_checkpointable_transaction(journal))) {
        // Write all metadata blocks to their final destinations
        list_for_each_entry(bh, &tx->t_checkpoint_list, b_cpnext) {
            if (!buffer_dirty(bh))
                continue;
            
            // Write block to its permanent location on disk
            lock_buffer(bh);
            bh->b_end_io = end_buffer_write_sync;
            submit_bh(WRITE, bh);
        }
        
        // Wait for all writes to complete
        list_for_each_entry(bh, &tx->t_checkpoint_list, b_cpnext) {
            wait_on_buffer(bh);
            if (buffer_uptodate(bh)) {
                release_buffer_from_checkpoint(bh);
            }
        }
        
        // Transaction is fully checkpointed
        // Journal space from s_start can be reclaimed
        __jbd2_journal_drop_transaction(journal, tx);
        journal->j_first = advance_to_next_transaction(tx);
    }
    
    // Update journal superblock with new s_first
    jbd2_update_superblock(journal);
}

Checkpoint vs. Commit:

Aspect	Commit	Checkpoint
What happens	Transaction written to journal	Journal data written to filesystem
When	Every 5 seconds (default)	When journal space needed
Crash safety	Provides crash safety	Frees journal space
Blocks involved	Metadata (and data in journal mode)	Same blocks, different location
Required order	Must complete before checkpoint	Must wait for commit

Checkpoint Ordering Constraints:

Transactions must checkpoint in order:

T1 commits → T2 commits → T3 commits
      ↓           ↓           ↓
T1 checkpoints → T2 checkpoints → T3 checkpoints

This ordering ensures that if a crash occurs during checkpointing, recovery replays transactions in the correct sequence.

Monitoring Checkpoint Activity:

# View journal status
cat /proc/fs/jbd2/sda1-8/info
# Output:
# 1 transaction, 1 locked, 0 flushing, 0 logging
# Average revision: 1 blocks used, 128 target

# Detailed stats
cat /sys/fs/ext4/sda1/journal_info

Checkpoint Stalls

If the journal fills completely, new operations stall waiting for checkpoint to complete. This manifests as I/O hangs. Solutions: increase journal size, reduce commit interval, investigate disk performance. Monitor /proc/fs/jbd2/*/info for warning signs.

Journal Performance Tuning

Optimizing journal performance requires understanding the trade-offs between safety, latency, and throughput.

Key Tuning Parameters:

journal_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Commit interval: time between automatic commits
# Lower = more frequent commits = less data at risk = lower throughput
mount -o commit=5 /dev/sda1 /mnt      # Default: 5 seconds
mount -o commit=1 /dev/sda1 /mnt      # Aggressive: 1 second
mount -o commit=60 /dev/sda1 /mnt     # Relaxed: 60 seconds
 
# Barrier behavior: ensures ordering for crash safety
mount -o barrier=1 /dev/sda1 /mnt     # Default: barriers enabled
mount -o barrier=0 /dev/sda1 /mnt     # Disable (DANGEROUS unless UPS)
mount -o nobarrier /dev/sda1 /mnt     # Alias for barrier=0
 
# Journal size: affects maximum transaction size and checkpoint frequency
mkfs.ext4 -J size=256 /dev/sda1       # 256 MB journal (at creation)
tune2fs -J size=256 /dev/sda1         # Resize existing (must be unmounted)
 
# External journal: place journal on separate fast device
mkfs.ext4 -J device=/dev/sdb1 /dev/sda1  # Journal on sdb1
 
# Journal checksum (ext4): detect journal corruption
tune2fs -O journal_checksum /dev/sda1
 
# Priority of journal I/O
# Typically set via ionice for jbd2 kernel threads:
ionice -c 1 -n 0 -p $(pgrep jbd2)

Performance Recommendations by Workload:

Workload	Journal Mode	Commit Interval	Journal Size	Barriers
General purpose	ordered	5s (default)	128 MB	Enabled
Database (safety)	journal	1s	256 MB	Enabled
Database (performance)	ordered	5s	256 MB	Enabled
Streaming/logs	writeback	30s	64 MB	Enabled
Scratch/temp	writeback	60s	32 MB	Optional
Build server	writeback	30s	128 MB	Optional

External Journal Optimization:

For high-performance systems, placing the journal on a separate device eliminates seek contention:

# Create journal device (small SSD or NVMe)
mkfs.ext4 -O journal_dev /dev/nvme0n1p1

# Create main filesystem with external journal
mkfs.ext4 -J device=/dev/nvme0n1p1 /dev/sda1

# Journal operations go to fast NVMe
# Bulk data goes to high-capacity HDD

Impact of Barrier Disable
Scenario	With Barriers	Without Barriers
Sequential writes	~10-20% overhead	Maximum throughput
Random writes	~20-30% overhead	Maximum throughput
Power failure	Full recovery	Potential corruption
Use case	Production (default)	UPS + write cache battery

Barrier Disable Warning

Disabling barriers allows disk write caches to reorder writes, breaking journal guarantees. Only disable if: (1) system has UPS, AND (2) disk has battery-backed write cache, AND (3) you've verified the cache is in write-through mode on power loss. Most consumer disks do NOT guarantee this.

Summary: Ext3 Journaling

Journaling transformed Linux filesystems from fragile to production-ready, enabling reliable crash recovery in seconds rather than hours.

Key Takeaways

•Journaling solves the multi-write consistency problem — By writing intended changes to a log before applying them, the filesystem can recover to a consistent state after any crash.
•Three journaling modes trade safety for speed — data=journal (safest), data=ordered (default, balanced), data=writeback (fastest, least safe).
•Transactions group related operations — Operations are batched into transactions that commit atomically every 5 seconds (default).
•Recovery replays committed transactions — The SCAN→REVOKE→REPLAY process takes seconds regardless of filesystem size.
•Checkpointing frees journal space — Committed transactions are written to final locations, allowing journal reuse.
•Barriers ensure write ordering — Critical for crash safety; only disable with battery-backed caches.

Next Up: Ext4 Extents

With journaling ensuring reliability, ext4's next major innovation tackled efficiency: extents. Rather than tracking files as lists of individual blocks, extents describe contiguous ranges, dramatically reducing metadata overhead for large files and improving both performance and scalability.

Page Complete

You now understand how ext3/ext4 journaling works—from write-ahead logging theory through implementation details and recovery mechanics. This knowledge is essential for understanding filesystem reliability, troubleshooting recovery issues, and optimizing journal performance for different workloads.

Ext3 Journaling

The End of fsck Nightmares

This page explores how ext3's journaling works, from the fundamental write-ahead logging concept through implementation details that made it both reliable and performant.

What You Will Learn

The Problem Journaling Solves

To understand journaling, we must first understand why file systems become inconsistent after crashes.

The Multi-Write Problem:

Consider creating a new file /home/user/document.txt. This seemingly simple operation requires modifying multiple disk structures:

Inode bitmap: Mark a new inode as allocated
Inode table: Initialize the new inode with metadata
Block bitmap: Mark data blocks as allocated (if file has content)
Data blocks: Write actual file content
Directory inode: Update modification time
Directory data block: Add new directory entry

These six updates cannot complete atomically. If power fails midway:

Converting Mermaid diagram...

ext2's Approach: Full Scan (fsck)

Without journaling, ext2's recovery strategy was exhaustive verification:

Pass 1: Check inodes, blocks, and sizes
Pass 2: Check directory structure
Pass 3: Check directory connectivity
Pass 4: Check reference counts (link count)
Pass 5: Check group summary information (bitmaps)

Time complexity: O(n) where n = total inodes and blocks

Filesystem Size	Typical fsck Time
10 GB	1-2 minutes
100 GB	10-20 minutes
1 TB	1-2 hours
10 TB	10+ hours

For a server that must be available 24/7, hours of downtime after each unexpected reboot was unacceptable.

The Core Insight: Write-Ahead Logging

Database systems solved this problem decades ago with write-ahead logging (WAL):

Before modifying data, write the intended change to a log
After writing the log entry, modify the actual data
After successful modification, mark the log entry as complete
On crash: replay incomplete log entries

This reduces recovery from O(n) scanning to O(log size) replay—typically seconds regardless of filesystem size.

The ARIES Protocol

The Journal's On-Disk Structure

The ext3/ext4 journal is stored as a special file managed by the Journaling Block Device (JBD2) layer. By default, it uses inode number 8 and occupies a fixed region of disk space.

Journal Location:

# View journal inode and size
dumpe2fs /dev/sda1 | grep -i journal
# Output:
Journal inode:            8
Journal size:             128M
Journal blocks:           32768

Journal Layout:

journal_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Journal superblock structure (first block of journal)
typedef struct journal_superblock_s {
    // Static information
    __be32  s_header.h_magic;      // JBD2 magic: 0xC03B3998
    __be32  s_header.h_blocktype;  // Superblock type identifier
    __be32  s_blocksize;           // Journal block size
    __be32  s_maxlen;              // Total journal blocks
    __be32  s_first;               // First usable block (after SB)
    
    // Dynamic information
    __be32  s_sequence;            // Sequence of first transaction
    __be32  s_start;               // Block of first transaction
    __be32  s_errno;               // Error value if any
    
    // Version 2 fields
    __be32  s_feature_compat;      // Compatible features
    __be32  s_feature_incompat;    // Incompatible features
    __be32  s_feature_ro_compat;   // Read-only compat features
    __u8    s_uuid[16];            // Journal UUID
    __be32  s_nr_users;            // Number of filesystems using
    __be32  s_dynsuper;            // Location of dynamic superblock
    
    // Recovery info
    __be32  s_max_transaction;     // Max blocks per transaction
    __be32  s_max_trans_data;      // Max data blocks per transaction
    
    __u8    s_checksum_type;       // Checksum algorithm
    __u8    s_padding2[3];
    __be32  s_num_fc_blks;         // Fast commit blocks
    __be32  s_padding[41];
    __be32  s_checksum;            // Superblock checksum
    __u8    s_users[16*48];        // UUIDs of filesystems using
} journal_superblock_t;
 
// Descriptor block (describes following data blocks)
typedef struct journal_header_s {
    __be32  h_magic;               // JBD2_MAGIC_NUMBER
    __be32  h_blocktype;           // DESCRIPTOR, COMMIT, or REVOKE
    __be32  h_sequence;            // Transaction sequence number
} journal_header_t;
 
// Block tag in descriptor block (ext4 format)
typedef struct journal_block_tag_s {
    __be32  t_blocknr;             // Destination block (low 32)
    __be16  t_checksum;            // Block checksum
    __be16  t_flags;               // Flags (ESCAPE, SAME_UUID, etc.)
    __be32  t_blocknr_high;        // Destination block (high 32)
} journal_block_tag_t;

Journal Block Types:

Block Type	Purpose	Header ID
Superblock	Journal configuration and state	N/A
Descriptor	Lists following data blocks and their destinations	1
Data Blocks	Copies of filesystem blocks being modified	N/A
Commit	Marks transaction as complete	2
Revoke	Lists blocks that should NOT be replayed	5

Circular Buffer:

The journal operates as a circular buffer:

Journal Space:
+--------+------+------+------+------+------+------+------+
| Super  | T1   | T1   | T2   | T2   | T2   | FREE | FREE |
| Block  | desc | data | desc | data | cmit |      |      |
+--------+------+------+------+------+------+------+------+
         ↑                           ↑             ↑
      s_start                     s_end         wrap point
   (oldest active)             (newest)      (reuses space)

When the journal fills, old completed transactions are overwritten. This is safe because:

Committed transactions have been written to their final locations
Only uncommitted or recently committed transactions need preservation
Checkpointing ensures committed data reaches permanent storage

Journal Size Matters

The Three Journaling Modes

ext3/ext4 offers three journaling modes, each trading off between safety and performance. Understanding these modes is crucial for optimizing different workloads.

Mode 1: Journal (data=journal)

The most conservative mode journals both metadata AND data blocks:

Write file data:
1. Write data blocks to journal
2. Write metadata blocks to journal  
3. Write commit record
4. Write data blocks to final location
5. Write metadata blocks to final location
6. Checkpoint: free journal space

Characteristics:

✅ Full crash protection for data and metadata
✅ Can recover file contents after crash
❌ Writes every block twice (journal + final)
❌ Write throughput cut roughly in half
📊 Use case: Critical data where no loss is acceptable

data=journal Pros

•Maximum data protection
•Atomic file content updates
•No partial writes visible
•Simplest recovery model
•Good for small files

data=journal Cons

•~50% write performance penalty
•High journal space usage
•More journal wrap-around
•More checkpointing pauses
•Not suitable for streaming writes

Mode 2: Ordered (data=ordered)

The default mode—journals metadata only, but ensures data is written before metadata:

Write file data:
1. Write data blocks to FINAL location
2. Wait for data write completion
3. Write metadata blocks to journal
4. Write commit record
5. Write metadata to final location

Characteristics:

✅ Data reaches disk before metadata commits
✅ No stale/garbage data visible after crash
✅ Much better performance than full journaling
⚠️ Data may be lost if crash before step 3
📊 Use case: General purpose (default for good reason)

Mode 3: Writeback (data=writeback)

Fastest mode—journals metadata only with no ordering guarantees:

Write file data:
1. Write data blocks to final location (async)
2. Write metadata blocks to journal
3. Write commit record
4. Write metadata to final location

Characteristics:

✅ Maximum write performance
✅ Metadata always consistent after crash
❌ Files may contain stale/garbage data after crash
❌ Security risk: old file content exposed
📊 Use case: Scratch/temp filesystems, benchmarks

Journaling Mode Comparison
Aspect	data=journal	data=ordered	data=writeback
What's journaled	Metadata + Data	Metadata only	Metadata only
Data ordering	Before metadata	Before metadata	No ordering
Write amplification	2x for all data	1x (no extra writes)	1x (no extra writes)
Recovery guarantee	Full content	No garbage data	Metadata only
Performance impact	~50% slower	~5-15% slower	Baseline
Default	No	Yes	No

mount_options.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Set journaling mode via mount options
mount -o data=journal /dev/sda1 /mnt/critical  # Maximum safety
mount -o data=ordered /dev/sda1 /mnt/general   # Default, balanced
mount -o data=writeback /dev/sda1 /mnt/temp    # Maximum performance
 
# Set default mode in fstab
# /dev/sda1  /data  ext4  defaults,data=ordered  0  2
 
# Check current mode
mount | grep sda1
# /dev/sda1 on /data type ext4 (rw,relatime,data=ordered)
 
# Set default mode in filesystem (affects all mounts)
tune2fs -o journal_data /dev/sda1        # data=journal
tune2fs -o journal_data_ordered /dev/sda1 # data=ordered
tune2fs -o journal_data_writeback /dev/sda1

Choosing the Right Mode

Transaction Lifecycle

Understanding the transaction lifecycle reveals how journaling achieves atomicity while maintaining performance.

Transaction States:

Converting Mermaid diagram...

Detailed Transaction Flow:

1. T_RUNNING (Active Transaction)

// Filesystem operations join the current transaction
handle_t *handle = jbd2_journal_start(journal, nblocks);

// Modify metadata blocks within the transaction
jbd2_journal_get_write_access(handle, bh);
modify_block(bh);
jbd2_journal_dirty_metadata(handle, bh);

// Complete this operation
jbd2_journal_stop(handle);

2. Commit Trigger A transaction commits when:

Commit timer expires (default: 5 seconds)
Transaction exceeds size limit
Explicit sync request (fsync, sync)
Journal space runs low

3. T_LOCKED → T_FLUSH

// Lock transaction: no new handles can join
transaction->t_state = T_LOCKED;

// Wait for all existing handles to complete
wait_for_handles_to_complete();

// Begin flushing to journal
transaction->t_state = T_FLUSH;

4. Journal Write Sequence

1. Write descriptor block (lists all blocks in transaction)
2. Write data/metadata blocks to journal space
3. Issue flush command (barrier) to disk
4. Write commit block with transaction checksum
5. Issue another flush command

The two flush commands ensure proper ordering: blocks must reach disk before commit, commit must reach disk before checkpoint.

journal_commit.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Simplified commit sequence (jbd2_journal_commit_transaction)
void commit_transaction(journal_t *journal, transaction_t *tx) {
    struct buffer_head *descriptor;
    
    // Phase 1: Write descriptor block
    descriptor = allocate_journal_block(journal);
    build_descriptor(descriptor, tx);
    submit_bh(WRITE, descriptor);
    
    // Phase 2: Write all dirty metadata blocks to journal
    list_for_each_entry(bh, &tx->t_buffers, b_tnext) {
        journal_block = allocate_journal_block(journal);
        memcpy(journal_block->b_data, bh->b_data, bh->b_size);
        submit_bh(WRITE, journal_block);
    }
    
    // Phase 3: Issue barrier (ensure writes reached disk)
    blkdev_issue_flush(journal->j_dev);
    
    // Phase 4: Write commit block
    commit_block = allocate_journal_block(journal);
    commit_block->h_blocktype = JBD2_COMMIT_BLOCK;
    commit_block->h_sequence = tx->t_tid;
    commit_block->h_chksum = compute_transaction_checksum(tx);
    submit_bh(WRITE, commit_block);
    
    // Phase 5: Second barrier
    blkdev_issue_flush(journal->j_dev);
    
    // Transaction is now committed - safe from crash
    tx->t_state = T_FINISHED;
}

The 5-Second Commit Interval

Crash Recovery Process

When a system crashes with uncommitted transactions, ext3/ext4's recovery process replays the journal to restore consistency.

Recovery Detection:

During mount, the kernel checks if recovery is needed:

if (!(sb->s_state & EXT4_VALID_FS) ||
    (sb->s_feature_incompat & EXT4_FEATURE_INCOMPAT_RECOVER)) {
    // Filesystem was not cleanly unmounted
    // Journal recovery required
    jbd2_journal_recover(journal);
}

Recovery Phases:

Converting Mermaid diagram...

Phase 1: PASS_SCAN

Scan the journal to identify valid transactions:

void recovery_pass_scan(journal_t *journal) {
    block_t block = journal->j_sb->s_start;
    tid_t expected_seq = journal->j_sb->s_sequence;
    
    while (block_is_valid(block)) {
        header = read_journal_block(block);
        
        if (header->h_magic != JBD2_MAGIC_NUMBER)
            break;  // End of journal or corruption
            
        if (header->h_sequence != expected_seq)
            break;  // Sequence inconsistency
        
        switch (header->h_blocktype) {
            case JBD2_DESCRIPTOR_BLOCK:
                record_descriptor(block);
                block += count_data_blocks(header);
                break;
            case JBD2_COMMIT_BLOCK:
                if (verify_checksum(header)) {
                    mark_transaction_valid(expected_seq);
                    expected_seq++;
                }
                break;
            case JBD2_REVOKE_BLOCK:
                record_revokes(block);
                break;
        }
        block++;
    }
}

Phase 2: PASS_REVOKE

Process revoke records to avoid replaying deleted blocks:

// Revoke blocks mark filesystem locations that should NOT
// be updated during replay (typically from deleted files)

for_each_revoke_record(record) {
    tid_t revoke_tid = record->r_transaction;
    block_t block = record->r_block;
    
    // If a later transaction revoked this block,
    // don't replay older data to it
    add_to_revoke_table(block, revoke_tid);
}

Phase 3: PASS_REPLAY

Replay committed transactions to restore consistency:

for_each_committed_transaction(tx) {
    for_each_block_in_transaction(tx, block) {
        // Check revoke table
        if (is_revoked(block->destination, tx->tid))
            continue;  // Skip revoked block
        
        // Replay: copy from journal to final location
        write_block(block->destination, block->data);
    }
}

Recovery Time Comparison
Filesystem Size	ext2 fsck Time	ext3/ext4 Recovery
10 GB	1-2 minutes	< 1 second
100 GB	10-20 minutes	1-2 seconds
1 TB	1-2 hours	2-5 seconds
10 TB	10+ hours	5-10 seconds

Recovery Is Automatic

Checkpointing and Space Management

The journal has finite space. Checkpointing is the process of writing committed transactions to their final locations, freeing journal space for reuse.

Why Checkpointing Matters:

Journal at 80% capacity:
+--------+------+------+------+------+------+------+------+
| Super  | T1✓  | T2✓  | T3✓  | T4✓  | T5•  | FREE | FREE |
+--------+------+------+------+------+------+------+------+
                                       ↑
                              Current transaction

T1-T4 are committed but not yet checkpointed.
Journal can only use remaining 20% until T1-T4 checkpoint.

Checkpoint Trigger Conditions:

Journal space falls below threshold (typically 25%)
Background checkpoint timer expires
Sync or fsync forces checkpoint
Unmount flushes all transactions

checkpointing.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Simplified checkpoint process
void jbd2_log_do_checkpoint(journal_t *journal) {
    transaction_t *tx;
    struct buffer_head *bh;
    
    while ((tx = get_oldest_checkpointable_transaction(journal))) {
        // Write all metadata blocks to their final destinations
        list_for_each_entry(bh, &tx->t_checkpoint_list, b_cpnext) {
            if (!buffer_dirty(bh))
                continue;
            
            // Write block to its permanent location on disk
            lock_buffer(bh);
            bh->b_end_io = end_buffer_write_sync;
            submit_bh(WRITE, bh);
        }
        
        // Wait for all writes to complete
        list_for_each_entry(bh, &tx->t_checkpoint_list, b_cpnext) {
            wait_on_buffer(bh);
            if (buffer_uptodate(bh)) {
                release_buffer_from_checkpoint(bh);
            }
        }
        
        // Transaction is fully checkpointed
        // Journal space from s_start can be reclaimed
        __jbd2_journal_drop_transaction(journal, tx);
        journal->j_first = advance_to_next_transaction(tx);
    }
    
    // Update journal superblock with new s_first
    jbd2_update_superblock(journal);
}

Checkpoint vs. Commit:

Aspect	Commit	Checkpoint
What happens	Transaction written to journal	Journal data written to filesystem
When	Every 5 seconds (default)	When journal space needed
Crash safety	Provides crash safety	Frees journal space
Blocks involved	Metadata (and data in journal mode)	Same blocks, different location
Required order	Must complete before checkpoint	Must wait for commit

Checkpoint Ordering Constraints:

Transactions must checkpoint in order:

T1 commits → T2 commits → T3 commits
      ↓           ↓           ↓
T1 checkpoints → T2 checkpoints → T3 checkpoints

This ordering ensures that if a crash occurs during checkpointing, recovery replays transactions in the correct sequence.

Monitoring Checkpoint Activity:

# View journal status
cat /proc/fs/jbd2/sda1-8/info
# Output:
# 1 transaction, 1 locked, 0 flushing, 0 logging
# Average revision: 1 blocks used, 128 target

# Detailed stats
cat /sys/fs/ext4/sda1/journal_info

Checkpoint Stalls

Journal Performance Tuning

Optimizing journal performance requires understanding the trade-offs between safety, latency, and throughput.

Key Tuning Parameters:

journal_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Commit interval: time between automatic commits
# Lower = more frequent commits = less data at risk = lower throughput
mount -o commit=5 /dev/sda1 /mnt      # Default: 5 seconds
mount -o commit=1 /dev/sda1 /mnt      # Aggressive: 1 second
mount -o commit=60 /dev/sda1 /mnt     # Relaxed: 60 seconds
 
# Barrier behavior: ensures ordering for crash safety
mount -o barrier=1 /dev/sda1 /mnt     # Default: barriers enabled
mount -o barrier=0 /dev/sda1 /mnt     # Disable (DANGEROUS unless UPS)
mount -o nobarrier /dev/sda1 /mnt     # Alias for barrier=0
 
# Journal size: affects maximum transaction size and checkpoint frequency
mkfs.ext4 -J size=256 /dev/sda1       # 256 MB journal (at creation)
tune2fs -J size=256 /dev/sda1         # Resize existing (must be unmounted)
 
# External journal: place journal on separate fast device
mkfs.ext4 -J device=/dev/sdb1 /dev/sda1  # Journal on sdb1
 
# Journal checksum (ext4): detect journal corruption
tune2fs -O journal_checksum /dev/sda1
 
# Priority of journal I/O
# Typically set via ionice for jbd2 kernel threads:
ionice -c 1 -n 0 -p $(pgrep jbd2)

Performance Recommendations by Workload:

Workload	Journal Mode	Commit Interval	Journal Size	Barriers
General purpose	ordered	5s (default)	128 MB	Enabled
Database (safety)	journal	1s	256 MB	Enabled
Database (performance)	ordered	5s	256 MB	Enabled
Streaming/logs	writeback	30s	64 MB	Enabled
Scratch/temp	writeback	60s	32 MB	Optional
Build server	writeback	30s	128 MB	Optional

External Journal Optimization:

For high-performance systems, placing the journal on a separate device eliminates seek contention:

# Create journal device (small SSD or NVMe)
mkfs.ext4 -O journal_dev /dev/nvme0n1p1

# Create main filesystem with external journal
mkfs.ext4 -J device=/dev/nvme0n1p1 /dev/sda1

# Journal operations go to fast NVMe
# Bulk data goes to high-capacity HDD

Impact of Barrier Disable
Scenario	With Barriers	Without Barriers
Sequential writes	~10-20% overhead	Maximum throughput
Random writes	~20-30% overhead	Maximum throughput
Power failure	Full recovery	Potential corruption
Use case	Production (default)	UPS + write cache battery

Barrier Disable Warning

Summary: Ext3 Journaling

Journaling transformed Linux filesystems from fragile to production-ready, enabling reliable crash recovery in seconds rather than hours.

Key Takeaways

•Journaling solves the multi-write consistency problem — By writing intended changes to a log before applying them, the filesystem can recover to a consistent state after any crash.
•Three journaling modes trade safety for speed — data=journal (safest), data=ordered (default, balanced), data=writeback (fastest, least safe).
•Transactions group related operations — Operations are batched into transactions that commit atomically every 5 seconds (default).
•Recovery replays committed transactions — The SCAN→REVOKE→REPLAY process takes seconds regardless of filesystem size.
•Checkpointing frees journal space — Committed transactions are written to final locations, allowing journal reuse.
•Barriers ensure write ordering — Critical for crash safety; only disable with battery-backed caches.

Next Up: Ext4 Extents

Page Complete