Loading content...
Before 2001, system administrators lived in fear of the unexpected reboot. When an ext2 system crashed—whether from power failure, kernel panic, or hardware fault—the recovery process was predictable and painful: run fsck, watch it scan every inode and block on the disk, and wait. For large filesystems, this could take hours. A 500 GB disk might require 30 minutes or more of scanning before the system could come back online.
The root problem was fundamental to ext2's design. Without a record of in-progress operations, the only way to verify filesystem consistency was to examine everything. Every inode's link count must match directory entries. Every allocated block must be reachable. Every bitmap must accurately reflect allocation state. With no shortcuts available, recovery time scaled linearly with filesystem size.
Journaling changed everything. By recording intended changes before making them permanent, ext3 could recover from crashes in seconds rather than hours. The technique—borrowed from database systems—transformed Linux filesystems from fragile to production-ready.
This page explores how ext3's journaling works, from the fundamental write-ahead logging concept through implementation details that made it both reliable and performant.
By the end of this page, you will understand write-ahead logging theory, the three ext3/ext4 journaling modes (writeback, ordered, journal), the journal's on-disk structure, transaction commit sequences, and crash recovery mechanisms. You'll gain practical knowledge for tuning journal performance and troubleshooting recovery issues.
To understand journaling, we must first understand why file systems become inconsistent after crashes.
The Multi-Write Problem:
Consider creating a new file /home/user/document.txt. This seemingly simple operation requires modifying multiple disk structures:
These six updates cannot complete atomically. If power fails midway:
ext2's Approach: Full Scan (fsck)
Without journaling, ext2's recovery strategy was exhaustive verification:
Pass 1: Check inodes, blocks, and sizes
Pass 2: Check directory structure
Pass 3: Check directory connectivity
Pass 4: Check reference counts (link count)
Pass 5: Check group summary information (bitmaps)
Time complexity: O(n) where n = total inodes and blocks
| Filesystem Size | Typical fsck Time |
|---|---|
| 10 GB | 1-2 minutes |
| 100 GB | 10-20 minutes |
| 1 TB | 1-2 hours |
| 10 TB | 10+ hours |
For a server that must be available 24/7, hours of downtime after each unexpected reboot was unacceptable.
The Core Insight: Write-Ahead Logging
Database systems solved this problem decades ago with write-ahead logging (WAL):
This reduces recovery from O(n) scanning to O(log size) replay—typically seconds regardless of filesystem size.
ext3's journaling is based on the ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) protocol developed at IBM in the 1990s. ARIES provided the theoretical foundation for reliable recovery using write-ahead logging, influencing virtually every modern database and journaling filesystem.
The ext3/ext4 journal is stored as a special file managed by the Journaling Block Device (JBD2) layer. By default, it uses inode number 8 and occupies a fixed region of disk space.
Journal Location:
# View journal inode and size
dumpe2fs /dev/sda1 | grep -i journal
# Output:
Journal inode: 8
Journal size: 128M
Journal blocks: 32768
Journal Layout:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
// Journal superblock structure (first block of journal)typedef struct journal_superblock_s { // Static information __be32 s_header.h_magic; // JBD2 magic: 0xC03B3998 __be32 s_header.h_blocktype; // Superblock type identifier __be32 s_blocksize; // Journal block size __be32 s_maxlen; // Total journal blocks __be32 s_first; // First usable block (after SB) // Dynamic information __be32 s_sequence; // Sequence of first transaction __be32 s_start; // Block of first transaction __be32 s_errno; // Error value if any // Version 2 fields __be32 s_feature_compat; // Compatible features __be32 s_feature_incompat; // Incompatible features __be32 s_feature_ro_compat; // Read-only compat features __u8 s_uuid[16]; // Journal UUID __be32 s_nr_users; // Number of filesystems using __be32 s_dynsuper; // Location of dynamic superblock // Recovery info __be32 s_max_transaction; // Max blocks per transaction __be32 s_max_trans_data; // Max data blocks per transaction __u8 s_checksum_type; // Checksum algorithm __u8 s_padding2[3]; __be32 s_num_fc_blks; // Fast commit blocks __be32 s_padding[41]; __be32 s_checksum; // Superblock checksum __u8 s_users[16*48]; // UUIDs of filesystems using} journal_superblock_t; // Descriptor block (describes following data blocks)typedef struct journal_header_s { __be32 h_magic; // JBD2_MAGIC_NUMBER __be32 h_blocktype; // DESCRIPTOR, COMMIT, or REVOKE __be32 h_sequence; // Transaction sequence number} journal_header_t; // Block tag in descriptor block (ext4 format)typedef struct journal_block_tag_s { __be32 t_blocknr; // Destination block (low 32) __be16 t_checksum; // Block checksum __be16 t_flags; // Flags (ESCAPE, SAME_UUID, etc.) __be32 t_blocknr_high; // Destination block (high 32)} journal_block_tag_t;Journal Block Types:
| Block Type | Purpose | Header ID |
|---|---|---|
| Superblock | Journal configuration and state | N/A |
| Descriptor | Lists following data blocks and their destinations | 1 |
| Data Blocks | Copies of filesystem blocks being modified | N/A |
| Commit | Marks transaction as complete | 2 |
| Revoke | Lists blocks that should NOT be replayed | 5 |
Circular Buffer:
The journal operates as a circular buffer:
Journal Space:
+--------+------+------+------+------+------+------+------+
| Super | T1 | T1 | T2 | T2 | T2 | FREE | FREE |
| Block | desc | data | desc | data | cmit | | |
+--------+------+------+------+------+------+------+------+
↑ ↑ ↑
s_start s_end wrap point
(oldest active) (newest) (reuses space)
When the journal fills, old completed transactions are overwritten. This is safe because:
A too-small journal causes frequent checkpointing pauses. A too-large journal wastes space. Default 128 MB is adequate for most workloads. Database servers with heavy transaction loads may benefit from 256-512 MB. Use tune2fs -J size=256 to resize (requires unmounted filesystem).
ext3/ext4 offers three journaling modes, each trading off between safety and performance. Understanding these modes is crucial for optimizing different workloads.
Mode 1: Journal (data=journal)
The most conservative mode journals both metadata AND data blocks:
Write file data:
1. Write data blocks to journal
2. Write metadata blocks to journal
3. Write commit record
4. Write data blocks to final location
5. Write metadata blocks to final location
6. Checkpoint: free journal space
Characteristics:
Mode 2: Ordered (data=ordered)
The default mode—journals metadata only, but ensures data is written before metadata:
Write file data:
1. Write data blocks to FINAL location
2. Wait for data write completion
3. Write metadata blocks to journal
4. Write commit record
5. Write metadata to final location
Characteristics:
Mode 3: Writeback (data=writeback)
Fastest mode—journals metadata only with no ordering guarantees:
Write file data:
1. Write data blocks to final location (async)
2. Write metadata blocks to journal
3. Write commit record
4. Write metadata to final location
Characteristics:
| Aspect | data=journal | data=ordered | data=writeback |
|---|---|---|---|
| What's journaled | Metadata + Data | Metadata only | Metadata only |
| Data ordering | Before metadata | Before metadata | No ordering |
| Write amplification | 2x for all data | 1x (no extra writes) | 1x (no extra writes) |
| Recovery guarantee | Full content | No garbage data | Metadata only |
| Performance impact | ~50% slower | ~5-15% slower | Baseline |
| Default | No | Yes | No |
12345678910111213141516
# Set journaling mode via mount optionsmount -o data=journal /dev/sda1 /mnt/critical # Maximum safetymount -o data=ordered /dev/sda1 /mnt/general # Default, balancedmount -o data=writeback /dev/sda1 /mnt/temp # Maximum performance # Set default mode in fstab# /dev/sda1 /data ext4 defaults,data=ordered 0 2 # Check current modemount | grep sda1# /dev/sda1 on /data type ext4 (rw,relatime,data=ordered) # Set default mode in filesystem (affects all mounts)tune2fs -o journal_data /dev/sda1 # data=journaltune2fs -o journal_data_ordered /dev/sda1 # data=orderedtune2fs -o journal_data_writeback /dev/sda1Use data=ordered (default) for almost everything. Consider data=journal for critical databases where atomicity matters. Use data=writeback only for temporary data, build directories, or when benchmarking—never for user data on multi-user systems (security risk from exposing old file contents).
Understanding the transaction lifecycle reveals how journaling achieves atomicity while maintaining performance.
Transaction States:
Detailed Transaction Flow:
1. T_RUNNING (Active Transaction)
// Filesystem operations join the current transaction
handle_t *handle = jbd2_journal_start(journal, nblocks);
// Modify metadata blocks within the transaction
jbd2_journal_get_write_access(handle, bh);
modify_block(bh);
jbd2_journal_dirty_metadata(handle, bh);
// Complete this operation
jbd2_journal_stop(handle);
2. Commit Trigger A transaction commits when:
3. T_LOCKED → T_FLUSH
// Lock transaction: no new handles can join
transaction->t_state = T_LOCKED;
// Wait for all existing handles to complete
wait_for_handles_to_complete();
// Begin flushing to journal
transaction->t_state = T_FLUSH;
4. Journal Write Sequence
1. Write descriptor block (lists all blocks in transaction)
2. Write data/metadata blocks to journal space
3. Issue flush command (barrier) to disk
4. Write commit block with transaction checksum
5. Issue another flush command
The two flush commands ensure proper ordering: blocks must reach disk before commit, commit must reach disk before checkpoint.
1234567891011121314151617181920212223242526272829303132
// Simplified commit sequence (jbd2_journal_commit_transaction)void commit_transaction(journal_t *journal, transaction_t *tx) { struct buffer_head *descriptor; // Phase 1: Write descriptor block descriptor = allocate_journal_block(journal); build_descriptor(descriptor, tx); submit_bh(WRITE, descriptor); // Phase 2: Write all dirty metadata blocks to journal list_for_each_entry(bh, &tx->t_buffers, b_tnext) { journal_block = allocate_journal_block(journal); memcpy(journal_block->b_data, bh->b_data, bh->b_size); submit_bh(WRITE, journal_block); } // Phase 3: Issue barrier (ensure writes reached disk) blkdev_issue_flush(journal->j_dev); // Phase 4: Write commit block commit_block = allocate_journal_block(journal); commit_block->h_blocktype = JBD2_COMMIT_BLOCK; commit_block->h_sequence = tx->t_tid; commit_block->h_chksum = compute_transaction_checksum(tx); submit_bh(WRITE, commit_block); // Phase 5: Second barrier blkdev_issue_flush(journal->j_dev); // Transaction is now committed - safe from crash tx->t_state = T_FINISHED;}By default, ext3/ext4 commits transactions every 5 seconds. This batches many operations into single commits, improving throughput. The tradeoff: up to 5 seconds of operations may be lost on crash. Use commit=N mount option to adjust (lower = safer, higher = faster).
When a system crashes with uncommitted transactions, ext3/ext4's recovery process replays the journal to restore consistency.
Recovery Detection:
During mount, the kernel checks if recovery is needed:
if (!(sb->s_state & EXT4_VALID_FS) ||
(sb->s_feature_incompat & EXT4_FEATURE_INCOMPAT_RECOVER)) {
// Filesystem was not cleanly unmounted
// Journal recovery required
jbd2_journal_recover(journal);
}
Recovery Phases:
Phase 1: PASS_SCAN
Scan the journal to identify valid transactions:
void recovery_pass_scan(journal_t *journal) {
block_t block = journal->j_sb->s_start;
tid_t expected_seq = journal->j_sb->s_sequence;
while (block_is_valid(block)) {
header = read_journal_block(block);
if (header->h_magic != JBD2_MAGIC_NUMBER)
break; // End of journal or corruption
if (header->h_sequence != expected_seq)
break; // Sequence inconsistency
switch (header->h_blocktype) {
case JBD2_DESCRIPTOR_BLOCK:
record_descriptor(block);
block += count_data_blocks(header);
break;
case JBD2_COMMIT_BLOCK:
if (verify_checksum(header)) {
mark_transaction_valid(expected_seq);
expected_seq++;
}
break;
case JBD2_REVOKE_BLOCK:
record_revokes(block);
break;
}
block++;
}
}
Phase 2: PASS_REVOKE
Process revoke records to avoid replaying deleted blocks:
// Revoke blocks mark filesystem locations that should NOT
// be updated during replay (typically from deleted files)
for_each_revoke_record(record) {
tid_t revoke_tid = record->r_transaction;
block_t block = record->r_block;
// If a later transaction revoked this block,
// don't replay older data to it
add_to_revoke_table(block, revoke_tid);
}
Phase 3: PASS_REPLAY
Replay committed transactions to restore consistency:
for_each_committed_transaction(tx) {
for_each_block_in_transaction(tx, block) {
// Check revoke table
if (is_revoked(block->destination, tx->tid))
continue; // Skip revoked block
// Replay: copy from journal to final location
write_block(block->destination, block->data);
}
}
| Filesystem Size | ext2 fsck Time | ext3/ext4 Recovery |
|---|---|---|
| 10 GB | 1-2 minutes | < 1 second |
| 100 GB | 10-20 minutes | 1-2 seconds |
| 1 TB | 1-2 hours | 2-5 seconds |
| 10 TB | 10+ hours | 5-10 seconds |
Journal recovery happens automatically during mount—no user intervention required. After a crash, simply boot the system normally. The kernel detects the unclean shutdown and replays the journal before completing the mount.
The journal has finite space. Checkpointing is the process of writing committed transactions to their final locations, freeing journal space for reuse.
Why Checkpointing Matters:
Journal at 80% capacity:
+--------+------+------+------+------+------+------+------+
| Super | T1✓ | T2✓ | T3✓ | T4✓ | T5• | FREE | FREE |
+--------+------+------+------+------+------+------+------+
↑
Current transaction
T1-T4 are committed but not yet checkpointed.
Journal can only use remaining 20% until T1-T4 checkpoint.
Checkpoint Trigger Conditions:
12345678910111213141516171819202122232425262728293031323334
// Simplified checkpoint processvoid jbd2_log_do_checkpoint(journal_t *journal) { transaction_t *tx; struct buffer_head *bh; while ((tx = get_oldest_checkpointable_transaction(journal))) { // Write all metadata blocks to their final destinations list_for_each_entry(bh, &tx->t_checkpoint_list, b_cpnext) { if (!buffer_dirty(bh)) continue; // Write block to its permanent location on disk lock_buffer(bh); bh->b_end_io = end_buffer_write_sync; submit_bh(WRITE, bh); } // Wait for all writes to complete list_for_each_entry(bh, &tx->t_checkpoint_list, b_cpnext) { wait_on_buffer(bh); if (buffer_uptodate(bh)) { release_buffer_from_checkpoint(bh); } } // Transaction is fully checkpointed // Journal space from s_start can be reclaimed __jbd2_journal_drop_transaction(journal, tx); journal->j_first = advance_to_next_transaction(tx); } // Update journal superblock with new s_first jbd2_update_superblock(journal);}Checkpoint vs. Commit:
| Aspect | Commit | Checkpoint |
|---|---|---|
| What happens | Transaction written to journal | Journal data written to filesystem |
| When | Every 5 seconds (default) | When journal space needed |
| Crash safety | Provides crash safety | Frees journal space |
| Blocks involved | Metadata (and data in journal mode) | Same blocks, different location |
| Required order | Must complete before checkpoint | Must wait for commit |
Checkpoint Ordering Constraints:
Transactions must checkpoint in order:
T1 commits → T2 commits → T3 commits
↓ ↓ ↓
T1 checkpoints → T2 checkpoints → T3 checkpoints
This ordering ensures that if a crash occurs during checkpointing, recovery replays transactions in the correct sequence.
Monitoring Checkpoint Activity:
# View journal status
cat /proc/fs/jbd2/sda1-8/info
# Output:
# 1 transaction, 1 locked, 0 flushing, 0 logging
# Average revision: 1 blocks used, 128 target
# Detailed stats
cat /sys/fs/ext4/sda1/journal_info
If the journal fills completely, new operations stall waiting for checkpoint to complete. This manifests as I/O hangs. Solutions: increase journal size, reduce commit interval, investigate disk performance. Monitor /proc/fs/jbd2/*/info for warning signs.
Optimizing journal performance requires understanding the trade-offs between safety, latency, and throughput.
Key Tuning Parameters:
123456789101112131415161718192021222324
# Commit interval: time between automatic commits# Lower = more frequent commits = less data at risk = lower throughputmount -o commit=5 /dev/sda1 /mnt # Default: 5 secondsmount -o commit=1 /dev/sda1 /mnt # Aggressive: 1 secondmount -o commit=60 /dev/sda1 /mnt # Relaxed: 60 seconds # Barrier behavior: ensures ordering for crash safetymount -o barrier=1 /dev/sda1 /mnt # Default: barriers enabledmount -o barrier=0 /dev/sda1 /mnt # Disable (DANGEROUS unless UPS)mount -o nobarrier /dev/sda1 /mnt # Alias for barrier=0 # Journal size: affects maximum transaction size and checkpoint frequencymkfs.ext4 -J size=256 /dev/sda1 # 256 MB journal (at creation)tune2fs -J size=256 /dev/sda1 # Resize existing (must be unmounted) # External journal: place journal on separate fast devicemkfs.ext4 -J device=/dev/sdb1 /dev/sda1 # Journal on sdb1 # Journal checksum (ext4): detect journal corruptiontune2fs -O journal_checksum /dev/sda1 # Priority of journal I/O# Typically set via ionice for jbd2 kernel threads:ionice -c 1 -n 0 -p $(pgrep jbd2)Performance Recommendations by Workload:
| Workload | Journal Mode | Commit Interval | Journal Size | Barriers |
|---|---|---|---|---|
| General purpose | ordered | 5s (default) | 128 MB | Enabled |
| Database (safety) | journal | 1s | 256 MB | Enabled |
| Database (performance) | ordered | 5s | 256 MB | Enabled |
| Streaming/logs | writeback | 30s | 64 MB | Enabled |
| Scratch/temp | writeback | 60s | 32 MB | Optional |
| Build server | writeback | 30s | 128 MB | Optional |
External Journal Optimization:
For high-performance systems, placing the journal on a separate device eliminates seek contention:
# Create journal device (small SSD or NVMe)
mkfs.ext4 -O journal_dev /dev/nvme0n1p1
# Create main filesystem with external journal
mkfs.ext4 -J device=/dev/nvme0n1p1 /dev/sda1
# Journal operations go to fast NVMe
# Bulk data goes to high-capacity HDD
| Scenario | With Barriers | Without Barriers |
|---|---|---|
| Sequential writes | ~10-20% overhead | Maximum throughput |
| Random writes | ~20-30% overhead | Maximum throughput |
| Power failure | Full recovery | Potential corruption |
| Use case | Production (default) | UPS + write cache battery |
Disabling barriers allows disk write caches to reorder writes, breaking journal guarantees. Only disable if: (1) system has UPS, AND (2) disk has battery-backed write cache, AND (3) you've verified the cache is in write-through mode on power loss. Most consumer disks do NOT guarantee this.
Journaling transformed Linux filesystems from fragile to production-ready, enabling reliable crash recovery in seconds rather than hours.
data=journal (safest), data=ordered (default, balanced), data=writeback (fastest, least safe).Next Up: Ext4 Extents
With journaling ensuring reliability, ext4's next major innovation tackled efficiency: extents. Rather than tracking files as lists of individual blocks, extents describe contiguous ranges, dramatically reducing metadata overhead for large files and improving both performance and scalability.
You now understand how ext3/ext4 journaling works—from write-ahead logging theory through implementation details and recovery mechanics. This knowledge is essential for understanding filesystem reliability, troubleshooting recovery issues, and optimizing journal performance for different workloads.