Operating SystemsJournaling

File System Journaling

LevelIntermediate

Duration90 mins

TopicJournaling

3 / 5

Metadata Journaling

The Pragmatic Compromise

When file systems first adopted journaling, designers faced a fundamental question: What should we journal? Journaling everything—both data and metadata—provides the strongest consistency guarantees but doubles all writes. Journaling nothing returns us to the crash vulnerability of the past.\n\nMetadata journaling emerged as the pragmatic middle ground. By journaling only the file system's structural information—inodes, directories, bitmaps, and block pointers—while writing data directly to its final location, file systems achieve rapid recovery and structural consistency with minimal performance overhead. This mode has become the default for most production file systems.

What You Will Learn

This page examines metadata journaling in depth. You'll understand what qualifies as metadata, why protecting it is sufficient for file system consistency, the data exposure window that results, and the ordering constraints that prevent corruption. You'll learn when metadata journaling is appropriate and when stronger modes are needed.

What Is Metadata?

To understand metadata journaling, we must first precisely define what constitutes metadata versus data. The distinction is fundamental to understanding what protections are—and are not—provided.

Data is the actual contents of files—the bytes that applications write through write() system calls. Your document text, image pixels, database records, application binaries: these are all data.\n\nMetadata is everything the file system needs to organize, locate, and manage files and directories. Metadata answers questions like:\n- Where is this file's data stored on disk?\n- How large is this file?\n- Who owns it and what permissions apply?\n- What files exist in this directory?\n- Which blocks are free for allocation?

File System Data vs. Metadata
Category	Examples	Changes When	Journaled in Metadata Mode
File Data	File contents, application bytes	write(), truncate()	No
Inode Metadata	File size, timestamps, block pointers, permissions	Any file operation	Yes
Directory Entries	File names, inode numbers, dirent structures	create, unlink, rename	Yes
Block Allocation	Bitmap, extent tree, block groups	Allocation/deallocation	Yes
Superblock	File system state, mount count, free counts	Volume state changes	Yes
Extended Attributes	xattrs, ACLs, SELinux labels	setxattr, setfacl	Yes

The Critical Insight:\n\nMetadata is what the file system uses to interpret the disk. Corrupt metadata means the file system cannot understand its own organization:\n\n- A corrupted inode might point to wrong blocks → reading garbage data\n- A corrupted directory means files cannot be found → data effectively lost\n- Corrupted bitmap might double-allocate blocks → catastrophic data intermixing\n\nCorrupt data, while certainly bad, has a bounded impact—one file's contents are wrong. The file system itself remains navigable; other files remain accessible.\n\nThis asymmetry justifies metadata journaling: protecting metadata ensures the file system structure survives crashes. Individual files may have incomplete updates, but the file system itself remains consistent and recoverable.

Block Pointers Are Metadata

A common misconception: the mapping from file offset to disk block is metadata, not data. When you append to a file, the new data blocks are data, but the updated inode block pointers and extent entries are metadata. Metadata journaling protects these pointers, ensuring files don't point to garbage after a crash.

The Metadata Journaling Protocol

Metadata journaling modifies the basic WAL protocol to handle data specially. The key insight is that data can be written directly to its final location—it doesn't need the journal's intermediate storage—but the timing of data writes relative to metadata writes must be carefully controlled.

The Protocol Steps:

Metadata Journaling Write Sequence

•Buffer modifications — Accumulate data and metadata changes in memory buffers
•Write data blocks — Write file data directly to final locations (bypass journal)
•Wait for data — Issue barrier ensuring data is durable before proceeding
•Write journal transaction — Write metadata changes to journal (descriptor, metadata blocks, commit)
•Commit transaction — Issue barrier and write commit record
•Acknowledge to application — fsync can now return success
•Writeback metadata — Later, write metadata to final locations
•Checkpoint — After metadata writeback, reclaim journal space

metadata_journaling_flow.txt

Text

Metadata Journaling Write Flow:
 
Application writes 8KB to file at offset 0:
 
Step 1: Buffer Modifications
    Memory:
    ├── Page cache: 2 new data blocks (8KB)
    ├── Inode cache: updated inode (size=8192, new block ptrs)
    └── Bitmap cache: 2 blocks marked allocated
 
Step 2: Write Data Blocks (direct to final location)
    Disk Request: Write blocks 50001-50002 (data)
    [Data now on disk at final location]
 
Step 3: Barrier (ensure data durable)
    Disk Request: FLUSH
    [All pending writes to media]
 
Step 4: Write Journal Transaction
    Journal:
    ├── Descriptor: "This transaction modifies blocks 8472, 8901"
    ├── Block 8472 contents: [Modified inode for file]
    └── Block 8901 contents: [Modified bitmap region]
 
Step 5: Commit
    Disk Request: Write commit + FUA
    [Transaction is now committed]
 
Step 6: Application gets fsync() success
 
Recovery After Crash at Any Point:
├── Crash before Step 3: Data gone, but inode unchanged (consistent)
├── Crash before Step 5: Data exists, metadata uncommitted (consistent)  
├── Crash after Step 5: Data exists, metadata committed (consistent)
└── All cases: File either has old contents or new contents, never garbage

The Critical Ordering:\n\nThe barrier in Step 3 is essential. It ensures:\n\n> Data reaches disk before metadata that references it.\n\nWithout this ordering, consider what could happen:\n1. Metadata commits (points to block 50001)\n2. Crash occurs before data write\n3. On recovery: file now points to block 50001, which contains old/garbage data from a previous file\n\nThis is not just inconsistent—it's potentially catastrophic. The file appears valid but contains completely wrong data, possibly sensitive data from another file. The data-before-metadata ordering prevents this.

Ordered vs. Writeback Modes

ext4 distinguishes between 'ordered' mode (data before metadata, described here) and 'writeback' mode (no ordering, metadata only journaled). Writeback is faster but can expose stale data after crash. Most production systems use ordered mode. We'll examine writeback mode later.

The Stale Data Exposure Problem

Even with proper ordering, metadata journaling has a subtle security and consistency issue: the stale data exposure problem. Understanding this issue is crucial for applications that handle sensitive data.

The Scenario:\n\nConsider allocating a new block for a file:\n\n1. Block 50001 was previously used by another file (say, containing passwords)\n2. Previous file was deleted, block marked free\n3. Your new file allocates block 50001\n4. You write new data to block 50001\n5. Crash before data write completes\n\nWith metadata journaling (ordered mode), the data write completed before metadata commit. But what if the crash occurred after metadata commit but before you wrote meaningful data?\n\nActually, with ordered mode, data must be written before metadata commits. But the block might have been freshly allocated (metadata) without the application having written all intended data yet. If the application was writing a sparse file or hadn't gotten to that part:

Stale Data Exposure Scenarios
Scenario	What Happens	Risk
Expand file, crash before write	Inode shows larger size, blocks contain old data	Privacy leak: old file contents exposed
Create sparse file	Blocks allocated but not written	Reading allocated holes returns stale data
Truncate file down	Blocks deallocated, may be reallocated	Next allocation might expose old data

Mitigation Mechanisms:\n\n1. Zero-on-allocate (Delayed Allocation)\n\nModern file systems often delay actual block allocation until data is written. This means:\n- Blocks aren't allocated until data is ready\n- New data overwrites old contents\n- No window for stale exposure\n\next4's delayed allocation provides this protection for most cases.\n\n2. Zero fill\n\nSome file systems zero newly allocated blocks before making them accessible:\n- Overhead of writing zeros\n- Guarantees no stale data exposure\n- May be configurable or automatic\n\n3. Application-level protection\n\nApplications handling sensitive data should:\n- Explicitly zero buffers before writing\n- Use fallocate with FALLOC_FL_ZERO_RANGE\n- Consider encrypted file systems (stale blocks are encrypted)

stale_data_mitigation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Safe file extension that prevents stale data exposure
 
#define _GNU_SOURCE
#include <fcntl.h>
#include <unistd.h>
 
// Method 1: Use fallocate with zero range
void safe_extend_file(int fd, off_t new_size) {
    off_t current_size;
    struct stat st;
    
    fstat(fd, &st);
    current_size = st.st_size;
    
    if (new_size > current_size) {
        // Explicitly zero the new region
        fallocate(fd, FALLOC_FL_ZERO_RANGE, 
                  current_size, new_size - current_size);
    }
}
 
// Method 2: Explicit zeroing for sensitive data
void write_sensitive_data(int fd, const char *data, size_t len, off_t offset) {
    // First, zero the region we'll write to
    // This ensures even partial writes don't expose stale data
    char *zeros = calloc(1, len);
    pwrite(fd, zeros, len, offset);
    fsync(fd);  // Ensure zeros are durable
    free(zeros);
    
    // Now write actual data
    pwrite(fd, data, len, offset);
    fsync(fd);
}
 
// Method 3: Use O_TMPFILE for atomic creation
// File not linked until complete, no intermediate exposure
int create_file_atomically(const char *dir, const char *filename,
                           const char *data, size_t len) {
    char linkpath[PATH_MAX];
    snprintf(linkpath, sizeof(linkpath), "%s/%s", dir, filename);
    
    // Create anonymous temp file
    int fd = open(dir, O_TMPFILE | O_RDWR, 0644);
    
    // Write all data
    write(fd, data, len);
    fsync(fd);
    
    // Now atomically link into directory
    char procpath[PATH_MAX];
    snprintf(procpath, sizeof(procpath), "/proc/self/fd/%d", fd);
    linkat(AT_FDCWD, procpath, AT_FDCWD, linkpath, AT_SYMLINK_FOLLOW);
    
    // fsync directory for rename durability
    int dirfd = open(dir, O_RDONLY);
    fsync(dirfd);
    close(dirfd);
    
    return fd;
}

Defense in Depth

The stale data problem is why sensitive systems use multiple layers: encrypted file systems (stale data is encrypted), secure deletion (overwrite before deallocation), and memory encryption. Don't rely solely on file system behavior for sensitive data protection.

Ext4 Journaling Modes Deep Dive

ext4 provides three journaling modes that illustrate the design space between consistency and performance. Understanding these modes deeply reveals the trade-offs inherent in journaling design.

ext4 Journaling Modes
Mode	What's Journaled	Data Ordering	Safety	Performance
journal	Data + Metadata	All in journal	Highest	Lowest (2x writes)
ordered (default)	Metadata only	Data before metadata	High	Good
writeback	Metadata only	No ordering	Moderate	Highest

Mode: journal (Full Data Journaling)\n\nBoth data and metadata are written to the journal before being written to final locations.\n\nAdvantages:\n- Both structure and content are atomic\n- After crash: files contain either old or new complete contents\n- Strongest guarantees\n\nDisadvantages:\n- Every byte written twice (to journal, then final location)\n- Journal must be large enough for data workload\n- Significant performance overhead\n\nUse cases:\n- Databases that need atomic file updates\n- Financial/audit logs requiring bullet-proof consistency\n- Systems where correctness far outweighs performance

Mode: ordered (Metadata Journaling with Ordering)\n\nOnly metadata is journaled. Data is written directly to final locations but must reach disk before the metadata transaction commits.\n\nAdvantages:\n- File system structure always consistent\n- Data writes not doubled\n- Good balance of safety and performance\n\nDisadvantages:\n- After crash: file may have partial new data (but valid old/new for each block)\n- Ordering requirement adds some overhead\n- Still vulnerable to application-level inconsistencies\n\nUse cases:\n- General-purpose workloads\n- Systems where file structural integrity is primary concern\n- Default for most production systems

Mode: writeback (Metadata Only, No Ordering)\n\nOnly metadata is journaled. Data can be written in any order relative to metadata—before, after, or interleaved.\n\nAdvantages:\n- Maximum flexibility for write scheduling\n- Best performance (no ordering barriers)\n- File system structure still protected\n\nDisadvantages:\n- After crash: file may contain stale data from previous file\n- Security risk: sensitive data exposure\n- Application data consistency not guaranteed\n\nUse cases:\n- Scratch/temp file systems\n- Cases where applications manage their own consistency\n- Performance-critical systems with data redundancy elsewhere

ext4_journal_mode.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Check current journaling mode
cat /proc/mounts | grep ext4
# Output includes data=ordered or data=writeback or data=journal
 
# Mount with specific mode
mount -o data=ordered /dev/sda1 /mnt/data
 
# Change mode (requires remount)
mount -o remount,data=journal /mnt/data
 
# Set default mode in fstab
# /dev/sda1 /mnt/data ext4 defaults,data=ordered 0 2
 
# Tune journal size (at filesystem creation)
mkfs.ext4 -J size=256 /dev/sda1      # 256MB journal
 
# Check journal status
dumpe2fs -h /dev/sda1 | grep -i journal
# Journal size:             128M
# Journal length:           32768
# Journal sequence:         0x00000547
 
# View journal contents (advanced debugging)
debugfs /dev/sda1
debugfs: logdump          # Dump journal transactions
 
# Alternative: use tune2fs to check features
tune2fs -l /dev/sda1 | grep -i has_journal

Writeback Mode Warning

Never use writeback mode for file systems that may contain sensitive data. After a crash, newly created files may expose contents from previously deleted files. This has security implications in multi-user and multi-tenant environments.

Delayed Allocation and Journaling

Modern file systems like ext4 use delayed allocation (also called allocate-on-flush), which interacts importantly with journaling. Understanding this interaction is crucial for predicting file system behavior.

Traditional Allocation:\n\nWith traditional allocation, blocks are assigned immediately when data is written:\n\n1. write(fd, data, 4096) called\n2. Immediately: allocate block 50001\n3. Immediately: update inode to point to block 50001\n4. Eventually: write data to block 50001\n\nThis approach has fragmentation issues—if you write a file in pieces, blocks are allocated in the order of writes, not optimally for sequential reading.\n\nDelayed Allocation:\n\nWith delayed allocation, block assignment is postponed:\n\n1. write(fd, data, 4096) called\n2. Data goes to page cache, marked dirty\n3. No blocks allocated yet\n4. Later (writeback time): examine total dirty data\n5. Allocate contiguous blocks for all pending writes\n6. Write data to allocated blocks\n7. Update metadata

Delayed Allocation Benefits

•Better contiguity: allocate after knowing total size
•Reduced fragmentation: blocks allocated together
•Less metadata churn: one allocation vs. many
•Write coalescing: combine adjacent writes
•Better SSD behavior: larger, aligned writes

Delayed Allocation Risks

•Data loss window: unwritten data lost on crash
•ENOSPC surprise: space check deferred
•Larger potential loss: more dirty data in memory
•Application assumptions may be violated
•Interaction with mmap can be complex

The Data Loss Controversy:\n\nDelayed allocation interacted problematically with some application patterns, leading to a significant compatibility issue in ext4's early days.\n\nThe problematic pattern:\n\n1. Rewrite config file:\n fd = open("config", O_TRUNC | O_WRONLY) // Truncate existing\n write(fd, new_config) // Write new content\n close(fd) // No explicit fsync\n\n2. Crash occurs\n\n3. Expected: Either old config or new config\n Actual: Zero-length file (old truncated, new not allocated yet)\n\n\nWith delayed allocation, the write's data might still be in page cache with no blocks allocated. A crash loses everything. Traditional allocation at least had blocks allocated, so something would survive.\n\nThe Fix:\n\next4 now detects patterns like truncate+write and forces earlier allocation for such files. Additionally, applications that care about durability should always use fsync()—the pattern above was always subtly incorrect, just masked by traditional allocation.

fsync() Is Not Optional

Applications should never assume close() provides durability. It doesn't. If you need data to survive a crash, explicitly call fsync() before closing. Delayed allocation just made this long-standing requirement more visible.

Metadata Journaling Across File Systems

While we've focused on ext4, metadata journaling is used across many file systems. Each implementation has unique characteristics that reflect design priorities and legacy constraints.

Metadata Journaling Implementations
File System	Journal Name	Default Mode	Notable Features
ext4 (Linux)	JBD2	Ordered	Three modes, checksums, external journal option
XFS (Linux)	Intentional Log	Metadata only	Parallel logging, always ordered semantically
NTFS (Windows)	Transaction Log	Metadata only	Integrated, $LogFile special file
HFS+ (macOS legacy)	Journal	Metadata only	Volume Header Journal
APFS (macOS)	Copy-on-Write	N/A (CoW)	Uses CoW instead of traditional journal
JFS (IBM)	Aggregate Log	Metadata only	Group commit, log offloading
ReiserFS	Journal	Metadata only	Wandering logs, tail packing

XFS Journaling:\n\nXFS uses an "intent-based" journaling approach:\n\n- Operations are logged as intents (e.g., "allocate extent from A to B")\n- Actual work happens after intent is logged\n- Recovery replays intents, not raw blocks\n\nAdvantages:\n- Log entries are smaller (intents vs. full blocks)\n- Operations are idempotent by design\n- Better parallel logging\n\nNTFS Journaling:\n\nNTFS ($LogFile) uses more sophisticated logging:\n\n- REDO + UNDO information\n- True transaction support with rollback capability\n- Transactional NTFS (TxF) for application use (deprecated)\n\nNTFS can undo incomplete operations, not just redo them. This provides additional recovery options but adds complexity.

xfs_journaling.txt

Text

XFS Intent Logging Example:
 
Operation: Allocate 100 blocks starting at block 50000
 
Traditional (ext4 style):
  Log: [Full inode block], [Full bitmap block], [Full extent tree block]
  Size: 3 × 4KB = 12KB
 
XFS Intent-Based:
  Log: "EFI: extent free intent, ag=5, block=50000, len=100"
  Size: ~100 bytes
 
Recovery Process:
1. Find incomplete EFI (Extent Free Intent)
2. Execute: allocate blocks 50000-50099
3. Done - operation is idempotent
 
The intent is a high-level description of what to do,
not a low-level copy of modified blocks.
 
XFS Log Structure:
┌─────────────────────────────────────────────────────────────┐
│ Log Sequence Number (LSN): 1234567                          │
├─────────────────────────────────────────────────────────────┤
│ Transaction Header (TID: 8847)                              │
├─────────────────────────────────────────────────────────────┤
│ Intent Item: EFI (extent free intent)                       │
│   AG number: 5                                              │
│   Block offset: 50000                                       │
│   Block count: 100                                          │
├─────────────────────────────────────────────────────────────┤
│ Done Item: EFD (extent free done)                           │
│   Links to EFI above                                        │
├─────────────────────────────────────────────────────────────┤
│ Commit Record                                               │
└─────────────────────────────────────────────────────────────┘

Copy-on-Write Alternative

APFS (Apple), ZFS (Oracle/OpenZFS), and Btrfs (Linux) use copy-on-write instead of journaling. They never modify data in place, so there's no need for journaling's redo capability. The file system is always consistent because old data is preserved until the new data is complete. This is a different paradigm we'll explore elsewhere.

Performance Analysis

Metadata journaling's performance impact varies significantly by workload. Understanding these patterns helps you choose the right mode and optimize system configuration.

Write Amplification Analysis:\n\nExtents of writing differs by journaling mode:

Write Amplification by Workload
Workload	No Journal	Ordered Mode	Full Journal
Large file write (1GB)	1.0x	1.0x (+ tiny metadata)	2.0x
Many small files	1.0x	~1.1x (metadata overhead)	~2.0x
Metadata-heavy (mkdir/create)	1.0x	2.0x (for metadata)	2.0x
Random small writes	1.0x	~1.05x	2.0x
Database transaction	1.0x	~1.1x	2.0x (expected)

Fsync Performance:\n\nFsync behavior is where journaling mode matters most:

fsync_performance.txt

Text

fsync() Latency Breakdown:
 
Scenario: Application writes 4KB, calls fsync()
 
========= Ordered Mode (typical) =========
1. Write data to final location:        ~0.5ms (HDD) / 0.1ms (SSD)
2. Barrier (ensure data durable):       ~8ms (HDD) / 0.1ms (SSD)
3. Write journal metadata (~8KB):       ~0.5ms (HDD) / 0.1ms (SSD)  
4. Write commit + barrier:              ~8ms (HDD) / 0.1ms (SSD)
─────────────────────────────────────────────────────────────────
Total:                                  ~17ms (HDD) / ~0.4ms (SSD)
 
========= Full Journal Mode =========
1. Write data to journal:               ~0.5ms (HDD) / 0.1ms (SSD)
2. Write metadata to journal:           ~0.5ms (HDD) / 0.1ms (SSD)
3. Write commit + barrier:              ~8ms (HDD) / 0.1ms (SSD)
4. (Later) Write data to final:         async, doesn't affect fsync
5. (Later) Write metadata to final:     async
─────────────────────────────────────────────────────────────────
Total for fsync:                        ~9ms (HDD) / ~0.3ms (SSD)
 
========= Writeback Mode =========
1. Write journal metadata (~8KB):       ~0.5ms (HDD) / 0.1ms (SSD)
2. Write commit + barrier:              ~8ms (HDD) / 0.1ms (SSD)
3. Data written without ordering:       async, may not be durable!
─────────────────────────────────────────────────────────────────
Total for metadata safety:              ~8.5ms (HDD) / ~0.2ms (SSD)
WARNING: Data may not be durable after fsync in this mode!

Key Performance Observations:\n\n1. HDD barrier dominance: On HDDs, the barrier cost (~8ms for disk rotation) dominates. Reducing barrier count matters more than reducing write count.\n\n2. SSD barrier efficiency: SSDs handle barriers much faster (~0.1ms), making journaling overhead proportionally smaller.\n\n3. Batching critical: Combining multiple operations into one transaction dramatically improves throughput by sharing barrier cost.\n\n4. Sequential log writes: Journal writes are sequential, which is efficient on both HDDs (no seeks) and SSDs (better wear leveling).\n\n5. Full journal paradox: For pure fsync latency, full journaling can be faster because data write doesn't need a separate barrier—everything goes to journal. But total write volume increases.

Benchmark Your Workload

Real-world performance depends heavily on workload patterns. Synthetic benchmarks often don't reflect production behavior. Test with realistic workloads before choosing journal modes or sizing. Tools like fio can simulate various access patterns.

Summary: Metadata Journaling

We've thoroughly explored metadata journaling, the dominant approach in production file systems. Let's consolidate our understanding:

Key Takeaways

•Metadata vs. data — Metadata is file system structure (inodes, directories, bitmaps); data is file contents. Protecting metadata ensures navigability; protecting data ensures content integrity.
•Data-before-metadata ordering — In ordered mode, data must reach disk before metadata that references it. This prevents exposure of stale data from previous files.
•Stale data exposure — Without ordering, crash recovery can leave files pointing to blocks containing data from deleted files. Security and privacy implications.
•ext4 modes — journal (full), ordered (default, recommended), writeback (fastest, risky). Choose based on durability vs. performance requirements.
•Delayed allocation interaction — Modern file systems delay block allocation for better contiguity. This can increase data loss window; fsync() is essential for durability.
•Performance patterns — Barrier costs dominate on HDDs. Transaction batching is critical. SSDs reduce journaling overhead significantly.

What's Next:\n\nWe've seen metadata journaling; now we'll examine full data journaling mode where both data and metadata are journaled. This mode provides the strongest guarantees at significant performance cost, and is essential for certain specialized workloads.

Metadata Journaling Mastered

You now understand the most common journaling mode used in production systems. This knowledge helps you configure file systems appropriately, understand recovery behavior, and design applications that correctly interact with file system durability guarantees.

3 / 5

Loading learning content...

Operating SystemsJournaling

File System Journaling

LevelIntermediate

Duration90 mins

TopicJournaling

3 / 5

Metadata Journaling

The Pragmatic Compromise

What You Will Learn

What Is Metadata?

To understand metadata journaling, we must first precisely define what constitutes metadata versus data. The distinction is fundamental to understanding what protections are—and are not—provided.

File System Data vs. Metadata
Category	Examples	Changes When	Journaled in Metadata Mode
File Data	File contents, application bytes	write(), truncate()	No
Inode Metadata	File size, timestamps, block pointers, permissions	Any file operation	Yes
Directory Entries	File names, inode numbers, dirent structures	create, unlink, rename	Yes
Block Allocation	Bitmap, extent tree, block groups	Allocation/deallocation	Yes
Superblock	File system state, mount count, free counts	Volume state changes	Yes
Extended Attributes	xattrs, ACLs, SELinux labels	setxattr, setfacl	Yes

Block Pointers Are Metadata

The Metadata Journaling Protocol

The Protocol Steps:

Metadata Journaling Write Sequence

•Buffer modifications — Accumulate data and metadata changes in memory buffers
•Write data blocks — Write file data directly to final locations (bypass journal)
•Wait for data — Issue barrier ensuring data is durable before proceeding
•Write journal transaction — Write metadata changes to journal (descriptor, metadata blocks, commit)
•Commit transaction — Issue barrier and write commit record
•Acknowledge to application — fsync can now return success
•Writeback metadata — Later, write metadata to final locations
•Checkpoint — After metadata writeback, reclaim journal space

metadata_journaling_flow.txt

Text

Metadata Journaling Write Flow:
 
Application writes 8KB to file at offset 0:
 
Step 1: Buffer Modifications
    Memory:
    ├── Page cache: 2 new data blocks (8KB)
    ├── Inode cache: updated inode (size=8192, new block ptrs)
    └── Bitmap cache: 2 blocks marked allocated
 
Step 2: Write Data Blocks (direct to final location)
    Disk Request: Write blocks 50001-50002 (data)
    [Data now on disk at final location]
 
Step 3: Barrier (ensure data durable)
    Disk Request: FLUSH
    [All pending writes to media]
 
Step 4: Write Journal Transaction
    Journal:
    ├── Descriptor: "This transaction modifies blocks 8472, 8901"
    ├── Block 8472 contents: [Modified inode for file]
    └── Block 8901 contents: [Modified bitmap region]
 
Step 5: Commit
    Disk Request: Write commit + FUA
    [Transaction is now committed]
 
Step 6: Application gets fsync() success
 
Recovery After Crash at Any Point:
├── Crash before Step 3: Data gone, but inode unchanged (consistent)
├── Crash before Step 5: Data exists, metadata uncommitted (consistent)  
├── Crash after Step 5: Data exists, metadata committed (consistent)
└── All cases: File either has old contents or new contents, never garbage

Ordered vs. Writeback Modes

The Stale Data Exposure Problem

Stale Data Exposure Scenarios
Scenario	What Happens	Risk
Expand file, crash before write	Inode shows larger size, blocks contain old data	Privacy leak: old file contents exposed
Create sparse file	Blocks allocated but not written	Reading allocated holes returns stale data
Truncate file down	Blocks deallocated, may be reallocated	Next allocation might expose old data

stale_data_mitigation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Safe file extension that prevents stale data exposure
 
#define _GNU_SOURCE
#include <fcntl.h>
#include <unistd.h>
 
// Method 1: Use fallocate with zero range
void safe_extend_file(int fd, off_t new_size) {
    off_t current_size;
    struct stat st;
    
    fstat(fd, &st);
    current_size = st.st_size;
    
    if (new_size > current_size) {
        // Explicitly zero the new region
        fallocate(fd, FALLOC_FL_ZERO_RANGE, 
                  current_size, new_size - current_size);
    }
}
 
// Method 2: Explicit zeroing for sensitive data
void write_sensitive_data(int fd, const char *data, size_t len, off_t offset) {
    // First, zero the region we'll write to
    // This ensures even partial writes don't expose stale data
    char *zeros = calloc(1, len);
    pwrite(fd, zeros, len, offset);
    fsync(fd);  // Ensure zeros are durable
    free(zeros);
    
    // Now write actual data
    pwrite(fd, data, len, offset);
    fsync(fd);
}
 
// Method 3: Use O_TMPFILE for atomic creation
// File not linked until complete, no intermediate exposure
int create_file_atomically(const char *dir, const char *filename,
                           const char *data, size_t len) {
    char linkpath[PATH_MAX];
    snprintf(linkpath, sizeof(linkpath), "%s/%s", dir, filename);
    
    // Create anonymous temp file
    int fd = open(dir, O_TMPFILE | O_RDWR, 0644);
    
    // Write all data
    write(fd, data, len);
    fsync(fd);
    
    // Now atomically link into directory
    char procpath[PATH_MAX];
    snprintf(procpath, sizeof(procpath), "/proc/self/fd/%d", fd);
    linkat(AT_FDCWD, procpath, AT_FDCWD, linkpath, AT_SYMLINK_FOLLOW);
    
    // fsync directory for rename durability
    int dirfd = open(dir, O_RDONLY);
    fsync(dirfd);
    close(dirfd);
    
    return fd;
}

Defense in Depth

Ext4 Journaling Modes Deep Dive

ext4 provides three journaling modes that illustrate the design space between consistency and performance. Understanding these modes deeply reveals the trade-offs inherent in journaling design.

ext4 Journaling Modes
Mode	What's Journaled	Data Ordering	Safety	Performance
journal	Data + Metadata	All in journal	Highest	Lowest (2x writes)
ordered (default)	Metadata only	Data before metadata	High	Good
writeback	Metadata only	No ordering	Moderate	Highest

ext4_journal_mode.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Check current journaling mode
cat /proc/mounts | grep ext4
# Output includes data=ordered or data=writeback or data=journal
 
# Mount with specific mode
mount -o data=ordered /dev/sda1 /mnt/data
 
# Change mode (requires remount)
mount -o remount,data=journal /mnt/data
 
# Set default mode in fstab
# /dev/sda1 /mnt/data ext4 defaults,data=ordered 0 2
 
# Tune journal size (at filesystem creation)
mkfs.ext4 -J size=256 /dev/sda1      # 256MB journal
 
# Check journal status
dumpe2fs -h /dev/sda1 | grep -i journal
# Journal size:             128M
# Journal length:           32768
# Journal sequence:         0x00000547
 
# View journal contents (advanced debugging)
debugfs /dev/sda1
debugfs: logdump          # Dump journal transactions
 
# Alternative: use tune2fs to check features
tune2fs -l /dev/sda1 | grep -i has_journal

Writeback Mode Warning

Delayed Allocation and Journaling

Delayed Allocation Benefits

•Better contiguity: allocate after knowing total size
•Reduced fragmentation: blocks allocated together
•Less metadata churn: one allocation vs. many
•Write coalescing: combine adjacent writes
•Better SSD behavior: larger, aligned writes

Delayed Allocation Risks

•Data loss window: unwritten data lost on crash
•ENOSPC surprise: space check deferred
•Larger potential loss: more dirty data in memory
•Application assumptions may be violated
•Interaction with mmap can be complex

fsync() Is Not Optional

Metadata Journaling Across File Systems

While we've focused on ext4, metadata journaling is used across many file systems. Each implementation has unique characteristics that reflect design priorities and legacy constraints.

Metadata Journaling Implementations
File System	Journal Name	Default Mode	Notable Features
ext4 (Linux)	JBD2	Ordered	Three modes, checksums, external journal option
XFS (Linux)	Intentional Log	Metadata only	Parallel logging, always ordered semantically
NTFS (Windows)	Transaction Log	Metadata only	Integrated, $LogFile special file
HFS+ (macOS legacy)	Journal	Metadata only	Volume Header Journal
APFS (macOS)	Copy-on-Write	N/A (CoW)	Uses CoW instead of traditional journal
JFS (IBM)	Aggregate Log	Metadata only	Group commit, log offloading
ReiserFS	Journal	Metadata only	Wandering logs, tail packing

xfs_journaling.txt

Text

XFS Intent Logging Example:
 
Operation: Allocate 100 blocks starting at block 50000
 
Traditional (ext4 style):
  Log: [Full inode block], [Full bitmap block], [Full extent tree block]
  Size: 3 × 4KB = 12KB
 
XFS Intent-Based:
  Log: "EFI: extent free intent, ag=5, block=50000, len=100"
  Size: ~100 bytes
 
Recovery Process:
1. Find incomplete EFI (Extent Free Intent)
2. Execute: allocate blocks 50000-50099
3. Done - operation is idempotent
 
The intent is a high-level description of what to do,
not a low-level copy of modified blocks.
 
XFS Log Structure:
┌─────────────────────────────────────────────────────────────┐
│ Log Sequence Number (LSN): 1234567                          │
├─────────────────────────────────────────────────────────────┤
│ Transaction Header (TID: 8847)                              │
├─────────────────────────────────────────────────────────────┤
│ Intent Item: EFI (extent free intent)                       │
│   AG number: 5                                              │
│   Block offset: 50000                                       │
│   Block count: 100                                          │
├─────────────────────────────────────────────────────────────┤
│ Done Item: EFD (extent free done)                           │
│   Links to EFI above                                        │
├─────────────────────────────────────────────────────────────┤
│ Commit Record                                               │
└─────────────────────────────────────────────────────────────┘

Copy-on-Write Alternative

Performance Analysis

Metadata journaling's performance impact varies significantly by workload. Understanding these patterns helps you choose the right mode and optimize system configuration.

Write Amplification Analysis:\n\nExtents of writing differs by journaling mode:

Write Amplification by Workload
Workload	No Journal	Ordered Mode	Full Journal
Large file write (1GB)	1.0x	1.0x (+ tiny metadata)	2.0x
Many small files	1.0x	~1.1x (metadata overhead)	~2.0x
Metadata-heavy (mkdir/create)	1.0x	2.0x (for metadata)	2.0x
Random small writes	1.0x	~1.05x	2.0x
Database transaction	1.0x	~1.1x	2.0x (expected)

Fsync Performance:\n\nFsync behavior is where journaling mode matters most:

fsync_performance.txt

Text

fsync() Latency Breakdown:
 
Scenario: Application writes 4KB, calls fsync()
 
========= Ordered Mode (typical) =========
1. Write data to final location:        ~0.5ms (HDD) / 0.1ms (SSD)
2. Barrier (ensure data durable):       ~8ms (HDD) / 0.1ms (SSD)
3. Write journal metadata (~8KB):       ~0.5ms (HDD) / 0.1ms (SSD)  
4. Write commit + barrier:              ~8ms (HDD) / 0.1ms (SSD)
─────────────────────────────────────────────────────────────────
Total:                                  ~17ms (HDD) / ~0.4ms (SSD)
 
========= Full Journal Mode =========
1. Write data to journal:               ~0.5ms (HDD) / 0.1ms (SSD)
2. Write metadata to journal:           ~0.5ms (HDD) / 0.1ms (SSD)
3. Write commit + barrier:              ~8ms (HDD) / 0.1ms (SSD)
4. (Later) Write data to final:         async, doesn't affect fsync
5. (Later) Write metadata to final:     async
─────────────────────────────────────────────────────────────────
Total for fsync:                        ~9ms (HDD) / ~0.3ms (SSD)
 
========= Writeback Mode =========
1. Write journal metadata (~8KB):       ~0.5ms (HDD) / 0.1ms (SSD)
2. Write commit + barrier:              ~8ms (HDD) / 0.1ms (SSD)
3. Data written without ordering:       async, may not be durable!
─────────────────────────────────────────────────────────────────
Total for metadata safety:              ~8.5ms (HDD) / ~0.2ms (SSD)
WARNING: Data may not be durable after fsync in this mode!

Benchmark Your Workload

Summary: Metadata Journaling

We've thoroughly explored metadata journaling, the dominant approach in production file systems. Let's consolidate our understanding:

Key Takeaways

•Metadata vs. data — Metadata is file system structure (inodes, directories, bitmaps); data is file contents. Protecting metadata ensures navigability; protecting data ensures content integrity.
•Data-before-metadata ordering — In ordered mode, data must reach disk before metadata that references it. This prevents exposure of stale data from previous files.
•Stale data exposure — Without ordering, crash recovery can leave files pointing to blocks containing data from deleted files. Security and privacy implications.
•ext4 modes — journal (full), ordered (default, recommended), writeback (fastest, risky). Choose based on durability vs. performance requirements.
•Delayed allocation interaction — Modern file systems delay block allocation for better contiguity. This can increase data loss window; fsync() is essential for durability.
•Performance patterns — Barrier costs dominate on HDDs. Transaction batching is critical. SSDs reduce journaling overhead significantly.

Metadata Journaling Mastered

3 / 5