Loading learning content...
When file systems first adopted journaling, designers faced a fundamental question: What should we journal? Journaling everything—both data and metadata—provides the strongest consistency guarantees but doubles all writes. Journaling nothing returns us to the crash vulnerability of the past.\n\nMetadata journaling emerged as the pragmatic middle ground. By journaling only the file system's structural information—inodes, directories, bitmaps, and block pointers—while writing data directly to its final location, file systems achieve rapid recovery and structural consistency with minimal performance overhead. This mode has become the default for most production file systems.
This page examines metadata journaling in depth. You'll understand what qualifies as metadata, why protecting it is sufficient for file system consistency, the data exposure window that results, and the ordering constraints that prevent corruption. You'll learn when metadata journaling is appropriate and when stronger modes are needed.
To understand metadata journaling, we must first precisely define what constitutes metadata versus data. The distinction is fundamental to understanding what protections are—and are not—provided.
Data is the actual contents of files—the bytes that applications write through write() system calls. Your document text, image pixels, database records, application binaries: these are all data.\n\nMetadata is everything the file system needs to organize, locate, and manage files and directories. Metadata answers questions like:\n- Where is this file's data stored on disk?\n- How large is this file?\n- Who owns it and what permissions apply?\n- What files exist in this directory?\n- Which blocks are free for allocation?
| Category | Examples | Changes When | Journaled in Metadata Mode |
|---|---|---|---|
| File Data | File contents, application bytes | write(), truncate() | No |
| Inode Metadata | File size, timestamps, block pointers, permissions | Any file operation | Yes |
| Directory Entries | File names, inode numbers, dirent structures | create, unlink, rename | Yes |
| Block Allocation | Bitmap, extent tree, block groups | Allocation/deallocation | Yes |
| Superblock | File system state, mount count, free counts | Volume state changes | Yes |
| Extended Attributes | xattrs, ACLs, SELinux labels | setxattr, setfacl | Yes |
The Critical Insight:\n\nMetadata is what the file system uses to interpret the disk. Corrupt metadata means the file system cannot understand its own organization:\n\n- A corrupted inode might point to wrong blocks → reading garbage data\n- A corrupted directory means files cannot be found → data effectively lost\n- Corrupted bitmap might double-allocate blocks → catastrophic data intermixing\n\nCorrupt data, while certainly bad, has a bounded impact—one file's contents are wrong. The file system itself remains navigable; other files remain accessible.\n\nThis asymmetry justifies metadata journaling: protecting metadata ensures the file system structure survives crashes. Individual files may have incomplete updates, but the file system itself remains consistent and recoverable.
A common misconception: the mapping from file offset to disk block is metadata, not data. When you append to a file, the new data blocks are data, but the updated inode block pointers and extent entries are metadata. Metadata journaling protects these pointers, ensuring files don't point to garbage after a crash.
Metadata journaling modifies the basic WAL protocol to handle data specially. The key insight is that data can be written directly to its final location—it doesn't need the journal's intermediate storage—but the timing of data writes relative to metadata writes must be carefully controlled.
The Protocol Steps:
1234567891011121314151617181920212223242526272829303132333435
Metadata Journaling Write Flow: Application writes 8KB to file at offset 0: Step 1: Buffer Modifications Memory: ├── Page cache: 2 new data blocks (8KB) ├── Inode cache: updated inode (size=8192, new block ptrs) └── Bitmap cache: 2 blocks marked allocated Step 2: Write Data Blocks (direct to final location) Disk Request: Write blocks 50001-50002 (data) [Data now on disk at final location] Step 3: Barrier (ensure data durable) Disk Request: FLUSH [All pending writes to media] Step 4: Write Journal Transaction Journal: ├── Descriptor: "This transaction modifies blocks 8472, 8901" ├── Block 8472 contents: [Modified inode for file] └── Block 8901 contents: [Modified bitmap region] Step 5: Commit Disk Request: Write commit + FUA [Transaction is now committed] Step 6: Application gets fsync() success Recovery After Crash at Any Point:├── Crash before Step 3: Data gone, but inode unchanged (consistent)├── Crash before Step 5: Data exists, metadata uncommitted (consistent) ├── Crash after Step 5: Data exists, metadata committed (consistent)└── All cases: File either has old contents or new contents, never garbageThe Critical Ordering:\n\nThe barrier in Step 3 is essential. It ensures:\n\n> Data reaches disk before metadata that references it.\n\nWithout this ordering, consider what could happen:\n1. Metadata commits (points to block 50001)\n2. Crash occurs before data write\n3. On recovery: file now points to block 50001, which contains old/garbage data from a previous file\n\nThis is not just inconsistent—it's potentially catastrophic. The file appears valid but contains completely wrong data, possibly sensitive data from another file. The data-before-metadata ordering prevents this.
ext4 distinguishes between 'ordered' mode (data before metadata, described here) and 'writeback' mode (no ordering, metadata only journaled). Writeback is faster but can expose stale data after crash. Most production systems use ordered mode. We'll examine writeback mode later.
Even with proper ordering, metadata journaling has a subtle security and consistency issue: the stale data exposure problem. Understanding this issue is crucial for applications that handle sensitive data.
The Scenario:\n\nConsider allocating a new block for a file:\n\n1. Block 50001 was previously used by another file (say, containing passwords)\n2. Previous file was deleted, block marked free\n3. Your new file allocates block 50001\n4. You write new data to block 50001\n5. Crash before data write completes\n\nWith metadata journaling (ordered mode), the data write completed before metadata commit. But what if the crash occurred after metadata commit but before you wrote meaningful data?\n\nActually, with ordered mode, data must be written before metadata commits. But the block might have been freshly allocated (metadata) without the application having written all intended data yet. If the application was writing a sparse file or hadn't gotten to that part:
| Scenario | What Happens | Risk |
|---|---|---|
| Expand file, crash before write | Inode shows larger size, blocks contain old data | Privacy leak: old file contents exposed |
| Create sparse file | Blocks allocated but not written | Reading allocated holes returns stale data |
| Truncate file down | Blocks deallocated, may be reallocated | Next allocation might expose old data |
Mitigation Mechanisms:\n\n1. Zero-on-allocate (Delayed Allocation)\n\nModern file systems often delay actual block allocation until data is written. This means:\n- Blocks aren't allocated until data is ready\n- New data overwrites old contents\n- No window for stale exposure\n\next4's delayed allocation provides this protection for most cases.\n\n2. Zero fill\n\nSome file systems zero newly allocated blocks before making them accessible:\n- Overhead of writing zeros\n- Guarantees no stale data exposure\n- May be configurable or automatic\n\n3. Application-level protection\n\nApplications handling sensitive data should:\n- Explicitly zero buffers before writing\n- Use fallocate with FALLOC_FL_ZERO_RANGE\n- Consider encrypted file systems (stale blocks are encrypted)
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
// Safe file extension that prevents stale data exposure #define _GNU_SOURCE#include <fcntl.h>#include <unistd.h> // Method 1: Use fallocate with zero rangevoid safe_extend_file(int fd, off_t new_size) { off_t current_size; struct stat st; fstat(fd, &st); current_size = st.st_size; if (new_size > current_size) { // Explicitly zero the new region fallocate(fd, FALLOC_FL_ZERO_RANGE, current_size, new_size - current_size); }} // Method 2: Explicit zeroing for sensitive datavoid write_sensitive_data(int fd, const char *data, size_t len, off_t offset) { // First, zero the region we'll write to // This ensures even partial writes don't expose stale data char *zeros = calloc(1, len); pwrite(fd, zeros, len, offset); fsync(fd); // Ensure zeros are durable free(zeros); // Now write actual data pwrite(fd, data, len, offset); fsync(fd);} // Method 3: Use O_TMPFILE for atomic creation// File not linked until complete, no intermediate exposureint create_file_atomically(const char *dir, const char *filename, const char *data, size_t len) { char linkpath[PATH_MAX]; snprintf(linkpath, sizeof(linkpath), "%s/%s", dir, filename); // Create anonymous temp file int fd = open(dir, O_TMPFILE | O_RDWR, 0644); // Write all data write(fd, data, len); fsync(fd); // Now atomically link into directory char procpath[PATH_MAX]; snprintf(procpath, sizeof(procpath), "/proc/self/fd/%d", fd); linkat(AT_FDCWD, procpath, AT_FDCWD, linkpath, AT_SYMLINK_FOLLOW); // fsync directory for rename durability int dirfd = open(dir, O_RDONLY); fsync(dirfd); close(dirfd); return fd;}The stale data problem is why sensitive systems use multiple layers: encrypted file systems (stale data is encrypted), secure deletion (overwrite before deallocation), and memory encryption. Don't rely solely on file system behavior for sensitive data protection.
ext4 provides three journaling modes that illustrate the design space between consistency and performance. Understanding these modes deeply reveals the trade-offs inherent in journaling design.
| Mode | What's Journaled | Data Ordering | Safety | Performance |
|---|---|---|---|---|
| journal | Data + Metadata | All in journal | Highest | Lowest (2x writes) |
| ordered (default) | Metadata only | Data before metadata | High | Good |
| writeback | Metadata only | No ordering | Moderate | Highest |
Mode: journal (Full Data Journaling)\n\nBoth data and metadata are written to the journal before being written to final locations.\n\nAdvantages:\n- Both structure and content are atomic\n- After crash: files contain either old or new complete contents\n- Strongest guarantees\n\nDisadvantages:\n- Every byte written twice (to journal, then final location)\n- Journal must be large enough for data workload\n- Significant performance overhead\n\nUse cases:\n- Databases that need atomic file updates\n- Financial/audit logs requiring bullet-proof consistency\n- Systems where correctness far outweighs performance
Mode: ordered (Metadata Journaling with Ordering)\n\nOnly metadata is journaled. Data is written directly to final locations but must reach disk before the metadata transaction commits.\n\nAdvantages:\n- File system structure always consistent\n- Data writes not doubled\n- Good balance of safety and performance\n\nDisadvantages:\n- After crash: file may have partial new data (but valid old/new for each block)\n- Ordering requirement adds some overhead\n- Still vulnerable to application-level inconsistencies\n\nUse cases:\n- General-purpose workloads\n- Systems where file structural integrity is primary concern\n- Default for most production systems
Mode: writeback (Metadata Only, No Ordering)\n\nOnly metadata is journaled. Data can be written in any order relative to metadata—before, after, or interleaved.\n\nAdvantages:\n- Maximum flexibility for write scheduling\n- Best performance (no ordering barriers)\n- File system structure still protected\n\nDisadvantages:\n- After crash: file may contain stale data from previous file\n- Security risk: sensitive data exposure\n- Application data consistency not guaranteed\n\nUse cases:\n- Scratch/temp file systems\n- Cases where applications manage their own consistency\n- Performance-critical systems with data redundancy elsewhere
12345678910111213141516171819202122232425262728
# Check current journaling modecat /proc/mounts | grep ext4# Output includes data=ordered or data=writeback or data=journal # Mount with specific modemount -o data=ordered /dev/sda1 /mnt/data # Change mode (requires remount)mount -o remount,data=journal /mnt/data # Set default mode in fstab# /dev/sda1 /mnt/data ext4 defaults,data=ordered 0 2 # Tune journal size (at filesystem creation)mkfs.ext4 -J size=256 /dev/sda1 # 256MB journal # Check journal statusdumpe2fs -h /dev/sda1 | grep -i journal# Journal size: 128M# Journal length: 32768# Journal sequence: 0x00000547 # View journal contents (advanced debugging)debugfs /dev/sda1debugfs: logdump # Dump journal transactions # Alternative: use tune2fs to check featurestune2fs -l /dev/sda1 | grep -i has_journalNever use writeback mode for file systems that may contain sensitive data. After a crash, newly created files may expose contents from previously deleted files. This has security implications in multi-user and multi-tenant environments.
Modern file systems like ext4 use delayed allocation (also called allocate-on-flush), which interacts importantly with journaling. Understanding this interaction is crucial for predicting file system behavior.
Traditional Allocation:\n\nWith traditional allocation, blocks are assigned immediately when data is written:\n\n1. write(fd, data, 4096) called\n2. Immediately: allocate block 50001\n3. Immediately: update inode to point to block 50001\n4. Eventually: write data to block 50001\n\nThis approach has fragmentation issues—if you write a file in pieces, blocks are allocated in the order of writes, not optimally for sequential reading.\n\nDelayed Allocation:\n\nWith delayed allocation, block assignment is postponed:\n\n1. write(fd, data, 4096) called\n2. Data goes to page cache, marked dirty\n3. No blocks allocated yet\n4. Later (writeback time): examine total dirty data\n5. Allocate contiguous blocks for all pending writes\n6. Write data to allocated blocks\n7. Update metadata
The Data Loss Controversy:\n\nDelayed allocation interacted problematically with some application patterns, leading to a significant compatibility issue in ext4's early days.\n\nThe problematic pattern:\n\n1. Rewrite config file:\n fd = open("config", O_TRUNC | O_WRONLY) // Truncate existing\n write(fd, new_config) // Write new content\n close(fd) // No explicit fsync\n\n2. Crash occurs\n\n3. Expected: Either old config or new config\n Actual: Zero-length file (old truncated, new not allocated yet)\n\n\nWith delayed allocation, the write's data might still be in page cache with no blocks allocated. A crash loses everything. Traditional allocation at least had blocks allocated, so something would survive.\n\nThe Fix:\n\next4 now detects patterns like truncate+write and forces earlier allocation for such files. Additionally, applications that care about durability should always use fsync()—the pattern above was always subtly incorrect, just masked by traditional allocation.
Applications should never assume close() provides durability. It doesn't. If you need data to survive a crash, explicitly call fsync() before closing. Delayed allocation just made this long-standing requirement more visible.
While we've focused on ext4, metadata journaling is used across many file systems. Each implementation has unique characteristics that reflect design priorities and legacy constraints.
| File System | Journal Name | Default Mode | Notable Features |
|---|---|---|---|
| ext4 (Linux) | JBD2 | Ordered | Three modes, checksums, external journal option |
| XFS (Linux) | Intentional Log | Metadata only | Parallel logging, always ordered semantically |
| NTFS (Windows) | Transaction Log | Metadata only | Integrated, $LogFile special file |
| HFS+ (macOS legacy) | Journal | Metadata only | Volume Header Journal |
| APFS (macOS) | Copy-on-Write | N/A (CoW) | Uses CoW instead of traditional journal |
| JFS (IBM) | Aggregate Log | Metadata only | Group commit, log offloading |
| ReiserFS | Journal | Metadata only | Wandering logs, tail packing |
XFS Journaling:\n\nXFS uses an "intent-based" journaling approach:\n\n- Operations are logged as intents (e.g., "allocate extent from A to B")\n- Actual work happens after intent is logged\n- Recovery replays intents, not raw blocks\n\nAdvantages:\n- Log entries are smaller (intents vs. full blocks)\n- Operations are idempotent by design\n- Better parallel logging\n\nNTFS Journaling:\n\nNTFS ($LogFile) uses more sophisticated logging:\n\n- REDO + UNDO information\n- True transaction support with rollback capability\n- Transactional NTFS (TxF) for application use (deprecated)\n\nNTFS can undo incomplete operations, not just redo them. This provides additional recovery options but adds complexity.
123456789101112131415161718192021222324252627282930313233343536
XFS Intent Logging Example: Operation: Allocate 100 blocks starting at block 50000 Traditional (ext4 style): Log: [Full inode block], [Full bitmap block], [Full extent tree block] Size: 3 × 4KB = 12KB XFS Intent-Based: Log: "EFI: extent free intent, ag=5, block=50000, len=100" Size: ~100 bytes Recovery Process:1. Find incomplete EFI (Extent Free Intent)2. Execute: allocate blocks 50000-500993. Done - operation is idempotent The intent is a high-level description of what to do,not a low-level copy of modified blocks. XFS Log Structure:┌─────────────────────────────────────────────────────────────┐│ Log Sequence Number (LSN): 1234567 │├─────────────────────────────────────────────────────────────┤│ Transaction Header (TID: 8847) │├─────────────────────────────────────────────────────────────┤│ Intent Item: EFI (extent free intent) ││ AG number: 5 ││ Block offset: 50000 ││ Block count: 100 │├─────────────────────────────────────────────────────────────┤│ Done Item: EFD (extent free done) ││ Links to EFI above │├─────────────────────────────────────────────────────────────┤│ Commit Record │└─────────────────────────────────────────────────────────────┘APFS (Apple), ZFS (Oracle/OpenZFS), and Btrfs (Linux) use copy-on-write instead of journaling. They never modify data in place, so there's no need for journaling's redo capability. The file system is always consistent because old data is preserved until the new data is complete. This is a different paradigm we'll explore elsewhere.
Metadata journaling's performance impact varies significantly by workload. Understanding these patterns helps you choose the right mode and optimize system configuration.
Write Amplification Analysis:\n\nExtents of writing differs by journaling mode:
| Workload | No Journal | Ordered Mode | Full Journal |
|---|---|---|---|
| Large file write (1GB) | 1.0x | 1.0x (+ tiny metadata) | 2.0x |
| Many small files | 1.0x | ~1.1x (metadata overhead) | ~2.0x |
| Metadata-heavy (mkdir/create) | 1.0x | 2.0x (for metadata) | 2.0x |
| Random small writes | 1.0x | ~1.05x | 2.0x |
| Database transaction | 1.0x | ~1.1x | 2.0x (expected) |
Fsync Performance:\n\nFsync behavior is where journaling mode matters most:
12345678910111213141516171819202122232425262728
fsync() Latency Breakdown: Scenario: Application writes 4KB, calls fsync() ========= Ordered Mode (typical) =========1. Write data to final location: ~0.5ms (HDD) / 0.1ms (SSD)2. Barrier (ensure data durable): ~8ms (HDD) / 0.1ms (SSD)3. Write journal metadata (~8KB): ~0.5ms (HDD) / 0.1ms (SSD) 4. Write commit + barrier: ~8ms (HDD) / 0.1ms (SSD)─────────────────────────────────────────────────────────────────Total: ~17ms (HDD) / ~0.4ms (SSD) ========= Full Journal Mode =========1. Write data to journal: ~0.5ms (HDD) / 0.1ms (SSD)2. Write metadata to journal: ~0.5ms (HDD) / 0.1ms (SSD)3. Write commit + barrier: ~8ms (HDD) / 0.1ms (SSD)4. (Later) Write data to final: async, doesn't affect fsync5. (Later) Write metadata to final: async─────────────────────────────────────────────────────────────────Total for fsync: ~9ms (HDD) / ~0.3ms (SSD) ========= Writeback Mode =========1. Write journal metadata (~8KB): ~0.5ms (HDD) / 0.1ms (SSD)2. Write commit + barrier: ~8ms (HDD) / 0.1ms (SSD)3. Data written without ordering: async, may not be durable!─────────────────────────────────────────────────────────────────Total for metadata safety: ~8.5ms (HDD) / ~0.2ms (SSD)WARNING: Data may not be durable after fsync in this mode!Key Performance Observations:\n\n1. HDD barrier dominance: On HDDs, the barrier cost (~8ms for disk rotation) dominates. Reducing barrier count matters more than reducing write count.\n\n2. SSD barrier efficiency: SSDs handle barriers much faster (~0.1ms), making journaling overhead proportionally smaller.\n\n3. Batching critical: Combining multiple operations into one transaction dramatically improves throughput by sharing barrier cost.\n\n4. Sequential log writes: Journal writes are sequential, which is efficient on both HDDs (no seeks) and SSDs (better wear leveling).\n\n5. Full journal paradox: For pure fsync latency, full journaling can be faster because data write doesn't need a separate barrier—everything goes to journal. But total write volume increases.
Real-world performance depends heavily on workload patterns. Synthetic benchmarks often don't reflect production behavior. Test with realistic workloads before choosing journal modes or sizing. Tools like fio can simulate various access patterns.
We've thoroughly explored metadata journaling, the dominant approach in production file systems. Let's consolidate our understanding:
What's Next:\n\nWe've seen metadata journaling; now we'll examine full data journaling mode where both data and metadata are journaled. This mode provides the strongest guarantees at significant performance cost, and is essential for certain specialized workloads.
You now understand the most common journaling mode used in production systems. This knowledge helps you configure file systems appropriately, understand recovery behavior, and design applications that correctly interact with file system durability guarantees.