Loading learning content...
Metadata journaling protects the file system's structure, but what about the data itself? When a crash occurs mid-write, metadata journaling ensures the file system remains navigable—but the file's contents may be inconsistent: partially new, partially old, perhaps containing garbage from a failed write.\n\nFor some workloads—databases, financial systems, critical logs—this is unacceptable. These systems require full data journaling, where both data and metadata are written to the journal before being applied to their final locations. This mode guarantees that files contain either their complete old contents or complete new contents, never a corrupted intermediate state.
This page examines full data journaling in depth. You'll understand when it's necessary, how it achieves atomic file updates, the significant performance costs involved, and the engineering techniques that make it practical for demanding workloads. You'll learn when full journaling is the right choice and when alternatives might be better.
To appreciate full journaling, let's first understand the failures that metadata journaling cannot prevent. These scenarios illustrate why some applications need stronger guarantees.
Scenario 1: Database Page Corruption\n\nA database stores structured data in fixed-size pages (typically 4KB, 8KB, or 16KB). When updating a record:\n\n1. Read page from disk\n2. Modify record within page\n3. Write entire page back\n\nWith metadata journaling, if a crash occurs during step 3:\n\n- First 2KB of page has new data\n- Last 2KB has old data\n- Page checksum fails → page corrupted\n- Database cannot use this page\n\nThe file system preserved its structure correctly, but the application's data structure is destroyed.
| Scenario | Metadata Journaling Result | Full Journaling Result |
|---|---|---|
| Crash during 8KB write | Partial write (4KB new, 4KB old) | Either all old or all new |
| Crash while updating two related records | One updated, one not | Both updated or neither |
| Config file rewrite | Partially new content | Complete old or complete new |
| Log append | Truncated entry | Missing entry or complete entry |
| Database checkpoint | Inconsistent state possible | Consistent checkpoint or previous state |
Scenario 2: Multi-File Atomic Updates\n\nSome applications update multiple files that must be consistent together:\n\n\n1. Write new data to file A\n2. Write new index to file B\n3. Write new checksum to file C\n\nCrash between steps 2 and 3:\n- File A: new data\n- File B: new index\n- File C: old checksum\n- Checksum doesn't match → corruption detected\n\n\nMetadata journaling can't help here—each file's metadata is consistent, but the application's cross-file invariant is violated.\n\nNote: Full journaling also cannot atomically update multiple files unless they're written in the same transaction. For true multi-file atomicity, applications typically need database-style transaction support or custom protocols.
Full journaling provides atomic updates at the file system level, but each write() call or fsync() boundary is a separate transaction. Applications still need careful design for cross-file consistency. Full journaling makes individual writes atomic; it doesn't make your entire application atomic.
Full journaling extends the metadata journaling protocol to include data blocks. Every disk block that will be modified—whether data or metadata—is first written to the journal. Only after the journal commit is the modification applied to its final location.
1234567891011121314151617181920212223242526272829303132333435363738
Full Journaling Write Flow: Application writes 8KB to file: Phase 1: Buffer Modifications (same as metadata journaling) Memory: ├── Page cache: 2 data blocks (8KB total) ├── Inode cache: modified inode └── Bitmap cache: 2 blocks marked allocated Phase 2: Write Journal Transaction Journal: ├── Descriptor block │ └── Lists all 4 blocks: 2 data + inode + bitmap ├── Data block 1: [4KB file content] ← DATA in journal ├── Data block 2: [4KB file content] ← DATA in journal ├── Metadata block 1: [inode block] ├── Metadata block 2: [bitmap block] └── Commit record (with checksum) Phase 3: Barrier + Commit Ensure all above durable, then commit record Phase 4: Application gets fsync() success ★ DATA IS NOW SAFE ★ (even though not at final location yet) Phase 5: Writeback (background) Write data blocks to final locations: blocks 50001-50002 Write metadata to final locations: inode, bitmap blocks Phase 6: Checkpoint After writeback complete, update checkpoint marker Key Difference from Metadata Journaling:─────────────────────────────────────────────Metadata: Data → Barrier → Metadata-in-Journal → CommitFull: Everything-in-Journal → Barrier → Commit → WritebackThe Atomicity Mechanism:\n\nFull journaling achieves atomicity through the commit record:\n\n1. Before commit on disk: Transaction doesn't exist. A crash will ignore all logged blocks.\n2. After commit on disk: Transaction is complete. Recovery will replay all logged blocks.\n\nThere's no intermediate state where some blocks are applied and others aren't. The entire write—data and metadata—either happens completely or not at all.\n\nRecovery Behavior:\n\nOn recovery with full journaling:\n\n1. Scan journal for committed transactions\n2. For each committed transaction, replay ALL blocks (data + metadata)\n3. Blocks are written to their final locations\n4. File now contains complete new contents\n\nThe key insight: we're replaying from the journal, not from the final locations. The journal contains complete, consistent data.
Replaying a transaction is idempotent—doing it twice produces the same result as once. This is crucial because we don't know if the original writeback completed. By always replaying, we ensure correctness regardless of crash timing during writeback phase.
Full journaling's strong guarantees come at a significant performance cost. Every byte of application data is written twice—first to the journal, then to its final location. Let's quantify this overhead and understand when it's acceptable.
Write Amplification:\n\nFor an 8KB file write:\n\nMetadata Journaling:\n- 8KB data → final location\n- ~12KB metadata → journal (descriptor, inode, bitmap, commit)\n- ~12KB metadata → final location\n- Total: ~28KB written\n\nFull Journaling:\n- 8KB data → journal\n- ~12KB metadata → journal\n- ~20KB → journal commit\n- 8KB data → final location\n- ~12KB metadata → final location\n- Total: ~48KB written\n\nFor data-heavy workloads, full journaling approximately doubles I/O bandwidth consumption.
| Workload | Metadata Mode Throughput | Full Journal Throughput | Impact |
|---|---|---|---|
| Large sequential writes | ~500 MB/s (SSD) | ~250 MB/s | -50% |
| Small random writes | ~10K IOPS | ~5K IOPS | -50% |
| Mostly reads | Minimal impact | Minimal impact | ~0% |
| Metadata-heavy (many small files) | ~20K ops/s | ~18K ops/s | -10% |
| Mixed read/write | Varies | ~70% of metadata mode | -30% |
Journal Size Requirements:\n\nFull journaling requires substantially larger journals:\n\nMetadata journaling: Journal holds only metadata, typically a few MB per second of activity. 128MB journal can hold many seconds of metadata.\n\nFull journaling: Journal holds all data, potentially hundreds of MB per second. A 128MB journal fills in less than a second of heavy writing.\n\nSizing formula:\n\nMinimal journal size = (Write rate MB/s) × (Commit interval seconds) × 2\n\nExample: 100 MB/s × 5s × 2 = 1GB journal minimum\n\n\nThe ×2 factor provides headroom for writeback delay. Running with an undersized journal leads to stalls as the journal fills before checkpointing completes.
123456789101112131415161718192021222324252627
# Create filesystem with large journal for full data journalingmkfs.ext4 -J size=2048 /dev/sda1 # 2GB journal # Check current journal sizetune2fs -l /dev/sda1 | grep "Journal size" # Mount with full journalingmount -o data=journal /dev/sda1 /mnt/fulljournal # Monitor journal usage (approximate via dirty pages)cat /proc/meminfo | grep Dirty # Check for journal stalls in kernel logdmesg | grep -i "journal"dmesg | grep "JBD2" # Recommended: use external high-speed journal device# Create external journal on fast NVMe SSDmkfs.ext4 -O journal_dev /dev/nvme0n1p1 # Create main filesystem referencing external journalmkfs.ext4 -J device=/dev/nvme0n1p1 /dev/sda1 # External journal reduces impact because:# 1. Journal writes don't compete with data writes for bandwidth# 2. NVMe much faster than HDD, reducing commit latency# 3. Parallel I/O: journal and data simultaneouslyIf you need full journaling, consider an external journal on a fast device. A small NVMe SSD (even 32GB) can serve as journal for multiple spindle-based filesystems, dramatically improving performance. The journal doesn't need to be large—just fast.
Despite the performance cost, full journaling remains essential for certain workloads. Let's examine the specific scenarios where it's the right choice.
Use Case 1: Filesystem-based Databases\n\nSome databases store data directly in files without implementing their own journaling:\n\n- SQLite (in certain modes)\n- Embedded databases\n- Simple key-value stores\n\nFor these, the filesystem is the only crash-consistency mechanism. Full journaling is mandatory for durability.\n\nHowever: Most production databases (PostgreSQL, MySQL, MongoDB) implement their own Write-Ahead Log, making filesystem full journaling redundant. Double journaling (database + filesystem) is wasteful—use metadata journaling for the filesystem and let the database manage data durability.
| Application Type | Recommended Mode | Rationale |
|---|---|---|
| General purpose workstation | Ordered | Good balance, most common case |
| Web server (static content) | Ordered or Writeback | Reads dominate, content replaceable |
| Database with own WAL | Ordered | DB handles data durability |
| Database without WAL (SQLite) | Full Journal | Only protection for data |
| Build/compile server | Writeback | Fast, work can be rebuilt |
| Email/mail server | Full Journal | Message integrity critical |
| Financial/audit logs | Full Journal | Regulatory requirements |
| Temporary/scratch space | Writeback | Data is transient |
Use Case 2: Audit and Compliance\n\nRegulatory requirements sometimes mandate that records be tamper-evident and complete:
Use Case 3: Application-Consistent Checkpoints\n\nSome applications checkpoint their entire state to disk periodically:\n\n- Virtual machine state\n- Scientific simulation snapshots\n- Game save states\n\nThe checkpoint must be fully consistent—a partial checkpoint is worse than no checkpoint (corrupted state that appears valid). Full journaling can provide this atomicity for single-file checkpoints.
Full journaling is not "extra safe mode." For most workloads, ordered mode provides excellent consistency with much better performance. Use full journaling only when your specific requirements demand it—typically when files must contain exactly old or new content with no intermediate states.
Full journaling implementation requires solving several technical challenges beyond metadata journaling. Let's examine the engineering decisions that make it practical.
Challenge 1: Journal Space Management\n\nWith data in the journal, space management becomes critical:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// Journal space reservation before loggingint journal_start_transaction(journal_t *j, transaction_t *txn, size_t data_blocks, size_t metadata_blocks) { size_t total_blocks = data_blocks + metadata_blocks; // Add overhead: descriptor, commit, padding size_t overhead = 2 + (total_blocks / 16); // Tags per descriptor size_t required = total_blocks + overhead; spin_lock(&j->j_lock); while (journal_free_space(j) < required) { // Not enough space - must wait for checkpoint if (journal_checkpoint_count(j) == 0) { // No checkpointable transactions - force writeback spin_unlock(&j->j_lock); force_writeback_oldest_transaction(j); spin_lock(&j->j_lock); continue; } // Wait for checkpoint to complete spin_unlock(&j->j_lock); wait_for_checkpoint_space(j, required); spin_lock(&j->j_lock); } // Reserve space j->j_reserved_space += required; txn->t_reserved_blocks = required; spin_unlock(&j->j_lock); return 0;} // After transaction commits, release reservationvoid journal_complete_transaction(journal_t *j, transaction_t *txn) { size_t actual_used = txn->t_actual_blocks; spin_lock(&j->j_lock); j->j_reserved_space -= txn->t_reserved_blocks; j->j_used_space += actual_used; spin_unlock(&j->j_lock); // Potentially wake waiters wake_up_checkpoint_waiters(j);}Challenge 2: Write Ordering for Fsync\n\nWhen an application calls fsync(), all data for that file must be in the journal and committed. The file system must track which pages belong to which transactions:
Challenge 3: Page Cache Integration\n\nThe page cache holds file data in memory. With full journaling, pages go through a complex lifecycle:
1234567891011121314151617181920212223242526272829303132
Page Lifecycle in Full Journaling: State 1: CLEAN Page matches on-disk content ↓ (write() called) State 2: DIRTY_UNCOMMITTED Page modified but not yet in journal transaction ↓ (transaction started, page added) State 3: JOURNALED_RUNNING Page is part of an open transaction Page data copied to journal buffers ↓ (transaction commits) State 4: JOURNALED_COMMITTED Page is in a committed transaction Still dirty in page cache (not at final location) ↓ (writeback to final location) State 5: WRITTEN_CHECKPOINT_PENDING Page at final location But transaction not yet checkpointed ↓ (checkpoint advances past this transaction) State 6: CLEAN Page fully synchronized Journal space reclaimed Key Invariant: A page cannot be reclaimed from memorywhile it's in JOURNALED_RUNNING or JOURNALED_COMMITTED state - we need it for potential recovery.Full journaling increases memory pressure because dirty pages must be held until their transaction commits and writes back. Under memory pressure, the system must force commits and writebacks to free pages. This can cause I/O storms when memory is tight—another reason full journaling impacts more than just I/O bandwidth.
File system full journaling and database write-ahead logging solve similar problems but differ in important ways. Understanding these differences helps architects make better system design decisions.
| Aspect | File System (ext4 data=journal) | Database (PostgreSQL WAL) |
|---|---|---|
| Scope | Single file system | Single database |
| Atomicity unit | Implicit (timer-based batches) | Explicit (user transactions) |
| Rollback support | No rollback | Full rollback capability |
| Isolation | None | ACID isolation levels |
| Log content | Physical (block images) | Often logical (operations) |
| Cross-file atomicity | Within same batch only | Explicit transaction support |
| Application awareness | None (transparent) | Application controls transactions |
The Layering Decision:\n\nShould you rely on file system journaling or application-level journaling?
Avoiding Double Journaling:\n\nA common anti-pattern: running a database with application-level WAL on a file system with full journaling. This results in:\n\n- 4x write amplification (2x from each layer)\n- Wasted I/O bandwidth\n- Wasted CPU for checksum computation\n- No additional safety (both journals protect the same data)\n\nRecommendation: If your application implements proper journaling (like PostgreSQL, MySQL InnoDB, or MongoDB), use ordered or writeback mode for the file system. Let each layer do what it does best.
Databases often use O_DIRECT to bypass the page cache and write directly to disk. This avoids the file system's journal entirely for data writes. Combined with the database's own WAL, this eliminates redundancy while maintaining strong consistency. PostgreSQL, Oracle, and others use this pattern.
If you need stronger consistency than metadata journaling but full journaling's overhead is unacceptable, several alternative approaches exist.
Alternative 1: Copy-on-Write File Systems\n\nZFS, Btrfs, and APFS use copy-on-write (CoW) instead of journaling:\n\n- Data is never overwritten in place\n- New version written to new location\n- Pointer swap makes new version active\n- No journal needed—filesystem always consistent\n\nCoW provides atomicity without write amplification for updates (new data written once). However, CoW has its own overheads: more metadata updates, potential fragmentation, more complex space management.\n\nBest for: Systems needing snapshots, deduplication, or integrated checksumming along with strong consistency.
Alternative 2: Application-Level Atomicity\n\nThe rename-trick provides atomic file replacement without full journaling:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// Atomic file update using rename()int atomic_update_file(const char *path, const void *data, size_t len) { // Step 1: Create temporary file char tmppath[PATH_MAX]; snprintf(tmppath, sizeof(tmppath), "%s.tmp.XXXXXX", path); int fd = mkstemp(tmppath); if (fd < 0) return -1; // Step 2: Write new content ssize_t written = write(fd, data, len); if (written != len) { close(fd); unlink(tmppath); return -1; } // Step 3: Ensure data is durable if (fsync(fd) < 0) { close(fd); unlink(tmppath); return -1; } close(fd); // Step 4: Atomic rename // rename() is atomic in POSIX - either old name or new name exists if (rename(tmppath, path) < 0) { unlink(tmppath); return -1; } // Step 5: Ensure directory update is durable int dirfd = open(dirname(path), O_RDONLY); if (dirfd >= 0) { fsync(dirfd); close(dirfd); } return 0;} /* * Guarantees: * - File always contains complete old or complete new content * - Never partially written or corrupted * - Works with metadata journaling (ordered mode) * * Limitations: * - Doubles space usage temporarily * - Only works for whole-file replacement * - Cannot update multiple files atomically */Alternative 3: Soft Updates + Background fsck\n\nBSD's soft updates carefully order writes to ensure the filesystem is always consistent, though some blocks may be leaked. A background fsck reclaims leaked space without blocking mount.\n\nThis avoids both journaling overhead and crash-time recovery delay. However, implementation is extremely complex (ordering dependencies) and leaked space temporarily reduces capacity.\n\nAlternative 4: Log-Structured File Systems\n\nLFS and F2FS write all data to a continuous log, never overwriting. Similar to full journaling but without the "writeback" phase—the log IS the final location.\n\nExcellent for SSDs and write-heavy workloads but requires garbage collection for space reclamation. Complex interactions with random read patterns.
Every approach trades off between write amplification, implementation complexity, recovery time, and consistency strength. Full journaling's 2x write amplification is often acceptable for its simplicity and reliability. Evaluate alternatives only if that overhead is demonstrably problematic for your workload.
We've examined full data journaling in depth. Let's consolidate the key insights:
What's Next:\n\nWe've covered the three main journaling modes. Now we'll examine journal replay—the recovery mechanism that makes all this work. Understanding replay completes our picture of how journaling maintains file system consistency across crashes.
You now understand when and why to use full data journaling, and importantly, when NOT to use it. This knowledge helps you make informed decisions about file system configuration based on your specific workload requirements and consistency needs.