Operating SystemsJournaling

File System Journaling

LevelIntermediate

Duration90 mins

TopicJournaling

1 / 5

Journaling Concept

The Crash Consistency Problem

Imagine you're saving a document. The application writes your data to disk, the progress bar completes, and you close your laptop. But what if—in that precise moment between "write started" and "write finished"—the power fails? What happens to your file? What happens to the file system itself?\n\nThis is the crash consistency problem, and it has haunted file system designers since the earliest days of computing. A file system must update multiple on-disk structures atomically—writing file data, updating metadata, allocating blocks, modifying directory entries—but disk operations are not atomic. A crash at any point can leave the file system in an inconsistent state, with orphaned blocks, corrupted directories, or lost data.

What You Will Learn

By the end of this page, you will understand the crash consistency problem in depth, why naive approaches fail, and how journaling provides a principled solution. You'll learn the fundamental theory behind write-ahead logging and see why journaling has become the standard approach for maintaining file system integrity.

The Anatomy of a File System Update

To understand why crash consistency is so challenging, we must first understand what happens when you modify a file. Consider a seemingly simple operation: appending a single block of data to an existing file.\n\nThis "simple" operation requires the file system to perform multiple distinct writes to different locations on disk:

Components of a Single File Append

•Data Block Write — The actual file data must be written to a newly allocated block on disk. This block may be anywhere on the storage device.
•Inode Update — The file's inode must be modified to increase the file size, update the modification timestamp, and add a pointer to the new data block.
•Bitmap Update — The free block bitmap must be updated to mark the newly used block as allocated, preventing it from being assigned to another file.
•Potential Directory Update — If this is a new file or affects directory metadata, the directory entry may also require modification.

The fundamental challenge:\n\nDisks (whether HDDs or SSDs) can only atomically write at sector granularity—typically 512 bytes or 4KB. But the updates above involve multiple sectors at different disk locations. There is no way to atomically write all of them together. The file system must issue separate I/O requests, and between any two writes, a crash can occur.\n\nThis creates what we call the atomicity gap: operations that should appear atomic to users are implemented as multiple non-atomic disk writes. Bridging this gap is the central challenge of crash-consistent file system design.

Crash Scenarios During File Append
Crash Timing	What's Written	File System State
Before any write	Nothing	Original state preserved (safe)
After data block only	Data block	Data exists but unreferenced (leaked blocks)
After inode only	Inode update	Inode points to unallocated/garbage block (corruption)
After bitmap only	Bitmap update	Block marked used but unreferenced (leaked blocks)
After data + inode	Data + inode	Bitmap inconsistency, potential double allocation
After data + bitmap	Data + bitmap	Inode doesn't reference data (orphaned block)
After inode + bitmap	Inode + bitmap	Inode points to uninitialized/garbage data
After all three	Complete update	Consistent state (safe)

The Combinatorial Explosion

For a simple three-write operation, there are 8 possible crash states (2³). For complex operations like renaming a file (which may involve 6+ writes), there are 64+ possible inconsistent states. Every combination must either be impossible or recoverable.

Pre-Journaling Recovery Approaches

Before journaling became widespread, file systems employed simpler—but more costly—approaches to maintain consistency. Understanding these approaches illuminates why journaling represents such a significant advancement.

The fsck Approach: Full File System Check\n\nThe original Unix approach was straightforward: after an unclean shutdown, run a comprehensive consistency checker (fsck, or "file system check") that scans the entire file system, identifies inconsistencies, and repairs them.\n\nfsck performs a complete traversal of all file system structures:\n- Scans every inode to build a reference count for each block\n- Traverses every directory to verify entries point to valid inodes\n- Compares computed block allocation against the free block bitmap\n- Identifies and resolves orphaned inodes and blocks\n- Verifies link counts match actual directory references

fsck Advantages

•Can recover from any inconsistency
•No runtime overhead during normal operation
•Conceptually simple: scan and fix
•Works with any file system structure
•Can repair damage from hardware errors too

fsck Disadvantages

•Time proportional to disk size, not damage
•Hours for multi-terabyte volumes
•Entire file system unavailable during check
•Some repairs may lose data (orphan deletion)
•Unacceptable downtime for production systems

The Critical Problem: Scale\n\nWhen disks were measured in megabytes, fsck completed in seconds. But disk capacity has grown exponentially while seek times and throughput have improved only linearly. A full fsck on a modern multi-terabyte file system can take hours to complete—during which the system is completely unavailable.\n\nFor a server with 8TB of storage, fsck might require:\n- Reading every inode: ~500 million inodes × seek time\n- Traversing every directory: millions of directories\n- Validating every block reference: billions of block pointers\n\nEstimated time: 2-6 hours of pure scanning, during which the file system cannot be mounted.\n\nThis is the fundamental limitation that drove the development of journaling: recovery time must be proportional to the amount of in-flight work at crash time, not to the total file system size.

The Soft Updates Alternative

BSD systems developed "soft updates," which carefully orders writes to ensure the file system is always recoverable, though potentially leaking some allocated blocks. While elegant, soft updates proved complex to implement correctly and still requires a background fsck to recover leaked space. Journaling emerged as the more practical solution.

The Write-Ahead Logging Principle

Journaling file systems borrow a fundamental technique from database systems: Write-Ahead Logging (WAL). The principle is elegantly simple yet profoundly powerful:\n\n> Before modifying any data structure on disk, first write a description of what you intend to do to a separate log. Only after the log record is safely on disk should you proceed with the actual modification.\n\nThis transforms crash recovery from a scan-everything operation into a replay-the-log operation. The log contains a complete record of recent activity, allowing the system to:\n\n1. Redo operations that completed in the log but may not have reached disk\n2. Undo operations that started but didn't complete\n3. Ignore operations that were never logged (they never started)

The WAL Protocol:\n\nThe write-ahead logging protocol enforces a strict ordering of operations:

Write-Ahead Logging Steps

•Transaction Begin — Log that a transaction is starting, assigning it a unique transaction ID
•Log All Modifications — For each change to be made, write a log record describing: the location being modified, the new value (and optionally the old value for undo support)
•Commit Record — Once all modifications are logged, write a commit record indicating the transaction is complete
•Force Log to Disk — Ensure all log records up to and including the commit are durably written (using fsync or equivalent)
•Perform Actual Updates — Now write the actual modifications to their final locations
•Checkpoint — Periodically write a checkpoint indicating which transactions are fully applied, allowing log truncation

Why This Works:\n\nThe magic of WAL lies in the commit record. If recovery finds a complete transaction (begin → modifications → commit), the transaction is valid and should be replayed. If recovery finds an incomplete transaction (begin → modifications → no commit), the transaction never officially happened and can be discarded (or rolled back, if undo information exists).\n\nCrucially, the commit record is a single, atomic write. Either the commit is on disk, or it isn't. There's no intermediate state. This converts the many-write atomicity problem into a single-write atomicity problem, which disks naturally provide.

The Sequential Write Advantage

The journal is written sequentially—new records are always appended to the end. Sequential writes are dramatically faster than random writes on HDDs (avoiding seeks) and significantly faster on SSDs (enabling efficient block allocation). This means the overhead of logging is much smaller than the overhead of doing writes twice randomly.

Journaling Guarantees and Trade-offs

Journaling provides strong guarantees that make file systems robust against crashes, but these guarantees come with specific trade-offs that system designers must understand.

Journaling Guarantees
Guarantee	Description	Implication
Atomicity	Transactions either fully apply or don't apply at all	No partial updates visible after recovery
Durability	Committed transactions survive crashes	Once commit returns, data is safe
Consistency	File system structures remain internally consistent	No corrupted metadata, valid pointers
Bounded Recovery	Recovery time proportional to log size, not FS size	Seconds/minutes, not hours

What Journaling Does NOT Guarantee:\n\nIt's equally important to understand what journaling cannot provide:

Journaling Limitations

•Hardware failure protection — Journaling assumes the disk correctly reports write completion. Disks that lie (saying write is complete when it's only in volatile cache) break journaling guarantees.
•Application-level consistency — If an application crashes mid-operation, journaling protects the file system but not the application's data semantics. An application writing a config file may leave half-written content.
•Protection against bugs — A file system bug that writes garbage to the journal will result in recovering to a garbage state. Journaling replays what was logged.
•Zero data loss — Depending on the journaling mode, data written but not yet logged may be lost. Only committed transactions are guaranteed.
•Sequential write ordering — Writes from different transactions may be reordered unless explicit barriers are used.

Performance Trade-offs:\n\nJournaling imposes performance costs that vary by workload:

Performance Considerations

•Write Amplification — Every modification is written twice: once to the journal, once to the final location. This doubles write traffic, which reduces effective disk bandwidth.
•Synchronization Overhead — Journal commits require fsync to ensure durability. Each fsync is expensive, especially on HDDs where it may require a full disk rotation.
•Contention — The journal is a shared resource. High-concurrency workloads may bottleneck on journal access, even if they're modifying different files.
•Memory Pressure — The file system must buffer transactions in memory until they're written to the journal, increasing memory requirements.
•SSD Wear — The extra writes from journaling contribute to SSD wear, though modern SSDs have sufficient endurance for typical workloads.

The Batching Optimization

Modern journaling file systems batch multiple operations into single transactions. Instead of committing each write individually, the file system collects operations over a short window (e.g., 5 seconds in ext4) and commits them together. This amortizes sync overhead across many operations, dramatically improving throughput at the cost of slightly increased data loss on crash.

Transaction Model in File Systems

File system transactions differ somewhat from database transactions. Understanding this model is essential for predicting file system behavior and for application developers who need to reason about crash safety.

Implicit vs. Explicit Transactions:\n\nDatabase transactions are explicit—applications begin transactions, perform operations, and commit. File system transactions are typically implicit—the file system automatically groups operations into transactions without application involvement.\n\nThis distinction has important implications:\n- Applications cannot control transaction boundaries\n- Related operations may end up in different transactions\n- The file system, not the application, determines atomicity boundaries

Comparison: Database vs File System Transactions
Aspect	Database Transactions	File System Transactions
Boundaries	Explicit (BEGIN/COMMIT)	Implicit (system-determined)
Granularity	Arbitrary	Single operation or batched window
Rollback	Full undo on abort	Often no explicit undo mechanism
Isolation	ACID isolation levels	Limited or no isolation
Application Control	Full control	No direct control
Commit Notification	Explicit commit returns	fsync/sync provides commit semantics

The fsync Contract:\n\nApplications achieve durability guarantees through the fsync() system call. When fsync returns successfully:\n\n1. All data written to the file has been persisted to disk\n2. All metadata changes (size, timestamps) have been persisted\n3. The file is recoverable even after an immediate crash\n\nHowever, fsync says nothing about other files. If you're updating multiple files atomically (like a database with multiple data files), you need additional mechanisms.\n\nImportant: fsync is expensive because it forces a journal commit and waits for the disk to confirm the write. Applications that call fsync after every small write will have terrible performance.

durability_patterns.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Pattern 1: Simple durable write
// Good for occasional important updates
void durable_write(int fd, const void *buf, size_t count) {
    write(fd, buf, count);  // Write to page cache
    fsync(fd);              // Force to disk
}
 
// Pattern 2: Write-then-rename (atomic file replacement)
// Provides atomicity at whole-file level
void atomic_file_update(const char *filename, const void *data, size_t size) {
    char tmpname[PATH_MAX];
    snprintf(tmpname, sizeof(tmpname), "%s.tmp", filename);
    
    int fd = open(tmpname, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    write(fd, data, size);
    fsync(fd);           // Ensure temp file is durable
    close(fd);
    
    rename(tmpname, filename);  // Atomic replacement
    
    // For full safety, also sync the directory
    int dir_fd = open(dirname(filename), O_RDONLY);
    fsync(dir_fd);       // Ensure rename is durable
    close(dir_fd);
}
 
// Pattern 3: Batched writes with periodic sync
// Good for high-throughput with acceptable loss window
void batched_writer(int fd, const void *buf, size_t count) {
    static int write_count = 0;
    write(fd, buf, count);
    
    if (++write_count >= 1000) {  // Every 1000 writes
        fsync(fd);
        write_count = 0;
    }
}

The fsync Pitfall

Many applications incorrectly assume that data is safe once write() returns. It isn't—write() only copies data to the kernel's page cache. A crash can lose this data. Always use fsync() for data that must survive crashes, but use it judiciously to avoid performance collapse.

Journal Placement and Structure

The physical design of the journal significantly impacts both performance and reliability. File system designers must make careful decisions about journal placement, sizing, and internal structure.

Internal vs. External Journals:\n\nJournals can be stored in two locations:

Internal Journal

•Stored as a special file within the file system
•Typically a hidden inode at a fixed location
•Simple setup—no separate device needed
•Competes with data for disk bandwidth
•Head movement between journal and data (HDD)

External Journal

•Stored on a separate physical device
•Can use a fast device (NVMe SSD) for journal
•Parallel I/O: journal and data simultaneously
•Adds hardware complexity and cost
•Requires careful failure handling

Journal Sizing Considerations:\n\nThe journal must be large enough to hold all in-flight transactions. If the journal fills up before transactions can be checkpointed to their final locations, the file system must stall all new operations until space is available.\n\nFactors affecting journal size:\n- Transaction commit interval (longer = more pending data)\n- Workload intensity (more writes = faster journal consumption)\n- Checkpoint frequency (faster checkpointing = less journal pressure)\n- Journaling mode (full data journaling requires much more space)\n\nTypical sizes:\n- ext4 default: 64MB to 128MB for metadata journaling\n- ext4 with full journaling: Often 1GB or more\n- XFS: Variable, typically 32MB to 2GB\n- Enterprise systems: May use multi-GB journals

Journal Record Structure:\n\nJournal records contain the information needed for replay:

journal_structures.h
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Simplified journal block header (inspired by ext4 JBD2)
struct journal_header {
    uint32_t h_magic;        // Journal magic number (validation)
    uint32_t h_blocktype;    // Block type (descriptor, commit, etc.)
    uint32_t h_sequence;     // Transaction sequence number
};
 
// Block types
#define JBD2_DESCRIPTOR_BLOCK  1  // Describes following data blocks
#define JBD2_COMMIT_BLOCK      2  // Marks transaction as complete
#define JBD2_SUPERBLOCK_V1     3  // Journal superblock version 1
#define JBD2_SUPERBLOCK_V2     4  // Journal superblock version 2
#define JBD2_REVOKE_BLOCK      5  // Revocation records (cancel previous)
 
// Descriptor block: describes what's being logged
struct journal_descriptor {
    struct journal_header header;
    // Followed by tag entries describing each logged block
};
 
// Block tag: describes one logged block
struct journal_block_tag {
    uint32_t t_blocknr;      // Block number on disk (final location)
    uint16_t t_checksum;     // Checksum of the block data
    uint16_t t_flags;        // Tag flags (last tag, same UUID, etc.)
};
 
// Commit block: marks transaction completion
struct journal_commit {
    struct journal_header header;
    uint8_t  commit_checksum[16];   // Checksum over entire transaction
    uint64_t commit_sec;            // Commit timestamp (seconds)
    uint32_t commit_nsec;           // Commit timestamp (nanoseconds)
};
 
// Journal superblock: journal metadata
struct journal_superblock {
    struct journal_header header;
    uint32_t s_blocksize;    // Journal block size
    uint32_t s_maxlen;       // Total journal blocks
    uint32_t s_first;        // First block of log
    uint32_t s_sequence;     // First sequence expected
    uint32_t s_start;        // Block number of first active block
    uint32_t s_errno;        // Error number (if journal corrupt)
    // ... additional fields for features, checksum type, etc.
};

Checksums for Defense in Depth

Modern journals include checksums in commit records. During recovery, the checksum validates that the logged data is intact. If a crash occurred mid-write to the journal (torn write), the checksum will fail, and the transaction will be correctly identified as incomplete.

Journaling in Context

Journaling is the dominant approach to crash consistency in modern file systems, but it exists within a broader ecosystem of related techniques. Understanding this context helps you appreciate when journaling is appropriate and when alternatives might be preferable.

Crash Consistency Approaches
Approach	Used By	Recovery Time	Write Overhead	Complexity
Full fsck	Legacy Unix	O(disk size)	None	Low
Journaling	ext4, NTFS, XFS	O(log size)	Moderate (2x for logged data)	Medium
Soft Updates	BSD UFS	O(log) + background	None (ordered writes)	High
Copy-on-Write	ZFS, Btrfs	Instant (always consistent)	Variable (CoW overhead)	Medium-High
Log-Structured	F2FS, LFS	O(checkpoint interval)	Low (sequential only)	High

Copy-on-Write vs. Journaling:\n\nModern file systems like ZFS and Btrfs use copy-on-write (CoW) instead of journaling. In CoW file systems:\n\n- Data is never modified in place\n- Updates write to new locations, then atomically update pointers\n- The file system is always consistent—there's no crash window\n\nCoW eliminates the need for a journal but introduces its own trade-offs:\n- More complex space management (need to handle snapshots, clones)\n- Potential fragmentation over time\n- Write amplification for small updates to large files\n- Different performance characteristics\n\nWhen Journaling Excels:\n- General-purpose workloads with mixed operations\n- Environments where simplicity is valued\n- Systems with limited memory (smaller metadata footprint)\n- When in-place updates are preferred (less fragmentation)\n\nWhen Alternatives Excel:\n- CoW: When snapshots, checksums, and RAID integration are needed\n- Log-structured: When write performance on SSDs is paramount\n- Soft updates: When memory-mapped I/O is heavily used

The Practical Reality

Despite the elegance of alternatives, journaling remains the most widely deployed crash consistency mechanism. ext4 (journaling) dominates Linux server deployments, NTFS (journaling) dominates Windows, and HFS+ journal/APFS journal Apple. Journaling's combination of good performance, understood behavior, and proven reliability makes it the conservative choice for most systems.

Summary: The Journaling Foundation

We've established the foundational concepts of file system journaling. Let's consolidate the key takeaways:

Key Takeaways

•The crash consistency problem — File system updates require multiple disk writes that cannot be atomic. A crash at any point can leave the file system inconsistent.
•The fsck limitation — Full file system scans for recovery have O(disk size) complexity, making them impractical for modern multi-terabyte systems.
•Write-ahead logging — By logging changes before applying them, we convert the recovery problem from O(disk size) to O(log size).
•The commit record — A single atomic commit record determines whether a transaction is valid, bridging the atomicity gap.
•Trade-offs — Journaling adds write amplification and sync overhead but provides bounded recovery time and strong consistency guarantees.
•fsync for durability — Applications must explicitly use fsync() to ensure data survives crashes; write() alone provides no durability guarantees.

What's Next:\n\nWith the theoretical foundation in place, we'll dive into the mechanics of write-ahead logging in the next page. We'll explore exactly how the journal is written, the ordering constraints that ensure correctness, and the recovery algorithm that restores consistency after a crash.

Foundation Complete

You now understand why journaling exists and the fundamental principles that make it work. This foundation will help you understand the detailed mechanics in the following pages, and ultimately help you make informed decisions about file system configuration and application durability requirements.

1 / 5

Loading learning content...

Operating SystemsJournaling

File System Journaling

LevelIntermediate

Duration90 mins

TopicJournaling

1 / 5

Journaling Concept

The Crash Consistency Problem

What You Will Learn

The Anatomy of a File System Update

Components of a Single File Append

•Data Block Write — The actual file data must be written to a newly allocated block on disk. This block may be anywhere on the storage device.
•Inode Update — The file's inode must be modified to increase the file size, update the modification timestamp, and add a pointer to the new data block.
•Bitmap Update — The free block bitmap must be updated to mark the newly used block as allocated, preventing it from being assigned to another file.
•Potential Directory Update — If this is a new file or affects directory metadata, the directory entry may also require modification.

Crash Scenarios During File Append
Crash Timing	What's Written	File System State
Before any write	Nothing	Original state preserved (safe)
After data block only	Data block	Data exists but unreferenced (leaked blocks)
After inode only	Inode update	Inode points to unallocated/garbage block (corruption)
After bitmap only	Bitmap update	Block marked used but unreferenced (leaked blocks)
After data + inode	Data + inode	Bitmap inconsistency, potential double allocation
After data + bitmap	Data + bitmap	Inode doesn't reference data (orphaned block)
After inode + bitmap	Inode + bitmap	Inode points to uninitialized/garbage data
After all three	Complete update	Consistent state (safe)

The Combinatorial Explosion

Pre-Journaling Recovery Approaches

fsck Advantages

•Can recover from any inconsistency
•No runtime overhead during normal operation
•Conceptually simple: scan and fix
•Works with any file system structure
•Can repair damage from hardware errors too

fsck Disadvantages

•Time proportional to disk size, not damage
•Hours for multi-terabyte volumes
•Entire file system unavailable during check
•Some repairs may lose data (orphan deletion)
•Unacceptable downtime for production systems

The Soft Updates Alternative

The Write-Ahead Logging Principle

The WAL Protocol:\n\nThe write-ahead logging protocol enforces a strict ordering of operations:

Write-Ahead Logging Steps

•Transaction Begin — Log that a transaction is starting, assigning it a unique transaction ID
•Log All Modifications — For each change to be made, write a log record describing: the location being modified, the new value (and optionally the old value for undo support)
•Commit Record — Once all modifications are logged, write a commit record indicating the transaction is complete
•Force Log to Disk — Ensure all log records up to and including the commit are durably written (using fsync or equivalent)
•Perform Actual Updates — Now write the actual modifications to their final locations
•Checkpoint — Periodically write a checkpoint indicating which transactions are fully applied, allowing log truncation

The Sequential Write Advantage

Journaling Guarantees and Trade-offs

Journaling provides strong guarantees that make file systems robust against crashes, but these guarantees come with specific trade-offs that system designers must understand.

Journaling Guarantees
Guarantee	Description	Implication
Atomicity	Transactions either fully apply or don't apply at all	No partial updates visible after recovery
Durability	Committed transactions survive crashes	Once commit returns, data is safe
Consistency	File system structures remain internally consistent	No corrupted metadata, valid pointers
Bounded Recovery	Recovery time proportional to log size, not FS size	Seconds/minutes, not hours

What Journaling Does NOT Guarantee:\n\nIt's equally important to understand what journaling cannot provide:

Journaling Limitations

•Hardware failure protection — Journaling assumes the disk correctly reports write completion. Disks that lie (saying write is complete when it's only in volatile cache) break journaling guarantees.
•Application-level consistency — If an application crashes mid-operation, journaling protects the file system but not the application's data semantics. An application writing a config file may leave half-written content.
•Protection against bugs — A file system bug that writes garbage to the journal will result in recovering to a garbage state. Journaling replays what was logged.
•Zero data loss — Depending on the journaling mode, data written but not yet logged may be lost. Only committed transactions are guaranteed.
•Sequential write ordering — Writes from different transactions may be reordered unless explicit barriers are used.

Performance Trade-offs:\n\nJournaling imposes performance costs that vary by workload:

Performance Considerations

•Write Amplification — Every modification is written twice: once to the journal, once to the final location. This doubles write traffic, which reduces effective disk bandwidth.
•Synchronization Overhead — Journal commits require fsync to ensure durability. Each fsync is expensive, especially on HDDs where it may require a full disk rotation.
•Contention — The journal is a shared resource. High-concurrency workloads may bottleneck on journal access, even if they're modifying different files.
•Memory Pressure — The file system must buffer transactions in memory until they're written to the journal, increasing memory requirements.
•SSD Wear — The extra writes from journaling contribute to SSD wear, though modern SSDs have sufficient endurance for typical workloads.

The Batching Optimization

Transaction Model in File Systems

Comparison: Database vs File System Transactions
Aspect	Database Transactions	File System Transactions
Boundaries	Explicit (BEGIN/COMMIT)	Implicit (system-determined)
Granularity	Arbitrary	Single operation or batched window
Rollback	Full undo on abort	Often no explicit undo mechanism
Isolation	ACID isolation levels	Limited or no isolation
Application Control	Full control	No direct control
Commit Notification	Explicit commit returns	fsync/sync provides commit semantics

durability_patterns.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Pattern 1: Simple durable write
// Good for occasional important updates
void durable_write(int fd, const void *buf, size_t count) {
    write(fd, buf, count);  // Write to page cache
    fsync(fd);              // Force to disk
}
 
// Pattern 2: Write-then-rename (atomic file replacement)
// Provides atomicity at whole-file level
void atomic_file_update(const char *filename, const void *data, size_t size) {
    char tmpname[PATH_MAX];
    snprintf(tmpname, sizeof(tmpname), "%s.tmp", filename);
    
    int fd = open(tmpname, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    write(fd, data, size);
    fsync(fd);           // Ensure temp file is durable
    close(fd);
    
    rename(tmpname, filename);  // Atomic replacement
    
    // For full safety, also sync the directory
    int dir_fd = open(dirname(filename), O_RDONLY);
    fsync(dir_fd);       // Ensure rename is durable
    close(dir_fd);
}
 
// Pattern 3: Batched writes with periodic sync
// Good for high-throughput with acceptable loss window
void batched_writer(int fd, const void *buf, size_t count) {
    static int write_count = 0;
    write(fd, buf, count);
    
    if (++write_count >= 1000) {  // Every 1000 writes
        fsync(fd);
        write_count = 0;
    }
}

The fsync Pitfall

Journal Placement and Structure

The physical design of the journal significantly impacts both performance and reliability. File system designers must make careful decisions about journal placement, sizing, and internal structure.

Internal vs. External Journals:\n\nJournals can be stored in two locations:

Internal Journal

•Stored as a special file within the file system
•Typically a hidden inode at a fixed location
•Simple setup—no separate device needed
•Competes with data for disk bandwidth
•Head movement between journal and data (HDD)

External Journal

•Stored on a separate physical device
•Can use a fast device (NVMe SSD) for journal
•Parallel I/O: journal and data simultaneously
•Adds hardware complexity and cost
•Requires careful failure handling

Journal Record Structure:\n\nJournal records contain the information needed for replay:

journal_structures.h
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Simplified journal block header (inspired by ext4 JBD2)
struct journal_header {
    uint32_t h_magic;        // Journal magic number (validation)
    uint32_t h_blocktype;    // Block type (descriptor, commit, etc.)
    uint32_t h_sequence;     // Transaction sequence number
};
 
// Block types
#define JBD2_DESCRIPTOR_BLOCK  1  // Describes following data blocks
#define JBD2_COMMIT_BLOCK      2  // Marks transaction as complete
#define JBD2_SUPERBLOCK_V1     3  // Journal superblock version 1
#define JBD2_SUPERBLOCK_V2     4  // Journal superblock version 2
#define JBD2_REVOKE_BLOCK      5  // Revocation records (cancel previous)
 
// Descriptor block: describes what's being logged
struct journal_descriptor {
    struct journal_header header;
    // Followed by tag entries describing each logged block
};
 
// Block tag: describes one logged block
struct journal_block_tag {
    uint32_t t_blocknr;      // Block number on disk (final location)
    uint16_t t_checksum;     // Checksum of the block data
    uint16_t t_flags;        // Tag flags (last tag, same UUID, etc.)
};
 
// Commit block: marks transaction completion
struct journal_commit {
    struct journal_header header;
    uint8_t  commit_checksum[16];   // Checksum over entire transaction
    uint64_t commit_sec;            // Commit timestamp (seconds)
    uint32_t commit_nsec;           // Commit timestamp (nanoseconds)
};
 
// Journal superblock: journal metadata
struct journal_superblock {
    struct journal_header header;
    uint32_t s_blocksize;    // Journal block size
    uint32_t s_maxlen;       // Total journal blocks
    uint32_t s_first;        // First block of log
    uint32_t s_sequence;     // First sequence expected
    uint32_t s_start;        // Block number of first active block
    uint32_t s_errno;        // Error number (if journal corrupt)
    // ... additional fields for features, checksum type, etc.
};

Checksums for Defense in Depth

Journaling in Context

Crash Consistency Approaches
Approach	Used By	Recovery Time	Write Overhead	Complexity
Full fsck	Legacy Unix	O(disk size)	None	Low
Journaling	ext4, NTFS, XFS	O(log size)	Moderate (2x for logged data)	Medium
Soft Updates	BSD UFS	O(log) + background	None (ordered writes)	High
Copy-on-Write	ZFS, Btrfs	Instant (always consistent)	Variable (CoW overhead)	Medium-High
Log-Structured	F2FS, LFS	O(checkpoint interval)	Low (sequential only)	High

The Practical Reality

Summary: The Journaling Foundation

We've established the foundational concepts of file system journaling. Let's consolidate the key takeaways:

Key Takeaways

•The crash consistency problem — File system updates require multiple disk writes that cannot be atomic. A crash at any point can leave the file system inconsistent.
•The fsck limitation — Full file system scans for recovery have O(disk size) complexity, making them impractical for modern multi-terabyte systems.
•Write-ahead logging — By logging changes before applying them, we convert the recovery problem from O(disk size) to O(log size).
•The commit record — A single atomic commit record determines whether a transaction is valid, bridging the atomicity gap.
•Trade-offs — Journaling adds write amplification and sync overhead but provides bounded recovery time and strong consistency guarantees.
•fsync for durability — Applications must explicitly use fsync() to ensure data survives crashes; write() alone provides no durability guarantees.

Foundation Complete

1 / 5