Operating SystemsLinux File Systems

Linux File Systems

LevelAdvanced

Duration120 mins

TopicLinux File Systems

2 / 5

ext4 Internals

The Workhorse of Linux Storage

ext4 (fourth extended filesystem) is the default file system for most Linux distributions. It boots billions of servers, desktops, embedded devices, and Android phones. Despite the emergence of newer file systems like Btrfs and XFS for specialized workloads, ext4 remains the trusted default—a testament to its balance of performance, reliability, and maturity.

But what makes ext4 work? How does it translate VFS operations into on-disk structures? How does it recover from crashes without data corruption? And what are the architectural decisions behind its performance characteristics?

This page answers these questions, providing the deep understanding that distinguishes file system engineers from file system users.

What You Will Learn

By the end of this page, you will understand ext4's complete architecture: the on-disk layout and data structures, the extent-based allocation system, the journaling mechanism that ensures crash consistency, and the performance optimizations that make ext4 suitable for both SSDs and spinning disks. You'll gain the knowledge needed to debug file system issues, understand mkfs/fsck operations, and make informed decisions about file system tuning.

Evolution from ext2/ext3

ext4 didn't emerge in isolation—it evolved from its predecessors, inheriting their on-disk layout while adding crucial features. Understanding this evolution illuminates ext4's design decisions.

ext Family Evolution
Feature	ext2	ext3	ext4
Journaling	No	Yes	Yes (enhanced)
Max file size	2 TB	2 TB	16 TB
Max volume size	32 TB	32 TB	1 EB
Block allocation	Block bitmap	Block bitmap	Extents + delayed
Timestamp granularity	1 second	1 second	Nanoseconds
Max subdirectories	32,000	32,000	Unlimited
Checksum protection	No	No	Metadata + optional journal
Online defragmentation	No	No	Yes

Key ext4 Innovations

1. Extent-based allocation: Instead of indirect blocks pointing to individual data blocks, ext4 uses extents—contiguous ranges of blocks described by (start, length) pairs. This dramatically reduces metadata overhead for large files and improves sequential I/O performance.

2. Delayed allocation: ext4 delays allocating disk blocks until data is actually written to disk (not just to the page cache). This allows the allocator to make better decisions about block placement, improving locality.

3. Journal checksumming: The journal now includes checksums to detect corruption, improving reliability beyond ext3's journaling.

4. Metadata checksums: ext4 can compute and verify checksums on all metadata structures, detecting silent corruption before it propagates.

5. 48-bit block addressing: Increases maximum volume size from 16 TB (32-bit) to 1 EB (exabyte), future-proofing for large storage systems.

Backward Compatibility

ext4 was designed with backward compatibility in mind. An ext3 file system can be mounted read-write by ext4, and many ext4 features are optional—added incrementally as the file system is used. This "upgrade path" approach reflects the pragmatic philosophy of Linux development.

On-Disk Layout

ext4 divides a partition into fixed-size block groups. This localization strategy keeps related data and metadata close together on disk, reducing seek times on rotational storage and improving cache efficiency.

Overall Structure

Converting Mermaid diagram...

Block Groups

Each block group contains:

Superblock (optional backup copy)
Group descriptor table (describes all block groups)
Reserved GDT blocks (for future expansion)
Data block bitmap (one bit per data block: 0=free, 1=used)
Inode bitmap (one bit per inode: 0=free, 1=used)
Inode table (fixed-size array of inodes)
Data blocks (actual file content)

The superblock and group descriptor table are replicated across block groups for redundancy. If the primary copy is corrupted, recovery tools can use backups.

Flex Block Groups

Modern ext4 uses 'flex_bg' (flexible block groups), which aggregates metadata from multiple block groups into a contiguous region. This reduces fragmentation of metadata and improves performance by allowing larger contiguous allocations for both metadata and data.

The Superblock

The superblock contains global file system information:

struct ext4_super_block (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
struct ext4_super_block {
    __le32  s_inodes_count;           /* Total inode count */
    __le32  s_blocks_count_lo;        /* Total block count (low 32 bits) */
    __le32  s_r_blocks_count_lo;      /* Reserved block count */
    __le32  s_free_blocks_count_lo;   /* Free blocks count */
    __le32  s_free_inodes_count;      /* Free inodes count */
    __le32  s_first_data_block;       /* First data block */
    __le32  s_log_block_size;         /* Block size = 1024 << s_log_block_size */
    __le32  s_log_cluster_size;       /* Cluster size (for bigalloc) */
    __le32  s_blocks_per_group;       /* Blocks per block group */
    __le32  s_clusters_per_group;     /* Clusters per block group */
    __le32  s_inodes_per_group;       /* Inodes per block group */
    __le32  s_mtime;                  /* Last mount time */
    __le32  s_wtime;                  /* Last write time */
    __le16  s_mnt_count;              /* Mount count since fsck */
    __le16  s_max_mnt_count;          /* Max mounts before fsck */
    __le16  s_magic;                  /* Magic number (0xEF53) */
    __le16  s_state;                  /* File system state */
    __le16  s_errors;                 /* Error handling behavior */
    __le16  s_minor_rev_level;        /* Minor revision level */
    __le32  s_lastcheck;              /* Time of last fsck */
    __le32  s_checkinterval;          /* Max interval between checks */
    __le32  s_creator_os;             /* OS that created fs */
    __le32  s_rev_level;              /* Revision level */
    __le16  s_def_resuid;             /* Default uid for reserved blocks */
    __le16  s_def_resgid;             /* Default gid for reserved blocks */
    
    /* ext4-specific fields follow... */
    __le32  s_first_ino;              /* First non-reserved inode */
    __le16  s_inode_size;             /* Inode structure size */
    __le16  s_block_group_nr;         /* Block group of this superblock */
    __le32  s_feature_compat;         /* Compatible feature flags */
    __le32  s_feature_incompat;       /* Incompatible feature flags */
    __le32  s_feature_ro_compat;      /* Read-only compatible features */
    __u8    s_uuid[16];               /* 128-bit filesystem UUID */
    char    s_volume_name[16];        /* Volume name */
    char    s_last_mounted[64];       /* Last mount path */
    __le32  s_algorithm_usage_bitmap; /* For compression */
    
    /* Performance hints */
    __u8    s_prealloc_blocks;        /* Blocks to preallocate */
    __u8    s_prealloc_dir_blocks;    /* For directories */
    __le16  s_reserved_gdt_blocks;    /* Reserved GDT blocks */
    
    /* Journaling support */
    __u8    s_journal_uuid[16];       /* Journal UUID */
    __le32  s_journal_inum;           /* Journal inode number */
    __le32  s_journal_dev;            /* Journal device (if external) */
    __le32  s_last_orphan;            /* Orphan inode list head */
    __le32  s_hash_seed[4];           /* htree hash seed */
    __u8    s_def_hash_version;       /* Default hash version */
    __u8    s_jnl_backup_type;
    __le16  s_desc_size;              /* Group descriptor size */
    __le32  s_default_mount_opts;
    __le32  s_first_meta_bg;          /* First metablock group */
    __le32  s_mkfs_time;              /* Creation time */
    __le32  s_jnl_blocks[17];         /* Journal inode backup */
    
    /* 64-bit support */
    __le32  s_blocks_count_hi;        /* Block count (high 32 bits) */
    /* ... */
};

Feature Flags

The superblock contains three sets of feature flags that control backward compatibility:

Compatible features (feature_compat): Unknown flags can be ignored; file system remains fully usable
Incompatible features (feature_incompat): Unknown flags prevent mounting entirely
Read-only compatible (feature_ro_compat): Unknown flags allow read-only mounting but prevent writes

Important ext4 Feature Flags
Flag	Type	Description
`EXT4_FEATURE_INCOMPAT_EXTENTS`	Incompatible	Extent-based allocation
`EXT4_FEATURE_INCOMPAT_64BIT`	Incompatible	64-bit block numbers
`EXT4_FEATURE_INCOMPAT_FLEX_BG`	Incompatible	Flexible block groups
`EXT4_FEATURE_RO_COMPAT_METADATA_CSUM`	RO-compat	Metadata checksums
`EXT4_FEATURE_COMPAT_HAS_JOURNAL`	Compatible	Has journal
`EXT4_FEATURE_INCOMPAT_FILETYPE`	Incompatible	Directory entries store file type

The ext4 Inode Structure

The inode is the fundamental metadata structure in ext4. Unlike VFS's generic inode, the on-disk ext4 inode has a fixed format that stores all persistent file metadata.

struct ext4_inode (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
struct ext4_inode {
    __le16  i_mode;           /* File mode (type + permissions) */
    __le16  i_uid;            /* Lower 16 bits of owner UID */
    __le32  i_size_lo;        /* Size in bytes (low 32 bits) */
    __le32  i_atime;          /* Access time */
    __le32  i_ctime;          /* Inode change time */
    __le32  i_mtime;          /* Modification time */
    __le32  i_dtime;          /* Deletion time */
    __le16  i_gid;            /* Lower 16 bits of group GID */
    __le16  i_links_count;    /* Hard link count */
    __le32  i_blocks_lo;      /* Block count (512-byte units) */
    __le32  i_flags;          /* Inode flags */
    
    union {
        struct {
            __le32  l_i_version;
        } linux1;
        /* Other OS variants... */
    } osd1;
    
    /* This is where the magic happens: */
    __le32  i_block[EXT4_N_BLOCKS];  /* Block pointers OR extent tree */
    
    __le32  i_generation;     /* NFS generation number */
    __le32  i_file_acl_lo;    /* File ACL (low 32 bits) */
    __le32  i_size_high;      /* Size in bytes (high 32 bits) */
    __le32  i_obso_faddr;     /* Obsolete fragment address */
    
    union {
        struct {
            __le16  l_i_blocks_high;   /* Block count (high 16 bits) */
            __le16  l_i_file_acl_high; /* File ACL (high 16 bits) */
            __le16  l_i_uid_high;      /* UID (high 16 bits) */
            __le16  l_i_gid_high;      /* GID (high 16 bits) */
            __le16  l_i_checksum_lo;   /* CRC32C checksum */
            __le16  l_i_reserved;
        } linux2;
        /* Other OS variants... */
    } osd2;
    
    __le16  i_extra_isize;    /* Extra inode space used */
    __le16  i_checksum_hi;    /* CRC32C checksum (high 16 bits) */
    __le32  i_ctime_extra;    /* Extra ctime (nanoseconds + epoch) */
    __le32  i_mtime_extra;    /* Extra mtime */
    __le32  i_atime_extra;    /* Extra atime */
    __le32  i_crtime;         /* Creation time */
    __le32  i_crtime_extra;   /* Extra creation time */
    __le32  i_version_hi;     /* High 32 bits of version */
    __le32  i_projid;         /* Project ID */
};

The i_block Array: Direct Blocks vs Extents

The i_block array is 60 bytes (15 × 4-byte entries). In legacy ext2/ext3 mode, it holds:

12 direct block pointers
1 indirect block pointer
1 double-indirect block pointer
1 triple-indirect block pointer

In modern ext4 with extents enabled, this same 60-byte space holds an extent tree—a far more efficient structure for large files.

Converting Mermaid diagram...

Extent Efficiency

A single extent can describe millions of contiguous blocks with just 12 bytes of metadata. For a 100 GB file that's allocated contiguously, this means almost no metadata overhead. Compare this to the legacy scheme, which would require millions of indirect blocks just to store the pointers!

Extent Trees in Detail

Extents are ext4's most important innovation. An extent is a contiguous range of physical blocks mapped to a contiguous range of logical blocks (file offsets). Extents are organized in a B-tree structure.

Extent Structure

Extent data structures
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/* Header at the start of extent tree/leaf */
struct ext4_extent_header {
    __le16  eh_magic;      /* Magic number (0xF30A) */
    __le16  eh_entries;    /* Number of valid entries */
    __le16  eh_max;        /* Maximum capacity */
    __le16  eh_depth;      /* Tree depth (0 = leaf level) */
    __le32  eh_generation; /* Tree generation */
};
 
/* Internal node: points to lower level */
struct ext4_extent_idx {
    __le32  ei_block;      /* First logical block covered */
    __le32  ei_leaf_lo;    /* Physical block of child node (low) */
    __le16  ei_leaf_hi;    /* Physical block (high 16 bits) */
    __le16  ei_unused;
};
 
/* Leaf node: actual extent mapping */
struct ext4_extent {
    __le32  ee_block;      /* First logical block (file offset) */
    __le16  ee_len;        /* Number of blocks in extent */
    __le16  ee_start_hi;   /* Physical block (high 16 bits) */
    __le32  ee_start_lo;   /* Physical block (low 32 bits) */
};

Extent Tree Layout

For small files, the extent header and up to 4 extents fit directly in the inode's i_block array (60 bytes):

Extent header: 12 bytes
Each extent: 12 bytes
Maximum inline extents: 4 (fitting in 12 + 4×12 = 60 bytes)

For larger files with more than 4 extent regions, ext4 creates an extent tree:

The inode holds index entries pointing to external blocks
Each external block contains more index entries (internal nodes) or extent entries (leaf nodes)
The tree grows in depth as needed

Converting Mermaid diagram...

Unwritten (Preallocated) Extents

The ee_len field's high bit indicates an unwritten extent—blocks that are allocated but not yet written. This supports fallocate() for preallocation:

Blocks are reserved on disk (preventing ENOSPC later)
Reading returns zeros (without actual disk I/O)
Writing converts the extent to written

This is crucial for applications like databases that preallocate files to ensure contiguous allocation.

Hole Punching

ext4 also supports 'hole punching'—deallocating blocks in the middle of a file without changing its size. This is the opposite of preallocation and enables space reclamation in virtual disk images and databases. Holes appear as sparse regions that read as zeros.

Directory Structure: Linear and HTree

Directories in ext4 are stored as files containing directory entries—records that map filenames to inode numbers. ext4 supports two directory formats:

Linear Directories (Legacy)

Small directories use a simple linear format where entries are packed sequentially:

struct ext4_dir_entry_2
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
struct ext4_dir_entry_2 {
    __le32  inode;        /* Inode number */
    __le16  rec_len;      /* Directory entry length */
    __u8    name_len;     /* Name length */
    __u8    file_type;    /* File type (regular, dir, symlink, etc.) */
    char    name[EXT4_NAME_LEN];  /* File name (up to 255 bytes) */
};
 
/* File types (stored in file_type field) */
#define EXT4_FT_UNKNOWN   0
#define EXT4_FT_REG_FILE  1
#define EXT4_FT_DIR       2
#define EXT4_FT_CHRDEV    3
#define EXT4_FT_BLKDEV    4
#define EXT4_FT_FIFO      5
#define EXT4_FT_SOCK      6
#define EXT4_FT_SYMLINK   7

The rec_len field allows variable-length entries and handles deletions by expanding the previous entry's rec_len to span the deleted space.

HTree Directories (Indexed)

For directories with many entries, linear search becomes O(n). ext4 uses HTree (hashed tree) indexing—a B-tree variant indexed by filename hash.

HTree characteristics:

Uses a two-level hash-based tree structure
Root and internal nodes contain hash ranges and block pointers
Leaf blocks contain the actual directory entries
Hash collisions handled by linear search within leaf blocks
Backward compatible: appears as linear directory to older kernels

Converting Mermaid diagram...

Hash Collision Attacks

The directory hash can be targeted by adversarial inputs—creating many files with the same hash forces O(n) lookup within leaf blocks. ext4 uses a keyed hash (half_md4 or tea with a per-filesystem seed) to mitigate this, but be aware when allowing untrusted users to create arbitrary filenames.

Journaling: Crash Consistency

File systems face a fundamental problem: operations that modify multiple blocks (writing a file, creating a directory) cannot be atomic at the hardware level. A crash mid-operation leaves the file system in an inconsistent state.

Journaling solves this by writing a record of intended changes to a special journal area before applying them to the main file system. After a crash, recovery replays the journal to complete or abort in-flight operations.

Journal Modes

ext4 supports three journaling modes:

ext4 Journal Modes
Mode	What's Journaled	Performance	Safety
journal	Metadata + data	Slowest (all data written twice)	Highest (complete crash protection)
ordered (default)	Metadata only, data forced before commit	Moderate	Good (prevents stale data exposure)
writeback	Metadata only, no data ordering	Fastest	Lowest (may expose stale data after crash)

Transaction Lifecycle

A transaction groups multiple file system operations:

Journal transaction flow
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/* Simplified transaction lifecycle */
 
/* 1. Start a new transaction (or join running one) */
handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, credits);
 
/* 2. Get buffer heads for modified metadata */
bh = sb_getblk(sb, block);
err = ext4_journal_get_write_access(handle, sb, bh, EXT4_JTR_NONE);
 
/* 3. Modify the buffer */
modify_metadata(bh);
 
/* 4. Mark buffer dirty in journal */
err = ext4_handle_dirty_metadata(handle, NULL, bh);
 
/* 5. Stop (complete our portion of transaction) */
ext4_journal_stop(handle);
 
/* 6. Eventually, the transaction commits:
 *    a. Write all journal blocks to journal area
 *    b. Write commit record
 *    c. fsync journal to disk
 *    d. Mark transaction committed
 *    e. Later: checkpoint (write actual blocks, release journal space)
 */

Converting Mermaid diagram...

The jbd2 Layer

ext4 uses jbd2 (journaling block device, version 2) as its journaling layer. jbd2 is a general-purpose block journaling library, separate from ext4 itself. This separation allows:

Shared code with other journaling file systems
Independent development and testing
Clean abstraction between file system logic and journaling mechanics

The journal is typically stored in a hidden inode (inode 8 by default) called the journal inode.

Journal Checksumming

Modern ext4 enables journal checksumming (journal_checksum or journal_checksum_v3). Each transaction includes CRC32C checksums of all journal blocks. During recovery, corrupted transactions are discarded rather than replayed, preventing corruption propagation.

Block Allocation Strategies

ext4's block allocator determines where data is placed on disk. Good allocation decisions are critical for performance—contiguous allocation enables efficient sequential I/O, while fragmented allocation causes excessive seeking.

Multi-Block Allocator (mballoc)

ext4 uses the multi-block allocator, which can allocate many blocks in a single operation (unlike ext3's single-block allocator). Key features:

1. Buddy Bitmap Allocator

Maintains per-group buddy bitmaps for fast free-extent lookup
Can quickly find contiguous extents of various sizes
O(1) allocation for common sizes

2. Preallocation

Per-inode preallocation: reserves blocks beyond current file size
Locality group preallocation: reserves blocks for multiple small files in same directory
Reduces fragmentation by anticipating future growth

3. Block Group Goals

Attempts to allocate new blocks near existing file data
Related files (same directory) allocated in same block group
Improves locality for common access patterns

Delayed Allocation

Delayed allocation (delalloc) is one of ext4's most important optimizations. Instead of allocating blocks when write() is called, ext4 delays allocation until the data is actually written to disk (usually during writeback).

Benefits:

Better block selection: allocator sees full write pattern before choosing blocks
Reduced fragmentation: small writes to same file can be allocated together
Fewer allocations: if file is deleted before writeback, no blocks are ever allocated
Better extent packing: adjacent logical blocks more likely to get contiguous physical blocks

Risks:

Data can be lost if system crashes before writeback (mitigated in ordered mode)
ENOSPC can occur during writeback, not at write() time

Block allocation path (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
/* During write(): mark pages dirty, no allocation yet */
generic_perform_write(file, iov_iter, pos) {
    /* ... */
    ext4_write_begin(file, mapping, pos, len, &page, &fsdata);
    /* Just reserves space in delayed allocation state */
    /* No actual block allocation happens */
    ext4_write_end(file, mapping, pos, len, copied, page, fsdata);
}
 
/* During writeback: actual allocation */
ext4_writepages(mapping, wbc) {
    /* Find dirty pages that need allocation */
    mpage_prepare_extent_to_map(&mpd);
    
    /* Allocate blocks for the extent */
    ext4_map_blocks(handle, inode, &map, EXT4_GET_BLOCKS_CREATE);
    
    /* Submit I/O */
    mpage_map_and_submit_extent(handle, &mpd, &give_up_on_write);
}
 
/* ext4_map_blocks: the core allocation function */
ext4_map_blocks(handle, inode, map, flags) {
    /* Check extent cache first */
    if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
        /* Found in extent status tree cache */
        return cached_result;
    }
    
    /* Look up in extent tree on disk */
    ext4_ext_map_blocks(handle, inode, map, flags);
    
    if (flags & EXT4_GET_BLOCKS_CREATE) {
        /* Allocate new blocks via mballoc */
        ext4_mb_new_blocks(handle, &ar, &err);
    }
}

Performance Tuning and Mount Options

ext4 provides numerous mount options for tuning performance and reliability tradeoffs. Understanding these options is crucial for optimizing specific workloads.

Important ext4 Mount Options
Option	Default	Description
`data=ordered`	Yes	Write data before committing metadata journal
`data=writeback`		No data ordering (faster, less safe)
`data=journal`		Journal all data (safest, slowest)
`noatime`		Don't update access times (significant speedup)
`relatime`	Yes	Only update atime if older than mtime (compromise)
`nodelalloc`		Disable delayed allocation
`barrier=1`	Yes	Use write barriers (required for safety)
`discard`		Enable TRIM for SSDs
`commit=N`	5s	Journal commit interval in seconds
`max_batch_time=N`	15000μs	Max time to batch journal commits
`journal_checksum`		Enable journal checksumming
`grpid`/`nogrpid`	nogrpid	Inherit directory group ID

SSD Optimization

For SSDs, consider:

mount -o discard,noatime /dev/sda1 /mnt

discard: Sends TRIM commands to the SSD when blocks are freed, improving garbage collection and wear leveling
noatime: Prevents write amplification from access time updates

Database Workloads

mount -o data=ordered,barrier=1,noatime,commit=60 /dev/sda1 /data

Longer commit interval reduces journal overhead
Barriers ensure durability guarantees
Databases typically manage their own fsync calls

Never Disable Barriers

The barrier=0 option is almost never appropriate. Modern storage with volatile write caches will silently reorder writes, causing journal corruption after crashes. Unless your storage has battery-backed write cache or guaranteed write ordering, keep barriers enabled.

Summary: ext4's Design Excellence

We've explored ext4's complete internal architecture. Let's consolidate the key concepts:

Key Takeaways

•ext4 evolved from ext2/ext3 while maintaining backward compatibility, adding extents, delayed allocation, larger limits, and metadata checksums.
•Block groups organize the disk into localized units containing metadata (bitmaps, inode tables) and data blocks, improving locality.
•Extents replace indirect blocks for mapping files to disk blocks, drastically reducing metadata overhead and improving sequential performance.
•HTree directories use hash-based B-trees to provide O(log n) lookup in large directories while remaining backward-compatible.
•jbd2 journaling ensures crash consistency by logging metadata changes before applying them, with configurable data ordering modes.
•Delayed allocation and the multi-block allocator optimize block placement by deferring allocation decisions until writeback time.
•Mount options allow fine-tuning for different workloads, from SSDs to databases to general-purpose systems.

What's next:

With ext4's file system logic understood, we'll descend to the next layer: the Block I/O subsystem. The next page explores how the kernel translates file system requests into disk operations, covering the bio structure, I/O scheduling, request merging, and the multi-queue block layer.

Page Complete

You now understand ext4's complete architecture—from on-disk layout to journaling mechanics to block allocation strategies. This knowledge enables you to make informed decisions about file system configuration, debug performance issues, and understand what 'fsck' is actually doing when it repairs a file system.

2 / 5

Loading learning content...

Operating SystemsLinux File Systems

Linux File Systems

LevelAdvanced

Duration120 mins

TopicLinux File Systems

2 / 5

ext4 Internals

The Workhorse of Linux Storage

This page answers these questions, providing the deep understanding that distinguishes file system engineers from file system users.

What You Will Learn

Evolution from ext2/ext3

ext4 didn't emerge in isolation—it evolved from its predecessors, inheriting their on-disk layout while adding crucial features. Understanding this evolution illuminates ext4's design decisions.

ext Family Evolution
Feature	ext2	ext3	ext4
Journaling	No	Yes	Yes (enhanced)
Max file size	2 TB	2 TB	16 TB
Max volume size	32 TB	32 TB	1 EB
Block allocation	Block bitmap	Block bitmap	Extents + delayed
Timestamp granularity	1 second	1 second	Nanoseconds
Max subdirectories	32,000	32,000	Unlimited
Checksum protection	No	No	Metadata + optional journal
Online defragmentation	No	No	Yes

Key ext4 Innovations

3. Journal checksumming: The journal now includes checksums to detect corruption, improving reliability beyond ext3's journaling.

4. Metadata checksums: ext4 can compute and verify checksums on all metadata structures, detecting silent corruption before it propagates.

5. 48-bit block addressing: Increases maximum volume size from 16 TB (32-bit) to 1 EB (exabyte), future-proofing for large storage systems.

Backward Compatibility

On-Disk Layout

Overall Structure

Converting Mermaid diagram...

Block Groups

Each block group contains:

Superblock (optional backup copy)
Group descriptor table (describes all block groups)
Reserved GDT blocks (for future expansion)
Data block bitmap (one bit per data block: 0=free, 1=used)
Inode bitmap (one bit per inode: 0=free, 1=used)
Inode table (fixed-size array of inodes)
Data blocks (actual file content)

The superblock and group descriptor table are replicated across block groups for redundancy. If the primary copy is corrupted, recovery tools can use backups.

Flex Block Groups

The Superblock

The superblock contains global file system information:

struct ext4_super_block (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
struct ext4_super_block {
    __le32  s_inodes_count;           /* Total inode count */
    __le32  s_blocks_count_lo;        /* Total block count (low 32 bits) */
    __le32  s_r_blocks_count_lo;      /* Reserved block count */
    __le32  s_free_blocks_count_lo;   /* Free blocks count */
    __le32  s_free_inodes_count;      /* Free inodes count */
    __le32  s_first_data_block;       /* First data block */
    __le32  s_log_block_size;         /* Block size = 1024 << s_log_block_size */
    __le32  s_log_cluster_size;       /* Cluster size (for bigalloc) */
    __le32  s_blocks_per_group;       /* Blocks per block group */
    __le32  s_clusters_per_group;     /* Clusters per block group */
    __le32  s_inodes_per_group;       /* Inodes per block group */
    __le32  s_mtime;                  /* Last mount time */
    __le32  s_wtime;                  /* Last write time */
    __le16  s_mnt_count;              /* Mount count since fsck */
    __le16  s_max_mnt_count;          /* Max mounts before fsck */
    __le16  s_magic;                  /* Magic number (0xEF53) */
    __le16  s_state;                  /* File system state */
    __le16  s_errors;                 /* Error handling behavior */
    __le16  s_minor_rev_level;        /* Minor revision level */
    __le32  s_lastcheck;              /* Time of last fsck */
    __le32  s_checkinterval;          /* Max interval between checks */
    __le32  s_creator_os;             /* OS that created fs */
    __le32  s_rev_level;              /* Revision level */
    __le16  s_def_resuid;             /* Default uid for reserved blocks */
    __le16  s_def_resgid;             /* Default gid for reserved blocks */
    
    /* ext4-specific fields follow... */
    __le32  s_first_ino;              /* First non-reserved inode */
    __le16  s_inode_size;             /* Inode structure size */
    __le16  s_block_group_nr;         /* Block group of this superblock */
    __le32  s_feature_compat;         /* Compatible feature flags */
    __le32  s_feature_incompat;       /* Incompatible feature flags */
    __le32  s_feature_ro_compat;      /* Read-only compatible features */
    __u8    s_uuid[16];               /* 128-bit filesystem UUID */
    char    s_volume_name[16];        /* Volume name */
    char    s_last_mounted[64];       /* Last mount path */
    __le32  s_algorithm_usage_bitmap; /* For compression */
    
    /* Performance hints */
    __u8    s_prealloc_blocks;        /* Blocks to preallocate */
    __u8    s_prealloc_dir_blocks;    /* For directories */
    __le16  s_reserved_gdt_blocks;    /* Reserved GDT blocks */
    
    /* Journaling support */
    __u8    s_journal_uuid[16];       /* Journal UUID */
    __le32  s_journal_inum;           /* Journal inode number */
    __le32  s_journal_dev;            /* Journal device (if external) */
    __le32  s_last_orphan;            /* Orphan inode list head */
    __le32  s_hash_seed[4];           /* htree hash seed */
    __u8    s_def_hash_version;       /* Default hash version */
    __u8    s_jnl_backup_type;
    __le16  s_desc_size;              /* Group descriptor size */
    __le32  s_default_mount_opts;
    __le32  s_first_meta_bg;          /* First metablock group */
    __le32  s_mkfs_time;              /* Creation time */
    __le32  s_jnl_blocks[17];         /* Journal inode backup */
    
    /* 64-bit support */
    __le32  s_blocks_count_hi;        /* Block count (high 32 bits) */
    /* ... */
};

Feature Flags

The superblock contains three sets of feature flags that control backward compatibility:

Compatible features (feature_compat): Unknown flags can be ignored; file system remains fully usable
Incompatible features (feature_incompat): Unknown flags prevent mounting entirely
Read-only compatible (feature_ro_compat): Unknown flags allow read-only mounting but prevent writes

Important ext4 Feature Flags
Flag	Type	Description
`EXT4_FEATURE_INCOMPAT_EXTENTS`	Incompatible	Extent-based allocation
`EXT4_FEATURE_INCOMPAT_64BIT`	Incompatible	64-bit block numbers
`EXT4_FEATURE_INCOMPAT_FLEX_BG`	Incompatible	Flexible block groups
`EXT4_FEATURE_RO_COMPAT_METADATA_CSUM`	RO-compat	Metadata checksums
`EXT4_FEATURE_COMPAT_HAS_JOURNAL`	Compatible	Has journal
`EXT4_FEATURE_INCOMPAT_FILETYPE`	Incompatible	Directory entries store file type

The ext4 Inode Structure

The inode is the fundamental metadata structure in ext4. Unlike VFS's generic inode, the on-disk ext4 inode has a fixed format that stores all persistent file metadata.

struct ext4_inode (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
struct ext4_inode {
    __le16  i_mode;           /* File mode (type + permissions) */
    __le16  i_uid;            /* Lower 16 bits of owner UID */
    __le32  i_size_lo;        /* Size in bytes (low 32 bits) */
    __le32  i_atime;          /* Access time */
    __le32  i_ctime;          /* Inode change time */
    __le32  i_mtime;          /* Modification time */
    __le32  i_dtime;          /* Deletion time */
    __le16  i_gid;            /* Lower 16 bits of group GID */
    __le16  i_links_count;    /* Hard link count */
    __le32  i_blocks_lo;      /* Block count (512-byte units) */
    __le32  i_flags;          /* Inode flags */
    
    union {
        struct {
            __le32  l_i_version;
        } linux1;
        /* Other OS variants... */
    } osd1;
    
    /* This is where the magic happens: */
    __le32  i_block[EXT4_N_BLOCKS];  /* Block pointers OR extent tree */
    
    __le32  i_generation;     /* NFS generation number */
    __le32  i_file_acl_lo;    /* File ACL (low 32 bits) */
    __le32  i_size_high;      /* Size in bytes (high 32 bits) */
    __le32  i_obso_faddr;     /* Obsolete fragment address */
    
    union {
        struct {
            __le16  l_i_blocks_high;   /* Block count (high 16 bits) */
            __le16  l_i_file_acl_high; /* File ACL (high 16 bits) */
            __le16  l_i_uid_high;      /* UID (high 16 bits) */
            __le16  l_i_gid_high;      /* GID (high 16 bits) */
            __le16  l_i_checksum_lo;   /* CRC32C checksum */
            __le16  l_i_reserved;
        } linux2;
        /* Other OS variants... */
    } osd2;
    
    __le16  i_extra_isize;    /* Extra inode space used */
    __le16  i_checksum_hi;    /* CRC32C checksum (high 16 bits) */
    __le32  i_ctime_extra;    /* Extra ctime (nanoseconds + epoch) */
    __le32  i_mtime_extra;    /* Extra mtime */
    __le32  i_atime_extra;    /* Extra atime */
    __le32  i_crtime;         /* Creation time */
    __le32  i_crtime_extra;   /* Extra creation time */
    __le32  i_version_hi;     /* High 32 bits of version */
    __le32  i_projid;         /* Project ID */
};

The i_block Array: Direct Blocks vs Extents

The i_block array is 60 bytes (15 × 4-byte entries). In legacy ext2/ext3 mode, it holds:

12 direct block pointers
1 indirect block pointer
1 double-indirect block pointer
1 triple-indirect block pointer

In modern ext4 with extents enabled, this same 60-byte space holds an extent tree—a far more efficient structure for large files.

Converting Mermaid diagram...

Extent Efficiency

Extent Trees in Detail

Extent Structure

Extent data structures
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/* Header at the start of extent tree/leaf */
struct ext4_extent_header {
    __le16  eh_magic;      /* Magic number (0xF30A) */
    __le16  eh_entries;    /* Number of valid entries */
    __le16  eh_max;        /* Maximum capacity */
    __le16  eh_depth;      /* Tree depth (0 = leaf level) */
    __le32  eh_generation; /* Tree generation */
};
 
/* Internal node: points to lower level */
struct ext4_extent_idx {
    __le32  ei_block;      /* First logical block covered */
    __le32  ei_leaf_lo;    /* Physical block of child node (low) */
    __le16  ei_leaf_hi;    /* Physical block (high 16 bits) */
    __le16  ei_unused;
};
 
/* Leaf node: actual extent mapping */
struct ext4_extent {
    __le32  ee_block;      /* First logical block (file offset) */
    __le16  ee_len;        /* Number of blocks in extent */
    __le16  ee_start_hi;   /* Physical block (high 16 bits) */
    __le32  ee_start_lo;   /* Physical block (low 32 bits) */
};

Extent Tree Layout

For small files, the extent header and up to 4 extents fit directly in the inode's i_block array (60 bytes):

Extent header: 12 bytes
Each extent: 12 bytes
Maximum inline extents: 4 (fitting in 12 + 4×12 = 60 bytes)

For larger files with more than 4 extent regions, ext4 creates an extent tree:

The inode holds index entries pointing to external blocks
Each external block contains more index entries (internal nodes) or extent entries (leaf nodes)
The tree grows in depth as needed

Converting Mermaid diagram...

Unwritten (Preallocated) Extents

The ee_len field's high bit indicates an unwritten extent—blocks that are allocated but not yet written. This supports fallocate() for preallocation:

Blocks are reserved on disk (preventing ENOSPC later)
Reading returns zeros (without actual disk I/O)
Writing converts the extent to written

This is crucial for applications like databases that preallocate files to ensure contiguous allocation.

Hole Punching

Directory Structure: Linear and HTree

Directories in ext4 are stored as files containing directory entries—records that map filenames to inode numbers. ext4 supports two directory formats:

Linear Directories (Legacy)

Small directories use a simple linear format where entries are packed sequentially:

struct ext4_dir_entry_2
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
struct ext4_dir_entry_2 {
    __le32  inode;        /* Inode number */
    __le16  rec_len;      /* Directory entry length */
    __u8    name_len;     /* Name length */
    __u8    file_type;    /* File type (regular, dir, symlink, etc.) */
    char    name[EXT4_NAME_LEN];  /* File name (up to 255 bytes) */
};
 
/* File types (stored in file_type field) */
#define EXT4_FT_UNKNOWN   0
#define EXT4_FT_REG_FILE  1
#define EXT4_FT_DIR       2
#define EXT4_FT_CHRDEV    3
#define EXT4_FT_BLKDEV    4
#define EXT4_FT_FIFO      5
#define EXT4_FT_SOCK      6
#define EXT4_FT_SYMLINK   7

The rec_len field allows variable-length entries and handles deletions by expanding the previous entry's rec_len to span the deleted space.

HTree Directories (Indexed)

For directories with many entries, linear search becomes O(n). ext4 uses HTree (hashed tree) indexing—a B-tree variant indexed by filename hash.

HTree characteristics:

Uses a two-level hash-based tree structure
Root and internal nodes contain hash ranges and block pointers
Leaf blocks contain the actual directory entries
Hash collisions handled by linear search within leaf blocks
Backward compatible: appears as linear directory to older kernels

Converting Mermaid diagram...

Hash Collision Attacks

Journaling: Crash Consistency

Journal Modes

ext4 supports three journaling modes:

ext4 Journal Modes
Mode	What's Journaled	Performance	Safety
journal	Metadata + data	Slowest (all data written twice)	Highest (complete crash protection)
ordered (default)	Metadata only, data forced before commit	Moderate	Good (prevents stale data exposure)
writeback	Metadata only, no data ordering	Fastest	Lowest (may expose stale data after crash)

Transaction Lifecycle

A transaction groups multiple file system operations:

Journal transaction flow
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/* Simplified transaction lifecycle */
 
/* 1. Start a new transaction (or join running one) */
handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, credits);
 
/* 2. Get buffer heads for modified metadata */
bh = sb_getblk(sb, block);
err = ext4_journal_get_write_access(handle, sb, bh, EXT4_JTR_NONE);
 
/* 3. Modify the buffer */
modify_metadata(bh);
 
/* 4. Mark buffer dirty in journal */
err = ext4_handle_dirty_metadata(handle, NULL, bh);
 
/* 5. Stop (complete our portion of transaction) */
ext4_journal_stop(handle);
 
/* 6. Eventually, the transaction commits:
 *    a. Write all journal blocks to journal area
 *    b. Write commit record
 *    c. fsync journal to disk
 *    d. Mark transaction committed
 *    e. Later: checkpoint (write actual blocks, release journal space)
 */

Converting Mermaid diagram...

The jbd2 Layer

ext4 uses jbd2 (journaling block device, version 2) as its journaling layer. jbd2 is a general-purpose block journaling library, separate from ext4 itself. This separation allows:

Shared code with other journaling file systems
Independent development and testing
Clean abstraction between file system logic and journaling mechanics

The journal is typically stored in a hidden inode (inode 8 by default) called the journal inode.

Journal Checksumming

Block Allocation Strategies

Multi-Block Allocator (mballoc)

ext4 uses the multi-block allocator, which can allocate many blocks in a single operation (unlike ext3's single-block allocator). Key features:

1. Buddy Bitmap Allocator

Maintains per-group buddy bitmaps for fast free-extent lookup
Can quickly find contiguous extents of various sizes
O(1) allocation for common sizes

2. Preallocation

Per-inode preallocation: reserves blocks beyond current file size
Locality group preallocation: reserves blocks for multiple small files in same directory
Reduces fragmentation by anticipating future growth

3. Block Group Goals

Attempts to allocate new blocks near existing file data
Related files (same directory) allocated in same block group
Improves locality for common access patterns

Delayed Allocation

Benefits:

Better block selection: allocator sees full write pattern before choosing blocks
Reduced fragmentation: small writes to same file can be allocated together
Fewer allocations: if file is deleted before writeback, no blocks are ever allocated
Better extent packing: adjacent logical blocks more likely to get contiguous physical blocks

Risks:

Data can be lost if system crashes before writeback (mitigated in ordered mode)
ENOSPC can occur during writeback, not at write() time

Block allocation path (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
/* During write(): mark pages dirty, no allocation yet */
generic_perform_write(file, iov_iter, pos) {
    /* ... */
    ext4_write_begin(file, mapping, pos, len, &page, &fsdata);
    /* Just reserves space in delayed allocation state */
    /* No actual block allocation happens */
    ext4_write_end(file, mapping, pos, len, copied, page, fsdata);
}
 
/* During writeback: actual allocation */
ext4_writepages(mapping, wbc) {
    /* Find dirty pages that need allocation */
    mpage_prepare_extent_to_map(&mpd);
    
    /* Allocate blocks for the extent */
    ext4_map_blocks(handle, inode, &map, EXT4_GET_BLOCKS_CREATE);
    
    /* Submit I/O */
    mpage_map_and_submit_extent(handle, &mpd, &give_up_on_write);
}
 
/* ext4_map_blocks: the core allocation function */
ext4_map_blocks(handle, inode, map, flags) {
    /* Check extent cache first */
    if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
        /* Found in extent status tree cache */
        return cached_result;
    }
    
    /* Look up in extent tree on disk */
    ext4_ext_map_blocks(handle, inode, map, flags);
    
    if (flags & EXT4_GET_BLOCKS_CREATE) {
        /* Allocate new blocks via mballoc */
        ext4_mb_new_blocks(handle, &ar, &err);
    }
}

Performance Tuning and Mount Options

ext4 provides numerous mount options for tuning performance and reliability tradeoffs. Understanding these options is crucial for optimizing specific workloads.

Important ext4 Mount Options
Option	Default	Description
`data=ordered`	Yes	Write data before committing metadata journal
`data=writeback`		No data ordering (faster, less safe)
`data=journal`		Journal all data (safest, slowest)
`noatime`		Don't update access times (significant speedup)
`relatime`	Yes	Only update atime if older than mtime (compromise)
`nodelalloc`		Disable delayed allocation
`barrier=1`	Yes	Use write barriers (required for safety)
`discard`		Enable TRIM for SSDs
`commit=N`	5s	Journal commit interval in seconds
`max_batch_time=N`	15000μs	Max time to batch journal commits
`journal_checksum`		Enable journal checksumming
`grpid`/`nogrpid`	nogrpid	Inherit directory group ID

SSD Optimization

For SSDs, consider:

mount -o discard,noatime /dev/sda1 /mnt

discard: Sends TRIM commands to the SSD when blocks are freed, improving garbage collection and wear leveling
noatime: Prevents write amplification from access time updates

Database Workloads

mount -o data=ordered,barrier=1,noatime,commit=60 /dev/sda1 /data

Longer commit interval reduces journal overhead
Barriers ensure durability guarantees
Databases typically manage their own fsync calls

Never Disable Barriers

Summary: ext4's Design Excellence

We've explored ext4's complete internal architecture. Let's consolidate the key concepts:

Key Takeaways

•ext4 evolved from ext2/ext3 while maintaining backward compatibility, adding extents, delayed allocation, larger limits, and metadata checksums.
•Block groups organize the disk into localized units containing metadata (bitmaps, inode tables) and data blocks, improving locality.
•Extents replace indirect blocks for mapping files to disk blocks, drastically reducing metadata overhead and improving sequential performance.
•HTree directories use hash-based B-trees to provide O(log n) lookup in large directories while remaining backward-compatible.
•jbd2 journaling ensures crash consistency by logging metadata changes before applying them, with configurable data ordering modes.
•Delayed allocation and the multi-block allocator optimize block placement by deferring allocation decisions until writeback time.
•Mount options allow fine-tuning for different workloads, from SSDs to databases to general-purpose systems.

What's next:

Page Complete

2 / 5