Loading learning content...
ext4 (fourth extended filesystem) is the default file system for most Linux distributions. It boots billions of servers, desktops, embedded devices, and Android phones. Despite the emergence of newer file systems like Btrfs and XFS for specialized workloads, ext4 remains the trusted default—a testament to its balance of performance, reliability, and maturity.
But what makes ext4 work? How does it translate VFS operations into on-disk structures? How does it recover from crashes without data corruption? And what are the architectural decisions behind its performance characteristics?
This page answers these questions, providing the deep understanding that distinguishes file system engineers from file system users.
By the end of this page, you will understand ext4's complete architecture: the on-disk layout and data structures, the extent-based allocation system, the journaling mechanism that ensures crash consistency, and the performance optimizations that make ext4 suitable for both SSDs and spinning disks. You'll gain the knowledge needed to debug file system issues, understand mkfs/fsck operations, and make informed decisions about file system tuning.
ext4 didn't emerge in isolation—it evolved from its predecessors, inheriting their on-disk layout while adding crucial features. Understanding this evolution illuminates ext4's design decisions.
| Feature | ext2 | ext3 | ext4 |
|---|---|---|---|
| Journaling | No | Yes | Yes (enhanced) |
| Max file size | 2 TB | 2 TB | 16 TB |
| Max volume size | 32 TB | 32 TB | 1 EB |
| Block allocation | Block bitmap | Block bitmap | Extents + delayed |
| Timestamp granularity | 1 second | 1 second | Nanoseconds |
| Max subdirectories | 32,000 | 32,000 | Unlimited |
| Checksum protection | No | No | Metadata + optional journal |
| Online defragmentation | No | No | Yes |
1. Extent-based allocation: Instead of indirect blocks pointing to individual data blocks, ext4 uses extents—contiguous ranges of blocks described by (start, length) pairs. This dramatically reduces metadata overhead for large files and improves sequential I/O performance.
2. Delayed allocation: ext4 delays allocating disk blocks until data is actually written to disk (not just to the page cache). This allows the allocator to make better decisions about block placement, improving locality.
3. Journal checksumming: The journal now includes checksums to detect corruption, improving reliability beyond ext3's journaling.
4. Metadata checksums: ext4 can compute and verify checksums on all metadata structures, detecting silent corruption before it propagates.
5. 48-bit block addressing: Increases maximum volume size from 16 TB (32-bit) to 1 EB (exabyte), future-proofing for large storage systems.
ext4 was designed with backward compatibility in mind. An ext3 file system can be mounted read-write by ext4, and many ext4 features are optional—added incrementally as the file system is used. This "upgrade path" approach reflects the pragmatic philosophy of Linux development.
ext4 divides a partition into fixed-size block groups. This localization strategy keeps related data and metadata close together on disk, reducing seek times on rotational storage and improving cache efficiency.
Each block group contains:
The superblock and group descriptor table are replicated across block groups for redundancy. If the primary copy is corrupted, recovery tools can use backups.
Modern ext4 uses 'flex_bg' (flexible block groups), which aggregates metadata from multiple block groups into a contiguous region. This reduces fragmentation of metadata and improves performance by allowing larger contiguous allocations for both metadata and data.
The superblock contains global file system information:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
struct ext4_super_block { __le32 s_inodes_count; /* Total inode count */ __le32 s_blocks_count_lo; /* Total block count (low 32 bits) */ __le32 s_r_blocks_count_lo; /* Reserved block count */ __le32 s_free_blocks_count_lo; /* Free blocks count */ __le32 s_free_inodes_count; /* Free inodes count */ __le32 s_first_data_block; /* First data block */ __le32 s_log_block_size; /* Block size = 1024 << s_log_block_size */ __le32 s_log_cluster_size; /* Cluster size (for bigalloc) */ __le32 s_blocks_per_group; /* Blocks per block group */ __le32 s_clusters_per_group; /* Clusters per block group */ __le32 s_inodes_per_group; /* Inodes per block group */ __le32 s_mtime; /* Last mount time */ __le32 s_wtime; /* Last write time */ __le16 s_mnt_count; /* Mount count since fsck */ __le16 s_max_mnt_count; /* Max mounts before fsck */ __le16 s_magic; /* Magic number (0xEF53) */ __le16 s_state; /* File system state */ __le16 s_errors; /* Error handling behavior */ __le16 s_minor_rev_level; /* Minor revision level */ __le32 s_lastcheck; /* Time of last fsck */ __le32 s_checkinterval; /* Max interval between checks */ __le32 s_creator_os; /* OS that created fs */ __le32 s_rev_level; /* Revision level */ __le16 s_def_resuid; /* Default uid for reserved blocks */ __le16 s_def_resgid; /* Default gid for reserved blocks */ /* ext4-specific fields follow... */ __le32 s_first_ino; /* First non-reserved inode */ __le16 s_inode_size; /* Inode structure size */ __le16 s_block_group_nr; /* Block group of this superblock */ __le32 s_feature_compat; /* Compatible feature flags */ __le32 s_feature_incompat; /* Incompatible feature flags */ __le32 s_feature_ro_compat; /* Read-only compatible features */ __u8 s_uuid[16]; /* 128-bit filesystem UUID */ char s_volume_name[16]; /* Volume name */ char s_last_mounted[64]; /* Last mount path */ __le32 s_algorithm_usage_bitmap; /* For compression */ /* Performance hints */ __u8 s_prealloc_blocks; /* Blocks to preallocate */ __u8 s_prealloc_dir_blocks; /* For directories */ __le16 s_reserved_gdt_blocks; /* Reserved GDT blocks */ /* Journaling support */ __u8 s_journal_uuid[16]; /* Journal UUID */ __le32 s_journal_inum; /* Journal inode number */ __le32 s_journal_dev; /* Journal device (if external) */ __le32 s_last_orphan; /* Orphan inode list head */ __le32 s_hash_seed[4]; /* htree hash seed */ __u8 s_def_hash_version; /* Default hash version */ __u8 s_jnl_backup_type; __le16 s_desc_size; /* Group descriptor size */ __le32 s_default_mount_opts; __le32 s_first_meta_bg; /* First metablock group */ __le32 s_mkfs_time; /* Creation time */ __le32 s_jnl_blocks[17]; /* Journal inode backup */ /* 64-bit support */ __le32 s_blocks_count_hi; /* Block count (high 32 bits) */ /* ... */};The superblock contains three sets of feature flags that control backward compatibility:
feature_compat): Unknown flags can be ignored; file system remains fully usablefeature_incompat): Unknown flags prevent mounting entirelyfeature_ro_compat): Unknown flags allow read-only mounting but prevent writes| Flag | Type | Description |
|---|---|---|
EXT4_FEATURE_INCOMPAT_EXTENTS | Incompatible | Extent-based allocation |
EXT4_FEATURE_INCOMPAT_64BIT | Incompatible | 64-bit block numbers |
EXT4_FEATURE_INCOMPAT_FLEX_BG | Incompatible | Flexible block groups |
EXT4_FEATURE_RO_COMPAT_METADATA_CSUM | RO-compat | Metadata checksums |
EXT4_FEATURE_COMPAT_HAS_JOURNAL | Compatible | Has journal |
EXT4_FEATURE_INCOMPAT_FILETYPE | Incompatible | Directory entries store file type |
The inode is the fundamental metadata structure in ext4. Unlike VFS's generic inode, the on-disk ext4 inode has a fixed format that stores all persistent file metadata.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
struct ext4_inode { __le16 i_mode; /* File mode (type + permissions) */ __le16 i_uid; /* Lower 16 bits of owner UID */ __le32 i_size_lo; /* Size in bytes (low 32 bits) */ __le32 i_atime; /* Access time */ __le32 i_ctime; /* Inode change time */ __le32 i_mtime; /* Modification time */ __le32 i_dtime; /* Deletion time */ __le16 i_gid; /* Lower 16 bits of group GID */ __le16 i_links_count; /* Hard link count */ __le32 i_blocks_lo; /* Block count (512-byte units) */ __le32 i_flags; /* Inode flags */ union { struct { __le32 l_i_version; } linux1; /* Other OS variants... */ } osd1; /* This is where the magic happens: */ __le32 i_block[EXT4_N_BLOCKS]; /* Block pointers OR extent tree */ __le32 i_generation; /* NFS generation number */ __le32 i_file_acl_lo; /* File ACL (low 32 bits) */ __le32 i_size_high; /* Size in bytes (high 32 bits) */ __le32 i_obso_faddr; /* Obsolete fragment address */ union { struct { __le16 l_i_blocks_high; /* Block count (high 16 bits) */ __le16 l_i_file_acl_high; /* File ACL (high 16 bits) */ __le16 l_i_uid_high; /* UID (high 16 bits) */ __le16 l_i_gid_high; /* GID (high 16 bits) */ __le16 l_i_checksum_lo; /* CRC32C checksum */ __le16 l_i_reserved; } linux2; /* Other OS variants... */ } osd2; __le16 i_extra_isize; /* Extra inode space used */ __le16 i_checksum_hi; /* CRC32C checksum (high 16 bits) */ __le32 i_ctime_extra; /* Extra ctime (nanoseconds + epoch) */ __le32 i_mtime_extra; /* Extra mtime */ __le32 i_atime_extra; /* Extra atime */ __le32 i_crtime; /* Creation time */ __le32 i_crtime_extra; /* Extra creation time */ __le32 i_version_hi; /* High 32 bits of version */ __le32 i_projid; /* Project ID */};The i_block array is 60 bytes (15 × 4-byte entries). In legacy ext2/ext3 mode, it holds:
In modern ext4 with extents enabled, this same 60-byte space holds an extent tree—a far more efficient structure for large files.
A single extent can describe millions of contiguous blocks with just 12 bytes of metadata. For a 100 GB file that's allocated contiguously, this means almost no metadata overhead. Compare this to the legacy scheme, which would require millions of indirect blocks just to store the pointers!
Extents are ext4's most important innovation. An extent is a contiguous range of physical blocks mapped to a contiguous range of logical blocks (file offsets). Extents are organized in a B-tree structure.
123456789101112131415161718192021222324
/* Header at the start of extent tree/leaf */struct ext4_extent_header { __le16 eh_magic; /* Magic number (0xF30A) */ __le16 eh_entries; /* Number of valid entries */ __le16 eh_max; /* Maximum capacity */ __le16 eh_depth; /* Tree depth (0 = leaf level) */ __le32 eh_generation; /* Tree generation */}; /* Internal node: points to lower level */struct ext4_extent_idx { __le32 ei_block; /* First logical block covered */ __le32 ei_leaf_lo; /* Physical block of child node (low) */ __le16 ei_leaf_hi; /* Physical block (high 16 bits) */ __le16 ei_unused;}; /* Leaf node: actual extent mapping */struct ext4_extent { __le32 ee_block; /* First logical block (file offset) */ __le16 ee_len; /* Number of blocks in extent */ __le16 ee_start_hi; /* Physical block (high 16 bits) */ __le32 ee_start_lo; /* Physical block (low 32 bits) */};For small files, the extent header and up to 4 extents fit directly in the inode's i_block array (60 bytes):
For larger files with more than 4 extent regions, ext4 creates an extent tree:
The ee_len field's high bit indicates an unwritten extent—blocks that are allocated but not yet written. This supports fallocate() for preallocation:
This is crucial for applications like databases that preallocate files to ensure contiguous allocation.
ext4 also supports 'hole punching'—deallocating blocks in the middle of a file without changing its size. This is the opposite of preallocation and enables space reclamation in virtual disk images and databases. Holes appear as sparse regions that read as zeros.
Directories in ext4 are stored as files containing directory entries—records that map filenames to inode numbers. ext4 supports two directory formats:
Small directories use a simple linear format where entries are packed sequentially:
1234567891011121314151617
struct ext4_dir_entry_2 { __le32 inode; /* Inode number */ __le16 rec_len; /* Directory entry length */ __u8 name_len; /* Name length */ __u8 file_type; /* File type (regular, dir, symlink, etc.) */ char name[EXT4_NAME_LEN]; /* File name (up to 255 bytes) */}; /* File types (stored in file_type field) */#define EXT4_FT_UNKNOWN 0#define EXT4_FT_REG_FILE 1#define EXT4_FT_DIR 2#define EXT4_FT_CHRDEV 3#define EXT4_FT_BLKDEV 4#define EXT4_FT_FIFO 5#define EXT4_FT_SOCK 6#define EXT4_FT_SYMLINK 7The rec_len field allows variable-length entries and handles deletions by expanding the previous entry's rec_len to span the deleted space.
For directories with many entries, linear search becomes O(n). ext4 uses HTree (hashed tree) indexing—a B-tree variant indexed by filename hash.
HTree characteristics:
The directory hash can be targeted by adversarial inputs—creating many files with the same hash forces O(n) lookup within leaf blocks. ext4 uses a keyed hash (half_md4 or tea with a per-filesystem seed) to mitigate this, but be aware when allowing untrusted users to create arbitrary filenames.
File systems face a fundamental problem: operations that modify multiple blocks (writing a file, creating a directory) cannot be atomic at the hardware level. A crash mid-operation leaves the file system in an inconsistent state.
Journaling solves this by writing a record of intended changes to a special journal area before applying them to the main file system. After a crash, recovery replays the journal to complete or abort in-flight operations.
ext4 supports three journaling modes:
| Mode | What's Journaled | Performance | Safety |
|---|---|---|---|
| journal | Metadata + data | Slowest (all data written twice) | Highest (complete crash protection) |
| ordered (default) | Metadata only, data forced before commit | Moderate | Good (prevents stale data exposure) |
| writeback | Metadata only, no data ordering | Fastest | Lowest (may expose stale data after crash) |
A transaction groups multiple file system operations:
12345678910111213141516171819202122232425
/* Simplified transaction lifecycle */ /* 1. Start a new transaction (or join running one) */handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, credits); /* 2. Get buffer heads for modified metadata */bh = sb_getblk(sb, block);err = ext4_journal_get_write_access(handle, sb, bh, EXT4_JTR_NONE); /* 3. Modify the buffer */modify_metadata(bh); /* 4. Mark buffer dirty in journal */err = ext4_handle_dirty_metadata(handle, NULL, bh); /* 5. Stop (complete our portion of transaction) */ext4_journal_stop(handle); /* 6. Eventually, the transaction commits: * a. Write all journal blocks to journal area * b. Write commit record * c. fsync journal to disk * d. Mark transaction committed * e. Later: checkpoint (write actual blocks, release journal space) */ext4 uses jbd2 (journaling block device, version 2) as its journaling layer. jbd2 is a general-purpose block journaling library, separate from ext4 itself. This separation allows:
The journal is typically stored in a hidden inode (inode 8 by default) called the journal inode.
Modern ext4 enables journal checksumming (journal_checksum or journal_checksum_v3). Each transaction includes CRC32C checksums of all journal blocks. During recovery, corrupted transactions are discarded rather than replayed, preventing corruption propagation.
ext4's block allocator determines where data is placed on disk. Good allocation decisions are critical for performance—contiguous allocation enables efficient sequential I/O, while fragmented allocation causes excessive seeking.
ext4 uses the multi-block allocator, which can allocate many blocks in a single operation (unlike ext3's single-block allocator). Key features:
1. Buddy Bitmap Allocator
2. Preallocation
3. Block Group Goals
Delayed allocation (delalloc) is one of ext4's most important optimizations. Instead of allocating blocks when write() is called, ext4 delays allocation until the data is actually written to disk (usually during writeback).
Benefits:
Risks:
12345678910111213141516171819202122232425262728293031323334353637
/* During write(): mark pages dirty, no allocation yet */generic_perform_write(file, iov_iter, pos) { /* ... */ ext4_write_begin(file, mapping, pos, len, &page, &fsdata); /* Just reserves space in delayed allocation state */ /* No actual block allocation happens */ ext4_write_end(file, mapping, pos, len, copied, page, fsdata);} /* During writeback: actual allocation */ext4_writepages(mapping, wbc) { /* Find dirty pages that need allocation */ mpage_prepare_extent_to_map(&mpd); /* Allocate blocks for the extent */ ext4_map_blocks(handle, inode, &map, EXT4_GET_BLOCKS_CREATE); /* Submit I/O */ mpage_map_and_submit_extent(handle, &mpd, &give_up_on_write);} /* ext4_map_blocks: the core allocation function */ext4_map_blocks(handle, inode, map, flags) { /* Check extent cache first */ if (ext4_es_lookup_extent(inode, map->m_lblk, &es)) { /* Found in extent status tree cache */ return cached_result; } /* Look up in extent tree on disk */ ext4_ext_map_blocks(handle, inode, map, flags); if (flags & EXT4_GET_BLOCKS_CREATE) { /* Allocate new blocks via mballoc */ ext4_mb_new_blocks(handle, &ar, &err); }}ext4 provides numerous mount options for tuning performance and reliability tradeoffs. Understanding these options is crucial for optimizing specific workloads.
| Option | Default | Description |
|---|---|---|
data=ordered | Yes | Write data before committing metadata journal |
data=writeback | No data ordering (faster, less safe) | |
data=journal | Journal all data (safest, slowest) | |
noatime | Don't update access times (significant speedup) | |
relatime | Yes | Only update atime if older than mtime (compromise) |
nodelalloc | Disable delayed allocation | |
barrier=1 | Yes | Use write barriers (required for safety) |
discard | Enable TRIM for SSDs | |
commit=N | 5s | Journal commit interval in seconds |
max_batch_time=N | 15000μs | Max time to batch journal commits |
journal_checksum | Enable journal checksumming | |
grpid/nogrpid | nogrpid | Inherit directory group ID |
For SSDs, consider:
mount -o discard,noatime /dev/sda1 /mnt
mount -o data=ordered,barrier=1,noatime,commit=60 /dev/sda1 /data
The barrier=0 option is almost never appropriate. Modern storage with volatile write caches will silently reorder writes, causing journal corruption after crashes. Unless your storage has battery-backed write cache or guaranteed write ordering, keep barriers enabled.
We've explored ext4's complete internal architecture. Let's consolidate the key concepts:
What's next:
With ext4's file system logic understood, we'll descend to the next layer: the Block I/O subsystem. The next page explores how the kernel translates file system requests into disk operations, covering the bio structure, I/O scheduling, request merging, and the multi-queue block layer.
You now understand ext4's complete architecture—from on-disk layout to journaling mechanics to block allocation strategies. This knowledge enables you to make informed decisions about file system configuration, debug performance issues, and understand what 'fsck' is actually doing when it repairs a file system.