Loading learning content...
Consider a 4 GB video file stored on an ext2/ext3 filesystem with 4 KB blocks. To describe where all that data lives on disk, the filesystem must maintain approximately one million block pointers—each pointing to a single 4 KB block. These pointers consume over 4 MB of space just for metadata, and accessing the file requires traversing multiple levels of indirect blocks.
Now imagine describing that same file with a single statement: "Logical blocks 0 through 1,048,575 are stored contiguously starting at physical block 50,000,000." One structure, one lookup, done.
This is the power of extents—ext4's revolutionary approach to block mapping. Rather than maintaining exhaustive lists of individual blocks, extents describe contiguous ranges. A perfectly contiguous file requires only a single extent, regardless of size. Even fragmented files typically need just a handful of extents.
The result: dramatic reductions in metadata overhead, faster file access, and more efficient use of disk space. Extents are the single most impactful feature distinguishing ext4 from its predecessors.
By the end of this page, you will understand the extent structure, how the extent tree organizes file mappings, how ext4 allocates and manages extents, the performance benefits compared to indirect blocks, and practical considerations for extent-based files.
Before appreciating extents, let's understand what they replace: the indirect block scheme used by ext2/ext3.
ext2/ext3 Block Pointer Structure:
The traditional inode contains 15 block pointers:
i_block[0-11]: 12 direct pointers (point directly to data blocks)
i_block[12]: 1 indirect pointer (points to block of pointers)
i_block[13]: 1 double indirect pointer (pointer → pointers → pointers)
i_block[14]: 1 triple indirect pointer (three levels of indirection)
Capacity Calculations (4 KB blocks, 4-byte pointers):
| Level | Pointers per Block | Blocks Addressable | Data Capacity |
|---|---|---|---|
| Direct | N/A | 12 | 48 KB |
| Single Indirect | 1,024 | 1,024 | 4 MB |
| Double Indirect | 1,024² | 1,048,576 | 4 GB |
| Triple Indirect | 1,024³ | 1,073,741,824 | 4 TB |
Total maximum file size: ~4 TB (with 4 KB blocks)
Problems with Indirect Blocks:
Excessive Metadata for Large Files
Multiple Disk Seeks for Access
Fragmented Pointer Storage
Inefficient for Contiguous Data
Indirect blocks made perfect sense in 1993 when most files were small (average < 10 KB) and disks were measured in megabytes. But by 2008, multi-gigabyte files were common, and the indirect scheme's inefficiency became a significant bottleneck.
An extent is a compact descriptor for a contiguous range of blocks. Instead of "block 100, block 101, block 102, ..., block 200," an extent says "100 blocks starting at block 100."
ext4 Extent Structure:
123456789101112131415161718192021222324252627
// Extent structure (12 bytes)struct ext4_extent { __le32 ee_block; // First logical block (within file) __le16 ee_len; // Length in blocks (0-32768) // High bit set = uninitialized extent __le16 ee_start_hi; // Physical block (high 16 bits) __le32 ee_start_lo; // Physical block (low 32 bits)}; // Extent header (12 bytes)struct ext4_extent_header { __le16 eh_magic; // Magic number: 0xF30A __le16 eh_entries; // Number of valid entries __le16 eh_max; // Maximum capacity for entries __le16 eh_depth; // Tree depth (0 = leaf, >0 = index) __le32 eh_generation; // Generation (for checksumming)}; // Extent index (12 bytes) - internal tree nodesstruct ext4_extent_idx { __le32 ei_block; // Logical block covered by this subtree __le32 ei_leaf_lo; // Physical block of child node (low 32) __le16 ei_leaf_hi; // Physical block of child node (high 16) __u16 ei_unused; // Reserved}; // All three structures are exactly 12 bytes for efficient packingExtent Field Details:
| Field | Size | Purpose | Range |
|---|---|---|---|
ee_block | 4 bytes | Logical block (within file) | 0 to 2³² - 1 |
ee_len | 2 bytes | Length in blocks | 1 to 32768 |
ee_start | 6 bytes | Physical block on disk | 0 to 2⁴⁸ - 1 |
Maximum Single Extent:
Physical Block Addressing: The 48-bit physical block address enables addressing:
2^48 blocks × 4 KB = 1 Exabyte (1,152,921,504,606,846,976 bytes)
Uninitialized Extents:
The high bit of ee_len marks uninitialized (preallocated but unwritten) extents:
#define EXT_INIT_MAX_LEN 32768 // Max initialized extent
#define EXT_UNWRITTEN_MAX_LEN 32768 // Max unwritten extent
#define EXT4_EXT_UNWRITTEN 0x8000 // Flag for unwritten
bool is_unwritten(struct ext4_extent *ext) {
return ext->ee_len & cpu_to_le16(EXT4_EXT_UNWRITTEN);
}
uint16_t extent_length(struct ext4_extent *ext) {
return le16_to_cpu(ext->ee_len) & ~EXT4_EXT_UNWRITTEN;
}
Uninitialized extents represent space reserved via fallocate() but not yet written. Reading them returns zeros without accessing disk.
A perfectly contiguous 4 TB file requires just 32,768 extents (one per 128 MB). Compare to indirect blocks: the same file would need ~1 billion block pointers. That's a 30,000× reduction in metadata.
While many files can be described by a handful of extents fitting directly in the inode, fragmented files may need hundreds or thousands of extents. The extent tree organizes these efficiently.
Inode Extent Storage:
The inode's i_block field (60 bytes) is reused for extent storage:
+-------------------+
| Extent Header | 12 bytes
+-------------------+
| Extent/Index 0 | 12 bytes
+-------------------+
| Extent/Index 1 | 12 bytes
+-------------------+
| Extent/Index 2 | 12 bytes
+-------------------+
| Extent/Index 3 | 12 bytes
+-------------------+
Total: 60 bytes = 1 header + 4 entries
With 4 extents in the inode, a file with < 5 fragments needs no additional blocks for metadata.
Tree Depth and Capacity:
Each tree block (4 KB) holds:
| Depth | Structure | Extents Capacity | Max File Size* |
|---|---|---|---|
| 0 | 4 extents in inode | 4 | 512 MB |
| 1 | 4 index → 340 extents each | 1,360 | 170 GB |
| 2 | 4×340 index → 340 each | 462,400 | 57 TB |
| 3 | 4×340×340 → 340 each | 157,216,000 | 19 PB |
*Assuming each extent covers 128 MB (32,768 blocks)
Binary Search for Lookup:
Extent tree lookup is O(depth × log(entries)):
struct ext4_extent *ext4_find_extent(inode, logical_block) {
struct ext4_extent_header *eh = inode->i_block;
while (eh->eh_depth > 0) {
// Binary search for containing index
struct ext4_extent_idx *idx = binary_search(
(struct ext4_extent_idx *)(eh + 1),
eh->eh_entries,
logical_block
);
// Follow index to child block
pblock = idx->ei_leaf;
eh = read_block(pblock);
}
// At leaf level: binary search for extent
return binary_search(
(struct ext4_extent *)(eh + 1),
eh->eh_entries,
logical_block
);
}
| File Size | Indirect Block Lookups | Extent Tree Lookups |
|---|---|---|
| 48 KB (12 blocks) | 1 (direct) | 1 (in inode) |
| 4 MB | 2 (indirect) | 1 (in inode) |
| 100 MB | 3 (double) | 1-2 (depends on fragmentation) |
| 1 GB | 3 (double) | 1-2 |
| 100 GB | 4 (triple) | 2-3 |
| 1 TB | 4 (triple) | 2-3 |
The extent tree's compactness means the entire mapping for most files fits in cache. A 10 GB file might need only 80 extents—easily cached in memory. The same file with indirect blocks requires 2.5 million pointers that can't possibly all be cached.
ext4's multiblock allocator (mballoc) is designed to create large, contiguous extents whenever possible.
Delayed Allocation:
The key to effective extent allocation is delayed allocation (delalloc):
1234567891011121314151617181920212223242526272829303132333435363738394041424344
// Simplified extent allocation flowint ext4_ext_map_blocks(handle_t *handle, struct inode *inode, struct ext4_map_blocks *map, int flags) { struct ext4_extent *ex; ext4_lblk_t logical = map->m_lblk; unsigned int len = map->m_len; // Find existing extent or insertion point path = ext4_find_extent(inode, logical, NULL, 0); ex = path[path->p_depth].p_ext; if (ex && extent_covers(ex, logical, len)) { // Existing extent covers request - calculate physical block map->m_pblk = extent_pblock(ex) + (logical - ex->ee_block); map->m_len = min(len, extent_len(ex) - (logical - ex->ee_block)); return map->m_len; } // Need new allocation if (!(flags & EXT4_GET_BLOCKS_CREATE)) return 0; // No allocation requested // Allocate contiguous blocks struct ext4_allocation_request ar = { .inode = inode, .logical = logical, .len = len, .goal = extent_pblock(prev_extent) + extent_len(prev_extent), .flags = 0, }; allocated = ext4_mb_new_blocks(handle, &ar, &err); if (allocated) { // Create new extent new_extent.ee_block = cpu_to_le32(logical); new_extent.ee_len = cpu_to_le16(allocated); set_extent_pblock(&new_extent, ar.pblk); // Insert into extent tree ext4_ext_insert_extent(handle, inode, &path, &new_extent, 0); } return allocated;}Allocation Goals:
The allocator tries to place new extents:
Extent Merging:
When a new extent is adjacent to an existing one, they merge:
before: [extent: blocks 0-99 @ 1000-1099]
[extent: blocks 100-199 @ 1100-1199] // Adjacent!
after: [extent: blocks 0-199 @ 1000-1199] // Merged!
Merging reduces extent count and improves access patterns.
Preallocation with fallocate():
# Preallocate 1 GB for a database file
fallocate -l 1G /var/lib/db/data.db
// Kernel creates uninitialized extent(s)
extent: ee_block=0, ee_len=262144|UNWRITTEN, ee_start=1000000
// 1 million blocks reserved, marked unwritten
// Reads return zeros, writes convert to written
Preallocation guarantees contiguous space, essential for databases and VMs.
| Scenario | Without delalloc | With delalloc |
|---|---|---|
| 100 × 4KB writes | 100 separate extents possible | 1 extent likely |
| 1 GB sequential write | Many extents | 8-10 extents (128 MB each) |
| Random 4KB writes | Many extents | Many extents (no change) |
| fallocate() 10 GB | 1 call = contiguous | 1 call = contiguous |
With delayed allocation, data exists only in page cache until writeback. A crash before writeback loses data even for 'successful' writes. This is the expected POSIX behavior (use fsync for durability), but ext4's implementation exposed some edge cases that caused issues with applications expecting write-then-rename atomicity.
The extent tree supports efficient insert, delete, and split operations to maintain a balanced, searchable structure.
Inserting a New Extent:
1. Find insertion point (binary search)
2. If adjacent to existing extent: merge
3. If room in leaf block: insert directly
4. If leaf full: split leaf, redistribute entries
5. If split propagates up: may need new root
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
// Extent tree insertion (simplified)int ext4_ext_insert_extent(handle_t *handle, struct inode *inode, struct ext4_ext_path **ppath, struct ext4_extent *newext, int flags) { struct ext4_ext_path *path = *ppath; struct ext4_extent_header *eh; int depth = ext_depth(inode); // Try to merge with surrounding extents if (ext4_ext_try_to_merge(handle, inode, path, newext)) { // Successfully merged - no new extent needed return 0; } eh = path[depth].p_hdr; // Check if there's room in the leaf if (le16_to_cpu(eh->eh_entries) < le16_to_cpu(eh->eh_max)) { // Room available - insert directly nearex = path[depth].p_ext; memmove(nearex + 1, nearex, (eh->eh_entries - (nearex - EXT_FIRST_EXTENT(eh))) * sizeof(struct ext4_extent)); *nearex = *newext; eh->eh_entries++; return 0; } // No room - need to split return ext4_ext_split(handle, inode, flags, path, newext);} // Splitting a full extent leafint ext4_ext_split(handle_t *handle, struct inode *inode, unsigned int flags, struct ext4_ext_path *path, struct ext4_extent *newext) { // 1. Allocate new blocks for split // 2. Distribute extents between old and new blocks // 3. Update parent index (may need recursive split) // 4. If root needs split, grow tree depth if (depth == 0 && path[0].p_hdr->eh_entries == path[0].p_hdr->eh_max) { // Root (in inode) is full - must grow tree return ext4_ext_grow_indepth(handle, inode, flags); } // ... split logic ...}Truncating (Removing Extents):
When a file is truncated or deleted, extents must be removed:
1. Find extent(s) covering truncation point
2. For partial first extent: shrink its length
3. For fully covered extents: remove entirely
4. Free physical blocks to block allocator
5. If tree becomes sparse: may reduce depth
Punch Hole (Sparse Files):
ext4 supports punching holes in existing files:
fallocate -p -o 1G -l 100M largefile.dat
# Deallocates 100 MB starting at offset 1 GB
The result is a sparse file where reading the hole returns zeros without consuming disk space:
before: [extent: 0-500000]
hole: [remove blocks 250000-274999]
after: [extent: 0-249999] [gap] [extent: 275000-500000]
Extent Status Tree:
ext4 maintains an in-memory cache of extent status:
struct extent_status {
struct rb_node rb_node; // Red-black tree node
ext4_lblk_t es_lblk; // Logical block
ext4_lblk_t es_len; // Length
ext4_fsblk_t es_pblk; // Physical block (or 0 for hole)
unsigned int es_status; // Written, unwritten, delayed, hole
};
This cache enables O(log n) extent lookups without disk I/O, crucial for performance with large files.
Use filefrag -v filename to see a file's extent layout. It shows logical blocks, physical blocks, and extent lengths—invaluable for understanding fragmentation.
Extents provide multiple layers of performance improvement over indirect blocks.
1. Reduced Metadata Overhead:
| Metric | Indirect Blocks | Extents |
|---|---|---|
| Block pointers needed | 268,435,456 | 8,192 (one per 128 MB) |
| Metadata blocks | ~265,000 | ~25 |
| Metadata size | ~1 GB | ~100 KB |
| Reduction | — | 10,000× |
2. Faster Sequential Access:
With extents, sequential read/write knows the entire contiguous range:
// With indirect blocks
for (block = 0; block < 1000000; block++) {
pblock = lookup_indirect(block); // Up to 4 lookups each
read_block(pblock);
}
// With extents
extent = lookup_extent(0); // One lookup
for (block = 0; block < extent.len; block++) {
read_block(extent.start + block); // No more lookups!
}
3. Better Disk Scheduling:
Knowing contiguous ranges enables:
4. Efficient Sparse Files:
Extents naturally represent holes (gaps in logical-to-physical mapping):
File with hole:
[extent: blocks 0-99 @ physical 1000-1099]
(no extent for blocks 100-199 = hole)
[extent: blocks 200-299 @ physical 2000-2099]
Reading blocks 100-199 returns zeros without disk access.
Benchmarks typically show 2-3× improvement in large file throughput when comparing ext4 extents to ext3 indirect blocks. For database workloads with preallocated files, the improvement can be even more dramatic due to reduced metadata I/O.
Understanding how to inspect, optimize, and troubleshoot extent-based files is essential for system administration.
Viewing Extent Information:
12345678910111213141516171819202122232425262728293031
# filefrag: Show file's extent layoutfilefrag -v largefile.dat # Example output:# Filesystem type is: ef53 (ext4)# File size of largefile.dat is 1073741824 (262144 blocks of 4096 bytes)# ext: logical_offset: physical_offset: length: expected: flags:# 0: 0.. 32767: 98304.. 131071: 32768: # 1: 32768.. 65535: 163840.. 196607: 32768: 131072:# 2: 65536.. 98303: 229376.. 262143: 32768: 196608:# ...# largefile.dat: 8 extents found # debugfs: Low-level extent examinationdebugfs /dev/sda1debugfs: stat <inode_number># Shows extent tree in detail debugfs: extents <inode_number># Walks and displays extent tree # e2freefrag: Show free space fragmentatione2freefrag /dev/sda1# Histogram of free extent sizes # Check if file uses extents (vs indirect)lsattr largefile.dat# 'e' in output indicates extents # View extent flagsfilefrag -e largefile.datDefragmenting Extent Files:
# Online defragmentation (ext4 only)
e4defrag /path/to/file # Defragment single file
e4defrag /mount/point # Defragment entire filesystem
e4defrag -v /path/to/file # Verbose output
# Check fragmentation score
filefrag largefile.dat
# "8 extents found" - lower is better
# Pre-defrag: check if worth it
e4defrag -c /path/to/file
# Shows current vs optimal extent count
Preallocation Strategies:
# Create contiguous preallocated file
fallocate -l 10G /var/lib/mysql/ibdata1
# Posix-style (portable)
dd if=/dev/zero of=datafile bs=1M count=10240 status=progress
# For databases: ensure files stay contiguous
# MySQL: innodb_file_per_table = OFF (single large file)
# PostgreSQL: Set large segment size at initdb
Converting Indirect → Extent:
Files created before ext4 or with extent disabled use indirect blocks:
# Enable extent feature on filesystem
tune2fs -O extent /dev/sda1
# Existing files remain indirect
# To convert: must rewrite file
cp oldfile newfile # newfile uses extents
mv newfile oldfile
# Or use e2image for full filesystem conversion
| Extent Count | Fragmentation Level | Action |
|---|---|---|
| 1-10 | Excellent | None needed |
| 10-100 | Good | Monitor |
| 100-1000 | Moderate | Consider defrag |
1000 | High | Defragment recommended |
10000 | Severe | Defragment + investigate cause |
The best defragmentation is prevention: use fallocate for large files, keep filesystems <80% full, enable delayed allocation (default), and avoid many small appends to the same file. Journaling databases should preallocate transaction logs.
Extents represent the most significant architectural advancement from ext3 to ext4, transforming how file systems handle large files and modern workloads.
fallocate() reserves contiguous space as unwritten extents, ideal for databases and VMs.Module Complete: Ext2/Ext3/Ext4
You've now explored the complete ext file system family:
Together, these concepts explain how the most widely deployed Linux file system operates—from the high-level organization down to individual data structures. This knowledge is essential for performance tuning, troubleshooting, recovery operations, and understanding modern file system design.
Congratulations! You've mastered the ext2/ext3/ext4 file system family, understanding everything from historical context through modern extent-based allocation. This knowledge positions you to work effectively with Linux storage, troubleshoot filesystem issues, and appreciate the engineering decisions that make ext4 the workhorse of Linux deployments worldwide.