Operating SystemsExt2/Ext3/Ext4

Ext2/Ext3/Ext4 File Systems

LevelAdvanced

Duration75 mins

TopicExt2/Ext3/Ext4

2 / 5

Block Groups

Divide and Conquer: The Block Group Strategy

Imagine organizing a massive library where you must store millions of books. One approach would be a single catalog listing every book's location. Simple, but when the catalog spans thousands of pages, every lookup requires searching the entire index. The alternative: divide the library into sections, each with its own local catalog.

This is precisely the strategy the ext file system employs through block groups. Rather than managing an entire disk as a monolithic unit, ext2/ext3/ext4 partition the storage into manageable chunks—block groups—each containing its own metadata and data blocks. This seemingly simple organizational choice has profound implications for performance, reliability, and scalability.

Block groups are the architectural decision that allowed ext file systems to scale from the 2 GB partitions of 1993 to the exabyte volumes theoretically supported by ext4 today. Understanding block groups means understanding how Linux manages disk space at its most fundamental level.

What You Will Learn

By the end of this page, you will understand the complete architecture of block groups, including their internal structure, the rationale behind their design, how they improve locality and reduce fragmentation, and the specific algorithms used for inode and block allocation within and across groups.

The Block Group Concept

A block group is a contiguous region of the disk partition that contains a complete, self-sufficient set of metadata alongside data blocks. Think of each block group as a miniature file system within the larger file system—capable of tracking its own inode allocation, block allocation, and local data storage.

Why Block Groups Exist:

The block group design solves several problems that plagued simpler file system designs:

Locality: Files stored together tend to be accessed together. Keeping a file's inode near its data blocks reduces seek time.
Metadata distribution: Spreading inodes and bitmaps across the disk prevents any single region from becoming a bottleneck.
Parallel access: Multiple block groups can be accessed concurrently on modern storage systems.
Fault isolation: Corruption in one block group is less likely to affect others.
Seek reduction: On rotational disks, keeping related data together minimizes head movement.

Converting Mermaid diagram...

Block Group Size Calculation:

The size of each block group is determined by the block size and the size of the block bitmap:

Each block bitmap uses exactly one block
Each bit in the bitmap represents one data block
Therefore: blocks_per_group = block_size × 8 (bits per byte)

Block Size	Blocks per Group	Group Size	Inodes per Group (default)
1 KB	8,192	8 MB	1,832
2 KB	16,384	32 MB	3,664
4 KB	32,768	128 MB	8,192

Example: 1 TB Partition with 4 KB Blocks:

Total blocks:    1 TB / 4 KB = 268,435,456 blocks
Blocks per group: 32,768
Number of groups: 268,435,456 / 32,768 = 8,192 block groups

Each of these 8,192 groups manages 128 MB of storage with its own local metadata.

The 128 MB Sweet Spot

With 4 KB blocks, each group is 128 MB—large enough to hold substantial files locally, small enough that metadata structures (bitmaps, inode tables) fit in a few blocks. This balance was carefully chosen to optimize for the typical file sizes and access patterns of Unix systems.

Block Group Internal Structure

Each block group contains several distinct components, laid out in a specific order. Understanding this layout is essential for file system analysis, recovery, and optimization.

Component Layout:

block_group_layout.txt

Text

Block Group N Layout (4 KB blocks, typical configuration)
==========================================================
 
Offset (blocks)    Component                Size
---------------    ---------                ----
0                  Superblock*              1 block (4 KB)
1                  Group Descriptors*       N blocks (varies with FS size)
N+1                Reserved GDT blocks*     For online growth
M                  Data Block Bitmap        1 block (4 KB = 32K bits)
M+1                Inode Bitmap             1 block (4 KB = 32K bits)
M+2                Inode Table              K blocks (inodes_per_group × 256 / 4096)
M+2+K              Data Blocks              Remaining blocks in group
 
* Present only in groups with superblock backup (sparse_super feature)
  Groups 0, 1, and powers of 3, 5, 7: 0, 1, 3, 5, 7, 9, 25, 27, 49, 81, ...
 
Example for 1 TB filesystem, Block Group 0:
- Superblock:           Block 0
- Group Descriptors:    Blocks 1-32 (8192 groups × 32 bytes / 4096)
- Reserved GDT:         Blocks 33-1056 (for growing to 16 TB)
- Block Bitmap:         Block 1057
- Inode Bitmap:         Block 1058
- Inode Table:          Blocks 1059-1570 (8192 inodes × 256 bytes / 4096)
- Data Blocks:          Blocks 1571-32767

Component Details:

1. Superblock (1 block)

Contains file system parameters: block size, inode count, mount count, state
Present in group 0 (primary) and backup groups
Modification updates all copies for redundancy

2. Group Descriptor Table (variable)

Array of descriptors, one per block group in the file system
Each descriptor: 32 bytes (ext2/ext3) or 64 bytes (ext4 with 64-bit feature)
Contains pointers to bitmaps and inode table for each group
Replicated in backup groups for redundancy

3. Reserved GDT Blocks (optional)

Pre-allocated space for group descriptor table growth
Enables online file system resizing (growing)
Size determined by resize inode during mkfs

4. Data Block Bitmap (1 block)

One bit per data block in this group
Bit = 0: block is free
Bit = 1: block is allocated
Used for rapid free block searching

5. Inode Bitmap (1 block)

One bit per inode slot in this group
Bit = 0: inode is free
Bit = 1: inode is allocated
Used when allocating new inodes

6. Inode Table (variable)

Array of inode structures
Size = (inodes_per_group × inode_size) / block_size
Each inode: 128 bytes (ext2) or 256 bytes (ext4)

Block Group Components: Sizes and Calculations
Component	Formula	Example (4 KB blocks, 8192 groups)
Superblock	Fixed: 1 block	4 KB
Group Desc Table	ceil(groups × desc_size / block_size)	64 KB (32 bytes × 8192 / 4096 = 16 blocks)
Reserved GDT	Calculated for max growth	Variable (often 1024 blocks)
Block Bitmap	Fixed: 1 block	4 KB (tracks 32,768 blocks)
Inode Bitmap	Fixed: 1 block	4 KB (tracks up to 32,768 inodes)
Inode Table	inodes_per_group × inode_size / block_size	512 KB (8192 × 256 / 4096 = 512 blocks)

Metadata Overhead

While block group metadata enables efficient allocation, it consumes disk space. For a 1 TB filesystem: ~2% overhead for metadata (superblocks, group descriptors, bitmaps, inode tables). This is the price of organization—a worthwhile trade-off for performance and reliability.

The Group Descriptor Table

The Group Descriptor Table (GDT) is a critical structure containing metadata about every block group in the file system. It's the "table of contents" that tells the kernel where to find each group's bitmaps and inode table.

Group Descriptor Structure:

ext4_group_desc.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// ext4 group descriptor structure
struct ext4_group_desc {
    __le32  bg_block_bitmap_lo;      // Block bitmap block (low 32 bits)
    __le32  bg_inode_bitmap_lo;      // Inode bitmap block (low 32 bits)
    __le32  bg_inode_table_lo;       // Inode table block (low 32 bits)
    __le16  bg_free_blocks_count_lo; // Free blocks count (low 16 bits)
    __le16  bg_free_inodes_count_lo; // Free inodes count (low 16 bits)
    __le16  bg_used_dirs_count_lo;   // Directories count (low 16 bits)
    __le16  bg_flags;                // EXT4_BG_* flags
    __le32  bg_exclude_bitmap_lo;    // Exclude bitmap (snapshots)
    __le16  bg_block_bitmap_csum_lo; // Block bitmap checksum (low 16 bits)
    __le16  bg_inode_bitmap_csum_lo; // Inode bitmap checksum (low 16 bits)
    __le16  bg_itable_unused_lo;     // Unused inodes count (low 16 bits)
    __le16  bg_checksum;             // Group descriptor checksum
    
    // --- Extended descriptor for 64-bit mode (additional 32 bytes) ---
    __le32  bg_block_bitmap_hi;      // Block bitmap block (high 32 bits)
    __le32  bg_inode_bitmap_hi;      // Inode bitmap block (high 32 bits)
    __le32  bg_inode_table_hi;       // Inode table block (high 32 bits)
    __le16  bg_free_blocks_count_hi; // Free blocks count (high 16 bits)
    __le16  bg_free_inodes_count_hi; // Free inodes count (high 16 bits)
    __le16  bg_used_dirs_count_hi;   // Directories count (high 16 bits)
    __le16  bg_itable_unused_hi;     // Unused inodes count (high 16 bits)
    __le32  bg_exclude_bitmap_hi;    // Exclude bitmap (high 32 bits)
    __le16  bg_block_bitmap_csum_hi; // Block bitmap checksum (high 16 bits)
    __le16  bg_inode_bitmap_csum_hi; // Inode bitmap checksum (high 16 bits)
    __u32   bg_reserved;             // Reserved for future use
};
 
// Total size: 32 bytes (32-bit) or 64 bytes (64-bit)

Key Fields Explained:

Field	Purpose	Usage
`bg_block_bitmap`	Points to block bitmap	Loaded when allocating blocks
`bg_inode_bitmap`	Points to inode bitmap	Loaded when allocating inodes
`bg_inode_table`	Points to inode table start	Essential for inode lookup
`bg_free_blocks_count`	Cached free block count	Quick capacity check
`bg_free_inodes_count`	Cached free inode count	Quick allocation decision
`bg_used_dirs_count`	Directory counter	Helps distribute new directories
`bg_flags`	INODE_UNINIT, BLOCK_UNINIT	Lazy initialization flags
`bg_checksum`	CRC checksum	Corruption detection

Lazy Initialization (bg_flags):

ext4 introduced lazy initialization to speed up mkfs and first-time writes:

INODE_UNINIT: Inode table not yet zeroed
BLOCK_UNINIT: Block bitmap not yet initialized

When a group is first used, the kernel initializes it on demand. This dramatically speeds up formatting large volumes—a 16 TB file system can be formatted in seconds rather than minutes.

GDT Redundancy:

The GDT is replicated in backup groups alongside the superblock. This redundancy allows recovery if the primary copies are corrupted:

# Recover using backup superblock from group 1
e2fsck -b 32768 -y /dev/sda1

# List backup superblock locations
dumpe2fs /dev/sda1 | grep -i 'backup'

64-bit Addressing in ext4

When the 64-bit feature is enabled (default for volumes over 16 TB), group descriptors expand to 64 bytes. The *_hi fields provide additional bits for addressing blocks beyond the 2^32 limit, enabling support for volumes up to 1 exabyte.

Block and Inode Bitmaps

Bitmaps provide the fundamental mechanism for tracking allocation status. Each block group maintains two bitmaps: one for data blocks, one for inodes.

Bitmap Structure:

Block Bitmap (4096 bytes = 32,768 bits)
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | ...
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
  V   V   V   V   V   V   V   V   V   V   V   V   V   V   V   V
 B0  B1  B2  B3  B4  B5  B6  B7  B8  B9  B10 B11 B12 B13 B14 B15
used used used free free used...

Each bit position directly corresponds to a block (or inode) number within the group.

Bitmap Operations:

bitmap_operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Bitmap manipulation primitives
#define BITS_PER_BYTE 8
 
// Check if a block is allocated
static inline int is_block_allocated(const unsigned char *bitmap, 
                                      unsigned long block) {
    unsigned long byte_offset = block / BITS_PER_BYTE;
    unsigned int bit_offset = block % BITS_PER_BYTE;
    return (bitmap[byte_offset] >> bit_offset) & 1;
}
 
// Allocate a block (set bit to 1)
static inline void allocate_block(unsigned char *bitmap, 
                                   unsigned long block) {
    unsigned long byte_offset = block / BITS_PER_BYTE;
    unsigned int bit_offset = block % BITS_PER_BYTE;
    bitmap[byte_offset] |= (1 << bit_offset);
}
 
// Free a block (set bit to 0)
static inline void free_block(unsigned char *bitmap, 
                               unsigned long block) {
    unsigned long byte_offset = block / BITS_PER_BYTE;
    unsigned int bit_offset = block % BITS_PER_BYTE;
    bitmap[byte_offset] &= ~(1 << bit_offset);
}
 
// Find first free block using optimized search
unsigned long find_first_free_block(const unsigned char *bitmap,
                                     unsigned long start,
                                     unsigned long count) {
    unsigned long i;
    
    // Search byte-by-byte for efficiency
    for (i = start / BITS_PER_BYTE; i < count / BITS_PER_BYTE; i++) {
        if (bitmap[i] != 0xFF) {
            // Found a byte with at least one free bit
            unsigned int bit;
            for (bit = 0; bit < BITS_PER_BYTE; bit++) {
                if (!(bitmap[i] & (1 << bit))) {
                    unsigned long block = i * BITS_PER_BYTE + bit;
                    if (block >= start && block < count)
                        return block;
                }
            }
        }
    }
    return count; // No free block found
}
 
// Count free blocks (used for bg_free_blocks_count validation)
unsigned long count_free_blocks(const unsigned char *bitmap,
                                 unsigned long count) {
    unsigned long free = 0;
    unsigned long i;
    
    for (i = 0; i < count; i++) {
        if (!is_block_allocated(bitmap, i))
            free++;
    }
    return free;
}

Optimization Techniques:

Modern ext4 uses several optimizations for bitmap operations:

Byte-level scanning: Skip full bytes (value 0xFF) when searching for free blocks
Hardware bit operations: Use CPU instructions like __builtin_ffs() (find first set) for efficient searching
Goal-oriented allocation: Start search near a "goal" block to improve locality
Preallocation window: Reserve contiguous ranges to reduce bitmap updates
Delayed allocation: Defer bitmap updates until actual write, enabling larger contiguous allocations

Bitmap Checksums (ext4):

ext4 optionally computes and verifies checksums for bitmaps:

// Verify block bitmap checksum
uint32_t calculated = ext4_chksum(sbi, group_desc->bg_checksum,
                                   bitmap, bitmap_size);
if (calculated != stored_checksum) {
    ext4_error(sb, "Bitmap checksum mismatch for group %u", group);
    // Mark filesystem as having errors
}

Checksum validation catches bit flips from hardware errors, detecting silent corruption before it propagates.

Bitmap Performance Characteristics
Operation	Time Complexity	Notes
Check single bit	O(1)	Direct indexed access
Set/clear bit	O(1)	Direct indexed access
Find first free	O(n) worst case	Optimized with byte-scan
Find contiguous range	O(n)	May span multiple bytes
Count all free	O(n)	Cached in group descriptor

Why Bitmaps Work

Bitmaps offer an excellent balance: O(1) allocation status lookup, compact representation (1 bit per block), and fast bulk operations. A 4 KB bitmap tracks 32,768 blocks (128 MB with 4 KB blocks)—enough for efficient local allocation without excessive bitmap scanning.

Inode Allocation Strategy

Inode allocation in ext file systems follows sophisticated strategies designed to maximize locality while distributing load across block groups.

The Orlov Allocator:

ext2/ext3/ext4 use the "Orlov allocator" for directory inode placement, named after Grigory Orlov who contributed the algorithm. The goals are:

Spread top-level directories across block groups
Keep subdirectories close to their parents
Avoid concentrating all inodes in early groups

Directory Allocation Algorithm:

inode_allocation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Simplified Orlov directory inode allocation
ext4_group_t find_group_for_directory(struct inode *parent) {
    struct ext4_sb_info *sbi = EXT4_SB(sb);
    int ngroups = sbi->s_groups_count;
    ext4_group_t group, parent_group;
    
    parent_group = EXT4_I(parent)->i_block_group;
    
    // For root directory children: spread across groups
    if (parent == d_inode(sb->s_root)) {
        // Find group with good balance of free inodes/blocks
        ext4_group_t best_group = 0;
        unsigned int best_score = 0;
        
        for (group = 0; group < ngroups; group++) {
            struct ext4_group_desc *desc = ext4_get_group_desc(sb, group);
            unsigned int free_inodes = ext4_free_inodes_count(sb, desc);
            unsigned int free_blocks = ext4_free_group_clusters(sb, desc);
            unsigned int dirs = ext4_used_dirs_count(sb, desc);
            
            // Score: prefer groups with free space and few directories
            unsigned int score = free_inodes + free_blocks / 16 
                               - dirs * DIRS_PENALTY;
            
            if (score > best_score) {
                best_score = score;
                best_group = group;
            }
        }
        return best_group;
    }
    
    // For subdirectories: stay near parent
    // Check parent group first, then nearby groups
    for (int offset = 0; offset < 8; offset++) {
        group = (parent_group + offset) % ngroups;
        struct ext4_group_desc *desc = ext4_get_group_desc(sb, group);
        
        if (ext4_free_inodes_count(sb, desc) >= sbi->s_inodes_per_group / 4)
            return group;
    }
    
    // Fallback: use Orlov heuristic with linear search
    return ext4_find_group_orlov(sb, parent);
}
 
// Regular file inode allocation: near directory
ext4_group_t find_group_for_file(struct inode *dir) {
    ext4_group_t dir_group = EXT4_I(dir)->i_block_group;
    
    // First choice: same group as directory
    struct ext4_group_desc *desc = ext4_get_group_desc(sb, dir_group);
    if (ext4_free_inodes_count(sb, desc) > 0)
        return dir_group;
    
    // Second: nearby groups (flex_bg aware)
    // Third: linear search for any group with free inodes
    return ext4_find_group_other(sb, dir);
}

The Strategy in Action:

Consider creating a directory structure /home/user/documents/:

/home in root → Placed in group with best free/directory balance
/home/user → Placed near or in same group as /home
/home/user/documents → Placed near or in same group as /home/user
Files in documents/ → Same group as documents directory inode

This keeps related files physically close, minimizing seek time for directory traversals.

Regular File Inodes:

For regular files, the allocation is simpler:

First choice: Same block group as containing directory
Fallback: Nearby groups in the same flex_bg (ext4)
Last resort: Any group with free inodes

Flex Block Groups (ext4):

ext4 introduces the concept of "flex block groups" (configurable, typically 16 groups):

Flex Group 0: Block Groups 0-15
Flex Group 1: Block Groups 16-31
...

Bitmaps and inode tables from multiple groups are concatenated, enabling:

Larger contiguous allocation regions
Better large-file handling
Reduced metadata seeks within a flex group

Observing Allocation

Use stat to see which block group contains a file's inode. The formula: block_group = (inode_number - 1) / inodes_per_group. Tools like debugfs can display detailed allocation information for analysis.

Block Allocation Strategy

Block allocation in ext4 employs the multiblock allocator (mballoc), a sophisticated system designed to minimize fragmentation while maximizing allocation speed.

Goals of Block Allocation:

Contiguity: Allocate contiguous blocks for sequential access performance
Locality: Keep file blocks near their inode
Efficiency: Minimize CPU time and disk I/O for allocation
Fairness: Prevent any single file from consuming all the space in a group

Multiblock Allocator (mballoc):

Unlike ext2's block-at-a-time allocation, mballoc allocates multiple blocks in a single operation:

Converting Mermaid diagram...

Key mballoc Concepts:

1. Buddy System Cache

mballoc maintains an in-memory cache of free extents organized like a buddy allocator:

Buddy Cache for Group 5:
  Order 0 (1 block):   blocks 100, 102, 105, 110...
  Order 1 (2 blocks):  blocks 200-201, 204-205...
  Order 2 (4 blocks):  blocks 300-303, 308-311...
  ...
  Order 10 (1024 blocks): blocks 5000-6023

This enables O(log n) search for free regions of specific sizes.

2. Preallocation (PA)

Preallocation reserves contiguous blocks before actual data is written:

Inode PA: Per-file preallocation for streaming writes
Locality Group PA: Shared preallocation for small files in a directory

// Per-inode preallocation
struct ext4_prealloc_space {
    struct list_head pa_inode_list;  // Link to inode's PA list
    ext4_fsblk_t pa_pstart;          // Physical start block
    ext4_lblk_t pa_lstart;           // Logical start block
    ext4_grpblk_t pa_len;            // Length in blocks
    ext4_grpblk_t pa_free;           // Free blocks in this PA
};

3. Normalization

Small requests are "normalized" to power-of-two sizes for buddy efficiency:

Request for 3 blocks → Search for 4 blocks
Request for 7 blocks → Search for 8 blocks

Excess blocks become preallocated space for future writes.

Block Allocation Strategies by Request Type
Scenario	Strategy	Goal Region
New file, first write	Near directory inode	Same block group as dir
Extending file	Contiguous to existing	Following last block
Small file (<64 KB)	Locality group PA	Near other small files
Large sequential write	Delayed + normalized	Large contiguous region
Direct I/O	Immediate allocation	Goal or first available

Delayed Allocation Impact

ext4's delayed allocation defers block allocation until writeback. This enables the allocator to see the full write size before choosing blocks, dramatically improving contiguity for streaming writes. A 100 MB sequential write becomes a single large extent rather than scattered 4 KB blocks.

Flex Block Groups (ext4)

ext4 introduced flex block groups as a meta-organization layer that aggregates multiple traditional block groups into larger allocation units.

The Problem with Small Groups:

With 4 KB blocks, each block group is 128 MB. For large files:

A 1 GB file spans 8 block groups
Each group's inode table and bitmaps are separate
Metadata is scattered across the disk
Seeks between groups degrade performance

Flex Block Group Solution:

A flex_bg bundles multiple block groups together:

Flex Group Size = 2^flex_bg_size traditional groups

Example: flex_bg_size = 4 → 16 groups per flex_bg
         16 groups × 128 MB = 2 GB allocation unit

Converting Mermaid diagram...

Benefits of Flex Block Groups:

Contiguous Metadata
- All bitmaps for the flex_bg are adjacent
- All inode tables are adjacent
- Single seek reads metadata for multiple groups
Larger Contiguous Data Regions
- Data blocks from multiple groups form a large pool
- Better large file allocation
- Reduced fragmentation
Improved mballoc Efficiency
- Allocator can find larger free extents
- Fewer group crossings for big allocations
- Better preallocation sizing

Configuration:

# Create filesystem with specific flex_bg size
mkfs.ext4 -G 16 /dev/sda1  # 16 groups per flex_bg (2 GB units)

# Check current flex_bg size
dumpe2fs /dev/sda1 | grep 'Flex block group'

# Typical output:
# Flex block group size:   16

Inode Table Packing:

With flex_bg, inode tables from constituent groups are placed contiguously:

Traditional: IT0...Data0...IT1...Data1...IT2...Data2
Flex_bg:     IT0|IT1|IT2|IT3...Data0|Data1|Data2|Data3
             \--- metadata ---/ \----- data --------/

This layout improves directory traversal performance where multiple inodes are read sequentially.

Choosing Flex BG Size

Default flex_bg size is 16 (2 GB units with 4 KB blocks). Larger values benefit workloads with very large files; smaller values may help workloads with many small files. For most uses, the default is optimal. Extremely large values (64+) can concentrate metadata, creating potential hotspots.

Summary: Block Groups

Block groups form the architectural foundation of ext file systems, enabling efficient organization of metadata and data across storage volumes of all sizes.

Key Takeaways

•Block groups divide the partition into manageable units — Each group contains its own metadata (bitmaps, inode table) and data blocks, typically 128 MB with 4 KB blocks.
•Group descriptors track each group's state — Pointers to bitmaps, inode tables, and free counts are cached in the Group Descriptor Table for quick access.
•Bitmaps efficiently track allocation — One bit per block/inode enables O(1) status checks and fast free-space searches.
•Inode allocation uses the Orlov algorithm — Directories are spread across groups; files stay near their parent directory for locality.
•Block allocation via mballoc maximizes contiguity — Preallocation, buddy caches, and normalization reduce fragmentation.
•Flex block groups aggregate traditional groups — ext4's flex_bg feature enables larger contiguous allocations and reduces metadata scatter.

Next Up: The Superblock

With block groups understood, we'll examine the superblock—the critical data structure that defines file system parameters and state. The superblock contains the essential metadata needed to mount and use an ext file system, and its redundancy strategy protects against catastrophic corruption.

Page Complete

You now understand how block groups organize ext file systems, including their internal structure, allocation strategies, and the flex_bg optimization. This knowledge is essential for understanding file system performance characteristics and troubleshooting allocation issues.

2 / 5

Loading learning content...

Operating SystemsExt2/Ext3/Ext4

Ext2/Ext3/Ext4 File Systems

LevelAdvanced

Duration75 mins

TopicExt2/Ext3/Ext4

2 / 5

Block Groups

Divide and Conquer: The Block Group Strategy

What You Will Learn

The Block Group Concept

Why Block Groups Exist:

The block group design solves several problems that plagued simpler file system designs:

Locality: Files stored together tend to be accessed together. Keeping a file's inode near its data blocks reduces seek time.
Metadata distribution: Spreading inodes and bitmaps across the disk prevents any single region from becoming a bottleneck.
Parallel access: Multiple block groups can be accessed concurrently on modern storage systems.
Fault isolation: Corruption in one block group is less likely to affect others.
Seek reduction: On rotational disks, keeping related data together minimizes head movement.

Converting Mermaid diagram...

Block Group Size Calculation:

The size of each block group is determined by the block size and the size of the block bitmap:

Each block bitmap uses exactly one block
Each bit in the bitmap represents one data block
Therefore: blocks_per_group = block_size × 8 (bits per byte)

Block Size	Blocks per Group	Group Size	Inodes per Group (default)
1 KB	8,192	8 MB	1,832
2 KB	16,384	32 MB	3,664
4 KB	32,768	128 MB	8,192

Example: 1 TB Partition with 4 KB Blocks:

Total blocks:    1 TB / 4 KB = 268,435,456 blocks
Blocks per group: 32,768
Number of groups: 268,435,456 / 32,768 = 8,192 block groups

Each of these 8,192 groups manages 128 MB of storage with its own local metadata.

The 128 MB Sweet Spot

Block Group Internal Structure

Each block group contains several distinct components, laid out in a specific order. Understanding this layout is essential for file system analysis, recovery, and optimization.

Component Layout:

block_group_layout.txt

Text

Block Group N Layout (4 KB blocks, typical configuration)
==========================================================
 
Offset (blocks)    Component                Size
---------------    ---------                ----
0                  Superblock*              1 block (4 KB)
1                  Group Descriptors*       N blocks (varies with FS size)
N+1                Reserved GDT blocks*     For online growth
M                  Data Block Bitmap        1 block (4 KB = 32K bits)
M+1                Inode Bitmap             1 block (4 KB = 32K bits)
M+2                Inode Table              K blocks (inodes_per_group × 256 / 4096)
M+2+K              Data Blocks              Remaining blocks in group
 
* Present only in groups with superblock backup (sparse_super feature)
  Groups 0, 1, and powers of 3, 5, 7: 0, 1, 3, 5, 7, 9, 25, 27, 49, 81, ...
 
Example for 1 TB filesystem, Block Group 0:
- Superblock:           Block 0
- Group Descriptors:    Blocks 1-32 (8192 groups × 32 bytes / 4096)
- Reserved GDT:         Blocks 33-1056 (for growing to 16 TB)
- Block Bitmap:         Block 1057
- Inode Bitmap:         Block 1058
- Inode Table:          Blocks 1059-1570 (8192 inodes × 256 bytes / 4096)
- Data Blocks:          Blocks 1571-32767

Component Details:

1. Superblock (1 block)

Contains file system parameters: block size, inode count, mount count, state
Present in group 0 (primary) and backup groups
Modification updates all copies for redundancy

2. Group Descriptor Table (variable)

Array of descriptors, one per block group in the file system
Each descriptor: 32 bytes (ext2/ext3) or 64 bytes (ext4 with 64-bit feature)
Contains pointers to bitmaps and inode table for each group
Replicated in backup groups for redundancy

3. Reserved GDT Blocks (optional)

Pre-allocated space for group descriptor table growth
Enables online file system resizing (growing)
Size determined by resize inode during mkfs

4. Data Block Bitmap (1 block)

One bit per data block in this group
Bit = 0: block is free
Bit = 1: block is allocated
Used for rapid free block searching

5. Inode Bitmap (1 block)

One bit per inode slot in this group
Bit = 0: inode is free
Bit = 1: inode is allocated
Used when allocating new inodes

6. Inode Table (variable)

Array of inode structures
Size = (inodes_per_group × inode_size) / block_size
Each inode: 128 bytes (ext2) or 256 bytes (ext4)

Block Group Components: Sizes and Calculations
Component	Formula	Example (4 KB blocks, 8192 groups)
Superblock	Fixed: 1 block	4 KB
Group Desc Table	ceil(groups × desc_size / block_size)	64 KB (32 bytes × 8192 / 4096 = 16 blocks)
Reserved GDT	Calculated for max growth	Variable (often 1024 blocks)
Block Bitmap	Fixed: 1 block	4 KB (tracks 32,768 blocks)
Inode Bitmap	Fixed: 1 block	4 KB (tracks up to 32,768 inodes)
Inode Table	inodes_per_group × inode_size / block_size	512 KB (8192 × 256 / 4096 = 512 blocks)

Metadata Overhead

The Group Descriptor Table

Group Descriptor Structure:

ext4_group_desc.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// ext4 group descriptor structure
struct ext4_group_desc {
    __le32  bg_block_bitmap_lo;      // Block bitmap block (low 32 bits)
    __le32  bg_inode_bitmap_lo;      // Inode bitmap block (low 32 bits)
    __le32  bg_inode_table_lo;       // Inode table block (low 32 bits)
    __le16  bg_free_blocks_count_lo; // Free blocks count (low 16 bits)
    __le16  bg_free_inodes_count_lo; // Free inodes count (low 16 bits)
    __le16  bg_used_dirs_count_lo;   // Directories count (low 16 bits)
    __le16  bg_flags;                // EXT4_BG_* flags
    __le32  bg_exclude_bitmap_lo;    // Exclude bitmap (snapshots)
    __le16  bg_block_bitmap_csum_lo; // Block bitmap checksum (low 16 bits)
    __le16  bg_inode_bitmap_csum_lo; // Inode bitmap checksum (low 16 bits)
    __le16  bg_itable_unused_lo;     // Unused inodes count (low 16 bits)
    __le16  bg_checksum;             // Group descriptor checksum
    
    // --- Extended descriptor for 64-bit mode (additional 32 bytes) ---
    __le32  bg_block_bitmap_hi;      // Block bitmap block (high 32 bits)
    __le32  bg_inode_bitmap_hi;      // Inode bitmap block (high 32 bits)
    __le32  bg_inode_table_hi;       // Inode table block (high 32 bits)
    __le16  bg_free_blocks_count_hi; // Free blocks count (high 16 bits)
    __le16  bg_free_inodes_count_hi; // Free inodes count (high 16 bits)
    __le16  bg_used_dirs_count_hi;   // Directories count (high 16 bits)
    __le16  bg_itable_unused_hi;     // Unused inodes count (high 16 bits)
    __le32  bg_exclude_bitmap_hi;    // Exclude bitmap (high 32 bits)
    __le16  bg_block_bitmap_csum_hi; // Block bitmap checksum (high 16 bits)
    __le16  bg_inode_bitmap_csum_hi; // Inode bitmap checksum (high 16 bits)
    __u32   bg_reserved;             // Reserved for future use
};
 
// Total size: 32 bytes (32-bit) or 64 bytes (64-bit)

Key Fields Explained:

Field	Purpose	Usage
`bg_block_bitmap`	Points to block bitmap	Loaded when allocating blocks
`bg_inode_bitmap`	Points to inode bitmap	Loaded when allocating inodes
`bg_inode_table`	Points to inode table start	Essential for inode lookup
`bg_free_blocks_count`	Cached free block count	Quick capacity check
`bg_free_inodes_count`	Cached free inode count	Quick allocation decision
`bg_used_dirs_count`	Directory counter	Helps distribute new directories
`bg_flags`	INODE_UNINIT, BLOCK_UNINIT	Lazy initialization flags
`bg_checksum`	CRC checksum	Corruption detection

Lazy Initialization (bg_flags):

ext4 introduced lazy initialization to speed up mkfs and first-time writes:

INODE_UNINIT: Inode table not yet zeroed
BLOCK_UNINIT: Block bitmap not yet initialized

When a group is first used, the kernel initializes it on demand. This dramatically speeds up formatting large volumes—a 16 TB file system can be formatted in seconds rather than minutes.

GDT Redundancy:

The GDT is replicated in backup groups alongside the superblock. This redundancy allows recovery if the primary copies are corrupted:

# Recover using backup superblock from group 1
e2fsck -b 32768 -y /dev/sda1

# List backup superblock locations
dumpe2fs /dev/sda1 | grep -i 'backup'

64-bit Addressing in ext4

Block and Inode Bitmaps

Bitmaps provide the fundamental mechanism for tracking allocation status. Each block group maintains two bitmaps: one for data blocks, one for inodes.

Bitmap Structure:

Block Bitmap (4096 bytes = 32,768 bits)
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | ...
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
  |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
  V   V   V   V   V   V   V   V   V   V   V   V   V   V   V   V
 B0  B1  B2  B3  B4  B5  B6  B7  B8  B9  B10 B11 B12 B13 B14 B15
used used used free free used...

Each bit position directly corresponds to a block (or inode) number within the group.

Bitmap Operations:

bitmap_operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Bitmap manipulation primitives
#define BITS_PER_BYTE 8
 
// Check if a block is allocated
static inline int is_block_allocated(const unsigned char *bitmap, 
                                      unsigned long block) {
    unsigned long byte_offset = block / BITS_PER_BYTE;
    unsigned int bit_offset = block % BITS_PER_BYTE;
    return (bitmap[byte_offset] >> bit_offset) & 1;
}
 
// Allocate a block (set bit to 1)
static inline void allocate_block(unsigned char *bitmap, 
                                   unsigned long block) {
    unsigned long byte_offset = block / BITS_PER_BYTE;
    unsigned int bit_offset = block % BITS_PER_BYTE;
    bitmap[byte_offset] |= (1 << bit_offset);
}
 
// Free a block (set bit to 0)
static inline void free_block(unsigned char *bitmap, 
                               unsigned long block) {
    unsigned long byte_offset = block / BITS_PER_BYTE;
    unsigned int bit_offset = block % BITS_PER_BYTE;
    bitmap[byte_offset] &= ~(1 << bit_offset);
}
 
// Find first free block using optimized search
unsigned long find_first_free_block(const unsigned char *bitmap,
                                     unsigned long start,
                                     unsigned long count) {
    unsigned long i;
    
    // Search byte-by-byte for efficiency
    for (i = start / BITS_PER_BYTE; i < count / BITS_PER_BYTE; i++) {
        if (bitmap[i] != 0xFF) {
            // Found a byte with at least one free bit
            unsigned int bit;
            for (bit = 0; bit < BITS_PER_BYTE; bit++) {
                if (!(bitmap[i] & (1 << bit))) {
                    unsigned long block = i * BITS_PER_BYTE + bit;
                    if (block >= start && block < count)
                        return block;
                }
            }
        }
    }
    return count; // No free block found
}
 
// Count free blocks (used for bg_free_blocks_count validation)
unsigned long count_free_blocks(const unsigned char *bitmap,
                                 unsigned long count) {
    unsigned long free = 0;
    unsigned long i;
    
    for (i = 0; i < count; i++) {
        if (!is_block_allocated(bitmap, i))
            free++;
    }
    return free;
}

Optimization Techniques:

Modern ext4 uses several optimizations for bitmap operations:

Byte-level scanning: Skip full bytes (value 0xFF) when searching for free blocks
Hardware bit operations: Use CPU instructions like __builtin_ffs() (find first set) for efficient searching
Goal-oriented allocation: Start search near a "goal" block to improve locality
Preallocation window: Reserve contiguous ranges to reduce bitmap updates
Delayed allocation: Defer bitmap updates until actual write, enabling larger contiguous allocations

Bitmap Checksums (ext4):

ext4 optionally computes and verifies checksums for bitmaps:

// Verify block bitmap checksum
uint32_t calculated = ext4_chksum(sbi, group_desc->bg_checksum,
                                   bitmap, bitmap_size);
if (calculated != stored_checksum) {
    ext4_error(sb, "Bitmap checksum mismatch for group %u", group);
    // Mark filesystem as having errors
}

Checksum validation catches bit flips from hardware errors, detecting silent corruption before it propagates.

Bitmap Performance Characteristics
Operation	Time Complexity	Notes
Check single bit	O(1)	Direct indexed access
Set/clear bit	O(1)	Direct indexed access
Find first free	O(n) worst case	Optimized with byte-scan
Find contiguous range	O(n)	May span multiple bytes
Count all free	O(n)	Cached in group descriptor

Why Bitmaps Work

Inode Allocation Strategy

Inode allocation in ext file systems follows sophisticated strategies designed to maximize locality while distributing load across block groups.

The Orlov Allocator:

ext2/ext3/ext4 use the "Orlov allocator" for directory inode placement, named after Grigory Orlov who contributed the algorithm. The goals are:

Spread top-level directories across block groups
Keep subdirectories close to their parents
Avoid concentrating all inodes in early groups

Directory Allocation Algorithm:

inode_allocation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Simplified Orlov directory inode allocation
ext4_group_t find_group_for_directory(struct inode *parent) {
    struct ext4_sb_info *sbi = EXT4_SB(sb);
    int ngroups = sbi->s_groups_count;
    ext4_group_t group, parent_group;
    
    parent_group = EXT4_I(parent)->i_block_group;
    
    // For root directory children: spread across groups
    if (parent == d_inode(sb->s_root)) {
        // Find group with good balance of free inodes/blocks
        ext4_group_t best_group = 0;
        unsigned int best_score = 0;
        
        for (group = 0; group < ngroups; group++) {
            struct ext4_group_desc *desc = ext4_get_group_desc(sb, group);
            unsigned int free_inodes = ext4_free_inodes_count(sb, desc);
            unsigned int free_blocks = ext4_free_group_clusters(sb, desc);
            unsigned int dirs = ext4_used_dirs_count(sb, desc);
            
            // Score: prefer groups with free space and few directories
            unsigned int score = free_inodes + free_blocks / 16 
                               - dirs * DIRS_PENALTY;
            
            if (score > best_score) {
                best_score = score;
                best_group = group;
            }
        }
        return best_group;
    }
    
    // For subdirectories: stay near parent
    // Check parent group first, then nearby groups
    for (int offset = 0; offset < 8; offset++) {
        group = (parent_group + offset) % ngroups;
        struct ext4_group_desc *desc = ext4_get_group_desc(sb, group);
        
        if (ext4_free_inodes_count(sb, desc) >= sbi->s_inodes_per_group / 4)
            return group;
    }
    
    // Fallback: use Orlov heuristic with linear search
    return ext4_find_group_orlov(sb, parent);
}
 
// Regular file inode allocation: near directory
ext4_group_t find_group_for_file(struct inode *dir) {
    ext4_group_t dir_group = EXT4_I(dir)->i_block_group;
    
    // First choice: same group as directory
    struct ext4_group_desc *desc = ext4_get_group_desc(sb, dir_group);
    if (ext4_free_inodes_count(sb, desc) > 0)
        return dir_group;
    
    // Second: nearby groups (flex_bg aware)
    // Third: linear search for any group with free inodes
    return ext4_find_group_other(sb, dir);
}

The Strategy in Action:

Consider creating a directory structure /home/user/documents/:

/home in root → Placed in group with best free/directory balance
/home/user → Placed near or in same group as /home
/home/user/documents → Placed near or in same group as /home/user
Files in documents/ → Same group as documents directory inode

This keeps related files physically close, minimizing seek time for directory traversals.

Regular File Inodes:

For regular files, the allocation is simpler:

First choice: Same block group as containing directory
Fallback: Nearby groups in the same flex_bg (ext4)
Last resort: Any group with free inodes

Flex Block Groups (ext4):

ext4 introduces the concept of "flex block groups" (configurable, typically 16 groups):

Flex Group 0: Block Groups 0-15
Flex Group 1: Block Groups 16-31
...

Bitmaps and inode tables from multiple groups are concatenated, enabling:

Larger contiguous allocation regions
Better large-file handling
Reduced metadata seeks within a flex group

Observing Allocation

Block Allocation Strategy

Block allocation in ext4 employs the multiblock allocator (mballoc), a sophisticated system designed to minimize fragmentation while maximizing allocation speed.

Goals of Block Allocation:

Contiguity: Allocate contiguous blocks for sequential access performance
Locality: Keep file blocks near their inode
Efficiency: Minimize CPU time and disk I/O for allocation
Fairness: Prevent any single file from consuming all the space in a group

Multiblock Allocator (mballoc):

Unlike ext2's block-at-a-time allocation, mballoc allocates multiple blocks in a single operation:

Converting Mermaid diagram...

Key mballoc Concepts:

1. Buddy System Cache

mballoc maintains an in-memory cache of free extents organized like a buddy allocator:

Buddy Cache for Group 5:
  Order 0 (1 block):   blocks 100, 102, 105, 110...
  Order 1 (2 blocks):  blocks 200-201, 204-205...
  Order 2 (4 blocks):  blocks 300-303, 308-311...
  ...
  Order 10 (1024 blocks): blocks 5000-6023

This enables O(log n) search for free regions of specific sizes.

2. Preallocation (PA)

Preallocation reserves contiguous blocks before actual data is written:

Inode PA: Per-file preallocation for streaming writes
Locality Group PA: Shared preallocation for small files in a directory

// Per-inode preallocation
struct ext4_prealloc_space {
    struct list_head pa_inode_list;  // Link to inode's PA list
    ext4_fsblk_t pa_pstart;          // Physical start block
    ext4_lblk_t pa_lstart;           // Logical start block
    ext4_grpblk_t pa_len;            // Length in blocks
    ext4_grpblk_t pa_free;           // Free blocks in this PA
};

3. Normalization

Small requests are "normalized" to power-of-two sizes for buddy efficiency:

Request for 3 blocks → Search for 4 blocks
Request for 7 blocks → Search for 8 blocks

Excess blocks become preallocated space for future writes.

Block Allocation Strategies by Request Type
Scenario	Strategy	Goal Region
New file, first write	Near directory inode	Same block group as dir
Extending file	Contiguous to existing	Following last block
Small file (<64 KB)	Locality group PA	Near other small files
Large sequential write	Delayed + normalized	Large contiguous region
Direct I/O	Immediate allocation	Goal or first available

Delayed Allocation Impact

Flex Block Groups (ext4)

ext4 introduced flex block groups as a meta-organization layer that aggregates multiple traditional block groups into larger allocation units.

The Problem with Small Groups:

With 4 KB blocks, each block group is 128 MB. For large files:

A 1 GB file spans 8 block groups
Each group's inode table and bitmaps are separate
Metadata is scattered across the disk
Seeks between groups degrade performance

Flex Block Group Solution:

A flex_bg bundles multiple block groups together:

Flex Group Size = 2^flex_bg_size traditional groups

Example: flex_bg_size = 4 → 16 groups per flex_bg
         16 groups × 128 MB = 2 GB allocation unit

Converting Mermaid diagram...

Benefits of Flex Block Groups:

Contiguous Metadata
- All bitmaps for the flex_bg are adjacent
- All inode tables are adjacent
- Single seek reads metadata for multiple groups
Larger Contiguous Data Regions
- Data blocks from multiple groups form a large pool
- Better large file allocation
- Reduced fragmentation
Improved mballoc Efficiency
- Allocator can find larger free extents
- Fewer group crossings for big allocations
- Better preallocation sizing

Configuration:

# Create filesystem with specific flex_bg size
mkfs.ext4 -G 16 /dev/sda1  # 16 groups per flex_bg (2 GB units)

# Check current flex_bg size
dumpe2fs /dev/sda1 | grep 'Flex block group'

# Typical output:
# Flex block group size:   16

Inode Table Packing:

With flex_bg, inode tables from constituent groups are placed contiguously:

Traditional: IT0...Data0...IT1...Data1...IT2...Data2
Flex_bg:     IT0|IT1|IT2|IT3...Data0|Data1|Data2|Data3
             \--- metadata ---/ \----- data --------/

This layout improves directory traversal performance where multiple inodes are read sequentially.

Choosing Flex BG Size

Summary: Block Groups

Block groups form the architectural foundation of ext file systems, enabling efficient organization of metadata and data across storage volumes of all sizes.

Key Takeaways

•Block groups divide the partition into manageable units — Each group contains its own metadata (bitmaps, inode table) and data blocks, typically 128 MB with 4 KB blocks.
•Group descriptors track each group's state — Pointers to bitmaps, inode tables, and free counts are cached in the Group Descriptor Table for quick access.
•Bitmaps efficiently track allocation — One bit per block/inode enables O(1) status checks and fast free-space searches.
•Inode allocation uses the Orlov algorithm — Directories are spread across groups; files stay near their parent directory for locality.
•Block allocation via mballoc maximizes contiguity — Preallocation, buddy caches, and normalization reduce fragmentation.
•Flex block groups aggregate traditional groups — ext4's flex_bg feature enables larger contiguous allocations and reduces metadata scatter.

Next Up: The Superblock

Page Complete

2 / 5