Loading learning content...
Imagine organizing a massive library where you must store millions of books. One approach would be a single catalog listing every book's location. Simple, but when the catalog spans thousands of pages, every lookup requires searching the entire index. The alternative: divide the library into sections, each with its own local catalog.
This is precisely the strategy the ext file system employs through block groups. Rather than managing an entire disk as a monolithic unit, ext2/ext3/ext4 partition the storage into manageable chunks—block groups—each containing its own metadata and data blocks. This seemingly simple organizational choice has profound implications for performance, reliability, and scalability.
Block groups are the architectural decision that allowed ext file systems to scale from the 2 GB partitions of 1993 to the exabyte volumes theoretically supported by ext4 today. Understanding block groups means understanding how Linux manages disk space at its most fundamental level.
By the end of this page, you will understand the complete architecture of block groups, including their internal structure, the rationale behind their design, how they improve locality and reduce fragmentation, and the specific algorithms used for inode and block allocation within and across groups.
A block group is a contiguous region of the disk partition that contains a complete, self-sufficient set of metadata alongside data blocks. Think of each block group as a miniature file system within the larger file system—capable of tracking its own inode allocation, block allocation, and local data storage.
Why Block Groups Exist:
The block group design solves several problems that plagued simpler file system designs:
Locality: Files stored together tend to be accessed together. Keeping a file's inode near its data blocks reduces seek time.
Metadata distribution: Spreading inodes and bitmaps across the disk prevents any single region from becoming a bottleneck.
Parallel access: Multiple block groups can be accessed concurrently on modern storage systems.
Fault isolation: Corruption in one block group is less likely to affect others.
Seek reduction: On rotational disks, keeping related data together minimizes head movement.
Block Group Size Calculation:
The size of each block group is determined by the block size and the size of the block bitmap:
blocks_per_group = block_size × 8 (bits per byte)| Block Size | Blocks per Group | Group Size | Inodes per Group (default) |
|---|---|---|---|
| 1 KB | 8,192 | 8 MB | 1,832 |
| 2 KB | 16,384 | 32 MB | 3,664 |
| 4 KB | 32,768 | 128 MB | 8,192 |
Example: 1 TB Partition with 4 KB Blocks:
Total blocks: 1 TB / 4 KB = 268,435,456 blocks
Blocks per group: 32,768
Number of groups: 268,435,456 / 32,768 = 8,192 block groups
Each of these 8,192 groups manages 128 MB of storage with its own local metadata.
With 4 KB blocks, each group is 128 MB—large enough to hold substantial files locally, small enough that metadata structures (bitmaps, inode tables) fit in a few blocks. This balance was carefully chosen to optimize for the typical file sizes and access patterns of Unix systems.
Each block group contains several distinct components, laid out in a specific order. Understanding this layout is essential for file system analysis, recovery, and optimization.
Component Layout:
123456789101112131415161718192021222324
Block Group N Layout (4 KB blocks, typical configuration)========================================================== Offset (blocks) Component Size--------------- --------- ----0 Superblock* 1 block (4 KB)1 Group Descriptors* N blocks (varies with FS size)N+1 Reserved GDT blocks* For online growthM Data Block Bitmap 1 block (4 KB = 32K bits)M+1 Inode Bitmap 1 block (4 KB = 32K bits)M+2 Inode Table K blocks (inodes_per_group × 256 / 4096)M+2+K Data Blocks Remaining blocks in group * Present only in groups with superblock backup (sparse_super feature) Groups 0, 1, and powers of 3, 5, 7: 0, 1, 3, 5, 7, 9, 25, 27, 49, 81, ... Example for 1 TB filesystem, Block Group 0:- Superblock: Block 0- Group Descriptors: Blocks 1-32 (8192 groups × 32 bytes / 4096)- Reserved GDT: Blocks 33-1056 (for growing to 16 TB)- Block Bitmap: Block 1057- Inode Bitmap: Block 1058- Inode Table: Blocks 1059-1570 (8192 inodes × 256 bytes / 4096)- Data Blocks: Blocks 1571-32767Component Details:
1. Superblock (1 block)
2. Group Descriptor Table (variable)
3. Reserved GDT Blocks (optional)
4. Data Block Bitmap (1 block)
5. Inode Bitmap (1 block)
6. Inode Table (variable)
| Component | Formula | Example (4 KB blocks, 8192 groups) |
|---|---|---|
| Superblock | Fixed: 1 block | 4 KB |
| Group Desc Table | ceil(groups × desc_size / block_size) | 64 KB (32 bytes × 8192 / 4096 = 16 blocks) |
| Reserved GDT | Calculated for max growth | Variable (often 1024 blocks) |
| Block Bitmap | Fixed: 1 block | 4 KB (tracks 32,768 blocks) |
| Inode Bitmap | Fixed: 1 block | 4 KB (tracks up to 32,768 inodes) |
| Inode Table | inodes_per_group × inode_size / block_size | 512 KB (8192 × 256 / 4096 = 512 blocks) |
While block group metadata enables efficient allocation, it consumes disk space. For a 1 TB filesystem: ~2% overhead for metadata (superblocks, group descriptors, bitmaps, inode tables). This is the price of organization—a worthwhile trade-off for performance and reliability.
The Group Descriptor Table (GDT) is a critical structure containing metadata about every block group in the file system. It's the "table of contents" that tells the kernel where to find each group's bitmaps and inode table.
Group Descriptor Structure:
123456789101112131415161718192021222324252627282930
// ext4 group descriptor structurestruct ext4_group_desc { __le32 bg_block_bitmap_lo; // Block bitmap block (low 32 bits) __le32 bg_inode_bitmap_lo; // Inode bitmap block (low 32 bits) __le32 bg_inode_table_lo; // Inode table block (low 32 bits) __le16 bg_free_blocks_count_lo; // Free blocks count (low 16 bits) __le16 bg_free_inodes_count_lo; // Free inodes count (low 16 bits) __le16 bg_used_dirs_count_lo; // Directories count (low 16 bits) __le16 bg_flags; // EXT4_BG_* flags __le32 bg_exclude_bitmap_lo; // Exclude bitmap (snapshots) __le16 bg_block_bitmap_csum_lo; // Block bitmap checksum (low 16 bits) __le16 bg_inode_bitmap_csum_lo; // Inode bitmap checksum (low 16 bits) __le16 bg_itable_unused_lo; // Unused inodes count (low 16 bits) __le16 bg_checksum; // Group descriptor checksum // --- Extended descriptor for 64-bit mode (additional 32 bytes) --- __le32 bg_block_bitmap_hi; // Block bitmap block (high 32 bits) __le32 bg_inode_bitmap_hi; // Inode bitmap block (high 32 bits) __le32 bg_inode_table_hi; // Inode table block (high 32 bits) __le16 bg_free_blocks_count_hi; // Free blocks count (high 16 bits) __le16 bg_free_inodes_count_hi; // Free inodes count (high 16 bits) __le16 bg_used_dirs_count_hi; // Directories count (high 16 bits) __le16 bg_itable_unused_hi; // Unused inodes count (high 16 bits) __le32 bg_exclude_bitmap_hi; // Exclude bitmap (high 32 bits) __le16 bg_block_bitmap_csum_hi; // Block bitmap checksum (high 16 bits) __le16 bg_inode_bitmap_csum_hi; // Inode bitmap checksum (high 16 bits) __u32 bg_reserved; // Reserved for future use}; // Total size: 32 bytes (32-bit) or 64 bytes (64-bit)Key Fields Explained:
| Field | Purpose | Usage |
|---|---|---|
bg_block_bitmap | Points to block bitmap | Loaded when allocating blocks |
bg_inode_bitmap | Points to inode bitmap | Loaded when allocating inodes |
bg_inode_table | Points to inode table start | Essential for inode lookup |
bg_free_blocks_count | Cached free block count | Quick capacity check |
bg_free_inodes_count | Cached free inode count | Quick allocation decision |
bg_used_dirs_count | Directory counter | Helps distribute new directories |
bg_flags | INODE_UNINIT, BLOCK_UNINIT | Lazy initialization flags |
bg_checksum | CRC checksum | Corruption detection |
Lazy Initialization (bg_flags):
ext4 introduced lazy initialization to speed up mkfs and first-time writes:
INODE_UNINIT: Inode table not yet zeroedBLOCK_UNINIT: Block bitmap not yet initializedWhen a group is first used, the kernel initializes it on demand. This dramatically speeds up formatting large volumes—a 16 TB file system can be formatted in seconds rather than minutes.
GDT Redundancy:
The GDT is replicated in backup groups alongside the superblock. This redundancy allows recovery if the primary copies are corrupted:
# Recover using backup superblock from group 1
e2fsck -b 32768 -y /dev/sda1
# List backup superblock locations
dumpe2fs /dev/sda1 | grep -i 'backup'
When the 64-bit feature is enabled (default for volumes over 16 TB), group descriptors expand to 64 bytes. The *_hi fields provide additional bits for addressing blocks beyond the 2^32 limit, enabling support for volumes up to 1 exabyte.
Bitmaps provide the fundamental mechanism for tracking allocation status. Each block group maintains two bitmaps: one for data blocks, one for inodes.
Bitmap Structure:
Block Bitmap (4096 bytes = 32,768 bits)
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | ...
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | | | | | | | | | | | | | | |
V V V V V V V V V V V V V V V V
B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15
used used used free free used...
Each bit position directly corresponds to a block (or inode) number within the group.
Bitmap Operations:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
// Bitmap manipulation primitives#define BITS_PER_BYTE 8 // Check if a block is allocatedstatic inline int is_block_allocated(const unsigned char *bitmap, unsigned long block) { unsigned long byte_offset = block / BITS_PER_BYTE; unsigned int bit_offset = block % BITS_PER_BYTE; return (bitmap[byte_offset] >> bit_offset) & 1;} // Allocate a block (set bit to 1)static inline void allocate_block(unsigned char *bitmap, unsigned long block) { unsigned long byte_offset = block / BITS_PER_BYTE; unsigned int bit_offset = block % BITS_PER_BYTE; bitmap[byte_offset] |= (1 << bit_offset);} // Free a block (set bit to 0)static inline void free_block(unsigned char *bitmap, unsigned long block) { unsigned long byte_offset = block / BITS_PER_BYTE; unsigned int bit_offset = block % BITS_PER_BYTE; bitmap[byte_offset] &= ~(1 << bit_offset);} // Find first free block using optimized searchunsigned long find_first_free_block(const unsigned char *bitmap, unsigned long start, unsigned long count) { unsigned long i; // Search byte-by-byte for efficiency for (i = start / BITS_PER_BYTE; i < count / BITS_PER_BYTE; i++) { if (bitmap[i] != 0xFF) { // Found a byte with at least one free bit unsigned int bit; for (bit = 0; bit < BITS_PER_BYTE; bit++) { if (!(bitmap[i] & (1 << bit))) { unsigned long block = i * BITS_PER_BYTE + bit; if (block >= start && block < count) return block; } } } } return count; // No free block found} // Count free blocks (used for bg_free_blocks_count validation)unsigned long count_free_blocks(const unsigned char *bitmap, unsigned long count) { unsigned long free = 0; unsigned long i; for (i = 0; i < count; i++) { if (!is_block_allocated(bitmap, i)) free++; } return free;}Optimization Techniques:
Modern ext4 uses several optimizations for bitmap operations:
Byte-level scanning: Skip full bytes (value 0xFF) when searching for free blocks
Hardware bit operations: Use CPU instructions like __builtin_ffs() (find first set) for efficient searching
Goal-oriented allocation: Start search near a "goal" block to improve locality
Preallocation window: Reserve contiguous ranges to reduce bitmap updates
Delayed allocation: Defer bitmap updates until actual write, enabling larger contiguous allocations
Bitmap Checksums (ext4):
ext4 optionally computes and verifies checksums for bitmaps:
// Verify block bitmap checksum
uint32_t calculated = ext4_chksum(sbi, group_desc->bg_checksum,
bitmap, bitmap_size);
if (calculated != stored_checksum) {
ext4_error(sb, "Bitmap checksum mismatch for group %u", group);
// Mark filesystem as having errors
}
Checksum validation catches bit flips from hardware errors, detecting silent corruption before it propagates.
| Operation | Time Complexity | Notes |
|---|---|---|
| Check single bit | O(1) | Direct indexed access |
| Set/clear bit | O(1) | Direct indexed access |
| Find first free | O(n) worst case | Optimized with byte-scan |
| Find contiguous range | O(n) | May span multiple bytes |
| Count all free | O(n) | Cached in group descriptor |
Bitmaps offer an excellent balance: O(1) allocation status lookup, compact representation (1 bit per block), and fast bulk operations. A 4 KB bitmap tracks 32,768 blocks (128 MB with 4 KB blocks)—enough for efficient local allocation without excessive bitmap scanning.
Inode allocation in ext file systems follows sophisticated strategies designed to maximize locality while distributing load across block groups.
The Orlov Allocator:
ext2/ext3/ext4 use the "Orlov allocator" for directory inode placement, named after Grigory Orlov who contributed the algorithm. The goals are:
Directory Allocation Algorithm:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
// Simplified Orlov directory inode allocationext4_group_t find_group_for_directory(struct inode *parent) { struct ext4_sb_info *sbi = EXT4_SB(sb); int ngroups = sbi->s_groups_count; ext4_group_t group, parent_group; parent_group = EXT4_I(parent)->i_block_group; // For root directory children: spread across groups if (parent == d_inode(sb->s_root)) { // Find group with good balance of free inodes/blocks ext4_group_t best_group = 0; unsigned int best_score = 0; for (group = 0; group < ngroups; group++) { struct ext4_group_desc *desc = ext4_get_group_desc(sb, group); unsigned int free_inodes = ext4_free_inodes_count(sb, desc); unsigned int free_blocks = ext4_free_group_clusters(sb, desc); unsigned int dirs = ext4_used_dirs_count(sb, desc); // Score: prefer groups with free space and few directories unsigned int score = free_inodes + free_blocks / 16 - dirs * DIRS_PENALTY; if (score > best_score) { best_score = score; best_group = group; } } return best_group; } // For subdirectories: stay near parent // Check parent group first, then nearby groups for (int offset = 0; offset < 8; offset++) { group = (parent_group + offset) % ngroups; struct ext4_group_desc *desc = ext4_get_group_desc(sb, group); if (ext4_free_inodes_count(sb, desc) >= sbi->s_inodes_per_group / 4) return group; } // Fallback: use Orlov heuristic with linear search return ext4_find_group_orlov(sb, parent);} // Regular file inode allocation: near directoryext4_group_t find_group_for_file(struct inode *dir) { ext4_group_t dir_group = EXT4_I(dir)->i_block_group; // First choice: same group as directory struct ext4_group_desc *desc = ext4_get_group_desc(sb, dir_group); if (ext4_free_inodes_count(sb, desc) > 0) return dir_group; // Second: nearby groups (flex_bg aware) // Third: linear search for any group with free inodes return ext4_find_group_other(sb, dir);}The Strategy in Action:
Consider creating a directory structure /home/user/documents/:
/home in root → Placed in group with best free/directory balance/home/user → Placed near or in same group as /home/home/user/documents → Placed near or in same group as /home/userdocuments/ → Same group as documents directory inodeThis keeps related files physically close, minimizing seek time for directory traversals.
Regular File Inodes:
For regular files, the allocation is simpler:
Flex Block Groups (ext4):
ext4 introduces the concept of "flex block groups" (configurable, typically 16 groups):
Flex Group 0: Block Groups 0-15
Flex Group 1: Block Groups 16-31
...
Bitmaps and inode tables from multiple groups are concatenated, enabling:
Use stat to see which block group contains a file's inode. The formula: block_group = (inode_number - 1) / inodes_per_group. Tools like debugfs can display detailed allocation information for analysis.
Block allocation in ext4 employs the multiblock allocator (mballoc), a sophisticated system designed to minimize fragmentation while maximizing allocation speed.
Goals of Block Allocation:
Multiblock Allocator (mballoc):
Unlike ext2's block-at-a-time allocation, mballoc allocates multiple blocks in a single operation:
Key mballoc Concepts:
1. Buddy System Cache
mballoc maintains an in-memory cache of free extents organized like a buddy allocator:
Buddy Cache for Group 5:
Order 0 (1 block): blocks 100, 102, 105, 110...
Order 1 (2 blocks): blocks 200-201, 204-205...
Order 2 (4 blocks): blocks 300-303, 308-311...
...
Order 10 (1024 blocks): blocks 5000-6023
This enables O(log n) search for free regions of specific sizes.
2. Preallocation (PA)
Preallocation reserves contiguous blocks before actual data is written:
// Per-inode preallocation
struct ext4_prealloc_space {
struct list_head pa_inode_list; // Link to inode's PA list
ext4_fsblk_t pa_pstart; // Physical start block
ext4_lblk_t pa_lstart; // Logical start block
ext4_grpblk_t pa_len; // Length in blocks
ext4_grpblk_t pa_free; // Free blocks in this PA
};
3. Normalization
Small requests are "normalized" to power-of-two sizes for buddy efficiency:
Excess blocks become preallocated space for future writes.
| Scenario | Strategy | Goal Region |
|---|---|---|
| New file, first write | Near directory inode | Same block group as dir |
| Extending file | Contiguous to existing | Following last block |
| Small file (<64 KB) | Locality group PA | Near other small files |
| Large sequential write | Delayed + normalized | Large contiguous region |
| Direct I/O | Immediate allocation | Goal or first available |
ext4's delayed allocation defers block allocation until writeback. This enables the allocator to see the full write size before choosing blocks, dramatically improving contiguity for streaming writes. A 100 MB sequential write becomes a single large extent rather than scattered 4 KB blocks.
ext4 introduced flex block groups as a meta-organization layer that aggregates multiple traditional block groups into larger allocation units.
The Problem with Small Groups:
With 4 KB blocks, each block group is 128 MB. For large files:
Flex Block Group Solution:
A flex_bg bundles multiple block groups together:
Flex Group Size = 2^flex_bg_size traditional groups
Example: flex_bg_size = 4 → 16 groups per flex_bg
16 groups × 128 MB = 2 GB allocation unit
Benefits of Flex Block Groups:
Contiguous Metadata
Larger Contiguous Data Regions
Improved mballoc Efficiency
Configuration:
# Create filesystem with specific flex_bg size
mkfs.ext4 -G 16 /dev/sda1 # 16 groups per flex_bg (2 GB units)
# Check current flex_bg size
dumpe2fs /dev/sda1 | grep 'Flex block group'
# Typical output:
# Flex block group size: 16
Inode Table Packing:
With flex_bg, inode tables from constituent groups are placed contiguously:
Traditional: IT0...Data0...IT1...Data1...IT2...Data2
Flex_bg: IT0|IT1|IT2|IT3...Data0|Data1|Data2|Data3
\--- metadata ---/ \----- data --------/
This layout improves directory traversal performance where multiple inodes are read sequentially.
Default flex_bg size is 16 (2 GB units with 4 KB blocks). Larger values benefit workloads with very large files; smaller values may help workloads with many small files. For most uses, the default is optimal. Extremely large values (64+) can concentrate metadata, creating potential hotspots.
Block groups form the architectural foundation of ext file systems, enabling efficient organization of metadata and data across storage volumes of all sizes.
Next Up: The Superblock
With block groups understood, we'll examine the superblock—the critical data structure that defines file system parameters and state. The superblock contains the essential metadata needed to mount and use an ext file system, and its redundancy strategy protects against catastrophic corruption.
You now understand how block groups organize ext file systems, including their internal structure, allocation strategies, and the flex_bg optimization. This knowledge is essential for understanding file system performance characteristics and troubleshooting allocation issues.