Loading content...
Hash tables gave us O(1) lookups but sacrificed order. Linear lists preserved creation order but couldn't scale. What if we could have fast lookups and sorted order? What if directories could support both instant file access and efficient ls -la | head without scanning everything?
Enter the B-tree—the data structure that powers databases, file systems, and virtually every system that needs ordered access to large datasets. Originally invented by Rudolf Bayer and Edward McCreight at Boeing Research Labs in 1970, B-trees have become the dominant directory implementation in modern file systems.
This page provides comprehensive coverage of B-tree directory implementation: the tree structure that keeps data ordered and balanced, the algorithms that maintain these properties during insertions and deletions, and how XFS, Btrfs, and NTFS leverage B-trees for world-class directory performance.
By the end of this page, you will understand: (1) B-tree structure and properties for file system use, (2) Search, insertion, and deletion algorithms in detail, (3) Why B-trees are ideal for disk-based storage, (4) B+ tree variations used in practice, (5) Real implementations in XFS, Btrfs, and NTFS, and (6) Performance comparison with hash tables and linear lists.
A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. Unlike binary trees where each node has at most 2 children, B-tree nodes can have many children—making them ideal for systems that read/write large blocks of data.
B-tree Definition:
A B-tree of order m (also called a B-tree of minimum degree t) has these properties:
12345678910111213141516171819202122232425
B-tree of Order 5 (max 5 children, max 4 keys per node):═══════════════════════════════════════════════════════════════════ ┌─────────────────┐ │ G │ M │ T │ Root node └──┬───┴──┬──┴──┬──┴──┬──┘ ┌──────────┘ │ │ └──────────┐ ▼ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ A │ C │ F│ │ H │ K │ L│ │ N │ P │ S│ │ U │ W │ Z│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ Leaf Leaf Leaf Leaf Properties of this tree:- Order m = 5 (max 5 children per node)- Minimum keys per node = ⌈5/2⌉ - 1 = 2- Maximum keys per node = 5 - 1 = 4- Height = 2 (root at depth 0, leaves at depth 1)- All leaves at same level ✓ Key ordering:- Everything left of G: {A, C, F} ✓ (all < G)- Between G and M: {H, K, L} ✓ (G < all < M)- Between M and T: {N, P, S} ✓ (M < all < T)- Right of T: {U, W, Z} ✓ (all > T)Why B-trees for File Systems?
B-trees are ideally suited for disk-based storage:
| Property | Benefit for File Systems |
|---|---|
| High branching factor | Shallow trees → fewer disk reads per lookup |
| Node size = block size | Each node read is exactly one disk I/O |
| Sorted order maintained | Range queries and ordered iteration efficient |
| Guaranteed balance | Consistent worst-case performance |
| Efficient updates | Splits/merges are localized, not global |
A B-tree with node size of 4KB and average entry size of 32 bytes can hold ~100 entries per node. With branching factor of 100, a tree of height 3 can hold 100³ = 1 million entries. Finding any entry requires at most 3 disk reads—regardless of which entry you're looking for.
Understanding B-tree operations is essential for appreciating how directories maintain their structure. Each operation must preserve all B-tree properties.
Search Operation:
Searching a B-tree extends binary search to multi-way trees:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// B-tree node structure for directorystruct btree_node { int num_keys; // Current number of keys bool is_leaf; // True if leaf node char keys[MAX_KEYS][256]; // Filenames (sorted) uint32_t inodes[MAX_KEYS]; // Corresponding inode numbers struct btree_node *children[MAX_KEYS + 1]; // Child pointers uint64_t disk_block; // Where this node is stored on disk}; // Search for a filename in B-tree directoryuint32_t btree_search(struct btree_node *node, const char *name) { // Binary search within this node int low = 0, high = node->num_keys - 1; int pos = 0; while (low <= high) { int mid = (low + high) / 2; int cmp = strcmp(name, node->keys[mid]); if (cmp == 0) { return node->inodes[mid]; // Found! } else if (cmp < 0) { high = mid - 1; pos = mid; } else { low = mid + 1; pos = mid + 1; } } // Not found in this node if (node->is_leaf) { return 0; // Not in tree } // Recurse into appropriate child // May require disk read if child not in memory struct btree_node *child = read_node(node->children[pos]); return btree_search(child, name);} // Complexity:// - Each node: O(log(keys_per_node)) for binary search// - Tree traversal: O(height) nodes visited// - Total: O(log(keys_per_node) × height) = O(log n)// - Disk I/O: O(height) readsInsertion Operation:
Insertion is more complex because we must handle full nodes through splitting:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
// Insert a new entry into B-tree directory// Uses proactive splitting: split full nodes on the way down int btree_insert(struct btree_node **root, const char *name, uint32_t inode) { struct btree_node *r = *root; // Special case: root is full if (r->num_keys == MAX_KEYS) { // Create new root struct btree_node *new_root = allocate_node(); new_root->is_leaf = false; new_root->num_keys = 0; new_root->children[0] = r; // Split old root split_child(new_root, 0, r); *root = new_root; return insert_non_full(new_root, name, inode); } return insert_non_full(r, name, inode);} // Insert into a node that is guaranteed not fullint insert_non_full(struct btree_node *node, const char *name, uint32_t inode) { int i = node->num_keys - 1; if (node->is_leaf) { // Shift keys to make room and insert while (i >= 0 && strcmp(name, node->keys[i]) < 0) { strcpy(node->keys[i + 1], node->keys[i]); node->inodes[i + 1] = node->inodes[i]; i--; } strcpy(node->keys[i + 1], name); node->inodes[i + 1] = inode; node->num_keys++; write_node(node); // Persist to disk return 0; } // Find child to descend into while (i >= 0 && strcmp(name, node->keys[i]) < 0) { i--; } i++; struct btree_node *child = read_node(node->children[i]); // If child is full, split it first if (child->num_keys == MAX_KEYS) { split_child(node, i, child); // Determine which of the two children to descend into if (strcmp(name, node->keys[i]) > 0) { i++; child = read_node(node->children[i]); } } return insert_non_full(child, name, inode);} // Split a full child nodevoid split_child(struct btree_node *parent, int child_index, struct btree_node *full_child) { // Create new node to hold right half struct btree_node *new_sibling = allocate_node(); new_sibling->is_leaf = full_child->is_leaf; new_sibling->num_keys = MIN_KEYS; // Half the keys // Copy right half of keys to new sibling for (int j = 0; j < MIN_KEYS; j++) { strcpy(new_sibling->keys[j], full_child->keys[j + MIN_KEYS + 1]); new_sibling->inodes[j] = full_child->inodes[j + MIN_KEYS + 1]; } // Copy right half of children if not leaf if (!full_child->is_leaf) { for (int j = 0; j <= MIN_KEYS; j++) { new_sibling->children[j] = full_child->children[j + MIN_KEYS + 1]; } } full_child->num_keys = MIN_KEYS; // Insert middle key into parent for (int j = parent->num_keys; j > child_index; j--) { parent->children[j + 1] = parent->children[j]; strcpy(parent->keys[j], parent->keys[j - 1]); parent->inodes[j] = parent->inodes[j - 1]; } parent->children[child_index + 1] = new_sibling; strcpy(parent->keys[child_index], full_child->keys[MIN_KEYS]); parent->inodes[child_index] = full_child->inodes[MIN_KEYS]; parent->num_keys++; // Persist all modified nodes write_node(full_child); write_node(new_sibling); write_node(parent);}Deletion Operation:
Deletion must handle underflow when a node drops below minimum keys:
12345678910111213141516171819202122232425262728293031323334
B-tree Deletion Cases:══════════════════════════════════════════════════════════════════ Case 1: Key in leaf node with sufficient keys → Simply remove the key Case 2: Key in leaf node at minimum keys → Borrow from sibling if sibling has spare keys → Otherwise, merge with sibling Case 3: Key in internal node → Replace with predecessor (max key in left subtree) → Or replace with successor (min key in right subtree) → Then delete the predecessor/successor from leaf Merge Operation: Parent: [... | X | ...] ↙ ↘ [A, B] [C, D] (both at minimum) After merge (removing X from parent): Parent: [... | ...] ↓ [A, B, X, C, D] (combined) Borrow Operation: Parent: [... | X | ...] ↙ ↘ [A] [C, D, E] (left needs keys, right has spare) After borrow (rotate through parent): Parent: [... | C | ...] ↙ ↘ [A, X] [D, E] (balanced)All B-tree operations—search, insert, delete—run in O(log n) time. More precisely, they require O(h) disk I/Os where h is the height, and h = O(log_m n) for a tree of order m. With typical file system parameters (m ≈ 100), a directory with 1 million entries has height 3.
Most file systems use B+ trees rather than pure B-trees. B+ trees are a variant optimized for systems that need both random access and sequential scanning.
Key Differences from B-trees:
12345678910111213141516171819202122232425262728
B-tree (data in all nodes):══════════════════════════════════════════════════════════════════ ┌────────────────────────┐ │ (G,87) │ (M,42) │ (T,91) │ Internal keys+data └──┬─────┴──┬─────┴──┬─────┴──┐ ┌──────────┘ │ │ └──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │(A,12)│(C,34)│ │(H,56)│(K,78)│ │(U,23)│(W,45)│ Leaf keys+data └──────────────┘ └──────────────┘ └──────────────┘ B+ tree (data only in leaves):══════════════════════════════════════════════════════════════════ ┌──────────────────┐ │ G │ M │ T │ Internal keys only └──┬────┴──┬────┴─┬─┴──┐ (navigation) ┌──────────┘ │ │ └──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │(A,12)│(C,34)│ → │(G,87)│(H,56)│ → │(M,42)│(T,91)│ → Leaves with └──────────────┘ └──────────────┘ └──────────────┘ data+link Benefits:1. Sequential scan: Follow leaf links, no tree traversal needed2. Higher fan-out: Internal nodes have 2x more keys (no data space)3. Consistent lookup time: Always traverse to leaf (predictable)4. Better for range queries: Leaves are ordered and linkedB+ Tree Operations for Directories:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
// B+ tree node for directorystruct bplus_node { uint16_t num_keys; uint16_t is_leaf; union { // Internal node: keys and child pointers struct { char keys[INTERNAL_MAX_KEYS][256]; uint64_t children[INTERNAL_MAX_KEYS + 1]; // Block numbers } internal; // Leaf node: keys, data, and sibling pointer struct { char keys[LEAF_MAX_KEYS][256]; uint32_t inodes[LEAF_MAX_KEYS]; uint64_t next_leaf; // For sequential scan } leaf; };}; // Search: Always goes to leafuint32_t bplus_search(uint64_t root_block, const char *name) { struct bplus_node *node = read_block(root_block); // Navigate to leaf while (!node->is_leaf) { int i = 0; while (i < node->num_keys && strcmp(name, node->internal.keys[i]) >= 0) { i++; } uint64_t child_block = node->internal.children[i]; free_block(node); node = read_block(child_block); } // Search within leaf for (int i = 0; i < node->num_keys; i++) { if (strcmp(name, node->leaf.keys[i]) == 0) { uint32_t result = node->leaf.inodes[i]; free_block(node); return result; } } free_block(node); return 0; // Not found} // Range query: Find all files starting with prefixvoid bplus_range_query(uint64_t root_block, const char *prefix, void (*callback)(const char *, uint32_t)) { // Navigate to first matching leaf struct bplus_node *node = read_block(root_block); while (!node->is_leaf) { int i = 0; while (i < node->num_keys && strcmp(prefix, node->internal.keys[i]) >= 0) { i++; } uint64_t child_block = node->internal.children[i]; free_block(node); node = read_block(child_block); } // Scan leaves sequentially size_t prefix_len = strlen(prefix); while (node != NULL) { for (int i = 0; i < node->num_keys; i++) { if (strncmp(node->leaf.keys[i], prefix, prefix_len) == 0) { callback(node->leaf.keys[i], node->leaf.inodes[i]); } else if (strcmp(node->leaf.keys[i], prefix) > 0 && strncmp(node->leaf.keys[i], prefix, prefix_len) != 0) { // Past the range, done free_block(node); return; } } uint64_t next = node->leaf.next_leaf; free_block(node); node = (next != 0) ? read_block(next) : NULL; }}The linked leaves in B+ trees make 'ls' incredibly efficient. Instead of traversing the tree for each entry, the kernel finds the first leaf then follows the chain. For a directory with 10,000 files, this might be 30 sequential block reads instead of 10,000 tree traversals.
Let's examine how major file systems implement B-tree directories.
XFS: The B+ Tree Pioneer
XFS, developed by SGI in 1993, was one of the first file systems to use B+ trees extensively:
12345678910111213141516171819202122232425262728293031
XFS Directory Implementation:══════════════════════════════════════════════════════════════════ Small directories (< 1 block): - Stored directly in inode "local format" - No tree structure needed Medium directories (1 block): - Single block with linear entries - Transitions to B+ tree when full Large directories: - Full B+ tree structure - Leaves contain (hash, entry) pairs - Entries sorted by hash of filename XFS B+ tree block layout:┌─────────────────────────────────────────────────────────────┐│ xfs_btree_block header: ││ bb_magic: 0x58444233 ("XD3B") ││ bb_level: tree level (0 = leaf) ││ bb_numrecs: number of records ││ bb_leftsib, bb_rightsib: sibling pointers │├─────────────────────────────────────────────────────────────┤│ Keys/Pointers (internal) OR Records (leaf) ││ Internal: [key₁, key₂, ...] [ptr₁, ptr₂, ptr₃, ...] ││ Leaf: [entry₁, entry₂, entry₃, ...] │└─────────────────────────────────────────────────────────────┘ Note: XFS uses hash values as keys for better distribution,but stores actual filenames in leaf entries.Btrfs: Copy-on-Write B-trees
Btrfs uses B-trees for everything, with a unique copy-on-write twist:
1234567891011121314151617181920212223242526272829303132
// Btrfs uses a unified B-tree for all metadata// Directory entries are items in this tree struct btrfs_key { __le64 objectid; // Parent directory inode __u8 type; // BTRFS_DIR_ITEM_KEY or BTRFS_DIR_INDEX_KEY __le64 offset; // Hash of name (DIR_ITEM) or sequence (DIR_INDEX)}; struct btrfs_dir_item { struct btrfs_disk_key location; // Where the file inode is __le64 transid; // Transaction that created this __le16 data_len; // Length of data after name __le16 name_len; // Filename length __u8 type; // File type (regular, directory, etc.) char name[]; // Filename follows}; // Btrfs stores TWO entries per file:// 1. DIR_ITEM: keyed by hash(name) - for name lookup// 2. DIR_INDEX: keyed by sequence number - for readdir order // Example tree structure:// Key: (parent_ino=100, DIR_ITEM, hash("file.txt"))// → btrfs_dir_item { inode=200, name="file.txt" }// Key: (parent_ino=100, DIR_INDEX, seq=5)// → btrfs_dir_item { inode=200, name="file.txt" } // COW semantics: // On directory modification, entire path from root to leaf is copied// Old version remains valid until garbage collected// Enables snapshots, atomic updates, and crash consistencyNTFS: B+ Trees for Everything
NTFS uses B+ trees for directory indices and more:
12345678910111213141516171819202122232425262728293031
NTFS Directory Implementation:══════════════════════════════════════════════════════════════════ Small directories: - Index entries stored in INDEX_ROOT attribute (in MFT record) - Fits within ~600 bytes of MFT record Large directories: - INDEX_ROOT contains root of B+ tree - INDEX_ALLOCATION attribute contains tree nodes - BITMAP attribute tracks allocated nodes Index Entry Structure:┌─────────────────────────────────────────────────────────────┐│ INDEX_ENTRY_HEADER: ││ MFT Reference (8 bytes) - points to file's MFT record ││ Entry Length (2 bytes) ││ Key Length (2 bytes) ││ Flags (4 bytes) - has subnodes, last entry │├─────────────────────────────────────────────────────────────┤│ FILE_NAME attribute copy: ││ Parent directory reference ││ Timestamps (creation, modification, access) ││ File size ││ Filename (Unicode, up to 255 chars) │├─────────────────────────────────────────────────────────────┤│ VCN of subnode (if has_subnodes flag set) │└─────────────────────────────────────────────────────────────┘ NTFS sorts by collation: case-insensitive Unicode by defaultThis enables efficient case-insensitive lookups| Aspect | XFS | Btrfs | NTFS |
|---|---|---|---|
| Tree type | B+ tree | B-tree (COW) | B+ tree |
| Key type | Filename hash | Composite key | Filename (Unicode) |
| Small dir optimization | Inline in inode | Same tree | Inline in MFT |
| Ordering | Hash order (random) | Hash order | Unicode collation |
| Special feature | Hash collision handling | Snapshots, COW | Case-insensitive |
Let's compare B-tree directories against hash tables and linear lists with concrete metrics.
Theoretical Complexity:
| Operation | Linear List | Hash Table | B-tree |
|---|---|---|---|
| Lookup (by name) | O(n) | O(1) average, O(n) worst | O(log n) |
| Insert | O(n) or O(1)* | O(1) average, O(n) rehash | O(log n) |
| Delete | O(n) | O(1) average | O(log n) |
| Enumerate all | O(n) | O(n) | O(n) |
| Range query (prefix) | O(n) | O(n) | O(log n + k)** |
| First k entries sorted | O(n log n) or O(n) | O(n log n) | O(log n + k) |
*Linear insert is O(1) if appending, O(n) if checking for duplicates **k is the number of matching results
Benchmark Results (SSD Storage):
| Directory Size | Linear (ext2) | Hash (ext4 htree) | B-tree (XFS) |
|---|---|---|---|
| 100 files | 50 | 45 | 48 |
| 1,000 files | 400 | 48 | 72 |
| 10,000 files | 3,800 | 52 | 95 |
| 100,000 files | 38,000 | 60 | 120 |
| 1,000,000 files | 380,000 | 85 | 150 |
Analysis:
Where B-trees Excel:
ls returns files in sorted order without additional sorting step.Modern file systems increasingly favor B-trees. XFS uses them exclusively. Btrfs builds everything on a single B-tree. ext4's dir_index (hash-based) works well but lacks range query efficiency. For new file system designs, B+ trees are the default choice.
B-trees represent the culmination of directory data structure evolution—providing the performance of hash tables with the ordering of sorted lists.
What's Next:
We've covered the three main directory data structures. The next page examines directory entries—the individual records that these structures contain. We'll explore what information entries store, how variable-length names are handled, and format differences across file systems.
You now understand B-tree directory implementation comprehensively: the tree structure, search/insert/delete algorithms, B+ tree variants, and real implementations in XFS, Btrfs, and NTFS. This knowledge explains why modern file systems handle million-file directories efficiently.