Loading content...
The year is 2001. A Linux server hosting a mail system with 500,000 messages in a single directory takes 12 seconds to open any individual email. System administrators worldwide share the same frustration—file systems that work brilliantly for normal use become unusable when directories grow large.
The solution? Hash tables. Instead of searching sequentially through every entry, we compute a hash of the filename and jump directly to where the entry should be. What took 12 seconds now takes milliseconds. Directory sizes that crashed systems become routine.
This page provides comprehensive coverage of hash table directory implementation: the hash functions that convert filenames to locations, strategies for handling collisions when multiple names hash to the same value, real implementations in production file systems, and the trade-offs that make hash tables excellent for lookups but problematic for ordered operations.
By the end of this page, you will understand: (1) Hash function design for directory names, (2) Collision resolution strategies including chaining and open addressing, (3) On-disk hash table layouts for persistent storage, (4) ext4's htree/dir_index implementation in detail, (5) Dynamic resizing and rehashing considerations, and (6) Performance characteristics and limitations of hash-based directories.
A hash table provides average O(1) lookup by using a hash function to compute an index directly from the search key. For directories, the key is the filename, and the value is typically an inode number or directory entry.
The Core Idea:
Given a filename, compute a hash value, then use that hash to determine where in the directory structure to look for the entry.
Let's formalize this:
123456789101112131415161718192021222324252627282930313233
// Conceptual hash table directory lookupuint32_t hash_directory_lookup( struct hash_directory *dir, const char *filename) { // Step 1: Compute hash of filename uint32_t hash = hash_function(filename); // Step 2: Convert hash to bucket index size_t bucket_index = hash % dir->num_buckets; // Step 3: Look in that bucket (may contain multiple entries) struct dir_entry *entry = dir->buckets[bucket_index]; while (entry != NULL) { if (strcmp(entry->name, filename) == 0) { return entry->inode; // Found! } entry = entry->next; // Handle collision via chaining } return 0; // Not found} // Complexity Analysis:// - Hash computation: O(k) where k = filename length// - Bucket access: O(1)// - Chain traversal: O(c) where c = chain length// // With good hash function and appropriate bucket count:// - Average chain length c ≈ n/m (load factor)// - For n files in m buckets, lookup is O(1 + n/m)// - With m proportional to n, this is O(1) averageHash Function Requirements for Filenames:
Directory hash functions must satisfy specific criteria:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
// ext4 uses a half MD4 hash for directory indexing// Simplified version - actual implementation more complex // Simple polynomial rolling hash (educational)uint32_t simple_filename_hash(const char *name) { uint32_t hash = 0; while (*name) { hash = hash * 31 + (unsigned char)*name; name++; } return hash;} // DJB2 hash (widely used, fast)uint32_t djb2_hash(const char *name) { uint32_t hash = 5381; int c; while ((c = *name++)) { hash = ((hash << 5) + hash) + c; // hash * 33 + c } return hash;} // ext4's actual hash (half MD4, simplified)// Uses tea_transform() from Tiny Encryption Algorithmvoid ext4_hash_name( struct dx_hash_info *hinfo, // Output const char *name, // Filename int len // Name length) { __u32 hash; __u32 minor_hash = 0; const char *p; int i; __u32 in[8], buf[4]; // Initialize with seed buf[0] = 0x12a3fe2d; buf[1] = 0x37abe8f9; buf[2] = 0x71dea16c; buf[3] = 0x01234567; // Process name in 16-byte chunks p = name; while (len > 0) { // Pack characters into in[] array // Apply TEA transform to buf using in as key // ... } hash = buf[0]; minor_hash = buf[1]; hinfo->hash = hash; hinfo->minor_hash = minor_hash;}ext4 uses half of the MD4 cryptographic hash. While MD4 is broken for cryptographic purposes, it has excellent distribution properties and reasonable speed. Using half the computation (fewer rounds) provides sufficient distribution for directory hashing without the full cryptographic overhead.
When two different filenames hash to the same bucket, we have a collision. This is inevitable—the number of possible filenames far exceeds the number of hash buckets. How we handle collisions determines both correctness and performance.
Strategy 1: Separate Chaining
Each bucket contains a linked list of all entries that hash to that bucket:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
// Hash table with separate chainingstruct chain_entry { char name[256]; uint32_t inode; struct chain_entry *next; // Collision chain}; struct hash_directory { struct chain_entry **buckets; // Array of chain heads size_t num_buckets; size_t num_entries;}; // Lookup with chaininguint32_t chained_lookup(struct hash_directory *dir, const char *name) { uint32_t hash = djb2_hash(name); size_t bucket = hash % dir->num_buckets; struct chain_entry *entry = dir->buckets[bucket]; while (entry != NULL) { if (strcmp(entry->name, name) == 0) { return entry->inode; } entry = entry->next; } return 0; // Not found} // Insertion with chaining (prepend to chain)int chained_insert(struct hash_directory *dir, const char *name, uint32_t inode) { uint32_t hash = djb2_hash(name); size_t bucket = hash % dir->num_buckets; // Check for duplicate struct chain_entry *current = dir->buckets[bucket]; while (current != NULL) { if (strcmp(current->name, name) == 0) { return -1; // Already exists } current = current->next; } // Create new entry and prepend struct chain_entry *new_entry = malloc(sizeof(*new_entry)); strcpy(new_entry->name, name); new_entry->inode = inode; new_entry->next = dir->buckets[bucket]; dir->buckets[bucket] = new_entry; dir->num_entries++; return 0;} // Advantages:// - Simple to implement// - Can store unlimited entries per bucket// - Handles high load factors gracefully//// Disadvantages:// - Extra memory for pointers// - Poor cache locality (linked lists scattered in memory)// - Each collision requires memory allocationStrategy 2: Open Addressing (Linear Probing)
All entries live directly in the bucket array. Collisions probe for the next available slot:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
// Hash table with linear probingstruct open_entry { char name[256]; // Empty string = unused slot uint32_t inode; uint8_t status; // EMPTY, OCCUPIED, DELETED}; #define STATUS_EMPTY 0#define STATUS_OCCUPIED 1#define STATUS_DELETED 2 // Tombstone for lazy deletion struct open_hash_directory { struct open_entry *entries; size_t capacity; size_t num_entries;}; // Lookup with linear probinguint32_t open_lookup(struct open_hash_directory *dir, const char *name) { uint32_t hash = djb2_hash(name); size_t index = hash % dir->capacity; size_t start = index; do { struct open_entry *entry = &dir->entries[index]; if (entry->status == STATUS_EMPTY) { return 0; // Not found - would have been here } if (entry->status == STATUS_OCCUPIED && strcmp(entry->name, name) == 0) { return entry->inode; // Found! } // Continue probing (skip over deleted entries) index = (index + 1) % dir->capacity; } while (index != start); // Wrapped around return 0; // Table full, not found} // Insertion with linear probingint open_insert(struct open_hash_directory *dir, const char *name, uint32_t inode) { if (dir->num_entries >= dir->capacity * 0.7) { resize_and_rehash(dir); // Maintain load factor < 0.7 } uint32_t hash = djb2_hash(name); size_t index = hash % dir->capacity; while (dir->entries[index].status == STATUS_OCCUPIED) { if (strcmp(dir->entries[index].name, name) == 0) { return -1; // Duplicate } index = (index + 1) % dir->capacity; } strcpy(dir->entries[index].name, name); dir->entries[index].inode = inode; dir->entries[index].status = STATUS_OCCUPIED; dir->num_entries++; return 0;} // Advantages:// - Better cache locality (entries contiguous)// - No extra pointer overhead// - Simpler memory management//// Disadvantages:// - Clustering degrades performance// - Must maintain low load factor (wastes space)// - Deletion requires tombstones (complicates logic)| Aspect | Separate Chaining | Open Addressing |
|---|---|---|
| Memory layout | Scattered (linked list nodes) | Contiguous (array slots) |
| Cache performance | Poor (pointer chasing) | Good (linear memory access) |
| Load factor tolerance | High (chains grow) | Low (must keep < 0.7) |
| Deletion complexity | Simple (unlink node) | Complex (tombstones) |
| On-disk suitability | Harder (variable chains) | Easier (fixed layout) |
| Worst case | O(n) if all in one bucket | O(n) if severely clustered |
For persistent directory storage on disk, neither pure chaining nor pure open addressing works ideally. Chaining requires allocating overflow blocks. Open addressing requires fixed-size tables. Real file systems use hybrid approaches—ext4's htree uses a tree of hash buckets, each containing a linear list of entries.
ext4's HTree (Hash Tree, also called dir_index) is the most widely deployed hash-based directory implementation. It combines hashing with a tree structure to handle directories of any size efficiently.
HTree Architecture:
The HTree is not a pure hash table—it's a tree where each level uses hashing to direct the search. Think of it as a B-tree where the comparison key is a hash value rather than a raw filename.
12345678910111213141516171819202122232425262728293031
ext4 HTree Directory Layout:═══════════════════════════════════════════════════════════════════ Block 0: Root Block┌─────────────────────────────────────────────────────────────────┐│ dx_root structure: ││ ├── Fake dir entries (for backward compatibility) ││ │ ├── "." entry (inode, rec_len points to "..") ││ │ └── ".." entry (inode, rec_len covers rest of header) ││ ├── dx_root_info: ││ │ ├── hash_version: 1 (half MD4) ││ │ ├── info_length: 8 ││ │ ├── indirect_levels: 0, 1, or 2 ││ │ └── unused_flags ││ ├── dx_countlimit: ││ │ ├── limit: max entries in this block ││ │ └── count: actual entries ││ └── dx_entry array: ││ ├── {hash: 0x00000000, block: 1} // Entries with hash < X ││ ├── {hash: 0x3FFFFFFF, block: 2} // Entries with hash in range│ ├── {hash: 0x7FFFFFFF, block: 3} // ... ││ └── ... │└─────────────────────────────────────────────────────────────────┘ Leaf Blocks (1, 2, 3, ...): Standard directory block┌─────────────────────────────────────────────────────────────────┐│ Linear list of ext4_dir_entry_2 structures ││ All entries have hash values in the range for this leaf block │└─────────────────────────────────────────────────────────────────┘ Lookup: hash("readme.txt") → find dx_entry with matching range → read leaf block → linear search within blockHTree Lookup Algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
// Simplified ext4 HTree lookupuint32_t htree_lookup( struct inode *dir, const char *name, int namelen) { struct buffer_head *bh; struct dx_root *root; struct dx_entry *entries; struct dx_hash_info hinfo; unsigned int block; // Step 1: Compute hash of filename ext4_hash(dir, name, namelen, &hinfo); // Step 2: Read root block bh = ext4_read_dirblock(dir, 0); root = (struct dx_root *)bh->b_data; // Step 3: Binary search dx_entry array for hash range entries = root->entries; int count = le16_to_cpu(root->countlimit.count); // Find entry where entry[i].hash <= hinfo.hash < entry[i+1].hash int low = 0, high = count - 1; while (low < high) { int mid = (low + high + 1) / 2; if (le32_to_cpu(entries[mid].hash) <= hinfo.hash) { low = mid; } else { high = mid - 1; } } // Step 4: Follow pointer to leaf block block = le32_to_cpu(entries[low].block); brelse(bh); // Step 5: Handle indirect levels if present if (root->info.indirect_levels > 0) { // Read intermediate node, binary search again // (Up to 2 levels of indirection) } // Step 6: Read leaf block and linear search bh = ext4_read_dirblock(dir, block); // Linear search for exact name match within this block return linear_search_block(bh, name, namelen);} // Complexity:// - Hash computation: O(name_length)// - Root block read: 1 disk I/O// - Binary search in root: O(log(entries_per_block))// - Leaf block read: 1 disk I/O// - Linear search in leaf: O(entries_per_block)// // Total: O(1) disk I/Os for small directories// O(height) disk I/Os for large (tree grows)// But height ≤ 3 always, so O(1) boundedWhy HTree Works Well:
When many files hash to values in the same range, one leaf block becomes overloaded. ext4 handles this by keeping a minor_hash (second hash) and using linear search within blocks. In extreme cases (many similar names), performance degrades toward O(n). The hash algorithm was chosen to minimize this in practice.
Hash tables have a fundamental challenge: the number of buckets must match the number of entries for optimal performance. Too few buckets cause collisions; too many waste space. Resizing adjusts the table as entries are added or removed.
In-Memory Resizing:
123456789101112131415161718192021222324252627282930313233343536
// Resizing an in-memory hash tablevoid resize_hash_table(struct hash_directory *dir, size_t new_capacity) { // Step 1: Allocate new bucket array struct chain_entry **new_buckets = calloc(new_capacity, sizeof(*new_buckets)); // Step 2: Rehash all existing entries for (size_t i = 0; i < dir->num_buckets; i++) { struct chain_entry *entry = dir->buckets[i]; while (entry != NULL) { struct chain_entry *next = entry->next; // Compute new bucket for this entry uint32_t hash = djb2_hash(entry->name); size_t new_bucket = hash % new_capacity; // Insert into new table entry->next = new_buckets[new_bucket]; new_buckets[new_bucket] = entry; entry = next; } } // Step 3: Replace old array free(dir->buckets); dir->buckets = new_buckets; dir->num_buckets = new_capacity;} // When to resize:// - Load factor > 0.75: Grow (typically double)// - Load factor < 0.25: Shrink (typically halve) // Complexity: O(n) to rehash all entries// Amortized: O(1) per operation if doubling strategy usedOn-Disk Resizing Challenge:
Resizing is trivial in memory but complex on disk. Persistent hash structures have two approaches:
| Strategy | Mechanism | Pros | Cons |
|---|---|---|---|
| Full Rehash | Lock directory, rewrite all blocks | Simple, clean result | Slow, blocks all operations |
| Incremental Rehash | Migrate entries gradually during normal operations | Non-blocking | Complex state management |
| Extendible Hashing | Use high bits of hash to select bucket; double by splitting one bucket at a time | Efficient growth | Complex implementation |
| Linear Hashing | Grow one bucket at a time in round-robin | Gradual, predictable | Temporarily uneven distribution |
12345678910111213141516171819202122232425
Extendible Hashing (used in some databases):══════════════════════════════════════════════ Global depth: 2 (using 2 bits of hash)Directory size: 2^2 = 4 entries Directory: 00 → Bucket A (local depth 2) 01 → Bucket B (local depth 2) 10 → Bucket C (local depth 1) ─┐ 11 → Bucket C (local depth 1) ─┘ (shared) To insert entry with hash 10110...: - Use top 2 bits: 10 - Look in directory[10] → Bucket C - If Bucket C full: - If local depth < global depth: split bucket - If local depth = global depth: double directory After splitting Bucket C (local depth 2): 10 → Bucket C' (entries where bit 2 = 0) 11 → Bucket C'' (entries where bit 2 = 1) Key insight: Directory doubles, but only one bucket splits.Most data stays in place.ext4 HTree Splitting:
ext4 uses a simpler approach—when a leaf block fills, it splits into two blocks and the parent index is updated:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// Simplified HTree split on insertionint htree_add_entry(struct inode *dir, struct dentry *dentry) { struct dx_hash_info hinfo; struct buffer_head *bh; int block, err; // Compute hash ext4_hash(dir, dentry->d_name.name, dentry->d_name.len, &hinfo); // Find target leaf block block = htree_find_leaf(dir, &hinfo); bh = ext4_read_dirblock(dir, block); // Try to add to existing block err = add_dirent_to_buf(dentry, bh); if (err != -ENOSPC) { brelse(bh); return err; // Success or error (not space) } // Block full - need to split return htree_split_and_insert(dir, dentry, &hinfo, bh);} int htree_split_and_insert( struct inode *dir, struct dentry *dentry, struct dx_hash_info *hinfo, struct buffer_head *old_bh) { // Allocate new block int new_block = ext4_alloc_block(dir); struct buffer_head *new_bh = sb_getblk(dir->i_sb, new_block); // Partition entries by hash value // Entries with hash < median go to old block // Entries with hash >= median go to new block uint32_t split_hash = find_median_hash(old_bh); move_entries_above_hash(old_bh, new_bh, split_hash); // Update parent index to include new block // Add dx_entry for new_block with hash = split_hash update_parent_index(dir, new_block, split_hash); // Insert new entry into appropriate block if (hinfo->hash < split_hash) { return add_dirent_to_buf(dentry, old_bh); } else { return add_dirent_to_buf(dentry, new_bh); }}ext4's HTree allows at most 3 levels (root + 2 indirect). This bounds lookup to 3 block reads but limits directory size. With 4KB blocks and ~400 entries per index block, max entries ≈ 400 × 400 × 400 = 64 million files per directory—more than enough for any practical directory.
Hash table directories deliver dramatic improvements over linear lists, but understanding when and why helps make informed file system choices.
Benchmark: Linear List vs HTree:
| Directory Size | Linear List | ext4 HTree | Speedup |
|---|---|---|---|
| 100 files | 0.8ms | 0.9ms | 0.9x (slower due to hash overhead) |
| 1,000 files | 4ms | 0.9ms | 4.4x |
| 10,000 files | 35ms | 1.0ms | 35x |
| 100,000 files | 340ms | 1.2ms | 283x |
| 1,000,000 files | 3,400ms | 1.5ms | 2,267x |
Key Observations:
Crossover Point — For very small directories (< 100 files), hash overhead may exceed linear search savings. Most file systems start with linear and convert to hash automatically.
Scaling — Linear time grows with O(n); hash time grows with O(log n) due to tree depth increases. The gap widens dramatically.
I/O Bound — For large directories on HDD, the difference is even more dramatic because HTree bounds disk reads.
Real-World Workload: Mail Server Comparison:
1234567891011121314151617181920
Scenario: Email server, 500,000 messages in one directoryOperation: Delivering 100 new messages Linear List (ext2 without dir_index): - Each delivery: lookup + insert - Lookup: average 250,000 comparisons × 100ns = 25ms - Insert: scan for empty slot = 25ms - Total per message: ~50ms - Total for 100 messages: 5 seconds HTree (ext4 with dir_index): - Each delivery: hash + tree traversal + insert - Hash: ~1µs - Tree lookup: 2-3 block reads × 0.1ms = 0.3ms - Insert: find space in leaf + split if needed = 0.3ms - Total per message: ~0.6ms - Total for 100 messages: 60 milliseconds Result: 83x speedup for message deliveryReal impact: Usable mail server vs. unusable mail serverLimitations of Hash Tables:
Hash directories excel at point lookups but have weaknesses:
ls still requires reading all entries and sorting in memory. Range queries (files starting with 'A') require full scan.B-tree directories (covered next) provide O(log n) lookup while maintaining sorted order. XFS, Btrfs, and modern file systems increasingly favor B-trees over hash tables for this reason.
Hash table directories transformed file system performance for large directories, enabling workloads that were previously impractical.
What's Next:
Hash tables sacrifice ordering for speed. B-trees achieve both O(log n) lookups AND sorted order. The next page examines B-tree directory implementation—how this balanced tree structure provides the best general-purpose directory performance.
You now understand hash table directory implementation comprehensively: hash functions, collision resolution, ext4's HTree, dynamic resizing, and performance characteristics. This knowledge explains why ext4 with dir_index handles large directories so much better than older file systems.