Loading learning content...
Performance isn't the only consideration when choosing between index types. Storage and memory consumption directly impact infrastructure costs, cache efficiency, and system scalability. This page examines how hash and tree indexes differ in their space utilization patterns, including structural overhead, fill factors, fragmentation behavior, and growth characteristics.
For large-scale systems, index space consumption can dominate total storage costs—indexes often exceed the size of the data they reference. Understanding space trade-offs is essential for informed index design decisions.
By the end of this page, you will understand: the structural overhead of hash vs tree indexes, how fill factors affect space utilization, fragmentation patterns and their implications, growth behavior during bulk and incremental loading, and strategies for optimizing index space consumption.
Both index types impose structural overhead beyond the raw key-value storage. Understanding this overhead is crucial for capacity planning.
Hash Index Overhead
Static hash indexes have relatively simple structures:
Bucket directory: Fixed-size array of bucket pointers
Bucket headers: Metadata per bucket
Entry overhead: Per-entry metadata
Dynamic hash indexes (extendible, linear) add:
| Component | Static Hashing | Extendible Hashing | Linear Hashing |
|---|---|---|---|
| Directory | B × 8 bytes (fixed) | 2^d × 8 bytes (grows) | None (implicit) |
| Bucket header | ~24 bytes each | ~32 bytes each | ~24 bytes each |
| Per-entry overhead | ~0-8 bytes | ~0-8 bytes | ~0-8 bytes |
| Overflow management | Chained pages | Local depth tracking | Overflow chains as needed |
| Typical overhead ratio | 5-10% | 8-15% | 5-12% |
B+Tree Index Overhead
B+trees have more complex structural requirements:
Internal nodes: Non-leaf nodes contain only separator keys and child pointers
Node headers: Per-node metadata
Separator keys in internal nodes: Keys duplicated for navigation
Sibling pointers: Leaf linking for range scans
| Component | Size Formula | Typical Contribution |
|---|---|---|
| Leaf nodes (data) | N × (key + value + overhead) | 85-95% of index |
| Internal nodes | (levels - 1) × nodes_per_level × node_size | 1-5% of index |
| Node headers | total_nodes × header_size | 1-3% of index |
| Separator key duplication | Varies with key length | 1-5% of index |
| Sibling pointers | leaf_count × 16 bytes | <1% of index |
| Typical total overhead | 10-20% |
Hash indexes generally have lower structural overhead (5-15%) compared to B+trees (10-20%). However, the difference narrows when considering fill factor efficiency and is often outweighed by functional requirements like range query support.
Neither index type achieves 100% space utilization. Fill factor—the percentage of allocated space actually containing data—significantly impacts total storage requirements.
Hash Index Fill Factors
Hash bucket utilization depends on data distribution and bucket sizing:
Ideal case (uniform distribution):
Practical considerations:
123456789101112131415161718192021222324
// Hash index space calculation example// Table: 10 million records, 100-byte entries numRecords = 10,000,000entrySize = 100 bytespageSize = 8,192 bytes (8KB pages) // Approach 1: Conservative static hashingnumBuckets = 100,000 // 100 records per bucket on averageentriesPerPage = pageSize / entrySize = 81 // Best case: Perfect distributionpagesNeeded = numRecords / entriesPerPage = 123,457 pagesactualPages = numBuckets + overflowPages // May vary // With 70% fill factor:effectiveCapacity = numBuckets * entriesPerPage * 0.70// = 100,000 * 81 * 0.70 = 5,670,000 entries per bucket array// Need ~1.76x over-provisioning for 10M records // Overflow chain impact:// If 20% of buckets have overflow:// Additional pages = 20,000 overflow pages// Total: 100,000 + 20,000 = 120,000 pagesB+Tree Fill Factors
B+tree fill factors are configurable and have different implications:
After bulk loading:
After random insertions:
Database defaults:
| Scenario | Hash Index | B+Tree Index |
|---|---|---|
| Bulk load (uniform keys) | 70-85% | 90-100% |
| Bulk load (skewed keys) | 40-60% | 90-100% |
| After random inserts | 50-70% | 50-70% |
| After many deletes | Variable (holes in buckets) | 50-70% (merging possible) |
| Configurable? | Limited (bucket sizing) | Yes (fillfactor parameter) |
For B+trees, a lower fill factor (70-80%) reduces future split operations but increases initial size. For hash indexes, appropriate bucket count relative to expected data volume is the primary tuning lever. Both require understanding your workload's insert patterns.
Over time, indexes develop internal fragmentation that wastes space and can degrade performance. Hash and tree indexes fragment differently.
Hash Index Fragmentation
Hash indexes fragment in specific patterns:
Overflow chain growth: As buckets fill, overflow pages are added
Bucket under-utilization: After deletes, buckets may be partially empty
Static sizing mismatch: If data volume changes significantly
B+Tree Fragmentation
B+trees experience different fragmentation patterns:
Internal fragmentation: Partially-filled nodes
External (physical) fragmentation: Logical order differs from physical
Ghost records: Deleted entries not immediately reclaimed
Both index types benefit from periodic rebuilding: B+trees can be rebuilt online in many databases (REINDEX CONCURRENTLY in PostgreSQL, ALTER INDEX REBUILD ONLINE in Oracle). Hash indexes typically require offline reorganization. The ease of B+tree maintenance is another practical advantage.
Understanding how indexes grow as data volume increases is essential for capacity planning. Hash and tree indexes scale differently.
Hash Index Growth Patterns
Static Hashing:
Extendible Hashing:
Linear Hashing:
12345678910111213141516171819202122232425
// Space growth comparison for 1 million to 100 million records // B+Tree (fanout 200, 100-byte entries, 8KB pages)function btreeSize(numRecords): leafPages = numRecords / entriesPerLeaf // ~123 pages per 10K records internalPages = leafPages / fanout // 1% of leaf pages return (leafPages + internalPages) * pageSize // 1M records: ~100 MB// 10M records: ~1 GB// 100M records: ~10 GB (linear growth) // Static Hash (100K buckets, 8KB pages)function staticHashSize(numRecords): bucketPages = numBuckets // Fixed at 100K avgPerBucket = numRecords / numBuckets overflowPages = max(0, bucketCount * (avgPerBucket/entriesPerPage - 1)) return (bucketPages + overflowPages) * pageSize // 1M records: ~800 MB (mostly empty buckets)// 10M records: ~900 MB (good utilization)// 100M records: ~2 GB+ (overflow chains) // Extendible Hash (no single formula - depends on splits)// Growth comes in jumps as directory doublesB+Tree Growth Patterns
B+trees exhibit highly predictable growth:
Height growth milestones (fanout ≈ 200):
For most practical databases (under billion rows), B+tree height stays at 3-4 levels.
| Data Volume | B+Tree Behavior | Hash Behavior |
|---|---|---|
| 10x growth | Linear size increase, same height | May need reorganization |
| 100x growth | Height +1, linear size | Definitely needs resize |
| Sudden spike | Splits absorb gracefully | Overflow chains degrade |
| Growth then shrinkage | Merges reclaim some space | Empty buckets remain |
| Predictability | Highly predictable | Depends on distribution |
B+trees provide predictable, linear space scaling that simplifies capacity planning. Hash indexes require more careful sizing—too small causes overflow chains, too large wastes space on empty buckets. Dynamic hashing schemes help but add their own complexity.
For database buffer pools and in-memory databases, how much memory an index requires directly impacts performance and costs.
Hash Index Memory Patterns
Hash index memory usage is characterized by:
Flat structure: No hierarchy to cache
Uniform access distribution:
Overflow chain locality:
B+Tree Memory Patterns
B+tree memory patterns are more favorable for caching:
Hierarchical hot spots:
Concentrated working set:
Range query locality:
| Aspect | Hash Index | B+Tree Index |
|---|---|---|
| Minimum useful cache | ~100% (all buckets needed) | ~1-5% (upper levels) |
| Cache hit pattern | Uniform random | Concentrated on hot paths |
| Benefit of 10% cache | ~10% hit rate | ~40-60% hit rate |
| Benefit of 50% cache | ~50% hit rate | ~80-90% hit rate |
| Prefetching value | Low (random access) | High (sequential leaves) |
B+trees provide leverage—a small cache investment yields disproportionate performance benefits because of hierarchical access patterns. Hash indexes require proportionally more memory to achieve similar cache hit rates. This is a significant hidden cost of hash indexing.
Practical Memory Sizing
For a 10GB index on a system with 2GB available for buffer pool:
Hash Index:
B+Tree Index:
Both index types offer techniques for reducing space consumption. Understanding these options enables more efficient index design.
Hash Index Optimizations
B+Tree Optimizations
12345678910111213141516171819202122
-- PostgreSQL: High fill factor for stable dataCREATE INDEX idx_archive ON archived_orders (order_date) WITH (fillfactor = 95); -- Partial index: Only index active usersCREATE INDEX idx_active_users ON users (last_login) WHERE status = 'active';-- Indexes only ~10% of table if most users are inactive -- Covering index: Avoid table lookupsCREATE INDEX idx_orders_covering ON orders (customer_id) INCLUDE (order_date, total_amount);-- Query can be answered from index alone -- Key compression (database-specific)-- Oracle: ALTER INDEX idx_name REBUILD COMPRESS;-- Some systems compress automatically -- Deduplication (PostgreSQL 13+)CREATE INDEX idx_status ON orders (status) WITH (deduplicate_items = on);-- 'pending', 'shipped', 'delivered' repeated millions of timesMany space optimizations trade other resources: higher fill factors reduce future insert performance, compression adds CPU overhead, partial indexes limit query coverage. Evaluate based on your specific workload characteristics.
Let's examine concrete space usage scenarios to ground the theoretical discussion.
Scenario: 100 Million User Table
Table structure:
| Metric | B+Tree Index | Static Hash | Extendible Hash |
|---|---|---|---|
| Key size | 8 bytes | 8 bytes | 8 bytes |
| Pointer size | 6 bytes (row pointer) | 6 bytes | 6 bytes |
| Entry size | ~14 bytes | ~14 bytes | ~14 bytes |
| Entries per page | ~580 | ~580 | ~580 |
| Leaf/bucket pages | ~172,500 | ~200,000 | ~180,000 |
| Internal/directory | ~900 pages | N/A | ~2,000 pages |
| Overhead estimate | 15% | 10% | 12% |
| Total size | ~1.5 GB | ~1.6 GB | ~1.5 GB |
Analysis: For this scenario, space usage is remarkably similar. The B+tree's internal nodes are offset by hash indexes' bucket overhead and potential overflow. The slight differences are within measurement error for practical purposes.
Scenario: 100 Million Order Table with Date Range Access
Now consider a more complex scenario where range queries on order_date are important:
| Strategy | Indexes Required | Total Space | Query Support |
|---|---|---|---|
| Hash on order_id | 1 hash index | ~1.5 GB | Point lookup only |
| B+Tree on order_id | 1 B+tree index | ~1.5 GB | Point + range on ID |
| Hash + separate date B+tree | 2 indexes | ~3.5 GB | Point ID + range date |
| B+Tree on order_id + covering | 1 covering B+tree | ~2.5 GB | Full flexibility |
While hash indexes may seem space-efficient in isolation, real workloads often require range capabilities that hash cannot provide. The 'cost' of hash indexing includes the additional indexes needed for range queries—often doubling total index space compared to a well-designed B+tree strategy.
Space usage differences between hash and tree indexes are more nuanced than often presented. The structural overhead savings of hash indexes are frequently offset by other factors.
For most workloads, space usage is not a compelling differentiator between hash and tree indexes. B+trees' memory efficiency and avoidance of supplemental indexes often result in better overall space utilization despite slightly higher structural overhead.
What's Next
With performance and space characteristics thoroughly examined, the final page synthesizes everything into a practical selection framework. You'll learn to evaluate workloads systematically and make confident index type decisions for real-world scenarios.
You now understand the space utilization characteristics of hash and tree indexes, including overhead, fill factors, fragmentation, growth, and memory efficiency. These insights complete the technical comparison—next, we'll develop a practical decision framework.