Database Management SystemsHash vs Tree Indexes

Hash vs Tree Indexes: Comparative Analysis

LevelIntermediate

Duration75 mins

TopicHash vs Tree Indexes

4 / 5

Space Usage Comparison: Storage and Memory Trade-offs

The Space Dimension

Performance isn't the only consideration when choosing between index types. Storage and memory consumption directly impact infrastructure costs, cache efficiency, and system scalability. This page examines how hash and tree indexes differ in their space utilization patterns, including structural overhead, fill factors, fragmentation behavior, and growth characteristics.

For large-scale systems, index space consumption can dominate total storage costs—indexes often exceed the size of the data they reference. Understanding space trade-offs is essential for informed index design decisions.

Learning Objectives

By the end of this page, you will understand: the structural overhead of hash vs tree indexes, how fill factors affect space utilization, fragmentation patterns and their implications, growth behavior during bulk and incremental loading, and strategies for optimizing index space consumption.

Structural Overhead Analysis

Both index types impose structural overhead beyond the raw key-value storage. Understanding this overhead is crucial for capacity planning.

Hash Index Overhead

Static hash indexes have relatively simple structures:

Bucket directory: Fixed-size array of bucket pointers
- Size: B × pointer_size (where B = number of buckets)
- Typically: B × 8 bytes for 64-bit pointers
Bucket headers: Metadata per bucket
- Entry count, overflow pointer, free space pointer
- Typically: 16-32 bytes per bucket
Entry overhead: Per-entry metadata
- Typically minimal (just key-value storage)

Dynamic hash indexes (extendible, linear) add:

Directory structure: Grows with data (extendible hashing)
Split metadata: Tracking information for progressive expansion
Depth indicators: Local and global depth tracking

Hash Index Space Components
Component	Static Hashing	Extendible Hashing	Linear Hashing
Directory	B × 8 bytes (fixed)	2^d × 8 bytes (grows)	None (implicit)
Bucket header	~24 bytes each	~32 bytes each	~24 bytes each
Per-entry overhead	~0-8 bytes	~0-8 bytes	~0-8 bytes
Overflow management	Chained pages	Local depth tracking	Overflow chains as needed
Typical overhead ratio	5-10%	8-15%	5-12%

B+Tree Index Overhead

B+trees have more complex structural requirements:

Internal nodes: Non-leaf nodes contain only separator keys and child pointers
- Size contribution: Typically 1-2% of total index size
- Formula: (N/F - 1) × node_size where F is fanout
Node headers: Per-node metadata
- Node type, entry count, sibling pointers, parent pointer
- Typically: 24-48 bytes per node
Separator keys in internal nodes: Keys duplicated for navigation
- Additional storage for routing information
- Can be significant for long keys
Sibling pointers: Leaf linking for range scans
- 2 × 8 bytes per leaf node (prev + next pointers)

B+Tree Space Components
Component	Size Formula	Typical Contribution
Leaf nodes (data)	N × (key + value + overhead)	85-95% of index
Internal nodes	(levels - 1) × nodes_per_level × node_size	1-5% of index
Node headers	total_nodes × header_size	1-3% of index
Separator key duplication	Varies with key length	1-5% of index
Sibling pointers	leaf_count × 16 bytes	<1% of index
Typical total overhead		10-20%

Overhead Comparison

Hash indexes generally have lower structural overhead (5-15%) compared to B+trees (10-20%). However, the difference narrows when considering fill factor efficiency and is often outweighed by functional requirements like range query support.

Fill Factor and Space Efficiency

Neither index type achieves 100% space utilization. Fill factor—the percentage of allocated space actually containing data—significantly impacts total storage requirements.

Hash Index Fill Factors

Hash bucket utilization depends on data distribution and bucket sizing:

Ideal case (uniform distribution):

Each bucket receives approximately n/B entries
Buckets sized to hold average load with small buffer
Fill factor can approach 70-85%

Practical considerations:

Buckets must be sized for worst-case, not average
Overflow handling consumes additional pages
Skewed distributions leave some buckets underutilized
Typical fill factor: 50-70%

hash_fill_factor_analysis.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Hash index space calculation example
// Table: 10 million records, 100-byte entries
 
numRecords = 10,000,000
entrySize = 100 bytes
pageSize = 8,192 bytes (8KB pages)
 
// Approach 1: Conservative static hashing
numBuckets = 100,000  // 100 records per bucket on average
entriesPerPage = pageSize / entrySize = 81
 
// Best case: Perfect distribution
pagesNeeded = numRecords / entriesPerPage = 123,457 pages
actualPages = numBuckets + overflowPages  // May vary
 
// With 70% fill factor:
effectiveCapacity = numBuckets * entriesPerPage * 0.70
// = 100,000 * 81 * 0.70 = 5,670,000 entries per bucket array
// Need ~1.76x over-provisioning for 10M records
 
// Overflow chain impact:
// If 20% of buckets have overflow:
// Additional pages = 20,000 overflow pages
// Total: 100,000 + 20,000 = 120,000 pages

B+Tree Fill Factors

B+tree fill factors are configurable and have different implications:

After bulk loading:

Optimal fill factor achievable (90-100%)
Bottom-up construction fills leaves compactly
Minimal wasted space initially

After random insertions:

Node splits reduce average fill to ~50-70%
Configurable fill factor trades space for future insert performance
Lower fill factor = fewer splits = less fragmentation

Database defaults:

PostgreSQL: Default fillfactor = 90% for B-tree indexes
MySQL/InnoDB: Attempts 15/16 (93.75%) fill for new pages
Oracle: configurable PCTFREE parameter

Fill Factor Comparison
Scenario	Hash Index	B+Tree Index
Bulk load (uniform keys)	70-85%	90-100%
Bulk load (skewed keys)	40-60%	90-100%
After random inserts	50-70%	50-70%
After many deletes	Variable (holes in buckets)	50-70% (merging possible)
Configurable?	Limited (bucket sizing)	Yes (fillfactor parameter)

Fill Factor Strategy

For B+trees, a lower fill factor (70-80%) reduces future split operations but increases initial size. For hash indexes, appropriate bucket count relative to expected data volume is the primary tuning lever. Both require understanding your workload's insert patterns.

Fragmentation Patterns

Over time, indexes develop internal fragmentation that wastes space and can degrade performance. Hash and tree indexes fragment differently.

Hash Index Fragmentation

Hash indexes fragment in specific patterns:

Overflow chain growth: As buckets fill, overflow pages are added
- Overflow pages are often non-contiguous on disk
- Chains lengthen, consuming space and slowing lookups
Bucket under-utilization: After deletes, buckets may be partially empty
- No automatic compaction (entries don't migrate between buckets)
- Deleted space remains "wasted" until reorganization
Static sizing mismatch: If data volume changes significantly
- Over-provisioned: Many empty buckets waste space
- Under-provisioned: Excessive overflow chains

B+Tree Fragmentation

B+trees experience different fragmentation patterns:

Internal fragmentation: Partially-filled nodes
- After random inserts/deletes, nodes average 50-70% full
- Guaranteed minimum: 50% (B-tree property)
- Can be addressed by rebuilding
External (physical) fragmentation: Logical order differs from physical
- Leaf nodes allocated at different times may be scattered on disk
- Range scans become random I/O instead of sequential
- Rebuild restores physical contiguity
Ghost records: Deleted entries not immediately reclaimed
- Some implementations mark-as-deleted rather than removing
- Space reclaimed during node merge or rebuild

Hash Fragmentation Indicators

•Average overflow chain length increasing
•Bucket utilization varying widely
•Point query latency becoming inconsistent
•Index size growing faster than data size
•Many buckets with very few entries

B+Tree Fragmentation Indicators

•Average leaf fill below 60%
•Tree height increasing unexpectedly
•Range scan I/O count exceeding expectations
•Index size significantly larger than bulk-loaded equivalent
•High ratio of dead/ghost tuples

Defragmentation Strategies

Both index types benefit from periodic rebuilding: B+trees can be rebuilt online in many databases (REINDEX CONCURRENTLY in PostgreSQL, ALTER INDEX REBUILD ONLINE in Oracle). Hash indexes typically require offline reorganization. The ease of B+tree maintenance is another practical advantage.

Growth Behavior and Scalability

Understanding how indexes grow as data volume increases is essential for capacity planning. Hash and tree indexes scale differently.

Hash Index Growth Patterns

Static Hashing:

Fixed bucket count → Growth handled by overflow chains
Linear space growth for data, but chain overhead accumulates
Eventually requires manual resize/reorganization

Extendible Hashing:

Directory doubles when any bucket splits
Growth comes in sudden jumps (1→2→4→8→... directory entries)
Bucket pages grow linearly with data
Directory can become large relative to data for deep trees

Linear Hashing:

One bucket added at a time (smoothest growth)
Split triggered by load factor threshold
Near-linear space growth

growth_comparison.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Space growth comparison for 1 million to 100 million records
 
// B+Tree (fanout 200, 100-byte entries, 8KB pages)
function btreeSize(numRecords):
    leafPages = numRecords / entriesPerLeaf  // ~123 pages per 10K records
    internalPages = leafPages / fanout        // 1% of leaf pages
    return (leafPages + internalPages) * pageSize
 
// 1M records:   ~100 MB
// 10M records:  ~1 GB
// 100M records: ~10 GB  (linear growth)
 
// Static Hash (100K buckets, 8KB pages)
function staticHashSize(numRecords):
    bucketPages = numBuckets  // Fixed at 100K
    avgPerBucket = numRecords / numBuckets
    overflowPages = max(0, bucketCount * (avgPerBucket/entriesPerPage - 1))
    return (bucketPages + overflowPages) * pageSize
 
// 1M records:   ~800 MB (mostly empty buckets)
// 10M records:  ~900 MB (good utilization)
// 100M records: ~2 GB+  (overflow chains)
 
// Extendible Hash (no single formula - depends on splits)
// Growth comes in jumps as directory doubles

B+Tree Growth Patterns

B+trees exhibit highly predictable growth:

Linear with data volume: Space proportional to record count
Logarithmic height growth: Height increases only at specific thresholds
Graceful degradation: Even worst-case splits maintain ~50% minimum fill

Height growth milestones (fanout ≈ 200):

Height 2: Up to 40,000 records
Height 3: Up to 8,000,000 records
Height 4: Up to 1,600,000,000 records
Height 5: Up to 320,000,000,000 records

For most practical databases (under billion rows), B+tree height stays at 3-4 levels.

Scalability Comparison
Data Volume	B+Tree Behavior	Hash Behavior
10x growth	Linear size increase, same height	May need reorganization
100x growth	Height +1, linear size	Definitely needs resize
Sudden spike	Splits absorb gracefully	Overflow chains degrade
Growth then shrinkage	Merges reclaim some space	Empty buckets remain
Predictability	Highly predictable	Depends on distribution

Capacity Planning Implications

B+trees provide predictable, linear space scaling that simplifies capacity planning. Hash indexes require more careful sizing—too small causes overflow chains, too large wastes space on empty buckets. Dynamic hashing schemes help but add their own complexity.

Memory Footprint Considerations

For database buffer pools and in-memory databases, how much memory an index requires directly impacts performance and costs.

Hash Index Memory Patterns

Hash index memory usage is characterized by:

Flat structure: No hierarchy to cache
- Either bucket is in memory or not
- No "hot" upper levels to prioritize caching
Uniform access distribution:
- All buckets equally likely to be accessed
- Working set is essentially entire index (for random access)
- Cache hit rate = (buffer size) / (index size)
Overflow chain locality:
- Overflow pages form chains—accessing one may require several
- Chain pages may be scattered in memory/disk

B+Tree Memory Patterns

B+tree memory patterns are more favorable for caching:

Hierarchical hot spots:
- Root accessed by every query → Always cached
- Upper internal levels accessed frequently → Usually cached
- Leaf level accessed selectively → Cached based on access pattern
Concentrated working set:
- For a 4-level tree: Root + level 1 = ~1% of index
- Caching 1% provides benefit for 100% of queries
- Effective working set much smaller than index size
Range query locality:
- Adjacent leaves co-accessed during range scans
- Prefetching is effective
- Recently-accessed leaves likely to be accessed again

Memory Efficiency Comparison
Aspect	Hash Index	B+Tree Index
Minimum useful cache	~100% (all buckets needed)	~1-5% (upper levels)
Cache hit pattern	Uniform random	Concentrated on hot paths
Benefit of 10% cache	~10% hit rate	~40-60% hit rate
Benefit of 50% cache	~50% hit rate	~80-90% hit rate
Prefetching value	Low (random access)	High (sequential leaves)

The Caching Multiplier

B+trees provide leverage—a small cache investment yields disproportionate performance benefits because of hierarchical access patterns. Hash indexes require proportionally more memory to achieve similar cache hit rates. This is a significant hidden cost of hash indexing.

Practical Memory Sizing

For a 10GB index on a system with 2GB available for buffer pool:

Hash Index:

Only 20% of index cacheable
Each query has 20% chance of cache hit
80% of queries require disk I/O
Performance: Heavily I/O bound

B+Tree Index:

Upper 2-3 levels (~100MB) fit easily
100% of queries hit cache for upper traversal
Only leaf access may miss
Performance: Significantly better cache utilization

Space Optimization Techniques

Both index types offer techniques for reducing space consumption. Understanding these options enables more efficient index design.

Hash Index Optimizations

Hash Space Reduction Strategies

•Bucket sizing: Right-size buckets for expected data volume—avoid over-provisioning
•Hash function tuning: Better distribution reduces overflow chains
•Periodic reorganization: Rebuild to eliminate overflow and compact data
•Dynamic schemes: Use extendible or linear hashing for variable workloads
•Key hashing: Store hash values instead of full keys (with collision handling)

B+Tree Optimizations

B+Tree Space Reduction Strategies

•Higher fill factor: Set fill factor to 90-100% for read-heavy workloads
•Key compression: Prefix/suffix compression for similar keys
•Index-organized tables: Store data directly in leaves (clustered index)
•Partial indexes: Index only relevant subset of rows
•Covering indexes: Include columns to enable index-only scans
•Deduplication: Combine entries for duplicate keys (PostgreSQL 13+)

space_optimization_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- PostgreSQL: High fill factor for stable data
CREATE INDEX idx_archive ON archived_orders (order_date)
    WITH (fillfactor = 95);
 
-- Partial index: Only index active users
CREATE INDEX idx_active_users ON users (last_login)
    WHERE status = 'active';
-- Indexes only ~10% of table if most users are inactive
 
-- Covering index: Avoid table lookups
CREATE INDEX idx_orders_covering ON orders (customer_id) 
    INCLUDE (order_date, total_amount);
-- Query can be answered from index alone
 
-- Key compression (database-specific)
-- Oracle: ALTER INDEX idx_name REBUILD COMPRESS;
-- Some systems compress automatically
 
-- Deduplication (PostgreSQL 13+)
CREATE INDEX idx_status ON orders (status)
    WITH (deduplicate_items = on);
-- 'pending', 'shipped', 'delivered' repeated millions of times

Optimization Trade-offs

Many space optimizations trade other resources: higher fill factors reduce future insert performance, compression adds CPU overhead, partial indexes limit query coverage. Evaluate based on your specific workload characteristics.

Real-World Space Comparisons

Let's examine concrete space usage scenarios to ground the theoretical discussion.

Scenario: 100 Million User Table

Table structure:

user_id (BIGINT, 8 bytes)
email (VARCHAR, average 30 bytes)
Various other columns
Primary key on user_id

Index Size Comparison: 100M Records on user_id (BIGINT)
Metric	B+Tree Index	Static Hash	Extendible Hash
Key size	8 bytes	8 bytes	8 bytes
Pointer size	6 bytes (row pointer)	6 bytes	6 bytes
Entry size	~14 bytes	~14 bytes	~14 bytes
Entries per page	~580	~580	~580
Leaf/bucket pages	~172,500	~200,000	~180,000
Internal/directory	~900 pages	N/A	~2,000 pages
Overhead estimate	15%	10%	12%
Total size	~1.5 GB	~1.6 GB	~1.5 GB

Analysis: For this scenario, space usage is remarkably similar. The B+tree's internal nodes are offset by hash indexes' bucket overhead and potential overflow. The slight differences are within measurement error for practical purposes.

Scenario: 100 Million Order Table with Date Range Access

Now consider a more complex scenario where range queries on order_date are important:

Practical Index Strategy Comparison
Strategy	Indexes Required	Total Space	Query Support
Hash on order_id	1 hash index	~1.5 GB	Point lookup only
B+Tree on order_id	1 B+tree index	~1.5 GB	Point + range on ID
Hash + separate date B+tree	2 indexes	~3.5 GB	Point ID + range date
B+Tree on order_id + covering	1 covering B+tree	~2.5 GB	Full flexibility

The Real Space Cost of Hash Indexes

While hash indexes may seem space-efficient in isolation, real workloads often require range capabilities that hash cannot provide. The 'cost' of hash indexing includes the additional indexes needed for range queries—often doubling total index space compared to a well-designed B+tree strategy.

Summary: Space Trade-offs in Perspective

Space usage differences between hash and tree indexes are more nuanced than often presented. The structural overhead savings of hash indexes are frequently offset by other factors.

Key Takeaways

•Structural overhead is similar—hash indexes save ~5-10% on internal structure, but this rarely matters at scale
•Fill factor behavior differs—B+trees are more configurable and predictable; hash depends on distribution quality
•Fragmentation patterns vary—B+trees fragment internally; hash indexes develop overflow chains
•Growth characteristics favor B+trees—predictable, linear scaling versus potential reorganization requirements
•Memory efficiency favors B+trees—hierarchical caching provides leverage that flat hash structures cannot match
•Total system space often favors B+trees—avoiding supplemental indexes for range queries saves overall space

Space Consideration Summary

For most workloads, space usage is not a compelling differentiator between hash and tree indexes. B+trees' memory efficiency and avoidance of supplemental indexes often result in better overall space utilization despite slightly higher structural overhead.

What's Next

With performance and space characteristics thoroughly examined, the final page synthesizes everything into a practical selection framework. You'll learn to evaluate workloads systematically and make confident index type decisions for real-world scenarios.

Space Analysis Complete

You now understand the space utilization characteristics of hash and tree indexes, including overhead, fill factors, fragmentation, growth, and memory efficiency. These insights complete the technical comparison—next, we'll develop a practical decision framework.

4 / 5

Loading learning content...

Database Management SystemsHash vs Tree Indexes

Hash vs Tree Indexes: Comparative Analysis

LevelIntermediate

Duration75 mins

TopicHash vs Tree Indexes

4 / 5

Space Usage Comparison: Storage and Memory Trade-offs

The Space Dimension

Learning Objectives

Structural Overhead Analysis

Both index types impose structural overhead beyond the raw key-value storage. Understanding this overhead is crucial for capacity planning.

Hash Index Overhead

Static hash indexes have relatively simple structures:

Bucket directory: Fixed-size array of bucket pointers
- Size: B × pointer_size (where B = number of buckets)
- Typically: B × 8 bytes for 64-bit pointers
Bucket headers: Metadata per bucket
- Entry count, overflow pointer, free space pointer
- Typically: 16-32 bytes per bucket
Entry overhead: Per-entry metadata
- Typically minimal (just key-value storage)

Dynamic hash indexes (extendible, linear) add:

Directory structure: Grows with data (extendible hashing)
Split metadata: Tracking information for progressive expansion
Depth indicators: Local and global depth tracking

Hash Index Space Components
Component	Static Hashing	Extendible Hashing	Linear Hashing
Directory	B × 8 bytes (fixed)	2^d × 8 bytes (grows)	None (implicit)
Bucket header	~24 bytes each	~32 bytes each	~24 bytes each
Per-entry overhead	~0-8 bytes	~0-8 bytes	~0-8 bytes
Overflow management	Chained pages	Local depth tracking	Overflow chains as needed
Typical overhead ratio	5-10%	8-15%	5-12%

B+Tree Index Overhead

B+trees have more complex structural requirements:

Internal nodes: Non-leaf nodes contain only separator keys and child pointers
- Size contribution: Typically 1-2% of total index size
- Formula: (N/F - 1) × node_size where F is fanout
Node headers: Per-node metadata
- Node type, entry count, sibling pointers, parent pointer
- Typically: 24-48 bytes per node
Separator keys in internal nodes: Keys duplicated for navigation
- Additional storage for routing information
- Can be significant for long keys
Sibling pointers: Leaf linking for range scans
- 2 × 8 bytes per leaf node (prev + next pointers)

B+Tree Space Components
Component	Size Formula	Typical Contribution
Leaf nodes (data)	N × (key + value + overhead)	85-95% of index
Internal nodes	(levels - 1) × nodes_per_level × node_size	1-5% of index
Node headers	total_nodes × header_size	1-3% of index
Separator key duplication	Varies with key length	1-5% of index
Sibling pointers	leaf_count × 16 bytes	<1% of index
Typical total overhead		10-20%

Overhead Comparison

Fill Factor and Space Efficiency

Neither index type achieves 100% space utilization. Fill factor—the percentage of allocated space actually containing data—significantly impacts total storage requirements.

Hash Index Fill Factors

Hash bucket utilization depends on data distribution and bucket sizing:

Ideal case (uniform distribution):

Each bucket receives approximately n/B entries
Buckets sized to hold average load with small buffer
Fill factor can approach 70-85%

Practical considerations:

Buckets must be sized for worst-case, not average
Overflow handling consumes additional pages
Skewed distributions leave some buckets underutilized
Typical fill factor: 50-70%

hash_fill_factor_analysis.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Hash index space calculation example
// Table: 10 million records, 100-byte entries
 
numRecords = 10,000,000
entrySize = 100 bytes
pageSize = 8,192 bytes (8KB pages)
 
// Approach 1: Conservative static hashing
numBuckets = 100,000  // 100 records per bucket on average
entriesPerPage = pageSize / entrySize = 81
 
// Best case: Perfect distribution
pagesNeeded = numRecords / entriesPerPage = 123,457 pages
actualPages = numBuckets + overflowPages  // May vary
 
// With 70% fill factor:
effectiveCapacity = numBuckets * entriesPerPage * 0.70
// = 100,000 * 81 * 0.70 = 5,670,000 entries per bucket array
// Need ~1.76x over-provisioning for 10M records
 
// Overflow chain impact:
// If 20% of buckets have overflow:
// Additional pages = 20,000 overflow pages
// Total: 100,000 + 20,000 = 120,000 pages

B+Tree Fill Factors

B+tree fill factors are configurable and have different implications:

After bulk loading:

Optimal fill factor achievable (90-100%)
Bottom-up construction fills leaves compactly
Minimal wasted space initially

After random insertions:

Node splits reduce average fill to ~50-70%
Configurable fill factor trades space for future insert performance
Lower fill factor = fewer splits = less fragmentation

Database defaults:

PostgreSQL: Default fillfactor = 90% for B-tree indexes
MySQL/InnoDB: Attempts 15/16 (93.75%) fill for new pages
Oracle: configurable PCTFREE parameter

Fill Factor Comparison
Scenario	Hash Index	B+Tree Index
Bulk load (uniform keys)	70-85%	90-100%
Bulk load (skewed keys)	40-60%	90-100%
After random inserts	50-70%	50-70%
After many deletes	Variable (holes in buckets)	50-70% (merging possible)
Configurable?	Limited (bucket sizing)	Yes (fillfactor parameter)

Fill Factor Strategy

Fragmentation Patterns

Over time, indexes develop internal fragmentation that wastes space and can degrade performance. Hash and tree indexes fragment differently.

Hash Index Fragmentation

Hash indexes fragment in specific patterns:

Overflow chain growth: As buckets fill, overflow pages are added
- Overflow pages are often non-contiguous on disk
- Chains lengthen, consuming space and slowing lookups
Bucket under-utilization: After deletes, buckets may be partially empty
- No automatic compaction (entries don't migrate between buckets)
- Deleted space remains "wasted" until reorganization
Static sizing mismatch: If data volume changes significantly
- Over-provisioned: Many empty buckets waste space
- Under-provisioned: Excessive overflow chains

B+Tree Fragmentation

B+trees experience different fragmentation patterns:

Internal fragmentation: Partially-filled nodes
- After random inserts/deletes, nodes average 50-70% full
- Guaranteed minimum: 50% (B-tree property)
- Can be addressed by rebuilding
External (physical) fragmentation: Logical order differs from physical
- Leaf nodes allocated at different times may be scattered on disk
- Range scans become random I/O instead of sequential
- Rebuild restores physical contiguity
Ghost records: Deleted entries not immediately reclaimed
- Some implementations mark-as-deleted rather than removing
- Space reclaimed during node merge or rebuild

Hash Fragmentation Indicators

•Average overflow chain length increasing
•Bucket utilization varying widely
•Point query latency becoming inconsistent
•Index size growing faster than data size
•Many buckets with very few entries

B+Tree Fragmentation Indicators

•Average leaf fill below 60%
•Tree height increasing unexpectedly
•Range scan I/O count exceeding expectations
•Index size significantly larger than bulk-loaded equivalent
•High ratio of dead/ghost tuples

Defragmentation Strategies

Growth Behavior and Scalability

Understanding how indexes grow as data volume increases is essential for capacity planning. Hash and tree indexes scale differently.

Hash Index Growth Patterns

Static Hashing:

Fixed bucket count → Growth handled by overflow chains
Linear space growth for data, but chain overhead accumulates
Eventually requires manual resize/reorganization

Extendible Hashing:

Directory doubles when any bucket splits
Growth comes in sudden jumps (1→2→4→8→... directory entries)
Bucket pages grow linearly with data
Directory can become large relative to data for deep trees

Linear Hashing:

One bucket added at a time (smoothest growth)
Split triggered by load factor threshold
Near-linear space growth

growth_comparison.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Space growth comparison for 1 million to 100 million records
 
// B+Tree (fanout 200, 100-byte entries, 8KB pages)
function btreeSize(numRecords):
    leafPages = numRecords / entriesPerLeaf  // ~123 pages per 10K records
    internalPages = leafPages / fanout        // 1% of leaf pages
    return (leafPages + internalPages) * pageSize
 
// 1M records:   ~100 MB
// 10M records:  ~1 GB
// 100M records: ~10 GB  (linear growth)
 
// Static Hash (100K buckets, 8KB pages)
function staticHashSize(numRecords):
    bucketPages = numBuckets  // Fixed at 100K
    avgPerBucket = numRecords / numBuckets
    overflowPages = max(0, bucketCount * (avgPerBucket/entriesPerPage - 1))
    return (bucketPages + overflowPages) * pageSize
 
// 1M records:   ~800 MB (mostly empty buckets)
// 10M records:  ~900 MB (good utilization)
// 100M records: ~2 GB+  (overflow chains)
 
// Extendible Hash (no single formula - depends on splits)
// Growth comes in jumps as directory doubles

B+Tree Growth Patterns

B+trees exhibit highly predictable growth:

Linear with data volume: Space proportional to record count
Logarithmic height growth: Height increases only at specific thresholds
Graceful degradation: Even worst-case splits maintain ~50% minimum fill

Height growth milestones (fanout ≈ 200):

Height 2: Up to 40,000 records
Height 3: Up to 8,000,000 records
Height 4: Up to 1,600,000,000 records
Height 5: Up to 320,000,000,000 records

For most practical databases (under billion rows), B+tree height stays at 3-4 levels.

Scalability Comparison
Data Volume	B+Tree Behavior	Hash Behavior
10x growth	Linear size increase, same height	May need reorganization
100x growth	Height +1, linear size	Definitely needs resize
Sudden spike	Splits absorb gracefully	Overflow chains degrade
Growth then shrinkage	Merges reclaim some space	Empty buckets remain
Predictability	Highly predictable	Depends on distribution

Capacity Planning Implications

Memory Footprint Considerations

For database buffer pools and in-memory databases, how much memory an index requires directly impacts performance and costs.

Hash Index Memory Patterns

Hash index memory usage is characterized by:

Flat structure: No hierarchy to cache
- Either bucket is in memory or not
- No "hot" upper levels to prioritize caching
Uniform access distribution:
- All buckets equally likely to be accessed
- Working set is essentially entire index (for random access)
- Cache hit rate = (buffer size) / (index size)
Overflow chain locality:
- Overflow pages form chains—accessing one may require several
- Chain pages may be scattered in memory/disk

B+Tree Memory Patterns

B+tree memory patterns are more favorable for caching:

Hierarchical hot spots:
- Root accessed by every query → Always cached
- Upper internal levels accessed frequently → Usually cached
- Leaf level accessed selectively → Cached based on access pattern
Concentrated working set:
- For a 4-level tree: Root + level 1 = ~1% of index
- Caching 1% provides benefit for 100% of queries
- Effective working set much smaller than index size
Range query locality:
- Adjacent leaves co-accessed during range scans
- Prefetching is effective
- Recently-accessed leaves likely to be accessed again

Memory Efficiency Comparison
Aspect	Hash Index	B+Tree Index
Minimum useful cache	~100% (all buckets needed)	~1-5% (upper levels)
Cache hit pattern	Uniform random	Concentrated on hot paths
Benefit of 10% cache	~10% hit rate	~40-60% hit rate
Benefit of 50% cache	~50% hit rate	~80-90% hit rate
Prefetching value	Low (random access)	High (sequential leaves)

The Caching Multiplier

Practical Memory Sizing

For a 10GB index on a system with 2GB available for buffer pool:

Hash Index:

Only 20% of index cacheable
Each query has 20% chance of cache hit
80% of queries require disk I/O
Performance: Heavily I/O bound

B+Tree Index:

Upper 2-3 levels (~100MB) fit easily
100% of queries hit cache for upper traversal
Only leaf access may miss
Performance: Significantly better cache utilization

Space Optimization Techniques

Both index types offer techniques for reducing space consumption. Understanding these options enables more efficient index design.

Hash Index Optimizations

Hash Space Reduction Strategies

•Bucket sizing: Right-size buckets for expected data volume—avoid over-provisioning
•Hash function tuning: Better distribution reduces overflow chains
•Periodic reorganization: Rebuild to eliminate overflow and compact data
•Dynamic schemes: Use extendible or linear hashing for variable workloads
•Key hashing: Store hash values instead of full keys (with collision handling)

B+Tree Optimizations

B+Tree Space Reduction Strategies

•Higher fill factor: Set fill factor to 90-100% for read-heavy workloads
•Key compression: Prefix/suffix compression for similar keys
•Index-organized tables: Store data directly in leaves (clustered index)
•Partial indexes: Index only relevant subset of rows
•Covering indexes: Include columns to enable index-only scans
•Deduplication: Combine entries for duplicate keys (PostgreSQL 13+)

space_optimization_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- PostgreSQL: High fill factor for stable data
CREATE INDEX idx_archive ON archived_orders (order_date)
    WITH (fillfactor = 95);
 
-- Partial index: Only index active users
CREATE INDEX idx_active_users ON users (last_login)
    WHERE status = 'active';
-- Indexes only ~10% of table if most users are inactive
 
-- Covering index: Avoid table lookups
CREATE INDEX idx_orders_covering ON orders (customer_id) 
    INCLUDE (order_date, total_amount);
-- Query can be answered from index alone
 
-- Key compression (database-specific)
-- Oracle: ALTER INDEX idx_name REBUILD COMPRESS;
-- Some systems compress automatically
 
-- Deduplication (PostgreSQL 13+)
CREATE INDEX idx_status ON orders (status)
    WITH (deduplicate_items = on);
-- 'pending', 'shipped', 'delivered' repeated millions of times

Optimization Trade-offs

Real-World Space Comparisons

Let's examine concrete space usage scenarios to ground the theoretical discussion.

Scenario: 100 Million User Table

Table structure:

user_id (BIGINT, 8 bytes)
email (VARCHAR, average 30 bytes)
Various other columns
Primary key on user_id

Index Size Comparison: 100M Records on user_id (BIGINT)
Metric	B+Tree Index	Static Hash	Extendible Hash
Key size	8 bytes	8 bytes	8 bytes
Pointer size	6 bytes (row pointer)	6 bytes	6 bytes
Entry size	~14 bytes	~14 bytes	~14 bytes
Entries per page	~580	~580	~580
Leaf/bucket pages	~172,500	~200,000	~180,000
Internal/directory	~900 pages	N/A	~2,000 pages
Overhead estimate	15%	10%	12%
Total size	~1.5 GB	~1.6 GB	~1.5 GB

Scenario: 100 Million Order Table with Date Range Access

Now consider a more complex scenario where range queries on order_date are important:

Practical Index Strategy Comparison
Strategy	Indexes Required	Total Space	Query Support
Hash on order_id	1 hash index	~1.5 GB	Point lookup only
B+Tree on order_id	1 B+tree index	~1.5 GB	Point + range on ID
Hash + separate date B+tree	2 indexes	~3.5 GB	Point ID + range date
B+Tree on order_id + covering	1 covering B+tree	~2.5 GB	Full flexibility

The Real Space Cost of Hash Indexes

Summary: Space Trade-offs in Perspective

Space usage differences between hash and tree indexes are more nuanced than often presented. The structural overhead savings of hash indexes are frequently offset by other factors.

Key Takeaways

•Structural overhead is similar—hash indexes save ~5-10% on internal structure, but this rarely matters at scale
•Fill factor behavior differs—B+trees are more configurable and predictable; hash depends on distribution quality
•Fragmentation patterns vary—B+trees fragment internally; hash indexes develop overflow chains
•Growth characteristics favor B+trees—predictable, linear scaling versus potential reorganization requirements
•Memory efficiency favors B+trees—hierarchical caching provides leverage that flat hash structures cannot match
•Total system space often favors B+trees—avoiding supplemental indexes for range queries saves overall space

Space Consideration Summary

What's Next

Space Analysis Complete

4 / 5