Database Management SystemsHash vs Tree Indexes

Hash vs Tree Indexes: Comparative Analysis

LevelIntermediate

Duration75 mins

TopicHash vs Tree Indexes

2 / 5

Point Query Analysis: The Equality Lookup Battleground

Where Hash Indexes Shine

Point queries—lookups for a single specific key value—represent the core operation where hash indexes have their theoretical advantage. When you execute a query like SELECT * FROM users WHERE user_id = 12345, you're performing a point query. The question is: how much does the O(1) advantage of hashing actually matter in practice?

This page dissects point query performance across both index types, moving beyond asymptotic complexity to examine real-world factors including bucket overflow, cache effects, disk I/O patterns, and concurrent access scenarios.

Learning Objectives

By the end of this page, you will understand: the mechanics of point queries in both hash and tree indexes, why theoretical O(1) doesn't always translate to practical dominance, conditions under which hash indexes genuinely outperform trees, and how to reason about point query performance in your specific context.

Hash Index Point Query Mechanics

Understanding the precise mechanics of hash-based point queries reveals both their elegance and their limitations.

The Ideal Case: Direct Bucket Access

In the ideal scenario, a hash point query proceeds as follows:

Hash Computation: Apply hash function h(k) to search key k → bucket number b
Bucket Access: Read bucket b from disk (1 I/O) or locate in memory
Entry Scan: Scan entries within bucket for matching key
Record Retrieval: For secondary indexes, follow pointer to data record (1 additional I/O)

Total cost: 1-2 I/Os in the ideal case, independent of table size.

The beauty of this approach is its simplicity. No matter whether your table has 1,000 rows or 1 billion rows, the number of I/O operations remains constant. This is the O(1) property in action.

hash_point_query.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
function hashPointQuery(key, hashTable):
    // Step 1: Compute bucket address
    bucketNumber = hash(key) mod hashTable.numBuckets
    
    // Step 2: Access primary bucket (1 disk I/O)
    bucket = readBucket(bucketNumber)
    
    // Step 3: Search within bucket
    for entry in bucket.entries:
        if entry.key == key:
            return entry.value  // Found!
    
    // Step 4: Handle overflow chains (if present)
    overflowPage = bucket.overflowPointer
    while overflowPage != null:
        // Additional disk I/O for each overflow page
        overflow = readPage(overflowPage)
        for entry in overflow.entries:
            if entry.key == key:
                return entry.value  // Found in overflow
        overflowPage = overflow.nextOverflow
    
    return NOT_FOUND  // Key doesn't exist

The Overflow Problem

The pseudo-code above reveals a critical issue: overflow chains break the O(1) guarantee. If a bucket has k overflow pages, lookup requires 1 + k I/Os. Poor hash function design or unexpected data skew can cause chains to grow arbitrarily long, degrading performance to O(n/B) where B is bucket count.

B+Tree Point Query Mechanics

B+tree point queries follow a well-defined traversal pattern from root to leaf. While asymptotically O(log n), the constants are remarkably favorable.

The Traversal Process

Root Access: Read root node (almost always cached)
Internal Navigation: At each internal level, binary search for child pointer
Leaf Access: Read target leaf node (typically the only disk I/O)
Entry Location: Binary search within leaf for target key
Record Retrieval: Follow pointer to data record (1 additional I/O for secondary indexes)

Total cost: Typically 1-2 I/Os for point queries on warm caches, up to h+1 I/Os for cold cache (where h is tree height).

B+Tree Height vs Table Size (Fanout = 200)
Table Size	Approximate Height	Maximum I/Os (Cold)	Typical I/Os (Warm)
10,000 rows	2 levels	2	1
100,000 rows	2-3 levels	3	1
1,000,000 rows	3 levels	3	1
100,000,000 rows	4 levels	4	1-2
1,000,000,000 rows	4 levels	4	1-2
100,000,000,000 rows	5 levels	5	2-3

Why Tree Height Matters Less Than You Think

The table above reveals a crucial insight: even for billion-row tables, B+tree height is only 4-5 levels. Moreover, the root and upper internal levels are accessed by every query, making them prime candidates for caching.

In practice, a well-tuned database keeps the root and top 1-2 internal levels in memory permanently. This means:

For a 3-level tree: Root cached → 1-2 I/Os per query
For a 4-level tree: Root + level 1 cached → 1-2 I/Os per query
For a 5-level tree: Root + levels 1-2 cached → 1-3 I/Os per query

The effective I/O cost of B+tree point queries is remarkably close to hash indexes for practical table sizes.

btree_point_query.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
function btreePointQuery(key, btree):
    // Start at root (almost always in memory)
    currentNode = btree.root
    
    // Traverse from root to leaf
    while not currentNode.isLeaf:
        // Binary search for correct child pointer
        childIndex = binarySearch(currentNode.keys, key)
        // Follow pointer to next level (may hit disk)
        currentNode = readNode(currentNode.children[childIndex])
    
    // At leaf node: binary search for key
    entryIndex = binarySearch(currentNode.keys, key)
    
    if currentNode.keys[entryIndex] == key:
        // Key found: return value/pointer
        return currentNode.values[entryIndex]
    else:
        return NOT_FOUND
        
// Memory access pattern analysis:
// - Root node: Always in buffer pool (hot)
// - Level 1: Usually in buffer pool (~200 pages)
// - Level 2: Likely in buffer pool for hot data (~40K pages)
// - Leaf level: Read from disk (cold)

Theoretical vs Practical Analysis

The advertised O(1) vs O(log n) complexity difference often misleads developers. Let's examine the gap between theory and practice.

Theoretical Expectations

Theoretical Complexity Comparison for Point Queries
Metric	Hash Index	B+Tree Index	Apparent Winner
Asymptotic Complexity	O(1)	O(log n)	Hash
Hash Computation	Yes	No	Tree
Key Comparisons	~1 (bucket size-dependent)	log₂(entries/node) × height	Hash
Worst Case	O(n/B) with overflow	O(log n) guaranteed	Tree

Practical Realities

Several factors conspire to narrow the theoretical gap:

1. CPU Costs Are Often Negligible

Modern CPUs can perform millions of comparisons per second. Whether you do 1 comparison or 4, the difference is microseconds—invisible next to disk or even memory latency. The comparison cost of B+tree search is nearly irrelevant for disk-based systems.

2. Real I/O Costs Are Similar

As demonstrated, hash indexes require 1+ I/Os (1 base + overflow chain length). B+trees require 1+ I/Os (height - cached_levels). For billion-row tables with proper caching, both typically require 1-2 I/Os.

3. Hash Overhead Is Non-Zero

Computing a good hash function (especially for long string keys) can be computationally expensive. Cryptographic or high-quality hash functions may require more CPU time than tree traversal comparisons.

4. Cache Behavior Favors Trees

B+tree hot paths (root, upper internals) are accessed by all queries and stay cached. Hash buckets are accessed uniformly at random—cache hit rates are lower for diverse key access patterns.

The Comfortable Fallacy

It's tempting to assume O(1) < O(log n) means hash indexes are always faster for equality lookups. In reality, the constants hidden by big-O notation (hash calculation, bucket overflow, cache behavior) often make B+trees competitive or superior for real workloads.

Workload Considerations for Point Queries

The relative performance of hash vs tree indexes for point queries depends heavily on workload characteristics. Let's examine the factors that influence which approach wins.

Factors Favoring Hash Indexes

When Hash Indexes Excel

•Pure Equality Workloads: When 100% of queries are WHERE key = value with no range or ordering requirements
•Memory-Resident Data: In-memory databases eliminate disk I/O, amplifying CPU efficiency differences
•Uniform Key Distribution: When keys hash uniformly across buckets with minimal overflow
•Short Keys: Simple integer or short string keys minimize hash computation cost
•High-Concurrency Writes: Hash buckets partition writes naturally; concurrent insertions to different buckets don't conflict

When B+Trees Perform Equally Well

•Mixed Workloads: Any need for range queries makes B+tree the only viable option—hash can only do equality
•Large Tables with Good Caching: Tree upper levels stay cached; effective I/O close to hash indexes
•Long or Complex Keys: Expensive hash computation narrows the gap
•Skewed Key Distribution: Hash hot spots degrade performance; trees remain logarithmic regardless
•Disk-Based Systems: I/O dominates; CPU complexity differences become negligible

Key Distribution Analysis

The performance of hash indexes depends critically on how keys distribute across buckets. Consider three scenarios:

Scenario A: Ideal Uniform Distribution

Each bucket contains approximately n/B entries (where B is bucket count)
No overflow chains
Performance: True O(1) lookups

Scenario B: Moderate Skew

Some buckets have 2-3x average entries
Short overflow chains (1-2 pages)
Performance: O(1) with small constant overhead

Scenario C: Severe Skew (Hot Spot)

A small number of key values dominate (80/20 distribution)
These map to same bucket(s)
Long overflow chains
Performance: Degrades toward O(k) where k is chain length

The Distribution Trap

Real-world data often has skewed distributions that only reveal themselves in production. User IDs might be sequential, email domains cluster, timestamps have patterns. Hash indexes that perform well in testing may degrade mysteriously when exposed to production data distributions.

Benchmarking Point Query Performance

Theoretical analysis provides intuition, but empirical benchmarking reveals ground truth. Here we examine methodologies for comparing hash and tree index point query performance.

Critical Benchmarking Considerations

Cache State Control: Results vary dramatically between cold and warm cache. Test both.
Data Volume Variation: Test across multiple table sizes (10K, 100K, 1M, 100M rows)
Key Distribution: Test with uniform, skewed, and real-world key distributions
Concurrent Access: Single-threaded benchmarks don't predict multi-user performance
Include Write Operations: Read-only benchmarks ignore index maintenance overhead

Sample Benchmark Results: Point Query Latency (microseconds)
Table Size	Hash (Cold)	Hash (Warm)	B+Tree (Cold)	B+Tree (Warm)
10,000	120	2	180	8
100,000	125	2	240	8
1,000,000	135	2	310	10
10,000,000	280	3	380	12
100,000,000	450	4	440	15
1,000,000,000	3200	8	520	25

Benchmark Interpretation

These representative numbers illustrate several patterns: (1) Warm cache performance is remarkably similar for both; (2) Hash cold performance degrades at very large scales due to overflow; (3) B+tree cold performance is more predictable; (4) Both achieve microsecond latencies—differences rarely matter at the application level.

PostgreSQL Benchmark Example

PostgreSQL provides both B-tree and hash indexes, allowing direct comparison. Here's a methodology for your own testing:

point_query_benchmark.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Create test table with significant data volume
CREATE TABLE point_query_test (
    id BIGINT PRIMARY KEY,
    data TEXT,
    created_at TIMESTAMP
);
 
-- Insert 10 million rows
INSERT INTO point_query_test
SELECT 
    generate_series,
    md5(random()::text),
    NOW() - (random() * INTERVAL '365 days')
FROM generate_series(1, 10000000);
 
-- Create B-tree index (for comparison; primary key already has one)
CREATE INDEX btree_idx ON point_query_test USING btree (id);
 
-- Create hash index on same column
CREATE INDEX hash_idx ON point_query_test USING hash (id);
 
-- Force cold cache (PostgreSQL-specific)
SELECT pg_prewarm('point_query_test'); -- Then restart server
 
-- Benchmark with EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS, TIMING)
SELECT * FROM point_query_test WHERE id = 5000000;
 
-- Compare index usage
SET enable_hashjoin = off; -- Force specific index if needed
SET enable_seqscan = off;

Multi-Key and Composite Point Queries

Real applications often query on multiple keys simultaneously. The behavior of hash and tree indexes differs significantly for these cases.

Single-Column Indexes on Multiple Predicates

Consider a query: WHERE column_a = 10 AND column_b = 20

With separate indexes on column_a and column_b:

B+Trees: Optimizer can use index intersection—query both indexes, intersect the result sets
Hash Indexes: Same capability, but less commonly supported by query optimizers

Composite Indexes

For queries on multiple columns, composite indexes provide significant advantages.

Composite Hash Index on (column_a, column_b):

Hash function takes the concatenation of both columns
Only useful when both columns are specified with equality
WHERE a = 10 AND b = 20 → O(1) lookup
WHERE a = 10 → Cannot use this index!

Composite B+Tree Index on (column_a, column_b):

Entries sorted by (a, then b within same a)
Supports prefix queries naturally
WHERE a = 10 AND b = 20 → O(log n) lookup
WHERE a = 10 → O(log n + k) range scan
WHERE a = 10 AND b > 15 → O(log n + k) bounded range

Composite Index Query Support: Hash vs B+Tree
Query Pattern	Composite Hash	Composite B+Tree
`WHERE a = 10 AND b = 20`	✓ O(1)	✓ O(log n)
`WHERE a = 10`	✗ Full scan	✓ O(log n + k)
`WHERE b = 20`	✗ Full scan	✗ Full scan
`WHERE a = 10 AND b > 15`	✗ Full scan	✓ O(log n + k)
`WHERE a > 5 AND a < 15`	✗ Full scan	✓ O(log n + k)
`WHERE a = 10 OR b = 20`	✗ Full scan	✗ Needs 2 indexes

Composite Index Insight

B+tree composite indexes excel because they support leftmost prefix queries. A composite index on (a, b, c) effectively gives you three indexes: (a), (a, b), and (a, b, c). Hash composite indexes only work when ALL columns are specified with equality—far less flexible.

In-Memory vs Disk-Based Considerations

The relative advantage of hash indexes for point queries changes substantially between in-memory and disk-based scenarios.

Disk-Based Database Analysis

In traditional disk-based systems, I/O latency dominates all other costs:

Disk seek: ~5-10 milliseconds (HDD) or ~0.1 ms (SSD)
Memory access: ~100 nanoseconds
CPU comparison: ~1 nanosecond

The performance ratio is roughly:

Disk vs Memory: 10,000-100,000x slower
Memory vs CPU: 100x slower

Implication: Whether you do 1 or 4 CPU comparisons is irrelevant when a single disk I/O costs 10,000x more. For disk-based systems, the only meaningful metric is I/O count—and as we've shown, hash and B+tree indexes are comparable.

Hash Advantage: In-Memory

•No I/O latency to hide CPU inefficiencies
•O(1) lookup translates directly to performance
•Hash computation often cheaper than tree traversal
•Redis, Memcached use hash tables for this reason
•Modern in-memory DBs offer hash indexes for equality-heavy workloads

Tree Competitiveness: Cache-Oriented B-Trees

•Cache-conscious B+trees designed for memory residence
•Node size matches cache line (64 bytes)
•SIMD instructions accelerate within-node search
•Prefetching hides memory latency
•Gap narrows significantly with modern implementations

NUMA and Modern Memory Hierarchies

Modern servers with Non-Uniform Memory Access (NUMA) architectures add another dimension. Memory access latency varies based on which CPU socket accesses which memory region:

Local memory access: ~80 nanoseconds
Remote memory access: ~150+ nanoseconds

Hash indexes, with their random access patterns, are more likely to incur remote memory accesses. B+tree traversal, which follows predictable paths, can be optimized for local memory access. This NUMA effect can partially offset hash lookup advantages in large-memory systems.

The In-Memory Reality

In-memory databases like Redis, SAP HANA, and MemSQL explicitly offer hash indexes because the performance difference matters without disk I/O masking it. Even so, most choose B-trees as their default due to flexibility. SAP HANA, for instance, automatically selects index type based on detected query patterns.

Summary: Point Query Performance Reality

We've conducted a thorough examination of point query performance for hash and tree indexes. The results may surprise those who expected hash indexes to dominate.

Key Takeaways

•O(1) vs O(log n) matters less than expected—for disk-based systems, I/O count dominates, and both index types typically require 1-2 I/Os
•Hash overflow chains break the guarantee—worst-case hash performance can exceed B+tree performance significantly
•B+tree caching is highly effective—upper tree levels stay resident, minimizing actual I/O operations
•In-memory scenarios favor hash—without I/O to mask CPU costs, hash's O(1) advantage materializes
•Key distribution is critical for hash—skewed data creates hot buckets that degrade performance
•Composite index behavior differs—B+trees support prefix queries; hash requires all columns specified

Practical Guidance

For point-query-only workloads on in-memory data with uniform key distribution and no need for composite prefix queries, hash indexes can provide meaningful performance benefits. For disk-based systems with varied workloads, B+trees are usually the better choice despite theoretical O(log n) complexity.

What's Next

Point queries are only part of the story. The next page examines range queries—the operation where hash indexes fall completely short and B+trees truly shine. Understanding this asymmetry is essential for making informed index selection decisions.

Point Query Analysis Complete

You now understand the nuanced reality of point query performance. Hash indexes have genuine advantages in specific scenarios, but their theoretical O(1) superiority rarely translates to dramatic practical improvements over well-tuned B+trees. Next, we'll see why range queries decisively favor tree-based indexing.

2 / 5

Loading learning content...

Database Management SystemsHash vs Tree Indexes

Hash vs Tree Indexes: Comparative Analysis

LevelIntermediate

Duration75 mins

TopicHash vs Tree Indexes

2 / 5

Point Query Analysis: The Equality Lookup Battleground

Where Hash Indexes Shine

Learning Objectives

Hash Index Point Query Mechanics

Understanding the precise mechanics of hash-based point queries reveals both their elegance and their limitations.

The Ideal Case: Direct Bucket Access

In the ideal scenario, a hash point query proceeds as follows:

Hash Computation: Apply hash function h(k) to search key k → bucket number b
Bucket Access: Read bucket b from disk (1 I/O) or locate in memory
Entry Scan: Scan entries within bucket for matching key
Record Retrieval: For secondary indexes, follow pointer to data record (1 additional I/O)

Total cost: 1-2 I/Os in the ideal case, independent of table size.

The beauty of this approach is its simplicity. No matter whether your table has 1,000 rows or 1 billion rows, the number of I/O operations remains constant. This is the O(1) property in action.

hash_point_query.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
function hashPointQuery(key, hashTable):
    // Step 1: Compute bucket address
    bucketNumber = hash(key) mod hashTable.numBuckets
    
    // Step 2: Access primary bucket (1 disk I/O)
    bucket = readBucket(bucketNumber)
    
    // Step 3: Search within bucket
    for entry in bucket.entries:
        if entry.key == key:
            return entry.value  // Found!
    
    // Step 4: Handle overflow chains (if present)
    overflowPage = bucket.overflowPointer
    while overflowPage != null:
        // Additional disk I/O for each overflow page
        overflow = readPage(overflowPage)
        for entry in overflow.entries:
            if entry.key == key:
                return entry.value  // Found in overflow
        overflowPage = overflow.nextOverflow
    
    return NOT_FOUND  // Key doesn't exist

The Overflow Problem

B+Tree Point Query Mechanics

B+tree point queries follow a well-defined traversal pattern from root to leaf. While asymptotically O(log n), the constants are remarkably favorable.

The Traversal Process

Root Access: Read root node (almost always cached)
Internal Navigation: At each internal level, binary search for child pointer
Leaf Access: Read target leaf node (typically the only disk I/O)
Entry Location: Binary search within leaf for target key
Record Retrieval: Follow pointer to data record (1 additional I/O for secondary indexes)

Total cost: Typically 1-2 I/Os for point queries on warm caches, up to h+1 I/Os for cold cache (where h is tree height).

B+Tree Height vs Table Size (Fanout = 200)
Table Size	Approximate Height	Maximum I/Os (Cold)	Typical I/Os (Warm)
10,000 rows	2 levels	2	1
100,000 rows	2-3 levels	3	1
1,000,000 rows	3 levels	3	1
100,000,000 rows	4 levels	4	1-2
1,000,000,000 rows	4 levels	4	1-2
100,000,000,000 rows	5 levels	5	2-3

Why Tree Height Matters Less Than You Think

In practice, a well-tuned database keeps the root and top 1-2 internal levels in memory permanently. This means:

For a 3-level tree: Root cached → 1-2 I/Os per query
For a 4-level tree: Root + level 1 cached → 1-2 I/Os per query
For a 5-level tree: Root + levels 1-2 cached → 1-3 I/Os per query

The effective I/O cost of B+tree point queries is remarkably close to hash indexes for practical table sizes.

btree_point_query.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
function btreePointQuery(key, btree):
    // Start at root (almost always in memory)
    currentNode = btree.root
    
    // Traverse from root to leaf
    while not currentNode.isLeaf:
        // Binary search for correct child pointer
        childIndex = binarySearch(currentNode.keys, key)
        // Follow pointer to next level (may hit disk)
        currentNode = readNode(currentNode.children[childIndex])
    
    // At leaf node: binary search for key
    entryIndex = binarySearch(currentNode.keys, key)
    
    if currentNode.keys[entryIndex] == key:
        // Key found: return value/pointer
        return currentNode.values[entryIndex]
    else:
        return NOT_FOUND
        
// Memory access pattern analysis:
// - Root node: Always in buffer pool (hot)
// - Level 1: Usually in buffer pool (~200 pages)
// - Level 2: Likely in buffer pool for hot data (~40K pages)
// - Leaf level: Read from disk (cold)

Theoretical vs Practical Analysis

The advertised O(1) vs O(log n) complexity difference often misleads developers. Let's examine the gap between theory and practice.

Theoretical Expectations

Theoretical Complexity Comparison for Point Queries
Metric	Hash Index	B+Tree Index	Apparent Winner
Asymptotic Complexity	O(1)	O(log n)	Hash
Hash Computation	Yes	No	Tree
Key Comparisons	~1 (bucket size-dependent)	log₂(entries/node) × height	Hash
Worst Case	O(n/B) with overflow	O(log n) guaranteed	Tree

Practical Realities

Several factors conspire to narrow the theoretical gap:

1. CPU Costs Are Often Negligible

2. Real I/O Costs Are Similar

3. Hash Overhead Is Non-Zero

4. Cache Behavior Favors Trees

B+tree hot paths (root, upper internals) are accessed by all queries and stay cached. Hash buckets are accessed uniformly at random—cache hit rates are lower for diverse key access patterns.

The Comfortable Fallacy

Workload Considerations for Point Queries

The relative performance of hash vs tree indexes for point queries depends heavily on workload characteristics. Let's examine the factors that influence which approach wins.

Factors Favoring Hash Indexes

When Hash Indexes Excel

•Pure Equality Workloads: When 100% of queries are WHERE key = value with no range or ordering requirements
•Memory-Resident Data: In-memory databases eliminate disk I/O, amplifying CPU efficiency differences
•Uniform Key Distribution: When keys hash uniformly across buckets with minimal overflow
•Short Keys: Simple integer or short string keys minimize hash computation cost
•High-Concurrency Writes: Hash buckets partition writes naturally; concurrent insertions to different buckets don't conflict

When B+Trees Perform Equally Well

•Mixed Workloads: Any need for range queries makes B+tree the only viable option—hash can only do equality
•Large Tables with Good Caching: Tree upper levels stay cached; effective I/O close to hash indexes
•Long or Complex Keys: Expensive hash computation narrows the gap
•Skewed Key Distribution: Hash hot spots degrade performance; trees remain logarithmic regardless
•Disk-Based Systems: I/O dominates; CPU complexity differences become negligible

Key Distribution Analysis

The performance of hash indexes depends critically on how keys distribute across buckets. Consider three scenarios:

Scenario A: Ideal Uniform Distribution

Each bucket contains approximately n/B entries (where B is bucket count)
No overflow chains
Performance: True O(1) lookups

Scenario B: Moderate Skew

Some buckets have 2-3x average entries
Short overflow chains (1-2 pages)
Performance: O(1) with small constant overhead

Scenario C: Severe Skew (Hot Spot)

A small number of key values dominate (80/20 distribution)
These map to same bucket(s)
Long overflow chains
Performance: Degrades toward O(k) where k is chain length

The Distribution Trap

Benchmarking Point Query Performance

Theoretical analysis provides intuition, but empirical benchmarking reveals ground truth. Here we examine methodologies for comparing hash and tree index point query performance.

Critical Benchmarking Considerations

Cache State Control: Results vary dramatically between cold and warm cache. Test both.
Data Volume Variation: Test across multiple table sizes (10K, 100K, 1M, 100M rows)
Key Distribution: Test with uniform, skewed, and real-world key distributions
Concurrent Access: Single-threaded benchmarks don't predict multi-user performance
Include Write Operations: Read-only benchmarks ignore index maintenance overhead

Sample Benchmark Results: Point Query Latency (microseconds)
Table Size	Hash (Cold)	Hash (Warm)	B+Tree (Cold)	B+Tree (Warm)
10,000	120	2	180	8
100,000	125	2	240	8
1,000,000	135	2	310	10
10,000,000	280	3	380	12
100,000,000	450	4	440	15
1,000,000,000	3200	8	520	25

Benchmark Interpretation

PostgreSQL Benchmark Example

PostgreSQL provides both B-tree and hash indexes, allowing direct comparison. Here's a methodology for your own testing:

point_query_benchmark.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Create test table with significant data volume
CREATE TABLE point_query_test (
    id BIGINT PRIMARY KEY,
    data TEXT,
    created_at TIMESTAMP
);
 
-- Insert 10 million rows
INSERT INTO point_query_test
SELECT 
    generate_series,
    md5(random()::text),
    NOW() - (random() * INTERVAL '365 days')
FROM generate_series(1, 10000000);
 
-- Create B-tree index (for comparison; primary key already has one)
CREATE INDEX btree_idx ON point_query_test USING btree (id);
 
-- Create hash index on same column
CREATE INDEX hash_idx ON point_query_test USING hash (id);
 
-- Force cold cache (PostgreSQL-specific)
SELECT pg_prewarm('point_query_test'); -- Then restart server
 
-- Benchmark with EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS, TIMING)
SELECT * FROM point_query_test WHERE id = 5000000;
 
-- Compare index usage
SET enable_hashjoin = off; -- Force specific index if needed
SET enable_seqscan = off;

Multi-Key and Composite Point Queries

Real applications often query on multiple keys simultaneously. The behavior of hash and tree indexes differs significantly for these cases.

Single-Column Indexes on Multiple Predicates

Consider a query: WHERE column_a = 10 AND column_b = 20

With separate indexes on column_a and column_b:

B+Trees: Optimizer can use index intersection—query both indexes, intersect the result sets
Hash Indexes: Same capability, but less commonly supported by query optimizers

Composite Indexes

For queries on multiple columns, composite indexes provide significant advantages.

Composite Hash Index on (column_a, column_b):

Hash function takes the concatenation of both columns
Only useful when both columns are specified with equality
WHERE a = 10 AND b = 20 → O(1) lookup
WHERE a = 10 → Cannot use this index!

Composite B+Tree Index on (column_a, column_b):

Entries sorted by (a, then b within same a)
Supports prefix queries naturally
WHERE a = 10 AND b = 20 → O(log n) lookup
WHERE a = 10 → O(log n + k) range scan
WHERE a = 10 AND b > 15 → O(log n + k) bounded range

Composite Index Query Support: Hash vs B+Tree
Query Pattern	Composite Hash	Composite B+Tree
`WHERE a = 10 AND b = 20`	✓ O(1)	✓ O(log n)
`WHERE a = 10`	✗ Full scan	✓ O(log n + k)
`WHERE b = 20`	✗ Full scan	✗ Full scan
`WHERE a = 10 AND b > 15`	✗ Full scan	✓ O(log n + k)
`WHERE a > 5 AND a < 15`	✗ Full scan	✓ O(log n + k)
`WHERE a = 10 OR b = 20`	✗ Full scan	✗ Needs 2 indexes

Composite Index Insight

In-Memory vs Disk-Based Considerations

The relative advantage of hash indexes for point queries changes substantially between in-memory and disk-based scenarios.

Disk-Based Database Analysis

In traditional disk-based systems, I/O latency dominates all other costs:

Disk seek: ~5-10 milliseconds (HDD) or ~0.1 ms (SSD)
Memory access: ~100 nanoseconds
CPU comparison: ~1 nanosecond

The performance ratio is roughly:

Disk vs Memory: 10,000-100,000x slower
Memory vs CPU: 100x slower

Hash Advantage: In-Memory

•No I/O latency to hide CPU inefficiencies
•O(1) lookup translates directly to performance
•Hash computation often cheaper than tree traversal
•Redis, Memcached use hash tables for this reason
•Modern in-memory DBs offer hash indexes for equality-heavy workloads

Tree Competitiveness: Cache-Oriented B-Trees

•Cache-conscious B+trees designed for memory residence
•Node size matches cache line (64 bytes)
•SIMD instructions accelerate within-node search
•Prefetching hides memory latency
•Gap narrows significantly with modern implementations

NUMA and Modern Memory Hierarchies

Modern servers with Non-Uniform Memory Access (NUMA) architectures add another dimension. Memory access latency varies based on which CPU socket accesses which memory region:

Local memory access: ~80 nanoseconds
Remote memory access: ~150+ nanoseconds

The In-Memory Reality

Summary: Point Query Performance Reality

We've conducted a thorough examination of point query performance for hash and tree indexes. The results may surprise those who expected hash indexes to dominate.

Key Takeaways

•O(1) vs O(log n) matters less than expected—for disk-based systems, I/O count dominates, and both index types typically require 1-2 I/Os
•Hash overflow chains break the guarantee—worst-case hash performance can exceed B+tree performance significantly
•B+tree caching is highly effective—upper tree levels stay resident, minimizing actual I/O operations
•In-memory scenarios favor hash—without I/O to mask CPU costs, hash's O(1) advantage materializes
•Key distribution is critical for hash—skewed data creates hot buckets that degrade performance
•Composite index behavior differs—B+trees support prefix queries; hash requires all columns specified

Practical Guidance

What's Next

Point Query Analysis Complete

2 / 5