Loading learning content...
Point queries—lookups for a single specific key value—represent the core operation where hash indexes have their theoretical advantage. When you execute a query like SELECT * FROM users WHERE user_id = 12345, you're performing a point query. The question is: how much does the O(1) advantage of hashing actually matter in practice?
This page dissects point query performance across both index types, moving beyond asymptotic complexity to examine real-world factors including bucket overflow, cache effects, disk I/O patterns, and concurrent access scenarios.
By the end of this page, you will understand: the mechanics of point queries in both hash and tree indexes, why theoretical O(1) doesn't always translate to practical dominance, conditions under which hash indexes genuinely outperform trees, and how to reason about point query performance in your specific context.
Understanding the precise mechanics of hash-based point queries reveals both their elegance and their limitations.
The Ideal Case: Direct Bucket Access
In the ideal scenario, a hash point query proceeds as follows:
Total cost: 1-2 I/Os in the ideal case, independent of table size.
The beauty of this approach is its simplicity. No matter whether your table has 1,000 rows or 1 billion rows, the number of I/O operations remains constant. This is the O(1) property in action.
1234567891011121314151617181920212223
function hashPointQuery(key, hashTable): // Step 1: Compute bucket address bucketNumber = hash(key) mod hashTable.numBuckets // Step 2: Access primary bucket (1 disk I/O) bucket = readBucket(bucketNumber) // Step 3: Search within bucket for entry in bucket.entries: if entry.key == key: return entry.value // Found! // Step 4: Handle overflow chains (if present) overflowPage = bucket.overflowPointer while overflowPage != null: // Additional disk I/O for each overflow page overflow = readPage(overflowPage) for entry in overflow.entries: if entry.key == key: return entry.value // Found in overflow overflowPage = overflow.nextOverflow return NOT_FOUND // Key doesn't existThe pseudo-code above reveals a critical issue: overflow chains break the O(1) guarantee. If a bucket has k overflow pages, lookup requires 1 + k I/Os. Poor hash function design or unexpected data skew can cause chains to grow arbitrarily long, degrading performance to O(n/B) where B is bucket count.
B+tree point queries follow a well-defined traversal pattern from root to leaf. While asymptotically O(log n), the constants are remarkably favorable.
The Traversal Process
Total cost: Typically 1-2 I/Os for point queries on warm caches, up to h+1 I/Os for cold cache (where h is tree height).
| Table Size | Approximate Height | Maximum I/Os (Cold) | Typical I/Os (Warm) |
|---|---|---|---|
| 10,000 rows | 2 levels | 2 | 1 |
| 100,000 rows | 2-3 levels | 3 | 1 |
| 1,000,000 rows | 3 levels | 3 | 1 |
| 100,000,000 rows | 4 levels | 4 | 1-2 |
| 1,000,000,000 rows | 4 levels | 4 | 1-2 |
| 100,000,000,000 rows | 5 levels | 5 | 2-3 |
Why Tree Height Matters Less Than You Think
The table above reveals a crucial insight: even for billion-row tables, B+tree height is only 4-5 levels. Moreover, the root and upper internal levels are accessed by every query, making them prime candidates for caching.
In practice, a well-tuned database keeps the root and top 1-2 internal levels in memory permanently. This means:
The effective I/O cost of B+tree point queries is remarkably close to hash indexes for practical table sizes.
12345678910111213141516171819202122232425
function btreePointQuery(key, btree): // Start at root (almost always in memory) currentNode = btree.root // Traverse from root to leaf while not currentNode.isLeaf: // Binary search for correct child pointer childIndex = binarySearch(currentNode.keys, key) // Follow pointer to next level (may hit disk) currentNode = readNode(currentNode.children[childIndex]) // At leaf node: binary search for key entryIndex = binarySearch(currentNode.keys, key) if currentNode.keys[entryIndex] == key: // Key found: return value/pointer return currentNode.values[entryIndex] else: return NOT_FOUND // Memory access pattern analysis:// - Root node: Always in buffer pool (hot)// - Level 1: Usually in buffer pool (~200 pages)// - Level 2: Likely in buffer pool for hot data (~40K pages)// - Leaf level: Read from disk (cold)The advertised O(1) vs O(log n) complexity difference often misleads developers. Let's examine the gap between theory and practice.
Theoretical Expectations
| Metric | Hash Index | B+Tree Index | Apparent Winner |
|---|---|---|---|
| Asymptotic Complexity | O(1) | O(log n) | Hash |
| Hash Computation | Yes | No | Tree |
| Key Comparisons | ~1 (bucket size-dependent) | log₂(entries/node) × height | Hash |
| Worst Case | O(n/B) with overflow | O(log n) guaranteed | Tree |
Practical Realities
Several factors conspire to narrow the theoretical gap:
1. CPU Costs Are Often Negligible
Modern CPUs can perform millions of comparisons per second. Whether you do 1 comparison or 4, the difference is microseconds—invisible next to disk or even memory latency. The comparison cost of B+tree search is nearly irrelevant for disk-based systems.
2. Real I/O Costs Are Similar
As demonstrated, hash indexes require 1+ I/Os (1 base + overflow chain length). B+trees require 1+ I/Os (height - cached_levels). For billion-row tables with proper caching, both typically require 1-2 I/Os.
3. Hash Overhead Is Non-Zero
Computing a good hash function (especially for long string keys) can be computationally expensive. Cryptographic or high-quality hash functions may require more CPU time than tree traversal comparisons.
4. Cache Behavior Favors Trees
B+tree hot paths (root, upper internals) are accessed by all queries and stay cached. Hash buckets are accessed uniformly at random—cache hit rates are lower for diverse key access patterns.
It's tempting to assume O(1) < O(log n) means hash indexes are always faster for equality lookups. In reality, the constants hidden by big-O notation (hash calculation, bucket overflow, cache behavior) often make B+trees competitive or superior for real workloads.
The relative performance of hash vs tree indexes for point queries depends heavily on workload characteristics. Let's examine the factors that influence which approach wins.
Factors Favoring Hash Indexes
WHERE key = value with no range or ordering requirementsKey Distribution Analysis
The performance of hash indexes depends critically on how keys distribute across buckets. Consider three scenarios:
Scenario A: Ideal Uniform Distribution
Scenario B: Moderate Skew
Scenario C: Severe Skew (Hot Spot)
Real-world data often has skewed distributions that only reveal themselves in production. User IDs might be sequential, email domains cluster, timestamps have patterns. Hash indexes that perform well in testing may degrade mysteriously when exposed to production data distributions.
Theoretical analysis provides intuition, but empirical benchmarking reveals ground truth. Here we examine methodologies for comparing hash and tree index point query performance.
Critical Benchmarking Considerations
Cache State Control: Results vary dramatically between cold and warm cache. Test both.
Data Volume Variation: Test across multiple table sizes (10K, 100K, 1M, 100M rows)
Key Distribution: Test with uniform, skewed, and real-world key distributions
Concurrent Access: Single-threaded benchmarks don't predict multi-user performance
Include Write Operations: Read-only benchmarks ignore index maintenance overhead
| Table Size | Hash (Cold) | Hash (Warm) | B+Tree (Cold) | B+Tree (Warm) |
|---|---|---|---|---|
| 10,000 | 120 | 2 | 180 | 8 |
| 100,000 | 125 | 2 | 240 | 8 |
| 1,000,000 | 135 | 2 | 310 | 10 |
| 10,000,000 | 280 | 3 | 380 | 12 |
| 100,000,000 | 450 | 4 | 440 | 15 |
| 1,000,000,000 | 3200 | 8 | 520 | 25 |
These representative numbers illustrate several patterns: (1) Warm cache performance is remarkably similar for both; (2) Hash cold performance degrades at very large scales due to overflow; (3) B+tree cold performance is more predictable; (4) Both achieve microsecond latencies—differences rarely matter at the application level.
PostgreSQL Benchmark Example
PostgreSQL provides both B-tree and hash indexes, allowing direct comparison. Here's a methodology for your own testing:
12345678910111213141516171819202122232425262728293031
-- Create test table with significant data volumeCREATE TABLE point_query_test ( id BIGINT PRIMARY KEY, data TEXT, created_at TIMESTAMP); -- Insert 10 million rowsINSERT INTO point_query_testSELECT generate_series, md5(random()::text), NOW() - (random() * INTERVAL '365 days')FROM generate_series(1, 10000000); -- Create B-tree index (for comparison; primary key already has one)CREATE INDEX btree_idx ON point_query_test USING btree (id); -- Create hash index on same columnCREATE INDEX hash_idx ON point_query_test USING hash (id); -- Force cold cache (PostgreSQL-specific)SELECT pg_prewarm('point_query_test'); -- Then restart server -- Benchmark with EXPLAIN ANALYZEEXPLAIN (ANALYZE, BUFFERS, TIMING)SELECT * FROM point_query_test WHERE id = 5000000; -- Compare index usageSET enable_hashjoin = off; -- Force specific index if neededSET enable_seqscan = off;Real applications often query on multiple keys simultaneously. The behavior of hash and tree indexes differs significantly for these cases.
Single-Column Indexes on Multiple Predicates
Consider a query: WHERE column_a = 10 AND column_b = 20
With separate indexes on column_a and column_b:
Composite Indexes
For queries on multiple columns, composite indexes provide significant advantages.
Composite Hash Index on (column_a, column_b):
WHERE a = 10 AND b = 20 → O(1) lookupWHERE a = 10 → Cannot use this index!Composite B+Tree Index on (column_a, column_b):
WHERE a = 10 AND b = 20 → O(log n) lookupWHERE a = 10 → O(log n + k) range scanWHERE a = 10 AND b > 15 → O(log n + k) bounded range| Query Pattern | Composite Hash | Composite B+Tree |
|---|---|---|
WHERE a = 10 AND b = 20 | ✓ O(1) | ✓ O(log n) |
WHERE a = 10 | ✗ Full scan | ✓ O(log n + k) |
WHERE b = 20 | ✗ Full scan | ✗ Full scan |
WHERE a = 10 AND b > 15 | ✗ Full scan | ✓ O(log n + k) |
WHERE a > 5 AND a < 15 | ✗ Full scan | ✓ O(log n + k) |
WHERE a = 10 OR b = 20 | ✗ Full scan | ✗ Needs 2 indexes |
B+tree composite indexes excel because they support leftmost prefix queries. A composite index on (a, b, c) effectively gives you three indexes: (a), (a, b), and (a, b, c). Hash composite indexes only work when ALL columns are specified with equality—far less flexible.
The relative advantage of hash indexes for point queries changes substantially between in-memory and disk-based scenarios.
Disk-Based Database Analysis
In traditional disk-based systems, I/O latency dominates all other costs:
The performance ratio is roughly:
Implication: Whether you do 1 or 4 CPU comparisons is irrelevant when a single disk I/O costs 10,000x more. For disk-based systems, the only meaningful metric is I/O count—and as we've shown, hash and B+tree indexes are comparable.
NUMA and Modern Memory Hierarchies
Modern servers with Non-Uniform Memory Access (NUMA) architectures add another dimension. Memory access latency varies based on which CPU socket accesses which memory region:
Hash indexes, with their random access patterns, are more likely to incur remote memory accesses. B+tree traversal, which follows predictable paths, can be optimized for local memory access. This NUMA effect can partially offset hash lookup advantages in large-memory systems.
In-memory databases like Redis, SAP HANA, and MemSQL explicitly offer hash indexes because the performance difference matters without disk I/O masking it. Even so, most choose B-trees as their default due to flexibility. SAP HANA, for instance, automatically selects index type based on detected query patterns.
We've conducted a thorough examination of point query performance for hash and tree indexes. The results may surprise those who expected hash indexes to dominate.
For point-query-only workloads on in-memory data with uniform key distribution and no need for composite prefix queries, hash indexes can provide meaningful performance benefits. For disk-based systems with varied workloads, B+trees are usually the better choice despite theoretical O(log n) complexity.
What's Next
Point queries are only part of the story. The next page examines range queries—the operation where hash indexes fall completely short and B+trees truly shine. Understanding this asymmetry is essential for making informed index selection decisions.
You now understand the nuanced reality of point query performance. Hash indexes have genuine advantages in specific scenarios, but their theoretical O(1) superiority rarely translates to dramatic practical improvements over well-tuned B+trees. Next, we'll see why range queries decisively favor tree-based indexing.