Loading learning content...
Every index structure has its raison d'être—the query pattern it was born to answer. For B+-trees, it's ordered traversal. For R-trees, it's spatial containment. For hash indexes, that moment is the point query: finding a specific record by exact key match.
SELECT * FROM employees WHERE employee_id = 12345;
This single query encapsulates why hash indexes exist. A B+-tree answers it by navigating through 3-4 tree levels. A table scan reads potentially millions of rows. A hash index? It computes the answer location in one step, reads one page, and returns the result.
O(1) — not O(log n), not O(n). Constant time, regardless of data size.
By the end of this page, you will understand exactly how hash indexes process point queries, when query optimizers choose hash indexes, practical scenarios where this matters most, and how to benchmark hash vs. tree performance for your specific workloads.
A point query (also called an equality query or exact-match query) searches for records matching a specific value. The defining characteristic: the search condition uses equality (=) on the indexed column(s).
Point query patterns:
-- Single-column equality
SELECT * FROM users WHERE user_id = 42;
SELECT * FROM products WHERE sku = 'PRD-12345';
SELECT * FROM sessions WHERE token = 'abc123xyz';
-- IN clause (multiple point queries)
SELECT * FROM orders WHERE order_id IN (100, 101, 102);
-- Composite key equality
SELECT * FROM inventory WHERE warehouse_id = 5 AND product_id = 999;
What is NOT a point query:
-- Range condition (NOT a point query)
SELECT * FROM orders WHERE order_date > '2024-01-01';
-- Pattern matching (NOT a point query)
SELECT * FROM users WHERE email LIKE '%@gmail.com';
-- Inequality (NOT a point query)
SELECT * FROM products WHERE price != 0;
-- Partial composite match (NOT a point query for hash)
SELECT * FROM inventory WHERE warehouse_id = 5; -- Missing product_id
An IN clause with discrete values is semantically multiple point queries. Hash indexes handle each value separately, potentially outperforming B+-trees which must navigate the tree for each. For IN (v1, v2, v3), hash index does 3 independent O(1) lookups = O(3) = O(1).
Why equality is special for hashing:
The hash function produces a single bucket number for each key value. This mapping is:
This isolation property is both a strength (enables O(1) lookup) and a limitation (prevents range queries). For point queries, it's pure advantage.
Let's trace a point query through the complete execution path, from SQL to disk to result.
Query: SELECT * FROM employees WHERE employee_id = 12345
Execution steps:
| Step | Component | Action | I/O |
|---|---|---|---|
| 1 | Parser | Validate SQL syntax, build parse tree | None |
| 2 | Optimizer | Identify hash index on employee_id, generate plan | None |
| 3 | Executor | Begin index scan operation | None |
| 4 | Hash Index | Compute: bucket = h(12345) mod M = 47 | None |
| 5 | Buffer Manager | Check if bucket directory is cached | None (hit) |
| 6 | Buffer Manager | directory[47] → page 2089; check buffer cache | 0 or 1 |
| 7 | Hash Index | Scan bucket page for key 12345, find RID | None |
| 8 | Buffer Manager | Fetch data page for RID | 0 or 1 |
| 9 | Executor | Project columns, return result row | None |
I/O analysis:
Compare to B+-tree:
The hash index advantage is most pronounced when:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
-- Create test tableCREATE TABLE employees ( employee_id SERIAL PRIMARY KEY, name VARCHAR(100), department VARCHAR(50), salary NUMERIC(10,2)); -- Generate test data (1 million rows)INSERT INTO employees (name, department, salary)SELECT 'Employee_' || generate_series, CASE (generate_series % 5) WHEN 0 THEN 'Engineering' WHEN 1 THEN 'Sales' WHEN 2 THEN 'Marketing' WHEN 3 THEN 'HR' ELSE 'Finance' END, 30000 + (random() * 70000)::numeric(10,2)FROM generate_series(1, 1000000); -- Create hash index on employee_idCREATE INDEX idx_emp_hash ON employees USING HASH (employee_id); -- Also create B-tree index for comparisonCREATE INDEX idx_emp_btree ON employees USING BTREE (employee_id); -- Force hash index usageSET enable_indexscan = off; -- Disable B-tree index scanSET enable_bitmapscan = off; EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT * FROM employees WHERE employee_id = 500000; -- Sample output:-- Index Scan using idx_emp_hash on employees-- Index Cond: (employee_id = 500000)-- Buffers: shared hit=2 <-- Just 2 buffer accesses!-- Planning Time: 0.082 ms-- Execution Time: 0.031 ms -- Compare with B-treeRESET enable_indexscan;RESET enable_bitmapscan;DROP INDEX idx_emp_hash; EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)SELECT * FROM employees WHERE employee_id = 500000; -- Sample output:-- Index Scan using idx_emp_btree on employees-- Index Cond: (employee_id = 500000)-- Buffers: shared hit=4 <-- 4 buffer accesses (tree depth)-- Planning Time: 0.075 ms-- Execution Time: 0.033 msIn practice, frequently accessed B+-tree root and internal nodes stay in buffer cache. For hot data, the I/O difference shrinks. Hash indexes shine brightest with cold or random access patterns where B+-tree internal nodes may be evicted.
The query optimizer decides whether to use a hash index for a given query. This decision involves cost comparison against alternatives.
Optimizer considerations:
Cost model for hash index lookup:
cost_hash = C_index_read + E[overflow_pages] × C_page_read + C_data_fetch
Where:
C_index_read = 1.0 (one bucket page)
E[overflow_pages] = average overflow chain length (often < 0.1)
C_page_read = sequential/random read cost
C_data_fetch = cost to fetch actual data row(s)
Cost model for B+-tree lookup:
cost_btree = tree_height × C_page_read + C_data_fetch
Where:
tree_height = ceil(log_fanout(n)) typically 3-4
fanout = entries_per_page (typically 100-500)
Example comparison:
| Factor | Hash Index | B+-Tree |
|---|---|---|
| Table size: 10M rows | ||
| Tree fanout: 200 | - | height = log₂₀₀(10M) ≈ 3 |
| Avg overflow: 0.1 pages | 1.1 pages | - |
| Index pages read | 1.1 | 3 |
| Advantage | ~3x fewer reads | Supports ranges |
Despite hash index advantages for point queries, many optimizers default to B+-tree. Reasons: B+-tree cost models are well-understood, B+-trees handle edge cases (NULL, range fallback), hash index statistics may be less reliable, and the optimizer is conservative about newer/less-common index types.
123456789101112131415161718192021222324
-- PostgreSQL: Examine optimizer choiceEXPLAIN (VERBOSE)SELECT * FROM employees WHERE employee_id = 12345; -- If optimizer ignores hash index, you can investigate:-- 1. Check if hash index exists\d employees -- 2. Check cost settingsSHOW random_page_cost;SHOW seq_page_cost; -- 3. Force hash index usage for testingSET enable_indexscan = on;SET enable_hashagg = on; -- 4. Disable other methods temporarilySET enable_seqscan = off; EXPLAIN SELECT * FROM employees WHERE employee_id = 12345;-- Should now use hash index if available -- PostgreSQL doesn't have direct index hints, but you can influence:-- Use CTEs, subqueries, or disable other index typesHash indexes deliver maximum value in specific workload patterns. Understanding these patterns helps identify opportunities for optimization.
Scenario analysis:
| Scenario | Query Pattern | Why Hash Excels | Real Example |
|---|---|---|---|
| User authentication | SELECT ... WHERE email = ? | Single lookup at login; latency critical | Login page: 0 tolerance for slow auth |
| Session management | SELECT ... WHERE session_token = ? | Random token lookups; no ordering needed | Every API request validates session |
| Primary key lookup | SELECT ... WHERE id = ? | Classic point query; often hotspot | User profile, order details |
| Foreign key validation | SELECT 1 WHERE fk_col = ? | Existence check; result doesn't matter | Insert validation: does parent exist? |
| Deduplication check | SELECT COUNT(*) WHERE hash = ? | Before insert: is this a duplicate? | Content-addressable storage |
| Cache population | SELECT ... WHERE key = ? | Cache miss triggers DB lookup | Redis/Memcached miss handling |
Case study: Session store optimization
Consider a web application with:
B+-tree performance:
Hash index performance:
This translates directly to:
Examine your slow query log for frequently executed point queries. Look for patterns: WHERE id = ?, WHERE token = ?, WHERE email = ?. If these queries dominate your workload and use B+-tree indexes, hash indexes may dramatically improve performance.
The SQL IN clause deserves special attention. It represents multiple point queries bundled into one SQL statement, and hash indexes can process these exceptionally efficiently.
IN clause query:
SELECT * FROM orders
WHERE order_id IN (1001, 1002, 1003, 1004, 1005);
Execution strategies:
Strategy 1: Index nested loop (default) For each value in the IN list:
With hash index: 5 values × 1 I/O each = 5 I/Os With B+-tree: 5 values × 3 I/Os each = 15 I/Os (tree traversal for each)
Strategy 2: Index range scan (B+-tree advantage) If IN values are consecutive or near-consecutive, B+-tree can:
This is where B+-tree might win—but only for sorted, consecutive values.
Strategy 3: Bitmap index scan Build a bitmap of matching row locations, then fetch rows:
Reduces random I/O for many-value IN clauses.
| IN Clause Size | Hash Index I/Os | B+-Tree I/Os | Winner |
|---|---|---|---|
| 5 values | 5 | 15 | Hash (3x better) |
| 10 values | 10 | 30 | Hash (3x better) |
| 100 values | 100 | 300 | Hash (3x better) |
| 1000 values (random) | 1000 | 3000 | Hash (3x better) |
| 1000 values (sequential) | 1000 | ~10 (range scan) | B+-Tree (wins) |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
-- Test IN clause with hash vs B-tree -- Setup: Create table with both index typesCREATE TABLE orders ( order_id SERIAL PRIMARY KEY, customer_id INT, order_date DATE, total NUMERIC(10,2)); INSERT INTO orders (customer_id, order_date, total)SELECT (random() * 10000)::int, '2024-01-01'::date + (random() * 365)::int, (random() * 1000)::numeric(10,2)FROM generate_series(1, 1000000); -- Create indexesCREATE INDEX idx_orders_hash ON orders USING HASH (order_id);CREATE INDEX idx_orders_btree ON orders USING BTREE (order_id); -- Random IN clause (hash index should win)-- Generate 100 random order IDsWITH random_ids AS ( SELECT (random() * 1000000)::int as id FROM generate_series(1, 100))EXPLAIN (ANALYZE, BUFFERS)SELECT * FROM orders WHERE order_id IN (SELECT id FROM random_ids); -- Sequential IN clause (B-tree might win with range scan)EXPLAIN (ANALYZE, BUFFERS)SELECT * FROM orders WHERE order_id IN ( SELECT generate_series(50000, 50099) -- 100 consecutive IDs); -- Force hash index usage for comparisonSET enable_indexscan = off;SET enable_seqscan = off; EXPLAIN (ANALYZE, BUFFERS)SELECT * FROM orders WHERE order_id IN (SELECT generate_series(50000, 50099)); -- Note the buffer counts in each plan-- Random values: Hash should show fewer buffers-- Sequential values: B-tree range scan may be more efficientModern optimizers may rewrite IN clauses to JOINs, EXISTS, or other forms. The rewritten form may change index selection. Use EXPLAIN to see the actual execution plan, not just the logical query.
For many applications, latency—not just throughput—is the critical metric. Hash indexes offer predictable, low latency for point queries.
Latency components:
total_latency = parse_time + plan_time + index_time + data_fetch_time
For point queries:
parse_time ≈ constant (simple query)
plan_time ≈ constant (obvious plan)
index_time = f(index_type)
data_fetch_time ≈ constant (single row)
Index time comparison (typical SSDs):
| Component | Hash Index | B+-Tree (depth 4) | Table Scan (1M rows) |
|---|---|---|---|
| Hash/navigation computation | ~10 μs | N/A | N/A |
| Index I/O (random read) | ~100 μs × 1 = 100 μs | ~100 μs × 4 = 400 μs | N/A |
| Data page I/O | ~100 μs | ~100 μs | ~100 μs × many |
| Total (cold cache) | ~200 μs | ~500 μs | seconds |
| Total (warm cache) | ~20 μs | ~50 μs | milliseconds |
Latency distribution:
Hash index latency is more predictable than B+-tree:
Hash index:
B+-tree:
Why predictability matters:
For latency-sensitive applications (real-time systems, SLA-bound services):
Example SLA analysis:
If your SLA requires 99th percentile latency < 50ms:
Both indexes meet the SLA, but hash provides more headroom.
Don't rely on execution plan cost estimates for latency. Use EXPLAIN ANALYZE with multiple runs, measure at different cache states, and capture percentile distributions. Tools like pg_stat_statements (PostgreSQL) or Query Store (SQL Server) help track query latency over time.
Theoretical analysis only goes so far. Proper benchmarking reveals the real-world performance difference for your specific workload.
Benchmarking methodology:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138
"""Database Index Benchmark ScriptCompares hash vs B-tree index performance for point queries.""" import psycopg2import timeimport randomimport statistics # ConfigurationDB_CONFIG = { 'host': 'localhost', 'database': 'benchmark_db', 'user': 'postgres', 'password': 'postgres'}TABLE_SIZE = 1_000_000NUM_QUERIES = 10_000WARMUP_QUERIES = 1_000 def setup_database(conn): """Create test table with both index types.""" with conn.cursor() as cur: # Drop and recreate cur.execute("DROP TABLE IF EXISTS benchmark_table") cur.execute(""" CREATE TABLE benchmark_table ( id SERIAL PRIMARY KEY, data VARCHAR(100), value NUMERIC(10,2) ) """) # Insert test data cur.execute(f""" INSERT INTO benchmark_table (data, value) SELECT 'data_' || generate_series, random() * 1000 FROM generate_series(1, {TABLE_SIZE}) """) # Create indexes cur.execute("CREATE INDEX idx_hash ON benchmark_table USING HASH (id)") cur.execute("CREATE INDEX idx_btree ON benchmark_table USING BTREE (id)") conn.commit() print(f"Created table with {TABLE_SIZE:,} rows") def run_benchmark(conn, index_type: str, num_queries: int) -> dict: """Run point query benchmark with specified index.""" # Configure PostgreSQL to use specific index with conn.cursor() as cur: if index_type == 'hash': cur.execute("SET enable_indexscan = off") # Disable btree cur.execute("SET enable_bitmapscan = off") else: cur.execute("SET enable_hashagg = off") # Not directly applicable # Default prefers btree cur.execute("SET enable_seqscan = off") # Force index usage # Generate random query keys query_keys = [random.randint(1, TABLE_SIZE) for _ in range(num_queries)] # Warmup with conn.cursor() as cur: for key in query_keys[:WARMUP_QUERIES]: cur.execute("SELECT * FROM benchmark_table WHERE id = %s", (key,)) cur.fetchone() # Timed benchmark latencies = [] with conn.cursor() as cur: for key in query_keys: start = time.perf_counter() cur.execute("SELECT * FROM benchmark_table WHERE id = %s", (key,)) cur.fetchone() end = time.perf_counter() latencies.append((end - start) * 1000) # Convert to ms # Reset settings with conn.cursor() as cur: cur.execute("RESET ALL") # Calculate statistics return { 'index_type': index_type, 'num_queries': num_queries, 'mean_ms': statistics.mean(latencies), 'median_ms': statistics.median(latencies), 'p95_ms': sorted(latencies)[int(num_queries * 0.95)], 'p99_ms': sorted(latencies)[int(num_queries * 0.99)], 'std_dev_ms': statistics.stdev(latencies), 'min_ms': min(latencies), 'max_ms': max(latencies), 'qps': num_queries / sum(latencies) * 1000, # Queries per second } def main(): conn = psycopg2.connect(**DB_CONFIG) print("Setting up database...") setup_database(conn) print("\nRunning hash index benchmark...") hash_results = run_benchmark(conn, 'hash', NUM_QUERIES) print("Running B-tree index benchmark...") btree_results = run_benchmark(conn, 'btree', NUM_QUERIES) # Report results print("\n" + "=" * 60) print("BENCHMARK RESULTS") print("=" * 60) for results in [hash_results, btree_results]: print(f"\n{results['index_type'].upper()} INDEX:") print(f" Queries executed: {results['num_queries']:,}") print(f" Mean latency: {results['mean_ms']:.3f} ms") print(f" Median latency: {results['median_ms']:.3f} ms") print(f" P95 latency: {results['p95_ms']:.3f} ms") print(f" P99 latency: {results['p99_ms']:.3f} ms") print(f" Throughput: {results['qps']:.0f} queries/sec") # Comparison speedup = btree_results['mean_ms'] / hash_results['mean_ms'] print(f"\nHash index speedup: {speedup:.2f}x faster mean latency") conn.close() if __name__ == "__main__": main()Benchmarks are system-specific. Your results will vary based on: disk type (SSD vs HDD), buffer pool size, concurrent load, data distribution, and key size. Always benchmark on hardware similar to production.
Point queries are where hash indexes prove their worth. The decision to use a hash index should be based on workload analysis and measured performance.
| Condition | Recommendation |
|---|---|
| All queries are equality lookups | Hash index is excellent choice |
| 99% equality, 1% range | Evaluate: hash + separate btree, or just btree |
| High query volume (>10K/sec) | Hash advantage compounds; strongly consider |
| Latency-critical (sub-ms) | Hash helps; but cache is more important |
| Mixed read/write workload | Hash still good; monitor overflow chains |
| Data grows unpredictably | Consider dynamic hashing or stick with btree |
What's next:
Having explored point queries—the strength of hash indexing—we must now confront its fundamental weakness. The next page examines the limitations of hash indexes: the query patterns they cannot support, the operational challenges they introduce, and the scenarios where B+-trees remain the better choice despite their higher I/O cost for point queries.
You now understand point queries in depth: their definition, execution mechanics, optimizer behavior, and performance characteristics. You can identify workloads that benefit from hash indexes, benchmark alternatives, and make data-driven decisions about index selection.