Database Management SystemsHash Indexing

Hash Index Concept

LevelIntermediate

Duration55 mins

TopicHash Indexing

4 / 5

Point Queries

The Hash Index's Moment of Glory

Every index structure has its raison d'être—the query pattern it was born to answer. For B+-trees, it's ordered traversal. For R-trees, it's spatial containment. For hash indexes, that moment is the point query: finding a specific record by exact key match.

SELECT * FROM employees WHERE employee_id = 12345;

This single query encapsulates why hash indexes exist. A B+-tree answers it by navigating through 3-4 tree levels. A table scan reads potentially millions of rows. A hash index? It computes the answer location in one step, reads one page, and returns the result.

O(1) — not O(log n), not O(n). Constant time, regardless of data size.

What You Will Learn

By the end of this page, you will understand exactly how hash indexes process point queries, when query optimizers choose hash indexes, practical scenarios where this matters most, and how to benchmark hash vs. tree performance for your specific workloads.

Anatomy of a Point Query

A point query (also called an equality query or exact-match query) searches for records matching a specific value. The defining characteristic: the search condition uses equality (=) on the indexed column(s).

Point query patterns:

-- Single-column equality
SELECT * FROM users WHERE user_id = 42;
SELECT * FROM products WHERE sku = 'PRD-12345';
SELECT * FROM sessions WHERE token = 'abc123xyz';

-- IN clause (multiple point queries)
SELECT * FROM orders WHERE order_id IN (100, 101, 102);

-- Composite key equality
SELECT * FROM inventory WHERE warehouse_id = 5 AND product_id = 999;

What is NOT a point query:

-- Range condition (NOT a point query)
SELECT * FROM orders WHERE order_date > '2024-01-01';

-- Pattern matching (NOT a point query)
SELECT * FROM users WHERE email LIKE '%@gmail.com';

-- Inequality (NOT a point query)
SELECT * FROM products WHERE price != 0;

-- Partial composite match (NOT a point query for hash)
SELECT * FROM inventory WHERE warehouse_id = 5;  -- Missing product_id

The IN Clause Opportunity

An IN clause with discrete values is semantically multiple point queries. Hash indexes handle each value separately, potentially outperforming B+-trees which must navigate the tree for each. For IN (v1, v2, v3), hash index does 3 independent O(1) lookups = O(3) = O(1).

Why equality is special for hashing:

The hash function produces a single bucket number for each key value. This mapping is:

Deterministic: The same key always hashes to the same bucket
Direct: No navigation needed; the bucket is computed, not searched for
Isolated: Each bucket is independent; no relationship between bucket N and bucket N+1

This isolation property is both a strength (enables O(1) lookup) and a limitation (prevents range queries). For point queries, it's pure advantage.

Point Query Execution Flow

Let's trace a point query through the complete execution path, from SQL to disk to result.

Query: SELECT * FROM employees WHERE employee_id = 12345

Execution steps:

Point Query Execution Trace
Step	Component	Action	I/O
1	Parser	Validate SQL syntax, build parse tree	None
2	Optimizer	Identify hash index on employee_id, generate plan	None
3	Executor	Begin index scan operation	None
4	Hash Index	Compute: bucket = h(12345) mod M = 47	None
5	Buffer Manager	Check if bucket directory is cached	None (hit)
6	Buffer Manager	directory[47] → page 2089; check buffer cache	0 or 1
7	Hash Index	Scan bucket page for key 12345, find RID	None
8	Buffer Manager	Fetch data page for RID	0 or 1
9	Executor	Project columns, return result row	None

I/O analysis:

Best case (warm cache): 0 I/Os — both index page and data page are in buffer
Typical case: 1-2 I/Os — index page cached, data page may not be
Cold cache: 2 I/Os — read index bucket page + read data page

Compare to B+-tree:

Best case: 0 I/Os (cached path)
Typical case: 3-4 I/Os — 3 levels of tree traversal + data page
Cold cache: Same 3-4 I/Os

The hash index advantage is most pronounced when:

The workload is I/O-bound (disk reads are the bottleneck)
Buffer cache pressure is high (pages compete for cache space)
Many distinct keys are accessed (cache misses are common)

Point Query Comparison
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
-- Create test table
CREATE TABLE employees (
    employee_id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    department VARCHAR(50),
    salary NUMERIC(10,2)
);
 
-- Generate test data (1 million rows)
INSERT INTO employees (name, department, salary)
SELECT 
    'Employee_' || generate_series,
    CASE (generate_series % 5)
        WHEN 0 THEN 'Engineering'
        WHEN 1 THEN 'Sales'
        WHEN 2 THEN 'Marketing'
        WHEN 3 THEN 'HR'
        ELSE 'Finance'
    END,
    30000 + (random() * 70000)::numeric(10,2)
FROM generate_series(1, 1000000);
 
-- Create hash index on employee_id
CREATE INDEX idx_emp_hash ON employees USING HASH (employee_id);
 
-- Also create B-tree index for comparison
CREATE INDEX idx_emp_btree ON employees USING BTREE (employee_id);
 
-- Force hash index usage
SET enable_indexscan = off;  -- Disable B-tree index scan
SET enable_bitmapscan = off;
 
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM employees WHERE employee_id = 500000;
 
-- Sample output:
-- Index Scan using idx_emp_hash on employees
--   Index Cond: (employee_id = 500000)
--   Buffers: shared hit=2        <-- Just 2 buffer accesses!
--   Planning Time: 0.082 ms
--   Execution Time: 0.031 ms
 
-- Compare with B-tree
RESET enable_indexscan;
RESET enable_bitmapscan;
DROP INDEX idx_emp_hash;
 
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM employees WHERE employee_id = 500000;
 
-- Sample output:
-- Index Scan using idx_emp_btree on employees
--   Index Cond: (employee_id = 500000)
--   Buffers: shared hit=4        <-- 4 buffer accesses (tree depth)
--   Planning Time: 0.075 ms
--   Execution Time: 0.033 ms

Buffer Pool Effects

In practice, frequently accessed B+-tree root and internal nodes stay in buffer cache. For hot data, the I/O difference shrinks. Hash indexes shine brightest with cold or random access patterns where B+-tree internal nodes may be evicted.

Query Optimizer Decisions

The query optimizer decides whether to use a hash index for a given query. This decision involves cost comparison against alternatives.

Optimizer considerations:

Factors in Index Selection

•Query predicate type — Hash indexes only support equality. If the query has range conditions, the hash index is immediately disqualified.
•Estimated selectivity — How many rows match the condition? Point queries typically match 1 row (unique) or few rows (high selectivity).
•Index availability — Both hash and B+-tree on same column? Optimizer must choose. Often prefers B+-tree due to versatility.
•Statistics quality — If statistics suggest the key is unique or near-unique, point query optimization kicks in.
•Clustering effects — If data is clustered by the indexed key, even table scan might be competitive for hot data.

Cost model for hash index lookup:

cost_hash = C_index_read + E[overflow_pages] × C_page_read + C_data_fetch

Where:
  C_index_read = 1.0 (one bucket page)
  E[overflow_pages] = average overflow chain length (often < 0.1)
  C_page_read = sequential/random read cost
  C_data_fetch = cost to fetch actual data row(s)

Cost model for B+-tree lookup:

cost_btree = tree_height × C_page_read + C_data_fetch

Where:
  tree_height = ceil(log_fanout(n)) typically 3-4
  fanout = entries_per_page (typically 100-500)

Example comparison:

Factor	Hash Index	B+-Tree
Table size: 10M rows
Tree fanout: 200	-	height = log₂₀₀(10M) ≈ 3
Avg overflow: 0.1 pages	1.1 pages	-
Index pages read	1.1	3
Advantage	~3x fewer reads	Supports ranges

Why Optimizers Often Prefer B+-tree

Despite hash index advantages for point queries, many optimizers default to B+-tree. Reasons: B+-tree cost models are well-understood, B+-trees handle edge cases (NULL, range fallback), hash index statistics may be less reliable, and the optimizer is conservative about newer/less-common index types.

Optimizer Hints
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- PostgreSQL: Examine optimizer choice
EXPLAIN (VERBOSE)
SELECT * FROM employees WHERE employee_id = 12345;
 
-- If optimizer ignores hash index, you can investigate:
-- 1. Check if hash index exists
\d employees
 
-- 2. Check cost settings
SHOW random_page_cost;
SHOW seq_page_cost;
 
-- 3. Force hash index usage for testing
SET enable_indexscan = on;
SET enable_hashagg = on;
 
-- 4. Disable other methods temporarily
SET enable_seqscan = off;
 
EXPLAIN SELECT * FROM employees WHERE employee_id = 12345;
-- Should now use hash index if available
 
-- PostgreSQL doesn't have direct index hints, but you can influence:
-- Use CTEs, subqueries, or disable other index types

Workload Scenarios Where Hash Excels

Hash indexes deliver maximum value in specific workload patterns. Understanding these patterns helps identify opportunities for optimization.

Scenario analysis:

Hash Index Use Case Scenarios
Scenario	Query Pattern	Why Hash Excels	Real Example
User authentication	SELECT ... WHERE email = ?	Single lookup at login; latency critical	Login page: 0 tolerance for slow auth
Session management	SELECT ... WHERE session_token = ?	Random token lookups; no ordering needed	Every API request validates session
Primary key lookup	SELECT ... WHERE id = ?	Classic point query; often hotspot	User profile, order details
Foreign key validation	SELECT 1 WHERE fk_col = ?	Existence check; result doesn't matter	Insert validation: does parent exist?
Deduplication check	SELECT COUNT(*) WHERE hash = ?	Before insert: is this a duplicate?	Content-addressable storage
Cache population	SELECT ... WHERE key = ?	Cache miss triggers DB lookup	Redis/Memcached miss handling

High-Value Scenarios

•OLTP systems — High-frequency point queries on small result sets
•Key-value workloads — Databases used as key-value stores
•API backends — Each API call fetches specific entity by ID
•Real-time applications — Gaming, trading, chat with strict latency SLAs
•Identity systems — User lookup by username, email, SSO token

Poor Fit Scenarios

•Analytics/OLAP — Aggregate queries, GROUP BY, scans
•Time-series data — Range queries on timestamps
•Sorted output needed — Reports requiring ORDER BY
•Pattern matching — LIKE queries on text
•Mixed workloads — Systems needing both point and range access

Case study: Session store optimization

Consider a web application with:

10 million active sessions
50,000 requests/second (each validates session)
Session token: 64-character random string
Current: B+-tree index on session_token

B+-tree performance:

Tree height: 4 levels (64-byte keys, ~50 fanout)
Total I/Os per lookup: 4 (assuming cold cache)
At 50K req/s: 200K index I/Os per second

Hash index performance:

Bucket lookup: 1 I/O
At 50K req/s: 50K index I/Os per second
4x reduction in index I/O load

This translates directly to:

Fewer disk IOPS required
Lower latency (fewer random reads)
More requests served per server
Reduced infrastructure cost

Identify Hash Index Opportunities

Examine your slow query log for frequently executed point queries. Look for patterns: WHERE id = ?, WHERE token = ?, WHERE email = ?. If these queries dominate your workload and use B+-tree indexes, hash indexes may dramatically improve performance.

IN Clause Optimization

The SQL IN clause deserves special attention. It represents multiple point queries bundled into one SQL statement, and hash indexes can process these exceptionally efficiently.

IN clause query:

SELECT * FROM orders 
WHERE order_id IN (1001, 1002, 1003, 1004, 1005);

Execution strategies:

Strategy 1: Index nested loop (default) For each value in the IN list:

Look up the value in the index
Fetch matching rows

With hash index: 5 values × 1 I/O each = 5 I/Os With B+-tree: 5 values × 3 I/Os each = 15 I/Os (tree traversal for each)

Strategy 2: Index range scan (B+-tree advantage) If IN values are consecutive or near-consecutive, B+-tree can:

Find the range start
Scan leaves sequentially

This is where B+-tree might win—but only for sorted, consecutive values.

Strategy 3: Bitmap index scan Build a bitmap of matching row locations, then fetch rows:

For each IN value, mark matching rows in bitmap
Scan bitmap, fetch marked rows

Reduces random I/O for many-value IN clauses.

IN Clause Performance by Index Type
IN Clause Size	Hash Index I/Os	B+-Tree I/Os	Winner
5 values	5	15	Hash (3x better)
10 values	10	30	Hash (3x better)
100 values	100	300	Hash (3x better)
1000 values (random)	1000	3000	Hash (3x better)
1000 values (sequential)	1000	~10 (range scan)	B+-Tree (wins)

IN Clause Analysis
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
-- Test IN clause with hash vs B-tree
 
-- Setup: Create table with both index types
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    total NUMERIC(10,2)
);
 
INSERT INTO orders (customer_id, order_date, total)
SELECT 
    (random() * 10000)::int,
    '2024-01-01'::date + (random() * 365)::int,
    (random() * 1000)::numeric(10,2)
FROM generate_series(1, 1000000);
 
-- Create indexes
CREATE INDEX idx_orders_hash ON orders USING HASH (order_id);
CREATE INDEX idx_orders_btree ON orders USING BTREE (order_id);
 
-- Random IN clause (hash index should win)
-- Generate 100 random order IDs
WITH random_ids AS (
    SELECT (random() * 1000000)::int as id 
    FROM generate_series(1, 100)
)
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE order_id IN (SELECT id FROM random_ids);
 
-- Sequential IN clause (B-tree might win with range scan)
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE order_id IN (
    SELECT generate_series(50000, 50099)  -- 100 consecutive IDs
);
 
-- Force hash index usage for comparison
SET enable_indexscan = off;
SET enable_seqscan = off;
 
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE order_id IN (SELECT generate_series(50000, 50099));
 
-- Note the buffer counts in each plan
-- Random values: Hash should show fewer buffers
-- Sequential values: B-tree range scan may be more efficient

Optimizer Rewriting

Modern optimizers may rewrite IN clauses to JOINs, EXISTS, or other forms. The rewritten form may change index selection. Use EXPLAIN to see the actual execution plan, not just the logical query.

Latency Characteristics

For many applications, latency—not just throughput—is the critical metric. Hash indexes offer predictable, low latency for point queries.

Latency components:

total_latency = parse_time + plan_time + index_time + data_fetch_time

For point queries:
  parse_time ≈ constant (simple query)
  plan_time ≈ constant (obvious plan)
  index_time = f(index_type)
  data_fetch_time ≈ constant (single row)

Index time comparison (typical SSDs):

Latency Breakdown for Point Query
Component	Hash Index	B+-Tree (depth 4)	Table Scan (1M rows)
Hash/navigation computation	~10 μs	N/A	N/A
Index I/O (random read)	~100 μs × 1 = 100 μs	~100 μs × 4 = 400 μs	N/A
Data page I/O	~100 μs	~100 μs	~100 μs × many
Total (cold cache)	~200 μs	~500 μs	seconds
Total (warm cache)	~20 μs	~50 μs	milliseconds

Latency distribution:

Hash index latency is more predictable than B+-tree:

Hash index:

P50 (median): ~1 I/O
P99 (long tail): ~2-3 I/Os (overflow chains)
Variance: Low (consistent bucket access)

B+-tree:

P50: ~3 I/Os (tree height)
P99: ~4-5 I/Os (if internal nodes evicted)
Variance: Low to moderate (depends on cache hit rate)

Why predictability matters:

For latency-sensitive applications (real-time systems, SLA-bound services):

Mean latency matters, but tail latency kills
One slow query can delay an entire request chain
Hash indexes' low variance helps meet SLAs

Example SLA analysis:

If your SLA requires 99th percentile latency < 50ms:

Hash index at P99: 2-3 I/Os ≈ 200-300 μs ✓
B+-tree at P99: 4-5 I/Os ≈ 400-500 μs ✓
Table scan at P99: unpredictable ✗

Both indexes meet the SLA, but hash provides more headroom.

Measuring Real Latency

Don't rely on execution plan cost estimates for latency. Use EXPLAIN ANALYZE with multiple runs, measure at different cache states, and capture percentile distributions. Tools like pg_stat_statements (PostgreSQL) or Query Store (SQL Server) help track query latency over time.

Benchmarking Hash vs. Tree Indexes

Theoretical analysis only goes so far. Proper benchmarking reveals the real-world performance difference for your specific workload.

Benchmarking methodology:

Benchmark Design Principles

•Isolate the variable — Same data, same hardware, same query. Only change the index type.
•Warm vs. cold cache — Test both. Production usually sees mixed state.
•Realistic access patterns — Uniform random? Zipfian (hot spots)? Temporal locality?
•Multiple run average — Single runs are noisy. Average multiple runs, report variance.
•Measure what matters — Latency percentiles (P50, P99), throughput, I/O operations.

Benchmark Script
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
"""
Database Index Benchmark Script
Compares hash vs B-tree index performance for point queries.
"""
 
import psycopg2
import time
import random
import statistics
 
# Configuration
DB_CONFIG = {
    'host': 'localhost',
    'database': 'benchmark_db',
    'user': 'postgres',
    'password': 'postgres'
}
TABLE_SIZE = 1_000_000
NUM_QUERIES = 10_000
WARMUP_QUERIES = 1_000
 
 
def setup_database(conn):
    """Create test table with both index types."""
    with conn.cursor() as cur:
        # Drop and recreate
        cur.execute("DROP TABLE IF EXISTS benchmark_table")
        cur.execute("""
            CREATE TABLE benchmark_table (
                id SERIAL PRIMARY KEY,
                data VARCHAR(100),
                value NUMERIC(10,2)
            )
        """)
        
        # Insert test data
        cur.execute(f"""
            INSERT INTO benchmark_table (data, value)
            SELECT 
                'data_' || generate_series,
                random() * 1000
            FROM generate_series(1, {TABLE_SIZE})
        """)
        
        # Create indexes
        cur.execute("CREATE INDEX idx_hash ON benchmark_table USING HASH (id)")
        cur.execute("CREATE INDEX idx_btree ON benchmark_table USING BTREE (id)")
        
        conn.commit()
        print(f"Created table with {TABLE_SIZE:,} rows")
 
 
def run_benchmark(conn, index_type: str, num_queries: int) -> dict:
    """Run point query benchmark with specified index."""
    
    # Configure PostgreSQL to use specific index
    with conn.cursor() as cur:
        if index_type == 'hash':
            cur.execute("SET enable_indexscan = off")  # Disable btree
            cur.execute("SET enable_bitmapscan = off")
        else:
            cur.execute("SET enable_hashagg = off")  # Not directly applicable
            # Default prefers btree
        cur.execute("SET enable_seqscan = off")  # Force index usage
    
    # Generate random query keys
    query_keys = [random.randint(1, TABLE_SIZE) for _ in range(num_queries)]
    
    # Warmup
    with conn.cursor() as cur:
        for key in query_keys[:WARMUP_QUERIES]:
            cur.execute("SELECT * FROM benchmark_table WHERE id = %s", (key,))
            cur.fetchone()
    
    # Timed benchmark
    latencies = []
    with conn.cursor() as cur:
        for key in query_keys:
            start = time.perf_counter()
            cur.execute("SELECT * FROM benchmark_table WHERE id = %s", (key,))
            cur.fetchone()
            end = time.perf_counter()
            latencies.append((end - start) * 1000)  # Convert to ms
    
    # Reset settings
    with conn.cursor() as cur:
        cur.execute("RESET ALL")
    
    # Calculate statistics
    return {
        'index_type': index_type,
        'num_queries': num_queries,
        'mean_ms': statistics.mean(latencies),
        'median_ms': statistics.median(latencies),
        'p95_ms': sorted(latencies)[int(num_queries * 0.95)],
        'p99_ms': sorted(latencies)[int(num_queries * 0.99)],
        'std_dev_ms': statistics.stdev(latencies),
        'min_ms': min(latencies),
        'max_ms': max(latencies),
        'qps': num_queries / sum(latencies) * 1000,  # Queries per second
    }
 
 
def main():
    conn = psycopg2.connect(**DB_CONFIG)
    
    print("Setting up database...")
    setup_database(conn)
    
    print("\nRunning hash index benchmark...")
    hash_results = run_benchmark(conn, 'hash', NUM_QUERIES)
    
    print("Running B-tree index benchmark...")
    btree_results = run_benchmark(conn, 'btree', NUM_QUERIES)
    
    # Report results
    print("\n" + "=" * 60)
    print("BENCHMARK RESULTS")
    print("=" * 60)
    
    for results in [hash_results, btree_results]:
        print(f"\n{results['index_type'].upper()} INDEX:")
        print(f"  Queries executed: {results['num_queries']:,}")
        print(f"  Mean latency:     {results['mean_ms']:.3f} ms")
        print(f"  Median latency:   {results['median_ms']:.3f} ms")
        print(f"  P95 latency:      {results['p95_ms']:.3f} ms")
        print(f"  P99 latency:      {results['p99_ms']:.3f} ms")
        print(f"  Throughput:       {results['qps']:.0f} queries/sec")
    
    # Comparison
    speedup = btree_results['mean_ms'] / hash_results['mean_ms']
    print(f"\nHash index speedup: {speedup:.2f}x faster mean latency")
    
    conn.close()
 
 
if __name__ == "__main__":
    main()

Benchmark Caveats

Benchmarks are system-specific. Your results will vary based on: disk type (SSD vs HDD), buffer pool size, concurrent load, data distribution, and key size. Always benchmark on hardware similar to production.

Summary: When to Use Hash for Point Queries

Point queries are where hash indexes prove their worth. The decision to use a hash index should be based on workload analysis and measured performance.

Key Takeaways

•Point queries = equality lookups — Hash indexes only accelerate WHERE column = value patterns; anything else isn't a point query.
•O(1) vs O(log n) for index access — Hash saves 2-3 I/Os per lookup compared to B+-tree, significant at high query volumes.
•Optimizer may not choose hash automatically — Verify with EXPLAIN that your hash index is being used; override if necessary.
•IN clauses benefit from hash — Multiple random lookups show the clearest hash advantage over B+-tree.
•Latency predictability — Hash index variance is lower, helping meet SLAs for tail latency.
•Benchmark your specific workload — Theoretical advantage may or may not materialize depending on cache states and access patterns.
•Consider operational trade-offs — Hash's point query advantage must outweigh inability to support range queries.

Point Query Decision Matrix
Condition	Recommendation
All queries are equality lookups	Hash index is excellent choice
99% equality, 1% range	Evaluate: hash + separate btree, or just btree
High query volume (>10K/sec)	Hash advantage compounds; strongly consider
Latency-critical (sub-ms)	Hash helps; but cache is more important
Mixed read/write workload	Hash still good; monitor overflow chains
Data grows unpredictably	Consider dynamic hashing or stick with btree

What's next:

Having explored point queries—the strength of hash indexing—we must now confront its fundamental weakness. The next page examines the limitations of hash indexes: the query patterns they cannot support, the operational challenges they introduce, and the scenarios where B+-trees remain the better choice despite their higher I/O cost for point queries.

Page Complete

You now understand point queries in depth: their definition, execution mechanics, optimizer behavior, and performance characteristics. You can identify workloads that benefit from hash indexes, benchmark alternatives, and make data-driven decisions about index selection.

4 / 5

Loading learning content...

Database Management SystemsHash Indexing

Hash Index Concept

LevelIntermediate

Duration55 mins

TopicHash Indexing

4 / 5

Point Queries

The Hash Index's Moment of Glory

SELECT * FROM employees WHERE employee_id = 12345;

O(1) — not O(log n), not O(n). Constant time, regardless of data size.

What You Will Learn

Anatomy of a Point Query

Point query patterns:

-- Single-column equality
SELECT * FROM users WHERE user_id = 42;
SELECT * FROM products WHERE sku = 'PRD-12345';
SELECT * FROM sessions WHERE token = 'abc123xyz';

-- IN clause (multiple point queries)
SELECT * FROM orders WHERE order_id IN (100, 101, 102);

-- Composite key equality
SELECT * FROM inventory WHERE warehouse_id = 5 AND product_id = 999;

What is NOT a point query:

-- Range condition (NOT a point query)
SELECT * FROM orders WHERE order_date > '2024-01-01';

-- Pattern matching (NOT a point query)
SELECT * FROM users WHERE email LIKE '%@gmail.com';

-- Inequality (NOT a point query)
SELECT * FROM products WHERE price != 0;

-- Partial composite match (NOT a point query for hash)
SELECT * FROM inventory WHERE warehouse_id = 5;  -- Missing product_id

The IN Clause Opportunity

Why equality is special for hashing:

The hash function produces a single bucket number for each key value. This mapping is:

Deterministic: The same key always hashes to the same bucket
Direct: No navigation needed; the bucket is computed, not searched for
Isolated: Each bucket is independent; no relationship between bucket N and bucket N+1

This isolation property is both a strength (enables O(1) lookup) and a limitation (prevents range queries). For point queries, it's pure advantage.

Point Query Execution Flow

Let's trace a point query through the complete execution path, from SQL to disk to result.

Query: SELECT * FROM employees WHERE employee_id = 12345

Execution steps:

Point Query Execution Trace
Step	Component	Action	I/O
1	Parser	Validate SQL syntax, build parse tree	None
2	Optimizer	Identify hash index on employee_id, generate plan	None
3	Executor	Begin index scan operation	None
4	Hash Index	Compute: bucket = h(12345) mod M = 47	None
5	Buffer Manager	Check if bucket directory is cached	None (hit)
6	Buffer Manager	directory[47] → page 2089; check buffer cache	0 or 1
7	Hash Index	Scan bucket page for key 12345, find RID	None
8	Buffer Manager	Fetch data page for RID	0 or 1
9	Executor	Project columns, return result row	None

I/O analysis:

Best case (warm cache): 0 I/Os — both index page and data page are in buffer
Typical case: 1-2 I/Os — index page cached, data page may not be
Cold cache: 2 I/Os — read index bucket page + read data page

Compare to B+-tree:

Best case: 0 I/Os (cached path)
Typical case: 3-4 I/Os — 3 levels of tree traversal + data page
Cold cache: Same 3-4 I/Os

The hash index advantage is most pronounced when:

The workload is I/O-bound (disk reads are the bottleneck)
Buffer cache pressure is high (pages compete for cache space)
Many distinct keys are accessed (cache misses are common)

Point Query Comparison
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
-- Create test table
CREATE TABLE employees (
    employee_id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    department VARCHAR(50),
    salary NUMERIC(10,2)
);
 
-- Generate test data (1 million rows)
INSERT INTO employees (name, department, salary)
SELECT 
    'Employee_' || generate_series,
    CASE (generate_series % 5)
        WHEN 0 THEN 'Engineering'
        WHEN 1 THEN 'Sales'
        WHEN 2 THEN 'Marketing'
        WHEN 3 THEN 'HR'
        ELSE 'Finance'
    END,
    30000 + (random() * 70000)::numeric(10,2)
FROM generate_series(1, 1000000);
 
-- Create hash index on employee_id
CREATE INDEX idx_emp_hash ON employees USING HASH (employee_id);
 
-- Also create B-tree index for comparison
CREATE INDEX idx_emp_btree ON employees USING BTREE (employee_id);
 
-- Force hash index usage
SET enable_indexscan = off;  -- Disable B-tree index scan
SET enable_bitmapscan = off;
 
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM employees WHERE employee_id = 500000;
 
-- Sample output:
-- Index Scan using idx_emp_hash on employees
--   Index Cond: (employee_id = 500000)
--   Buffers: shared hit=2        <-- Just 2 buffer accesses!
--   Planning Time: 0.082 ms
--   Execution Time: 0.031 ms
 
-- Compare with B-tree
RESET enable_indexscan;
RESET enable_bitmapscan;
DROP INDEX idx_emp_hash;
 
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM employees WHERE employee_id = 500000;
 
-- Sample output:
-- Index Scan using idx_emp_btree on employees
--   Index Cond: (employee_id = 500000)
--   Buffers: shared hit=4        <-- 4 buffer accesses (tree depth)
--   Planning Time: 0.075 ms
--   Execution Time: 0.033 ms

Buffer Pool Effects

Query Optimizer Decisions

The query optimizer decides whether to use a hash index for a given query. This decision involves cost comparison against alternatives.

Optimizer considerations:

Factors in Index Selection

•Query predicate type — Hash indexes only support equality. If the query has range conditions, the hash index is immediately disqualified.
•Estimated selectivity — How many rows match the condition? Point queries typically match 1 row (unique) or few rows (high selectivity).
•Index availability — Both hash and B+-tree on same column? Optimizer must choose. Often prefers B+-tree due to versatility.
•Statistics quality — If statistics suggest the key is unique or near-unique, point query optimization kicks in.
•Clustering effects — If data is clustered by the indexed key, even table scan might be competitive for hot data.

Cost model for hash index lookup:

cost_hash = C_index_read + E[overflow_pages] × C_page_read + C_data_fetch

Where:
  C_index_read = 1.0 (one bucket page)
  E[overflow_pages] = average overflow chain length (often < 0.1)
  C_page_read = sequential/random read cost
  C_data_fetch = cost to fetch actual data row(s)

Cost model for B+-tree lookup:

cost_btree = tree_height × C_page_read + C_data_fetch

Where:
  tree_height = ceil(log_fanout(n)) typically 3-4
  fanout = entries_per_page (typically 100-500)

Example comparison:

Factor	Hash Index	B+-Tree
Table size: 10M rows
Tree fanout: 200	-	height = log₂₀₀(10M) ≈ 3
Avg overflow: 0.1 pages	1.1 pages	-
Index pages read	1.1	3
Advantage	~3x fewer reads	Supports ranges

Why Optimizers Often Prefer B+-tree

Optimizer Hints
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- PostgreSQL: Examine optimizer choice
EXPLAIN (VERBOSE)
SELECT * FROM employees WHERE employee_id = 12345;
 
-- If optimizer ignores hash index, you can investigate:
-- 1. Check if hash index exists
\d employees
 
-- 2. Check cost settings
SHOW random_page_cost;
SHOW seq_page_cost;
 
-- 3. Force hash index usage for testing
SET enable_indexscan = on;
SET enable_hashagg = on;
 
-- 4. Disable other methods temporarily
SET enable_seqscan = off;
 
EXPLAIN SELECT * FROM employees WHERE employee_id = 12345;
-- Should now use hash index if available
 
-- PostgreSQL doesn't have direct index hints, but you can influence:
-- Use CTEs, subqueries, or disable other index types

Workload Scenarios Where Hash Excels

Hash indexes deliver maximum value in specific workload patterns. Understanding these patterns helps identify opportunities for optimization.

Scenario analysis:

Hash Index Use Case Scenarios
Scenario	Query Pattern	Why Hash Excels	Real Example
User authentication	SELECT ... WHERE email = ?	Single lookup at login; latency critical	Login page: 0 tolerance for slow auth
Session management	SELECT ... WHERE session_token = ?	Random token lookups; no ordering needed	Every API request validates session
Primary key lookup	SELECT ... WHERE id = ?	Classic point query; often hotspot	User profile, order details
Foreign key validation	SELECT 1 WHERE fk_col = ?	Existence check; result doesn't matter	Insert validation: does parent exist?
Deduplication check	SELECT COUNT(*) WHERE hash = ?	Before insert: is this a duplicate?	Content-addressable storage
Cache population	SELECT ... WHERE key = ?	Cache miss triggers DB lookup	Redis/Memcached miss handling

High-Value Scenarios

•OLTP systems — High-frequency point queries on small result sets
•Key-value workloads — Databases used as key-value stores
•API backends — Each API call fetches specific entity by ID
•Real-time applications — Gaming, trading, chat with strict latency SLAs
•Identity systems — User lookup by username, email, SSO token

Poor Fit Scenarios

•Analytics/OLAP — Aggregate queries, GROUP BY, scans
•Time-series data — Range queries on timestamps
•Sorted output needed — Reports requiring ORDER BY
•Pattern matching — LIKE queries on text
•Mixed workloads — Systems needing both point and range access

Case study: Session store optimization

Consider a web application with:

10 million active sessions
50,000 requests/second (each validates session)
Session token: 64-character random string
Current: B+-tree index on session_token

B+-tree performance:

Tree height: 4 levels (64-byte keys, ~50 fanout)
Total I/Os per lookup: 4 (assuming cold cache)
At 50K req/s: 200K index I/Os per second

Hash index performance:

Bucket lookup: 1 I/O
At 50K req/s: 50K index I/Os per second
4x reduction in index I/O load

This translates directly to:

Fewer disk IOPS required
Lower latency (fewer random reads)
More requests served per server
Reduced infrastructure cost

Identify Hash Index Opportunities

IN Clause Optimization

The SQL IN clause deserves special attention. It represents multiple point queries bundled into one SQL statement, and hash indexes can process these exceptionally efficiently.

IN clause query:

SELECT * FROM orders 
WHERE order_id IN (1001, 1002, 1003, 1004, 1005);

Execution strategies:

Strategy 1: Index nested loop (default) For each value in the IN list:

Look up the value in the index
Fetch matching rows

With hash index: 5 values × 1 I/O each = 5 I/Os With B+-tree: 5 values × 3 I/Os each = 15 I/Os (tree traversal for each)

Strategy 2: Index range scan (B+-tree advantage) If IN values are consecutive or near-consecutive, B+-tree can:

Find the range start
Scan leaves sequentially

This is where B+-tree might win—but only for sorted, consecutive values.

Strategy 3: Bitmap index scan Build a bitmap of matching row locations, then fetch rows:

For each IN value, mark matching rows in bitmap
Scan bitmap, fetch marked rows

Reduces random I/O for many-value IN clauses.

IN Clause Performance by Index Type
IN Clause Size	Hash Index I/Os	B+-Tree I/Os	Winner
5 values	5	15	Hash (3x better)
10 values	10	30	Hash (3x better)
100 values	100	300	Hash (3x better)
1000 values (random)	1000	3000	Hash (3x better)
1000 values (sequential)	1000	~10 (range scan)	B+-Tree (wins)

IN Clause Analysis
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
-- Test IN clause with hash vs B-tree
 
-- Setup: Create table with both index types
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    total NUMERIC(10,2)
);
 
INSERT INTO orders (customer_id, order_date, total)
SELECT 
    (random() * 10000)::int,
    '2024-01-01'::date + (random() * 365)::int,
    (random() * 1000)::numeric(10,2)
FROM generate_series(1, 1000000);
 
-- Create indexes
CREATE INDEX idx_orders_hash ON orders USING HASH (order_id);
CREATE INDEX idx_orders_btree ON orders USING BTREE (order_id);
 
-- Random IN clause (hash index should win)
-- Generate 100 random order IDs
WITH random_ids AS (
    SELECT (random() * 1000000)::int as id 
    FROM generate_series(1, 100)
)
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE order_id IN (SELECT id FROM random_ids);
 
-- Sequential IN clause (B-tree might win with range scan)
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE order_id IN (
    SELECT generate_series(50000, 50099)  -- 100 consecutive IDs
);
 
-- Force hash index usage for comparison
SET enable_indexscan = off;
SET enable_seqscan = off;
 
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders 
WHERE order_id IN (SELECT generate_series(50000, 50099));
 
-- Note the buffer counts in each plan
-- Random values: Hash should show fewer buffers
-- Sequential values: B-tree range scan may be more efficient

Optimizer Rewriting

Modern optimizers may rewrite IN clauses to JOINs, EXISTS, or other forms. The rewritten form may change index selection. Use EXPLAIN to see the actual execution plan, not just the logical query.

Latency Characteristics

For many applications, latency—not just throughput—is the critical metric. Hash indexes offer predictable, low latency for point queries.

Latency components:

total_latency = parse_time + plan_time + index_time + data_fetch_time

For point queries:
  parse_time ≈ constant (simple query)
  plan_time ≈ constant (obvious plan)
  index_time = f(index_type)
  data_fetch_time ≈ constant (single row)

Index time comparison (typical SSDs):

Latency Breakdown for Point Query
Component	Hash Index	B+-Tree (depth 4)	Table Scan (1M rows)
Hash/navigation computation	~10 μs	N/A	N/A
Index I/O (random read)	~100 μs × 1 = 100 μs	~100 μs × 4 = 400 μs	N/A
Data page I/O	~100 μs	~100 μs	~100 μs × many
Total (cold cache)	~200 μs	~500 μs	seconds
Total (warm cache)	~20 μs	~50 μs	milliseconds

Latency distribution:

Hash index latency is more predictable than B+-tree:

Hash index:

P50 (median): ~1 I/O
P99 (long tail): ~2-3 I/Os (overflow chains)
Variance: Low (consistent bucket access)

B+-tree:

P50: ~3 I/Os (tree height)
P99: ~4-5 I/Os (if internal nodes evicted)
Variance: Low to moderate (depends on cache hit rate)

Why predictability matters:

For latency-sensitive applications (real-time systems, SLA-bound services):

Mean latency matters, but tail latency kills
One slow query can delay an entire request chain
Hash indexes' low variance helps meet SLAs

Example SLA analysis:

If your SLA requires 99th percentile latency < 50ms:

Hash index at P99: 2-3 I/Os ≈ 200-300 μs ✓
B+-tree at P99: 4-5 I/Os ≈ 400-500 μs ✓
Table scan at P99: unpredictable ✗

Both indexes meet the SLA, but hash provides more headroom.

Measuring Real Latency

Benchmarking Hash vs. Tree Indexes

Theoretical analysis only goes so far. Proper benchmarking reveals the real-world performance difference for your specific workload.

Benchmarking methodology:

Benchmark Design Principles

•Isolate the variable — Same data, same hardware, same query. Only change the index type.
•Warm vs. cold cache — Test both. Production usually sees mixed state.
•Realistic access patterns — Uniform random? Zipfian (hot spots)? Temporal locality?
•Multiple run average — Single runs are noisy. Average multiple runs, report variance.
•Measure what matters — Latency percentiles (P50, P99), throughput, I/O operations.

Benchmark Script
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
"""
Database Index Benchmark Script
Compares hash vs B-tree index performance for point queries.
"""
 
import psycopg2
import time
import random
import statistics
 
# Configuration
DB_CONFIG = {
    'host': 'localhost',
    'database': 'benchmark_db',
    'user': 'postgres',
    'password': 'postgres'
}
TABLE_SIZE = 1_000_000
NUM_QUERIES = 10_000
WARMUP_QUERIES = 1_000
 
 
def setup_database(conn):
    """Create test table with both index types."""
    with conn.cursor() as cur:
        # Drop and recreate
        cur.execute("DROP TABLE IF EXISTS benchmark_table")
        cur.execute("""
            CREATE TABLE benchmark_table (
                id SERIAL PRIMARY KEY,
                data VARCHAR(100),
                value NUMERIC(10,2)
            )
        """)
        
        # Insert test data
        cur.execute(f"""
            INSERT INTO benchmark_table (data, value)
            SELECT 
                'data_' || generate_series,
                random() * 1000
            FROM generate_series(1, {TABLE_SIZE})
        """)
        
        # Create indexes
        cur.execute("CREATE INDEX idx_hash ON benchmark_table USING HASH (id)")
        cur.execute("CREATE INDEX idx_btree ON benchmark_table USING BTREE (id)")
        
        conn.commit()
        print(f"Created table with {TABLE_SIZE:,} rows")
 
 
def run_benchmark(conn, index_type: str, num_queries: int) -> dict:
    """Run point query benchmark with specified index."""
    
    # Configure PostgreSQL to use specific index
    with conn.cursor() as cur:
        if index_type == 'hash':
            cur.execute("SET enable_indexscan = off")  # Disable btree
            cur.execute("SET enable_bitmapscan = off")
        else:
            cur.execute("SET enable_hashagg = off")  # Not directly applicable
            # Default prefers btree
        cur.execute("SET enable_seqscan = off")  # Force index usage
    
    # Generate random query keys
    query_keys = [random.randint(1, TABLE_SIZE) for _ in range(num_queries)]
    
    # Warmup
    with conn.cursor() as cur:
        for key in query_keys[:WARMUP_QUERIES]:
            cur.execute("SELECT * FROM benchmark_table WHERE id = %s", (key,))
            cur.fetchone()
    
    # Timed benchmark
    latencies = []
    with conn.cursor() as cur:
        for key in query_keys:
            start = time.perf_counter()
            cur.execute("SELECT * FROM benchmark_table WHERE id = %s", (key,))
            cur.fetchone()
            end = time.perf_counter()
            latencies.append((end - start) * 1000)  # Convert to ms
    
    # Reset settings
    with conn.cursor() as cur:
        cur.execute("RESET ALL")
    
    # Calculate statistics
    return {
        'index_type': index_type,
        'num_queries': num_queries,
        'mean_ms': statistics.mean(latencies),
        'median_ms': statistics.median(latencies),
        'p95_ms': sorted(latencies)[int(num_queries * 0.95)],
        'p99_ms': sorted(latencies)[int(num_queries * 0.99)],
        'std_dev_ms': statistics.stdev(latencies),
        'min_ms': min(latencies),
        'max_ms': max(latencies),
        'qps': num_queries / sum(latencies) * 1000,  # Queries per second
    }
 
 
def main():
    conn = psycopg2.connect(**DB_CONFIG)
    
    print("Setting up database...")
    setup_database(conn)
    
    print("\nRunning hash index benchmark...")
    hash_results = run_benchmark(conn, 'hash', NUM_QUERIES)
    
    print("Running B-tree index benchmark...")
    btree_results = run_benchmark(conn, 'btree', NUM_QUERIES)
    
    # Report results
    print("\n" + "=" * 60)
    print("BENCHMARK RESULTS")
    print("=" * 60)
    
    for results in [hash_results, btree_results]:
        print(f"\n{results['index_type'].upper()} INDEX:")
        print(f"  Queries executed: {results['num_queries']:,}")
        print(f"  Mean latency:     {results['mean_ms']:.3f} ms")
        print(f"  Median latency:   {results['median_ms']:.3f} ms")
        print(f"  P95 latency:      {results['p95_ms']:.3f} ms")
        print(f"  P99 latency:      {results['p99_ms']:.3f} ms")
        print(f"  Throughput:       {results['qps']:.0f} queries/sec")
    
    # Comparison
    speedup = btree_results['mean_ms'] / hash_results['mean_ms']
    print(f"\nHash index speedup: {speedup:.2f}x faster mean latency")
    
    conn.close()
 
 
if __name__ == "__main__":
    main()

Benchmark Caveats

Summary: When to Use Hash for Point Queries

Point queries are where hash indexes prove their worth. The decision to use a hash index should be based on workload analysis and measured performance.

Key Takeaways

•Point queries = equality lookups — Hash indexes only accelerate WHERE column = value patterns; anything else isn't a point query.
•O(1) vs O(log n) for index access — Hash saves 2-3 I/Os per lookup compared to B+-tree, significant at high query volumes.
•Optimizer may not choose hash automatically — Verify with EXPLAIN that your hash index is being used; override if necessary.
•IN clauses benefit from hash — Multiple random lookups show the clearest hash advantage over B+-tree.
•Latency predictability — Hash index variance is lower, helping meet SLAs for tail latency.
•Benchmark your specific workload — Theoretical advantage may or may not materialize depending on cache states and access patterns.
•Consider operational trade-offs — Hash's point query advantage must outweigh inability to support range queries.

Point Query Decision Matrix
Condition	Recommendation
All queries are equality lookups	Hash index is excellent choice
99% equality, 1% range	Evaluate: hash + separate btree, or just btree
High query volume (>10K/sec)	Hash advantage compounds; strongly consider
Latency-critical (sub-ms)	Hash helps; but cache is more important
Mixed read/write workload	Hash still good; monitor overflow chains
Data grows unpredictably	Consider dynamic hashing or stick with btree

What's next:

Page Complete

4 / 5