Database Management SystemsHash Indexing

Hash Index Concept

LevelIntermediate

Duration55 mins

TopicHash Indexing

5 / 5

Limitations

The Other Side of the Coin

Every technology choice involves trade-offs. Hash indexes achieve O(1) point query performance by sacrificing capabilities that B+-trees provide effortlessly. Understanding these limitations isn't pessimism—it's engineering maturity. The worst outcome isn't choosing B+-tree over hash and losing some performance. The worst outcome is choosing hash and discovering, after deployment, that your workload requires capabilities hash indexes fundamentally cannot provide.

This page examines hash index limitations with the same rigor we applied to their strengths. By the end, you'll know exactly when to reach for hash indexes—and when to step back.

What You Will Learn

By the end of this page, you will understand the fundamental query patterns hash indexes cannot support, operational challenges unique to hash structures, disaster scenarios to avoid, and a comprehensive decision framework for index selection.

Range Query Impossibility

The most fundamental limitation of hash indexes is their complete inability to support range queries. This isn't a performance issue—it's a structural impossibility.

Why range queries fail:

Hash functions deliberately destroy key ordering. Consider keys 100, 101, and 102:

h(100) = 47
h(101) = 892
h(102) = 3

Consecutive keys hash to completely unrelated buckets. There is no path from bucket 47 to bucket 892 to bucket 3. The structure that would enable traversal simply doesn't exist.

Contrast with B+-tree:

In a B+-tree, keys 100, 101, 102 are stored sequentially in leaf nodes. A range query WHERE key BETWEEN 100 AND 200 starts at 100 and follows leaf pointers until reaching 200. The structure inherently supports ordered traversal.

Range Query Support: Hash vs. B+-Tree
Query Pattern	Hash Index	B+-Tree
WHERE key > 100	Full table scan required	Index range scan, O(log n + k)
WHERE key BETWEEN 100 AND 200	Full table scan required	Index range scan
WHERE key < 50	Full table scan required	Index range scan
ORDER BY key	Cannot contribute	Provides sorted output directly
MIN(key) / MAX(key)	Full table scan required	First/last leaf, O(log n)
GROUP BY key (sorted)	Cannot contribute to sort	Provides sorted input for grouping

The cost of range queries with hash indexes:

When a query requires range access on a hash-indexed column, the database must:

Ignore the hash index entirely — The optimizer recognizes it cannot help
Fall back to table scan — Read every row, test range condition
Or use a different index — If another index exists on the column

Example scenario:

-- You create a hash index for point queries
CREATE INDEX idx_order_date_hash ON orders USING HASH (order_date);

-- This query cannot use the hash index
SELECT * FROM orders 
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

-- The optimizer will use:
-- 1. Seq Scan on orders (full table scan)
-- 2. Any B+-tree index on order_date (if exists)
-- The hash index is useless for this query

Critical Design Decision

If your workload includes ANY range queries on the indexed column, hash index is the wrong choice. Even 5% range queries mean the hash index provides zero value 5% of the time while consuming space. Consider: is the 95% point query speedup worth maintaining two indexes?

Ordering Limitations

Beyond range queries, hash indexes cannot contribute to any operation requiring sorted data.

ORDER BY clause:

SELECT * FROM employees ORDER BY employee_id;

With a B+-tree on employee_id, the optimizer can retrieve rows in sorted order directly—no sorting step needed. With a hash index, the database must:

Retrieve all matching rows (in arbitrary order)
Perform an explicit sort operation
Return sorted results

For large result sets, this sorting step can be expensive (O(n log n)) and may require disk-based sorting if results don't fit in memory.

GROUP BY optimization:

SELECT department_id, COUNT(*) 
FROM employees 
GROUP BY department_id;

If data arrives presorted by department_id, aggregation is efficient: scan once, accumulate counts as groups change. B+-trees enable this. Hash indexes deliver data in random order, forcing the optimizer to:

Build a hash table of groups (hash aggregation), or
Sort the data first, then aggregate (sort aggregation)

Both require O(n) extra space or O(n log n) extra time.

Operations Hash Indexes Cannot Optimize

•ORDER BY — Any query with ORDER BY on the indexed column gains nothing from hash index
•GROUP BY (sorted aggregation) — Hash index forces hash-based or sort-based aggregation
•DISTINCT (sorted unique) — Cannot scan in order to identify unique values efficiently
•Merge joins — Require sorted inputs; hash index cannot provide them
•Window functions with ordering — ROW_NUMBER(), RANK(), etc., require sorted data
•Top-N with ORDER BY — LIMIT with ORDER BY cannot use hash index for early termination

Ordering Limitation Demo
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- Demo: Hash index cannot optimize ORDER BY
 
CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    price NUMERIC(10,2)
);
 
INSERT INTO products (name, price)
SELECT 'Product_' || i, random() * 100
FROM generate_series(1, 1000000) i;
 
-- Create hash index
CREATE INDEX idx_product_hash ON products USING HASH (product_id);
 
-- Query with ORDER BY
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM products ORDER BY product_id LIMIT 100;
 
-- Output shows:
-- Sort  (cost=... rows=1000000)
--   Sort Key: product_id
--   Sort Method: top-N heapsort  Memory: ...
--   ->  Seq Scan on products
--        Buffers: shared hit=...
-- 
-- Notice: The hash index is NOT used!
-- PostgreSQL performs a table scan + sort
 
-- Now create B+-tree index
CREATE INDEX idx_product_btree ON products USING BTREE (product_id);
 
-- Same query
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM products ORDER BY product_id LIMIT 100;
 
-- Output shows:
-- Limit  (cost=0.42..3.14 rows=100)
--   ->  Index Scan using idx_product_btree on products
--        Buffers: shared hit=4
--
-- B+-tree provides sorted access directly!
-- Only 4 buffer accesses for 100 rows
 
-- The difference: 
-- Hash: Full scan + sort = O(n log n), many I/Os
-- B+-tree: Index scan = O(log n + 100), ~4 I/Os

Hidden Ordering Requirements

Some queries have implicit ordering needs. UNION (without ALL) requires sorted or hashed deduplication. Certain subquery optimizations rely on sorted access. The optimizer may find creative uses for B+-tree ordering that you didn't explicitly request.

Partial Key Lookup Limitation

For composite (multi-column) hash indexes, an additional limitation emerges: partial key lookups are impossible.

The problem:

Consider a hash index on (department_id, employee_id):

CREATE INDEX idx_dept_emp_hash ON employees 
USING HASH (department_id, employee_id);

This index hashes the combination of both columns into a single bucket address:

bucket = hash(department_id || employee_id)

What works:

-- Full composite key provided
SELECT * FROM employees 
WHERE department_id = 10 AND employee_id = 500;
-- Hash is computed: h(10, 500) = bucket 47 ✓

What doesn't work:

-- Only first column provided
SELECT * FROM employees WHERE department_id = 10;
-- Cannot compute hash: need employee_id too!
-- Falls back to table scan ✗

Contrast with B+-tree:

A B+-tree on (department_id, employee_id) stores entries sorted by department_id first, then by employee_id within each department. This organization enables:

-- Full key: Navigate to exact position
WHERE department_id = 10 AND employee_id = 500;  ✓

-- Leftmost prefix: Navigate to department, scan
WHERE department_id = 10;  ✓

-- Range on first column: Navigate and scan
WHERE department_id BETWEEN 10 AND 20;  ✓

-- Range on second column (within first)
WHERE department_id = 10 AND employee_id > 400;  ✓

The B+-tree leftmost prefix rule:

For a B+-tree index on (A, B, C), these queries can use the index:

WHERE A = ?
WHERE A = ? AND B = ?
WHERE A = ? AND B = ? AND C = ?
WHERE A = ? AND B > ?
WHERE A BETWEEN ? AND ?

Hash equivalent:

For a hash index on (A, B, C), only this query uses the index:

WHERE A = ? AND B = ? AND C = ? (all columns, equality only)

Every other combination requires table scan.

Composite Index Access Patterns
Query Pattern	B+-Tree on (A,B,C)	Hash on (A,B,C)
A = 1 AND B = 2 AND C = 3	Full index lookup ✓	Hash lookup ✓
A = 1 AND B = 2	Prefix scan ✓	Table scan ✗
A = 1	Prefix scan ✓	Table scan ✗
A = 1 AND C = 3 (skip B)	Partial scan, filter ✓	Table scan ✗
A BETWEEN 1 AND 5	Range scan ✓	Table scan ✗
B = 2 (skip A)	Table scan ✗	Table scan ✗

Design Implication

If you need lookups by department_id alone AND by (department_id, employee_id), a single hash index cannot serve both. You'd need either: (1) a B+-tree that serves both, or (2) separate hash indexes on each pattern (expensive). This is a major reason B+-trees dominate for composite keys.

Pattern Matching Limitation

Hash indexes cannot support pattern matching queries (LIKE, regular expressions). This limitation follows from the same principle as range query limitation: hashing destroys the character-by-character structure that makes pattern matching work.

Why LIKE queries fail:

-- Hash index on email column
CREATE INDEX idx_email_hash ON users USING HASH (email);

-- This query CANNOT use the hash index
SELECT * FROM users WHERE email LIKE 'john%';

The hash of 'john@example.com' (say, 847) has no relationship to the hash of 'john.doe@example.com' (say, 2931). The shared 'john' prefix provides no advantage.

B+-tree advantage:

B+-trees on strings store entries in lexicographic order:

'aaron@...' comes before 'bob@...'
'john.doe@...' is near 'john.smith@...'

Prefix queries (LIKE 'john%') can find the first 'john...' entry and scan until the prefix no longer matches—efficient range scan.

Pattern Matching Support
Pattern Type	Example	B+-Tree	Hash
Prefix match	LIKE 'john%'	Range scan ✓	Table scan ✗
Suffix match	LIKE '%@gmail.com'	Table scan ✗	Table scan ✗
Contains	LIKE '%john%'	Table scan ✗	Table scan ✗
Single wildcard	LIKE 'jo_n'	Range + filter ✓	Table scan ✗
Regex	~ '^john[0-9]+'	Table scan ✗	Table scan ✗

Note on suffix and contains patterns:

Neither B+-tree nor hash index helps with suffix or contains patterns on standard indexed columns. For these patterns, specialized indexes are needed:

Trigram indexes (pg_trgm): Break strings into 3-character chunks for pattern matching
Full-text indexes: Support word-based searching
Reverse indexes: Store reversed strings for suffix matching

But for prefix matching—common in autocomplete, search suggestions, and filtering—B+-trees work and hash indexes don't.

Hashing the Pattern?

You might wonder: could we hash the prefix and search? No—the hash of 'john' differs from the hash of 'john@example.com'. Hash functions process the entire input. Partial input produces unrelated output. This is by design (good for distribution) but prevents pattern matching.

Static Size Limitation

Traditional (static) hash indexes are created with a fixed number of buckets. When data grows beyond this capacity, performance degrades—and recovering requires costly reorganization.

The static hashing problem:

Initial state:
- 10,000 buckets
- 50,000 records
- Load factor: 5 entries/bucket ✓

After growth:
- 10,000 buckets (unchanged)
- 500,000 records
- Load factor: 50 entries/bucket ✗

With 50 entries per bucket, overflow chains grow long. What was O(1) becomes O(50)—still constant, but a very different constant. If chains grow to disk pages, each lookup requires multiple I/Os.

B+-tree contrast:

B+-trees grow organically:

Leaf splits when full → two half-full leaves
Internal node splits propagate up
Tree height increases slowly (logarithmically)
No manual reorganization needed

A B+-tree with 10 million records has height ~4. A B+-tree with 100 million records has height ~5. The performance delta is one additional I/O—graceful degradation, not collapse.

Reorganization (rehashing):

When a static hash index becomes overloaded, the only solution is reorganization:

Create a new hash index with more buckets (say, 2x current)
For each entry in the old index:
- Compute new hash
- Insert into new bucket
Drop old index
Rename new index

Cost of reorganization:

Hash Index Reorganization Costs
Cost Factor	Impact	Mitigation
Time	O(n) to rehash all entries	Schedule during maintenance window
Space	2x space during reorganization	Ensure disk headroom
Lock contention	May block concurrent operations	Use online reindex if available
Invalidation	Cached directory becomes stale	Application may see stale reads
Recovery	If interrupted, index is corrupted	Ensure transaction safety

Dynamic Hashing Solutions

Dynamic hashing schemes (extendible hashing, linear hashing) address this limitation by growing incrementally. We'll cover these in depth in later modules. Static hashing remains relevant for stable-size datasets or when the overhead of dynamic schemes isn't justified.

Monitoring Hash Index Health
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- PostgreSQL: Monitor hash index health
 
-- Check index size relative to table size
SELECT 
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
    idx_scan as index_scans,
    idx_tup_read as tuples_read,
    idx_tup_fetch as tuples_fetched
FROM pg_stat_user_indexes
WHERE indexdef LIKE '%USING hash%';
 
-- Examine index bloat (extension needed)
-- pgstattuple extension provides detailed stats
CREATE EXTENSION IF NOT EXISTS pgstattuple;
 
SELECT * FROM pgstathashindex('idx_your_hash_index');
-- Returns: version, bucket_pages, overflow_pages, bitmap_pages,
--          unused_pages, live_items, dead_items, free_percent
 
-- Overflow pages / bucket_pages ratio indicates health
-- Ratio > 0.2 suggests reorganization may help
 
-- Reindex when needed (blocks writes!)
REINDEX INDEX idx_your_hash_index;
 
-- PostgreSQL 12+ concurrent reindex (less disruptive)
REINDEX INDEX CONCURRENTLY idx_your_hash_index;

Operational Challenges

Beyond query limitations, hash indexes present operational challenges that database administrators must consider.

Operational Challenges

•WAL logging complexity — Historically, hash indexes in PostgreSQL weren't WAL-logged, making them crash-unsafe. Fixed in PostgreSQL 10+, but this history reflects inherent complexity.
•Replication challenges — Hash index state depends on bucket assignment. Replicas must apply the same hash function configuration, adding replication complexity.
•Hot spots from poor distribution — If the hash function doesn't distribute well for your specific data, some buckets become hot while others remain empty. Unlike B+-trees where hot data clusters naturally, hash hot spots are pathological.
•Debugging difficulty — Tracing performance issues requires understanding hash distribution, overflow chains, and bucket health. B+-tree issues (height, fanout, splits) are more intuitive.
•Limited tooling — Fewer diagnostic tools exist for hash indexes compared to B+-trees. DBA expertise for hash optimization is less common.

Database-specific limitations:

Hash Index Support by Database
Database	Hash Index Support	Notable Limitations
PostgreSQL	Full support (v10+)	Historically not WAL-logged; fixed in v10
MySQL/InnoDB	No explicit hash indexes	Only adaptive hash (internal, automatic)
Oracle	Hash clusters, not indexes	Tables must be designed for hash access
SQL Server	No native hash indexes	In-memory OLTP has hash indexes
SQLite	No hash indexes	Only B-tree supported

The InnoDB adaptive hash perspective:

MySQL's InnoDB doesn't expose hash indexes to users. Instead, it maintains an adaptive hash index (AHI) internally:

InnoDB observes B+-tree access patterns
Frequently accessed pages get hash entries automatically
No user configuration needed
Combines B+-tree versatility with hash speed for hot data

This approach avoids forcing users to choose between hash and tree—the engine makes the decision dynamically based on actual workload.

Consider the DBA Perspective

When evaluating hash indexes, consider who will operate the system. If your team has deep PostgreSQL expertise and can monitor hash index health, the benefits may outweigh complexity. If the team is less specialized, B+-tree's simpler operational model may be worth the performance trade-off.

Worst-Case Scenarios

Understanding failure modes prevents costly mistakes. Here are scenarios where hash indexes produce disastrous results.

Scenario 1: Wrong workload assumption

Initial assumption: "We only do point queries on user_id"

Created: Hash index on users(user_id)

Reality 6 months later:
- New feature: "Show users who signed up this week"
  SELECT * FROM users WHERE created_at > '2024-01-01';
  -- user_id hash index useless, full scan

- New feature: "List users alphabetically"
  SELECT * FROM users ORDER BY username;
  -- user_id hash index useless, full scan + sort

- Business request: "Users with IDs between 1000-2000"
  SELECT * FROM users WHERE user_id BETWEEN 1000 AND 2000;
  -- Even on indexed column, hash can't help!

Lesson: Requirements evolve. B+-tree's versatility provides insurance against changing needs.

Scenario 2: Pathological hash distribution

Data characteristic: All user emails from same domain
- john@company.com
- jane@company.com  
- bob@company.com
- ...

Hash function behavior:
If hash function uses early characters heavily:
- h('john@company.com') = 47
- h('jane@company.com') = 48  
- h('jill@company.com') = 48
- h('jim@company.com') = 48

Result: Massive clustering in buckets 47-50
All 'j' names in same few buckets → overflow chains

Expected: O(1) lookup
Actual: O(n/few_buckets) lookup ≈ O(n)

Lesson: Hash function quality depends on data distribution. Test with real data, not synthetic.

Scenario 3: Unbounded data growth

Initial sizing:
- Expected: 1 million users, ever
- Bucket count: 10,000
- Load factor: 100 entries/bucket (acceptable)

Actual growth:
- Year 1: 1 million users → works well
- Year 2: 5 million users → performance degrades
- Year 3: 20 million users → overflow chains of 200+ entries
- Year 4: Query latency spikes during peak hours
- Year 5: Desperate reindex during business hours → outage

Post-mortem: "We should have used B+-tree or extendible hashing"

Lesson: Static hash indexes require accurate long-term sizing. Uncertainty favors B+-tree.

Hash Flooding Attack

If attackers know your hash function, they can craft inputs that all hash to the same bucket. This is a denial-of-service attack: O(1) becomes O(n). Applications accepting untrusted keys must use randomized or keyed hash functions to prevent this. Database hash indexes are typically not exposed to external attackers, but internal data patterns can create similar effects.

Summary: Comprehensive Decision Framework

We've now covered both sides of hash indexing: the O(1) promise and the constraints that come with it. Let's consolidate this into an actionable decision framework.

Hash Index Limitation Summary

•No range queries — WHERE col > x, BETWEEN, < all require table scan with hash index.
•No ordering support — ORDER BY, GROUP BY (sorted), DISTINCT cannot benefit from hash index.
•No partial composite key lookup — WHERE (a,b) = (1,2) works; WHERE a = 1 alone doesn't.
•No pattern matching — LIKE 'prefix%' cannot use hash index even on the column.
•Static sizing — Traditional hash indexes need reorganization when data grows.
•Operational complexity — Less tooling, expertise, and (historically) reliability than B+-tree.

Index Selection Decision Matrix
Criterion	Choose Hash If...	Choose B+-Tree If...
Query patterns	100% equality lookups	Any range, ordering, or pattern needs
Data growth	Stable, predictable size	Unpredictable or rapid growth
Composite keys	Always query all columns	Need leftmost prefix access
Operational maturity	Strong DBA team, monitoring	Standard ops, less specialization
Access patterns	Random key access	Sequential or range access
Performance criticality	Sub-millisecond latency required	Standard latency acceptable
Default choice	Never (must justify)	Yes (safe default)

The 'When in Doubt' Rule:

If you're unsure whether to use hash or B+-tree, use B+-tree.

B+-tree is never catastrophically wrong. It may be 2-3x slower for point queries than hash, but it handles all query types, grows gracefully, and is universally understood. Hash indexes are optimizations for specific, well-understood workloads—not general-purpose solutions.

What's next:

Having completed our exploration of the Hash Index Concept module, you now possess deep understanding of both the power and limitations of hash-based indexing. The subsequent modules will build on this foundation, exploring static hashing mechanisms, dynamic hashing schemes (extendible and linear hashing), and comparative analysis with tree-based indexes.

Module 1 Complete

You now understand hash indexing comprehensively: the fundamental concept, hash functions, buckets, point query excellence, and the limitations that constrain its applicability. This balanced knowledge enables you to make informed decisions about when hash indexes are the right tool—and when they're not.

5 / 5

Loading learning content...

Database Management SystemsHash Indexing

Hash Index Concept

LevelIntermediate

Duration55 mins

TopicHash Indexing

5 / 5

Limitations

The Other Side of the Coin

This page examines hash index limitations with the same rigor we applied to their strengths. By the end, you'll know exactly when to reach for hash indexes—and when to step back.

What You Will Learn

Range Query Impossibility

The most fundamental limitation of hash indexes is their complete inability to support range queries. This isn't a performance issue—it's a structural impossibility.

Why range queries fail:

Hash functions deliberately destroy key ordering. Consider keys 100, 101, and 102:

h(100) = 47
h(101) = 892
h(102) = 3

Consecutive keys hash to completely unrelated buckets. There is no path from bucket 47 to bucket 892 to bucket 3. The structure that would enable traversal simply doesn't exist.

Contrast with B+-tree:

Range Query Support: Hash vs. B+-Tree
Query Pattern	Hash Index	B+-Tree
WHERE key > 100	Full table scan required	Index range scan, O(log n + k)
WHERE key BETWEEN 100 AND 200	Full table scan required	Index range scan
WHERE key < 50	Full table scan required	Index range scan
ORDER BY key	Cannot contribute	Provides sorted output directly
MIN(key) / MAX(key)	Full table scan required	First/last leaf, O(log n)
GROUP BY key (sorted)	Cannot contribute to sort	Provides sorted input for grouping

The cost of range queries with hash indexes:

When a query requires range access on a hash-indexed column, the database must:

Ignore the hash index entirely — The optimizer recognizes it cannot help
Fall back to table scan — Read every row, test range condition
Or use a different index — If another index exists on the column

Example scenario:

-- You create a hash index for point queries
CREATE INDEX idx_order_date_hash ON orders USING HASH (order_date);

-- This query cannot use the hash index
SELECT * FROM orders 
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

-- The optimizer will use:
-- 1. Seq Scan on orders (full table scan)
-- 2. Any B+-tree index on order_date (if exists)
-- The hash index is useless for this query

Critical Design Decision

Ordering Limitations

Beyond range queries, hash indexes cannot contribute to any operation requiring sorted data.

ORDER BY clause:

SELECT * FROM employees ORDER BY employee_id;

With a B+-tree on employee_id, the optimizer can retrieve rows in sorted order directly—no sorting step needed. With a hash index, the database must:

Retrieve all matching rows (in arbitrary order)
Perform an explicit sort operation
Return sorted results

For large result sets, this sorting step can be expensive (O(n log n)) and may require disk-based sorting if results don't fit in memory.

GROUP BY optimization:

SELECT department_id, COUNT(*) 
FROM employees 
GROUP BY department_id;

Build a hash table of groups (hash aggregation), or
Sort the data first, then aggregate (sort aggregation)

Both require O(n) extra space or O(n log n) extra time.

Operations Hash Indexes Cannot Optimize

•ORDER BY — Any query with ORDER BY on the indexed column gains nothing from hash index
•GROUP BY (sorted aggregation) — Hash index forces hash-based or sort-based aggregation
•DISTINCT (sorted unique) — Cannot scan in order to identify unique values efficiently
•Merge joins — Require sorted inputs; hash index cannot provide them
•Window functions with ordering — ROW_NUMBER(), RANK(), etc., require sorted data
•Top-N with ORDER BY — LIMIT with ORDER BY cannot use hash index for early termination

Ordering Limitation Demo
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- Demo: Hash index cannot optimize ORDER BY
 
CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    price NUMERIC(10,2)
);
 
INSERT INTO products (name, price)
SELECT 'Product_' || i, random() * 100
FROM generate_series(1, 1000000) i;
 
-- Create hash index
CREATE INDEX idx_product_hash ON products USING HASH (product_id);
 
-- Query with ORDER BY
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM products ORDER BY product_id LIMIT 100;
 
-- Output shows:
-- Sort  (cost=... rows=1000000)
--   Sort Key: product_id
--   Sort Method: top-N heapsort  Memory: ...
--   ->  Seq Scan on products
--        Buffers: shared hit=...
-- 
-- Notice: The hash index is NOT used!
-- PostgreSQL performs a table scan + sort
 
-- Now create B+-tree index
CREATE INDEX idx_product_btree ON products USING BTREE (product_id);
 
-- Same query
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM products ORDER BY product_id LIMIT 100;
 
-- Output shows:
-- Limit  (cost=0.42..3.14 rows=100)
--   ->  Index Scan using idx_product_btree on products
--        Buffers: shared hit=4
--
-- B+-tree provides sorted access directly!
-- Only 4 buffer accesses for 100 rows
 
-- The difference: 
-- Hash: Full scan + sort = O(n log n), many I/Os
-- B+-tree: Index scan = O(log n + 100), ~4 I/Os

Hidden Ordering Requirements

Partial Key Lookup Limitation

For composite (multi-column) hash indexes, an additional limitation emerges: partial key lookups are impossible.

The problem:

Consider a hash index on (department_id, employee_id):

CREATE INDEX idx_dept_emp_hash ON employees 
USING HASH (department_id, employee_id);

This index hashes the combination of both columns into a single bucket address:

bucket = hash(department_id || employee_id)

What works:

-- Full composite key provided
SELECT * FROM employees 
WHERE department_id = 10 AND employee_id = 500;
-- Hash is computed: h(10, 500) = bucket 47 ✓

What doesn't work:

-- Only first column provided
SELECT * FROM employees WHERE department_id = 10;
-- Cannot compute hash: need employee_id too!
-- Falls back to table scan ✗

Contrast with B+-tree:

A B+-tree on (department_id, employee_id) stores entries sorted by department_id first, then by employee_id within each department. This organization enables:

-- Full key: Navigate to exact position
WHERE department_id = 10 AND employee_id = 500;  ✓

-- Leftmost prefix: Navigate to department, scan
WHERE department_id = 10;  ✓

-- Range on first column: Navigate and scan
WHERE department_id BETWEEN 10 AND 20;  ✓

-- Range on second column (within first)
WHERE department_id = 10 AND employee_id > 400;  ✓

The B+-tree leftmost prefix rule:

For a B+-tree index on (A, B, C), these queries can use the index:

WHERE A = ?
WHERE A = ? AND B = ?
WHERE A = ? AND B = ? AND C = ?
WHERE A = ? AND B > ?
WHERE A BETWEEN ? AND ?

Hash equivalent:

For a hash index on (A, B, C), only this query uses the index:

WHERE A = ? AND B = ? AND C = ? (all columns, equality only)

Every other combination requires table scan.

Composite Index Access Patterns
Query Pattern	B+-Tree on (A,B,C)	Hash on (A,B,C)
A = 1 AND B = 2 AND C = 3	Full index lookup ✓	Hash lookup ✓
A = 1 AND B = 2	Prefix scan ✓	Table scan ✗
A = 1	Prefix scan ✓	Table scan ✗
A = 1 AND C = 3 (skip B)	Partial scan, filter ✓	Table scan ✗
A BETWEEN 1 AND 5	Range scan ✓	Table scan ✗
B = 2 (skip A)	Table scan ✗	Table scan ✗

Design Implication

Pattern Matching Limitation

Why LIKE queries fail:

-- Hash index on email column
CREATE INDEX idx_email_hash ON users USING HASH (email);

-- This query CANNOT use the hash index
SELECT * FROM users WHERE email LIKE 'john%';

The hash of 'john@example.com' (say, 847) has no relationship to the hash of 'john.doe@example.com' (say, 2931). The shared 'john' prefix provides no advantage.

B+-tree advantage:

B+-trees on strings store entries in lexicographic order:

'aaron@...' comes before 'bob@...'
'john.doe@...' is near 'john.smith@...'

Prefix queries (LIKE 'john%') can find the first 'john...' entry and scan until the prefix no longer matches—efficient range scan.

Pattern Matching Support
Pattern Type	Example	B+-Tree	Hash
Prefix match	LIKE 'john%'	Range scan ✓	Table scan ✗
Suffix match	LIKE '%@gmail.com'	Table scan ✗	Table scan ✗
Contains	LIKE '%john%'	Table scan ✗	Table scan ✗
Single wildcard	LIKE 'jo_n'	Range + filter ✓	Table scan ✗
Regex	~ '^john[0-9]+'	Table scan ✗	Table scan ✗

Note on suffix and contains patterns:

Neither B+-tree nor hash index helps with suffix or contains patterns on standard indexed columns. For these patterns, specialized indexes are needed:

Trigram indexes (pg_trgm): Break strings into 3-character chunks for pattern matching
Full-text indexes: Support word-based searching
Reverse indexes: Store reversed strings for suffix matching

But for prefix matching—common in autocomplete, search suggestions, and filtering—B+-trees work and hash indexes don't.

Hashing the Pattern?

Static Size Limitation

Traditional (static) hash indexes are created with a fixed number of buckets. When data grows beyond this capacity, performance degrades—and recovering requires costly reorganization.

The static hashing problem:

Initial state:
- 10,000 buckets
- 50,000 records
- Load factor: 5 entries/bucket ✓

After growth:
- 10,000 buckets (unchanged)
- 500,000 records
- Load factor: 50 entries/bucket ✗

With 50 entries per bucket, overflow chains grow long. What was O(1) becomes O(50)—still constant, but a very different constant. If chains grow to disk pages, each lookup requires multiple I/Os.

B+-tree contrast:

B+-trees grow organically:

Leaf splits when full → two half-full leaves
Internal node splits propagate up
Tree height increases slowly (logarithmically)
No manual reorganization needed

A B+-tree with 10 million records has height ~4. A B+-tree with 100 million records has height ~5. The performance delta is one additional I/O—graceful degradation, not collapse.

Reorganization (rehashing):

When a static hash index becomes overloaded, the only solution is reorganization:

Create a new hash index with more buckets (say, 2x current)
For each entry in the old index:
- Compute new hash
- Insert into new bucket
Drop old index
Rename new index

Cost of reorganization:

Hash Index Reorganization Costs
Cost Factor	Impact	Mitigation
Time	O(n) to rehash all entries	Schedule during maintenance window
Space	2x space during reorganization	Ensure disk headroom
Lock contention	May block concurrent operations	Use online reindex if available
Invalidation	Cached directory becomes stale	Application may see stale reads
Recovery	If interrupted, index is corrupted	Ensure transaction safety

Dynamic Hashing Solutions

Monitoring Hash Index Health
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- PostgreSQL: Monitor hash index health
 
-- Check index size relative to table size
SELECT 
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
    idx_scan as index_scans,
    idx_tup_read as tuples_read,
    idx_tup_fetch as tuples_fetched
FROM pg_stat_user_indexes
WHERE indexdef LIKE '%USING hash%';
 
-- Examine index bloat (extension needed)
-- pgstattuple extension provides detailed stats
CREATE EXTENSION IF NOT EXISTS pgstattuple;
 
SELECT * FROM pgstathashindex('idx_your_hash_index');
-- Returns: version, bucket_pages, overflow_pages, bitmap_pages,
--          unused_pages, live_items, dead_items, free_percent
 
-- Overflow pages / bucket_pages ratio indicates health
-- Ratio > 0.2 suggests reorganization may help
 
-- Reindex when needed (blocks writes!)
REINDEX INDEX idx_your_hash_index;
 
-- PostgreSQL 12+ concurrent reindex (less disruptive)
REINDEX INDEX CONCURRENTLY idx_your_hash_index;

Operational Challenges

Beyond query limitations, hash indexes present operational challenges that database administrators must consider.

Operational Challenges

•WAL logging complexity — Historically, hash indexes in PostgreSQL weren't WAL-logged, making them crash-unsafe. Fixed in PostgreSQL 10+, but this history reflects inherent complexity.
•Replication challenges — Hash index state depends on bucket assignment. Replicas must apply the same hash function configuration, adding replication complexity.
•Hot spots from poor distribution — If the hash function doesn't distribute well for your specific data, some buckets become hot while others remain empty. Unlike B+-trees where hot data clusters naturally, hash hot spots are pathological.
•Debugging difficulty — Tracing performance issues requires understanding hash distribution, overflow chains, and bucket health. B+-tree issues (height, fanout, splits) are more intuitive.
•Limited tooling — Fewer diagnostic tools exist for hash indexes compared to B+-trees. DBA expertise for hash optimization is less common.

Database-specific limitations:

Hash Index Support by Database
Database	Hash Index Support	Notable Limitations
PostgreSQL	Full support (v10+)	Historically not WAL-logged; fixed in v10
MySQL/InnoDB	No explicit hash indexes	Only adaptive hash (internal, automatic)
Oracle	Hash clusters, not indexes	Tables must be designed for hash access
SQL Server	No native hash indexes	In-memory OLTP has hash indexes
SQLite	No hash indexes	Only B-tree supported

The InnoDB adaptive hash perspective:

MySQL's InnoDB doesn't expose hash indexes to users. Instead, it maintains an adaptive hash index (AHI) internally:

InnoDB observes B+-tree access patterns
Frequently accessed pages get hash entries automatically
No user configuration needed
Combines B+-tree versatility with hash speed for hot data

This approach avoids forcing users to choose between hash and tree—the engine makes the decision dynamically based on actual workload.

Consider the DBA Perspective

Worst-Case Scenarios

Understanding failure modes prevents costly mistakes. Here are scenarios where hash indexes produce disastrous results.

Scenario 1: Wrong workload assumption

Initial assumption: "We only do point queries on user_id"

Created: Hash index on users(user_id)

Reality 6 months later:
- New feature: "Show users who signed up this week"
  SELECT * FROM users WHERE created_at > '2024-01-01';
  -- user_id hash index useless, full scan

- New feature: "List users alphabetically"
  SELECT * FROM users ORDER BY username;
  -- user_id hash index useless, full scan + sort

- Business request: "Users with IDs between 1000-2000"
  SELECT * FROM users WHERE user_id BETWEEN 1000 AND 2000;
  -- Even on indexed column, hash can't help!

Lesson: Requirements evolve. B+-tree's versatility provides insurance against changing needs.

Scenario 2: Pathological hash distribution

Data characteristic: All user emails from same domain
- john@company.com
- jane@company.com  
- bob@company.com
- ...

Hash function behavior:
If hash function uses early characters heavily:
- h('john@company.com') = 47
- h('jane@company.com') = 48  
- h('jill@company.com') = 48
- h('jim@company.com') = 48

Result: Massive clustering in buckets 47-50
All 'j' names in same few buckets → overflow chains

Expected: O(1) lookup
Actual: O(n/few_buckets) lookup ≈ O(n)

Lesson: Hash function quality depends on data distribution. Test with real data, not synthetic.

Scenario 3: Unbounded data growth

Initial sizing:
- Expected: 1 million users, ever
- Bucket count: 10,000
- Load factor: 100 entries/bucket (acceptable)

Actual growth:
- Year 1: 1 million users → works well
- Year 2: 5 million users → performance degrades
- Year 3: 20 million users → overflow chains of 200+ entries
- Year 4: Query latency spikes during peak hours
- Year 5: Desperate reindex during business hours → outage

Post-mortem: "We should have used B+-tree or extendible hashing"

Lesson: Static hash indexes require accurate long-term sizing. Uncertainty favors B+-tree.

Hash Flooding Attack

Summary: Comprehensive Decision Framework

We've now covered both sides of hash indexing: the O(1) promise and the constraints that come with it. Let's consolidate this into an actionable decision framework.

Hash Index Limitation Summary

•No range queries — WHERE col > x, BETWEEN, < all require table scan with hash index.
•No ordering support — ORDER BY, GROUP BY (sorted), DISTINCT cannot benefit from hash index.
•No partial composite key lookup — WHERE (a,b) = (1,2) works; WHERE a = 1 alone doesn't.
•No pattern matching — LIKE 'prefix%' cannot use hash index even on the column.
•Static sizing — Traditional hash indexes need reorganization when data grows.
•Operational complexity — Less tooling, expertise, and (historically) reliability than B+-tree.

Index Selection Decision Matrix
Criterion	Choose Hash If...	Choose B+-Tree If...
Query patterns	100% equality lookups	Any range, ordering, or pattern needs
Data growth	Stable, predictable size	Unpredictable or rapid growth
Composite keys	Always query all columns	Need leftmost prefix access
Operational maturity	Strong DBA team, monitoring	Standard ops, less specialization
Access patterns	Random key access	Sequential or range access
Performance criticality	Sub-millisecond latency required	Standard latency acceptable
Default choice	Never (must justify)	Yes (safe default)

The 'When in Doubt' Rule:

If you're unsure whether to use hash or B+-tree, use B+-tree.

What's next:

Module 1 Complete

5 / 5