Loading learning content...
Every technology choice involves trade-offs. Hash indexes achieve O(1) point query performance by sacrificing capabilities that B+-trees provide effortlessly. Understanding these limitations isn't pessimism—it's engineering maturity. The worst outcome isn't choosing B+-tree over hash and losing some performance. The worst outcome is choosing hash and discovering, after deployment, that your workload requires capabilities hash indexes fundamentally cannot provide.
This page examines hash index limitations with the same rigor we applied to their strengths. By the end, you'll know exactly when to reach for hash indexes—and when to step back.
By the end of this page, you will understand the fundamental query patterns hash indexes cannot support, operational challenges unique to hash structures, disaster scenarios to avoid, and a comprehensive decision framework for index selection.
The most fundamental limitation of hash indexes is their complete inability to support range queries. This isn't a performance issue—it's a structural impossibility.
Why range queries fail:
Hash functions deliberately destroy key ordering. Consider keys 100, 101, and 102:
h(100) = 47
h(101) = 892
h(102) = 3
Consecutive keys hash to completely unrelated buckets. There is no path from bucket 47 to bucket 892 to bucket 3. The structure that would enable traversal simply doesn't exist.
Contrast with B+-tree:
In a B+-tree, keys 100, 101, 102 are stored sequentially in leaf nodes. A range query WHERE key BETWEEN 100 AND 200 starts at 100 and follows leaf pointers until reaching 200. The structure inherently supports ordered traversal.
| Query Pattern | Hash Index | B+-Tree |
|---|---|---|
| WHERE key > 100 | Full table scan required | Index range scan, O(log n + k) |
| WHERE key BETWEEN 100 AND 200 | Full table scan required | Index range scan |
| WHERE key < 50 | Full table scan required | Index range scan |
| ORDER BY key | Cannot contribute | Provides sorted output directly |
| MIN(key) / MAX(key) | Full table scan required | First/last leaf, O(log n) |
| GROUP BY key (sorted) | Cannot contribute to sort | Provides sorted input for grouping |
The cost of range queries with hash indexes:
When a query requires range access on a hash-indexed column, the database must:
Example scenario:
-- You create a hash index for point queries
CREATE INDEX idx_order_date_hash ON orders USING HASH (order_date);
-- This query cannot use the hash index
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
-- The optimizer will use:
-- 1. Seq Scan on orders (full table scan)
-- 2. Any B+-tree index on order_date (if exists)
-- The hash index is useless for this query
If your workload includes ANY range queries on the indexed column, hash index is the wrong choice. Even 5% range queries mean the hash index provides zero value 5% of the time while consuming space. Consider: is the 95% point query speedup worth maintaining two indexes?
Beyond range queries, hash indexes cannot contribute to any operation requiring sorted data.
ORDER BY clause:
SELECT * FROM employees ORDER BY employee_id;
With a B+-tree on employee_id, the optimizer can retrieve rows in sorted order directly—no sorting step needed. With a hash index, the database must:
For large result sets, this sorting step can be expensive (O(n log n)) and may require disk-based sorting if results don't fit in memory.
GROUP BY optimization:
SELECT department_id, COUNT(*)
FROM employees
GROUP BY department_id;
If data arrives presorted by department_id, aggregation is efficient: scan once, accumulate counts as groups change. B+-trees enable this. Hash indexes deliver data in random order, forcing the optimizer to:
Both require O(n) extra space or O(n log n) extra time.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
-- Demo: Hash index cannot optimize ORDER BY CREATE TABLE products ( product_id SERIAL PRIMARY KEY, name VARCHAR(100), price NUMERIC(10,2)); INSERT INTO products (name, price)SELECT 'Product_' || i, random() * 100FROM generate_series(1, 1000000) i; -- Create hash indexCREATE INDEX idx_product_hash ON products USING HASH (product_id); -- Query with ORDER BYEXPLAIN (ANALYZE, BUFFERS)SELECT * FROM products ORDER BY product_id LIMIT 100; -- Output shows:-- Sort (cost=... rows=1000000)-- Sort Key: product_id-- Sort Method: top-N heapsort Memory: ...-- -> Seq Scan on products-- Buffers: shared hit=...-- -- Notice: The hash index is NOT used!-- PostgreSQL performs a table scan + sort -- Now create B+-tree indexCREATE INDEX idx_product_btree ON products USING BTREE (product_id); -- Same queryEXPLAIN (ANALYZE, BUFFERS)SELECT * FROM products ORDER BY product_id LIMIT 100; -- Output shows:-- Limit (cost=0.42..3.14 rows=100)-- -> Index Scan using idx_product_btree on products-- Buffers: shared hit=4---- B+-tree provides sorted access directly!-- Only 4 buffer accesses for 100 rows -- The difference: -- Hash: Full scan + sort = O(n log n), many I/Os-- B+-tree: Index scan = O(log n + 100), ~4 I/OsSome queries have implicit ordering needs. UNION (without ALL) requires sorted or hashed deduplication. Certain subquery optimizations rely on sorted access. The optimizer may find creative uses for B+-tree ordering that you didn't explicitly request.
For composite (multi-column) hash indexes, an additional limitation emerges: partial key lookups are impossible.
The problem:
Consider a hash index on (department_id, employee_id):
CREATE INDEX idx_dept_emp_hash ON employees
USING HASH (department_id, employee_id);
This index hashes the combination of both columns into a single bucket address:
bucket = hash(department_id || employee_id)
What works:
-- Full composite key provided
SELECT * FROM employees
WHERE department_id = 10 AND employee_id = 500;
-- Hash is computed: h(10, 500) = bucket 47 ✓
What doesn't work:
-- Only first column provided
SELECT * FROM employees WHERE department_id = 10;
-- Cannot compute hash: need employee_id too!
-- Falls back to table scan ✗
Contrast with B+-tree:
A B+-tree on (department_id, employee_id) stores entries sorted by department_id first, then by employee_id within each department. This organization enables:
-- Full key: Navigate to exact position
WHERE department_id = 10 AND employee_id = 500; ✓
-- Leftmost prefix: Navigate to department, scan
WHERE department_id = 10; ✓
-- Range on first column: Navigate and scan
WHERE department_id BETWEEN 10 AND 20; ✓
-- Range on second column (within first)
WHERE department_id = 10 AND employee_id > 400; ✓
The B+-tree leftmost prefix rule:
For a B+-tree index on (A, B, C), these queries can use the index:
Hash equivalent:
For a hash index on (A, B, C), only this query uses the index:
Every other combination requires table scan.
| Query Pattern | B+-Tree on (A,B,C) | Hash on (A,B,C) |
|---|---|---|
| A = 1 AND B = 2 AND C = 3 | Full index lookup ✓ | Hash lookup ✓ |
| A = 1 AND B = 2 | Prefix scan ✓ | Table scan ✗ |
| A = 1 | Prefix scan ✓ | Table scan ✗ |
| A = 1 AND C = 3 (skip B) | Partial scan, filter ✓ | Table scan ✗ |
| A BETWEEN 1 AND 5 | Range scan ✓ | Table scan ✗ |
| B = 2 (skip A) | Table scan ✗ | Table scan ✗ |
If you need lookups by department_id alone AND by (department_id, employee_id), a single hash index cannot serve both. You'd need either: (1) a B+-tree that serves both, or (2) separate hash indexes on each pattern (expensive). This is a major reason B+-trees dominate for composite keys.
Hash indexes cannot support pattern matching queries (LIKE, regular expressions). This limitation follows from the same principle as range query limitation: hashing destroys the character-by-character structure that makes pattern matching work.
Why LIKE queries fail:
-- Hash index on email column
CREATE INDEX idx_email_hash ON users USING HASH (email);
-- This query CANNOT use the hash index
SELECT * FROM users WHERE email LIKE 'john%';
The hash of 'john@example.com' (say, 847) has no relationship to the hash of 'john.doe@example.com' (say, 2931). The shared 'john' prefix provides no advantage.
B+-tree advantage:
B+-trees on strings store entries in lexicographic order:
Prefix queries (LIKE 'john%') can find the first 'john...' entry and scan until the prefix no longer matches—efficient range scan.
| Pattern Type | Example | B+-Tree | Hash |
|---|---|---|---|
| Prefix match | LIKE 'john%' | Range scan ✓ | Table scan ✗ |
| Suffix match | LIKE '%@gmail.com' | Table scan ✗ | Table scan ✗ |
| Contains | LIKE '%john%' | Table scan ✗ | Table scan ✗ |
| Single wildcard | LIKE 'jo_n' | Range + filter ✓ | Table scan ✗ |
| Regex | ~ '^john[0-9]+' | Table scan ✗ | Table scan ✗ |
Note on suffix and contains patterns:
Neither B+-tree nor hash index helps with suffix or contains patterns on standard indexed columns. For these patterns, specialized indexes are needed:
But for prefix matching—common in autocomplete, search suggestions, and filtering—B+-trees work and hash indexes don't.
You might wonder: could we hash the prefix and search? No—the hash of 'john' differs from the hash of 'john@example.com'. Hash functions process the entire input. Partial input produces unrelated output. This is by design (good for distribution) but prevents pattern matching.
Traditional (static) hash indexes are created with a fixed number of buckets. When data grows beyond this capacity, performance degrades—and recovering requires costly reorganization.
The static hashing problem:
Initial state:
- 10,000 buckets
- 50,000 records
- Load factor: 5 entries/bucket ✓
After growth:
- 10,000 buckets (unchanged)
- 500,000 records
- Load factor: 50 entries/bucket ✗
With 50 entries per bucket, overflow chains grow long. What was O(1) becomes O(50)—still constant, but a very different constant. If chains grow to disk pages, each lookup requires multiple I/Os.
B+-tree contrast:
B+-trees grow organically:
A B+-tree with 10 million records has height ~4. A B+-tree with 100 million records has height ~5. The performance delta is one additional I/O—graceful degradation, not collapse.
Reorganization (rehashing):
When a static hash index becomes overloaded, the only solution is reorganization:
Cost of reorganization:
| Cost Factor | Impact | Mitigation |
|---|---|---|
| Time | O(n) to rehash all entries | Schedule during maintenance window |
| Space | 2x space during reorganization | Ensure disk headroom |
| Lock contention | May block concurrent operations | Use online reindex if available |
| Invalidation | Cached directory becomes stale | Application may see stale reads |
| Recovery | If interrupted, index is corrupted | Ensure transaction safety |
Dynamic hashing schemes (extendible hashing, linear hashing) address this limitation by growing incrementally. We'll cover these in depth in later modules. Static hashing remains relevant for stable-size datasets or when the overhead of dynamic schemes isn't justified.
123456789101112131415161718192021222324252627282930
-- PostgreSQL: Monitor hash index health -- Check index size relative to table sizeSELECT schemaname, tablename, indexname, pg_size_pretty(pg_relation_size(indexrelid)) as index_size, idx_scan as index_scans, idx_tup_read as tuples_read, idx_tup_fetch as tuples_fetchedFROM pg_stat_user_indexesWHERE indexdef LIKE '%USING hash%'; -- Examine index bloat (extension needed)-- pgstattuple extension provides detailed statsCREATE EXTENSION IF NOT EXISTS pgstattuple; SELECT * FROM pgstathashindex('idx_your_hash_index');-- Returns: version, bucket_pages, overflow_pages, bitmap_pages,-- unused_pages, live_items, dead_items, free_percent -- Overflow pages / bucket_pages ratio indicates health-- Ratio > 0.2 suggests reorganization may help -- Reindex when needed (blocks writes!)REINDEX INDEX idx_your_hash_index; -- PostgreSQL 12+ concurrent reindex (less disruptive)REINDEX INDEX CONCURRENTLY idx_your_hash_index;Beyond query limitations, hash indexes present operational challenges that database administrators must consider.
Database-specific limitations:
| Database | Hash Index Support | Notable Limitations |
|---|---|---|
| PostgreSQL | Full support (v10+) | Historically not WAL-logged; fixed in v10 |
| MySQL/InnoDB | No explicit hash indexes | Only adaptive hash (internal, automatic) |
| Oracle | Hash clusters, not indexes | Tables must be designed for hash access |
| SQL Server | No native hash indexes | In-memory OLTP has hash indexes |
| SQLite | No hash indexes | Only B-tree supported |
The InnoDB adaptive hash perspective:
MySQL's InnoDB doesn't expose hash indexes to users. Instead, it maintains an adaptive hash index (AHI) internally:
This approach avoids forcing users to choose between hash and tree—the engine makes the decision dynamically based on actual workload.
When evaluating hash indexes, consider who will operate the system. If your team has deep PostgreSQL expertise and can monitor hash index health, the benefits may outweigh complexity. If the team is less specialized, B+-tree's simpler operational model may be worth the performance trade-off.
Understanding failure modes prevents costly mistakes. Here are scenarios where hash indexes produce disastrous results.
Scenario 1: Wrong workload assumption
Initial assumption: "We only do point queries on user_id"
Created: Hash index on users(user_id)
Reality 6 months later:
- New feature: "Show users who signed up this week"
SELECT * FROM users WHERE created_at > '2024-01-01';
-- user_id hash index useless, full scan
- New feature: "List users alphabetically"
SELECT * FROM users ORDER BY username;
-- user_id hash index useless, full scan + sort
- Business request: "Users with IDs between 1000-2000"
SELECT * FROM users WHERE user_id BETWEEN 1000 AND 2000;
-- Even on indexed column, hash can't help!
Lesson: Requirements evolve. B+-tree's versatility provides insurance against changing needs.
Scenario 2: Pathological hash distribution
Data characteristic: All user emails from same domain
- john@company.com
- jane@company.com
- bob@company.com
- ...
Hash function behavior:
If hash function uses early characters heavily:
- h('john@company.com') = 47
- h('jane@company.com') = 48
- h('jill@company.com') = 48
- h('jim@company.com') = 48
Result: Massive clustering in buckets 47-50
All 'j' names in same few buckets → overflow chains
Expected: O(1) lookup
Actual: O(n/few_buckets) lookup ≈ O(n)
Lesson: Hash function quality depends on data distribution. Test with real data, not synthetic.
Scenario 3: Unbounded data growth
Initial sizing:
- Expected: 1 million users, ever
- Bucket count: 10,000
- Load factor: 100 entries/bucket (acceptable)
Actual growth:
- Year 1: 1 million users → works well
- Year 2: 5 million users → performance degrades
- Year 3: 20 million users → overflow chains of 200+ entries
- Year 4: Query latency spikes during peak hours
- Year 5: Desperate reindex during business hours → outage
Post-mortem: "We should have used B+-tree or extendible hashing"
Lesson: Static hash indexes require accurate long-term sizing. Uncertainty favors B+-tree.
If attackers know your hash function, they can craft inputs that all hash to the same bucket. This is a denial-of-service attack: O(1) becomes O(n). Applications accepting untrusted keys must use randomized or keyed hash functions to prevent this. Database hash indexes are typically not exposed to external attackers, but internal data patterns can create similar effects.
We've now covered both sides of hash indexing: the O(1) promise and the constraints that come with it. Let's consolidate this into an actionable decision framework.
| Criterion | Choose Hash If... | Choose B+-Tree If... |
|---|---|---|
| Query patterns | 100% equality lookups | Any range, ordering, or pattern needs |
| Data growth | Stable, predictable size | Unpredictable or rapid growth |
| Composite keys | Always query all columns | Need leftmost prefix access |
| Operational maturity | Strong DBA team, monitoring | Standard ops, less specialization |
| Access patterns | Random key access | Sequential or range access |
| Performance criticality | Sub-millisecond latency required | Standard latency acceptable |
| Default choice | Never (must justify) | Yes (safe default) |
The 'When in Doubt' Rule:
If you're unsure whether to use hash or B+-tree, use B+-tree.
B+-tree is never catastrophically wrong. It may be 2-3x slower for point queries than hash, but it handles all query types, grows gracefully, and is universally understood. Hash indexes are optimizations for specific, well-understood workloads—not general-purpose solutions.
What's next:
Having completed our exploration of the Hash Index Concept module, you now possess deep understanding of both the power and limitations of hash-based indexing. The subsequent modules will build on this foundation, exploring static hashing mechanisms, dynamic hashing schemes (extendible and linear hashing), and comparative analysis with tree-based indexes.
You now understand hash indexing comprehensively: the fundamental concept, hash functions, buckets, point query excellence, and the limitations that constrain its applicability. This balanced knowledge enables you to make informed decisions about when hash indexes are the right tool—and when they're not.