Hash Vs Tree Indexes - Learning Module

Loading content...

0/241

Selection Criteria: The Practical Decision Framework

Applying Knowledge to Decisions

Throughout this module, we've examined hash and tree indexes from multiple angles: fundamental architecture, point query performance, range query capabilities, and space utilization. Now we synthesize this knowledge into a practical decision framework that you can apply to real-world index selection challenges.

The goal isn't to memorize rules, but to develop the analytical capability to evaluate any indexing scenario systematically. By the end of this page, you'll possess a repeatable methodology for making confident, well-reasoned index type decisions.

What You Will Master

By the end of this page, you will possess: a systematic workload analysis methodology, clear decision criteria for hash vs tree selection, awareness of common anti-patterns and their consequences, real-world case studies illustrating decision processes, and confidence to make and defend index type choices.

The Decision Framework Overview

The index selection decision can be approached as a structured evaluation process. Here's the framework we'll develop:

Step 1: Workload Characterization

What query types will use this index?
What's the read/write ratio?
What's the expected data volume and growth?

Step 2: Query Pattern Analysis

Pure equality vs mixed operations?
Range query requirements?
Ordering and aggregation needs?

Step 3: Performance Priority Assessment

Is sub-millisecond latency critical?
Is consistent performance more important than peak performance?
What are the memory/storage constraints?

Step 4: Operational Considerations

Maintenance complexity tolerance?
Rebuild/reorganization capability?
Monitoring and troubleshooting expertise?

Step 5: Decision and Documentation

Make the choice based on evidence
Document reasoning for future reference
Plan for validation and potential revision

The Default Answer

When in doubt, choose B+tree. It's the safer default that handles the widest range of workloads. Only choose hash indexes when you have strong evidence that your workload falls into the narrow category where hash excels AND will continue to do so.

Workload Characterization

Rigorous workload characterization is the foundation of sound index selection. Begin by gathering quantitative data about your access patterns.

Query Type Distribution

Classify queries that will use the proposed index:

Query Classification for Index Selection
Query Type	Example	Hash Suitability	B+Tree Suitability
Point lookup (=)	`WHERE id = 123`	✓ Excellent	✓ Very Good
Bounded range	`WHERE x BETWEEN a AND b`	✗ Cannot use	✓ Excellent
Unbounded range	`WHERE x > a`	✗ Cannot use	✓ Excellent
Prefix match	`WHERE name LIKE 'Smi%'`	✗ Cannot use	✓ Excellent
MIN/MAX	`SELECT MIN(x)`	✗ Full scan	✓ Excellent
ORDER BY	`ORDER BY created_at`	✗ External sort	✓ Excellent
IN list	`WHERE id IN (1,2,3)`	✓ Multiple lookups	✓ Multiple lookups
NOT EQUAL	`WHERE status != 'done'`	✗ Full scan	✗ Usually full scan

Critical Question: What percentage of queries on this column are pure equality lookups?

100% equality: Hash is viable (consider other factors)
95%+ equality with rare ranges: B+tree is safer; range queries would fail with hash
Mixed workload: B+tree is the only viable option

query_analysis_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- PostgreSQL: Analyze query patterns from pg_stat_statements
SELECT 
    query,
    calls,
    total_time / calls as avg_time_ms,
    CASE 
        WHEN query ~ 'WHERE.*user_id\s*=' AND query !~ 'user_id\s*[<>]' 
        THEN 'equality'
        WHEN query ~ 'WHERE.*user_id\s*BETWEEN' 
        OR query ~ 'WHERE.*user_id\s*[<>]'
        THEN 'range'
        ELSE 'other'
    END as query_type
FROM pg_stat_statements
WHERE query LIKE '%user_id%'
ORDER BY calls DESC
LIMIT 20;
 
-- Result interpretation:
-- If 100% 'equality' → Hash may be viable
-- If ANY 'range' queries → B+tree required

Future Workload Consideration

Don't just analyze current queries—anticipate future needs. A hash index that works today becomes a liability when new features require range queries. Consult with product and business stakeholders about upcoming requirements.

The Hash-Viable Checklist

Hash indexes are appropriate only when all of the following conditions are met. A single 'No' answer means B+tree is the better choice.

Hash Index Viability Checklist

•Query Pattern: Are ALL queries on this column pure equality (= value)? No BETWEEN, <, >, <=, >=, LIKE 'prefix%'?
•Ordering: Will you NEVER need ORDER BY, MIN, or MAX operations on this column?
•Composite Needs: If part of a composite index, will ALL columns always be specified with equality?
•Future Requirements: Are you confident future features won't require range operations?
•Key Distribution: Do keys distribute uniformly, or at least without severe hot spots?
•Data Volume: Is the data volume stable and predictable (for static hashing)?
•Database Support: Does your database fully support hash indexes with crash recovery?
•Performance Criticality: Is the marginal equality-lookup speedup genuinely necessary?

One 'No' = Use B+Tree

This isn't a scoring system—it's a checklist. If ANY answer is 'No' or 'Maybe', the conservative and usually optimal choice is B+tree. Hash indexes are a specialized tool for specialized situations.

Checklist Application Examples

Example 1: User Session Lookup Table

Queries: WHERE session_id = ? (100% equality) ✓
Ordering: Never ordered ✓
Composite: Single column ✓
Future: Sessions don't need ranges ✓
Distribution: UUIDs distribute well ✓
Volume: Sessions are transient, bounded ✓
Support: Check your database ?
Criticality: Session lookup is latency-sensitive ✓

Verdict: Hash is potentially viable—verify database support.

Example 2: Order ID Lookup

Queries: WHERE order_id = ? (currently 100%) ✓
Ordering: Might need ORDER BY order_id for reports ✗

Verdict: B+tree—even the possibility of ordering needs disqualifies hash.

The Visual Decision Tree

The following decision tree encapsulates the selection logic in a visual, step-by-step format. Follow from top to bottom.

                    ┌─────────────────────────────────────┐
                    │   Do ANY queries need range/order?  │
                    │  (BETWEEN, <, >, ORDER BY, MIN/MAX) │
                    └─────────────────┬───────────────────┘
                                      │
              ┌───────────Yes─────────┼─────────No────────┐
              ▼                       │                   ▼
    ┌─────────────────┐               │     ┌─────────────────────┐
    │   USE B+TREE    │               │     │ Is key distribution │
    │  (Only option)  │               │     │   uniform/random?   │
    └─────────────────┘               │     └──────────┬──────────┘
                                      │                │
                                      │     ┌───Yes────┼───No─────┐
                                      │     ▼          │          ▼
                                      │ ┌──────────┐   │  ┌─────────────┐
                                      │ │ Is data  │   │  │  USE B+TREE │
                                      │ │ in-memory│   │  │ (Skew risks)│
                                      │ │ or SSD?  │   │  └─────────────┘
                                      │ └────┬─────┘   │
                                      │      │         │
                                      │  Yes │ No      │
                                      │   ▼  │  ▼      │
                                      │ ┌────┴─────┐   │
                                      │ │ Database │   │
                                      │ │ supports │   │
                                      │ │  hash?   │   │
                                      │ └────┬─────┘   │
                                      │      │         │
                                      │  Yes │ No      │
                                      │   ▼  │  ▼      │
                                      │ ┌────┴────────────┐
                                      │ │  Hash viable!   │
                                      │ │ But B+tree also │
                                      │ │ works well here │
                                      │ └─────────────────┘

Key Insight from the Tree

Notice that every path eventually leads to B+tree being at least acceptable. The decision tree doesn't identify scenarios where hash is required—only scenarios where it's a viable alternative. This reflects the reality that B+tree is the universal solvent of indexing.

Decision Summary Matrix
Scenario	Recommended Index	Confidence
Any range query requirements	B+tree	High (100%)
Pure equality on skewed data	B+tree	High (95%)
Pure equality, uniform, disk-based	B+tree	High (90%)
Pure equality, uniform, in-memory	Hash or B+tree	Medium (both viable)
Unknown or uncertain requirements	B+tree	High (safe default)

Common Anti-Patterns to Avoid

Understanding common mistakes helps avoid costly missteps. Here are anti-patterns we've observed in production systems.

Anti-Pattern 1: Premature Hash Optimization

•What happens: Developer chooses hash index for theoretical O(1) performance without validating workload
•The reality: Six months later, new reporting feature needs range queries → full table scans → performance crisis
•The fix: Add B+tree index alongside hash, or replace hash entirely — double the index maintenance cost
•Prevention: Start with B+tree; switch to hash only with proven, stable equality-only workload

Anti-Pattern 2: Ignoring Key Distribution

•What happens: Hash index deployed on column with skewed distribution (e.g., 'status' with 80% 'active')
•The reality: Hot bucket overwhelmed with entries; overflow chains grow; performance degrades to O(n/B)
•The fix: Expensive reorganization or switch to B+tree
•Prevention: Analyze key distribution before choosing hash; test with production-like data

Anti-Pattern 3: Database-Incompatible Hash Usage

•What happens: Hash index created in PostgreSQL < 10 without WAL logging
•The reality: Database crash → hash index corrupted → must rebuild from scratch
•The fix: Upgrade PostgreSQL or switch to B+tree
•Prevention: Verify database support for hash indexes including crash recovery before deployment

Anti-Pattern 4: Composite Hash Key Misunderstanding

•What happens: Composite hash index on (a, b) created; queries use only column 'a'
•The reality: Hash of (a, b) cannot find entries when only 'a' specified → full scan
•The fix: Create separate index on 'a' alone OR use B+tree composite which supports prefix
•Prevention: Understand that hash composite requires ALL columns; B+tree supports leftmost prefix

The Cost of Anti-Patterns

These anti-patterns don't just cause performance issues—they create operational emergencies. A hash index that worked fine for months suddenly becoming a bottleneck is harder to debug and more disruptive to fix than starting with the right index type.

Real-World Case Studies

Let's walk through detailed case studies showing the decision framework in action.

Case Study 1: Redis-Like Key-Value Cache Layer

Context: Building an in-memory caching layer as a PostgreSQL extension for session data.

Analysis:

Query pattern: 100% equality (GET session_123)
No ordering: Sessions retrieved by exact ID only
Key distribution: UUIDs — excellent uniformity
Data volume: Bounded (active sessions only)
Storage: In-memory exclusively
Database: Custom in-memory structure, not standard indexes

Decision: Hash table — This is exactly the scenario hash excels at. In-memory operation eliminates I/O concerns, pure equality workload, uniform keys. This is what Redis does internally.

Outcome: O(1) lookups achieved, sub-microsecond latency, perfect for caching use case.

Case Study 2: E-Commerce Order Lookup

Context: Primary index on order_id column in orders table.

Analysis:

Query pattern: 95% equality (WHERE order_id = ?)
BUT: Customer service needs order history (WHERE customer_id = ? ORDER BY order_date)
BUT: Analytics needs date ranges (WHERE order_date BETWEEN ...)
Key distribution: Sequential integers — would hash poorly
Data volume: Growing continuously
Storage: Disk-based (100M+ orders)

Decision: B+tree — Even though most lookups are equality, the ordering and range requirements are non-negotiable. Sequential IDs would create hash hot spots anyway.

Outcome: Slightly slower point lookups (still <1ms), but all query patterns supported efficiently.

Case Study 3: IOT Sensor Data Ingestion

Context: High-velocity sensor readings with sensor_id + timestamp keys.

Analysis:

Query pattern: Point lookup by sensor_id + exact timestamp
BUT: Also need time-range queries (sensor_id = X AND timestamp BETWEEN ...)
Key distribution: sensor_id limited (10K sensors), timestamp sequential
Data volume: Billions of rows, continuously growing
Write load: Extremely high (thousands/second)

Decision: B+tree on (sensor_id, timestamp) — Time-range queries are essential for analysis. Composite B+tree handles both patterns.

Alternative considered: Partitioned by time with hash on sensor_id within partition — overly complex for marginal gain.

Outcome: Point and range queries efficient; range queries on time windows critical for dashboards.

Pattern Recognition

Notice how even in Case Study 1, which favored hash, it was for an in-memory, specialized structure—not a standard database index. For general-purpose database indexing, B+tree wins the vast majority of cases.

Database-Specific Guidance

Different databases handle hash indexes differently. Here's practical guidance for major systems.

PostgreSQL

postgres_hash_guidance.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- PostgreSQL hash index support
-- Pre-v10: NOT recommended (no WAL logging - crash unsafe)
-- v10+: WAL-logged, but still limited optimization
 
-- Creating a hash index
CREATE INDEX idx_session_id_hash ON sessions USING hash (session_id);
 
-- When hash might help (rare):
-- - Very high equality lookup rate
-- - In-memory working set (fits in shared_buffers)
-- - PostgreSQL 10+
 
-- Usually better: Just use B-tree (default)
CREATE INDEX idx_session_id ON sessions (session_id);
 
-- Check if hash index is being used
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM sessions WHERE session_id = 'abc123';
 
-- PostgreSQL optimizer may still choose seq scan over hash index!

MySQL/InnoDB

mysql_hash_guidance.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- MySQL/InnoDB: B+tree only for standard tables
-- Hash indexes only available for MEMORY storage engine
 
-- Standard InnoDB table: B+tree automatic
CREATE TABLE orders (
    order_id BIGINT PRIMARY KEY,  -- Clustered B+tree
    customer_id BIGINT,
    INDEX idx_customer (customer_id)  -- Secondary B+tree
);
 
-- MEMORY table with hash (not persistent!)
CREATE TABLE sessions (
    session_id VARCHAR(64) PRIMARY KEY
) ENGINE = MEMORY;
-- Default for MEMORY is HASH, but data lost on restart
 
-- Practical MySQL recommendation: 
-- Use InnoDB with B+tree - it's highly optimized and well-tested

Oracle

oracle_hash_guidance.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Oracle: Hash clusters and hash-partitioned indexes available
 
-- Hash cluster (tables organized by hash)
CREATE CLUSTER sessions_cluster (session_id VARCHAR2(64))
    HASHKEYS 10000
    SIZE 200;
 
CREATE TABLE sessions (
    session_id VARCHAR2(64) PRIMARY KEY,
    data CLOB
) CLUSTER sessions_cluster (session_id);
 
-- Hash-partitioned index (for partitioning, not lookups)
CREATE INDEX idx_orders_hash_part ON orders (order_id)
    GLOBAL PARTITION BY HASH (order_id)
    PARTITIONS 16;
 
-- Oracle recommendation:
-- B-tree indexes for general use
-- Hash clusters only for specific, validated use cases
-- Hash partitioning for partition management, not performance

Database-Agnostic Wisdom

Every major database has invested heavily in B-tree optimization—it's their default for good reason. Hash index support varies from partial (PostgreSQL) to nonexistent for general use (MySQL/InnoDB). When in doubt, trust the database's default: B-tree.

Summary: The Index Selection Mastery

This module has provided a comprehensive comparison of hash and tree indexes. Let's consolidate the core insights into actionable wisdom.

Module Key Takeaways

•Hash indexes sacrifice flexibility for speed—O(1) lookups but no ordering, no ranges, no prefix matching
•B+trees are the universal choice—O(log n) is nearly as fast in practice, with full functionality
•Point query performance is comparable—caching and I/O patterns narrow the theoretical gap
•Range queries decide the choice—any range requirement eliminates hash as an option
•Space usage is similar—B+tree memory efficiency often compensates for structural overhead
•Use the checklist—all conditions must be met for hash to be viable
•Default to B+tree—it's the safe, flexible, well-supported choice for nearly all scenarios

Final Comparison Summary
Dimension	Hash Index	B+Tree Index	Winner
Point queries	O(1)	O(log n)	Hash (slightly)
Range queries	Cannot support	O(log n + k)	B+Tree (decisively)
Ordering	No support	Natural	B+Tree
Space overhead	5-15%	10-20%	Hash (slightly)
Memory efficiency	Poor (uniform)	Excellent (hierarchical)	B+Tree
Flexibility	Very limited	Highly flexible	B+Tree
Database support	Variable/limited	Universal	B+Tree
Overall verdict			B+Tree for 95%+ of cases

Decision Framework Mastered

You now possess the analytical framework to make informed hash vs tree index decisions. Remember: B+tree is the safe default; hash is a specialized optimization for narrow use cases. When you do choose hash, document your reasoning—future maintainers will thank you.

Final Thoughts

The hash versus tree index decision exemplifies a broader engineering principle: flexibility usually beats raw performance for specialized operations. The theoretical O(1) advantage of hashing rarely justifies losing range query capability, ordered access, and prefix matching in real-world systems.

B+trees aren't just 'good enough'—they're excellent general-purpose structures refined over decades. Trust that excellence unless you have compelling, validated evidence for a specialized alternative.

Module Complete

Congratulations on completing the Hash vs Tree Indexes module! You've mastered: fundamental architectural differences, point and range query performance analysis, space utilization trade-offs, and a practical decision framework. Apply this knowledge to make confident, well-reasoned index selection decisions in your database systems.

Selection Criteria: The Practical Decision Framework

Applying Knowledge to Decisions

What You Will Master

The Decision Framework Overview

The index selection decision can be approached as a structured evaluation process. Here's the framework we'll develop:

Step 1: Workload Characterization

What query types will use this index?
What's the read/write ratio?
What's the expected data volume and growth?

Step 2: Query Pattern Analysis

Pure equality vs mixed operations?
Range query requirements?
Ordering and aggregation needs?

Step 3: Performance Priority Assessment

Is sub-millisecond latency critical?
Is consistent performance more important than peak performance?
What are the memory/storage constraints?

Step 4: Operational Considerations

Maintenance complexity tolerance?
Rebuild/reorganization capability?
Monitoring and troubleshooting expertise?

Step 5: Decision and Documentation

Make the choice based on evidence
Document reasoning for future reference
Plan for validation and potential revision

The Default Answer

Workload Characterization

Rigorous workload characterization is the foundation of sound index selection. Begin by gathering quantitative data about your access patterns.

Query Type Distribution

Classify queries that will use the proposed index:

Query Classification for Index Selection
Query Type	Example	Hash Suitability	B+Tree Suitability
Point lookup (=)	`WHERE id = 123`	✓ Excellent	✓ Very Good
Bounded range	`WHERE x BETWEEN a AND b`	✗ Cannot use	✓ Excellent
Unbounded range	`WHERE x > a`	✗ Cannot use	✓ Excellent
Prefix match	`WHERE name LIKE 'Smi%'`	✗ Cannot use	✓ Excellent
MIN/MAX	`SELECT MIN(x)`	✗ Full scan	✓ Excellent
ORDER BY	`ORDER BY created_at`	✗ External sort	✓ Excellent
IN list	`WHERE id IN (1,2,3)`	✓ Multiple lookups	✓ Multiple lookups
NOT EQUAL	`WHERE status != 'done'`	✗ Full scan	✗ Usually full scan

Critical Question: What percentage of queries on this column are pure equality lookups?

100% equality: Hash is viable (consider other factors)
95%+ equality with rare ranges: B+tree is safer; range queries would fail with hash
Mixed workload: B+tree is the only viable option

query_analysis_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- PostgreSQL: Analyze query patterns from pg_stat_statements
SELECT 
    query,
    calls,
    total_time / calls as avg_time_ms,
    CASE 
        WHEN query ~ 'WHERE.*user_id\s*=' AND query !~ 'user_id\s*[<>]' 
        THEN 'equality'
        WHEN query ~ 'WHERE.*user_id\s*BETWEEN' 
        OR query ~ 'WHERE.*user_id\s*[<>]'
        THEN 'range'
        ELSE 'other'
    END as query_type
FROM pg_stat_statements
WHERE query LIKE '%user_id%'
ORDER BY calls DESC
LIMIT 20;
 
-- Result interpretation:
-- If 100% 'equality' → Hash may be viable
-- If ANY 'range' queries → B+tree required

Future Workload Consideration

The Hash-Viable Checklist

Hash indexes are appropriate only when all of the following conditions are met. A single 'No' answer means B+tree is the better choice.

Hash Index Viability Checklist

•Query Pattern: Are ALL queries on this column pure equality (= value)? No BETWEEN, <, >, <=, >=, LIKE 'prefix%'?
•Ordering: Will you NEVER need ORDER BY, MIN, or MAX operations on this column?
•Composite Needs: If part of a composite index, will ALL columns always be specified with equality?
•Future Requirements: Are you confident future features won't require range operations?
•Key Distribution: Do keys distribute uniformly, or at least without severe hot spots?
•Data Volume: Is the data volume stable and predictable (for static hashing)?
•Database Support: Does your database fully support hash indexes with crash recovery?
•Performance Criticality: Is the marginal equality-lookup speedup genuinely necessary?

One 'No' = Use B+Tree

Checklist Application Examples

Example 1: User Session Lookup Table

Queries: WHERE session_id = ? (100% equality) ✓
Ordering: Never ordered ✓
Composite: Single column ✓
Future: Sessions don't need ranges ✓
Distribution: UUIDs distribute well ✓
Volume: Sessions are transient, bounded ✓
Support: Check your database ?
Criticality: Session lookup is latency-sensitive ✓

Verdict: Hash is potentially viable—verify database support.

Example 2: Order ID Lookup

Queries: WHERE order_id = ? (currently 100%) ✓
Ordering: Might need ORDER BY order_id for reports ✗

Verdict: B+tree—even the possibility of ordering needs disqualifies hash.

The Visual Decision Tree

The following decision tree encapsulates the selection logic in a visual, step-by-step format. Follow from top to bottom.

                    ┌─────────────────────────────────────┐
                    │   Do ANY queries need range/order?  │
                    │  (BETWEEN, <, >, ORDER BY, MIN/MAX) │
                    └─────────────────┬───────────────────┘
                                      │
              ┌───────────Yes─────────┼─────────No────────┐
              ▼                       │                   ▼
    ┌─────────────────┐               │     ┌─────────────────────┐
    │   USE B+TREE    │               │     │ Is key distribution │
    │  (Only option)  │               │     │   uniform/random?   │
    └─────────────────┘               │     └──────────┬──────────┘
                                      │                │
                                      │     ┌───Yes────┼───No─────┐
                                      │     ▼          │          ▼
                                      │ ┌──────────┐   │  ┌─────────────┐
                                      │ │ Is data  │   │  │  USE B+TREE │
                                      │ │ in-memory│   │  │ (Skew risks)│
                                      │ │ or SSD?  │   │  └─────────────┘
                                      │ └────┬─────┘   │
                                      │      │         │
                                      │  Yes │ No      │
                                      │   ▼  │  ▼      │
                                      │ ┌────┴─────┐   │
                                      │ │ Database │   │
                                      │ │ supports │   │
                                      │ │  hash?   │   │
                                      │ └────┬─────┘   │
                                      │      │         │
                                      │  Yes │ No      │
                                      │   ▼  │  ▼      │
                                      │ ┌────┴────────────┐
                                      │ │  Hash viable!   │
                                      │ │ But B+tree also │
                                      │ │ works well here │
                                      │ └─────────────────┘

Key Insight from the Tree

Decision Summary Matrix
Scenario	Recommended Index	Confidence
Any range query requirements	B+tree	High (100%)
Pure equality on skewed data	B+tree	High (95%)
Pure equality, uniform, disk-based	B+tree	High (90%)
Pure equality, uniform, in-memory	Hash or B+tree	Medium (both viable)
Unknown or uncertain requirements	B+tree	High (safe default)

Common Anti-Patterns to Avoid

Understanding common mistakes helps avoid costly missteps. Here are anti-patterns we've observed in production systems.

Anti-Pattern 1: Premature Hash Optimization

•What happens: Developer chooses hash index for theoretical O(1) performance without validating workload
•The reality: Six months later, new reporting feature needs range queries → full table scans → performance crisis
•The fix: Add B+tree index alongside hash, or replace hash entirely — double the index maintenance cost
•Prevention: Start with B+tree; switch to hash only with proven, stable equality-only workload

Anti-Pattern 2: Ignoring Key Distribution

•What happens: Hash index deployed on column with skewed distribution (e.g., 'status' with 80% 'active')
•The reality: Hot bucket overwhelmed with entries; overflow chains grow; performance degrades to O(n/B)
•The fix: Expensive reorganization or switch to B+tree
•Prevention: Analyze key distribution before choosing hash; test with production-like data

Anti-Pattern 3: Database-Incompatible Hash Usage

•What happens: Hash index created in PostgreSQL < 10 without WAL logging
•The reality: Database crash → hash index corrupted → must rebuild from scratch
•The fix: Upgrade PostgreSQL or switch to B+tree
•Prevention: Verify database support for hash indexes including crash recovery before deployment

Anti-Pattern 4: Composite Hash Key Misunderstanding

•What happens: Composite hash index on (a, b) created; queries use only column 'a'
•The reality: Hash of (a, b) cannot find entries when only 'a' specified → full scan
•The fix: Create separate index on 'a' alone OR use B+tree composite which supports prefix
•Prevention: Understand that hash composite requires ALL columns; B+tree supports leftmost prefix

The Cost of Anti-Patterns

Real-World Case Studies

Let's walk through detailed case studies showing the decision framework in action.

Case Study 1: Redis-Like Key-Value Cache Layer

Context: Building an in-memory caching layer as a PostgreSQL extension for session data.

Analysis:

Query pattern: 100% equality (GET session_123)
No ordering: Sessions retrieved by exact ID only
Key distribution: UUIDs — excellent uniformity
Data volume: Bounded (active sessions only)
Storage: In-memory exclusively
Database: Custom in-memory structure, not standard indexes

Decision: Hash table — This is exactly the scenario hash excels at. In-memory operation eliminates I/O concerns, pure equality workload, uniform keys. This is what Redis does internally.

Outcome: O(1) lookups achieved, sub-microsecond latency, perfect for caching use case.

Case Study 2: E-Commerce Order Lookup

Context: Primary index on order_id column in orders table.

Analysis:

Query pattern: 95% equality (WHERE order_id = ?)
BUT: Customer service needs order history (WHERE customer_id = ? ORDER BY order_date)
BUT: Analytics needs date ranges (WHERE order_date BETWEEN ...)
Key distribution: Sequential integers — would hash poorly
Data volume: Growing continuously
Storage: Disk-based (100M+ orders)

Decision: B+tree — Even though most lookups are equality, the ordering and range requirements are non-negotiable. Sequential IDs would create hash hot spots anyway.

Outcome: Slightly slower point lookups (still <1ms), but all query patterns supported efficiently.

Case Study 3: IOT Sensor Data Ingestion

Context: High-velocity sensor readings with sensor_id + timestamp keys.

Analysis:

Query pattern: Point lookup by sensor_id + exact timestamp
BUT: Also need time-range queries (sensor_id = X AND timestamp BETWEEN ...)
Key distribution: sensor_id limited (10K sensors), timestamp sequential
Data volume: Billions of rows, continuously growing
Write load: Extremely high (thousands/second)

Decision: B+tree on (sensor_id, timestamp) — Time-range queries are essential for analysis. Composite B+tree handles both patterns.

Alternative considered: Partitioned by time with hash on sensor_id within partition — overly complex for marginal gain.

Outcome: Point and range queries efficient; range queries on time windows critical for dashboards.

Pattern Recognition

Database-Specific Guidance

Different databases handle hash indexes differently. Here's practical guidance for major systems.

PostgreSQL

postgres_hash_guidance.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- PostgreSQL hash index support
-- Pre-v10: NOT recommended (no WAL logging - crash unsafe)
-- v10+: WAL-logged, but still limited optimization
 
-- Creating a hash index
CREATE INDEX idx_session_id_hash ON sessions USING hash (session_id);
 
-- When hash might help (rare):
-- - Very high equality lookup rate
-- - In-memory working set (fits in shared_buffers)
-- - PostgreSQL 10+
 
-- Usually better: Just use B-tree (default)
CREATE INDEX idx_session_id ON sessions (session_id);
 
-- Check if hash index is being used
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM sessions WHERE session_id = 'abc123';
 
-- PostgreSQL optimizer may still choose seq scan over hash index!

MySQL/InnoDB

mysql_hash_guidance.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- MySQL/InnoDB: B+tree only for standard tables
-- Hash indexes only available for MEMORY storage engine
 
-- Standard InnoDB table: B+tree automatic
CREATE TABLE orders (
    order_id BIGINT PRIMARY KEY,  -- Clustered B+tree
    customer_id BIGINT,
    INDEX idx_customer (customer_id)  -- Secondary B+tree
);
 
-- MEMORY table with hash (not persistent!)
CREATE TABLE sessions (
    session_id VARCHAR(64) PRIMARY KEY
) ENGINE = MEMORY;
-- Default for MEMORY is HASH, but data lost on restart
 
-- Practical MySQL recommendation: 
-- Use InnoDB with B+tree - it's highly optimized and well-tested

Oracle

oracle_hash_guidance.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Oracle: Hash clusters and hash-partitioned indexes available
 
-- Hash cluster (tables organized by hash)
CREATE CLUSTER sessions_cluster (session_id VARCHAR2(64))
    HASHKEYS 10000
    SIZE 200;
 
CREATE TABLE sessions (
    session_id VARCHAR2(64) PRIMARY KEY,
    data CLOB
) CLUSTER sessions_cluster (session_id);
 
-- Hash-partitioned index (for partitioning, not lookups)
CREATE INDEX idx_orders_hash_part ON orders (order_id)
    GLOBAL PARTITION BY HASH (order_id)
    PARTITIONS 16;
 
-- Oracle recommendation:
-- B-tree indexes for general use
-- Hash clusters only for specific, validated use cases
-- Hash partitioning for partition management, not performance

Database-Agnostic Wisdom

Summary: The Index Selection Mastery

This module has provided a comprehensive comparison of hash and tree indexes. Let's consolidate the core insights into actionable wisdom.

Module Key Takeaways

•Hash indexes sacrifice flexibility for speed—O(1) lookups but no ordering, no ranges, no prefix matching
•B+trees are the universal choice—O(log n) is nearly as fast in practice, with full functionality
•Point query performance is comparable—caching and I/O patterns narrow the theoretical gap
•Range queries decide the choice—any range requirement eliminates hash as an option
•Space usage is similar—B+tree memory efficiency often compensates for structural overhead
•Use the checklist—all conditions must be met for hash to be viable
•Default to B+tree—it's the safe, flexible, well-supported choice for nearly all scenarios

Final Comparison Summary
Dimension	Hash Index	B+Tree Index	Winner
Point queries	O(1)	O(log n)	Hash (slightly)
Range queries	Cannot support	O(log n + k)	B+Tree (decisively)
Ordering	No support	Natural	B+Tree
Space overhead	5-15%	10-20%	Hash (slightly)
Memory efficiency	Poor (uniform)	Excellent (hierarchical)	B+Tree
Flexibility	Very limited	Highly flexible	B+Tree
Database support	Variable/limited	Universal	B+Tree
Overall verdict			B+Tree for 95%+ of cases

Decision Framework Mastered

Final Thoughts

Module Complete