Database Management SystemsHash vs Tree Indexes

Hash vs Tree Indexes: Comparative Analysis

LevelIntermediate

Duration75 mins

TopicHash vs Tree Indexes

3 / 5

Range Query Analysis: The Tree Index Advantage

The Fundamental Limitation of Hashing

If point queries are the battleground where hash and tree indexes compete closely, range queries are the terrain where the battle is decided before it begins. Hash indexes simply cannot perform range queries efficiently—not because of poor implementation, but because of fundamental mathematical properties of hash functions.

This limitation isn't a minor inconvenience. Range queries are ubiquitous in real-world database applications: finding all orders placed between two dates, all employees with salaries in a specific band, all products priced below a threshold. Understanding why hash indexes fail here—and how B+trees excel—is essential for practical index design.

The Critical Distinction

Point queries ask 'Where is the record with key = X?' Range queries ask 'Where are all records with A ≤ key ≤ B?' Hash indexes can only answer the first question. Tree indexes answer both. This asymmetry often makes hash indexes impractical despite their point-query performance.

Why Hashing Destroys Order

To understand why hash indexes cannot support range queries, we must examine what hash functions do to key ordering.

The Purpose of Hashing

Hash functions are designed to uniformly distribute keys across buckets. A good hash function achieves this by deliberately scrambling the relationship between similar keys. If consecutive keys mapped to consecutive buckets, the distribution would be poor—all hot data would cluster in adjacent buckets.

Consider what happens to a sequence of integer keys:

Key     Hash Value (mod 8 buckets)
100  →  3
101  →  7
102  →  1
103  →  4
104  →  0
105  →  6
106  →  2
107  →  5

Notice that consecutive keys (100, 101, 102, ...) map to seemingly random buckets (3, 7, 1, ...). This is by design—it's exactly what makes hash indexes efficient for point lookups.

The Ordering Problem

The same property that makes hash indexes fast for equality lookups—destroying key ordering—makes them useless for range queries. To find all keys between 100 and 107, you'd need to examine every bucket because those keys are scattered randomly across the entire structure.

Mathematical Formalization

A hash function h: K → B maps keys from key space K to bucket space B. For range queries on [a, b], we need all keys k where a ≤ k ≤ b.

Given keys k₁ < k₂:

Tree index guarantee: k₁ and k₂ are stored in relative order; everything between them is located between them
Hash index non-guarantee: h(k₁) and h(k₂) have no predictable relationship; h(k₁) might be greater than, less than, or equal to h(k₂)

To support range queries, we would need a hash function where:

If k₁ < k₂, then h(k₁) ≤ h(k₂)

But this is precisely what makes a hash function poor—keys would cluster based on their natural ordering, defeating the purpose of hashing for uniform distribution.

Order Preservation: Hash vs Tree
Property	Hash Function	B+Tree Organization
Order preservation	Deliberately destroyed	Carefully maintained
Key k < key k'	h(k) has no relation to h(k')	k appears before k' in traversal
Adjacent keys	May be in distant buckets	Are in same or adjacent leaves
Range [a,b] location	Scattered across all buckets	Contiguous in leaf sequence

Range Query Cost Analysis

Let's quantify the performance difference for range queries between hash and tree indexes.

Hash Index Range Query 'Algorithm'

Since hash indexes provide no ordering, a range query on a hash-indexed column requires:

Full table/index scan: Examine every bucket
Filter application: Check each entry against range predicate
Result collection: Gather matching entries

Cost: O(n) where n is total number of entries—independent of how selective the range is.

hash_range_query.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
function hashRangeQuery(lowerBound, upperBound, hashTable):
    // TERRIBLE: Must scan all buckets - O(n) always
    results = []
    
    for bucketNum in range(0, hashTable.numBuckets):
        bucket = readBucket(bucketNum)  // Every single bucket!
        
        for entry in bucket.entries:
            if lowerBound <= entry.key <= upperBound:
                results.append(entry)
        
        // Don't forget overflow chains
        overflow = bucket.overflowPointer
        while overflow != null:
            for entry in overflow.entries:
                if lowerBound <= entry.key <= upperBound:
                    results.append(entry)
            overflow = overflow.nextOverflow
    
    return results
 
// Analysis:
// - Table has 1 million rows
// - Range query matches 1,000 rows (0.1% selectivity)
// - Hash: STILL reads all 1 million entries
// - Cost: O(n) regardless of selectivity

B+Tree Range Query Algorithm

B+trees exploit their ordered structure to efficiently locate and scan range results:

Locate start: Traverse from root to leaf containing lower bound (O(log n))
Sequential scan: Follow leaf sibling pointers, collecting entries in range
Terminate: Stop when upper bound exceeded or end of index reached

Cost: O(log n + k) where k is the number of matching entries

btree_range_query.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
function btreeRangeQuery(lowerBound, upperBound, btree):
    // EFFICIENT: O(log n + k) where k = result count
    results = []
    
    // Step 1: Find starting leaf - O(log n)
    currentNode = btree.root
    while not currentNode.isLeaf:
        childIndex = findChildIndex(currentNode, lowerBound)
        currentNode = readNode(currentNode.children[childIndex])
    
    // Step 2: Locate first matching entry within leaf
    startIndex = binarySearch(currentNode.keys, lowerBound)
    
    // Step 3: Sequential scan through leaves - O(k)
    while currentNode != null:
        for i in range(startIndex, len(currentNode.keys)):
            if currentNode.keys[i] > upperBound:
                return results  // Done! Upper bound exceeded
            results.append(currentNode.entries[i])
        
        // Follow sibling pointer to next leaf
        currentNode = currentNode.nextSibling
        startIndex = 0  // Start from beginning of next leaf
    
    return results
 
// Analysis:
// - Table has 1 million rows
// - Range query matches 1,000 rows (0.1% selectivity)
// - B+tree: log₂₀₀(1M) ≈ 3 levels + 1,000 entries ≈ 1,003 accesses
// - Cost: O(log n + k) - proportional to result size

Range Query Cost Comparison (1 Million Rows)
Selectivity	Result Size	Hash Index Cost	B+Tree Cost	Speedup Factor
0.01%	100 rows	~1,000,000 accesses	~103 accesses	~10,000x
0.1%	1,000 rows	~1,000,000 accesses	~1,003 accesses	~1,000x
1%	10,000 rows	~1,000,000 accesses	~10,003 accesses	~100x
10%	100,000 rows	~1,000,000 accesses	~100,003 accesses	~10x
100%	1,000,000 rows	~1,000,000 accesses	~1,000,003 accesses	~1x

The Selectivity Insight

For highly selective range queries (small result sets), B+trees outperform hash by orders of magnitude. Only when the range covers nearly the entire table does hash performance approach B+tree. In practice, useful range queries are selective—if you're fetching everything, you'd use a table scan anyway.

Types of Range Queries

Range queries come in several forms, all of which B+trees handle efficiently and hash indexes cannot support:

Bounded Range Queries

SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31'
SELECT * FROM products WHERE price >= 10.00 AND price <= 50.00
SELECT * FROM employees WHERE hire_date >= '2020-01-01'

Unbounded Range Queries (One-Sided)

SELECT * FROM transactions WHERE amount > 10000  -- Find high-value transactions
SELECT * FROM inventory WHERE quantity < 10      -- Find low-stock items
SELECT * FROM users WHERE last_login < NOW() - INTERVAL '90 days'  -- Inactive users

Prefix Queries (String Ranges)

Prefix matching is a special case of range query:

SELECT * FROM customers WHERE name LIKE 'Smith%'
SELECT * FROM products WHERE sku LIKE 'ELEC-%'

These translate to range queries:

'Smith%' → BETWEEN 'Smith' AND 'Smithzzz...' (conceptually)

B+trees support this naturally through sorted string ordering. Hash indexes cannot—hashing 'Smith' and 'Smithson' produces unrelated bucket addresses.

Min/Max Queries

SELECT MIN(price), MAX(price) FROM products WHERE category = 'Electronics'
SELECT * FROM orders ORDER BY order_date DESC LIMIT 10

These leverage the ordering of B+trees:

MIN: Navigate to leftmost leaf (O(log n))
MAX: Navigate to rightmost leaf (O(log n))
ORDER BY: Follow leaf chain in order (no sorting needed)

Hash indexes require full scans for all of these.

Ordering-Dependent Query Support
Query Type	Example	B+Tree	Hash Index
Bounded range	`BETWEEN a AND b`	✓ O(log n + k)	✗ O(n)
Unbounded range	`> a` or `< b`	✓ O(log n + k)	✗ O(n)
Prefix match	`LIKE 'abc%'`	✓ O(log n + k)	✗ O(n)
MIN value	`MIN(column)`	✓ O(log n)	✗ O(n)
MAX value	`MAX(column)`	✓ O(log n)	✗ O(n)
ORDER BY	`ORDER BY column`	✓ O(n) no sort	✗ O(n log n)
Top-K / LIMIT	`LIMIT k`	✓ O(log n + k)	✗ O(n + k log k)

The Sequential I/O Advantage

Beyond algorithmic complexity, B+trees provide a crucial I/O efficiency advantage for range queries: sequential access at the leaf level.

Leaf Node Linking

B+tree leaf nodes are linked via sibling pointers, forming a doubly-linked list:

[Leaf₀] ↔ [Leaf₁] ↔ [Leaf₂] ↔ [Leaf₃] ↔ [Leaf₄] ↔ ...

This structure enables sequential scanning after locating the range start. On traditional spinning disks, sequential reads are dramatically faster than random reads:

Random disk seek: 5-10 ms
Sequential read (per page after initial seek): 0.1 ms

A range query reading 100 consecutive pages:

Random access: 100 × 5ms = 500ms
Sequential access: 5ms + 99 × 0.1ms ≈ 15ms

The difference can be 30x or more for disk-based systems.

Physical Clustering

B+trees often exhibit good physical clustering on disk:

Leaf nodes created around the same time are often physically adjacent
Sequential inserts produce sequential disk layout
Range scans benefit from read-ahead/prefetching

Modern storage systems exploit this:

OS read-ahead: Pre-fetches pages anticipated to be needed
SSD internal parallelism: Sequential reads maximize channel utilization
Database prefetching: Explicitly requests upcoming pages during scans

SSD Considerations

SSDs have similar random and sequential latency for small reads, but sequential reads still win due to: (1) reduced CPU/driver overhead per I/O, (2) better internal parallelism, and (3) larger effective read sizes through prefetching. The advantage is smaller than with HDDs (perhaps 3-5x) but still significant.

I/O Pattern Comparison for Range Queries
Aspect	Hash Index Scan	B+Tree Range Scan
Access pattern	Random (all buckets)	Sequential (leaf chain)
Prefetch effectiveness	None (unpredictable)	High (predictable path)
Disk seek count	One per bucket	One initial + sequential
Cache utilization	Poor (no locality)	Excellent (adjacent pages)
CPU efficiency	Low (scattered data)	High (cache-friendly)

Real-World Range Query Scenarios

Range queries are pervasive in production databases. Let's examine common scenarios where the hash vs tree index decision has direct performance implications.

E-Commerce: Order Date Ranges

ecommerce_range_queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Daily sales report: All orders from yesterday
SELECT * FROM orders 
WHERE order_date >= CURRENT_DATE - INTERVAL '1 day'
  AND order_date < CURRENT_DATE;
 
-- Monthly revenue calculation
SELECT SUM(total_amount) FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2024-02-01';
 
-- Customer order history (customer_id equality + date range)
SELECT * FROM orders
WHERE customer_id = 12345
  AND order_date >= '2023-01-01'
ORDER BY order_date DESC;
 
-- Index recommendation:
-- B+tree on (customer_id, order_date) - composite supports both!
-- Hash on customer_id - point lookup only, loses date ordering

Financial Systems: Transaction Ranges

financial_range_queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Fraud detection: High-value transactions
SELECT * FROM transactions WHERE amount > 10000;
 
-- Account statement generation
SELECT * FROM transactions
WHERE account_id = 'ABC123'
  AND transaction_time BETWEEN '2024-01-01' AND '2024-01-31'
ORDER BY transaction_time;
 
-- Interest calculation: Balance over time
SELECT date, balance FROM daily_balances
WHERE account_id = 'ABC123'
  AND date >= '2023-01-01'
ORDER BY date;
 
-- Compliance reporting: Transactions in amount ranges
SELECT COUNT(*), SUM(amount) FROM transactions
WHERE amount >= 1000 AND amount < 10000;

Analytics and Logging: Time-Series Data

analytics_range_queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Dashboards: Recent metrics
SELECT metric_name, AVG(value) FROM metrics
WHERE timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY metric_name;
 
-- Log analysis: Errors in time window
SELECT * FROM logs
WHERE log_level = 'ERROR'
  AND created_at >= NOW() - INTERVAL '24 hours'
ORDER BY created_at DESC;
 
-- User activity analysis
SELECT date_trunc('day', event_time), COUNT(*) 
FROM user_events
WHERE user_id = 12345
  AND event_time >= '2024-01-01'
GROUP BY 1 ORDER BY 1;
 
-- Time-series data is fundamentally about ranges!

The Range Query Reality

Virtually every business domain involves time-based queries (date ranges), value-based filtering (price/amount ranges), or sorted output (ORDER BY). Hash indexes, despite their point-query speed, cannot participate in any of these operations efficiently. This is why B+trees dominate in practice.

Index Selection Implications

The range query limitation of hash indexes has profound implications for index selection strategy.

The Mix Problem

Most real workloads combine point and range queries on the same columns:

-- Point query: Specific customer lookup
SELECT * FROM customers WHERE customer_id = 12345;

-- Range query: Customers in region
SELECT * FROM customers WHERE region = 'WEST' AND created_at > '2024-01-01';

With a hash index on customer_id:

First query: O(1) excellent
Second query: Full scan (cannot use hash index at all)

With a B+tree index on customer_id:

First query: O(log n) very good
Other indexes can be created for range queries

The hash index provides marginal point-query benefit but blocks any range capability on that column.

Hash Index Suitability Criteria (All Must Apply)

•Pure equality workload: 100% of queries on this column are = value comparisons
•No ordering requirements: Never need ORDER BY, MIN, MAX on this column
•No range filters: Never need >, <, >=, <=, BETWEEN on this column
•No prefix matching: Never need LIKE 'prefix%' patterns
•Separate from composite needs: This column won't be combined with range columns in a composite index
•Uniform key distribution: Data distributes evenly across hash buckets

The Practical Test

Ask yourself: 'Will I EVER need to do a range query, ordering, or prefix match on this column?' If the answer is anything other than 'definitely not,' choose a B+tree. The modest point-query performance difference rarely justifies losing all range query capability.

Covering Index Considerations

Range queries often benefit from covering indexes—indexes that contain all columns needed to satisfy a query without accessing the base table. This optimization interacts differently with hash and tree structures.

B+Tree Covering Indexes

B+tree indexes can include additional columns beyond the search key:

CREATE INDEX idx_orders_covering ON orders (order_date, customer_id, total_amount);

-- This query can be answered entirely from the index:
SELECT customer_id, total_amount FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

Benefits:

Index-only scans avoid table lookups entirely
Leaves contain all needed data in sorted order
Sequential scan through index is highly efficient

Hash Index Covering Limitations

Hash indexes can theoretically store additional columns in bucket entries, but the utility is limited:

Random bucket access means no sequential scan benefit
Range queries still require full index scan
The scenario where hash + covering helps is extremely narrow (pure equality + need included columns)

Practical Example

Consider an analytics query:

SELECT product_category, SUM(quantity), SUM(revenue)
FROM order_items
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY product_category;

With B+tree covering index on (order_date, product_category, quantity, revenue):

Seek to order_date = '2024-01-01' (O(log n))
Sequential scan through year's data in index
Aggregate as scanning—no table access
All data in sorted date order

With hash index: Cannot use index for range; falls back to table scan.

The Coverage Asymmetry

B+tree covering indexes transform expensive range queries into efficient index-only scans. Hash indexes cannot provide this benefit for range operations—they can only cover equality lookups, which are already fast anyway.

Summary: The Insurmountable Ordering Gap

Range queries represent the clearest differentiation between hash and tree indexes. This isn't a matter of performance tuning or clever implementation—it's a fundamental mathematical limitation.

Key Takeaways

•Hash functions destroy ordering by design—uniform distribution requires scrambling key relationships, making range queries impossible
•Hash index range queries require full scans—O(n) cost regardless of selectivity, often making them slower than table scans
•B+trees support ranges efficiently—O(log n + k) cost proportional to result size, with sequential I/O benefits
•Many query types require ordering—BETWEEN, <, >, LIKE 'prefix%', ORDER BY, MIN, MAX all depend on key ordering
•Real workloads mix query types—the rare workload with 100% equality queries doesn't justify hash indexes when ranges might be needed later
•Sequential I/O compounds the advantage—B+tree leaf linking enables efficient range scans that hash structures cannot match

The Bottom Line

If your workload includes even occasional range queries, hash indexes are not viable for those columns. The O(1) point-query advantage is irrelevant when range queries become full table scans. This single limitation explains why B+trees dominate production databases.

What's Next

We've examined point queries (where hash indexes compete) and range queries (where they fail). The next page explores space usage—how hash and tree indexes differ in memory and storage footprint, including overhead, fill factors, and growth characteristics.

Range Query Analysis Complete

You now understand why hash indexes fundamentally cannot support range queries and the dramatic performance implications. This limitation is the primary reason B+trees are the default choice in nearly every production database system. Next, we examine space utilization differences.

3 / 5

Loading learning content...

Database Management SystemsHash vs Tree Indexes

Hash vs Tree Indexes: Comparative Analysis

LevelIntermediate

Duration75 mins

TopicHash vs Tree Indexes

3 / 5

Range Query Analysis: The Tree Index Advantage

The Fundamental Limitation of Hashing

The Critical Distinction

Why Hashing Destroys Order

To understand why hash indexes cannot support range queries, we must examine what hash functions do to key ordering.

The Purpose of Hashing

Consider what happens to a sequence of integer keys:

Key     Hash Value (mod 8 buckets)
100  →  3
101  →  7
102  →  1
103  →  4
104  →  0
105  →  6
106  →  2
107  →  5

Notice that consecutive keys (100, 101, 102, ...) map to seemingly random buckets (3, 7, 1, ...). This is by design—it's exactly what makes hash indexes efficient for point lookups.

The Ordering Problem

Mathematical Formalization

A hash function h: K → B maps keys from key space K to bucket space B. For range queries on [a, b], we need all keys k where a ≤ k ≤ b.

Given keys k₁ < k₂:

Tree index guarantee: k₁ and k₂ are stored in relative order; everything between them is located between them
Hash index non-guarantee: h(k₁) and h(k₂) have no predictable relationship; h(k₁) might be greater than, less than, or equal to h(k₂)

To support range queries, we would need a hash function where:

If k₁ < k₂, then h(k₁) ≤ h(k₂)

But this is precisely what makes a hash function poor—keys would cluster based on their natural ordering, defeating the purpose of hashing for uniform distribution.

Order Preservation: Hash vs Tree
Property	Hash Function	B+Tree Organization
Order preservation	Deliberately destroyed	Carefully maintained
Key k < key k'	h(k) has no relation to h(k')	k appears before k' in traversal
Adjacent keys	May be in distant buckets	Are in same or adjacent leaves
Range [a,b] location	Scattered across all buckets	Contiguous in leaf sequence

Range Query Cost Analysis

Let's quantify the performance difference for range queries between hash and tree indexes.

Hash Index Range Query 'Algorithm'

Since hash indexes provide no ordering, a range query on a hash-indexed column requires:

Full table/index scan: Examine every bucket
Filter application: Check each entry against range predicate
Result collection: Gather matching entries

Cost: O(n) where n is total number of entries—independent of how selective the range is.

hash_range_query.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
function hashRangeQuery(lowerBound, upperBound, hashTable):
    // TERRIBLE: Must scan all buckets - O(n) always
    results = []
    
    for bucketNum in range(0, hashTable.numBuckets):
        bucket = readBucket(bucketNum)  // Every single bucket!
        
        for entry in bucket.entries:
            if lowerBound <= entry.key <= upperBound:
                results.append(entry)
        
        // Don't forget overflow chains
        overflow = bucket.overflowPointer
        while overflow != null:
            for entry in overflow.entries:
                if lowerBound <= entry.key <= upperBound:
                    results.append(entry)
            overflow = overflow.nextOverflow
    
    return results
 
// Analysis:
// - Table has 1 million rows
// - Range query matches 1,000 rows (0.1% selectivity)
// - Hash: STILL reads all 1 million entries
// - Cost: O(n) regardless of selectivity

B+Tree Range Query Algorithm

B+trees exploit their ordered structure to efficiently locate and scan range results:

Locate start: Traverse from root to leaf containing lower bound (O(log n))
Sequential scan: Follow leaf sibling pointers, collecting entries in range
Terminate: Stop when upper bound exceeded or end of index reached

Cost: O(log n + k) where k is the number of matching entries

btree_range_query.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
function btreeRangeQuery(lowerBound, upperBound, btree):
    // EFFICIENT: O(log n + k) where k = result count
    results = []
    
    // Step 1: Find starting leaf - O(log n)
    currentNode = btree.root
    while not currentNode.isLeaf:
        childIndex = findChildIndex(currentNode, lowerBound)
        currentNode = readNode(currentNode.children[childIndex])
    
    // Step 2: Locate first matching entry within leaf
    startIndex = binarySearch(currentNode.keys, lowerBound)
    
    // Step 3: Sequential scan through leaves - O(k)
    while currentNode != null:
        for i in range(startIndex, len(currentNode.keys)):
            if currentNode.keys[i] > upperBound:
                return results  // Done! Upper bound exceeded
            results.append(currentNode.entries[i])
        
        // Follow sibling pointer to next leaf
        currentNode = currentNode.nextSibling
        startIndex = 0  // Start from beginning of next leaf
    
    return results
 
// Analysis:
// - Table has 1 million rows
// - Range query matches 1,000 rows (0.1% selectivity)
// - B+tree: log₂₀₀(1M) ≈ 3 levels + 1,000 entries ≈ 1,003 accesses
// - Cost: O(log n + k) - proportional to result size

Range Query Cost Comparison (1 Million Rows)
Selectivity	Result Size	Hash Index Cost	B+Tree Cost	Speedup Factor
0.01%	100 rows	~1,000,000 accesses	~103 accesses	~10,000x
0.1%	1,000 rows	~1,000,000 accesses	~1,003 accesses	~1,000x
1%	10,000 rows	~1,000,000 accesses	~10,003 accesses	~100x
10%	100,000 rows	~1,000,000 accesses	~100,003 accesses	~10x
100%	1,000,000 rows	~1,000,000 accesses	~1,000,003 accesses	~1x

The Selectivity Insight

Types of Range Queries

Range queries come in several forms, all of which B+trees handle efficiently and hash indexes cannot support:

Bounded Range Queries

SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31'
SELECT * FROM products WHERE price >= 10.00 AND price <= 50.00
SELECT * FROM employees WHERE hire_date >= '2020-01-01'

Unbounded Range Queries (One-Sided)

SELECT * FROM transactions WHERE amount > 10000  -- Find high-value transactions
SELECT * FROM inventory WHERE quantity < 10      -- Find low-stock items
SELECT * FROM users WHERE last_login < NOW() - INTERVAL '90 days'  -- Inactive users

Prefix Queries (String Ranges)

Prefix matching is a special case of range query:

SELECT * FROM customers WHERE name LIKE 'Smith%'
SELECT * FROM products WHERE sku LIKE 'ELEC-%'

These translate to range queries:

'Smith%' → BETWEEN 'Smith' AND 'Smithzzz...' (conceptually)

B+trees support this naturally through sorted string ordering. Hash indexes cannot—hashing 'Smith' and 'Smithson' produces unrelated bucket addresses.

Min/Max Queries

SELECT MIN(price), MAX(price) FROM products WHERE category = 'Electronics'
SELECT * FROM orders ORDER BY order_date DESC LIMIT 10

These leverage the ordering of B+trees:

MIN: Navigate to leftmost leaf (O(log n))
MAX: Navigate to rightmost leaf (O(log n))
ORDER BY: Follow leaf chain in order (no sorting needed)

Hash indexes require full scans for all of these.

Ordering-Dependent Query Support
Query Type	Example	B+Tree	Hash Index
Bounded range	`BETWEEN a AND b`	✓ O(log n + k)	✗ O(n)
Unbounded range	`> a` or `< b`	✓ O(log n + k)	✗ O(n)
Prefix match	`LIKE 'abc%'`	✓ O(log n + k)	✗ O(n)
MIN value	`MIN(column)`	✓ O(log n)	✗ O(n)
MAX value	`MAX(column)`	✓ O(log n)	✗ O(n)
ORDER BY	`ORDER BY column`	✓ O(n) no sort	✗ O(n log n)
Top-K / LIMIT	`LIMIT k`	✓ O(log n + k)	✗ O(n + k log k)

The Sequential I/O Advantage

Beyond algorithmic complexity, B+trees provide a crucial I/O efficiency advantage for range queries: sequential access at the leaf level.

Leaf Node Linking

B+tree leaf nodes are linked via sibling pointers, forming a doubly-linked list:

[Leaf₀] ↔ [Leaf₁] ↔ [Leaf₂] ↔ [Leaf₃] ↔ [Leaf₄] ↔ ...

This structure enables sequential scanning after locating the range start. On traditional spinning disks, sequential reads are dramatically faster than random reads:

Random disk seek: 5-10 ms
Sequential read (per page after initial seek): 0.1 ms

A range query reading 100 consecutive pages:

Random access: 100 × 5ms = 500ms
Sequential access: 5ms + 99 × 0.1ms ≈ 15ms

The difference can be 30x or more for disk-based systems.

Physical Clustering

B+trees often exhibit good physical clustering on disk:

Leaf nodes created around the same time are often physically adjacent
Sequential inserts produce sequential disk layout
Range scans benefit from read-ahead/prefetching

Modern storage systems exploit this:

OS read-ahead: Pre-fetches pages anticipated to be needed
SSD internal parallelism: Sequential reads maximize channel utilization
Database prefetching: Explicitly requests upcoming pages during scans

SSD Considerations

I/O Pattern Comparison for Range Queries
Aspect	Hash Index Scan	B+Tree Range Scan
Access pattern	Random (all buckets)	Sequential (leaf chain)
Prefetch effectiveness	None (unpredictable)	High (predictable path)
Disk seek count	One per bucket	One initial + sequential
Cache utilization	Poor (no locality)	Excellent (adjacent pages)
CPU efficiency	Low (scattered data)	High (cache-friendly)

Real-World Range Query Scenarios

Range queries are pervasive in production databases. Let's examine common scenarios where the hash vs tree index decision has direct performance implications.

E-Commerce: Order Date Ranges

ecommerce_range_queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Daily sales report: All orders from yesterday
SELECT * FROM orders 
WHERE order_date >= CURRENT_DATE - INTERVAL '1 day'
  AND order_date < CURRENT_DATE;
 
-- Monthly revenue calculation
SELECT SUM(total_amount) FROM orders
WHERE order_date >= '2024-01-01' AND order_date < '2024-02-01';
 
-- Customer order history (customer_id equality + date range)
SELECT * FROM orders
WHERE customer_id = 12345
  AND order_date >= '2023-01-01'
ORDER BY order_date DESC;
 
-- Index recommendation:
-- B+tree on (customer_id, order_date) - composite supports both!
-- Hash on customer_id - point lookup only, loses date ordering

Financial Systems: Transaction Ranges

financial_range_queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Fraud detection: High-value transactions
SELECT * FROM transactions WHERE amount > 10000;
 
-- Account statement generation
SELECT * FROM transactions
WHERE account_id = 'ABC123'
  AND transaction_time BETWEEN '2024-01-01' AND '2024-01-31'
ORDER BY transaction_time;
 
-- Interest calculation: Balance over time
SELECT date, balance FROM daily_balances
WHERE account_id = 'ABC123'
  AND date >= '2023-01-01'
ORDER BY date;
 
-- Compliance reporting: Transactions in amount ranges
SELECT COUNT(*), SUM(amount) FROM transactions
WHERE amount >= 1000 AND amount < 10000;

Analytics and Logging: Time-Series Data

analytics_range_queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Dashboards: Recent metrics
SELECT metric_name, AVG(value) FROM metrics
WHERE timestamp >= NOW() - INTERVAL '1 hour'
GROUP BY metric_name;
 
-- Log analysis: Errors in time window
SELECT * FROM logs
WHERE log_level = 'ERROR'
  AND created_at >= NOW() - INTERVAL '24 hours'
ORDER BY created_at DESC;
 
-- User activity analysis
SELECT date_trunc('day', event_time), COUNT(*) 
FROM user_events
WHERE user_id = 12345
  AND event_time >= '2024-01-01'
GROUP BY 1 ORDER BY 1;
 
-- Time-series data is fundamentally about ranges!

The Range Query Reality

Index Selection Implications

The range query limitation of hash indexes has profound implications for index selection strategy.

The Mix Problem

Most real workloads combine point and range queries on the same columns:

-- Point query: Specific customer lookup
SELECT * FROM customers WHERE customer_id = 12345;

-- Range query: Customers in region
SELECT * FROM customers WHERE region = 'WEST' AND created_at > '2024-01-01';

With a hash index on customer_id:

First query: O(1) excellent
Second query: Full scan (cannot use hash index at all)

With a B+tree index on customer_id:

First query: O(log n) very good
Other indexes can be created for range queries

The hash index provides marginal point-query benefit but blocks any range capability on that column.

Hash Index Suitability Criteria (All Must Apply)

•Pure equality workload: 100% of queries on this column are = value comparisons
•No ordering requirements: Never need ORDER BY, MIN, MAX on this column
•No range filters: Never need >, <, >=, <=, BETWEEN on this column
•No prefix matching: Never need LIKE 'prefix%' patterns
•Separate from composite needs: This column won't be combined with range columns in a composite index
•Uniform key distribution: Data distributes evenly across hash buckets

The Practical Test

Covering Index Considerations

B+Tree Covering Indexes

B+tree indexes can include additional columns beyond the search key:

CREATE INDEX idx_orders_covering ON orders (order_date, customer_id, total_amount);

-- This query can be answered entirely from the index:
SELECT customer_id, total_amount FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

Benefits:

Index-only scans avoid table lookups entirely
Leaves contain all needed data in sorted order
Sequential scan through index is highly efficient

Hash Index Covering Limitations

Hash indexes can theoretically store additional columns in bucket entries, but the utility is limited:

Random bucket access means no sequential scan benefit
Range queries still require full index scan
The scenario where hash + covering helps is extremely narrow (pure equality + need included columns)

Practical Example

Consider an analytics query:

SELECT product_category, SUM(quantity), SUM(revenue)
FROM order_items
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY product_category;

With B+tree covering index on (order_date, product_category, quantity, revenue):

Seek to order_date = '2024-01-01' (O(log n))
Sequential scan through year's data in index
Aggregate as scanning—no table access
All data in sorted date order

With hash index: Cannot use index for range; falls back to table scan.

The Coverage Asymmetry

Summary: The Insurmountable Ordering Gap

Range queries represent the clearest differentiation between hash and tree indexes. This isn't a matter of performance tuning or clever implementation—it's a fundamental mathematical limitation.

Key Takeaways

•Hash functions destroy ordering by design—uniform distribution requires scrambling key relationships, making range queries impossible
•Hash index range queries require full scans—O(n) cost regardless of selectivity, often making them slower than table scans
•B+trees support ranges efficiently—O(log n + k) cost proportional to result size, with sequential I/O benefits
•Many query types require ordering—BETWEEN, <, >, LIKE 'prefix%', ORDER BY, MIN, MAX all depend on key ordering
•Real workloads mix query types—the rare workload with 100% equality queries doesn't justify hash indexes when ranges might be needed later
•Sequential I/O compounds the advantage—B+tree leaf linking enables efficient range scans that hash structures cannot match

The Bottom Line

What's Next

Range Query Analysis Complete

3 / 5