Database Management SystemsHash Indexing

Hash Index Concept

LevelIntermediate

Duration55 mins

TopicHash Indexing

1 / 5

Hashing for Indexing

The Power of Direct Access

Every database query ultimately reduces to a fundamental question: Where is the data I need? Tree-based indexes like B+-trees answer this question by traversing a hierarchical structure—a journey of O(log n) steps through increasingly specific regions of the keyspace. But there exists a more direct approach, one that promises to answer this question in essentially constant time, regardless of how much data you have.

This approach is hash-based indexing, and it represents one of the most elegant applications of mathematics to data management. Instead of searching through a tree structure, hash indexing computes the exact storage location from the search key itself. The data doesn't need to be found—its location is calculated.

What You Will Learn

By the end of this page, you will understand how hash functions transform the indexing problem from a search operation into a computation. You'll grasp the fundamental insight that makes hash indexing possible, appreciate its position within the broader landscape of index structures, and recognize the architectural constraints that make this technique both powerful and specialized.

The Indexing Problem Revisited

To appreciate hash indexing, we must first understand the problem it solves. Consider a database table with millions of records. Without an index, answering the query SELECT * FROM employees WHERE employee_id = 12345 requires a full table scan—reading potentially millions of records to find the one that matches.

Tree-based indexes like B+-trees solve this by organizing data hierarchically. Starting from the root, we follow the appropriate child pointers based on key comparisons until we reach the leaf containing our target. For a well-balanced tree with high fanout, this requires only 3-4 disk I/O operations even for tables with billions of rows.

But here's the philosophical question: Why search at all?

The Hash Insight

If we could compute the storage location directly from the key value—without any searching—we could answer point queries with a single I/O operation. This is the fundamental insight behind hash indexing: transform the search problem into a computation problem.

Consider an analogy from the physical world. A library organizes books on shelves using the Dewey Decimal System—a hierarchical numbering scheme that requires understanding the classification to locate a book. You navigate from broad categories to specific subcategories to individual shelves.

Now imagine a different system: each book has a unique identifier, and a formula tells you exactly which shelf and slot that book occupies. You don't navigate; you compute and retrieve. This is the essence of hashing.

The mathematical formulation:

address = h(key)

Where:

key is the search value (e.g., employee_id = 12345)
h() is a hash function
address is the storage location (bucket) where records with this key are stored

This simple equation encapsulates the power of hash indexing: O(1) average-case lookup time for point queries.

Hash Functions in Database Context

A hash function is a mathematical function that maps arbitrary-sized input (the search key) to a fixed-size output (the bucket address). In the context of database indexing, the hash function must satisfy specific properties that differ somewhat from hash functions used in cryptography or general-purpose programming.

Key properties for database hash functions:

Essential Hash Function Properties

•Determinism — The same key must always produce the same hash value. Without this property, we couldn't reliably locate previously stored records.
•Uniform Distribution — Hash values should be spread evenly across the available bucket space. Clustering leads to overflow chains and degraded performance.
•Computational Efficiency — The function must execute quickly, as it's invoked for every lookup, insert, and delete operation. Complex cryptographic hash functions are typically too slow.
•Full Key Utilization — The function should incorporate all bits/bytes of the key to ensure similar keys don't cluster together (e.g., 'Smith' and 'Smyth' should hash to different buckets).

Common hash function approaches for databases:

1. Modular Hashing (Division Method)

The simplest and most common approach:

h(key) = key mod M

Where M is the number of buckets. For best distribution, M should be a prime number not close to powers of 2. For example, with M = 101 buckets:

h(12345) = 12345 mod 101 = 21
h(67890) = 67890 mod 101 = 28

2. Multiplicative Hashing

Uses the fractional part of a multiplication:

h(key) = floor(M × (key × A mod 1))

Where A is a constant (often the golden ratio ≈ 0.618034). This approach is less sensitive to the choice of M.

3. Folding

Partitions the key into segments and combines them:

key = 123456789
Fold: 123 + 456 + 789 = 1368
h(key) = 1368 mod M

4. Mid-Square Method

Squares the key and extracts middle digits:

key = 4567
key² = 20857489
Middle digits: 8574
h(key) = 8574 mod M

String Keys Require Special Handling

For non-numeric keys (names, email addresses, product codes), the string must first be converted to a numeric value. Common approaches include summing ASCII/Unicode values, polynomial accumulation (treating the string as a base-128 number), or using established string hash functions like djb2, FNV, or MurmurHash.

Hash Function Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def modular_hash(key: int, num_buckets: int) -> int:
    """
    Simple modular hash function.
    Best when num_buckets is a prime number.
    """
    return key % num_buckets
 
 
def string_hash(key: str, num_buckets: int) -> int:
    """
    Polynomial accumulation hash for strings.
    Uses Horner's rule for efficient computation.
    """
    hash_value = 0
    a = 31  # A prime multiplier
    
    for char in key:
        # Multiply previous hash by 'a' and add current char value
        hash_value = (hash_value * a + ord(char))
    
    return hash_value % num_buckets
 
 
def multiplicative_hash(key: int, num_buckets: int) -> int:
    """
    Multiplicative hash using the golden ratio.
    Less sensitive to bucket count choice.
    """
    import math
    A = (math.sqrt(5) - 1) / 2  # Golden ratio ≈ 0.618034
    
    fractional_part = (key * A) % 1
    return int(num_buckets * fractional_part)
 
 
# Example usage
print(f"modular_hash(12345, 101) = {modular_hash(12345, 101)}")
print(f"string_hash('Employee', 101) = {string_hash('Employee', 101)}")
print(f"multiplicative_hash(12345, 100) = {multiplicative_hash(12345, 100)}")

From Hash Value to Storage Location

Understanding how the hash function connects to physical storage is crucial. The output of the hash function isn't a raw disk address—it's a bucket number. This bucket number is then translated into an actual storage location through the database's storage manager.

The translation chain:

Search Key → Hash Function → Bucket Number → Bucket Address → Disk Block

Let's trace through a concrete example:

Hash Index Lookup Trace: Finding employee_id = 12345
Step	Operation	Result
1	Apply hash function h(12345)	Hash value = 21
2	Map to bucket number	Bucket #21
3	Look up bucket directory	Bucket #21 → Disk block 1847
4	Read disk block	Load block 1847 into buffer
5	Scan bucket for exact key	Find record with employee_id = 12345

The anatomy of a hash index structure:

A hash index in a DBMS typically consists of:

1. Bucket Directory (or Header Page)

An array where each entry corresponds to a bucket number
Each entry contains a pointer to the first page of that bucket
The directory is kept small enough to often fit in memory

2. Bucket Pages (or Primary Pages)

Actual data pages that store index entries
Each bucket may consist of one or more pages
Pages contain (key, record_pointer) pairs

3. Overflow Pages

Additional pages linked when a bucket exceeds its capacity
Overflow chains degrade performance and are a key concern

Space allocation:

With M buckets and a page size of 4KB that can hold 100 entries:

If we expect N total records, we should size M such that N/M < 100
A load factor (λ) around 0.7-0.8 balances space and performance
Load factor = N / (M × entries_per_bucket)

Converting Mermaid diagram...

Index Entries vs. Data Records

Like B+-tree indexes, hash indexes typically store (key, record_pointer) pairs, not the full records themselves. The record pointer (RID) identifies the actual data page and slot where the complete record resides. Some systems support 'clustered' hash indexes where the full record is stored in buckets, but this is less common.

Contrasting with Tree-Based Indexes

Hash indexing and tree-based indexing (particularly B+-trees) represent fundamentally different approaches to the same problem. Understanding their contrasts clarifies when each is appropriate.

The philosophical difference:

B+-tree philosophy: Organize keys by their relative ordering. Navigate through levels of increasingly refined key ranges to locate targets.
Hash index philosophy: Compute location directly from key value. Ordering is irrelevant; only equality matters.

This philosophical difference manifests in dramatic operational characteristics:

Hash Index vs. B+-Tree: Operational Comparison
Operation	Hash Index	B+-Tree
Equality lookup (`key = value`)	O(1) average case, one I/O	O(log n), typically 3-4 I/Os
Range query (`key > a AND key < b`)	Not supported, requires full scan	O(log n + k), efficient traversal
Ordered retrieval (`ORDER BY key`)	Not supported, random order	Natural ordering, no extra cost
Prefix matching (`LIKE 'abc%'`)	Not supported	Supported efficiently
MIN/MAX queries	Requires full scan	O(log n), leftmost/rightmost leaf
Space overhead	Lower (no internal nodes)	Higher (internal node pages)
Insert complexity	O(1) average, may trigger rehash	O(log n), may trigger split
Delete complexity	O(1) average	O(log n), may trigger merge

When Hash Index Excels

•Primary key lookups by exact value
•Foreign key joins on equality
•Lookup tables and reference data
•Session stores and caches
•User authentication (username/email lookup)
•High-velocity point query workloads

When Hash Index Struggles

•Range queries (dates, prices, scores)
•Sorted result requirements
•Pattern matching queries
•Aggregate queries (MIN, MAX)
•Data with natural ordering semantics
•Workloads mixing equality and range

The Range Query Limitation is Fundamental

Hash functions deliberately destroy key ordering—that's how they achieve uniform distribution. Keys 100, 101, and 102 may hash to buckets 47, 3, and 89 respectively. This is a feature for distribution, but a fatal limitation for range queries. You cannot 'traverse' from key 100 to key 200 in a hash index.

The Collision Problem

The elegance of hash indexing faces a mathematical reality: the pigeonhole principle. If you have more keys than buckets (which is almost always the case), multiple keys must share the same bucket. This is called a collision.

Why collisions are inevitable:

Suppose we have:

1 million employee IDs (possible keys)
10,000 buckets (to keep the directory manageable)
100,000 actual employees (records to store)

With 10,000 buckets, on average each bucket will contain 10 records. When we look up employee_id = 12345, we'll find its bucket, but that bucket contains 10 different records. We must scan through them to find the exact match.

This is still O(1) under reasonable assumptions:

If the hash function distributes keys uniformly:

Expected bucket size = N/M (total records / number of buckets)
For N = 100,000 and M = 10,000: expected 10 records per bucket
Scanning 10 records is essentially constant time

The O(1) guarantee assumes:

Records are distributed uniformly across buckets
Bucket chains remain short (controlled by choosing appropriate M)
The hash function produces good distribution

Collision resolution strategies:

Databases handle collisions through several mechanisms:

1. Chaining (Most Common in Databases)

Each bucket can hold multiple entries
Overflow pages linked when bucket exceeds capacity
Simple and handles any load factor
Performance degrades as chains grow

2. Open Addressing (Less Common in DBMS)

Probe for alternative buckets when collision occurs
Linear probing: try bucket+1, bucket+2, ...
Quadratic probing: try bucket+1², bucket+2², ...
Double hashing: use second hash function for probe sequence
Better cache performance but complicates deletion

3. Cuckoo Hashing

Use multiple hash functions (typically 2)
Each key can be in one of two possible buckets
On collision, evict existing entry to its alternate location
Guarantees constant lookup time (deterministic, not average)

Collision Handling Illustration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
class HashBucket:
    """Represents a single bucket that can hold multiple entries."""
    def __init__(self, capacity: int = 10):
        self.entries = []  # List of (key, record_pointer) pairs
        self.capacity = capacity
        self.overflow = None  # Pointer to overflow bucket
    
    def insert(self, key, record_pointer):
        """Insert entry, creating overflow if necessary."""
        if len(self.entries) < self.capacity:
            self.entries.append((key, record_pointer))
        else:
            # Bucket full, use overflow chain
            if self.overflow is None:
                self.overflow = HashBucket(self.capacity)
            self.overflow.insert(key, record_pointer)
    
    def search(self, key):
        """Search for key, following overflow chain if needed."""
        for k, rp in self.entries:
            if k == key:
                return rp  # Found, return record pointer
        
        # Check overflow chain
        if self.overflow:
            return self.overflow.search(key)
        
        return None  # Not found
 
 
class SimpleHashIndex:
    """Simplified hash index with chaining for collision resolution."""
    
    def __init__(self, num_buckets: int = 1000):
        self.num_buckets = num_buckets
        self.buckets = [HashBucket() for _ in range(num_buckets)]
        self.num_entries = 0
    
    def _hash(self, key) -> int:
        """Hash function: handles both int and string keys."""
        if isinstance(key, int):
            return key % self.num_buckets
        else:
            # String hash using polynomial accumulation
            h = 0
            for char in str(key):
                h = (h * 31 + ord(char)) % self.num_buckets
            return h
    
    def insert(self, key, record_pointer):
        bucket_num = self._hash(key)
        self.buckets[bucket_num].insert(key, record_pointer)
        self.num_entries += 1
    
    def search(self, key):
        bucket_num = self._hash(key)
        return self.buckets[bucket_num].search(key)
    
    @property
    def load_factor(self) -> float:
        return self.num_entries / self.num_buckets
 
 
# Demonstration
index = SimpleHashIndex(num_buckets=100)
 
# Insert some employee records
employees = [
    (12345, "RID:Page100,Slot5"),
    (67890, "RID:Page200,Slot12"),
    (12445, "RID:Page150,Slot3"),  # Might collide with 12345
]
 
for emp_id, rid in employees:
    bucket = index._hash(emp_id)
    print(f"Employee {emp_id} → Bucket {bucket}")
    index.insert(emp_id, rid)
 
print(f"\nLoad factor: {index.load_factor:.2f}")
print(f"Search 12345: {index.search(12345)}")

Load Factor Management

The load factor (λ = n/m, where n is entries and m is buckets) is the critical metric for hash index performance. At λ = 0.5, half the buckets are empty on average. At λ = 1.0, every bucket has one entry on average, but variance means some will have several. Most systems aim for λ between 0.5 and 0.8, trading space for performance.

Hash Index Architecture in Practice

Real database systems implement hash indexes with careful attention to disk-based storage constraints. Unlike in-memory hash tables, disk-based hash indexes must minimize I/O operations and handle persistence requirements.

Structural components of a production hash index:

Key Architectural Elements

•Header Page — Contains metadata: number of buckets, bucket size, hash function parameters, load factor thresholds, and pointer to the bucket directory.
•Bucket Directory — An array mapping bucket numbers to page IDs. Small enough to cache in memory for fast translation. May span multiple pages for very large indexes.
•Primary Bucket Pages — Fixed-size pages (typically 4KB-16KB) holding sorted or unsorted (key, RID) entries. Page format includes header, entry slots, and free space pointer.
•Overflow Chain — Linked list of additional pages when primary bucket exceeds capacity. Each overflow page contains a 'next pointer' to continue the chain.
•Free Page List — Tracks available pages for bucket allocation. Reuses pages from deleted entries or contracted buckets.

Page layout for hash bucket:

+--------------------------------------------------+
| Page Header (32 bytes)                           |
| - Page ID, LSN, Checksum                         |
| - Entry count, Free space offset                 |
| - Next overflow page pointer                     |
+--------------------------------------------------+
| Entry 1: | Key (variable) | RID (8 bytes) |      |
| Entry 2: | Key (variable) | RID (8 bytes) |      |
| Entry 3: | Key (variable) | RID (8 bytes) |      |
| ...                                              |
+--------------------------------------------------+
| Free Space                                       |
+--------------------------------------------------+
| Slot Directory (grows backward)                  |
| [offset_n][offset_n-1]...[offset_1]              |
+--------------------------------------------------+

I/O cost for hash index lookup (ideal case):

Zero reads if bucket directory is cached (common)
One read for primary bucket page
Zero additional reads if no overflow (common with good load factor)

Total: 1 I/O for point query (compared to 3-4 for B+-tree)

This single-I/O performance is the fundamental appeal of hash indexing for point-query workloads.

Database-Specific Implementations

PostgreSQL supports hash indexes (CREATE INDEX ... USING HASH) but historically recommended B+-tree for most cases due to hash index limitations with WAL logging (fixed in PostgreSQL 10+). Oracle uses hash clusters rather than hash indexes. MySQL/InnoDB doesn't support traditional hash indexes but uses adaptive hash indexes internally for frequently accessed pages.

When Hashing Makes Sense

Given the strengths and limitations we've discussed, hash indexing is the right choice in specific scenarios. Understanding these scenarios helps database architects make informed decisions.

Ideal hash index scenarios:

Hash Index Use Case Analysis
Scenario	Why Hash Index Works	Example
Primary key lookup	Equality-only access pattern, unique keys	SELECT * FROM orders WHERE order_id = 'ORD-12345'
User authentication	Username/email exact match at login	SELECT password_hash FROM users WHERE email = ?
Session management	Session token exact lookup	SELECT user_id FROM sessions WHERE token = ?
Foreign key validation	Check existence by exact value	Does department_id = 42 exist in departments?
Deduplication	Check if exact value already exists	Is this transaction_id already processed?
Hash join build phase	Internal use by query optimizer	Building in-memory hash for join execution

Decision framework for index selection:

Does the workload require:
├── Range queries (BETWEEN, <, >, LIKE 'prefix%')?
│   └── YES → Use B+-Tree
├── Sorted output (ORDER BY indexd column)?
│   └── YES → Use B+-Tree
├── MIN/MAX queries on indexed column?
│   └── YES → Use B+-Tree
├── Only equality lookups (=, IN)?
│   ├── AND high query volume?
│   │   └── YES → Consider Hash Index
│   └── AND static or slowly-changing data?
│       └── YES → Hash Index is excellent
└── Mixed workload?
    └── Default to B+-Tree for versatility

Practical reality:

In most production systems, B+-trees dominate because:

They handle all query types adequately
The 3-4 I/O cost for B+-tree lookup is acceptable
Workloads often evolve to include range queries
Hash indexes require more careful tuning (bucket count, rehashing)

Hash indexes shine in specialized scenarios: high-volume exact-match workloads, in-memory databases, key-value stores, and as internal structures within the database engine itself.

The Hybrid Approach

Some systems use both: a B+-tree as the primary index (for flexibility) and a hash index as a secondary structure for hot equality lookups. InnoDB's adaptive hash index automatically builds hash structures for frequently accessed B+-tree pages, combining the best of both worlds transparently.

Summary: The Foundation of Hash Indexing

We've established the theoretical and practical foundation for hash-based indexing. Let's consolidate the key insights before diving deeper into hash functions, bucket mechanics, and advanced hash index variants in subsequent pages.

Key Takeaways

•Hash indexing transforms search into computation — Instead of navigating a tree, we calculate the storage location directly from the key value.
•O(1) average-case lookup is the defining advantage — A single I/O can answer a point query, regardless of data size.
•Hash functions must be deterministic and uniform — The same key always yields the same bucket, and keys should distribute evenly across buckets.
•Collisions are inevitable and manageable — Multiple keys share buckets, but with proper load factor, bucket sizes remain small.
•Range queries are fundamentally unsupported — Hash functions destroy key ordering; this is a feature for distribution but prevents ordered access.
•B+-trees remain the default choice — Their versatility handles more query patterns; hash indexes are specialized tools.
•Architecture involves directory, buckets, and overflow chains — Understanding these components is essential for tuning and troubleshooting.

What's next:

With the conceptual foundation established, the next page explores hash functions in depth. We'll examine the mathematical properties that make certain functions suitable for database indexing, analyze real hash function implementations used in production systems, and understand how to evaluate hash function quality through distribution analysis.

Page Complete

You now understand the fundamental concept of using hashing for database indexing. You can articulate why hash indexes achieve O(1) lookups, when they're appropriate versus B+-trees, and how they're structured at both logical and physical levels. Next, we'll deep-dive into the hash functions that make this possible.

1 / 5

Loading learning content...

Database Management SystemsHash Indexing

Hash Index Concept

LevelIntermediate

Duration55 mins

TopicHash Indexing

1 / 5

Hashing for Indexing

The Power of Direct Access

What You Will Learn

The Indexing Problem Revisited

But here's the philosophical question: Why search at all?

The Hash Insight

The mathematical formulation:

address = h(key)

Where:

key is the search value (e.g., employee_id = 12345)
h() is a hash function
address is the storage location (bucket) where records with this key are stored

This simple equation encapsulates the power of hash indexing: O(1) average-case lookup time for point queries.

Hash Functions in Database Context

Key properties for database hash functions:

Essential Hash Function Properties

•Determinism — The same key must always produce the same hash value. Without this property, we couldn't reliably locate previously stored records.
•Uniform Distribution — Hash values should be spread evenly across the available bucket space. Clustering leads to overflow chains and degraded performance.
•Computational Efficiency — The function must execute quickly, as it's invoked for every lookup, insert, and delete operation. Complex cryptographic hash functions are typically too slow.
•Full Key Utilization — The function should incorporate all bits/bytes of the key to ensure similar keys don't cluster together (e.g., 'Smith' and 'Smyth' should hash to different buckets).

Common hash function approaches for databases:

1. Modular Hashing (Division Method)

The simplest and most common approach:

h(key) = key mod M

Where M is the number of buckets. For best distribution, M should be a prime number not close to powers of 2. For example, with M = 101 buckets:

h(12345) = 12345 mod 101 = 21
h(67890) = 67890 mod 101 = 28

2. Multiplicative Hashing

Uses the fractional part of a multiplication:

h(key) = floor(M × (key × A mod 1))

Where A is a constant (often the golden ratio ≈ 0.618034). This approach is less sensitive to the choice of M.

3. Folding

Partitions the key into segments and combines them:

key = 123456789
Fold: 123 + 456 + 789 = 1368
h(key) = 1368 mod M

4. Mid-Square Method

Squares the key and extracts middle digits:

key = 4567
key² = 20857489
Middle digits: 8574
h(key) = 8574 mod M

String Keys Require Special Handling

Hash Function Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def modular_hash(key: int, num_buckets: int) -> int:
    """
    Simple modular hash function.
    Best when num_buckets is a prime number.
    """
    return key % num_buckets
 
 
def string_hash(key: str, num_buckets: int) -> int:
    """
    Polynomial accumulation hash for strings.
    Uses Horner's rule for efficient computation.
    """
    hash_value = 0
    a = 31  # A prime multiplier
    
    for char in key:
        # Multiply previous hash by 'a' and add current char value
        hash_value = (hash_value * a + ord(char))
    
    return hash_value % num_buckets
 
 
def multiplicative_hash(key: int, num_buckets: int) -> int:
    """
    Multiplicative hash using the golden ratio.
    Less sensitive to bucket count choice.
    """
    import math
    A = (math.sqrt(5) - 1) / 2  # Golden ratio ≈ 0.618034
    
    fractional_part = (key * A) % 1
    return int(num_buckets * fractional_part)
 
 
# Example usage
print(f"modular_hash(12345, 101) = {modular_hash(12345, 101)}")
print(f"string_hash('Employee', 101) = {string_hash('Employee', 101)}")
print(f"multiplicative_hash(12345, 100) = {multiplicative_hash(12345, 100)}")

From Hash Value to Storage Location

The translation chain:

Search Key → Hash Function → Bucket Number → Bucket Address → Disk Block

Let's trace through a concrete example:

Hash Index Lookup Trace: Finding employee_id = 12345
Step	Operation	Result
1	Apply hash function h(12345)	Hash value = 21
2	Map to bucket number	Bucket #21
3	Look up bucket directory	Bucket #21 → Disk block 1847
4	Read disk block	Load block 1847 into buffer
5	Scan bucket for exact key	Find record with employee_id = 12345

The anatomy of a hash index structure:

A hash index in a DBMS typically consists of:

1. Bucket Directory (or Header Page)

An array where each entry corresponds to a bucket number
Each entry contains a pointer to the first page of that bucket
The directory is kept small enough to often fit in memory

2. Bucket Pages (or Primary Pages)

Actual data pages that store index entries
Each bucket may consist of one or more pages
Pages contain (key, record_pointer) pairs

3. Overflow Pages

Additional pages linked when a bucket exceeds its capacity
Overflow chains degrade performance and are a key concern

Space allocation:

With M buckets and a page size of 4KB that can hold 100 entries:

If we expect N total records, we should size M such that N/M < 100
A load factor (λ) around 0.7-0.8 balances space and performance
Load factor = N / (M × entries_per_bucket)

Converting Mermaid diagram...

Index Entries vs. Data Records

Contrasting with Tree-Based Indexes

Hash indexing and tree-based indexing (particularly B+-trees) represent fundamentally different approaches to the same problem. Understanding their contrasts clarifies when each is appropriate.

The philosophical difference:

B+-tree philosophy: Organize keys by their relative ordering. Navigate through levels of increasingly refined key ranges to locate targets.
Hash index philosophy: Compute location directly from key value. Ordering is irrelevant; only equality matters.

This philosophical difference manifests in dramatic operational characteristics:

Hash Index vs. B+-Tree: Operational Comparison
Operation	Hash Index	B+-Tree
Equality lookup (`key = value`)	O(1) average case, one I/O	O(log n), typically 3-4 I/Os
Range query (`key > a AND key < b`)	Not supported, requires full scan	O(log n + k), efficient traversal
Ordered retrieval (`ORDER BY key`)	Not supported, random order	Natural ordering, no extra cost
Prefix matching (`LIKE 'abc%'`)	Not supported	Supported efficiently
MIN/MAX queries	Requires full scan	O(log n), leftmost/rightmost leaf
Space overhead	Lower (no internal nodes)	Higher (internal node pages)
Insert complexity	O(1) average, may trigger rehash	O(log n), may trigger split
Delete complexity	O(1) average	O(log n), may trigger merge

When Hash Index Excels

•Primary key lookups by exact value
•Foreign key joins on equality
•Lookup tables and reference data
•Session stores and caches
•User authentication (username/email lookup)
•High-velocity point query workloads

When Hash Index Struggles

•Range queries (dates, prices, scores)
•Sorted result requirements
•Pattern matching queries
•Aggregate queries (MIN, MAX)
•Data with natural ordering semantics
•Workloads mixing equality and range

The Range Query Limitation is Fundamental

The Collision Problem

Why collisions are inevitable:

Suppose we have:

1 million employee IDs (possible keys)
10,000 buckets (to keep the directory manageable)
100,000 actual employees (records to store)

This is still O(1) under reasonable assumptions:

If the hash function distributes keys uniformly:

Expected bucket size = N/M (total records / number of buckets)
For N = 100,000 and M = 10,000: expected 10 records per bucket
Scanning 10 records is essentially constant time

The O(1) guarantee assumes:

Records are distributed uniformly across buckets
Bucket chains remain short (controlled by choosing appropriate M)
The hash function produces good distribution

Collision resolution strategies:

Databases handle collisions through several mechanisms:

1. Chaining (Most Common in Databases)

Each bucket can hold multiple entries
Overflow pages linked when bucket exceeds capacity
Simple and handles any load factor
Performance degrades as chains grow

2. Open Addressing (Less Common in DBMS)

Probe for alternative buckets when collision occurs
Linear probing: try bucket+1, bucket+2, ...
Quadratic probing: try bucket+1², bucket+2², ...
Double hashing: use second hash function for probe sequence
Better cache performance but complicates deletion

3. Cuckoo Hashing

Use multiple hash functions (typically 2)
Each key can be in one of two possible buckets
On collision, evict existing entry to its alternate location
Guarantees constant lookup time (deterministic, not average)

Collision Handling Illustration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
class HashBucket:
    """Represents a single bucket that can hold multiple entries."""
    def __init__(self, capacity: int = 10):
        self.entries = []  # List of (key, record_pointer) pairs
        self.capacity = capacity
        self.overflow = None  # Pointer to overflow bucket
    
    def insert(self, key, record_pointer):
        """Insert entry, creating overflow if necessary."""
        if len(self.entries) < self.capacity:
            self.entries.append((key, record_pointer))
        else:
            # Bucket full, use overflow chain
            if self.overflow is None:
                self.overflow = HashBucket(self.capacity)
            self.overflow.insert(key, record_pointer)
    
    def search(self, key):
        """Search for key, following overflow chain if needed."""
        for k, rp in self.entries:
            if k == key:
                return rp  # Found, return record pointer
        
        # Check overflow chain
        if self.overflow:
            return self.overflow.search(key)
        
        return None  # Not found
 
 
class SimpleHashIndex:
    """Simplified hash index with chaining for collision resolution."""
    
    def __init__(self, num_buckets: int = 1000):
        self.num_buckets = num_buckets
        self.buckets = [HashBucket() for _ in range(num_buckets)]
        self.num_entries = 0
    
    def _hash(self, key) -> int:
        """Hash function: handles both int and string keys."""
        if isinstance(key, int):
            return key % self.num_buckets
        else:
            # String hash using polynomial accumulation
            h = 0
            for char in str(key):
                h = (h * 31 + ord(char)) % self.num_buckets
            return h
    
    def insert(self, key, record_pointer):
        bucket_num = self._hash(key)
        self.buckets[bucket_num].insert(key, record_pointer)
        self.num_entries += 1
    
    def search(self, key):
        bucket_num = self._hash(key)
        return self.buckets[bucket_num].search(key)
    
    @property
    def load_factor(self) -> float:
        return self.num_entries / self.num_buckets
 
 
# Demonstration
index = SimpleHashIndex(num_buckets=100)
 
# Insert some employee records
employees = [
    (12345, "RID:Page100,Slot5"),
    (67890, "RID:Page200,Slot12"),
    (12445, "RID:Page150,Slot3"),  # Might collide with 12345
]
 
for emp_id, rid in employees:
    bucket = index._hash(emp_id)
    print(f"Employee {emp_id} → Bucket {bucket}")
    index.insert(emp_id, rid)
 
print(f"\nLoad factor: {index.load_factor:.2f}")
print(f"Search 12345: {index.search(12345)}")

Load Factor Management

Hash Index Architecture in Practice

Structural components of a production hash index:

Key Architectural Elements

•Header Page — Contains metadata: number of buckets, bucket size, hash function parameters, load factor thresholds, and pointer to the bucket directory.
•Bucket Directory — An array mapping bucket numbers to page IDs. Small enough to cache in memory for fast translation. May span multiple pages for very large indexes.
•Primary Bucket Pages — Fixed-size pages (typically 4KB-16KB) holding sorted or unsorted (key, RID) entries. Page format includes header, entry slots, and free space pointer.
•Overflow Chain — Linked list of additional pages when primary bucket exceeds capacity. Each overflow page contains a 'next pointer' to continue the chain.
•Free Page List — Tracks available pages for bucket allocation. Reuses pages from deleted entries or contracted buckets.

Page layout for hash bucket:

+--------------------------------------------------+
| Page Header (32 bytes)                           |
| - Page ID, LSN, Checksum                         |
| - Entry count, Free space offset                 |
| - Next overflow page pointer                     |
+--------------------------------------------------+
| Entry 1: | Key (variable) | RID (8 bytes) |      |
| Entry 2: | Key (variable) | RID (8 bytes) |      |
| Entry 3: | Key (variable) | RID (8 bytes) |      |
| ...                                              |
+--------------------------------------------------+
| Free Space                                       |
+--------------------------------------------------+
| Slot Directory (grows backward)                  |
| [offset_n][offset_n-1]...[offset_1]              |
+--------------------------------------------------+

I/O cost for hash index lookup (ideal case):

Zero reads if bucket directory is cached (common)
One read for primary bucket page
Zero additional reads if no overflow (common with good load factor)

Total: 1 I/O for point query (compared to 3-4 for B+-tree)

This single-I/O performance is the fundamental appeal of hash indexing for point-query workloads.

Database-Specific Implementations

When Hashing Makes Sense

Given the strengths and limitations we've discussed, hash indexing is the right choice in specific scenarios. Understanding these scenarios helps database architects make informed decisions.

Ideal hash index scenarios:

Hash Index Use Case Analysis
Scenario	Why Hash Index Works	Example
Primary key lookup	Equality-only access pattern, unique keys	SELECT * FROM orders WHERE order_id = 'ORD-12345'
User authentication	Username/email exact match at login	SELECT password_hash FROM users WHERE email = ?
Session management	Session token exact lookup	SELECT user_id FROM sessions WHERE token = ?
Foreign key validation	Check existence by exact value	Does department_id = 42 exist in departments?
Deduplication	Check if exact value already exists	Is this transaction_id already processed?
Hash join build phase	Internal use by query optimizer	Building in-memory hash for join execution

Decision framework for index selection:

Does the workload require:
├── Range queries (BETWEEN, <, >, LIKE 'prefix%')?
│   └── YES → Use B+-Tree
├── Sorted output (ORDER BY indexd column)?
│   └── YES → Use B+-Tree
├── MIN/MAX queries on indexed column?
│   └── YES → Use B+-Tree
├── Only equality lookups (=, IN)?
│   ├── AND high query volume?
│   │   └── YES → Consider Hash Index
│   └── AND static or slowly-changing data?
│       └── YES → Hash Index is excellent
└── Mixed workload?
    └── Default to B+-Tree for versatility

Practical reality:

In most production systems, B+-trees dominate because:

They handle all query types adequately
The 3-4 I/O cost for B+-tree lookup is acceptable
Workloads often evolve to include range queries
Hash indexes require more careful tuning (bucket count, rehashing)

Hash indexes shine in specialized scenarios: high-volume exact-match workloads, in-memory databases, key-value stores, and as internal structures within the database engine itself.

The Hybrid Approach

Summary: The Foundation of Hash Indexing

Key Takeaways

•Hash indexing transforms search into computation — Instead of navigating a tree, we calculate the storage location directly from the key value.
•O(1) average-case lookup is the defining advantage — A single I/O can answer a point query, regardless of data size.
•Hash functions must be deterministic and uniform — The same key always yields the same bucket, and keys should distribute evenly across buckets.
•Collisions are inevitable and manageable — Multiple keys share buckets, but with proper load factor, bucket sizes remain small.
•Range queries are fundamentally unsupported — Hash functions destroy key ordering; this is a feature for distribution but prevents ordered access.
•B+-trees remain the default choice — Their versatility handles more query patterns; hash indexes are specialized tools.
•Architecture involves directory, buckets, and overflow chains — Understanding these components is essential for tuning and troubleshooting.

What's next:

Page Complete

1 / 5