Loading learning content...
Every database query ultimately reduces to a fundamental question: Where is the data I need? Tree-based indexes like B+-trees answer this question by traversing a hierarchical structure—a journey of O(log n) steps through increasingly specific regions of the keyspace. But there exists a more direct approach, one that promises to answer this question in essentially constant time, regardless of how much data you have.
This approach is hash-based indexing, and it represents one of the most elegant applications of mathematics to data management. Instead of searching through a tree structure, hash indexing computes the exact storage location from the search key itself. The data doesn't need to be found—its location is calculated.
By the end of this page, you will understand how hash functions transform the indexing problem from a search operation into a computation. You'll grasp the fundamental insight that makes hash indexing possible, appreciate its position within the broader landscape of index structures, and recognize the architectural constraints that make this technique both powerful and specialized.
To appreciate hash indexing, we must first understand the problem it solves. Consider a database table with millions of records. Without an index, answering the query SELECT * FROM employees WHERE employee_id = 12345 requires a full table scan—reading potentially millions of records to find the one that matches.
Tree-based indexes like B+-trees solve this by organizing data hierarchically. Starting from the root, we follow the appropriate child pointers based on key comparisons until we reach the leaf containing our target. For a well-balanced tree with high fanout, this requires only 3-4 disk I/O operations even for tables with billions of rows.
But here's the philosophical question: Why search at all?
If we could compute the storage location directly from the key value—without any searching—we could answer point queries with a single I/O operation. This is the fundamental insight behind hash indexing: transform the search problem into a computation problem.
Consider an analogy from the physical world. A library organizes books on shelves using the Dewey Decimal System—a hierarchical numbering scheme that requires understanding the classification to locate a book. You navigate from broad categories to specific subcategories to individual shelves.
Now imagine a different system: each book has a unique identifier, and a formula tells you exactly which shelf and slot that book occupies. You don't navigate; you compute and retrieve. This is the essence of hashing.
The mathematical formulation:
address = h(key)
Where:
key is the search value (e.g., employee_id = 12345)h() is a hash functionaddress is the storage location (bucket) where records with this key are storedThis simple equation encapsulates the power of hash indexing: O(1) average-case lookup time for point queries.
A hash function is a mathematical function that maps arbitrary-sized input (the search key) to a fixed-size output (the bucket address). In the context of database indexing, the hash function must satisfy specific properties that differ somewhat from hash functions used in cryptography or general-purpose programming.
Key properties for database hash functions:
Common hash function approaches for databases:
1. Modular Hashing (Division Method)
The simplest and most common approach:
h(key) = key mod M
Where M is the number of buckets. For best distribution, M should be a prime number not close to powers of 2. For example, with M = 101 buckets:
2. Multiplicative Hashing
Uses the fractional part of a multiplication:
h(key) = floor(M × (key × A mod 1))
Where A is a constant (often the golden ratio ≈ 0.618034). This approach is less sensitive to the choice of M.
3. Folding
Partitions the key into segments and combines them:
key = 123456789
Fold: 123 + 456 + 789 = 1368
h(key) = 1368 mod M
4. Mid-Square Method
Squares the key and extracts middle digits:
key = 4567
key² = 20857489
Middle digits: 8574
h(key) = 8574 mod M
For non-numeric keys (names, email addresses, product codes), the string must first be converted to a numeric value. Common approaches include summing ASCII/Unicode values, polynomial accumulation (treating the string as a base-128 number), or using established string hash functions like djb2, FNV, or MurmurHash.
123456789101112131415161718192021222324252627282930313233343536373839
def modular_hash(key: int, num_buckets: int) -> int: """ Simple modular hash function. Best when num_buckets is a prime number. """ return key % num_buckets def string_hash(key: str, num_buckets: int) -> int: """ Polynomial accumulation hash for strings. Uses Horner's rule for efficient computation. """ hash_value = 0 a = 31 # A prime multiplier for char in key: # Multiply previous hash by 'a' and add current char value hash_value = (hash_value * a + ord(char)) return hash_value % num_buckets def multiplicative_hash(key: int, num_buckets: int) -> int: """ Multiplicative hash using the golden ratio. Less sensitive to bucket count choice. """ import math A = (math.sqrt(5) - 1) / 2 # Golden ratio ≈ 0.618034 fractional_part = (key * A) % 1 return int(num_buckets * fractional_part) # Example usageprint(f"modular_hash(12345, 101) = {modular_hash(12345, 101)}")print(f"string_hash('Employee', 101) = {string_hash('Employee', 101)}")print(f"multiplicative_hash(12345, 100) = {multiplicative_hash(12345, 100)}")Understanding how the hash function connects to physical storage is crucial. The output of the hash function isn't a raw disk address—it's a bucket number. This bucket number is then translated into an actual storage location through the database's storage manager.
The translation chain:
Search Key → Hash Function → Bucket Number → Bucket Address → Disk Block
Let's trace through a concrete example:
| Step | Operation | Result |
|---|---|---|
| 1 | Apply hash function h(12345) | Hash value = 21 |
| 2 | Map to bucket number | Bucket #21 |
| 3 | Look up bucket directory | Bucket #21 → Disk block 1847 |
| 4 | Read disk block | Load block 1847 into buffer |
| 5 | Scan bucket for exact key | Find record with employee_id = 12345 |
The anatomy of a hash index structure:
A hash index in a DBMS typically consists of:
1. Bucket Directory (or Header Page)
2. Bucket Pages (or Primary Pages)
3. Overflow Pages
Space allocation:
With M buckets and a page size of 4KB that can hold 100 entries:
Like B+-tree indexes, hash indexes typically store (key, record_pointer) pairs, not the full records themselves. The record pointer (RID) identifies the actual data page and slot where the complete record resides. Some systems support 'clustered' hash indexes where the full record is stored in buckets, but this is less common.
Hash indexing and tree-based indexing (particularly B+-trees) represent fundamentally different approaches to the same problem. Understanding their contrasts clarifies when each is appropriate.
The philosophical difference:
B+-tree philosophy: Organize keys by their relative ordering. Navigate through levels of increasingly refined key ranges to locate targets.
Hash index philosophy: Compute location directly from key value. Ordering is irrelevant; only equality matters.
This philosophical difference manifests in dramatic operational characteristics:
| Operation | Hash Index | B+-Tree |
|---|---|---|
Equality lookup (key = value) | O(1) average case, one I/O | O(log n), typically 3-4 I/Os |
Range query (key > a AND key < b) | Not supported, requires full scan | O(log n + k), efficient traversal |
Ordered retrieval (ORDER BY key) | Not supported, random order | Natural ordering, no extra cost |
Prefix matching (LIKE 'abc%') | Not supported | Supported efficiently |
| MIN/MAX queries | Requires full scan | O(log n), leftmost/rightmost leaf |
| Space overhead | Lower (no internal nodes) | Higher (internal node pages) |
| Insert complexity | O(1) average, may trigger rehash | O(log n), may trigger split |
| Delete complexity | O(1) average | O(log n), may trigger merge |
Hash functions deliberately destroy key ordering—that's how they achieve uniform distribution. Keys 100, 101, and 102 may hash to buckets 47, 3, and 89 respectively. This is a feature for distribution, but a fatal limitation for range queries. You cannot 'traverse' from key 100 to key 200 in a hash index.
The elegance of hash indexing faces a mathematical reality: the pigeonhole principle. If you have more keys than buckets (which is almost always the case), multiple keys must share the same bucket. This is called a collision.
Why collisions are inevitable:
Suppose we have:
With 10,000 buckets, on average each bucket will contain 10 records. When we look up employee_id = 12345, we'll find its bucket, but that bucket contains 10 different records. We must scan through them to find the exact match.
This is still O(1) under reasonable assumptions:
If the hash function distributes keys uniformly:
The O(1) guarantee assumes:
Collision resolution strategies:
Databases handle collisions through several mechanisms:
1. Chaining (Most Common in Databases)
2. Open Addressing (Less Common in DBMS)
3. Cuckoo Hashing
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
class HashBucket: """Represents a single bucket that can hold multiple entries.""" def __init__(self, capacity: int = 10): self.entries = [] # List of (key, record_pointer) pairs self.capacity = capacity self.overflow = None # Pointer to overflow bucket def insert(self, key, record_pointer): """Insert entry, creating overflow if necessary.""" if len(self.entries) < self.capacity: self.entries.append((key, record_pointer)) else: # Bucket full, use overflow chain if self.overflow is None: self.overflow = HashBucket(self.capacity) self.overflow.insert(key, record_pointer) def search(self, key): """Search for key, following overflow chain if needed.""" for k, rp in self.entries: if k == key: return rp # Found, return record pointer # Check overflow chain if self.overflow: return self.overflow.search(key) return None # Not found class SimpleHashIndex: """Simplified hash index with chaining for collision resolution.""" def __init__(self, num_buckets: int = 1000): self.num_buckets = num_buckets self.buckets = [HashBucket() for _ in range(num_buckets)] self.num_entries = 0 def _hash(self, key) -> int: """Hash function: handles both int and string keys.""" if isinstance(key, int): return key % self.num_buckets else: # String hash using polynomial accumulation h = 0 for char in str(key): h = (h * 31 + ord(char)) % self.num_buckets return h def insert(self, key, record_pointer): bucket_num = self._hash(key) self.buckets[bucket_num].insert(key, record_pointer) self.num_entries += 1 def search(self, key): bucket_num = self._hash(key) return self.buckets[bucket_num].search(key) @property def load_factor(self) -> float: return self.num_entries / self.num_buckets # Demonstrationindex = SimpleHashIndex(num_buckets=100) # Insert some employee recordsemployees = [ (12345, "RID:Page100,Slot5"), (67890, "RID:Page200,Slot12"), (12445, "RID:Page150,Slot3"), # Might collide with 12345] for emp_id, rid in employees: bucket = index._hash(emp_id) print(f"Employee {emp_id} → Bucket {bucket}") index.insert(emp_id, rid) print(f"\nLoad factor: {index.load_factor:.2f}")print(f"Search 12345: {index.search(12345)}")The load factor (λ = n/m, where n is entries and m is buckets) is the critical metric for hash index performance. At λ = 0.5, half the buckets are empty on average. At λ = 1.0, every bucket has one entry on average, but variance means some will have several. Most systems aim for λ between 0.5 and 0.8, trading space for performance.
Real database systems implement hash indexes with careful attention to disk-based storage constraints. Unlike in-memory hash tables, disk-based hash indexes must minimize I/O operations and handle persistence requirements.
Structural components of a production hash index:
Page layout for hash bucket:
+--------------------------------------------------+
| Page Header (32 bytes) |
| - Page ID, LSN, Checksum |
| - Entry count, Free space offset |
| - Next overflow page pointer |
+--------------------------------------------------+
| Entry 1: | Key (variable) | RID (8 bytes) | |
| Entry 2: | Key (variable) | RID (8 bytes) | |
| Entry 3: | Key (variable) | RID (8 bytes) | |
| ... |
+--------------------------------------------------+
| Free Space |
+--------------------------------------------------+
| Slot Directory (grows backward) |
| [offset_n][offset_n-1]...[offset_1] |
+--------------------------------------------------+
I/O cost for hash index lookup (ideal case):
Total: 1 I/O for point query (compared to 3-4 for B+-tree)
This single-I/O performance is the fundamental appeal of hash indexing for point-query workloads.
PostgreSQL supports hash indexes (CREATE INDEX ... USING HASH) but historically recommended B+-tree for most cases due to hash index limitations with WAL logging (fixed in PostgreSQL 10+). Oracle uses hash clusters rather than hash indexes. MySQL/InnoDB doesn't support traditional hash indexes but uses adaptive hash indexes internally for frequently accessed pages.
Given the strengths and limitations we've discussed, hash indexing is the right choice in specific scenarios. Understanding these scenarios helps database architects make informed decisions.
Ideal hash index scenarios:
| Scenario | Why Hash Index Works | Example |
|---|---|---|
| Primary key lookup | Equality-only access pattern, unique keys | SELECT * FROM orders WHERE order_id = 'ORD-12345' |
| User authentication | Username/email exact match at login | SELECT password_hash FROM users WHERE email = ? |
| Session management | Session token exact lookup | SELECT user_id FROM sessions WHERE token = ? |
| Foreign key validation | Check existence by exact value | Does department_id = 42 exist in departments? |
| Deduplication | Check if exact value already exists | Is this transaction_id already processed? |
| Hash join build phase | Internal use by query optimizer | Building in-memory hash for join execution |
Decision framework for index selection:
Does the workload require:
├── Range queries (BETWEEN, <, >, LIKE 'prefix%')?
│ └── YES → Use B+-Tree
├── Sorted output (ORDER BY indexd column)?
│ └── YES → Use B+-Tree
├── MIN/MAX queries on indexed column?
│ └── YES → Use B+-Tree
├── Only equality lookups (=, IN)?
│ ├── AND high query volume?
│ │ └── YES → Consider Hash Index
│ └── AND static or slowly-changing data?
│ └── YES → Hash Index is excellent
└── Mixed workload?
└── Default to B+-Tree for versatility
Practical reality:
In most production systems, B+-trees dominate because:
Hash indexes shine in specialized scenarios: high-volume exact-match workloads, in-memory databases, key-value stores, and as internal structures within the database engine itself.
Some systems use both: a B+-tree as the primary index (for flexibility) and a hash index as a secondary structure for hot equality lookups. InnoDB's adaptive hash index automatically builds hash structures for frequently accessed B+-tree pages, combining the best of both worlds transparently.
We've established the theoretical and practical foundation for hash-based indexing. Let's consolidate the key insights before diving deeper into hash functions, bucket mechanics, and advanced hash index variants in subsequent pages.
What's next:
With the conceptual foundation established, the next page explores hash functions in depth. We'll examine the mathematical properties that make certain functions suitable for database indexing, analyze real hash function implementations used in production systems, and understand how to evaluate hash function quality through distribution analysis.
You now understand the fundamental concept of using hashing for database indexing. You can articulate why hash indexes achieve O(1) lookups, when they're appropriate versus B+-trees, and how they're structured at both logical and physical levels. Next, we'll deep-dive into the hash functions that make this possible.