Loading learning content...
A hash function produces a hash value—but that value alone isn't useful until we translate it into an array index. This translation is the bridge between the abstract mathematical concept of hashing and the concrete implementation of a hash table.
Consider this: a hash function might produce the value 2,147,483,647 for some key, but your array only has 1,000 slots. How do we map that large number to a valid array position? How do we ensure this mapping preserves the uniform distribution we worked so hard to achieve in the hash function?
This page answers these questions. We'll explore the techniques for converting arbitrary hash values into usable array indices, understand the mathematical considerations involved, and learn why certain approaches work better than others.
By the end of this page, you will understand: • How to convert hash values to valid array indices • The division method and its mathematical foundations • Why table size selection critically affects performance • The multiplication method and its advantages • Universal hashing techniques for robust key-to-index mapping • Common pitfalls in index computation and how to avoid them
Let's formalize what we need to accomplish. We have:
We need to compute index(k) = h(k) → {0, 1, ..., m-1}
This transformation must satisfy several requirements:
Requirement 1: Valid Range The output must always be a valid array index. For an array of size m, this means [0, m-1].
Requirement 2: Preservation of Distribution If h(k) distributes keys uniformly across its output range, index(k) should distribute keys uniformly across [0, m-1]. A poor index computation can destroy a good hash function's distribution.
Requirement 3: Determinism The same key must always map to the same index. index(k) must be a pure function of k and m.
Requirement 4: Efficiency The computation should be fast—ideally O(1) using only basic arithmetic.
Many programmers assume that the hash function's output IS the array index. This is only true when the hash function is specifically designed for a fixed table size. In practice, hash functions typically output large integers (32 or 64 bits), and we need a separate step to map these to array indices.
Example: The Problem in Action
Suppose we use Java's String.hashCode() method:
"hello".hashCode() = 99162322
"world".hashCode() = 113318802
"apple".hashCode() = 93029210
These values are 32-bit integers. If our hash table has only 16 slots, we need to map these large values to the range [0, 15]. How?
The naive approach—just use the hash value directly—fails immediately: 99162322 is not a valid index for a 16-slot array.
We need a systematic method to compress these values into our target range while preserving as much of the original distribution as possible.
The division method is the most common and intuitive approach to index computation:
index(k) = h(k) mod m
where m is the table size.
This method:
Example:
m = 16 (table size)
index("hello") = 99162322 mod 16 = 2
index("world") = 113318802 mod 16 = 2 // Collision!
index("apple") = 93029210 mod 16 = 10
But wait—"hello" and "world" both map to index 2. Is this the division method's fault, or were these hash values already going to collide?
The Choice of m Matters Enormously
The division method's quality depends critically on choosing m well. Poor choices of m can introduce patterns that destroy uniformity.
Why Powers of 2 Can Be Dangerous:
If m = 2^k, then h mod m is equivalent to taking the last k bits of h. This is fast but ignores all higher-order bits.
m = 16 = 2^4 → index = last 4 bits of hash
If hash function has patterns in low bits:
hash₁ = ...00010000 → index = 0
hash₂ = ...01110000 → index = 0
hash₃ = ...10100000 → index = 0
All three hash to the same index despite being different!
If the hash function has any weakness in its low-order bits, using a power of 2 amplifies that weakness.
Why Primes Are Preferred:
Prime numbers for m tend to scatter keys more uniformly because they have no common factors with typical integer patterns.
m = 17 (prime)
hash₁ = 272 → 272 mod 17 = 0
hash₂ = 273 → 273 mod 17 = 1
hash₃ = 274 → 274 mod 17 = 2
...
With a prime m, consecutive hash values map to consecutive indices (mod m), spreading keys across the table.
For the division method, choose m as a prime number not close to a power of 2. Example: for a table of roughly 1000 entries, use m = 1009 or 1013 rather than m = 1024. This simple choice often dramatically improves distribution.
| Table Size Target | Poor Choice | Better Choice | Reason |
|---|---|---|---|
| ~100 | 100, 128 | 97, 101, 103 | 97/101/103 are prime |
| ~1000 | 1000, 1024 | 997, 1009, 1013 | 997/1009/1013 are prime |
| ~10000 | 10000, 16384 | 10007, 10009 | 10007/10009 are prime |
| ~100000 | 100000, 131072 | 100003, 100019 | Primes not near 2^k |
Handling Negative Hash Values:
In many languages, hash values can be negative (e.g., Java's String.hashCode() can return negative integers). The modulo operation with negative numbers can produce negative results:
// Java example
int hash = "some string".hashCode(); // Might be negative!
int index = hash % m; // Could be negative!
Solutions:
// Approach 1: Bit masking (most efficient)
int index = (hash & 0x7FFFFFFF) % m; // Clear sign bit first
// Approach 2: Math.abs (beware of Integer.MIN_VALUE!)
int index = Math.abs(hash) % m; // Edge case: abs(MIN_VALUE) overflows!
// Approach 3: Mathematical correction
int index = ((hash % m) + m) % m; // Always positive
The first approach (bit masking) is typically preferred for its simplicity and efficiency.
The multiplication method offers an alternative that's less sensitive to the choice of m:
index(k) = ⌊m × (h(k) × A mod 1)⌋
where:
Step-by-step example:
Let m = 16, A = 0.6180339887 (related to golden ratio), h(k) = 99162322
So index = 4
Why the Multiplication Method Works:
The multiplication method uses the fractional part of h(k) × A. When A is an irrational number (or a good approximation), this operation "scrambles" the hash value in a way that distributes across [0, 1) with minimal patterns.
The Golden Ratio Connection:
Knuth suggests using A ≈ (√5 - 1) / 2 ≈ 0.6180339887 (the golden ratio's inverse). This value has special properties that minimize clustering:
Advantages of the Multiplication Method:
// Fast implementation when m = 2^p
// Assuming 32-bit hash values
uint32_t hash = compute_hash(key);
uint32_t product = hash * 2654435769U; // A ≈ 0.618 as fixed-point
uint32_t index = product >> (32 - p); // Extract top p bits
The magic constant 2654435769 is ⌊(√5 - 1) / 2 × 2^32⌋.
In practice, the multiplication method is often implemented using integer arithmetic and bit shifts to avoid floating-point operations. The key insight is that we're essentially using the high-order bits of a fixed-point multiplication, which mixes information from all parts of the hash value.
hash % mUniversal hashing provides provable guarantees against worst-case behavior by randomly selecting the hash function from a carefully designed family.
The Carter-Wegman Universal Hash Family:
For integer keys in range [0, p-1] (where p is a large prime) and table size m:
h_{a,b}(k) = ((a × k + b) mod p) mod m
where:
Why is this "universal"?
For any two distinct keys k₁ ≠ k₂:
P[h_{a,b}(k₁) = h_{a,b}(k₂)] ≤ 1/m
This probability is over the random choice of a and b. No adversary can construct a bad input because they don't know which a and b we'll choose.
Practical Implementation:
import random
class UniversalHashTable:
def __init__(self, size, prime=10000019):
self.m = size
self.p = prime # Prime larger than key range
self.a = random.randint(1, prime - 1)
self.b = random.randint(0, prime - 1)
self.table = [None] * size
def _hash(self, key):
# Convert key to integer if necessary
if isinstance(key, str):
key = sum(ord(c) * (31 ** i) for i, c in enumerate(key))
return ((self.a * key + self.b) % self.p) % self.m
def insert(self, key, value):
index = self._hash(key)
# ... collision handling ...
When to Use Universal Hashing:
In 2011, researchers demonstrated "hash flooding" attacks against web frameworks by crafting POST requests with keys that all collided in the language's hash table. Universal hashing (or randomized hash seeds) prevents such attacks because attackers can't predict which hash function will be used.
So far we've assumed keys are integers. In practice, keys are often strings, objects, or composite types. The index computation pipeline becomes:
Key → Integer Hash → Array Index
Let's examine common approaches for the first step: converting arbitrary keys to integers.
Strings:
Strings are the most common non-integer key type. Several methods exist:
# Method 1: Sum of characters (POOR - many collisions)
def naive_hash(s):
return sum(ord(c) for c in s)
# "abc" and "cba" and "bac" all hash to the same value!
# Method 2: Polynomial rolling hash (GOOD)
def polynomial_hash(s, base=31):
h = 0
for c in s:
h = h * base + ord(c)
return h
# "abc" = 97*31² + 98*31 + 99 = 93217 + 3038 + 99 = 96354
# "cba" = 99*31² + 98*31 + 97 = 95139 + 3038 + 97 = 98274
# Now they're different!
# Method 3: djb2 (GOOD - widely used)
def djb2_hash(s):
h = 5381
for c in s:
h = ((h << 5) + h) + ord(c) # h * 33 + c
return h
Composite Objects:
For objects with multiple fields, combine the hash values of individual fields:
// Java example
public class Person {
private String name;
private int age;
private String city;
@Override
public int hashCode() {
int result = 17; // Non-zero initial value
result = 31 * result + (name != null ? name.hashCode() : 0);
result = 31 * result + age;
result = 31 * result + (city != null ? city.hashCode() : 0);
return result;
}
}
Why 31?
Floating-Point Numbers:
Floating-point keys require special care:
// Java approach: use bit representation
double d = 3.14159;
long bits = Double.doubleToLongBits(d);
int hash = (int)(bits ^ (bits >>> 32));
Caution: Floating-point comparison is tricky due to precision issues. Consider whether you really want exact floating-point keys, or if you should quantize them first.
Arrays and Collections:
For ordered collections, use a rolling polynomial hash:
def array_hash(arr):
h = 0
for elem in arr:
h = h * 31 + hash(elem)
return h
For unordered collections (sets), use XOR or sum (order-independent operations):
def set_hash(s):
return sum(hash(elem) for elem in s) # Order doesn't matter
In languages like Java: if two objects are equal (equals() returns true), they MUST have the same hashCode(). The reverse isn't required—unequal objects may have the same hash (collisions). Violating this contract causes hash tables to malfunction silently.
Index computation seems simple, but subtle bugs are common. Let's examine the most frequent pitfalls:
(hash & 0x7FFFFFFF) % m or ((hash % m) + m) % mMath.abs(hash) — fails for Integer.MIN_VALUE in Javah = (h * 31 + c) % LARGE_PRIME prevents overflowTesting Your Hash Function:
def test_hash_distribution(hash_func, keys, table_size):
"""Test if a hash function distributes keys uniformly."""
buckets = [0] * table_size
for key in keys:
index = hash_func(key) % table_size
buckets[index] += 1
expected = len(keys) / table_size
variance = sum((b - expected) ** 2 for b in buckets) / table_size
std_dev = variance ** 0.5
print(f"Expected per bucket: {expected:.2f}")
print(f"Standard deviation: {std_dev:.2f}")
print(f"Min: {min(buckets)}, Max: {max(buckets)}")
# For good distribution, std_dev should be close to sqrt(expected)
return std_dev < 2 * (expected ** 0.5)
Run this test with representative data to verify your hash function behaves well.
We've explored the critical bridge between hash values and array positions. This mapping determines whether your hash table achieves O(1) performance or degrades to O(n). Let's consolidate the key insights:
index = hash % m is simple and fast, but requires careful choice of m (prefer primes not near powers of 2).What's Next:
We've now covered how to compute hash values and map them to array indices. The next page explores the full journey from arbitrary data to array positions—seeing the complete picture of how inputs like strings, objects, and composite types become usable storage locations.
You now understand the mechanics of translating hash values to array indices—the critical step that enables O(1) hash table operations. You can choose between division and multiplication methods, handle edge cases like negative values, and apply universal hashing when stronger guarantees are needed.