Loading learning content...
We've explored hash functions conceptually and learned how to map hash values to indices. Now it's time to see the complete picture—the end-to-end transformation that takes any piece of data and converts it into a usable array position.
This transformation is the heart of hash table functionality. Understanding it completely allows you to:
In this page, we'll trace the complete journey from arbitrary data to array positions, examining each stage in detail and understanding how they work together as a unified system.
By the end of this page, you will understand: • The complete multi-stage hashing pipeline • How each data type flows through the transformation • Memory layout considerations for hash tables • Cache efficiency and modern hardware considerations • Complete worked examples tracing data from input to storage location
The transformation from arbitrary data to array position follows a three-stage pipeline:
Stage 1: Normalization Convert the key to a canonical byte representation
Stage 2: Hashing Transform bytes into a fixed-size integer hash code
Stage 3: Index Computation Map the hash code to a valid array index
Let's visualize this pipeline:
┌─────────────┐ ┌──────────────┐ ┌───────────────┐ ┌─────────┐
│ Key │───▶│ Normalization │───▶│ Hash Function │───▶│ Index │
│ (any type) │ │ (bytes) │ │ (integer) │ │ (0..m-1)│
└─────────────┘ └──────────────┘ └───────────────┘ └─────────┘
"hello" ───▶ [104,101,108, ───▶ 99162322 ───▶ 7
108,111]
Why Three Stages?
This separation provides several advantages:
Stage Dependencies:
The stages form a directed pipeline—each depends only on the previous:
Normalization ─depends on─▶ Data type
Hash Function ─depends on─▶ Byte sequence
Index Computation ─depends on─▶ Hash value and table size
This separation of concerns is why you can use the same hash table implementation with strings, integers, objects, or any other type—only the normalization stage changes.
In optimized implementations, these stages are often fused for efficiency. For example, string hash functions typically process characters directly rather than first converting to a byte array. But conceptually, the three stages always exist and understanding them aids debugging and optimization.
Normalization converts varied data types into a form the hash function can process. Let's examine strategies for different data types:
Primitive Types:
| Type | Normalization Strategy | Example |
|---|---|---|
| int32 | Direct use | 42 → 42 |
| int64 | Fold high/low bits | 0x12345678ABCDEF → XOR/combine |
| float | IEEE bit representation | 3.14 → 0x4048F5C3 |
| double | IEEE bits, fold if needed | 2.718 → bits → fold |
| bool | 0 or 1 | true → 1 |
| char | Unicode code point | 'A' → 65 |
Strings:
Strings are sequences of characters. The normalization must ensure:
# Direct character processing (most common)
def normalize_string(s):
return [ord(c) for c in s]
# With case-insensitive handling
def normalize_string_ci(s):
return [ord(c.lower()) for c in s]
# With Unicode normalization
import unicodedata
def normalize_string_unicode(s):
normalized = unicodedata.normalize('NFC', s)
return normalized.encode('utf-8')
Important: The same logical string can have different byte representations depending on encoding (UTF-8 vs UTF-16 vs ASCII). Your normalization must be consistent.
Compound Objects:
Objects with multiple fields require combining field values:
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def normalized(self):
"""Convert to hashable representation."""
# Option 1: Tuple (relies on tuple hashing)
return (self.x, self.y)
# Option 2: Explicit bytes
return struct.pack('ii', self.x, self.y)
# Option 3: Field combination
return self.x * 31 + self.y
Critical Consideration: Field Order
The order fields are combined matters:
# Point(1, 2) should differ from Point(2, 1)
class BadHash:
def normalized(self):
return self.x + self.y # BAD: (1,2) == (2,1) !!
class GoodHash:
def normalized(self):
return self.x * 31 + self.y # GOOD: (1,2) ≠ (2,1)
Collections:
For ordered collections (lists, arrays), preserve order:
def normalize_list(lst):
result = 1
for item in lst:
result = result * 31 + hash(item)
return result
# [1, 2, 3] ≠ [3, 2, 1]
For unordered collections (sets), use order-independent operations:
def normalize_set(s):
return sum(hash(item) for item in s) # Order doesn't matter
# or: XOR all hashes together
# {1, 2, 3} == {3, 1, 2} (same set)
Null/None Values:
Special care is needed for null values:
// Java approach
public int hashCode() {
int result = 17;
result = 31 * result + (name != null ? name.hashCode() : 0);
result = 31 * result + age;
return result;
}
Using 0 for null is conventional but means nullable fields and fields with value 0 might collide more often.
Once data is normalized to bytes or integers, we apply the hash function. Let's examine popular hash function designs:
The djb2 Hash (Dan Bernstein)
One of the most widely used string hash functions:
unsigned long djb2(const char *str) {
unsigned long hash = 5381;
int c;
while ((c = *str++)) {
hash = ((hash << 5) + hash) + c; // hash * 33 + c
}
return hash;
}
Why it works:
The FNV-1a Hash (Fowler-Noll-Vo)
Another excellent general-purpose hash:
uint32_t fnv1a_32(const char *data, size_t len) {
uint32_t hash = 2166136261; // FNV offset basis
for (size_t i = 0; i < len; i++) {
hash ^= (uint8_t)data[i];
hash *= 16777619; // FNV prime
}
return hash;
}
Key properties:
The MurmurHash Family
Designed for high performance with excellent distribution:
uint32_t murmur3_32(const uint8_t* key, size_t len, uint32_t seed) {
uint32_t h = seed;
const uint32_t c1 = 0xcc9e2d51;
const uint32_t c2 = 0x1b873593;
// Body: process 4-byte chunks
const uint32_t* blocks = (const uint32_t*)key;
for (size_t i = 0; i < len / 4; i++) {
uint32_t k = blocks[i];
k *= c1;
k = (k << 15) | (k >> 17); // ROTL
k *= c2;
h ^= k;
h = (h << 13) | (h >> 19); // ROTL
h = h * 5 + 0xe6546b64;
}
// Tail: process remaining bytes
// ... (handle 1-3 remaining bytes)
// Finalization: mix final bits
h ^= len;
h ^= h >> 16; h *= 0x85ebca6b;
h ^= h >> 13; h *= 0xc2b2ae35;
h ^= h >> 16;
return h;
}
Why MurmurHash is popular:
| Hash Function | Speed | Distribution | Complexity | Best Use Case |
|---|---|---|---|---|
| djb2 | Fast | Good | Very Simple | General strings |
| FNV-1a | Fast | Good | Simple | General purpose |
| MurmurHash3 | Very Fast | Excellent | Moderate | High-performance apps |
| xxHash | Extremely Fast | Excellent | Moderate | Big data, streaming |
| SipHash | Moderate | Excellent | Moderate | Security-critical |
We've covered index computation methods (division, multiplication, universal). Now let's see how production implementations handle this:
Power-of-2 Table Sizes with Good Hash Functions:
Modern hash tables often use power-of-2 sizes despite earlier warnings. Why?
hash & (size - 1) vs hash % size// Power-of-2 optimization
size_t table_size = 1024; // 2^10
size_t mask = table_size - 1; // 1023 = 0b1111111111
size_t index = hash & mask; // Faster than hash % table_size
Java's HashMap Approach:
Java combines user's hashCode() with additional "spreading":
// Java 8+ HashMap internal hashing
static final int hash(Object key) {
int h;
// XOR high bits into low bits
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
// Index computation (table.length is always power of 2)
int index = hash(key) & (table.length - 1);
Why h ^ (h >>> 16)?
This mixes the high 16 bits into the low 16 bits. When table size is small (e.g., 16 slots), only the lowest bits are used. This mixing ensures high-order bits influence the index.
Original: 1010 0011 0101 1100 | 0001 1010 0111 1000
Shifted: 0000 0000 0000 0000 | 1010 0011 0101 1100
XOR'd: 1010 0011 0101 1100 | 1011 1001 0010 0100
└── These bits now contain
info from high bits!
Python's dict Approach:
Python uses a more sophisticated probing scheme with perturbation:
# Simplified Python dict indexing logic
def find_slot(hash_value, table_size):
mask = table_size - 1
index = hash_value & mask
perturb = hash_value
while table[index] is not None:
# Probe sequence uses perturbation
perturb >>= 5
index = (5 * index + perturb + 1) & mask
return index
Why perturbation?
Linear probing (index + 1, index + 2, ...) causes clustering. Perturbation creates a pseudo-random probe sequence that depends on the full hash value, not just the index, breaking up clusters.
Rust's HashBrown:
Rust's standard HashMap uses SwissTable/HashBrown with SIMD:
// Simplified concept
fn find_slot(hash: u64, table: &Table) -> usize {
let h1 = (hash >> 7) as usize; // Primary hash
let h2 = (hash & 0x7F) as u8; // Control byte
// Use SIMD to check 16 slots at once
// h2 is stored separately for fast comparison
}
This design allows checking 16 potential slots in a single CPU instruction using SIMD.
Understanding how hash tables interact with memory is crucial for performance. Modern CPUs don't access memory byte-by-byte—they load cache lines (typically 64 bytes).
The Memory Hierarchy:
┌─────────────┐ Access Time Size
│ CPU Regs │ ~0.5 ns ~1 KB
├─────────────┤
│ L1 Cache │ ~1 ns 32-64 KB
├─────────────┤
│ L2 Cache │ ~4 ns 256 KB - 1 MB
├─────────────┤
│ L3 Cache │ ~16 ns 2 - 32 MB
├─────────────┤
│ Main RAM │ ~100 ns 4 - 128 GB
└─────────────┘
A cache miss to main memory is ~100x slower than an L1 cache hit!
Chaining vs. Open Addressing: Cache Perspective
Chaining with Linked Lists:
Table: [ptr] [ptr] [ptr] [ptr] ... ← Contiguous
│ │
▼ ▼
[node] [node] ← Scattered in memory!
│
▼
[node] ← Cache miss each hop
Each pointer dereference is likely a cache miss because nodes are allocated separately and may be anywhere in memory.
Open Addressing:
Table: [entry] [entry] [entry] [entry] ... ← All contiguous!
Entries are stored directly in the array. Sequential probing accesses consecutive memory, likely hitting the same cache line.
Cache Line Utilization:
With 64-byte cache lines and 8-byte entries:
On modern hardware, memory access patterns often dominate algorithm complexity. An O(1) operation with a cache miss can be slower than O(log n) operations that hit cache. This is why open addressing hash tables often outperform chaining in practice, despite similar theoretical complexity.
Swiss Table Optimization:
Google's Swiss Table (used in Abseil) achieves remarkable cache efficiency:
Control bytes: [h2|h2|h2|h2|h2|h2|h2|h2|...] ← Dense, SIMD-friendly
│ │ │
▼ ▼ ▼
Slots: [entry | entry | entry |...] ← Only accessed on match
By checking metadata first with SIMD, most non-matching slots are eliminated without loading their full entries.
Let's trace complete examples through all three stages:
Example 1: String Key "hello"
┌─── Stage 1: Normalization ───┐
│ │
│ "hello" → [104, 101, 108, 108, 111] (ASCII values)
│ │
└──────────────────────────────┘
│
▼
┌─── Stage 2: Hash Function (djb2) ───┐
│ │
│ hash = 5381 │
│ hash = 5381*33 + 104 = 177674 │
│ hash = 177674*33 + 101 = 5863343 │
│ hash = 5863343*33 + 108 = 193490427 │
│ hash = 193490427*33 + 108 = 6385184199 │
│ hash = 6385184199*33 + 111 = 210714636678 │
│ (with overflow) │
│ Final: 210714636678 mod 2^32 = 261238937 │
│ │
└──────────────────────────────────────┘
│
▼
┌─── Stage 3: Index (m=16) ───┐
│ │
│ index = 261238937 mod 16 │
│ index = 9 │
│ │
└──────────────────────────────┘
"hello" → Slot 9
Example 2: Composite Object Person("Alice", 30)
┌─── Stage 1: Normalization ───┐
│ │
│ name = "Alice" │
│ age = 30 │
│ │
│ Combine: 17 │
│ Step 1: 17*31 + hash("Alice") = 17*31 + 63494359 = 63494886 │
│ Step 2: 63494886*31 + 30 = 1968341496 + 30 = 1968341526 │
│ │
└──────────────────────────────┘
│
▼
┌─── Stage 2: Hash Already Computed ───┐
│ │
│ hash = 1968341526 │
│ │
└───────────────────────────────────────┘
│
▼
┌─── Stage 3: Index (m=32) ───┐
│ │
│ Java-style spread: │
│ h = 1968341526 │
│ h ^ (h >>> 16) = 1968341526 ^ 30037 = 1968371543 │
│ index = 1968371543 & 31 │
│ index = 7 │
│ │
└──────────────────────────────┘
Person("Alice", 30) → Slot 7
Example 3: Integer Key 42 (Simple Case)
┌─── Stage 1: Normalization ───┐
│ │
│ 42 is already an integer │
│ Normalized value: 42 │
│ │
└──────────────────────────────┘
│
▼
┌─── Stage 2: Hash Function ───┐
│ │
│ Many implementations use │
│ identity for small ints: │
│ hash(42) = 42 │
│ │
│ Better: apply mixing │
│ hash = 42 * 2654435769 │
│ hash = 111486382298 mod 2^32│
│ hash = 1631917802 │
│ │
└──────────────────────────────┘
│
▼
┌─── Stage 3: Index (m=8) ───┐
│ │
│ With identity hash: │
│ index = 42 mod 8 = 2 │
│ │
│ With mixing: │
│ index = 1631917802 & 7 = 2 │
│ │
└─────────────────────────────┘
42 → Slot 2
Important: Using identity for integer keys can cause problems with patterned data. Better implementations always apply some mixing.
We've traced the complete journey from arbitrary data to array positions—the transformation that makes hash tables work. Let's consolidate these insights:
What's Next:
Having understood the mechanical transformation from data to indices, we're ready to demystify the perceived "magic" of hashing. The next page reveals why hashing works—the mathematical and intuitive foundations that make this elegant transformation reliable and powerful.
You now understand the end-to-end transformation from arbitrary data to array positions. You can trace any key through normalization, hashing, and index computation. This knowledge enables you to debug performance issues, design custom hash functions, and optimize for modern hardware.