Data Structures & AlgorithmsWhat Is Hashing? Core Intuition

What Is Hashing? Core Intuition

LevelBeginner

Duration60 mins

TopicWhat Is Hashing? Core Intuition

2 / 4

Mapping Keys to Indices

From Hash Values to Array Positions

A hash function produces a hash value—but that value alone isn't useful until we translate it into an array index. This translation is the bridge between the abstract mathematical concept of hashing and the concrete implementation of a hash table.

Consider this: a hash function might produce the value 2,147,483,647 for some key, but your array only has 1,000 slots. How do we map that large number to a valid array position? How do we ensure this mapping preserves the uniform distribution we worked so hard to achieve in the hash function?

This page answers these questions. We'll explore the techniques for converting arbitrary hash values into usable array indices, understand the mathematical considerations involved, and learn why certain approaches work better than others.

Learning Objectives

By the end of this page, you will understand: • How to convert hash values to valid array indices • The division method and its mathematical foundations • Why table size selection critically affects performance • The multiplication method and its advantages • Universal hashing techniques for robust key-to-index mapping • Common pitfalls in index computation and how to avoid them

The Index Computation Problem

Let's formalize what we need to accomplish. We have:

A key k — Could be any data type: string, integer, object, etc.
A hash function h(k) — Produces an integer hash code, potentially very large
An array of size m — Valid indices are 0 to m-1

We need to compute index(k) = h(k) → {0, 1, ..., m-1}

This transformation must satisfy several requirements:

Requirement 1: Valid Range The output must always be a valid array index. For an array of size m, this means [0, m-1].

Requirement 2: Preservation of Distribution If h(k) distributes keys uniformly across its output range, index(k) should distribute keys uniformly across [0, m-1]. A poor index computation can destroy a good hash function's distribution.

Requirement 3: Determinism The same key must always map to the same index. index(k) must be a pure function of k and m.

Requirement 4: Efficiency The computation should be fast—ideally O(1) using only basic arithmetic.

A Common Misconception

Many programmers assume that the hash function's output IS the array index. This is only true when the hash function is specifically designed for a fixed table size. In practice, hash functions typically output large integers (32 or 64 bits), and we need a separate step to map these to array indices.

Example: The Problem in Action

Suppose we use Java's String.hashCode() method:

"hello".hashCode() = 99162322
"world".hashCode() = 113318802
"apple".hashCode() = 93029210

These values are 32-bit integers. If our hash table has only 16 slots, we need to map these large values to the range [0, 15]. How?

The naive approach—just use the hash value directly—fails immediately: 99162322 is not a valid index for a 16-slot array.

We need a systematic method to compress these values into our target range while preserving as much of the original distribution as possible.

The Division Method

The division method is the most common and intuitive approach to index computation:

index(k) = h(k) mod m

where m is the table size.

This method:

Maps any integer to the range [0, m-1]
Is computationally efficient (single modulo operation)
Is deterministic (same hash always produces same index)

Example:

m = 16 (table size)

index("hello") = 99162322 mod 16 = 2
index("world") = 113318802 mod 16 = 2    // Collision!
index("apple") = 93029210 mod 16 = 10

But wait—"hello" and "world" both map to index 2. Is this the division method's fault, or were these hash values already going to collide?

The Choice of m Matters Enormously

The division method's quality depends critically on choosing m well. Poor choices of m can introduce patterns that destroy uniformity.

Why Powers of 2 Can Be Dangerous:

If m = 2^k, then h mod m is equivalent to taking the last k bits of h. This is fast but ignores all higher-order bits.

m = 16 = 2^4  →  index = last 4 bits of hash

If hash function has patterns in low bits:
hash₁ = ...00010000  →  index = 0
hash₂ = ...01110000  →  index = 0
hash₃ = ...10100000  →  index = 0

All three hash to the same index despite being different!

If the hash function has any weakness in its low-order bits, using a power of 2 amplifies that weakness.

Why Primes Are Preferred:

Prime numbers for m tend to scatter keys more uniformly because they have no common factors with typical integer patterns.

m = 17 (prime)

hash₁ = 272 → 272 mod 17 = 0
hash₂ = 273 → 273 mod 17 = 1
hash₃ = 274 → 274 mod 17 = 2
...

With a prime m, consecutive hash values map to consecutive indices (mod m), spreading keys across the table.

Practical Guidance for Choosing m

For the division method, choose m as a prime number not close to a power of 2. Example: for a table of roughly 1000 entries, use m = 1009 or 1013 rather than m = 1024. This simple choice often dramatically improves distribution.

Good vs. Poor Choices for m
Table Size Target	Poor Choice	Better Choice	Reason
~100	100, 128	97, 101, 103	97/101/103 are prime
~1000	1000, 1024	997, 1009, 1013	997/1009/1013 are prime
~10000	10000, 16384	10007, 10009	10007/10009 are prime
~100000	100000, 131072	100003, 100019	Primes not near 2^k

Handling Negative Hash Values:

In many languages, hash values can be negative (e.g., Java's String.hashCode() can return negative integers). The modulo operation with negative numbers can produce negative results:

// Java example
int hash = "some string".hashCode();  // Might be negative!
int index = hash % m;                  // Could be negative!

Solutions:

// Approach 1: Bit masking (most efficient)
int index = (hash & 0x7FFFFFFF) % m;  // Clear sign bit first

// Approach 2: Math.abs (beware of Integer.MIN_VALUE!)
int index = Math.abs(hash) % m;  // Edge case: abs(MIN_VALUE) overflows!

// Approach 3: Mathematical correction
int index = ((hash % m) + m) % m;  // Always positive

The first approach (bit masking) is typically preferred for its simplicity and efficiency.

The Multiplication Method

The multiplication method offers an alternative that's less sensitive to the choice of m:

index(k) = ⌊m × (h(k) × A mod 1)⌋

where:

A is a constant in the range 0 < A < 1
"h(k) × A mod 1" extracts the fractional part
⌊...⌋ is the floor function

Step-by-step example:

Let m = 16, A = 0.6180339887 (related to golden ratio), h(k) = 99162322

h(k) × A = 99162322 × 0.6180339887 = 61,284,489.something
Fractional part = 61,284,489.xxx mod 1 = 0.285489...
m × 0.285489... = 16 × 0.285489... = 4.567...
Floor(4.567...) = 4

So index = 4

Why the Multiplication Method Works:

The multiplication method uses the fractional part of h(k) × A. When A is an irrational number (or a good approximation), this operation "scrambles" the hash value in a way that distributes across [0, 1) with minimal patterns.

The Golden Ratio Connection:

Knuth suggests using A ≈ (√5 - 1) / 2 ≈ 0.6180339887 (the golden ratio's inverse). This value has special properties that minimize clustering:

It's irrational, so no periodic patterns emerge
Its continued fraction expansion is [0; 1, 1, 1, ...], making it the "most irrational" number
Consecutive values spread as evenly as possible

Advantages of the Multiplication Method:

m can be any value — Unlike division, powers of 2 work fine
Less sensitive to hash function weaknesses — Multiplication scrambles patterns
Efficient when m = 2^p — Can use bit manipulation:

// Fast implementation when m = 2^p
// Assuming 32-bit hash values
uint32_t hash = compute_hash(key);
uint32_t product = hash * 2654435769U;  // A ≈ 0.618 as fixed-point
uint32_t index = product >> (32 - p);   // Extract top p bits

The magic constant 2654435769 is ⌊(√5 - 1) / 2 × 2^32⌋.

Implementation Considerations

In practice, the multiplication method is often implemented using integer arithmetic and bit shifts to avoid floating-point operations. The key insight is that we're essentially using the high-order bits of a fixed-point multiplication, which mixes information from all parts of the hash value.

Division Method

•Simple: just hash % m
•Very fast computation
•Sensitive to choice of m
•Primes preferred for m
•May expose hash weaknesses

Multiplication Method

•More complex formula
•Still O(1), but more ops
•m can be any value
•Powers of 2 work well
•Better scrambling of bits

Universal Hashing for Index Computation

Universal hashing provides provable guarantees against worst-case behavior by randomly selecting the hash function from a carefully designed family.

The Carter-Wegman Universal Hash Family:

For integer keys in range [0, p-1] (where p is a large prime) and table size m:

h_{a,b}(k) = ((a × k + b) mod p) mod m

where:

a ∈ {1, 2, ..., p-1} (randomly chosen)
b ∈ {0, 1, ..., p-1} (randomly chosen)
p is a prime larger than the largest key

Why is this "universal"?

For any two distinct keys k₁ ≠ k₂:

P[h_{a,b}(k₁) = h_{a,b}(k₂)] ≤ 1/m

This probability is over the random choice of a and b. No adversary can construct a bad input because they don't know which a and b we'll choose.

Practical Implementation:

import random

class UniversalHashTable:
    def __init__(self, size, prime=10000019):
        self.m = size
        self.p = prime  # Prime larger than key range
        self.a = random.randint(1, prime - 1)
        self.b = random.randint(0, prime - 1)
        self.table = [None] * size
    
    def _hash(self, key):
        # Convert key to integer if necessary
        if isinstance(key, str):
            key = sum(ord(c) * (31 ** i) for i, c in enumerate(key))
        return ((self.a * key + self.b) % self.p) % self.m
    
    def insert(self, key, value):
        index = self._hash(key)
        # ... collision handling ...

When to Use Universal Hashing:

Security-sensitive applications — When adversaries might craft inputs to cause collisions
Real-time systems — When worst-case O(n) is unacceptable
Theoretical analysis — When you need provable performance bounds

The Defense Against Hash Attacks

In 2011, researchers demonstrated "hash flooding" attacks against web frameworks by crafting POST requests with keys that all collided in the language's hash table. Universal hashing (or randomized hash seeds) prevents such attacks because attackers can't predict which hash function will be used.

Handling Non-Integer Keys

So far we've assumed keys are integers. In practice, keys are often strings, objects, or composite types. The index computation pipeline becomes:

Key → Integer Hash → Array Index

Let's examine common approaches for the first step: converting arbitrary keys to integers.

Strings:

Strings are the most common non-integer key type. Several methods exist:

# Method 1: Sum of characters (POOR - many collisions)
def naive_hash(s):
    return sum(ord(c) for c in s)

# "abc" and "cba" and "bac" all hash to the same value!

# Method 2: Polynomial rolling hash (GOOD)
def polynomial_hash(s, base=31):
    h = 0
    for c in s:
        h = h * base + ord(c)
    return h

# "abc" = 97*31² + 98*31 + 99 = 93217 + 3038 + 99 = 96354
# "cba" = 99*31² + 98*31 + 97 = 95139 + 3038 + 97 = 98274
# Now they're different!

# Method 3: djb2 (GOOD - widely used)
def djb2_hash(s):
    h = 5381
    for c in s:
        h = ((h << 5) + h) + ord(c)  # h * 33 + c
    return h

Composite Objects:

For objects with multiple fields, combine the hash values of individual fields:

// Java example
public class Person {
    private String name;
    private int age;
    private String city;
    
    @Override
    public int hashCode() {
        int result = 17;  // Non-zero initial value
        result = 31 * result + (name != null ? name.hashCode() : 0);
        result = 31 * result + age;
        result = 31 * result + (city != null ? city.hashCode() : 0);
        return result;
    }
}

Why 31?

It's odd (avoids loss of information from bit shifts)
It's prime (distributes hash values well)
31 * x = (x << 5) - x (efficient to compute)
Widely used convention in Java

Floating-Point Numbers:

Floating-point keys require special care:

// Java approach: use bit representation
double d = 3.14159;
long bits = Double.doubleToLongBits(d);
int hash = (int)(bits ^ (bits >>> 32));

Caution: Floating-point comparison is tricky due to precision issues. Consider whether you really want exact floating-point keys, or if you should quantize them first.

Arrays and Collections:

For ordered collections, use a rolling polynomial hash:

def array_hash(arr):
    h = 0
    for elem in arr:
        h = h * 31 + hash(elem)
    return h

For unordered collections (sets), use XOR or sum (order-independent operations):

def set_hash(s):
    return sum(hash(elem) for elem in s)  # Order doesn't matter

The hashCode/equals Contract

In languages like Java: if two objects are equal (equals() returns true), they MUST have the same hashCode(). The reverse isn't required—unequal objects may have the same hash (collisions). Violating this contract causes hash tables to malfunction silently.

Common Pitfalls and How to Avoid Them

Index computation seems simple, but subtle bugs are common. Let's examine the most frequent pitfalls:

Pitfall #1: Negative Hash Values

•Problem: Many hash functions return signed integers that can be negative
•Symptom: ArrayIndexOutOfBoundsException or negative indices
•Solution: Use (hash & 0x7FFFFFFF) % m or ((hash % m) + m) % m
•Avoid: Math.abs(hash) — fails for Integer.MIN_VALUE in Java

Pitfall #2: Integer Overflow

•Problem: Hash computations for long strings can overflow 32-bit integers
•Symptom: Inconsistent hash values, loops in hash computation
•Solution: Use unsigned arithmetic or take mod at each step
•Example: h = (h * 31 + c) % LARGE_PRIME prevents overflow

Pitfall #3: Mutable Keys

•Problem: Key's hash value changes after insertion into hash table
•Symptom: Items become unretrievable even though they're in the table
•Solution: Use immutable keys, or copy keys on insertion
•Example: Never use mutable lists as dictionary keys in Python

Pitfall #4: Hash/Equals Inconsistency

•Problem: equals() returns true but hashCode() returns different values
•Symptom: Duplicate entries in hash sets, missing entries in maps
•Solution: Always override both hashCode() and equals() together
•Rule: Equal objects MUST have equal hash codes

Testing Your Hash Function:

def test_hash_distribution(hash_func, keys, table_size):
    """Test if a hash function distributes keys uniformly."""
    buckets = [0] * table_size
    for key in keys:
        index = hash_func(key) % table_size
        buckets[index] += 1
    
    expected = len(keys) / table_size
    variance = sum((b - expected) ** 2 for b in buckets) / table_size
    std_dev = variance ** 0.5
    
    print(f"Expected per bucket: {expected:.2f}")
    print(f"Standard deviation: {std_dev:.2f}")
    print(f"Min: {min(buckets)}, Max: {max(buckets)}")
    
    # For good distribution, std_dev should be close to sqrt(expected)
    return std_dev < 2 * (expected ** 0.5)

Run this test with representative data to verify your hash function behaves well.

Summary: Mapping Keys to Indices

We've explored the critical bridge between hash values and array positions. This mapping determines whether your hash table achieves O(1) performance or degrades to O(n). Let's consolidate the key insights:

Key Takeaways

•Division Method — index = hash % m is simple and fast, but requires careful choice of m (prefer primes not near powers of 2).
•Multiplication Method — Uses fractional parts of floating-point multiplication; less sensitive to m's value.
•Universal Hashing — Randomly selected hash functions provide provable collision bounds against adversarial inputs.
•Non-Integer Keys — Convert to integers using polynomial hashing for strings, field combination for objects.
•Handle Negatives — Always account for potentially negative hash values before computing modulo.
•Preserve Distribution — A poor index computation can ruin a good hash function's uniform distribution.
•Test Your Implementation — Verify distribution empirically with representative data.

What's Next:

We've now covered how to compute hash values and map them to array indices. The next page explores the full journey from arbitrary data to array positions—seeing the complete picture of how inputs like strings, objects, and composite types become usable storage locations.

Mapping Mastery Achieved

You now understand the mechanics of translating hash values to array indices—the critical step that enables O(1) hash table operations. You can choose between division and multiplication methods, handle edge cases like negative values, and apply universal hashing when stronger guarantees are needed.

2 / 4

Loading learning content...

Data Structures & AlgorithmsWhat Is Hashing? Core Intuition

What Is Hashing? Core Intuition

LevelBeginner

Duration60 mins

TopicWhat Is Hashing? Core Intuition

2 / 4

Mapping Keys to Indices

From Hash Values to Array Positions

Learning Objectives

The Index Computation Problem

Let's formalize what we need to accomplish. We have:

A key k — Could be any data type: string, integer, object, etc.
A hash function h(k) — Produces an integer hash code, potentially very large
An array of size m — Valid indices are 0 to m-1

We need to compute index(k) = h(k) → {0, 1, ..., m-1}

This transformation must satisfy several requirements:

Requirement 1: Valid Range The output must always be a valid array index. For an array of size m, this means [0, m-1].

Requirement 3: Determinism The same key must always map to the same index. index(k) must be a pure function of k and m.

Requirement 4: Efficiency The computation should be fast—ideally O(1) using only basic arithmetic.

A Common Misconception

Example: The Problem in Action

Suppose we use Java's String.hashCode() method:

"hello".hashCode() = 99162322
"world".hashCode() = 113318802
"apple".hashCode() = 93029210

These values are 32-bit integers. If our hash table has only 16 slots, we need to map these large values to the range [0, 15]. How?

The naive approach—just use the hash value directly—fails immediately: 99162322 is not a valid index for a 16-slot array.

We need a systematic method to compress these values into our target range while preserving as much of the original distribution as possible.

The Division Method

The division method is the most common and intuitive approach to index computation:

index(k) = h(k) mod m

where m is the table size.

This method:

Maps any integer to the range [0, m-1]
Is computationally efficient (single modulo operation)
Is deterministic (same hash always produces same index)

Example:

m = 16 (table size)

index("hello") = 99162322 mod 16 = 2
index("world") = 113318802 mod 16 = 2    // Collision!
index("apple") = 93029210 mod 16 = 10

But wait—"hello" and "world" both map to index 2. Is this the division method's fault, or were these hash values already going to collide?

The Choice of m Matters Enormously

The division method's quality depends critically on choosing m well. Poor choices of m can introduce patterns that destroy uniformity.

Why Powers of 2 Can Be Dangerous:

If m = 2^k, then h mod m is equivalent to taking the last k bits of h. This is fast but ignores all higher-order bits.

m = 16 = 2^4  →  index = last 4 bits of hash

If hash function has patterns in low bits:
hash₁ = ...00010000  →  index = 0
hash₂ = ...01110000  →  index = 0
hash₃ = ...10100000  →  index = 0

All three hash to the same index despite being different!

If the hash function has any weakness in its low-order bits, using a power of 2 amplifies that weakness.

Why Primes Are Preferred:

Prime numbers for m tend to scatter keys more uniformly because they have no common factors with typical integer patterns.

m = 17 (prime)

hash₁ = 272 → 272 mod 17 = 0
hash₂ = 273 → 273 mod 17 = 1
hash₃ = 274 → 274 mod 17 = 2
...

With a prime m, consecutive hash values map to consecutive indices (mod m), spreading keys across the table.

Practical Guidance for Choosing m

Good vs. Poor Choices for m
Table Size Target	Poor Choice	Better Choice	Reason
~100	100, 128	97, 101, 103	97/101/103 are prime
~1000	1000, 1024	997, 1009, 1013	997/1009/1013 are prime
~10000	10000, 16384	10007, 10009	10007/10009 are prime
~100000	100000, 131072	100003, 100019	Primes not near 2^k

Handling Negative Hash Values:

In many languages, hash values can be negative (e.g., Java's String.hashCode() can return negative integers). The modulo operation with negative numbers can produce negative results:

// Java example
int hash = "some string".hashCode();  // Might be negative!
int index = hash % m;                  // Could be negative!

Solutions:

// Approach 1: Bit masking (most efficient)
int index = (hash & 0x7FFFFFFF) % m;  // Clear sign bit first

// Approach 2: Math.abs (beware of Integer.MIN_VALUE!)
int index = Math.abs(hash) % m;  // Edge case: abs(MIN_VALUE) overflows!

// Approach 3: Mathematical correction
int index = ((hash % m) + m) % m;  // Always positive

The first approach (bit masking) is typically preferred for its simplicity and efficiency.

The Multiplication Method

The multiplication method offers an alternative that's less sensitive to the choice of m:

index(k) = ⌊m × (h(k) × A mod 1)⌋

where:

A is a constant in the range 0 < A < 1
"h(k) × A mod 1" extracts the fractional part
⌊...⌋ is the floor function

Step-by-step example:

Let m = 16, A = 0.6180339887 (related to golden ratio), h(k) = 99162322

h(k) × A = 99162322 × 0.6180339887 = 61,284,489.something
Fractional part = 61,284,489.xxx mod 1 = 0.285489...
m × 0.285489... = 16 × 0.285489... = 4.567...
Floor(4.567...) = 4

So index = 4

Why the Multiplication Method Works:

The Golden Ratio Connection:

Knuth suggests using A ≈ (√5 - 1) / 2 ≈ 0.6180339887 (the golden ratio's inverse). This value has special properties that minimize clustering:

It's irrational, so no periodic patterns emerge
Its continued fraction expansion is [0; 1, 1, 1, ...], making it the "most irrational" number
Consecutive values spread as evenly as possible

Advantages of the Multiplication Method:

m can be any value — Unlike division, powers of 2 work fine
Less sensitive to hash function weaknesses — Multiplication scrambles patterns
Efficient when m = 2^p — Can use bit manipulation:

// Fast implementation when m = 2^p
// Assuming 32-bit hash values
uint32_t hash = compute_hash(key);
uint32_t product = hash * 2654435769U;  // A ≈ 0.618 as fixed-point
uint32_t index = product >> (32 - p);   // Extract top p bits

The magic constant 2654435769 is ⌊(√5 - 1) / 2 × 2^32⌋.

Implementation Considerations

Division Method

•Simple: just hash % m
•Very fast computation
•Sensitive to choice of m
•Primes preferred for m
•May expose hash weaknesses

Multiplication Method

•More complex formula
•Still O(1), but more ops
•m can be any value
•Powers of 2 work well
•Better scrambling of bits

Universal Hashing for Index Computation

Universal hashing provides provable guarantees against worst-case behavior by randomly selecting the hash function from a carefully designed family.

The Carter-Wegman Universal Hash Family:

For integer keys in range [0, p-1] (where p is a large prime) and table size m:

h_{a,b}(k) = ((a × k + b) mod p) mod m

where:

a ∈ {1, 2, ..., p-1} (randomly chosen)
b ∈ {0, 1, ..., p-1} (randomly chosen)
p is a prime larger than the largest key

Why is this "universal"?

For any two distinct keys k₁ ≠ k₂:

P[h_{a,b}(k₁) = h_{a,b}(k₂)] ≤ 1/m

This probability is over the random choice of a and b. No adversary can construct a bad input because they don't know which a and b we'll choose.

Practical Implementation:

import random

class UniversalHashTable:
    def __init__(self, size, prime=10000019):
        self.m = size
        self.p = prime  # Prime larger than key range
        self.a = random.randint(1, prime - 1)
        self.b = random.randint(0, prime - 1)
        self.table = [None] * size
    
    def _hash(self, key):
        # Convert key to integer if necessary
        if isinstance(key, str):
            key = sum(ord(c) * (31 ** i) for i, c in enumerate(key))
        return ((self.a * key + self.b) % self.p) % self.m
    
    def insert(self, key, value):
        index = self._hash(key)
        # ... collision handling ...

When to Use Universal Hashing:

Security-sensitive applications — When adversaries might craft inputs to cause collisions
Real-time systems — When worst-case O(n) is unacceptable
Theoretical analysis — When you need provable performance bounds

The Defense Against Hash Attacks

Handling Non-Integer Keys

So far we've assumed keys are integers. In practice, keys are often strings, objects, or composite types. The index computation pipeline becomes:

Key → Integer Hash → Array Index

Let's examine common approaches for the first step: converting arbitrary keys to integers.

Strings:

Strings are the most common non-integer key type. Several methods exist:

# Method 1: Sum of characters (POOR - many collisions)
def naive_hash(s):
    return sum(ord(c) for c in s)

# "abc" and "cba" and "bac" all hash to the same value!

# Method 2: Polynomial rolling hash (GOOD)
def polynomial_hash(s, base=31):
    h = 0
    for c in s:
        h = h * base + ord(c)
    return h

# "abc" = 97*31² + 98*31 + 99 = 93217 + 3038 + 99 = 96354
# "cba" = 99*31² + 98*31 + 97 = 95139 + 3038 + 97 = 98274
# Now they're different!

# Method 3: djb2 (GOOD - widely used)
def djb2_hash(s):
    h = 5381
    for c in s:
        h = ((h << 5) + h) + ord(c)  # h * 33 + c
    return h

Composite Objects:

For objects with multiple fields, combine the hash values of individual fields:

// Java example
public class Person {
    private String name;
    private int age;
    private String city;
    
    @Override
    public int hashCode() {
        int result = 17;  // Non-zero initial value
        result = 31 * result + (name != null ? name.hashCode() : 0);
        result = 31 * result + age;
        result = 31 * result + (city != null ? city.hashCode() : 0);
        return result;
    }
}

Why 31?

It's odd (avoids loss of information from bit shifts)
It's prime (distributes hash values well)
31 * x = (x << 5) - x (efficient to compute)
Widely used convention in Java

Floating-Point Numbers:

Floating-point keys require special care:

// Java approach: use bit representation
double d = 3.14159;
long bits = Double.doubleToLongBits(d);
int hash = (int)(bits ^ (bits >>> 32));

Caution: Floating-point comparison is tricky due to precision issues. Consider whether you really want exact floating-point keys, or if you should quantize them first.

Arrays and Collections:

For ordered collections, use a rolling polynomial hash:

def array_hash(arr):
    h = 0
    for elem in arr:
        h = h * 31 + hash(elem)
    return h

For unordered collections (sets), use XOR or sum (order-independent operations):

def set_hash(s):
    return sum(hash(elem) for elem in s)  # Order doesn't matter

The hashCode/equals Contract

Common Pitfalls and How to Avoid Them

Index computation seems simple, but subtle bugs are common. Let's examine the most frequent pitfalls:

Pitfall #1: Negative Hash Values

•Problem: Many hash functions return signed integers that can be negative
•Symptom: ArrayIndexOutOfBoundsException or negative indices
•Solution: Use (hash & 0x7FFFFFFF) % m or ((hash % m) + m) % m
•Avoid: Math.abs(hash) — fails for Integer.MIN_VALUE in Java

Pitfall #2: Integer Overflow

•Problem: Hash computations for long strings can overflow 32-bit integers
•Symptom: Inconsistent hash values, loops in hash computation
•Solution: Use unsigned arithmetic or take mod at each step
•Example: h = (h * 31 + c) % LARGE_PRIME prevents overflow

Pitfall #3: Mutable Keys

•Problem: Key's hash value changes after insertion into hash table
•Symptom: Items become unretrievable even though they're in the table
•Solution: Use immutable keys, or copy keys on insertion
•Example: Never use mutable lists as dictionary keys in Python

Pitfall #4: Hash/Equals Inconsistency

•Problem: equals() returns true but hashCode() returns different values
•Symptom: Duplicate entries in hash sets, missing entries in maps
•Solution: Always override both hashCode() and equals() together
•Rule: Equal objects MUST have equal hash codes

Testing Your Hash Function:

def test_hash_distribution(hash_func, keys, table_size):
    """Test if a hash function distributes keys uniformly."""
    buckets = [0] * table_size
    for key in keys:
        index = hash_func(key) % table_size
        buckets[index] += 1
    
    expected = len(keys) / table_size
    variance = sum((b - expected) ** 2 for b in buckets) / table_size
    std_dev = variance ** 0.5
    
    print(f"Expected per bucket: {expected:.2f}")
    print(f"Standard deviation: {std_dev:.2f}")
    print(f"Min: {min(buckets)}, Max: {max(buckets)}")
    
    # For good distribution, std_dev should be close to sqrt(expected)
    return std_dev < 2 * (expected ** 0.5)

Run this test with representative data to verify your hash function behaves well.

Summary: Mapping Keys to Indices

Key Takeaways

•Division Method — index = hash % m is simple and fast, but requires careful choice of m (prefer primes not near powers of 2).
•Multiplication Method — Uses fractional parts of floating-point multiplication; less sensitive to m's value.
•Universal Hashing — Randomly selected hash functions provide provable collision bounds against adversarial inputs.
•Non-Integer Keys — Convert to integers using polynomial hashing for strings, field combination for objects.
•Handle Negatives — Always account for potentially negative hash values before computing modulo.
•Preserve Distribution — A poor index computation can ruin a good hash function's uniform distribution.
•Test Your Implementation — Verify distribution empirically with representative data.

What's Next:

Mapping Mastery Achieved

2 / 4