Machine LearningK-Nearest Neighbors

Locality-Sensitive Hashing (LSH)

LevelAdvanced

Duration90 mins

TopicK-Nearest Neighbors

1 / 6

Hashing vs Locality-Sensitive Hashing

The Nearest Neighbor Bottleneck

Imagine you're building a music recommendation system with 100 million songs, each represented as a 512-dimensional feature vector capturing tempo, key, instrumentation, and audio characteristics. A user requests songs similar to their current favorite. The naive approach—computing the distance to every song in the database—requires 100 million distance calculations, each involving 512 dimensions. Even at 100 nanoseconds per calculation, that's 10 seconds per query. For a service handling thousands of requests per second, this is catastrophic.

This isn't a theoretical concern. Every major tech company—Spotify, Netflix, Google, Amazon—faces this exact challenge: how do you find the nearest neighbors to a query point among billions of items without computing all distances?

The solution lies in a counterintuitive idea: hashing. But not the hashing you know from hash tables. A special kind of hashing that does exactly what traditional hashing tries to avoid.

What You Will Learn

By the end of this page, you will understand the fundamental paradigm shift from traditional hashing to locality-sensitive hashing. You'll see why traditional hashing's core design principle must be inverted for similarity search, and how LSH achieves sub-linear query time by deliberately creating "structured collisions" based on similarity.

Traditional Hashing: Minimizing Collisions

Before we can appreciate locality-sensitive hashing, we must deeply understand traditional hashing and its fundamental design philosophy.

What is a hash function?

A hash function h maps inputs from a (typically large) domain to a (typically smaller) codomain:

$$h: U \rightarrow {0, 1, 2, \ldots, m-1}$$

where $U$ is the universe of possible inputs and $m$ is the number of buckets. The hash value $h(x)$ is often called the hash code, hash digest, or simply the hash of $x$.

The Core Design Principle: Collision Minimization

Traditional hash functions are designed with one overarching goal: minimize collisions. A collision occurs when two distinct inputs produce the same hash value:

$$h(x) = h(y) \text{ where } x eq y$$

This principle drives everything in traditional hashing:

Traditional Hashing Goals

•Uniform distribution — Hash values should spread evenly across all buckets, regardless of input patterns. If inputs cluster in the domain, they should scatter in the codomain.
•Determinism — The same input always produces the same output, enabling reliable lookups and comparisons.
•Avalanche effect — A small change in input should produce a dramatically different hash. Flipping one bit should change ~50% of output bits.
•Pre-image resistance — Given a hash value, it should be computationally infeasible to find any input that produces it (critical for cryptographic applications).
•Collision resistance — It should be computationally infeasible to find any two distinct inputs that hash to the same value.

Why collision minimization matters for hash tables:

In a hash table, we use the hash value as an index into an array. When a collision occurs, we must handle it through chaining (linked lists at each bucket) or open addressing (probing for empty slots). Collisions degrade performance:

Perfect hashing (no collisions): O(1) lookup
Low collision rate: O(1) average lookup
High collision rate: Degrades toward O(n) lookup

The entire hash table paradigm rests on minimizing collisions. A hash function that produced many collisions would be considered broken for hash table applications.

traditional_hash_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import hashlib
 
def traditional_hash(text: str, num_buckets: int = 1000) -> int:
    """
    Traditional hash function using SHA-256.
    Note how SIMILAR inputs produce COMPLETELY DIFFERENT outputs.
    """
    hash_bytes = hashlib.sha256(text.encode()).digest()
    # Convert first 8 bytes to integer and mod by bucket count
    hash_int = int.from_bytes(hash_bytes[:8], 'big')
    return hash_int % num_buckets
 
# Demonstrate the avalanche effect
strings = [
    "The quick brown fox",
    "The quick brown foX",  # One character changed
    "The quick brown foy",  # One character different
    "The quick brown foz",  # One character different
]
 
print("Traditional Hashing - Avalanche Effect Demo:")
print("-" * 50)
for s in strings:
    h = traditional_hash(s)
    print(f"'{s}' -> bucket {h}")
 
# Output (example):
# 'The quick brown fox' -> bucket 342
# 'The quick brown foX' -> bucket 891  # Completely different!
# 'The quick brown foy' -> bucket 127  # Completely different!
# 'The quick brown foz' -> bucket 658  # Completely different!
 
# The avalanche effect ensures that similar strings
# hash to DIFFERENT buckets - this is BY DESIGN
# for traditional hash functions.

The Avalanche Effect is a Feature, Not a Bug

The avalanche effect—where similar inputs hash to vastly different outputs—is a deliberate design goal of traditional hash functions. It ensures uniform distribution and is essential for security applications. But this very property makes traditional hashing useless for similarity search.

The Similarity Search Problem

Now let's formally define the problem that traditional hashing cannot solve.

The Nearest Neighbor Problem:

Given:

A dataset $P = {p_1, p_2, \ldots, p_n}$ of $n$ points in a metric space
A distance function $d(x, y)$
A query point $q$

Find: $$\text{NN}(q) = \arg\min_{p \in P} d(q, p)$$

The exact solution requires computing $d(q, p_i)$ for all $n$ points, yielding $O(n)$ query time.

The Approximate Nearest Neighbor (ANN) Problem:

For many applications, we can relax the requirement. Given an approximation factor $c > 1$:

Find a point $p^$ such that: $$d(q, p^) \leq c \cdot d(q, \text{NN}(q))$$

That is, we accept a point whose distance is at most $c$ times the true nearest neighbor distance. This relaxation is crucial—it allows sub-linear query time.

Comparison of Exact vs Approximate NN Methods
Method	Query Time	Space	Guarantees	Scalability
Brute Force	O(nd)	O(nd)	Exact	Poor (linear scan)
KD-Trees	O(log n) avg, O(n) worst	O(nd)	Exact	Poor in high dimensions
Ball Trees	O(log n) avg, O(n) worst	O(nd)	Exact	Degrades with dimension
LSH	O(n^ρ) where ρ < 1	O(n^{1+ρ})	Approximate (1+ε)	Excellent in high dimensions
Approximate KD-Trees	O(log n)	O(nd)	Approximate	Moderate

Why tree-based methods fail in high dimensions:

KD-trees and ball trees work brilliantly in low dimensions (d ≤ 20 or so). They partition space into regions, allowing the search to prune large portions of the dataset.

But in high dimensions, the curse of dimensionality destroys this efficiency:

Distance concentration: All distances become similar, making it hard to prune
Exponential partitioning: The number of cells needed to cover the space grows exponentially with d
Backtracking doom: Most queries must visit a large fraction of nodes

Empirical studies show that KD-trees become slower than brute force when $d > 10$-$20$ for typical datasets.

The insight that leads to LSH:

If we think about the problem differently, we realize:

We don't need to find the exact structure of the point cloud. We just need a way to quickly find points that are likely to be similar.

What if we could create a hash function where similar points are likely to hash to the same bucket? Then we could:

Hash the query point
Look only at points in the same (and nearby) buckets
Compute exact distances only for this small subset

The Key Insight

Traditional hashing tries to scatter similar items. For similarity search, we need the opposite: a hash function that clusters similar items. This paradigm inversion is the essence of locality-sensitive hashing.

Introducing Locality-Sensitive Hashing

Locality-Sensitive Hashing (LSH) is a family of hashing techniques where the probability of collision is directly related to similarity. Unlike traditional hashing, LSH is designed so that:

Similar items are more likely to hash to the same bucket than dissimilar items.

Formal Definition:

A family $\mathcal{H}$ of hash functions is called $(d_1, d_2, p_1, p_2)$-sensitive for a distance measure $d$ if for any $h \in \mathcal{H}$ chosen uniformly at random:

If $d(x, y) \leq d_1$, then $\Pr[h(x) = h(y)] \geq p_1$
If $d(x, y) \geq d_2$, then $\Pr[h(x) = h(y)] \leq p_2$

where $d_1 < d_2$ and $p_1 > p_2$.

Interpreting the definition:

LSH Parameters Explained

•$d_1$ (near threshold) — Points within distance $d_1$ are considered "near" and should collide with high probability
•$d_2$ (far threshold) — Points at distance $d_2$ or more are considered "far" and should rarely collide
•$p_1$ (near collision probability) — The minimum probability that near points hash together
•$p_2$ (far collision probability) — The maximum probability that far points hash together
•The gap $d_2 - d_1$ — Represents the "gray zone" where behavior is not specified; larger gaps give weaker guarantees
•The ratio $p_1/p_2$ — Measures the "quality" of the hash family; higher is better

The LSH Quality Parameter:

The quality of an LSH family is often characterized by the parameter $\rho$:

$$\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)}$$

For a useful LSH family, we need $\rho < 1$. The smaller $\rho$ is, the better the hash family discriminates between near and far points. The query time for $(1 + \epsilon)$-approximate nearest neighbor search is $O(n^\rho)$, so smaller $\rho$ means faster queries.

Why does this work?

The magic happens when we use multiple hash functions and multiple hash tables. By combining $k$ hash functions into a single hash key and using $L$ independent hash tables:

lsh_conceptual.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
"""
Conceptual illustration of LSH vs Traditional Hashing.
This demonstrates the PARADIGM INVERSION at the heart of LSH.
"""
 
import numpy as np
from typing import List, Tuple
 
class TraditionalHashDemo:
    """Traditional hashing: similar items -> different buckets"""
    
    def hash_vector(self, v: np.ndarray, num_buckets: int = 100) -> int:
        # Convert vector to bytes and hash
        v_bytes = v.tobytes()
        hash_val = hash(v_bytes)
        return hash_val % num_buckets
 
class LSHDemo:
    """LSH: similar items -> likely same bucket"""
    
    def __init__(self, dim: int, num_hyperplanes: int = 8):
        """
        Simple random projection LSH for cosine similarity.
        Each hyperplane divides space in two; we use k hyperplanes
        to create 2^k buckets.
        """
        # Random hyperplanes through origin
        self.hyperplanes = np.random.randn(num_hyperplanes, dim)
    
    def hash_vector(self, v: np.ndarray) -> str:
        """
        Hash by checking which side of each hyperplane the vector lies.
        Similar vectors are likely on the same side of most hyperplanes.
        """
        # Compute dot product with each hyperplane
        projections = np.dot(self.hyperplanes, v)
        # Convert to binary: 1 if positive, 0 if negative
        bits = (projections >= 0).astype(int)
        # Return as binary string (the hash)
        return ''.join(map(str, bits))
 
# Demonstration
np.random.seed(42)
dim = 100
 
# Create two sets of vectors
base_vector = np.random.randn(dim)
base_vector = base_vector / np.linalg.norm(base_vector)
 
# Similar vectors (small perturbations)
similar_vectors = [base_vector + 0.1 * np.random.randn(dim) for _ in range(5)]
similar_vectors = [v / np.linalg.norm(v) for v in similar_vectors]
 
# Dissimilar vectors (random directions)
dissimilar_vectors = [np.random.randn(dim) for _ in range(5)]
dissimilar_vectors = [v / np.linalg.norm(v) for v in dissimilar_vectors]
 
# Compare hashing behaviors
traditional = TraditionalHashDemo()
lsh = LSHDemo(dim, num_hyperplanes=8)
 
print("=" * 60)
print("TRADITIONAL HASHING (designed to scatter similar items)")
print("=" * 60)
print(f"Base vector hash: {traditional.hash_vector(base_vector)}")
print("Similar vectors hashes:", 
      [traditional.hash_vector(v) for v in similar_vectors])
# Output: All different buckets! (avalanche effect)
 
print()
print("=" * 60)
print("LOCALITY-SENSITIVE HASHING (designed to cluster similar items)")
print("=" * 60)
print(f"Base vector hash: {lsh.hash_vector(base_vector)}")
print("Similar vectors hashes:", 
      [lsh.hash_vector(v) for v in similar_vectors])
# Output: Many identical or similar hash values!
 
print("Dissimilar vectors hashes:", 
      [lsh.hash_vector(v) for v in dissimilar_vectors])
# Output: Different hash values from base

The Paradigm Inversion: From Scattering to Clustering

Let's crystallize the fundamental difference between traditional hashing and LSH. This conceptual clarity is essential for understanding everything that follows.

Traditional Hashing

•Goal: Minimize collisions
•Collisions are: Failures to be handled
•Similar inputs: Should hash differently
•Avalanche effect: Essential property
•Use case: Hash tables, checksums, cryptography
•Similarity preserved?: No—deliberately destroyed
•Query: O(1) for exact key match only

Locality-Sensitive Hashing

•Goal: Create meaningful collisions
•Collisions are: Signals of similarity
•Similar inputs: Should hash together
•Avalanche effect: Deliberately avoided
•Use case: Similarity search, deduplication
•Similarity preserved?: Yes—by design
•Query: O(n^ρ) for approximate neighbors

The mathematical intuition:

Traditional hashing can be thought of as a function that "scrambles" the input space. Points close together in the original space are mapped essentially at random in the hash space. The correlation between input similarity and output similarity is intentionally zero (or negative for cryptographic hashes).

LSH inverts this. An LSH function is constructed so that:

$$\text{Pr}[h(x) = h(y)] = f(d(x, y))$$

where $f$ is a monotonically decreasing function of the distance. The closer two points are, the more likely they are to receive the same hash. The hash function is a similarity-preserving projection from high-dimensional space to a discrete set of buckets.

Why not just check all colliding points?

The power of LSH comes from amplification. A single LSH function might have:

$p_1 = 0.8$ (near points collide 80% of the time)
$p_2 = 0.6$ (far points collide 60% of the time)

This gap seems small. But by combining $k$ functions (requiring ALL to match for a collision), we get:

Near points: $(0.8)^k$
Far points: $(0.6)^k$

For $k = 10$:

Near points collide with probability: $(0.8)^{10} \approx 0.107$
Far points collide with probability: $(0.6)^{10} \approx 0.006$

The ratio has grown from $0.8/0.6 = 1.33$ to $0.107/0.006 \approx 17.8$!

We then use $L$ independent hash tables to boost the probability of finding near points. This AND-then-OR amplification is the secret sauce of LSH.

AND-OR Amplification

By requiring matches in k hash functions (AND), we reduce BOTH collision probabilities, but the far probability drops faster due to the power law. By using L tables (OR), we boost the probability of finding at least one collision for near points while keeping false positives manageable. This gives us knobs k and L to tune the precision-recall tradeoff.

LSH Algorithm Overview

Let's outline the complete LSH algorithm for approximate nearest neighbor search. This provides the big picture before we dive into specific hash families in subsequent pages.

lsh_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
"""
Complete LSH Algorithm for Approximate Nearest Neighbor Search.
 
High-level structure that applies to all LSH variants.
Specific hash functions vary by distance metric.
"""
 
import numpy as np
from typing import List, Tuple, Set, Callable
from collections import defaultdict
from dataclasses import dataclass
 
@dataclass
class LSHIndex:
    """
    LSH Index Structure.
    
    Parameters:
    -----------
    k : int
        Number of hash functions per table (higher = fewer candidates, 
        but might miss true neighbors)
    L : int  
        Number of hash tables (higher = more likely to find true neighbors,
        but more candidates to check)
    """
    k: int  # Hash functions per table
    L: int  # Number of hash tables
    
    # Each table maps hash key -> set of point indices
    tables: List[dict]  # L tables, each is {hash_key: [indices]}
    
    # Hash functions for each table
    hash_functions: List[List[Callable]]  # L lists of k functions each
 
def build_lsh_index(
    points: np.ndarray,
    k: int,
    L: int,
    hash_family: Callable,
    dim: int
) -> LSHIndex:
    """
    Build LSH index from dataset.
    
    Parameters:
    -----------
    points : np.ndarray of shape (n, d)
        Dataset of n points in d dimensions
    k : int
        Number of hash functions per table
    L : int
        Number of hash tables
    hash_family : callable
        Function that returns a new random hash function
    dim : int
        Dimensionality of points
        
    Returns:
    --------
    LSHIndex : The built index
    """
    n = len(points)
    
    # Create L tables, each with k random hash functions
    tables = [defaultdict(list) for _ in range(L)]
    hash_functions = [
        [hash_family(dim) for _ in range(k)] 
        for _ in range(L)
    ]
    
    # Insert all points into all tables
    for point_idx in range(n):
        point = points[point_idx]
        
        for table_idx in range(L):
            # Compute k hash values and concatenate
            hash_values = tuple(
                h(point) for h in hash_functions[table_idx]
            )
            # Insert into this table
            tables[table_idx][hash_values].append(point_idx)
    
    return LSHIndex(
        k=k, 
        L=L, 
        tables=tables, 
        hash_functions=hash_functions
    )
 
def query_lsh(
    query: np.ndarray,
    index: LSHIndex,
    points: np.ndarray,
    distance_fn: Callable,
    num_neighbors: int = 1,
    max_candidates: int = None
) -> List[Tuple[int, float]]:
    """
    Query LSH index for approximate nearest neighbors.
    
    Parameters:
    -----------
    query : np.ndarray of shape (d,)
        Query point
    index : LSHIndex
        Pre-built LSH index
    points : np.ndarray
        Original dataset (needed for distance computation)
    distance_fn : callable
        Distance function d(x, y) -> float
    num_neighbors : int
        Number of neighbors to return
    max_candidates : int
        Maximum candidates to check (for efficiency)
        
    Returns:
    --------
    List of (point_index, distance) tuples, sorted by distance
    """
    candidates = set()
    
    # Collect candidates from all L tables
    for table_idx in range(index.L):
        # Compute hash key for query using this table's functions
        hash_values = tuple(
            h(query) for h in index.hash_functions[table_idx]
        )
        # Get all points in same bucket
        bucket = index.tables[table_idx].get(hash_values, [])
        candidates.update(bucket)
        
        # Early termination if we have enough candidates
        if max_candidates and len(candidates) >= max_candidates:
            break
    
    # Compute exact distances to all candidates
    results = []
    for idx in candidates:
        dist = distance_fn(query, points[idx])
        results.append((idx, dist))
    
    # Sort by distance and return top-k
    results.sort(key=lambda x: x[1])
    return results[:num_neighbors]
 
 
# Example usage with random projections (cosine similarity)
def random_hyperplane_hash(dim: int):
    """Factory for random hyperplane hash functions (cosine LSH)."""
    hyperplane = np.random.randn(dim)
    
    def hash_fn(point):
        return 1 if np.dot(hyperplane, point) >= 0 else 0
    
    return hash_fn
 
 
# Demo
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create dataset
    n_points = 10000
    dim = 128
    points = np.random.randn(n_points, dim)
    points = points / np.linalg.norm(points, axis=1, keepdims=True)
    
    # Build index
    k = 10  # 10 hash functions per table
    L = 20  # 20 tables
    
    print(f"Building LSH index: n={n_points}, d={dim}, k={k}, L={L}")
    index = build_lsh_index(
        points, k, L, random_hyperplane_hash, dim
    )
    
    # Query
    query = np.random.randn(dim)
    query = query / np.linalg.norm(query)
    
    def cosine_distance(a, b):
        return 1 - np.dot(a, b)
    
    neighbors = query_lsh(
        query, index, points, cosine_distance, num_neighbors=5
    )
    
    print(f"
Top 5 approximate nearest neighbors:")
    for idx, dist in neighbors:
        print(f"  Point {idx}: cosine distance = {dist:.4f}")
    
    # For comparison, brute force
    print(f"
Brute force verification:")
    all_dists = [(i, cosine_distance(query, points[i])) 
                 for i in range(n_points)]
    all_dists.sort(key=lambda x: x[1])
    for idx, dist in all_dists[:5]:
        print(f"  Point {idx}: cosine distance = {dist:.4f}")

Complexity Analysis:

Preprocessing (build index):

Time: $O(nLk)$ — hash each of $n$ points with $k$ functions in $L$ tables
Space: $O(nL + L \cdot k \cdot d)$ — store $L$ tables plus hash functions

Query:

Hashing: $O(Lkd)$ — compute $k$ hashes for each of $L$ tables
Candidate retrieval: $O(L)$ expected — look up $L$ buckets
Distance computation: $O(|C| \cdot d)$ — compute exact distance to all candidates

The key insight is that $|C|$ (the number of candidates) scales as $O(n^\rho)$ where $\rho < 1$. For a well-tuned LSH, we might have $\rho \approx 0.5$, meaning we only need to check $O(\sqrt{n})$ candidates instead of all $n$ points.

For 100 million points, this is the difference between checking 100 million distances and checking ~10,000 distances—a 10,000x speedup.

Real-World Applications of LSH

LSH has become a foundational technique across many domains where similarity search at scale is required. Understanding these applications helps motivate the theoretical depth we'll explore in subsequent pages.

LSH Applications Across Industries
Domain	Application	Similarity Metric	Scale
Information Retrieval	Near-duplicate document detection	Jaccard (MinHash)	Billions of web pages
Computer Vision	Image similarity search	Euclidean/Cosine	Hundreds of millions of images
Music/Audio	Audio fingerprinting, song recognition	Euclidean	Millions of audio tracks
Recommendation Systems	Similar user/item finding	Cosine similarity	Billions of user-item pairs
Biology	Similar genome/protein sequence search	Edit distance variants	Millions of sequences
Advertising	Ad targeting, lookalike audiences	Cosine similarity	Billions of user profiles
Social Networks	Friend suggestions, content dedup	Jaccard/Cosine	Billions of users
Natural Language	Semantic search, plagiarism detection	Cosine (embeddings)	Millions of documents

Case Study: Google's Web Duplicate Detection

Google processes billions of web pages and must identify near-duplicates efficiently. Two pages might be duplicates if they share most content but differ in headers, ads, or timestamps.

Approach using LSH:

Extract text content from pages (ignoring boilerplate)
Convert to shingle sets (e.g., all 5-word sequences)
Use MinHash LSH to estimate Jaccard similarity
Group pages that hash to the same buckets
Verify with exact comparison only within buckets

Without LSH, comparing all pairs of $n$ pages requires $O(n^2)$ comparisons—infeasible for billions of pages. With LSH, the expected comparisons drop to $O(n)$ while still catching most true duplicates.

Case Study: Shazam's Audio Recognition

Shazam identifies songs from short audio clips recorded in noisy environments. The system creates "fingerprints" from audio spectrograms—high-dimensional vectors representing local audio characteristics.

For each query:

Extract fingerprints from the audio clip (multiple vectors)
Use LSH to find candidate matching fingerprints in the database
Vote among candidates to identify the song

LSH enables searching millions of songs in real-time on mobile devices with limited resources.

The Common Thread

All these applications share a pattern: high-dimensional data, a well-defined similarity metric, queries that can tolerate some approximation, and datasets too large for brute-force search. LSH transforms these intractable problems into efficient, practical systems.

Key Properties of LSH Families

Not all hash functions can be locality-sensitive. LSH families must satisfy specific mathematical properties tied to the underlying distance metric. Understanding these properties is crucial for selecting and tuning LSH for your application.

Essential Properties of an LSH Family

•Similarity monotonicity — Collision probability must be a monotonically decreasing function of distance. Closer points should always be more likely to collide than farther points.
•Efficient computation — Hash functions must be computable in O(d) time or better. Complex hash functions defeat the purpose of LSH.
•Randomization — The family must support random sampling. We need to draw independent hash functions to build multiple tables.
•Distance metric alignment — The hash family must be designed for a specific distance metric. Using the wrong family destroys the correlation between collision probability and similarity.
•Amplifiability — The family must support AND/OR amplification. The base probabilities p₁ and p₂ can be imperfect if amplification can boost discrimination.

Different LSH Families for Different Metrics:

Each distance metric requires its own LSH family:

Distance Metric	LSH Family	Collision Probability
Hamming distance	Bit sampling	$1 - d_H(x,y)/d$
Jaccard similarity	MinHash	$J(A, B)$
Cosine similarity	Random hyperplanes	$1 - \theta/\pi$
Euclidean distance	p-stable distributions	Depends on distance

We will explore each of these in detail in the following pages. The key insight is that there's no "universal" LSH—you must match the LSH family to your distance metric.

Metric Mismatch is Catastrophic

Using an LSH family designed for one metric with a different metric produces meaningless results. For example, MinHash (designed for Jaccard similarity) will not work for Euclidean distance. Always verify that your LSH family matches your application's similarity measure.

Summary: The LSH Paradigm

We've established the foundational understanding of locality-sensitive hashing. Let's consolidate the key insights:

Key Takeaways

•Traditional hashing minimizes collisions; LSH maximizes meaningful collisions — This paradigm inversion is the core innovation that enables similarity search.
•LSH trades exactness for speed — By accepting approximate answers, LSH achieves sub-linear query time where exact methods require linear scans.
•AND-OR amplification is the secret sauce — By combining k functions per table (AND) and L tables (OR), we can tune the precision-recall tradeoff.
•The quality parameter ρ determines query time — Query time scales as O(n^ρ); smaller ρ means faster queries but requires more space.
•Each metric needs its own LSH family — Hamming, Jaccard, cosine, and Euclidean distances each have specialized hash families.
•LSH enables massive-scale similarity search — Real-world systems use LSH to search billions of items in milliseconds.

What's Next:

In the following pages, we'll dive deep into specific LSH families:

LSH for Cosine Similarity — Random hyperplane projections
LSH for Euclidean Distance — p-stable distributions
Hash Function Families — MinHash for Jaccard, and beyond
Collision Probability Analysis — Mathematical foundations
LSH Performance Tuning — Practical implementation guidance

Each page will provide the mathematical rigor, algorithmic details, and practical insights needed to apply LSH in production systems.

Page Complete

You now understand the fundamental distinction between traditional hashing and locality-sensitive hashing. This paradigm shift—from collision avoidance to similarity-preserving collisions—is the foundation for all LSH techniques we'll explore next.

1 / 6

Loading learning content...

Machine LearningK-Nearest Neighbors

Locality-Sensitive Hashing (LSH)

LevelAdvanced

Duration90 mins

TopicK-Nearest Neighbors

1 / 6

Hashing vs Locality-Sensitive Hashing

The Nearest Neighbor Bottleneck

The solution lies in a counterintuitive idea: hashing. But not the hashing you know from hash tables. A special kind of hashing that does exactly what traditional hashing tries to avoid.

What You Will Learn

Traditional Hashing: Minimizing Collisions

Before we can appreciate locality-sensitive hashing, we must deeply understand traditional hashing and its fundamental design philosophy.

What is a hash function?

A hash function h maps inputs from a (typically large) domain to a (typically smaller) codomain:

$$h: U \rightarrow {0, 1, 2, \ldots, m-1}$$

where $U$ is the universe of possible inputs and $m$ is the number of buckets. The hash value $h(x)$ is often called the hash code, hash digest, or simply the hash of $x$.

The Core Design Principle: Collision Minimization

Traditional hash functions are designed with one overarching goal: minimize collisions. A collision occurs when two distinct inputs produce the same hash value:

$$h(x) = h(y) \text{ where } x eq y$$

This principle drives everything in traditional hashing:

Traditional Hashing Goals

•Uniform distribution — Hash values should spread evenly across all buckets, regardless of input patterns. If inputs cluster in the domain, they should scatter in the codomain.
•Determinism — The same input always produces the same output, enabling reliable lookups and comparisons.
•Avalanche effect — A small change in input should produce a dramatically different hash. Flipping one bit should change ~50% of output bits.
•Pre-image resistance — Given a hash value, it should be computationally infeasible to find any input that produces it (critical for cryptographic applications).
•Collision resistance — It should be computationally infeasible to find any two distinct inputs that hash to the same value.

Why collision minimization matters for hash tables:

Perfect hashing (no collisions): O(1) lookup
Low collision rate: O(1) average lookup
High collision rate: Degrades toward O(n) lookup

The entire hash table paradigm rests on minimizing collisions. A hash function that produced many collisions would be considered broken for hash table applications.

traditional_hash_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import hashlib
 
def traditional_hash(text: str, num_buckets: int = 1000) -> int:
    """
    Traditional hash function using SHA-256.
    Note how SIMILAR inputs produce COMPLETELY DIFFERENT outputs.
    """
    hash_bytes = hashlib.sha256(text.encode()).digest()
    # Convert first 8 bytes to integer and mod by bucket count
    hash_int = int.from_bytes(hash_bytes[:8], 'big')
    return hash_int % num_buckets
 
# Demonstrate the avalanche effect
strings = [
    "The quick brown fox",
    "The quick brown foX",  # One character changed
    "The quick brown foy",  # One character different
    "The quick brown foz",  # One character different
]
 
print("Traditional Hashing - Avalanche Effect Demo:")
print("-" * 50)
for s in strings:
    h = traditional_hash(s)
    print(f"'{s}' -> bucket {h}")
 
# Output (example):
# 'The quick brown fox' -> bucket 342
# 'The quick brown foX' -> bucket 891  # Completely different!
# 'The quick brown foy' -> bucket 127  # Completely different!
# 'The quick brown foz' -> bucket 658  # Completely different!
 
# The avalanche effect ensures that similar strings
# hash to DIFFERENT buckets - this is BY DESIGN
# for traditional hash functions.

The Avalanche Effect is a Feature, Not a Bug

The Similarity Search Problem

Now let's formally define the problem that traditional hashing cannot solve.

The Nearest Neighbor Problem:

Given:

A dataset $P = {p_1, p_2, \ldots, p_n}$ of $n$ points in a metric space
A distance function $d(x, y)$
A query point $q$

Find: $$\text{NN}(q) = \arg\min_{p \in P} d(q, p)$$

The exact solution requires computing $d(q, p_i)$ for all $n$ points, yielding $O(n)$ query time.

The Approximate Nearest Neighbor (ANN) Problem:

For many applications, we can relax the requirement. Given an approximation factor $c > 1$:

Find a point $p^$ such that: $$d(q, p^) \leq c \cdot d(q, \text{NN}(q))$$

That is, we accept a point whose distance is at most $c$ times the true nearest neighbor distance. This relaxation is crucial—it allows sub-linear query time.

Comparison of Exact vs Approximate NN Methods
Method	Query Time	Space	Guarantees	Scalability
Brute Force	O(nd)	O(nd)	Exact	Poor (linear scan)
KD-Trees	O(log n) avg, O(n) worst	O(nd)	Exact	Poor in high dimensions
Ball Trees	O(log n) avg, O(n) worst	O(nd)	Exact	Degrades with dimension
LSH	O(n^ρ) where ρ < 1	O(n^{1+ρ})	Approximate (1+ε)	Excellent in high dimensions
Approximate KD-Trees	O(log n)	O(nd)	Approximate	Moderate

Why tree-based methods fail in high dimensions:

KD-trees and ball trees work brilliantly in low dimensions (d ≤ 20 or so). They partition space into regions, allowing the search to prune large portions of the dataset.

But in high dimensions, the curse of dimensionality destroys this efficiency:

Distance concentration: All distances become similar, making it hard to prune
Exponential partitioning: The number of cells needed to cover the space grows exponentially with d
Backtracking doom: Most queries must visit a large fraction of nodes

Empirical studies show that KD-trees become slower than brute force when $d > 10$-$20$ for typical datasets.

The insight that leads to LSH:

If we think about the problem differently, we realize:

We don't need to find the exact structure of the point cloud. We just need a way to quickly find points that are likely to be similar.

What if we could create a hash function where similar points are likely to hash to the same bucket? Then we could:

Hash the query point
Look only at points in the same (and nearby) buckets
Compute exact distances only for this small subset

The Key Insight

Introducing Locality-Sensitive Hashing

Locality-Sensitive Hashing (LSH) is a family of hashing techniques where the probability of collision is directly related to similarity. Unlike traditional hashing, LSH is designed so that:

Similar items are more likely to hash to the same bucket than dissimilar items.

Formal Definition:

A family $\mathcal{H}$ of hash functions is called $(d_1, d_2, p_1, p_2)$-sensitive for a distance measure $d$ if for any $h \in \mathcal{H}$ chosen uniformly at random:

If $d(x, y) \leq d_1$, then $\Pr[h(x) = h(y)] \geq p_1$
If $d(x, y) \geq d_2$, then $\Pr[h(x) = h(y)] \leq p_2$

where $d_1 < d_2$ and $p_1 > p_2$.

Interpreting the definition:

LSH Parameters Explained

•$d_1$ (near threshold) — Points within distance $d_1$ are considered "near" and should collide with high probability
•$d_2$ (far threshold) — Points at distance $d_2$ or more are considered "far" and should rarely collide
•$p_1$ (near collision probability) — The minimum probability that near points hash together
•$p_2$ (far collision probability) — The maximum probability that far points hash together
•The gap $d_2 - d_1$ — Represents the "gray zone" where behavior is not specified; larger gaps give weaker guarantees
•The ratio $p_1/p_2$ — Measures the "quality" of the hash family; higher is better

The LSH Quality Parameter:

The quality of an LSH family is often characterized by the parameter $\rho$:

$$\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)}$$

Why does this work?

The magic happens when we use multiple hash functions and multiple hash tables. By combining $k$ hash functions into a single hash key and using $L$ independent hash tables:

lsh_conceptual.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
"""
Conceptual illustration of LSH vs Traditional Hashing.
This demonstrates the PARADIGM INVERSION at the heart of LSH.
"""
 
import numpy as np
from typing import List, Tuple
 
class TraditionalHashDemo:
    """Traditional hashing: similar items -> different buckets"""
    
    def hash_vector(self, v: np.ndarray, num_buckets: int = 100) -> int:
        # Convert vector to bytes and hash
        v_bytes = v.tobytes()
        hash_val = hash(v_bytes)
        return hash_val % num_buckets
 
class LSHDemo:
    """LSH: similar items -> likely same bucket"""
    
    def __init__(self, dim: int, num_hyperplanes: int = 8):
        """
        Simple random projection LSH for cosine similarity.
        Each hyperplane divides space in two; we use k hyperplanes
        to create 2^k buckets.
        """
        # Random hyperplanes through origin
        self.hyperplanes = np.random.randn(num_hyperplanes, dim)
    
    def hash_vector(self, v: np.ndarray) -> str:
        """
        Hash by checking which side of each hyperplane the vector lies.
        Similar vectors are likely on the same side of most hyperplanes.
        """
        # Compute dot product with each hyperplane
        projections = np.dot(self.hyperplanes, v)
        # Convert to binary: 1 if positive, 0 if negative
        bits = (projections >= 0).astype(int)
        # Return as binary string (the hash)
        return ''.join(map(str, bits))
 
# Demonstration
np.random.seed(42)
dim = 100
 
# Create two sets of vectors
base_vector = np.random.randn(dim)
base_vector = base_vector / np.linalg.norm(base_vector)
 
# Similar vectors (small perturbations)
similar_vectors = [base_vector + 0.1 * np.random.randn(dim) for _ in range(5)]
similar_vectors = [v / np.linalg.norm(v) for v in similar_vectors]
 
# Dissimilar vectors (random directions)
dissimilar_vectors = [np.random.randn(dim) for _ in range(5)]
dissimilar_vectors = [v / np.linalg.norm(v) for v in dissimilar_vectors]
 
# Compare hashing behaviors
traditional = TraditionalHashDemo()
lsh = LSHDemo(dim, num_hyperplanes=8)
 
print("=" * 60)
print("TRADITIONAL HASHING (designed to scatter similar items)")
print("=" * 60)
print(f"Base vector hash: {traditional.hash_vector(base_vector)}")
print("Similar vectors hashes:", 
      [traditional.hash_vector(v) for v in similar_vectors])
# Output: All different buckets! (avalanche effect)
 
print()
print("=" * 60)
print("LOCALITY-SENSITIVE HASHING (designed to cluster similar items)")
print("=" * 60)
print(f"Base vector hash: {lsh.hash_vector(base_vector)}")
print("Similar vectors hashes:", 
      [lsh.hash_vector(v) for v in similar_vectors])
# Output: Many identical or similar hash values!
 
print("Dissimilar vectors hashes:", 
      [lsh.hash_vector(v) for v in dissimilar_vectors])
# Output: Different hash values from base

The Paradigm Inversion: From Scattering to Clustering

Let's crystallize the fundamental difference between traditional hashing and LSH. This conceptual clarity is essential for understanding everything that follows.

Traditional Hashing

•Goal: Minimize collisions
•Collisions are: Failures to be handled
•Similar inputs: Should hash differently
•Avalanche effect: Essential property
•Use case: Hash tables, checksums, cryptography
•Similarity preserved?: No—deliberately destroyed
•Query: O(1) for exact key match only

Locality-Sensitive Hashing

•Goal: Create meaningful collisions
•Collisions are: Signals of similarity
•Similar inputs: Should hash together
•Avalanche effect: Deliberately avoided
•Use case: Similarity search, deduplication
•Similarity preserved?: Yes—by design
•Query: O(n^ρ) for approximate neighbors

The mathematical intuition:

LSH inverts this. An LSH function is constructed so that:

$$\text{Pr}[h(x) = h(y)] = f(d(x, y))$$

Why not just check all colliding points?

The power of LSH comes from amplification. A single LSH function might have:

$p_1 = 0.8$ (near points collide 80% of the time)
$p_2 = 0.6$ (far points collide 60% of the time)

This gap seems small. But by combining $k$ functions (requiring ALL to match for a collision), we get:

Near points: $(0.8)^k$
Far points: $(0.6)^k$

For $k = 10$:

Near points collide with probability: $(0.8)^{10} \approx 0.107$
Far points collide with probability: $(0.6)^{10} \approx 0.006$

The ratio has grown from $0.8/0.6 = 1.33$ to $0.107/0.006 \approx 17.8$!

We then use $L$ independent hash tables to boost the probability of finding near points. This AND-then-OR amplification is the secret sauce of LSH.

AND-OR Amplification

LSH Algorithm Overview

Let's outline the complete LSH algorithm for approximate nearest neighbor search. This provides the big picture before we dive into specific hash families in subsequent pages.

lsh_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
"""
Complete LSH Algorithm for Approximate Nearest Neighbor Search.
 
High-level structure that applies to all LSH variants.
Specific hash functions vary by distance metric.
"""
 
import numpy as np
from typing import List, Tuple, Set, Callable
from collections import defaultdict
from dataclasses import dataclass
 
@dataclass
class LSHIndex:
    """
    LSH Index Structure.
    
    Parameters:
    -----------
    k : int
        Number of hash functions per table (higher = fewer candidates, 
        but might miss true neighbors)
    L : int  
        Number of hash tables (higher = more likely to find true neighbors,
        but more candidates to check)
    """
    k: int  # Hash functions per table
    L: int  # Number of hash tables
    
    # Each table maps hash key -> set of point indices
    tables: List[dict]  # L tables, each is {hash_key: [indices]}
    
    # Hash functions for each table
    hash_functions: List[List[Callable]]  # L lists of k functions each
 
def build_lsh_index(
    points: np.ndarray,
    k: int,
    L: int,
    hash_family: Callable,
    dim: int
) -> LSHIndex:
    """
    Build LSH index from dataset.
    
    Parameters:
    -----------
    points : np.ndarray of shape (n, d)
        Dataset of n points in d dimensions
    k : int
        Number of hash functions per table
    L : int
        Number of hash tables
    hash_family : callable
        Function that returns a new random hash function
    dim : int
        Dimensionality of points
        
    Returns:
    --------
    LSHIndex : The built index
    """
    n = len(points)
    
    # Create L tables, each with k random hash functions
    tables = [defaultdict(list) for _ in range(L)]
    hash_functions = [
        [hash_family(dim) for _ in range(k)] 
        for _ in range(L)
    ]
    
    # Insert all points into all tables
    for point_idx in range(n):
        point = points[point_idx]
        
        for table_idx in range(L):
            # Compute k hash values and concatenate
            hash_values = tuple(
                h(point) for h in hash_functions[table_idx]
            )
            # Insert into this table
            tables[table_idx][hash_values].append(point_idx)
    
    return LSHIndex(
        k=k, 
        L=L, 
        tables=tables, 
        hash_functions=hash_functions
    )
 
def query_lsh(
    query: np.ndarray,
    index: LSHIndex,
    points: np.ndarray,
    distance_fn: Callable,
    num_neighbors: int = 1,
    max_candidates: int = None
) -> List[Tuple[int, float]]:
    """
    Query LSH index for approximate nearest neighbors.
    
    Parameters:
    -----------
    query : np.ndarray of shape (d,)
        Query point
    index : LSHIndex
        Pre-built LSH index
    points : np.ndarray
        Original dataset (needed for distance computation)
    distance_fn : callable
        Distance function d(x, y) -> float
    num_neighbors : int
        Number of neighbors to return
    max_candidates : int
        Maximum candidates to check (for efficiency)
        
    Returns:
    --------
    List of (point_index, distance) tuples, sorted by distance
    """
    candidates = set()
    
    # Collect candidates from all L tables
    for table_idx in range(index.L):
        # Compute hash key for query using this table's functions
        hash_values = tuple(
            h(query) for h in index.hash_functions[table_idx]
        )
        # Get all points in same bucket
        bucket = index.tables[table_idx].get(hash_values, [])
        candidates.update(bucket)
        
        # Early termination if we have enough candidates
        if max_candidates and len(candidates) >= max_candidates:
            break
    
    # Compute exact distances to all candidates
    results = []
    for idx in candidates:
        dist = distance_fn(query, points[idx])
        results.append((idx, dist))
    
    # Sort by distance and return top-k
    results.sort(key=lambda x: x[1])
    return results[:num_neighbors]
 
 
# Example usage with random projections (cosine similarity)
def random_hyperplane_hash(dim: int):
    """Factory for random hyperplane hash functions (cosine LSH)."""
    hyperplane = np.random.randn(dim)
    
    def hash_fn(point):
        return 1 if np.dot(hyperplane, point) >= 0 else 0
    
    return hash_fn
 
 
# Demo
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create dataset
    n_points = 10000
    dim = 128
    points = np.random.randn(n_points, dim)
    points = points / np.linalg.norm(points, axis=1, keepdims=True)
    
    # Build index
    k = 10  # 10 hash functions per table
    L = 20  # 20 tables
    
    print(f"Building LSH index: n={n_points}, d={dim}, k={k}, L={L}")
    index = build_lsh_index(
        points, k, L, random_hyperplane_hash, dim
    )
    
    # Query
    query = np.random.randn(dim)
    query = query / np.linalg.norm(query)
    
    def cosine_distance(a, b):
        return 1 - np.dot(a, b)
    
    neighbors = query_lsh(
        query, index, points, cosine_distance, num_neighbors=5
    )
    
    print(f"
Top 5 approximate nearest neighbors:")
    for idx, dist in neighbors:
        print(f"  Point {idx}: cosine distance = {dist:.4f}")
    
    # For comparison, brute force
    print(f"
Brute force verification:")
    all_dists = [(i, cosine_distance(query, points[i])) 
                 for i in range(n_points)]
    all_dists.sort(key=lambda x: x[1])
    for idx, dist in all_dists[:5]:
        print(f"  Point {idx}: cosine distance = {dist:.4f}")

Complexity Analysis:

Preprocessing (build index):

Time: $O(nLk)$ — hash each of $n$ points with $k$ functions in $L$ tables
Space: $O(nL + L \cdot k \cdot d)$ — store $L$ tables plus hash functions

Query:

Hashing: $O(Lkd)$ — compute $k$ hashes for each of $L$ tables
Candidate retrieval: $O(L)$ expected — look up $L$ buckets
Distance computation: $O(|C| \cdot d)$ — compute exact distance to all candidates

For 100 million points, this is the difference between checking 100 million distances and checking ~10,000 distances—a 10,000x speedup.

Real-World Applications of LSH

LSH Applications Across Industries
Domain	Application	Similarity Metric	Scale
Information Retrieval	Near-duplicate document detection	Jaccard (MinHash)	Billions of web pages
Computer Vision	Image similarity search	Euclidean/Cosine	Hundreds of millions of images
Music/Audio	Audio fingerprinting, song recognition	Euclidean	Millions of audio tracks
Recommendation Systems	Similar user/item finding	Cosine similarity	Billions of user-item pairs
Biology	Similar genome/protein sequence search	Edit distance variants	Millions of sequences
Advertising	Ad targeting, lookalike audiences	Cosine similarity	Billions of user profiles
Social Networks	Friend suggestions, content dedup	Jaccard/Cosine	Billions of users
Natural Language	Semantic search, plagiarism detection	Cosine (embeddings)	Millions of documents

Case Study: Google's Web Duplicate Detection

Google processes billions of web pages and must identify near-duplicates efficiently. Two pages might be duplicates if they share most content but differ in headers, ads, or timestamps.

Approach using LSH:

Extract text content from pages (ignoring boilerplate)
Convert to shingle sets (e.g., all 5-word sequences)
Use MinHash LSH to estimate Jaccard similarity
Group pages that hash to the same buckets
Verify with exact comparison only within buckets

Case Study: Shazam's Audio Recognition

For each query:

Extract fingerprints from the audio clip (multiple vectors)
Use LSH to find candidate matching fingerprints in the database
Vote among candidates to identify the song

LSH enables searching millions of songs in real-time on mobile devices with limited resources.

The Common Thread

Key Properties of LSH Families

Essential Properties of an LSH Family

•Similarity monotonicity — Collision probability must be a monotonically decreasing function of distance. Closer points should always be more likely to collide than farther points.
•Efficient computation — Hash functions must be computable in O(d) time or better. Complex hash functions defeat the purpose of LSH.
•Randomization — The family must support random sampling. We need to draw independent hash functions to build multiple tables.
•Distance metric alignment — The hash family must be designed for a specific distance metric. Using the wrong family destroys the correlation between collision probability and similarity.
•Amplifiability — The family must support AND/OR amplification. The base probabilities p₁ and p₂ can be imperfect if amplification can boost discrimination.

Different LSH Families for Different Metrics:

Each distance metric requires its own LSH family:

Distance Metric	LSH Family	Collision Probability
Hamming distance	Bit sampling	$1 - d_H(x,y)/d$
Jaccard similarity	MinHash	$J(A, B)$
Cosine similarity	Random hyperplanes	$1 - \theta/\pi$
Euclidean distance	p-stable distributions	Depends on distance

We will explore each of these in detail in the following pages. The key insight is that there's no "universal" LSH—you must match the LSH family to your distance metric.

Metric Mismatch is Catastrophic

Summary: The LSH Paradigm

We've established the foundational understanding of locality-sensitive hashing. Let's consolidate the key insights:

Key Takeaways

•Traditional hashing minimizes collisions; LSH maximizes meaningful collisions — This paradigm inversion is the core innovation that enables similarity search.
•LSH trades exactness for speed — By accepting approximate answers, LSH achieves sub-linear query time where exact methods require linear scans.
•AND-OR amplification is the secret sauce — By combining k functions per table (AND) and L tables (OR), we can tune the precision-recall tradeoff.
•The quality parameter ρ determines query time — Query time scales as O(n^ρ); smaller ρ means faster queries but requires more space.
•Each metric needs its own LSH family — Hamming, Jaccard, cosine, and Euclidean distances each have specialized hash families.
•LSH enables massive-scale similarity search — Real-world systems use LSH to search billions of items in milliseconds.

What's Next:

In the following pages, we'll dive deep into specific LSH families:

LSH for Cosine Similarity — Random hyperplane projections
LSH for Euclidean Distance — p-stable distributions
Hash Function Families — MinHash for Jaccard, and beyond
Collision Probability Analysis — Mathematical foundations
LSH Performance Tuning — Practical implementation guidance

Each page will provide the mathematical rigor, algorithmic details, and practical insights needed to apply LSH in production systems.

Page Complete

1 / 6