Loading learning content...
Imagine you're building a music recommendation system with 100 million songs, each represented as a 512-dimensional feature vector capturing tempo, key, instrumentation, and audio characteristics. A user requests songs similar to their current favorite. The naive approach—computing the distance to every song in the database—requires 100 million distance calculations, each involving 512 dimensions. Even at 100 nanoseconds per calculation, that's 10 seconds per query. For a service handling thousands of requests per second, this is catastrophic.
This isn't a theoretical concern. Every major tech company—Spotify, Netflix, Google, Amazon—faces this exact challenge: how do you find the nearest neighbors to a query point among billions of items without computing all distances?
The solution lies in a counterintuitive idea: hashing. But not the hashing you know from hash tables. A special kind of hashing that does exactly what traditional hashing tries to avoid.
By the end of this page, you will understand the fundamental paradigm shift from traditional hashing to locality-sensitive hashing. You'll see why traditional hashing's core design principle must be inverted for similarity search, and how LSH achieves sub-linear query time by deliberately creating "structured collisions" based on similarity.
Before we can appreciate locality-sensitive hashing, we must deeply understand traditional hashing and its fundamental design philosophy.
What is a hash function?
A hash function h maps inputs from a (typically large) domain to a (typically smaller) codomain:
$$h: U \rightarrow {0, 1, 2, \ldots, m-1}$$
where $U$ is the universe of possible inputs and $m$ is the number of buckets. The hash value $h(x)$ is often called the hash code, hash digest, or simply the hash of $x$.
The Core Design Principle: Collision Minimization
Traditional hash functions are designed with one overarching goal: minimize collisions. A collision occurs when two distinct inputs produce the same hash value:
$$h(x) = h(y) \text{ where } x eq y$$
This principle drives everything in traditional hashing:
Why collision minimization matters for hash tables:
In a hash table, we use the hash value as an index into an array. When a collision occurs, we must handle it through chaining (linked lists at each bucket) or open addressing (probing for empty slots). Collisions degrade performance:
The entire hash table paradigm rests on minimizing collisions. A hash function that produced many collisions would be considered broken for hash table applications.
1234567891011121314151617181920212223242526272829303132333435
import hashlib def traditional_hash(text: str, num_buckets: int = 1000) -> int: """ Traditional hash function using SHA-256. Note how SIMILAR inputs produce COMPLETELY DIFFERENT outputs. """ hash_bytes = hashlib.sha256(text.encode()).digest() # Convert first 8 bytes to integer and mod by bucket count hash_int = int.from_bytes(hash_bytes[:8], 'big') return hash_int % num_buckets # Demonstrate the avalanche effectstrings = [ "The quick brown fox", "The quick brown foX", # One character changed "The quick brown foy", # One character different "The quick brown foz", # One character different] print("Traditional Hashing - Avalanche Effect Demo:")print("-" * 50)for s in strings: h = traditional_hash(s) print(f"'{s}' -> bucket {h}") # Output (example):# 'The quick brown fox' -> bucket 342# 'The quick brown foX' -> bucket 891 # Completely different!# 'The quick brown foy' -> bucket 127 # Completely different!# 'The quick brown foz' -> bucket 658 # Completely different! # The avalanche effect ensures that similar strings# hash to DIFFERENT buckets - this is BY DESIGN# for traditional hash functions.The avalanche effect—where similar inputs hash to vastly different outputs—is a deliberate design goal of traditional hash functions. It ensures uniform distribution and is essential for security applications. But this very property makes traditional hashing useless for similarity search.
Now let's formally define the problem that traditional hashing cannot solve.
The Nearest Neighbor Problem:
Given:
Find: $$\text{NN}(q) = \arg\min_{p \in P} d(q, p)$$
The exact solution requires computing $d(q, p_i)$ for all $n$ points, yielding $O(n)$ query time.
The Approximate Nearest Neighbor (ANN) Problem:
For many applications, we can relax the requirement. Given an approximation factor $c > 1$:
Find a point $p^$ such that: $$d(q, p^) \leq c \cdot d(q, \text{NN}(q))$$
That is, we accept a point whose distance is at most $c$ times the true nearest neighbor distance. This relaxation is crucial—it allows sub-linear query time.
| Method | Query Time | Space | Guarantees | Scalability |
|---|---|---|---|---|
| Brute Force | O(nd) | O(nd) | Exact | Poor (linear scan) |
| KD-Trees | O(log n) avg, O(n) worst | O(nd) | Exact | Poor in high dimensions |
| Ball Trees | O(log n) avg, O(n) worst | O(nd) | Exact | Degrades with dimension |
| LSH | O(n^ρ) where ρ < 1 | O(n^{1+ρ}) | Approximate (1+ε) | Excellent in high dimensions |
| Approximate KD-Trees | O(log n) | O(nd) | Approximate | Moderate |
Why tree-based methods fail in high dimensions:
KD-trees and ball trees work brilliantly in low dimensions (d ≤ 20 or so). They partition space into regions, allowing the search to prune large portions of the dataset.
But in high dimensions, the curse of dimensionality destroys this efficiency:
Empirical studies show that KD-trees become slower than brute force when $d > 10$-$20$ for typical datasets.
The insight that leads to LSH:
If we think about the problem differently, we realize:
We don't need to find the exact structure of the point cloud. We just need a way to quickly find points that are likely to be similar.
What if we could create a hash function where similar points are likely to hash to the same bucket? Then we could:
Traditional hashing tries to scatter similar items. For similarity search, we need the opposite: a hash function that clusters similar items. This paradigm inversion is the essence of locality-sensitive hashing.
Locality-Sensitive Hashing (LSH) is a family of hashing techniques where the probability of collision is directly related to similarity. Unlike traditional hashing, LSH is designed so that:
Similar items are more likely to hash to the same bucket than dissimilar items.
Formal Definition:
A family $\mathcal{H}$ of hash functions is called $(d_1, d_2, p_1, p_2)$-sensitive for a distance measure $d$ if for any $h \in \mathcal{H}$ chosen uniformly at random:
where $d_1 < d_2$ and $p_1 > p_2$.
Interpreting the definition:
The LSH Quality Parameter:
The quality of an LSH family is often characterized by the parameter $\rho$:
$$\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)}$$
For a useful LSH family, we need $\rho < 1$. The smaller $\rho$ is, the better the hash family discriminates between near and far points. The query time for $(1 + \epsilon)$-approximate nearest neighbor search is $O(n^\rho)$, so smaller $\rho$ means faster queries.
Why does this work?
The magic happens when we use multiple hash functions and multiple hash tables. By combining $k$ hash functions into a single hash key and using $L$ independent hash tables:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
"""Conceptual illustration of LSH vs Traditional Hashing.This demonstrates the PARADIGM INVERSION at the heart of LSH.""" import numpy as npfrom typing import List, Tuple class TraditionalHashDemo: """Traditional hashing: similar items -> different buckets""" def hash_vector(self, v: np.ndarray, num_buckets: int = 100) -> int: # Convert vector to bytes and hash v_bytes = v.tobytes() hash_val = hash(v_bytes) return hash_val % num_buckets class LSHDemo: """LSH: similar items -> likely same bucket""" def __init__(self, dim: int, num_hyperplanes: int = 8): """ Simple random projection LSH for cosine similarity. Each hyperplane divides space in two; we use k hyperplanes to create 2^k buckets. """ # Random hyperplanes through origin self.hyperplanes = np.random.randn(num_hyperplanes, dim) def hash_vector(self, v: np.ndarray) -> str: """ Hash by checking which side of each hyperplane the vector lies. Similar vectors are likely on the same side of most hyperplanes. """ # Compute dot product with each hyperplane projections = np.dot(self.hyperplanes, v) # Convert to binary: 1 if positive, 0 if negative bits = (projections >= 0).astype(int) # Return as binary string (the hash) return ''.join(map(str, bits)) # Demonstrationnp.random.seed(42)dim = 100 # Create two sets of vectorsbase_vector = np.random.randn(dim)base_vector = base_vector / np.linalg.norm(base_vector) # Similar vectors (small perturbations)similar_vectors = [base_vector + 0.1 * np.random.randn(dim) for _ in range(5)]similar_vectors = [v / np.linalg.norm(v) for v in similar_vectors] # Dissimilar vectors (random directions)dissimilar_vectors = [np.random.randn(dim) for _ in range(5)]dissimilar_vectors = [v / np.linalg.norm(v) for v in dissimilar_vectors] # Compare hashing behaviorstraditional = TraditionalHashDemo()lsh = LSHDemo(dim, num_hyperplanes=8) print("=" * 60)print("TRADITIONAL HASHING (designed to scatter similar items)")print("=" * 60)print(f"Base vector hash: {traditional.hash_vector(base_vector)}")print("Similar vectors hashes:", [traditional.hash_vector(v) for v in similar_vectors])# Output: All different buckets! (avalanche effect) print()print("=" * 60)print("LOCALITY-SENSITIVE HASHING (designed to cluster similar items)")print("=" * 60)print(f"Base vector hash: {lsh.hash_vector(base_vector)}")print("Similar vectors hashes:", [lsh.hash_vector(v) for v in similar_vectors])# Output: Many identical or similar hash values! print("Dissimilar vectors hashes:", [lsh.hash_vector(v) for v in dissimilar_vectors])# Output: Different hash values from baseLet's crystallize the fundamental difference between traditional hashing and LSH. This conceptual clarity is essential for understanding everything that follows.
The mathematical intuition:
Traditional hashing can be thought of as a function that "scrambles" the input space. Points close together in the original space are mapped essentially at random in the hash space. The correlation between input similarity and output similarity is intentionally zero (or negative for cryptographic hashes).
LSH inverts this. An LSH function is constructed so that:
$$\text{Pr}[h(x) = h(y)] = f(d(x, y))$$
where $f$ is a monotonically decreasing function of the distance. The closer two points are, the more likely they are to receive the same hash. The hash function is a similarity-preserving projection from high-dimensional space to a discrete set of buckets.
Why not just check all colliding points?
The power of LSH comes from amplification. A single LSH function might have:
This gap seems small. But by combining $k$ functions (requiring ALL to match for a collision), we get:
For $k = 10$:
The ratio has grown from $0.8/0.6 = 1.33$ to $0.107/0.006 \approx 17.8$!
We then use $L$ independent hash tables to boost the probability of finding near points. This AND-then-OR amplification is the secret sauce of LSH.
By requiring matches in k hash functions (AND), we reduce BOTH collision probabilities, but the far probability drops faster due to the power law. By using L tables (OR), we boost the probability of finding at least one collision for near points while keeping false positives manageable. This gives us knobs k and L to tune the precision-recall tradeoff.
Let's outline the complete LSH algorithm for approximate nearest neighbor search. This provides the big picture before we dive into specific hash families in subsequent pages.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201
"""Complete LSH Algorithm for Approximate Nearest Neighbor Search. High-level structure that applies to all LSH variants.Specific hash functions vary by distance metric.""" import numpy as npfrom typing import List, Tuple, Set, Callablefrom collections import defaultdictfrom dataclasses import dataclass @dataclassclass LSHIndex: """ LSH Index Structure. Parameters: ----------- k : int Number of hash functions per table (higher = fewer candidates, but might miss true neighbors) L : int Number of hash tables (higher = more likely to find true neighbors, but more candidates to check) """ k: int # Hash functions per table L: int # Number of hash tables # Each table maps hash key -> set of point indices tables: List[dict] # L tables, each is {hash_key: [indices]} # Hash functions for each table hash_functions: List[List[Callable]] # L lists of k functions each def build_lsh_index( points: np.ndarray, k: int, L: int, hash_family: Callable, dim: int) -> LSHIndex: """ Build LSH index from dataset. Parameters: ----------- points : np.ndarray of shape (n, d) Dataset of n points in d dimensions k : int Number of hash functions per table L : int Number of hash tables hash_family : callable Function that returns a new random hash function dim : int Dimensionality of points Returns: -------- LSHIndex : The built index """ n = len(points) # Create L tables, each with k random hash functions tables = [defaultdict(list) for _ in range(L)] hash_functions = [ [hash_family(dim) for _ in range(k)] for _ in range(L) ] # Insert all points into all tables for point_idx in range(n): point = points[point_idx] for table_idx in range(L): # Compute k hash values and concatenate hash_values = tuple( h(point) for h in hash_functions[table_idx] ) # Insert into this table tables[table_idx][hash_values].append(point_idx) return LSHIndex( k=k, L=L, tables=tables, hash_functions=hash_functions ) def query_lsh( query: np.ndarray, index: LSHIndex, points: np.ndarray, distance_fn: Callable, num_neighbors: int = 1, max_candidates: int = None) -> List[Tuple[int, float]]: """ Query LSH index for approximate nearest neighbors. Parameters: ----------- query : np.ndarray of shape (d,) Query point index : LSHIndex Pre-built LSH index points : np.ndarray Original dataset (needed for distance computation) distance_fn : callable Distance function d(x, y) -> float num_neighbors : int Number of neighbors to return max_candidates : int Maximum candidates to check (for efficiency) Returns: -------- List of (point_index, distance) tuples, sorted by distance """ candidates = set() # Collect candidates from all L tables for table_idx in range(index.L): # Compute hash key for query using this table's functions hash_values = tuple( h(query) for h in index.hash_functions[table_idx] ) # Get all points in same bucket bucket = index.tables[table_idx].get(hash_values, []) candidates.update(bucket) # Early termination if we have enough candidates if max_candidates and len(candidates) >= max_candidates: break # Compute exact distances to all candidates results = [] for idx in candidates: dist = distance_fn(query, points[idx]) results.append((idx, dist)) # Sort by distance and return top-k results.sort(key=lambda x: x[1]) return results[:num_neighbors] # Example usage with random projections (cosine similarity)def random_hyperplane_hash(dim: int): """Factory for random hyperplane hash functions (cosine LSH).""" hyperplane = np.random.randn(dim) def hash_fn(point): return 1 if np.dot(hyperplane, point) >= 0 else 0 return hash_fn # Demoif __name__ == "__main__": np.random.seed(42) # Create dataset n_points = 10000 dim = 128 points = np.random.randn(n_points, dim) points = points / np.linalg.norm(points, axis=1, keepdims=True) # Build index k = 10 # 10 hash functions per table L = 20 # 20 tables print(f"Building LSH index: n={n_points}, d={dim}, k={k}, L={L}") index = build_lsh_index( points, k, L, random_hyperplane_hash, dim ) # Query query = np.random.randn(dim) query = query / np.linalg.norm(query) def cosine_distance(a, b): return 1 - np.dot(a, b) neighbors = query_lsh( query, index, points, cosine_distance, num_neighbors=5 ) print(f"Top 5 approximate nearest neighbors:") for idx, dist in neighbors: print(f" Point {idx}: cosine distance = {dist:.4f}") # For comparison, brute force print(f"Brute force verification:") all_dists = [(i, cosine_distance(query, points[i])) for i in range(n_points)] all_dists.sort(key=lambda x: x[1]) for idx, dist in all_dists[:5]: print(f" Point {idx}: cosine distance = {dist:.4f}")Complexity Analysis:
Preprocessing (build index):
Query:
The key insight is that $|C|$ (the number of candidates) scales as $O(n^\rho)$ where $\rho < 1$. For a well-tuned LSH, we might have $\rho \approx 0.5$, meaning we only need to check $O(\sqrt{n})$ candidates instead of all $n$ points.
For 100 million points, this is the difference between checking 100 million distances and checking ~10,000 distances—a 10,000x speedup.
LSH has become a foundational technique across many domains where similarity search at scale is required. Understanding these applications helps motivate the theoretical depth we'll explore in subsequent pages.
| Domain | Application | Similarity Metric | Scale |
|---|---|---|---|
| Information Retrieval | Near-duplicate document detection | Jaccard (MinHash) | Billions of web pages |
| Computer Vision | Image similarity search | Euclidean/Cosine | Hundreds of millions of images |
| Music/Audio | Audio fingerprinting, song recognition | Euclidean | Millions of audio tracks |
| Recommendation Systems | Similar user/item finding | Cosine similarity | Billions of user-item pairs |
| Biology | Similar genome/protein sequence search | Edit distance variants | Millions of sequences |
| Advertising | Ad targeting, lookalike audiences | Cosine similarity | Billions of user profiles |
| Social Networks | Friend suggestions, content dedup | Jaccard/Cosine | Billions of users |
| Natural Language | Semantic search, plagiarism detection | Cosine (embeddings) | Millions of documents |
Case Study: Google's Web Duplicate Detection
Google processes billions of web pages and must identify near-duplicates efficiently. Two pages might be duplicates if they share most content but differ in headers, ads, or timestamps.
Approach using LSH:
Without LSH, comparing all pairs of $n$ pages requires $O(n^2)$ comparisons—infeasible for billions of pages. With LSH, the expected comparisons drop to $O(n)$ while still catching most true duplicates.
Case Study: Shazam's Audio Recognition
Shazam identifies songs from short audio clips recorded in noisy environments. The system creates "fingerprints" from audio spectrograms—high-dimensional vectors representing local audio characteristics.
For each query:
LSH enables searching millions of songs in real-time on mobile devices with limited resources.
All these applications share a pattern: high-dimensional data, a well-defined similarity metric, queries that can tolerate some approximation, and datasets too large for brute-force search. LSH transforms these intractable problems into efficient, practical systems.
Not all hash functions can be locality-sensitive. LSH families must satisfy specific mathematical properties tied to the underlying distance metric. Understanding these properties is crucial for selecting and tuning LSH for your application.
Different LSH Families for Different Metrics:
Each distance metric requires its own LSH family:
| Distance Metric | LSH Family | Collision Probability |
|---|---|---|
| Hamming distance | Bit sampling | $1 - d_H(x,y)/d$ |
| Jaccard similarity | MinHash | $J(A, B)$ |
| Cosine similarity | Random hyperplanes | $1 - \theta/\pi$ |
| Euclidean distance | p-stable distributions | Depends on distance |
We will explore each of these in detail in the following pages. The key insight is that there's no "universal" LSH—you must match the LSH family to your distance metric.
Using an LSH family designed for one metric with a different metric produces meaningless results. For example, MinHash (designed for Jaccard similarity) will not work for Euclidean distance. Always verify that your LSH family matches your application's similarity measure.
We've established the foundational understanding of locality-sensitive hashing. Let's consolidate the key insights:
What's Next:
In the following pages, we'll dive deep into specific LSH families:
Each page will provide the mathematical rigor, algorithmic details, and practical insights needed to apply LSH in production systems.
You now understand the fundamental distinction between traditional hashing and locality-sensitive hashing. This paradigm shift—from collision avoidance to similarity-preserving collisions—is the foundation for all LSH techniques we'll explore next.