Loading content...
Traditional hash functions are designed to scatter — they map similar inputs to completely different outputs. Change one bit, and the hash changes unpredictably. This is perfect for hash tables but useless for similarity search.
Locality-Sensitive Hashing (LSH) inverts this principle. LSH hash functions are designed to cluster — similar items hash to the same bucket with high probability, while dissimilar items hash to different buckets. This seemingly paradoxical property enables sub-linear approximate nearest neighbor search.
The Key Insight:
Instead of comparing a query to all $n$ points, we:
This transforms $O(n)$ comparisons into $O(1)$ lookups plus $O(b)$ bucket comparisons, where $b << n$ is the bucket size. The tradeoff is that we might miss some true neighbors (false negatives) or return non-neighbors (false positives).
By the end of this page, you will understand the mathematical definition of locality-sensitive hash families, implement LSH for different distance metrics (cosine, Euclidean, Jaccard), analyze collision probabilities and parameter selection, understand amplification techniques for tuning precision and recall, and recognize the tradeoffs between query time, space, and accuracy.
LSH is defined formally in terms of hash families with specific collision probability properties.
Definition (Locality-Sensitive Hash Family):
A family $\mathcal{H}$ of hash functions is $(r_1, r_2, p_1, p_2)$-sensitive for distance $d(\cdot, \cdot)$ if for any $h \in \mathcal{H}$ chosen uniformly at random:
If $d(\mathbf{x}, \mathbf{y}) \leq r_1$, then $\Pr[h(\mathbf{x}) = h(\mathbf{y})] \geq p_1$ (similar items collide often)
If $d(\mathbf{x}, \mathbf{y}) \geq r_2$, then $\Pr[h(\mathbf{x}) = h(\mathbf{y})] \leq p_2$ (dissimilar items collide rarely)
where $r_1 < r_2$ and $p_1 > p_2$.
The gap between similar ($\leq r_1$) and dissimilar ($\geq r_2$) is crucial. We typically define:
$$c = \frac{r_2}{r_1} > 1$$
as the approximation factor. An LSH scheme finds a $(1 + \epsilon)$-approximate nearest neighbor with $c = 1 + \epsilon$.
The $\rho$ parameter:
A key quantity is:
$$\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)}$$
This controls the query time complexity. Smaller $\rho$ means more efficient separation between similar and dissimilar pairs. For the best LSH families, $\rho \approx 1/c$.
| Parameter | Symbol | Meaning | Desirable Value |
|---|---|---|---|
| Near distance | $r_1$ | Distance threshold for 'similar' | Application-specific |
| Far distance | $r_2$ | Distance threshold for 'dissimilar' | $r_2 > r_1$ |
| Near collision prob | $p_1$ | Pr(collision | similar) | High (close to 1) |
| Far collision prob | $p_2$ | Pr(collision | dissimilar) | Low (close to 0) |
| Approximation factor | $c = r_2/r_1$ | Quality of approximation | $1 + \epsilon$ for small $\epsilon$ |
| Rho parameter | $\rho$ | Efficiency measure | Small (ideally $1/c$) |
LSH solves the (c, r)-near neighbor problem: given a query with a true neighbor within distance r, return a neighbor within distance cr (c-approximation). This relaxation from exact to approximate nearest neighbor is what enables sub-linear search in high dimensions.
One of the most elegant LSH schemes is for cosine similarity, using random hyperplane partitions.
Cosine Similarity:
$$\cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}| |\mathbf{y}|}$$
The angle $\theta$ between vectors determines similarity. Vectors pointing in similar directions have $\theta \approx 0$ and $\cos(\theta) \approx 1$.
Random Hyperplane Hash:
For a random unit vector $\mathbf{r}$ sampled uniformly from the unit sphere:
$$h_{\mathbf{r}}(\mathbf{x}) = \begin{cases} 1 & \text{if } \mathbf{r} \cdot \mathbf{x} \geq 0 \ 0 & \text{if } \mathbf{r} \cdot \mathbf{x} < 0 \end{cases}$$
This hash assigns $\mathbf{x}$ based on which side of the hyperplane $\mathbf{r} \cdot \mathbf{x} = 0$ it falls on.
Collision Probability:
The probability that two vectors have the same hash is:
$$\Pr[h_{\mathbf{r}}(\mathbf{x}) = h_{\mathbf{r}}(\mathbf{y})] = 1 - \frac{\theta}{\pi}$$
where $\theta = \arccos\left(\frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}| |\mathbf{y}|}\right)$ is the angle between the vectors.
Intuition: The hyperplane separates the vectors only if it 'cuts between' them. The probability of this equals the fraction of random directions that do so, which is $\theta / \pi$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166
import numpy as npfrom typing import List, Tuple, Setfrom collections import defaultdict class CosineLSH: """ Locality-Sensitive Hashing for Cosine Similarity. Uses random hyperplanes to hash vectors such that similar vectors (small angle) are likely to share hash values. Parameters: ----------- n_hyperplanes : int Number of random hyperplanes (bits per hash) n_tables : int Number of hash tables (for recall boosting) dim : int Dimensionality of input vectors seed : int Random seed for reproducibility """ def __init__( self, n_hyperplanes: int = 8, n_tables: int = 4, dim: int = 128, seed: int = 42 ): self.n_hyperplanes = n_hyperplanes self.n_tables = n_tables self.dim = dim np.random.seed(seed) # Generate random hyperplanes for each table # Shape: (n_tables, n_hyperplanes, dim) self.hyperplanes = np.random.randn(n_tables, n_hyperplanes, dim) # Normalize each hyperplane to unit length norms = np.linalg.norm(self.hyperplanes, axis=2, keepdims=True) self.hyperplanes /= norms # Storage: table_id -> hash_value -> set of indices self.tables: List[defaultdict] = [ defaultdict(set) for _ in range(n_tables) ] self.data = [] # Store original vectors def _hash_single(self, x: np.ndarray, table_idx: int) -> int: """ Compute hash for vector x using table table_idx. Returns an integer hash (binary encoding of hyperplane signs). """ # Dot product with each hyperplane: (n_hyperplanes,) projections = self.hyperplanes[table_idx] @ x # Convert to binary hash (1 if positive, 0 if negative) bits = (projections >= 0).astype(int) # Convert binary array to integer hash_val = int(''.join(map(str, bits)), 2) return hash_val def _hash_all_tables(self, x: np.ndarray) -> List[int]: """Compute hash in all tables.""" return [self._hash_single(x, i) for i in range(self.n_tables)] def index(self, vectors: np.ndarray): """ Index a collection of vectors. Parameters: ----------- vectors : np.ndarray, shape (n, dim) Vectors to index """ self.data = vectors for idx, vec in enumerate(vectors): hashes = self._hash_all_tables(vec) for table_idx, h in enumerate(hashes): self.tables[table_idx][h].add(idx) def query(self, q: np.ndarray, k: int = 10) -> List[Tuple[int, float]]: """ Find approximate k nearest neighbors. Parameters: ----------- q : np.ndarray, shape (dim,) Query vector k : int Number of neighbors to return Returns: -------- List of (index, cosine_similarity) pairs, sorted by similarity """ # Get candidate indices from all tables candidates: Set[int] = set() hashes = self._hash_all_tables(q) for table_idx, h in enumerate(hashes): candidates.update(self.tables[table_idx][h]) if not candidates: return [] # Compute actual cosine similarity for candidates q_norm = np.linalg.norm(q) results = [] for idx in candidates: vec = self.data[idx] sim = np.dot(q, vec) / (q_norm * np.linalg.norm(vec)) results.append((idx, sim)) # Sort by similarity (descending) and return top k results.sort(key=lambda x: -x[1]) return results[:k] def collision_probability(self, theta: float) -> float: """ Theoretical collision probability for angle theta. For a single hyperplane: Pr[collision] = 1 - theta/pi For n hyperplanes (AND): Pr[all match] = (1 - theta/pi)^n For m tables (OR): Pr[at least one match] = 1 - (1 - p^n)^m """ single_prob = 1 - theta / np.pi band_prob = single_prob ** self.n_hyperplanes total_prob = 1 - (1 - band_prob) ** self.n_tables return total_prob # Example usagedef demo_cosine_lsh(): np.random.seed(0) # Generate random high-dimensional vectors n, dim = 10000, 128 vectors = np.random.randn(n, dim) # Create and index LSH lsh = CosineLSH(n_hyperplanes=12, n_tables=8, dim=dim) lsh.index(vectors) # Query query = np.random.randn(dim) neighbors = lsh.query(query, k=10) print(f"Query found {len(neighbors)} candidates") print("Top 5 neighbors (idx, cosine_sim):") for idx, sim in neighbors[:5]: print(f" {idx}: {sim:.4f}") # Verify with brute force true_sims = [np.dot(query, v) / (np.linalg.norm(query) * np.linalg.norm(v)) for v in vectors] true_top = sorted(enumerate(true_sims), key=lambda x: -x[1])[:10] print("\nTrue top 5:") for idx, sim in true_top[:5]: print(f" {idx}: {sim:.4f}")Charikar's SimHash (2002) applies random hyperplane LSH to document fingerprinting. Documents are represented as high-dimensional vectors (e.g., TF-IDF), and SimHash produces compact binary fingerprints. Similar documents have similar fingerprints, enabling near-duplicate detection at web scale.
For Euclidean ($L_2$) distance, we use a different hash family based on random projections onto lines and discretization.
Random Projection Hash:
For a random unit vector $\mathbf{a}$ and a random offset $b$ uniform in $[0, w]$:
$$h_{\mathbf{a}, b}(\mathbf{x}) = \left\lfloor \frac{\mathbf{a} \cdot \mathbf{x} + b}{w} \right\rfloor$$
This projects $\mathbf{x}$ onto a random line, then discretizes into buckets of width $w$.
Collision Probability:
For two vectors at distance $d = |\mathbf{x} - \mathbf{y}|$:
$$p(d) = \Pr[h(\mathbf{x}) = h(\mathbf{y})] = \int_0^w \frac{1}{d} f\left(\frac{t}{d}\right) \left(1 - \frac{t}{w}\right) dt$$
where $f$ is the PDF of the absolute difference of projected values.
For standard Gaussian projections:
$$p(d) \approx 1 - 2 \Phi\left(-\frac{w}{d}\right) - \frac{2}{\sqrt{2\pi}(w/d)}\left(1 - e^{-(w/d)^2/2}\right)$$
where $\Phi$ is the standard normal CDF.
Practical Approximation:
$$p(d) \approx 1 - \frac{d}{w} \quad \text{for } d \ll w$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168
import numpy as npfrom typing import List, Tuple, Setfrom collections import defaultdict class EuclideanLSH: """ LSH for Euclidean (L2) distance using random projections. Projects vectors onto random lines and discretizes into buckets. Similar vectors (close in L2) land in the same bucket with high probability. Parameters: ----------- n_projections : int Number of random projections per hash table n_tables : int Number of hash tables bucket_width : float Width of discretization buckets (controls resolution) dim : int Dimensionality of input vectors """ def __init__( self, n_projections: int = 4, n_tables: int = 8, bucket_width: float = 4.0, dim: int = 128, seed: int = 42 ): self.n_projections = n_projections self.n_tables = n_tables self.bucket_width = bucket_width self.dim = dim np.random.seed(seed) # Random projection vectors: (n_tables, n_projections, dim) # Gaussian for theoretical guarantees self.projections = np.random.randn(n_tables, n_projections, dim) # Random offsets: (n_tables, n_projections) self.offsets = np.random.uniform(0, bucket_width, (n_tables, n_projections)) # Hash tables self.tables: List[defaultdict] = [ defaultdict(set) for _ in range(n_tables) ] self.data = [] def _hash_single(self, x: np.ndarray, table_idx: int) -> Tuple[int, ...]: """ Compute hash for vector x in table table_idx. Returns a tuple of bucket indices (one per projection). """ # Project: (n_projections,) projections = self.projections[table_idx] @ x # Add offset and discretize bucket_indices = np.floor( (projections + self.offsets[table_idx]) / self.bucket_width ).astype(int) return tuple(bucket_indices) def index(self, vectors: np.ndarray): """Index vectors into all hash tables.""" self.data = vectors for idx, vec in enumerate(vectors): for table_idx in range(self.n_tables): h = self._hash_single(vec, table_idx) self.tables[table_idx][h].add(idx) def query( self, q: np.ndarray, k: int = 10, radius: float = None ) -> List[Tuple[int, float]]: """ Find approximate nearest neighbors. Parameters: ----------- q : np.ndarray Query vector k : int Number of neighbors to return radius : float, optional Return only neighbors within this distance Returns: -------- List of (index, distance) pairs, sorted by distance """ candidates: Set[int] = set() for table_idx in range(self.n_tables): h = self._hash_single(q, table_idx) candidates.update(self.tables[table_idx][h]) if not candidates: return [] # Compute actual distances for candidates results = [] for idx in candidates: dist = np.linalg.norm(q - self.data[idx]) if radius is None or dist <= radius: results.append((idx, dist)) results.sort(key=lambda x: x[1]) return results[:k] def collision_probability(self, d: float) -> float: """ Approximate collision probability for distance d. For a single projection with Gaussian random vectors: p ≈ 2*Phi(-d/(2*w)) where Phi is normal CDF For multiple projections (AND): p1^n For multiple tables (OR): 1 - (1 - p1^n)^m """ from scipy.stats import norm # Single projection collision prob p1 = 2 * norm.cdf(-d / (2 * self.bucket_width)) - 1 + 2 * norm.pdf(0) * self.bucket_width / d * (1 - np.exp(-d**2/(2*self.bucket_width**2))) # Simplified approximation p1_approx = max(0, 1 - d / self.bucket_width) # AND over projections band_prob = p1_approx ** self.n_projections # OR over tables total_prob = 1 - (1 - band_prob) ** self.n_tables return total_prob # Choosing bucket width wdef analyze_bucket_width(): """ Demonstrate effect of bucket width on collision probability. """ import matplotlib.pyplot as plt widths = [1, 2, 4, 8, 16] distances = np.linspace(0.1, 20, 100) fig, ax = plt.subplots(figsize=(10, 6)) for w in widths: probs = [max(0, 1 - d/w) for d in distances] ax.plot(distances, probs, label=f'w = {w}') ax.set_xlabel('Distance d') ax.set_ylabel('Collision probability') ax.set_title('LSH Collision Probability vs Distance') ax.legend() ax.grid(True, alpha=0.3) return figThe bucket width w is a critical parameter. Small w gives fine-grained buckets (good precision but low recall). Large w gives coarse buckets (good recall but more false positives). A good heuristic is to set w to the expected distance to the nearest neighbor, or tune based on desired precision-recall tradeoff.
For set similarity, MinHash provides an elegant LSH scheme for the Jaccard similarity.
Jaccard Similarity:
For sets $A$ and $B$:
$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$
MinHash Definition:
Let $h$ be a random permutation of all possible elements. Define:
$$\text{MinHash}h(A) = \min{a \in A} h(a)$$
The minimum element under permutation $h$.
Remarkable Property:
$$\Pr[\text{MinHash}_h(A) = \text{MinHash}_h(B)] = J(A, B)$$
Proof Sketch: The minimum of $A \cup B$ under $h$ is equally likely to be any element. It's the same for both sets iff it's in $A \cap B$, which happens with probability $|A \cap B| / |A \cup B| = J(A, B)$.
MinHash Signature:
Using $k$ independent hash functions, we create a signature:
$$\text{Sig}(A) = [\text{MinHash}{h_1}(A), \text{MinHash}{h_2}(A), \ldots, \text{MinHash}_{h_k}(A)]$$
This compresses a set to $k$ integers while preserving similarity information.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183
import numpy as npfrom typing import Set, List, Tuplefrom collections import defaultdictimport hashlib class MinHash: """ MinHash signatures for estimating Jaccard similarity. Compresses sets into fixed-size signatures such that: Pr[signature match] = Jaccard similarity Parameters: ----------- n_hashes : int Number of hash functions (signature length) seed : int Random seed """ def __init__(self, n_hashes: int = 128, seed: int = 42): self.n_hashes = n_hashes np.random.seed(seed) # Generate random hash function parameters # We use h(x) = (ax + b) mod p mod n for large prime p self.p = 2**31 - 1 # Mersenne prime self.a = np.random.randint(1, self.p, size=n_hashes) self.b = np.random.randint(0, self.p, size=n_hashes) def _hash_value(self, val: int, hash_idx: int) -> int: """Apply hash function hash_idx to value val.""" return (self.a[hash_idx] * val + self.b[hash_idx]) % self.p def signature(self, s: Set) -> np.ndarray: """ Compute MinHash signature for a set. Parameters: ----------- s : Set Set of hashable elements Returns: -------- np.ndarray of shape (n_hashes,) MinHash signature """ sig = np.full(self.n_hashes, np.inf) for elem in s: # Convert element to integer hash elem_hash = hash(elem) % self.p # Update signature with minimum for each hash function for i in range(self.n_hashes): h_val = self._hash_value(elem_hash, i) if h_val < sig[i]: sig[i] = h_val return sig.astype(np.int64) @staticmethod def estimate_jaccard(sig1: np.ndarray, sig2: np.ndarray) -> float: """ Estimate Jaccard similarity from signatures. Just counts matching positions! """ return np.mean(sig1 == sig2) class MinHashLSH: """ LSH index using MinHash for Jaccard similarity. Uses banding technique: divide signature into b bands of r rows each. Two sets are candidates if any band matches exactly. Parameters: ----------- n_hashes : int Signature length (must equal bands * rows) bands : int Number of bands """ def __init__(self, n_hashes: int = 128, bands: int = 32, seed: int = 42): self.bands = bands self.rows = n_hashes // bands self.n_hashes = bands * self.rows # Ensure divisibility self.minhash = MinHash(n_hashes=self.n_hashes, seed=seed) # One hash table per band self.tables: List[defaultdict] = [ defaultdict(set) for _ in range(bands) ] self.signatures = [] self.data = [] def index(self, sets: List[Set]): """Index a collection of sets.""" self.data = sets self.signatures = [self.minhash.signature(s) for s in sets] for idx, sig in enumerate(self.signatures): for band_idx in range(self.bands): # Extract band from signature start = band_idx * self.rows end = start + self.rows band = tuple(sig[start:end]) self.tables[band_idx][band].add(idx) def query(self, q: Set, k: int = 10) -> List[Tuple[int, float]]: """ Find similar sets. Returns: -------- List of (index, estimated_jaccard) pairs """ q_sig = self.minhash.signature(q) # Find candidates from any matching band candidates: Set[int] = set() for band_idx in range(self.bands): start = band_idx * self.rows end = start + self.rows band = tuple(q_sig[start:end]) candidates.update(self.tables[band_idx][band]) # Compute estimated Jaccard for candidates results = [] for idx in candidates: jaccard_est = MinHash.estimate_jaccard(q_sig, self.signatures[idx]) results.append((idx, jaccard_est)) results.sort(key=lambda x: -x[1]) return results[:k] def s_curve_probability(self, jaccard: float) -> float: """ Probability of becoming a candidate given Jaccard similarity. Follows an S-curve: Pr = 1 - (1 - J^r)^b where r = rows per band, b = number of bands """ return 1 - (1 - jaccard ** self.rows) ** self.bands # Example: Near-duplicate document detectiondef demo_document_lsh(): # Represent documents as sets of shingles (n-grams) def shingle(text: str, k: int = 5) -> Set[str]: """Create k-shingles from text.""" return {text[i:i+k] for i in range(len(text) - k + 1)} docs = [ "The quick brown fox jumps over the lazy dog", "The quick brown fox leaps over the lazy dog", # Similar to doc 0 "A completely different document about cats", "The fast brown fox jumps over the sleepy dog", # Similar to doc 0 "Another document about something else entirely", ] # Create shingle sets shingle_sets = [shingle(doc) for doc in docs] # Build LSH index lsh = MinHashLSH(n_hashes=128, bands=16) lsh.index(shingle_sets) # Query with each document for i, doc in enumerate(docs): similar = lsh.query(shingle_sets[i], k=3) print(f"Doc {i}: '{doc[:40]}...'") for idx, jaccard in similar: if idx != i: print(f" Similar to Doc {idx} (J≈{jaccard:.3f})")The banding technique creates an S-curve probability: pairs above a threshold similarity almost always become candidates, while pairs below almost never do. The threshold is approximately (1/b)^(1/r) where b=bands and r=rows. Choose b and r to position the threshold at your desired Jaccard cutoff.
A single LSH hash function provides only weak separation between similar and dissimilar items. Amplification combines multiple hashes to achieve arbitrarily strong separation.
Two Amplification Strategies:
AND-Amplification (Increases Precision):
Require all of $r$ hash functions to match: $$g(\mathbf{x}) = (h_1(\mathbf{x}), h_2(\mathbf{x}), \ldots, h_r(\mathbf{x}))$$
Collision probability: $$p_r = p^r$$
This dramatically reduces false positives but also reduces true positives.
OR-Amplification (Increases Recall):
Use $b$ independent hash tables, candidate if any matches:
Collision probability: $$p_b = 1 - (1 - p)^b$$
This recovers true pairs that any single hash might miss.
Combined (b, r)-Banding:
Use $b$ bands of $r$ rows each. Match if any band matches completely:
$$P(r, b) = 1 - (1 - p^r)^b$$
This creates the characteristic S-curve: sharp transition from low to high probability at a threshold similarity.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
import numpy as npimport matplotlib.pyplot as plt def plot_s_curves(): """ Visualize how banding parameters affect the S-curve. """ fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Basic collision probability (for illustration) s = np.linspace(0, 1, 100) # Similarity # Left: Effect of changing bands (fixed rows) ax = axes[0] ax.set_title('Effect of Number of Bands (r=5 fixed)') r = 5 for b in [1, 4, 16, 64, 256]: prob = 1 - (1 - s**r)**b ax.plot(s, prob, label=f'b={b}') ax.set_xlabel('Similarity') ax.set_ylabel('Probability of becoming candidate') ax.legend() ax.grid(True, alpha=0.3) # Right: Effect of changing rows (fixed bands) ax = axes[1] ax.set_title('Effect of Rows per Band (b=16 fixed)') b = 16 for r in [1, 2, 4, 8, 16]: prob = 1 - (1 - s**r)**b ax.plot(s, prob, label=f'r={r}') ax.set_xlabel('Similarity') ax.set_ylabel('Probability of becoming candidate') ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() return fig def find_params_for_threshold(target_sim: float, n_hashes: int = 128, target_prob: float = 0.5) -> tuple: """ Find (b, r) parameters to achieve target threshold. The threshold t ≈ (1/b)^(1/r) where Pr(candidate | J=t) = 0.5 Parameters: ----------- target_sim : float Desired similarity threshold (e.g., 0.8 for 80% similarity) n_hashes : int Total signature length target_prob : float Probability at threshold (usually 0.5) Returns: -------- (bands, rows) tuple with bands * rows <= n_hashes """ best_params = None best_distance = float('inf') for b in range(1, n_hashes + 1): for r in range(1, n_hashes // b + 1): if b * r > n_hashes: continue # Compute threshold where probability = 0.5 # 1 - (1 - t^r)^b = 0.5 # (1 - t^r)^b = 0.5 # 1 - t^r = 0.5^(1/b) # t^r = 1 - 0.5^(1/b) # t = (1 - 0.5^(1/b))^(1/r) threshold = (1 - 0.5**(1/b))**(1/r) distance = abs(threshold - target_sim) if distance < best_distance: best_distance = distance best_params = (b, r, threshold) return best_params def analyze_amplification(): """ Show how amplification separates similar from dissimilar. """ # Without amplification single_p_similar = 0.9 # Similar pairs collide 90% single_p_dissimilar = 0.3 # Dissimilar still collide 30%! print("Without amplification:") print(f" Pr(collision | similar): {single_p_similar:.2f}") print(f" Pr(collision | dissimilar): {single_p_dissimilar:.2f}") print(f" Ratio: {single_p_similar/single_p_dissimilar:.2f}x") # With (b=4, r=5) amplification (20 hash functions) b, r = 4, 5 amp_p_similar = 1 - (1 - single_p_similar**r)**b amp_p_dissimilar = 1 - (1 - single_p_dissimilar**r)**b print(f"\nWith (b={b}, r={r}) amplification:") print(f" Pr(collision | similar): {amp_p_similar:.4f}") print(f" Pr(collision | dissimilar): {amp_p_dissimilar:.6f}") print(f" Ratio: {amp_p_similar/amp_p_dissimilar:.1f}x") # Example output:# Without amplification:# Pr(collision | similar): 0.90# Pr(collision | dissimilar): 0.30# Ratio: 3.00x## With (b=4, r=5) amplification:# Pr(collision | similar): 0.9988# Pr(collision | dissimilar): 0.0010# Ratio: 992.7x| Bands (b) | Rows (r) | Total Hashes | Threshold (J≈) | Effect |
|---|---|---|---|---|
| 1 | 16 | 16 | 0.98 | Very high precision, low recall |
| 4 | 4 | 16 | 0.84 | Balanced |
| 8 | 2 | 16 | 0.68 | Higher recall |
| 16 | 1 | 16 | 0.50 | Very high recall, lower precision |
| 32 | 4 | 128 | 0.78 | Production setting (balanced) |
| 16 | 8 | 128 | 0.90 | High sensitivity detection |
Let's analyze the complexity of LSH-based nearest neighbor search.
Preprocessing:
Query:
Key Quantity: Expected Candidate Size
For a $(r_1, r_2, p_1, p_2)$-sensitive family with $L$ tables:
Optimal Parameters (Indyk-Motwani):
Using $k = \log_{1/p_2} n$ hash functions per table and $L = n^\rho$ tables where $\rho = \log(1/p_1) / \log(1/p_2)$:
For the best LSH families, $\rho \approx 1/c$ for $c$-approximation, giving $O(n^{1/c})$ query time.
| Operation | Complexity | Notes |
|---|---|---|
| Preprocessing | $O(ndL)$ | One-time; parallelize easily |
| Space | $O(nL + nd)$ | L hash tables + original data |
| Query (compute hashes) | $O(Lkd)$ | L tables, k functions, d dims |
| Query (candidates) | $O(n^\rho d)$ expected | $\rho < 1$ for sublinear |
| Total Query | $O(n^\rho d + Lkd)$ | Sublinear for good families |
The parameter ρ = log(1/p₁)/log(1/p₂) determines query efficiency. For c-approximate NN, the best known LSH families achieve ρ ≈ 1/c. This means 2-approximate NN can be done in O(√n) time, and (1+ε)-approximate in O(n^(1/(1+ε))) time. Lower bounds suggest this is nearly optimal.
Deploying LSH in production requires attention to several practical issues.
Parameter Tuning:
Bucket width (for Euclidean): Start with $w$ = average nearest neighbor distance, tune based on precision-recall curve
Number of tables $L$: More tables = higher recall but more memory/query cost. Typical: 4-50 tables.
Hash length $k$: More hash functions = tighter buckets. Balance against probability of empty buckets.
Banding parameters: Choose to place S-curve threshold at your desired similarity cutoff.
Multi-Probe LSH:
Instead of checking only the exact bucket, also check nearby buckets that a query might have hashed to with slightly different random choices. This dramatically reduces required number of tables while maintaining recall.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
import numpy as npfrom typing import List, Set, Tuplefrom itertools import combinationsfrom collections import defaultdict class MultiProbeLSH: """ Multi-probe LSH for improved efficiency. Instead of using many tables, probe multiple buckets per table. Buckets to probe are those where the query was 'close' to the boundary. This can reduce table count by 10-100x while maintaining recall. """ def __init__( self, n_projections: int = 16, n_tables: int = 4, n_probes: int = 10, # Number of additional buckets to probe bucket_width: float = 4.0, dim: int = 128, seed: int = 42 ): self.n_projections = n_projections self.n_tables = n_tables self.n_probes = n_probes self.bucket_width = bucket_width self.dim = dim np.random.seed(seed) self.projections = np.random.randn(n_tables, n_projections, dim) self.offsets = np.random.uniform(0, bucket_width, (n_tables, n_projections)) self.tables = [defaultdict(set) for _ in range(n_tables)] self.data = [] def _hash_with_residuals( self, x: np.ndarray, table_idx: int ) -> Tuple[Tuple[int, ...], List[Tuple[int, float]]]: """ Hash x and return residuals to bucket boundaries. Residual = how close was x to falling into an adjacent bucket. """ projections = self.projections[table_idx] @ x shifted = projections + self.offsets[table_idx] bucket_indices = np.floor(shifted / self.bucket_width).astype(int) # Compute distance to each bucket boundary residuals = shifted / self.bucket_width - bucket_indices # residual in [0, 1): 0 means at lower boundary, 1 means at upper # For each dimension, compute delta to nearest boundary deltas = [] for i, r in enumerate(residuals): if r < 0.5: deltas.append((i, r, -1)) # Close to lower, try -1 else: deltas.append((i, 1-r, +1)) # Close to upper, try +1 # Sort by distance (smallest residual = closest to boundary) deltas.sort(key=lambda x: x[1]) return tuple(bucket_indices), deltas def _generate_probe_buckets( self, base_bucket: Tuple[int, ...], deltas: List[Tuple[int, float, int]], n_probes: int ) -> List[Tuple[int, ...]]: """ Generate additional buckets to probe. Perturbations are applied in order of proximity to boundary. """ buckets = [base_bucket] # Single perturbations for i, (dim, dist, direction) in enumerate(deltas[:n_probes]): new_bucket = list(base_bucket) new_bucket[dim] += direction buckets.append(tuple(new_bucket)) if len(buckets) >= n_probes + 1: break # Could add double perturbations for more probes if len(buckets) < n_probes + 1: for i in range(min(len(deltas), n_probes)): for j in range(i+1, min(len(deltas), n_probes)): dim1, _, dir1 = deltas[i] dim2, _, dir2 = deltas[j] new_bucket = list(base_bucket) new_bucket[dim1] += dir1 new_bucket[dim2] += dir2 buckets.append(tuple(new_bucket)) if len(buckets) >= n_probes + 1: break if len(buckets) >= n_probes + 1: break return buckets[:n_probes + 1] def query(self, q: np.ndarray, k: int = 10) -> List[Tuple[int, float]]: """ Query with multi-probing. """ candidates: Set[int] = set() for table_idx in range(self.n_tables): base_bucket, deltas = self._hash_with_residuals(q, table_idx) probe_buckets = self._generate_probe_buckets( base_bucket, deltas, self.n_probes ) for bucket in probe_buckets: candidates.update(self.tables[table_idx][bucket]) # Score candidates results = [] for idx in candidates: dist = np.linalg.norm(q - self.data[idx]) results.append((idx, dist)) results.sort(key=lambda x: x[1]) return results[:k]What's Next:
LSH provides theoretical guarantees for approximate NN, but modern practice has moved toward graph-based methods like HNSW and quantization methods like FAISS's IVF and PQ. The final page surveys these and other Approximate Nearest Neighbor (ANN) techniques that power production similarity search at companies like Spotify, Google, and Meta.
You now understand Locality-Sensitive Hashing: its theoretical foundations, hash families for different metrics, amplification techniques, and practical considerations. LSH was a breakthrough in showing that approximate NN is fundamentally easier than exact NN. Next, we survey the broader landscape of approximate methods for production-scale similarity search.