Loading learning content...
When working with high-dimensional vector representations—document embeddings, user preference vectors, neural network features—cosine similarity is often the metric of choice. Unlike Euclidean distance, cosine similarity captures the directional relationship between vectors, ignoring their magnitude. Two documents about machine learning should be similar whether they're 500 words or 5,000 words.
Cosine similarity between vectors $\mathbf{u}$ and $\mathbf{v}$ is defined as:
$$\text{cos}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| |\mathbf{v}|}$$
This is simply the cosine of the angle $\theta$ between the vectors. Parallel vectors have similarity 1, orthogonal vectors have similarity 0, and anti-parallel vectors have similarity -1.
But how do we build a locality-sensitive hash function for the angle between vectors? The answer is beautifully geometric: random hyperplanes.
By the end of this page, you will understand how random hyperplanes create locality-sensitive hashes for cosine similarity. You'll derive the collision probability formula, understand the geometric intuition, and learn to implement efficient random projection LSH for high-dimensional vector search.
A hyperplane in $d$-dimensional space is a $(d-1)$-dimensional subspace that divides the space into two half-spaces. In 2D, a hyperplane is a line. In 3D, it's a plane. In higher dimensions, we can't visualize it, but the math works identically.
A hyperplane through the origin can be defined by its normal vector $\mathbf{r}$. Every point $\mathbf{x}$ in space can be classified based on which side of the hyperplane it lies:
$$h_{\mathbf{r}}(\mathbf{x}) = \begin{cases} 1 & \text{if } \mathbf{r} \cdot \mathbf{x} \geq 0 \ 0 & \text{if } \mathbf{r} \cdot \mathbf{x} < 0 \end{cases}$$
This is our hash function. It outputs a single bit based on which half-space the point occupies.
The Key Insight:
Consider two vectors $\mathbf{u}$ and $\mathbf{v}$ with angle $\theta$ between them. If we draw a random hyperplane through the origin, what's the probability that $\mathbf{u}$ and $\mathbf{v}$ end up on different sides?
The hyperplane separates $\mathbf{u}$ and $\mathbf{v}$ if and only if it passes through the angle $\theta$ between them. The probability of this is exactly:
$$\Pr[h_{\mathbf{r}}(\mathbf{u}) \neq h_{\mathbf{r}}(\mathbf{v})] = \frac{\theta}{\pi}$$
And therefore, the collision probability is:
$$\Pr[h_{\mathbf{r}}(\mathbf{u}) = h_{\mathbf{r}}(\mathbf{v})] = 1 - \frac{\theta}{\pi}$$
Imagine vectors u and v in 2D forming an angle θ. Now spin a random line (hyperplane in 2D) through the origin. The line separates u and v only when it passes through the "wedge" between them. That wedge spans θ out of the full π radians (180°). So the separation probability is θ/π.
Why does this work for any dimension?
The beautiful fact is that this probability depends only on the angle $\theta$, not on the dimensionality $d$. Here's why:
This dimensional independence is what makes random hyperplane LSH so powerful—it works efficiently regardless of whether your vectors are 10-dimensional or 10,000-dimensional.
Let's rigorously derive the collision probability formula. This derivation gives us deep insight into why random hyperplanes work.
Setup:
Claim: $\Pr[\text{sign}(\mathbf{r} \cdot \mathbf{u}) = \text{sign}(\mathbf{r} \cdot \mathbf{v})] = 1 - \frac{\theta}{\pi}$
Proof:
Without loss of generality, assume $\mathbf{u}$ and $\mathbf{v}$ are unit vectors (normalization doesn't change signs).
Define the projections:
Since $\mathbf{r}$ has i.i.d. $\mathcal{N}(0, 1)$ components, $X$ and $Y$ are jointly Gaussian with:
So $(X, Y)$ follows a bivariate normal distribution with correlation $\rho = \cos(\theta)$.
Key Lemma: For bivariate normal $(X, Y)$ with zero means, unit variances, and correlation $\rho$:
$$\Pr[\text{sign}(X) = \text{sign}(Y)] = 1 - \frac{1}{\pi} \cos^{-1}(\rho)$$
Proof of Lemma:
We need to compute $\Pr[XY > 0] = \Pr[X > 0, Y > 0] + \Pr[X < 0, Y < 0]$.
By symmetry, $$\Pr[X > 0, Y > 0] = \Pr[X < 0, Y < 0]$$
So we need $2 \Pr[X > 0, Y > 0]$.
For bivariate normal with correlation $\rho$, this is a classical result:
$$\Pr[X > 0, Y > 0] = \frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(\rho)$$
This can be derived by transforming to polar coordinates and integrating over the quadrant.
Therefore: $$\Pr[\text{sign}(X) = \text{sign}(Y)] = 2 \cdot \left(\frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(\rho)\right) = \frac{1}{2} + \frac{1}{\pi} \sin^{-1}(\rho)$$
Using the identity $\sin^{-1}(\rho) = \frac{\pi}{2} - \cos^{-1}(\rho)$:
$$= \frac{1}{2} + \frac{1}{\pi}\left(\frac{\pi}{2} - \cos^{-1}(\rho)\right) = 1 - \frac{1}{\pi} \cos^{-1}(\rho)$$
Completing the Main Proof:
Since $\rho = \cos(\theta)$, we have $\cos^{-1}(\rho) = \theta$, giving us:
$$\Pr[h_{\mathbf{r}}(\mathbf{u}) = h_{\mathbf{r}}(\mathbf{v})] = 1 - \frac{\theta}{\pi}$$
$$\blacksquare$$
This random hyperplane method is also known as SimHash, introduced by Moses Charikar in 2002. It's used extensively in practice for near-duplicate detection, plagiarism checking, and similarity search in high-dimensional spaces.
Now let's connect the collision probability to the LSH framework. Recall that an LSH family is $(d_1, d_2, p_1, p_2)$-sensitive if near points (distance $\leq d_1$) collide with probability $\geq p_1$ and far points (distance $\geq d_2$) collide with probability $\leq p_2$.
For cosine similarity, we typically work with angular distance:
$$d_{\text{angular}}(\mathbf{u}, \mathbf{v}) = \frac{\theta}{\pi} = \frac{\cos^{-1}(\text{sim}(\mathbf{u}, \mathbf{v}))}{\pi}$$
This normalizes the angle to $[0, 1]$, where 0 means identical direction and 1 means opposite direction.
LSH Sensitivity Parameters:
For angular distance thresholds $d_1$ and $d_2$ with $d_1 < d_2$:
The ρ Parameter:
$$\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)} = \frac{\ln(1/(1-d_1))}{\ln(1/(1-d_2))}$$
For approximate near-neighbor with ratio $c$ (i.e., we accept points at distance $cd_1$ instead of $d_1$), setting $d_2 = cd_1$ gives:
$$\rho = \frac{\ln(1/(1-d_1))}{\ln(1/(1-cd_1))}$$
For small $d_1$ (high similarity), using $-\ln(1-x) \approx x$:
$$\rho \approx \frac{d_1}{cd_1} = \frac{1}{c}$$
| Cosine Similarity | Angle θ (radians) | Angular Distance θ/π | Collision Probability |
|---|---|---|---|
| 1.0 (identical) | 0 | 0 | 1.0 |
| 0.95 | 0.318 | 0.101 | 0.899 |
| 0.9 | 0.451 | 0.144 | 0.856 |
| 0.8 | 0.644 | 0.205 | 0.795 |
| 0.5 | 1.047 | 0.333 | 0.667 |
| 0.0 (orthogonal) | 1.571 (π/2) | 0.5 | 0.5 |
| -0.5 | 2.094 | 0.667 | 0.333 |
| -1.0 (opposite) | 3.142 (π) | 1.0 | 0.0 |
Observation on the Probability Curve:
Notice that the collision probability is linear in the angular distance:
$$p(\theta) = 1 - \frac{\theta}{\pi}$$
This linear relationship is special. Many LSH families have S-shaped or concave collision probability curves. The linear curve of random hyperplane LSH means:
However, the lack of a sharp transition also means we may need more hash functions compared to LSH families with steeper probability curves.
Let's build a complete, production-quality implementation of random hyperplane LSH for cosine similarity.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296
"""Random Hyperplane LSH for Cosine Similarity. This implementation is optimized for efficiency while remaining readable.Key optimizations:1. Matrix operations instead of loops2. Pre-computed random hyperplanes3. Efficient bit packing for hash keys""" import numpy as npfrom typing import List, Tuple, Optional, Dict, Setfrom collections import defaultdictfrom dataclasses import dataclass, fieldimport time @dataclassclass RandomHyperplaneLSH: """ LSH index using random hyperplanes for cosine similarity. Parameters: ----------- dim : int Dimensionality of vectors num_hyperplanes : int (k) Number of hyperplanes per hash table. Higher k means: - Fewer candidates per bucket (more precise) - Higher risk of missing true neighbors num_tables : int (L) Number of hash tables. Higher L means: - More likely to find true neighbors - More candidates to check - More memory and query time seed : int Random seed for reproducibility """ dim: int num_hyperplanes: int = 10 # k num_tables: int = 20 # L seed: int = 42 # Initialized in __post_init__ hyperplanes: np.ndarray = field(init=False) # Shape: (L, k, dim) tables: List[Dict] = field(init=False) vectors: Optional[np.ndarray] = field(default=None, init=False) def __post_init__(self): np.random.seed(self.seed) # Generate random hyperplanes for all tables at once # Each hyperplane is a random normal vector self.hyperplanes = np.random.randn( self.num_tables, self.num_hyperplanes, self.dim ) # Normalize hyperplanes (optional, but good practice) norms = np.linalg.norm(self.hyperplanes, axis=2, keepdims=True) self.hyperplanes = self.hyperplanes / norms # Initialize empty hash tables self.tables = [defaultdict(list) for _ in range(self.num_tables)] def _hash_vector( self, vector: np.ndarray, table_idx: int ) -> Tuple[int, ...]: """ Compute hash key for a vector in a specific table. The hash is a tuple of 0s and 1s indicating which side of each hyperplane the vector lies on. """ # Dot product with all hyperplanes in this table # Shape: (num_hyperplanes,) projections = np.dot(self.hyperplanes[table_idx], vector) # Convert to binary: 1 if positive, 0 if negative bits = (projections >= 0).astype(int) # Return as tuple (hashable) return tuple(bits.tolist()) def _hash_batch( self, vectors: np.ndarray ) -> List[List[Tuple[int, ...]]]: """ Compute hash keys for multiple vectors across all tables. Returns: List of L lists, each containing n hash keys """ n = len(vectors) all_hashes = [] for table_idx in range(self.num_tables): # Compute all projections at once # Shape: (n, num_hyperplanes) projections = np.dot(vectors, self.hyperplanes[table_idx].T) # Convert to binary bits = (projections >= 0).astype(int) # Convert each row to tuple hashes = [tuple(row.tolist()) for row in bits] all_hashes.append(hashes) return all_hashes def fit(self, vectors: np.ndarray) -> 'RandomHyperplaneLSH': """ Build the LSH index from a dataset of vectors. Parameters: ----------- vectors : np.ndarray of shape (n, dim) Dataset of n vectors Returns: -------- self : for method chaining """ self.vectors = vectors.copy() n = len(vectors) # Normalize vectors (cosine similarity is scale-invariant) norms = np.linalg.norm(vectors, axis=1, keepdims=True) norms = np.where(norms == 0, 1, norms) # Avoid division by zero normalized = vectors / norms # Compute all hashes all_hashes = self._hash_batch(normalized) # Insert into tables for table_idx in range(self.num_tables): for vec_idx in range(n): hash_key = all_hashes[table_idx][vec_idx] self.tables[table_idx][hash_key].append(vec_idx) return self def query( self, query: np.ndarray, num_neighbors: int = 5, return_similarity: bool = True ) -> List[Tuple[int, float]]: """ Find approximate nearest neighbors for a query vector. Parameters: ----------- query : np.ndarray of shape (dim,) Query vector num_neighbors : int Number of neighbors to return return_similarity : bool If True, return cosine similarity; else return angular distance Returns: -------- List of (index, similarity/distance) tuples, sorted by similarity """ if self.vectors is None: raise ValueError("Index not built. Call fit() first.") # Normalize query query_norm = np.linalg.norm(query) if query_norm == 0: raise ValueError("Query vector has zero norm") query_normalized = query / query_norm # Collect candidates from all tables candidates: Set[int] = set() for table_idx in range(self.num_tables): hash_key = self._hash_vector(query_normalized, table_idx) bucket = self.tables[table_idx].get(hash_key, []) candidates.update(bucket) if not candidates: # No candidates found; fall back to checking a sample candidates = set( np.random.choice(len(self.vectors), min(100, len(self.vectors)), replace=False) ) # Compute exact cosine similarities for candidates candidate_indices = list(candidates) candidate_vectors = self.vectors[candidate_indices] # Normalize candidate vectors norms = np.linalg.norm(candidate_vectors, axis=1, keepdims=True) norms = np.where(norms == 0, 1, norms) candidate_normalized = candidate_vectors / norms # Compute similarities similarities = np.dot(candidate_normalized, query_normalized) # Sort by similarity (descending) sorted_indices = np.argsort(-similarities) # Return top-k results = [] for i in sorted_indices[:num_neighbors]: idx = candidate_indices[i] sim = similarities[i] if return_similarity: results.append((idx, float(sim))) else: # Angular distance results.append((idx, float(np.arccos(np.clip(sim, -1, 1)) / np.pi))) return results def get_stats(self) -> Dict: """Get statistics about the index.""" bucket_sizes = [] for table in self.tables: bucket_sizes.extend(len(v) for v in table.values()) return { "num_tables": self.num_tables, "num_hyperplanes": self.num_hyperplanes, "num_vectors": len(self.vectors) if self.vectors is not None else 0, "total_buckets": sum(len(t) for t in self.tables), "avg_bucket_size": np.mean(bucket_sizes) if bucket_sizes else 0, "max_bucket_size": max(bucket_sizes) if bucket_sizes else 0, "empty_buckets": sum(1 for t in self.tables for v in t.values() if len(v) == 0), } def demo(): """Demonstrate Random Hyperplane LSH.""" np.random.seed(42) # Create dataset n = 50000 dim = 256 print(f"Creating dataset: {n} vectors, {dim} dimensions") vectors = np.random.randn(n, dim) # Create query (similar to vector 0) query = vectors[0] + 0.1 * np.random.randn(dim) # Build index print("\nBuilding LSH index...") start = time.time() lsh = RandomHyperplaneLSH(dim=dim, num_hyperplanes=12, num_tables=25) lsh.fit(vectors) build_time = time.time() - start print(f"Build time: {build_time:.3f}s") print(f"Index stats: {lsh.get_stats()}") # Query print("\nQuerying...") start = time.time() neighbors = lsh.query(query, num_neighbors=10) query_time = time.time() - start print(f"Query time: {query_time:.6f}s") print("\nTop 10 neighbors (LSH):") for idx, sim in neighbors: print(f" Vector {idx}: similarity = {sim:.4f}") # Brute force comparison print("\nBrute force comparison...") start = time.time() query_norm = query / np.linalg.norm(query) norms = np.linalg.norm(vectors, axis=1, keepdims=True) normalized = vectors / norms all_sims = np.dot(normalized, query_norm) bf_time = time.time() - start top_10_bf = np.argsort(-all_sims)[:10] print(f"Brute force time: {bf_time:.6f}s") print("\nTop 10 neighbors (brute force):") for idx in top_10_bf: print(f" Vector {idx}: similarity = {all_sims[idx]:.4f}") # Calculate speedup print(f"\nSpeedup: {bf_time / query_time:.1f}x") # Check recall lsh_set = {idx for idx, _ in neighbors} bf_set = set(top_10_bf) recall = len(lsh_set & bf_set) / len(bf_set) print(f"Recall@10: {recall:.1%}") if __name__ == "__main__": demo()The raw collision probability from a single hyperplane may not discriminate well between near and far points. Amplification is the technique that transforms a weak LSH into a powerful one.
AND Amplification (k hyperplanes per table):
We require ALL $k$ hyperplanes to agree for a collision: $$p^{(k)} = p^k$$
For near points with $p_1 = 0.9$ and far points with $p_2 = 0.7$:
AND amplification reduces BOTH probabilities, but the far probability drops faster.
OR Amplification (L tables):
We succeed if we find a match in AT LEAST ONE table: $$P_1 = 1 - (1 - p_1^k)^L \quad \text{(near points)}$$ $$P_2 = 1 - (1 - p_2^k)^L \quad \text{(far points)}$$
With $k = 10$ and $L = 20$:
We now find near points with 99.98% probability while far points appear in only 43% of queries—a massive improvement over the base probabilities!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
"""Analysis of AND-OR amplification for random hyperplane LSH.""" import numpy as npimport matplotlib.pyplot as plt def collision_prob(cos_sim: float) -> float: """ Collision probability for a single hyperplane. p = 1 - θ/π where θ = arccos(cos_sim) """ theta = np.arccos(np.clip(cos_sim, -1, 1)) return 1 - theta / np.pi def amplified_prob(p: float, k: int, L: int) -> float: """ Probability after AND-OR amplification. AND: require all k hyperplanes to agree -> p^k OR: succeed if any of L tables match -> 1 - (1-p^k)^L """ p_and = p ** k p_or = 1 - (1 - p_and) ** L return p_or def analyze_amplification(): """Analyze how k and L affect discrimination.""" # Various similarity levels similarities = [0.95, 0.9, 0.8, 0.7, 0.5, 0.3, 0.1, 0.0] base_probs = [collision_prob(s) for s in similarities] print("Base collision probabilities:") print("-" * 50) for s, p in zip(similarities, base_probs): print(f" cos_sim = {s:.2f}: p = {p:.4f}") # Example: Distinguish 0.9 similarity from 0.5 similarity p_near = collision_prob(0.9) # ~0.856 p_far = collision_prob(0.5) # ~0.667 print(f"\n\nTarget: Find vectors with sim >= 0.9, reject sim <= 0.5") print(f"Base probabilities: p_near = {p_near:.4f}, p_far = {p_far:.4f}") print(f"Base ratio: {p_near/p_far:.2f}") # Try different k and L combinations print("\n\nAmplification analysis (fixed memory budget):") print("-" * 70) print(f"{'k':>4} {'L':>4} {'P_near':>10} {'P_far':>10} {'Ratio':>10} {'FPs per 1000':>12}") print("-" * 70) for k in [5, 8, 10, 12, 15]: for L in [10, 20, 30, 50]: P_near = amplified_prob(p_near, k, L) P_far = amplified_prob(p_far, k, L) ratio = P_near / P_far if P_far > 0 else float('inf') fps = P_far * 1000 # Expected false positives per 1000 far points print(f"{k:>4} {L:>4} {P_near:>10.4f} {P_far:>10.4f} {ratio:>10.2f} {fps:>12.1f}") # Optimal for specific recall target print("\n\nFinding k, L for 99% recall with minimal false positives:") print("-" * 70) best_config = None best_fp_rate = float('inf') for k in range(4, 20): for L in range(5, 100): P_near = amplified_prob(p_near, k, L) P_far = amplified_prob(p_far, k, L) if P_near >= 0.99: # 99% recall if P_far < best_fp_rate: best_fp_rate = P_far best_config = (k, L, P_near, P_far) if best_config: k, L, P_near, P_far = best_config print(f"Best config for 99% recall:") print(f" k = {k}, L = {L}") print(f" Recall (P_near) = {P_near:.4f}") print(f" False positive rate (P_far) = {P_far:.4f}") print(f" Space: {L} tables × {k} hyperplanes = {L*k} total hyperplanes") if __name__ == "__main__": analyze_amplification()Increasing k (hash functions per table) reduces false positives but increases false negatives. Increasing L (number of tables) reduces false negatives but increases query time and memory. The optimal combination depends on your recall/precision requirements and computational budget.
The basic random hyperplane LSH can be enhanced in several ways for practical applications.
Multi-probe LSH Deep Dive:
Multi-probe is perhaps the most impactful optimization. The insight is that if a query slightly misses a bucket containing a true neighbor, the query's hash likely differs by only a few bits.
Given a query hash $h(q) = (b_1, b_2, \ldots, b_k)$, we probe:
The probability that a near neighbor ends up in a "1-bit different" bucket can be computed exactly, allowing us to prioritize which alternate buckets to check.
Benefit: Multi-probe with $T$ probes per table can achieve similar recall to standard LSH with $T$ times as many tables, but with $T$ times less memory. The query time is similar since we're replacing table lookups with probes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
"""Multi-probe extension for Random Hyperplane LSH.""" import numpy as npfrom typing import Tuple, Set, Listfrom itertools import combinations def generate_probes( base_hash: Tuple[int, ...], num_flips: int = 2) -> List[Tuple[int, ...]]: """ Generate all hash keys within Hamming distance of base_hash. Parameters: ----------- base_hash : tuple of ints The original hash key num_flips : int Maximum number of bits to flip Returns: -------- List of hash keys to probe """ probes = [base_hash] k = len(base_hash) # Generate all combinations of positions to flip for flips in range(1, num_flips + 1): for positions in combinations(range(k), flips): # Flip bits at these positions new_hash = list(base_hash) for pos in positions: new_hash[pos] = 1 - new_hash[pos] probes.append(tuple(new_hash)) return probes def multiprobe_query( query: np.ndarray, hyperplanes: np.ndarray, # Shape: (num_tables, k, dim) tables: List[dict], num_probes: int = 5 # Number of probes per table) -> Set[int]: """ Multi-probe query: check nearby buckets in addition to exact match. Returns set of candidate indices. """ candidates = set() num_tables = len(tables) for table_idx in range(num_tables): # Compute exact hash projections = np.dot(hyperplanes[table_idx], query) base_hash = tuple((projections >= 0).astype(int).tolist()) # Generate probes # For efficiency, we generate probes based on projection magnitudes # Bits with small |projection| are more likely to flip abs_proj = np.abs(projections) flip_priority = np.argsort(abs_proj) # Smallest first probes_checked = 0 for flips in range(len(base_hash) + 1): if probes_checked >= num_probes: break for positions in combinations(flip_priority[:flips+3], flips): if probes_checked >= num_probes: break # Create probed hash probed_hash = list(base_hash) for pos in positions: probed_hash[pos] = 1 - probed_hash[pos] probed_hash = tuple(probed_hash) # Get candidates from this bucket bucket = tables[table_idx].get(probed_hash, []) candidates.update(bucket) probes_checked += 1 return candidates # Example comparisondef compare_probe_strategies(): """Compare single-probe vs multi-probe LSH.""" np.random.seed(42) # Setup n, dim, k = 10000, 128, 12 # Generate random hyperplanes hyperplanes = np.random.randn(1, k, dim) # Single table hyperplanes /= np.linalg.norm(hyperplanes, axis=2, keepdims=True) # Generate data data = np.random.randn(n, dim) data /= np.linalg.norm(data, axis=1, keepdims=True) # Build table table = {} for i, vec in enumerate(data): proj = np.dot(hyperplanes[0], vec) h = tuple((proj >= 0).astype(int).tolist()) if h not in table: table[h] = [] table[h].append(i) # Query (find nearest to a specific vector) query = data[0] + 0.2 * np.random.randn(dim) query /= np.linalg.norm(query) # True 100-NN sims = np.dot(data, query) true_100nn = set(np.argsort(-sims)[:100]) # Single-probe proj = np.dot(hyperplanes[0], query) exact_hash = tuple((proj >= 0).astype(int).tolist()) single_candidates = set(table.get(exact_hash, [])) single_recall = len(single_candidates & true_100nn) / 100 # Multi-probe (5 probes) multi_candidates = multiprobe_query(query, hyperplanes, [table], num_probes=5) multi_recall = len(multi_candidates & true_100nn) / 100 # Multi-probe (20 probes) multi_candidates_20 = multiprobe_query(query, hyperplanes, [table], num_probes=20) multi_recall_20 = len(multi_candidates_20 & true_100nn) / 100 print("Recall comparison (single table, k=12):") print(f" Single-probe: {single_recall:.1%} ({len(single_candidates)} candidates)") print(f" Multi-probe (5): {multi_recall:.1%} ({len(multi_candidates)} candidates)") print(f" Multi-probe (20): {multi_recall_20:.1%} ({len(multi_candidates_20)} candidates)") if __name__ == "__main__": compare_probe_strategies()When deploying random hyperplane LSH in production, several practical factors influence design decisions.
| Use Case | Dataset Size | k (hash bits) | L (tables) | Expected Recall |
|---|---|---|---|---|
| Real-time search | 1M vectors | 8-12 | 10-20 | 80-90% |
| High-recall retrieval | 1M vectors | 6-10 | 30-50 | 95-99% |
| Large-scale (10M+) | 10-100M vectors | 12-16 | 20-30 | 85-95% |
| Deduplication | Any | 10-14 | 5-10 | 90%+ for exact dupes |
| Candidate generation | Any | 6-8 | 50-100 | 99%+ |
Start with k = log₂(n) / 2 and L = 10-20. This gives buckets with O(√n) points on average. Then tune based on measured recall and query time on your actual data. Multi-probe with fewer tables often beats standard LSH with more tables.
We've thoroughly explored random hyperplane LSH for cosine similarity. Let's consolidate the key insights:
What's Next:
In the next page, we'll explore LSH for Euclidean distance, which uses a fundamentally different approach based on p-stable distributions. While random hyperplanes capture angular similarity, Euclidean LSH must capture absolute distance, requiring different mathematical machinery.
You now have a deep understanding of random hyperplane LSH for cosine similarity. This technique is the foundation for many real-world similarity search systems, from Google's semantic search to content recommendation engines.