Loading content...
The power of LSH lies not just in its algorithms but in its mathematical guarantees. Unlike heuristic approaches to similarity search, LSH provides provable bounds on recall and precision. These bounds flow directly from understanding collision probabilities—the mathematical relationship between similarity and the likelihood of hashing to the same bucket.
In this page, we dive deep into the probability theory underlying LSH. We'll derive collision probability formulas, understand amplification mechanics, analyze error bounds, and connect everything to the practical question: how do we set parameters to achieve a target recall?
This analysis is essential for anyone who needs to deploy LSH in production, where understanding the tradeoffs is not optional but critical.
By the end of this page, you will derive collision probabilities for major LSH families, understand AND-OR amplification mathematically, analyze the S-curve behavior of amplified probabilities, compute false positive and false negative rates, and determine optimal (k, L) parameters for desired performance targets.
The foundation of LSH analysis is understanding how collision probability relates to similarity.
The Core Concept:
For an LSH hash function $h$ and two points $x, y$:
$$p(x, y) = \Pr[h(x) = h(y)]$$
This probability should be a monotonically increasing function of similarity (or equivalently, decreasing function of distance).
Collision Probability Functions:
Let $s = \text{sim}(x, y)$ be some similarity measure (or $d = \text{dist}(x, y)$ for distance). The collision probability function $p(s)$ characterizes the LSH family:
| LSH Family | Similarity/Distance | Collision Probability $p$ |
|---|---|---|
| MinHash | Jaccard $J$ | $p = J$ |
| Random Hyperplanes | Cosine $\cos(\theta)$ | $p = 1 - \theta/\pi$ |
| E2LSH | Euclidean $d$ | $p = f(d/w)$ (complex) |
| Bit Sampling | Hamming $d_H$ | $p = 1 - d_H/d$ |
Interpreting Collision Probabilities:
A useful LSH family must have $p$ significantly higher for similar items than for dissimilar items. The gap between these probabilities drives the discrimination power.
Recall that an LSH family is (d₁, d₂, p₁, p₂)-sensitive if: (1) points at distance ≤ d₁ collide with probability ≥ p₁, and (2) points at distance ≥ d₂ collide with probability ≤ p₂. For the family to be useful, we need d₁ < d₂ and p₁ > p₂.
Let's rigorously derive collision probabilities for the major LSH families.
Random Hyperplane LSH:
For unit vectors $\mathbf{u}, \mathbf{v}$ with angle $\theta$ and random Gaussian vector $\mathbf{r}$:
$$h_{\mathbf{r}}(\mathbf{v}) = \text{sign}(\mathbf{r} \cdot \mathbf{v})$$
Theorem: $\Pr[h_{\mathbf{r}}(\mathbf{u}) = h_{\mathbf{r}}(\mathbf{v})] = 1 - \theta/\pi$
Proof: Let $X = \mathbf{r} \cdot \mathbf{u}$ and $Y = \mathbf{r} \cdot \mathbf{v}$. Since $\mathbf{r}$ has i.i.d. $\mathcal{N}(0,1)$ components:
For bivariate normal $(X, Y)$ with correlation $\rho$: $$\Pr[\text{sign}(X) = \text{sign}(Y)] = 1 - \frac{\cos^{-1}(\rho)}{\pi}$$
Substituting $\rho = \cos(\theta)$: $\Pr = 1 - \theta/\pi$. $\blacksquare$
MinHash:
For sets $A, B$ and random permutation $\pi$:
$$h_\pi(S) = \min_{x \in S} \pi(x)$$
Theorem: $\Pr[h_\pi(A) = h_\pi(B)] = J(A, B) = \frac{|A \cap B|}{|A \cup B|}$
Proof: Under a random permutation, every element in $A \cup B$ is equally likely to have the minimum $\pi$ value. The minimum is shared by both $A$ and $B$ iff it belongs to $A \cap B$.
$$\Pr[h_\pi(A) = h_\pi(B)] = \frac{|A \cap B|}{|A \cup B|} = J(A, B)$$
$\blacksquare$
This is remarkable: MinHash collision probability equals Jaccard similarity exactly, not just proportionally!
E2LSH (Euclidean LSH):
For vectors $\mathbf{u}, \mathbf{v}$ at Euclidean distance $d$, with random Gaussian $\mathbf{a}$, offset $b \sim \text{Unif}[0, w)$:
$$h(\mathbf{v}) = \left\lfloor \frac{\mathbf{a} \cdot \mathbf{v} + b}{w} \right\rfloor$$
Theorem: $\Pr[h(\mathbf{u}) = h(\mathbf{v})] = p(d/w)$ where:
$$p(c) = \int_0^1 \left[\Phi\left(\frac{\tau}{c}\right) - \Phi\left(\frac{\tau - 1}{c}\right)\right] d\tau$$
and $\Phi$ is the standard normal CDF.
Proof Sketch: By 2-stability, $\mathbf{a} \cdot (\mathbf{u} - \mathbf{v}) \sim d \cdot Z$ where $Z \sim \mathcal{N}(0,1)$. The collision condition requires both projections to land in the same width-$w$ bucket. Integrating over the uniform random offset $b$ gives the formula. $\blacksquare$
Closed form:
$$p(c) = 1 - 2\Phi(-1/c) - \frac{2c}{\sqrt{2\pi}}\left(1 - e^{-1/(2c^2)}\right)$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
"""Collision probability functions for major LSH families.""" import numpy as npfrom scipy import statsfrom scipy.integrate import quad def random_hyperplane_collision_prob(cos_sim: float) -> float: """ Collision probability for random hyperplane LSH. p = 1 - arccos(cos_sim) / π """ theta = np.arccos(np.clip(cos_sim, -1, 1)) return 1 - theta / np.pi def minhash_collision_prob(jaccard: float) -> float: """ Collision probability for MinHash. p = J (Jaccard similarity) """ return jaccard def e2lsh_collision_prob(distance: float, bucket_width: float) -> float: """ Collision probability for E2LSH (Euclidean LSH). p = integral formula based on Gaussian projections """ if distance == 0: return 1.0 c = distance / bucket_width if c > 50: # Asymptotic approximation for large c return 2 / (c * np.sqrt(2 * np.pi)) def integrand(tau): return stats.norm.cdf(tau/c) - stats.norm.cdf((tau-1)/c) prob, _ = quad(integrand, 0, 1) return prob def bit_sampling_collision_prob( hamming_dist: int, dim: int) -> float: """ Collision probability for bit sampling LSH. p = 1 - d_H / d """ return 1 - hamming_dist / dim # Visualizationdef plot_collision_probabilities(): """Compare collision probability curves.""" import matplotlib.pyplot as plt # Cosine similarity range cos_sims = np.linspace(-1, 1, 101) hp_probs = [random_hyperplane_collision_prob(s) for s in cos_sims] # Jaccard similarity range jaccards = np.linspace(0, 1, 101) mh_probs = [minhash_collision_prob(j) for j in jaccards] # Euclidean distances (normalized by bucket width) c_values = np.linspace(0.01, 3, 101) e2_probs = [e2lsh_collision_prob(c, 1.0) for c in c_values] # Hamming distances (as fraction of dimension) hamming_fracs = np.linspace(0, 1, 101) bs_probs = [bit_sampling_collision_prob(int(f * 100), 100) for f in hamming_fracs] fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes[0, 0].plot(cos_sims, hp_probs, 'b-', linewidth=2) axes[0, 0].set_xlabel('Cosine Similarity') axes[0, 0].set_ylabel('Collision Probability') axes[0, 0].set_title('Random Hyperplane LSH') axes[0, 0].grid(True) axes[0, 1].plot(jaccards, mh_probs, 'g-', linewidth=2) axes[0, 1].set_xlabel('Jaccard Similarity') axes[0, 1].set_ylabel('Collision Probability') axes[0, 1].set_title('MinHash') axes[0, 1].grid(True) axes[1, 0].plot(c_values, e2_probs, 'r-', linewidth=2) axes[1, 0].set_xlabel('c = distance / bucket_width') axes[1, 0].set_ylabel('Collision Probability') axes[1, 0].set_title('E2LSH (Euclidean)') axes[1, 0].grid(True) axes[1, 1].plot(hamming_fracs, bs_probs, 'm-', linewidth=2) axes[1, 1].set_xlabel('Hamming Distance (fraction of dimension)') axes[1, 1].set_ylabel('Collision Probability') axes[1, 1].set_title('Bit Sampling LSH') axes[1, 1].grid(True) plt.tight_layout() plt.savefig('collision_probabilities.png', dpi=150) print("Saved collision_probabilities.png") if __name__ == "__main__": print("Collision Probability Examples") print("=" * 50) print("\nRandom Hyperplane LSH:") for cos_sim in [1.0, 0.9, 0.5, 0.0, -0.5]: p = random_hyperplane_collision_prob(cos_sim) print(f" cos(θ) = {cos_sim:>5.2f}: p = {p:.4f}") print("\nMinHash:") for jaccard in [1.0, 0.8, 0.5, 0.2, 0.0]: p = minhash_collision_prob(jaccard) print(f" J = {jaccard:>4.2f}: p = {p:.4f}") print("\nE2LSH (bucket width w = 4):") for dist in [0.5, 1.0, 2.0, 4.0, 8.0]: p = e2lsh_collision_prob(dist, 4.0) print(f" d = {dist:>4.1f}: p = {p:.4f}") print("\nBit Sampling (dim = 256):") for hamming in [0, 10, 25, 50, 100]: p = bit_sampling_collision_prob(hamming, 256) print(f" d_H = {hamming:>3d}: p = {p:.4f}")A single LSH hash function has limited discrimination power. Amplification combines multiple hash functions to create a sharper distinction between similar and dissimilar pairs.
AND Amplification (Combining k Functions):
Require ALL $k$ independent hash functions to agree:
$$\Pr[\text{all } k \text{ agree}] = p^k$$
For near points ($p = p_1$) and far points ($p = p_2$):
Since $p_1 > p_2$, and both probabilities shrink exponentially, but $p_2^k$ shrinks faster:
$$\text{Ratio: } \frac{p_1^k}{p_2^k} = \left(\frac{p_1}{p_2}\right)^k$$
The discrimination ratio grows exponentially!
The Trade-off:
AND amplification reduces BOTH probabilities. We might make false positives rare, but we also miss true positives. We need OR amplification to recover.
OR Amplification (Using L Tables):
For $L$ independent hash tables (each with $k$-function AND):
$$\Pr[\text{at least one table matches}] = 1 - (1 - p^k)^L$$
For near points: $P_1 = 1 - (1 - p_1^k)^L$ For far points: $P_2 = 1 - (1 - p_2^k)^L$
The S-Curve Behavior:
The amplified probability function creates an S-curve that sharpens the transition from "similar" to "dissimilar":
$$P(p) = 1 - (1 - p^k)^L$$
This function has the following properties:
Intuition:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166
"""AND-OR Amplification Analysis for LSH.""" import numpy as npimport matplotlib.pyplot as plt def amplified_probability(p: float, k: int, L: int) -> float: """ Compute amplified collision probability. P(p) = 1 - (1 - p^k)^L Parameters: ----------- p : float Base collision probability k : int Number of hash functions per table (AND) L : int Number of tables (OR) Returns: -------- Amplified probability """ p_and = p ** k # After AND over k functions p_or = 1 - (1 - p_and) ** L # After OR over L tables return p_or def analyze_amplification(): """Analyze the effect of k and L on the probability curve.""" # Base probabilities (similarity range) p_range = np.linspace(0, 1, 1001) # Different (k, L) combinations configs = [ (1, 1, "k=1, L=1 (no amplification)"), (4, 4, "k=4, L=4"), (8, 16, "k=8, L=16"), (10, 50, "k=10, L=50"), (16, 100, "k=16, L=100"), ] print("S-Curve Analysis") print("=" * 60) for k, L, label in configs: amplified = [amplified_probability(p, k, L) for p in p_range] # Find the 50% threshold (where P crosses 0.5) threshold_idx = np.argmin(np.abs(np.array(amplified) - 0.5)) threshold_p = p_range[threshold_idx] # Compute slope at threshold (steepness) if threshold_idx > 0 and threshold_idx < len(p_range) - 1: slope = (amplified[threshold_idx + 1] - amplified[threshold_idx - 1]) / \ (p_range[threshold_idx + 1] - p_range[threshold_idx - 1]) else: slope = 0 print(f"{label}") print(f" 50% threshold: p = {threshold_p:.4f}") print(f" Slope at threshold: {slope:.2f}") print() # Specific example: detecting similarity 0.8 vs 0.5 print("\nExample: Distinguishing sim=0.8 from sim=0.5") print("-" * 50) p_near = 0.8 p_far = 0.5 print(f"{'k':>4} {'L':>4} {'P_near':>10} {'P_far':>10} {'Ratio':>10}") print("-" * 42) for k in [2, 4, 6, 8, 10]: for L in [5, 10, 20, 50]: P_near = amplified_probability(p_near, k, L) P_far = amplified_probability(p_far, k, L) ratio = P_near / P_far if P_far > 1e-10 else float('inf') if P_near > 0.9: # High recall print(f"{k:>4} {L:>4} {P_near:>10.4f} {P_far:>10.4f} {ratio:>10.1f}") def compute_optimal_parameters( p1: float, p2: float, target_recall: float = 0.95, max_fp_rate: float = 0.1): """ Find (k, L) that achieves target recall with bounded false positives. Parameters: ----------- p1 : float Base collision probability for near points p2 : float Base collision probability for far points target_recall : float Minimum probability of finding a near neighbor max_fp_rate : float Maximum probability of a far point being a candidate """ print(f"\nFinding optimal (k, L) for:") print(f" p1 (near) = {p1}, p2 (far) = {p2}") print(f" Target recall >= {target_recall}") print(f" Max FP rate <= {max_fp_rate}") print("-" * 50) best = None best_cost = float('inf') for k in range(1, 25): for L in range(1, 200): P1 = amplified_probability(p1, k, L) # Recall P2 = amplified_probability(p2, k, L) # FP rate if P1 >= target_recall and P2 <= max_fp_rate: # Cost = memory (L) + query time (approximately L) cost = L if cost < best_cost: best_cost = cost best = (k, L, P1, P2) if best: k, L, P1, P2 = best print(f"Optimal: k = {k}, L = {L}") print(f" Actual recall: {P1:.4f}") print(f" Actual FP rate: {P2:.4f}") print(f" Cost (num tables): {L}") else: print("No configuration found meeting constraints.") print("Try relaxing constraints or using different base probabilities.") def theoretical_rho(p1: float, p2: float) -> float: """ Compute the theoretical ρ parameter. ρ = ln(1/p1) / ln(1/p2) Query time is O(n^ρ), so smaller ρ is better. """ if p1 <= 0 or p1 >= 1 or p2 <= 0 or p2 >= 1: return float('nan') return np.log(1/p1) / np.log(1/p2) if __name__ == "__main__": analyze_amplification() # Example: Random hyperplane LSH # For 30° angle (cos=0.866) vs 60° angle (cos=0.5) p1 = 1 - (np.pi/6) / np.pi # 30° angle p2 = 1 - (np.pi/3) / np.pi # 60° angle print(f"\nRandom Hyperplane LSH Example:") print(f"Near points (30° angle): p1 = {p1:.4f}") print(f"Far points (60° angle): p2 = {p2:.4f}") print(f"Theoretical ρ = {theoretical_rho(p1, p2):.4f}") compute_optimal_parameters(p1, p2, target_recall=0.95, max_fp_rate=0.1)The ρ parameter is the single most important metric for comparing LSH families. It directly determines query complexity.
Definition:
$$\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)}$$
where $p_1$ is the collision probability for near points and $p_2$ for far points.
Key Properties:
The Query Time Theorem:
For $(d_1, d_2, p_1, p_2)$-sensitive LSH with $\rho = \ln(1/p_1)/\ln(1/p_2)$:
With $k = \log_{1/p_2}(n)$ and $L = n^\rho$ tables, we achieve:
- Query time: $O(n^\rho \cdot d)$ expected
- Success probability: $\geq 1 - 1/e$ per query
This is a remarkable result: for $\rho = 0.5$, query time is $O(\sqrt{n})$, a quadratic improvement over brute force!
| LSH Family | Approximation c | ρ (theoretical) | Query Time |
|---|---|---|---|
| Random Hyperplanes | c = θ₂/θ₁ | 1/c (for small angles) | O(n^{1/c}) |
| E2LSH | c = d₂/d₁ | 1/c + o(1) | O(n^{1/c}) |
| MinHash | N/A (exact J) | Depends on J gap | Varies |
| Bit Sampling | c = d_H2/d_H1 | 1/c | O(n^{1/c}) |
| Cross-Polytope | c (for cosine) | 1/(2c-1) | O(n^{1/(2c-1)}) |
Interpreting ρ:
The Space-Time Tradeoff:
Lower $\rho$ requires more tables. Specifically:
For a billion-item database ($n = 10^9$) with $\rho = 0.5$:
In practice, we often accept higher $\rho$ to reduce memory requirements.
LSH can make two types of errors:
False Negative Rate:
For a near point at distance $d \leq d_1$, the probability of NOT finding it is:
$$\text{FN} = (1 - p_1^k)^L$$
This is the probability that ALL $L$ tables fail to match (each matches with probability $p_1^k$).
Target: FN ≤ δ
For false negative rate at most $\delta$: $$(1 - p_1^k)^L \leq \delta$$ $$L \geq \frac{\ln(1/\delta)}{\ln(1/(1-p_1^k))} \approx \frac{\ln(1/\delta)}{p_1^k}$$
(using $\ln(1/(1-x)) \approx x$ for small $x$)
False Positive Rate:
For a far point at distance $d \geq d_2$:
$$\text{FP} = 1 - (1 - p_2^k)^L$$
Expected False Positives:
If there are $N_{\text{far}}$ far points in the database:
$$\mathbb{E}[\text{num FPs}] = N_{\text{far}} \cdot \text{FP}$$
These false positives must be filtered by exact distance computation, contributing to query time.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
"""Error analysis for LSH: False positives and false negatives.""" import numpy as np def false_negative_rate(p1: float, k: int, L: int) -> float: """ Probability of missing a true near neighbor. FN = (1 - p1^k)^L """ p_and = p1 ** k fn = (1 - p_and) ** L return fn def false_positive_rate(p2: float, k: int, L: int) -> float: """ Probability of a far point being a candidate. FP = 1 - (1 - p2^k)^L """ p_and = p2 ** k fp = 1 - (1 - p_and) ** L return fp def expected_candidates( n: int, p2: float, k: int, L: int, n_near: int = 1) -> float: """ Expected number of candidates returned by LSH query. = (True positives) + (False positives) ≈ n_near (if recall is high) + (n - n_near) * FP_rate """ fp_rate = false_positive_rate(p2, k, L) return n_near + (n - n_near) * fp_rate def min_tables_for_recall( p1: float, k: int, target_recall: float) -> int: """ Minimum L for target recall. (1 - p1^k)^L <= 1 - target_recall """ p_and = p1 ** k max_fn = 1 - target_recall if p_and == 0: return float('inf') if max_fn >= 1: return 1 L = np.ceil(np.log(max_fn) / np.log(1 - p_and)) return int(L) def analyze_errors(): """Analyze error rates for different configurations.""" # Example: Random hyperplane LSH # Near: cos(θ) = 0.9 (about 25° angle) -> p1 = 1 - 25°/180° ≈ 0.86 # Far: cos(θ) = 0.5 (about 60° angle) -> p2 = 1 - 60°/180° = 0.67 p1 = 0.86 p2 = 0.67 n = 1_000_000 # 1 million points print("Error Analysis for LSH") print(f"p1 (near) = {p1}, p2 (far) = {p2}, n = {n:,}") print("=" * 70) print(f"\n{'k':>4} {'L':>6} {'FN Rate':>12} {'FP Rate':>12} {'Recall':>10} {'E[Candidates]':>15}") print("-" * 70) for k in [4, 6, 8, 10, 12]: for L in [10, 20, 50, 100]: fn = false_negative_rate(p1, k, L) fp = false_positive_rate(p2, k, L) recall = 1 - fn # Expected candidates (assuming 10 true near neighbors) exp_cand = expected_candidates(n, p2, k, L, n_near=10) print(f"{k:>4} {L:>6} {fn:>12.6f} {fp:>12.6f} {recall:>10.4f} {exp_cand:>15,.0f}") # Find configuration with target performance print("\n" + "=" * 70) print("Finding configurations for 99% recall with <1000 expected candidates") print("-" * 70) for k in range(4, 16): min_L = min_tables_for_recall(p1, k, 0.99) if min_L < 500: # Reasonable number of tables exp_cand = expected_candidates(n, p2, k, min_L, n_near=10) if exp_cand < 10000: # Reasonable number of candidates fn = false_negative_rate(p1, k, min_L) fp = false_positive_rate(p2, k, min_L) print(f"k={k}, L={min_L}: Recall={1-fn:.4f}, E[Candidates]={exp_cand:,.0f}") def query_cost_analysis(): """Analyze total query cost: hashing + candidate verification.""" print("\n\nTotal Query Cost Analysis") print("=" * 70) d = 256 # Dimension n = 1_000_000 p1, p2 = 0.85, 0.65 print(f"Dimension d = {d}, Dataset size n = {n:,}") print(f"p1 = {p1}, p2 = {p2}") print() # Cost model: # - Hash computation: O(k * L * d) for k projections per table # - Candidate verification: O(candidates * d) for k in [6, 8, 10, 12]: L = min_tables_for_recall(p1, k, 0.95) if L > 200: continue exp_cand = expected_candidates(n, p2, k, L, n_near=10) hash_cost = k * L * d verify_cost = exp_cand * d total_cost = hash_cost + verify_cost brute_force_cost = n * d speedup = brute_force_cost / total_cost print(f"k={k}, L={L}:") print(f" Hash cost: {hash_cost:,} ops") print(f" Verify cost: {verify_cost:,.0f} ops (E[candidates]={exp_cand:,.0f})") print(f" Total: {total_cost:,.0f} ops") print(f" Brute force: {brute_force_cost:,} ops") print(f" Speedup: {speedup:.1f}x") print() if __name__ == "__main__": analyze_errors() query_cost_analysis()The analysis so far has focused on expected values. To provide high-probability guarantees, we need concentration inequalities.
Chernoff Bounds for Hash Collisions:
Let $X$ be the number of tables in which a pair $(x, y)$ collides. If each table's hash functions are independent with collision probability $p^k$:
$$X \sim \text{Binomial}(L, p^k)$$
For near points ($p = p_1$):
We want $\Pr[X = 0] \leq \delta$ (low false negative probability):
$$\Pr[X = 0] = (1 - p_1^k)^L \leq \delta$$
By Chernoff: With $L = \frac{\ln(1/\delta)}{p_1^k}$ tables: $$\Pr[X = 0] \leq \delta$$
For far points ($p = p_2$):
We want $\Pr[X \geq 1]$ to be small, or at least bound $\mathbb{E}[X]$:
$$\mathbb{E}[X] = L \cdot p_2^k$$
By Markov: $\Pr[X \geq 1] \leq \mathbb{E}[X] = L \cdot p_2^k$
High-Probability Success Guarantee:
With appropriate $(k, L)$, we can guarantee:
With probability at least $1 - \delta$:
- At least one true near neighbor is found (if it exists)
- The number of candidates is at most $O(n^\rho \log(1/\delta))$
Proof Sketch:
Multi-Point Queries:
When finding all $m$ near neighbors (not just one), the analysis extends:
LSH success is probabilistic. For a given query, there's always a small chance of missing true neighbors or getting too many false positives. In practice, we can boost confidence through query repetition or by using more tables. Systems should be designed with this randomness in mind.
Choosing the right $(k, L)$ is critical for LSH performance. Here's a systematic approach.
Step 1: Define Requirements
Step 2: Estimate $p_1$ and $p_2$
From your data or knowledge of distance distributions:
Step 3: Compute Constraints
For recall $R$: $$L \geq \frac{\ln(1/(1-R))}{p_1^k}$$
For candidate budget $C$: $$n \cdot (1 - (1 - p_2^k)^L) \leq C$$
Step 4: Grid Search
Search over $(k, L)$ pairs satisfying constraints and minimizing total cost (memory + query time).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187
"""Systematic parameter optimization for LSH.""" import numpy as npfrom dataclasses import dataclassfrom typing import Optional, List, Tuple @dataclassclass LSHConfig: """Configuration for LSH with computed metrics.""" k: int L: int recall: float fp_rate: float expected_candidates: int memory_cost: float # Proportional to L query_cost: float # Hashing + verification def find_optimal_lsh_config( p1: float, p2: float, n: int, d: int, target_recall: float = 0.95, max_candidates: int = 10000, max_tables: int = 200, num_near_neighbors: int = 10) -> List[LSHConfig]: """ Find LSH configurations meeting constraints. Parameters: ----------- p1 : float Base collision probability for near points p2 : float Base collision probability for far points n : int Dataset size d : int Dimensionality target_recall : float Minimum recall requirement max_candidates : int Maximum acceptable expected candidates max_tables : int Maximum number of tables (memory constraint) num_near_neighbors : int Expected number of true near neighbors Returns: -------- List of valid configurations sorted by total cost """ valid_configs = [] for k in range(1, 25): for L in range(1, max_tables + 1): # Compute metrics p1_k = p1 ** k p2_k = p2 ** k # Recall (probability of finding a near neighbor) recall = 1 - (1 - p1_k) ** L # False positive rate fp_rate = 1 - (1 - p2_k) ** L # Expected candidates exp_candidates = num_near_neighbors + (n - num_near_neighbors) * fp_rate # Check constraints if recall < target_recall: continue if exp_candidates > max_candidates: continue # Compute costs hash_cost = k * L * d # k projections per table, L tables verify_cost = exp_candidates * d memory_cost = L # Proportional to number of tables total_query_cost = hash_cost + verify_cost config = LSHConfig( k=k, L=L, recall=recall, fp_rate=fp_rate, expected_candidates=int(exp_candidates), memory_cost=memory_cost, query_cost=total_query_cost ) valid_configs.append(config) # Sort by total query cost valid_configs.sort(key=lambda c: c.query_cost) return valid_configs def print_best_configs(configs: List[LSHConfig], top_n: int = 10): """Print the best configurations.""" print(f"\nTop {top_n} LSH Configurations:") print("-" * 80) print(f"{'k':>4} {'L':>6} {'Recall':>10} {'FP Rate':>12} {'E[Cand]':>10} {'Query Cost':>12}") print("-" * 80) for config in configs[:top_n]: print(f"{config.k:>4} {config.L:>6} {config.recall:>10.4f} " f"{config.fp_rate:>12.6f} {config.expected_candidates:>10,} " f"{config.query_cost:>12,.0f}") def visualize_parameter_space( p1: float, p2: float, n: int): """Visualize recall and FP rate across (k, L) space.""" print("\nParameter Space Analysis") print("=" * 60) print(f"\nRecall (1 - FN rate) for p1 = {p1}:") print(" L=", end="") for L in [5, 10, 20, 50, 100]: print(f"{L:>8}", end="") print() for k in [4, 6, 8, 10, 12]: print(f"k={k:>2}", end="") for L in [5, 10, 20, 50, 100]: recall = 1 - (1 - p1 ** k) ** L print(f"{recall:>8.3f}", end="") print() print(f"\nE[Candidates] for n={n:,}, p2 = {p2}:") print(" L=", end="") for L in [5, 10, 20, 50, 100]: print(f"{L:>10}", end="") print() for k in [4, 6, 8, 10, 12]: print(f"k={k:>2}", end="") for L in [5, 10, 20, 50, 100]: fp_rate = 1 - (1 - p2 ** k) ** L exp_cand = n * fp_rate if exp_cand < 1000: print(f"{exp_cand:>10.0f}", end="") elif exp_cand < 1000000: print(f"{exp_cand/1000:>9.0f}K", end="") else: print(f"{exp_cand/1000000:>9.1f}M", end="") print() if __name__ == "__main__": # Example: Random hyperplane LSH for near/far classification # Near points: ~22° angle (cos = 0.93) -> p1 ≈ 0.88 # Far points: ~60° angle (cos = 0.5) -> p2 ≈ 0.67 p1 = 0.88 p2 = 0.67 n = 1_000_000 d = 256 print("LSH Parameter Optimization") print(f"Dataset: n={n:,} vectors, d={d}") print(f"Collision probabilities: p1={p1}, p2={p2}") # Visualize parameter space visualize_parameter_space(p1, p2, n) # Find optimal configurations configs = find_optimal_lsh_config( p1=p1, p2=p2, n=n, d=d, target_recall=0.95, max_candidates=5000, max_tables=150 ) if configs: print_best_configs(configs) else: print("\nNo configurations meet all constraints.") print("Try relaxing target_recall or max_candidates.")We've developed a comprehensive mathematical understanding of LSH collision probabilities. Let's consolidate the key insights:
What's Next:
In the final page of this module, we'll cover LSH Performance Tuning—practical techniques for implementing, benchmarking, and optimizing LSH in production systems. We'll discuss multi-probe strategies, data-dependent tuning, and integration with modern vector databases.
You now have deep understanding of the probability theory underlying LSH. This mathematical foundation is essential for deploying LSH with known performance characteristics and choosing the right parameters for your specific use case.