Locality Sensitive Hashing - Learning Module

Loading content...

0/245

Collision Probability Analysis

The Mathematics of Similarity Collisions

The power of LSH lies not just in its algorithms but in its mathematical guarantees. Unlike heuristic approaches to similarity search, LSH provides provable bounds on recall and precision. These bounds flow directly from understanding collision probabilities—the mathematical relationship between similarity and the likelihood of hashing to the same bucket.

In this page, we dive deep into the probability theory underlying LSH. We'll derive collision probability formulas, understand amplification mechanics, analyze error bounds, and connect everything to the practical question: how do we set parameters to achieve a target recall?

This analysis is essential for anyone who needs to deploy LSH in production, where understanding the tradeoffs is not optional but critical.

What You Will Learn

By the end of this page, you will derive collision probabilities for major LSH families, understand AND-OR amplification mathematically, analyze the S-curve behavior of amplified probabilities, compute false positive and false negative rates, and determine optimal (k, L) parameters for desired performance targets.

Collision Probability Fundamentals

The foundation of LSH analysis is understanding how collision probability relates to similarity.

The Core Concept:

For an LSH hash function $h$ and two points $x, y$:

$$p(x, y) = \Pr[h(x) = h(y)]$$

This probability should be a monotonically increasing function of similarity (or equivalently, decreasing function of distance).

Collision Probability Functions:

Let $s = \text{sim}(x, y)$ be some similarity measure (or $d = \text{dist}(x, y)$ for distance). The collision probability function $p(s)$ characterizes the LSH family:

LSH Family	Similarity/Distance	Collision Probability $p$
MinHash	Jaccard $J$	$p = J$
Random Hyperplanes	Cosine $\cos(\theta)$	$p = 1 - \theta/\pi$
E2LSH	Euclidean $d$	$p = f(d/w)$ (complex)
Bit Sampling	Hamming $d_H$	$p = 1 - d_H/d$

Interpreting Collision Probabilities:

$p = 1$: Identical items always hash together (perfect)
$p = 0.5$: Random chance (useless for discrimination)
$p = 0$: Completely different items never collide (perfect)

A useful LSH family must have $p$ significantly higher for similar items than for dissimilar items. The gap between these probabilities drives the discrimination power.

The $(d_1, d_2, p_1, p_2)$-Sensitive Framework

Recall that an LSH family is (d₁, d₂, p₁, p₂)-sensitive if: (1) points at distance ≤ d₁ collide with probability ≥ p₁, and (2) points at distance ≥ d₂ collide with probability ≤ p₂. For the family to be useful, we need d₁ < d₂ and p₁ > p₂.

Deriving Collision Probabilities

Let's rigorously derive collision probabilities for the major LSH families.

Random Hyperplane LSH:

For unit vectors $\mathbf{u}, \mathbf{v}$ with angle $\theta$ and random Gaussian vector $\mathbf{r}$:

$$h_{\mathbf{r}}(\mathbf{v}) = \text{sign}(\mathbf{r} \cdot \mathbf{v})$$

Theorem: $\Pr[h_{\mathbf{r}}(\mathbf{u}) = h_{\mathbf{r}}(\mathbf{v})] = 1 - \theta/\pi$

Proof: Let $X = \mathbf{r} \cdot \mathbf{u}$ and $Y = \mathbf{r} \cdot \mathbf{v}$. Since $\mathbf{r}$ has i.i.d. $\mathcal{N}(0,1)$ components:

$X, Y$ are jointly Gaussian with zero mean, unit variance
$\text{Cov}(X, Y) = \mathbf{u} \cdot \mathbf{v} = \cos(\theta)$

For bivariate normal $(X, Y)$ with correlation $\rho$: $$\Pr[\text{sign}(X) = \text{sign}(Y)] = 1 - \frac{\cos^{-1}(\rho)}{\pi}$$

Substituting $\rho = \cos(\theta)$: $\Pr = 1 - \theta/\pi$. $\blacksquare$

MinHash:

For sets $A, B$ and random permutation $\pi$:

$$h_\pi(S) = \min_{x \in S} \pi(x)$$

Theorem: $\Pr[h_\pi(A) = h_\pi(B)] = J(A, B) = \frac{|A \cap B|}{|A \cup B|}$

Proof: Under a random permutation, every element in $A \cup B$ is equally likely to have the minimum $\pi$ value. The minimum is shared by both $A$ and $B$ iff it belongs to $A \cap B$.

$$\Pr[h_\pi(A) = h_\pi(B)] = \frac{|A \cap B|}{|A \cup B|} = J(A, B)$$

$\blacksquare$

This is remarkable: MinHash collision probability equals Jaccard similarity exactly, not just proportionally!

E2LSH (Euclidean LSH):

For vectors $\mathbf{u}, \mathbf{v}$ at Euclidean distance $d$, with random Gaussian $\mathbf{a}$, offset $b \sim \text{Unif}[0, w)$:

$$h(\mathbf{v}) = \left\lfloor \frac{\mathbf{a} \cdot \mathbf{v} + b}{w} \right\rfloor$$

Theorem: $\Pr[h(\mathbf{u}) = h(\mathbf{v})] = p(d/w)$ where:

$$p(c) = \int_0^1 \left[\Phi\left(\frac{\tau}{c}\right) - \Phi\left(\frac{\tau - 1}{c}\right)\right] d\tau$$

and $\Phi$ is the standard normal CDF.

Proof Sketch: By 2-stability, $\mathbf{a} \cdot (\mathbf{u} - \mathbf{v}) \sim d \cdot Z$ where $Z \sim \mathcal{N}(0,1)$. The collision condition requires both projections to land in the same width-$w$ bucket. Integrating over the uniform random offset $b$ gives the formula. $\blacksquare$

Closed form:

$$p(c) = 1 - 2\Phi(-1/c) - \frac{2c}{\sqrt{2\pi}}\left(1 - e^{-1/(2c^2)}\right)$$

collision_probability_derivations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
"""
Collision probability functions for major LSH families.
"""
 
import numpy as np
from scipy import stats
from scipy.integrate import quad
 
def random_hyperplane_collision_prob(cos_sim: float) -> float:
    """
    Collision probability for random hyperplane LSH.
    
    p = 1 - arccos(cos_sim) / π
    """
    theta = np.arccos(np.clip(cos_sim, -1, 1))
    return 1 - theta / np.pi
 
def minhash_collision_prob(jaccard: float) -> float:
    """
    Collision probability for MinHash.
    
    p = J (Jaccard similarity)
    """
    return jaccard
 
def e2lsh_collision_prob(distance: float, bucket_width: float) -> float:
    """
    Collision probability for E2LSH (Euclidean LSH).
    
    p = integral formula based on Gaussian projections
    """
    if distance == 0:
        return 1.0
    
    c = distance / bucket_width
    
    if c > 50:  # Asymptotic approximation for large c
        return 2 / (c * np.sqrt(2 * np.pi))
    
    def integrand(tau):
        return stats.norm.cdf(tau/c) - stats.norm.cdf((tau-1)/c)
    
    prob, _ = quad(integrand, 0, 1)
    return prob
 
def bit_sampling_collision_prob(
    hamming_dist: int, 
    dim: int
) -> float:
    """
    Collision probability for bit sampling LSH.
    
    p = 1 - d_H / d
    """
    return 1 - hamming_dist / dim
 
# Visualization
def plot_collision_probabilities():
    """Compare collision probability curves."""
    import matplotlib.pyplot as plt
    
    # Cosine similarity range
    cos_sims = np.linspace(-1, 1, 101)
    hp_probs = [random_hyperplane_collision_prob(s) for s in cos_sims]
    
    # Jaccard similarity range
    jaccards = np.linspace(0, 1, 101)
    mh_probs = [minhash_collision_prob(j) for j in jaccards]
    
    # Euclidean distances (normalized by bucket width)
    c_values = np.linspace(0.01, 3, 101)
    e2_probs = [e2lsh_collision_prob(c, 1.0) for c in c_values]
    
    # Hamming distances (as fraction of dimension)
    hamming_fracs = np.linspace(0, 1, 101)
    bs_probs = [bit_sampling_collision_prob(int(f * 100), 100) 
                for f in hamming_fracs]
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    axes[0, 0].plot(cos_sims, hp_probs, 'b-', linewidth=2)
    axes[0, 0].set_xlabel('Cosine Similarity')
    axes[0, 0].set_ylabel('Collision Probability')
    axes[0, 0].set_title('Random Hyperplane LSH')
    axes[0, 0].grid(True)
    
    axes[0, 1].plot(jaccards, mh_probs, 'g-', linewidth=2)
    axes[0, 1].set_xlabel('Jaccard Similarity')
    axes[0, 1].set_ylabel('Collision Probability')
    axes[0, 1].set_title('MinHash')
    axes[0, 1].grid(True)
    
    axes[1, 0].plot(c_values, e2_probs, 'r-', linewidth=2)
    axes[1, 0].set_xlabel('c = distance / bucket_width')
    axes[1, 0].set_ylabel('Collision Probability')
    axes[1, 0].set_title('E2LSH (Euclidean)')
    axes[1, 0].grid(True)
    
    axes[1, 1].plot(hamming_fracs, bs_probs, 'm-', linewidth=2)
    axes[1, 1].set_xlabel('Hamming Distance (fraction of dimension)')
    axes[1, 1].set_ylabel('Collision Probability')
    axes[1, 1].set_title('Bit Sampling LSH')
    axes[1, 1].grid(True)
    
    plt.tight_layout()
    plt.savefig('collision_probabilities.png', dpi=150)
    print("Saved collision_probabilities.png")
 
 
if __name__ == "__main__":
    print("Collision Probability Examples")
    print("=" * 50)
    
    print("\nRandom Hyperplane LSH:")
    for cos_sim in [1.0, 0.9, 0.5, 0.0, -0.5]:
        p = random_hyperplane_collision_prob(cos_sim)
        print(f"  cos(θ) = {cos_sim:>5.2f}: p = {p:.4f}")
    
    print("\nMinHash:")
    for jaccard in [1.0, 0.8, 0.5, 0.2, 0.0]:
        p = minhash_collision_prob(jaccard)
        print(f"  J = {jaccard:>4.2f}: p = {p:.4f}")
    
    print("\nE2LSH (bucket width w = 4):")
    for dist in [0.5, 1.0, 2.0, 4.0, 8.0]:
        p = e2lsh_collision_prob(dist, 4.0)
        print(f"  d = {dist:>4.1f}: p = {p:.4f}")
    
    print("\nBit Sampling (dim = 256):")
    for hamming in [0, 10, 25, 50, 100]:
        p = bit_sampling_collision_prob(hamming, 256)
        print(f"  d_H = {hamming:>3d}: p = {p:.4f}")

AND-OR Amplification Theory

A single LSH hash function has limited discrimination power. Amplification combines multiple hash functions to create a sharper distinction between similar and dissimilar pairs.

AND Amplification (Combining k Functions):

Require ALL $k$ independent hash functions to agree:

$$\Pr[\text{all } k \text{ agree}] = p^k$$

For near points ($p = p_1$) and far points ($p = p_2$):

Near: $p_1^k$
Far: $p_2^k$

Since $p_1 > p_2$, and both probabilities shrink exponentially, but $p_2^k$ shrinks faster:

$$\text{Ratio: } \frac{p_1^k}{p_2^k} = \left(\frac{p_1}{p_2}\right)^k$$

The discrimination ratio grows exponentially!

The Trade-off:

AND amplification reduces BOTH probabilities. We might make false positives rare, but we also miss true positives. We need OR amplification to recover.

OR Amplification (Using L Tables):

For $L$ independent hash tables (each with $k$-function AND):

$$\Pr[\text{at least one table matches}] = 1 - (1 - p^k)^L$$

For near points: $P_1 = 1 - (1 - p_1^k)^L$ For far points: $P_2 = 1 - (1 - p_2^k)^L$

The S-Curve Behavior:

The amplified probability function creates an S-curve that sharpens the transition from "similar" to "dissimilar":

$$P(p) = 1 - (1 - p^k)^L$$

This function has the following properties:

Steeper transition: As $k$ and $L$ increase, the curve becomes steeper
Threshold effect: Points below a certain similarity almost never collide; points above almost always do
Tunable threshold: The "50% probability" point shifts with parameters

Intuition:

$k$ controls where the curve starts falling (AND pushes it left)
$L$ controls how quickly it rises back up (OR pushes it up)
Together, they create a sharp sigmoid around a target similarity threshold

amplification_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
"""
AND-OR Amplification Analysis for LSH.
"""
 
import numpy as np
import matplotlib.pyplot as plt
 
def amplified_probability(p: float, k: int, L: int) -> float:
    """
    Compute amplified collision probability.
    
    P(p) = 1 - (1 - p^k)^L
    
    Parameters:
    -----------
    p : float
        Base collision probability
    k : int
        Number of hash functions per table (AND)
    L : int
        Number of tables (OR)
    
    Returns:
    --------
    Amplified probability
    """
    p_and = p ** k  # After AND over k functions
    p_or = 1 - (1 - p_and) ** L  # After OR over L tables
    return p_or
 
def analyze_amplification():
    """Analyze the effect of k and L on the probability curve."""
    
    # Base probabilities (similarity range)
    p_range = np.linspace(0, 1, 1001)
    
    # Different (k, L) combinations
    configs = [
        (1, 1, "k=1, L=1 (no amplification)"),
        (4, 4, "k=4, L=4"),
        (8, 16, "k=8, L=16"),
        (10, 50, "k=10, L=50"),
        (16, 100, "k=16, L=100"),
    ]
    
    print("S-Curve Analysis")
    print("=" * 60)
    
    for k, L, label in configs:
        amplified = [amplified_probability(p, k, L) for p in p_range]
        
        # Find the 50% threshold (where P crosses 0.5)
        threshold_idx = np.argmin(np.abs(np.array(amplified) - 0.5))
        threshold_p = p_range[threshold_idx]
        
        # Compute slope at threshold (steepness)
        if threshold_idx > 0 and threshold_idx < len(p_range) - 1:
            slope = (amplified[threshold_idx + 1] - amplified[threshold_idx - 1]) / \
                    (p_range[threshold_idx + 1] - p_range[threshold_idx - 1])
        else:
            slope = 0
        
        print(f"{label}")
        print(f"  50% threshold: p = {threshold_p:.4f}")
        print(f"  Slope at threshold: {slope:.2f}")
        print()
    
    # Specific example: detecting similarity 0.8 vs 0.5
    print("\nExample: Distinguishing sim=0.8 from sim=0.5")
    print("-" * 50)
    
    p_near = 0.8
    p_far = 0.5
    
    print(f"{'k':>4} {'L':>4} {'P_near':>10} {'P_far':>10} {'Ratio':>10}")
    print("-" * 42)
    
    for k in [2, 4, 6, 8, 10]:
        for L in [5, 10, 20, 50]:
            P_near = amplified_probability(p_near, k, L)
            P_far = amplified_probability(p_far, k, L)
            ratio = P_near / P_far if P_far > 1e-10 else float('inf')
            
            if P_near > 0.9:  # High recall
                print(f"{k:>4} {L:>4} {P_near:>10.4f} {P_far:>10.4f} {ratio:>10.1f}")
 
def compute_optimal_parameters(
    p1: float, 
    p2: float,
    target_recall: float = 0.95,
    max_fp_rate: float = 0.1
):
    """
    Find (k, L) that achieves target recall with bounded false positives.
    
    Parameters:
    -----------
    p1 : float
        Base collision probability for near points
    p2 : float
        Base collision probability for far points
    target_recall : float
        Minimum probability of finding a near neighbor
    max_fp_rate : float
        Maximum probability of a far point being a candidate
    """
    print(f"\nFinding optimal (k, L) for:")
    print(f"  p1 (near) = {p1}, p2 (far) = {p2}")
    print(f"  Target recall >= {target_recall}")
    print(f"  Max FP rate <= {max_fp_rate}")
    print("-" * 50)
    
    best = None
    best_cost = float('inf')
    
    for k in range(1, 25):
        for L in range(1, 200):
            P1 = amplified_probability(p1, k, L)  # Recall
            P2 = amplified_probability(p2, k, L)  # FP rate
            
            if P1 >= target_recall and P2 <= max_fp_rate:
                # Cost = memory (L) + query time (approximately L)
                cost = L
                
                if cost < best_cost:
                    best_cost = cost
                    best = (k, L, P1, P2)
    
    if best:
        k, L, P1, P2 = best
        print(f"Optimal: k = {k}, L = {L}")
        print(f"  Actual recall: {P1:.4f}")
        print(f"  Actual FP rate: {P2:.4f}")
        print(f"  Cost (num tables): {L}")
    else:
        print("No configuration found meeting constraints.")
        print("Try relaxing constraints or using different base probabilities.")
 
 
def theoretical_rho(p1: float, p2: float) -> float:
    """
    Compute the theoretical ρ parameter.
    
    ρ = ln(1/p1) / ln(1/p2)
    
    Query time is O(n^ρ), so smaller ρ is better.
    """
    if p1 <= 0 or p1 >= 1 or p2 <= 0 or p2 >= 1:
        return float('nan')
    return np.log(1/p1) / np.log(1/p2)
 
 
if __name__ == "__main__":
    analyze_amplification()
    
    # Example: Random hyperplane LSH
    # For 30° angle (cos=0.866) vs 60° angle (cos=0.5)
    p1 = 1 - (np.pi/6) / np.pi  # 30° angle
    p2 = 1 - (np.pi/3) / np.pi  # 60° angle
    
    print(f"\nRandom Hyperplane LSH Example:")
    print(f"Near points (30° angle): p1 = {p1:.4f}")
    print(f"Far points (60° angle): p2 = {p2:.4f}")
    print(f"Theoretical ρ = {theoretical_rho(p1, p2):.4f}")
    
    compute_optimal_parameters(p1, p2, target_recall=0.95, max_fp_rate=0.1)

The ρ Parameter and Query Complexity

The ρ parameter is the single most important metric for comparing LSH families. It directly determines query complexity.

Definition:

$$\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)}$$

where $p_1$ is the collision probability for near points and $p_2$ for far points.

Key Properties:

Always $\rho \leq 1$ for a useful LSH (since $p_1 > p_2$ implies $\ln(1/p_1) < \ln(1/p_2)$)
Smaller is better: Lower $\rho$ means better discrimination
Query time: With optimal $(k, L)$, query time is $O(n^\rho)$
Space: Optimal LSH uses $O(n^{1+\rho})$ space

The Query Time Theorem:

For $(d_1, d_2, p_1, p_2)$-sensitive LSH with $\rho = \ln(1/p_1)/\ln(1/p_2)$:

With $k = \log_{1/p_2}(n)$ and $L = n^\rho$ tables, we achieve:

Query time: $O(n^\rho \cdot d)$ expected

Success probability: $\geq 1 - 1/e$ per query

This is a remarkable result: for $\rho = 0.5$, query time is $O(\sqrt{n})$, a quadratic improvement over brute force!

ρ Values for Major LSH Families
LSH Family	Approximation c	ρ (theoretical)	Query Time
Random Hyperplanes	c = θ₂/θ₁	1/c (for small angles)	O(n^{1/c})
E2LSH	c = d₂/d₁	1/c + o(1)	O(n^{1/c})
MinHash	N/A (exact J)	Depends on J gap	Varies
Bit Sampling	c = d_H2/d_H1	1/c	O(n^{1/c})
Cross-Polytope	c (for cosine)	1/(2c-1)	O(n^{1/(2c-1)})

Interpreting ρ:

$\rho = 1$: No better than random sampling (useless)
$\rho = 0.5$: $O(\sqrt{n})$ query time (excellent for 2-approximate NN)
$\rho = 0.33$: $O(n^{0.33})$ query time (excellent for 3-approximate NN)
$\rho \to 0$: Approaches perfect discrimination (but requires more space)

The Space-Time Tradeoff:

Lower $\rho$ requires more tables. Specifically:

Number of tables: $L = n^\rho$
Total space: $O(n \cdot L) = O(n^{1+\rho})$

For a billion-item database ($n = 10^9$) with $\rho = 0.5$:

Query time: $O(10^{4.5}) \approx 30,000$ operations (vs $10^9$ for brute force)
Space: $O(10^{13.5})$ — this is substantial!

In practice, we often accept higher $\rho$ to reduce memory requirements.

False Positive and False Negative Analysis

LSH can make two types of errors:

False Negatives: Missing a true near neighbor (recalled < 100%)
False Positives: Returning a far point as a candidate

False Negative Rate:

For a near point at distance $d \leq d_1$, the probability of NOT finding it is:

$$\text{FN} = (1 - p_1^k)^L$$

This is the probability that ALL $L$ tables fail to match (each matches with probability $p_1^k$).

Target: FN ≤ δ

For false negative rate at most $\delta$: $$(1 - p_1^k)^L \leq \delta$$ $$L \geq \frac{\ln(1/\delta)}{\ln(1/(1-p_1^k))} \approx \frac{\ln(1/\delta)}{p_1^k}$$

(using $\ln(1/(1-x)) \approx x$ for small $x$)

False Positive Rate:

For a far point at distance $d \geq d_2$:

$$\text{FP} = 1 - (1 - p_2^k)^L$$

Expected False Positives:

If there are $N_{\text{far}}$ far points in the database:

$$\mathbb{E}[\text{num FPs}] = N_{\text{far}} \cdot \text{FP}$$

These false positives must be filtered by exact distance computation, contributing to query time.

error_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
"""
Error analysis for LSH: False positives and false negatives.
"""
 
import numpy as np
 
def false_negative_rate(p1: float, k: int, L: int) -> float:
    """
    Probability of missing a true near neighbor.
    
    FN = (1 - p1^k)^L
    """
    p_and = p1 ** k
    fn = (1 - p_and) ** L
    return fn
 
def false_positive_rate(p2: float, k: int, L: int) -> float:
    """
    Probability of a far point being a candidate.
    
    FP = 1 - (1 - p2^k)^L
    """
    p_and = p2 ** k
    fp = 1 - (1 - p_and) ** L
    return fp
 
def expected_candidates(
    n: int,
    p2: float,
    k: int,
    L: int,
    n_near: int = 1
) -> float:
    """
    Expected number of candidates returned by LSH query.
    
    = (True positives) + (False positives)
    ≈ n_near (if recall is high) + (n - n_near) * FP_rate
    """
    fp_rate = false_positive_rate(p2, k, L)
    return n_near + (n - n_near) * fp_rate
 
def min_tables_for_recall(
    p1: float,
    k: int,
    target_recall: float
) -> int:
    """
    Minimum L for target recall.
    
    (1 - p1^k)^L <= 1 - target_recall
    """
    p_and = p1 ** k
    max_fn = 1 - target_recall
    
    if p_and == 0:
        return float('inf')
    if max_fn >= 1:
        return 1
    
    L = np.ceil(np.log(max_fn) / np.log(1 - p_and))
    return int(L)
 
def analyze_errors():
    """Analyze error rates for different configurations."""
    
    # Example: Random hyperplane LSH
    # Near: cos(θ) = 0.9 (about 25° angle) -> p1 = 1 - 25°/180° ≈ 0.86
    # Far: cos(θ) = 0.5 (about 60° angle) -> p2 = 1 - 60°/180° = 0.67
    
    p1 = 0.86
    p2 = 0.67
    n = 1_000_000  # 1 million points
    
    print("Error Analysis for LSH")
    print(f"p1 (near) = {p1}, p2 (far) = {p2}, n = {n:,}")
    print("=" * 70)
    
    print(f"\n{'k':>4} {'L':>6} {'FN Rate':>12} {'FP Rate':>12} {'Recall':>10} {'E[Candidates]':>15}")
    print("-" * 70)
    
    for k in [4, 6, 8, 10, 12]:
        for L in [10, 20, 50, 100]:
            fn = false_negative_rate(p1, k, L)
            fp = false_positive_rate(p2, k, L)
            recall = 1 - fn
            
            # Expected candidates (assuming 10 true near neighbors)
            exp_cand = expected_candidates(n, p2, k, L, n_near=10)
            
            print(f"{k:>4} {L:>6} {fn:>12.6f} {fp:>12.6f} {recall:>10.4f} {exp_cand:>15,.0f}")
    
    # Find configuration with target performance
    print("\n" + "=" * 70)
    print("Finding configurations for 99% recall with <1000 expected candidates")
    print("-" * 70)
    
    for k in range(4, 16):
        min_L = min_tables_for_recall(p1, k, 0.99)
        
        if min_L < 500:  # Reasonable number of tables
            exp_cand = expected_candidates(n, p2, k, min_L, n_near=10)
            
            if exp_cand < 10000:  # Reasonable number of candidates
                fn = false_negative_rate(p1, k, min_L)
                fp = false_positive_rate(p2, k, min_L)
                
                print(f"k={k}, L={min_L}: Recall={1-fn:.4f}, E[Candidates]={exp_cand:,.0f}")
 
 
def query_cost_analysis():
    """Analyze total query cost: hashing + candidate verification."""
    
    print("\n\nTotal Query Cost Analysis")
    print("=" * 70)
    
    d = 256  # Dimension
    n = 1_000_000
    
    p1, p2 = 0.85, 0.65
    
    print(f"Dimension d = {d}, Dataset size n = {n:,}")
    print(f"p1 = {p1}, p2 = {p2}")
    print()
    
    # Cost model:
    # - Hash computation: O(k * L * d) for k projections per table
    # - Candidate verification: O(candidates * d)
    
    for k in [6, 8, 10, 12]:
        L = min_tables_for_recall(p1, k, 0.95)
        if L > 200:
            continue
        
        exp_cand = expected_candidates(n, p2, k, L, n_near=10)
        
        hash_cost = k * L * d
        verify_cost = exp_cand * d
        total_cost = hash_cost + verify_cost
        
        brute_force_cost = n * d
        speedup = brute_force_cost / total_cost
        
        print(f"k={k}, L={L}:")
        print(f"  Hash cost: {hash_cost:,} ops")
        print(f"  Verify cost: {verify_cost:,.0f} ops (E[candidates]={exp_cand:,.0f})")
        print(f"  Total: {total_cost:,.0f} ops")
        print(f"  Brute force: {brute_force_cost:,} ops")
        print(f"  Speedup: {speedup:.1f}x")
        print()
 
 
if __name__ == "__main__":
    analyze_errors()
    query_cost_analysis()

Concentration Inequalities for LSH

The analysis so far has focused on expected values. To provide high-probability guarantees, we need concentration inequalities.

Chernoff Bounds for Hash Collisions:

Let $X$ be the number of tables in which a pair $(x, y)$ collides. If each table's hash functions are independent with collision probability $p^k$:

$$X \sim \text{Binomial}(L, p^k)$$

For near points ($p = p_1$):

We want $\Pr[X = 0] \leq \delta$ (low false negative probability):

$$\Pr[X = 0] = (1 - p_1^k)^L \leq \delta$$

By Chernoff: With $L = \frac{\ln(1/\delta)}{p_1^k}$ tables: $$\Pr[X = 0] \leq \delta$$

For far points ($p = p_2$):

We want $\Pr[X \geq 1]$ to be small, or at least bound $\mathbb{E}[X]$:

$$\mathbb{E}[X] = L \cdot p_2^k$$

By Markov: $\Pr[X \geq 1] \leq \mathbb{E}[X] = L \cdot p_2^k$

High-Probability Success Guarantee:

With appropriate $(k, L)$, we can guarantee:

With probability at least $1 - \delta$:

At least one true near neighbor is found (if it exists)

The number of candidates is at most $O(n^\rho \log(1/\delta))$

Proof Sketch:

For the near neighbor, we showed $L = O(\frac{\log(1/\delta)}{p_1^k})$ suffices
For far points, by union bound over all $n$ points and Chernoff:
- Each far point is a candidate with probability $\leq L \cdot p_2^k$
- Expected candidates: $n \cdot L \cdot p_2^k$
- With $k = \log_{1/p_2}(n)$: $p_2^k = 1/n$, so expected candidates = $L$
- Chernoff gives concentration around this expectation

Multi-Point Queries:

When finding all $m$ near neighbors (not just one), the analysis extends:

False negative probability per neighbor is still $(1 - p_1^k)^L$
By union bound over $m$ neighbors: need $L = O(\frac{\log(m/\delta)}{p_1^k})$

The Role of Randomness

LSH success is probabilistic. For a given query, there's always a small chance of missing true neighbors or getting too many false positives. In practice, we can boost confidence through query repetition or by using more tables. Systems should be designed with this randomness in mind.

Optimizing k and L for Your Application

Choosing the right $(k, L)$ is critical for LSH performance. Here's a systematic approach.

Step 1: Define Requirements

Target recall $R$: What fraction of true neighbors must be found?
Max query time $T$: How many candidates can you verify?
Memory budget $M$: How many tables can you store?
False positive tolerance: How many false positives are acceptable?

Step 2: Estimate $p_1$ and $p_2$

From your data or knowledge of distance distributions:

$p_1$: Collision probability at your "near" threshold
$p_2$: Collision probability at your "far" threshold

Step 3: Compute Constraints

For recall $R$: $$L \geq \frac{\ln(1/(1-R))}{p_1^k}$$

For candidate budget $C$: $$n \cdot (1 - (1 - p_2^k)^L) \leq C$$

Step 4: Grid Search

Search over $(k, L)$ pairs satisfying constraints and minimizing total cost (memory + query time).

parameter_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
"""
Systematic parameter optimization for LSH.
"""
 
import numpy as np
from dataclasses import dataclass
from typing import Optional, List, Tuple
 
@dataclass
class LSHConfig:
    """Configuration for LSH with computed metrics."""
    k: int
    L: int
    recall: float
    fp_rate: float
    expected_candidates: int
    memory_cost: float  # Proportional to L
    query_cost: float   # Hashing + verification
 
def find_optimal_lsh_config(
    p1: float,
    p2: float,
    n: int,
    d: int,
    target_recall: float = 0.95,
    max_candidates: int = 10000,
    max_tables: int = 200,
    num_near_neighbors: int = 10
) -> List[LSHConfig]:
    """
    Find LSH configurations meeting constraints.
    
    Parameters:
    -----------
    p1 : float
        Base collision probability for near points
    p2 : float
        Base collision probability for far points
    n : int
        Dataset size
    d : int
        Dimensionality
    target_recall : float
        Minimum recall requirement
    max_candidates : int
        Maximum acceptable expected candidates
    max_tables : int
        Maximum number of tables (memory constraint)
    num_near_neighbors : int
        Expected number of true near neighbors
        
    Returns:
    --------
    List of valid configurations sorted by total cost
    """
    valid_configs = []
    
    for k in range(1, 25):
        for L in range(1, max_tables + 1):
            # Compute metrics
            p1_k = p1 ** k
            p2_k = p2 ** k
            
            # Recall (probability of finding a near neighbor)
            recall = 1 - (1 - p1_k) ** L
            
            # False positive rate
            fp_rate = 1 - (1 - p2_k) ** L
            
            # Expected candidates
            exp_candidates = num_near_neighbors + (n - num_near_neighbors) * fp_rate
            
            # Check constraints
            if recall < target_recall:
                continue
            if exp_candidates > max_candidates:
                continue
            
            # Compute costs
            hash_cost = k * L * d  # k projections per table, L tables
            verify_cost = exp_candidates * d
            memory_cost = L  # Proportional to number of tables
            total_query_cost = hash_cost + verify_cost
            
            config = LSHConfig(
                k=k,
                L=L,
                recall=recall,
                fp_rate=fp_rate,
                expected_candidates=int(exp_candidates),
                memory_cost=memory_cost,
                query_cost=total_query_cost
            )
            valid_configs.append(config)
    
    # Sort by total query cost
    valid_configs.sort(key=lambda c: c.query_cost)
    
    return valid_configs
 
def print_best_configs(configs: List[LSHConfig], top_n: int = 10):
    """Print the best configurations."""
    print(f"\nTop {top_n} LSH Configurations:")
    print("-" * 80)
    print(f"{'k':>4} {'L':>6} {'Recall':>10} {'FP Rate':>12} {'E[Cand]':>10} {'Query Cost':>12}")
    print("-" * 80)
    
    for config in configs[:top_n]:
        print(f"{config.k:>4} {config.L:>6} {config.recall:>10.4f} "
              f"{config.fp_rate:>12.6f} {config.expected_candidates:>10,} "
              f"{config.query_cost:>12,.0f}")
 
def visualize_parameter_space(
    p1: float, 
    p2: float, 
    n: int
):
    """Visualize recall and FP rate across (k, L) space."""
    print("\nParameter Space Analysis")
    print("=" * 60)
    
    print(f"\nRecall (1 - FN rate) for p1 = {p1}:")
    print("    L=", end="")
    for L in [5, 10, 20, 50, 100]:
        print(f"{L:>8}", end="")
    print()
    
    for k in [4, 6, 8, 10, 12]:
        print(f"k={k:>2}", end="")
        for L in [5, 10, 20, 50, 100]:
            recall = 1 - (1 - p1 ** k) ** L
            print(f"{recall:>8.3f}", end="")
        print()
    
    print(f"\nE[Candidates] for n={n:,}, p2 = {p2}:")
    print("    L=", end="")
    for L in [5, 10, 20, 50, 100]:
        print(f"{L:>10}", end="")
    print()
    
    for k in [4, 6, 8, 10, 12]:
        print(f"k={k:>2}", end="")
        for L in [5, 10, 20, 50, 100]:
            fp_rate = 1 - (1 - p2 ** k) ** L
            exp_cand = n * fp_rate
            if exp_cand < 1000:
                print(f"{exp_cand:>10.0f}", end="")
            elif exp_cand < 1000000:
                print(f"{exp_cand/1000:>9.0f}K", end="")
            else:
                print(f"{exp_cand/1000000:>9.1f}M", end="")
        print()
 
 
if __name__ == "__main__":
    # Example: Random hyperplane LSH for near/far classification
    # Near points: ~22° angle (cos = 0.93) -> p1 ≈ 0.88
    # Far points: ~60° angle (cos = 0.5) -> p2 ≈ 0.67
    
    p1 = 0.88
    p2 = 0.67
    n = 1_000_000
    d = 256
    
    print("LSH Parameter Optimization")
    print(f"Dataset: n={n:,} vectors, d={d}")
    print(f"Collision probabilities: p1={p1}, p2={p2}")
    
    # Visualize parameter space
    visualize_parameter_space(p1, p2, n)
    
    # Find optimal configurations
    configs = find_optimal_lsh_config(
        p1=p1,
        p2=p2,
        n=n,
        d=d,
        target_recall=0.95,
        max_candidates=5000,
        max_tables=150
    )
    
    if configs:
        print_best_configs(configs)
    else:
        print("\nNo configurations meet all constraints.")
        print("Try relaxing target_recall or max_candidates.")

Summary: Collision Probability Analysis

We've developed a comprehensive mathematical understanding of LSH collision probabilities. Let's consolidate the key insights:

Key Takeaways

•Collision probability decreases with distance — This is the fundamental property of all LSH families. The specific formula depends on the hash family.
•AND amplification sharpens discrimination — Combining k hash functions exponentially increases the ratio p₁/p₂, but reduces both probabilities.
•OR amplification recovers recall — Using L tables ensures high probability of finding true neighbors despite low per-table collision probability.
•The ρ parameter determines query complexity — ρ = ln(1/p₁)/ln(1/p₂) gives query time O(n^ρ). Smaller ρ means faster queries.
•False negatives and false positives trade off — More tables reduce FN but increase FP. More hash functions per table have the opposite effect.
•Concentration inequalities provide guarantees — Chernoff bounds give high-probability guarantees beyond expected-value analysis.
•Systematic optimization is essential — Grid search over (k, L) with explicit constraints finds deployable configurations.

What's Next:

In the final page of this module, we'll cover LSH Performance Tuning—practical techniques for implementing, benchmarking, and optimizing LSH in production systems. We'll discuss multi-probe strategies, data-dependent tuning, and integration with modern vector databases.

Page Complete

You now have deep understanding of the probability theory underlying LSH. This mathematical foundation is essential for deploying LSH with known performance characteristics and choosing the right parameters for your specific use case.