Machine LearningK-Nearest Neighbors

Locality-Sensitive Hashing (LSH)

LevelAdvanced

Duration90 mins

TopicK-Nearest Neighbors

2 / 6

LSH for Cosine Similarity

The Angle Between Vectors

When working with high-dimensional vector representations—document embeddings, user preference vectors, neural network features—cosine similarity is often the metric of choice. Unlike Euclidean distance, cosine similarity captures the directional relationship between vectors, ignoring their magnitude. Two documents about machine learning should be similar whether they're 500 words or 5,000 words.

Cosine similarity between vectors $\mathbf{u}$ and $\mathbf{v}$ is defined as:

$$\text{cos}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| |\mathbf{v}|}$$

This is simply the cosine of the angle $\theta$ between the vectors. Parallel vectors have similarity 1, orthogonal vectors have similarity 0, and anti-parallel vectors have similarity -1.

But how do we build a locality-sensitive hash function for the angle between vectors? The answer is beautifully geometric: random hyperplanes.

What You Will Learn

By the end of this page, you will understand how random hyperplanes create locality-sensitive hashes for cosine similarity. You'll derive the collision probability formula, understand the geometric intuition, and learn to implement efficient random projection LSH for high-dimensional vector search.

Geometric Intuition: Hyperplanes as Hash Functions

A hyperplane in $d$-dimensional space is a $(d-1)$-dimensional subspace that divides the space into two half-spaces. In 2D, a hyperplane is a line. In 3D, it's a plane. In higher dimensions, we can't visualize it, but the math works identically.

A hyperplane through the origin can be defined by its normal vector $\mathbf{r}$. Every point $\mathbf{x}$ in space can be classified based on which side of the hyperplane it lies:

$$h_{\mathbf{r}}(\mathbf{x}) = \begin{cases} 1 & \text{if } \mathbf{r} \cdot \mathbf{x} \geq 0 \ 0 & \text{if } \mathbf{r} \cdot \mathbf{x} < 0 \end{cases}$$

This is our hash function. It outputs a single bit based on which half-space the point occupies.

The Key Insight:

Consider two vectors $\mathbf{u}$ and $\mathbf{v}$ with angle $\theta$ between them. If we draw a random hyperplane through the origin, what's the probability that $\mathbf{u}$ and $\mathbf{v}$ end up on different sides?

The hyperplane separates $\mathbf{u}$ and $\mathbf{v}$ if and only if it passes through the angle $\theta$ between them. The probability of this is exactly:

$$\Pr[h_{\mathbf{r}}(\mathbf{u}) \neq h_{\mathbf{r}}(\mathbf{v})] = \frac{\theta}{\pi}$$

And therefore, the collision probability is:

$$\Pr[h_{\mathbf{r}}(\mathbf{u}) = h_{\mathbf{r}}(\mathbf{v})] = 1 - \frac{\theta}{\pi}$$

The Geometric Picture

Imagine vectors u and v in 2D forming an angle θ. Now spin a random line (hyperplane in 2D) through the origin. The line separates u and v only when it passes through the "wedge" between them. That wedge spans θ out of the full π radians (180°). So the separation probability is θ/π.

Why does this work for any dimension?

The beautiful fact is that this probability depends only on the angle $\theta$, not on the dimensionality $d$. Here's why:

The vectors $\mathbf{u}$ and $\mathbf{v}$ span a 2D plane (unless they're parallel)
The random hyperplane is determined by its normal vector $\mathbf{r}$
The relevant question is: does the projection of $\mathbf{r}$ onto the 2D plane of $\mathbf{u}$ and $\mathbf{v}$ lie in the angle $\theta$ or its complement?
By symmetry of the random normal, this projection is uniformly distributed
Therefore, the separation probability is $\theta / \pi$

This dimensional independence is what makes random hyperplane LSH so powerful—it works efficiently regardless of whether your vectors are 10-dimensional or 10,000-dimensional.

Mathematical Derivation of Collision Probability

Let's rigorously derive the collision probability formula. This derivation gives us deep insight into why random hyperplanes work.

Setup:

Let $\mathbf{u}, \mathbf{v} \in \mathbb{R}^d$ be two vectors
Let $\theta = \cos^{-1}\left(\frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| |\mathbf{v}|}\right)$ be the angle between them
Let $\mathbf{r} \sim \mathcal{N}(0, I_d)$ be a random vector with i.i.d. standard normal components

Claim: $\Pr[\text{sign}(\mathbf{r} \cdot \mathbf{u}) = \text{sign}(\mathbf{r} \cdot \mathbf{v})] = 1 - \frac{\theta}{\pi}$

Proof:

Without loss of generality, assume $\mathbf{u}$ and $\mathbf{v}$ are unit vectors (normalization doesn't change signs).

Define the projections:

$X = \mathbf{r} \cdot \mathbf{u}$
$Y = \mathbf{r} \cdot \mathbf{v}$

Since $\mathbf{r}$ has i.i.d. $\mathcal{N}(0, 1)$ components, $X$ and $Y$ are jointly Gaussian with:

$\mathbb{E}[X] = \mathbb{E}[Y] = 0$
$\text{Var}(X) = |\mathbf{u}|^2 = 1$
$\text{Var}(Y) = |\mathbf{v}|^2 = 1$
$\text{Cov}(X, Y) = \mathbf{u} \cdot \mathbf{v} = \cos(\theta)$

So $(X, Y)$ follows a bivariate normal distribution with correlation $\rho = \cos(\theta)$.

Key Lemma: For bivariate normal $(X, Y)$ with zero means, unit variances, and correlation $\rho$:

$$\Pr[\text{sign}(X) = \text{sign}(Y)] = 1 - \frac{1}{\pi} \cos^{-1}(\rho)$$

Proof of Lemma:

We need to compute $\Pr[XY > 0] = \Pr[X > 0, Y > 0] + \Pr[X < 0, Y < 0]$.

By symmetry, $$\Pr[X > 0, Y > 0] = \Pr[X < 0, Y < 0]$$

So we need $2 \Pr[X > 0, Y > 0]$.

For bivariate normal with correlation $\rho$, this is a classical result:

$$\Pr[X > 0, Y > 0] = \frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(\rho)$$

This can be derived by transforming to polar coordinates and integrating over the quadrant.

Therefore: $$\Pr[\text{sign}(X) = \text{sign}(Y)] = 2 \cdot \left(\frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(\rho)\right) = \frac{1}{2} + \frac{1}{\pi} \sin^{-1}(\rho)$$

Using the identity $\sin^{-1}(\rho) = \frac{\pi}{2} - \cos^{-1}(\rho)$:

$$= \frac{1}{2} + \frac{1}{\pi}\left(\frac{\pi}{2} - \cos^{-1}(\rho)\right) = 1 - \frac{1}{\pi} \cos^{-1}(\rho)$$

Completing the Main Proof:

Since $\rho = \cos(\theta)$, we have $\cos^{-1}(\rho) = \theta$, giving us:

$$\Pr[h_{\mathbf{r}}(\mathbf{u}) = h_{\mathbf{r}}(\mathbf{v})] = 1 - \frac{\theta}{\pi}$$

$$\blacksquare$$

Connection to SimHash

This random hyperplane method is also known as SimHash, introduced by Moses Charikar in 2002. It's used extensively in practice for near-duplicate detection, plagiarism checking, and similarity search in high-dimensional spaces.

LSH Parameters for Cosine Similarity

Now let's connect the collision probability to the LSH framework. Recall that an LSH family is $(d_1, d_2, p_1, p_2)$-sensitive if near points (distance $\leq d_1$) collide with probability $\geq p_1$ and far points (distance $\geq d_2$) collide with probability $\leq p_2$.

For cosine similarity, we typically work with angular distance:

$$d_{\text{angular}}(\mathbf{u}, \mathbf{v}) = \frac{\theta}{\pi} = \frac{\cos^{-1}(\text{sim}(\mathbf{u}, \mathbf{v}))}{\pi}$$

This normalizes the angle to $[0, 1]$, where 0 means identical direction and 1 means opposite direction.

LSH Sensitivity Parameters:

For angular distance thresholds $d_1$ and $d_2$ with $d_1 < d_2$:

$p_1 = 1 - d_1$ (collision probability for near points)
$p_2 = 1 - d_2$ (collision probability for far points)

The ρ Parameter:

$$\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)} = \frac{\ln(1/(1-d_1))}{\ln(1/(1-d_2))}$$

For approximate near-neighbor with ratio $c$ (i.e., we accept points at distance $cd_1$ instead of $d_1$), setting $d_2 = cd_1$ gives:

$$\rho = \frac{\ln(1/(1-d_1))}{\ln(1/(1-cd_1))}$$

For small $d_1$ (high similarity), using $-\ln(1-x) \approx x$:

$$\rho \approx \frac{d_1}{cd_1} = \frac{1}{c}$$

Collision Probabilities for Various Similarity Levels
Cosine Similarity	Angle θ (radians)	Angular Distance θ/π	Collision Probability
1.0 (identical)	0	0	1.0
0.95	0.318	0.101	0.899
0.9	0.451	0.144	0.856
0.8	0.644	0.205	0.795
0.5	1.047	0.333	0.667
0.0 (orthogonal)	1.571 (π/2)	0.5	0.5
-0.5	2.094	0.667	0.333
-1.0 (opposite)	3.142 (π)	1.0	0.0

Observation on the Probability Curve:

Notice that the collision probability is linear in the angular distance:

$$p(\theta) = 1 - \frac{\theta}{\pi}$$

This linear relationship is special. Many LSH families have S-shaped or concave collision probability curves. The linear curve of random hyperplane LSH means:

Uniform gap: The difference $p_1 - p_2$ scales linearly with the distance gap $d_2 - d_1$
No sweet spot: It works reasonably well across all similarity ranges
Predictable amplification: The AND-OR amplification behaves consistently

However, the lack of a sharp transition also means we may need more hash functions compared to LSH families with steeper probability curves.

Efficient Implementation of Random Hyperplane LSH

Let's build a complete, production-quality implementation of random hyperplane LSH for cosine similarity.

random_hyperplane_lsh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
"""
Random Hyperplane LSH for Cosine Similarity.
 
This implementation is optimized for efficiency while remaining readable.
Key optimizations:
1. Matrix operations instead of loops
2. Pre-computed random hyperplanes
3. Efficient bit packing for hash keys
"""
 
import numpy as np
from typing import List, Tuple, Optional, Dict, Set
from collections import defaultdict
from dataclasses import dataclass, field
import time
 
@dataclass
class RandomHyperplaneLSH:
    """
    LSH index using random hyperplanes for cosine similarity.
    
    Parameters:
    -----------
    dim : int
        Dimensionality of vectors
    num_hyperplanes : int (k)
        Number of hyperplanes per hash table. Higher k means:
        - Fewer candidates per bucket (more precise)
        - Higher risk of missing true neighbors
    num_tables : int (L)
        Number of hash tables. Higher L means:
        - More likely to find true neighbors
        - More candidates to check
        - More memory and query time
    seed : int
        Random seed for reproducibility
    """
    dim: int
    num_hyperplanes: int = 10  # k
    num_tables: int = 20      # L
    seed: int = 42
    
    # Initialized in __post_init__
    hyperplanes: np.ndarray = field(init=False)  # Shape: (L, k, dim)
    tables: List[Dict] = field(init=False)
    vectors: Optional[np.ndarray] = field(default=None, init=False)
    
    def __post_init__(self):
        np.random.seed(self.seed)
        
        # Generate random hyperplanes for all tables at once
        # Each hyperplane is a random normal vector
        self.hyperplanes = np.random.randn(
            self.num_tables, 
            self.num_hyperplanes, 
            self.dim
        )
        
        # Normalize hyperplanes (optional, but good practice)
        norms = np.linalg.norm(self.hyperplanes, axis=2, keepdims=True)
        self.hyperplanes = self.hyperplanes / norms
        
        # Initialize empty hash tables
        self.tables = [defaultdict(list) for _ in range(self.num_tables)]
    
    def _hash_vector(
        self, 
        vector: np.ndarray, 
        table_idx: int
    ) -> Tuple[int, ...]:
        """
        Compute hash key for a vector in a specific table.
        
        The hash is a tuple of 0s and 1s indicating which side
        of each hyperplane the vector lies on.
        """
        # Dot product with all hyperplanes in this table
        # Shape: (num_hyperplanes,)
        projections = np.dot(self.hyperplanes[table_idx], vector)
        
        # Convert to binary: 1 if positive, 0 if negative
        bits = (projections >= 0).astype(int)
        
        # Return as tuple (hashable)
        return tuple(bits.tolist())
    
    def _hash_batch(
        self, 
        vectors: np.ndarray
    ) -> List[List[Tuple[int, ...]]]:
        """
        Compute hash keys for multiple vectors across all tables.
        
        Returns: List of L lists, each containing n hash keys
        """
        n = len(vectors)
        all_hashes = []
        
        for table_idx in range(self.num_tables):
            # Compute all projections at once
            # Shape: (n, num_hyperplanes)
            projections = np.dot(vectors, self.hyperplanes[table_idx].T)
            
            # Convert to binary
            bits = (projections >= 0).astype(int)
            
            # Convert each row to tuple
            hashes = [tuple(row.tolist()) for row in bits]
            all_hashes.append(hashes)
        
        return all_hashes
    
    def fit(self, vectors: np.ndarray) -> 'RandomHyperplaneLSH':
        """
        Build the LSH index from a dataset of vectors.
        
        Parameters:
        -----------
        vectors : np.ndarray of shape (n, dim)
            Dataset of n vectors
        
        Returns:
        --------
        self : for method chaining
        """
        self.vectors = vectors.copy()
        n = len(vectors)
        
        # Normalize vectors (cosine similarity is scale-invariant)
        norms = np.linalg.norm(vectors, axis=1, keepdims=True)
        norms = np.where(norms == 0, 1, norms)  # Avoid division by zero
        normalized = vectors / norms
        
        # Compute all hashes
        all_hashes = self._hash_batch(normalized)
        
        # Insert into tables
        for table_idx in range(self.num_tables):
            for vec_idx in range(n):
                hash_key = all_hashes[table_idx][vec_idx]
                self.tables[table_idx][hash_key].append(vec_idx)
        
        return self
    
    def query(
        self, 
        query: np.ndarray, 
        num_neighbors: int = 5,
        return_similarity: bool = True
    ) -> List[Tuple[int, float]]:
        """
        Find approximate nearest neighbors for a query vector.
        
        Parameters:
        -----------
        query : np.ndarray of shape (dim,)
            Query vector
        num_neighbors : int
            Number of neighbors to return
        return_similarity : bool
            If True, return cosine similarity; else return angular distance
            
        Returns:
        --------
        List of (index, similarity/distance) tuples, sorted by similarity
        """
        if self.vectors is None:
            raise ValueError("Index not built. Call fit() first.")
        
        # Normalize query
        query_norm = np.linalg.norm(query)
        if query_norm == 0:
            raise ValueError("Query vector has zero norm")
        query_normalized = query / query_norm
        
        # Collect candidates from all tables
        candidates: Set[int] = set()
        for table_idx in range(self.num_tables):
            hash_key = self._hash_vector(query_normalized, table_idx)
            bucket = self.tables[table_idx].get(hash_key, [])
            candidates.update(bucket)
        
        if not candidates:
            # No candidates found; fall back to checking a sample
            candidates = set(
                np.random.choice(len(self.vectors), 
                                min(100, len(self.vectors)), 
                                replace=False)
            )
        
        # Compute exact cosine similarities for candidates
        candidate_indices = list(candidates)
        candidate_vectors = self.vectors[candidate_indices]
        
        # Normalize candidate vectors
        norms = np.linalg.norm(candidate_vectors, axis=1, keepdims=True)
        norms = np.where(norms == 0, 1, norms)
        candidate_normalized = candidate_vectors / norms
        
        # Compute similarities
        similarities = np.dot(candidate_normalized, query_normalized)
        
        # Sort by similarity (descending)
        sorted_indices = np.argsort(-similarities)
        
        # Return top-k
        results = []
        for i in sorted_indices[:num_neighbors]:
            idx = candidate_indices[i]
            sim = similarities[i]
            if return_similarity:
                results.append((idx, float(sim)))
            else:
                # Angular distance
                results.append((idx, float(np.arccos(np.clip(sim, -1, 1)) / np.pi)))
        
        return results
    
    def get_stats(self) -> Dict:
        """Get statistics about the index."""
        bucket_sizes = []
        for table in self.tables:
            bucket_sizes.extend(len(v) for v in table.values())
        
        return {
            "num_tables": self.num_tables,
            "num_hyperplanes": self.num_hyperplanes,
            "num_vectors": len(self.vectors) if self.vectors is not None else 0,
            "total_buckets": sum(len(t) for t in self.tables),
            "avg_bucket_size": np.mean(bucket_sizes) if bucket_sizes else 0,
            "max_bucket_size": max(bucket_sizes) if bucket_sizes else 0,
            "empty_buckets": sum(1 for t in self.tables for v in t.values() if len(v) == 0),
        }
 
 
def demo():
    """Demonstrate Random Hyperplane LSH."""
    np.random.seed(42)
    
    # Create dataset
    n = 50000
    dim = 256
    print(f"Creating dataset: {n} vectors, {dim} dimensions")
    
    vectors = np.random.randn(n, dim)
    
    # Create query (similar to vector 0)
    query = vectors[0] + 0.1 * np.random.randn(dim)
    
    # Build index
    print("\nBuilding LSH index...")
    start = time.time()
    lsh = RandomHyperplaneLSH(dim=dim, num_hyperplanes=12, num_tables=25)
    lsh.fit(vectors)
    build_time = time.time() - start
    print(f"Build time: {build_time:.3f}s")
    print(f"Index stats: {lsh.get_stats()}")
    
    # Query
    print("\nQuerying...")
    start = time.time()
    neighbors = lsh.query(query, num_neighbors=10)
    query_time = time.time() - start
    print(f"Query time: {query_time:.6f}s")
    
    print("\nTop 10 neighbors (LSH):")
    for idx, sim in neighbors:
        print(f"  Vector {idx}: similarity = {sim:.4f}")
    
    # Brute force comparison
    print("\nBrute force comparison...")
    start = time.time()
    query_norm = query / np.linalg.norm(query)
    norms = np.linalg.norm(vectors, axis=1, keepdims=True)
    normalized = vectors / norms
    all_sims = np.dot(normalized, query_norm)
    bf_time = time.time() - start
    
    top_10_bf = np.argsort(-all_sims)[:10]
    print(f"Brute force time: {bf_time:.6f}s")
    print("\nTop 10 neighbors (brute force):")
    for idx in top_10_bf:
        print(f"  Vector {idx}: similarity = {all_sims[idx]:.4f}")
    
    # Calculate speedup
    print(f"\nSpeedup: {bf_time / query_time:.1f}x")
    
    # Check recall
    lsh_set = {idx for idx, _ in neighbors}
    bf_set = set(top_10_bf)
    recall = len(lsh_set & bf_set) / len(bf_set)
    print(f"Recall@10: {recall:.1%}")
 
 
if __name__ == "__main__":
    demo()

AND-OR Amplification for Random Hyperplanes

The raw collision probability from a single hyperplane may not discriminate well between near and far points. Amplification is the technique that transforms a weak LSH into a powerful one.

AND Amplification (k hyperplanes per table):

We require ALL $k$ hyperplanes to agree for a collision: $$p^{(k)} = p^k$$

For near points with $p_1 = 0.9$ and far points with $p_2 = 0.7$:

With $k = 1$: $p_1 = 0.9$, $p_2 = 0.7$ (ratio 1.29)
With $k = 5$: $p_1 = 0.59$, $p_2 = 0.17$ (ratio 3.5)
With $k = 10$: $p_1 = 0.35$, $p_2 = 0.028$ (ratio 12.5)

AND amplification reduces BOTH probabilities, but the far probability drops faster.

OR Amplification (L tables):

We succeed if we find a match in AT LEAST ONE table: $$P_1 = 1 - (1 - p_1^k)^L \quad \text{(near points)}$$ $$P_2 = 1 - (1 - p_2^k)^L \quad \text{(far points)}$$

With $k = 10$ and $L = 20$:

Near points: $P_1 = 1 - (1 - 0.35)^{20} = 0.9998$
Far points: $P_2 = 1 - (1 - 0.028)^{20} = 0.433$

We now find near points with 99.98% probability while far points appear in only 43% of queries—a massive improvement over the base probabilities!

amplification_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
"""
Analysis of AND-OR amplification for random hyperplane LSH.
"""
 
import numpy as np
import matplotlib.pyplot as plt
 
def collision_prob(cos_sim: float) -> float:
    """
    Collision probability for a single hyperplane.
    
    p = 1 - θ/π where θ = arccos(cos_sim)
    """
    theta = np.arccos(np.clip(cos_sim, -1, 1))
    return 1 - theta / np.pi
 
def amplified_prob(p: float, k: int, L: int) -> float:
    """
    Probability after AND-OR amplification.
    
    AND: require all k hyperplanes to agree -> p^k
    OR: succeed if any of L tables match -> 1 - (1-p^k)^L
    """
    p_and = p ** k
    p_or = 1 - (1 - p_and) ** L
    return p_or
 
def analyze_amplification():
    """Analyze how k and L affect discrimination."""
    
    # Various similarity levels
    similarities = [0.95, 0.9, 0.8, 0.7, 0.5, 0.3, 0.1, 0.0]
    base_probs = [collision_prob(s) for s in similarities]
    
    print("Base collision probabilities:")
    print("-" * 50)
    for s, p in zip(similarities, base_probs):
        print(f"  cos_sim = {s:.2f}: p = {p:.4f}")
    
    # Example: Distinguish 0.9 similarity from 0.5 similarity
    p_near = collision_prob(0.9)   # ~0.856
    p_far = collision_prob(0.5)    # ~0.667
    
    print(f"\n\nTarget: Find vectors with sim >= 0.9, reject sim <= 0.5")
    print(f"Base probabilities: p_near = {p_near:.4f}, p_far = {p_far:.4f}")
    print(f"Base ratio: {p_near/p_far:.2f}")
    
    # Try different k and L combinations
    print("\n\nAmplification analysis (fixed memory budget):")
    print("-" * 70)
    print(f"{'k':>4} {'L':>4} {'P_near':>10} {'P_far':>10} {'Ratio':>10} {'FPs per 1000':>12}")
    print("-" * 70)
    
    for k in [5, 8, 10, 12, 15]:
        for L in [10, 20, 30, 50]:
            P_near = amplified_prob(p_near, k, L)
            P_far = amplified_prob(p_far, k, L)
            ratio = P_near / P_far if P_far > 0 else float('inf')
            fps = P_far * 1000  # Expected false positives per 1000 far points
            
            print(f"{k:>4} {L:>4} {P_near:>10.4f} {P_far:>10.4f} {ratio:>10.2f} {fps:>12.1f}")
    
    # Optimal for specific recall target
    print("\n\nFinding k, L for 99% recall with minimal false positives:")
    print("-" * 70)
    
    best_config = None
    best_fp_rate = float('inf')
    
    for k in range(4, 20):
        for L in range(5, 100):
            P_near = amplified_prob(p_near, k, L)
            P_far = amplified_prob(p_far, k, L)
            
            if P_near >= 0.99:  # 99% recall
                if P_far < best_fp_rate:
                    best_fp_rate = P_far
                    best_config = (k, L, P_near, P_far)
    
    if best_config:
        k, L, P_near, P_far = best_config
        print(f"Best config for 99% recall:")
        print(f"  k = {k}, L = {L}")
        print(f"  Recall (P_near) = {P_near:.4f}")
        print(f"  False positive rate (P_far) = {P_far:.4f}")
        print(f"  Space: {L} tables × {k} hyperplanes = {L*k} total hyperplanes")
 
 
if __name__ == "__main__":
    analyze_amplification()

The k-L Tradeoff

Increasing k (hash functions per table) reduces false positives but increases false negatives. Increasing L (number of tables) reduces false negatives but increases query time and memory. The optimal combination depends on your recall/precision requirements and computational budget.

Variants and Optimizations

The basic random hyperplane LSH can be enhanced in several ways for practical applications.

Key Optimizations

•Multi-probe LSH — Instead of looking only in the exact bucket, also check nearby buckets that differ in a few bits. This dramatically improves recall with fewer tables. For example, also check buckets where 1-2 bits are flipped.
•Bit packing — Store hash keys as compact integers instead of tuples. For k ≤ 64, pack bits into a single 64-bit integer for O(1) hashing into the table.
•Sparse random projections — Instead of dense Gaussian hyperplanes, use sparse ternary projections (+1, 0, -1) for faster dot products. Works well for sparse input vectors.
•Hierarchical LSH — Build a hierarchy of LSH indexes at different precision levels. Use coarse index for initial filtering, then fine index for refinement.
•Dynamic updates — Support insertion and deletion of vectors without rebuilding the entire index. Use tombstones for deletions and periodic garbage collection.

Multi-probe LSH Deep Dive:

Multi-probe is perhaps the most impactful optimization. The insight is that if a query slightly misses a bucket containing a true neighbor, the query's hash likely differs by only a few bits.

Given a query hash $h(q) = (b_1, b_2, \ldots, b_k)$, we probe:

The exact bucket $h(q)$
All buckets differing by 1 bit (there are $k$ of these)
Optionally, buckets differing by 2 bits (there are $\binom{k}{2}$ of these)

The probability that a near neighbor ends up in a "1-bit different" bucket can be computed exactly, allowing us to prioritize which alternate buckets to check.

Benefit: Multi-probe with $T$ probes per table can achieve similar recall to standard LSH with $T$ times as many tables, but with $T$ times less memory. The query time is similar since we're replacing table lookups with probes.

multiprobe_lsh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
"""
Multi-probe extension for Random Hyperplane LSH.
"""
 
import numpy as np
from typing import Tuple, Set, List
from itertools import combinations
 
def generate_probes(
    base_hash: Tuple[int, ...], 
    num_flips: int = 2
) -> List[Tuple[int, ...]]:
    """
    Generate all hash keys within Hamming distance of base_hash.
    
    Parameters:
    -----------
    base_hash : tuple of ints
        The original hash key
    num_flips : int
        Maximum number of bits to flip
        
    Returns:
    --------
    List of hash keys to probe
    """
    probes = [base_hash]
    k = len(base_hash)
    
    # Generate all combinations of positions to flip
    for flips in range(1, num_flips + 1):
        for positions in combinations(range(k), flips):
            # Flip bits at these positions
            new_hash = list(base_hash)
            for pos in positions:
                new_hash[pos] = 1 - new_hash[pos]
            probes.append(tuple(new_hash))
    
    return probes
 
def multiprobe_query(
    query: np.ndarray,
    hyperplanes: np.ndarray,  # Shape: (num_tables, k, dim)
    tables: List[dict],
    num_probes: int = 5  # Number of probes per table
) -> Set[int]:
    """
    Multi-probe query: check nearby buckets in addition to exact match.
    
    Returns set of candidate indices.
    """
    candidates = set()
    num_tables = len(tables)
    
    for table_idx in range(num_tables):
        # Compute exact hash
        projections = np.dot(hyperplanes[table_idx], query)
        base_hash = tuple((projections >= 0).astype(int).tolist())
        
        # Generate probes
        # For efficiency, we generate probes based on projection magnitudes
        # Bits with small |projection| are more likely to flip
        abs_proj = np.abs(projections)
        flip_priority = np.argsort(abs_proj)  # Smallest first
        
        probes_checked = 0
        for flips in range(len(base_hash) + 1):
            if probes_checked >= num_probes:
                break
            
            for positions in combinations(flip_priority[:flips+3], flips):
                if probes_checked >= num_probes:
                    break
                    
                # Create probed hash
                probed_hash = list(base_hash)
                for pos in positions:
                    probed_hash[pos] = 1 - probed_hash[pos]
                probed_hash = tuple(probed_hash)
                
                # Get candidates from this bucket
                bucket = tables[table_idx].get(probed_hash, [])
                candidates.update(bucket)
                probes_checked += 1
    
    return candidates
 
# Example comparison
def compare_probe_strategies():
    """Compare single-probe vs multi-probe LSH."""
    np.random.seed(42)
    
    # Setup
    n, dim, k = 10000, 128, 12
    
    # Generate random hyperplanes
    hyperplanes = np.random.randn(1, k, dim)  # Single table
    hyperplanes /= np.linalg.norm(hyperplanes, axis=2, keepdims=True)
    
    # Generate data
    data = np.random.randn(n, dim)
    data /= np.linalg.norm(data, axis=1, keepdims=True)
    
    # Build table
    table = {}
    for i, vec in enumerate(data):
        proj = np.dot(hyperplanes[0], vec)
        h = tuple((proj >= 0).astype(int).tolist())
        if h not in table:
            table[h] = []
        table[h].append(i)
    
    # Query (find nearest to a specific vector)
    query = data[0] + 0.2 * np.random.randn(dim)
    query /= np.linalg.norm(query)
    
    # True 100-NN
    sims = np.dot(data, query)
    true_100nn = set(np.argsort(-sims)[:100])
    
    # Single-probe
    proj = np.dot(hyperplanes[0], query)
    exact_hash = tuple((proj >= 0).astype(int).tolist())
    single_candidates = set(table.get(exact_hash, []))
    single_recall = len(single_candidates & true_100nn) / 100
    
    # Multi-probe (5 probes)
    multi_candidates = multiprobe_query(query, hyperplanes, [table], num_probes=5)
    multi_recall = len(multi_candidates & true_100nn) / 100
    
    # Multi-probe (20 probes)
    multi_candidates_20 = multiprobe_query(query, hyperplanes, [table], num_probes=20)
    multi_recall_20 = len(multi_candidates_20 & true_100nn) / 100
    
    print("Recall comparison (single table, k=12):")
    print(f"  Single-probe: {single_recall:.1%} ({len(single_candidates)} candidates)")
    print(f"  Multi-probe (5): {multi_recall:.1%} ({len(multi_candidates)} candidates)")
    print(f"  Multi-probe (20): {multi_recall_20:.1%} ({len(multi_candidates_20)} candidates)")
 
 
if __name__ == "__main__":
    compare_probe_strategies()

Practical Considerations

When deploying random hyperplane LSH in production, several practical factors influence design decisions.

Common Parameter Settings by Use Case
Use Case	Dataset Size	k (hash bits)	L (tables)	Expected Recall
Real-time search	1M vectors	8-12	10-20	80-90%
High-recall retrieval	1M vectors	6-10	30-50	95-99%
Large-scale (10M+)	10-100M vectors	12-16	20-30	85-95%
Deduplication	Any	10-14	5-10	90%+ for exact dupes
Candidate generation	Any	6-8	50-100	99%+

Common Pitfalls

•Forgetting to normalize vectors — Cosine similarity is invariant to vector magnitude, but the hash is not. Always normalize input vectors before hashing.
•Using too few hash bits — With small k, buckets become large and query time degrades. With k=5 and even distribution, each bucket has n/32 points on average.
•Using too many hash bits — With large k, most buckets are empty and you miss true neighbors. With k=20, you have 1M possible buckets.
•Ignoring the recall-precision tradeoff — High recall requires more tables (memory and query time). High precision requires more hash bits (risk of missing neighbors).
•Not validating with ground truth — Always evaluate your LSH configuration on held-out data with known neighbors before deploying.

Rule of Thumb for k and L

Start with k = log₂(n) / 2 and L = 10-20. This gives buckets with O(√n) points on average. Then tune based on measured recall and query time on your actual data. Multi-probe with fewer tables often beats standard LSH with more tables.

Summary: Random Hyperplane LSH

We've thoroughly explored random hyperplane LSH for cosine similarity. Let's consolidate the key insights:

Key Takeaways

•Random hyperplanes define hash functions — Each hyperplane partitions space; the hash is which half-space the vector occupies.
•Collision probability equals 1 - θ/π — The probability that two vectors hash together is linearly related to their angular distance.
•This works in any dimension — The collision probability depends only on the angle, not the dimensionality, making it efficient for high-dimensional data.
•AND-OR amplification controls precision/recall — Combine k functions per table (AND) and L tables (OR) to tune the tradeoff.
•Multi-probe dramatically improves efficiency — Checking nearby buckets gives similar recall with fewer tables.
•Normalization is essential — Always normalize vectors for cosine similarity applications.

What's Next:

In the next page, we'll explore LSH for Euclidean distance, which uses a fundamentally different approach based on p-stable distributions. While random hyperplanes capture angular similarity, Euclidean LSH must capture absolute distance, requiring different mathematical machinery.

Page Complete

You now have a deep understanding of random hyperplane LSH for cosine similarity. This technique is the foundation for many real-world similarity search systems, from Google's semantic search to content recommendation engines.

2 / 6

Loading learning content...

Machine LearningK-Nearest Neighbors

Locality-Sensitive Hashing (LSH)

LevelAdvanced

Duration90 mins

TopicK-Nearest Neighbors

2 / 6

LSH for Cosine Similarity

The Angle Between Vectors

Cosine similarity between vectors $\mathbf{u}$ and $\mathbf{v}$ is defined as:

$$\text{cos}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| |\mathbf{v}|}$$

This is simply the cosine of the angle $\theta$ between the vectors. Parallel vectors have similarity 1, orthogonal vectors have similarity 0, and anti-parallel vectors have similarity -1.

But how do we build a locality-sensitive hash function for the angle between vectors? The answer is beautifully geometric: random hyperplanes.

What You Will Learn

Geometric Intuition: Hyperplanes as Hash Functions

A hyperplane through the origin can be defined by its normal vector $\mathbf{r}$. Every point $\mathbf{x}$ in space can be classified based on which side of the hyperplane it lies:

$$h_{\mathbf{r}}(\mathbf{x}) = \begin{cases} 1 & \text{if } \mathbf{r} \cdot \mathbf{x} \geq 0 \ 0 & \text{if } \mathbf{r} \cdot \mathbf{x} < 0 \end{cases}$$

This is our hash function. It outputs a single bit based on which half-space the point occupies.

The Key Insight:

The hyperplane separates $\mathbf{u}$ and $\mathbf{v}$ if and only if it passes through the angle $\theta$ between them. The probability of this is exactly:

$$\Pr[h_{\mathbf{r}}(\mathbf{u}) \neq h_{\mathbf{r}}(\mathbf{v})] = \frac{\theta}{\pi}$$

And therefore, the collision probability is:

$$\Pr[h_{\mathbf{r}}(\mathbf{u}) = h_{\mathbf{r}}(\mathbf{v})] = 1 - \frac{\theta}{\pi}$$

The Geometric Picture

Why does this work for any dimension?

The beautiful fact is that this probability depends only on the angle $\theta$, not on the dimensionality $d$. Here's why:

The vectors $\mathbf{u}$ and $\mathbf{v}$ span a 2D plane (unless they're parallel)
The random hyperplane is determined by its normal vector $\mathbf{r}$
The relevant question is: does the projection of $\mathbf{r}$ onto the 2D plane of $\mathbf{u}$ and $\mathbf{v}$ lie in the angle $\theta$ or its complement?
By symmetry of the random normal, this projection is uniformly distributed
Therefore, the separation probability is $\theta / \pi$

This dimensional independence is what makes random hyperplane LSH so powerful—it works efficiently regardless of whether your vectors are 10-dimensional or 10,000-dimensional.

Mathematical Derivation of Collision Probability

Let's rigorously derive the collision probability formula. This derivation gives us deep insight into why random hyperplanes work.

Setup:

Let $\mathbf{u}, \mathbf{v} \in \mathbb{R}^d$ be two vectors
Let $\theta = \cos^{-1}\left(\frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| |\mathbf{v}|}\right)$ be the angle between them
Let $\mathbf{r} \sim \mathcal{N}(0, I_d)$ be a random vector with i.i.d. standard normal components

Claim: $\Pr[\text{sign}(\mathbf{r} \cdot \mathbf{u}) = \text{sign}(\mathbf{r} \cdot \mathbf{v})] = 1 - \frac{\theta}{\pi}$

Proof:

Without loss of generality, assume $\mathbf{u}$ and $\mathbf{v}$ are unit vectors (normalization doesn't change signs).

Define the projections:

$X = \mathbf{r} \cdot \mathbf{u}$
$Y = \mathbf{r} \cdot \mathbf{v}$

Since $\mathbf{r}$ has i.i.d. $\mathcal{N}(0, 1)$ components, $X$ and $Y$ are jointly Gaussian with:

$\mathbb{E}[X] = \mathbb{E}[Y] = 0$
$\text{Var}(X) = |\mathbf{u}|^2 = 1$
$\text{Var}(Y) = |\mathbf{v}|^2 = 1$
$\text{Cov}(X, Y) = \mathbf{u} \cdot \mathbf{v} = \cos(\theta)$

So $(X, Y)$ follows a bivariate normal distribution with correlation $\rho = \cos(\theta)$.

Key Lemma: For bivariate normal $(X, Y)$ with zero means, unit variances, and correlation $\rho$:

$$\Pr[\text{sign}(X) = \text{sign}(Y)] = 1 - \frac{1}{\pi} \cos^{-1}(\rho)$$

Proof of Lemma:

We need to compute $\Pr[XY > 0] = \Pr[X > 0, Y > 0] + \Pr[X < 0, Y < 0]$.

By symmetry, $$\Pr[X > 0, Y > 0] = \Pr[X < 0, Y < 0]$$

So we need $2 \Pr[X > 0, Y > 0]$.

For bivariate normal with correlation $\rho$, this is a classical result:

$$\Pr[X > 0, Y > 0] = \frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(\rho)$$

This can be derived by transforming to polar coordinates and integrating over the quadrant.

Therefore: $$\Pr[\text{sign}(X) = \text{sign}(Y)] = 2 \cdot \left(\frac{1}{4} + \frac{1}{2\pi} \sin^{-1}(\rho)\right) = \frac{1}{2} + \frac{1}{\pi} \sin^{-1}(\rho)$$

Using the identity $\sin^{-1}(\rho) = \frac{\pi}{2} - \cos^{-1}(\rho)$:

$$= \frac{1}{2} + \frac{1}{\pi}\left(\frac{\pi}{2} - \cos^{-1}(\rho)\right) = 1 - \frac{1}{\pi} \cos^{-1}(\rho)$$

Completing the Main Proof:

Since $\rho = \cos(\theta)$, we have $\cos^{-1}(\rho) = \theta$, giving us:

$$\Pr[h_{\mathbf{r}}(\mathbf{u}) = h_{\mathbf{r}}(\mathbf{v})] = 1 - \frac{\theta}{\pi}$$

$$\blacksquare$$

Connection to SimHash

LSH Parameters for Cosine Similarity

For cosine similarity, we typically work with angular distance:

$$d_{\text{angular}}(\mathbf{u}, \mathbf{v}) = \frac{\theta}{\pi} = \frac{\cos^{-1}(\text{sim}(\mathbf{u}, \mathbf{v}))}{\pi}$$

This normalizes the angle to $[0, 1]$, where 0 means identical direction and 1 means opposite direction.

LSH Sensitivity Parameters:

For angular distance thresholds $d_1$ and $d_2$ with $d_1 < d_2$:

$p_1 = 1 - d_1$ (collision probability for near points)
$p_2 = 1 - d_2$ (collision probability for far points)

The ρ Parameter:

$$\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)} = \frac{\ln(1/(1-d_1))}{\ln(1/(1-d_2))}$$

For approximate near-neighbor with ratio $c$ (i.e., we accept points at distance $cd_1$ instead of $d_1$), setting $d_2 = cd_1$ gives:

$$\rho = \frac{\ln(1/(1-d_1))}{\ln(1/(1-cd_1))}$$

For small $d_1$ (high similarity), using $-\ln(1-x) \approx x$:

$$\rho \approx \frac{d_1}{cd_1} = \frac{1}{c}$$

Collision Probabilities for Various Similarity Levels
Cosine Similarity	Angle θ (radians)	Angular Distance θ/π	Collision Probability
1.0 (identical)	0	0	1.0
0.95	0.318	0.101	0.899
0.9	0.451	0.144	0.856
0.8	0.644	0.205	0.795
0.5	1.047	0.333	0.667
0.0 (orthogonal)	1.571 (π/2)	0.5	0.5
-0.5	2.094	0.667	0.333
-1.0 (opposite)	3.142 (π)	1.0	0.0

Observation on the Probability Curve:

Notice that the collision probability is linear in the angular distance:

$$p(\theta) = 1 - \frac{\theta}{\pi}$$

This linear relationship is special. Many LSH families have S-shaped or concave collision probability curves. The linear curve of random hyperplane LSH means:

Uniform gap: The difference $p_1 - p_2$ scales linearly with the distance gap $d_2 - d_1$
No sweet spot: It works reasonably well across all similarity ranges
Predictable amplification: The AND-OR amplification behaves consistently

However, the lack of a sharp transition also means we may need more hash functions compared to LSH families with steeper probability curves.

Efficient Implementation of Random Hyperplane LSH

Let's build a complete, production-quality implementation of random hyperplane LSH for cosine similarity.

random_hyperplane_lsh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
"""
Random Hyperplane LSH for Cosine Similarity.
 
This implementation is optimized for efficiency while remaining readable.
Key optimizations:
1. Matrix operations instead of loops
2. Pre-computed random hyperplanes
3. Efficient bit packing for hash keys
"""
 
import numpy as np
from typing import List, Tuple, Optional, Dict, Set
from collections import defaultdict
from dataclasses import dataclass, field
import time
 
@dataclass
class RandomHyperplaneLSH:
    """
    LSH index using random hyperplanes for cosine similarity.
    
    Parameters:
    -----------
    dim : int
        Dimensionality of vectors
    num_hyperplanes : int (k)
        Number of hyperplanes per hash table. Higher k means:
        - Fewer candidates per bucket (more precise)
        - Higher risk of missing true neighbors
    num_tables : int (L)
        Number of hash tables. Higher L means:
        - More likely to find true neighbors
        - More candidates to check
        - More memory and query time
    seed : int
        Random seed for reproducibility
    """
    dim: int
    num_hyperplanes: int = 10  # k
    num_tables: int = 20      # L
    seed: int = 42
    
    # Initialized in __post_init__
    hyperplanes: np.ndarray = field(init=False)  # Shape: (L, k, dim)
    tables: List[Dict] = field(init=False)
    vectors: Optional[np.ndarray] = field(default=None, init=False)
    
    def __post_init__(self):
        np.random.seed(self.seed)
        
        # Generate random hyperplanes for all tables at once
        # Each hyperplane is a random normal vector
        self.hyperplanes = np.random.randn(
            self.num_tables, 
            self.num_hyperplanes, 
            self.dim
        )
        
        # Normalize hyperplanes (optional, but good practice)
        norms = np.linalg.norm(self.hyperplanes, axis=2, keepdims=True)
        self.hyperplanes = self.hyperplanes / norms
        
        # Initialize empty hash tables
        self.tables = [defaultdict(list) for _ in range(self.num_tables)]
    
    def _hash_vector(
        self, 
        vector: np.ndarray, 
        table_idx: int
    ) -> Tuple[int, ...]:
        """
        Compute hash key for a vector in a specific table.
        
        The hash is a tuple of 0s and 1s indicating which side
        of each hyperplane the vector lies on.
        """
        # Dot product with all hyperplanes in this table
        # Shape: (num_hyperplanes,)
        projections = np.dot(self.hyperplanes[table_idx], vector)
        
        # Convert to binary: 1 if positive, 0 if negative
        bits = (projections >= 0).astype(int)
        
        # Return as tuple (hashable)
        return tuple(bits.tolist())
    
    def _hash_batch(
        self, 
        vectors: np.ndarray
    ) -> List[List[Tuple[int, ...]]]:
        """
        Compute hash keys for multiple vectors across all tables.
        
        Returns: List of L lists, each containing n hash keys
        """
        n = len(vectors)
        all_hashes = []
        
        for table_idx in range(self.num_tables):
            # Compute all projections at once
            # Shape: (n, num_hyperplanes)
            projections = np.dot(vectors, self.hyperplanes[table_idx].T)
            
            # Convert to binary
            bits = (projections >= 0).astype(int)
            
            # Convert each row to tuple
            hashes = [tuple(row.tolist()) for row in bits]
            all_hashes.append(hashes)
        
        return all_hashes
    
    def fit(self, vectors: np.ndarray) -> 'RandomHyperplaneLSH':
        """
        Build the LSH index from a dataset of vectors.
        
        Parameters:
        -----------
        vectors : np.ndarray of shape (n, dim)
            Dataset of n vectors
        
        Returns:
        --------
        self : for method chaining
        """
        self.vectors = vectors.copy()
        n = len(vectors)
        
        # Normalize vectors (cosine similarity is scale-invariant)
        norms = np.linalg.norm(vectors, axis=1, keepdims=True)
        norms = np.where(norms == 0, 1, norms)  # Avoid division by zero
        normalized = vectors / norms
        
        # Compute all hashes
        all_hashes = self._hash_batch(normalized)
        
        # Insert into tables
        for table_idx in range(self.num_tables):
            for vec_idx in range(n):
                hash_key = all_hashes[table_idx][vec_idx]
                self.tables[table_idx][hash_key].append(vec_idx)
        
        return self
    
    def query(
        self, 
        query: np.ndarray, 
        num_neighbors: int = 5,
        return_similarity: bool = True
    ) -> List[Tuple[int, float]]:
        """
        Find approximate nearest neighbors for a query vector.
        
        Parameters:
        -----------
        query : np.ndarray of shape (dim,)
            Query vector
        num_neighbors : int
            Number of neighbors to return
        return_similarity : bool
            If True, return cosine similarity; else return angular distance
            
        Returns:
        --------
        List of (index, similarity/distance) tuples, sorted by similarity
        """
        if self.vectors is None:
            raise ValueError("Index not built. Call fit() first.")
        
        # Normalize query
        query_norm = np.linalg.norm(query)
        if query_norm == 0:
            raise ValueError("Query vector has zero norm")
        query_normalized = query / query_norm
        
        # Collect candidates from all tables
        candidates: Set[int] = set()
        for table_idx in range(self.num_tables):
            hash_key = self._hash_vector(query_normalized, table_idx)
            bucket = self.tables[table_idx].get(hash_key, [])
            candidates.update(bucket)
        
        if not candidates:
            # No candidates found; fall back to checking a sample
            candidates = set(
                np.random.choice(len(self.vectors), 
                                min(100, len(self.vectors)), 
                                replace=False)
            )
        
        # Compute exact cosine similarities for candidates
        candidate_indices = list(candidates)
        candidate_vectors = self.vectors[candidate_indices]
        
        # Normalize candidate vectors
        norms = np.linalg.norm(candidate_vectors, axis=1, keepdims=True)
        norms = np.where(norms == 0, 1, norms)
        candidate_normalized = candidate_vectors / norms
        
        # Compute similarities
        similarities = np.dot(candidate_normalized, query_normalized)
        
        # Sort by similarity (descending)
        sorted_indices = np.argsort(-similarities)
        
        # Return top-k
        results = []
        for i in sorted_indices[:num_neighbors]:
            idx = candidate_indices[i]
            sim = similarities[i]
            if return_similarity:
                results.append((idx, float(sim)))
            else:
                # Angular distance
                results.append((idx, float(np.arccos(np.clip(sim, -1, 1)) / np.pi)))
        
        return results
    
    def get_stats(self) -> Dict:
        """Get statistics about the index."""
        bucket_sizes = []
        for table in self.tables:
            bucket_sizes.extend(len(v) for v in table.values())
        
        return {
            "num_tables": self.num_tables,
            "num_hyperplanes": self.num_hyperplanes,
            "num_vectors": len(self.vectors) if self.vectors is not None else 0,
            "total_buckets": sum(len(t) for t in self.tables),
            "avg_bucket_size": np.mean(bucket_sizes) if bucket_sizes else 0,
            "max_bucket_size": max(bucket_sizes) if bucket_sizes else 0,
            "empty_buckets": sum(1 for t in self.tables for v in t.values() if len(v) == 0),
        }
 
 
def demo():
    """Demonstrate Random Hyperplane LSH."""
    np.random.seed(42)
    
    # Create dataset
    n = 50000
    dim = 256
    print(f"Creating dataset: {n} vectors, {dim} dimensions")
    
    vectors = np.random.randn(n, dim)
    
    # Create query (similar to vector 0)
    query = vectors[0] + 0.1 * np.random.randn(dim)
    
    # Build index
    print("\nBuilding LSH index...")
    start = time.time()
    lsh = RandomHyperplaneLSH(dim=dim, num_hyperplanes=12, num_tables=25)
    lsh.fit(vectors)
    build_time = time.time() - start
    print(f"Build time: {build_time:.3f}s")
    print(f"Index stats: {lsh.get_stats()}")
    
    # Query
    print("\nQuerying...")
    start = time.time()
    neighbors = lsh.query(query, num_neighbors=10)
    query_time = time.time() - start
    print(f"Query time: {query_time:.6f}s")
    
    print("\nTop 10 neighbors (LSH):")
    for idx, sim in neighbors:
        print(f"  Vector {idx}: similarity = {sim:.4f}")
    
    # Brute force comparison
    print("\nBrute force comparison...")
    start = time.time()
    query_norm = query / np.linalg.norm(query)
    norms = np.linalg.norm(vectors, axis=1, keepdims=True)
    normalized = vectors / norms
    all_sims = np.dot(normalized, query_norm)
    bf_time = time.time() - start
    
    top_10_bf = np.argsort(-all_sims)[:10]
    print(f"Brute force time: {bf_time:.6f}s")
    print("\nTop 10 neighbors (brute force):")
    for idx in top_10_bf:
        print(f"  Vector {idx}: similarity = {all_sims[idx]:.4f}")
    
    # Calculate speedup
    print(f"\nSpeedup: {bf_time / query_time:.1f}x")
    
    # Check recall
    lsh_set = {idx for idx, _ in neighbors}
    bf_set = set(top_10_bf)
    recall = len(lsh_set & bf_set) / len(bf_set)
    print(f"Recall@10: {recall:.1%}")
 
 
if __name__ == "__main__":
    demo()

AND-OR Amplification for Random Hyperplanes

The raw collision probability from a single hyperplane may not discriminate well between near and far points. Amplification is the technique that transforms a weak LSH into a powerful one.

AND Amplification (k hyperplanes per table):

We require ALL $k$ hyperplanes to agree for a collision: $$p^{(k)} = p^k$$

For near points with $p_1 = 0.9$ and far points with $p_2 = 0.7$:

With $k = 1$: $p_1 = 0.9$, $p_2 = 0.7$ (ratio 1.29)
With $k = 5$: $p_1 = 0.59$, $p_2 = 0.17$ (ratio 3.5)
With $k = 10$: $p_1 = 0.35$, $p_2 = 0.028$ (ratio 12.5)

AND amplification reduces BOTH probabilities, but the far probability drops faster.

OR Amplification (L tables):

We succeed if we find a match in AT LEAST ONE table: $$P_1 = 1 - (1 - p_1^k)^L \quad \text{(near points)}$$ $$P_2 = 1 - (1 - p_2^k)^L \quad \text{(far points)}$$

With $k = 10$ and $L = 20$:

Near points: $P_1 = 1 - (1 - 0.35)^{20} = 0.9998$
Far points: $P_2 = 1 - (1 - 0.028)^{20} = 0.433$

We now find near points with 99.98% probability while far points appear in only 43% of queries—a massive improvement over the base probabilities!

amplification_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
"""
Analysis of AND-OR amplification for random hyperplane LSH.
"""
 
import numpy as np
import matplotlib.pyplot as plt
 
def collision_prob(cos_sim: float) -> float:
    """
    Collision probability for a single hyperplane.
    
    p = 1 - θ/π where θ = arccos(cos_sim)
    """
    theta = np.arccos(np.clip(cos_sim, -1, 1))
    return 1 - theta / np.pi
 
def amplified_prob(p: float, k: int, L: int) -> float:
    """
    Probability after AND-OR amplification.
    
    AND: require all k hyperplanes to agree -> p^k
    OR: succeed if any of L tables match -> 1 - (1-p^k)^L
    """
    p_and = p ** k
    p_or = 1 - (1 - p_and) ** L
    return p_or
 
def analyze_amplification():
    """Analyze how k and L affect discrimination."""
    
    # Various similarity levels
    similarities = [0.95, 0.9, 0.8, 0.7, 0.5, 0.3, 0.1, 0.0]
    base_probs = [collision_prob(s) for s in similarities]
    
    print("Base collision probabilities:")
    print("-" * 50)
    for s, p in zip(similarities, base_probs):
        print(f"  cos_sim = {s:.2f}: p = {p:.4f}")
    
    # Example: Distinguish 0.9 similarity from 0.5 similarity
    p_near = collision_prob(0.9)   # ~0.856
    p_far = collision_prob(0.5)    # ~0.667
    
    print(f"\n\nTarget: Find vectors with sim >= 0.9, reject sim <= 0.5")
    print(f"Base probabilities: p_near = {p_near:.4f}, p_far = {p_far:.4f}")
    print(f"Base ratio: {p_near/p_far:.2f}")
    
    # Try different k and L combinations
    print("\n\nAmplification analysis (fixed memory budget):")
    print("-" * 70)
    print(f"{'k':>4} {'L':>4} {'P_near':>10} {'P_far':>10} {'Ratio':>10} {'FPs per 1000':>12}")
    print("-" * 70)
    
    for k in [5, 8, 10, 12, 15]:
        for L in [10, 20, 30, 50]:
            P_near = amplified_prob(p_near, k, L)
            P_far = amplified_prob(p_far, k, L)
            ratio = P_near / P_far if P_far > 0 else float('inf')
            fps = P_far * 1000  # Expected false positives per 1000 far points
            
            print(f"{k:>4} {L:>4} {P_near:>10.4f} {P_far:>10.4f} {ratio:>10.2f} {fps:>12.1f}")
    
    # Optimal for specific recall target
    print("\n\nFinding k, L for 99% recall with minimal false positives:")
    print("-" * 70)
    
    best_config = None
    best_fp_rate = float('inf')
    
    for k in range(4, 20):
        for L in range(5, 100):
            P_near = amplified_prob(p_near, k, L)
            P_far = amplified_prob(p_far, k, L)
            
            if P_near >= 0.99:  # 99% recall
                if P_far < best_fp_rate:
                    best_fp_rate = P_far
                    best_config = (k, L, P_near, P_far)
    
    if best_config:
        k, L, P_near, P_far = best_config
        print(f"Best config for 99% recall:")
        print(f"  k = {k}, L = {L}")
        print(f"  Recall (P_near) = {P_near:.4f}")
        print(f"  False positive rate (P_far) = {P_far:.4f}")
        print(f"  Space: {L} tables × {k} hyperplanes = {L*k} total hyperplanes")
 
 
if __name__ == "__main__":
    analyze_amplification()

The k-L Tradeoff

Variants and Optimizations

The basic random hyperplane LSH can be enhanced in several ways for practical applications.

Key Optimizations

•Multi-probe LSH — Instead of looking only in the exact bucket, also check nearby buckets that differ in a few bits. This dramatically improves recall with fewer tables. For example, also check buckets where 1-2 bits are flipped.
•Bit packing — Store hash keys as compact integers instead of tuples. For k ≤ 64, pack bits into a single 64-bit integer for O(1) hashing into the table.
•Sparse random projections — Instead of dense Gaussian hyperplanes, use sparse ternary projections (+1, 0, -1) for faster dot products. Works well for sparse input vectors.
•Hierarchical LSH — Build a hierarchy of LSH indexes at different precision levels. Use coarse index for initial filtering, then fine index for refinement.
•Dynamic updates — Support insertion and deletion of vectors without rebuilding the entire index. Use tombstones for deletions and periodic garbage collection.

Multi-probe LSH Deep Dive:

Multi-probe is perhaps the most impactful optimization. The insight is that if a query slightly misses a bucket containing a true neighbor, the query's hash likely differs by only a few bits.

Given a query hash $h(q) = (b_1, b_2, \ldots, b_k)$, we probe:

The exact bucket $h(q)$
All buckets differing by 1 bit (there are $k$ of these)
Optionally, buckets differing by 2 bits (there are $\binom{k}{2}$ of these)

The probability that a near neighbor ends up in a "1-bit different" bucket can be computed exactly, allowing us to prioritize which alternate buckets to check.

multiprobe_lsh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
"""
Multi-probe extension for Random Hyperplane LSH.
"""
 
import numpy as np
from typing import Tuple, Set, List
from itertools import combinations
 
def generate_probes(
    base_hash: Tuple[int, ...], 
    num_flips: int = 2
) -> List[Tuple[int, ...]]:
    """
    Generate all hash keys within Hamming distance of base_hash.
    
    Parameters:
    -----------
    base_hash : tuple of ints
        The original hash key
    num_flips : int
        Maximum number of bits to flip
        
    Returns:
    --------
    List of hash keys to probe
    """
    probes = [base_hash]
    k = len(base_hash)
    
    # Generate all combinations of positions to flip
    for flips in range(1, num_flips + 1):
        for positions in combinations(range(k), flips):
            # Flip bits at these positions
            new_hash = list(base_hash)
            for pos in positions:
                new_hash[pos] = 1 - new_hash[pos]
            probes.append(tuple(new_hash))
    
    return probes
 
def multiprobe_query(
    query: np.ndarray,
    hyperplanes: np.ndarray,  # Shape: (num_tables, k, dim)
    tables: List[dict],
    num_probes: int = 5  # Number of probes per table
) -> Set[int]:
    """
    Multi-probe query: check nearby buckets in addition to exact match.
    
    Returns set of candidate indices.
    """
    candidates = set()
    num_tables = len(tables)
    
    for table_idx in range(num_tables):
        # Compute exact hash
        projections = np.dot(hyperplanes[table_idx], query)
        base_hash = tuple((projections >= 0).astype(int).tolist())
        
        # Generate probes
        # For efficiency, we generate probes based on projection magnitudes
        # Bits with small |projection| are more likely to flip
        abs_proj = np.abs(projections)
        flip_priority = np.argsort(abs_proj)  # Smallest first
        
        probes_checked = 0
        for flips in range(len(base_hash) + 1):
            if probes_checked >= num_probes:
                break
            
            for positions in combinations(flip_priority[:flips+3], flips):
                if probes_checked >= num_probes:
                    break
                    
                # Create probed hash
                probed_hash = list(base_hash)
                for pos in positions:
                    probed_hash[pos] = 1 - probed_hash[pos]
                probed_hash = tuple(probed_hash)
                
                # Get candidates from this bucket
                bucket = tables[table_idx].get(probed_hash, [])
                candidates.update(bucket)
                probes_checked += 1
    
    return candidates
 
# Example comparison
def compare_probe_strategies():
    """Compare single-probe vs multi-probe LSH."""
    np.random.seed(42)
    
    # Setup
    n, dim, k = 10000, 128, 12
    
    # Generate random hyperplanes
    hyperplanes = np.random.randn(1, k, dim)  # Single table
    hyperplanes /= np.linalg.norm(hyperplanes, axis=2, keepdims=True)
    
    # Generate data
    data = np.random.randn(n, dim)
    data /= np.linalg.norm(data, axis=1, keepdims=True)
    
    # Build table
    table = {}
    for i, vec in enumerate(data):
        proj = np.dot(hyperplanes[0], vec)
        h = tuple((proj >= 0).astype(int).tolist())
        if h not in table:
            table[h] = []
        table[h].append(i)
    
    # Query (find nearest to a specific vector)
    query = data[0] + 0.2 * np.random.randn(dim)
    query /= np.linalg.norm(query)
    
    # True 100-NN
    sims = np.dot(data, query)
    true_100nn = set(np.argsort(-sims)[:100])
    
    # Single-probe
    proj = np.dot(hyperplanes[0], query)
    exact_hash = tuple((proj >= 0).astype(int).tolist())
    single_candidates = set(table.get(exact_hash, []))
    single_recall = len(single_candidates & true_100nn) / 100
    
    # Multi-probe (5 probes)
    multi_candidates = multiprobe_query(query, hyperplanes, [table], num_probes=5)
    multi_recall = len(multi_candidates & true_100nn) / 100
    
    # Multi-probe (20 probes)
    multi_candidates_20 = multiprobe_query(query, hyperplanes, [table], num_probes=20)
    multi_recall_20 = len(multi_candidates_20 & true_100nn) / 100
    
    print("Recall comparison (single table, k=12):")
    print(f"  Single-probe: {single_recall:.1%} ({len(single_candidates)} candidates)")
    print(f"  Multi-probe (5): {multi_recall:.1%} ({len(multi_candidates)} candidates)")
    print(f"  Multi-probe (20): {multi_recall_20:.1%} ({len(multi_candidates_20)} candidates)")
 
 
if __name__ == "__main__":
    compare_probe_strategies()

Practical Considerations

When deploying random hyperplane LSH in production, several practical factors influence design decisions.

Common Parameter Settings by Use Case
Use Case	Dataset Size	k (hash bits)	L (tables)	Expected Recall
Real-time search	1M vectors	8-12	10-20	80-90%
High-recall retrieval	1M vectors	6-10	30-50	95-99%
Large-scale (10M+)	10-100M vectors	12-16	20-30	85-95%
Deduplication	Any	10-14	5-10	90%+ for exact dupes
Candidate generation	Any	6-8	50-100	99%+

Common Pitfalls

•Forgetting to normalize vectors — Cosine similarity is invariant to vector magnitude, but the hash is not. Always normalize input vectors before hashing.
•Using too few hash bits — With small k, buckets become large and query time degrades. With k=5 and even distribution, each bucket has n/32 points on average.
•Using too many hash bits — With large k, most buckets are empty and you miss true neighbors. With k=20, you have 1M possible buckets.
•Ignoring the recall-precision tradeoff — High recall requires more tables (memory and query time). High precision requires more hash bits (risk of missing neighbors).
•Not validating with ground truth — Always evaluate your LSH configuration on held-out data with known neighbors before deploying.

Rule of Thumb for k and L

Summary: Random Hyperplane LSH

We've thoroughly explored random hyperplane LSH for cosine similarity. Let's consolidate the key insights:

Key Takeaways

•Random hyperplanes define hash functions — Each hyperplane partitions space; the hash is which half-space the vector occupies.
•Collision probability equals 1 - θ/π — The probability that two vectors hash together is linearly related to their angular distance.
•This works in any dimension — The collision probability depends only on the angle, not the dimensionality, making it efficient for high-dimensional data.
•AND-OR amplification controls precision/recall — Combine k functions per table (AND) and L tables (OR) to tune the tradeoff.
•Multi-probe dramatically improves efficiency — Checking nearby buckets gives similar recall with fewer tables.
•Normalization is essential — Always normalize vectors for cosine similarity applications.

What's Next:

Page Complete

2 / 6