Locality Sensitive Hashing - Learning Module

Loading content...

0/245

LSH for Euclidean Distance

Beyond Angular Similarity

While cosine similarity captures directional relationships between vectors, many applications fundamentally require Euclidean distance—the straight-line distance in feature space. Image retrieval, spatial databases, and many scientific applications deal with absolute distances, not just angles.

Consider a facial recognition system. Two face embeddings might point in similar directions (high cosine similarity), but if one represents a face photographed up close and another from far away, their Euclidean distance might be large despite similar angles. For matching identities, we often care about the actual distance in embedding space.

The challenge: Random hyperplanes don't work for Euclidean distance. A hyperplane's hash depends only on which side a point lies—it captures direction but loses magnitude information. We need a fundamentally different approach.

The solution: p-stable distributions and random projections that preserve distance information.

What You Will Learn

By the end of this page, you will understand p-stable distribution LSH for Euclidean distance. You'll learn the mathematical theory behind why Gaussian projections work, derive the collision probability formula, and implement E2LSH (Euclidean LSH) for approximate nearest neighbor search.

The Euclidean LSH Problem

Let's formally define what we need from a Euclidean LSH family.

Problem Setup:

Given points $\mathbf{u}, \mathbf{v} \in \mathbb{R}^d$, the Euclidean distance is:

$$d(\mathbf{u}, \mathbf{v}) = |\mathbf{u} - \mathbf{v}|2 = \sqrt{\sum{i=1}^{d} (u_i - v_i)^2}$$

We want a hash family $\mathcal{H}$ such that for any $h \in \mathcal{H}$:

If $d(\mathbf{u}, \mathbf{v}) \leq r$ (close points), then $\Pr[h(\mathbf{u}) = h(\mathbf{v})] \geq p_1$
If $d(\mathbf{u}, \mathbf{v}) \geq cr$ (far points, for some $c > 1$), then $\Pr[h(\mathbf{u}) = h(\mathbf{v})] \leq p_2$

with $p_1 > p_2$.

Why Random Hyperplanes Fail:

Random hyperplane hashing (from the previous page) produces collision probability $1 - \theta/\pi$ where $\theta$ is the angle between vectors. But consider:

$\mathbf{u} = (1, 0)$ and $\mathbf{v} = (10, 0)$: Angle is 0, collision probability is 1
$\mathbf{u} = (1, 0)$ and $\mathbf{w} = (0, 1)$: Angle is $\pi/2$, collision probability is 0.5

But $d(\mathbf{u}, \mathbf{v}) = 9$ while $d(\mathbf{u}, \mathbf{w}) = \sqrt{2} \approx 1.41$!

The hyperplane hash thinks $\mathbf{u}$ and $\mathbf{v}$ are identical (same bucket), while it gives a 50% chance of separating $\mathbf{u}$ and $\mathbf{w}$, despite $\mathbf{w}$ being much closer in Euclidean distance. This is catastrophically wrong for Euclidean nearest neighbors.

Metric Mismatch

Using an LSH family designed for one metric with a different metric produces meaningless results. You cannot use random hyperplane LSH for Euclidean distance, even if the vectors are normalized. The fundamental relationship between collision probability and distance is broken.

p-Stable Distributions: The Mathematical Foundation

The key to Euclidean LSH is a beautiful property of certain probability distributions called p-stability.

Definition: p-Stable Distribution

A distribution $\mathcal{D}$ is called p-stable if for any $n$ real numbers $v_1, v_2, \ldots, v_n$ and i.i.d. random variables $X_1, X_2, \ldots, X_n$ drawn from $\mathcal{D}$:

$$\sum_{i=1}^{n} v_i X_i \sim \left(\sum_{i=1}^{n} |v_i|^p\right)^{1/p} \cdot X$$

where $X$ is another random variable drawn from $\mathcal{D}$.

In plain English: The weighted sum of p-stable random variables is distributed like a p-stable random variable scaled by the $\ell_p$ norm of the weights.

Key Examples:

Important p-Stable Distributions
p	Distribution	PDF	Use Case
p = 1	Cauchy	$\frac{1}{\pi(1+x^2)}$	Manhattan (L1) distance LSH
p = 2	Gaussian (Normal)	$\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$	Euclidean (L2) distance LSH

Why p-Stability Matters for LSH:

Consider two vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^d$ and a random vector $\mathbf{a}$ with i.i.d. components from a 2-stable (Gaussian) distribution.

The difference in projections is: $$\mathbf{a} \cdot \mathbf{u} - \mathbf{a} \cdot \mathbf{v} = \mathbf{a} \cdot (\mathbf{u} - \mathbf{v}) = \sum_{i=1}^{d} a_i (u_i - v_i)$$

By 2-stability: $$\mathbf{a} \cdot (\mathbf{u} - \mathbf{v}) \sim |\mathbf{u} - \mathbf{v}|_2 \cdot Z$$

where $Z \sim \mathcal{N}(0, 1)$.

The crucial insight: The projection difference is distributed as the Euclidean distance times a standard normal! This means:

The distribution of $\mathbf{a} \cdot \mathbf{u} - \mathbf{a} \cdot \mathbf{v}$ depends only on $|\mathbf{u} - \mathbf{v}|_2$
Close points (small distance) have projection differences tightly concentrated around 0
Far points (large distance) have projection differences spread widely

We can exploit this to create a hash function where collision probability decreases with distance!

The Magic of p-Stability

p-stability transforms a high-dimensional distance into a single random variable whose scale equals that distance. This dimensional reduction preserves exactly the information we need: how far apart two points are.

The E2LSH Hash Function

Now we'll construct the actual hash function for Euclidean distance, known as E2LSH (Euclidean LSH) or p-stable LSH.

Hash Function Definition:

For a point $\mathbf{v} \in \mathbb{R}^d$, define:

$$h_{\mathbf{a},b}(\mathbf{v}) = \left\lfloor \frac{\mathbf{a} \cdot \mathbf{v} + b}{w} \right\rfloor$$

where:

$\mathbf{a} \in \mathbb{R}^d$ is a random vector with i.i.d. $\mathcal{N}(0, 1)$ components
$b \in [0, w)$ is a random offset uniformly distributed in $[0, w)$
$w > 0$ is the bucket width (a tunable parameter)
$\lfloor \cdot \rfloor$ is the floor function

Geometric Interpretation:

Think of this hash as:

Project the point onto a random 1D line (direction $\mathbf{a}$)
Shift by a random offset $b$
Divide the line into buckets of width $w$
Return the bucket number

Points that are close together in $\mathbb{R}^d$ will have similar projections, and hence are likely to fall in the same bucket. Points far apart have different projections and likely fall in different buckets.

Why the Random Offset $b$?

Without $b$, bucket boundaries would be at fixed locations (multiples of $w$ along the projection direction). Two points very close together could be split by a boundary.

The random offset randomizes where boundaries fall, ensuring that the probability of two points sharing a bucket depends only on their distance, not their absolute positions.

The Bucket Width $w$:

This is the key tunable parameter:

Small $w$: Narrow buckets → fewer collisions → high precision, low recall
Large $w$: Wide buckets → more collisions → high recall, low precision

The optimal $w$ depends on the distance threshold $r$ you care about. Typical choices are $w = 4r$ to $w = r$, balancing the collision probabilities for near and far points.

e2lsh_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
"""
E2LSH (Euclidean LSH) hash function demonstration.
 
This illustrates the core hash function for Euclidean distance LSH
using p-stable (Gaussian) distributions.
"""
 
import numpy as np
from typing import Tuple
 
class E2LSHHashFunction:
    """
    Single E2LSH hash function.
    
    h(v) = floor((a·v + b) / w)
    
    where:
    - a is a random Gaussian vector
    - b is a random offset in [0, w)
    - w is the bucket width
    """
    
    def __init__(self, dim: int, w: float, seed: int = None):
        """
        Initialize hash function.
        
        Parameters:
        -----------
        dim : int
            Dimensionality of input vectors
        w : float
            Bucket width (key parameter for tuning)
        seed : int
            Random seed for reproducibility
        """
        if seed is not None:
            np.random.seed(seed)
        
        # Random Gaussian projection vector
        self.a = np.random.randn(dim)
        
        # Random offset uniformly in [0, w)
        self.b = np.random.uniform(0, w)
        
        self.w = w
        self.dim = dim
    
    def hash(self, v: np.ndarray) -> int:
        """Compute hash value for a vector."""
        projection = np.dot(self.a, v) + self.b
        return int(np.floor(projection / self.w))
    
    def hash_batch(self, vectors: np.ndarray) -> np.ndarray:
        """Compute hash values for multiple vectors."""
        projections = np.dot(vectors, self.a) + self.b
        return np.floor(projections / self.w).astype(int)
 
 
def demonstrate_e2lsh():
    """Show how E2LSH works for Euclidean distance."""
    np.random.seed(42)
    
    dim = 10
    w = 4.0  # Bucket width
    
    h = E2LSHHashFunction(dim, w)
    
    # Create a base point
    base = np.random.randn(dim)
    
    # Create points at various distances from base
    distances = [0.5, 1.0, 2.0, 4.0, 8.0]
    
    print(f"Base point hash: {h.hash(base)}")
    print(f"Bucket width w = {w}")
    print()
    
    # For each distance, create random points and see hash behavior
    for dist in distances:
        # Create 100 random points at this distance
        collisions = 0
        for _ in range(100):
            # Random direction
            direction = np.random.randn(dim)
            direction = direction / np.linalg.norm(direction)
            
            # Point at exact distance
            point = base + dist * direction
            
            if h.hash(point) == h.hash(base):
                collisions += 1
        
        print(f"Distance {dist:.1f}: {collisions}/100 collisions ({collisions}%)")
    
    # Theoretical insight
    print()
    print("Observation: Collision probability decreases with distance!")
    print("Points closer than w/2 are likely to collide.")
    print("Points farther than w have low collision probability.")
 
 
def demonstrate_distance_preservation():
    """
    Show that projection differences scale with Euclidean distance.
    This is the key p-stability property.
    """
    np.random.seed(42)
    
    dim = 100
    num_trials = 1000
    
    # Random Gaussian projection vector
    a = np.random.randn(dim)
    
    # Create pairs of points at various distances
    distances = [1.0, 2.0, 5.0, 10.0]
    
    print("p-Stability Demonstration:")
    print("Projection difference = d * Z where Z ~ N(0,1)")
    print()
    print(f"{'Distance':<12} {'Mean |proj diff|':<18} {'Theoretical':<12}")
    print("-" * 45)
    
    for d in distances:
        proj_diffs = []
        
        for _ in range(num_trials):
            # Random base point
            u = np.random.randn(dim)
            
            # Random direction
            direction = np.random.randn(dim)
            direction = direction / np.linalg.norm(direction)
            
            # Point at distance d
            v = u + d * direction
            
            # Projection difference
            diff = np.dot(a, v) - np.dot(a, u)
            proj_diffs.append(abs(diff))
        
        mean_diff = np.mean(proj_diffs)
        # Theoretical: |Z| has mean sqrt(2/pi) ≈ 0.798
        # So E[|proj diff|] = d * norm(a) * sqrt(2/pi)
        theoretical = d * np.linalg.norm(a) * np.sqrt(2/np.pi)
        
        print(f"{d:<12.1f} {mean_diff:<18.4f} {theoretical:<12.4f}")
 
 
if __name__ == "__main__":
    demonstrate_e2lsh()
    print()
    print("=" * 50)
    print()
    demonstrate_distance_preservation()

Collision Probability for E2LSH

Let's derive the exact collision probability for E2LSH. This analysis reveals why the bucket width $w$ is crucial.

Setup:

Two points $\mathbf{u}$ and $\mathbf{v}$ at Euclidean distance $d = |\mathbf{u} - \mathbf{v}|_2$.

Collision Condition:

$h(\mathbf{u}) = h(\mathbf{v})$ iff both projections fall in the same bucket:

$$\left\lfloor \frac{\mathbf{a} \cdot \mathbf{u} + b}{w} \right\rfloor = \left\lfloor \frac{\mathbf{a} \cdot \mathbf{v} + b}{w} \right\rfloor$$

Key Insight:

By 2-stability, $\mathbf{a} \cdot (\mathbf{u} - \mathbf{v}) \sim d \cdot Z$ where $Z \sim \mathcal{N}(0, 1)$.

Let $t = \mathbf{a} \cdot \mathbf{u} + b$. Then $\mathbf{a} \cdot \mathbf{v} + b = t - dZ$ for some standard normal $Z$.

The collision probability depends on:

Where $t$ falls within its bucket (uniformly distributed due to random $b$)
Whether $t - dZ$ lands in the same bucket

Derivation:

Let $c = d/w$ (distance relative to bucket width). WLOG, assume $t \in [0, w)$ (we're in bucket 0).

Collision occurs iff $t - dZ \in [0, w)$, i.e., $t - w < dZ < t$.

Dividing by $d$: $(t-w)/d < Z < t/d$.

Since $t$ is uniform on $[0, w)$, we integrate:

$$p(c) = \int_0^1 \left[\Phi\left(\frac{\tau w}{d}\right) - \Phi\left(\frac{(\tau-1)w}{d}\right)\right] d\tau$$

where $\Phi$ is the standard normal CDF.

Substituting $c = d/w$:

$$p(c) = \int_0^1 \left[\Phi\left(\frac{\tau}{c}\right) - \Phi\left(\frac{\tau - 1}{c}\right)\right] d\tau$$

This can be computed as:

$$p(c) = 1 - 2\Phi(-1/c) - \frac{2c}{\sqrt{2\pi}}\left(1 - e^{-1/(2c^2)}\right)$$

Simplification for analysis:

For $c \ll 1$ (distance much smaller than bucket width): $p(c) \approx 1$

For $c \gg 1$ (distance much larger than bucket width): $p(c) \approx \frac{2}{c\sqrt{2\pi}} = \frac{2w}{d\sqrt{2\pi}}$

collision_probability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
"""
E2LSH collision probability analysis.
"""
 
import numpy as np
from scipy import stats
from scipy.integrate import quad
 
def e2lsh_collision_prob_integral(c: float) -> float:
    """
    Exact collision probability for E2LSH.
    
    Parameters:
    -----------
    c : float
        Distance relative to bucket width (c = d/w)
        
    Returns:
    --------
    Collision probability p(c)
    """
    if c == 0:
        return 1.0
    
    def integrand(tau):
        return stats.norm.cdf(tau/c) - stats.norm.cdf((tau-1)/c)
    
    prob, _ = quad(integrand, 0, 1)
    return prob
 
def e2lsh_collision_prob_closed(c: float) -> float:
    """
    Closed-form collision probability for E2LSH.
    
    p(c) = 1 - 2*Φ(-1/c) - (2c/sqrt(2π)) * (1 - exp(-1/(2c²)))
    """
    if c == 0:
        return 1.0
    if c > 100:
        # Asymptotic form for large c
        return 2 / (c * np.sqrt(2 * np.pi))
    
    phi = stats.norm.cdf(-1/c)
    term1 = 1 - 2 * phi
    term2 = (2 * c / np.sqrt(2 * np.pi)) * (1 - np.exp(-1 / (2 * c**2)))
    return term1 - term2
 
def analyze_collision_probability():
    """Analyze how collision probability varies with distance."""
    
    print("E2LSH Collision Probability")
    print("c = d/w (distance / bucket width)")
    print()
    print(f"{'c (d/w)':<10} {'p(c) integral':<15} {'p(c) closed':<15}")
    print("-" * 40)
    
    c_values = [0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 3.0, 5.0, 10.0]
    
    for c in c_values:
        p_int = e2lsh_collision_prob_integral(c)
        p_closed = e2lsh_collision_prob_closed(c)
        print(f"{c:<10.2f} {p_int:<15.4f} {p_closed:<15.4f}")
    
    # Analyze LSH quality for different w
    print()
    print("=" * 60)
    print("LSH Quality Analysis")
    print("=" * 60)
    print()
    
    # Scenario: near points at d1=1, far points at d2=4000
    d1, d2 = 1.0, 4.0
    
    print(f"Near distance d1 = {d1}, Far distance d2 = {d2}")
    print()
    print(f"{'w':<8} {'c1=d1/w':<10} {'p1':<10} {'c2=d2/w':<10} {'p2':<10} {'ρ':<10}")
    print("-" * 58)
    
    for w in [0.5, 1.0, 2.0, 4.0, 8.0]:
        c1 = d1 / w
        c2 = d2 / w
        p1 = e2lsh_collision_prob_closed(c1)
        p2 = e2lsh_collision_prob_closed(c2)
        
        # rho = ln(1/p1) / ln(1/p2)
        if p1 > 0 and p2 > 0 and p1 < 1 and p2 < 1:
            rho = np.log(1/p1) / np.log(1/p2)
        else:
            rho = float('nan')
        
        print(f"{w:<8.1f} {c1:<10.2f} {p1:<10.4f} {c2:<10.2f} {p2:<10.4f} {rho:<10.4f}")
 
 
def simulate_collision_probability():
    """Verify collision probability via simulation."""
    np.random.seed(42)
    
    dim = 100
    w = 4.0
    num_trials = 10000
    
    print("Simulated vs Theoretical Collision Probabilities")
    print(f"dim = {dim}, w = {w}")
    print()
    print(f"{'Distance d':<12} {'Simulated':<12} {'Theoretical':<12}")
    print("-" * 38)
    
    for d in [0.5, 1.0, 2.0, 4.0, 8.0, 16.0]:
        collisions = 0
        
        for _ in range(num_trials):
            # Random direction
            direction = np.random.randn(dim)
            direction = direction / np.linalg.norm(direction)
            
            # Two points at distance d
            u = np.random.randn(dim)
            v = u + d * direction
            
            # Random hash function
            a = np.random.randn(dim)
            b = np.random.uniform(0, w)
            
            h_u = np.floor((np.dot(a, u) + b) / w)
            h_v = np.floor((np.dot(a, v) + b) / w)
            
            if h_u == h_v:
                collisions += 1
        
        simulated = collisions / num_trials
        c = d / w
        theoretical = e2lsh_collision_prob_closed(c)
        
        print(f"{d:<12.1f} {simulated:<12.4f} {theoretical:<12.4f}")
 
 
if __name__ == "__main__":
    analyze_collision_probability()
    print()
    print("=" * 60)
    print()
    simulate_collision_probability()

The ρ Parameter for E2LSH

The quality parameter ρ = ln(1/p₁)/ln(1/p₂) for E2LSH depends on the bucket width w and the distance ratio c = d₂/d₁. For c-approximate nearest neighbor (accepting points at distance ≤ c·d₁), we achieve ρ ≈ 1/c, giving query time O(n^{1/c}). This is provably optimal for LSH schemes.

Complete E2LSH Implementation

Let's build a full E2LSH index with optimizations for production use.

e2lsh_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
"""
Complete E2LSH Implementation for Approximate Nearest Neighbor Search.
 
Optimized for efficiency with:
- Batch operations using NumPy
- Smart bucket width selection
- Multi-probe querying
"""
 
import numpy as np
from typing import List, Tuple, Dict, Set, Optional
from collections import defaultdict
from dataclasses import dataclass, field
import time
 
@dataclass
class E2LSHIndex:
    """
    E2LSH Index for Euclidean distance approximate nearest neighbor.
    
    Parameters:
    -----------
    dim : int
        Dimensionality of vectors
    num_projections : int (k)
        Number of projections per hash table
    num_tables : int (L)
        Number of hash tables
    bucket_width : float (w)
        Width of hash buckets (crucial parameter!)
    seed : int
        Random seed for reproducibility
    """
    dim: int
    num_projections: int = 8  # k
    num_tables: int = 20      # L
    bucket_width: float = 4.0 # w
    seed: int = 42
    
    # Initialized in __post_init__
    projections: np.ndarray = field(init=False)  # Shape: (L, k, dim)
    offsets: np.ndarray = field(init=False)      # Shape: (L, k)
    tables: List[Dict] = field(init=False)
    vectors: Optional[np.ndarray] = field(default=None, init=False)
    
    def __post_init__(self):
        np.random.seed(self.seed)
        
        # Random Gaussian projection vectors
        self.projections = np.random.randn(
            self.num_tables, 
            self.num_projections, 
            self.dim
        )
        
        # Random offsets uniformly in [0, w)
        self.offsets = np.random.uniform(
            0, 
            self.bucket_width, 
            size=(self.num_tables, self.num_projections)
        )
        
        # Initialize hash tables
        self.tables = [defaultdict(list) for _ in range(self.num_tables)]
    
    def _compute_hash_values(
        self, 
        vector: np.ndarray, 
        table_idx: int
    ) -> np.ndarray:
        """
        Compute hash values (bucket indices) for all projections in a table.
        
        Returns array of k integer hash values.
        """
        # Project: shape (k,)
        projections = np.dot(self.projections[table_idx], vector)
        # Add offsets and divide by bucket width
        bucket_indices = np.floor(
            (projections + self.offsets[table_idx]) / self.bucket_width
        )
        return bucket_indices.astype(int)
    
    def _hash_vector(self, vector: np.ndarray, table_idx: int) -> Tuple:
        """Get the hash key (tuple of bucket indices) for a vector."""
        return tuple(self._compute_hash_values(vector, table_idx))
    
    def fit(self, vectors: np.ndarray) -> 'E2LSHIndex':
        """
        Build the E2LSH index from a dataset.
        
        Parameters:
        -----------
        vectors : np.ndarray of shape (n, dim)
            Dataset of n vectors
            
        Returns:
        --------
        self : for method chaining
        """
        self.vectors = vectors.copy()
        n = len(vectors)
        
        # Compute hashes for all vectors in all tables
        for table_idx in range(self.num_tables):
            # Batch projection: (n, dim) @ (dim, k) -> (n, k)
            all_projections = np.dot(
                vectors, 
                self.projections[table_idx].T
            )
            # Add offsets and compute bucket indices
            all_buckets = np.floor(
                (all_projections + self.offsets[table_idx]) / self.bucket_width
            ).astype(int)
            
            # Insert into table
            for vec_idx in range(n):
                hash_key = tuple(all_buckets[vec_idx])
                self.tables[table_idx][hash_key].append(vec_idx)
        
        return self
    
    def query(
        self, 
        query: np.ndarray, 
        num_neighbors: int = 5,
        num_probes: int = 1
    ) -> List[Tuple[int, float]]:
        """
        Find approximate nearest neighbors.
        
        Parameters:
        -----------
        query : np.ndarray of shape (dim,)
            Query vector
        num_neighbors : int
            Number of neighbors to return
        num_probes : int
            Number of probes per table (multi-probe LSH)
            
        Returns:
        --------
        List of (index, distance) tuples, sorted by distance
        """
        if self.vectors is None:
            raise ValueError("Index not built. Call fit() first.")
        
        candidates: Set[int] = set()
        
        for table_idx in range(self.num_tables):
            # Get exact hash and nearby hashes
            hash_values = self._compute_hash_values(query, table_idx)
            
            if num_probes == 1:
                # Single probe - just exact match
                hash_key = tuple(hash_values)
                bucket = self.tables[table_idx].get(hash_key, [])
                candidates.update(bucket)
            else:
                # Multi-probe - check nearby buckets
                probes = self._generate_probes(hash_values, num_probes)
                for probe_key in probes:
                    bucket = self.tables[table_idx].get(probe_key, [])
                    candidates.update(bucket)
        
        if not candidates:
            # Fallback: random sample
            sample_size = min(1000, len(self.vectors))
            candidates = set(np.random.choice(
                len(self.vectors), sample_size, replace=False
            ))
        
        # Compute exact distances to candidates
        candidate_list = list(candidates)
        candidate_vectors = self.vectors[candidate_list]
        
        distances = np.linalg.norm(
            candidate_vectors - query.reshape(1, -1), 
            axis=1
        )
        
        # Sort and return top-k
        sorted_indices = np.argsort(distances)[:num_neighbors]
        
        return [
            (candidate_list[i], float(distances[i])) 
            for i in sorted_indices
        ]
    
    def _generate_probes(
        self, 
        hash_values: np.ndarray, 
        num_probes: int
    ) -> List[Tuple]:
        """
        Generate probe sequences for multi-probe LSH.
        
        This implements a simple perturbation strategy:
        perturb each dimension by ±1 in order of smallest
        fractional part (most likely to change bucket).
        """
        probes = [tuple(hash_values)]
        
        if num_probes <= 1:
            return probes
        
        # Compute fractional parts for perturbation priority
        k = len(hash_values)
        
        # Simple strategy: try flipping each dimension ±1
        for i in range(k):
            if len(probes) >= num_probes:
                break
            
            # Try +1
            perturbed = hash_values.copy()
            perturbed[i] += 1
            probes.append(tuple(perturbed))
            
            if len(probes) >= num_probes:
                break
            
            # Try -1
            perturbed = hash_values.copy()
            perturbed[i] -= 1
            probes.append(tuple(perturbed))
        
        return probes[:num_probes]
    
    def get_stats(self) -> Dict:
        """Get statistics about the index."""
        if self.vectors is None:
            return {"error": "Index not built"}
        
        bucket_sizes = []
        for table in self.tables:
            bucket_sizes.extend(len(v) for v in table.values())
        
        return {
            "num_vectors": len(self.vectors),
            "num_tables": self.num_tables,
            "num_projections": self.num_projections,
            "bucket_width": self.bucket_width,
            "total_buckets": sum(len(t) for t in self.tables),
            "avg_bucket_size": np.mean(bucket_sizes) if bucket_sizes else 0,
            "max_bucket_size": max(bucket_sizes) if bucket_sizes else 0,
        }
 
 
def demo_e2lsh():
    """Demonstrate E2LSH for approximate nearest neighbor search."""
    np.random.seed(42)
    
    # Create dataset
    n = 50000
    dim = 128
    print(f"Dataset: {n} vectors, {dim} dimensions")
    
    vectors = np.random.randn(n, dim)
    
    # Create query (close to vector 0)
    query = vectors[0] + 0.5 * np.random.randn(dim)
    true_distance_to_0 = np.linalg.norm(query - vectors[0])
    print(f"Query is {true_distance_to_0:.3f} away from vector 0")
    
    # Build index
    print("\nBuilding E2LSH index...")
    start = time.time()
    
    index = E2LSHIndex(
        dim=dim,
        num_projections=8,
        num_tables=25,
        bucket_width=4.0
    )
    index.fit(vectors)
    
    build_time = time.time() - start
    print(f"Build time: {build_time:.3f}s")
    print(f"Stats: {index.get_stats()}")
    
    # Query with different probe settings
    for num_probes in [1, 5, 10]:
        print(f"\nQuery with num_probes={num_probes}:")
        
        start = time.time()
        results = index.query(query, num_neighbors=10, num_probes=num_probes)
        query_time = time.time() - start
        
        print(f"  Query time: {query_time*1000:.2f}ms")
        print("  Top 5 results:")
        for idx, dist in results[:5]:
            print(f"    Vector {idx}: distance = {dist:.4f}")
    
    # Brute force comparison
    print("\nBrute force comparison:")
    start = time.time()
    all_distances = np.linalg.norm(vectors - query.reshape(1, -1), axis=1)
    bf_time = time.time() - start
    
    top_10_bf = np.argsort(all_distances)[:10]
    print(f"  Time: {bf_time*1000:.2f}ms")
    print("  Top 5 results:")
    for idx in top_10_bf[:5]:
        print(f"    Vector {idx}: distance = {all_distances[idx]:.4f}")
    
    # Recall calculation
    lsh_set = {idx for idx, _ in results}
    bf_set = set(top_10_bf)
    recall = len(lsh_set & bf_set) / len(bf_set)
    print(f"\nRecall@10: {recall:.1%}")
 
 
if __name__ == "__main__":
    demo_e2lsh()

Choosing the Optimal Bucket Width

The bucket width $w$ is the most critical parameter in E2LSH. It directly controls the tradeoff between precision and recall.

Intuition:

$w$ too small: Buckets are narrow. Even close points may fall in different buckets. Low recall, high precision.
$w$ too large: Buckets are wide. Even far points share buckets. High recall, low precision.

Optimal $w$ Selection:

Given a target distance threshold $r$ (points within distance $r$ are considered "near"), a good starting point is:

$$w \approx 4r$$

This ensures:

Points at distance $r$ have $c = r/w = 0.25$, collision probability $\approx 0.85$
Points at distance $4r$ have $c = 4r/w = 1.0$, collision probability $\approx 0.43$
The discrimination ratio is reasonable for amplification

Effect of Bucket Width on Collision Probabilities
w/r ratio	p(near) at d=r	p(far) at d=4r	p₁/p₂ ratio
w = r	0.43	0.09	4.8x
w = 2r	0.63	0.18	3.5x
w = 4r	0.85	0.43	2.0x
w = 8r	0.93	0.63	1.5x
w = 16r	0.96	0.78	1.2x

The Tradeoff:

Smaller $w$ gives better discrimination (higher p₁/p₂ ratio) but lower absolute collision probabilities. This means:

You need more tables (higher L) to find near points
But you have fewer false positives to filter

Larger $w$ gives poorer discrimination but higher collision probabilities:

Fewer tables needed to find near points
But more false positives to filter

Practical Guidance:

Estimate the typical distance to true k-NN in your dataset
Set $w$ to 2-4 times this distance
Tune $k$ and $L$ to achieve desired recall
Adjust $w$ if query time is too slow (too many false positives)

Adaptive Bucket Width

Some implementations use different bucket widths in different tables, covering multiple scales. This is useful when the distribution of relevant distances varies across queries. However, it complicates tuning and increases implementation complexity.

E2LSH vs. Random Hyperplane LSH

Let's crystallize the differences between these two LSH families.

Random Hyperplane LSH (Cosine)

•Metric: Cosine similarity / angular distance
•Hash function: sign(a·v)
•Hash values: Binary (0 or 1)
•Parameters: Only k and L
•Probability: p = 1 - θ/π (linear in angle)
•Invariant to: Vector magnitude
•Use case: Document similarity, embeddings

E2LSH (Euclidean)

•Metric: Euclidean (L2) distance
•Hash function: ⌊(a·v + b)/w⌋
•Hash values: Integers (unbounded)
•Parameters: k, L, and bucket width w
•Probability: p(c) = complex formula
•Invariant to: Nothing (magnitude matters)
•Use case: Spatial data, image features

When to Use Which:

Application	Recommended LSH	Why
Text/document similarity	Cosine LSH	TF-IDF and embeddings work with direction
Image deduplication	Either	Depends on feature representation
Facial recognition	Euclidean	Distance in embedding space matters
Geospatial search	Euclidean	Physical distance is absolute
Recommendation systems	Cosine	User preference vectors are directional
Audio fingerprinting	Euclidean	Feature magnitudes carry information

Hybrid Approaches:

Some applications normalize vectors and use cosine LSH as a proxy for Euclidean distance. For unit vectors:

$$|\mathbf{u} - \mathbf{v}|_2^2 = 2 - 2\cos(\theta) = 2(1 - \cos\theta)$$

So cosine similarity correlates with Euclidean distance for normalized vectors. However, the collision probability functions are different, so the amplification behavior differs.

Don't Conflate the Two

Even for normalized vectors, cosine LSH and Euclidean LSH have different collision probability curves. The optimal (k, L) settings and expected performance can differ significantly. Always match your LSH family to your actual distance metric.

Summary: E2LSH for Euclidean Distance

We've thoroughly explored E2LSH, the locality-sensitive hashing scheme for Euclidean distance. Let's consolidate the key insights:

Key Takeaways

•p-stable distributions are the foundation — The Gaussian distribution's 2-stability property means projection differences scale with L2 distance.
•The hash function is h(v) = ⌊(a·v + b)/w⌋ — This bucketizes projections onto random lines, with random offsets for unbiased boundary placement.
•Bucket width w is crucial — It determines the collision probability curve and must be tuned relative to your target distance threshold.
•Collision probability decreases smoothly with distance — Unlike binary (cosine) hashes, E2LSH produces integer bucket indices spanning multiple buckets.
•Multi-probe helps efficiently — Checking adjacent buckets (±1 in each projection) recovers points that barely crossed bucket boundaries.
•E2LSH achieves ρ ≈ 1/c for c-approximate NN — Query time O(n^{1/c}) with appropriate parameter tuning.

What's Next:

In the next page, we'll explore hash function families more broadly, including MinHash for Jaccard similarity and other specialized LSH schemes. Understanding the full landscape of LSH families will help you choose the right approach for any similarity search problem.

Page Complete

You now understand E2LSH, the Euclidean distance LSH based on p-stable distributions. This technique is foundational for approximate nearest neighbor search in metric spaces where absolute distance—not just direction—matters.