Loading content...
While cosine similarity captures directional relationships between vectors, many applications fundamentally require Euclidean distance—the straight-line distance in feature space. Image retrieval, spatial databases, and many scientific applications deal with absolute distances, not just angles.
Consider a facial recognition system. Two face embeddings might point in similar directions (high cosine similarity), but if one represents a face photographed up close and another from far away, their Euclidean distance might be large despite similar angles. For matching identities, we often care about the actual distance in embedding space.
The challenge: Random hyperplanes don't work for Euclidean distance. A hyperplane's hash depends only on which side a point lies—it captures direction but loses magnitude information. We need a fundamentally different approach.
The solution: p-stable distributions and random projections that preserve distance information.
By the end of this page, you will understand p-stable distribution LSH for Euclidean distance. You'll learn the mathematical theory behind why Gaussian projections work, derive the collision probability formula, and implement E2LSH (Euclidean LSH) for approximate nearest neighbor search.
Let's formally define what we need from a Euclidean LSH family.
Problem Setup:
Given points $\mathbf{u}, \mathbf{v} \in \mathbb{R}^d$, the Euclidean distance is:
$$d(\mathbf{u}, \mathbf{v}) = |\mathbf{u} - \mathbf{v}|2 = \sqrt{\sum{i=1}^{d} (u_i - v_i)^2}$$
We want a hash family $\mathcal{H}$ such that for any $h \in \mathcal{H}$:
with $p_1 > p_2$.
Why Random Hyperplanes Fail:
Random hyperplane hashing (from the previous page) produces collision probability $1 - \theta/\pi$ where $\theta$ is the angle between vectors. But consider:
But $d(\mathbf{u}, \mathbf{v}) = 9$ while $d(\mathbf{u}, \mathbf{w}) = \sqrt{2} \approx 1.41$!
The hyperplane hash thinks $\mathbf{u}$ and $\mathbf{v}$ are identical (same bucket), while it gives a 50% chance of separating $\mathbf{u}$ and $\mathbf{w}$, despite $\mathbf{w}$ being much closer in Euclidean distance. This is catastrophically wrong for Euclidean nearest neighbors.
Using an LSH family designed for one metric with a different metric produces meaningless results. You cannot use random hyperplane LSH for Euclidean distance, even if the vectors are normalized. The fundamental relationship between collision probability and distance is broken.
The key to Euclidean LSH is a beautiful property of certain probability distributions called p-stability.
Definition: p-Stable Distribution
A distribution $\mathcal{D}$ is called p-stable if for any $n$ real numbers $v_1, v_2, \ldots, v_n$ and i.i.d. random variables $X_1, X_2, \ldots, X_n$ drawn from $\mathcal{D}$:
$$\sum_{i=1}^{n} v_i X_i \sim \left(\sum_{i=1}^{n} |v_i|^p\right)^{1/p} \cdot X$$
where $X$ is another random variable drawn from $\mathcal{D}$.
In plain English: The weighted sum of p-stable random variables is distributed like a p-stable random variable scaled by the $\ell_p$ norm of the weights.
Key Examples:
| p | Distribution | Use Case | |
|---|---|---|---|
| p = 1 | Cauchy | $\frac{1}{\pi(1+x^2)}$ | Manhattan (L1) distance LSH |
| p = 2 | Gaussian (Normal) | $\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$ | Euclidean (L2) distance LSH |
Why p-Stability Matters for LSH:
Consider two vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^d$ and a random vector $\mathbf{a}$ with i.i.d. components from a 2-stable (Gaussian) distribution.
The difference in projections is: $$\mathbf{a} \cdot \mathbf{u} - \mathbf{a} \cdot \mathbf{v} = \mathbf{a} \cdot (\mathbf{u} - \mathbf{v}) = \sum_{i=1}^{d} a_i (u_i - v_i)$$
By 2-stability: $$\mathbf{a} \cdot (\mathbf{u} - \mathbf{v}) \sim |\mathbf{u} - \mathbf{v}|_2 \cdot Z$$
where $Z \sim \mathcal{N}(0, 1)$.
The crucial insight: The projection difference is distributed as the Euclidean distance times a standard normal! This means:
We can exploit this to create a hash function where collision probability decreases with distance!
p-stability transforms a high-dimensional distance into a single random variable whose scale equals that distance. This dimensional reduction preserves exactly the information we need: how far apart two points are.
Now we'll construct the actual hash function for Euclidean distance, known as E2LSH (Euclidean LSH) or p-stable LSH.
Hash Function Definition:
For a point $\mathbf{v} \in \mathbb{R}^d$, define:
$$h_{\mathbf{a},b}(\mathbf{v}) = \left\lfloor \frac{\mathbf{a} \cdot \mathbf{v} + b}{w} \right\rfloor$$
where:
Geometric Interpretation:
Think of this hash as:
Points that are close together in $\mathbb{R}^d$ will have similar projections, and hence are likely to fall in the same bucket. Points far apart have different projections and likely fall in different buckets.
Why the Random Offset $b$?
Without $b$, bucket boundaries would be at fixed locations (multiples of $w$ along the projection direction). Two points very close together could be split by a boundary.
The random offset randomizes where boundaries fall, ensuring that the probability of two points sharing a bucket depends only on their distance, not their absolute positions.
The Bucket Width $w$:
This is the key tunable parameter:
The optimal $w$ depends on the distance threshold $r$ you care about. Typical choices are $w = 4r$ to $w = r$, balancing the collision probabilities for near and far points.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
"""E2LSH (Euclidean LSH) hash function demonstration. This illustrates the core hash function for Euclidean distance LSHusing p-stable (Gaussian) distributions.""" import numpy as npfrom typing import Tuple class E2LSHHashFunction: """ Single E2LSH hash function. h(v) = floor((a·v + b) / w) where: - a is a random Gaussian vector - b is a random offset in [0, w) - w is the bucket width """ def __init__(self, dim: int, w: float, seed: int = None): """ Initialize hash function. Parameters: ----------- dim : int Dimensionality of input vectors w : float Bucket width (key parameter for tuning) seed : int Random seed for reproducibility """ if seed is not None: np.random.seed(seed) # Random Gaussian projection vector self.a = np.random.randn(dim) # Random offset uniformly in [0, w) self.b = np.random.uniform(0, w) self.w = w self.dim = dim def hash(self, v: np.ndarray) -> int: """Compute hash value for a vector.""" projection = np.dot(self.a, v) + self.b return int(np.floor(projection / self.w)) def hash_batch(self, vectors: np.ndarray) -> np.ndarray: """Compute hash values for multiple vectors.""" projections = np.dot(vectors, self.a) + self.b return np.floor(projections / self.w).astype(int) def demonstrate_e2lsh(): """Show how E2LSH works for Euclidean distance.""" np.random.seed(42) dim = 10 w = 4.0 # Bucket width h = E2LSHHashFunction(dim, w) # Create a base point base = np.random.randn(dim) # Create points at various distances from base distances = [0.5, 1.0, 2.0, 4.0, 8.0] print(f"Base point hash: {h.hash(base)}") print(f"Bucket width w = {w}") print() # For each distance, create random points and see hash behavior for dist in distances: # Create 100 random points at this distance collisions = 0 for _ in range(100): # Random direction direction = np.random.randn(dim) direction = direction / np.linalg.norm(direction) # Point at exact distance point = base + dist * direction if h.hash(point) == h.hash(base): collisions += 1 print(f"Distance {dist:.1f}: {collisions}/100 collisions ({collisions}%)") # Theoretical insight print() print("Observation: Collision probability decreases with distance!") print("Points closer than w/2 are likely to collide.") print("Points farther than w have low collision probability.") def demonstrate_distance_preservation(): """ Show that projection differences scale with Euclidean distance. This is the key p-stability property. """ np.random.seed(42) dim = 100 num_trials = 1000 # Random Gaussian projection vector a = np.random.randn(dim) # Create pairs of points at various distances distances = [1.0, 2.0, 5.0, 10.0] print("p-Stability Demonstration:") print("Projection difference = d * Z where Z ~ N(0,1)") print() print(f"{'Distance':<12} {'Mean |proj diff|':<18} {'Theoretical':<12}") print("-" * 45) for d in distances: proj_diffs = [] for _ in range(num_trials): # Random base point u = np.random.randn(dim) # Random direction direction = np.random.randn(dim) direction = direction / np.linalg.norm(direction) # Point at distance d v = u + d * direction # Projection difference diff = np.dot(a, v) - np.dot(a, u) proj_diffs.append(abs(diff)) mean_diff = np.mean(proj_diffs) # Theoretical: |Z| has mean sqrt(2/pi) ≈ 0.798 # So E[|proj diff|] = d * norm(a) * sqrt(2/pi) theoretical = d * np.linalg.norm(a) * np.sqrt(2/np.pi) print(f"{d:<12.1f} {mean_diff:<18.4f} {theoretical:<12.4f}") if __name__ == "__main__": demonstrate_e2lsh() print() print("=" * 50) print() demonstrate_distance_preservation()Let's derive the exact collision probability for E2LSH. This analysis reveals why the bucket width $w$ is crucial.
Setup:
Two points $\mathbf{u}$ and $\mathbf{v}$ at Euclidean distance $d = |\mathbf{u} - \mathbf{v}|_2$.
Collision Condition:
$h(\mathbf{u}) = h(\mathbf{v})$ iff both projections fall in the same bucket:
$$\left\lfloor \frac{\mathbf{a} \cdot \mathbf{u} + b}{w} \right\rfloor = \left\lfloor \frac{\mathbf{a} \cdot \mathbf{v} + b}{w} \right\rfloor$$
Key Insight:
By 2-stability, $\mathbf{a} \cdot (\mathbf{u} - \mathbf{v}) \sim d \cdot Z$ where $Z \sim \mathcal{N}(0, 1)$.
Let $t = \mathbf{a} \cdot \mathbf{u} + b$. Then $\mathbf{a} \cdot \mathbf{v} + b = t - dZ$ for some standard normal $Z$.
The collision probability depends on:
Derivation:
Let $c = d/w$ (distance relative to bucket width). WLOG, assume $t \in [0, w)$ (we're in bucket 0).
Collision occurs iff $t - dZ \in [0, w)$, i.e., $t - w < dZ < t$.
Dividing by $d$: $(t-w)/d < Z < t/d$.
Since $t$ is uniform on $[0, w)$, we integrate:
$$p(c) = \int_0^1 \left[\Phi\left(\frac{\tau w}{d}\right) - \Phi\left(\frac{(\tau-1)w}{d}\right)\right] d\tau$$
where $\Phi$ is the standard normal CDF.
Substituting $c = d/w$:
$$p(c) = \int_0^1 \left[\Phi\left(\frac{\tau}{c}\right) - \Phi\left(\frac{\tau - 1}{c}\right)\right] d\tau$$
This can be computed as:
$$p(c) = 1 - 2\Phi(-1/c) - \frac{2c}{\sqrt{2\pi}}\left(1 - e^{-1/(2c^2)}\right)$$
Simplification for analysis:
For $c \ll 1$ (distance much smaller than bucket width): $p(c) \approx 1$
For $c \gg 1$ (distance much larger than bucket width): $p(c) \approx \frac{2}{c\sqrt{2\pi}} = \frac{2w}{d\sqrt{2\pi}}$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
"""E2LSH collision probability analysis.""" import numpy as npfrom scipy import statsfrom scipy.integrate import quad def e2lsh_collision_prob_integral(c: float) -> float: """ Exact collision probability for E2LSH. Parameters: ----------- c : float Distance relative to bucket width (c = d/w) Returns: -------- Collision probability p(c) """ if c == 0: return 1.0 def integrand(tau): return stats.norm.cdf(tau/c) - stats.norm.cdf((tau-1)/c) prob, _ = quad(integrand, 0, 1) return prob def e2lsh_collision_prob_closed(c: float) -> float: """ Closed-form collision probability for E2LSH. p(c) = 1 - 2*Φ(-1/c) - (2c/sqrt(2π)) * (1 - exp(-1/(2c²))) """ if c == 0: return 1.0 if c > 100: # Asymptotic form for large c return 2 / (c * np.sqrt(2 * np.pi)) phi = stats.norm.cdf(-1/c) term1 = 1 - 2 * phi term2 = (2 * c / np.sqrt(2 * np.pi)) * (1 - np.exp(-1 / (2 * c**2))) return term1 - term2 def analyze_collision_probability(): """Analyze how collision probability varies with distance.""" print("E2LSH Collision Probability") print("c = d/w (distance / bucket width)") print() print(f"{'c (d/w)':<10} {'p(c) integral':<15} {'p(c) closed':<15}") print("-" * 40) c_values = [0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 3.0, 5.0, 10.0] for c in c_values: p_int = e2lsh_collision_prob_integral(c) p_closed = e2lsh_collision_prob_closed(c) print(f"{c:<10.2f} {p_int:<15.4f} {p_closed:<15.4f}") # Analyze LSH quality for different w print() print("=" * 60) print("LSH Quality Analysis") print("=" * 60) print() # Scenario: near points at d1=1, far points at d2=4000 d1, d2 = 1.0, 4.0 print(f"Near distance d1 = {d1}, Far distance d2 = {d2}") print() print(f"{'w':<8} {'c1=d1/w':<10} {'p1':<10} {'c2=d2/w':<10} {'p2':<10} {'ρ':<10}") print("-" * 58) for w in [0.5, 1.0, 2.0, 4.0, 8.0]: c1 = d1 / w c2 = d2 / w p1 = e2lsh_collision_prob_closed(c1) p2 = e2lsh_collision_prob_closed(c2) # rho = ln(1/p1) / ln(1/p2) if p1 > 0 and p2 > 0 and p1 < 1 and p2 < 1: rho = np.log(1/p1) / np.log(1/p2) else: rho = float('nan') print(f"{w:<8.1f} {c1:<10.2f} {p1:<10.4f} {c2:<10.2f} {p2:<10.4f} {rho:<10.4f}") def simulate_collision_probability(): """Verify collision probability via simulation.""" np.random.seed(42) dim = 100 w = 4.0 num_trials = 10000 print("Simulated vs Theoretical Collision Probabilities") print(f"dim = {dim}, w = {w}") print() print(f"{'Distance d':<12} {'Simulated':<12} {'Theoretical':<12}") print("-" * 38) for d in [0.5, 1.0, 2.0, 4.0, 8.0, 16.0]: collisions = 0 for _ in range(num_trials): # Random direction direction = np.random.randn(dim) direction = direction / np.linalg.norm(direction) # Two points at distance d u = np.random.randn(dim) v = u + d * direction # Random hash function a = np.random.randn(dim) b = np.random.uniform(0, w) h_u = np.floor((np.dot(a, u) + b) / w) h_v = np.floor((np.dot(a, v) + b) / w) if h_u == h_v: collisions += 1 simulated = collisions / num_trials c = d / w theoretical = e2lsh_collision_prob_closed(c) print(f"{d:<12.1f} {simulated:<12.4f} {theoretical:<12.4f}") if __name__ == "__main__": analyze_collision_probability() print() print("=" * 60) print() simulate_collision_probability()The quality parameter ρ = ln(1/p₁)/ln(1/p₂) for E2LSH depends on the bucket width w and the distance ratio c = d₂/d₁. For c-approximate nearest neighbor (accepting points at distance ≤ c·d₁), we achieve ρ ≈ 1/c, giving query time O(n^{1/c}). This is provably optimal for LSH schemes.
Let's build a full E2LSH index with optimizations for production use.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315
"""Complete E2LSH Implementation for Approximate Nearest Neighbor Search. Optimized for efficiency with:- Batch operations using NumPy- Smart bucket width selection- Multi-probe querying""" import numpy as npfrom typing import List, Tuple, Dict, Set, Optionalfrom collections import defaultdictfrom dataclasses import dataclass, fieldimport time @dataclassclass E2LSHIndex: """ E2LSH Index for Euclidean distance approximate nearest neighbor. Parameters: ----------- dim : int Dimensionality of vectors num_projections : int (k) Number of projections per hash table num_tables : int (L) Number of hash tables bucket_width : float (w) Width of hash buckets (crucial parameter!) seed : int Random seed for reproducibility """ dim: int num_projections: int = 8 # k num_tables: int = 20 # L bucket_width: float = 4.0 # w seed: int = 42 # Initialized in __post_init__ projections: np.ndarray = field(init=False) # Shape: (L, k, dim) offsets: np.ndarray = field(init=False) # Shape: (L, k) tables: List[Dict] = field(init=False) vectors: Optional[np.ndarray] = field(default=None, init=False) def __post_init__(self): np.random.seed(self.seed) # Random Gaussian projection vectors self.projections = np.random.randn( self.num_tables, self.num_projections, self.dim ) # Random offsets uniformly in [0, w) self.offsets = np.random.uniform( 0, self.bucket_width, size=(self.num_tables, self.num_projections) ) # Initialize hash tables self.tables = [defaultdict(list) for _ in range(self.num_tables)] def _compute_hash_values( self, vector: np.ndarray, table_idx: int ) -> np.ndarray: """ Compute hash values (bucket indices) for all projections in a table. Returns array of k integer hash values. """ # Project: shape (k,) projections = np.dot(self.projections[table_idx], vector) # Add offsets and divide by bucket width bucket_indices = np.floor( (projections + self.offsets[table_idx]) / self.bucket_width ) return bucket_indices.astype(int) def _hash_vector(self, vector: np.ndarray, table_idx: int) -> Tuple: """Get the hash key (tuple of bucket indices) for a vector.""" return tuple(self._compute_hash_values(vector, table_idx)) def fit(self, vectors: np.ndarray) -> 'E2LSHIndex': """ Build the E2LSH index from a dataset. Parameters: ----------- vectors : np.ndarray of shape (n, dim) Dataset of n vectors Returns: -------- self : for method chaining """ self.vectors = vectors.copy() n = len(vectors) # Compute hashes for all vectors in all tables for table_idx in range(self.num_tables): # Batch projection: (n, dim) @ (dim, k) -> (n, k) all_projections = np.dot( vectors, self.projections[table_idx].T ) # Add offsets and compute bucket indices all_buckets = np.floor( (all_projections + self.offsets[table_idx]) / self.bucket_width ).astype(int) # Insert into table for vec_idx in range(n): hash_key = tuple(all_buckets[vec_idx]) self.tables[table_idx][hash_key].append(vec_idx) return self def query( self, query: np.ndarray, num_neighbors: int = 5, num_probes: int = 1 ) -> List[Tuple[int, float]]: """ Find approximate nearest neighbors. Parameters: ----------- query : np.ndarray of shape (dim,) Query vector num_neighbors : int Number of neighbors to return num_probes : int Number of probes per table (multi-probe LSH) Returns: -------- List of (index, distance) tuples, sorted by distance """ if self.vectors is None: raise ValueError("Index not built. Call fit() first.") candidates: Set[int] = set() for table_idx in range(self.num_tables): # Get exact hash and nearby hashes hash_values = self._compute_hash_values(query, table_idx) if num_probes == 1: # Single probe - just exact match hash_key = tuple(hash_values) bucket = self.tables[table_idx].get(hash_key, []) candidates.update(bucket) else: # Multi-probe - check nearby buckets probes = self._generate_probes(hash_values, num_probes) for probe_key in probes: bucket = self.tables[table_idx].get(probe_key, []) candidates.update(bucket) if not candidates: # Fallback: random sample sample_size = min(1000, len(self.vectors)) candidates = set(np.random.choice( len(self.vectors), sample_size, replace=False )) # Compute exact distances to candidates candidate_list = list(candidates) candidate_vectors = self.vectors[candidate_list] distances = np.linalg.norm( candidate_vectors - query.reshape(1, -1), axis=1 ) # Sort and return top-k sorted_indices = np.argsort(distances)[:num_neighbors] return [ (candidate_list[i], float(distances[i])) for i in sorted_indices ] def _generate_probes( self, hash_values: np.ndarray, num_probes: int ) -> List[Tuple]: """ Generate probe sequences for multi-probe LSH. This implements a simple perturbation strategy: perturb each dimension by ±1 in order of smallest fractional part (most likely to change bucket). """ probes = [tuple(hash_values)] if num_probes <= 1: return probes # Compute fractional parts for perturbation priority k = len(hash_values) # Simple strategy: try flipping each dimension ±1 for i in range(k): if len(probes) >= num_probes: break # Try +1 perturbed = hash_values.copy() perturbed[i] += 1 probes.append(tuple(perturbed)) if len(probes) >= num_probes: break # Try -1 perturbed = hash_values.copy() perturbed[i] -= 1 probes.append(tuple(perturbed)) return probes[:num_probes] def get_stats(self) -> Dict: """Get statistics about the index.""" if self.vectors is None: return {"error": "Index not built"} bucket_sizes = [] for table in self.tables: bucket_sizes.extend(len(v) for v in table.values()) return { "num_vectors": len(self.vectors), "num_tables": self.num_tables, "num_projections": self.num_projections, "bucket_width": self.bucket_width, "total_buckets": sum(len(t) for t in self.tables), "avg_bucket_size": np.mean(bucket_sizes) if bucket_sizes else 0, "max_bucket_size": max(bucket_sizes) if bucket_sizes else 0, } def demo_e2lsh(): """Demonstrate E2LSH for approximate nearest neighbor search.""" np.random.seed(42) # Create dataset n = 50000 dim = 128 print(f"Dataset: {n} vectors, {dim} dimensions") vectors = np.random.randn(n, dim) # Create query (close to vector 0) query = vectors[0] + 0.5 * np.random.randn(dim) true_distance_to_0 = np.linalg.norm(query - vectors[0]) print(f"Query is {true_distance_to_0:.3f} away from vector 0") # Build index print("\nBuilding E2LSH index...") start = time.time() index = E2LSHIndex( dim=dim, num_projections=8, num_tables=25, bucket_width=4.0 ) index.fit(vectors) build_time = time.time() - start print(f"Build time: {build_time:.3f}s") print(f"Stats: {index.get_stats()}") # Query with different probe settings for num_probes in [1, 5, 10]: print(f"\nQuery with num_probes={num_probes}:") start = time.time() results = index.query(query, num_neighbors=10, num_probes=num_probes) query_time = time.time() - start print(f" Query time: {query_time*1000:.2f}ms") print(" Top 5 results:") for idx, dist in results[:5]: print(f" Vector {idx}: distance = {dist:.4f}") # Brute force comparison print("\nBrute force comparison:") start = time.time() all_distances = np.linalg.norm(vectors - query.reshape(1, -1), axis=1) bf_time = time.time() - start top_10_bf = np.argsort(all_distances)[:10] print(f" Time: {bf_time*1000:.2f}ms") print(" Top 5 results:") for idx in top_10_bf[:5]: print(f" Vector {idx}: distance = {all_distances[idx]:.4f}") # Recall calculation lsh_set = {idx for idx, _ in results} bf_set = set(top_10_bf) recall = len(lsh_set & bf_set) / len(bf_set) print(f"\nRecall@10: {recall:.1%}") if __name__ == "__main__": demo_e2lsh()The bucket width $w$ is the most critical parameter in E2LSH. It directly controls the tradeoff between precision and recall.
Intuition:
Optimal $w$ Selection:
Given a target distance threshold $r$ (points within distance $r$ are considered "near"), a good starting point is:
$$w \approx 4r$$
This ensures:
| w/r ratio | p(near) at d=r | p(far) at d=4r | p₁/p₂ ratio |
|---|---|---|---|
| w = r | 0.43 | 0.09 | 4.8x |
| w = 2r | 0.63 | 0.18 | 3.5x |
| w = 4r | 0.85 | 0.43 | 2.0x |
| w = 8r | 0.93 | 0.63 | 1.5x |
| w = 16r | 0.96 | 0.78 | 1.2x |
The Tradeoff:
Smaller $w$ gives better discrimination (higher p₁/p₂ ratio) but lower absolute collision probabilities. This means:
Larger $w$ gives poorer discrimination but higher collision probabilities:
Practical Guidance:
Some implementations use different bucket widths in different tables, covering multiple scales. This is useful when the distribution of relevant distances varies across queries. However, it complicates tuning and increases implementation complexity.
Let's crystallize the differences between these two LSH families.
When to Use Which:
| Application | Recommended LSH | Why |
|---|---|---|
| Text/document similarity | Cosine LSH | TF-IDF and embeddings work with direction |
| Image deduplication | Either | Depends on feature representation |
| Facial recognition | Euclidean | Distance in embedding space matters |
| Geospatial search | Euclidean | Physical distance is absolute |
| Recommendation systems | Cosine | User preference vectors are directional |
| Audio fingerprinting | Euclidean | Feature magnitudes carry information |
Hybrid Approaches:
Some applications normalize vectors and use cosine LSH as a proxy for Euclidean distance. For unit vectors:
$$|\mathbf{u} - \mathbf{v}|_2^2 = 2 - 2\cos(\theta) = 2(1 - \cos\theta)$$
So cosine similarity correlates with Euclidean distance for normalized vectors. However, the collision probability functions are different, so the amplification behavior differs.
Even for normalized vectors, cosine LSH and Euclidean LSH have different collision probability curves. The optimal (k, L) settings and expected performance can differ significantly. Always match your LSH family to your actual distance metric.
We've thoroughly explored E2LSH, the locality-sensitive hashing scheme for Euclidean distance. Let's consolidate the key insights:
What's Next:
In the next page, we'll explore hash function families more broadly, including MinHash for Jaccard similarity and other specialized LSH schemes. Understanding the full landscape of LSH families will help you choose the right approach for any similarity search problem.
You now understand E2LSH, the Euclidean distance LSH based on p-stable distributions. This technique is foundational for approximate nearest neighbor search in metric spaces where absolute distance—not just direction—matters.