Loading content...
So far, we've explored LSH for two similarity measures: cosine similarity (random hyperplanes) and Euclidean distance (p-stable distributions). But the world of similarity is much richer. Documents are compared using Jaccard similarity of their word sets. Binary features use Hamming distance. Strings have edit distance. Probability distributions have KL divergence.
The fundamental principle of LSH is that each similarity or distance metric requires its own hash function family tailored to that metric's specific properties. Using the wrong family—like applying MinHash to Euclidean vectors or random hyperplanes to sets—produces meaningless results.
This page surveys the major LSH hash function families, providing the theory, intuition, and implementation for each. By the end, you'll have a comprehensive toolkit for similarity search across diverse data types.
By the end of this page, you will understand MinHash for Jaccard similarity, bit sampling for Hamming distance, SimHash for angular similarity with binary output, and the theoretical framework for constructing new LSH families. You'll be able to match problems to the appropriate hash family.
MinHash is one of the most important LSH families, designed for Jaccard similarity between sets. It has widespread applications in document deduplication, plagiarism detection, and recommendation systems.
Jaccard Similarity:
For two sets $A$ and $B$, the Jaccard similarity is:
$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$
This measures the overlap between sets normalized by their union. Two identical sets have $J = 1$; disjoint sets have $J = 0$.
The MinHash Function:
Given a random permutation $\pi$ of the universe of elements, define:
$$h_\pi(S) = \min_{x \in S} \pi(x)$$
That is, apply the permutation to all elements in $S$ and return the minimum value.
The Remarkable Theorem:
$$\Pr[h_\pi(A) = h_\pi(B)] = J(A, B)$$
Proof:
Consider the union $A \cup B$. Under a random permutation, the element with the minimum permuted value is equally likely to be any element in $A \cup B$.
The minimum is the same in both $A$ and $B$ iff it comes from $A \cap B$:
$$\Pr[h_\pi(A) = h_\pi(B)] = \frac{|A \cap B|}{|A \cup B|} = J(A, B)$$
$\blacksquare$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220
"""MinHash implementation for Jaccard similarity LSH.""" import numpy as npfrom typing import Set, List, Tuple, FrozenSetimport hashlibfrom collections import defaultdict class MinHash: """ MinHash signature generator for sets. Uses hash functions to simulate random permutations. For each hash function, we keep the minimum hash value across all elements in the set. """ def __init__( self, num_hashes: int = 128, seed: int = 42 ): """ Initialize MinHash with num_hashes hash functions. More hashes = better Jaccard similarity estimation, but larger signatures. """ self.num_hashes = num_hashes np.random.seed(seed) # Generate random coefficients for hash functions # h(x) = (a * x + b) mod p # where p is a large prime self.p = 2**31 - 1 # Mersenne prime self.a = np.random.randint(1, self.p, size=num_hashes) self.b = np.random.randint(0, self.p, size=num_hashes) def _element_hash(self, element) -> int: """Hash an element to an integer.""" # Use SHA256 for consistent hashing of any object h = hashlib.sha256(str(element).encode()).digest() return int.from_bytes(h[:8], 'big') def signature(self, s: Set) -> np.ndarray: """ Compute the MinHash signature for a set. Returns array of length num_hashes. """ if not s: # Empty set: return max values return np.full(self.num_hashes, self.p) # Initialize with infinity sig = np.full(self.num_hashes, self.p) for element in s: # Get hash of element elem_hash = self._element_hash(element) # Apply each hash function hash_values = (self.a * elem_hash + self.b) % self.p # Keep minimum sig = np.minimum(sig, hash_values) return sig def estimate_jaccard( self, sig1: np.ndarray, sig2: np.ndarray ) -> float: """ Estimate Jaccard similarity from two MinHash signatures. """ return np.mean(sig1 == sig2) class MinHashLSH: """ LSH index using MinHash signatures. Uses banding technique: divide signature into b bands of r rows. Two sets are candidates if they match in at least one complete band. """ def __init__( self, num_hashes: int = 128, num_bands: int = 16, seed: int = 42 ): """ Initialize MinHash LSH. num_hashes must be divisible by num_bands. rows_per_band = num_hashes // num_bands Higher num_bands = more candidates (higher recall, lower precision) """ if num_hashes % num_bands != 0: raise ValueError("num_hashes must be divisible by num_bands") self.num_hashes = num_hashes self.num_bands = num_bands self.rows_per_band = num_hashes // num_bands self.minhash = MinHash(num_hashes, seed) # One hash table per band self.tables: List[dict] = [ defaultdict(list) for _ in range(num_bands) ] self.signatures: List[np.ndarray] = [] # Store signatures def add(self, set_id: int, s: Set): """Add a set to the index.""" sig = self.minhash.signature(s) self.signatures.append(sig) # Insert into each band's table for band_idx in range(self.num_bands): start = band_idx * self.rows_per_band end = start + self.rows_per_band band_key = tuple(sig[start:end]) self.tables[band_idx][band_key].append(set_id) def query(self, s: Set) -> List[int]: """ Find candidate similar sets. Returns list of set IDs that share at least one band. """ sig = self.minhash.signature(s) candidates = set() for band_idx in range(self.num_bands): start = band_idx * self.rows_per_band end = start + self.rows_per_band band_key = tuple(sig[start:end]) matches = self.tables[band_idx].get(band_key, []) candidates.update(matches) return list(candidates) def query_with_similarity( self, s: Set, threshold: float = 0.0 ) -> List[Tuple[int, float]]: """ Find similar sets with estimated Jaccard similarity. """ query_sig = self.minhash.signature(s) candidates = self.query(s) results = [] for cand_id in candidates: cand_sig = self.signatures[cand_id] est_jaccard = self.minhash.estimate_jaccard(query_sig, cand_sig) if est_jaccard >= threshold: results.append((cand_id, est_jaccard)) # Sort by similarity descending results.sort(key=lambda x: -x[1]) return results def demo_minhash(): """Demonstrate MinHash for document similarity.""" # Create document shingles (word n-grams) docs = [ "the quick brown fox jumps over the lazy dog", "the quick brown fox leaps over the lazy dog", "a fast brown fox jumps over a lazy dog", "the lazy cat sleeps on the mat", "machine learning is fascinating and powerful", ] def shingle(doc: str, k: int = 3) -> Set[str]: """Create k-shingles (character n-grams) from document.""" return set(doc[i:i+k] for i in range(len(doc) - k + 1)) # Create shingle sets doc_shingles = [shingle(doc) for doc in docs] print("MinHash Document Similarity Demo") print("=" * 50) # Compute true Jaccard similarities print("\nTrue Jaccard Similarities:") for i in range(len(docs)): for j in range(i+1, len(docs)): jac = len(doc_shingles[i] & doc_shingles[j]) / \ len(doc_shingles[i] | doc_shingles[j]) print(f" Doc {i} vs Doc {j}: {jac:.4f}") # Build MinHash LSH index lsh = MinHashLSH(num_hashes=128, num_bands=16) for i, shingles in enumerate(doc_shingles): lsh.add(i, shingles) # Query for similar documents print("\nMinHash LSH Results (estimated similarity):") for i, shingles in enumerate(doc_shingles): results = lsh.query_with_similarity(shingles, threshold=0.1) # Filter self-match results = [(j, sim) for j, sim in results if j != i] if results: print(f" Doc {i} similar to: {results}") if __name__ == "__main__": demo_minhash()The banding technique divides the MinHash signature into b bands of r rows. Two sets are candidates if ALL r hash values match in at least ONE band. This implements the AND-OR amplification: AND within bands (all r must match), OR across bands (any band matches). The S-curve shape of the resulting probability function can be tuned via b and r.
Bit sampling is the simplest LSH family, designed for Hamming distance between binary vectors.
Hamming Distance:
For binary vectors $\mathbf{u}, \mathbf{v} \in {0, 1}^d$:
$$d_H(\mathbf{u}, \mathbf{v}) = |{i : u_i \neq v_i}|$$
This counts the number of positions where the vectors differ.
The Hash Function:
For a random index $i \in {1, 2, \ldots, d}$:
$$h_i(\mathbf{v}) = v_i$$
Simply return the bit at position $i$.
Collision Probability:
$$\Pr[h_i(\mathbf{u}) = h_i(\mathbf{v})] = 1 - \frac{d_H(\mathbf{u}, \mathbf{v})}{d}$$
The probability that a random position matches equals the fraction of matching positions.
LSH Quality:
For Hamming distance, this gives $\rho = 1/c$ where $c$ is the approximation ratio, which is optimal for LSH schemes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143
"""Bit Sampling LSH for Hamming Distance.""" import numpy as npfrom typing import Tuple, List, Dict, Setfrom collections import defaultdict class BitSamplingLSH: """ LSH for binary vectors using bit sampling. Hash function: sample k random bit positions. Collision probability = (1 - d_H/d)^k where d_H is Hamming distance. """ def __init__( self, dim: int, num_bits: int = 16, # k: bits per hash num_tables: int = 20, # L: number of tables seed: int = 42 ): np.random.seed(seed) self.dim = dim self.num_bits = num_bits self.num_tables = num_tables # Random bit positions for each table # Shape: (L, k) self.bit_indices = np.array([ np.random.choice(dim, num_bits, replace=False) for _ in range(num_tables) ]) # Hash tables self.tables: List[Dict] = [ defaultdict(list) for _ in range(num_tables) ] self.vectors: List[np.ndarray] = [] def _hash( self, vector: np.ndarray, table_idx: int ) -> Tuple[int, ...]: """Extract bits at sampled positions.""" bits = vector[self.bit_indices[table_idx]] return tuple(bits.astype(int)) def add(self, vector_id: int, vector: np.ndarray): """Add a binary vector to the index.""" self.vectors.append(vector.copy()) for table_idx in range(self.num_tables): hash_key = self._hash(vector, table_idx) self.tables[table_idx][hash_key].append(vector_id) def query( self, query: np.ndarray, k: int = 10 ) -> List[Tuple[int, int]]: """ Find approximate nearest neighbors by Hamming distance. Returns list of (vector_id, hamming_distance) tuples. """ candidates: Set[int] = set() for table_idx in range(self.num_tables): hash_key = self._hash(query, table_idx) matches = self.tables[table_idx].get(hash_key, []) candidates.update(matches) if not candidates: # Fallback: random sample candidates = set(range(min(100, len(self.vectors)))) # Compute exact Hamming distances results = [] for cand_id in candidates: hamming_dist = np.sum( query != self.vectors[cand_id] ) results.append((cand_id, int(hamming_dist))) # Sort by distance results.sort(key=lambda x: x[1]) return results[:k] def demo_bit_sampling(): """Demonstrate bit sampling LSH.""" np.random.seed(42) dim = 256 n = 10000 # Generate random binary vectors vectors = np.random.randint(0, 2, size=(n, dim)) # Create a query similar to vector 0 query = vectors[0].copy() # Flip 10 random bits flip_positions = np.random.choice(dim, 10, replace=False) query[flip_positions] = 1 - query[flip_positions] print("Bit Sampling LSH Demo") print(f"Dataset: {n} vectors, {dim} bits each") print(f"Query: Hamming distance {np.sum(query != vectors[0])} from vector 0") print() # Build index lsh = BitSamplingLSH(dim=dim, num_bits=16, num_tables=30) for i, v in enumerate(vectors): lsh.add(i, v) # Query results = lsh.query(query, k=10) print("Top 10 results (LSH):") for vec_id, dist in results: print(f" Vector {vec_id}: Hamming distance = {dist}") # Brute force comparison all_dists = [np.sum(query != v) for v in vectors] bf_top10 = np.argsort(all_dists)[:10] print("\nTop 10 results (brute force):") for vec_id in bf_top10: print(f" Vector {vec_id}: Hamming distance = {all_dists[vec_id]}") # Recall lsh_set = {r[0] for r in results} bf_set = set(bf_top10) recall = len(lsh_set & bf_set) / len(bf_set) print(f"\nRecall@10: {recall:.1%}") if __name__ == "__main__": demo_bit_sampling()SimHash (also known as Sign Random Projection) combines random hyperplane LSH with compact binary codes. It's particularly efficient for large-scale near-duplicate detection where storage is a concern.
The Idea:
Instead of storing $k$ separate single-bit hashes from $k$ hyperplanes, SimHash combines them into a single $k$-bit binary code. Each bit is determined by one hyperplane, and the Hamming distance between codes approximates the angular distance between vectors.
SimHash Definition:
For $k$ random hyperplanes $\mathbf{r}_1, \ldots, \mathbf{r}_k$ and a vector $\mathbf{v}$:
$$\text{SimHash}(\mathbf{v}) = (\text{sign}(\mathbf{r}_1 \cdot \mathbf{v}), \ldots, \text{sign}(\mathbf{r}_k \cdot \mathbf{v}))$$
Key Property:
The expected Hamming distance between SimHash codes equals $k \cdot \theta / \pi$:
$$\mathbb{E}[d_H(\text{SimHash}(\mathbf{u}), \text{SimHash}(\mathbf{v}))] = k \cdot \frac{\theta}{\pi}$$
where $\theta$ is the angle between $\mathbf{u}$ and $\mathbf{v}$.
Practical Advantage:
SimHash codes are:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254
"""SimHash implementation for compact similarity fingerprints.""" import numpy as npfrom typing import List, Tuple class SimHash: """ SimHash: Sign Random Projection for compact fingerprints. Projects vectors onto k random hyperplanes and encodes as k-bit binary codes. Hamming distance between codes approximates angular distance between vectors. """ def __init__(self, dim: int, num_bits: int = 64, seed: int = 42): """ Initialize SimHash. Parameters: ----------- dim : int Dimensionality of input vectors num_bits : int Number of bits in the hash (must be <= 64 for uint64) """ np.random.seed(seed) self.dim = dim self.num_bits = num_bits # Random hyperplanes self.hyperplanes = np.random.randn(num_bits, dim) # Normalize (optional but good practice) self.hyperplanes /= np.linalg.norm( self.hyperplanes, axis=1, keepdims=True ) def hash(self, vector: np.ndarray) -> int: """ Compute SimHash fingerprint as integer. """ projections = np.dot(self.hyperplanes, vector) bits = (projections > 0).astype(np.uint64) # Pack bits into integer fingerprint = 0 for i, bit in enumerate(bits): fingerprint |= (bit << i) return int(fingerprint) def hash_batch(self, vectors: np.ndarray) -> np.ndarray: """ Compute SimHash fingerprints for multiple vectors. """ # (n, dim) @ (dim, num_bits) -> (n, num_bits) projections = np.dot(vectors, self.hyperplanes.T) bits = (projections > 0).astype(np.uint64) # Pack bits fingerprints = np.zeros(len(vectors), dtype=np.uint64) for i in range(self.num_bits): fingerprints |= (bits[:, i] << i) return fingerprints @staticmethod def hamming_distance(fp1: int, fp2: int) -> int: """ Compute Hamming distance between two fingerprints. Uses bit counting on XOR result. """ xor = fp1 ^ fp2 return bin(xor).count('1') def estimate_cosine_similarity( self, fp1: int, fp2: int ) -> float: """ Estimate cosine similarity from fingerprints. Hamming distance k -> angle θ ≈ π * k / num_bits Cosine similarity = cos(θ) """ hamming = self.hamming_distance(fp1, fp2) angle = np.pi * hamming / self.num_bits return np.cos(angle) class SimHashLSH: """ LSH index using SimHash fingerprints. Uses multiple hash tables with partial fingerprint matches. """ def __init__( self, dim: int, num_bits: int = 64, num_tables: int = 10, bits_per_table: int = 8, seed: int = 42 ): """ Initialize SimHash LSH. Each table uses bits_per_table bits from the fingerprint. """ self.simhash = SimHash(dim, num_bits, seed) self.num_tables = num_tables self.bits_per_table = bits_per_table # Masks for extracting bits for each table np.random.seed(seed + 1000) self.bit_indices = [ np.sort(np.random.choice( num_bits, bits_per_table, replace=False )) for _ in range(num_tables) ] # Hash tables self.tables: List[dict] = [{} for _ in range(num_tables)] self.fingerprints: List[int] = [] def _extract_bits( self, fingerprint: int, bit_indices: np.ndarray ) -> int: """Extract specific bits from fingerprint.""" result = 0 for i, idx in enumerate(bit_indices): bit = (fingerprint >> idx) & 1 result |= (bit << i) return result def add(self, item_id: int, vector: np.ndarray): """Add a vector to the index.""" fp = self.simhash.hash(vector) self.fingerprints.append(fp) for table_idx in range(self.num_tables): key = self._extract_bits(fp, self.bit_indices[table_idx]) if key not in self.tables[table_idx]: self.tables[table_idx][key] = [] self.tables[table_idx][key].append(item_id) def query( self, vector: np.ndarray, k: int = 10, max_hamming: int = 5 ) -> List[Tuple[int, int]]: """ Find similar items. Returns (item_id, hamming_distance) sorted by distance. """ query_fp = self.simhash.hash(vector) candidates = set() for table_idx in range(self.num_tables): key = self._extract_bits( query_fp, self.bit_indices[table_idx] ) matches = self.tables[table_idx].get(key, []) candidates.update(matches) # Compute exact Hamming distances results = [] for item_id in candidates: hamming = self.simhash.hamming_distance( query_fp, self.fingerprints[item_id] ) if hamming <= max_hamming: results.append((item_id, hamming)) results.sort(key=lambda x: x[1]) return results[:k] def demo_simhash(): """Demonstrate SimHash for near-duplicate detection.""" np.random.seed(42) dim = 256 n = 10000 # Generate random vectors vectors = np.random.randn(n, dim) vectors /= np.linalg.norm(vectors, axis=1, keepdims=True) # Create similar vectors (small perturbations) similar_groups = [] for i in range(100): base = vectors[i] # Create 3 similar vectors perturbations = 0.1 * np.random.randn(3, dim) similar = base + perturbations similar /= np.linalg.norm(similar, axis=1, keepdims=True) similar_groups.append((i, similar)) print("SimHash Demo") print(f"Dataset: {n} base vectors + similar variants") print() # Build index lsh = SimHashLSH( dim=dim, num_bits=64, num_tables=10, bits_per_table=10 ) for i, v in enumerate(vectors): lsh.add(i, v) # Also add similar vectors with special IDs for i, (base_id, similar_vecs) in enumerate(similar_groups): for j, sim_vec in enumerate(similar_vecs): item_id = n + i * 3 + j # IDs beyond original dataset lsh.fingerprints.append(lsh.simhash.hash(sim_vec)) # Query with one of the similar vectors test_idx = 0 query_vec = similar_groups[test_idx][1][0] # First similar variant base_id = similar_groups[test_idx][0] print(f"Query: similar variant of vector {base_id}") # Compute true similarity true_sim = np.dot(query_vec, vectors[base_id]) print(f"True cosine similarity to base: {true_sim:.4f}") # Find using LSH results = lsh.query(query_vec, k=5, max_hamming=10) print("\nTop 5 matches (by Hamming distance):") for item_id, hamming_dist in results: est_sim = lsh.simhash.estimate_cosine_similarity( lsh.simhash.hash(query_vec), lsh.fingerprints[item_id] if item_id < len(lsh.fingerprints) else lsh.simhash.hash(vectors[item_id]) ) print(f" Item {item_id}: Hamming={hamming_dist}, Est.Sim={est_sim:.4f}") if __name__ == "__main__": demo_simhash()Let's organize the landscape of LSH hash families by their target similarity/distance metric.
| Metric | Hash Family | Hash Function | Collision Probability |
|---|---|---|---|
| Jaccard similarity | MinHash | min_x∈S π(x) | J(A,B) |
| Cosine similarity | Random Hyperplanes | sign(r·v) | 1 - θ/π |
| Angular distance | SimHash | binary code from signs | Derived from above |
| Euclidean (L2) | p-stable (Gaussian) | ⌊(a·v+b)/w⌋ | Complex formula |
| Manhattan (L1) | p-stable (Cauchy) | ⌊(a·v+b)/w⌋ | Complex formula |
| Hamming distance | Bit Sampling | v[i] for random i | 1 - d_H/d |
| Chi-squared distance | Spherical LSH | Specialized | Metric-dependent |
Choosing the Right Family:
Red Flags - When NOT to Use LSH:
What if you have a custom similarity measure not covered by existing LSH families? There are general techniques for constructing LSH families:
Method 1: Metric Embedding
If your metric can be approximately embedded into a "standard" space (Euclidean, Hamming, etc.), you can:
Example: Edit distance on strings can be embedded into Hamming space (with some distortion), then use bit sampling LSH.
Method 2: Kernel-Based LSH
For positive-definite kernels, there exist LSH-like techniques:
Method 3: Learning to Hash
For complex similarities, learn hash functions from data:
These learned methods can sometimes outperform random LSH on specific data distributions, but lose the theoretical guarantees.
For an LSH family to work, you need: (1) a hash function whose collision probability decreases with distance, (2) the ability to sample random hash functions efficiently, and (3) an amplification strategy to achieve desired precision/recall. If you can establish these properties for your metric, you have a valid LSH family.
Cross-polytope LSH (also called hyperplane LSH or multiprobe LSH) is a modern alternative to random hyperplane LSH that achieves better theoretical bounds and practical performance.
The Idea:
Instead of using hyperplanes to produce binary hashes, cross-polytope LSH:
Mathematical Definition:
For a random rotation matrix $R$ and unit vector $\mathbf{v}$:
$$h_R(\mathbf{v}) = \arg\max_{1 \leq i \leq d} |\mathbf{r}_i \cdot \mathbf{v}|$$
where $\mathbf{r}_i$ is the $i$-th row of $R$.
Why It's Better:
Practical Consideration:
Random rotations are expensive (O(d²) or O(d log d) with fast methods). For very high-dimensional data, the computational overhead may outweigh the theoretical benefits.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
"""Cross-Polytope LSH: A more efficient alternative to random hyperplanes.""" import numpy as npfrom scipy.stats import special_ortho_group class CrossPolytopeLSH: """ Cross-Polytope LSH for cosine similarity. Hash value is the index of the canonical direction (±e_1, ..., ±e_d) closest to the rotated vector. """ def __init__( self, dim: int, num_tables: int = 10, seed: int = 42 ): np.random.seed(seed) self.dim = dim self.num_tables = num_tables # Random rotation matrices for each table # Using Haar-distributed random orthogonal matrices self.rotations = [ special_ortho_group.rvs(dim) for _ in range(num_tables) ] def hash( self, vector: np.ndarray, table_idx: int ) -> int: """ Compute cross-polytope hash. Returns index in [0, 2d-1]: - 0 to d-1: positive direction e_i - d to 2d-1: negative direction -e_i """ # Normalize vector v = vector / np.linalg.norm(vector) # Apply rotation rotated = np.dot(self.rotations[table_idx], v) # Find maximum absolute coordinate abs_coords = np.abs(rotated) max_idx = np.argmax(abs_coords) # Determine sign if rotated[max_idx] >= 0: return max_idx # Positive direction else: return self.dim + max_idx # Negative direction def hash_batch( self, vectors: np.ndarray ) -> np.ndarray: """ Compute hashes for all vectors in all tables. Returns: (num_tables, num_vectors) array """ n = len(vectors) # Normalize norms = np.linalg.norm(vectors, axis=1, keepdims=True) v_normalized = vectors / norms hashes = np.zeros((self.num_tables, n), dtype=int) for table_idx in range(self.num_tables): # Rotate all vectors rotated = np.dot(v_normalized, self.rotations[table_idx].T) # Find max absolute coordinate for each vector abs_coords = np.abs(rotated) max_indices = np.argmax(abs_coords, axis=1) # Determine signs signs = np.array([ rotated[i, max_indices[i]] >= 0 for i in range(n) ]) hashes[table_idx] = np.where( signs, max_indices, self.dim + max_indices ) return hashes def analyze_cross_polytope(): """Compare cross-polytope vs random hyperplane LSH.""" np.random.seed(42) dim = 64 n_tests = 1000 # Test collision probabilities at various angles angles = np.linspace(0, np.pi, 11) hyperplane = np.random.randn(dim) hyperplane /= np.linalg.norm(hyperplane) cross_poly = CrossPolytopeLSH(dim, num_tables=1) print("Collision Probability Comparison") print(f"{'Angle':>10} {'Hyperplane':>12} {'CrossPoly':>12} {'Theoretical HP':>15}") print("-" * 52) for theta in angles: hp_collisions = 0 cp_collisions = 0 for _ in range(n_tests): # Generate random vector pair at angle theta u = np.random.randn(dim) u /= np.linalg.norm(u) # Generate v at angle theta from u v_perp = np.random.randn(dim) v_perp -= np.dot(v_perp, u) * u v_perp /= np.linalg.norm(v_perp) v = np.cos(theta) * u + np.sin(theta) * v_perp # Random hyperplane hp_u = np.sign(np.dot(hyperplane, u)) hp_v = np.sign(np.dot(hyperplane, v)) if hp_u == hp_v: hp_collisions += 1 # Cross-polytope cp_u = cross_poly.hash(u, 0) cp_v = cross_poly.hash(v, 0) if cp_u == cp_v: cp_collisions += 1 hp_prob = hp_collisions / n_tests cp_prob = cp_collisions / n_tests theoretical_hp = 1 - theta / np.pi print(f"{theta:>10.4f} {hp_prob:>12.4f} {cp_prob:>12.4f} {theoretical_hp:>15.4f}") if __name__ == "__main__": analyze_cross_polytope()Standard LSH uses the same hash function for both database points and queries. Asymmetric LSH uses different (but related) hash functions for the two roles, enabling better recall.
The Motivation:
In many applications, we hash database points once during indexing but hash queries many times during search. We have different time budgets:
Asymmetric Approach:
For database point $p$ and query $q$:
Example: Asymmetric E2LSH
In standard E2LSH, both points use the same projection and offset. In asymmetric E2LSH:
This increases recall without increasing false positives—the query checks more buckets, finding neighbors that just barely crossed a boundary.
Asymmetric LSH is most beneficial when: (1) query time is more constrained than index time, (2) high recall is critical, (3) the distance distribution has many points near decision boundaries. It's essentially a principled form of multi-probing with theoretical backing.
We've surveyed the major LSH hash function families and their applications. Let's consolidate the key insights:
What's Next:
In the next page, we'll dive deep into collision probability analysis—the mathematical foundation that lets us predict and optimize LSH performance. Understanding these probability calculations is essential for tuning LSH parameters in practice.
You now have a comprehensive understanding of the major LSH hash function families. You can match problems to appropriate hash families and understand the tradeoffs between different approaches.