Locality Sensitive Hashing - Learning Module

Loading content...

0/245

Hash Function Families

A Family for Every Metric

So far, we've explored LSH for two similarity measures: cosine similarity (random hyperplanes) and Euclidean distance (p-stable distributions). But the world of similarity is much richer. Documents are compared using Jaccard similarity of their word sets. Binary features use Hamming distance. Strings have edit distance. Probability distributions have KL divergence.

The fundamental principle of LSH is that each similarity or distance metric requires its own hash function family tailored to that metric's specific properties. Using the wrong family—like applying MinHash to Euclidean vectors or random hyperplanes to sets—produces meaningless results.

This page surveys the major LSH hash function families, providing the theory, intuition, and implementation for each. By the end, you'll have a comprehensive toolkit for similarity search across diverse data types.

What You Will Learn

By the end of this page, you will understand MinHash for Jaccard similarity, bit sampling for Hamming distance, SimHash for angular similarity with binary output, and the theoretical framework for constructing new LSH families. You'll be able to match problems to the appropriate hash family.

MinHash for Jaccard Similarity

MinHash is one of the most important LSH families, designed for Jaccard similarity between sets. It has widespread applications in document deduplication, plagiarism detection, and recommendation systems.

Jaccard Similarity:

For two sets $A$ and $B$, the Jaccard similarity is:

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

This measures the overlap between sets normalized by their union. Two identical sets have $J = 1$; disjoint sets have $J = 0$.

The MinHash Function:

Given a random permutation $\pi$ of the universe of elements, define:

$$h_\pi(S) = \min_{x \in S} \pi(x)$$

That is, apply the permutation to all elements in $S$ and return the minimum value.

The Remarkable Theorem:

$$\Pr[h_\pi(A) = h_\pi(B)] = J(A, B)$$

Proof:

Consider the union $A \cup B$. Under a random permutation, the element with the minimum permuted value is equally likely to be any element in $A \cup B$.

The minimum is the same in both $A$ and $B$ iff it comes from $A \cap B$:

$$\Pr[h_\pi(A) = h_\pi(B)] = \frac{|A \cap B|}{|A \cup B|} = J(A, B)$$

$\blacksquare$

minhash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
"""
MinHash implementation for Jaccard similarity LSH.
"""
 
import numpy as np
from typing import Set, List, Tuple, FrozenSet
import hashlib
from collections import defaultdict
 
class MinHash:
    """
    MinHash signature generator for sets.
    
    Uses hash functions to simulate random permutations.
    For each hash function, we keep the minimum hash value
    across all elements in the set.
    """
    
    def __init__(
        self, 
        num_hashes: int = 128, 
        seed: int = 42
    ):
        """
        Initialize MinHash with num_hashes hash functions.
        
        More hashes = better Jaccard similarity estimation,
        but larger signatures.
        """
        self.num_hashes = num_hashes
        np.random.seed(seed)
        
        # Generate random coefficients for hash functions
        # h(x) = (a * x + b) mod p
        # where p is a large prime
        self.p = 2**31 - 1  # Mersenne prime
        self.a = np.random.randint(1, self.p, size=num_hashes)
        self.b = np.random.randint(0, self.p, size=num_hashes)
    
    def _element_hash(self, element) -> int:
        """Hash an element to an integer."""
        # Use SHA256 for consistent hashing of any object
        h = hashlib.sha256(str(element).encode()).digest()
        return int.from_bytes(h[:8], 'big')
    
    def signature(self, s: Set) -> np.ndarray:
        """
        Compute the MinHash signature for a set.
        
        Returns array of length num_hashes.
        """
        if not s:
            # Empty set: return max values
            return np.full(self.num_hashes, self.p)
        
        # Initialize with infinity
        sig = np.full(self.num_hashes, self.p)
        
        for element in s:
            # Get hash of element
            elem_hash = self._element_hash(element)
            
            # Apply each hash function
            hash_values = (self.a * elem_hash + self.b) % self.p
            
            # Keep minimum
            sig = np.minimum(sig, hash_values)
        
        return sig
    
    def estimate_jaccard(
        self, 
        sig1: np.ndarray, 
        sig2: np.ndarray
    ) -> float:
        """
        Estimate Jaccard similarity from two MinHash signatures.
        """
        return np.mean(sig1 == sig2)
 
 
class MinHashLSH:
    """
    LSH index using MinHash signatures.
    
    Uses banding technique: divide signature into b bands of r rows.
    Two sets are candidates if they match in at least one complete band.
    """
    
    def __init__(
        self, 
        num_hashes: int = 128,
        num_bands: int = 16,
        seed: int = 42
    ):
        """
        Initialize MinHash LSH.
        
        num_hashes must be divisible by num_bands.
        rows_per_band = num_hashes // num_bands
        
        Higher num_bands = more candidates (higher recall, lower precision)
        """
        if num_hashes % num_bands != 0:
            raise ValueError("num_hashes must be divisible by num_bands")
        
        self.num_hashes = num_hashes
        self.num_bands = num_bands
        self.rows_per_band = num_hashes // num_bands
        
        self.minhash = MinHash(num_hashes, seed)
        
        # One hash table per band
        self.tables: List[dict] = [
            defaultdict(list) for _ in range(num_bands)
        ]
        self.signatures: List[np.ndarray] = []  # Store signatures
    
    def add(self, set_id: int, s: Set):
        """Add a set to the index."""
        sig = self.minhash.signature(s)
        self.signatures.append(sig)
        
        # Insert into each band's table
        for band_idx in range(self.num_bands):
            start = band_idx * self.rows_per_band
            end = start + self.rows_per_band
            band_key = tuple(sig[start:end])
            self.tables[band_idx][band_key].append(set_id)
    
    def query(self, s: Set) -> List[int]:
        """
        Find candidate similar sets.
        
        Returns list of set IDs that share at least one band.
        """
        sig = self.minhash.signature(s)
        candidates = set()
        
        for band_idx in range(self.num_bands):
            start = band_idx * self.rows_per_band
            end = start + self.rows_per_band
            band_key = tuple(sig[start:end])
            
            matches = self.tables[band_idx].get(band_key, [])
            candidates.update(matches)
        
        return list(candidates)
    
    def query_with_similarity(
        self, 
        s: Set, 
        threshold: float = 0.0
    ) -> List[Tuple[int, float]]:
        """
        Find similar sets with estimated Jaccard similarity.
        """
        query_sig = self.minhash.signature(s)
        candidates = self.query(s)
        
        results = []
        for cand_id in candidates:
            cand_sig = self.signatures[cand_id]
            est_jaccard = self.minhash.estimate_jaccard(query_sig, cand_sig)
            if est_jaccard >= threshold:
                results.append((cand_id, est_jaccard))
        
        # Sort by similarity descending
        results.sort(key=lambda x: -x[1])
        return results
 
 
def demo_minhash():
    """Demonstrate MinHash for document similarity."""
    
    # Create document shingles (word n-grams)
    docs = [
        "the quick brown fox jumps over the lazy dog",
        "the quick brown fox leaps over the lazy dog",
        "a fast brown fox jumps over a lazy dog",
        "the lazy cat sleeps on the mat",
        "machine learning is fascinating and powerful",
    ]
    
    def shingle(doc: str, k: int = 3) -> Set[str]:
        """Create k-shingles (character n-grams) from document."""
        return set(doc[i:i+k] for i in range(len(doc) - k + 1))
    
    # Create shingle sets
    doc_shingles = [shingle(doc) for doc in docs]
    
    print("MinHash Document Similarity Demo")
    print("=" * 50)
    
    # Compute true Jaccard similarities
    print("\nTrue Jaccard Similarities:")
    for i in range(len(docs)):
        for j in range(i+1, len(docs)):
            jac = len(doc_shingles[i] & doc_shingles[j]) / \
                  len(doc_shingles[i] | doc_shingles[j])
            print(f"  Doc {i} vs Doc {j}: {jac:.4f}")
    
    # Build MinHash LSH index
    lsh = MinHashLSH(num_hashes=128, num_bands=16)
    
    for i, shingles in enumerate(doc_shingles):
        lsh.add(i, shingles)
    
    # Query for similar documents
    print("\nMinHash LSH Results (estimated similarity):")
    for i, shingles in enumerate(doc_shingles):
        results = lsh.query_with_similarity(shingles, threshold=0.1)
        # Filter self-match
        results = [(j, sim) for j, sim in results if j != i]
        if results:
            print(f"  Doc {i} similar to: {results}")
 
 
if __name__ == "__main__":
    demo_minhash()

Banding Technique

The banding technique divides the MinHash signature into b bands of r rows. Two sets are candidates if ALL r hash values match in at least ONE band. This implements the AND-OR amplification: AND within bands (all r must match), OR across bands (any band matches). The S-curve shape of the resulting probability function can be tuned via b and r.

Bit Sampling for Hamming Distance

Bit sampling is the simplest LSH family, designed for Hamming distance between binary vectors.

Hamming Distance:

For binary vectors $\mathbf{u}, \mathbf{v} \in {0, 1}^d$:

$$d_H(\mathbf{u}, \mathbf{v}) = |{i : u_i \neq v_i}|$$

This counts the number of positions where the vectors differ.

The Hash Function:

For a random index $i \in {1, 2, \ldots, d}$:

$$h_i(\mathbf{v}) = v_i$$

Simply return the bit at position $i$.

Collision Probability:

$$\Pr[h_i(\mathbf{u}) = h_i(\mathbf{v})] = 1 - \frac{d_H(\mathbf{u}, \mathbf{v})}{d}$$

The probability that a random position matches equals the fraction of matching positions.

LSH Quality:

For Hamming distance, this gives $\rho = 1/c$ where $c$ is the approximation ratio, which is optimal for LSH schemes.

bit_sampling_lsh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
"""
Bit Sampling LSH for Hamming Distance.
"""
 
import numpy as np
from typing import Tuple, List, Dict, Set
from collections import defaultdict
 
class BitSamplingLSH:
    """
    LSH for binary vectors using bit sampling.
    
    Hash function: sample k random bit positions.
    Collision probability = (1 - d_H/d)^k where d_H is Hamming distance.
    """
    
    def __init__(
        self,
        dim: int,
        num_bits: int = 16,  # k: bits per hash
        num_tables: int = 20,  # L: number of tables
        seed: int = 42
    ):
        np.random.seed(seed)
        
        self.dim = dim
        self.num_bits = num_bits
        self.num_tables = num_tables
        
        # Random bit positions for each table
        # Shape: (L, k)
        self.bit_indices = np.array([
            np.random.choice(dim, num_bits, replace=False)
            for _ in range(num_tables)
        ])
        
        # Hash tables
        self.tables: List[Dict] = [
            defaultdict(list) for _ in range(num_tables)
        ]
        self.vectors: List[np.ndarray] = []
    
    def _hash(
        self, 
        vector: np.ndarray, 
        table_idx: int
    ) -> Tuple[int, ...]:
        """Extract bits at sampled positions."""
        bits = vector[self.bit_indices[table_idx]]
        return tuple(bits.astype(int))
    
    def add(self, vector_id: int, vector: np.ndarray):
        """Add a binary vector to the index."""
        self.vectors.append(vector.copy())
        
        for table_idx in range(self.num_tables):
            hash_key = self._hash(vector, table_idx)
            self.tables[table_idx][hash_key].append(vector_id)
    
    def query(
        self, 
        query: np.ndarray, 
        k: int = 10
    ) -> List[Tuple[int, int]]:
        """
        Find approximate nearest neighbors by Hamming distance.
        
        Returns list of (vector_id, hamming_distance) tuples.
        """
        candidates: Set[int] = set()
        
        for table_idx in range(self.num_tables):
            hash_key = self._hash(query, table_idx)
            matches = self.tables[table_idx].get(hash_key, [])
            candidates.update(matches)
        
        if not candidates:
            # Fallback: random sample
            candidates = set(range(min(100, len(self.vectors))))
        
        # Compute exact Hamming distances
        results = []
        for cand_id in candidates:
            hamming_dist = np.sum(
                query != self.vectors[cand_id]
            )
            results.append((cand_id, int(hamming_dist)))
        
        # Sort by distance
        results.sort(key=lambda x: x[1])
        return results[:k]
 
 
def demo_bit_sampling():
    """Demonstrate bit sampling LSH."""
    np.random.seed(42)
    
    dim = 256
    n = 10000
    
    # Generate random binary vectors
    vectors = np.random.randint(0, 2, size=(n, dim))
    
    # Create a query similar to vector 0
    query = vectors[0].copy()
    # Flip 10 random bits
    flip_positions = np.random.choice(dim, 10, replace=False)
    query[flip_positions] = 1 - query[flip_positions]
    
    print("Bit Sampling LSH Demo")
    print(f"Dataset: {n} vectors, {dim} bits each")
    print(f"Query: Hamming distance {np.sum(query != vectors[0])} from vector 0")
    print()
    
    # Build index
    lsh = BitSamplingLSH(dim=dim, num_bits=16, num_tables=30)
    for i, v in enumerate(vectors):
        lsh.add(i, v)
    
    # Query
    results = lsh.query(query, k=10)
    
    print("Top 10 results (LSH):")
    for vec_id, dist in results:
        print(f"  Vector {vec_id}: Hamming distance = {dist}")
    
    # Brute force comparison
    all_dists = [np.sum(query != v) for v in vectors]
    bf_top10 = np.argsort(all_dists)[:10]
    
    print("\nTop 10 results (brute force):")
    for vec_id in bf_top10:
        print(f"  Vector {vec_id}: Hamming distance = {all_dists[vec_id]}")
    
    # Recall
    lsh_set = {r[0] for r in results}
    bf_set = set(bf_top10)
    recall = len(lsh_set & bf_set) / len(bf_set)
    print(f"\nRecall@10: {recall:.1%}")
 
 
if __name__ == "__main__":
    demo_bit_sampling()

SimHash: Binary Codes from Random Projections

SimHash (also known as Sign Random Projection) combines random hyperplane LSH with compact binary codes. It's particularly efficient for large-scale near-duplicate detection where storage is a concern.

The Idea:

Instead of storing $k$ separate single-bit hashes from $k$ hyperplanes, SimHash combines them into a single $k$-bit binary code. Each bit is determined by one hyperplane, and the Hamming distance between codes approximates the angular distance between vectors.

SimHash Definition:

For $k$ random hyperplanes $\mathbf{r}_1, \ldots, \mathbf{r}_k$ and a vector $\mathbf{v}$:

$$\text{SimHash}(\mathbf{v}) = (\text{sign}(\mathbf{r}_1 \cdot \mathbf{v}), \ldots, \text{sign}(\mathbf{r}_k \cdot \mathbf{v}))$$

Key Property:

The expected Hamming distance between SimHash codes equals $k \cdot \theta / \pi$:

$$\mathbb{E}[d_H(\text{SimHash}(\mathbf{u}), \text{SimHash}(\mathbf{v}))] = k \cdot \frac{\theta}{\pi}$$

where $\theta$ is the angle between $\mathbf{u}$ and $\mathbf{v}$.

Practical Advantage:

SimHash codes are:

Compact: 64 bits can encode 64-hyperplane projections
Fast to compare: Hamming distance via XOR + popcount (single CPU instruction)
Hashable: Can be used directly as hash table keys

simhash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
"""
SimHash implementation for compact similarity fingerprints.
"""
 
import numpy as np
from typing import List, Tuple
 
class SimHash:
    """
    SimHash: Sign Random Projection for compact fingerprints.
    
    Projects vectors onto k random hyperplanes and encodes
    as k-bit binary codes. Hamming distance between codes
    approximates angular distance between vectors.
    """
    
    def __init__(self, dim: int, num_bits: int = 64, seed: int = 42):
        """
        Initialize SimHash.
        
        Parameters:
        -----------
        dim : int
            Dimensionality of input vectors
        num_bits : int
            Number of bits in the hash (must be <= 64 for uint64)
        """
        np.random.seed(seed)
        
        self.dim = dim
        self.num_bits = num_bits
        
        # Random hyperplanes
        self.hyperplanes = np.random.randn(num_bits, dim)
        # Normalize (optional but good practice)
        self.hyperplanes /= np.linalg.norm(
            self.hyperplanes, axis=1, keepdims=True
        )
    
    def hash(self, vector: np.ndarray) -> int:
        """
        Compute SimHash fingerprint as integer.
        """
        projections = np.dot(self.hyperplanes, vector)
        bits = (projections > 0).astype(np.uint64)
        
        # Pack bits into integer
        fingerprint = 0
        for i, bit in enumerate(bits):
            fingerprint |= (bit << i)
        
        return int(fingerprint)
    
    def hash_batch(self, vectors: np.ndarray) -> np.ndarray:
        """
        Compute SimHash fingerprints for multiple vectors.
        """
        # (n, dim) @ (dim, num_bits) -> (n, num_bits)
        projections = np.dot(vectors, self.hyperplanes.T)
        bits = (projections > 0).astype(np.uint64)
        
        # Pack bits
        fingerprints = np.zeros(len(vectors), dtype=np.uint64)
        for i in range(self.num_bits):
            fingerprints |= (bits[:, i] << i)
        
        return fingerprints
    
    @staticmethod
    def hamming_distance(fp1: int, fp2: int) -> int:
        """
        Compute Hamming distance between two fingerprints.
        Uses bit counting on XOR result.
        """
        xor = fp1 ^ fp2
        return bin(xor).count('1')
    
    def estimate_cosine_similarity(
        self, 
        fp1: int, 
        fp2: int
    ) -> float:
        """
        Estimate cosine similarity from fingerprints.
        
        Hamming distance k -> angle θ ≈ π * k / num_bits
        Cosine similarity = cos(θ)
        """
        hamming = self.hamming_distance(fp1, fp2)
        angle = np.pi * hamming / self.num_bits
        return np.cos(angle)
 
 
class SimHashLSH:
    """
    LSH index using SimHash fingerprints.
    
    Uses multiple hash tables with partial fingerprint matches.
    """
    
    def __init__(
        self,
        dim: int,
        num_bits: int = 64,
        num_tables: int = 10,
        bits_per_table: int = 8,
        seed: int = 42
    ):
        """
        Initialize SimHash LSH.
        
        Each table uses bits_per_table bits from the fingerprint.
        """
        self.simhash = SimHash(dim, num_bits, seed)
        self.num_tables = num_tables
        self.bits_per_table = bits_per_table
        
        # Masks for extracting bits for each table
        np.random.seed(seed + 1000)
        self.bit_indices = [
            np.sort(np.random.choice(
                num_bits, bits_per_table, replace=False
            ))
            for _ in range(num_tables)
        ]
        
        # Hash tables
        self.tables: List[dict] = [{} for _ in range(num_tables)]
        self.fingerprints: List[int] = []
    
    def _extract_bits(
        self, 
        fingerprint: int, 
        bit_indices: np.ndarray
    ) -> int:
        """Extract specific bits from fingerprint."""
        result = 0
        for i, idx in enumerate(bit_indices):
            bit = (fingerprint >> idx) & 1
            result |= (bit << i)
        return result
    
    def add(self, item_id: int, vector: np.ndarray):
        """Add a vector to the index."""
        fp = self.simhash.hash(vector)
        self.fingerprints.append(fp)
        
        for table_idx in range(self.num_tables):
            key = self._extract_bits(fp, self.bit_indices[table_idx])
            if key not in self.tables[table_idx]:
                self.tables[table_idx][key] = []
            self.tables[table_idx][key].append(item_id)
    
    def query(
        self, 
        vector: np.ndarray, 
        k: int = 10,
        max_hamming: int = 5
    ) -> List[Tuple[int, int]]:
        """
        Find similar items.
        
        Returns (item_id, hamming_distance) sorted by distance.
        """
        query_fp = self.simhash.hash(vector)
        candidates = set()
        
        for table_idx in range(self.num_tables):
            key = self._extract_bits(
                query_fp, self.bit_indices[table_idx]
            )
            matches = self.tables[table_idx].get(key, [])
            candidates.update(matches)
        
        # Compute exact Hamming distances
        results = []
        for item_id in candidates:
            hamming = self.simhash.hamming_distance(
                query_fp, self.fingerprints[item_id]
            )
            if hamming <= max_hamming:
                results.append((item_id, hamming))
        
        results.sort(key=lambda x: x[1])
        return results[:k]
 
 
def demo_simhash():
    """Demonstrate SimHash for near-duplicate detection."""
    np.random.seed(42)
    
    dim = 256
    n = 10000
    
    # Generate random vectors
    vectors = np.random.randn(n, dim)
    vectors /= np.linalg.norm(vectors, axis=1, keepdims=True)
    
    # Create similar vectors (small perturbations)
    similar_groups = []
    for i in range(100):
        base = vectors[i]
        # Create 3 similar vectors
        perturbations = 0.1 * np.random.randn(3, dim)
        similar = base + perturbations
        similar /= np.linalg.norm(similar, axis=1, keepdims=True)
        similar_groups.append((i, similar))
    
    print("SimHash Demo")
    print(f"Dataset: {n} base vectors + similar variants")
    print()
    
    # Build index
    lsh = SimHashLSH(
        dim=dim, 
        num_bits=64, 
        num_tables=10,
        bits_per_table=10
    )
    
    for i, v in enumerate(vectors):
        lsh.add(i, v)
    
    # Also add similar vectors with special IDs
    for i, (base_id, similar_vecs) in enumerate(similar_groups):
        for j, sim_vec in enumerate(similar_vecs):
            item_id = n + i * 3 + j  # IDs beyond original dataset
            lsh.fingerprints.append(lsh.simhash.hash(sim_vec))
    
    # Query with one of the similar vectors
    test_idx = 0
    query_vec = similar_groups[test_idx][1][0]  # First similar variant
    base_id = similar_groups[test_idx][0]
    
    print(f"Query: similar variant of vector {base_id}")
    
    # Compute true similarity
    true_sim = np.dot(query_vec, vectors[base_id])
    print(f"True cosine similarity to base: {true_sim:.4f}")
    
    # Find using LSH
    results = lsh.query(query_vec, k=5, max_hamming=10)
    print("\nTop 5 matches (by Hamming distance):")
    for item_id, hamming_dist in results:
        est_sim = lsh.simhash.estimate_cosine_similarity(
            lsh.simhash.hash(query_vec),
            lsh.fingerprints[item_id] if item_id < len(lsh.fingerprints) 
            else lsh.simhash.hash(vectors[item_id])
        )
        print(f"  Item {item_id}: Hamming={hamming_dist}, Est.Sim={est_sim:.4f}")
 
 
if __name__ == "__main__":
    demo_simhash()

Overview: Hash Families by Metric

Let's organize the landscape of LSH hash families by their target similarity/distance metric.

LSH Hash Families Overview
Metric	Hash Family	Hash Function	Collision Probability
Jaccard similarity	MinHash	min_x∈S π(x)	J(A,B)
Cosine similarity	Random Hyperplanes	sign(r·v)	1 - θ/π
Angular distance	SimHash	binary code from signs	Derived from above
Euclidean (L2)	p-stable (Gaussian)	⌊(a·v+b)/w⌋	Complex formula
Manhattan (L1)	p-stable (Cauchy)	⌊(a·v+b)/w⌋	Complex formula
Hamming distance	Bit Sampling	v[i] for random i	1 - d_H/d
Chi-squared distance	Spherical LSH	Specialized	Metric-dependent

Choosing the Right Family:

Sets/Documents with overlap: MinHash (Jaccard)
Continuous vectors, direction matters: Random Hyperplanes / SimHash (Cosine)
Continuous vectors, magnitude matters: E2LSH (Euclidean)
Binary features: Bit Sampling (Hamming)
Count/histogram data with positivity: Chi-squared LSH
Sparse L1 data: Cauchy LSH

Red Flags - When NOT to Use LSH:

Exact nearest neighbors required: LSH is fundamentally approximate
Low-dimensional data (d < 10): Tree-based methods are often faster
Very small datasets (n < 1000): Brute force is simpler and fast enough
Highly clustered data: May need data-dependent hashing methods
Dynamic datasets with frequent updates: LSH rebuild can be costly

Constructing New LSH Families

What if you have a custom similarity measure not covered by existing LSH families? There are general techniques for constructing LSH families:

Method 1: Metric Embedding

If your metric can be approximately embedded into a "standard" space (Euclidean, Hamming, etc.), you can:

Map data points into the embedded space
Apply standard LSH in that space
Verify candidates in original metric

Example: Edit distance on strings can be embedded into Hamming space (with some distortion), then use bit sampling LSH.

Method 2: Kernel-Based LSH

For positive-definite kernels, there exist LSH-like techniques:

Random Fourier Features approximate RBF kernels
Nyström approximation for general kernels

Method 3: Learning to Hash

For complex similarities, learn hash functions from data:

Spectral Hashing: Learn projections that preserve neighborhood structure
Semantic Hashing: Deep networks that produce binary codes
Product Quantization: Learn codebooks for vector compression

These learned methods can sometimes outperform random LSH on specific data distributions, but lose the theoretical guarantees.

The LSH Framework

For an LSH family to work, you need: (1) a hash function whose collision probability decreases with distance, (2) the ability to sample random hash functions efficiently, and (3) an amplification strategy to achieve desired precision/recall. If you can establish these properties for your metric, you have a valid LSH family.

Cross-Polytope LSH: A Modern Alternative

Cross-polytope LSH (also called hyperplane LSH or multiprobe LSH) is a modern alternative to random hyperplane LSH that achieves better theoretical bounds and practical performance.

The Idea:

Instead of using hyperplanes to produce binary hashes, cross-polytope LSH:

Applies a random rotation to the vector
Finds which of the $2d$ canonical directions (±e₁, ±e₂, ..., ±e_d) is closest to the rotated vector
Returns that direction as the hash

Mathematical Definition:

For a random rotation matrix $R$ and unit vector $\mathbf{v}$:

$$h_R(\mathbf{v}) = \arg\max_{1 \leq i \leq d} |\mathbf{r}_i \cdot \mathbf{v}|$$

where $\mathbf{r}_i$ is the $i$-th row of $R$.

Why It's Better:

Higher entropy: $2d$ possible hash values vs. 2 for hyperplanes
Better discrimination: Single hash function is more informative
Faster convergence: Fewer hash functions needed for same accuracy
Optimal ρ: Achieves $\rho = 1/(2c - 1)$ for c-approximate NN

Practical Consideration:

Random rotations are expensive (O(d²) or O(d log d) with fast methods). For very high-dimensional data, the computational overhead may outweigh the theoretical benefits.

cross_polytope_lsh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
"""
Cross-Polytope LSH: A more efficient alternative to random hyperplanes.
"""
 
import numpy as np
from scipy.stats import special_ortho_group
 
class CrossPolytopeLSH:
    """
    Cross-Polytope LSH for cosine similarity.
    
    Hash value is the index of the canonical direction
    (±e_1, ..., ±e_d) closest to the rotated vector.
    """
    
    def __init__(
        self, 
        dim: int, 
        num_tables: int = 10, 
        seed: int = 42
    ):
        np.random.seed(seed)
        
        self.dim = dim
        self.num_tables = num_tables
        
        # Random rotation matrices for each table
        # Using Haar-distributed random orthogonal matrices
        self.rotations = [
            special_ortho_group.rvs(dim)
            for _ in range(num_tables)
        ]
    
    def hash(
        self, 
        vector: np.ndarray, 
        table_idx: int
    ) -> int:
        """
        Compute cross-polytope hash.
        
        Returns index in [0, 2d-1]:
        - 0 to d-1: positive direction e_i
        - d to 2d-1: negative direction -e_i
        """
        # Normalize vector
        v = vector / np.linalg.norm(vector)
        
        # Apply rotation
        rotated = np.dot(self.rotations[table_idx], v)
        
        # Find maximum absolute coordinate
        abs_coords = np.abs(rotated)
        max_idx = np.argmax(abs_coords)
        
        # Determine sign
        if rotated[max_idx] >= 0:
            return max_idx  # Positive direction
        else:
            return self.dim + max_idx  # Negative direction
    
    def hash_batch(
        self, 
        vectors: np.ndarray
    ) -> np.ndarray:
        """
        Compute hashes for all vectors in all tables.
        
        Returns: (num_tables, num_vectors) array
        """
        n = len(vectors)
        
        # Normalize
        norms = np.linalg.norm(vectors, axis=1, keepdims=True)
        v_normalized = vectors / norms
        
        hashes = np.zeros((self.num_tables, n), dtype=int)
        
        for table_idx in range(self.num_tables):
            # Rotate all vectors
            rotated = np.dot(v_normalized, self.rotations[table_idx].T)
            
            # Find max absolute coordinate for each vector
            abs_coords = np.abs(rotated)
            max_indices = np.argmax(abs_coords, axis=1)
            
            # Determine signs
            signs = np.array([
                rotated[i, max_indices[i]] >= 0 
                for i in range(n)
            ])
            
            hashes[table_idx] = np.where(
                signs, max_indices, self.dim + max_indices
            )
        
        return hashes
 
 
def analyze_cross_polytope():
    """Compare cross-polytope vs random hyperplane LSH."""
    np.random.seed(42)
    
    dim = 64
    n_tests = 1000
    
    # Test collision probabilities at various angles
    angles = np.linspace(0, np.pi, 11)
    
    hyperplane = np.random.randn(dim)
    hyperplane /= np.linalg.norm(hyperplane)
    
    cross_poly = CrossPolytopeLSH(dim, num_tables=1)
    
    print("Collision Probability Comparison")
    print(f"{'Angle':>10} {'Hyperplane':>12} {'CrossPoly':>12} {'Theoretical HP':>15}")
    print("-" * 52)
    
    for theta in angles:
        hp_collisions = 0
        cp_collisions = 0
        
        for _ in range(n_tests):
            # Generate random vector pair at angle theta
            u = np.random.randn(dim)
            u /= np.linalg.norm(u)
            
            # Generate v at angle theta from u
            v_perp = np.random.randn(dim)
            v_perp -= np.dot(v_perp, u) * u
            v_perp /= np.linalg.norm(v_perp)
            
            v = np.cos(theta) * u + np.sin(theta) * v_perp
            
            # Random hyperplane
            hp_u = np.sign(np.dot(hyperplane, u))
            hp_v = np.sign(np.dot(hyperplane, v))
            if hp_u == hp_v:
                hp_collisions += 1
            
            # Cross-polytope
            cp_u = cross_poly.hash(u, 0)
            cp_v = cross_poly.hash(v, 0)
            if cp_u == cp_v:
                cp_collisions += 1
        
        hp_prob = hp_collisions / n_tests
        cp_prob = cp_collisions / n_tests
        theoretical_hp = 1 - theta / np.pi
        
        print(f"{theta:>10.4f} {hp_prob:>12.4f} {cp_prob:>12.4f} {theoretical_hp:>15.4f}")
 
 
if __name__ == "__main__":
    analyze_cross_polytope()

Asymmetric LSH

Standard LSH uses the same hash function for both database points and queries. Asymmetric LSH uses different (but related) hash functions for the two roles, enabling better recall.

The Motivation:

In many applications, we hash database points once during indexing but hash queries many times during search. We have different time budgets:

Indexing: Can be slow (done once per point)
Querying: Must be fast (done for every search)

Asymmetric Approach:

For database point $p$ and query $q$:

Use hash function $h^{(p)}$ for database points
Use hash function $h^{(q)}$ for queries
Design so that $\Pr[h^{(p)}(p) = h^{(q)}(q)]$ depends on similarity

Example: Asymmetric E2LSH

In standard E2LSH, both points use the same projection and offset. In asymmetric E2LSH:

For database points: Use standard hash $h(p) = \lfloor (a·p + b) / w \rfloor$
For queries: Hash to multiple adjacent buckets along the projection

This increases recall without increasing false positives—the query checks more buckets, finding neighbors that just barely crossed a boundary.

When to Use Asymmetric LSH

Asymmetric LSH is most beneficial when: (1) query time is more constrained than index time, (2) high recall is critical, (3) the distance distribution has many points near decision boundaries. It's essentially a principled form of multi-probing with theoretical backing.

Summary: Hash Function Families

We've surveyed the major LSH hash function families and their applications. Let's consolidate the key insights:

Key Takeaways

•MinHash for Jaccard similarity — Uses minimum under random permutation; collision probability equals Jaccard similarity exactly.
•Bit sampling for Hamming distance — Simplest LSH family; just sample random bit positions.
•SimHash for compact codes — Binary codes from random projections; efficient storage and comparison.
•Each metric needs its own family — Mismatched hash families produce meaningless results.
•Cross-polytope LSH offers better bounds — Modern alternative to hyperplanes with higher hash entropy.
•Asymmetric LSH can boost recall — Different functions for database and query points.
•Learning-based methods complement LSH — When data has structure, learned hashes can outperform random LSH.

What's Next:

In the next page, we'll dive deep into collision probability analysis—the mathematical foundation that lets us predict and optimize LSH performance. Understanding these probability calculations is essential for tuning LSH parameters in practice.

Page Complete

You now have a comprehensive understanding of the major LSH hash function families. You can match problems to appropriate hash families and understand the tradeoffs between different approaches.

Hash Function Families

A Family for Every Metric

What You Will Learn

MinHash for Jaccard Similarity

Jaccard Similarity:

For two sets $A$ and $B$, the Jaccard similarity is:

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

This measures the overlap between sets normalized by their union. Two identical sets have $J = 1$; disjoint sets have $J = 0$.

The MinHash Function:

Given a random permutation $\pi$ of the universe of elements, define:

$$h_\pi(S) = \min_{x \in S} \pi(x)$$

That is, apply the permutation to all elements in $S$ and return the minimum value.

The Remarkable Theorem:

$$\Pr[h_\pi(A) = h_\pi(B)] = J(A, B)$$

Proof:

Consider the union $A \cup B$. Under a random permutation, the element with the minimum permuted value is equally likely to be any element in $A \cup B$.

The minimum is the same in both $A$ and $B$ iff it comes from $A \cap B$:

$$\Pr[h_\pi(A) = h_\pi(B)] = \frac{|A \cap B|}{|A \cup B|} = J(A, B)$$

$\blacksquare$

minhash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
"""
MinHash implementation for Jaccard similarity LSH.
"""
 
import numpy as np
from typing import Set, List, Tuple, FrozenSet
import hashlib
from collections import defaultdict
 
class MinHash:
    """
    MinHash signature generator for sets.
    
    Uses hash functions to simulate random permutations.
    For each hash function, we keep the minimum hash value
    across all elements in the set.
    """
    
    def __init__(
        self, 
        num_hashes: int = 128, 
        seed: int = 42
    ):
        """
        Initialize MinHash with num_hashes hash functions.
        
        More hashes = better Jaccard similarity estimation,
        but larger signatures.
        """
        self.num_hashes = num_hashes
        np.random.seed(seed)
        
        # Generate random coefficients for hash functions
        # h(x) = (a * x + b) mod p
        # where p is a large prime
        self.p = 2**31 - 1  # Mersenne prime
        self.a = np.random.randint(1, self.p, size=num_hashes)
        self.b = np.random.randint(0, self.p, size=num_hashes)
    
    def _element_hash(self, element) -> int:
        """Hash an element to an integer."""
        # Use SHA256 for consistent hashing of any object
        h = hashlib.sha256(str(element).encode()).digest()
        return int.from_bytes(h[:8], 'big')
    
    def signature(self, s: Set) -> np.ndarray:
        """
        Compute the MinHash signature for a set.
        
        Returns array of length num_hashes.
        """
        if not s:
            # Empty set: return max values
            return np.full(self.num_hashes, self.p)
        
        # Initialize with infinity
        sig = np.full(self.num_hashes, self.p)
        
        for element in s:
            # Get hash of element
            elem_hash = self._element_hash(element)
            
            # Apply each hash function
            hash_values = (self.a * elem_hash + self.b) % self.p
            
            # Keep minimum
            sig = np.minimum(sig, hash_values)
        
        return sig
    
    def estimate_jaccard(
        self, 
        sig1: np.ndarray, 
        sig2: np.ndarray
    ) -> float:
        """
        Estimate Jaccard similarity from two MinHash signatures.
        """
        return np.mean(sig1 == sig2)
 
 
class MinHashLSH:
    """
    LSH index using MinHash signatures.
    
    Uses banding technique: divide signature into b bands of r rows.
    Two sets are candidates if they match in at least one complete band.
    """
    
    def __init__(
        self, 
        num_hashes: int = 128,
        num_bands: int = 16,
        seed: int = 42
    ):
        """
        Initialize MinHash LSH.
        
        num_hashes must be divisible by num_bands.
        rows_per_band = num_hashes // num_bands
        
        Higher num_bands = more candidates (higher recall, lower precision)
        """
        if num_hashes % num_bands != 0:
            raise ValueError("num_hashes must be divisible by num_bands")
        
        self.num_hashes = num_hashes
        self.num_bands = num_bands
        self.rows_per_band = num_hashes // num_bands
        
        self.minhash = MinHash(num_hashes, seed)
        
        # One hash table per band
        self.tables: List[dict] = [
            defaultdict(list) for _ in range(num_bands)
        ]
        self.signatures: List[np.ndarray] = []  # Store signatures
    
    def add(self, set_id: int, s: Set):
        """Add a set to the index."""
        sig = self.minhash.signature(s)
        self.signatures.append(sig)
        
        # Insert into each band's table
        for band_idx in range(self.num_bands):
            start = band_idx * self.rows_per_band
            end = start + self.rows_per_band
            band_key = tuple(sig[start:end])
            self.tables[band_idx][band_key].append(set_id)
    
    def query(self, s: Set) -> List[int]:
        """
        Find candidate similar sets.
        
        Returns list of set IDs that share at least one band.
        """
        sig = self.minhash.signature(s)
        candidates = set()
        
        for band_idx in range(self.num_bands):
            start = band_idx * self.rows_per_band
            end = start + self.rows_per_band
            band_key = tuple(sig[start:end])
            
            matches = self.tables[band_idx].get(band_key, [])
            candidates.update(matches)
        
        return list(candidates)
    
    def query_with_similarity(
        self, 
        s: Set, 
        threshold: float = 0.0
    ) -> List[Tuple[int, float]]:
        """
        Find similar sets with estimated Jaccard similarity.
        """
        query_sig = self.minhash.signature(s)
        candidates = self.query(s)
        
        results = []
        for cand_id in candidates:
            cand_sig = self.signatures[cand_id]
            est_jaccard = self.minhash.estimate_jaccard(query_sig, cand_sig)
            if est_jaccard >= threshold:
                results.append((cand_id, est_jaccard))
        
        # Sort by similarity descending
        results.sort(key=lambda x: -x[1])
        return results
 
 
def demo_minhash():
    """Demonstrate MinHash for document similarity."""
    
    # Create document shingles (word n-grams)
    docs = [
        "the quick brown fox jumps over the lazy dog",
        "the quick brown fox leaps over the lazy dog",
        "a fast brown fox jumps over a lazy dog",
        "the lazy cat sleeps on the mat",
        "machine learning is fascinating and powerful",
    ]
    
    def shingle(doc: str, k: int = 3) -> Set[str]:
        """Create k-shingles (character n-grams) from document."""
        return set(doc[i:i+k] for i in range(len(doc) - k + 1))
    
    # Create shingle sets
    doc_shingles = [shingle(doc) for doc in docs]
    
    print("MinHash Document Similarity Demo")
    print("=" * 50)
    
    # Compute true Jaccard similarities
    print("\nTrue Jaccard Similarities:")
    for i in range(len(docs)):
        for j in range(i+1, len(docs)):
            jac = len(doc_shingles[i] & doc_shingles[j]) / \
                  len(doc_shingles[i] | doc_shingles[j])
            print(f"  Doc {i} vs Doc {j}: {jac:.4f}")
    
    # Build MinHash LSH index
    lsh = MinHashLSH(num_hashes=128, num_bands=16)
    
    for i, shingles in enumerate(doc_shingles):
        lsh.add(i, shingles)
    
    # Query for similar documents
    print("\nMinHash LSH Results (estimated similarity):")
    for i, shingles in enumerate(doc_shingles):
        results = lsh.query_with_similarity(shingles, threshold=0.1)
        # Filter self-match
        results = [(j, sim) for j, sim in results if j != i]
        if results:
            print(f"  Doc {i} similar to: {results}")
 
 
if __name__ == "__main__":
    demo_minhash()

Banding Technique

Bit Sampling for Hamming Distance

Bit sampling is the simplest LSH family, designed for Hamming distance between binary vectors.

Hamming Distance:

For binary vectors $\mathbf{u}, \mathbf{v} \in {0, 1}^d$:

$$d_H(\mathbf{u}, \mathbf{v}) = |{i : u_i \neq v_i}|$$

This counts the number of positions where the vectors differ.

The Hash Function:

For a random index $i \in {1, 2, \ldots, d}$:

$$h_i(\mathbf{v}) = v_i$$

Simply return the bit at position $i$.

Collision Probability:

$$\Pr[h_i(\mathbf{u}) = h_i(\mathbf{v})] = 1 - \frac{d_H(\mathbf{u}, \mathbf{v})}{d}$$

The probability that a random position matches equals the fraction of matching positions.

LSH Quality:

For Hamming distance, this gives $\rho = 1/c$ where $c$ is the approximation ratio, which is optimal for LSH schemes.

bit_sampling_lsh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
"""
Bit Sampling LSH for Hamming Distance.
"""
 
import numpy as np
from typing import Tuple, List, Dict, Set
from collections import defaultdict
 
class BitSamplingLSH:
    """
    LSH for binary vectors using bit sampling.
    
    Hash function: sample k random bit positions.
    Collision probability = (1 - d_H/d)^k where d_H is Hamming distance.
    """
    
    def __init__(
        self,
        dim: int,
        num_bits: int = 16,  # k: bits per hash
        num_tables: int = 20,  # L: number of tables
        seed: int = 42
    ):
        np.random.seed(seed)
        
        self.dim = dim
        self.num_bits = num_bits
        self.num_tables = num_tables
        
        # Random bit positions for each table
        # Shape: (L, k)
        self.bit_indices = np.array([
            np.random.choice(dim, num_bits, replace=False)
            for _ in range(num_tables)
        ])
        
        # Hash tables
        self.tables: List[Dict] = [
            defaultdict(list) for _ in range(num_tables)
        ]
        self.vectors: List[np.ndarray] = []
    
    def _hash(
        self, 
        vector: np.ndarray, 
        table_idx: int
    ) -> Tuple[int, ...]:
        """Extract bits at sampled positions."""
        bits = vector[self.bit_indices[table_idx]]
        return tuple(bits.astype(int))
    
    def add(self, vector_id: int, vector: np.ndarray):
        """Add a binary vector to the index."""
        self.vectors.append(vector.copy())
        
        for table_idx in range(self.num_tables):
            hash_key = self._hash(vector, table_idx)
            self.tables[table_idx][hash_key].append(vector_id)
    
    def query(
        self, 
        query: np.ndarray, 
        k: int = 10
    ) -> List[Tuple[int, int]]:
        """
        Find approximate nearest neighbors by Hamming distance.
        
        Returns list of (vector_id, hamming_distance) tuples.
        """
        candidates: Set[int] = set()
        
        for table_idx in range(self.num_tables):
            hash_key = self._hash(query, table_idx)
            matches = self.tables[table_idx].get(hash_key, [])
            candidates.update(matches)
        
        if not candidates:
            # Fallback: random sample
            candidates = set(range(min(100, len(self.vectors))))
        
        # Compute exact Hamming distances
        results = []
        for cand_id in candidates:
            hamming_dist = np.sum(
                query != self.vectors[cand_id]
            )
            results.append((cand_id, int(hamming_dist)))
        
        # Sort by distance
        results.sort(key=lambda x: x[1])
        return results[:k]
 
 
def demo_bit_sampling():
    """Demonstrate bit sampling LSH."""
    np.random.seed(42)
    
    dim = 256
    n = 10000
    
    # Generate random binary vectors
    vectors = np.random.randint(0, 2, size=(n, dim))
    
    # Create a query similar to vector 0
    query = vectors[0].copy()
    # Flip 10 random bits
    flip_positions = np.random.choice(dim, 10, replace=False)
    query[flip_positions] = 1 - query[flip_positions]
    
    print("Bit Sampling LSH Demo")
    print(f"Dataset: {n} vectors, {dim} bits each")
    print(f"Query: Hamming distance {np.sum(query != vectors[0])} from vector 0")
    print()
    
    # Build index
    lsh = BitSamplingLSH(dim=dim, num_bits=16, num_tables=30)
    for i, v in enumerate(vectors):
        lsh.add(i, v)
    
    # Query
    results = lsh.query(query, k=10)
    
    print("Top 10 results (LSH):")
    for vec_id, dist in results:
        print(f"  Vector {vec_id}: Hamming distance = {dist}")
    
    # Brute force comparison
    all_dists = [np.sum(query != v) for v in vectors]
    bf_top10 = np.argsort(all_dists)[:10]
    
    print("\nTop 10 results (brute force):")
    for vec_id in bf_top10:
        print(f"  Vector {vec_id}: Hamming distance = {all_dists[vec_id]}")
    
    # Recall
    lsh_set = {r[0] for r in results}
    bf_set = set(bf_top10)
    recall = len(lsh_set & bf_set) / len(bf_set)
    print(f"\nRecall@10: {recall:.1%}")
 
 
if __name__ == "__main__":
    demo_bit_sampling()

SimHash: Binary Codes from Random Projections

The Idea:

SimHash Definition:

For $k$ random hyperplanes $\mathbf{r}_1, \ldots, \mathbf{r}_k$ and a vector $\mathbf{v}$:

$$\text{SimHash}(\mathbf{v}) = (\text{sign}(\mathbf{r}_1 \cdot \mathbf{v}), \ldots, \text{sign}(\mathbf{r}_k \cdot \mathbf{v}))$$

Key Property:

The expected Hamming distance between SimHash codes equals $k \cdot \theta / \pi$:

$$\mathbb{E}[d_H(\text{SimHash}(\mathbf{u}), \text{SimHash}(\mathbf{v}))] = k \cdot \frac{\theta}{\pi}$$

where $\theta$ is the angle between $\mathbf{u}$ and $\mathbf{v}$.

Practical Advantage:

SimHash codes are:

Compact: 64 bits can encode 64-hyperplane projections
Fast to compare: Hamming distance via XOR + popcount (single CPU instruction)
Hashable: Can be used directly as hash table keys

simhash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
"""
SimHash implementation for compact similarity fingerprints.
"""
 
import numpy as np
from typing import List, Tuple
 
class SimHash:
    """
    SimHash: Sign Random Projection for compact fingerprints.
    
    Projects vectors onto k random hyperplanes and encodes
    as k-bit binary codes. Hamming distance between codes
    approximates angular distance between vectors.
    """
    
    def __init__(self, dim: int, num_bits: int = 64, seed: int = 42):
        """
        Initialize SimHash.
        
        Parameters:
        -----------
        dim : int
            Dimensionality of input vectors
        num_bits : int
            Number of bits in the hash (must be <= 64 for uint64)
        """
        np.random.seed(seed)
        
        self.dim = dim
        self.num_bits = num_bits
        
        # Random hyperplanes
        self.hyperplanes = np.random.randn(num_bits, dim)
        # Normalize (optional but good practice)
        self.hyperplanes /= np.linalg.norm(
            self.hyperplanes, axis=1, keepdims=True
        )
    
    def hash(self, vector: np.ndarray) -> int:
        """
        Compute SimHash fingerprint as integer.
        """
        projections = np.dot(self.hyperplanes, vector)
        bits = (projections > 0).astype(np.uint64)
        
        # Pack bits into integer
        fingerprint = 0
        for i, bit in enumerate(bits):
            fingerprint |= (bit << i)
        
        return int(fingerprint)
    
    def hash_batch(self, vectors: np.ndarray) -> np.ndarray:
        """
        Compute SimHash fingerprints for multiple vectors.
        """
        # (n, dim) @ (dim, num_bits) -> (n, num_bits)
        projections = np.dot(vectors, self.hyperplanes.T)
        bits = (projections > 0).astype(np.uint64)
        
        # Pack bits
        fingerprints = np.zeros(len(vectors), dtype=np.uint64)
        for i in range(self.num_bits):
            fingerprints |= (bits[:, i] << i)
        
        return fingerprints
    
    @staticmethod
    def hamming_distance(fp1: int, fp2: int) -> int:
        """
        Compute Hamming distance between two fingerprints.
        Uses bit counting on XOR result.
        """
        xor = fp1 ^ fp2
        return bin(xor).count('1')
    
    def estimate_cosine_similarity(
        self, 
        fp1: int, 
        fp2: int
    ) -> float:
        """
        Estimate cosine similarity from fingerprints.
        
        Hamming distance k -> angle θ ≈ π * k / num_bits
        Cosine similarity = cos(θ)
        """
        hamming = self.hamming_distance(fp1, fp2)
        angle = np.pi * hamming / self.num_bits
        return np.cos(angle)
 
 
class SimHashLSH:
    """
    LSH index using SimHash fingerprints.
    
    Uses multiple hash tables with partial fingerprint matches.
    """
    
    def __init__(
        self,
        dim: int,
        num_bits: int = 64,
        num_tables: int = 10,
        bits_per_table: int = 8,
        seed: int = 42
    ):
        """
        Initialize SimHash LSH.
        
        Each table uses bits_per_table bits from the fingerprint.
        """
        self.simhash = SimHash(dim, num_bits, seed)
        self.num_tables = num_tables
        self.bits_per_table = bits_per_table
        
        # Masks for extracting bits for each table
        np.random.seed(seed + 1000)
        self.bit_indices = [
            np.sort(np.random.choice(
                num_bits, bits_per_table, replace=False
            ))
            for _ in range(num_tables)
        ]
        
        # Hash tables
        self.tables: List[dict] = [{} for _ in range(num_tables)]
        self.fingerprints: List[int] = []
    
    def _extract_bits(
        self, 
        fingerprint: int, 
        bit_indices: np.ndarray
    ) -> int:
        """Extract specific bits from fingerprint."""
        result = 0
        for i, idx in enumerate(bit_indices):
            bit = (fingerprint >> idx) & 1
            result |= (bit << i)
        return result
    
    def add(self, item_id: int, vector: np.ndarray):
        """Add a vector to the index."""
        fp = self.simhash.hash(vector)
        self.fingerprints.append(fp)
        
        for table_idx in range(self.num_tables):
            key = self._extract_bits(fp, self.bit_indices[table_idx])
            if key not in self.tables[table_idx]:
                self.tables[table_idx][key] = []
            self.tables[table_idx][key].append(item_id)
    
    def query(
        self, 
        vector: np.ndarray, 
        k: int = 10,
        max_hamming: int = 5
    ) -> List[Tuple[int, int]]:
        """
        Find similar items.
        
        Returns (item_id, hamming_distance) sorted by distance.
        """
        query_fp = self.simhash.hash(vector)
        candidates = set()
        
        for table_idx in range(self.num_tables):
            key = self._extract_bits(
                query_fp, self.bit_indices[table_idx]
            )
            matches = self.tables[table_idx].get(key, [])
            candidates.update(matches)
        
        # Compute exact Hamming distances
        results = []
        for item_id in candidates:
            hamming = self.simhash.hamming_distance(
                query_fp, self.fingerprints[item_id]
            )
            if hamming <= max_hamming:
                results.append((item_id, hamming))
        
        results.sort(key=lambda x: x[1])
        return results[:k]
 
 
def demo_simhash():
    """Demonstrate SimHash for near-duplicate detection."""
    np.random.seed(42)
    
    dim = 256
    n = 10000
    
    # Generate random vectors
    vectors = np.random.randn(n, dim)
    vectors /= np.linalg.norm(vectors, axis=1, keepdims=True)
    
    # Create similar vectors (small perturbations)
    similar_groups = []
    for i in range(100):
        base = vectors[i]
        # Create 3 similar vectors
        perturbations = 0.1 * np.random.randn(3, dim)
        similar = base + perturbations
        similar /= np.linalg.norm(similar, axis=1, keepdims=True)
        similar_groups.append((i, similar))
    
    print("SimHash Demo")
    print(f"Dataset: {n} base vectors + similar variants")
    print()
    
    # Build index
    lsh = SimHashLSH(
        dim=dim, 
        num_bits=64, 
        num_tables=10,
        bits_per_table=10
    )
    
    for i, v in enumerate(vectors):
        lsh.add(i, v)
    
    # Also add similar vectors with special IDs
    for i, (base_id, similar_vecs) in enumerate(similar_groups):
        for j, sim_vec in enumerate(similar_vecs):
            item_id = n + i * 3 + j  # IDs beyond original dataset
            lsh.fingerprints.append(lsh.simhash.hash(sim_vec))
    
    # Query with one of the similar vectors
    test_idx = 0
    query_vec = similar_groups[test_idx][1][0]  # First similar variant
    base_id = similar_groups[test_idx][0]
    
    print(f"Query: similar variant of vector {base_id}")
    
    # Compute true similarity
    true_sim = np.dot(query_vec, vectors[base_id])
    print(f"True cosine similarity to base: {true_sim:.4f}")
    
    # Find using LSH
    results = lsh.query(query_vec, k=5, max_hamming=10)
    print("\nTop 5 matches (by Hamming distance):")
    for item_id, hamming_dist in results:
        est_sim = lsh.simhash.estimate_cosine_similarity(
            lsh.simhash.hash(query_vec),
            lsh.fingerprints[item_id] if item_id < len(lsh.fingerprints) 
            else lsh.simhash.hash(vectors[item_id])
        )
        print(f"  Item {item_id}: Hamming={hamming_dist}, Est.Sim={est_sim:.4f}")
 
 
if __name__ == "__main__":
    demo_simhash()

Overview: Hash Families by Metric

Let's organize the landscape of LSH hash families by their target similarity/distance metric.

LSH Hash Families Overview
Metric	Hash Family	Hash Function	Collision Probability
Jaccard similarity	MinHash	min_x∈S π(x)	J(A,B)
Cosine similarity	Random Hyperplanes	sign(r·v)	1 - θ/π
Angular distance	SimHash	binary code from signs	Derived from above
Euclidean (L2)	p-stable (Gaussian)	⌊(a·v+b)/w⌋	Complex formula
Manhattan (L1)	p-stable (Cauchy)	⌊(a·v+b)/w⌋	Complex formula
Hamming distance	Bit Sampling	v[i] for random i	1 - d_H/d
Chi-squared distance	Spherical LSH	Specialized	Metric-dependent

Choosing the Right Family:

Sets/Documents with overlap: MinHash (Jaccard)
Continuous vectors, direction matters: Random Hyperplanes / SimHash (Cosine)
Continuous vectors, magnitude matters: E2LSH (Euclidean)
Binary features: Bit Sampling (Hamming)
Count/histogram data with positivity: Chi-squared LSH
Sparse L1 data: Cauchy LSH

Red Flags - When NOT to Use LSH:

Exact nearest neighbors required: LSH is fundamentally approximate
Low-dimensional data (d < 10): Tree-based methods are often faster
Very small datasets (n < 1000): Brute force is simpler and fast enough
Highly clustered data: May need data-dependent hashing methods
Dynamic datasets with frequent updates: LSH rebuild can be costly

Constructing New LSH Families

What if you have a custom similarity measure not covered by existing LSH families? There are general techniques for constructing LSH families:

Method 1: Metric Embedding

If your metric can be approximately embedded into a "standard" space (Euclidean, Hamming, etc.), you can:

Map data points into the embedded space
Apply standard LSH in that space
Verify candidates in original metric

Example: Edit distance on strings can be embedded into Hamming space (with some distortion), then use bit sampling LSH.

Method 2: Kernel-Based LSH

For positive-definite kernels, there exist LSH-like techniques:

Random Fourier Features approximate RBF kernels
Nyström approximation for general kernels

Method 3: Learning to Hash

For complex similarities, learn hash functions from data:

Spectral Hashing: Learn projections that preserve neighborhood structure
Semantic Hashing: Deep networks that produce binary codes
Product Quantization: Learn codebooks for vector compression

These learned methods can sometimes outperform random LSH on specific data distributions, but lose the theoretical guarantees.

The LSH Framework

Cross-Polytope LSH: A Modern Alternative

Cross-polytope LSH (also called hyperplane LSH or multiprobe LSH) is a modern alternative to random hyperplane LSH that achieves better theoretical bounds and practical performance.

The Idea:

Instead of using hyperplanes to produce binary hashes, cross-polytope LSH:

Applies a random rotation to the vector
Finds which of the $2d$ canonical directions (±e₁, ±e₂, ..., ±e_d) is closest to the rotated vector
Returns that direction as the hash

Mathematical Definition:

For a random rotation matrix $R$ and unit vector $\mathbf{v}$:

$$h_R(\mathbf{v}) = \arg\max_{1 \leq i \leq d} |\mathbf{r}_i \cdot \mathbf{v}|$$

where $\mathbf{r}_i$ is the $i$-th row of $R$.

Why It's Better:

Higher entropy: $2d$ possible hash values vs. 2 for hyperplanes
Better discrimination: Single hash function is more informative
Faster convergence: Fewer hash functions needed for same accuracy
Optimal ρ: Achieves $\rho = 1/(2c - 1)$ for c-approximate NN

Practical Consideration:

Random rotations are expensive (O(d²) or O(d log d) with fast methods). For very high-dimensional data, the computational overhead may outweigh the theoretical benefits.

cross_polytope_lsh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
"""
Cross-Polytope LSH: A more efficient alternative to random hyperplanes.
"""
 
import numpy as np
from scipy.stats import special_ortho_group
 
class CrossPolytopeLSH:
    """
    Cross-Polytope LSH for cosine similarity.
    
    Hash value is the index of the canonical direction
    (±e_1, ..., ±e_d) closest to the rotated vector.
    """
    
    def __init__(
        self, 
        dim: int, 
        num_tables: int = 10, 
        seed: int = 42
    ):
        np.random.seed(seed)
        
        self.dim = dim
        self.num_tables = num_tables
        
        # Random rotation matrices for each table
        # Using Haar-distributed random orthogonal matrices
        self.rotations = [
            special_ortho_group.rvs(dim)
            for _ in range(num_tables)
        ]
    
    def hash(
        self, 
        vector: np.ndarray, 
        table_idx: int
    ) -> int:
        """
        Compute cross-polytope hash.
        
        Returns index in [0, 2d-1]:
        - 0 to d-1: positive direction e_i
        - d to 2d-1: negative direction -e_i
        """
        # Normalize vector
        v = vector / np.linalg.norm(vector)
        
        # Apply rotation
        rotated = np.dot(self.rotations[table_idx], v)
        
        # Find maximum absolute coordinate
        abs_coords = np.abs(rotated)
        max_idx = np.argmax(abs_coords)
        
        # Determine sign
        if rotated[max_idx] >= 0:
            return max_idx  # Positive direction
        else:
            return self.dim + max_idx  # Negative direction
    
    def hash_batch(
        self, 
        vectors: np.ndarray
    ) -> np.ndarray:
        """
        Compute hashes for all vectors in all tables.
        
        Returns: (num_tables, num_vectors) array
        """
        n = len(vectors)
        
        # Normalize
        norms = np.linalg.norm(vectors, axis=1, keepdims=True)
        v_normalized = vectors / norms
        
        hashes = np.zeros((self.num_tables, n), dtype=int)
        
        for table_idx in range(self.num_tables):
            # Rotate all vectors
            rotated = np.dot(v_normalized, self.rotations[table_idx].T)
            
            # Find max absolute coordinate for each vector
            abs_coords = np.abs(rotated)
            max_indices = np.argmax(abs_coords, axis=1)
            
            # Determine signs
            signs = np.array([
                rotated[i, max_indices[i]] >= 0 
                for i in range(n)
            ])
            
            hashes[table_idx] = np.where(
                signs, max_indices, self.dim + max_indices
            )
        
        return hashes
 
 
def analyze_cross_polytope():
    """Compare cross-polytope vs random hyperplane LSH."""
    np.random.seed(42)
    
    dim = 64
    n_tests = 1000
    
    # Test collision probabilities at various angles
    angles = np.linspace(0, np.pi, 11)
    
    hyperplane = np.random.randn(dim)
    hyperplane /= np.linalg.norm(hyperplane)
    
    cross_poly = CrossPolytopeLSH(dim, num_tables=1)
    
    print("Collision Probability Comparison")
    print(f"{'Angle':>10} {'Hyperplane':>12} {'CrossPoly':>12} {'Theoretical HP':>15}")
    print("-" * 52)
    
    for theta in angles:
        hp_collisions = 0
        cp_collisions = 0
        
        for _ in range(n_tests):
            # Generate random vector pair at angle theta
            u = np.random.randn(dim)
            u /= np.linalg.norm(u)
            
            # Generate v at angle theta from u
            v_perp = np.random.randn(dim)
            v_perp -= np.dot(v_perp, u) * u
            v_perp /= np.linalg.norm(v_perp)
            
            v = np.cos(theta) * u + np.sin(theta) * v_perp
            
            # Random hyperplane
            hp_u = np.sign(np.dot(hyperplane, u))
            hp_v = np.sign(np.dot(hyperplane, v))
            if hp_u == hp_v:
                hp_collisions += 1
            
            # Cross-polytope
            cp_u = cross_poly.hash(u, 0)
            cp_v = cross_poly.hash(v, 0)
            if cp_u == cp_v:
                cp_collisions += 1
        
        hp_prob = hp_collisions / n_tests
        cp_prob = cp_collisions / n_tests
        theoretical_hp = 1 - theta / np.pi
        
        print(f"{theta:>10.4f} {hp_prob:>12.4f} {cp_prob:>12.4f} {theoretical_hp:>15.4f}")
 
 
if __name__ == "__main__":
    analyze_cross_polytope()

Asymmetric LSH

Standard LSH uses the same hash function for both database points and queries. Asymmetric LSH uses different (but related) hash functions for the two roles, enabling better recall.

The Motivation:

In many applications, we hash database points once during indexing but hash queries many times during search. We have different time budgets:

Indexing: Can be slow (done once per point)
Querying: Must be fast (done for every search)

Asymmetric Approach:

For database point $p$ and query $q$:

Use hash function $h^{(p)}$ for database points
Use hash function $h^{(q)}$ for queries
Design so that $\Pr[h^{(p)}(p) = h^{(q)}(q)]$ depends on similarity

Example: Asymmetric E2LSH

In standard E2LSH, both points use the same projection and offset. In asymmetric E2LSH:

For database points: Use standard hash $h(p) = \lfloor (a·p + b) / w \rfloor$
For queries: Hash to multiple adjacent buckets along the projection

This increases recall without increasing false positives—the query checks more buckets, finding neighbors that just barely crossed a boundary.

When to Use Asymmetric LSH

Summary: Hash Function Families

We've surveyed the major LSH hash function families and their applications. Let's consolidate the key insights:

Key Takeaways

•MinHash for Jaccard similarity — Uses minimum under random permutation; collision probability equals Jaccard similarity exactly.
•Bit sampling for Hamming distance — Simplest LSH family; just sample random bit positions.
•SimHash for compact codes — Binary codes from random projections; efficient storage and comparison.
•Each metric needs its own family — Mismatched hash families produce meaningless results.
•Cross-polytope LSH offers better bounds — Modern alternative to hyperplanes with higher hash entropy.
•Asymmetric LSH can boost recall — Different functions for database and query points.
•Learning-based methods complement LSH — When data has structure, learned hashes can outperform random LSH.

What's Next:

Page Complete

You now have a comprehensive understanding of the major LSH hash function families. You can match problems to appropriate hash families and understand the tradeoffs between different approaches.