Efficient Knn - Learning Module

Loading content...

0/245

Locality-Sensitive Hashing (LSH)

Hashing for Similarity, Not Identity

Traditional hash functions are designed to scatter — they map similar inputs to completely different outputs. Change one bit, and the hash changes unpredictably. This is perfect for hash tables but useless for similarity search.

Locality-Sensitive Hashing (LSH) inverts this principle. LSH hash functions are designed to cluster — similar items hash to the same bucket with high probability, while dissimilar items hash to different buckets. This seemingly paradoxical property enables sub-linear approximate nearest neighbor search.

The Key Insight:

Instead of comparing a query to all $n$ points, we:

Hash the query
Only compare against points in the same bucket
If similar items tend to share buckets, most nearest neighbors will be found

This transforms $O(n)$ comparisons into $O(1)$ lookups plus $O(b)$ bucket comparisons, where $b << n$ is the bucket size. The tradeoff is that we might miss some true neighbors (false negatives) or return non-neighbors (false positives).

What You Will Learn

By the end of this page, you will understand the mathematical definition of locality-sensitive hash families, implement LSH for different distance metrics (cosine, Euclidean, Jaccard), analyze collision probabilities and parameter selection, understand amplification techniques for tuning precision and recall, and recognize the tradeoffs between query time, space, and accuracy.

Definition of Locality-Sensitive Hash Families

LSH is defined formally in terms of hash families with specific collision probability properties.

Definition (Locality-Sensitive Hash Family):

A family $\mathcal{H}$ of hash functions is $(r_1, r_2, p_1, p_2)$-sensitive for distance $d(\cdot, \cdot)$ if for any $h \in \mathcal{H}$ chosen uniformly at random:

If $d(\mathbf{x}, \mathbf{y}) \leq r_1$, then $\Pr[h(\mathbf{x}) = h(\mathbf{y})] \geq p_1$ (similar items collide often)
If $d(\mathbf{x}, \mathbf{y}) \geq r_2$, then $\Pr[h(\mathbf{x}) = h(\mathbf{y})] \leq p_2$ (dissimilar items collide rarely)

where $r_1 < r_2$ and $p_1 > p_2$.

The gap between similar ($\leq r_1$) and dissimilar ($\geq r_2$) is crucial. We typically define:

$$c = \frac{r_2}{r_1} > 1$$

as the approximation factor. An LSH scheme finds a $(1 + \epsilon)$-approximate nearest neighbor with $c = 1 + \epsilon$.

The $\rho$ parameter:

A key quantity is:

$$\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)}$$

This controls the query time complexity. Smaller $\rho$ means more efficient separation between similar and dissimilar pairs. For the best LSH families, $\rho \approx 1/c$.

Key LSH Parameters
Parameter	Symbol	Meaning	Desirable Value
Near distance	$r_1$	Distance threshold for 'similar'	Application-specific
Far distance	$r_2$	Distance threshold for 'dissimilar'	$r_2 > r_1$
Near collision prob	$p_1$	Pr(collision \| similar)	High (close to 1)
Far collision prob	$p_2$	Pr(collision \| dissimilar)	Low (close to 0)
Approximation factor	$c = r_2/r_1$	Quality of approximation	$1 + \epsilon$ for small $\epsilon$
Rho parameter	$\rho$	Efficiency measure	Small (ideally $1/c$)

The (c, r)-Near Neighbor Problem

LSH solves the (c, r)-near neighbor problem: given a query with a true neighbor within distance r, return a neighbor within distance cr (c-approximation). This relaxation from exact to approximate nearest neighbor is what enables sub-linear search in high dimensions.

LSH for Cosine Similarity: Random Hyperplanes

One of the most elegant LSH schemes is for cosine similarity, using random hyperplane partitions.

Cosine Similarity:

$$\cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}| |\mathbf{y}|}$$

The angle $\theta$ between vectors determines similarity. Vectors pointing in similar directions have $\theta \approx 0$ and $\cos(\theta) \approx 1$.

Random Hyperplane Hash:

For a random unit vector $\mathbf{r}$ sampled uniformly from the unit sphere:

$$h_{\mathbf{r}}(\mathbf{x}) = \begin{cases} 1 & \text{if } \mathbf{r} \cdot \mathbf{x} \geq 0 \ 0 & \text{if } \mathbf{r} \cdot \mathbf{x} < 0 \end{cases}$$

This hash assigns $\mathbf{x}$ based on which side of the hyperplane $\mathbf{r} \cdot \mathbf{x} = 0$ it falls on.

Collision Probability:

The probability that two vectors have the same hash is:

$$\Pr[h_{\mathbf{r}}(\mathbf{x}) = h_{\mathbf{r}}(\mathbf{y})] = 1 - \frac{\theta}{\pi}$$

where $\theta = \arccos\left(\frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}| |\mathbf{y}|}\right)$ is the angle between the vectors.

Intuition: The hyperplane separates the vectors only if it 'cuts between' them. The probability of this equals the fraction of random directions that do so, which is $\theta / \pi$.

lsh_cosine.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
import numpy as np
from typing import List, Tuple, Set
from collections import defaultdict
 
class CosineLSH:
    """
    Locality-Sensitive Hashing for Cosine Similarity.
    
    Uses random hyperplanes to hash vectors such that similar
    vectors (small angle) are likely to share hash values.
    
    Parameters:
    -----------
    n_hyperplanes : int
        Number of random hyperplanes (bits per hash)
    n_tables : int
        Number of hash tables (for recall boosting)
    dim : int
        Dimensionality of input vectors
    seed : int
        Random seed for reproducibility
    """
    
    def __init__(
        self, 
        n_hyperplanes: int = 8,
        n_tables: int = 4,
        dim: int = 128,
        seed: int = 42
    ):
        self.n_hyperplanes = n_hyperplanes
        self.n_tables = n_tables
        self.dim = dim
        
        np.random.seed(seed)
        
        # Generate random hyperplanes for each table
        # Shape: (n_tables, n_hyperplanes, dim)
        self.hyperplanes = np.random.randn(n_tables, n_hyperplanes, dim)
        # Normalize each hyperplane to unit length
        norms = np.linalg.norm(self.hyperplanes, axis=2, keepdims=True)
        self.hyperplanes /= norms
        
        # Storage: table_id -> hash_value -> set of indices
        self.tables: List[defaultdict] = [
            defaultdict(set) for _ in range(n_tables)
        ]
        self.data = []  # Store original vectors
    
    def _hash_single(self, x: np.ndarray, table_idx: int) -> int:
        """
        Compute hash for vector x using table table_idx.
        
        Returns an integer hash (binary encoding of hyperplane signs).
        """
        # Dot product with each hyperplane: (n_hyperplanes,)
        projections = self.hyperplanes[table_idx] @ x
        
        # Convert to binary hash (1 if positive, 0 if negative)
        bits = (projections >= 0).astype(int)
        
        # Convert binary array to integer
        hash_val = int(''.join(map(str, bits)), 2)
        return hash_val
    
    def _hash_all_tables(self, x: np.ndarray) -> List[int]:
        """Compute hash in all tables."""
        return [self._hash_single(x, i) for i in range(self.n_tables)]
    
    def index(self, vectors: np.ndarray):
        """
        Index a collection of vectors.
        
        Parameters:
        -----------
        vectors : np.ndarray, shape (n, dim)
            Vectors to index
        """
        self.data = vectors
        
        for idx, vec in enumerate(vectors):
            hashes = self._hash_all_tables(vec)
            for table_idx, h in enumerate(hashes):
                self.tables[table_idx][h].add(idx)
    
    def query(self, q: np.ndarray, k: int = 10) -> List[Tuple[int, float]]:
        """
        Find approximate k nearest neighbors.
        
        Parameters:
        -----------
        q : np.ndarray, shape (dim,)
            Query vector
        k : int
            Number of neighbors to return
        
        Returns:
        --------
        List of (index, cosine_similarity) pairs, sorted by similarity
        """
        # Get candidate indices from all tables
        candidates: Set[int] = set()
        hashes = self._hash_all_tables(q)
        
        for table_idx, h in enumerate(hashes):
            candidates.update(self.tables[table_idx][h])
        
        if not candidates:
            return []
        
        # Compute actual cosine similarity for candidates
        q_norm = np.linalg.norm(q)
        results = []
        
        for idx in candidates:
            vec = self.data[idx]
            sim = np.dot(q, vec) / (q_norm * np.linalg.norm(vec))
            results.append((idx, sim))
        
        # Sort by similarity (descending) and return top k
        results.sort(key=lambda x: -x[1])
        return results[:k]
    
    def collision_probability(self, theta: float) -> float:
        """
        Theoretical collision probability for angle theta.
        
        For a single hyperplane: Pr[collision] = 1 - theta/pi
        For n hyperplanes (AND): Pr[all match] = (1 - theta/pi)^n
        For m tables (OR): Pr[at least one match] = 1 - (1 - p^n)^m
        """
        single_prob = 1 - theta / np.pi
        band_prob = single_prob ** self.n_hyperplanes
        total_prob = 1 - (1 - band_prob) ** self.n_tables
        return total_prob
 
 
# Example usage
def demo_cosine_lsh():
    np.random.seed(0)
    
    # Generate random high-dimensional vectors
    n, dim = 10000, 128
    vectors = np.random.randn(n, dim)
    
    # Create and index LSH
    lsh = CosineLSH(n_hyperplanes=12, n_tables=8, dim=dim)
    lsh.index(vectors)
    
    # Query
    query = np.random.randn(dim)
    neighbors = lsh.query(query, k=10)
    
    print(f"Query found {len(neighbors)} candidates")
    print("Top 5 neighbors (idx, cosine_sim):")
    for idx, sim in neighbors[:5]:
        print(f"  {idx}: {sim:.4f}")
    
    # Verify with brute force
    true_sims = [np.dot(query, v) / (np.linalg.norm(query) * np.linalg.norm(v)) 
                 for v in vectors]
    true_top = sorted(enumerate(true_sims), key=lambda x: -x[1])[:10]
    
    print("\nTrue top 5:")
    for idx, sim in true_top[:5]:
        print(f"  {idx}: {sim:.4f}")

SimHash for Documents

Charikar's SimHash (2002) applies random hyperplane LSH to document fingerprinting. Documents are represented as high-dimensional vectors (e.g., TF-IDF), and SimHash produces compact binary fingerprints. Similar documents have similar fingerprints, enabling near-duplicate detection at web scale.

LSH for Euclidean Distance: Random Projections

For Euclidean ($L_2$) distance, we use a different hash family based on random projections onto lines and discretization.

Random Projection Hash:

For a random unit vector $\mathbf{a}$ and a random offset $b$ uniform in $[0, w]$:

$$h_{\mathbf{a}, b}(\mathbf{x}) = \left\lfloor \frac{\mathbf{a} \cdot \mathbf{x} + b}{w} \right\rfloor$$

This projects $\mathbf{x}$ onto a random line, then discretizes into buckets of width $w$.

Collision Probability:

For two vectors at distance $d = |\mathbf{x} - \mathbf{y}|$:

$$p(d) = \Pr[h(\mathbf{x}) = h(\mathbf{y})] = \int_0^w \frac{1}{d} f\left(\frac{t}{d}\right) \left(1 - \frac{t}{w}\right) dt$$

where $f$ is the PDF of the absolute difference of projected values.

For standard Gaussian projections:

$$p(d) \approx 1 - 2 \Phi\left(-\frac{w}{d}\right) - \frac{2}{\sqrt{2\pi}(w/d)}\left(1 - e^{-(w/d)^2/2}\right)$$

where $\Phi$ is the standard normal CDF.

Practical Approximation:

$$p(d) \approx 1 - \frac{d}{w} \quad \text{for } d \ll w$$

lsh_euclidean.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
import numpy as np
from typing import List, Tuple, Set
from collections import defaultdict
 
class EuclideanLSH:
    """
    LSH for Euclidean (L2) distance using random projections.
    
    Projects vectors onto random lines and discretizes into buckets.
    Similar vectors (close in L2) land in the same bucket with high probability.
    
    Parameters:
    -----------
    n_projections : int
        Number of random projections per hash table
    n_tables : int
        Number of hash tables
    bucket_width : float
        Width of discretization buckets (controls resolution)
    dim : int
        Dimensionality of input vectors
    """
    
    def __init__(
        self,
        n_projections: int = 4,
        n_tables: int = 8,
        bucket_width: float = 4.0,
        dim: int = 128,
        seed: int = 42
    ):
        self.n_projections = n_projections
        self.n_tables = n_tables
        self.bucket_width = bucket_width
        self.dim = dim
        
        np.random.seed(seed)
        
        # Random projection vectors: (n_tables, n_projections, dim)
        # Gaussian for theoretical guarantees
        self.projections = np.random.randn(n_tables, n_projections, dim)
        
        # Random offsets: (n_tables, n_projections)
        self.offsets = np.random.uniform(0, bucket_width, 
                                         (n_tables, n_projections))
        
        # Hash tables
        self.tables: List[defaultdict] = [
            defaultdict(set) for _ in range(n_tables)
        ]
        self.data = []
    
    def _hash_single(self, x: np.ndarray, table_idx: int) -> Tuple[int, ...]:
        """
        Compute hash for vector x in table table_idx.
        
        Returns a tuple of bucket indices (one per projection).
        """
        # Project: (n_projections,)
        projections = self.projections[table_idx] @ x
        
        # Add offset and discretize
        bucket_indices = np.floor(
            (projections + self.offsets[table_idx]) / self.bucket_width
        ).astype(int)
        
        return tuple(bucket_indices)
    
    def index(self, vectors: np.ndarray):
        """Index vectors into all hash tables."""
        self.data = vectors
        
        for idx, vec in enumerate(vectors):
            for table_idx in range(self.n_tables):
                h = self._hash_single(vec, table_idx)
                self.tables[table_idx][h].add(idx)
    
    def query(
        self, 
        q: np.ndarray, 
        k: int = 10,
        radius: float = None
    ) -> List[Tuple[int, float]]:
        """
        Find approximate nearest neighbors.
        
        Parameters:
        -----------
        q : np.ndarray
            Query vector
        k : int
            Number of neighbors to return
        radius : float, optional
            Return only neighbors within this distance
        
        Returns:
        --------
        List of (index, distance) pairs, sorted by distance
        """
        candidates: Set[int] = set()
        
        for table_idx in range(self.n_tables):
            h = self._hash_single(q, table_idx)
            candidates.update(self.tables[table_idx][h])
        
        if not candidates:
            return []
        
        # Compute actual distances for candidates
        results = []
        for idx in candidates:
            dist = np.linalg.norm(q - self.data[idx])
            if radius is None or dist <= radius:
                results.append((idx, dist))
        
        results.sort(key=lambda x: x[1])
        return results[:k]
    
    def collision_probability(self, d: float) -> float:
        """
        Approximate collision probability for distance d.
        
        For a single projection with Gaussian random vectors:
        p ≈ 2*Phi(-d/(2*w)) where Phi is normal CDF
        
        For multiple projections (AND): p1^n
        For multiple tables (OR): 1 - (1 - p1^n)^m
        """
        from scipy.stats import norm
        
        # Single projection collision prob
        p1 = 2 * norm.cdf(-d / (2 * self.bucket_width)) - 1 +              2 * norm.pdf(0) * self.bucket_width / d * (1 - np.exp(-d**2/(2*self.bucket_width**2)))
        
        # Simplified approximation
        p1_approx = max(0, 1 - d / self.bucket_width)
        
        # AND over projections
        band_prob = p1_approx ** self.n_projections
        
        # OR over tables
        total_prob = 1 - (1 - band_prob) ** self.n_tables
        
        return total_prob
 
 
# Choosing bucket width w
def analyze_bucket_width():
    """
    Demonstrate effect of bucket width on collision probability.
    """
    import matplotlib.pyplot as plt
    
    widths = [1, 2, 4, 8, 16]
    distances = np.linspace(0.1, 20, 100)
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    for w in widths:
        probs = [max(0, 1 - d/w) for d in distances]
        ax.plot(distances, probs, label=f'w = {w}')
    
    ax.set_xlabel('Distance d')
    ax.set_ylabel('Collision probability')
    ax.set_title('LSH Collision Probability vs Distance')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    return fig

Choosing Bucket Width

The bucket width w is a critical parameter. Small w gives fine-grained buckets (good precision but low recall). Large w gives coarse buckets (good recall but more false positives). A good heuristic is to set w to the expected distance to the nearest neighbor, or tune based on desired precision-recall tradeoff.

MinHash for Jaccard Similarity

For set similarity, MinHash provides an elegant LSH scheme for the Jaccard similarity.

Jaccard Similarity:

For sets $A$ and $B$:

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

MinHash Definition:

Let $h$ be a random permutation of all possible elements. Define:

$$\text{MinHash}h(A) = \min{a \in A} h(a)$$

The minimum element under permutation $h$.

Remarkable Property:

$$\Pr[\text{MinHash}_h(A) = \text{MinHash}_h(B)] = J(A, B)$$

Proof Sketch: The minimum of $A \cup B$ under $h$ is equally likely to be any element. It's the same for both sets iff it's in $A \cap B$, which happens with probability $|A \cap B| / |A \cup B| = J(A, B)$.

MinHash Signature:

Using $k$ independent hash functions, we create a signature:

$$\text{Sig}(A) = [\text{MinHash}{h_1}(A), \text{MinHash}{h_2}(A), \ldots, \text{MinHash}_{h_k}(A)]$$

This compresses a set to $k$ integers while preserving similarity information.

minhash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
import numpy as np
from typing import Set, List, Tuple
from collections import defaultdict
import hashlib
 
class MinHash:
    """
    MinHash signatures for estimating Jaccard similarity.
    
    Compresses sets into fixed-size signatures such that:
    Pr[signature match] = Jaccard similarity
    
    Parameters:
    -----------
    n_hashes : int
        Number of hash functions (signature length)
    seed : int
        Random seed
    """
    
    def __init__(self, n_hashes: int = 128, seed: int = 42):
        self.n_hashes = n_hashes
        np.random.seed(seed)
        
        # Generate random hash function parameters
        # We use h(x) = (ax + b) mod p mod n for large prime p
        self.p = 2**31 - 1  # Mersenne prime
        self.a = np.random.randint(1, self.p, size=n_hashes)
        self.b = np.random.randint(0, self.p, size=n_hashes)
    
    def _hash_value(self, val: int, hash_idx: int) -> int:
        """Apply hash function hash_idx to value val."""
        return (self.a[hash_idx] * val + self.b[hash_idx]) % self.p
    
    def signature(self, s: Set) -> np.ndarray:
        """
        Compute MinHash signature for a set.
        
        Parameters:
        -----------
        s : Set
            Set of hashable elements
        
        Returns:
        --------
        np.ndarray of shape (n_hashes,)
            MinHash signature
        """
        sig = np.full(self.n_hashes, np.inf)
        
        for elem in s:
            # Convert element to integer hash
            elem_hash = hash(elem) % self.p
            
            # Update signature with minimum for each hash function
            for i in range(self.n_hashes):
                h_val = self._hash_value(elem_hash, i)
                if h_val < sig[i]:
                    sig[i] = h_val
        
        return sig.astype(np.int64)
    
    @staticmethod
    def estimate_jaccard(sig1: np.ndarray, sig2: np.ndarray) -> float:
        """
        Estimate Jaccard similarity from signatures.
        
        Just counts matching positions!
        """
        return np.mean(sig1 == sig2)
 
 
class MinHashLSH:
    """
    LSH index using MinHash for Jaccard similarity.
    
    Uses banding technique: divide signature into b bands of r rows each.
    Two sets are candidates if any band matches exactly.
    
    Parameters:
    -----------
    n_hashes : int
        Signature length (must equal bands * rows)
    bands : int
        Number of bands
    """
    
    def __init__(self, n_hashes: int = 128, bands: int = 32, seed: int = 42):
        self.bands = bands
        self.rows = n_hashes // bands
        self.n_hashes = bands * self.rows  # Ensure divisibility
        
        self.minhash = MinHash(n_hashes=self.n_hashes, seed=seed)
        
        # One hash table per band
        self.tables: List[defaultdict] = [
            defaultdict(set) for _ in range(bands)
        ]
        self.signatures = []
        self.data = []
    
    def index(self, sets: List[Set]):
        """Index a collection of sets."""
        self.data = sets
        self.signatures = [self.minhash.signature(s) for s in sets]
        
        for idx, sig in enumerate(self.signatures):
            for band_idx in range(self.bands):
                # Extract band from signature
                start = band_idx * self.rows
                end = start + self.rows
                band = tuple(sig[start:end])
                
                self.tables[band_idx][band].add(idx)
    
    def query(self, q: Set, k: int = 10) -> List[Tuple[int, float]]:
        """
        Find similar sets.
        
        Returns:
        --------
        List of (index, estimated_jaccard) pairs
        """
        q_sig = self.minhash.signature(q)
        
        # Find candidates from any matching band
        candidates: Set[int] = set()
        
        for band_idx in range(self.bands):
            start = band_idx * self.rows
            end = start + self.rows
            band = tuple(q_sig[start:end])
            
            candidates.update(self.tables[band_idx][band])
        
        # Compute estimated Jaccard for candidates
        results = []
        for idx in candidates:
            jaccard_est = MinHash.estimate_jaccard(q_sig, self.signatures[idx])
            results.append((idx, jaccard_est))
        
        results.sort(key=lambda x: -x[1])
        return results[:k]
    
    def s_curve_probability(self, jaccard: float) -> float:
        """
        Probability of becoming a candidate given Jaccard similarity.
        
        Follows an S-curve: Pr = 1 - (1 - J^r)^b
        where r = rows per band, b = number of bands
        """
        return 1 - (1 - jaccard ** self.rows) ** self.bands
 
 
# Example: Near-duplicate document detection
def demo_document_lsh():
    # Represent documents as sets of shingles (n-grams)
    def shingle(text: str, k: int = 5) -> Set[str]:
        """Create k-shingles from text."""
        return {text[i:i+k] for i in range(len(text) - k + 1)}
    
    docs = [
        "The quick brown fox jumps over the lazy dog",
        "The quick brown fox leaps over the lazy dog",  # Similar to doc 0
        "A completely different document about cats",
        "The fast brown fox jumps over the sleepy dog",  # Similar to doc 0
        "Another document about something else entirely",
    ]
    
    # Create shingle sets
    shingle_sets = [shingle(doc) for doc in docs]
    
    # Build LSH index
    lsh = MinHashLSH(n_hashes=128, bands=16)
    lsh.index(shingle_sets)
    
    # Query with each document
    for i, doc in enumerate(docs):
        similar = lsh.query(shingle_sets[i], k=3)
        print(f"Doc {i}: '{doc[:40]}...'")
        for idx, jaccard in similar:
            if idx != i:
                print(f"  Similar to Doc {idx} (J≈{jaccard:.3f})")

Banding for Tunable Thresholds

The banding technique creates an S-curve probability: pairs above a threshold similarity almost always become candidates, while pairs below almost never do. The threshold is approximately (1/b)^(1/r) where b=bands and r=rows. Choose b and r to position the threshold at your desired Jaccard cutoff.

Amplification: Tuning Precision and Recall

A single LSH hash function provides only weak separation between similar and dissimilar items. Amplification combines multiple hashes to achieve arbitrarily strong separation.

Two Amplification Strategies:

AND-Amplification (Increases Precision):

Require all of $r$ hash functions to match: $$g(\mathbf{x}) = (h_1(\mathbf{x}), h_2(\mathbf{x}), \ldots, h_r(\mathbf{x}))$$

Collision probability: $$p_r = p^r$$

This dramatically reduces false positives but also reduces true positives.

OR-Amplification (Increases Recall):

Use $b$ independent hash tables, candidate if any matches:

Collision probability: $$p_b = 1 - (1 - p)^b$$

This recovers true pairs that any single hash might miss.

Combined (b, r)-Banding:

Use $b$ bands of $r$ rows each. Match if any band matches completely:

$$P(r, b) = 1 - (1 - p^r)^b$$

This creates the characteristic S-curve: sharp transition from low to high probability at a threshold similarity.

amplification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import numpy as np
import matplotlib.pyplot as plt
 
def plot_s_curves():
    """
    Visualize how banding parameters affect the S-curve.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Basic collision probability (for illustration)
    s = np.linspace(0, 1, 100)  # Similarity
    
    # Left: Effect of changing bands (fixed rows)
    ax = axes[0]
    ax.set_title('Effect of Number of Bands (r=5 fixed)')
    r = 5
    for b in [1, 4, 16, 64, 256]:
        prob = 1 - (1 - s**r)**b
        ax.plot(s, prob, label=f'b={b}')
    ax.set_xlabel('Similarity')
    ax.set_ylabel('Probability of becoming candidate')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Right: Effect of changing rows (fixed bands)
    ax = axes[1]
    ax.set_title('Effect of Rows per Band (b=16 fixed)')
    b = 16
    for r in [1, 2, 4, 8, 16]:
        prob = 1 - (1 - s**r)**b
        ax.plot(s, prob, label=f'r={r}')
    ax.set_xlabel('Similarity')
    ax.set_ylabel('Probability of becoming candidate')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig
 
 
def find_params_for_threshold(target_sim: float, 
                               n_hashes: int = 128,
                               target_prob: float = 0.5) -> tuple:
    """
    Find (b, r) parameters to achieve target threshold.
    
    The threshold t ≈ (1/b)^(1/r) where Pr(candidate | J=t) = 0.5
    
    Parameters:
    -----------
    target_sim : float
        Desired similarity threshold (e.g., 0.8 for 80% similarity)
    n_hashes : int
        Total signature length
    target_prob : float
        Probability at threshold (usually 0.5)
    
    Returns:
    --------
    (bands, rows) tuple with bands * rows <= n_hashes
    """
    best_params = None
    best_distance = float('inf')
    
    for b in range(1, n_hashes + 1):
        for r in range(1, n_hashes // b + 1):
            if b * r > n_hashes:
                continue
            
            # Compute threshold where probability = 0.5
            # 1 - (1 - t^r)^b = 0.5
            # (1 - t^r)^b = 0.5
            # 1 - t^r = 0.5^(1/b)
            # t^r = 1 - 0.5^(1/b)
            # t = (1 - 0.5^(1/b))^(1/r)
            
            threshold = (1 - 0.5**(1/b))**(1/r)
            distance = abs(threshold - target_sim)
            
            if distance < best_distance:
                best_distance = distance
                best_params = (b, r, threshold)
    
    return best_params
 
 
def analyze_amplification():
    """
    Show how amplification separates similar from dissimilar.
    """
    # Without amplification
    single_p_similar = 0.9      # Similar pairs collide 90%
    single_p_dissimilar = 0.3   # Dissimilar still collide 30%!
    
    print("Without amplification:")
    print(f"  Pr(collision | similar): {single_p_similar:.2f}")
    print(f"  Pr(collision | dissimilar): {single_p_dissimilar:.2f}")
    print(f"  Ratio: {single_p_similar/single_p_dissimilar:.2f}x")
    
    # With (b=4, r=5) amplification (20 hash functions)
    b, r = 4, 5
    amp_p_similar = 1 - (1 - single_p_similar**r)**b
    amp_p_dissimilar = 1 - (1 - single_p_dissimilar**r)**b
    
    print(f"\nWith (b={b}, r={r}) amplification:")
    print(f"  Pr(collision | similar): {amp_p_similar:.4f}")
    print(f"  Pr(collision | dissimilar): {amp_p_dissimilar:.6f}")
    print(f"  Ratio: {amp_p_similar/amp_p_dissimilar:.1f}x")
 
# Example output:
# Without amplification:
#   Pr(collision | similar): 0.90
#   Pr(collision | dissimilar): 0.30
#   Ratio: 3.00x
#
# With (b=4, r=5) amplification:
#   Pr(collision | similar): 0.9988
#   Pr(collision | dissimilar): 0.0010
#   Ratio: 992.7x

Effect of Amplification Parameters
Bands (b)	Rows (r)	Total Hashes	Threshold (J≈)	Effect
1	16	16	0.98	Very high precision, low recall
4	4	16	0.84	Balanced
8	2	16	0.68	Higher recall
16	1	16	0.50	Very high recall, lower precision
32	4	128	0.78	Production setting (balanced)
16	8	128	0.90	High sensitivity detection

Query Complexity Analysis

Let's analyze the complexity of LSH-based nearest neighbor search.

Preprocessing:

Compute $k$ hash values per point: $O(kd)$ per point
Total for $n$ points: $O(nkd)$
Hash table insertions: $O(n \cdot L)$ for $L$ tables

Query:

Hash query into $L$ tables: $O(Lkd)$
Retrieve candidates from buckets: $O(L)$ plus bucket sizes
Compute exact distances for candidates: $O(|\text{candidates}| \cdot d)$

Key Quantity: Expected Candidate Size

For a $(r_1, r_2, p_1, p_2)$-sensitive family with $L$ tables:

Points within $r_1$: have probability $\geq 1 - (1-p_1^k)^L$ of being candidates
Points beyond $r_2$: have probability $\leq L \cdot p_2^k$ of being candidates

Optimal Parameters (Indyk-Motwani):

Using $k = \log_{1/p_2} n$ hash functions per table and $L = n^\rho$ tables where $\rho = \log(1/p_1) / \log(1/p_2)$:

Query time: $O(n^\rho)$ (sublinear!)
Space: $O(n^{1+\rho} + nd)$

For the best LSH families, $\rho \approx 1/c$ for $c$-approximation, giving $O(n^{1/c})$ query time.

LSH Complexity Summary
Operation	Complexity	Notes
Preprocessing	$O(ndL)$	One-time; parallelize easily
Space	$O(nL + nd)$	L hash tables + original data
Query (compute hashes)	$O(Lkd)$	L tables, k functions, d dims
Query (candidates)	$O(n^\rho d)$ expected	$\rho < 1$ for sublinear
Total Query	$O(n^\rho d + Lkd)$	Sublinear for good families

The ρ Parameter is Key

The parameter ρ = log(1/p₁)/log(1/p₂) determines query efficiency. For c-approximate NN, the best known LSH families achieve ρ ≈ 1/c. This means 2-approximate NN can be done in O(√n) time, and (1+ε)-approximate in O(n^(1/(1+ε))) time. Lower bounds suggest this is nearly optimal.

Practical Considerations

Deploying LSH in production requires attention to several practical issues.

Parameter Tuning:

Bucket width (for Euclidean): Start with $w$ = average nearest neighbor distance, tune based on precision-recall curve
Number of tables $L$: More tables = higher recall but more memory/query cost. Typical: 4-50 tables.
Hash length $k$: More hash functions = tighter buckets. Balance against probability of empty buckets.
Banding parameters: Choose to place S-curve threshold at your desired similarity cutoff.

Multi-Probe LSH:

Instead of checking only the exact bucket, also check nearby buckets that a query might have hashed to with slightly different random choices. This dramatically reduces required number of tables while maintaining recall.

multiprobe_lsh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import numpy as np
from typing import List, Set, Tuple
from itertools import combinations
from collections import defaultdict
 
class MultiProbeLSH:
    """
    Multi-probe LSH for improved efficiency.
    
    Instead of using many tables, probe multiple buckets per table.
    Buckets to probe are those where the query was 'close' to the boundary.
    
    This can reduce table count by 10-100x while maintaining recall.
    """
    
    def __init__(
        self,
        n_projections: int = 16,
        n_tables: int = 4,
        n_probes: int = 10,  # Number of additional buckets to probe
        bucket_width: float = 4.0,
        dim: int = 128,
        seed: int = 42
    ):
        self.n_projections = n_projections
        self.n_tables = n_tables
        self.n_probes = n_probes
        self.bucket_width = bucket_width
        self.dim = dim
        
        np.random.seed(seed)
        self.projections = np.random.randn(n_tables, n_projections, dim)
        self.offsets = np.random.uniform(0, bucket_width, (n_tables, n_projections))
        
        self.tables = [defaultdict(set) for _ in range(n_tables)]
        self.data = []
    
    def _hash_with_residuals(
        self, 
        x: np.ndarray, 
        table_idx: int
    ) -> Tuple[Tuple[int, ...], List[Tuple[int, float]]]:
        """
        Hash x and return residuals to bucket boundaries.
        
        Residual = how close was x to falling into an adjacent bucket.
        """
        projections = self.projections[table_idx] @ x
        shifted = projections + self.offsets[table_idx]
        
        bucket_indices = np.floor(shifted / self.bucket_width).astype(int)
        
        # Compute distance to each bucket boundary
        residuals = shifted / self.bucket_width - bucket_indices
        # residual in [0, 1): 0 means at lower boundary, 1 means at upper
        
        # For each dimension, compute delta to nearest boundary
        deltas = []
        for i, r in enumerate(residuals):
            if r < 0.5:
                deltas.append((i, r, -1))   # Close to lower, try -1
            else:
                deltas.append((i, 1-r, +1))  # Close to upper, try +1
        
        # Sort by distance (smallest residual = closest to boundary)
        deltas.sort(key=lambda x: x[1])
        
        return tuple(bucket_indices), deltas
    
    def _generate_probe_buckets(
        self,
        base_bucket: Tuple[int, ...],
        deltas: List[Tuple[int, float, int]],
        n_probes: int
    ) -> List[Tuple[int, ...]]:
        """
        Generate additional buckets to probe.
        
        Perturbations are applied in order of proximity to boundary.
        """
        buckets = [base_bucket]
        
        # Single perturbations
        for i, (dim, dist, direction) in enumerate(deltas[:n_probes]):
            new_bucket = list(base_bucket)
            new_bucket[dim] += direction
            buckets.append(tuple(new_bucket))
            
            if len(buckets) >= n_probes + 1:
                break
        
        # Could add double perturbations for more probes
        if len(buckets) < n_probes + 1:
            for i in range(min(len(deltas), n_probes)):
                for j in range(i+1, min(len(deltas), n_probes)):
                    dim1, _, dir1 = deltas[i]
                    dim2, _, dir2 = deltas[j]
                    new_bucket = list(base_bucket)
                    new_bucket[dim1] += dir1
                    new_bucket[dim2] += dir2
                    buckets.append(tuple(new_bucket))
                    if len(buckets) >= n_probes + 1:
                        break
                if len(buckets) >= n_probes + 1:
                    break
        
        return buckets[:n_probes + 1]
    
    def query(self, q: np.ndarray, k: int = 10) -> List[Tuple[int, float]]:
        """
        Query with multi-probing.
        """
        candidates: Set[int] = set()
        
        for table_idx in range(self.n_tables):
            base_bucket, deltas = self._hash_with_residuals(q, table_idx)
            probe_buckets = self._generate_probe_buckets(
                base_bucket, deltas, self.n_probes
            )
            
            for bucket in probe_buckets:
                candidates.update(self.tables[table_idx][bucket])
        
        # Score candidates
        results = []
        for idx in candidates:
            dist = np.linalg.norm(q - self.data[idx])
            results.append((idx, dist))
        
        results.sort(key=lambda x: x[1])
        return results[:k]

Production Deployment Tips

•Use multi-probe LSH to reduce memory by 10-100× while maintaining recall
•Hybridize with exact search: Use LSH to filter to candidates, then brute-force on candidates
•Monitor collision statistics to detect distribution shift in production data
•Consider data-dependent LSH (e.g., learned hash functions) for better separation
•Use persistent hash tables (e.g., Redis, RocksDB) for datasets that don't fit in memory

Summary and Next Steps

Key Takeaways

•LSH hashes similar items to the same bucket with high probability, enabling sub-linear similarity search.
•Different hash families for different metrics: Random hyperplanes for cosine, random projections for Euclidean, MinHash for Jaccard.
•Amplification via banding creates sharp S-curve thresholds separating similar from dissimilar pairs.
•Query complexity is O(n^ρ) where ρ < 1, breaking the linear barrier of brute force.
•Multi-probe LSH reduces memory requirements by checking nearby buckets intelligently.
•LSH trades accuracy for speed — it finds approximate nearest neighbors, not exact ones.

What's Next:

LSH provides theoretical guarantees for approximate NN, but modern practice has moved toward graph-based methods like HNSW and quantization methods like FAISS's IVF and PQ. The final page surveys these and other Approximate Nearest Neighbor (ANN) techniques that power production similarity search at companies like Spotify, Google, and Meta.

Page Complete

You now understand Locality-Sensitive Hashing: its theoretical foundations, hash families for different metrics, amplification techniques, and practical considerations. LSH was a breakthrough in showing that approximate NN is fundamentally easier than exact NN. Next, we survey the broader landscape of approximate methods for production-scale similarity search.

Locality-Sensitive Hashing (LSH)

Hashing for Similarity, Not Identity

The Key Insight:

Instead of comparing a query to all $n$ points, we:

Hash the query
Only compare against points in the same bucket
If similar items tend to share buckets, most nearest neighbors will be found

What You Will Learn

Definition of Locality-Sensitive Hash Families

LSH is defined formally in terms of hash families with specific collision probability properties.

Definition (Locality-Sensitive Hash Family):

A family $\mathcal{H}$ of hash functions is $(r_1, r_2, p_1, p_2)$-sensitive for distance $d(\cdot, \cdot)$ if for any $h \in \mathcal{H}$ chosen uniformly at random:

If $d(\mathbf{x}, \mathbf{y}) \leq r_1$, then $\Pr[h(\mathbf{x}) = h(\mathbf{y})] \geq p_1$ (similar items collide often)
If $d(\mathbf{x}, \mathbf{y}) \geq r_2$, then $\Pr[h(\mathbf{x}) = h(\mathbf{y})] \leq p_2$ (dissimilar items collide rarely)

where $r_1 < r_2$ and $p_1 > p_2$.

The gap between similar ($\leq r_1$) and dissimilar ($\geq r_2$) is crucial. We typically define:

$$c = \frac{r_2}{r_1} > 1$$

as the approximation factor. An LSH scheme finds a $(1 + \epsilon)$-approximate nearest neighbor with $c = 1 + \epsilon$.

The $\rho$ parameter:

A key quantity is:

$$\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)}$$

This controls the query time complexity. Smaller $\rho$ means more efficient separation between similar and dissimilar pairs. For the best LSH families, $\rho \approx 1/c$.

Key LSH Parameters
Parameter	Symbol	Meaning	Desirable Value
Near distance	$r_1$	Distance threshold for 'similar'	Application-specific
Far distance	$r_2$	Distance threshold for 'dissimilar'	$r_2 > r_1$
Near collision prob	$p_1$	Pr(collision \| similar)	High (close to 1)
Far collision prob	$p_2$	Pr(collision \| dissimilar)	Low (close to 0)
Approximation factor	$c = r_2/r_1$	Quality of approximation	$1 + \epsilon$ for small $\epsilon$
Rho parameter	$\rho$	Efficiency measure	Small (ideally $1/c$)

The (c, r)-Near Neighbor Problem

LSH for Cosine Similarity: Random Hyperplanes

One of the most elegant LSH schemes is for cosine similarity, using random hyperplane partitions.

Cosine Similarity:

$$\cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}| |\mathbf{y}|}$$

The angle $\theta$ between vectors determines similarity. Vectors pointing in similar directions have $\theta \approx 0$ and $\cos(\theta) \approx 1$.

Random Hyperplane Hash:

For a random unit vector $\mathbf{r}$ sampled uniformly from the unit sphere:

$$h_{\mathbf{r}}(\mathbf{x}) = \begin{cases} 1 & \text{if } \mathbf{r} \cdot \mathbf{x} \geq 0 \ 0 & \text{if } \mathbf{r} \cdot \mathbf{x} < 0 \end{cases}$$

This hash assigns $\mathbf{x}$ based on which side of the hyperplane $\mathbf{r} \cdot \mathbf{x} = 0$ it falls on.

Collision Probability:

The probability that two vectors have the same hash is:

$$\Pr[h_{\mathbf{r}}(\mathbf{x}) = h_{\mathbf{r}}(\mathbf{y})] = 1 - \frac{\theta}{\pi}$$

where $\theta = \arccos\left(\frac{\mathbf{x} \cdot \mathbf{y}}{|\mathbf{x}| |\mathbf{y}|}\right)$ is the angle between the vectors.

Intuition: The hyperplane separates the vectors only if it 'cuts between' them. The probability of this equals the fraction of random directions that do so, which is $\theta / \pi$.

lsh_cosine.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
import numpy as np
from typing import List, Tuple, Set
from collections import defaultdict
 
class CosineLSH:
    """
    Locality-Sensitive Hashing for Cosine Similarity.
    
    Uses random hyperplanes to hash vectors such that similar
    vectors (small angle) are likely to share hash values.
    
    Parameters:
    -----------
    n_hyperplanes : int
        Number of random hyperplanes (bits per hash)
    n_tables : int
        Number of hash tables (for recall boosting)
    dim : int
        Dimensionality of input vectors
    seed : int
        Random seed for reproducibility
    """
    
    def __init__(
        self, 
        n_hyperplanes: int = 8,
        n_tables: int = 4,
        dim: int = 128,
        seed: int = 42
    ):
        self.n_hyperplanes = n_hyperplanes
        self.n_tables = n_tables
        self.dim = dim
        
        np.random.seed(seed)
        
        # Generate random hyperplanes for each table
        # Shape: (n_tables, n_hyperplanes, dim)
        self.hyperplanes = np.random.randn(n_tables, n_hyperplanes, dim)
        # Normalize each hyperplane to unit length
        norms = np.linalg.norm(self.hyperplanes, axis=2, keepdims=True)
        self.hyperplanes /= norms
        
        # Storage: table_id -> hash_value -> set of indices
        self.tables: List[defaultdict] = [
            defaultdict(set) for _ in range(n_tables)
        ]
        self.data = []  # Store original vectors
    
    def _hash_single(self, x: np.ndarray, table_idx: int) -> int:
        """
        Compute hash for vector x using table table_idx.
        
        Returns an integer hash (binary encoding of hyperplane signs).
        """
        # Dot product with each hyperplane: (n_hyperplanes,)
        projections = self.hyperplanes[table_idx] @ x
        
        # Convert to binary hash (1 if positive, 0 if negative)
        bits = (projections >= 0).astype(int)
        
        # Convert binary array to integer
        hash_val = int(''.join(map(str, bits)), 2)
        return hash_val
    
    def _hash_all_tables(self, x: np.ndarray) -> List[int]:
        """Compute hash in all tables."""
        return [self._hash_single(x, i) for i in range(self.n_tables)]
    
    def index(self, vectors: np.ndarray):
        """
        Index a collection of vectors.
        
        Parameters:
        -----------
        vectors : np.ndarray, shape (n, dim)
            Vectors to index
        """
        self.data = vectors
        
        for idx, vec in enumerate(vectors):
            hashes = self._hash_all_tables(vec)
            for table_idx, h in enumerate(hashes):
                self.tables[table_idx][h].add(idx)
    
    def query(self, q: np.ndarray, k: int = 10) -> List[Tuple[int, float]]:
        """
        Find approximate k nearest neighbors.
        
        Parameters:
        -----------
        q : np.ndarray, shape (dim,)
            Query vector
        k : int
            Number of neighbors to return
        
        Returns:
        --------
        List of (index, cosine_similarity) pairs, sorted by similarity
        """
        # Get candidate indices from all tables
        candidates: Set[int] = set()
        hashes = self._hash_all_tables(q)
        
        for table_idx, h in enumerate(hashes):
            candidates.update(self.tables[table_idx][h])
        
        if not candidates:
            return []
        
        # Compute actual cosine similarity for candidates
        q_norm = np.linalg.norm(q)
        results = []
        
        for idx in candidates:
            vec = self.data[idx]
            sim = np.dot(q, vec) / (q_norm * np.linalg.norm(vec))
            results.append((idx, sim))
        
        # Sort by similarity (descending) and return top k
        results.sort(key=lambda x: -x[1])
        return results[:k]
    
    def collision_probability(self, theta: float) -> float:
        """
        Theoretical collision probability for angle theta.
        
        For a single hyperplane: Pr[collision] = 1 - theta/pi
        For n hyperplanes (AND): Pr[all match] = (1 - theta/pi)^n
        For m tables (OR): Pr[at least one match] = 1 - (1 - p^n)^m
        """
        single_prob = 1 - theta / np.pi
        band_prob = single_prob ** self.n_hyperplanes
        total_prob = 1 - (1 - band_prob) ** self.n_tables
        return total_prob
 
 
# Example usage
def demo_cosine_lsh():
    np.random.seed(0)
    
    # Generate random high-dimensional vectors
    n, dim = 10000, 128
    vectors = np.random.randn(n, dim)
    
    # Create and index LSH
    lsh = CosineLSH(n_hyperplanes=12, n_tables=8, dim=dim)
    lsh.index(vectors)
    
    # Query
    query = np.random.randn(dim)
    neighbors = lsh.query(query, k=10)
    
    print(f"Query found {len(neighbors)} candidates")
    print("Top 5 neighbors (idx, cosine_sim):")
    for idx, sim in neighbors[:5]:
        print(f"  {idx}: {sim:.4f}")
    
    # Verify with brute force
    true_sims = [np.dot(query, v) / (np.linalg.norm(query) * np.linalg.norm(v)) 
                 for v in vectors]
    true_top = sorted(enumerate(true_sims), key=lambda x: -x[1])[:10]
    
    print("\nTrue top 5:")
    for idx, sim in true_top[:5]:
        print(f"  {idx}: {sim:.4f}")

SimHash for Documents

LSH for Euclidean Distance: Random Projections

For Euclidean ($L_2$) distance, we use a different hash family based on random projections onto lines and discretization.

Random Projection Hash:

For a random unit vector $\mathbf{a}$ and a random offset $b$ uniform in $[0, w]$:

$$h_{\mathbf{a}, b}(\mathbf{x}) = \left\lfloor \frac{\mathbf{a} \cdot \mathbf{x} + b}{w} \right\rfloor$$

This projects $\mathbf{x}$ onto a random line, then discretizes into buckets of width $w$.

Collision Probability:

For two vectors at distance $d = |\mathbf{x} - \mathbf{y}|$:

$$p(d) = \Pr[h(\mathbf{x}) = h(\mathbf{y})] = \int_0^w \frac{1}{d} f\left(\frac{t}{d}\right) \left(1 - \frac{t}{w}\right) dt$$

where $f$ is the PDF of the absolute difference of projected values.

For standard Gaussian projections:

$$p(d) \approx 1 - 2 \Phi\left(-\frac{w}{d}\right) - \frac{2}{\sqrt{2\pi}(w/d)}\left(1 - e^{-(w/d)^2/2}\right)$$

where $\Phi$ is the standard normal CDF.

Practical Approximation:

$$p(d) \approx 1 - \frac{d}{w} \quad \text{for } d \ll w$$

lsh_euclidean.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
import numpy as np
from typing import List, Tuple, Set
from collections import defaultdict
 
class EuclideanLSH:
    """
    LSH for Euclidean (L2) distance using random projections.
    
    Projects vectors onto random lines and discretizes into buckets.
    Similar vectors (close in L2) land in the same bucket with high probability.
    
    Parameters:
    -----------
    n_projections : int
        Number of random projections per hash table
    n_tables : int
        Number of hash tables
    bucket_width : float
        Width of discretization buckets (controls resolution)
    dim : int
        Dimensionality of input vectors
    """
    
    def __init__(
        self,
        n_projections: int = 4,
        n_tables: int = 8,
        bucket_width: float = 4.0,
        dim: int = 128,
        seed: int = 42
    ):
        self.n_projections = n_projections
        self.n_tables = n_tables
        self.bucket_width = bucket_width
        self.dim = dim
        
        np.random.seed(seed)
        
        # Random projection vectors: (n_tables, n_projections, dim)
        # Gaussian for theoretical guarantees
        self.projections = np.random.randn(n_tables, n_projections, dim)
        
        # Random offsets: (n_tables, n_projections)
        self.offsets = np.random.uniform(0, bucket_width, 
                                         (n_tables, n_projections))
        
        # Hash tables
        self.tables: List[defaultdict] = [
            defaultdict(set) for _ in range(n_tables)
        ]
        self.data = []
    
    def _hash_single(self, x: np.ndarray, table_idx: int) -> Tuple[int, ...]:
        """
        Compute hash for vector x in table table_idx.
        
        Returns a tuple of bucket indices (one per projection).
        """
        # Project: (n_projections,)
        projections = self.projections[table_idx] @ x
        
        # Add offset and discretize
        bucket_indices = np.floor(
            (projections + self.offsets[table_idx]) / self.bucket_width
        ).astype(int)
        
        return tuple(bucket_indices)
    
    def index(self, vectors: np.ndarray):
        """Index vectors into all hash tables."""
        self.data = vectors
        
        for idx, vec in enumerate(vectors):
            for table_idx in range(self.n_tables):
                h = self._hash_single(vec, table_idx)
                self.tables[table_idx][h].add(idx)
    
    def query(
        self, 
        q: np.ndarray, 
        k: int = 10,
        radius: float = None
    ) -> List[Tuple[int, float]]:
        """
        Find approximate nearest neighbors.
        
        Parameters:
        -----------
        q : np.ndarray
            Query vector
        k : int
            Number of neighbors to return
        radius : float, optional
            Return only neighbors within this distance
        
        Returns:
        --------
        List of (index, distance) pairs, sorted by distance
        """
        candidates: Set[int] = set()
        
        for table_idx in range(self.n_tables):
            h = self._hash_single(q, table_idx)
            candidates.update(self.tables[table_idx][h])
        
        if not candidates:
            return []
        
        # Compute actual distances for candidates
        results = []
        for idx in candidates:
            dist = np.linalg.norm(q - self.data[idx])
            if radius is None or dist <= radius:
                results.append((idx, dist))
        
        results.sort(key=lambda x: x[1])
        return results[:k]
    
    def collision_probability(self, d: float) -> float:
        """
        Approximate collision probability for distance d.
        
        For a single projection with Gaussian random vectors:
        p ≈ 2*Phi(-d/(2*w)) where Phi is normal CDF
        
        For multiple projections (AND): p1^n
        For multiple tables (OR): 1 - (1 - p1^n)^m
        """
        from scipy.stats import norm
        
        # Single projection collision prob
        p1 = 2 * norm.cdf(-d / (2 * self.bucket_width)) - 1 +              2 * norm.pdf(0) * self.bucket_width / d * (1 - np.exp(-d**2/(2*self.bucket_width**2)))
        
        # Simplified approximation
        p1_approx = max(0, 1 - d / self.bucket_width)
        
        # AND over projections
        band_prob = p1_approx ** self.n_projections
        
        # OR over tables
        total_prob = 1 - (1 - band_prob) ** self.n_tables
        
        return total_prob
 
 
# Choosing bucket width w
def analyze_bucket_width():
    """
    Demonstrate effect of bucket width on collision probability.
    """
    import matplotlib.pyplot as plt
    
    widths = [1, 2, 4, 8, 16]
    distances = np.linspace(0.1, 20, 100)
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    for w in widths:
        probs = [max(0, 1 - d/w) for d in distances]
        ax.plot(distances, probs, label=f'w = {w}')
    
    ax.set_xlabel('Distance d')
    ax.set_ylabel('Collision probability')
    ax.set_title('LSH Collision Probability vs Distance')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    return fig

Choosing Bucket Width

MinHash for Jaccard Similarity

For set similarity, MinHash provides an elegant LSH scheme for the Jaccard similarity.

Jaccard Similarity:

For sets $A$ and $B$:

$$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$$

MinHash Definition:

Let $h$ be a random permutation of all possible elements. Define:

$$\text{MinHash}h(A) = \min{a \in A} h(a)$$

The minimum element under permutation $h$.

Remarkable Property:

$$\Pr[\text{MinHash}_h(A) = \text{MinHash}_h(B)] = J(A, B)$$

MinHash Signature:

Using $k$ independent hash functions, we create a signature:

$$\text{Sig}(A) = [\text{MinHash}{h_1}(A), \text{MinHash}{h_2}(A), \ldots, \text{MinHash}_{h_k}(A)]$$

This compresses a set to $k$ integers while preserving similarity information.

minhash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
import numpy as np
from typing import Set, List, Tuple
from collections import defaultdict
import hashlib
 
class MinHash:
    """
    MinHash signatures for estimating Jaccard similarity.
    
    Compresses sets into fixed-size signatures such that:
    Pr[signature match] = Jaccard similarity
    
    Parameters:
    -----------
    n_hashes : int
        Number of hash functions (signature length)
    seed : int
        Random seed
    """
    
    def __init__(self, n_hashes: int = 128, seed: int = 42):
        self.n_hashes = n_hashes
        np.random.seed(seed)
        
        # Generate random hash function parameters
        # We use h(x) = (ax + b) mod p mod n for large prime p
        self.p = 2**31 - 1  # Mersenne prime
        self.a = np.random.randint(1, self.p, size=n_hashes)
        self.b = np.random.randint(0, self.p, size=n_hashes)
    
    def _hash_value(self, val: int, hash_idx: int) -> int:
        """Apply hash function hash_idx to value val."""
        return (self.a[hash_idx] * val + self.b[hash_idx]) % self.p
    
    def signature(self, s: Set) -> np.ndarray:
        """
        Compute MinHash signature for a set.
        
        Parameters:
        -----------
        s : Set
            Set of hashable elements
        
        Returns:
        --------
        np.ndarray of shape (n_hashes,)
            MinHash signature
        """
        sig = np.full(self.n_hashes, np.inf)
        
        for elem in s:
            # Convert element to integer hash
            elem_hash = hash(elem) % self.p
            
            # Update signature with minimum for each hash function
            for i in range(self.n_hashes):
                h_val = self._hash_value(elem_hash, i)
                if h_val < sig[i]:
                    sig[i] = h_val
        
        return sig.astype(np.int64)
    
    @staticmethod
    def estimate_jaccard(sig1: np.ndarray, sig2: np.ndarray) -> float:
        """
        Estimate Jaccard similarity from signatures.
        
        Just counts matching positions!
        """
        return np.mean(sig1 == sig2)
 
 
class MinHashLSH:
    """
    LSH index using MinHash for Jaccard similarity.
    
    Uses banding technique: divide signature into b bands of r rows each.
    Two sets are candidates if any band matches exactly.
    
    Parameters:
    -----------
    n_hashes : int
        Signature length (must equal bands * rows)
    bands : int
        Number of bands
    """
    
    def __init__(self, n_hashes: int = 128, bands: int = 32, seed: int = 42):
        self.bands = bands
        self.rows = n_hashes // bands
        self.n_hashes = bands * self.rows  # Ensure divisibility
        
        self.minhash = MinHash(n_hashes=self.n_hashes, seed=seed)
        
        # One hash table per band
        self.tables: List[defaultdict] = [
            defaultdict(set) for _ in range(bands)
        ]
        self.signatures = []
        self.data = []
    
    def index(self, sets: List[Set]):
        """Index a collection of sets."""
        self.data = sets
        self.signatures = [self.minhash.signature(s) for s in sets]
        
        for idx, sig in enumerate(self.signatures):
            for band_idx in range(self.bands):
                # Extract band from signature
                start = band_idx * self.rows
                end = start + self.rows
                band = tuple(sig[start:end])
                
                self.tables[band_idx][band].add(idx)
    
    def query(self, q: Set, k: int = 10) -> List[Tuple[int, float]]:
        """
        Find similar sets.
        
        Returns:
        --------
        List of (index, estimated_jaccard) pairs
        """
        q_sig = self.minhash.signature(q)
        
        # Find candidates from any matching band
        candidates: Set[int] = set()
        
        for band_idx in range(self.bands):
            start = band_idx * self.rows
            end = start + self.rows
            band = tuple(q_sig[start:end])
            
            candidates.update(self.tables[band_idx][band])
        
        # Compute estimated Jaccard for candidates
        results = []
        for idx in candidates:
            jaccard_est = MinHash.estimate_jaccard(q_sig, self.signatures[idx])
            results.append((idx, jaccard_est))
        
        results.sort(key=lambda x: -x[1])
        return results[:k]
    
    def s_curve_probability(self, jaccard: float) -> float:
        """
        Probability of becoming a candidate given Jaccard similarity.
        
        Follows an S-curve: Pr = 1 - (1 - J^r)^b
        where r = rows per band, b = number of bands
        """
        return 1 - (1 - jaccard ** self.rows) ** self.bands
 
 
# Example: Near-duplicate document detection
def demo_document_lsh():
    # Represent documents as sets of shingles (n-grams)
    def shingle(text: str, k: int = 5) -> Set[str]:
        """Create k-shingles from text."""
        return {text[i:i+k] for i in range(len(text) - k + 1)}
    
    docs = [
        "The quick brown fox jumps over the lazy dog",
        "The quick brown fox leaps over the lazy dog",  # Similar to doc 0
        "A completely different document about cats",
        "The fast brown fox jumps over the sleepy dog",  # Similar to doc 0
        "Another document about something else entirely",
    ]
    
    # Create shingle sets
    shingle_sets = [shingle(doc) for doc in docs]
    
    # Build LSH index
    lsh = MinHashLSH(n_hashes=128, bands=16)
    lsh.index(shingle_sets)
    
    # Query with each document
    for i, doc in enumerate(docs):
        similar = lsh.query(shingle_sets[i], k=3)
        print(f"Doc {i}: '{doc[:40]}...'")
        for idx, jaccard in similar:
            if idx != i:
                print(f"  Similar to Doc {idx} (J≈{jaccard:.3f})")

Banding for Tunable Thresholds

Amplification: Tuning Precision and Recall

A single LSH hash function provides only weak separation between similar and dissimilar items. Amplification combines multiple hashes to achieve arbitrarily strong separation.

Two Amplification Strategies:

AND-Amplification (Increases Precision):

Require all of $r$ hash functions to match: $$g(\mathbf{x}) = (h_1(\mathbf{x}), h_2(\mathbf{x}), \ldots, h_r(\mathbf{x}))$$

Collision probability: $$p_r = p^r$$

This dramatically reduces false positives but also reduces true positives.

OR-Amplification (Increases Recall):

Use $b$ independent hash tables, candidate if any matches:

Collision probability: $$p_b = 1 - (1 - p)^b$$

This recovers true pairs that any single hash might miss.

Combined (b, r)-Banding:

Use $b$ bands of $r$ rows each. Match if any band matches completely:

$$P(r, b) = 1 - (1 - p^r)^b$$

This creates the characteristic S-curve: sharp transition from low to high probability at a threshold similarity.

amplification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import numpy as np
import matplotlib.pyplot as plt
 
def plot_s_curves():
    """
    Visualize how banding parameters affect the S-curve.
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Basic collision probability (for illustration)
    s = np.linspace(0, 1, 100)  # Similarity
    
    # Left: Effect of changing bands (fixed rows)
    ax = axes[0]
    ax.set_title('Effect of Number of Bands (r=5 fixed)')
    r = 5
    for b in [1, 4, 16, 64, 256]:
        prob = 1 - (1 - s**r)**b
        ax.plot(s, prob, label=f'b={b}')
    ax.set_xlabel('Similarity')
    ax.set_ylabel('Probability of becoming candidate')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Right: Effect of changing rows (fixed bands)
    ax = axes[1]
    ax.set_title('Effect of Rows per Band (b=16 fixed)')
    b = 16
    for r in [1, 2, 4, 8, 16]:
        prob = 1 - (1 - s**r)**b
        ax.plot(s, prob, label=f'r={r}')
    ax.set_xlabel('Similarity')
    ax.set_ylabel('Probability of becoming candidate')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig
 
 
def find_params_for_threshold(target_sim: float, 
                               n_hashes: int = 128,
                               target_prob: float = 0.5) -> tuple:
    """
    Find (b, r) parameters to achieve target threshold.
    
    The threshold t ≈ (1/b)^(1/r) where Pr(candidate | J=t) = 0.5
    
    Parameters:
    -----------
    target_sim : float
        Desired similarity threshold (e.g., 0.8 for 80% similarity)
    n_hashes : int
        Total signature length
    target_prob : float
        Probability at threshold (usually 0.5)
    
    Returns:
    --------
    (bands, rows) tuple with bands * rows <= n_hashes
    """
    best_params = None
    best_distance = float('inf')
    
    for b in range(1, n_hashes + 1):
        for r in range(1, n_hashes // b + 1):
            if b * r > n_hashes:
                continue
            
            # Compute threshold where probability = 0.5
            # 1 - (1 - t^r)^b = 0.5
            # (1 - t^r)^b = 0.5
            # 1 - t^r = 0.5^(1/b)
            # t^r = 1 - 0.5^(1/b)
            # t = (1 - 0.5^(1/b))^(1/r)
            
            threshold = (1 - 0.5**(1/b))**(1/r)
            distance = abs(threshold - target_sim)
            
            if distance < best_distance:
                best_distance = distance
                best_params = (b, r, threshold)
    
    return best_params
 
 
def analyze_amplification():
    """
    Show how amplification separates similar from dissimilar.
    """
    # Without amplification
    single_p_similar = 0.9      # Similar pairs collide 90%
    single_p_dissimilar = 0.3   # Dissimilar still collide 30%!
    
    print("Without amplification:")
    print(f"  Pr(collision | similar): {single_p_similar:.2f}")
    print(f"  Pr(collision | dissimilar): {single_p_dissimilar:.2f}")
    print(f"  Ratio: {single_p_similar/single_p_dissimilar:.2f}x")
    
    # With (b=4, r=5) amplification (20 hash functions)
    b, r = 4, 5
    amp_p_similar = 1 - (1 - single_p_similar**r)**b
    amp_p_dissimilar = 1 - (1 - single_p_dissimilar**r)**b
    
    print(f"\nWith (b={b}, r={r}) amplification:")
    print(f"  Pr(collision | similar): {amp_p_similar:.4f}")
    print(f"  Pr(collision | dissimilar): {amp_p_dissimilar:.6f}")
    print(f"  Ratio: {amp_p_similar/amp_p_dissimilar:.1f}x")
 
# Example output:
# Without amplification:
#   Pr(collision | similar): 0.90
#   Pr(collision | dissimilar): 0.30
#   Ratio: 3.00x
#
# With (b=4, r=5) amplification:
#   Pr(collision | similar): 0.9988
#   Pr(collision | dissimilar): 0.0010
#   Ratio: 992.7x

Effect of Amplification Parameters
Bands (b)	Rows (r)	Total Hashes	Threshold (J≈)	Effect
1	16	16	0.98	Very high precision, low recall
4	4	16	0.84	Balanced
8	2	16	0.68	Higher recall
16	1	16	0.50	Very high recall, lower precision
32	4	128	0.78	Production setting (balanced)
16	8	128	0.90	High sensitivity detection

Query Complexity Analysis

Let's analyze the complexity of LSH-based nearest neighbor search.

Preprocessing:

Compute $k$ hash values per point: $O(kd)$ per point
Total for $n$ points: $O(nkd)$
Hash table insertions: $O(n \cdot L)$ for $L$ tables

Query:

Hash query into $L$ tables: $O(Lkd)$
Retrieve candidates from buckets: $O(L)$ plus bucket sizes
Compute exact distances for candidates: $O(|\text{candidates}| \cdot d)$

Key Quantity: Expected Candidate Size

For a $(r_1, r_2, p_1, p_2)$-sensitive family with $L$ tables:

Points within $r_1$: have probability $\geq 1 - (1-p_1^k)^L$ of being candidates
Points beyond $r_2$: have probability $\leq L \cdot p_2^k$ of being candidates

Optimal Parameters (Indyk-Motwani):

Using $k = \log_{1/p_2} n$ hash functions per table and $L = n^\rho$ tables where $\rho = \log(1/p_1) / \log(1/p_2)$:

Query time: $O(n^\rho)$ (sublinear!)
Space: $O(n^{1+\rho} + nd)$

For the best LSH families, $\rho \approx 1/c$ for $c$-approximation, giving $O(n^{1/c})$ query time.

LSH Complexity Summary
Operation	Complexity	Notes
Preprocessing	$O(ndL)$	One-time; parallelize easily
Space	$O(nL + nd)$	L hash tables + original data
Query (compute hashes)	$O(Lkd)$	L tables, k functions, d dims
Query (candidates)	$O(n^\rho d)$ expected	$\rho < 1$ for sublinear
Total Query	$O(n^\rho d + Lkd)$	Sublinear for good families

The ρ Parameter is Key

Practical Considerations

Deploying LSH in production requires attention to several practical issues.

Parameter Tuning:

Bucket width (for Euclidean): Start with $w$ = average nearest neighbor distance, tune based on precision-recall curve
Number of tables $L$: More tables = higher recall but more memory/query cost. Typical: 4-50 tables.
Hash length $k$: More hash functions = tighter buckets. Balance against probability of empty buckets.
Banding parameters: Choose to place S-curve threshold at your desired similarity cutoff.

Multi-Probe LSH:

multiprobe_lsh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import numpy as np
from typing import List, Set, Tuple
from itertools import combinations
from collections import defaultdict
 
class MultiProbeLSH:
    """
    Multi-probe LSH for improved efficiency.
    
    Instead of using many tables, probe multiple buckets per table.
    Buckets to probe are those where the query was 'close' to the boundary.
    
    This can reduce table count by 10-100x while maintaining recall.
    """
    
    def __init__(
        self,
        n_projections: int = 16,
        n_tables: int = 4,
        n_probes: int = 10,  # Number of additional buckets to probe
        bucket_width: float = 4.0,
        dim: int = 128,
        seed: int = 42
    ):
        self.n_projections = n_projections
        self.n_tables = n_tables
        self.n_probes = n_probes
        self.bucket_width = bucket_width
        self.dim = dim
        
        np.random.seed(seed)
        self.projections = np.random.randn(n_tables, n_projections, dim)
        self.offsets = np.random.uniform(0, bucket_width, (n_tables, n_projections))
        
        self.tables = [defaultdict(set) for _ in range(n_tables)]
        self.data = []
    
    def _hash_with_residuals(
        self, 
        x: np.ndarray, 
        table_idx: int
    ) -> Tuple[Tuple[int, ...], List[Tuple[int, float]]]:
        """
        Hash x and return residuals to bucket boundaries.
        
        Residual = how close was x to falling into an adjacent bucket.
        """
        projections = self.projections[table_idx] @ x
        shifted = projections + self.offsets[table_idx]
        
        bucket_indices = np.floor(shifted / self.bucket_width).astype(int)
        
        # Compute distance to each bucket boundary
        residuals = shifted / self.bucket_width - bucket_indices
        # residual in [0, 1): 0 means at lower boundary, 1 means at upper
        
        # For each dimension, compute delta to nearest boundary
        deltas = []
        for i, r in enumerate(residuals):
            if r < 0.5:
                deltas.append((i, r, -1))   # Close to lower, try -1
            else:
                deltas.append((i, 1-r, +1))  # Close to upper, try +1
        
        # Sort by distance (smallest residual = closest to boundary)
        deltas.sort(key=lambda x: x[1])
        
        return tuple(bucket_indices), deltas
    
    def _generate_probe_buckets(
        self,
        base_bucket: Tuple[int, ...],
        deltas: List[Tuple[int, float, int]],
        n_probes: int
    ) -> List[Tuple[int, ...]]:
        """
        Generate additional buckets to probe.
        
        Perturbations are applied in order of proximity to boundary.
        """
        buckets = [base_bucket]
        
        # Single perturbations
        for i, (dim, dist, direction) in enumerate(deltas[:n_probes]):
            new_bucket = list(base_bucket)
            new_bucket[dim] += direction
            buckets.append(tuple(new_bucket))
            
            if len(buckets) >= n_probes + 1:
                break
        
        # Could add double perturbations for more probes
        if len(buckets) < n_probes + 1:
            for i in range(min(len(deltas), n_probes)):
                for j in range(i+1, min(len(deltas), n_probes)):
                    dim1, _, dir1 = deltas[i]
                    dim2, _, dir2 = deltas[j]
                    new_bucket = list(base_bucket)
                    new_bucket[dim1] += dir1
                    new_bucket[dim2] += dir2
                    buckets.append(tuple(new_bucket))
                    if len(buckets) >= n_probes + 1:
                        break
                if len(buckets) >= n_probes + 1:
                    break
        
        return buckets[:n_probes + 1]
    
    def query(self, q: np.ndarray, k: int = 10) -> List[Tuple[int, float]]:
        """
        Query with multi-probing.
        """
        candidates: Set[int] = set()
        
        for table_idx in range(self.n_tables):
            base_bucket, deltas = self._hash_with_residuals(q, table_idx)
            probe_buckets = self._generate_probe_buckets(
                base_bucket, deltas, self.n_probes
            )
            
            for bucket in probe_buckets:
                candidates.update(self.tables[table_idx][bucket])
        
        # Score candidates
        results = []
        for idx in candidates:
            dist = np.linalg.norm(q - self.data[idx])
            results.append((idx, dist))
        
        results.sort(key=lambda x: x[1])
        return results[:k]

Production Deployment Tips

•Use multi-probe LSH to reduce memory by 10-100× while maintaining recall
•Hybridize with exact search: Use LSH to filter to candidates, then brute-force on candidates
•Monitor collision statistics to detect distribution shift in production data
•Consider data-dependent LSH (e.g., learned hash functions) for better separation
•Use persistent hash tables (e.g., Redis, RocksDB) for datasets that don't fit in memory

Summary and Next Steps

Key Takeaways

•LSH hashes similar items to the same bucket with high probability, enabling sub-linear similarity search.
•Different hash families for different metrics: Random hyperplanes for cosine, random projections for Euclidean, MinHash for Jaccard.
•Amplification via banding creates sharp S-curve thresholds separating similar from dissimilar pairs.
•Query complexity is O(n^ρ) where ρ < 1, breaking the linear barrier of brute force.
•Multi-probe LSH reduces memory requirements by checking nearby buckets intelligently.
•LSH trades accuracy for speed — it finds approximate nearest neighbors, not exact ones.

What's Next:

Page Complete