Machine LearningK-Nearest Neighbors

Locality-Sensitive Hashing (LSH)

LevelAdvanced

Duration90 mins

TopicK-Nearest Neighbors

6 / 6

LSH Performance Tuning

From Theory to Production

Everything we've covered so far has been theoretical: collision probabilities, amplification math, error bounds. But deploying LSH in production requires bridging theory and practice. Real systems have memory limits, latency budgets, data distribution quirks, and dynamic updates.

This page addresses the practical challenges of making LSH work in the real world. We'll cover multi-probe techniques that dramatically reduce memory requirements, data-dependent parameter tuning, integration with modern vector databases, and best practices from large-scale deployments at companies like Google, Facebook, and Spotify.

By the end, you'll be equipped to deploy high-performance LSH systems that meet production requirements.

What You Will Learn

By the end of this page, you will understand multi-probe LSH for memory efficiency, data-dependent parameter selection, benchmarking methodology, production deployment patterns, integration with vector databases, and troubleshooting common issues.

Multi-Probe LSH: Trading Computation for Memory

Standard LSH requires many hash tables ($L$) to achieve high recall. For billion-scale datasets, this memory requirement becomes prohibitive. Multi-probe LSH offers a solution: use fewer tables but probe multiple nearby buckets per table.

The Key Insight:

When a query misses a near neighbor, it's often because the neighbor barely crossed a bucket boundary. By also checking adjacent buckets, we can recover these near-misses.

Multi-Probe Strategy:

Instead of checking only the exact hash bucket, check $T$ buckets per table:

The exact bucket $h(q)$
The $T-1$ most likely nearby buckets

Probe Sequence Generation:

For random hyperplane LSH, nearby buckets differ by a few bit flips. The probability of each flip depends on how close the query projection was to the decision boundary (sign change).

For E2LSH, nearby buckets have adjacent integer values in one or more projections. Projections with fractional parts close to 0 or 1 are most likely to have crossed a boundary.

multiprobe_lsh_production.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
"""
Production-grade Multi-Probe LSH Implementation.
"""
 
import numpy as np
from typing import List, Tuple, Set, Dict
from dataclasses import dataclass
from collections import defaultdict
import heapq
 
@dataclass
class ProbeSequence:
    """
    Generates optimal probe sequences for E2LSH multi-probe.
    
    Based on paper: "Multi-Probe LSH: Efficient Indexing for 
    High-Dimensional Similarity Search" (Lv et al., 2007)
    """
    
    @staticmethod
    def generate_perturbation_vectors(
        hash_values: np.ndarray,
        fractional_parts: np.ndarray,
        num_probes: int
    ) -> List[Tuple[float, np.ndarray]]:
        """
        Generate perturbation vectors ordered by probability.
        
        Parameters:
        -----------
        hash_values : np.ndarray
            Base hash values (integers)
        fractional_parts : np.ndarray
            (projection + offset) / w - floor(...), in [0, 1)
        num_probes : int
            Number of probes to generate
            
        Returns:
        --------
        List of (score, perturbation_vector) tuples
        Score is related to probability (lower = more likely)
        """
        k = len(hash_values)
        
        # For each dimension, compute cost of perturbing +1 or -1
        # If fractional part f is close to 0, perturbing -1 is likely
        # If f is close to 1, perturbing +1 is likely
        
        # Cost = squared distance from boundary
        cost_minus = fractional_parts ** 2  # Cost of -1 perturbation
        cost_plus = (1 - fractional_parts) ** 2  # Cost of +1 perturbation
        
        # Generate all perturbation vectors up to Hamming distance 2
        # Use priority queue to get lowest-cost probes
        
        probes = []
        
        # Zero perturbation (exact match) - cost 0
        heapq.heappush(probes, (0.0, tuple([0] * k)))
        
        # Single perturbations
        for i in range(k):
            # Perturb dimension i by +1
            perturb = [0] * k
            perturb[i] = 1
            heapq.heappush(probes, (cost_plus[i], tuple(perturb)))
            
            # Perturb dimension i by -1
            perturb = [0] * k
            perturb[i] = -1
            heapq.heappush(probes, (cost_minus[i], tuple(perturb)))
        
        # Double perturbations (if needed for more probes)
        if num_probes > 2 * k + 1:
            for i in range(k):
                for j in range(i + 1, k):
                    for di in [-1, 1]:
                        for dj in [-1, 1]:
                            perturb = [0] * k
                            perturb[i] = di
                            perturb[j] = dj
                            cost = (cost_minus[i] if di == -1 else cost_plus[i]) +                                    (cost_minus[j] if dj == -1 else cost_plus[j])
                            heapq.heappush(probes, (cost, tuple(perturb)))
        
        # Extract top probes
        result = []
        seen = set()
        while len(result) < num_probes and probes:
            cost, perturb = heapq.heappop(probes)
            if perturb not in seen:
                seen.add(perturb)
                result.append((cost, np.array(perturb)))
        
        return result
 
 
class MultiProbeE2LSH:
    """
    Multi-probe E2LSH for production use.
    
    Uses optimal probe sequences to achieve high recall
    with fewer hash tables.
    """
    
    def __init__(
        self,
        dim: int,
        num_projections: int = 10,
        num_tables: int = 5,  # Much fewer needed with multi-probe
        bucket_width: float = 4.0,
        num_probes: int = 20,  # Probes per table
        seed: int = 42
    ):
        np.random.seed(seed)
        
        self.dim = dim
        self.k = num_projections
        self.L = num_tables
        self.w = bucket_width
        self.num_probes = num_probes
        
        # Random projections and offsets for each table
        self.projections = np.random.randn(self.L, self.k, dim)
        self.offsets = np.random.uniform(0, self.w, size=(self.L, self.k))
        
        # Hash tables
        self.tables: List[Dict] = [defaultdict(list) for _ in range(self.L)]
        self.vectors: List[np.ndarray] = []
    
    def _compute_hash_and_fractional(
        self, 
        vector: np.ndarray, 
        table_idx: int
    ) -> Tuple[np.ndarray, np.ndarray]:
        """
        Compute hash values and fractional parts for multi-probe.
        """
        proj = np.dot(self.projections[table_idx], vector)
        shifted = (proj + self.offsets[table_idx]) / self.w
        hash_values = np.floor(shifted).astype(int)
        fractional_parts = shifted - hash_values
        return hash_values, fractional_parts
    
    def add(self, vector_id: int, vector: np.ndarray):
        """Add a vector to the index."""
        self.vectors.append(vector.copy())
        
        for table_idx in range(self.L):
            hash_values, _ = self._compute_hash_and_fractional(
                vector, table_idx
            )
            hash_key = tuple(hash_values)
            self.tables[table_idx][hash_key].append(vector_id)
    
    def query(
        self, 
        query: np.ndarray, 
        k_neighbors: int = 10
    ) -> List[Tuple[int, float]]:
        """
        Multi-probe query for approximate nearest neighbors.
        """
        candidates: Set[int] = set()
        
        for table_idx in range(self.L):
            hash_values, fractional_parts = self._compute_hash_and_fractional(
                query, table_idx
            )
            
            # Generate probe sequence
            probes = ProbeSequence.generate_perturbation_vectors(
                hash_values, fractional_parts, self.num_probes
            )
            
            # Probe each bucket
            for _, perturbation in probes:
                probed_hash = tuple((hash_values + perturbation).astype(int))
                bucket = self.tables[table_idx].get(probed_hash, [])
                candidates.update(bucket)
        
        # Compute exact distances
        results = []
        for cand_id in candidates:
            dist = np.linalg.norm(query - self.vectors[cand_id])
            results.append((cand_id, float(dist)))
        
        results.sort(key=lambda x: x[1])
        return results[:k_neighbors]
 
 
def benchmark_multiprobe():
    """Compare standard LSH vs multi-probe LSH."""
    np.random.seed(42)
    
    n = 50000
    dim = 128
    
    print("Multi-Probe LSH Benchmark")
    print(f"Dataset: {n} vectors, {dim} dimensions")
    print("=" * 60)
    
    # Generate data
    data = np.random.randn(n, dim)
    query = data[0] + 0.3 * np.random.randn(dim)
    
    # Ground truth
    distances = np.linalg.norm(data - query.reshape(1, -1), axis=1)
    true_10nn = set(np.argsort(distances)[:10])
    
    # Standard LSH (many tables, single probe)
    print("
Standard LSH (L=50, T=1):")
    standard_lsh = MultiProbeE2LSH(
        dim=dim, num_projections=8, num_tables=50, 
        bucket_width=4.0, num_probes=1
    )
    for i, v in enumerate(data):
        standard_lsh.add(i, v)
    
    standard_results = standard_lsh.query(query, k_neighbors=10)
    standard_found = {r[0] for r in standard_results}
    standard_recall = len(standard_found & true_10nn) / 10
    
    print(f"  Tables: 50")
    print(f"  Recall@10: {standard_recall:.1%}")
    print(f"  Memory: 50 tables")
    
    # Multi-probe LSH (fewer tables, more probes)
    print("
Multi-Probe LSH (L=5, T=20):")
    multiprobe_lsh = MultiProbeE2LSH(
        dim=dim, num_projections=8, num_tables=5,
        bucket_width=4.0, num_probes=20
    )
    for i, v in enumerate(data):
        multiprobe_lsh.add(i, v)
    
    multiprobe_results = multiprobe_lsh.query(query, k_neighbors=10)
    multiprobe_found = {r[0] for r in multiprobe_results}
    multiprobe_recall = len(multiprobe_found & true_10nn) / 10
    
    print(f"  Tables: 5")
    print(f"  Recall@10: {multiprobe_recall:.1%}")
    print(f"  Memory: 5 tables (10x reduction)")
    print(f"  Probes per table: 20")
 
 
if __name__ == "__main__":
    benchmark_multiprobe()

Memory Reduction with Multi-Probe

Multi-probe can achieve the same recall as standard LSH with 5-10x fewer tables. This translates directly to 5-10x memory savings. The trade-off is slightly increased query computation (generating and checking probe sequences), but this is usually negligible compared to the memory benefit.

Data-Dependent Parameter Tuning

Optimal LSH parameters depend heavily on your data distribution. Theoretical analysis assumes uniform random data, but real datasets have structure that affects performance.

Empirical Parameter Selection:

The best approach is empirical validation:

Sample your data: Use a representative sample (e.g., 10,000 vectors)
Estimate distance distributions: Compute pairwise distances to understand typical near/far thresholds
Test recall on ground truth: For the sample, compute exact k-NN and measure LSH recall
Grid search (k, L, w): Find parameters that meet your recall target with minimal resources

data_dependent_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
"""
Data-dependent LSH parameter tuning.
"""
 
import numpy as np
from typing import Tuple, List
from scipy import stats
 
def analyze_distance_distribution(
    data: np.ndarray,
    sample_size: int = 1000,
    num_pairs: int = 10000
) -> Tuple[float, float, float]:
    """
    Analyze distance distribution in data to guide parameter selection.
    
    Returns:
    --------
    (median_dist, p10_dist, p90_dist) - distance percentiles
    """
    n = len(data)
    
    # Sample random pairs
    np.random.seed(42)
    idx1 = np.random.randint(0, n, num_pairs)
    idx2 = np.random.randint(0, n, num_pairs)
    
    distances = np.linalg.norm(data[idx1] - data[idx2], axis=1)
    
    # Remove self-pairs
    distances = distances[idx1 != idx2]
    
    p10 = np.percentile(distances, 10)
    p50 = np.percentile(distances, 50)
    p90 = np.percentile(distances, 90)
    
    return p50, p10, p90
 
def analyze_knn_distances(
    data: np.ndarray,
    k: int = 10,
    sample_size: int = 100
) -> Tuple[float, float]:
    """
    Analyze k-NN distances to determine near threshold.
    
    Returns:
    --------
    (mean_knn_dist, max_knn_dist)
    """
    n = len(data)
    sample_indices = np.random.choice(n, min(sample_size, n), replace=False)
    
    knn_dists = []
    for query_idx in sample_indices:
        query = data[query_idx]
        distances = np.linalg.norm(data - query.reshape(1, -1), axis=1)
        distances[query_idx] = float('inf')  # Exclude self
        kth_dist = np.partition(distances, k)[k]
        knn_dists.append(kth_dist)
    
    return np.mean(knn_dists), np.max(knn_dists)
 
def tune_bucket_width(
    data: np.ndarray,
    target_k: int = 10,
    target_recall: float = 0.9
) -> float:
    """
    Suggest bucket width (w) for E2LSH based on data distribution.
    
    Heuristic: w = 2-4x the typical k-NN distance
    """
    mean_knn_dist, max_knn_dist = analyze_knn_distances(data, k=target_k)
    
    # Start with 2x mean k-NN distance
    suggested_w = 2.5 * mean_knn_dist
    
    return suggested_w
 
def tune_lsh_parameters(
    data: np.ndarray,
    target_k: int = 10,
    target_recall: float = 0.95,
    max_memory_ratio: float = 2.0,  # Max memory as multiple of data
    sample_queries: int = 100
) -> dict:
    """
    Comprehensive parameter tuning for LSH.
    
    Returns:
    --------
    Dict with recommended parameters and expected performance
    """
    n, d = data.shape
    
    print("LSH Parameter Tuning")
    print("=" * 60)
    print(f"Dataset: n={n:,}, d={d}")
    print(f"Target: {target_k}-NN with {target_recall:.0%} recall")
    print()
    
    # Analyze data distribution
    print("1. Analyzing distance distribution...")
    median_dist, p10_dist, p90_dist = analyze_distance_distribution(data)
    print(f"   Distance percentiles: p10={p10_dist:.3f}, p50={median_dist:.3f}, p90={p90_dist:.3f}")
    
    print("
2. Analyzing k-NN distances...")
    mean_knn_dist, max_knn_dist = analyze_knn_distances(data, k=target_k)
    print(f"   Mean {target_k}-NN distance: {mean_knn_dist:.3f}")
    print(f"   Max {target_k}-NN distance: {max_knn_dist:.3f}")
    
    print("
3. Suggesting parameters...")
    
    # Bucket width
    w = tune_bucket_width(data, target_k)
    print(f"   Bucket width (w): {w:.3f}")
    
    # Number of projections (k) - heuristic
    # More projections for larger datasets
    k_suggested = max(4, min(16, int(np.log2(n) / 2)))
    print(f"   Projections per table (k): {k_suggested}")
    
    # Number of tables (L) - depends on target recall
    # Start with conservative estimate
    L_suggested = max(10, int(50 * (1 - target_recall + 0.1)))
    
    # With multi-probe, can reduce L significantly
    num_probes = 15
    L_multiprobe = max(3, L_suggested // 5)
    
    print(f"   Tables without multi-probe (L): {L_suggested}")
    print(f"   Tables with multi-probe (L): {L_multiprobe}, probes={num_probes}")
    
    return {
        "bucket_width": w,
        "num_projections": k_suggested,
        "num_tables_standard": L_suggested,
        "num_tables_multiprobe": L_multiprobe,
        "num_probes": num_probes,
        "distance_stats": {
            "p10": p10_dist,
            "median": median_dist,
            "p90": p90_dist,
            "mean_knn": mean_knn_dist,
        }
    }
 
 
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate synthetic data with clusters
    n_clusters = 50
    points_per_cluster = 1000
    dim = 128
    
    # Cluster centers
    centers = np.random.randn(n_clusters, dim) * 10
    
    # Generate points around centers
    data = []
    for center in centers:
        cluster_points = center + np.random.randn(points_per_cluster, dim)
        data.append(cluster_points)
    
    data = np.vstack(data)
    
    # Tune parameters
    params = tune_lsh_parameters(
        data, 
        target_k=10, 
        target_recall=0.95
    )
    
    print("
Recommended Parameters:")
    print("-" * 40)
    for key, value in params.items():
        if not isinstance(value, dict):
            print(f"  {key}: {value}")

Don't Trust Theoretical Defaults

Default parameters from papers assume specific distance distributions. Real data often has clusters, outliers, or other structure that changes collision probabilities. Always validate on a sample of your actual data before production deployment.

Benchmarking Methodology

Proper benchmarking is essential for evaluating LSH performance. Here's a rigorous methodology.

Key Metrics to Measure

•Recall@k — What fraction of true k-NN are returned? This is the primary quality metric.
•Precision — What fraction of returned candidates are true neighbors? Important when verification is expensive.
•Query latency — Wall-clock time per query. Measure p50, p90, p99 for SLA compliance.
•Throughput — Queries per second under concurrent load. Important for multi-tenant systems.
•Index build time — Time to construct the index. Critical for dynamic datasets.
•Memory usage — Peak and steady-state memory. Often the binding constraint.
•Candidates per query — How many candidates require distance verification? Affects post-filtering cost.

Benchmarking Protocol:

Establish ground truth: For a test set of queries, compute exact k-NN using brute force. Cache this for reuse.
Use realistic query distribution: Don't use random queries if your production queries have structure (e.g., similar to indexed data).
Warm up the system: Run several queries before measuring to populate caches.
Measure under load: Single-query latency differs from latency under concurrent load.
Control for variance: Run multiple trials; report mean and standard deviation.
Test edge cases: Include queries with many near neighbors, few near neighbors, and outliers.

benchmark_lsh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
"""
Comprehensive LSH benchmarking framework.
"""
 
import numpy as np
import time
from typing import Dict, List, Set, Tuple
from dataclasses import dataclass, field
 
@dataclass
class BenchmarkResult:
    """Results from a single benchmark run."""
    recall_at_k: float
    precision: float
    avg_latency_ms: float
    p50_latency_ms: float
    p90_latency_ms: float
    p99_latency_ms: float
    avg_candidates: float
    qps: float  # Queries per second
    build_time_sec: float
    memory_mb: float
 
@dataclass
class LSHBenchmark:
    """
    Benchmark framework for LSH implementations.
    """
    
    data: np.ndarray
    queries: np.ndarray
    k: int = 10
    ground_truth: List[Set[int]] = field(default_factory=list)
    
    def __post_init__(self):
        if not self.ground_truth:
            self._compute_ground_truth()
    
    def _compute_ground_truth(self):
        """Compute exact k-NN for all queries."""
        print("Computing ground truth k-NN...")
        self.ground_truth = []
        
        for query in self.queries:
            distances = np.linalg.norm(
                self.data - query.reshape(1, -1), axis=1
            )
            knn = set(np.argsort(distances)[:self.k])
            self.ground_truth.append(knn)
        
        print(f"Ground truth computed for {len(self.queries)} queries")
    
    def benchmark(
        self,
        lsh_index,  # Any object with query(vector, k) method
        name: str = "LSH"
    ) -> BenchmarkResult:
        """
        Run comprehensive benchmark.
        
        Parameters:
        -----------
        lsh_index : object
            LSH index with query(vector, k) -> List[(id, dist)] method
        name : str
            Name for reporting
        """
        print(f"
Benchmarking {name}...")
        
        latencies = []
        recalls = []
        precisions = []
        num_candidates = []
        
        # Warmup
        for query in self.queries[:min(10, len(self.queries))]:
            lsh_index.query(query, self.k)
        
        # Measure
        for i, query in enumerate(self.queries):
            # Time the query
            start = time.perf_counter()
            results = lsh_index.query(query, self.k)
            elapsed = (time.perf_counter() - start) * 1000  # ms
            
            latencies.append(elapsed)
            
            # Compute recall
            returned_ids = {r[0] for r in results}
            true_nn = self.ground_truth[i]
            
            recall = len(returned_ids & true_nn) / len(true_nn) if true_nn else 1.0
            recalls.append(recall)
            
            # Precision (assuming results are top-k from candidates)
            precision = len(returned_ids & true_nn) / len(returned_ids) if returned_ids else 0.0
            precisions.append(precision)
            
            # Track candidate count if available
            if hasattr(lsh_index, 'last_candidate_count'):
                num_candidates.append(lsh_index.last_candidate_count)
        
        # Aggregate metrics
        latencies = np.array(latencies)
        
        return BenchmarkResult(
            recall_at_k=np.mean(recalls),
            precision=np.mean(precisions),
            avg_latency_ms=np.mean(latencies),
            p50_latency_ms=np.percentile(latencies, 50),
            p90_latency_ms=np.percentile(latencies, 90),
            p99_latency_ms=np.percentile(latencies, 99),
            avg_candidates=np.mean(num_candidates) if num_candidates else 0,
            qps=1000 / np.mean(latencies),
            build_time_sec=0,  # Set externally
            memory_mb=0        # Set externally
        )
    
    @staticmethod
    def print_result(name: str, result: BenchmarkResult):
        """Pretty-print benchmark results."""
        print(f"
{'='*50}")
        print(f"Results: {name}")
        print(f"{'='*50}")
        print(f"  Recall@{10}:    {result.recall_at_k:.2%}")
        print(f"  Precision:     {result.precision:.2%}")
        print(f"  Avg Latency:   {result.avg_latency_ms:.2f}ms")
        print(f"  p50 Latency:   {result.p50_latency_ms:.2f}ms")
        print(f"  p90 Latency:   {result.p90_latency_ms:.2f}ms")
        print(f"  p99 Latency:   {result.p99_latency_ms:.2f}ms")
        print(f"  Throughput:    {result.qps:.1f} QPS")
        if result.avg_candidates > 0:
            print(f"  Avg Candidates: {result.avg_candidates:.0f}")
        if result.build_time_sec > 0:
            print(f"  Build Time:    {result.build_time_sec:.1f}s")
        if result.memory_mb > 0:
            print(f"  Memory:        {result.memory_mb:.1f}MB")
 
def run_comparison_benchmark():
    """
    Run benchmarks comparing different LSH configurations.
    """
    np.random.seed(42)
    
    n = 100000
    dim = 128
    n_queries = 500
    
    print("LSH Benchmark Suite")
    print(f"Dataset: {n:,} vectors, {dim} dimensions")
    print(f"Queries: {n_queries}")
    
    # Generate data
    data = np.random.randn(n, dim)
    
    # Generate queries (mix of in-distribution and perturbed)
    query_indices = np.random.choice(n, n_queries, replace=False)
    queries = data[query_indices] + 0.3 * np.random.randn(n_queries, dim)
    
    benchmark = LSHBenchmark(data, queries, k=10)
    
    # Test different configurations
    # (In practice, you'd import your actual LSH implementations)
    
    print("
Benchmark complete!")
    print("
Summary: Choose configuration based on your priority:")
    print("  - High recall needed? Increase L or use multi-probe")
    print("  - Low latency needed? Reduce L, increase k")
    print("  - Memory constrained? Use multi-probe with fewer tables")
 
 
if __name__ == "__main__":
    run_comparison_benchmark()

Production Deployment Patterns

Deploying LSH in production involves more than just the algorithm. Here are patterns from large-scale deployments.

Production Deployment Patterns
Pattern	Description	When to Use
Sharded Index	Partition data across machines, query all shards	Dataset exceeds single machine memory
Hierarchical LSH	Coarse index filters, fine index refines	Very large datasets with variable precision needs
Cached Candidates	Cache frequent query results	High query locality, cold-start latency tolerance
Async Reindexing	Background rebuild while serving old index	Frequent data updates
Hybrid Exact/Approx	LSH for initial filter, exact search on candidates	High precision requirements
Multi-Vector Queries	Query with multiple vectors, aggregate results	Complex similarity (multi-modal, ensembles)

Sharding Strategy:

For billion-scale datasets:

Random sharding: Data randomly distributed. Query all shards, merge results.
- Pros: Simple, even load distribution
- Cons: All shards receive every query
Locality-aware sharding: Data partitioned by region in feature space.
- Pros: Can route queries to relevant shards only
- Cons: Complex routing, potential hotspots
Replicated sharding: Each shard holds full index replica.
- Pros: Scale read throughput linearly
- Cons: Memory multiplied by replica count

Index Updates:

LSH indexes are expensive to update incrementally:

Append-only: Add new points to buckets. Fast, but buckets grow unbounded.
Periodic rebuild: Rebuild index on schedule. Simple, but stale during rebuild.
Double-buffered: Build new index while old serves. Seamless switchover.
LSM-tree inspired: Layer of small fresh indexes merged periodically.

Start Simple, Optimize Later

Many production systems start with a single-machine, periodically-rebuilt index. Only add sharding, caching, and complex patterns when demonstrated necessary. Premature optimization in distributed systems creates operational burden.

Integration with Vector Databases

Modern vector databases like Pinecone, Weaviate, Milvus, Qdrant, and FAISS use LSH or related techniques under the hood. Understanding LSH helps you configure these systems effectively.

FAISS (Facebook AI Similarity Search):

FAISS is the most widely-used library for vector similarity search. It offers multiple indexing methods:

IndexLSH: Pure LSH with bit signatures
IndexIVF*: Inverted file indexes (cluster-based, not LSH)
IndexHNSW: Graph-based (not LSH, but often faster)

LSH in FAISS is typically combined with product quantization for memory efficiency.

Pinecone:

Pinecone abstracts away index implementation details, but you configure:

metric: cosine, euclidean, dot product (affects internal hashing)
pods: Compute resources (affects latency/throughput tradeoff)
Metadata filtering (integrated with similarity search)

Milvus:

Milvus exposes more knobs:

Index types: IVF_FLAT, IVF_SQ8, IVF_PQ, HNSW, ANNOY
Each has parameters like nlist, nprobe (analogous to L and T)

vector_db_integration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
"""
Example integration with FAISS vector database.
"""
 
import numpy as np
 
# Note: Requires faiss-cpu or faiss-gpu package
# pip install faiss-cpu
 
def faiss_lsh_example():
    """
    Demonstrate LSH with FAISS.
    """
    try:
        import faiss
    except ImportError:
        print("FAISS not installed. Run: pip install faiss-cpu")
        return
    
    np.random.seed(42)
    
    # Parameters
    d = 128  # Dimension
    n = 100000  # Database size
    n_queries = 1000
    n_bits = 256  # Number of hash bits
    
    print("FAISS LSH Example")
    print(f"Dataset: {n:,} vectors, {d} dimensions")
    print(f"LSH bits: {n_bits}")
    
    # Generate data
    data = np.random.randn(n, d).astype('float32')
    queries = np.random.randn(n_queries, d).astype('float32')
    
    # Build LSH index
    print("
Building LSH index...")
    index = faiss.IndexLSH(d, n_bits)
    index.train(data)  # LSH doesn't require training, but API is uniform
    index.add(data)
    
    print(f"Index size: {index.ntotal} vectors")
    
    # Search
    print("
Searching...")
    k = 10
    distances, indices = index.search(queries, k)
    
    # Compare with brute force
    print("
Computing ground truth (brute force)...")
    brute_index = faiss.IndexFlatL2(d)
    brute_index.add(data)
    true_distances, true_indices = brute_index.search(queries, k)
    
    # Compute recall
    recalls = []
    for i in range(n_queries):
        lsh_set = set(indices[i])
        true_set = set(true_indices[i])
        recall = len(lsh_set & true_set) / k
        recalls.append(recall)
    
    print(f"
Recall@{k}: {np.mean(recalls):.2%}")
    print("(Pure LSH in FAISS has moderate recall; use IVF or HNSW for better results)")
 
 
def faiss_ivf_example():
    """
    Demonstrate IVF (Inverted File) index - a practical alternative to LSH.
    """
    try:
        import faiss
    except ImportError:
        print("FAISS not installed.")
        return
    
    np.random.seed(42)
    
    d = 128
    n = 100000
    n_queries = 100
    
    print("
" + "="*50)
    print("FAISS IVF Example (Cluster-based, not LSH)")
    print("="*50)
    
    data = np.random.randn(n, d).astype('float32')
    queries = np.random.randn(n_queries, d).astype('float32')
    
    # IVF index: data partitioned into nlist clusters
    nlist = 100  # Number of clusters (like L in LSH)
    
    # Create quantizer and IVF index
    quantizer = faiss.IndexFlatL2(d)
    index = faiss.IndexIVFFlat(quantizer, d, nlist)
    
    print("
Training index (learning cluster centroids)...")
    index.train(data)
    index.add(data)
    
    # nprobe controls recall/speed tradeoff (like T in multi-probe LSH)
    for nprobe in [1, 5, 10, 20, 50]:
        index.nprobe = nprobe
        
        distances, indices = index.search(queries, 10)
        
        # Ground truth
        brute_index = faiss.IndexFlatL2(d)
        brute_index.add(data)
        true_distances, true_indices = brute_index.search(queries, 10)
        
        recalls = []
        for i in range(n_queries):
            recall = len(set(indices[i]) & set(true_indices[i])) / 10
            recalls.append(recall)
        
        print(f"nprobe={nprobe:>3}: Recall@10 = {np.mean(recalls):.2%}")
 
 
if __name__ == "__main__":
    faiss_lsh_example()
    faiss_ivf_example()

Troubleshooting Common Issues

Even well-configured LSH systems can exhibit unexpected behavior. Here's how to diagnose and fix common issues.

Problem: Low Recall Despite Theory Prediction

•Check data normalization — For cosine LSH, vectors must be normalized. Varying magnitudes break the angle-based collision probability.
•Verify distance distribution — If true k-NN are farther than expected, you need more tables or wider buckets.
•Look for data skew — Clustered data can cause uneven bucket sizes, reducing effective coverage.
•Validate hash function correctness — Off-by-one errors in bucket computation are surprisingly common.
•Increase tables (L) or probes (T) — The simplest fix for low recall is more brute-force coverage.

Problem: Slow Query Time

•Check bucket sizes — Large buckets (many candidates per bucket) slow verification. Increase k or decrease w.
•Profile hash computation — If hashing is slow, optimize projections (vectorize, use sparse random projections).
•Reduce number of tables (L) — Each table adds hash computation and bucket lookup overhead.
•Use approximate distance early — For candidate filtering, use lower-precision distance (e.g., 32-bit float) before exact verification.
•Check memory bandwidth — For large indexes, query time may be memory-bound. Ensure data locality.

Problem: Memory Exceeds Budget

•Switch to multi-probe — Reduce tables by 5-10x with similar recall.
•Use compressed hash keys — Pack bits into integers instead of tuples.
•Store indices only, not vectors — Keep vectors in separate storage, access on-demand.
•Shard the index — Distribute across machines if single-machine memory is insufficient.
•Consider alternative indexes — HNSW and IVF often have better memory efficiency than pure LSH.

When in Doubt, Profile

Most production issues stem from incorrect assumptions about data distributions or implementation bugs. Before tuning parameters, add instrumentation: log bucket sizes, candidate counts, and timing breakdowns. The data usually reveals the problem.

Best Practices Summary

Let's consolidate the practical wisdom from this module into actionable guidelines.

LSH Best Practices Checklist
Category	Best Practice	Common Mistake
Data Prep	Normalize vectors for cosine similarity	Using raw vectors with varying scales
Parameter Selection	Tune on representative sample of production data	Using paper defaults without validation
Recall Target	Define explicit recall requirement upfront	Optimizing for vague "good enough"
Memory Planning	Estimate memory before deployment	Discovering OOM in production
Multi-Probe	Use multi-probe for memory-constrained scenarios	Always using single-probe with many tables
Benchmarking	Measure p99 latency, not just average	Reporting only mean performance
Ground Truth	Compute exact k-NN for validation set	Trusting LSH results without verification
Production	Monitor bucket sizes and candidate counts	Deploy and forget
Fallback	Have brute-force fallback for failed queries	No handling for edge cases
Updates	Plan for index refresh strategy	Static index with stale data

Quick Reference: Parameter Selection

For Random Hyperplane LSH (cosine):
  k = ceil(log₂(n) / 2)    # Start here, adjust for precision
  L = 10-30                 # Increase for higher recall
  With multi-probe: L = 3-5, T = 15-25

For E2LSH (Euclidean):
  w = 2-4 × (typical k-NN distance)
  k = 6-12
  L = 10-50
  With multi-probe: L = 3-10, T = 10-25

For MinHash (Jaccard):
  n_hashes = 128-256
  bands = 16-32
  rows = n_hashes / bands

Module Summary: Locality-Sensitive Hashing

We've completed a comprehensive exploration of Locality-Sensitive Hashing. Let's review the full journey:

Module Takeaways

•LSH inverts traditional hashing — Instead of minimizing collisions, LSH maximizes meaningful collisions based on similarity.
•Different metrics need different hash families — Cosine uses random hyperplanes, Euclidean uses p-stable distributions, Jaccard uses MinHash.
•Collision probability analysis enables tuning — Understanding p(similarity) → collision probability allows principled parameter selection.
•AND-OR amplification is the secret sauce — Combining k functions per table and L tables creates sharp S-curve discrimination.
•The ρ parameter determines query complexity — Query time is O(n^ρ), with ρ < 1 for useful LSH families.
•Multi-probe trades computation for memory — Check nearby buckets instead of building more tables.
•Production deployment requires empirical tuning — Theory guides, but real data distributions determine actual performance.
•LSH is one tool in the similarity search toolkit — Modern systems often use HNSW, IVF, or hybrid approaches.

When to Use LSH:

✅ Use LSH when:

Dataset is too large for brute-force exact search
Approximate answers are acceptable
You need sub-linear query time guarantees
Memory budget allows multiple hash tables
Your similarity metric has an established LSH family

❌ Consider alternatives when:

Exact nearest neighbors are required (use exact methods)
Data is low-dimensional (d < 10-20, use KD-trees)
Dataset is small (< 10,000 points, brute force is fine)
You need adaptive similarity (use learned hashing)
Graph structure can be exploited (use HNSW)

Module Complete

Congratulations! You now have comprehensive understanding of Locality-Sensitive Hashing, from theoretical foundations to production deployment. This knowledge enables you to design, implement, and optimize efficient similarity search systems for large-scale machine learning applications.

6 / 6

Loading learning content...

Machine LearningK-Nearest Neighbors

Locality-Sensitive Hashing (LSH)

LevelAdvanced

Duration90 mins

TopicK-Nearest Neighbors

6 / 6

LSH Performance Tuning

From Theory to Production

By the end, you'll be equipped to deploy high-performance LSH systems that meet production requirements.

What You Will Learn

Multi-Probe LSH: Trading Computation for Memory

The Key Insight:

When a query misses a near neighbor, it's often because the neighbor barely crossed a bucket boundary. By also checking adjacent buckets, we can recover these near-misses.

Multi-Probe Strategy:

Instead of checking only the exact hash bucket, check $T$ buckets per table:

The exact bucket $h(q)$
The $T-1$ most likely nearby buckets

Probe Sequence Generation:

For random hyperplane LSH, nearby buckets differ by a few bit flips. The probability of each flip depends on how close the query projection was to the decision boundary (sign change).

For E2LSH, nearby buckets have adjacent integer values in one or more projections. Projections with fractional parts close to 0 or 1 are most likely to have crossed a boundary.

multiprobe_lsh_production.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
"""
Production-grade Multi-Probe LSH Implementation.
"""
 
import numpy as np
from typing import List, Tuple, Set, Dict
from dataclasses import dataclass
from collections import defaultdict
import heapq
 
@dataclass
class ProbeSequence:
    """
    Generates optimal probe sequences for E2LSH multi-probe.
    
    Based on paper: "Multi-Probe LSH: Efficient Indexing for 
    High-Dimensional Similarity Search" (Lv et al., 2007)
    """
    
    @staticmethod
    def generate_perturbation_vectors(
        hash_values: np.ndarray,
        fractional_parts: np.ndarray,
        num_probes: int
    ) -> List[Tuple[float, np.ndarray]]:
        """
        Generate perturbation vectors ordered by probability.
        
        Parameters:
        -----------
        hash_values : np.ndarray
            Base hash values (integers)
        fractional_parts : np.ndarray
            (projection + offset) / w - floor(...), in [0, 1)
        num_probes : int
            Number of probes to generate
            
        Returns:
        --------
        List of (score, perturbation_vector) tuples
        Score is related to probability (lower = more likely)
        """
        k = len(hash_values)
        
        # For each dimension, compute cost of perturbing +1 or -1
        # If fractional part f is close to 0, perturbing -1 is likely
        # If f is close to 1, perturbing +1 is likely
        
        # Cost = squared distance from boundary
        cost_minus = fractional_parts ** 2  # Cost of -1 perturbation
        cost_plus = (1 - fractional_parts) ** 2  # Cost of +1 perturbation
        
        # Generate all perturbation vectors up to Hamming distance 2
        # Use priority queue to get lowest-cost probes
        
        probes = []
        
        # Zero perturbation (exact match) - cost 0
        heapq.heappush(probes, (0.0, tuple([0] * k)))
        
        # Single perturbations
        for i in range(k):
            # Perturb dimension i by +1
            perturb = [0] * k
            perturb[i] = 1
            heapq.heappush(probes, (cost_plus[i], tuple(perturb)))
            
            # Perturb dimension i by -1
            perturb = [0] * k
            perturb[i] = -1
            heapq.heappush(probes, (cost_minus[i], tuple(perturb)))
        
        # Double perturbations (if needed for more probes)
        if num_probes > 2 * k + 1:
            for i in range(k):
                for j in range(i + 1, k):
                    for di in [-1, 1]:
                        for dj in [-1, 1]:
                            perturb = [0] * k
                            perturb[i] = di
                            perturb[j] = dj
                            cost = (cost_minus[i] if di == -1 else cost_plus[i]) +                                    (cost_minus[j] if dj == -1 else cost_plus[j])
                            heapq.heappush(probes, (cost, tuple(perturb)))
        
        # Extract top probes
        result = []
        seen = set()
        while len(result) < num_probes and probes:
            cost, perturb = heapq.heappop(probes)
            if perturb not in seen:
                seen.add(perturb)
                result.append((cost, np.array(perturb)))
        
        return result
 
 
class MultiProbeE2LSH:
    """
    Multi-probe E2LSH for production use.
    
    Uses optimal probe sequences to achieve high recall
    with fewer hash tables.
    """
    
    def __init__(
        self,
        dim: int,
        num_projections: int = 10,
        num_tables: int = 5,  # Much fewer needed with multi-probe
        bucket_width: float = 4.0,
        num_probes: int = 20,  # Probes per table
        seed: int = 42
    ):
        np.random.seed(seed)
        
        self.dim = dim
        self.k = num_projections
        self.L = num_tables
        self.w = bucket_width
        self.num_probes = num_probes
        
        # Random projections and offsets for each table
        self.projections = np.random.randn(self.L, self.k, dim)
        self.offsets = np.random.uniform(0, self.w, size=(self.L, self.k))
        
        # Hash tables
        self.tables: List[Dict] = [defaultdict(list) for _ in range(self.L)]
        self.vectors: List[np.ndarray] = []
    
    def _compute_hash_and_fractional(
        self, 
        vector: np.ndarray, 
        table_idx: int
    ) -> Tuple[np.ndarray, np.ndarray]:
        """
        Compute hash values and fractional parts for multi-probe.
        """
        proj = np.dot(self.projections[table_idx], vector)
        shifted = (proj + self.offsets[table_idx]) / self.w
        hash_values = np.floor(shifted).astype(int)
        fractional_parts = shifted - hash_values
        return hash_values, fractional_parts
    
    def add(self, vector_id: int, vector: np.ndarray):
        """Add a vector to the index."""
        self.vectors.append(vector.copy())
        
        for table_idx in range(self.L):
            hash_values, _ = self._compute_hash_and_fractional(
                vector, table_idx
            )
            hash_key = tuple(hash_values)
            self.tables[table_idx][hash_key].append(vector_id)
    
    def query(
        self, 
        query: np.ndarray, 
        k_neighbors: int = 10
    ) -> List[Tuple[int, float]]:
        """
        Multi-probe query for approximate nearest neighbors.
        """
        candidates: Set[int] = set()
        
        for table_idx in range(self.L):
            hash_values, fractional_parts = self._compute_hash_and_fractional(
                query, table_idx
            )
            
            # Generate probe sequence
            probes = ProbeSequence.generate_perturbation_vectors(
                hash_values, fractional_parts, self.num_probes
            )
            
            # Probe each bucket
            for _, perturbation in probes:
                probed_hash = tuple((hash_values + perturbation).astype(int))
                bucket = self.tables[table_idx].get(probed_hash, [])
                candidates.update(bucket)
        
        # Compute exact distances
        results = []
        for cand_id in candidates:
            dist = np.linalg.norm(query - self.vectors[cand_id])
            results.append((cand_id, float(dist)))
        
        results.sort(key=lambda x: x[1])
        return results[:k_neighbors]
 
 
def benchmark_multiprobe():
    """Compare standard LSH vs multi-probe LSH."""
    np.random.seed(42)
    
    n = 50000
    dim = 128
    
    print("Multi-Probe LSH Benchmark")
    print(f"Dataset: {n} vectors, {dim} dimensions")
    print("=" * 60)
    
    # Generate data
    data = np.random.randn(n, dim)
    query = data[0] + 0.3 * np.random.randn(dim)
    
    # Ground truth
    distances = np.linalg.norm(data - query.reshape(1, -1), axis=1)
    true_10nn = set(np.argsort(distances)[:10])
    
    # Standard LSH (many tables, single probe)
    print("
Standard LSH (L=50, T=1):")
    standard_lsh = MultiProbeE2LSH(
        dim=dim, num_projections=8, num_tables=50, 
        bucket_width=4.0, num_probes=1
    )
    for i, v in enumerate(data):
        standard_lsh.add(i, v)
    
    standard_results = standard_lsh.query(query, k_neighbors=10)
    standard_found = {r[0] for r in standard_results}
    standard_recall = len(standard_found & true_10nn) / 10
    
    print(f"  Tables: 50")
    print(f"  Recall@10: {standard_recall:.1%}")
    print(f"  Memory: 50 tables")
    
    # Multi-probe LSH (fewer tables, more probes)
    print("
Multi-Probe LSH (L=5, T=20):")
    multiprobe_lsh = MultiProbeE2LSH(
        dim=dim, num_projections=8, num_tables=5,
        bucket_width=4.0, num_probes=20
    )
    for i, v in enumerate(data):
        multiprobe_lsh.add(i, v)
    
    multiprobe_results = multiprobe_lsh.query(query, k_neighbors=10)
    multiprobe_found = {r[0] for r in multiprobe_results}
    multiprobe_recall = len(multiprobe_found & true_10nn) / 10
    
    print(f"  Tables: 5")
    print(f"  Recall@10: {multiprobe_recall:.1%}")
    print(f"  Memory: 5 tables (10x reduction)")
    print(f"  Probes per table: 20")
 
 
if __name__ == "__main__":
    benchmark_multiprobe()

Memory Reduction with Multi-Probe

Data-Dependent Parameter Tuning

Optimal LSH parameters depend heavily on your data distribution. Theoretical analysis assumes uniform random data, but real datasets have structure that affects performance.

Empirical Parameter Selection:

The best approach is empirical validation:

Sample your data: Use a representative sample (e.g., 10,000 vectors)
Estimate distance distributions: Compute pairwise distances to understand typical near/far thresholds
Test recall on ground truth: For the sample, compute exact k-NN and measure LSH recall
Grid search (k, L, w): Find parameters that meet your recall target with minimal resources

data_dependent_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
"""
Data-dependent LSH parameter tuning.
"""
 
import numpy as np
from typing import Tuple, List
from scipy import stats
 
def analyze_distance_distribution(
    data: np.ndarray,
    sample_size: int = 1000,
    num_pairs: int = 10000
) -> Tuple[float, float, float]:
    """
    Analyze distance distribution in data to guide parameter selection.
    
    Returns:
    --------
    (median_dist, p10_dist, p90_dist) - distance percentiles
    """
    n = len(data)
    
    # Sample random pairs
    np.random.seed(42)
    idx1 = np.random.randint(0, n, num_pairs)
    idx2 = np.random.randint(0, n, num_pairs)
    
    distances = np.linalg.norm(data[idx1] - data[idx2], axis=1)
    
    # Remove self-pairs
    distances = distances[idx1 != idx2]
    
    p10 = np.percentile(distances, 10)
    p50 = np.percentile(distances, 50)
    p90 = np.percentile(distances, 90)
    
    return p50, p10, p90
 
def analyze_knn_distances(
    data: np.ndarray,
    k: int = 10,
    sample_size: int = 100
) -> Tuple[float, float]:
    """
    Analyze k-NN distances to determine near threshold.
    
    Returns:
    --------
    (mean_knn_dist, max_knn_dist)
    """
    n = len(data)
    sample_indices = np.random.choice(n, min(sample_size, n), replace=False)
    
    knn_dists = []
    for query_idx in sample_indices:
        query = data[query_idx]
        distances = np.linalg.norm(data - query.reshape(1, -1), axis=1)
        distances[query_idx] = float('inf')  # Exclude self
        kth_dist = np.partition(distances, k)[k]
        knn_dists.append(kth_dist)
    
    return np.mean(knn_dists), np.max(knn_dists)
 
def tune_bucket_width(
    data: np.ndarray,
    target_k: int = 10,
    target_recall: float = 0.9
) -> float:
    """
    Suggest bucket width (w) for E2LSH based on data distribution.
    
    Heuristic: w = 2-4x the typical k-NN distance
    """
    mean_knn_dist, max_knn_dist = analyze_knn_distances(data, k=target_k)
    
    # Start with 2x mean k-NN distance
    suggested_w = 2.5 * mean_knn_dist
    
    return suggested_w
 
def tune_lsh_parameters(
    data: np.ndarray,
    target_k: int = 10,
    target_recall: float = 0.95,
    max_memory_ratio: float = 2.0,  # Max memory as multiple of data
    sample_queries: int = 100
) -> dict:
    """
    Comprehensive parameter tuning for LSH.
    
    Returns:
    --------
    Dict with recommended parameters and expected performance
    """
    n, d = data.shape
    
    print("LSH Parameter Tuning")
    print("=" * 60)
    print(f"Dataset: n={n:,}, d={d}")
    print(f"Target: {target_k}-NN with {target_recall:.0%} recall")
    print()
    
    # Analyze data distribution
    print("1. Analyzing distance distribution...")
    median_dist, p10_dist, p90_dist = analyze_distance_distribution(data)
    print(f"   Distance percentiles: p10={p10_dist:.3f}, p50={median_dist:.3f}, p90={p90_dist:.3f}")
    
    print("
2. Analyzing k-NN distances...")
    mean_knn_dist, max_knn_dist = analyze_knn_distances(data, k=target_k)
    print(f"   Mean {target_k}-NN distance: {mean_knn_dist:.3f}")
    print(f"   Max {target_k}-NN distance: {max_knn_dist:.3f}")
    
    print("
3. Suggesting parameters...")
    
    # Bucket width
    w = tune_bucket_width(data, target_k)
    print(f"   Bucket width (w): {w:.3f}")
    
    # Number of projections (k) - heuristic
    # More projections for larger datasets
    k_suggested = max(4, min(16, int(np.log2(n) / 2)))
    print(f"   Projections per table (k): {k_suggested}")
    
    # Number of tables (L) - depends on target recall
    # Start with conservative estimate
    L_suggested = max(10, int(50 * (1 - target_recall + 0.1)))
    
    # With multi-probe, can reduce L significantly
    num_probes = 15
    L_multiprobe = max(3, L_suggested // 5)
    
    print(f"   Tables without multi-probe (L): {L_suggested}")
    print(f"   Tables with multi-probe (L): {L_multiprobe}, probes={num_probes}")
    
    return {
        "bucket_width": w,
        "num_projections": k_suggested,
        "num_tables_standard": L_suggested,
        "num_tables_multiprobe": L_multiprobe,
        "num_probes": num_probes,
        "distance_stats": {
            "p10": p10_dist,
            "median": median_dist,
            "p90": p90_dist,
            "mean_knn": mean_knn_dist,
        }
    }
 
 
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate synthetic data with clusters
    n_clusters = 50
    points_per_cluster = 1000
    dim = 128
    
    # Cluster centers
    centers = np.random.randn(n_clusters, dim) * 10
    
    # Generate points around centers
    data = []
    for center in centers:
        cluster_points = center + np.random.randn(points_per_cluster, dim)
        data.append(cluster_points)
    
    data = np.vstack(data)
    
    # Tune parameters
    params = tune_lsh_parameters(
        data, 
        target_k=10, 
        target_recall=0.95
    )
    
    print("
Recommended Parameters:")
    print("-" * 40)
    for key, value in params.items():
        if not isinstance(value, dict):
            print(f"  {key}: {value}")

Don't Trust Theoretical Defaults

Benchmarking Methodology

Proper benchmarking is essential for evaluating LSH performance. Here's a rigorous methodology.

Key Metrics to Measure

•Recall@k — What fraction of true k-NN are returned? This is the primary quality metric.
•Precision — What fraction of returned candidates are true neighbors? Important when verification is expensive.
•Query latency — Wall-clock time per query. Measure p50, p90, p99 for SLA compliance.
•Throughput — Queries per second under concurrent load. Important for multi-tenant systems.
•Index build time — Time to construct the index. Critical for dynamic datasets.
•Memory usage — Peak and steady-state memory. Often the binding constraint.
•Candidates per query — How many candidates require distance verification? Affects post-filtering cost.

Benchmarking Protocol:

Establish ground truth: For a test set of queries, compute exact k-NN using brute force. Cache this for reuse.
Use realistic query distribution: Don't use random queries if your production queries have structure (e.g., similar to indexed data).
Warm up the system: Run several queries before measuring to populate caches.
Measure under load: Single-query latency differs from latency under concurrent load.
Control for variance: Run multiple trials; report mean and standard deviation.
Test edge cases: Include queries with many near neighbors, few near neighbors, and outliers.

benchmark_lsh.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
"""
Comprehensive LSH benchmarking framework.
"""
 
import numpy as np
import time
from typing import Dict, List, Set, Tuple
from dataclasses import dataclass, field
 
@dataclass
class BenchmarkResult:
    """Results from a single benchmark run."""
    recall_at_k: float
    precision: float
    avg_latency_ms: float
    p50_latency_ms: float
    p90_latency_ms: float
    p99_latency_ms: float
    avg_candidates: float
    qps: float  # Queries per second
    build_time_sec: float
    memory_mb: float
 
@dataclass
class LSHBenchmark:
    """
    Benchmark framework for LSH implementations.
    """
    
    data: np.ndarray
    queries: np.ndarray
    k: int = 10
    ground_truth: List[Set[int]] = field(default_factory=list)
    
    def __post_init__(self):
        if not self.ground_truth:
            self._compute_ground_truth()
    
    def _compute_ground_truth(self):
        """Compute exact k-NN for all queries."""
        print("Computing ground truth k-NN...")
        self.ground_truth = []
        
        for query in self.queries:
            distances = np.linalg.norm(
                self.data - query.reshape(1, -1), axis=1
            )
            knn = set(np.argsort(distances)[:self.k])
            self.ground_truth.append(knn)
        
        print(f"Ground truth computed for {len(self.queries)} queries")
    
    def benchmark(
        self,
        lsh_index,  # Any object with query(vector, k) method
        name: str = "LSH"
    ) -> BenchmarkResult:
        """
        Run comprehensive benchmark.
        
        Parameters:
        -----------
        lsh_index : object
            LSH index with query(vector, k) -> List[(id, dist)] method
        name : str
            Name for reporting
        """
        print(f"
Benchmarking {name}...")
        
        latencies = []
        recalls = []
        precisions = []
        num_candidates = []
        
        # Warmup
        for query in self.queries[:min(10, len(self.queries))]:
            lsh_index.query(query, self.k)
        
        # Measure
        for i, query in enumerate(self.queries):
            # Time the query
            start = time.perf_counter()
            results = lsh_index.query(query, self.k)
            elapsed = (time.perf_counter() - start) * 1000  # ms
            
            latencies.append(elapsed)
            
            # Compute recall
            returned_ids = {r[0] for r in results}
            true_nn = self.ground_truth[i]
            
            recall = len(returned_ids & true_nn) / len(true_nn) if true_nn else 1.0
            recalls.append(recall)
            
            # Precision (assuming results are top-k from candidates)
            precision = len(returned_ids & true_nn) / len(returned_ids) if returned_ids else 0.0
            precisions.append(precision)
            
            # Track candidate count if available
            if hasattr(lsh_index, 'last_candidate_count'):
                num_candidates.append(lsh_index.last_candidate_count)
        
        # Aggregate metrics
        latencies = np.array(latencies)
        
        return BenchmarkResult(
            recall_at_k=np.mean(recalls),
            precision=np.mean(precisions),
            avg_latency_ms=np.mean(latencies),
            p50_latency_ms=np.percentile(latencies, 50),
            p90_latency_ms=np.percentile(latencies, 90),
            p99_latency_ms=np.percentile(latencies, 99),
            avg_candidates=np.mean(num_candidates) if num_candidates else 0,
            qps=1000 / np.mean(latencies),
            build_time_sec=0,  # Set externally
            memory_mb=0        # Set externally
        )
    
    @staticmethod
    def print_result(name: str, result: BenchmarkResult):
        """Pretty-print benchmark results."""
        print(f"
{'='*50}")
        print(f"Results: {name}")
        print(f"{'='*50}")
        print(f"  Recall@{10}:    {result.recall_at_k:.2%}")
        print(f"  Precision:     {result.precision:.2%}")
        print(f"  Avg Latency:   {result.avg_latency_ms:.2f}ms")
        print(f"  p50 Latency:   {result.p50_latency_ms:.2f}ms")
        print(f"  p90 Latency:   {result.p90_latency_ms:.2f}ms")
        print(f"  p99 Latency:   {result.p99_latency_ms:.2f}ms")
        print(f"  Throughput:    {result.qps:.1f} QPS")
        if result.avg_candidates > 0:
            print(f"  Avg Candidates: {result.avg_candidates:.0f}")
        if result.build_time_sec > 0:
            print(f"  Build Time:    {result.build_time_sec:.1f}s")
        if result.memory_mb > 0:
            print(f"  Memory:        {result.memory_mb:.1f}MB")
 
def run_comparison_benchmark():
    """
    Run benchmarks comparing different LSH configurations.
    """
    np.random.seed(42)
    
    n = 100000
    dim = 128
    n_queries = 500
    
    print("LSH Benchmark Suite")
    print(f"Dataset: {n:,} vectors, {dim} dimensions")
    print(f"Queries: {n_queries}")
    
    # Generate data
    data = np.random.randn(n, dim)
    
    # Generate queries (mix of in-distribution and perturbed)
    query_indices = np.random.choice(n, n_queries, replace=False)
    queries = data[query_indices] + 0.3 * np.random.randn(n_queries, dim)
    
    benchmark = LSHBenchmark(data, queries, k=10)
    
    # Test different configurations
    # (In practice, you'd import your actual LSH implementations)
    
    print("
Benchmark complete!")
    print("
Summary: Choose configuration based on your priority:")
    print("  - High recall needed? Increase L or use multi-probe")
    print("  - Low latency needed? Reduce L, increase k")
    print("  - Memory constrained? Use multi-probe with fewer tables")
 
 
if __name__ == "__main__":
    run_comparison_benchmark()

Production Deployment Patterns

Deploying LSH in production involves more than just the algorithm. Here are patterns from large-scale deployments.

Production Deployment Patterns
Pattern	Description	When to Use
Sharded Index	Partition data across machines, query all shards	Dataset exceeds single machine memory
Hierarchical LSH	Coarse index filters, fine index refines	Very large datasets with variable precision needs
Cached Candidates	Cache frequent query results	High query locality, cold-start latency tolerance
Async Reindexing	Background rebuild while serving old index	Frequent data updates
Hybrid Exact/Approx	LSH for initial filter, exact search on candidates	High precision requirements
Multi-Vector Queries	Query with multiple vectors, aggregate results	Complex similarity (multi-modal, ensembles)

Sharding Strategy:

For billion-scale datasets:

Random sharding: Data randomly distributed. Query all shards, merge results.
- Pros: Simple, even load distribution
- Cons: All shards receive every query
Locality-aware sharding: Data partitioned by region in feature space.
- Pros: Can route queries to relevant shards only
- Cons: Complex routing, potential hotspots
Replicated sharding: Each shard holds full index replica.
- Pros: Scale read throughput linearly
- Cons: Memory multiplied by replica count

Index Updates:

LSH indexes are expensive to update incrementally:

Append-only: Add new points to buckets. Fast, but buckets grow unbounded.
Periodic rebuild: Rebuild index on schedule. Simple, but stale during rebuild.
Double-buffered: Build new index while old serves. Seamless switchover.
LSM-tree inspired: Layer of small fresh indexes merged periodically.

Start Simple, Optimize Later

Integration with Vector Databases

FAISS (Facebook AI Similarity Search):

FAISS is the most widely-used library for vector similarity search. It offers multiple indexing methods:

IndexLSH: Pure LSH with bit signatures
IndexIVF*: Inverted file indexes (cluster-based, not LSH)
IndexHNSW: Graph-based (not LSH, but often faster)

LSH in FAISS is typically combined with product quantization for memory efficiency.

Pinecone:

Pinecone abstracts away index implementation details, but you configure:

metric: cosine, euclidean, dot product (affects internal hashing)
pods: Compute resources (affects latency/throughput tradeoff)
Metadata filtering (integrated with similarity search)

Milvus:

Milvus exposes more knobs:

Index types: IVF_FLAT, IVF_SQ8, IVF_PQ, HNSW, ANNOY
Each has parameters like nlist, nprobe (analogous to L and T)

vector_db_integration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
"""
Example integration with FAISS vector database.
"""
 
import numpy as np
 
# Note: Requires faiss-cpu or faiss-gpu package
# pip install faiss-cpu
 
def faiss_lsh_example():
    """
    Demonstrate LSH with FAISS.
    """
    try:
        import faiss
    except ImportError:
        print("FAISS not installed. Run: pip install faiss-cpu")
        return
    
    np.random.seed(42)
    
    # Parameters
    d = 128  # Dimension
    n = 100000  # Database size
    n_queries = 1000
    n_bits = 256  # Number of hash bits
    
    print("FAISS LSH Example")
    print(f"Dataset: {n:,} vectors, {d} dimensions")
    print(f"LSH bits: {n_bits}")
    
    # Generate data
    data = np.random.randn(n, d).astype('float32')
    queries = np.random.randn(n_queries, d).astype('float32')
    
    # Build LSH index
    print("
Building LSH index...")
    index = faiss.IndexLSH(d, n_bits)
    index.train(data)  # LSH doesn't require training, but API is uniform
    index.add(data)
    
    print(f"Index size: {index.ntotal} vectors")
    
    # Search
    print("
Searching...")
    k = 10
    distances, indices = index.search(queries, k)
    
    # Compare with brute force
    print("
Computing ground truth (brute force)...")
    brute_index = faiss.IndexFlatL2(d)
    brute_index.add(data)
    true_distances, true_indices = brute_index.search(queries, k)
    
    # Compute recall
    recalls = []
    for i in range(n_queries):
        lsh_set = set(indices[i])
        true_set = set(true_indices[i])
        recall = len(lsh_set & true_set) / k
        recalls.append(recall)
    
    print(f"
Recall@{k}: {np.mean(recalls):.2%}")
    print("(Pure LSH in FAISS has moderate recall; use IVF or HNSW for better results)")
 
 
def faiss_ivf_example():
    """
    Demonstrate IVF (Inverted File) index - a practical alternative to LSH.
    """
    try:
        import faiss
    except ImportError:
        print("FAISS not installed.")
        return
    
    np.random.seed(42)
    
    d = 128
    n = 100000
    n_queries = 100
    
    print("
" + "="*50)
    print("FAISS IVF Example (Cluster-based, not LSH)")
    print("="*50)
    
    data = np.random.randn(n, d).astype('float32')
    queries = np.random.randn(n_queries, d).astype('float32')
    
    # IVF index: data partitioned into nlist clusters
    nlist = 100  # Number of clusters (like L in LSH)
    
    # Create quantizer and IVF index
    quantizer = faiss.IndexFlatL2(d)
    index = faiss.IndexIVFFlat(quantizer, d, nlist)
    
    print("
Training index (learning cluster centroids)...")
    index.train(data)
    index.add(data)
    
    # nprobe controls recall/speed tradeoff (like T in multi-probe LSH)
    for nprobe in [1, 5, 10, 20, 50]:
        index.nprobe = nprobe
        
        distances, indices = index.search(queries, 10)
        
        # Ground truth
        brute_index = faiss.IndexFlatL2(d)
        brute_index.add(data)
        true_distances, true_indices = brute_index.search(queries, 10)
        
        recalls = []
        for i in range(n_queries):
            recall = len(set(indices[i]) & set(true_indices[i])) / 10
            recalls.append(recall)
        
        print(f"nprobe={nprobe:>3}: Recall@10 = {np.mean(recalls):.2%}")
 
 
if __name__ == "__main__":
    faiss_lsh_example()
    faiss_ivf_example()

Troubleshooting Common Issues

Even well-configured LSH systems can exhibit unexpected behavior. Here's how to diagnose and fix common issues.

Problem: Low Recall Despite Theory Prediction

•Check data normalization — For cosine LSH, vectors must be normalized. Varying magnitudes break the angle-based collision probability.
•Verify distance distribution — If true k-NN are farther than expected, you need more tables or wider buckets.
•Look for data skew — Clustered data can cause uneven bucket sizes, reducing effective coverage.
•Validate hash function correctness — Off-by-one errors in bucket computation are surprisingly common.
•Increase tables (L) or probes (T) — The simplest fix for low recall is more brute-force coverage.

Problem: Slow Query Time

•Check bucket sizes — Large buckets (many candidates per bucket) slow verification. Increase k or decrease w.
•Profile hash computation — If hashing is slow, optimize projections (vectorize, use sparse random projections).
•Reduce number of tables (L) — Each table adds hash computation and bucket lookup overhead.
•Use approximate distance early — For candidate filtering, use lower-precision distance (e.g., 32-bit float) before exact verification.
•Check memory bandwidth — For large indexes, query time may be memory-bound. Ensure data locality.

Problem: Memory Exceeds Budget

•Switch to multi-probe — Reduce tables by 5-10x with similar recall.
•Use compressed hash keys — Pack bits into integers instead of tuples.
•Store indices only, not vectors — Keep vectors in separate storage, access on-demand.
•Shard the index — Distribute across machines if single-machine memory is insufficient.
•Consider alternative indexes — HNSW and IVF often have better memory efficiency than pure LSH.

When in Doubt, Profile

Best Practices Summary

Let's consolidate the practical wisdom from this module into actionable guidelines.

LSH Best Practices Checklist
Category	Best Practice	Common Mistake
Data Prep	Normalize vectors for cosine similarity	Using raw vectors with varying scales
Parameter Selection	Tune on representative sample of production data	Using paper defaults without validation
Recall Target	Define explicit recall requirement upfront	Optimizing for vague "good enough"
Memory Planning	Estimate memory before deployment	Discovering OOM in production
Multi-Probe	Use multi-probe for memory-constrained scenarios	Always using single-probe with many tables
Benchmarking	Measure p99 latency, not just average	Reporting only mean performance
Ground Truth	Compute exact k-NN for validation set	Trusting LSH results without verification
Production	Monitor bucket sizes and candidate counts	Deploy and forget
Fallback	Have brute-force fallback for failed queries	No handling for edge cases
Updates	Plan for index refresh strategy	Static index with stale data

Quick Reference: Parameter Selection

For Random Hyperplane LSH (cosine):
  k = ceil(log₂(n) / 2)    # Start here, adjust for precision
  L = 10-30                 # Increase for higher recall
  With multi-probe: L = 3-5, T = 15-25

For E2LSH (Euclidean):
  w = 2-4 × (typical k-NN distance)
  k = 6-12
  L = 10-50
  With multi-probe: L = 3-10, T = 10-25

For MinHash (Jaccard):
  n_hashes = 128-256
  bands = 16-32
  rows = n_hashes / bands

Module Summary: Locality-Sensitive Hashing

We've completed a comprehensive exploration of Locality-Sensitive Hashing. Let's review the full journey:

Module Takeaways

•LSH inverts traditional hashing — Instead of minimizing collisions, LSH maximizes meaningful collisions based on similarity.
•Different metrics need different hash families — Cosine uses random hyperplanes, Euclidean uses p-stable distributions, Jaccard uses MinHash.
•Collision probability analysis enables tuning — Understanding p(similarity) → collision probability allows principled parameter selection.
•AND-OR amplification is the secret sauce — Combining k functions per table and L tables creates sharp S-curve discrimination.
•The ρ parameter determines query complexity — Query time is O(n^ρ), with ρ < 1 for useful LSH families.
•Multi-probe trades computation for memory — Check nearby buckets instead of building more tables.
•Production deployment requires empirical tuning — Theory guides, but real data distributions determine actual performance.
•LSH is one tool in the similarity search toolkit — Modern systems often use HNSW, IVF, or hybrid approaches.

When to Use LSH:

✅ Use LSH when:

Dataset is too large for brute-force exact search
Approximate answers are acceptable
You need sub-linear query time guarantees
Memory budget allows multiple hash tables
Your similarity metric has an established LSH family

❌ Consider alternatives when:

Exact nearest neighbors are required (use exact methods)
Data is low-dimensional (d < 10-20, use KD-trees)
Dataset is small (< 10,000 points, brute force is fine)
You need adaptive similarity (use learned hashing)
Graph structure can be exploited (use HNSW)

Module Complete

6 / 6