Loading content...
We've seen that exact nearest neighbor search in high dimensions inevitably degrades to brute force, and that LSH provides theoretical guarantees for approximate search. But what techniques actually power the billion-scale similarity search systems used in production today?
The Scale Challenge:
These systems demand:
This page surveys the techniques that make this possible: graph-based methods (HNSW), quantization (PQ, OPQ, ScaNN), and production systems (FAISS, Annoy, Milvus).
By the end of this page, you will understand how graph-based methods like HNSW achieve state-of-the-art recall, how product quantization compresses embeddings for memory efficiency, how hybrid approaches combine multiple techniques, how to choose between different ANN libraries, and how to benchmark and tune ANN systems for production.
Before diving into specific methods, let's survey the landscape of approximate nearest neighbor techniques.
The Fundamental Tradeoffs:
All ANN methods navigate three competing objectives:
No method wins on all three. Understanding the tradeoff space is key to choosing the right approach.
Major Method Families:
| Family | Examples | Principle | Strengths | Weaknesses |
|---|---|---|---|---|
| Hashing | LSH, E2LSH, Multi-probe | Hash similar items together | Theoretical guarantees | High memory, many tables needed |
| Tree-based | KD-tree, Ball tree, Annoy | Hierarchical space partition | Exact mode available | Degrades in high-d |
| Graph-based | HNSW, NSG, DiskANN | Navigate neighbor graph | Best recall-speed tradeoff | High memory, slow build |
| Quantization | PQ, OPQ, ScaNN | Compress vectors | Low memory, fast search | Approximation error |
| Hybrid | IVF-PQ, FAISS | Combine multiple methods | Flexible tradeoffs | More complex tuning |
State of the Art (2024):
Benchmarks like ann-benchmarks.com compare methods across datasets. HNSW consistently achieves the best recall-vs-QPS (queries per second) tradeoff, while IVF-PQ wins on memory efficiency.
HNSW (Malkov & Yashunin, 2016) is the most successful graph-based ANN method, achieving near-optimal recall with excellent query speed.
Core Ideas:
Navigable Small World (NSW): A graph where each node connects to its approximate nearest neighbors. Navigation works by greedily moving to the neighbor closest to the query.
Hierarchical Structure: Multiple layers of NSW graphs with exponentially decreasing density. Higher layers contain fewer nodes but longer-range connections.
Search Strategy: Start at the top (sparse) layer, greedily descend to find an entry point, then search more thoroughly at lower layers.
Why It Works:
The hierarchical structure provides:
This is analogous to skip lists for 1D data, but in high-dimensional metric space.
Theoretical Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208
import numpy as npfrom typing import List, Set, Dict, Tupleimport heapqfrom collections import defaultdict class SimpleHNSW: """ Simplified HNSW implementation for educational purposes. Production systems (like hnswlib, FAISS) are much more optimized. Parameters: ----------- M : int Maximum connections per node (affects recall and memory) ef_construction : int Size of dynamic candidate list during construction m_L : float Level generation factor (typically 1/ln(M)) """ def __init__( self, dim: int, M: int = 16, ef_construction: int = 200, m_L: float = None, seed: int = 42 ): self.dim = dim self.M = M self.M0 = 2 * M # Max connections at level 0 (base layer) self.ef_construction = ef_construction self.m_L = m_L or 1.0 / np.log(M) np.random.seed(seed) # Data storage self.data: List[np.ndarray] = [] # Graph structure: level -> node_id -> set of neighbor_ids self.graphs: Dict[int, Dict[int, Set[int]]] = defaultdict( lambda: defaultdict(set) ) # Entry point and max level self.entry_point: int = -1 self.max_level: int = -1 # Level of each node self.node_levels: Dict[int, int] = {} def _distance(self, v1: np.ndarray, v2: np.ndarray) -> float: """Euclidean distance.""" return np.sqrt(np.sum((v1 - v2) ** 2)) def _random_level(self) -> int: """Generate random level using exponential distribution.""" return int(-np.log(np.random.random()) * self.m_L) def _search_layer( self, query: np.ndarray, entry_points: Set[int], ef: int, layer: int ) -> List[Tuple[float, int]]: """ Search a single layer using greedy best-first search. Returns ef nearest neighbors found at this layer. """ visited = set(entry_points) # Min-heap of (distance, node_id) for candidates candidates = [] # Max-heap of (neg_distance, node_id) for results results = [] for ep in entry_points: dist = self._distance(query, self.data[ep]) heapq.heappush(candidates, (dist, ep)) heapq.heappush(results, (-dist, ep)) while candidates: dist_c, c = heapq.heappop(candidates) # Farthest result distance dist_f = -results[0][0] if results else float('inf') if dist_c > dist_f: break # All remaining candidates are farther than farthest result # Explore neighbors of c in this layer for neighbor in self.graphs[layer].get(c, set()): if neighbor not in visited: visited.add(neighbor) dist_n = self._distance(query, self.data[neighbor]) if len(results) < ef or dist_n < -results[0][0]: heapq.heappush(candidates, (dist_n, neighbor)) heapq.heappush(results, (-dist_n, neighbor)) if len(results) > ef: heapq.heappop(results) # Return results sorted by distance (ascending) return sorted([(-d, n) for d, n in results]) def insert(self, vector: np.ndarray) -> int: """ Insert a vector into the HNSW index. Returns the node ID. """ node_id = len(self.data) self.data.append(vector) # Assign random level node_level = self._random_level() self.node_levels[node_id] = node_level # First node if self.entry_point < 0: self.entry_point = node_id self.max_level = node_level return node_id # Search for entry points at each layer current_entry = {self.entry_point} # Descend from top to node_level + 1 for layer in range(self.max_level, node_level, -1): results = self._search_layer(vector, current_entry, 1, layer) current_entry = {results[0][1]} if results else current_entry # Insert at layers node_level down to 0 for layer in range(min(node_level, self.max_level), -1, -1): # Find neighbors results = self._search_layer( vector, current_entry, self.ef_construction, layer ) # Select best neighbors max_M = self.M0 if layer == 0 else self.M neighbors = [n for _, n in results[:max_M]] # Add bidirectional edges for neighbor in neighbors: self.graphs[layer][node_id].add(neighbor) self.graphs[layer][neighbor].add(node_id) # Prune if neighbor has too many connections if len(self.graphs[layer][neighbor]) > max_M: # Keep closest max_M neighbors neighbor_neighbors = list(self.graphs[layer][neighbor]) dists = [ (self._distance(self.data[neighbor], self.data[nn]), nn) for nn in neighbor_neighbors ] dists.sort() self.graphs[layer][neighbor] = set( nn for _, nn in dists[:max_M] ) current_entry = set(n for _, n in results[:1]) # Update entry point if new node is highest if node_level > self.max_level: self.entry_point = node_id self.max_level = node_level return node_id def search(self, query: np.ndarray, k: int, ef: int = None) -> List[Tuple[float, int]]: """ Search for k nearest neighbors. Parameters: ----------- query : np.ndarray Query vector k : int Number of neighbors to return ef : int Size of dynamic candidate list (≥ k, higher = better recall) Returns: -------- List of (distance, node_id) pairs """ if ef is None: ef = max(k, 100) if self.entry_point < 0: return [] current_entry = {self.entry_point} # Descend from top to layer 1 for layer in range(self.max_level, 0, -1): results = self._search_layer(query, current_entry, 1, layer) current_entry = {results[0][1]} if results else current_entry # Search layer 0 with ef candidates results = self._search_layer(query, current_entry, ef, 0) return results[:k]M (connections per node): Higher M = better recall but more memory. Typical: 16-64. ef_construction: Higher = better graph quality but slower build. Typical: 100-500. ef_search: Higher = better recall but slower query. Tune based on recall requirements.
While HNSW achieves excellent recall, it requires storing full vectors, consuming significant memory. Product Quantization (PQ) provides dramatic memory reduction by compressing vectors to compact codes.
The Idea:
Memory Savings:
Original: $d \times 4$ bytes (float32) Compressed: $m$ bytes (if k=256)
Example: 512-d vector × 4 bytes = 2048 bytes → 64 subspaces × 1 byte = 64 bytes
32× compression!
Distance Computation:
For query q and encoded vector x with codes $c_1, \ldots, c_m$:
Lookup is $O(m)$ additions instead of $O(d)$ multiply-adds!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213
import numpy as npfrom sklearn.cluster import KMeansfrom typing import List, Tuple class ProductQuantizer: """ Product Quantization for vector compression and fast distance computation. Parameters: ----------- dim : int Dimensionality of input vectors n_subspaces : int Number of subspaces (m). dim must be divisible by m. n_centroids : int Centroids per subspace (k). Typically 256 for 8-bit codes. """ def __init__( self, dim: int = 512, n_subspaces: int = 64, n_centroids: int = 256, seed: int = 42 ): assert dim % n_subspaces == 0, "dim must be divisible by n_subspaces" self.dim = dim self.m = n_subspaces self.k = n_centroids self.subspace_dim = dim // n_subspaces np.random.seed(seed) # Codebooks: (m, k, subspace_dim) self.codebooks: np.ndarray = None def train(self, X: np.ndarray, n_iter: int = 25): """ Train codebooks on training data. Uses k-means on each subspace independently. """ n, d = X.shape assert d == self.dim self.codebooks = np.zeros((self.m, self.k, self.subspace_dim)) for i in range(self.m): # Extract subspace i start = i * self.subspace_dim end = start + self.subspace_dim X_sub = X[:, start:end] # Run k-means kmeans = KMeans(n_clusters=self.k, max_iter=n_iter, n_init=1) kmeans.fit(X_sub) self.codebooks[i] = kmeans.cluster_centers_ def encode(self, X: np.ndarray) -> np.ndarray: """ Encode vectors to PQ codes. Returns: -------- codes : np.ndarray of shape (n, m) with dtype uint8 """ n = X.shape[0] codes = np.zeros((n, self.m), dtype=np.uint8) for i in range(self.m): start = i * self.subspace_dim end = start + self.subspace_dim X_sub = X[:, start:end] # Find nearest centroid for each vector # Shape: (n, k) distances = np.sum( (X_sub[:, np.newaxis, :] - self.codebooks[i]) ** 2, axis=2 ) codes[:, i] = np.argmin(distances, axis=1) return codes def decode(self, codes: np.ndarray) -> np.ndarray: """ Decode PQ codes back to reconstructed vectors. Returns approximate vectors, not exact originals. """ n = codes.shape[0] X_reconstructed = np.zeros((n, self.dim)) for i in range(self.m): start = i * self.subspace_dim end = start + self.subspace_dim X_reconstructed[:, start:end] = self.codebooks[i, codes[:, i]] return X_reconstructed def compute_distance_table(self, query: np.ndarray) -> np.ndarray: """ Precompute distance table for efficient search. Returns: -------- table : np.ndarray of shape (m, k) table[i, j] = squared distance from query's i-th subvector to j-th centroid of i-th subspace """ table = np.zeros((self.m, self.k)) for i in range(self.m): start = i * self.subspace_dim end = start + self.subspace_dim q_sub = query[start:end] # Distance from q_sub to each centroid table[i] = np.sum((self.codebooks[i] - q_sub) ** 2, axis=1) return table def asymmetric_distance( self, query: np.ndarray, codes: np.ndarray, table: np.ndarray = None ) -> np.ndarray: """ Compute distances from query to encoded vectors. Asymmetric: query is not encoded, database vectors are. This is more accurate than symmetric (both encoded). Returns: -------- distances : np.ndarray of shape (n,) """ if table is None: table = self.compute_distance_table(query) n = codes.shape[0] distances_sq = np.zeros(n) for i in range(self.m): # Lookup distances for subspace i distances_sq += table[i, codes[:, i]] return np.sqrt(distances_sq) def search( self, query: np.ndarray, codes: np.ndarray, k: int = 10 ) -> Tuple[np.ndarray, np.ndarray]: """ Search for k nearest neighbors in encoded database. """ table = self.compute_distance_table(query) distances = self.asymmetric_distance(query, codes, table) # Find k smallest indices = np.argpartition(distances, k)[:k] indices = indices[np.argsort(distances[indices])] return indices, distances[indices] def demo_pq_compression(): """Demonstrate PQ compression ratio and accuracy.""" np.random.seed(42) # Generate test data n_train, n_test, dim = 10000, 1000, 512 X_train = np.random.randn(n_train, dim).astype(np.float32) X_test = np.random.randn(n_test, dim).astype(np.float32) # Train PQ pq = ProductQuantizer(dim=dim, n_subspaces=64, n_centroids=256) pq.train(X_train) # Encode test data codes = pq.encode(X_test) # Memory comparison original_bytes = X_test.nbytes compressed_bytes = codes.nbytes ratio = original_bytes / compressed_bytes print(f"Original: {original_bytes:,} bytes") print(f"Compressed: {compressed_bytes:,} bytes") print(f"Compression ratio: {ratio:.1f}x") # Reconstruction error X_reconstructed = pq.decode(codes) mse = np.mean((X_test - X_reconstructed) ** 2) print(f"Reconstruction MSE: {mse:.4f}") # Search accuracy query = np.random.randn(dim) # True NN (brute force on originals) true_dists = np.sqrt(np.sum((X_test - query) ** 2, axis=1)) true_nn = np.argmin(true_dists) # PQ NN pq_indices, pq_dists = pq.search(query, codes, k=10) print(f"True NN: idx={true_nn}, dist={true_dists[true_nn]:.4f}") print(f"PQ top-10: {pq_indices[:5]}")Variants of Product Quantization:
| Variant | Description | Benefit |
|---|---|---|
| OPQ (Optimized PQ) | Rotate vectors before PQ to align with subspaces | 10-30% lower distortion |
| LOPQ (Locally OPQ) | Different rotations per cluster | Better for clustered data |
| RQ (Residual Quantization) | Iteratively quantize residuals | Lower distortion, slower |
| AQ (Additive Quantization) | Generalize PQ to overlapping codebooks | Flexible, complex |
PQ provides memory efficiency but still requires scanning all codes. Inverted File Index (IVF) adds coarse quantization for faster candidate selection.
The IVF Approach:
Speedup:
For $L = 1000$ and $n_{probe} = 10$: 100× speedup!
IVF + PQ (IVF-PQ):
Combine inverted files with product quantization:
This is the workhouse of billion-scale search systems like FAISS.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176
import numpy as npfrom sklearn.cluster import KMeansfrom typing import List, Tuple class IVFPQ: """ Inverted File Index with Product Quantization. The standard recipe for billion-scale similarity search. Parameters: ----------- dim : int Vector dimensionality n_lists : int Number of IVF coarse clusters (L) n_subspaces : int PQ subspaces (m) n_centroids : int PQ centroids per subspace (typically 256) """ def __init__( self, dim: int = 512, n_lists: int = 1000, n_subspaces: int = 64, n_centroids: int = 256, seed: int = 42 ): self.dim = dim self.n_lists = n_lists np.random.seed(seed) # Coarse quantizer self.coarse_centroids: np.ndarray = None # PQ for fine quantization (on residuals) self.pq = ProductQuantizer(dim, n_subspaces, n_centroids, seed) # Inverted lists: list_id -> (original_ids, pq_codes) self.invlists: List[Tuple[np.ndarray, np.ndarray]] = [] def train(self, X: np.ndarray, n_iter: int = 25): """ Train coarse quantizer and PQ. """ n = X.shape[0] # Step 1: Train coarse quantizer print(f"Training coarse quantizer with {self.n_lists} clusters...") kmeans = KMeans(n_clusters=self.n_lists, max_iter=n_iter, n_init=1) kmeans.fit(X) self.coarse_centroids = kmeans.cluster_centers_ # Step 2: Compute residuals assignments = kmeans.labels_ residuals = X - self.coarse_centroids[assignments] # Step 3: Train PQ on residuals print("Training product quantizer on residuals...") self.pq.train(residuals, n_iter) def index(self, X: np.ndarray, ids: np.ndarray = None): """ Add vectors to the index. """ n = X.shape[0] if ids is None: ids = np.arange(n) # Assign to coarse clusters distances = np.sum( (X[:, np.newaxis, :] - self.coarse_centroids) ** 2, axis=2 ) assignments = np.argmin(distances, axis=1) # Compute residuals residuals = X - self.coarse_centroids[assignments] # Encode residuals with PQ codes = self.pq.encode(residuals) # Build inverted lists self.invlists = [None] * self.n_lists for list_id in range(self.n_lists): mask = assignments == list_id if np.any(mask): self.invlists[list_id] = ( ids[mask], codes[mask] ) def search( self, query: np.ndarray, k: int = 10, n_probe: int = 10 ) -> Tuple[np.ndarray, np.ndarray]: """ Search for k nearest neighbors. Parameters: ----------- query : np.ndarray Query vector k : int Number of neighbors n_probe : int Number of IVF lists to search Returns: -------- ids, distances """ # Find n_probe nearest coarse centroids coarse_distances = np.sum( (self.coarse_centroids - query) ** 2, axis=1 ) probe_list_ids = np.argsort(coarse_distances)[:n_probe] # Search each probed list all_ids = [] all_distances = [] for list_id in probe_list_ids: if self.invlists[list_id] is None: continue ids, codes = self.invlists[list_id] # Compute residual query residual_query = query - self.coarse_centroids[list_id] # PQ distance computation table = self.pq.compute_distance_table(residual_query) distances = self.pq.asymmetric_distance(residual_query, codes, table) all_ids.extend(ids) all_distances.extend(distances) if not all_ids: return np.array([]), np.array([]) all_ids = np.array(all_ids) all_distances = np.array(all_distances) # Return top k top_k_idx = np.argsort(all_distances)[:k] return all_ids[top_k_idx], all_distances[top_k_idx] # Comparison of methodsdef compare_methods_memory(): """ Compare memory usage of different index types. """ n = 1_000_000 # 1 million vectors dim = 512 methods = { 'Brute Force (float32)': n * dim * 4, 'HNSW (M=32, float32)': n * dim * 4 + n * 32 * 8, # vectors + edges 'IVF only (1000 lists)': n * dim * 4 + 1000 * dim * 4, 'PQ (m=64)': n * 64 + 64 * 256 * (dim//64) * 4, # codes + codebooks 'IVF-PQ': n * 64 + 1000 * dim * 4 + 64 * 256 * (dim//64) * 4, } print("Memory Usage for 1M 512-d vectors:") print("-" * 50) for method, bytes_used in methods.items(): mb = bytes_used / (1024 ** 2) gb = bytes_used / (1024 ** 3) print(f"{method:30s}: {mb:8.1f} MB ({gb:.2f} GB)")The n_probe parameter controls the recall-speed tradeoff in IVF. With n_probe=1, you only search one cluster—fast but low recall. With n_probe=n_lists, you search everything—high recall but slow. Typical values: 1-10% of n_lists.
Several well-maintained libraries provide production-ready ANN implementations. Here's a comparison of the major options.
| Library | Origin | Best For | Key Features |
|---|---|---|---|
| FAISS | Meta AI | Billion-scale, GPU | IVF-PQ, HNSW, GPU support, OPQ |
| Annoy | Spotify | Read-heavy, memory-mapped | Tree-based, memory-mapped files |
| hnswlib | Academic | Pure HNSW, easy API | Header-only C++, Python bindings |
| ScaNN | High-dim, anisotropic | Anisotropic quantization, fast | |
| Milvus | Zilliz | Managed service, filtering | Distributed, attribute filtering |
| Weaviate | SeMI | Semantic search, ML ops | Vector + keyword, GraphQL API |
| Pinecone | Pinecone | Serverless, no-ops | Managed cloud, simple API |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
import faissimport numpy as npimport time def faiss_demo(n_vectors: int = 1_000_000, dim: int = 128): """ Demonstrate FAISS for different index types and scales. """ np.random.seed(42) # Generate data print(f"Generating {n_vectors:,} vectors of dimension {dim}...") X = np.random.randn(n_vectors, dim).astype('float32') queries = np.random.randn(1000, dim).astype('float32') indices = {} # 1. Brute Force (Flat) print("\n1. Flat Index (Brute Force):") index_flat = faiss.IndexFlatL2(dim) index_flat.add(X) start = time.time() D, I = index_flat.search(queries[:10], k=10) flat_time = (time.time() - start) / 10 * 1000 print(f" Query time: {flat_time:.2f} ms") indices['flat'] = (index_flat, I) # 2. IVF (Inverted File) print("\n2. IVF Index:") n_list = int(np.sqrt(n_vectors)) # Rule of thumb index_ivf = faiss.IndexIVFFlat( faiss.IndexFlatL2(dim), # Quantizer dim, n_list ) index_ivf.train(X) index_ivf.add(X) index_ivf.nprobe = 10 # Number of cells to search start = time.time() D, I = index_ivf.search(queries[:10], k=10) ivf_time = (time.time() - start) / 10 * 1000 print(f" nlist={n_list}, nprobe=10, Query time: {ivf_time:.2f} ms") indices['ivf'] = (index_ivf, I) # 3. IVF-PQ (Compressed) print("\n3. IVF-PQ Index:") m = 8 # Number of subspaces index_ivfpq = faiss.IndexIVFPQ( faiss.IndexFlatL2(dim), dim, n_list, m, 8 # 8 bits per subspace ) index_ivfpq.train(X) index_ivfpq.add(X) index_ivfpq.nprobe = 10 start = time.time() D, I = index_ivfpq.search(queries[:10], k=10) ivfpq_time = (time.time() - start) / 10 * 1000 print(f" m={m}, Query time: {ivfpq_time:.2f} ms") indices['ivfpq'] = (index_ivfpq, I) # 4. HNSW print("\n4. HNSW Index:") index_hnsw = faiss.IndexHNSWFlat(dim, 32) # M=32 index_hnsw.hnsw.efConstruction = 200 index_hnsw.add(X) index_hnsw.hnsw.efSearch = 128 start = time.time() D, I = index_hnsw.search(queries[:10], k=10) hnsw_time = (time.time() - start) / 10 * 1000 print(f" M=32, efSearch=128, Query time: {hnsw_time:.2f} ms") indices['hnsw'] = (index_hnsw, I) # Compare recall print("\n=== Recall Comparison (vs Flat ground truth) ===") ground_truth = indices['flat'][1] for name, (idx, results) in indices.items(): if name == 'flat': continue recall = compute_recall(ground_truth, results) print(f"{name:10s}: Recall@10 = {recall:.3f}") return indices def compute_recall(ground_truth: np.ndarray, predictions: np.ndarray) -> float: """Compute recall: fraction of ground truth neighbors found.""" recalls = [] for gt, pred in zip(ground_truth, predictions): gt_set = set(gt) pred_set = set(pred) recalls.append(len(gt_set & pred_set) / len(gt_set)) return np.mean(recalls) # Usage example output (illustrative):# 1. Flat Index (Brute Force):# Query time: 45.23 ms# # 2. IVF Index:# nlist=1000, nprobe=10, Query time: 1.82 ms# # 3. IVF-PQ Index:# m=8, Query time: 0.34 ms# # 4. HNSW Index:# M=32, efSearch=128, Query time: 0.52 ms# # === Recall Comparison (vs Flat ground truth) ===# ivf : Recall@10 = 0.912# ivfpq : Recall@10 = 0.847# hnsw : Recall@10 = 0.998Getting the best performance from ANN systems requires systematic benchmarking and tuning.
Key Metrics:
The Recall-Speed Frontier:
For any ANN system, there's a tradeoff curve between recall and speed. Plot recall vs QPS for different parameter settings to find the Pareto frontier.
Tuning Workflow:
| Method | Parameter | Typical Range | Effect |
|---|---|---|---|
| HNSW | M | 8-64 | ↑ = better recall, more memory |
| HNSW | ef_construction | 100-500 | ↑ = better graph quality |
| HNSW | ef_search | 16-512 | ↑ = better recall, slower |
| IVF | n_list | √n to 4√n | More lists = finer partitions |
| IVF | n_probe | 1-10% of n_list | ↑ = better recall, slower |
| PQ | m (subspaces) | dim/8 to dim/2 | More = higher quality, more memory |
| PQ | nbits | 4-8 | 8 is standard (256 centroids) |
The ann-benchmarks project provides standardized ANN benchmarks across datasets and libraries. Check the Pareto frontiers for your target dataset size and dimensionality before reinventing the wheel. The benchmark results are reproducible and regularly updated.
The ANN field continues to evolve rapidly. Here are some advanced topics shaping the future.
GPU Acceleration:
FAISS provides GPU implementations that achieve 10-100× speedup for:
Hybrid Search (Vector + Filters):
Production systems often need: "Find nearest neighbors WHERE category='electronics' AND price < 100"
Approaches:
Learned Indices:
Neural networks can learn to partition data better than k-means:
Streaming / Online Updates:
Most indices are built offline. For real-time insertions:
We've journeyed from brute-force's $O(nd)$ complexity to production systems handling billions of vectors in milliseconds. Let's consolidate the key insights from this entire module.
| Scenario | Recommended Method | Key Parameters |
|---|---|---|
| Small dataset (< 10K) | Brute force | — |
| Low-dim, exact needed | KD-tree / Ball tree | leaf_size |
| High-dim, recall critical | HNSW | M, ef |
| Billion-scale, memory-limited | IVF-PQ | n_list, m, n_probe |
| Jaccard similarity (sets) | MinHash LSH | bands, rows |
| Managed, no-ops | Pinecone / Milvus | — |
Congratulations! You've mastered the data structures and algorithms that make K-Nearest Neighbors scalable. From the theoretical foundations of Voronoi diagrams to the practical engineering of FAISS, you now have the knowledge to build and deploy efficient similarity search systems at any scale. This completes Module 4: Efficient KNN.