Efficient Knn - Learning Module

Loading content...

0/245

Approximate Nearest Neighbors

From Theory to Production Scale

We've seen that exact nearest neighbor search in high dimensions inevitably degrades to brute force, and that LSH provides theoretical guarantees for approximate search. But what techniques actually power the billion-scale similarity search systems used in production today?

The Scale Challenge:

Spotify: 100+ million song embeddings for personalized playlists
Pinterest: 5+ billion image embeddings for visual search
Google: Trillions of embeddings across Search, Photos, and YouTube
TikTok: Real-time recommendation from billions of videos

These systems demand:

Sub-millisecond latency at the 99th percentile
Recall > 95% (finding most true neighbors)
Memory efficiency (can't store full embeddings in RAM)
Dynamic updates (new content constantly arriving)

This page surveys the techniques that make this possible: graph-based methods (HNSW), quantization (PQ, OPQ, ScaNN), and production systems (FAISS, Annoy, Milvus).

What You Will Learn

By the end of this page, you will understand how graph-based methods like HNSW achieve state-of-the-art recall, how product quantization compresses embeddings for memory efficiency, how hybrid approaches combine multiple techniques, how to choose between different ANN libraries, and how to benchmark and tune ANN systems for production.

The Approximate NN Landscape

Before diving into specific methods, let's survey the landscape of approximate nearest neighbor techniques.

The Fundamental Tradeoffs:

All ANN methods navigate three competing objectives:

Query Speed: How fast can we answer a query?
Recall: What fraction of true k-NN do we return?
Memory: How much space does the index consume?

No method wins on all three. Understanding the tradeoff space is key to choosing the right approach.

Major Method Families:

ANN Method Families
Family	Examples	Principle	Strengths	Weaknesses
Hashing	LSH, E2LSH, Multi-probe	Hash similar items together	Theoretical guarantees	High memory, many tables needed
Tree-based	KD-tree, Ball tree, Annoy	Hierarchical space partition	Exact mode available	Degrades in high-d
Graph-based	HNSW, NSG, DiskANN	Navigate neighbor graph	Best recall-speed tradeoff	High memory, slow build
Quantization	PQ, OPQ, ScaNN	Compress vectors	Low memory, fast search	Approximation error
Hybrid	IVF-PQ, FAISS	Combine multiple methods	Flexible tradeoffs	More complex tuning

State of the Art (2024):

For highest recall: Graph-based methods (HNSW) dominate
For memory efficiency: Quantization methods (PQ, ScaNN) excel
For billion-scale: Hybrid approaches (IVF + PQ + HNSW) are essential
For dynamic data: HNSW and streaming-friendly variants

Benchmarks like ann-benchmarks.com compare methods across datasets. HNSW consistently achieves the best recall-vs-QPS (queries per second) tradeoff, while IVF-PQ wins on memory efficiency.

HNSW: Hierarchical Navigable Small World

HNSW (Malkov & Yashunin, 2016) is the most successful graph-based ANN method, achieving near-optimal recall with excellent query speed.

Core Ideas:

Navigable Small World (NSW): A graph where each node connects to its approximate nearest neighbors. Navigation works by greedily moving to the neighbor closest to the query.
Hierarchical Structure: Multiple layers of NSW graphs with exponentially decreasing density. Higher layers contain fewer nodes but longer-range connections.
Search Strategy: Start at the top (sparse) layer, greedily descend to find an entry point, then search more thoroughly at lower layers.

Why It Works:

The hierarchical structure provides:

Long-range jumps at high layers (like highway roads)
Fine-grained navigation at low layers (like local streets)

This is analogous to skip lists for 1D data, but in high-dimensional metric space.

Theoretical Properties:

Construction: $O(n \log n)$ expected
Query: $O(\log n)$ for polylogarithmic recall
Space: $O(n \cdot M)$ where $M$ is max connections per node

hnsw_concept.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
import numpy as np
from typing import List, Set, Dict, Tuple
import heapq
from collections import defaultdict
 
class SimpleHNSW:
    """
    Simplified HNSW implementation for educational purposes.
    
    Production systems (like hnswlib, FAISS) are much more optimized.
    
    Parameters:
    -----------
    M : int
        Maximum connections per node (affects recall and memory)
    ef_construction : int
        Size of dynamic candidate list during construction
    m_L : float
        Level generation factor (typically 1/ln(M))
    """
    
    def __init__(
        self,
        dim: int,
        M: int = 16,
        ef_construction: int = 200,
        m_L: float = None,
        seed: int = 42
    ):
        self.dim = dim
        self.M = M
        self.M0 = 2 * M  # Max connections at level 0 (base layer)
        self.ef_construction = ef_construction
        self.m_L = m_L or 1.0 / np.log(M)
        
        np.random.seed(seed)
        
        # Data storage
        self.data: List[np.ndarray] = []
        
        # Graph structure: level -> node_id -> set of neighbor_ids
        self.graphs: Dict[int, Dict[int, Set[int]]] = defaultdict(
            lambda: defaultdict(set)
        )
        
        # Entry point and max level
        self.entry_point: int = -1
        self.max_level: int = -1
        
        # Level of each node
        self.node_levels: Dict[int, int] = {}
    
    def _distance(self, v1: np.ndarray, v2: np.ndarray) -> float:
        """Euclidean distance."""
        return np.sqrt(np.sum((v1 - v2) ** 2))
    
    def _random_level(self) -> int:
        """Generate random level using exponential distribution."""
        return int(-np.log(np.random.random()) * self.m_L)
    
    def _search_layer(
        self,
        query: np.ndarray,
        entry_points: Set[int],
        ef: int,
        layer: int
    ) -> List[Tuple[float, int]]:
        """
        Search a single layer using greedy best-first search.
        
        Returns ef nearest neighbors found at this layer.
        """
        visited = set(entry_points)
        
        # Min-heap of (distance, node_id) for candidates
        candidates = []
        # Max-heap of (neg_distance, node_id) for results
        results = []
        
        for ep in entry_points:
            dist = self._distance(query, self.data[ep])
            heapq.heappush(candidates, (dist, ep))
            heapq.heappush(results, (-dist, ep))
        
        while candidates:
            dist_c, c = heapq.heappop(candidates)
            
            # Farthest result distance
            dist_f = -results[0][0] if results else float('inf')
            
            if dist_c > dist_f:
                break  # All remaining candidates are farther than farthest result
            
            # Explore neighbors of c in this layer
            for neighbor in self.graphs[layer].get(c, set()):
                if neighbor not in visited:
                    visited.add(neighbor)
                    dist_n = self._distance(query, self.data[neighbor])
                    
                    if len(results) < ef or dist_n < -results[0][0]:
                        heapq.heappush(candidates, (dist_n, neighbor))
                        heapq.heappush(results, (-dist_n, neighbor))
                        
                        if len(results) > ef:
                            heapq.heappop(results)
        
        # Return results sorted by distance (ascending)
        return sorted([(-d, n) for d, n in results])
    
    def insert(self, vector: np.ndarray) -> int:
        """
        Insert a vector into the HNSW index.
        
        Returns the node ID.
        """
        node_id = len(self.data)
        self.data.append(vector)
        
        # Assign random level
        node_level = self._random_level()
        self.node_levels[node_id] = node_level
        
        # First node
        if self.entry_point < 0:
            self.entry_point = node_id
            self.max_level = node_level
            return node_id
        
        # Search for entry points at each layer
        current_entry = {self.entry_point}
        
        # Descend from top to node_level + 1
        for layer in range(self.max_level, node_level, -1):
            results = self._search_layer(vector, current_entry, 1, layer)
            current_entry = {results[0][1]} if results else current_entry
        
        # Insert at layers node_level down to 0
        for layer in range(min(node_level, self.max_level), -1, -1):
            # Find neighbors
            results = self._search_layer(
                vector, current_entry, self.ef_construction, layer
            )
            
            # Select best neighbors
            max_M = self.M0 if layer == 0 else self.M
            neighbors = [n for _, n in results[:max_M]]
            
            # Add bidirectional edges
            for neighbor in neighbors:
                self.graphs[layer][node_id].add(neighbor)
                self.graphs[layer][neighbor].add(node_id)
                
                # Prune if neighbor has too many connections
                if len(self.graphs[layer][neighbor]) > max_M:
                    # Keep closest max_M neighbors
                    neighbor_neighbors = list(self.graphs[layer][neighbor])
                    dists = [
                        (self._distance(self.data[neighbor], self.data[nn]), nn)
                        for nn in neighbor_neighbors
                    ]
                    dists.sort()
                    self.graphs[layer][neighbor] = set(
                        nn for _, nn in dists[:max_M]
                    )
            
            current_entry = set(n for _, n in results[:1])
        
        # Update entry point if new node is highest
        if node_level > self.max_level:
            self.entry_point = node_id
            self.max_level = node_level
        
        return node_id
    
    def search(self, query: np.ndarray, k: int, ef: int = None) -> List[Tuple[float, int]]:
        """
        Search for k nearest neighbors.
        
        Parameters:
        -----------
        query : np.ndarray
            Query vector
        k : int
            Number of neighbors to return
        ef : int
            Size of dynamic candidate list (≥ k, higher = better recall)
        
        Returns:
        --------
        List of (distance, node_id) pairs
        """
        if ef is None:
            ef = max(k, 100)
        
        if self.entry_point < 0:
            return []
        
        current_entry = {self.entry_point}
        
        # Descend from top to layer 1
        for layer in range(self.max_level, 0, -1):
            results = self._search_layer(query, current_entry, 1, layer)
            current_entry = {results[0][1]} if results else current_entry
        
        # Search layer 0 with ef candidates
        results = self._search_layer(query, current_entry, ef, 0)
        
        return results[:k]

Key Parameters

M (connections per node): Higher M = better recall but more memory. Typical: 16-64. ef_construction: Higher = better graph quality but slower build. Typical: 100-500. ef_search: Higher = better recall but slower query. Tune based on recall requirements.

Product Quantization for Memory Efficiency

While HNSW achieves excellent recall, it requires storing full vectors, consuming significant memory. Product Quantization (PQ) provides dramatic memory reduction by compressing vectors to compact codes.

The Idea:

Split each d-dimensional vector into m subvectors of dimension d/m
For each subspace, learn k centroids (codebook) via k-means
Encode each subvector by its nearest centroid's index
The full vector is represented by m indices (bytes if k=256)

Memory Savings:

Original: $d \times 4$ bytes (float32) Compressed: $m$ bytes (if k=256)

Example: 512-d vector × 4 bytes = 2048 bytes → 64 subspaces × 1 byte = 64 bytes

32× compression!

Distance Computation:

For query q and encoded vector x with codes $c_1, \ldots, c_m$:

Precompute distance table: $D[i][j] = d(q_i, \text{centroid}_{i,j})^2$ for each subspace
Distance to x ≈ $\sqrt{\sum_{i=1}^m D[i][c_i]}$

Lookup is $O(m)$ additions instead of $O(d)$ multiply-adds!

product_quantization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
import numpy as np
from sklearn.cluster import KMeans
from typing import List, Tuple
 
class ProductQuantizer:
    """
    Product Quantization for vector compression and fast distance computation.
    
    Parameters:
    -----------
    dim : int
        Dimensionality of input vectors
    n_subspaces : int
        Number of subspaces (m). dim must be divisible by m.
    n_centroids : int  
        Centroids per subspace (k). Typically 256 for 8-bit codes.
    """
    
    def __init__(
        self,
        dim: int = 512,
        n_subspaces: int = 64,
        n_centroids: int = 256,
        seed: int = 42
    ):
        assert dim % n_subspaces == 0, "dim must be divisible by n_subspaces"
        
        self.dim = dim
        self.m = n_subspaces
        self.k = n_centroids
        self.subspace_dim = dim // n_subspaces
        
        np.random.seed(seed)
        
        # Codebooks: (m, k, subspace_dim)
        self.codebooks: np.ndarray = None
        
    def train(self, X: np.ndarray, n_iter: int = 25):
        """
        Train codebooks on training data.
        
        Uses k-means on each subspace independently.
        """
        n, d = X.shape
        assert d == self.dim
        
        self.codebooks = np.zeros((self.m, self.k, self.subspace_dim))
        
        for i in range(self.m):
            # Extract subspace i
            start = i * self.subspace_dim
            end = start + self.subspace_dim
            X_sub = X[:, start:end]
            
            # Run k-means
            kmeans = KMeans(n_clusters=self.k, max_iter=n_iter, n_init=1)
            kmeans.fit(X_sub)
            
            self.codebooks[i] = kmeans.cluster_centers_
    
    def encode(self, X: np.ndarray) -> np.ndarray:
        """
        Encode vectors to PQ codes.
        
        Returns:
        --------
        codes : np.ndarray of shape (n, m) with dtype uint8
        """
        n = X.shape[0]
        codes = np.zeros((n, self.m), dtype=np.uint8)
        
        for i in range(self.m):
            start = i * self.subspace_dim
            end = start + self.subspace_dim
            X_sub = X[:, start:end]
            
            # Find nearest centroid for each vector
            # Shape: (n, k)
            distances = np.sum(
                (X_sub[:, np.newaxis, :] - self.codebooks[i]) ** 2,
                axis=2
            )
            codes[:, i] = np.argmin(distances, axis=1)
        
        return codes
    
    def decode(self, codes: np.ndarray) -> np.ndarray:
        """
        Decode PQ codes back to reconstructed vectors.
        
        Returns approximate vectors, not exact originals.
        """
        n = codes.shape[0]
        X_reconstructed = np.zeros((n, self.dim))
        
        for i in range(self.m):
            start = i * self.subspace_dim
            end = start + self.subspace_dim
            X_reconstructed[:, start:end] = self.codebooks[i, codes[:, i]]
        
        return X_reconstructed
    
    def compute_distance_table(self, query: np.ndarray) -> np.ndarray:
        """
        Precompute distance table for efficient search.
        
        Returns:
        --------
        table : np.ndarray of shape (m, k)
            table[i, j] = squared distance from query's i-th subvector
                          to j-th centroid of i-th subspace
        """
        table = np.zeros((self.m, self.k))
        
        for i in range(self.m):
            start = i * self.subspace_dim
            end = start + self.subspace_dim
            q_sub = query[start:end]
            
            # Distance from q_sub to each centroid
            table[i] = np.sum((self.codebooks[i] - q_sub) ** 2, axis=1)
        
        return table
    
    def asymmetric_distance(
        self,
        query: np.ndarray,
        codes: np.ndarray,
        table: np.ndarray = None
    ) -> np.ndarray:
        """
        Compute distances from query to encoded vectors.
        
        Asymmetric: query is not encoded, database vectors are.
        This is more accurate than symmetric (both encoded).
        
        Returns:
        --------
        distances : np.ndarray of shape (n,)
        """
        if table is None:
            table = self.compute_distance_table(query)
        
        n = codes.shape[0]
        distances_sq = np.zeros(n)
        
        for i in range(self.m):
            # Lookup distances for subspace i
            distances_sq += table[i, codes[:, i]]
        
        return np.sqrt(distances_sq)
    
    def search(
        self,
        query: np.ndarray,
        codes: np.ndarray,
        k: int = 10
    ) -> Tuple[np.ndarray, np.ndarray]:
        """
        Search for k nearest neighbors in encoded database.
        """
        table = self.compute_distance_table(query)
        distances = self.asymmetric_distance(query, codes, table)
        
        # Find k smallest
        indices = np.argpartition(distances, k)[:k]
        indices = indices[np.argsort(distances[indices])]
        
        return indices, distances[indices]
 
 
def demo_pq_compression():
    """Demonstrate PQ compression ratio and accuracy."""
    np.random.seed(42)
    
    # Generate test data
    n_train, n_test, dim = 10000, 1000, 512
    X_train = np.random.randn(n_train, dim).astype(np.float32)
    X_test = np.random.randn(n_test, dim).astype(np.float32)
    
    # Train PQ
    pq = ProductQuantizer(dim=dim, n_subspaces=64, n_centroids=256)
    pq.train(X_train)
    
    # Encode test data
    codes = pq.encode(X_test)
    
    # Memory comparison
    original_bytes = X_test.nbytes
    compressed_bytes = codes.nbytes
    ratio = original_bytes / compressed_bytes
    
    print(f"Original: {original_bytes:,} bytes")
    print(f"Compressed: {compressed_bytes:,} bytes")
    print(f"Compression ratio: {ratio:.1f}x")
    
    # Reconstruction error
    X_reconstructed = pq.decode(codes)
    mse = np.mean((X_test - X_reconstructed) ** 2)
    print(f"Reconstruction MSE: {mse:.4f}")
    
    # Search accuracy
    query = np.random.randn(dim)
    
    # True NN (brute force on originals)
    true_dists = np.sqrt(np.sum((X_test - query) ** 2, axis=1))
    true_nn = np.argmin(true_dists)
    
    # PQ NN
    pq_indices, pq_dists = pq.search(query, codes, k=10)
    
    print(f"True NN: idx={true_nn}, dist={true_dists[true_nn]:.4f}")
    print(f"PQ top-10: {pq_indices[:5]}")

Variants of Product Quantization:

Variant	Description	Benefit
OPQ (Optimized PQ)	Rotate vectors before PQ to align with subspaces	10-30% lower distortion
LOPQ (Locally OPQ)	Different rotations per cluster	Better for clustered data
RQ (Residual Quantization)	Iteratively quantize residuals	Lower distortion, slower
AQ (Additive Quantization)	Generalize PQ to overlapping codebooks	Flexible, complex

Inverted File Index (IVF)

PQ provides memory efficiency but still requires scanning all codes. Inverted File Index (IVF) adds coarse quantization for faster candidate selection.

The IVF Approach:

Coarse quantization: Cluster all vectors into $L$ large clusters (e.g., $L = \sqrt{n}$)
Inverted lists: Store each vector in the list of its assigned cluster
Query: Find $n_{probe}$ nearest clusters, only search those lists

Speedup:

Without IVF: Search all $n$ vectors → $O(n)$
With IVF ($n_{probe}$ lists of size $n/L$ each): → $O(n \cdot n_{probe} / L)$

For $L = 1000$ and $n_{probe} = 10$: 100× speedup!

IVF + PQ (IVF-PQ):

Combine inverted files with product quantization:

Coarse quantize to find candidate lists
PQ search within candidate lists (fast due to compression)

This is the workhouse of billion-scale search systems like FAISS.

ivf_pq.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
import numpy as np
from sklearn.cluster import KMeans
from typing import List, Tuple
 
class IVFPQ:
    """
    Inverted File Index with Product Quantization.
    
    The standard recipe for billion-scale similarity search.
    
    Parameters:
    -----------
    dim : int
        Vector dimensionality
    n_lists : int
        Number of IVF coarse clusters (L)
    n_subspaces : int
        PQ subspaces (m)
    n_centroids : int
        PQ centroids per subspace (typically 256)
    """
    
    def __init__(
        self,
        dim: int = 512,
        n_lists: int = 1000,
        n_subspaces: int = 64,
        n_centroids: int = 256,
        seed: int = 42
    ):
        self.dim = dim
        self.n_lists = n_lists
        
        np.random.seed(seed)
        
        # Coarse quantizer
        self.coarse_centroids: np.ndarray = None
        
        # PQ for fine quantization (on residuals)
        self.pq = ProductQuantizer(dim, n_subspaces, n_centroids, seed)
        
        # Inverted lists: list_id -> (original_ids, pq_codes)
        self.invlists: List[Tuple[np.ndarray, np.ndarray]] = []
        
    def train(self, X: np.ndarray, n_iter: int = 25):
        """
        Train coarse quantizer and PQ.
        """
        n = X.shape[0]
        
        # Step 1: Train coarse quantizer
        print(f"Training coarse quantizer with {self.n_lists} clusters...")
        kmeans = KMeans(n_clusters=self.n_lists, max_iter=n_iter, n_init=1)
        kmeans.fit(X)
        self.coarse_centroids = kmeans.cluster_centers_
        
        # Step 2: Compute residuals
        assignments = kmeans.labels_
        residuals = X - self.coarse_centroids[assignments]
        
        # Step 3: Train PQ on residuals
        print("Training product quantizer on residuals...")
        self.pq.train(residuals, n_iter)
    
    def index(self, X: np.ndarray, ids: np.ndarray = None):
        """
        Add vectors to the index.
        """
        n = X.shape[0]
        if ids is None:
            ids = np.arange(n)
        
        # Assign to coarse clusters
        distances = np.sum(
            (X[:, np.newaxis, :] - self.coarse_centroids) ** 2,
            axis=2
        )
        assignments = np.argmin(distances, axis=1)
        
        # Compute residuals
        residuals = X - self.coarse_centroids[assignments]
        
        # Encode residuals with PQ
        codes = self.pq.encode(residuals)
        
        # Build inverted lists
        self.invlists = [None] * self.n_lists
        for list_id in range(self.n_lists):
            mask = assignments == list_id
            if np.any(mask):
                self.invlists[list_id] = (
                    ids[mask],
                    codes[mask]
                )
    
    def search(
        self,
        query: np.ndarray,
        k: int = 10,
        n_probe: int = 10
    ) -> Tuple[np.ndarray, np.ndarray]:
        """
        Search for k nearest neighbors.
        
        Parameters:
        -----------
        query : np.ndarray
            Query vector
        k : int
            Number of neighbors
        n_probe : int
            Number of IVF lists to search
        
        Returns:
        --------
        ids, distances
        """
        # Find n_probe nearest coarse centroids
        coarse_distances = np.sum(
            (self.coarse_centroids - query) ** 2, axis=1
        )
        probe_list_ids = np.argsort(coarse_distances)[:n_probe]
        
        # Search each probed list
        all_ids = []
        all_distances = []
        
        for list_id in probe_list_ids:
            if self.invlists[list_id] is None:
                continue
            
            ids, codes = self.invlists[list_id]
            
            # Compute residual query
            residual_query = query - self.coarse_centroids[list_id]
            
            # PQ distance computation
            table = self.pq.compute_distance_table(residual_query)
            distances = self.pq.asymmetric_distance(residual_query, codes, table)
            
            all_ids.extend(ids)
            all_distances.extend(distances)
        
        if not all_ids:
            return np.array([]), np.array([])
        
        all_ids = np.array(all_ids)
        all_distances = np.array(all_distances)
        
        # Return top k
        top_k_idx = np.argsort(all_distances)[:k]
        return all_ids[top_k_idx], all_distances[top_k_idx]
 
 
# Comparison of methods
def compare_methods_memory():
    """
    Compare memory usage of different index types.
    """
    n = 1_000_000  # 1 million vectors
    dim = 512
    
    methods = {
        'Brute Force (float32)': n * dim * 4,
        'HNSW (M=32, float32)': n * dim * 4 + n * 32 * 8,  # vectors + edges
        'IVF only (1000 lists)': n * dim * 4 + 1000 * dim * 4,
        'PQ (m=64)': n * 64 + 64 * 256 * (dim//64) * 4,  # codes + codebooks
        'IVF-PQ': n * 64 + 1000 * dim * 4 + 64 * 256 * (dim//64) * 4,
    }
    
    print("Memory Usage for 1M 512-d vectors:")
    print("-" * 50)
    for method, bytes_used in methods.items():
        mb = bytes_used / (1024 ** 2)
        gb = bytes_used / (1024 ** 3)
        print(f"{method:30s}: {mb:8.1f} MB ({gb:.2f} GB)")

n_probe is Key

The n_probe parameter controls the recall-speed tradeoff in IVF. With n_probe=1, you only search one cluster—fast but low recall. With n_probe=n_lists, you search everything—high recall but slow. Typical values: 1-10% of n_lists.

Production ANN Systems

Several well-maintained libraries provide production-ready ANN implementations. Here's a comparison of the major options.

Major ANN Libraries
Library	Origin	Best For	Key Features
FAISS	Meta AI	Billion-scale, GPU	IVF-PQ, HNSW, GPU support, OPQ
Annoy	Spotify	Read-heavy, memory-mapped	Tree-based, memory-mapped files
hnswlib	Academic	Pure HNSW, easy API	Header-only C++, Python bindings
ScaNN	Google	High-dim, anisotropic	Anisotropic quantization, fast
Milvus	Zilliz	Managed service, filtering	Distributed, attribute filtering
Weaviate	SeMI	Semantic search, ML ops	Vector + keyword, GraphQL API
Pinecone	Pinecone	Serverless, no-ops	Managed cloud, simple API

faiss_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import faiss
import numpy as np
import time
 
def faiss_demo(n_vectors: int = 1_000_000, dim: int = 128):
    """
    Demonstrate FAISS for different index types and scales.
    """
    np.random.seed(42)
    
    # Generate data
    print(f"Generating {n_vectors:,} vectors of dimension {dim}...")
    X = np.random.randn(n_vectors, dim).astype('float32')
    queries = np.random.randn(1000, dim).astype('float32')
    
    indices = {}
    
    # 1. Brute Force (Flat)
    print("\n1. Flat Index (Brute Force):")
    index_flat = faiss.IndexFlatL2(dim)
    index_flat.add(X)
    
    start = time.time()
    D, I = index_flat.search(queries[:10], k=10)
    flat_time = (time.time() - start) / 10 * 1000
    print(f"   Query time: {flat_time:.2f} ms")
    indices['flat'] = (index_flat, I)
    
    # 2. IVF (Inverted File)
    print("\n2. IVF Index:")
    n_list = int(np.sqrt(n_vectors))  # Rule of thumb
    index_ivf = faiss.IndexIVFFlat(
        faiss.IndexFlatL2(dim),  # Quantizer
        dim, n_list
    )
    index_ivf.train(X)
    index_ivf.add(X)
    index_ivf.nprobe = 10  # Number of cells to search
    
    start = time.time()
    D, I = index_ivf.search(queries[:10], k=10)
    ivf_time = (time.time() - start) / 10 * 1000
    print(f"   nlist={n_list}, nprobe=10, Query time: {ivf_time:.2f} ms")
    indices['ivf'] = (index_ivf, I)
    
    # 3. IVF-PQ (Compressed)
    print("\n3. IVF-PQ Index:")
    m = 8  # Number of subspaces
    index_ivfpq = faiss.IndexIVFPQ(
        faiss.IndexFlatL2(dim),
        dim, n_list, m, 8  # 8 bits per subspace
    )
    index_ivfpq.train(X)
    index_ivfpq.add(X)
    index_ivfpq.nprobe = 10
    
    start = time.time()
    D, I = index_ivfpq.search(queries[:10], k=10)
    ivfpq_time = (time.time() - start) / 10 * 1000
    print(f"   m={m}, Query time: {ivfpq_time:.2f} ms")
    indices['ivfpq'] = (index_ivfpq, I)
    
    # 4. HNSW
    print("\n4. HNSW Index:")
    index_hnsw = faiss.IndexHNSWFlat(dim, 32)  # M=32
    index_hnsw.hnsw.efConstruction = 200
    index_hnsw.add(X)
    index_hnsw.hnsw.efSearch = 128
    
    start = time.time()
    D, I = index_hnsw.search(queries[:10], k=10)
    hnsw_time = (time.time() - start) / 10 * 1000
    print(f"   M=32, efSearch=128, Query time: {hnsw_time:.2f} ms")
    indices['hnsw'] = (index_hnsw, I)
    
    # Compare recall
    print("\n=== Recall Comparison (vs Flat ground truth) ===")
    ground_truth = indices['flat'][1]
    
    for name, (idx, results) in indices.items():
        if name == 'flat':
            continue
        recall = compute_recall(ground_truth, results)
        print(f"{name:10s}: Recall@10 = {recall:.3f}")
    
    return indices
 
 
def compute_recall(ground_truth: np.ndarray, predictions: np.ndarray) -> float:
    """Compute recall: fraction of ground truth neighbors found."""
    recalls = []
    for gt, pred in zip(ground_truth, predictions):
        gt_set = set(gt)
        pred_set = set(pred)
        recalls.append(len(gt_set & pred_set) / len(gt_set))
    return np.mean(recalls)
 
 
# Usage example output (illustrative):
# 1. Flat Index (Brute Force):
#    Query time: 45.23 ms
# 
# 2. IVF Index:
#    nlist=1000, nprobe=10, Query time: 1.82 ms
# 
# 3. IVF-PQ Index:
#    m=8, Query time: 0.34 ms
# 
# 4. HNSW Index:
#    M=32, efSearch=128, Query time: 0.52 ms
# 
# === Recall Comparison (vs Flat ground truth) ===
# ivf       : Recall@10 = 0.912
# ivfpq     : Recall@10 = 0.847
# hnsw      : Recall@10 = 0.998

Choosing the Right Library

•FAISS: Best for research and billion-scale. Use IVF-PQ for memory, HNSW for recall.
•hnswlib: Best for easy HNSW integration. Simple API, good Python support.
•Annoy: Best for read-heavy workloads. Memory-mapped files enable huge indices.
•ScaNN: Best for Google-scale optimization. Anisotropic quantization beats PQ.
•Milvus/Pinecone: Best for managed infrastructure. Focus on building, not operating.

Benchmarking and Tuning

Getting the best performance from ANN systems requires systematic benchmarking and tuning.

Key Metrics:

Recall@k: Fraction of true k-NN found in returned results
QPS (Queries Per Second): Throughput at target latency
Latency: Query time distribution (p50, p95, p99)
Memory: Index size on disk and in RAM
Build Time: Time to construct the index

The Recall-Speed Frontier:

For any ANN system, there's a tradeoff curve between recall and speed. Plot recall vs QPS for different parameter settings to find the Pareto frontier.

Tuning Workflow:

Set hard constraints: Required recall, max latency, max memory
Choose method family: Graph (HNSW) for recall, quantization (PQ) for memory
Grid search parameters: ef, n_probe, M, m, etc.
Measure on realistic queries: Distribution matters!
Iterate: Often 2-3 rounds needed

Typical Parameter Ranges for Tuning
Method	Parameter	Typical Range	Effect
HNSW	M	8-64	↑ = better recall, more memory
HNSW	ef_construction	100-500	↑ = better graph quality
HNSW	ef_search	16-512	↑ = better recall, slower
IVF	n_list	√n to 4√n	More lists = finer partitions
IVF	n_probe	1-10% of n_list	↑ = better recall, slower
PQ	m (subspaces)	dim/8 to dim/2	More = higher quality, more memory
PQ	nbits	4-8	8 is standard (256 centroids)

ann-benchmarks.com

The ann-benchmarks project provides standardized ANN benchmarks across datasets and libraries. Check the Pareto frontiers for your target dataset size and dimensionality before reinventing the wheel. The benchmark results are reproducible and regularly updated.

Advanced Topics

The ANN field continues to evolve rapidly. Here are some advanced topics shaping the future.

GPU Acceleration:

FAISS provides GPU implementations that achieve 10-100× speedup for:

Distance computation (batched matrix multiply)
IVF search (parallel list scanning)
PQ table lookups (memory-bound → cache-friendly)

Hybrid Search (Vector + Filters):

Production systems often need: "Find nearest neighbors WHERE category='electronics' AND price < 100"

Approaches:

Post-filtering: Find k×10 neighbors, filter, return k (fails if filter is selective)
Pre-filtering: Create separate indices per filter value (memory explosion)
Integrated: Milvus, Weaviate support filter-aware search

Learned Indices:

Neural networks can learn to partition data better than k-means:

Neural LSH: Learn hash functions from data
Learned quantization: End-to-end optimize compression for search
Approximate learned index: Trade index size for approximation

Streaming / Online Updates:

Most indices are built offline. For real-time insertions:

HNSW naturally supports inserts (just add to graph)
IVF-PQ: Use separate "fresh" index, periodically merge
Some systems: Background reindexing with zero-downtime swaps

Emerging Trends

•DiskANN: Store index on SSD, only graph in RAM. Enables trillion-scale on commodity hardware.
•RAG (Retrieval-Augmented Generation): ANN is the retrieval engine powering LLM knowledge bases.
•Multi-modal search: Same index for text, images, audio via unified embedding spaces.
•Differentiable search: End-to-end train retrieval with downstream task (questions answering, etc.).
•Private ANN: Search encrypted vectors with homomorphic encryption or secure multi-party computation.

Summary: Efficient KNN Module Complete

We've journeyed from brute-force's $O(nd)$ complexity to production systems handling billions of vectors in milliseconds. Let's consolidate the key insights from this entire module.

Module Takeaways

•Brute force is O(nd) and becomes impractical beyond ~100K points. Batching and GPU can help but don't change fundamentals.
•KD-trees and Ball trees provide O(log n) exact search in low dimensions but degrade in high-d (curse of dimensionality).
•Voronoi diagrams are the theoretical optimum for 2D/3D but have exponential complexity in higher dimensions.
•LSH provides the first sub-linear approximate search, trading exactness for speed via probabilistic hashing.
•HNSW achieves near-optimal recall-speed tradeoff via hierarchical navigable small world graphs.
•Product Quantization compresses vectors 10-50× enabling billion-scale in-memory indices.
•IVF-PQ combines coarse clustering with compression for the best memory-speed tradeoff at scale.
•Production systems (FAISS, hnswlib, Milvus) make these techniques accessible; choose based on use case.

Method Selection Guide
Scenario	Recommended Method	Key Parameters
Small dataset (< 10K)	Brute force	—
Low-dim, exact needed	KD-tree / Ball tree	leaf_size
High-dim, recall critical	HNSW	M, ef
Billion-scale, memory-limited	IVF-PQ	n_list, m, n_probe
Jaccard similarity (sets)	MinHash LSH	bands, rows
Managed, no-ops	Pinecone / Milvus	—

Module Complete!

Congratulations! You've mastered the data structures and algorithms that make K-Nearest Neighbors scalable. From the theoretical foundations of Voronoi diagrams to the practical engineering of FAISS, you now have the knowledge to build and deploy efficient similarity search systems at any scale. This completes Module 4: Efficient KNN.