Loading learning content...
K-Nearest Neighbors belongs to the family of instance-based learning algorithms—methods that store the entire training dataset and use it directly at prediction time. While this approach offers remarkable simplicity and the ability to model arbitrarily complex decision boundaries, it carries a fundamental computational burden: every prediction requires distance computations to all stored training instances.
Consider the implications at scale. A training set of 1 million instances with 100 features means every prediction involves 1 million distance calculations in 100-dimensional space. Even with efficient data structures like KD-trees or ball trees (which degrade to brute-force performance in high dimensions), this computational overhead becomes prohibitive for real-time applications.
But here's the critical insight: not all training instances contribute equally to the decision boundary. Points deep within the interior of a class region provide no discriminative value—removing them would leave the decision boundary unchanged. It's only the points near class boundaries that truly matter for classification.
Condensed Nearest Neighbors (CNN) exploits this observation to dramatically reduce the training set while preserving classification accuracy. It answers a deceptively simple question: What is the smallest subset of training data that correctly classifies all original training instances using 1-NN?
By completing this page, you will understand the theoretical foundation of prototype selection, master the CNN algorithm with its variants, analyze its computational complexity and convergence properties, recognize its limitations, and know when CNN is the right choice versus alternative data reduction techniques.
The mathematical foundation of CNN rests on the concept of a consistent subset. Before diving into the algorithm, we must precisely define what we're seeking.
Let $S = {(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$ be our original training set, where $\mathbf{x}_i \in \mathbb{R}^d$ are feature vectors and $y_i \in {1, 2, \ldots, K}$ are class labels.
Definition (Consistent Subset): A subset $T \subseteq S$ is said to be consistent with respect to $S$ if every point in $S$ is correctly classified by the 1-nearest-neighbor rule using only the points in $T$.
Formally, $T$ is consistent if:
$$\forall (\mathbf{x}_i, y_i) \in S: \text{1-NN}(\mathbf{x}_i, T) = y_i$$
where $\text{1-NN}(\mathbf{x}, T)$ returns the label of the nearest neighbor to $\mathbf{x}$ within $T$.
The ideal goal would be to find the minimum consistent subset—the smallest $T$ that correctly classifies all of $S$. Unfortunately, this minimum consistent subset problem is NP-hard. The reduction can be shown from the minimum set cover problem, which is a well-known NP-complete problem.
Finding the absolute minimum consistent subset is computationally intractable for large datasets. The Condensed Nearest Neighbors algorithm provides a greedy heuristic that produces a consistent subset, but not necessarily the smallest one. The quality of the reduction depends heavily on the order of processing and the geometry of the data.
A consistent subset guarantees correct classification on the training data, but says nothing about generalization. In fact, there's a subtle but important relationship between subset size and generalization:
The key insight is that points near class boundaries are geometrically necessary for maintaining the decision surface, while interior points are redundant because their contribution is dominated by closer boundary points.
Consider a two-class problem in 2D. The training data forms two clusters. The decision boundary of 1-NN is the Voronoi tessellation where each cell is assigned the label of its generating point.
Now observe:
This geometric understanding motivates the CNN algorithm: iteratively identify and retain only those points that are necessary for correct classification.
| Point Type | Definition | Contribution to Decision Boundary | CNN Treatment |
|---|---|---|---|
| Border Point | Nearest to a point of different class | Directly shapes the decision boundary | Always retained |
| Interior Point | Surrounded only by same-class points | No contribution to inter-class boundary | Removed (redundant) |
| Outlier/Noise | Isolated point far from class bulk | Creates spurious local boundary | May be retained (problematic) |
| Bridge Point | Connects two clusters of same class | Maintains cluster connectivity | Situationally retained |
The original CNN algorithm was introduced by Peter Hart in 1968. It operates through an incremental, greedy approach: start with a minimal subset and iteratively add points that are misclassified by the current subset.
The algorithm maintains two sets:
Algorithm (CNN - Hart, 1968):
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
def condensed_nearest_neighbors(X, y, distance_metric='euclidean'): """ Condensed Nearest Neighbors Algorithm (Hart, 1968) Computes a consistent subset of training data that correctly classifies all original training instances using 1-NN. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Training feature vectors y : ndarray of shape (n_samples,) Training class labels distance_metric : str or callable Distance metric for nearest neighbor computation Returns: -------- store_indices : list Indices of points retained in the condensed set """ import numpy as np from scipy.spatial.distance import cdist n_samples = len(X) classes = np.unique(y) # Step 1: Initialize STORE with one point from each class # This ensures STORE can represent every class store_indices = [] for c in classes: class_indices = np.where(y == c)[0] store_indices.append(class_indices[0]) store_indices = set(store_indices) grabbag_indices = set(range(n_samples)) - store_indices # Step 2: Iterate until no changes occur (convergence) changed = True iteration = 0 while changed: changed = False iteration += 1 # Convert to lists for indexing store_list = list(store_indices) grabbag_list = list(grabbag_indices) if len(grabbag_list) == 0: break # Compute distances from grabbag points to store points X_store = X[store_list] X_grabbag = X[grabbag_list] distances = cdist(X_grabbag, X_store, metric=distance_metric) # For each grabbag point, find nearest store neighbor nearest_store_idx = np.argmin(distances, axis=1) predicted_labels = y[np.array(store_list)[nearest_store_idx]] true_labels = y[grabbag_list] # Find misclassified points misclassified_mask = predicted_labels != true_labels if np.any(misclassified_mask): changed = True # Add misclassified points to STORE misclassified_grabbag = np.array(grabbag_list)[misclassified_mask] for idx in misclassified_grabbag: store_indices.add(idx) grabbag_indices.remove(idx) print(f"CNN completed in {iteration} iterations") print(f"Reduced from {n_samples} to {len(store_indices)} points") print(f"Reduction ratio: {100 * (1 - len(store_indices)/n_samples):.1f}%") return list(store_indices)Let's trace through a simple example to build intuition. Consider 8 training points in 2D from two classes (+ and -):
| Point | Coordinates | Class |
|---|---|---|
| A | (1, 1) | + |
| B | (2, 1) | + |
| C | (1.5, 2) | + |
| D | (5, 5) | + |
| E | (7, 1) | - |
| F | (8, 2) | - |
| G | (7.5, 1.5) | - |
| H | (9, 3) | - |
Iteration 0 (Initialization):
Iteration 1:
If all GRABBAG points are correctly classified, the algorithm terminates with STORE = {A, E}.
This example illustrates why CNN achieves dramatic reduction: two well-separated clusters need only one representative each.
The CNN algorithm is highly sensitive to initialization and processing order. Different initializations can lead to different final subsets, all of which are consistent but have different sizes. Randomizing the order and running multiple trials can help find smaller consistent subsets.
Understanding CNN's computational behavior is critical for determining its applicability to large-scale problems. Let's analyze both time and space complexity rigorously.
Let $n$ be the number of training instances, $d$ the dimensionality, and $k$ the number of iterations until convergence.
Per-Iteration Cost: In each iteration, we compute distances between all GRABBAG points and all STORE points. If GRABBAG has $g$ points and STORE has $s$ points:
$$\text{Distance computation: } O(g \cdot s \cdot d)$$
Number of Iterations: The algorithm terminates when no point moves from GRABBAG to STORE. In the worst case, exactly one point is added per iteration, requiring $O(n)$ iterations.
Worst-Case Total Complexity:
$$O(n^3 \cdot d)$$
This occurs when:
Average-Case Behavior: In practice, many points are often added in early iterations, and the algorithm converges much faster. Empirically, for well-separated clusters, the number of iterations is typically $O(\log n)$ to $O(\sqrt{n})$.
| Operation | Best Case | Average Case | Worst Case |
|---|---|---|---|
| Iterations | O(1) | O(log n) to O(√n) | O(n) |
| Per-iteration distance computation | O(n · d) | O(n · |STORE| · d) | O(n² · d) |
| Total time complexity | O(n · d) | O(n · √n · d) | O(n³ · d) |
| Space complexity | O(n · d) | O(n · d) | O(n · d) |
CNN requires:
The distance matrix is the dominant factor. A memory-efficient implementation computes distances row-by-row rather than materializing the full matrix.
The reduction ratio is the key practical metric:
$$\text{Reduction Ratio} = 1 - \frac{|\text{STORE}|}{n}$$
Factors Affecting Reduction:
The real benefit of CNN comes at prediction time, not training time. If CNN reduces the training set by 90% (from 100,000 to 10,000 points), every prediction becomes 10× faster. For applications requiring millions of predictions, this speedup is transformative—even if the initial CNN computation takes hours.
The CNN algorithm enjoys important theoretical guarantees, though with notable limitations. Understanding these properties is essential for principled application.
Theorem (CNN Convergence): The CNN algorithm terminates in finite time.
Proof Sketch:
Theorem (CNN Consistency): Upon termination, STORE is a consistent subset of the original training set.
Proof: By the termination condition, the algorithm stops only when every point in GRABBAG (and trivially, every point in STORE) is correctly classified by 1-NN using STORE.
Combined with the fact that STORE ∪ GRABBAG always equals the original training set, this means every original training point is correctly classified by the final STORE. Thus STORE is consistent. ∎
CNN can be understood through covering theory. A covering of a set is a collection of subsets whose union contains the original set. CNN finds a covering such that each training point is "covered" by a consistent neighbor in STORE.
The fundamental combinatorial object is the consistent covering:
This perspective connects CNN to other prototype selection methods in computational geometry, particularly the minimum dominating set problem on the mutual k-nearest-neighbor graph.
While worst-case convergence is $O(n)$ iterations, the actual convergence rate depends on data geometry:
Fast Convergence (Few Iterations):
Slow Convergence (Many Iterations):
The original CNN algorithm has inspired numerous improvements addressing its limitations. These variants offer different tradeoffs between reduction rate, accuracy, and computational cost.
RNN is a post-processing step applied after CNN. It attempts to further shrink the condensed set by removing points that are no longer necessary for consistency.
RNN Algorithm:
12345678910111213141516171819202122232425262728293031323334353637383940414243
def reduced_nearest_neighbors(X, y, store_indices, distance_metric='euclidean'): """ Reduced Nearest Neighbors - Post-processing to further shrink CNN output Parameters: ----------- X : ndarray of shape (n_samples, n_features) Original training feature vectors y : ndarray of shape (n_samples,) Training class labels store_indices : list Indices from CNN output Returns: -------- reduced_indices : list Further reduced set of indices """ import numpy as np from scipy.spatial.distance import cdist store_indices = list(store_indices) # Process in reverse order (points added later are more likely dispensable) for idx in reversed(store_indices.copy()): # Temporarily remove this point temp_store = [i for i in store_indices if i != idx] if len(temp_store) == 0: continue # Check if all original points are still correctly classified X_store = X[temp_store] distances = cdist(X, X_store, metric=distance_metric) nearest_indices = np.argmin(distances, axis=1) predicted = y[np.array(temp_store)[nearest_indices]] if np.all(predicted == y): # This point is dispensable store_indices.remove(idx) print(f"RNN reduced from {len(store_indices)} to {len(store_indices)} points") return store_indicesGCNN extends CNN to work with k-NN for k > 1 rather than just 1-NN. The consistency criterion becomes: all original points must be correctly classified by k-NN using STORE.
This is more challenging because:
MCNN modifies the iteration strategy to process points in a specific order based on distance to the decision boundary:
| Variant | Key Modification | Advantage | Disadvantage |
|---|---|---|---|
| CNN (Original) | Greedy addition of misclassified points | Simple, guaranteed consistent | Not minimal, order-sensitive |
| RNN (Reduced) | Post-processing removal step | Smaller subsets than CNN alone | Additional O(n²) overhead |
| GCNN (Generalized) | Works with k-NN for k > 1 | Consistent for k-NN classification | Larger subsets required |
| MCNN (Modified) | Boundary-prioritized ordering | Often finds smaller subsets | Requires boundary estimation |
| Fast CNN | Random sampling + incremental | Scales to larger datasets | Approximate consistency |
For most applications, apply CNN followed by RNN. The combination typically achieves 80-95% reduction on datasets with well-defined clusters, while the computational overhead of RNN is modest compared to the improvement in final subset size.
Implementing CNN effectively in production systems requires attention to several practical details that aren't evident from the theoretical algorithm.
The choice of which points to initially add to STORE significantly affects the final result. Several strategies exist:
Random per Class: Select one random point from each class. Simple but may require multiple runs.
Centroid Seeding: Select the point closest to each class centroid. Tends to start with "typical" representatives.
Boundary Seeding: Use a quick heuristic to estimate boundary points and start with those. Often produces smaller final sets.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import numpy as npfrom sklearn.neighbors import NearestNeighbors def initialize_store_centroid(X, y): """Initialize STORE with points nearest to class centroids""" classes = np.unique(y) initial_indices = [] for c in classes: class_mask = y == c class_points = X[class_mask] class_indices = np.where(class_mask)[0] # Compute class centroid centroid = class_points.mean(axis=0) # Find point nearest to centroid distances = np.linalg.norm(class_points - centroid, axis=1) nearest_idx = class_indices[np.argmin(distances)] initial_indices.append(nearest_idx) return set(initial_indices) def initialize_store_boundary(X, y, n_neighbors=5): """Initialize STORE with estimated boundary points""" n_samples = len(X) initial_indices = set() # Fit nearest neighbors on full dataset nn = NearestNeighbors(n_neighbors=n_neighbors + 1) nn.fit(X) # For each point, check if neighbors include different classes for i in range(n_samples): _, indices = nn.kneighbors(X[i].reshape(1, -1)) neighbor_labels = y[indices[0][1:]] # Exclude self # If neighbors contain different classes, this is a boundary point if len(np.unique(neighbor_labels)) > 1 or y[i] not in neighbor_labels: initial_indices.add(i) # Ensure at least one point per class classes = np.unique(y) for c in classes: if not any(y[i] == c for i in initial_indices): class_indices = np.where(y == c)[0] initial_indices.add(class_indices[0]) return initial_indicesThe original CNN processes the entire GRABBAG in each iteration. For very large datasets, this becomes inefficient. Incremental CNN processes points one at a time:
12345678910111213141516171819202122232425262728293031323334
def incremental_cnn(X, y, distance_metric='euclidean'): """ Incremental CNN - Processes one point at a time More memory efficient for large datasets """ import numpy as np from scipy.spatial.distance import cdist n_samples = len(X) classes = np.unique(y) # Initialize with one point per class store_indices = [] for c in classes: class_idx = np.where(y == c)[0][0] store_indices.append(class_idx) store_set = set(store_indices) remaining = set(range(n_samples)) - store_set # Single pass through remaining points for idx in list(remaining): X_store = X[list(store_set)] x_query = X[idx].reshape(1, -1) # Find nearest neighbor in current STORE distances = cdist(x_query, X_store, metric=distance_metric) nearest_store_idx = list(store_set)[np.argmin(distances)] # If misclassified, add to STORE if y[nearest_store_idx] != y[idx]: store_set.add(idx) return list(store_set)Incremental CNN is faster (single pass) but may produce larger subsets than iterative CNN. Points that would be correctly classified after later additions to STORE are added prematurely because they were processed before those later additions. Use iterative CNN when subset size is critical; use incremental CNN when training time is the bottleneck.
Like all distance-based methods, CNN is sensitive to feature scales. Features with larger numeric ranges dominate the distance computation.
Best Practice: Standardize or normalize features before applying CNN:
The choice should match whatever preprocessing you'll use at prediction time. The same transformation must be applied to test data.
For large datasets, materializing an $n \times n$ distance matrix is impractical. Use chunked distance computation:
123456789101112131415161718192021
def chunked_nearest_neighbor(X_query, X_store, y_store, chunk_size=1000): """ Memory-efficient nearest neighbor search Processes query points in chunks to avoid large distance matrices """ import numpy as np from scipy.spatial.distance import cdist n_query = len(X_query) predictions = np.zeros(n_query, dtype=y_store.dtype) for start in range(0, n_query, chunk_size): end = min(start + chunk_size, n_query) chunk = X_query[start:end] # Only compute distances for this chunk distances = cdist(chunk, X_store) nearest_indices = np.argmin(distances, axis=1) predictions[start:end] = y_store[nearest_indices] return predictionsCNN is a powerful tool, but it's not universally applicable. Understanding its ideal use cases—and its limitations—helps you make informed decisions.
Spam detection for email client: Training set of 500K emails (90% ham, 10% spam). After CNN, reduced to 50K emails. Client-side classification is now 10× faster, works offline, and fits in device memory.
Medical diagnosis with noisy labels: Each misdiagnosed training case becomes a 'critical boundary point.' CNN retains most of the noisy data while providing false confidence in the reduction. Better to use noise-tolerant methods first.
Use this framework to decide whether CNN fits your problem:
Estimate class separability: Plot or compute inter-class vs. intra-class distance ratios. High separation → CNN effective.
Assess label quality: Quantify noise rate if possible. > 5% noise → Consider Edited Nearest Neighbors first (covered in next page).
Measure dimensionality impact: For $d > 20$, test CNN on a sample first. If reduction < 50%, alternative methods may be better.
Compute ROI: Estimate $\text{(Prediction volume)} \times \text{(Speedup per prediction)} - \text{(CNN training time)}$. Positive ROI = use CNN.
Validate rigorously: Always evaluate CNN performance on held-out test data, not just training set consistency.
We've conducted a deep exploration of Condensed Nearest Neighbors—from its theoretical foundations to practical implementation. Let's consolidate the essential knowledge:
CNN focuses on retaining necessary instances—but what about removing harmful instances? Noisy labels and outliers can degrade KNN performance even after CNN condensation.
The next page explores Edited Nearest Neighbors (ENN), which takes the opposite approach: instead of selecting points to keep, it identifies points to remove based on neighborhood consistency. Together, CNN and ENN form a powerful data cleaning pipeline for instance-based learning.
You now possess a thorough understanding of Condensed Nearest Neighbors—its theoretical basis, algorithm, complexity, variants, and practical applications. You can identify when CNN is appropriate and implement it effectively. Next, we examine Edited Nearest Neighbors for noise reduction and outlier removal.