Loading learning content...
The previous page examined Condensed Nearest Neighbors, which reduces storage by retaining only boundary-defining instances. But condensation has a blind spot: it treats all training labels as ground truth. When labels are noisy—as they invariably are in real-world data—CNN faithfully preserves those errors.
Consider the implications: a single mislabeled point in the interior of Class A, incorrectly labeled as Class B, becomes a "critical boundary point" from CNN's perspective. It creates an artificial decision boundary that shouldn't exist. Not only does CNN retain this point, but it may retain additional points to "defend" the correct Class A label nearby.
Noisy labels are the enemy of instance-based learning.
K-Nearest Neighbors is particularly susceptible to label noise because it makes predictions based on local neighborhoods. A mislabeled point in the wrong region doesn't just affect its own Voronoi cell—it corrupts the voting for all nearby test points. The effect is pronounced for small k, where a single noisy neighbor can flip the prediction, and troublesome for large k, where noisy points in transitional regions create systematic bias.
Edited Nearest Neighbors (ENN) addresses this vulnerability by identifying and removing instances that appear inconsistent with their neighbors. Rather than asking "which points do we need for classification?" (the CNN question), ENN asks "which points harm classification?"
By completing this page, you will understand the theoretical motivation for neighborhood-based editing, master the ENN algorithm and its variants, learn to combine ENN with CNN for comprehensive data reduction, analyze the trade-offs between aggressive and conservative editing, and know when noise removal is appropriate versus risky.
The foundation of ENN rests on a simple but powerful principle: if a point is misclassified by its neighbors, something is wrong—either the point is noisy, or it's a genuine outlier that harms generalization. Either way, removing it may improve performance.
Let $(\mathbf{x}_i, y_i)$ be a training instance. Consider its k-nearest neighbors in the training set (excluding itself):
$$N_k(\mathbf{x}_i) = {\mathbf{x}_j : \mathbf{x}_j \in \text{kNN}(\mathbf{x}_i)}$$
Definition (Wilson's Editing Rule): An instance $(\mathbf{x}_i, y_i)$ is consistent if the majority of its k neighbors share its label:
$$\text{Consistent}(\mathbf{x}_i) = \begin{cases} \text{True} & \text{if } |{j : \mathbf{x}_j \in N_k(\mathbf{x}_i), y_j = y_i}| > \frac{k}{2} \ \text{False} & \text{otherwise} \end{cases}$$
An instance is inconsistent if the majority of its neighbors disagree with its label. ENN removes all inconsistent instances.
The majority criterion isn't arbitrary—it connects directly to the asymptotic behavior of k-NN:
Theorem (Cover & Hart, 1967): As the training set size $n \to \infty$ and $k \to \infty$ with $k/n \to 0$, the error rate of k-NN converges to at most twice the Bayes error rate.
For k-NN to achieve this optimal behavior, training points should reflect the true conditional class probabilities $P(Y|X)$. A point with label $y$ in a region where $P(Y=y|X) < 0.5$ violates this assumption—it's inconsistent with the local class probability distribution.
By removing points that are minority-labeled in their neighborhoods, ENN nudges the training set toward better alignment with the underlying conditional distributions.
| Type | Description | True Label | Should Remove? |
|---|---|---|---|
| Label Noise | Correctly positioned but mislabeled during data collection | Probably wrong | Yes - harmful to KNN |
| Class Overlap Boundary | Near true decision boundary; neighbors naturally mixed | Correct | Maybe - depends on k |
| Outlier | Far from class bulk; genuinely belongs to unusual region | Correct | Often - unusual for generalization |
| Rare Subclass | Part of legitimate small cluster within a class | Correct | No - represents real pattern |
ENN cannot distinguish between label noise and legitimate rare instances. Points near true decision boundaries or belonging to small valid clusters may also appear 'inconsistent.' Removing them erases genuine patterns. The choice of k and editing aggressiveness must balance noise removal against information loss.
The original Edited Nearest Neighbors algorithm was introduced by Dennis Wilson in 1972. It's remarkably simple yet effective.
Wilson's editing operates in a single pass through the data:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
def edited_nearest_neighbors(X, y, k=3, distance_metric='euclidean'): """ Wilson's Edited Nearest Neighbors Algorithm (1972) Removes instances that are misclassified by their k neighbors, effectively cleaning noise and smoothing decision boundaries. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Training feature vectors y : ndarray of shape (n_samples,) Training class labels k : int Number of neighbors for consistency check (typically 3-5) distance_metric : str or callable Distance metric for neighbor computation Returns: -------- retained_indices : list Indices of points retained after editing removed_indices : list Indices of points removed as inconsistent """ import numpy as np from scipy.spatial.distance import cdist n_samples = len(X) # Compute full pairwise distance matrix # For large datasets, use chunked or approximate methods distances = cdist(X, X, metric=distance_metric) # Set diagonal to infinity (don't count self as neighbor) np.fill_diagonal(distances, np.inf) # Find k nearest neighbors for each point knn_indices = np.argsort(distances, axis=1)[:, :k] # Check consistency for each point retained_indices = [] removed_indices = [] for i in range(n_samples): neighbor_labels = y[knn_indices[i]] # Majority vote among neighbors unique_labels, counts = np.unique(neighbor_labels, return_counts=True) majority_label = unique_labels[np.argmax(counts)] if majority_label == y[i]: retained_indices.append(i) else: removed_indices.append(i) print(f"ENN (k={k}): Retained {len(retained_indices)}, Removed {len(removed_indices)}") print(f"Removal rate: {100 * len(removed_indices) / n_samples:.2f}%") return retained_indices, removed_indicesConsider a small dataset with 8 points (labels A and B) and k=3:
| Point | Coordinates | Label | 3 Nearest Neighbors | Neighbor Labels | Majority | Consistent? |
|---|---|---|---|---|---|---|
| P1 | (1, 1) | A | P2, P3, P4 | A, A, A | A | ✓ Yes |
| P2 | (2, 1) | A | P1, P3, P5 | A, A, A | A | ✓ Yes |
| P3 | (1.5, 2) | A | P1, P2, P4 | A, A, A | A | ✓ Yes |
| P4 | (2, 2.5) | B | P3, P2, P1 | A, A, A | A | ✗ No |
| P5 | (7, 1) | B | P6, P7, P2 | B, B, A | B | ✓ Yes |
| P6 | (8, 2) | B | P5, P7, P8 | B, B, B | B | ✓ Yes |
| P7 | (7.5, 1.5) | B | P5, P6, P8 | B, B, B | B | ✓ Yes |
| P8 | (9, 3) | B | P6, P7, P5 | B, B, B | B | ✓ Yes |
Result: Point P4 is removed. It was labeled B but sits in the middle of A's territory—likely a mislabeled point.
Notice that ENN doesn't just remove noise; it effectively creates a smoother decision boundary by eliminating spurious intrusions of one class into another's region.
Wilson's original algorithm runs once. But removing inconsistent points changes the neighborhoods of remaining points! Some points that were consistent may become inconsistent (or vice versa). Repeated ENN (covered later) iterates until convergence, often achieving more thorough cleaning at the cost of more data loss.
The parameter k in ENN isn't just an algorithmic detail—it fundamentally controls the trade-off between noise removal and information preservation. Understanding this trade-off is essential for effective application.
With small k, a point needs very few disagreeing neighbors to be marked inconsistent:
k=1: Remove if the single nearest neighbor has a different label.
k=3: Remove if 2+ of 3 neighbors disagree.
| k Value | Removal Criterion | Editing Aggressiveness | Best For |
|---|---|---|---|
| 1 | Nearest neighbor disagrees | Very aggressive | Obvious noise, well-separated classes |
| 3 | ≥2 of 3 neighbors disagree | Moderate | General-purpose noise reduction |
| 5 | ≥3 of 5 neighbors disagree | Conservative | Preserving boundary points |
| 7+ | ≥4+ neighbors disagree | Very conservative | High overlap, valuable rare instances |
With larger k, the consistency criterion becomes more forgiving:
k=5: A point must have 3+ neighbors disagreeing to be removed.
k≥7: Increasingly conservative
The optimal k depends on several factors:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
def select_k_for_enn(X, y, k_candidates=[1, 3, 5, 7], cv_folds=5): """ Select optimal k for ENN using cross-validation. For each candidate k: 1. Apply ENN with that k 2. Evaluate k-NN accuracy on held-out data 3. Choose k that maximizes validation accuracy """ import numpy as np from sklearn.model_selection import cross_val_score from sklearn.neighbors import KNeighborsClassifier best_k = None best_score = -1 results = {} for k in k_candidates: # Apply ENN retained_idx, _ = edited_nearest_neighbors(X, y, k=k) X_edited = X[retained_idx] y_edited = y[retained_idx] if len(np.unique(y_edited)) < len(np.unique(y)): # Lost entire classes - skip this k results[k] = {'score': 0, 'retained': len(retained_idx), 'note': 'lost classes'} continue # Evaluate with cross-validation knn = KNeighborsClassifier(n_neighbors=min(5, len(X_edited)//2)) try: scores = cross_val_score(knn, X_edited, y_edited, cv=cv_folds) mean_score = np.mean(scores) except: mean_score = 0 results[k] = { 'score': mean_score, 'retained': len(retained_idx), 'removed_pct': 100 * (len(y) - len(retained_idx)) / len(y) } if mean_score > best_score: best_score = mean_score best_k = k print("ENN k selection results:") for k, res in results.items(): print(f" k={k}: accuracy={res['score']:.3f}, retained={res['retained']}") print(f"Best k: {best_k} (accuracy={best_score:.3f})") return best_k, resultsIn the absence of domain knowledge, k=3 is a widely-used default for ENN. It's aggressive enough to remove clear noise while conservative enough to preserve most boundary points. If subsequent analysis shows too much or too little removal, adjust k accordingly.
Wilson's original ENN has spawned important variants that address its limitations or extend its capabilities. Understanding these variants enables you to select the right tool for each situation.
Standard ENN runs once and stops. But removing points changes the neighborhoods of remaining points. Some previously consistent points may become inconsistent when their noisy neighbors are removed.
Repeated ENN iterates until no more points are removed:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
def repeated_enn(X, y, k=3, max_iterations=100, distance_metric='euclidean'): """ Repeated ENN - Iterates Wilson's editing until convergence. Removes points that become inconsistent after earlier removals, achieving more thorough noise elimination at cost of more data loss. """ import numpy as np current_X = X.copy() current_y = y.copy() original_indices = np.arange(len(X)) iteration = 0 total_removed = 0 while iteration < max_iterations: iteration += 1 n_before = len(current_y) # Apply one round of ENN retained_mask, _ = edited_nearest_neighbors_mask( current_X, current_y, k=k, distance_metric=distance_metric ) current_X = current_X[retained_mask] current_y = current_y[retained_mask] original_indices = original_indices[retained_mask] n_removed_this_round = n_before - len(current_y) total_removed += n_removed_this_round print(f" Iteration {iteration}: removed {n_removed_this_round}, remaining {len(current_y)}") if n_removed_this_round == 0: print(f"RENN converged after {iteration} iterations") break print(f"RENN total: {len(y)} → {len(current_y)} ({total_removed} removed)") return original_indices.tolist() def edited_nearest_neighbors_mask(X, y, k=3, distance_metric='euclidean'): """Returns boolean mask of points to retain""" import numpy as np from scipy.spatial.distance import cdist distances = cdist(X, X, metric=distance_metric) np.fill_diagonal(distances, np.inf) knn_indices = np.argsort(distances, axis=1)[:, :k] retained_mask = np.ones(len(X), dtype=bool) for i in range(len(X)): neighbor_labels = y[knn_indices[i]] unique, counts = np.unique(neighbor_labels, return_counts=True) majority = unique[np.argmax(counts)] if majority != y[i]: retained_mask[i] = False return retained_mask, ~retained_maskTomek proposed a stricter editing rule: remove a point if it is misclassified by any k-NN classifier for k = 1, 2, ..., K_max.
This is equivalent to requiring that a point's nearest neighbor must share its label, AND its 2 nearest neighbors must form a majority with its label, AND its 3 nearest neighbors must form a majority, etc.
All-k-NN is substantially more aggressive than standard ENN and is typically used when noise is severe:
12345678910111213141516171819202122232425262728293031323334353637
def all_knn_editing(X, y, k_max=5, distance_metric='euclidean'): """ Tomek's All-k-NN Editing Rule Removes a point if it's misclassified by k-NN for ANY k from 1 to k_max. Very aggressive - removes points that fail at any neighborhood scale. """ import numpy as np from scipy.spatial.distance import cdist n_samples = len(X) distances = cdist(X, X, metric=distance_metric) np.fill_diagonal(distances, np.inf) # Get k_max nearest neighbors for each point knn_indices = np.argsort(distances, axis=1)[:, :k_max] retained_indices = [] for i in range(n_samples): is_consistent_all_k = True # Check consistency for each k from 1 to k_max for k in range(1, k_max + 1): neighbor_labels = y[knn_indices[i, :k]] unique, counts = np.unique(neighbor_labels, return_counts=True) majority = unique[np.argmax(counts)] if majority != y[i]: is_consistent_all_k = False break if is_consistent_all_k: retained_indices.append(i) print(f"All-k-NN (k_max={k_max}): {len(retained_indices)}/{n_samples} retained") return retained_indices| Variant | Removal Criterion | Aggressiveness | Use Case |
|---|---|---|---|
| ENN (Wilson) | Majority of k neighbors disagree | Moderate | Standard noise reduction |
| RENN (Repeated) | ENN, iterated to convergence | High | Deep noise cleaning |
| All-k-NN (Tomek) | Inconsistent for ANY k ≤ k_max | Very high | Severe noise, wide boundaries |
| RENN + All-k-NN | All-k-NN, repeated | Extreme | Maximum cleaning (use cautiously) |
Aggressive editing variants (RENN, All-k-NN) can disproportionately remove points from minority classes. A small cluster of the minority class surrounded by the majority will have its points voted out. Always check class distribution after editing and consider class-aware variants if imbalance is present.
CNN and ENN address different problems: CNN reduces storage by removing redundant interior points, while ENN improves quality by removing noisy points. Combining them yields a powerful data preprocessing pipeline.
The recommended order is ENN first, then CNN:
Why this order?
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
def enn_cnn_pipeline(X, y, enn_k=3, distance_metric='euclidean'): """ Complete ENN → CNN data reduction pipeline. Step 1: Clean noise with ENN Step 2: Condense clean data with CNN Step 3: Optionally, apply RNN for further reduction Returns the indices of the final selected subset. """ import numpy as np print("=" * 50) print("ENN-CNN Pipeline") print("=" * 50) n_original = len(X) # Step 1: ENN - Remove noise print("Step 1: Edited Nearest Neighbors (noise removal)") enn_retained, enn_removed = edited_nearest_neighbors( X, y, k=enn_k, distance_metric=distance_metric ) X_clean = X[enn_retained] y_clean = y[enn_retained] indices_after_enn = np.array(enn_retained) print(f"After ENN: {len(X_clean)}/{n_original} points retained") # Step 2: CNN - Condense to consistent subset print("Step 2: Condensed Nearest Neighbors (data reduction)") cnn_retained = condensed_nearest_neighbors( X_clean, y_clean, distance_metric=distance_metric ) X_condensed = X_clean[cnn_retained] y_condensed = y_clean[cnn_retained] indices_after_cnn = indices_after_enn[cnn_retained] print(f"After CNN: {len(X_condensed)}/{len(X_clean)} points retained") # Summary print("" + "=" * 50) print("Pipeline Summary:") print(f" Original: {n_original} points") print(f" After ENN: {len(X_clean)} points ({len(enn_removed)} removed as noise)") print(f" After CNN: {len(X_condensed)} points (final)") print(f" Total reduction: {100 * (1 - len(X_condensed)/n_original):.1f}%") print("=" * 50) return indices_after_cnn.tolist() # Helper: Include the CNN functiondef condensed_nearest_neighbors(X, y, distance_metric='euclidean'): import numpy as np from scipy.spatial.distance import cdist n_samples = len(X) classes = np.unique(y) store_indices = [] for c in classes: class_indices = np.where(y == c)[0] store_indices.append(class_indices[0]) store_indices = set(store_indices) grabbag_indices = set(range(n_samples)) - store_indices changed = True while changed: changed = False store_list = list(store_indices) grabbag_list = list(grabbag_indices) if len(grabbag_list) == 0: break X_store = X[store_list] X_grabbag = X[grabbag_list] distances = cdist(X_grabbag, X_store, metric=distance_metric) nearest_store_idx = np.argmin(distances, axis=1) predicted = y[np.array(store_list)[nearest_store_idx]] true_labels = y[grabbag_list] misclassified = predicted != true_labels if np.any(misclassified): changed = True for idx in np.array(grabbag_list)[misclassified]: store_indices.add(idx) grabbag_indices.remove(idx) return list(store_indices)What happens if we reverse the order?
This ordering is problematic:
The only scenario where CNN-first makes sense is when you're certain data is already clean and you want to apply ENN as a post-hoc safeguard against any remaining edge cases.
Clean first, then condense. Noise doesn't contaminate the condensed set. CNN works on high-quality data and finds true borders. Final set is both small AND clean.
Condense first, including noise. Noise becomes critical border points. ENN removal destroys consistency. Final set may perform worse than doing nothing.
Understanding ENN's computational profile is essential for applying it to real-world datasets.
Standard ENN (Wilson):
The algorithm has three phases:
Dominant term: $O(n^2 \cdot d)$ for distance matrix, assuming $d < n$
Repeated ENN: Multiplies by number of iterations. In worst case (one point removed per iteration), this becomes $O(n^3 \cdot d)$.
With Spatial Data Structures:
For low-to-moderate dimensionality ($d < 20$), k-d trees or ball trees can accelerate neighbor search:
This is vastly better than $O(n^2)$, but only works in low dimensions due to the curse of dimensionality.
| Method | Time (naive) | Time (w/ spatial index) | Space |
|---|---|---|---|
| ENN (single pass) | O(n² · d) | O(n · k log n) | O(n²) or O(n · d) |
| RENN (repeated) | O(iter · n² · d) | O(iter · n · k log n) | O(n²) or O(n · d) |
| All-k-NN | O(k_max · n² · d) | O(n · k_max log n) | O(n²) or O(n · d) |
For datasets up to 50K points: Use naive distance matrix with NumPy/SciPy. Fits in memory (50K × 50K × 4 bytes = 10 GB) on most servers.
For datasets 50K-500K points: Use spatial indices (k-d tree, ball tree) if dimensionality permits. Use chunked distance computation otherwise.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
def scalable_enn(X, y, k=3, method='auto'): """ Scalable ENN implementation using appropriate method for data size. Parameters: ----------- X : array of shape (n_samples, n_features) y : array of shape (n_samples,) k : int, number of neighbors method : 'brute', 'ball_tree', 'kd_tree', or 'auto' """ import numpy as np from sklearn.neighbors import NearestNeighbors n_samples, n_features = X.shape # Choose method based on dataset characteristics if method == 'auto': if n_features > 20: method = 'brute' # Spatial structures fail in high dims elif n_samples < 10000: method = 'brute' # Small data, overhead not worth it else: method = 'ball_tree' # Larger data, moderate dims print(f"Using {method} for ENN with n={n_samples}, d={n_features}") # Build neighbor index (k+1 because we exclude self) nn = NearestNeighbors(n_neighbors=k+1, algorithm=method) nn.fit(X) # Query all neighbors at once distances, indices = nn.kneighbors(X) # Remove self from neighbors (first column) neighbor_indices = indices[:, 1:k+1] # Exclude self # Check consistency retained = [] for i in range(n_samples): neighbor_labels = y[neighbor_indices[i]] unique, counts = np.unique(neighbor_labels, return_counts=True) majority = unique[np.argmax(counts)] if majority == y[i]: retained.append(i) return retainedFor datasets exceeding 1 million points, consider sampling-based approaches: (1) Apply ENN to a stratified sample, (2) Train a noise classifier on the sample, (3) Apply the classifier to the full dataset. This approximate approach scales linearly but requires careful validation.
ENN is a powerful tool, but it's not universally beneficial. Understanding when it helps—and when it harms—is critical for effective application.
Use this structured approach to decide whether ENN is appropriate:
Step 1: Assess Label Quality
Step 2: Examine Class Geometry
Step 3: Check Class Balance
Step 4: Evaluate Dataset Size
Step 5: Validate Empirically
Although ENN was designed for KNN, cleaned data often benefits other algorithms too. Neural networks, SVMs, and ensemble methods can all learn better from noise-reduced training sets. However, tree-based methods like Random Forests are inherently noise-resistant and may not need ENN preprocessing.
We've conducted a comprehensive exploration of Edited Nearest Neighbors—from theoretical foundations through practical application. Let's consolidate the essential knowledge:
CNN and ENN address instance selection—deciding which training points to keep. But what about improving how we measure similarity?
The next page explores Large Margin K-Nearest Neighbors (LMNN), which takes a fundamentally different approach: instead of selecting instances, it learns a distance metric that pulls same-class points closer and pushes different-class points apart. This metric learning approach can dramatically improve KNN performance without discarding any training data.
You now understand Edited Nearest Neighbors comprehensively—its theoretical basis, algorithm, variants, and practical considerations. You can apply ENN appropriately, select the right k value, combine it with CNN, and avoid common pitfalls. Next, we examine Large Margin KNN for learning optimal distance metrics.