Machine LearningK-Nearest Neighbors

KNN Variants

LevelAdvanced

Duration90 mins

TopicK-Nearest Neighbors

2 / 5

Edited Nearest Neighbors

The Noise Problem in Instance-Based Learning

The previous page examined Condensed Nearest Neighbors, which reduces storage by retaining only boundary-defining instances. But condensation has a blind spot: it treats all training labels as ground truth. When labels are noisy—as they invariably are in real-world data—CNN faithfully preserves those errors.

Consider the implications: a single mislabeled point in the interior of Class A, incorrectly labeled as Class B, becomes a "critical boundary point" from CNN's perspective. It creates an artificial decision boundary that shouldn't exist. Not only does CNN retain this point, but it may retain additional points to "defend" the correct Class A label nearby.

Noisy labels are the enemy of instance-based learning.

K-Nearest Neighbors is particularly susceptible to label noise because it makes predictions based on local neighborhoods. A mislabeled point in the wrong region doesn't just affect its own Voronoi cell—it corrupts the voting for all nearby test points. The effect is pronounced for small k, where a single noisy neighbor can flip the prediction, and troublesome for large k, where noisy points in transitional regions create systematic bias.

Edited Nearest Neighbors (ENN) addresses this vulnerability by identifying and removing instances that appear inconsistent with their neighbors. Rather than asking "which points do we need for classification?" (the CNN question), ENN asks "which points harm classification?"

What You Will Master

By completing this page, you will understand the theoretical motivation for neighborhood-based editing, master the ENN algorithm and its variants, learn to combine ENN with CNN for comprehensive data reduction, analyze the trade-offs between aggressive and conservative editing, and know when noise removal is appropriate versus risky.

Theoretical Foundation: Neighborhood Consistency

The foundation of ENN rests on a simple but powerful principle: if a point is misclassified by its neighbors, something is wrong—either the point is noisy, or it's a genuine outlier that harms generalization. Either way, removing it may improve performance.

Formal Definition of Neighborhood Consistency

Let $(\mathbf{x}_i, y_i)$ be a training instance. Consider its k-nearest neighbors in the training set (excluding itself):

$$N_k(\mathbf{x}_i) = {\mathbf{x}_j : \mathbf{x}_j \in \text{kNN}(\mathbf{x}_i)}$$

Definition (Wilson's Editing Rule): An instance $(\mathbf{x}_i, y_i)$ is consistent if the majority of its k neighbors share its label:

$$\text{Consistent}(\mathbf{x}_i) = \begin{cases} \text{True} & \text{if } |{j : \mathbf{x}_j \in N_k(\mathbf{x}_i), y_j = y_i}| > \frac{k}{2} \ \text{False} & \text{otherwise} \end{cases}$$

An instance is inconsistent if the majority of its neighbors disagree with its label. ENN removes all inconsistent instances.

Why Majority Vote?

The majority criterion isn't arbitrary—it connects directly to the asymptotic behavior of k-NN:

Theorem (Cover & Hart, 1967): As the training set size $n \to \infty$ and $k \to \infty$ with $k/n \to 0$, the error rate of k-NN converges to at most twice the Bayes error rate.

For k-NN to achieve this optimal behavior, training points should reflect the true conditional class probabilities $P(Y|X)$. A point with label $y$ in a region where $P(Y=y|X) < 0.5$ violates this assumption—it's inconsistent with the local class probability distribution.

By removing points that are minority-labeled in their neighborhoods, ENN nudges the training set toward better alignment with the underlying conditional distributions.

Types of Inconsistent Instances
Type	Description	True Label	Should Remove?
Label Noise	Correctly positioned but mislabeled during data collection	Probably wrong	Yes - harmful to KNN
Class Overlap Boundary	Near true decision boundary; neighbors naturally mixed	Correct	Maybe - depends on k
Outlier	Far from class bulk; genuinely belongs to unusual region	Correct	Often - unusual for generalization
Rare Subclass	Part of legitimate small cluster within a class	Correct	No - represents real pattern

The Danger of Aggressive Editing

ENN cannot distinguish between label noise and legitimate rare instances. Points near true decision boundaries or belonging to small valid clusters may also appear 'inconsistent.' Removing them erases genuine patterns. The choice of k and editing aggressiveness must balance noise removal against information loss.

Wilson's Editing Algorithm

The original Edited Nearest Neighbors algorithm was introduced by Dennis Wilson in 1972. It's remarkably simple yet effective.

The Basic Algorithm

Wilson's editing operates in a single pass through the data:

For each training instance $(\mathbf{x}_i, y_i)$
Find its k nearest neighbors (excluding itself)
Take a majority vote among the neighbors
If the vote disagrees with $y_i$, mark the instance for removal
After processing all points, delete marked instances

edited_nearest_neighbors.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def edited_nearest_neighbors(X, y, k=3, distance_metric='euclidean'):
    """
    Wilson's Edited Nearest Neighbors Algorithm (1972)
    
    Removes instances that are misclassified by their k neighbors,
    effectively cleaning noise and smoothing decision boundaries.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Training feature vectors
    y : ndarray of shape (n_samples,)
        Training class labels
    k : int
        Number of neighbors for consistency check (typically 3-5)
    distance_metric : str or callable
        Distance metric for neighbor computation
    
    Returns:
    --------
    retained_indices : list
        Indices of points retained after editing
    removed_indices : list
        Indices of points removed as inconsistent
    """
    import numpy as np
    from scipy.spatial.distance import cdist
    
    n_samples = len(X)
    
    # Compute full pairwise distance matrix
    # For large datasets, use chunked or approximate methods
    distances = cdist(X, X, metric=distance_metric)
    
    # Set diagonal to infinity (don't count self as neighbor)
    np.fill_diagonal(distances, np.inf)
    
    # Find k nearest neighbors for each point
    knn_indices = np.argsort(distances, axis=1)[:, :k]
    
    # Check consistency for each point
    retained_indices = []
    removed_indices = []
    
    for i in range(n_samples):
        neighbor_labels = y[knn_indices[i]]
        
        # Majority vote among neighbors
        unique_labels, counts = np.unique(neighbor_labels, return_counts=True)
        majority_label = unique_labels[np.argmax(counts)]
        
        if majority_label == y[i]:
            retained_indices.append(i)
        else:
            removed_indices.append(i)
    
    print(f"ENN (k={k}): Retained {len(retained_indices)}, Removed {len(removed_indices)}")
    print(f"Removal rate: {100 * len(removed_indices) / n_samples:.2f}%")
    
    return retained_indices, removed_indices

Step-by-Step Execution Trace

Consider a small dataset with 8 points (labels A and B) and k=3:

Point	Coordinates	Label	3 Nearest Neighbors	Neighbor Labels	Majority	Consistent?
P1	(1, 1)	A	P2, P3, P4	A, A, A	A	✓ Yes
P2	(2, 1)	A	P1, P3, P5	A, A, A	A	✓ Yes
P3	(1.5, 2)	A	P1, P2, P4	A, A, A	A	✓ Yes
P4	(2, 2.5)	B	P3, P2, P1	A, A, A	A	✗ No
P5	(7, 1)	B	P6, P7, P2	B, B, A	B	✓ Yes
P6	(8, 2)	B	P5, P7, P8	B, B, B	B	✓ Yes
P7	(7.5, 1.5)	B	P5, P6, P8	B, B, B	B	✓ Yes
P8	(9, 3)	B	P6, P7, P5	B, B, B	B	✓ Yes

Result: Point P4 is removed. It was labeled B but sits in the middle of A's territory—likely a mislabeled point.

Notice that ENN doesn't just remove noise; it effectively creates a smoother decision boundary by eliminating spurious intrusions of one class into another's region.

Single-Pass vs. Repeated Editing

Wilson's original algorithm runs once. But removing inconsistent points changes the neighborhoods of remaining points! Some points that were consistent may become inconsistent (or vice versa). Repeated ENN (covered later) iterates until convergence, often achieving more thorough cleaning at the cost of more data loss.

The Role of k in Editing

The parameter k in ENN isn't just an algorithmic detail—it fundamentally controls the trade-off between noise removal and information preservation. Understanding this trade-off is essential for effective application.

Small k (1-3): Aggressive Editing

With small k, a point needs very few disagreeing neighbors to be marked inconsistent:

k=1: Remove if the single nearest neighbor has a different label.

Most aggressive editing
Removes many points near class boundaries
High risk of removing legitimate rare instances
Effective for obvious noise far from boundaries

k=3: Remove if 2+ of 3 neighbors disagree.

Moderate editing
Commonly used default
Balances noise removal and boundary preservation

Effect of k on ENN Behavior
k Value	Removal Criterion	Editing Aggressiveness	Best For
1	Nearest neighbor disagrees	Very aggressive	Obvious noise, well-separated classes
3	≥2 of 3 neighbors disagree	Moderate	General-purpose noise reduction
5	≥3 of 5 neighbors disagree	Conservative	Preserving boundary points
7+	≥4+ neighbors disagree	Very conservative	High overlap, valuable rare instances

Large k (5+): Conservative Editing

With larger k, the consistency criterion becomes more forgiving:

k=5: A point must have 3+ neighbors disagreeing to be removed.

Points near genuine class boundaries often survive
Only clear violations are removed
Better for datasets with natural class overlap

k≥7: Increasingly conservative

Very few points removed
Only severe noise detected
Appropriate when data quality is high and you're cautious about information loss

Choosing k: A Framework

The optimal k depends on several factors:

Factors Influencing k Selection

•Estimated noise rate: Higher noise → smaller k (more aggressive removal needed)
•Class separation: Well-separated classes → larger k (boundary points are clearly different from noise)
•Sample size: More data → larger k (more neighbors available for reliable voting)
•Cost of false removal: If removing valid points is costly → larger k (conservative editing)
•Subsequent k for classification: Match editing k to classification k (use same neighborhood scale)

adaptive_k_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def select_k_for_enn(X, y, k_candidates=[1, 3, 5, 7], cv_folds=5):
    """
    Select optimal k for ENN using cross-validation.
    
    For each candidate k:
    1. Apply ENN with that k
    2. Evaluate k-NN accuracy on held-out data
    3. Choose k that maximizes validation accuracy
    """
    import numpy as np
    from sklearn.model_selection import cross_val_score
    from sklearn.neighbors import KNeighborsClassifier
    
    best_k = None
    best_score = -1
    results = {}
    
    for k in k_candidates:
        # Apply ENN
        retained_idx, _ = edited_nearest_neighbors(X, y, k=k)
        X_edited = X[retained_idx]
        y_edited = y[retained_idx]
        
        if len(np.unique(y_edited)) < len(np.unique(y)):
            # Lost entire classes - skip this k
            results[k] = {'score': 0, 'retained': len(retained_idx), 'note': 'lost classes'}
            continue
        
        # Evaluate with cross-validation
        knn = KNeighborsClassifier(n_neighbors=min(5, len(X_edited)//2))
        
        try:
            scores = cross_val_score(knn, X_edited, y_edited, cv=cv_folds)
            mean_score = np.mean(scores)
        except:
            mean_score = 0
        
        results[k] = {
            'score': mean_score, 
            'retained': len(retained_idx),
            'removed_pct': 100 * (len(y) - len(retained_idx)) / len(y)
        }
        
        if mean_score > best_score:
            best_score = mean_score
            best_k = k
    
    print("ENN k selection results:")
    for k, res in results.items():
        print(f"  k={k}: accuracy={res['score']:.3f}, retained={res['retained']}")
    print(f"Best k: {best_k} (accuracy={best_score:.3f})")
    
    return best_k, results

Practical Default

In the absence of domain knowledge, k=3 is a widely-used default for ENN. It's aggressive enough to remove clear noise while conservative enough to preserve most boundary points. If subsequent analysis shows too much or too little removal, adjust k accordingly.

ENN Variants: Extended and Repeated Editing

Wilson's original ENN has spawned important variants that address its limitations or extend its capabilities. Understanding these variants enables you to select the right tool for each situation.

Repeated ENN (RENN)

Standard ENN runs once and stops. But removing points changes the neighborhoods of remaining points. Some previously consistent points may become inconsistent when their noisy neighbors are removed.

Repeated ENN iterates until no more points are removed:

repeated_enn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def repeated_enn(X, y, k=3, max_iterations=100, distance_metric='euclidean'):
    """
    Repeated ENN - Iterates Wilson's editing until convergence.
    
    Removes points that become inconsistent after earlier removals,
    achieving more thorough noise elimination at cost of more data loss.
    """
    import numpy as np
    
    current_X = X.copy()
    current_y = y.copy()
    original_indices = np.arange(len(X))
    
    iteration = 0
    total_removed = 0
    
    while iteration < max_iterations:
        iteration += 1
        n_before = len(current_y)
        
        # Apply one round of ENN
        retained_mask, _ = edited_nearest_neighbors_mask(
            current_X, current_y, k=k, distance_metric=distance_metric
        )
        
        current_X = current_X[retained_mask]
        current_y = current_y[retained_mask]
        original_indices = original_indices[retained_mask]
        
        n_removed_this_round = n_before - len(current_y)
        total_removed += n_removed_this_round
        
        print(f"  Iteration {iteration}: removed {n_removed_this_round}, remaining {len(current_y)}")
        
        if n_removed_this_round == 0:
            print(f"RENN converged after {iteration} iterations")
            break
    
    print(f"RENN total: {len(y)} → {len(current_y)} ({total_removed} removed)")
    return original_indices.tolist()
 
def edited_nearest_neighbors_mask(X, y, k=3, distance_metric='euclidean'):
    """Returns boolean mask of points to retain"""
    import numpy as np
    from scipy.spatial.distance import cdist
    
    distances = cdist(X, X, metric=distance_metric)
    np.fill_diagonal(distances, np.inf)
    knn_indices = np.argsort(distances, axis=1)[:, :k]
    
    retained_mask = np.ones(len(X), dtype=bool)
    
    for i in range(len(X)):
        neighbor_labels = y[knn_indices[i]]
        unique, counts = np.unique(neighbor_labels, return_counts=True)
        majority = unique[np.argmax(counts)]
        if majority != y[i]:
            retained_mask[i] = False
    
    return retained_mask, ~retained_mask

All-k-NN Editing (Tomek's Rule)

Tomek proposed a stricter editing rule: remove a point if it is misclassified by any k-NN classifier for k = 1, 2, ..., K_max.

This is equivalent to requiring that a point's nearest neighbor must share its label, AND its 2 nearest neighbors must form a majority with its label, AND its 3 nearest neighbors must form a majority, etc.

All-k-NN is substantially more aggressive than standard ENN and is typically used when noise is severe:

all_knn_editing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def all_knn_editing(X, y, k_max=5, distance_metric='euclidean'):
    """
    Tomek's All-k-NN Editing Rule
    
    Removes a point if it's misclassified by k-NN for ANY k from 1 to k_max.
    Very aggressive - removes points that fail at any neighborhood scale.
    """
    import numpy as np
    from scipy.spatial.distance import cdist
    
    n_samples = len(X)
    distances = cdist(X, X, metric=distance_metric)
    np.fill_diagonal(distances, np.inf)
    
    # Get k_max nearest neighbors for each point
    knn_indices = np.argsort(distances, axis=1)[:, :k_max]
    
    retained_indices = []
    
    for i in range(n_samples):
        is_consistent_all_k = True
        
        # Check consistency for each k from 1 to k_max
        for k in range(1, k_max + 1):
            neighbor_labels = y[knn_indices[i, :k]]
            unique, counts = np.unique(neighbor_labels, return_counts=True)
            majority = unique[np.argmax(counts)]
            
            if majority != y[i]:
                is_consistent_all_k = False
                break
        
        if is_consistent_all_k:
            retained_indices.append(i)
    
    print(f"All-k-NN (k_max={k_max}): {len(retained_indices)}/{n_samples} retained")
    return retained_indices

Comparison of ENN Variants
Variant	Removal Criterion	Aggressiveness	Use Case
ENN (Wilson)	Majority of k neighbors disagree	Moderate	Standard noise reduction
RENN (Repeated)	ENN, iterated to convergence	High	Deep noise cleaning
All-k-NN (Tomek)	Inconsistent for ANY k ≤ k_max	Very high	Severe noise, wide boundaries
RENN + All-k-NN	All-k-NN, repeated	Extreme	Maximum cleaning (use cautiously)

Class Imbalance Risk

Aggressive editing variants (RENN, All-k-NN) can disproportionately remove points from minority classes. A small cluster of the minority class surrounded by the majority will have its points voted out. Always check class distribution after editing and consider class-aware variants if imbalance is present.

Combining ENN and CNN: The Complete Pipeline

CNN and ENN address different problems: CNN reduces storage by removing redundant interior points, while ENN improves quality by removing noisy points. Combining them yields a powerful data preprocessing pipeline.

The ENN-CNN Pipeline

The recommended order is ENN first, then CNN:

Apply ENN: Remove noisy/mislabeled points
Apply CNN: Condense remaining (cleaner) points to minimal consistent subset

Why this order?

If CNN runs first, it treats noisy points as valid boundaries and retains them
Noisy points become even more damaging as border points in the condensed set
With ENN first, noise is removed before CNN ever sees it
CNN then works on clean data, achieving better condensation and better boundaries

enn_cnn_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
def enn_cnn_pipeline(X, y, enn_k=3, distance_metric='euclidean'):
    """
    Complete ENN → CNN data reduction pipeline.
    
    Step 1: Clean noise with ENN
    Step 2: Condense clean data with CNN
    Step 3: Optionally, apply RNN for further reduction
    
    Returns the indices of the final selected subset.
    """
    import numpy as np
    
    print("=" * 50)
    print("ENN-CNN Pipeline")
    print("=" * 50)
    
    n_original = len(X)
    
    # Step 1: ENN - Remove noise
    print("
Step 1: Edited Nearest Neighbors (noise removal)")
    enn_retained, enn_removed = edited_nearest_neighbors(
        X, y, k=enn_k, distance_metric=distance_metric
    )
    
    X_clean = X[enn_retained]
    y_clean = y[enn_retained]
    indices_after_enn = np.array(enn_retained)
    
    print(f"After ENN: {len(X_clean)}/{n_original} points retained")
    
    # Step 2: CNN - Condense to consistent subset
    print("
Step 2: Condensed Nearest Neighbors (data reduction)")
    cnn_retained = condensed_nearest_neighbors(
        X_clean, y_clean, distance_metric=distance_metric
    )
    
    X_condensed = X_clean[cnn_retained]
    y_condensed = y_clean[cnn_retained]
    indices_after_cnn = indices_after_enn[cnn_retained]
    
    print(f"After CNN: {len(X_condensed)}/{len(X_clean)} points retained")
    
    # Summary
    print("
" + "=" * 50)
    print("Pipeline Summary:")
    print(f"  Original: {n_original} points")
    print(f"  After ENN: {len(X_clean)} points ({len(enn_removed)} removed as noise)")
    print(f"  After CNN: {len(X_condensed)} points (final)")
    print(f"  Total reduction: {100 * (1 - len(X_condensed)/n_original):.1f}%")
    print("=" * 50)
    
    return indices_after_cnn.tolist()
 
# Helper: Include the CNN function
def condensed_nearest_neighbors(X, y, distance_metric='euclidean'):
    import numpy as np
    from scipy.spatial.distance import cdist
    
    n_samples = len(X)
    classes = np.unique(y)
    
    store_indices = []
    for c in classes:
        class_indices = np.where(y == c)[0]
        store_indices.append(class_indices[0])
    
    store_indices = set(store_indices)
    grabbag_indices = set(range(n_samples)) - store_indices
    
    changed = True
    while changed:
        changed = False
        store_list = list(store_indices)
        grabbag_list = list(grabbag_indices)
        
        if len(grabbag_list) == 0:
            break
        
        X_store = X[store_list]
        X_grabbag = X[grabbag_list]
        
        distances = cdist(X_grabbag, X_store, metric=distance_metric)
        nearest_store_idx = np.argmin(distances, axis=1)
        predicted = y[np.array(store_list)[nearest_store_idx]]
        true_labels = y[grabbag_list]
        
        misclassified = predicted != true_labels
        if np.any(misclassified):
            changed = True
            for idx in np.array(grabbag_list)[misclassified]:
                store_indices.add(idx)
                grabbag_indices.remove(idx)
    
    return list(store_indices)

The CNN-ENN Pipeline (Inverse Order)

What happens if we reverse the order?

Apply CNN first: Find consistent subset
Apply ENN: Remove inconsistent points from condensed set

This ordering is problematic:

CNN retains noisy points as essential borders
ENN then removes these "essential" points
The remaining set may no longer be consistent!
Test point classification may be worse than before CNN

The only scenario where CNN-first makes sense is when you're certain data is already clean and you want to apply ENN as a post-hoc safeguard against any remaining edge cases.

ENN → CNN (Recommended)

Clean first, then condense. Noise doesn't contaminate the condensed set. CNN works on high-quality data and finds true borders. Final set is both small AND clean.

CNN → ENN (Usually Wrong)

Condense first, including noise. Noise becomes critical border points. ENN removal destroys consistency. Final set may perform worse than doing nothing.

Computational Complexity and Scalability

Understanding ENN's computational profile is essential for applying it to real-world datasets.

Time Complexity Analysis

Standard ENN (Wilson):

The algorithm has three phases:

Distance Computation: Computing all pairwise distances is $O(n^2 \cdot d)$
k-NN Finding: Sorting each row is $O(n \log n)$ per point, totaling $O(n^2 \log n)$
Consistency Checking: $O(n \cdot k)$ for majority voting

Dominant term: $O(n^2 \cdot d)$ for distance matrix, assuming $d < n$

Repeated ENN: Multiplies by number of iterations. In worst case (one point removed per iteration), this becomes $O(n^3 \cdot d)$.

With Spatial Data Structures:

For low-to-moderate dimensionality ($d < 20$), k-d trees or ball trees can accelerate neighbor search:

Building tree: $O(n \log n)$
k-NN query per point: $O(k \log n)$ average case
Total: $O(n \cdot k \log n)$ + $O(n \log n)$ = $O(n \cdot k \log n)$

This is vastly better than $O(n^2)$, but only works in low dimensions due to the curse of dimensionality.

ENN Complexity Summary
Method	Time (naive)	Time (w/ spatial index)	Space
ENN (single pass)	O(n² · d)	O(n · k log n)	O(n²) or O(n · d)
RENN (repeated)	O(iter · n² · d)	O(iter · n · k log n)	O(n²) or O(n · d)
All-k-NN	O(k_max · n² · d)	O(n · k_max log n)	O(n²) or O(n · d)

Practical Scalability Strategies

For datasets up to 50K points: Use naive distance matrix with NumPy/SciPy. Fits in memory (50K × 50K × 4 bytes = 10 GB) on most servers.

For datasets 50K-500K points: Use spatial indices (k-d tree, ball tree) if dimensionality permits. Use chunked distance computation otherwise.

scalable_enn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def scalable_enn(X, y, k=3, method='auto'):
    """
    Scalable ENN implementation using appropriate method for data size.
    
    Parameters:
    -----------
    X : array of shape (n_samples, n_features)
    y : array of shape (n_samples,)
    k : int, number of neighbors
    method : 'brute', 'ball_tree', 'kd_tree', or 'auto'
    """
    import numpy as np
    from sklearn.neighbors import NearestNeighbors
    
    n_samples, n_features = X.shape
    
    # Choose method based on dataset characteristics
    if method == 'auto':
        if n_features > 20:
            method = 'brute'  # Spatial structures fail in high dims
        elif n_samples < 10000:
            method = 'brute'  # Small data, overhead not worth it
        else:
            method = 'ball_tree'  # Larger data, moderate dims
    
    print(f"Using {method} for ENN with n={n_samples}, d={n_features}")
    
    # Build neighbor index (k+1 because we exclude self)
    nn = NearestNeighbors(n_neighbors=k+1, algorithm=method)
    nn.fit(X)
    
    # Query all neighbors at once
    distances, indices = nn.kneighbors(X)
    
    # Remove self from neighbors (first column)
    neighbor_indices = indices[:, 1:k+1]  # Exclude self
    
    # Check consistency
    retained = []
    for i in range(n_samples):
        neighbor_labels = y[neighbor_indices[i]]
        unique, counts = np.unique(neighbor_labels, return_counts=True)
        majority = unique[np.argmax(counts)]
        
        if majority == y[i]:
            retained.append(i)
    
    return retained

Sampling for Very Large Data

For datasets exceeding 1 million points, consider sampling-based approaches: (1) Apply ENN to a stratified sample, (2) Train a noise classifier on the sample, (3) Apply the classifier to the full dataset. This approximate approach scales linearly but requires careful validation.

When to Use Edited Nearest Neighbors

ENN is a powerful tool, but it's not universally beneficial. Understanding when it helps—and when it harms—is critical for effective application.

Conditions Favoring ENN

ENN Works Best When

•Known label noise: Data collection process is error-prone (crowdsourced labels, automated labeling, human annotation errors).
•Well-separated classes with outliers: Clean cluster interiors with occasional mislabeled points that intrude into wrong regions.
•Preprocessing for other algorithms: KNN, SVM with RBF kernel, and other local methods all benefit from cleaner training data.
•Combined with CNN: ENN-then-CNN pipeline achieves both noise reduction and storage reduction.
•Improving decision boundary smoothness: ENN implicitly smooths jagged boundaries caused by noisy points.

ENN May Harm When

•Classes truly overlap: If P(Y|X) is genuinely mixed in some regions, ENN removes valid boundary points.
•Rare but valid subclasses exist: Small clusters of a class surrounded by another will be edited out entirely.
•Severe class imbalance: Minority class points in majority regions are removed, worsening imbalance.
•Labels are trustworthy: If labels are reliable, ENN only removes genuine boundary/outlier points (not noise), reducing representation.
•Very small datasets: Removing any points significantly reduces information; ENN may overfit to sampling noise.

Decision Framework

Use this structured approach to decide whether ENN is appropriate:

Step 1: Assess Label Quality

Known labeling errors? → ENN likely helpful
Expert-verified labels? → ENN may not help

Step 2: Examine Class Geometry

Well-separated clusters? → ENN safe
Heavily overlapping? → ENN risky

Step 3: Check Class Balance

Balanced classes? → ENN safe
Severe imbalance? → Use class-aware editing variants

Step 4: Evaluate Dataset Size

n > 1000 per class? → ENN acceptable
n < 100 per class? → Be very cautious

Step 5: Validate Empirically

Run ENN with different k values
Compare validation accuracy before/after
Check if minority classes are preserved

ENN for Non-KNN Models

Although ENN was designed for KNN, cleaned data often benefits other algorithms too. Neural networks, SVMs, and ensemble methods can all learn better from noise-reduced training sets. However, tree-based methods like Random Forests are inherently noise-resistant and may not need ENN preprocessing.

Summary: Edited Nearest Neighbors

We've conducted a comprehensive exploration of Edited Nearest Neighbors—from theoretical foundations through practical application. Let's consolidate the essential knowledge:

Key Takeaways

•ENN removes noise by neighborhood voting: Points misclassified by their k neighbors are considered inconsistent and removed.
•k controls editing aggressiveness: Small k (1-3) is aggressive, large k (5+) is conservative. Match to noise level and risk tolerance.
•Variants offer different trade-offs: Repeated ENN (RENN) for deeper cleaning, All-k-NN for strictest removal.
•Order matters in pipelines: Always apply ENN before CNN. Cleaning before condensing produces better results.
•Computational complexity is O(n²d): Use spatial indices for large, low-dimensional datasets.
•Watch for class imbalance effects: Minority classes can be disproportionately removed. Validate class distributions after editing.
•Validate empirically: Always compare validation accuracy before and after ENN to confirm benefit.
•Not always needed: If labels are clean and classes are balanced, ENN may remove valid information.

Connection to Next Topics

CNN and ENN address instance selection—deciding which training points to keep. But what about improving how we measure similarity?

The next page explores Large Margin K-Nearest Neighbors (LMNN), which takes a fundamentally different approach: instead of selecting instances, it learns a distance metric that pulls same-class points closer and pushes different-class points apart. This metric learning approach can dramatically improve KNN performance without discarding any training data.

Page Complete

You now understand Edited Nearest Neighbors comprehensively—its theoretical basis, algorithm, variants, and practical considerations. You can apply ENN appropriately, select the right k value, combine it with CNN, and avoid common pitfalls. Next, we examine Large Margin KNN for learning optimal distance metrics.

2 / 5

Loading learning content...

Machine LearningK-Nearest Neighbors

KNN Variants

LevelAdvanced

Duration90 mins

TopicK-Nearest Neighbors

2 / 5

Edited Nearest Neighbors

The Noise Problem in Instance-Based Learning

Noisy labels are the enemy of instance-based learning.

What You Will Master

Theoretical Foundation: Neighborhood Consistency

Formal Definition of Neighborhood Consistency

Let $(\mathbf{x}_i, y_i)$ be a training instance. Consider its k-nearest neighbors in the training set (excluding itself):

$$N_k(\mathbf{x}_i) = {\mathbf{x}_j : \mathbf{x}_j \in \text{kNN}(\mathbf{x}_i)}$$

Definition (Wilson's Editing Rule): An instance $(\mathbf{x}_i, y_i)$ is consistent if the majority of its k neighbors share its label:

$$\text{Consistent}(\mathbf{x}_i) = \begin{cases} \text{True} & \text{if } |{j : \mathbf{x}_j \in N_k(\mathbf{x}_i), y_j = y_i}| > \frac{k}{2} \ \text{False} & \text{otherwise} \end{cases}$$

An instance is inconsistent if the majority of its neighbors disagree with its label. ENN removes all inconsistent instances.

Why Majority Vote?

The majority criterion isn't arbitrary—it connects directly to the asymptotic behavior of k-NN:

Theorem (Cover & Hart, 1967): As the training set size $n \to \infty$ and $k \to \infty$ with $k/n \to 0$, the error rate of k-NN converges to at most twice the Bayes error rate.

By removing points that are minority-labeled in their neighborhoods, ENN nudges the training set toward better alignment with the underlying conditional distributions.

Types of Inconsistent Instances
Type	Description	True Label	Should Remove?
Label Noise	Correctly positioned but mislabeled during data collection	Probably wrong	Yes - harmful to KNN
Class Overlap Boundary	Near true decision boundary; neighbors naturally mixed	Correct	Maybe - depends on k
Outlier	Far from class bulk; genuinely belongs to unusual region	Correct	Often - unusual for generalization
Rare Subclass	Part of legitimate small cluster within a class	Correct	No - represents real pattern

The Danger of Aggressive Editing

Wilson's Editing Algorithm

The original Edited Nearest Neighbors algorithm was introduced by Dennis Wilson in 1972. It's remarkably simple yet effective.

The Basic Algorithm

Wilson's editing operates in a single pass through the data:

For each training instance $(\mathbf{x}_i, y_i)$
Find its k nearest neighbors (excluding itself)
Take a majority vote among the neighbors
If the vote disagrees with $y_i$, mark the instance for removal
After processing all points, delete marked instances

edited_nearest_neighbors.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def edited_nearest_neighbors(X, y, k=3, distance_metric='euclidean'):
    """
    Wilson's Edited Nearest Neighbors Algorithm (1972)
    
    Removes instances that are misclassified by their k neighbors,
    effectively cleaning noise and smoothing decision boundaries.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Training feature vectors
    y : ndarray of shape (n_samples,)
        Training class labels
    k : int
        Number of neighbors for consistency check (typically 3-5)
    distance_metric : str or callable
        Distance metric for neighbor computation
    
    Returns:
    --------
    retained_indices : list
        Indices of points retained after editing
    removed_indices : list
        Indices of points removed as inconsistent
    """
    import numpy as np
    from scipy.spatial.distance import cdist
    
    n_samples = len(X)
    
    # Compute full pairwise distance matrix
    # For large datasets, use chunked or approximate methods
    distances = cdist(X, X, metric=distance_metric)
    
    # Set diagonal to infinity (don't count self as neighbor)
    np.fill_diagonal(distances, np.inf)
    
    # Find k nearest neighbors for each point
    knn_indices = np.argsort(distances, axis=1)[:, :k]
    
    # Check consistency for each point
    retained_indices = []
    removed_indices = []
    
    for i in range(n_samples):
        neighbor_labels = y[knn_indices[i]]
        
        # Majority vote among neighbors
        unique_labels, counts = np.unique(neighbor_labels, return_counts=True)
        majority_label = unique_labels[np.argmax(counts)]
        
        if majority_label == y[i]:
            retained_indices.append(i)
        else:
            removed_indices.append(i)
    
    print(f"ENN (k={k}): Retained {len(retained_indices)}, Removed {len(removed_indices)}")
    print(f"Removal rate: {100 * len(removed_indices) / n_samples:.2f}%")
    
    return retained_indices, removed_indices

Step-by-Step Execution Trace

Consider a small dataset with 8 points (labels A and B) and k=3:

Point	Coordinates	Label	3 Nearest Neighbors	Neighbor Labels	Majority	Consistent?
P1	(1, 1)	A	P2, P3, P4	A, A, A	A	✓ Yes
P2	(2, 1)	A	P1, P3, P5	A, A, A	A	✓ Yes
P3	(1.5, 2)	A	P1, P2, P4	A, A, A	A	✓ Yes
P4	(2, 2.5)	B	P3, P2, P1	A, A, A	A	✗ No
P5	(7, 1)	B	P6, P7, P2	B, B, A	B	✓ Yes
P6	(8, 2)	B	P5, P7, P8	B, B, B	B	✓ Yes
P7	(7.5, 1.5)	B	P5, P6, P8	B, B, B	B	✓ Yes
P8	(9, 3)	B	P6, P7, P5	B, B, B	B	✓ Yes

Result: Point P4 is removed. It was labeled B but sits in the middle of A's territory—likely a mislabeled point.

Notice that ENN doesn't just remove noise; it effectively creates a smoother decision boundary by eliminating spurious intrusions of one class into another's region.

Single-Pass vs. Repeated Editing

The Role of k in Editing

Small k (1-3): Aggressive Editing

With small k, a point needs very few disagreeing neighbors to be marked inconsistent:

k=1: Remove if the single nearest neighbor has a different label.

Most aggressive editing
Removes many points near class boundaries
High risk of removing legitimate rare instances
Effective for obvious noise far from boundaries

k=3: Remove if 2+ of 3 neighbors disagree.

Moderate editing
Commonly used default
Balances noise removal and boundary preservation

Effect of k on ENN Behavior
k Value	Removal Criterion	Editing Aggressiveness	Best For
1	Nearest neighbor disagrees	Very aggressive	Obvious noise, well-separated classes
3	≥2 of 3 neighbors disagree	Moderate	General-purpose noise reduction
5	≥3 of 5 neighbors disagree	Conservative	Preserving boundary points
7+	≥4+ neighbors disagree	Very conservative	High overlap, valuable rare instances

Large k (5+): Conservative Editing

With larger k, the consistency criterion becomes more forgiving:

k=5: A point must have 3+ neighbors disagreeing to be removed.

Points near genuine class boundaries often survive
Only clear violations are removed
Better for datasets with natural class overlap

k≥7: Increasingly conservative

Very few points removed
Only severe noise detected
Appropriate when data quality is high and you're cautious about information loss

Choosing k: A Framework

The optimal k depends on several factors:

Factors Influencing k Selection

•Estimated noise rate: Higher noise → smaller k (more aggressive removal needed)
•Class separation: Well-separated classes → larger k (boundary points are clearly different from noise)
•Sample size: More data → larger k (more neighbors available for reliable voting)
•Cost of false removal: If removing valid points is costly → larger k (conservative editing)
•Subsequent k for classification: Match editing k to classification k (use same neighborhood scale)

adaptive_k_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def select_k_for_enn(X, y, k_candidates=[1, 3, 5, 7], cv_folds=5):
    """
    Select optimal k for ENN using cross-validation.
    
    For each candidate k:
    1. Apply ENN with that k
    2. Evaluate k-NN accuracy on held-out data
    3. Choose k that maximizes validation accuracy
    """
    import numpy as np
    from sklearn.model_selection import cross_val_score
    from sklearn.neighbors import KNeighborsClassifier
    
    best_k = None
    best_score = -1
    results = {}
    
    for k in k_candidates:
        # Apply ENN
        retained_idx, _ = edited_nearest_neighbors(X, y, k=k)
        X_edited = X[retained_idx]
        y_edited = y[retained_idx]
        
        if len(np.unique(y_edited)) < len(np.unique(y)):
            # Lost entire classes - skip this k
            results[k] = {'score': 0, 'retained': len(retained_idx), 'note': 'lost classes'}
            continue
        
        # Evaluate with cross-validation
        knn = KNeighborsClassifier(n_neighbors=min(5, len(X_edited)//2))
        
        try:
            scores = cross_val_score(knn, X_edited, y_edited, cv=cv_folds)
            mean_score = np.mean(scores)
        except:
            mean_score = 0
        
        results[k] = {
            'score': mean_score, 
            'retained': len(retained_idx),
            'removed_pct': 100 * (len(y) - len(retained_idx)) / len(y)
        }
        
        if mean_score > best_score:
            best_score = mean_score
            best_k = k
    
    print("ENN k selection results:")
    for k, res in results.items():
        print(f"  k={k}: accuracy={res['score']:.3f}, retained={res['retained']}")
    print(f"Best k: {best_k} (accuracy={best_score:.3f})")
    
    return best_k, results

Practical Default

ENN Variants: Extended and Repeated Editing

Wilson's original ENN has spawned important variants that address its limitations or extend its capabilities. Understanding these variants enables you to select the right tool for each situation.

Repeated ENN (RENN)

Standard ENN runs once and stops. But removing points changes the neighborhoods of remaining points. Some previously consistent points may become inconsistent when their noisy neighbors are removed.

Repeated ENN iterates until no more points are removed:

repeated_enn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def repeated_enn(X, y, k=3, max_iterations=100, distance_metric='euclidean'):
    """
    Repeated ENN - Iterates Wilson's editing until convergence.
    
    Removes points that become inconsistent after earlier removals,
    achieving more thorough noise elimination at cost of more data loss.
    """
    import numpy as np
    
    current_X = X.copy()
    current_y = y.copy()
    original_indices = np.arange(len(X))
    
    iteration = 0
    total_removed = 0
    
    while iteration < max_iterations:
        iteration += 1
        n_before = len(current_y)
        
        # Apply one round of ENN
        retained_mask, _ = edited_nearest_neighbors_mask(
            current_X, current_y, k=k, distance_metric=distance_metric
        )
        
        current_X = current_X[retained_mask]
        current_y = current_y[retained_mask]
        original_indices = original_indices[retained_mask]
        
        n_removed_this_round = n_before - len(current_y)
        total_removed += n_removed_this_round
        
        print(f"  Iteration {iteration}: removed {n_removed_this_round}, remaining {len(current_y)}")
        
        if n_removed_this_round == 0:
            print(f"RENN converged after {iteration} iterations")
            break
    
    print(f"RENN total: {len(y)} → {len(current_y)} ({total_removed} removed)")
    return original_indices.tolist()
 
def edited_nearest_neighbors_mask(X, y, k=3, distance_metric='euclidean'):
    """Returns boolean mask of points to retain"""
    import numpy as np
    from scipy.spatial.distance import cdist
    
    distances = cdist(X, X, metric=distance_metric)
    np.fill_diagonal(distances, np.inf)
    knn_indices = np.argsort(distances, axis=1)[:, :k]
    
    retained_mask = np.ones(len(X), dtype=bool)
    
    for i in range(len(X)):
        neighbor_labels = y[knn_indices[i]]
        unique, counts = np.unique(neighbor_labels, return_counts=True)
        majority = unique[np.argmax(counts)]
        if majority != y[i]:
            retained_mask[i] = False
    
    return retained_mask, ~retained_mask

All-k-NN Editing (Tomek's Rule)

Tomek proposed a stricter editing rule: remove a point if it is misclassified by any k-NN classifier for k = 1, 2, ..., K_max.

All-k-NN is substantially more aggressive than standard ENN and is typically used when noise is severe:

all_knn_editing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def all_knn_editing(X, y, k_max=5, distance_metric='euclidean'):
    """
    Tomek's All-k-NN Editing Rule
    
    Removes a point if it's misclassified by k-NN for ANY k from 1 to k_max.
    Very aggressive - removes points that fail at any neighborhood scale.
    """
    import numpy as np
    from scipy.spatial.distance import cdist
    
    n_samples = len(X)
    distances = cdist(X, X, metric=distance_metric)
    np.fill_diagonal(distances, np.inf)
    
    # Get k_max nearest neighbors for each point
    knn_indices = np.argsort(distances, axis=1)[:, :k_max]
    
    retained_indices = []
    
    for i in range(n_samples):
        is_consistent_all_k = True
        
        # Check consistency for each k from 1 to k_max
        for k in range(1, k_max + 1):
            neighbor_labels = y[knn_indices[i, :k]]
            unique, counts = np.unique(neighbor_labels, return_counts=True)
            majority = unique[np.argmax(counts)]
            
            if majority != y[i]:
                is_consistent_all_k = False
                break
        
        if is_consistent_all_k:
            retained_indices.append(i)
    
    print(f"All-k-NN (k_max={k_max}): {len(retained_indices)}/{n_samples} retained")
    return retained_indices

Comparison of ENN Variants
Variant	Removal Criterion	Aggressiveness	Use Case
ENN (Wilson)	Majority of k neighbors disagree	Moderate	Standard noise reduction
RENN (Repeated)	ENN, iterated to convergence	High	Deep noise cleaning
All-k-NN (Tomek)	Inconsistent for ANY k ≤ k_max	Very high	Severe noise, wide boundaries
RENN + All-k-NN	All-k-NN, repeated	Extreme	Maximum cleaning (use cautiously)

Class Imbalance Risk

Combining ENN and CNN: The Complete Pipeline

The ENN-CNN Pipeline

The recommended order is ENN first, then CNN:

Apply ENN: Remove noisy/mislabeled points
Apply CNN: Condense remaining (cleaner) points to minimal consistent subset

Why this order?

If CNN runs first, it treats noisy points as valid boundaries and retains them
Noisy points become even more damaging as border points in the condensed set
With ENN first, noise is removed before CNN ever sees it
CNN then works on clean data, achieving better condensation and better boundaries

enn_cnn_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
def enn_cnn_pipeline(X, y, enn_k=3, distance_metric='euclidean'):
    """
    Complete ENN → CNN data reduction pipeline.
    
    Step 1: Clean noise with ENN
    Step 2: Condense clean data with CNN
    Step 3: Optionally, apply RNN for further reduction
    
    Returns the indices of the final selected subset.
    """
    import numpy as np
    
    print("=" * 50)
    print("ENN-CNN Pipeline")
    print("=" * 50)
    
    n_original = len(X)
    
    # Step 1: ENN - Remove noise
    print("
Step 1: Edited Nearest Neighbors (noise removal)")
    enn_retained, enn_removed = edited_nearest_neighbors(
        X, y, k=enn_k, distance_metric=distance_metric
    )
    
    X_clean = X[enn_retained]
    y_clean = y[enn_retained]
    indices_after_enn = np.array(enn_retained)
    
    print(f"After ENN: {len(X_clean)}/{n_original} points retained")
    
    # Step 2: CNN - Condense to consistent subset
    print("
Step 2: Condensed Nearest Neighbors (data reduction)")
    cnn_retained = condensed_nearest_neighbors(
        X_clean, y_clean, distance_metric=distance_metric
    )
    
    X_condensed = X_clean[cnn_retained]
    y_condensed = y_clean[cnn_retained]
    indices_after_cnn = indices_after_enn[cnn_retained]
    
    print(f"After CNN: {len(X_condensed)}/{len(X_clean)} points retained")
    
    # Summary
    print("
" + "=" * 50)
    print("Pipeline Summary:")
    print(f"  Original: {n_original} points")
    print(f"  After ENN: {len(X_clean)} points ({len(enn_removed)} removed as noise)")
    print(f"  After CNN: {len(X_condensed)} points (final)")
    print(f"  Total reduction: {100 * (1 - len(X_condensed)/n_original):.1f}%")
    print("=" * 50)
    
    return indices_after_cnn.tolist()
 
# Helper: Include the CNN function
def condensed_nearest_neighbors(X, y, distance_metric='euclidean'):
    import numpy as np
    from scipy.spatial.distance import cdist
    
    n_samples = len(X)
    classes = np.unique(y)
    
    store_indices = []
    for c in classes:
        class_indices = np.where(y == c)[0]
        store_indices.append(class_indices[0])
    
    store_indices = set(store_indices)
    grabbag_indices = set(range(n_samples)) - store_indices
    
    changed = True
    while changed:
        changed = False
        store_list = list(store_indices)
        grabbag_list = list(grabbag_indices)
        
        if len(grabbag_list) == 0:
            break
        
        X_store = X[store_list]
        X_grabbag = X[grabbag_list]
        
        distances = cdist(X_grabbag, X_store, metric=distance_metric)
        nearest_store_idx = np.argmin(distances, axis=1)
        predicted = y[np.array(store_list)[nearest_store_idx]]
        true_labels = y[grabbag_list]
        
        misclassified = predicted != true_labels
        if np.any(misclassified):
            changed = True
            for idx in np.array(grabbag_list)[misclassified]:
                store_indices.add(idx)
                grabbag_indices.remove(idx)
    
    return list(store_indices)

The CNN-ENN Pipeline (Inverse Order)

What happens if we reverse the order?

Apply CNN first: Find consistent subset
Apply ENN: Remove inconsistent points from condensed set

This ordering is problematic:

CNN retains noisy points as essential borders
ENN then removes these "essential" points
The remaining set may no longer be consistent!
Test point classification may be worse than before CNN

The only scenario where CNN-first makes sense is when you're certain data is already clean and you want to apply ENN as a post-hoc safeguard against any remaining edge cases.

ENN → CNN (Recommended)

Clean first, then condense. Noise doesn't contaminate the condensed set. CNN works on high-quality data and finds true borders. Final set is both small AND clean.

CNN → ENN (Usually Wrong)

Condense first, including noise. Noise becomes critical border points. ENN removal destroys consistency. Final set may perform worse than doing nothing.

Computational Complexity and Scalability

Understanding ENN's computational profile is essential for applying it to real-world datasets.

Time Complexity Analysis

Standard ENN (Wilson):

The algorithm has three phases:

Distance Computation: Computing all pairwise distances is $O(n^2 \cdot d)$
k-NN Finding: Sorting each row is $O(n \log n)$ per point, totaling $O(n^2 \log n)$
Consistency Checking: $O(n \cdot k)$ for majority voting

Dominant term: $O(n^2 \cdot d)$ for distance matrix, assuming $d < n$

Repeated ENN: Multiplies by number of iterations. In worst case (one point removed per iteration), this becomes $O(n^3 \cdot d)$.

With Spatial Data Structures:

For low-to-moderate dimensionality ($d < 20$), k-d trees or ball trees can accelerate neighbor search:

Building tree: $O(n \log n)$
k-NN query per point: $O(k \log n)$ average case
Total: $O(n \cdot k \log n)$ + $O(n \log n)$ = $O(n \cdot k \log n)$

This is vastly better than $O(n^2)$, but only works in low dimensions due to the curse of dimensionality.

ENN Complexity Summary
Method	Time (naive)	Time (w/ spatial index)	Space
ENN (single pass)	O(n² · d)	O(n · k log n)	O(n²) or O(n · d)
RENN (repeated)	O(iter · n² · d)	O(iter · n · k log n)	O(n²) or O(n · d)
All-k-NN	O(k_max · n² · d)	O(n · k_max log n)	O(n²) or O(n · d)

Practical Scalability Strategies

For datasets up to 50K points: Use naive distance matrix with NumPy/SciPy. Fits in memory (50K × 50K × 4 bytes = 10 GB) on most servers.

For datasets 50K-500K points: Use spatial indices (k-d tree, ball tree) if dimensionality permits. Use chunked distance computation otherwise.

scalable_enn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def scalable_enn(X, y, k=3, method='auto'):
    """
    Scalable ENN implementation using appropriate method for data size.
    
    Parameters:
    -----------
    X : array of shape (n_samples, n_features)
    y : array of shape (n_samples,)
    k : int, number of neighbors
    method : 'brute', 'ball_tree', 'kd_tree', or 'auto'
    """
    import numpy as np
    from sklearn.neighbors import NearestNeighbors
    
    n_samples, n_features = X.shape
    
    # Choose method based on dataset characteristics
    if method == 'auto':
        if n_features > 20:
            method = 'brute'  # Spatial structures fail in high dims
        elif n_samples < 10000:
            method = 'brute'  # Small data, overhead not worth it
        else:
            method = 'ball_tree'  # Larger data, moderate dims
    
    print(f"Using {method} for ENN with n={n_samples}, d={n_features}")
    
    # Build neighbor index (k+1 because we exclude self)
    nn = NearestNeighbors(n_neighbors=k+1, algorithm=method)
    nn.fit(X)
    
    # Query all neighbors at once
    distances, indices = nn.kneighbors(X)
    
    # Remove self from neighbors (first column)
    neighbor_indices = indices[:, 1:k+1]  # Exclude self
    
    # Check consistency
    retained = []
    for i in range(n_samples):
        neighbor_labels = y[neighbor_indices[i]]
        unique, counts = np.unique(neighbor_labels, return_counts=True)
        majority = unique[np.argmax(counts)]
        
        if majority == y[i]:
            retained.append(i)
    
    return retained

Sampling for Very Large Data

When to Use Edited Nearest Neighbors

ENN is a powerful tool, but it's not universally beneficial. Understanding when it helps—and when it harms—is critical for effective application.

Conditions Favoring ENN

ENN Works Best When

•Known label noise: Data collection process is error-prone (crowdsourced labels, automated labeling, human annotation errors).
•Well-separated classes with outliers: Clean cluster interiors with occasional mislabeled points that intrude into wrong regions.
•Preprocessing for other algorithms: KNN, SVM with RBF kernel, and other local methods all benefit from cleaner training data.
•Combined with CNN: ENN-then-CNN pipeline achieves both noise reduction and storage reduction.
•Improving decision boundary smoothness: ENN implicitly smooths jagged boundaries caused by noisy points.

ENN May Harm When

•Classes truly overlap: If P(Y|X) is genuinely mixed in some regions, ENN removes valid boundary points.
•Rare but valid subclasses exist: Small clusters of a class surrounded by another will be edited out entirely.
•Severe class imbalance: Minority class points in majority regions are removed, worsening imbalance.
•Labels are trustworthy: If labels are reliable, ENN only removes genuine boundary/outlier points (not noise), reducing representation.
•Very small datasets: Removing any points significantly reduces information; ENN may overfit to sampling noise.

Decision Framework

Use this structured approach to decide whether ENN is appropriate:

Step 1: Assess Label Quality

Known labeling errors? → ENN likely helpful
Expert-verified labels? → ENN may not help

Step 2: Examine Class Geometry

Well-separated clusters? → ENN safe
Heavily overlapping? → ENN risky

Step 3: Check Class Balance

Balanced classes? → ENN safe
Severe imbalance? → Use class-aware editing variants

Step 4: Evaluate Dataset Size

n > 1000 per class? → ENN acceptable
n < 100 per class? → Be very cautious

Step 5: Validate Empirically

Run ENN with different k values
Compare validation accuracy before/after
Check if minority classes are preserved

ENN for Non-KNN Models

Summary: Edited Nearest Neighbors

We've conducted a comprehensive exploration of Edited Nearest Neighbors—from theoretical foundations through practical application. Let's consolidate the essential knowledge:

Key Takeaways

•ENN removes noise by neighborhood voting: Points misclassified by their k neighbors are considered inconsistent and removed.
•k controls editing aggressiveness: Small k (1-3) is aggressive, large k (5+) is conservative. Match to noise level and risk tolerance.
•Variants offer different trade-offs: Repeated ENN (RENN) for deeper cleaning, All-k-NN for strictest removal.
•Order matters in pipelines: Always apply ENN before CNN. Cleaning before condensing produces better results.
•Computational complexity is O(n²d): Use spatial indices for large, low-dimensional datasets.
•Watch for class imbalance effects: Minority classes can be disproportionately removed. Validate class distributions after editing.
•Validate empirically: Always compare validation accuracy before and after ENN to confirm benefit.
•Not always needed: If labels are clean and classes are balanced, ENN may remove valid information.

Connection to Next Topics

CNN and ENN address instance selection—deciding which training points to keep. But what about improving how we measure similarity?

Page Complete

2 / 5