Machine LearningK-Nearest Neighbors

KNN Variants

LevelAdvanced

Duration90 mins

TopicK-Nearest Neighbors

5 / 5

KNN Ensemble Methods

Why Ensemble K-Nearest Neighbors?

Throughout this module, we've explored individual optimizations for k-NN: selecting instances (CNN, ENN), learning metrics (LMNN, deep metric learning), and adapting neighborhoods. Each technique improves a specific aspect of k-NN performance.

But what if we could combine multiple k-NN classifiers—each capturing different aspects of the data—to achieve robustness and accuracy that no single classifier can match?

Ensemble methods have revolutionized machine learning. Random Forests combine thousands of decision trees to achieve state-of-the-art performance on many tasks. Gradient boosting combines weak learners into powerful predictors. The same principles apply to k-NN.

Why ensemble k-NN?

Reduce variance: A single k-NN is sensitive to training data, choice of k, and distance metric. Averaging over multiple classifiers smooths these sensitivities.
Improve robustness: Different classifiers make different errors. Combining them through voting reduces the impact of any single classifier's mistakes.
Handle high-dimensionality: In high dimensions, using all features often hurts. Combining k-NNs on different feature subsets can outperform the full-space classifier.
Leverage diversity: Different distance metrics capture different notions of similarity. Combining them yields a richer similarity measure.

This page explores the principles and techniques of k-NN ensembles—from simple bagging to sophisticated adaptive combinations.

What You Will Master

By completing this page, you will understand why ensembles improve k-NN, master random subspace methods for high-dimensional data, learn to ensemble different metrics and k values, implement weighted voting strategies, and design hybrid KNN-based systems combining instance-based and model-based approaches.

Ensemble Learning Principles Applied to KNN

Before diving into k-NN-specific techniques, let's review the fundamental principles that make ensembles work.

The Wisdom of Crowds

Consider $M$ independent classifiers, each with accuracy $p > 0.5$. The ensemble (majority vote) is correct when more than $M/2$ classifiers are correct. By the Chernoff bound:

$$P(\text{ensemble wrong}) \leq \exp\left(-2M\left(p - \frac{1}{2}\right)^2\right)$$

As $M \to \infty$, ensemble error → 0 exponentially fast!

Key requirement: Classifiers must be diverse (make different errors). If all classifiers make the same mistakes, the ensemble doesn't help.

Sources of Diversity in KNN

k-NN ensembles create diversity through:

Diversity Mechanisms for KNN Ensembles

•Feature Subspace: Each classifier uses a random subset of features (Random Subspace Method)
•Sample Subspace: Each classifier trained on bootstrap sample (Bagging)
•Different Metrics: Each classifier uses a different distance function
•Different k Values: Combine predictions from k=1, k=3, k=5, etc.
•Different Weightings: Each classifier uses different feature weights or distance-based voting
•Different Instance Sets: After CNN/ENN, different runs produce different reduced sets

Bias-Variance Decomposition for KNN

The expected prediction error decomposes as:

$$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$$

k-NN's characteristics:

Bias: Increases with k (more smoothing)
Variance: Decreases with k (more averaging)
For small k: Low bias, high variance
For large k: High bias, low variance

Ensembles primarily reduce variance. Averaging multiple k-NN classifiers with small k can achieve low bias (from small k) AND low variance (from averaging).

Theoretical Result: Ensemble Bound

For regression, if $M$ base estimators have expected squared error $\leq \epsilon$, and their pairwise correlations are $\leq \rho$, then the ensemble's expected error is:

$$\text{Error}_{\text{ensemble}} \leq \rho \cdot \epsilon + \frac{(1-\rho) \cdot \epsilon}{M}$$

As $M \to \infty$, error approaches $\rho \cdot \epsilon$. Lower correlation → lower asymptotic error → more benefit from diversity.

Diversity is Key

The most common mistake in building KNN ensembles is creating classifiers that are too similar. If all classifiers use the same features, same k, and same metric with slightly different data, they're highly correlated and the ensemble provides little benefit. Maximize diversity across dimensions: features, metrics, AND parameters.

Random Subspace Method for KNN

The Random Subspace Method (RSM), introduced by Ho (1998), is particularly effective for k-NN in high-dimensional spaces. Each classifier operates on a random subset of features, avoiding the curse of dimensionality while capturing different aspects of the data.

Algorithm

Given training data with $d$ features
For each of $M$ base classifiers:
- Randomly select $r$ features (where $r < d$)
- Train k-NN classifier using only those $r$ features
At prediction time:
- Each classifier predicts using its feature subset
- Combine predictions via majority voting (classification) or averaging (regression)

random_subspace_knn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
from collections import Counter
 
class RandomSubspaceKNN(BaseEstimator, ClassifierMixin):
    """
    Random Subspace Method for K-Nearest Neighbors
    
    Creates an ensemble of KNN classifiers, each operating on
    a random subset of features.
    
    Parameters:
    -----------
    n_estimators : int
        Number of base KNN classifiers
    n_features : int or float
        Number (int) or fraction (float) of features to use per classifier
    k : int
        Number of neighbors for each base classifier
    voting : str
        'hard' for majority voting, 'soft' for probability averaging
    random_state : int
        Random seed for reproducibility
    """
    def __init__(self, n_estimators=50, n_features=0.5, k=5, 
                 voting='hard', random_state=None):
        self.n_estimators = n_estimators
        self.n_features = n_features
        self.k = k
        self.voting = voting
        self.random_state = random_state
    
    def fit(self, X, y):
        """Fit ensemble of KNN classifiers on random feature subsets"""
        np.random.seed(self.random_state)
        
        self.n_samples_, self.n_features_total_ = X.shape
        self.classes_ = np.unique(y)
        
        # Determine number of features per classifier
        if isinstance(self.n_features, float):
            self.n_features_per_ = max(1, int(self.n_features * self.n_features_total_))
        else:
            self.n_features_per_ = min(self.n_features, self.n_features_total_)
        
        # Create base classifiers
        self.estimators_ = []
        self.feature_subsets_ = []
        
        for i in range(self.n_estimators):
            # Random feature subset
            feature_idx = np.random.choice(
                self.n_features_total_, 
                size=self.n_features_per_, 
                replace=False
            )
            self.feature_subsets_.append(feature_idx)
            
            # Train KNN on subset
            X_subset = X[:, feature_idx]
            knn = KNeighborsClassifier(n_neighbors=self.k)
            knn.fit(X_subset, y)
            self.estimators_.append(knn)
        
        return self
    
    def predict(self, X):
        """Predict using majority voting across ensemble"""
        if self.voting == 'soft':
            probas = self.predict_proba(X)
            return self.classes_[np.argmax(probas, axis=1)]
        else:
            # Hard voting
            predictions = np.array([
                est.predict(X[:, self.feature_subsets_[i]])
                for i, est in enumerate(self.estimators_)
            ])  # Shape: (n_estimators, n_samples)
            
            # Majority vote per sample
            return np.array([
                Counter(predictions[:, j]).most_common(1)[0][0]
                for j in range(X.shape[0])
            ])
    
    def predict_proba(self, X):
        """Predict class probabilities by averaging base classifier probabilities"""
        probas = np.zeros((X.shape[0], len(self.classes_)))
        
        for i, est in enumerate(self.estimators_):
            X_subset = X[:, self.feature_subsets_[i]]
            probas += est.predict_proba(X_subset)
        
        probas /= self.n_estimators
        return probas
    
    def get_feature_importance(self):
        """
        Estimate feature importance based on ensemble performance.
        
        Features that appear in more accurate classifiers are more important.
        """
        importance = np.zeros(self.n_features_total_)
        
        # Count occurrences of each feature
        for feature_idx in self.feature_subsets_:
            importance[feature_idx] += 1
        
        # Normalize
        importance /= self.n_estimators
        return importance

Optimal Subspace Size

The subspace dimension $r$ controls the bias-variance trade-off:

Small r: High variance per classifier, low correlation between classifiers → Big ensemble benefit
Large r (→ d): Low variance per classifier, high correlation → Little ensemble benefit

Empirical guideline: $r \approx \sqrt{d}$ or $r \approx d/2$ often works well.

For d = 100: try r = 10, 20, 50 and cross-validate.

Effect of Subspace Size on Ensemble Performance
Subspace Size (r)	Per-Classifier Variance	Classifier Correlation	Ensemble Benefit
Small (r << d)	High	Low	Large (diverse classifiers)
Medium (r ≈ √d)	Moderate	Moderate	Balanced (typical sweet spot)
Large (r → d)	Low	High	Small (similar classifiers)

Relationship to Random Forests

Random Subspace KNN is analogous to Random Forests with trees replaced by k-NN. Both use feature randomization for diversity. The key difference: trees partition space globally, while k-NN partitions locally. RSM-KNN often outperforms Random Forests when local patterns matter more than global structure.

Bagging K-Nearest Neighbors

Bagging (Bootstrap Aggregating) creates diversity by training each base classifier on a bootstrap sample of the training data. For k-NN, this creates classifiers with different instance sets.

Standard Bagging for KNN

Given n training samples
For each of M classifiers:
- Draw n samples with replacement (bootstrap sample)
- ~63.2% unique samples, ~36.8% duplicates (on average)
- Train k-NN on bootstrap sample
Combine predictions via voting

Challenge: For large k, bagged k-NNs are highly correlated because most neighbors are the same across bootstrap samples.

Improved: Subbagging (Subsample Bagging)

Instead of n samples with replacement, draw m < n samples without replacement:

bagging_knn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
from collections import Counter
 
class BaggingKNN(BaseEstimator, ClassifierMixin):
    """
    Bagging for K-Nearest Neighbors with optional subsampling.
    
    Parameters:
    -----------
    n_estimators : int
        Number of base classifiers
    sample_fraction : float
        Fraction of samples to use per classifier (subbagging)
    k : int
        Number of neighbors
    bootstrap : bool
        If True, sample with replacement; if False, without replacement
    combine_with_rsm : bool
        If True, also apply random subspace method (feature sampling)
    rsm_fraction : float
        Fraction of features if combine_with_rsm=True
    """
    def __init__(self, n_estimators=50, sample_fraction=0.7, k=5,
                 bootstrap=True, combine_with_rsm=False, rsm_fraction=0.5,
                 random_state=None):
        self.n_estimators = n_estimators
        self.sample_fraction = sample_fraction
        self.k = k
        self.bootstrap = bootstrap
        self.combine_with_rsm = combine_with_rsm
        self.rsm_fraction = rsm_fraction
        self.random_state = random_state
    
    def fit(self, X, y):
        """Fit bagged KNN ensemble"""
        np.random.seed(self.random_state)
        
        self.n_samples_, self.n_features_ = X.shape
        self.classes_ = np.unique(y)
        n_subsample = int(self.sample_fraction * self.n_samples_)
        
        # Effective k (can't exceed subsample size)
        self.k_effective_ = min(self.k, n_subsample - 1)
        
        self.estimators_ = []
        self.sample_indices_ = []
        self.feature_indices_ = []
        
        for i in range(self.n_estimators):
            # Sample indices
            if self.bootstrap:
                indices = np.random.choice(self.n_samples_, size=n_subsample, replace=True)
            else:
                indices = np.random.choice(self.n_samples_, size=n_subsample, replace=False)
            self.sample_indices_.append(indices)
            
            # Feature indices (if RSM enabled)
            if self.combine_with_rsm:
                n_features_use = max(1, int(self.rsm_fraction * self.n_features_))
                feat_indices = np.random.choice(self.n_features_, size=n_features_use, replace=False)
            else:
                feat_indices = np.arange(self.n_features_)
            self.feature_indices_.append(feat_indices)
            
            # Train KNN
            X_subset = X[np.ix_(indices, feat_indices)]
            y_subset = y[indices]
            
            knn = KNeighborsClassifier(n_neighbors=self.k_effective_)
            knn.fit(X_subset, y_subset)
            self.estimators_.append(knn)
        
        return self
    
    def predict(self, X):
        """Predict via majority voting"""
        predictions = []
        for i, est in enumerate(self.estimators_):
            X_feat = X[:, self.feature_indices_[i]]
            pred = est.predict(X_feat)
            predictions.append(pred)
        
        predictions = np.array(predictions)  # (n_estimators, n_samples)
        
        final_pred = []
        for j in range(X.shape[0]):
            votes = predictions[:, j]
            final_pred.append(Counter(votes).most_common(1)[0][0])
        
        return np.array(final_pred)
    
    def predict_proba(self, X):
        """Predict probabilities by averaging"""
        probas = np.zeros((X.shape[0], len(self.classes_)))
        
        for i, est in enumerate(self.estimators_):
            X_feat = X[:, self.feature_indices_[i]]
            probas += est.predict_proba(X_feat)
        
        return probas / self.n_estimators
    
    def oob_score(self, X, y):
        """
        Compute out-of-bag score.
        
        For each sample, aggregate predictions only from classifiers
        that did not include that sample in their training set.
        """
        n_samples = len(y)
        oob_predictions = [[] for _ in range(n_samples)]
        
        for i, (est, sample_idx, feat_idx) in enumerate(
            zip(self.estimators_, self.sample_indices_, self.feature_indices_)
        ):
            # Find out-of-bag samples
            in_bag = set(sample_idx)
            for j in range(n_samples):
                if j not in in_bag:
                    pred = est.predict(X[j:j+1, feat_idx])[0]
                    oob_predictions[j].append(pred)
        
        # Compute accuracy on samples with OOB predictions
        correct = 0
        counted = 0
        for j in range(n_samples):
            if len(oob_predictions[j]) > 0:
                majority_pred = Counter(oob_predictions[j]).most_common(1)[0][0]
                if majority_pred == y[j]:
                    correct += 1
                counted += 1
        
        return correct / counted if counted > 0 else 0.0

Bagging + Random Subspace = Random Patches

Combining sample and feature subsampling creates Random Patches (Louppe & Geurts, 2012):

Each classifier uses a random subset of both samples AND features
Maximum diversity: different classifiers see different data in different feature spaces
Particularly effective for very high-dimensional, large-sample datasets

When Bagging Helps KNN

Bagging is most beneficial when:

Training data is noisy: Bootstrap smooths out noise effects
k is small: Small-k classifiers have high variance; bagging reduces it
Combined with feature subsampling: Pure sample bagging often insufficient for k-NN

Bagging Alone May Not Be Enough

Unlike decision trees, k-NN doesn't benefit as much from pure sample bagging. The reason: k-NN predictions depend on nearest neighbors, which overlap heavily between bootstrap samples. Always combine with feature subsampling (Random Patches) for meaningful diversity.

Multi-Metric Ensembles

Different distance metrics capture different notions of similarity. Combining k-NN classifiers with different metrics creates ensemble diversity through semantic diversity rather than sampling diversity.

Common Metrics for Ensemble

Consider a d-dimensional feature space. Standard metrics include:

Distance Metrics for KNN Ensembles
Metric	Formula	Best For
Euclidean	√(Σ(xᵢ - yᵢ)²)	General purpose, standardized features
Manhattan (L1)	Σ\|xᵢ - yᵢ\|	Sparse features, outlier robustness
Chebyshev (L∞)	max\|xᵢ - yᵢ\|	When worst-case difference matters
Cosine	1 - (x·y)/(\|\|x\|\| \|\|y\|\|)	Text, high-dimensional sparse vectors
Mahalanobis	√((x-y)ᵀM(x-y))	Correlated features (requires learned M)
Minkowski (Lp)	(Σ\|xᵢ - yᵢ\|ᵖ)^(1/p)	Tunable between L1 and L∞

multi_metric_ensemble.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.preprocessing import StandardScaler, normalize
from collections import Counter
 
class MultiMetricKNN(BaseEstimator, ClassifierMixin):
    """
    Ensemble of KNN classifiers using different distance metrics.
    
    Each base classifier uses a different metric, capturing different
    notions of similarity. Predictions are combined via weighted voting.
    
    Parameters:
    -----------
    metrics : list
        List of metrics to use. Options: 'euclidean', 'manhattan', 
        'chebyshev', 'cosine', 'minkowski_p3', 'minkowski_p0.5'
    k : int
        Number of neighbors for all classifiers
    weights : array-like or None
        Weights for each metric's vote. If None, use uniform weights.
    learn_weights : bool
        If True, learn optimal weights from validation performance
    """
    def __init__(self, metrics=None, k=5, weights=None, learn_weights=False):
        self.metrics = metrics or ['euclidean', 'manhattan', 'chebyshev', 'cosine']
        self.k = k
        self.weights = weights
        self.learn_weights = learn_weights
    
    def fit(self, X, y, X_val=None, y_val=None):
        """
        Fit ensemble of KNN classifiers.
        
        If learn_weights=True and validation data provided, learn optimal weights.
        """
        self.classes_ = np.unique(y)
        self.scaler_ = StandardScaler()
        X_scaled = self.scaler_.fit_transform(X)
        
        self.estimators_ = []
        self.metric_configs_ = []
        
        for metric in self.metrics:
            # Parse metric configuration
            if metric.startswith('minkowski_p'):
                p = float(metric.split('_p')[1])
                config = {'metric': 'minkowski', 'p': p}
            elif metric == 'cosine':
                # For cosine, normalize data and use euclidean
                config = {'metric': 'euclidean', 'normalize': True}
            else:
                config = {'metric': metric}
            
            self.metric_configs_.append(config)
            
            # Prepare data for this metric
            if config.get('normalize', False):
                X_metric = normalize(X_scaled)
            else:
                X_metric = X_scaled
            
            # Build classifier
            knn = KNeighborsClassifier(
                n_neighbors=self.k,
                metric=config['metric'],
                p=config.get('p', 2)
            )
            knn.fit(X_metric, y)
            self.estimators_.append(knn)
        
        # Learn weights if requested
        if self.learn_weights and X_val is not None:
            self.weights_ = self._learn_weights(X_val, y_val)
        elif self.weights is not None:
            self.weights_ = np.array(self.weights)
        else:
            self.weights_ = np.ones(len(self.metrics)) / len(self.metrics)
        
        # Store scaled training data for later use
        self.X_train_scaled_ = X_scaled
        self.y_train_ = y
        
        return self
    
    def _learn_weights(self, X_val, y_val):
        """Learn optimal weights based on validation accuracy"""
        X_val_scaled = self.scaler_.transform(X_val)
        
        accuracies = []
        for i, (est, config) in enumerate(zip(self.estimators_, self.metric_configs_)):
            if config.get('normalize', False):
                X_val_metric = normalize(X_val_scaled)
            else:
                X_val_metric = X_val_scaled
            
            acc = est.score(X_val_metric, y_val)
            accuracies.append(acc)
        
        # Weight proportional to accuracy
        accuracies = np.array(accuracies)
        weights = accuracies / accuracies.sum()
        
        print("Learned metric weights:")
        for metric, weight, acc in zip(self.metrics, weights, accuracies):
            print(f"  {metric}: weight={weight:.3f} (acc={acc:.3f})")
        
        return weights
    
    def _prepare_X(self, X, config):
        """Prepare input for a specific metric configuration"""
        X_scaled = self.scaler_.transform(X)
        if config.get('normalize', False):
            return normalize(X_scaled)
        return X_scaled
    
    def predict(self, X):
        """Predict via weighted voting"""
        # Collect weighted votes
        vote_matrix = np.zeros((len(X), len(self.classes_)))
        
        for i, (est, config) in enumerate(zip(self.estimators_, self.metric_configs_)):
            X_metric = self._prepare_X(X, config)
            probas = est.predict_proba(X_metric)
            vote_matrix += self.weights_[i] * probas
        
        return self.classes_[np.argmax(vote_matrix, axis=1)]
    
    def predict_proba(self, X):
        """Predict weighted probability averages"""
        probas = np.zeros((len(X), len(self.classes_)))
        
        for i, (est, config) in enumerate(zip(self.estimators_, self.metric_configs_)):
            X_metric = self._prepare_X(X, config)
            probas += self.weights_[i] * est.predict_proba(X_metric)
        
        return probas

Metric Selection for Ensembles

Not all metric combinations are equally useful. Guidelines:

High Diversity (Recommended):

Euclidean + Manhattan + Cosine
L1 + L2 + L∞ (Minkowski family span)
Euclidean + Mahalanobis (if learned metric available)

Low Diversity (Less Useful):

Euclidean + Minkowski p=2.1 (nearly identical)
Multiple similar Minkowski values

Learned Metric Combination

For maximum power, include a learned Mahalanobis metric (from LMNN) in the ensemble:

Train LMNN to learn optimal $M$
Create k-NN classifier using Mahalanobis distance with $M$
Ensemble with Euclidean, Manhattan, Cosine classifiers
The learned metric captures data-specific patterns; others provide stability

Dynamic Metric Selection

For advanced applications, consider region-specific metric weights. In different parts of feature space, different metrics may be optimal. A meta-classifier can learn which metric to trust based on the test point's location.

Stacking and Hybrid KNN Methods

Stacking (stacked generalization) uses a meta-learner to combine base classifier predictions, learning optimal combination weights from data rather than using fixed voting.

Stacking with KNN Base Classifiers

Level 0 (Base Classifiers): Multiple k-NN variants
- Different k values: k=1, k=3, k=5, k=10
- Different metrics: Euclidean, Manhattan, Cosine
- Different feature subsets: Random subspace classifiers
Level 1 (Meta-Classifier): Learn to combine Level 0 predictions
- Input: Probability predictions from all base classifiers
- Output: Final class prediction
- Common choices: Logistic Regression, another k-NN, or Random Forest

stacking_knn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.base import BaseEstimator, ClassifierMixin
 
class StackingKNN(BaseEstimator, ClassifierMixin):
    """
    Stacking ensemble with multiple KNN base classifiers
    and a learned meta-classifier.
    
    Parameters:
    -----------
    base_configs : list of dicts
        Configuration for each base KNN classifier.
        Each dict should have keys like 'k', 'metric', 'weights'.
    meta_classifier : estimator
        The Level-1 classifier to combine base predictions.
    cv_folds : int
        Number of CV folds for generating meta-features.
    """
    def __init__(self, base_configs=None, meta_classifier=None, cv_folds=5):
        self.base_configs = base_configs or [
            {'n_neighbors': 1, 'metric': 'euclidean'},
            {'n_neighbors': 3, 'metric': 'euclidean'},
            {'n_neighbors': 5, 'metric': 'euclidean'},
            {'n_neighbors': 5, 'metric': 'manhattan'},
            {'n_neighbors': 5, 'metric': 'chebyshev'},
        ]
        self.meta_classifier = meta_classifier or LogisticRegression(max_iter=1000)
        self.cv_folds = cv_folds
    
    def fit(self, X, y):
        """
        Fit stacking ensemble.
        
        Uses cross-validation to generate meta-features for training
        the meta-classifier.
        """
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        n_samples = len(X)
        
        # Create base classifiers
        self.base_classifiers_ = []
        for config in self.base_configs:
            knn = KNeighborsClassifier(**config)
            self.base_classifiers_.append(knn)
        
        # Generate meta-features via cross-validation
        # Shape: (n_samples, n_base_classifiers * n_classes)
        meta_features = np.zeros((n_samples, len(self.base_classifiers_) * n_classes))
        
        for i, knn in enumerate(self.base_classifiers_):
            # Cross-val predictions (probabilities)
            cv_probas = cross_val_predict(
                knn, X, y, cv=self.cv_folds, method='predict_proba'
            )
            meta_features[:, i*n_classes:(i+1)*n_classes] = cv_probas
        
        # Fit meta-classifier on meta-features
        self.meta_classifier.fit(meta_features, y)
        
        # Refit all base classifiers on full training data
        for knn in self.base_classifiers_:
            knn.fit(X, y)
        
        return self
    
    def predict(self, X):
        """Predict using stacked ensemble"""
        meta_features = self._get_meta_features(X)
        return self.meta_classifier.predict(meta_features)
    
    def predict_proba(self, X):
        """Predict probabilities"""
        meta_features = self._get_meta_features(X)
        if hasattr(self.meta_classifier, 'predict_proba'):
            return self.meta_classifier.predict_proba(meta_features)
        else:
            # Fall back to predictions
            preds = self.meta_classifier.predict(meta_features)
            probas = np.zeros((len(X), len(self.classes_)))
            for i, pred in enumerate(preds):
                probas[i, np.where(self.classes_ == pred)[0]] = 1.0
            return probas
    
    def _get_meta_features(self, X):
        """Generate meta-features from base classifier predictions"""
        n_classes = len(self.classes_)
        meta_features = np.zeros((len(X), len(self.base_classifiers_) * n_classes))
        
        for i, knn in enumerate(self.base_classifiers_):
            probas = knn.predict_proba(X)
            meta_features[:, i*n_classes:(i+1)*n_classes] = probas
        
        return meta_features

Hybrid KNN + Model-Based Approaches

k-NN (instance-based) and model-based classifiers (trees, SVMs, neural nets) have complementary strengths:

k-NN	Model-Based
No training phase	Requires training
Adapts to local structure	Captures global patterns
Memory-intensive	Compact models
Slow prediction	Fast prediction

Hybrid Approach 1: KNN as Meta-Features

Add k-NN predictions as features for a model-based classifier:

For training point $\mathbf{x}_i$, find its k-NN class distribution (leave-one-out)
Append this distribution to $\mathbf{x}_i$'s features
Train Random Forest / SVM / Neural Net on augmented features

Hybrid Approach 2: Model-Based Embeddings + KNN

Use a trained model to create embeddings, then apply k-NN:

Train neural network classifier
Extract hidden layer activations as embeddings
Apply k-NN on embeddings (often more effective than on raw features)

Hybrid Approach 3: Gating Network

Learn when to trust k-NN vs. model-based:

Both k-NN and model-based classifier make predictions
A gating network learns which to trust based on input characteristics
Output is weighted combination: $\alpha(\mathbf{x}) \cdot p_{\text{kNN}}(\mathbf{x}) + (1-\alpha(\mathbf{x})) \cdot p_{\text{model}}(\mathbf{x})$

Practical Stacking Tip

The meta-classifier should be simpler than base classifiers to avoid overfitting. Logistic Regression is a common choice. If using another k-NN as meta-classifier, use large k to average over base classifier disagreements.

Implementation Best Practices

Building effective KNN ensembles requires attention to several practical considerations. Here are battle-tested best practices.

Choosing Ensemble Size (n_estimators)

More classifiers → more computation but diminishing returns on accuracy.

Recommended Ensemble Sizes
Method	Minimum	Typical	Maximum Useful
Random Subspace	25	50-100	200
Bagging/Random Patches	25	50-100	500
Multi-Metric	3-4	5-10	15 (diversity-limited)
Multi-k	3-5	5-10	10 (k-space limited)

Computational Optimization

k-NN ensembles can be parallelized effectively:

Parallel Training:

Each base classifier is independent → embarrassingly parallel
Use joblib or multiprocessing

Parallel Prediction:

Each base classifier can predict independently
Batch process test points across all classifiers

Memory Optimization:

For very large ensembles, train classifiers in batches
Don't store all training data M times; use shared storage with index pointers

parallel_ensemble.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from joblib import Parallel, delayed
from sklearn.neighbors import KNeighborsClassifier
 
def train_single_knn(X, y, feature_idx, k):
    """Train a single KNN on feature subset - for parallel execution"""
    X_subset = X[:, feature_idx]
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_subset, y)
    return knn, feature_idx
 
def predict_single_knn(knn, X, feature_idx):
    """Get predictions from single KNN - for parallel execution"""
    X_subset = X[:, feature_idx]
    return knn.predict_proba(X_subset)
 
class ParallelRandomSubspaceKNN:
    """
    Random Subspace KNN with parallel training and prediction.
    """
    def __init__(self, n_estimators=50, n_features=0.5, k=5, n_jobs=-1):
        self.n_estimators = n_estimators
        self.n_features = n_features
        self.k = k
        self.n_jobs = n_jobs
    
    def fit(self, X, y):
        np.random.seed(42)
        
        self.n_features_total_ = X.shape[1]
        self.classes_ = np.unique(y)
        
        n_feat = int(self.n_features * self.n_features_total_)
        
        # Generate feature subsets
        self.feature_subsets_ = [
            np.random.choice(self.n_features_total_, size=n_feat, replace=False)
            for _ in range(self.n_estimators)
        ]
        
        # Parallel training
        results = Parallel(n_jobs=self.n_jobs)(
            delayed(train_single_knn)(X, y, feat_idx, self.k)
            for feat_idx in self.feature_subsets_
        )
        
        self.estimators_ = [r[0] for r in results]
        return self
    
    def predict(self, X):
        # Parallel prediction
        probas_list = Parallel(n_jobs=self.n_jobs)(
            delayed(predict_single_knn)(est, X, feat_idx)
            for est, feat_idx in zip(self.estimators_, self.feature_subsets_)
        )
        
        # Average probabilities
        avg_probas = np.mean(probas_list, axis=0)
        return self.classes_[np.argmax(avg_probas, axis=1)]

Validation Strategy

Ensembles are prone to subtly overfitting to the combination strategy. Proper validation is essential:

Nested Cross-Validation: Outer loop for ensemble evaluation, inner loop for hyperparameter tuning
Hold-Out Validation Set: Especially important for stacking (meta-classifier must be trained on out-of-fold predictions)
OOB Estimation: For bagging, use out-of-bag samples for unbiased error estimation

When NOT to Ensemble KNN

Ensembles add complexity. Skip them when:

Single k-NN already achieves target accuracy
Interpretation is critical (ensemble predictions are harder to explain)
Prediction latency is constrained (ensembles multiply inference time)
Training data is very small (risk of overfitting the combination)

Start Simple

Begin with Random Subspace Method (50 classifiers, √d features each). It's easy to implement, parallelizes well, and often captures most of the ensemble benefit. Only add complexity (stacking, learned weights, multiple metrics) if RSM doesn't meet performance targets.

Summary: KNN Ensemble Methods

We've explored the rich landscape of KNN ensemble methods—from fundamental principles through specific techniques to practical implementation. Let's consolidate the essential knowledge:

Key Takeaways

•Ensembles reduce variance: Averaging multiple k-NN classifiers smooths prediction noise while preserving low bias of small k.
•Diversity is essential: Create diverse classifiers through feature subsampling, sample bootstrapping, different metrics, and different k values.
•Random Subspace Method excels for high dimensions: Each classifier sees a random feature subset, naturally handling the curse of dimensionality.
•Multi-metric ensembles capture different similarity notions: Combining Euclidean, Manhattan, Cosine, and learned metrics yields robust predictions.
•Stacking learns optimal combination: Rather than fixed voting, a meta-classifier learns when to trust each base classifier.
•Hybrid methods combine KNN with model-based approaches: Use k-NN predictions as features, or apply k-NN to learned embeddings.
•Parallelization is straightforward: Base classifiers are independent; training and prediction parallelize easily.
•Start simple, add complexity as needed: Random Subspace often suffices; only add stacking/hybrid methods if performance demands it.

Module Complete: KNN Variants

This page completes our exploration of KNN Variants. We've covered:

Condensed Nearest Neighbors (CNN): Reducing storage by selecting boundary points
Edited Nearest Neighbors (ENN): Improving quality by removing noisy points
Large Margin KNN (LMNN): Learning optimal Mahalanobis distance metrics
Metric Learning: The full landscape from linear to deep methods
KNN Ensembles: Combining multiple classifiers for robustness and accuracy

Together, these techniques transform k-NN from a simple baseline into a powerful, production-ready classification system. You can now:

Reduce k-NN's storage and computational requirements
Handle noisy, high-dimensional data effectively
Learn data-adapted distance metrics
Combine multiple perspectives for robust predictions

K-Nearest Neighbors, properly engineered, remains a competitive algorithm for many real-world problems.

Module Complete

Congratulations! You've mastered the full spectrum of KNN variants—from instance selection through metric learning to ensemble methods. You now possess the knowledge to apply k-NN effectively in diverse production scenarios, selecting and combining the right techniques for each problem's unique characteristics.

5 / 5

Loading learning content...

Machine LearningK-Nearest Neighbors

KNN Variants

LevelAdvanced

Duration90 mins

TopicK-Nearest Neighbors

5 / 5

KNN Ensemble Methods

Why Ensemble K-Nearest Neighbors?

But what if we could combine multiple k-NN classifiers—each capturing different aspects of the data—to achieve robustness and accuracy that no single classifier can match?

Why ensemble k-NN?

Reduce variance: A single k-NN is sensitive to training data, choice of k, and distance metric. Averaging over multiple classifiers smooths these sensitivities.
Improve robustness: Different classifiers make different errors. Combining them through voting reduces the impact of any single classifier's mistakes.
Handle high-dimensionality: In high dimensions, using all features often hurts. Combining k-NNs on different feature subsets can outperform the full-space classifier.
Leverage diversity: Different distance metrics capture different notions of similarity. Combining them yields a richer similarity measure.

This page explores the principles and techniques of k-NN ensembles—from simple bagging to sophisticated adaptive combinations.

What You Will Master

Ensemble Learning Principles Applied to KNN

Before diving into k-NN-specific techniques, let's review the fundamental principles that make ensembles work.

The Wisdom of Crowds

Consider $M$ independent classifiers, each with accuracy $p > 0.5$. The ensemble (majority vote) is correct when more than $M/2$ classifiers are correct. By the Chernoff bound:

$$P(\text{ensemble wrong}) \leq \exp\left(-2M\left(p - \frac{1}{2}\right)^2\right)$$

As $M \to \infty$, ensemble error → 0 exponentially fast!

Key requirement: Classifiers must be diverse (make different errors). If all classifiers make the same mistakes, the ensemble doesn't help.

Sources of Diversity in KNN

k-NN ensembles create diversity through:

Diversity Mechanisms for KNN Ensembles

•Feature Subspace: Each classifier uses a random subset of features (Random Subspace Method)
•Sample Subspace: Each classifier trained on bootstrap sample (Bagging)
•Different Metrics: Each classifier uses a different distance function
•Different k Values: Combine predictions from k=1, k=3, k=5, etc.
•Different Weightings: Each classifier uses different feature weights or distance-based voting
•Different Instance Sets: After CNN/ENN, different runs produce different reduced sets

Bias-Variance Decomposition for KNN

The expected prediction error decomposes as:

$$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$$

k-NN's characteristics:

Bias: Increases with k (more smoothing)
Variance: Decreases with k (more averaging)
For small k: Low bias, high variance
For large k: High bias, low variance

Ensembles primarily reduce variance. Averaging multiple k-NN classifiers with small k can achieve low bias (from small k) AND low variance (from averaging).

Theoretical Result: Ensemble Bound

For regression, if $M$ base estimators have expected squared error $\leq \epsilon$, and their pairwise correlations are $\leq \rho$, then the ensemble's expected error is:

$$\text{Error}_{\text{ensemble}} \leq \rho \cdot \epsilon + \frac{(1-\rho) \cdot \epsilon}{M}$$

As $M \to \infty$, error approaches $\rho \cdot \epsilon$. Lower correlation → lower asymptotic error → more benefit from diversity.

Diversity is Key

Random Subspace Method for KNN

Algorithm

Given training data with $d$ features
For each of $M$ base classifiers:
- Randomly select $r$ features (where $r < d$)
- Train k-NN classifier using only those $r$ features
At prediction time:
- Each classifier predicts using its feature subset
- Combine predictions via majority voting (classification) or averaging (regression)

random_subspace_knn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
from collections import Counter
 
class RandomSubspaceKNN(BaseEstimator, ClassifierMixin):
    """
    Random Subspace Method for K-Nearest Neighbors
    
    Creates an ensemble of KNN classifiers, each operating on
    a random subset of features.
    
    Parameters:
    -----------
    n_estimators : int
        Number of base KNN classifiers
    n_features : int or float
        Number (int) or fraction (float) of features to use per classifier
    k : int
        Number of neighbors for each base classifier
    voting : str
        'hard' for majority voting, 'soft' for probability averaging
    random_state : int
        Random seed for reproducibility
    """
    def __init__(self, n_estimators=50, n_features=0.5, k=5, 
                 voting='hard', random_state=None):
        self.n_estimators = n_estimators
        self.n_features = n_features
        self.k = k
        self.voting = voting
        self.random_state = random_state
    
    def fit(self, X, y):
        """Fit ensemble of KNN classifiers on random feature subsets"""
        np.random.seed(self.random_state)
        
        self.n_samples_, self.n_features_total_ = X.shape
        self.classes_ = np.unique(y)
        
        # Determine number of features per classifier
        if isinstance(self.n_features, float):
            self.n_features_per_ = max(1, int(self.n_features * self.n_features_total_))
        else:
            self.n_features_per_ = min(self.n_features, self.n_features_total_)
        
        # Create base classifiers
        self.estimators_ = []
        self.feature_subsets_ = []
        
        for i in range(self.n_estimators):
            # Random feature subset
            feature_idx = np.random.choice(
                self.n_features_total_, 
                size=self.n_features_per_, 
                replace=False
            )
            self.feature_subsets_.append(feature_idx)
            
            # Train KNN on subset
            X_subset = X[:, feature_idx]
            knn = KNeighborsClassifier(n_neighbors=self.k)
            knn.fit(X_subset, y)
            self.estimators_.append(knn)
        
        return self
    
    def predict(self, X):
        """Predict using majority voting across ensemble"""
        if self.voting == 'soft':
            probas = self.predict_proba(X)
            return self.classes_[np.argmax(probas, axis=1)]
        else:
            # Hard voting
            predictions = np.array([
                est.predict(X[:, self.feature_subsets_[i]])
                for i, est in enumerate(self.estimators_)
            ])  # Shape: (n_estimators, n_samples)
            
            # Majority vote per sample
            return np.array([
                Counter(predictions[:, j]).most_common(1)[0][0]
                for j in range(X.shape[0])
            ])
    
    def predict_proba(self, X):
        """Predict class probabilities by averaging base classifier probabilities"""
        probas = np.zeros((X.shape[0], len(self.classes_)))
        
        for i, est in enumerate(self.estimators_):
            X_subset = X[:, self.feature_subsets_[i]]
            probas += est.predict_proba(X_subset)
        
        probas /= self.n_estimators
        return probas
    
    def get_feature_importance(self):
        """
        Estimate feature importance based on ensemble performance.
        
        Features that appear in more accurate classifiers are more important.
        """
        importance = np.zeros(self.n_features_total_)
        
        # Count occurrences of each feature
        for feature_idx in self.feature_subsets_:
            importance[feature_idx] += 1
        
        # Normalize
        importance /= self.n_estimators
        return importance

Optimal Subspace Size

The subspace dimension $r$ controls the bias-variance trade-off:

Small r: High variance per classifier, low correlation between classifiers → Big ensemble benefit
Large r (→ d): Low variance per classifier, high correlation → Little ensemble benefit

Empirical guideline: $r \approx \sqrt{d}$ or $r \approx d/2$ often works well.

For d = 100: try r = 10, 20, 50 and cross-validate.

Effect of Subspace Size on Ensemble Performance
Subspace Size (r)	Per-Classifier Variance	Classifier Correlation	Ensemble Benefit
Small (r << d)	High	Low	Large (diverse classifiers)
Medium (r ≈ √d)	Moderate	Moderate	Balanced (typical sweet spot)
Large (r → d)	Low	High	Small (similar classifiers)

Relationship to Random Forests

Bagging K-Nearest Neighbors

Bagging (Bootstrap Aggregating) creates diversity by training each base classifier on a bootstrap sample of the training data. For k-NN, this creates classifiers with different instance sets.

Standard Bagging for KNN

Given n training samples
For each of M classifiers:
- Draw n samples with replacement (bootstrap sample)
- ~63.2% unique samples, ~36.8% duplicates (on average)
- Train k-NN on bootstrap sample
Combine predictions via voting

Challenge: For large k, bagged k-NNs are highly correlated because most neighbors are the same across bootstrap samples.

Improved: Subbagging (Subsample Bagging)

Instead of n samples with replacement, draw m < n samples without replacement:

bagging_knn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
from collections import Counter
 
class BaggingKNN(BaseEstimator, ClassifierMixin):
    """
    Bagging for K-Nearest Neighbors with optional subsampling.
    
    Parameters:
    -----------
    n_estimators : int
        Number of base classifiers
    sample_fraction : float
        Fraction of samples to use per classifier (subbagging)
    k : int
        Number of neighbors
    bootstrap : bool
        If True, sample with replacement; if False, without replacement
    combine_with_rsm : bool
        If True, also apply random subspace method (feature sampling)
    rsm_fraction : float
        Fraction of features if combine_with_rsm=True
    """
    def __init__(self, n_estimators=50, sample_fraction=0.7, k=5,
                 bootstrap=True, combine_with_rsm=False, rsm_fraction=0.5,
                 random_state=None):
        self.n_estimators = n_estimators
        self.sample_fraction = sample_fraction
        self.k = k
        self.bootstrap = bootstrap
        self.combine_with_rsm = combine_with_rsm
        self.rsm_fraction = rsm_fraction
        self.random_state = random_state
    
    def fit(self, X, y):
        """Fit bagged KNN ensemble"""
        np.random.seed(self.random_state)
        
        self.n_samples_, self.n_features_ = X.shape
        self.classes_ = np.unique(y)
        n_subsample = int(self.sample_fraction * self.n_samples_)
        
        # Effective k (can't exceed subsample size)
        self.k_effective_ = min(self.k, n_subsample - 1)
        
        self.estimators_ = []
        self.sample_indices_ = []
        self.feature_indices_ = []
        
        for i in range(self.n_estimators):
            # Sample indices
            if self.bootstrap:
                indices = np.random.choice(self.n_samples_, size=n_subsample, replace=True)
            else:
                indices = np.random.choice(self.n_samples_, size=n_subsample, replace=False)
            self.sample_indices_.append(indices)
            
            # Feature indices (if RSM enabled)
            if self.combine_with_rsm:
                n_features_use = max(1, int(self.rsm_fraction * self.n_features_))
                feat_indices = np.random.choice(self.n_features_, size=n_features_use, replace=False)
            else:
                feat_indices = np.arange(self.n_features_)
            self.feature_indices_.append(feat_indices)
            
            # Train KNN
            X_subset = X[np.ix_(indices, feat_indices)]
            y_subset = y[indices]
            
            knn = KNeighborsClassifier(n_neighbors=self.k_effective_)
            knn.fit(X_subset, y_subset)
            self.estimators_.append(knn)
        
        return self
    
    def predict(self, X):
        """Predict via majority voting"""
        predictions = []
        for i, est in enumerate(self.estimators_):
            X_feat = X[:, self.feature_indices_[i]]
            pred = est.predict(X_feat)
            predictions.append(pred)
        
        predictions = np.array(predictions)  # (n_estimators, n_samples)
        
        final_pred = []
        for j in range(X.shape[0]):
            votes = predictions[:, j]
            final_pred.append(Counter(votes).most_common(1)[0][0])
        
        return np.array(final_pred)
    
    def predict_proba(self, X):
        """Predict probabilities by averaging"""
        probas = np.zeros((X.shape[0], len(self.classes_)))
        
        for i, est in enumerate(self.estimators_):
            X_feat = X[:, self.feature_indices_[i]]
            probas += est.predict_proba(X_feat)
        
        return probas / self.n_estimators
    
    def oob_score(self, X, y):
        """
        Compute out-of-bag score.
        
        For each sample, aggregate predictions only from classifiers
        that did not include that sample in their training set.
        """
        n_samples = len(y)
        oob_predictions = [[] for _ in range(n_samples)]
        
        for i, (est, sample_idx, feat_idx) in enumerate(
            zip(self.estimators_, self.sample_indices_, self.feature_indices_)
        ):
            # Find out-of-bag samples
            in_bag = set(sample_idx)
            for j in range(n_samples):
                if j not in in_bag:
                    pred = est.predict(X[j:j+1, feat_idx])[0]
                    oob_predictions[j].append(pred)
        
        # Compute accuracy on samples with OOB predictions
        correct = 0
        counted = 0
        for j in range(n_samples):
            if len(oob_predictions[j]) > 0:
                majority_pred = Counter(oob_predictions[j]).most_common(1)[0][0]
                if majority_pred == y[j]:
                    correct += 1
                counted += 1
        
        return correct / counted if counted > 0 else 0.0

Bagging + Random Subspace = Random Patches

Combining sample and feature subsampling creates Random Patches (Louppe & Geurts, 2012):

Each classifier uses a random subset of both samples AND features
Maximum diversity: different classifiers see different data in different feature spaces
Particularly effective for very high-dimensional, large-sample datasets

When Bagging Helps KNN

Bagging is most beneficial when:

Training data is noisy: Bootstrap smooths out noise effects
k is small: Small-k classifiers have high variance; bagging reduces it
Combined with feature subsampling: Pure sample bagging often insufficient for k-NN

Bagging Alone May Not Be Enough

Multi-Metric Ensembles

Common Metrics for Ensemble

Consider a d-dimensional feature space. Standard metrics include:

Distance Metrics for KNN Ensembles
Metric	Formula	Best For
Euclidean	√(Σ(xᵢ - yᵢ)²)	General purpose, standardized features
Manhattan (L1)	Σ\|xᵢ - yᵢ\|	Sparse features, outlier robustness
Chebyshev (L∞)	max\|xᵢ - yᵢ\|	When worst-case difference matters
Cosine	1 - (x·y)/(\|\|x\|\| \|\|y\|\|)	Text, high-dimensional sparse vectors
Mahalanobis	√((x-y)ᵀM(x-y))	Correlated features (requires learned M)
Minkowski (Lp)	(Σ\|xᵢ - yᵢ\|ᵖ)^(1/p)	Tunable between L1 and L∞

multi_metric_ensemble.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.preprocessing import StandardScaler, normalize
from collections import Counter
 
class MultiMetricKNN(BaseEstimator, ClassifierMixin):
    """
    Ensemble of KNN classifiers using different distance metrics.
    
    Each base classifier uses a different metric, capturing different
    notions of similarity. Predictions are combined via weighted voting.
    
    Parameters:
    -----------
    metrics : list
        List of metrics to use. Options: 'euclidean', 'manhattan', 
        'chebyshev', 'cosine', 'minkowski_p3', 'minkowski_p0.5'
    k : int
        Number of neighbors for all classifiers
    weights : array-like or None
        Weights for each metric's vote. If None, use uniform weights.
    learn_weights : bool
        If True, learn optimal weights from validation performance
    """
    def __init__(self, metrics=None, k=5, weights=None, learn_weights=False):
        self.metrics = metrics or ['euclidean', 'manhattan', 'chebyshev', 'cosine']
        self.k = k
        self.weights = weights
        self.learn_weights = learn_weights
    
    def fit(self, X, y, X_val=None, y_val=None):
        """
        Fit ensemble of KNN classifiers.
        
        If learn_weights=True and validation data provided, learn optimal weights.
        """
        self.classes_ = np.unique(y)
        self.scaler_ = StandardScaler()
        X_scaled = self.scaler_.fit_transform(X)
        
        self.estimators_ = []
        self.metric_configs_ = []
        
        for metric in self.metrics:
            # Parse metric configuration
            if metric.startswith('minkowski_p'):
                p = float(metric.split('_p')[1])
                config = {'metric': 'minkowski', 'p': p}
            elif metric == 'cosine':
                # For cosine, normalize data and use euclidean
                config = {'metric': 'euclidean', 'normalize': True}
            else:
                config = {'metric': metric}
            
            self.metric_configs_.append(config)
            
            # Prepare data for this metric
            if config.get('normalize', False):
                X_metric = normalize(X_scaled)
            else:
                X_metric = X_scaled
            
            # Build classifier
            knn = KNeighborsClassifier(
                n_neighbors=self.k,
                metric=config['metric'],
                p=config.get('p', 2)
            )
            knn.fit(X_metric, y)
            self.estimators_.append(knn)
        
        # Learn weights if requested
        if self.learn_weights and X_val is not None:
            self.weights_ = self._learn_weights(X_val, y_val)
        elif self.weights is not None:
            self.weights_ = np.array(self.weights)
        else:
            self.weights_ = np.ones(len(self.metrics)) / len(self.metrics)
        
        # Store scaled training data for later use
        self.X_train_scaled_ = X_scaled
        self.y_train_ = y
        
        return self
    
    def _learn_weights(self, X_val, y_val):
        """Learn optimal weights based on validation accuracy"""
        X_val_scaled = self.scaler_.transform(X_val)
        
        accuracies = []
        for i, (est, config) in enumerate(zip(self.estimators_, self.metric_configs_)):
            if config.get('normalize', False):
                X_val_metric = normalize(X_val_scaled)
            else:
                X_val_metric = X_val_scaled
            
            acc = est.score(X_val_metric, y_val)
            accuracies.append(acc)
        
        # Weight proportional to accuracy
        accuracies = np.array(accuracies)
        weights = accuracies / accuracies.sum()
        
        print("Learned metric weights:")
        for metric, weight, acc in zip(self.metrics, weights, accuracies):
            print(f"  {metric}: weight={weight:.3f} (acc={acc:.3f})")
        
        return weights
    
    def _prepare_X(self, X, config):
        """Prepare input for a specific metric configuration"""
        X_scaled = self.scaler_.transform(X)
        if config.get('normalize', False):
            return normalize(X_scaled)
        return X_scaled
    
    def predict(self, X):
        """Predict via weighted voting"""
        # Collect weighted votes
        vote_matrix = np.zeros((len(X), len(self.classes_)))
        
        for i, (est, config) in enumerate(zip(self.estimators_, self.metric_configs_)):
            X_metric = self._prepare_X(X, config)
            probas = est.predict_proba(X_metric)
            vote_matrix += self.weights_[i] * probas
        
        return self.classes_[np.argmax(vote_matrix, axis=1)]
    
    def predict_proba(self, X):
        """Predict weighted probability averages"""
        probas = np.zeros((len(X), len(self.classes_)))
        
        for i, (est, config) in enumerate(zip(self.estimators_, self.metric_configs_)):
            X_metric = self._prepare_X(X, config)
            probas += self.weights_[i] * est.predict_proba(X_metric)
        
        return probas

Metric Selection for Ensembles

Not all metric combinations are equally useful. Guidelines:

High Diversity (Recommended):

Euclidean + Manhattan + Cosine
L1 + L2 + L∞ (Minkowski family span)
Euclidean + Mahalanobis (if learned metric available)

Low Diversity (Less Useful):

Euclidean + Minkowski p=2.1 (nearly identical)
Multiple similar Minkowski values

Learned Metric Combination

For maximum power, include a learned Mahalanobis metric (from LMNN) in the ensemble:

Train LMNN to learn optimal $M$
Create k-NN classifier using Mahalanobis distance with $M$
Ensemble with Euclidean, Manhattan, Cosine classifiers
The learned metric captures data-specific patterns; others provide stability

Dynamic Metric Selection

Stacking and Hybrid KNN Methods

Stacking (stacked generalization) uses a meta-learner to combine base classifier predictions, learning optimal combination weights from data rather than using fixed voting.

Stacking with KNN Base Classifiers

Level 0 (Base Classifiers): Multiple k-NN variants
- Different k values: k=1, k=3, k=5, k=10
- Different metrics: Euclidean, Manhattan, Cosine
- Different feature subsets: Random subspace classifiers
Level 1 (Meta-Classifier): Learn to combine Level 0 predictions
- Input: Probability predictions from all base classifiers
- Output: Final class prediction
- Common choices: Logistic Regression, another k-NN, or Random Forest

stacking_knn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.base import BaseEstimator, ClassifierMixin
 
class StackingKNN(BaseEstimator, ClassifierMixin):
    """
    Stacking ensemble with multiple KNN base classifiers
    and a learned meta-classifier.
    
    Parameters:
    -----------
    base_configs : list of dicts
        Configuration for each base KNN classifier.
        Each dict should have keys like 'k', 'metric', 'weights'.
    meta_classifier : estimator
        The Level-1 classifier to combine base predictions.
    cv_folds : int
        Number of CV folds for generating meta-features.
    """
    def __init__(self, base_configs=None, meta_classifier=None, cv_folds=5):
        self.base_configs = base_configs or [
            {'n_neighbors': 1, 'metric': 'euclidean'},
            {'n_neighbors': 3, 'metric': 'euclidean'},
            {'n_neighbors': 5, 'metric': 'euclidean'},
            {'n_neighbors': 5, 'metric': 'manhattan'},
            {'n_neighbors': 5, 'metric': 'chebyshev'},
        ]
        self.meta_classifier = meta_classifier or LogisticRegression(max_iter=1000)
        self.cv_folds = cv_folds
    
    def fit(self, X, y):
        """
        Fit stacking ensemble.
        
        Uses cross-validation to generate meta-features for training
        the meta-classifier.
        """
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        n_samples = len(X)
        
        # Create base classifiers
        self.base_classifiers_ = []
        for config in self.base_configs:
            knn = KNeighborsClassifier(**config)
            self.base_classifiers_.append(knn)
        
        # Generate meta-features via cross-validation
        # Shape: (n_samples, n_base_classifiers * n_classes)
        meta_features = np.zeros((n_samples, len(self.base_classifiers_) * n_classes))
        
        for i, knn in enumerate(self.base_classifiers_):
            # Cross-val predictions (probabilities)
            cv_probas = cross_val_predict(
                knn, X, y, cv=self.cv_folds, method='predict_proba'
            )
            meta_features[:, i*n_classes:(i+1)*n_classes] = cv_probas
        
        # Fit meta-classifier on meta-features
        self.meta_classifier.fit(meta_features, y)
        
        # Refit all base classifiers on full training data
        for knn in self.base_classifiers_:
            knn.fit(X, y)
        
        return self
    
    def predict(self, X):
        """Predict using stacked ensemble"""
        meta_features = self._get_meta_features(X)
        return self.meta_classifier.predict(meta_features)
    
    def predict_proba(self, X):
        """Predict probabilities"""
        meta_features = self._get_meta_features(X)
        if hasattr(self.meta_classifier, 'predict_proba'):
            return self.meta_classifier.predict_proba(meta_features)
        else:
            # Fall back to predictions
            preds = self.meta_classifier.predict(meta_features)
            probas = np.zeros((len(X), len(self.classes_)))
            for i, pred in enumerate(preds):
                probas[i, np.where(self.classes_ == pred)[0]] = 1.0
            return probas
    
    def _get_meta_features(self, X):
        """Generate meta-features from base classifier predictions"""
        n_classes = len(self.classes_)
        meta_features = np.zeros((len(X), len(self.base_classifiers_) * n_classes))
        
        for i, knn in enumerate(self.base_classifiers_):
            probas = knn.predict_proba(X)
            meta_features[:, i*n_classes:(i+1)*n_classes] = probas
        
        return meta_features

Hybrid KNN + Model-Based Approaches

k-NN (instance-based) and model-based classifiers (trees, SVMs, neural nets) have complementary strengths:

k-NN	Model-Based
No training phase	Requires training
Adapts to local structure	Captures global patterns
Memory-intensive	Compact models
Slow prediction	Fast prediction

Hybrid Approach 1: KNN as Meta-Features

Add k-NN predictions as features for a model-based classifier:

For training point $\mathbf{x}_i$, find its k-NN class distribution (leave-one-out)
Append this distribution to $\mathbf{x}_i$'s features
Train Random Forest / SVM / Neural Net on augmented features

Hybrid Approach 2: Model-Based Embeddings + KNN

Use a trained model to create embeddings, then apply k-NN:

Train neural network classifier
Extract hidden layer activations as embeddings
Apply k-NN on embeddings (often more effective than on raw features)

Hybrid Approach 3: Gating Network

Learn when to trust k-NN vs. model-based:

Both k-NN and model-based classifier make predictions
A gating network learns which to trust based on input characteristics
Output is weighted combination: $\alpha(\mathbf{x}) \cdot p_{\text{kNN}}(\mathbf{x}) + (1-\alpha(\mathbf{x})) \cdot p_{\text{model}}(\mathbf{x})$

Practical Stacking Tip

Implementation Best Practices

Building effective KNN ensembles requires attention to several practical considerations. Here are battle-tested best practices.

Choosing Ensemble Size (n_estimators)

More classifiers → more computation but diminishing returns on accuracy.

Recommended Ensemble Sizes
Method	Minimum	Typical	Maximum Useful
Random Subspace	25	50-100	200
Bagging/Random Patches	25	50-100	500
Multi-Metric	3-4	5-10	15 (diversity-limited)
Multi-k	3-5	5-10	10 (k-space limited)

Computational Optimization

k-NN ensembles can be parallelized effectively:

Parallel Training:

Each base classifier is independent → embarrassingly parallel
Use joblib or multiprocessing

Parallel Prediction:

Each base classifier can predict independently
Batch process test points across all classifiers

Memory Optimization:

For very large ensembles, train classifiers in batches
Don't store all training data M times; use shared storage with index pointers

parallel_ensemble.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from joblib import Parallel, delayed
from sklearn.neighbors import KNeighborsClassifier
 
def train_single_knn(X, y, feature_idx, k):
    """Train a single KNN on feature subset - for parallel execution"""
    X_subset = X[:, feature_idx]
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_subset, y)
    return knn, feature_idx
 
def predict_single_knn(knn, X, feature_idx):
    """Get predictions from single KNN - for parallel execution"""
    X_subset = X[:, feature_idx]
    return knn.predict_proba(X_subset)
 
class ParallelRandomSubspaceKNN:
    """
    Random Subspace KNN with parallel training and prediction.
    """
    def __init__(self, n_estimators=50, n_features=0.5, k=5, n_jobs=-1):
        self.n_estimators = n_estimators
        self.n_features = n_features
        self.k = k
        self.n_jobs = n_jobs
    
    def fit(self, X, y):
        np.random.seed(42)
        
        self.n_features_total_ = X.shape[1]
        self.classes_ = np.unique(y)
        
        n_feat = int(self.n_features * self.n_features_total_)
        
        # Generate feature subsets
        self.feature_subsets_ = [
            np.random.choice(self.n_features_total_, size=n_feat, replace=False)
            for _ in range(self.n_estimators)
        ]
        
        # Parallel training
        results = Parallel(n_jobs=self.n_jobs)(
            delayed(train_single_knn)(X, y, feat_idx, self.k)
            for feat_idx in self.feature_subsets_
        )
        
        self.estimators_ = [r[0] for r in results]
        return self
    
    def predict(self, X):
        # Parallel prediction
        probas_list = Parallel(n_jobs=self.n_jobs)(
            delayed(predict_single_knn)(est, X, feat_idx)
            for est, feat_idx in zip(self.estimators_, self.feature_subsets_)
        )
        
        # Average probabilities
        avg_probas = np.mean(probas_list, axis=0)
        return self.classes_[np.argmax(avg_probas, axis=1)]

Validation Strategy

Ensembles are prone to subtly overfitting to the combination strategy. Proper validation is essential:

Nested Cross-Validation: Outer loop for ensemble evaluation, inner loop for hyperparameter tuning
Hold-Out Validation Set: Especially important for stacking (meta-classifier must be trained on out-of-fold predictions)
OOB Estimation: For bagging, use out-of-bag samples for unbiased error estimation

When NOT to Ensemble KNN

Ensembles add complexity. Skip them when:

Single k-NN already achieves target accuracy
Interpretation is critical (ensemble predictions are harder to explain)
Prediction latency is constrained (ensembles multiply inference time)
Training data is very small (risk of overfitting the combination)

Start Simple

Summary: KNN Ensemble Methods

We've explored the rich landscape of KNN ensemble methods—from fundamental principles through specific techniques to practical implementation. Let's consolidate the essential knowledge:

Key Takeaways

•Ensembles reduce variance: Averaging multiple k-NN classifiers smooths prediction noise while preserving low bias of small k.
•Diversity is essential: Create diverse classifiers through feature subsampling, sample bootstrapping, different metrics, and different k values.
•Random Subspace Method excels for high dimensions: Each classifier sees a random feature subset, naturally handling the curse of dimensionality.
•Multi-metric ensembles capture different similarity notions: Combining Euclidean, Manhattan, Cosine, and learned metrics yields robust predictions.
•Stacking learns optimal combination: Rather than fixed voting, a meta-classifier learns when to trust each base classifier.
•Hybrid methods combine KNN with model-based approaches: Use k-NN predictions as features, or apply k-NN to learned embeddings.
•Parallelization is straightforward: Base classifiers are independent; training and prediction parallelize easily.
•Start simple, add complexity as needed: Random Subspace often suffices; only add stacking/hybrid methods if performance demands it.

Module Complete: KNN Variants

This page completes our exploration of KNN Variants. We've covered:

Condensed Nearest Neighbors (CNN): Reducing storage by selecting boundary points
Edited Nearest Neighbors (ENN): Improving quality by removing noisy points
Large Margin KNN (LMNN): Learning optimal Mahalanobis distance metrics
Metric Learning: The full landscape from linear to deep methods
KNN Ensembles: Combining multiple classifiers for robustness and accuracy

Together, these techniques transform k-NN from a simple baseline into a powerful, production-ready classification system. You can now:

Reduce k-NN's storage and computational requirements
Handle noisy, high-dimensional data effectively
Learn data-adapted distance metrics
Combine multiple perspectives for robust predictions

K-Nearest Neighbors, properly engineered, remains a competitive algorithm for many real-world problems.

Module Complete

5 / 5