Loading learning content...
Throughout this module, we've explored individual optimizations for k-NN: selecting instances (CNN, ENN), learning metrics (LMNN, deep metric learning), and adapting neighborhoods. Each technique improves a specific aspect of k-NN performance.
But what if we could combine multiple k-NN classifiers—each capturing different aspects of the data—to achieve robustness and accuracy that no single classifier can match?
Ensemble methods have revolutionized machine learning. Random Forests combine thousands of decision trees to achieve state-of-the-art performance on many tasks. Gradient boosting combines weak learners into powerful predictors. The same principles apply to k-NN.
Why ensemble k-NN?
Reduce variance: A single k-NN is sensitive to training data, choice of k, and distance metric. Averaging over multiple classifiers smooths these sensitivities.
Improve robustness: Different classifiers make different errors. Combining them through voting reduces the impact of any single classifier's mistakes.
Handle high-dimensionality: In high dimensions, using all features often hurts. Combining k-NNs on different feature subsets can outperform the full-space classifier.
Leverage diversity: Different distance metrics capture different notions of similarity. Combining them yields a richer similarity measure.
This page explores the principles and techniques of k-NN ensembles—from simple bagging to sophisticated adaptive combinations.
By completing this page, you will understand why ensembles improve k-NN, master random subspace methods for high-dimensional data, learn to ensemble different metrics and k values, implement weighted voting strategies, and design hybrid KNN-based systems combining instance-based and model-based approaches.
Before diving into k-NN-specific techniques, let's review the fundamental principles that make ensembles work.
Consider $M$ independent classifiers, each with accuracy $p > 0.5$. The ensemble (majority vote) is correct when more than $M/2$ classifiers are correct. By the Chernoff bound:
$$P(\text{ensemble wrong}) \leq \exp\left(-2M\left(p - \frac{1}{2}\right)^2\right)$$
As $M \to \infty$, ensemble error → 0 exponentially fast!
Key requirement: Classifiers must be diverse (make different errors). If all classifiers make the same mistakes, the ensemble doesn't help.
k-NN ensembles create diversity through:
The expected prediction error decomposes as:
$$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$$
k-NN's characteristics:
Ensembles primarily reduce variance. Averaging multiple k-NN classifiers with small k can achieve low bias (from small k) AND low variance (from averaging).
For regression, if $M$ base estimators have expected squared error $\leq \epsilon$, and their pairwise correlations are $\leq \rho$, then the ensemble's expected error is:
$$\text{Error}_{\text{ensemble}} \leq \rho \cdot \epsilon + \frac{(1-\rho) \cdot \epsilon}{M}$$
As $M \to \infty$, error approaches $\rho \cdot \epsilon$. Lower correlation → lower asymptotic error → more benefit from diversity.
The most common mistake in building KNN ensembles is creating classifiers that are too similar. If all classifiers use the same features, same k, and same metric with slightly different data, they're highly correlated and the ensemble provides little benefit. Maximize diversity across dimensions: features, metrics, AND parameters.
The Random Subspace Method (RSM), introduced by Ho (1998), is particularly effective for k-NN in high-dimensional spaces. Each classifier operates on a random subset of features, avoiding the curse of dimensionality while capturing different aspects of the data.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
import numpy as npfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.base import BaseEstimator, ClassifierMixinfrom collections import Counter class RandomSubspaceKNN(BaseEstimator, ClassifierMixin): """ Random Subspace Method for K-Nearest Neighbors Creates an ensemble of KNN classifiers, each operating on a random subset of features. Parameters: ----------- n_estimators : int Number of base KNN classifiers n_features : int or float Number (int) or fraction (float) of features to use per classifier k : int Number of neighbors for each base classifier voting : str 'hard' for majority voting, 'soft' for probability averaging random_state : int Random seed for reproducibility """ def __init__(self, n_estimators=50, n_features=0.5, k=5, voting='hard', random_state=None): self.n_estimators = n_estimators self.n_features = n_features self.k = k self.voting = voting self.random_state = random_state def fit(self, X, y): """Fit ensemble of KNN classifiers on random feature subsets""" np.random.seed(self.random_state) self.n_samples_, self.n_features_total_ = X.shape self.classes_ = np.unique(y) # Determine number of features per classifier if isinstance(self.n_features, float): self.n_features_per_ = max(1, int(self.n_features * self.n_features_total_)) else: self.n_features_per_ = min(self.n_features, self.n_features_total_) # Create base classifiers self.estimators_ = [] self.feature_subsets_ = [] for i in range(self.n_estimators): # Random feature subset feature_idx = np.random.choice( self.n_features_total_, size=self.n_features_per_, replace=False ) self.feature_subsets_.append(feature_idx) # Train KNN on subset X_subset = X[:, feature_idx] knn = KNeighborsClassifier(n_neighbors=self.k) knn.fit(X_subset, y) self.estimators_.append(knn) return self def predict(self, X): """Predict using majority voting across ensemble""" if self.voting == 'soft': probas = self.predict_proba(X) return self.classes_[np.argmax(probas, axis=1)] else: # Hard voting predictions = np.array([ est.predict(X[:, self.feature_subsets_[i]]) for i, est in enumerate(self.estimators_) ]) # Shape: (n_estimators, n_samples) # Majority vote per sample return np.array([ Counter(predictions[:, j]).most_common(1)[0][0] for j in range(X.shape[0]) ]) def predict_proba(self, X): """Predict class probabilities by averaging base classifier probabilities""" probas = np.zeros((X.shape[0], len(self.classes_))) for i, est in enumerate(self.estimators_): X_subset = X[:, self.feature_subsets_[i]] probas += est.predict_proba(X_subset) probas /= self.n_estimators return probas def get_feature_importance(self): """ Estimate feature importance based on ensemble performance. Features that appear in more accurate classifiers are more important. """ importance = np.zeros(self.n_features_total_) # Count occurrences of each feature for feature_idx in self.feature_subsets_: importance[feature_idx] += 1 # Normalize importance /= self.n_estimators return importanceThe subspace dimension $r$ controls the bias-variance trade-off:
Empirical guideline: $r \approx \sqrt{d}$ or $r \approx d/2$ often works well.
For d = 100: try r = 10, 20, 50 and cross-validate.
| Subspace Size (r) | Per-Classifier Variance | Classifier Correlation | Ensemble Benefit |
|---|---|---|---|
| Small (r << d) | High | Low | Large (diverse classifiers) |
| Medium (r ≈ √d) | Moderate | Moderate | Balanced (typical sweet spot) |
| Large (r → d) | Low | High | Small (similar classifiers) |
Random Subspace KNN is analogous to Random Forests with trees replaced by k-NN. Both use feature randomization for diversity. The key difference: trees partition space globally, while k-NN partitions locally. RSM-KNN often outperforms Random Forests when local patterns matter more than global structure.
Bagging (Bootstrap Aggregating) creates diversity by training each base classifier on a bootstrap sample of the training data. For k-NN, this creates classifiers with different instance sets.
Challenge: For large k, bagged k-NNs are highly correlated because most neighbors are the same across bootstrap samples.
Instead of n samples with replacement, draw m < n samples without replacement:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
import numpy as npfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.base import BaseEstimator, ClassifierMixinfrom collections import Counter class BaggingKNN(BaseEstimator, ClassifierMixin): """ Bagging for K-Nearest Neighbors with optional subsampling. Parameters: ----------- n_estimators : int Number of base classifiers sample_fraction : float Fraction of samples to use per classifier (subbagging) k : int Number of neighbors bootstrap : bool If True, sample with replacement; if False, without replacement combine_with_rsm : bool If True, also apply random subspace method (feature sampling) rsm_fraction : float Fraction of features if combine_with_rsm=True """ def __init__(self, n_estimators=50, sample_fraction=0.7, k=5, bootstrap=True, combine_with_rsm=False, rsm_fraction=0.5, random_state=None): self.n_estimators = n_estimators self.sample_fraction = sample_fraction self.k = k self.bootstrap = bootstrap self.combine_with_rsm = combine_with_rsm self.rsm_fraction = rsm_fraction self.random_state = random_state def fit(self, X, y): """Fit bagged KNN ensemble""" np.random.seed(self.random_state) self.n_samples_, self.n_features_ = X.shape self.classes_ = np.unique(y) n_subsample = int(self.sample_fraction * self.n_samples_) # Effective k (can't exceed subsample size) self.k_effective_ = min(self.k, n_subsample - 1) self.estimators_ = [] self.sample_indices_ = [] self.feature_indices_ = [] for i in range(self.n_estimators): # Sample indices if self.bootstrap: indices = np.random.choice(self.n_samples_, size=n_subsample, replace=True) else: indices = np.random.choice(self.n_samples_, size=n_subsample, replace=False) self.sample_indices_.append(indices) # Feature indices (if RSM enabled) if self.combine_with_rsm: n_features_use = max(1, int(self.rsm_fraction * self.n_features_)) feat_indices = np.random.choice(self.n_features_, size=n_features_use, replace=False) else: feat_indices = np.arange(self.n_features_) self.feature_indices_.append(feat_indices) # Train KNN X_subset = X[np.ix_(indices, feat_indices)] y_subset = y[indices] knn = KNeighborsClassifier(n_neighbors=self.k_effective_) knn.fit(X_subset, y_subset) self.estimators_.append(knn) return self def predict(self, X): """Predict via majority voting""" predictions = [] for i, est in enumerate(self.estimators_): X_feat = X[:, self.feature_indices_[i]] pred = est.predict(X_feat) predictions.append(pred) predictions = np.array(predictions) # (n_estimators, n_samples) final_pred = [] for j in range(X.shape[0]): votes = predictions[:, j] final_pred.append(Counter(votes).most_common(1)[0][0]) return np.array(final_pred) def predict_proba(self, X): """Predict probabilities by averaging""" probas = np.zeros((X.shape[0], len(self.classes_))) for i, est in enumerate(self.estimators_): X_feat = X[:, self.feature_indices_[i]] probas += est.predict_proba(X_feat) return probas / self.n_estimators def oob_score(self, X, y): """ Compute out-of-bag score. For each sample, aggregate predictions only from classifiers that did not include that sample in their training set. """ n_samples = len(y) oob_predictions = [[] for _ in range(n_samples)] for i, (est, sample_idx, feat_idx) in enumerate( zip(self.estimators_, self.sample_indices_, self.feature_indices_) ): # Find out-of-bag samples in_bag = set(sample_idx) for j in range(n_samples): if j not in in_bag: pred = est.predict(X[j:j+1, feat_idx])[0] oob_predictions[j].append(pred) # Compute accuracy on samples with OOB predictions correct = 0 counted = 0 for j in range(n_samples): if len(oob_predictions[j]) > 0: majority_pred = Counter(oob_predictions[j]).most_common(1)[0][0] if majority_pred == y[j]: correct += 1 counted += 1 return correct / counted if counted > 0 else 0.0Combining sample and feature subsampling creates Random Patches (Louppe & Geurts, 2012):
Bagging is most beneficial when:
Unlike decision trees, k-NN doesn't benefit as much from pure sample bagging. The reason: k-NN predictions depend on nearest neighbors, which overlap heavily between bootstrap samples. Always combine with feature subsampling (Random Patches) for meaningful diversity.
Different distance metrics capture different notions of similarity. Combining k-NN classifiers with different metrics creates ensemble diversity through semantic diversity rather than sampling diversity.
Consider a d-dimensional feature space. Standard metrics include:
| Metric | Formula | Best For |
|---|---|---|
| Euclidean | √(Σ(xᵢ - yᵢ)²) | General purpose, standardized features |
| Manhattan (L1) | Σ|xᵢ - yᵢ| | Sparse features, outlier robustness |
| Chebyshev (L∞) | max|xᵢ - yᵢ| | When worst-case difference matters |
| Cosine | 1 - (x·y)/(||x|| ||y||) | Text, high-dimensional sparse vectors |
| Mahalanobis | √((x-y)ᵀM(x-y)) | Correlated features (requires learned M) |
| Minkowski (Lp) | (Σ|xᵢ - yᵢ|ᵖ)^(1/p) | Tunable between L1 and L∞ |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138
import numpy as npfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.base import BaseEstimator, ClassifierMixinfrom sklearn.preprocessing import StandardScaler, normalizefrom collections import Counter class MultiMetricKNN(BaseEstimator, ClassifierMixin): """ Ensemble of KNN classifiers using different distance metrics. Each base classifier uses a different metric, capturing different notions of similarity. Predictions are combined via weighted voting. Parameters: ----------- metrics : list List of metrics to use. Options: 'euclidean', 'manhattan', 'chebyshev', 'cosine', 'minkowski_p3', 'minkowski_p0.5' k : int Number of neighbors for all classifiers weights : array-like or None Weights for each metric's vote. If None, use uniform weights. learn_weights : bool If True, learn optimal weights from validation performance """ def __init__(self, metrics=None, k=5, weights=None, learn_weights=False): self.metrics = metrics or ['euclidean', 'manhattan', 'chebyshev', 'cosine'] self.k = k self.weights = weights self.learn_weights = learn_weights def fit(self, X, y, X_val=None, y_val=None): """ Fit ensemble of KNN classifiers. If learn_weights=True and validation data provided, learn optimal weights. """ self.classes_ = np.unique(y) self.scaler_ = StandardScaler() X_scaled = self.scaler_.fit_transform(X) self.estimators_ = [] self.metric_configs_ = [] for metric in self.metrics: # Parse metric configuration if metric.startswith('minkowski_p'): p = float(metric.split('_p')[1]) config = {'metric': 'minkowski', 'p': p} elif metric == 'cosine': # For cosine, normalize data and use euclidean config = {'metric': 'euclidean', 'normalize': True} else: config = {'metric': metric} self.metric_configs_.append(config) # Prepare data for this metric if config.get('normalize', False): X_metric = normalize(X_scaled) else: X_metric = X_scaled # Build classifier knn = KNeighborsClassifier( n_neighbors=self.k, metric=config['metric'], p=config.get('p', 2) ) knn.fit(X_metric, y) self.estimators_.append(knn) # Learn weights if requested if self.learn_weights and X_val is not None: self.weights_ = self._learn_weights(X_val, y_val) elif self.weights is not None: self.weights_ = np.array(self.weights) else: self.weights_ = np.ones(len(self.metrics)) / len(self.metrics) # Store scaled training data for later use self.X_train_scaled_ = X_scaled self.y_train_ = y return self def _learn_weights(self, X_val, y_val): """Learn optimal weights based on validation accuracy""" X_val_scaled = self.scaler_.transform(X_val) accuracies = [] for i, (est, config) in enumerate(zip(self.estimators_, self.metric_configs_)): if config.get('normalize', False): X_val_metric = normalize(X_val_scaled) else: X_val_metric = X_val_scaled acc = est.score(X_val_metric, y_val) accuracies.append(acc) # Weight proportional to accuracy accuracies = np.array(accuracies) weights = accuracies / accuracies.sum() print("Learned metric weights:") for metric, weight, acc in zip(self.metrics, weights, accuracies): print(f" {metric}: weight={weight:.3f} (acc={acc:.3f})") return weights def _prepare_X(self, X, config): """Prepare input for a specific metric configuration""" X_scaled = self.scaler_.transform(X) if config.get('normalize', False): return normalize(X_scaled) return X_scaled def predict(self, X): """Predict via weighted voting""" # Collect weighted votes vote_matrix = np.zeros((len(X), len(self.classes_))) for i, (est, config) in enumerate(zip(self.estimators_, self.metric_configs_)): X_metric = self._prepare_X(X, config) probas = est.predict_proba(X_metric) vote_matrix += self.weights_[i] * probas return self.classes_[np.argmax(vote_matrix, axis=1)] def predict_proba(self, X): """Predict weighted probability averages""" probas = np.zeros((len(X), len(self.classes_))) for i, (est, config) in enumerate(zip(self.estimators_, self.metric_configs_)): X_metric = self._prepare_X(X, config) probas += self.weights_[i] * est.predict_proba(X_metric) return probasNot all metric combinations are equally useful. Guidelines:
High Diversity (Recommended):
Low Diversity (Less Useful):
For maximum power, include a learned Mahalanobis metric (from LMNN) in the ensemble:
For advanced applications, consider region-specific metric weights. In different parts of feature space, different metrics may be optimal. A meta-classifier can learn which metric to trust based on the test point's location.
Stacking (stacked generalization) uses a meta-learner to combine base classifier predictions, learning optimal combination weights from data rather than using fixed voting.
Level 0 (Base Classifiers): Multiple k-NN variants
Level 1 (Meta-Classifier): Learn to combine Level 0 predictions
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as npfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import cross_val_predictfrom sklearn.base import BaseEstimator, ClassifierMixin class StackingKNN(BaseEstimator, ClassifierMixin): """ Stacking ensemble with multiple KNN base classifiers and a learned meta-classifier. Parameters: ----------- base_configs : list of dicts Configuration for each base KNN classifier. Each dict should have keys like 'k', 'metric', 'weights'. meta_classifier : estimator The Level-1 classifier to combine base predictions. cv_folds : int Number of CV folds for generating meta-features. """ def __init__(self, base_configs=None, meta_classifier=None, cv_folds=5): self.base_configs = base_configs or [ {'n_neighbors': 1, 'metric': 'euclidean'}, {'n_neighbors': 3, 'metric': 'euclidean'}, {'n_neighbors': 5, 'metric': 'euclidean'}, {'n_neighbors': 5, 'metric': 'manhattan'}, {'n_neighbors': 5, 'metric': 'chebyshev'}, ] self.meta_classifier = meta_classifier or LogisticRegression(max_iter=1000) self.cv_folds = cv_folds def fit(self, X, y): """ Fit stacking ensemble. Uses cross-validation to generate meta-features for training the meta-classifier. """ self.classes_ = np.unique(y) n_classes = len(self.classes_) n_samples = len(X) # Create base classifiers self.base_classifiers_ = [] for config in self.base_configs: knn = KNeighborsClassifier(**config) self.base_classifiers_.append(knn) # Generate meta-features via cross-validation # Shape: (n_samples, n_base_classifiers * n_classes) meta_features = np.zeros((n_samples, len(self.base_classifiers_) * n_classes)) for i, knn in enumerate(self.base_classifiers_): # Cross-val predictions (probabilities) cv_probas = cross_val_predict( knn, X, y, cv=self.cv_folds, method='predict_proba' ) meta_features[:, i*n_classes:(i+1)*n_classes] = cv_probas # Fit meta-classifier on meta-features self.meta_classifier.fit(meta_features, y) # Refit all base classifiers on full training data for knn in self.base_classifiers_: knn.fit(X, y) return self def predict(self, X): """Predict using stacked ensemble""" meta_features = self._get_meta_features(X) return self.meta_classifier.predict(meta_features) def predict_proba(self, X): """Predict probabilities""" meta_features = self._get_meta_features(X) if hasattr(self.meta_classifier, 'predict_proba'): return self.meta_classifier.predict_proba(meta_features) else: # Fall back to predictions preds = self.meta_classifier.predict(meta_features) probas = np.zeros((len(X), len(self.classes_))) for i, pred in enumerate(preds): probas[i, np.where(self.classes_ == pred)[0]] = 1.0 return probas def _get_meta_features(self, X): """Generate meta-features from base classifier predictions""" n_classes = len(self.classes_) meta_features = np.zeros((len(X), len(self.base_classifiers_) * n_classes)) for i, knn in enumerate(self.base_classifiers_): probas = knn.predict_proba(X) meta_features[:, i*n_classes:(i+1)*n_classes] = probas return meta_featuresk-NN (instance-based) and model-based classifiers (trees, SVMs, neural nets) have complementary strengths:
| k-NN | Model-Based |
|---|---|
| No training phase | Requires training |
| Adapts to local structure | Captures global patterns |
| Memory-intensive | Compact models |
| Slow prediction | Fast prediction |
Hybrid Approach 1: KNN as Meta-Features
Add k-NN predictions as features for a model-based classifier:
Hybrid Approach 2: Model-Based Embeddings + KNN
Use a trained model to create embeddings, then apply k-NN:
Hybrid Approach 3: Gating Network
Learn when to trust k-NN vs. model-based:
The meta-classifier should be simpler than base classifiers to avoid overfitting. Logistic Regression is a common choice. If using another k-NN as meta-classifier, use large k to average over base classifier disagreements.
Building effective KNN ensembles requires attention to several practical considerations. Here are battle-tested best practices.
More classifiers → more computation but diminishing returns on accuracy.
| Method | Minimum | Typical | Maximum Useful |
|---|---|---|---|
| Random Subspace | 25 | 50-100 | 200 |
| Bagging/Random Patches | 25 | 50-100 | 500 |
| Multi-Metric | 3-4 | 5-10 | 15 (diversity-limited) |
| Multi-k | 3-5 | 5-10 | 10 (k-space limited) |
k-NN ensembles can be parallelized effectively:
Parallel Training:
joblib or multiprocessingParallel Prediction:
Memory Optimization:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as npfrom joblib import Parallel, delayedfrom sklearn.neighbors import KNeighborsClassifier def train_single_knn(X, y, feature_idx, k): """Train a single KNN on feature subset - for parallel execution""" X_subset = X[:, feature_idx] knn = KNeighborsClassifier(n_neighbors=k) knn.fit(X_subset, y) return knn, feature_idx def predict_single_knn(knn, X, feature_idx): """Get predictions from single KNN - for parallel execution""" X_subset = X[:, feature_idx] return knn.predict_proba(X_subset) class ParallelRandomSubspaceKNN: """ Random Subspace KNN with parallel training and prediction. """ def __init__(self, n_estimators=50, n_features=0.5, k=5, n_jobs=-1): self.n_estimators = n_estimators self.n_features = n_features self.k = k self.n_jobs = n_jobs def fit(self, X, y): np.random.seed(42) self.n_features_total_ = X.shape[1] self.classes_ = np.unique(y) n_feat = int(self.n_features * self.n_features_total_) # Generate feature subsets self.feature_subsets_ = [ np.random.choice(self.n_features_total_, size=n_feat, replace=False) for _ in range(self.n_estimators) ] # Parallel training results = Parallel(n_jobs=self.n_jobs)( delayed(train_single_knn)(X, y, feat_idx, self.k) for feat_idx in self.feature_subsets_ ) self.estimators_ = [r[0] for r in results] return self def predict(self, X): # Parallel prediction probas_list = Parallel(n_jobs=self.n_jobs)( delayed(predict_single_knn)(est, X, feat_idx) for est, feat_idx in zip(self.estimators_, self.feature_subsets_) ) # Average probabilities avg_probas = np.mean(probas_list, axis=0) return self.classes_[np.argmax(avg_probas, axis=1)]Ensembles are prone to subtly overfitting to the combination strategy. Proper validation is essential:
Ensembles add complexity. Skip them when:
Begin with Random Subspace Method (50 classifiers, √d features each). It's easy to implement, parallelizes well, and often captures most of the ensemble benefit. Only add complexity (stacking, learned weights, multiple metrics) if RSM doesn't meet performance targets.
We've explored the rich landscape of KNN ensemble methods—from fundamental principles through specific techniques to practical implementation. Let's consolidate the essential knowledge:
This page completes our exploration of KNN Variants. We've covered:
Together, these techniques transform k-NN from a simple baseline into a powerful, production-ready classification system. You can now:
K-Nearest Neighbors, properly engineered, remains a competitive algorithm for many real-world problems.
Congratulations! You've mastered the full spectrum of KNN variants—from instance selection through metric learning to ensemble methods. You now possess the knowledge to apply k-NN effectively in diverse production scenarios, selecting and combining the right techniques for each problem's unique characteristics.