Machine LearningK-Nearest Neighbors

Weighted KNN

LevelIntermediate

Duration90 mins

TopicK-Nearest Neighbors

5 / 5

Regression with KNN

KNN as a Universal Regressor

KNN regression is one of the most universally applicable nonparametric methods in machine learning. Given any regression dataset—regardless of the true relationship between features and target—KNN can provide reasonable predictions without assuming parametric forms.

This universality comes from a deep theoretical result: Stone's Consistency Theorem (1977) proves that KNN regression is consistent under mild conditions. As the number of training points $n \to \infty$ and the number of neighbors $k \to \infty$ with $k/n \to 0$, the KNN estimate converges to the true regression function.

But consistency doesn't mean optimality. In practice, KNN regression involves critical decisions about weighting, neighborhood size, local model complexity, and feature preparation. This page synthesizes everything we've learned into a complete framework for regression with KNN.

What You Will Master

By the end of this page, you will understand the complete KNN regression pipeline from data preparation to deployment, master hyperparameter selection strategies, know when KNN regression excels and when to prefer alternatives, and be able to implement production-grade KNN regression systems.

The KNN Regression Framework

Let's formalize the complete KNN regression framework, incorporating all the variants we've studied.

The General KNN Regression Estimator:

For a query point $\mathbf{x}$:

$$\hat{f}(\mathbf{x}) = \sum_{i=1}^{n} K^*(\mathbf{x}, \mathbf{x}_i) \cdot y_i$$

where $K^*$ is the effective kernel that may incorporate:

Neighbor selection (only k nearest contribute)
Distance weighting (inverse distance, Gaussian, etc.)
Local polynomial fitting (equivalent kernel representation)
Adaptive bandwidth (query-dependent or sample-point)

Hierarchy of Methods (Increasing Sophistication):

Method	Local Model	Weighting	Complexity
k-NN Uniform	Constant	Equal	Lowest
k-NN Weighted	Constant	Distance-based	Low
Kernel Regression	Constant	Kernel-based	Low-Medium
Local Linear	Linear	Kernel-based	Medium
LOESS	Linear/Quadratic	Tricube + Robust	Medium-High
Local Polynomial	Arbitrary degree	Kernel-based	High

Theoretical Properties:

1. Bias-Variance Decomposition:

For local constant (weighted KNN): $$\text{MSE}(\hat{f}(\mathbf{x})) = \underbrace{\frac{h^2 \mu_2(K)}{2} f''(\mathbf{x})^2}{\text{Bias}^2} + \underbrace{\frac{\sigma^2 R(K)}{nh^d p(\mathbf{x})}}{\text{Variance}}$$

2. Optimal Rate:

For interior points with smooth $f$, optimal bandwidth gives: $$\text{MSE} = O(n^{-4/(d+4)})$$

This deteriorates rapidly with dimension $d$ (curse of dimensionality).

3. Consistency:

KNN regression is consistent if $k \to \infty$ and $k/n \to 0$ as $n \to \infty$. A practical choice: $k \approx n^{4/(d+4)}$.

The Curse of Dimensionality

The optimal rate n^(-4/(d+4)) means: in 1D, MSE ~ n^(-4/5); in 10D, MSE ~ n^(-4/14) ≈ n^(-0.29); in 100D, MSE ~ n^(-4/104) ≈ n^(-0.04). High-dimensional KNN regression requires exponentially more data for the same accuracy.

A Complete KNN Regressor

Let's build a production-quality KNN regressor incorporating best practices from everything we've learned.

knn_regressor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
import numpy as np
from scipy.spatial import KDTree
from typing import Tuple, Optional, Literal
 
class KNNRegressor:
    """
    Production-quality K-Nearest Neighbors Regressor.
    
    Features:
    - Multiple weighting schemes
    - Adaptive bandwidth
    - Local linear option
    - Efficient KD-tree search
    - Feature scaling
    - Confidence estimates
    """
    
    def __init__(self,
                 k: int = 5,
                 weights: Literal['uniform', 'distance', 'gaussian'] = 'distance',
                 algorithm: Literal['constant', 'linear'] = 'constant',
                 power: float = 2.0,
                 leaf_size: int = 30):
        """
        Parameters
        ----------
        k : number of neighbors
        weights : 'uniform', 'distance' (inverse distance), 'gaussian'
        algorithm : 'constant' (weighted average) or 'linear' (local linear)
        power : power for inverse distance weighting
        leaf_size : leaf size for KD-tree
        """
        self.k = k
        self.weights = weights
        self.algorithm = algorithm
        self.power = power
        self.leaf_size = leaf_size
        
        # Fitted attributes
        self.X_train_ = None
        self.y_train_ = None
        self.tree_ = None
        self.feature_means_ = None
        self.feature_stds_ = None
        self.y_mean_ = None
        self.y_std_ = None
    
    def fit(self, X: np.ndarray, y: np.ndarray, 
            scale_features: bool = True,
            scale_target: bool = False):
        """
        Fit KNN regressor.
        
        Parameters
        ----------
        X : training features, shape (n_samples, n_features)
        y : training targets, shape (n_samples,)
        scale_features : whether to standardize features
        scale_target : whether to standardize target (for stability)
        """
        X = np.atleast_2d(X)
        y = np.asarray(y)
        
        # Feature scaling
        if scale_features:
            self.feature_means_ = X.mean(axis=0)
            self.feature_stds_ = X.std(axis=0)
            self.feature_stds_[self.feature_stds_ < 1e-10] = 1.0
            X_scaled = (X - self.feature_means_) / self.feature_stds_
        else:
            self.feature_means_ = np.zeros(X.shape[1])
            self.feature_stds_ = np.ones(X.shape[1])
            X_scaled = X
        
        # Target scaling
        if scale_target:
            self.y_mean_ = y.mean()
            self.y_std_ = y.std()
            if self.y_std_ < 1e-10:
                self.y_std_ = 1.0
            y_scaled = (y - self.y_mean_) / self.y_std_
        else:
            self.y_mean_ = 0.0
            self.y_std_ = 1.0
            y_scaled = y
        
        self.X_train_ = X_scaled
        self.y_train_ = y_scaled
        self.tree_ = KDTree(X_scaled, leafsize=self.leaf_size)
        
        return self
    
    def _compute_weights(self, distances: np.ndarray) -> np.ndarray:
        """Compute weights from distances."""
        if self.weights == 'uniform':
            return np.ones_like(distances)
        
        elif self.weights == 'distance':
            # Inverse distance with handling for zero
            safe_distances = np.maximum(distances, 1e-10)
            return 1.0 / (safe_distances ** self.power)
        
        elif self.weights == 'gaussian':
            # Adaptive bandwidth = k-th neighbor distance
            h = distances[-1] + 1e-10
            u = distances / h
            return np.exp(-0.5 * u**2)
        
        else:
            raise ValueError(f"Unknown weights: {self.weights}")
    
    def _predict_constant(self, x_query: np.ndarray,
                           indices: np.ndarray,
                           distances: np.ndarray) -> Tuple[float, float]:
        """Predict using local constant (weighted average)."""
        weights = self._compute_weights(distances)
        weights = weights / weights.sum()
        
        y_neighbors = self.y_train_[indices]
        prediction = np.sum(weights * y_neighbors)
        
        # Variance estimate (for confidence)
        variance = np.sum(weights * (y_neighbors - prediction)**2)
        std_est = np.sqrt(variance + 1e-10)
        
        return prediction, std_est
    
    def _predict_linear(self, x_query: np.ndarray,
                         indices: np.ndarray,
                         distances: np.ndarray) -> Tuple[float, float]:
        """Predict using local linear regression."""
        d = x_query.shape[0]
        
        weights = self._compute_weights(distances)
        
        X_local = self.X_train_[indices]
        y_local = self.y_train_[indices]
        
        # Centered design matrix
        X_centered = X_local - x_query
        design = np.column_stack([np.ones(len(indices)), X_centered])
        
        # Weighted least squares
        W = np.diag(weights)
        XtWX = design.T @ W @ design
        XtWy = design.T @ W @ y_local
        
        # Regularization for stability
        XtWX += 1e-6 * np.eye(d + 1)
        
        try:
            beta = np.linalg.solve(XtWX, XtWy)
            prediction = beta[0]
            
            # Residual variance estimate
            residuals = y_local - design @ beta
            mse = np.sum(weights * residuals**2) / (weights.sum() + 1e-10)
            std_est = np.sqrt(mse)
        except np.linalg.LinAlgError:
            # Fall back to weighted average
            return self._predict_constant(x_query, indices, distances)
        
        return prediction, std_est
    
    def predict(self, X: np.ndarray,
                return_std: bool = False) -> np.ndarray:
        """
        Predict regression target.
        
        Parameters
        ----------
        X : query points, shape (n_queries, n_features)
        return_std : if True, also return standard deviation estimates
        
        Returns
        -------
        predictions : shape (n_queries,)
        std_estimates : shape (n_queries,) if return_std=True
        """
        X = np.atleast_2d(X)
        
        # Scale features
        X_scaled = (X - self.feature_means_) / self.feature_stds_
        
        # Query KD-tree
        distances, indices = self.tree_.query(X_scaled, k=self.k)
        
        predictions = []
        std_estimates = []
        
        for i, (x_query, dists, idx) in enumerate(zip(X_scaled, distances, indices)):
            if self.algorithm == 'constant':
                pred, std = self._predict_constant(x_query, idx, dists)
            else:  # linear
                pred, std = self._predict_linear(x_query, idx, dists)
            
            predictions.append(pred)
            std_estimates.append(std)
        
        # Inverse transform predictions
        predictions = np.array(predictions) * self.y_std_ + self.y_mean_
        std_estimates = np.array(std_estimates) * self.y_std_
        
        if return_std:
            return predictions, std_estimates
        return predictions
    
    def score(self, X: np.ndarray, y: np.ndarray) -> float:
        """
        Compute R² score.
        """
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred)**2)
        ss_tot = np.sum((y - y.mean())**2)
        return 1 - ss_res / (ss_tot + 1e-10)
 
 
# Demonstration
np.random.seed(42)
 
# Complex nonlinear function
def true_function(x):
    return np.sin(2 * x[:, 0]) + 0.5 * x[:, 1]**2 - x[:, 0] * x[:, 1]
 
# Generate data
n_train, n_test = 500, 100
X_train = np.random.uniform(-2, 2, (n_train, 2))
y_train = true_function(X_train) + 0.3 * np.random.randn(n_train)
 
X_test = np.random.uniform(-2, 2, (n_test, 2))
y_test = true_function(X_test)
 
print("KNN Regressor Comparison:")
print("=" * 60)
 
configs = [
    {'k': 5, 'weights': 'uniform', 'algorithm': 'constant'},
    {'k': 10, 'weights': 'distance', 'algorithm': 'constant'},
    {'k': 20, 'weights': 'gaussian', 'algorithm': 'constant'},
    {'k': 20, 'weights': 'gaussian', 'algorithm': 'linear'},
]
 
for cfg in configs:
    model = KNNRegressor(**cfg)
    model.fit(X_train, y_train)
    
    y_pred, y_std = model.predict(X_test, return_std=True)
    r2 = model.score(X_test, y_test)
    rmse = np.sqrt(np.mean((y_pred - y_test)**2))
    
    print(f"\nk={cfg['k']}, weights={cfg['weights']}, algo={cfg['algorithm']}")
    print(f"  R² = {r2:.4f}")
    print(f"  RMSE = {rmse:.4f}")
    print(f"  Mean uncertainty = {y_std.mean():.4f}")

Hyperparameter Selection

Selecting the right hyperparameters is crucial for KNN regression performance. The main hyperparameters are:

1. Number of Neighbors (k):

The most important hyperparameter. Rules of thumb:

Start with $k \approx \sqrt{n}$ for small datasets
Start with $k \approx n^{0.4}$ for larger datasets
Use cross-validation to finalize

2. Weighting Scheme:

Uniform: theoretically sound, but often suboptimal
Inverse distance: good default, power=2 is standard
Gaussian: smooth, bandwidth adapts to k-th neighbor

3. Algorithm (Constant vs Linear):

Constant: use when relationships are locally flat
Linear: use when local trends exist, or near boundaries

4. Distance Metric:

Euclidean: standard choice after feature scaling
Manhattan: robust to outliers in features
Custom: domain-specific when available

hyperparameter_tuning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
 
def tune_knn_regressor(X: np.ndarray, y: np.ndarray,
                        k_range: tuple = (3, 50),
                        cv: int = 5) -> dict:
    """
    Tune KNN regressor using grid search with cross-validation.
    
    Returns optimal hyperparameters and cross-validation scores.
    """
    # Create pipeline with scaling
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsRegressor())
    ])
    
    # Determine k candidates based on data size
    n = len(y)
    k_min = max(k_range[0], 1)
    k_max = min(k_range[1], n // 2)
    
    # Logarithmically spaced k values
    k_candidates = np.unique(np.logspace(
        np.log10(k_min), 
        np.log10(k_max), 
        num=15
    ).astype(int))
    
    # Parameter grid
    param_grid = {
        'knn__n_neighbors': k_candidates,
        'knn__weights': ['uniform', 'distance'],
        'knn__p': [1, 2],  # Manhattan (1) vs Euclidean (2)
    }
    
    # Grid search
    grid_search = GridSearchCV(
        pipeline, 
        param_grid,
        cv=cv,
        scoring='neg_mean_squared_error',
        n_jobs=-1,
        verbose=0
    )
    
    grid_search.fit(X, y)
    
    # Extract results
    best_params = grid_search.best_params_
    best_score = -grid_search.best_score_  # Negate to get MSE
    
    # Get CV scores for best estimator
    cv_results = grid_search.cv_results_
    best_idx = grid_search.best_index_
    
    results = {
        'best_k': best_params['knn__n_neighbors'],
        'best_weights': best_params['knn__weights'],
        'best_p': best_params['knn__p'],
        'best_mse': best_score,
        'best_rmse': np.sqrt(best_score),
        'cv_std': cv_results['std_test_score'][best_idx],
        'all_k_tested': k_candidates.tolist(),
    }
    
    return results
 
 
def quick_k_selection(X: np.ndarray, y: np.ndarray,
                       cv: int = 5) -> int:
    """
    Quick heuristic k selection using cross-validation on a few candidates.
    """
    n = len(y)
    
    # Candidates based on data size
    candidates = [
        max(3, int(np.sqrt(n) / 2)),
        max(5, int(np.sqrt(n))),
        max(7, int(np.sqrt(n) * 2)),
        max(10, int(n ** 0.4)),
    ]
    candidates = sorted(list(set(candidates)))
    
    best_k = candidates[0]
    best_score = -np.inf
    
    for k in candidates:
        knn = Pipeline([
            ('scaler', StandardScaler()),
            ('knn', KNeighborsRegressor(n_neighbors=k, weights='distance'))
        ])
        
        scores = cross_val_score(knn, X, y, cv=cv, 
                                  scoring='neg_mean_squared_error')
        mean_score = scores.mean()
        
        if mean_score > best_score:
            best_score = mean_score
            best_k = k
    
    return best_k
 
 
# Demonstration
np.random.seed(42)
 
# Generate dataset
n = 300
X = np.random.uniform(-3, 3, (n, 3))
y = np.sin(X[:, 0]) + X[:, 1]**2 - X[:, 0]*X[:, 2] + 0.5*np.random.randn(n)
 
print("KNN Hyperparameter Tuning:")
print("=" * 60)
 
# Quick selection
quick_k = quick_k_selection(X, y)
print(f"\nQuick k selection: k = {quick_k}")
 
# Full grid search
results = tune_knn_regressor(X, y)
print(f"\nGrid Search Results:")
for key, val in results.items():
    if key != 'all_k_tested':
        print(f"  {key}: {val}")

Nested Cross-Validation

For honest performance estimation, use nested cross-validation: an outer loop for performance estimation and an inner loop for hyperparameter selection. This prevents optimistic bias from hyperparameter tuning on the same data used for evaluation.

Feature Preprocessing for KNN Regression

KNN regression is highly sensitive to feature preprocessing. Unlike tree-based methods, KNN uses distances directly, so feature scales and transformations profoundly affect results.

Essential Preprocessing Steps:

1. Standardization (Critical):

Always standardize features to zero mean and unit variance: $$x_j' = \frac{x_j - \mu_j}{\sigma_j}$$

Without standardization, features with larger scales dominate the distance calculation.

2. Handling Categorical Features:

KNN with Euclidean distance doesn't naturally handle categoricals. Options:

One-hot encoding (increases dimensionality)
Target encoding (converts to numeric, but risk of leakage)
Separate distance functions for categorical features
Use Gower distance or mixed-type distances

3. Missing Value Handling:

Options:

Imputation before KNN (mean, median, or model-based)
Use distance functions that handle missingness
Exclude samples with missing values from neighborhood

Feature Preprocessing Checklist

•Standardize all numeric features — Use StandardScaler, fit on training data only
•Handle outliers — Consider robust scaling (RobustScaler) if outliers are present
•Remove or transform irrelevant features — Irrelevant features add noise to distances
•Consider dimensionality reduction — PCA can help if d > 10-20
•Encode categoricals appropriately — One-hot for low cardinality, embeddings for high cardinality
•Impute missing values — KNN can't handle NaN in distances

feature_preprocessing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
 
def create_knn_pipeline(numeric_features: list,
                          categorical_features: list = None,
                          use_pca: bool = False,
                          pca_components: int = 10,
                          robust_scaling: bool = False,
                          k: int = 10) -> Pipeline:
    """
    Create a complete preprocessing + KNN pipeline.
    """
    from sklearn.preprocessing import OneHotEncoder
    
    # Choose scaler
    scaler = RobustScaler() if robust_scaling else StandardScaler()
    
    # Numeric preprocessing
    numeric_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', scaler),
    ])
    
    # Categorical preprocessing
    if categorical_features:
        categorical_transformer = Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
        ])
        
        preprocessor = ColumnTransformer([
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features),
        ])
    else:
        preprocessor = ColumnTransformer([
            ('num', numeric_transformer, numeric_features),
        ])
    
    # Build pipeline
    steps = [('preprocess', preprocessor)]
    
    if use_pca:
        steps.append(('pca', PCA(n_components=pca_components)))
    
    steps.append(('knn', KNeighborsRegressor(n_neighbors=k, weights='distance')))
    
    return Pipeline(steps)
 
 
# Demonstration: Impact of preprocessing
np.random.seed(42)
 
# Generate data with very different feature scales
X_raw = np.column_stack([
    np.random.uniform(0, 1, 200),        # Feature 1: [0, 1]
    np.random.uniform(0, 1000, 200),     # Feature 2: [0, 1000]
    np.random.uniform(-0.001, 0.001, 200)  # Feature 3: [-0.001, 0.001]
])
 
# Target depends on all features equally
y = X_raw[:, 0] * 10 + X_raw[:, 1] / 100 + X_raw[:, 2] * 10000 + 0.5*np.random.randn(200)
 
# Split
X_train, X_test = X_raw[:150], X_raw[150:]
y_train, y_test = y[:150], y[150:]
 
print("Impact of Feature Scaling on KNN Regression:")
print("=" * 60)
 
# Without scaling
knn_raw = KNeighborsRegressor(n_neighbors=10, weights='distance')
knn_raw.fit(X_train, y_train)
r2_raw = knn_raw.score(X_test, y_test)
print(f"\nWithout scaling: R² = {r2_raw:.4f}")
 
# With scaling
from sklearn.pipeline import make_pipeline
knn_scaled = make_pipeline(StandardScaler(), 
                            KNeighborsRegressor(n_neighbors=10, weights='distance'))
knn_scaled.fit(X_train, y_train)
r2_scaled = knn_scaled.score(X_test, y_test)
print(f"With StandardScaler: R² = {r2_scaled:.4f}")
 
# With robust scaling
knn_robust = make_pipeline(RobustScaler(), 
                            KNeighborsRegressor(n_neighbors=10, weights='distance'))
knn_robust.fit(X_train, y_train)
r2_robust = knn_robust.score(X_test, y_test)
print(f"With RobustScaler: R² = {r2_robust:.4f}")
 
print(f"\nImprovement from scaling: {(r2_scaled - r2_raw) / (1 - r2_raw) * 100:.1f}% of remaining error")

When KNN Regression Excels

KNN regression has specific scenarios where it outperforms other methods. Understanding these helps you choose appropriately.

KNN Regression Strengths

•Low-dimensional data (d ≤ 10): KNN shines when the curse of dimensionality doesn't dominate. In 2-5 dimensions, it's often competitive with or better than parametric methods.
•Nonlinear relationships of unknown form: No assumptions about functional form. Automatically adapts to local structure without specifying basis functions.
•Multi-modal distributions: KNN naturally handles data with multiple clusters or modes. It doesn't force a single global model.
•Dense training data: When you have many training points, KNN can capture fine local detail that parametric models miss.
•Interpolation tasks: KNN is essentially an interpolator. For predictions within the training data hull, it's very effective.
•Quick baseline: Minimal assumptions and easy implementation make it an excellent first model to try.
•Explainability needs: "This prediction is based on these k most similar examples" is highly interpretable.

Ideal Scenarios

• Small to medium d (d ≤ 10) • Large n (n ≥ 1000) • Complex, unknown relationships • Spatial or temporal data • Need for local explanations • Prototype-based prediction

Poor Scenarios

• High d (d > 20) • Small n • Simple linear relationships • Extrapolation required • Real-time low-latency needs • Mixed feature types

Empirical Comparison:

Studies comparing KNN to other regressors (linear, trees, neural networks) consistently find:

KNN wins on synthetic data with local structure designed to favor it
KNN is competitive on many real-world tabular datasets with d < 10
KNN loses to random forests/gradient boosting on high-dimensional or mixed data
KNN loses to linear models when relationships are actually linear
KNN loses to deep learning on structured data (images, text, sequences)

The practical lesson: try KNN as a baseline, but expect tree ensembles to outperform on most tabular benchmarks.

Comparison to Alternative Regressors

How does KNN regression compare to the main alternatives? Let's analyze systematically.

KNN vs Other Regressors
Criterion	KNN	Linear Regression	Random Forest	Gradient Boosting	Neural Network
Dimensionality	Struggles d>10	Any d	Handles high d	Handles high d	Handles very high d
Sample size needed	Moderate to high	Low	Moderate	Moderate	High
Nonlinearity	Automatic	Needs features	Automatic	Automatic	Automatic
Interpretability	High (examples)	High (coefficients)	Moderate (importance)	Low	Very low
Training speed	O(1) or O(n log n)	O(nd²)	O(T·n log n)	O(T·n log n)	O(T·n·d)
Prediction speed	O(n) or O(log n)	O(d)	O(T log n)	O(T log n)	O(d)
Extrapolation	Poor	Linear extrapolation	Poor	Poor	Variable
Missing data	Needs handling	Needs handling	Native support	Native support	Needs handling

Key Differentiators:

vs. Linear Regression:

KNN captures nonlinearity; linear doesn't (without feature engineering)
Linear is faster, especially for prediction
Linear extrapolates linearly; KNN cannot extrapolate
Linear coefficients are interpretable; KNN explains via neighbors

vs. Random Forests:

RF handles high dimensions better
RF handles mixed feature types naturally
RF has built-in feature importance
KNN gives more localized predictions
KNN memory is O(n); RF is O(T·leaves)

vs. Gradient Boosting (XGBoost, LightGBM):

GB typically achieves better accuracy on tabular data
GB handles missing values natively
GB is more computationally efficient
KNN is simpler to understand and implement

vs. Neural Networks:

NNs require much more data and tuning
NNs handle structured data (images, text) far better
KNN is more interpretable
KNN has no training phase (just storage)

The Modern Consensus

For general tabular regression, gradient boosting (XGBoost, LightGBM, CatBoost) is the current best practice. Use KNN when: (1) you need neighbor-based explanations, (2) dimension is low and you have dense data, (3) you're building a quick baseline, or (4) you're doing spatial/geographic prediction.

Production Deployment Considerations

Deploying KNN regression in production involves unique challenges compared to parametric models.

Production Challenges and Solutions

•Memory footprint: KNN stores all training data. Solution: Use representative sampling, data compression, or approximate methods (LSH).
•Prediction latency: O(n) or O(log n) per query. Solution: Use spatial indexes (KD-trees, ball trees), batch predictions, or approximate nearest neighbor algorithms.
•Model updates: Adding new training points is easy, but rebalancing indexes takes time. Solution: Use incremental indexes or periodic full rebuilds.
•Feature drift: If feature distributions change, distances change meaning. Solution: Monitor feature statistics and retrain preprocessing as needed.
•Cold start: Need sufficient training data for reliable predictions. Solution: Hybrid systems that fall back to simpler models when data is sparse.

production_knn.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
import pickle
import time
 
class ProductionKNNRegressor:
    """
    Production-ready KNN regressor with serialization, 
    monitoring, and fallback capabilities.
    """
    
    def __init__(self, 
                 k: int = 10,
                 min_samples: int = 50,
                 fallback_value: float = None,
                 max_memory_mb: float = 1000):
        self.k = k
        self.min_samples = min_samples
        self.fallback_value = fallback_value
        self.max_memory_mb = max_memory_mb
        
        self.model = None
        self.scaler = None
        self.n_samples = 0
        self.feature_names = None
        self.training_stats = {}
    
    def fit(self, X: np.ndarray, y: np.ndarray, 
            feature_names: list = None):
        """Fit with production safeguards."""
        X = np.asarray(X)
        y = np.asarray(y)
        
        # Check memory constraint
        memory_mb = X.nbytes / 1e6
        if memory_mb > self.max_memory_mb:
            # Subsample to fit memory
            n_keep = int(len(X) * self.max_memory_mb / memory_mb)
            indices = np.random.choice(len(X), n_keep, replace=False)
            X = X[indices]
            y = y[indices]
            print(f"Subsampled to {n_keep} samples to fit memory constraint")
        
        self.n_samples = len(X)
        self.feature_names = feature_names
        
        # Store training statistics for monitoring
        self.training_stats = {
            'feature_means': X.mean(axis=0),
            'feature_stds': X.std(axis=0),
            'target_mean': y.mean(),
            'target_std': y.std(),
            'n_samples': len(X),
        }
        
        # Set fallback if not provided
        if self.fallback_value is None:
            self.fallback_value = y.mean()
        
        # Fit scaler and model
        self.scaler = StandardScaler()
        X_scaled = self.scaler.fit_transform(X)
        
        self.model = KNeighborsRegressor(
            n_neighbors=min(self.k, len(X) - 1),
            weights='distance',
            algorithm='kd_tree'
        )
        self.model.fit(X_scaled, y)
        
        return self
    
    def predict(self, X: np.ndarray, 
                return_confidence: bool = False):
        """Predict with fallback and confidence."""
        X = np.atleast_2d(X)
        
        # Check if model is fitted
        if self.model is None or self.n_samples < self.min_samples:
            predictions = np.full(len(X), self.fallback_value)
            confidences = np.zeros(len(X))
            
            if return_confidence:
                return predictions, confidences
            return predictions
        
        # Scale features
        try:
            X_scaled = self.scaler.transform(X)
        except Exception as e:
            # Feature mismatch or other error
            predictions = np.full(len(X), self.fallback_value)
            if return_confidence:
                return predictions, np.zeros(len(X))
            return predictions
        
        # Predict
        predictions = self.model.predict(X_scaled)
        
        if return_confidence:
            # Estimate confidence from neighbor distance consistency
            distances, indices = self.model.kneighbors(X_scaled)
            
            # Confidence = inverse of relative distance spread
            mean_dist = distances.mean(axis=1)
            std_dist = distances.std(axis=1)
            confidences = 1 / (1 + std_dist / (mean_dist + 1e-10))
            
            return predictions, confidences
        
        return predictions
    
    def check_feature_drift(self, X: np.ndarray, 
                             threshold: float = 2.0) -> dict:
        """Check for feature drift from training distribution."""
        X = np.atleast_2d(X)
        
        # Compute z-scores relative to training
        z_scores = np.abs(
            (X.mean(axis=0) - self.training_stats['feature_means']) / 
            (self.training_stats['feature_stds'] + 1e-10)
        )
        
        drift_detected = z_scores > threshold
        
        return {
            'drift_detected': drift_detected.any(),
            'features_with_drift': np.where(drift_detected)[0].tolist(),
            'z_scores': z_scores.tolist(),
        }
    
    def save(self, filepath: str):
        """Serialize model to disk."""
        state = {
            'model': self.model,
            'scaler': self.scaler,
            'k': self.k,
            'n_samples': self.n_samples,
            'fallback_value': self.fallback_value,
            'training_stats': self.training_stats,
        }
        with open(filepath, 'wb') as f:
            pickle.dump(state, f)
    
    @classmethod
    def load(cls, filepath: str):
        """Load model from disk."""
        with open(filepath, 'rb') as f:
            state = pickle.load(f)
        
        instance = cls(k=state['k'])
        instance.model = state['model']
        instance.scaler = state['scaler']
        instance.n_samples = state['n_samples']
        instance.fallback_value = state['fallback_value']
        instance.training_stats = state['training_stats']
        
        return instance
 
 
# Usage demonstration
np.random.seed(42)
 
# Training data
X_train = np.random.randn(1000, 5)
y_train = X_train.sum(axis=1) + 0.5 * np.random.randn(1000)
 
# Create and fit
model = ProductionKNNRegressor(k=15, min_samples=50)
model.fit(X_train, y_train)
 
# Normal prediction with confidence
X_test = np.random.randn(10, 5)
preds, confs = model.predict(X_test, return_confidence=True)
print("Production KNN Regression:")
print("=" * 50)
print(f"Predictions: {preds[:3]}")
print(f"Confidences: {confs[:3]}")
 
# Check for drift
X_drifted = np.random.randn(100, 5) + 3  # Shifted distribution
drift_report = model.check_feature_drift(X_drifted)
print(f"\nDrift check: {drift_report}")

Summary: KNN Regression Mastery

Key Takeaways

•KNN regression is a universal approximator — Given enough data in low dimensions, it can approximate any smooth function.
•The curse of dimensionality limits applicability — Performance degrades rapidly above ~10 dimensions.
•Feature preprocessing is critical — Always standardize features; handle categoricals and missing values explicitly.
•Hyperparameter selection via CV — k, weights, and algorithm choice should be tuned using cross-validation.
•Local linear models reduce bias — When local trends exist, LOESS/local linear outperforms simple averaging.
•Know when KNN is appropriate — Dense data, low dimensions, nonlinear relationships, interpretability needs.
•Tree ensembles usually win on tabular data — Use KNN when its specific strengths apply.
•Production requires special care — Memory, latency, model updates, and monitoring need attention.

Module Complete

Congratulations! You have completed the Weighted KNN module. You now understand distance weighting, kernel methods, adaptive neighborhoods, local models, and KNN regression at a depth suitable for research and production applications. The next module explores KNN Variants—specialized modifications for specific scenarios like condensed NN, edited NN, and metric learning.

5 / 5

Loading learning content...

Machine LearningK-Nearest Neighbors

Weighted KNN

LevelIntermediate

Duration90 mins

TopicK-Nearest Neighbors

5 / 5

Regression with KNN

KNN as a Universal Regressor

What You Will Master

The KNN Regression Framework

Let's formalize the complete KNN regression framework, incorporating all the variants we've studied.

The General KNN Regression Estimator:

For a query point $\mathbf{x}$:

$$\hat{f}(\mathbf{x}) = \sum_{i=1}^{n} K^*(\mathbf{x}, \mathbf{x}_i) \cdot y_i$$

where $K^*$ is the effective kernel that may incorporate:

Neighbor selection (only k nearest contribute)
Distance weighting (inverse distance, Gaussian, etc.)
Local polynomial fitting (equivalent kernel representation)
Adaptive bandwidth (query-dependent or sample-point)

Hierarchy of Methods (Increasing Sophistication):

Method	Local Model	Weighting	Complexity
k-NN Uniform	Constant	Equal	Lowest
k-NN Weighted	Constant	Distance-based	Low
Kernel Regression	Constant	Kernel-based	Low-Medium
Local Linear	Linear	Kernel-based	Medium
LOESS	Linear/Quadratic	Tricube + Robust	Medium-High
Local Polynomial	Arbitrary degree	Kernel-based	High

Theoretical Properties:

1. Bias-Variance Decomposition:

2. Optimal Rate:

For interior points with smooth $f$, optimal bandwidth gives: $$\text{MSE} = O(n^{-4/(d+4)})$$

This deteriorates rapidly with dimension $d$ (curse of dimensionality).

3. Consistency:

KNN regression is consistent if $k \to \infty$ and $k/n \to 0$ as $n \to \infty$. A practical choice: $k \approx n^{4/(d+4)}$.

The Curse of Dimensionality

A Complete KNN Regressor

Let's build a production-quality KNN regressor incorporating best practices from everything we've learned.

knn_regressor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
import numpy as np
from scipy.spatial import KDTree
from typing import Tuple, Optional, Literal
 
class KNNRegressor:
    """
    Production-quality K-Nearest Neighbors Regressor.
    
    Features:
    - Multiple weighting schemes
    - Adaptive bandwidth
    - Local linear option
    - Efficient KD-tree search
    - Feature scaling
    - Confidence estimates
    """
    
    def __init__(self,
                 k: int = 5,
                 weights: Literal['uniform', 'distance', 'gaussian'] = 'distance',
                 algorithm: Literal['constant', 'linear'] = 'constant',
                 power: float = 2.0,
                 leaf_size: int = 30):
        """
        Parameters
        ----------
        k : number of neighbors
        weights : 'uniform', 'distance' (inverse distance), 'gaussian'
        algorithm : 'constant' (weighted average) or 'linear' (local linear)
        power : power for inverse distance weighting
        leaf_size : leaf size for KD-tree
        """
        self.k = k
        self.weights = weights
        self.algorithm = algorithm
        self.power = power
        self.leaf_size = leaf_size
        
        # Fitted attributes
        self.X_train_ = None
        self.y_train_ = None
        self.tree_ = None
        self.feature_means_ = None
        self.feature_stds_ = None
        self.y_mean_ = None
        self.y_std_ = None
    
    def fit(self, X: np.ndarray, y: np.ndarray, 
            scale_features: bool = True,
            scale_target: bool = False):
        """
        Fit KNN regressor.
        
        Parameters
        ----------
        X : training features, shape (n_samples, n_features)
        y : training targets, shape (n_samples,)
        scale_features : whether to standardize features
        scale_target : whether to standardize target (for stability)
        """
        X = np.atleast_2d(X)
        y = np.asarray(y)
        
        # Feature scaling
        if scale_features:
            self.feature_means_ = X.mean(axis=0)
            self.feature_stds_ = X.std(axis=0)
            self.feature_stds_[self.feature_stds_ < 1e-10] = 1.0
            X_scaled = (X - self.feature_means_) / self.feature_stds_
        else:
            self.feature_means_ = np.zeros(X.shape[1])
            self.feature_stds_ = np.ones(X.shape[1])
            X_scaled = X
        
        # Target scaling
        if scale_target:
            self.y_mean_ = y.mean()
            self.y_std_ = y.std()
            if self.y_std_ < 1e-10:
                self.y_std_ = 1.0
            y_scaled = (y - self.y_mean_) / self.y_std_
        else:
            self.y_mean_ = 0.0
            self.y_std_ = 1.0
            y_scaled = y
        
        self.X_train_ = X_scaled
        self.y_train_ = y_scaled
        self.tree_ = KDTree(X_scaled, leafsize=self.leaf_size)
        
        return self
    
    def _compute_weights(self, distances: np.ndarray) -> np.ndarray:
        """Compute weights from distances."""
        if self.weights == 'uniform':
            return np.ones_like(distances)
        
        elif self.weights == 'distance':
            # Inverse distance with handling for zero
            safe_distances = np.maximum(distances, 1e-10)
            return 1.0 / (safe_distances ** self.power)
        
        elif self.weights == 'gaussian':
            # Adaptive bandwidth = k-th neighbor distance
            h = distances[-1] + 1e-10
            u = distances / h
            return np.exp(-0.5 * u**2)
        
        else:
            raise ValueError(f"Unknown weights: {self.weights}")
    
    def _predict_constant(self, x_query: np.ndarray,
                           indices: np.ndarray,
                           distances: np.ndarray) -> Tuple[float, float]:
        """Predict using local constant (weighted average)."""
        weights = self._compute_weights(distances)
        weights = weights / weights.sum()
        
        y_neighbors = self.y_train_[indices]
        prediction = np.sum(weights * y_neighbors)
        
        # Variance estimate (for confidence)
        variance = np.sum(weights * (y_neighbors - prediction)**2)
        std_est = np.sqrt(variance + 1e-10)
        
        return prediction, std_est
    
    def _predict_linear(self, x_query: np.ndarray,
                         indices: np.ndarray,
                         distances: np.ndarray) -> Tuple[float, float]:
        """Predict using local linear regression."""
        d = x_query.shape[0]
        
        weights = self._compute_weights(distances)
        
        X_local = self.X_train_[indices]
        y_local = self.y_train_[indices]
        
        # Centered design matrix
        X_centered = X_local - x_query
        design = np.column_stack([np.ones(len(indices)), X_centered])
        
        # Weighted least squares
        W = np.diag(weights)
        XtWX = design.T @ W @ design
        XtWy = design.T @ W @ y_local
        
        # Regularization for stability
        XtWX += 1e-6 * np.eye(d + 1)
        
        try:
            beta = np.linalg.solve(XtWX, XtWy)
            prediction = beta[0]
            
            # Residual variance estimate
            residuals = y_local - design @ beta
            mse = np.sum(weights * residuals**2) / (weights.sum() + 1e-10)
            std_est = np.sqrt(mse)
        except np.linalg.LinAlgError:
            # Fall back to weighted average
            return self._predict_constant(x_query, indices, distances)
        
        return prediction, std_est
    
    def predict(self, X: np.ndarray,
                return_std: bool = False) -> np.ndarray:
        """
        Predict regression target.
        
        Parameters
        ----------
        X : query points, shape (n_queries, n_features)
        return_std : if True, also return standard deviation estimates
        
        Returns
        -------
        predictions : shape (n_queries,)
        std_estimates : shape (n_queries,) if return_std=True
        """
        X = np.atleast_2d(X)
        
        # Scale features
        X_scaled = (X - self.feature_means_) / self.feature_stds_
        
        # Query KD-tree
        distances, indices = self.tree_.query(X_scaled, k=self.k)
        
        predictions = []
        std_estimates = []
        
        for i, (x_query, dists, idx) in enumerate(zip(X_scaled, distances, indices)):
            if self.algorithm == 'constant':
                pred, std = self._predict_constant(x_query, idx, dists)
            else:  # linear
                pred, std = self._predict_linear(x_query, idx, dists)
            
            predictions.append(pred)
            std_estimates.append(std)
        
        # Inverse transform predictions
        predictions = np.array(predictions) * self.y_std_ + self.y_mean_
        std_estimates = np.array(std_estimates) * self.y_std_
        
        if return_std:
            return predictions, std_estimates
        return predictions
    
    def score(self, X: np.ndarray, y: np.ndarray) -> float:
        """
        Compute R² score.
        """
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred)**2)
        ss_tot = np.sum((y - y.mean())**2)
        return 1 - ss_res / (ss_tot + 1e-10)
 
 
# Demonstration
np.random.seed(42)
 
# Complex nonlinear function
def true_function(x):
    return np.sin(2 * x[:, 0]) + 0.5 * x[:, 1]**2 - x[:, 0] * x[:, 1]
 
# Generate data
n_train, n_test = 500, 100
X_train = np.random.uniform(-2, 2, (n_train, 2))
y_train = true_function(X_train) + 0.3 * np.random.randn(n_train)
 
X_test = np.random.uniform(-2, 2, (n_test, 2))
y_test = true_function(X_test)
 
print("KNN Regressor Comparison:")
print("=" * 60)
 
configs = [
    {'k': 5, 'weights': 'uniform', 'algorithm': 'constant'},
    {'k': 10, 'weights': 'distance', 'algorithm': 'constant'},
    {'k': 20, 'weights': 'gaussian', 'algorithm': 'constant'},
    {'k': 20, 'weights': 'gaussian', 'algorithm': 'linear'},
]
 
for cfg in configs:
    model = KNNRegressor(**cfg)
    model.fit(X_train, y_train)
    
    y_pred, y_std = model.predict(X_test, return_std=True)
    r2 = model.score(X_test, y_test)
    rmse = np.sqrt(np.mean((y_pred - y_test)**2))
    
    print(f"\nk={cfg['k']}, weights={cfg['weights']}, algo={cfg['algorithm']}")
    print(f"  R² = {r2:.4f}")
    print(f"  RMSE = {rmse:.4f}")
    print(f"  Mean uncertainty = {y_std.mean():.4f}")

Hyperparameter Selection

Selecting the right hyperparameters is crucial for KNN regression performance. The main hyperparameters are:

1. Number of Neighbors (k):

The most important hyperparameter. Rules of thumb:

Start with $k \approx \sqrt{n}$ for small datasets
Start with $k \approx n^{0.4}$ for larger datasets
Use cross-validation to finalize

2. Weighting Scheme:

Uniform: theoretically sound, but often suboptimal
Inverse distance: good default, power=2 is standard
Gaussian: smooth, bandwidth adapts to k-th neighbor

3. Algorithm (Constant vs Linear):

Constant: use when relationships are locally flat
Linear: use when local trends exist, or near boundaries

4. Distance Metric:

Euclidean: standard choice after feature scaling
Manhattan: robust to outliers in features
Custom: domain-specific when available

hyperparameter_tuning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
 
def tune_knn_regressor(X: np.ndarray, y: np.ndarray,
                        k_range: tuple = (3, 50),
                        cv: int = 5) -> dict:
    """
    Tune KNN regressor using grid search with cross-validation.
    
    Returns optimal hyperparameters and cross-validation scores.
    """
    # Create pipeline with scaling
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsRegressor())
    ])
    
    # Determine k candidates based on data size
    n = len(y)
    k_min = max(k_range[0], 1)
    k_max = min(k_range[1], n // 2)
    
    # Logarithmically spaced k values
    k_candidates = np.unique(np.logspace(
        np.log10(k_min), 
        np.log10(k_max), 
        num=15
    ).astype(int))
    
    # Parameter grid
    param_grid = {
        'knn__n_neighbors': k_candidates,
        'knn__weights': ['uniform', 'distance'],
        'knn__p': [1, 2],  # Manhattan (1) vs Euclidean (2)
    }
    
    # Grid search
    grid_search = GridSearchCV(
        pipeline, 
        param_grid,
        cv=cv,
        scoring='neg_mean_squared_error',
        n_jobs=-1,
        verbose=0
    )
    
    grid_search.fit(X, y)
    
    # Extract results
    best_params = grid_search.best_params_
    best_score = -grid_search.best_score_  # Negate to get MSE
    
    # Get CV scores for best estimator
    cv_results = grid_search.cv_results_
    best_idx = grid_search.best_index_
    
    results = {
        'best_k': best_params['knn__n_neighbors'],
        'best_weights': best_params['knn__weights'],
        'best_p': best_params['knn__p'],
        'best_mse': best_score,
        'best_rmse': np.sqrt(best_score),
        'cv_std': cv_results['std_test_score'][best_idx],
        'all_k_tested': k_candidates.tolist(),
    }
    
    return results
 
 
def quick_k_selection(X: np.ndarray, y: np.ndarray,
                       cv: int = 5) -> int:
    """
    Quick heuristic k selection using cross-validation on a few candidates.
    """
    n = len(y)
    
    # Candidates based on data size
    candidates = [
        max(3, int(np.sqrt(n) / 2)),
        max(5, int(np.sqrt(n))),
        max(7, int(np.sqrt(n) * 2)),
        max(10, int(n ** 0.4)),
    ]
    candidates = sorted(list(set(candidates)))
    
    best_k = candidates[0]
    best_score = -np.inf
    
    for k in candidates:
        knn = Pipeline([
            ('scaler', StandardScaler()),
            ('knn', KNeighborsRegressor(n_neighbors=k, weights='distance'))
        ])
        
        scores = cross_val_score(knn, X, y, cv=cv, 
                                  scoring='neg_mean_squared_error')
        mean_score = scores.mean()
        
        if mean_score > best_score:
            best_score = mean_score
            best_k = k
    
    return best_k
 
 
# Demonstration
np.random.seed(42)
 
# Generate dataset
n = 300
X = np.random.uniform(-3, 3, (n, 3))
y = np.sin(X[:, 0]) + X[:, 1]**2 - X[:, 0]*X[:, 2] + 0.5*np.random.randn(n)
 
print("KNN Hyperparameter Tuning:")
print("=" * 60)
 
# Quick selection
quick_k = quick_k_selection(X, y)
print(f"\nQuick k selection: k = {quick_k}")
 
# Full grid search
results = tune_knn_regressor(X, y)
print(f"\nGrid Search Results:")
for key, val in results.items():
    if key != 'all_k_tested':
        print(f"  {key}: {val}")

Nested Cross-Validation

Feature Preprocessing for KNN Regression

KNN regression is highly sensitive to feature preprocessing. Unlike tree-based methods, KNN uses distances directly, so feature scales and transformations profoundly affect results.

Essential Preprocessing Steps:

1. Standardization (Critical):

Always standardize features to zero mean and unit variance: $$x_j' = \frac{x_j - \mu_j}{\sigma_j}$$

Without standardization, features with larger scales dominate the distance calculation.

2. Handling Categorical Features:

KNN with Euclidean distance doesn't naturally handle categoricals. Options:

One-hot encoding (increases dimensionality)
Target encoding (converts to numeric, but risk of leakage)
Separate distance functions for categorical features
Use Gower distance or mixed-type distances

3. Missing Value Handling:

Options:

Imputation before KNN (mean, median, or model-based)
Use distance functions that handle missingness
Exclude samples with missing values from neighborhood

Feature Preprocessing Checklist

•Standardize all numeric features — Use StandardScaler, fit on training data only
•Handle outliers — Consider robust scaling (RobustScaler) if outliers are present
•Remove or transform irrelevant features — Irrelevant features add noise to distances
•Consider dimensionality reduction — PCA can help if d > 10-20
•Encode categoricals appropriately — One-hot for low cardinality, embeddings for high cardinality
•Impute missing values — KNN can't handle NaN in distances

feature_preprocessing.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
 
def create_knn_pipeline(numeric_features: list,
                          categorical_features: list = None,
                          use_pca: bool = False,
                          pca_components: int = 10,
                          robust_scaling: bool = False,
                          k: int = 10) -> Pipeline:
    """
    Create a complete preprocessing + KNN pipeline.
    """
    from sklearn.preprocessing import OneHotEncoder
    
    # Choose scaler
    scaler = RobustScaler() if robust_scaling else StandardScaler()
    
    # Numeric preprocessing
    numeric_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', scaler),
    ])
    
    # Categorical preprocessing
    if categorical_features:
        categorical_transformer = Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
        ])
        
        preprocessor = ColumnTransformer([
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features),
        ])
    else:
        preprocessor = ColumnTransformer([
            ('num', numeric_transformer, numeric_features),
        ])
    
    # Build pipeline
    steps = [('preprocess', preprocessor)]
    
    if use_pca:
        steps.append(('pca', PCA(n_components=pca_components)))
    
    steps.append(('knn', KNeighborsRegressor(n_neighbors=k, weights='distance')))
    
    return Pipeline(steps)
 
 
# Demonstration: Impact of preprocessing
np.random.seed(42)
 
# Generate data with very different feature scales
X_raw = np.column_stack([
    np.random.uniform(0, 1, 200),        # Feature 1: [0, 1]
    np.random.uniform(0, 1000, 200),     # Feature 2: [0, 1000]
    np.random.uniform(-0.001, 0.001, 200)  # Feature 3: [-0.001, 0.001]
])
 
# Target depends on all features equally
y = X_raw[:, 0] * 10 + X_raw[:, 1] / 100 + X_raw[:, 2] * 10000 + 0.5*np.random.randn(200)
 
# Split
X_train, X_test = X_raw[:150], X_raw[150:]
y_train, y_test = y[:150], y[150:]
 
print("Impact of Feature Scaling on KNN Regression:")
print("=" * 60)
 
# Without scaling
knn_raw = KNeighborsRegressor(n_neighbors=10, weights='distance')
knn_raw.fit(X_train, y_train)
r2_raw = knn_raw.score(X_test, y_test)
print(f"\nWithout scaling: R² = {r2_raw:.4f}")
 
# With scaling
from sklearn.pipeline import make_pipeline
knn_scaled = make_pipeline(StandardScaler(), 
                            KNeighborsRegressor(n_neighbors=10, weights='distance'))
knn_scaled.fit(X_train, y_train)
r2_scaled = knn_scaled.score(X_test, y_test)
print(f"With StandardScaler: R² = {r2_scaled:.4f}")
 
# With robust scaling
knn_robust = make_pipeline(RobustScaler(), 
                            KNeighborsRegressor(n_neighbors=10, weights='distance'))
knn_robust.fit(X_train, y_train)
r2_robust = knn_robust.score(X_test, y_test)
print(f"With RobustScaler: R² = {r2_robust:.4f}")
 
print(f"\nImprovement from scaling: {(r2_scaled - r2_raw) / (1 - r2_raw) * 100:.1f}% of remaining error")

When KNN Regression Excels

KNN regression has specific scenarios where it outperforms other methods. Understanding these helps you choose appropriately.

KNN Regression Strengths

•Low-dimensional data (d ≤ 10): KNN shines when the curse of dimensionality doesn't dominate. In 2-5 dimensions, it's often competitive with or better than parametric methods.
•Nonlinear relationships of unknown form: No assumptions about functional form. Automatically adapts to local structure without specifying basis functions.
•Multi-modal distributions: KNN naturally handles data with multiple clusters or modes. It doesn't force a single global model.
•Dense training data: When you have many training points, KNN can capture fine local detail that parametric models miss.
•Interpolation tasks: KNN is essentially an interpolator. For predictions within the training data hull, it's very effective.
•Quick baseline: Minimal assumptions and easy implementation make it an excellent first model to try.
•Explainability needs: "This prediction is based on these k most similar examples" is highly interpretable.

Ideal Scenarios

• Small to medium d (d ≤ 10) • Large n (n ≥ 1000) • Complex, unknown relationships • Spatial or temporal data • Need for local explanations • Prototype-based prediction

Poor Scenarios

• High d (d > 20) • Small n • Simple linear relationships • Extrapolation required • Real-time low-latency needs • Mixed feature types

Empirical Comparison:

Studies comparing KNN to other regressors (linear, trees, neural networks) consistently find:

KNN wins on synthetic data with local structure designed to favor it
KNN is competitive on many real-world tabular datasets with d < 10
KNN loses to random forests/gradient boosting on high-dimensional or mixed data
KNN loses to linear models when relationships are actually linear
KNN loses to deep learning on structured data (images, text, sequences)

The practical lesson: try KNN as a baseline, but expect tree ensembles to outperform on most tabular benchmarks.

Comparison to Alternative Regressors

How does KNN regression compare to the main alternatives? Let's analyze systematically.

KNN vs Other Regressors
Criterion	KNN	Linear Regression	Random Forest	Gradient Boosting	Neural Network
Dimensionality	Struggles d>10	Any d	Handles high d	Handles high d	Handles very high d
Sample size needed	Moderate to high	Low	Moderate	Moderate	High
Nonlinearity	Automatic	Needs features	Automatic	Automatic	Automatic
Interpretability	High (examples)	High (coefficients)	Moderate (importance)	Low	Very low
Training speed	O(1) or O(n log n)	O(nd²)	O(T·n log n)	O(T·n log n)	O(T·n·d)
Prediction speed	O(n) or O(log n)	O(d)	O(T log n)	O(T log n)	O(d)
Extrapolation	Poor	Linear extrapolation	Poor	Poor	Variable
Missing data	Needs handling	Needs handling	Native support	Native support	Needs handling

Key Differentiators:

vs. Linear Regression:

KNN captures nonlinearity; linear doesn't (without feature engineering)
Linear is faster, especially for prediction
Linear extrapolates linearly; KNN cannot extrapolate
Linear coefficients are interpretable; KNN explains via neighbors

vs. Random Forests:

RF handles high dimensions better
RF handles mixed feature types naturally
RF has built-in feature importance
KNN gives more localized predictions
KNN memory is O(n); RF is O(T·leaves)

vs. Gradient Boosting (XGBoost, LightGBM):

GB typically achieves better accuracy on tabular data
GB handles missing values natively
GB is more computationally efficient
KNN is simpler to understand and implement

vs. Neural Networks:

NNs require much more data and tuning
NNs handle structured data (images, text) far better
KNN is more interpretable
KNN has no training phase (just storage)

The Modern Consensus

Production Deployment Considerations

Deploying KNN regression in production involves unique challenges compared to parametric models.

Production Challenges and Solutions

•Memory footprint: KNN stores all training data. Solution: Use representative sampling, data compression, or approximate methods (LSH).
•Prediction latency: O(n) or O(log n) per query. Solution: Use spatial indexes (KD-trees, ball trees), batch predictions, or approximate nearest neighbor algorithms.
•Model updates: Adding new training points is easy, but rebalancing indexes takes time. Solution: Use incremental indexes or periodic full rebuilds.
•Feature drift: If feature distributions change, distances change meaning. Solution: Monitor feature statistics and retrain preprocessing as needed.
•Cold start: Need sufficient training data for reliable predictions. Solution: Hybrid systems that fall back to simpler models when data is sparse.

production_knn.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
import pickle
import time
 
class ProductionKNNRegressor:
    """
    Production-ready KNN regressor with serialization, 
    monitoring, and fallback capabilities.
    """
    
    def __init__(self, 
                 k: int = 10,
                 min_samples: int = 50,
                 fallback_value: float = None,
                 max_memory_mb: float = 1000):
        self.k = k
        self.min_samples = min_samples
        self.fallback_value = fallback_value
        self.max_memory_mb = max_memory_mb
        
        self.model = None
        self.scaler = None
        self.n_samples = 0
        self.feature_names = None
        self.training_stats = {}
    
    def fit(self, X: np.ndarray, y: np.ndarray, 
            feature_names: list = None):
        """Fit with production safeguards."""
        X = np.asarray(X)
        y = np.asarray(y)
        
        # Check memory constraint
        memory_mb = X.nbytes / 1e6
        if memory_mb > self.max_memory_mb:
            # Subsample to fit memory
            n_keep = int(len(X) * self.max_memory_mb / memory_mb)
            indices = np.random.choice(len(X), n_keep, replace=False)
            X = X[indices]
            y = y[indices]
            print(f"Subsampled to {n_keep} samples to fit memory constraint")
        
        self.n_samples = len(X)
        self.feature_names = feature_names
        
        # Store training statistics for monitoring
        self.training_stats = {
            'feature_means': X.mean(axis=0),
            'feature_stds': X.std(axis=0),
            'target_mean': y.mean(),
            'target_std': y.std(),
            'n_samples': len(X),
        }
        
        # Set fallback if not provided
        if self.fallback_value is None:
            self.fallback_value = y.mean()
        
        # Fit scaler and model
        self.scaler = StandardScaler()
        X_scaled = self.scaler.fit_transform(X)
        
        self.model = KNeighborsRegressor(
            n_neighbors=min(self.k, len(X) - 1),
            weights='distance',
            algorithm='kd_tree'
        )
        self.model.fit(X_scaled, y)
        
        return self
    
    def predict(self, X: np.ndarray, 
                return_confidence: bool = False):
        """Predict with fallback and confidence."""
        X = np.atleast_2d(X)
        
        # Check if model is fitted
        if self.model is None or self.n_samples < self.min_samples:
            predictions = np.full(len(X), self.fallback_value)
            confidences = np.zeros(len(X))
            
            if return_confidence:
                return predictions, confidences
            return predictions
        
        # Scale features
        try:
            X_scaled = self.scaler.transform(X)
        except Exception as e:
            # Feature mismatch or other error
            predictions = np.full(len(X), self.fallback_value)
            if return_confidence:
                return predictions, np.zeros(len(X))
            return predictions
        
        # Predict
        predictions = self.model.predict(X_scaled)
        
        if return_confidence:
            # Estimate confidence from neighbor distance consistency
            distances, indices = self.model.kneighbors(X_scaled)
            
            # Confidence = inverse of relative distance spread
            mean_dist = distances.mean(axis=1)
            std_dist = distances.std(axis=1)
            confidences = 1 / (1 + std_dist / (mean_dist + 1e-10))
            
            return predictions, confidences
        
        return predictions
    
    def check_feature_drift(self, X: np.ndarray, 
                             threshold: float = 2.0) -> dict:
        """Check for feature drift from training distribution."""
        X = np.atleast_2d(X)
        
        # Compute z-scores relative to training
        z_scores = np.abs(
            (X.mean(axis=0) - self.training_stats['feature_means']) / 
            (self.training_stats['feature_stds'] + 1e-10)
        )
        
        drift_detected = z_scores > threshold
        
        return {
            'drift_detected': drift_detected.any(),
            'features_with_drift': np.where(drift_detected)[0].tolist(),
            'z_scores': z_scores.tolist(),
        }
    
    def save(self, filepath: str):
        """Serialize model to disk."""
        state = {
            'model': self.model,
            'scaler': self.scaler,
            'k': self.k,
            'n_samples': self.n_samples,
            'fallback_value': self.fallback_value,
            'training_stats': self.training_stats,
        }
        with open(filepath, 'wb') as f:
            pickle.dump(state, f)
    
    @classmethod
    def load(cls, filepath: str):
        """Load model from disk."""
        with open(filepath, 'rb') as f:
            state = pickle.load(f)
        
        instance = cls(k=state['k'])
        instance.model = state['model']
        instance.scaler = state['scaler']
        instance.n_samples = state['n_samples']
        instance.fallback_value = state['fallback_value']
        instance.training_stats = state['training_stats']
        
        return instance
 
 
# Usage demonstration
np.random.seed(42)
 
# Training data
X_train = np.random.randn(1000, 5)
y_train = X_train.sum(axis=1) + 0.5 * np.random.randn(1000)
 
# Create and fit
model = ProductionKNNRegressor(k=15, min_samples=50)
model.fit(X_train, y_train)
 
# Normal prediction with confidence
X_test = np.random.randn(10, 5)
preds, confs = model.predict(X_test, return_confidence=True)
print("Production KNN Regression:")
print("=" * 50)
print(f"Predictions: {preds[:3]}")
print(f"Confidences: {confs[:3]}")
 
# Check for drift
X_drifted = np.random.randn(100, 5) + 3  # Shifted distribution
drift_report = model.check_feature_drift(X_drifted)
print(f"\nDrift check: {drift_report}")

Summary: KNN Regression Mastery

Key Takeaways

•KNN regression is a universal approximator — Given enough data in low dimensions, it can approximate any smooth function.
•The curse of dimensionality limits applicability — Performance degrades rapidly above ~10 dimensions.
•Feature preprocessing is critical — Always standardize features; handle categoricals and missing values explicitly.
•Hyperparameter selection via CV — k, weights, and algorithm choice should be tuned using cross-validation.
•Local linear models reduce bias — When local trends exist, LOESS/local linear outperforms simple averaging.
•Know when KNN is appropriate — Dense data, low dimensions, nonlinear relationships, interpretability needs.
•Tree ensembles usually win on tabular data — Use KNN when its specific strengths apply.
•Production requires special care — Memory, latency, model updates, and monitoring need attention.

Module Complete

5 / 5