Loading content...
Throughout this module, we've seen that increasing K transforms decision surfaces from jagged Voronoi tessellations to smooth, averaged boundaries. This isn't just an empirical observation—it reflects a deep connection between KNN and kernel-based smoothing methods.
The parameter K functions as an effective bandwidth: controlling how much local structure is averaged into each prediction. Small K sees trees; large K sees forest. Understanding this smoothing perspective provides:
By the end of this page, you will understand K as an implicit bandwidth parameter, the connection between KNN and kernel density estimation, how weighted KNN creates smoother transitions, and practical guidelines for K selection based on smoothing considerations.
In kernel smoothing, the bandwidth parameter h controls how wide a neighborhood influences each prediction. KNN has no explicit bandwidth—instead, K implicitly determines a spatially-varying bandwidth.
The K-Neighborhood Radius:
For query point q, define rₖ(q) as the distance to the K-th nearest neighbor. This radius varies across space: $$r_K(q) = |q - x_{(K)}|$$
where x₍ₖ₎ is the K-th closest training point to q.
Key Insight: The effective bandwidth is rₖ(q), not a fixed constant. In dense regions, rₖ is small (tight, local averaging). In sparse regions, rₖ is large (broad, more smoothing).
Comparison to Fixed-Bandwidth Methods:
| Property | Fixed Bandwidth (Kernel) | KNN (Adaptive) |
|---|---|---|
| Neighborhood size | Fixed (same h everywhere) | Variable (adapts to density) |
| Dense regions | Many neighbors, stable | K neighbors, small radius |
| Sparse regions | Few neighbors, unstable | K neighbors, large radius |
| Edge behavior | May have no neighbors | Always K neighbors |
KNN's adaptive bandwidth is a key advantage: it naturally handles varying data density without manual bandwidth selection. In dense regions, it avoids over-smoothing. In sparse regions, it avoids high variance from too few neighbors. This adaptivity is why KNN often outperforms fixed-bandwidth methods on heterogeneous data.
Scaling Analysis:
For uniformly distributed data in d dimensions with n points: $$\mathbb{E}[r_K] \propto \left(\frac{K}{n}\right)^{1/d}$$
Implications:
KNN can be viewed as kernel regression with a specific, adaptive kernel. This perspective unifies KNN with the broader family of local smoothing methods.
KNN as Uniform Kernel Regression:
The K-NN prediction for regression: $$\hat{f}(q) = \frac{1}{K}\sum_{i \in N_K(q)} y_i$$
is equivalent to kernel regression with a uniform (box) kernel: $$\hat{f}(q) = \frac{\sum_i K_h(q, x_i) y_i}{\sum_i K_h(q, x_i)}$$
where the kernel is: $$K_h(q, x_i) = \begin{cases} 1 & \text{if } x_i \in N_K(q) \ 0 & \text{otherwise} \end{cases}$$
Why This Matters:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.neighbors import KNeighborsRegressor def compare_knn_kernel_regression(): """ Demonstrate that KNN is kernel regression with an adaptive box kernel. """ np.random.seed(42) # Generate noisy data n = 100 X = np.sort(np.random.uniform(0, 10, n)) y = np.sin(X) + np.random.randn(n) * 0.3 # Query points X_plot = np.linspace(0, 10, 200) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Different K values for ax, k in zip(axes, [5, 25]): # KNN prediction knn = KNeighborsRegressor(n_neighbors=k) knn.fit(X.reshape(-1, 1), y) y_knn = knn.predict(X_plot.reshape(-1, 1)) # Plot ax.scatter(X, y, alpha=0.5, label='Data') ax.plot(X_plot, np.sin(X_plot), 'g--', lw=2, label='True function') ax.plot(X_plot, y_knn, 'r-', lw=2, label=f'{k}-NN') # Show neighborhood for one query point q = 5.0 distances = np.abs(X - q) k_nearest_idx = np.argsort(distances)[:k] r_k = distances[k_nearest_idx[-1]] ax.axvspan(q - r_k, q + r_k, alpha=0.2, color='red', label=f'K-neighborhood at q={q}') ax.axvline(q, color='red', linestyle=':', lw=1) ax.set_title(f'K = {k}: Effective bandwidth = {2*r_k:.2f}', fontweight='bold') ax.legend(loc='best') ax.set_xlabel('x') ax.set_ylabel('y') plt.suptitle('KNN as Adaptive Kernel Regression', fontsize=14, y=1.02) plt.tight_layout() return fig if __name__ == "__main__": fig = compare_knn_kernel_regression() plt.savefig("knn_kernel_equivalence.png", dpi=150) plt.show()Standard KNN treats all K neighbors equally—the 1st nearest neighbor has the same vote as the K-th. Distance-weighted KNN assigns higher weight to closer neighbors, creating smoother transitions.
Weighting Schemes:
Weighted Prediction:
For classification: $$\hat{y}(q) = \arg\max_c \sum_{i \in N_K(q)} w_i \cdot \mathbb{1}[y_i = c]$$
For regression: $$\hat{f}(q) = \frac{\sum_{i \in N_K(q)} w_i \cdot y_i}{\sum_{i \in N_K(q)} w_i}$$
In scikit-learn, set weights='distance' for inverse-distance weighting: KNeighborsClassifier(n_neighbors=k, weights='distance'). For custom weight functions, pass a callable that takes distance array and returns weights.
The smoothing perspective provides intuition for the bias-variance tradeoff in KNN.
Variance Reduction Through Averaging:
Each prediction averages K labels. By the law of large numbers: $$\text{Var}[\hat{y}] \propto \frac{1}{K}$$
Larger K → more averaging → lower variance. This is why K = 1 is high-variance (single unpredictable neighbor) while K = 100 is low-variance (average of 100 neighbors).
Bias Through Over-Smoothing:
Larger neighborhoods include points further from q, whose true labels may differ: $$\text{Bias}[\hat{y}] \propto r_K^\beta$$
where β depends on the smoothness of the true decision boundary. Larger K → larger radius → higher bias.
Optimal Smoothing Level:
Balancing these effects, the optimal K typically satisfies: $$K^* \approx n^{\frac{2\beta}{2\beta + d}}$$
For smooth boundaries (high β) in low dimensions, larger K is optimal. For complex boundaries in high dimensions, smaller K is needed.
| Scenario | Recommended K | Rationale |
|---|---|---|
| Low noise, complex boundary | Small (√n/4 to √n/2) | Preserve detail, averaging unnecessary |
| High noise, simple boundary | Large (√n to 2√n) | Heavy averaging to remove noise |
| High dimensions | Larger than low-d | Combat curse of dimensionality |
| Imbalanced classes | Smaller than minority class | Ensure minority can win votes |
| Real-time prediction constraints | As small as accurate | Minimize distance computations |
KNN regression fits a constant (average) in each neighborhood. Local regression methods generalize this by fitting more flexible local models.
Hierarchy of Local Methods:
LOESS/LOWESS:
For each query q:
Advantage over KNN: The local linear fit reduces boundary bias. At data edges, the linear term extrapolates appropriately rather than just averaging available neighbors.
Use KNN when simplicity matters or for classification. Use LOESS when smooth regression curves are needed and computation is affordable. Local polynomial methods shine for derivative estimation (velocity from position data) where KNN's piecewise constant output fails.
Computational Considerations:
Module Complete:
You have now mastered the geometric foundations of KNN through the lens of Voronoi diagrams and decision surfaces. From Voronoi tessellations to K-smoothing, from the Cover-Hart theorem to practical visualization, this module provides the complete picture of how KNN partitions feature space and why it behaves as it does.
Congratulations! You now understand KNN from a geometric perspective: Voronoi tessellations underlie 1-NN, K transforms boundaries through local averaging, and the neighborhood radius functions as an adaptive bandwidth. This geometric intuition transfers to understanding local methods throughout machine learning.