Voronoi Diagrams And Decision Surfaces - Learning Module

Loading content...

0/245

Decision Surface Smoothing with K

K as a Smoothing Parameter

Throughout this module, we've seen that increasing K transforms decision surfaces from jagged Voronoi tessellations to smooth, averaged boundaries. This isn't just an empirical observation—it reflects a deep connection between KNN and kernel-based smoothing methods.

The parameter K functions as an effective bandwidth: controlling how much local structure is averaged into each prediction. Small K sees trees; large K sees forest. Understanding this smoothing perspective provides:

Principled K selection: Based on local density and noise level
Connection to kernel methods: KNN as adaptive kernel regression
Theoretical analysis: Bias-variance from a smoothing lens
Extensions: Distance-weighted KNN and local regression

What You Will Learn

By the end of this page, you will understand K as an implicit bandwidth parameter, the connection between KNN and kernel density estimation, how weighted KNN creates smoother transitions, and practical guidelines for K selection based on smoothing considerations.

K as Effective Bandwidth

In kernel smoothing, the bandwidth parameter h controls how wide a neighborhood influences each prediction. KNN has no explicit bandwidth—instead, K implicitly determines a spatially-varying bandwidth.

The K-Neighborhood Radius:

For query point q, define rₖ(q) as the distance to the K-th nearest neighbor. This radius varies across space: $$r_K(q) = |q - x_{(K)}|$$

where x₍ₖ₎ is the K-th closest training point to q.

Key Insight: The effective bandwidth is rₖ(q), not a fixed constant. In dense regions, rₖ is small (tight, local averaging). In sparse regions, rₖ is large (broad, more smoothing).

Comparison to Fixed-Bandwidth Methods:

Property	Fixed Bandwidth (Kernel)	KNN (Adaptive)
Neighborhood size	Fixed (same h everywhere)	Variable (adapts to density)
Dense regions	Many neighbors, stable	K neighbors, small radius
Sparse regions	Few neighbors, unstable	K neighbors, large radius
Edge behavior	May have no neighbors	Always K neighbors

Adaptive Bandwidth Advantage

KNN's adaptive bandwidth is a key advantage: it naturally handles varying data density without manual bandwidth selection. In dense regions, it avoids over-smoothing. In sparse regions, it avoids high variance from too few neighbors. This adaptivity is why KNN often outperforms fixed-bandwidth methods on heterogeneous data.

Scaling Analysis:

For uniformly distributed data in d dimensions with n points: $$\mathbb{E}[r_K] \propto \left(\frac{K}{n}\right)^{1/d}$$

Implications:

Doubling K increases typical radius by factor 2^(1/d)
In high dimensions, K must grow significantly to affect radius
Local density variations directly modulate this relationship

Connection to Kernel Methods

KNN can be viewed as kernel regression with a specific, adaptive kernel. This perspective unifies KNN with the broader family of local smoothing methods.

KNN as Uniform Kernel Regression:

The K-NN prediction for regression: $$\hat{f}(q) = \frac{1}{K}\sum_{i \in N_K(q)} y_i$$

is equivalent to kernel regression with a uniform (box) kernel: $$\hat{f}(q) = \frac{\sum_i K_h(q, x_i) y_i}{\sum_i K_h(q, x_i)}$$

where the kernel is: $$K_h(q, x_i) = \begin{cases} 1 & \text{if } x_i \in N_K(q) \ 0 & \text{otherwise} \end{cases}$$

Why This Matters:

KNN inherits kernel regression properties (consistency, asymptotic normality)
Suggests improvements: use smoother kernels (Gaussian, Epanechnikov)
Explains boundary sharpness: box kernel has discontinuous edges

knn_kernel_equivalence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
 
def compare_knn_kernel_regression():
    """
    Demonstrate that KNN is kernel regression with an adaptive box kernel.
    """
    np.random.seed(42)
    
    # Generate noisy data
    n = 100
    X = np.sort(np.random.uniform(0, 10, n))
    y = np.sin(X) + np.random.randn(n) * 0.3
    
    # Query points
    X_plot = np.linspace(0, 10, 200)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Different K values
    for ax, k in zip(axes, [5, 25]):
        # KNN prediction
        knn = KNeighborsRegressor(n_neighbors=k)
        knn.fit(X.reshape(-1, 1), y)
        y_knn = knn.predict(X_plot.reshape(-1, 1))
        
        # Plot
        ax.scatter(X, y, alpha=0.5, label='Data')
        ax.plot(X_plot, np.sin(X_plot), 'g--', lw=2, label='True function')
        ax.plot(X_plot, y_knn, 'r-', lw=2, label=f'{k}-NN')
        
        # Show neighborhood for one query point
        q = 5.0
        distances = np.abs(X - q)
        k_nearest_idx = np.argsort(distances)[:k]
        r_k = distances[k_nearest_idx[-1]]
        
        ax.axvspan(q - r_k, q + r_k, alpha=0.2, color='red', 
                   label=f'K-neighborhood at q={q}')
        ax.axvline(q, color='red', linestyle=':', lw=1)
        
        ax.set_title(f'K = {k}: Effective bandwidth = {2*r_k:.2f}', 
                    fontweight='bold')
        ax.legend(loc='best')
        ax.set_xlabel('x')
        ax.set_ylabel('y')
    
    plt.suptitle('KNN as Adaptive Kernel Regression', fontsize=14, y=1.02)
    plt.tight_layout()
    return fig
 
if __name__ == "__main__":
    fig = compare_knn_kernel_regression()
    plt.savefig("knn_kernel_equivalence.png", dpi=150)
    plt.show()

Distance-Weighted KNN for Smoother Surfaces

Standard KNN treats all K neighbors equally—the 1st nearest neighbor has the same vote as the K-th. Distance-weighted KNN assigns higher weight to closer neighbors, creating smoother transitions.

Weighting Schemes:

Inverse distance: $w_i = 1/d(q, x_i)$
Inverse square: $w_i = 1/d(q, x_i)^2$
Gaussian: $w_i = \exp(-d(q, x_i)^2 / 2\sigma^2)$
Epanechnikov: $w_i = \max(0, 1 - (d/r_K)^2)$

Weighted Prediction:

For classification: $$\hat{y}(q) = \arg\max_c \sum_{i \in N_K(q)} w_i \cdot \mathbb{1}[y_i = c]$$

For regression: $$\hat{f}(q) = \frac{\sum_{i \in N_K(q)} w_i \cdot y_i}{\sum_{i \in N_K(q)} w_i}$$

Advantages of Weighting

•Smoother decision boundaries (no sharp K-cutoff)
•Closer neighbors contribute more (intuitive)
•Reduces sensitivity to exact K choice
•Better probability estimates

Potential Issues

•1/d weights explode as d→0 (training points)
•Adds hyperparameter (weight function choice)
•May over-emphasize nearest neighbor if very close
•Slightly higher computation cost

sklearn Implementation

In scikit-learn, set weights='distance' for inverse-distance weighting: KNeighborsClassifier(n_neighbors=k, weights='distance'). For custom weight functions, pass a callable that takes distance array and returns weights.

Smoothing and the Bias-Variance Tradeoff

The smoothing perspective provides intuition for the bias-variance tradeoff in KNN.

Variance Reduction Through Averaging:

Each prediction averages K labels. By the law of large numbers: $$\text{Var}[\hat{y}] \propto \frac{1}{K}$$

Larger K → more averaging → lower variance. This is why K = 1 is high-variance (single unpredictable neighbor) while K = 100 is low-variance (average of 100 neighbors).

Bias Through Over-Smoothing:

Larger neighborhoods include points further from q, whose true labels may differ: $$\text{Bias}[\hat{y}] \propto r_K^\beta$$

where β depends on the smoothness of the true decision boundary. Larger K → larger radius → higher bias.

Optimal Smoothing Level:

Balancing these effects, the optimal K typically satisfies: $$K^* \approx n^{\frac{2\beta}{2\beta + d}}$$

For smooth boundaries (high β) in low dimensions, larger K is optimal. For complex boundaries in high dimensions, smaller K is needed.

K Selection Guidelines by Problem Type
Scenario	Recommended K	Rationale
Low noise, complex boundary	Small (√n/4 to √n/2)	Preserve detail, averaging unnecessary
High noise, simple boundary	Large (√n to 2√n)	Heavy averaging to remove noise
High dimensions	Larger than low-d	Combat curse of dimensionality
Imbalanced classes	Smaller than minority class	Ensure minority can win votes
Real-time prediction constraints	As small as accurate	Minimize distance computations

Connection to Local Regression Methods

KNN regression fits a constant (average) in each neighborhood. Local regression methods generalize this by fitting more flexible local models.

Hierarchy of Local Methods:

KNN Regression: Fit constant to K neighbors → piecewise constant surface
Local Linear Regression (LOWESS/LOESS): Fit linear model to neighborhood → piecewise linear surface
Local Polynomial Regression: Fit degree-p polynomial → smoother surfaces

LOESS/LOWESS:

For each query q:

Find neighbors within bandwidth h (or K nearest)
Assign weights by distance (tricube kernel common)
Fit weighted least squares: $\min_\beta \sum_i w_i (y_i - \beta_0 - \beta_1(x_i - q))^2$
Predict $\hat{f}(q) = \beta_0$

Advantage over KNN: The local linear fit reduces boundary bias. At data edges, the linear term extrapolates appropriately rather than just averaging available neighbors.

When to Use Local Regression

Use KNN when simplicity matters or for classification. Use LOESS when smooth regression curves are needed and computation is affordable. Local polynomial methods shine for derivative estimation (velocity from position data) where KNN's piecewise constant output fails.

Practical Smoothing Recommendations

Practical Guidelines

•Start with K = √n rounded to odd: Classic heuristic that often works well as starting point
•Use cross-validation for final selection: Scan K ∈ [3, 5, 7, ..., √n × 2] and pick best validation accuracy
•Apply 1-standard-error rule: Choose largest K within 1 SE of best—simpler models generalize better
•Use distance weighting by default: weights='distance' rarely hurts and often helps
•Scale features before fitting: KNN is sensitive to feature scales; standardize first
•Monitor prediction confidence: If most predictions use max probability near 0.5, K may be too small

Computational Considerations:

Larger K means more neighbors to average (minimal cost)
The dominant cost is finding K neighbors (O(n) naive, O(log n) with KD-tree)
Precompute and store data structures when K is fixed
For varying K, store all distances and select K at prediction time

Summary: The Smoothing Perspective on KNN

Key Takeaways

•K is an implicit bandwidth: Controls the effective radius of local averaging, varying spatially with data density
•KNN is kernel regression: With an adaptive box kernel; suggests extensions using smoother kernels
•Distance weighting smooths boundaries: Creates gradual transitions instead of sharp K-cutoffs
•Bias-variance from smoothing lens: More smoothing (larger K) → less variance, more bias
•Local regression extends KNN: Fitting local models beyond constants reduces boundary bias
•Practical K selection: Start with √n, refine with cross-validation, apply 1-SE rule

Module Complete:

You have now mastered the geometric foundations of KNN through the lens of Voronoi diagrams and decision surfaces. From Voronoi tessellations to K-smoothing, from the Cover-Hart theorem to practical visualization, this module provides the complete picture of how KNN partitions feature space and why it behaves as it does.

Module Complete

Congratulations! You now understand KNN from a geometric perspective: Voronoi tessellations underlie 1-NN, K transforms boundaries through local averaging, and the neighborhood radius functions as an adaptive bandwidth. This geometric intuition transfers to understanding local methods throughout machine learning.