Machine LearningK-Nearest Neighbors

Weighted KNN

LevelIntermediate

Duration90 mins

TopicK-Nearest Neighbors

2 / 5

Kernel-Weighted KNN

From Ad-Hoc Weights to Principled Kernels

In the previous page, we introduced distance weighting as an intuitive improvement over uniform voting. But a deeper question remains: why should we use inverse distance specifically? Why not inverse square root? Why not any other monotonically decreasing function?

The answer lies in kernel theory—a rich area of nonparametric statistics that provides a principled framework for choosing weight functions. Kernel-weighted KNN connects our local prediction method to kernel density estimation, kernel regression (Nadaraya-Watson estimators), and ultimately to a theory that tells us optimal ways to weight observations based on their distance.

This isn't merely academic sophistication. Kernel-based weighting provides:

Mathematical guarantees on bias and variance
Connections to bandwidth selection theory
Smooth, infinitely differentiable predictions (if desired)
A unified framework linking KNN to other nonparametric methods

What You Will Master

By the end of this page, you will understand kernel functions as weight generators, master the Epanechnikov, Gaussian, and other standard kernels, connect kernel-weighted KNN to the Nadaraya-Watson estimator, and grasp bandwidth as a hyperparameter that unifies k and weighting strength in a single framework.

What is a Kernel Function?

A kernel function in nonparametric statistics is a non-negative function $K(u)$ that integrates to 1 and is typically symmetric around zero. The argument $u$ represents a normalized distance.(not to be confused with the kernel in SVM/kernel methods with inner products in feature space—here we mean a smoothing kernel/window function).

Formal Definition:

A kernel function $K: \mathbb{R} \rightarrow \mathbb{R}^+$ satisfies:

$K(u) \geq 0$ for all $u$ (non-negativity)
$\int_{-\infty}^{\infty} K(u) , du = 1$ (integrates to 1, like a probability density)
$K(-u) = K(u)$ (symmetry around zero)
$K(u)$ is unimodal (single peak at $u = 0$)

For multidimensional data, we typically use a product kernel or a spherical kernel:

$$K(\mathbf{u}) = K(|\mathbf{u}|)$$

where $|\cdot|$ is the Euclidean norm.

Two Meanings of 'Kernel'

The machine learning literature overloads the term 'kernel'. In SVMs and kernel methods, a kernel k(x, y) computes inner products in a feature space. In nonparametric statistics (our context here), a kernel K(u) is a weighting function for local averaging. These are related but distinct concepts. Context usually clarifies which is meant.

From Kernel to Weight Function:

Given a kernel $K$ and a bandwidth $h > 0$, we define the weight of a training point at distance $d$ from the query:

$$w(d) = K\left(\frac{d}{h}\right)$$

The bandwidth $h$ plays a critical role:

Small $h$: weight decays rapidly with distance (local, potentially high variance)
Large $h$: weight decays slowly (global, potentially high bias)

The kernel determines the shape of the decay; the bandwidth determines the scale.

Common Kernel Functions
Kernel Name	Formula K(u)	Support	Properties
Uniform (Box)	½ for \|u\| ≤ 1, else 0	Bounded [-1, 1]	Simplest kernel; discontinuous; uniform within support
Triangular	(1 - \|u\|) for \|u\| ≤ 1, else 0	Bounded [-1, 1]	Linear decay; continuous but not smooth
Epanechnikov	¾(1 - u²) for \|u\| ≤ 1, else 0	Bounded [-1, 1]	Optimal for MSE in density estimation; parabolic
Biweight	(15/16)(1 - u²)² for \|u\| ≤ 1	Bounded [-1, 1]	Smoother than Epanechnikov; twice differentiable
Gaussian	(1/√2π)exp(-u²/2)	Unbounded (-∞, ∞)	Infinitely smooth; never exactly zero; standard choice
Tricube	(70/81)(1 - \|u\|³)³ for \|u\| ≤ 1	Bounded [-1, 1]	Very smooth decay; used in LOESS

The Epanechnikov Kernel: The Optimal Choice

Among all second-order kernels (kernels with finite variance), the Epanechnikov kernel is optimal in the sense of minimizing the asymptotic mean integrated squared error (MISE) for kernel density estimation.

The Epanechnikov Kernel:

$$K_E(u) = \begin{cases} \frac{3}{4}(1 - u^2) & \text{if } |u| \leq 1 \ 0 & \text{otherwise} \end{cases}$$

Properties:

Compact support: The kernel is exactly zero for $|u| > 1$. Points at normalized distance greater than the bandwidth receive zero weight.
Parabolic shape: The weight decreases quadratically from the center, providing smooth but bounded decay.
Optimal efficiency: For kernel density estimation, Epanechnikov is around 99.4% efficient relative to the MSE-optimal kernel in most practical scenarios.
Computational advantage: The compact support means we only need to consider training points within distance $h$ of the query.

kernel_functions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from typing import Callable
 
def uniform_kernel(u: np.ndarray) -> np.ndarray:
    """Box/Uniform kernel: constant within [-1, 1], zero outside."""
    return np.where(np.abs(u) <= 1, 0.5, 0.0)
 
def triangular_kernel(u: np.ndarray) -> np.ndarray:
    """Triangular kernel: linear decay within [-1, 1]."""
    return np.where(np.abs(u) <= 1, 1 - np.abs(u), 0.0)
 
def epanechnikov_kernel(u: np.ndarray) -> np.ndarray:
    """Epanechnikov kernel: optimal parabolic kernel."""
    return np.where(np.abs(u) <= 1, 0.75 * (1 - u**2), 0.0)
 
def biweight_kernel(u: np.ndarray) -> np.ndarray:
    """Biweight/Quartic kernel: very smooth, twice differentiable."""
    return np.where(np.abs(u) <= 1, (15/16) * (1 - u**2)**2, 0.0)
 
def gaussian_kernel(u: np.ndarray) -> np.ndarray:
    """Gaussian kernel: smooth, unbounded support."""
    return (1 / np.sqrt(2 * np.pi)) * np.exp(-0.5 * u**2)
 
def tricube_kernel(u: np.ndarray) -> np.ndarray:
    """Tricube kernel: used in LOESS, very smooth at boundary."""
    return np.where(np.abs(u) <= 1, (70/81) * (1 - np.abs(u)**3)**3, 0.0)
 
 
def apply_kernel_weights(distances: np.ndarray, 
                          bandwidth: float,
                          kernel: Callable) -> np.ndarray:
    """
    Convert distances to kernel weights.
    
    Parameters
    ----------
    distances : array of distances from query to training points
    bandwidth : kernel bandwidth (scale parameter h)
    kernel : kernel function K(u) -> weight
    
    Returns
    -------
    weights : normalized weights (sum to 1)
    """
    # Normalize distances by bandwidth
    normalized_distances = distances / bandwidth
    
    # Apply kernel function
    raw_weights = kernel(normalized_distances)
    
    # Handle case where all weights are zero
    weight_sum = raw_weights.sum()
    if weight_sum < 1e-10:
        # Fall back to uniform weights on k nearest
        return np.ones_like(raw_weights) / len(raw_weights)
    
    return raw_weights / weight_sum
 
 
# Visualization: compare kernel shapes
u_values = np.linspace(-1.5, 1.5, 1000)
 
kernels = {
    'Uniform': uniform_kernel,
    'Triangular': triangular_kernel,
    'Epanechnikov': epanechnikov_kernel,
    'Biweight': biweight_kernel,
    'Gaussian': gaussian_kernel,
    'Tricube': tricube_kernel,
}
 
print("Kernel values at u = 0, 0.5, 1.0:")
print("-" * 50)
for name, kern in kernels.items():
    vals = [kern(np.array([u]))[0] for u in [0.0, 0.5, 1.0]]
    print(f"{name:15s}: K(0)={vals[0]:.4f}, K(0.5)={vals[1]:.4f}, K(1.0)={vals[2]:.4f}")

Practical Kernel Selection

In practice, kernel choice matters less than bandwidth selection. The Gaussian kernel is often preferred despite being theoretically suboptimal because it's smooth everywhere, never zero (avoiding division issues), and extremely well-studied. For strict locality (ignoring distant points entirely), Epanechnikov or tricube are better choices.

Gaussian Kernel Weighting

The Gaussian kernel is the most widely used kernel in practice due to its mathematical elegance and numerical convenience.

The Gaussian Kernel:

$$K_G(u) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{u^2}{2}\right)$$

With bandwidth $h$:

$$w(d) = K_G\left(\frac{d}{h}\right) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{d^2}{2h^2}\right)$$

Properties:

Infinite support: Every training point contributes some weight, no matter how distant. In practice, points beyond $3h$ contribute negligibly (weight < 0.011 of maximum).
Infinitely differentiable: Predictions are smooth functions of the query location, which is important for certain applications.
Connection to RBF: The Gaussian kernel weight is identical in form to the Radial Basis Function (RBF) kernel used in SVMs and Gaussian Processes.
Natural probabilistic interpretation: If the distance follows a half-normal distribution, Gaussian weighting corresponds to likelihood weighting.

Relationship to inverse distance weighting:

Expanding for small distances:

$$\exp\left(-\frac{d^2}{2h^2}\right) \approx 1 - \frac{d^2}{2h^2} + O(d^4)$$

The Gaussian kernel is locally quadratic in distance, similar to inverse-square weighting near zero but with smooth decay rather than the singularity at $d=0$.

The bandwidth $h$ as the key hyperparameter:

$h$ relative to data spread	Behavior
$h \ll$ nearest neighbor distance	Extremely local; nearly 1-NN
$h \approx$ average neighbor distance	Balanced local averaging
$h \gg$ data spread	Nearly uniform weighting over all points

gaussian_kernel_knn.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from typing import Optional
 
class GaussianKernelKNN:
    """
    KNN with Gaussian kernel weighting.
    
    Uses bandwidth h instead of k as the locality parameter.
    All training points within reasonable distance contribute,
    weighted by K((d/h)).
    """
    
    def __init__(self, bandwidth: Optional[float] = None):
        """
        Parameters
        ----------
        bandwidth : kernel bandwidth h. If None, estimated from data.
        """
        self.bandwidth = bandwidth
        self.X_train = None
        self.y_train = None
        
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Store training data and optionally estimate bandwidth."""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        
        if self.bandwidth is None:
            # Silverman's rule of thumb for bandwidth selection
            n, d = self.X_train.shape
            std = self.X_train.std(axis=0).mean()
            self.bandwidth = std * (4 / (n * (d + 2))) ** (1 / (d + 4))
            print(f"Auto-selected bandwidth: {self.bandwidth:.4f}")
        
        return self
    
    def _gaussian_weights(self, distances: np.ndarray) -> np.ndarray:
        """Compute Gaussian kernel weights."""
        u = distances / self.bandwidth
        weights = np.exp(-0.5 * u**2)
        
        # Normalize
        total = weights.sum()
        if total < 1e-10:
            # All weights essentially zero - use 1-NN
            weights = np.zeros_like(weights)
            weights[np.argmin(distances)] = 1.0
        else:
            weights = weights / total
        
        return weights
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict for regression (weighted average) or classification (mode)."""
        X = np.atleast_2d(X)
        predictions = []
        
        for x_query in X:
            distances = np.linalg.norm(self.X_train - x_query, axis=1)
            weights = self._gaussian_weights(distances)
            
            # Check if classification or regression
            if np.issubdtype(self.y_train.dtype, np.floating):
                # Regression: weighted average
                pred = np.sum(weights * self.y_train)
            else:
                # Classification: weighted voting
                classes = np.unique(self.y_train)
                scores = np.array([weights[self.y_train == c].sum() for c in classes])
                pred = classes[np.argmax(scores)]
            
            predictions.append(pred)
        
        return np.array(predictions)
    
    def effective_neighbors(self, x_query: np.ndarray, threshold: float = 0.01) -> int:
        """Count how many neighbors contribute more than threshold weight."""
        distances = np.linalg.norm(self.X_train - x_query, axis=1)
        weights = self._gaussian_weights(distances)
        return np.sum(weights > threshold)
 
 
# Demonstration
np.random.seed(42)
 
# Create 1D regression data with varying density
X_train = np.vstack([
    np.linspace(0, 2, 20)[:, np.newaxis],      # Dense region
    np.linspace(5, 10, 10)[:, np.newaxis],     # Sparse region
])
y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(30)
 
# Compare different bandwidths
bandwidths = [0.2, 0.5, 1.0, 2.0]
x_test = np.array([[1.0], [7.5]])  # One in dense, one in sparse region
 
print("Gaussian Kernel KNN predictions:")
print("=" * 60)
 
for h in bandwidths:
    model = GaussianKernelKNN(bandwidth=h)
    model.fit(X_train, y_train)
    preds = model.predict(x_test)
    
    eff_neighbors = [model.effective_neighbors(x) for x in x_test]
    
    print(f"\nBandwidth h = {h}")
    for i, (x, pred, eff) in enumerate(zip(x_test, preds, eff_neighbors)):
        print(f"  x={x[0]:.1f}: pred={pred:.4f}, effective neighbors ≈ {eff}")

The Nadaraya-Watson Estimator

Kernel-weighted KNN for regression is a special case of a classical nonparametric estimator: the Nadaraya-Watson estimator, proposed independently by Nadaraya (1964) and Watson (1964).

The Nadaraya-Watson Estimator:

For a query point $\mathbf{x}$:

$$\hat{f}(\mathbf{x}) = \frac{\sum_{i=1}^{n} K_h(\mathbf{x} - \mathbf{x}i) \cdot y_i}{\sum{i=1}^{n} K_h(\mathbf{x} - \mathbf{x}_i)}$$

where $K_h(\mathbf{u}) = \frac{1}{h^d} K\left(\frac{|\mathbf{u}|}{h}\right)$ and $d$ is the dimensionality.

This is precisely our kernel-weighted KNN regression formula! The only difference is that Nadaraya-Watson traditionally uses all training points (no explicit k), while KNN variants may still select k neighbors first and then apply kernel weights.

Connection to Kernel Density Estimation:

The Nadaraya-Watson estimator can be interpreted as:

$$\hat{f}(\mathbf{x}) = \frac{\text{weighted average of } y_i \text{ at } \mathbf{x}}{\text{kernel density estimate at } \mathbf{x}}$$

The denominator $\sum_i K_h(\mathbf{x} - \mathbf{x}_i)$ estimates the density of training points at $\mathbf{x}$, while the numerator estimates the joint density of $(x, y)$ marginalized over $y$ with weighting.

The Historical Significance

The Nadaraya-Watson estimator predates modern 'machine learning' terminology but is fundamental to it. Understanding this connection reveals that kernel-weighted KNN isn't an ad-hoc modification—it's a principled statistical estimator with decades of theoretical study.

Theoretical Properties:

Consistency: Under mild conditions, $\hat{f}(\mathbf{x}) \to f(\mathbf{x})$ as $n \to \infty$ if $h \to 0$ and $nh^d \to \infty$.
Asymptotic Bias: For a twice-differentiable true function: $$\text{Bias}(\hat{f}(\mathbf{x})) \approx \frac{h^2}{2} \cdot \mu_2(K) \cdot \left(f''(\mathbf{x}) + 2\frac{f'(\mathbf{x}) \cdot p'(\mathbf{x})}{p(\mathbf{x})}\right)$$

where $\mu_2(K) = \int u^2 K(u) du$ is the second moment of the kernel and $p(\mathbf{x})$ is the data density.

Asymptotic Variance: $$\text{Var}(\hat{f}(\mathbf{x})) \approx \frac{\sigma^2 R(K)}{nh^d \cdot p(\mathbf{x})}$$

where $R(K) = \int K(u)^2 du$ is the roughness of the kernel.

Key Insight: Bias is $O(h^2)$ while variance is $O(1/nh^d)$. This leads to the classic bias-variance tradeoff in bandwidth selection.

Bandwidth Selection

The bandwidth $h$ is the critical hyperparameter in kernel-weighted KNN. Choosing $h$ is analogous to choosing $k$ in standard KNN, but with more nuanced control over the locality of predictions.

The Bias-Variance Decomposition:

Mean Squared Error = Bias² + Variance

$$\text{MSE}(\hat{f}) \approx \underbrace{\frac{h^4}{4} \mu_2(K)^2 [f''(x)]^2}{\text{Bias}^2} + \underbrace{\frac{\sigma^2 R(K)}{n h^d p(x)}}{\text{Variance}}$$

Optimizing over $h$ yields the asymptotically optimal bandwidth:

$$h_{opt} = \left(\frac{R(K) \cdot d \cdot \sigma^2}{n \cdot \mu_2(K)^2 \cdot \int [f''(x)]^2 p(x) dx}\right)^{1/(d+4)}$$

This depends on unknown quantities ($\sigma^2$, $f''$) and so cannot be computed directly. Practical methods estimate or approximate these.

Bandwidth Selection Methods

•Silverman's Rule of Thumb: $h = 0.9 \cdot \min(\text{std}, \text{IQR}/1.34) \cdot n^{-1/(d+4)}$. Simple, fast, but assumes roughly Gaussian data.
•Cross-Validation: Evaluate prediction error for a grid of $h$ values using leave-one-out (LOO) or k-fold CV. Data-driven but computationally expensive.
•Plug-in Methods: Estimate $f''$ using a pilot bandwidth, then compute optimal $h$. More sophisticated but requires iterative estimation.
•Nearest-Neighbor Bandwidth: Set $h$ adaptively based on distance to k-th neighbor. Bridges fixed-bandwidth and fixed-k approaches.

bandwidth_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from sklearn.model_selection import LeaveOneOut
 
def silverman_bandwidth(X: np.ndarray) -> float:
    """
    Silverman's rule of thumb for bandwidth selection.
    
    h = 0.9 * min(std, IQR/1.34) * n^(-1/(d+4))
    
    This is designed for Gaussian kernels on roughly normal data.
    """
    n, d = X.shape
    
    # Per-dimension scales
    stds = X.std(axis=0)
    iqrs = np.percentile(X, 75, axis=0) - np.percentile(X, 25, axis=0)
    
    # Use minimum of std and IQR/1.34 for robustness
    scales = np.minimum(stds, iqrs / 1.34)
    avg_scale = scales.mean()
    
    # Silverman's formula
    h = 0.9 * avg_scale * n ** (-1 / (d + 4))
    
    return h
 
 
def cross_validate_bandwidth(X: np.ndarray, 
                              y: np.ndarray,
                              bandwidths: np.ndarray,
                              kernel_func = None) -> tuple:
    """
    Select bandwidth by leave-one-out cross-validation.
    
    Returns (best_bandwidth, cv_scores)
    """
    if kernel_func is None:
        kernel_func = lambda u: np.exp(-0.5 * u**2)  # Gaussian
    
    n = len(y)
    cv_scores = []
    
    for h in bandwidths:
        squared_errors = []
        
        for i in range(n):
            # Leave out point i
            X_loo = np.delete(X, i, axis=0)
            y_loo = np.delete(y, i)
            x_query = X[i:i+1]
            y_true = y[i]
            
            # Predict using remaining points
            distances = np.linalg.norm(X_loo - x_query, axis=1)
            u = distances / h
            weights = kernel_func(u)
            
            total_weight = weights.sum()
            if total_weight < 1e-10:
                y_pred = y_loo.mean()  # Fall back to mean
            else:
                y_pred = (weights * y_loo).sum() / total_weight
            
            squared_errors.append((y_pred - y_true) ** 2)
        
        mse = np.mean(squared_errors)
        cv_scores.append(mse)
    
    best_idx = np.argmin(cv_scores)
    return bandwidths[best_idx], cv_scores
 
 
def adaptive_bandwidth(X: np.ndarray, k: int = 10) -> np.ndarray:
    """
    Compute adaptive (local) bandwidth for each point.
    
    h_i = distance to k-th nearest neighbor for point i
    This adapts bandwidth to local data density.
    """
    n = X.shape[0]
    local_bandwidths = np.zeros(n)
    
    for i in range(n):
        distances = np.linalg.norm(X - X[i], axis=1)
        distances_sorted = np.sort(distances)
        # k-th nearest (index k since index 0 is the point itself)
        local_bandwidths[i] = distances_sorted[k]
    
    return local_bandwidths
 
 
# Demonstration
np.random.seed(42)
 
# Generate regression data with varying curvature
X_train = np.random.uniform(0, 10, (100, 1))
y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(100)
 
print("Bandwidth Selection Comparison:")
print("=" * 50)
 
# Silverman's rule
h_silverman = silverman_bandwidth(X_train)
print(f"Silverman's rule:     h = {h_silverman:.4f}")
 
# Cross-validation
h_candidates = np.linspace(0.1, 2.0, 20)
h_cv, scores = cross_validate_bandwidth(X_train, y_train, h_candidates)
print(f"Cross-validated:      h = {h_cv:.4f}")
 
# Adaptive (show statistics)
h_adaptive = adaptive_bandwidth(X_train, k=5)
print(f"Adaptive (median):    h = {np.median(h_adaptive):.4f}")
print(f"Adaptive (range):     [{h_adaptive.min():.4f}, {h_adaptive.max():.4f}]")

K vs Bandwidth: A Unified View

Standard KNN uses a fixed $k$ (number of neighbors), while kernel regression uses a fixed bandwidth $h$. These can be unified into a single framework:

Fixed-k KNN: Locality is defined by the k-th nearest neighbor distance. In regions of high density, the neighborhood is small; in sparse regions, it expands.

Fixed-h Kernel: Locality is defined by a fixed distance. In dense regions, many points contribute; in sparse regions, few do.

Hybrid Approaches:

kNN with kernel weights: Select k neighbors, then apply kernel weights based on distance within this neighborhood.
Adaptive bandwidth: Set $h_i$ proportional to the distance to the k-th neighbor, combining adaptive locality with kernel smoothing.
Variable kernel: Use a global bandwidth but scale by local density estimates.

Comparing K-Based and Bandwidth-Based Approaches
Property	Fixed-k KNN	Fixed-h Kernel	Adaptive/Hybrid
Locality control	Number of neighbors	Distance threshold	Varies by local density
Dense regions	Small effective radius	Many contributors	Balanced
Sparse regions	Large effective radius (few samples)	Few contributors (high variance)	Balanced
Boundary behavior	Always k neighbors (may be far)	May have no neighbors	Adapts gracefully
Computation	Exact k neighbors needed	All within h (or approximation)	Requires density estimation
Hyperparameter	k (integer)	h (continuous)	Both or derived

Practical Recommendation

For most applications, start with fixed-k KNN with kernel weights (combine the best of both). Use k for robust neighborhood definition (ensuring you always have enough samples) and kernel weights for smooth, principled weighting within that neighborhood. Set k moderately large (20-50) and let the kernel weighting handle the locality.

The Lp-norm Connection:

The relationship between k and h depends on data dimensionality $d$ and local density $p(\mathbf{x})$:

$$k \approx n \cdot V_d \cdot h^d \cdot p(\mathbf{x})$$

where $V_d$ is the volume of the unit ball in $d$ dimensions.

This shows:

As $h$ increases, roughly $h^d$ more points enter the neighborhood
The curse of dimensionality manifests here: in high $d$, even large $h$ may capture few points
Local density $p(\mathbf{x})$ modulates the relationship: dense regions have more neighbors for the same $h$

Kernel Weighting for Classification

While Nadaraya-Watson is primarily a regression estimator, kernel weighting extends naturally to classification.

Kernel-Weighted Class Probabilities:

For a query point $\mathbf{x}$:

$$\hat{P}(Y = c | \mathbf{x}) = \frac{\sum_{i: y_i = c} K_h(\mathbf{x} - \mathbf{x}i)}{\sum{i=1}^{n} K_h(\mathbf{x} - \mathbf{x}_i)}$$

This estimates the posterior probability of class $c$ at $\mathbf{x}$ using kernel-weighted class membership.

Connection to Generative Classification:

If we interpret $\sum_{i: y_i = c} K_h(\mathbf{x} - \mathbf{x}_i)$ as a kernel density estimate of $p(\mathbf{x} | Y = c)$, then:

$$\hat{P}(c | \mathbf{x}) = \frac{\hat{p}(\mathbf{x} | c) \cdot \hat{p}(c)}{\sum_{c'} \hat{p}(\mathbf{x} | c') \cdot \hat{p}(c')}$$

where $\hat{p}(c) = n_c / n$ is the empirical prior. This is Bayes' rule with kernel-estimated class-conditional densities!

kernel_knn_classifier.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from typing import Callable, Tuple
 
class KernelKNNClassifier:
    """
    KNN classifier with kernel weighting.
    
    Combines k-nearest neighbor selection with kernel-weighted voting.
    Outputs calibrated probability estimates.
    """
    
    def __init__(self, 
                 k: int = 20,
                 kernel: str = 'gaussian',
                 bandwidth: str = 'adaptive'):
        """
        Parameters
        ----------
        k : number of neighbors to consider
        kernel : 'gaussian', 'epanechnikov', 'tricube'
        bandwidth : 'adaptive' (based on k-th neighbor), or float
        """
        self.k = k
        self.kernel_name = kernel
        self.bandwidth_mode = bandwidth
        self.X_train = None
        self.y_train = None
        self.classes_ = None
        
    def _kernel_func(self, u: np.ndarray) -> np.ndarray:
        """Evaluate kernel at normalized distance u."""
        if self.kernel_name == 'gaussian':
            return np.exp(-0.5 * u**2)
        elif self.kernel_name == 'epanechnikov':
            return np.where(np.abs(u) <= 1, 0.75 * (1 - u**2), 0.0)
        elif self.kernel_name == 'tricube':
            return np.where(np.abs(u) <= 1, 
                          (70/81) * (1 - np.abs(u)**3)**3, 0.0)
        else:
            raise ValueError(f"Unknown kernel: {self.kernel_name}")
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Store training data."""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        self.classes_ = np.unique(y)
        return self
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Predict class probabilities using kernel-weighted voting.
        """
        X = np.atleast_2d(X)
        probas = np.zeros((len(X), len(self.classes_)))
        
        for i, x_query in enumerate(X):
            # Compute all distances
            distances = np.linalg.norm(self.X_train - x_query, axis=1)
            
            # Find k nearest neighbors
            k_indices = np.argsort(distances)[:self.k]
            k_distances = distances[k_indices]
            k_labels = self.y_train[k_indices]
            
            # Determine bandwidth
            if self.bandwidth_mode == 'adaptive':
                h = k_distances[-1] + 1e-10  # k-th neighbor distance
            else:
                h = float(self.bandwidth_mode)
            
            # Compute kernel weights
            u = k_distances / h
            weights = self._kernel_func(u)
            weights = weights / (weights.sum() + 1e-10)
            
            # Aggregate by class
            for c_idx, c in enumerate(self.classes_):
                mask = k_labels == c
                probas[i, c_idx] = weights[mask].sum()
        
        return probas
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        probas = self.predict_proba(X)
        return self.classes_[np.argmax(probas, axis=1)]
    
    def decision_function(self, X: np.ndarray) -> np.ndarray:
        """Return probability of positive class (for binary classification)."""
        probas = self.predict_proba(X)
        if len(self.classes_) == 2:
            return probas[:, 1]
        return probas
 
 
# Demonstration: Compare kernels on 2D classification
np.random.seed(42)
 
# Generate two-class data
n_per_class = 100
X_class0 = np.random.randn(n_per_class, 2) + [-1, 0]
X_class1 = np.random.randn(n_per_class, 2) + [1, 0]
X_train = np.vstack([X_class0, X_class1])
y_train = np.array([0] * n_per_class + [1] * n_per_class)
 
# Query points along the class boundary
X_query = np.array([
    [0, 0],      # Center
    [-0.5, 0],   # Slightly class 0
    [0.5, 0],    # Slightly class 1
    [0, 2],      # Same x but different y
])
 
print("Kernel KNN Classification Comparison:")
print("=" * 60)
 
for kernel in ['gaussian', 'epanechnikov', 'tricube']:
    clf = KernelKNNClassifier(k=30, kernel=kernel)
    clf.fit(X_train, y_train)
    probas = clf.predict_proba(X_query)
    
    print(f"\n{kernel.capitalize()} kernel (k=30, adaptive bandwidth):")
    for j, (x, p) in enumerate(zip(X_query, probas)):
        print(f"  x={x}: P(class=0)={p[0]:.3f}, P(class=1)={p[1]:.3f}")

Summary: Kernel-Weighted KNN

Key Takeaways

•Kernel functions provide principled weight functions — They come from nonparametric statistics with well-understood theoretical properties.
•Epanechnikov is theoretically optimal — It minimizes MISE for density estimation, but Gaussian is often preferred for smoothness.
•Bandwidth h is the key hyperparameter — It controls the locality of predictions analogously to k in standard KNN.
•Nadaraya-Watson estimator unifies kernel regression — Kernel-weighted KNN is a special case of this classical nonparametric method.
•Bias-variance tradeoff governs bandwidth selection — Small h means low bias but high variance; large h is the opposite.
•Cross-validation is the practical gold standard — When computational budget allows, CV-based bandwidth selection is most reliable.
•k and h can be combined — Use k for neighborhood definition and kernel weighting for smooth, principled aggregation.
•Classification uses kernel-weighted voting — Probabilities are estimated as normalized kernel-weighted class membership.

Page Complete

You now understand kernel-weighted KNN as a principled nonparametric method with solid theoretical foundations. The next page explores adaptive neighborhoods—techniques that adjust the locality not just through k or h, but dynamically based on local data characteristics.

2 / 5

Loading learning content...

Machine LearningK-Nearest Neighbors

Weighted KNN

LevelIntermediate

Duration90 mins

TopicK-Nearest Neighbors

2 / 5

Kernel-Weighted KNN

From Ad-Hoc Weights to Principled Kernels

This isn't merely academic sophistication. Kernel-based weighting provides:

Mathematical guarantees on bias and variance
Connections to bandwidth selection theory
Smooth, infinitely differentiable predictions (if desired)
A unified framework linking KNN to other nonparametric methods

What You Will Master

What is a Kernel Function?

Formal Definition:

A kernel function $K: \mathbb{R} \rightarrow \mathbb{R}^+$ satisfies:

$K(u) \geq 0$ for all $u$ (non-negativity)
$\int_{-\infty}^{\infty} K(u) , du = 1$ (integrates to 1, like a probability density)
$K(-u) = K(u)$ (symmetry around zero)
$K(u)$ is unimodal (single peak at $u = 0$)

For multidimensional data, we typically use a product kernel or a spherical kernel:

$$K(\mathbf{u}) = K(|\mathbf{u}|)$$

where $|\cdot|$ is the Euclidean norm.

Two Meanings of 'Kernel'

From Kernel to Weight Function:

Given a kernel $K$ and a bandwidth $h > 0$, we define the weight of a training point at distance $d$ from the query:

$$w(d) = K\left(\frac{d}{h}\right)$$

The bandwidth $h$ plays a critical role:

Small $h$: weight decays rapidly with distance (local, potentially high variance)
Large $h$: weight decays slowly (global, potentially high bias)

The kernel determines the shape of the decay; the bandwidth determines the scale.

Common Kernel Functions
Kernel Name	Formula K(u)	Support	Properties
Uniform (Box)	½ for \|u\| ≤ 1, else 0	Bounded [-1, 1]	Simplest kernel; discontinuous; uniform within support
Triangular	(1 - \|u\|) for \|u\| ≤ 1, else 0	Bounded [-1, 1]	Linear decay; continuous but not smooth
Epanechnikov	¾(1 - u²) for \|u\| ≤ 1, else 0	Bounded [-1, 1]	Optimal for MSE in density estimation; parabolic
Biweight	(15/16)(1 - u²)² for \|u\| ≤ 1	Bounded [-1, 1]	Smoother than Epanechnikov; twice differentiable
Gaussian	(1/√2π)exp(-u²/2)	Unbounded (-∞, ∞)	Infinitely smooth; never exactly zero; standard choice
Tricube	(70/81)(1 - \|u\|³)³ for \|u\| ≤ 1	Bounded [-1, 1]	Very smooth decay; used in LOESS

The Epanechnikov Kernel: The Optimal Choice

The Epanechnikov Kernel:

$$K_E(u) = \begin{cases} \frac{3}{4}(1 - u^2) & \text{if } |u| \leq 1 \ 0 & \text{otherwise} \end{cases}$$

Properties:

Compact support: The kernel is exactly zero for $|u| > 1$. Points at normalized distance greater than the bandwidth receive zero weight.
Parabolic shape: The weight decreases quadratically from the center, providing smooth but bounded decay.
Optimal efficiency: For kernel density estimation, Epanechnikov is around 99.4% efficient relative to the MSE-optimal kernel in most practical scenarios.
Computational advantage: The compact support means we only need to consider training points within distance $h$ of the query.

kernel_functions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from typing import Callable
 
def uniform_kernel(u: np.ndarray) -> np.ndarray:
    """Box/Uniform kernel: constant within [-1, 1], zero outside."""
    return np.where(np.abs(u) <= 1, 0.5, 0.0)
 
def triangular_kernel(u: np.ndarray) -> np.ndarray:
    """Triangular kernel: linear decay within [-1, 1]."""
    return np.where(np.abs(u) <= 1, 1 - np.abs(u), 0.0)
 
def epanechnikov_kernel(u: np.ndarray) -> np.ndarray:
    """Epanechnikov kernel: optimal parabolic kernel."""
    return np.where(np.abs(u) <= 1, 0.75 * (1 - u**2), 0.0)
 
def biweight_kernel(u: np.ndarray) -> np.ndarray:
    """Biweight/Quartic kernel: very smooth, twice differentiable."""
    return np.where(np.abs(u) <= 1, (15/16) * (1 - u**2)**2, 0.0)
 
def gaussian_kernel(u: np.ndarray) -> np.ndarray:
    """Gaussian kernel: smooth, unbounded support."""
    return (1 / np.sqrt(2 * np.pi)) * np.exp(-0.5 * u**2)
 
def tricube_kernel(u: np.ndarray) -> np.ndarray:
    """Tricube kernel: used in LOESS, very smooth at boundary."""
    return np.where(np.abs(u) <= 1, (70/81) * (1 - np.abs(u)**3)**3, 0.0)
 
 
def apply_kernel_weights(distances: np.ndarray, 
                          bandwidth: float,
                          kernel: Callable) -> np.ndarray:
    """
    Convert distances to kernel weights.
    
    Parameters
    ----------
    distances : array of distances from query to training points
    bandwidth : kernel bandwidth (scale parameter h)
    kernel : kernel function K(u) -> weight
    
    Returns
    -------
    weights : normalized weights (sum to 1)
    """
    # Normalize distances by bandwidth
    normalized_distances = distances / bandwidth
    
    # Apply kernel function
    raw_weights = kernel(normalized_distances)
    
    # Handle case where all weights are zero
    weight_sum = raw_weights.sum()
    if weight_sum < 1e-10:
        # Fall back to uniform weights on k nearest
        return np.ones_like(raw_weights) / len(raw_weights)
    
    return raw_weights / weight_sum
 
 
# Visualization: compare kernel shapes
u_values = np.linspace(-1.5, 1.5, 1000)
 
kernels = {
    'Uniform': uniform_kernel,
    'Triangular': triangular_kernel,
    'Epanechnikov': epanechnikov_kernel,
    'Biweight': biweight_kernel,
    'Gaussian': gaussian_kernel,
    'Tricube': tricube_kernel,
}
 
print("Kernel values at u = 0, 0.5, 1.0:")
print("-" * 50)
for name, kern in kernels.items():
    vals = [kern(np.array([u]))[0] for u in [0.0, 0.5, 1.0]]
    print(f"{name:15s}: K(0)={vals[0]:.4f}, K(0.5)={vals[1]:.4f}, K(1.0)={vals[2]:.4f}")

Practical Kernel Selection

Gaussian Kernel Weighting

The Gaussian kernel is the most widely used kernel in practice due to its mathematical elegance and numerical convenience.

The Gaussian Kernel:

$$K_G(u) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{u^2}{2}\right)$$

With bandwidth $h$:

$$w(d) = K_G\left(\frac{d}{h}\right) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{d^2}{2h^2}\right)$$

Properties:

Infinite support: Every training point contributes some weight, no matter how distant. In practice, points beyond $3h$ contribute negligibly (weight < 0.011 of maximum).
Infinitely differentiable: Predictions are smooth functions of the query location, which is important for certain applications.
Connection to RBF: The Gaussian kernel weight is identical in form to the Radial Basis Function (RBF) kernel used in SVMs and Gaussian Processes.
Natural probabilistic interpretation: If the distance follows a half-normal distribution, Gaussian weighting corresponds to likelihood weighting.

Relationship to inverse distance weighting:

Expanding for small distances:

$$\exp\left(-\frac{d^2}{2h^2}\right) \approx 1 - \frac{d^2}{2h^2} + O(d^4)$$

The Gaussian kernel is locally quadratic in distance, similar to inverse-square weighting near zero but with smooth decay rather than the singularity at $d=0$.

The bandwidth $h$ as the key hyperparameter:

$h$ relative to data spread	Behavior
$h \ll$ nearest neighbor distance	Extremely local; nearly 1-NN
$h \approx$ average neighbor distance	Balanced local averaging
$h \gg$ data spread	Nearly uniform weighting over all points

gaussian_kernel_knn.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from typing import Optional
 
class GaussianKernelKNN:
    """
    KNN with Gaussian kernel weighting.
    
    Uses bandwidth h instead of k as the locality parameter.
    All training points within reasonable distance contribute,
    weighted by K((d/h)).
    """
    
    def __init__(self, bandwidth: Optional[float] = None):
        """
        Parameters
        ----------
        bandwidth : kernel bandwidth h. If None, estimated from data.
        """
        self.bandwidth = bandwidth
        self.X_train = None
        self.y_train = None
        
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Store training data and optionally estimate bandwidth."""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        
        if self.bandwidth is None:
            # Silverman's rule of thumb for bandwidth selection
            n, d = self.X_train.shape
            std = self.X_train.std(axis=0).mean()
            self.bandwidth = std * (4 / (n * (d + 2))) ** (1 / (d + 4))
            print(f"Auto-selected bandwidth: {self.bandwidth:.4f}")
        
        return self
    
    def _gaussian_weights(self, distances: np.ndarray) -> np.ndarray:
        """Compute Gaussian kernel weights."""
        u = distances / self.bandwidth
        weights = np.exp(-0.5 * u**2)
        
        # Normalize
        total = weights.sum()
        if total < 1e-10:
            # All weights essentially zero - use 1-NN
            weights = np.zeros_like(weights)
            weights[np.argmin(distances)] = 1.0
        else:
            weights = weights / total
        
        return weights
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict for regression (weighted average) or classification (mode)."""
        X = np.atleast_2d(X)
        predictions = []
        
        for x_query in X:
            distances = np.linalg.norm(self.X_train - x_query, axis=1)
            weights = self._gaussian_weights(distances)
            
            # Check if classification or regression
            if np.issubdtype(self.y_train.dtype, np.floating):
                # Regression: weighted average
                pred = np.sum(weights * self.y_train)
            else:
                # Classification: weighted voting
                classes = np.unique(self.y_train)
                scores = np.array([weights[self.y_train == c].sum() for c in classes])
                pred = classes[np.argmax(scores)]
            
            predictions.append(pred)
        
        return np.array(predictions)
    
    def effective_neighbors(self, x_query: np.ndarray, threshold: float = 0.01) -> int:
        """Count how many neighbors contribute more than threshold weight."""
        distances = np.linalg.norm(self.X_train - x_query, axis=1)
        weights = self._gaussian_weights(distances)
        return np.sum(weights > threshold)
 
 
# Demonstration
np.random.seed(42)
 
# Create 1D regression data with varying density
X_train = np.vstack([
    np.linspace(0, 2, 20)[:, np.newaxis],      # Dense region
    np.linspace(5, 10, 10)[:, np.newaxis],     # Sparse region
])
y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(30)
 
# Compare different bandwidths
bandwidths = [0.2, 0.5, 1.0, 2.0]
x_test = np.array([[1.0], [7.5]])  # One in dense, one in sparse region
 
print("Gaussian Kernel KNN predictions:")
print("=" * 60)
 
for h in bandwidths:
    model = GaussianKernelKNN(bandwidth=h)
    model.fit(X_train, y_train)
    preds = model.predict(x_test)
    
    eff_neighbors = [model.effective_neighbors(x) for x in x_test]
    
    print(f"\nBandwidth h = {h}")
    for i, (x, pred, eff) in enumerate(zip(x_test, preds, eff_neighbors)):
        print(f"  x={x[0]:.1f}: pred={pred:.4f}, effective neighbors ≈ {eff}")

The Nadaraya-Watson Estimator

Kernel-weighted KNN for regression is a special case of a classical nonparametric estimator: the Nadaraya-Watson estimator, proposed independently by Nadaraya (1964) and Watson (1964).

The Nadaraya-Watson Estimator:

For a query point $\mathbf{x}$:

$$\hat{f}(\mathbf{x}) = \frac{\sum_{i=1}^{n} K_h(\mathbf{x} - \mathbf{x}i) \cdot y_i}{\sum{i=1}^{n} K_h(\mathbf{x} - \mathbf{x}_i)}$$

where $K_h(\mathbf{u}) = \frac{1}{h^d} K\left(\frac{|\mathbf{u}|}{h}\right)$ and $d$ is the dimensionality.

Connection to Kernel Density Estimation:

The Nadaraya-Watson estimator can be interpreted as:

$$\hat{f}(\mathbf{x}) = \frac{\text{weighted average of } y_i \text{ at } \mathbf{x}}{\text{kernel density estimate at } \mathbf{x}}$$

The Historical Significance

Theoretical Properties:

Consistency: Under mild conditions, $\hat{f}(\mathbf{x}) \to f(\mathbf{x})$ as $n \to \infty$ if $h \to 0$ and $nh^d \to \infty$.
Asymptotic Bias: For a twice-differentiable true function: $$\text{Bias}(\hat{f}(\mathbf{x})) \approx \frac{h^2}{2} \cdot \mu_2(K) \cdot \left(f''(\mathbf{x}) + 2\frac{f'(\mathbf{x}) \cdot p'(\mathbf{x})}{p(\mathbf{x})}\right)$$

where $\mu_2(K) = \int u^2 K(u) du$ is the second moment of the kernel and $p(\mathbf{x})$ is the data density.

Asymptotic Variance: $$\text{Var}(\hat{f}(\mathbf{x})) \approx \frac{\sigma^2 R(K)}{nh^d \cdot p(\mathbf{x})}$$

where $R(K) = \int K(u)^2 du$ is the roughness of the kernel.

Key Insight: Bias is $O(h^2)$ while variance is $O(1/nh^d)$. This leads to the classic bias-variance tradeoff in bandwidth selection.

Bandwidth Selection

The bandwidth $h$ is the critical hyperparameter in kernel-weighted KNN. Choosing $h$ is analogous to choosing $k$ in standard KNN, but with more nuanced control over the locality of predictions.

The Bias-Variance Decomposition:

Mean Squared Error = Bias² + Variance

$$\text{MSE}(\hat{f}) \approx \underbrace{\frac{h^4}{4} \mu_2(K)^2 [f''(x)]^2}{\text{Bias}^2} + \underbrace{\frac{\sigma^2 R(K)}{n h^d p(x)}}{\text{Variance}}$$

Optimizing over $h$ yields the asymptotically optimal bandwidth:

$$h_{opt} = \left(\frac{R(K) \cdot d \cdot \sigma^2}{n \cdot \mu_2(K)^2 \cdot \int [f''(x)]^2 p(x) dx}\right)^{1/(d+4)}$$

This depends on unknown quantities ($\sigma^2$, $f''$) and so cannot be computed directly. Practical methods estimate or approximate these.

Bandwidth Selection Methods

•Silverman's Rule of Thumb: $h = 0.9 \cdot \min(\text{std}, \text{IQR}/1.34) \cdot n^{-1/(d+4)}$. Simple, fast, but assumes roughly Gaussian data.
•Cross-Validation: Evaluate prediction error for a grid of $h$ values using leave-one-out (LOO) or k-fold CV. Data-driven but computationally expensive.
•Plug-in Methods: Estimate $f''$ using a pilot bandwidth, then compute optimal $h$. More sophisticated but requires iterative estimation.
•Nearest-Neighbor Bandwidth: Set $h$ adaptively based on distance to k-th neighbor. Bridges fixed-bandwidth and fixed-k approaches.

bandwidth_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from sklearn.model_selection import LeaveOneOut
 
def silverman_bandwidth(X: np.ndarray) -> float:
    """
    Silverman's rule of thumb for bandwidth selection.
    
    h = 0.9 * min(std, IQR/1.34) * n^(-1/(d+4))
    
    This is designed for Gaussian kernels on roughly normal data.
    """
    n, d = X.shape
    
    # Per-dimension scales
    stds = X.std(axis=0)
    iqrs = np.percentile(X, 75, axis=0) - np.percentile(X, 25, axis=0)
    
    # Use minimum of std and IQR/1.34 for robustness
    scales = np.minimum(stds, iqrs / 1.34)
    avg_scale = scales.mean()
    
    # Silverman's formula
    h = 0.9 * avg_scale * n ** (-1 / (d + 4))
    
    return h
 
 
def cross_validate_bandwidth(X: np.ndarray, 
                              y: np.ndarray,
                              bandwidths: np.ndarray,
                              kernel_func = None) -> tuple:
    """
    Select bandwidth by leave-one-out cross-validation.
    
    Returns (best_bandwidth, cv_scores)
    """
    if kernel_func is None:
        kernel_func = lambda u: np.exp(-0.5 * u**2)  # Gaussian
    
    n = len(y)
    cv_scores = []
    
    for h in bandwidths:
        squared_errors = []
        
        for i in range(n):
            # Leave out point i
            X_loo = np.delete(X, i, axis=0)
            y_loo = np.delete(y, i)
            x_query = X[i:i+1]
            y_true = y[i]
            
            # Predict using remaining points
            distances = np.linalg.norm(X_loo - x_query, axis=1)
            u = distances / h
            weights = kernel_func(u)
            
            total_weight = weights.sum()
            if total_weight < 1e-10:
                y_pred = y_loo.mean()  # Fall back to mean
            else:
                y_pred = (weights * y_loo).sum() / total_weight
            
            squared_errors.append((y_pred - y_true) ** 2)
        
        mse = np.mean(squared_errors)
        cv_scores.append(mse)
    
    best_idx = np.argmin(cv_scores)
    return bandwidths[best_idx], cv_scores
 
 
def adaptive_bandwidth(X: np.ndarray, k: int = 10) -> np.ndarray:
    """
    Compute adaptive (local) bandwidth for each point.
    
    h_i = distance to k-th nearest neighbor for point i
    This adapts bandwidth to local data density.
    """
    n = X.shape[0]
    local_bandwidths = np.zeros(n)
    
    for i in range(n):
        distances = np.linalg.norm(X - X[i], axis=1)
        distances_sorted = np.sort(distances)
        # k-th nearest (index k since index 0 is the point itself)
        local_bandwidths[i] = distances_sorted[k]
    
    return local_bandwidths
 
 
# Demonstration
np.random.seed(42)
 
# Generate regression data with varying curvature
X_train = np.random.uniform(0, 10, (100, 1))
y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(100)
 
print("Bandwidth Selection Comparison:")
print("=" * 50)
 
# Silverman's rule
h_silverman = silverman_bandwidth(X_train)
print(f"Silverman's rule:     h = {h_silverman:.4f}")
 
# Cross-validation
h_candidates = np.linspace(0.1, 2.0, 20)
h_cv, scores = cross_validate_bandwidth(X_train, y_train, h_candidates)
print(f"Cross-validated:      h = {h_cv:.4f}")
 
# Adaptive (show statistics)
h_adaptive = adaptive_bandwidth(X_train, k=5)
print(f"Adaptive (median):    h = {np.median(h_adaptive):.4f}")
print(f"Adaptive (range):     [{h_adaptive.min():.4f}, {h_adaptive.max():.4f}]")

K vs Bandwidth: A Unified View

Standard KNN uses a fixed $k$ (number of neighbors), while kernel regression uses a fixed bandwidth $h$. These can be unified into a single framework:

Fixed-k KNN: Locality is defined by the k-th nearest neighbor distance. In regions of high density, the neighborhood is small; in sparse regions, it expands.

Fixed-h Kernel: Locality is defined by a fixed distance. In dense regions, many points contribute; in sparse regions, few do.

Hybrid Approaches:

kNN with kernel weights: Select k neighbors, then apply kernel weights based on distance within this neighborhood.
Adaptive bandwidth: Set $h_i$ proportional to the distance to the k-th neighbor, combining adaptive locality with kernel smoothing.
Variable kernel: Use a global bandwidth but scale by local density estimates.

Comparing K-Based and Bandwidth-Based Approaches
Property	Fixed-k KNN	Fixed-h Kernel	Adaptive/Hybrid
Locality control	Number of neighbors	Distance threshold	Varies by local density
Dense regions	Small effective radius	Many contributors	Balanced
Sparse regions	Large effective radius (few samples)	Few contributors (high variance)	Balanced
Boundary behavior	Always k neighbors (may be far)	May have no neighbors	Adapts gracefully
Computation	Exact k neighbors needed	All within h (or approximation)	Requires density estimation
Hyperparameter	k (integer)	h (continuous)	Both or derived

Practical Recommendation

The Lp-norm Connection:

The relationship between k and h depends on data dimensionality $d$ and local density $p(\mathbf{x})$:

$$k \approx n \cdot V_d \cdot h^d \cdot p(\mathbf{x})$$

where $V_d$ is the volume of the unit ball in $d$ dimensions.

This shows:

As $h$ increases, roughly $h^d$ more points enter the neighborhood
The curse of dimensionality manifests here: in high $d$, even large $h$ may capture few points
Local density $p(\mathbf{x})$ modulates the relationship: dense regions have more neighbors for the same $h$

Kernel Weighting for Classification

While Nadaraya-Watson is primarily a regression estimator, kernel weighting extends naturally to classification.

Kernel-Weighted Class Probabilities:

For a query point $\mathbf{x}$:

$$\hat{P}(Y = c | \mathbf{x}) = \frac{\sum_{i: y_i = c} K_h(\mathbf{x} - \mathbf{x}i)}{\sum{i=1}^{n} K_h(\mathbf{x} - \mathbf{x}_i)}$$

This estimates the posterior probability of class $c$ at $\mathbf{x}$ using kernel-weighted class membership.

Connection to Generative Classification:

If we interpret $\sum_{i: y_i = c} K_h(\mathbf{x} - \mathbf{x}_i)$ as a kernel density estimate of $p(\mathbf{x} | Y = c)$, then:

$$\hat{P}(c | \mathbf{x}) = \frac{\hat{p}(\mathbf{x} | c) \cdot \hat{p}(c)}{\sum_{c'} \hat{p}(\mathbf{x} | c') \cdot \hat{p}(c')}$$

where $\hat{p}(c) = n_c / n$ is the empirical prior. This is Bayes' rule with kernel-estimated class-conditional densities!

kernel_knn_classifier.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from typing import Callable, Tuple
 
class KernelKNNClassifier:
    """
    KNN classifier with kernel weighting.
    
    Combines k-nearest neighbor selection with kernel-weighted voting.
    Outputs calibrated probability estimates.
    """
    
    def __init__(self, 
                 k: int = 20,
                 kernel: str = 'gaussian',
                 bandwidth: str = 'adaptive'):
        """
        Parameters
        ----------
        k : number of neighbors to consider
        kernel : 'gaussian', 'epanechnikov', 'tricube'
        bandwidth : 'adaptive' (based on k-th neighbor), or float
        """
        self.k = k
        self.kernel_name = kernel
        self.bandwidth_mode = bandwidth
        self.X_train = None
        self.y_train = None
        self.classes_ = None
        
    def _kernel_func(self, u: np.ndarray) -> np.ndarray:
        """Evaluate kernel at normalized distance u."""
        if self.kernel_name == 'gaussian':
            return np.exp(-0.5 * u**2)
        elif self.kernel_name == 'epanechnikov':
            return np.where(np.abs(u) <= 1, 0.75 * (1 - u**2), 0.0)
        elif self.kernel_name == 'tricube':
            return np.where(np.abs(u) <= 1, 
                          (70/81) * (1 - np.abs(u)**3)**3, 0.0)
        else:
            raise ValueError(f"Unknown kernel: {self.kernel_name}")
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Store training data."""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        self.classes_ = np.unique(y)
        return self
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Predict class probabilities using kernel-weighted voting.
        """
        X = np.atleast_2d(X)
        probas = np.zeros((len(X), len(self.classes_)))
        
        for i, x_query in enumerate(X):
            # Compute all distances
            distances = np.linalg.norm(self.X_train - x_query, axis=1)
            
            # Find k nearest neighbors
            k_indices = np.argsort(distances)[:self.k]
            k_distances = distances[k_indices]
            k_labels = self.y_train[k_indices]
            
            # Determine bandwidth
            if self.bandwidth_mode == 'adaptive':
                h = k_distances[-1] + 1e-10  # k-th neighbor distance
            else:
                h = float(self.bandwidth_mode)
            
            # Compute kernel weights
            u = k_distances / h
            weights = self._kernel_func(u)
            weights = weights / (weights.sum() + 1e-10)
            
            # Aggregate by class
            for c_idx, c in enumerate(self.classes_):
                mask = k_labels == c
                probas[i, c_idx] = weights[mask].sum()
        
        return probas
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        probas = self.predict_proba(X)
        return self.classes_[np.argmax(probas, axis=1)]
    
    def decision_function(self, X: np.ndarray) -> np.ndarray:
        """Return probability of positive class (for binary classification)."""
        probas = self.predict_proba(X)
        if len(self.classes_) == 2:
            return probas[:, 1]
        return probas
 
 
# Demonstration: Compare kernels on 2D classification
np.random.seed(42)
 
# Generate two-class data
n_per_class = 100
X_class0 = np.random.randn(n_per_class, 2) + [-1, 0]
X_class1 = np.random.randn(n_per_class, 2) + [1, 0]
X_train = np.vstack([X_class0, X_class1])
y_train = np.array([0] * n_per_class + [1] * n_per_class)
 
# Query points along the class boundary
X_query = np.array([
    [0, 0],      # Center
    [-0.5, 0],   # Slightly class 0
    [0.5, 0],    # Slightly class 1
    [0, 2],      # Same x but different y
])
 
print("Kernel KNN Classification Comparison:")
print("=" * 60)
 
for kernel in ['gaussian', 'epanechnikov', 'tricube']:
    clf = KernelKNNClassifier(k=30, kernel=kernel)
    clf.fit(X_train, y_train)
    probas = clf.predict_proba(X_query)
    
    print(f"\n{kernel.capitalize()} kernel (k=30, adaptive bandwidth):")
    for j, (x, p) in enumerate(zip(X_query, probas)):
        print(f"  x={x}: P(class=0)={p[0]:.3f}, P(class=1)={p[1]:.3f}")

Summary: Kernel-Weighted KNN

Key Takeaways

•Kernel functions provide principled weight functions — They come from nonparametric statistics with well-understood theoretical properties.
•Epanechnikov is theoretically optimal — It minimizes MISE for density estimation, but Gaussian is often preferred for smoothness.
•Bandwidth h is the key hyperparameter — It controls the locality of predictions analogously to k in standard KNN.
•Nadaraya-Watson estimator unifies kernel regression — Kernel-weighted KNN is a special case of this classical nonparametric method.
•Bias-variance tradeoff governs bandwidth selection — Small h means low bias but high variance; large h is the opposite.
•Cross-validation is the practical gold standard — When computational budget allows, CV-based bandwidth selection is most reliable.
•k and h can be combined — Use k for neighborhood definition and kernel weighting for smooth, principled aggregation.
•Classification uses kernel-weighted voting — Probabilities are estimated as normalized kernel-weighted class membership.

Page Complete

2 / 5