Loading learning content...
In the previous page, we introduced distance weighting as an intuitive improvement over uniform voting. But a deeper question remains: why should we use inverse distance specifically? Why not inverse square root? Why not any other monotonically decreasing function?
The answer lies in kernel theory—a rich area of nonparametric statistics that provides a principled framework for choosing weight functions. Kernel-weighted KNN connects our local prediction method to kernel density estimation, kernel regression (Nadaraya-Watson estimators), and ultimately to a theory that tells us optimal ways to weight observations based on their distance.
This isn't merely academic sophistication. Kernel-based weighting provides:
By the end of this page, you will understand kernel functions as weight generators, master the Epanechnikov, Gaussian, and other standard kernels, connect kernel-weighted KNN to the Nadaraya-Watson estimator, and grasp bandwidth as a hyperparameter that unifies k and weighting strength in a single framework.
A kernel function in nonparametric statistics is a non-negative function $K(u)$ that integrates to 1 and is typically symmetric around zero. The argument $u$ represents a normalized distance.(not to be confused with the kernel in SVM/kernel methods with inner products in feature space—here we mean a smoothing kernel/window function).
Formal Definition:
A kernel function $K: \mathbb{R} \rightarrow \mathbb{R}^+$ satisfies:
For multidimensional data, we typically use a product kernel or a spherical kernel:
$$K(\mathbf{u}) = K(|\mathbf{u}|)$$
where $|\cdot|$ is the Euclidean norm.
The machine learning literature overloads the term 'kernel'. In SVMs and kernel methods, a kernel k(x, y) computes inner products in a feature space. In nonparametric statistics (our context here), a kernel K(u) is a weighting function for local averaging. These are related but distinct concepts. Context usually clarifies which is meant.
From Kernel to Weight Function:
Given a kernel $K$ and a bandwidth $h > 0$, we define the weight of a training point at distance $d$ from the query:
$$w(d) = K\left(\frac{d}{h}\right)$$
The bandwidth $h$ plays a critical role:
The kernel determines the shape of the decay; the bandwidth determines the scale.
| Kernel Name | Formula K(u) | Support | Properties |
|---|---|---|---|
| Uniform (Box) | ½ for |u| ≤ 1, else 0 | Bounded [-1, 1] | Simplest kernel; discontinuous; uniform within support |
| Triangular | (1 - |u|) for |u| ≤ 1, else 0 | Bounded [-1, 1] | Linear decay; continuous but not smooth |
| Epanechnikov | ¾(1 - u²) for |u| ≤ 1, else 0 | Bounded [-1, 1] | Optimal for MSE in density estimation; parabolic |
| Biweight | (15/16)(1 - u²)² for |u| ≤ 1 | Bounded [-1, 1] | Smoother than Epanechnikov; twice differentiable |
| Gaussian | (1/√2π)exp(-u²/2) | Unbounded (-∞, ∞) | Infinitely smooth; never exactly zero; standard choice |
| Tricube | (70/81)(1 - |u|³)³ for |u| ≤ 1 | Bounded [-1, 1] | Very smooth decay; used in LOESS |
Among all second-order kernels (kernels with finite variance), the Epanechnikov kernel is optimal in the sense of minimizing the asymptotic mean integrated squared error (MISE) for kernel density estimation.
The Epanechnikov Kernel:
$$K_E(u) = \begin{cases} \frac{3}{4}(1 - u^2) & \text{if } |u| \leq 1 \ 0 & \text{otherwise} \end{cases}$$
Properties:
Compact support: The kernel is exactly zero for $|u| > 1$. Points at normalized distance greater than the bandwidth receive zero weight.
Parabolic shape: The weight decreases quadratically from the center, providing smooth but bounded decay.
Optimal efficiency: For kernel density estimation, Epanechnikov is around 99.4% efficient relative to the MSE-optimal kernel in most practical scenarios.
Computational advantage: The compact support means we only need to consider training points within distance $h$ of the query.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as npfrom typing import Callable def uniform_kernel(u: np.ndarray) -> np.ndarray: """Box/Uniform kernel: constant within [-1, 1], zero outside.""" return np.where(np.abs(u) <= 1, 0.5, 0.0) def triangular_kernel(u: np.ndarray) -> np.ndarray: """Triangular kernel: linear decay within [-1, 1].""" return np.where(np.abs(u) <= 1, 1 - np.abs(u), 0.0) def epanechnikov_kernel(u: np.ndarray) -> np.ndarray: """Epanechnikov kernel: optimal parabolic kernel.""" return np.where(np.abs(u) <= 1, 0.75 * (1 - u**2), 0.0) def biweight_kernel(u: np.ndarray) -> np.ndarray: """Biweight/Quartic kernel: very smooth, twice differentiable.""" return np.where(np.abs(u) <= 1, (15/16) * (1 - u**2)**2, 0.0) def gaussian_kernel(u: np.ndarray) -> np.ndarray: """Gaussian kernel: smooth, unbounded support.""" return (1 / np.sqrt(2 * np.pi)) * np.exp(-0.5 * u**2) def tricube_kernel(u: np.ndarray) -> np.ndarray: """Tricube kernel: used in LOESS, very smooth at boundary.""" return np.where(np.abs(u) <= 1, (70/81) * (1 - np.abs(u)**3)**3, 0.0) def apply_kernel_weights(distances: np.ndarray, bandwidth: float, kernel: Callable) -> np.ndarray: """ Convert distances to kernel weights. Parameters ---------- distances : array of distances from query to training points bandwidth : kernel bandwidth (scale parameter h) kernel : kernel function K(u) -> weight Returns ------- weights : normalized weights (sum to 1) """ # Normalize distances by bandwidth normalized_distances = distances / bandwidth # Apply kernel function raw_weights = kernel(normalized_distances) # Handle case where all weights are zero weight_sum = raw_weights.sum() if weight_sum < 1e-10: # Fall back to uniform weights on k nearest return np.ones_like(raw_weights) / len(raw_weights) return raw_weights / weight_sum # Visualization: compare kernel shapesu_values = np.linspace(-1.5, 1.5, 1000) kernels = { 'Uniform': uniform_kernel, 'Triangular': triangular_kernel, 'Epanechnikov': epanechnikov_kernel, 'Biweight': biweight_kernel, 'Gaussian': gaussian_kernel, 'Tricube': tricube_kernel,} print("Kernel values at u = 0, 0.5, 1.0:")print("-" * 50)for name, kern in kernels.items(): vals = [kern(np.array([u]))[0] for u in [0.0, 0.5, 1.0]] print(f"{name:15s}: K(0)={vals[0]:.4f}, K(0.5)={vals[1]:.4f}, K(1.0)={vals[2]:.4f}")In practice, kernel choice matters less than bandwidth selection. The Gaussian kernel is often preferred despite being theoretically suboptimal because it's smooth everywhere, never zero (avoiding division issues), and extremely well-studied. For strict locality (ignoring distant points entirely), Epanechnikov or tricube are better choices.
The Gaussian kernel is the most widely used kernel in practice due to its mathematical elegance and numerical convenience.
The Gaussian Kernel:
$$K_G(u) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{u^2}{2}\right)$$
With bandwidth $h$:
$$w(d) = K_G\left(\frac{d}{h}\right) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{d^2}{2h^2}\right)$$
Properties:
Infinite support: Every training point contributes some weight, no matter how distant. In practice, points beyond $3h$ contribute negligibly (weight < 0.011 of maximum).
Infinitely differentiable: Predictions are smooth functions of the query location, which is important for certain applications.
Connection to RBF: The Gaussian kernel weight is identical in form to the Radial Basis Function (RBF) kernel used in SVMs and Gaussian Processes.
Natural probabilistic interpretation: If the distance follows a half-normal distribution, Gaussian weighting corresponds to likelihood weighting.
Relationship to inverse distance weighting:
Expanding for small distances:
$$\exp\left(-\frac{d^2}{2h^2}\right) \approx 1 - \frac{d^2}{2h^2} + O(d^4)$$
The Gaussian kernel is locally quadratic in distance, similar to inverse-square weighting near zero but with smooth decay rather than the singularity at $d=0$.
The bandwidth $h$ as the key hyperparameter:
| $h$ relative to data spread | Behavior |
|---|---|
| $h \ll$ nearest neighbor distance | Extremely local; nearly 1-NN |
| $h \approx$ average neighbor distance | Balanced local averaging |
| $h \gg$ data spread | Nearly uniform weighting over all points |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as npfrom typing import Optional class GaussianKernelKNN: """ KNN with Gaussian kernel weighting. Uses bandwidth h instead of k as the locality parameter. All training points within reasonable distance contribute, weighted by K((d/h)). """ def __init__(self, bandwidth: Optional[float] = None): """ Parameters ---------- bandwidth : kernel bandwidth h. If None, estimated from data. """ self.bandwidth = bandwidth self.X_train = None self.y_train = None def fit(self, X: np.ndarray, y: np.ndarray): """Store training data and optionally estimate bandwidth.""" self.X_train = np.array(X) self.y_train = np.array(y) if self.bandwidth is None: # Silverman's rule of thumb for bandwidth selection n, d = self.X_train.shape std = self.X_train.std(axis=0).mean() self.bandwidth = std * (4 / (n * (d + 2))) ** (1 / (d + 4)) print(f"Auto-selected bandwidth: {self.bandwidth:.4f}") return self def _gaussian_weights(self, distances: np.ndarray) -> np.ndarray: """Compute Gaussian kernel weights.""" u = distances / self.bandwidth weights = np.exp(-0.5 * u**2) # Normalize total = weights.sum() if total < 1e-10: # All weights essentially zero - use 1-NN weights = np.zeros_like(weights) weights[np.argmin(distances)] = 1.0 else: weights = weights / total return weights def predict(self, X: np.ndarray) -> np.ndarray: """Predict for regression (weighted average) or classification (mode).""" X = np.atleast_2d(X) predictions = [] for x_query in X: distances = np.linalg.norm(self.X_train - x_query, axis=1) weights = self._gaussian_weights(distances) # Check if classification or regression if np.issubdtype(self.y_train.dtype, np.floating): # Regression: weighted average pred = np.sum(weights * self.y_train) else: # Classification: weighted voting classes = np.unique(self.y_train) scores = np.array([weights[self.y_train == c].sum() for c in classes]) pred = classes[np.argmax(scores)] predictions.append(pred) return np.array(predictions) def effective_neighbors(self, x_query: np.ndarray, threshold: float = 0.01) -> int: """Count how many neighbors contribute more than threshold weight.""" distances = np.linalg.norm(self.X_train - x_query, axis=1) weights = self._gaussian_weights(distances) return np.sum(weights > threshold) # Demonstrationnp.random.seed(42) # Create 1D regression data with varying densityX_train = np.vstack([ np.linspace(0, 2, 20)[:, np.newaxis], # Dense region np.linspace(5, 10, 10)[:, np.newaxis], # Sparse region])y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(30) # Compare different bandwidthsbandwidths = [0.2, 0.5, 1.0, 2.0]x_test = np.array([[1.0], [7.5]]) # One in dense, one in sparse region print("Gaussian Kernel KNN predictions:")print("=" * 60) for h in bandwidths: model = GaussianKernelKNN(bandwidth=h) model.fit(X_train, y_train) preds = model.predict(x_test) eff_neighbors = [model.effective_neighbors(x) for x in x_test] print(f"\nBandwidth h = {h}") for i, (x, pred, eff) in enumerate(zip(x_test, preds, eff_neighbors)): print(f" x={x[0]:.1f}: pred={pred:.4f}, effective neighbors ≈ {eff}")Kernel-weighted KNN for regression is a special case of a classical nonparametric estimator: the Nadaraya-Watson estimator, proposed independently by Nadaraya (1964) and Watson (1964).
The Nadaraya-Watson Estimator:
For a query point $\mathbf{x}$:
$$\hat{f}(\mathbf{x}) = \frac{\sum_{i=1}^{n} K_h(\mathbf{x} - \mathbf{x}i) \cdot y_i}{\sum{i=1}^{n} K_h(\mathbf{x} - \mathbf{x}_i)}$$
where $K_h(\mathbf{u}) = \frac{1}{h^d} K\left(\frac{|\mathbf{u}|}{h}\right)$ and $d$ is the dimensionality.
This is precisely our kernel-weighted KNN regression formula! The only difference is that Nadaraya-Watson traditionally uses all training points (no explicit k), while KNN variants may still select k neighbors first and then apply kernel weights.
Connection to Kernel Density Estimation:
The Nadaraya-Watson estimator can be interpreted as:
$$\hat{f}(\mathbf{x}) = \frac{\text{weighted average of } y_i \text{ at } \mathbf{x}}{\text{kernel density estimate at } \mathbf{x}}$$
The denominator $\sum_i K_h(\mathbf{x} - \mathbf{x}_i)$ estimates the density of training points at $\mathbf{x}$, while the numerator estimates the joint density of $(x, y)$ marginalized over $y$ with weighting.
The Nadaraya-Watson estimator predates modern 'machine learning' terminology but is fundamental to it. Understanding this connection reveals that kernel-weighted KNN isn't an ad-hoc modification—it's a principled statistical estimator with decades of theoretical study.
Theoretical Properties:
Consistency: Under mild conditions, $\hat{f}(\mathbf{x}) \to f(\mathbf{x})$ as $n \to \infty$ if $h \to 0$ and $nh^d \to \infty$.
Asymptotic Bias: For a twice-differentiable true function: $$\text{Bias}(\hat{f}(\mathbf{x})) \approx \frac{h^2}{2} \cdot \mu_2(K) \cdot \left(f''(\mathbf{x}) + 2\frac{f'(\mathbf{x}) \cdot p'(\mathbf{x})}{p(\mathbf{x})}\right)$$
where $\mu_2(K) = \int u^2 K(u) du$ is the second moment of the kernel and $p(\mathbf{x})$ is the data density.
where $R(K) = \int K(u)^2 du$ is the roughness of the kernel.
Key Insight: Bias is $O(h^2)$ while variance is $O(1/nh^d)$. This leads to the classic bias-variance tradeoff in bandwidth selection.
The bandwidth $h$ is the critical hyperparameter in kernel-weighted KNN. Choosing $h$ is analogous to choosing $k$ in standard KNN, but with more nuanced control over the locality of predictions.
The Bias-Variance Decomposition:
Mean Squared Error = Bias² + Variance
$$\text{MSE}(\hat{f}) \approx \underbrace{\frac{h^4}{4} \mu_2(K)^2 [f''(x)]^2}{\text{Bias}^2} + \underbrace{\frac{\sigma^2 R(K)}{n h^d p(x)}}{\text{Variance}}$$
Optimizing over $h$ yields the asymptotically optimal bandwidth:
$$h_{opt} = \left(\frac{R(K) \cdot d \cdot \sigma^2}{n \cdot \mu_2(K)^2 \cdot \int [f''(x)]^2 p(x) dx}\right)^{1/(d+4)}$$
This depends on unknown quantities ($\sigma^2$, $f''$) and so cannot be computed directly. Practical methods estimate or approximate these.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as npfrom sklearn.model_selection import LeaveOneOut def silverman_bandwidth(X: np.ndarray) -> float: """ Silverman's rule of thumb for bandwidth selection. h = 0.9 * min(std, IQR/1.34) * n^(-1/(d+4)) This is designed for Gaussian kernels on roughly normal data. """ n, d = X.shape # Per-dimension scales stds = X.std(axis=0) iqrs = np.percentile(X, 75, axis=0) - np.percentile(X, 25, axis=0) # Use minimum of std and IQR/1.34 for robustness scales = np.minimum(stds, iqrs / 1.34) avg_scale = scales.mean() # Silverman's formula h = 0.9 * avg_scale * n ** (-1 / (d + 4)) return h def cross_validate_bandwidth(X: np.ndarray, y: np.ndarray, bandwidths: np.ndarray, kernel_func = None) -> tuple: """ Select bandwidth by leave-one-out cross-validation. Returns (best_bandwidth, cv_scores) """ if kernel_func is None: kernel_func = lambda u: np.exp(-0.5 * u**2) # Gaussian n = len(y) cv_scores = [] for h in bandwidths: squared_errors = [] for i in range(n): # Leave out point i X_loo = np.delete(X, i, axis=0) y_loo = np.delete(y, i) x_query = X[i:i+1] y_true = y[i] # Predict using remaining points distances = np.linalg.norm(X_loo - x_query, axis=1) u = distances / h weights = kernel_func(u) total_weight = weights.sum() if total_weight < 1e-10: y_pred = y_loo.mean() # Fall back to mean else: y_pred = (weights * y_loo).sum() / total_weight squared_errors.append((y_pred - y_true) ** 2) mse = np.mean(squared_errors) cv_scores.append(mse) best_idx = np.argmin(cv_scores) return bandwidths[best_idx], cv_scores def adaptive_bandwidth(X: np.ndarray, k: int = 10) -> np.ndarray: """ Compute adaptive (local) bandwidth for each point. h_i = distance to k-th nearest neighbor for point i This adapts bandwidth to local data density. """ n = X.shape[0] local_bandwidths = np.zeros(n) for i in range(n): distances = np.linalg.norm(X - X[i], axis=1) distances_sorted = np.sort(distances) # k-th nearest (index k since index 0 is the point itself) local_bandwidths[i] = distances_sorted[k] return local_bandwidths # Demonstrationnp.random.seed(42) # Generate regression data with varying curvatureX_train = np.random.uniform(0, 10, (100, 1))y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(100) print("Bandwidth Selection Comparison:")print("=" * 50) # Silverman's ruleh_silverman = silverman_bandwidth(X_train)print(f"Silverman's rule: h = {h_silverman:.4f}") # Cross-validationh_candidates = np.linspace(0.1, 2.0, 20)h_cv, scores = cross_validate_bandwidth(X_train, y_train, h_candidates)print(f"Cross-validated: h = {h_cv:.4f}") # Adaptive (show statistics)h_adaptive = adaptive_bandwidth(X_train, k=5)print(f"Adaptive (median): h = {np.median(h_adaptive):.4f}")print(f"Adaptive (range): [{h_adaptive.min():.4f}, {h_adaptive.max():.4f}]")Standard KNN uses a fixed $k$ (number of neighbors), while kernel regression uses a fixed bandwidth $h$. These can be unified into a single framework:
Fixed-k KNN: Locality is defined by the k-th nearest neighbor distance. In regions of high density, the neighborhood is small; in sparse regions, it expands.
Fixed-h Kernel: Locality is defined by a fixed distance. In dense regions, many points contribute; in sparse regions, few do.
Hybrid Approaches:
kNN with kernel weights: Select k neighbors, then apply kernel weights based on distance within this neighborhood.
Adaptive bandwidth: Set $h_i$ proportional to the distance to the k-th neighbor, combining adaptive locality with kernel smoothing.
Variable kernel: Use a global bandwidth but scale by local density estimates.
| Property | Fixed-k KNN | Fixed-h Kernel | Adaptive/Hybrid |
|---|---|---|---|
| Locality control | Number of neighbors | Distance threshold | Varies by local density |
| Dense regions | Small effective radius | Many contributors | Balanced |
| Sparse regions | Large effective radius (few samples) | Few contributors (high variance) | Balanced |
| Boundary behavior | Always k neighbors (may be far) | May have no neighbors | Adapts gracefully |
| Computation | Exact k neighbors needed | All within h (or approximation) | Requires density estimation |
| Hyperparameter | k (integer) | h (continuous) | Both or derived |
For most applications, start with fixed-k KNN with kernel weights (combine the best of both). Use k for robust neighborhood definition (ensuring you always have enough samples) and kernel weights for smooth, principled weighting within that neighborhood. Set k moderately large (20-50) and let the kernel weighting handle the locality.
The Lp-norm Connection:
The relationship between k and h depends on data dimensionality $d$ and local density $p(\mathbf{x})$:
$$k \approx n \cdot V_d \cdot h^d \cdot p(\mathbf{x})$$
where $V_d$ is the volume of the unit ball in $d$ dimensions.
This shows:
While Nadaraya-Watson is primarily a regression estimator, kernel weighting extends naturally to classification.
Kernel-Weighted Class Probabilities:
For a query point $\mathbf{x}$:
$$\hat{P}(Y = c | \mathbf{x}) = \frac{\sum_{i: y_i = c} K_h(\mathbf{x} - \mathbf{x}i)}{\sum{i=1}^{n} K_h(\mathbf{x} - \mathbf{x}_i)}$$
This estimates the posterior probability of class $c$ at $\mathbf{x}$ using kernel-weighted class membership.
Connection to Generative Classification:
If we interpret $\sum_{i: y_i = c} K_h(\mathbf{x} - \mathbf{x}_i)$ as a kernel density estimate of $p(\mathbf{x} | Y = c)$, then:
$$\hat{P}(c | \mathbf{x}) = \frac{\hat{p}(\mathbf{x} | c) \cdot \hat{p}(c)}{\sum_{c'} \hat{p}(\mathbf{x} | c') \cdot \hat{p}(c')}$$
where $\hat{p}(c) = n_c / n$ is the empirical prior. This is Bayes' rule with kernel-estimated class-conditional densities!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npfrom typing import Callable, Tuple class KernelKNNClassifier: """ KNN classifier with kernel weighting. Combines k-nearest neighbor selection with kernel-weighted voting. Outputs calibrated probability estimates. """ def __init__(self, k: int = 20, kernel: str = 'gaussian', bandwidth: str = 'adaptive'): """ Parameters ---------- k : number of neighbors to consider kernel : 'gaussian', 'epanechnikov', 'tricube' bandwidth : 'adaptive' (based on k-th neighbor), or float """ self.k = k self.kernel_name = kernel self.bandwidth_mode = bandwidth self.X_train = None self.y_train = None self.classes_ = None def _kernel_func(self, u: np.ndarray) -> np.ndarray: """Evaluate kernel at normalized distance u.""" if self.kernel_name == 'gaussian': return np.exp(-0.5 * u**2) elif self.kernel_name == 'epanechnikov': return np.where(np.abs(u) <= 1, 0.75 * (1 - u**2), 0.0) elif self.kernel_name == 'tricube': return np.where(np.abs(u) <= 1, (70/81) * (1 - np.abs(u)**3)**3, 0.0) else: raise ValueError(f"Unknown kernel: {self.kernel_name}") def fit(self, X: np.ndarray, y: np.ndarray): """Store training data.""" self.X_train = np.array(X) self.y_train = np.array(y) self.classes_ = np.unique(y) return self def predict_proba(self, X: np.ndarray) -> np.ndarray: """ Predict class probabilities using kernel-weighted voting. """ X = np.atleast_2d(X) probas = np.zeros((len(X), len(self.classes_))) for i, x_query in enumerate(X): # Compute all distances distances = np.linalg.norm(self.X_train - x_query, axis=1) # Find k nearest neighbors k_indices = np.argsort(distances)[:self.k] k_distances = distances[k_indices] k_labels = self.y_train[k_indices] # Determine bandwidth if self.bandwidth_mode == 'adaptive': h = k_distances[-1] + 1e-10 # k-th neighbor distance else: h = float(self.bandwidth_mode) # Compute kernel weights u = k_distances / h weights = self._kernel_func(u) weights = weights / (weights.sum() + 1e-10) # Aggregate by class for c_idx, c in enumerate(self.classes_): mask = k_labels == c probas[i, c_idx] = weights[mask].sum() return probas def predict(self, X: np.ndarray) -> np.ndarray: """Predict class labels.""" probas = self.predict_proba(X) return self.classes_[np.argmax(probas, axis=1)] def decision_function(self, X: np.ndarray) -> np.ndarray: """Return probability of positive class (for binary classification).""" probas = self.predict_proba(X) if len(self.classes_) == 2: return probas[:, 1] return probas # Demonstration: Compare kernels on 2D classificationnp.random.seed(42) # Generate two-class datan_per_class = 100X_class0 = np.random.randn(n_per_class, 2) + [-1, 0]X_class1 = np.random.randn(n_per_class, 2) + [1, 0]X_train = np.vstack([X_class0, X_class1])y_train = np.array([0] * n_per_class + [1] * n_per_class) # Query points along the class boundaryX_query = np.array([ [0, 0], # Center [-0.5, 0], # Slightly class 0 [0.5, 0], # Slightly class 1 [0, 2], # Same x but different y]) print("Kernel KNN Classification Comparison:")print("=" * 60) for kernel in ['gaussian', 'epanechnikov', 'tricube']: clf = KernelKNNClassifier(k=30, kernel=kernel) clf.fit(X_train, y_train) probas = clf.predict_proba(X_query) print(f"\n{kernel.capitalize()} kernel (k=30, adaptive bandwidth):") for j, (x, p) in enumerate(zip(X_query, probas)): print(f" x={x}: P(class=0)={p[0]:.3f}, P(class=1)={p[1]:.3f}")You now understand kernel-weighted KNN as a principled nonparametric method with solid theoretical foundations. The next page explores adaptive neighborhoods—techniques that adjust the locality not just through k or h, but dynamically based on local data characteristics.