Loading content...
A single global parameter—whether k (number of neighbors) or h (bandwidth)—cannot optimally serve every part of the feature space. Consider a dataset with both dense urban centers and sparse rural regions:
This fundamental tension between global parameters and local data structure motivates adaptive neighborhoods—techniques that automatically adjust locality based on where you are in the feature space.
Adaptive methods don't just improve accuracy; they fundamentally change what KNN can handle: datasets with heterogeneous density, nonstationary relationships, and complex multimodal structures.
By the end of this page, you will understand the theory and practice of adaptive neighborhoods: variable bandwidth methods (balloon and sample-point estimators), k-nearest neighbor bandwidth adaptation, local density estimation for adaptation, and the mathematical relationship between local density and optimal neighborhood size.
To understand why adaptive neighborhoods matter, we must first quantify how fixed parameters fail under varying density.
Fixed-k Behavior:
With fixed k, the effective neighborhood radius $r_k(\mathbf{x})$ (distance to k-th neighbor) varies inversely with local density:
$$r_k(\mathbf{x}) \approx \left(\frac{k}{n \cdot p(\mathbf{x}) \cdot V_d}\right)^{1/d}$$
where $p(\mathbf{x})$ is the local data density and $V_d$ is the volume of the unit ball in $d$ dimensions.
Key insight: In low-density regions, $r_k$ becomes large—the algorithm reaches far to find k neighbors. This means:
| Region Density | Fixed-k Behavior | Fixed-h Behavior |
|---|---|---|
| High (dense) | Small radius, many good samples | Many samples (possibly too many, slow) |
| Low (sparse) | Large radius, distant neighbors | Few samples (high variance or none) |
| Mixed | Adapts radius automatically | Inconsistent: too many or too few |
Fixed-h Behavior:
With fixed bandwidth h, the effective number of neighbors $k_{\text{eff}}(\mathbf{x})$ varies with local density:
$$k_{\text{eff}}(\mathbf{x}) \approx n \cdot p(\mathbf{x}) \cdot V_d \cdot h^d$$
In low-density regions, $k_{\text{eff}}$ can become very small—sometimes zero—leading to extreme variance or undefined predictions.
The ideal: We want the best of both worlds—enough neighbors for stable estimation, yet close enough to maintain locality. This requires adapting the neighborhood based on the query location.
In high dimensions, density variations become extreme. Even modest density differences in the marginals compound exponentially in the joint space. A factor of 2 difference in density per dimension becomes 2^d overall. Adaptive methods become essential, not optional, in high-dimensional settings.
Two classical approaches exist for adaptive neighborhood selection in nonparametric estimation:
Balloon Estimator:
The bandwidth adapts to the query point $\mathbf{x}$:
$$\hat{f}(\mathbf{x}) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h(\mathbf{x})^d} K\left(\frac{\mathbf{x} - \mathbf{x}_i}{h(\mathbf{x})}\right)$$
The bandwidth $h(\mathbf{x})$ varies with where we're making the prediction. Common choices:
Intuition: The neighborhood "balloon" expands or contracts based on where you're standing.
Sample-Point Estimator:
The bandwidth adapts to each training point $\mathbf{x}_i$:
$$\hat{f}(\mathbf{x}) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h_i^d} K\left(\frac{\mathbf{x} - \mathbf{x}_i}{h_i}\right)$$
Each training point has its own bandwidth $h_i$, typically based on local density around $\mathbf{x}_i$.
Intuition: Each training point broadcasts its influence with a reach proportional to how isolated it is.
In KNN for prediction (classification or regression), the balloon estimator is almost universally used. We adapt based on where we want to predict, not based on where training points live. Sample-point estimators are more common in density estimation contexts.
The simplest and most common form of adaptive bandwidth sets $h(\mathbf{x})$ equal to the distance to the k-th nearest neighbor of $\mathbf{x}$:
$$h(\mathbf{x}) = d_{(k)}(\mathbf{x}) = |\mathbf{x} - \mathbf{x}_{(k)}|$$
where $\mathbf{x}_{(k)}$ is the k-th closest training point to $\mathbf{x}$.
Properties:
Density-adaptive: In dense regions, $h$ is small; in sparse regions, $h$ is large.
Guarantees k samples: Unlike fixed-h, we always have at least k points within the bandwidth.
Bounded kernel regime: When combined with a compactly supported kernel (like Epanechnikov), exactly k points contribute.
Smooth transition: Even with a Gaussian kernel, the k-th neighbor distance provides natural scale.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npfrom typing import Tuple class AdaptiveBandwidthKNN: """ KNN with k-nearest-neighbor adaptive bandwidth. Bandwidth at each query point is set to the distance to the k-th neighbor, ensuring density-adaptive smoothing. """ def __init__(self, k: int = 10, kernel: str = 'gaussian'): """ Parameters ---------- k : number of neighbors (also determines adaptive bandwidth) kernel : 'gaussian', 'epanechnikov', or 'tricube' """ self.k = k self.kernel = kernel self.X_train = None self.y_train = None def _kernel_func(self, u: np.ndarray) -> np.ndarray: """Apply kernel function to normalized distances.""" if self.kernel == 'gaussian': return np.exp(-0.5 * u**2) elif self.kernel == 'epanechnikov': return np.where(u <= 1, 0.75 * (1 - u**2), 0.0) elif self.kernel == 'tricube': return np.where(u <= 1, (70/81) * (1 - u**3)**3, 0.0) else: raise ValueError(f"Unknown kernel: {self.kernel}") def fit(self, X: np.ndarray, y: np.ndarray): """Store training data.""" self.X_train = np.array(X) self.y_train = np.array(y) return self def _predict_one(self, x_query: np.ndarray) -> Tuple[float, dict]: """ Make prediction for a single query point. Returns (prediction, metadata). """ # Compute all distances distances = np.linalg.norm(self.X_train - x_query, axis=1) # Find k nearest neighbors sorted_idx = np.argsort(distances) k_indices = sorted_idx[:self.k] k_distances = distances[k_indices] k_labels = self.y_train[k_indices] # Adaptive bandwidth = k-th neighbor distance h = k_distances[-1] + 1e-10 # Small epsilon for stability # Normalize distances and compute kernel weights u = k_distances / h weights = self._kernel_func(u) weights = weights / (weights.sum() + 1e-10) # Weighted average for regression prediction = np.sum(weights * k_labels) metadata = { 'bandwidth': h, 'effective_samples': 1 / np.sum(weights**2), # Effective sample size 'min_distance': k_distances[0], 'max_distance': k_distances[-1], } return prediction, metadata def predict(self, X: np.ndarray, return_metadata: bool = False): """Predict for all query points.""" X = np.atleast_2d(X) predictions = [] all_metadata = [] for x in X: pred, meta = self._predict_one(x) predictions.append(pred) all_metadata.append(meta) if return_metadata: return np.array(predictions), all_metadata return np.array(predictions) # Demonstration: Varying density regressionnp.random.seed(42) # Create data with varying density# Dense region: x in [0, 2]X_dense = np.random.uniform(0, 2, (80, 1))# Sparse region: x in [5, 10]X_sparse = np.random.uniform(5, 10, (20, 1)) X_train = np.vstack([X_dense, X_sparse])y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(100) # Query points in both regionsX_query = np.array([[1.0], [7.5]]) print("Adaptive Bandwidth KNN Regression:")print("=" * 60) model = AdaptiveBandwidthKNN(k=10, kernel='gaussian')model.fit(X_train, y_train)predictions, metadata = model.predict(X_query, return_metadata=True) for i, (x, pred, meta) in enumerate(zip(X_query, predictions, metadata)): region = "DENSE" if x[0] < 3 else "SPARSE" print(f"Query x = {x[0]:.1f} ({region} region):") print(f" Prediction: {pred:.4f}") print(f" Adaptive bandwidth h: {meta['bandwidth']:.4f}") print(f" Effective samples: {meta['effective_samples']:.2f}") print(f" Nearest neighbor: {meta['min_distance']:.4f}") print(f" k-th neighbor: {meta['max_distance']:.4f}")Theoretical Justification:
For k-NN bandwidth with the k-th neighbor distance, the effective bandwidth scales as:
$$h(\mathbf{x}) \propto \left(\frac{k}{n \cdot p(\mathbf{x})}\right)^{1/d}$$
This means:
The ratio $k/n$ controls the global scale, while $p(\mathbf{x})$ provides local adaptation.
A more sophisticated approach estimates local density $\hat{p}(\mathbf{x})$ and sets bandwidth inversely proportional to it.
The Abramson (1982) Square-Root Law:
For sample-point estimators, Abramson showed that setting:
$$h_i = \frac{h_0}{\sqrt{\hat{p}(\mathbf{x}_i)}}$$
where $h_0$ is a global scaling factor and $\hat{p}(\mathbf{x}_i)$ is a pilot density estimate, achieves optimal bias reduction at distribution tails.
Geometric Mean Reference:
To avoid changing the overall smoothing scale, we normalize by the geometric mean of the pilot densities:
$$h_i = h_0 \cdot \left(\frac{\hat{p}(\mathbf{x}_i)}{g}\right)^{-1/2}$$
where $g = \left(\prod_{i=1}^n \hat{p}(\mathbf{x}_i)\right)^{1/n}$ is the geometric mean.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
import numpy as np class DensityAdaptiveKNN: """ KNN with density-based adaptive bandwidth following Abramson's method. Uses a pilot density estimate to set bandwidths inversely proportional to the square root of local density. """ def __init__(self, k_pilot: int = 10, k_final: int = 20): """ Parameters ---------- k_pilot : k for pilot density estimation k_final : k for final predictions """ self.k_pilot = k_pilot self.k_final = k_final self.X_train = None self.y_train = None self.local_densities = None self.adaptive_bandwidths = None def fit(self, X: np.ndarray, y: np.ndarray): """Fit model and compute adaptive bandwidths.""" self.X_train = np.array(X) self.y_train = np.array(y) n, d = self.X_train.shape # Step 1: Compute pilot density estimates using k-NN self.local_densities = np.zeros(n) k_distances = np.zeros(n) for i, x in enumerate(self.X_train): distances = np.linalg.norm(self.X_train - x, axis=1) distances_sorted = np.sort(distances) # k-th neighbor distance (index k since index 0 is the point itself) k_dist = distances_sorted[self.k_pilot] k_distances[i] = k_dist # k-NN density estimate: p ~ k / (n * V_d * r^d) # We use simplified proportional form (ignoring V_d constant) self.local_densities[i] = self.k_pilot / (k_dist ** d + 1e-10) # Step 2: Compute Abramson adaptive bandwidths # h_i ∝ 1 / sqrt(density_i) # Normalize by geometric mean log_densities = np.log(self.local_densities + 1e-10) geometric_mean_density = np.exp(log_densities.mean()) # Adaptive bandwidths using square-root law self.adaptive_bandwidths = k_distances * np.sqrt( geometric_mean_density / (self.local_densities + 1e-10) ) return self def predict(self, X: np.ndarray) -> np.ndarray: """Predict using density-adaptive neighborhoods.""" X = np.atleast_2d(X) predictions = [] for x_query in X: # Find k_final nearest neighbors distances = np.linalg.norm(self.X_train - x_query, axis=1) k_indices = np.argsort(distances)[:self.k_final] # Use adaptive bandwidths of the selected neighbors k_distances = distances[k_indices] k_bandwidths = self.adaptive_bandwidths[k_indices] k_labels = self.y_train[k_indices] # Each training point contributes with its own bandwidth # Weight = K((dist from query) / (training point's bandwidth)) weights = np.zeros(len(k_indices)) for j, (dist, h) in enumerate(zip(k_distances, k_bandwidths)): u = dist / (h + 1e-10) weights[j] = np.exp(-0.5 * u**2) # Gaussian kernel weights = weights / (weights.sum() + 1e-10) pred = np.sum(weights * k_labels) predictions.append(pred) return np.array(predictions) def get_adaptation_info(self) -> dict: """Return information about the adaptive bandwidths.""" return { 'min_density': self.local_densities.min(), 'max_density': self.local_densities.max(), 'density_ratio': self.local_densities.max() / self.local_densities.min(), 'min_bandwidth': self.adaptive_bandwidths.min(), 'max_bandwidth': self.adaptive_bandwidths.max(), 'bandwidth_ratio': self.adaptive_bandwidths.max() / self.adaptive_bandwidths.min(), } # Demonstrationnp.random.seed(42) # Create data with very different densities# Cluster 1: tight clusterX1 = np.random.randn(70, 2) * 0.3 + [0, 0]# Cluster 2: spread outX2 = np.random.randn(30, 2) * 2.0 + [5, 5] X_train = np.vstack([X1, X2])y_train = X_train[:, 0] + X_train[:, 1] + 0.5 * np.random.randn(100) model = DensityAdaptiveKNN(k_pilot=5, k_final=15)model.fit(X_train, y_train) print("Density-Adaptive KNN:")print("=" * 50)info = model.get_adaptation_info()for key, val in info.items(): print(f" {key}: {val:.4f}") # Test predictionsX_test = np.array([ [0, 0], # Dense cluster center [5, 5], # Sparse cluster center])preds = model.predict(X_test)print(f"Predictions:")print(f" Dense center [0,0]: {preds[0]:.4f}")print(f" Sparse center [5,5]: {preds[1]:.4f}")The square-root relationship (h ∝ 1/√p) is derived from minimizing asymptotic bias. If density drops by factor 4, bandwidth only doubles (not quadruples). This prevents over-smoothing in sparse regions while still adapting significantly.
There's a deep connection between adaptive neighborhoods and the concept of local learning rate or local complexity control.
Fixed Parameters = Global Complexity:
With fixed k or h:
Adaptive Parameters = Local Complexity:
With adaptive neighborhoods:
This mirrors the concept of local bandwidth selection in nonparametric statistics and adaptive regularization in modern machine learning.
Connection to Bias-Variance Tradeoff:
Let $h(\mathbf{x})$ be the local bandwidth. The local bias-variance decomposition is:
$$\text{MSE}(\hat{f}(\mathbf{x})) = \text{Bias}^2(h(\mathbf{x})) + \text{Var}(h(\mathbf{x}))$$
Optimizing $h(\mathbf{x})$ point-wise:
$$h_{\text{opt}}(\mathbf{x}) = \arg\min_h \left[ \text{Bias}^2(h) + \text{Var}(h) \right]$$
The optimal local bandwidth depends on:
A more advanced extension uses different bandwidths for different features at each location. This allows the algorithm to learn that, say, feature 1 matters a lot in one region but feature 2 matters more elsewhere. This leads to local metric learning, covered in KNN variants.
Implementing adaptive neighborhoods involves several practical decisions. Here we outline robust strategies.
Strategy 1: k-NN Radius (Most Common)
Use distance to k-th neighbor as bandwidth:
h(x) = distance(x, kth_nearest_neighbor(x))
Pros: Simple, guaranteed k samples, widely supported Cons: Discrete (only considers k, not k+1, k+2, etc.)
Strategy 2: Pilot Density + Abramson
First estimate density, then set bandwidth inversely:
density_i = k / (radius_k_i ^ d)
h_i = h_0 / sqrt(density_i)
Pros: Theoretically optimal for density estimation Cons: Two-stage, more complex
Strategy 3: Cross-Validated Local k
Choose k locally via leave-one-out at each training point:
k_i = argmin_k LOO_error_at_x_i(k)
Pros: Fully data-driven, handles heterogeneous complexity Cons: Computationally expensive (O(n²) at minimum)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
import numpy as npfrom typing import List, Tuple def knn_radius_adaptive(X_train: np.ndarray, y_train: np.ndarray, X_query: np.ndarray, k: int) -> np.ndarray: """ Strategy 1: k-NN radius as adaptive bandwidth. Most common and computationally efficient. """ predictions = [] for x in X_query: distances = np.linalg.norm(X_train - x, axis=1) k_idx = np.argsort(distances)[:k] h = distances[k_idx[-1]] + 1e-10 # Gaussian kernel with adaptive h u = distances[k_idx] / h weights = np.exp(-0.5 * u**2) weights /= weights.sum() predictions.append(np.sum(weights * y_train[k_idx])) return np.array(predictions) def lof_adaptive(X_train: np.ndarray, y_train: np.ndarray, X_query: np.ndarray, k_min: int = 5, k_max: int = 30) -> Tuple[np.ndarray, List[int]]: """ Strategy 3 (simplified): Choose k based on local outlier factor. Points in anomalous regions (high LOF) get larger k for stability. Points in normal regions get smaller k for precision. """ n = len(X_train) # Precompute k-distances for LOF k_distances = np.zeros((n, k_max)) for i, x in enumerate(X_train): dists = np.sort(np.linalg.norm(X_train - x, axis=1)) k_distances[i] = dists[1:k_max+1] # Exclude self # Compute approximate LOF for each training point lofs = np.zeros(n) k_lof = k_max // 2 # Use intermediate k for LOF for i in range(n): # k-distance of point i k_dist_i = k_distances[i, k_lof-1] # Find k neighbors dists = np.linalg.norm(X_train - X_train[i], axis=1) neighbors = np.argsort(dists)[1:k_lof+1] # Local reachability density reach_dists = np.maximum(dists[neighbors], k_distances[neighbors, k_lof-1]) lrd_i = k_lof / (reach_dists.sum() + 1e-10) # LOF = average ratio of neighbor LRDs to own LRD neighbor_reach_dists = np.array([ np.maximum(np.linalg.norm(X_train - X_train[j], axis=1), k_distances[j, k_lof-1]).sum() / k_lof for j in neighbors ]) lrd_neighbors = k_lof / (neighbor_reach_dists + 1e-10) lofs[i] = lrd_neighbors.mean() / (lrd_i + 1e-10) predictions = [] chosen_ks = [] for x in X_query: # Find nearest training point to estimate local LOF dists = np.linalg.norm(X_train - x, axis=1) nearest_idx = np.argmin(dists) local_lof = lofs[nearest_idx] # Map LOF to k: higher LOF -> larger k # LOF ~ 1 is normal, LOF >> 1 is outlier lof_factor = np.clip(local_lof, 0.5, 3.0) local_k = int(np.clip(k_min * lof_factor, k_min, k_max)) chosen_ks.append(local_k) # Predict with chosen k k_idx = np.argsort(dists)[:local_k] h = dists[k_idx[-1]] + 1e-10 u = dists[k_idx] / h weights = np.exp(-0.5 * u**2) weights /= weights.sum() predictions.append(np.sum(weights * y_train[k_idx])) return np.array(predictions), chosen_ks # Demonstrationnp.random.seed(42) # Data with outlier regionX_normal = np.random.randn(80, 2)X_outlier = np.random.randn(20, 2) * 0.1 + [5, 5] # Tight cluster far awayX_train = np.vstack([X_normal, X_outlier])y_train = X_train[:, 0] + 0.3 * np.random.randn(100) X_query = np.array([ [0, 0], # Normal region [5, 5], # Near outlier cluster [3, 3], # Between regions]) print("Adaptive Strategy Comparison:")print("=" * 50) preds_knn = knn_radius_adaptive(X_train, y_train, X_query, k=10)preds_lof, ks_lof = lof_adaptive(X_train, y_train, X_query, k_min=5, k_max=25) for i, x in enumerate(X_query): print(f"Query {x}:") print(f" k-NN radius (k=10): pred = {preds_knn[i]:.4f}") print(f" LOF-adaptive (k={ks_lof[i]}): pred = {preds_lof[i]:.4f}")Adaptive neighborhoods must handle boundary effects—what happens when query points are near the edge of the data distribution.
The Boundary Problem:
Near boundaries:
Manifestations:
Mitigation Strategies:
When a query point lies outside the convex hull of training data, all neighbors are on one side. This is extrapolation, and KNN (like most nonparametric methods) is fundamentally unreliable here. The best adaptive methods cannot fix extrapolation—they can only recognize it and flag uncertainty.
Detecting Boundary Conditions:
We can detect when a query is near the boundary by examining the angular distribution of neighbors:
$$\text{Angular Dispersion} = \frac{1}{k} \sum_{i=1}^k \cos(\theta_i)$$
where $\theta_i$ is the angle between the vector to neighbor $i$ and the average neighbor direction.
When detected at the boundary, we can:
You now understand how to adapt KNN neighborhoods to local data characteristics. The next page explores Local Models—extending adaptive KNN beyond simple weighted averages to fit local polynomial models within each neighborhood.