Weighted Knn - Learning Module

Loading content...

0/278

Adaptive Neighborhoods

The One-Size-Fits-All Problem

A single global parameter—whether k (number of neighbors) or h (bandwidth)—cannot optimally serve every part of the feature space. Consider a dataset with both dense urban centers and sparse rural regions:

In dense regions: Even a small k or h captures sufficient samples for reliable estimation.
In sparse regions: The same k or h may span vast distances, pulling in points that are genuinely dissimilar to the query.

This fundamental tension between global parameters and local data structure motivates adaptive neighborhoods—techniques that automatically adjust locality based on where you are in the feature space.

Adaptive methods don't just improve accuracy; they fundamentally change what KNN can handle: datasets with heterogeneous density, nonstationary relationships, and complex multimodal structures.

What You Will Master

By the end of this page, you will understand the theory and practice of adaptive neighborhoods: variable bandwidth methods (balloon and sample-point estimators), k-nearest neighbor bandwidth adaptation, local density estimation for adaptation, and the mathematical relationship between local density and optimal neighborhood size.

The Density Problem

To understand why adaptive neighborhoods matter, we must first quantify how fixed parameters fail under varying density.

Fixed-k Behavior:

With fixed k, the effective neighborhood radius $r_k(\mathbf{x})$ (distance to k-th neighbor) varies inversely with local density:

$$r_k(\mathbf{x}) \approx \left(\frac{k}{n \cdot p(\mathbf{x}) \cdot V_d}\right)^{1/d}$$

where $p(\mathbf{x})$ is the local data density and $V_d$ is the volume of the unit ball in $d$ dimensions.

Key insight: In low-density regions, $r_k$ becomes large—the algorithm reaches far to find k neighbors. This means:

Neighbors may come from different local regimes of the function
Bias increases as we average across regions where $f$ varies
The implicit assumption of local constancy breaks down

Fixed-k vs Fixed-h in Varying Density
Region Density	Fixed-k Behavior	Fixed-h Behavior
High (dense)	Small radius, many good samples	Many samples (possibly too many, slow)
Low (sparse)	Large radius, distant neighbors	Few samples (high variance or none)
Mixed	Adapts radius automatically	Inconsistent: too many or too few

Fixed-h Behavior:

With fixed bandwidth h, the effective number of neighbors $k_{\text{eff}}(\mathbf{x})$ varies with local density:

$$k_{\text{eff}}(\mathbf{x}) \approx n \cdot p(\mathbf{x}) \cdot V_d \cdot h^d$$

In low-density regions, $k_{\text{eff}}$ can become very small—sometimes zero—leading to extreme variance or undefined predictions.

The ideal: We want the best of both worlds—enough neighbors for stable estimation, yet close enough to maintain locality. This requires adapting the neighborhood based on the query location.

The Curse Multiplies

In high dimensions, density variations become extreme. Even modest density differences in the marginals compound exponentially in the joint space. A factor of 2 difference in density per dimension becomes 2^d overall. Adaptive methods become essential, not optional, in high-dimensional settings.

Balloon and Sample-Point Estimators

Two classical approaches exist for adaptive neighborhood selection in nonparametric estimation:

Balloon Estimator:

The bandwidth adapts to the query point $\mathbf{x}$:

$$\hat{f}(\mathbf{x}) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h(\mathbf{x})^d} K\left(\frac{\mathbf{x} - \mathbf{x}_i}{h(\mathbf{x})}\right)$$

The bandwidth $h(\mathbf{x})$ varies with where we're making the prediction. Common choices:

$h(\mathbf{x})$ = distance to k-th nearest neighbor from $\mathbf{x}$
$h(\mathbf{x})$ = some function of local density estimate at $\mathbf{x}$

Intuition: The neighborhood "balloon" expands or contracts based on where you're standing.

Sample-Point Estimator:

The bandwidth adapts to each training point $\mathbf{x}_i$:

$$\hat{f}(\mathbf{x}) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h_i^d} K\left(\frac{\mathbf{x} - \mathbf{x}_i}{h_i}\right)$$

Each training point has its own bandwidth $h_i$, typically based on local density around $\mathbf{x}_i$.

Intuition: Each training point broadcasts its influence with a reach proportional to how isolated it is.

Balloon Estimator

•Bandwidth determined by query location
•Natural for prediction tasks (one query at a time)
•Computationally simple: just find k-th neighbor distance
•Standard KNN is a special case (uniform kernel, k neighbors)
•May not integrate to 1 as a density estimate

Sample-Point Estimator

•Bandwidth determined by training point location
•Natural for density estimation (integrates properly)
•Requires pre-computing bandwidth for each training point
•Abramson (1982) showed this reduces bias at distribution tails
•More complex to implement but theoretically superior for density

For KNN Prediction

In KNN for prediction (classification or regression), the balloon estimator is almost universally used. We adapt based on where we want to predict, not based on where training points live. Sample-point estimators are more common in density estimation contexts.

k-NN Adaptive Bandwidth

The simplest and most common form of adaptive bandwidth sets $h(\mathbf{x})$ equal to the distance to the k-th nearest neighbor of $\mathbf{x}$:

$$h(\mathbf{x}) = d_{(k)}(\mathbf{x}) = |\mathbf{x} - \mathbf{x}_{(k)}|$$

where $\mathbf{x}_{(k)}$ is the k-th closest training point to $\mathbf{x}$.

Properties:

Density-adaptive: In dense regions, $h$ is small; in sparse regions, $h$ is large.
Guarantees k samples: Unlike fixed-h, we always have at least k points within the bandwidth.
Bounded kernel regime: When combined with a compactly supported kernel (like Epanechnikov), exactly k points contribute.
Smooth transition: Even with a Gaussian kernel, the k-th neighbor distance provides natural scale.

knn_adaptive_bandwidth.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from typing import Tuple
 
class AdaptiveBandwidthKNN:
    """
    KNN with k-nearest-neighbor adaptive bandwidth.
    
    Bandwidth at each query point is set to the distance to the k-th neighbor,
    ensuring density-adaptive smoothing.
    """
    
    def __init__(self, k: int = 10, kernel: str = 'gaussian'):
        """
        Parameters
        ----------
        k : number of neighbors (also determines adaptive bandwidth)
        kernel : 'gaussian', 'epanechnikov', or 'tricube'
        """
        self.k = k
        self.kernel = kernel
        self.X_train = None
        self.y_train = None
    
    def _kernel_func(self, u: np.ndarray) -> np.ndarray:
        """Apply kernel function to normalized distances."""
        if self.kernel == 'gaussian':
            return np.exp(-0.5 * u**2)
        elif self.kernel == 'epanechnikov':
            return np.where(u <= 1, 0.75 * (1 - u**2), 0.0)
        elif self.kernel == 'tricube':
            return np.where(u <= 1, (70/81) * (1 - u**3)**3, 0.0)
        else:
            raise ValueError(f"Unknown kernel: {self.kernel}")
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Store training data."""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        return self
    
    def _predict_one(self, x_query: np.ndarray) -> Tuple[float, dict]:
        """
        Make prediction for a single query point.
        Returns (prediction, metadata).
        """
        # Compute all distances
        distances = np.linalg.norm(self.X_train - x_query, axis=1)
        
        # Find k nearest neighbors
        sorted_idx = np.argsort(distances)
        k_indices = sorted_idx[:self.k]
        k_distances = distances[k_indices]
        k_labels = self.y_train[k_indices]
        
        # Adaptive bandwidth = k-th neighbor distance
        h = k_distances[-1] + 1e-10  # Small epsilon for stability
        
        # Normalize distances and compute kernel weights
        u = k_distances / h
        weights = self._kernel_func(u)
        weights = weights / (weights.sum() + 1e-10)
        
        # Weighted average for regression
        prediction = np.sum(weights * k_labels)
        
        metadata = {
            'bandwidth': h,
            'effective_samples': 1 / np.sum(weights**2),  # Effective sample size
            'min_distance': k_distances[0],
            'max_distance': k_distances[-1],
        }
        
        return prediction, metadata
    
    def predict(self, X: np.ndarray, return_metadata: bool = False):
        """Predict for all query points."""
        X = np.atleast_2d(X)
        predictions = []
        all_metadata = []
        
        for x in X:
            pred, meta = self._predict_one(x)
            predictions.append(pred)
            all_metadata.append(meta)
        
        if return_metadata:
            return np.array(predictions), all_metadata
        return np.array(predictions)
 
 
# Demonstration: Varying density regression
np.random.seed(42)
 
# Create data with varying density
# Dense region: x in [0, 2]
X_dense = np.random.uniform(0, 2, (80, 1))
# Sparse region: x in [5, 10]
X_sparse = np.random.uniform(5, 10, (20, 1))
 
X_train = np.vstack([X_dense, X_sparse])
y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(100)
 
# Query points in both regions
X_query = np.array([[1.0], [7.5]])
 
print("Adaptive Bandwidth KNN Regression:")
print("=" * 60)
 
model = AdaptiveBandwidthKNN(k=10, kernel='gaussian')
model.fit(X_train, y_train)
predictions, metadata = model.predict(X_query, return_metadata=True)
 
for i, (x, pred, meta) in enumerate(zip(X_query, predictions, metadata)):
    region = "DENSE" if x[0] < 3 else "SPARSE"
    print(f"
Query x = {x[0]:.1f} ({region} region):")
    print(f"  Prediction: {pred:.4f}")
    print(f"  Adaptive bandwidth h: {meta['bandwidth']:.4f}")
    print(f"  Effective samples: {meta['effective_samples']:.2f}")
    print(f"  Nearest neighbor: {meta['min_distance']:.4f}")
    print(f"  k-th neighbor: {meta['max_distance']:.4f}")

Theoretical Justification:

For k-NN bandwidth with the k-th neighbor distance, the effective bandwidth scales as:

$$h(\mathbf{x}) \propto \left(\frac{k}{n \cdot p(\mathbf{x})}\right)^{1/d}$$

This means:

In high-density regions: small $h$ (local, low bias)
In low-density regions: large $h$ (uses more data for stability)

The ratio $k/n$ controls the global scale, while $p(\mathbf{x})$ provides local adaptation.

Density-Based Bandwidth

A more sophisticated approach estimates local density $\hat{p}(\mathbf{x})$ and sets bandwidth inversely proportional to it.

The Abramson (1982) Square-Root Law:

For sample-point estimators, Abramson showed that setting:

$$h_i = \frac{h_0}{\sqrt{\hat{p}(\mathbf{x}_i)}}$$

where $h_0$ is a global scaling factor and $\hat{p}(\mathbf{x}_i)$ is a pilot density estimate, achieves optimal bias reduction at distribution tails.

Geometric Mean Reference:

To avoid changing the overall smoothing scale, we normalize by the geometric mean of the pilot densities:

$$h_i = h_0 \cdot \left(\frac{\hat{p}(\mathbf{x}_i)}{g}\right)^{-1/2}$$

where $g = \left(\prod_{i=1}^n \hat{p}(\mathbf{x}_i)\right)^{1/n}$ is the geometric mean.

density_adaptive_knn.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
 
class DensityAdaptiveKNN:
    """
    KNN with density-based adaptive bandwidth following Abramson's method.
    
    Uses a pilot density estimate to set bandwidths inversely proportional
    to the square root of local density.
    """
    
    def __init__(self, k_pilot: int = 10, k_final: int = 20):
        """
        Parameters
        ----------
        k_pilot : k for pilot density estimation
        k_final : k for final predictions
        """
        self.k_pilot = k_pilot
        self.k_final = k_final
        self.X_train = None
        self.y_train = None
        self.local_densities = None
        self.adaptive_bandwidths = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit model and compute adaptive bandwidths."""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        n, d = self.X_train.shape
        
        # Step 1: Compute pilot density estimates using k-NN
        self.local_densities = np.zeros(n)
        k_distances = np.zeros(n)
        
        for i, x in enumerate(self.X_train):
            distances = np.linalg.norm(self.X_train - x, axis=1)
            distances_sorted = np.sort(distances)
            # k-th neighbor distance (index k since index 0 is the point itself)
            k_dist = distances_sorted[self.k_pilot]
            k_distances[i] = k_dist
            
            # k-NN density estimate: p ~ k / (n * V_d * r^d)
            # We use simplified proportional form (ignoring V_d constant)
            self.local_densities[i] = self.k_pilot / (k_dist ** d + 1e-10)
        
        # Step 2: Compute Abramson adaptive bandwidths
        # h_i ∝ 1 / sqrt(density_i)
        # Normalize by geometric mean
        log_densities = np.log(self.local_densities + 1e-10)
        geometric_mean_density = np.exp(log_densities.mean())
        
        # Adaptive bandwidths using square-root law
        self.adaptive_bandwidths = k_distances * np.sqrt(
            geometric_mean_density / (self.local_densities + 1e-10)
        )
        
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict using density-adaptive neighborhoods."""
        X = np.atleast_2d(X)
        predictions = []
        
        for x_query in X:
            # Find k_final nearest neighbors
            distances = np.linalg.norm(self.X_train - x_query, axis=1)
            k_indices = np.argsort(distances)[:self.k_final]
            
            # Use adaptive bandwidths of the selected neighbors
            k_distances = distances[k_indices]
            k_bandwidths = self.adaptive_bandwidths[k_indices]
            k_labels = self.y_train[k_indices]
            
            # Each training point contributes with its own bandwidth
            # Weight = K((dist from query) / (training point's bandwidth))
            weights = np.zeros(len(k_indices))
            for j, (dist, h) in enumerate(zip(k_distances, k_bandwidths)):
                u = dist / (h + 1e-10)
                weights[j] = np.exp(-0.5 * u**2)  # Gaussian kernel
            
            weights = weights / (weights.sum() + 1e-10)
            pred = np.sum(weights * k_labels)
            predictions.append(pred)
        
        return np.array(predictions)
    
    def get_adaptation_info(self) -> dict:
        """Return information about the adaptive bandwidths."""
        return {
            'min_density': self.local_densities.min(),
            'max_density': self.local_densities.max(),
            'density_ratio': self.local_densities.max() / self.local_densities.min(),
            'min_bandwidth': self.adaptive_bandwidths.min(),
            'max_bandwidth': self.adaptive_bandwidths.max(),
            'bandwidth_ratio': self.adaptive_bandwidths.max() / self.adaptive_bandwidths.min(),
        }
 
 
# Demonstration
np.random.seed(42)
 
# Create data with very different densities
# Cluster 1: tight cluster
X1 = np.random.randn(70, 2) * 0.3 + [0, 0]
# Cluster 2: spread out
X2 = np.random.randn(30, 2) * 2.0 + [5, 5]
 
X_train = np.vstack([X1, X2])
y_train = X_train[:, 0] + X_train[:, 1] + 0.5 * np.random.randn(100)
 
model = DensityAdaptiveKNN(k_pilot=5, k_final=15)
model.fit(X_train, y_train)
 
print("Density-Adaptive KNN:")
print("=" * 50)
info = model.get_adaptation_info()
for key, val in info.items():
    print(f"  {key}: {val:.4f}")
 
# Test predictions
X_test = np.array([
    [0, 0],    # Dense cluster center
    [5, 5],    # Sparse cluster center
])
preds = model.predict(X_test)
print(f"
Predictions:")
print(f"  Dense center [0,0]: {preds[0]:.4f}")
print(f"  Sparse center [5,5]: {preds[1]:.4f}")

Why Square Root?

The square-root relationship (h ∝ 1/√p) is derived from minimizing asymptotic bias. If density drops by factor 4, bandwidth only doubles (not quadruples). This prevents over-smoothing in sparse regions while still adapting significantly.

Adaptive Neighborhoods as Local Learning Rate

There's a deep connection between adaptive neighborhoods and the concept of local learning rate or local complexity control.

Fixed Parameters = Global Complexity:

With fixed k or h:

The model's local complexity is constant everywhere
This is inappropriate when the underlying function has varying smoothness
Example: A function that's smooth in one region and has sharp transitions in another

Adaptive Parameters = Local Complexity:

With adaptive neighborhoods:

In regions with dense, consistent data: smaller neighborhoods (lower bias, can capture local detail)
In regions with sparse, noisy data: larger neighborhoods (lower variance, smooths over noise)

This mirrors the concept of local bandwidth selection in nonparametric statistics and adaptive regularization in modern machine learning.

Connection to Bias-Variance Tradeoff:

Let $h(\mathbf{x})$ be the local bandwidth. The local bias-variance decomposition is:

$$\text{MSE}(\hat{f}(\mathbf{x})) = \text{Bias}^2(h(\mathbf{x})) + \text{Var}(h(\mathbf{x}))$$

Optimizing $h(\mathbf{x})$ point-wise:

$$h_{\text{opt}}(\mathbf{x}) = \arg\min_h \left[ \text{Bias}^2(h) + \text{Var}(h) \right]$$

The optimal local bandwidth depends on:

Local curvature $|f''(\mathbf{x})|$: Higher curvature → smaller $h$ to reduce bias
Local noise level $\sigma^2(\mathbf{x})$: Higher noise → larger $h$ to reduce variance
Local density $p(\mathbf{x})$: Lower density → larger $h$ to get enough samples

Factors Affecting Optimal Local Bandwidth

•High local curvature → Use smaller neighborhoods to capture rapid changes without over-smoothing
•High local noise → Use larger neighborhoods to average out noise through more samples
•Low local density → Use larger neighborhoods to ensure sufficient effective samples
•Near boundaries → May need asymmetric or specially shaped neighborhoods
•Multimodal regions → Smaller neighborhoods to avoid bridging modes

Local Feature Importance

A more advanced extension uses different bandwidths for different features at each location. This allows the algorithm to learn that, say, feature 1 matters a lot in one region but feature 2 matters more elsewhere. This leads to local metric learning, covered in KNN variants.

Practical Implementation Strategies

Implementing adaptive neighborhoods involves several practical decisions. Here we outline robust strategies.

Strategy 1: k-NN Radius (Most Common)

Use distance to k-th neighbor as bandwidth:

h(x) = distance(x, kth_nearest_neighbor(x))

Pros: Simple, guaranteed k samples, widely supported Cons: Discrete (only considers k, not k+1, k+2, etc.)

Strategy 2: Pilot Density + Abramson

First estimate density, then set bandwidth inversely:

density_i = k / (radius_k_i ^ d)
h_i = h_0 / sqrt(density_i)

Pros: Theoretically optimal for density estimation Cons: Two-stage, more complex

Strategy 3: Cross-Validated Local k

Choose k locally via leave-one-out at each training point:

k_i = argmin_k LOO_error_at_x_i(k)

Pros: Fully data-driven, handles heterogeneous complexity Cons: Computationally expensive (O(n²) at minimum)

adaptive_strategies.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
from typing import List, Tuple
 
def knn_radius_adaptive(X_train: np.ndarray, 
                         y_train: np.ndarray,
                         X_query: np.ndarray,
                         k: int) -> np.ndarray:
    """
    Strategy 1: k-NN radius as adaptive bandwidth.
    Most common and computationally efficient.
    """
    predictions = []
    
    for x in X_query:
        distances = np.linalg.norm(X_train - x, axis=1)
        k_idx = np.argsort(distances)[:k]
        h = distances[k_idx[-1]] + 1e-10
        
        # Gaussian kernel with adaptive h
        u = distances[k_idx] / h
        weights = np.exp(-0.5 * u**2)
        weights /= weights.sum()
        
        predictions.append(np.sum(weights * y_train[k_idx]))
    
    return np.array(predictions)
 
 
def lof_adaptive(X_train: np.ndarray,
                  y_train: np.ndarray,
                  X_query: np.ndarray,
                  k_min: int = 5,
                  k_max: int = 30) -> Tuple[np.ndarray, List[int]]:
    """
    Strategy 3 (simplified): Choose k based on local outlier factor.
    
    Points in anomalous regions (high LOF) get larger k for stability.
    Points in normal regions get smaller k for precision.
    """
    n = len(X_train)
    
    # Precompute k-distances for LOF
    k_distances = np.zeros((n, k_max))
    for i, x in enumerate(X_train):
        dists = np.sort(np.linalg.norm(X_train - x, axis=1))
        k_distances[i] = dists[1:k_max+1]  # Exclude self
    
    # Compute approximate LOF for each training point
    lofs = np.zeros(n)
    k_lof = k_max // 2  # Use intermediate k for LOF
    
    for i in range(n):
        # k-distance of point i
        k_dist_i = k_distances[i, k_lof-1]
        
        # Find k neighbors
        dists = np.linalg.norm(X_train - X_train[i], axis=1)
        neighbors = np.argsort(dists)[1:k_lof+1]
        
        # Local reachability density
        reach_dists = np.maximum(dists[neighbors], 
                                  k_distances[neighbors, k_lof-1])
        lrd_i = k_lof / (reach_dists.sum() + 1e-10)
        
        # LOF = average ratio of neighbor LRDs to own LRD
        neighbor_reach_dists = np.array([
            np.maximum(np.linalg.norm(X_train - X_train[j], axis=1),
                      k_distances[j, k_lof-1]).sum() / k_lof
            for j in neighbors
        ])
        lrd_neighbors = k_lof / (neighbor_reach_dists + 1e-10)
        lofs[i] = lrd_neighbors.mean() / (lrd_i + 1e-10)
    
    predictions = []
    chosen_ks = []
    
    for x in X_query:
        # Find nearest training point to estimate local LOF
        dists = np.linalg.norm(X_train - x, axis=1)
        nearest_idx = np.argmin(dists)
        local_lof = lofs[nearest_idx]
        
        # Map LOF to k: higher LOF -> larger k
        # LOF ~ 1 is normal, LOF >> 1 is outlier
        lof_factor = np.clip(local_lof, 0.5, 3.0)
        local_k = int(np.clip(k_min * lof_factor, k_min, k_max))
        chosen_ks.append(local_k)
        
        # Predict with chosen k
        k_idx = np.argsort(dists)[:local_k]
        h = dists[k_idx[-1]] + 1e-10
        u = dists[k_idx] / h
        weights = np.exp(-0.5 * u**2)
        weights /= weights.sum()
        
        predictions.append(np.sum(weights * y_train[k_idx]))
    
    return np.array(predictions), chosen_ks
 
 
# Demonstration
np.random.seed(42)
 
# Data with outlier region
X_normal = np.random.randn(80, 2)
X_outlier = np.random.randn(20, 2) * 0.1 + [5, 5]  # Tight cluster far away
X_train = np.vstack([X_normal, X_outlier])
y_train = X_train[:, 0] + 0.3 * np.random.randn(100)
 
X_query = np.array([
    [0, 0],    # Normal region
    [5, 5],    # Near outlier cluster
    [3, 3],    # Between regions
])
 
print("Adaptive Strategy Comparison:")
print("=" * 50)
 
preds_knn = knn_radius_adaptive(X_train, y_train, X_query, k=10)
preds_lof, ks_lof = lof_adaptive(X_train, y_train, X_query, k_min=5, k_max=25)
 
for i, x in enumerate(X_query):
    print(f"
Query {x}:")
    print(f"  k-NN radius (k=10): pred = {preds_knn[i]:.4f}")
    print(f"  LOF-adaptive (k={ks_lof[i]}): pred = {preds_lof[i]:.4f}")

Boundary and Edge Effects

Adaptive neighborhoods must handle boundary effects—what happens when query points are near the edge of the data distribution.

The Boundary Problem:

Near boundaries:

Neighborhoods become asymmetric (all neighbors on one side)
Adaptive bandwidth based on k-th neighbor may be artificially large
Prediction bias can shift toward the interior of the data

Manifestations:

Regression: Predictions at boundaries biased toward interior values
Classification: Posterior probabilities biased toward interior classes
Density estimation: Density underestimated at boundaries (mass "escapes")

Mitigation Strategies:

Handling Boundaries

•Reflection method: Mirror data across boundaries to create synthetic neighbors on the other side
•Local polynomial regression: Fit linear or quadratic models locally instead of just weighted averages (covered next page)
•Boundary kernels: Use asymmetric kernels near edges that preserve unbiasedness
•Adaptive kernel shape: Not just bandwidth but also kernel shape adapts to local geometry
•Ignore the problem: If boundaries aren't important for your application, simple methods may suffice

Extrapolation Danger

When a query point lies outside the convex hull of training data, all neighbors are on one side. This is extrapolation, and KNN (like most nonparametric methods) is fundamentally unreliable here. The best adaptive methods cannot fix extrapolation—they can only recognize it and flag uncertainty.

Detecting Boundary Conditions:

We can detect when a query is near the boundary by examining the angular distribution of neighbors:

$$\text{Angular Dispersion} = \frac{1}{k} \sum_{i=1}^k \cos(\theta_i)$$

where $\theta_i$ is the angle between the vector to neighbor $i$ and the average neighbor direction.

Dispersion near 0: neighbors spread evenly (interior)
Dispersion near 1: neighbors clustered in one direction (boundary)

When detected at the boundary, we can:

Increase uncertainty estimates
Fall back to global models
Use specialized boundary correction

Summary: Adaptive Neighborhoods

Key Takeaways

•Fixed parameters cannot handle varying density — Global k or h is suboptimal when data density varies spatially.
•Balloon vs sample-point estimators — Balloon adapts to query location (most common for KNN); sample-point adapts to training points (better for density estimation).
•k-NN bandwidth is simple and effective — Setting h = distance to k-th neighbor automatically adapts to local density.
•Abramson's square-root law — Optimal adaptation sets bandwidth inversely proportional to square root of density.
•Adaptive neighborhoods = local complexity control — Choose more or less smoothing based on local data characteristics.
•Multiple factors affect optimal local bandwidth — Curvature, noise, and density all play roles.
•Boundaries require special handling — Asymmetric neighborhoods near edges can bias predictions.
•Implementation strategies vary in complexity — From simple k-th neighbor distance to cross-validated local k.

Page Complete

You now understand how to adapt KNN neighborhoods to local data characteristics. The next page explores Local Models—extending adaptive KNN beyond simple weighted averages to fit local polynomial models within each neighborhood.