Loading content...
In standard K-Nearest Neighbors, every neighbor gets an equal vote. The closest neighbor and the k-th nearest neighbor, despite potentially vast differences in distance to the query point, contribute identically to the final prediction. This egalitarian approach carries a profound flaw: it ignores the very information that defines the algorithm—distance.
Consider a query point with k=5 neighbors. One neighbor sits almost exactly at the query location (distance ≈ 0.001), while the remaining four cluster at a much greater distance (distance ≈ 0.8). In uniform voting, that extremely close neighbor—practically a clone of our query—gets the same weight as neighbors that might live in different regions of feature space entirely. This fundamentally violates our intuition about locality.
By the end of this page, you will deeply understand distance-weighted voting: why it exists, how it's formulated, the mathematics of weighting functions, edge cases and numerical stability concerns, and the theoretical justification for treating closer neighbors as more informative. You'll be equipped to implement, debug, and reason about weighted KNN systems at a production level.
To understand why distance weighting matters, we must first formalize what uniform KNN assumes and where that assumption breaks down.
The Uniform KNN Assumption:
Standard KNN makes an implicit assumption: all points within the k-neighborhood are equally representative of the query point's true class or value. Mathematically, if we denote the query point as $\mathbf{x}_q$ and its k neighbors as ${\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_k}$ with distances ${d_1, d_2, ..., d_k}$ where $d_1 \leq d_2 \leq ... \leq d_k$, uniform voting treats:
$$P(y = c | \mathbf{x}q) = \frac{1}{k} \sum{i=1}^{k} \mathbb{1}[y_i = c]$$
This formula assigns probability based purely on the count of neighbors in each class, completely discarding the distance information ${d_1, ..., d_k}$.
After computing k distances (which is computationally expensive), uniform KNN throws away all magnitude information and keeps only the rankings. This is statistically wasteful—we computed continuous values but reduce them to mere ordinal relationships.
Where Uniform Voting Fails:
Consider a binary classification problem with k=7. The query point has:
Uniform voting yields: $$P(A) = \frac{2}{7} \approx 0.286, \quad P(B) = \frac{5}{7} \approx 0.714$$
Prediction: Class B
But examine the geometry! The two class A neighbors are almost superimposed on the query point, while all class B neighbors are nearly at the boundary of the k-neighborhood. Intuitively, the query point is deeply embedded in a region dominated by class A locally, with class B points lying at the periphery.
This situation arises frequently in practice:
Distance-weighted KNN replaces uniform voting with a weighted voting scheme. We introduce a weight function $w(d)$ that maps distance to contribution weight. The fundamental requirements for this function encode our intuitions about locality:
Axioms of a Weight Function:
Non-negativity: $w(d) \geq 0$ for all $d \geq 0$ (weights cannot be negative)
Monotonicity: $w(d_1) \geq w(d_2)$ whenever $d_1 < d_2$ (closer means higher weight)
Finite at Zero: $w(0)$ is finite and well-defined (or approaches a finite limit)
Decay at Infinity: $\lim_{d \to \infty} w(d) \to 0$ (very distant points contribute nothing)
With a weight function, the weighted KNN prediction becomes:
$$P(y = c | \mathbf{x}q) = \frac{\sum{i=1}^{k} w(d_i) \cdot \mathbb{1}[y_i = c]}{\sum_{i=1}^{k} w(d_i)}$$
For regression:
$$\hat{y}(\mathbf{x}q) = \frac{\sum{i=1}^{k} w(d_i) \cdot y_i}{\sum_{i=1}^{k} w(d_i)}$$
Notice the denominator: we always normalize by the sum of weights. This ensures our predictions remain valid probabilities (for classification) or weighted averages (for regression). Without normalization, the scale of distances would arbitrarily affect predictions.
The Inverse Distance Weighting (IDW) Function:
The most common weight function is inverse distance weighting:
$$w(d) = \frac{1}{d^p}$$
where $p > 0$ is a power parameter controlling decay rate.
Common choices:
Mathematical Properties:
For $p = 2$ (inverse square):
This quadratic relationship strongly emphasizes close neighbors while not completely ignoring distant ones.
| Distance d | w(d) = 1/d | w(d) = 1/d² | w(d) = 1/d³ | w(d) = exp(-d) |
|---|---|---|---|---|
| 0.1 | 10.00 | 100.00 | 1000.00 | 0.905 |
| 0.5 | 2.00 | 4.00 | 8.00 | 0.607 |
| 1.0 | 1.00 | 1.00 | 1.00 | 0.368 |
| 2.0 | 0.50 | 0.25 | 0.125 | 0.135 |
| 5.0 | 0.20 | 0.04 | 0.008 | 0.007 |
| 10.0 | 0.10 | 0.01 | 0.001 | 0.00005 |
The inverse distance function $w(d) = 1/d^p$ has a critical flaw: it explodes at $d = 0$.
When a training point exactly coincides with the query point (which happens with duplicate data, overfitting scenarios, or numerical precision limits), we face division by zero:
$$\lim_{d \to 0^+} \frac{1}{d^p} = +\infty$$
This isn't merely a numerical inconvenience—it's a conceptual issue. What should happen when the query exactly matches a training point?
The Philosophical Question:
If the query point is identical to a training point, two reasonable philosophies exist:
Interpolation View: The prediction should exactly equal the training label (we know the answer exactly)
Estimation View: The prediction should still consider other neighbors (accounting for noise in the training label)
Most implementations adopt a hybrid approach using smoothed inverse distance:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import numpy as np def inverse_distance_weights(distances: np.ndarray, power: float = 2.0, epsilon: float = 1e-10) -> np.ndarray: """ Compute inverse distance weights with numerical stability. Parameters: ----------- distances : array of distances to k neighbors power : exponent for distance (higher = more weight to close neighbors) epsilon : small constant to prevent division by zero Returns: -------- weights : normalized weights summing to 1 """ # Handle exact matches (distance = 0) exact_match = distances < epsilon if np.any(exact_match): # If any distance is essentially zero, that neighbor gets all weight weights = exact_match.astype(float) else: # Standard inverse distance weighting weights = 1.0 / (distances ** power) # Normalize to sum to 1 return weights / weights.sum() def shepard_weights(distances: np.ndarray, power: float = 2.0) -> np.ndarray: """ Shepard's interpolation weights - alternative stable formulation. Uses the formula: w_i = (d_max - d_i) / d_i This naturally handles the boundary case more gracefully. """ d_max = distances.max() # Avoid division by zero safe_distances = np.maximum(distances, 1e-10) weights = ((d_max - distances) / safe_distances) ** power return weights / weights.sum() # Demonstrationdistances = np.array([0.1, 0.5, 0.8, 1.2, 1.5])print("Distances:", distances)print("IDW weights (p=2):", inverse_distance_weights(distances, power=2))print("Shepard weights:", shepard_weights(distances, power=2)) # Edge case: exact matchdistances_with_match = np.array([0.0, 0.5, 0.8, 1.2, 1.5])print("\nWith exact match (d=0):")print("IDW weights:", inverse_distance_weights(distances_with_match, power=2))Adding a small epsilon: w(d) = 1/(d + ε)^p is common but changes the weight function's behavior everywhere, especially for small distances. The conditional approach (checking for exact matches explicitly) preserves the true inverse relationship for normal cases while handling the singularity gracefully.
Alternative: Gaussian/Exponential Weighting:
To avoid the zero-distance singularity entirely, we can use weight functions that are naturally bounded:
$$w(d) = \exp\left(-\frac{d^2}{2\sigma^2}\right)$$
This Gaussian weight function:
The tradeoff: inverse distance weighting emphasizes very close neighbors more strongly than Gaussian weighting, which can be advantageous or disadvantageous depending on the data distribution.
Let's rigorously analyze how distance weighting affects the bias-variance tradeoff in KNN predictions.
Setup:
Assume we have a regression problem with true function $f(\mathbf{x})$ and observations $y_i = f(\mathbf{x}_i) + \epsilon_i$ where $\epsilon_i$ are i.i.d. noise with mean 0 and variance $\sigma^2$.
For a query point $\mathbf{x}_q$, the weighted KNN prediction is:
$$\hat{f}(\mathbf{x}q) = \sum{i=1}^{k} \tilde{w}_i y_i$$
where $\tilde{w}_i = w(d_i) / \sum_j w(d_j)$ are normalized weights.
Bias Analysis:
The bias of the weighted estimator is:
$$\text{Bias}(\hat{f}) = \mathbb{E}[\hat{f}(\mathbf{x}_q)] - f(\mathbf{x}q) = \sum{i=1}^{k} \tilde{w}_i f(\mathbf{x}_i) - f(\mathbf{x}_q)$$
Using a Taylor expansion of $f$ around $\mathbf{x}_q$:
$$f(\mathbf{x}_i) \approx f(\mathbf{x}_q) + \nabla f(\mathbf{x}_q)^T (\mathbf{x}_i - \mathbf{x}_q) + O(|\mathbf{x}_i - \mathbf{x}_q|^2)$$
Substituting:
$$\text{Bias} \approx \nabla f(\mathbf{x}q)^T \sum{i=1}^{k} \tilde{w}_i (\mathbf{x}_i - \mathbf{x}_q) + O(\text{second order})$$
The first-order bias depends on the weighted centroid of neighbors relative to the query point. If the weighted centroid equals the query point (perfect symmetry), first-order bias vanishes. Distance weighting shifts the weighted centroid toward closer neighbors, reducing bias when the function varies within the neighborhood.
Variance Analysis:
The variance of the weighted estimator is:
$$\text{Var}(\hat{f}) = \text{Var}\left(\sum_{i=1}^{k} \tilde{w}i y_i\right) = \sum{i=1}^{k} \tilde{w}i^2 \sigma^2 = \sigma^2 \sum{i=1}^{k} \tilde{w}_i^2$$
Define the effective sample size as:
$$n_{\text{eff}} = \frac{1}{\sum_{i=1}^{k} \tilde{w}_i^2}$$
This tells us how many "equivalent uniform samples" our weighted average represents.
The Tradeoff:
| Weighting Style | Effective Sample Size | Bias | Variance |
|---|---|---|---|
| Uniform ($w = 1$) | $n_{\text{eff}} = k$ | Higher (includes distant points) | Lower (averages over k points) |
| Strong weighting | $n_{\text{eff}} \ll k$ | Lower (focuses on close points) | Higher (fewer effective samples) |
Distance weighting trades increased variance for reduced bias. This is favorable when:
123456789101112131415161718192021222324252627282930313233343536373839
import numpy as np def effective_sample_size(weights: np.ndarray) -> float: """ Compute effective sample size for weighted estimate. n_eff = 1 / sum(w_i^2) where weights are normalized to sum to 1. For uniform weights: n_eff = k For concentrated weights: n_eff < k Extreme case (single weight = 1): n_eff = 1 """ # Ensure normalization normalized_weights = weights / weights.sum() return 1.0 / np.sum(normalized_weights ** 2) # Compare effective sample sizes for different weight schemesk = 10uniform_weights = np.ones(k) / k # Inverse distance with distances increasing linearlydistances = np.linspace(0.1, 1.0, k)inv_weights = 1.0 / distances ** 2inv_weights = inv_weights / inv_weights.sum() # Very concentrated (exponential decay)exp_weights = np.exp(-3 * distances)exp_weights = exp_weights / exp_weights.sum() print(f"k = {k} neighbors")print(f"Uniform weights: n_eff = {effective_sample_size(uniform_weights):.2f}")print(f"Inverse-square: n_eff = {effective_sample_size(inv_weights):.2f}")print(f"Strong exponential: n_eff = {effective_sample_size(exp_weights):.2f}") # Variance multiplier relative to uniformprint(f"\nVariance multiplier (vs uniform k={k}):")print(f"Inverse-square: {k * np.sum(inv_weights**2):.2f}x")print(f"Strong exponential: {k * np.sum(exp_weights**2):.2f}x")In classification, weighted voting aggregates class votes using distance-derived weights.
Weighted Voting Rule:
For a query point $\mathbf{x}_q$ with k neighbors ${(\mathbf{x}_1, c_1), ..., (\mathbf{x}_k, c_k)}$ at distances ${d_1, ..., d_k}$:
$$\text{score}(c) = \sum_{i: c_i = c} w(d_i)$$
$$\hat{c} = \arg\max_{c} \text{score}(c)$$
The predicted class is the one with the highest weighted vote.
Probabilistic Interpretation:
We can convert scores to probabilities:
$$P(c | \mathbf{x}q) = \frac{\text{score}(c)}{\sum{c'} \text{score}(c')} = \frac{\sum_{i: c_i = c} w(d_i)}{\sum_{i=1}^{k} w(d_i)}$$
This provides calibrated probability estimates, not just hard classifications.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
import numpy as npfrom collections import defaultdictfrom typing import Tuple, Dict, Any class WeightedKNNClassifier: """ K-Nearest Neighbors classifier with distance weighting. Implements multiple weighting schemes and provides both hard predictions and probability estimates. """ def __init__(self, k: int = 5, weight_type: str = 'inverse_squared', power: float = 2.0): """ Parameters ---------- k : number of neighbors weight_type : 'uniform', 'inverse', 'inverse_squared', 'gaussian' power : power for inverse weighting (ignored for gaussian) """ self.k = k self.weight_type = weight_type self.power = power self.X_train = None self.y_train = None self.classes_ = None def fit(self, X: np.ndarray, y: np.ndarray): """Store training data and extract classes.""" self.X_train = np.array(X) self.y_train = np.array(y) self.classes_ = np.unique(y) return self def _compute_weights(self, distances: np.ndarray) -> np.ndarray: """Compute weights from distances based on weight_type.""" epsilon = 1e-10 if self.weight_type == 'uniform': return np.ones_like(distances) elif self.weight_type == 'inverse': # Handle zero distances safe_distances = np.maximum(distances, epsilon) return 1.0 / safe_distances elif self.weight_type == 'inverse_squared': safe_distances = np.maximum(distances, epsilon) return 1.0 / (safe_distances ** self.power) elif self.weight_type == 'gaussian': # Bandwidth = median distance (adaptive) sigma = np.median(distances) + epsilon return np.exp(-distances**2 / (2 * sigma**2)) else: raise ValueError(f"Unknown weight type: {self.weight_type}") def predict_proba(self, X: np.ndarray) -> np.ndarray: """ Predict class probabilities for each sample in X. Returns ------- proba : shape (n_samples, n_classes) Probability of each class for each sample """ X = np.atleast_2d(X) n_samples = X.shape[0] n_classes = len(self.classes_) probas = np.zeros((n_samples, n_classes)) for i, x_query in enumerate(X): # Compute distances to all training points distances = np.linalg.norm(self.X_train - x_query, axis=1) # Find k nearest neighbors k_indices = np.argsort(distances)[:self.k] k_distances = distances[k_indices] k_labels = self.y_train[k_indices] # Compute weights weights = self._compute_weights(k_distances) # Accumulate weighted votes per class class_scores = np.zeros(n_classes) for d, label, weight in zip(k_distances, k_labels, weights): class_idx = np.where(self.classes_ == label)[0][0] class_scores[class_idx] += weight # Normalize to probabilities probas[i] = class_scores / class_scores.sum() return probas def predict(self, X: np.ndarray) -> np.ndarray: """Predict class labels for samples in X.""" probas = self.predict_proba(X) return self.classes_[np.argmax(probas, axis=1)] # Demonstration: Uniform vs Weighted KNN near decision boundarynp.random.seed(42) # Create training data: two classes with overlapX_train = np.vstack([ np.random.randn(50, 2) + [0, 0], # Class 0 np.random.randn(50, 2) + [2, 2], # Class 1])y_train = np.array([0] * 50 + [1] * 50) # Query point near decision boundaryX_query = np.array([[1.0, 1.0]]) # Compare uniform and weighted predictionsuniform_knn = WeightedKNNClassifier(k=7, weight_type='uniform')weighted_knn = WeightedKNNClassifier(k=7, weight_type='inverse_squared') uniform_knn.fit(X_train, y_train)weighted_knn.fit(X_train, y_train) print("Query point:", X_query[0])print("\nUniform KNN (k=7):")print(f" Probabilities: {uniform_knn.predict_proba(X_query)[0]}")print(f" Prediction: {uniform_knn.predict(X_query)[0]}") print("\nWeighted KNN (k=7, inverse-square):")print(f" Probabilities: {weighted_knn.predict_proba(X_query)[0]}")print(f" Prediction: {weighted_knn.predict(X_query)[0]}")Weighted KNN typically produces better-calibrated probability estimates than uniform KNN. The weights act as a form of soft locality, where confidence naturally decreases when the closest neighbors are far away (all weights become similar) and increases when close neighbors agree (close neighbors dominate).
In uniform voting, ties occur when multiple classes have equal vote counts. Weighted voting dramatically reduces but doesn't eliminate ties—exact ties require weighted sums to match precisely, which is rare with real-valued weights.
When Weighted Ties Still Occur:
Robust Tie-Breaking Strategies:
Implementation Consideration:
A subtle issue: what if multiple points share the exact k-th distance (distance ties for the boundary neighbors)? This affects which points enter the k-neighborhood.
Solutions:
For weighted KNN, including all ties is often the cleanest approach since the weighting function handles their influence appropriately.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as np def robust_knn_prediction(X_train: np.ndarray, y_train: np.ndarray, x_query: np.ndarray, k: int, weight_power: float = 2.0) -> tuple: """ Weighted KNN with robust tie-breaking. Returns (prediction, probabilities, debugging_info) """ distances = np.linalg.norm(X_train - x_query, axis=1) # Handle distance ties at k-boundary sorted_indices = np.argsort(distances) kth_distance = distances[sorted_indices[k-1]] # Include all points at or closer than kth distance effective_indices = np.where(distances <= kth_distance + 1e-10)[0] effective_k = len(effective_indices) # Compute weights effective_distances = distances[effective_indices] safe_distances = np.maximum(effective_distances, 1e-10) weights = 1.0 / safe_distances ** weight_power normalized_weights = weights / weights.sum() # Aggregate by class effective_labels = y_train[effective_indices] classes = np.unique(y_train) class_scores = {} class_counts = {} for c in classes: mask = effective_labels == c class_scores[c] = normalized_weights[mask].sum() class_counts[c] = mask.sum() # Find max score max_score = max(class_scores.values()) top_classes = [c for c, s in class_scores.items() if np.isclose(s, max_score, rtol=1e-9)] # Tie-breaking if len(top_classes) == 1: prediction = top_classes[0] tie_broken = False else: # Tie-break 1: Choose class of closest neighbor closest_idx = sorted_indices[0] closest_class = y_train[closest_idx] if closest_class in top_classes: prediction = closest_class tie_broken = True else: # Tie-break 2: Higher count top_counts = {c: class_counts[c] for c in top_classes} prediction = max(top_counts, key=top_counts.get) tie_broken = True # Probability vector total_score = sum(class_scores.values()) probas = {c: s/total_score for c, s in class_scores.items()} debug_info = { 'effective_k': effective_k, 'requested_k': k, 'tie_broken': tie_broken, 'top_classes_before_tiebreak': top_classes, 'class_scores': class_scores, } return prediction, probas, debug_info # Test with potential tie scenarioX_train = np.array([ [0, 0], [1, 0], [0, 1], [1, 1], # Corners of unit square])y_train = np.array([0, 1, 1, 0]) # Diagonal classes x_query = np.array([0.5, 0.5]) # Center - equidistant to all pred, probs, debug = robust_knn_prediction(X_train, y_train, x_query, k=4)print("Query at center of square (equidistant to all corners)")print(f"Prediction: {pred}")print(f"Probabilities: {probs}")print(f"Debug info: {debug}")The power parameter $p$ in the weight function $w(d) = 1/d^p$ is a hyperparameter that controls how strongly distance influences weights.
Behavior Across p Values:
| Power (p) | Behavior | Extreme Limit |
|---|---|---|
| p → 0 | All weights approach 1 (uniform voting) | Exactly uniform at p = 0 |
| p = 1 | Linear inverse | Gentle decay |
| p = 2 | Inverse square | Standard choice |
| p = 3+ | Aggressive decay | Strongly local |
| p → ∞ | Only closest neighbor matters | Equivalent to 1-NN |
How to Select p:
Cross-validation: Treat p as a hyperparameter and select via cross-validation alongside k.
Domain knowledge: In physics, many phenomena follow inverse-square laws (gravity, electromagnetism), making p=2 a natural default. For other domains, different values may be appropriate.
Signal-to-noise ratio: High noise → prefer smaller p (more averaging). Low noise → prefer larger p (trust close neighbors).
Data density: Sparse data → prefer smaller p (need more effective samples). Dense data → can afford larger p.
Start with p = 2 (inverse squared). This is the scikit-learn default and works well in most cases. If you're using a moderately large k (10-30) and want more locality, try p = 3. If you need more smoothing, try p = 1. Rarely is p > 3 beneficial—at that point, consider reducing k instead.
You now understand distance-weighted voting from first principles through implementation. The next page extends these concepts to kernel-weighted KNN, where the weight function takes on a more sophisticated form derived from kernel density estimation.