Weighted Knn - Learning Module

Loading content...

0/278

Distance-Weighted Voting

The Fundamental Flaw in Equal Voting

In standard K-Nearest Neighbors, every neighbor gets an equal vote. The closest neighbor and the k-th nearest neighbor, despite potentially vast differences in distance to the query point, contribute identically to the final prediction. This egalitarian approach carries a profound flaw: it ignores the very information that defines the algorithm—distance.

Consider a query point with k=5 neighbors. One neighbor sits almost exactly at the query location (distance ≈ 0.001), while the remaining four cluster at a much greater distance (distance ≈ 0.8). In uniform voting, that extremely close neighbor—practically a clone of our query—gets the same weight as neighbors that might live in different regions of feature space entirely. This fundamentally violates our intuition about locality.

What You Will Master

By the end of this page, you will deeply understand distance-weighted voting: why it exists, how it's formulated, the mathematics of weighting functions, edge cases and numerical stability concerns, and the theoretical justification for treating closer neighbors as more informative. You'll be equipped to implement, debug, and reason about weighted KNN systems at a production level.

The Motivation for Weighting

To understand why distance weighting matters, we must first formalize what uniform KNN assumes and where that assumption breaks down.

The Uniform KNN Assumption:

Standard KNN makes an implicit assumption: all points within the k-neighborhood are equally representative of the query point's true class or value. Mathematically, if we denote the query point as $\mathbf{x}_q$ and its k neighbors as ${\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_k}$ with distances ${d_1, d_2, ..., d_k}$ where $d_1 \leq d_2 \leq ... \leq d_k$, uniform voting treats:

$$P(y = c | \mathbf{x}q) = \frac{1}{k} \sum{i=1}^{k} \mathbb{1}[y_i = c]$$

This formula assigns probability based purely on the count of neighbors in each class, completely discarding the distance information ${d_1, ..., d_k}$.

Information Discarded

After computing k distances (which is computationally expensive), uniform KNN throws away all magnitude information and keeps only the rankings. This is statistically wasteful—we computed continuous values but reduce them to mere ordinal relationships.

Where Uniform Voting Fails:

Consider a binary classification problem with k=7. The query point has:

2 neighbors of class A at distances [0.01, 0.02]
5 neighbors of class B at distances [0.95, 0.96, 0.97, 0.98, 0.99]

Uniform voting yields: $$P(A) = \frac{2}{7} \approx 0.286, \quad P(B) = \frac{5}{7} \approx 0.714$$

Prediction: Class B

But examine the geometry! The two class A neighbors are almost superimposed on the query point, while all class B neighbors are nearly at the boundary of the k-neighborhood. Intuitively, the query point is deeply embedded in a region dominated by class A locally, with class B points lying at the periphery.

This situation arises frequently in practice:

Near decision boundaries where classes intermingle
When k is chosen large for variance reduction but encompasses multiple local regimes
In high-dimensional spaces where distance distributions have peculiar properties

Scenarios Where Uniform Voting Fails

•Outliers in the Neighborhood — Distant neighbors from other classes survive into the k-set but shouldn't dominate predictions
•Sparse vs Dense Regions — In varying density, distant neighbors in sparse regions are less reliable than close neighbors in dense regions
•Decision Boundary Proximity — Near boundaries, the closest neighbors are most diagnostic of the local class structure
•High-Dimensional Concentration — All distances become similar, making small differences disproportionately meaningful
•Large k for Stability — When k is large to reduce variance, peripheral neighbors dilute the signal from truly proximal points

The Weight Function Framework

Distance-weighted KNN replaces uniform voting with a weighted voting scheme. We introduce a weight function $w(d)$ that maps distance to contribution weight. The fundamental requirements for this function encode our intuitions about locality:

Axioms of a Weight Function:

Non-negativity: $w(d) \geq 0$ for all $d \geq 0$ (weights cannot be negative)
Monotonicity: $w(d_1) \geq w(d_2)$ whenever $d_1 < d_2$ (closer means higher weight)
Finite at Zero: $w(0)$ is finite and well-defined (or approaches a finite limit)
Decay at Infinity: $\lim_{d \to \infty} w(d) \to 0$ (very distant points contribute nothing)

With a weight function, the weighted KNN prediction becomes:

$$P(y = c | \mathbf{x}q) = \frac{\sum{i=1}^{k} w(d_i) \cdot \mathbb{1}[y_i = c]}{\sum_{i=1}^{k} w(d_i)}$$

For regression:

$$\hat{y}(\mathbf{x}q) = \frac{\sum{i=1}^{k} w(d_i) \cdot y_i}{\sum_{i=1}^{k} w(d_i)}$$

Normalization is Critical

Notice the denominator: we always normalize by the sum of weights. This ensures our predictions remain valid probabilities (for classification) or weighted averages (for regression). Without normalization, the scale of distances would arbitrarily affect predictions.

The Inverse Distance Weighting (IDW) Function:

The most common weight function is inverse distance weighting:

$$w(d) = \frac{1}{d^p}$$

where $p > 0$ is a power parameter controlling decay rate.

Common choices:

$p = 1$: Linear inverse (gentle decay)
$p = 2$: Inverse square (standard choice, matches physical intuition from inverse-square laws)
$p \to \infty$: Approaches 1-NN (only closest neighbor matters)

Mathematical Properties:

For $p = 2$ (inverse square):

A neighbor at half the distance contributes 4× as much weight
A neighbor at double the distance contributes 1/4 the weight
Weight ratio between closest ($d_1$) and farthest ($d_k$) neighbor: $(d_k/d_1)^2$

This quadratic relationship strongly emphasizes close neighbors while not completely ignoring distant ones.

Weight Function Comparison
Distance d	w(d) = 1/d	w(d) = 1/d²	w(d) = 1/d³	w(d) = exp(-d)
0.1	10.00	100.00	1000.00	0.905
0.5	2.00	4.00	8.00	0.607
1.0	1.00	1.00	1.00	0.368
2.0	0.50	0.25	0.125	0.135
5.0	0.20	0.04	0.008	0.007
10.0	0.10	0.01	0.001	0.00005

The Zero-Distance Problem

The inverse distance function $w(d) = 1/d^p$ has a critical flaw: it explodes at $d = 0$.

When a training point exactly coincides with the query point (which happens with duplicate data, overfitting scenarios, or numerical precision limits), we face division by zero:

$$\lim_{d \to 0^+} \frac{1}{d^p} = +\infty$$

This isn't merely a numerical inconvenience—it's a conceptual issue. What should happen when the query exactly matches a training point?

The Philosophical Question:

If the query point is identical to a training point, two reasonable philosophies exist:

Interpolation View: The prediction should exactly equal the training label (we know the answer exactly)
Estimation View: The prediction should still consider other neighbors (accounting for noise in the training label)

Most implementations adopt a hybrid approach using smoothed inverse distance:

distance_weighting.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
 
def inverse_distance_weights(distances: np.ndarray, 
                               power: float = 2.0,
                               epsilon: float = 1e-10) -> np.ndarray:
    """
    Compute inverse distance weights with numerical stability.
    
    Parameters:
    -----------
    distances : array of distances to k neighbors
    power : exponent for distance (higher = more weight to close neighbors)
    epsilon : small constant to prevent division by zero
    
    Returns:
    --------
    weights : normalized weights summing to 1
    """
    # Handle exact matches (distance = 0)
    exact_match = distances < epsilon
    
    if np.any(exact_match):
        # If any distance is essentially zero, that neighbor gets all weight
        weights = exact_match.astype(float)
    else:
        # Standard inverse distance weighting
        weights = 1.0 / (distances ** power)
    
    # Normalize to sum to 1
    return weights / weights.sum()
 
 
def shepard_weights(distances: np.ndarray,
                     power: float = 2.0) -> np.ndarray:
    """
    Shepard's interpolation weights - alternative stable formulation.
    
    Uses the formula: w_i = (d_max - d_i) / d_i
    This naturally handles the boundary case more gracefully.
    """
    d_max = distances.max()
    
    # Avoid division by zero
    safe_distances = np.maximum(distances, 1e-10)
    
    weights = ((d_max - distances) / safe_distances) ** power
    
    return weights / weights.sum()
 
 
# Demonstration
distances = np.array([0.1, 0.5, 0.8, 1.2, 1.5])
print("Distances:", distances)
print("IDW weights (p=2):", inverse_distance_weights(distances, power=2))
print("Shepard weights:", shepard_weights(distances, power=2))
 
# Edge case: exact match
distances_with_match = np.array([0.0, 0.5, 0.8, 1.2, 1.5])
print("
With exact match (d=0):")
print("IDW weights:", inverse_distance_weights(distances_with_match, power=2))

The Epsilon Trick

Adding a small epsilon: w(d) = 1/(d + ε)^p is common but changes the weight function's behavior everywhere, especially for small distances. The conditional approach (checking for exact matches explicitly) preserves the true inverse relationship for normal cases while handling the singularity gracefully.

Alternative: Gaussian/Exponential Weighting:

To avoid the zero-distance singularity entirely, we can use weight functions that are naturally bounded:

$$w(d) = \exp\left(-\frac{d^2}{2\sigma^2}\right)$$

This Gaussian weight function:

Gives weight 1.0 at $d = 0$ (maximum, but finite)
Decays smoothly and exponentially
Has a bandwidth parameter $\sigma$ controlling the decay rate
Never reaches infinity or causes numerical issues

The tradeoff: inverse distance weighting emphasizes very close neighbors more strongly than Gaussian weighting, which can be advantageous or disadvantageous depending on the data distribution.

Mathematical Analysis of Weighting

Let's rigorously analyze how distance weighting affects the bias-variance tradeoff in KNN predictions.

Setup:

Assume we have a regression problem with true function $f(\mathbf{x})$ and observations $y_i = f(\mathbf{x}_i) + \epsilon_i$ where $\epsilon_i$ are i.i.d. noise with mean 0 and variance $\sigma^2$.

For a query point $\mathbf{x}_q$, the weighted KNN prediction is:

$$\hat{f}(\mathbf{x}q) = \sum{i=1}^{k} \tilde{w}_i y_i$$

where $\tilde{w}_i = w(d_i) / \sum_j w(d_j)$ are normalized weights.

Bias Analysis:

The bias of the weighted estimator is:

$$\text{Bias}(\hat{f}) = \mathbb{E}[\hat{f}(\mathbf{x}_q)] - f(\mathbf{x}q) = \sum{i=1}^{k} \tilde{w}_i f(\mathbf{x}_i) - f(\mathbf{x}_q)$$

Using a Taylor expansion of $f$ around $\mathbf{x}_q$:

$$f(\mathbf{x}_i) \approx f(\mathbf{x}_q) + abla f(\mathbf{x}_q)^T (\mathbf{x}_i - \mathbf{x}_q) + O(|\mathbf{x}_i - \mathbf{x}_q|^2)$$

Substituting:

$$\text{Bias} \approx abla f(\mathbf{x}q)^T \sum{i=1}^{k} \tilde{w}_i (\mathbf{x}_i - \mathbf{x}_q) + O(\text{second order})$$

Key Insight: Weighted Centroid

The first-order bias depends on the weighted centroid of neighbors relative to the query point. If the weighted centroid equals the query point (perfect symmetry), first-order bias vanishes. Distance weighting shifts the weighted centroid toward closer neighbors, reducing bias when the function varies within the neighborhood.

Variance Analysis:

The variance of the weighted estimator is:

$$\text{Var}(\hat{f}) = \text{Var}\left(\sum_{i=1}^{k} \tilde{w}i y_i\right) = \sum{i=1}^{k} \tilde{w}i^2 \sigma^2 = \sigma^2 \sum{i=1}^{k} \tilde{w}_i^2$$

Define the effective sample size as:

$$n_{\text{eff}} = \frac{1}{\sum_{i=1}^{k} \tilde{w}_i^2}$$

This tells us how many "equivalent uniform samples" our weighted average represents.

The Tradeoff:

Weighting Style	Effective Sample Size	Bias	Variance
Uniform ($w = 1$)	$n_{\text{eff}} = k$	Higher (includes distant points)	Lower (averages over k points)
Strong weighting	$n_{\text{eff}} \ll k$	Lower (focuses on close points)	Higher (fewer effective samples)

Distance weighting trades increased variance for reduced bias. This is favorable when:

The underlying function varies significantly within the k-neighborhood
k is chosen large for computational reasons but local accuracy is desired
Distant neighbors are poor predictors of the query point

effective_sample_size.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
 
def effective_sample_size(weights: np.ndarray) -> float:
    """
    Compute effective sample size for weighted estimate.
    
    n_eff = 1 / sum(w_i^2) where weights are normalized to sum to 1.
    
    For uniform weights: n_eff = k
    For concentrated weights: n_eff < k
    Extreme case (single weight = 1): n_eff = 1
    """
    # Ensure normalization
    normalized_weights = weights / weights.sum()
    return 1.0 / np.sum(normalized_weights ** 2)
 
 
# Compare effective sample sizes for different weight schemes
k = 10
uniform_weights = np.ones(k) / k
 
# Inverse distance with distances increasing linearly
distances = np.linspace(0.1, 1.0, k)
inv_weights = 1.0 / distances ** 2
inv_weights = inv_weights / inv_weights.sum()
 
# Very concentrated (exponential decay)
exp_weights = np.exp(-3 * distances)
exp_weights = exp_weights / exp_weights.sum()
 
print(f"k = {k} neighbors")
print(f"Uniform weights:  n_eff = {effective_sample_size(uniform_weights):.2f}")
print(f"Inverse-square:   n_eff = {effective_sample_size(inv_weights):.2f}")
print(f"Strong exponential: n_eff = {effective_sample_size(exp_weights):.2f}")
 
# Variance multiplier relative to uniform
print(f"
Variance multiplier (vs uniform k={k}):")
print(f"Inverse-square:   {k * np.sum(inv_weights**2):.2f}x")
print(f"Strong exponential: {k * np.sum(exp_weights**2):.2f}x")

Weighted Voting for Classification

In classification, weighted voting aggregates class votes using distance-derived weights.

Weighted Voting Rule:

For a query point $\mathbf{x}_q$ with k neighbors ${(\mathbf{x}_1, c_1), ..., (\mathbf{x}_k, c_k)}$ at distances ${d_1, ..., d_k}$:

$$\text{score}(c) = \sum_{i: c_i = c} w(d_i)$$

$$\hat{c} = \arg\max_{c} \text{score}(c)$$

The predicted class is the one with the highest weighted vote.

Probabilistic Interpretation:

We can convert scores to probabilities:

$$P(c | \mathbf{x}q) = \frac{\text{score}(c)}{\sum{c'} \text{score}(c')} = \frac{\sum_{i: c_i = c} w(d_i)}{\sum_{i=1}^{k} w(d_i)}$$

This provides calibrated probability estimates, not just hard classifications.

weighted_knn_classifier.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from collections import defaultdict
from typing import Tuple, Dict, Any
 
class WeightedKNNClassifier:
    """
    K-Nearest Neighbors classifier with distance weighting.
    
    Implements multiple weighting schemes and provides both
    hard predictions and probability estimates.
    """
    
    def __init__(self, 
                 k: int = 5, 
                 weight_type: str = 'inverse_squared',
                 power: float = 2.0):
        """
        Parameters
        ----------
        k : number of neighbors
        weight_type : 'uniform', 'inverse', 'inverse_squared', 'gaussian'
        power : power for inverse weighting (ignored for gaussian)
        """
        self.k = k
        self.weight_type = weight_type
        self.power = power
        self.X_train = None
        self.y_train = None
        self.classes_ = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Store training data and extract classes."""
        self.X_train = np.array(X)
        self.y_train = np.array(y)
        self.classes_ = np.unique(y)
        return self
    
    def _compute_weights(self, distances: np.ndarray) -> np.ndarray:
        """Compute weights from distances based on weight_type."""
        epsilon = 1e-10
        
        if self.weight_type == 'uniform':
            return np.ones_like(distances)
        
        elif self.weight_type == 'inverse':
            # Handle zero distances
            safe_distances = np.maximum(distances, epsilon)
            return 1.0 / safe_distances
        
        elif self.weight_type == 'inverse_squared':
            safe_distances = np.maximum(distances, epsilon)
            return 1.0 / (safe_distances ** self.power)
        
        elif self.weight_type == 'gaussian':
            # Bandwidth = median distance (adaptive)
            sigma = np.median(distances) + epsilon
            return np.exp(-distances**2 / (2 * sigma**2))
        
        else:
            raise ValueError(f"Unknown weight type: {self.weight_type}")
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Predict class probabilities for each sample in X.
        
        Returns
        -------
        proba : shape (n_samples, n_classes)
            Probability of each class for each sample
        """
        X = np.atleast_2d(X)
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        probas = np.zeros((n_samples, n_classes))
        
        for i, x_query in enumerate(X):
            # Compute distances to all training points
            distances = np.linalg.norm(self.X_train - x_query, axis=1)
            
            # Find k nearest neighbors
            k_indices = np.argsort(distances)[:self.k]
            k_distances = distances[k_indices]
            k_labels = self.y_train[k_indices]
            
            # Compute weights
            weights = self._compute_weights(k_distances)
            
            # Accumulate weighted votes per class
            class_scores = np.zeros(n_classes)
            for d, label, weight in zip(k_distances, k_labels, weights):
                class_idx = np.where(self.classes_ == label)[0][0]
                class_scores[class_idx] += weight
            
            # Normalize to probabilities
            probas[i] = class_scores / class_scores.sum()
        
        return probas
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels for samples in X."""
        probas = self.predict_proba(X)
        return self.classes_[np.argmax(probas, axis=1)]
 
 
# Demonstration: Uniform vs Weighted KNN near decision boundary
np.random.seed(42)
 
# Create training data: two classes with overlap
X_train = np.vstack([
    np.random.randn(50, 2) + [0, 0],   # Class 0
    np.random.randn(50, 2) + [2, 2],   # Class 1
])
y_train = np.array([0] * 50 + [1] * 50)
 
# Query point near decision boundary
X_query = np.array([[1.0, 1.0]])
 
# Compare uniform and weighted predictions
uniform_knn = WeightedKNNClassifier(k=7, weight_type='uniform')
weighted_knn = WeightedKNNClassifier(k=7, weight_type='inverse_squared')
 
uniform_knn.fit(X_train, y_train)
weighted_knn.fit(X_train, y_train)
 
print("Query point:", X_query[0])
print("
Uniform KNN (k=7):")
print(f"  Probabilities: {uniform_knn.predict_proba(X_query)[0]}")
print(f"  Prediction: {uniform_knn.predict(X_query)[0]}")
 
print("
Weighted KNN (k=7, inverse-square):")
print(f"  Probabilities: {weighted_knn.predict_proba(X_query)[0]}")
print(f"  Prediction: {weighted_knn.predict(X_query)[0]}")

Probability Calibration

Weighted KNN typically produces better-calibrated probability estimates than uniform KNN. The weights act as a form of soft locality, where confidence naturally decreases when the closest neighbors are far away (all weights become similar) and increases when close neighbors agree (close neighbors dominate).

Tie-Breaking in Weighted Voting

In uniform voting, ties occur when multiple classes have equal vote counts. Weighted voting dramatically reduces but doesn't eliminate ties—exact ties require weighted sums to match precisely, which is rare with real-valued weights.

When Weighted Ties Still Occur:

Binary classification with k=2: If neighbors have equal distances but different classes
Symmetric configurations: Perfect geometric symmetry with balanced classes
Numerical precision limits: Weights that differ below floating-point precision

Robust Tie-Breaking Strategies:

Tie-Breaking Hierarchy

•Choose the class of the closest neighbor — This is always well-defined (unless there's also a distance tie) and respects the locality principle most directly.
•Choose the class with more members among the k neighbors — Falls back to a uniform count when weights tie.
•Random selection — Introduces stochasticity but ensures the classifier always returns a prediction. Useful in ensemble methods where randomness adds diversity.
•Prior probability — Choose the class with higher prior P(c) in the training set. Biases toward the majority class but can be justified probabilistically.

Implementation Consideration:

A subtle issue: what if multiple points share the exact k-th distance (distance ties for the boundary neighbors)? This affects which points enter the k-neighborhood.

Solutions:

Include all ties: Increase k dynamically to include all equidistant points
Random selection among ties: Randomly break distance ties
Stable sorting: Rely on original data order (deterministic but arbitrary)

For weighted KNN, including all ties is often the cleanest approach since the weighting function handles their influence appropriately.

robust_tie_breaking.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
 
def robust_knn_prediction(X_train: np.ndarray, 
                           y_train: np.ndarray,
                           x_query: np.ndarray,
                           k: int,
                           weight_power: float = 2.0) -> tuple:
    """
    Weighted KNN with robust tie-breaking.
    
    Returns (prediction, probabilities, debugging_info)
    """
    distances = np.linalg.norm(X_train - x_query, axis=1)
    
    # Handle distance ties at k-boundary
    sorted_indices = np.argsort(distances)
    kth_distance = distances[sorted_indices[k-1]]
    
    # Include all points at or closer than kth distance
    effective_indices = np.where(distances <= kth_distance + 1e-10)[0]
    effective_k = len(effective_indices)
    
    # Compute weights
    effective_distances = distances[effective_indices]
    safe_distances = np.maximum(effective_distances, 1e-10)
    weights = 1.0 / safe_distances ** weight_power
    normalized_weights = weights / weights.sum()
    
    # Aggregate by class
    effective_labels = y_train[effective_indices]
    classes = np.unique(y_train)
    class_scores = {}
    class_counts = {}
    
    for c in classes:
        mask = effective_labels == c
        class_scores[c] = normalized_weights[mask].sum()
        class_counts[c] = mask.sum()
    
    # Find max score
    max_score = max(class_scores.values())
    top_classes = [c for c, s in class_scores.items() 
                   if np.isclose(s, max_score, rtol=1e-9)]
    
    # Tie-breaking
    if len(top_classes) == 1:
        prediction = top_classes[0]
        tie_broken = False
    else:
        # Tie-break 1: Choose class of closest neighbor
        closest_idx = sorted_indices[0]
        closest_class = y_train[closest_idx]
        if closest_class in top_classes:
            prediction = closest_class
            tie_broken = True
        else:
            # Tie-break 2: Higher count
            top_counts = {c: class_counts[c] for c in top_classes}
            prediction = max(top_counts, key=top_counts.get)
            tie_broken = True
    
    # Probability vector
    total_score = sum(class_scores.values())
    probas = {c: s/total_score for c, s in class_scores.items()}
    
    debug_info = {
        'effective_k': effective_k,
        'requested_k': k,
        'tie_broken': tie_broken,
        'top_classes_before_tiebreak': top_classes,
        'class_scores': class_scores,
    }
    
    return prediction, probas, debug_info
 
 
# Test with potential tie scenario
X_train = np.array([
    [0, 0], [1, 0], [0, 1], [1, 1],  # Corners of unit square
])
y_train = np.array([0, 1, 1, 0])  # Diagonal classes
 
x_query = np.array([0.5, 0.5])  # Center - equidistant to all
 
pred, probs, debug = robust_knn_prediction(X_train, y_train, x_query, k=4)
print("Query at center of square (equidistant to all corners)")
print(f"Prediction: {pred}")
print(f"Probabilities: {probs}")
print(f"Debug info: {debug}")

Choosing the Power Parameter

The power parameter $p$ in the weight function $w(d) = 1/d^p$ is a hyperparameter that controls how strongly distance influences weights.

Behavior Across p Values:

Power (p)	Behavior	Extreme Limit
p → 0	All weights approach 1 (uniform voting)	Exactly uniform at p = 0
p = 1	Linear inverse	Gentle decay
p = 2	Inverse square	Standard choice
p = 3+	Aggressive decay	Strongly local
p → ∞	Only closest neighbor matters	Equivalent to 1-NN

How to Select p:

Cross-validation: Treat p as a hyperparameter and select via cross-validation alongside k.
Domain knowledge: In physics, many phenomena follow inverse-square laws (gravity, electromagnetism), making p=2 a natural default. For other domains, different values may be appropriate.
Signal-to-noise ratio: High noise → prefer smaller p (more averaging). Low noise → prefer larger p (trust close neighbors).
Data density: Sparse data → prefer smaller p (need more effective samples). Dense data → can afford larger p.

When to Use Higher p

•Low-noise data where close neighbors are highly reliable
•Complex functions with high local variation
•Dense datasets where close neighbors are plentiful
•When bias is the dominant error source
•Near decision boundaries where precision matters

When to Use Lower p

•Noisy data requiring more averaging
•Smooth functions where all k neighbors are informative
•Sparse datasets where close neighbors may be unreliable
•When variance is the dominant error source
•Far from decision boundaries where local details matter less

Practical Default

Start with p = 2 (inverse squared). This is the scikit-learn default and works well in most cases. If you're using a moderately large k (10-30) and want more locality, try p = 3. If you need more smoothing, try p = 1. Rarely is p > 3 beneficial—at that point, consider reducing k instead.

Summary: The Power of Distance Weighting

Key Takeaways

•Uniform voting wastes information — After computing distances, ignoring magnitudes is statistically inefficient.
•Distance weighting respects locality — Closer neighbors are more informative about the query point's true label or value.
•Weight functions have axioms — Non-negativity, monotonicity, finite at zero, and decay at infinity define valid weight functions.
•Inverse distance is standard — w(d) = 1/d^p with p=2 is the most common choice, balancing locality with robustness.
•Zero-distance requires care — Singularities must be handled via epsilon smoothing or exact-match detection.
•Weighting trades variance for bias — Effective sample size decreases as weights become more concentrated.
•Ties become rare but not impossible — Robust tie-breaking respects the closest-neighbor principle.
•Power parameter p is tunable — Cross-validation can optimize p alongside k.

Page Complete

You now understand distance-weighted voting from first principles through implementation. The next page extends these concepts to kernel-weighted KNN, where the weight function takes on a more sophisticated form derived from kernel density estimation.