Knn Algorithm - Learning Module

Loading content...

0/278

Voting Schemes

How Neighbors Make Decisions Together

Once we've identified the k nearest neighbors of a query point, a crucial question remains: How do we combine their labels into a single prediction?

This is the aggregation step of KNN—the moment when individual neighbor opinions become a collective verdict. The choice of aggregation scheme significantly impacts prediction quality, especially in challenging scenarios:

When neighbors disagree
When some neighbors are much closer than others
When class frequencies are imbalanced
When predicting continuous values (regression)

This page explores the full spectrum of voting and aggregation schemes, from the naive to the sophisticated.

What You Will Learn

By the end of this page, you will master majority voting and its limitations, distance-weighted voting for improved precision, kernel-weighted schemes and their theoretical foundations, handling ties and edge cases, aggregation for regression (mean, median, weighted), and probabilistic predictions with confidence estimation.

Majority Voting (Uniform Weights)

The simplest aggregation scheme is majority voting: each of the k neighbors gets one vote, and the class with the most votes wins.

Formal Definition

For a query point $\mathbf{x}_q$ with k nearest neighbors having labels ${y_1, y_2, \ldots, y_k}$:

$$\hat{y}(\mathbf{x}q) = \arg\max_c \sum{i=1}^{k} \mathbf{1}[y_i = c]$$

where $\mathbf{1}[\cdot]$ is the indicator function, and $c$ ranges over all classes.

Properties:

Each neighbor contributes equally regardless of distance
The k-th nearest neighbor has as much influence as the 1st nearest neighbor
Simple, fast, and often surprisingly effective
Sensitive to the choice of k (discussed in previous page)

majority_voting.py

Python

The Equidistant Illusion

Majority voting implicitly assumes all k neighbors are equally informative. In reality, the 1st nearest neighbor (distance 0.01) is far more informative than the 50th nearest neighbor (distance 2.3). Treating them equally can lead to suboptimal predictions when neighbors span a wide range of distances.

Distance-Weighted Voting

Distance-weighted voting addresses the equidistant illusion by giving closer neighbors more influence. The weight of each neighbor is inversely related to its distance from the query.

Formal Definition

For neighbor $i$ at distance $d_i$ from the query, assign weight:

$$w_i = \frac{1}{d_i}$$

or more generally:

$$w_i = \frac{1}{d_i^p}$$

where $p > 0$ controls how quickly influence decays with distance ($p = 1$ is most common, $p = 2$ gives quadratic decay).

The prediction becomes:

$$\hat{y}(\mathbf{x}q) = \arg\max_c \sum{i=1}^{k} w_i \cdot \mathbf{1}[y_i = c] = \arg\max_c \sum_{i: y_i = c} w_i$$

Common Distance Weighting Functions
Weighting	Formula	Properties	Use Case
Uniform	$w_i = 1$	All neighbors equal	Baseline, noisy data
Inverse distance	$w_i = 1/d_i$	Linear decay	Default weighted KNN
Inverse squared	$w_i = 1/d_i^2$	Stronger locality	When nearest matters most
Exponential	$w_i = e^{-d_i/\sigma}$	Smooth, tunable decay	Kernel-like behavior
Rank-based	$w_i = (k-r_i+1)/k$	Order, not distance	When absolute distance is unreliable

distance_weighted_voting.py

Python

When to Use Distance Weighting

Distance weighting is almost always preferable to uniform voting. It's the default in scikit-learn's KNeighborsClassifier (weights='distance'). The main exception is when you want simpler, more interpretable behavior, or when all neighbors are roughly equidistant (high-dimensional data).

Kernel-Weighted Voting

Kernel-weighted voting provides a principled framework for distance-based weighting using kernel functions from non-parametric statistics.

The Kernel Perspective

A kernel function $K: \mathbb{R}^+ \to \mathbb{R}^+$ maps distances to weights. Good kernels satisfy:

Non-negative: $K(d) \geq 0$
Monotonically decreasing: $K(d_1) \geq K(d_2)$ if $d_1 < d_2$
Maximum at zero: $K(0) = \max_d K(d)$
Often normalized: $\int K(d) dd = 1$ (for proper probability interpretation)

Bandwidth Parameter

Most kernels have a bandwidth (or scale) parameter $h$ that controls the effective range:

$$K_h(d) = \frac{1}{h} K\left(\frac{d}{h}\right)$$

Larger $h$ → smoother weighting, more neighbors contribute significantly Smaller $h$ → sharper weighting, nearest neighbors dominate

Kernel Functions for KNN
Kernel	Formula $K(u)$	Support	Properties
Uniform (Box)	$\frac{1}{2}\mathbf{1}_{\|u\|\leq 1}$	[-1, 1]	Equal weight within range
Triangular	$(1-\|u\|)\mathbf{1}_{\|u\|\leq 1}$	[-1, 1]	Linear decay to boundary
Epanechnikov	$\frac{3}{4}(1-u^2)\mathbf{1}_{\|u\|\leq 1}$	[-1, 1]	Optimal efficiency, smooth
Gaussian	$\frac{1}{\sqrt{2\pi}}e^{-u^2/2}$	$(-\infty, \infty)$	Infinitely smooth, all points contribute
Tricube	$\frac{70}{81}(1-\|u\|^3)^3\mathbf{1}_{\|u\|\leq 1}$	[-1, 1]	Very smooth at boundary, used in LOESS

kernel_weighted_voting.py

Python

Adaptive Bandwidth

Using the distance to the k-th neighbor as the bandwidth makes the kernel adapt to local density. In dense regions, bandwidth is small (tight weighting); in sparse regions, bandwidth is large (spread weighting). This automatically adjusts to varying data density without requiring manual tuning.

Aggregation for Regression

For regression tasks, we predict a continuous value by aggregating neighbor target values. The choice of aggregation function affects robustness and accuracy.

Common Aggregation Functions

Mean (Default): $$\hat{y}(\mathbf{x}q) = \frac{1}{k} \sum{i=1}^{k} y_i$$

Weighted Mean: $$\hat{y}(\mathbf{x}q) = \frac{\sum{i=1}^{k} w_i \cdot y_i}{\sum_{i=1}^{k} w_i}$$

Median (robust to outliers): $$\hat{y}(\mathbf{x}_q) = \text{median}(y_1, \ldots, y_k)$$

Weighted Median (robust + distance-aware): Find $m$ such that the sum of weights below $m$ and above $m$ are approximately equal.

Mean-based Aggregation

•Simple mean: Unbiased estimator of local expected value
•Weighted mean: Incorporates distance information
•Pro: Optimal for Gaussian errors, smooth predictions
•Con: Sensitive to outliers — a single extreme neighbor pulls the prediction

Median-based Aggregation

•Simple median: Robust to up to 50% outliers
•Weighted median: Combines robustness with distance
•Pro: Unaffected by extreme neighbor values
•Con: Less efficient for Gaussian data, harder to compute weighted version

regression_aggregation.py

Python

Probabilistic Predictions and Confidence

Beyond point predictions (a single class or value), KNN can provide probabilistic predictions indicating confidence in the prediction.

Classification: Class Probabilities

Instead of just returning the majority class, return the distribution of neighbor labels:

$$P(y = c | \mathbf{x}q) = \frac{\sum{i=1}^{k} w_i \cdot \mathbf{1}[y_i = c]}{\sum_{i=1}^{k} w_i}$$

This gives:

A probability for each class
Natural confidence measure: high probability = high confidence
Ability to set decision thresholds (e.g., "classify as positive only if P > 0.8")
Proper inputs for metrics like log-loss, Brier score, ROC/AUC

Regression: Prediction Intervals

For regression, confidence can be expressed as prediction intervals:

$$[\hat{y} - z_{\alpha/2} \cdot s, \hat{y} + z_{\alpha/2} \cdot s]$$

where $s$ is the standard deviation of neighbor values, and $z_{\alpha/2}$ is the appropriate z-score for confidence level $1-\alpha$.

probabilistic_predictions.py

Python

Calibration Matters

KNN predicted probabilities are often well-calibrated out of the box — if 70% of predictions with P(class=1)=0.7 are indeed class 1, the predictions are calibrated. However, with small k or in sparse regions, probabilities can be unreliable. Consider using calibration methods (Platt scaling, isotonic regression) for mission-critical applications.

Handling Ties and Edge Cases

Robust KNN implementations must handle various edge cases that arise in practice.

Common Edge Cases

•Voting Ties: Two or more classes receive equal votes. Solutions: prefer class with closer neighbor, random selection, use distance-weighted voting to break ties, increase k dynamically until tie is broken.
•Distance Ties: Multiple points at exactly the same distance from query. Common with discrete features or quantized data. Solutions: include all tied points (k becomes variable), random selection among tied points, use secondary sort criterion.
•Query on Training Point: Query is identical to a training point (distance = 0). Solutions: exclude the exact match if leave-one-out evaluation, use k neighbors excluding the query itself, handle the 0/0 case in distance weighting.
•Isolated Points: Query is far from all training points. The 'nearest' neighbors may still be very far. Solutions: use radius-based or density-adaptive methods, return 'unknown' or low-confidence prediction, flag such cases for human review.
•Uniform Neighborhoods: In high dimensions, all neighbors are roughly equidistant. Distance weighting provides no benefit. Solutions: consider dimensionality reduction, use alternative distance metrics, accept uniform voting.

edge_cases.py

Python

Multi-class and Multi-label Extensions

KNN naturally extends beyond binary classification:

Multi-class Classification

KNN handles multi-class problems natively—no modification needed. The voting simply considers all classes:

$$\hat{y} = \arg\max_c \sum_{i=1}^k w_i \cdot \mathbf{1}[y_i = c]$$

where $c \in {1, 2, \ldots, C}$.

Key considerations:

Ties are more likely with more classes
Class imbalance affects results more strongly
k should typically be larger than for binary classification

Multi-label Classification

When each instance can belong to multiple classes simultaneously (e.g., an image tagged as [outdoor, sunny, beach]):

ML-KNN Approach:

Find k nearest neighbors
For each possible label, compute the proportion of neighbors having that label
Predict label if proportion exceeds a threshold (possibly label-specific)

$$\hat{y}l = \mathbf{1}\left[\frac{\sum{i \in N_k} \mathbf{1}[l \in y_i]}{k} > \tau_l\right]$$

Label Correlation in Multi-label

Simple per-label thresholding ignores label correlations. More sophisticated approaches consider that certain labels often co-occur (e.g., 'ocean' and 'beach'). ML-KNN and similar methods can model these dependencies by considering not just individual label frequencies but also how label combinations appear in neighbors.

Multi-output Regression

When predicting multiple continuous targets simultaneously (e.g., predict both temperature and humidity):

$$\hat{\mathbf{y}} = \frac{\sum_{i=1}^k w_i \cdot \mathbf{y}i}{\sum{i=1}^k w_i}$$

where $\mathbf{y}_i$ is a vector of target values. Each output dimension is aggregated independently (or jointly if outputs are correlated).

Handling Output Correlations

If outputs are correlated, better approaches:

Use a joint distance metric that considers all outputs
Apply dimensionality reduction to outputs (PCA) before finding neighbors
Use multi-task learning formulations

Summary: Aggregating Neighbor Opinions

The voting/aggregation scheme determines how neighbor information becomes a prediction. While majority voting is simple and often effective, distance-weighted and kernel-based methods usually perform better.

Key Takeaways

•Majority voting treats all k neighbors equally — simple but ignores distance information.
•Distance-weighted voting gives closer neighbors more influence, typically via $w = 1/d$ or $w = 1/d^2$.
•Kernel-weighted voting provides a principled framework with bandwidth parameters controlling smoothness.
•Regression aggregation uses mean (efficient) or median (robust), with weighted variants for distance awareness.
•Probabilistic predictions enable confidence estimation, proper scoring, and threshold tuning.
•Edge cases (ties, zero distances, isolated queries) require careful handling for robust implementations.
•Multi-class and multi-label extend naturally, with considerations for increased tie frequency and label correlations.

Coming Up Next: In Algorithm Pseudocode, we'll synthesize everything we've learned into complete, implementable KNN algorithms for both classification and regression, with all the details for a production-quality implementation.

Page Complete

You now understand the full landscape of how to combine neighbor opinions into predictions. The choice of voting scheme can significantly impact performance — distance weighting is almost always worth using, and probabilistic outputs enable richer downstream decision-making. The best scheme depends on your data characteristics and application requirements.

Voting Schemes

How Neighbors Make Decisions Together

Once we've identified the k nearest neighbors of a query point, a crucial question remains: How do we combine their labels into a single prediction?

When neighbors disagree
When some neighbors are much closer than others
When class frequencies are imbalanced
When predicting continuous values (regression)

This page explores the full spectrum of voting and aggregation schemes, from the naive to the sophisticated.

What You Will Learn

Majority Voting (Uniform Weights)

The simplest aggregation scheme is majority voting: each of the k neighbors gets one vote, and the class with the most votes wins.

Formal Definition

For a query point $\mathbf{x}_q$ with k nearest neighbors having labels ${y_1, y_2, \ldots, y_k}$:

$$\hat{y}(\mathbf{x}q) = \arg\max_c \sum{i=1}^{k} \mathbf{1}[y_i = c]$$

where $\mathbf{1}[\cdot]$ is the indicator function, and $c$ ranges over all classes.

Properties:

Each neighbor contributes equally regardless of distance
The k-th nearest neighbor has as much influence as the 1st nearest neighbor
Simple, fast, and often surprisingly effective
Sensitive to the choice of k (discussed in previous page)

majority_voting.py

Python

The Equidistant Illusion

Distance-Weighted Voting

Distance-weighted voting addresses the equidistant illusion by giving closer neighbors more influence. The weight of each neighbor is inversely related to its distance from the query.

Formal Definition

For neighbor $i$ at distance $d_i$ from the query, assign weight:

$$w_i = \frac{1}{d_i}$$

or more generally:

$$w_i = \frac{1}{d_i^p}$$

where $p > 0$ controls how quickly influence decays with distance ($p = 1$ is most common, $p = 2$ gives quadratic decay).

The prediction becomes:

$$\hat{y}(\mathbf{x}q) = \arg\max_c \sum{i=1}^{k} w_i \cdot \mathbf{1}[y_i = c] = \arg\max_c \sum_{i: y_i = c} w_i$$

Common Distance Weighting Functions
Weighting	Formula	Properties	Use Case
Uniform	$w_i = 1$	All neighbors equal	Baseline, noisy data
Inverse distance	$w_i = 1/d_i$	Linear decay	Default weighted KNN
Inverse squared	$w_i = 1/d_i^2$	Stronger locality	When nearest matters most
Exponential	$w_i = e^{-d_i/\sigma}$	Smooth, tunable decay	Kernel-like behavior
Rank-based	$w_i = (k-r_i+1)/k$	Order, not distance	When absolute distance is unreliable

distance_weighted_voting.py

Python

When to Use Distance Weighting

Kernel-Weighted Voting

Kernel-weighted voting provides a principled framework for distance-based weighting using kernel functions from non-parametric statistics.

The Kernel Perspective

A kernel function $K: \mathbb{R}^+ \to \mathbb{R}^+$ maps distances to weights. Good kernels satisfy:

Non-negative: $K(d) \geq 0$
Monotonically decreasing: $K(d_1) \geq K(d_2)$ if $d_1 < d_2$
Maximum at zero: $K(0) = \max_d K(d)$
Often normalized: $\int K(d) dd = 1$ (for proper probability interpretation)

Bandwidth Parameter

Most kernels have a bandwidth (or scale) parameter $h$ that controls the effective range:

$$K_h(d) = \frac{1}{h} K\left(\frac{d}{h}\right)$$

Larger $h$ → smoother weighting, more neighbors contribute significantly Smaller $h$ → sharper weighting, nearest neighbors dominate

Kernel Functions for KNN
Kernel	Formula $K(u)$	Support	Properties
Uniform (Box)	$\frac{1}{2}\mathbf{1}_{\|u\|\leq 1}$	[-1, 1]	Equal weight within range
Triangular	$(1-\|u\|)\mathbf{1}_{\|u\|\leq 1}$	[-1, 1]	Linear decay to boundary
Epanechnikov	$\frac{3}{4}(1-u^2)\mathbf{1}_{\|u\|\leq 1}$	[-1, 1]	Optimal efficiency, smooth
Gaussian	$\frac{1}{\sqrt{2\pi}}e^{-u^2/2}$	$(-\infty, \infty)$	Infinitely smooth, all points contribute
Tricube	$\frac{70}{81}(1-\|u\|^3)^3\mathbf{1}_{\|u\|\leq 1}$	[-1, 1]	Very smooth at boundary, used in LOESS

kernel_weighted_voting.py

Python

Adaptive Bandwidth

Aggregation for Regression

For regression tasks, we predict a continuous value by aggregating neighbor target values. The choice of aggregation function affects robustness and accuracy.

Common Aggregation Functions

Mean (Default): $$\hat{y}(\mathbf{x}q) = \frac{1}{k} \sum{i=1}^{k} y_i$$

Weighted Mean: $$\hat{y}(\mathbf{x}q) = \frac{\sum{i=1}^{k} w_i \cdot y_i}{\sum_{i=1}^{k} w_i}$$

Median (robust to outliers): $$\hat{y}(\mathbf{x}_q) = \text{median}(y_1, \ldots, y_k)$$

Weighted Median (robust + distance-aware): Find $m$ such that the sum of weights below $m$ and above $m$ are approximately equal.

Mean-based Aggregation

•Simple mean: Unbiased estimator of local expected value
•Weighted mean: Incorporates distance information
•Pro: Optimal for Gaussian errors, smooth predictions
•Con: Sensitive to outliers — a single extreme neighbor pulls the prediction

Median-based Aggregation

•Simple median: Robust to up to 50% outliers
•Weighted median: Combines robustness with distance
•Pro: Unaffected by extreme neighbor values
•Con: Less efficient for Gaussian data, harder to compute weighted version

regression_aggregation.py

Python

Probabilistic Predictions and Confidence

Beyond point predictions (a single class or value), KNN can provide probabilistic predictions indicating confidence in the prediction.

Classification: Class Probabilities

Instead of just returning the majority class, return the distribution of neighbor labels:

$$P(y = c | \mathbf{x}q) = \frac{\sum{i=1}^{k} w_i \cdot \mathbf{1}[y_i = c]}{\sum_{i=1}^{k} w_i}$$

This gives:

A probability for each class
Natural confidence measure: high probability = high confidence
Ability to set decision thresholds (e.g., "classify as positive only if P > 0.8")
Proper inputs for metrics like log-loss, Brier score, ROC/AUC

Regression: Prediction Intervals

For regression, confidence can be expressed as prediction intervals:

$$[\hat{y} - z_{\alpha/2} \cdot s, \hat{y} + z_{\alpha/2} \cdot s]$$

where $s$ is the standard deviation of neighbor values, and $z_{\alpha/2}$ is the appropriate z-score for confidence level $1-\alpha$.

probabilistic_predictions.py

Python

Calibration Matters

Handling Ties and Edge Cases

Robust KNN implementations must handle various edge cases that arise in practice.

Common Edge Cases

•Voting Ties: Two or more classes receive equal votes. Solutions: prefer class with closer neighbor, random selection, use distance-weighted voting to break ties, increase k dynamically until tie is broken.
•Distance Ties: Multiple points at exactly the same distance from query. Common with discrete features or quantized data. Solutions: include all tied points (k becomes variable), random selection among tied points, use secondary sort criterion.
•Query on Training Point: Query is identical to a training point (distance = 0). Solutions: exclude the exact match if leave-one-out evaluation, use k neighbors excluding the query itself, handle the 0/0 case in distance weighting.
•Isolated Points: Query is far from all training points. The 'nearest' neighbors may still be very far. Solutions: use radius-based or density-adaptive methods, return 'unknown' or low-confidence prediction, flag such cases for human review.
•Uniform Neighborhoods: In high dimensions, all neighbors are roughly equidistant. Distance weighting provides no benefit. Solutions: consider dimensionality reduction, use alternative distance metrics, accept uniform voting.

edge_cases.py

Python

Multi-class and Multi-label Extensions

KNN naturally extends beyond binary classification:

Multi-class Classification

KNN handles multi-class problems natively—no modification needed. The voting simply considers all classes:

$$\hat{y} = \arg\max_c \sum_{i=1}^k w_i \cdot \mathbf{1}[y_i = c]$$

where $c \in {1, 2, \ldots, C}$.

Key considerations:

Ties are more likely with more classes
Class imbalance affects results more strongly
k should typically be larger than for binary classification

Multi-label Classification

When each instance can belong to multiple classes simultaneously (e.g., an image tagged as [outdoor, sunny, beach]):

ML-KNN Approach:

Find k nearest neighbors
For each possible label, compute the proportion of neighbors having that label
Predict label if proportion exceeds a threshold (possibly label-specific)

$$\hat{y}l = \mathbf{1}\left[\frac{\sum{i \in N_k} \mathbf{1}[l \in y_i]}{k} > \tau_l\right]$$

Label Correlation in Multi-label

Multi-output Regression

When predicting multiple continuous targets simultaneously (e.g., predict both temperature and humidity):

$$\hat{\mathbf{y}} = \frac{\sum_{i=1}^k w_i \cdot \mathbf{y}i}{\sum{i=1}^k w_i}$$

where $\mathbf{y}_i$ is a vector of target values. Each output dimension is aggregated independently (or jointly if outputs are correlated).

Handling Output Correlations

If outputs are correlated, better approaches:

Use a joint distance metric that considers all outputs
Apply dimensionality reduction to outputs (PCA) before finding neighbors
Use multi-task learning formulations

Summary: Aggregating Neighbor Opinions

Key Takeaways

•Majority voting treats all k neighbors equally — simple but ignores distance information.
•Distance-weighted voting gives closer neighbors more influence, typically via $w = 1/d$ or $w = 1/d^2$.
•Kernel-weighted voting provides a principled framework with bandwidth parameters controlling smoothness.
•Regression aggregation uses mean (efficient) or median (robust), with weighted variants for distance awareness.
•Probabilistic predictions enable confidence estimation, proper scoring, and threshold tuning.
•Edge cases (ties, zero distances, isolated queries) require careful handling for robust implementations.
•Multi-class and multi-label extend naturally, with considerations for increased tie frequency and label correlations.

Page Complete