Knn Algorithm - Learning Module

Loading content...

0/245

K Selection

The Most Important Hyperparameter

In K-Nearest Neighbors, the letter 'K' isn't just a naming convention—it's the algorithm's most crucial hyperparameter. This single integer fundamentally shapes every prediction the model makes.

What is K?

The parameter k specifies how many of the closest training points to consult when making a prediction. A query point's label is determined by aggregating the labels of its k nearest neighbors in feature space.

k = 1: Predict exactly what the single closest neighbor says
k = 5: Take a vote among the 5 closest neighbors
k = n: Consult every training point (equivalent to predicting the majority class)

Choosing k involves navigating one of machine learning's fundamental tensions: the bias-variance tradeoff.

What You Will Learn

By the end of this page, you will understand how k controls the bias-variance tradeoff, why small k leads to overfitting and large k to underfitting, rigorous methods for selecting k (cross-validation, theoretical bounds), common heuristics and their limitations, and special considerations for classification vs. regression.

The Bias-Variance Tradeoff in KNN

The expected prediction error for any supervised learning algorithm can be decomposed into three components:

$$\mathbb{E}[(y - \hat{y})^2] = \text{Bias}^2(\hat{y}) + \text{Var}(\hat{y}) + \sigma^2$$

where $\sigma^2$ is the irreducible noise in the data. For KNN, the value of k directly controls the first two terms:

Low k (e.g., k = 1)

Low Bias: The prediction follows local structure precisely. If the true function varies, 1-NN will capture those variations.
High Variance: The prediction is extremely sensitive to which specific training points happen to be nearby. A single noisy neighbor can completely change the prediction.
Result: Overfitting. The model captures noise as if it were signal.

High k (e.g., k = n)

High Bias: Averaging over many neighbors smooths out true local variations. In the extreme (k = n), every prediction is the global mean/majority.
Low Variance: Predictions are stable. Adding or removing a few training points barely changes the result.
Result: Underfitting. The model is too smooth to capture real structure.

The Sweet Spot

Optimal k minimizes total error by balancing bias and variance. This sweet spot depends on the true underlying function, the noise level, the sample size, and the data distribution. There's no universal 'best k'—it must be determined empirically for each problem.

Effect of K on KNN Behavior
k Value	Bias	Variance	Decision Boundary	Typical Behavior
k = 1	Low	Very High	Highly irregular	Overfits noise, especially near class boundaries
k = √n	Moderate	Moderate	Moderately smooth	Common heuristic, often reasonable starting point
k = n/10	Moderate-High	Low-Moderate	Smooth	Works well for noisy data
k = n	Maximum	Minimum	Constant (majority class)	Predicts majority class everywhere

bias_variance_k.py

Python

Visualizing K's Effect on Decision Boundaries

The most intuitive way to understand k's role is to visualize decision boundaries as k changes.

k = 1: The Voronoi Tessellation

With k = 1, the feature space is partitioned into Voronoi cells. Each training point "owns" a region of space—all query points in that region are assigned the owner's label. The decision boundary is a piecewise-linear structure following the perpendicular bisectors between differently-labeled neighbors.

This creates:

Maximum local adaptivity
Potentially complex, irregular boundaries
'Islands' of one class surrounded by another
Sensitivity to outliers (a single mislabeled point creates a wrong cell)

k = large: Smooth, Simple Boundaries

As k increases, decision boundaries become smoother. Each point is classified based on the majority of many neighbors, so local irregularities get 'voted out.' In the limit, boundaries approach those of simpler models.

$$\lim_{k \to n} \hat{y}(\mathbf{x}) = \text{majority class in training set}$$

The decision boundary for k = n is trivial: everything is classified as the majority class.

Boundary Smoothness as Regularization

Increasing k is a form of regularization. Just as L2 regularization penalizes complex weight vectors, large k penalizes complex decision boundaries by forcing consensus across more neighbors. This controls overfitting without explicitly adding a regularization term.

The Neighborhood Dynamics

Consider what happens at a query point near the boundary between two classes:

For small k, the prediction depends on which class dominates the immediate vicinity. A slight shift in query position might flip the prediction.

For large k, the prediction depends on the regional class proportions. Class boundaries are determined by where the balance tips, creating smoother transitions.

Mathematically, for a query point $\mathbf{x}_q$ and binary classification:

$$P(y = 1 | \mathbf{x}q) \approx \frac{\sum{i \in N_k(\mathbf{x}_q)} \mathbf{1}[y_i = 1]}{k}$$

As k grows, this probability estimate becomes more stable (lower variance) but may drift from the true local probability (higher bias if the region is heterogeneous).

Cross-Validation for K Selection

The gold standard for selecting k is cross-validation: systematically test different k values and choose the one with best estimated generalization performance.

K-Fold Cross-Validation Procedure

Choose a set of candidate k values: $K = {1, 3, 5, 7, \ldots, k_{max}}$
For each candidate $k \in K$:
- Split data into $m$ folds
- For each fold $f = 1, \ldots, m$:
  - Train KNN with this k on all folds except $f$
  - Evaluate on fold $f$
- Average performance across folds = $\text{CV}(k)$
Select $k^* = \arg\max_k \text{CV}(k)$

k_selection_cv.py

Python

The Validation Trap

Warning: If you use the same CV score that selected k to report model performance, you're being optimistic. The CV procedure that chose k has implicitly overfit to the validation sets. Use nested cross-validation for unbiased performance estimates, or reserve a separate holdout test set.

Heuristics and Rules of Thumb

While cross-validation is the rigorous approach, several heuristics provide useful starting points:

Common Heuristics for K Selection

•k = √n: The most common heuristic. For n = 1000 examples, start with k ≈ 31. This balances between having enough neighbors for stability and not over-smoothing. Limitation: Ignores dimensionality and noise level.
•k = odd number: For binary classification, use odd k to avoid ties in majority voting. For multi-class, it's less critical but still helpful.
•k = 1 for high-quality data: If training data has no label noise and sufficient density, k = 1 can achieve Bayes-optimal performance asymptotically.
•k ∝ 1/noise level: Noisier data requires larger k to average out label errors. Clean data can use smaller k.
•Start with k = 5: A reasonable default when you know nothing about the problem. Small enough for local adaptivity, large enough for some noise robustness.

K Heuristics by Dataset Characteristics
Dataset Characteristic	Suggested k Range	Reasoning
Small (n < 100)	k = 3 to 7	Limited data; very small k overfits, very large exhausts data
Medium (100 < n < 10K)	k = 5 to √n	Standard range; cross-validate within this
Large (n > 10K)	k = 10 to n^(4/5)	Larger k for smoothness; compute cost matters
Low noise	k = 1 to 5	Trust local structure; less averaging needed
High noise	k = 10 to 50+	More averaging to wash out noise
Class imbalance	k > minority class size	Ensure minority class can form neighborhoods
High dimensionality	Larger k	Distances are less meaningful; more neighbors for stability

Elbow Method

Plot CV error vs. k. Often, error decreases rapidly for small k, then flattens. The 'elbow' point where improvement diminishes is a reasonable choice—it balances performance with simplicity (preferring smaller k when performance is similar).

Theoretical Considerations

Statistical learning theory provides guidance on optimal k selection under idealized conditions.

Asymptotic Theory

For k-NN classification, the asymptotic behavior is well-understood. As $n \to \infty$:

Theorem (Stone, 1977): If $k \to \infty$ and $k/n \to 0$ as $n \to \infty$, then the k-NN classifier is universally consistent: its error rate converges to the Bayes error rate for any distribution.

This tells us:

k must grow with n (but slower than n) for consistency
Too small k (constant) doesn't guarantee convergence
Too large k (proportional to n) doesn't guarantee convergence

Optimal Rate: Under smoothness assumptions on the class-conditional densities, the optimal rate is:

$$k^* \propto n^{4/(d+4)}$$

where d is the dimensionality. This gives:

For d = 2: $k^* \propto n^{2/3}$
For d = 10: $k^* \propto n^{2/7} \approx n^{0.29}$
For d = 100: $k^* \propto n^{4/104} \approx n^{0.04}$ (very small relative to n)

The Curse of Dimensionality Again

Notice how optimal k shrinks relative to n as dimension d increases. In high dimensions, even 'nearby' neighbors are far away, so averaging many of them doesn't help (they're all roughly equidistant). This is another manifestation of the curse of dimensionality—it affects not just whether KNN works, but also how to tune it.

Finite-Sample Bounds

For finite samples, tighter analysis is possible. One important result:

Theorem (Cover-Hart, 1967): For the 1-NN classifier, as $n \to \infty$:

$$R^* \leq R_{1\text{-NN}} \leq 2R^* \left(1 - \frac{c}{c-1}R^*\right)$$

where $R^*$ is the Bayes error rate and $c$ is the number of classes. For binary classification:

$$R^* \leq R_{1\text{-NN}} \leq 2R^(1 - R^)$$

This bounds 1-NN error in terms of the best possible error. Key insight: 1-NN is at most twice as bad as optimal (asymptotically), which is remarkable for such a simple algorithm.

For general k, tighter bounds exist:

$$R_{k\text{-NN}} \leq R^* + \sqrt{\frac{2R^*}{k}}$$

This shows that larger k can improve the error bound (up to a point), consistent with the variance-reduction intuition.

Classification vs. Regression: K Selection Differences

The considerations for choosing k differ somewhat between classification and regression tasks:

Classification

Majority voting among k neighbors
Ties are problematic → prefer odd k
Class imbalance affects k choice (need enough of each class)
Noise in labels is catastrophic for small k
Typical range: 1 to √n

Regression

Mean (or median) of k neighbor values
No tie issues
Outliers in targets affect small-k predictions
Smoother target → can use smaller k
Typical range: 3 to √n (k < 3 rarely makes sense)

Classification K Guidelines

•Use odd k to avoid ties in binary classification
•Ensure k < size of smallest class
•For c classes, consider k ≠ multiple of c
•Highly imbalanced: weighted voting may be better than large k
•Best checked via CV on the specific metric of interest (F1, AUC, etc.)

Regression K Guidelines

•Odd/even k doesn't matter (averaging)
•Consider median instead of mean for robustness to outliers
•For smooth targets, k = 3-5 often works
•For noisy targets, k = 10-20+ may be needed
•Weighted averaging (distance-weighted) can reduce need for large k

k_for_regression.py

Python

Adaptive K Selection

A fixed k across all query points is suboptimal. Different regions of feature space have different characteristics:

Dense regions: Many nearby training points → can use smaller k for precision
Sparse regions: Few nearby points → need larger k to get reliable estimates
Near boundaries: Complex structure → smaller k to capture detail
Interior regions: Clear class membership → larger k is fine

Adaptive Nearest Neighbors adjust k per query based on local data density or other criteria.

Distance-Based Adaptation

Instead of fixing k, fix a distance threshold r and use all points within that radius. This naturally adapts: dense regions get more neighbors, sparse regions get fewer. However, this requires choosing r, which is also problem-dependent. Hybrid approaches exist that combine both strategies.

Local Cross-Validation

A principled approach:

For each query point $\mathbf{x}q$, find its $K{max}$ nearest neighbors
Within this local neighborhood, perform leave-one-out CV to select the best $k \leq K_{max}$
Use this locally-optimal k for the prediction

This is computationally expensive but can significantly improve performance in heterogeneous datasets.

Density-Based Adaptation

Simpler approach using local density:

$$k(\mathbf{x}q) = \max\left(k{min}, \min\left(k_{max}, \frac{c}{\hat{f}(\mathbf{x}_q)}\right)\right)$$

where $\hat{f}(\mathbf{x}_q)$ is the estimated local density. Denser regions use smaller k; sparser regions use larger k.

When Adaptive K Helps:

Datasets with varying density (e.g., clustered data)
Problems with complex decision boundaries in some regions
When held-out validation shows inconsistent performance across data subsets

Summary: Choosing K Wisely

The parameter k is KNN's primary control over model complexity. Choosing it well is essential for good performance.

Key Takeaways

•k controls bias-variance tradeoff: Small k = low bias, high variance (overfitting); large k = high bias, low variance (underfitting).
•Cross-validation is the gold standard for selecting k. Use nested CV for unbiased performance estimation.
•Common heuristics: k = √n, k = odd for binary classification, start with k = 5 when uncertain.
•Theoretical guidance: Optimal k grows slower than n (roughly n^{4/(d+4)} under smoothness assumptions).
•Classification vs. regression have slightly different considerations (tie-breaking, outlier sensitivity).
•Adaptive k per query can improve performance in heterogeneous datasets but adds complexity.
•When in doubt, search: Testing k = {1, 3, 5, 7, 11, 15, 21, ...} via CV is always valid.

Coming Up Next: In Voting Schemes, we'll explore how to combine neighbor labels once we've found them. Majority voting is just one option—weighted voting, kernel weighting, and probabilistic approaches can significantly improve predictions.

Page Complete

You now understand the central role of k in the KNN algorithm. It's not just a number—it's the knob that controls how much your model trusts local structure versus seeking consensus across broader regions. The right k depends on your data's noise level, density, and dimensionality, and should be determined empirically through cross-validation.

K Selection

The Most Important Hyperparameter

In K-Nearest Neighbors, the letter 'K' isn't just a naming convention—it's the algorithm's most crucial hyperparameter. This single integer fundamentally shapes every prediction the model makes.

What is K?

k = 1: Predict exactly what the single closest neighbor says
k = 5: Take a vote among the 5 closest neighbors
k = n: Consult every training point (equivalent to predicting the majority class)

Choosing k involves navigating one of machine learning's fundamental tensions: the bias-variance tradeoff.

What You Will Learn

The Bias-Variance Tradeoff in KNN

The expected prediction error for any supervised learning algorithm can be decomposed into three components:

$$\mathbb{E}[(y - \hat{y})^2] = \text{Bias}^2(\hat{y}) + \text{Var}(\hat{y}) + \sigma^2$$

where $\sigma^2$ is the irreducible noise in the data. For KNN, the value of k directly controls the first two terms:

Low k (e.g., k = 1)

Low Bias: The prediction follows local structure precisely. If the true function varies, 1-NN will capture those variations.
High Variance: The prediction is extremely sensitive to which specific training points happen to be nearby. A single noisy neighbor can completely change the prediction.
Result: Overfitting. The model captures noise as if it were signal.

High k (e.g., k = n)

High Bias: Averaging over many neighbors smooths out true local variations. In the extreme (k = n), every prediction is the global mean/majority.
Low Variance: Predictions are stable. Adding or removing a few training points barely changes the result.
Result: Underfitting. The model is too smooth to capture real structure.

The Sweet Spot

Effect of K on KNN Behavior
k Value	Bias	Variance	Decision Boundary	Typical Behavior
k = 1	Low	Very High	Highly irregular	Overfits noise, especially near class boundaries
k = √n	Moderate	Moderate	Moderately smooth	Common heuristic, often reasonable starting point
k = n/10	Moderate-High	Low-Moderate	Smooth	Works well for noisy data
k = n	Maximum	Minimum	Constant (majority class)	Predicts majority class everywhere

bias_variance_k.py

Python

Visualizing K's Effect on Decision Boundaries

The most intuitive way to understand k's role is to visualize decision boundaries as k changes.

k = 1: The Voronoi Tessellation

This creates:

Maximum local adaptivity
Potentially complex, irregular boundaries
'Islands' of one class surrounded by another
Sensitivity to outliers (a single mislabeled point creates a wrong cell)

k = large: Smooth, Simple Boundaries

$$\lim_{k \to n} \hat{y}(\mathbf{x}) = \text{majority class in training set}$$

The decision boundary for k = n is trivial: everything is classified as the majority class.

Boundary Smoothness as Regularization

The Neighborhood Dynamics

Consider what happens at a query point near the boundary between two classes:

For small k, the prediction depends on which class dominates the immediate vicinity. A slight shift in query position might flip the prediction.

For large k, the prediction depends on the regional class proportions. Class boundaries are determined by where the balance tips, creating smoother transitions.

Mathematically, for a query point $\mathbf{x}_q$ and binary classification:

$$P(y = 1 | \mathbf{x}q) \approx \frac{\sum{i \in N_k(\mathbf{x}_q)} \mathbf{1}[y_i = 1]}{k}$$

As k grows, this probability estimate becomes more stable (lower variance) but may drift from the true local probability (higher bias if the region is heterogeneous).

Cross-Validation for K Selection

The gold standard for selecting k is cross-validation: systematically test different k values and choose the one with best estimated generalization performance.

K-Fold Cross-Validation Procedure

Choose a set of candidate k values: $K = {1, 3, 5, 7, \ldots, k_{max}}$
For each candidate $k \in K$:
- Split data into $m$ folds
- For each fold $f = 1, \ldots, m$:
  - Train KNN with this k on all folds except $f$
  - Evaluate on fold $f$
- Average performance across folds = $\text{CV}(k)$
Select $k^* = \arg\max_k \text{CV}(k)$

k_selection_cv.py

Python

The Validation Trap

Heuristics and Rules of Thumb

While cross-validation is the rigorous approach, several heuristics provide useful starting points:

Common Heuristics for K Selection

•k = √n: The most common heuristic. For n = 1000 examples, start with k ≈ 31. This balances between having enough neighbors for stability and not over-smoothing. Limitation: Ignores dimensionality and noise level.
•k = odd number: For binary classification, use odd k to avoid ties in majority voting. For multi-class, it's less critical but still helpful.
•k = 1 for high-quality data: If training data has no label noise and sufficient density, k = 1 can achieve Bayes-optimal performance asymptotically.
•k ∝ 1/noise level: Noisier data requires larger k to average out label errors. Clean data can use smaller k.
•Start with k = 5: A reasonable default when you know nothing about the problem. Small enough for local adaptivity, large enough for some noise robustness.

K Heuristics by Dataset Characteristics
Dataset Characteristic	Suggested k Range	Reasoning
Small (n < 100)	k = 3 to 7	Limited data; very small k overfits, very large exhausts data
Medium (100 < n < 10K)	k = 5 to √n	Standard range; cross-validate within this
Large (n > 10K)	k = 10 to n^(4/5)	Larger k for smoothness; compute cost matters
Low noise	k = 1 to 5	Trust local structure; less averaging needed
High noise	k = 10 to 50+	More averaging to wash out noise
Class imbalance	k > minority class size	Ensure minority class can form neighborhoods
High dimensionality	Larger k	Distances are less meaningful; more neighbors for stability

Elbow Method

Theoretical Considerations

Statistical learning theory provides guidance on optimal k selection under idealized conditions.

Asymptotic Theory

For k-NN classification, the asymptotic behavior is well-understood. As $n \to \infty$:

This tells us:

k must grow with n (but slower than n) for consistency
Too small k (constant) doesn't guarantee convergence
Too large k (proportional to n) doesn't guarantee convergence

Optimal Rate: Under smoothness assumptions on the class-conditional densities, the optimal rate is:

$$k^* \propto n^{4/(d+4)}$$

where d is the dimensionality. This gives:

For d = 2: $k^* \propto n^{2/3}$
For d = 10: $k^* \propto n^{2/7} \approx n^{0.29}$
For d = 100: $k^* \propto n^{4/104} \approx n^{0.04}$ (very small relative to n)

The Curse of Dimensionality Again

Finite-Sample Bounds

For finite samples, tighter analysis is possible. One important result:

Theorem (Cover-Hart, 1967): For the 1-NN classifier, as $n \to \infty$:

$$R^* \leq R_{1\text{-NN}} \leq 2R^* \left(1 - \frac{c}{c-1}R^*\right)$$

where $R^*$ is the Bayes error rate and $c$ is the number of classes. For binary classification:

$$R^* \leq R_{1\text{-NN}} \leq 2R^(1 - R^)$$

This bounds 1-NN error in terms of the best possible error. Key insight: 1-NN is at most twice as bad as optimal (asymptotically), which is remarkable for such a simple algorithm.

For general k, tighter bounds exist:

$$R_{k\text{-NN}} \leq R^* + \sqrt{\frac{2R^*}{k}}$$

This shows that larger k can improve the error bound (up to a point), consistent with the variance-reduction intuition.

Classification vs. Regression: K Selection Differences

The considerations for choosing k differ somewhat between classification and regression tasks:

Classification

Majority voting among k neighbors
Ties are problematic → prefer odd k
Class imbalance affects k choice (need enough of each class)
Noise in labels is catastrophic for small k
Typical range: 1 to √n

Regression

Mean (or median) of k neighbor values
No tie issues
Outliers in targets affect small-k predictions
Smoother target → can use smaller k
Typical range: 3 to √n (k < 3 rarely makes sense)

Classification K Guidelines

•Use odd k to avoid ties in binary classification
•Ensure k < size of smallest class
•For c classes, consider k ≠ multiple of c
•Highly imbalanced: weighted voting may be better than large k
•Best checked via CV on the specific metric of interest (F1, AUC, etc.)

Regression K Guidelines

•Odd/even k doesn't matter (averaging)
•Consider median instead of mean for robustness to outliers
•For smooth targets, k = 3-5 often works
•For noisy targets, k = 10-20+ may be needed
•Weighted averaging (distance-weighted) can reduce need for large k

k_for_regression.py

Python

Adaptive K Selection

A fixed k across all query points is suboptimal. Different regions of feature space have different characteristics:

Dense regions: Many nearby training points → can use smaller k for precision
Sparse regions: Few nearby points → need larger k to get reliable estimates
Near boundaries: Complex structure → smaller k to capture detail
Interior regions: Clear class membership → larger k is fine

Adaptive Nearest Neighbors adjust k per query based on local data density or other criteria.

Distance-Based Adaptation

Local Cross-Validation

A principled approach:

For each query point $\mathbf{x}q$, find its $K{max}$ nearest neighbors
Within this local neighborhood, perform leave-one-out CV to select the best $k \leq K_{max}$
Use this locally-optimal k for the prediction

This is computationally expensive but can significantly improve performance in heterogeneous datasets.

Density-Based Adaptation

Simpler approach using local density:

$$k(\mathbf{x}q) = \max\left(k{min}, \min\left(k_{max}, \frac{c}{\hat{f}(\mathbf{x}_q)}\right)\right)$$

where $\hat{f}(\mathbf{x}_q)$ is the estimated local density. Denser regions use smaller k; sparser regions use larger k.

When Adaptive K Helps:

Datasets with varying density (e.g., clustered data)
Problems with complex decision boundaries in some regions
When held-out validation shows inconsistent performance across data subsets

Summary: Choosing K Wisely

The parameter k is KNN's primary control over model complexity. Choosing it well is essential for good performance.

Key Takeaways

•k controls bias-variance tradeoff: Small k = low bias, high variance (overfitting); large k = high bias, low variance (underfitting).
•Cross-validation is the gold standard for selecting k. Use nested CV for unbiased performance estimation.
•Common heuristics: k = √n, k = odd for binary classification, start with k = 5 when uncertain.
•Theoretical guidance: Optimal k grows slower than n (roughly n^{4/(d+4)} under smoothness assumptions).
•Classification vs. regression have slightly different considerations (tie-breaking, outlier sensitivity).
•Adaptive k per query can improve performance in heterogeneous datasets but adds complexity.
•When in doubt, search: Testing k = {1, 3, 5, 7, 11, 15, 21, ...} via CV is always valid.

Page Complete