Machine LearningAnomaly & Outlier Detection

Distance-Based Methods for Anomaly Detection

LevelAdvanced

Duration90 mins

TopicAnomaly & Outlier Detection

1 / 6

KNN-Based Detection

The Intuition Behind Distance-Based Anomaly Detection

At the heart of anomaly detection lies a deceptively simple insight: anomalies are different from normal data points. But how do we quantify this difference?

Distance-based methods operationalize this intuition by examining the geometric relationships between data points in feature space. The fundamental premise is that normal points tend to cluster together, forming dense regions, while anomalies reside in sparse, isolated regions far from their neighbors.

This geometric perspective transforms the abstract concept of 'abnormality' into a measurable quantity: distance to neighbors. Points that are far from their nearest neighbors are suspicious. Points surrounded by other close points are likely normal.

K-Nearest Neighbor (KNN) based detection represents the most direct implementation of this intuition. It leverages the rich literature of KNN algorithms—originally developed for classification—and repurposes them for the fundamentally different task of identifying outliers.

What You Will Learn

By the end of this page, you will: (1) Understand the theoretical foundations of KNN-based anomaly detection, (2) Master the algorithmic variations and their trade-offs, (3) Analyze the computational complexity and scalability challenges, (4) Recognize when KNN-based methods are appropriate for specific detection scenarios, and (5) Implement practical solutions with awareness of common pitfalls.

Theoretical Foundations of KNN-Based Detection

To understand KNN-based anomaly detection rigorously, we must first establish the mathematical framework that governs distance-based outlier identification.

The Nearest Neighbor Paradigm

Given a dataset $\mathcal{D} = {x_1, x_2, ..., x_n}$ where each $x_i \in \mathbb{R}^d$ is a d-dimensional feature vector, we define the k-nearest neighbors of a point $x$ as the set:

$$N_k(x) = {x_1, x_2, ..., x_k} \subset \mathcal{D} \text{ such that } d(x, x_i) \leq d(x, x_j) \text{ for all } x_j \notin N_k(x)$$

where $d(\cdot, \cdot)$ is a distance function, typically the Euclidean distance:

$$d(x, y) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}$$

The k-Distance Function

The k-distance of a point $x$, denoted $d_k(x)$, is the distance from $x$ to its k-th nearest neighbor:

$$d_k(x) = \max_{x_i \in N_k(x)} d(x, x_i)$$

This k-distance forms the foundation of all KNN-based anomaly detection methods. Points with large k-distances are far from their neighbors and thus potentially anomalous.

The Geometric Interpretation

The k-distance of a point x defines a hypersphere centered at x that contains exactly k neighbors. For normal points in dense regions, this hypersphere is small. For anomalies in sparse regions, it must expand significantly to capture k neighbors. The radius of this hypersphere directly correlates with outlierness.

Formal Anomaly Definition

Under the KNN framework, an anomaly can be formally defined using a threshold $\tau$:

$$\text{Anomaly}(x) = \begin{cases} 1 & \text{if } f(d_1(x), d_2(x), ..., d_k(x)) > \tau \ 0 & \text{otherwise} \end{cases}$$

where $f$ is a scoring function that aggregates distance information. The choice of $f$ defines different variants of KNN-based detection:

Variant 1: k-th Nearest Neighbor Distance The simplest approach uses only the k-distance: $$\text{score}(x) = d_k(x)$$

Variant 2: Average Distance to k Neighbors A more robust variant averages all k distances: $$\text{score}(x) = \frac{1}{k} \sum_{i=1}^{k} d_i(x)$$

Variant 3: Sum of Distances to k Neighbors Equivalent to average but affects threshold interpretation: $$\text{score}(x) = \sum_{i=1}^{k} d_i(x)$$

Each variant exhibits different sensitivity to local density variations, a nuance that becomes critical in real-world applications.

Comparison of KNN-Based Scoring Functions
Scoring Function	Formula	Sensitivity	Robustness	Use Case
k-th Distance	$d_k(x)$	High to outlying neighbors	Low (single point)	Sharp anomaly boundaries
Average Distance	$\frac{1}{k}\sum d_i(x)$	Moderate, smoothed	Medium (k points average)	General-purpose detection
Sum of Distances	$\sum d_i(x)$	Moderate, scale-dependent	Medium	When thresholds are data-driven
Median Distance	$\text{median}(d_1...d_k)$	Low to extreme values	High (resistant to outliers)	Noisy distance computations
Max Distance	$\max(d_1...d_k)$	Very high	Low	Conservative anomaly detection

The Basic KNN Anomaly Detection Algorithm

The fundamental KNN-based anomaly detection algorithm follows a straightforward procedure, though its apparent simplicity conceals subtle implementation details that significantly impact performance and accuracy.

Algorithm: KNN-Based Anomaly Detection

Input: Dataset D = {x₁, x₂, ..., xₙ}, parameter k, threshold τ
Output: Set of anomalies A

1. Initialize anomaly set A = ∅
2. For each point xᵢ in D:
   a. Find N_k(xᵢ) = k nearest neighbors of xᵢ in D \ {xᵢ}
   b. Compute anomaly score: score(xᵢ) = f(distances to N_k(xᵢ))
3. Rank all points by score in descending order
4. For each point xᵢ:
   If score(xᵢ) > τ:
      Add xᵢ to A
5. Return A

Critical Implementation Considerations

The algorithm above omits several crucial details that separate naive implementations from production-ready systems:

Implementation Details That Matter

•Include self in neighbors? The algorithm above excludes xᵢ from its own neighbor search. Some implementations include it, which adds a zero-distance neighbor and affects scoring. Best practice: exclude self unless k is adjusted accordingly.
•Ties in k-th neighbor distance: When multiple points are equidistant at the k-th position, the algorithm must handle tie-breaking. Options include: include all tied points (variable neighborhood size), random selection, or deterministic ordering.
•Distance function choice: While Euclidean distance is standard, Manhattan distance, Chebyshev distance, Minkowski distance, or domain-specific measures may be more appropriate depending on feature semantics.
•Feature normalization: Features on different scales cause the distance function to be dominated by high-variance features. Standardization (z-score) or min-max normalization is typically required.
•Handling categorical features: Pure KNN requires numerical features. Categorical data must be encoded (one-hot, target encoding) or handled with specialized distance metrics like Gower distance.

knn_anomaly_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from scipy.spatial import distance
from sklearn.neighbors import NearestNeighbors
from typing import Tuple, np.ndarray
 
def knn_anomaly_scores(X: np.ndarray, 
                       k: int = 5,
                       scoring: str = 'mean') -> np.ndarray:
    """
    Compute KNN-based anomaly scores for all points in dataset.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        The input dataset
    k : int
        Number of nearest neighbors to consider
    scoring : str
        Scoring method: 'kth' (k-th distance), 'mean', 'sum', 'median'
    
    Returns:
    --------
    scores : np.ndarray of shape (n_samples,)
        Anomaly score for each point (higher = more anomalous)
    """
    n_samples = X.shape[0]
    
    # Validate k
    if k >= n_samples:
        raise ValueError(f"k ({k}) must be less than n_samples ({n_samples})")
    
    # Build KNN model (k+1 to exclude self)
    nn_model = NearestNeighbors(n_neighbors=k+1, algorithm='auto', metric='euclidean')
    nn_model.fit(X)
    
    # Get distances to k nearest neighbors (excluding self)
    distances, indices = nn_model.kneighbors(X)
    distances = distances[:, 1:]  # Exclude self (distance 0)
    
    # Compute scores based on chosen method
    if scoring == 'kth':
        # Use only the k-th nearest neighbor distance
        scores = distances[:, -1]
    elif scoring == 'mean':
        # Average distance to all k neighbors
        scores = np.mean(distances, axis=1)
    elif scoring == 'sum':
        # Sum of distances to all k neighbors
        scores = np.sum(distances, axis=1)
    elif scoring == 'median':
        # Median distance (robust to outlying neighbors)
        scores = np.median(distances, axis=1)
    else:
        raise ValueError(f"Unknown scoring method: {scoring}")
    
    return scores
 
 
def detect_anomalies(X: np.ndarray,
                     k: int = 5,
                     contamination: float = 0.05,
                     scoring: str = 'mean') -> Tuple[np.ndarray, np.ndarray]:
    """
    Detect anomalies using KNN-based method.
    
    Parameters:
    -----------
    X : np.ndarray
        Input dataset
    k : int
        Number of neighbors
    contamination : float
        Expected proportion of anomalies (0 to 1)
    scoring : str
        Scoring method
    
    Returns:
    --------
    labels : np.ndarray
        Binary labels (1 = anomaly, 0 = normal)
    scores : np.ndarray
        Anomaly scores
    """
    scores = knn_anomaly_scores(X, k, scoring)
    
    # Determine threshold based on contamination
    threshold = np.percentile(scores, 100 * (1 - contamination))
    labels = (scores > threshold).astype(int)
    
    return labels, scores

Computational Complexity Analysis

Understanding the computational characteristics of KNN-based detection is essential for practical deployment. The naive algorithm has significant complexity that can render it impractical for large datasets.

Naive Algorithm Complexity

The straightforward implementation requires:

For each of n points: Find k nearest neighbors among n-1 other points
Distance computation: O(d) per pair for d-dimensional data
Finding k smallest: O((n-1) log k) using a max-heap, or O(n-1) for small k

Total naive complexity: $$O(n^2 \cdot d)$$

For a dataset with 1 million points and 100 features, this translates to approximately 100 trillion operations—clearly intractable.

Space Complexity

Storing the dataset: O(n · d)
Distance matrix (if precomputed): O(n²)—prohibitive for large n
On-the-fly computation: O(k · n) for neighbor indices and distances

Complexity Analysis of KNN Anomaly Detection Approaches
Approach	Time Complexity	Space Complexity	Notes
Naive Brute Force	O(n² · d)	O(n · d)	Only viable for n < 10,000
KD-Tree (low d)	O(n · d · log n)	O(n · d)	Efficient for d < 20
Ball Tree	O(n · d · log n)	O(n · d)	Better for higher d than KD-Tree
Locality-Sensitive Hashing	O(n · d · L · K)	O(n · L · K)	Approximate, sublinear query time
Random Projection Trees	O(n · d · log n)	O(n · d)	Good for moderate d
FAISS (GPU)	O(n · d + n · log n)	O(n · d)	Massive parallelism, batch processing

Acceleration Structures for KNN Queries

Production implementations must leverage spatial data structures to achieve subquadratic complexity:

KD-Trees (Low Dimensions) A KD-Tree partitions space recursively along alternating coordinate axes, enabling O(log n) average-case nearest neighbor queries. However, performance degrades exponentially with dimensionality—the so-called "curse of dimensionality" causes nearly all tree nodes to be visited when d > 15-20.

Ball Trees (Moderate Dimensions) Ball Trees use hyperspheres rather than axis-aligned hyperplanes for partitioning. This provides better handling of non-uniform distributions and slightly better high-dimensional performance than KD-Trees, though both suffer from the same fundamental curse.

Approximate Methods For high dimensions, exact KNN becomes intractable. Approximate nearest neighbor (ANN) methods trade accuracy for speed:

Locality-Sensitive Hashing (LSH): Hashes similar points to the same bucket with high probability. Query time is O(d · L) where L is the number of hash tables.
Hierarchical Navigable Small World (HNSW): Graph-based method with logarithmic search complexity and excellent practical performance.
Inverted File Index with Product Quantization (IVFPQ): Clusters data and compresses residuals for memory-efficient search.

The High-Dimensional Warning

When dimensionality exceeds 15-20, no spatial index can provide consistent speedup over brute force. At 100+ dimensions, distances concentrate (all pairwise distances become similar), spatial trees degenerate, and even the concept of 'nearest neighbor' becomes less meaningful. For high-dimensional anomaly detection, consider dimensionality reduction (PCA, autoencoders) before applying KNN methods.

efficient_knn_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA
import time
 
class EfficientKNNAnomalyDetector:
    """
    Production-ready KNN anomaly detector with automatic
    algorithm selection and optional dimensionality reduction.
    """
    
    def __init__(self, 
                 k: int = 10,
                 algorithm: str = 'auto',
                 leaf_size: int = 30,
                 reduce_dim: int = None,
                 scoring: str = 'mean'):
        """
        Parameters:
        -----------
        k : int
            Number of neighbors
        algorithm : str
            'auto', 'ball_tree', 'kd_tree', 'brute'
        leaf_size : int
            Leaf size for tree-based algorithms
        reduce_dim : int or None
            If set, reduce to this many dimensions via PCA
        scoring : str
            Scoring function: 'kth', 'mean', 'sum'
        """
        self.k = k
        self.algorithm = algorithm
        self.leaf_size = leaf_size
        self.reduce_dim = reduce_dim
        self.scoring = scoring
        
        self.nn_model_ = None
        self.pca_ = None
        self.scores_ = None
        self.threshold_ = None
    
    def _choose_algorithm(self, n_samples: int, n_features: int) -> str:
        """
        Select optimal algorithm based on data characteristics.
        """
        if self.algorithm != 'auto':
            return self.algorithm
        
        # Heuristics based on scikit-learn guidelines
        if n_features > 20:
            # High dimensionality: trees degrade
            if n_samples > 50000:
                # Very large: use approximate methods externally
                print("Warning: Consider approximate KNN for this scale")
            return 'brute'
        elif n_samples < 1000:
            # Small dataset: brute force is fine
            return 'brute'
        else:
            # Moderate size and dimension: use ball_tree
            return 'ball_tree'
    
    def fit(self, X: np.ndarray, contamination: float = 0.05):
        """
        Fit the anomaly detector and compute scores.
        """
        n_samples, n_features = X.shape
        
        # Optional dimensionality reduction
        X_processed = X
        if self.reduce_dim is not None and n_features > self.reduce_dim:
            print(f"Reducing dimensionality from {n_features} to {self.reduce_dim}")
            self.pca_ = PCA(n_components=self.reduce_dim)
            X_processed = self.pca_.fit_transform(X)
            n_features = self.reduce_dim
        
        # Select algorithm
        algo = self._choose_algorithm(n_samples, n_features)
        print(f"Using algorithm: {algo}")
        
        # Build nearest neighbor model
        start_time = time.time()
        self.nn_model_ = NearestNeighbors(
            n_neighbors=self.k + 1,
            algorithm=algo,
            leaf_size=self.leaf_size,
            metric='euclidean'
        )
        self.nn_model_.fit(X_processed)
        
        # Query all points
        distances, _ = self.nn_model_.kneighbors(X_processed)
        distances = distances[:, 1:]  # Exclude self
        
        elapsed = time.time() - start_time
        print(f"KNN computation completed in {elapsed:.2f} seconds")
        
        # Compute scores
        if self.scoring == 'kth':
            self.scores_ = distances[:, -1]
        elif self.scoring == 'mean':
            self.scores_ = np.mean(distances, axis=1)
        else:
            self.scores_ = np.sum(distances, axis=1)
        
        # Set threshold based on contamination
        self.threshold_ = np.percentile(
            self.scores_, 100 * (1 - contamination)
        )
        
        return self
    
    def predict(self, X: np.ndarray = None) -> np.ndarray:
        """
        Return anomaly labels for training data or new data.
        """
        if X is None:
            # Return labels for training data
            return (self.scores_ > self.threshold_).astype(int)
        else:
            # Score new data
            X_processed = X
            if self.pca_ is not None:
                X_processed = self.pca_.transform(X)
            
            distances, _ = self.nn_model_.kneighbors(X_processed)
            distances = distances[:, 1:]
            
            if self.scoring == 'kth':
                scores = distances[:, -1]
            elif self.scoring == 'mean':
                scores = np.mean(distances, axis=1)
            else:
                scores = np.sum(distances, axis=1)
            
            return (scores > self.threshold_).astype(int)

Distance Metrics for Anomaly Detection

The choice of distance metric fundamentally shapes what the KNN detector considers 'similar' and 'different.' Different metrics encode different assumptions about the geometry of the feature space.

The Minkowski Family

The Minkowski distance generalizes several common metrics:

$$d_p(x, y) = \left(\sum_{i=1}^{d} |x_i - y_i|^p\right)^{1/p}$$

p = 1: Manhattan (L1) distance—sum of absolute differences
p = 2: Euclidean (L2) distance—straight-line distance
p → ∞: Chebyshev (L∞) distance—maximum coordinate difference

Each metric defines different 'unit balls' (the set of points at distance 1 from the origin):

L1: Diamond shape (vertices along axes)
L2: Sphere (circular in 2D)
L∞: Hypercube (square in 2D)

Which Metric to Use?

The appropriate metric depends on feature semantics:

Distance Metric Selection Guide
Metric	When to Use	Advantages	Disadvantages
Euclidean (L2)	Continuous features, uniform scales	Intuitive, rotation-invariant	Sensitive to outlying dimensions
Manhattan (L1)	Sparse data, grid-like movement	Robust to outliers, interpretable	Not rotation-invariant
Chebyshev (L∞)	When any single deviation matters	Catches single extreme features	Ignores other deviations
Mahalanobis	Correlated features, elliptical clusters	Accounts for covariance	Requires covariance estimation
Cosine Similarity	Text, normalized embeddings	Scale-invariant, angle-based	Ignores magnitude information
Jaccard Distance	Binary features, set similarity	Natural for sets	Only for binary data

The Mahalanobis Distance

For data with correlated features or non-spherical clusters, the Mahalanobis distance accounts for feature covariance:

$$d_M(x, y) = \sqrt{(x - y)^T \Sigma^{-1} (x - y)}$$

where $\Sigma$ is the covariance matrix of the data. This effectively 'whitens' the data, transforming elliptical clusters into spherical ones.

Important: The Mahalanobis distance requires estimating $\Sigma$ from data, which can be problematic with:

Few samples (singular covariance)
Outliers in training data (biased covariance)
High dimensionality (unstable estimation)

Practical Recommendations:

Start with Euclidean after proper normalization—it works well in most cases
Use Manhattan for sparse data—it's less sensitive to zeros dominating
Consider Mahalanobis when features are correlated—but regularize the covariance with shrinkage estimators
Use cosine similarity for embeddings—text and neural network features typically have angular semantics
Domain knowledge trumps defaults—if your domain has a natural distance concept, use it

The Normalization Imperative

Before applying any distance-based method, normalize your features. With Euclidean distance, a feature ranging [0, 1000] will dominate one ranging [0, 1]. Standard options: (1) Z-score normalization: (x - μ) / σ, (2) Min-max scaling: (x - min) / (max - min), (3) Robust scaling: (x - median) / IQR. For anomaly detection, robust scaling is often preferred since anomalies can skew mean and standard deviation.

The Critical Question: How to Choose k?

The parameter k—the number of nearest neighbors—is the most critical hyperparameter in KNN-based anomaly detection. Its choice fundamentally affects what types of anomalies the detector can find.

Understanding k's Effect

Small k (e.g., k = 1-3):

Sensitive to local structure
Can detect point anomalies in locally dense regions
Vulnerable to noise—single noisy points affect scores
May miss micro-clusters of anomalies

Large k (e.g., k = 20-50):

Captures global density patterns
Robust to noise and single outliers
May miss anomalies in locally sparse (but globally normal) regions
Blurs the distinction between local and global anomalies

The Goldilocks Zone: Typically, k values between 5-20 provide a reasonable balance, but the optimal choice is highly data-dependent.

Choosing k: Best Practices

•Rule of thumb: k ≈ √n is sometimes suggested but rarely optimal
•Consider cluster sizes: k should be smaller than the smallest expected normal cluster
•Multiple k values: Score stability across k ∈ {5, 10, 15, 20} indicates robust anomalies
•Domain knowledge: If anomalies should be isolated points, use small k; if they're small groups, use larger k

Common k Selection Mistakes

•Using k = 1: Extremely noise-sensitive, rarely appropriate
•k > cluster size: Neighbors span multiple clusters, loses local information
•Fixed k across datasets: Different data requires different k values
•Ignoring contamination rate: High contamination requires larger k to avoid anomalies affecting each other

Data-Driven k Selection Methods

Method 1: Cross-Validation (when labels exist) If you have labeled anomalies (even partial labels), use cross-validation to select k that maximizes detection performance (e.g., F1-score, AUC-PR).

Method 2: Score Stability Analysis For unsupervised scenarios, examine how scores change with k:

Compute scores for k ∈ {3, 5, 7, 10, 15, 20, 30}
Calculate rank correlation between consecutive k values
Choose k where rankings stabilize (high correlation with neighboring k values)

Method 3: Ensemble Averaging Instead of choosing a single k, average scores across multiple k values: $$\text{score}{ensemble}(x) = \frac{1}{|K|} \sum{k \in K} \text{score}_k(x)$$

This creates a robust detector that balances local and global perspectives.

Method 4: Contamination-Aware Selection If the expected contamination rate c is known:

k should be at least 1/c to ensure anomalies don't dominate each other's neighborhoods
Example: 5% contamination → k ≥ 20

The Contamination Trap

When the contamination rate is high (> 10%), anomalies become each other's nearest neighbors. Their distances shrink, and they appear less anomalous. This 'anomaly masking' effect requires careful k selection: k should be large enough that normal points appear in anomaly neighborhoods, providing a baseline for comparison.

Strengths and Limitations of KNN-Based Detection

A mature understanding of KNN-based anomaly detection requires honest acknowledgment of when the method shines and when it fails. This section provides a balanced assessment.

Core Strengths:

When KNN-Based Detection Excels

•Intuitive and interpretable: Distance-based scores are easy to explain ('this point is far from everything else'). Non-technical stakeholders can understand why something was flagged.
•Non-parametric: Makes no assumptions about underlying data distribution. Works with multi-modal, non-Gaussian, arbitrarily shaped clusters.
•Local adaptation: Each point's score depends on its local neighborhood, allowing detection in regions of varying density.
•No training phase (lazy learning): Can be updated incrementally as new data arrives without retraining.
•Wide applicability: Works on most tabular data without domain-specific tuning.

Fundamental Limitations

•Computational cost: O(n²) naive complexity makes it impractical for large datasets without approximation.
•Curse of dimensionality: In high dimensions (d > 20), distances concentrate, making nearest neighbor distinctions meaningless.
•Global threshold problem: A single threshold across regions of varying density causes under-detection in sparse regions and over-detection in dense regions.
•Sensitive to noise: Random fluctuations in distance can affect scores, especially with small k.
•Cannot handle varying local densities: Points in naturally sparse regions are always flagged as outliers, even if they're normal for their context.

When to Use KNN-Based Detection:

✅ Good fit:

Low-to-moderate dimensionality (d < 20)
Small-to-medium datasets (n < 100,000)
When interpretability is important
When data distributions are unknown
For baseline/benchmark anomaly detection

❌ Poor fit:

High-dimensional data without preprocessing
Very large datasets requiring real-time detection
Data with varying local densities (use LOF instead)
When computational resources are limited
Streaming data with strict latency requirements

The Transition to LOF:

The limitation regarding varying local densities is so significant that it motivated the development of the Local Outlier Factor (LOF) algorithm. LOF explicitly accounts for local density, comparing each point's density to its neighbors' densities rather than using raw distances. This addresses the fundamental weakness of KNN-based detection and will be our focus in subsequent pages.

The Practitioner's Perspective

Despite its limitations, KNN-based detection remains valuable as a baseline method. It's fast to implement, easy to explain, and often performs surprisingly well. Start with KNN-based detection to establish a benchmark, then move to more sophisticated methods (LOF, Isolation Forest, autoencoders) only when KNN's limitations demonstrably affect your use case.

Summary: KNN-Based Anomaly Detection

We've established the foundational concepts for distance-based anomaly detection through the lens of k-nearest neighbors. Let's consolidate the essential takeaways:

Key Takeaways

•Core intuition: Anomalies are points far from their neighbors; the k-distance quantifies this isolation.
•Scoring variants: k-th distance, mean distance, and sum of distances offer different trade-offs between sensitivity and robustness.
•Computational reality: Naive O(n²) complexity requires spatial indexes (KD-trees, ball trees) or approximate methods for scalability.
•Distance metric choice: Euclidean with normalization is the default; Mahalanobis for correlated features; domain knowledge guides selection.
•Parameter k: Critical hyperparameter; small k is noise-sensitive, large k loses local information; ensemble methods provide robustness.
•Limitations: High dimensionality, varying local densities, and global thresholds are fundamental weaknesses addressed by more sophisticated methods.

What's Next:

The next page explores the mean distance to KNN in depth, analyzing its statistical properties, optimal weighting schemes, and practical implementation patterns. We'll see how this seemingly simple variation of KNN-based scoring provides important advantages in specific detection scenarios.

Foundation Established

You now possess a comprehensive understanding of KNN-based anomaly detection—its theory, algorithms, complexity, metrics, parameter selection, and honest assessment of capabilities. This foundation prepares you for the more sophisticated distance-based methods that follow: mean distance analysis, Local Outlier Factor (LOF), and LOCI algorithm.

1 / 6

Loading learning content...

Machine LearningAnomaly & Outlier Detection

Distance-Based Methods for Anomaly Detection

LevelAdvanced

Duration90 mins

TopicAnomaly & Outlier Detection

1 / 6

KNN-Based Detection

The Intuition Behind Distance-Based Anomaly Detection

At the heart of anomaly detection lies a deceptively simple insight: anomalies are different from normal data points. But how do we quantify this difference?

What You Will Learn

Theoretical Foundations of KNN-Based Detection

To understand KNN-based anomaly detection rigorously, we must first establish the mathematical framework that governs distance-based outlier identification.

The Nearest Neighbor Paradigm

Given a dataset $\mathcal{D} = {x_1, x_2, ..., x_n}$ where each $x_i \in \mathbb{R}^d$ is a d-dimensional feature vector, we define the k-nearest neighbors of a point $x$ as the set:

$$N_k(x) = {x_1, x_2, ..., x_k} \subset \mathcal{D} \text{ such that } d(x, x_i) \leq d(x, x_j) \text{ for all } x_j \notin N_k(x)$$

where $d(\cdot, \cdot)$ is a distance function, typically the Euclidean distance:

$$d(x, y) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}$$

The k-Distance Function

The k-distance of a point $x$, denoted $d_k(x)$, is the distance from $x$ to its k-th nearest neighbor:

$$d_k(x) = \max_{x_i \in N_k(x)} d(x, x_i)$$

This k-distance forms the foundation of all KNN-based anomaly detection methods. Points with large k-distances are far from their neighbors and thus potentially anomalous.

The Geometric Interpretation

Formal Anomaly Definition

Under the KNN framework, an anomaly can be formally defined using a threshold $\tau$:

$$\text{Anomaly}(x) = \begin{cases} 1 & \text{if } f(d_1(x), d_2(x), ..., d_k(x)) > \tau \ 0 & \text{otherwise} \end{cases}$$

where $f$ is a scoring function that aggregates distance information. The choice of $f$ defines different variants of KNN-based detection:

Variant 1: k-th Nearest Neighbor Distance The simplest approach uses only the k-distance: $$\text{score}(x) = d_k(x)$$

Variant 2: Average Distance to k Neighbors A more robust variant averages all k distances: $$\text{score}(x) = \frac{1}{k} \sum_{i=1}^{k} d_i(x)$$

Variant 3: Sum of Distances to k Neighbors Equivalent to average but affects threshold interpretation: $$\text{score}(x) = \sum_{i=1}^{k} d_i(x)$$

Each variant exhibits different sensitivity to local density variations, a nuance that becomes critical in real-world applications.

Comparison of KNN-Based Scoring Functions
Scoring Function	Formula	Sensitivity	Robustness	Use Case
k-th Distance	$d_k(x)$	High to outlying neighbors	Low (single point)	Sharp anomaly boundaries
Average Distance	$\frac{1}{k}\sum d_i(x)$	Moderate, smoothed	Medium (k points average)	General-purpose detection
Sum of Distances	$\sum d_i(x)$	Moderate, scale-dependent	Medium	When thresholds are data-driven
Median Distance	$\text{median}(d_1...d_k)$	Low to extreme values	High (resistant to outliers)	Noisy distance computations
Max Distance	$\max(d_1...d_k)$	Very high	Low	Conservative anomaly detection

The Basic KNN Anomaly Detection Algorithm

Algorithm: KNN-Based Anomaly Detection

Input: Dataset D = {x₁, x₂, ..., xₙ}, parameter k, threshold τ
Output: Set of anomalies A

1. Initialize anomaly set A = ∅
2. For each point xᵢ in D:
   a. Find N_k(xᵢ) = k nearest neighbors of xᵢ in D \ {xᵢ}
   b. Compute anomaly score: score(xᵢ) = f(distances to N_k(xᵢ))
3. Rank all points by score in descending order
4. For each point xᵢ:
   If score(xᵢ) > τ:
      Add xᵢ to A
5. Return A

Critical Implementation Considerations

The algorithm above omits several crucial details that separate naive implementations from production-ready systems:

Implementation Details That Matter

•Include self in neighbors? The algorithm above excludes xᵢ from its own neighbor search. Some implementations include it, which adds a zero-distance neighbor and affects scoring. Best practice: exclude self unless k is adjusted accordingly.
•Ties in k-th neighbor distance: When multiple points are equidistant at the k-th position, the algorithm must handle tie-breaking. Options include: include all tied points (variable neighborhood size), random selection, or deterministic ordering.
•Distance function choice: While Euclidean distance is standard, Manhattan distance, Chebyshev distance, Minkowski distance, or domain-specific measures may be more appropriate depending on feature semantics.
•Feature normalization: Features on different scales cause the distance function to be dominated by high-variance features. Standardization (z-score) or min-max normalization is typically required.
•Handling categorical features: Pure KNN requires numerical features. Categorical data must be encoded (one-hot, target encoding) or handled with specialized distance metrics like Gower distance.

knn_anomaly_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from scipy.spatial import distance
from sklearn.neighbors import NearestNeighbors
from typing import Tuple, np.ndarray
 
def knn_anomaly_scores(X: np.ndarray, 
                       k: int = 5,
                       scoring: str = 'mean') -> np.ndarray:
    """
    Compute KNN-based anomaly scores for all points in dataset.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        The input dataset
    k : int
        Number of nearest neighbors to consider
    scoring : str
        Scoring method: 'kth' (k-th distance), 'mean', 'sum', 'median'
    
    Returns:
    --------
    scores : np.ndarray of shape (n_samples,)
        Anomaly score for each point (higher = more anomalous)
    """
    n_samples = X.shape[0]
    
    # Validate k
    if k >= n_samples:
        raise ValueError(f"k ({k}) must be less than n_samples ({n_samples})")
    
    # Build KNN model (k+1 to exclude self)
    nn_model = NearestNeighbors(n_neighbors=k+1, algorithm='auto', metric='euclidean')
    nn_model.fit(X)
    
    # Get distances to k nearest neighbors (excluding self)
    distances, indices = nn_model.kneighbors(X)
    distances = distances[:, 1:]  # Exclude self (distance 0)
    
    # Compute scores based on chosen method
    if scoring == 'kth':
        # Use only the k-th nearest neighbor distance
        scores = distances[:, -1]
    elif scoring == 'mean':
        # Average distance to all k neighbors
        scores = np.mean(distances, axis=1)
    elif scoring == 'sum':
        # Sum of distances to all k neighbors
        scores = np.sum(distances, axis=1)
    elif scoring == 'median':
        # Median distance (robust to outlying neighbors)
        scores = np.median(distances, axis=1)
    else:
        raise ValueError(f"Unknown scoring method: {scoring}")
    
    return scores
 
 
def detect_anomalies(X: np.ndarray,
                     k: int = 5,
                     contamination: float = 0.05,
                     scoring: str = 'mean') -> Tuple[np.ndarray, np.ndarray]:
    """
    Detect anomalies using KNN-based method.
    
    Parameters:
    -----------
    X : np.ndarray
        Input dataset
    k : int
        Number of neighbors
    contamination : float
        Expected proportion of anomalies (0 to 1)
    scoring : str
        Scoring method
    
    Returns:
    --------
    labels : np.ndarray
        Binary labels (1 = anomaly, 0 = normal)
    scores : np.ndarray
        Anomaly scores
    """
    scores = knn_anomaly_scores(X, k, scoring)
    
    # Determine threshold based on contamination
    threshold = np.percentile(scores, 100 * (1 - contamination))
    labels = (scores > threshold).astype(int)
    
    return labels, scores

Computational Complexity Analysis

Naive Algorithm Complexity

The straightforward implementation requires:

For each of n points: Find k nearest neighbors among n-1 other points
Distance computation: O(d) per pair for d-dimensional data
Finding k smallest: O((n-1) log k) using a max-heap, or O(n-1) for small k

Total naive complexity: $$O(n^2 \cdot d)$$

For a dataset with 1 million points and 100 features, this translates to approximately 100 trillion operations—clearly intractable.

Space Complexity

Storing the dataset: O(n · d)
Distance matrix (if precomputed): O(n²)—prohibitive for large n
On-the-fly computation: O(k · n) for neighbor indices and distances

Complexity Analysis of KNN Anomaly Detection Approaches
Approach	Time Complexity	Space Complexity	Notes
Naive Brute Force	O(n² · d)	O(n · d)	Only viable for n < 10,000
KD-Tree (low d)	O(n · d · log n)	O(n · d)	Efficient for d < 20
Ball Tree	O(n · d · log n)	O(n · d)	Better for higher d than KD-Tree
Locality-Sensitive Hashing	O(n · d · L · K)	O(n · L · K)	Approximate, sublinear query time
Random Projection Trees	O(n · d · log n)	O(n · d)	Good for moderate d
FAISS (GPU)	O(n · d + n · log n)	O(n · d)	Massive parallelism, batch processing

Acceleration Structures for KNN Queries

Production implementations must leverage spatial data structures to achieve subquadratic complexity:

Approximate Methods For high dimensions, exact KNN becomes intractable. Approximate nearest neighbor (ANN) methods trade accuracy for speed:

Locality-Sensitive Hashing (LSH): Hashes similar points to the same bucket with high probability. Query time is O(d · L) where L is the number of hash tables.
Hierarchical Navigable Small World (HNSW): Graph-based method with logarithmic search complexity and excellent practical performance.
Inverted File Index with Product Quantization (IVFPQ): Clusters data and compresses residuals for memory-efficient search.

The High-Dimensional Warning

efficient_knn_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA
import time
 
class EfficientKNNAnomalyDetector:
    """
    Production-ready KNN anomaly detector with automatic
    algorithm selection and optional dimensionality reduction.
    """
    
    def __init__(self, 
                 k: int = 10,
                 algorithm: str = 'auto',
                 leaf_size: int = 30,
                 reduce_dim: int = None,
                 scoring: str = 'mean'):
        """
        Parameters:
        -----------
        k : int
            Number of neighbors
        algorithm : str
            'auto', 'ball_tree', 'kd_tree', 'brute'
        leaf_size : int
            Leaf size for tree-based algorithms
        reduce_dim : int or None
            If set, reduce to this many dimensions via PCA
        scoring : str
            Scoring function: 'kth', 'mean', 'sum'
        """
        self.k = k
        self.algorithm = algorithm
        self.leaf_size = leaf_size
        self.reduce_dim = reduce_dim
        self.scoring = scoring
        
        self.nn_model_ = None
        self.pca_ = None
        self.scores_ = None
        self.threshold_ = None
    
    def _choose_algorithm(self, n_samples: int, n_features: int) -> str:
        """
        Select optimal algorithm based on data characteristics.
        """
        if self.algorithm != 'auto':
            return self.algorithm
        
        # Heuristics based on scikit-learn guidelines
        if n_features > 20:
            # High dimensionality: trees degrade
            if n_samples > 50000:
                # Very large: use approximate methods externally
                print("Warning: Consider approximate KNN for this scale")
            return 'brute'
        elif n_samples < 1000:
            # Small dataset: brute force is fine
            return 'brute'
        else:
            # Moderate size and dimension: use ball_tree
            return 'ball_tree'
    
    def fit(self, X: np.ndarray, contamination: float = 0.05):
        """
        Fit the anomaly detector and compute scores.
        """
        n_samples, n_features = X.shape
        
        # Optional dimensionality reduction
        X_processed = X
        if self.reduce_dim is not None and n_features > self.reduce_dim:
            print(f"Reducing dimensionality from {n_features} to {self.reduce_dim}")
            self.pca_ = PCA(n_components=self.reduce_dim)
            X_processed = self.pca_.fit_transform(X)
            n_features = self.reduce_dim
        
        # Select algorithm
        algo = self._choose_algorithm(n_samples, n_features)
        print(f"Using algorithm: {algo}")
        
        # Build nearest neighbor model
        start_time = time.time()
        self.nn_model_ = NearestNeighbors(
            n_neighbors=self.k + 1,
            algorithm=algo,
            leaf_size=self.leaf_size,
            metric='euclidean'
        )
        self.nn_model_.fit(X_processed)
        
        # Query all points
        distances, _ = self.nn_model_.kneighbors(X_processed)
        distances = distances[:, 1:]  # Exclude self
        
        elapsed = time.time() - start_time
        print(f"KNN computation completed in {elapsed:.2f} seconds")
        
        # Compute scores
        if self.scoring == 'kth':
            self.scores_ = distances[:, -1]
        elif self.scoring == 'mean':
            self.scores_ = np.mean(distances, axis=1)
        else:
            self.scores_ = np.sum(distances, axis=1)
        
        # Set threshold based on contamination
        self.threshold_ = np.percentile(
            self.scores_, 100 * (1 - contamination)
        )
        
        return self
    
    def predict(self, X: np.ndarray = None) -> np.ndarray:
        """
        Return anomaly labels for training data or new data.
        """
        if X is None:
            # Return labels for training data
            return (self.scores_ > self.threshold_).astype(int)
        else:
            # Score new data
            X_processed = X
            if self.pca_ is not None:
                X_processed = self.pca_.transform(X)
            
            distances, _ = self.nn_model_.kneighbors(X_processed)
            distances = distances[:, 1:]
            
            if self.scoring == 'kth':
                scores = distances[:, -1]
            elif self.scoring == 'mean':
                scores = np.mean(distances, axis=1)
            else:
                scores = np.sum(distances, axis=1)
            
            return (scores > self.threshold_).astype(int)

Distance Metrics for Anomaly Detection

The choice of distance metric fundamentally shapes what the KNN detector considers 'similar' and 'different.' Different metrics encode different assumptions about the geometry of the feature space.

The Minkowski Family

The Minkowski distance generalizes several common metrics:

$$d_p(x, y) = \left(\sum_{i=1}^{d} |x_i - y_i|^p\right)^{1/p}$$

p = 1: Manhattan (L1) distance—sum of absolute differences
p = 2: Euclidean (L2) distance—straight-line distance
p → ∞: Chebyshev (L∞) distance—maximum coordinate difference

Each metric defines different 'unit balls' (the set of points at distance 1 from the origin):

L1: Diamond shape (vertices along axes)
L2: Sphere (circular in 2D)
L∞: Hypercube (square in 2D)

Which Metric to Use?

The appropriate metric depends on feature semantics:

Distance Metric Selection Guide
Metric	When to Use	Advantages	Disadvantages
Euclidean (L2)	Continuous features, uniform scales	Intuitive, rotation-invariant	Sensitive to outlying dimensions
Manhattan (L1)	Sparse data, grid-like movement	Robust to outliers, interpretable	Not rotation-invariant
Chebyshev (L∞)	When any single deviation matters	Catches single extreme features	Ignores other deviations
Mahalanobis	Correlated features, elliptical clusters	Accounts for covariance	Requires covariance estimation
Cosine Similarity	Text, normalized embeddings	Scale-invariant, angle-based	Ignores magnitude information
Jaccard Distance	Binary features, set similarity	Natural for sets	Only for binary data

The Mahalanobis Distance

For data with correlated features or non-spherical clusters, the Mahalanobis distance accounts for feature covariance:

$$d_M(x, y) = \sqrt{(x - y)^T \Sigma^{-1} (x - y)}$$

where $\Sigma$ is the covariance matrix of the data. This effectively 'whitens' the data, transforming elliptical clusters into spherical ones.

Important: The Mahalanobis distance requires estimating $\Sigma$ from data, which can be problematic with:

Few samples (singular covariance)
Outliers in training data (biased covariance)
High dimensionality (unstable estimation)

Practical Recommendations:

Start with Euclidean after proper normalization—it works well in most cases
Use Manhattan for sparse data—it's less sensitive to zeros dominating
Consider Mahalanobis when features are correlated—but regularize the covariance with shrinkage estimators
Use cosine similarity for embeddings—text and neural network features typically have angular semantics
Domain knowledge trumps defaults—if your domain has a natural distance concept, use it

The Normalization Imperative

The Critical Question: How to Choose k?

The parameter k—the number of nearest neighbors—is the most critical hyperparameter in KNN-based anomaly detection. Its choice fundamentally affects what types of anomalies the detector can find.

Understanding k's Effect

Small k (e.g., k = 1-3):

Sensitive to local structure
Can detect point anomalies in locally dense regions
Vulnerable to noise—single noisy points affect scores
May miss micro-clusters of anomalies

Large k (e.g., k = 20-50):

Captures global density patterns
Robust to noise and single outliers
May miss anomalies in locally sparse (but globally normal) regions
Blurs the distinction between local and global anomalies

The Goldilocks Zone: Typically, k values between 5-20 provide a reasonable balance, but the optimal choice is highly data-dependent.

Choosing k: Best Practices

•Rule of thumb: k ≈ √n is sometimes suggested but rarely optimal
•Consider cluster sizes: k should be smaller than the smallest expected normal cluster
•Multiple k values: Score stability across k ∈ {5, 10, 15, 20} indicates robust anomalies
•Domain knowledge: If anomalies should be isolated points, use small k; if they're small groups, use larger k

Common k Selection Mistakes

•Using k = 1: Extremely noise-sensitive, rarely appropriate
•k > cluster size: Neighbors span multiple clusters, loses local information
•Fixed k across datasets: Different data requires different k values
•Ignoring contamination rate: High contamination requires larger k to avoid anomalies affecting each other

Data-Driven k Selection Methods

Method 1: Cross-Validation (when labels exist) If you have labeled anomalies (even partial labels), use cross-validation to select k that maximizes detection performance (e.g., F1-score, AUC-PR).

Method 2: Score Stability Analysis For unsupervised scenarios, examine how scores change with k:

Compute scores for k ∈ {3, 5, 7, 10, 15, 20, 30}
Calculate rank correlation between consecutive k values
Choose k where rankings stabilize (high correlation with neighboring k values)

Method 3: Ensemble Averaging Instead of choosing a single k, average scores across multiple k values: $$\text{score}{ensemble}(x) = \frac{1}{|K|} \sum{k \in K} \text{score}_k(x)$$

This creates a robust detector that balances local and global perspectives.

Method 4: Contamination-Aware Selection If the expected contamination rate c is known:

k should be at least 1/c to ensure anomalies don't dominate each other's neighborhoods
Example: 5% contamination → k ≥ 20

The Contamination Trap

Strengths and Limitations of KNN-Based Detection

A mature understanding of KNN-based anomaly detection requires honest acknowledgment of when the method shines and when it fails. This section provides a balanced assessment.

Core Strengths:

When KNN-Based Detection Excels

•Intuitive and interpretable: Distance-based scores are easy to explain ('this point is far from everything else'). Non-technical stakeholders can understand why something was flagged.
•Non-parametric: Makes no assumptions about underlying data distribution. Works with multi-modal, non-Gaussian, arbitrarily shaped clusters.
•Local adaptation: Each point's score depends on its local neighborhood, allowing detection in regions of varying density.
•No training phase (lazy learning): Can be updated incrementally as new data arrives without retraining.
•Wide applicability: Works on most tabular data without domain-specific tuning.

Fundamental Limitations

•Computational cost: O(n²) naive complexity makes it impractical for large datasets without approximation.
•Curse of dimensionality: In high dimensions (d > 20), distances concentrate, making nearest neighbor distinctions meaningless.
•Global threshold problem: A single threshold across regions of varying density causes under-detection in sparse regions and over-detection in dense regions.
•Sensitive to noise: Random fluctuations in distance can affect scores, especially with small k.
•Cannot handle varying local densities: Points in naturally sparse regions are always flagged as outliers, even if they're normal for their context.

When to Use KNN-Based Detection:

✅ Good fit:

Low-to-moderate dimensionality (d < 20)
Small-to-medium datasets (n < 100,000)
When interpretability is important
When data distributions are unknown
For baseline/benchmark anomaly detection

❌ Poor fit:

High-dimensional data without preprocessing
Very large datasets requiring real-time detection
Data with varying local densities (use LOF instead)
When computational resources are limited
Streaming data with strict latency requirements

The Transition to LOF:

The Practitioner's Perspective

Summary: KNN-Based Anomaly Detection

We've established the foundational concepts for distance-based anomaly detection through the lens of k-nearest neighbors. Let's consolidate the essential takeaways:

Key Takeaways

•Core intuition: Anomalies are points far from their neighbors; the k-distance quantifies this isolation.
•Scoring variants: k-th distance, mean distance, and sum of distances offer different trade-offs between sensitivity and robustness.
•Computational reality: Naive O(n²) complexity requires spatial indexes (KD-trees, ball trees) or approximate methods for scalability.
•Distance metric choice: Euclidean with normalization is the default; Mahalanobis for correlated features; domain knowledge guides selection.
•Parameter k: Critical hyperparameter; small k is noise-sensitive, large k loses local information; ensemble methods provide robustness.
•Limitations: High dimensionality, varying local densities, and global thresholds are fundamental weaknesses addressed by more sophisticated methods.

What's Next:

Foundation Established

1 / 6