Density Based Methods - Learning Module

Loading content...

0/278

Local Density Estimation

The Importance of Local Density Estimation

Accurate local density estimation is the linchpin of effective density-based anomaly detection. While we've established that relative density—comparing a point's density to its neighbors—is the key concept, the quality of our anomaly detection hinges on how well we estimate these local densities.

Poor density estimates lead to:

False positives: Normal points misclassified as anomalies due to noisy density estimates
False negatives: True anomalies missed because their density appears normal
Unstable rankings: Small data changes cause dramatic shifts in anomaly scores

This page dives deep into the techniques for robust local density estimation, from simple k-NN approaches to sophisticated kernel methods, and the mathematical principles that make them effective.

What You Will Learn

This page covers: (1) k-NN based density estimation variants and their properties; (2) Kernel density estimation adapted for local neighborhoods; (3) The Local Reachability Density used in LOF; (4) Adaptive density estimation techniques; (5) Bias-variance tradeoffs in density estimation for anomaly detection; and (6) Practical implementation considerations.

k-Nearest Neighbor Density Estimation

The k-nearest neighbor (k-NN) approach to density estimation is foundational in anomaly detection. It provides a simple, intuitive, and computationally tractable way to estimate local density.

The Core Idea

The k-NN density estimator is based on a key insight: the volume required to contain k neighbors is inversely related to density. In dense regions, k neighbors are found in a small volume; in sparse regions, one must search a larger volume.

Basic k-NN Density Estimator: $$\hat{p}_k(\mathbf{x}) = \frac{k}{n \cdot V_d(r_k(\mathbf{x}))}$$

where:

$r_k(\mathbf{x})$ is the distance from $\mathbf{x}$ to its k-th nearest neighbor
$V_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} r^d$ is the volume of a d-dimensional ball of radius r
$n$ is the total number of points

This estimator has strong theoretical backing: as $n \to \infty$ with k growing appropriately, $\hat{p}_k(\mathbf{x}) \to p(\mathbf{x})$ (consistency).

Variants of k-NN Density

Several variants exist, each with different properties:

1. Simple k-Distance Density: $$\rho_k^{(1)}(\mathbf{x}) = \frac{1}{d_k(\mathbf{x})}$$

Simplest form—ignores the volume constant and number of points. Useful when only relative densities matter.

2. Average Distance Density: $$\rho_k^{(2)}(\mathbf{x}) = \frac{k}{\sum_{i=1}^{k} d_i(\mathbf{x})}$$

Uses all k distances, not just the k-th. More stable than using only $d_k$.

3. Inverse Sum of Squared Distances: $$\rho_k^{(3)}(\mathbf{x}) = \frac{1}{\sum_{i=1}^{k} d_i(\mathbf{x})^2}$$

Gives more weight to closer neighbors, which may be desirable if immediate neighbors are more informative.

4. Exponential Decay Weighting: $$\rho_k^{(4)}(\mathbf{x}) = \sum_{i=1}^{k} \exp\left(-\frac{d_i(\mathbf{x})}{\sigma}\right)$$

Soft weighting based on distance, with bandwidth parameter σ controlling decay rate.

Comparison of k-NN Density Variants
Variant	Formula	Sensitivity	Stability	Use Case
k-distance	1/d_k	High	Low	Quick estimates, sensitive detection
Average distance	k/Σd_i	Medium	Medium	Balanced approach
Sum of squares	1/Σd_i²	High to close	Medium	Local focus important
Exponential	Σexp(-d_i/σ)	Tunable	High	Smooth transitions needed

knn_density_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from sklearn.neighbors import NearestNeighbors
from typing import Dict
from scipy.special import gamma
 
def compute_knn_density_variants(X: np.ndarray, k: int = 10) -> Dict[str, np.ndarray]:
    """
    Compute various k-NN density estimation variants.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        Input data
    k : int
        Number of neighbors
        
    Returns:
    --------
    dict : Dictionary of density arrays for each variant
    """
    n_samples, d = X.shape
    
    # Find k nearest neighbors
    nn = NearestNeighbors(n_neighbors=k + 1)  # +1 for self
    nn.fit(X)
    distances, _ = nn.kneighbors(X)
    distances = distances[:, 1:]  # Exclude self-distance
    
    # 1. Simple k-distance density
    k_distances = distances[:, -1]
    rho_kdist = 1.0 / (k_distances + 1e-10)
    
    # 2. Average distance density
    avg_distances = distances.mean(axis=1)
    rho_avg = k / (distances.sum(axis=1) + 1e-10)
    
    # 3. Inverse sum of squared distances
    sum_sq_distances = (distances ** 2).sum(axis=1)
    rho_sumsq = 1.0 / (sum_sq_distances + 1e-10)
    
    # 4. Exponential decay (with adaptive sigma)
    sigma = np.median(distances)  # Adaptive bandwidth
    exp_weights = np.exp(-distances / sigma)
    rho_exp = exp_weights.sum(axis=1)
    
    # 5. Proper probabilistic k-NN density
    # Volume of d-dimensional unit ball
    v_unit = (np.pi ** (d/2)) / gamma(d/2 + 1)
    volumes = v_unit * (k_distances ** d)
    rho_prob = k / (n_samples * volumes + 1e-10)
    
    return {
        'k_distance': rho_kdist,
        'average': rho_avg,
        'sum_squares': rho_sumsq,
        'exponential': rho_exp,
        'probabilistic': rho_prob,
        'raw_distances': distances,
        'k_distances_raw': k_distances
    }
 
def normalize_densities(densities: np.ndarray) -> np.ndarray:
    """Normalize densities to [0, 1] range for comparison."""
    min_val = densities.min()
    max_val = densities.max()
    if max_val - min_val < 1e-10:
        return np.ones_like(densities) * 0.5
    return (densities - min_val) / (max_val - min_val)
 
# Demonstration
np.random.seed(42)
 
# Create dataset with variable density clusters
dense_cluster = np.random.randn(200, 2) * 0.3
sparse_cluster = np.random.randn(50, 2) * 1.2 + np.array([5, 5])
outliers = np.array([[3, 3], [-2, 4], [7, 1]])
 
X = np.vstack([dense_cluster, sparse_cluster, outliers])
 
# Compute all density variants
k = 10
densities = compute_knn_density_variants(X, k=k)
 
print("Density Estimation Comparison (normalized to [0,1]):")
print("-" * 60)
 
for name in ['k_distance', 'average', 'sum_squares', 'exponential', 'probabilistic']:
    normalized = normalize_densities(densities[name])
    
    # Statistics for different point groups
    dense_mean = normalized[:200].mean()
    sparse_mean = normalized[200:250].mean()
    outlier_mean = normalized[250:].mean()
    
    print(f"
{name.upper()}:")
    print(f"  Dense cluster mean:  {dense_mean:.4f}")
    print(f"  Sparse cluster mean: {sparse_mean:.4f}")
    print(f"  Outliers mean:       {outlier_mean:.4f}")
    
# Correlation between variants
from scipy.stats import spearmanr
print("
" + "=" * 60)
print("Spearman correlation between density variants:")
variants = ['k_distance', 'average', 'sum_squares', 'exponential']
for i, v1 in enumerate(variants):
    for v2 in variants[i+1:]:
        corr, _ = spearmanr(densities[v1], densities[v2])
        print(f"  {v1} vs {v2}: {corr:.3f}")

Choosing k for Density Estimation

The choice of k involves a bias-variance tradeoff: small k gives high variance (noisy) estimates; large k gives high bias (oversmoothed) estimates. Rules of thumb: k = sqrt(n) is often suggested; for anomaly detection, k = 10-20 is common in practice. Consider using multiple k values and aggregating results for robustness.

Local Reachability Density: The LOF Innovation

The Local Outlier Factor (LOF) algorithm introduced a clever refinement to simple k-NN density: the Local Reachability Density (LRD). This concept addresses a subtle but important issue with raw distance-based density estimation.

The Problem with Raw Distances

Consider a point $\mathbf{x}$ on the boundary of a dense cluster. Its k-nearest neighbors might include:

Some points in the dense core (very close)
Some points on the sparse outer edge (farther away)

The average distance to neighbors is dominated by the farther points, potentially underestimating $\mathbf{x}$'s density. This instability at cluster boundaries causes noisy anomaly scores.

Reachability Distance: A Smoothing Mechanism

LOF introduces reachability distance to address this:

Definition (Reachability Distance): $$\text{reach-dist}_k(\mathbf{x}, \mathbf{y}) = \max(d_k(\mathbf{y}), d(\mathbf{x}, \mathbf{y}))$$

where:

$d_k(\mathbf{y})$ is the k-distance of point $\mathbf{y}$ (distance to its k-th nearest neighbor)
$d(\mathbf{x}, \mathbf{y})$ is the actual distance between $\mathbf{x}$ and $\mathbf{y}$

Key Insight: If $\mathbf{y}$ is in a dense region (small $d_k(\mathbf{y})$), the reachability distance from $\mathbf{x}$ to $\mathbf{y}$ is just the actual distance. But if $\mathbf{y}$ is in a sparse region (large $d_k(\mathbf{y})$), the reachability distance is 'smoothed' to at least $d_k(\mathbf{y})$.

This smoothing ensures that points in dense cores have stable reachability distances, reducing statistical fluctuations.

Local Reachability Density (LRD)

With reachability distance defined, LRD is:

Definition (Local Reachability Density): $$\text{lrd}k(\mathbf{x}) = \frac{1}{\frac{1}{k} \sum{\mathbf{y} \in N_k(\mathbf{x})} \text{reach-dist}_k(\mathbf{x}, \mathbf{y})}$$

Or equivalently: $$\text{lrd}k(\mathbf{x}) = \frac{k}{\sum{\mathbf{y} \in N_k(\mathbf{x})} \text{reach-dist}_k(\mathbf{x}, \mathbf{y})}$$

LRD is the inverse of the average reachability distance to neighbors. High LRD means $\mathbf{x}$ can 'reach' its neighbors quickly (dense region); low LRD means neighbors are far (sparse region).

Properties of LRD

1. Smoothness in Dense Regions: For points deep within a dense cluster, their k-distances are small, so reach-dist ≈ actual distances ≈ small values. LRD is high and stable.

2. Stability at Boundaries: For boundary points whose neighbors include some core points, the reachability distance to those core points is dominated by the core's k-distance, providing stability.

3. Sensitivity to Isolation: For true outliers, both the actual distances and neighbors' k-distances are large, resulting in low LRD.

4. Asymmetry: Note that reach-dist($\mathbf{x}$, $\mathbf{y}$) ≠ reach-dist($\mathbf{y}$, $\mathbf{x}$) in general. This asymmetry captures the directional nature of density variations.

local_reachability_density.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from sklearn.neighbors import NearestNeighbors
 
class LocalReachabilityDensity:
    """
    Compute Local Reachability Density (LRD) as used in LOF.
    
    This implementation exposes intermediate computations for 
    educational purposes.
    """
    
    def __init__(self, k: int = 20):
        """
        Parameters:
        -----------
        k : int
            Number of neighbors for k-distance and LRD computation
        """
        self.k = k
        self.k_distances_ = None
        self.neighbors_indices_ = None
        self.distances_ = None
        self.lrd_ = None
        
    def fit(self, X: np.ndarray):
        """Fit the model by computing k-distances for all points."""
        self.X_ = X
        self.n_samples_ = X.shape[0]
        
        # Find k-nearest neighbors
        nn = NearestNeighbors(n_neighbors=self.k + 1)
        nn.fit(X)
        distances, indices = nn.kneighbors(X)
        
        # Exclude self
        self.distances_ = distances[:, 1:]
        self.neighbors_indices_ = indices[:, 1:]
        
        # k-distance is distance to k-th neighbor
        self.k_distances_ = self.distances_[:, -1]
        
        return self
    
    def compute_reach_dist(self, x_idx: int, y_idx: int) -> float:
        """
        Compute reachability distance from x to y.
        
        reach-dist_k(x, y) = max(k-dist(y), d(x, y))
        """
        # Actual distance from x to y
        actual_dist = np.linalg.norm(self.X_[x_idx] - self.X_[y_idx])
        # k-distance of y
        k_dist_y = self.k_distances_[y_idx]
        
        return max(k_dist_y, actual_dist)
    
    def compute_lrd(self) -> np.ndarray:
        """
        Compute Local Reachability Density for all points.
        
        lrd_k(x) = k / sum(reach-dist(x, y) for y in N_k(x))
        """
        lrd = np.zeros(self.n_samples_)
        
        for i in range(self.n_samples_):
            # Get neighbors of point i
            neighbors = self.neighbors_indices_[i]
            
            # Sum of reachability distances to neighbors
            reach_dist_sum = 0.0
            for j in neighbors:
                reach_dist_sum += self.compute_reach_dist(i, j)
            
            # LRD is inverse of average reachability distance
            lrd[i] = self.k / (reach_dist_sum + 1e-10)
        
        self.lrd_ = lrd
        return lrd
    
    def compute_lof(self) -> np.ndarray:
        """
        Compute Local Outlier Factor using LRD.
        
        LOF_k(x) = (1/k) * sum(lrd(y)/lrd(x) for y in N_k(x))
        """
        if self.lrd_ is None:
            self.compute_lrd()
        
        lof = np.zeros(self.n_samples_)
        
        for i in range(self.n_samples_):
            neighbors = self.neighbors_indices_[i]
            # Ratio of neighbors' LRD to point's LRD
            lrd_ratios = self.lrd_[neighbors] / (self.lrd_[i] + 1e-10)
            lof[i] = lrd_ratios.mean()
        
        return lof
 
# Demonstration
np.random.seed(42)
 
# Create dataset with clear density variations
cluster1 = np.random.randn(150, 2) * 0.3  # Very dense
cluster2 = np.random.randn(100, 2) * 0.8 + np.array([4, 4])  # Moderately dense
outliers = np.array([[2, 2], [6, 1], [-2, 3]])  # Isolated points
 
X = np.vstack([cluster1, cluster2, outliers])
 
# Compute LRD
lrd_model = LocalReachabilityDensity(k=15)
lrd_model.fit(X)
lrd_values = lrd_model.compute_lrd()
lof_values = lrd_model.compute_lof()
 
print("Local Reachability Density Analysis:")
print("=" * 50)
print(f"
Dense cluster (n=150):")
print(f"  LRD: mean={lrd_values[:150].mean():.4f}, std={lrd_values[:150].std():.4f}")
print(f"  LOF: mean={lof_values[:150].mean():.4f}, std={lof_values[:150].std():.4f}")
 
print(f"
Moderate cluster (n=100):")
print(f"  LRD: mean={lrd_values[150:250].mean():.4f}, std={lrd_values[150:250].std():.4f}")
print(f"  LOF: mean={lof_values[150:250].mean():.4f}, std={lof_values[150:250].std():.4f}")
 
print(f"
Outliers (n=3):")
print(f"  LRD: {lrd_values[250:]}")
print(f"  LOF: {lof_values[250:]}")
 
# Key insight: LOF ≈ 1 for normal points, LOF >> 1 for outliers
print(f"
Anomaly detection threshold (LOF > 1.5):")
print(f"  Detected anomalies: {np.sum(lof_values > 1.5)}")
print(f"  True outlier LOFs: {lof_values[250:]}  (all should be > 1.5)")

Why Reachability Distance Works

The reachability distance creates a 'buffer zone' around dense cores. Points just outside a dense cluster see reachability distances dominated by the cluster's k-distances, not the raw (potentially small) distances. This prevents false positive anomalies at cluster edges while maintaining sensitivity to true outliers.

Kernel Density Estimation for Anomaly Detection

Kernel Density Estimation (KDE) provides a theoretically principled approach to density estimation that can be adapted for anomaly detection. While k-NN methods are more common in practice, KDE offers some advantages in certain scenarios.

Review: Standard KDE

The kernel density estimator: $$\hat{p}(\mathbf{x}) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K\left(\frac{\mathbf{x} - \mathbf{x}_i}{h}\right)$$

where $K$ is a kernel function (Gaussian, Epanechnikov, etc.) and $h$ is the bandwidth.

KDE for Anomaly Detection

KDE-based anomaly detection works by:

Estimate density $\hat{p}(\mathbf{x})$ for all points
Points with low $\hat{p}(\mathbf{x})$ are anomaly candidates
Threshold selection determines the boundary

Advantages:

Smooth, continuous density surface
Well-understood theoretical properties
Natural probabilistic interpretation

Disadvantages:

Global bandwidth struggles with varying densities
Computationally expensive for large datasets (O(n²) naive)
Curse of dimensionality hits hard

Locally Adaptive KDE

To handle varying densities, locally adaptive KDE uses different bandwidths in different regions:

Variable Bandwidth KDE: $$\hat{p}(\mathbf{x}) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h_i^d} K\left(\frac{\mathbf{x} - \mathbf{x}_i}{h_i}\right)$$

where $h_i$ varies per data point. Common choices:

1. k-NN Bandwidth: $$h_i = d_k(\mathbf{x}_i)$$ Use distance to k-th neighbor as local bandwidth.

2. Pilot Density Bandwidth: $$h_i = h_0 \cdot \left(\frac{\tilde{p}(\mathbf{x}_i)}{g}\right)^{-\alpha}$$ where $\tilde{p}$ is a pilot density estimate, $g$ is its geometric mean, and $\alpha \in [0, 1]$ controls adaptation strength.

Abramson's Square Root Rule: $$h_i \propto \tilde{p}(\mathbf{x}_i)^{-1/2}$$

This achieves better bias properties by using smaller bandwidths in dense regions and larger bandwidths in sparse regions.

kde_anomaly_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
import numpy as np
from sklearn.neighbors import NearestNeighbors, KernelDensity
from scipy.stats import norm
 
class LocallyAdaptiveKDE:
    """
    Locally Adaptive Kernel Density Estimation for Anomaly Detection.
    
    Uses k-NN distances to set local bandwidths, adapting to varying
    density regions.
    """
    
    def __init__(self, k: int = 10, kernel: str = 'gaussian'):
        """
        Parameters:
        -----------
        k : int
            Number of neighbors for local bandwidth estimation
        kernel : str
            Kernel type ('gaussian' or 'epanechnikov')
        """
        self.k = k
        self.kernel = kernel
        
    def fit(self, X: np.ndarray):
        """Fit the model: compute local bandwidths."""
        self.X_ = X
        self.n_samples_, self.n_features_ = X.shape
        
        # Find k-nearest neighbors for bandwidth
        nn = NearestNeighbors(n_neighbors=self.k + 1)
        nn.fit(X)
        distances, _ = nn.kneighbors(X)
        
        # Local bandwidth = k-distance (distance to k-th neighbor)
        self.bandwidths_ = distances[:, -1] + 1e-10
        
        return self
    
    def _kernel_value(self, u: np.ndarray) -> np.ndarray:
        """Evaluate kernel at u."""
        if self.kernel == 'gaussian':
            return np.exp(-0.5 * np.sum(u**2, axis=1)) / (2 * np.pi) ** (self.n_features_ / 2)
        elif self.kernel == 'epanechnikov':
            norms = np.sum(u**2, axis=1)
            return np.where(norms < 1, 0.75 * (1 - norms), 0)
        else:
            raise ValueError(f"Unknown kernel: {self.kernel}")
    
    def score_samples(self, X_query: np.ndarray) -> np.ndarray:
        """
        Compute density estimates at query points.
        
        Parameters:
        -----------
        X_query : np.ndarray
            Points to evaluate density at
            
        Returns:
        --------
        np.ndarray : Density estimates
        """
        n_query = X_query.shape[0]
        densities = np.zeros(n_query)
        
        for i in range(n_query):
            # Compute contribution from each data point
            for j in range(self.n_samples_):
                h = self.bandwidths_[j]
                diff = (X_query[i] - self.X_[j]) / h
                kernel_val = self._kernel_value(diff.reshape(1, -1))[0]
                densities[i] += kernel_val / (h ** self.n_features_)
            
            densities[i] /= self.n_samples_
        
        return densities
    
    def detect_anomalies(self, X_query: np.ndarray = None, 
                         contamination: float = 0.05) -> tuple:
        """
        Detect anomalies based on density estimates.
        
        Parameters:
        -----------
        X_query : np.ndarray, optional
            Points to evaluate; if None, use training data
        contamination : float
            Expected fraction of anomalies
            
        Returns:
        --------
        tuple : (anomaly_labels, density_scores, threshold)
        """
        if X_query is None:
            X_query = self.X_
        
        densities = self.score_samples(X_query)
        
        # Threshold at contamination percentile
        threshold = np.percentile(densities, 100 * contamination)
        anomalies = densities < threshold
        
        return anomalies, densities, threshold
 
 
def compare_kde_approaches(X: np.ndarray, contamination: float = 0.05):
    """Compare fixed bandwidth vs locally adaptive KDE."""
    from sklearn.neighbors import KernelDensity
    
    # Fixed bandwidth KDE (sklearn)
    # Use Scott's rule: h = n^(-1/(d+4)) * sigma
    n, d = X.shape
    bandwidth = n ** (-1/(d + 4)) * X.std()
    kde_fixed = KernelDensity(bandwidth=bandwidth, kernel='gaussian')
    kde_fixed.fit(X)
    densities_fixed = np.exp(kde_fixed.score_samples(X))
    
    # Locally adaptive KDE
    kde_adaptive = LocallyAdaptiveKDE(k=10)
    kde_adaptive.fit(X)
    densities_adaptive = kde_adaptive.score_samples(X)
    
    # Determine anomalies
    thresh_fixed = np.percentile(densities_fixed, 100 * contamination)
    thresh_adaptive = np.percentile(densities_adaptive, 100 * contamination)
    
    anomalies_fixed = densities_fixed < thresh_fixed
    anomalies_adaptive = densities_adaptive < thresh_adaptive
    
    return {
        'fixed': {
            'densities': densities_fixed,
            'anomalies': anomalies_fixed,
            'bandwidth': bandwidth
        },
        'adaptive': {
            'densities': densities_adaptive,
            'anomalies': anomalies_adaptive,
            'bandwidths': kde_adaptive.bandwidths_
        }
    }
 
# Demonstration
np.random.seed(42)
 
# Multi-density dataset
dense = np.random.randn(150, 2) * 0.3
sparse = np.random.randn(80, 2) * 1.0 + np.array([4, 4])
outliers = np.array([[2, 2], [-1.5, 3], [6, 0], [4, 8]])
 
X = np.vstack([dense, sparse, outliers])
 
results = compare_kde_approaches(X, contamination=0.02)
 
print("KDE Anomaly Detection Comparison:")
print("=" * 50)
print(f"
Fixed Bandwidth KDE:")
print(f"  Bandwidth: {results['fixed']['bandwidth']:.4f}")
print(f"  Anomalies detected: {results['fixed']['anomalies'].sum()}")
 
print(f"
Adaptive Bandwidth KDE:")
print(f"  Bandwidth range: [{results['adaptive']['bandwidths'].min():.4f}, "
      f"{results['adaptive']['bandwidths'].max():.4f}]")
print(f"  Anomalies detected: {results['adaptive']['anomalies'].sum()}")
 
# Check if outliers were detected
n_normal = 230
print(f"
Outlier Detection (4 true outliers):")
print(f"  Fixed: detected {results['fixed']['anomalies'][n_normal:].sum()}/4")
print(f"  Adaptive: detected {results['adaptive']['anomalies'][n_normal:].sum()}/4")

Bias-Variance Tradeoffs in Density Estimation

Understanding the bias-variance tradeoff is crucial for selecting appropriate density estimation parameters, directly impacting anomaly detection quality.

The Fundamental Tradeoff

For any density estimator $\hat{p}(\mathbf{x})$, the mean squared error decomposes as: $$\text{MSE}(\hat{p}) = \text{Bias}^2(\hat{p}) + \text{Var}(\hat{p})$$

Bias: Systematic error—does the expected estimate equal the true density? Variance: Random error—how much does the estimate fluctuate across different samples?

In k-NN Density Estimation

Small k (e.g., 1-5):

Low bias: Captures local density variations accurately
High variance: Estimates are noisy; small data changes cause large swings
Risk: Many false positive anomalies due to noise

Large k (e.g., 50+):

High bias: Smooths over genuine density variations
Low variance: Stable estimates across samples
Risk: May miss local anomalies; boundary effects

k Selection and Its Effects on Anomaly Detection
k Value	Bias	Variance	Anomaly Detection Impact
Very small (1-3)	Very low	Very high	Many false positives; unstable rankings
Small (5-10)	Low	High	Sensitive detection; some noise
Moderate (15-30)	Moderate	Moderate	Balanced; good default choice
Large (50-100)	High	Low	Stable but may miss subtle anomalies
Very large (>n/10)	Very high	Very low	Essentially global; not local anymore

In Kernel Density Estimation

Small bandwidth h:

Low bias: Captures fine density details
High variance: Density estimate is spiky, many local modes
Anomalies: Many points appear in low-density 'valleys'

Large bandwidth h:

High bias: Oversmooths, blurs density modes together
Low variance: Smooth, stable density surface
Anomalies: Only extreme outliers detected

Asymptotic Analysis

For KDE with Gaussian kernel in d dimensions: $$\text{Bias}(\hat{p}) = O(h^2)$$ $$\text{Var}(\hat{p}) = O\left(\frac{1}{n \cdot h^d}\right)$$

Optimal bandwidth balances these: $$h_{\text{opt}} = O(n^{-1/(d+4)})$$

This is Silverman's rule of thumb, widely used in practice.

For k-NN density estimation: $$k_{\text{opt}} = O(n^{4/(d+4)})$$

Note how both depend on dimensionality $d$—as dimension increases, we need more smoothing to combat variance, at the cost of increased bias.

The Curse of Dimensionality

In high dimensions, the bias-variance balance becomes increasingly difficult. The volume of a hypersphere grows as r^d, so neighbor distances become more uniform. Local density estimates require exponentially more data to maintain accuracy. Rule of thumb: beyond ~15-20 dimensions, consider dimensionality reduction before density-based anomaly detection.

Practical Implications for Anomaly Detection

1. Robustness over Sensitivity: For production systems, slightly higher bias (larger k or h) often beats low bias with high variance. Stable anomaly scores are more actionable than volatile ones.

2. Ensemble Strategies: Combine estimates from multiple k values: $$\bar{\rho}(\mathbf{x}) = \frac{1}{|K|} \sum_{k \in K} \rho_k(\mathbf{x})$$ or use weighted combinations favoring optimal k.

3. Adaptive Selection: Use data-driven methods to select k or h:

Cross-validation on held-out density quality
If labeled anomalies exist: optimize for detection ROC
Stability analysis: choose k where rankings are stable

4. Consider the Use Case:

Exploratory analysis: prefer sensitivity (smaller k)
Automated alerting: prefer stability (larger k)
High-stakes decisions: combine both with human review

Advanced Local Density Techniques

Beyond basic k-NN and KDE approaches, several advanced techniques improve local density estimation for anomaly detection.

Influenced Outlierness (INFLO)

INFLO extends LOF by considering not just k-nearest neighbors but also reverse k-nearest neighbors (points that have the query point as a neighbor):

$$\text{INFLO}k(\mathbf{x}) = \frac{\text{den}(\mathbf{x})}{\frac{1}{|IS_k(\mathbf{x})|} \sum{\mathbf{y} \in IS_k(\mathbf{x})} \text{den}(\mathbf{y})}$$

where $IS_k(\mathbf{x}) = N_k(\mathbf{x}) \cup RNN_k(\mathbf{x})$ is the influence set, combining neighbors and reverse neighbors.

Benefit: More symmetric density comparison, better handling of density gradients.

Local Outlier Probabilities (LoOP)

LoOP converts density-based outlier scores into calibrated probabilities using statistical reasoning:

$$\text{LoOP}(\mathbf{x}) = \max\left(0, \text{erf}\left(\frac{\text{PLOF}(\mathbf{x})}{\sqrt{2} \cdot \text{nPLOF}}\right)\right)$$

where PLOF is a probabilistic local outlier factor and nPLOF is a normalization factor.

Benefit: Scores are bounded in [0, 1] with probabilistic interpretation.

Angle-Based Outlier Detection (ABOD)

ABOD uses the variance of angles between a point and pairs of other points:

$$\text{ABOD}(\mathbf{x}) = \text{Var}_{\mathbf{y}, \mathbf{z} \in N_k(\mathbf{x})}\left[\frac{\langle \mathbf{y} - \mathbf{x}, \mathbf{z} - \mathbf{x} \rangle}{|\mathbf{y} - \mathbf{x}|^2 |\mathbf{z} - \mathbf{x}|^2}\right]$$

Benefit: More robust in high dimensions where distance-based measures degrade.

Subspace Local Outlier Detection

In high-dimensional data, outliers may only be anomalous in certain feature subspaces:

SOD (Subspace Outlier Degree): Projects data onto subspace spanned by k-nearest neighbors, measures deviation.

HiCS (High Contrast Subspaces): Searches for subspaces where points have unusual distributions.

LSOD (Locally Subspace Outlier Detection): Combines local density with subspace analysis.

Benefit: Can detect anomalies that are masked when viewing all dimensions.

advanced_local_density.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.special import erf
 
class LocalOutlierProbability:
    """
    Local Outlier Probabilities (LoOP) implementation.
    
    Provides calibrated outlier probabilities based on local density.
    """
    
    def __init__(self, k: int = 20, lambda_factor: float = 3.0):
        """
        Parameters:
        -----------
        k : int
            Number of neighbors
        lambda_factor : float
            Controls extent (typically 2-4)
        """
        self.k = k
        self.lambda_factor = lambda_factor
        
    def fit_predict(self, X: np.ndarray) -> np.ndarray:
        """
        Compute LoOP scores for all points.
        
        Returns:
        --------
        np.ndarray : Outlier probabilities in [0, 1]
        """
        n_samples = X.shape[0]
        
        # Find k-nearest neighbors
        nn = NearestNeighbors(n_neighbors=self.k + 1)
        nn.fit(X)
        distances, indices = nn.kneighbors(X)
        distances = distances[:, 1:]  # Exclude self
        indices = indices[:, 1:]
        
        # Step 1: Compute probabilistic distance (pdist)
        # pdist = lambda * sqrt(mean(d^2))
        pdist = self.lambda_factor * np.sqrt(np.mean(distances**2, axis=1))
        
        # Step 2: Compute PLOF (Probabilistic Local Outlier Factor)
        plof = np.zeros(n_samples)
        for i in range(n_samples):
            neighbors = indices[i]
            neighbor_pdist = pdist[neighbors].mean()
            plof[i] = (pdist[i] / (neighbor_pdist + 1e-10)) - 1
        
        # Step 3: Compute nPLOF (normalization factor)
        nplof = self.lambda_factor * np.sqrt(np.mean(plof**2))
        if nplof < 1e-10:
            nplof = 1e-10
        
        # Step 4: Compute LoOP
        loop = np.maximum(0, erf(plof / (np.sqrt(2) * nplof)))
        
        return loop
 
 
class AngleBasedOutlierDetector:
    """
    Fast Angle-Based Outlier Detection (FastABOD).
    
    Uses variance of angles to neighbors for outlier scoring.
    """
    
    def __init__(self, k: int = 20):
        """
        Parameters:
        -----------
        k : int
            Number of neighbors for angle computation
        """
        self.k = k
        
    def fit_predict(self, X: np.ndarray) -> np.ndarray:
        """
        Compute ABOD scores for all points.
        
        Lower scores indicate higher outlierness.
        """
        n_samples = X.shape[0]
        
        # Find k-nearest neighbors
        nn = NearestNeighbors(n_neighbors=self.k + 1)
        nn.fit(X)
        _, indices = nn.kneighbors(X)
        indices = indices[:, 1:]  # Exclude self
        
        scores = np.zeros(n_samples)
        
        for i in range(n_samples):
            neighbors = indices[i]
            
            # Compute angles between all pairs of neighbors
            angle_vars = []
            for j in range(len(neighbors)):
                for m in range(j + 1, len(neighbors)):
                    v1 = X[neighbors[j]] - X[i]
                    v2 = X[neighbors[m]] - X[i]
                    
                    norm1 = np.linalg.norm(v1)
                    norm2 = np.linalg.norm(v2)
                    
                    if norm1 > 1e-10 and norm2 > 1e-10:
                        # Weighted angle factor (ABOD formula)
                        dot = np.dot(v1, v2)
                        weighted_angle = dot / (norm1**2 * norm2**2)
                        angle_vars.append(weighted_angle)
            
            if len(angle_vars) > 1:
                scores[i] = np.var(angle_vars)
            else:
                scores[i] = 0  # No variance computable
        
        return scores
 
 
# Demonstration: Compare methods
np.random.seed(42)
 
# Create challenging dataset
cluster1 = np.random.randn(100, 2) * 0.5
cluster2 = np.random.randn(80, 2) * 0.3 + np.array([3, 3])
# Outlier between clusters (local outlier)
local_outlier = np.array([[1.5, 1.5]])
# Global outlier
global_outlier = np.array([[6, 6]])
 
X = np.vstack([cluster1, cluster2, local_outlier, global_outlier])
 
# Compare methods
from sklearn.neighbors import LocalOutlierFactor
 
# 1. LOF (sklearn)
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
lof_labels = lof.fit_predict(X)
lof_scores = -lof.negative_outlier_factor_  # Convert to positive scores
 
# 2. LoOP
loop = LocalOutlierProbability(k=20)
loop_scores = loop.fit_predict(X)
 
# 3. FastABOD
abod = AngleBasedOutlierDetector(k=20)
abod_scores = abod.fit_predict(X)
 
print("Method Comparison on Test Dataset:")
print("=" * 60)
print(f"{'Method':<15} {'Local Outlier':<15} {'Global Outlier':<15}")
print("-" * 60)
print(f"{'LOF':<15} {lof_scores[-2]:<15.4f} {lof_scores[-1]:<15.4f}")
print(f"{'LoOP':<15} {loop_scores[-2]:<15.4f} {loop_scores[-1]:<15.4f}")
print(f"{'ABOD (inv)':<15} {1/(abod_scores[-2]+1e-6):<15.4f} {1/(abod_scores[-1]+1e-6):<15.4f}")
 
print(f"
Note: For LOF and LoOP, higher = more anomalous")
print(f"      For ABOD, lower variance = more anomalous (showing 1/ABOD)")

Summary: Mastering Local Density Estimation

Local density estimation is the engine powering density-based anomaly detection. The choice of estimation technique significantly impacts detection quality.

Key Concepts Covered

k-NN Density Variants: From simple k-distance to weighted averages, each variant offers different bias-variance characteristics.
Local Reachability Density (LRD): The LOF innovation that smooths density estimates using reachability distances, providing stable estimates at cluster boundaries.
Kernel Density Estimation: Theoretically principled but requires careful bandwidth selection; locally adaptive KDE handles varying densities.
Bias-Variance Tradeoff: Small k/h means sensitive but noisy estimates; large k/h means stable but potentially biased estimates.
Advanced Methods: LoOP, INFLO, ABOD, and subspace methods extend basic approaches for specific scenarios.

Practical Guidelines

•Start with LOF: Its LRD approach is well-balanced for most scenarios
•Choose k ≈ 10-30: Unless you have specific reasons for different values
•Standardize features: Density estimation is distance-sensitive
•Consider ensembles: Aggregate over multiple k values for robustness
•Use LoOP for probabilities: When calibrated scores are needed
•Apply ABOD in high dimensions: When distance-based methods struggle

Core Principle

Effective local density estimation balances sensitivity and stability. The reachability distance concept from LOF exemplifies this balance: by capping distances at the neighbor's k-distance, it prevents noise from cluster edges while maintaining sensitivity to true outliers.

Looking Ahead

With local density estimation techniques mastered, we now extend to multivariate settings:

Next Page (Multivariate Density): How density estimation scales to high-dimensional data, dealing with the curse of dimensionality, and multivariate distribution-based anomaly detection.
Final Page (Scalability): Techniques for applying density-based methods to large-scale datasets efficiently.

The density paradigm continues to provide intuitive yet powerful tools for identifying anomalies across diverse data types and scales.

Local Density Estimation

The Importance of Local Density Estimation

Poor density estimates lead to:

False positives: Normal points misclassified as anomalies due to noisy density estimates
False negatives: True anomalies missed because their density appears normal
Unstable rankings: Small data changes cause dramatic shifts in anomaly scores

This page dives deep into the techniques for robust local density estimation, from simple k-NN approaches to sophisticated kernel methods, and the mathematical principles that make them effective.

What You Will Learn

k-Nearest Neighbor Density Estimation

The k-nearest neighbor (k-NN) approach to density estimation is foundational in anomaly detection. It provides a simple, intuitive, and computationally tractable way to estimate local density.

The Core Idea

Basic k-NN Density Estimator: $$\hat{p}_k(\mathbf{x}) = \frac{k}{n \cdot V_d(r_k(\mathbf{x}))}$$

where:

$r_k(\mathbf{x})$ is the distance from $\mathbf{x}$ to its k-th nearest neighbor
$V_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} r^d$ is the volume of a d-dimensional ball of radius r
$n$ is the total number of points