Loading content...
Accurate local density estimation is the linchpin of effective density-based anomaly detection. While we've established that relative density—comparing a point's density to its neighbors—is the key concept, the quality of our anomaly detection hinges on how well we estimate these local densities.
Poor density estimates lead to:
This page dives deep into the techniques for robust local density estimation, from simple k-NN approaches to sophisticated kernel methods, and the mathematical principles that make them effective.
This page covers: (1) k-NN based density estimation variants and their properties; (2) Kernel density estimation adapted for local neighborhoods; (3) The Local Reachability Density used in LOF; (4) Adaptive density estimation techniques; (5) Bias-variance tradeoffs in density estimation for anomaly detection; and (6) Practical implementation considerations.
The k-nearest neighbor (k-NN) approach to density estimation is foundational in anomaly detection. It provides a simple, intuitive, and computationally tractable way to estimate local density.
The k-NN density estimator is based on a key insight: the volume required to contain k neighbors is inversely related to density. In dense regions, k neighbors are found in a small volume; in sparse regions, one must search a larger volume.
Basic k-NN Density Estimator: $$\hat{p}_k(\mathbf{x}) = \frac{k}{n \cdot V_d(r_k(\mathbf{x}))}$$
where:
This estimator has strong theoretical backing: as $n \to \infty$ with k growing appropriately, $\hat{p}_k(\mathbf{x}) \to p(\mathbf{x})$ (consistency).
Several variants exist, each with different properties:
1. Simple k-Distance Density: $$\rho_k^{(1)}(\mathbf{x}) = \frac{1}{d_k(\mathbf{x})}$$
Simplest form—ignores the volume constant and number of points. Useful when only relative densities matter.
2. Average Distance Density: $$\rho_k^{(2)}(\mathbf{x}) = \frac{k}{\sum_{i=1}^{k} d_i(\mathbf{x})}$$
Uses all k distances, not just the k-th. More stable than using only $d_k$.
3. Inverse Sum of Squared Distances: $$\rho_k^{(3)}(\mathbf{x}) = \frac{1}{\sum_{i=1}^{k} d_i(\mathbf{x})^2}$$
Gives more weight to closer neighbors, which may be desirable if immediate neighbors are more informative.
4. Exponential Decay Weighting: $$\rho_k^{(4)}(\mathbf{x}) = \sum_{i=1}^{k} \exp\left(-\frac{d_i(\mathbf{x})}{\sigma}\right)$$
Soft weighting based on distance, with bandwidth parameter σ controlling decay rate.
| Variant | Formula | Sensitivity | Stability | Use Case |
|---|---|---|---|---|
| k-distance | 1/d_k | High | Low | Quick estimates, sensitive detection |
| Average distance | k/Σd_i | Medium | Medium | Balanced approach |
| Sum of squares | 1/Σd_i² | High to close | Medium | Local focus important |
| Exponential | Σexp(-d_i/σ) | Tunable | High | Smooth transitions needed |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import numpy as npfrom sklearn.neighbors import NearestNeighborsfrom typing import Dictfrom scipy.special import gamma def compute_knn_density_variants(X: np.ndarray, k: int = 10) -> Dict[str, np.ndarray]: """ Compute various k-NN density estimation variants. Parameters: ----------- X : np.ndarray of shape (n_samples, n_features) Input data k : int Number of neighbors Returns: -------- dict : Dictionary of density arrays for each variant """ n_samples, d = X.shape # Find k nearest neighbors nn = NearestNeighbors(n_neighbors=k + 1) # +1 for self nn.fit(X) distances, _ = nn.kneighbors(X) distances = distances[:, 1:] # Exclude self-distance # 1. Simple k-distance density k_distances = distances[:, -1] rho_kdist = 1.0 / (k_distances + 1e-10) # 2. Average distance density avg_distances = distances.mean(axis=1) rho_avg = k / (distances.sum(axis=1) + 1e-10) # 3. Inverse sum of squared distances sum_sq_distances = (distances ** 2).sum(axis=1) rho_sumsq = 1.0 / (sum_sq_distances + 1e-10) # 4. Exponential decay (with adaptive sigma) sigma = np.median(distances) # Adaptive bandwidth exp_weights = np.exp(-distances / sigma) rho_exp = exp_weights.sum(axis=1) # 5. Proper probabilistic k-NN density # Volume of d-dimensional unit ball v_unit = (np.pi ** (d/2)) / gamma(d/2 + 1) volumes = v_unit * (k_distances ** d) rho_prob = k / (n_samples * volumes + 1e-10) return { 'k_distance': rho_kdist, 'average': rho_avg, 'sum_squares': rho_sumsq, 'exponential': rho_exp, 'probabilistic': rho_prob, 'raw_distances': distances, 'k_distances_raw': k_distances } def normalize_densities(densities: np.ndarray) -> np.ndarray: """Normalize densities to [0, 1] range for comparison.""" min_val = densities.min() max_val = densities.max() if max_val - min_val < 1e-10: return np.ones_like(densities) * 0.5 return (densities - min_val) / (max_val - min_val) # Demonstrationnp.random.seed(42) # Create dataset with variable density clustersdense_cluster = np.random.randn(200, 2) * 0.3sparse_cluster = np.random.randn(50, 2) * 1.2 + np.array([5, 5])outliers = np.array([[3, 3], [-2, 4], [7, 1]]) X = np.vstack([dense_cluster, sparse_cluster, outliers]) # Compute all density variantsk = 10densities = compute_knn_density_variants(X, k=k) print("Density Estimation Comparison (normalized to [0,1]):")print("-" * 60) for name in ['k_distance', 'average', 'sum_squares', 'exponential', 'probabilistic']: normalized = normalize_densities(densities[name]) # Statistics for different point groups dense_mean = normalized[:200].mean() sparse_mean = normalized[200:250].mean() outlier_mean = normalized[250:].mean() print(f"{name.upper()}:") print(f" Dense cluster mean: {dense_mean:.4f}") print(f" Sparse cluster mean: {sparse_mean:.4f}") print(f" Outliers mean: {outlier_mean:.4f}") # Correlation between variantsfrom scipy.stats import spearmanrprint("" + "=" * 60)print("Spearman correlation between density variants:")variants = ['k_distance', 'average', 'sum_squares', 'exponential']for i, v1 in enumerate(variants): for v2 in variants[i+1:]: corr, _ = spearmanr(densities[v1], densities[v2]) print(f" {v1} vs {v2}: {corr:.3f}")The choice of k involves a bias-variance tradeoff: small k gives high variance (noisy) estimates; large k gives high bias (oversmoothed) estimates. Rules of thumb: k = sqrt(n) is often suggested; for anomaly detection, k = 10-20 is common in practice. Consider using multiple k values and aggregating results for robustness.
The Local Outlier Factor (LOF) algorithm introduced a clever refinement to simple k-NN density: the Local Reachability Density (LRD). This concept addresses a subtle but important issue with raw distance-based density estimation.
Consider a point $\mathbf{x}$ on the boundary of a dense cluster. Its k-nearest neighbors might include:
The average distance to neighbors is dominated by the farther points, potentially underestimating $\mathbf{x}$'s density. This instability at cluster boundaries causes noisy anomaly scores.
LOF introduces reachability distance to address this:
Definition (Reachability Distance): $$\text{reach-dist}_k(\mathbf{x}, \mathbf{y}) = \max(d_k(\mathbf{y}), d(\mathbf{x}, \mathbf{y}))$$
where:
Key Insight: If $\mathbf{y}$ is in a dense region (small $d_k(\mathbf{y})$), the reachability distance from $\mathbf{x}$ to $\mathbf{y}$ is just the actual distance. But if $\mathbf{y}$ is in a sparse region (large $d_k(\mathbf{y})$), the reachability distance is 'smoothed' to at least $d_k(\mathbf{y})$.
This smoothing ensures that points in dense cores have stable reachability distances, reducing statistical fluctuations.
With reachability distance defined, LRD is:
Definition (Local Reachability Density): $$\text{lrd}k(\mathbf{x}) = \frac{1}{\frac{1}{k} \sum{\mathbf{y} \in N_k(\mathbf{x})} \text{reach-dist}_k(\mathbf{x}, \mathbf{y})}$$
Or equivalently: $$\text{lrd}k(\mathbf{x}) = \frac{k}{\sum{\mathbf{y} \in N_k(\mathbf{x})} \text{reach-dist}_k(\mathbf{x}, \mathbf{y})}$$
LRD is the inverse of the average reachability distance to neighbors. High LRD means $\mathbf{x}$ can 'reach' its neighbors quickly (dense region); low LRD means neighbors are far (sparse region).
1. Smoothness in Dense Regions: For points deep within a dense cluster, their k-distances are small, so reach-dist ≈ actual distances ≈ small values. LRD is high and stable.
2. Stability at Boundaries: For boundary points whose neighbors include some core points, the reachability distance to those core points is dominated by the core's k-distance, providing stability.
3. Sensitivity to Isolation: For true outliers, both the actual distances and neighbors' k-distances are large, resulting in low LRD.
4. Asymmetry: Note that reach-dist($\mathbf{x}$, $\mathbf{y}$) ≠ reach-dist($\mathbf{y}$, $\mathbf{x}$) in general. This asymmetry captures the directional nature of density variations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136
import numpy as npfrom sklearn.neighbors import NearestNeighbors class LocalReachabilityDensity: """ Compute Local Reachability Density (LRD) as used in LOF. This implementation exposes intermediate computations for educational purposes. """ def __init__(self, k: int = 20): """ Parameters: ----------- k : int Number of neighbors for k-distance and LRD computation """ self.k = k self.k_distances_ = None self.neighbors_indices_ = None self.distances_ = None self.lrd_ = None def fit(self, X: np.ndarray): """Fit the model by computing k-distances for all points.""" self.X_ = X self.n_samples_ = X.shape[0] # Find k-nearest neighbors nn = NearestNeighbors(n_neighbors=self.k + 1) nn.fit(X) distances, indices = nn.kneighbors(X) # Exclude self self.distances_ = distances[:, 1:] self.neighbors_indices_ = indices[:, 1:] # k-distance is distance to k-th neighbor self.k_distances_ = self.distances_[:, -1] return self def compute_reach_dist(self, x_idx: int, y_idx: int) -> float: """ Compute reachability distance from x to y. reach-dist_k(x, y) = max(k-dist(y), d(x, y)) """ # Actual distance from x to y actual_dist = np.linalg.norm(self.X_[x_idx] - self.X_[y_idx]) # k-distance of y k_dist_y = self.k_distances_[y_idx] return max(k_dist_y, actual_dist) def compute_lrd(self) -> np.ndarray: """ Compute Local Reachability Density for all points. lrd_k(x) = k / sum(reach-dist(x, y) for y in N_k(x)) """ lrd = np.zeros(self.n_samples_) for i in range(self.n_samples_): # Get neighbors of point i neighbors = self.neighbors_indices_[i] # Sum of reachability distances to neighbors reach_dist_sum = 0.0 for j in neighbors: reach_dist_sum += self.compute_reach_dist(i, j) # LRD is inverse of average reachability distance lrd[i] = self.k / (reach_dist_sum + 1e-10) self.lrd_ = lrd return lrd def compute_lof(self) -> np.ndarray: """ Compute Local Outlier Factor using LRD. LOF_k(x) = (1/k) * sum(lrd(y)/lrd(x) for y in N_k(x)) """ if self.lrd_ is None: self.compute_lrd() lof = np.zeros(self.n_samples_) for i in range(self.n_samples_): neighbors = self.neighbors_indices_[i] # Ratio of neighbors' LRD to point's LRD lrd_ratios = self.lrd_[neighbors] / (self.lrd_[i] + 1e-10) lof[i] = lrd_ratios.mean() return lof # Demonstrationnp.random.seed(42) # Create dataset with clear density variationscluster1 = np.random.randn(150, 2) * 0.3 # Very densecluster2 = np.random.randn(100, 2) * 0.8 + np.array([4, 4]) # Moderately denseoutliers = np.array([[2, 2], [6, 1], [-2, 3]]) # Isolated points X = np.vstack([cluster1, cluster2, outliers]) # Compute LRDlrd_model = LocalReachabilityDensity(k=15)lrd_model.fit(X)lrd_values = lrd_model.compute_lrd()lof_values = lrd_model.compute_lof() print("Local Reachability Density Analysis:")print("=" * 50)print(f"Dense cluster (n=150):")print(f" LRD: mean={lrd_values[:150].mean():.4f}, std={lrd_values[:150].std():.4f}")print(f" LOF: mean={lof_values[:150].mean():.4f}, std={lof_values[:150].std():.4f}") print(f"Moderate cluster (n=100):")print(f" LRD: mean={lrd_values[150:250].mean():.4f}, std={lrd_values[150:250].std():.4f}")print(f" LOF: mean={lof_values[150:250].mean():.4f}, std={lof_values[150:250].std():.4f}") print(f"Outliers (n=3):")print(f" LRD: {lrd_values[250:]}")print(f" LOF: {lof_values[250:]}") # Key insight: LOF ≈ 1 for normal points, LOF >> 1 for outliersprint(f"Anomaly detection threshold (LOF > 1.5):")print(f" Detected anomalies: {np.sum(lof_values > 1.5)}")print(f" True outlier LOFs: {lof_values[250:]} (all should be > 1.5)")The reachability distance creates a 'buffer zone' around dense cores. Points just outside a dense cluster see reachability distances dominated by the cluster's k-distances, not the raw (potentially small) distances. This prevents false positive anomalies at cluster edges while maintaining sensitivity to true outliers.
Kernel Density Estimation (KDE) provides a theoretically principled approach to density estimation that can be adapted for anomaly detection. While k-NN methods are more common in practice, KDE offers some advantages in certain scenarios.
The kernel density estimator: $$\hat{p}(\mathbf{x}) = \frac{1}{n \cdot h^d} \sum_{i=1}^{n} K\left(\frac{\mathbf{x} - \mathbf{x}_i}{h}\right)$$
where $K$ is a kernel function (Gaussian, Epanechnikov, etc.) and $h$ is the bandwidth.
KDE-based anomaly detection works by:
Advantages:
Disadvantages:
To handle varying densities, locally adaptive KDE uses different bandwidths in different regions:
Variable Bandwidth KDE: $$\hat{p}(\mathbf{x}) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h_i^d} K\left(\frac{\mathbf{x} - \mathbf{x}_i}{h_i}\right)$$
where $h_i$ varies per data point. Common choices:
1. k-NN Bandwidth: $$h_i = d_k(\mathbf{x}_i)$$ Use distance to k-th neighbor as local bandwidth.
2. Pilot Density Bandwidth: $$h_i = h_0 \cdot \left(\frac{\tilde{p}(\mathbf{x}_i)}{g}\right)^{-\alpha}$$ where $\tilde{p}$ is a pilot density estimate, $g$ is its geometric mean, and $\alpha \in [0, 1]$ controls adaptation strength.
Abramson's Square Root Rule: $$h_i \propto \tilde{p}(\mathbf{x}_i)^{-1/2}$$
This achieves better bias properties by using smaller bandwidths in dense regions and larger bandwidths in sparse regions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173
import numpy as npfrom sklearn.neighbors import NearestNeighbors, KernelDensityfrom scipy.stats import norm class LocallyAdaptiveKDE: """ Locally Adaptive Kernel Density Estimation for Anomaly Detection. Uses k-NN distances to set local bandwidths, adapting to varying density regions. """ def __init__(self, k: int = 10, kernel: str = 'gaussian'): """ Parameters: ----------- k : int Number of neighbors for local bandwidth estimation kernel : str Kernel type ('gaussian' or 'epanechnikov') """ self.k = k self.kernel = kernel def fit(self, X: np.ndarray): """Fit the model: compute local bandwidths.""" self.X_ = X self.n_samples_, self.n_features_ = X.shape # Find k-nearest neighbors for bandwidth nn = NearestNeighbors(n_neighbors=self.k + 1) nn.fit(X) distances, _ = nn.kneighbors(X) # Local bandwidth = k-distance (distance to k-th neighbor) self.bandwidths_ = distances[:, -1] + 1e-10 return self def _kernel_value(self, u: np.ndarray) -> np.ndarray: """Evaluate kernel at u.""" if self.kernel == 'gaussian': return np.exp(-0.5 * np.sum(u**2, axis=1)) / (2 * np.pi) ** (self.n_features_ / 2) elif self.kernel == 'epanechnikov': norms = np.sum(u**2, axis=1) return np.where(norms < 1, 0.75 * (1 - norms), 0) else: raise ValueError(f"Unknown kernel: {self.kernel}") def score_samples(self, X_query: np.ndarray) -> np.ndarray: """ Compute density estimates at query points. Parameters: ----------- X_query : np.ndarray Points to evaluate density at Returns: -------- np.ndarray : Density estimates """ n_query = X_query.shape[0] densities = np.zeros(n_query) for i in range(n_query): # Compute contribution from each data point for j in range(self.n_samples_): h = self.bandwidths_[j] diff = (X_query[i] - self.X_[j]) / h kernel_val = self._kernel_value(diff.reshape(1, -1))[0] densities[i] += kernel_val / (h ** self.n_features_) densities[i] /= self.n_samples_ return densities def detect_anomalies(self, X_query: np.ndarray = None, contamination: float = 0.05) -> tuple: """ Detect anomalies based on density estimates. Parameters: ----------- X_query : np.ndarray, optional Points to evaluate; if None, use training data contamination : float Expected fraction of anomalies Returns: -------- tuple : (anomaly_labels, density_scores, threshold) """ if X_query is None: X_query = self.X_ densities = self.score_samples(X_query) # Threshold at contamination percentile threshold = np.percentile(densities, 100 * contamination) anomalies = densities < threshold return anomalies, densities, threshold def compare_kde_approaches(X: np.ndarray, contamination: float = 0.05): """Compare fixed bandwidth vs locally adaptive KDE.""" from sklearn.neighbors import KernelDensity # Fixed bandwidth KDE (sklearn) # Use Scott's rule: h = n^(-1/(d+4)) * sigma n, d = X.shape bandwidth = n ** (-1/(d + 4)) * X.std() kde_fixed = KernelDensity(bandwidth=bandwidth, kernel='gaussian') kde_fixed.fit(X) densities_fixed = np.exp(kde_fixed.score_samples(X)) # Locally adaptive KDE kde_adaptive = LocallyAdaptiveKDE(k=10) kde_adaptive.fit(X) densities_adaptive = kde_adaptive.score_samples(X) # Determine anomalies thresh_fixed = np.percentile(densities_fixed, 100 * contamination) thresh_adaptive = np.percentile(densities_adaptive, 100 * contamination) anomalies_fixed = densities_fixed < thresh_fixed anomalies_adaptive = densities_adaptive < thresh_adaptive return { 'fixed': { 'densities': densities_fixed, 'anomalies': anomalies_fixed, 'bandwidth': bandwidth }, 'adaptive': { 'densities': densities_adaptive, 'anomalies': anomalies_adaptive, 'bandwidths': kde_adaptive.bandwidths_ } } # Demonstrationnp.random.seed(42) # Multi-density datasetdense = np.random.randn(150, 2) * 0.3sparse = np.random.randn(80, 2) * 1.0 + np.array([4, 4])outliers = np.array([[2, 2], [-1.5, 3], [6, 0], [4, 8]]) X = np.vstack([dense, sparse, outliers]) results = compare_kde_approaches(X, contamination=0.02) print("KDE Anomaly Detection Comparison:")print("=" * 50)print(f"Fixed Bandwidth KDE:")print(f" Bandwidth: {results['fixed']['bandwidth']:.4f}")print(f" Anomalies detected: {results['fixed']['anomalies'].sum()}") print(f"Adaptive Bandwidth KDE:")print(f" Bandwidth range: [{results['adaptive']['bandwidths'].min():.4f}, " f"{results['adaptive']['bandwidths'].max():.4f}]")print(f" Anomalies detected: {results['adaptive']['anomalies'].sum()}") # Check if outliers were detectedn_normal = 230print(f"Outlier Detection (4 true outliers):")print(f" Fixed: detected {results['fixed']['anomalies'][n_normal:].sum()}/4")print(f" Adaptive: detected {results['adaptive']['anomalies'][n_normal:].sum()}/4")Understanding the bias-variance tradeoff is crucial for selecting appropriate density estimation parameters, directly impacting anomaly detection quality.
For any density estimator $\hat{p}(\mathbf{x})$, the mean squared error decomposes as: $$\text{MSE}(\hat{p}) = \text{Bias}^2(\hat{p}) + \text{Var}(\hat{p})$$
Bias: Systematic error—does the expected estimate equal the true density? Variance: Random error—how much does the estimate fluctuate across different samples?
Small k (e.g., 1-5):
Large k (e.g., 50+):
| k Value | Bias | Variance | Anomaly Detection Impact |
|---|---|---|---|
| Very small (1-3) | Very low | Very high | Many false positives; unstable rankings |
| Small (5-10) | Low | High | Sensitive detection; some noise |
| Moderate (15-30) | Moderate | Moderate | Balanced; good default choice |
| Large (50-100) | High | Low | Stable but may miss subtle anomalies |
| Very large (>n/10) | Very high | Very low | Essentially global; not local anymore |
Small bandwidth h:
Large bandwidth h:
For KDE with Gaussian kernel in d dimensions: $$\text{Bias}(\hat{p}) = O(h^2)$$ $$\text{Var}(\hat{p}) = O\left(\frac{1}{n \cdot h^d}\right)$$
Optimal bandwidth balances these: $$h_{\text{opt}} = O(n^{-1/(d+4)})$$
This is Silverman's rule of thumb, widely used in practice.
For k-NN density estimation: $$k_{\text{opt}} = O(n^{4/(d+4)})$$
Note how both depend on dimensionality $d$—as dimension increases, we need more smoothing to combat variance, at the cost of increased bias.
In high dimensions, the bias-variance balance becomes increasingly difficult. The volume of a hypersphere grows as r^d, so neighbor distances become more uniform. Local density estimates require exponentially more data to maintain accuracy. Rule of thumb: beyond ~15-20 dimensions, consider dimensionality reduction before density-based anomaly detection.
1. Robustness over Sensitivity: For production systems, slightly higher bias (larger k or h) often beats low bias with high variance. Stable anomaly scores are more actionable than volatile ones.
2. Ensemble Strategies: Combine estimates from multiple k values: $$\bar{\rho}(\mathbf{x}) = \frac{1}{|K|} \sum_{k \in K} \rho_k(\mathbf{x})$$ or use weighted combinations favoring optimal k.
3. Adaptive Selection: Use data-driven methods to select k or h:
4. Consider the Use Case:
Beyond basic k-NN and KDE approaches, several advanced techniques improve local density estimation for anomaly detection.
INFLO extends LOF by considering not just k-nearest neighbors but also reverse k-nearest neighbors (points that have the query point as a neighbor):
$$\text{INFLO}k(\mathbf{x}) = \frac{\text{den}(\mathbf{x})}{\frac{1}{|IS_k(\mathbf{x})|} \sum{\mathbf{y} \in IS_k(\mathbf{x})} \text{den}(\mathbf{y})}$$
where $IS_k(\mathbf{x}) = N_k(\mathbf{x}) \cup RNN_k(\mathbf{x})$ is the influence set, combining neighbors and reverse neighbors.
Benefit: More symmetric density comparison, better handling of density gradients.
LoOP converts density-based outlier scores into calibrated probabilities using statistical reasoning:
$$\text{LoOP}(\mathbf{x}) = \max\left(0, \text{erf}\left(\frac{\text{PLOF}(\mathbf{x})}{\sqrt{2} \cdot \text{nPLOF}}\right)\right)$$
where PLOF is a probabilistic local outlier factor and nPLOF is a normalization factor.
Benefit: Scores are bounded in [0, 1] with probabilistic interpretation.
ABOD uses the variance of angles between a point and pairs of other points:
$$\text{ABOD}(\mathbf{x}) = \text{Var}_{\mathbf{y}, \mathbf{z} \in N_k(\mathbf{x})}\left[\frac{\langle \mathbf{y} - \mathbf{x}, \mathbf{z} - \mathbf{x} \rangle}{|\mathbf{y} - \mathbf{x}|^2 |\mathbf{z} - \mathbf{x}|^2}\right]$$
Benefit: More robust in high dimensions where distance-based measures degrade.
In high-dimensional data, outliers may only be anomalous in certain feature subspaces:
SOD (Subspace Outlier Degree): Projects data onto subspace spanned by k-nearest neighbors, measures deviation.
HiCS (High Contrast Subspaces): Searches for subspaces where points have unusual distributions.
LSOD (Locally Subspace Outlier Detection): Combines local density with subspace analysis.
Benefit: Can detect anomalies that are masked when viewing all dimensions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
import numpy as npfrom sklearn.neighbors import NearestNeighborsfrom scipy.special import erf class LocalOutlierProbability: """ Local Outlier Probabilities (LoOP) implementation. Provides calibrated outlier probabilities based on local density. """ def __init__(self, k: int = 20, lambda_factor: float = 3.0): """ Parameters: ----------- k : int Number of neighbors lambda_factor : float Controls extent (typically 2-4) """ self.k = k self.lambda_factor = lambda_factor def fit_predict(self, X: np.ndarray) -> np.ndarray: """ Compute LoOP scores for all points. Returns: -------- np.ndarray : Outlier probabilities in [0, 1] """ n_samples = X.shape[0] # Find k-nearest neighbors nn = NearestNeighbors(n_neighbors=self.k + 1) nn.fit(X) distances, indices = nn.kneighbors(X) distances = distances[:, 1:] # Exclude self indices = indices[:, 1:] # Step 1: Compute probabilistic distance (pdist) # pdist = lambda * sqrt(mean(d^2)) pdist = self.lambda_factor * np.sqrt(np.mean(distances**2, axis=1)) # Step 2: Compute PLOF (Probabilistic Local Outlier Factor) plof = np.zeros(n_samples) for i in range(n_samples): neighbors = indices[i] neighbor_pdist = pdist[neighbors].mean() plof[i] = (pdist[i] / (neighbor_pdist + 1e-10)) - 1 # Step 3: Compute nPLOF (normalization factor) nplof = self.lambda_factor * np.sqrt(np.mean(plof**2)) if nplof < 1e-10: nplof = 1e-10 # Step 4: Compute LoOP loop = np.maximum(0, erf(plof / (np.sqrt(2) * nplof))) return loop class AngleBasedOutlierDetector: """ Fast Angle-Based Outlier Detection (FastABOD). Uses variance of angles to neighbors for outlier scoring. """ def __init__(self, k: int = 20): """ Parameters: ----------- k : int Number of neighbors for angle computation """ self.k = k def fit_predict(self, X: np.ndarray) -> np.ndarray: """ Compute ABOD scores for all points. Lower scores indicate higher outlierness. """ n_samples = X.shape[0] # Find k-nearest neighbors nn = NearestNeighbors(n_neighbors=self.k + 1) nn.fit(X) _, indices = nn.kneighbors(X) indices = indices[:, 1:] # Exclude self scores = np.zeros(n_samples) for i in range(n_samples): neighbors = indices[i] # Compute angles between all pairs of neighbors angle_vars = [] for j in range(len(neighbors)): for m in range(j + 1, len(neighbors)): v1 = X[neighbors[j]] - X[i] v2 = X[neighbors[m]] - X[i] norm1 = np.linalg.norm(v1) norm2 = np.linalg.norm(v2) if norm1 > 1e-10 and norm2 > 1e-10: # Weighted angle factor (ABOD formula) dot = np.dot(v1, v2) weighted_angle = dot / (norm1**2 * norm2**2) angle_vars.append(weighted_angle) if len(angle_vars) > 1: scores[i] = np.var(angle_vars) else: scores[i] = 0 # No variance computable return scores # Demonstration: Compare methodsnp.random.seed(42) # Create challenging datasetcluster1 = np.random.randn(100, 2) * 0.5cluster2 = np.random.randn(80, 2) * 0.3 + np.array([3, 3])# Outlier between clusters (local outlier)local_outlier = np.array([[1.5, 1.5]])# Global outlierglobal_outlier = np.array([[6, 6]]) X = np.vstack([cluster1, cluster2, local_outlier, global_outlier]) # Compare methodsfrom sklearn.neighbors import LocalOutlierFactor # 1. LOF (sklearn)lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)lof_labels = lof.fit_predict(X)lof_scores = -lof.negative_outlier_factor_ # Convert to positive scores # 2. LoOPloop = LocalOutlierProbability(k=20)loop_scores = loop.fit_predict(X) # 3. FastABODabod = AngleBasedOutlierDetector(k=20)abod_scores = abod.fit_predict(X) print("Method Comparison on Test Dataset:")print("=" * 60)print(f"{'Method':<15} {'Local Outlier':<15} {'Global Outlier':<15}")print("-" * 60)print(f"{'LOF':<15} {lof_scores[-2]:<15.4f} {lof_scores[-1]:<15.4f}")print(f"{'LoOP':<15} {loop_scores[-2]:<15.4f} {loop_scores[-1]:<15.4f}")print(f"{'ABOD (inv)':<15} {1/(abod_scores[-2]+1e-6):<15.4f} {1/(abod_scores[-1]+1e-6):<15.4f}") print(f"Note: For LOF and LoOP, higher = more anomalous")print(f" For ABOD, lower variance = more anomalous (showing 1/ABOD)")Local density estimation is the engine powering density-based anomaly detection. The choice of estimation technique significantly impacts detection quality.
k-NN Density Variants: From simple k-distance to weighted averages, each variant offers different bias-variance characteristics.
Local Reachability Density (LRD): The LOF innovation that smooths density estimates using reachability distances, providing stable estimates at cluster boundaries.
Kernel Density Estimation: Theoretically principled but requires careful bandwidth selection; locally adaptive KDE handles varying densities.
Bias-Variance Tradeoff: Small k/h means sensitive but noisy estimates; large k/h means stable but potentially biased estimates.
Advanced Methods: LoOP, INFLO, ABOD, and subspace methods extend basic approaches for specific scenarios.
Effective local density estimation balances sensitivity and stability. The reachability distance concept from LOF exemplifies this balance: by capping distances at the neighbor's k-distance, it prevents noise from cluster edges while maintaining sensitivity to true outliers.
With local density estimation techniques mastered, we now extend to multivariate settings:
Next Page (Multivariate Density): How density estimation scales to high-dimensional data, dealing with the curse of dimensionality, and multivariate distribution-based anomaly detection.
Final Page (Scalability): Techniques for applying density-based methods to large-scale datasets efficiently.
The density paradigm continues to provide intuitive yet powerful tools for identifying anomalies across diverse data types and scales.