Loading content...
Real-world anomaly detection problems rarely involve just two or three features. Financial fraud detection might consider hundreds of transaction attributes. Network intrusion detection examines dozens of protocol features. Healthcare anomaly detection processes thousands of gene expressions or sensor readings.
When density-based methods move from low-dimensional toy examples to these high-dimensional realities, fundamental challenges emerge. Distances become less meaningful, volumes explode exponentially, and the intuitions that guide us in 2D or 3D spaces fail spectacularly.
This page confronts these challenges directly, exploring both the theoretical foundations of multivariate density estimation and the practical techniques that make density-based anomaly detection viable in high dimensions.
This page covers: (1) The curse of dimensionality and its impact on density estimation; (2) Multivariate Gaussian density and Mahalanobis distance; (3) Covariance estimation challenges and robust alternatives; (4) Subspace and projected density methods; (5) Feature selection and dimensionality reduction for density-based detection; and (6) Practical strategies for high-dimensional anomaly detection.
The curse of dimensionality—coined by Richard Bellman—describes how algorithms that work well in low dimensions often fail catastrophically as dimensionality increases. For density-based anomaly detection, this curse manifests in several devastating ways.
Consider the volume of a d-dimensional hypersphere of radius r: $$V_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} r^d$$
The ratio of the volume of a hypersphere to its enclosing hypercube: $$\frac{V_{\text{sphere}}}{V_{\text{cube}}} = \frac{\pi^{d/2}}{\Gamma(d/2 + 1) \cdot 2^d}$$
As d increases:
The volume of a hypersphere becomes negligibly small compared to its enclosing cube. Most of the volume is concentrated in the 'corners' of hypercubes, regions that distance-based methods cannot reach efficiently.
In high dimensions, the ratio of distances between points converges to 1. For random points: max(d_ij) / min(d_ij) → 1 as d → ∞. This means all points appear approximately equidistant, making it nearly impossible to distinguish neighbors from non-neighbors based on distance alone.
For n random points uniformly distributed in [0, 1]^d:
$$\lim_{d \to \infty} \mathbb{E}\left[\frac{D_{\max}}{D_{\min}}\right] = 1$$
where $D_{\max}$ and $D_{\min}$ are maximum and minimum pairwise distances.
More precisely, for fixed n and growing d: $$\frac{D_{\max} - D_{\min}}{D_{\min}} = O(1/\sqrt{d})$$
This concentration means:
To maintain the same statistical accuracy as sample size n in d dimensions, we need: $$n' = n^{d/d_0}$$
where d₀ is the original dimensionality. This exponential growth is called the sample size curse.
For practical density estimation, we need sample density: $$\rho = \frac{n}{V} = \frac{n}{L^d}$$
As d grows, for fixed n, this density drops exponentially, making any local neighborhood estimation statistically unsound.
| Dimension d | Distance Variability | LOF Effectiveness | Required Sample Size |
|---|---|---|---|
| 2-5 | High (good) | Excellent | Hundreds |
| 6-15 | Moderate | Good | Thousands |
| 16-30 | Low | Degraded | Tens of thousands |
| 31-100 | Very Low | Poor | Millions+ |
100 | Near-zero | Unreliable | Impractical |
The multivariate Gaussian distribution provides a principled parametric approach to density-based anomaly detection, particularly effective when data approximately follows a normal distribution.
For d-dimensional data, the multivariate Gaussian density: $$p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\mathbf{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$
where:
The exponent in the Gaussian density defines the Mahalanobis distance: $$D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$
Mahalanobis distance accounts for:
Euclidean distance treats all features equally and assumes independence. Mahalanobis distance is equivalent to Euclidean distance after whitening the data (transforming to uncorrelated, unit-variance features). For anomaly detection, Mahalanobis is almost always preferable when covariance can be reliably estimated.
For Gaussian-distributed data, the squared Mahalanobis distance $D_M^2(\mathbf{x})$ follows a chi-squared distribution with d degrees of freedom: $$D_M^2(\mathbf{x}) \sim \chi^2_d$$
This provides a principled threshold for anomaly detection: $$\text{IsAnomaly}(\mathbf{x}) = \mathbb{1}\left[D_M^2(\mathbf{x}) > \chi^2_{d, 1-\alpha}\right]$$
where $\chi^2_{d, 1-\alpha}$ is the $(1-\alpha)$ quantile of the chi-squared distribution.
For example, with α = 0.01 and d = 10:
This gives a probabilistic interpretation: an anomaly has less than 1% probability under the fitted Gaussian model.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159
import numpy as npfrom scipy import statsfrom scipy.linalg import inv, choleskyfrom sklearn.covariance import EmpiricalCovariance, MinCovDetimport warnings class MahalanobisAnomalyDetector: """ Anomaly detection using Mahalanobis distance. Supports both standard and robust covariance estimation. """ def __init__(self, contamination: float = 0.05, robust: bool = True): """ Parameters: ----------- contamination : float Expected fraction of anomalies (for threshold setting) robust : bool If True, use Minimum Covariance Determinant (robust to outliers) """ self.contamination = contamination self.robust = robust def fit(self, X: np.ndarray): """Fit the model: estimate mean and covariance.""" self.n_samples_, self.n_features_ = X.shape if self.robust and self.n_samples_ > self.n_features_ + 1: # Robust covariance estimation (MCD) try: self.cov_estimator_ = MinCovDet().fit(X) self.mean_ = self.cov_estimator_.location_ self.cov_ = self.cov_estimator_.covariance_ except ValueError: # Fall back to empirical if MCD fails warnings.warn("MCD failed, using empirical covariance") self.cov_estimator_ = EmpiricalCovariance().fit(X) self.mean_ = self.cov_estimator_.location_ self.cov_ = self.cov_estimator_.covariance_ else: # Standard covariance estimation self.mean_ = np.mean(X, axis=0) self.cov_ = np.cov(X, rowvar=False) if self.n_features_ == 1: self.cov_ = np.array([[self.cov_]]) # Compute precision matrix (inverse covariance) # Add small regularization for numerical stability self.cov_reg_ = self.cov_ + 1e-6 * np.eye(self.n_features_) self.precision_ = inv(self.cov_reg_) # Chi-squared threshold self.threshold_ = stats.chi2.ppf(1 - self.contamination, df=self.n_features_) return self def mahalanobis_distance(self, X: np.ndarray) -> np.ndarray: """Compute Mahalanobis distance for each sample.""" centered = X - self.mean_ # D^2 = (x - mu)^T * Sigma^{-1} * (x - mu) left = centered @ self.precision_ d_squared = np.sum(left * centered, axis=1) return np.sqrt(np.maximum(d_squared, 0)) # Ensure non-negative def score_samples(self, X: np.ndarray) -> np.ndarray: """ Compute anomaly scores (higher = more anomalous). Returns squared Mahalanobis distance. """ d = self.mahalanobis_distance(X) return d ** 2 def predict(self, X: np.ndarray) -> np.ndarray: """ Predict anomaly labels. Returns: -------- np.ndarray : 1 for anomalies, 0 for normal """ scores = self.score_samples(X) return (scores > self.threshold_).astype(int) def fit_predict(self, X: np.ndarray) -> np.ndarray: """Fit and predict.""" return self.fit(X).predict(X) def predict_proba(self, X: np.ndarray) -> np.ndarray: """ Compute probability of being an anomaly. P(anomaly) = P(chi2_d > D^2) """ scores = self.score_samples(X) # Survival function of chi-squared return stats.chi2.sf(scores, df=self.n_features_) def demonstrate_mahalanobis(): """Demonstrate Mahalanobis-based anomaly detection.""" np.random.seed(42) # Generate correlated normal data n, d = 500, 5 # Create covariance with correlations L = np.random.randn(d, d) cov_true = L @ L.T + 0.5 * np.eye(d) mean_true = np.zeros(d) X_normal = np.random.multivariate_normal(mean_true, cov_true, n) # Add anomalies # Type 1: Global outliers (far from center) outliers_global = np.random.multivariate_normal(mean_true + 5, 0.5 * np.eye(d), 10) # Type 2: Correlation-violating outliers # These are close in Euclidean but far in Mahalanobis # Find principal direction and move perpendicular eigvals, eigvecs = np.linalg.eigh(cov_true) small_eigvec = eigvecs[:, 0] # Direction of smallest variance outliers_corr = mean_true + 2 * small_eigvec * np.random.randn(10, 1).repeat(d, axis=1) X = np.vstack([X_normal, outliers_global, outliers_corr]) # Fit detector detector = MahalanobisAnomalyDetector(contamination=0.05, robust=True) detector.fit(X) predictions = detector.predict(X) scores = detector.score_samples(X) print("Mahalanobis Anomaly Detection Results:") print("=" * 50) print(f"Threshold (chi2, df={d}): {detector.threshold_:.2f}") print(f"Normal data (n={n}):") print(f" False positives: {predictions[:n].sum()}") print(f" Score range: [{scores[:n].min():.2f}, {scores[:n].max():.2f}]") print(f"Global outliers (n=10):") print(f" Detected: {predictions[n:n+10].sum()}/10") print(f" Score range: [{scores[n:n+10].min():.2f}, {scores[n:n+10].max():.2f}]") print(f"Correlation-violating outliers (n=10):") print(f" Detected: {predictions[n+10:].sum()}/10") print(f" Score range: [{scores[n+10:].min():.2f}, {scores[n+10:].max():.2f}]") # Compare with Euclidean distance euclidean_dists = np.linalg.norm(X - X[:n].mean(axis=0), axis=1) print(f"Euclidean vs Mahalanobis for correlation-violating outliers:") print(f" Euclidean ranks: {np.argsort(euclidean_dists)[::-1][:5]}") print(f" Mahalanobis ranks: {np.argsort(scores)[::-1][:5]}") demonstrate_mahalanobis()The Achilles' heel of Mahalanobis-based anomaly detection is covariance estimation. The standard empirical covariance is highly sensitive to outliers—the very things we're trying to detect!
When outliers contaminate the training data:
This circular problem—needing to know outliers to estimate covariance, but needing covariance to find outliers—requires robust estimation techniques.
MCD finds the subset of h points (typically h = ⌈(n+d+1)/2⌉) whose covariance matrix has the smallest determinant:
$$\hat{\Sigma}{\text{MCD}} = \arg\min{|H|=h} |\mathbf{\Sigma}_H|$$
The intuition: the subset with minimum covariance determinant best represents the 'core' of the data, excluding outliers.
Properties of MCD:
Minimum Volume Ellipsoid (MVE): Finds the ellipsoid of minimum volume containing h points. Similar concept to MCD but with different optimization objective.
S-Estimators: Minimize a robust scale estimate: $$\hat{\Sigma}S = \arg\min\Sigma \det(\Sigma) \text{ s.t. } \frac{1}{n}\sum_i \rho(d_M^i(\Sigma)) = k$$ where ρ is a robust loss function (e.g., Tukey's biweight).
M-Estimators: Iteratively reweight observations based on their Mahalanobis distance: $$\hat{\Sigma}^{(t+1)} = \frac{\sum_i w(d_M^{(t)}(\mathbf{x}_i)) (\mathbf{x}_i - \hat{\mu})(\mathbf{x}_i - \hat{\mu})^T}{\sum_i w(d_M^{(t)}(\mathbf{x}_i))}$$ where w(d) downweights points with large distances.
Shrinkage Estimators: Regularize towards a simpler covariance structure: $$\hat{\Sigma}{\text{shrink}} = (1-\alpha) \hat{\Sigma}{\text{sample}} + \alpha \cdot \text{diag}(\hat{\Sigma}_{\text{sample}})$$
| Estimator | Breakdown Point | Efficiency | Computational Cost | Best For |
|---|---|---|---|---|
| Empirical | 0% | 100% | O(nd²) | Clean data only |
| MCD | ~50% | ~80% | O(n²d²) naive | General robust use |
| MVE | ~50% | Lower | O(n²d²) | Theoretical interest |
| M-Estimator | Variable | High | O(nd² × iters) | Known contamination |
| Shrinkage | 0% | Variable | O(nd²) | High-dimensional, small n |
For most anomaly detection applications: (1) If you trust your training data is mostly clean, use empirical covariance with regularization; (2) If contamination is suspected, use MCD (available in sklearn.covariance.MinCovDet); (3) For high-dimensional data (d > n/2), use shrinkage estimators (sklearn.covariance.LedoitWolf or ShrunkCovariance).
When full-dimensional density estimation fails due to the curse of dimensionality, subspace methods offer a powerful alternative: detect anomalies in lower-dimensional projections where density estimation remains tractable.
In high-dimensional data, anomalies may only be anomalous in certain feature subsets. A point could appear normal when considering all features but be clearly anomalous in a specific 2D or 3D projection.
Example: A fraudulent transaction might have normal amounts and normal times, but the combination of <location, merchant_category> is rare.
Inspired by ensemble methods, feature bagging repeatedly samples feature subsets and applies anomaly detection in each:
Benefits:
HiCS searches for subspaces where a point deviates most from the local distribution:
Contrast Function: $$\text{contrast}(\mathbf{x}, S) = \frac{p_S(\mathbf{x}|N(\mathbf{x}))}{p_S(\mathbf{x})}$$
where $p_S$ is the marginal distribution in subspace S, and $N(\mathbf{x})$ is the local neighborhood.
Algorithm:
The key insight: Instead of searching all 2^d subspaces (exponential), HiCS uses statistical tests to prune uninteresting subspaces.
SOD leverages the observation that outliers often deviate in low-variance subspaces:
$$\text{SOD}(\mathbf{x}) = \frac{1}{|S|} \sum_{j \in S} (x_j - \tilde{x}_j)^2$$
where $\tilde{x}_j$ is the projection onto the neighbor-defined subspace.
Points far from their neighbors' subspace are anomalies—they don't 'fit' the local pattern.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191
import numpy as npfrom sklearn.neighbors import LocalOutlierFactor, NearestNeighborsfrom sklearn.decomposition import PCAfrom itertools import combinations class FeatureBaggingDetector: """ Feature Bagging for outlier detection in high dimensions. Samples random feature subsets and aggregates LOF scores. """ def __init__(self, n_estimators: int = 50, subspace_size: int = None, k: int = 20, contamination: float = 0.05): """ Parameters: ----------- n_estimators : int Number of subspace samples subspace_size : int Number of features per subspace (default: d/2) k : int Number of neighbors for LOF contamination : float Expected anomaly fraction """ self.n_estimators = n_estimators self.subspace_size = subspace_size self.k = k self.contamination = contamination def fit_predict(self, X: np.ndarray) -> tuple: """ Fit and return anomaly scores. Returns: -------- tuple : (labels, aggregated_scores) """ n_samples, n_features = X.shape # Default subspace size subspace_dim = self.subspace_size or max(2, n_features // 2) subspace_dim = min(subspace_dim, n_features) all_scores = np.zeros((self.n_estimators, n_samples)) for i in range(self.n_estimators): # Random feature subset features = np.random.choice(n_features, size=subspace_dim, replace=False) X_sub = X[:, features] # Apply LOF in subspace lof = LocalOutlierFactor(n_neighbors=self.k, contamination=self.contamination) lof.fit(X_sub) # Convert negative outlier factor to positive scores all_scores[i] = -lof.negative_outlier_factor_ # Aggregate across subspaces aggregated = np.mean(all_scores, axis=0) # Threshold at contamination percentile threshold = np.percentile(aggregated, 100 * (1 - self.contamination)) labels = (aggregated > threshold).astype(int) return labels, aggregated class SubspaceOutlierDegree: """ SOD: Detects outliers that deviate from their neighbors' subspace. """ def __init__(self, k: int = 20, alpha: float = 0.8): """ Parameters: ----------- k : int Number of neighbors alpha : float Fraction of variance for subspace selection """ self.k = k self.alpha = alpha def fit_predict(self, X: np.ndarray) -> np.ndarray: """ Compute SOD scores for all points. Returns: -------- np.ndarray : Outlier scores (higher = more anomalous) """ n_samples, n_features = X.shape scores = np.zeros(n_samples) # Find k-nearest neighbors nn = NearestNeighbors(n_neighbors=self.k + 1) nn.fit(X) _, indices = nn.kneighbors(X) neighbor_indices = indices[:, 1:] # Exclude self for i in range(n_samples): # Get neighbors neighbors = X[neighbor_indices[i]] # Compute subspace spanned by neighbors via PCA # Center the neighbors neighbor_center = neighbors.mean(axis=0) neighbors_centered = neighbors - neighbor_center # PCA to find subspace pca = PCA() pca.fit(neighbors_centered) # Determine subspace dimension (cumulative variance >= alpha) cumvar = np.cumsum(pca.explained_variance_ratio_) n_components = np.searchsorted(cumvar, self.alpha) + 1 n_components = min(n_components, n_features) # Project point onto subspace point_centered = X[i] - neighbor_center # Reconstruction error using top n_components projection = pca.components_[:n_components].T @ (pca.components_[:n_components] @ point_centered) # SOD score = reconstruction error / subspace dimensionality reconstruction_error = np.linalg.norm(point_centered - projection) scores[i] = reconstruction_error / max(1, n_components) return scores # Demonstrationnp.random.seed(42) # High-dimensional data with subspace outliersn_normal = 300d = 50 # Normal data: correlations in first 10 dimensions, noise elsewherelatent = np.random.randn(n_normal, 5)signal_dims = latent @ np.random.randn(5, 10) # First 10 dims from latentnoise_dims = np.random.randn(n_normal, 40) * 0.5 # Rest is noiseX_normal = np.hstack([signal_dims, noise_dims]) # Subspace outliers: anomalous in first 10 dims, normal in restanomaly_latent = np.random.randn(10, 5) + 5 # Shifted latentanomaly_signal = anomaly_latent @ np.random.randn(5, 10)anomaly_noise = np.random.randn(10, 40) * 0.5 # Normal noiseoutliers = np.hstack([anomaly_signal, anomaly_noise]) X = np.vstack([X_normal, outliers]) # Compare methodsprint("High-Dimensional Subspace Anomaly Detection:")print("=" * 60)print(f"Data: {X.shape[0]} samples, {X.shape[1]} dimensions")print(f"True anomalies: {10}") # 1. Standard LOF (full dimensions)lof_full = LocalOutlierFactor(n_neighbors=20)lof_full.fit(X)lof_scores = -lof_full.negative_outlier_factor_ # 2. Feature Baggingfb = FeatureBaggingDetector(n_estimators=50, subspace_size=10)fb_labels, fb_scores = fb.fit_predict(X) # 3. SODsod = SubspaceOutlierDegree(k=20, alpha=0.8)sod_scores = sod.fit_predict(X) # Evaluatedef top_k_precision(scores, n_anomalies, k): top_k = np.argsort(scores)[-k:] true_labels = np.zeros(len(scores)) true_labels[-n_anomalies:] = 1 return true_labels[top_k].sum() / k print(f"Top-10 Precision (detecting 10 anomalies):")print(f" Full-dim LOF: {top_k_precision(lof_scores, 10, 10):.1%}")print(f" Feature Bagging: {top_k_precision(fb_scores, 10, 10):.1%}")print(f" SOD: {top_k_precision(sod_scores, 10, 10):.1%}") print(f"Top-20 Precision:")print(f" Full-dim LOF: {top_k_precision(lof_scores, 10, 20):.1%}")print(f" Feature Bagging: {top_k_precision(fb_scores, 10, 20):.1%}")print(f" SOD: {top_k_precision(sod_scores, 10, 20):.1%}")Rather than working in subspaces, another approach is to reduce dimensionality first, then apply density-based detection in the reduced space.
PCA projects data onto directions of maximum variance:
Caveat: PCA preserves variance, not necessarily anomaly structure. An anomaly in a low-variance direction may be lost.
PCA reconstruction error itself can indicate anomalies: $$\text{error}(\mathbf{x}) = |\mathbf{x} - \mathbf{P}_k \mathbf{x}|^2$$
where $\mathbf{P}_k$ projects onto top k principal components.
Insight: Anomalies often have high reconstruction error because they don't conform to the learned principal subspace. This is the basis for autoencoder-based anomaly detection in deep learning.
PCA is unsupervised—it finds directions of maximum variance regardless of anomaly labels. If labeled anomalies are available (rare but valuable), supervised dimensionality reduction (e.g., LDA, supervised autoencoders) can find projections that better separate normal from anomalous points.
The Johnson-Lindenstrauss lemma states that random projections approximately preserve distances:
$$\text{For } k = O\left(\frac{\log n}{\epsilon^2}\right)$$
a random projection to k dimensions preserves all pairwise distances within factor $(1 \pm \epsilon)$ with high probability.
For anomaly detection:
Benefits:
A robust pipeline for high-dimensional anomaly detection:
Multivariate density estimation for anomaly detection presents unique challenges that require thoughtful approaches.
Curse of Dimensionality: In high dimensions, distances concentrate, volumes explode, and density estimation degrades. Beyond ~20 dimensions, standard methods struggle.
Mahalanobis Distance: For approximately Gaussian data, Mahalanobis distance provides principled anomaly detection with proper handling of correlations and scales.
Robust Covariance: Standard covariance is sensitive to outliers. Use MCD or other robust estimators when contamination is suspected.
Subspace Methods: Anomalies may only be visible in certain feature subsets. Feature bagging and SOD detect subspace outliers.
Dimensionality Reduction: Reducing dimensionality via PCA or random projections can make density estimation tractable, though care is needed to preserve anomaly structure.
| Scenario | Recommended Approach | Avoid |
|---|---|---|
| d < 15, n >> d | LOF, Mahalanobis | Subspace (unnecessary) |
| d = 15-50, n > 10d | Robust Mahalanobis, Feature Bagging | Full-dim KDE |
| d = 50-200, n ~ d | PCA + LOF, SOD | Full-dim anything |
| d > 200 | Random Projection ensembles, Deep methods | Standard density methods |
| Subspace anomalies expected | Feature Bagging, HiCS, SOD | Global methods alone |
For high-dimensional anomaly detection: (1) Reduce dimensionality if possible while preserving anomaly structure; (2) Use robust covariance estimation; (3) Consider subspace and ensemble approaches; (4) Validate that detected anomalies make domain sense in original features.
With multivariate density challenges addressed, the final piece of the puzzle is making these methods work at scale:
The density paradigm, when properly adapted for high dimensions and large scales, remains one of the most interpretable and effective approaches to anomaly detection.