Density Based Methods - Learning Module

Loading content...

0/278

Multivariate Density

The High-Dimensional Challenge

Real-world anomaly detection problems rarely involve just two or three features. Financial fraud detection might consider hundreds of transaction attributes. Network intrusion detection examines dozens of protocol features. Healthcare anomaly detection processes thousands of gene expressions or sensor readings.

When density-based methods move from low-dimensional toy examples to these high-dimensional realities, fundamental challenges emerge. Distances become less meaningful, volumes explode exponentially, and the intuitions that guide us in 2D or 3D spaces fail spectacularly.

This page confronts these challenges directly, exploring both the theoretical foundations of multivariate density estimation and the practical techniques that make density-based anomaly detection viable in high dimensions.

What You Will Learn

This page covers: (1) The curse of dimensionality and its impact on density estimation; (2) Multivariate Gaussian density and Mahalanobis distance; (3) Covariance estimation challenges and robust alternatives; (4) Subspace and projected density methods; (5) Feature selection and dimensionality reduction for density-based detection; and (6) Practical strategies for high-dimensional anomaly detection.

The Curse of Dimensionality

The curse of dimensionality—coined by Richard Bellman—describes how algorithms that work well in low dimensions often fail catastrophically as dimensionality increases. For density-based anomaly detection, this curse manifests in several devastating ways.

Volume Concentration

Consider the volume of a d-dimensional hypersphere of radius r: $$V_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} r^d$$

The ratio of the volume of a hypersphere to its enclosing hypercube: $$\frac{V_{\text{sphere}}}{V_{\text{cube}}} = \frac{\pi^{d/2}}{\Gamma(d/2 + 1) \cdot 2^d}$$

As d increases:

For d=2: ratio ≈ 0.785
For d=10: ratio ≈ 0.0025
For d=100: ratio ≈ 10⁻⁷⁰

The volume of a hypersphere becomes negligibly small compared to its enclosing cube. Most of the volume is concentrated in the 'corners' of hypercubes, regions that distance-based methods cannot reach efficiently.

Distance Concentration

In high dimensions, the ratio of distances between points converges to 1. For random points: max(d_ij) / min(d_ij) → 1 as d → ∞. This means all points appear approximately equidistant, making it nearly impossible to distinguish neighbors from non-neighbors based on distance alone.

Formal Analysis: Distance Concentration

For n random points uniformly distributed in [0, 1]^d:

$$\lim_{d \to \infty} \mathbb{E}\left[\frac{D_{\max}}{D_{\min}}\right] = 1$$

where $D_{\max}$ and $D_{\min}$ are maximum and minimum pairwise distances.

More precisely, for fixed n and growing d: $$\frac{D_{\max} - D_{\min}}{D_{\min}} = O(1/\sqrt{d})$$

This concentration means:

k-NN becomes unreliable: The k-th nearest neighbor is barely closer than the (k+100)-th neighbor
Density estimates lose discriminative power: All points have similar local densities
Anomaly detection degrades: True anomalies aren't distinguishable from normal points by distance

Sample Complexity Explosion

To maintain the same statistical accuracy as sample size n in d dimensions, we need: $$n' = n^{d/d_0}$$

where d₀ is the original dimensionality. This exponential growth is called the sample size curse.

For practical density estimation, we need sample density: $$\rho = \frac{n}{V} = \frac{n}{L^d}$$

As d grows, for fixed n, this density drops exponentially, making any local neighborhood estimation statistically unsound.

Empirical Effects of Dimensionality on Anomaly Detection
Dimension d	Distance Variability	LOF Effectiveness	Required Sample Size
2-5	High (good)	Excellent	Hundreds
6-15	Moderate	Good	Thousands
16-30	Low	Degraded	Tens of thousands
31-100	Very Low	Poor	Millions+
100	Near-zero	Unreliable	Impractical

Multivariate Gaussian Density and Mahalanobis Distance

The multivariate Gaussian distribution provides a principled parametric approach to density-based anomaly detection, particularly effective when data approximately follows a normal distribution.

Multivariate Gaussian Density

For d-dimensional data, the multivariate Gaussian density: $$p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2} |\mathbf{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$

where:

$\boldsymbol{\mu}$ is the mean vector
$\mathbf{\Sigma}$ is the covariance matrix
$|\mathbf{\Sigma}|$ is the determinant of the covariance matrix

Mahalanobis Distance

The exponent in the Gaussian density defines the Mahalanobis distance: $$D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$

Mahalanobis distance accounts for:

Correlations between features: Correlated features are weighted appropriately
Different scales: Features with high variance don't dominate
Ellipsoidal structure: Captures the true shape of the data distribution

Mahalanobis vs. Euclidean Distance

Euclidean distance treats all features equally and assumes independence. Mahalanobis distance is equivalent to Euclidean distance after whitening the data (transforming to uncorrelated, unit-variance features). For anomaly detection, Mahalanobis is almost always preferable when covariance can be reliably estimated.

Anomaly Detection via Mahalanobis Distance

For Gaussian-distributed data, the squared Mahalanobis distance $D_M^2(\mathbf{x})$ follows a chi-squared distribution with d degrees of freedom: $$D_M^2(\mathbf{x}) \sim \chi^2_d$$

This provides a principled threshold for anomaly detection: $$\text{IsAnomaly}(\mathbf{x}) = \mathbb{1}\left[D_M^2(\mathbf{x}) > \chi^2_{d, 1-\alpha}\right]$$

where $\chi^2_{d, 1-\alpha}$ is the $(1-\alpha)$ quantile of the chi-squared distribution.

For example, with α = 0.01 and d = 10:

$\chi^2_{10, 0.99} \approx 23.2$
Points with $D_M^2 > 23.2$ are flagged as anomalies

This gives a probabilistic interpretation: an anomaly has less than 1% probability under the fitted Gaussian model.

mahalanobis_anomaly_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
import numpy as np
from scipy import stats
from scipy.linalg import inv, cholesky
from sklearn.covariance import EmpiricalCovariance, MinCovDet
import warnings
 
class MahalanobisAnomalyDetector:
    """
    Anomaly detection using Mahalanobis distance.
    
    Supports both standard and robust covariance estimation.
    """
    
    def __init__(self, contamination: float = 0.05, robust: bool = True):
        """
        Parameters:
        -----------
        contamination : float
            Expected fraction of anomalies (for threshold setting)
        robust : bool
            If True, use Minimum Covariance Determinant (robust to outliers)
        """
        self.contamination = contamination
        self.robust = robust
        
    def fit(self, X: np.ndarray):
        """Fit the model: estimate mean and covariance."""
        self.n_samples_, self.n_features_ = X.shape
        
        if self.robust and self.n_samples_ > self.n_features_ + 1:
            # Robust covariance estimation (MCD)
            try:
                self.cov_estimator_ = MinCovDet().fit(X)
                self.mean_ = self.cov_estimator_.location_
                self.cov_ = self.cov_estimator_.covariance_
            except ValueError:
                # Fall back to empirical if MCD fails
                warnings.warn("MCD failed, using empirical covariance")
                self.cov_estimator_ = EmpiricalCovariance().fit(X)
                self.mean_ = self.cov_estimator_.location_
                self.cov_ = self.cov_estimator_.covariance_
        else:
            # Standard covariance estimation
            self.mean_ = np.mean(X, axis=0)
            self.cov_ = np.cov(X, rowvar=False)
            if self.n_features_ == 1:
                self.cov_ = np.array([[self.cov_]])
        
        # Compute precision matrix (inverse covariance)
        # Add small regularization for numerical stability
        self.cov_reg_ = self.cov_ + 1e-6 * np.eye(self.n_features_)
        self.precision_ = inv(self.cov_reg_)
        
        # Chi-squared threshold
        self.threshold_ = stats.chi2.ppf(1 - self.contamination, df=self.n_features_)
        
        return self
    
    def mahalanobis_distance(self, X: np.ndarray) -> np.ndarray:
        """Compute Mahalanobis distance for each sample."""
        centered = X - self.mean_
        # D^2 = (x - mu)^T * Sigma^{-1} * (x - mu)
        left = centered @ self.precision_
        d_squared = np.sum(left * centered, axis=1)
        return np.sqrt(np.maximum(d_squared, 0))  # Ensure non-negative
    
    def score_samples(self, X: np.ndarray) -> np.ndarray:
        """
        Compute anomaly scores (higher = more anomalous).
        
        Returns squared Mahalanobis distance.
        """
        d = self.mahalanobis_distance(X)
        return d ** 2
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict anomaly labels.
        
        Returns:
        --------
        np.ndarray : 1 for anomalies, 0 for normal
        """
        scores = self.score_samples(X)
        return (scores > self.threshold_).astype(int)
    
    def fit_predict(self, X: np.ndarray) -> np.ndarray:
        """Fit and predict."""
        return self.fit(X).predict(X)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Compute probability of being an anomaly.
        
        P(anomaly) = P(chi2_d > D^2)
        """
        scores = self.score_samples(X)
        # Survival function of chi-squared
        return stats.chi2.sf(scores, df=self.n_features_)
 
 
def demonstrate_mahalanobis():
    """Demonstrate Mahalanobis-based anomaly detection."""
    np.random.seed(42)
    
    # Generate correlated normal data
    n, d = 500, 5
    # Create covariance with correlations
    L = np.random.randn(d, d)
    cov_true = L @ L.T + 0.5 * np.eye(d)
    mean_true = np.zeros(d)
    
    X_normal = np.random.multivariate_normal(mean_true, cov_true, n)
    
    # Add anomalies
    # Type 1: Global outliers (far from center)
    outliers_global = np.random.multivariate_normal(mean_true + 5, 0.5 * np.eye(d), 10)
    
    # Type 2: Correlation-violating outliers
    # These are close in Euclidean but far in Mahalanobis
    # Find principal direction and move perpendicular
    eigvals, eigvecs = np.linalg.eigh(cov_true)
    small_eigvec = eigvecs[:, 0]  # Direction of smallest variance
    outliers_corr = mean_true + 2 * small_eigvec * np.random.randn(10, 1).repeat(d, axis=1)
    
    X = np.vstack([X_normal, outliers_global, outliers_corr])
    
    # Fit detector
    detector = MahalanobisAnomalyDetector(contamination=0.05, robust=True)
    detector.fit(X)
    
    predictions = detector.predict(X)
    scores = detector.score_samples(X)
    
    print("Mahalanobis Anomaly Detection Results:")
    print("=" * 50)
    print(f"Threshold (chi2, df={d}): {detector.threshold_:.2f}")
    print(f"
Normal data (n={n}):")
    print(f"  False positives: {predictions[:n].sum()}")
    print(f"  Score range: [{scores[:n].min():.2f}, {scores[:n].max():.2f}]")
    print(f"
Global outliers (n=10):")
    print(f"  Detected: {predictions[n:n+10].sum()}/10")
    print(f"  Score range: [{scores[n:n+10].min():.2f}, {scores[n:n+10].max():.2f}]")
    print(f"
Correlation-violating outliers (n=10):")
    print(f"  Detected: {predictions[n+10:].sum()}/10")
    print(f"  Score range: [{scores[n+10:].min():.2f}, {scores[n+10:].max():.2f}]")
    
    # Compare with Euclidean distance
    euclidean_dists = np.linalg.norm(X - X[:n].mean(axis=0), axis=1)
    
    print(f"
Euclidean vs Mahalanobis for correlation-violating outliers:")
    print(f"  Euclidean ranks: {np.argsort(euclidean_dists)[::-1][:5]}")
    print(f"  Mahalanobis ranks: {np.argsort(scores)[::-1][:5]}")
 
demonstrate_mahalanobis()

Robust Covariance Estimation

The Achilles' heel of Mahalanobis-based anomaly detection is covariance estimation. The standard empirical covariance is highly sensitive to outliers—the very things we're trying to detect!

The Masking Effect

When outliers contaminate the training data:

They inflate the covariance estimate
The inflated covariance makes outliers appear less anomalous (swamping)
Normal points near the boundary may appear more anomalous (masking)

This circular problem—needing to know outliers to estimate covariance, but needing covariance to find outliers—requires robust estimation techniques.

Minimum Covariance Determinant (MCD)

MCD finds the subset of h points (typically h = ⌈(n+d+1)/2⌉) whose covariance matrix has the smallest determinant:

$$\hat{\Sigma}{\text{MCD}} = \arg\min{|H|=h} |\mathbf{\Sigma}_H|$$

The intuition: the subset with minimum covariance determinant best represents the 'core' of the data, excluding outliers.

Properties of MCD:

Breakdown point: Can tolerate up to ⌊(n-h)/n⌋ fraction of outliers
Affine equivariant: Robust to coordinate transformations
Computationally intensive: Fast-MCD algorithms make it practical

Other Robust Estimators

Minimum Volume Ellipsoid (MVE): Finds the ellipsoid of minimum volume containing h points. Similar concept to MCD but with different optimization objective.

S-Estimators: Minimize a robust scale estimate: $$\hat{\Sigma}S = \arg\min\Sigma \det(\Sigma) \text{ s.t. } \frac{1}{n}\sum_i \rho(d_M^i(\Sigma)) = k$$ where ρ is a robust loss function (e.g., Tukey's biweight).

M-Estimators: Iteratively reweight observations based on their Mahalanobis distance: $$\hat{\Sigma}^{(t+1)} = \frac{\sum_i w(d_M^{(t)}(\mathbf{x}_i)) (\mathbf{x}_i - \hat{\mu})(\mathbf{x}_i - \hat{\mu})^T}{\sum_i w(d_M^{(t)}(\mathbf{x}_i))}$$ where w(d) downweights points with large distances.

Shrinkage Estimators: Regularize towards a simpler covariance structure: $$\hat{\Sigma}{\text{shrink}} = (1-\alpha) \hat{\Sigma}{\text{sample}} + \alpha \cdot \text{diag}(\hat{\Sigma}_{\text{sample}})$$

Comparison of Robust Covariance Estimators
Estimator	Breakdown Point	Efficiency	Computational Cost	Best For
Empirical	0%	100%	O(nd²)	Clean data only
MCD	~50%	~80%	O(n²d²) naive	General robust use
MVE	~50%	Lower	O(n²d²)	Theoretical interest
M-Estimator	Variable	High	O(nd² × iters)	Known contamination
Shrinkage	0%	Variable	O(nd²)	High-dimensional, small n

Practical Recommendation

For most anomaly detection applications: (1) If you trust your training data is mostly clean, use empirical covariance with regularization; (2) If contamination is suspected, use MCD (available in sklearn.covariance.MinCovDet); (3) For high-dimensional data (d > n/2), use shrinkage estimators (sklearn.covariance.LedoitWolf or ShrunkCovariance).

Subspace and Projected Density Methods

When full-dimensional density estimation fails due to the curse of dimensionality, subspace methods offer a powerful alternative: detect anomalies in lower-dimensional projections where density estimation remains tractable.

The Subspace Outlier Hypothesis

In high-dimensional data, anomalies may only be anomalous in certain feature subsets. A point could appear normal when considering all features but be clearly anomalous in a specific 2D or 3D projection.

Example: A fraudulent transaction might have normal amounts and normal times, but the combination of <location, merchant_category> is rare.

Feature Bagging

Inspired by ensemble methods, feature bagging repeatedly samples feature subsets and applies anomaly detection in each:

Sample m feature subsets $S_1, S_2, \ldots, S_m$ of size r < d
Apply a detector (e.g., LOF) to data projected onto each $S_i$
Aggregate scores: $\text{score}(\mathbf{x}) = \frac{1}{m}\sum_{i=1}^m \text{score}_{S_i}(\mathbf{x})$

Benefits:

Bypasses curse of dimensionality by working in low dimensions
Can detect subspace outliers
Averaging provides robustness

High Contrast Subspaces (HiCS)

HiCS searches for subspaces where a point deviates most from the local distribution:

Contrast Function: $$\text{contrast}(\mathbf{x}, S) = \frac{p_S(\mathbf{x}|N(\mathbf{x}))}{p_S(\mathbf{x})}$$

where $p_S$ is the marginal distribution in subspace S, and $N(\mathbf{x})$ is the local neighborhood.

Algorithm:

For candidate subspaces S:
- Estimate $p_S(\mathbf{x})$ via KDE or histograms
- Compute contrast score
Aggregate across high-contrast subspaces
Report points with consistently low local probability

The key insight: Instead of searching all 2^d subspaces (exponential), HiCS uses statistical tests to prune uninteresting subspaces.

SOD: Subspace Outlier Degree

SOD leverages the observation that outliers often deviate in low-variance subspaces:

For each point, find its k-nearest neighbors
Compute PCA on the neighbors → identify subspace where neighbors cluster
Measure distance from the point to this subspace (reconstruction error)

$$\text{SOD}(\mathbf{x}) = \frac{1}{|S|} \sum_{j \in S} (x_j - \tilde{x}_j)^2$$

where $\tilde{x}_j$ is the projection onto the neighbor-defined subspace.

Points far from their neighbors' subspace are anomalies—they don't 'fit' the local pattern.

subspace_anomaly_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
import numpy as np
from sklearn.neighbors import LocalOutlierFactor, NearestNeighbors
from sklearn.decomposition import PCA
from itertools import combinations
 
class FeatureBaggingDetector:
    """
    Feature Bagging for outlier detection in high dimensions.
    
    Samples random feature subsets and aggregates LOF scores.
    """
    
    def __init__(self, n_estimators: int = 50, subspace_size: int = None,
                 k: int = 20, contamination: float = 0.05):
        """
        Parameters:
        -----------
        n_estimators : int
            Number of subspace samples
        subspace_size : int
            Number of features per subspace (default: d/2)
        k : int
            Number of neighbors for LOF
        contamination : float
            Expected anomaly fraction
        """
        self.n_estimators = n_estimators
        self.subspace_size = subspace_size
        self.k = k
        self.contamination = contamination
        
    def fit_predict(self, X: np.ndarray) -> tuple:
        """
        Fit and return anomaly scores.
        
        Returns:
        --------
        tuple : (labels, aggregated_scores)
        """
        n_samples, n_features = X.shape
        
        # Default subspace size
        subspace_dim = self.subspace_size or max(2, n_features // 2)
        subspace_dim = min(subspace_dim, n_features)
        
        all_scores = np.zeros((self.n_estimators, n_samples))
        
        for i in range(self.n_estimators):
            # Random feature subset
            features = np.random.choice(n_features, size=subspace_dim, replace=False)
            X_sub = X[:, features]
            
            # Apply LOF in subspace
            lof = LocalOutlierFactor(n_neighbors=self.k, contamination=self.contamination)
            lof.fit(X_sub)
            # Convert negative outlier factor to positive scores
            all_scores[i] = -lof.negative_outlier_factor_
        
        # Aggregate across subspaces
        aggregated = np.mean(all_scores, axis=0)
        
        # Threshold at contamination percentile
        threshold = np.percentile(aggregated, 100 * (1 - self.contamination))
        labels = (aggregated > threshold).astype(int)
        
        return labels, aggregated
 
 
class SubspaceOutlierDegree:
    """
    SOD: Detects outliers that deviate from their neighbors' subspace.
    """
    
    def __init__(self, k: int = 20, alpha: float = 0.8):
        """
        Parameters:
        -----------
        k : int
            Number of neighbors
        alpha : float
            Fraction of variance for subspace selection
        """
        self.k = k
        self.alpha = alpha
        
    def fit_predict(self, X: np.ndarray) -> np.ndarray:
        """
        Compute SOD scores for all points.
        
        Returns:
        --------
        np.ndarray : Outlier scores (higher = more anomalous)
        """
        n_samples, n_features = X.shape
        scores = np.zeros(n_samples)
        
        # Find k-nearest neighbors
        nn = NearestNeighbors(n_neighbors=self.k + 1)
        nn.fit(X)
        _, indices = nn.kneighbors(X)
        neighbor_indices = indices[:, 1:]  # Exclude self
        
        for i in range(n_samples):
            # Get neighbors
            neighbors = X[neighbor_indices[i]]
            
            # Compute subspace spanned by neighbors via PCA
            # Center the neighbors
            neighbor_center = neighbors.mean(axis=0)
            neighbors_centered = neighbors - neighbor_center
            
            # PCA to find subspace
            pca = PCA()
            pca.fit(neighbors_centered)
            
            # Determine subspace dimension (cumulative variance >= alpha)
            cumvar = np.cumsum(pca.explained_variance_ratio_)
            n_components = np.searchsorted(cumvar, self.alpha) + 1
            n_components = min(n_components, n_features)
            
            # Project point onto subspace
            point_centered = X[i] - neighbor_center
            
            # Reconstruction error using top n_components
            projection = pca.components_[:n_components].T @                         (pca.components_[:n_components] @ point_centered)
            
            # SOD score = reconstruction error / subspace dimensionality
            reconstruction_error = np.linalg.norm(point_centered - projection)
            scores[i] = reconstruction_error / max(1, n_components)
        
        return scores
 
 
# Demonstration
np.random.seed(42)
 
# High-dimensional data with subspace outliers
n_normal = 300
d = 50
 
# Normal data: correlations in first 10 dimensions, noise elsewhere
latent = np.random.randn(n_normal, 5)
signal_dims = latent @ np.random.randn(5, 10)  # First 10 dims from latent
noise_dims = np.random.randn(n_normal, 40) * 0.5  # Rest is noise
X_normal = np.hstack([signal_dims, noise_dims])
 
# Subspace outliers: anomalous in first 10 dims, normal in rest
anomaly_latent = np.random.randn(10, 5) + 5  # Shifted latent
anomaly_signal = anomaly_latent @ np.random.randn(5, 10)
anomaly_noise = np.random.randn(10, 40) * 0.5  # Normal noise
outliers = np.hstack([anomaly_signal, anomaly_noise])
 
X = np.vstack([X_normal, outliers])
 
# Compare methods
print("High-Dimensional Subspace Anomaly Detection:")
print("=" * 60)
print(f"Data: {X.shape[0]} samples, {X.shape[1]} dimensions")
print(f"True anomalies: {10}")
 
# 1. Standard LOF (full dimensions)
lof_full = LocalOutlierFactor(n_neighbors=20)
lof_full.fit(X)
lof_scores = -lof_full.negative_outlier_factor_
 
# 2. Feature Bagging
fb = FeatureBaggingDetector(n_estimators=50, subspace_size=10)
fb_labels, fb_scores = fb.fit_predict(X)
 
# 3. SOD
sod = SubspaceOutlierDegree(k=20, alpha=0.8)
sod_scores = sod.fit_predict(X)
 
# Evaluate
def top_k_precision(scores, n_anomalies, k):
    top_k = np.argsort(scores)[-k:]
    true_labels = np.zeros(len(scores))
    true_labels[-n_anomalies:] = 1
    return true_labels[top_k].sum() / k
 
print(f"
Top-10 Precision (detecting 10 anomalies):")
print(f"  Full-dim LOF: {top_k_precision(lof_scores, 10, 10):.1%}")
print(f"  Feature Bagging: {top_k_precision(fb_scores, 10, 10):.1%}")
print(f"  SOD: {top_k_precision(sod_scores, 10, 10):.1%}")
 
print(f"
Top-20 Precision:")
print(f"  Full-dim LOF: {top_k_precision(lof_scores, 10, 20):.1%}")
print(f"  Feature Bagging: {top_k_precision(fb_scores, 10, 20):.1%}")
print(f"  SOD: {top_k_precision(sod_scores, 10, 20):.1%}")

Dimensionality Reduction for Anomaly Detection

Rather than working in subspaces, another approach is to reduce dimensionality first, then apply density-based detection in the reduced space.

PCA-Based Reduction

PCA projects data onto directions of maximum variance:

Compute principal components of training data
Project to top k components (capturing most variance)
Apply density-based detection in k-dimensional space

Caveat: PCA preserves variance, not necessarily anomaly structure. An anomaly in a low-variance direction may be lost.

Reconstruction Error Approaches

PCA reconstruction error itself can indicate anomalies: $$\text{error}(\mathbf{x}) = |\mathbf{x} - \mathbf{P}_k \mathbf{x}|^2$$

where $\mathbf{P}_k$ projects onto top k principal components.

Insight: Anomalies often have high reconstruction error because they don't conform to the learned principal subspace. This is the basis for autoencoder-based anomaly detection in deep learning.

Supervised vs. Unsupervised Reduction

PCA is unsupervised—it finds directions of maximum variance regardless of anomaly labels. If labeled anomalies are available (rare but valuable), supervised dimensionality reduction (e.g., LDA, supervised autoencoders) can find projections that better separate normal from anomalous points.

Random Projections

The Johnson-Lindenstrauss lemma states that random projections approximately preserve distances:

$$\text{For } k = O\left(\frac{\log n}{\epsilon^2}\right)$$

a random projection to k dimensions preserves all pairwise distances within factor $(1 \pm \epsilon)$ with high probability.

For anomaly detection:

Generate random projection matrix $\mathbf{R} \in \mathbb{R}^{k \times d}$
Project: $\tilde{\mathbf{X}} = \mathbf{X} \mathbf{R}^T$
Apply detection in projected space

Benefits:

No training required (unlike PCA)
Distance-preserving (good for distance-based methods)
Multiple projections can be ensembled

Combined Strategy: Project and Detect

A robust pipeline for high-dimensional anomaly detection:

Feature selection: Remove obviously irrelevant features
Normalization: Standardize to common scale
Reduction: Apply PCA, keeping components explaining 95% variance
Detection: Apply LOF or Mahalanobis in reduced space
Validation: Check top anomalies in original feature space

Choosing Reduction Dimensionality

•Variance criterion: Keep components explaining 90-99% of total variance
•Elbow method: Plot explained variance, find diminishing returns
•Task-based: If labeled data exists, cross-validate detection performance vs. dimensions
•Rule of thumb: k = 10-50 is often sufficient for detection; more than ~100 rarely helps
•Stability check: Ensure rankings are stable across nearby k values

Summary: Tackling Multivariate Density

Multivariate density estimation for anomaly detection presents unique challenges that require thoughtful approaches.

Key Takeaways

Curse of Dimensionality: In high dimensions, distances concentrate, volumes explode, and density estimation degrades. Beyond ~20 dimensions, standard methods struggle.
Mahalanobis Distance: For approximately Gaussian data, Mahalanobis distance provides principled anomaly detection with proper handling of correlations and scales.
Robust Covariance: Standard covariance is sensitive to outliers. Use MCD or other robust estimators when contamination is suspected.
Subspace Methods: Anomalies may only be visible in certain feature subsets. Feature bagging and SOD detect subspace outliers.
Dimensionality Reduction: Reducing dimensionality via PCA or random projections can make density estimation tractable, though care is needed to preserve anomaly structure.

Method Selection Guide for High-Dimensional Anomaly Detection
Scenario	Recommended Approach	Avoid
d < 15, n >> d	LOF, Mahalanobis	Subspace (unnecessary)
d = 15-50, n > 10d	Robust Mahalanobis, Feature Bagging	Full-dim KDE
d = 50-200, n ~ d	PCA + LOF, SOD	Full-dim anything
d > 200	Random Projection ensembles, Deep methods	Standard density methods
Subspace anomalies expected	Feature Bagging, HiCS, SOD	Global methods alone

Core Strategy

For high-dimensional anomaly detection: (1) Reduce dimensionality if possible while preserving anomaly structure; (2) Use robust covariance estimation; (3) Consider subspace and ensemble approaches; (4) Validate that detected anomalies make domain sense in original features.

Looking Ahead

With multivariate density challenges addressed, the final piece of the puzzle is making these methods work at scale:

Next Page (Scalability): Techniques for efficient density-based anomaly detection on large datasets, including approximate nearest neighbors, streaming algorithms, and distributed computation.

The density paradigm, when properly adapted for high dimensions and large scales, remains one of the most interpretable and effective approaches to anomaly detection.

Multivariate Density

The High-Dimensional Challenge

What You Will Learn

The Curse of Dimensionality

Volume Concentration

Consider the volume of a d-dimensional hypersphere of radius r: $$V_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} r^d$$

The ratio of the volume of a hypersphere to its enclosing hypercube: $$\frac{V_{\text{sphere}}}{V_{\text{cube}}} = \frac{\pi^{d/2}}{\Gamma(d/2 + 1) \cdot 2^d}$$

As d increases:

For d=2: ratio ≈ 0.785
For d=10: ratio ≈ 0.0025
For d=100: ratio ≈ 10⁻⁷⁰

Distance Concentration

Formal Analysis: Distance Concentration

For n random points uniformly distributed in [0, 1]^d:

$$\lim_{d \to \infty} \mathbb{E}\left[\frac{D_{\max}}{D_{\min}}\right] = 1$$

where $D_{\max}$ and $D_{\min}$ are maximum and minimum pairwise distances.

More precisely, for fixed n and growing d: $$\frac{D_{\max} - D_{\min}}{D_{\min}} = O(1/\sqrt{d})$$

This concentration means:

k-NN becomes unreliable: The k-th nearest neighbor is barely closer than the (k+100)-th neighbor
Density estimates lose discriminative power: All points have similar local densities
Anomaly detection degrades: True anomalies aren't distinguishable from normal points by distance

Sample Complexity Explosion

To maintain the same statistical accuracy as sample size n in d dimensions, we need: $$n' = n^{d/d_0}$$

where d₀ is the original dimensionality. This exponential growth is called the sample size curse.

For practical density estimation, we need sample density: $$\rho = \frac{n}{V} = \frac{n}{L^d}$$

As d grows, for fixed n, this density drops exponentially, making any local neighborhood estimation statistically unsound.

Empirical Effects of Dimensionality on Anomaly Detection
Dimension d	Distance Variability	LOF Effectiveness	Required Sample Size
2-5	High (good)	Excellent	Hundreds
6-15	Moderate	Good	Thousands
16-30	Low	Degraded	Tens of thousands
31-100	Very Low	Poor	Millions+
100	Near-zero	Unreliable	Impractical

Multivariate Gaussian Density and Mahalanobis Distance

The multivariate Gaussian distribution provides a principled parametric approach to density-based anomaly detection, particularly effective when data approximately follows a normal distribution.

Multivariate Gaussian Density

where:

$\boldsymbol{\mu}$ is the mean vector
$\mathbf{\Sigma}$ is the covariance matrix
$|\mathbf{\Sigma}|$ is the determinant of the covariance matrix

Mahalanobis Distance

The exponent in the Gaussian density defines the Mahalanobis distance: $$D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$

Mahalanobis distance accounts for:

Correlations between features: Correlated features are weighted appropriately
Different scales: Features with high variance don't dominate
Ellipsoidal structure: Captures the true shape of the data distribution

Mahalanobis vs. Euclidean Distance

Anomaly Detection via Mahalanobis Distance

For Gaussian-distributed data, the squared Mahalanobis distance $D_M^2(\mathbf{x})$ follows a chi-squared distribution with d degrees of freedom: $$D_M^2(\mathbf{x}) \sim \chi^2_d$$

This provides a principled threshold for anomaly detection: $$\text{IsAnomaly}(\mathbf{x}) = \mathbb{1}\left[D_M^2(\mathbf{x}) > \chi^2_{d, 1-\alpha}\right]$$

where $\chi^2_{d, 1-\alpha}$ is the $(1-\alpha)$ quantile of the chi-squared distribution.

For example, with α = 0.01 and d = 10:

$\chi^2_{10, 0.99} \approx 23.2$
Points with $D_M^2 > 23.2$ are flagged as anomalies

This gives a probabilistic interpretation: an anomaly has less than 1% probability under the fitted Gaussian model.

mahalanobis_anomaly_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
import numpy as np
from scipy import stats
from scipy.linalg import inv, cholesky
from sklearn.covariance import EmpiricalCovariance, MinCovDet
import warnings
 
class MahalanobisAnomalyDetector:
    """
    Anomaly detection using Mahalanobis distance.
    
    Supports both standard and robust covariance estimation.
    """
    
    def __init__(self, contamination: float = 0.05, robust: bool = True):
        """
        Parameters:
        -----------
        contamination : float
            Expected fraction of anomalies (for threshold setting)
        robust : bool
            If True, use Minimum Covariance Determinant (robust to outliers)
        """
        self.contamination = contamination
        self.robust = robust
        
    def fit(self, X: np.ndarray):
        """Fit the model: estimate mean and covariance."""
        self.n_samples_, self.n_features_ = X.shape
        
        if self.robust and self.n_samples_ > self.n_features_ + 1:
            # Robust covariance estimation (MCD)
            try:
                self.cov_estimator_ = MinCovDet().fit(X)
                self.mean_ = self.cov_estimator_.location_
                self.cov_ = self.cov_estimator_.covariance_
            except ValueError:
                # Fall back to empirical if MCD fails
                warnings.warn("MCD failed, using empirical covariance")
                self.cov_estimator_ = EmpiricalCovariance().fit(X)
                self.mean_ = self.cov_estimator_.location_
                self.cov_ = self.cov_estimator_.covariance_
        else:
            # Standard covariance estimation
            self.mean_ = np.mean(X, axis=0)
            self.cov_ = np.cov(X, rowvar=False)
            if self.n_features_ == 1:
                self.cov_ = np.array([[self.cov_]])
        
        # Compute precision matrix (inverse covariance)
        # Add small regularization for numerical stability
        self.cov_reg_ = self.cov_ + 1e-6 * np.eye(self.n_features_)
        self.precision_ = inv(self.cov_reg_)
        
        # Chi-squared threshold
        self.threshold_ = stats.chi2.ppf(1 - self.contamination, df=self.n_features_)
        
        return self
    
    def mahalanobis_distance(self, X: np.ndarray) -> np.ndarray:
        """Compute Mahalanobis distance for each sample."""
        centered = X - self.mean_
        # D^2 = (x - mu)^T * Sigma^{-1} * (x - mu)
        left = centered @ self.precision_
        d_squared = np.sum(left * centered, axis=1)
        return np.sqrt(np.maximum(d_squared, 0))  # Ensure non-negative
    
    def score_samples(self, X: np.ndarray) -> np.ndarray:
        """
        Compute anomaly scores (higher = more anomalous).
        
        Returns squared Mahalanobis distance.
        """
        d = self.mahalanobis_distance(X)
        return d ** 2
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict anomaly labels.
        
        Returns:
        --------
        np.ndarray : 1 for anomalies, 0 for normal
        """
        scores = self.score_samples(X)
        return (scores > self.threshold_).astype(int)
    
    def fit_predict(self, X: np.ndarray) -> np.ndarray:
        """Fit and predict."""
        return self.fit(X).predict(X)
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Compute probability of being an anomaly.
        
        P(anomaly) = P(chi2_d > D^2)
        """
        scores = self.score_samples(X)
        # Survival function of chi-squared
        return stats.chi2.sf(scores, df=self.n_features_)
 
 
def demonstrate_mahalanobis():
    """Demonstrate Mahalanobis-based anomaly detection."""
    np.random.seed(42)
    
    # Generate correlated normal data
    n, d = 500, 5
    # Create covariance with correlations
    L = np.random.randn(d, d)
    cov_true = L @ L.T + 0.5 * np.eye(d)
    mean_true = np.zeros(d)
    
    X_normal = np.random.multivariate_normal(mean_true, cov_true, n)
    
    # Add anomalies
    # Type 1: Global outliers (far from center)
    outliers_global = np.random.multivariate_normal(mean_true + 5, 0.5 * np.eye(d), 10)
    
    # Type 2: Correlation-violating outliers
    # These are close in Euclidean but far in Mahalanobis
    # Find principal direction and move perpendicular
    eigvals, eigvecs = np.linalg.eigh(cov_true)
    small_eigvec = eigvecs[:, 0]  # Direction of smallest variance
    outliers_corr = mean_true + 2 * small_eigvec * np.random.randn(10, 1).repeat(d, axis=1)
    
    X = np.vstack([X_normal, outliers_global, outliers_corr])
    
    # Fit detector
    detector = MahalanobisAnomalyDetector(contamination=0.05, robust=True)
    detector.fit(X)
    
    predictions = detector.predict(X)
    scores = detector.score_samples(X)
    
    print("Mahalanobis Anomaly Detection Results:")
    print("=" * 50)
    print(f"Threshold (chi2, df={d}): {detector.threshold_:.2f}")
    print(f"
Normal data (n={n}):")
    print(f"  False positives: {predictions[:n].sum()}")
    print(f"  Score range: [{scores[:n].min():.2f}, {scores[:n].max():.2f}]")
    print(f"
Global outliers (n=10):")
    print(f"  Detected: {predictions[n:n+10].sum()}/10")
    print(f"  Score range: [{scores[n:n+10].min():.2f}, {scores[n:n+10].max():.2f}]")
    print(f"
Correlation-violating outliers (n=10):")
    print(f"  Detected: {predictions[n+10:].sum()}/10")
    print(f"  Score range: [{scores[n+10:].min():.2f}, {scores[n+10:].max():.2f}]")
    
    # Compare with Euclidean distance
    euclidean_dists = np.linalg.norm(X - X[:n].mean(axis=0), axis=1)
    
    print(f"
Euclidean vs Mahalanobis for correlation-violating outliers:")
    print(f"  Euclidean ranks: {np.argsort(euclidean_dists)[::-1][:5]}")
    print(f"  Mahalanobis ranks: {np.argsort(scores)[::-1][:5]}")
 
demonstrate_mahalanobis()

Robust Covariance Estimation

The Achilles' heel of Mahalanobis-based anomaly detection is covariance estimation. The standard empirical covariance is highly sensitive to outliers—the very things we're trying to detect!

The Masking Effect

When outliers contaminate the training data:

They inflate the covariance estimate
The inflated covariance makes outliers appear less anomalous (swamping)
Normal points near the boundary may appear more anomalous (masking)

This circular problem—needing to know outliers to estimate covariance, but needing covariance to find outliers—requires robust estimation techniques.

Minimum Covariance Determinant (MCD)

MCD finds the subset of h points (typically h = ⌈(n+d+1)/2⌉) whose covariance matrix has the smallest determinant:

$$\hat{\Sigma}{\text{MCD}} = \arg\min{|H|=h} |\mathbf{\Sigma}_H|$$

The intuition: the subset with minimum covariance determinant best represents the 'core' of the data, excluding outliers.

Properties of MCD:

Breakdown point: Can tolerate up to ⌊(n-h)/n⌋ fraction of outliers
Affine equivariant: Robust to coordinate transformations
Computationally intensive: Fast-MCD algorithms make it practical

Other Robust Estimators

Minimum Volume Ellipsoid (MVE): Finds the ellipsoid of minimum volume containing h points. Similar concept to MCD but with different optimization objective.

Comparison of Robust Covariance Estimators
Estimator	Breakdown Point	Efficiency	Computational Cost	Best For
Empirical	0%	100%	O(nd²)	Clean data only
MCD	~50%	~80%	O(n²d²) naive	General robust use
MVE	~50%	Lower	O(n²d²)	Theoretical interest
M-Estimator	Variable	High	O(nd² × iters)	Known contamination
Shrinkage	0%	Variable	O(nd²)	High-dimensional, small n

Practical Recommendation

Subspace and Projected Density Methods

The Subspace Outlier Hypothesis

Example: A fraudulent transaction might have normal amounts and normal times, but the combination of <location, merchant_category> is rare.

Feature Bagging

Inspired by ensemble methods, feature bagging repeatedly samples feature subsets and applies anomaly detection in each:

Sample m feature subsets $S_1, S_2, \ldots, S_m$ of size r < d
Apply a detector (e.g., LOF) to data projected onto each $S_i$
Aggregate scores: $\text{score}(\mathbf{x}) = \frac{1}{m}\sum_{i=1}^m \text{score}_{S_i}(\mathbf{x})$

Benefits:

Bypasses curse of dimensionality by working in low dimensions
Can detect subspace outliers
Averaging provides robustness

High Contrast Subspaces (HiCS)

HiCS searches for subspaces where a point deviates most from the local distribution:

Contrast Function: $$\text{contrast}(\mathbf{x}, S) = \frac{p_S(\mathbf{x}|N(\mathbf{x}))}{p_S(\mathbf{x})}$$

where $p_S$ is the marginal distribution in subspace S, and $N(\mathbf{x})$ is the local neighborhood.

Algorithm:

For candidate subspaces S:
- Estimate $p_S(\mathbf{x})$ via KDE or histograms
- Compute contrast score
Aggregate across high-contrast subspaces
Report points with consistently low local probability

The key insight: Instead of searching all 2^d subspaces (exponential), HiCS uses statistical tests to prune uninteresting subspaces.

SOD: Subspace Outlier Degree

SOD leverages the observation that outliers often deviate in low-variance subspaces:

For each point, find its k-nearest neighbors
Compute PCA on the neighbors → identify subspace where neighbors cluster
Measure distance from the point to this subspace (reconstruction error)

$$\text{SOD}(\mathbf{x}) = \frac{1}{|S|} \sum_{j \in S} (x_j - \tilde{x}_j)^2$$

where $\tilde{x}_j$ is the projection onto the neighbor-defined subspace.

Points far from their neighbors' subspace are anomalies—they don't 'fit' the local pattern.

subspace_anomaly_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
import numpy as np
from sklearn.neighbors import LocalOutlierFactor, NearestNeighbors
from sklearn.decomposition import PCA
from itertools import combinations
 
class FeatureBaggingDetector:
    """
    Feature Bagging for outlier detection in high dimensions.
    
    Samples random feature subsets and aggregates LOF scores.
    """
    
    def __init__(self, n_estimators: int = 50, subspace_size: int = None,
                 k: int = 20, contamination: float = 0.05):
        """
        Parameters:
        -----------
        n_estimators : int
            Number of subspace samples
        subspace_size : int
            Number of features per subspace (default: d/2)
        k : int
            Number of neighbors for LOF
        contamination : float
            Expected anomaly fraction
        """
        self.n_estimators = n_estimators
        self.subspace_size = subspace_size
        self.k = k
        self.contamination = contamination
        
    def fit_predict(self, X: np.ndarray) -> tuple:
        """
        Fit and return anomaly scores.
        
        Returns:
        --------
        tuple : (labels, aggregated_scores)
        """
        n_samples, n_features = X.shape
        
        # Default subspace size
        subspace_dim = self.subspace_size or max(2, n_features // 2)
        subspace_dim = min(subspace_dim, n_features)
        
        all_scores = np.zeros((self.n_estimators, n_samples))
        
        for i in range(self.n_estimators):
            # Random feature subset
            features = np.random.choice(n_features, size=subspace_dim, replace=False)
            X_sub = X[:, features]
            
            # Apply LOF in subspace
            lof = LocalOutlierFactor(n_neighbors=self.k, contamination=self.contamination)
            lof.fit(X_sub)
            # Convert negative outlier factor to positive scores
            all_scores[i] = -lof.negative_outlier_factor_
        
        # Aggregate across subspaces
        aggregated = np.mean(all_scores, axis=0)
        
        # Threshold at contamination percentile
        threshold = np.percentile(aggregated, 100 * (1 - self.contamination))
        labels = (aggregated > threshold).astype(int)
        
        return labels, aggregated
 
 
class SubspaceOutlierDegree:
    """
    SOD: Detects outliers that deviate from their neighbors' subspace.
    """
    
    def __init__(self, k: int = 20, alpha: float = 0.8):
        """
        Parameters:
        -----------
        k : int
            Number of neighbors
        alpha : float
            Fraction of variance for subspace selection
        """
        self.k = k
        self.alpha = alpha
        
    def fit_predict(self, X: np.ndarray) -> np.ndarray:
        """
        Compute SOD scores for all points.
        
        Returns:
        --------
        np.ndarray : Outlier scores (higher = more anomalous)
        """
        n_samples, n_features = X.shape
        scores = np.zeros(n_samples)
        
        # Find k-nearest neighbors
        nn = NearestNeighbors(n_neighbors=self.k + 1)
        nn.fit(X)
        _, indices = nn.kneighbors(X)
        neighbor_indices = indices[:, 1:]  # Exclude self
        
        for i in range(n_samples):
            # Get neighbors
            neighbors = X[neighbor_indices[i]]
            
            # Compute subspace spanned by neighbors via PCA
            # Center the neighbors
            neighbor_center = neighbors.mean(axis=0)
            neighbors_centered = neighbors - neighbor_center
            
            # PCA to find subspace
            pca = PCA()
            pca.fit(neighbors_centered)
            
            # Determine subspace dimension (cumulative variance >= alpha)
            cumvar = np.cumsum(pca.explained_variance_ratio_)
            n_components = np.searchsorted(cumvar, self.alpha) + 1
            n_components = min(n_components, n_features)
            
            # Project point onto subspace
            point_centered = X[i] - neighbor_center
            
            # Reconstruction error using top n_components
            projection = pca.components_[:n_components].T @                         (pca.components_[:n_components] @ point_centered)
            
            # SOD score = reconstruction error / subspace dimensionality
            reconstruction_error = np.linalg.norm(point_centered - projection)
            scores[i] = reconstruction_error / max(1, n_components)
        
        return scores
 
 
# Demonstration
np.random.seed(42)
 
# High-dimensional data with subspace outliers
n_normal = 300
d = 50
 
# Normal data: correlations in first 10 dimensions, noise elsewhere
latent = np.random.randn(n_normal, 5)
signal_dims = latent @ np.random.randn(5, 10)  # First 10 dims from latent
noise_dims = np.random.randn(n_normal, 40) * 0.5  # Rest is noise
X_normal = np.hstack([signal_dims, noise_dims])
 
# Subspace outliers: anomalous in first 10 dims, normal in rest
anomaly_latent = np.random.randn(10, 5) + 5  # Shifted latent
anomaly_signal = anomaly_latent @ np.random.randn(5, 10)
anomaly_noise = np.random.randn(10, 40) * 0.5  # Normal noise
outliers = np.hstack([anomaly_signal, anomaly_noise])
 
X = np.vstack([X_normal, outliers])
 
# Compare methods
print("High-Dimensional Subspace Anomaly Detection:")
print("=" * 60)
print(f"Data: {X.shape[0]} samples, {X.shape[1]} dimensions")
print(f"True anomalies: {10}")
 
# 1. Standard LOF (full dimensions)
lof_full = LocalOutlierFactor(n_neighbors=20)
lof_full.fit(X)
lof_scores = -lof_full.negative_outlier_factor_
 
# 2. Feature Bagging
fb = FeatureBaggingDetector(n_estimators=50, subspace_size=10)
fb_labels, fb_scores = fb.fit_predict(X)
 
# 3. SOD
sod = SubspaceOutlierDegree(k=20, alpha=0.8)
sod_scores = sod.fit_predict(X)
 
# Evaluate
def top_k_precision(scores, n_anomalies, k):
    top_k = np.argsort(scores)[-k:]
    true_labels = np.zeros(len(scores))
    true_labels[-n_anomalies:] = 1
    return true_labels[top_k].sum() / k
 
print(f"
Top-10 Precision (detecting 10 anomalies):")
print(f"  Full-dim LOF: {top_k_precision(lof_scores, 10, 10):.1%}")
print(f"  Feature Bagging: {top_k_precision(fb_scores, 10, 10):.1%}")
print(f"  SOD: {top_k_precision(sod_scores, 10, 10):.1%}")
 
print(f"
Top-20 Precision:")
print(f"  Full-dim LOF: {top_k_precision(lof_scores, 10, 20):.1%}")
print(f"  Feature Bagging: {top_k_precision(fb_scores, 10, 20):.1%}")
print(f"  SOD: {top_k_precision(sod_scores, 10, 20):.1%}")

Dimensionality Reduction for Anomaly Detection

Rather than working in subspaces, another approach is to reduce dimensionality first, then apply density-based detection in the reduced space.

PCA-Based Reduction

PCA projects data onto directions of maximum variance:

Compute principal components of training data
Project to top k components (capturing most variance)
Apply density-based detection in k-dimensional space

Caveat: PCA preserves variance, not necessarily anomaly structure. An anomaly in a low-variance direction may be lost.

Reconstruction Error Approaches

PCA reconstruction error itself can indicate anomalies: $$\text{error}(\mathbf{x}) = |\mathbf{x} - \mathbf{P}_k \mathbf{x}|^2$$

where $\mathbf{P}_k$ projects onto top k principal components.

Insight: Anomalies often have high reconstruction error because they don't conform to the learned principal subspace. This is the basis for autoencoder-based anomaly detection in deep learning.

Supervised vs. Unsupervised Reduction

Random Projections

The Johnson-Lindenstrauss lemma states that random projections approximately preserve distances:

$$\text{For } k = O\left(\frac{\log n}{\epsilon^2}\right)$$

a random projection to k dimensions preserves all pairwise distances within factor $(1 \pm \epsilon)$ with high probability.

For anomaly detection:

Generate random projection matrix $\mathbf{R} \in \mathbb{R}^{k \times d}$
Project: $\tilde{\mathbf{X}} = \mathbf{X} \mathbf{R}^T$
Apply detection in projected space

Benefits:

No training required (unlike PCA)
Distance-preserving (good for distance-based methods)
Multiple projections can be ensembled

Combined Strategy: Project and Detect

A robust pipeline for high-dimensional anomaly detection:

Feature selection: Remove obviously irrelevant features
Normalization: Standardize to common scale
Reduction: Apply PCA, keeping components explaining 95% variance
Detection: Apply LOF or Mahalanobis in reduced space
Validation: Check top anomalies in original feature space

Choosing Reduction Dimensionality

•Variance criterion: Keep components explaining 90-99% of total variance
•Elbow method: Plot explained variance, find diminishing returns
•Task-based: If labeled data exists, cross-validate detection performance vs. dimensions
•Rule of thumb: k = 10-50 is often sufficient for detection; more than ~100 rarely helps
•Stability check: Ensure rankings are stable across nearby k values

Summary: Tackling Multivariate Density

Multivariate density estimation for anomaly detection presents unique challenges that require thoughtful approaches.

Key Takeaways

Curse of Dimensionality: In high dimensions, distances concentrate, volumes explode, and density estimation degrades. Beyond ~20 dimensions, standard methods struggle.
Mahalanobis Distance: For approximately Gaussian data, Mahalanobis distance provides principled anomaly detection with proper handling of correlations and scales.
Robust Covariance: Standard covariance is sensitive to outliers. Use MCD or other robust estimators when contamination is suspected.
Subspace Methods: Anomalies may only be visible in certain feature subsets. Feature bagging and SOD detect subspace outliers.
Dimensionality Reduction: Reducing dimensionality via PCA or random projections can make density estimation tractable, though care is needed to preserve anomaly structure.

Method Selection Guide for High-Dimensional Anomaly Detection
Scenario	Recommended Approach	Avoid
d < 15, n >> d	LOF, Mahalanobis	Subspace (unnecessary)
d = 15-50, n > 10d	Robust Mahalanobis, Feature Bagging	Full-dim KDE
d = 50-200, n ~ d	PCA + LOF, SOD	Full-dim anything
d > 200	Random Projection ensembles, Deep methods	Standard density methods
Subspace anomalies expected	Feature Bagging, HiCS, SOD	Global methods alone

Core Strategy

Looking Ahead

With multivariate density challenges addressed, the final piece of the puzzle is making these methods work at scale:

Next Page (Scalability): Techniques for efficient density-based anomaly detection on large datasets, including approximate nearest neighbors, streaming algorithms, and distributed computation.

The density paradigm, when properly adapted for high dimensions and large scales, remains one of the most interpretable and effective approaches to anomaly detection.