Loading content...
Consider a person who is 6'8" tall and weighs 130 pounds. Their height is unusual but not extreme—professional basketball players routinely exceed this. Their weight is on the low side but not alarming—some people are naturally thin. Yet the combination of extreme height with low weight is highly anomalous—it suggests a medical condition, data entry error, or fraud.
This illustrates a fundamental limitation of univariate methods: anomalies often hide in the relationships between variables, not in individual variables. The Z-score, IQR, and Grubbs' tests we've studied can miss these multivariate outliers entirely.
Multivariate anomaly detection addresses this by considering the joint distribution of all variables simultaneously, accounting for correlations and dependencies that define what constitutes 'normal' in high-dimensional space.
By the end of this page, you will understand why univariate methods fail for correlated data, the mathematical foundations of the Mahalanobis distance, how to use chi-squared distributions for thresholding, challenges of covariance estimation, robust alternatives like the Minimum Covariance Determinant, and practical implementation strategies for multivariate outlier detection.
Let's formalize why applying univariate tests to each variable independently can fail dramatically.
Consider a 2D dataset where variables $X$ and $Y$ are positively correlated (when $X$ is high, $Y$ tends to be high). The data forms an elliptical cloud in 2D space.
A point with moderately high $X$ and moderately low $Y$ might be:
Mathematically: If $X$ and $Y$ are marginally normal, applying Z-scores to each gives:
Even if $|z_X| < 3$ and $|z_Y| < 3$, the point $(x, y)$ can be highly improbable under the joint distribution.
Univariate thresholds define a rectangle (in 2D) or hypercube (in higher dimensions) aligned with the coordinate axes.
Joint probability contours for correlated normal data form an ellipse (in 2D) or ellipsoid (in higher dimensions) that can be rotated and stretched.
The rectangle and ellipse can differ dramatically:
As dimensionality increases, this discrepancy worsens exponentially.
In high dimensions, most of the probability mass in a Gaussian distribution lies in a thin shell at a specific distance from the center—not at the center itself. Univariate reasoning fails to capture this phenomenon. Points that appear 'central' in each dimension can be extremely unlikely in combination.
If you apply univariate outlier detection to each feature and take the union, you will miss anomalies that violate correlation structures. If you take the intersection (flag only if outlier in ALL dimensions), you will miss almost everything in high dimensions. Neither approach captures multivariate structure.
The Mahalanobis distance, introduced by Prasanta Chandra Mahalanobis in 1936, is a distance measure that accounts for correlations between variables. It is the fundamental tool for multivariate anomaly detection.
For a point $\mathbf{x} \in \mathbb{R}^p$ and a distribution with mean vector $\boldsymbol{\mu}$ and covariance matrix $\boldsymbol{\Sigma}$, the Mahalanobis distance is:
$$D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$
Often, we work with the squared Mahalanobis distance:
$$D_M^2(\mathbf{x}) = (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})$$
Let's unpack this expression:
$(\mathbf{x} - \boldsymbol{\mu})$: Centers the point relative to the mean (just like Z-score)
$\boldsymbol{\Sigma}^{-1}$: The inverse covariance matrix (precision matrix) standardizes and decorrelates
The quadratic form: Computes a weighted sum of squared deviations, where weights depend on variances and covariances
Case 1: Uncorrelated Variables
If variables are uncorrelated, $\boldsymbol{\Sigma}$ is diagonal: $$\boldsymbol{\Sigma} = \text{diag}(\sigma_1^2, \sigma_2^2, \ldots, \sigma_p^2)$$
The Mahalanobis distance simplifies to: $$D_M^2(\mathbf{x}) = \sum_{i=1}^{p} \left(\frac{x_i - \mu_i}{\sigma_i}\right)^2 = \sum_{i=1}^{p} z_i^2$$
This is the sum of squared Z-scores—the Euclidean distance in standardized space.
Case 2: Identity Covariance
If $\boldsymbol{\Sigma} = \mathbf{I}$ (standardized uncorrelated data), Mahalanobis distance equals the standard Euclidean distance from the mean: $$D_M(\mathbf{x}) = |\mathbf{x} - \boldsymbol{\mu}|_2$$
The covariance matrix $\boldsymbol{\Sigma}$ can be decomposed as $\boldsymbol{\Sigma} = \mathbf{V} \mathbf{D} \mathbf{V}^T$, where:
Mahalanobis distance effectively:
This transforms the elliptical probability contours into circular ones, making simple distance comparison meaningful.
Mahalanobis distance measures how many 'standard deviations' a point is from the center, where 'standard deviation' is measured along the natural axes of the data ellipsoid. It's the multivariate generalization of the Z-score.
Just as Z-scores have probabilistic interpretation under normality, Mahalanobis distances have a precise distributional characterization.
If $\mathbf{x} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ (multivariate normal), then the squared Mahalanobis distance follows a chi-squared distribution with $p$ degrees of freedom:
$$D_M^2(\mathbf{x}) \sim \chi^2_p$$
This result is fundamental—it allows us to set thresholds with known false positive rates.
To achieve a false positive rate of $\alpha$, set the threshold at the $(1-\alpha)$ quantile of the chi-squared distribution:
$$\tau = \chi^2_{p, 1-\alpha}$$
Points with $D_M^2(\mathbf{x}) > \tau$ are flagged as outliers.
| Dimensions (p) | α = 0.05 | α = 0.01 | α = 0.001 |
|---|---|---|---|
| 2 | 5.99 | 9.21 | 13.82 |
| 3 | 7.81 | 11.34 | 16.27 |
| 5 | 11.07 | 15.09 | 20.52 |
| 10 | 18.31 | 23.21 | 29.59 |
| 20 | 31.41 | 37.57 | 45.31 |
| 50 | 67.50 | 76.15 | 86.66 |
| 100 | 124.34 | 135.81 | 149.45 |
Notice how the threshold increases with dimensionality. In 2D, $D_M^2 > 5.99$ is an outlier at α=0.05. In 100D, you need $D_M^2 > 124$ for the same significance.
Why? In high dimensions, even normal points are expected to be 'far' from the center in Mahalanobis distance. The chi-squared distribution with more degrees of freedom has a larger mean and wider spread.
Mean of $\chi^2_p$: $\mathbb{E}[\chi^2_p] = p$
Variance of $\chi^2_p$: $\text{Var}(\chi^2_p) = 2p$
So in 100 dimensions, the average squared Mahalanobis distance is 100, not near zero.
The chi-squared distribution is exact only when the true mean and covariance are known. When estimated from sample data (as in practice), the distribution is better approximated by a scaled F-distribution or beta distribution. For small samples, this correction is important; for large samples (n >> p), the chi-squared approximation is adequate.
Input: Dataset $\mathbf{X} \in \mathbb{R}^{n \times p}$ (n observations, p variables), significance level $\alpha$
Output: Boolean array indicating outliers
1. Estimate mean vector: μ̂ = (1/n) Σᵢ xᵢ
2. Estimate covariance matrix: Σ̂ = (1/(n-1)) Σᵢ (xᵢ - μ̂)(xᵢ - μ̂)ᵀ
3. Compute precision matrix: Σ̂⁻¹ (with regularization if needed)
4. For each observation i:
D²ᵢ = (xᵢ - μ̂)ᵀ Σ̂⁻¹ (xᵢ - μ̂)
5. Compute threshold: τ = χ²_p,1-α
6. Flag outliers: D²ᵢ > τ
Computational Complexity:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
import numpy as npfrom scipy import statsfrom typing import Tuple, NamedTupleimport warnings class MahalanobisResult(NamedTuple): """Results from Mahalanobis distance-based outlier detection.""" outlier_mask: np.ndarray distances_squared: np.ndarray threshold: float mean: np.ndarray covariance: np.ndarray def mahalanobis_outlier_detection( X: np.ndarray, alpha: float = 0.05, regularization: float = 1e-6) -> MahalanobisResult: """ Detect outliers using Mahalanobis distance. Parameters ---------- X : np.ndarray Data matrix of shape (n_samples, n_features) alpha : float Significance level for chi-squared threshold regularization : float Ridge regularization for covariance matrix (for stability) Returns ------- MahalanobisResult : NamedTuple with detection results """ n_samples, n_features = X.shape if n_samples <= n_features: warnings.warn( f"n_samples ({n_samples}) <= n_features ({n_features}). " "Covariance matrix will be singular. Consider PCA first." ) # Estimate mean and covariance mean = np.mean(X, axis=0) covariance = np.cov(X, rowvar=False) # Add regularization for numerical stability covariance += regularization * np.eye(n_features) # Compute precision matrix (inverse covariance) try: precision = np.linalg.inv(covariance) except np.linalg.LinAlgError: # Use pseudo-inverse if singular precision = np.linalg.pinv(covariance) warnings.warn("Covariance matrix is singular, using pseudo-inverse") # Compute Mahalanobis distances centered = X - mean distances_squared = np.sum(centered @ precision * centered, axis=1) # Chi-squared threshold threshold = stats.chi2.ppf(1 - alpha, df=n_features) # Flag outliers outlier_mask = distances_squared > threshold return MahalanobisResult( outlier_mask=outlier_mask, distances_squared=distances_squared, threshold=threshold, mean=mean, covariance=covariance ) def plot_mahalanobis_2d(X: np.ndarray, result: MahalanobisResult): """ Visualize Mahalanobis outlier detection in 2D. """ import matplotlib.pyplot as plt from matplotlib.patches import Ellipse fig, ax = plt.subplots(figsize=(10, 8)) # Plot inliers and outliers inliers = X[~result.outlier_mask] outliers = X[result.outlier_mask] ax.scatter(inliers[:, 0], inliers[:, 1], c='blue', alpha=0.5, label='Inliers') ax.scatter(outliers[:, 0], outliers[:, 1], c='red', s=100, marker='x', label='Outliers') # Draw threshold ellipse eigenvalues, eigenvectors = np.linalg.eigh(result.covariance) angle = np.degrees(np.arctan2(eigenvectors[1, 0], eigenvectors[0, 0])) # Ellipse axes are sqrt(eigenvalue * threshold) width = 2 * np.sqrt(eigenvalues[0] * result.threshold) height = 2 * np.sqrt(eigenvalues[1] * result.threshold) ellipse = Ellipse( result.mean, width, height, angle=angle, fill=False, edgecolor='green', linewidth=2, label=f'Threshold (α={0.05})' ) ax.add_patch(ellipse) ax.set_xlabel('X₁') ax.set_ylabel('X₂') ax.legend() ax.set_title('Mahalanobis Distance Outlier Detection') plt.axis('equal') return fig # Example usagenp.random.seed(42) # Generate correlated normal datamean = [50, 30]cov = [[100, 80], [80, 100]] # Strongly correlatedX_normal = np.random.multivariate_normal(mean, cov, 200) # Add some outliers (violating correlation structure)outliers = np.array([ [80, 0], # High X1, Low X2 (violates positive correlation) [20, 60], # Low X1, High X2 (violates positive correlation) [100, 90], # Extreme in both (far from center)])X = np.vstack([X_normal, outliers]) # Detect outliersresult = mahalanobis_outlier_detection(X, alpha=0.01) print(f"Dataset: {X.shape[0]} observations, {X.shape[1]} features")print(f"Chi-squared threshold (α=0.01, df=2): {result.threshold:.2f}")print(f"Outliers detected: {np.sum(result.outlier_mask)}")print(f"Outlier Mahalanobis² distances: {result.distances_squared[result.outlier_mask]}")The masking problem we encountered with univariate methods becomes even more severe in the multivariate setting.
When computing the sample covariance matrix, outliers can:
The combined effect: The estimated ellipsoid expands and distorts to 'accommodate' the outliers, making their Mahalanobis distances appear smaller than they should be.
Consider a clean dataset forming a tight ellipse. Add a single extreme outlier:
This is why directly applying Mahalanobis distance with sample estimates can miss exactly the outliers we're trying to detect.
The sample mean has breakdown point 1/n. The sample covariance matrix has breakdown point approximately 1/n as well for certain types of contamination. In p dimensions, a single outlier can arbitrarily distort the covariance estimate in p(p+1)/2 ways (the number of unique covariance parameters).
In high dimensions, the covariance matrix has O(p²) parameters estimated from n observations. As p approaches n, estimation becomes extremely unstable even without outliers. Add outliers, and the estimates can become meaningless. This is why robust methods are critical.
To address the masking problem, we need robust estimators of location and scatter that are not influenced by outliers.
The MCD estimator, introduced by Rousseeuw (1984), finds the subset of observations that gives the smallest determinant of the sample covariance matrix.
Definition:
Given $n$ observations in $\mathbb{R}^p$, find the subset $H$ of size $h$ (where $h > (n+p+1)/2$) that minimizes:
$$\det(\hat{\boldsymbol{\Sigma}}_H)$$
Where $\hat{\boldsymbol{\Sigma}}_H$ is the sample covariance of the subset $H$.
Intuition: The subset with smallest covariance determinant is the 'tightest' cluster—the core of non-outlying observations. Outliers, by definition, lie far from this core and would inflate the determinant if included.
High breakdown point: If $h \approx n/2$, the MCD can resist up to ~50% contamination.
Affine equivariance: The estimator behaves correctly under linear transformations of the data.
Statistical efficiency: For normal data, MCD is less efficient than classical estimates, but this is the price of robustness.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import numpy as npfrom scipy import statsfrom sklearn.covariance import MinCovDeterminantfrom typing import NamedTuple class RobustMahalanobisResult(NamedTuple): """Results from robust Mahalanobis distance outlier detection.""" outlier_mask: np.ndarray distances_squared: np.ndarray threshold: float robust_mean: np.ndarray robust_covariance: np.ndarray def robust_mahalanobis_detection( X: np.ndarray, alpha: float = 0.05, support_fraction: float = None) -> RobustMahalanobisResult: """ Detect outliers using robust Mahalanobis distance based on MCD. Parameters ---------- X : np.ndarray Data matrix of shape (n_samples, n_features) alpha : float Significance level for chi-squared threshold support_fraction : float, optional Proportion of observations used for MCD (default: (n + p + 1) / (2n)) Lower values increase robustness but decrease efficiency Returns ------- RobustMahalanobisResult : NamedTuple with detection results """ n_samples, n_features = X.shape # Fit MCD estimator mcd = MinCovDeterminant(support_fraction=support_fraction) mcd.fit(X) # Get robust estimates robust_mean = mcd.location_ robust_covariance = mcd.covariance_ # Compute Mahalanobis distances using robust estimates distances_squared = mcd.mahalanobis(X) # Chi-squared threshold (adjusted for robustness) # Note: The distribution is approximately chi-squared for clean data threshold = stats.chi2.ppf(1 - alpha, df=n_features) # Flag outliers outlier_mask = distances_squared > threshold return RobustMahalanobisResult( outlier_mask=outlier_mask, distances_squared=distances_squared, threshold=threshold, robust_mean=robust_mean, robust_covariance=robust_covariance ) def compare_classical_vs_robust(X: np.ndarray, alpha: float = 0.01): """ Compare classical and robust Mahalanobis detection. """ from sklearn.covariance import EmpiricalCovariance, MinCovDeterminant n_samples, n_features = X.shape # Classical (non-robust) estimation emp_cov = EmpiricalCovariance().fit(X) classical_distances = emp_cov.mahalanobis(X) # Robust MCD estimation mcd = MinCovDeterminant().fit(X) robust_distances = mcd.mahalanobis(X) # Threshold threshold = stats.chi2.ppf(1 - alpha, df=n_features) classical_outliers = classical_distances > threshold robust_outliers = robust_distances > threshold print("=== Classical vs. Robust Mahalanobis Comparison ===") print(f"Chi-squared threshold (α={alpha}, df={n_features}): {threshold:.2f}") print(f"Classical method:") print(f" Outliers detected: {np.sum(classical_outliers)}") print(f" Max distance²: {np.max(classical_distances):.2f}") print(f"Robust MCD method:") print(f" Outliers detected: {np.sum(robust_outliers)}") print(f" Max distance²: {np.max(robust_distances):.2f}") # Points detected by robust but not classical (masking effect) masked = robust_outliers & ~classical_outliers if np.any(masked): print(f"⚠️ Masking detected: {np.sum(masked)} outliers missed by classical method") return classical_outliers, robust_outliers # Example: Demonstrating masking effectnp.random.seed(42) # Clean correlated datamean = [0, 0]cov = [[1, 0.8], [0.8, 1]]X_clean = np.random.multivariate_normal(mean, cov, 100) # Add outliers that will cause masking# Multiple outliers in same direction inflate covariance in that directionoutliers = np.array([ [5, -3], [6, -4], [7, -3.5], [4, 5], [5.5, 4],])X_contaminated = np.vstack([X_clean, outliers]) # Compare methodsclassical_mask, robust_mask = compare_classical_vs_robust(X_contaminated, alpha=0.01)Computing the exact MCD is NP-hard. The FAST-MCD algorithm (Rousseeuw & Van Driessen, 1999) provides an efficient approximation used in scikit-learn. It uses iterative concentration steps starting from multiple initial subsets.
When the number of features $p$ approaches or exceeds the number of observations $n$, classical covariance estimation breaks down entirely.
The sample covariance matrix requires $n > p$ to be non-singular (invertible). When $n \leq p$, the matrix is rank-deficient and cannot be inverted directly.
Even when $n > p$, if $n$ is not substantially larger than $p$, the estimates are highly unstable. A common rule of thumb is $n \geq 5p$ to $10p$ for reliable estimation.
1. Regularization
Add a regularization term to the covariance matrix to ensure invertibility: $$\hat{\boldsymbol{\Sigma}}_{\text{reg}} = \hat{\boldsymbol{\Sigma}} + \lambda \mathbf{I}$$
Where $\lambda > 0$ is a regularization parameter. This shrinks eigenvalues toward a common value.
2. Shrinkage Estimators
Ledoit-Wolf shrinkage interpolates between the sample covariance and a structured target (e.g., diagonal or scaled identity): $$\hat{\boldsymbol{\Sigma}}_{\text{shrunk}} = (1 - \alpha) \hat{\boldsymbol{\Sigma}} + \alpha \mathbf{T}$$
Where $\alpha$ is optimally chosen to minimize estimation error.
3. Dimensionality Reduction First
Apply PCA to reduce dimensions to $k << n$, then perform Mahalanobis-based detection in the reduced space. The first $k$ principal components often capture the main structure; anomalies may appear in minor components too.
| Regime | n/p Ratio | Recommended Approach |
|---|---|---|
| Classical | n/p > 10 | Standard sample covariance |
| Moderate | 5 < n/p ≤ 10 | Regularization or shrinkage |
| Challenging | 2 < n/p ≤ 5 | Strong regularization or MCD with care |
| High-dimensional | 1 < n/p ≤ 2 | Dimensionality reduction + regularization |
| Underdetermined | n/p ≤ 1 | PCA first, or diagonal covariance only |
Real-world high-dimensional data often lies near a lower-dimensional manifold. PCA can identify this intrinsic dimensionality, allowing effective Mahalanobis-based detection in the reduced space where covariance estimation is stable.
We've covered the core statistical methods for anomaly detection: univariate Z-scores, robust IQR, formal Grubbs' testing, and multivariate Mahalanobis distance. The final page of this module synthesizes everything by examining the Assumptions and Limitations of statistical approaches as a whole—when they work, when they fail, and how to know which method to apply in practice.
You now understand why multivariate structure matters for outlier detection, the mathematical foundations of Mahalanobis distance, chi-squared thresholds, the masking problem, and robust alternatives like MCD. This knowledge equips you to detect anomalies that hide in the relationships between variables.