Loading content...
Statistical methods for anomaly detection are powerful—but they are not magic. Every method we've covered makes assumptions about the data: its distribution, independence, stationarity, and the nature of anomalies themselves. When these assumptions hold, statistical methods provide principled, interpretable detection with controlled error rates. When they fail, the methods can produce arbitrarily wrong results.
This final page synthesizes the assumptions and limitations of statistical anomaly detection, providing you with the diagnostic skills to know when to trust these methods, when to be cautious, and when to abandon them entirely for alternatives.
By the end of this page, you will understand the core assumptions of statistical methods, diagnostic techniques to check assumptions, specific failure modes with examples, guidance for selecting appropriate methods, and when to pivot to non-statistical alternatives.
All statistical anomaly detection methods share certain foundational assumptions. Understanding these is essential for proper application.
What it means: The data follows a known or assumed probability distribution (typically normal/Gaussian).
Why it matters: Threshold calculations depend on this distribution. The "3-sigma rule" assumes normality. Chi-squared thresholds for Mahalanobis distance assume multivariate normality.
Consequences of violation:
What it means: Each observation provides independent information. The value of one observation doesn't predict another.
Why it matters: Standard error calculations, effective sample size, and sequential testing depend on independence.
Consequences of violation:
What it means: The underlying data-generating process doesn't change over time. The mean, variance, and correlation structure are constant.
Why it matters: Statistics computed from historical data are used to evaluate new observations. If the process changes, historical benchmarks become irrelevant.
Consequences of violation:
What it means: The dataset used to estimate parameters is predominantly "normal" (non-anomalous). Anomalies are rare.
Why it matters: Statistics (mean, variance, covariance) are computed assuming most data is normal. High contamination corrupts these estimates.
Consequences of violation:
What it means: Anomalies are defined as low-probability events under the assumed distribution. They occur in the tails.
Why it matters: The very definition of "outlier" as "statistically improbable" requires that most observations are probable.
Consequences of violation:
When you apply a statistical anomaly detection method, you implicitly claim: "My data approximately follows [assumed distribution], observations are approximately independent, the process is approximately stationary, contamination is low, and anomalies are rare." If any of these are significantly violated, your results may be worthless.
Before applying statistical methods, check whether your data satisfies the necessary assumptions. Here are practical diagnostic techniques:
Visual Methods:
Statistical Tests:
Caveat: For large samples, formal tests reject normality even for slight, practically irrelevant deviations. Use visual inspection alongside statistical tests.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def comprehensive_normality_check(data: np.ndarray, alpha: float = 0.05): """ Perform comprehensive normality diagnostics. Parameters ---------- data : np.ndarray 1D array of observations alpha : float Significance level for hypothesis tests """ n = len(data) print("=" * 60) print("NORMALITY DIAGNOSTICS") print("=" * 60) # Descriptive statistics print(f"Sample size: {n}") print(f"Mean: {np.mean(data):.4f}") print(f"Std: {np.std(data, ddof=1):.4f}") print(f"Skewness: {stats.skew(data):.4f} (0 for normal)") print(f"Kurtosis: {stats.kurtosis(data):.4f} (0 for normal)") # Statistical tests print(f"--- Statistical Tests (α = {alpha}) ---") # Shapiro-Wilk (best for n < 5000) if n <= 5000: stat, p = stats.shapiro(data) result = "PASS (normal)" if p > alpha else "FAIL (non-normal)" print(f"Shapiro-Wilk: statistic={stat:.4f}, p-value={p:.4f} -> {result}") else: print("Shapiro-Wilk: Skipped (n > 5000)") # D'Agostino-Pearson if n >= 20: stat, p = stats.normaltest(data) result = "PASS" if p > alpha else "FAIL" print(f"D'Agostino-Pearson: statistic={stat:.4f}, p-value={p:.4f} -> {result}") # Anderson-Darling result = stats.anderson(data, dist='norm') idx = list(result.significance_level).index(5.0) # 5% significance ad_result = "PASS" if result.statistic < result.critical_values[idx] else "FAIL" print(f"Anderson-Darling: statistic={result.statistic:.4f}, " f"critical_5%={result.critical_values[idx]:.4f} -> {ad_result}") # Visual diagnostics fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Histogram with normal overlay axes[0].hist(data, bins='auto', density=True, alpha=0.7, edgecolor='black') x = np.linspace(data.min(), data.max(), 100) axes[0].plot(x, stats.norm.pdf(x, np.mean(data), np.std(data)), 'r-', lw=2, label='Normal fit') axes[0].set_title('Histogram with Normal Overlay') axes[0].legend() # Q-Q plot stats.probplot(data, dist="norm", plot=axes[1]) axes[1].set_title('Q-Q Plot') # Box plot axes[2].boxplot(data, vert=True) axes[2].set_title('Box Plot') plt.tight_layout() return fig def check_independence(data: np.ndarray, max_lag: int = 10): """ Check for autocorrelation (independence violation). """ from statsmodels.stats.diagnostic import acorr_ljungbox print("--- Independence Check (Autocorrelation) ---") # Compute autocorrelation at various lags acf_values = [] for lag in range(1, max_lag + 1): acf = np.corrcoef(data[:-lag], data[lag:])[0, 1] acf_values.append(acf) print(f"Lag {lag}: autocorrelation = {acf:.4f}") # Ljung-Box test for overall significance result = acorr_ljungbox(data, lags=[max_lag], return_df=True) p_value = result['lb_pvalue'].values[0] print(f"Ljung-Box test (lag={max_lag}): p-value = {p_value:.4f}") if p_value < 0.05: print("⚠️ Significant autocorrelation detected - independence violated!") else: print("✓ No significant autocorrelation") return acf_values # Examplenp.random.seed(42) # Generate different types of data for demonstrationnormal_data = np.random.normal(0, 1, 500) print("" + "="*60)print("Testing Normal Data")comprehensive_normality_check(normal_data) # Heavy-tailed data (Student's t with 3 df)heavy_tailed = np.random.standard_t(3, 500)print("" + "="*60)print("Testing Heavy-Tailed Data (t-distribution, df=3)")comprehensive_normality_check(heavy_tailed)Visual Methods:
Statistical Tests:
Visual Methods:
Statistical Tests:
This is inherently circular—if you knew which points were contamination, you wouldn't need detection. Approaches:
Heavy tails are the most common and dangerous violation of normality assumptions in real-world data.
A distribution has heavy tails if extreme values are more probable than under the normal distribution. Mathematically, if the tail probability decays slower than exponentially:
$$P(X > x) \sim x^{-\alpha} \text{ (power law)}$$
vs.
$$P(X > x) \sim e^{-x^2/2} \text{ (Gaussian, exponential decay)}$$
Examples of heavy-tailed data:
Under heavy tails, Z-scores of 5, 10, or even 20 can occur regularly in normal operation. Applying a threshold of $|z| > 3$ will flag countless legitimate observations as anomalies.
| Distribution | P(|Z| > 3) | False Positives per 10,000 |
|---|---|---|
| Normal | 0.27% | 27 |
| Student-t (df=10) | 0.87% | 87 |
| Student-t (df=5) | 1.84% | 184 |
| Student-t (df=3) | 4.28% | 428 |
| Laplace (double exponential) | 0.74% | 74 |
| Cauchy | 20.5% | 2,050 |
1. Use Non-Parametric Methods IQR-based detection and MAD-based detection don't assume normality. They're based on order statistics and remain valid for heavy-tailed data.
2. Transform the Data Log, square root, or Box-Cox transforms can reduce heavy-tailedness. Check normality after transformation.
3. Use Tail-Specific Methods Extreme Value Theory (EVT) provides principled methods for heavy-tailed data, modeling only the tail behavior.
4. Use Higher Thresholds If you must use Z-scores on heavy-tailed data, calibrate thresholds empirically rather than using theoretical normal quantiles.
5. Model the Actual Distribution Fit a t-distribution, generalized Pareto, or other heavy-tailed family and compute tail probabilities accordingly.
The Cauchy distribution is so heavy-tailed that its mean and variance don't exist (they're infinite). The sample mean doesn't converge to the population mean regardless of sample size. Z-scores are mathematically meaningless for Cauchy data, yet naive implementations will compute them anyway without warning.
Time series data—measurements over time—often violates the stationarity assumption.
Trend: The mean systematically increases or decreases over time.
Seasonality: Regular periodic patterns.
Level Shifts: Sudden permanent changes.
Variance Changes (Heteroscedasticity): Volatility increases or decreases.
With trends:
With seasonality:
With level shifts:
1. Detrending Fit and remove the trend component before anomaly detection. Common methods:
2. Deseasonalization Remove seasonal components using:
3. Sliding Window Approaches Compute statistics from a recent window rather than all historical data. Only recent data informs what's "normal" now.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
import numpy as npfrom typing import Tuple def sliding_window_zscore( data: np.ndarray, window_size: int, threshold: float = 3.0) -> Tuple[np.ndarray, np.ndarray]: """ Compute Z-scores using a sliding window for non-stationary data. Each point is evaluated relative to the statistics of the preceding window of observations. Parameters ---------- data : np.ndarray Time series data window_size : int Number of preceding observations to use for statistics threshold : float Z-score threshold for anomaly flagging Returns ------- anomaly_mask : np.ndarray Boolean array indicating anomalies (first window_size entries are False) z_scores : np.ndarray Z-scores for each point (NaN for first window_size entries) """ n = len(data) z_scores = np.full(n, np.nan) anomaly_mask = np.zeros(n, dtype=bool) for i in range(window_size, n): # Get the preceding window window = data[i - window_size:i] # Compute window statistics window_mean = np.mean(window) window_std = np.std(window, ddof=1) # Avoid division by zero if window_std > 0: z_scores[i] = (data[i] - window_mean) / window_std anomaly_mask[i] = np.abs(z_scores[i]) > threshold return anomaly_mask, z_scores def exponential_moving_average_zscore( data: np.ndarray, alpha: float = 0.1, threshold: float = 3.0) -> Tuple[np.ndarray, np.ndarray]: """ Compute Z-scores using exponentially weighted moving statistics. More recent observations have more influence on the statistics, allowing gradual adaptation to changing conditions. Parameters ---------- data : np.ndarray Time series data alpha : float Smoothing factor (0 < alpha < 1). Higher = more weight on recent data. threshold : float Z-score threshold """ n = len(data) z_scores = np.full(n, np.nan) anomaly_mask = np.zeros(n, dtype=bool) # Initialize with first observation ewm_mean = data[0] ewm_var = 0.0 for i in range(1, n): # Update exponentially weighted mean diff = data[i] - ewm_mean ewm_mean = ewm_mean + alpha * diff # Update exponentially weighted variance ewm_var = (1 - alpha) * (ewm_var + alpha * diff**2) ewm_std = np.sqrt(ewm_var) # Compute Z-score relative to current EWM statistics if ewm_std > 0: z_scores[i] = diff / ewm_std anomaly_mask[i] = np.abs(z_scores[i]) > threshold return anomaly_mask, z_scores # Example: Detecting anomalies in data with trendnp.random.seed(42) # Generate data with trend and anomaliesn = 200trend = np.linspace(0, 20, n) # Linear upward trendnoise = np.random.normal(0, 2, n)data = trend + noise # Inject anomaliesanomaly_indices = [50, 100, 150]data[50] += 15 # Spikedata[100] -= 12 # Dropdata[150] += 18 # Spike # Compare global vs sliding window detectionfrom scipy import stats # Global Z-score (will fail due to trend)global_mean = np.mean(data)global_std = np.std(data, ddof=1)global_z = (data - global_mean) / global_stdglobal_anomalies = np.abs(global_z) > 3 # Sliding window Z-scorewindow_anomalies, window_z = sliding_window_zscore(data, window_size=30, threshold=3.0) print(f"Injected anomaly indices: {anomaly_indices}")print(f"Global Z-score detections: {np.where(global_anomalies)[0]}")print(f"Sliding window detections: {np.where(window_anomalies)[0]}") # Global method likely misses or has wrong detections due to trend# Sliding window should find the actual anomaliesWhen data comes from multiple subpopulations with different characteristics, applying global statistics produces nonsensical results.
Multiple User Types:
Multiple Operating Regimes:
Mixture of Sources:
Consider bimodal data with modes at 10 and 100. The global mean is around 55—a value that almost never occurs! A point at 10 or 100 (both perfectly normal) might be flagged as ~3-4 standard deviations from the mean.
Meanwhile, a value at 55 (which never occurs naturally) would appear perfectly normal by Z-score—it's right at the mean!
1. Stratified Analysis If you can identify subpopulations (e.g., user_type column), analyze each separately. Compute statistics within each group.
2. Mixture Model Approach Fit a Gaussian Mixture Model (GMM) to the data. For each point, evaluate its probability under the mixture—not relative to a single Gaussian.
3. Clustering First Apply clustering (k-means, DBSCAN) to identify natural groups. Then detect anomalies within clusters or as points far from all cluster centers.
4. Density-Based Methods Methods like Local Outlier Factor (LOF) or Isolation Forest implicitly handle multimodality by evaluating local density rather than global statistics.
5. Domain Knowledge Often the subpopulations are known. Segment the data before applying statistical methods.
Always plot a histogram before applying statistical outlier detection. If you see multiple peaks, stop immediately—global statistics will be misleading. Either segment the data or use methods designed for multimodal distributions.
Given the assumptions and failure modes, how do you choose the right method for your data? Here's a decision framework:
| Data Characteristic | Recommended Method(s) | Avoid |
|---|---|---|
| Normal, univariate, clean | Z-score, Grubbs | None (ideal case) |
| Non-normal, univariate | IQR, MAD | Z-score, Grubbs |
| Normal, multivariate, clean | Mahalanobis (classical) | Univariate methods |
| Multivariate, unknown distribution | Mahalanobis with MCD | Classical covariance |
| Time series with trend | Sliding window, STL + detection | Global statistics |
| Multimodal | Segment first, or LOF/clustering | All global methods |
| High contamination | MCD, trimmed statistics | Classical mean/variance |
| Very high dimensions | PCA first, then detection | Direct Mahalanobis |
Statistical methods are powerful but not universal. Some situations call for fundamentally different approaches.
1. Complex Feature Interactions When anomalies are defined by intricate combinations of features that can't be captured by linear covariance structures, consider:
2. Categorical or Mixed Data Statistical methods assume continuous numeric data. For:
3. Labeled Anomaly Examples Available If you have examples of known anomalies, use supervised or semi-supervised learning:
4. Very High Dimensions In hundreds or thousands of dimensions:
5. Sequential/Graph Structure When data has inherent structure:
6. Domain-Specific Anomaly Definitions Statistical rarity isn't always the right definition of "anomaly":
Statistical methods detect statistically unusual. If your definition of anomaly is different, use methods aligned with that definition.
Statistical methods are excellent for well-behaved numeric data where "anomaly" means "statistically improbable." But they're not the only tool. Distance-based, density-based, tree-based, and neural network-based methods each have strengths for different scenarios. Match the method to the data and the problem definition.
This module has provided a comprehensive treatment of statistical methods for anomaly detection. Let's consolidate the key lessons across all pages:
Statistical methods form the foundational layer of anomaly detection. They provide:
However, they are not sufficient for all problems. The subsequent modules in this chapter cover distance-based methods (LOF, k-NN), density-based methods (DBSCAN, kernel density), isolation-based methods (Isolation Forest), and one-class learning (One-Class SVM, autoencoders). Each extends the toolkit for different data characteristics and anomaly definitions.
Master the statistical foundations first—they provide the conceptual grounding for everything that follows.
Congratulations! You have completed the Statistical Methods module for Anomaly Detection. You now possess deep knowledge of Z-score, IQR, Grubbs' test, and Mahalanobis distance methods, along with critical understanding of their assumptions, limitations, and appropriate use cases. This knowledge forms the essential foundation for all anomaly detection work.