Statistical Methods - Learning Module

Loading content...

0/278

Assumptions and Limitations

Understanding the Boundaries of Statistical Detection

Statistical methods for anomaly detection are powerful—but they are not magic. Every method we've covered makes assumptions about the data: its distribution, independence, stationarity, and the nature of anomalies themselves. When these assumptions hold, statistical methods provide principled, interpretable detection with controlled error rates. When they fail, the methods can produce arbitrarily wrong results.

This final page synthesizes the assumptions and limitations of statistical anomaly detection, providing you with the diagnostic skills to know when to trust these methods, when to be cautious, and when to abandon them entirely for alternatives.

What You Will Learn

By the end of this page, you will understand the core assumptions of statistical methods, diagnostic techniques to check assumptions, specific failure modes with examples, guidance for selecting appropriate methods, and when to pivot to non-statistical alternatives.

The Core Assumptions

All statistical anomaly detection methods share certain foundational assumptions. Understanding these is essential for proper application.

Assumption 1: Distributional Form

What it means: The data follows a known or assumed probability distribution (typically normal/Gaussian).

Why it matters: Threshold calculations depend on this distribution. The "3-sigma rule" assumes normality. Chi-squared thresholds for Mahalanobis distance assume multivariate normality.

Consequences of violation:

Heavy-tailed data: Extreme values are more common than assumed → too many false negatives (real outliers classified as normal)
Light-tailed data: Extreme values are rarer than assumed → too many false positives
Skewed data: Asymmetric false positive rates; detection biased toward one tail
Multimodal data: One mode's normal values appear as outliers relative to another mode

Assumption 2: Independence

What it means: Each observation provides independent information. The value of one observation doesn't predict another.

Why it matters: Standard error calculations, effective sample size, and sequential testing depend on independence.

Consequences of violation:

Positive autocorrelation: Observations cluster → overestimated sample size, inflated confidence, more false positives
Negative autocorrelation: Adjacent observations differ → underestimated variance, more false positives
Spatial correlation: Geographic or network clustering violates i.i.d. assumption

Assumption 3: Stationarity

What it means: The underlying data-generating process doesn't change over time. The mean, variance, and correlation structure are constant.

Why it matters: Statistics computed from historical data are used to evaluate new observations. If the process changes, historical benchmarks become irrelevant.

Consequences of violation:

Trend: Values drift over time → historical mean underestimates current mean, creating systematic false positives or negatives
Seasonality: Periodic patterns → values at seasonal peaks appear anomalous relative to off-peak statistics
Concept drift: The definition of "normal" changes → the detector becomes obsolete

Assumption 4: Contamination is Low

What it means: The dataset used to estimate parameters is predominantly "normal" (non-anomalous). Anomalies are rare.

Why it matters: Statistics (mean, variance, covariance) are computed assuming most data is normal. High contamination corrupts these estimates.

Consequences of violation:

Masking: Outliers inflate variance, making other outliers appear less extreme
Swamping: Outliers shift the mean, making normal points appear extreme
Corrupted thresholds: Detection limits become meaningless

Assumption 5: Anomalies are Statistically Rare

What it means: Anomalies are defined as low-probability events under the assumed distribution. They occur in the tails.

Why it matters: The very definition of "outlier" as "statistically improbable" requires that most observations are probable.

Consequences of violation:

If anomalies are common: They're not anomalies—they're a second mode of normal behavior
If anomalies cluster: They form their own distribution, requiring clustering-based approaches

The Implicit Contract

When you apply a statistical anomaly detection method, you implicitly claim: "My data approximately follows [assumed distribution], observations are approximately independent, the process is approximately stationary, contamination is low, and anomalies are rare." If any of these are significantly violated, your results may be worthless.

Diagnostic Techniques: Checking Assumptions

Before applying statistical methods, check whether your data satisfies the necessary assumptions. Here are practical diagnostic techniques:

Checking Normality

Visual Methods:

Histogram: Should show symmetric, bell-shaped distribution
Q-Q Plot (Quantile-Quantile): Points should lie on a straight diagonal line
Box Plot: Whiskers should be roughly symmetric; many points beyond whiskers suggest heavy tails

Statistical Tests:

Shapiro-Wilk test: Best for small to medium samples (n < 5000)
D'Agostino-Pearson test: Tests skewness and kurtosis
Anderson-Darling test: Emphasizes tail behavior
Kolmogorov-Smirnov test: General distribution test

Caveat: For large samples, formal tests reject normality even for slight, practically irrelevant deviations. Use visual inspection alongside statistical tests.

normality_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def comprehensive_normality_check(data: np.ndarray, alpha: float = 0.05):
    """
    Perform comprehensive normality diagnostics.
    
    Parameters
    ----------
    data : np.ndarray
        1D array of observations
    alpha : float
        Significance level for hypothesis tests
    """
    n = len(data)
    
    print("=" * 60)
    print("NORMALITY DIAGNOSTICS")
    print("=" * 60)
    
    # Descriptive statistics
    print(f"
Sample size: {n}")
    print(f"Mean: {np.mean(data):.4f}")
    print(f"Std: {np.std(data, ddof=1):.4f}")
    print(f"Skewness: {stats.skew(data):.4f} (0 for normal)")
    print(f"Kurtosis: {stats.kurtosis(data):.4f} (0 for normal)")
    
    # Statistical tests
    print(f"
--- Statistical Tests (α = {alpha}) ---")
    
    # Shapiro-Wilk (best for n < 5000)
    if n <= 5000:
        stat, p = stats.shapiro(data)
        result = "PASS (normal)" if p > alpha else "FAIL (non-normal)"
        print(f"Shapiro-Wilk: statistic={stat:.4f}, p-value={p:.4f} -> {result}")
    else:
        print("Shapiro-Wilk: Skipped (n > 5000)")
    
    # D'Agostino-Pearson
    if n >= 20:
        stat, p = stats.normaltest(data)
        result = "PASS" if p > alpha else "FAIL"
        print(f"D'Agostino-Pearson: statistic={stat:.4f}, p-value={p:.4f} -> {result}")
    
    # Anderson-Darling
    result = stats.anderson(data, dist='norm')
    idx = list(result.significance_level).index(5.0)  # 5% significance
    ad_result = "PASS" if result.statistic < result.critical_values[idx] else "FAIL"
    print(f"Anderson-Darling: statistic={result.statistic:.4f}, "
          f"critical_5%={result.critical_values[idx]:.4f} -> {ad_result}")
    
    # Visual diagnostics
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Histogram with normal overlay
    axes[0].hist(data, bins='auto', density=True, alpha=0.7, edgecolor='black')
    x = np.linspace(data.min(), data.max(), 100)
    axes[0].plot(x, stats.norm.pdf(x, np.mean(data), np.std(data)), 
                 'r-', lw=2, label='Normal fit')
    axes[0].set_title('Histogram with Normal Overlay')
    axes[0].legend()
    
    # Q-Q plot
    stats.probplot(data, dist="norm", plot=axes[1])
    axes[1].set_title('Q-Q Plot')
    
    # Box plot
    axes[2].boxplot(data, vert=True)
    axes[2].set_title('Box Plot')
    
    plt.tight_layout()
    return fig
 
 
def check_independence(data: np.ndarray, max_lag: int = 10):
    """
    Check for autocorrelation (independence violation).
    """
    from statsmodels.stats.diagnostic import acorr_ljungbox
    
    print("
--- Independence Check (Autocorrelation) ---")
    
    # Compute autocorrelation at various lags
    acf_values = []
    for lag in range(1, max_lag + 1):
        acf = np.corrcoef(data[:-lag], data[lag:])[0, 1]
        acf_values.append(acf)
        print(f"Lag {lag}: autocorrelation = {acf:.4f}")
    
    # Ljung-Box test for overall significance
    result = acorr_ljungbox(data, lags=[max_lag], return_df=True)
    p_value = result['lb_pvalue'].values[0]
    print(f"
Ljung-Box test (lag={max_lag}): p-value = {p_value:.4f}")
    if p_value < 0.05:
        print("⚠️  Significant autocorrelation detected - independence violated!")
    else:
        print("✓ No significant autocorrelation")
    
    return acf_values
 
 
# Example
np.random.seed(42)
 
# Generate different types of data for demonstration
normal_data = np.random.normal(0, 1, 500)
 
print("
" + "="*60)
print("Testing Normal Data")
comprehensive_normality_check(normal_data)
 
# Heavy-tailed data (Student's t with 3 df)
heavy_tailed = np.random.standard_t(3, 500)
print("
" + "="*60)
print("Testing Heavy-Tailed Data (t-distribution, df=3)")
comprehensive_normality_check(heavy_tailed)

Checking Independence

Visual Methods:

Lag plot: Plot $x_t$ vs $x_{t-1}$. Random scatter indicates independence; patterns indicate dependence
Autocorrelation function (ACF) plot: Bars should decay quickly to near-zero for independent data

Statistical Tests:

Durbin-Watson test: Tests for first-order autocorrelation
Ljung-Box test: Tests for autocorrelation at multiple lags
Runs test: Checks if sequential values are random

Checking Stationarity

Visual Methods:

Time series plot: Look for trends, seasonal patterns, or changing variance
Rolling statistics: Compute mean and variance in sliding windows; should be constant

Statistical Tests:

Augmented Dickey-Fuller (ADF) test: Tests for unit root (non-stationarity)
KPSS test: Tests null hypothesis of stationarity
Phillips-Perron test: Alternative to ADF

Checking for Contamination

This is inherently circular—if you knew which points were contamination, you wouldn't need detection. Approaches:

Compare robust (MCD, MAD) and non-robust estimates; large discrepancy suggests contamination
Examine the proportion of points flagged by robust methods
Domain knowledge about expected anomaly rates

Failure Mode: Heavy-Tailed Distributions

Heavy tails are the most common and dangerous violation of normality assumptions in real-world data.

What Are Heavy Tails?

A distribution has heavy tails if extreme values are more probable than under the normal distribution. Mathematically, if the tail probability decays slower than exponentially:

$$P(X > x) \sim x^{-\alpha} \text{ (power law)}$$

vs.

$$P(X > x) \sim e^{-x^2/2} \text{ (Gaussian, exponential decay)}$$

Examples of heavy-tailed data:

Financial returns (crashes and booms are more common than normal predicts)
Network traffic (few connections dominate, many are tiny)
City populations (few mega-cities, many small towns)
File sizes (many small, occasional huge)
Insurance claims (most small, occasional catastrophic)

The Impact on Detection

Under heavy tails, Z-scores of 5, 10, or even 20 can occur regularly in normal operation. Applying a threshold of $|z| > 3$ will flag countless legitimate observations as anomalies.

Probability of |Z| > 3 for Different Distributions
Distribution	P(\|Z\| > 3)	False Positives per 10,000
Normal	0.27%	27
Student-t (df=10)	0.87%	87
Student-t (df=5)	1.84%	184
Student-t (df=3)	4.28%	428
Laplace (double exponential)	0.74%	74
Cauchy	20.5%	2,050

Solutions for Heavy Tails

1. Use Non-Parametric Methods IQR-based detection and MAD-based detection don't assume normality. They're based on order statistics and remain valid for heavy-tailed data.

2. Transform the Data Log, square root, or Box-Cox transforms can reduce heavy-tailedness. Check normality after transformation.

3. Use Tail-Specific Methods Extreme Value Theory (EVT) provides principled methods for heavy-tailed data, modeling only the tail behavior.

4. Use Higher Thresholds If you must use Z-scores on heavy-tailed data, calibrate thresholds empirically rather than using theoretical normal quantiles.

5. Model the Actual Distribution Fit a t-distribution, generalized Pareto, or other heavy-tailed family and compute tail probabilities accordingly.

The Cauchy Catastrophe

The Cauchy distribution is so heavy-tailed that its mean and variance don't exist (they're infinite). The sample mean doesn't converge to the population mean regardless of sample size. Z-scores are mathematically meaningless for Cauchy data, yet naive implementations will compute them anyway without warning.

Failure Mode: Non-Stationary Data

Time series data—measurements over time—often violates the stationarity assumption.

Types of Non-Stationarity

Trend: The mean systematically increases or decreases over time.

Sales growing due to business expansion
Temperature rising due to climate change
Degrading sensor reading as equipment ages

Seasonality: Regular periodic patterns.

Retail sales peaking during holidays
Energy usage peaking in winter/summer
Website traffic varying by time of day

Level Shifts: Sudden permanent changes.

New product launch changing sales baseline
Regulatory change affecting behavior
Equipment replacement changing readings

Variance Changes (Heteroscedasticity): Volatility increases or decreases.

Market volatility clustering (calm periods followed by turbulent periods)
Measurement precision changing with instrument calibration

The Impact on Detection

With trends:

Recent high values appear as outliers relative to historical mean (when they're actually the new normal)
Recent low values might not be flagged (even though they represent unusual drops)

With seasonality:

Peak-season values flagged as outliers when compared to annual statistics
Off-peak genuine anomalies might be missed

With level shifts:

All post-shift data appears anomalous relative to pre-shift statistics
Or all pre-shift data appears anomalous if statistics computed after the shift

Solutions for Non-Stationarity

1. Detrending Fit and remove the trend component before anomaly detection. Common methods:

Moving average subtraction
Linear/polynomial regression of time
Differencing (analyze changes rather than levels)

2. Deseasonalization Remove seasonal components using:

Seasonal decomposition (STL, X-13A-RIMA)
Seasonal dummy variables
Fourier series regression

3. Sliding Window Approaches Compute statistics from a recent window rather than all historical data. Only recent data informs what's "normal" now.

sliding_window_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from typing import Tuple
 
def sliding_window_zscore(
    data: np.ndarray,
    window_size: int,
    threshold: float = 3.0
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute Z-scores using a sliding window for non-stationary data.
    
    Each point is evaluated relative to the statistics of the 
    preceding window of observations.
    
    Parameters
    ----------
    data : np.ndarray
        Time series data
    window_size : int
        Number of preceding observations to use for statistics
    threshold : float
        Z-score threshold for anomaly flagging
        
    Returns
    -------
    anomaly_mask : np.ndarray
        Boolean array indicating anomalies (first window_size entries are False)
    z_scores : np.ndarray
        Z-scores for each point (NaN for first window_size entries)
    """
    n = len(data)
    z_scores = np.full(n, np.nan)
    anomaly_mask = np.zeros(n, dtype=bool)
    
    for i in range(window_size, n):
        # Get the preceding window
        window = data[i - window_size:i]
        
        # Compute window statistics
        window_mean = np.mean(window)
        window_std = np.std(window, ddof=1)
        
        # Avoid division by zero
        if window_std > 0:
            z_scores[i] = (data[i] - window_mean) / window_std
            anomaly_mask[i] = np.abs(z_scores[i]) > threshold
    
    return anomaly_mask, z_scores
 
 
def exponential_moving_average_zscore(
    data: np.ndarray,
    alpha: float = 0.1,
    threshold: float = 3.0
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute Z-scores using exponentially weighted moving statistics.
    
    More recent observations have more influence on the statistics,
    allowing gradual adaptation to changing conditions.
    
    Parameters
    ----------
    data : np.ndarray
        Time series data
    alpha : float
        Smoothing factor (0 < alpha < 1). Higher = more weight on recent data.
    threshold : float
        Z-score threshold
    """
    n = len(data)
    z_scores = np.full(n, np.nan)
    anomaly_mask = np.zeros(n, dtype=bool)
    
    # Initialize with first observation
    ewm_mean = data[0]
    ewm_var = 0.0
    
    for i in range(1, n):
        # Update exponentially weighted mean
        diff = data[i] - ewm_mean
        ewm_mean = ewm_mean + alpha * diff
        
        # Update exponentially weighted variance
        ewm_var = (1 - alpha) * (ewm_var + alpha * diff**2)
        ewm_std = np.sqrt(ewm_var)
        
        # Compute Z-score relative to current EWM statistics
        if ewm_std > 0:
            z_scores[i] = diff / ewm_std
            anomaly_mask[i] = np.abs(z_scores[i]) > threshold
    
    return anomaly_mask, z_scores
 
 
# Example: Detecting anomalies in data with trend
np.random.seed(42)
 
# Generate data with trend and anomalies
n = 200
trend = np.linspace(0, 20, n)  # Linear upward trend
noise = np.random.normal(0, 2, n)
data = trend + noise
 
# Inject anomalies
anomaly_indices = [50, 100, 150]
data[50] += 15   # Spike
data[100] -= 12  # Drop
data[150] += 18  # Spike
 
# Compare global vs sliding window detection
from scipy import stats
 
# Global Z-score (will fail due to trend)
global_mean = np.mean(data)
global_std = np.std(data, ddof=1)
global_z = (data - global_mean) / global_std
global_anomalies = np.abs(global_z) > 3
 
# Sliding window Z-score
window_anomalies, window_z = sliding_window_zscore(data, window_size=30, threshold=3.0)
 
print(f"Injected anomaly indices: {anomaly_indices}")
print(f"
Global Z-score detections: {np.where(global_anomalies)[0]}")
print(f"Sliding window detections: {np.where(window_anomalies)[0]}")
 
# Global method likely misses or has wrong detections due to trend
# Sliding window should find the actual anomalies

Failure Mode: Multimodal Distributions

When data comes from multiple subpopulations with different characteristics, applying global statistics produces nonsensical results.

What Causes Multimodality?

Multiple User Types:

Free vs. premium subscribers (different usage patterns)
Mobile vs. desktop users (different session lengths)
Domestic vs. international customers (different purchase sizes)

Multiple Operating Regimes:

Machine on vs. off (bimodal power consumption)
Peak vs. off-peak traffic (bimodal request rates)
Working hours vs. night (bimodal activity)

Mixture of Sources:

Combined data from multiple sensors
Aggregated data from different departments
Mislabeled or merged datasets

Why Global Statistics Fail

Consider bimodal data with modes at 10 and 100. The global mean is around 55—a value that almost never occurs! A point at 10 or 100 (both perfectly normal) might be flagged as ~3-4 standard deviations from the mean.

Meanwhile, a value at 55 (which never occurs naturally) would appear perfectly normal by Z-score—it's right at the mean!

Solutions for Multimodality

1. Stratified Analysis If you can identify subpopulations (e.g., user_type column), analyze each separately. Compute statistics within each group.

2. Mixture Model Approach Fit a Gaussian Mixture Model (GMM) to the data. For each point, evaluate its probability under the mixture—not relative to a single Gaussian.

3. Clustering First Apply clustering (k-means, DBSCAN) to identify natural groups. Then detect anomalies within clusters or as points far from all cluster centers.

4. Density-Based Methods Methods like Local Outlier Factor (LOF) or Isolation Forest implicitly handle multimodality by evaluating local density rather than global statistics.

5. Domain Knowledge Often the subpopulations are known. Segment the data before applying statistical methods.

The Histogram Test

Always plot a histogram before applying statistical outlier detection. If you see multiple peaks, stop immediately—global statistics will be misleading. Either segment the data or use methods designed for multimodal distributions.

Method Selection Guide

Given the assumptions and failure modes, how do you choose the right method for your data? Here's a decision framework:

Decision Tree for Method Selection

Step-by-Step Method Selection

•Is the data univariate or multivariate? → Univariate: Continue to step 2 → Multivariate: Consider Mahalanobis distance with MCD (robust) or go directly to non-statistical methods (LOF, Isolation Forest)
•Is the data approximately normal? → Yes: Z-score or Grubbs' test are appropriate → No (heavy tails): Use IQR/MAD-based methods or transform data → No (multimodal): Segment data or use density-based methods
•Is the data independent and stationary? → Yes: Global statistics are valid → No (time series): Use sliding window or detrend/deseasonalize first
•What is the expected contamination rate? → Low (<2%): Classical estimates are acceptable → Moderate (2-10%): Use robust methods (MCD, MAD) → High (>10%): Anomaly detection may be the wrong framing—consider clustering
•Do you need formal statistical inference (p-values)? → Yes: Grubbs' test or generalized ESD → No: Z-score, IQR, or non-statistical methods are fine

Quick Reference: Method Selection by Data Characteristics
Data Characteristic	Recommended Method(s)	Avoid
Normal, univariate, clean	Z-score, Grubbs	None (ideal case)
Non-normal, univariate	IQR, MAD	Z-score, Grubbs
Normal, multivariate, clean	Mahalanobis (classical)	Univariate methods
Multivariate, unknown distribution	Mahalanobis with MCD	Classical covariance
Time series with trend	Sliding window, STL + detection	Global statistics
Multimodal	Segment first, or LOF/clustering	All global methods
High contamination	MCD, trimmed statistics	Classical mean/variance
Very high dimensions	PCA first, then detection	Direct Mahalanobis

When to Abandon Statistical Methods

Statistical methods are powerful but not universal. Some situations call for fundamentally different approaches.

Signs You Need Non-Statistical Methods

1. Complex Feature Interactions When anomalies are defined by intricate combinations of features that can't be captured by linear covariance structures, consider:

Isolation Forest: Random partitioning finds anomalies efficiently
Local Outlier Factor (LOF): Local density comparison handles complex shapes
One-Class SVM: Learns a decision boundary in kernel space
Autoencoders: Neural networks learn to compress; anomalies reconstruct poorly

2. Categorical or Mixed Data Statistical methods assume continuous numeric data. For:

Categorical features: Use frequency-based or distance-based methods
Mixed types: Encode appropriately or use methods like Isolation Forest that handle heterogeneous data

3. Labeled Anomaly Examples Available If you have examples of known anomalies, use supervised or semi-supervised learning:

Classification (with class imbalance handling)
Positive-unlabeled learning
Anomaly-aware metric learning

4. Very High Dimensions In hundreds or thousands of dimensions:

Distance metrics become meaningless (curse of dimensionality)
Covariance estimation is impossible without more samples than features
Consider: Deep learning autoencoders, dimensionality reduction + detection, or feature selection

5. Sequential/Graph Structure When data has inherent structure:

Time series: Specialized methods (Prophet, ARIMA residuals, LSTM autoencoders)
Graphs: Graph neural networks, community detection-based methods
Text: NLP-specific approaches (embedding distances, perplexity)

6. Domain-Specific Anomaly Definitions Statistical rarity isn't always the right definition of "anomaly":

Security: Known attack signatures (rule-based detection)
Business: Values outside acceptable ranges (threshold-based)
Quality: Specific defect patterns (supervised classification)

Statistical methods detect statistically unusual. If your definition of anomaly is different, use methods aligned with that definition.

The Right Tool for the Job

Statistical methods are excellent for well-behaved numeric data where "anomaly" means "statistically improbable." But they're not the only tool. Distance-based, density-based, tree-based, and neural network-based methods each have strengths for different scenarios. Match the method to the data and the problem definition.

Module Summary and Key Takeaways

This module has provided a comprehensive treatment of statistical methods for anomaly detection. Let's consolidate the key lessons across all pages:

Module Key Takeaways

•Z-score standardizes data to detect tail values — Simple, interpretable, but assumes normality and is highly sensitive to outliers in the reference data.
•IQR provides robust, non-parametric detection — Resistant to contamination with 25% breakdown point; no distributional assumptions required.
•Grubbs' test adds formal hypothesis testing — Provides p-values and controlled error rates, but only tests for one outlier at a time and requires normality.
•Mahalanobis distance handles multivariate structure — Captures correlation-violating anomalies; uses chi-squared thresholds under normality.
•Masking corrupts classical estimates — Outliers inflate variance and distort means; robust methods (MCD, MAD) resist this corruption.
•Assumptions must be verified — Normality, independence, stationarity, and low contamination; violating these invalidates the methods.
•Know when to use alternatives — Heavy tails, multimodality, time series structure, and high dimensionality often require non-statistical approaches.

The Broader Context

Statistical methods form the foundational layer of anomaly detection. They provide:

Interpretability: Thresholds and scores have clear probabilistic meaning
Theoretical guarantees: Controlled false positive rates under assumptions
Computational efficiency: Often O(n) or O(n log n)
Baseline performance: Useful benchmarks against which to evaluate more complex methods

However, they are not sufficient for all problems. The subsequent modules in this chapter cover distance-based methods (LOF, k-NN), density-based methods (DBSCAN, kernel density), isolation-based methods (Isolation Forest), and one-class learning (One-Class SVM, autoencoders). Each extends the toolkit for different data characteristics and anomaly definitions.

Master the statistical foundations first—they provide the conceptual grounding for everything that follows.

Module Complete

Congratulations! You have completed the Statistical Methods module for Anomaly Detection. You now possess deep knowledge of Z-score, IQR, Grubbs' test, and Mahalanobis distance methods, along with critical understanding of their assumptions, limitations, and appropriate use cases. This knowledge forms the essential foundation for all anomaly detection work.

Assumptions and Limitations

Understanding the Boundaries of Statistical Detection

What You Will Learn

The Core Assumptions

All statistical anomaly detection methods share certain foundational assumptions. Understanding these is essential for proper application.

Assumption 1: Distributional Form

What it means: The data follows a known or assumed probability distribution (typically normal/Gaussian).

Why it matters: Threshold calculations depend on this distribution. The "3-sigma rule" assumes normality. Chi-squared thresholds for Mahalanobis distance assume multivariate normality.

Consequences of violation:

Heavy-tailed data: Extreme values are more common than assumed → too many false negatives (real outliers classified as normal)
Light-tailed data: Extreme values are rarer than assumed → too many false positives
Skewed data: Asymmetric false positive rates; detection biased toward one tail
Multimodal data: One mode's normal values appear as outliers relative to another mode

Assumption 2: Independence

What it means: Each observation provides independent information. The value of one observation doesn't predict another.

Why it matters: Standard error calculations, effective sample size, and sequential testing depend on independence.

Consequences of violation:

Positive autocorrelation: Observations cluster → overestimated sample size, inflated confidence, more false positives
Negative autocorrelation: Adjacent observations differ → underestimated variance, more false positives
Spatial correlation: Geographic or network clustering violates i.i.d. assumption

Assumption 3: Stationarity

What it means: The underlying data-generating process doesn't change over time. The mean, variance, and correlation structure are constant.

Why it matters: Statistics computed from historical data are used to evaluate new observations. If the process changes, historical benchmarks become irrelevant.

Consequences of violation:

Trend: Values drift over time → historical mean underestimates current mean, creating systematic false positives or negatives
Seasonality: Periodic patterns → values at seasonal peaks appear anomalous relative to off-peak statistics
Concept drift: The definition of "normal" changes → the detector becomes obsolete

Assumption 4: Contamination is Low

What it means: The dataset used to estimate parameters is predominantly "normal" (non-anomalous). Anomalies are rare.

Why it matters: Statistics (mean, variance, covariance) are computed assuming most data is normal. High contamination corrupts these estimates.

Consequences of violation:

Masking: Outliers inflate variance, making other outliers appear less extreme
Swamping: Outliers shift the mean, making normal points appear extreme
Corrupted thresholds: Detection limits become meaningless

Assumption 5: Anomalies are Statistically Rare

What it means: Anomalies are defined as low-probability events under the assumed distribution. They occur in the tails.

Why it matters: The very definition of "outlier" as "statistically improbable" requires that most observations are probable.

Consequences of violation:

If anomalies are common: They're not anomalies—they're a second mode of normal behavior
If anomalies cluster: They form their own distribution, requiring clustering-based approaches

The Implicit Contract

Diagnostic Techniques: Checking Assumptions

Before applying statistical methods, check whether your data satisfies the necessary assumptions. Here are practical diagnostic techniques:

Checking Normality

Visual Methods:

Histogram: Should show symmetric, bell-shaped distribution
Q-Q Plot (Quantile-Quantile): Points should lie on a straight diagonal line
Box Plot: Whiskers should be roughly symmetric; many points beyond whiskers suggest heavy tails

Statistical Tests:

Shapiro-Wilk test: Best for small to medium samples (n < 5000)
D'Agostino-Pearson test: Tests skewness and kurtosis
Anderson-Darling test: Emphasizes tail behavior
Kolmogorov-Smirnov test: General distribution test

Caveat: For large samples, formal tests reject normality even for slight, practically irrelevant deviations. Use visual inspection alongside statistical tests.

normality_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def comprehensive_normality_check(data: np.ndarray, alpha: float = 0.05):
    """
    Perform comprehensive normality diagnostics.
    
    Parameters
    ----------
    data : np.ndarray
        1D array of observations
    alpha : float
        Significance level for hypothesis tests
    """
    n = len(data)
    
    print("=" * 60)
    print("NORMALITY DIAGNOSTICS")
    print("=" * 60)
    
    # Descriptive statistics
    print(f"
Sample size: {n}")
    print(f"Mean: {np.mean(data):.4f}")
    print(f"Std: {np.std(data, ddof=1):.4f}")
    print(f"Skewness: {stats.skew(data):.4f} (0 for normal)")
    print(f"Kurtosis: {stats.kurtosis(data):.4f} (0 for normal)")
    
    # Statistical tests
    print(f"
--- Statistical Tests (α = {alpha}) ---")
    
    # Shapiro-Wilk (best for n < 5000)
    if n <= 5000:
        stat, p = stats.shapiro(data)
        result = "PASS (normal)" if p > alpha else "FAIL (non-normal)"
        print(f"Shapiro-Wilk: statistic={stat:.4f}, p-value={p:.4f} -> {result}")
    else:
        print("Shapiro-Wilk: Skipped (n > 5000)")
    
    # D'Agostino-Pearson
    if n >= 20:
        stat, p = stats.normaltest(data)
        result = "PASS" if p > alpha else "FAIL"
        print(f"D'Agostino-Pearson: statistic={stat:.4f}, p-value={p:.4f} -> {result}")
    
    # Anderson-Darling
    result = stats.anderson(data, dist='norm')
    idx = list(result.significance_level).index(5.0)  # 5% significance
    ad_result = "PASS" if result.statistic < result.critical_values[idx] else "FAIL"
    print(f"Anderson-Darling: statistic={result.statistic:.4f}, "
          f"critical_5%={result.critical_values[idx]:.4f} -> {ad_result}")
    
    # Visual diagnostics
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Histogram with normal overlay
    axes[0].hist(data, bins='auto', density=True, alpha=0.7, edgecolor='black')
    x = np.linspace(data.min(), data.max(), 100)
    axes[0].plot(x, stats.norm.pdf(x, np.mean(data), np.std(data)), 
                 'r-', lw=2, label='Normal fit')
    axes[0].set_title('Histogram with Normal Overlay')
    axes[0].legend()
    
    # Q-Q plot
    stats.probplot(data, dist="norm", plot=axes[1])
    axes[1].set_title('Q-Q Plot')
    
    # Box plot
    axes[2].boxplot(data, vert=True)
    axes[2].set_title('Box Plot')
    
    plt.tight_layout()
    return fig
 
 
def check_independence(data: np.ndarray, max_lag: int = 10):
    """
    Check for autocorrelation (independence violation).
    """
    from statsmodels.stats.diagnostic import acorr_ljungbox
    
    print("
--- Independence Check (Autocorrelation) ---")
    
    # Compute autocorrelation at various lags
    acf_values = []
    for lag in range(1, max_lag + 1):
        acf = np.corrcoef(data[:-lag], data[lag:])[0, 1]
        acf_values.append(acf)
        print(f"Lag {lag}: autocorrelation = {acf:.4f}")
    
    # Ljung-Box test for overall significance
    result = acorr_ljungbox(data, lags=[max_lag], return_df=True)
    p_value = result['lb_pvalue'].values[0]
    print(f"
Ljung-Box test (lag={max_lag}): p-value = {p_value:.4f}")
    if p_value < 0.05:
        print("⚠️  Significant autocorrelation detected - independence violated!")
    else:
        print("✓ No significant autocorrelation")
    
    return acf_values
 
 
# Example
np.random.seed(42)
 
# Generate different types of data for demonstration
normal_data = np.random.normal(0, 1, 500)
 
print("
" + "="*60)
print("Testing Normal Data")
comprehensive_normality_check(normal_data)
 
# Heavy-tailed data (Student's t with 3 df)
heavy_tailed = np.random.standard_t(3, 500)
print("
" + "="*60)
print("Testing Heavy-Tailed Data (t-distribution, df=3)")
comprehensive_normality_check(heavy_tailed)

Checking Independence

Visual Methods:

Lag plot: Plot $x_t$ vs $x_{t-1}$. Random scatter indicates independence; patterns indicate dependence
Autocorrelation function (ACF) plot: Bars should decay quickly to near-zero for independent data

Statistical Tests:

Durbin-Watson test: Tests for first-order autocorrelation
Ljung-Box test: Tests for autocorrelation at multiple lags
Runs test: Checks if sequential values are random

Checking Stationarity

Visual Methods:

Time series plot: Look for trends, seasonal patterns, or changing variance
Rolling statistics: Compute mean and variance in sliding windows; should be constant

Statistical Tests:

Augmented Dickey-Fuller (ADF) test: Tests for unit root (non-stationarity)
KPSS test: Tests null hypothesis of stationarity
Phillips-Perron test: Alternative to ADF

Checking for Contamination

This is inherently circular—if you knew which points were contamination, you wouldn't need detection. Approaches:

Compare robust (MCD, MAD) and non-robust estimates; large discrepancy suggests contamination
Examine the proportion of points flagged by robust methods
Domain knowledge about expected anomaly rates

Failure Mode: Heavy-Tailed Distributions

Heavy tails are the most common and dangerous violation of normality assumptions in real-world data.

What Are Heavy Tails?

A distribution has heavy tails if extreme values are more probable than under the normal distribution. Mathematically, if the tail probability decays slower than exponentially:

$$P(X > x) \sim x^{-\alpha} \text{ (power law)}$$

vs.

$$P(X > x) \sim e^{-x^2/2} \text{ (Gaussian, exponential decay)}$$

Examples of heavy-tailed data:

Financial returns (crashes and booms are more common than normal predicts)
Network traffic (few connections dominate, many are tiny)
City populations (few mega-cities, many small towns)
File sizes (many small, occasional huge)
Insurance claims (most small, occasional catastrophic)

The Impact on Detection

Under heavy tails, Z-scores of 5, 10, or even 20 can occur regularly in normal operation. Applying a threshold of $|z| > 3$ will flag countless legitimate observations as anomalies.

Probability of |Z| > 3 for Different Distributions
Distribution	P(\|Z\| > 3)	False Positives per 10,000
Normal	0.27%	27
Student-t (df=10)	0.87%	87
Student-t (df=5)	1.84%	184
Student-t (df=3)	4.28%	428
Laplace (double exponential)	0.74%	74
Cauchy	20.5%	2,050

Solutions for Heavy Tails

1. Use Non-Parametric Methods IQR-based detection and MAD-based detection don't assume normality. They're based on order statistics and remain valid for heavy-tailed data.

2. Transform the Data Log, square root, or Box-Cox transforms can reduce heavy-tailedness. Check normality after transformation.

3. Use Tail-Specific Methods Extreme Value Theory (EVT) provides principled methods for heavy-tailed data, modeling only the tail behavior.

4. Use Higher Thresholds If you must use Z-scores on heavy-tailed data, calibrate thresholds empirically rather than using theoretical normal quantiles.

5. Model the Actual Distribution Fit a t-distribution, generalized Pareto, or other heavy-tailed family and compute tail probabilities accordingly.

The Cauchy Catastrophe

Failure Mode: Non-Stationary Data

Time series data—measurements over time—often violates the stationarity assumption.

Types of Non-Stationarity

Trend: The mean systematically increases or decreases over time.

Sales growing due to business expansion
Temperature rising due to climate change
Degrading sensor reading as equipment ages

Seasonality: Regular periodic patterns.

Retail sales peaking during holidays
Energy usage peaking in winter/summer
Website traffic varying by time of day

Level Shifts: Sudden permanent changes.

New product launch changing sales baseline
Regulatory change affecting behavior
Equipment replacement changing readings

Variance Changes (Heteroscedasticity): Volatility increases or decreases.

Market volatility clustering (calm periods followed by turbulent periods)
Measurement precision changing with instrument calibration

The Impact on Detection

With trends:

Recent high values appear as outliers relative to historical mean (when they're actually the new normal)
Recent low values might not be flagged (even though they represent unusual drops)

With seasonality:

Peak-season values flagged as outliers when compared to annual statistics
Off-peak genuine anomalies might be missed

With level shifts:

All post-shift data appears anomalous relative to pre-shift statistics
Or all pre-shift data appears anomalous if statistics computed after the shift

Solutions for Non-Stationarity

1. Detrending Fit and remove the trend component before anomaly detection. Common methods:

Moving average subtraction
Linear/polynomial regression of time
Differencing (analyze changes rather than levels)

2. Deseasonalization Remove seasonal components using:

Seasonal decomposition (STL, X-13A-RIMA)
Seasonal dummy variables
Fourier series regression

3. Sliding Window Approaches Compute statistics from a recent window rather than all historical data. Only recent data informs what's "normal" now.

sliding_window_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from typing import Tuple
 
def sliding_window_zscore(
    data: np.ndarray,
    window_size: int,
    threshold: float = 3.0
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute Z-scores using a sliding window for non-stationary data.
    
    Each point is evaluated relative to the statistics of the 
    preceding window of observations.
    
    Parameters
    ----------
    data : np.ndarray
        Time series data
    window_size : int
        Number of preceding observations to use for statistics
    threshold : float
        Z-score threshold for anomaly flagging
        
    Returns
    -------
    anomaly_mask : np.ndarray
        Boolean array indicating anomalies (first window_size entries are False)
    z_scores : np.ndarray
        Z-scores for each point (NaN for first window_size entries)
    """
    n = len(data)
    z_scores = np.full(n, np.nan)
    anomaly_mask = np.zeros(n, dtype=bool)
    
    for i in range(window_size, n):
        # Get the preceding window
        window = data[i - window_size:i]
        
        # Compute window statistics
        window_mean = np.mean(window)
        window_std = np.std(window, ddof=1)
        
        # Avoid division by zero
        if window_std > 0:
            z_scores[i] = (data[i] - window_mean) / window_std
            anomaly_mask[i] = np.abs(z_scores[i]) > threshold
    
    return anomaly_mask, z_scores
 
 
def exponential_moving_average_zscore(
    data: np.ndarray,
    alpha: float = 0.1,
    threshold: float = 3.0
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute Z-scores using exponentially weighted moving statistics.
    
    More recent observations have more influence on the statistics,
    allowing gradual adaptation to changing conditions.
    
    Parameters
    ----------
    data : np.ndarray
        Time series data
    alpha : float
        Smoothing factor (0 < alpha < 1). Higher = more weight on recent data.
    threshold : float
        Z-score threshold
    """
    n = len(data)
    z_scores = np.full(n, np.nan)
    anomaly_mask = np.zeros(n, dtype=bool)
    
    # Initialize with first observation
    ewm_mean = data[0]
    ewm_var = 0.0
    
    for i in range(1, n):
        # Update exponentially weighted mean
        diff = data[i] - ewm_mean
        ewm_mean = ewm_mean + alpha * diff
        
        # Update exponentially weighted variance
        ewm_var = (1 - alpha) * (ewm_var + alpha * diff**2)
        ewm_std = np.sqrt(ewm_var)
        
        # Compute Z-score relative to current EWM statistics
        if ewm_std > 0:
            z_scores[i] = diff / ewm_std
            anomaly_mask[i] = np.abs(z_scores[i]) > threshold
    
    return anomaly_mask, z_scores
 
 
# Example: Detecting anomalies in data with trend
np.random.seed(42)
 
# Generate data with trend and anomalies
n = 200
trend = np.linspace(0, 20, n)  # Linear upward trend
noise = np.random.normal(0, 2, n)
data = trend + noise
 
# Inject anomalies
anomaly_indices = [50, 100, 150]
data[50] += 15   # Spike
data[100] -= 12  # Drop
data[150] += 18  # Spike
 
# Compare global vs sliding window detection
from scipy import stats
 
# Global Z-score (will fail due to trend)
global_mean = np.mean(data)
global_std = np.std(data, ddof=1)
global_z = (data - global_mean) / global_std
global_anomalies = np.abs(global_z) > 3
 
# Sliding window Z-score
window_anomalies, window_z = sliding_window_zscore(data, window_size=30, threshold=3.0)
 
print(f"Injected anomaly indices: {anomaly_indices}")
print(f"
Global Z-score detections: {np.where(global_anomalies)[0]}")
print(f"Sliding window detections: {np.where(window_anomalies)[0]}")
 
# Global method likely misses or has wrong detections due to trend
# Sliding window should find the actual anomalies

Failure Mode: Multimodal Distributions

When data comes from multiple subpopulations with different characteristics, applying global statistics produces nonsensical results.

What Causes Multimodality?

Multiple User Types:

Free vs. premium subscribers (different usage patterns)
Mobile vs. desktop users (different session lengths)
Domestic vs. international customers (different purchase sizes)

Multiple Operating Regimes:

Machine on vs. off (bimodal power consumption)
Peak vs. off-peak traffic (bimodal request rates)
Working hours vs. night (bimodal activity)

Mixture of Sources:

Combined data from multiple sensors
Aggregated data from different departments
Mislabeled or merged datasets

Why Global Statistics Fail

Meanwhile, a value at 55 (which never occurs naturally) would appear perfectly normal by Z-score—it's right at the mean!

Solutions for Multimodality

1. Stratified Analysis If you can identify subpopulations (e.g., user_type column), analyze each separately. Compute statistics within each group.

2. Mixture Model Approach Fit a Gaussian Mixture Model (GMM) to the data. For each point, evaluate its probability under the mixture—not relative to a single Gaussian.

3. Clustering First Apply clustering (k-means, DBSCAN) to identify natural groups. Then detect anomalies within clusters or as points far from all cluster centers.

4. Density-Based Methods Methods like Local Outlier Factor (LOF) or Isolation Forest implicitly handle multimodality by evaluating local density rather than global statistics.

5. Domain Knowledge Often the subpopulations are known. Segment the data before applying statistical methods.

The Histogram Test

Method Selection Guide

Given the assumptions and failure modes, how do you choose the right method for your data? Here's a decision framework:

Decision Tree for Method Selection

Step-by-Step Method Selection

•Is the data univariate or multivariate? → Univariate: Continue to step 2 → Multivariate: Consider Mahalanobis distance with MCD (robust) or go directly to non-statistical methods (LOF, Isolation Forest)
•Is the data approximately normal? → Yes: Z-score or Grubbs' test are appropriate → No (heavy tails): Use IQR/MAD-based methods or transform data → No (multimodal): Segment data or use density-based methods
•Is the data independent and stationary? → Yes: Global statistics are valid → No (time series): Use sliding window or detrend/deseasonalize first
•What is the expected contamination rate? → Low (<2%): Classical estimates are acceptable → Moderate (2-10%): Use robust methods (MCD, MAD) → High (>10%): Anomaly detection may be the wrong framing—consider clustering
•Do you need formal statistical inference (p-values)? → Yes: Grubbs' test or generalized ESD → No: Z-score, IQR, or non-statistical methods are fine

Quick Reference: Method Selection by Data Characteristics
Data Characteristic	Recommended Method(s)	Avoid
Normal, univariate, clean	Z-score, Grubbs	None (ideal case)
Non-normal, univariate	IQR, MAD	Z-score, Grubbs
Normal, multivariate, clean	Mahalanobis (classical)	Univariate methods
Multivariate, unknown distribution	Mahalanobis with MCD	Classical covariance
Time series with trend	Sliding window, STL + detection	Global statistics
Multimodal	Segment first, or LOF/clustering	All global methods
High contamination	MCD, trimmed statistics	Classical mean/variance
Very high dimensions	PCA first, then detection	Direct Mahalanobis

When to Abandon Statistical Methods

Statistical methods are powerful but not universal. Some situations call for fundamentally different approaches.

Signs You Need Non-Statistical Methods

1. Complex Feature Interactions When anomalies are defined by intricate combinations of features that can't be captured by linear covariance structures, consider:

Isolation Forest: Random partitioning finds anomalies efficiently
Local Outlier Factor (LOF): Local density comparison handles complex shapes
One-Class SVM: Learns a decision boundary in kernel space
Autoencoders: Neural networks learn to compress; anomalies reconstruct poorly

2. Categorical or Mixed Data Statistical methods assume continuous numeric data. For:

Categorical features: Use frequency-based or distance-based methods
Mixed types: Encode appropriately or use methods like Isolation Forest that handle heterogeneous data

3. Labeled Anomaly Examples Available If you have examples of known anomalies, use supervised or semi-supervised learning:

Classification (with class imbalance handling)
Positive-unlabeled learning
Anomaly-aware metric learning

4. Very High Dimensions In hundreds or thousands of dimensions:

Distance metrics become meaningless (curse of dimensionality)
Covariance estimation is impossible without more samples than features
Consider: Deep learning autoencoders, dimensionality reduction + detection, or feature selection

5. Sequential/Graph Structure When data has inherent structure:

Time series: Specialized methods (Prophet, ARIMA residuals, LSTM autoencoders)
Graphs: Graph neural networks, community detection-based methods
Text: NLP-specific approaches (embedding distances, perplexity)

6. Domain-Specific Anomaly Definitions Statistical rarity isn't always the right definition of "anomaly":

Security: Known attack signatures (rule-based detection)
Business: Values outside acceptable ranges (threshold-based)
Quality: Specific defect patterns (supervised classification)

Statistical methods detect statistically unusual. If your definition of anomaly is different, use methods aligned with that definition.

The Right Tool for the Job

Module Summary and Key Takeaways

This module has provided a comprehensive treatment of statistical methods for anomaly detection. Let's consolidate the key lessons across all pages:

Module Key Takeaways

•Z-score standardizes data to detect tail values — Simple, interpretable, but assumes normality and is highly sensitive to outliers in the reference data.
•IQR provides robust, non-parametric detection — Resistant to contamination with 25% breakdown point; no distributional assumptions required.
•Grubbs' test adds formal hypothesis testing — Provides p-values and controlled error rates, but only tests for one outlier at a time and requires normality.
•Mahalanobis distance handles multivariate structure — Captures correlation-violating anomalies; uses chi-squared thresholds under normality.
•Masking corrupts classical estimates — Outliers inflate variance and distort means; robust methods (MCD, MAD) resist this corruption.
•Assumptions must be verified — Normality, independence, stationarity, and low contamination; violating these invalidates the methods.
•Know when to use alternatives — Heavy tails, multimodality, time series structure, and high dimensionality often require non-statistical approaches.

The Broader Context

Statistical methods form the foundational layer of anomaly detection. They provide:

Interpretability: Thresholds and scores have clear probabilistic meaning
Theoretical guarantees: Controlled false positive rates under assumptions
Computational efficiency: Often O(n) or O(n log n)
Baseline performance: Useful benchmarks against which to evaluate more complex methods

Master the statistical foundations first—they provide the conceptual grounding for everything that follows.

Module Complete