Statistical Methods - Learning Module

Loading content...

0/278

Grubbs' Test

Formal Statistical Testing for Outliers

Both the Z-score and IQR methods provide rules of thumb for flagging outliers, but neither offers formal statistical significance testing. When must we declare with statistical confidence that a value doesn't belong to the normal population? When regulatory compliance, scientific publication, or legal defensibility demands rigorous evidence?

Enter Grubbs' test (also known as the maximum normed residual test or extreme studentized deviate test). Developed by Frank E. Grubbs in 1950, this procedure provides a formal hypothesis testing framework with controlled Type I error rates and clear accept/reject decisions.

What You Will Learn

By the end of this page, you will understand the hypothesis testing framework for outlier detection, the mathematical derivation of Grubbs' test statistic, critical value computation, variants for detecting one-sided and multiple outliers, and the crucial assumptions that determine when Grubbs' test is valid.

The Hypothesis Testing Framework

Grubbs' test frames outlier detection as a formal statistical hypothesis test. This framework provides the mathematical machinery for making principled decisions.

The Null and Alternative Hypotheses

Null Hypothesis ($H_0$): There are no outliers in the dataset. All observations come from the same normally distributed population.

Alternative Hypothesis ($H_1$): There is exactly one outlier in the dataset.

The test examines the most extreme observation (largest deviation from the mean) and determines whether its extremity is consistent with the null hypothesis.

Type I and Type II Errors in Outlier Detection

Type I Error (False Positive): Rejecting $H_0$ when no outliers exist—flagging a legitimate value as an outlier.

Type II Error (False Negative): Failing to reject $H_0$ when an outlier exists—missing a true anomaly.

The significance level $\alpha$ controls the Type I error rate. If $\alpha = 0.05$, we accept a 5% chance of falsely flagging an outlier when none exists.

Single Outlier Testing

The basic Grubbs' test is designed to detect at most ONE outlier. If multiple outliers are suspected, the test can be applied iteratively (with corrections), or specialized variants for multiple outliers should be used. Testing for one outlier when two exist can lead to masking.

Why Formal Testing Matters

Consider a pharmaceutical company testing drug efficacy. A single anomalous measurement could skew clinical trial results. The company needs to:

Demonstrate statistical rigor to regulators
Control false positive rates (incorrectly removing valid data)
Provide reproducible, objective criteria for data exclusion
Document the p-value and decision rule

A Z-score threshold of 3 lacks this formal framework. Grubbs' test provides it.

The Grubbs Test Statistic

The Grubbs test statistic measures how extreme the suspected outlier is relative to the sample's variability.

Definition

Given a sample ${x_1, x_2, \ldots, x_n}$ from a normal population, the Grubbs test statistic for the most extreme observation is:

$$G = \frac{\max_{i} |x_i - \bar{x}|}{s}$$

Where:

$\bar{x}$ is the sample mean
$s$ is the sample standard deviation: $s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$

This is simply the maximum absolute Z-score in the sample.

One-sided variants:

For testing only the minimum value (suspected low outlier): $$G_{\min} = \frac{\bar{x} - x_{(1)}}{s}$$

For testing only the maximum value (suspected high outlier): $$G_{\max} = \frac{x_{(n)} - \bar{x}}{s}$$

Where $x_{(1)}$ and $x_{(n)}$ are the minimum and maximum order statistics.

Distribution Under the Null Hypothesis

Under $H_0$ (all observations from the same normal distribution), the distribution of $G$ depends on the sample size $n$. This distribution is known as the distribution of the maximum of n correlated t-variates.

The exact distribution involves complex integrals, but accurate approximations and tabulated critical values exist.

Key insight: Even if all data truly comes from a normal distribution, we expect some extreme values simply by chance. With $n = 100$ observations, the most extreme is guaranteed to exist and will have some Z-score. Grubbs' test asks: "Is this extreme value too extreme to be explained by chance?"

The Critical Value

The critical value $G_{\text{crit}}$ is tabulated for various sample sizes and significance levels. For a two-sided test at significance level $\alpha$:

$$G_{\text{crit}} = \frac{n-1}{\sqrt{n}} \sqrt{\frac{t_{\alpha/(2n), n-2}^2}{n - 2 + t_{\alpha/(2n), n-2}^2}}$$

Where $t_{\alpha/(2n), n-2}$ is the critical value of the t-distribution with $n-2$ degrees of freedom at significance level $\alpha/(2n)$.

Decision rule: If $G > G_{\text{crit}}$, reject $H_0$ and conclude the extreme value is an outlier.

Critical Values for Grubbs' Test (Two-Sided, α = 0.05)
Sample Size n	G_crit (α=0.05)	G_crit (α=0.01)
3	1.153	1.155
5	1.715	1.764
7	1.938	2.093
10	2.176	2.410
15	2.409	2.705
20	2.557	2.884
25	2.663	3.009
30	2.745	3.103
40	2.867	3.240
50	2.956	3.336
100	3.289	3.600

The Sample Size Effect

Notice how the critical value increases with sample size. With n=10, a Z-score of 2.2 is significant. With n=100, you need a Z-score above 3.3. This accounts for the fact that extreme values are more likely to occur by chance in larger samples.

Algorithm and Implementation

Algorithm: Grubbs' Test for a Single Outlier

Input: Dataset ${x_1, x_2, \ldots, x_n}$, significance level $\alpha$

Output: Whether an outlier exists, which point is the outlier (if any)

1. Compute sample mean x̄ and sample std s
2. Identify the point with maximum |xᵢ - x̄|
3. Compute G = max|xᵢ - x̄| / s
4. Compute critical value G_crit(n, α)
5. If G > G_crit:
   Return: outlier detected, flag the extreme point
   Else:
   Return: no outlier detected

Computational Complexity: $O(n)$

grubbs_test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
import numpy as np
from scipy import stats
from typing import Tuple, Optional, NamedTuple
 
class GrubbsResult(NamedTuple):
    """Results from Grubbs' test for outliers."""
    has_outlier: bool
    outlier_index: Optional[int]
    outlier_value: Optional[float]
    G_statistic: float
    G_critical: float
    p_value: float
 
def grubbs_critical_value(n: int, alpha: float = 0.05, two_sided: bool = True) -> float:
    """
    Compute the critical value for Grubbs' test.
    
    Parameters
    ----------
    n : int
        Sample size
    alpha : float
        Significance level
    two_sided : bool
        Whether to use two-sided test
        
    Returns
    -------
    G_crit : float
        Critical value
    """
    if two_sided:
        alpha_adj = alpha / (2 * n)
    else:
        alpha_adj = alpha / n
    
    t_crit = stats.t.ppf(1 - alpha_adj, n - 2)
    G_crit = ((n - 1) / np.sqrt(n)) * np.sqrt(t_crit**2 / (n - 2 + t_crit**2))
    
    return G_crit
 
def grubbs_test(
    data: np.ndarray,
    alpha: float = 0.05,
    two_sided: bool = True
) -> GrubbsResult:
    """
    Perform Grubbs' test for detecting a single outlier.
    
    Parameters
    ----------
    data : np.ndarray
        1D array of observations (must be at least 3 points)
    alpha : float
        Significance level (default: 0.05)
    two_sided : bool
        If True, test for both high and low outliers.
        If False, test only for the more extreme outlier.
        
    Returns
    -------
    GrubbsResult : NamedTuple with test results
    """
    n = len(data)
    if n < 3:
        raise ValueError("Grubbs' test requires at least 3 observations")
    
    mean = np.mean(data)
    std = np.std(data, ddof=1)
    
    # Compute deviations from mean
    deviations = np.abs(data - mean)
    
    # Find the most extreme point
    max_idx = np.argmax(deviations)
    max_deviation = deviations[max_idx]
    
    # Compute Grubbs statistic
    G = max_deviation / std
    
    # Compute critical value
    G_crit = grubbs_critical_value(n, alpha, two_sided)
    
    # Compute approximate p-value
    # This uses the relationship to the t-distribution
    t_stat = G * np.sqrt(n) / np.sqrt(n - 1 + G**2 / (n - 2))
    if two_sided:
        p_value = 2 * n * (1 - stats.t.cdf(t_stat, n - 2))
    else:
        p_value = n * (1 - stats.t.cdf(t_stat, n - 2))
    p_value = min(p_value, 1.0)  # Cap at 1
    
    # Determine if outlier exists
    has_outlier = G > G_crit
    
    return GrubbsResult(
        has_outlier=has_outlier,
        outlier_index=int(max_idx) if has_outlier else None,
        outlier_value=float(data[max_idx]) if has_outlier else None,
        G_statistic=G,
        G_critical=G_crit,
        p_value=p_value
    )
 
def grubbs_test_iterative(
    data: np.ndarray,
    alpha: float = 0.05,
    max_outliers: int = 10
) -> Tuple[np.ndarray, list]:
    """
    Apply Grubbs' test iteratively to detect multiple outliers.
    
    Warning: For multiple outliers, this approach can suffer from
    masking. Consider using a dedicated multiple outlier test.
    
    Parameters
    ----------
    data : np.ndarray
        1D array of observations
    alpha : float
        Significance level for each individual test
    max_outliers : int
        Maximum number of outliers to detect
        
    Returns
    -------
    mask : np.ndarray
        Boolean array where True indicates an outlier
    results : list
        List of GrubbsResult for each iteration
    """
    data_copy = data.copy()
    original_indices = np.arange(len(data))
    mask = np.zeros(len(data), dtype=bool)
    results = []
    
    for _ in range(max_outliers):
        if len(data_copy) < 3:
            break
            
        result = grubbs_test(data_copy, alpha)
        results.append(result)
        
        if not result.has_outlier:
            break
        
        # Mark the outlier in the original array
        outlier_original_idx = original_indices[result.outlier_index]
        mask[outlier_original_idx] = True
        
        # Remove the outlier and continue
        keep_mask = np.ones(len(data_copy), dtype=bool)
        keep_mask[result.outlier_index] = False
        data_copy = data_copy[keep_mask]
        original_indices = original_indices[keep_mask]
    
    return mask, results
 
 
# Example usage
np.random.seed(42)
 
# Generate data with one outlier
normal_data = np.random.normal(50, 5, 20)
data_with_outlier = np.append(normal_data, 80)  # Add obvious outlier
 
# Run Grubbs' test
result = grubbs_test(data_with_outlier)
 
print("=== Grubbs' Test Results ===")
print(f"Sample size: {len(data_with_outlier)}")
print(f"G statistic: {result.G_statistic:.4f}")
print(f"Critical value (α=0.05): {result.G_critical:.4f}")
print(f"p-value: {result.p_value:.6f}")
print(f"Outlier detected: {result.has_outlier}")
if result.has_outlier:
    print(f"Outlier value: {result.outlier_value:.2f}")
    print(f"Outlier index: {result.outlier_index}")

Testing for Multiple Outliers

The basic Grubbs' test is designed for at most one outlier. When multiple outliers are suspected, several approaches exist:

Approach 1: Iterative Testing

Apply Grubbs' test
If outlier detected, remove it from the dataset
Repeat until no outlier is detected

Problem: This approach suffers from masking. If two outliers are present, they inflate the variance, potentially causing both to appear non-extreme. The test may fail to detect either.

Partial mitigation: Apply a Bonferroni correction to the significance level: use $\alpha' = \alpha / k$ where $k$ is the expected number of outliers.

Approach 2: The Generalized ESD (Extreme Studentized Deviate) Test

Rosner (1983) developed a more sophisticated procedure:

Specify upper bound $r$ on the number of outliers to test
Compute $r$ test statistics: Remove the most extreme point, compute the test statistic, repeat $r$ times
Compute corresponding critical values: Adjusted for the sequential testing
Determine the number of outliers: Find the largest $i$ where the test statistic exceeds its critical value

This approach handles masking by examining subsets of the data.

generalized_esd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import numpy as np
from scipy import stats
from typing import Tuple, List
 
def generalized_esd_test(
    data: np.ndarray,
    max_outliers: int,
    alpha: float = 0.05
) -> Tuple[int, np.ndarray, List[float], List[float]]:
    """
    Generalized Extreme Studentized Deviate (ESD) test for 
    detecting multiple outliers (Rosner, 1983).
    
    Parameters
    ----------
    data : np.ndarray
        1D array of observations
    max_outliers : int
        Maximum number of outliers to test for (upper bound r)
    alpha : float
        Significance level
        
    Returns
    -------
    num_outliers : int
        Number of outliers detected
    outlier_indices : np.ndarray
        Indices of detected outliers in original data
    R_values : list
        Test statistics for each iteration
    lambda_values : list
        Critical values for each iteration
    """
    n = len(data)
    if max_outliers >= n - 2:
        raise ValueError("max_outliers must be less than n - 2")
    
    data_copy = data.copy()
    original_indices = np.arange(n)
    
    R_values = []  # Test statistics
    lambda_values = []  # Critical values
    removed_indices = []  # Indices of removed points (in original order)
    
    for i in range(max_outliers):
        n_i = len(data_copy)
        
        # Compute mean and std of current subset
        mean_i = np.mean(data_copy)
        std_i = np.std(data_copy, ddof=1)
        
        # Find most extreme residual
        residuals = np.abs(data_copy - mean_i)
        max_idx = np.argmax(residuals)
        R_i = residuals[max_idx] / std_i
        R_values.append(R_i)
        
        # Compute critical value lambda_i
        p = 1 - alpha / (2 * (n_i - i))
        t_crit = stats.t.ppf(p, n_i - i - 2)
        lambda_i = ((n_i - i - 1) * t_crit) / np.sqrt(
            (n_i - i - 2 + t_crit**2) * (n_i - i)
        )
        lambda_values.append(lambda_i)
        
        # Store the original index of the removed point
        removed_indices.append(original_indices[max_idx])
        
        # Remove the point and continue
        keep_mask = np.ones(len(data_copy), dtype=bool)
        keep_mask[max_idx] = False
        data_copy = data_copy[keep_mask]
        original_indices = original_indices[keep_mask]
    
    # Determine number of outliers
    # Find largest i where R_i > lambda_i
    num_outliers = 0
    for i in range(max_outliers):
        if R_values[i] > lambda_values[i]:
            num_outliers = i + 1
    
    outlier_indices = np.array(removed_indices[:num_outliers])
    
    return num_outliers, outlier_indices, R_values, lambda_values
 
 
# Example with multiple outliers
np.random.seed(42)
normal_data = np.random.normal(50, 5, 25)
# Add 3 outliers
data = np.concatenate([normal_data, [85, 90, 15]])
 
num_outliers, indices, R, lam = generalized_esd_test(data, max_outliers=5)
 
print("=== Generalized ESD Test Results ===")
print(f"
Number of outliers detected: {num_outliers}")
print(f"Outlier values: {data[indices]}")
print(f"
Iteration details:")
for i in range(len(R)):
    status = "OUTLIER" if R[i] > lam[i] else "not outlier"
    print(f"  i={i+1}: R={R[i]:.3f}, λ={lam[i]:.3f} -> {status}")

Choosing max_outliers

For the generalized ESD test, you must specify an upper bound on the number of outliers. A common rule of thumb is to set max_outliers to about 10-25% of the sample size. The test will determine the actual number of outliers (which could be 0 to max_outliers).

One-Sided Grubbs' Tests

When domain knowledge dictates that outliers can occur in only one direction, one-sided tests provide more statistical power.

Testing for a Maximum Outlier

Null Hypothesis: The maximum value is not an outlier.

Test Statistic: $$G_{\max} = \frac{x_{(n)} - \bar{x}}{s}$$

Where $x_{(n)}$ is the maximum observation.

Critical value: Use $\alpha/n$ instead of $\alpha/(2n)$ in the formula.

Testing for a Minimum Outlier

Null Hypothesis: The minimum value is not an outlier.

Test Statistic: $$G_{\min} = \frac{\bar{x} - x_{(1)}}{s}$$

Where $x_{(1)}$ is the minimum observation.

When to Use One-Sided Tests

Examples of One-Sided vs. Two-Sided Applications
Application	Test Type	Rationale
Temperature sensors	Two-sided	Both high and low readings might indicate failure
Response times	Maximum only	Only unusually long times are problematic
Drug dosage measurements	Two-sided	Both overdose and underdose are concerning
Stock returns	Minimum only	Only large losses may require investigation
Manufacturing dimensions	Two-sided	Both over and undersize are defects
Battery capacity	Minimum only	Only low capacity is a defect

Power Advantage

One-sided tests have more power (lower Type II error rate) than two-sided tests at the same significance level. If you genuinely know the direction, use a one-sided test. But if you're uncertain about the direction, always use the two-sided version to avoid missing outliers on the unexpected side.

Assumptions and Limitations

Grubbs' test is only valid under specific conditions. Violating these assumptions can lead to incorrect conclusions.

Critical Assumptions

1. Normality

The test is derived assuming the underlying population is normally distributed. For non-normal data:

Heavy tails → Too many false negatives (real outliers not detected)
Light tails → Too many false positives
Skewed data → Asymmetric detection rates

Rule of thumb: Visual inspection via Q-Q plots and formal tests (Shapiro-Wilk) should precede Grubbs' test.

2. Independence

Observations must be independent. Serial correlation (common in time series) invalidates the test because consecutive observations carry redundant information.

3. Known Sample Size

The critical values depend on $n$. For very large samples, the asymptotic behavior differs from tabulated values.

4. Single Outlier (for Basic Test)

The basic test is designed for zero or one outlier. Multiple outliers cause masking, potentially leading to no outliers being detected.

When NOT to Use Grubbs' Test

•Non-normal data — Use non-parametric methods (IQR, MAD) instead
•Time series with autocorrelation — The independence assumption is violated
•Very small samples (n < 7) — Low power; nearly impossible to detect outliers
•Very large samples (n > 1000) — Asymptotic approximations needed; consider other methods
•Multiple suspected outliers without using generalized ESD — Basic Grubbs suffers from masking
•Multivariate data — Use Mahalanobis distance-based tests instead
•Repeated application without adjustment — Inflates Type I error rate

The Chicken-and-Egg of Testing Normality

To use Grubbs' test, you must verify normality. But outliers can make normal data appear non-normal, and removing outliers changes the distribution. This circularity has no perfect solution. Best practice: 1) Visual inspection first, 2) Apply robust normality tests, 3) Consider robust alternatives if uncertain.

Comparison with Other Statistical Tests

Grubbs' test is one of several formal tests for outliers. Understanding the alternatives helps you choose appropriately.

Dixon's Q Test

An alternative for small samples (n ≤ 25). Uses the ratio of the gap between the suspected outlier and its nearest neighbor to the overall range:

$$Q = \frac{x_{(n)} - x_{(n-1)}}{x_{(n)} - x_{(1)}}$$

Advantages: Simple calculation, doesn't require mean/variance. Disadvantages: Only for very small samples, less powerful than Grubbs.

Rosner's Test (Generalized ESD)

As described earlier, handles multiple outliers by sequential testing with adjusted critical values.

Advantages: Handles multiple outliers correctly. Disadvantages: Requires specifying maximum number of outliers.

Tietjen-Moore Test

Tests the hypothesis that exactly $k$ outliers exist (you must specify $k$ in advance).

Advantages: Optimal when you know the exact number of outliers. Disadvantages: Rarely know $k$ in practice.

Comparison of Outlier Tests
Test	Sample Size	Number of Outliers	Assumptions
Grubbs	n ≥ 7 (practical)	0 or 1	Normality, independence
Dixon's Q	3 ≤ n ≤ 25	0 or 1	Normality, independence
Generalized ESD	n ≥ 25	0 to r (specified)	Normality, independence
Tietjen-Moore	n ≥ 7	Exactly k (specified)	Normality, independence
Z-score (3σ)	Any	Any	Normality (informal)
IQR	Any	Any	None (non-parametric)

Summary and Key Takeaways

Key Takeaways

•Grubbs' test provides formal hypothesis testing — Unlike Z-score thresholds, it gives p-values and controlled error rates.
•The test statistic G is the maximum absolute Z-score — Conceptually simple, but with proper distributional theory.
•Critical values depend on sample size — Larger samples require more extreme values for significance.
•Basic test detects at most one outlier — Use generalized ESD or iterative testing for multiple outliers.
•Normality is a critical assumption — Verify before applying; use non-parametric alternatives if violated.
•Masking can hide outliers — Multiple extreme values can inflate variance and prevent detection.

What's Next

The univariate methods we've covered (Z-score, IQR, Grubbs) all treat each variable independently. But in real-world data, anomalies often manifest in combinations of variables—a point might be normal in each dimension individually but anomalous considering all dimensions together. The next page covers Multivariate Methods, including the Mahalanobis distance, which extends statistical outlier detection to multiple dimensions.

Page Complete

You now understand Grubbs' test as a formal hypothesis testing framework for outlier detection, its statistical foundations, extensions for multiple outliers, and critical assumptions. This rigorous approach complements the heuristic methods (Z-score, IQR) when statistical confidence is required.