Loading content...
Both the Z-score and IQR methods provide rules of thumb for flagging outliers, but neither offers formal statistical significance testing. When must we declare with statistical confidence that a value doesn't belong to the normal population? When regulatory compliance, scientific publication, or legal defensibility demands rigorous evidence?
Enter Grubbs' test (also known as the maximum normed residual test or extreme studentized deviate test). Developed by Frank E. Grubbs in 1950, this procedure provides a formal hypothesis testing framework with controlled Type I error rates and clear accept/reject decisions.
By the end of this page, you will understand the hypothesis testing framework for outlier detection, the mathematical derivation of Grubbs' test statistic, critical value computation, variants for detecting one-sided and multiple outliers, and the crucial assumptions that determine when Grubbs' test is valid.
Grubbs' test frames outlier detection as a formal statistical hypothesis test. This framework provides the mathematical machinery for making principled decisions.
Null Hypothesis ($H_0$): There are no outliers in the dataset. All observations come from the same normally distributed population.
Alternative Hypothesis ($H_1$): There is exactly one outlier in the dataset.
The test examines the most extreme observation (largest deviation from the mean) and determines whether its extremity is consistent with the null hypothesis.
Type I Error (False Positive): Rejecting $H_0$ when no outliers exist—flagging a legitimate value as an outlier.
Type II Error (False Negative): Failing to reject $H_0$ when an outlier exists—missing a true anomaly.
The significance level $\alpha$ controls the Type I error rate. If $\alpha = 0.05$, we accept a 5% chance of falsely flagging an outlier when none exists.
The basic Grubbs' test is designed to detect at most ONE outlier. If multiple outliers are suspected, the test can be applied iteratively (with corrections), or specialized variants for multiple outliers should be used. Testing for one outlier when two exist can lead to masking.
Consider a pharmaceutical company testing drug efficacy. A single anomalous measurement could skew clinical trial results. The company needs to:
A Z-score threshold of 3 lacks this formal framework. Grubbs' test provides it.
The Grubbs test statistic measures how extreme the suspected outlier is relative to the sample's variability.
Given a sample ${x_1, x_2, \ldots, x_n}$ from a normal population, the Grubbs test statistic for the most extreme observation is:
$$G = \frac{\max_{i} |x_i - \bar{x}|}{s}$$
Where:
This is simply the maximum absolute Z-score in the sample.
One-sided variants:
For testing only the minimum value (suspected low outlier): $$G_{\min} = \frac{\bar{x} - x_{(1)}}{s}$$
For testing only the maximum value (suspected high outlier): $$G_{\max} = \frac{x_{(n)} - \bar{x}}{s}$$
Where $x_{(1)}$ and $x_{(n)}$ are the minimum and maximum order statistics.
Under $H_0$ (all observations from the same normal distribution), the distribution of $G$ depends on the sample size $n$. This distribution is known as the distribution of the maximum of n correlated t-variates.
The exact distribution involves complex integrals, but accurate approximations and tabulated critical values exist.
Key insight: Even if all data truly comes from a normal distribution, we expect some extreme values simply by chance. With $n = 100$ observations, the most extreme is guaranteed to exist and will have some Z-score. Grubbs' test asks: "Is this extreme value too extreme to be explained by chance?"
The critical value $G_{\text{crit}}$ is tabulated for various sample sizes and significance levels. For a two-sided test at significance level $\alpha$:
$$G_{\text{crit}} = \frac{n-1}{\sqrt{n}} \sqrt{\frac{t_{\alpha/(2n), n-2}^2}{n - 2 + t_{\alpha/(2n), n-2}^2}}$$
Where $t_{\alpha/(2n), n-2}$ is the critical value of the t-distribution with $n-2$ degrees of freedom at significance level $\alpha/(2n)$.
Decision rule: If $G > G_{\text{crit}}$, reject $H_0$ and conclude the extreme value is an outlier.
| Sample Size n | G_crit (α=0.05) | G_crit (α=0.01) |
|---|---|---|
| 3 | 1.153 | 1.155 |
| 5 | 1.715 | 1.764 |
| 7 | 1.938 | 2.093 |
| 10 | 2.176 | 2.410 |
| 15 | 2.409 | 2.705 |
| 20 | 2.557 | 2.884 |
| 25 | 2.663 | 3.009 |
| 30 | 2.745 | 3.103 |
| 40 | 2.867 | 3.240 |
| 50 | 2.956 | 3.336 |
| 100 | 3.289 | 3.600 |
Notice how the critical value increases with sample size. With n=10, a Z-score of 2.2 is significant. With n=100, you need a Z-score above 3.3. This accounts for the fact that extreme values are more likely to occur by chance in larger samples.
Input: Dataset ${x_1, x_2, \ldots, x_n}$, significance level $\alpha$
Output: Whether an outlier exists, which point is the outlier (if any)
1. Compute sample mean x̄ and sample std s
2. Identify the point with maximum |xᵢ - x̄|
3. Compute G = max|xᵢ - x̄| / s
4. Compute critical value G_crit(n, α)
5. If G > G_crit:
Return: outlier detected, flag the extreme point
Else:
Return: no outlier detected
Computational Complexity: $O(n)$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178
import numpy as npfrom scipy import statsfrom typing import Tuple, Optional, NamedTuple class GrubbsResult(NamedTuple): """Results from Grubbs' test for outliers.""" has_outlier: bool outlier_index: Optional[int] outlier_value: Optional[float] G_statistic: float G_critical: float p_value: float def grubbs_critical_value(n: int, alpha: float = 0.05, two_sided: bool = True) -> float: """ Compute the critical value for Grubbs' test. Parameters ---------- n : int Sample size alpha : float Significance level two_sided : bool Whether to use two-sided test Returns ------- G_crit : float Critical value """ if two_sided: alpha_adj = alpha / (2 * n) else: alpha_adj = alpha / n t_crit = stats.t.ppf(1 - alpha_adj, n - 2) G_crit = ((n - 1) / np.sqrt(n)) * np.sqrt(t_crit**2 / (n - 2 + t_crit**2)) return G_crit def grubbs_test( data: np.ndarray, alpha: float = 0.05, two_sided: bool = True) -> GrubbsResult: """ Perform Grubbs' test for detecting a single outlier. Parameters ---------- data : np.ndarray 1D array of observations (must be at least 3 points) alpha : float Significance level (default: 0.05) two_sided : bool If True, test for both high and low outliers. If False, test only for the more extreme outlier. Returns ------- GrubbsResult : NamedTuple with test results """ n = len(data) if n < 3: raise ValueError("Grubbs' test requires at least 3 observations") mean = np.mean(data) std = np.std(data, ddof=1) # Compute deviations from mean deviations = np.abs(data - mean) # Find the most extreme point max_idx = np.argmax(deviations) max_deviation = deviations[max_idx] # Compute Grubbs statistic G = max_deviation / std # Compute critical value G_crit = grubbs_critical_value(n, alpha, two_sided) # Compute approximate p-value # This uses the relationship to the t-distribution t_stat = G * np.sqrt(n) / np.sqrt(n - 1 + G**2 / (n - 2)) if two_sided: p_value = 2 * n * (1 - stats.t.cdf(t_stat, n - 2)) else: p_value = n * (1 - stats.t.cdf(t_stat, n - 2)) p_value = min(p_value, 1.0) # Cap at 1 # Determine if outlier exists has_outlier = G > G_crit return GrubbsResult( has_outlier=has_outlier, outlier_index=int(max_idx) if has_outlier else None, outlier_value=float(data[max_idx]) if has_outlier else None, G_statistic=G, G_critical=G_crit, p_value=p_value ) def grubbs_test_iterative( data: np.ndarray, alpha: float = 0.05, max_outliers: int = 10) -> Tuple[np.ndarray, list]: """ Apply Grubbs' test iteratively to detect multiple outliers. Warning: For multiple outliers, this approach can suffer from masking. Consider using a dedicated multiple outlier test. Parameters ---------- data : np.ndarray 1D array of observations alpha : float Significance level for each individual test max_outliers : int Maximum number of outliers to detect Returns ------- mask : np.ndarray Boolean array where True indicates an outlier results : list List of GrubbsResult for each iteration """ data_copy = data.copy() original_indices = np.arange(len(data)) mask = np.zeros(len(data), dtype=bool) results = [] for _ in range(max_outliers): if len(data_copy) < 3: break result = grubbs_test(data_copy, alpha) results.append(result) if not result.has_outlier: break # Mark the outlier in the original array outlier_original_idx = original_indices[result.outlier_index] mask[outlier_original_idx] = True # Remove the outlier and continue keep_mask = np.ones(len(data_copy), dtype=bool) keep_mask[result.outlier_index] = False data_copy = data_copy[keep_mask] original_indices = original_indices[keep_mask] return mask, results # Example usagenp.random.seed(42) # Generate data with one outliernormal_data = np.random.normal(50, 5, 20)data_with_outlier = np.append(normal_data, 80) # Add obvious outlier # Run Grubbs' testresult = grubbs_test(data_with_outlier) print("=== Grubbs' Test Results ===")print(f"Sample size: {len(data_with_outlier)}")print(f"G statistic: {result.G_statistic:.4f}")print(f"Critical value (α=0.05): {result.G_critical:.4f}")print(f"p-value: {result.p_value:.6f}")print(f"Outlier detected: {result.has_outlier}")if result.has_outlier: print(f"Outlier value: {result.outlier_value:.2f}") print(f"Outlier index: {result.outlier_index}")The basic Grubbs' test is designed for at most one outlier. When multiple outliers are suspected, several approaches exist:
Problem: This approach suffers from masking. If two outliers are present, they inflate the variance, potentially causing both to appear non-extreme. The test may fail to detect either.
Partial mitigation: Apply a Bonferroni correction to the significance level: use $\alpha' = \alpha / k$ where $k$ is the expected number of outliers.
Rosner (1983) developed a more sophisticated procedure:
This approach handles masking by examining subsets of the data.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
import numpy as npfrom scipy import statsfrom typing import Tuple, List def generalized_esd_test( data: np.ndarray, max_outliers: int, alpha: float = 0.05) -> Tuple[int, np.ndarray, List[float], List[float]]: """ Generalized Extreme Studentized Deviate (ESD) test for detecting multiple outliers (Rosner, 1983). Parameters ---------- data : np.ndarray 1D array of observations max_outliers : int Maximum number of outliers to test for (upper bound r) alpha : float Significance level Returns ------- num_outliers : int Number of outliers detected outlier_indices : np.ndarray Indices of detected outliers in original data R_values : list Test statistics for each iteration lambda_values : list Critical values for each iteration """ n = len(data) if max_outliers >= n - 2: raise ValueError("max_outliers must be less than n - 2") data_copy = data.copy() original_indices = np.arange(n) R_values = [] # Test statistics lambda_values = [] # Critical values removed_indices = [] # Indices of removed points (in original order) for i in range(max_outliers): n_i = len(data_copy) # Compute mean and std of current subset mean_i = np.mean(data_copy) std_i = np.std(data_copy, ddof=1) # Find most extreme residual residuals = np.abs(data_copy - mean_i) max_idx = np.argmax(residuals) R_i = residuals[max_idx] / std_i R_values.append(R_i) # Compute critical value lambda_i p = 1 - alpha / (2 * (n_i - i)) t_crit = stats.t.ppf(p, n_i - i - 2) lambda_i = ((n_i - i - 1) * t_crit) / np.sqrt( (n_i - i - 2 + t_crit**2) * (n_i - i) ) lambda_values.append(lambda_i) # Store the original index of the removed point removed_indices.append(original_indices[max_idx]) # Remove the point and continue keep_mask = np.ones(len(data_copy), dtype=bool) keep_mask[max_idx] = False data_copy = data_copy[keep_mask] original_indices = original_indices[keep_mask] # Determine number of outliers # Find largest i where R_i > lambda_i num_outliers = 0 for i in range(max_outliers): if R_values[i] > lambda_values[i]: num_outliers = i + 1 outlier_indices = np.array(removed_indices[:num_outliers]) return num_outliers, outlier_indices, R_values, lambda_values # Example with multiple outliersnp.random.seed(42)normal_data = np.random.normal(50, 5, 25)# Add 3 outliersdata = np.concatenate([normal_data, [85, 90, 15]]) num_outliers, indices, R, lam = generalized_esd_test(data, max_outliers=5) print("=== Generalized ESD Test Results ===")print(f"Number of outliers detected: {num_outliers}")print(f"Outlier values: {data[indices]}")print(f"Iteration details:")for i in range(len(R)): status = "OUTLIER" if R[i] > lam[i] else "not outlier" print(f" i={i+1}: R={R[i]:.3f}, λ={lam[i]:.3f} -> {status}")For the generalized ESD test, you must specify an upper bound on the number of outliers. A common rule of thumb is to set max_outliers to about 10-25% of the sample size. The test will determine the actual number of outliers (which could be 0 to max_outliers).
When domain knowledge dictates that outliers can occur in only one direction, one-sided tests provide more statistical power.
Null Hypothesis: The maximum value is not an outlier.
Test Statistic: $$G_{\max} = \frac{x_{(n)} - \bar{x}}{s}$$
Where $x_{(n)}$ is the maximum observation.
Critical value: Use $\alpha/n$ instead of $\alpha/(2n)$ in the formula.
Null Hypothesis: The minimum value is not an outlier.
Test Statistic: $$G_{\min} = \frac{\bar{x} - x_{(1)}}{s}$$
Where $x_{(1)}$ is the minimum observation.
| Application | Test Type | Rationale |
|---|---|---|
| Temperature sensors | Two-sided | Both high and low readings might indicate failure |
| Response times | Maximum only | Only unusually long times are problematic |
| Drug dosage measurements | Two-sided | Both overdose and underdose are concerning |
| Stock returns | Minimum only | Only large losses may require investigation |
| Manufacturing dimensions | Two-sided | Both over and undersize are defects |
| Battery capacity | Minimum only | Only low capacity is a defect |
One-sided tests have more power (lower Type II error rate) than two-sided tests at the same significance level. If you genuinely know the direction, use a one-sided test. But if you're uncertain about the direction, always use the two-sided version to avoid missing outliers on the unexpected side.
Grubbs' test is only valid under specific conditions. Violating these assumptions can lead to incorrect conclusions.
1. Normality
The test is derived assuming the underlying population is normally distributed. For non-normal data:
Rule of thumb: Visual inspection via Q-Q plots and formal tests (Shapiro-Wilk) should precede Grubbs' test.
2. Independence
Observations must be independent. Serial correlation (common in time series) invalidates the test because consecutive observations carry redundant information.
3. Known Sample Size
The critical values depend on $n$. For very large samples, the asymptotic behavior differs from tabulated values.
4. Single Outlier (for Basic Test)
The basic test is designed for zero or one outlier. Multiple outliers cause masking, potentially leading to no outliers being detected.
To use Grubbs' test, you must verify normality. But outliers can make normal data appear non-normal, and removing outliers changes the distribution. This circularity has no perfect solution. Best practice: 1) Visual inspection first, 2) Apply robust normality tests, 3) Consider robust alternatives if uncertain.
Grubbs' test is one of several formal tests for outliers. Understanding the alternatives helps you choose appropriately.
An alternative for small samples (n ≤ 25). Uses the ratio of the gap between the suspected outlier and its nearest neighbor to the overall range:
$$Q = \frac{x_{(n)} - x_{(n-1)}}{x_{(n)} - x_{(1)}}$$
Advantages: Simple calculation, doesn't require mean/variance. Disadvantages: Only for very small samples, less powerful than Grubbs.
As described earlier, handles multiple outliers by sequential testing with adjusted critical values.
Advantages: Handles multiple outliers correctly. Disadvantages: Requires specifying maximum number of outliers.
Tests the hypothesis that exactly $k$ outliers exist (you must specify $k$ in advance).
Advantages: Optimal when you know the exact number of outliers. Disadvantages: Rarely know $k$ in practice.
| Test | Sample Size | Number of Outliers | Assumptions |
|---|---|---|---|
| Grubbs | n ≥ 7 (practical) | 0 or 1 | Normality, independence |
| Dixon's Q | 3 ≤ n ≤ 25 | 0 or 1 | Normality, independence |
| Generalized ESD | n ≥ 25 | 0 to r (specified) | Normality, independence |
| Tietjen-Moore | n ≥ 7 | Exactly k (specified) | Normality, independence |
| Z-score (3σ) | Any | Any | Normality (informal) |
| IQR | Any | Any | None (non-parametric) |
The univariate methods we've covered (Z-score, IQR, Grubbs) all treat each variable independently. But in real-world data, anomalies often manifest in combinations of variables—a point might be normal in each dimension individually but anomalous considering all dimensions together. The next page covers Multivariate Methods, including the Mahalanobis distance, which extends statistical outlier detection to multiple dimensions.
You now understand Grubbs' test as a formal hypothesis testing framework for outlier detection, its statistical foundations, extensions for multiple outliers, and critical assumptions. This rigorous approach complements the heuristic methods (Z-score, IQR) when statistical confidence is required.