Machine LearningLinear Regression Deep Dive

Regression Diagnostics

LevelIntermediate

Duration75 mins

TopicLinear Regression Deep Dive

2 / 5

Normality Tests

The Normality Assumption in Context

Among the assumptions of classical regression, normality of errors occupies a peculiar position. It is simultaneously one of the most frequently tested assumptions and one of the most forgiving when violated. Understanding this paradox is essential for making sound inferential decisions.

In the standard linear regression model: $$y = X\beta + \varepsilon, \quad \varepsilon \sim N(0, \sigma^2 I)$$

The normality assumption specifically states that the errors $\varepsilon_i$ are drawn from a Gaussian distribution. But why does this matter? The answer depends heavily on what you want to do with your regression.

What You Will Learn

This page covers: (1) When normality actually matters for regression inference; (2) Visual methods for assessing normality including Q-Q plots; (3) Formal statistical tests—Shapiro-Wilk, Anderson-Darling, Jarque-Bera; (4) Interpretation guidelines that account for sample size; and (5) Remedies when normality is severely violated.

When Normality Matters (and When It Doesn't)

The importance of normality depends on your analytical goals and sample size. Let's be precise about what normality buys you:

Normality is NOT Required For:

1. OLS Coefficient Estimation

The Gauss-Markov theorem guarantees that OLS estimates are BLUE (Best Linear Unbiased Estimators) under the assumptions of linearity, independence, homoscedasticity, and unbiasedness—with no normality requirement. Your $\hat{\beta}$ values are valid estimates regardless of error distribution.

2. Asymptotic Inference (Large Samples)

By the Central Limit Theorem, the sampling distribution of $\hat{\beta}$ approaches normality as $n \to \infty$, regardless of the error distribution. For large samples, t-tests and F-tests remain approximately valid even with non-normal errors.

Normality IS Required For:

1. Exact Small-Sample Inference

In small samples ($n < 30$), the exact distributions of t-statistics and F-statistics depend on normal errors. Non-normality can lead to:

Incorrect p-values
Invalid confidence intervals
Misleading hypothesis tests

2. Prediction Intervals

Prediction intervals for new observations rely on normality. Without it, the stated coverage probability (e.g., 95%) may be far from accurate.

3. Maximum Likelihood Equivalence

Under normality, OLS equals MLE. This has implications for information criteria (AIC, BIC) and likelihood ratio tests.

Impact of Non-Normality by Sample Size
Sample Size	OLS Estimation	Hypothesis Testing	Confidence Intervals	Prediction Intervals
n < 15	Unaffected	Potentially invalid	Potentially invalid	Potentially invalid
15 ≤ n < 30	Unaffected	Moderate concern	Moderate concern	Moderate concern
30 ≤ n < 100	Unaffected	Usually robust	Usually robust	May need adjustment
n ≥ 100	Unaffected	Very robust (CLT)	Very robust	May need adjustment

The Practical Rule

For moderately large samples (n > 30-50), mild to moderate non-normality is rarely a serious concern for inference. Severe departures—heavy tails, extreme skewness, multimodality—warrant attention regardless of sample size. Focus your diagnostic energy accordingly.

The Q-Q Plot: Visual Normality Assessment

The quantile-quantile (Q-Q) plot is the most powerful visual tool for assessing normality. It directly compares the distribution of your residuals to the theoretical normal distribution.

Construction

Sort residuals from smallest to largest: $e_{(1)} \leq e_{(2)} \leq \cdots \leq e_{(n)}$
For each ordered residual $e_{(i)}$, compute its theoretical quantile position: $$p_i = \frac{i - 0.375}{n + 0.25}$$ (This is the Blom formula, though variations exist)
Find the corresponding quantile from standard normal: $z_i = \Phi^{-1}(p_i)$
Plot $e_{(i)}$ (y-axis) against $z_i$ (x-axis)

If residuals are normally distributed, points should fall approximately on a straight line with slope $\sigma$ passing through $(0, 0)$.

Pattern Interpretation

Heavy Tails (Leptokurtic)

Both ends curve away from the line in opposite directions
Left tail bends down, right tail bends up (S-shape or flipped S)
Indicates more extreme values than normal predicts
Example: t-distribution, Laplace distribution

Light Tails (Platykurtic)

Both ends curve toward the line
Left tail bends up, right tail bends down
Indicates fewer extreme values than normal
Example: Uniform distribution

Right Skewness

Points curve upward at both ends
Right tail extends further than expected
Example: Log-normal, exponential

Left Skewness

Points curve downward at both ends
Left tail extends further than expected

Outliers

Individual points far from the line
May appear at one or both extremes
Distinguish from systematic patterns

qq_plot_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def comprehensive_qq_plot(residuals, title="Q-Q Plot", figsize=(10, 8)):
    """
    Create a detailed Q-Q plot with reference band and pattern annotations.
    
    Parameters:
    -----------
    residuals : array-like
        Residuals from regression model
    title : str
        Plot title
    
    Returns:
    --------
    dict with figure and normality statistics
    """
    e = np.array(residuals)
    n = len(e)
    
    # Sort residuals
    e_sorted = np.sort(e)
    
    # Compute theoretical quantiles (Blom formula)
    p = (np.arange(1, n + 1) - 0.375) / (n + 0.25)
    theoretical_quantiles = stats.norm.ppf(p)
    
    # Standardize residuals for plotting
    e_standardized = (e_sorted - np.mean(e)) / np.std(e, ddof=1)
    
    # Fit line through Q1 and Q3 (robust to outliers)
    q1_idx, q3_idx = int(0.25 * n), int(0.75 * n)
    slope = (e_standardized[q3_idx] - e_standardized[q1_idx]) / \
            (theoretical_quantiles[q3_idx] - theoretical_quantiles[q1_idx])
    intercept = e_standardized[q1_idx] - slope * theoretical_quantiles[q1_idx]
    
    # Create figure
    fig, axes = plt.subplots(2, 2, figsize=figsize)
    
    # Main Q-Q plot
    ax1 = axes[0, 0]
    ax1.scatter(theoretical_quantiles, e_standardized, 
               alpha=0.6, edgecolors='k', linewidth=0.5, s=40)
    
    # Reference line
    x_line = np.array([theoretical_quantiles.min(), theoretical_quantiles.max()])
    ax1.plot(x_line, x_line, 'r-', linewidth=2, label='Perfect normality')
    
    # Confidence band (approximate)
    se = 1 / (stats.norm.pdf(theoretical_quantiles) * np.sqrt(n))
    ax1.fill_between(theoretical_quantiles, 
                     theoretical_quantiles - 1.96 * se,
                     theoretical_quantiles + 1.96 * se,
                     alpha=0.2, color='red', label='95% confidence band')
    
    ax1.set_xlabel('Theoretical Quantiles (Standard Normal)', fontsize=11)
    ax1.set_ylabel('Sample Quantiles (Standardized)', fontsize=11)
    ax1.set_title('Q-Q Plot', fontsize=12, fontweight='bold')
    ax1.legend(loc='upper left')
    ax1.grid(True, alpha=0.3)
    
    # Histogram with normal overlay
    ax2 = axes[0, 1]
    ax2.hist(e_standardized, bins='auto', density=True, 
             alpha=0.7, edgecolor='black', label='Residuals')
    x_norm = np.linspace(e_standardized.min(), e_standardized.max(), 100)
    ax2.plot(x_norm, stats.norm.pdf(x_norm), 'r-', 
             linewidth=2, label='Standard Normal')
    ax2.set_xlabel('Standardized Residuals', fontsize=11)
    ax2.set_ylabel('Density', fontsize=11)
    ax2.set_title('Histogram vs Normal PDF', fontsize=12, fontweight='bold')
    ax2.legend()
    
    # Detrended Q-Q plot (deviations from line)
    ax3 = axes[1, 0]
    deviations = e_standardized - theoretical_quantiles
    ax3.scatter(theoretical_quantiles, deviations, 
               alpha=0.6, edgecolors='k', linewidth=0.5, s=40)
    ax3.axhline(0, color='red', linestyle='--', linewidth=1.5)
    ax3.fill_between(theoretical_quantiles, -1.96*se, 1.96*se, 
                     alpha=0.2, color='red')
    ax3.set_xlabel('Theoretical Quantiles', fontsize=11)
    ax3.set_ylabel('Deviation from Normal', fontsize=11)
    ax3.set_title('Detrended Q-Q Plot', fontsize=12, fontweight='bold')
    ax3.grid(True, alpha=0.3)
    
    # Summary statistics
    ax4 = axes[1, 1]
    ax4.axis('off')
    
    # Compute tests
    shapiro_stat, shapiro_p = stats.shapiro(e)
    skewness = stats.skew(e)
    kurtosis = stats.kurtosis(e)  # Excess kurtosis
    
    summary_text = f"""
    NORMALITY ASSESSMENT SUMMARY
    ═══════════════════════════════════════
    
    Sample Size: {n}
    
    Descriptive Statistics:
    ─────────────────────────────────────
    Skewness:          {skewness:8.4f}  (0 = symmetric)
    Excess Kurtosis:   {kurtosis:8.4f}  (0 = normal tails)
    
    Formal Tests:
    ─────────────────────────────────────
    Shapiro-Wilk W:    {shapiro_stat:8.4f}
    Shapiro-Wilk p:    {shapiro_p:8.4f}
    
    Interpretation:
    ─────────────────────────────────────
    {"✓ No evidence against normality (p > 0.05)" if shapiro_p > 0.05 else "⚠ Evidence of non-normality (p ≤ 0.05)"}
    {"✓ Skewness acceptable (|skew| < 0.5)" if abs(skewness) < 0.5 else "⚠ Notable skewness" if abs(skewness) < 1 else "⚠ Severe skewness"}
    {"✓ Kurtosis acceptable (|kurt| < 1)" if abs(kurtosis) < 1 else "⚠ Notable kurtosis" if abs(kurtosis) < 2 else "⚠ Severe kurtosis"}
    """
    
    ax4.text(0.1, 0.95, summary_text, transform=ax4.transAxes,
             fontsize=10, verticalalignment='top', fontfamily='monospace',
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    plt.suptitle(title, fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    
    return {
        'figure': fig,
        'shapiro_stat': shapiro_stat,
        'shapiro_p': shapiro_p,
        'skewness': skewness,
        'kurtosis': kurtosis
    }
 
# Demonstrate with different distribution types
np.random.seed(42)
n = 100
 
# Normal residuals
e_normal = np.random.randn(n)
result = comprehensive_qq_plot(e_normal, "Normal Residuals")
plt.savefig('qq_normal.png', dpi=150, bbox_inches='tight')
plt.show()
 
# Heavy-tailed (t-distribution)
e_heavy = stats.t.rvs(df=3, size=n)
result = comprehensive_qq_plot(e_heavy, "Heavy-Tailed Residuals (t, df=3)")
plt.savefig('qq_heavy_tailed.png', dpi=150, bbox_inches='tight')
plt.show()
 
# Right-skewed
e_skewed = stats.lognorm.rvs(s=0.5, size=n)
e_skewed = e_skewed - np.mean(e_skewed)
result = comprehensive_qq_plot(e_skewed, "Right-Skewed Residuals")
plt.savefig('qq_skewed.png', dpi=150, bbox_inches='tight')
plt.show()

The Shapiro-Wilk Test

The Shapiro-Wilk test is widely considered the most powerful omnibus normality test for small to moderate sample sizes. It should be your default formal test for normality assessment.

The Test Statistic

The Shapiro-Wilk statistic is defined as: $$W = \frac{\left(\sum_{i=1}^{n} a_i x_{(i)}\right)^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$$

where:

$x_{(i)}$ are the ordered sample values
$a_i$ are tabulated coefficients derived from the expected values and covariance matrix of order statistics from a standard normal distribution
The denominator is the sum of squared deviations (proportional to sample variance)

Interpretation

W statistic range: $0 < W \leq 1$

$W = 1$: Perfect normality
$W < 1$: Departure from normality
Lower W indicates more severe non-normality

Hypothesis test:

$H_0$: The sample comes from a normal distribution
$H_1$: The sample does not come from a normal distribution

Decision rule: Reject normality if p-value < $\alpha$ (typically 0.05)

Practical Considerations

Sample size limits:

Most implementations require $3 \leq n \leq 5000$
For larger samples, the test becomes very sensitive to minor departures
For very large n, almost any sample will reject (the curse of large samples)

Power characteristics:

Most powerful against symmetric, moderate-tailed alternatives
Good power against skewed distributions
May miss some specific non-normal patterns

The Large Sample Problem

For large samples (n > 1000), formal normality tests become problematic. They detect trivial departures that have no practical impact on inference. In these cases, rely more heavily on visual assessment (Q-Q plots) and effect sizes (skewness, kurtosis) rather than p-values.

shapiro_wilk_test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
from scipy import stats
 
def shapiro_wilk_analysis(residuals, alpha=0.05):
    """
    Comprehensive Shapiro-Wilk normality test with interpretation.
    
    Parameters:
    -----------
    residuals : array-like
        Residuals from regression model
    alpha : float
        Significance level
    
    Returns:
    --------
    dict with test results and interpretation
    """
    e = np.array(residuals)
    n = len(e)
    
    # Check sample size constraints
    if n < 3:
        raise ValueError("Shapiro-Wilk requires at least 3 observations")
    if n > 5000:
        print("Warning: n > 5000. Shapiro-Wilk may be overly sensitive.")
    
    # Perform test
    statistic, p_value = stats.shapiro(e)
    
    # Effect size measures
    skewness = stats.skew(e)
    kurtosis = stats.kurtosis(e)  # Excess kurtosis
    
    # Interpretation
    if p_value > alpha:
        decision = "FAIL TO REJECT H₀"
        conclusion = "No significant evidence against normality."
    else:
        decision = "REJECT H₀"
        conclusion = "Significant evidence against normality."
    
    # Severity assessment based on effect sizes
    if abs(skewness) < 0.5 and abs(kurtosis) < 1:
        severity = "Negligible: Minor departures unlikely to affect inference"
    elif abs(skewness) < 1 and abs(kurtosis) < 2:
        severity = "Moderate: May affect small-sample inference"
    else:
        severity = "Severe: Consider transformations or robust methods"
    
    # Large-sample warning
    if n > 100 and p_value < alpha:
        large_sample_note = ("⚠ With n > 100, statistical significance may not "
                            "indicate practical significance. Check effect sizes.")
    else:
        large_sample_note = ""
    
    results = {
        'n': n,
        'statistic': statistic,
        'p_value': p_value,
        'alpha': alpha,
        'decision': decision,
        'conclusion': conclusion,
        'skewness': skewness,
        'kurtosis': kurtosis,
        'severity': severity,
        'large_sample_note': large_sample_note
    }
    
    # Print report
    print("═" * 50)
    print("SHAPIRO-WILK NORMALITY TEST")
    print("═" * 50)
    print(f"Sample size (n):     {n}")
    print(f"W statistic:         {statistic:.6f}")
    print(f"P-value:             {p_value:.6f}")
    print(f"Significance (α):    {alpha}")
    print("─" * 50)
    print(f"Decision:            {decision}")
    print(f"Conclusion:          {conclusion}")
    print("─" * 50)
    print("Effect Sizes:")
    print(f"  Skewness:          {skewness:.4f}")
    print(f"  Excess Kurtosis:   {kurtosis:.4f}")
    print(f"  Assessment:        {severity}")
    if large_sample_note:
        print("─" * 50)
        print(large_sample_note)
    print("═" * 50)
    
    return results
 
# Demonstration
np.random.seed(42)
 
# Test 1: Normal residuals
print("\n" + "="*60)
print("EXAMPLE 1: True Normal Residuals")
print("="*60)
e_normal = np.random.randn(50)
results1 = shapiro_wilk_analysis(e_normal)
 
# Test 2: Mildly skewed (practical concern: usually OK)
print("\n" + "="*60)
print("EXAMPLE 2: Mildly Skewed Residuals")
print("="*60)
e_mild = stats.skewnorm.rvs(a=2, size=50)  # Mild skewness
e_mild = e_mild - np.mean(e_mild)
results2 = shapiro_wilk_analysis(e_mild)
 
# Test 3: Heavily skewed (practical concern: may need action)
print("\n" + "="*60)
print("EXAMPLE 3: Heavily Skewed Residuals")
print("="*60)
e_heavy = stats.lognorm.rvs(s=1, size=50)
e_heavy = e_heavy - np.mean(e_heavy)
results3 = shapiro_wilk_analysis(e_heavy)

Alternative Normality Tests

While Shapiro-Wilk is generally recommended, other normality tests serve specific purposes and provide complementary information.

Anderson-Darling Test

The Anderson-Darling test is an empirical distribution function (EDF) test that places more weight on the tails than the Kolmogorov-Smirnov test.

Statistic: $$A^2 = -n - \frac{1}{n}\sum_{i=1}^{n}(2i-1)[\ln F(x_{(i)}) + \ln(1-F(x_{(n+1-i)}))]$$

where $F$ is the hypothesized CDF (standard normal with estimated parameters).

Strengths:

Particularly sensitive to deviations in the tails
Good for detecting heavy-tailed distributions
Available for testing other distributions besides normal

When to use: When tail behavior is of particular concern (e.g., financial data, extreme value analysis).

Jarque-Bera Test

The Jarque-Bera test directly tests whether skewness and kurtosis match normal distribution values.

Statistic: $$JB = \frac{n}{6}\left(S^2 + \frac{(K-3)^2}{4}\right)$$

where $S$ is sample skewness and $K$ is sample kurtosis. Under $H_0$, $JB \sim \chi^2_2$ (approximately, for large n).

Strengths:

Easy to interpret: directly measures "non-normalness" through skewness and kurtosis
Works well for large samples where asymptotic distribution kicks in
Widely used in econometrics

Weaknesses:

Asymptotic test—unreliable for small samples (n < 50)
Can miss non-normality that doesn't manifest in skewness/kurtosis

D'Agostino-Pearson Test

Combines standardized skewness and kurtosis statistics: $$K^2 = Z_s^2 + Z_k^2$$

where $Z_s$ and $Z_k$ are transformed skewness and kurtosis with approximately standard normal distributions. Under $H_0$, $K^2 \sim \chi^2_2$.

Strengths:

Omnibus test covering both skewness and kurtosis
Often more reliable than Jarque-Bera for moderate samples

Comparison of Normality Tests
Test	Best Sample Size	Special Sensitivity	Recommended Use
Shapiro-Wilk	n < 50	Overall departures	Default choice for small samples
Anderson-Darling	n < 5000	Tail behavior	When tails are critical
Jarque-Bera	n > 50	Skewness & kurtosis	Large samples, econometrics
D'Agostino-Pearson	n > 20	Skewness & kurtosis	Medium samples, omnibus test
Kolmogorov-Smirnov	Any	Central distribution	Generally less powerful—avoid

multiple_normality_tests.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
from scipy import stats
 
def comprehensive_normality_battery(residuals, alpha=0.05):
    """
    Run multiple normality tests for comprehensive assessment.
    
    Parameters:
    -----------
    residuals : array-like
        Residuals from regression model
    alpha : float
        Significance level
    
    Returns:
    --------
    dict with all test results
    """
    e = np.array(residuals)
    n = len(e)
    
    results = {'n': n, 'alpha': alpha}
    
    print("═" * 60)
    print("COMPREHENSIVE NORMALITY TEST BATTERY")
    print("═" * 60)
    print(f"Sample size: {n}")
    print(f"Significance level: {alpha}")
    print("═" * 60)
    
    # 1. Shapiro-Wilk
    if n <= 5000:
        sw_stat, sw_p = stats.shapiro(e)
        results['shapiro_wilk'] = {'statistic': sw_stat, 'p_value': sw_p}
        verdict = "✓ PASS" if sw_p > alpha else "✗ FAIL"
        print(f"Shapiro-Wilk:       W = {sw_stat:.4f}, p = {sw_p:.4f}  {verdict}")
    else:
        print("Shapiro-Wilk:       Skipped (n > 5000)")
    
    # 2. Anderson-Darling
    ad_result = stats.anderson(e, dist='norm')
    # Anderson-Darling returns critical values, not p-value directly
    # We use the statistic and compare to critical value at 5%
    ad_stat = ad_result.statistic
    ad_critical_5 = ad_result.critical_values[2]  # Index 2 is 5% level
    ad_pass = ad_stat < ad_critical_5
    results['anderson_darling'] = {
        'statistic': ad_stat, 
        'critical_5pct': ad_critical_5
    }
    verdict = "✓ PASS" if ad_pass else "✗ FAIL"
    print(f"Anderson-Darling:   A² = {ad_stat:.4f}, crit(5%) = {ad_critical_5:.4f}  {verdict}")
    
    # 3. D'Agostino-Pearson (K² test)
    if n >= 20:
        dp_stat, dp_p = stats.normaltest(e)
        results['dagostino_pearson'] = {'statistic': dp_stat, 'p_value': dp_p}
        verdict = "✓ PASS" if dp_p > alpha else "✗ FAIL"
        print(f"D'Agostino-Pearson: K² = {dp_stat:.4f}, p = {dp_p:.4f}  {verdict}")
    else:
        print("D'Agostino-Pearson: Skipped (n < 20)")
    
    # 4. Jarque-Bera
    jb_stat, jb_p = stats.jarque_bera(e)
    results['jarque_bera'] = {'statistic': jb_stat, 'p_value': jb_p}
    verdict = "✓ PASS" if jb_p > alpha else "✗ FAIL"
    if n < 50:
        print(f"Jarque-Bera:        JB = {jb_stat:.4f}, p = {jb_p:.4f}  {verdict} (⚠ n<50)")
    else:
        print(f"Jarque-Bera:        JB = {jb_stat:.4f}, p = {jb_p:.4f}  {verdict}")
    
    # Effect sizes
    skew = stats.skew(e)
    kurt = stats.kurtosis(e)
    results['skewness'] = skew
    results['kurtosis'] = kurt
    
    print("─" * 60)
    print(f"Effect sizes:       Skewness = {skew:.4f}, Excess Kurtosis = {kurt:.4f}")
    
    # Overall assessment
    print("═" * 60)
    
    # Count failures
    n_tests = 0
    n_failures = 0
    
    if 'shapiro_wilk' in results and results['shapiro_wilk']['p_value'] <= alpha:
        n_failures += 1
    if 'shapiro_wilk' in results:
        n_tests += 1
        
    if not ad_pass:
        n_failures += 1
    n_tests += 1
    
    if 'dagostino_pearson' in results and results['dagostino_pearson']['p_value'] <= alpha:
        n_failures += 1
    if 'dagostino_pearson' in results:
        n_tests += 1
    
    if n >= 50 and jb_p <= alpha:
        n_failures += 1
    if n >= 50:
        n_tests += 1
    
    print(f"Tests failed: {n_failures}/{n_tests}")
    
    if n_failures == 0:
        print("OVERALL: No evidence against normality")
    elif n_failures <= n_tests // 2:
        print("OVERALL: Mixed evidence—inspect Q-Q plot carefully")
    else:
        print("OVERALL: Substantial evidence against normality")
    
    print("═" * 60)
    
    return results
 
# Demonstration
np.random.seed(42)
 
print("\n" + "▓" * 60)
print(" NORMAL DISTRIBUTION ")
print("▓" * 60)
e_normal = np.random.randn(100)
r1 = comprehensive_normality_battery(e_normal)
 
print("\n" + "▓" * 60)
print(" HEAVY-TAILED (t, df=4) ")
print("▓" * 60)
e_heavy = stats.t.rvs(df=4, size=100)
r2 = comprehensive_normality_battery(e_heavy)
 
print("\n" + "▓" * 60)
print(" RIGHT-SKEWED (lognormal) ")
print("▓" * 60)
e_skewed = np.exp(np.random.randn(100) * 0.5)
e_skewed = e_skewed - np.mean(e_skewed)
r3 = comprehensive_normality_battery(e_skewed)

Remedies for Non-Normal Residuals

When normality is seriously violated and inference is affected, several remedial strategies are available. The choice depends on the nature of the non-normality and your analytical goals.

1. Response Transformations

Applying a monotonic transformation to $y$ can often normalize residuals:

Box-Cox Transformation: $$y^{(\lambda)} = \begin{cases} \frac{y^\lambda - 1}{\lambda} & \lambda \neq 0 \ \ln(y) & \lambda = 0 \end{cases}$$

The optimal $\lambda$ is chosen to maximize the (profiled) log-likelihood. Common special cases:

$\lambda = 1$: No transformation
$\lambda = 0$: Log transformation (right skewness)
$\lambda = 0.5$: Square root (mild right skewness)
$\lambda = -1$: Inverse transformation

Limitations:

Requires $y > 0$ (shift if needed)
Changes interpretation of coefficients
May introduce heteroscedasticity if original model was homoscedastic

2. Robust Regression

When outliers or heavy tails drive non-normality, robust regression methods downweight influential observations:

M-estimation: Minimizes $\sum \rho(e_i)$ where $\rho$ is a robust loss function (Huber, bisquare/Tukey)

Iteratively Reweighted Least Squares (IRLS): Implements M-estimation by repeatedly fitting weighted OLS

Advantages:

Coefficients remain interpretable as in OLS
Less sensitive to outliers and heavy tails
Standard errors account for robust fitting

3. Bootstrap Inference

When distributional assumptions are suspect, the bootstrap provides nonparametric inference:

Residual Bootstrap:

Fit model, obtain $\hat{\beta}$ and residuals $\hat{e}_i$
For $b = 1, \ldots, B$:
- Sample residuals with replacement: $e_i^*$
- Create bootstrap response: $y^_i = \hat{y}_i + e_i^$
- Fit model to $(X, y^*)$, obtain $\hat{\beta}^{(b)}$
Use empirical distribution of $\hat{\beta}^{(b)}$ for inference

Advantage: Makes no assumptions about error distribution

When to use: Small to moderate samples where CLT can't be relied upon

Transformation Strategy

Before choosing a remedy, diagnose the source of non-normality. If it's a few outliers, robust regression may suffice. If it's systematic skewness, transformation addresses the root cause. If it's complex non-normality, bootstrap may be the safest approach.

normality_remedies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
import numpy as np
from scipy import stats
from scipy.optimize import minimize_scalar
import matplotlib.pyplot as plt
 
def box_cox_transform(y, lmbda):
    """Apply Box-Cox transformation."""
    if lmbda == 0:
        return np.log(y)
    else:
        return (y**lmbda - 1) / lmbda
 
def optimal_box_cox(y, X=None):
    """
    Find optimal Box-Cox lambda using profile likelihood.
    
    Parameters:
    -----------
    y : array-like
        Positive response values
    X : array-like, optional
        Design matrix (for model-based optimization)
    
    Returns:
    --------
    dict with optimal lambda and transformed values
    """
    y = np.array(y)
    
    if np.any(y <= 0):
        shift = np.abs(y.min()) + 1
        y = y + shift
        print(f"Warning: y shifted by {shift:.2f} to make all values positive")
    else:
        shift = 0
    
    n = len(y)
    
    def neg_log_likelihood(lmbda):
        """Negative profile log-likelihood for Box-Cox."""
        y_transformed = box_cox_transform(y, lmbda)
        
        # Variance of transformed data
        var_t = np.var(y_transformed, ddof=1)
        
        # Log-likelihood (up to constants)
        ll = -0.5 * n * np.log(var_t) + (lmbda - 1) * np.sum(np.log(y))
        return -ll
    
    # Find optimal lambda
    result = minimize_scalar(neg_log_likelihood, bounds=(-2, 2), method='bounded')
    optimal_lmbda = result.x
    
    # Transform with optimal lambda
    y_transformed = box_cox_transform(y, optimal_lmbda)
    
    # Test normality of transformed values
    sw_stat, sw_p = stats.shapiro(y_transformed)
    
    # Common named transformations
    if abs(optimal_lmbda) < 0.05:
        transform_name = "Log"
    elif abs(optimal_lmbda - 0.5) < 0.1:
        transform_name = "Square Root"
    elif abs(optimal_lmbda - 1) < 0.1:
        transform_name = "None (linear)"
    elif abs(optimal_lmbda + 1) < 0.1:
        transform_name = "Inverse"
    else:
        transform_name = f"Power {optimal_lmbda:.2f}"
    
    return {
        'optimal_lambda': optimal_lmbda,
        'transform_name': transform_name,
        'y_transformed': y_transformed,
        'shift_applied': shift,
        'shapiro_w': sw_stat,
        'shapiro_p': sw_p
    }
 
def residual_bootstrap_ci(X, y, n_bootstrap=1000, alpha=0.05, seed=42):
    """
    Compute bootstrap confidence intervals for regression coefficients.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix (without intercept)
    y : ndarray of shape (n,)
        Response vector
    n_bootstrap : int
        Number of bootstrap samples
    alpha : float
        Significance level for CI
    
    Returns:
    --------
    dict with coefficients and confidence intervals
    """
    np.random.seed(seed)
    n = len(y)
    
    # Add intercept
    X_full = np.column_stack([np.ones(n), X])
    p = X_full.shape[1]
    
    # Fit original model
    XtX_inv = np.linalg.inv(X_full.T @ X_full)
    beta_hat = XtX_inv @ X_full.T @ y
    y_hat = X_full @ beta_hat
    residuals = y - y_hat
    
    # Bootstrap
    beta_boot = np.zeros((n_bootstrap, p))
    
    for b in range(n_bootstrap):
        # Resample residuals with replacement
        e_star = residuals[np.random.choice(n, n, replace=True)]
        
        # Create bootstrap response
        y_star = y_hat + e_star
        
        # Fit to bootstrap sample
        beta_boot[b] = XtX_inv @ X_full.T @ y_star
    
    # Compute percentile CIs
    lower = np.percentile(beta_boot, 100 * alpha / 2, axis=0)
    upper = np.percentile(beta_boot, 100 * (1 - alpha / 2), axis=0)
    se_boot = np.std(beta_boot, axis=0, ddof=1)
    
    return {
        'coefficients': beta_hat,
        'se_bootstrap': se_boot,
        'ci_lower': lower,
        'ci_upper': upper,
        'beta_boot': beta_boot
    }
 
# Demonstration
np.random.seed(42)
n = 100
 
# Create data with skewed response
X = np.random.randn(n, 2)
y_latent = 2 + 1.5 * X[:, 0] - 0.8 * X[:, 1] + np.random.randn(n) * 0.5
y = np.exp(y_latent)  # Response is lognormal
 
print("="*60)
print("BOX-COX TRANSFORMATION ANALYSIS")
print("="*60)
 
# Original residuals
from sklearn.linear_model import LinearRegression
model_orig = LinearRegression()
model_orig.fit(X, y)
resid_orig = y - model_orig.predict(X)
sw_orig, p_orig = stats.shapiro(resid_orig)
print(f"\nOriginal residuals: Shapiro-Wilk p = {p_orig:.4f}")
 
# Find optimal Box-Cox
bc_result = optimal_box_cox(y)
print(f"\nOptimal λ: {bc_result['optimal_lambda']:.4f}")
print(f"Suggested transformation: {bc_result['transform_name']}")
print(f"Transformed residuals: Shapiro-Wilk p = {bc_result['shapiro_p']:.4f}")
 
print("\n" + "="*60)
print("RESIDUAL BOOTSTRAP INFERENCE")
print("="*60)
 
# Apply log transformation and bootstrap
y_log = np.log(y)
boot_results = residual_bootstrap_ci(X, y_log, n_bootstrap=2000)
 
print(f"\nCoefficients (log scale):")
for i, (coef, lower, upper, se) in enumerate(zip(
    boot_results['coefficients'], 
    boot_results['ci_lower'],
    boot_results['ci_upper'],
    boot_results['se_bootstrap']
)):
    name = 'Intercept' if i == 0 else f'β{i}'
    print(f"  {name}: {coef:.4f} (SE: {se:.4f}), 95% CI: [{lower:.4f}, {upper:.4f}]")

Summary: Normality Testing in Practice

Normality testing occupies a nuanced position in regression diagnostics. It's important to understand both when to worry and when not to.

Key Takeaways

•Normality affects inference, not estimation — OLS estimates are valid without normality. Hypothesis tests and confidence intervals are what suffer.
•Sample size provides protection — The Central Limit Theorem ensures asymptotic normality of test statistics. For n > 50, mild non-normality is rarely problematic.
•Q-Q plots are your primary tool — Visual assessment reveals the nature of departures (skewness, heavy tails, outliers) that formal tests miss in their binary output.
•Use Shapiro-Wilk as default formal test — It's the most powerful omnibus test for small to moderate samples. Complement with Jarque-Bera for large samples.
•Beware the large-sample trap — Formal tests become hypersensitive with large n. Focus on effect sizes (skewness, kurtosis) rather than p-values.
•Remedies depend on the source — Transformations fix systematic non-normality; robust methods handle outliers; bootstrap provides nonparametric inference.

Page Complete

You now understand when normality matters, how to assess it visually and formally, and what to do when it fails. The next page tackles heteroscedasticity—the violation of constant variance that can fundamentally undermine both estimation efficiency and inference validity.

2 / 5

Loading learning content...

Machine LearningLinear Regression Deep Dive

Regression Diagnostics

LevelIntermediate

Duration75 mins

TopicLinear Regression Deep Dive

2 / 5

Normality Tests

The Normality Assumption in Context

In the standard linear regression model: $$y = X\beta + \varepsilon, \quad \varepsilon \sim N(0, \sigma^2 I)$$

What You Will Learn

When Normality Matters (and When It Doesn't)

The importance of normality depends on your analytical goals and sample size. Let's be precise about what normality buys you:

Normality is NOT Required For:

1. OLS Coefficient Estimation

2. Asymptotic Inference (Large Samples)

Normality IS Required For:

1. Exact Small-Sample Inference

In small samples ($n < 30$), the exact distributions of t-statistics and F-statistics depend on normal errors. Non-normality can lead to:

Incorrect p-values
Invalid confidence intervals
Misleading hypothesis tests

2. Prediction Intervals

Prediction intervals for new observations rely on normality. Without it, the stated coverage probability (e.g., 95%) may be far from accurate.

3. Maximum Likelihood Equivalence

Under normality, OLS equals MLE. This has implications for information criteria (AIC, BIC) and likelihood ratio tests.

Impact of Non-Normality by Sample Size
Sample Size	OLS Estimation	Hypothesis Testing	Confidence Intervals	Prediction Intervals
n < 15	Unaffected	Potentially invalid	Potentially invalid	Potentially invalid
15 ≤ n < 30	Unaffected	Moderate concern	Moderate concern	Moderate concern
30 ≤ n < 100	Unaffected	Usually robust	Usually robust	May need adjustment
n ≥ 100	Unaffected	Very robust (CLT)	Very robust	May need adjustment

The Practical Rule

The Q-Q Plot: Visual Normality Assessment

The quantile-quantile (Q-Q) plot is the most powerful visual tool for assessing normality. It directly compares the distribution of your residuals to the theoretical normal distribution.

Construction

Sort residuals from smallest to largest: $e_{(1)} \leq e_{(2)} \leq \cdots \leq e_{(n)}$
For each ordered residual $e_{(i)}$, compute its theoretical quantile position: $$p_i = \frac{i - 0.375}{n + 0.25}$$ (This is the Blom formula, though variations exist)
Find the corresponding quantile from standard normal: $z_i = \Phi^{-1}(p_i)$
Plot $e_{(i)}$ (y-axis) against $z_i$ (x-axis)

If residuals are normally distributed, points should fall approximately on a straight line with slope $\sigma$ passing through $(0, 0)$.

Pattern Interpretation

Heavy Tails (Leptokurtic)

Both ends curve away from the line in opposite directions
Left tail bends down, right tail bends up (S-shape or flipped S)
Indicates more extreme values than normal predicts
Example: t-distribution, Laplace distribution

Light Tails (Platykurtic)

Both ends curve toward the line
Left tail bends up, right tail bends down
Indicates fewer extreme values than normal
Example: Uniform distribution

Right Skewness

Points curve upward at both ends
Right tail extends further than expected
Example: Log-normal, exponential

Left Skewness

Points curve downward at both ends
Left tail extends further than expected

Outliers

Individual points far from the line
May appear at one or both extremes
Distinguish from systematic patterns

qq_plot_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def comprehensive_qq_plot(residuals, title="Q-Q Plot", figsize=(10, 8)):
    """
    Create a detailed Q-Q plot with reference band and pattern annotations.
    
    Parameters:
    -----------
    residuals : array-like
        Residuals from regression model
    title : str
        Plot title
    
    Returns:
    --------
    dict with figure and normality statistics
    """
    e = np.array(residuals)
    n = len(e)
    
    # Sort residuals
    e_sorted = np.sort(e)
    
    # Compute theoretical quantiles (Blom formula)
    p = (np.arange(1, n + 1) - 0.375) / (n + 0.25)
    theoretical_quantiles = stats.norm.ppf(p)
    
    # Standardize residuals for plotting
    e_standardized = (e_sorted - np.mean(e)) / np.std(e, ddof=1)
    
    # Fit line through Q1 and Q3 (robust to outliers)
    q1_idx, q3_idx = int(0.25 * n), int(0.75 * n)
    slope = (e_standardized[q3_idx] - e_standardized[q1_idx]) / \
            (theoretical_quantiles[q3_idx] - theoretical_quantiles[q1_idx])
    intercept = e_standardized[q1_idx] - slope * theoretical_quantiles[q1_idx]
    
    # Create figure
    fig, axes = plt.subplots(2, 2, figsize=figsize)
    
    # Main Q-Q plot
    ax1 = axes[0, 0]
    ax1.scatter(theoretical_quantiles, e_standardized, 
               alpha=0.6, edgecolors='k', linewidth=0.5, s=40)
    
    # Reference line
    x_line = np.array([theoretical_quantiles.min(), theoretical_quantiles.max()])
    ax1.plot(x_line, x_line, 'r-', linewidth=2, label='Perfect normality')
    
    # Confidence band (approximate)
    se = 1 / (stats.norm.pdf(theoretical_quantiles) * np.sqrt(n))
    ax1.fill_between(theoretical_quantiles, 
                     theoretical_quantiles - 1.96 * se,
                     theoretical_quantiles + 1.96 * se,
                     alpha=0.2, color='red', label='95% confidence band')
    
    ax1.set_xlabel('Theoretical Quantiles (Standard Normal)', fontsize=11)
    ax1.set_ylabel('Sample Quantiles (Standardized)', fontsize=11)
    ax1.set_title('Q-Q Plot', fontsize=12, fontweight='bold')
    ax1.legend(loc='upper left')
    ax1.grid(True, alpha=0.3)
    
    # Histogram with normal overlay
    ax2 = axes[0, 1]
    ax2.hist(e_standardized, bins='auto', density=True, 
             alpha=0.7, edgecolor='black', label='Residuals')
    x_norm = np.linspace(e_standardized.min(), e_standardized.max(), 100)
    ax2.plot(x_norm, stats.norm.pdf(x_norm), 'r-', 
             linewidth=2, label='Standard Normal')
    ax2.set_xlabel('Standardized Residuals', fontsize=11)
    ax2.set_ylabel('Density', fontsize=11)
    ax2.set_title('Histogram vs Normal PDF', fontsize=12, fontweight='bold')
    ax2.legend()
    
    # Detrended Q-Q plot (deviations from line)
    ax3 = axes[1, 0]
    deviations = e_standardized - theoretical_quantiles
    ax3.scatter(theoretical_quantiles, deviations, 
               alpha=0.6, edgecolors='k', linewidth=0.5, s=40)
    ax3.axhline(0, color='red', linestyle='--', linewidth=1.5)
    ax3.fill_between(theoretical_quantiles, -1.96*se, 1.96*se, 
                     alpha=0.2, color='red')
    ax3.set_xlabel('Theoretical Quantiles', fontsize=11)
    ax3.set_ylabel('Deviation from Normal', fontsize=11)
    ax3.set_title('Detrended Q-Q Plot', fontsize=12, fontweight='bold')
    ax3.grid(True, alpha=0.3)
    
    # Summary statistics
    ax4 = axes[1, 1]
    ax4.axis('off')
    
    # Compute tests
    shapiro_stat, shapiro_p = stats.shapiro(e)
    skewness = stats.skew(e)
    kurtosis = stats.kurtosis(e)  # Excess kurtosis
    
    summary_text = f"""
    NORMALITY ASSESSMENT SUMMARY
    ═══════════════════════════════════════
    
    Sample Size: {n}
    
    Descriptive Statistics:
    ─────────────────────────────────────
    Skewness:          {skewness:8.4f}  (0 = symmetric)
    Excess Kurtosis:   {kurtosis:8.4f}  (0 = normal tails)
    
    Formal Tests:
    ─────────────────────────────────────
    Shapiro-Wilk W:    {shapiro_stat:8.4f}
    Shapiro-Wilk p:    {shapiro_p:8.4f}
    
    Interpretation:
    ─────────────────────────────────────
    {"✓ No evidence against normality (p > 0.05)" if shapiro_p > 0.05 else "⚠ Evidence of non-normality (p ≤ 0.05)"}
    {"✓ Skewness acceptable (|skew| < 0.5)" if abs(skewness) < 0.5 else "⚠ Notable skewness" if abs(skewness) < 1 else "⚠ Severe skewness"}
    {"✓ Kurtosis acceptable (|kurt| < 1)" if abs(kurtosis) < 1 else "⚠ Notable kurtosis" if abs(kurtosis) < 2 else "⚠ Severe kurtosis"}
    """
    
    ax4.text(0.1, 0.95, summary_text, transform=ax4.transAxes,
             fontsize=10, verticalalignment='top', fontfamily='monospace',
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    plt.suptitle(title, fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    
    return {
        'figure': fig,
        'shapiro_stat': shapiro_stat,
        'shapiro_p': shapiro_p,
        'skewness': skewness,
        'kurtosis': kurtosis
    }
 
# Demonstrate with different distribution types
np.random.seed(42)
n = 100
 
# Normal residuals
e_normal = np.random.randn(n)
result = comprehensive_qq_plot(e_normal, "Normal Residuals")
plt.savefig('qq_normal.png', dpi=150, bbox_inches='tight')
plt.show()
 
# Heavy-tailed (t-distribution)
e_heavy = stats.t.rvs(df=3, size=n)
result = comprehensive_qq_plot(e_heavy, "Heavy-Tailed Residuals (t, df=3)")
plt.savefig('qq_heavy_tailed.png', dpi=150, bbox_inches='tight')
plt.show()
 
# Right-skewed
e_skewed = stats.lognorm.rvs(s=0.5, size=n)
e_skewed = e_skewed - np.mean(e_skewed)
result = comprehensive_qq_plot(e_skewed, "Right-Skewed Residuals")
plt.savefig('qq_skewed.png', dpi=150, bbox_inches='tight')
plt.show()

The Shapiro-Wilk Test

The Shapiro-Wilk test is widely considered the most powerful omnibus normality test for small to moderate sample sizes. It should be your default formal test for normality assessment.

The Test Statistic

The Shapiro-Wilk statistic is defined as: $$W = \frac{\left(\sum_{i=1}^{n} a_i x_{(i)}\right)^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$$

where:

$x_{(i)}$ are the ordered sample values
$a_i$ are tabulated coefficients derived from the expected values and covariance matrix of order statistics from a standard normal distribution
The denominator is the sum of squared deviations (proportional to sample variance)

Interpretation

W statistic range: $0 < W \leq 1$

$W = 1$: Perfect normality
$W < 1$: Departure from normality
Lower W indicates more severe non-normality

Hypothesis test:

$H_0$: The sample comes from a normal distribution
$H_1$: The sample does not come from a normal distribution

Decision rule: Reject normality if p-value < $\alpha$ (typically 0.05)

Practical Considerations

Sample size limits:

Most implementations require $3 \leq n \leq 5000$
For larger samples, the test becomes very sensitive to minor departures
For very large n, almost any sample will reject (the curse of large samples)

Power characteristics:

Most powerful against symmetric, moderate-tailed alternatives
Good power against skewed distributions
May miss some specific non-normal patterns

The Large Sample Problem

shapiro_wilk_test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
from scipy import stats
 
def shapiro_wilk_analysis(residuals, alpha=0.05):
    """
    Comprehensive Shapiro-Wilk normality test with interpretation.
    
    Parameters:
    -----------
    residuals : array-like
        Residuals from regression model
    alpha : float
        Significance level
    
    Returns:
    --------
    dict with test results and interpretation
    """
    e = np.array(residuals)
    n = len(e)
    
    # Check sample size constraints
    if n < 3:
        raise ValueError("Shapiro-Wilk requires at least 3 observations")
    if n > 5000:
        print("Warning: n > 5000. Shapiro-Wilk may be overly sensitive.")
    
    # Perform test
    statistic, p_value = stats.shapiro(e)
    
    # Effect size measures
    skewness = stats.skew(e)
    kurtosis = stats.kurtosis(e)  # Excess kurtosis
    
    # Interpretation
    if p_value > alpha:
        decision = "FAIL TO REJECT H₀"
        conclusion = "No significant evidence against normality."
    else:
        decision = "REJECT H₀"
        conclusion = "Significant evidence against normality."
    
    # Severity assessment based on effect sizes
    if abs(skewness) < 0.5 and abs(kurtosis) < 1:
        severity = "Negligible: Minor departures unlikely to affect inference"
    elif abs(skewness) < 1 and abs(kurtosis) < 2:
        severity = "Moderate: May affect small-sample inference"
    else:
        severity = "Severe: Consider transformations or robust methods"
    
    # Large-sample warning
    if n > 100 and p_value < alpha:
        large_sample_note = ("⚠ With n > 100, statistical significance may not "
                            "indicate practical significance. Check effect sizes.")
    else:
        large_sample_note = ""
    
    results = {
        'n': n,
        'statistic': statistic,
        'p_value': p_value,
        'alpha': alpha,
        'decision': decision,
        'conclusion': conclusion,
        'skewness': skewness,
        'kurtosis': kurtosis,
        'severity': severity,
        'large_sample_note': large_sample_note
    }
    
    # Print report
    print("═" * 50)
    print("SHAPIRO-WILK NORMALITY TEST")
    print("═" * 50)
    print(f"Sample size (n):     {n}")
    print(f"W statistic:         {statistic:.6f}")
    print(f"P-value:             {p_value:.6f}")
    print(f"Significance (α):    {alpha}")
    print("─" * 50)
    print(f"Decision:            {decision}")
    print(f"Conclusion:          {conclusion}")
    print("─" * 50)
    print("Effect Sizes:")
    print(f"  Skewness:          {skewness:.4f}")
    print(f"  Excess Kurtosis:   {kurtosis:.4f}")
    print(f"  Assessment:        {severity}")
    if large_sample_note:
        print("─" * 50)
        print(large_sample_note)
    print("═" * 50)
    
    return results
 
# Demonstration
np.random.seed(42)
 
# Test 1: Normal residuals
print("\n" + "="*60)
print("EXAMPLE 1: True Normal Residuals")
print("="*60)
e_normal = np.random.randn(50)
results1 = shapiro_wilk_analysis(e_normal)
 
# Test 2: Mildly skewed (practical concern: usually OK)
print("\n" + "="*60)
print("EXAMPLE 2: Mildly Skewed Residuals")
print("="*60)
e_mild = stats.skewnorm.rvs(a=2, size=50)  # Mild skewness
e_mild = e_mild - np.mean(e_mild)
results2 = shapiro_wilk_analysis(e_mild)
 
# Test 3: Heavily skewed (practical concern: may need action)
print("\n" + "="*60)
print("EXAMPLE 3: Heavily Skewed Residuals")
print("="*60)
e_heavy = stats.lognorm.rvs(s=1, size=50)
e_heavy = e_heavy - np.mean(e_heavy)
results3 = shapiro_wilk_analysis(e_heavy)

Alternative Normality Tests

While Shapiro-Wilk is generally recommended, other normality tests serve specific purposes and provide complementary information.

Anderson-Darling Test

The Anderson-Darling test is an empirical distribution function (EDF) test that places more weight on the tails than the Kolmogorov-Smirnov test.

Statistic: $$A^2 = -n - \frac{1}{n}\sum_{i=1}^{n}(2i-1)[\ln F(x_{(i)}) + \ln(1-F(x_{(n+1-i)}))]$$

where $F$ is the hypothesized CDF (standard normal with estimated parameters).

Strengths:

Particularly sensitive to deviations in the tails
Good for detecting heavy-tailed distributions
Available for testing other distributions besides normal

When to use: When tail behavior is of particular concern (e.g., financial data, extreme value analysis).

Jarque-Bera Test

The Jarque-Bera test directly tests whether skewness and kurtosis match normal distribution values.

Statistic: $$JB = \frac{n}{6}\left(S^2 + \frac{(K-3)^2}{4}\right)$$

where $S$ is sample skewness and $K$ is sample kurtosis. Under $H_0$, $JB \sim \chi^2_2$ (approximately, for large n).

Strengths:

Easy to interpret: directly measures "non-normalness" through skewness and kurtosis
Works well for large samples where asymptotic distribution kicks in
Widely used in econometrics

Weaknesses:

Asymptotic test—unreliable for small samples (n < 50)
Can miss non-normality that doesn't manifest in skewness/kurtosis

D'Agostino-Pearson Test

Combines standardized skewness and kurtosis statistics: $$K^2 = Z_s^2 + Z_k^2$$

where $Z_s$ and $Z_k$ are transformed skewness and kurtosis with approximately standard normal distributions. Under $H_0$, $K^2 \sim \chi^2_2$.

Strengths:

Omnibus test covering both skewness and kurtosis
Often more reliable than Jarque-Bera for moderate samples

Comparison of Normality Tests
Test	Best Sample Size	Special Sensitivity	Recommended Use
Shapiro-Wilk	n < 50	Overall departures	Default choice for small samples
Anderson-Darling	n < 5000	Tail behavior	When tails are critical
Jarque-Bera	n > 50	Skewness & kurtosis	Large samples, econometrics
D'Agostino-Pearson	n > 20	Skewness & kurtosis	Medium samples, omnibus test
Kolmogorov-Smirnov	Any	Central distribution	Generally less powerful—avoid

multiple_normality_tests.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
from scipy import stats
 
def comprehensive_normality_battery(residuals, alpha=0.05):
    """
    Run multiple normality tests for comprehensive assessment.
    
    Parameters:
    -----------
    residuals : array-like
        Residuals from regression model
    alpha : float
        Significance level
    
    Returns:
    --------
    dict with all test results
    """
    e = np.array(residuals)
    n = len(e)
    
    results = {'n': n, 'alpha': alpha}
    
    print("═" * 60)
    print("COMPREHENSIVE NORMALITY TEST BATTERY")
    print("═" * 60)
    print(f"Sample size: {n}")
    print(f"Significance level: {alpha}")
    print("═" * 60)
    
    # 1. Shapiro-Wilk
    if n <= 5000:
        sw_stat, sw_p = stats.shapiro(e)
        results['shapiro_wilk'] = {'statistic': sw_stat, 'p_value': sw_p}
        verdict = "✓ PASS" if sw_p > alpha else "✗ FAIL"
        print(f"Shapiro-Wilk:       W = {sw_stat:.4f}, p = {sw_p:.4f}  {verdict}")
    else:
        print("Shapiro-Wilk:       Skipped (n > 5000)")
    
    # 2. Anderson-Darling
    ad_result = stats.anderson(e, dist='norm')
    # Anderson-Darling returns critical values, not p-value directly
    # We use the statistic and compare to critical value at 5%
    ad_stat = ad_result.statistic
    ad_critical_5 = ad_result.critical_values[2]  # Index 2 is 5% level
    ad_pass = ad_stat < ad_critical_5
    results['anderson_darling'] = {
        'statistic': ad_stat, 
        'critical_5pct': ad_critical_5
    }
    verdict = "✓ PASS" if ad_pass else "✗ FAIL"
    print(f"Anderson-Darling:   A² = {ad_stat:.4f}, crit(5%) = {ad_critical_5:.4f}  {verdict}")
    
    # 3. D'Agostino-Pearson (K² test)
    if n >= 20:
        dp_stat, dp_p = stats.normaltest(e)
        results['dagostino_pearson'] = {'statistic': dp_stat, 'p_value': dp_p}
        verdict = "✓ PASS" if dp_p > alpha else "✗ FAIL"
        print(f"D'Agostino-Pearson: K² = {dp_stat:.4f}, p = {dp_p:.4f}  {verdict}")
    else:
        print("D'Agostino-Pearson: Skipped (n < 20)")
    
    # 4. Jarque-Bera
    jb_stat, jb_p = stats.jarque_bera(e)
    results['jarque_bera'] = {'statistic': jb_stat, 'p_value': jb_p}
    verdict = "✓ PASS" if jb_p > alpha else "✗ FAIL"
    if n < 50:
        print(f"Jarque-Bera:        JB = {jb_stat:.4f}, p = {jb_p:.4f}  {verdict} (⚠ n<50)")
    else:
        print(f"Jarque-Bera:        JB = {jb_stat:.4f}, p = {jb_p:.4f}  {verdict}")
    
    # Effect sizes
    skew = stats.skew(e)
    kurt = stats.kurtosis(e)
    results['skewness'] = skew
    results['kurtosis'] = kurt
    
    print("─" * 60)
    print(f"Effect sizes:       Skewness = {skew:.4f}, Excess Kurtosis = {kurt:.4f}")
    
    # Overall assessment
    print("═" * 60)
    
    # Count failures
    n_tests = 0
    n_failures = 0
    
    if 'shapiro_wilk' in results and results['shapiro_wilk']['p_value'] <= alpha:
        n_failures += 1
    if 'shapiro_wilk' in results:
        n_tests += 1
        
    if not ad_pass:
        n_failures += 1
    n_tests += 1
    
    if 'dagostino_pearson' in results and results['dagostino_pearson']['p_value'] <= alpha:
        n_failures += 1
    if 'dagostino_pearson' in results:
        n_tests += 1
    
    if n >= 50 and jb_p <= alpha:
        n_failures += 1
    if n >= 50:
        n_tests += 1
    
    print(f"Tests failed: {n_failures}/{n_tests}")
    
    if n_failures == 0:
        print("OVERALL: No evidence against normality")
    elif n_failures <= n_tests // 2:
        print("OVERALL: Mixed evidence—inspect Q-Q plot carefully")
    else:
        print("OVERALL: Substantial evidence against normality")
    
    print("═" * 60)
    
    return results
 
# Demonstration
np.random.seed(42)
 
print("\n" + "▓" * 60)
print(" NORMAL DISTRIBUTION ")
print("▓" * 60)
e_normal = np.random.randn(100)
r1 = comprehensive_normality_battery(e_normal)
 
print("\n" + "▓" * 60)
print(" HEAVY-TAILED (t, df=4) ")
print("▓" * 60)
e_heavy = stats.t.rvs(df=4, size=100)
r2 = comprehensive_normality_battery(e_heavy)
 
print("\n" + "▓" * 60)
print(" RIGHT-SKEWED (lognormal) ")
print("▓" * 60)
e_skewed = np.exp(np.random.randn(100) * 0.5)
e_skewed = e_skewed - np.mean(e_skewed)
r3 = comprehensive_normality_battery(e_skewed)

Remedies for Non-Normal Residuals

When normality is seriously violated and inference is affected, several remedial strategies are available. The choice depends on the nature of the non-normality and your analytical goals.

1. Response Transformations

Applying a monotonic transformation to $y$ can often normalize residuals:

Box-Cox Transformation: $$y^{(\lambda)} = \begin{cases} \frac{y^\lambda - 1}{\lambda} & \lambda \neq 0 \ \ln(y) & \lambda = 0 \end{cases}$$

The optimal $\lambda$ is chosen to maximize the (profiled) log-likelihood. Common special cases:

$\lambda = 1$: No transformation
$\lambda = 0$: Log transformation (right skewness)
$\lambda = 0.5$: Square root (mild right skewness)
$\lambda = -1$: Inverse transformation

Limitations:

Requires $y > 0$ (shift if needed)
Changes interpretation of coefficients
May introduce heteroscedasticity if original model was homoscedastic

2. Robust Regression

When outliers or heavy tails drive non-normality, robust regression methods downweight influential observations:

M-estimation: Minimizes $\sum \rho(e_i)$ where $\rho$ is a robust loss function (Huber, bisquare/Tukey)

Iteratively Reweighted Least Squares (IRLS): Implements M-estimation by repeatedly fitting weighted OLS

Advantages:

Coefficients remain interpretable as in OLS
Less sensitive to outliers and heavy tails
Standard errors account for robust fitting

3. Bootstrap Inference

When distributional assumptions are suspect, the bootstrap provides nonparametric inference:

Residual Bootstrap:

Fit model, obtain $\hat{\beta}$ and residuals $\hat{e}_i$
For $b = 1, \ldots, B$:
- Sample residuals with replacement: $e_i^*$
- Create bootstrap response: $y^_i = \hat{y}_i + e_i^$
- Fit model to $(X, y^*)$, obtain $\hat{\beta}^{(b)}$
Use empirical distribution of $\hat{\beta}^{(b)}$ for inference

Advantage: Makes no assumptions about error distribution

When to use: Small to moderate samples where CLT can't be relied upon

Transformation Strategy

normality_remedies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
import numpy as np
from scipy import stats
from scipy.optimize import minimize_scalar
import matplotlib.pyplot as plt
 
def box_cox_transform(y, lmbda):
    """Apply Box-Cox transformation."""
    if lmbda == 0:
        return np.log(y)
    else:
        return (y**lmbda - 1) / lmbda
 
def optimal_box_cox(y, X=None):
    """
    Find optimal Box-Cox lambda using profile likelihood.
    
    Parameters:
    -----------
    y : array-like
        Positive response values
    X : array-like, optional
        Design matrix (for model-based optimization)
    
    Returns:
    --------
    dict with optimal lambda and transformed values
    """
    y = np.array(y)
    
    if np.any(y <= 0):
        shift = np.abs(y.min()) + 1
        y = y + shift
        print(f"Warning: y shifted by {shift:.2f} to make all values positive")
    else:
        shift = 0
    
    n = len(y)
    
    def neg_log_likelihood(lmbda):
        """Negative profile log-likelihood for Box-Cox."""
        y_transformed = box_cox_transform(y, lmbda)
        
        # Variance of transformed data
        var_t = np.var(y_transformed, ddof=1)
        
        # Log-likelihood (up to constants)
        ll = -0.5 * n * np.log(var_t) + (lmbda - 1) * np.sum(np.log(y))
        return -ll
    
    # Find optimal lambda
    result = minimize_scalar(neg_log_likelihood, bounds=(-2, 2), method='bounded')
    optimal_lmbda = result.x
    
    # Transform with optimal lambda
    y_transformed = box_cox_transform(y, optimal_lmbda)
    
    # Test normality of transformed values
    sw_stat, sw_p = stats.shapiro(y_transformed)
    
    # Common named transformations
    if abs(optimal_lmbda) < 0.05:
        transform_name = "Log"
    elif abs(optimal_lmbda - 0.5) < 0.1:
        transform_name = "Square Root"
    elif abs(optimal_lmbda - 1) < 0.1:
        transform_name = "None (linear)"
    elif abs(optimal_lmbda + 1) < 0.1:
        transform_name = "Inverse"
    else:
        transform_name = f"Power {optimal_lmbda:.2f}"
    
    return {
        'optimal_lambda': optimal_lmbda,
        'transform_name': transform_name,
        'y_transformed': y_transformed,
        'shift_applied': shift,
        'shapiro_w': sw_stat,
        'shapiro_p': sw_p
    }
 
def residual_bootstrap_ci(X, y, n_bootstrap=1000, alpha=0.05, seed=42):
    """
    Compute bootstrap confidence intervals for regression coefficients.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix (without intercept)
    y : ndarray of shape (n,)
        Response vector
    n_bootstrap : int
        Number of bootstrap samples
    alpha : float
        Significance level for CI
    
    Returns:
    --------
    dict with coefficients and confidence intervals
    """
    np.random.seed(seed)
    n = len(y)
    
    # Add intercept
    X_full = np.column_stack([np.ones(n), X])
    p = X_full.shape[1]
    
    # Fit original model
    XtX_inv = np.linalg.inv(X_full.T @ X_full)
    beta_hat = XtX_inv @ X_full.T @ y
    y_hat = X_full @ beta_hat
    residuals = y - y_hat
    
    # Bootstrap
    beta_boot = np.zeros((n_bootstrap, p))
    
    for b in range(n_bootstrap):
        # Resample residuals with replacement
        e_star = residuals[np.random.choice(n, n, replace=True)]
        
        # Create bootstrap response
        y_star = y_hat + e_star
        
        # Fit to bootstrap sample
        beta_boot[b] = XtX_inv @ X_full.T @ y_star
    
    # Compute percentile CIs
    lower = np.percentile(beta_boot, 100 * alpha / 2, axis=0)
    upper = np.percentile(beta_boot, 100 * (1 - alpha / 2), axis=0)
    se_boot = np.std(beta_boot, axis=0, ddof=1)
    
    return {
        'coefficients': beta_hat,
        'se_bootstrap': se_boot,
        'ci_lower': lower,
        'ci_upper': upper,
        'beta_boot': beta_boot
    }
 
# Demonstration
np.random.seed(42)
n = 100
 
# Create data with skewed response
X = np.random.randn(n, 2)
y_latent = 2 + 1.5 * X[:, 0] - 0.8 * X[:, 1] + np.random.randn(n) * 0.5
y = np.exp(y_latent)  # Response is lognormal
 
print("="*60)
print("BOX-COX TRANSFORMATION ANALYSIS")
print("="*60)
 
# Original residuals
from sklearn.linear_model import LinearRegression
model_orig = LinearRegression()
model_orig.fit(X, y)
resid_orig = y - model_orig.predict(X)
sw_orig, p_orig = stats.shapiro(resid_orig)
print(f"\nOriginal residuals: Shapiro-Wilk p = {p_orig:.4f}")
 
# Find optimal Box-Cox
bc_result = optimal_box_cox(y)
print(f"\nOptimal λ: {bc_result['optimal_lambda']:.4f}")
print(f"Suggested transformation: {bc_result['transform_name']}")
print(f"Transformed residuals: Shapiro-Wilk p = {bc_result['shapiro_p']:.4f}")
 
print("\n" + "="*60)
print("RESIDUAL BOOTSTRAP INFERENCE")
print("="*60)
 
# Apply log transformation and bootstrap
y_log = np.log(y)
boot_results = residual_bootstrap_ci(X, y_log, n_bootstrap=2000)
 
print(f"\nCoefficients (log scale):")
for i, (coef, lower, upper, se) in enumerate(zip(
    boot_results['coefficients'], 
    boot_results['ci_lower'],
    boot_results['ci_upper'],
    boot_results['se_bootstrap']
)):
    name = 'Intercept' if i == 0 else f'β{i}'
    print(f"  {name}: {coef:.4f} (SE: {se:.4f}), 95% CI: [{lower:.4f}, {upper:.4f}]")

Summary: Normality Testing in Practice

Normality testing occupies a nuanced position in regression diagnostics. It's important to understand both when to worry and when not to.

Key Takeaways

•Normality affects inference, not estimation — OLS estimates are valid without normality. Hypothesis tests and confidence intervals are what suffer.
•Sample size provides protection — The Central Limit Theorem ensures asymptotic normality of test statistics. For n > 50, mild non-normality is rarely problematic.
•Q-Q plots are your primary tool — Visual assessment reveals the nature of departures (skewness, heavy tails, outliers) that formal tests miss in their binary output.
•Use Shapiro-Wilk as default formal test — It's the most powerful omnibus test for small to moderate samples. Complement with Jarque-Bera for large samples.
•Beware the large-sample trap — Formal tests become hypersensitive with large n. Focus on effect sizes (skewness, kurtosis) rather than p-values.
•Remedies depend on the source — Transformations fix systematic non-normality; robust methods handle outliers; bootstrap provides nonparametric inference.

Page Complete

2 / 5