Regression Diagnostics - Learning Module

Loading content...

0/278

Residual Analysis

The Detective Work of Regression

Every regression model tells two stories: the signal story (what the model captures about the relationship between predictors and response) and the residual story (what the model fails to explain). A fitted regression equation might look impressive on paper, but without examining residuals—the differences between observed and predicted values—we're flying blind.

Residual analysis is the diagnostic lens through which we scrutinize our models. It reveals whether our assumptions hold, whether our predictions are reliable, and whether hidden patterns lurk in the data that our model missed. In this sense, residual analysis isn't just a validation step—it's detective work.

This page will equip you with the mathematical foundations, visual diagnostic tools, and practical intuition needed to conduct rigorous residual analysis. You'll learn to read residual plots like a radiologist reads an X-ray—identifying subtle anomalies that reveal deeper structural issues.

What You Will Learn

By the end of this page, you will understand how to define, compute, and interpret residuals; distinguish between different residual types (raw, standardized, studentized); construct and analyze key diagnostic plots; and recognize the specific assumption violations each plot reveals. This knowledge is essential for any production ML system built on regression.

Defining Residuals: The Foundation of Diagnostics

Let's establish the precise mathematical foundation. In a regression setting with response variable $y$ and fitted values $\hat{y}$, the residual for observation $i$ is defined as:

$$e_i = y_i - \hat{y}_i$$

This deceptively simple quantity encapsulates everything the model cannot explain. If our model were perfect and the true relationship were deterministic, all residuals would be exactly zero. In practice, residuals arise from:

Random error (irreducible noise in the data-generating process)
Model misspecification (wrong functional form, missing predictors)
Measurement error (imprecision in recorded values)
Violations of assumptions (non-constant variance, non-independence)

The goal of residual analysis is to distinguish between these sources—determining whether observed residual patterns are consistent with random noise or indicative of systematic problems.

Residuals vs. Errors: A Crucial Distinction

Residuals (e_i) are observable estimates of the true errors (ε_i). The true error ε_i = y_i − E[y_i|x_i] involves the unknown population regression function, while the residual e_i = y_i − ŷ_i uses our estimated function. This distinction matters: residuals are correlated with each other and with fitted values in ways that true errors are not. Understanding this distinction is essential for proper inference.

Properties of OLS Residuals

Under ordinary least squares (OLS) estimation, residuals possess important mathematical properties that arise directly from the normal equations. For a model $y = X\beta + \varepsilon$, the OLS residual vector is:

$$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - X\hat{\beta} = (I - H)\mathbf{y}$$

where $H = X(X^TX)^{-1}X^T$ is the hat matrix (also called the projection matrix). Several key properties follow:

Property 1: Zero Sum $$\sum_{i=1}^{n} e_i = 0$$ Proof: The first normal equation states $X^T e = 0$. If the model includes an intercept, the first column of $X$ is $\mathbf{1}$ (all ones), so $\mathbf{1}^T e = \sum e_i = 0$.

Property 2: Orthogonality to Predictors $$X^T \mathbf{e} = 0$$ Meaning: Residuals are uncorrelated with each predictor column. This is exactly the condition that defines OLS—minimizing squared residuals yields predictions orthogonal to the residual space.

Property 3: Orthogonality to Fitted Values $$\hat{\mathbf{y}}^T \mathbf{e} = 0$$ Meaning: Residuals are uncorrelated with fitted values—a consequence of the previous property since $\hat{y}$ is a linear combination of predictor columns.

Summary of OLS Residual Properties
Property	Mathematical Statement	Interpretation
Zero sum	$\sum e_i = 0$	Intercept absorbs mean offset
Orthogonal to X	$X^T e = 0$	No remaining linear signal in residuals
Orthogonal to ŷ	$\hat{y}^T e = 0$	Predictions and errors are uncorrelated
Expected value zero	$E[e_i] = 0$	Unbiased on average
Variance relationship	$\text{Var}(e_i) = \sigma^2(1-h_{ii})$	Variance depends on leverage

Types of Residuals: Raw, Standardized, and Studentized

Not all residuals are created equal. For diagnostic purposes, we often transform raw residuals to account for the non-constant variance inherent in OLS residual structure. Understanding these transformations is critical for proper interpretation.

Raw Residuals

The simplest form, raw residuals are just the observed minus predicted: $$e_i = y_i - \hat{y}_i$$

Limitation: Raw residuals have unequal variances. Specifically, $\text{Var}(e_i) = \sigma^2(1 - h_{ii})$ where $h_{ii}$ is the $i$-th diagonal element of the hat matrix. Points with high leverage (large $h_{ii}$) have artificially smaller residuals, making comparison across observations misleading.

Standardized Residuals

To enable comparison, we standardize by the estimated standard deviation: $$r_i = \frac{e_i}{\hat{\sigma}}$$

where $\hat{\sigma} = \sqrt{\frac{1}{n-p}\sum_{i=1}^{n}e_i^2}$ is the residual standard error (RSS divided by degrees of freedom). This gives residuals on a consistent scale—approximately unit variance under the model assumptions.

Limitation: This still doesn't account for the fact that different observations have different residual variances due to leverage differences.

The Hat Matrix Diagonal

The diagonal element h_ii measures the 'leverage' of observation i—how much influence its x-values have on the regression fit. Values range from 1/n (for a point at the predictor centroid) to 1 (for perfectly high-leverage points). We'll explore leverage in depth when we cover influential points.

Internally Studentized Residuals

To properly account for leverage, we divide by the estimated standard deviation of each residual: $$r_i^{\text{int}} = \frac{e_i}{\hat{\sigma}\sqrt{1-h_{ii}}}$$

These are sometimes called standardized residuals in software (causing confusion with our earlier definition). Under normal errors, internally studentized residuals approximately follow a standard normal distribution.

Externally Studentized Residuals (Studentized Deleted Residuals)

The most sophisticated form uses a leave-one-out approach. For each observation $i$, we compute: $$t_i = \frac{e_i}{\hat{\sigma}{(i)}\sqrt{1-h{ii}}}$$

where $\hat{\sigma}_{(i)}$ is the residual standard error computed from a model fitted without observation $i$. This removes the self-influence problem—the residual is being standardized by variance estimated from independent data.

Computational efficiency: Remarkably, we don't need to refit $n$ separate models. The externally studentized residual can be computed as: $$t_i = r_i^{\text{int}} \sqrt{\frac{n-p-1}{n-p-(r_i^{\text{int}})^2}}$$

Under normal errors, $t_i$ follows a $t$-distribution with $n-p-1$ degrees of freedom, enabling formal outlier testing.

Comparison of Residual Types
Residual Type	Formula	Distribution (under H₀)	Primary Use
Raw	$e_i = y_i - \hat{y}_i$	Non-standard	Basic diagnostics, RSS
Standardized	$e_i / \hat{\sigma}$	Approx. N(0,1)	Scale-free comparison
Internally Studentized	$e_i / (\hat{\sigma}\sqrt{1-h_{ii}})$	Approx. N(0,1)	Leverage-adjusted comparison
Externally Studentized	$e_i / (\hat{\sigma}{(i)}\sqrt{1-h{ii}})$	t(n-p-1)	Formal outlier testing

residual_types.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy import stats
 
def compute_residual_types(X, y, include_intercept=True):
    """
    Compute all four types of residuals for regression diagnostics.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix (without intercept column if include_intercept=True)
    y : ndarray of shape (n,)
        Response vector
    include_intercept : bool
        Whether to add intercept column to X
    
    Returns:
    --------
    dict with keys: 'raw', 'standardized', 'internally_studentized', 
                    'externally_studentized', 'leverage', 'sigma_hat'
    """
    n = len(y)
    
    # Add intercept if needed
    if include_intercept:
        X = np.column_stack([np.ones(n), X])
    
    p = X.shape[1]  # Number of parameters including intercept
    
    # Compute OLS estimates
    XtX_inv = np.linalg.inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    y_hat = X @ beta_hat
    
    # Raw residuals
    e = y - y_hat
    
    # Hat matrix diagonal (leverage values)
    H = X @ XtX_inv @ X.T
    h = np.diag(H)
    
    # Residual standard error
    RSS = np.sum(e**2)
    sigma_hat = np.sqrt(RSS / (n - p))
    
    # Standardized residuals
    r_std = e / sigma_hat
    
    # Internally studentized residuals
    r_int = e / (sigma_hat * np.sqrt(1 - h))
    
    # Externally studentized residuals (efficient formula)
    r_ext = r_int * np.sqrt((n - p - 1) / (n - p - r_int**2))
    
    return {
        'raw': e,
        'standardized': r_std,
        'internally_studentized': r_int,
        'externally_studentized': r_ext,
        'leverage': h,
        'sigma_hat': sigma_hat,
        'n': n,
        'p': p
    }
 
# Example usage with synthetic data
np.random.seed(42)
n = 100
X = np.random.randn(n, 2)
true_beta = np.array([2, -1, 0.5])  # intercept, beta1, beta2
y = true_beta[0] + X @ true_beta[1:] + np.random.randn(n) * 0.5
 
# Add an outlier
y[0] = y[0] + 5
 
results = compute_residual_types(X, y)
print(f"Observation 1 (outlier) residuals:")
print(f"  Raw: {results['raw'][0]:.4f}")
print(f"  Standardized: {results['standardized'][0]:.4f}")
print(f"  Internally Studentized: {results['internally_studentized'][0]:.4f}")
print(f"  Externally Studentized: {results['externally_studentized'][0]:.4f}")
print(f"  Leverage (h_11): {results['leverage'][0]:.4f}")

The Residual vs. Fitted Values Plot

The residual vs. fitted values plot is the workhorse of regression diagnostics. It reveals multiple assumption violations simultaneously and should be your first stop in any analysis.

Construction

Plot residuals $e_i$ (or standardized/studentized versions) on the y-axis against fitted values $\hat{y}_i$ on the x-axis. Each point represents one observation.

What a Good Plot Looks Like

Under correct model specification and valid assumptions, this plot should show:

Random scatter around the horizontal line at zero
Constant vertical spread across all fitted values (homoscedasticity)
No systematic patterns (curves, trends, clusters)
No extreme outliers (points far from the bulk)

The null pattern is a formless cloud—any structure reveals a problem.

The Detective's First Tool

The residual vs. fitted plot can reveal: (1) non-linearity (curved patterns), (2) heteroscedasticity (funnel shapes), (3) outliers (isolated points), and (4) missing predictors (systematic patterns). It cannot easily distinguish between these—additional plots are needed—but it alerts you that something is wrong.

Pattern Recognition: What Each Shape Tells You

Pattern 1: Curvature (U-shape or inverted U)

Diagnosis: The relationship between predictors and response is nonlinear. A linear model is missing quadratic or higher-order terms.

Example: Fitting a linear model to $y = \beta_0 + \beta_1 x + \beta_2 x^2 + \varepsilon$ without the $x^2$ term produces residuals that are negative for extreme $\hat{y}$ values and positive in the middle (or vice versa).

Remedy: Add polynomial terms, apply transformations (log, sqrt), or consider nonlinear models.

Pattern 2: Funnel Shape (Heteroscedasticity)

Diagnosis: Residual variance increases (or decreases) with fitted values. The constant variance assumption is violated.

Example: In income prediction, high-income individuals show more variability than low-income individuals. Residuals fan out as $\hat{y}$ increases.

Remedy: Log-transform the response, use weighted least squares, or fit a model that accounts for heteroscedasticity (e.g., generalized least squares).

Pattern 3: Horizontal Band with Isolated Points

Diagnosis: General structure is acceptable but individual outliers exist. These may be data errors, special cases, or genuine unusual observations.

Remedy: Investigate outliers individually. Consider robust regression if outliers are genuine but shouldn't dominate the fit.

diagnostic_plots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def create_diagnostic_plots(y, y_hat, residuals, leverage=None, figsize=(14, 10)):
    """
    Create a comprehensive 2x2 grid of regression diagnostic plots.
    
    Parameters:
    -----------
    y : array-like
        Observed response values
    y_hat : array-like
        Fitted/predicted values
    residuals : dict
        Dictionary with residual types from compute_residual_types()
    leverage : array-like, optional
        Leverage values (h_ii)
    """
    fig, axes = plt.subplots(2, 2, figsize=figsize)
    
    # Plot 1: Residuals vs Fitted
    ax1 = axes[0, 0]
    ax1.scatter(y_hat, residuals['raw'], alpha=0.6, edgecolors='k', linewidth=0.5)
    ax1.axhline(y=0, color='red', linestyle='--', linewidth=2)
    ax1.set_xlabel('Fitted Values', fontsize=12)
    ax1.set_ylabel('Residuals', fontsize=12)
    ax1.set_title('Residuals vs Fitted', fontsize=14, fontweight='bold')
    
    # Add LOWESS smoother for pattern detection
    from statsmodels.nonparametric.smoothers_lowess import lowess
    smoothed = lowess(residuals['raw'], y_hat, frac=0.5)
    ax1.plot(smoothed[:, 0], smoothed[:, 1], color='blue', linewidth=2, 
             label='LOWESS smoother')
    ax1.legend()
    
    # Plot 2: Q-Q Plot
    ax2 = axes[0, 1]
    stats.probplot(residuals['internally_studentized'], dist="norm", plot=ax2)
    ax2.set_title('Normal Q-Q', fontsize=14, fontweight='bold')
    ax2.get_lines()[0].set_markersize(5)
    ax2.get_lines()[0].set_alpha(0.6)
    
    # Plot 3: Scale-Location Plot
    ax3 = axes[1, 0]
    sqrt_abs_std_resid = np.sqrt(np.abs(residuals['internally_studentized']))
    ax3.scatter(y_hat, sqrt_abs_std_resid, alpha=0.6, edgecolors='k', linewidth=0.5)
    ax3.set_xlabel('Fitted Values', fontsize=12)
    ax3.set_ylabel('√|Standardized Residuals|', fontsize=12)
    ax3.set_title('Scale-Location', fontsize=14, fontweight='bold')
    smoothed = lowess(sqrt_abs_std_resid, y_hat, frac=0.5)
    ax3.plot(smoothed[:, 0], smoothed[:, 1], color='red', linewidth=2)
    
    # Plot 4: Residuals vs Leverage
    if leverage is not None:
        ax4 = axes[1, 1]
        ax4.scatter(leverage, residuals['internally_studentized'], 
                   alpha=0.6, edgecolors='k', linewidth=0.5)
        ax4.axhline(y=0, color='gray', linestyle='--')
        ax4.set_xlabel('Leverage (h_ii)', fontsize=12)
        ax4.set_ylabel('Standardized Residuals', fontsize=12)
        ax4.set_title('Residuals vs Leverage', fontsize=14, fontweight='bold')
        
        # Add Cook's distance contours
        n = len(y)
        p = len(y) - len(residuals['raw']) + residuals['p']  # Approximate
        for cook_d in [0.5, 1.0]:
            h_range = np.linspace(0.001, max(leverage) * 1.1, 100)
            y_cook = np.sqrt(cook_d * p * (1 - h_range) / h_range)
            ax4.plot(h_range, y_cook, 'r--', alpha=0.4)
            ax4.plot(h_range, -y_cook, 'r--', alpha=0.4)
    
    plt.tight_layout()
    return fig
 
# Demonstrate with examples showing different patterns
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Example 1: Good residual pattern
np.random.seed(42)
n = 100
x = np.random.randn(n)
y_good = 2 + 3*x + np.random.randn(n)
y_hat_good = 2 + 3*x
e_good = y_good - y_hat_good
 
axes[0].scatter(y_hat_good, e_good, alpha=0.6)
axes[0].axhline(0, color='red', linestyle='--')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Good Pattern
(Random Scatter)', fontweight='bold')
 
# Example 2: Nonlinearity (quadratic term missing)
y_quad = 2 + 3*x + 2*x**2 + np.random.randn(n)
y_hat_linear = 2 + 3*x  # Wrong model (missing x²)
e_nonlin = y_quad - y_hat_linear
 
axes[1].scatter(y_hat_linear, e_nonlin, alpha=0.6)
axes[1].axhline(0, color='red', linestyle='--')
axes[1].set_xlabel('Fitted Values')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Nonlinearity
(U-shaped Pattern)', fontweight='bold')
 
# Example 3: Heteroscedasticity (funnel)
y_hetero = 2 + 3*x + np.random.randn(n) * (1 + 2*np.abs(x))
y_hat_hetero = 2 + 3*x
e_hetero = y_hetero - y_hat_hetero
 
axes[2].scatter(y_hat_hetero, e_hetero, alpha=0.6)
axes[2].axhline(0, color='red', linestyle='--')
axes[2].set_xlabel('Fitted Values')
axes[2].set_ylabel('Residuals')
axes[2].set_title('Heteroscedasticity
(Funnel Shape)', fontweight='bold')
 
plt.tight_layout()
plt.savefig('residual_patterns.png', dpi=150, bbox_inches='tight')
plt.show()

Component-Plus-Residual (Partial Residual) Plots

While residual vs. fitted plots reveal overall model issues, partial residual plots (also called component-plus-residual plots) let us examine the functional form of individual predictors. This is crucial for detecting nonlinear relationships in specific variables.

Construction

For predictor $x_j$, the partial residual is: $$e_i^{(j)} = e_i + \hat{\beta}j x{ij}$$

We then plot $e_i^{(j)}$ against $x_{ij}$. The resulting scatter should approximate the partial relationship between $y$ and $x_j$ while controlling for other predictors.

Interpretation

If the true relationship between $y$ and $x_j$ is linear, the partial residual plot will show a linear pattern with slope approximately $\hat{\beta}_j$. Curvature in the plot suggests the linear term is inadequate—we might need polynomial terms, splines, or transformations.

Why This Works

Consider the true model: $E[y|x] = \alpha + \beta_j x_j + g(x_{-j})$ where $g(x_{-j})$ captures all other predictor effects. The OLS residual is approximately: $$e \approx \varepsilon + (\text{nonlinearity in } x_j)$$

Adding back $\hat{\beta}_j x_j$ recovers the total contribution of $x_j$: $$e + \hat{\beta}_j x_j \approx \beta_j x_j + (\text{nonlinearity in } x_j) + \varepsilon$$

Plotting this against $x_j$ reveals the true functional form.

CERES Plots: An Enhanced Alternative

Component-plus-Residual plots assume other predictors have correct functional forms. If this is violated, patterns can be misleading. CERES (Combining conditional Expectations and RESiduals) plots address this by using smoother estimates instead of linear terms for other variables. Use car::ceresPlots() in R for this enhanced diagnostic.

partial_residual_plots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from statsmodels.nonparametric.smoothers_lowess import lowess
 
def partial_residual_plots(X, y, feature_names=None, figsize=(14, 5)):
    """
    Create partial residual (component-plus-residual) plots for each predictor.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix of predictors
    y : ndarray of shape (n,)
        Response variable
    feature_names : list of str, optional
        Names for each predictor column
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f'X{i+1}' for i in range(p)]
    
    # Fit the model
    model = LinearRegression()
    model.fit(X, y)
    y_hat = model.predict(X)
    residuals = y - y_hat
    betas = model.coef_
    
    # Create subplot grid
    fig, axes = plt.subplots(1, p, figsize=figsize)
    if p == 1:
        axes = [axes]
    
    for j, (ax, name, beta) in enumerate(zip(axes, feature_names, betas)):
        # Compute partial residuals
        partial_resid = residuals + beta * X[:, j]
        
        # Plot
        ax.scatter(X[:, j], partial_resid, alpha=0.6, edgecolors='k', linewidth=0.5)
        
        # Add the fitted component line (slope = beta)
        x_range = np.linspace(X[:, j].min(), X[:, j].max(), 100)
        ax.plot(x_range, beta * x_range, 'r-', linewidth=2, 
                label=f'Linear fit (β={beta:.3f})')
        
        # Add LOWESS smoother to detect nonlinearity
        smoothed = lowess(partial_resid, X[:, j], frac=0.5)
        ax.plot(smoothed[:, 0], smoothed[:, 1], 'b--', linewidth=2, 
                label='LOWESS smoother')
        
        ax.set_xlabel(name, fontsize=12)
        ax.set_ylabel(f'e + β_{name}·{name}', fontsize=10)
        ax.set_title(f'Partial Residual: {name}', fontweight='bold')
        ax.legend(fontsize=8)
    
    plt.tight_layout()
    return fig
 
# Demonstration: detecting need for quadratic term
np.random.seed(42)
n = 200
 
# X1 has a linear effect, X2 has a quadratic effect (which we'll miss initially)
X1 = np.random.uniform(-3, 3, n)
X2 = np.random.uniform(-3, 3, n)
y = 2 + 3*X1 + 1.5*X2 + 0.8*X2**2 + np.random.randn(n)
 
X = np.column_stack([X1, X2])
 
# Create partial residual plots
fig = partial_residual_plots(X, y, feature_names=['X1 (linear)', 'X2 (quadratic ignored)'])
plt.suptitle('Partial Residual Plots Reveal Missing Quadratic Term in X2', 
             fontsize=14, fontweight='bold', y=1.02)
plt.savefig('partial_residual_demo.png', dpi=150, bbox_inches='tight')
plt.show()
 
# The plot for X2 will show clear curvature, indicating we need X2² term

Added Variable (Partial Regression) Plots

Added variable plots (AVPs), also called partial regression plots, serve a different purpose from partial residual plots. They visualize the unique contribution of a predictor after controlling for other predictors, revealing the marginal relationship between $x_j$ and $y$.

Construction

For predictor $x_j$:

Regress $y$ on all predictors except $x_j$. Save residuals: $\tilde{y}_i$
Regress $x_j$ on all other predictors. Save residuals: $\tilde{x}_{j,i}$
Plot $\tilde{y}i$ versus $\tilde{x}{j,i}$

The slope of the regression line through this scatter equals $\hat{\beta}_j$ from the full model (Frisch-Waugh-Lovell theorem).

Why This Matters

AVPs show the marginal relationship between $x_j$ and $y$—what $x_j$ explains that no other predictor can. This is crucial because:

High correlation among predictors can obscure individual effects
A predictor might correlate with $y$ only because it correlates with another predictor
The AVP separates true predictive power from confounding

Interpretation

Strong positive/negative slope: $x_j$ has significant marginal effect
Flat scatter: $x_j$ adds little after controlling for others
Curvature: Nonlinear marginal effect (rare but informative)
Isolated points far from origin: High leverage observations for this predictor

Comparing Partial Residual and Added Variable Plots
Feature	Partial Residual Plot	Added Variable Plot
Purpose	Detect functional form (nonlinearity)	Visualize marginal/unique effect
X-axis	Original predictor values	Residualized predictor values
Y-axis	Residual + β·x	Residualized response
Slope meaning	Approximate marginal effect	Exact coefficient β_j
Reveals	Need for transformations	Unique contribution after controlling
Outlier detection	Y-direction outliers	High leverage points for x_j

added_variable_plots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
 
def added_variable_plots(X, y, feature_names=None, figsize=(14, 5)):
    """
    Create added variable (partial regression) plots.
    
    For each predictor x_j:
    1. Regress y on all X except x_j -> residuals = y_tilde
    2. Regress x_j on all X except x_j -> residuals = x_tilde
    3. Plot y_tilde vs x_tilde
    
    The slope of this plot equals the coefficient β_j from full model.
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f'X{i+1}' for i in range(p)]
    
    # Fit full model to get coefficients for comparison
    full_model = LinearRegression()
    full_model.fit(X, y)
    full_betas = full_model.coef_
    
    # Create subplot grid
    fig, axes = plt.subplots(1, p, figsize=figsize)
    if p == 1:
        axes = [axes]
    
    for j, (ax, name, full_beta) in enumerate(zip(axes, feature_names, full_betas)):
        # Get indices of other predictors
        other_idx = [i for i in range(p) if i != j]
        X_other = X[:, other_idx]
        
        # Regress y on other predictors
        model_y = LinearRegression()
        model_y.fit(X_other, y)
        y_tilde = y - model_y.predict(X_other)
        
        # Regress x_j on other predictors
        model_x = LinearRegression()
        model_x.fit(X_other, X[:, j])
        x_tilde = X[:, j] - model_x.predict(X_other)
        
        # Plot the added variable plot
        ax.scatter(x_tilde, y_tilde, alpha=0.6, edgecolors='k', linewidth=0.5)
        
        # Fit line through the AVP (slope should equal full_beta)
        avp_slope = np.cov(x_tilde, y_tilde)[0, 1] / np.var(x_tilde)
        x_range = np.linspace(x_tilde.min(), x_tilde.max(), 100)
        ax.plot(x_range, avp_slope * x_range, 'r-', linewidth=2, 
                label=f'AVP slope: {avp_slope:.4f}')
        
        ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
        ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
        
        ax.set_xlabel(f'{name} | Others', fontsize=10)
        ax.set_ylabel(f'Y | Others', fontsize=10)
        ax.set_title(f'Added Variable: {name}
(Full model β={full_beta:.4f})', 
                    fontweight='bold')
        ax.legend(fontsize=8)
    
    plt.tight_layout()
    return fig
 
# Demonstration: multicollinearity impact
np.random.seed(42)
n = 200
 
# Create correlated predictors
X1 = np.random.randn(n)
X2 = 0.9 * X1 + 0.1 * np.random.randn(n)  # X2 highly correlated with X1
X3 = np.random.randn(n)  # X3 independent
 
# True model: y depends on X1 and X3, not directly on X2
y = 2 + 3*X1 + 1.5*X3 + np.random.randn(n)
 
X = np.column_stack([X1, X2, X3])
 
# Create added variable plots
fig = added_variable_plots(X, y, feature_names=['X1', 'X2 (corr with X1)', 'X3'])
plt.suptitle('Added Variable Plots: X2 Shows Weak Marginal Effect (As Expected)', 
             fontsize=12, fontweight='bold', y=1.05)
plt.savefig('added_variable_demo.png', dpi=150, bbox_inches='tight')
plt.show()

Autocorrelation Diagnostics

In time series or spatially ordered data, residuals may exhibit autocorrelation—correlation between residuals at different time points or locations. This violates the independence assumption and has serious consequences for inference.

Why Autocorrelation Matters

When residuals are autocorrelated:

Standard errors are biased (usually underestimated)
t-statistics and F-statistics are inflated
Confidence intervals are too narrow
P-values suggest significance where none exists

This is a severe problem: your model appears to work better than it actually does.

Detecting Autocorrelation

Visual Method: Residual Time Series Plot

Plot residuals in their natural order (time index, spatial index, etc.). Look for:

Runs of positive or negative residuals
Cyclic patterns
Persistent deviations from zero

Formal Test: Durbin-Watson Statistic

The Durbin-Watson (DW) test detects first-order autocorrelation: $$d = \frac{\sum_{t=2}^{n}(e_t - e_{t-1})^2}{\sum_{t=1}^{n}e_t^2}$$

Interpretation:

$d \approx 2$: No autocorrelation
$d < 2$: Positive autocorrelation (residuals tend to have the same sign as neighbors)
$d > 2$: Negative autocorrelation (residuals tend to alternate signs)

The test has complex critical values depending on sample size and number of predictors, but a rule of thumb is: $1.5 < d < 2.5$ suggests acceptable levels.

Durbin-Watson Limitations

The DW test only detects first-order autocorrelation (correlation with immediately adjacent residuals). Higher-order autocorrelation requires other tests like the Breusch-Godfrey test. Also, DW is invalid if the model includes lagged dependent variables as predictors.

autocorrelation_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.stats.stattools import durbin_watson
from statsmodels.graphics.tsaplots import plot_acf
 
def autocorrelation_diagnostics(residuals, max_lags=20, figsize=(14, 4)):
    """
    Comprehensive autocorrelation diagnostics for residuals.
    
    Parameters:
    -----------
    residuals : array-like
        Residuals from regression (in time order)
    max_lags : int
        Maximum number of lags for ACF plot
    """
    e = np.array(residuals)
    n = len(e)
    
    # Compute Durbin-Watson statistic
    dw = durbin_watson(e)
    
    # Compute first-order autocorrelation coefficient
    rho1 = np.corrcoef(e[:-1], e[1:])[0, 1]
    
    # Create diagnostic plots
    fig, axes = plt.subplots(1, 3, figsize=figsize)
    
    # Plot 1: Residuals over time
    ax1 = axes[0]
    ax1.plot(range(n), e, 'b-', alpha=0.7, linewidth=1)
    ax1.scatter(range(n), e, s=20, alpha=0.6)
    ax1.axhline(0, color='red', linestyle='--', linewidth=1.5)
    ax1.fill_between(range(n), e, 0, alpha=0.2)
    ax1.set_xlabel('Observation Order (Time)', fontsize=11)
    ax1.set_ylabel('Residual', fontsize=11)
    ax1.set_title('Residuals vs. Order', fontweight='bold')
    
    # Plot 2: Lag-1 scatter plot
    ax2 = axes[1]
    ax2.scatter(e[:-1], e[1:], alpha=0.6, edgecolors='k', linewidth=0.5)
    ax2.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax2.axvline(0, color='gray', linestyle='--', alpha=0.5)
    
    # Add regression line
    slope = rho1
    x_range = np.linspace(e.min(), e.max(), 100)
    ax2.plot(x_range, slope * x_range, 'r-', linewidth=2, 
             label=f'ρ₁ = {rho1:.3f}')
    ax2.set_xlabel('e(t-1)', fontsize=11)
    ax2.set_ylabel('e(t)', fontsize=11)
    ax2.set_title('Lag-1 Scatter Plot', fontweight='bold')
    ax2.legend()
    
    # Plot 3: Autocorrelation function
    ax3 = axes[2]
    plot_acf(e, ax=ax3, lags=max_lags, alpha=0.05)
    ax3.set_xlabel('Lag', fontsize=11)
    ax3.set_title('Autocorrelation Function (ACF)', fontweight='bold')
    
    plt.suptitle(f'Durbin-Watson: {dw:.3f} | First-order autocorr: {rho1:.3f}', 
                fontsize=12, fontweight='bold', y=1.02)
    plt.tight_layout()
    
    return {
        'durbin_watson': dw,
        'rho1': rho1,
        'figure': fig
    }
 
# Demonstration: compare independent vs autocorrelated residuals
np.random.seed(42)
n = 100
t = np.arange(n)
 
# Case 1: No autocorrelation
e_independent = np.random.randn(n)
 
# Case 2: Positive autocorrelation (AR(1) process)
rho = 0.8
e_positive = np.zeros(n)
e_positive[0] = np.random.randn()
for i in range(1, n):
    e_positive[i] = rho * e_positive[i-1] + np.sqrt(1-rho**2) * np.random.randn()
 
# Case 3: Negative autocorrelation
e_negative = np.zeros(n)
e_negative[0] = np.random.randn()
for i in range(1, n):
    e_negative[i] = -0.6 * e_negative[i-1] + np.sqrt(1-0.36) * np.random.randn()
 
# Run diagnostics on each
print("=== Independent Residuals ===")
results_ind = autocorrelation_diagnostics(e_independent)
print(f"D-W: {results_ind['durbin_watson']:.3f}, ρ₁: {results_ind['rho1']:.3f}")
plt.savefig('autocorr_independent.png', dpi=150, bbox_inches='tight')
plt.show()
 
print("
=== Positive Autocorrelation ===")
results_pos = autocorrelation_diagnostics(e_positive)
print(f"D-W: {results_pos['durbin_watson']:.3f}, ρ₁: {results_pos['rho1']:.3f}")
plt.savefig('autocorr_positive.png', dpi=150, bbox_inches='tight')
plt.show()

Summary: Residual Analysis Best Practices

Residual analysis is the first line of defense against flawed regression models. A model that looks great on summary statistics can harbor deep structural problems that only residual plots reveal.

Key Takeaways

•Residuals are observable proxies for true errors — They estimate what our model cannot explain, but have their own statistical properties (correlated, non-constant variance) that must be understood.
•Use studentized residuals for valid comparisons — Raw residuals have unequal variances due to leverage. Internally or externally studentized residuals account for this, enabling fair outlier identification.
•Residual vs. fitted is your primary diagnostic — This single plot can reveal nonlinearity, heteroscedasticity, and outliers. Make it your first stop in every analysis.
•Partial residual plots diagnose functional form — When you suspect a predictor needs transformation or polynomial terms, these plots show you the true shape of the relationship.
•Added variable plots reveal marginal effects — When predictors are correlated, AVPs show what each uniquely contributes, and identify high-leverage observations.
•Autocorrelation invalidates inference — In time-ordered data, always check for serial correlation. The Durbin-Watson test is your standard tool, but visual inspection is essential.

Page Complete

You now have the foundational toolkit for residual analysis. In the next page, we'll dive deep into normality testing—examining why the normality assumption matters, which tests to use, and how to interpret departures from normality in your regression residuals.

Residual Analysis

The Detective Work of Regression

What You Will Learn

Defining Residuals: The Foundation of Diagnostics

Let's establish the precise mathematical foundation. In a regression setting with response variable $y$ and fitted values $\hat{y}$, the residual for observation $i$ is defined as:

$$e_i = y_i - \hat{y}_i$$

Random error (irreducible noise in the data-generating process)
Model misspecification (wrong functional form, missing predictors)
Measurement error (imprecision in recorded values)
Violations of assumptions (non-constant variance, non-independence)

The goal of residual analysis is to distinguish between these sources—determining whether observed residual patterns are consistent with random noise or indicative of systematic problems.

Residuals vs. Errors: A Crucial Distinction

Properties of OLS Residuals

$$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - X\hat{\beta} = (I - H)\mathbf{y}$$

where $H = X(X^TX)^{-1}X^T$ is the hat matrix (also called the projection matrix). Several key properties follow:

Summary of OLS Residual Properties
Property	Mathematical Statement	Interpretation
Zero sum	$\sum e_i = 0$	Intercept absorbs mean offset
Orthogonal to X	$X^T e = 0$	No remaining linear signal in residuals
Orthogonal to ŷ	$\hat{y}^T e = 0$	Predictions and errors are uncorrelated
Expected value zero	$E[e_i] = 0$	Unbiased on average
Variance relationship	$\text{Var}(e_i) = \sigma^2(1-h_{ii})$	Variance depends on leverage

Types of Residuals: Raw, Standardized, and Studentized

Raw Residuals

The simplest form, raw residuals are just the observed minus predicted: $$e_i = y_i - \hat{y}_i$$

Standardized Residuals

To enable comparison, we standardize by the estimated standard deviation: $$r_i = \frac{e_i}{\hat{\sigma}}$$

Limitation: This still doesn't account for the fact that different observations have different residual variances due to leverage differences.

The Hat Matrix Diagonal

Internally Studentized Residuals

To properly account for leverage, we divide by the estimated standard deviation of each residual: $$r_i^{\text{int}} = \frac{e_i}{\hat{\sigma}\sqrt{1-h_{ii}}}$$

Externally Studentized Residuals (Studentized Deleted Residuals)

The most sophisticated form uses a leave-one-out approach. For each observation $i$, we compute: $$t_i = \frac{e_i}{\hat{\sigma}{(i)}\sqrt{1-h{ii}}}$$

Under normal errors, $t_i$ follows a $t$-distribution with $n-p-1$ degrees of freedom, enabling formal outlier testing.

Comparison of Residual Types
Residual Type	Formula	Distribution (under H₀)	Primary Use
Raw	$e_i = y_i - \hat{y}_i$	Non-standard	Basic diagnostics, RSS
Standardized	$e_i / \hat{\sigma}$	Approx. N(0,1)	Scale-free comparison
Internally Studentized	$e_i / (\hat{\sigma}\sqrt{1-h_{ii}})$	Approx. N(0,1)	Leverage-adjusted comparison
Externally Studentized	$e_i / (\hat{\sigma}{(i)}\sqrt{1-h{ii}})$	t(n-p-1)	Formal outlier testing

residual_types.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from scipy import stats
 
def compute_residual_types(X, y, include_intercept=True):
    """
    Compute all four types of residuals for regression diagnostics.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix (without intercept column if include_intercept=True)
    y : ndarray of shape (n,)
        Response vector
    include_intercept : bool
        Whether to add intercept column to X
    
    Returns:
    --------
    dict with keys: 'raw', 'standardized', 'internally_studentized', 
                    'externally_studentized', 'leverage', 'sigma_hat'
    """
    n = len(y)
    
    # Add intercept if needed
    if include_intercept:
        X = np.column_stack([np.ones(n), X])
    
    p = X.shape[1]  # Number of parameters including intercept
    
    # Compute OLS estimates
    XtX_inv = np.linalg.inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    y_hat = X @ beta_hat
    
    # Raw residuals
    e = y - y_hat
    
    # Hat matrix diagonal (leverage values)
    H = X @ XtX_inv @ X.T
    h = np.diag(H)
    
    # Residual standard error
    RSS = np.sum(e**2)
    sigma_hat = np.sqrt(RSS / (n - p))
    
    # Standardized residuals
    r_std = e / sigma_hat
    
    # Internally studentized residuals
    r_int = e / (sigma_hat * np.sqrt(1 - h))
    
    # Externally studentized residuals (efficient formula)
    r_ext = r_int * np.sqrt((n - p - 1) / (n - p - r_int**2))
    
    return {
        'raw': e,
        'standardized': r_std,
        'internally_studentized': r_int,
        'externally_studentized': r_ext,
        'leverage': h,
        'sigma_hat': sigma_hat,
        'n': n,
        'p': p
    }
 
# Example usage with synthetic data
np.random.seed(42)
n = 100
X = np.random.randn(n, 2)
true_beta = np.array([2, -1, 0.5])  # intercept, beta1, beta2
y = true_beta[0] + X @ true_beta[1:] + np.random.randn(n) * 0.5
 
# Add an outlier
y[0] = y[0] + 5
 
results = compute_residual_types(X, y)
print(f"Observation 1 (outlier) residuals:")
print(f"  Raw: {results['raw'][0]:.4f}")
print(f"  Standardized: {results['standardized'][0]:.4f}")
print(f"  Internally Studentized: {results['internally_studentized'][0]:.4f}")
print(f"  Externally Studentized: {results['externally_studentized'][0]:.4f}")
print(f"  Leverage (h_11): {results['leverage'][0]:.4f}")

The Residual vs. Fitted Values Plot

The residual vs. fitted values plot is the workhorse of regression diagnostics. It reveals multiple assumption violations simultaneously and should be your first stop in any analysis.

Construction

Plot residuals $e_i$ (or standardized/studentized versions) on the y-axis against fitted values $\hat{y}_i$ on the x-axis. Each point represents one observation.

What a Good Plot Looks Like

Under correct model specification and valid assumptions, this plot should show:

Random scatter around the horizontal line at zero
Constant vertical spread across all fitted values (homoscedasticity)
No systematic patterns (curves, trends, clusters)
No extreme outliers (points far from the bulk)

The null pattern is a formless cloud—any structure reveals a problem.

The Detective's First Tool

Pattern Recognition: What Each Shape Tells You

Pattern 1: Curvature (U-shape or inverted U)

Diagnosis: The relationship between predictors and response is nonlinear. A linear model is missing quadratic or higher-order terms.

Remedy: Add polynomial terms, apply transformations (log, sqrt), or consider nonlinear models.

Pattern 2: Funnel Shape (Heteroscedasticity)

Diagnosis: Residual variance increases (or decreases) with fitted values. The constant variance assumption is violated.

Example: In income prediction, high-income individuals show more variability than low-income individuals. Residuals fan out as $\hat{y}$ increases.

Remedy: Log-transform the response, use weighted least squares, or fit a model that accounts for heteroscedasticity (e.g., generalized least squares).

Pattern 3: Horizontal Band with Isolated Points

Diagnosis: General structure is acceptable but individual outliers exist. These may be data errors, special cases, or genuine unusual observations.

Remedy: Investigate outliers individually. Consider robust regression if outliers are genuine but shouldn't dominate the fit.

diagnostic_plots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def create_diagnostic_plots(y, y_hat, residuals, leverage=None, figsize=(14, 10)):
    """
    Create a comprehensive 2x2 grid of regression diagnostic plots.
    
    Parameters:
    -----------
    y : array-like
        Observed response values
    y_hat : array-like
        Fitted/predicted values
    residuals : dict
        Dictionary with residual types from compute_residual_types()
    leverage : array-like, optional
        Leverage values (h_ii)
    """
    fig, axes = plt.subplots(2, 2, figsize=figsize)
    
    # Plot 1: Residuals vs Fitted
    ax1 = axes[0, 0]
    ax1.scatter(y_hat, residuals['raw'], alpha=0.6, edgecolors='k', linewidth=0.5)
    ax1.axhline(y=0, color='red', linestyle='--', linewidth=2)
    ax1.set_xlabel('Fitted Values', fontsize=12)
    ax1.set_ylabel('Residuals', fontsize=12)
    ax1.set_title('Residuals vs Fitted', fontsize=14, fontweight='bold')
    
    # Add LOWESS smoother for pattern detection
    from statsmodels.nonparametric.smoothers_lowess import lowess
    smoothed = lowess(residuals['raw'], y_hat, frac=0.5)
    ax1.plot(smoothed[:, 0], smoothed[:, 1], color='blue', linewidth=2, 
             label='LOWESS smoother')
    ax1.legend()
    
    # Plot 2: Q-Q Plot
    ax2 = axes[0, 1]
    stats.probplot(residuals['internally_studentized'], dist="norm", plot=ax2)
    ax2.set_title('Normal Q-Q', fontsize=14, fontweight='bold')
    ax2.get_lines()[0].set_markersize(5)
    ax2.get_lines()[0].set_alpha(0.6)
    
    # Plot 3: Scale-Location Plot
    ax3 = axes[1, 0]
    sqrt_abs_std_resid = np.sqrt(np.abs(residuals['internally_studentized']))
    ax3.scatter(y_hat, sqrt_abs_std_resid, alpha=0.6, edgecolors='k', linewidth=0.5)
    ax3.set_xlabel('Fitted Values', fontsize=12)
    ax3.set_ylabel('√|Standardized Residuals|', fontsize=12)
    ax3.set_title('Scale-Location', fontsize=14, fontweight='bold')
    smoothed = lowess(sqrt_abs_std_resid, y_hat, frac=0.5)
    ax3.plot(smoothed[:, 0], smoothed[:, 1], color='red', linewidth=2)
    
    # Plot 4: Residuals vs Leverage
    if leverage is not None:
        ax4 = axes[1, 1]
        ax4.scatter(leverage, residuals['internally_studentized'], 
                   alpha=0.6, edgecolors='k', linewidth=0.5)
        ax4.axhline(y=0, color='gray', linestyle='--')
        ax4.set_xlabel('Leverage (h_ii)', fontsize=12)
        ax4.set_ylabel('Standardized Residuals', fontsize=12)
        ax4.set_title('Residuals vs Leverage', fontsize=14, fontweight='bold')
        
        # Add Cook's distance contours
        n = len(y)
        p = len(y) - len(residuals['raw']) + residuals['p']  # Approximate
        for cook_d in [0.5, 1.0]:
            h_range = np.linspace(0.001, max(leverage) * 1.1, 100)
            y_cook = np.sqrt(cook_d * p * (1 - h_range) / h_range)
            ax4.plot(h_range, y_cook, 'r--', alpha=0.4)
            ax4.plot(h_range, -y_cook, 'r--', alpha=0.4)
    
    plt.tight_layout()
    return fig
 
# Demonstrate with examples showing different patterns
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Example 1: Good residual pattern
np.random.seed(42)
n = 100
x = np.random.randn(n)
y_good = 2 + 3*x + np.random.randn(n)
y_hat_good = 2 + 3*x
e_good = y_good - y_hat_good
 
axes[0].scatter(y_hat_good, e_good, alpha=0.6)
axes[0].axhline(0, color='red', linestyle='--')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Good Pattern
(Random Scatter)', fontweight='bold')
 
# Example 2: Nonlinearity (quadratic term missing)
y_quad = 2 + 3*x + 2*x**2 + np.random.randn(n)
y_hat_linear = 2 + 3*x  # Wrong model (missing x²)
e_nonlin = y_quad - y_hat_linear
 
axes[1].scatter(y_hat_linear, e_nonlin, alpha=0.6)
axes[1].axhline(0, color='red', linestyle='--')
axes[1].set_xlabel('Fitted Values')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Nonlinearity
(U-shaped Pattern)', fontweight='bold')
 
# Example 3: Heteroscedasticity (funnel)
y_hetero = 2 + 3*x + np.random.randn(n) * (1 + 2*np.abs(x))
y_hat_hetero = 2 + 3*x
e_hetero = y_hetero - y_hat_hetero
 
axes[2].scatter(y_hat_hetero, e_hetero, alpha=0.6)
axes[2].axhline(0, color='red', linestyle='--')
axes[2].set_xlabel('Fitted Values')
axes[2].set_ylabel('Residuals')
axes[2].set_title('Heteroscedasticity
(Funnel Shape)', fontweight='bold')
 
plt.tight_layout()
plt.savefig('residual_patterns.png', dpi=150, bbox_inches='tight')
plt.show()

Component-Plus-Residual (Partial Residual) Plots

Construction

For predictor $x_j$, the partial residual is: $$e_i^{(j)} = e_i + \hat{\beta}j x{ij}$$

We then plot $e_i^{(j)}$ against $x_{ij}$. The resulting scatter should approximate the partial relationship between $y$ and $x_j$ while controlling for other predictors.

Interpretation

Why This Works

Adding back $\hat{\beta}_j x_j$ recovers the total contribution of $x_j$: $$e + \hat{\beta}_j x_j \approx \beta_j x_j + (\text{nonlinearity in } x_j) + \varepsilon$$

Plotting this against $x_j$ reveals the true functional form.

CERES Plots: An Enhanced Alternative

partial_residual_plots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from statsmodels.nonparametric.smoothers_lowess import lowess
 
def partial_residual_plots(X, y, feature_names=None, figsize=(14, 5)):
    """
    Create partial residual (component-plus-residual) plots for each predictor.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix of predictors
    y : ndarray of shape (n,)
        Response variable
    feature_names : list of str, optional
        Names for each predictor column
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f'X{i+1}' for i in range(p)]
    
    # Fit the model
    model = LinearRegression()
    model.fit(X, y)
    y_hat = model.predict(X)
    residuals = y - y_hat
    betas = model.coef_
    
    # Create subplot grid
    fig, axes = plt.subplots(1, p, figsize=figsize)
    if p == 1:
        axes = [axes]
    
    for j, (ax, name, beta) in enumerate(zip(axes, feature_names, betas)):
        # Compute partial residuals
        partial_resid = residuals + beta * X[:, j]
        
        # Plot
        ax.scatter(X[:, j], partial_resid, alpha=0.6, edgecolors='k', linewidth=0.5)
        
        # Add the fitted component line (slope = beta)
        x_range = np.linspace(X[:, j].min(), X[:, j].max(), 100)
        ax.plot(x_range, beta * x_range, 'r-', linewidth=2, 
                label=f'Linear fit (β={beta:.3f})')
        
        # Add LOWESS smoother to detect nonlinearity
        smoothed = lowess(partial_resid, X[:, j], frac=0.5)
        ax.plot(smoothed[:, 0], smoothed[:, 1], 'b--', linewidth=2, 
                label='LOWESS smoother')
        
        ax.set_xlabel(name, fontsize=12)
        ax.set_ylabel(f'e + β_{name}·{name}', fontsize=10)
        ax.set_title(f'Partial Residual: {name}', fontweight='bold')
        ax.legend(fontsize=8)
    
    plt.tight_layout()
    return fig
 
# Demonstration: detecting need for quadratic term
np.random.seed(42)
n = 200
 
# X1 has a linear effect, X2 has a quadratic effect (which we'll miss initially)
X1 = np.random.uniform(-3, 3, n)
X2 = np.random.uniform(-3, 3, n)
y = 2 + 3*X1 + 1.5*X2 + 0.8*X2**2 + np.random.randn(n)
 
X = np.column_stack([X1, X2])
 
# Create partial residual plots
fig = partial_residual_plots(X, y, feature_names=['X1 (linear)', 'X2 (quadratic ignored)'])
plt.suptitle('Partial Residual Plots Reveal Missing Quadratic Term in X2', 
             fontsize=14, fontweight='bold', y=1.02)
plt.savefig('partial_residual_demo.png', dpi=150, bbox_inches='tight')
plt.show()
 
# The plot for X2 will show clear curvature, indicating we need X2² term

Added Variable (Partial Regression) Plots

Construction

For predictor $x_j$:

Regress $y$ on all predictors except $x_j$. Save residuals: $\tilde{y}_i$
Regress $x_j$ on all other predictors. Save residuals: $\tilde{x}_{j,i}$
Plot $\tilde{y}i$ versus $\tilde{x}{j,i}$

The slope of the regression line through this scatter equals $\hat{\beta}_j$ from the full model (Frisch-Waugh-Lovell theorem).

Why This Matters

AVPs show the marginal relationship between $x_j$ and $y$—what $x_j$ explains that no other predictor can. This is crucial because:

High correlation among predictors can obscure individual effects
A predictor might correlate with $y$ only because it correlates with another predictor
The AVP separates true predictive power from confounding

Interpretation

Strong positive/negative slope: $x_j$ has significant marginal effect
Flat scatter: $x_j$ adds little after controlling for others
Curvature: Nonlinear marginal effect (rare but informative)
Isolated points far from origin: High leverage observations for this predictor

Comparing Partial Residual and Added Variable Plots
Feature	Partial Residual Plot	Added Variable Plot
Purpose	Detect functional form (nonlinearity)	Visualize marginal/unique effect
X-axis	Original predictor values	Residualized predictor values
Y-axis	Residual + β·x	Residualized response
Slope meaning	Approximate marginal effect	Exact coefficient β_j
Reveals	Need for transformations	Unique contribution after controlling
Outlier detection	Y-direction outliers	High leverage points for x_j

added_variable_plots.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
 
def added_variable_plots(X, y, feature_names=None, figsize=(14, 5)):
    """
    Create added variable (partial regression) plots.
    
    For each predictor x_j:
    1. Regress y on all X except x_j -> residuals = y_tilde
    2. Regress x_j on all X except x_j -> residuals = x_tilde
    3. Plot y_tilde vs x_tilde
    
    The slope of this plot equals the coefficient β_j from full model.
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f'X{i+1}' for i in range(p)]
    
    # Fit full model to get coefficients for comparison
    full_model = LinearRegression()
    full_model.fit(X, y)
    full_betas = full_model.coef_
    
    # Create subplot grid
    fig, axes = plt.subplots(1, p, figsize=figsize)
    if p == 1:
        axes = [axes]
    
    for j, (ax, name, full_beta) in enumerate(zip(axes, feature_names, full_betas)):
        # Get indices of other predictors
        other_idx = [i for i in range(p) if i != j]
        X_other = X[:, other_idx]
        
        # Regress y on other predictors
        model_y = LinearRegression()
        model_y.fit(X_other, y)
        y_tilde = y - model_y.predict(X_other)
        
        # Regress x_j on other predictors
        model_x = LinearRegression()
        model_x.fit(X_other, X[:, j])
        x_tilde = X[:, j] - model_x.predict(X_other)
        
        # Plot the added variable plot
        ax.scatter(x_tilde, y_tilde, alpha=0.6, edgecolors='k', linewidth=0.5)
        
        # Fit line through the AVP (slope should equal full_beta)
        avp_slope = np.cov(x_tilde, y_tilde)[0, 1] / np.var(x_tilde)
        x_range = np.linspace(x_tilde.min(), x_tilde.max(), 100)
        ax.plot(x_range, avp_slope * x_range, 'r-', linewidth=2, 
                label=f'AVP slope: {avp_slope:.4f}')
        
        ax.axhline(0, color='gray', linestyle='--', alpha=0.5)
        ax.axvline(0, color='gray', linestyle='--', alpha=0.5)
        
        ax.set_xlabel(f'{name} | Others', fontsize=10)
        ax.set_ylabel(f'Y | Others', fontsize=10)
        ax.set_title(f'Added Variable: {name}
(Full model β={full_beta:.4f})', 
                    fontweight='bold')
        ax.legend(fontsize=8)
    
    plt.tight_layout()
    return fig
 
# Demonstration: multicollinearity impact
np.random.seed(42)
n = 200
 
# Create correlated predictors
X1 = np.random.randn(n)
X2 = 0.9 * X1 + 0.1 * np.random.randn(n)  # X2 highly correlated with X1
X3 = np.random.randn(n)  # X3 independent
 
# True model: y depends on X1 and X3, not directly on X2
y = 2 + 3*X1 + 1.5*X3 + np.random.randn(n)
 
X = np.column_stack([X1, X2, X3])
 
# Create added variable plots
fig = added_variable_plots(X, y, feature_names=['X1', 'X2 (corr with X1)', 'X3'])
plt.suptitle('Added Variable Plots: X2 Shows Weak Marginal Effect (As Expected)', 
             fontsize=12, fontweight='bold', y=1.05)
plt.savefig('added_variable_demo.png', dpi=150, bbox_inches='tight')
plt.show()

Autocorrelation Diagnostics

Why Autocorrelation Matters

When residuals are autocorrelated:

Standard errors are biased (usually underestimated)
t-statistics and F-statistics are inflated
Confidence intervals are too narrow
P-values suggest significance where none exists

This is a severe problem: your model appears to work better than it actually does.

Detecting Autocorrelation

Visual Method: Residual Time Series Plot

Plot residuals in their natural order (time index, spatial index, etc.). Look for:

Runs of positive or negative residuals
Cyclic patterns
Persistent deviations from zero

Formal Test: Durbin-Watson Statistic

The Durbin-Watson (DW) test detects first-order autocorrelation: $$d = \frac{\sum_{t=2}^{n}(e_t - e_{t-1})^2}{\sum_{t=1}^{n}e_t^2}$$

Interpretation:

$d \approx 2$: No autocorrelation
$d < 2$: Positive autocorrelation (residuals tend to have the same sign as neighbors)
$d > 2$: Negative autocorrelation (residuals tend to alternate signs)

The test has complex critical values depending on sample size and number of predictors, but a rule of thumb is: $1.5 < d < 2.5$ suggests acceptable levels.

Durbin-Watson Limitations

autocorrelation_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.stats.stattools import durbin_watson
from statsmodels.graphics.tsaplots import plot_acf
 
def autocorrelation_diagnostics(residuals, max_lags=20, figsize=(14, 4)):
    """
    Comprehensive autocorrelation diagnostics for residuals.
    
    Parameters:
    -----------
    residuals : array-like
        Residuals from regression (in time order)
    max_lags : int
        Maximum number of lags for ACF plot
    """
    e = np.array(residuals)
    n = len(e)
    
    # Compute Durbin-Watson statistic
    dw = durbin_watson(e)
    
    # Compute first-order autocorrelation coefficient
    rho1 = np.corrcoef(e[:-1], e[1:])[0, 1]
    
    # Create diagnostic plots
    fig, axes = plt.subplots(1, 3, figsize=figsize)
    
    # Plot 1: Residuals over time
    ax1 = axes[0]
    ax1.plot(range(n), e, 'b-', alpha=0.7, linewidth=1)
    ax1.scatter(range(n), e, s=20, alpha=0.6)
    ax1.axhline(0, color='red', linestyle='--', linewidth=1.5)
    ax1.fill_between(range(n), e, 0, alpha=0.2)
    ax1.set_xlabel('Observation Order (Time)', fontsize=11)
    ax1.set_ylabel('Residual', fontsize=11)
    ax1.set_title('Residuals vs. Order', fontweight='bold')
    
    # Plot 2: Lag-1 scatter plot
    ax2 = axes[1]
    ax2.scatter(e[:-1], e[1:], alpha=0.6, edgecolors='k', linewidth=0.5)
    ax2.axhline(0, color='gray', linestyle='--', alpha=0.5)
    ax2.axvline(0, color='gray', linestyle='--', alpha=0.5)
    
    # Add regression line
    slope = rho1
    x_range = np.linspace(e.min(), e.max(), 100)
    ax2.plot(x_range, slope * x_range, 'r-', linewidth=2, 
             label=f'ρ₁ = {rho1:.3f}')
    ax2.set_xlabel('e(t-1)', fontsize=11)
    ax2.set_ylabel('e(t)', fontsize=11)
    ax2.set_title('Lag-1 Scatter Plot', fontweight='bold')
    ax2.legend()
    
    # Plot 3: Autocorrelation function
    ax3 = axes[2]
    plot_acf(e, ax=ax3, lags=max_lags, alpha=0.05)
    ax3.set_xlabel('Lag', fontsize=11)
    ax3.set_title('Autocorrelation Function (ACF)', fontweight='bold')
    
    plt.suptitle(f'Durbin-Watson: {dw:.3f} | First-order autocorr: {rho1:.3f}', 
                fontsize=12, fontweight='bold', y=1.02)
    plt.tight_layout()
    
    return {
        'durbin_watson': dw,
        'rho1': rho1,
        'figure': fig
    }
 
# Demonstration: compare independent vs autocorrelated residuals
np.random.seed(42)
n = 100
t = np.arange(n)
 
# Case 1: No autocorrelation
e_independent = np.random.randn(n)
 
# Case 2: Positive autocorrelation (AR(1) process)
rho = 0.8
e_positive = np.zeros(n)
e_positive[0] = np.random.randn()
for i in range(1, n):
    e_positive[i] = rho * e_positive[i-1] + np.sqrt(1-rho**2) * np.random.randn()
 
# Case 3: Negative autocorrelation
e_negative = np.zeros(n)
e_negative[0] = np.random.randn()
for i in range(1, n):
    e_negative[i] = -0.6 * e_negative[i-1] + np.sqrt(1-0.36) * np.random.randn()
 
# Run diagnostics on each
print("=== Independent Residuals ===")
results_ind = autocorrelation_diagnostics(e_independent)
print(f"D-W: {results_ind['durbin_watson']:.3f}, ρ₁: {results_ind['rho1']:.3f}")
plt.savefig('autocorr_independent.png', dpi=150, bbox_inches='tight')
plt.show()
 
print("
=== Positive Autocorrelation ===")
results_pos = autocorrelation_diagnostics(e_positive)
print(f"D-W: {results_pos['durbin_watson']:.3f}, ρ₁: {results_pos['rho1']:.3f}")
plt.savefig('autocorr_positive.png', dpi=150, bbox_inches='tight')
plt.show()

Summary: Residual Analysis Best Practices

Key Takeaways

•Residuals are observable proxies for true errors — They estimate what our model cannot explain, but have their own statistical properties (correlated, non-constant variance) that must be understood.
•Use studentized residuals for valid comparisons — Raw residuals have unequal variances due to leverage. Internally or externally studentized residuals account for this, enabling fair outlier identification.
•Residual vs. fitted is your primary diagnostic — This single plot can reveal nonlinearity, heteroscedasticity, and outliers. Make it your first stop in every analysis.
•Partial residual plots diagnose functional form — When you suspect a predictor needs transformation or polynomial terms, these plots show you the true shape of the relationship.
•Added variable plots reveal marginal effects — When predictors are correlated, AVPs show what each uniquely contributes, and identify high-leverage observations.
•Autocorrelation invalidates inference — In time-ordered data, always check for serial correlation. The Durbin-Watson test is your standard tool, but visual inspection is essential.

Page Complete