Regression Diagnostics - Learning Module

Loading content...

0/278

Influential Points

When One Point Changes Everything

Regression estimates summarize information from all observations, but not all observations contribute equally. Some data points—due to their position in predictor space, their unusual response values, or both—can dominate the regression fit. Understanding, detecting, and appropriately handling these influential observations is critical for robust analysis.

Consider this: removing a single observation from a 1,000-point dataset shouldn't dramatically change your conclusions. If it does, your model is not telling a story about the data—it's telling a story about that one point. This section equips you to recognize and address such situations.

What You Will Learn

This page covers: (1) The distinction between leverage, outliers, and influential points; (2) The hat matrix and leverage values; (3) Influence measures—Cook's distance, DFBETAS, DFFITS; (4) Visual diagnostics for influence; (5) Strategies for handling influential points without discarding valid data.

Leverage, Outliers, and Influence: Key Distinctions

Three related but distinct concepts characterize unusual observations. Understanding their interplay is fundamental:

Leverage

Leverage measures how unusual an observation's predictor values are—how far it lies from the center of the predictor space. Formally, leverage is defined through the hat matrix $H$: $$H = X(X^TX)^{-1}X^T$$

The $i$-th diagonal element $h_{ii}$ is the leverage of observation $i$: $$h_{ii} = x_i^T(X^TX)^{-1}x_i$$

Interpretation:

Leverage measures how much $\hat{y}_i$ depends on $y_i$ itself
Range: $1/n \leq h_{ii} \leq 1$
Points at the centroid of predictor space have leverage $1/n$
Points far from the centroid have higher leverage

Key insight: Leverage is a function of X only—it doesn't depend on y at all. A high-leverage point can have a perfectly ordinary y-value.

Outliers

Outliers are observations whose response values are unusual given their predictor values. They appear as large residuals. A point can be:

An outlier in Y (unusual response)
An outlier in X (high leverage)
Neither or both

Influence

Influence combines leverage and residual magnitude. A point is influential if removing it substantially changes the regression results.

The key relationship: $$\text{Influence} \approx \text{Leverage} \times \text{Outlier severity}$$

A point needs both to be influential:

High leverage + small residual = fits the line well, little influence
Low leverage + large residual = pulled toward the bulk, little influence
High leverage + large residual = high influence—it's far from other points AND doesn't follow their pattern

Relationship Between Leverage, Residuals, and Influence
Scenario	Leverage	Residual	Influence	Interpretation
Typical point	Low	Small	Low	Normal observation
Y-outlier among bulk	Low	Large	Moderate	Unusual y, but constrained by neighbors
Remote point on line	High	Small	Low	Confirms pattern at extreme
Remote outlier	High	Large	High	Pulls regression line—investigate

The Masking Effect

High-leverage points tend to have small residuals even when they're outliers—they 'pull' the regression line toward themselves, hiding their own unusual nature. This is why residual plots alone can miss influential points. You must examine leverage explicitly.

The Hat Matrix and Leverage Values

The hat matrix is central to understanding leverage and its effects on residuals.

Hat Matrix Properties

The hat matrix $H = X(X^TX)^{-1}X^T$ has several important properties:

1. Projects y onto the column space of X: $$\hat{y} = Hy$$

Hence the name "hat" matrix—it puts the hat on y.

2. Symmetric: $H^T = H$

3. Idempotent: $H^2 = H$

4. Trace equals the number of parameters: $$\text{tr}(H) = \sum_{i=1}^{n} h_{ii} = p$$

where $p$ is the number of predictors including intercept.

5. Average leverage: $$\bar{h} = \frac{p}{n}$$

Leverage Thresholds

Since average leverage is $p/n$, we flag high leverage points as those with: $$h_{ii} > 2 \cdot \frac{p}{n} = \frac{2p}{n}$$

or for stricter screening: $$h_{ii} > 3 \cdot \frac{p}{n} = \frac{3p}{n}$$

Intuition: With $n = 100$ observations and $p = 5$ predictors, average leverage is 0.05. Points with leverage > 0.10 (twice average) warrant inspection.

Leverage in Simple Regression

For simple regression with standardized predictor: $$h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_j (x_j - \bar{x})^2}$$

This shows leverage increases with distance from the predictor mean—points at extreme x-values have high leverage.

leverage_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def compute_leverage(X, include_intercept=True):
    """
    Compute leverage values (hat matrix diagonal) for design matrix X.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix (without intercept column if include_intercept=True)
    include_intercept : bool
        Whether to add intercept column
    
    Returns:
    --------
    dict with leverage values and thresholds
    """
    n = X.shape[0] if X.ndim > 1 else len(X)
    
    if X.ndim == 1:
        X = X.reshape(-1, 1)
    
    if include_intercept:
        X = np.column_stack([np.ones(n), X])
    
    p = X.shape[1]
    
    # Compute hat matrix
    XtX_inv = np.linalg.inv(X.T @ X)
    H = X @ XtX_inv @ X.T
    
    # Extract diagonal (leverage values)
    leverage = np.diag(H)
    
    # Thresholds
    avg_leverage = p / n
    threshold_2x = 2 * avg_leverage
    threshold_3x = 3 * avg_leverage
    
    # Identify high-leverage points
    high_lev_idx = np.where(leverage > threshold_2x)[0]
    very_high_lev_idx = np.where(leverage > threshold_3x)[0]
    
    return {
        'leverage': leverage,
        'n': n,
        'p': p,
        'avg_leverage': avg_leverage,
        'threshold_2x': threshold_2x,
        'threshold_3x': threshold_3x,
        'high_leverage_idx': high_lev_idx,
        'very_high_leverage_idx': very_high_lev_idx
    }
 
def leverage_diagnostic_plot(X, y, leverage_info=None, figsize=(12, 5)):
    """
    Create diagnostic plots for leverage analysis.
    """
    n = len(y)
    
    if X.ndim == 1:
        X = X.reshape(-1, 1)
    
    if leverage_info is None:
        leverage_info = compute_leverage(X)
    
    h = leverage_info['leverage']
    threshold = leverage_info['threshold_2x']
    
    fig, axes = plt.subplots(1, 2, figsize=figsize)
    
    # Plot 1: Leverage values vs observation index
    ax1 = axes[0]
    colors = ['red' if hi > threshold else 'blue' for hi in h]
    ax1.scatter(range(n), h, c=colors, alpha=0.6, edgecolors='k', linewidth=0.5)
    ax1.axhline(leverage_info['threshold_2x'], color='orange', linestyle='--', 
                linewidth=1.5, label=f'2×avg ({leverage_info["threshold_2x"]:.3f})')
    ax1.axhline(leverage_info['threshold_3x'], color='red', linestyle='--', 
                linewidth=1.5, label=f'3×avg ({leverage_info["threshold_3x"]:.3f})')
    ax1.axhline(leverage_info['avg_leverage'], color='green', linestyle='-', 
                alpha=0.7, label=f'avg ({leverage_info["avg_leverage"]:.3f})')
    
    ax1.set_xlabel('Observation Index', fontsize=11)
    ax1.set_ylabel('Leverage (h_ii)', fontsize=11)
    ax1.set_title('Leverage Values', fontsize=12, fontweight='bold')
    ax1.legend()
    
    # Label high-leverage points
    for idx in leverage_info['high_leverage_idx']:
        ax1.annotate(str(idx), (idx, h[idx]), textcoords='offset points',
                    xytext=(0, 5), ha='center', fontsize=8)
    
    # Plot 2: For simple regression, show leverage vs x
    if X.shape[1] == 2:  # Intercept + 1 predictor
        x = X[:, 1]
        ax2 = axes[1]
        ax2.scatter(x, h, c=colors, alpha=0.6, edgecolors='k', linewidth=0.5, s=50)
        
        # Show theoretical curve
        x_sorted = np.sort(x)
        x_mean = np.mean(x)
        ss_x = np.sum((x - x_mean)**2)
        h_theoretical = 1/n + (x_sorted - x_mean)**2 / ss_x
        ax2.plot(x_sorted, h_theoretical, 'g-', linewidth=2, alpha=0.7,
                label='Theoretical leverage')
        
        ax2.axhline(threshold, color='orange', linestyle='--', linewidth=1.5)
        ax2.set_xlabel('Predictor Value (x)', fontsize=11)
        ax2.set_ylabel('Leverage (h_ii)', fontsize=11)
        ax2.set_title('Leverage vs Predictor', fontsize=12, fontweight='bold')
        ax2.legend()
    
    plt.tight_layout()
    return fig
 
# Demonstration
np.random.seed(42)
n = 50
 
# Create data with a high-leverage point
x = np.random.uniform(2, 8, n)
y = 2 + 3 * x + np.random.randn(n) * 2
 
# Add a high-leverage point (far from others in x)
x = np.append(x, 15)  # Way outside the range
y = np.append(y, 2 + 3 * 15 + np.random.randn() * 2)  # On the line
 
# Add another high-leverage outlier
x = np.append(x, 16)
y = np.append(y, 20)  # Way off the line
 
n = len(y)
print(f"Dataset: n = {n}")
 
# Compute leverage
lev_info = compute_leverage(x)
 
print(f"
Leverage Statistics:")
print(f"  p (parameters): {lev_info['p']}")
print(f"  Average leverage: {lev_info['avg_leverage']:.4f}")
print(f"  2× threshold: {lev_info['threshold_2x']:.4f}")
print(f"  3× threshold: {lev_info['threshold_3x']:.4f}")
 
print(f"
High leverage points (> 2×avg):")
for idx in lev_info['high_leverage_idx']:
    print(f"  Obs {idx}: x={x[idx]:.2f}, y={y[idx]:.2f}, leverage={lev_info['leverage'][idx]:.4f}")
 
# Plot
fig = leverage_diagnostic_plot(x.reshape(-1, 1), y, lev_info)
plt.suptitle('Leverage Analysis: Identifying Unusual Predictor Values', 
            fontsize=12, fontweight='bold', y=1.02)
plt.savefig('leverage_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

Cook's Distance: Measuring Overall Influence

Cook's distance is the most widely used single measure of an observation's influence on the regression. It quantifies how much all fitted values change when observation $i$ is deleted.

Definition

Cook's distance for observation $i$ is: $$D_i = \frac{\sum_{j=1}^{n}(\hat{y}j - \hat{y}{j(i)})^2}{p \cdot \hat{\sigma}^2}$$

where:

$\hat{y}_j$ is the fitted value for observation $j$ from the full model
$\hat{y}_{j(i)}$ is the fitted value from the model fitted without observation $i$
$p$ is the number of parameters
$\hat{\sigma}^2$ is the residual mean square

Computational Formula

Remarkably, Cook's distance can be computed efficiently without refitting $n$ separate models: $$D_i = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}$$

where $r_i$ is the standardized residual. This shows Cook's distance as a product of:

Outlier component: $r_i^2$ (how unusual is the response?)
Leverage component: $h_{ii}/(1-h_{ii})$ (how unusual is the predictor position?)

Thresholds for Concern

Several rules of thumb exist:

Conservative: $D_i > 1$ (suggested by Cook himself)

Moderate: $D_i > 4/n$ (commonly used)

Relative: $D_i > F_{0.50,p,n-p}$ (50th percentile of F distribution)

The $4/n$ rule is most common in practice. For $n = 100$, this gives a threshold of 0.04.

Cook's Distance Interpretation

Cook's distance measures the effect of deletion on all fitted values simultaneously. It answers: 'If I remove this point, how much does the predicted value change, on average, across all observations?' Large D means the entire fitted surface shifts substantially.

cooks_distance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
 
def compute_influence_measures(X, y):
    """
    Compute comprehensive influence measures for regression.
    
    Parameters:
    -----------
    X : ndarray of shape (n,) or (n, p)
        Predictors
    y : ndarray of shape (n,)
        Response
    
    Returns:
    --------
    dict with influence measures
    """
    n = len(y)
    
    if X.ndim == 1:
        X = X.reshape(-1, 1)
    
    # Add intercept
    X_full = np.column_stack([np.ones(n), X])
    p = X_full.shape[1]
    
    # Fit model
    XtX_inv = np.linalg.inv(X_full.T @ X_full)
    beta_hat = XtX_inv @ X_full.T @ y
    y_hat = X_full @ beta_hat
    residuals = y - y_hat
    
    # Hat matrix diagonal (leverage)
    H = X_full @ XtX_inv @ X_full.T
    leverage = np.diag(H)
    
    # Residual standard error
    RSS = np.sum(residuals ** 2)
    sigma_hat = np.sqrt(RSS / (n - p))
    
    # Standardized residuals
    std_resid = residuals / sigma_hat
    
    # Internally studentized residuals
    int_student = residuals / (sigma_hat * np.sqrt(1 - leverage))
    
    # Externally studentized residuals
    ext_student = int_student * np.sqrt((n - p - 1) / (n - p - int_student**2))
    
    # Cook's distance
    cooks_d = (int_student ** 2 / p) * (leverage / (1 - leverage))
    
    # DFFITS
    dffits = int_student * np.sqrt(leverage / (1 - leverage))
    
    # DFBETAS (influence on each coefficient)
    dfbetas = np.zeros((n, p))
    for i in range(n):
        # Efficient formula using hat matrix
        for j in range(p):
            dfbetas[i, j] = (XtX_inv @ X_full[i, :]) [j] * residuals[i] /                            (sigma_hat * np.sqrt(1 - leverage[i]))
    
    # Thresholds
    cooks_threshold = 4 / n
    dffits_threshold = 2 * np.sqrt(p / n)
    dfbetas_threshold = 2 / np.sqrt(n)
    leverage_threshold = 2 * p / n
    
    # Identify influential points
    influential_cooks = np.where(cooks_d > cooks_threshold)[0]
    influential_dffits = np.where(np.abs(dffits) > dffits_threshold)[0]
    high_leverage = np.where(leverage > leverage_threshold)[0]
    
    return {
        'n': n,
        'p': p,
        'residuals': residuals,
        'leverage': leverage,
        'std_resid': std_resid,
        'int_student': int_student,
        'ext_student': ext_student,
        'cooks_d': cooks_d,
        'dffits': dffits,
        'dfbetas': dfbetas,
        'thresholds': {
            'cooks': cooks_threshold,
            'dffits': dffits_threshold,
            'dfbetas': dfbetas_threshold,
            'leverage': leverage_threshold
        },
        'influential': {
            'cooks': influential_cooks,
            'dffits': influential_dffits,
            'high_leverage': high_leverage
        }
    }
 
def influence_diagnostic_plots(X, y, results=None, figsize=(14, 10)):
    """
    Create comprehensive influence diagnostic plots.
    """
    if results is None:
        results = compute_influence_measures(X, y)
    
    n = results['n']
    
    fig, axes = plt.subplots(2, 2, figsize=figsize)
    
    # Plot 1: Cook's Distance
    ax1 = axes[0, 0]
    stem_container = ax1.stem(range(n), results['cooks_d'], linefmt='b-', 
                              markerfmt='bo', basefmt='k-')
    ax1.axhline(results['thresholds']['cooks'], color='red', linestyle='--', 
                linewidth=1.5, label=f'Threshold (4/n = {results["thresholds"]["cooks"]:.4f})')
    ax1.set_xlabel('Observation Index', fontsize=11)
    ax1.set_ylabel("Cook's Distance", fontsize=11)
    ax1.set_title("Cook's Distance", fontsize=12, fontweight='bold')
    ax1.legend()
    
    # Label influential points
    for idx in results['influential']['cooks']:
        ax1.annotate(str(idx), (idx, results['cooks_d'][idx]), 
                    textcoords='offset points', xytext=(0, 5), ha='center', fontsize=8)
    
    # Plot 2: Residuals vs Leverage (with Cook's distance contours)
    ax2 = axes[0, 1]
    ax2.scatter(results['leverage'], results['int_student'], 
               c=results['cooks_d'], cmap='YlOrRd', alpha=0.7,
               edgecolors='k', linewidth=0.5, s=50)
    ax2.axhline(0, color='gray', linestyle='-', alpha=0.5)
    ax2.axvline(results['thresholds']['leverage'], color='blue', linestyle='--', 
                alpha=0.7, label=f'Leverage threshold')
    
    # Add Cook's distance contours
    h_range = np.linspace(0.001, max(results['leverage']) * 1.1, 100)
    for D_val in [0.5, 1.0]:
        p = results['p']
        y_cook = np.sqrt(D_val * p / h_range * (1 - h_range))
        valid = ~np.isnan(y_cook) & (y_cook < 5)
        ax2.plot(h_range[valid], y_cook[valid], 'r--', alpha=0.5, linewidth=1)
        ax2.plot(h_range[valid], -y_cook[valid], 'r--', alpha=0.5, linewidth=1)
    
    ax2.set_xlabel('Leverage', fontsize=11)
    ax2.set_ylabel('Studentized Residuals', fontsize=11)
    ax2.set_title('Residuals vs Leverage', fontsize=12, fontweight='bold')
    
    # Plot 3: DFFITS
    ax3 = axes[1, 0]
    colors = ['red' if abs(d) > results['thresholds']['dffits'] else 'blue' 
              for d in results['dffits']]
    ax3.scatter(range(n), results['dffits'], c=colors, alpha=0.6, 
               edgecolors='k', linewidth=0.5)
    ax3.axhline(results['thresholds']['dffits'], color='red', linestyle='--', linewidth=1.5)
    ax3.axhline(-results['thresholds']['dffits'], color='red', linestyle='--', linewidth=1.5)
    ax3.axhline(0, color='gray', linestyle='-', alpha=0.5)
    ax3.set_xlabel('Observation Index', fontsize=11)
    ax3.set_ylabel('DFFITS', fontsize=11)
    ax3.set_title('DFFITS (Influence on Own Fitted Value)', fontsize=12, fontweight='bold')
    
    # Plot 4: DFBETAS for slope (if simple regression)
    ax4 = axes[1, 1]
    if results['p'] == 2:  # Intercept + 1 slope
        dfb_slope = results['dfbetas'][:, 1]
        colors = ['red' if abs(d) > results['thresholds']['dfbetas'] else 'blue' 
                  for d in dfb_slope]
        ax4.scatter(range(n), dfb_slope, c=colors, alpha=0.6,
                   edgecolors='k', linewidth=0.5)
        ax4.axhline(results['thresholds']['dfbetas'], color='red', linestyle='--', linewidth=1.5)
        ax4.axhline(-results['thresholds']['dfbetas'], color='red', linestyle='--', linewidth=1.5)
        ax4.axhline(0, color='gray', linestyle='-', alpha=0.5)
        ax4.set_xlabel('Observation Index', fontsize=11)
        ax4.set_ylabel('DFBETAS (Slope)', fontsize=11)
        ax4.set_title('DFBETAS: Influence on Slope Coefficient', fontsize=12, fontweight='bold')
    else:
        ax4.text(0.5, 0.5, 'DFBETAS plot
(multiple predictors)', 
                ha='center', va='center', fontsize=12)
        ax4.set_title('DFBETAS', fontsize=12, fontweight='bold')
    
    plt.tight_layout()
    return fig
 
# Demonstration
np.random.seed(42)
n = 50
 
# Create normal data
x = np.random.uniform(2, 8, n)
y = 2 + 3 * x + np.random.randn(n) * 1.5
 
# Add influential points
# Point 1: High leverage, on the line (not influential)
x = np.append(x, 12)
y = np.append(y, 2 + 3 * 12)
 
# Point 2: High leverage, off the line (influential)
x = np.append(x, 13)
y = np.append(y, 15)  # Should be ~41, but is 15
 
# Point 3: Low leverage, large residual (moderate influence)
x = np.append(x, 5)
y = np.append(y, 30)  # Should be ~17, but is 30
 
print(f"Dataset with {len(y)} observations including 3 added points")
 
# Compute influence
results = compute_influence_measures(x, y)
 
print("
" + "=" * 60)
print("INFLUENTIAL OBSERVATION SUMMARY")
print("=" * 60)
print(f"
Thresholds:")
print(f"  Cook's D:  > {results['thresholds']['cooks']:.4f}")
print(f"  DFFITS:    > |{results['thresholds']['dffits']:.4f}|")
print(f"  DFBETAS:   > |{results['thresholds']['dfbetas']:.4f}|")
print(f"  Leverage:  > {results['thresholds']['leverage']:.4f}")
 
print(f"
Influential by Cook's D: {list(results['influential']['cooks'])}")
print(f"Influential by DFFITS: {list(results['influential']['dffits'])}")
print(f"High leverage: {list(results['influential']['high_leverage'])}")
 
# Show details for flagged points
print("
" + "-" * 60)
print("Details for flagged observations:")
all_flagged = set(results['influential']['cooks']) |               set(results['influential']['dffits']) |               set(results['influential']['high_leverage'])
 
for idx in sorted(all_flagged):
    print(f"
Obs {idx}: x={x[idx]:.2f}, y={y[idx]:.2f}")
    print(f"  Leverage: {results['leverage'][idx]:.4f}")
    print(f"  Studentized Resid: {results['int_student'][idx]:.4f}")
    print(f"  Cook's D: {results['cooks_d'][idx]:.4f}")
    print(f"  DFFITS: {results['dffits'][idx]:.4f}")
 
# Plot
fig = influence_diagnostic_plots(x, y, results)
plt.suptitle('Influence Diagnostics', fontsize=14, fontweight='bold', y=1.02)
plt.savefig('influence_diagnostics.png', dpi=150, bbox_inches='tight')
plt.show()

DFBETAS and DFFITS: Detailed Influence Measures

While Cook's distance summarizes overall influence, DFFITS and DFBETAS provide more granular information about how each observation affects specific model outputs.

DFFITS: Influence on Own Fitted Value

DFFITS measures how much the fitted value for observation $i$ changes when observation $i$ is deleted: $$\text{DFFITS}i = \frac{\hat{y}i - \hat{y}{i(i)}}{\hat{\sigma}{(i)}\sqrt{h_{ii}}}$$

Computational formula: $$\text{DFFITS}i = t_i \sqrt{\frac{h{ii}}{1 - h_{ii}}}$$

where $t_i$ is the externally studentized residual.

Threshold: $|\text{DFFITS}_i| > 2\sqrt{p/n}$

Interpretation: DFFITS measures the standardized change in prediction at point $i$ when point $i$ is removed. It focuses on how well the model predicts this particular point.

DFBETAS: Influence on Each Coefficient

DFBETAS measures how much each regression coefficient changes when observation $i$ is deleted: $$\text{DFBETAS}{i,j} = \frac{\hat{\beta}j - \hat{\beta}{j(i)}}{\hat{\sigma}{(i)}\sqrt{(X^TX)^{-1}_{jj}}}$$

Threshold: $|\text{DFBETAS}_{i,j}| > 2/\sqrt{n}$

Interpretation: DFBETAS tells you which coefficients are most affected by each observation. This is crucial when:

You care about specific coefficient estimates (policy implications)
You suspect one predictor's effect is being distorted
You want to understand how an influential point affects representation

Relationship Between Measures

These measures are mathematically related: $$D_i = \frac{\text{DFFITS}_i^2}{p}$$

So Cook's distance is a scaled version of DFFITS squared.

All measures capture the interaction of leverage and residual magnitude, but emphasize different aspects:

Cook's D: Overall impact on all predictions
DFFITS: Impact on self-prediction
DFBETAS: Impact on specific coefficients

Summary of Influence Measures
Measure	What It Measures	Formula	Threshold
Cook's D	Change in all fitted values	$\frac{r_i^2}{p} \cdot \frac{h_{ii}}{1-h_{ii}}$	$> 4/n$
DFFITS	Change in own fitted value	$t_i\sqrt{\frac{h_{ii}}{1-h_{ii}}}$	$> 2\sqrt{p/n}$
DFBETAS_j	Change in coefficient j	$\frac{\hat{\beta}j - \hat{\beta}{j(i)}}{SE_{(i)}(\hat{\beta}_j)}$	$> 2/\sqrt{n}$
Leverage	Unusualness in X-space	$h_{ii} = x_i^T(X^TX)^{-1}x_i$	$> 2p/n$
Studentized Resid	Unusualness in Y given X	$t_i = \frac{e_i}{\hat{\sigma}{(i)}\sqrt{1-h{ii}}}$	$> 2$ or $> 3$

When to Use Which Measure

Use Cook's D for initial screening—it's a single number per observation. When you find influential points, examine DFBETAS to understand which coefficients are affected. Use DFFITS when prediction accuracy for specific points matters (e.g., when that prediction will be used for decision-making).

Strategies for Handling Influential Points

Discovering influential points is only the beginning. The question is: what should you do about them? The answer depends on why the point is influential and your analytical goals.

Step 1: Investigate the Point

Before any remedial action, understand the source of influence:

Data quality issues:

Recording error (typo, wrong units)
Data entry mistake
Measurement malfunction
Merge/join error (wrong observation matched)

Action: Correct if possible; otherwise exclude with documentation.

Legitimate unusual observation:

Rare but real event
Different population (does it belong in your sample?)
Edge case that reveals model limitations

Action: This is the hard case—see strategies below.

Step 2: Choose a Strategy

Strategy 1: Report Sensitivity Analysis

Run the analysis with and without the influential point(s). Report both:

"The slope is 3.2 with all data, and 2.8 excluding observation 47."
If conclusions are robust, influence doesn't matter practically
If conclusions differ, acknowledge uncertainty

Strategy 2: Robust Regression

Use regression methods that downweight influential observations automatically:

M-estimation (Huber, bisquare loss): Reduces weight as residuals grow
Least Trimmed Squares (LTS): Fits only a subset of data
MM-estimation: Combines high breakdown point with efficiency

Strategy 3: Bounded Influence Regression

Explicitly limit the maximum influence any point can have. Mallows-type estimators constrain leverage effects.

Strategy 4: Transformation

Sometimes influence is an artifact of scale. Log-transforming skewed responses can reduce the prominence of large values.

What NOT to Do

Never delete points just because they're influential without understanding why. 'Fishing' for good results by removing inconvenient data is scientific fraud. Deletion should be justified by substantive reasons (data error, wrong population) documented before looking at results.

robust_regression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
from statsmodels.robust.robust_linear_model import RLM
 
def compare_ols_robust(X, y):
    """
    Compare OLS with robust regression methods.
    
    Parameters:
    -----------
    X : ndarray
        Predictor (1D or 2D)
    y : ndarray
        Response
    """
    n = len(y)
    
    if X.ndim == 1:
        X_plot = X.copy()
        X = X.reshape(-1, 1)
    else:
        X_plot = X[:, 0]
    
    X_const = sm.add_constant(X)
    
    print("=" * 65)
    print("COMPARISON: OLS vs ROBUST REGRESSION")
    print("=" * 65)
    
    # OLS
    ols = sm.OLS(y, X_const).fit()
    print(f"
OLS Results:")
    print(f"  Intercept: {ols.params[0]:.4f} (SE: {ols.bse[0]:.4f})")
    print(f"  Slope:     {ols.params[1]:.4f} (SE: {ols.bse[1]:.4f})")
    
    # Robust: Huber's T
    rlm_huber = RLM(y, X_const, M=sm.robust.norms.HuberT()).fit()
    print(f"
Robust (Huber) Results:")
    print(f"  Intercept: {rlm_huber.params[0]:.4f} (SE: {rlm_huber.bse[0]:.4f})")
    print(f"  Slope:     {rlm_huber.params[1]:.4f} (SE: {rlm_huber.bse[1]:.4f})")
    
    # Robust: Tukey's bisquare
    rlm_bisquare = RLM(y, X_const, M=sm.robust.norms.TukeyBiweight()).fit()
    print(f"
Robust (Bisquare) Results:")
    print(f"  Intercept: {rlm_bisquare.params[0]:.4f} (SE: {rlm_bisquare.bse[0]:.4f})")
    print(f"  Slope:     {rlm_bisquare.params[1]:.4f} (SE: {rlm_bisquare.bse[1]:.4f})")
    
    # Weights from bisquare (shows downweighting)
    weights = rlm_bisquare.weights
    low_weight_idx = np.where(weights < 0.5)[0]
    
    print(f"
Observations with weights < 0.5 (downweighted by bisquare):")
    for idx in low_weight_idx:
        print(f"  Obs {idx}: x={X_plot[idx]:.2f}, y={y[idx]:.2f}, weight={weights[idx]:.3f}")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Plot 1: Data and fitted lines
    ax1 = axes[0]
    ax1.scatter(X_plot, y, alpha=0.6, edgecolors='k', linewidth=0.5, s=50)
    
    x_range = np.linspace(X_plot.min(), X_plot.max(), 100)
    ax1.plot(x_range, ols.params[0] + ols.params[1] * x_range, 
            'b-', linewidth=2, label=f'OLS (slope={ols.params[1]:.2f})')
    ax1.plot(x_range, rlm_huber.params[0] + rlm_huber.params[1] * x_range, 
            'g--', linewidth=2, label=f'Huber (slope={rlm_huber.params[1]:.2f})')
    ax1.plot(x_range, rlm_bisquare.params[0] + rlm_bisquare.params[1] * x_range, 
            'r--', linewidth=2, label=f'Bisquare (slope={rlm_bisquare.params[1]:.2f})')
    
    ax1.set_xlabel('x', fontsize=11)
    ax1.set_ylabel('y', fontsize=11)
    ax1.set_title('OLS vs Robust Regression', fontsize=12, fontweight='bold')
    ax1.legend()
    
    # Plot 2: Bisquare weights
    ax2 = axes[1]
    colors = ['red' if w < 0.5 else 'blue' for w in weights]
    ax2.scatter(range(n), weights, c=colors, alpha=0.6, edgecolors='k', linewidth=0.5, s=50)
    ax2.axhline(0.5, color='orange', linestyle='--', label='Weight = 0.5')
    ax2.axhline(1.0, color='green', linestyle='--', alpha=0.5, label='Full weight')
    ax2.set_xlabel('Observation Index', fontsize=11)
    ax2.set_ylabel('Bisquare Weight', fontsize=11)
    ax2.set_title('Robust Regression Weights', fontsize=12, fontweight='bold')
    ax2.legend()
    
    # Label downweighted points
    for idx in low_weight_idx:
        ax2.annotate(str(idx), (idx, weights[idx]), 
                    textcoords='offset points', xytext=(0, 5), ha='center', fontsize=8)
    
    plt.tight_layout()
    
    print("=" * 65)
    
    return {
        'ols': ols,
        'huber': rlm_huber,
        'bisquare': rlm_bisquare,
        'weights': weights,
        'figure': fig
    }
 
# Demonstration
np.random.seed(42)
n = 50
 
# Create data with clear contamination
x = np.random.uniform(2, 8, n)
y = 2 + 3 * x + np.random.randn(n) * 1.5
 
# Add outliers that will influence OLS
x_outliers = np.array([5, 6, 7])
y_outliers = np.array([35, 38, 40])  # Way above the line
 
x = np.append(x, x_outliers)
y = np.append(y, y_outliers)
 
print(f"Dataset: {len(y)} observations with 3 contaminating outliers")
print(f"True model: y = 2 + 3x + noise")
 
results = compare_ols_robust(x, y)
plt.savefig('robust_regression.png', dpi=150, bbox_inches='tight')
plt.show()

Summary: Managing Influential Observations

Influential observations are inevitable in real data. The key is systematic detection and principled handling—not mechanical deletion.

Key Takeaways

•Distinguish leverage, outliers, and influence — High leverage (unusual X) and large residuals (unusual Y|X) combine to create influence. Neither alone is sufficient.
•Leverage is determined by X alone — The hat matrix diagonal h_ii measures how unusual each observation's predictor values are. Threshold: h > 2p/n.
•Cook's distance is your primary screening tool — It measures the overall effect of deletion on all fitted values. Threshold: D > 4/n.
•DFBETAS reveals coefficient-specific effects — When you find influence, DFBETAS tells you which coefficients are affected and by how much.
•Investigate before acting — Determine if influence comes from data errors (fix or exclude) or legitimate unusual values (sensitivity analysis, robust methods).
•Robust regression provides automatic protection — Methods like Huber and bisquare estimation downweight influential observations without deletion.

Page Complete

You now have comprehensive tools for identifying and handling influential observations. The final page of this module tackles multicollinearity—the problem of correlated predictors that destabilizes coefficient estimates and undermines interpretability.