Loading content...
Regression estimates summarize information from all observations, but not all observations contribute equally. Some data points—due to their position in predictor space, their unusual response values, or both—can dominate the regression fit. Understanding, detecting, and appropriately handling these influential observations is critical for robust analysis.
Consider this: removing a single observation from a 1,000-point dataset shouldn't dramatically change your conclusions. If it does, your model is not telling a story about the data—it's telling a story about that one point. This section equips you to recognize and address such situations.
This page covers: (1) The distinction between leverage, outliers, and influential points; (2) The hat matrix and leverage values; (3) Influence measures—Cook's distance, DFBETAS, DFFITS; (4) Visual diagnostics for influence; (5) Strategies for handling influential points without discarding valid data.
Three related but distinct concepts characterize unusual observations. Understanding their interplay is fundamental:
Leverage measures how unusual an observation's predictor values are—how far it lies from the center of the predictor space. Formally, leverage is defined through the hat matrix $H$: $$H = X(X^TX)^{-1}X^T$$
The $i$-th diagonal element $h_{ii}$ is the leverage of observation $i$: $$h_{ii} = x_i^T(X^TX)^{-1}x_i$$
Interpretation:
Key insight: Leverage is a function of X only—it doesn't depend on y at all. A high-leverage point can have a perfectly ordinary y-value.
Outliers are observations whose response values are unusual given their predictor values. They appear as large residuals. A point can be:
Influence combines leverage and residual magnitude. A point is influential if removing it substantially changes the regression results.
The key relationship: $$\text{Influence} \approx \text{Leverage} \times \text{Outlier severity}$$
A point needs both to be influential:
| Scenario | Leverage | Residual | Influence | Interpretation |
|---|---|---|---|---|
| Typical point | Low | Small | Low | Normal observation |
| Y-outlier among bulk | Low | Large | Moderate | Unusual y, but constrained by neighbors |
| Remote point on line | High | Small | Low | Confirms pattern at extreme |
| Remote outlier | High | Large | High | Pulls regression line—investigate |
High-leverage points tend to have small residuals even when they're outliers—they 'pull' the regression line toward themselves, hiding their own unusual nature. This is why residual plots alone can miss influential points. You must examine leverage explicitly.
The hat matrix is central to understanding leverage and its effects on residuals.
The hat matrix $H = X(X^TX)^{-1}X^T$ has several important properties:
1. Projects y onto the column space of X: $$\hat{y} = Hy$$
Hence the name "hat" matrix—it puts the hat on y.
2. Symmetric: $H^T = H$
3. Idempotent: $H^2 = H$
4. Trace equals the number of parameters: $$\text{tr}(H) = \sum_{i=1}^{n} h_{ii} = p$$
where $p$ is the number of predictors including intercept.
5. Average leverage: $$\bar{h} = \frac{p}{n}$$
Since average leverage is $p/n$, we flag high leverage points as those with: $$h_{ii} > 2 \cdot \frac{p}{n} = \frac{2p}{n}$$
or for stricter screening: $$h_{ii} > 3 \cdot \frac{p}{n} = \frac{3p}{n}$$
Intuition: With $n = 100$ observations and $p = 5$ predictors, average leverage is 0.05. Points with leverage > 0.10 (twice average) warrant inspection.
For simple regression with standardized predictor: $$h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_j (x_j - \bar{x})^2}$$
This shows leverage increases with distance from the predictor mean—points at extreme x-values have high leverage.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def compute_leverage(X, include_intercept=True): """ Compute leverage values (hat matrix diagonal) for design matrix X. Parameters: ----------- X : ndarray of shape (n, p) Design matrix (without intercept column if include_intercept=True) include_intercept : bool Whether to add intercept column Returns: -------- dict with leverage values and thresholds """ n = X.shape[0] if X.ndim > 1 else len(X) if X.ndim == 1: X = X.reshape(-1, 1) if include_intercept: X = np.column_stack([np.ones(n), X]) p = X.shape[1] # Compute hat matrix XtX_inv = np.linalg.inv(X.T @ X) H = X @ XtX_inv @ X.T # Extract diagonal (leverage values) leverage = np.diag(H) # Thresholds avg_leverage = p / n threshold_2x = 2 * avg_leverage threshold_3x = 3 * avg_leverage # Identify high-leverage points high_lev_idx = np.where(leverage > threshold_2x)[0] very_high_lev_idx = np.where(leverage > threshold_3x)[0] return { 'leverage': leverage, 'n': n, 'p': p, 'avg_leverage': avg_leverage, 'threshold_2x': threshold_2x, 'threshold_3x': threshold_3x, 'high_leverage_idx': high_lev_idx, 'very_high_leverage_idx': very_high_lev_idx } def leverage_diagnostic_plot(X, y, leverage_info=None, figsize=(12, 5)): """ Create diagnostic plots for leverage analysis. """ n = len(y) if X.ndim == 1: X = X.reshape(-1, 1) if leverage_info is None: leverage_info = compute_leverage(X) h = leverage_info['leverage'] threshold = leverage_info['threshold_2x'] fig, axes = plt.subplots(1, 2, figsize=figsize) # Plot 1: Leverage values vs observation index ax1 = axes[0] colors = ['red' if hi > threshold else 'blue' for hi in h] ax1.scatter(range(n), h, c=colors, alpha=0.6, edgecolors='k', linewidth=0.5) ax1.axhline(leverage_info['threshold_2x'], color='orange', linestyle='--', linewidth=1.5, label=f'2×avg ({leverage_info["threshold_2x"]:.3f})') ax1.axhline(leverage_info['threshold_3x'], color='red', linestyle='--', linewidth=1.5, label=f'3×avg ({leverage_info["threshold_3x"]:.3f})') ax1.axhline(leverage_info['avg_leverage'], color='green', linestyle='-', alpha=0.7, label=f'avg ({leverage_info["avg_leverage"]:.3f})') ax1.set_xlabel('Observation Index', fontsize=11) ax1.set_ylabel('Leverage (h_ii)', fontsize=11) ax1.set_title('Leverage Values', fontsize=12, fontweight='bold') ax1.legend() # Label high-leverage points for idx in leverage_info['high_leverage_idx']: ax1.annotate(str(idx), (idx, h[idx]), textcoords='offset points', xytext=(0, 5), ha='center', fontsize=8) # Plot 2: For simple regression, show leverage vs x if X.shape[1] == 2: # Intercept + 1 predictor x = X[:, 1] ax2 = axes[1] ax2.scatter(x, h, c=colors, alpha=0.6, edgecolors='k', linewidth=0.5, s=50) # Show theoretical curve x_sorted = np.sort(x) x_mean = np.mean(x) ss_x = np.sum((x - x_mean)**2) h_theoretical = 1/n + (x_sorted - x_mean)**2 / ss_x ax2.plot(x_sorted, h_theoretical, 'g-', linewidth=2, alpha=0.7, label='Theoretical leverage') ax2.axhline(threshold, color='orange', linestyle='--', linewidth=1.5) ax2.set_xlabel('Predictor Value (x)', fontsize=11) ax2.set_ylabel('Leverage (h_ii)', fontsize=11) ax2.set_title('Leverage vs Predictor', fontsize=12, fontweight='bold') ax2.legend() plt.tight_layout() return fig # Demonstrationnp.random.seed(42)n = 50 # Create data with a high-leverage pointx = np.random.uniform(2, 8, n)y = 2 + 3 * x + np.random.randn(n) * 2 # Add a high-leverage point (far from others in x)x = np.append(x, 15) # Way outside the rangey = np.append(y, 2 + 3 * 15 + np.random.randn() * 2) # On the line # Add another high-leverage outlierx = np.append(x, 16)y = np.append(y, 20) # Way off the line n = len(y)print(f"Dataset: n = {n}") # Compute leveragelev_info = compute_leverage(x) print(f"Leverage Statistics:")print(f" p (parameters): {lev_info['p']}")print(f" Average leverage: {lev_info['avg_leverage']:.4f}")print(f" 2× threshold: {lev_info['threshold_2x']:.4f}")print(f" 3× threshold: {lev_info['threshold_3x']:.4f}") print(f"High leverage points (> 2×avg):")for idx in lev_info['high_leverage_idx']: print(f" Obs {idx}: x={x[idx]:.2f}, y={y[idx]:.2f}, leverage={lev_info['leverage'][idx]:.4f}") # Plotfig = leverage_diagnostic_plot(x.reshape(-1, 1), y, lev_info)plt.suptitle('Leverage Analysis: Identifying Unusual Predictor Values', fontsize=12, fontweight='bold', y=1.02)plt.savefig('leverage_analysis.png', dpi=150, bbox_inches='tight')plt.show()Cook's distance is the most widely used single measure of an observation's influence on the regression. It quantifies how much all fitted values change when observation $i$ is deleted.
Cook's distance for observation $i$ is: $$D_i = \frac{\sum_{j=1}^{n}(\hat{y}j - \hat{y}{j(i)})^2}{p \cdot \hat{\sigma}^2}$$
where:
Remarkably, Cook's distance can be computed efficiently without refitting $n$ separate models: $$D_i = \frac{r_i^2}{p} \cdot \frac{h_{ii}}{1 - h_{ii}}$$
where $r_i$ is the standardized residual. This shows Cook's distance as a product of:
Several rules of thumb exist:
Conservative: $D_i > 1$ (suggested by Cook himself)
Moderate: $D_i > 4/n$ (commonly used)
Relative: $D_i > F_{0.50,p,n-p}$ (50th percentile of F distribution)
The $4/n$ rule is most common in practice. For $n = 100$, this gives a threshold of 0.04.
Cook's distance measures the effect of deletion on all fitted values simultaneously. It answers: 'If I remove this point, how much does the predicted value change, on average, across all observations?' Large D means the entire fitted surface shifts substantially.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import LinearRegression def compute_influence_measures(X, y): """ Compute comprehensive influence measures for regression. Parameters: ----------- X : ndarray of shape (n,) or (n, p) Predictors y : ndarray of shape (n,) Response Returns: -------- dict with influence measures """ n = len(y) if X.ndim == 1: X = X.reshape(-1, 1) # Add intercept X_full = np.column_stack([np.ones(n), X]) p = X_full.shape[1] # Fit model XtX_inv = np.linalg.inv(X_full.T @ X_full) beta_hat = XtX_inv @ X_full.T @ y y_hat = X_full @ beta_hat residuals = y - y_hat # Hat matrix diagonal (leverage) H = X_full @ XtX_inv @ X_full.T leverage = np.diag(H) # Residual standard error RSS = np.sum(residuals ** 2) sigma_hat = np.sqrt(RSS / (n - p)) # Standardized residuals std_resid = residuals / sigma_hat # Internally studentized residuals int_student = residuals / (sigma_hat * np.sqrt(1 - leverage)) # Externally studentized residuals ext_student = int_student * np.sqrt((n - p - 1) / (n - p - int_student**2)) # Cook's distance cooks_d = (int_student ** 2 / p) * (leverage / (1 - leverage)) # DFFITS dffits = int_student * np.sqrt(leverage / (1 - leverage)) # DFBETAS (influence on each coefficient) dfbetas = np.zeros((n, p)) for i in range(n): # Efficient formula using hat matrix for j in range(p): dfbetas[i, j] = (XtX_inv @ X_full[i, :]) [j] * residuals[i] / (sigma_hat * np.sqrt(1 - leverage[i])) # Thresholds cooks_threshold = 4 / n dffits_threshold = 2 * np.sqrt(p / n) dfbetas_threshold = 2 / np.sqrt(n) leverage_threshold = 2 * p / n # Identify influential points influential_cooks = np.where(cooks_d > cooks_threshold)[0] influential_dffits = np.where(np.abs(dffits) > dffits_threshold)[0] high_leverage = np.where(leverage > leverage_threshold)[0] return { 'n': n, 'p': p, 'residuals': residuals, 'leverage': leverage, 'std_resid': std_resid, 'int_student': int_student, 'ext_student': ext_student, 'cooks_d': cooks_d, 'dffits': dffits, 'dfbetas': dfbetas, 'thresholds': { 'cooks': cooks_threshold, 'dffits': dffits_threshold, 'dfbetas': dfbetas_threshold, 'leverage': leverage_threshold }, 'influential': { 'cooks': influential_cooks, 'dffits': influential_dffits, 'high_leverage': high_leverage } } def influence_diagnostic_plots(X, y, results=None, figsize=(14, 10)): """ Create comprehensive influence diagnostic plots. """ if results is None: results = compute_influence_measures(X, y) n = results['n'] fig, axes = plt.subplots(2, 2, figsize=figsize) # Plot 1: Cook's Distance ax1 = axes[0, 0] stem_container = ax1.stem(range(n), results['cooks_d'], linefmt='b-', markerfmt='bo', basefmt='k-') ax1.axhline(results['thresholds']['cooks'], color='red', linestyle='--', linewidth=1.5, label=f'Threshold (4/n = {results["thresholds"]["cooks"]:.4f})') ax1.set_xlabel('Observation Index', fontsize=11) ax1.set_ylabel("Cook's Distance", fontsize=11) ax1.set_title("Cook's Distance", fontsize=12, fontweight='bold') ax1.legend() # Label influential points for idx in results['influential']['cooks']: ax1.annotate(str(idx), (idx, results['cooks_d'][idx]), textcoords='offset points', xytext=(0, 5), ha='center', fontsize=8) # Plot 2: Residuals vs Leverage (with Cook's distance contours) ax2 = axes[0, 1] ax2.scatter(results['leverage'], results['int_student'], c=results['cooks_d'], cmap='YlOrRd', alpha=0.7, edgecolors='k', linewidth=0.5, s=50) ax2.axhline(0, color='gray', linestyle='-', alpha=0.5) ax2.axvline(results['thresholds']['leverage'], color='blue', linestyle='--', alpha=0.7, label=f'Leverage threshold') # Add Cook's distance contours h_range = np.linspace(0.001, max(results['leverage']) * 1.1, 100) for D_val in [0.5, 1.0]: p = results['p'] y_cook = np.sqrt(D_val * p / h_range * (1 - h_range)) valid = ~np.isnan(y_cook) & (y_cook < 5) ax2.plot(h_range[valid], y_cook[valid], 'r--', alpha=0.5, linewidth=1) ax2.plot(h_range[valid], -y_cook[valid], 'r--', alpha=0.5, linewidth=1) ax2.set_xlabel('Leverage', fontsize=11) ax2.set_ylabel('Studentized Residuals', fontsize=11) ax2.set_title('Residuals vs Leverage', fontsize=12, fontweight='bold') # Plot 3: DFFITS ax3 = axes[1, 0] colors = ['red' if abs(d) > results['thresholds']['dffits'] else 'blue' for d in results['dffits']] ax3.scatter(range(n), results['dffits'], c=colors, alpha=0.6, edgecolors='k', linewidth=0.5) ax3.axhline(results['thresholds']['dffits'], color='red', linestyle='--', linewidth=1.5) ax3.axhline(-results['thresholds']['dffits'], color='red', linestyle='--', linewidth=1.5) ax3.axhline(0, color='gray', linestyle='-', alpha=0.5) ax3.set_xlabel('Observation Index', fontsize=11) ax3.set_ylabel('DFFITS', fontsize=11) ax3.set_title('DFFITS (Influence on Own Fitted Value)', fontsize=12, fontweight='bold') # Plot 4: DFBETAS for slope (if simple regression) ax4 = axes[1, 1] if results['p'] == 2: # Intercept + 1 slope dfb_slope = results['dfbetas'][:, 1] colors = ['red' if abs(d) > results['thresholds']['dfbetas'] else 'blue' for d in dfb_slope] ax4.scatter(range(n), dfb_slope, c=colors, alpha=0.6, edgecolors='k', linewidth=0.5) ax4.axhline(results['thresholds']['dfbetas'], color='red', linestyle='--', linewidth=1.5) ax4.axhline(-results['thresholds']['dfbetas'], color='red', linestyle='--', linewidth=1.5) ax4.axhline(0, color='gray', linestyle='-', alpha=0.5) ax4.set_xlabel('Observation Index', fontsize=11) ax4.set_ylabel('DFBETAS (Slope)', fontsize=11) ax4.set_title('DFBETAS: Influence on Slope Coefficient', fontsize=12, fontweight='bold') else: ax4.text(0.5, 0.5, 'DFBETAS plot(multiple predictors)', ha='center', va='center', fontsize=12) ax4.set_title('DFBETAS', fontsize=12, fontweight='bold') plt.tight_layout() return fig # Demonstrationnp.random.seed(42)n = 50 # Create normal datax = np.random.uniform(2, 8, n)y = 2 + 3 * x + np.random.randn(n) * 1.5 # Add influential points# Point 1: High leverage, on the line (not influential)x = np.append(x, 12)y = np.append(y, 2 + 3 * 12) # Point 2: High leverage, off the line (influential)x = np.append(x, 13)y = np.append(y, 15) # Should be ~41, but is 15 # Point 3: Low leverage, large residual (moderate influence)x = np.append(x, 5)y = np.append(y, 30) # Should be ~17, but is 30 print(f"Dataset with {len(y)} observations including 3 added points") # Compute influenceresults = compute_influence_measures(x, y) print("" + "=" * 60)print("INFLUENTIAL OBSERVATION SUMMARY")print("=" * 60)print(f"Thresholds:")print(f" Cook's D: > {results['thresholds']['cooks']:.4f}")print(f" DFFITS: > |{results['thresholds']['dffits']:.4f}|")print(f" DFBETAS: > |{results['thresholds']['dfbetas']:.4f}|")print(f" Leverage: > {results['thresholds']['leverage']:.4f}") print(f"Influential by Cook's D: {list(results['influential']['cooks'])}")print(f"Influential by DFFITS: {list(results['influential']['dffits'])}")print(f"High leverage: {list(results['influential']['high_leverage'])}") # Show details for flagged pointsprint("" + "-" * 60)print("Details for flagged observations:")all_flagged = set(results['influential']['cooks']) | set(results['influential']['dffits']) | set(results['influential']['high_leverage']) for idx in sorted(all_flagged): print(f"Obs {idx}: x={x[idx]:.2f}, y={y[idx]:.2f}") print(f" Leverage: {results['leverage'][idx]:.4f}") print(f" Studentized Resid: {results['int_student'][idx]:.4f}") print(f" Cook's D: {results['cooks_d'][idx]:.4f}") print(f" DFFITS: {results['dffits'][idx]:.4f}") # Plotfig = influence_diagnostic_plots(x, y, results)plt.suptitle('Influence Diagnostics', fontsize=14, fontweight='bold', y=1.02)plt.savefig('influence_diagnostics.png', dpi=150, bbox_inches='tight')plt.show()While Cook's distance summarizes overall influence, DFFITS and DFBETAS provide more granular information about how each observation affects specific model outputs.
DFFITS measures how much the fitted value for observation $i$ changes when observation $i$ is deleted: $$\text{DFFITS}i = \frac{\hat{y}i - \hat{y}{i(i)}}{\hat{\sigma}{(i)}\sqrt{h_{ii}}}$$
Computational formula: $$\text{DFFITS}i = t_i \sqrt{\frac{h{ii}}{1 - h_{ii}}}$$
where $t_i$ is the externally studentized residual.
Threshold: $|\text{DFFITS}_i| > 2\sqrt{p/n}$
Interpretation: DFFITS measures the standardized change in prediction at point $i$ when point $i$ is removed. It focuses on how well the model predicts this particular point.
DFBETAS measures how much each regression coefficient changes when observation $i$ is deleted: $$\text{DFBETAS}{i,j} = \frac{\hat{\beta}j - \hat{\beta}{j(i)}}{\hat{\sigma}{(i)}\sqrt{(X^TX)^{-1}_{jj}}}$$
Threshold: $|\text{DFBETAS}_{i,j}| > 2/\sqrt{n}$
Interpretation: DFBETAS tells you which coefficients are most affected by each observation. This is crucial when:
These measures are mathematically related: $$D_i = \frac{\text{DFFITS}_i^2}{p}$$
So Cook's distance is a scaled version of DFFITS squared.
All measures capture the interaction of leverage and residual magnitude, but emphasize different aspects:
| Measure | What It Measures | Formula | Threshold |
|---|---|---|---|
| Cook's D | Change in all fitted values | $\frac{r_i^2}{p} \cdot \frac{h_{ii}}{1-h_{ii}}$ | $> 4/n$ |
| DFFITS | Change in own fitted value | $t_i\sqrt{\frac{h_{ii}}{1-h_{ii}}}$ | $> 2\sqrt{p/n}$ |
| DFBETAS_j | Change in coefficient j | $\frac{\hat{\beta}j - \hat{\beta}{j(i)}}{SE_{(i)}(\hat{\beta}_j)}$ | $> 2/\sqrt{n}$ |
| Leverage | Unusualness in X-space | $h_{ii} = x_i^T(X^TX)^{-1}x_i$ | $> 2p/n$ |
| Studentized Resid | Unusualness in Y given X | $t_i = \frac{e_i}{\hat{\sigma}{(i)}\sqrt{1-h{ii}}}$ | $> 2$ or $> 3$ |
Use Cook's D for initial screening—it's a single number per observation. When you find influential points, examine DFBETAS to understand which coefficients are affected. Use DFFITS when prediction accuracy for specific points matters (e.g., when that prediction will be used for decision-making).
Discovering influential points is only the beginning. The question is: what should you do about them? The answer depends on why the point is influential and your analytical goals.
Before any remedial action, understand the source of influence:
Data quality issues:
Action: Correct if possible; otherwise exclude with documentation.
Legitimate unusual observation:
Action: This is the hard case—see strategies below.
Strategy 1: Report Sensitivity Analysis
Run the analysis with and without the influential point(s). Report both:
Strategy 2: Robust Regression
Use regression methods that downweight influential observations automatically:
Strategy 3: Bounded Influence Regression
Explicitly limit the maximum influence any point can have. Mallows-type estimators constrain leverage effects.
Strategy 4: Transformation
Sometimes influence is an artifact of scale. Log-transforming skewed responses can reduce the prominence of large values.
Never delete points just because they're influential without understanding why. 'Fishing' for good results by removing inconvenient data is scientific fraud. Deletion should be justified by substantive reasons (data error, wrong population) documented before looking at results.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npimport matplotlib.pyplot as pltfrom scipy import statsimport statsmodels.api as smfrom statsmodels.robust.robust_linear_model import RLM def compare_ols_robust(X, y): """ Compare OLS with robust regression methods. Parameters: ----------- X : ndarray Predictor (1D or 2D) y : ndarray Response """ n = len(y) if X.ndim == 1: X_plot = X.copy() X = X.reshape(-1, 1) else: X_plot = X[:, 0] X_const = sm.add_constant(X) print("=" * 65) print("COMPARISON: OLS vs ROBUST REGRESSION") print("=" * 65) # OLS ols = sm.OLS(y, X_const).fit() print(f"OLS Results:") print(f" Intercept: {ols.params[0]:.4f} (SE: {ols.bse[0]:.4f})") print(f" Slope: {ols.params[1]:.4f} (SE: {ols.bse[1]:.4f})") # Robust: Huber's T rlm_huber = RLM(y, X_const, M=sm.robust.norms.HuberT()).fit() print(f"Robust (Huber) Results:") print(f" Intercept: {rlm_huber.params[0]:.4f} (SE: {rlm_huber.bse[0]:.4f})") print(f" Slope: {rlm_huber.params[1]:.4f} (SE: {rlm_huber.bse[1]:.4f})") # Robust: Tukey's bisquare rlm_bisquare = RLM(y, X_const, M=sm.robust.norms.TukeyBiweight()).fit() print(f"Robust (Bisquare) Results:") print(f" Intercept: {rlm_bisquare.params[0]:.4f} (SE: {rlm_bisquare.bse[0]:.4f})") print(f" Slope: {rlm_bisquare.params[1]:.4f} (SE: {rlm_bisquare.bse[1]:.4f})") # Weights from bisquare (shows downweighting) weights = rlm_bisquare.weights low_weight_idx = np.where(weights < 0.5)[0] print(f"Observations with weights < 0.5 (downweighted by bisquare):") for idx in low_weight_idx: print(f" Obs {idx}: x={X_plot[idx]:.2f}, y={y[idx]:.2f}, weight={weights[idx]:.3f}") # Visualization fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # Plot 1: Data and fitted lines ax1 = axes[0] ax1.scatter(X_plot, y, alpha=0.6, edgecolors='k', linewidth=0.5, s=50) x_range = np.linspace(X_plot.min(), X_plot.max(), 100) ax1.plot(x_range, ols.params[0] + ols.params[1] * x_range, 'b-', linewidth=2, label=f'OLS (slope={ols.params[1]:.2f})') ax1.plot(x_range, rlm_huber.params[0] + rlm_huber.params[1] * x_range, 'g--', linewidth=2, label=f'Huber (slope={rlm_huber.params[1]:.2f})') ax1.plot(x_range, rlm_bisquare.params[0] + rlm_bisquare.params[1] * x_range, 'r--', linewidth=2, label=f'Bisquare (slope={rlm_bisquare.params[1]:.2f})') ax1.set_xlabel('x', fontsize=11) ax1.set_ylabel('y', fontsize=11) ax1.set_title('OLS vs Robust Regression', fontsize=12, fontweight='bold') ax1.legend() # Plot 2: Bisquare weights ax2 = axes[1] colors = ['red' if w < 0.5 else 'blue' for w in weights] ax2.scatter(range(n), weights, c=colors, alpha=0.6, edgecolors='k', linewidth=0.5, s=50) ax2.axhline(0.5, color='orange', linestyle='--', label='Weight = 0.5') ax2.axhline(1.0, color='green', linestyle='--', alpha=0.5, label='Full weight') ax2.set_xlabel('Observation Index', fontsize=11) ax2.set_ylabel('Bisquare Weight', fontsize=11) ax2.set_title('Robust Regression Weights', fontsize=12, fontweight='bold') ax2.legend() # Label downweighted points for idx in low_weight_idx: ax2.annotate(str(idx), (idx, weights[idx]), textcoords='offset points', xytext=(0, 5), ha='center', fontsize=8) plt.tight_layout() print("=" * 65) return { 'ols': ols, 'huber': rlm_huber, 'bisquare': rlm_bisquare, 'weights': weights, 'figure': fig } # Demonstrationnp.random.seed(42)n = 50 # Create data with clear contaminationx = np.random.uniform(2, 8, n)y = 2 + 3 * x + np.random.randn(n) * 1.5 # Add outliers that will influence OLSx_outliers = np.array([5, 6, 7])y_outliers = np.array([35, 38, 40]) # Way above the line x = np.append(x, x_outliers)y = np.append(y, y_outliers) print(f"Dataset: {len(y)} observations with 3 contaminating outliers")print(f"True model: y = 2 + 3x + noise") results = compare_ols_robust(x, y)plt.savefig('robust_regression.png', dpi=150, bbox_inches='tight')plt.show()Influential observations are inevitable in real data. The key is systematic detection and principled handling—not mechanical deletion.
You now have comprehensive tools for identifying and handling influential observations. The final page of this module tackles multicollinearity—the problem of correlated predictors that destabilizes coefficient estimates and undermines interpretability.