Loading content...
In an ideal regression world, predictors would be uncorrelated—each carrying unique, independent information about the response. Reality is messier. In practice, predictors are often correlated, sometimes strongly. When correlations among predictors are high, we face multicollinearity, a condition that doesn't bias estimates but makes them unreliable, unstable, and difficult to interpret.
Multicollinearity is particularly insidious because:
This page provides the mathematical foundations, diagnostic tools, and remedial strategies needed to detect and address multicollinearity in production regression models.
This page covers: (1) The mathematical nature and consequences of multicollinearity; (2) Detection methods—correlation matrices, VIF, condition numbers, eigenvalue analysis; (3) The distinction between harmful and benign collinearity; (4) Remedial strategies including variable selection, regularization, and PCA; and (5) Practical guidelines for real-world applications.
Multicollinearity exists when one or more predictor variables can be approximately expressed as linear combinations of other predictors: $$X_j \approx \sum_{k eq j} \alpha_k X_k$$
We distinguish between:
Perfect Multicollinearity: One predictor is an exact linear combination of others. The design matrix $X$ becomes rank-deficient: $$\text{rank}(X) < p$$
This makes $(X^TX)^{-1}$ undefined—OLS cannot be computed. Software will either fail or quietly drop a variable.
Near-Perfect (Practical) Multicollinearity: Predictors are highly correlated but not perfectly so. $(X^TX)$ is invertible but ill-conditioned—small perturbations in data cause large changes in $\hat{\beta}$.
Recall the variance of OLS estimators: $$\text{Var}(\hat{\beta}) = \sigma^2(X^TX)^{-1}$$
When predictors are collinear, $(X^TX)$ is nearly singular. Its inverse has very large elements, inflating the variance of coefficient estimates.
Geometrically, collinearity means predictor vectors are nearly parallel in the column space of X. When vectors are nearly aligned, small rotations (from sampling variation) cause large swings in the projection coefficients. The regression 'sees' many nearly equivalent ways to combine the predictors, and the choice among them is unstable.
What Multicollinearity Does NOT Affect:
What Multicollinearity DOES Affect:
| Aspect | Effect | Why It Matters |
|---|---|---|
| Coefficient variance | Inflated (sometimes massively) | Wide confidence intervals, low power |
| Coefficient stability | Highly unstable | Small data changes → large β changes |
| Coefficient signs | May flip unexpectedly | Contradicts domain knowledge |
| Individual t-tests | Nonsignificant despite overall significance | Type II errors on important variables |
| Model fit (R²) | Unaffected | Model predicts well despite coefficient issues |
| Prediction accuracy | Unaffected within sample range | Problems arise for extrapolation |
The simplest detection method examines the correlation matrix of predictors: $$R = \text{corr}(X)$$
High pairwise correlations (|r| > 0.7 or 0.8) suggest potential problems. However, this method has a critical limitation: it only detects bivariate collinearity. A variable can be an exact linear combination of three others while having moderate pairwise correlations with each.
The VIF is the standard tool for detecting multicollinearity, including multivariate patterns invisible to correlation matrices.
Definition: For predictor $X_j$: $$\text{VIF}_j = \frac{1}{1 - R_j^2}$$
where $R_j^2$ is the $R^2$ from regressing $X_j$ on all other predictors.
Interpretation:
Connection to variance inflation: $$\text{Var}(\hat{\beta}_j) = \frac{\sigma^2}{(n-1)s_j^2} \cdot \text{VIF}_j$$
where $s_j^2$ is the variance of $X_j$. VIF directly measures how much the variance is inflated due to collinearity.
Thresholds (rules of thumb):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.linear_model import LinearRegression def compute_vif(X, feature_names=None): """ Compute Variance Inflation Factors for each predictor. Parameters: ----------- X : ndarray of shape (n, p) Design matrix (predictors only, no intercept) feature_names : list of str, optional Names for each predictor Returns: -------- DataFrame with VIF for each predictor """ n, p = X.shape if feature_names is None: feature_names = [f'X{i+1}' for i in range(p)] vifs = [] r_squareds = [] for j in range(p): # Regress X_j on all other predictors y_j = X[:, j] X_others = np.delete(X, j, axis=1) model = LinearRegression() model.fit(X_others, y_j) # R² from this auxiliary regression r2_j = model.score(X_others, y_j) # VIF if r2_j >= 1: vif_j = np.inf else: vif_j = 1 / (1 - r2_j) vifs.append(vif_j) r_squareds.append(r2_j) # Create DataFrame results = pd.DataFrame({ 'Feature': feature_names, 'VIF': vifs, 'R² (auxiliary)': r_squareds, 'Tolerance': [1/v if v != np.inf else 0 for v in vifs] }) # Add assessment def assess_vif(vif): if vif < 5: return 'OK' elif vif < 10: return 'Moderate' elif vif < 20: return 'High' else: return 'Severe' results['Assessment'] = results['VIF'].apply(assess_vif) return results def multicollinearity_diagnostics(X, feature_names=None, figsize=(14, 5)): """ Comprehensive multicollinearity diagnostics including correlation matrix and VIF analysis. """ n, p = X.shape if feature_names is None: feature_names = [f'X{i+1}' for i in range(p)] # Compute VIF vif_results = compute_vif(X, feature_names) # Compute correlation matrix corr_matrix = np.corrcoef(X.T) print("=" * 70) print("MULTICOLLINEARITY DIAGNOSTICS") print("=" * 70) print("VARIANCE INFLATION FACTORS (VIF)") print("-" * 70) print(vif_results.to_string(index=False)) print("" + "-" * 70) print("VIF Interpretation:") print(" VIF = 1: No collinearity") print(" VIF < 5: Acceptable") print(" VIF 5-10: Moderate collinearity (investigate)") print(" VIF > 10: High collinearity (problematic)") print(" VIF > 20: Severe collinearity (action needed)") # Find problematic pairs print("HIGH PAIRWISE CORRELATIONS (|r| > 0.7)") print("-" * 70) high_corr_found = False for i in range(p): for j in range(i+1, p): if abs(corr_matrix[i, j]) > 0.7: high_corr_found = True print(f" {feature_names[i]} ↔ {feature_names[j]}: r = {corr_matrix[i, j]:.4f}") if not high_corr_found: print(" No pairwise correlations exceed 0.7") print("=" * 70) # Visualization fig, axes = plt.subplots(1, 2, figsize=figsize) # Plot 1: Correlation heatmap ax1 = axes[0] mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1) sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r', center=0, vmin=-1, vmax=1, ax=ax1, xticklabels=feature_names, yticklabels=feature_names) ax1.set_title('Correlation Matrix', fontsize=12, fontweight='bold') # Plot 2: VIF bar chart ax2 = axes[1] colors = ['green' if v < 5 else 'orange' if v < 10 else 'red' for v in vif_results['VIF']] bars = ax2.bar(feature_names, vif_results['VIF'], color=colors, edgecolor='black', alpha=0.7) ax2.axhline(5, color='orange', linestyle='--', linewidth=1.5, label='VIF = 5 (moderate)') ax2.axhline(10, color='red', linestyle='--', linewidth=1.5, label='VIF = 10 (high)') ax2.set_xlabel('Predictor', fontsize=11) ax2.set_ylabel('VIF', fontsize=11) ax2.set_title('Variance Inflation Factors', fontsize=12, fontweight='bold') ax2.legend() # Add value labels on bars for bar, vif in zip(bars, vif_results['VIF']): ax2.annotate(f'{vif:.1f}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()), xytext=(0, 3), textcoords='offset points', ha='center', fontsize=9) plt.tight_layout() return { 'vif': vif_results, 'correlation_matrix': corr_matrix, 'figure': fig } # Demonstrationnp.random.seed(42)n = 100 # Create predictors with varying degrees of collinearityX1 = np.random.randn(n)X2 = 0.3 * X1 + np.random.randn(n) * 0.9 # Mild correlation with X1X3 = 0.9 * X1 + np.random.randn(n) * 0.4 # High correlation with X1X4 = np.random.randn(n) # IndependentX5 = 0.7 * X3 + 0.5 * X4 + np.random.randn(n) * 0.3 # Correlated with X3 and X4 X = np.column_stack([X1, X2, X3, X4, X5])feature_names = ['X1', 'X2', 'X3 (≈X1)', 'X4', 'X5 (≈X3+X4)'] results = multicollinearity_diagnostics(X, feature_names)plt.savefig('multicollinearity_diagnostics.png', dpi=150, bbox_inches='tight')plt.show()A more sophisticated approach examines the eigenvalues of $X^TX$ (or the correlation matrix of predictors).
Why eigenvalues? The eigenvalues of $X^TX$ represent the "variance explained" along each principal direction. Near-zero eigenvalues indicate directions along which predictors provide almost no unique information—the hallmark of collinearity.
The condition number of $X^TX$ is: $$\kappa = \sqrt{\frac{\lambda_{\max}}{\lambda_{\min}}}$$
where $\lambda_{\max}$ and $\lambda_{\min}$ are the largest and smallest eigenvalues.
Interpretation:
Mathematical meaning: The condition number measures how much the solution $\hat{\beta}$ can change relative to perturbations in the input data. A condition number of 100 means a 1% change in data could cause up to 100% change in coefficients.
Condition indices generalize the condition number by comparing each eigenvalue to the maximum: $$\eta_k = \sqrt{\frac{\lambda_{\max}}{\lambda_k}}$$
Each condition index corresponds to a principal component. Large condition indices (>30) identify specific near-dependencies.
To determine which variables are involved in each near-dependency:
This is the most complete diagnostic—it identifies which specific variables are entangled in each collinearity relationship.
For eigenvalue analysis, predictors should be standardized (mean 0, SD 1) or centered. Otherwise, the condition number reflects scaling differences rather than true collinearity. Many implementations use the correlation matrix rather than X'X to handle this automatically.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
import numpy as npimport pandas as pdimport matplotlib.pyplot as plt def eigenvalue_collinearity_diagnostics(X, feature_names=None, standardize=True): """ Eigenvalue-based collinearity diagnostics including condition number, condition indices, and variance decomposition proportions. Parameters: ----------- X : ndarray of shape (n, p) Design matrix (predictors only) feature_names : list of str, optional Names for each predictor standardize : bool Whether to standardize predictors before analysis """ n, p = X.shape if feature_names is None: feature_names = [f'X{i+1}' for i in range(p)] # Standardize if standardize: X_std = (X - X.mean(axis=0)) / X.std(axis=0, ddof=1) else: X_std = X - X.mean(axis=0) # At minimum, center # Add intercept column X_design = np.column_stack([np.ones(n), X_std]) full_names = ['Intercept'] + feature_names # Compute X'X XtX = X_design.T @ X_design # Eigendecomposition eigenvalues, eigenvectors = np.linalg.eigh(XtX) # Sort in descending order idx = np.argsort(eigenvalues)[::-1] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] # Condition indices condition_indices = np.sqrt(eigenvalues[0] / eigenvalues) # Overall condition number condition_number = condition_indices[-1] # Variance decomposition proportions # Π_kj = (v_jk² / λ_k) / Σ_k(v_jk² / λ_k) # where v_jk is the element of the k-th eigenvector for the j-th variable var_proportions = np.zeros((p + 1, p + 1)) # Including intercept for j in range(p + 1): # For each coefficient phi_j = np.zeros(p + 1) for k in range(p + 1): # For each eigenvalue phi_j[k] = eigenvectors[j, k]**2 / eigenvalues[k] var_proportions[:, j] = phi_j / phi_j.sum() print("=" * 80) print("EIGENVALUE-BASED COLLINEARITY DIAGNOSTICS") print("=" * 80) print(f"OVERALL CONDITION NUMBER: {condition_number:.2f}") if condition_number < 10: print("Assessment: No serious collinearity") elif condition_number < 30: print("Assessment: Moderate collinearity") elif condition_number < 100: print("Assessment: Serious collinearity - action may be needed") else: print("Assessment: SEVERE collinearity - action strongly recommended") print("" + "-" * 80) print("EIGENVALUE ANALYSIS") print("-" * 80) eigenvalue_df = pd.DataFrame({ 'Component': range(1, p + 2), 'Eigenvalue': eigenvalues, 'Condition Index': condition_indices, 'Assessment': ['OK' if ci < 10 else 'Moderate' if ci < 30 else 'HIGH' for ci in condition_indices] }) print(eigenvalue_df.to_string(index=False)) print("" + "-" * 80) print("VARIANCE DECOMPOSITION PROPORTIONS") print("-" * 80) # Create DataFrame for variance proportions var_df = pd.DataFrame(var_proportions.T, columns=[f'Comp{i+1}' for i in range(p + 1)], index=full_names) # Round for display print(var_df.round(3).to_string()) print("" + "-" * 80) print("COLLINEARITY IDENTIFICATION") print("-" * 80) # For each high condition index, find involved variables high_ci_idx = np.where(condition_indices > 30)[0] if len(high_ci_idx) == 0: print("No components with condition index > 30") else: for k in high_ci_idx: print(f"Component {k+1} (Condition Index = {condition_indices[k]:.2f}):") high_prop_vars = np.where(var_proportions[k, :] > 0.5)[0] for j in high_prop_vars: print(f" • {full_names[j]}: proportion = {var_proportions[k, j]:.3f}") print("=" * 80) # Visualization fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # Plot 1: Condition indices ax1 = axes[0] colors = ['green' if ci < 10 else 'orange' if ci < 30 else 'red' for ci in condition_indices] ax1.bar(range(1, p + 2), condition_indices, color=colors, edgecolor='black', alpha=0.7) ax1.axhline(10, color='orange', linestyle='--', label='CI = 10') ax1.axhline(30, color='red', linestyle='--', label='CI = 30') ax1.set_xlabel('Component', fontsize=11) ax1.set_ylabel('Condition Index', fontsize=11) ax1.set_title('Condition Indices', fontsize=12, fontweight='bold') ax1.legend() ax1.set_yscale('log') # Plot 2: Eigenvalue scree plot ax2 = axes[1] ax2.plot(range(1, p + 2), eigenvalues, 'bo-', linewidth=2, markersize=8) ax2.set_xlabel('Component', fontsize=11) ax2.set_ylabel('Eigenvalue', fontsize=11) ax2.set_title('Eigenvalue Scree Plot', fontsize=12, fontweight='bold') ax2.set_yscale('log') plt.tight_layout() return { 'eigenvalues': eigenvalues, 'condition_indices': condition_indices, 'condition_number': condition_number, 'variance_proportions': var_proportions, 'figure': fig } # Demonstrationnp.random.seed(42)n = 100 # Create predictors with known collinearity structureX1 = np.random.randn(n)X2 = np.random.randn(n)X3 = 0.95 * X1 + 0.05 * np.random.randn(n) # Near-perfect collinearity with X1X4 = 0.7 * X1 + 0.7 * X2 + 0.1 * np.random.randn(n) # Multivariate collinearity X = np.column_stack([X1, X2, X3, X4])feature_names = ['X1', 'X2', 'X3 (≈X1)', 'X4 (≈X1+X2)'] results = eigenvalue_collinearity_diagnostics(X, feature_names)plt.savefig('condition_number_analysis.png', dpi=150, bbox_inches='tight')plt.show()Theory is one thing; seeing multicollinearity in action is another. Let's demonstrate its practical effects on coefficient estimation.
With collinear predictors, the data don't determine unique coefficients. Many different $(\beta_1, \beta_2)$ pairs might fit equally well. Small data perturbations select different pairs from this near-equivalent set.
A classic symptom: the overall F-test is significant (model explains variation) but no individual t-test reaches significance. Collinearity inflates standard errors so much that even important predictors appear non-significant.
Coefficients may have counterintuitive signs. If X1 and X2 are positively correlated and both positively related to Y, collinearity can produce a negative coefficient for one—the model "over-adjusts" because it can't separate their effects.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import LinearRegressionimport statsmodels.api as sm def demonstrate_collinearity_effects(n=100, correlation=0.95, n_simulations=200): """ Demonstrate the instability of coefficients under multicollinearity. Parameters: ----------- n : int Sample size correlation : float Correlation between X1 and X2 n_simulations : int Number of bootstrap-like replications """ np.random.seed(42) # True coefficients true_beta = np.array([1, 2, 3]) # [intercept, beta1, beta2] # Generate collinear predictors X1 = np.random.randn(n) X2 = correlation * X1 + np.sqrt(1 - correlation**2) * np.random.randn(n) actual_corr = np.corrcoef(X1, X2)[0, 1] print("=" * 70) print(f"MULTICOLLINEARITY DEMONSTRATION (r = {actual_corr:.3f})") print("=" * 70) print(f"True model: y = {true_beta[0]} + {true_beta[1]}*X1 + {true_beta[2]}*X2 + ε") # Generate response X = np.column_stack([X1, X2]) y = true_beta[0] + true_beta[1] * X1 + true_beta[2] * X2 + np.random.randn(n) # Fit model X_const = sm.add_constant(X) model = sm.OLS(y, X_const).fit() print(f"--- Single Sample Results ---") print(model.summary().tables[1]) # Compute VIF vif1 = 1 / (1 - actual_corr**2) print(f"VIF for both predictors: {vif1:.2f}") print(f"Variance inflation: {vif1:.1f}x the uncorrelated case") # Bootstrap to show instability beta1_samples = [] beta2_samples = [] for _ in range(n_simulations): # Resample with replacement idx = np.random.choice(n, n, replace=True) X_boot = X_const[idx] y_boot = y[idx] model_boot = sm.OLS(y_boot, X_boot).fit() beta1_samples.append(model_boot.params[1]) beta2_samples.append(model_boot.params[2]) beta1_samples = np.array(beta1_samples) beta2_samples = np.array(beta2_samples) print(f"--- Bootstrap Analysis ({n_simulations} replications) ---") print(f"β1: mean = {np.mean(beta1_samples):.3f}, SD = {np.std(beta1_samples):.3f}") print(f" range: [{np.min(beta1_samples):.3f}, {np.max(beta1_samples):.3f}]") print(f"β2: mean = {np.mean(beta2_samples):.3f}, SD = {np.std(beta2_samples):.3f}") print(f" range: [{np.min(beta2_samples):.3f}, {np.max(beta2_samples):.3f}]") # Compare with uncorrelated case print(f"--- Comparison with Uncorrelated Predictors ---") X1_unc = np.random.randn(n) X2_unc = np.random.randn(n) y_unc = true_beta[0] + true_beta[1] * X1_unc + true_beta[2] * X2_unc + np.random.randn(n) X_unc = np.column_stack([np.ones(n), X1_unc, X2_unc]) model_unc = sm.OLS(y_unc, X_unc).fit() print(f"Standard errors:") print(f" Collinear case - β1: {model.bse[1]:.4f}, β2: {model.bse[2]:.4f}") print(f" Uncorrelated case - β1: {model_unc.bse[1]:.4f}, β2: {model_unc.bse[2]:.4f}") print(f" Ratio: {model.bse[1]/model_unc.bse[1]:.2f}x inflation") print("=" * 70) # Visualization fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Plot 1: Scatter of predictors ax1 = axes[0] ax1.scatter(X1, X2, alpha=0.5, edgecolors='k', linewidth=0.5) ax1.set_xlabel('X1', fontsize=11) ax1.set_ylabel('X2', fontsize=11) ax1.set_title(f'Collinear Predictors (r = {actual_corr:.3f})', fontweight='bold') # Plot 2: Bootstrap coefficient distributions ax2 = axes[1] ax2.hist(beta1_samples, bins=30, alpha=0.6, label='β1 bootstrap', edgecolor='black') ax2.hist(beta2_samples, bins=30, alpha=0.6, label='β2 bootstrap', edgecolor='black') ax2.axvline(true_beta[1], color='blue', linestyle='--', linewidth=2, label=f'True β1 = {true_beta[1]}') ax2.axvline(true_beta[2], color='orange', linestyle='--', linewidth=2, label=f'True β2 = {true_beta[2]}') ax2.set_xlabel('Coefficient Value', fontsize=11) ax2.set_ylabel('Frequency', fontsize=11) ax2.set_title('Bootstrap Coefficient Distributions', fontweight='bold') ax2.legend(fontsize=8) # Plot 3: β1 vs β2 showing compensation ax3 = axes[2] ax3.scatter(beta1_samples, beta2_samples, alpha=0.5, s=20) ax3.scatter([true_beta[1]], [true_beta[2]], color='red', s=100, marker='*', label='True values', zorder=5) ax3.set_xlabel('β1', fontsize=11) ax3.set_ylabel('β2', fontsize=11) ax3.set_title('β1 vs β2 Trade-off(Negative correlation shows compensation)', fontweight='bold') ax3.legend() # Add correlation annotation beta_corr = np.corrcoef(beta1_samples, beta2_samples)[0, 1] ax3.annotate(f'r = {beta_corr:.3f}', xy=(0.05, 0.95), xycoords='axes fraction', fontsize=11, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8)) plt.tight_layout() return { 'model': model, 'beta1_samples': beta1_samples, 'beta2_samples': beta2_samples, 'figure': fig } # Run demonstration with different correlation levelsprint("" + "█" * 70)print(" MODERATE COLLINEARITY (r = 0.7) ")print("█" * 70)results_moderate = demonstrate_collinearity_effects(correlation=0.7)plt.savefig('collinearity_moderate.png', dpi=150, bbox_inches='tight')plt.show() print("" + "█" * 70)print(" SEVERE COLLINEARITY (r = 0.95) ")print("█" * 70)results_severe = demonstrate_collinearity_effects(correlation=0.95)plt.savefig('collinearity_severe.png', dpi=150, bbox_inches='tight')plt.show()The appropriate remedy depends on your analytical goals. If you only need predictions, collinearity may not matter. If you need interpretable coefficients, action is required.
The most direct approach: remove one of the collinear variables.
Advantages:
Disadvantages:
Guidance: Drop the predictor that is less theoretically important or less reliably measured.
If predictors measure related concepts, combine them:
Advantages:
Disadvantages:
Ridge regression adds an L2 penalty: $$\hat{\beta}{\text{ridge}} = \arg\min\beta \left[ |y - X\beta|^2 + \lambda |\beta|^2 \right]$$
Solution: $$\hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1} X^T y$$
Key insight: The penalty term $\lambda I$ adds to the diagonal of $X^TX$, improving conditioning. Even near-singular $X^TX$ becomes well-conditioned after adding $\lambda$.
Advantages:
Disadvantages:
Replace correlated predictors with their principal components:
Advantages:
Disadvantages:
| Remedy | Prediction | Interpretation | Complexity | Best When |
|---|---|---|---|---|
| Drop variables | May suffer | Restored | Low | Clear redundancy, domain knowledge available |
| Combine predictors | Good | Changed | Medium | Predictors measure same latent construct |
| Ridge regression | Improved | Lost | Medium | Prediction is primary goal |
| Lasso | Improved | Partially kept | Medium | Want automatic selection |
| PCR | Good | Lost | High | High-dimensional data, many collinearities |
| Do nothing | OK | Unreliable | None | Only overall prediction matters |
If your goal is prediction (not interpretation) and you're predicting within the training data range, collinearity may not require remediation. The model still fits and predicts well—it just can't tell you how much each individual predictor contributes. Report robust predictions but avoid interpreting individual coefficients.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import LinearRegression, Ridge, Lassofrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAimport statsmodels.api as sm def compare_collinearity_remedies(X, y, feature_names=None, alpha_ridge=1.0): """ Compare different approaches to handling multicollinearity. Parameters: ----------- X : ndarray of shape (n, p) Design matrix with collinear predictors y : ndarray of shape (n,) Response variable feature_names : list of str Names for predictors alpha_ridge : float Regularization strength for Ridge """ n, p = X.shape if feature_names is None: feature_names = [f'X{i+1}' for i in range(p)] # Standardize for fair comparison scaler = StandardScaler() X_scaled = scaler.fit_transform(X) print("=" * 70) print("COMPARISON OF MULTICOLLINEARITY REMEDIES") print("=" * 70) results = {} # 1. OLS (baseline) model_ols = LinearRegression() model_ols.fit(X_scaled, y) results['OLS'] = { 'coef': model_ols.coef_, 'intercept': model_ols.intercept_, 'r2': model_ols.score(X_scaled, y) } # Get standard errors from statsmodels X_sm = sm.add_constant(X_scaled) model_sm = sm.OLS(y, X_sm).fit() results['OLS']['se'] = model_sm.bse[1:] print(f"1. ORDINARY LEAST SQUARES (OLS)") print("-" * 50) print(f"R² = {results['OLS']['r2']:.4f}") for i, name in enumerate(feature_names): print(f" {name}: β = {results['OLS']['coef'][i]:8.4f} (SE = {results['OLS']['se'][i]:.4f})") # 2. Ridge Regression model_ridge = Ridge(alpha=alpha_ridge) model_ridge.fit(X_scaled, y) results['Ridge'] = { 'coef': model_ridge.coef_, 'intercept': model_ridge.intercept_, 'r2': model_ridge.score(X_scaled, y) } print(f"2. RIDGE REGRESSION (λ = {alpha_ridge})") print("-" * 50) print(f"R² = {results['Ridge']['r2']:.4f}") for i, name in enumerate(feature_names): shrinkage = 1 - abs(results['Ridge']['coef'][i]) / abs(results['OLS']['coef'][i]) if results['OLS']['coef'][i] != 0 else 0 print(f" {name}: β = {results['Ridge']['coef'][i]:8.4f} (shrinkage: {shrinkage*100:.1f}%)") # 3. Lasso model_lasso = Lasso(alpha=0.1) model_lasso.fit(X_scaled, y) results['Lasso'] = { 'coef': model_lasso.coef_, 'intercept': model_lasso.intercept_, 'r2': model_lasso.score(X_scaled, y) } print(f"3. LASSO REGRESSION") print("-" * 50) print(f"R² = {results['Lasso']['r2']:.4f}") n_nonzero = np.sum(model_lasso.coef_ != 0) print(f"Non-zero coefficients: {n_nonzero}/{p}") for i, name in enumerate(feature_names): if model_lasso.coef_[i] != 0: print(f" {name}: β = {results['Lasso']['coef'][i]:8.4f}") else: print(f" {name}: (excluded)") # 4. Principal Component Regression pca = PCA() X_pca = pca.fit_transform(X_scaled) # Use first k components explaining 95% variance cumvar = np.cumsum(pca.explained_variance_ratio_) n_components = np.argmax(cumvar >= 0.95) + 1 X_pca_reduced = X_pca[:, :n_components] model_pcr = LinearRegression() model_pcr.fit(X_pca_reduced, y) results['PCR'] = { 'coef_pc': model_pcr.coef_, 'n_components': n_components, 'r2': model_pcr.score(X_pca_reduced, y) } print(f"4. PRINCIPAL COMPONENT REGRESSION") print("-" * 50) print(f"Components used: {n_components}/{p} (95% variance)") print(f"R² = {results['PCR']['r2']:.4f}") for i in range(n_components): print(f" PC{i+1}: β = {model_pcr.coef_[i]:8.4f} (explains {pca.explained_variance_ratio_[i]*100:.1f}%)") print("" + "=" * 70) print("SUMMARY: COEFFICIENT STABILITY") print("=" * 70) # Coefficient ranges across methods print(f"{'Predictor':<12} {'OLS':>10} {'Ridge':>10} {'Lasso':>10}") print("-" * 44) for i, name in enumerate(feature_names): print(f"{name:<12} {results['OLS']['coef'][i]:>10.3f} " f"{results['Ridge']['coef'][i]:>10.3f} " f"{results['Lasso']['coef'][i]:>10.3f}") # Visualization fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # Plot 1: Coefficient comparison ax1 = axes[0] x_pos = np.arange(p) width = 0.25 ax1.bar(x_pos - width, results['OLS']['coef'], width, label='OLS', alpha=0.8) ax1.bar(x_pos, results['Ridge']['coef'], width, label='Ridge', alpha=0.8) ax1.bar(x_pos + width, results['Lasso']['coef'], width, label='Lasso', alpha=0.8) ax1.set_xticks(x_pos) ax1.set_xticklabels(feature_names) ax1.set_ylabel('Coefficient', fontsize=11) ax1.set_title('Coefficient Estimates by Method', fontsize=12, fontweight='bold') ax1.legend() ax1.axhline(0, color='gray', linestyle='-', linewidth=0.5) # Plot 2: Standard errors (OLS only, with indication of instability) ax2 = axes[1] colors = ['red' if se > 2*np.median(results['OLS']['se']) else 'steelblue' for se in results['OLS']['se']] ax2.bar(feature_names, results['OLS']['se'], color=colors, edgecolor='black', alpha=0.7) ax2.set_ylabel('Standard Error', fontsize=11) ax2.set_title('OLS Standard Errors(Red = inflated by collinearity)', fontsize=12, fontweight='bold') plt.tight_layout() return results, fig # Demonstrationnp.random.seed(42)n = 100 # Create collinear predictorsX1 = np.random.randn(n)X2 = 0.9 * X1 + 0.1 * np.random.randn(n) # 0.9 correlation with X1X3 = np.random.randn(n) # IndependentX4 = 0.8 * X3 + 0.2 * np.random.randn(n) # 0.8 correlation with X3 X = np.column_stack([X1, X2, X3, X4])y = 2 + 3*X1 + 0*X2 + 2*X3 + 0*X4 + np.random.randn(n) # Only X1 and X3 matter results, fig = compare_collinearity_remedies( X, y, feature_names=['X1', 'X2 (≈X1)', 'X3', 'X4 (≈X3)'], alpha_ridge=1.0)plt.savefig('collinearity_remedies.png', dpi=150, bbox_inches='tight')plt.show()Multicollinearity is one of the most common diagnostic challenges in regression. Proper detection and thoughtful remediation are essential for reliable inference.
Congratulations! You have completed the Regression Diagnostics module. You now possess a comprehensive toolkit for validating regression models: residual analysis for model adequacy, normality tests for inference validity, heteroscedasticity detection and remediation, influence diagnostics for unusual observations, and multicollinearity assessment for stable coefficients. These skills are essential for building trustworthy regression models in production ML systems.