Regression Diagnostics - Learning Module

Loading content...

0/278

Multicollinearity

The Curse of Correlated Predictors

In an ideal regression world, predictors would be uncorrelated—each carrying unique, independent information about the response. Reality is messier. In practice, predictors are often correlated, sometimes strongly. When correlations among predictors are high, we face multicollinearity, a condition that doesn't bias estimates but makes them unreliable, unstable, and difficult to interpret.

Multicollinearity is particularly insidious because:

The model can still fit well (high R²)
Predictions for new data within the original range may be accurate
But individual coefficient estimates become meaningless
Small changes in data can cause wild swings in coefficients

This page provides the mathematical foundations, diagnostic tools, and remedial strategies needed to detect and address multicollinearity in production regression models.

What You Will Learn

This page covers: (1) The mathematical nature and consequences of multicollinearity; (2) Detection methods—correlation matrices, VIF, condition numbers, eigenvalue analysis; (3) The distinction between harmful and benign collinearity; (4) Remedial strategies including variable selection, regularization, and PCA; and (5) Practical guidelines for real-world applications.

Understanding Multicollinearity

What Is Multicollinearity?

Multicollinearity exists when one or more predictor variables can be approximately expressed as linear combinations of other predictors: $$X_j \approx \sum_{k eq j} \alpha_k X_k$$

We distinguish between:

Perfect Multicollinearity: One predictor is an exact linear combination of others. The design matrix $X$ becomes rank-deficient: $$\text{rank}(X) < p$$

This makes $(X^TX)^{-1}$ undefined—OLS cannot be computed. Software will either fail or quietly drop a variable.

Near-Perfect (Practical) Multicollinearity: Predictors are highly correlated but not perfectly so. $(X^TX)$ is invertible but ill-conditioned—small perturbations in data cause large changes in $\hat{\beta}$.

Sources of Multicollinearity

Structural relationships: Variables that are definitionally related (e.g., height in inches and height in centimeters)
Natural correlations: Variables that move together in the real world (e.g., education and income)
Data collection artifacts: Sampling from a narrow range, survey design
Polynomial/interaction terms: $x$ and $x^2$ are inherently correlated
Dummy variable trap: Including all levels of a categorical variable plus intercept

Mathematical Impact

Recall the variance of OLS estimators: $$\text{Var}(\hat{\beta}) = \sigma^2(X^TX)^{-1}$$

When predictors are collinear, $(X^TX)$ is nearly singular. Its inverse has very large elements, inflating the variance of coefficient estimates.

The Geometric View

Geometrically, collinearity means predictor vectors are nearly parallel in the column space of X. When vectors are nearly aligned, small rotations (from sampling variation) cause large swings in the projection coefficients. The regression 'sees' many nearly equivalent ways to combine the predictors, and the choice among them is unstable.

Consequences of Multicollinearity

What Multicollinearity Does NOT Affect:

Unbiasedness: $E[\hat{\beta}] = \beta$ still holds
Overall fit: $R^2$, F-test, and predictions remain valid
Efficiency within the class of linear unbiased estimators: Gauss-Markov still applies

What Multicollinearity DOES Affect:

Variance of individual coefficients: $\text{Var}(\hat{\beta}_j)$ becomes enormous
t-tests on individual coefficients: Low power to detect non-zero effects
Coefficient stability: Small data changes → large coefficient changes
Interpretability: Coefficients don't represent isolated marginal effects
Sign stability: Coefficients may have counterintuitive signs

Summary of Multicollinearity Effects
Aspect	Effect	Why It Matters
Coefficient variance	Inflated (sometimes massively)	Wide confidence intervals, low power
Coefficient stability	Highly unstable	Small data changes → large β changes
Coefficient signs	May flip unexpectedly	Contradicts domain knowledge
Individual t-tests	Nonsignificant despite overall significance	Type II errors on important variables
Model fit (R²)	Unaffected	Model predicts well despite coefficient issues
Prediction accuracy	Unaffected within sample range	Problems arise for extrapolation

Detection: Correlation Matrices and Variance Inflation Factors

Pairwise Correlations

The simplest detection method examines the correlation matrix of predictors: $$R = \text{corr}(X)$$

High pairwise correlations (|r| > 0.7 or 0.8) suggest potential problems. However, this method has a critical limitation: it only detects bivariate collinearity. A variable can be an exact linear combination of three others while having moderate pairwise correlations with each.

Variance Inflation Factor (VIF)

The VIF is the standard tool for detecting multicollinearity, including multivariate patterns invisible to correlation matrices.

Definition: For predictor $X_j$: $$\text{VIF}_j = \frac{1}{1 - R_j^2}$$

where $R_j^2$ is the $R^2$ from regressing $X_j$ on all other predictors.

Interpretation:

$\text{VIF}_j = 1$: No collinearity (predictor is orthogonal to others)
$\text{VIF}_j = 5$: Predictor explains 80% variance from others
$\text{VIF}_j = 10$: Predictor explains 90% variance from others
$\text{VIF}_j = \infty$: Perfect collinearity

Connection to variance inflation: $$\text{Var}(\hat{\beta}_j) = \frac{\sigma^2}{(n-1)s_j^2} \cdot \text{VIF}_j$$

where $s_j^2$ is the variance of $X_j$. VIF directly measures how much the variance is inflated due to collinearity.

Thresholds (rules of thumb):

VIF > 5: Moderate collinearity—investigate
VIF > 10: Serious collinearity—likely problematic
VIF > 20: Severe collinearity—remedial action needed

vif_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
 
def compute_vif(X, feature_names=None):
    """
    Compute Variance Inflation Factors for each predictor.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix (predictors only, no intercept)
    feature_names : list of str, optional
        Names for each predictor
    
    Returns:
    --------
    DataFrame with VIF for each predictor
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f'X{i+1}' for i in range(p)]
    
    vifs = []
    r_squareds = []
    
    for j in range(p):
        # Regress X_j on all other predictors
        y_j = X[:, j]
        X_others = np.delete(X, j, axis=1)
        
        model = LinearRegression()
        model.fit(X_others, y_j)
        
        # R² from this auxiliary regression
        r2_j = model.score(X_others, y_j)
        
        # VIF
        if r2_j >= 1:
            vif_j = np.inf
        else:
            vif_j = 1 / (1 - r2_j)
        
        vifs.append(vif_j)
        r_squareds.append(r2_j)
    
    # Create DataFrame
    results = pd.DataFrame({
        'Feature': feature_names,
        'VIF': vifs,
        'R² (auxiliary)': r_squareds,
        'Tolerance': [1/v if v != np.inf else 0 for v in vifs]
    })
    
    # Add assessment
    def assess_vif(vif):
        if vif < 5:
            return 'OK'
        elif vif < 10:
            return 'Moderate'
        elif vif < 20:
            return 'High'
        else:
            return 'Severe'
    
    results['Assessment'] = results['VIF'].apply(assess_vif)
    
    return results
 
def multicollinearity_diagnostics(X, feature_names=None, figsize=(14, 5)):
    """
    Comprehensive multicollinearity diagnostics including correlation
    matrix and VIF analysis.
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f'X{i+1}' for i in range(p)]
    
    # Compute VIF
    vif_results = compute_vif(X, feature_names)
    
    # Compute correlation matrix
    corr_matrix = np.corrcoef(X.T)
    
    print("=" * 70)
    print("MULTICOLLINEARITY DIAGNOSTICS")
    print("=" * 70)
    
    print("
VARIANCE INFLATION FACTORS (VIF)")
    print("-" * 70)
    print(vif_results.to_string(index=False))
    
    print("
" + "-" * 70)
    print("VIF Interpretation:")
    print("  VIF = 1:     No collinearity")
    print("  VIF < 5:     Acceptable")
    print("  VIF 5-10:    Moderate collinearity (investigate)")
    print("  VIF > 10:    High collinearity (problematic)")
    print("  VIF > 20:    Severe collinearity (action needed)")
    
    # Find problematic pairs
    print("
HIGH PAIRWISE CORRELATIONS (|r| > 0.7)")
    print("-" * 70)
    high_corr_found = False
    for i in range(p):
        for j in range(i+1, p):
            if abs(corr_matrix[i, j]) > 0.7:
                high_corr_found = True
                print(f"  {feature_names[i]} ↔ {feature_names[j]}: r = {corr_matrix[i, j]:.4f}")
    if not high_corr_found:
        print("  No pairwise correlations exceed 0.7")
    
    print("=" * 70)
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=figsize)
    
    # Plot 1: Correlation heatmap
    ax1 = axes[0]
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
    sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
                center=0, vmin=-1, vmax=1, ax=ax1,
                xticklabels=feature_names, yticklabels=feature_names)
    ax1.set_title('Correlation Matrix', fontsize=12, fontweight='bold')
    
    # Plot 2: VIF bar chart
    ax2 = axes[1]
    colors = ['green' if v < 5 else 'orange' if v < 10 else 'red' 
              for v in vif_results['VIF']]
    bars = ax2.bar(feature_names, vif_results['VIF'], color=colors, 
                   edgecolor='black', alpha=0.7)
    ax2.axhline(5, color='orange', linestyle='--', linewidth=1.5, 
                label='VIF = 5 (moderate)')
    ax2.axhline(10, color='red', linestyle='--', linewidth=1.5, 
                label='VIF = 10 (high)')
    ax2.set_xlabel('Predictor', fontsize=11)
    ax2.set_ylabel('VIF', fontsize=11)
    ax2.set_title('Variance Inflation Factors', fontsize=12, fontweight='bold')
    ax2.legend()
    
    # Add value labels on bars
    for bar, vif in zip(bars, vif_results['VIF']):
        ax2.annotate(f'{vif:.1f}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                    xytext=(0, 3), textcoords='offset points', ha='center', fontsize=9)
    
    plt.tight_layout()
    
    return {
        'vif': vif_results,
        'correlation_matrix': corr_matrix,
        'figure': fig
    }
 
# Demonstration
np.random.seed(42)
n = 100
 
# Create predictors with varying degrees of collinearity
X1 = np.random.randn(n)
X2 = 0.3 * X1 + np.random.randn(n) * 0.9  # Mild correlation with X1
X3 = 0.9 * X1 + np.random.randn(n) * 0.4  # High correlation with X1
X4 = np.random.randn(n)  # Independent
X5 = 0.7 * X3 + 0.5 * X4 + np.random.randn(n) * 0.3  # Correlated with X3 and X4
 
X = np.column_stack([X1, X2, X3, X4, X5])
feature_names = ['X1', 'X2', 'X3 (≈X1)', 'X4', 'X5 (≈X3+X4)']
 
results = multicollinearity_diagnostics(X, feature_names)
plt.savefig('multicollinearity_diagnostics.png', dpi=150, bbox_inches='tight')
plt.show()

Detection: Condition Numbers and Eigenvalue Analysis

Eigenvalue Analysis

A more sophisticated approach examines the eigenvalues of $X^TX$ (or the correlation matrix of predictors).

Why eigenvalues? The eigenvalues of $X^TX$ represent the "variance explained" along each principal direction. Near-zero eigenvalues indicate directions along which predictors provide almost no unique information—the hallmark of collinearity.

Condition Number

The condition number of $X^TX$ is: $$\kappa = \sqrt{\frac{\lambda_{\max}}{\lambda_{\min}}}$$

where $\lambda_{\max}$ and $\lambda_{\min}$ are the largest and smallest eigenvalues.

Interpretation:

$\kappa < 10$: No serious collinearity
$10 \leq \kappa < 30$: Moderate collinearity
$\kappa \geq 30$: Serious collinearity
$\kappa \geq 100$: Severe collinearity

Mathematical meaning: The condition number measures how much the solution $\hat{\beta}$ can change relative to perturbations in the input data. A condition number of 100 means a 1% change in data could cause up to 100% change in coefficients.

Condition Indices

Condition indices generalize the condition number by comparing each eigenvalue to the maximum: $$\eta_k = \sqrt{\frac{\lambda_{\max}}{\lambda_k}}$$

Each condition index corresponds to a principal component. Large condition indices (>30) identify specific near-dependencies.

Variance Decomposition Proportions

To determine which variables are involved in each near-dependency:

Decompose the variance of each $\hat{\beta}_j$ across the principal components
If two or more coefficients have high proportions (>0.5) on a component with high condition index, they are collinear with each other

This is the most complete diagnostic—it identifies which specific variables are entangled in each collinearity relationship.

Standardization Matters

For eigenvalue analysis, predictors should be standardized (mean 0, SD 1) or centered. Otherwise, the condition number reflects scaling differences rather than true collinearity. Many implementations use the correlation matrix rather than X'X to handle this automatically.

condition_number_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
def eigenvalue_collinearity_diagnostics(X, feature_names=None, standardize=True):
    """
    Eigenvalue-based collinearity diagnostics including condition number,
    condition indices, and variance decomposition proportions.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix (predictors only)
    feature_names : list of str, optional
        Names for each predictor
    standardize : bool
        Whether to standardize predictors before analysis
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f'X{i+1}' for i in range(p)]
    
    # Standardize
    if standardize:
        X_std = (X - X.mean(axis=0)) / X.std(axis=0, ddof=1)
    else:
        X_std = X - X.mean(axis=0)  # At minimum, center
    
    # Add intercept column
    X_design = np.column_stack([np.ones(n), X_std])
    full_names = ['Intercept'] + feature_names
    
    # Compute X'X
    XtX = X_design.T @ X_design
    
    # Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(XtX)
    
    # Sort in descending order
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Condition indices
    condition_indices = np.sqrt(eigenvalues[0] / eigenvalues)
    
    # Overall condition number
    condition_number = condition_indices[-1]
    
    # Variance decomposition proportions
    # Π_kj = (v_jk² / λ_k) / Σ_k(v_jk² / λ_k)
    # where v_jk is the element of the k-th eigenvector for the j-th variable
    
    var_proportions = np.zeros((p + 1, p + 1))  # Including intercept
    
    for j in range(p + 1):  # For each coefficient
        phi_j = np.zeros(p + 1)
        for k in range(p + 1):  # For each eigenvalue
            phi_j[k] = eigenvectors[j, k]**2 / eigenvalues[k]
        var_proportions[:, j] = phi_j / phi_j.sum()
    
    print("=" * 80)
    print("EIGENVALUE-BASED COLLINEARITY DIAGNOSTICS")
    print("=" * 80)
    
    print(f"
OVERALL CONDITION NUMBER: {condition_number:.2f}")
    if condition_number < 10:
        print("Assessment: No serious collinearity")
    elif condition_number < 30:
        print("Assessment: Moderate collinearity")
    elif condition_number < 100:
        print("Assessment: Serious collinearity - action may be needed")
    else:
        print("Assessment: SEVERE collinearity - action strongly recommended")
    
    print("
" + "-" * 80)
    print("EIGENVALUE ANALYSIS")
    print("-" * 80)
    
    eigenvalue_df = pd.DataFrame({
        'Component': range(1, p + 2),
        'Eigenvalue': eigenvalues,
        'Condition Index': condition_indices,
        'Assessment': ['OK' if ci < 10 else 'Moderate' if ci < 30 else 'HIGH' 
                       for ci in condition_indices]
    })
    print(eigenvalue_df.to_string(index=False))
    
    print("
" + "-" * 80)
    print("VARIANCE DECOMPOSITION PROPORTIONS")
    print("-" * 80)
    
    # Create DataFrame for variance proportions
    var_df = pd.DataFrame(var_proportions.T, 
                         columns=[f'Comp{i+1}' for i in range(p + 1)],
                         index=full_names)
    
    # Round for display
    print(var_df.round(3).to_string())
    
    print("
" + "-" * 80)
    print("COLLINEARITY IDENTIFICATION")
    print("-" * 80)
    
    # For each high condition index, find involved variables
    high_ci_idx = np.where(condition_indices > 30)[0]
    
    if len(high_ci_idx) == 0:
        print("No components with condition index > 30")
    else:
        for k in high_ci_idx:
            print(f"
Component {k+1} (Condition Index = {condition_indices[k]:.2f}):")
            high_prop_vars = np.where(var_proportions[k, :] > 0.5)[0]
            for j in high_prop_vars:
                print(f"  • {full_names[j]}: proportion = {var_proportions[k, j]:.3f}")
    
    print("=" * 80)
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Plot 1: Condition indices
    ax1 = axes[0]
    colors = ['green' if ci < 10 else 'orange' if ci < 30 else 'red' 
              for ci in condition_indices]
    ax1.bar(range(1, p + 2), condition_indices, color=colors, edgecolor='black', alpha=0.7)
    ax1.axhline(10, color='orange', linestyle='--', label='CI = 10')
    ax1.axhline(30, color='red', linestyle='--', label='CI = 30')
    ax1.set_xlabel('Component', fontsize=11)
    ax1.set_ylabel('Condition Index', fontsize=11)
    ax1.set_title('Condition Indices', fontsize=12, fontweight='bold')
    ax1.legend()
    ax1.set_yscale('log')
    
    # Plot 2: Eigenvalue scree plot
    ax2 = axes[1]
    ax2.plot(range(1, p + 2), eigenvalues, 'bo-', linewidth=2, markersize=8)
    ax2.set_xlabel('Component', fontsize=11)
    ax2.set_ylabel('Eigenvalue', fontsize=11)
    ax2.set_title('Eigenvalue Scree Plot', fontsize=12, fontweight='bold')
    ax2.set_yscale('log')
    
    plt.tight_layout()
    
    return {
        'eigenvalues': eigenvalues,
        'condition_indices': condition_indices,
        'condition_number': condition_number,
        'variance_proportions': var_proportions,
        'figure': fig
    }
 
# Demonstration
np.random.seed(42)
n = 100
 
# Create predictors with known collinearity structure
X1 = np.random.randn(n)
X2 = np.random.randn(n)
X3 = 0.95 * X1 + 0.05 * np.random.randn(n)  # Near-perfect collinearity with X1
X4 = 0.7 * X1 + 0.7 * X2 + 0.1 * np.random.randn(n)  # Multivariate collinearity
 
X = np.column_stack([X1, X2, X3, X4])
feature_names = ['X1', 'X2', 'X3 (≈X1)', 'X4 (≈X1+X2)']
 
results = eigenvalue_collinearity_diagnostics(X, feature_names)
plt.savefig('condition_number_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

Demonstrating Multicollinearity's Effects

Theory is one thing; seeing multicollinearity in action is another. Let's demonstrate its practical effects on coefficient estimation.

Unstable Coefficients

With collinear predictors, the data don't determine unique coefficients. Many different $(\beta_1, \beta_2)$ pairs might fit equally well. Small data perturbations select different pairs from this near-equivalent set.

The "Joint Significance" Paradox

A classic symptom: the overall F-test is significant (model explains variation) but no individual t-test reaches significance. Collinearity inflates standard errors so much that even important predictors appear non-significant.

Sign Reversals

Coefficients may have counterintuitive signs. If X1 and X2 are positively correlated and both positively related to Y, collinearity can produce a negative coefficient for one—the model "over-adjusts" because it can't separate their effects.

collinearity_effects.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
 
def demonstrate_collinearity_effects(n=100, correlation=0.95, n_simulations=200):
    """
    Demonstrate the instability of coefficients under multicollinearity.
    
    Parameters:
    -----------
    n : int
        Sample size
    correlation : float
        Correlation between X1 and X2
    n_simulations : int
        Number of bootstrap-like replications
    """
    np.random.seed(42)
    
    # True coefficients
    true_beta = np.array([1, 2, 3])  # [intercept, beta1, beta2]
    
    # Generate collinear predictors
    X1 = np.random.randn(n)
    X2 = correlation * X1 + np.sqrt(1 - correlation**2) * np.random.randn(n)
    
    actual_corr = np.corrcoef(X1, X2)[0, 1]
    
    print("=" * 70)
    print(f"MULTICOLLINEARITY DEMONSTRATION (r = {actual_corr:.3f})")
    print("=" * 70)
    print(f"True model: y = {true_beta[0]} + {true_beta[1]}*X1 + {true_beta[2]}*X2 + ε")
    
    # Generate response
    X = np.column_stack([X1, X2])
    y = true_beta[0] + true_beta[1] * X1 + true_beta[2] * X2 + np.random.randn(n)
    
    # Fit model
    X_const = sm.add_constant(X)
    model = sm.OLS(y, X_const).fit()
    
    print(f"
--- Single Sample Results ---")
    print(model.summary().tables[1])
    
    # Compute VIF
    vif1 = 1 / (1 - actual_corr**2)
    print(f"
VIF for both predictors: {vif1:.2f}")
    print(f"Variance inflation: {vif1:.1f}x the uncorrelated case")
    
    # Bootstrap to show instability
    beta1_samples = []
    beta2_samples = []
    
    for _ in range(n_simulations):
        # Resample with replacement
        idx = np.random.choice(n, n, replace=True)
        X_boot = X_const[idx]
        y_boot = y[idx]
        
        model_boot = sm.OLS(y_boot, X_boot).fit()
        beta1_samples.append(model_boot.params[1])
        beta2_samples.append(model_boot.params[2])
    
    beta1_samples = np.array(beta1_samples)
    beta2_samples = np.array(beta2_samples)
    
    print(f"
--- Bootstrap Analysis ({n_simulations} replications) ---")
    print(f"β1: mean = {np.mean(beta1_samples):.3f}, SD = {np.std(beta1_samples):.3f}")
    print(f"    range: [{np.min(beta1_samples):.3f}, {np.max(beta1_samples):.3f}]")
    print(f"β2: mean = {np.mean(beta2_samples):.3f}, SD = {np.std(beta2_samples):.3f}")
    print(f"    range: [{np.min(beta2_samples):.3f}, {np.max(beta2_samples):.3f}]")
    
    # Compare with uncorrelated case
    print(f"
--- Comparison with Uncorrelated Predictors ---")
    X1_unc = np.random.randn(n)
    X2_unc = np.random.randn(n)
    y_unc = true_beta[0] + true_beta[1] * X1_unc + true_beta[2] * X2_unc + np.random.randn(n)
    
    X_unc = np.column_stack([np.ones(n), X1_unc, X2_unc])
    model_unc = sm.OLS(y_unc, X_unc).fit()
    
    print(f"
Standard errors:")
    print(f"  Collinear case    - β1: {model.bse[1]:.4f}, β2: {model.bse[2]:.4f}")
    print(f"  Uncorrelated case - β1: {model_unc.bse[1]:.4f}, β2: {model_unc.bse[2]:.4f}")
    print(f"  Ratio: {model.bse[1]/model_unc.bse[1]:.2f}x inflation")
    
    print("=" * 70)
    
    # Visualization
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Plot 1: Scatter of predictors
    ax1 = axes[0]
    ax1.scatter(X1, X2, alpha=0.5, edgecolors='k', linewidth=0.5)
    ax1.set_xlabel('X1', fontsize=11)
    ax1.set_ylabel('X2', fontsize=11)
    ax1.set_title(f'Collinear Predictors (r = {actual_corr:.3f})', fontweight='bold')
    
    # Plot 2: Bootstrap coefficient distributions
    ax2 = axes[1]
    ax2.hist(beta1_samples, bins=30, alpha=0.6, label='β1 bootstrap', edgecolor='black')
    ax2.hist(beta2_samples, bins=30, alpha=0.6, label='β2 bootstrap', edgecolor='black')
    ax2.axvline(true_beta[1], color='blue', linestyle='--', linewidth=2, 
                label=f'True β1 = {true_beta[1]}')
    ax2.axvline(true_beta[2], color='orange', linestyle='--', linewidth=2, 
                label=f'True β2 = {true_beta[2]}')
    ax2.set_xlabel('Coefficient Value', fontsize=11)
    ax2.set_ylabel('Frequency', fontsize=11)
    ax2.set_title('Bootstrap Coefficient Distributions', fontweight='bold')
    ax2.legend(fontsize=8)
    
    # Plot 3: β1 vs β2 showing compensation
    ax3 = axes[2]
    ax3.scatter(beta1_samples, beta2_samples, alpha=0.5, s=20)
    ax3.scatter([true_beta[1]], [true_beta[2]], color='red', s=100, 
               marker='*', label='True values', zorder=5)
    ax3.set_xlabel('β1', fontsize=11)
    ax3.set_ylabel('β2', fontsize=11)
    ax3.set_title('β1 vs β2 Trade-off
(Negative correlation shows compensation)', 
                 fontweight='bold')
    ax3.legend()
    
    # Add correlation annotation
    beta_corr = np.corrcoef(beta1_samples, beta2_samples)[0, 1]
    ax3.annotate(f'r = {beta_corr:.3f}', xy=(0.05, 0.95), xycoords='axes fraction',
                fontsize=11, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
    
    plt.tight_layout()
    
    return {
        'model': model,
        'beta1_samples': beta1_samples,
        'beta2_samples': beta2_samples,
        'figure': fig
    }
 
# Run demonstration with different correlation levels
print("
" + "█" * 70)
print(" MODERATE COLLINEARITY (r = 0.7) ")
print("█" * 70)
results_moderate = demonstrate_collinearity_effects(correlation=0.7)
plt.savefig('collinearity_moderate.png', dpi=150, bbox_inches='tight')
plt.show()
 
print("
" + "█" * 70)
print(" SEVERE COLLINEARITY (r = 0.95) ")
print("█" * 70)
results_severe = demonstrate_collinearity_effects(correlation=0.95)
plt.savefig('collinearity_severe.png', dpi=150, bbox_inches='tight')
plt.show()

Remedies for Multicollinearity

The appropriate remedy depends on your analytical goals. If you only need predictions, collinearity may not matter. If you need interpretable coefficients, action is required.

Remedy 1: Drop Redundant Predictors

The most direct approach: remove one of the collinear variables.

Advantages:

Simple to implement
Restores interpretability

Disadvantages:

Loses information
Which variable to drop may not be obvious
Requires domain knowledge

Guidance: Drop the predictor that is less theoretically important or less reliably measured.

Remedy 2: Combine Predictors

If predictors measure related concepts, combine them:

Average: $(X_1 + X_2)/2$
Index: Principal component or factor score
Ratio: $X_1/X_2$ if meaningful

Advantages:

Preserves information
Reduces dimensionality

Disadvantages:

Interpretation changes
May lose granularity

Remedy 3: Regularization (Ridge Regression)

Ridge regression adds an L2 penalty: $$\hat{\beta}{\text{ridge}} = \arg\min\beta \left[ |y - X\beta|^2 + \lambda |\beta|^2 \right]$$

Solution: $$\hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1} X^T y$$

Key insight: The penalty term $\lambda I$ adds to the diagonal of $X^TX$, improving conditioning. Even near-singular $X^TX$ becomes well-conditioned after adding $\lambda$.

Advantages:

Stabilizes coefficients
Keeps all predictors
Improves prediction accuracy under collinearity

Disadvantages:

Introduces bias (coefficients shrunk toward zero)
Coefficients no longer have simple interpretation
Requires choosing $\lambda$

Remedy 4: Principal Component Regression (PCR)

Replace correlated predictors with their principal components:

Compute PCs of the predictor matrix
Regress Y on the first $k$ PCs (dropping low-variance components)

Advantages:

Eliminates collinearity entirely (PCs are orthogonal)
Dimensionality reduction

Disadvantages:

Loses interpretability (PCs are linear combinations)
High-variance PCs aren't necessarily predictive

Comparison of Collinearity Remedies
Remedy	Prediction	Interpretation	Complexity	Best When
Drop variables	May suffer	Restored	Low	Clear redundancy, domain knowledge available
Combine predictors	Good	Changed	Medium	Predictors measure same latent construct
Ridge regression	Improved	Lost	Medium	Prediction is primary goal
Lasso	Improved	Partially kept	Medium	Want automatic selection
PCR	Good	Lost	High	High-dimensional data, many collinearities
Do nothing	OK	Unreliable	None	Only overall prediction matters

When Collinearity Doesn't Matter

If your goal is prediction (not interpretation) and you're predicting within the training data range, collinearity may not require remediation. The model still fits and predicts well—it just can't tell you how much each individual predictor contributes. Report robust predictions but avoid interpreting individual coefficients.

collinearity_remedies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import statsmodels.api as sm
 
def compare_collinearity_remedies(X, y, feature_names=None, alpha_ridge=1.0):
    """
    Compare different approaches to handling multicollinearity.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix with collinear predictors
    y : ndarray of shape (n,)
        Response variable
    feature_names : list of str
        Names for predictors
    alpha_ridge : float
        Regularization strength for Ridge
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f'X{i+1}' for i in range(p)]
    
    # Standardize for fair comparison
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    print("=" * 70)
    print("COMPARISON OF MULTICOLLINEARITY REMEDIES")
    print("=" * 70)
    
    results = {}
    
    # 1. OLS (baseline)
    model_ols = LinearRegression()
    model_ols.fit(X_scaled, y)
    results['OLS'] = {
        'coef': model_ols.coef_,
        'intercept': model_ols.intercept_,
        'r2': model_ols.score(X_scaled, y)
    }
    
    # Get standard errors from statsmodels
    X_sm = sm.add_constant(X_scaled)
    model_sm = sm.OLS(y, X_sm).fit()
    results['OLS']['se'] = model_sm.bse[1:]
    
    print(f"
1. ORDINARY LEAST SQUARES (OLS)")
    print("-" * 50)
    print(f"R² = {results['OLS']['r2']:.4f}")
    for i, name in enumerate(feature_names):
        print(f"  {name}: β = {results['OLS']['coef'][i]:8.4f} (SE = {results['OLS']['se'][i]:.4f})")
    
    # 2. Ridge Regression
    model_ridge = Ridge(alpha=alpha_ridge)
    model_ridge.fit(X_scaled, y)
    results['Ridge'] = {
        'coef': model_ridge.coef_,
        'intercept': model_ridge.intercept_,
        'r2': model_ridge.score(X_scaled, y)
    }
    
    print(f"
2. RIDGE REGRESSION (λ = {alpha_ridge})")
    print("-" * 50)
    print(f"R² = {results['Ridge']['r2']:.4f}")
    for i, name in enumerate(feature_names):
        shrinkage = 1 - abs(results['Ridge']['coef'][i]) / abs(results['OLS']['coef'][i])                     if results['OLS']['coef'][i] != 0 else 0
        print(f"  {name}: β = {results['Ridge']['coef'][i]:8.4f} (shrinkage: {shrinkage*100:.1f}%)")
    
    # 3. Lasso
    model_lasso = Lasso(alpha=0.1)
    model_lasso.fit(X_scaled, y)
    results['Lasso'] = {
        'coef': model_lasso.coef_,
        'intercept': model_lasso.intercept_,
        'r2': model_lasso.score(X_scaled, y)
    }
    
    print(f"
3. LASSO REGRESSION")
    print("-" * 50)
    print(f"R² = {results['Lasso']['r2']:.4f}")
    n_nonzero = np.sum(model_lasso.coef_ != 0)
    print(f"Non-zero coefficients: {n_nonzero}/{p}")
    for i, name in enumerate(feature_names):
        if model_lasso.coef_[i] != 0:
            print(f"  {name}: β = {results['Lasso']['coef'][i]:8.4f}")
        else:
            print(f"  {name}: (excluded)")
    
    # 4. Principal Component Regression
    pca = PCA()
    X_pca = pca.fit_transform(X_scaled)
    
    # Use first k components explaining 95% variance
    cumvar = np.cumsum(pca.explained_variance_ratio_)
    n_components = np.argmax(cumvar >= 0.95) + 1
    
    X_pca_reduced = X_pca[:, :n_components]
    model_pcr = LinearRegression()
    model_pcr.fit(X_pca_reduced, y)
    
    results['PCR'] = {
        'coef_pc': model_pcr.coef_,
        'n_components': n_components,
        'r2': model_pcr.score(X_pca_reduced, y)
    }
    
    print(f"
4. PRINCIPAL COMPONENT REGRESSION")
    print("-" * 50)
    print(f"Components used: {n_components}/{p} (95% variance)")
    print(f"R² = {results['PCR']['r2']:.4f}")
    for i in range(n_components):
        print(f"  PC{i+1}: β = {model_pcr.coef_[i]:8.4f} (explains {pca.explained_variance_ratio_[i]*100:.1f}%)")
    
    print("
" + "=" * 70)
    print("SUMMARY: COEFFICIENT STABILITY")
    print("=" * 70)
    
    # Coefficient ranges across methods
    print(f"
{'Predictor':<12} {'OLS':>10} {'Ridge':>10} {'Lasso':>10}")
    print("-" * 44)
    for i, name in enumerate(feature_names):
        print(f"{name:<12} {results['OLS']['coef'][i]:>10.3f} "
              f"{results['Ridge']['coef'][i]:>10.3f} "
              f"{results['Lasso']['coef'][i]:>10.3f}")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Plot 1: Coefficient comparison
    ax1 = axes[0]
    x_pos = np.arange(p)
    width = 0.25
    
    ax1.bar(x_pos - width, results['OLS']['coef'], width, label='OLS', alpha=0.8)
    ax1.bar(x_pos, results['Ridge']['coef'], width, label='Ridge', alpha=0.8)
    ax1.bar(x_pos + width, results['Lasso']['coef'], width, label='Lasso', alpha=0.8)
    
    ax1.set_xticks(x_pos)
    ax1.set_xticklabels(feature_names)
    ax1.set_ylabel('Coefficient', fontsize=11)
    ax1.set_title('Coefficient Estimates by Method', fontsize=12, fontweight='bold')
    ax1.legend()
    ax1.axhline(0, color='gray', linestyle='-', linewidth=0.5)
    
    # Plot 2: Standard errors (OLS only, with indication of instability)
    ax2 = axes[1]
    colors = ['red' if se > 2*np.median(results['OLS']['se']) else 'steelblue' 
              for se in results['OLS']['se']]
    ax2.bar(feature_names, results['OLS']['se'], color=colors, edgecolor='black', alpha=0.7)
    ax2.set_ylabel('Standard Error', fontsize=11)
    ax2.set_title('OLS Standard Errors
(Red = inflated by collinearity)', 
                 fontsize=12, fontweight='bold')
    
    plt.tight_layout()
    
    return results, fig
 
# Demonstration
np.random.seed(42)
n = 100
 
# Create collinear predictors
X1 = np.random.randn(n)
X2 = 0.9 * X1 + 0.1 * np.random.randn(n)  # 0.9 correlation with X1
X3 = np.random.randn(n)  # Independent
X4 = 0.8 * X3 + 0.2 * np.random.randn(n)  # 0.8 correlation with X3
 
X = np.column_stack([X1, X2, X3, X4])
y = 2 + 3*X1 + 0*X2 + 2*X3 + 0*X4 + np.random.randn(n)  # Only X1 and X3 matter
 
results, fig = compare_collinearity_remedies(
    X, y, 
    feature_names=['X1', 'X2 (≈X1)', 'X3', 'X4 (≈X3)'],
    alpha_ridge=1.0
)
plt.savefig('collinearity_remedies.png', dpi=150, bbox_inches='tight')
plt.show()

Summary: Managing Multicollinearity

Multicollinearity is one of the most common diagnostic challenges in regression. Proper detection and thoughtful remediation are essential for reliable inference.

Key Takeaways

•Multicollinearity inflates variance, not bias — OLS coefficients remain unbiased, but their standard errors explode, undermining hypothesis tests and interpretability.
•Detection requires multiple tools — Correlation matrices miss multivariate collinearity. VIF detects it; condition numbers and eigenvalue analysis characterize it precisely.
•VIF > 10 is the standard threshold — Values above 10 indicate problematic collinearity requiring investigation; values above 20 typically require remediation.
•The condition number measures overall ill-conditioning — Values above 30 indicate serious collinearity; variance decomposition proportions identify which specific variables are entangled.
•Remedies depend on your goals — If only prediction matters, collinearity may be ignorable. For interpretable coefficients, consider dropping variables, regularization, or combining predictors.
•Ridge regression stabilizes without dropping — By shrinking coefficients toward zero, ridge recovers stable estimates at the cost of some bias—often an excellent trade-off.

Module Complete

Congratulations! You have completed the Regression Diagnostics module. You now possess a comprehensive toolkit for validating regression models: residual analysis for model adequacy, normality tests for inference validity, heteroscedasticity detection and remediation, influence diagnostics for unusual observations, and multicollinearity assessment for stable coefficients. These skills are essential for building trustworthy regression models in production ML systems.

Multicollinearity

The Curse of Correlated Predictors

Multicollinearity is particularly insidious because:

The model can still fit well (high R²)
Predictions for new data within the original range may be accurate
But individual coefficient estimates become meaningless
Small changes in data can cause wild swings in coefficients

This page provides the mathematical foundations, diagnostic tools, and remedial strategies needed to detect and address multicollinearity in production regression models.

What You Will Learn

Understanding Multicollinearity

What Is Multicollinearity?

Multicollinearity exists when one or more predictor variables can be approximately expressed as linear combinations of other predictors: $$X_j \approx \sum_{k eq j} \alpha_k X_k$$

We distinguish between:

Perfect Multicollinearity: One predictor is an exact linear combination of others. The design matrix $X$ becomes rank-deficient: $$\text{rank}(X) < p$$

This makes $(X^TX)^{-1}$ undefined—OLS cannot be computed. Software will either fail or quietly drop a variable.

Sources of Multicollinearity

Structural relationships: Variables that are definitionally related (e.g., height in inches and height in centimeters)
Natural correlations: Variables that move together in the real world (e.g., education and income)
Data collection artifacts: Sampling from a narrow range, survey design
Polynomial/interaction terms: $x$ and $x^2$ are inherently correlated
Dummy variable trap: Including all levels of a categorical variable plus intercept

Mathematical Impact

Recall the variance of OLS estimators: $$\text{Var}(\hat{\beta}) = \sigma^2(X^TX)^{-1}$$

When predictors are collinear, $(X^TX)$ is nearly singular. Its inverse has very large elements, inflating the variance of coefficient estimates.

The Geometric View

Consequences of Multicollinearity

What Multicollinearity Does NOT Affect:

Unbiasedness: $E[\hat{\beta}] = \beta$ still holds
Overall fit: $R^2$, F-test, and predictions remain valid
Efficiency within the class of linear unbiased estimators: Gauss-Markov still applies

What Multicollinearity DOES Affect:

Variance of individual coefficients: $\text{Var}(\hat{\beta}_j)$ becomes enormous
t-tests on individual coefficients: Low power to detect non-zero effects
Coefficient stability: Small data changes → large coefficient changes
Interpretability: Coefficients don't represent isolated marginal effects
Sign stability: Coefficients may have counterintuitive signs

Summary of Multicollinearity Effects
Aspect	Effect	Why It Matters
Coefficient variance	Inflated (sometimes massively)	Wide confidence intervals, low power
Coefficient stability	Highly unstable	Small data changes → large β changes
Coefficient signs	May flip unexpectedly	Contradicts domain knowledge
Individual t-tests	Nonsignificant despite overall significance	Type II errors on important variables
Model fit (R²)	Unaffected	Model predicts well despite coefficient issues
Prediction accuracy	Unaffected within sample range	Problems arise for extrapolation

Detection: Correlation Matrices and Variance Inflation Factors

Pairwise Correlations

The simplest detection method examines the correlation matrix of predictors: $$R = \text{corr}(X)$$

Variance Inflation Factor (VIF)

The VIF is the standard tool for detecting multicollinearity, including multivariate patterns invisible to correlation matrices.

Definition: For predictor $X_j$: $$\text{VIF}_j = \frac{1}{1 - R_j^2}$$

where $R_j^2$ is the $R^2$ from regressing $X_j$ on all other predictors.

Interpretation:

$\text{VIF}_j = 1$: No collinearity (predictor is orthogonal to others)
$\text{VIF}_j = 5$: Predictor explains 80% variance from others
$\text{VIF}_j = 10$: Predictor explains 90% variance from others
$\text{VIF}_j = \infty$: Perfect collinearity

Connection to variance inflation: $$\text{Var}(\hat{\beta}_j) = \frac{\sigma^2}{(n-1)s_j^2} \cdot \text{VIF}_j$$

where $s_j^2$ is the variance of $X_j$. VIF directly measures how much the variance is inflated due to collinearity.

Thresholds (rules of thumb):

VIF > 5: Moderate collinearity—investigate
VIF > 10: Serious collinearity—likely problematic
VIF > 20: Severe collinearity—remedial action needed

vif_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
 
def compute_vif(X, feature_names=None):
    """
    Compute Variance Inflation Factors for each predictor.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix (predictors only, no intercept)
    feature_names : list of str, optional
        Names for each predictor
    
    Returns:
    --------
    DataFrame with VIF for each predictor
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f'X{i+1}' for i in range(p)]
    
    vifs = []
    r_squareds = []
    
    for j in range(p):
        # Regress X_j on all other predictors
        y_j = X[:, j]
        X_others = np.delete(X, j, axis=1)
        
        model = LinearRegression()
        model.fit(X_others, y_j)
        
        # R² from this auxiliary regression
        r2_j = model.score(X_others, y_j)
        
        # VIF
        if r2_j >= 1:
            vif_j = np.inf
        else:
            vif_j = 1 / (1 - r2_j)
        
        vifs.append(vif_j)
        r_squareds.append(r2_j)
    
    # Create DataFrame
    results = pd.DataFrame({
        'Feature': feature_names,
        'VIF': vifs,
        'R² (auxiliary)': r_squareds,
        'Tolerance': [1/v if v != np.inf else 0 for v in vifs]
    })
    
    # Add assessment
    def assess_vif(vif):
        if vif < 5:
            return 'OK'
        elif vif < 10:
            return 'Moderate'
        elif vif < 20:
            return 'High'
        else:
            return 'Severe'
    
    results['Assessment'] = results['VIF'].apply(assess_vif)
    
    return results
 
def multicollinearity_diagnostics(X, feature_names=None, figsize=(14, 5)):
    """
    Comprehensive multicollinearity diagnostics including correlation
    matrix and VIF analysis.
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f'X{i+1}' for i in range(p)]
    
    # Compute VIF
    vif_results = compute_vif(X, feature_names)
    
    # Compute correlation matrix
    corr_matrix = np.corrcoef(X.T)
    
    print("=" * 70)
    print("MULTICOLLINEARITY DIAGNOSTICS")
    print("=" * 70)
    
    print("
VARIANCE INFLATION FACTORS (VIF)")
    print("-" * 70)
    print(vif_results.to_string(index=False))
    
    print("
" + "-" * 70)
    print("VIF Interpretation:")
    print("  VIF = 1:     No collinearity")
    print("  VIF < 5:     Acceptable")
    print("  VIF 5-10:    Moderate collinearity (investigate)")
    print("  VIF > 10:    High collinearity (problematic)")
    print("  VIF > 20:    Severe collinearity (action needed)")
    
    # Find problematic pairs
    print("
HIGH PAIRWISE CORRELATIONS (|r| > 0.7)")
    print("-" * 70)
    high_corr_found = False
    for i in range(p):
        for j in range(i+1, p):
            if abs(corr_matrix[i, j]) > 0.7:
                high_corr_found = True
                print(f"  {feature_names[i]} ↔ {feature_names[j]}: r = {corr_matrix[i, j]:.4f}")
    if not high_corr_found:
        print("  No pairwise correlations exceed 0.7")
    
    print("=" * 70)
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=figsize)
    
    # Plot 1: Correlation heatmap
    ax1 = axes[0]
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
    sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
                center=0, vmin=-1, vmax=1, ax=ax1,
                xticklabels=feature_names, yticklabels=feature_names)
    ax1.set_title('Correlation Matrix', fontsize=12, fontweight='bold')
    
    # Plot 2: VIF bar chart
    ax2 = axes[1]
    colors = ['green' if v < 5 else 'orange' if v < 10 else 'red' 
              for v in vif_results['VIF']]
    bars = ax2.bar(feature_names, vif_results['VIF'], color=colors, 
                   edgecolor='black', alpha=0.7)
    ax2.axhline(5, color='orange', linestyle='--', linewidth=1.5, 
                label='VIF = 5 (moderate)')
    ax2.axhline(10, color='red', linestyle='--', linewidth=1.5, 
                label='VIF = 10 (high)')
    ax2.set_xlabel('Predictor', fontsize=11)
    ax2.set_ylabel('VIF', fontsize=11)
    ax2.set_title('Variance Inflation Factors', fontsize=12, fontweight='bold')
    ax2.legend()
    
    # Add value labels on bars
    for bar, vif in zip(bars, vif_results['VIF']):
        ax2.annotate(f'{vif:.1f}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                    xytext=(0, 3), textcoords='offset points', ha='center', fontsize=9)
    
    plt.tight_layout()
    
    return {
        'vif': vif_results,
        'correlation_matrix': corr_matrix,
        'figure': fig
    }
 
# Demonstration
np.random.seed(42)
n = 100
 
# Create predictors with varying degrees of collinearity
X1 = np.random.randn(n)
X2 = 0.3 * X1 + np.random.randn(n) * 0.9  # Mild correlation with X1
X3 = 0.9 * X1 + np.random.randn(n) * 0.4  # High correlation with X1
X4 = np.random.randn(n)  # Independent
X5 = 0.7 * X3 + 0.5 * X4 + np.random.randn(n) * 0.3  # Correlated with X3 and X4
 
X = np.column_stack([X1, X2, X3, X4, X5])
feature_names = ['X1', 'X2', 'X3 (≈X1)', 'X4', 'X5 (≈X3+X4)']
 
results = multicollinearity_diagnostics(X, feature_names)
plt.savefig('multicollinearity_diagnostics.png', dpi=150, bbox_inches='tight')
plt.show()

Detection: Condition Numbers and Eigenvalue Analysis

Eigenvalue Analysis

A more sophisticated approach examines the eigenvalues of $X^TX$ (or the correlation matrix of predictors).

Condition Number

The condition number of $X^TX$ is: $$\kappa = \sqrt{\frac{\lambda_{\max}}{\lambda_{\min}}}$$

where $\lambda_{\max}$ and $\lambda_{\min}$ are the largest and smallest eigenvalues.

Interpretation:

$\kappa < 10$: No serious collinearity
$10 \leq \kappa < 30$: Moderate collinearity
$\kappa \geq 30$: Serious collinearity
$\kappa \geq 100$: Severe collinearity

Condition Indices

Condition indices generalize the condition number by comparing each eigenvalue to the maximum: $$\eta_k = \sqrt{\frac{\lambda_{\max}}{\lambda_k}}$$

Each condition index corresponds to a principal component. Large condition indices (>30) identify specific near-dependencies.

Variance Decomposition Proportions

To determine which variables are involved in each near-dependency:

Decompose the variance of each $\hat{\beta}_j$ across the principal components
If two or more coefficients have high proportions (>0.5) on a component with high condition index, they are collinear with each other

This is the most complete diagnostic—it identifies which specific variables are entangled in each collinearity relationship.

Standardization Matters

condition_number_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
def eigenvalue_collinearity_diagnostics(X, feature_names=None, standardize=True):
    """
    Eigenvalue-based collinearity diagnostics including condition number,
    condition indices, and variance decomposition proportions.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix (predictors only)
    feature_names : list of str, optional
        Names for each predictor
    standardize : bool
        Whether to standardize predictors before analysis
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f'X{i+1}' for i in range(p)]
    
    # Standardize
    if standardize:
        X_std = (X - X.mean(axis=0)) / X.std(axis=0, ddof=1)
    else:
        X_std = X - X.mean(axis=0)  # At minimum, center
    
    # Add intercept column
    X_design = np.column_stack([np.ones(n), X_std])
    full_names = ['Intercept'] + feature_names
    
    # Compute X'X
    XtX = X_design.T @ X_design
    
    # Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(XtX)
    
    # Sort in descending order
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Condition indices
    condition_indices = np.sqrt(eigenvalues[0] / eigenvalues)
    
    # Overall condition number
    condition_number = condition_indices[-1]
    
    # Variance decomposition proportions
    # Π_kj = (v_jk² / λ_k) / Σ_k(v_jk² / λ_k)
    # where v_jk is the element of the k-th eigenvector for the j-th variable
    
    var_proportions = np.zeros((p + 1, p + 1))  # Including intercept
    
    for j in range(p + 1):  # For each coefficient
        phi_j = np.zeros(p + 1)
        for k in range(p + 1):  # For each eigenvalue
            phi_j[k] = eigenvectors[j, k]**2 / eigenvalues[k]
        var_proportions[:, j] = phi_j / phi_j.sum()
    
    print("=" * 80)
    print("EIGENVALUE-BASED COLLINEARITY DIAGNOSTICS")
    print("=" * 80)
    
    print(f"
OVERALL CONDITION NUMBER: {condition_number:.2f}")
    if condition_number < 10:
        print("Assessment: No serious collinearity")
    elif condition_number < 30:
        print("Assessment: Moderate collinearity")
    elif condition_number < 100:
        print("Assessment: Serious collinearity - action may be needed")
    else:
        print("Assessment: SEVERE collinearity - action strongly recommended")
    
    print("
" + "-" * 80)
    print("EIGENVALUE ANALYSIS")
    print("-" * 80)
    
    eigenvalue_df = pd.DataFrame({
        'Component': range(1, p + 2),
        'Eigenvalue': eigenvalues,
        'Condition Index': condition_indices,
        'Assessment': ['OK' if ci < 10 else 'Moderate' if ci < 30 else 'HIGH' 
                       for ci in condition_indices]
    })
    print(eigenvalue_df.to_string(index=False))
    
    print("
" + "-" * 80)
    print("VARIANCE DECOMPOSITION PROPORTIONS")
    print("-" * 80)
    
    # Create DataFrame for variance proportions
    var_df = pd.DataFrame(var_proportions.T, 
                         columns=[f'Comp{i+1}' for i in range(p + 1)],
                         index=full_names)
    
    # Round for display
    print(var_df.round(3).to_string())
    
    print("
" + "-" * 80)
    print("COLLINEARITY IDENTIFICATION")
    print("-" * 80)
    
    # For each high condition index, find involved variables
    high_ci_idx = np.where(condition_indices > 30)[0]
    
    if len(high_ci_idx) == 0:
        print("No components with condition index > 30")
    else:
        for k in high_ci_idx:
            print(f"
Component {k+1} (Condition Index = {condition_indices[k]:.2f}):")
            high_prop_vars = np.where(var_proportions[k, :] > 0.5)[0]
            for j in high_prop_vars:
                print(f"  • {full_names[j]}: proportion = {var_proportions[k, j]:.3f}")
    
    print("=" * 80)
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Plot 1: Condition indices
    ax1 = axes[0]
    colors = ['green' if ci < 10 else 'orange' if ci < 30 else 'red' 
              for ci in condition_indices]
    ax1.bar(range(1, p + 2), condition_indices, color=colors, edgecolor='black', alpha=0.7)
    ax1.axhline(10, color='orange', linestyle='--', label='CI = 10')
    ax1.axhline(30, color='red', linestyle='--', label='CI = 30')
    ax1.set_xlabel('Component', fontsize=11)
    ax1.set_ylabel('Condition Index', fontsize=11)
    ax1.set_title('Condition Indices', fontsize=12, fontweight='bold')
    ax1.legend()
    ax1.set_yscale('log')
    
    # Plot 2: Eigenvalue scree plot
    ax2 = axes[1]
    ax2.plot(range(1, p + 2), eigenvalues, 'bo-', linewidth=2, markersize=8)
    ax2.set_xlabel('Component', fontsize=11)
    ax2.set_ylabel('Eigenvalue', fontsize=11)
    ax2.set_title('Eigenvalue Scree Plot', fontsize=12, fontweight='bold')
    ax2.set_yscale('log')
    
    plt.tight_layout()
    
    return {
        'eigenvalues': eigenvalues,
        'condition_indices': condition_indices,
        'condition_number': condition_number,
        'variance_proportions': var_proportions,
        'figure': fig
    }
 
# Demonstration
np.random.seed(42)
n = 100
 
# Create predictors with known collinearity structure
X1 = np.random.randn(n)
X2 = np.random.randn(n)
X3 = 0.95 * X1 + 0.05 * np.random.randn(n)  # Near-perfect collinearity with X1
X4 = 0.7 * X1 + 0.7 * X2 + 0.1 * np.random.randn(n)  # Multivariate collinearity
 
X = np.column_stack([X1, X2, X3, X4])
feature_names = ['X1', 'X2', 'X3 (≈X1)', 'X4 (≈X1+X2)']
 
results = eigenvalue_collinearity_diagnostics(X, feature_names)
plt.savefig('condition_number_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

Demonstrating Multicollinearity's Effects

Theory is one thing; seeing multicollinearity in action is another. Let's demonstrate its practical effects on coefficient estimation.

Unstable Coefficients

The "Joint Significance" Paradox

Sign Reversals

collinearity_effects.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
 
def demonstrate_collinearity_effects(n=100, correlation=0.95, n_simulations=200):
    """
    Demonstrate the instability of coefficients under multicollinearity.
    
    Parameters:
    -----------
    n : int
        Sample size
    correlation : float
        Correlation between X1 and X2
    n_simulations : int
        Number of bootstrap-like replications
    """
    np.random.seed(42)
    
    # True coefficients
    true_beta = np.array([1, 2, 3])  # [intercept, beta1, beta2]
    
    # Generate collinear predictors
    X1 = np.random.randn(n)
    X2 = correlation * X1 + np.sqrt(1 - correlation**2) * np.random.randn(n)
    
    actual_corr = np.corrcoef(X1, X2)[0, 1]
    
    print("=" * 70)
    print(f"MULTICOLLINEARITY DEMONSTRATION (r = {actual_corr:.3f})")
    print("=" * 70)
    print(f"True model: y = {true_beta[0]} + {true_beta[1]}*X1 + {true_beta[2]}*X2 + ε")
    
    # Generate response
    X = np.column_stack([X1, X2])
    y = true_beta[0] + true_beta[1] * X1 + true_beta[2] * X2 + np.random.randn(n)
    
    # Fit model
    X_const = sm.add_constant(X)
    model = sm.OLS(y, X_const).fit()
    
    print(f"
--- Single Sample Results ---")
    print(model.summary().tables[1])
    
    # Compute VIF
    vif1 = 1 / (1 - actual_corr**2)
    print(f"
VIF for both predictors: {vif1:.2f}")
    print(f"Variance inflation: {vif1:.1f}x the uncorrelated case")
    
    # Bootstrap to show instability
    beta1_samples = []
    beta2_samples = []
    
    for _ in range(n_simulations):
        # Resample with replacement
        idx = np.random.choice(n, n, replace=True)
        X_boot = X_const[idx]
        y_boot = y[idx]
        
        model_boot = sm.OLS(y_boot, X_boot).fit()
        beta1_samples.append(model_boot.params[1])
        beta2_samples.append(model_boot.params[2])
    
    beta1_samples = np.array(beta1_samples)
    beta2_samples = np.array(beta2_samples)
    
    print(f"
--- Bootstrap Analysis ({n_simulations} replications) ---")
    print(f"β1: mean = {np.mean(beta1_samples):.3f}, SD = {np.std(beta1_samples):.3f}")
    print(f"    range: [{np.min(beta1_samples):.3f}, {np.max(beta1_samples):.3f}]")
    print(f"β2: mean = {np.mean(beta2_samples):.3f}, SD = {np.std(beta2_samples):.3f}")
    print(f"    range: [{np.min(beta2_samples):.3f}, {np.max(beta2_samples):.3f}]")
    
    # Compare with uncorrelated case
    print(f"
--- Comparison with Uncorrelated Predictors ---")
    X1_unc = np.random.randn(n)
    X2_unc = np.random.randn(n)
    y_unc = true_beta[0] + true_beta[1] * X1_unc + true_beta[2] * X2_unc + np.random.randn(n)
    
    X_unc = np.column_stack([np.ones(n), X1_unc, X2_unc])
    model_unc = sm.OLS(y_unc, X_unc).fit()
    
    print(f"
Standard errors:")
    print(f"  Collinear case    - β1: {model.bse[1]:.4f}, β2: {model.bse[2]:.4f}")
    print(f"  Uncorrelated case - β1: {model_unc.bse[1]:.4f}, β2: {model_unc.bse[2]:.4f}")
    print(f"  Ratio: {model.bse[1]/model_unc.bse[1]:.2f}x inflation")
    
    print("=" * 70)
    
    # Visualization
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Plot 1: Scatter of predictors
    ax1 = axes[0]
    ax1.scatter(X1, X2, alpha=0.5, edgecolors='k', linewidth=0.5)
    ax1.set_xlabel('X1', fontsize=11)
    ax1.set_ylabel('X2', fontsize=11)
    ax1.set_title(f'Collinear Predictors (r = {actual_corr:.3f})', fontweight='bold')
    
    # Plot 2: Bootstrap coefficient distributions
    ax2 = axes[1]
    ax2.hist(beta1_samples, bins=30, alpha=0.6, label='β1 bootstrap', edgecolor='black')
    ax2.hist(beta2_samples, bins=30, alpha=0.6, label='β2 bootstrap', edgecolor='black')
    ax2.axvline(true_beta[1], color='blue', linestyle='--', linewidth=2, 
                label=f'True β1 = {true_beta[1]}')
    ax2.axvline(true_beta[2], color='orange', linestyle='--', linewidth=2, 
                label=f'True β2 = {true_beta[2]}')
    ax2.set_xlabel('Coefficient Value', fontsize=11)
    ax2.set_ylabel('Frequency', fontsize=11)
    ax2.set_title('Bootstrap Coefficient Distributions', fontweight='bold')
    ax2.legend(fontsize=8)
    
    # Plot 3: β1 vs β2 showing compensation
    ax3 = axes[2]
    ax3.scatter(beta1_samples, beta2_samples, alpha=0.5, s=20)
    ax3.scatter([true_beta[1]], [true_beta[2]], color='red', s=100, 
               marker='*', label='True values', zorder=5)
    ax3.set_xlabel('β1', fontsize=11)
    ax3.set_ylabel('β2', fontsize=11)
    ax3.set_title('β1 vs β2 Trade-off
(Negative correlation shows compensation)', 
                 fontweight='bold')
    ax3.legend()
    
    # Add correlation annotation
    beta_corr = np.corrcoef(beta1_samples, beta2_samples)[0, 1]
    ax3.annotate(f'r = {beta_corr:.3f}', xy=(0.05, 0.95), xycoords='axes fraction',
                fontsize=11, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
    
    plt.tight_layout()
    
    return {
        'model': model,
        'beta1_samples': beta1_samples,
        'beta2_samples': beta2_samples,
        'figure': fig
    }
 
# Run demonstration with different correlation levels
print("
" + "█" * 70)
print(" MODERATE COLLINEARITY (r = 0.7) ")
print("█" * 70)
results_moderate = demonstrate_collinearity_effects(correlation=0.7)
plt.savefig('collinearity_moderate.png', dpi=150, bbox_inches='tight')
plt.show()
 
print("
" + "█" * 70)
print(" SEVERE COLLINEARITY (r = 0.95) ")
print("█" * 70)
results_severe = demonstrate_collinearity_effects(correlation=0.95)
plt.savefig('collinearity_severe.png', dpi=150, bbox_inches='tight')
plt.show()

Remedies for Multicollinearity

The appropriate remedy depends on your analytical goals. If you only need predictions, collinearity may not matter. If you need interpretable coefficients, action is required.

Remedy 1: Drop Redundant Predictors

The most direct approach: remove one of the collinear variables.

Advantages:

Simple to implement
Restores interpretability

Disadvantages:

Loses information
Which variable to drop may not be obvious
Requires domain knowledge

Guidance: Drop the predictor that is less theoretically important or less reliably measured.

Remedy 2: Combine Predictors

If predictors measure related concepts, combine them:

Average: $(X_1 + X_2)/2$
Index: Principal component or factor score
Ratio: $X_1/X_2$ if meaningful

Advantages:

Preserves information
Reduces dimensionality

Disadvantages:

Interpretation changes
May lose granularity

Remedy 3: Regularization (Ridge Regression)

Ridge regression adds an L2 penalty: $$\hat{\beta}{\text{ridge}} = \arg\min\beta \left[ |y - X\beta|^2 + \lambda |\beta|^2 \right]$$

Solution: $$\hat{\beta}_{\text{ridge}} = (X^TX + \lambda I)^{-1} X^T y$$

Key insight: The penalty term $\lambda I$ adds to the diagonal of $X^TX$, improving conditioning. Even near-singular $X^TX$ becomes well-conditioned after adding $\lambda$.

Advantages:

Stabilizes coefficients
Keeps all predictors
Improves prediction accuracy under collinearity

Disadvantages:

Introduces bias (coefficients shrunk toward zero)
Coefficients no longer have simple interpretation
Requires choosing $\lambda$

Remedy 4: Principal Component Regression (PCR)

Replace correlated predictors with their principal components:

Compute PCs of the predictor matrix
Regress Y on the first $k$ PCs (dropping low-variance components)

Advantages:

Eliminates collinearity entirely (PCs are orthogonal)
Dimensionality reduction

Disadvantages:

Loses interpretability (PCs are linear combinations)
High-variance PCs aren't necessarily predictive

Comparison of Collinearity Remedies
Remedy	Prediction	Interpretation	Complexity	Best When
Drop variables	May suffer	Restored	Low	Clear redundancy, domain knowledge available
Combine predictors	Good	Changed	Medium	Predictors measure same latent construct
Ridge regression	Improved	Lost	Medium	Prediction is primary goal
Lasso	Improved	Partially kept	Medium	Want automatic selection
PCR	Good	Lost	High	High-dimensional data, many collinearities
Do nothing	OK	Unreliable	None	Only overall prediction matters

When Collinearity Doesn't Matter

collinearity_remedies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import statsmodels.api as sm
 
def compare_collinearity_remedies(X, y, feature_names=None, alpha_ridge=1.0):
    """
    Compare different approaches to handling multicollinearity.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix with collinear predictors
    y : ndarray of shape (n,)
        Response variable
    feature_names : list of str
        Names for predictors
    alpha_ridge : float
        Regularization strength for Ridge
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f'X{i+1}' for i in range(p)]
    
    # Standardize for fair comparison
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    print("=" * 70)
    print("COMPARISON OF MULTICOLLINEARITY REMEDIES")
    print("=" * 70)
    
    results = {}
    
    # 1. OLS (baseline)
    model_ols = LinearRegression()
    model_ols.fit(X_scaled, y)
    results['OLS'] = {
        'coef': model_ols.coef_,
        'intercept': model_ols.intercept_,
        'r2': model_ols.score(X_scaled, y)
    }
    
    # Get standard errors from statsmodels
    X_sm = sm.add_constant(X_scaled)
    model_sm = sm.OLS(y, X_sm).fit()
    results['OLS']['se'] = model_sm.bse[1:]
    
    print(f"
1. ORDINARY LEAST SQUARES (OLS)")
    print("-" * 50)
    print(f"R² = {results['OLS']['r2']:.4f}")
    for i, name in enumerate(feature_names):
        print(f"  {name}: β = {results['OLS']['coef'][i]:8.4f} (SE = {results['OLS']['se'][i]:.4f})")
    
    # 2. Ridge Regression
    model_ridge = Ridge(alpha=alpha_ridge)
    model_ridge.fit(X_scaled, y)
    results['Ridge'] = {
        'coef': model_ridge.coef_,
        'intercept': model_ridge.intercept_,
        'r2': model_ridge.score(X_scaled, y)
    }
    
    print(f"
2. RIDGE REGRESSION (λ = {alpha_ridge})")
    print("-" * 50)
    print(f"R² = {results['Ridge']['r2']:.4f}")
    for i, name in enumerate(feature_names):
        shrinkage = 1 - abs(results['Ridge']['coef'][i]) / abs(results['OLS']['coef'][i])                     if results['OLS']['coef'][i] != 0 else 0
        print(f"  {name}: β = {results['Ridge']['coef'][i]:8.4f} (shrinkage: {shrinkage*100:.1f}%)")
    
    # 3. Lasso
    model_lasso = Lasso(alpha=0.1)
    model_lasso.fit(X_scaled, y)
    results['Lasso'] = {
        'coef': model_lasso.coef_,
        'intercept': model_lasso.intercept_,
        'r2': model_lasso.score(X_scaled, y)
    }
    
    print(f"
3. LASSO REGRESSION")
    print("-" * 50)
    print(f"R² = {results['Lasso']['r2']:.4f}")
    n_nonzero = np.sum(model_lasso.coef_ != 0)
    print(f"Non-zero coefficients: {n_nonzero}/{p}")
    for i, name in enumerate(feature_names):
        if model_lasso.coef_[i] != 0:
            print(f"  {name}: β = {results['Lasso']['coef'][i]:8.4f}")
        else:
            print(f"  {name}: (excluded)")
    
    # 4. Principal Component Regression
    pca = PCA()
    X_pca = pca.fit_transform(X_scaled)
    
    # Use first k components explaining 95% variance
    cumvar = np.cumsum(pca.explained_variance_ratio_)
    n_components = np.argmax(cumvar >= 0.95) + 1
    
    X_pca_reduced = X_pca[:, :n_components]
    model_pcr = LinearRegression()
    model_pcr.fit(X_pca_reduced, y)
    
    results['PCR'] = {
        'coef_pc': model_pcr.coef_,
        'n_components': n_components,
        'r2': model_pcr.score(X_pca_reduced, y)
    }
    
    print(f"
4. PRINCIPAL COMPONENT REGRESSION")
    print("-" * 50)
    print(f"Components used: {n_components}/{p} (95% variance)")
    print(f"R² = {results['PCR']['r2']:.4f}")
    for i in range(n_components):
        print(f"  PC{i+1}: β = {model_pcr.coef_[i]:8.4f} (explains {pca.explained_variance_ratio_[i]*100:.1f}%)")
    
    print("
" + "=" * 70)
    print("SUMMARY: COEFFICIENT STABILITY")
    print("=" * 70)
    
    # Coefficient ranges across methods
    print(f"
{'Predictor':<12} {'OLS':>10} {'Ridge':>10} {'Lasso':>10}")
    print("-" * 44)
    for i, name in enumerate(feature_names):
        print(f"{name:<12} {results['OLS']['coef'][i]:>10.3f} "
              f"{results['Ridge']['coef'][i]:>10.3f} "
              f"{results['Lasso']['coef'][i]:>10.3f}")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Plot 1: Coefficient comparison
    ax1 = axes[0]
    x_pos = np.arange(p)
    width = 0.25
    
    ax1.bar(x_pos - width, results['OLS']['coef'], width, label='OLS', alpha=0.8)
    ax1.bar(x_pos, results['Ridge']['coef'], width, label='Ridge', alpha=0.8)
    ax1.bar(x_pos + width, results['Lasso']['coef'], width, label='Lasso', alpha=0.8)
    
    ax1.set_xticks(x_pos)
    ax1.set_xticklabels(feature_names)
    ax1.set_ylabel('Coefficient', fontsize=11)
    ax1.set_title('Coefficient Estimates by Method', fontsize=12, fontweight='bold')
    ax1.legend()
    ax1.axhline(0, color='gray', linestyle='-', linewidth=0.5)
    
    # Plot 2: Standard errors (OLS only, with indication of instability)
    ax2 = axes[1]
    colors = ['red' if se > 2*np.median(results['OLS']['se']) else 'steelblue' 
              for se in results['OLS']['se']]
    ax2.bar(feature_names, results['OLS']['se'], color=colors, edgecolor='black', alpha=0.7)
    ax2.set_ylabel('Standard Error', fontsize=11)
    ax2.set_title('OLS Standard Errors
(Red = inflated by collinearity)', 
                 fontsize=12, fontweight='bold')
    
    plt.tight_layout()
    
    return results, fig
 
# Demonstration
np.random.seed(42)
n = 100
 
# Create collinear predictors
X1 = np.random.randn(n)
X2 = 0.9 * X1 + 0.1 * np.random.randn(n)  # 0.9 correlation with X1
X3 = np.random.randn(n)  # Independent
X4 = 0.8 * X3 + 0.2 * np.random.randn(n)  # 0.8 correlation with X3
 
X = np.column_stack([X1, X2, X3, X4])
y = 2 + 3*X1 + 0*X2 + 2*X3 + 0*X4 + np.random.randn(n)  # Only X1 and X3 matter
 
results, fig = compare_collinearity_remedies(
    X, y, 
    feature_names=['X1', 'X2 (≈X1)', 'X3', 'X4 (≈X3)'],
    alpha_ridge=1.0
)
plt.savefig('collinearity_remedies.png', dpi=150, bbox_inches='tight')
plt.show()

Summary: Managing Multicollinearity

Multicollinearity is one of the most common diagnostic challenges in regression. Proper detection and thoughtful remediation are essential for reliable inference.

Key Takeaways

•Multicollinearity inflates variance, not bias — OLS coefficients remain unbiased, but their standard errors explode, undermining hypothesis tests and interpretability.
•Detection requires multiple tools — Correlation matrices miss multivariate collinearity. VIF detects it; condition numbers and eigenvalue analysis characterize it precisely.
•VIF > 10 is the standard threshold — Values above 10 indicate problematic collinearity requiring investigation; values above 20 typically require remediation.
•The condition number measures overall ill-conditioning — Values above 30 indicate serious collinearity; variance decomposition proportions identify which specific variables are entangled.
•Remedies depend on your goals — If only prediction matters, collinearity may be ignorable. For interpretable coefficients, consider dropping variables, regularization, or combining predictors.
•Ridge regression stabilizes without dropping — By shrinking coefficients toward zero, ridge recovers stable estimates at the cost of some bias—often an excellent trade-off.

Module Complete