Multiple Linear Regression - Learning Module

Loading content...

0/245

Projection Perspective

Deepening the Geometric Understanding

We've established that OLS is orthogonal projection onto the column space of $\mathbf{X}$. This final page deepens that perspective, revealing additional insights that emerge when we fully embrace the projection viewpoint.

We'll explore how coefficients can be understood through sequential projections (Gram-Schmidt), why regularization modifies the projection geometry, and how the projection view illuminates the interpretation of coefficients in the presence of correlated predictors.

These insights are not merely theoretical—they directly inform how we interpret regression output and understand model behavior in practice.

What You Will Learn

By the end of this page, you will understand Gram-Schmidt in the regression context, see how sequential regression reveals partial effects, grasp the geometric interpretation of regularization, and synthesize the complete picture of multiple regression.

Gram-Schmidt Orthogonalization

The Gram-Schmidt process transforms a set of linearly independent vectors into an orthonormal set that spans the same subspace. Applied to the columns of $\mathbf{X}$, it reveals the structure of multiple regression.

The Gram-Schmidt algorithm:

Given vectors $\mathbf{x}_0, \mathbf{x}_1, \ldots, \mathbf{x}_p$ (columns of $\mathbf{X}$), produce orthonormal vectors $\mathbf{q}_0, \mathbf{q}_1, \ldots, \mathbf{q}_p$:

$$\begin{aligned} \mathbf{u}_0 &= \mathbf{x}_0, \quad \mathbf{q}_0 = \mathbf{u}_0 / |\mathbf{u}_0| \ \mathbf{u}_j &= \mathbf{x}j - \sum{k=0}^{j-1} (\mathbf{q}_k^\top \mathbf{x}_j) \mathbf{q}_k, \quad \mathbf{q}_j = \mathbf{u}_j / |\mathbf{u}_j| \end{aligned}$$

Each $\mathbf{u}_j$ is what remains of $\mathbf{x}_j$ after removing its projections onto all previous orthogonalized vectors.

QR Decomposition Connection

Gram-Schmidt is algorithmic QR decomposition. Writing X = QR, where Q has orthonormal columns and R is upper triangular, the normal equations become R⊤Rβ = R⊤Q⊤y, which simplifies to Rβ = Q⊤y—a triangular system solved by back-substitution.

Interpretation in regression:

When we orthogonalize the design matrix, the $j$-th orthogonalized column $\mathbf{u}_j$ represents what's "new" in predictor $j$ that isn't explained by predictors $0, 1, \ldots, j-1$.

The regression coefficient $\beta_j$ in OLS measures the effect of this "new" part—the partial effect of $x_j$ after controlling for other predictors.

Sequential Regression and Partial Effects

Gram-Schmidt provides a powerful interpretation: the coefficient of $x_j$ in multiple regression is the simple regression coefficient of $y$ on the residualized $x_j$.

The Frisch-Waugh-Lovell theorem:

Consider partitioning $\mathbf{X} = [\mathbf{X}_1 | \mathbf{X}_2]$ and writing $\mathbf{y} = \mathbf{X}_1\boldsymbol{\beta}_1 + \mathbf{X}_2\boldsymbol{\beta}_2 + \boldsymbol{\epsilon}$.

The coefficient $\boldsymbol{\beta}_2$ can be obtained by:

Regress $\mathbf{y}$ on $\mathbf{X}_1$, save residuals $\mathbf{M}_1\mathbf{y}$
Regress $\mathbf{X}_2$ on $\mathbf{X}_1$, save residuals $\mathbf{M}_1\mathbf{X}_2$
Regress residuals from step 1 on residuals from step 2

where $\mathbf{M}_1 = \mathbf{I} - \mathbf{X}_1(\mathbf{X}_1^\top\mathbf{X}_1)^{-1}\mathbf{X}_1^\top$ is the residual maker.

Interpretation:

$$\boldsymbol{\beta}_2 = (\tilde{\mathbf{X}}_2^\top\tilde{\mathbf{X}}_2)^{-1}\tilde{\mathbf{X}}_2^\top\tilde{\mathbf{y}}$$

where $\tilde{\mathbf{X}}_2 = \mathbf{M}_1\mathbf{X}_2$ and $\tilde{\mathbf{y}} = \mathbf{M}_1\mathbf{y}$.

This says: $\boldsymbol{\beta}_2$ measures the relationship between the parts of $\mathbf{y}$ and $\mathbf{X}_2$ that are not explained by $\mathbf{X}_1$.

This is the formal statement of "controlling for" or "adjusting for" other variables. When you include control variables in a regression, you're measuring the effect of your variables of interest after removing the linear influence of the controls.

Practical Implication

In multiple regression with correlated predictors, each coefficient represents the unique contribution of that predictor—what it explains beyond what other predictors already explain. This is why coefficients change when you add or remove predictors.

Why Coefficients Change When Predictors Are Added

A common source of confusion: why does the coefficient of $x_1$ change when you add $x_2$ to the model?

The geometric answer:

When you have only $x_1$, you project $\mathbf{y}$ onto the line spanned by $[\mathbf{1}, \mathbf{x}_1]$. The coefficient $\beta_1$ scales how much of $\mathbf{x}_1$ you need.

When you add $x_2$, you project onto the plane spanned by $[\mathbf{1}, \mathbf{x}_1, \mathbf{x}_2]$. The coefficient of $\mathbf{x}_1$ now scales only the part of $\mathbf{x}_1$ that is orthogonal to $\mathbf{x}_2$.

When do coefficients stay the same?

The coefficient $\beta_1$ remains unchanged when adding $x_2$ if and only if:

$x_2$ is uncorrelated with $x_1$ (orthogonal in variable space), OR
$x_2$ is uncorrelated with $y$ after controlling for $x_1$

Effect of Adding Predictors on Existing Coefficients
Scenario	What Happens to β₁	Why
x₂ ⊥ x₁ and x₂ ⊥ y	Unchanged	New variable is noise orthogonal to everything
x₂ ⊥ x₁ but x₂ correlated with y	Unchanged	x₂ explains different variance than x₁
x₂ correlated with x₁ but not y	May change (toward zero)	x₂ 'steals' variance x₁ was explaining
x₂ correlated with both x₁ and y	Changes (often substantially)	Part of x₁'s effect was really x₂'s effect
x₂ is confounder (causes x₁ and y)	Changes (toward causal effect)	Removing confounding reveals true relationship

Omitted Variable Bias

If a relevant variable is omitted from the regression, coefficients of included variables absorb its effect. The bias equals the product of (1) the effect of the omitted variable on y, and (2) the regression of the omitted variable on included variables. This is the core of omitted variable bias.

The Geometry of Regularization

Regularization modifies the projection geometry in illuminating ways. Let's see how Ridge regression changes the picture.

Ridge regression:

$$\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$$

The ridge estimator is no longer a pure projection onto $\mathcal{C}(\mathbf{X})$. Instead, it's a shrinkage operator.

SVD perspective:

Using the SVD $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$, we can write:

$$\hat{\boldsymbol{\beta}}{\text{OLS}} = \sum{j=1}^{r} \frac{1}{\sigma_j} (\mathbf{u}_j^\top \mathbf{y}) \mathbf{v}_j$$

$$\hat{\boldsymbol{\beta}}{\text{ridge}} = \sum{j=1}^{r} \frac{\sigma_j}{\sigma_j^2 + \lambda} (\mathbf{u}_j^\top \mathbf{y}) \mathbf{v}_j$$

Interpretation:

OLS divides by $\sigma_j$: small singular values get amplified
Ridge multiplies by $\sigma_j/(\sigma_j^2 + \lambda)$: small singular values get shrunk

Geometric effect:

Directions in parameter space corresponding to small singular values (high-variance directions) are shrunk more aggressively. Ridge "pulls" the OLS solution toward the origin along directions the data doesn't constrain well.

The shrinkage factor:

$$d_j = \frac{\sigma_j^2}{\sigma_j^2 + \lambda}$$

When $\sigma_j \gg \sqrt{\lambda}$: $d_j \approx 1$ (little shrinkage)
When $\sigma_j \ll \sqrt{\lambda}$: $d_j \approx 0$ (heavy shrinkage)

Ridge as Modified Projection

Ridge regression is OLS in a modified geometry where the column space is 'stretched' by λ in all directions. The solution is no longer the orthogonal projection in the original space, but it is the orthogonal projection in this modified space.

Effective Degrees of Freedom

The projection view provides a natural definition of model complexity through effective degrees of freedom.

For OLS: $$\text{df}_{\text{OLS}} = \text{trace}(\mathbf{H}) = p + 1$$

The degrees of freedom equals the rank of the hat matrix, which equals the number of parameters.

For Ridge: $$\text{df}{\text{ridge}}(\lambda) = \text{trace}(\mathbf{H}\lambda) = \sum_{j=1}^{p} \frac{\sigma_j^2}{\sigma_j^2 + \lambda}$$

where $\mathbf{H}_\lambda = \mathbf{X}(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top$.

Properties of Effective Degrees of Freedom

•At λ = 0: df = p + 1 (full OLS complexity)
•As λ → ∞: df → 0 (model shrinks to zero)
•Monotonic: df(λ) decreases as λ increases
•Continuous: Unlike subset selection, regularization provides continuous complexity control
•Interpretation: df measures how many 'equivalent parameters' the model effectively uses

Why This Matters

Effective degrees of freedom allows fair comparison between models of different types. A Ridge regression with df = 5 uses complexity equivalent to OLS with 5 parameters, even though Ridge nominally has p+1 parameters. This is essential for information criteria (AIC, BIC) with regularized models.

Constrained Least Squares

The projection view extends naturally to constrained least squares problems, where we minimize $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$ subject to linear constraints $\mathbf{A}\boldsymbol{\beta} = \mathbf{c}$.

Geometric interpretation:

The constraint $\mathbf{A}\boldsymbol{\beta} = \mathbf{c}$ defines an affine subspace of the parameter space. We seek the point in this subspace whose image under $\mathbf{X}$ is closest to $\mathbf{y}$.

Solution via projection:

$$\hat{\boldsymbol{\beta}}c = \hat{\boldsymbol{\beta}}{\text{OLS}} - (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{A}^\top[\mathbf{A}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{A}^\top]^{-1}(\mathbf{A}\hat{\boldsymbol{\beta}}_{\text{OLS}} - \mathbf{c})$$

This adjusts the unconstrained solution to satisfy the constraints, moving minimally in the metric defined by $\mathbf{X}^\top\mathbf{X}$.

Applications:

Testing linear hypotheses ($\mathbf{c} = \mathbf{0}$): Impose $\mathbf{A}\boldsymbol{\beta} = \mathbf{0}$ and test whether the increase in SSE is significant
Monotonicity constraints: Require coefficients to be non-negative or ordered
Sum-to-zero constraints: In ANOVA-style models, constraint coefficients to sum to zero for identifiability
Known relationships: Incorporate prior knowledge that certain coefficient ratios are fixed

Constraints as Dimension Reduction

Each independent linear constraint removes one dimension from the parameter space. With k constraints, the effective dimensionality drops from p+1 to p+1-k. This is another form of regularization—incorporating prior knowledge to reduce model complexity.

Weighted Least Squares

When errors have unequal variances (heteroscedasticity), OLS remains unbiased but is no longer BLUE. Weighted Least Squares (WLS) restores efficiency.

The problem: $$\text{Var}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{W}^{-1}$$

where $\mathbf{W}$ is a known diagonal matrix of weights (larger weights = smaller variance = more reliable observations).

WLS objective: $$\min_{\boldsymbol{\beta}} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top\mathbf{W}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$

Solution: $$\hat{\boldsymbol{\beta}}_{\text{WLS}} = (\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}\mathbf{y}$$

Geometric interpretation:

WLS is orthogonal projection in a modified inner product space. Instead of the standard inner product $\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\top\mathbf{v}$, we use $\langle \mathbf{u}, \mathbf{v} \rangle_\mathbf{W} = \mathbf{u}^\top\mathbf{W}\mathbf{v}$.

The projection onto $\mathcal{C}(\mathbf{X})$ with this weighted inner product gives WLS. Geometrically, we're stretching the space so that low-variance observations contribute more.

Equivalent transformation:

Let $\mathbf{W} = \mathbf{D}^2$ where $\mathbf{D}$ is diagonal with $d_{ii} = \sqrt{w_{ii}}$. Then: $$\hat{\boldsymbol{\beta}}_{\text{WLS}} = ((\mathbf{D}\mathbf{X})^\top(\mathbf{D}\mathbf{X}))^{-1}(\mathbf{D}\mathbf{X})^\top(\mathbf{D}\mathbf{y})$$

WLS on $(\mathbf{X}, \mathbf{y})$ equals OLS on the transformed data $(\mathbf{D}\mathbf{X}, \mathbf{D}\mathbf{y})$.

Generalized Least Squares

When errors are both heteroscedastic AND correlated (e.g., time series), the full covariance matrix Σ replaces W⁻¹. The solution is β̂_GLS = (X⊤Σ⁻¹X)⁻¹X⊤Σ⁻¹y. This is orthogonal projection in the inner product defined by Σ⁻¹.

Synthesis: The Complete Picture

Let's synthesize everything we've learned about multiple linear regression into a unified understanding.

Multiple Views of Linear Regression
Perspective	Core Idea	Key Insight
Algebraic	$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$	Compact representation enabling closed-form solutions
Optimization	Minimize $\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$	Quadratic objective with unique global minimum
Statistical	OLS is BLUE	Best unbiased estimator under Gauss-Markov assumptions
Geometric	Project $\mathbf{y}$ onto $\mathcal{C}(\mathbf{X})$	Residuals are orthogonal to column space
Sequential	Frisch-Waugh-Lovell	Coefficients measure partial effects after controlling
Regularized	Shrinkage toward origin	Trade bias for reduced variance in high dimensions

The unified view:

All these perspectives describe the same mathematical object. The power comes from switching perspectives based on what you need:

Computation: Use the algebraic/matrix view for implementation
Properties: Use the statistical view for inference
Intuition: Use the geometric view for understanding
Interpretation: Use the sequential view for causality and control
High dimensions: Use the regularization view for prediction

Implementing Advanced Concepts

Let's implement key concepts from this page: sequential regression and the effective degrees of freedom for Ridge regression.

projection_concepts.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from numpy.linalg import svd, solve
 
def frisch_waugh_lovell(X1, X2, y):
    """
    Demonstrate Frisch-Waugh-Lovell theorem.
    
    Computes β₂ in y = X₁β₁ + X₂β₂ + ε by
    regressing residualized y on residualized X₂.
    
    Returns:
    --------
    beta2_direct : coefficients from full regression
    beta2_fwl : coefficients via Frisch-Waugh-Lovell
    """
    n = y.shape[0]
    
    # Method 1: Direct full regression
    X_full = np.column_stack([X1, X2])
    beta_full = solve(X_full.T @ X_full, X_full.T @ y)
    beta2_direct = beta_full[X1.shape[1]:]
    
    # Method 2: Frisch-Waugh-Lovell
    # Step 1: Residual maker for X1
    H1 = X1 @ solve(X1.T @ X1, X1.T)
    M1 = np.eye(n) - H1
    
    # Step 2: Residualize y and X2
    y_resid = M1 @ y
    X2_resid = M1 @ X2
    
    # Step 3: Regress residualized y on residualized X2
    beta2_fwl = solve(X2_resid.T @ X2_resid, X2_resid.T @ y_resid)
    
    return beta2_direct, beta2_fwl
 
def effective_degrees_of_freedom(X, lambdas):
    """
    Compute effective degrees of freedom for Ridge regression.
    
    df(λ) = Σⱼ σⱼ² / (σⱼ² + λ)
    """
    # Compute SVD
    U, s, Vt = svd(X, full_matrices=False)
    
    df_values = []
    for lam in lambdas:
        df = np.sum(s**2 / (s**2 + lam))
        df_values.append(df)
    
    return np.array(df_values)
 
def ridge_shrinkage_factors(X, lam):
    """
    Compute shrinkage factors for each singular direction.
    
    d_j = σⱼ² / (σⱼ² + λ)
    """
    U, s, Vt = svd(X, full_matrices=False)
    d = s**2 / (s**2 + lam)
    return s, d
 
# Example: Frisch-Waugh-Lovell demonstration
np.random.seed(42)
n = 100
 
# Create correlated predictors
x1 = np.random.randn(n)
x2 = 0.7 * x1 + 0.3 * np.random.randn(n)  # Correlated with x1
x3 = np.random.randn(n)  # Independent
 
# True model: y = 2 + 1*x1 + 0.5*x2 + 0.3*x3 + noise
y = 2 + 1*x1 + 0.5*x2 + 0.3*x3 + np.random.randn(n) * 0.5
 
# Split into X1 (controls) and X2 (variable of interest)
X1 = np.column_stack([np.ones(n), x1])  # Intercept + x1
X2 = np.column_stack([x2, x3])          # x2 and x3
 
beta2_direct, beta2_fwl = frisch_waugh_lovell(X1, X2, y)
 
print("Frisch-Waugh-Lovell Demonstration")
print("=" * 50)
print(f"β₂ from full regression:       {beta2_direct}")
print(f"β₂ via Frisch-Waugh-Lovell:    {beta2_fwl}")
print(f"Difference (should be ~0):     {np.max(np.abs(beta2_direct - beta2_fwl)):.2e}")
 
# Example: Effective degrees of freedom
print("\n" + "=" * 50)
print("Effective Degrees of Freedom for Ridge Regression")
print("=" * 50)
 
X = np.column_stack([np.ones(n), x1, x2, x3])
p_plus_1 = X.shape[1]
 
lambdas = [0, 0.1, 1, 10, 100, 1000]
df_values = effective_degrees_of_freedom(X, lambdas)
 
print(f"Number of parameters: {p_plus_1}")
print(f"\n{'λ':>10} {'df(λ)':>10} {'Shrinkage':>12}")
print("-" * 35)
for lam, df in zip(lambdas, df_values):
    shrinkage = 1 - df/p_plus_1
    print(f"{lam:>10.1f} {df:>10.2f} {shrinkage:>11.1%}")
 
# Example: Shrinkage factors
print("\n" + "=" * 50)
print("Ridge Shrinkage Factors (λ = 1)")
print("=" * 50)
 
singvals, shrink_factors = ridge_shrinkage_factors(X, lam=1)
print(f"{'Direction':>10} {'σⱼ':>10} {'dⱼ':>10}")
print("-" * 35)
for j, (s, d) in enumerate(zip(singvals, shrink_factors)):
    print(f"{j+1:>10} {s:>10.4f} {d:>10.4f}")

Module Summary: Multiple Linear Regression

We've completed a comprehensive treatment of multiple linear regression, developing mastery from multiple angles. Let's consolidate the entire module:

Module Overview

•Page 1 - Matrix Formulation: Established $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ with design matrix, Gram matrix, and objective function.
•Page 2 - Normal Equations: Derived $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ via gradient optimization and verified the global minimum.
•Page 3 - OLS Properties: Proved unbiasedness, derived variance formula, and established Gauss-Markov (BLUE) optimality.
•Page 4 - Geometric Interpretation: Developed OLS as orthogonal projection onto column space with hat matrix as projection operator.
•Page 5 - Projection Perspective: Extended to sequential regression, regularization geometry, and unified synthesis.

Core competencies developed:

Derivation skills: Derive OLS solution from first principles (calculus or geometry)
Statistical understanding: Know when OLS is optimal and when it isn't
Geometric intuition: Visualize regression as projection and interpret accordingly
Practical interpretation: Understand what coefficients mean when predictors are correlated
Extension awareness: Know how regularization, constraints, and weighting modify the basic framework

Module Complete

You now have a complete, rigorous understanding of multiple linear regression from matrix formulation through projection geometry. This foundation prepares you for the statistical properties of OLS (Chapter 6, Module 3), maximum likelihood perspectives (Module 4), and eventually regularization methods (Chapter 7).

Projection Perspective

Deepening the Geometric Understanding

These insights are not merely theoretical—they directly inform how we interpret regression output and understand model behavior in practice.

What You Will Learn

Gram-Schmidt Orthogonalization

The Gram-Schmidt algorithm:

Given vectors $\mathbf{x}_0, \mathbf{x}_1, \ldots, \mathbf{x}_p$ (columns of $\mathbf{X}$), produce orthonormal vectors $\mathbf{q}_0, \mathbf{q}_1, \ldots, \mathbf{q}_p$:

Each $\mathbf{u}_j$ is what remains of $\mathbf{x}_j$ after removing its projections onto all previous orthogonalized vectors.

QR Decomposition Connection

Interpretation in regression:

When we orthogonalize the design matrix, the $j$-th orthogonalized column $\mathbf{u}_j$ represents what's "new" in predictor $j$ that isn't explained by predictors $0, 1, \ldots, j-1$.

The regression coefficient $\beta_j$ in OLS measures the effect of this "new" part—the partial effect of $x_j$ after controlling for other predictors.

Sequential Regression and Partial Effects

Gram-Schmidt provides a powerful interpretation: the coefficient of $x_j$ in multiple regression is the simple regression coefficient of $y$ on the residualized $x_j$.

The Frisch-Waugh-Lovell theorem:

Consider partitioning $\mathbf{X} = [\mathbf{X}_1 | \mathbf{X}_2]$ and writing $\mathbf{y} = \mathbf{X}_1\boldsymbol{\beta}_1 + \mathbf{X}_2\boldsymbol{\beta}_2 + \boldsymbol{\epsilon}$.

The coefficient $\boldsymbol{\beta}_2$ can be obtained by:

Regress $\mathbf{y}$ on $\mathbf{X}_1$, save residuals $\mathbf{M}_1\mathbf{y}$
Regress $\mathbf{X}_2$ on $\mathbf{X}_1$, save residuals $\mathbf{M}_1\mathbf{X}_2$
Regress residuals from step 1 on residuals from step 2

where $\mathbf{M}_1 = \mathbf{I} - \mathbf{X}_1(\mathbf{X}_1^\top\mathbf{X}_1)^{-1}\mathbf{X}_1^\top$ is the residual maker.

Interpretation:

$$\boldsymbol{\beta}_2 = (\tilde{\mathbf{X}}_2^\top\tilde{\mathbf{X}}_2)^{-1}\tilde{\mathbf{X}}_2^\top\tilde{\mathbf{y}}$$

where $\tilde{\mathbf{X}}_2 = \mathbf{M}_1\mathbf{X}_2$ and $\tilde{\mathbf{y}} = \mathbf{M}_1\mathbf{y}$.

This says: $\boldsymbol{\beta}_2$ measures the relationship between the parts of $\mathbf{y}$ and $\mathbf{X}_2$ that are not explained by $\mathbf{X}_1$.

Practical Implication

Why Coefficients Change When Predictors Are Added

A common source of confusion: why does the coefficient of $x_1$ change when you add $x_2$ to the model?

The geometric answer:

When you have only $x_1$, you project $\mathbf{y}$ onto the line spanned by $[\mathbf{1}, \mathbf{x}_1]$. The coefficient $\beta_1$ scales how much of $\mathbf{x}_1$ you need.

When do coefficients stay the same?

The coefficient $\beta_1$ remains unchanged when adding $x_2$ if and only if:

$x_2$ is uncorrelated with $x_1$ (orthogonal in variable space), OR
$x_2$ is uncorrelated with $y$ after controlling for $x_1$

Effect of Adding Predictors on Existing Coefficients
Scenario	What Happens to β₁	Why
x₂ ⊥ x₁ and x₂ ⊥ y	Unchanged	New variable is noise orthogonal to everything
x₂ ⊥ x₁ but x₂ correlated with y	Unchanged	x₂ explains different variance than x₁
x₂ correlated with x₁ but not y	May change (toward zero)	x₂ 'steals' variance x₁ was explaining
x₂ correlated with both x₁ and y	Changes (often substantially)	Part of x₁'s effect was really x₂'s effect
x₂ is confounder (causes x₁ and y)	Changes (toward causal effect)	Removing confounding reveals true relationship

Omitted Variable Bias

The Geometry of Regularization

Regularization modifies the projection geometry in illuminating ways. Let's see how Ridge regression changes the picture.

Ridge regression:

$$\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$$

The ridge estimator is no longer a pure projection onto $\mathcal{C}(\mathbf{X})$. Instead, it's a shrinkage operator.

SVD perspective:

Using the SVD $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$, we can write:

$$\hat{\boldsymbol{\beta}}{\text{OLS}} = \sum{j=1}^{r} \frac{1}{\sigma_j} (\mathbf{u}_j^\top \mathbf{y}) \mathbf{v}_j$$

$$\hat{\boldsymbol{\beta}}{\text{ridge}} = \sum{j=1}^{r} \frac{\sigma_j}{\sigma_j^2 + \lambda} (\mathbf{u}_j^\top \mathbf{y}) \mathbf{v}_j$$

Interpretation:

OLS divides by $\sigma_j$: small singular values get amplified
Ridge multiplies by $\sigma_j/(\sigma_j^2 + \lambda)$: small singular values get shrunk

Geometric effect:

The shrinkage factor:

$$d_j = \frac{\sigma_j^2}{\sigma_j^2 + \lambda}$$

When $\sigma_j \gg \sqrt{\lambda}$: $d_j \approx 1$ (little shrinkage)
When $\sigma_j \ll \sqrt{\lambda}$: $d_j \approx 0$ (heavy shrinkage)

Ridge as Modified Projection

Effective Degrees of Freedom

The projection view provides a natural definition of model complexity through effective degrees of freedom.

For OLS: $$\text{df}_{\text{OLS}} = \text{trace}(\mathbf{H}) = p + 1$$

The degrees of freedom equals the rank of the hat matrix, which equals the number of parameters.

For Ridge: $$\text{df}{\text{ridge}}(\lambda) = \text{trace}(\mathbf{H}\lambda) = \sum_{j=1}^{p} \frac{\sigma_j^2}{\sigma_j^2 + \lambda}$$

where $\mathbf{H}_\lambda = \mathbf{X}(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top$.

Properties of Effective Degrees of Freedom

•At λ = 0: df = p + 1 (full OLS complexity)
•As λ → ∞: df → 0 (model shrinks to zero)
•Monotonic: df(λ) decreases as λ increases
•Continuous: Unlike subset selection, regularization provides continuous complexity control
•Interpretation: df measures how many 'equivalent parameters' the model effectively uses

Why This Matters

Constrained Least Squares

Geometric interpretation:

Solution via projection:

This adjusts the unconstrained solution to satisfy the constraints, moving minimally in the metric defined by $\mathbf{X}^\top\mathbf{X}$.

Applications:

Testing linear hypotheses ($\mathbf{c} = \mathbf{0}$): Impose $\mathbf{A}\boldsymbol{\beta} = \mathbf{0}$ and test whether the increase in SSE is significant
Monotonicity constraints: Require coefficients to be non-negative or ordered
Sum-to-zero constraints: In ANOVA-style models, constraint coefficients to sum to zero for identifiability
Known relationships: Incorporate prior knowledge that certain coefficient ratios are fixed

Constraints as Dimension Reduction

Weighted Least Squares

When errors have unequal variances (heteroscedasticity), OLS remains unbiased but is no longer BLUE. Weighted Least Squares (WLS) restores efficiency.

The problem: $$\text{Var}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{W}^{-1}$$

where $\mathbf{W}$ is a known diagonal matrix of weights (larger weights = smaller variance = more reliable observations).

WLS objective: $$\min_{\boldsymbol{\beta}} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top\mathbf{W}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$

Solution: $$\hat{\boldsymbol{\beta}}_{\text{WLS}} = (\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}\mathbf{y}$$

Geometric interpretation:

The projection onto $\mathcal{C}(\mathbf{X})$ with this weighted inner product gives WLS. Geometrically, we're stretching the space so that low-variance observations contribute more.

Equivalent transformation:

WLS on $(\mathbf{X}, \mathbf{y})$ equals OLS on the transformed data $(\mathbf{D}\mathbf{X}, \mathbf{D}\mathbf{y})$.

Generalized Least Squares

Synthesis: The Complete Picture

Let's synthesize everything we've learned about multiple linear regression into a unified understanding.

Multiple Views of Linear Regression
Perspective	Core Idea	Key Insight
Algebraic	$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$	Compact representation enabling closed-form solutions
Optimization	Minimize $\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$	Quadratic objective with unique global minimum
Statistical	OLS is BLUE	Best unbiased estimator under Gauss-Markov assumptions
Geometric	Project $\mathbf{y}$ onto $\mathcal{C}(\mathbf{X})$	Residuals are orthogonal to column space
Sequential	Frisch-Waugh-Lovell	Coefficients measure partial effects after controlling
Regularized	Shrinkage toward origin	Trade bias for reduced variance in high dimensions

The unified view:

All these perspectives describe the same mathematical object. The power comes from switching perspectives based on what you need:

Computation: Use the algebraic/matrix view for implementation
Properties: Use the statistical view for inference
Intuition: Use the geometric view for understanding
Interpretation: Use the sequential view for causality and control
High dimensions: Use the regularization view for prediction

Implementing Advanced Concepts

Let's implement key concepts from this page: sequential regression and the effective degrees of freedom for Ridge regression.

projection_concepts.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from numpy.linalg import svd, solve
 
def frisch_waugh_lovell(X1, X2, y):
    """
    Demonstrate Frisch-Waugh-Lovell theorem.
    
    Computes β₂ in y = X₁β₁ + X₂β₂ + ε by
    regressing residualized y on residualized X₂.
    
    Returns:
    --------
    beta2_direct : coefficients from full regression
    beta2_fwl : coefficients via Frisch-Waugh-Lovell
    """
    n = y.shape[0]
    
    # Method 1: Direct full regression
    X_full = np.column_stack([X1, X2])
    beta_full = solve(X_full.T @ X_full, X_full.T @ y)
    beta2_direct = beta_full[X1.shape[1]:]
    
    # Method 2: Frisch-Waugh-Lovell
    # Step 1: Residual maker for X1
    H1 = X1 @ solve(X1.T @ X1, X1.T)
    M1 = np.eye(n) - H1
    
    # Step 2: Residualize y and X2
    y_resid = M1 @ y
    X2_resid = M1 @ X2
    
    # Step 3: Regress residualized y on residualized X2
    beta2_fwl = solve(X2_resid.T @ X2_resid, X2_resid.T @ y_resid)
    
    return beta2_direct, beta2_fwl
 
def effective_degrees_of_freedom(X, lambdas):
    """
    Compute effective degrees of freedom for Ridge regression.
    
    df(λ) = Σⱼ σⱼ² / (σⱼ² + λ)
    """
    # Compute SVD
    U, s, Vt = svd(X, full_matrices=False)
    
    df_values = []
    for lam in lambdas:
        df = np.sum(s**2 / (s**2 + lam))
        df_values.append(df)
    
    return np.array(df_values)
 
def ridge_shrinkage_factors(X, lam):
    """
    Compute shrinkage factors for each singular direction.
    
    d_j = σⱼ² / (σⱼ² + λ)
    """
    U, s, Vt = svd(X, full_matrices=False)
    d = s**2 / (s**2 + lam)
    return s, d
 
# Example: Frisch-Waugh-Lovell demonstration
np.random.seed(42)
n = 100
 
# Create correlated predictors
x1 = np.random.randn(n)
x2 = 0.7 * x1 + 0.3 * np.random.randn(n)  # Correlated with x1
x3 = np.random.randn(n)  # Independent
 
# True model: y = 2 + 1*x1 + 0.5*x2 + 0.3*x3 + noise
y = 2 + 1*x1 + 0.5*x2 + 0.3*x3 + np.random.randn(n) * 0.5
 
# Split into X1 (controls) and X2 (variable of interest)
X1 = np.column_stack([np.ones(n), x1])  # Intercept + x1
X2 = np.column_stack([x2, x3])          # x2 and x3
 
beta2_direct, beta2_fwl = frisch_waugh_lovell(X1, X2, y)
 
print("Frisch-Waugh-Lovell Demonstration")
print("=" * 50)
print(f"β₂ from full regression:       {beta2_direct}")
print(f"β₂ via Frisch-Waugh-Lovell:    {beta2_fwl}")
print(f"Difference (should be ~0):     {np.max(np.abs(beta2_direct - beta2_fwl)):.2e}")
 
# Example: Effective degrees of freedom
print("\n" + "=" * 50)
print("Effective Degrees of Freedom for Ridge Regression")
print("=" * 50)
 
X = np.column_stack([np.ones(n), x1, x2, x3])
p_plus_1 = X.shape[1]
 
lambdas = [0, 0.1, 1, 10, 100, 1000]
df_values = effective_degrees_of_freedom(X, lambdas)
 
print(f"Number of parameters: {p_plus_1}")
print(f"\n{'λ':>10} {'df(λ)':>10} {'Shrinkage':>12}")
print("-" * 35)
for lam, df in zip(lambdas, df_values):
    shrinkage = 1 - df/p_plus_1
    print(f"{lam:>10.1f} {df:>10.2f} {shrinkage:>11.1%}")
 
# Example: Shrinkage factors
print("\n" + "=" * 50)
print("Ridge Shrinkage Factors (λ = 1)")
print("=" * 50)
 
singvals, shrink_factors = ridge_shrinkage_factors(X, lam=1)
print(f"{'Direction':>10} {'σⱼ':>10} {'dⱼ':>10}")
print("-" * 35)
for j, (s, d) in enumerate(zip(singvals, shrink_factors)):
    print(f"{j+1:>10} {s:>10.4f} {d:>10.4f}")

Module Summary: Multiple Linear Regression

We've completed a comprehensive treatment of multiple linear regression, developing mastery from multiple angles. Let's consolidate the entire module:

Module Overview

•Page 1 - Matrix Formulation: Established $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ with design matrix, Gram matrix, and objective function.
•Page 2 - Normal Equations: Derived $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ via gradient optimization and verified the global minimum.
•Page 3 - OLS Properties: Proved unbiasedness, derived variance formula, and established Gauss-Markov (BLUE) optimality.
•Page 4 - Geometric Interpretation: Developed OLS as orthogonal projection onto column space with hat matrix as projection operator.
•Page 5 - Projection Perspective: Extended to sequential regression, regularization geometry, and unified synthesis.

Core competencies developed:

Derivation skills: Derive OLS solution from first principles (calculus or geometry)
Statistical understanding: Know when OLS is optimal and when it isn't
Geometric intuition: Visualize regression as projection and interpret accordingly
Practical interpretation: Understand what coefficients mean when predictors are correlated
Extension awareness: Know how regularization, constraints, and weighting modify the basic framework

Module Complete