Loading content...
We've established that OLS is orthogonal projection onto the column space of $\mathbf{X}$. This final page deepens that perspective, revealing additional insights that emerge when we fully embrace the projection viewpoint.
We'll explore how coefficients can be understood through sequential projections (Gram-Schmidt), why regularization modifies the projection geometry, and how the projection view illuminates the interpretation of coefficients in the presence of correlated predictors.
These insights are not merely theoretical—they directly inform how we interpret regression output and understand model behavior in practice.
By the end of this page, you will understand Gram-Schmidt in the regression context, see how sequential regression reveals partial effects, grasp the geometric interpretation of regularization, and synthesize the complete picture of multiple regression.
The Gram-Schmidt process transforms a set of linearly independent vectors into an orthonormal set that spans the same subspace. Applied to the columns of $\mathbf{X}$, it reveals the structure of multiple regression.
The Gram-Schmidt algorithm:
Given vectors $\mathbf{x}_0, \mathbf{x}_1, \ldots, \mathbf{x}_p$ (columns of $\mathbf{X}$), produce orthonormal vectors $\mathbf{q}_0, \mathbf{q}_1, \ldots, \mathbf{q}_p$:
$$\begin{aligned} \mathbf{u}_0 &= \mathbf{x}_0, \quad \mathbf{q}_0 = \mathbf{u}_0 / |\mathbf{u}_0| \ \mathbf{u}_j &= \mathbf{x}j - \sum{k=0}^{j-1} (\mathbf{q}_k^\top \mathbf{x}_j) \mathbf{q}_k, \quad \mathbf{q}_j = \mathbf{u}_j / |\mathbf{u}_j| \end{aligned}$$
Each $\mathbf{u}_j$ is what remains of $\mathbf{x}_j$ after removing its projections onto all previous orthogonalized vectors.
Gram-Schmidt is algorithmic QR decomposition. Writing X = QR, where Q has orthonormal columns and R is upper triangular, the normal equations become R⊤Rβ = R⊤Q⊤y, which simplifies to Rβ = Q⊤y—a triangular system solved by back-substitution.
Interpretation in regression:
When we orthogonalize the design matrix, the $j$-th orthogonalized column $\mathbf{u}_j$ represents what's "new" in predictor $j$ that isn't explained by predictors $0, 1, \ldots, j-1$.
The regression coefficient $\beta_j$ in OLS measures the effect of this "new" part—the partial effect of $x_j$ after controlling for other predictors.
Gram-Schmidt provides a powerful interpretation: the coefficient of $x_j$ in multiple regression is the simple regression coefficient of $y$ on the residualized $x_j$.
The Frisch-Waugh-Lovell theorem:
Consider partitioning $\mathbf{X} = [\mathbf{X}_1 | \mathbf{X}_2]$ and writing $\mathbf{y} = \mathbf{X}_1\boldsymbol{\beta}_1 + \mathbf{X}_2\boldsymbol{\beta}_2 + \boldsymbol{\epsilon}$.
The coefficient $\boldsymbol{\beta}_2$ can be obtained by:
where $\mathbf{M}_1 = \mathbf{I} - \mathbf{X}_1(\mathbf{X}_1^\top\mathbf{X}_1)^{-1}\mathbf{X}_1^\top$ is the residual maker.
Interpretation:
$$\boldsymbol{\beta}_2 = (\tilde{\mathbf{X}}_2^\top\tilde{\mathbf{X}}_2)^{-1}\tilde{\mathbf{X}}_2^\top\tilde{\mathbf{y}}$$
where $\tilde{\mathbf{X}}_2 = \mathbf{M}_1\mathbf{X}_2$ and $\tilde{\mathbf{y}} = \mathbf{M}_1\mathbf{y}$.
This says: $\boldsymbol{\beta}_2$ measures the relationship between the parts of $\mathbf{y}$ and $\mathbf{X}_2$ that are not explained by $\mathbf{X}_1$.
This is the formal statement of "controlling for" or "adjusting for" other variables. When you include control variables in a regression, you're measuring the effect of your variables of interest after removing the linear influence of the controls.
In multiple regression with correlated predictors, each coefficient represents the unique contribution of that predictor—what it explains beyond what other predictors already explain. This is why coefficients change when you add or remove predictors.
A common source of confusion: why does the coefficient of $x_1$ change when you add $x_2$ to the model?
The geometric answer:
When you have only $x_1$, you project $\mathbf{y}$ onto the line spanned by $[\mathbf{1}, \mathbf{x}_1]$. The coefficient $\beta_1$ scales how much of $\mathbf{x}_1$ you need.
When you add $x_2$, you project onto the plane spanned by $[\mathbf{1}, \mathbf{x}_1, \mathbf{x}_2]$. The coefficient of $\mathbf{x}_1$ now scales only the part of $\mathbf{x}_1$ that is orthogonal to $\mathbf{x}_2$.
When do coefficients stay the same?
The coefficient $\beta_1$ remains unchanged when adding $x_2$ if and only if:
| Scenario | What Happens to β₁ | Why |
|---|---|---|
| x₂ ⊥ x₁ and x₂ ⊥ y | Unchanged | New variable is noise orthogonal to everything |
| x₂ ⊥ x₁ but x₂ correlated with y | Unchanged | x₂ explains different variance than x₁ |
| x₂ correlated with x₁ but not y | May change (toward zero) | x₂ 'steals' variance x₁ was explaining |
| x₂ correlated with both x₁ and y | Changes (often substantially) | Part of x₁'s effect was really x₂'s effect |
| x₂ is confounder (causes x₁ and y) | Changes (toward causal effect) | Removing confounding reveals true relationship |
If a relevant variable is omitted from the regression, coefficients of included variables absorb its effect. The bias equals the product of (1) the effect of the omitted variable on y, and (2) the regression of the omitted variable on included variables. This is the core of omitted variable bias.
Regularization modifies the projection geometry in illuminating ways. Let's see how Ridge regression changes the picture.
Ridge regression:
$$\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$$
The ridge estimator is no longer a pure projection onto $\mathcal{C}(\mathbf{X})$. Instead, it's a shrinkage operator.
SVD perspective:
Using the SVD $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$, we can write:
$$\hat{\boldsymbol{\beta}}{\text{OLS}} = \sum{j=1}^{r} \frac{1}{\sigma_j} (\mathbf{u}_j^\top \mathbf{y}) \mathbf{v}_j$$
$$\hat{\boldsymbol{\beta}}{\text{ridge}} = \sum{j=1}^{r} \frac{\sigma_j}{\sigma_j^2 + \lambda} (\mathbf{u}_j^\top \mathbf{y}) \mathbf{v}_j$$
Interpretation:
Geometric effect:
Directions in parameter space corresponding to small singular values (high-variance directions) are shrunk more aggressively. Ridge "pulls" the OLS solution toward the origin along directions the data doesn't constrain well.
The shrinkage factor:
$$d_j = \frac{\sigma_j^2}{\sigma_j^2 + \lambda}$$
Ridge regression is OLS in a modified geometry where the column space is 'stretched' by λ in all directions. The solution is no longer the orthogonal projection in the original space, but it is the orthogonal projection in this modified space.
The projection view provides a natural definition of model complexity through effective degrees of freedom.
For OLS: $$\text{df}_{\text{OLS}} = \text{trace}(\mathbf{H}) = p + 1$$
The degrees of freedom equals the rank of the hat matrix, which equals the number of parameters.
For Ridge: $$\text{df}{\text{ridge}}(\lambda) = \text{trace}(\mathbf{H}\lambda) = \sum_{j=1}^{p} \frac{\sigma_j^2}{\sigma_j^2 + \lambda}$$
where $\mathbf{H}_\lambda = \mathbf{X}(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top$.
Effective degrees of freedom allows fair comparison between models of different types. A Ridge regression with df = 5 uses complexity equivalent to OLS with 5 parameters, even though Ridge nominally has p+1 parameters. This is essential for information criteria (AIC, BIC) with regularized models.
The projection view extends naturally to constrained least squares problems, where we minimize $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$ subject to linear constraints $\mathbf{A}\boldsymbol{\beta} = \mathbf{c}$.
Geometric interpretation:
The constraint $\mathbf{A}\boldsymbol{\beta} = \mathbf{c}$ defines an affine subspace of the parameter space. We seek the point in this subspace whose image under $\mathbf{X}$ is closest to $\mathbf{y}$.
Solution via projection:
$$\hat{\boldsymbol{\beta}}c = \hat{\boldsymbol{\beta}}{\text{OLS}} - (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{A}^\top[\mathbf{A}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{A}^\top]^{-1}(\mathbf{A}\hat{\boldsymbol{\beta}}_{\text{OLS}} - \mathbf{c})$$
This adjusts the unconstrained solution to satisfy the constraints, moving minimally in the metric defined by $\mathbf{X}^\top\mathbf{X}$.
Applications:
Testing linear hypotheses ($\mathbf{c} = \mathbf{0}$): Impose $\mathbf{A}\boldsymbol{\beta} = \mathbf{0}$ and test whether the increase in SSE is significant
Monotonicity constraints: Require coefficients to be non-negative or ordered
Sum-to-zero constraints: In ANOVA-style models, constraint coefficients to sum to zero for identifiability
Known relationships: Incorporate prior knowledge that certain coefficient ratios are fixed
Each independent linear constraint removes one dimension from the parameter space. With k constraints, the effective dimensionality drops from p+1 to p+1-k. This is another form of regularization—incorporating prior knowledge to reduce model complexity.
When errors have unequal variances (heteroscedasticity), OLS remains unbiased but is no longer BLUE. Weighted Least Squares (WLS) restores efficiency.
The problem: $$\text{Var}(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{W}^{-1}$$
where $\mathbf{W}$ is a known diagonal matrix of weights (larger weights = smaller variance = more reliable observations).
WLS objective: $$\min_{\boldsymbol{\beta}} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top\mathbf{W}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$
Solution: $$\hat{\boldsymbol{\beta}}_{\text{WLS}} = (\mathbf{X}^\top\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{W}\mathbf{y}$$
Geometric interpretation:
WLS is orthogonal projection in a modified inner product space. Instead of the standard inner product $\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\top\mathbf{v}$, we use $\langle \mathbf{u}, \mathbf{v} \rangle_\mathbf{W} = \mathbf{u}^\top\mathbf{W}\mathbf{v}$.
The projection onto $\mathcal{C}(\mathbf{X})$ with this weighted inner product gives WLS. Geometrically, we're stretching the space so that low-variance observations contribute more.
Equivalent transformation:
Let $\mathbf{W} = \mathbf{D}^2$ where $\mathbf{D}$ is diagonal with $d_{ii} = \sqrt{w_{ii}}$. Then: $$\hat{\boldsymbol{\beta}}_{\text{WLS}} = ((\mathbf{D}\mathbf{X})^\top(\mathbf{D}\mathbf{X}))^{-1}(\mathbf{D}\mathbf{X})^\top(\mathbf{D}\mathbf{y})$$
WLS on $(\mathbf{X}, \mathbf{y})$ equals OLS on the transformed data $(\mathbf{D}\mathbf{X}, \mathbf{D}\mathbf{y})$.
When errors are both heteroscedastic AND correlated (e.g., time series), the full covariance matrix Σ replaces W⁻¹. The solution is β̂_GLS = (X⊤Σ⁻¹X)⁻¹X⊤Σ⁻¹y. This is orthogonal projection in the inner product defined by Σ⁻¹.
Let's synthesize everything we've learned about multiple linear regression into a unified understanding.
| Perspective | Core Idea | Key Insight |
|---|---|---|
| Algebraic | $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ | Compact representation enabling closed-form solutions |
| Optimization | Minimize $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$ | Quadratic objective with unique global minimum |
| Statistical | OLS is BLUE | Best unbiased estimator under Gauss-Markov assumptions |
| Geometric | Project $\mathbf{y}$ onto $\mathcal{C}(\mathbf{X})$ | Residuals are orthogonal to column space |
| Sequential | Frisch-Waugh-Lovell | Coefficients measure partial effects after controlling |
| Regularized | Shrinkage toward origin | Trade bias for reduced variance in high dimensions |
The unified view:
All these perspectives describe the same mathematical object. The power comes from switching perspectives based on what you need:
Let's implement key concepts from this page: sequential regression and the effective degrees of freedom for Ridge regression.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as npfrom numpy.linalg import svd, solve def frisch_waugh_lovell(X1, X2, y): """ Demonstrate Frisch-Waugh-Lovell theorem. Computes β₂ in y = X₁β₁ + X₂β₂ + ε by regressing residualized y on residualized X₂. Returns: -------- beta2_direct : coefficients from full regression beta2_fwl : coefficients via Frisch-Waugh-Lovell """ n = y.shape[0] # Method 1: Direct full regression X_full = np.column_stack([X1, X2]) beta_full = solve(X_full.T @ X_full, X_full.T @ y) beta2_direct = beta_full[X1.shape[1]:] # Method 2: Frisch-Waugh-Lovell # Step 1: Residual maker for X1 H1 = X1 @ solve(X1.T @ X1, X1.T) M1 = np.eye(n) - H1 # Step 2: Residualize y and X2 y_resid = M1 @ y X2_resid = M1 @ X2 # Step 3: Regress residualized y on residualized X2 beta2_fwl = solve(X2_resid.T @ X2_resid, X2_resid.T @ y_resid) return beta2_direct, beta2_fwl def effective_degrees_of_freedom(X, lambdas): """ Compute effective degrees of freedom for Ridge regression. df(λ) = Σⱼ σⱼ² / (σⱼ² + λ) """ # Compute SVD U, s, Vt = svd(X, full_matrices=False) df_values = [] for lam in lambdas: df = np.sum(s**2 / (s**2 + lam)) df_values.append(df) return np.array(df_values) def ridge_shrinkage_factors(X, lam): """ Compute shrinkage factors for each singular direction. d_j = σⱼ² / (σⱼ² + λ) """ U, s, Vt = svd(X, full_matrices=False) d = s**2 / (s**2 + lam) return s, d # Example: Frisch-Waugh-Lovell demonstrationnp.random.seed(42)n = 100 # Create correlated predictorsx1 = np.random.randn(n)x2 = 0.7 * x1 + 0.3 * np.random.randn(n) # Correlated with x1x3 = np.random.randn(n) # Independent # True model: y = 2 + 1*x1 + 0.5*x2 + 0.3*x3 + noisey = 2 + 1*x1 + 0.5*x2 + 0.3*x3 + np.random.randn(n) * 0.5 # Split into X1 (controls) and X2 (variable of interest)X1 = np.column_stack([np.ones(n), x1]) # Intercept + x1X2 = np.column_stack([x2, x3]) # x2 and x3 beta2_direct, beta2_fwl = frisch_waugh_lovell(X1, X2, y) print("Frisch-Waugh-Lovell Demonstration")print("=" * 50)print(f"β₂ from full regression: {beta2_direct}")print(f"β₂ via Frisch-Waugh-Lovell: {beta2_fwl}")print(f"Difference (should be ~0): {np.max(np.abs(beta2_direct - beta2_fwl)):.2e}") # Example: Effective degrees of freedomprint("\n" + "=" * 50)print("Effective Degrees of Freedom for Ridge Regression")print("=" * 50) X = np.column_stack([np.ones(n), x1, x2, x3])p_plus_1 = X.shape[1] lambdas = [0, 0.1, 1, 10, 100, 1000]df_values = effective_degrees_of_freedom(X, lambdas) print(f"Number of parameters: {p_plus_1}")print(f"\n{'λ':>10} {'df(λ)':>10} {'Shrinkage':>12}")print("-" * 35)for lam, df in zip(lambdas, df_values): shrinkage = 1 - df/p_plus_1 print(f"{lam:>10.1f} {df:>10.2f} {shrinkage:>11.1%}") # Example: Shrinkage factorsprint("\n" + "=" * 50)print("Ridge Shrinkage Factors (λ = 1)")print("=" * 50) singvals, shrink_factors = ridge_shrinkage_factors(X, lam=1)print(f"{'Direction':>10} {'σⱼ':>10} {'dⱼ':>10}")print("-" * 35)for j, (s, d) in enumerate(zip(singvals, shrink_factors)): print(f"{j+1:>10} {s:>10.4f} {d:>10.4f}")We've completed a comprehensive treatment of multiple linear regression, developing mastery from multiple angles. Let's consolidate the entire module:
Core competencies developed:
You now have a complete, rigorous understanding of multiple linear regression from matrix formulation through projection geometry. This foundation prepares you for the statistical properties of OLS (Chapter 6, Module 3), maximum likelihood perspectives (Module 4), and eventually regularization methods (Chapter 7).