Machine LearningMultiple Linear Regression

Multiple Linear Regression

LevelIntermediate

Duration90 mins

TopicMultiple Linear Regression

4 / 5

Geometric Interpretation

Seeing Regression Geometrically

We've derived the normal equations algebraically and established statistical properties, but there's a deeper way to understand linear regression: geometry.

In this view, the response vector $\mathbf{y}$ lives in $n$-dimensional space, and the columns of the design matrix $\mathbf{X}$ span a subspace. The OLS solution is the point in that subspace closest to $\mathbf{y}$—a geometric projection.

This perspective isn't just elegant—it provides immediate intuition for why OLS works, what residuals represent, and why the normal equations have their particular form. It connects linear regression to fundamental linear algebra concepts that recur throughout machine learning.

What You Will Learn

By the end of this page, you will visualize linear regression as orthogonal projection onto a subspace, understand the hat matrix as a projection operator, derive the normal equations geometrically, and gain intuition for why minimizing squared error is equivalent to finding the closest point.

Vectors in n-Dimensional Space

To develop geometric intuition, we must shift our perspective from viewing data as $n$ observations to viewing vectors in $\mathbb{R}^n$.

The observation space (row view):

Each observation is a point in $\mathbb{R}^{p+1}$ (feature space)
$n$ data points form a cloud in feature space
This is the standard ML/statistics perspective

The variable space (column view):

Each variable (column of data) is a vector in $\mathbb{R}^n$
The response $\mathbf{y}$ is a vector in $\mathbb{R}^n$
Each column of $\mathbf{X}$ is a vector in $\mathbb{R}^n$
This is the geometric perspective we'll develop

The Key Insight

In variable space, we work in ℝⁿ where n is the number of observations. The response y and each feature column x₁, x₂, ..., xₚ are all vectors in this same n-dimensional space. Regression becomes a problem of finding the best linear combination of feature vectors to approximate y.

Example:

With $n = 100$ observations and $p = 3$ predictors:

Observation space: 100 points in $\mathbb{R}^4$ (3 features + intercept)
Variable space: 4 vectors ($\mathbf{1}, \mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3$) and 1 response vector ($\mathbf{y}$), all in $\mathbb{R}^{100}$

The variable space perspective treats each observation as a dimension and each variable as a single vector.

The Column Space of X

Definition: The column space of $\mathbf{X}$, denoted $\mathcal{C}(\mathbf{X})$, is the set of all linear combinations of the columns of $\mathbf{X}$:

$$\mathcal{C}(\mathbf{X}) = {\mathbf{X}\boldsymbol{\beta} : \boldsymbol{\beta} \in \mathbb{R}^{p+1}}$$

This is a $(p+1)$-dimensional subspace of $\mathbb{R}^n$ (assuming $\mathbf{X}$ has full column rank).

What this means:

Every vector in $\mathcal{C}(\mathbf{X})$ can be expressed as $\beta_0 \mathbf{1} + \beta_1 \mathbf{x}_1 + \beta_2 \mathbf{x}_2 + \cdots + \beta_p \mathbf{x}_p$
The subspace is spanned by the intercept column and all feature columns
For any choice of $\boldsymbol{\beta}$, the prediction $\mathbf{X}\boldsymbol{\beta}$ lies in $\mathcal{C}(\mathbf{X})$

Dimension of the Column Space

When X has full column rank, the column space has dimension p+1 (intercept plus p predictors). This is vastly smaller than the ambient space ℝⁿ when n >> p. The column space is a low-dimensional 'plane' (hyperplane) in high-dimensional space.

Geometric picture:

$\mathbf{y}$ is a point (vector) somewhere in $\mathbb{R}^n$
$\mathcal{C}(\mathbf{X})$ is a $(p+1)$-dimensional plane through the origin
We want to find the point in $\mathcal{C}(\mathbf{X})$ closest to $\mathbf{y}$

This closest point is the fitted value vector $\hat{\mathbf{y}}$. The difference $\mathbf{y} - \hat{\mathbf{y}}$ is the residual vector $\mathbf{e}$.

Orthogonal Projection onto the Column Space

Key theorem: The point in $\mathcal{C}(\mathbf{X})$ closest to $\mathbf{y}$ is found by orthogonal projection. The closest point is the unique vector $\hat{\mathbf{y}} \in \mathcal{C}(\mathbf{X})$ such that the error vector $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ is perpendicular to $\mathcal{C}(\mathbf{X})$.

Mathematically: $\hat{\mathbf{y}}$ minimizes $|\mathbf{y} - \hat{\mathbf{y}}|$ over all $\hat{\mathbf{y}} \in \mathcal{C}(\mathbf{X})$ if and only if: $$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} \perp \mathcal{C}(\mathbf{X})$$

where $\perp$ means perpendicular (orthogonal) to every vector in the subspace.

Why orthogonality gives the closest point:

Consider any other point $\tilde{\mathbf{y}} \in \mathcal{C}(\mathbf{X})$, and let $\mathbf{d} = \tilde{\mathbf{y}} - \hat{\mathbf{y}}$ (the difference, which lies in $\mathcal{C}(\mathbf{X})$).

By the Pythagorean theorem (which holds because $\mathbf{e} \perp \mathbf{d}$): $$|\mathbf{y} - \tilde{\mathbf{y}}|^2 = |\mathbf{e} + (\hat{\mathbf{y}} - \tilde{\mathbf{y}})|^2 = |\mathbf{e}|^2 + |\mathbf{d}|^2 \geq |\mathbf{e}|^2$$

Equality holds only when $\mathbf{d} = \mathbf{0}$, i.e., $\tilde{\mathbf{y}} = \hat{\mathbf{y}}$.

Conclusion: The orthogonal projection is the unique closest point.

OLS = Orthogonal Projection

Minimizing the sum of squared errors ||y - ŷ||² is equivalent to finding the orthogonal projection of y onto the column space of X. The geometric and algebraic views are two perspectives on the same optimization.

Deriving Normal Equations Geometrically

We can derive the normal equations purely from the orthogonality condition—no calculus required.

The orthogonality condition:

For $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ to be perpendicular to $\mathcal{C}(\mathbf{X})$, it must be perpendicular to every column of $\mathbf{X}$ (since the columns span $\mathcal{C}(\mathbf{X})$).

Perpendicularity of vectors means zero inner product: $$\mathbf{x}_j^\top \mathbf{e} = 0 \quad \text{for } j = 0, 1, \ldots, p$$

Collecting all these conditions into a single matrix equation: $$\mathbf{X}^\top \mathbf{e} = \mathbf{0}$$

Deriving the normal equations:

Substituting $\mathbf{e} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$: $$\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}$$ $$\mathbf{X}^\top\mathbf{y} - \mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{0}$$ $$\boxed{\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}}$$

These are exactly the normal equations! The name "normal equations" comes from this geometric derivation: the residual is normal (perpendicular) to the column space.

Two Paths, One Destination

We've now derived the normal equations two ways: (1) by setting the gradient of SSE to zero (calculus), and (2) by requiring the residual to be perpendicular to the column space (geometry). Same equations, different insights.

The Hat Matrix as a Projection Operator

Recall that the fitted values can be written as: $$\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$$

where $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is the hat matrix (or projection matrix).

The hat matrix is not just a computational convenience—it's the orthogonal projection operator onto $\mathcal{C}(\mathbf{X})$.

Properties of the Hat Matrix H
Property	Mathematical Statement	Interpretation
Symmetry	$\mathbf{H} = \mathbf{H}^\top$	Projections are self-adjoint operators
Idempotence	$\mathbf{H}^2 = \mathbf{H}$	Projecting twice = projecting once
Range	Range($\mathbf{H}$) = $\mathcal{C}(\mathbf{X})$	H projects onto the column space of X
Null space	Null($\mathbf{H}$) = $\mathcal{C}(\mathbf{X})^\perp$	Vectors perpendicular to C(X) are mapped to 0
Rank	rank($\mathbf{H}$) = $p + 1$	Dimension of projection subspace
Trace	trace($\mathbf{H}$) = $p + 1$	Sum of diagonal equals rank for projections

Verifying idempotence: $$\mathbf{H}^2 = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \cdot \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}(\mathbf{X}^\top\mathbf{X})(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{H}$$

Geometric interpretation: If $\mathbf{v}$ is already in $\mathcal{C}(\mathbf{X})$, projecting it again leaves it unchanged: $\mathbf{H}\mathbf{v} = \mathbf{v}$. Idempotence captures this mathematically.

Complementary Projection

The matrix M = I - H projects onto the orthogonal complement of C(X). Residuals are e = My = (I - H)y. Since y = Hy + My = ŷ + e, every response vector decomposes uniquely into a fitted part (in C(X)) and a residual part (perpendicular to C(X)).

Leverage: The Diagonal of H

The diagonal elements of the hat matrix have a special interpretation and name: leverage.

Definition: The leverage of observation $i$ is: $$h_{ii} = [\mathbf{H}]_{ii} = \mathbf{x}_i^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_i$$

where $\mathbf{x}_i$ is the $i$-th row of $\mathbf{X}$ (as a column vector).

What leverage measures:

Recall $\hat{y}i = \sum{j=1}^n h_{ij} y_j$. Leverage $h_{ii}$ measures how much the fitted value $\hat{y}_i$ depends on the observed value $y_i$ itself.

High leverage: $\hat{y}_i$ strongly influenced by $y_i$
Low leverage: $\hat{y}_i$ mainly determined by other observations

Properties of Leverage Values

•Range: $0 \leq h_{ii} \leq 1$ (for any observation)
•Sum: $\sum_{i=1}^n h_{ii} = \text{trace}(\mathbf{H}) = p + 1$
•Average: $\bar{h} = (p+1)/n$
•Outlier threshold: $h_{ii} > 2(p+1)/n$ often flags high-leverage points
•Extreme cases: $h_{ii} = 1$ if observation $i$ perfectly determines its own fitted value

High Leverage ≠ High Influence

Leverage measures potential influence from feature location, not actual influence. A high-leverage point with a typical y-value may not distort the regression. Cook's distance combines leverage with residual magnitude to measure actual influence.

The Geometry of Residuals

The residual vector $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y}$ has beautiful geometric properties that illuminate regression diagnostics.

Orthogonality to column space: $$\mathbf{X}^\top\mathbf{e} = \mathbf{0}$$

The residual vector is perpendicular to every column of $\mathbf{X}$, including:

The intercept column (so $\sum e_i = 0$)
Every feature column (so $\sum x_{ij} e_i = 0$ for all $j$)

Pythagorean decomposition:

Since $\hat{\mathbf{y}} \perp \mathbf{e}$: $$|\mathbf{y}|^2 = |\hat{\mathbf{y}}|^2 + |\mathbf{e}|^2$$

In sum of squares notation: $$\text{SS}{\text{Total}} = \text{SS}{\text{Regression}} + \text{SS}_{\text{Error}}$$ $$\sum(y_i - \bar{y})^2 = \sum(\hat{y}_i - \bar{y})^2 + \sum(y_i - \hat{y}_i)^2$$

(This exact decomposition requires centering; the geometric version uses the origin as reference.)

The coefficient of determination: $$R^2 = \frac{\text{SS}{\text{Regression}}}{\text{SS}{\text{Total}}} = 1 - \frac{\text{SS}{\text{Error}}}{\text{SS}{\text{Total}}} = 1 - \frac{|\mathbf{e}|^2}{|\mathbf{y} - \bar{y}\mathbf{1}|^2}$$

R² as a Cosine

Geometrically, R² = cos²(θ) where θ is the angle between the centered response vector and its projection onto the column space. Perfect fit means θ = 0 (R² = 1); no relationship means θ = 90° (R² = 0).

Visualizing Projection in Low Dimensions

Let's build intuition with a simple example: $n = 3$ observations and $p = 1$ predictor (plus intercept).

Setup:

$\mathbf{y} \in \mathbb{R}^3$: the response vector
$\mathbf{X}$ is a $3 \times 2$ matrix (intercept + one predictor)
$\mathcal{C}(\mathbf{X})$ is a 2D plane in $\mathbb{R}^3$

Geometric picture:

Imagine $\mathbb{R}^3$ as ordinary 3D space. The column space $\mathcal{C}(\mathbf{X})$ is a plane through the origin. The response $\mathbf{y}$ is a point somewhere in 3D space, possibly not on this plane.

The fitted value $\hat{\mathbf{y}}$ is the point on the plane closest to $\mathbf{y}$—found by dropping a perpendicular from $\mathbf{y}$ to the plane. The residual $\mathbf{e}$ is the perpendicular segment itself.

projection_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
# Simple example: 3 observations, 1 predictor
np.random.seed(42)
 
# The response vector in R^3
y = np.array([3, 5, 4])
 
# Design matrix: intercept + one predictor
X = np.column_stack([
    np.ones(3),           # Intercept column [1, 1, 1]
    np.array([1, 2, 3])   # Predictor column
])
 
# Compute OLS solution
beta_hat = np.linalg.solve(X.T @ X, X.T @ y)
y_hat = X @ beta_hat  # Fitted values (projection)
e = y - y_hat         # Residuals
 
print(f"y (response):     {y}")
print(f"ŷ (fitted):       {y_hat}")
print(f"e (residual):     {e}")
print(f"β̂ (coefficients): {beta_hat}")
 
# Verify orthogonality
print(f"\nOrthogonality check:")
print(f"  X⊤e = {X.T @ e}")  # Should be [0, 0]
print(f"  ||e|| = {np.linalg.norm(e):.4f}")
 
# Verify Pythagorean decomposition
print(f"\nPythagorean decomposition:")
print(f"  ||y||² = {np.dot(y, y):.4f}")
print(f"  ||ŷ||² + ||e||² = {np.dot(y_hat, y_hat) + np.dot(e, e):.4f}")
 
# Create 3D visualization
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
 
# Plot the column space (a plane through origin)
# Create a grid of points in the plane spanned by X columns
s, t = np.meshgrid(np.linspace(-1, 3, 10), np.linspace(-1, 2, 10))
# Plane: points of form s*x1 + t*x2 where x1, x2 are columns of X
plane_x = s * X[0, 0] + t * X[0, 1]
plane_y = s * X[1, 0] + t * X[1, 1]
plane_z = s * X[2, 0] + t * X[2, 1]
 
ax.plot_surface(plane_x, plane_y, plane_z, alpha=0.3, color='blue')
 
# Plot vectors
ax.quiver(0, 0, 0, y[0], y[1], y[2], color='green', arrow_length_ratio=0.1, 
          linewidth=2, label='y (response)')
ax.quiver(0, 0, 0, y_hat[0], y_hat[1], y_hat[2], color='blue', 
          arrow_length_ratio=0.1, linewidth=2, label='ŷ (fitted)')
ax.plot([y_hat[0], y[0]], [y_hat[1], y[1]], [y_hat[2], y[2]], 
        'r--', linewidth=2, label='e (residual)')
 
ax.set_xlabel('Obs 1')
ax.set_ylabel('Obs 2')
ax.set_zlabel('Obs 3')
ax.set_title('OLS as Orthogonal Projection')
ax.legend()
 
plt.tight_layout()
plt.savefig('ols_projection.png', dpi=150)
print("\nVisualization saved to ols_projection.png")

Connections to Broader Linear Algebra

The geometric view connects linear regression to fundamental concepts that recur throughout machine learning.

Key Connections

•Principal Component Analysis (PCA): Finds the subspace that captures maximum variance—also an orthogonal projection, but onto a data-determined subspace rather than feature-determined.
•SVD and the pseudoinverse: The pseudoinverse $\mathbf{X}^+ = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ generalizes to singular matrices via SVD. Both are fundamentally about projection.
•Kernel methods: Feature maps take data to (potentially infinite-dimensional) spaces where linear regression becomes kernel ridge regression. Same geometric principle in a different space.
•Neural network linear layers: Without activation functions, layers compute linear projections. Understanding projection geometry illuminates what networks can and cannot represent.
•Gram-Schmidt orthogonalization: Provides an alternative way to compute projections and explains the sequential interpretation of regression coefficients.

Projection is Everywhere

The concept of orthogonal projection onto a subspace is one of the most fundamental ideas in applied mathematics. Mastering it in the context of linear regression prepares you for advanced topics throughout machine learning and signal processing.

Summary: Geometric Interpretation

We've developed a complete geometric understanding of linear regression. Let's consolidate:

Key Takeaways

•Column space: The design matrix X spans a subspace $\mathcal{C}(\mathbf{X})$ in $\mathbb{R}^n$.
•OLS as projection: Finding $\hat{\boldsymbol{\beta}}$ is equivalent to projecting $\mathbf{y}$ orthogonally onto $\mathcal{C}(\mathbf{X})$.
•Normal equations: Derived from the orthogonality condition $\mathbf{X}^\top\mathbf{e} = \mathbf{0}$.
•Hat matrix: $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is the projection operator onto $\mathcal{C}(\mathbf{X})$.
•Leverage: Diagonal elements $h_{ii}$ measure how much each observation determines its own fitted value.
•Pythagorean decomposition: Total variation splits into explained and unexplained parts, with R² measuring the fraction explained.

What's next:

We've covered matrix formulation, normal equations, statistical properties, and geometric interpretation. The final page synthesizes these perspectives and explores the projection perspective—deepening our understanding of how OLS relates to the geometry of inner product spaces.

Page Complete

You now understand linear regression geometrically—as orthogonal projection of the response onto the column space of the design matrix. This perspective unifies the algebraic, statistical, and computational aspects of OLS.

4 / 5

Loading learning content...

Machine LearningMultiple Linear Regression

Multiple Linear Regression

LevelIntermediate

Duration90 mins

TopicMultiple Linear Regression

4 / 5

Geometric Interpretation

Seeing Regression Geometrically

We've derived the normal equations algebraically and established statistical properties, but there's a deeper way to understand linear regression: geometry.

What You Will Learn

Vectors in n-Dimensional Space

To develop geometric intuition, we must shift our perspective from viewing data as $n$ observations to viewing vectors in $\mathbb{R}^n$.

The observation space (row view):

Each observation is a point in $\mathbb{R}^{p+1}$ (feature space)
$n$ data points form a cloud in feature space
This is the standard ML/statistics perspective

The variable space (column view):

Each variable (column of data) is a vector in $\mathbb{R}^n$
The response $\mathbf{y}$ is a vector in $\mathbb{R}^n$
Each column of $\mathbf{X}$ is a vector in $\mathbb{R}^n$
This is the geometric perspective we'll develop

The Key Insight

Example:

With $n = 100$ observations and $p = 3$ predictors:

Observation space: 100 points in $\mathbb{R}^4$ (3 features + intercept)
Variable space: 4 vectors ($\mathbf{1}, \mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3$) and 1 response vector ($\mathbf{y}$), all in $\mathbb{R}^{100}$

The variable space perspective treats each observation as a dimension and each variable as a single vector.

The Column Space of X

Definition: The column space of $\mathbf{X}$, denoted $\mathcal{C}(\mathbf{X})$, is the set of all linear combinations of the columns of $\mathbf{X}$:

$$\mathcal{C}(\mathbf{X}) = {\mathbf{X}\boldsymbol{\beta} : \boldsymbol{\beta} \in \mathbb{R}^{p+1}}$$

This is a $(p+1)$-dimensional subspace of $\mathbb{R}^n$ (assuming $\mathbf{X}$ has full column rank).

What this means:

Every vector in $\mathcal{C}(\mathbf{X})$ can be expressed as $\beta_0 \mathbf{1} + \beta_1 \mathbf{x}_1 + \beta_2 \mathbf{x}_2 + \cdots + \beta_p \mathbf{x}_p$
The subspace is spanned by the intercept column and all feature columns
For any choice of $\boldsymbol{\beta}$, the prediction $\mathbf{X}\boldsymbol{\beta}$ lies in $\mathcal{C}(\mathbf{X})$

Dimension of the Column Space

Geometric picture:

$\mathbf{y}$ is a point (vector) somewhere in $\mathbb{R}^n$
$\mathcal{C}(\mathbf{X})$ is a $(p+1)$-dimensional plane through the origin
We want to find the point in $\mathcal{C}(\mathbf{X})$ closest to $\mathbf{y}$

This closest point is the fitted value vector $\hat{\mathbf{y}}$. The difference $\mathbf{y} - \hat{\mathbf{y}}$ is the residual vector $\mathbf{e}$.

Orthogonal Projection onto the Column Space

where $\perp$ means perpendicular (orthogonal) to every vector in the subspace.

Why orthogonality gives the closest point:

Consider any other point $\tilde{\mathbf{y}} \in \mathcal{C}(\mathbf{X})$, and let $\mathbf{d} = \tilde{\mathbf{y}} - \hat{\mathbf{y}}$ (the difference, which lies in $\mathcal{C}(\mathbf{X})$).

Equality holds only when $\mathbf{d} = \mathbf{0}$, i.e., $\tilde{\mathbf{y}} = \hat{\mathbf{y}}$.

Conclusion: The orthogonal projection is the unique closest point.

OLS = Orthogonal Projection

Deriving Normal Equations Geometrically

We can derive the normal equations purely from the orthogonality condition—no calculus required.

The orthogonality condition:

Perpendicularity of vectors means zero inner product: $$\mathbf{x}_j^\top \mathbf{e} = 0 \quad \text{for } j = 0, 1, \ldots, p$$

Collecting all these conditions into a single matrix equation: $$\mathbf{X}^\top \mathbf{e} = \mathbf{0}$$

Deriving the normal equations:

These are exactly the normal equations! The name "normal equations" comes from this geometric derivation: the residual is normal (perpendicular) to the column space.

Two Paths, One Destination

The Hat Matrix as a Projection Operator

Recall that the fitted values can be written as: $$\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$$

where $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is the hat matrix (or projection matrix).

The hat matrix is not just a computational convenience—it's the orthogonal projection operator onto $\mathcal{C}(\mathbf{X})$.

Properties of the Hat Matrix H
Property	Mathematical Statement	Interpretation
Symmetry	$\mathbf{H} = \mathbf{H}^\top$	Projections are self-adjoint operators
Idempotence	$\mathbf{H}^2 = \mathbf{H}$	Projecting twice = projecting once
Range	Range($\mathbf{H}$) = $\mathcal{C}(\mathbf{X})$	H projects onto the column space of X
Null space	Null($\mathbf{H}$) = $\mathcal{C}(\mathbf{X})^\perp$	Vectors perpendicular to C(X) are mapped to 0
Rank	rank($\mathbf{H}$) = $p + 1$	Dimension of projection subspace
Trace	trace($\mathbf{H}$) = $p + 1$	Sum of diagonal equals rank for projections

Complementary Projection

Leverage: The Diagonal of H

The diagonal elements of the hat matrix have a special interpretation and name: leverage.

Definition: The leverage of observation $i$ is: $$h_{ii} = [\mathbf{H}]_{ii} = \mathbf{x}_i^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_i$$

where $\mathbf{x}_i$ is the $i$-th row of $\mathbf{X}$ (as a column vector).

What leverage measures:

Recall $\hat{y}i = \sum{j=1}^n h_{ij} y_j$. Leverage $h_{ii}$ measures how much the fitted value $\hat{y}_i$ depends on the observed value $y_i$ itself.

High leverage: $\hat{y}_i$ strongly influenced by $y_i$
Low leverage: $\hat{y}_i$ mainly determined by other observations

Properties of Leverage Values

•Range: $0 \leq h_{ii} \leq 1$ (for any observation)
•Sum: $\sum_{i=1}^n h_{ii} = \text{trace}(\mathbf{H}) = p + 1$
•Average: $\bar{h} = (p+1)/n$
•Outlier threshold: $h_{ii} > 2(p+1)/n$ often flags high-leverage points
•Extreme cases: $h_{ii} = 1$ if observation $i$ perfectly determines its own fitted value

High Leverage ≠ High Influence

The Geometry of Residuals

The residual vector $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y}$ has beautiful geometric properties that illuminate regression diagnostics.

Orthogonality to column space: $$\mathbf{X}^\top\mathbf{e} = \mathbf{0}$$

The residual vector is perpendicular to every column of $\mathbf{X}$, including:

The intercept column (so $\sum e_i = 0$)
Every feature column (so $\sum x_{ij} e_i = 0$ for all $j$)

Pythagorean decomposition:

Since $\hat{\mathbf{y}} \perp \mathbf{e}$: $$|\mathbf{y}|^2 = |\hat{\mathbf{y}}|^2 + |\mathbf{e}|^2$$

In sum of squares notation: $$\text{SS}{\text{Total}} = \text{SS}{\text{Regression}} + \text{SS}_{\text{Error}}$$ $$\sum(y_i - \bar{y})^2 = \sum(\hat{y}_i - \bar{y})^2 + \sum(y_i - \hat{y}_i)^2$$

(This exact decomposition requires centering; the geometric version uses the origin as reference.)

R² as a Cosine

Visualizing Projection in Low Dimensions

Let's build intuition with a simple example: $n = 3$ observations and $p = 1$ predictor (plus intercept).

Setup:

$\mathbf{y} \in \mathbb{R}^3$: the response vector
$\mathbf{X}$ is a $3 \times 2$ matrix (intercept + one predictor)
$\mathcal{C}(\mathbf{X})$ is a 2D plane in $\mathbb{R}^3$

Geometric picture:

projection_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
# Simple example: 3 observations, 1 predictor
np.random.seed(42)
 
# The response vector in R^3
y = np.array([3, 5, 4])
 
# Design matrix: intercept + one predictor
X = np.column_stack([
    np.ones(3),           # Intercept column [1, 1, 1]
    np.array([1, 2, 3])   # Predictor column
])
 
# Compute OLS solution
beta_hat = np.linalg.solve(X.T @ X, X.T @ y)
y_hat = X @ beta_hat  # Fitted values (projection)
e = y - y_hat         # Residuals
 
print(f"y (response):     {y}")
print(f"ŷ (fitted):       {y_hat}")
print(f"e (residual):     {e}")
print(f"β̂ (coefficients): {beta_hat}")
 
# Verify orthogonality
print(f"\nOrthogonality check:")
print(f"  X⊤e = {X.T @ e}")  # Should be [0, 0]
print(f"  ||e|| = {np.linalg.norm(e):.4f}")
 
# Verify Pythagorean decomposition
print(f"\nPythagorean decomposition:")
print(f"  ||y||² = {np.dot(y, y):.4f}")
print(f"  ||ŷ||² + ||e||² = {np.dot(y_hat, y_hat) + np.dot(e, e):.4f}")
 
# Create 3D visualization
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
 
# Plot the column space (a plane through origin)
# Create a grid of points in the plane spanned by X columns
s, t = np.meshgrid(np.linspace(-1, 3, 10), np.linspace(-1, 2, 10))
# Plane: points of form s*x1 + t*x2 where x1, x2 are columns of X
plane_x = s * X[0, 0] + t * X[0, 1]
plane_y = s * X[1, 0] + t * X[1, 1]
plane_z = s * X[2, 0] + t * X[2, 1]
 
ax.plot_surface(plane_x, plane_y, plane_z, alpha=0.3, color='blue')
 
# Plot vectors
ax.quiver(0, 0, 0, y[0], y[1], y[2], color='green', arrow_length_ratio=0.1, 
          linewidth=2, label='y (response)')
ax.quiver(0, 0, 0, y_hat[0], y_hat[1], y_hat[2], color='blue', 
          arrow_length_ratio=0.1, linewidth=2, label='ŷ (fitted)')
ax.plot([y_hat[0], y[0]], [y_hat[1], y[1]], [y_hat[2], y[2]], 
        'r--', linewidth=2, label='e (residual)')
 
ax.set_xlabel('Obs 1')
ax.set_ylabel('Obs 2')
ax.set_zlabel('Obs 3')
ax.set_title('OLS as Orthogonal Projection')
ax.legend()
 
plt.tight_layout()
plt.savefig('ols_projection.png', dpi=150)
print("\nVisualization saved to ols_projection.png")

Connections to Broader Linear Algebra

The geometric view connects linear regression to fundamental concepts that recur throughout machine learning.

Key Connections

•Principal Component Analysis (PCA): Finds the subspace that captures maximum variance—also an orthogonal projection, but onto a data-determined subspace rather than feature-determined.
•SVD and the pseudoinverse: The pseudoinverse $\mathbf{X}^+ = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ generalizes to singular matrices via SVD. Both are fundamentally about projection.
•Kernel methods: Feature maps take data to (potentially infinite-dimensional) spaces where linear regression becomes kernel ridge regression. Same geometric principle in a different space.
•Neural network linear layers: Without activation functions, layers compute linear projections. Understanding projection geometry illuminates what networks can and cannot represent.
•Gram-Schmidt orthogonalization: Provides an alternative way to compute projections and explains the sequential interpretation of regression coefficients.

Projection is Everywhere

Summary: Geometric Interpretation

We've developed a complete geometric understanding of linear regression. Let's consolidate:

Key Takeaways

•Column space: The design matrix X spans a subspace $\mathcal{C}(\mathbf{X})$ in $\mathbb{R}^n$.
•OLS as projection: Finding $\hat{\boldsymbol{\beta}}$ is equivalent to projecting $\mathbf{y}$ orthogonally onto $\mathcal{C}(\mathbf{X})$.
•Normal equations: Derived from the orthogonality condition $\mathbf{X}^\top\mathbf{e} = \mathbf{0}$.
•Hat matrix: $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is the projection operator onto $\mathcal{C}(\mathbf{X})$.
•Leverage: Diagonal elements $h_{ii}$ measure how much each observation determines its own fitted value.
•Pythagorean decomposition: Total variation splits into explained and unexplained parts, with R² measuring the fraction explained.

What's next:

Page Complete

4 / 5