Loading learning content...
We've derived the normal equations algebraically and established statistical properties, but there's a deeper way to understand linear regression: geometry.
In this view, the response vector $\mathbf{y}$ lives in $n$-dimensional space, and the columns of the design matrix $\mathbf{X}$ span a subspace. The OLS solution is the point in that subspace closest to $\mathbf{y}$—a geometric projection.
This perspective isn't just elegant—it provides immediate intuition for why OLS works, what residuals represent, and why the normal equations have their particular form. It connects linear regression to fundamental linear algebra concepts that recur throughout machine learning.
By the end of this page, you will visualize linear regression as orthogonal projection onto a subspace, understand the hat matrix as a projection operator, derive the normal equations geometrically, and gain intuition for why minimizing squared error is equivalent to finding the closest point.
To develop geometric intuition, we must shift our perspective from viewing data as $n$ observations to viewing vectors in $\mathbb{R}^n$.
The observation space (row view):
The variable space (column view):
In variable space, we work in ℝⁿ where n is the number of observations. The response y and each feature column x₁, x₂, ..., xₚ are all vectors in this same n-dimensional space. Regression becomes a problem of finding the best linear combination of feature vectors to approximate y.
Example:
With $n = 100$ observations and $p = 3$ predictors:
The variable space perspective treats each observation as a dimension and each variable as a single vector.
Definition: The column space of $\mathbf{X}$, denoted $\mathcal{C}(\mathbf{X})$, is the set of all linear combinations of the columns of $\mathbf{X}$:
$$\mathcal{C}(\mathbf{X}) = {\mathbf{X}\boldsymbol{\beta} : \boldsymbol{\beta} \in \mathbb{R}^{p+1}}$$
This is a $(p+1)$-dimensional subspace of $\mathbb{R}^n$ (assuming $\mathbf{X}$ has full column rank).
What this means:
When X has full column rank, the column space has dimension p+1 (intercept plus p predictors). This is vastly smaller than the ambient space ℝⁿ when n >> p. The column space is a low-dimensional 'plane' (hyperplane) in high-dimensional space.
Geometric picture:
This closest point is the fitted value vector $\hat{\mathbf{y}}$. The difference $\mathbf{y} - \hat{\mathbf{y}}$ is the residual vector $\mathbf{e}$.
Key theorem: The point in $\mathcal{C}(\mathbf{X})$ closest to $\mathbf{y}$ is found by orthogonal projection. The closest point is the unique vector $\hat{\mathbf{y}} \in \mathcal{C}(\mathbf{X})$ such that the error vector $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ is perpendicular to $\mathcal{C}(\mathbf{X})$.
Mathematically: $\hat{\mathbf{y}}$ minimizes $|\mathbf{y} - \hat{\mathbf{y}}|$ over all $\hat{\mathbf{y}} \in \mathcal{C}(\mathbf{X})$ if and only if: $$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} \perp \mathcal{C}(\mathbf{X})$$
where $\perp$ means perpendicular (orthogonal) to every vector in the subspace.
Why orthogonality gives the closest point:
Consider any other point $\tilde{\mathbf{y}} \in \mathcal{C}(\mathbf{X})$, and let $\mathbf{d} = \tilde{\mathbf{y}} - \hat{\mathbf{y}}$ (the difference, which lies in $\mathcal{C}(\mathbf{X})$).
By the Pythagorean theorem (which holds because $\mathbf{e} \perp \mathbf{d}$): $$|\mathbf{y} - \tilde{\mathbf{y}}|^2 = |\mathbf{e} + (\hat{\mathbf{y}} - \tilde{\mathbf{y}})|^2 = |\mathbf{e}|^2 + |\mathbf{d}|^2 \geq |\mathbf{e}|^2$$
Equality holds only when $\mathbf{d} = \mathbf{0}$, i.e., $\tilde{\mathbf{y}} = \hat{\mathbf{y}}$.
Conclusion: The orthogonal projection is the unique closest point.
Minimizing the sum of squared errors ||y - ŷ||² is equivalent to finding the orthogonal projection of y onto the column space of X. The geometric and algebraic views are two perspectives on the same optimization.
We can derive the normal equations purely from the orthogonality condition—no calculus required.
The orthogonality condition:
For $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ to be perpendicular to $\mathcal{C}(\mathbf{X})$, it must be perpendicular to every column of $\mathbf{X}$ (since the columns span $\mathcal{C}(\mathbf{X})$).
Perpendicularity of vectors means zero inner product: $$\mathbf{x}_j^\top \mathbf{e} = 0 \quad \text{for } j = 0, 1, \ldots, p$$
Collecting all these conditions into a single matrix equation: $$\mathbf{X}^\top \mathbf{e} = \mathbf{0}$$
Deriving the normal equations:
Substituting $\mathbf{e} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$: $$\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}$$ $$\mathbf{X}^\top\mathbf{y} - \mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{0}$$ $$\boxed{\mathbf{X}^\top\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^\top\mathbf{y}}$$
These are exactly the normal equations! The name "normal equations" comes from this geometric derivation: the residual is normal (perpendicular) to the column space.
We've now derived the normal equations two ways: (1) by setting the gradient of SSE to zero (calculus), and (2) by requiring the residual to be perpendicular to the column space (geometry). Same equations, different insights.
Recall that the fitted values can be written as: $$\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$$
where $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is the hat matrix (or projection matrix).
The hat matrix is not just a computational convenience—it's the orthogonal projection operator onto $\mathcal{C}(\mathbf{X})$.
| Property | Mathematical Statement | Interpretation |
|---|---|---|
| Symmetry | $\mathbf{H} = \mathbf{H}^\top$ | Projections are self-adjoint operators |
| Idempotence | $\mathbf{H}^2 = \mathbf{H}$ | Projecting twice = projecting once |
| Range | Range($\mathbf{H}$) = $\mathcal{C}(\mathbf{X})$ | H projects onto the column space of X |
| Null space | Null($\mathbf{H}$) = $\mathcal{C}(\mathbf{X})^\perp$ | Vectors perpendicular to C(X) are mapped to 0 |
| Rank | rank($\mathbf{H}$) = $p + 1$ | Dimension of projection subspace |
| Trace | trace($\mathbf{H}$) = $p + 1$ | Sum of diagonal equals rank for projections |
Verifying idempotence: $$\mathbf{H}^2 = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \cdot \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}(\mathbf{X}^\top\mathbf{X})(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{H}$$
Geometric interpretation: If $\mathbf{v}$ is already in $\mathcal{C}(\mathbf{X})$, projecting it again leaves it unchanged: $\mathbf{H}\mathbf{v} = \mathbf{v}$. Idempotence captures this mathematically.
The matrix M = I - H projects onto the orthogonal complement of C(X). Residuals are e = My = (I - H)y. Since y = Hy + My = ŷ + e, every response vector decomposes uniquely into a fitted part (in C(X)) and a residual part (perpendicular to C(X)).
The diagonal elements of the hat matrix have a special interpretation and name: leverage.
Definition: The leverage of observation $i$ is: $$h_{ii} = [\mathbf{H}]_{ii} = \mathbf{x}_i^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_i$$
where $\mathbf{x}_i$ is the $i$-th row of $\mathbf{X}$ (as a column vector).
What leverage measures:
Recall $\hat{y}i = \sum{j=1}^n h_{ij} y_j$. Leverage $h_{ii}$ measures how much the fitted value $\hat{y}_i$ depends on the observed value $y_i$ itself.
Leverage measures potential influence from feature location, not actual influence. A high-leverage point with a typical y-value may not distort the regression. Cook's distance combines leverage with residual magnitude to measure actual influence.
The residual vector $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y}$ has beautiful geometric properties that illuminate regression diagnostics.
Orthogonality to column space: $$\mathbf{X}^\top\mathbf{e} = \mathbf{0}$$
The residual vector is perpendicular to every column of $\mathbf{X}$, including:
Pythagorean decomposition:
Since $\hat{\mathbf{y}} \perp \mathbf{e}$: $$|\mathbf{y}|^2 = |\hat{\mathbf{y}}|^2 + |\mathbf{e}|^2$$
In sum of squares notation: $$\text{SS}{\text{Total}} = \text{SS}{\text{Regression}} + \text{SS}_{\text{Error}}$$ $$\sum(y_i - \bar{y})^2 = \sum(\hat{y}_i - \bar{y})^2 + \sum(y_i - \hat{y}_i)^2$$
(This exact decomposition requires centering; the geometric version uses the origin as reference.)
The coefficient of determination: $$R^2 = \frac{\text{SS}{\text{Regression}}}{\text{SS}{\text{Total}}} = 1 - \frac{\text{SS}{\text{Error}}}{\text{SS}{\text{Total}}} = 1 - \frac{|\mathbf{e}|^2}{|\mathbf{y} - \bar{y}\mathbf{1}|^2}$$
Geometrically, R² = cos²(θ) where θ is the angle between the centered response vector and its projection onto the column space. Perfect fit means θ = 0 (R² = 1); no relationship means θ = 90° (R² = 0).
Let's build intuition with a simple example: $n = 3$ observations and $p = 1$ predictor (plus intercept).
Setup:
Geometric picture:
Imagine $\mathbb{R}^3$ as ordinary 3D space. The column space $\mathcal{C}(\mathbf{X})$ is a plane through the origin. The response $\mathbf{y}$ is a point somewhere in 3D space, possibly not on this plane.
The fitted value $\hat{\mathbf{y}}$ is the point on the plane closest to $\mathbf{y}$—found by dropping a perpendicular from $\mathbf{y}$ to the plane. The residual $\mathbf{e}$ is the perpendicular segment itself.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D # Simple example: 3 observations, 1 predictornp.random.seed(42) # The response vector in R^3y = np.array([3, 5, 4]) # Design matrix: intercept + one predictorX = np.column_stack([ np.ones(3), # Intercept column [1, 1, 1] np.array([1, 2, 3]) # Predictor column]) # Compute OLS solutionbeta_hat = np.linalg.solve(X.T @ X, X.T @ y)y_hat = X @ beta_hat # Fitted values (projection)e = y - y_hat # Residuals print(f"y (response): {y}")print(f"ŷ (fitted): {y_hat}")print(f"e (residual): {e}")print(f"β̂ (coefficients): {beta_hat}") # Verify orthogonalityprint(f"\nOrthogonality check:")print(f" X⊤e = {X.T @ e}") # Should be [0, 0]print(f" ||e|| = {np.linalg.norm(e):.4f}") # Verify Pythagorean decompositionprint(f"\nPythagorean decomposition:")print(f" ||y||² = {np.dot(y, y):.4f}")print(f" ||ŷ||² + ||e||² = {np.dot(y_hat, y_hat) + np.dot(e, e):.4f}") # Create 3D visualizationfig = plt.figure(figsize=(10, 8))ax = fig.add_subplot(111, projection='3d') # Plot the column space (a plane through origin)# Create a grid of points in the plane spanned by X columnss, t = np.meshgrid(np.linspace(-1, 3, 10), np.linspace(-1, 2, 10))# Plane: points of form s*x1 + t*x2 where x1, x2 are columns of Xplane_x = s * X[0, 0] + t * X[0, 1]plane_y = s * X[1, 0] + t * X[1, 1]plane_z = s * X[2, 0] + t * X[2, 1] ax.plot_surface(plane_x, plane_y, plane_z, alpha=0.3, color='blue') # Plot vectorsax.quiver(0, 0, 0, y[0], y[1], y[2], color='green', arrow_length_ratio=0.1, linewidth=2, label='y (response)')ax.quiver(0, 0, 0, y_hat[0], y_hat[1], y_hat[2], color='blue', arrow_length_ratio=0.1, linewidth=2, label='ŷ (fitted)')ax.plot([y_hat[0], y[0]], [y_hat[1], y[1]], [y_hat[2], y[2]], 'r--', linewidth=2, label='e (residual)') ax.set_xlabel('Obs 1')ax.set_ylabel('Obs 2')ax.set_zlabel('Obs 3')ax.set_title('OLS as Orthogonal Projection')ax.legend() plt.tight_layout()plt.savefig('ols_projection.png', dpi=150)print("\nVisualization saved to ols_projection.png")The geometric view connects linear regression to fundamental concepts that recur throughout machine learning.
The concept of orthogonal projection onto a subspace is one of the most fundamental ideas in applied mathematics. Mastering it in the context of linear regression prepares you for advanced topics throughout machine learning and signal processing.
We've developed a complete geometric understanding of linear regression. Let's consolidate:
What's next:
We've covered matrix formulation, normal equations, statistical properties, and geometric interpretation. The final page synthesizes these perspectives and explores the projection perspective—deepening our understanding of how OLS relates to the geometry of inner product spaces.
You now understand linear regression geometrically—as orthogonal projection of the response onto the column space of the design matrix. This perspective unifies the algebraic, statistical, and computational aspects of OLS.