Loading content...
The algebraic derivation of OLS—setting derivatives to zero, solving systems of equations—gives us formulas. But it leaves unanswered a deeper question: What does least squares actually do, geometrically?
The answer is beautiful and profound: OLS finds the orthogonal projection of the observed data onto the space of possible predictions. This geometric interpretation transforms linear regression from a collection of formulas into a unified visual concept.
Understanding this geometry isn't just intellectually satisfying—it provides practical intuition for diagnosing regression problems, understanding R-squared, and extending to more complex models.
By the end of this page, you will visualize regression in n-dimensional space, understand why residuals are perpendicular to predictors, interpret R² as the squared length ratio, and see how the hat matrix projects observations onto the model space.
To understand the geometry of regression, we need to think of our data as vectors in high-dimensional space. This may feel abstract at first, but it's remarkably powerful.
Data as Vectors
Consider our $n$ observations. Each variable—$y$, $x$, even the constant intercept term—can be viewed as a vector in $\mathbb{R}^n$:
$$\mathbf{y} = \begin{pmatrix} y_1 \ y_2 \ \vdots \ y_n \end{pmatrix}, \quad \mathbf{x} = \begin{pmatrix} x_1 \ x_2 \ \vdots \ x_n \end{pmatrix}, \quad \mathbf{1} = \begin{pmatrix} 1 \ 1 \ \vdots \ 1 \end{pmatrix}$$
Each component of the vector corresponds to one observation. The vector $\mathbf{y}$ represents all $n$ observed responses stacked together.
There are two geometric perspectives on regression data. The standard view plots each observation as a point in (x, y) space—a 2D scatterplot. The geometric view treats x and y as vectors in n-dimensional space, where each dimension corresponds to an observation. Both are valid; they reveal different insights.
Inner Products and Orthogonality
In $\mathbb{R}^n$, the inner product of two vectors is:
$$\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^T \mathbf{v} = \sum_{i=1}^n u_i v_i$$
Two vectors are orthogonal (perpendicular) if their inner product is zero:
$$\mathbf{u} \perp \mathbf{v} \iff \mathbf{u}^T \mathbf{v} = 0$$
The length (norm) of a vector is:
$$|\mathbf{u}| = \sqrt{\mathbf{u}^T\mathbf{u}} = \sqrt{\sum_{i=1}^n u_i^2}$$
These concepts from linear algebra are the building blocks of the geometric view.
The design matrix $\mathbf{X}$ for simple linear regression has two columns:
$$\mathbf{X} = \begin{pmatrix} 1 & x_1 \ 1 & x_2 \ \vdots & \vdots \ 1 & x_n \end{pmatrix} = [\mathbf{1} | \mathbf{x}]$$
What Can the Model Predict?
Any prediction from our model takes the form:
$$\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta} = \beta_0 \mathbf{1} + \beta_1 \mathbf{x}$$
As we vary $\beta_0$ and $\beta_1$ over all possible real values, what predictions can we generate?
The Column Space
The set of all possible predictions is the column space (or range) of $\mathbf{X}$:
$$\mathcal{C}(\mathbf{X}) = {\mathbf{X}\boldsymbol{\beta} : \boldsymbol{\beta} \in \mathbb{R}^2} = {\beta_0 \mathbf{1} + \beta_1 \mathbf{x} : \beta_0, \beta_1 \in \mathbb{R}}$$
This is a 2-dimensional subspace of $\mathbb{R}^n$ (assuming $\mathbf{1}$ and $\mathbf{x}$ are linearly independent). It's the plane spanned by the vectors $\mathbf{1}$ and $\mathbf{x}$.
Key Insight: The model can only produce predictions that lie in this plane. No matter what values of $\beta_0$ and $\beta_1$ we choose, $\hat{\mathbf{y}}$ is always in $\mathcal{C}(\mathbf{X})$. If the observed $\mathbf{y}$ doesn't lie in this plane, we can never predict it exactly.
With n observations and 2 parameters, the column space is a 2D plane sitting inside n-dimensional space. Think of it like a flat piece of paper floating in a room—except the room has n dimensions, not 3. The observed y is a point in this n-dimensional room, and we're finding the closest point on the paper.
Visualization Intuition
Imagine $n = 3$ for visualization purposes (we can draw 3D). In this simplified case:
The question becomes: which point on the plane is closest to $\mathbf{y}$?
Here's the central geometric insight:
$$\text{Minimizing } |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 \text{ is equivalent to finding the closest point in } \mathcal{C}(\mathbf{X}) \text{ to } \mathbf{y}$$
Why? Because:
$$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \text{RSS}$$
Minimizing RSS is minimizing the squared Euclidean distance in $\mathbb{R}^n$.
The Closest Point is the Orthogonal Projection
A fundamental theorem in linear algebra states: the closest point in a subspace to an external point is the orthogonal projection onto that subspace.
This means:
$$\hat{\mathbf{y}} = \text{proj}_{\mathcal{C}(\mathbf{X})} \mathbf{y}$$
The fitted values $\hat{\mathbf{y}}$ are the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$!
Drop a perpendicular from y to the plane (column space). Where it hits the plane is the projection. This projection minimizes distance because any other point on the plane would create a triangle where the hypotenuse (distance to y) is longer than the perpendicular leg.
The Residual is Perpendicular
The residual vector is:
$$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$$
Since $\hat{\mathbf{y}}$ is the orthogonal projection, $\mathbf{e}$ must be perpendicular to the column space:
$$\mathbf{e} \perp \mathcal{C}(\mathbf{X})$$
Specifically, $\mathbf{e}$ is orthogonal to every column of $\mathbf{X}$:
$$\mathbf{1}^T \mathbf{e} = 0 \implies \sum_i e_i = 0$$ $$\mathbf{x}^T \mathbf{e} = 0 \implies \sum_i x_i e_i = 0$$
These are exactly the normal equations we derived algebraically! The geometric and algebraic perspectives connect perfectly.
The projection operation can be expressed as matrix multiplication. From the OLS solution:
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
The fitted values are:
$$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
Define the hat matrix (or projection matrix):
$$\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$$
Then simply:
$$\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$$
The matrix $\mathbf{H}$ "puts a hat on $\mathbf{y}$"—hence the name.
The Residual-Maker Matrix
The residuals are:
$$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - \mathbf{H}\mathbf{y} = (\mathbf{I} - \mathbf{H})\mathbf{y}$$
Define $\mathbf{M} = \mathbf{I} - \mathbf{H}$ (the residual-maker matrix):
$$\mathbf{e} = \mathbf{M}\mathbf{y}$$
$\mathbf{M}$ is also symmetric and idempotent. It projects onto the orthogonal complement of $\mathcal{C}(\mathbf{X})$—the space of all vectors perpendicular to predictions.
We have an orthogonal decomposition: y = ŷ + e where ŷ ∈ C(X) and e ⊥ C(X). Every observation vector decomposes uniquely into a model part (explainable by X) and a residual part (unexplainable by X). This is the geometry underlying variance decomposition and R².
Since $\hat{\mathbf{y}}$ and $\mathbf{e}$ are orthogonal, the Pythagorean theorem applies:
$$|\mathbf{y}|^2 = |\hat{\mathbf{y}}|^2 + |\mathbf{e}|^2$$
But wait—this isn't quite right because we need to center around means. Let's be more careful.
Centered Vectors
Define centered vectors:
$$\tilde{\mathbf{y}} = \mathbf{y} - \bar{y}\mathbf{1}$$ $$\tilde{\hat{\mathbf{y}}} = \hat{\mathbf{y}} - \bar{y}\mathbf{1}$$
(Note: $\bar{\hat{y}} = \bar{y}$ from the normal equations, so we can use $\bar{y}$ for both.)
Variance Decomposition
Now we have the orthogonal decomposition:
$$\tilde{\mathbf{y}} = \tilde{\hat{\mathbf{y}}} + \mathbf{e}$$
Applying Pythagorean theorem:
$$|\tilde{\mathbf{y}}|^2 = |\tilde{\hat{\mathbf{y}}}|^2 + |\mathbf{e}|^2$$
Translating back to sums of squares:
$$\underbrace{\sum_i (y_i - \bar{y})^2}_{\text{SST}} = \underbrace{\sum_i (\hat{y}i - \bar{y})^2}{\text{SSR}} + \underbrace{\sum_i (y_i - \hat{y}i)^2}{\text{SSE}}$$
Or symbolically:
$$\text{SST} = \text{SSR} + \text{SSE}$$
Total Sum of Squares = Regression Sum of Squares + Error Sum of Squares
SST (Total): Total variability in y around its mean. SSR (Regression/Model): Variability explained by the regression—how much ŷ varies from ȳ. SSE (Error/Residual): Variability unexplained—how much y deviates from ŷ. The Pythagorean theorem guarantees SST = SSR + SSE.
R-Squared: The Coefficient of Determination
The proportion of total variability explained by the model:
$$R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}}$$
Geometric Interpretation: $R^2$ is the squared cosine of the angle between $\tilde{\mathbf{y}}$ and $\tilde{\hat{\mathbf{y}}}$:
$$R^2 = \cos^2(\theta)$$
where $\theta$ is the angle between the centered observation vector and the centered prediction vector.
| Statistic | Geometric Meaning | Formula |
|---|---|---|
| $|\tilde{\mathbf{y}}|^2$ | Squared length of centered y | SST = $\sum(y_i - \bar{y})^2$ |
| $|\tilde{\hat{\mathbf{y}}}|^2$ | Squared length of projected component | SSR = $\sum(\hat{y}_i - \bar{y})^2$ |
| $|\mathbf{e}|^2$ | Squared length of residual | SSE = $\sum(y_i - \hat{y}_i)^2$ |
| $R^2$ | Cosine squared of angle | SSR/SST = 1 - SSE/SST |
| $R$ | Cosine of angle (= correlation) | $r_{y\hat{y}}$ |
The hat matrix reveals important information about individual observations. Each fitted value is a weighted combination of all observed $y$ values:
$$\hat{y}i = \sum{j=1}^n h_{ij} y_j$$
where $h_{ij}$ is the $(i,j)$ element of $\mathbf{H}$.
Leverage Values
The diagonal elements $h_{ii}$ are called leverage values or hat values:
$$h_{ii} = [\mathbf{H}]_{ii}$$
For simple linear regression, leverage has a closed form:
$$h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_j (x_j - \bar{x})^2} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{S_{xx}}$$
High leverage means an observation could strongly influence the fit, but only if its y value is unusual for its x. An observation is truly influential if it has both high leverage AND a large residual. A point with high leverage but low residual actually anchors the fit in a good way.
Residual Variance and Leverage
The variance of residuals is not constant—it depends on leverage:
$$\text{Var}(e_i) = \sigma^2(1 - h_{ii})$$
High-leverage points have smaller residual variance. This makes sense: if $h_{ii}$ is close to 1, the fitted line is forced to pass near $(x_i, y_i)$, leaving little residual.
This is why we use studentized residuals for diagnostics:
$$t_i = \frac{e_i}{s\sqrt{1 - h_{ii}}}$$
where $s$ is the estimated standard error. Studentization adjusts for the unequal variance.
Let's visualize the geometric concepts with code. We'll compute the hat matrix, verify its properties, and examine the projection.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as npimport matplotlib.pyplot as plt # Generate sample datanp.random.seed(42)n = 20x = np.random.uniform(1, 10, n)y = 3 + 2 * x + np.random.normal(0, 2, n) # Design matrix (with intercept)X = np.column_stack([np.ones(n), x]) # Hat matrix: H = X(X'X)^(-1)X'XtX_inv = np.linalg.inv(X.T @ X)H = X @ XtX_inv @ X.T # Fitted values (projection of y)y_hat = H @ y # Residualse = y - y_hat # Verify propertiesprint("Hat Matrix Properties:")print(f" H is symmetric: {np.allclose(H, H.T)}")print(f" H is idempotent (H²=H): {np.allclose(H @ H, H)}")print(f" Trace of H (should be 2): {np.trace(H):.4f}")print() # Verify orthogonality of residualsprint("Orthogonality of Residuals:")print(f" 1'e (should be ~0): {np.ones(n) @ e:.2e}")print(f" x'e (should be ~0): {x @ e:.2e}")print() # Sums of squares decompositiony_bar = np.mean(y)SST = np.sum((y - y_bar)**2)SSR = np.sum((y_hat - y_bar)**2)SSE = np.sum(e**2)R_squared = SSR / SST print("Pythagorean Decomposition:")print(f" SST = {SST:.4f}")print(f" SSR = {SSR:.4f}")print(f" SSE = {SSE:.4f}")print(f" SSR + SSE = {SSR + SSE:.4f}")print(f" R² = {R_squared:.4f}")print() # Leverage values (diagonal of H)leverage = np.diag(H)print("Leverage Values:")print(f" Min leverage: {leverage.min():.4f} (at x near mean)")print(f" Max leverage: {leverage.max():.4f} (at extreme x)")print(f" Mean leverage: {leverage.mean():.4f} (should be 2/n = {2/n:.4f})")Running this code confirms all geometric properties we discussed: H is symmetric and idempotent, residuals are orthogonal to the column space, SST = SSR + SSE, and leverage values follow the expected pattern. This kind of numerical verification builds intuition and catches implementation errors.
The geometric perspective provides deep insight into what linear regression actually computes. Let's consolidate:
What's Next
Having understood what the OLS estimators are and how they arise geometrically, we turn to coefficient interpretation. How do we meaningfully explain what $\hat{\beta}_0$ and $\hat{\beta}_1$ tell us about the real world? What causal claims can (and can't) we make?
You now possess the geometric intuition for linear regression. This view—regression as projection—is one of the most powerful mental models in statistical learning. It explains why OLS works, what R² measures, and how individual observations influence the fit. Next, we'll interpret the meaning of our fitted coefficients.