Simple Linear Regression - Learning Module

Loading content...

0/278

Geometric Interpretation

Regression as Geometry

The algebraic derivation of OLS—setting derivatives to zero, solving systems of equations—gives us formulas. But it leaves unanswered a deeper question: What does least squares actually do, geometrically?

The answer is beautiful and profound: OLS finds the orthogonal projection of the observed data onto the space of possible predictions. This geometric interpretation transforms linear regression from a collection of formulas into a unified visual concept.

Understanding this geometry isn't just intellectually satisfying—it provides practical intuition for diagnosing regression problems, understanding R-squared, and extending to more complex models.

What You Will Learn

By the end of this page, you will visualize regression in n-dimensional space, understand why residuals are perpendicular to predictors, interpret R² as the squared length ratio, and see how the hat matrix projects observations onto the model space.

Vectors in n-Dimensional Space

To understand the geometry of regression, we need to think of our data as vectors in high-dimensional space. This may feel abstract at first, but it's remarkably powerful.

Data as Vectors

Consider our $n$ observations. Each variable—$y$, $x$, even the constant intercept term—can be viewed as a vector in $\mathbb{R}^n$:

$$\mathbf{y} = \begin{pmatrix} y_1 \ y_2 \ \vdots \ y_n \end{pmatrix}, \quad \mathbf{x} = \begin{pmatrix} x_1 \ x_2 \ \vdots \ x_n \end{pmatrix}, \quad \mathbf{1} = \begin{pmatrix} 1 \ 1 \ \vdots \ 1 \end{pmatrix}$$

Each component of the vector corresponds to one observation. The vector $\mathbf{y}$ represents all $n$ observed responses stacked together.

Two Ways to Visualize Data

There are two geometric perspectives on regression data. The standard view plots each observation as a point in (x, y) space—a 2D scatterplot. The geometric view treats x and y as vectors in n-dimensional space, where each dimension corresponds to an observation. Both are valid; they reveal different insights.

Standard View (Scatterplot)

•Each point is one observation
•Axes are variables (x, y)
•Regression line is a line in 2D
•Residual is vertical distance to line
•n points in 2D space

Geometric View (n-Space)

•Each variable is one vector
•Axes are observations
•y is a single point in nD
•Residual is distance from projection
•3 vectors (1, x, y) in nD space

Inner Products and Orthogonality

In $\mathbb{R}^n$, the inner product of two vectors is:

$$\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^T \mathbf{v} = \sum_{i=1}^n u_i v_i$$

Two vectors are orthogonal (perpendicular) if their inner product is zero:

$$\mathbf{u} \perp \mathbf{v} \iff \mathbf{u}^T \mathbf{v} = 0$$

The length (norm) of a vector is:

$$|\mathbf{u}| = \sqrt{\mathbf{u}^T\mathbf{u}} = \sqrt{\sum_{i=1}^n u_i^2}$$

These concepts from linear algebra are the building blocks of the geometric view.

The Column Space of X

The design matrix $\mathbf{X}$ for simple linear regression has two columns:

$$\mathbf{X} = \begin{pmatrix} 1 & x_1 \ 1 & x_2 \ \vdots & \vdots \ 1 & x_n \end{pmatrix} = [\mathbf{1} | \mathbf{x}]$$

What Can the Model Predict?

Any prediction from our model takes the form:

$$\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta} = \beta_0 \mathbf{1} + \beta_1 \mathbf{x}$$

As we vary $\beta_0$ and $\beta_1$ over all possible real values, what predictions can we generate?

The Column Space

The set of all possible predictions is the column space (or range) of $\mathbf{X}$:

$$\mathcal{C}(\mathbf{X}) = {\mathbf{X}\boldsymbol{\beta} : \boldsymbol{\beta} \in \mathbb{R}^2} = {\beta_0 \mathbf{1} + \beta_1 \mathbf{x} : \beta_0, \beta_1 \in \mathbb{R}}$$

This is a 2-dimensional subspace of $\mathbb{R}^n$ (assuming $\mathbf{1}$ and $\mathbf{x}$ are linearly independent). It's the plane spanned by the vectors $\mathbf{1}$ and $\mathbf{x}$.

Key Insight: The model can only produce predictions that lie in this plane. No matter what values of $\beta_0$ and $\beta_1$ we choose, $\hat{\mathbf{y}}$ is always in $\mathcal{C}(\mathbf{X})$. If the observed $\mathbf{y}$ doesn't lie in this plane, we can never predict it exactly.

Dimensionality

With n observations and 2 parameters, the column space is a 2D plane sitting inside n-dimensional space. Think of it like a flat piece of paper floating in a room—except the room has n dimensions, not 3. The observed y is a point in this n-dimensional room, and we're finding the closest point on the paper.

Visualization Intuition

Imagine $n = 3$ for visualization purposes (we can draw 3D). In this simplified case:

$\mathbf{y}$ is a point in $\mathbb{R}^3$
$\mathbf{1}$ and $\mathbf{x}$ are two vectors in $\mathbb{R}^3$
Their span is a 2D plane through the origin
Any prediction $\hat{\mathbf{y}} = \beta_0 \mathbf{1} + \beta_1 \mathbf{x}$ lies on this plane

The question becomes: which point on the plane is closest to $\mathbf{y}$?

Projection as Least Squares

Here's the central geometric insight:

$$\text{Minimizing } |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 \text{ is equivalent to finding the closest point in } \mathcal{C}(\mathbf{X}) \text{ to } \mathbf{y}$$

Why? Because:

$$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \text{RSS}$$

Minimizing RSS is minimizing the squared Euclidean distance in $\mathbb{R}^n$.

The Closest Point is the Orthogonal Projection

A fundamental theorem in linear algebra states: the closest point in a subspace to an external point is the orthogonal projection onto that subspace.

This means:

$$\hat{\mathbf{y}} = \text{proj}_{\mathcal{C}(\mathbf{X})} \mathbf{y}$$

The fitted values $\hat{\mathbf{y}}$ are the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$!

What is Orthogonal Projection?

Drop a perpendicular from y to the plane (column space). Where it hits the plane is the projection. This projection minimizes distance because any other point on the plane would create a triangle where the hypotenuse (distance to y) is longer than the perpendicular leg.

The Residual is Perpendicular

The residual vector is:

$$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$$

Since $\hat{\mathbf{y}}$ is the orthogonal projection, $\mathbf{e}$ must be perpendicular to the column space:

$$\mathbf{e} \perp \mathcal{C}(\mathbf{X})$$

Specifically, $\mathbf{e}$ is orthogonal to every column of $\mathbf{X}$:

$$\mathbf{1}^T \mathbf{e} = 0 \implies \sum_i e_i = 0$$ $$\mathbf{x}^T \mathbf{e} = 0 \implies \sum_i x_i e_i = 0$$

These are exactly the normal equations we derived algebraically! The geometric and algebraic perspectives connect perfectly.

The Hat Matrix

The projection operation can be expressed as matrix multiplication. From the OLS solution:

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

The fitted values are:

$$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Define the hat matrix (or projection matrix):

$$\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$$

Then simply:

$$\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$$

The matrix $\mathbf{H}$ "puts a hat on $\mathbf{y}$"—hence the name.

Properties of the Hat Matrix

•Symmetric: $\mathbf{H}^T = \mathbf{H}$. The matrix equals its transpose.
•Idempotent: $\mathbf{H}^2 = \mathbf{H}$. Projecting twice gives the same result as projecting once.
•Projection: $\mathbf{H}$ projects onto $\mathcal{C}(\mathbf{X})$. It's the orthogonal projector.
•Eigenvalues: All eigenvalues are 0 or 1. (Consequence of idempotence.)
•Rank: $\text{rank}(\mathbf{H}) = p$ (number of parameters). Same as rank of $\mathbf{X}$.
•Trace: $\text{tr}(\mathbf{H}) = p$. Equals the number of parameters fitted.

The Residual-Maker Matrix

The residuals are:

$$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - \mathbf{H}\mathbf{y} = (\mathbf{I} - \mathbf{H})\mathbf{y}$$

Define $\mathbf{M} = \mathbf{I} - \mathbf{H}$ (the residual-maker matrix):

$$\mathbf{e} = \mathbf{M}\mathbf{y}$$

$\mathbf{M}$ is also symmetric and idempotent. It projects onto the orthogonal complement of $\mathcal{C}(\mathbf{X})$—the space of all vectors perpendicular to predictions.

Decomposition of y

We have an orthogonal decomposition: y = ŷ + e where ŷ ∈ C(X) and e ⊥ C(X). Every observation vector decomposes uniquely into a model part (explainable by X) and a residual part (unexplainable by X). This is the geometry underlying variance decomposition and R².

The Pythagorean Theorem and R²

Since $\hat{\mathbf{y}}$ and $\mathbf{e}$ are orthogonal, the Pythagorean theorem applies:

$$|\mathbf{y}|^2 = |\hat{\mathbf{y}}|^2 + |\mathbf{e}|^2$$

But wait—this isn't quite right because we need to center around means. Let's be more careful.

Centered Vectors

Define centered vectors:

$$\tilde{\mathbf{y}} = \mathbf{y} - \bar{y}\mathbf{1}$$ $$\tilde{\hat{\mathbf{y}}} = \hat{\mathbf{y}} - \bar{y}\mathbf{1}$$

(Note: $\bar{\hat{y}} = \bar{y}$ from the normal equations, so we can use $\bar{y}$ for both.)

Variance Decomposition

Now we have the orthogonal decomposition:

$$\tilde{\mathbf{y}} = \tilde{\hat{\mathbf{y}}} + \mathbf{e}$$

Applying Pythagorean theorem:

$$|\tilde{\mathbf{y}}|^2 = |\tilde{\hat{\mathbf{y}}}|^2 + |\mathbf{e}|^2$$

Translating back to sums of squares:

$$\underbrace{\sum_i (y_i - \bar{y})^2}_{\text{SST}} = \underbrace{\sum_i (\hat{y}i - \bar{y})^2}{\text{SSR}} + \underbrace{\sum_i (y_i - \hat{y}i)^2}{\text{SSE}}$$

Or symbolically:

$$\text{SST} = \text{SSR} + \text{SSE}$$

Total Sum of Squares = Regression Sum of Squares + Error Sum of Squares

The Three Sums of Squares

SST (Total): Total variability in y around its mean. SSR (Regression/Model): Variability explained by the regression—how much ŷ varies from ȳ. SSE (Error/Residual): Variability unexplained—how much y deviates from ŷ. The Pythagorean theorem guarantees SST = SSR + SSE.

R-Squared: The Coefficient of Determination

The proportion of total variability explained by the model:

$$R^2 = \frac{\text{SSR}}{\text{SST}} = 1 - \frac{\text{SSE}}{\text{SST}}$$

Geometric Interpretation: $R^2$ is the squared cosine of the angle between $\tilde{\mathbf{y}}$ and $\tilde{\hat{\mathbf{y}}}$:

$$R^2 = \cos^2(\theta)$$

where $\theta$ is the angle between the centered observation vector and the centered prediction vector.

$R^2 = 1$: Perfect fit. The angle is 0°; $\mathbf{y}$ lies in the column space.
$R^2 = 0$: No fit. The angle is 90°; predictions are all equal to $\bar{y}$.
$0 < R^2 < 1$: Partial fit. The smaller the angle, the better.

Geometric View of Regression Statistics
Statistic	Geometric Meaning	Formula
$\|\tilde{\mathbf{y}}\|^2$	Squared length of centered y	SST = $\sum(y_i - \bar{y})^2$
$\|\tilde{\hat{\mathbf{y}}}\|^2$	Squared length of projected component	SSR = $\sum(\hat{y}_i - \bar{y})^2$
$\|\mathbf{e}\|^2$	Squared length of residual	SSE = $\sum(y_i - \hat{y}_i)^2$
$R^2$	Cosine squared of angle	SSR/SST = 1 - SSE/SST
$R$	Cosine of angle (= correlation)	$r_{y\hat{y}}$

Leverage and Influence

The hat matrix reveals important information about individual observations. Each fitted value is a weighted combination of all observed $y$ values:

$$\hat{y}i = \sum{j=1}^n h_{ij} y_j$$

where $h_{ij}$ is the $(i,j)$ element of $\mathbf{H}$.

Leverage Values

The diagonal elements $h_{ii}$ are called leverage values or hat values:

$$h_{ii} = [\mathbf{H}]_{ii}$$

For simple linear regression, leverage has a closed form:

$$h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_j (x_j - \bar{x})^2} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{S_{xx}}$$

Properties of Leverage

•Interpretation: $h_{ii}$ measures how influential observation $i$ is on its own fitted value $\hat{y}_i$. High leverage means $\hat{y}_i$ depends heavily on $y_i$.
•Range: $1/n \le h_{ii} \le 1$. Minimum leverage is at $x = \bar{x}$; maximum approaches 1 for extreme $x$ values.
•Sum: $\sum_i h_{ii} = \text{tr}(\mathbf{H}) = p = 2$ (for simple regression). Average leverage is $p/n$.
•Extreme x: Points with unusual $x$ values have high leverage. They can heavily influence the fitted line.
•No y dependence: Leverage depends only on $x$ values, not on $y$. It's about position in predictor space.

Leverage vs. Influence

High leverage means an observation could strongly influence the fit, but only if its y value is unusual for its x. An observation is truly influential if it has both high leverage AND a large residual. A point with high leverage but low residual actually anchors the fit in a good way.

Residual Variance and Leverage

The variance of residuals is not constant—it depends on leverage:

$$\text{Var}(e_i) = \sigma^2(1 - h_{ii})$$

High-leverage points have smaller residual variance. This makes sense: if $h_{ii}$ is close to 1, the fitted line is forced to pass near $(x_i, y_i)$, leaving little residual.

This is why we use studentized residuals for diagnostics:

$$t_i = \frac{e_i}{s\sqrt{1 - h_{ii}}}$$

where $s$ is the estimated standard error. Studentization adjusts for the unequal variance.

Visual Demonstration

Let's visualize the geometric concepts with code. We'll compute the hat matrix, verify its properties, and examine the projection.

geometric_view.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
import matplotlib.pyplot as plt
 
# Generate sample data
np.random.seed(42)
n = 20
x = np.random.uniform(1, 10, n)
y = 3 + 2 * x + np.random.normal(0, 2, n)
 
# Design matrix (with intercept)
X = np.column_stack([np.ones(n), x])
 
# Hat matrix: H = X(X'X)^(-1)X'
XtX_inv = np.linalg.inv(X.T @ X)
H = X @ XtX_inv @ X.T
 
# Fitted values (projection of y)
y_hat = H @ y
 
# Residuals
e = y - y_hat
 
# Verify properties
print("Hat Matrix Properties:")
print(f"  H is symmetric: {np.allclose(H, H.T)}")
print(f"  H is idempotent (H²=H): {np.allclose(H @ H, H)}")
print(f"  Trace of H (should be 2): {np.trace(H):.4f}")
print()
 
# Verify orthogonality of residuals
print("Orthogonality of Residuals:")
print(f"  1'e (should be ~0): {np.ones(n) @ e:.2e}")
print(f"  x'e (should be ~0): {x @ e:.2e}")
print()
 
# Sums of squares decomposition
y_bar = np.mean(y)
SST = np.sum((y - y_bar)**2)
SSR = np.sum((y_hat - y_bar)**2)
SSE = np.sum(e**2)
R_squared = SSR / SST
 
print("Pythagorean Decomposition:")
print(f"  SST = {SST:.4f}")
print(f"  SSR = {SSR:.4f}")
print(f"  SSE = {SSE:.4f}")
print(f"  SSR + SSE = {SSR + SSE:.4f}")
print(f"  R² = {R_squared:.4f}")
print()
 
# Leverage values (diagonal of H)
leverage = np.diag(H)
print("Leverage Values:")
print(f"  Min leverage: {leverage.min():.4f} (at x near mean)")
print(f"  Max leverage: {leverage.max():.4f} (at extreme x)")
print(f"  Mean leverage: {leverage.mean():.4f} (should be 2/n = {2/n:.4f})")

Numerical Verification

Running this code confirms all geometric properties we discussed: H is symmetric and idempotent, residuals are orthogonal to the column space, SST = SSR + SSE, and leverage values follow the expected pattern. This kind of numerical verification builds intuition and catches implementation errors.

Summary: Geometric Interpretation

The geometric perspective provides deep insight into what linear regression actually computes. Let's consolidate:

Key Takeaways

•Data as vectors: y and x are vectors in n-dimensional observation space
•Column space: The model can only produce predictions in the span of X's columns
•OLS = Projection: Fitted values ŷ are the orthogonal projection of y onto C(X)
•Residuals are orthogonal: e ⊥ C(X), which gives us the normal equations
•Hat matrix: $\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ is the projection operator
•Pythagorean theorem: SST = SSR + SSE is a length decomposition
•R² is angular: R² = cos²(θ) where θ is the angle between centered y and ŷ
•Leverage: Diagonal of H measures potential influence of each observation

What's Next

Having understood what the OLS estimators are and how they arise geometrically, we turn to coefficient interpretation. How do we meaningfully explain what $\hat{\beta}_0$ and $\hat{\beta}_1$ tell us about the real world? What causal claims can (and can't) we make?

Page Complete

You now possess the geometric intuition for linear regression. This view—regression as projection—is one of the most powerful mental models in statistical learning. It explains why OLS works, what R² measures, and how individual observations influence the fit. Next, we'll interpret the meaning of our fitted coefficients.