Machine LearningLinear Algebra Foundations

Projections and Least Squares

LevelIntermediate

Duration90 mins

TopicLinear Algebra Foundations

4 / 5

Geometric View of Least Squares

Seeing Least Squares in Geometric Space

Least squares is often presented as minimizing squared errors—a purely algebraic operation. But the true power of the method emerges when you see it geometrically. Every least squares problem is a projection problem in disguise.

In this page, we develop rich geometric intuition:

Observation space: Each data point as a dimension, the target y as a point
Column space: The plane/subspace of all possible predictions
Projection: The closest point to y in prediction space
Error vector: Perpendicular to the prediction space

This geometric view transforms least squares from a formula to memorize into a picture to understand.

What You Will Learn

By the end of this page, you will visualize least squares as orthogonal projection, understand why squared error is special (Pythagorean theorem), and see the fitted values as the shadow of the target onto the model space.

The Observation Space Perspective

Consider regression with n observations and p features. The design matrix X is n × p.

Key insight: Instead of viewing each row as a data point, view each column as a vector in n-dimensional space.

Response vector y: A point in ℝⁿ (n observations = n dimensions)
Feature vectors x₁, ..., xₚ: p vectors in ℝⁿ
Column space Col(X): The p-dimensional plane spanned by features
Any prediction Xβ: A point in Col(X)

The goal becomes: find the point Xβ̂ in Col(X) closest to y.

Why this perspective helps:

In observation space, the fitted values ŷ = Xβ̂ are literally the closest point to y in the subspace of possible predictions. The residual e = y - ŷ is the vector from ŷ to y, perpendicular to Col(X).

Two Perspectives on Regression Data
View	Data Points	Dimensions	Goal
Feature space	n observations	p features	Find separating surface
Observation space	p feature vectors	n observations	Find closest point in subspace

Projection onto the Column Space

The column space Col(X) is the set of all linear combinations of feature columns:

Col(X) = {Xβ : β ∈ ℝᵖ}

This is the space of all possible predictions your linear model can make. The dimension of Col(X) equals rank(X), typically p when features are linearly independent.

Least squares as projection:

The vector y (actual responses) typically doesn't lie in Col(X)—there's no β making Xβ = y exactly. Least squares finds:

ŷ = proj_{Col(X)}(y) = X(XᵀX)⁻¹Xᵀy

This is the orthogonal projection of y onto Col(X). It's the unique point in Col(X) closest to y in Euclidean distance.

geometric_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
from numpy.linalg import norm
 
def visualize_projection_geometry(X, y):
    """
    Analyze the geometric structure of least squares projection.
    """
    n, p = X.shape
    
    # Projection matrix and fitted values
    H = X @ np.linalg.inv(X.T @ X) @ X.T
    y_hat = H @ y
    e = y - y_hat
    
    print("=== Geometric View of Least Squares ===")
    print(f"\nObservation space dimension: n = {n}")
    print(f"Column space dimension: p = {p}")
    print(f"\nTarget y: {y}")
    print(f"Projection ŷ (in Col(X)): {np.round(y_hat, 4)}")
    print(f"Residual e (⊥ to Col(X)): {np.round(e, 4)}")
    
    print(f"\n=== Distances ===")
    print(f"||y|| (distance from origin): {norm(y):.4f}")
    print(f"||ŷ|| (projection length): {norm(y_hat):.4f}")
    print(f"||e|| (error/distance to Col(X)): {norm(e):.4f}")
    
    print(f"\n=== Pythagorean Theorem ===")
    print(f"||y||² = ||ŷ||² + ||e||²")
    print(f"{norm(y)**2:.4f} = {norm(y_hat)**2:.4f} + {norm(e)**2:.4f}")
    print(f"        = {norm(y_hat)**2 + norm(e)**2:.4f}")
    
    # Verify orthogonality
    print(f"\n=== Orthogonality ===")
    print(f"ŷ · e = {np.dot(y_hat, e):.2e} (should be ≈ 0)")
    for j in range(p):
        print(f"X[:, {j}] · e = {np.dot(X[:, j], e):.2e}")
 
# Example: 3D space, 2D column space (a plane)
X = np.array([[1, 0], [1, 1], [0, 1]])
y = np.array([1, 2, 3])
visualize_projection_geometry(X, y)

The Pythagorean Theorem in Action

The orthogonal projection creates a right angle between the fitted values and residuals. The Pythagorean theorem then gives:

||y||² = ||ŷ||² + ||e||²

Or in terms of sums of squares:

SS_total = SS_regression + SS_residual

This is the fundamental decomposition underlying R²:

R² = SS_regression / SS_total = ||ŷ - ȳ||² / ||y - ȳ||²

R² measures what fraction of y's variance lies in the column space direction. It's literally the squared cosine of the angle between y and Col(X).

Why Squared Error?

Squared error isn't arbitrary—it corresponds to Euclidean geometry. Minimizing ||y - Xβ||² means minimizing Euclidean distance, which connects to orthogonal projection and the Pythagorean theorem. Other error measures (L1, etc.) have different geometric meanings.

Geometric interpretation of R²:

In observation space:

y is a vector from the origin
ŷ = proj_{Col(X)}(y) is its shadow on the column space
The angle θ between y and Col(X) determines fit quality

R² = cos²(θ) = ||ŷ||² / ||y||²

When θ = 0 (y lies in Col(X)): R² = 1, perfect fit When θ = 90° (y ⊥ Col(X)): R² = 0, no linear relationship

Residuals Orthogonal to Feature Space

The defining property of orthogonal projection:

e ⊥ Col(X) equivalently Xᵀe = 0

Each feature has zero correlation with the residual. This means:

No remaining linear signal: The model has extracted all linear information from features
Residuals are 'pure noise': What's left is orthogonal to predicting power
Normal equations encoded: Xᵀ(y - Xβ̂) = 0 is exactly Xᵀe = 0

Implications for regression diagnostics:

Plot residuals vs. features: should show no pattern (orthogonality)
If pattern exists: nonlinear relationship missed, or transformation needed
Residuals vs. fitted: horizontal band (orthogonality preserved)

residual_orthogonality.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
 
def verify_residual_orthogonality(X, y):
    """
    Verify that residuals are orthogonal to column space.
    This is the geometric essence of least squares.
    """
    beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
    y_hat = X @ beta_hat
    e = y - y_hat
    
    print("=== Residual Orthogonality Verification ===")
    print(f"\nCoefficients β̂: {beta_hat}")
    print(f"Fitted ŷ: {y_hat}")
    print(f"Residual e: {e}")
    
    # Xᵀe should be zero vector
    XTe = X.T @ e
    print(f"\nXᵀe = {XTe}")
    print(f"||Xᵀe|| = {np.linalg.norm(XTe):.2e}")
    
    # Each column inner product
    print(f"\nColumn-by-column orthogonality:")
    for j in range(X.shape[1]):
        print(f"  Feature {j}: X[:,{j}]·e = {np.dot(X[:, j], e):.2e}")
    
    # If intercept included, residuals sum to zero
    if np.allclose(X[:, 0], 1):
        print(f"\nSum of residuals (intercept effect): {np.sum(e):.2e}")
 
X = np.column_stack([np.ones(5), np.arange(5)])
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
verify_residual_orthogonality(X, y)

Geometric Effect of Adding Features

What happens geometrically when you add a new feature?

Adding feature xₚ₊₁ expands Col(X) from a p-dimensional subspace to (potentially) a (p+1)-dimensional subspace.

Geometrically:

Original column space: p-dimensional plane
New column space: (p+1)-dimensional subspace containing the original
New projection: ŷ_new is at least as close to y as ŷ_old

Key insight: ||e_new|| ≤ ||e_old|| always!

Adding features can only decrease (or maintain) residual size because:

The new column space contains the old one
The projection can only get closer or stay the same

This explains why R² never decreases when adding features—the projection can only improve. Adjusted R² penalizes complexity to counter this geometric guarantee.

Overfitting Geometry

With p = n features, Col(X) = ℝⁿ (full space). Then proj_Col(X)(y) = y exactly—perfect fit with zero error! But this is overfitting: we've memorized noise. The geometry shows R² = 1 is easy to achieve with enough features, even random ones.

The Orthonormal Basis Picture

The geometric picture is clearest with an orthonormal basis for Col(X). Using QR decomposition X = QR:

ŷ = QQᵀy = Σᵢ (qᵢᵀy)qᵢ

Each qᵢ is a unit vector, and (qᵢᵀy) is the component of y in that direction. The projection is literally the sum of shadows on orthogonal axes.

Coefficients in the orthonormal picture:

The "loadings" (qᵢᵀy) measure how much y aligns with each orthonormal direction. Large loadings mean y has significant component in that direction.

This connects to PCA: principal components are orthonormal directions that maximize these loadings for centered data.

orthonormal_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from numpy.linalg import qr, norm
 
def orthonormal_projection_decomposition(X, y):
    """
    Show projection as sum of orthogonal components.
    """
    Q, R = qr(X)
    p = Q.shape[1]
    
    print("=== Orthonormal Decomposition ===")
    print(f"\nOrthonormal basis Q for Col(X):")
    for i in range(p):
        print(f"  q_{i}: {np.round(Q[:, i], 4)}")
    
    print(f"\nProjection as sum of orthogonal components:")
    y_hat = np.zeros_like(y, dtype=float)
    
    for i in range(p):
        loading = np.dot(Q[:, i], y)
        component = loading * Q[:, i]
        y_hat += component
        print(f"  (q_{i}ᵀy)q_{i} = {loading:.4f} × {np.round(Q[:, i], 3)} = {np.round(component, 4)}")
    
    print(f"\nSum = ŷ = {np.round(y_hat, 4)}")
    print(f"\nVerification via QQᵀy: {np.round(Q @ Q.T @ y, 4)}")
 
X = np.array([[1, 0], [1, 1], [0, 1]])
y = np.array([1, 2, 3])
orthonormal_projection_decomposition(X, y)

Summary: Geometric View

Key Takeaways

•Observation space: View data as vectors in n-dimensional space
•Column space Col(X): Subspace of all possible predictions
•Least squares: Orthogonal projection of y onto Col(X)
•Pythagorean theorem: ||y||² = ||ŷ||² + ||e||², basis for R²
•Residual orthogonality: e ⊥ Col(X), encoded as Xᵀe = 0
•Adding features: Expands Col(X), can only reduce ||e||

Page Complete

You now see least squares geometrically—as projection onto the column space with perpendicular error. Next, we'll study the pseudoinverse, which generalizes projection to handle rank-deficient cases.

4 / 5

Loading learning content...

Machine LearningLinear Algebra Foundations

Projections and Least Squares

LevelIntermediate

Duration90 mins

TopicLinear Algebra Foundations

4 / 5

Geometric View of Least Squares

Seeing Least Squares in Geometric Space

In this page, we develop rich geometric intuition:

Observation space: Each data point as a dimension, the target y as a point
Column space: The plane/subspace of all possible predictions
Projection: The closest point to y in prediction space
Error vector: Perpendicular to the prediction space

This geometric view transforms least squares from a formula to memorize into a picture to understand.

What You Will Learn

The Observation Space Perspective

Consider regression with n observations and p features. The design matrix X is n × p.

Key insight: Instead of viewing each row as a data point, view each column as a vector in n-dimensional space.

Response vector y: A point in ℝⁿ (n observations = n dimensions)
Feature vectors x₁, ..., xₚ: p vectors in ℝⁿ
Column space Col(X): The p-dimensional plane spanned by features
Any prediction Xβ: A point in Col(X)

The goal becomes: find the point Xβ̂ in Col(X) closest to y.

Why this perspective helps:

Two Perspectives on Regression Data
View	Data Points	Dimensions	Goal
Feature space	n observations	p features	Find separating surface
Observation space	p feature vectors	n observations	Find closest point in subspace

Projection onto the Column Space

The column space Col(X) is the set of all linear combinations of feature columns:

Col(X) = {Xβ : β ∈ ℝᵖ}

This is the space of all possible predictions your linear model can make. The dimension of Col(X) equals rank(X), typically p when features are linearly independent.

Least squares as projection:

The vector y (actual responses) typically doesn't lie in Col(X)—there's no β making Xβ = y exactly. Least squares finds:

ŷ = proj_{Col(X)}(y) = X(XᵀX)⁻¹Xᵀy

This is the orthogonal projection of y onto Col(X). It's the unique point in Col(X) closest to y in Euclidean distance.

geometric_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
from numpy.linalg import norm
 
def visualize_projection_geometry(X, y):
    """
    Analyze the geometric structure of least squares projection.
    """
    n, p = X.shape
    
    # Projection matrix and fitted values
    H = X @ np.linalg.inv(X.T @ X) @ X.T
    y_hat = H @ y
    e = y - y_hat
    
    print("=== Geometric View of Least Squares ===")
    print(f"\nObservation space dimension: n = {n}")
    print(f"Column space dimension: p = {p}")
    print(f"\nTarget y: {y}")
    print(f"Projection ŷ (in Col(X)): {np.round(y_hat, 4)}")
    print(f"Residual e (⊥ to Col(X)): {np.round(e, 4)}")
    
    print(f"\n=== Distances ===")
    print(f"||y|| (distance from origin): {norm(y):.4f}")
    print(f"||ŷ|| (projection length): {norm(y_hat):.4f}")
    print(f"||e|| (error/distance to Col(X)): {norm(e):.4f}")
    
    print(f"\n=== Pythagorean Theorem ===")
    print(f"||y||² = ||ŷ||² + ||e||²")
    print(f"{norm(y)**2:.4f} = {norm(y_hat)**2:.4f} + {norm(e)**2:.4f}")
    print(f"        = {norm(y_hat)**2 + norm(e)**2:.4f}")
    
    # Verify orthogonality
    print(f"\n=== Orthogonality ===")
    print(f"ŷ · e = {np.dot(y_hat, e):.2e} (should be ≈ 0)")
    for j in range(p):
        print(f"X[:, {j}] · e = {np.dot(X[:, j], e):.2e}")
 
# Example: 3D space, 2D column space (a plane)
X = np.array([[1, 0], [1, 1], [0, 1]])
y = np.array([1, 2, 3])
visualize_projection_geometry(X, y)

The Pythagorean Theorem in Action

The orthogonal projection creates a right angle between the fitted values and residuals. The Pythagorean theorem then gives:

||y||² = ||ŷ||² + ||e||²

Or in terms of sums of squares:

SS_total = SS_regression + SS_residual

This is the fundamental decomposition underlying R²:

R² = SS_regression / SS_total = ||ŷ - ȳ||² / ||y - ȳ||²

R² measures what fraction of y's variance lies in the column space direction. It's literally the squared cosine of the angle between y and Col(X).

Why Squared Error?

Geometric interpretation of R²:

In observation space:

y is a vector from the origin
ŷ = proj_{Col(X)}(y) is its shadow on the column space
The angle θ between y and Col(X) determines fit quality

R² = cos²(θ) = ||ŷ||² / ||y||²

When θ = 0 (y lies in Col(X)): R² = 1, perfect fit When θ = 90° (y ⊥ Col(X)): R² = 0, no linear relationship

Residuals Orthogonal to Feature Space

The defining property of orthogonal projection:

e ⊥ Col(X) equivalently Xᵀe = 0

Each feature has zero correlation with the residual. This means:

No remaining linear signal: The model has extracted all linear information from features
Residuals are 'pure noise': What's left is orthogonal to predicting power
Normal equations encoded: Xᵀ(y - Xβ̂) = 0 is exactly Xᵀe = 0

Implications for regression diagnostics:

Plot residuals vs. features: should show no pattern (orthogonality)
If pattern exists: nonlinear relationship missed, or transformation needed
Residuals vs. fitted: horizontal band (orthogonality preserved)

residual_orthogonality.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np
 
def verify_residual_orthogonality(X, y):
    """
    Verify that residuals are orthogonal to column space.
    This is the geometric essence of least squares.
    """
    beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
    y_hat = X @ beta_hat
    e = y - y_hat
    
    print("=== Residual Orthogonality Verification ===")
    print(f"\nCoefficients β̂: {beta_hat}")
    print(f"Fitted ŷ: {y_hat}")
    print(f"Residual e: {e}")
    
    # Xᵀe should be zero vector
    XTe = X.T @ e
    print(f"\nXᵀe = {XTe}")
    print(f"||Xᵀe|| = {np.linalg.norm(XTe):.2e}")
    
    # Each column inner product
    print(f"\nColumn-by-column orthogonality:")
    for j in range(X.shape[1]):
        print(f"  Feature {j}: X[:,{j}]·e = {np.dot(X[:, j], e):.2e}")
    
    # If intercept included, residuals sum to zero
    if np.allclose(X[:, 0], 1):
        print(f"\nSum of residuals (intercept effect): {np.sum(e):.2e}")
 
X = np.column_stack([np.ones(5), np.arange(5)])
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
verify_residual_orthogonality(X, y)

Geometric Effect of Adding Features

What happens geometrically when you add a new feature?

Adding feature xₚ₊₁ expands Col(X) from a p-dimensional subspace to (potentially) a (p+1)-dimensional subspace.

Geometrically:

Original column space: p-dimensional plane
New column space: (p+1)-dimensional subspace containing the original
New projection: ŷ_new is at least as close to y as ŷ_old

Key insight: ||e_new|| ≤ ||e_old|| always!

Adding features can only decrease (or maintain) residual size because:

The new column space contains the old one
The projection can only get closer or stay the same

This explains why R² never decreases when adding features—the projection can only improve. Adjusted R² penalizes complexity to counter this geometric guarantee.

Overfitting Geometry

The Orthonormal Basis Picture

The geometric picture is clearest with an orthonormal basis for Col(X). Using QR decomposition X = QR:

ŷ = QQᵀy = Σᵢ (qᵢᵀy)qᵢ

Each qᵢ is a unit vector, and (qᵢᵀy) is the component of y in that direction. The projection is literally the sum of shadows on orthogonal axes.

Coefficients in the orthonormal picture:

The "loadings" (qᵢᵀy) measure how much y aligns with each orthonormal direction. Large loadings mean y has significant component in that direction.

This connects to PCA: principal components are orthonormal directions that maximize these loadings for centered data.

orthonormal_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from numpy.linalg import qr, norm
 
def orthonormal_projection_decomposition(X, y):
    """
    Show projection as sum of orthogonal components.
    """
    Q, R = qr(X)
    p = Q.shape[1]
    
    print("=== Orthonormal Decomposition ===")
    print(f"\nOrthonormal basis Q for Col(X):")
    for i in range(p):
        print(f"  q_{i}: {np.round(Q[:, i], 4)}")
    
    print(f"\nProjection as sum of orthogonal components:")
    y_hat = np.zeros_like(y, dtype=float)
    
    for i in range(p):
        loading = np.dot(Q[:, i], y)
        component = loading * Q[:, i]
        y_hat += component
        print(f"  (q_{i}ᵀy)q_{i} = {loading:.4f} × {np.round(Q[:, i], 3)} = {np.round(component, 4)}")
    
    print(f"\nSum = ŷ = {np.round(y_hat, 4)}")
    print(f"\nVerification via QQᵀy: {np.round(Q @ Q.T @ y, 4)}")
 
X = np.array([[1, 0], [1, 1], [0, 1]])
y = np.array([1, 2, 3])
orthonormal_projection_decomposition(X, y)

Summary: Geometric View

Key Takeaways

•Observation space: View data as vectors in n-dimensional space
•Column space Col(X): Subspace of all possible predictions
•Least squares: Orthogonal projection of y onto Col(X)
•Pythagorean theorem: ||y||² = ||ŷ||² + ||e||², basis for R²
•Residual orthogonality: e ⊥ Col(X), encoded as Xᵀe = 0
•Adding features: Expands Col(X), can only reduce ||e||

Page Complete

4 / 5