Loading learning content...
Least squares is often presented as minimizing squared errors—a purely algebraic operation. But the true power of the method emerges when you see it geometrically. Every least squares problem is a projection problem in disguise.
In this page, we develop rich geometric intuition:
This geometric view transforms least squares from a formula to memorize into a picture to understand.
By the end of this page, you will visualize least squares as orthogonal projection, understand why squared error is special (Pythagorean theorem), and see the fitted values as the shadow of the target onto the model space.
Consider regression with n observations and p features. The design matrix X is n × p.
Key insight: Instead of viewing each row as a data point, view each column as a vector in n-dimensional space.
The goal becomes: find the point Xβ̂ in Col(X) closest to y.
Why this perspective helps:
In observation space, the fitted values ŷ = Xβ̂ are literally the closest point to y in the subspace of possible predictions. The residual e = y - ŷ is the vector from ŷ to y, perpendicular to Col(X).
| View | Data Points | Dimensions | Goal |
|---|---|---|---|
| Feature space | n observations | p features | Find separating surface |
| Observation space | p feature vectors | n observations | Find closest point in subspace |
The column space Col(X) is the set of all linear combinations of feature columns:
Col(X) = {Xβ : β ∈ ℝᵖ}
This is the space of all possible predictions your linear model can make. The dimension of Col(X) equals rank(X), typically p when features are linearly independent.
Least squares as projection:
The vector y (actual responses) typically doesn't lie in Col(X)—there's no β making Xβ = y exactly. Least squares finds:
ŷ = proj_{Col(X)}(y) = X(XᵀX)⁻¹Xᵀy
This is the orthogonal projection of y onto Col(X). It's the unique point in Col(X) closest to y in Euclidean distance.
1234567891011121314151617181920212223242526272829303132333435363738394041
import numpy as npfrom numpy.linalg import norm def visualize_projection_geometry(X, y): """ Analyze the geometric structure of least squares projection. """ n, p = X.shape # Projection matrix and fitted values H = X @ np.linalg.inv(X.T @ X) @ X.T y_hat = H @ y e = y - y_hat print("=== Geometric View of Least Squares ===") print(f"\nObservation space dimension: n = {n}") print(f"Column space dimension: p = {p}") print(f"\nTarget y: {y}") print(f"Projection ŷ (in Col(X)): {np.round(y_hat, 4)}") print(f"Residual e (⊥ to Col(X)): {np.round(e, 4)}") print(f"\n=== Distances ===") print(f"||y|| (distance from origin): {norm(y):.4f}") print(f"||ŷ|| (projection length): {norm(y_hat):.4f}") print(f"||e|| (error/distance to Col(X)): {norm(e):.4f}") print(f"\n=== Pythagorean Theorem ===") print(f"||y||² = ||ŷ||² + ||e||²") print(f"{norm(y)**2:.4f} = {norm(y_hat)**2:.4f} + {norm(e)**2:.4f}") print(f" = {norm(y_hat)**2 + norm(e)**2:.4f}") # Verify orthogonality print(f"\n=== Orthogonality ===") print(f"ŷ · e = {np.dot(y_hat, e):.2e} (should be ≈ 0)") for j in range(p): print(f"X[:, {j}] · e = {np.dot(X[:, j], e):.2e}") # Example: 3D space, 2D column space (a plane)X = np.array([[1, 0], [1, 1], [0, 1]])y = np.array([1, 2, 3])visualize_projection_geometry(X, y)The orthogonal projection creates a right angle between the fitted values and residuals. The Pythagorean theorem then gives:
||y||² = ||ŷ||² + ||e||²
Or in terms of sums of squares:
SS_total = SS_regression + SS_residual
This is the fundamental decomposition underlying R²:
R² = SS_regression / SS_total = ||ŷ - ȳ||² / ||y - ȳ||²
R² measures what fraction of y's variance lies in the column space direction. It's literally the squared cosine of the angle between y and Col(X).
Squared error isn't arbitrary—it corresponds to Euclidean geometry. Minimizing ||y - Xβ||² means minimizing Euclidean distance, which connects to orthogonal projection and the Pythagorean theorem. Other error measures (L1, etc.) have different geometric meanings.
Geometric interpretation of R²:
In observation space:
R² = cos²(θ) = ||ŷ||² / ||y||²
When θ = 0 (y lies in Col(X)): R² = 1, perfect fit When θ = 90° (y ⊥ Col(X)): R² = 0, no linear relationship
The defining property of orthogonal projection:
e ⊥ Col(X) equivalently Xᵀe = 0
Each feature has zero correlation with the residual. This means:
Implications for regression diagnostics:
123456789101112131415161718192021222324252627282930313233
import numpy as np def verify_residual_orthogonality(X, y): """ Verify that residuals are orthogonal to column space. This is the geometric essence of least squares. """ beta_hat = np.linalg.lstsq(X, y, rcond=None)[0] y_hat = X @ beta_hat e = y - y_hat print("=== Residual Orthogonality Verification ===") print(f"\nCoefficients β̂: {beta_hat}") print(f"Fitted ŷ: {y_hat}") print(f"Residual e: {e}") # Xᵀe should be zero vector XTe = X.T @ e print(f"\nXᵀe = {XTe}") print(f"||Xᵀe|| = {np.linalg.norm(XTe):.2e}") # Each column inner product print(f"\nColumn-by-column orthogonality:") for j in range(X.shape[1]): print(f" Feature {j}: X[:,{j}]·e = {np.dot(X[:, j], e):.2e}") # If intercept included, residuals sum to zero if np.allclose(X[:, 0], 1): print(f"\nSum of residuals (intercept effect): {np.sum(e):.2e}") X = np.column_stack([np.ones(5), np.arange(5)])y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])verify_residual_orthogonality(X, y)What happens geometrically when you add a new feature?
Adding feature xₚ₊₁ expands Col(X) from a p-dimensional subspace to (potentially) a (p+1)-dimensional subspace.
Geometrically:
Key insight: ||e_new|| ≤ ||e_old|| always!
Adding features can only decrease (or maintain) residual size because:
This explains why R² never decreases when adding features—the projection can only improve. Adjusted R² penalizes complexity to counter this geometric guarantee.
With p = n features, Col(X) = ℝⁿ (full space). Then proj_Col(X)(y) = y exactly—perfect fit with zero error! But this is overfitting: we've memorized noise. The geometry shows R² = 1 is easy to achieve with enough features, even random ones.
The geometric picture is clearest with an orthonormal basis for Col(X). Using QR decomposition X = QR:
ŷ = QQᵀy = Σᵢ (qᵢᵀy)qᵢ
Each qᵢ is a unit vector, and (qᵢᵀy) is the component of y in that direction. The projection is literally the sum of shadows on orthogonal axes.
Coefficients in the orthonormal picture:
The "loadings" (qᵢᵀy) measure how much y aligns with each orthonormal direction. Large loadings mean y has significant component in that direction.
This connects to PCA: principal components are orthonormal directions that maximize these loadings for centered data.
123456789101112131415161718192021222324252627282930
import numpy as npfrom numpy.linalg import qr, norm def orthonormal_projection_decomposition(X, y): """ Show projection as sum of orthogonal components. """ Q, R = qr(X) p = Q.shape[1] print("=== Orthonormal Decomposition ===") print(f"\nOrthonormal basis Q for Col(X):") for i in range(p): print(f" q_{i}: {np.round(Q[:, i], 4)}") print(f"\nProjection as sum of orthogonal components:") y_hat = np.zeros_like(y, dtype=float) for i in range(p): loading = np.dot(Q[:, i], y) component = loading * Q[:, i] y_hat += component print(f" (q_{i}ᵀy)q_{i} = {loading:.4f} × {np.round(Q[:, i], 3)} = {np.round(component, 4)}") print(f"\nSum = ŷ = {np.round(y_hat, 4)}") print(f"\nVerification via QQᵀy: {np.round(Q @ Q.T @ y, 4)}") X = np.array([[1, 0], [1, 1], [0, 1]])y = np.array([1, 2, 3])orthonormal_projection_decomposition(X, y)You now see least squares geometrically—as projection onto the column space with perpendicular error. Next, we'll study the pseudoinverse, which generalizes projection to handle rank-deficient cases.