Machine LearningLinear Algebra Foundations

Projections and Least Squares

LevelIntermediate

Duration90 mins

TopicLinear Algebra Foundations

3 / 5

Normal Equations

The Equations That Define Optimal Fit

The normal equations are the system of linear equations whose solution gives the least squares coefficients. For a linear system Ax = b that may have no exact solution (overdetermined), the normal equations are:

AᵀA x = Aᵀb

This compact equation encodes both the optimization condition (setting the gradient to zero) and the geometric condition (error perpendicular to column space). Understanding the normal equations deeply is essential for:

Deriving closed-form solutions for linear regression
Understanding when solutions exist and when they're unique
Connecting optimization to geometry in a unified framework
Analyzing numerical stability and computational approaches

What You Will Learn

By the end of this page, you will derive the normal equations from three perspectives (calculus, geometry, algebra), understand when they have unique solutions, and see how they connect projection theory to practical computation of regression coefficients.

Derivation via Calculus

The least squares problem seeks x minimizing the squared error:

f(x) = ||Ax - b||² = (Ax - b)ᵀ(Ax - b)

Expanding:

f(x) = xᵀAᵀAx - 2xᵀAᵀb + bᵀb

To minimize, take the gradient and set to zero:

∇f(x) = 2AᵀAx - 2Aᵀb = 0

This gives the normal equations: AᵀA x = Aᵀb

Since the Hessian ∇²f = 2AᵀA is positive semi-definite (and positive definite when A has full column rank), this critical point is indeed a minimum.

normal_equations_calculus.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
from numpy.linalg import inv, norm
 
def solve_normal_equations(A: np.ndarray, b: np.ndarray) -> dict:
    """
    Solve least squares via normal equations: AᵀAx = Aᵀb
    """
    ATA = A.T @ A      # Gram matrix (n x n)
    ATb = A.T @ b      # Moment vector (n x 1)
    
    # Solve normal equations
    x_hat = inv(ATA) @ ATb
    
    # Compute fitted values and residuals
    b_hat = A @ x_hat
    residual = b - b_hat
    
    # Verify gradient is zero at solution
    gradient = 2 * (ATA @ x_hat - ATb)
    
    return {
        'coefficients': x_hat,
        'fitted': b_hat,
        'residual': residual,
        'gradient_norm': norm(gradient),
        'squared_error': norm(residual)**2
    }
 
# Example: Overdetermined system (3 equations, 2 unknowns)
A = np.array([[1, 1], [1, 2], [1, 3]])
b = np.array([2, 3, 5])
 
result = solve_normal_equations(A, b)
print(f"Coefficients: {result['coefficients']}")
print(f"Fitted values: {result['fitted']}")
print(f"Residual: {result['residual']}")
print(f"Gradient norm at solution: {result['gradient_norm']:.2e}")

Derivation via Geometry

From the projection perspective, Ax̂ is the orthogonal projection of b onto Col(A). The error e = b - Ax̂ must be perpendicular to every column of A:

aⱼ · (b - Ax̂) = 0 for all columns aⱼ

In matrix form, this says:

Aᵀ(b - Ax̂) = 0

Rearranging:

AᵀA x̂ = Aᵀb

The normal equations are simply the algebraic encoding of orthogonality. The matrix Aᵀ tests orthogonality against each column of A simultaneously.

The Name 'Normal'

These are called 'normal' equations because they ensure the error is 'normal' (perpendicular) to the column space. The term 'normal' here means perpendicular, as in 'normal to a surface'.

Visual interpretation:

Imagine b as a point in ℝᵐ. The column space Col(A) is a k-dimensional plane through the origin (where k = rank(A)). The normal equations find the point Ax̂ in this plane closest to b by requiring the connecting vector (b - Ax̂) to be perpendicular to the plane.

This is exactly what Aᵀ(b - Ax̂) = 0 says: the error has zero inner product with every basis vector of the plane.

Solution Existence and Uniqueness

When does AᵀA x = Aᵀb have a unique solution?

The matrix AᵀA is n×n (where A is m×n). Key facts:

AᵀA is always symmetric: (AᵀA)ᵀ = AᵀA
AᵀA is positive semi-definite: xᵀAᵀAx = ||Ax||² ≥ 0
AᵀA is invertible iff A has full column rank: Null(AᵀA) = Null(A)

Full column rank case (most common in ML):

When rank(A) = n (columns are linearly independent), AᵀA is invertible:

x̂ = (AᵀA)⁻¹Aᵀb — unique solution

Rank-deficient case:

When rank(A) < n, AᵀA is singular. There are infinitely many solutions (a subspace of solutions). The pseudoinverse selects the minimum-norm solution.

Solution Cases for Normal Equations
Case	Condition	Number of Solutions	Approach
Full column rank	rank(A) = n	Exactly one	(AᵀA)⁻¹Aᵀb
Rank deficient	rank(A) < n	Infinitely many	Pseudoinverse A⁺b
Underdetermined	m < n	Infinitely many	Minimum norm solution

solution_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
from numpy.linalg import matrix_rank, lstsq, pinv
 
def analyze_normal_equations(A: np.ndarray, b: np.ndarray):
    """Analyze solution properties of normal equations."""
    m, n = A.shape
    rank = matrix_rank(A)
    
    print(f"Matrix A: {m}×{n}, rank = {rank}")
    print(f"Full column rank: {rank == n}")
    
    ATA = A.T @ A
    print(f"AᵀA is {n}×{n}, rank = {matrix_rank(ATA)}")
    
    if rank == n:
        x_hat = np.linalg.inv(ATA) @ A.T @ b
        print(f"Unique solution: {x_hat}")
    else:
        x_hat = pinv(A) @ b
        print(f"Minimum norm solution (via pseudoinverse): {x_hat}")
        print(f"Note: infinitely many solutions exist")
    
    # Verify it satisfies normal equations
    residual = ATA @ x_hat - A.T @ b
    print(f"Normal equation residual: {np.linalg.norm(residual):.2e}")
 
# Full rank case
A1 = np.array([[1, 2], [3, 4], [5, 6]])
b1 = np.array([1, 2, 3])
print("=== Full Column Rank ===")
analyze_normal_equations(A1, b1)
 
# Rank deficient case
print("\n=== Rank Deficient ===")
A2 = np.array([[1, 2, 3], [2, 4, 6], [1, 2, 3]])  # Rank 1
b2 = np.array([1, 2, 1])
analyze_normal_equations(A2, b2)

The Gram Matrix AᵀA

The matrix AᵀA appearing in the normal equations is the Gram matrix of A's columns. Its entries are inner products:

(AᵀA)ᵢⱼ = aᵢ · aⱼ

where aᵢ is the i-th column of A.

Key properties:

Diagonal entries: (AᵀA)ᵢᵢ = ||aᵢ||² (squared norms)
Off-diagonal entries: (AᵀA)ᵢⱼ = aᵢ · aⱼ (correlations between features)
Condition number: The ratio λₘₐₓ/λₘᵢₙ indicates numerical stability

In machine learning terms:

For design matrix X with features as columns, XᵀX captures:

Feature variances (diagonal)
Feature correlations (off-diagonal)
Multicollinearity issues (large condition number)

Multicollinearity Warning

When features are highly correlated, XᵀX becomes nearly singular (ill-conditioned). Small changes in data cause large changes in coefficients. Regularization (ridge regression) fixes this by adding λI to XᵀX.

Numerical Considerations

Don't explicitly invert AᵀA!

While x̂ = (AᵀA)⁻¹Aᵀb is theoretically correct, computing the inverse explicitly is:

Expensive: O(n³) for the inverse alone
Unstable: Amplifies numerical errors, especially when AᵀA is ill-conditioned

Better approaches:

Solve the linear system directly: Use np.linalg.solve(ATA, ATb) — exploits structure, O(n³)
QR decomposition: A = QR, then solve Rx = Qᵀb — more stable, O(mn²)
SVD: Most stable, handles rank-deficiency — O(mn²)
Cholesky decomposition: AᵀA = LLᵀ, then forward/back substitution — fast for well-conditioned problems

numerical_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
from numpy.linalg import solve, qr, lstsq, inv, cond
from scipy.linalg import cholesky
 
def compare_methods(A, b):
    """Compare different approaches to solving normal equations."""
    ATA, ATb = A.T @ A, A.T @ b
    
    print(f"Condition number of AᵀA: {cond(ATA):.2e}")
    
    # Method 1: Explicit inverse (not recommended)
    x1 = inv(ATA) @ ATb
    
    # Method 2: Solve linear system
    x2 = solve(ATA, ATb)
    
    # Method 3: QR decomposition
    Q, R = qr(A)
    x3 = solve(R, Q.T @ b)
    
    # Method 4: NumPy lstsq (uses SVD internally)
    x4, _, _, _ = lstsq(A, b, rcond=None)
    
    print(f"Inverse:  {x1}")
    print(f"Solve:    {x2}")
    print(f"QR:       {x3}")
    print(f"lstsq:    {x4}")
    print(f"Max diff: {max(np.linalg.norm(x1-x2), np.linalg.norm(x1-x3), np.linalg.norm(x1-x4)):.2e}")
 
A = np.array([[1, 1], [1, 2], [1, 3], [1, 4], [1, 5]])
b = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
compare_methods(A, b)

Connection to Linear Regression

In linear regression with design matrix X and response y, the normal equations become:

XᵀX β = Xᵀy

Each coefficient βⱼ satisfies:

Σᵢ xᵢⱼ(yᵢ - xᵢ·β) = 0

This says: for each feature j, its correlation with the residual is zero. The model has extracted all linear predictive information from the features.

Interpretation of XᵀX:

Diagonal: sum of squared feature values (relates to feature variance)
Off-diagonal: feature cross-products (relates to feature correlation)

Interpretation of Xᵀy:

j-th entry: correlation between feature j and response

The Intercept Effect

When X includes a column of ones (intercept), the normal equations imply Σᵢ(yᵢ - ŷᵢ) = 0, meaning residuals sum to zero. The regression line passes through the point (x̄, ȳ).

Summary: Normal Equations

Key Takeaways

•Normal equations: AᵀAx = Aᵀb — the fundamental equations of least squares
•Calculus derivation: Setting gradient of ||Ax - b||² to zero
•Geometric derivation: Encoding orthogonality of error to column space
•Unique solution when A has full column rank: x̂ = (AᵀA)⁻¹Aᵀb
•Gram matrix AᵀA: Captures feature inner products and correlations
•Numerical advice: Use QR or SVD, not explicit inverse

Page Complete

You now understand the normal equations from multiple perspectives. Next, we'll explore the geometric view of least squares, visualizing how the projection onto the column space provides the optimal approximation.

3 / 5

Loading learning content...

Machine LearningLinear Algebra Foundations

Projections and Least Squares

LevelIntermediate

Duration90 mins

TopicLinear Algebra Foundations

3 / 5

Normal Equations

The Equations That Define Optimal Fit

AᵀA x = Aᵀb

Deriving closed-form solutions for linear regression
Understanding when solutions exist and when they're unique
Connecting optimization to geometry in a unified framework
Analyzing numerical stability and computational approaches

What You Will Learn

Derivation via Calculus

The least squares problem seeks x minimizing the squared error:

f(x) = ||Ax - b||² = (Ax - b)ᵀ(Ax - b)

Expanding:

f(x) = xᵀAᵀAx - 2xᵀAᵀb + bᵀb

To minimize, take the gradient and set to zero:

∇f(x) = 2AᵀAx - 2Aᵀb = 0

This gives the normal equations: AᵀA x = Aᵀb

Since the Hessian ∇²f = 2AᵀA is positive semi-definite (and positive definite when A has full column rank), this critical point is indeed a minimum.

normal_equations_calculus.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
from numpy.linalg import inv, norm
 
def solve_normal_equations(A: np.ndarray, b: np.ndarray) -> dict:
    """
    Solve least squares via normal equations: AᵀAx = Aᵀb
    """
    ATA = A.T @ A      # Gram matrix (n x n)
    ATb = A.T @ b      # Moment vector (n x 1)
    
    # Solve normal equations
    x_hat = inv(ATA) @ ATb
    
    # Compute fitted values and residuals
    b_hat = A @ x_hat
    residual = b - b_hat
    
    # Verify gradient is zero at solution
    gradient = 2 * (ATA @ x_hat - ATb)
    
    return {
        'coefficients': x_hat,
        'fitted': b_hat,
        'residual': residual,
        'gradient_norm': norm(gradient),
        'squared_error': norm(residual)**2
    }
 
# Example: Overdetermined system (3 equations, 2 unknowns)
A = np.array([[1, 1], [1, 2], [1, 3]])
b = np.array([2, 3, 5])
 
result = solve_normal_equations(A, b)
print(f"Coefficients: {result['coefficients']}")
print(f"Fitted values: {result['fitted']}")
print(f"Residual: {result['residual']}")
print(f"Gradient norm at solution: {result['gradient_norm']:.2e}")

Derivation via Geometry

From the projection perspective, Ax̂ is the orthogonal projection of b onto Col(A). The error e = b - Ax̂ must be perpendicular to every column of A:

aⱼ · (b - Ax̂) = 0 for all columns aⱼ

In matrix form, this says:

Aᵀ(b - Ax̂) = 0

Rearranging:

AᵀA x̂ = Aᵀb

The normal equations are simply the algebraic encoding of orthogonality. The matrix Aᵀ tests orthogonality against each column of A simultaneously.

The Name 'Normal'

These are called 'normal' equations because they ensure the error is 'normal' (perpendicular) to the column space. The term 'normal' here means perpendicular, as in 'normal to a surface'.

Visual interpretation:

This is exactly what Aᵀ(b - Ax̂) = 0 says: the error has zero inner product with every basis vector of the plane.

Solution Existence and Uniqueness

When does AᵀA x = Aᵀb have a unique solution?

The matrix AᵀA is n×n (where A is m×n). Key facts:

AᵀA is always symmetric: (AᵀA)ᵀ = AᵀA
AᵀA is positive semi-definite: xᵀAᵀAx = ||Ax||² ≥ 0
AᵀA is invertible iff A has full column rank: Null(AᵀA) = Null(A)

Full column rank case (most common in ML):

When rank(A) = n (columns are linearly independent), AᵀA is invertible:

x̂ = (AᵀA)⁻¹Aᵀb — unique solution

Rank-deficient case:

When rank(A) < n, AᵀA is singular. There are infinitely many solutions (a subspace of solutions). The pseudoinverse selects the minimum-norm solution.

Solution Cases for Normal Equations
Case	Condition	Number of Solutions	Approach
Full column rank	rank(A) = n	Exactly one	(AᵀA)⁻¹Aᵀb
Rank deficient	rank(A) < n	Infinitely many	Pseudoinverse A⁺b
Underdetermined	m < n	Infinitely many	Minimum norm solution

solution_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
from numpy.linalg import matrix_rank, lstsq, pinv
 
def analyze_normal_equations(A: np.ndarray, b: np.ndarray):
    """Analyze solution properties of normal equations."""
    m, n = A.shape
    rank = matrix_rank(A)
    
    print(f"Matrix A: {m}×{n}, rank = {rank}")
    print(f"Full column rank: {rank == n}")
    
    ATA = A.T @ A
    print(f"AᵀA is {n}×{n}, rank = {matrix_rank(ATA)}")
    
    if rank == n:
        x_hat = np.linalg.inv(ATA) @ A.T @ b
        print(f"Unique solution: {x_hat}")
    else:
        x_hat = pinv(A) @ b
        print(f"Minimum norm solution (via pseudoinverse): {x_hat}")
        print(f"Note: infinitely many solutions exist")
    
    # Verify it satisfies normal equations
    residual = ATA @ x_hat - A.T @ b
    print(f"Normal equation residual: {np.linalg.norm(residual):.2e}")
 
# Full rank case
A1 = np.array([[1, 2], [3, 4], [5, 6]])
b1 = np.array([1, 2, 3])
print("=== Full Column Rank ===")
analyze_normal_equations(A1, b1)
 
# Rank deficient case
print("\n=== Rank Deficient ===")
A2 = np.array([[1, 2, 3], [2, 4, 6], [1, 2, 3]])  # Rank 1
b2 = np.array([1, 2, 1])
analyze_normal_equations(A2, b2)

The Gram Matrix AᵀA

The matrix AᵀA appearing in the normal equations is the Gram matrix of A's columns. Its entries are inner products:

(AᵀA)ᵢⱼ = aᵢ · aⱼ

where aᵢ is the i-th column of A.

Key properties:

Diagonal entries: (AᵀA)ᵢᵢ = ||aᵢ||² (squared norms)
Off-diagonal entries: (AᵀA)ᵢⱼ = aᵢ · aⱼ (correlations between features)
Condition number: The ratio λₘₐₓ/λₘᵢₙ indicates numerical stability

In machine learning terms:

For design matrix X with features as columns, XᵀX captures:

Feature variances (diagonal)
Feature correlations (off-diagonal)
Multicollinearity issues (large condition number)

Multicollinearity Warning

Numerical Considerations

Don't explicitly invert AᵀA!

While x̂ = (AᵀA)⁻¹Aᵀb is theoretically correct, computing the inverse explicitly is:

Expensive: O(n³) for the inverse alone
Unstable: Amplifies numerical errors, especially when AᵀA is ill-conditioned

Better approaches:

Solve the linear system directly: Use np.linalg.solve(ATA, ATb) — exploits structure, O(n³)
QR decomposition: A = QR, then solve Rx = Qᵀb — more stable, O(mn²)
SVD: Most stable, handles rank-deficiency — O(mn²)
Cholesky decomposition: AᵀA = LLᵀ, then forward/back substitution — fast for well-conditioned problems

numerical_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
from numpy.linalg import solve, qr, lstsq, inv, cond
from scipy.linalg import cholesky
 
def compare_methods(A, b):
    """Compare different approaches to solving normal equations."""
    ATA, ATb = A.T @ A, A.T @ b
    
    print(f"Condition number of AᵀA: {cond(ATA):.2e}")
    
    # Method 1: Explicit inverse (not recommended)
    x1 = inv(ATA) @ ATb
    
    # Method 2: Solve linear system
    x2 = solve(ATA, ATb)
    
    # Method 3: QR decomposition
    Q, R = qr(A)
    x3 = solve(R, Q.T @ b)
    
    # Method 4: NumPy lstsq (uses SVD internally)
    x4, _, _, _ = lstsq(A, b, rcond=None)
    
    print(f"Inverse:  {x1}")
    print(f"Solve:    {x2}")
    print(f"QR:       {x3}")
    print(f"lstsq:    {x4}")
    print(f"Max diff: {max(np.linalg.norm(x1-x2), np.linalg.norm(x1-x3), np.linalg.norm(x1-x4)):.2e}")
 
A = np.array([[1, 1], [1, 2], [1, 3], [1, 4], [1, 5]])
b = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
compare_methods(A, b)

Connection to Linear Regression

In linear regression with design matrix X and response y, the normal equations become:

XᵀX β = Xᵀy

Each coefficient βⱼ satisfies:

Σᵢ xᵢⱼ(yᵢ - xᵢ·β) = 0

This says: for each feature j, its correlation with the residual is zero. The model has extracted all linear predictive information from the features.

Interpretation of XᵀX:

Diagonal: sum of squared feature values (relates to feature variance)
Off-diagonal: feature cross-products (relates to feature correlation)

Interpretation of Xᵀy:

j-th entry: correlation between feature j and response

The Intercept Effect

When X includes a column of ones (intercept), the normal equations imply Σᵢ(yᵢ - ŷᵢ) = 0, meaning residuals sum to zero. The regression line passes through the point (x̄, ȳ).

Summary: Normal Equations

Key Takeaways

•Normal equations: AᵀAx = Aᵀb — the fundamental equations of least squares
•Calculus derivation: Setting gradient of ||Ax - b||² to zero
•Geometric derivation: Encoding orthogonality of error to column space
•Unique solution when A has full column rank: x̂ = (AᵀA)⁻¹Aᵀb
•Gram matrix AᵀA: Captures feature inner products and correlations
•Numerical advice: Use QR or SVD, not explicit inverse

Page Complete

3 / 5