Machine LearningMultiple Linear Regression

Multiple Linear Regression

LevelIntermediate

Duration90 mins

TopicMultiple Linear Regression

2 / 5

Normal Equations Derivation

The Path to the Closed-Form Solution

With the matrix formulation established, we now face the central optimization problem: find the coefficient vector $\hat{\boldsymbol{\beta}}$ that minimizes the sum of squared errors.

Unlike many optimization problems in machine learning that require iterative algorithms, linear regression admits an exact, closed-form solution. This solution, derived from the normal equations, is one of the most elegant results in statistical learning theory.

The derivation we present here uses matrix calculus—specifically, derivatives of scalar functions with respect to vectors. This approach is not merely an exercise in mathematical sophistication; it's the foundation for understanding optimization throughout machine learning.

What You Will Learn

By the end of this page, you will derive the normal equations from first principles, understand why X⊤X must be invertible for a unique solution, and appreciate the mathematical elegance underlying ordinary least squares estimation.

The Optimization Problem

Recall from the previous page that we seek to minimize the sum of squared errors:

$$\hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\arg\min} ; \text{SSE}(\boldsymbol{\beta}) = \underset{\boldsymbol{\beta}}{\arg\min} ; |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$$

Expanding the squared norm:

$$\text{SSE}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$

This is a quadratic function of $\boldsymbol{\beta}$. Quadratic functions with positive semi-definite Hessian matrices have a unique global minimum (or a set of minima if the Hessian is singular). Our task is to find this minimum analytically.

Why Closed-Form Solutions Matter

A closed-form solution means we can directly compute the answer in a finite number of operations—no iterations, no convergence criteria, no hyperparameter tuning. For linear regression, this means exact answers subject only to numerical precision limits.

Matrix Calculus Prerequisites

Before deriving the normal equations, we need key results from matrix calculus. Let $\mathbf{x}$ be a column vector and $f(\mathbf{x})$ a scalar function. The gradient $\nabla_{\mathbf{x}} f$ is a vector of partial derivatives:

$$\nabla_{\mathbf{x}} f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \ \vdots \ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

Here are the essential identities we'll use:

Essential Matrix Calculus Identities
Function	Gradient	Notes
$f(\mathbf{x}) = \mathbf{a}^\top\mathbf{x}$	$\nabla_\mathbf{x} f = \mathbf{a}$	Linear function
$f(\mathbf{x}) = \mathbf{x}^\top\mathbf{a}$	$\nabla_\mathbf{x} f = \mathbf{a}$	Same as above (scalar)
$f(\mathbf{x}) = \mathbf{x}^\top\mathbf{A}\mathbf{x}$	$\nabla_\mathbf{x} f = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$	Quadratic form
$f(\mathbf{x}) = \mathbf{x}^\top\mathbf{A}\mathbf{x}$ (A symmetric)	$\nabla_\mathbf{x} f = 2\mathbf{A}\mathbf{x}$	Symmetric case

Symmetry Simplifies Everything

The Gram matrix X⊤X is always symmetric. This means we can use the simpler gradient formula for quadratic forms: ∇(x⊤Ax) = 2Ax when A is symmetric.

Proof sketch for the quadratic form gradient:

For $f(\mathbf{x}) = \mathbf{x}^\top\mathbf{A}\mathbf{x}$, we can expand: $$f(\mathbf{x}) = \sum_{i}\sum_{j} a_{ij} x_i x_j$$

Taking the partial derivative with respect to $x_k$: $$\frac{\partial f}{\partial x_k} = \sum_{j} a_{kj} x_j + \sum_{i} a_{ik} x_i = (\mathbf{A}\mathbf{x})_k + (\mathbf{A}^\top\mathbf{x})_k$$

Collecting terms: $\nabla_\mathbf{x} f = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$. When $\mathbf{A} = \mathbf{A}^\top$, this simplifies to $2\mathbf{A}\mathbf{x}$.

Expanding the Objective Function

Let's expand the SSE objective to identify its structure:

$$\begin{aligned} \text{SSE}(\boldsymbol{\beta}) &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \ &= \mathbf{y}^\top\mathbf{y} - \mathbf{y}^\top\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \end{aligned}$$

Now, $\mathbf{y}^\top\mathbf{X}\boldsymbol{\beta}$ is a scalar. The transpose of a scalar equals itself, so: $$\mathbf{y}^\top\mathbf{X}\boldsymbol{\beta} = (\mathbf{y}^\top\mathbf{X}\boldsymbol{\beta})^\top = \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}$$

Therefore, we can combine the middle terms: $$\text{SSE}(\boldsymbol{\beta}) = \mathbf{y}^\top\mathbf{y} - 2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}$$

Identifying the components:

$\mathbf{y}^\top\mathbf{y}$ — A constant (doesn't depend on $\boldsymbol{\beta}$)
$-2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}$ — Linear in $\boldsymbol{\beta}$
$\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}$ — Quadratic in $\boldsymbol{\beta}$

This is a quadratic function of $\boldsymbol{\beta}$ with:

Linear coefficient: $-2\mathbf{X}^\top\mathbf{y}$
Quadratic coefficient: $\mathbf{X}^\top\mathbf{X}$ (the Gram matrix)

The Gram matrix $\mathbf{X}^\top\mathbf{X}$ is symmetric and positive semi-definite, ensuring the objective is convex (bowl-shaped upward).

Convexity Guarantee

SSE(β) is a convex function of β because the Hessian matrix (2X⊤X) is positive semi-definite. Any local minimum is a global minimum. If X⊤X is positive definite (full column rank), the minimum is unique.

Taking the Gradient

To find the minimum, we take the gradient with respect to $\boldsymbol{\beta}$ and set it to zero.

$$\nabla_{\boldsymbol{\beta}} \text{SSE} = \nabla_{\boldsymbol{\beta}} \left[ \mathbf{y}^\top\mathbf{y} - 2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \right]$$

Applying our matrix calculus identities term by term:

$\nabla_{\boldsymbol{\beta}}(\mathbf{y}^\top\mathbf{y}) = \mathbf{0}$ (constant term)
$\nabla_{\boldsymbol{\beta}}(-2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}) = -2\mathbf{X}^\top\mathbf{y}$ (linear term)
$\nabla_{\boldsymbol{\beta}}(\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}) = 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}$ (quadratic term, using symmetry of $\mathbf{X}^\top\mathbf{X}$)

Combining: $$\nabla_{\boldsymbol{\beta}} \text{SSE} = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}$$

The Gradient Expression

∇SSE = 2X⊤Xβ - 2X⊤y. This gradient tells us the direction of steepest ascent at any point β. We seek where this gradient is zero—the bottom of the bowl.

Deriving the Normal Equations

Setting the gradient to zero gives the first-order optimality condition:

$$\nabla_{\boldsymbol{\beta}} \text{SSE} = \mathbf{0}$$

$$-2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{0}$$

Dividing by 2 and rearranging:

$$\boxed{\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}}$$

These are the normal equations.

Why 'Normal' Equations?

The term 'normal' comes from the fact that these equations ensure the residual vector e = y - Xβ̂ is orthogonal (normal) to the column space of X. We'll explore this geometric interpretation in detail in a later page.

Solving for $\hat{\boldsymbol{\beta}}$:

If $\mathbf{X}^\top\mathbf{X}$ is invertible (i.e., has full rank), we can multiply both sides by $(\mathbf{X}^\top\mathbf{X})^{-1}$:

$$\boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$$

This gives us the OLS estimator:

$$\boxed{\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}}$$

This is one of the most important formulas in statistics and machine learning. It expresses the optimal coefficients directly in terms of the data matrices.

Verifying This is a Minimum

Setting the gradient to zero finds stationary points—which could be minima, maxima, or saddle points. We must verify we have a minimum by examining the Hessian matrix (the matrix of second derivatives).

$$\mathbf{H} = \nabla^2_{\boldsymbol{\beta}} \text{SSE} = \nabla_{\boldsymbol{\beta}} \left( 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} - 2\mathbf{X}^\top\mathbf{y} \right) = 2\mathbf{X}^\top\mathbf{X}$$

The Hessian is constant (independent of $\boldsymbol{\beta}$) and equals $2\mathbf{X}^\top\mathbf{X}$.

Second-order sufficiency conditions:

For a stationary point to be a minimum, the Hessian must be positive semi-definite (for weak minimum) or positive definite (for strict, unique minimum).

Recall that $\mathbf{X}^\top\mathbf{X}$ is always positive semi-definite: $$\mathbf{v}^\top(\mathbf{X}^\top\mathbf{X})\mathbf{v} = (\mathbf{X}\mathbf{v})^\top(\mathbf{X}\mathbf{v}) = |\mathbf{X}\mathbf{v}|^2 \geq 0$$

Furthermore, $\mathbf{X}^\top\mathbf{X}$ is positive definite if and only if $\mathbf{X}$ has full column rank (i.e., no two columns are linearly dependent).

Conclusion: The stationary point $\hat{\boldsymbol{\beta}}$ is indeed a global minimum. If $\mathbf{X}^\top\mathbf{X}$ is invertible, it's the unique global minimum.

Verified Global Minimum

The solution β̂ = (X⊤X)⁻¹X⊤y minimizes the sum of squared errors globally. There is no other set of coefficients that achieves lower SSE.

When X⊤X Is Not Invertible

The OLS formula requires $(\mathbf{X}^\top\mathbf{X})^{-1}$ to exist, which happens when $\mathbf{X}^\top\mathbf{X}$ has full rank. This fails in several important scenarios:

Causes of Non-Invertibility

•More features than samples (p > n): The design matrix is wide, columns must be linearly dependent.
•Perfect multicollinearity: One feature is an exact linear combination of others (e.g., total_sales = online_sales + store_sales).
•Duplicate features: The same feature included twice under different names.
•Constant features: A feature with zero variance (all values identical).
•Encoding issues: One-hot encoding all categories without dropping a reference level.

Consequences:

When $\mathbf{X}^\top\mathbf{X}$ is singular:

Infinitely many solutions — The normal equations have a solution but it's not unique
Unstable estimates — Small data changes cause wild coefficient swings
Numerical issues — Even near-singularity causes precision problems

Solutions:

Regularization: Ridge regression adds $\lambda\mathbf{I}$ to $\mathbf{X}^\top\mathbf{X}$, ensuring invertibility
Feature selection: Remove redundant or perfectly correlated features
Dimensionality reduction: Apply PCA before regression
Moore-Penrose pseudoinverse: Provides a minimum-norm solution even when singular

Practical Detection

Modern statistical software (scikit-learn, R, statsmodels) usually detects rank deficiency and either warns or uses automatic regularization. However, near-singularity (high condition number) can go undetected and cause subtle problems. Check the condition number of X⊤X when coefficients seem unreasonable.

The Moore-Penrose Pseudoinverse

When $\mathbf{X}^\top\mathbf{X}$ is not invertible, the normal equations still have solutions—just infinitely many. The Moore-Penrose pseudoinverse provides a principled way to select one.

Definition: The pseudoinverse of $\mathbf{X}$, denoted $\mathbf{X}^+$, is the unique matrix satisfying:

$\mathbf{X}\mathbf{X}^+\mathbf{X} = \mathbf{X}$
$\mathbf{X}^+\mathbf{X}\mathbf{X}^+ = \mathbf{X}^+$
$(\mathbf{X}\mathbf{X}^+)^\top = \mathbf{X}\mathbf{X}^+$
$(\mathbf{X}^+\mathbf{X})^\top = \mathbf{X}^+\mathbf{X}$

When $\mathbf{X}$ has full column rank, $\mathbf{X}^+ = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ (our familiar expression).

Key property: The pseudoinverse solution $\hat{\boldsymbol{\beta}} = \mathbf{X}^+\mathbf{y}$ is the minimum-norm solution:

$$\hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}: \mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}}{\arg\min} |\boldsymbol{\beta}|^2$$

Among all coefficient vectors that minimize SSE, the pseudoinverse gives the one with smallest Euclidean norm. This is a form of implicit regularization.

Computing the Pseudoinverse

The pseudoinverse is typically computed via SVD: if X = UΣV⊤, then X⁺ = VΣ⁺U⊤, where Σ⁺ inverts non-zero singular values and leaves zeros as zeros. Never compute the pseudoinverse by inverting nearly-singular matrices—always use SVD-based methods like numpy.linalg.pinv().

Alternative Derivation: Component-Wise Calculus

For readers less comfortable with matrix calculus, we can derive the normal equations using standard calculus. This approach is more tedious but illuminating.

Setup: We minimize $$\text{SSE} = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2$$

For the intercept $\beta_0$: $$\frac{\partial \text{SSE}}{\partial \beta_0} = -2\sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) = 0$$

This gives us: $\sum_{i=1}^n y_i = n\beta_0 + \sum_{j=1}^p \beta_j \sum_{i=1}^n x_{ij}$

For coefficient $\beta_k$ (for $k = 1, \ldots, p$): $$\frac{\partial \text{SSE}}{\partial \beta_k} = -2\sum_{i=1}^{n} x_{ik}\left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) = 0$$

This gives us: $\sum_{i=1}^n x_{ik} y_i = \beta_0 \sum_{i=1}^n x_{ik} + \sum_{j=1}^p \beta_j \sum_{i=1}^n x_{ik} x_{ij}$

Recognition: These $p+1$ equations can be written as $(\mathbf{X}^\top\mathbf{X})\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}$. The matrix formulation compresses all component-wise derivatives into a single elegant expression.

Matrix vs. Component Form

Both derivations arrive at the same normal equations. The matrix approach is compact and generalizes easily; the component approach builds intuition but becomes unwieldy with many predictors. Comfort with both is valuable.

Implementing the Normal Equations

Let's implement the OLS solution and verify it matches established libraries.

normal_equations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from sklearn.linear_model import LinearRegression
 
def ols_closed_form(X, y, add_intercept=True):
    """
    Compute OLS coefficients using the normal equations.
    
    β̂ = (X⊤X)⁻¹X⊤y
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Feature matrix (without intercept column)
    y : ndarray of shape (n_samples,)
        Response vector
    add_intercept : bool
        Whether to add intercept column
    
    Returns:
    --------
    beta : ndarray of shape (n_features + 1,) if add_intercept else (n_features,)
        Estimated coefficients
    """
    n_samples = X.shape[0]
    
    if add_intercept:
        # Add column of ones for intercept
        X_design = np.column_stack([np.ones(n_samples), X])
    else:
        X_design = X.copy()
    
    # Compute Gram matrix X⊤X
    gram = X_design.T @ X_design
    
    # Compute X⊤y
    moment = X_design.T @ y
    
    # Solve normal equations: (X⊤X)β = X⊤y
    # Using solve() is more stable than computing inverse explicitly
    beta = np.linalg.solve(gram, moment)
    
    return beta
 
def ols_via_pseudoinverse(X, y, add_intercept=True):
    """
    Compute OLS using Moore-Penrose pseudoinverse.
    Works even when X⊤X is singular.
    
    β̂ = X⁺y
    """
    n_samples = X.shape[0]
    
    if add_intercept:
        X_design = np.column_stack([np.ones(n_samples), X])
    else:
        X_design = X.copy()
    
    # Compute pseudoinverse via SVD
    beta = np.linalg.pinv(X_design) @ y
    
    return beta
 
# Generate sample data
np.random.seed(42)
n_samples, n_features = 100, 3
X = np.random.randn(n_samples, n_features)
true_beta = np.array([5.0, 2.0, -1.5, 0.8])  # intercept + 3 coefficients
X_design = np.column_stack([np.ones(n_samples), X])
y = X_design @ true_beta + np.random.randn(n_samples) * 0.5
 
# Compute using our implementation
beta_ols = ols_closed_form(X, y)
print("Our OLS solution:")
print(f"  β̂ = {beta_ols}")
 
# Compare with scikit-learn
lr = LinearRegression()
lr.fit(X, y)
beta_sklearn = np.concatenate([[lr.intercept_], lr.coef_])
print("\nScikit-learn solution:")
print(f"  β̂ = {beta_sklearn}")
 
# Verify they match
print(f"\nDifference (should be ~0): {np.max(np.abs(beta_ols - beta_sklearn)):.2e}")
 
# Show how close we are to true values
print(f"\nTrue coefficients: {true_beta}")
print(f"Estimation error: {np.linalg.norm(beta_ols - true_beta):.4f}")
 
# Compute and display properties
X_design = np.column_stack([np.ones(n_samples), X])
y_pred = X_design @ beta_ols
residuals = y - y_pred
sse = residuals @ residuals
print(f"\nSum of Squared Errors: {sse:.4f}")
print(f"Mean Squared Error: {sse / n_samples:.4f}")
print(f"RMSE: {np.sqrt(sse / n_samples):.4f}")

Why np.linalg.solve() Instead of Inverse?

Computing (X⊤X)⁻¹ explicitly then multiplying is numerically unstable and slower. np.linalg.solve() uses LU decomposition to solve the linear system directly, which is both faster and more accurate. Production code should never compute the inverse explicitly.

Summary: Normal Equations

We've derived the closed-form solution for ordinary least squares. Let's consolidate the key results:

Key Takeaways

•The Normal Equations: $\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}$ — derived by setting the gradient of SSE to zero.
•The OLS Estimator: $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ — the closed-form solution when $\mathbf{X}^\top\mathbf{X}$ is invertible.
•Verification: The Hessian $2\mathbf{X}^\top\mathbf{X}$ is positive semi-definite, confirming a global minimum.
•Singular cases: When $\mathbf{X}^\top\mathbf{X}$ isn't invertible, use the pseudoinverse for the minimum-norm solution.
•Numerical practice: Use np.linalg.solve() or SVD-based methods, never explicit matrix inversion.

What's next:

With the OLS estimator derived, we need to understand its statistical properties. The next page examines whether $\hat{\boldsymbol{\beta}}$ is unbiased, what its variance is, and under what conditions it's the best linear unbiased estimator (BLUE)—the famous Gauss-Markov theorem.

Page Complete

You've derived the normal equations from first principles using matrix calculus. This closed-form solution is the theoretical foundation for all linear regression implementations and the starting point for understanding regularization, Bayesian perspectives, and extensions to generalized linear models.

2 / 5

Loading learning content...

Machine LearningMultiple Linear Regression

Multiple Linear Regression

LevelIntermediate

Duration90 mins

TopicMultiple Linear Regression

2 / 5

Normal Equations Derivation

The Path to the Closed-Form Solution

With the matrix formulation established, we now face the central optimization problem: find the coefficient vector $\hat{\boldsymbol{\beta}}$ that minimizes the sum of squared errors.

What You Will Learn

The Optimization Problem

Recall from the previous page that we seek to minimize the sum of squared errors:

$$\hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\arg\min} ; \text{SSE}(\boldsymbol{\beta}) = \underset{\boldsymbol{\beta}}{\arg\min} ; |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$$

Expanding the squared norm:

$$\text{SSE}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$

Why Closed-Form Solutions Matter

Matrix Calculus Prerequisites

$$\nabla_{\mathbf{x}} f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \ \vdots \ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

Here are the essential identities we'll use:

Essential Matrix Calculus Identities
Function	Gradient	Notes
$f(\mathbf{x}) = \mathbf{a}^\top\mathbf{x}$	$\nabla_\mathbf{x} f = \mathbf{a}$	Linear function
$f(\mathbf{x}) = \mathbf{x}^\top\mathbf{a}$	$\nabla_\mathbf{x} f = \mathbf{a}$	Same as above (scalar)
$f(\mathbf{x}) = \mathbf{x}^\top\mathbf{A}\mathbf{x}$	$\nabla_\mathbf{x} f = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$	Quadratic form
$f(\mathbf{x}) = \mathbf{x}^\top\mathbf{A}\mathbf{x}$ (A symmetric)	$\nabla_\mathbf{x} f = 2\mathbf{A}\mathbf{x}$	Symmetric case

Symmetry Simplifies Everything

The Gram matrix X⊤X is always symmetric. This means we can use the simpler gradient formula for quadratic forms: ∇(x⊤Ax) = 2Ax when A is symmetric.

Proof sketch for the quadratic form gradient:

For $f(\mathbf{x}) = \mathbf{x}^\top\mathbf{A}\mathbf{x}$, we can expand: $$f(\mathbf{x}) = \sum_{i}\sum_{j} a_{ij} x_i x_j$$

Taking the partial derivative with respect to $x_k$: $$\frac{\partial f}{\partial x_k} = \sum_{j} a_{kj} x_j + \sum_{i} a_{ik} x_i = (\mathbf{A}\mathbf{x})_k + (\mathbf{A}^\top\mathbf{x})_k$$

Collecting terms: $\nabla_\mathbf{x} f = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$. When $\mathbf{A} = \mathbf{A}^\top$, this simplifies to $2\mathbf{A}\mathbf{x}$.

Expanding the Objective Function

Let's expand the SSE objective to identify its structure:

Identifying the components:

$\mathbf{y}^\top\mathbf{y}$ — A constant (doesn't depend on $\boldsymbol{\beta}$)
$-2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}$ — Linear in $\boldsymbol{\beta}$
$\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}$ — Quadratic in $\boldsymbol{\beta}$

This is a quadratic function of $\boldsymbol{\beta}$ with:

Linear coefficient: $-2\mathbf{X}^\top\mathbf{y}$
Quadratic coefficient: $\mathbf{X}^\top\mathbf{X}$ (the Gram matrix)

The Gram matrix $\mathbf{X}^\top\mathbf{X}$ is symmetric and positive semi-definite, ensuring the objective is convex (bowl-shaped upward).

Convexity Guarantee

Taking the Gradient

To find the minimum, we take the gradient with respect to $\boldsymbol{\beta}$ and set it to zero.

Applying our matrix calculus identities term by term:

$\nabla_{\boldsymbol{\beta}}(\mathbf{y}^\top\mathbf{y}) = \mathbf{0}$ (constant term)
$\nabla_{\boldsymbol{\beta}}(-2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}) = -2\mathbf{X}^\top\mathbf{y}$ (linear term)
$\nabla_{\boldsymbol{\beta}}(\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}) = 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}$ (quadratic term, using symmetry of $\mathbf{X}^\top\mathbf{X}$)

Combining: $$\nabla_{\boldsymbol{\beta}} \text{SSE} = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}$$

The Gradient Expression

∇SSE = 2X⊤Xβ - 2X⊤y. This gradient tells us the direction of steepest ascent at any point β. We seek where this gradient is zero—the bottom of the bowl.

Deriving the Normal Equations

Setting the gradient to zero gives the first-order optimality condition:

$$\nabla_{\boldsymbol{\beta}} \text{SSE} = \mathbf{0}$$

$$-2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{0}$$

Dividing by 2 and rearranging:

$$\boxed{\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}}$$

These are the normal equations.

Why 'Normal' Equations?

Solving for $\hat{\boldsymbol{\beta}}$:

If $\mathbf{X}^\top\mathbf{X}$ is invertible (i.e., has full rank), we can multiply both sides by $(\mathbf{X}^\top\mathbf{X})^{-1}$:

$$\boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$$

This gives us the OLS estimator:

$$\boxed{\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}}$$

This is one of the most important formulas in statistics and machine learning. It expresses the optimal coefficients directly in terms of the data matrices.

Verifying This is a Minimum

The Hessian is constant (independent of $\boldsymbol{\beta}$) and equals $2\mathbf{X}^\top\mathbf{X}$.

Second-order sufficiency conditions:

For a stationary point to be a minimum, the Hessian must be positive semi-definite (for weak minimum) or positive definite (for strict, unique minimum).

Furthermore, $\mathbf{X}^\top\mathbf{X}$ is positive definite if and only if $\mathbf{X}$ has full column rank (i.e., no two columns are linearly dependent).

Conclusion: The stationary point $\hat{\boldsymbol{\beta}}$ is indeed a global minimum. If $\mathbf{X}^\top\mathbf{X}$ is invertible, it's the unique global minimum.

Verified Global Minimum

The solution β̂ = (X⊤X)⁻¹X⊤y minimizes the sum of squared errors globally. There is no other set of coefficients that achieves lower SSE.

When X⊤X Is Not Invertible

The OLS formula requires $(\mathbf{X}^\top\mathbf{X})^{-1}$ to exist, which happens when $\mathbf{X}^\top\mathbf{X}$ has full rank. This fails in several important scenarios:

Causes of Non-Invertibility

•More features than samples (p > n): The design matrix is wide, columns must be linearly dependent.
•Perfect multicollinearity: One feature is an exact linear combination of others (e.g., total_sales = online_sales + store_sales).
•Duplicate features: The same feature included twice under different names.
•Constant features: A feature with zero variance (all values identical).
•Encoding issues: One-hot encoding all categories without dropping a reference level.

Consequences:

When $\mathbf{X}^\top\mathbf{X}$ is singular:

Infinitely many solutions — The normal equations have a solution but it's not unique
Unstable estimates — Small data changes cause wild coefficient swings
Numerical issues — Even near-singularity causes precision problems

Solutions:

Regularization: Ridge regression adds $\lambda\mathbf{I}$ to $\mathbf{X}^\top\mathbf{X}$, ensuring invertibility
Feature selection: Remove redundant or perfectly correlated features
Dimensionality reduction: Apply PCA before regression
Moore-Penrose pseudoinverse: Provides a minimum-norm solution even when singular

Practical Detection

The Moore-Penrose Pseudoinverse

When $\mathbf{X}^\top\mathbf{X}$ is not invertible, the normal equations still have solutions—just infinitely many. The Moore-Penrose pseudoinverse provides a principled way to select one.

Definition: The pseudoinverse of $\mathbf{X}$, denoted $\mathbf{X}^+$, is the unique matrix satisfying:

$\mathbf{X}\mathbf{X}^+\mathbf{X} = \mathbf{X}$
$\mathbf{X}^+\mathbf{X}\mathbf{X}^+ = \mathbf{X}^+$
$(\mathbf{X}\mathbf{X}^+)^\top = \mathbf{X}\mathbf{X}^+$
$(\mathbf{X}^+\mathbf{X})^\top = \mathbf{X}^+\mathbf{X}$

When $\mathbf{X}$ has full column rank, $\mathbf{X}^+ = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ (our familiar expression).

Key property: The pseudoinverse solution $\hat{\boldsymbol{\beta}} = \mathbf{X}^+\mathbf{y}$ is the minimum-norm solution:

$$\hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}: \mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}}{\arg\min} |\boldsymbol{\beta}|^2$$

Among all coefficient vectors that minimize SSE, the pseudoinverse gives the one with smallest Euclidean norm. This is a form of implicit regularization.

Computing the Pseudoinverse

Alternative Derivation: Component-Wise Calculus

For readers less comfortable with matrix calculus, we can derive the normal equations using standard calculus. This approach is more tedious but illuminating.

Setup: We minimize $$\text{SSE} = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2$$

For the intercept $\beta_0$: $$\frac{\partial \text{SSE}}{\partial \beta_0} = -2\sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) = 0$$

This gives us: $\sum_{i=1}^n y_i = n\beta_0 + \sum_{j=1}^p \beta_j \sum_{i=1}^n x_{ij}$

For coefficient $\beta_k$ (for $k = 1, \ldots, p$): $$\frac{\partial \text{SSE}}{\partial \beta_k} = -2\sum_{i=1}^{n} x_{ik}\left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) = 0$$

This gives us: $\sum_{i=1}^n x_{ik} y_i = \beta_0 \sum_{i=1}^n x_{ik} + \sum_{j=1}^p \beta_j \sum_{i=1}^n x_{ik} x_{ij}$

Matrix vs. Component Form

Implementing the Normal Equations

Let's implement the OLS solution and verify it matches established libraries.

normal_equations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from sklearn.linear_model import LinearRegression
 
def ols_closed_form(X, y, add_intercept=True):
    """
    Compute OLS coefficients using the normal equations.
    
    β̂ = (X⊤X)⁻¹X⊤y
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Feature matrix (without intercept column)
    y : ndarray of shape (n_samples,)
        Response vector
    add_intercept : bool
        Whether to add intercept column
    
    Returns:
    --------
    beta : ndarray of shape (n_features + 1,) if add_intercept else (n_features,)
        Estimated coefficients
    """
    n_samples = X.shape[0]
    
    if add_intercept:
        # Add column of ones for intercept
        X_design = np.column_stack([np.ones(n_samples), X])
    else:
        X_design = X.copy()
    
    # Compute Gram matrix X⊤X
    gram = X_design.T @ X_design
    
    # Compute X⊤y
    moment = X_design.T @ y
    
    # Solve normal equations: (X⊤X)β = X⊤y
    # Using solve() is more stable than computing inverse explicitly
    beta = np.linalg.solve(gram, moment)
    
    return beta
 
def ols_via_pseudoinverse(X, y, add_intercept=True):
    """
    Compute OLS using Moore-Penrose pseudoinverse.
    Works even when X⊤X is singular.
    
    β̂ = X⁺y
    """
    n_samples = X.shape[0]
    
    if add_intercept:
        X_design = np.column_stack([np.ones(n_samples), X])
    else:
        X_design = X.copy()
    
    # Compute pseudoinverse via SVD
    beta = np.linalg.pinv(X_design) @ y
    
    return beta
 
# Generate sample data
np.random.seed(42)
n_samples, n_features = 100, 3
X = np.random.randn(n_samples, n_features)
true_beta = np.array([5.0, 2.0, -1.5, 0.8])  # intercept + 3 coefficients
X_design = np.column_stack([np.ones(n_samples), X])
y = X_design @ true_beta + np.random.randn(n_samples) * 0.5
 
# Compute using our implementation
beta_ols = ols_closed_form(X, y)
print("Our OLS solution:")
print(f"  β̂ = {beta_ols}")
 
# Compare with scikit-learn
lr = LinearRegression()
lr.fit(X, y)
beta_sklearn = np.concatenate([[lr.intercept_], lr.coef_])
print("\nScikit-learn solution:")
print(f"  β̂ = {beta_sklearn}")
 
# Verify they match
print(f"\nDifference (should be ~0): {np.max(np.abs(beta_ols - beta_sklearn)):.2e}")
 
# Show how close we are to true values
print(f"\nTrue coefficients: {true_beta}")
print(f"Estimation error: {np.linalg.norm(beta_ols - true_beta):.4f}")
 
# Compute and display properties
X_design = np.column_stack([np.ones(n_samples), X])
y_pred = X_design @ beta_ols
residuals = y - y_pred
sse = residuals @ residuals
print(f"\nSum of Squared Errors: {sse:.4f}")
print(f"Mean Squared Error: {sse / n_samples:.4f}")
print(f"RMSE: {np.sqrt(sse / n_samples):.4f}")

Why np.linalg.solve() Instead of Inverse?

Summary: Normal Equations

We've derived the closed-form solution for ordinary least squares. Let's consolidate the key results:

Key Takeaways

•The Normal Equations: $\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}$ — derived by setting the gradient of SSE to zero.
•The OLS Estimator: $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ — the closed-form solution when $\mathbf{X}^\top\mathbf{X}$ is invertible.
•Verification: The Hessian $2\mathbf{X}^\top\mathbf{X}$ is positive semi-definite, confirming a global minimum.
•Singular cases: When $\mathbf{X}^\top\mathbf{X}$ isn't invertible, use the pseudoinverse for the minimum-norm solution.
•Numerical practice: Use np.linalg.solve() or SVD-based methods, never explicit matrix inversion.

What's next:

Page Complete

2 / 5