Loading learning content...
With the matrix formulation established, we now face the central optimization problem: find the coefficient vector $\hat{\boldsymbol{\beta}}$ that minimizes the sum of squared errors.
Unlike many optimization problems in machine learning that require iterative algorithms, linear regression admits an exact, closed-form solution. This solution, derived from the normal equations, is one of the most elegant results in statistical learning theory.
The derivation we present here uses matrix calculus—specifically, derivatives of scalar functions with respect to vectors. This approach is not merely an exercise in mathematical sophistication; it's the foundation for understanding optimization throughout machine learning.
By the end of this page, you will derive the normal equations from first principles, understand why X⊤X must be invertible for a unique solution, and appreciate the mathematical elegance underlying ordinary least squares estimation.
Recall from the previous page that we seek to minimize the sum of squared errors:
$$\hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}}{\arg\min} ; \text{SSE}(\boldsymbol{\beta}) = \underset{\boldsymbol{\beta}}{\arg\min} ; |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$$
Expanding the squared norm:
$$\text{SSE}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$
This is a quadratic function of $\boldsymbol{\beta}$. Quadratic functions with positive semi-definite Hessian matrices have a unique global minimum (or a set of minima if the Hessian is singular). Our task is to find this minimum analytically.
A closed-form solution means we can directly compute the answer in a finite number of operations—no iterations, no convergence criteria, no hyperparameter tuning. For linear regression, this means exact answers subject only to numerical precision limits.
Before deriving the normal equations, we need key results from matrix calculus. Let $\mathbf{x}$ be a column vector and $f(\mathbf{x})$ a scalar function. The gradient $\nabla_{\mathbf{x}} f$ is a vector of partial derivatives:
$$\nabla_{\mathbf{x}} f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \ \vdots \ \frac{\partial f}{\partial x_n} \end{bmatrix}$$
Here are the essential identities we'll use:
| Function | Gradient | Notes |
|---|---|---|
| $f(\mathbf{x}) = \mathbf{a}^\top\mathbf{x}$ | $\nabla_\mathbf{x} f = \mathbf{a}$ | Linear function |
| $f(\mathbf{x}) = \mathbf{x}^\top\mathbf{a}$ | $\nabla_\mathbf{x} f = \mathbf{a}$ | Same as above (scalar) |
| $f(\mathbf{x}) = \mathbf{x}^\top\mathbf{A}\mathbf{x}$ | $\nabla_\mathbf{x} f = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$ | Quadratic form |
| $f(\mathbf{x}) = \mathbf{x}^\top\mathbf{A}\mathbf{x}$ (A symmetric) | $\nabla_\mathbf{x} f = 2\mathbf{A}\mathbf{x}$ | Symmetric case |
The Gram matrix X⊤X is always symmetric. This means we can use the simpler gradient formula for quadratic forms: ∇(x⊤Ax) = 2Ax when A is symmetric.
Proof sketch for the quadratic form gradient:
For $f(\mathbf{x}) = \mathbf{x}^\top\mathbf{A}\mathbf{x}$, we can expand: $$f(\mathbf{x}) = \sum_{i}\sum_{j} a_{ij} x_i x_j$$
Taking the partial derivative with respect to $x_k$: $$\frac{\partial f}{\partial x_k} = \sum_{j} a_{kj} x_j + \sum_{i} a_{ik} x_i = (\mathbf{A}\mathbf{x})_k + (\mathbf{A}^\top\mathbf{x})_k$$
Collecting terms: $\nabla_\mathbf{x} f = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$. When $\mathbf{A} = \mathbf{A}^\top$, this simplifies to $2\mathbf{A}\mathbf{x}$.
Let's expand the SSE objective to identify its structure:
$$\begin{aligned} \text{SSE}(\boldsymbol{\beta}) &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \ &= \mathbf{y}^\top\mathbf{y} - \mathbf{y}^\top\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \end{aligned}$$
Now, $\mathbf{y}^\top\mathbf{X}\boldsymbol{\beta}$ is a scalar. The transpose of a scalar equals itself, so: $$\mathbf{y}^\top\mathbf{X}\boldsymbol{\beta} = (\mathbf{y}^\top\mathbf{X}\boldsymbol{\beta})^\top = \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}$$
Therefore, we can combine the middle terms: $$\text{SSE}(\boldsymbol{\beta}) = \mathbf{y}^\top\mathbf{y} - 2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}$$
Identifying the components:
This is a quadratic function of $\boldsymbol{\beta}$ with:
The Gram matrix $\mathbf{X}^\top\mathbf{X}$ is symmetric and positive semi-definite, ensuring the objective is convex (bowl-shaped upward).
SSE(β) is a convex function of β because the Hessian matrix (2X⊤X) is positive semi-definite. Any local minimum is a global minimum. If X⊤X is positive definite (full column rank), the minimum is unique.
To find the minimum, we take the gradient with respect to $\boldsymbol{\beta}$ and set it to zero.
$$\nabla_{\boldsymbol{\beta}} \text{SSE} = \nabla_{\boldsymbol{\beta}} \left[ \mathbf{y}^\top\mathbf{y} - 2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \right]$$
Applying our matrix calculus identities term by term:
$\nabla_{\boldsymbol{\beta}}(\mathbf{y}^\top\mathbf{y}) = \mathbf{0}$ (constant term)
$\nabla_{\boldsymbol{\beta}}(-2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}) = -2\mathbf{X}^\top\mathbf{y}$ (linear term)
$\nabla_{\boldsymbol{\beta}}(\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}) = 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}$ (quadratic term, using symmetry of $\mathbf{X}^\top\mathbf{X}$)
Combining: $$\nabla_{\boldsymbol{\beta}} \text{SSE} = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta}$$
∇SSE = 2X⊤Xβ - 2X⊤y. This gradient tells us the direction of steepest ascent at any point β. We seek where this gradient is zero—the bottom of the bowl.
Setting the gradient to zero gives the first-order optimality condition:
$$\nabla_{\boldsymbol{\beta}} \text{SSE} = \mathbf{0}$$
$$-2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{0}$$
Dividing by 2 and rearranging:
$$\boxed{\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}}$$
These are the normal equations.
The term 'normal' comes from the fact that these equations ensure the residual vector e = y - Xβ̂ is orthogonal (normal) to the column space of X. We'll explore this geometric interpretation in detail in a later page.
Solving for $\hat{\boldsymbol{\beta}}$:
If $\mathbf{X}^\top\mathbf{X}$ is invertible (i.e., has full rank), we can multiply both sides by $(\mathbf{X}^\top\mathbf{X})^{-1}$:
$$\boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$$
This gives us the OLS estimator:
$$\boxed{\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}}$$
This is one of the most important formulas in statistics and machine learning. It expresses the optimal coefficients directly in terms of the data matrices.
Setting the gradient to zero finds stationary points—which could be minima, maxima, or saddle points. We must verify we have a minimum by examining the Hessian matrix (the matrix of second derivatives).
$$\mathbf{H} = \nabla^2_{\boldsymbol{\beta}} \text{SSE} = \nabla_{\boldsymbol{\beta}} \left( 2\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} - 2\mathbf{X}^\top\mathbf{y} \right) = 2\mathbf{X}^\top\mathbf{X}$$
The Hessian is constant (independent of $\boldsymbol{\beta}$) and equals $2\mathbf{X}^\top\mathbf{X}$.
Second-order sufficiency conditions:
For a stationary point to be a minimum, the Hessian must be positive semi-definite (for weak minimum) or positive definite (for strict, unique minimum).
Recall that $\mathbf{X}^\top\mathbf{X}$ is always positive semi-definite: $$\mathbf{v}^\top(\mathbf{X}^\top\mathbf{X})\mathbf{v} = (\mathbf{X}\mathbf{v})^\top(\mathbf{X}\mathbf{v}) = |\mathbf{X}\mathbf{v}|^2 \geq 0$$
Furthermore, $\mathbf{X}^\top\mathbf{X}$ is positive definite if and only if $\mathbf{X}$ has full column rank (i.e., no two columns are linearly dependent).
Conclusion: The stationary point $\hat{\boldsymbol{\beta}}$ is indeed a global minimum. If $\mathbf{X}^\top\mathbf{X}$ is invertible, it's the unique global minimum.
The solution β̂ = (X⊤X)⁻¹X⊤y minimizes the sum of squared errors globally. There is no other set of coefficients that achieves lower SSE.
The OLS formula requires $(\mathbf{X}^\top\mathbf{X})^{-1}$ to exist, which happens when $\mathbf{X}^\top\mathbf{X}$ has full rank. This fails in several important scenarios:
Consequences:
When $\mathbf{X}^\top\mathbf{X}$ is singular:
Solutions:
Modern statistical software (scikit-learn, R, statsmodels) usually detects rank deficiency and either warns or uses automatic regularization. However, near-singularity (high condition number) can go undetected and cause subtle problems. Check the condition number of X⊤X when coefficients seem unreasonable.
When $\mathbf{X}^\top\mathbf{X}$ is not invertible, the normal equations still have solutions—just infinitely many. The Moore-Penrose pseudoinverse provides a principled way to select one.
Definition: The pseudoinverse of $\mathbf{X}$, denoted $\mathbf{X}^+$, is the unique matrix satisfying:
When $\mathbf{X}$ has full column rank, $\mathbf{X}^+ = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ (our familiar expression).
Key property: The pseudoinverse solution $\hat{\boldsymbol{\beta}} = \mathbf{X}^+\mathbf{y}$ is the minimum-norm solution:
$$\hat{\boldsymbol{\beta}} = \underset{\boldsymbol{\beta}: \mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}}{\arg\min} |\boldsymbol{\beta}|^2$$
Among all coefficient vectors that minimize SSE, the pseudoinverse gives the one with smallest Euclidean norm. This is a form of implicit regularization.
The pseudoinverse is typically computed via SVD: if X = UΣV⊤, then X⁺ = VΣ⁺U⊤, where Σ⁺ inverts non-zero singular values and leaves zeros as zeros. Never compute the pseudoinverse by inverting nearly-singular matrices—always use SVD-based methods like numpy.linalg.pinv().
For readers less comfortable with matrix calculus, we can derive the normal equations using standard calculus. This approach is more tedious but illuminating.
Setup: We minimize $$\text{SSE} = \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2$$
For the intercept $\beta_0$: $$\frac{\partial \text{SSE}}{\partial \beta_0} = -2\sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) = 0$$
This gives us: $\sum_{i=1}^n y_i = n\beta_0 + \sum_{j=1}^p \beta_j \sum_{i=1}^n x_{ij}$
For coefficient $\beta_k$ (for $k = 1, \ldots, p$): $$\frac{\partial \text{SSE}}{\partial \beta_k} = -2\sum_{i=1}^{n} x_{ik}\left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) = 0$$
This gives us: $\sum_{i=1}^n x_{ik} y_i = \beta_0 \sum_{i=1}^n x_{ik} + \sum_{j=1}^p \beta_j \sum_{i=1}^n x_{ik} x_{ij}$
Recognition: These $p+1$ equations can be written as $(\mathbf{X}^\top\mathbf{X})\boldsymbol{\beta} = \mathbf{X}^\top\mathbf{y}$. The matrix formulation compresses all component-wise derivatives into a single elegant expression.
Both derivations arrive at the same normal equations. The matrix approach is compact and generalizes easily; the component approach builds intuition but becomes unwieldy with many predictors. Comfort with both is valuable.
Let's implement the OLS solution and verify it matches established libraries.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as npfrom sklearn.linear_model import LinearRegression def ols_closed_form(X, y, add_intercept=True): """ Compute OLS coefficients using the normal equations. β̂ = (X⊤X)⁻¹X⊤y Parameters: ----------- X : ndarray of shape (n_samples, n_features) Feature matrix (without intercept column) y : ndarray of shape (n_samples,) Response vector add_intercept : bool Whether to add intercept column Returns: -------- beta : ndarray of shape (n_features + 1,) if add_intercept else (n_features,) Estimated coefficients """ n_samples = X.shape[0] if add_intercept: # Add column of ones for intercept X_design = np.column_stack([np.ones(n_samples), X]) else: X_design = X.copy() # Compute Gram matrix X⊤X gram = X_design.T @ X_design # Compute X⊤y moment = X_design.T @ y # Solve normal equations: (X⊤X)β = X⊤y # Using solve() is more stable than computing inverse explicitly beta = np.linalg.solve(gram, moment) return beta def ols_via_pseudoinverse(X, y, add_intercept=True): """ Compute OLS using Moore-Penrose pseudoinverse. Works even when X⊤X is singular. β̂ = X⁺y """ n_samples = X.shape[0] if add_intercept: X_design = np.column_stack([np.ones(n_samples), X]) else: X_design = X.copy() # Compute pseudoinverse via SVD beta = np.linalg.pinv(X_design) @ y return beta # Generate sample datanp.random.seed(42)n_samples, n_features = 100, 3X = np.random.randn(n_samples, n_features)true_beta = np.array([5.0, 2.0, -1.5, 0.8]) # intercept + 3 coefficientsX_design = np.column_stack([np.ones(n_samples), X])y = X_design @ true_beta + np.random.randn(n_samples) * 0.5 # Compute using our implementationbeta_ols = ols_closed_form(X, y)print("Our OLS solution:")print(f" β̂ = {beta_ols}") # Compare with scikit-learnlr = LinearRegression()lr.fit(X, y)beta_sklearn = np.concatenate([[lr.intercept_], lr.coef_])print("\nScikit-learn solution:")print(f" β̂ = {beta_sklearn}") # Verify they matchprint(f"\nDifference (should be ~0): {np.max(np.abs(beta_ols - beta_sklearn)):.2e}") # Show how close we are to true valuesprint(f"\nTrue coefficients: {true_beta}")print(f"Estimation error: {np.linalg.norm(beta_ols - true_beta):.4f}") # Compute and display propertiesX_design = np.column_stack([np.ones(n_samples), X])y_pred = X_design @ beta_olsresiduals = y - y_predsse = residuals @ residualsprint(f"\nSum of Squared Errors: {sse:.4f}")print(f"Mean Squared Error: {sse / n_samples:.4f}")print(f"RMSE: {np.sqrt(sse / n_samples):.4f}")Computing (X⊤X)⁻¹ explicitly then multiplying is numerically unstable and slower. np.linalg.solve() uses LU decomposition to solve the linear system directly, which is both faster and more accurate. Production code should never compute the inverse explicitly.
We've derived the closed-form solution for ordinary least squares. Let's consolidate the key results:
What's next:
With the OLS estimator derived, we need to understand its statistical properties. The next page examines whether $\hat{\boldsymbol{\beta}}$ is unbiased, what its variance is, and under what conditions it's the best linear unbiased estimator (BLUE)—the famous Gauss-Markov theorem.
You've derived the normal equations from first principles using matrix calculus. This closed-form solution is the theoretical foundation for all linear regression implementations and the starting point for understanding regularization, Bayesian perspectives, and extensions to generalized linear models.