Loading learning content...
The normal equations are the system of linear equations whose solution gives the least squares coefficients. For a linear system Ax = b that may have no exact solution (overdetermined), the normal equations are:
AᵀA x = Aᵀb
This compact equation encodes both the optimization condition (setting the gradient to zero) and the geometric condition (error perpendicular to column space). Understanding the normal equations deeply is essential for:
By the end of this page, you will derive the normal equations from three perspectives (calculus, geometry, algebra), understand when they have unique solutions, and see how they connect projection theory to practical computation of regression coefficients.
The least squares problem seeks x minimizing the squared error:
f(x) = ||Ax - b||² = (Ax - b)ᵀ(Ax - b)
Expanding:
f(x) = xᵀAᵀAx - 2xᵀAᵀb + bᵀb
To minimize, take the gradient and set to zero:
∇f(x) = 2AᵀAx - 2Aᵀb = 0
This gives the normal equations: AᵀA x = Aᵀb
Since the Hessian ∇²f = 2AᵀA is positive semi-definite (and positive definite when A has full column rank), this critical point is indeed a minimum.
12345678910111213141516171819202122232425262728293031323334353637
import numpy as npfrom numpy.linalg import inv, norm def solve_normal_equations(A: np.ndarray, b: np.ndarray) -> dict: """ Solve least squares via normal equations: AᵀAx = Aᵀb """ ATA = A.T @ A # Gram matrix (n x n) ATb = A.T @ b # Moment vector (n x 1) # Solve normal equations x_hat = inv(ATA) @ ATb # Compute fitted values and residuals b_hat = A @ x_hat residual = b - b_hat # Verify gradient is zero at solution gradient = 2 * (ATA @ x_hat - ATb) return { 'coefficients': x_hat, 'fitted': b_hat, 'residual': residual, 'gradient_norm': norm(gradient), 'squared_error': norm(residual)**2 } # Example: Overdetermined system (3 equations, 2 unknowns)A = np.array([[1, 1], [1, 2], [1, 3]])b = np.array([2, 3, 5]) result = solve_normal_equations(A, b)print(f"Coefficients: {result['coefficients']}")print(f"Fitted values: {result['fitted']}")print(f"Residual: {result['residual']}")print(f"Gradient norm at solution: {result['gradient_norm']:.2e}")From the projection perspective, Ax̂ is the orthogonal projection of b onto Col(A). The error e = b - Ax̂ must be perpendicular to every column of A:
aⱼ · (b - Ax̂) = 0 for all columns aⱼ
In matrix form, this says:
Aᵀ(b - Ax̂) = 0
Rearranging:
AᵀA x̂ = Aᵀb
The normal equations are simply the algebraic encoding of orthogonality. The matrix Aᵀ tests orthogonality against each column of A simultaneously.
These are called 'normal' equations because they ensure the error is 'normal' (perpendicular) to the column space. The term 'normal' here means perpendicular, as in 'normal to a surface'.
Visual interpretation:
Imagine b as a point in ℝᵐ. The column space Col(A) is a k-dimensional plane through the origin (where k = rank(A)). The normal equations find the point Ax̂ in this plane closest to b by requiring the connecting vector (b - Ax̂) to be perpendicular to the plane.
This is exactly what Aᵀ(b - Ax̂) = 0 says: the error has zero inner product with every basis vector of the plane.
When does AᵀA x = Aᵀb have a unique solution?
The matrix AᵀA is n×n (where A is m×n). Key facts:
Full column rank case (most common in ML):
When rank(A) = n (columns are linearly independent), AᵀA is invertible:
x̂ = (AᵀA)⁻¹Aᵀb — unique solution
Rank-deficient case:
When rank(A) < n, AᵀA is singular. There are infinitely many solutions (a subspace of solutions). The pseudoinverse selects the minimum-norm solution.
| Case | Condition | Number of Solutions | Approach |
|---|---|---|---|
| Full column rank | rank(A) = n | Exactly one | (AᵀA)⁻¹Aᵀb |
| Rank deficient | rank(A) < n | Infinitely many | Pseudoinverse A⁺b |
| Underdetermined | m < n | Infinitely many | Minimum norm solution |
12345678910111213141516171819202122232425262728293031323334353637
import numpy as npfrom numpy.linalg import matrix_rank, lstsq, pinv def analyze_normal_equations(A: np.ndarray, b: np.ndarray): """Analyze solution properties of normal equations.""" m, n = A.shape rank = matrix_rank(A) print(f"Matrix A: {m}×{n}, rank = {rank}") print(f"Full column rank: {rank == n}") ATA = A.T @ A print(f"AᵀA is {n}×{n}, rank = {matrix_rank(ATA)}") if rank == n: x_hat = np.linalg.inv(ATA) @ A.T @ b print(f"Unique solution: {x_hat}") else: x_hat = pinv(A) @ b print(f"Minimum norm solution (via pseudoinverse): {x_hat}") print(f"Note: infinitely many solutions exist") # Verify it satisfies normal equations residual = ATA @ x_hat - A.T @ b print(f"Normal equation residual: {np.linalg.norm(residual):.2e}") # Full rank caseA1 = np.array([[1, 2], [3, 4], [5, 6]])b1 = np.array([1, 2, 3])print("=== Full Column Rank ===")analyze_normal_equations(A1, b1) # Rank deficient caseprint("\n=== Rank Deficient ===")A2 = np.array([[1, 2, 3], [2, 4, 6], [1, 2, 3]]) # Rank 1b2 = np.array([1, 2, 1])analyze_normal_equations(A2, b2)The matrix AᵀA appearing in the normal equations is the Gram matrix of A's columns. Its entries are inner products:
(AᵀA)ᵢⱼ = aᵢ · aⱼ
where aᵢ is the i-th column of A.
Key properties:
In machine learning terms:
For design matrix X with features as columns, XᵀX captures:
When features are highly correlated, XᵀX becomes nearly singular (ill-conditioned). Small changes in data cause large changes in coefficients. Regularization (ridge regression) fixes this by adding λI to XᵀX.
Don't explicitly invert AᵀA!
While x̂ = (AᵀA)⁻¹Aᵀb is theoretically correct, computing the inverse explicitly is:
Better approaches:
Solve the linear system directly: Use np.linalg.solve(ATA, ATb) — exploits structure, O(n³)
QR decomposition: A = QR, then solve Rx = Qᵀb — more stable, O(mn²)
SVD: Most stable, handles rank-deficiency — O(mn²)
Cholesky decomposition: AᵀA = LLᵀ, then forward/back substitution — fast for well-conditioned problems
1234567891011121314151617181920212223242526272829303132
import numpy as npfrom numpy.linalg import solve, qr, lstsq, inv, condfrom scipy.linalg import cholesky def compare_methods(A, b): """Compare different approaches to solving normal equations.""" ATA, ATb = A.T @ A, A.T @ b print(f"Condition number of AᵀA: {cond(ATA):.2e}") # Method 1: Explicit inverse (not recommended) x1 = inv(ATA) @ ATb # Method 2: Solve linear system x2 = solve(ATA, ATb) # Method 3: QR decomposition Q, R = qr(A) x3 = solve(R, Q.T @ b) # Method 4: NumPy lstsq (uses SVD internally) x4, _, _, _ = lstsq(A, b, rcond=None) print(f"Inverse: {x1}") print(f"Solve: {x2}") print(f"QR: {x3}") print(f"lstsq: {x4}") print(f"Max diff: {max(np.linalg.norm(x1-x2), np.linalg.norm(x1-x3), np.linalg.norm(x1-x4)):.2e}") A = np.array([[1, 1], [1, 2], [1, 3], [1, 4], [1, 5]])b = np.array([2.1, 3.9, 6.2, 7.8, 10.1])compare_methods(A, b)In linear regression with design matrix X and response y, the normal equations become:
XᵀX β = Xᵀy
Each coefficient βⱼ satisfies:
Σᵢ xᵢⱼ(yᵢ - xᵢ·β) = 0
This says: for each feature j, its correlation with the residual is zero. The model has extracted all linear predictive information from the features.
Interpretation of XᵀX:
Interpretation of Xᵀy:
When X includes a column of ones (intercept), the normal equations imply Σᵢ(yᵢ - ŷᵢ) = 0, meaning residuals sum to zero. The regression line passes through the point (x̄, ȳ).
You now understand the normal equations from multiple perspectives. Next, we'll explore the geometric view of least squares, visualizing how the projection onto the column space provides the optimal approximation.