Loading learning content...
In the previous page, we discovered a sobering truth: the seemingly simple normal equations β = (XᵀX)⁻¹Xᵀy can catastrophically fail due to condition number squaring. A moderately ill-conditioned problem (κ(X) = 10⁴) becomes severely unstable (κ(XᵀX) = 10⁸) when we form XᵀX.
The solution isn't to find a better way to invert XᵀX—it's to never form XᵀX in the first place.
This is where QR decomposition enters the picture. By factoring X into an orthogonal matrix Q and an upper-triangular matrix R, we transform least squares into a simple triangular system that preserves the original conditioning of X.
By the end of this page, you will: (1) understand why orthogonal matrices are numerically safe, (2) derive how QR decomposition solves least squares, (3) learn multiple algorithms for computing QR (Gram-Schmidt, Householder), and (4) implement a numerically stable least squares solver from scratch.
Before diving into QR decomposition, we must understand why orthogonal matrices are the cornerstone of numerical linear algebra. An orthogonal matrix Q satisfies:
$$Q^\top Q = QQ^\top = I \quad \Leftrightarrow \quad Q^{-1} = Q^\top$$
Geometric Interpretation:
Orthogonal matrices represent rotations and reflections—rigid transformations that preserve lengths and angles. When we multiply a vector by Q:
Since orthogonal matrices preserve lengths, all their singular values equal 1. Therefore κ(Q) = σ_max/σ_min = 1/1 = 1. Orthogonal matrices are perfectly conditioned! Multiplying by Q never amplifies errors.
Why This Matters for Stability:
Suppose we want to solve Ax = b but A is ill-conditioned. If we can factor A = QR where Q is orthogonal, then:
$$Ax = b \implies QRx = b \implies Rx = Q^\top b$$
The transformation from b to Qᵀb is perfectly conditioned (κ = 1). All the ill-conditioning is isolated in R, and since R is upper-triangular, solving Rx = c is straightforward via back-substitution.
Critically, the condition number we face is κ(R) = κ(A), not κ(AᵀA) = κ(A)². We've avoided the squaring!
| Property | Normal Equations | QR Decomposition |
|---|---|---|
| Solves | (XᵀX)β = Xᵀy | Rβ = Qᵀy |
| Condition experienced | κ(XᵀX) = κ(X)² | κ(R) = κ(X) |
| Matrix to invert | XᵀX (p × p) | None—back-substitution |
| Numerical stability | Unstable for ill-conditioned X | Backward stable |
| Extra computation | Form XᵀX | Compute Q, R |
| Recommended when | X is very well-conditioned | Always (default choice) |
The QR Factorization:
For any matrix X ∈ ℝⁿˣᵖ with n ≥ p (more rows than columns), we can compute:
$$X = QR$$
where:
The reduced/thin QR is most useful for least squares:
$$X = Q_1 R_1 \quad \text{where } Q_1 \in \mathbb{R}^{n \times p}, R_1 \in \mathbb{R}^{p \times p}$$
Derivation: QR Solution to Least Squares
We want to minimize ‖Xβ - y‖². Substituting X = Q₁R₁:
$$|X\beta - y|^2 = |Q_1 R_1 \beta - y|^2$$
Now, for any orthogonal matrix Q:
$$|Qz| = |z| \quad \text{(orthogonal matrices preserve norms)}$$
Multiplying by Qᵀ (which is also orthogonal) preserves the norm:
$$|Q_1 R_1 \beta - y|^2 = |Q^\top(Q_1 R_1 \beta - y)|^2$$
where Q is the full orthogonal matrix. Partitioning Q = [Q₁ | Q₂]:
$$Q^\top y = \begin{bmatrix} Q_1^\top y \ Q_2^\top y \end{bmatrix} = \begin{bmatrix} c_1 \ c_2 \end{bmatrix}$$
The residual norm becomes:
$$|R_1\beta - c_1|^2 + |c_2|^2$$
The term ‖c₂‖² = ‖Q₂ᵀy‖² is independent of β! It's the irreducible residual—the part of y orthogonal to the column space of X. We minimize by choosing β such that R₁β = c₁, making the first term zero.
The QR Least Squares Algorithm:
Why Back-Substitution is Stable:
Solving an upper-triangular system:
$$\begin{bmatrix} r_{11} & r_{12} & \cdots & r_{1p} \ 0 & r_{22} & \cdots & r_{2p} \ \vdots & \ddots & \ddots & \vdots \ 0 & \cdots & 0 & r_{pp} \end{bmatrix} \begin{bmatrix} \beta_1 \ \beta_2 \ \vdots \ \beta_p \end{bmatrix} = \begin{bmatrix} c_1 \ c_2 \ \vdots \ c_p \end{bmatrix}$$
is trivial: start from the bottom row and work up:
$$\beta_p = c_p / r_{pp}, \quad \beta_{p-1} = (c_{p-1} - r_{p-1,p}\beta_p) / r_{p-1,p-1}, \quad \ldots$$
Each step involves only additions, multiplications, and a single division—no matrix inversion, no condition-number-squaring.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as npfrom scipy.linalg import solve_triangular def qr_least_squares(X, y): """ Solve least squares using QR decomposition. This is the numerically stable approach: 1. Compute thin QR: X = Q @ R 2. Compute c = Q.T @ y 3. Solve triangular system: R @ beta = c Parameters: ----------- X : ndarray of shape (n, p) Design matrix y : ndarray of shape (n,) Target vector Returns: -------- beta : ndarray of shape (p,) Least squares solution residual : float Squared residual norm ||X @ beta - y||^2 """ n, p = X.shape # Step 1: Compute thin QR factorization # mode='reduced' gives Q1 (n x p) and R (p x p) Q, R = np.linalg.qr(X, mode='reduced') # Step 2: Compute c = Q^T @ y c = Q.T @ y # Step 3: Solve R @ beta = c via back-substitution # solve_triangular is more efficient and numerically stable beta = solve_triangular(R, c) # Compute residual: ||y - X @ beta||^2 # Equivalently: ||y||^2 - ||c||^2 (the part orthogonal to col(X)) residual = np.linalg.norm(y)**2 - np.linalg.norm(c)**2 return beta, residual def compare_methods(X, y, beta_true): """ Compare normal equations vs QR for the same problem. """ # Method 1: Normal Equations (unstable) XtX = X.T @ X Xty = X.T @ y try: beta_normal = np.linalg.solve(XtX, Xty) error_normal = np.linalg.norm(beta_normal - beta_true) except np.linalg.LinAlgError: beta_normal = None error_normal = np.inf # Method 2: QR Decomposition (stable) beta_qr, residual = qr_least_squares(X, y) error_qr = np.linalg.norm(beta_qr - beta_true) print(f"Condition number κ(X): {np.linalg.cond(X):.2e}") print(f"Condition number κ(XᵀX): {np.linalg.cond(XtX):.2e}") print(f"Normal Equations Error: {error_normal:.2e}") print(f"QR Decomposition Error: {error_qr:.2e}") print(f"QR advantage factor: {error_normal / error_qr:.1f}x") return beta_normal, beta_qr # Demonstration with ill-conditioned problemnp.random.seed(42)n, p = 100, 10 # Create ill-conditioned X using specific singular valuesU, _ = np.linalg.qr(np.random.randn(n, n))V, _ = np.linalg.qr(np.random.randn(p, p))singular_values = np.logspace(0, -8, p) # 1, 1e-8 range → κ ≈ 10^8X = U[:, :p] @ np.diag(singular_values) @ V # True coefficients and generate databeta_true = np.ones(p)y = X @ beta_true + 1e-10 * np.random.randn(n) # Tiny noise print("=== Ill-Conditioned Least Squares Comparison ===")compare_methods(X, y, beta_true)The simplest way to compute QR is the Gram-Schmidt process, an iterative algorithm that orthogonalizes columns one at a time.
The Classical Gram-Schmidt Algorithm:
Given matrix X with columns x₁, x₂, ..., xₚ, we construct orthonormal vectors q₁, q₂, ..., qₚ:
q₁ = x₁ / ‖x₁‖ (normalize first column)
For j = 2, 3, ..., p:
The R matrix is built as we go:
Why Classical Gram-Schmidt is Numerically Problematic:
While mathematically elegant, CGS suffers from loss of orthogonality when columns are nearly linearly dependent. The problem is that we compute projections onto previously computed q vectors, which may have accumulated small errors. These errors compound, and after several iterations, the q vectors may no longer be orthogonal at all.
The Fix: Modified Gram-Schmidt (MGS)
Modified Gram-Schmidt makes a subtle but crucial change: instead of computing all projections at once using the original xⱼ, we update xⱼ incrementally after each projection:
q₁ = x₁ / ‖x₁‖
For j = 2, 3, ..., p:
Mathematically identical to CGS, but numerically much more stable.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
import numpy as np def classical_gram_schmidt(X): """ Classical Gram-Schmidt orthogonalization. Mathematically correct but numerically unstable for ill-conditioned matrices. """ n, p = X.shape Q = np.zeros((n, p)) R = np.zeros((p, p)) for j in range(p): # Start with the j-th column v = X[:, j].copy() # Subtract projections onto all previous q vectors for i in range(j): R[i, j] = np.dot(Q[:, i], X[:, j]) # Using ORIGINAL X[:, j] v = v - R[i, j] * Q[:, i] # Normalize R[j, j] = np.linalg.norm(v) if R[j, j] > 1e-14: Q[:, j] = v / R[j, j] else: Q[:, j] = v # Near-zero column (rank deficiency) return Q, R def modified_gram_schmidt(X): """ Modified Gram-Schmidt orthogonalization. Numerically more stable than classical GS because projections are computed incrementally. """ n, p = X.shape Q = np.zeros((n, p)) R = np.zeros((p, p)) V = X.copy().astype(float) # Work with a copy for j in range(p): # First, orthogonalize column j against all previous columns for i in range(j): R[i, j] = np.dot(Q[:, i], V[:, j]) # Using CURRENT V[:, j] V[:, j] = V[:, j] - R[i, j] * Q[:, i] # Immediate update! # Normalize R[j, j] = np.linalg.norm(V[:, j]) if R[j, j] > 1e-14: Q[:, j] = V[:, j] / R[j, j] else: Q[:, j] = V[:, j] return Q, R def measure_orthogonality(Q): """ Measure how close Q^T Q is to identity. Returns ||Q^T Q - I||_F / p """ p = Q.shape[1] return np.linalg.norm(Q.T @ Q - np.eye(p), 'fro') / p def compare_gram_schmidt_stability(): """ Demonstrates the numerical advantage of Modified Gram-Schmidt. """ np.random.seed(42) # Create increasingly ill-conditioned matrices n, p = 50, 20 condition_numbers = [1e2, 1e4, 1e6, 1e8, 1e10] print("Gram-Schmidt Orthogonality Comparison") print("=" * 60) print(f"{'κ(X)':<12} {'CGS Orth. Error':<18} {'MGS Orth. Error':<18}") print("-" * 60) for kappa in condition_numbers: # Construct X with specified condition number U, _ = np.linalg.qr(np.random.randn(n, p)) V, _ = np.linalg.qr(np.random.randn(p, p)) s = np.logspace(0, -np.log10(kappa), p) X = U @ np.diag(s) @ V # Classical Gram-Schmidt Q_cgs, R_cgs = classical_gram_schmidt(X) orth_error_cgs = measure_orthogonality(Q_cgs) # Modified Gram-Schmidt Q_mgs, R_mgs = modified_gram_schmidt(X) orth_error_mgs = measure_orthogonality(Q_mgs) print(f"{kappa:<12.0e} {orth_error_cgs:<18.2e} {orth_error_mgs:<18.2e}") print() print("Observation: MGS maintains better orthogonality as κ(X) grows.") print("However, even MGS degrades—Householder QR is needed for extreme cases.") compare_gram_schmidt_stability()While MGS is more stable than CGS, both versions can lose orthogonality for severely ill-conditioned matrices. For production code, use Householder QR (what np.linalg.qr computes) which is fully backward stable. MGS is valuable for understanding and for cases where you need incremental updates.
Householder reflections are the industry-standard method for computing QR decomposition. They achieve full backward stability and are what LAPACK (and hence NumPy, SciPy, MATLAB) actually compute.
The Key Idea:
Instead of building Q column by column (Gram-Schmidt), Householder works by applying a sequence of reflections that zero out elements below the diagonal.
A Householder reflection (or elementary reflector) is a matrix of the form:
$$H = I - 2vv^\top \quad \text{where } |v| = 1$$
This matrix reflects vectors across the hyperplane orthogonal to v.
Properties of Householder Reflectors:
The Core Problem We Solve:
Given a vector x, find H such that Hx has all zeros below the first entry:
$$Hx = \alpha e_1 = \begin{bmatrix} \alpha \ 0 \ \vdots \ 0 \end{bmatrix}$$
Since H preserves lengths: |α| = ‖x‖
The Formula:
$$v = x \pm |x| e_1, \quad v = v / |v|$$
(Choose sign to avoid catastrophic cancellation in first component: use sign(x₁)·‖x‖ for stability)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
import numpy as np def householder_vector(x): """ Compute Householder vector v such that (I - 2vv^T)x is zero below the first entry. Uses the numerically stable formula with sign choice. """ n = len(x) # Compute norm sigma = np.dot(x[1:], x[1:]) # Sum of squares below first element v = np.zeros(n) v[0] = 1.0 v[1:] = x[1:] if sigma < 1e-30: # x is already in e_1 direction beta = 0 return v, beta mu = np.sqrt(x[0]**2 + sigma) # Choose sign to avoid cancellation if x[0] <= 0: v[0] = x[0] - mu else: v[0] = -sigma / (x[0] + mu) # beta = 2 / (v^T v), but we normalize v beta = 2 * v[0]**2 / (sigma + v[0]**2) v = v / v[0] # Normalize so v[0] = 1 return v, beta def householder_qr(A): """ Compute QR decomposition using Householder reflections. This is the numerically stable, backward-stable algorithm used by LAPACK and numpy.linalg.qr. Returns Q explicitly (for demonstration), though efficient implementations store the Householder vectors instead. """ m, n = A.shape Q = np.eye(m) R = A.copy().astype(float) for j in range(min(m-1, n)): # Apply Householder to zero out below-diagonal in column j x = R[j:, j] # Compute Householder vector for this column v, beta = householder_vector(x) # Apply to R: R[j:, j:] = (I - beta*v*v^T) @ R[j:, j:] # Efficient computation: R = R - beta * v * (v^T @ R) R[j:, j:] = R[j:, j:] - beta * np.outer(v, v @ R[j:, j:]) # Accumulate into Q: Q = Q @ H_j^T = Q @ H_j # Q[:, j:] = Q[:, j:] @ (I - beta*v*v^T) Q[:, j:] = Q[:, j:] - beta * np.outer(Q[:, j:] @ v, v) return Q[:, :n], R[:n, :] # Return thin QR def verify_householder_qr(): """ Verify our implementation against numpy and check stability. """ np.random.seed(42) # Test on ill-conditioned matrix m, n = 50, 20 U, _ = np.linalg.qr(np.random.randn(m, m)) V, _ = np.linalg.qr(np.random.randn(n, n)) s = np.logspace(0, -8, n) # κ ≈ 10^8 A = U[:, :n] @ np.diag(s) @ V print("=== Householder QR Verification ===") print(f"Matrix condition number: {np.linalg.cond(A):.2e}") # Our implementation Q_ours, R_ours = householder_qr(A) # NumPy implementation Q_np, R_np = np.linalg.qr(A, mode='reduced') # Check properties print(f"\n||Q^T Q - I||_F (ours): {np.linalg.norm(Q_ours.T @ Q_ours - np.eye(n)):.2e}") print(f"||Q^T Q - I||_F (numpy): {np.linalg.norm(Q_np.T @ Q_np - np.eye(n)):.2e}") print(f"||QR - A||_F (ours): {np.linalg.norm(Q_ours @ R_ours - A):.2e}") print(f"||QR - A||_F (numpy): {np.linalg.norm(Q_np @ R_np - A):.2e}") # Test least squares with ill-conditioned matrix beta_true = np.ones(n) y = A @ beta_true # Solve via our QR from scipy.linalg import solve_triangular c = Q_ours.T @ y beta_qr = solve_triangular(R_ours, c) print(f"\nLeast squares ||β - β_true||: {np.linalg.norm(beta_qr - beta_true):.2e}") verify_householder_qr()Real implementations don't form Q explicitly. They store the Householder vectors compactly in the lower triangle of R (since those values would be zero anyway). When you need Qᵀb, apply each Householder reflection sequentially: H_p...H_2·H_1·b. This is what np.linalg.qr(mode='raw') returns.
Standard QR decomposition assumes X has full column rank. But what if some columns are linearly dependent (or nearly so)? This happens frequently with:
Pivoted QR (also called Column-Pivoted QR or QR with Pivoting) addresses this by reordering columns during factorization:
$$XP = QR$$
where P is a permutation matrix that reorders columns in decreasing importance.
How Pivoting Works:
At each step j, instead of reflecting column j as-is, we first swap in the column with the largest remaining norm. This ensures:
Rank Determination:
With pivoted QR, we can identify the numerical rank by finding where diagonal elements of R drop sharply:
$$r_{11} \geq r_{22} \geq \cdots \geq r_{kk} \gg r_{k+1,k+1} \approx \cdots \approx r_{pp} \approx 0$$
Columns k+1 through p (after permutation) are numerically in the null space.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npfrom scipy.linalg import qr def analyze_rank_deficiency(): """ Demonstrates pivoted QR for detecting and handling rank deficiency. """ np.random.seed(42) n, p = 100, 10 # Create rank-deficient matrix: rank = 7 instead of 10 true_rank = 7 A = np.random.randn(n, true_rank) @ np.random.randn(true_rank, p) A += 1e-12 * np.random.randn(n, p) # Tiny noise print("=== Rank-Deficient Least Squares ===") print(f"Matrix shape: {n} × {p}") print(f"True rank: {true_rank}") print(f"Numerical rank (SVD): {np.linalg.matrix_rank(A)}") # Standard QR (no pivoting) Q_std, R_std = np.linalg.qr(A, mode='reduced') # Pivoted QR Q_piv, R_piv, P = qr(A, pivoting=True, mode='economic') print(f"\nDiagonal of R (standard QR):") print(f" {np.diag(R_std)}") print(f"\nDiagonal of R (pivoted QR, sorted by magnitude):") print(f" {np.diag(R_piv)}") print(f"\nColumn permutation P: {P}") # Detect numerical rank from pivoted R r_diag = np.abs(np.diag(R_piv)) threshold = r_diag[0] * 1e-10 detected_rank = np.sum(r_diag > threshold) print(f"\nDetected numerical rank from pivoted QR: {detected_rank}") # Solve least squares with rank awareness y = np.random.randn(n) # Full solution (may be unstable for rank-deficient) try: beta_full = np.linalg.lstsq(A, y, rcond=None)[0] print(f"\nFull solution (lstsq) norm: {np.linalg.norm(beta_full):.4f}") except: print("Full solution failed!") # Truncated solution using only reliable components k = detected_rank R_k = R_piv[:k, :k] c = Q_piv.T @ y # Solve for first k components, zero out the rest beta_trunc = np.zeros(p) from scipy.linalg import solve_triangular beta_trunc[P[:k]] = solve_triangular(R_k, c[:k]) print(f"Truncated solution norm: {np.linalg.norm(beta_trunc):.4f}") print(f"Truncated residual: {np.linalg.norm(A @ beta_trunc - y):.4f}") print(f"Full residual: {np.linalg.norm(A @ beta_full - y):.4f}") print("\nNote: Truncated solution has similar residual but much smaller norm!") print("This gives the minimum-norm solution for rank-deficient problems.") analyze_rank_deficiency()Both pivoted QR and SVD can handle rank deficiency. SVD gives the mathematically optimal minimum-norm solution (the pseudoinverse). Pivoted QR is faster (O(np²) vs O(np² + p³)) and often sufficient. When you need the absolute best numerical properties, use SVD; for practical applications, pivoted QR is usually adequate.
We've covered the essential theory and practice of QR decomposition for solving least squares problems numerically stably:
| Algorithm | Stability | Cost | Use Case |
|---|---|---|---|
| Classical Gram-Schmidt | Poor | O(np²) | Educational only |
| Modified Gram-Schmidt | Good | O(np²) | Incremental updates, understanding |
| Householder QR | Excellent (backward stable) | O(np² - p³/3) | Default choice in libraries |
| Givens QR | Excellent | O(np²) | Sparse matrices, parallel |
| Pivoted Householder QR | Excellent + rank-revealing | O(np²) | Rank-deficient problems |
What's Next:
Now that we understand how to solve least squares stably, the next page analyzes the computational complexity of different approaches. We'll compare the operation counts for normal equations, QR, and SVD, and understand when each method is appropriate based on problem dimensions.
You now understand QR decomposition—the backbone of numerically stable least squares. This knowledge is essential for implementing reliable regression systems and understanding what library functions like np.linalg.lstsq actually compute under the hood.