Loading content...
We've established that QR decomposition is numerically stable and normal equations are not. But stability isn't the only consideration—computational cost matters too.
In the era of big data, linear regression might be applied to:
Understanding the operation counts (floating-point operations, or FLOPs) and memory requirements of different algorithms is essential for making informed implementation decisions.
By the end of this page, you will: (1) understand the FLOP counts for matrix operations, (2) compare normal equations, QR, and SVD complexity, (3) analyze memory requirements, and (4) make informed decisions about which algorithm to use based on problem dimensions.
Before analyzing least squares algorithms, we need to establish the costs of basic matrix operations. We count FLOPs (floating-point operations), where typically both additions and multiplications count as 1 FLOP each.
Convention: For asymptotic analysis, we focus on the leading-order term and use ≈ for approximate counts. Exact counts vary with implementation details.
| Operation | Dimensions | FLOP Count | Notes |
|---|---|---|---|
| Vector dot product | xᵀy, x,y ∈ ℝⁿ | 2n - 1 ≈ 2n | n multiplies + (n-1) adds |
| Matrix-vector product | Ax, A ∈ ℝᵐˣⁿ | 2mn | m dot products of length n |
| Outer product | xyᵀ, result ∈ ℝᵐˣⁿ | mn | Just multiplications |
| Matrix-matrix product | AB, A ∈ ℝᵐˣᵏ, B ∈ ℝᵏˣⁿ | 2mnk | mn dot products of length k |
| Symmetric matrix multiply | AᵀA, A ∈ ℝⁿˣᵖ | np² + p² | ≈ np² (using symmetry) |
| Triangular solve | Rx = b, R ∈ ℝⁿˣⁿ | n² | Back-substitution |
| LU factorization | A ∈ ℝⁿˣⁿ | (2/3)n³ | Gaussian elimination |
| Cholesky factorization | A ∈ ℝⁿˣⁿ (SPD) | (1/3)n³ | Half of LU (exploits symmetry) |
| QR factorization (Householder) | A ∈ ℝⁿˣᵖ | 2np² - (2/3)p³ | Standard thin QR |
| Full SVD | A ∈ ℝⁿˣᵖ | O(min(np², n²p)) | More expensive than QR |
When n >> p (many more samples than features), terms with n dominate. For n = 10⁶ and p = 100, the term np² = 10¹⁰ vastly exceeds p³ = 10⁶. Focus on the terms involving the larger dimension.
Why FLOPs Matter (and Don't):
FLOP counts give asymptotic scaling but don't tell the whole story:
Memory bandwidth often matters more than compute on modern hardware. A cache-efficient O(n²) algorithm can beat a cache-inefficient O(n log n) algorithm.
Parallelism: Matrix operations are highly parallelizable (BLAS Level 3 operations achieve near-peak efficiency on GPUs).
Constants matter: The leading constant (like 1/3 vs 2/3 for Cholesky vs LU) can halve runtime.
Numerical libraries are highly optimized: LAPACK, Intel MKL, OpenBLAS achieve near-theoretical-peak performance through cache blocking, vectorization, and parallelism.
Despite these caveats, FLOP counts remain the primary tool for comparing algorithm scaling.
The normal equations approach solves β = (XᵀX)⁻¹Xᵀy through three steps:
Step-by-Step FLOP Analysis:
Step 1: Compute XᵀX
X is n × p. We compute the p × p symmetric matrix XᵀX.
Step 2: Compute Xᵀy
Matrix-vector product, X is n × p, y is length n.
Step 3: Solve via Cholesky
Cholesky factor XᵀX = LLᵀ, then solve Lz = Xᵀy and Lᵀβ = z.
| Step | Operation | FLOP Count |
|---|---|---|
| 1 | Form XᵀX | np² |
| 2 | Form Xᵀy | 2np |
| 3 | Cholesky + solve | (1/3)p³ + 2p² |
| Total | np² + (1/3)p³ (+ lower order) |
When Normal Equations Dominate:
The key insight is that for n >> p (the typical regression scenario), the np² term from forming XᵀX dominates everything. The cubic term (1/3)p³ only matters when p is large relative to n.
Memory Requirements:
The key advantage: we reduce the n × p matrix to a p × p system before the expensive factorization. For n = 10⁶ and p = 100, we go from a million-row matrix to a 100 × 100 system—a huge reduction!
Despite being computationally attractive, normal equations square the condition number. For many practical problems, the time saved is worthless if the answer is wrong. Speed without accuracy is not a feature.
The QR approach computes X = QR using Householder reflections, then solves Rβ = Qᵀy:
Step-by-Step FLOP Analysis:
Step 1: Householder QR Factorization
For each of p columns, we:
Summing over all columns: $$\sum_{j=0}^{p-1} 4(n-j)(p-j) \approx 2np^2 - \frac{2}{3}p^3$$
Cost: ≈ 2np² - (2/3)p³ FLOPs
Step 2: Apply Qᵀ to y
We don't form Q explicitly! Instead, we apply the p Householder reflections to y:
Cost: ≈ 4np - 2p² FLOPs
Step 3: Back-substitution
Solve the p × p upper triangular system Rβ = c.
| Step | Operation | FLOP Count |
|---|---|---|
| 1 | Householder QR | 2np² - (2/3)p³ |
| 2 | Apply Qᵀ to y | 4np - 2p² |
| 3 | Back-substitution | p² |
| Total | 2np² - (2/3)p³ (+ lower order) |
Comparison with Normal Equations:
For the same n, p:
Ratio (for n >> p): $$\frac{\text{QR}}{\text{Normal}} \approx \frac{2np^2}{np^2} = 2$$
QR is about 2× more expensive than normal equations for large n/p ratios.
This factor of 2 is the price of numerical stability. In most applications, this is an excellent trade-off—paying 2× in compute to get reliable answers instead of garbage is clearly worthwhile.
Unlike normal equations, QR can be computed without forming any dense p×p matrices. The Householder vectors are stored in place (overwriting X), meaning the memory footprint is just np for X plus O(p) workspace—much better than np + p² for normal equations when p is large.
The SVD approach computes X = UΣVᵀ and uses the pseudoinverse:
$$\beta = V \Sigma^{-1} U^\top y = \sum_{i=1}^{r} \frac{u_i^\top y}{\sigma_i} v_i$$
where r is the numerical rank (singular values above threshold).
This is the most numerically robust approach, elegantly handling rank deficiency by simply omitting small singular values. But it comes at a computational premium.
SVD Complexity:
Computing the thin SVD of an n × p matrix (n ≥ p) requires:
The total depends heavily on the algorithm and implementation. Using the standard "divide and conquer" approach:
Cost: ≈ 4np² + 8p³ FLOPs (rough estimate, implementation-dependent)
This is significantly more expensive than QR:
| Method | Dominant FLOP Count | Relative Cost | Stability |
|---|---|---|---|
| Normal Equations | np² | 1× | Poor (κ² dependence) |
| QR Decomposition | 2np² | 2× | Excellent (backward stable) |
| SVD | 4np² to 6np² | 4-6× | Excellent + rank-revealing |
When is SVD Worth It?
Despite its cost, SVD is preferred when:
When to use QR instead:
When n and p are both large, you don't need the full SVD. Randomized SVD algorithms compute the top-k singular values/vectors in O(npk) time, much faster than full SVD when k << p. This is how scikit-learn's TruncatedSVD and many recommendation systems work.
Asymptotic complexity tells part of the story. Here are the real-world factors that influence algorithm selection:
1. Memory Constraints
For very large problems, you may not be able to hold X in memory:
| Method | Memory Required | Comment |
|---|---|---|
| Normal Equations | O(p²) after streaming | Can compute XᵀX incrementally |
| QR | O(np) | Need full X in memory for standard QR |
| SVD | O(np) + O(p²) | Full matrices required |
Normal equations have an advantage here: XᵀX can be computed incrementally by streaming through X in chunks, requiring only O(p²) persistent memory. This is critical for out-of-core or distributed computing.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
import numpy as np def streaming_least_squares(data_generator, p): """ Solve least squares when data doesn't fit in memory. Uses incremental computation of X^T X and X^T y. Parameters: ----------- data_generator : generator Yields (X_chunk, y_chunk) tuples p : int Number of features Returns: -------- beta : ndarray of shape (p,) Least squares solution """ # Initialize accumulators (only p×p and p needed!) XtX = np.zeros((p, p)) Xty = np.zeros(p) n_total = 0 # Stream through data for X_chunk, y_chunk in data_generator: n_chunk = X_chunk.shape[0] # Accumulate outer products XtX += X_chunk.T @ X_chunk Xty += X_chunk.T @ y_chunk n_total += n_chunk print(f"Processed {n_total} samples using only O({p}²) = O({p**2}) memory") # Solve the normal equations # Use Cholesky for numerical stability among normal equation methods try: L = np.linalg.cholesky(XtX) z = np.linalg.solve(L, Xty) beta = np.linalg.solve(L.T, z) except np.linalg.LinAlgError: # Fall back to lstsq if XtX is singular beta = np.linalg.lstsq(XtX, Xty, rcond=None)[0] return beta def demonstrate_streaming(): """ Demonstrate streaming computation on synthetic large dataset. """ np.random.seed(42) # Simulate a dataset too large to fit in memory n_total = 1_000_000 # 1 million samples p = 50 chunk_size = 10_000 # Process 10K at a time # True coefficients beta_true = np.random.randn(p) def data_generator(): """Simulates streaming data from disk/network.""" for i in range(0, n_total, chunk_size): n_chunk = min(chunk_size, n_total - i) X_chunk = np.random.randn(n_chunk, p) y_chunk = X_chunk @ beta_true + 0.1 * np.random.randn(n_chunk) yield X_chunk, y_chunk # Solve via streaming beta_hat = streaming_least_squares(data_generator(), p) error = np.linalg.norm(beta_hat - beta_true) / np.linalg.norm(beta_true) print(f"Relative error: {error:.6f}") demonstrate_streaming()2. Parallelism and Hardware
| Aspect | Normal Equations | QR | SVD |
|---|---|---|---|
| Parallelism | Excellent (GEMM-based) | Good | Moderate |
| GPU acceleration | Excellent | Good | Moderate |
| Distributed computing | Map-reduce friendly | Complex | Very complex |
| Cache efficiency | Excellent (small p×p) | Good | Moderate |
Normal equations reduce to BLAS Level 3 operations (matrix-matrix products) which are extremely cache-efficient and parallelize nearly perfectly. This can offset the 2× FLOP disadvantage of QR on modern hardware.
3. Multiple Right-Hand Sides
Sometimes you need to solve Xβ = y for many different y vectors (same X). This arises in:
In this case, factor X once and reuse:
| Method | Factorization Cost | Per-RHS Cost | Best For |
|---|---|---|---|
| Normal Equations | np² + (1/3)p³ | p² | Many RHS |
| QR | 2np² | np + p² | Moderate # RHS |
| SVD | 4-6np² + p³ | np | Few RHS |
With k right-hand sides, break-even between QR and normal equations is around k ≈ n/p.
4. Regularization
Ridge regression adds λI to the normal equations, often improving conditioning enough to make them safe:
$$(X^\top X + \lambda I)\beta = X^\top y$$
With appropriate λ, the condition number becomes: $$\kappa(X^\top X + \lambda I) \leq \frac{\sigma_1^2 + \lambda}{\lambda} \approx \frac{\sigma_1^2}{\lambda}$$
A well-chosen λ makes normal equations both fast AND stable—the best of both worlds.
Based on the complexity analysis and practical considerations, here's a decision framework for choosing your least squares algorithm:
| Scenario | Recommended Method | Rationale |
|---|---|---|
| Well-conditioned, n >> p, need speed | Normal Equations (Cholesky) | Fastest, safe when κ(X) is small |
| General-purpose, moderate size | QR Decomposition | Safe default, 2× slower is acceptable |
| Ill-conditioned or rank-deficient | SVD or Pivoted QR | Need rank-revealing properties |
| Ridge regression (any λ > 0) | Normal Equations | Regularization stabilizes conditioning |
| Huge n, streaming data | Normal Equations (streaming) | Only O(p²) memory required |
| Many targets (same X) | Pre-factor with QR or Cholesky | Amortize factorization cost |
| Need singular values anyway | SVD | Get regression + PCA info together |
| Unknown conditioning, production | QR (default) | Never wrong, modest overhead |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139
import numpy as npimport timefrom numpy.linalg import cond, qr, svd, lstsq, solvefrom scipy.linalg import cholesky, solve_triangular def select_and_solve(X, y, regularization=0.0, verbose=True): """ Intelligently select and apply the best least squares algorithm. Parameters: ----------- X : ndarray of shape (n, p) Design matrix y : ndarray of shape (n,) Target vector regularization : float Ridge regularization parameter (λ) verbose : bool Print diagnostic information Returns: -------- beta : ndarray of shape (p,) Least squares solution info : dict Diagnostic information """ n, p = X.shape info = {'n': n, 'p': p, 'method': None, 'time': None} # Quick conditioning estimate using random vectors (faster than full SVD) # Not exact, but good enough for selection def estimate_condition(): # Power iteration to estimate largest/smallest singular values # For speed, use numpy's built-in which is optimized if n < 5000 and p < 500: return cond(X) else: # For large matrices, estimate via sampling return None # Unknown, be cautious kappa = estimate_condition() info['condition_number'] = kappa start = time.time() # Decision logic if regularization > 0: # Ridge regression: normal equations are stabilized if verbose: print(f"Using regularized normal equations (λ={regularization})") XtX = X.T @ X + regularization * np.eye(p) Xty = X.T @ y L = cholesky(XtX, lower=True) z = solve_triangular(L, Xty, lower=True) beta = solve_triangular(L.T, z) info['method'] = 'ridge_cholesky' elif kappa is not None and kappa < 1e4: # Well-conditioned: normal equations are safe if verbose: print(f"κ(X) = {kappa:.2e} (well-conditioned) → Normal Equations") XtX = X.T @ X Xty = X.T @ y L = cholesky(XtX, lower=True) z = solve_triangular(L, Xty, lower=True) beta = solve_triangular(L.T, z) info['method'] = 'normal_cholesky' elif kappa is not None and kappa < 1e8: # Moderate conditioning: use QR if verbose: print(f"κ(X) = {kappa:.2e} (moderate) → QR Decomposition") Q, R = qr(X, mode='reduced') c = Q.T @ y beta = solve_triangular(R, c) info['method'] = 'qr' elif kappa is not None and kappa < 1e12: # Ill-conditioned: use SVD if verbose: print(f"κ(X) = {kappa:.2e} (ill-conditioned) → SVD") beta = lstsq(X, y, rcond=None)[0] info['method'] = 'svd_lstsq' else: # Unknown or very ill-conditioned: safest option if verbose: print(f"Unknown conditioning → SVD (safest default)") beta = lstsq(X, y, rcond=None)[0] info['method'] = 'svd_lstsq' info['time'] = time.time() - start if verbose: print(f"Solution computed in {info['time']*1000:.2f} ms") return beta, info def benchmark_methods(n, p, condition_number): """ Benchmark all methods for given problem dimensions. """ np.random.seed(42) # Create matrix with specified condition number U, _ = qr(np.random.randn(n, n)) V, _ = qr(np.random.randn(p, p)) s = np.logspace(0, -np.log10(condition_number), p) X = U[:, :p] @ np.diag(s) @ V beta_true = np.ones(p) y = X @ beta_true print(f"=== Benchmark: n={n}, p={p}, κ={condition_number:.0e} ===") methods = { 'Normal (Cholesky)': lambda: solve(X.T @ X, X.T @ y), 'QR': lambda: solve_triangular(qr(X, mode='reduced')[1], qr(X, mode='reduced')[0].T @ y), 'SVD (lstsq)': lambda: lstsq(X, y, rcond=None)[0], } for name, method in methods.items(): try: start = time.time() beta = method() elapsed = time.time() - start error = np.linalg.norm(beta - beta_true) print(f" {name:<20}: {elapsed*1000:8.2f} ms, error = {error:.2e}") except Exception as e: print(f" {name:<20}: FAILED ({e})") # Run benchmarksbenchmark_methods(10000, 100, 1e2) # Easybenchmark_methods(10000, 100, 1e8) # Hardbenchmark_methods(10000, 100, 1e14) # Very hardWhat's Next:
We've analyzed batch algorithms that see all data at once. But what if data arrives incrementally—one sample at a time—and we need to update our model continuously? The next page covers online updates for linear regression, including recursive least squares and streaming algorithms.
You now understand the computational trade-offs between different least squares algorithms. This knowledge enables you to choose the right tool for your problem size, stability requirements, and computational constraints.