Loading learning content...
In the pristine world of pure mathematics, the formula β = (XᵀX)⁻¹Xᵀy provides an exact solution to linear regression. Given a design matrix X and target vector y, we simply compute the inverse and multiply—problem solved.
But computers don't live in that pristine world.
Every computation happens with finite precision. Numbers are represented with approximately 16 significant decimal digits in standard double-precision floating-point format. Operations that seem mathematically innocuous—matrix inversion, solving linear systems, even simple subtraction—can catastrophically amplify tiny representation errors into results that are completely wrong.
A linear regression implementation that passes all unit tests can still produce completely wrong results in production. The difference? Real-world data often has subtle numerical properties that expose catastrophic instabilities in naive implementations. This page teaches you to recognize, diagnose, and prevent these failures.
By the end of this page, you will understand: (1) how floating-point arithmetic introduces errors, (2) what condition numbers reveal about numerical sensitivity, (3) why the normal equations approach fails for ill-conditioned problems, and (4) how to diagnose numerical stability issues before they corrupt your results.
To understand numerical instability, we must first understand how computers represent real numbers. The IEEE 754 double-precision floating-point format is the standard representation used in virtually all scientific computing.
The Anatomy of a Float:
A 64-bit double-precision floating-point number consists of three components:
The value represented is:
$$(-1)^{\text{sign}} \times 1.\text{mantissa} \times 2^{\text{exponent} - 1023}$$
Notice the '1.' before the mantissa. Since normalized binary numbers always start with 1, this leading 1 is implicit and not stored—giving us effectively 53 bits of precision for the significand. This is called the 'hidden bit' optimization.
Machine Epsilon: The Fundamental Precision Limit
The most important constant in numerical computing is machine epsilon (ε_mach)—the smallest positive number such that:
$$1 + \varepsilon_{\text{mach}} \neq 1 \text{ in floating-point}$$
For double precision: ε_mach ≈ 2.22 × 10⁻¹⁶ (approximately 2⁻⁵²)
This means:
| Format | Total Bits | Mantissa Bits | Machine Epsilon | Decimal Digits |
|---|---|---|---|---|
| Half (float16) | 16 | 10 | ≈ 9.77 × 10⁻⁴ | ~3-4 |
| Single (float32) | 32 | 23 | ≈ 1.19 × 10⁻⁷ | ~7 |
| Double (float64) | 64 | 52 | ≈ 2.22 × 10⁻¹⁶ | ~16 |
| Extended (float80) | 80 | 64 | ≈ 1.08 × 10⁻¹⁹ | ~19 |
Catastrophic Cancellation: When Subtraction Destroys Precision
The most dangerous operation in floating-point arithmetic is subtraction of nearly equal numbers. Consider computing:
$$f(x) = \sqrt{x+1} - \sqrt{x}$$
For x = 10²⁰, mathematically this equals approximately 5 × 10⁻¹¹. But in floating-point:
This is catastrophic cancellation—we lost all significant digits.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
import numpy as np def demonstrate_catastrophic_cancellation(): """ Demonstrates how subtracting nearly equal numbers destroys precision. """ # Case 1: sqrt(x+1) - sqrt(x) for large x x = 1e20 # Naive computation (WRONG for large x) naive_result = np.sqrt(x + 1) - np.sqrt(x) # Algebraically equivalent but numerically stable form: # sqrt(x+1) - sqrt(x) = 1 / (sqrt(x+1) + sqrt(x)) stable_result = 1.0 / (np.sqrt(x + 1) + np.sqrt(x)) # True value (computed with arbitrary precision) true_value = 5e-11 # approximately print(f"x = {x:.0e}") print(f"Naive result: {naive_result:.10e}") # Shows 0.0! print(f"Stable result: {stable_result:.10e}") # Correct answer print(f"Relative error (naive): {abs(naive_result - stable_result) / stable_result:.2%}") # Case 2: The classic quadratic formula failure # Solving x² - 10⁸x + 1 = 0 # Roots are approximately 10⁸ and 10⁻⁸ a, b, c = 1.0, -1e8, 1.0 discriminant = b**2 - 4*a*c # Standard formula (fails for the small root) x1_standard = (-b + np.sqrt(discriminant)) / (2*a) x2_standard = (-b - np.sqrt(discriminant)) / (2*a) # Stable computation using alternative form for small root x1_stable = (-b + np.sqrt(discriminant)) / (2*a) x2_stable = c / (a * x1_stable) # Uses Vieta's formula: x1 * x2 = c/a print(f"\nQuadratic formula comparison:") print(f"Large root - Standard: {x1_standard:.10e}, Stable: {x1_stable:.10e}") print(f"Small root - Standard: {x2_standard:.10e}, Stable: {x2_stable:.10e}") print(f"True small root ≈ 1e-8") demonstrate_catastrophic_cancellation()Never subtract nearly equal floating-point numbers if you can avoid it. Algebraically rearrange formulas to use addition instead, or use alternative formulations that avoid the cancellation entirely.
The condition number is one of the most important concepts in numerical linear algebra. It quantifies how sensitive the output of a computation is to perturbations in the input—essentially measuring the inherent difficulty of a problem, independent of any algorithm used to solve it.
Definition: Matrix Condition Number
For a square, invertible matrix A, the condition number with respect to the 2-norm is:
$$\kappa(A) = |A|2 |A^{-1}|2 = \frac{\sigma{\max}(A)}{\sigma{\min}(A)}$$
where σ_max and σ_min are the largest and smallest singular values of A.
For symmetric positive definite matrices (like XᵀX in least squares):
$$\kappa(A) = \frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}$$
where λ_max and λ_min are the largest and smallest eigenvalues.
A condition number tells you: for every digit of precision you want in your answer, how many digits of precision do you need in your input? If κ(A) = 10⁸, you need 8 extra digits of precision just to preserve accuracy. With 16-digit double precision, you can only trust about 8 digits of your result.
The Error Amplification Bound:
Consider solving the linear system Ax = b. If we perturb b by a small amount δb, the solution changes by δx where:
$$\frac{|\delta x|}{|x|} \leq \kappa(A) \cdot \frac{|\delta b|}{|b|}$$
This inequality is tight—there exist perturbations that achieve this worst case.
Interpretation:
Classification of Conditioning:
| Condition Number κ(A) | Classification | Expected Precision Loss | Practical Interpretation |
|---|---|---|---|
| 1 - 10 | Well-conditioned | < 1 digit | Numerically safe, standard algorithms work |
| 10 - 10⁴ | Moderate conditioning | 1-4 digits | Generally safe, monitor precision |
| 10⁴ - 10⁸ | Ill-conditioned | 4-8 digits | Use numerically stable algorithms only |
| 10⁸ - 10¹² | Severely ill-conditioned | 8-12 digits | Results questionable, consider reformulation |
10¹² | Near-singular | 12 digits | Standard double precision unreliable |
Condition Number of XᵀX vs X:
Here's a critical insight for linear regression. The normal equations require forming XᵀX, but:
$$\kappa(X^\top X) = \kappa(X)^2$$
This is devastating! If your design matrix has κ(X) = 10⁴ (moderately ill-conditioned but manageable), then κ(XᵀX) = 10⁸, which is severely ill-conditioned.
This squaring of the condition number is the fundamental reason why solving normal equations directly is often numerically unstable, and why we prefer methods like QR decomposition that work with X directly.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as npfrom numpy.linalg import cond, svd, lstsq def analyze_condition_numbers(): """ Demonstrates condition number computation and its implications. """ np.random.seed(42) # Create a design matrix with controlled condition number n, p = 100, 5 # Well-conditioned case: X with orthonormal-ish columns X_good = np.random.randn(n, p) X_good, _ = np.linalg.qr(X_good) # Make columns orthonormal X_good *= np.sqrt(n) # Scale back # Ill-conditioned case: nearly collinear features X_bad = np.random.randn(n, p) X_bad[:, -1] = X_bad[:, 0] + 1e-8 * np.random.randn(n) # Near-duplicate column print("=== Condition Number Analysis ===\n") for name, X in [("Well-conditioned", X_good), ("Ill-conditioned", X_bad)]: # Compute singular values U, singular_values, Vt = svd(X, full_matrices=False) # Condition numbers kappa_X = cond(X) kappa_XtX = cond(X.T @ X) print(f"{name} Matrix:") print(f" Singular values: {singular_values}") print(f" κ(X) = {kappa_X:.2e}") print(f" κ(XᵀX) = {kappa_XtX:.2e}") print(f" κ(XᵀX) / κ(X)² = {kappa_XtX / kappa_X**2:.2f} (should be ≈1)") print(f" Expected digits lost solving normal equations: ~{np.log10(kappa_XtX):.1f}") print() # Demonstrate actual error amplification # Create true coefficients and generate data beta_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0]) y = X @ beta_true # Solve via normal equations (potentially unstable) try: beta_normal = np.linalg.solve(X.T @ X, X.T @ y) except np.linalg.LinAlgError: beta_normal = np.full(p, np.nan) # Solve via SVD (stable) beta_svd = lstsq(X, y, rcond=None)[0] error_normal = np.linalg.norm(beta_normal - beta_true) error_svd = np.linalg.norm(beta_svd - beta_true) print(f" Reconstruction error (Normal Equations): {error_normal:.2e}") print(f" Reconstruction error (SVD): {error_svd:.2e}") print() analyze_condition_numbers()The normal equations provide a direct formula for the least squares solution:
$$X^\top X \beta = X^\top y \implies \beta = (X^\top X)^{-1} X^\top y$$
While mathematically elegant, this approach has multiple numerical pitfalls that can lead to catastrophic failures in practice.
A Concrete Failure Case:
Consider polynomial regression with high-degree polynomials. The design matrix has the form:
$$X = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^d \ 1 & x_2 & x_2^2 & \cdots & x_2^d \ \vdots & \vdots & \vdots & \ddots & \vdots \ 1 & x_n & x_n^2 & \cdots & x_n^d \end{bmatrix}$$
This is known as a Vandermonde matrix, and they are notoriously ill-conditioned. For even moderate degree d ≈ 10-15 and typical data ranges, κ(X) can exceed 10¹⁶, making XᵀX essentially numerically rank-deficient.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import numpy as npimport matplotlib.pyplot as pltfrom numpy.linalg import cond, solve, lstsq def demonstrate_normal_equations_failure(): """ Shows how normal equations fail for polynomial regression while QR/SVD methods remain stable. """ np.random.seed(42) # Generate simple data n = 50 x = np.linspace(0, 1, n) y_true = np.sin(2 * np.pi * x) y = y_true + 0.1 * np.random.randn(n) degrees = [2, 5, 8, 10, 12, 14, 16] print("Polynomial Regression: Normal Equations vs SVD") print("=" * 65) print(f"{'Degree':<8} {'κ(X)':<12} {'κ(XᵀX)':<15} {'MSE Normal':<12} {'MSE SVD':<12}") print("-" * 65) for d in degrees: # Build Vandermonde matrix X = np.vander(x, d + 1, increasing=True) # Condition numbers kappa_X = cond(X) XtX = X.T @ X # Check if XtX is numerically singular try: kappa_XtX = cond(XtX) except: kappa_XtX = np.inf # Normal equations (potentially fails) try: Xty = X.T @ y beta_normal = solve(XtX, Xty) y_pred_normal = X @ beta_normal mse_normal = np.mean((y_pred_normal - y)**2) mse_normal_str = f"{mse_normal:.2e}" except np.linalg.LinAlgError: mse_normal_str = "FAILED" # SVD-based solution (robust) beta_svd = lstsq(X, y, rcond=None)[0] y_pred_svd = X @ beta_svd mse_svd = np.mean((y_pred_svd - y)**2) print(f"{d:<8} {kappa_X:<12.2e} {kappa_XtX:<15.2e} {mse_normal_str:<12} {mse_svd:<12.2e}") print() print("Observation: As degree increases, κ(XᵀX) explodes exponentially.") print("Normal equations become unreliable around degree 10-12.") print("SVD remains stable regardless of conditioning.") demonstrate_normal_equations_failure()Even if normal equations are used, never compute (XᵀX)⁻¹ explicitly. At minimum, use Cholesky decomposition: factor XᵀX = LLᵀ, then solve Lz = Xᵀy followed by Lᵀβ = z. This is still twice as ill-conditioned as necessary, but avoids the numerical disaster of explicit inversion.
Understanding why design matrices become ill-conditioned helps you prevent the problem before it occurs. Here are the most common causes in real-world machine learning:
1. Multicollinearity: Correlated Features
When features are highly correlated (or nearly linearly dependent), the columns of X point in nearly the same direction. The "narrower" the column space of X, the smaller its minimum singular value, and the larger κ(X).
Example: Including both Celsius and Fahrenheit temperature as features. Or GDP and GDP-per-capita in a country-level model.
2. Scale Disparities: Features on Different Scales
If one feature ranges from 0-1 and another from 0-10⁶, the columns have vastly different magnitudes. This creates artificial ill-conditioning that's easily fixed by standardization.
Example: Using age (0-100) and annual income ($0-$10M) as raw features without scaling.
3. Polynomial and Interaction Features
As we saw, polynomial features create Vandermonde-like structures that are inherently ill-conditioned. Higher-order interaction terms exacerbate this.
Example: Including x, x², x³, ..., x¹⁵ as features.
4. High-Dimensional Settings (p ≈ n)
When the number of features approaches or exceeds the number of samples, XᵀX approaches singularity. The matrix may be technically invertible but with enormous condition number.
Example: Gene expression data with 20,000 genes and 100 patients.
5. Redundant or Constant Features
Features that are constant (zero variance) or exact linear combinations of other features create singular or near-singular XᵀX.
Example: Including a feature that's always 1 alongside one-hot encoded categories that sum to 1.
| Source | Detection Method | Remedy |
|---|---|---|
| Multicollinearity | Correlation matrix, VIF > 10 | Remove correlated features, PCA, Ridge regression |
| Scale disparity | Feature variance comparison | Standardization (z-score) or normalization |
| Polynomial features | Condition number of Vandermonde | Orthogonal polynomials (Legendre, Chebyshev) |
| High dimensionality | p/n ratio > 0.5 | Regularization (Ridge, Lasso), dimensionality reduction |
| Redundant features | Rank deficiency check | Remove constant/duplicate columns, use pivoted QR |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
import numpy as npfrom numpy.linalg import matrix_rank, cond, svd def diagnose_conditioning(X, feature_names=None): """ Comprehensive diagnostic for design matrix conditioning. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Design matrix to analyze feature_names : list of str, optional Names of features for interpretable output Returns: -------- dict : Diagnostic information """ n, p = X.shape if feature_names is None: feature_names = [f"x{i}" for i in range(p)] # Basic properties U, s, Vt = svd(X, full_matrices=False) rank = matrix_rank(X) kappa = cond(X) # Identify problematic directions # Small singular values indicate near-linear-dependence sv_threshold = s[0] * 1e-10 # Relative threshold problematic_idx = np.where(s < sv_threshold)[0] # Analyze which features contribute to ill-conditioning # Look at right singular vectors corresponding to small singular values V = Vt.T diagnostics = { "shape": (n, p), "rank": rank, "condition_number": kappa, "singular_values": s, "log10_condition": np.log10(kappa) if kappa < np.inf else np.inf, } print("=== Design Matrix Conditioning Diagnostics ===") print(f"Shape: {n} samples × {p} features") print(f"Numerical rank: {rank} (expecting {min(n, p)})") print(f"Condition number: {kappa:.2e}") print(f"Log₁₀(κ): {diagnostics['log10_condition']:.1f}") print() # Singular value analysis print("Singular Value Spectrum:") print(f" Largest: {s[0]:.4e}") print(f" Smallest: {s[-1]:.4e}") print(f" Ratio: {s[0]/s[-1]:.4e}") print() # Check for scale disparity col_norms = np.linalg.norm(X, axis=0) scale_ratio = col_norms.max() / col_norms.min() print(f"Column Norm Disparity: {scale_ratio:.2e}") if scale_ratio > 100: print(" ⚠️ Consider standardizing features!") print() # Identify potential multicollinearity if len(problematic_idx) > 0: print("⚠️ Near-linear dependencies detected!") for idx in problematic_idx: # Which features contribute most to this null-space direction? v = V[:, idx] contributors = np.argsort(np.abs(v))[-3:][::-1] print(f" σ_{idx} = {s[idx]:.2e}: involves {[feature_names[i] for i in contributors]}") else: print("✓ No near-exact linear dependencies detected.") # Guidance print() print("Recommendations:") if kappa > 1e12: print(" 🔴 CRITICAL: Condition number too high for reliable double-precision.") print(" → Use regularization OR reformulate the problem") elif kappa > 1e8: print(" 🟠 WARNING: Use QR or SVD, avoid normal equations entirely.") elif kappa > 1e4: print(" 🟡 CAUTION: Use stable algorithms (QR preferred over normal equations).") else: print(" 🟢 GOOD: Standard algorithms should work reliably.") return diagnostics # Example usagenp.random.seed(42)n, p = 100, 5X = np.random.randn(n, p)X[:, 4] = X[:, 0] + X[:, 1] + 1e-8 * np.random.randn(n) # Introduce collinearitydiagnose_conditioning(X, ["A", "B", "C", "D", "A+B (collinear)"])Before diving into specific algorithms (QR decomposition, SVD), let's consolidate the general principles of numerical stability that every practitioner should internalize:
Backward Stability: The Gold Standard
A fundamental concept in numerical analysis is backward stability. An algorithm for computing f(x) is backward stable if, for any input x, it produces a result f̂(x) that is the exact answer to a slightly perturbed input:
$$\hat{f}(x) = f(x + \delta x) \text{ where } \frac{|\delta x|}{|x|} = O(\varepsilon_{\text{mach}})$$
Why this matters: If your algorithm is backward stable, and your problem is well-conditioned, the result will be accurate. If the problem is ill-conditioned, the result will be inaccurate—but no better algorithm could do better with the same precision.
QR factorization and the SVD are backward stable. Solving normal equations via Cholesky (even with pivoting) is not backward stable for least squares because the conditioning of XᵀX, not X, dominates.
| Method | Stability | Condition Dependence | When to Use |
|---|---|---|---|
| Normal Equations (Cholesky) | Not backward stable | κ(XᵀX) = κ(X)² | Only when X is very well-conditioned and n >> p |
| QR Decomposition | Backward stable | κ(X) | Default choice for moderate-sized problems |
| SVD | Backward stable | κ(X) | Rank-deficient problems, need pseudoinverse |
| Modified Gram-Schmidt | Less stable than Householder QR | κ(X), can degrade | Educational purposes, iterative refinement available |
| Householder QR | Fully backward stable | κ(X) | Standard library implementation (LAPACK) |
In 99% of cases: use np.linalg.lstsq() or scipy.linalg.lstsq() which internally use QR or SVD. Never write beta = np.linalg.inv(X.T @ X) @ X.T @ y. The latter is a numerical disaster waiting to happen.
We've covered the foundational concepts of numerical stability in linear regression. These ideas are crucial for any production-quality implementation:
What's Next:
Now that we understand why numerical stability matters and why normal equations fail, the next page introduces the QR decomposition—the industry-standard approach for solving least squares problems stably and efficiently. We'll derive the method from first principles and understand why it preserves conditioning where normal equations fail.
You now understand the numerical challenges of linear regression and why naive implementations fail. This knowledge is essential for implementing reliable ML systems—the difference between a model that works in testing and one that works in production at scale.