Machine LearningComputational Considerations

Computational Considerations in Linear Regression

LevelAdvanced

Duration90 mins

TopicComputational Considerations

1 / 5

Numerical Stability

When Mathematics Meets Machine Precision

In the pristine world of pure mathematics, the formula β = (XᵀX)⁻¹Xᵀy provides an exact solution to linear regression. Given a design matrix X and target vector y, we simply compute the inverse and multiply—problem solved.

But computers don't live in that pristine world.

Every computation happens with finite precision. Numbers are represented with approximately 16 significant decimal digits in standard double-precision floating-point format. Operations that seem mathematically innocuous—matrix inversion, solving linear systems, even simple subtraction—can catastrophically amplify tiny representation errors into results that are completely wrong.

The Hidden Danger

A linear regression implementation that passes all unit tests can still produce completely wrong results in production. The difference? Real-world data often has subtle numerical properties that expose catastrophic instabilities in naive implementations. This page teaches you to recognize, diagnose, and prevent these failures.

What You Will Learn

By the end of this page, you will understand: (1) how floating-point arithmetic introduces errors, (2) what condition numbers reveal about numerical sensitivity, (3) why the normal equations approach fails for ill-conditioned problems, and (4) how to diagnose numerical stability issues before they corrupt your results.

Floating-Point Representation

To understand numerical instability, we must first understand how computers represent real numbers. The IEEE 754 double-precision floating-point format is the standard representation used in virtually all scientific computing.

The Anatomy of a Float:

A 64-bit double-precision floating-point number consists of three components:

Sign bit (1 bit): Determines if the number is positive or negative
Exponent (11 bits): Encodes the magnitude as a power of 2 (biased by 1023)
Mantissa/Significand (52 bits): Stores the significant digits

The value represented is:

$$(-1)^{\text{sign}} \times 1.\text{mantissa} \times 2^{\text{exponent} - 1023}$$

The Hidden One

Notice the '1.' before the mantissa. Since normalized binary numbers always start with 1, this leading 1 is implicit and not stored—giving us effectively 53 bits of precision for the significand. This is called the 'hidden bit' optimization.

Machine Epsilon: The Fundamental Precision Limit

The most important constant in numerical computing is machine epsilon (ε_mach)—the smallest positive number such that:

$$1 + \varepsilon_{\text{mach}} \neq 1 \text{ in floating-point}$$

For double precision: ε_mach ≈ 2.22 × 10⁻¹⁶ (approximately 2⁻⁵²)

This means:

We have roughly 16 significant decimal digits of precision
Numbers differing by less than ε_mach × |number| are indistinguishable
Every arithmetic operation introduces a relative error of at most ε_mach

IEEE 754 Floating-Point Precision Levels
Format	Total Bits	Mantissa Bits	Machine Epsilon	Decimal Digits
Half (float16)	16	10	≈ 9.77 × 10⁻⁴	~3-4
Single (float32)	32	23	≈ 1.19 × 10⁻⁷	~7
Double (float64)	64	52	≈ 2.22 × 10⁻¹⁶	~16
Extended (float80)	80	64	≈ 1.08 × 10⁻¹⁹	~19

Catastrophic Cancellation: When Subtraction Destroys Precision

The most dangerous operation in floating-point arithmetic is subtraction of nearly equal numbers. Consider computing:

$$f(x) = \sqrt{x+1} - \sqrt{x}$$

For x = 10²⁰, mathematically this equals approximately 5 × 10⁻¹¹. But in floating-point:

√(10²⁰ + 1) ≈ 10¹⁰ + 5 × 10⁻¹¹ (but this tiny difference is lost!)
√(10²⁰) = 10¹⁰
Subtraction yields 0 instead of 5 × 10⁻¹¹

This is catastrophic cancellation—we lost all significant digits.

catastrophic_cancellation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
 
def demonstrate_catastrophic_cancellation():
    """
    Demonstrates how subtracting nearly equal numbers destroys precision.
    """
    # Case 1: sqrt(x+1) - sqrt(x) for large x
    x = 1e20
    
    # Naive computation (WRONG for large x)
    naive_result = np.sqrt(x + 1) - np.sqrt(x)
    
    # Algebraically equivalent but numerically stable form:
    # sqrt(x+1) - sqrt(x) = 1 / (sqrt(x+1) + sqrt(x))
    stable_result = 1.0 / (np.sqrt(x + 1) + np.sqrt(x))
    
    # True value (computed with arbitrary precision)
    true_value = 5e-11  # approximately
    
    print(f"x = {x:.0e}")
    print(f"Naive result:  {naive_result:.10e}")  # Shows 0.0!
    print(f"Stable result: {stable_result:.10e}")  # Correct answer
    print(f"Relative error (naive): {abs(naive_result - stable_result) / stable_result:.2%}")
    
    # Case 2: The classic quadratic formula failure
    # Solving x² - 10⁸x + 1 = 0
    # Roots are approximately 10⁸ and 10⁻⁸
    a, b, c = 1.0, -1e8, 1.0
    discriminant = b**2 - 4*a*c
    
    # Standard formula (fails for the small root)
    x1_standard = (-b + np.sqrt(discriminant)) / (2*a)
    x2_standard = (-b - np.sqrt(discriminant)) / (2*a)
    
    # Stable computation using alternative form for small root
    x1_stable = (-b + np.sqrt(discriminant)) / (2*a)
    x2_stable = c / (a * x1_stable)  # Uses Vieta's formula: x1 * x2 = c/a
    
    print(f"\nQuadratic formula comparison:")
    print(f"Large root - Standard: {x1_standard:.10e}, Stable: {x1_stable:.10e}")
    print(f"Small root - Standard: {x2_standard:.10e}, Stable: {x2_stable:.10e}")
    print(f"True small root ≈ 1e-8")
 
demonstrate_catastrophic_cancellation()

Numerical Stability Principle #1

Never subtract nearly equal floating-point numbers if you can avoid it. Algebraically rearrange formulas to use addition instead, or use alternative formulations that avoid the cancellation entirely.

Condition Numbers: Quantifying Sensitivity

The condition number is one of the most important concepts in numerical linear algebra. It quantifies how sensitive the output of a computation is to perturbations in the input—essentially measuring the inherent difficulty of a problem, independent of any algorithm used to solve it.

Definition: Matrix Condition Number

For a square, invertible matrix A, the condition number with respect to the 2-norm is:

$$\kappa(A) = |A|2 |A^{-1}|2 = \frac{\sigma{\max}(A)}{\sigma{\min}(A)}$$

where σ_max and σ_min are the largest and smallest singular values of A.

For symmetric positive definite matrices (like XᵀX in least squares):

$$\kappa(A) = \frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}$$

where λ_max and λ_min are the largest and smallest eigenvalues.

Intuition Behind Condition Numbers

A condition number tells you: for every digit of precision you want in your answer, how many digits of precision do you need in your input? If κ(A) = 10⁸, you need 8 extra digits of precision just to preserve accuracy. With 16-digit double precision, you can only trust about 8 digits of your result.

The Error Amplification Bound:

Consider solving the linear system Ax = b. If we perturb b by a small amount δb, the solution changes by δx where:

$$\frac{|\delta x|}{|x|} \leq \kappa(A) \cdot \frac{|\delta b|}{|b|}$$

This inequality is tight—there exist perturbations that achieve this worst case.

Interpretation:

κ(A) = 1: Perfect conditioning (orthogonal matrices achieve this)
κ(A) = 10²: Lose ~2 digits of precision
κ(A) = 10⁸: Lose ~8 digits of precision
κ(A) = 10¹⁶+: Effectively singular—no reliable accuracy in double precision

Classification of Conditioning:

Condition Number Classification
Condition Number κ(A)	Classification	Expected Precision Loss	Practical Interpretation
1 - 10	Well-conditioned	< 1 digit	Numerically safe, standard algorithms work
10 - 10⁴	Moderate conditioning	1-4 digits	Generally safe, monitor precision
10⁴ - 10⁸	Ill-conditioned	4-8 digits	Use numerically stable algorithms only
10⁸ - 10¹²	Severely ill-conditioned	8-12 digits	Results questionable, consider reformulation
10¹²	Near-singular	12 digits	Standard double precision unreliable

Condition Number of XᵀX vs X:

Here's a critical insight for linear regression. The normal equations require forming XᵀX, but:

$$\kappa(X^\top X) = \kappa(X)^2$$

This is devastating! If your design matrix has κ(X) = 10⁴ (moderately ill-conditioned but manageable), then κ(XᵀX) = 10⁸, which is severely ill-conditioned.

This squaring of the condition number is the fundamental reason why solving normal equations directly is often numerically unstable, and why we prefer methods like QR decomposition that work with X directly.

condition_number_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from numpy.linalg import cond, svd, lstsq
 
def analyze_condition_numbers():
    """
    Demonstrates condition number computation and its implications.
    """
    np.random.seed(42)
    
    # Create a design matrix with controlled condition number
    n, p = 100, 5
    
    # Well-conditioned case: X with orthonormal-ish columns
    X_good = np.random.randn(n, p)
    X_good, _ = np.linalg.qr(X_good)  # Make columns orthonormal
    X_good *= np.sqrt(n)  # Scale back
    
    # Ill-conditioned case: nearly collinear features
    X_bad = np.random.randn(n, p)
    X_bad[:, -1] = X_bad[:, 0] + 1e-8 * np.random.randn(n)  # Near-duplicate column
    
    print("=== Condition Number Analysis ===\n")
    
    for name, X in [("Well-conditioned", X_good), ("Ill-conditioned", X_bad)]:
        # Compute singular values
        U, singular_values, Vt = svd(X, full_matrices=False)
        
        # Condition numbers
        kappa_X = cond(X)
        kappa_XtX = cond(X.T @ X)
        
        print(f"{name} Matrix:")
        print(f"  Singular values: {singular_values}")
        print(f"  κ(X) = {kappa_X:.2e}")
        print(f"  κ(XᵀX) = {kappa_XtX:.2e}")
        print(f"  κ(XᵀX) / κ(X)² = {kappa_XtX / kappa_X**2:.2f} (should be ≈1)")
        print(f"  Expected digits lost solving normal equations: ~{np.log10(kappa_XtX):.1f}")
        print()
        
        # Demonstrate actual error amplification
        # Create true coefficients and generate data
        beta_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
        y = X @ beta_true
        
        # Solve via normal equations (potentially unstable)
        try:
            beta_normal = np.linalg.solve(X.T @ X, X.T @ y)
        except np.linalg.LinAlgError:
            beta_normal = np.full(p, np.nan)
        
        # Solve via SVD (stable)
        beta_svd = lstsq(X, y, rcond=None)[0]
        
        error_normal = np.linalg.norm(beta_normal - beta_true)
        error_svd = np.linalg.norm(beta_svd - beta_true)
        
        print(f"  Reconstruction error (Normal Equations): {error_normal:.2e}")
        print(f"  Reconstruction error (SVD):              {error_svd:.2e}")
        print()
 
analyze_condition_numbers()

Why Normal Equations Fail

The normal equations provide a direct formula for the least squares solution:

$$X^\top X \beta = X^\top y \implies \beta = (X^\top X)^{-1} X^\top y$$

While mathematically elegant, this approach has multiple numerical pitfalls that can lead to catastrophic failures in practice.

Numerical Problems with Normal Equations

•Condition Number Squaring: As shown, κ(XᵀX) = κ(X)². Moderate ill-conditioning becomes severe.
•Explicit Matrix Inversion: Computing (XᵀX)⁻¹ is never recommended—it's numerically unstable and computationally inefficient.
•Loss of Orthogonality Information: The product XᵀX loses geometric information about the column space of X.
•Near-Singularity Amplification: If X has any near-zero singular values, XᵀX will have near-zero eigenvalues squared, making detection harder.
•Symmetry Exploitation Backfires: While XᵀX is symmetric positive semi-definite (allowing Cholesky), the squaring effect dominates any savings.

A Concrete Failure Case:

Consider polynomial regression with high-degree polynomials. The design matrix has the form:

$$X = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^d \ 1 & x_2 & x_2^2 & \cdots & x_2^d \ \vdots & \vdots & \vdots & \ddots & \vdots \ 1 & x_n & x_n^2 & \cdots & x_n^d \end{bmatrix}$$

This is known as a Vandermonde matrix, and they are notoriously ill-conditioned. For even moderate degree d ≈ 10-15 and typical data ranges, κ(X) can exceed 10¹⁶, making XᵀX essentially numerically rank-deficient.

normal_equations_failure.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
import matplotlib.pyplot as plt
from numpy.linalg import cond, solve, lstsq
 
def demonstrate_normal_equations_failure():
    """
    Shows how normal equations fail for polynomial regression
    while QR/SVD methods remain stable.
    """
    np.random.seed(42)
    
    # Generate simple data
    n = 50
    x = np.linspace(0, 1, n)
    y_true = np.sin(2 * np.pi * x)
    y = y_true + 0.1 * np.random.randn(n)
    
    degrees = [2, 5, 8, 10, 12, 14, 16]
    
    print("Polynomial Regression: Normal Equations vs SVD")
    print("=" * 65)
    print(f"{'Degree':<8} {'κ(X)':<12} {'κ(XᵀX)':<15} {'MSE Normal':<12} {'MSE SVD':<12}")
    print("-" * 65)
    
    for d in degrees:
        # Build Vandermonde matrix
        X = np.vander(x, d + 1, increasing=True)
        
        # Condition numbers
        kappa_X = cond(X)
        XtX = X.T @ X
        
        # Check if XtX is numerically singular
        try:
            kappa_XtX = cond(XtX)
        except:
            kappa_XtX = np.inf
        
        # Normal equations (potentially fails)
        try:
            Xty = X.T @ y
            beta_normal = solve(XtX, Xty)
            y_pred_normal = X @ beta_normal
            mse_normal = np.mean((y_pred_normal - y)**2)
            mse_normal_str = f"{mse_normal:.2e}"
        except np.linalg.LinAlgError:
            mse_normal_str = "FAILED"
        
        # SVD-based solution (robust)
        beta_svd = lstsq(X, y, rcond=None)[0]
        y_pred_svd = X @ beta_svd
        mse_svd = np.mean((y_pred_svd - y)**2)
        
        print(f"{d:<8} {kappa_X:<12.2e} {kappa_XtX:<15.2e} {mse_normal_str:<12} {mse_svd:<12.2e}")
    
    print()
    print("Observation: As degree increases, κ(XᵀX) explodes exponentially.")
    print("Normal equations become unreliable around degree 10-12.")
    print("SVD remains stable regardless of conditioning.")
 
demonstrate_normal_equations_failure()

Never Explicitly Invert XᵀX

Even if normal equations are used, never compute (XᵀX)⁻¹ explicitly. At minimum, use Cholesky decomposition: factor XᵀX = LLᵀ, then solve Lz = Xᵀy followed by Lᵀβ = z. This is still twice as ill-conditioned as necessary, but avoids the numerical disaster of explicit inversion.

Sources of Ill-Conditioning in Practice

Understanding why design matrices become ill-conditioned helps you prevent the problem before it occurs. Here are the most common causes in real-world machine learning:

1. Multicollinearity: Correlated Features

When features are highly correlated (or nearly linearly dependent), the columns of X point in nearly the same direction. The "narrower" the column space of X, the smaller its minimum singular value, and the larger κ(X).

Example: Including both Celsius and Fahrenheit temperature as features. Or GDP and GDP-per-capita in a country-level model.

2. Scale Disparities: Features on Different Scales

If one feature ranges from 0-1 and another from 0-10⁶, the columns have vastly different magnitudes. This creates artificial ill-conditioning that's easily fixed by standardization.

Example: Using age (0-100) and annual income ($0-$10M) as raw features without scaling.

3. Polynomial and Interaction Features

As we saw, polynomial features create Vandermonde-like structures that are inherently ill-conditioned. Higher-order interaction terms exacerbate this.

Example: Including x, x², x³, ..., x¹⁵ as features.

4. High-Dimensional Settings (p ≈ n)

When the number of features approaches or exceeds the number of samples, XᵀX approaches singularity. The matrix may be technically invertible but with enormous condition number.

Example: Gene expression data with 20,000 genes and 100 patients.

5. Redundant or Constant Features

Features that are constant (zero variance) or exact linear combinations of other features create singular or near-singular XᵀX.

Example: Including a feature that's always 1 alongside one-hot encoded categories that sum to 1.

Ill-Conditioning Sources and Remedies
Source	Detection Method	Remedy
Multicollinearity	Correlation matrix, VIF > 10	Remove correlated features, PCA, Ridge regression
Scale disparity	Feature variance comparison	Standardization (z-score) or normalization
Polynomial features	Condition number of Vandermonde	Orthogonal polynomials (Legendre, Chebyshev)
High dimensionality	p/n ratio > 0.5	Regularization (Ridge, Lasso), dimensionality reduction
Redundant features	Rank deficiency check	Remove constant/duplicate columns, use pivoted QR

diagnose_conditioning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
from numpy.linalg import matrix_rank, cond, svd
 
def diagnose_conditioning(X, feature_names=None):
                                        """
    Comprehensive diagnostic for design matrix conditioning.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Design matrix to analyze
    feature_names : list of str, optional
        Names of features for interpretable output
        
    Returns:
    --------
    dict : Diagnostic information
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f"x{i}" for i in range(p)]
    
    # Basic properties
    U, s, Vt = svd(X, full_matrices=False)
    rank = matrix_rank(X)
    kappa = cond(X)
    
    # Identify problematic directions
    # Small singular values indicate near-linear-dependence
    sv_threshold = s[0] * 1e-10  # Relative threshold
    problematic_idx = np.where(s < sv_threshold)[0]
    
    # Analyze which features contribute to ill-conditioning
    # Look at right singular vectors corresponding to small singular values
    V = Vt.T
    
    diagnostics = {
        "shape": (n, p),
        "rank": rank,
        "condition_number": kappa,
        "singular_values": s,
        "log10_condition": np.log10(kappa) if kappa < np.inf else np.inf,
    }
    
    print("=== Design Matrix Conditioning Diagnostics ===")
    print(f"Shape: {n} samples × {p} features")
    print(f"Numerical rank: {rank} (expecting {min(n, p)})")
    print(f"Condition number: {kappa:.2e}")
    print(f"Log₁₀(κ): {diagnostics['log10_condition']:.1f}")
    print()
    
    # Singular value analysis
    print("Singular Value Spectrum:")
    print(f"  Largest:  {s[0]:.4e}")
    print(f"  Smallest: {s[-1]:.4e}")
    print(f"  Ratio:    {s[0]/s[-1]:.4e}")
    print()
    
    # Check for scale disparity
    col_norms = np.linalg.norm(X, axis=0)
    scale_ratio = col_norms.max() / col_norms.min()
    print(f"Column Norm Disparity: {scale_ratio:.2e}")
    if scale_ratio > 100:
        print("  ⚠️ Consider standardizing features!")
    print()
    
    # Identify potential multicollinearity
    if len(problematic_idx) > 0:
        print("⚠️ Near-linear dependencies detected!")
        for idx in problematic_idx:
            # Which features contribute most to this null-space direction?
            v = V[:, idx]
            contributors = np.argsort(np.abs(v))[-3:][::-1]
            print(f"  σ_{idx} = {s[idx]:.2e}: involves {[feature_names[i] for i in contributors]}")
    else:
        print("✓ No near-exact linear dependencies detected.")
    
    # Guidance
    print()
    print("Recommendations:")
    if kappa > 1e12:
        print("  🔴 CRITICAL: Condition number too high for reliable double-precision.")
        print("     → Use regularization OR reformulate the problem")
    elif kappa > 1e8:
        print("  🟠 WARNING: Use QR or SVD, avoid normal equations entirely.")
    elif kappa > 1e4:
        print("  🟡 CAUTION: Use stable algorithms (QR preferred over normal equations).")
    else:
        print("  🟢 GOOD: Standard algorithms should work reliably.")
    
    return diagnostics
 
# Example usage
np.random.seed(42)
n, p = 100, 5
X = np.random.randn(n, p)
X[:, 4] = X[:, 0] + X[:, 1] + 1e-8 * np.random.randn(n)  # Introduce collinearity
diagnose_conditioning(X, ["A", "B", "C", "D", "A+B (collinear)"])

Numerical Stability Best Practices

Before diving into specific algorithms (QR decomposition, SVD), let's consolidate the general principles of numerical stability that every practitioner should internalize:

Golden Rules of Numerical Linear Algebra

•Never explicitly invert matrices. Instead of computing A⁻¹b, solve the linear system Ax = b using factorization (LU, Cholesky, QR).
•Avoid forming XᵀX when possible. Work directly with X via QR or SVD decomposition to preserve the condition number.
•Standardize features before regression. Subtract mean and divide by standard deviation to equalize column norms and reduce scale-induced ill-conditioning.
•Check condition numbers before solving. If κ(X) > 10⁸, be suspicious of any result; if κ(X) > 10¹², use extended precision or regularization.
•Use pivoting in factorizations. Pivoted QR and LU with partial pivoting are more robust than their unpivoted counterparts.
•Prefer backward-stable algorithms. An algorithm is backward-stable if the computed solution is the exact solution to a slightly perturbed problem. QR and SVD are backward-stable; normal equations are not.

Backward Stability: The Gold Standard

A fundamental concept in numerical analysis is backward stability. An algorithm for computing f(x) is backward stable if, for any input x, it produces a result f̂(x) that is the exact answer to a slightly perturbed input:

$$\hat{f}(x) = f(x + \delta x) \text{ where } \frac{|\delta x|}{|x|} = O(\varepsilon_{\text{mach}})$$

Why this matters: If your algorithm is backward stable, and your problem is well-conditioned, the result will be accurate. If the problem is ill-conditioned, the result will be inaccurate—but no better algorithm could do better with the same precision.

QR factorization and the SVD are backward stable. Solving normal equations via Cholesky (even with pivoting) is not backward stable for least squares because the conditioning of XᵀX, not X, dominates.

Algorithm Stability Comparison for Least Squares
Method	Stability	Condition Dependence	When to Use
Normal Equations (Cholesky)	Not backward stable	κ(XᵀX) = κ(X)²	Only when X is very well-conditioned and n >> p
QR Decomposition	Backward stable	κ(X)	Default choice for moderate-sized problems
SVD	Backward stable	κ(X)	Rank-deficient problems, need pseudoinverse
Modified Gram-Schmidt	Less stable than Householder QR	κ(X), can degrade	Educational purposes, iterative refinement available
Householder QR	Fully backward stable	κ(X)	Standard library implementation (LAPACK)

The Practitioner's Rule

In 99% of cases: use np.linalg.lstsq() or scipy.linalg.lstsq() which internally use QR or SVD. Never write beta = np.linalg.inv(X.T @ X) @ X.T @ y. The latter is a numerical disaster waiting to happen.

Summary: Numerical Stability

We've covered the foundational concepts of numerical stability in linear regression. These ideas are crucial for any production-quality implementation:

Key Takeaways

•Floating-point arithmetic has fundamental precision limits — Machine epsilon (≈10⁻¹⁶ for doubles) bounds the relative accuracy of every operation.
•Catastrophic cancellation destroys accuracy — Subtracting nearly equal numbers can lose all significant digits. Reformulate to avoid.
•Condition numbers quantify problem sensitivity — κ(A) tells you how many digits of precision you'll lose. High condition = unreliable results.
•Normal equations square the condition number — κ(XᵀX) = κ(X)² makes moderate ill-conditioning catastrophic. Avoid this approach.
•Ill-conditioning has identifiable causes — Multicollinearity, scale disparity, high-degree polynomials, and high dimensionality all contribute.
•Use backward-stable algorithms — QR and SVD preserve precision; normal equations do not. Always prefer the former in practice.

What's Next:

Now that we understand why numerical stability matters and why normal equations fail, the next page introduces the QR decomposition—the industry-standard approach for solving least squares problems stably and efficiently. We'll derive the method from first principles and understand why it preserves conditioning where normal equations fail.

Page Complete

You now understand the numerical challenges of linear regression and why naive implementations fail. This knowledge is essential for implementing reliable ML systems—the difference between a model that works in testing and one that works in production at scale.

1 / 5

Loading learning content...

Machine LearningComputational Considerations

Computational Considerations in Linear Regression

LevelAdvanced

Duration90 mins

TopicComputational Considerations

1 / 5

Numerical Stability

When Mathematics Meets Machine Precision

But computers don't live in that pristine world.

The Hidden Danger

What You Will Learn

Floating-Point Representation

The Anatomy of a Float:

A 64-bit double-precision floating-point number consists of three components:

Sign bit (1 bit): Determines if the number is positive or negative
Exponent (11 bits): Encodes the magnitude as a power of 2 (biased by 1023)
Mantissa/Significand (52 bits): Stores the significant digits

The value represented is:

$$(-1)^{\text{sign}} \times 1.\text{mantissa} \times 2^{\text{exponent} - 1023}$$

The Hidden One

Machine Epsilon: The Fundamental Precision Limit

The most important constant in numerical computing is machine epsilon (ε_mach)—the smallest positive number such that:

$$1 + \varepsilon_{\text{mach}} \neq 1 \text{ in floating-point}$$

For double precision: ε_mach ≈ 2.22 × 10⁻¹⁶ (approximately 2⁻⁵²)

This means:

We have roughly 16 significant decimal digits of precision
Numbers differing by less than ε_mach × |number| are indistinguishable
Every arithmetic operation introduces a relative error of at most ε_mach

IEEE 754 Floating-Point Precision Levels
Format	Total Bits	Mantissa Bits	Machine Epsilon	Decimal Digits
Half (float16)	16	10	≈ 9.77 × 10⁻⁴	~3-4
Single (float32)	32	23	≈ 1.19 × 10⁻⁷	~7
Double (float64)	64	52	≈ 2.22 × 10⁻¹⁶	~16
Extended (float80)	80	64	≈ 1.08 × 10⁻¹⁹	~19

Catastrophic Cancellation: When Subtraction Destroys Precision

The most dangerous operation in floating-point arithmetic is subtraction of nearly equal numbers. Consider computing:

$$f(x) = \sqrt{x+1} - \sqrt{x}$$

For x = 10²⁰, mathematically this equals approximately 5 × 10⁻¹¹. But in floating-point:

√(10²⁰ + 1) ≈ 10¹⁰ + 5 × 10⁻¹¹ (but this tiny difference is lost!)
√(10²⁰) = 10¹⁰
Subtraction yields 0 instead of 5 × 10⁻¹¹

This is catastrophic cancellation—we lost all significant digits.

catastrophic_cancellation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
 
def demonstrate_catastrophic_cancellation():
    """
    Demonstrates how subtracting nearly equal numbers destroys precision.
    """
    # Case 1: sqrt(x+1) - sqrt(x) for large x
    x = 1e20
    
    # Naive computation (WRONG for large x)
    naive_result = np.sqrt(x + 1) - np.sqrt(x)
    
    # Algebraically equivalent but numerically stable form:
    # sqrt(x+1) - sqrt(x) = 1 / (sqrt(x+1) + sqrt(x))
    stable_result = 1.0 / (np.sqrt(x + 1) + np.sqrt(x))
    
    # True value (computed with arbitrary precision)
    true_value = 5e-11  # approximately
    
    print(f"x = {x:.0e}")
    print(f"Naive result:  {naive_result:.10e}")  # Shows 0.0!
    print(f"Stable result: {stable_result:.10e}")  # Correct answer
    print(f"Relative error (naive): {abs(naive_result - stable_result) / stable_result:.2%}")
    
    # Case 2: The classic quadratic formula failure
    # Solving x² - 10⁸x + 1 = 0
    # Roots are approximately 10⁸ and 10⁻⁸
    a, b, c = 1.0, -1e8, 1.0
    discriminant = b**2 - 4*a*c
    
    # Standard formula (fails for the small root)
    x1_standard = (-b + np.sqrt(discriminant)) / (2*a)
    x2_standard = (-b - np.sqrt(discriminant)) / (2*a)
    
    # Stable computation using alternative form for small root
    x1_stable = (-b + np.sqrt(discriminant)) / (2*a)
    x2_stable = c / (a * x1_stable)  # Uses Vieta's formula: x1 * x2 = c/a
    
    print(f"\nQuadratic formula comparison:")
    print(f"Large root - Standard: {x1_standard:.10e}, Stable: {x1_stable:.10e}")
    print(f"Small root - Standard: {x2_standard:.10e}, Stable: {x2_stable:.10e}")
    print(f"True small root ≈ 1e-8")
 
demonstrate_catastrophic_cancellation()

Numerical Stability Principle #1

Never subtract nearly equal floating-point numbers if you can avoid it. Algebraically rearrange formulas to use addition instead, or use alternative formulations that avoid the cancellation entirely.

Condition Numbers: Quantifying Sensitivity

Definition: Matrix Condition Number

For a square, invertible matrix A, the condition number with respect to the 2-norm is:

$$\kappa(A) = |A|2 |A^{-1}|2 = \frac{\sigma{\max}(A)}{\sigma{\min}(A)}$$

where σ_max and σ_min are the largest and smallest singular values of A.

For symmetric positive definite matrices (like XᵀX in least squares):

$$\kappa(A) = \frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}$$

where λ_max and λ_min are the largest and smallest eigenvalues.

Intuition Behind Condition Numbers

The Error Amplification Bound:

Consider solving the linear system Ax = b. If we perturb b by a small amount δb, the solution changes by δx where:

$$\frac{|\delta x|}{|x|} \leq \kappa(A) \cdot \frac{|\delta b|}{|b|}$$

This inequality is tight—there exist perturbations that achieve this worst case.

Interpretation:

κ(A) = 1: Perfect conditioning (orthogonal matrices achieve this)
κ(A) = 10²: Lose ~2 digits of precision
κ(A) = 10⁸: Lose ~8 digits of precision
κ(A) = 10¹⁶+: Effectively singular—no reliable accuracy in double precision

Classification of Conditioning:

Condition Number Classification
Condition Number κ(A)	Classification	Expected Precision Loss	Practical Interpretation
1 - 10	Well-conditioned	< 1 digit	Numerically safe, standard algorithms work
10 - 10⁴	Moderate conditioning	1-4 digits	Generally safe, monitor precision
10⁴ - 10⁸	Ill-conditioned	4-8 digits	Use numerically stable algorithms only
10⁸ - 10¹²	Severely ill-conditioned	8-12 digits	Results questionable, consider reformulation
10¹²	Near-singular	12 digits	Standard double precision unreliable

Condition Number of XᵀX vs X:

Here's a critical insight for linear regression. The normal equations require forming XᵀX, but:

$$\kappa(X^\top X) = \kappa(X)^2$$

This is devastating! If your design matrix has κ(X) = 10⁴ (moderately ill-conditioned but manageable), then κ(XᵀX) = 10⁸, which is severely ill-conditioned.

condition_number_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
from numpy.linalg import cond, svd, lstsq
 
def analyze_condition_numbers():
    """
    Demonstrates condition number computation and its implications.
    """
    np.random.seed(42)
    
    # Create a design matrix with controlled condition number
    n, p = 100, 5
    
    # Well-conditioned case: X with orthonormal-ish columns
    X_good = np.random.randn(n, p)
    X_good, _ = np.linalg.qr(X_good)  # Make columns orthonormal
    X_good *= np.sqrt(n)  # Scale back
    
    # Ill-conditioned case: nearly collinear features
    X_bad = np.random.randn(n, p)
    X_bad[:, -1] = X_bad[:, 0] + 1e-8 * np.random.randn(n)  # Near-duplicate column
    
    print("=== Condition Number Analysis ===\n")
    
    for name, X in [("Well-conditioned", X_good), ("Ill-conditioned", X_bad)]:
        # Compute singular values
        U, singular_values, Vt = svd(X, full_matrices=False)
        
        # Condition numbers
        kappa_X = cond(X)
        kappa_XtX = cond(X.T @ X)
        
        print(f"{name} Matrix:")
        print(f"  Singular values: {singular_values}")
        print(f"  κ(X) = {kappa_X:.2e}")
        print(f"  κ(XᵀX) = {kappa_XtX:.2e}")
        print(f"  κ(XᵀX) / κ(X)² = {kappa_XtX / kappa_X**2:.2f} (should be ≈1)")
        print(f"  Expected digits lost solving normal equations: ~{np.log10(kappa_XtX):.1f}")
        print()
        
        # Demonstrate actual error amplification
        # Create true coefficients and generate data
        beta_true = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
        y = X @ beta_true
        
        # Solve via normal equations (potentially unstable)
        try:
            beta_normal = np.linalg.solve(X.T @ X, X.T @ y)
        except np.linalg.LinAlgError:
            beta_normal = np.full(p, np.nan)
        
        # Solve via SVD (stable)
        beta_svd = lstsq(X, y, rcond=None)[0]
        
        error_normal = np.linalg.norm(beta_normal - beta_true)
        error_svd = np.linalg.norm(beta_svd - beta_true)
        
        print(f"  Reconstruction error (Normal Equations): {error_normal:.2e}")
        print(f"  Reconstruction error (SVD):              {error_svd:.2e}")
        print()
 
analyze_condition_numbers()

Why Normal Equations Fail

The normal equations provide a direct formula for the least squares solution:

$$X^\top X \beta = X^\top y \implies \beta = (X^\top X)^{-1} X^\top y$$

While mathematically elegant, this approach has multiple numerical pitfalls that can lead to catastrophic failures in practice.

Numerical Problems with Normal Equations

•Condition Number Squaring: As shown, κ(XᵀX) = κ(X)². Moderate ill-conditioning becomes severe.
•Explicit Matrix Inversion: Computing (XᵀX)⁻¹ is never recommended—it's numerically unstable and computationally inefficient.
•Loss of Orthogonality Information: The product XᵀX loses geometric information about the column space of X.
•Near-Singularity Amplification: If X has any near-zero singular values, XᵀX will have near-zero eigenvalues squared, making detection harder.
•Symmetry Exploitation Backfires: While XᵀX is symmetric positive semi-definite (allowing Cholesky), the squaring effect dominates any savings.

A Concrete Failure Case:

Consider polynomial regression with high-degree polynomials. The design matrix has the form:

$$X = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^d \ 1 & x_2 & x_2^2 & \cdots & x_2^d \ \vdots & \vdots & \vdots & \ddots & \vdots \ 1 & x_n & x_n^2 & \cdots & x_n^d \end{bmatrix}$$

normal_equations_failure.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import numpy as np
import matplotlib.pyplot as plt
from numpy.linalg import cond, solve, lstsq
 
def demonstrate_normal_equations_failure():
    """
    Shows how normal equations fail for polynomial regression
    while QR/SVD methods remain stable.
    """
    np.random.seed(42)
    
    # Generate simple data
    n = 50
    x = np.linspace(0, 1, n)
    y_true = np.sin(2 * np.pi * x)
    y = y_true + 0.1 * np.random.randn(n)
    
    degrees = [2, 5, 8, 10, 12, 14, 16]
    
    print("Polynomial Regression: Normal Equations vs SVD")
    print("=" * 65)
    print(f"{'Degree':<8} {'κ(X)':<12} {'κ(XᵀX)':<15} {'MSE Normal':<12} {'MSE SVD':<12}")
    print("-" * 65)
    
    for d in degrees:
        # Build Vandermonde matrix
        X = np.vander(x, d + 1, increasing=True)
        
        # Condition numbers
        kappa_X = cond(X)
        XtX = X.T @ X
        
        # Check if XtX is numerically singular
        try:
            kappa_XtX = cond(XtX)
        except:
            kappa_XtX = np.inf
        
        # Normal equations (potentially fails)
        try:
            Xty = X.T @ y
            beta_normal = solve(XtX, Xty)
            y_pred_normal = X @ beta_normal
            mse_normal = np.mean((y_pred_normal - y)**2)
            mse_normal_str = f"{mse_normal:.2e}"
        except np.linalg.LinAlgError:
            mse_normal_str = "FAILED"
        
        # SVD-based solution (robust)
        beta_svd = lstsq(X, y, rcond=None)[0]
        y_pred_svd = X @ beta_svd
        mse_svd = np.mean((y_pred_svd - y)**2)
        
        print(f"{d:<8} {kappa_X:<12.2e} {kappa_XtX:<15.2e} {mse_normal_str:<12} {mse_svd:<12.2e}")
    
    print()
    print("Observation: As degree increases, κ(XᵀX) explodes exponentially.")
    print("Normal equations become unreliable around degree 10-12.")
    print("SVD remains stable regardless of conditioning.")
 
demonstrate_normal_equations_failure()

Never Explicitly Invert XᵀX

Sources of Ill-Conditioning in Practice

Understanding why design matrices become ill-conditioned helps you prevent the problem before it occurs. Here are the most common causes in real-world machine learning:

1. Multicollinearity: Correlated Features

Example: Including both Celsius and Fahrenheit temperature as features. Or GDP and GDP-per-capita in a country-level model.

2. Scale Disparities: Features on Different Scales

If one feature ranges from 0-1 and another from 0-10⁶, the columns have vastly different magnitudes. This creates artificial ill-conditioning that's easily fixed by standardization.

Example: Using age (0-100) and annual income ($0-$10M) as raw features without scaling.

3. Polynomial and Interaction Features

As we saw, polynomial features create Vandermonde-like structures that are inherently ill-conditioned. Higher-order interaction terms exacerbate this.

Example: Including x, x², x³, ..., x¹⁵ as features.

4. High-Dimensional Settings (p ≈ n)

When the number of features approaches or exceeds the number of samples, XᵀX approaches singularity. The matrix may be technically invertible but with enormous condition number.

Example: Gene expression data with 20,000 genes and 100 patients.

5. Redundant or Constant Features

Features that are constant (zero variance) or exact linear combinations of other features create singular or near-singular XᵀX.

Example: Including a feature that's always 1 alongside one-hot encoded categories that sum to 1.

Ill-Conditioning Sources and Remedies
Source	Detection Method	Remedy
Multicollinearity	Correlation matrix, VIF > 10	Remove correlated features, PCA, Ridge regression
Scale disparity	Feature variance comparison	Standardization (z-score) or normalization
Polynomial features	Condition number of Vandermonde	Orthogonal polynomials (Legendre, Chebyshev)
High dimensionality	p/n ratio > 0.5	Regularization (Ridge, Lasso), dimensionality reduction
Redundant features	Rank deficiency check	Remove constant/duplicate columns, use pivoted QR

diagnose_conditioning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
from numpy.linalg import matrix_rank, cond, svd
 
def diagnose_conditioning(X, feature_names=None):
                                        """
    Comprehensive diagnostic for design matrix conditioning.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Design matrix to analyze
    feature_names : list of str, optional
        Names of features for interpretable output
        
    Returns:
    --------
    dict : Diagnostic information
    """
    n, p = X.shape
    
    if feature_names is None:
        feature_names = [f"x{i}" for i in range(p)]
    
    # Basic properties
    U, s, Vt = svd(X, full_matrices=False)
    rank = matrix_rank(X)
    kappa = cond(X)
    
    # Identify problematic directions
    # Small singular values indicate near-linear-dependence
    sv_threshold = s[0] * 1e-10  # Relative threshold
    problematic_idx = np.where(s < sv_threshold)[0]
    
    # Analyze which features contribute to ill-conditioning
    # Look at right singular vectors corresponding to small singular values
    V = Vt.T
    
    diagnostics = {
        "shape": (n, p),
        "rank": rank,
        "condition_number": kappa,
        "singular_values": s,
        "log10_condition": np.log10(kappa) if kappa < np.inf else np.inf,
    }
    
    print("=== Design Matrix Conditioning Diagnostics ===")
    print(f"Shape: {n} samples × {p} features")
    print(f"Numerical rank: {rank} (expecting {min(n, p)})")
    print(f"Condition number: {kappa:.2e}")
    print(f"Log₁₀(κ): {diagnostics['log10_condition']:.1f}")
    print()
    
    # Singular value analysis
    print("Singular Value Spectrum:")
    print(f"  Largest:  {s[0]:.4e}")
    print(f"  Smallest: {s[-1]:.4e}")
    print(f"  Ratio:    {s[0]/s[-1]:.4e}")
    print()
    
    # Check for scale disparity
    col_norms = np.linalg.norm(X, axis=0)
    scale_ratio = col_norms.max() / col_norms.min()
    print(f"Column Norm Disparity: {scale_ratio:.2e}")
    if scale_ratio > 100:
        print("  ⚠️ Consider standardizing features!")
    print()
    
    # Identify potential multicollinearity
    if len(problematic_idx) > 0:
        print("⚠️ Near-linear dependencies detected!")
        for idx in problematic_idx:
            # Which features contribute most to this null-space direction?
            v = V[:, idx]
            contributors = np.argsort(np.abs(v))[-3:][::-1]
            print(f"  σ_{idx} = {s[idx]:.2e}: involves {[feature_names[i] for i in contributors]}")
    else:
        print("✓ No near-exact linear dependencies detected.")
    
    # Guidance
    print()
    print("Recommendations:")
    if kappa > 1e12:
        print("  🔴 CRITICAL: Condition number too high for reliable double-precision.")
        print("     → Use regularization OR reformulate the problem")
    elif kappa > 1e8:
        print("  🟠 WARNING: Use QR or SVD, avoid normal equations entirely.")
    elif kappa > 1e4:
        print("  🟡 CAUTION: Use stable algorithms (QR preferred over normal equations).")
    else:
        print("  🟢 GOOD: Standard algorithms should work reliably.")
    
    return diagnostics
 
# Example usage
np.random.seed(42)
n, p = 100, 5
X = np.random.randn(n, p)
X[:, 4] = X[:, 0] + X[:, 1] + 1e-8 * np.random.randn(n)  # Introduce collinearity
diagnose_conditioning(X, ["A", "B", "C", "D", "A+B (collinear)"])

Numerical Stability Best Practices

Before diving into specific algorithms (QR decomposition, SVD), let's consolidate the general principles of numerical stability that every practitioner should internalize:

Golden Rules of Numerical Linear Algebra

•Never explicitly invert matrices. Instead of computing A⁻¹b, solve the linear system Ax = b using factorization (LU, Cholesky, QR).
•Avoid forming XᵀX when possible. Work directly with X via QR or SVD decomposition to preserve the condition number.
•Standardize features before regression. Subtract mean and divide by standard deviation to equalize column norms and reduce scale-induced ill-conditioning.
•Check condition numbers before solving. If κ(X) > 10⁸, be suspicious of any result; if κ(X) > 10¹², use extended precision or regularization.
•Use pivoting in factorizations. Pivoted QR and LU with partial pivoting are more robust than their unpivoted counterparts.
•Prefer backward-stable algorithms. An algorithm is backward-stable if the computed solution is the exact solution to a slightly perturbed problem. QR and SVD are backward-stable; normal equations are not.

Backward Stability: The Gold Standard

$$\hat{f}(x) = f(x + \delta x) \text{ where } \frac{|\delta x|}{|x|} = O(\varepsilon_{\text{mach}})$$

Algorithm Stability Comparison for Least Squares
Method	Stability	Condition Dependence	When to Use
Normal Equations (Cholesky)	Not backward stable	κ(XᵀX) = κ(X)²	Only when X is very well-conditioned and n >> p
QR Decomposition	Backward stable	κ(X)	Default choice for moderate-sized problems
SVD	Backward stable	κ(X)	Rank-deficient problems, need pseudoinverse
Modified Gram-Schmidt	Less stable than Householder QR	κ(X), can degrade	Educational purposes, iterative refinement available
Householder QR	Fully backward stable	κ(X)	Standard library implementation (LAPACK)

The Practitioner's Rule

Summary: Numerical Stability

We've covered the foundational concepts of numerical stability in linear regression. These ideas are crucial for any production-quality implementation:

Key Takeaways

•Floating-point arithmetic has fundamental precision limits — Machine epsilon (≈10⁻¹⁶ for doubles) bounds the relative accuracy of every operation.
•Catastrophic cancellation destroys accuracy — Subtracting nearly equal numbers can lose all significant digits. Reformulate to avoid.
•Condition numbers quantify problem sensitivity — κ(A) tells you how many digits of precision you'll lose. High condition = unreliable results.
•Normal equations square the condition number — κ(XᵀX) = κ(X)² makes moderate ill-conditioning catastrophic. Avoid this approach.
•Ill-conditioning has identifiable causes — Multicollinearity, scale disparity, high-degree polynomials, and high dimensionality all contribute.
•Use backward-stable algorithms — QR and SVD preserve precision; normal equations do not. Always prefer the former in practice.

What's Next:

Page Complete

1 / 5