Matrix Decompositions - Learning Module

Loading content...

0/278

Cholesky Decomposition

The Efficient Factorization for Positive Definite Matrices

When matrices have special structure, specialized algorithms can exploit that structure for dramatic gains in efficiency and stability. Cholesky decomposition is the gold standard for symmetric positive definite (SPD) matrices—a class that includes covariance matrices, kernel/Gram matrices, Hessians at minima, and precision matrices in probabilistic models.

Cholesky factors an SPD matrix A into A = LLᵀ, where L is lower triangular with positive diagonal entries. This simple form has profound practical implications:

2× faster than LU decomposition (half the operations)
No pivoting required (guaranteed stable for SPD matrices)
Unique factorization (given positive diagonal convention)
Built-in positive definiteness check (algorithm fails iff A is not SPD)

From sampling multivariate Gaussians to solving kernel regression, from computing log-determinants in Bayesian inference to preconditioning optimization algorithms, Cholesky decomposition is the workhorse of probabilistic machine learning.

What You Will Learn

By the end of this page, you will understand: (1) The definition and existence conditions for Cholesky decomposition, (2) The algorithm and its O(n³/3) complexity advantage, (3) Connection to LU and LDLT factorizations, (4) Applications in solving linear systems, computing determinants, and sampling, (5) Numerical stability and updating techniques, and (6) Critical ML applications in Gaussian processes and optimization.

Cholesky Decomposition: Definition and Existence

The Cholesky Decomposition:

For a symmetric positive definite matrix A ∈ ℝⁿˣⁿ, there exists a unique lower triangular matrix L with positive diagonal entries such that:

$$A = LL^T$$

Equivalently, using upper triangular form: A = RᵀR where R = Lᵀ.

Prerequisites for Cholesky:

1. Symmetric: A = Aᵀ Every entry aᵢⱼ = aⱼᵢ. This is necessary for the LLᵀ structure.

2. Positive Definite: xᵀAx > 0 for all non-zero x Equivalently: all eigenvalues of A are strictly positive.

For positive semi-definite (PSD) matrices: Cholesky still exists but L may have zero diagonal entries. The factorization is no longer unique, and numerical issues can arise.

Existence and uniqueness theorem:

If A is symmetric positive definite, then:

Cholesky factorization A = LLᵀ exists
L is unique with positive diagonal entries
The diagonal entries of L are lᵢᵢ = √(aᵢᵢ - Σⱼ₌₁^(i-1) lᵢⱼ²) > 0

The proof follows from the LU factorization of SPD matrices, which is guaranteed to exist without pivoting.

Cholesky as a Square Root

Think of Cholesky as computing the 'matrix square root': L is the unique lower triangular matrix such that L × Lᵀ = A. Just as √a is unique for positive a, L is unique for positive definite A. This analogy extends to operations: solving Ax = b becomes solving 'L²x = b' via two triangular solves.

Detecting positive definiteness:

Cholesky decomposition simultaneously checks if A is positive definite:

If the algorithm completes successfully (all computed lᵢᵢ² > 0), A is SPD
If we encounter lᵢᵢ² ≤ 0, A is not positive definite
This is more efficient than computing eigenvalues: O(n³/3) vs O(n³)

Common SPD matrices in ML:

Matrix Type	Definition	Why SPD?
Covariance matrix	Σ = E[(X-μ)(X-μ)ᵀ]	xᵀΣx = E[(xᵀ(X-μ))²] ≥ 0
Gram matrix	G = XᵀX	xᵀGx =
Kernel matrix	Kᵢⱼ = k(xᵢ, xⱼ)	Mercer's theorem
Hessian at minimum	∇²f(x*)	Second-order optimality
Regularized matrix	A + λI (λ > 0)	Shifts eigenvalues up

cholesky_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
from scipy.linalg import cholesky, cho_solve, cho_factor
 
# Create a positive definite matrix
np.random.seed(42)
n = 5
 
# Method 1: XᵀX is always positive semi-definite
# Adding small diagonal ensures positive definite
X = np.random.randn(n + 5, n)
A = X.T @ X + 0.01 * np.eye(n)  # Regularization ensures SPD
 
print("Matrix A (symmetric positive definite):")
print(np.round(A, 3))
 
# Verify symmetry
print(f"
Symmetric: {np.allclose(A, A.T)}")
 
# Verify positive definiteness via eigenvalues
eigenvalues = np.linalg.eigvalsh(A)
print(f"Eigenvalues: {eigenvalues.round(4)}")
print(f"All positive: {np.all(eigenvalues > 0)}")
 
# Cholesky decomposition
L = cholesky(A, lower=True)
print(f"
Cholesky factor L (lower triangular):")
print(np.round(L, 4))
 
# Verify reconstruction
A_reconstructed = L @ L.T
print(f"
Reconstruction error ||A - LL^T||: {np.linalg.norm(A - A_reconstructed):.2e}")
 
# Verify L is lower triangular with positive diagonal
print(f"
L is lower triangular: {np.allclose(L, np.tril(L))}")
print(f"Diagonal of L: {np.diag(L).round(4)}")
print(f"All diagonal entries positive: {np.all(np.diag(L) > 0)}")
 
# Test what happens with non-SPD matrix
print("
=== Testing Non-SPD Matrix ===")
A_bad = np.array([[1, 2], [2, 1]])  # Eigenvalues: 3, -1
eigenvalues_bad = np.linalg.eigvalsh(A_bad)
print(f"Matrix eigenvalues: {eigenvalues_bad}")
print(f"Positive definite: {np.all(eigenvalues_bad > 0)}")
 
try:
    L_bad = cholesky(A_bad, lower=True)
except np.linalg.LinAlgError as e:
    print(f"Cholesky failed (as expected): {e}")

The Cholesky Algorithm: Efficient Computation

The Cholesky algorithm computes L column by column, exploiting the triangular structure for efficiency.

Derivation from A = LLᵀ:

Comparing entry (i, j) on both sides (for i ≥ j): $$a_{ij} = \sum_{k=1}^{j} l_{ik} l_{jk} = \sum_{k=1}^{j-1} l_{ik} l_{jk} + l_{ij} l_{jj}$$

Solving for L entries:

For diagonal entries (i = j): $$l_{jj} = \sqrt{a_{jj} - \sum_{k=1}^{j-1} l_{jk}^2}$$

For off-diagonal entries (i > j): $$l_{ij} = \frac{1}{l_{jj}} \left( a_{ij} - \sum_{k=1}^{j-1} l_{ik} l_{jk} \right)$$

The algorithm:

For j = 1 to n:                           # Column by column
    # Diagonal entry
    l_jj = sqrt(a_jj - sum(l_jk^2, k=1 to j-1))
    
    # Off-diagonal entries
    For i = j+1 to n:
        l_ij = (a_ij - sum(l_ik * l_jk, k=1 to j-1)) / l_jj

Complexity Advantage

Cholesky requires n³/3 floating-point operations, compared to 2n³/3 for LU decomposition—exactly half the work! This is because we only compute the lower triangle (the upper triangle is just Lᵀ), and the symmetry reduces the number of distinct computations.

Computational considerations:

Operation count:

n square roots (for diagonal)
n(n-1)/2 divisions (for off-diagonal scaling)
n³/3 multiplications and additions (the dominant cost)

Memory:

Can overwrite the lower triangle of A with L (in-place)
Only need lower triangle of A as input

Numerical stability:

Cholesky is backward stable for SPD matrices
No pivoting required (unlike LU for general matrices)
Condition number of L is √κ(A), where κ(A) is condition number of A

Block Cholesky:

For large matrices, block algorithms improve cache utilization:

Partition A into blocks
Compute Cholesky of each diagonal block
Use triangular solves and matrix multiplies for off-diagonal blocks
Achieves near-BLAS3 efficiency (memory-bound rather than compute-bound)

cholesky_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
import time
 
def cholesky_manual(A):
    """
    Compute Cholesky decomposition L such that A = LL^T.
    
    Parameters:
        A: n x n symmetric positive definite matrix
        
    Returns:
        L: n x n lower triangular matrix
    """
    n = A.shape[0]
    L = np.zeros((n, n))
    
    for j in range(n):
        # Compute diagonal entry
        sum_sq = np.sum(L[j, :j]**2)
        diag_val = A[j, j] - sum_sq
        
        if diag_val <= 0:
            raise ValueError(f"Matrix is not positive definite (at position {j})")
        
        L[j, j] = np.sqrt(diag_val)
        
        # Compute off-diagonal entries in column j
        for i in range(j + 1, n):
            sum_prod = np.sum(L[i, :j] * L[j, :j])
            L[i, j] = (A[i, j] - sum_prod) / L[j, j]
    
    return L
 
def cholesky_vectorized(A):
    """
    Cholesky decomposition with vectorized inner loops.
    More efficient than pure loop version.
    """
    n = A.shape[0]
    L = np.zeros((n, n))
    
    for j in range(n):
        # Diagonal entry (vectorized sum)
        L[j, j] = np.sqrt(A[j, j] - L[j, :j] @ L[j, :j])
        
        # Off-diagonal entries (vectorized)
        if j + 1 < n:
            L[j+1:, j] = (A[j+1:, j] - L[j+1:, :j] @ L[j, :j]) / L[j, j]
    
    return L
 
# Test implementations
np.random.seed(42)
n = 100
 
# Create SPD matrix
X = np.random.randn(n + 10, n)
A = X.T @ X + 0.1 * np.eye(n)
 
# Compare implementations
print("Comparing Cholesky Implementations")
print("=" * 50)
 
# Manual implementation
start = time.time()
L_manual = cholesky_manual(A)
time_manual = time.time() - start
print(f"Manual:     {time_manual:.4f}s")
 
# Vectorized implementation
start = time.time()
L_vectorized = cholesky_vectorized(A)
time_vectorized = time.time() - start
print(f"Vectorized: {time_vectorized:.4f}s")
 
# NumPy/SciPy implementation
from scipy.linalg import cholesky
start = time.time()
L_scipy = cholesky(A, lower=True)
time_scipy = time.time() - start
print(f"SciPy:      {time_scipy:.4f}s")
 
# Verify all give same result
print(f"
Manual vs SciPy error: {np.linalg.norm(L_manual - L_scipy):.2e}")
print(f"Vectorized vs SciPy error: {np.linalg.norm(L_vectorized - L_scipy):.2e}")
 
# Verify reconstruction
print(f"Reconstruction error ||A - LL^T||: {np.linalg.norm(A - L_scipy @ L_scipy.T):.2e}")
 
# Operation count comparison with LU
print("
=== Complexity Comparison ===")
print(f"Matrix size: {n} x {n}")
print(f"Cholesky ops: ~{n**3 // 3:,} (n³/3)")
print(f"LU ops:       ~{2 * n**3 // 3:,} (2n³/3)")
print(f"Speedup:      2x")

Solving Linear Systems with Cholesky

One of the most common uses of Cholesky is solving linear systems Ax = b where A is SPD. The factorization enables efficient solution through two triangular solves.

The solution process:

Given Ax = b with A = LLᵀ:

Factor: Compute L such that A = LLᵀ — O(n³/3)
Forward substitution: Solve Ly = b — O(n²)
Back substitution: Solve Lᵀx = y — O(n²)

Why triangular solves are efficient:

For lower triangular L:

y₁ = b₁ / L₁₁
y₂ = (b₂ - L₂₁y₁) / L₂₂
...
yᵢ = (bᵢ - Σⱼ₌₁^(i-1) Lᵢⱼyⱼ) / Lᵢᵢ

Each equation involves only previously computed values—purely sequential, O(n²) total.

Multiple right-hand sides:

If you need to solve Ax = b for many different b vectors:

Factor once: A = LLᵀ — O(n³/3)
For each b: two triangular solves — O(n²) each

This is much faster than naively recomputing for each b.

When to Factor vs Direct Solve

Factoring is worth it when solving multiple systems with the same A. Break-even: if you solve more than n/3 systems, factoring + triangular solves beats recomputing each time. For Gaussian processes with n training points and k test points, this is nearly always beneficial.

Computing the inverse (when needed):

While explicitly computing A⁻¹ is usually avoided, when needed:

Factor A = LLᵀ
Compute L⁻¹ by solving L X = I column by column
A⁻¹ = (LLᵀ)⁻¹ = L⁻ᵀL⁻¹

Or equivalently, solve AX = I column by column using the Cholesky solve.

Computing log-determinant:

The log-determinant of A is crucial in Bayesian inference (appears in Gaussian log-likelihoods):

$$\log|A| = \log|LL^T| = \log|L|^2 = 2\log|L| = 2\sum_{i=1}^{n} \log(L_{ii})$$

This is numerically stable (avoids computing the potentially huge determinant directly) and costs only O(n) after factorization.

Solving the precision-weighted system:

In Gaussian inference, we often need (Σ⁻¹ + Q⁻¹)⁻¹ or solutions involving sums/products of covariances. Cholesky of the precision matrix enables efficient computation without explicit inversions.

cholesky_solve.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from scipy.linalg import cholesky, solve_triangular, cho_solve, cho_factor
 
def solve_with_cholesky(A, b):
    """
    Solve Ax = b using Cholesky decomposition.
    A must be symmetric positive definite.
    """
    # Factor
    L = cholesky(A, lower=True)
    
    # Forward substitution: Ly = b
    y = solve_triangular(L, b, lower=True)
    
    # Back substitution: L^T x = y
    x = solve_triangular(L.T, y, lower=False)
    
    return x, L
 
def log_determinant_cholesky(L):
    """
    Compute log|A| from Cholesky factor L where A = LL^T.
    log|A| = 2 * sum(log(L_ii))
    """
    return 2.0 * np.sum(np.log(np.diag(L)))
 
# Example: Solving a system
np.random.seed(42)
n = 100
 
# Create SPD matrix
X = np.random.randn(n + 10, n)
A = X.T @ X + 0.5 * np.eye(n)
b = np.random.randn(n)
 
# Solve with Cholesky
x_chol, L = solve_with_cholesky(A, b)
 
# Verify solution
residual = np.linalg.norm(A @ x_chol - b)
print(f"Solution residual ||Ax - b||: {residual:.2e}")
 
# Compare with numpy.linalg.solve (uses LU)
x_np = np.linalg.solve(A, b)
print(f"Difference from np.solve: {np.linalg.norm(x_chol - x_np):.2e}")
 
# Multiple right-hand sides
print("
=== Multiple Right-Hand Sides ===")
import time
 
num_rhs = 50
B = np.random.randn(n, num_rhs)
 
# Method 1: Naive (solve each separately)
start = time.time()
X_naive = np.column_stack([np.linalg.solve(A, B[:, i]) for i in range(num_rhs)])
time_naive = time.time() - start
 
# Method 2: Cholesky + triangular solves
start = time.time()
c, low = cho_factor(A, lower=True)
X_chol = cho_solve((c, low), B)
time_chol = time.time() - start
 
print(f"Naive (solve each): {time_naive:.4f}s")
print(f"Cholesky factored:  {time_chol:.4f}s")
print(f"Speedup: {time_naive / time_chol:.1f}x")
 
# Log-determinant computation
print("
=== Log-Determinant ===")
log_det_chol = log_determinant_cholesky(L)
log_det_direct = np.linalg.slogdet(A)[1]  # Returns (sign, logdet)
 
print(f"Log|A| via Cholesky:    {log_det_chol:.6f}")
print(f"Log|A| via slogdet:     {log_det_direct:.6f}")
print(f"Difference: {abs(log_det_chol - log_det_direct):.2e}")
 
# Timing comparison for log-det
import time
large_n = 500
X_large = np.random.randn(large_n + 10, large_n)
A_large = X_large.T @ X_large + np.eye(large_n)
 
start = time.time()
L_large = cholesky(A_large, lower=True)
_ = log_determinant_cholesky(L_large)
time_chol_logdet = time.time() - start
 
start = time.time()
_ = np.linalg.slogdet(A_large)
time_slogdet = time.time() - start
 
print(f"
For {large_n}x{large_n} matrix:")
print(f"Cholesky + sum logs: {time_chol_logdet:.4f}s")
print(f"np.slogdet:          {time_slogdet:.4f}s")

Cholesky Variants: LDLᵀ and Modified Cholesky

Several variants of Cholesky decomposition address different needs: avoiding square roots, handling indefinite matrices, and numerical robustness.

LDLᵀ Decomposition:

For symmetric (but not necessarily positive definite) matrices: $$A = LDL^T$$

where:

L is unit lower triangular (ones on diagonal)
D is diagonal (can have negative entries for indefinite A)

Advantages of LDLᵀ:

No square roots (faster, avoids complex numbers)
Works for indefinite symmetric matrices
D reveals eigenvalue signs (positive/negative inertia)
Same storage as Cholesky

Connection to Cholesky: If A is SPD and A = LDLᵀ with D > 0, then: $$A = L D L^T = L D^{1/2} D^{1/2} L^T = (LD^{1/2})(LD^{1/2})^T = \tilde{L}\tilde{L}^T$$

So LDLᵀ is just Cholesky factored into 'no-square-root' form.

When to Use LDLᵀ

Use LDLᵀ when: (1) You want to avoid square roots (faster, more numerically stable for some problems), (2) The matrix might be indefinite (Hessian at non-minimum, covariance with round-off errors), (3) You need to check inertia (number of positive, negative, zero eigenvalues) which D reveals directly.

Modified Cholesky:

For optimization, we often encounter Hessians that should be positive definite at a minimum but may be indefinite during iteration. Modified Cholesky computes: $$A + E = L L^T$$

where E is a 'small' positive semi-definite perturbation that makes A+E positive definite.

Applications of modified Cholesky:

Ensuring descent direction in Newton's method
Making covariance estimates positive definite
Regularizing ill-conditioned systems

Incomplete Cholesky:

For large sparse matrices, incomplete Cholesky computes an approximate factorization by dropping small elements: $$A \approx \tilde{L}\tilde{L}^T$$

The approximation serves as a preconditioner for iterative methods (conjugate gradient), dramatically speeding convergence.

Block Cholesky:

For partitioned matrices: $$A = \begin{pmatrix} A_{11} & A_{12} \ A_{21} & A_{22} \end{pmatrix} = \begin{pmatrix} L_{11} & 0 \ L_{21} & L_{22} \end{pmatrix} \begin{pmatrix} L_{11}^T & L_{21}^T \ 0 & L_{22}^T \end{pmatrix}$$

L₁₁ = chol(A₁₁)
L₂₁ = A₂₁ L₁₁⁻ᵀ
L₂₂ = chol(A₂₂ - L₂₁L₂₁ᵀ) ← Schur complement

This is the basis for efficient block algorithms and parallel implementations.

cholesky_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from scipy.linalg import ldl
 
def ldlt_decomposition(A):
    """
    Compute LDL^T decomposition of symmetric matrix A.
    
    Returns:
        L: unit lower triangular
        D: diagonal
    """
    n = A.shape[0]
    L = np.eye(n)
    D = np.zeros(n)
    
    for j in range(n):
        # Compute D[j]
        D[j] = A[j, j] - np.sum(L[j, :j]**2 * D[:j])
        
        # Compute off-diagonal entries of L
        for i in range(j + 1, n):
            L[i, j] = (A[i, j] - np.sum(L[i, :j] * L[j, :j] * D[:j])) / D[j]
    
    return L, D
 
def modified_cholesky(A, beta=1e-4):
    """
    Modified Cholesky that handles indefinite matrices by adding 
    to diagonal to ensure positive definiteness.
    
    Returns:
        L: lower triangular factor
        E_diag: diagonal perturbation added (E = diag(E_diag))
    """
    n = A.shape[0]
    L = np.zeros((n, n))
    E_diag = np.zeros(n)
    
    for j in range(n):
        # What diagonal entry would we get?
        sum_sq = np.sum(L[j, :j]**2)
        diag_val = A[j, j] - sum_sq
        
        # If not positive enough, add perturbation
        if diag_val < beta:
            E_diag[j] = beta - diag_val
            diag_val = beta
        
        L[j, j] = np.sqrt(diag_val)
        
        # Off-diagonal entries
        for i in range(j + 1, n):
            sum_prod = np.sum(L[i, :j] * L[j, :j])
            L[i, j] = (A[i, j] - sum_prod) / L[j, j]
    
    return L, E_diag
 
# Test LDL^T on SPD matrix
np.random.seed(42)
n = 5
X = np.random.randn(n + 2, n)
A_spd = X.T @ X + 0.1 * np.eye(n)
 
L, D = ldlt_decomposition(A_spd)
print("LDL^T Decomposition (SPD matrix)")
print("=" * 50)
print(f"D (diagonal): {D.round(4)}")
print(f"All positive (SPD): {np.all(D > 0)}")
 
# Verify reconstruction
A_reconstructed = L @ np.diag(D) @ L.T
print(f"Reconstruction error: {np.linalg.norm(A_spd - A_reconstructed):.2e}")
 
# Test on indefinite matrix
print("
" + "=" * 50)
print("LDL^T on Indefinite Matrix")
A_indef = np.array([
    [4, 2, -2],
    [2, 1, -1],
    [-2, -1, 5]
])
eigenvalues_indef = np.linalg.eigvalsh(A_indef)
print(f"Eigenvalues: {eigenvalues_indef.round(4)}")
print(f"Indefinite: {np.any(eigenvalues_indef < 0) and np.any(eigenvalues_indef > 0)}")
 
# SciPy LDL works for indefinite
lu, d, perm = ldl(A_indef)
print(f"D from LDL: {np.diag(d).round(4)}")
 
# Modified Cholesky on indefinite matrix
print("
" + "=" * 50)
print("Modified Cholesky (making indefinite matrix SPD)")
L_mod, E_diag = modified_cholesky(A_indef)
print(f"Perturbation added to diagonal: {E_diag.round(4)}")
 
# Verify: A + E = LL^T
E = np.diag(E_diag)
A_modified = A_indef + E
print(f"Modified matrix eigenvalues: {np.linalg.eigvalsh(A_modified).round(4)}")
print(f"||A + E - LL^T||: {np.linalg.norm(A_modified - L_mod @ L_mod.T):.2e}")

Updating and Downdating Cholesky Factors

When data arrives incrementally or changes, we often need to update the Cholesky factor without recomputing from scratch. Rank-1 updates handle adding or removing a single outer product.

Rank-1 Update (Cholupdate):

Given L with A = LLᵀ, compute L' such that: $$A' = A + xx^T = L'L'^T$$

This is a rank-1 update (adding outer product xxᵀ).

The algorithm (simplified):

Process the update column by column using Givens rotations:

For k = 1 to n:
    r = sqrt(L[k,k]² + x[k]²)
    c = L[k,k] / r
    s = x[k] / r
    L[k,k] = r
    
    # Rotate remaining entries
    For i = k+1 to n:
        [L[i,k], x[i]] = [c*L[i,k] + s*x[i], -s*L[i,k] + c*x[i]]

Complexity: O(n²) instead of O(n³) for full refactorization!

Rank-1 Downdate:

Given L with A = LLᵀ, compute L' such that: $$A' = A - xx^T = L'L'^T$$

This is more delicate—the result may not be positive definite if x is 'too large'.

Downdate Stability

Cholesky downdating (removing a rank-1 component) is numerically less stable than updating. If A - xxᵀ is close to singular or indefinite, the algorithm can fail or produce inaccurate results. Always check that the updated factor is valid by verifying reconstruction error or checking diagonal positivity.

Applications of incremental updates:

1. Online learning: As new data points arrive, update the covariance matrix: $$\Sigma_{new} = \Sigma_{old} + x_{new} x_{new}^T$$

With rank-1 update: O(n²) per point vs O(n³) to refactor.

2. Leave-one-out cross-validation: For each left-out point, downdate the covariance and recompute predictions. With direct updates: O(n²) per fold vs O(n³).

3. Sliding window models: As new data enters and old data leaves, update and downdate simultaneously.

4. Gaussian process conditioning: When adding inducing points or updating observations, incremental Cholesky updates enable efficient model updates.

Row/Column Addition:

When adding a row and column (expanding the matrix): $$A' = \begin{pmatrix} A & b \ b^T & c \end{pmatrix}$$

The new Cholesky is: $$L' = \begin{pmatrix} L & 0 \ l^T & \ell \end{pmatrix}$$

where Ll = b (solve triangular system) and ℓ = √(c - lᵀl).

Complexity: O(n²) for the triangular solve + O(n) for ℓ.

cholesky_update.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def cholesky_update(L, x):
    """
    Rank-1 update: Given L with A = LL^T, compute L' with A + xx^T = L'L'^T.
    
    Uses Givens rotations for numerical stability.
    Modifies L in place and returns it.
    """
    n = len(x)
    L = L.copy()
    x = x.copy()
    
    for k in range(n):
        r = np.sqrt(L[k, k]**2 + x[k]**2)
        c = L[k, k] / r
        s = x[k] / r
        L[k, k] = r
        
        if k < n - 1:
            # Update rest of column k and x
            L_col = L[k+1:, k].copy()
            x_rest = x[k+1:].copy()
            
            L[k+1:, k] = c * L_col + s * x_rest
            x[k+1:] = -s * L_col + c * x_rest
    
    return L
 
def cholesky_downdate(L, x):
    """
    Rank-1 downdate: Given L with A = LL^T, compute L' with A - xx^T = L'L'^T.
    
    Warning: May fail if A - xx^T is not positive definite.
    """
    n = len(x)
    L = L.copy()
    x = x.copy()
    
    for k in range(n):
        r_sq = L[k, k]**2 - x[k]**2
        if r_sq <= 0:
            raise ValueError("Matrix becomes non-positive definite")
        
        r = np.sqrt(r_sq)
        c = r / L[k, k]
        s = x[k] / L[k, k]
        L[k, k] = r
        
        if k < n - 1:
            L_col = L[k+1:, k].copy()
            x_rest = x[k+1:].copy()
            
            L[k+1:, k] = (L_col - s * x_rest) / c
            x[k+1:] = c * x_rest - s * L_col
    
    return L
 
def add_row_column(L, b, c):
    """
    Add a row and column to the Cholesky factor.
    
    If A = LL^T, compute L' for A' = [[A, b], [b^T, c]]
    """
    n = L.shape[0]
    
    # Solve L l = b for l
    l = solve_triangular(L, b, lower=True)
    
    # Compute new diagonal element
    ell_sq = c - l @ l
    if ell_sq <= 0:
        raise ValueError("Extended matrix is not positive definite")
    ell = np.sqrt(ell_sq)
    
    # Build new L'
    L_new = np.zeros((n + 1, n + 1))
    L_new[:n, :n] = L
    L_new[n, :n] = l
    L_new[n, n] = ell
    
    return L_new
 
# Demonstration
np.random.seed(42)
n = 50
 
# Initial matrix
X = np.random.randn(n + 10, n)
A = X.T @ X + 0.5 * np.eye(n)
L = cholesky(A, lower=True)
 
# Rank-1 update
x = np.random.randn(n)
L_updated = cholesky_update(L, x)
 
# Verify via direct computation
A_updated = A + np.outer(x, x)
L_direct = cholesky(A_updated, lower=True)
 
print("Cholesky Rank-1 Update")
print("=" * 50)
print(f"Update error ||L_updated - L_direct||: {np.linalg.norm(L_updated - L_direct):.2e}")
print(f"Reconstruction error: {np.linalg.norm(A_updated - L_updated @ L_updated.T):.2e}")
 
# Row/column addition
b = np.random.randn(n)
c = b @ np.linalg.solve(A, b) + 1.0  # Ensures positive definiteness
L_extended = add_row_column(L, b, c)
 
# Verify
A_extended = np.block([[A, b.reshape(-1, 1)], 
                       [b.reshape(1, -1), np.array([[c]])]])
L_extended_direct = cholesky(A_extended, lower=True)
 
print(f"
Row/Column Addition")
print(f"Extension error: {np.linalg.norm(L_extended - L_extended_direct):.2e}")
 
# Timing comparison
import time
n_large = 500
X_large = np.random.randn(n_large + 10, n_large)
A_large = X_large.T @ X_large + 0.5 * np.eye(n_large)
L_large = cholesky(A_large, lower=True)
x_large = np.random.randn(n_large)
 
# Incremental update
start = time.time()
L_incr = cholesky_update(L_large, x_large)
time_incr = time.time() - start
 
# Full recomputation
start = time.time()
L_full = cholesky(A_large + np.outer(x_large, x_large), lower=True)
time_full = time.time() - start
 
print(f"
Timing ({n_large}x{n_large} matrix):")
print(f"Incremental update: {time_incr:.4f}s")
print(f"Full recomputation: {time_full:.4f}s")
print(f"Speedup: {time_full / time_incr:.1f}x")

Cholesky in Machine Learning Applications

Cholesky decomposition is the computational backbone of many ML algorithms involving Gaussian distributions, kernel methods, and positive definite structures.

1. Gaussian Process Regression:

For GP regression with n training points:

Kernel matrix K is n × n symmetric positive definite
Training involves solving Kα = y and computing log|K|
Prediction requires K⁻¹ and additional solves

With Cholesky:

Factor: K = LLᵀ — O(n³)
Solve: α = L⁻ᵀL⁻¹y — O(n²)
Log-det: 2Σlog(Lᵢᵢ) — O(n)
Marginal likelihood: -½yᵀα - ½log|K| - n/2 log(2π)

2. Sampling Multivariate Gaussians:

To sample from N(μ, Σ):

Compute L = chol(Σ)
Sample z ~ N(0, I)
Return x = μ + Lz

Why this works: Cov(Lz) = L·I·Lᵀ = LLᵀ = Σ

3. Linear Regression with Ridge Regularization:

For regularized least squares: minimize ||Xβ - y||² + λ||β||²

Normal equations: (XᵀX + λI)β = Xᵀy

The matrix XᵀX + λI is always SPD (XᵀX is PSD, λI adds positive definiteness). Cholesky solve is optimal here.

The GP Computational Bottleneck

For Gaussian processes, the O(n³) Cholesky factorization is the dominant computational cost. This limits vanilla GPs to ~10,000 training points. Scalable GP methods (inducing points, SKI, etc.) work by approximating K to have efficient Cholesky structure.

4. Mahalanobis Distance:

The Mahalanobis distance measures distance under a covariance structure: $$d_M(x, y) = \sqrt{(x-y)^T \Sigma^{-1} (x-y)}$$

With Cholesky: L = chol(Σ), solve Lz = (x-y), then d = ||z||

No explicit inverse needed!

5. Whitening Transformation:

Transform data to have identity covariance:

Compute L = chol(Σ)
Whitened data: X_white = L⁻¹(X - μ)

This is used in:

Preprocessing for algorithms assuming spherical data
Decorrelating features before PCA
Normalizing flows in generative models

6. Constrained Optimization (Interior Point Methods):

Newton steps in interior point methods require solving: $$(H + A^T D A) \Delta x = -g$$

where H is Hessian, A is constraint matrix, D is diagonal scaling.

The left-hand side is often SPD, making Cholesky the solver of choice.

7. Fisher Information and Natural Gradient:

The Fisher information matrix F in natural gradient descent is SPD. Computing the natural gradient requires solving Fδ = ∇L, which is efficiently done via Cholesky.

cholesky_ml_applications.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def sample_multivariate_gaussian(mean, cov, n_samples=1):
    """
    Sample from multivariate Gaussian N(mean, cov) using Cholesky.
    """
    d = len(mean)
    L = cholesky(cov, lower=True)
    
    # Sample from standard normal
    z = np.random.randn(n_samples, d)
    
    # Transform: x = mean + L @ z
    samples = mean + (L @ z.T).T
    
    return samples
 
def gaussian_process_predict(X_train, y_train, X_test, kernel_fn, noise_var=0.1):
    """
    Gaussian Process regression using Cholesky.
    
    Returns:
        mean: predictive mean at test points
        var: predictive variance at test points
        log_marginal: log marginal likelihood
    """
    n_train = len(X_train)
    n_test = len(X_test)
    
    # Compute kernel matrices
    K = kernel_fn(X_train, X_train) + noise_var * np.eye(n_train)
    K_s = kernel_fn(X_train, X_test)
    K_ss = kernel_fn(X_test, X_test)
    
    # Cholesky factorization
    L = cholesky(K, lower=True)
    
    # Solve K^{-1} y via triangular solves
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    
    # Predictive mean: K_s^T @ alpha
    mean = K_s.T @ alpha
    
    # Predictive variance: K_ss - K_s^T @ K^{-1} @ K_s
    v = solve_triangular(L, K_s, lower=True)
    var = np.diag(K_ss) - np.sum(v**2, axis=0)
    
    # Log marginal likelihood
    log_marginal = -0.5 * y_train @ alpha
    log_marginal -= np.sum(np.log(np.diag(L)))
    log_marginal -= 0.5 * n_train * np.log(2 * np.pi)
    
    return mean, var, log_marginal
 
def rbf_kernel(X1, X2, lengthscale=1.0, variance=1.0):
    """Radial basis function (squared exponential) kernel."""
    sq_dist = np.sum(X1**2, axis=1, keepdims=True) +               np.sum(X2**2, axis=1) - 2 * X1 @ X2.T
    return variance * np.exp(-0.5 * sq_dist / lengthscale**2)
 
# Demo: Sampling from multivariate Gaussian
np.random.seed(42)
mean = np.array([1, 2])
cov = np.array([[2, 0.8], [0.8, 1]])
 
samples = sample_multivariate_gaussian(mean, cov, n_samples=1000)
print("Multivariate Gaussian Sampling")
print("=" * 50)
print(f"True mean: {mean}")
print(f"Sample mean: {samples.mean(axis=0).round(3)}")
print(f"
True covariance:
{cov}")
print(f"Sample covariance:
{np.cov(samples.T).round(3)}")
 
# Demo: Gaussian Process Regression
print("
" + "=" * 50)
print("Gaussian Process Regression")
 
# Training data
X_train = np.linspace(0, 10, 20).reshape(-1, 1)
y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(20)
 
# Test data
X_test = np.linspace(0, 10, 100).reshape(-1, 1)
 
# Predictions
mean_pred, var_pred, log_ml = gaussian_process_predict(
    X_train, y_train, X_test, 
    lambda X1, X2: rbf_kernel(X1, X2, lengthscale=1.0, variance=1.0),
    noise_var=0.01
)
 
print(f"Log marginal likelihood: {log_ml:.4f}")
print(f"Mean prediction range: [{mean_pred.min():.3f}, {mean_pred.max():.3f}]")
print(f"Predictive std range: [{np.sqrt(var_pred).min():.3f}, {np.sqrt(var_pred).max():.3f}]")
 
# Demo: Mahalanobis distance
print("
" + "=" * 50)
print("Mahalanobis Distance")
 
# Points to compare
x1 = np.array([0, 0])
x2 = np.array([2, 2])
Sigma = np.array([[4, 2], [2, 3]])  # Correlation structure
 
L = cholesky(Sigma, lower=True)
diff = x2 - x1
z = solve_triangular(L, diff, lower=True)
mahal_dist = np.linalg.norm(z)
 
euclidean_dist = np.linalg.norm(diff)
 
print(f"Euclidean distance: {euclidean_dist:.4f}")
print(f"Mahalanobis distance: {mahal_dist:.4f}")
print("(Mahalanobis accounts for correlation structure)")

Summary: Mastering Cholesky Decomposition

We've explored Cholesky decomposition comprehensively—from its elegant definition through efficient algorithms to critical ML applications. Let's consolidate the key insights.

Key Takeaways

•Cholesky exists for SPD matrices: A = LLᵀ with lower triangular L having positive diagonal—unique under this convention.
•2× faster than LU: Exploiting symmetry reduces operations from 2n³/3 to n³/3, with no pivoting needed.
•Solving systems: Factor once O(n³), then each solve is O(n²) via forward and back substitution.
•Log-determinant: log|A| = 2Σlog(Lᵢᵢ) is numerically stable and O(n) after factorization.
•LDLᵀ variant: Avoids square roots, works for indefinite matrices, reveals inertia.
•Rank-1 updates: O(n²) incremental updates vs O(n³) full refactorization for online learning.
•ML applications abound: Gaussian processes, sampling, Mahalanobis distance, ridge regression, optimization.

What's next:

We'll explore Applications of Matrix Decompositions in ML—synthesizing SVD, eigendecomposition, QR, and Cholesky into a unified toolkit for solving dimensionality reduction, recommendation systems, signal processing, and optimization problems in machine learning.

Page Complete

You now understand Cholesky decomposition in depth—when it exists, how to compute it efficiently, its variants, incremental updates, and central role in probabilistic ML. Combined with SVD, eigendecomposition, and QR, you have a complete matrix decomposition toolkit.

Cholesky Decomposition

The Efficient Factorization for Positive Definite Matrices

Cholesky factors an SPD matrix A into A = LLᵀ, where L is lower triangular with positive diagonal entries. This simple form has profound practical implications:

2× faster than LU decomposition (half the operations)
No pivoting required (guaranteed stable for SPD matrices)
Unique factorization (given positive diagonal convention)
Built-in positive definiteness check (algorithm fails iff A is not SPD)

What You Will Learn

Cholesky Decomposition: Definition and Existence

The Cholesky Decomposition:

For a symmetric positive definite matrix A ∈ ℝⁿˣⁿ, there exists a unique lower triangular matrix L with positive diagonal entries such that:

$$A = LL^T$$

Equivalently, using upper triangular form: A = RᵀR where R = Lᵀ.

Prerequisites for Cholesky:

1. Symmetric: A = Aᵀ Every entry aᵢⱼ = aⱼᵢ. This is necessary for the LLᵀ structure.

2. Positive Definite: xᵀAx > 0 for all non-zero x Equivalently: all eigenvalues of A are strictly positive.

For positive semi-definite (PSD) matrices: Cholesky still exists but L may have zero diagonal entries. The factorization is no longer unique, and numerical issues can arise.

Existence and uniqueness theorem:

If A is symmetric positive definite, then:

Cholesky factorization A = LLᵀ exists
L is unique with positive diagonal entries
The diagonal entries of L are lᵢᵢ = √(aᵢᵢ - Σⱼ₌₁^(i-1) lᵢⱼ²) > 0

The proof follows from the LU factorization of SPD matrices, which is guaranteed to exist without pivoting.

Cholesky as a Square Root

Detecting positive definiteness:

Cholesky decomposition simultaneously checks if A is positive definite:

If the algorithm completes successfully (all computed lᵢᵢ² > 0), A is SPD
If we encounter lᵢᵢ² ≤ 0, A is not positive definite
This is more efficient than computing eigenvalues: O(n³/3) vs O(n³)

Common SPD matrices in ML:

Matrix Type	Definition	Why SPD?
Covariance matrix	Σ = E[(X-μ)(X-μ)ᵀ]	xᵀΣx = E[(xᵀ(X-μ))²] ≥ 0
Gram matrix	G = XᵀX	xᵀGx =
Kernel matrix	Kᵢⱼ = k(xᵢ, xⱼ)	Mercer's theorem
Hessian at minimum	∇²f(x*)	Second-order optimality
Regularized matrix	A + λI (λ > 0)	Shifts eigenvalues up

cholesky_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
from scipy.linalg import cholesky, cho_solve, cho_factor
 
# Create a positive definite matrix
np.random.seed(42)
n = 5
 
# Method 1: XᵀX is always positive semi-definite
# Adding small diagonal ensures positive definite
X = np.random.randn(n + 5, n)
A = X.T @ X + 0.01 * np.eye(n)  # Regularization ensures SPD
 
print("Matrix A (symmetric positive definite):")
print(np.round(A, 3))
 
# Verify symmetry
print(f"
Symmetric: {np.allclose(A, A.T)}")
 
# Verify positive definiteness via eigenvalues
eigenvalues = np.linalg.eigvalsh(A)
print(f"Eigenvalues: {eigenvalues.round(4)}")
print(f"All positive: {np.all(eigenvalues > 0)}")
 
# Cholesky decomposition
L = cholesky(A, lower=True)
print(f"
Cholesky factor L (lower triangular):")
print(np.round(L, 4))
 
# Verify reconstruction
A_reconstructed = L @ L.T
print(f"
Reconstruction error ||A - LL^T||: {np.linalg.norm(A - A_reconstructed):.2e}")
 
# Verify L is lower triangular with positive diagonal
print(f"
L is lower triangular: {np.allclose(L, np.tril(L))}")
print(f"Diagonal of L: {np.diag(L).round(4)}")
print(f"All diagonal entries positive: {np.all(np.diag(L) > 0)}")
 
# Test what happens with non-SPD matrix
print("
=== Testing Non-SPD Matrix ===")
A_bad = np.array([[1, 2], [2, 1]])  # Eigenvalues: 3, -1
eigenvalues_bad = np.linalg.eigvalsh(A_bad)
print(f"Matrix eigenvalues: {eigenvalues_bad}")
print(f"Positive definite: {np.all(eigenvalues_bad > 0)}")
 
try:
    L_bad = cholesky(A_bad, lower=True)
except np.linalg.LinAlgError as e:
    print(f"Cholesky failed (as expected): {e}")

The Cholesky Algorithm: Efficient Computation

The Cholesky algorithm computes L column by column, exploiting the triangular structure for efficiency.

Derivation from A = LLᵀ:

Comparing entry (i, j) on both sides (for i ≥ j): $$a_{ij} = \sum_{k=1}^{j} l_{ik} l_{jk} = \sum_{k=1}^{j-1} l_{ik} l_{jk} + l_{ij} l_{jj}$$

Solving for L entries:

For diagonal entries (i = j): $$l_{jj} = \sqrt{a_{jj} - \sum_{k=1}^{j-1} l_{jk}^2}$$

For off-diagonal entries (i > j): $$l_{ij} = \frac{1}{l_{jj}} \left( a_{ij} - \sum_{k=1}^{j-1} l_{ik} l_{jk} \right)$$

The algorithm:

For j = 1 to n:                           # Column by column
    # Diagonal entry
    l_jj = sqrt(a_jj - sum(l_jk^2, k=1 to j-1))
    
    # Off-diagonal entries
    For i = j+1 to n:
        l_ij = (a_ij - sum(l_ik * l_jk, k=1 to j-1)) / l_jj

Complexity Advantage

Computational considerations:

Operation count:

n square roots (for diagonal)
n(n-1)/2 divisions (for off-diagonal scaling)
n³/3 multiplications and additions (the dominant cost)

Memory:

Can overwrite the lower triangle of A with L (in-place)
Only need lower triangle of A as input

Numerical stability:

Cholesky is backward stable for SPD matrices
No pivoting required (unlike LU for general matrices)
Condition number of L is √κ(A), where κ(A) is condition number of A

Block Cholesky:

For large matrices, block algorithms improve cache utilization:

Partition A into blocks
Compute Cholesky of each diagonal block
Use triangular solves and matrix multiplies for off-diagonal blocks
Achieves near-BLAS3 efficiency (memory-bound rather than compute-bound)

cholesky_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
import time
 
def cholesky_manual(A):
    """
    Compute Cholesky decomposition L such that A = LL^T.
    
    Parameters:
        A: n x n symmetric positive definite matrix
        
    Returns:
        L: n x n lower triangular matrix
    """
    n = A.shape[0]
    L = np.zeros((n, n))
    
    for j in range(n):
        # Compute diagonal entry
        sum_sq = np.sum(L[j, :j]**2)
        diag_val = A[j, j] - sum_sq
        
        if diag_val <= 0:
            raise ValueError(f"Matrix is not positive definite (at position {j})")
        
        L[j, j] = np.sqrt(diag_val)
        
        # Compute off-diagonal entries in column j
        for i in range(j + 1, n):
            sum_prod = np.sum(L[i, :j] * L[j, :j])
            L[i, j] = (A[i, j] - sum_prod) / L[j, j]
    
    return L
 
def cholesky_vectorized(A):
    """
    Cholesky decomposition with vectorized inner loops.
    More efficient than pure loop version.
    """
    n = A.shape[0]
    L = np.zeros((n, n))
    
    for j in range(n):
        # Diagonal entry (vectorized sum)
        L[j, j] = np.sqrt(A[j, j] - L[j, :j] @ L[j, :j])
        
        # Off-diagonal entries (vectorized)
        if j + 1 < n:
            L[j+1:, j] = (A[j+1:, j] - L[j+1:, :j] @ L[j, :j]) / L[j, j]
    
    return L
 
# Test implementations
np.random.seed(42)
n = 100
 
# Create SPD matrix
X = np.random.randn(n + 10, n)
A = X.T @ X + 0.1 * np.eye(n)
 
# Compare implementations
print("Comparing Cholesky Implementations")
print("=" * 50)
 
# Manual implementation
start = time.time()
L_manual = cholesky_manual(A)
time_manual = time.time() - start
print(f"Manual:     {time_manual:.4f}s")
 
# Vectorized implementation
start = time.time()
L_vectorized = cholesky_vectorized(A)
time_vectorized = time.time() - start
print(f"Vectorized: {time_vectorized:.4f}s")
 
# NumPy/SciPy implementation
from scipy.linalg import cholesky
start = time.time()
L_scipy = cholesky(A, lower=True)
time_scipy = time.time() - start
print(f"SciPy:      {time_scipy:.4f}s")
 
# Verify all give same result
print(f"
Manual vs SciPy error: {np.linalg.norm(L_manual - L_scipy):.2e}")
print(f"Vectorized vs SciPy error: {np.linalg.norm(L_vectorized - L_scipy):.2e}")
 
# Verify reconstruction
print(f"Reconstruction error ||A - LL^T||: {np.linalg.norm(A - L_scipy @ L_scipy.T):.2e}")
 
# Operation count comparison with LU
print("
=== Complexity Comparison ===")
print(f"Matrix size: {n} x {n}")
print(f"Cholesky ops: ~{n**3 // 3:,} (n³/3)")
print(f"LU ops:       ~{2 * n**3 // 3:,} (2n³/3)")
print(f"Speedup:      2x")

Solving Linear Systems with Cholesky

One of the most common uses of Cholesky is solving linear systems Ax = b where A is SPD. The factorization enables efficient solution through two triangular solves.

The solution process:

Given Ax = b with A = LLᵀ:

Factor: Compute L such that A = LLᵀ — O(n³/3)
Forward substitution: Solve Ly = b — O(n²)
Back substitution: Solve Lᵀx = y — O(n²)

Why triangular solves are efficient:

For lower triangular L:

y₁ = b₁ / L₁₁
y₂ = (b₂ - L₂₁y₁) / L₂₂
...
yᵢ = (bᵢ - Σⱼ₌₁^(i-1) Lᵢⱼyⱼ) / Lᵢᵢ

Each equation involves only previously computed values—purely sequential, O(n²) total.

Multiple right-hand sides:

If you need to solve Ax = b for many different b vectors:

Factor once: A = LLᵀ — O(n³/3)
For each b: two triangular solves — O(n²) each

This is much faster than naively recomputing for each b.

When to Factor vs Direct Solve

Computing the inverse (when needed):

While explicitly computing A⁻¹ is usually avoided, when needed:

Factor A = LLᵀ
Compute L⁻¹ by solving L X = I column by column
A⁻¹ = (LLᵀ)⁻¹ = L⁻ᵀL⁻¹

Or equivalently, solve AX = I column by column using the Cholesky solve.

Computing log-determinant:

The log-determinant of A is crucial in Bayesian inference (appears in Gaussian log-likelihoods):

$$\log|A| = \log|LL^T| = \log|L|^2 = 2\log|L| = 2\sum_{i=1}^{n} \log(L_{ii})$$

This is numerically stable (avoids computing the potentially huge determinant directly) and costs only O(n) after factorization.

Solving the precision-weighted system:

cholesky_solve.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from scipy.linalg import cholesky, solve_triangular, cho_solve, cho_factor
 
def solve_with_cholesky(A, b):
    """
    Solve Ax = b using Cholesky decomposition.
    A must be symmetric positive definite.
    """
    # Factor
    L = cholesky(A, lower=True)
    
    # Forward substitution: Ly = b
    y = solve_triangular(L, b, lower=True)
    
    # Back substitution: L^T x = y
    x = solve_triangular(L.T, y, lower=False)
    
    return x, L
 
def log_determinant_cholesky(L):
    """
    Compute log|A| from Cholesky factor L where A = LL^T.
    log|A| = 2 * sum(log(L_ii))
    """
    return 2.0 * np.sum(np.log(np.diag(L)))
 
# Example: Solving a system
np.random.seed(42)
n = 100
 
# Create SPD matrix
X = np.random.randn(n + 10, n)
A = X.T @ X + 0.5 * np.eye(n)
b = np.random.randn(n)
 
# Solve with Cholesky
x_chol, L = solve_with_cholesky(A, b)
 
# Verify solution
residual = np.linalg.norm(A @ x_chol - b)
print(f"Solution residual ||Ax - b||: {residual:.2e}")
 
# Compare with numpy.linalg.solve (uses LU)
x_np = np.linalg.solve(A, b)
print(f"Difference from np.solve: {np.linalg.norm(x_chol - x_np):.2e}")
 
# Multiple right-hand sides
print("
=== Multiple Right-Hand Sides ===")
import time
 
num_rhs = 50
B = np.random.randn(n, num_rhs)
 
# Method 1: Naive (solve each separately)
start = time.time()
X_naive = np.column_stack([np.linalg.solve(A, B[:, i]) for i in range(num_rhs)])
time_naive = time.time() - start
 
# Method 2: Cholesky + triangular solves
start = time.time()
c, low = cho_factor(A, lower=True)
X_chol = cho_solve((c, low), B)
time_chol = time.time() - start
 
print(f"Naive (solve each): {time_naive:.4f}s")
print(f"Cholesky factored:  {time_chol:.4f}s")
print(f"Speedup: {time_naive / time_chol:.1f}x")
 
# Log-determinant computation
print("
=== Log-Determinant ===")
log_det_chol = log_determinant_cholesky(L)
log_det_direct = np.linalg.slogdet(A)[1]  # Returns (sign, logdet)
 
print(f"Log|A| via Cholesky:    {log_det_chol:.6f}")
print(f"Log|A| via slogdet:     {log_det_direct:.6f}")
print(f"Difference: {abs(log_det_chol - log_det_direct):.2e}")
 
# Timing comparison for log-det
import time
large_n = 500
X_large = np.random.randn(large_n + 10, large_n)
A_large = X_large.T @ X_large + np.eye(large_n)
 
start = time.time()
L_large = cholesky(A_large, lower=True)
_ = log_determinant_cholesky(L_large)
time_chol_logdet = time.time() - start
 
start = time.time()
_ = np.linalg.slogdet(A_large)
time_slogdet = time.time() - start
 
print(f"
For {large_n}x{large_n} matrix:")
print(f"Cholesky + sum logs: {time_chol_logdet:.4f}s")
print(f"np.slogdet:          {time_slogdet:.4f}s")

Cholesky Variants: LDLᵀ and Modified Cholesky

Several variants of Cholesky decomposition address different needs: avoiding square roots, handling indefinite matrices, and numerical robustness.

LDLᵀ Decomposition:

For symmetric (but not necessarily positive definite) matrices: $$A = LDL^T$$

where:

L is unit lower triangular (ones on diagonal)
D is diagonal (can have negative entries for indefinite A)

Advantages of LDLᵀ:

No square roots (faster, avoids complex numbers)
Works for indefinite symmetric matrices
D reveals eigenvalue signs (positive/negative inertia)
Same storage as Cholesky

Connection to Cholesky: If A is SPD and A = LDLᵀ with D > 0, then: $$A = L D L^T = L D^{1/2} D^{1/2} L^T = (LD^{1/2})(LD^{1/2})^T = \tilde{L}\tilde{L}^T$$

So LDLᵀ is just Cholesky factored into 'no-square-root' form.

When to Use LDLᵀ

Modified Cholesky:

For optimization, we often encounter Hessians that should be positive definite at a minimum but may be indefinite during iteration. Modified Cholesky computes: $$A + E = L L^T$$

where E is a 'small' positive semi-definite perturbation that makes A+E positive definite.

Applications of modified Cholesky:

Ensuring descent direction in Newton's method
Making covariance estimates positive definite
Regularizing ill-conditioned systems

Incomplete Cholesky:

For large sparse matrices, incomplete Cholesky computes an approximate factorization by dropping small elements: $$A \approx \tilde{L}\tilde{L}^T$$

The approximation serves as a preconditioner for iterative methods (conjugate gradient), dramatically speeding convergence.

Block Cholesky:

L₁₁ = chol(A₁₁)
L₂₁ = A₂₁ L₁₁⁻ᵀ
L₂₂ = chol(A₂₂ - L₂₁L₂₁ᵀ) ← Schur complement

This is the basis for efficient block algorithms and parallel implementations.

cholesky_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from scipy.linalg import ldl
 
def ldlt_decomposition(A):
    """
    Compute LDL^T decomposition of symmetric matrix A.
    
    Returns:
        L: unit lower triangular
        D: diagonal
    """
    n = A.shape[0]
    L = np.eye(n)
    D = np.zeros(n)
    
    for j in range(n):
        # Compute D[j]
        D[j] = A[j, j] - np.sum(L[j, :j]**2 * D[:j])
        
        # Compute off-diagonal entries of L
        for i in range(j + 1, n):
            L[i, j] = (A[i, j] - np.sum(L[i, :j] * L[j, :j] * D[:j])) / D[j]
    
    return L, D
 
def modified_cholesky(A, beta=1e-4):
    """
    Modified Cholesky that handles indefinite matrices by adding 
    to diagonal to ensure positive definiteness.
    
    Returns:
        L: lower triangular factor
        E_diag: diagonal perturbation added (E = diag(E_diag))
    """
    n = A.shape[0]
    L = np.zeros((n, n))
    E_diag = np.zeros(n)
    
    for j in range(n):
        # What diagonal entry would we get?
        sum_sq = np.sum(L[j, :j]**2)
        diag_val = A[j, j] - sum_sq
        
        # If not positive enough, add perturbation
        if diag_val < beta:
            E_diag[j] = beta - diag_val
            diag_val = beta
        
        L[j, j] = np.sqrt(diag_val)
        
        # Off-diagonal entries
        for i in range(j + 1, n):
            sum_prod = np.sum(L[i, :j] * L[j, :j])
            L[i, j] = (A[i, j] - sum_prod) / L[j, j]
    
    return L, E_diag
 
# Test LDL^T on SPD matrix
np.random.seed(42)
n = 5
X = np.random.randn(n + 2, n)
A_spd = X.T @ X + 0.1 * np.eye(n)
 
L, D = ldlt_decomposition(A_spd)
print("LDL^T Decomposition (SPD matrix)")
print("=" * 50)
print(f"D (diagonal): {D.round(4)}")
print(f"All positive (SPD): {np.all(D > 0)}")
 
# Verify reconstruction
A_reconstructed = L @ np.diag(D) @ L.T
print(f"Reconstruction error: {np.linalg.norm(A_spd - A_reconstructed):.2e}")
 
# Test on indefinite matrix
print("
" + "=" * 50)
print("LDL^T on Indefinite Matrix")
A_indef = np.array([
    [4, 2, -2],
    [2, 1, -1],
    [-2, -1, 5]
])
eigenvalues_indef = np.linalg.eigvalsh(A_indef)
print(f"Eigenvalues: {eigenvalues_indef.round(4)}")
print(f"Indefinite: {np.any(eigenvalues_indef < 0) and np.any(eigenvalues_indef > 0)}")
 
# SciPy LDL works for indefinite
lu, d, perm = ldl(A_indef)
print(f"D from LDL: {np.diag(d).round(4)}")
 
# Modified Cholesky on indefinite matrix
print("
" + "=" * 50)
print("Modified Cholesky (making indefinite matrix SPD)")
L_mod, E_diag = modified_cholesky(A_indef)
print(f"Perturbation added to diagonal: {E_diag.round(4)}")
 
# Verify: A + E = LL^T
E = np.diag(E_diag)
A_modified = A_indef + E
print(f"Modified matrix eigenvalues: {np.linalg.eigvalsh(A_modified).round(4)}")
print(f"||A + E - LL^T||: {np.linalg.norm(A_modified - L_mod @ L_mod.T):.2e}")

Updating and Downdating Cholesky Factors

When data arrives incrementally or changes, we often need to update the Cholesky factor without recomputing from scratch. Rank-1 updates handle adding or removing a single outer product.

Rank-1 Update (Cholupdate):

Given L with A = LLᵀ, compute L' such that: $$A' = A + xx^T = L'L'^T$$

This is a rank-1 update (adding outer product xxᵀ).

The algorithm (simplified):

Process the update column by column using Givens rotations:

For k = 1 to n:
    r = sqrt(L[k,k]² + x[k]²)
    c = L[k,k] / r
    s = x[k] / r
    L[k,k] = r
    
    # Rotate remaining entries
    For i = k+1 to n:
        [L[i,k], x[i]] = [c*L[i,k] + s*x[i], -s*L[i,k] + c*x[i]]

Complexity: O(n²) instead of O(n³) for full refactorization!

Rank-1 Downdate:

Given L with A = LLᵀ, compute L' such that: $$A' = A - xx^T = L'L'^T$$

This is more delicate—the result may not be positive definite if x is 'too large'.

Downdate Stability

Applications of incremental updates:

1. Online learning: As new data points arrive, update the covariance matrix: $$\Sigma_{new} = \Sigma_{old} + x_{new} x_{new}^T$$

With rank-1 update: O(n²) per point vs O(n³) to refactor.

2. Leave-one-out cross-validation: For each left-out point, downdate the covariance and recompute predictions. With direct updates: O(n²) per fold vs O(n³).

3. Sliding window models: As new data enters and old data leaves, update and downdate simultaneously.

4. Gaussian process conditioning: When adding inducing points or updating observations, incremental Cholesky updates enable efficient model updates.

Row/Column Addition:

When adding a row and column (expanding the matrix): $$A' = \begin{pmatrix} A & b \ b^T & c \end{pmatrix}$$

The new Cholesky is: $$L' = \begin{pmatrix} L & 0 \ l^T & \ell \end{pmatrix}$$

where Ll = b (solve triangular system) and ℓ = √(c - lᵀl).

Complexity: O(n²) for the triangular solve + O(n) for ℓ.

cholesky_update.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def cholesky_update(L, x):
    """
    Rank-1 update: Given L with A = LL^T, compute L' with A + xx^T = L'L'^T.
    
    Uses Givens rotations for numerical stability.
    Modifies L in place and returns it.
    """
    n = len(x)
    L = L.copy()
    x = x.copy()
    
    for k in range(n):
        r = np.sqrt(L[k, k]**2 + x[k]**2)
        c = L[k, k] / r
        s = x[k] / r
        L[k, k] = r
        
        if k < n - 1:
            # Update rest of column k and x
            L_col = L[k+1:, k].copy()
            x_rest = x[k+1:].copy()
            
            L[k+1:, k] = c * L_col + s * x_rest
            x[k+1:] = -s * L_col + c * x_rest
    
    return L
 
def cholesky_downdate(L, x):
    """
    Rank-1 downdate: Given L with A = LL^T, compute L' with A - xx^T = L'L'^T.
    
    Warning: May fail if A - xx^T is not positive definite.
    """
    n = len(x)
    L = L.copy()
    x = x.copy()
    
    for k in range(n):
        r_sq = L[k, k]**2 - x[k]**2
        if r_sq <= 0:
            raise ValueError("Matrix becomes non-positive definite")
        
        r = np.sqrt(r_sq)
        c = r / L[k, k]
        s = x[k] / L[k, k]
        L[k, k] = r
        
        if k < n - 1:
            L_col = L[k+1:, k].copy()
            x_rest = x[k+1:].copy()
            
            L[k+1:, k] = (L_col - s * x_rest) / c
            x[k+1:] = c * x_rest - s * L_col
    
    return L
 
def add_row_column(L, b, c):
    """
    Add a row and column to the Cholesky factor.
    
    If A = LL^T, compute L' for A' = [[A, b], [b^T, c]]
    """
    n = L.shape[0]
    
    # Solve L l = b for l
    l = solve_triangular(L, b, lower=True)
    
    # Compute new diagonal element
    ell_sq = c - l @ l
    if ell_sq <= 0:
        raise ValueError("Extended matrix is not positive definite")
    ell = np.sqrt(ell_sq)
    
    # Build new L'
    L_new = np.zeros((n + 1, n + 1))
    L_new[:n, :n] = L
    L_new[n, :n] = l
    L_new[n, n] = ell
    
    return L_new
 
# Demonstration
np.random.seed(42)
n = 50
 
# Initial matrix
X = np.random.randn(n + 10, n)
A = X.T @ X + 0.5 * np.eye(n)
L = cholesky(A, lower=True)
 
# Rank-1 update
x = np.random.randn(n)
L_updated = cholesky_update(L, x)
 
# Verify via direct computation
A_updated = A + np.outer(x, x)
L_direct = cholesky(A_updated, lower=True)
 
print("Cholesky Rank-1 Update")
print("=" * 50)
print(f"Update error ||L_updated - L_direct||: {np.linalg.norm(L_updated - L_direct):.2e}")
print(f"Reconstruction error: {np.linalg.norm(A_updated - L_updated @ L_updated.T):.2e}")
 
# Row/column addition
b = np.random.randn(n)
c = b @ np.linalg.solve(A, b) + 1.0  # Ensures positive definiteness
L_extended = add_row_column(L, b, c)
 
# Verify
A_extended = np.block([[A, b.reshape(-1, 1)], 
                       [b.reshape(1, -1), np.array([[c]])]])
L_extended_direct = cholesky(A_extended, lower=True)
 
print(f"
Row/Column Addition")
print(f"Extension error: {np.linalg.norm(L_extended - L_extended_direct):.2e}")
 
# Timing comparison
import time
n_large = 500
X_large = np.random.randn(n_large + 10, n_large)
A_large = X_large.T @ X_large + 0.5 * np.eye(n_large)
L_large = cholesky(A_large, lower=True)
x_large = np.random.randn(n_large)
 
# Incremental update
start = time.time()
L_incr = cholesky_update(L_large, x_large)
time_incr = time.time() - start
 
# Full recomputation
start = time.time()
L_full = cholesky(A_large + np.outer(x_large, x_large), lower=True)
time_full = time.time() - start
 
print(f"
Timing ({n_large}x{n_large} matrix):")
print(f"Incremental update: {time_incr:.4f}s")
print(f"Full recomputation: {time_full:.4f}s")
print(f"Speedup: {time_full / time_incr:.1f}x")

Cholesky in Machine Learning Applications

Cholesky decomposition is the computational backbone of many ML algorithms involving Gaussian distributions, kernel methods, and positive definite structures.

1. Gaussian Process Regression:

For GP regression with n training points:

Kernel matrix K is n × n symmetric positive definite
Training involves solving Kα = y and computing log|K|
Prediction requires K⁻¹ and additional solves

With Cholesky:

Factor: K = LLᵀ — O(n³)
Solve: α = L⁻ᵀL⁻¹y — O(n²)
Log-det: 2Σlog(Lᵢᵢ) — O(n)
Marginal likelihood: -½yᵀα - ½log|K| - n/2 log(2π)

2. Sampling Multivariate Gaussians:

To sample from N(μ, Σ):

Compute L = chol(Σ)
Sample z ~ N(0, I)
Return x = μ + Lz

Why this works: Cov(Lz) = L·I·Lᵀ = LLᵀ = Σ

3. Linear Regression with Ridge Regularization:

For regularized least squares: minimize ||Xβ - y||² + λ||β||²

Normal equations: (XᵀX + λI)β = Xᵀy

The matrix XᵀX + λI is always SPD (XᵀX is PSD, λI adds positive definiteness). Cholesky solve is optimal here.

The GP Computational Bottleneck

4. Mahalanobis Distance:

The Mahalanobis distance measures distance under a covariance structure: $$d_M(x, y) = \sqrt{(x-y)^T \Sigma^{-1} (x-y)}$$

With Cholesky: L = chol(Σ), solve Lz = (x-y), then d = ||z||

No explicit inverse needed!

5. Whitening Transformation:

Transform data to have identity covariance:

Compute L = chol(Σ)
Whitened data: X_white = L⁻¹(X - μ)

This is used in:

Preprocessing for algorithms assuming spherical data
Decorrelating features before PCA
Normalizing flows in generative models

6. Constrained Optimization (Interior Point Methods):

Newton steps in interior point methods require solving: $$(H + A^T D A) \Delta x = -g$$

where H is Hessian, A is constraint matrix, D is diagonal scaling.

The left-hand side is often SPD, making Cholesky the solver of choice.

7. Fisher Information and Natural Gradient:

The Fisher information matrix F in natural gradient descent is SPD. Computing the natural gradient requires solving Fδ = ∇L, which is efficiently done via Cholesky.

cholesky_ml_applications.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def sample_multivariate_gaussian(mean, cov, n_samples=1):
    """
    Sample from multivariate Gaussian N(mean, cov) using Cholesky.
    """
    d = len(mean)
    L = cholesky(cov, lower=True)
    
    # Sample from standard normal
    z = np.random.randn(n_samples, d)
    
    # Transform: x = mean + L @ z
    samples = mean + (L @ z.T).T
    
    return samples
 
def gaussian_process_predict(X_train, y_train, X_test, kernel_fn, noise_var=0.1):
    """
    Gaussian Process regression using Cholesky.
    
    Returns:
        mean: predictive mean at test points
        var: predictive variance at test points
        log_marginal: log marginal likelihood
    """
    n_train = len(X_train)
    n_test = len(X_test)
    
    # Compute kernel matrices
    K = kernel_fn(X_train, X_train) + noise_var * np.eye(n_train)
    K_s = kernel_fn(X_train, X_test)
    K_ss = kernel_fn(X_test, X_test)
    
    # Cholesky factorization
    L = cholesky(K, lower=True)
    
    # Solve K^{-1} y via triangular solves
    alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True))
    
    # Predictive mean: K_s^T @ alpha
    mean = K_s.T @ alpha
    
    # Predictive variance: K_ss - K_s^T @ K^{-1} @ K_s
    v = solve_triangular(L, K_s, lower=True)
    var = np.diag(K_ss) - np.sum(v**2, axis=0)
    
    # Log marginal likelihood
    log_marginal = -0.5 * y_train @ alpha
    log_marginal -= np.sum(np.log(np.diag(L)))
    log_marginal -= 0.5 * n_train * np.log(2 * np.pi)
    
    return mean, var, log_marginal
 
def rbf_kernel(X1, X2, lengthscale=1.0, variance=1.0):
    """Radial basis function (squared exponential) kernel."""
    sq_dist = np.sum(X1**2, axis=1, keepdims=True) +               np.sum(X2**2, axis=1) - 2 * X1 @ X2.T
    return variance * np.exp(-0.5 * sq_dist / lengthscale**2)
 
# Demo: Sampling from multivariate Gaussian
np.random.seed(42)
mean = np.array([1, 2])
cov = np.array([[2, 0.8], [0.8, 1]])
 
samples = sample_multivariate_gaussian(mean, cov, n_samples=1000)
print("Multivariate Gaussian Sampling")
print("=" * 50)
print(f"True mean: {mean}")
print(f"Sample mean: {samples.mean(axis=0).round(3)}")
print(f"
True covariance:
{cov}")
print(f"Sample covariance:
{np.cov(samples.T).round(3)}")
 
# Demo: Gaussian Process Regression
print("
" + "=" * 50)
print("Gaussian Process Regression")
 
# Training data
X_train = np.linspace(0, 10, 20).reshape(-1, 1)
y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(20)
 
# Test data
X_test = np.linspace(0, 10, 100).reshape(-1, 1)
 
# Predictions
mean_pred, var_pred, log_ml = gaussian_process_predict(
    X_train, y_train, X_test, 
    lambda X1, X2: rbf_kernel(X1, X2, lengthscale=1.0, variance=1.0),
    noise_var=0.01
)
 
print(f"Log marginal likelihood: {log_ml:.4f}")
print(f"Mean prediction range: [{mean_pred.min():.3f}, {mean_pred.max():.3f}]")
print(f"Predictive std range: [{np.sqrt(var_pred).min():.3f}, {np.sqrt(var_pred).max():.3f}]")
 
# Demo: Mahalanobis distance
print("
" + "=" * 50)
print("Mahalanobis Distance")
 
# Points to compare
x1 = np.array([0, 0])
x2 = np.array([2, 2])
Sigma = np.array([[4, 2], [2, 3]])  # Correlation structure
 
L = cholesky(Sigma, lower=True)
diff = x2 - x1
z = solve_triangular(L, diff, lower=True)
mahal_dist = np.linalg.norm(z)
 
euclidean_dist = np.linalg.norm(diff)
 
print(f"Euclidean distance: {euclidean_dist:.4f}")
print(f"Mahalanobis distance: {mahal_dist:.4f}")
print("(Mahalanobis accounts for correlation structure)")

Summary: Mastering Cholesky Decomposition

We've explored Cholesky decomposition comprehensively—from its elegant definition through efficient algorithms to critical ML applications. Let's consolidate the key insights.

Key Takeaways

•Cholesky exists for SPD matrices: A = LLᵀ with lower triangular L having positive diagonal—unique under this convention.
•2× faster than LU: Exploiting symmetry reduces operations from 2n³/3 to n³/3, with no pivoting needed.
•Solving systems: Factor once O(n³), then each solve is O(n²) via forward and back substitution.
•Log-determinant: log|A| = 2Σlog(Lᵢᵢ) is numerically stable and O(n) after factorization.
•LDLᵀ variant: Avoids square roots, works for indefinite matrices, reveals inertia.
•Rank-1 updates: O(n²) incremental updates vs O(n³) full refactorization for online learning.
•ML applications abound: Gaussian processes, sampling, Mahalanobis distance, ridge regression, optimization.

What's next:

Page Complete