Loading content...
When matrices have special structure, specialized algorithms can exploit that structure for dramatic gains in efficiency and stability. Cholesky decomposition is the gold standard for symmetric positive definite (SPD) matrices—a class that includes covariance matrices, kernel/Gram matrices, Hessians at minima, and precision matrices in probabilistic models.
Cholesky factors an SPD matrix A into A = LLᵀ, where L is lower triangular with positive diagonal entries. This simple form has profound practical implications:
From sampling multivariate Gaussians to solving kernel regression, from computing log-determinants in Bayesian inference to preconditioning optimization algorithms, Cholesky decomposition is the workhorse of probabilistic machine learning.
By the end of this page, you will understand: (1) The definition and existence conditions for Cholesky decomposition, (2) The algorithm and its O(n³/3) complexity advantage, (3) Connection to LU and LDLT factorizations, (4) Applications in solving linear systems, computing determinants, and sampling, (5) Numerical stability and updating techniques, and (6) Critical ML applications in Gaussian processes and optimization.
The Cholesky Decomposition:
For a symmetric positive definite matrix A ∈ ℝⁿˣⁿ, there exists a unique lower triangular matrix L with positive diagonal entries such that:
$$A = LL^T$$
Equivalently, using upper triangular form: A = RᵀR where R = Lᵀ.
Prerequisites for Cholesky:
1. Symmetric: A = Aᵀ Every entry aᵢⱼ = aⱼᵢ. This is necessary for the LLᵀ structure.
2. Positive Definite: xᵀAx > 0 for all non-zero x Equivalently: all eigenvalues of A are strictly positive.
For positive semi-definite (PSD) matrices: Cholesky still exists but L may have zero diagonal entries. The factorization is no longer unique, and numerical issues can arise.
Existence and uniqueness theorem:
If A is symmetric positive definite, then:
The proof follows from the LU factorization of SPD matrices, which is guaranteed to exist without pivoting.
Think of Cholesky as computing the 'matrix square root': L is the unique lower triangular matrix such that L × Lᵀ = A. Just as √a is unique for positive a, L is unique for positive definite A. This analogy extends to operations: solving Ax = b becomes solving 'L²x = b' via two triangular solves.
Detecting positive definiteness:
Cholesky decomposition simultaneously checks if A is positive definite:
Common SPD matrices in ML:
| Matrix Type | Definition | Why SPD? |
|---|---|---|
| Covariance matrix | Σ = E[(X-μ)(X-μ)ᵀ] | xᵀΣx = E[(xᵀ(X-μ))²] ≥ 0 |
| Gram matrix | G = XᵀX | xᵀGx = |
| Kernel matrix | Kᵢⱼ = k(xᵢ, xⱼ) | Mercer's theorem |
| Hessian at minimum | ∇²f(x*) | Second-order optimality |
| Regularized matrix | A + λI (λ > 0) | Shifts eigenvalues up |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as npfrom scipy.linalg import cholesky, cho_solve, cho_factor # Create a positive definite matrixnp.random.seed(42)n = 5 # Method 1: XᵀX is always positive semi-definite# Adding small diagonal ensures positive definiteX = np.random.randn(n + 5, n)A = X.T @ X + 0.01 * np.eye(n) # Regularization ensures SPD print("Matrix A (symmetric positive definite):")print(np.round(A, 3)) # Verify symmetryprint(f"Symmetric: {np.allclose(A, A.T)}") # Verify positive definiteness via eigenvalueseigenvalues = np.linalg.eigvalsh(A)print(f"Eigenvalues: {eigenvalues.round(4)}")print(f"All positive: {np.all(eigenvalues > 0)}") # Cholesky decompositionL = cholesky(A, lower=True)print(f"Cholesky factor L (lower triangular):")print(np.round(L, 4)) # Verify reconstructionA_reconstructed = L @ L.Tprint(f"Reconstruction error ||A - LL^T||: {np.linalg.norm(A - A_reconstructed):.2e}") # Verify L is lower triangular with positive diagonalprint(f"L is lower triangular: {np.allclose(L, np.tril(L))}")print(f"Diagonal of L: {np.diag(L).round(4)}")print(f"All diagonal entries positive: {np.all(np.diag(L) > 0)}") # Test what happens with non-SPD matrixprint("=== Testing Non-SPD Matrix ===")A_bad = np.array([[1, 2], [2, 1]]) # Eigenvalues: 3, -1eigenvalues_bad = np.linalg.eigvalsh(A_bad)print(f"Matrix eigenvalues: {eigenvalues_bad}")print(f"Positive definite: {np.all(eigenvalues_bad > 0)}") try: L_bad = cholesky(A_bad, lower=True)except np.linalg.LinAlgError as e: print(f"Cholesky failed (as expected): {e}")The Cholesky algorithm computes L column by column, exploiting the triangular structure for efficiency.
Derivation from A = LLᵀ:
Comparing entry (i, j) on both sides (for i ≥ j): $$a_{ij} = \sum_{k=1}^{j} l_{ik} l_{jk} = \sum_{k=1}^{j-1} l_{ik} l_{jk} + l_{ij} l_{jj}$$
Solving for L entries:
For diagonal entries (i = j): $$l_{jj} = \sqrt{a_{jj} - \sum_{k=1}^{j-1} l_{jk}^2}$$
For off-diagonal entries (i > j): $$l_{ij} = \frac{1}{l_{jj}} \left( a_{ij} - \sum_{k=1}^{j-1} l_{ik} l_{jk} \right)$$
The algorithm:
For j = 1 to n: # Column by column
# Diagonal entry
l_jj = sqrt(a_jj - sum(l_jk^2, k=1 to j-1))
# Off-diagonal entries
For i = j+1 to n:
l_ij = (a_ij - sum(l_ik * l_jk, k=1 to j-1)) / l_jj
Cholesky requires n³/3 floating-point operations, compared to 2n³/3 for LU decomposition—exactly half the work! This is because we only compute the lower triangle (the upper triangle is just Lᵀ), and the symmetry reduces the number of distinct computations.
Computational considerations:
Operation count:
Memory:
Numerical stability:
Block Cholesky:
For large matrices, block algorithms improve cache utilization:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as npimport time def cholesky_manual(A): """ Compute Cholesky decomposition L such that A = LL^T. Parameters: A: n x n symmetric positive definite matrix Returns: L: n x n lower triangular matrix """ n = A.shape[0] L = np.zeros((n, n)) for j in range(n): # Compute diagonal entry sum_sq = np.sum(L[j, :j]**2) diag_val = A[j, j] - sum_sq if diag_val <= 0: raise ValueError(f"Matrix is not positive definite (at position {j})") L[j, j] = np.sqrt(diag_val) # Compute off-diagonal entries in column j for i in range(j + 1, n): sum_prod = np.sum(L[i, :j] * L[j, :j]) L[i, j] = (A[i, j] - sum_prod) / L[j, j] return L def cholesky_vectorized(A): """ Cholesky decomposition with vectorized inner loops. More efficient than pure loop version. """ n = A.shape[0] L = np.zeros((n, n)) for j in range(n): # Diagonal entry (vectorized sum) L[j, j] = np.sqrt(A[j, j] - L[j, :j] @ L[j, :j]) # Off-diagonal entries (vectorized) if j + 1 < n: L[j+1:, j] = (A[j+1:, j] - L[j+1:, :j] @ L[j, :j]) / L[j, j] return L # Test implementationsnp.random.seed(42)n = 100 # Create SPD matrixX = np.random.randn(n + 10, n)A = X.T @ X + 0.1 * np.eye(n) # Compare implementationsprint("Comparing Cholesky Implementations")print("=" * 50) # Manual implementationstart = time.time()L_manual = cholesky_manual(A)time_manual = time.time() - startprint(f"Manual: {time_manual:.4f}s") # Vectorized implementationstart = time.time()L_vectorized = cholesky_vectorized(A)time_vectorized = time.time() - startprint(f"Vectorized: {time_vectorized:.4f}s") # NumPy/SciPy implementationfrom scipy.linalg import choleskystart = time.time()L_scipy = cholesky(A, lower=True)time_scipy = time.time() - startprint(f"SciPy: {time_scipy:.4f}s") # Verify all give same resultprint(f"Manual vs SciPy error: {np.linalg.norm(L_manual - L_scipy):.2e}")print(f"Vectorized vs SciPy error: {np.linalg.norm(L_vectorized - L_scipy):.2e}") # Verify reconstructionprint(f"Reconstruction error ||A - LL^T||: {np.linalg.norm(A - L_scipy @ L_scipy.T):.2e}") # Operation count comparison with LUprint("=== Complexity Comparison ===")print(f"Matrix size: {n} x {n}")print(f"Cholesky ops: ~{n**3 // 3:,} (n³/3)")print(f"LU ops: ~{2 * n**3 // 3:,} (2n³/3)")print(f"Speedup: 2x")One of the most common uses of Cholesky is solving linear systems Ax = b where A is SPD. The factorization enables efficient solution through two triangular solves.
The solution process:
Given Ax = b with A = LLᵀ:
Why triangular solves are efficient:
For lower triangular L:
y₁ = b₁ / L₁₁
y₂ = (b₂ - L₂₁y₁) / L₂₂
...
yᵢ = (bᵢ - Σⱼ₌₁^(i-1) Lᵢⱼyⱼ) / Lᵢᵢ
Each equation involves only previously computed values—purely sequential, O(n²) total.
Multiple right-hand sides:
If you need to solve Ax = b for many different b vectors:
This is much faster than naively recomputing for each b.
Factoring is worth it when solving multiple systems with the same A. Break-even: if you solve more than n/3 systems, factoring + triangular solves beats recomputing each time. For Gaussian processes with n training points and k test points, this is nearly always beneficial.
Computing the inverse (when needed):
While explicitly computing A⁻¹ is usually avoided, when needed:
Or equivalently, solve AX = I column by column using the Cholesky solve.
Computing log-determinant:
The log-determinant of A is crucial in Bayesian inference (appears in Gaussian log-likelihoods):
$$\log|A| = \log|LL^T| = \log|L|^2 = 2\log|L| = 2\sum_{i=1}^{n} \log(L_{ii})$$
This is numerically stable (avoids computing the potentially huge determinant directly) and costs only O(n) after factorization.
Solving the precision-weighted system:
In Gaussian inference, we often need (Σ⁻¹ + Q⁻¹)⁻¹ or solutions involving sums/products of covariances. Cholesky of the precision matrix enables efficient computation without explicit inversions.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import numpy as npfrom scipy.linalg import cholesky, solve_triangular, cho_solve, cho_factor def solve_with_cholesky(A, b): """ Solve Ax = b using Cholesky decomposition. A must be symmetric positive definite. """ # Factor L = cholesky(A, lower=True) # Forward substitution: Ly = b y = solve_triangular(L, b, lower=True) # Back substitution: L^T x = y x = solve_triangular(L.T, y, lower=False) return x, L def log_determinant_cholesky(L): """ Compute log|A| from Cholesky factor L where A = LL^T. log|A| = 2 * sum(log(L_ii)) """ return 2.0 * np.sum(np.log(np.diag(L))) # Example: Solving a systemnp.random.seed(42)n = 100 # Create SPD matrixX = np.random.randn(n + 10, n)A = X.T @ X + 0.5 * np.eye(n)b = np.random.randn(n) # Solve with Choleskyx_chol, L = solve_with_cholesky(A, b) # Verify solutionresidual = np.linalg.norm(A @ x_chol - b)print(f"Solution residual ||Ax - b||: {residual:.2e}") # Compare with numpy.linalg.solve (uses LU)x_np = np.linalg.solve(A, b)print(f"Difference from np.solve: {np.linalg.norm(x_chol - x_np):.2e}") # Multiple right-hand sidesprint("=== Multiple Right-Hand Sides ===")import time num_rhs = 50B = np.random.randn(n, num_rhs) # Method 1: Naive (solve each separately)start = time.time()X_naive = np.column_stack([np.linalg.solve(A, B[:, i]) for i in range(num_rhs)])time_naive = time.time() - start # Method 2: Cholesky + triangular solvesstart = time.time()c, low = cho_factor(A, lower=True)X_chol = cho_solve((c, low), B)time_chol = time.time() - start print(f"Naive (solve each): {time_naive:.4f}s")print(f"Cholesky factored: {time_chol:.4f}s")print(f"Speedup: {time_naive / time_chol:.1f}x") # Log-determinant computationprint("=== Log-Determinant ===")log_det_chol = log_determinant_cholesky(L)log_det_direct = np.linalg.slogdet(A)[1] # Returns (sign, logdet) print(f"Log|A| via Cholesky: {log_det_chol:.6f}")print(f"Log|A| via slogdet: {log_det_direct:.6f}")print(f"Difference: {abs(log_det_chol - log_det_direct):.2e}") # Timing comparison for log-detimport timelarge_n = 500X_large = np.random.randn(large_n + 10, large_n)A_large = X_large.T @ X_large + np.eye(large_n) start = time.time()L_large = cholesky(A_large, lower=True)_ = log_determinant_cholesky(L_large)time_chol_logdet = time.time() - start start = time.time()_ = np.linalg.slogdet(A_large)time_slogdet = time.time() - start print(f"For {large_n}x{large_n} matrix:")print(f"Cholesky + sum logs: {time_chol_logdet:.4f}s")print(f"np.slogdet: {time_slogdet:.4f}s")Several variants of Cholesky decomposition address different needs: avoiding square roots, handling indefinite matrices, and numerical robustness.
LDLᵀ Decomposition:
For symmetric (but not necessarily positive definite) matrices: $$A = LDL^T$$
where:
Advantages of LDLᵀ:
Connection to Cholesky: If A is SPD and A = LDLᵀ with D > 0, then: $$A = L D L^T = L D^{1/2} D^{1/2} L^T = (LD^{1/2})(LD^{1/2})^T = \tilde{L}\tilde{L}^T$$
So LDLᵀ is just Cholesky factored into 'no-square-root' form.
Use LDLᵀ when: (1) You want to avoid square roots (faster, more numerically stable for some problems), (2) The matrix might be indefinite (Hessian at non-minimum, covariance with round-off errors), (3) You need to check inertia (number of positive, negative, zero eigenvalues) which D reveals directly.
Modified Cholesky:
For optimization, we often encounter Hessians that should be positive definite at a minimum but may be indefinite during iteration. Modified Cholesky computes: $$A + E = L L^T$$
where E is a 'small' positive semi-definite perturbation that makes A+E positive definite.
Applications of modified Cholesky:
Incomplete Cholesky:
For large sparse matrices, incomplete Cholesky computes an approximate factorization by dropping small elements: $$A \approx \tilde{L}\tilde{L}^T$$
The approximation serves as a preconditioner for iterative methods (conjugate gradient), dramatically speeding convergence.
Block Cholesky:
For partitioned matrices: $$A = \begin{pmatrix} A_{11} & A_{12} \ A_{21} & A_{22} \end{pmatrix} = \begin{pmatrix} L_{11} & 0 \ L_{21} & L_{22} \end{pmatrix} \begin{pmatrix} L_{11}^T & L_{21}^T \ 0 & L_{22}^T \end{pmatrix}$$
This is the basis for efficient block algorithms and parallel implementations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as npfrom scipy.linalg import ldl def ldlt_decomposition(A): """ Compute LDL^T decomposition of symmetric matrix A. Returns: L: unit lower triangular D: diagonal """ n = A.shape[0] L = np.eye(n) D = np.zeros(n) for j in range(n): # Compute D[j] D[j] = A[j, j] - np.sum(L[j, :j]**2 * D[:j]) # Compute off-diagonal entries of L for i in range(j + 1, n): L[i, j] = (A[i, j] - np.sum(L[i, :j] * L[j, :j] * D[:j])) / D[j] return L, D def modified_cholesky(A, beta=1e-4): """ Modified Cholesky that handles indefinite matrices by adding to diagonal to ensure positive definiteness. Returns: L: lower triangular factor E_diag: diagonal perturbation added (E = diag(E_diag)) """ n = A.shape[0] L = np.zeros((n, n)) E_diag = np.zeros(n) for j in range(n): # What diagonal entry would we get? sum_sq = np.sum(L[j, :j]**2) diag_val = A[j, j] - sum_sq # If not positive enough, add perturbation if diag_val < beta: E_diag[j] = beta - diag_val diag_val = beta L[j, j] = np.sqrt(diag_val) # Off-diagonal entries for i in range(j + 1, n): sum_prod = np.sum(L[i, :j] * L[j, :j]) L[i, j] = (A[i, j] - sum_prod) / L[j, j] return L, E_diag # Test LDL^T on SPD matrixnp.random.seed(42)n = 5X = np.random.randn(n + 2, n)A_spd = X.T @ X + 0.1 * np.eye(n) L, D = ldlt_decomposition(A_spd)print("LDL^T Decomposition (SPD matrix)")print("=" * 50)print(f"D (diagonal): {D.round(4)}")print(f"All positive (SPD): {np.all(D > 0)}") # Verify reconstructionA_reconstructed = L @ np.diag(D) @ L.Tprint(f"Reconstruction error: {np.linalg.norm(A_spd - A_reconstructed):.2e}") # Test on indefinite matrixprint("" + "=" * 50)print("LDL^T on Indefinite Matrix")A_indef = np.array([ [4, 2, -2], [2, 1, -1], [-2, -1, 5]])eigenvalues_indef = np.linalg.eigvalsh(A_indef)print(f"Eigenvalues: {eigenvalues_indef.round(4)}")print(f"Indefinite: {np.any(eigenvalues_indef < 0) and np.any(eigenvalues_indef > 0)}") # SciPy LDL works for indefinitelu, d, perm = ldl(A_indef)print(f"D from LDL: {np.diag(d).round(4)}") # Modified Cholesky on indefinite matrixprint("" + "=" * 50)print("Modified Cholesky (making indefinite matrix SPD)")L_mod, E_diag = modified_cholesky(A_indef)print(f"Perturbation added to diagonal: {E_diag.round(4)}") # Verify: A + E = LL^TE = np.diag(E_diag)A_modified = A_indef + Eprint(f"Modified matrix eigenvalues: {np.linalg.eigvalsh(A_modified).round(4)}")print(f"||A + E - LL^T||: {np.linalg.norm(A_modified - L_mod @ L_mod.T):.2e}")When data arrives incrementally or changes, we often need to update the Cholesky factor without recomputing from scratch. Rank-1 updates handle adding or removing a single outer product.
Rank-1 Update (Cholupdate):
Given L with A = LLᵀ, compute L' such that: $$A' = A + xx^T = L'L'^T$$
This is a rank-1 update (adding outer product xxᵀ).
The algorithm (simplified):
Process the update column by column using Givens rotations:
For k = 1 to n:
r = sqrt(L[k,k]² + x[k]²)
c = L[k,k] / r
s = x[k] / r
L[k,k] = r
# Rotate remaining entries
For i = k+1 to n:
[L[i,k], x[i]] = [c*L[i,k] + s*x[i], -s*L[i,k] + c*x[i]]
Complexity: O(n²) instead of O(n³) for full refactorization!
Rank-1 Downdate:
Given L with A = LLᵀ, compute L' such that: $$A' = A - xx^T = L'L'^T$$
This is more delicate—the result may not be positive definite if x is 'too large'.
Cholesky downdating (removing a rank-1 component) is numerically less stable than updating. If A - xxᵀ is close to singular or indefinite, the algorithm can fail or produce inaccurate results. Always check that the updated factor is valid by verifying reconstruction error or checking diagonal positivity.
Applications of incremental updates:
1. Online learning: As new data points arrive, update the covariance matrix: $$\Sigma_{new} = \Sigma_{old} + x_{new} x_{new}^T$$
With rank-1 update: O(n²) per point vs O(n³) to refactor.
2. Leave-one-out cross-validation: For each left-out point, downdate the covariance and recompute predictions. With direct updates: O(n²) per fold vs O(n³).
3. Sliding window models: As new data enters and old data leaves, update and downdate simultaneously.
4. Gaussian process conditioning: When adding inducing points or updating observations, incremental Cholesky updates enable efficient model updates.
Row/Column Addition:
When adding a row and column (expanding the matrix): $$A' = \begin{pmatrix} A & b \ b^T & c \end{pmatrix}$$
The new Cholesky is: $$L' = \begin{pmatrix} L & 0 \ l^T & \ell \end{pmatrix}$$
where Ll = b (solve triangular system) and ℓ = √(c - lᵀl).
Complexity: O(n²) for the triangular solve + O(n) for ℓ.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143
import numpy as npfrom scipy.linalg import cholesky, solve_triangular def cholesky_update(L, x): """ Rank-1 update: Given L with A = LL^T, compute L' with A + xx^T = L'L'^T. Uses Givens rotations for numerical stability. Modifies L in place and returns it. """ n = len(x) L = L.copy() x = x.copy() for k in range(n): r = np.sqrt(L[k, k]**2 + x[k]**2) c = L[k, k] / r s = x[k] / r L[k, k] = r if k < n - 1: # Update rest of column k and x L_col = L[k+1:, k].copy() x_rest = x[k+1:].copy() L[k+1:, k] = c * L_col + s * x_rest x[k+1:] = -s * L_col + c * x_rest return L def cholesky_downdate(L, x): """ Rank-1 downdate: Given L with A = LL^T, compute L' with A - xx^T = L'L'^T. Warning: May fail if A - xx^T is not positive definite. """ n = len(x) L = L.copy() x = x.copy() for k in range(n): r_sq = L[k, k]**2 - x[k]**2 if r_sq <= 0: raise ValueError("Matrix becomes non-positive definite") r = np.sqrt(r_sq) c = r / L[k, k] s = x[k] / L[k, k] L[k, k] = r if k < n - 1: L_col = L[k+1:, k].copy() x_rest = x[k+1:].copy() L[k+1:, k] = (L_col - s * x_rest) / c x[k+1:] = c * x_rest - s * L_col return L def add_row_column(L, b, c): """ Add a row and column to the Cholesky factor. If A = LL^T, compute L' for A' = [[A, b], [b^T, c]] """ n = L.shape[0] # Solve L l = b for l l = solve_triangular(L, b, lower=True) # Compute new diagonal element ell_sq = c - l @ l if ell_sq <= 0: raise ValueError("Extended matrix is not positive definite") ell = np.sqrt(ell_sq) # Build new L' L_new = np.zeros((n + 1, n + 1)) L_new[:n, :n] = L L_new[n, :n] = l L_new[n, n] = ell return L_new # Demonstrationnp.random.seed(42)n = 50 # Initial matrixX = np.random.randn(n + 10, n)A = X.T @ X + 0.5 * np.eye(n)L = cholesky(A, lower=True) # Rank-1 updatex = np.random.randn(n)L_updated = cholesky_update(L, x) # Verify via direct computationA_updated = A + np.outer(x, x)L_direct = cholesky(A_updated, lower=True) print("Cholesky Rank-1 Update")print("=" * 50)print(f"Update error ||L_updated - L_direct||: {np.linalg.norm(L_updated - L_direct):.2e}")print(f"Reconstruction error: {np.linalg.norm(A_updated - L_updated @ L_updated.T):.2e}") # Row/column additionb = np.random.randn(n)c = b @ np.linalg.solve(A, b) + 1.0 # Ensures positive definitenessL_extended = add_row_column(L, b, c) # VerifyA_extended = np.block([[A, b.reshape(-1, 1)], [b.reshape(1, -1), np.array([[c]])]])L_extended_direct = cholesky(A_extended, lower=True) print(f"Row/Column Addition")print(f"Extension error: {np.linalg.norm(L_extended - L_extended_direct):.2e}") # Timing comparisonimport timen_large = 500X_large = np.random.randn(n_large + 10, n_large)A_large = X_large.T @ X_large + 0.5 * np.eye(n_large)L_large = cholesky(A_large, lower=True)x_large = np.random.randn(n_large) # Incremental updatestart = time.time()L_incr = cholesky_update(L_large, x_large)time_incr = time.time() - start # Full recomputationstart = time.time()L_full = cholesky(A_large + np.outer(x_large, x_large), lower=True)time_full = time.time() - start print(f"Timing ({n_large}x{n_large} matrix):")print(f"Incremental update: {time_incr:.4f}s")print(f"Full recomputation: {time_full:.4f}s")print(f"Speedup: {time_full / time_incr:.1f}x")Cholesky decomposition is the computational backbone of many ML algorithms involving Gaussian distributions, kernel methods, and positive definite structures.
1. Gaussian Process Regression:
For GP regression with n training points:
With Cholesky:
2. Sampling Multivariate Gaussians:
To sample from N(μ, Σ):
Why this works: Cov(Lz) = L·I·Lᵀ = LLᵀ = Σ
3. Linear Regression with Ridge Regularization:
For regularized least squares: minimize ||Xβ - y||² + λ||β||²
Normal equations: (XᵀX + λI)β = Xᵀy
The matrix XᵀX + λI is always SPD (XᵀX is PSD, λI adds positive definiteness). Cholesky solve is optimal here.
For Gaussian processes, the O(n³) Cholesky factorization is the dominant computational cost. This limits vanilla GPs to ~10,000 training points. Scalable GP methods (inducing points, SKI, etc.) work by approximating K to have efficient Cholesky structure.
4. Mahalanobis Distance:
The Mahalanobis distance measures distance under a covariance structure: $$d_M(x, y) = \sqrt{(x-y)^T \Sigma^{-1} (x-y)}$$
With Cholesky: L = chol(Σ), solve Lz = (x-y), then d = ||z||
No explicit inverse needed!
5. Whitening Transformation:
Transform data to have identity covariance:
This is used in:
6. Constrained Optimization (Interior Point Methods):
Newton steps in interior point methods require solving: $$(H + A^T D A) \Delta x = -g$$
where H is Hessian, A is constraint matrix, D is diagonal scaling.
The left-hand side is often SPD, making Cholesky the solver of choice.
7. Fisher Information and Natural Gradient:
The Fisher information matrix F in natural gradient descent is SPD. Computing the natural gradient requires solving Fδ = ∇L, which is efficiently done via Cholesky.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
import numpy as npfrom scipy.linalg import cholesky, solve_triangular def sample_multivariate_gaussian(mean, cov, n_samples=1): """ Sample from multivariate Gaussian N(mean, cov) using Cholesky. """ d = len(mean) L = cholesky(cov, lower=True) # Sample from standard normal z = np.random.randn(n_samples, d) # Transform: x = mean + L @ z samples = mean + (L @ z.T).T return samples def gaussian_process_predict(X_train, y_train, X_test, kernel_fn, noise_var=0.1): """ Gaussian Process regression using Cholesky. Returns: mean: predictive mean at test points var: predictive variance at test points log_marginal: log marginal likelihood """ n_train = len(X_train) n_test = len(X_test) # Compute kernel matrices K = kernel_fn(X_train, X_train) + noise_var * np.eye(n_train) K_s = kernel_fn(X_train, X_test) K_ss = kernel_fn(X_test, X_test) # Cholesky factorization L = cholesky(K, lower=True) # Solve K^{-1} y via triangular solves alpha = solve_triangular(L.T, solve_triangular(L, y_train, lower=True)) # Predictive mean: K_s^T @ alpha mean = K_s.T @ alpha # Predictive variance: K_ss - K_s^T @ K^{-1} @ K_s v = solve_triangular(L, K_s, lower=True) var = np.diag(K_ss) - np.sum(v**2, axis=0) # Log marginal likelihood log_marginal = -0.5 * y_train @ alpha log_marginal -= np.sum(np.log(np.diag(L))) log_marginal -= 0.5 * n_train * np.log(2 * np.pi) return mean, var, log_marginal def rbf_kernel(X1, X2, lengthscale=1.0, variance=1.0): """Radial basis function (squared exponential) kernel.""" sq_dist = np.sum(X1**2, axis=1, keepdims=True) + np.sum(X2**2, axis=1) - 2 * X1 @ X2.T return variance * np.exp(-0.5 * sq_dist / lengthscale**2) # Demo: Sampling from multivariate Gaussiannp.random.seed(42)mean = np.array([1, 2])cov = np.array([[2, 0.8], [0.8, 1]]) samples = sample_multivariate_gaussian(mean, cov, n_samples=1000)print("Multivariate Gaussian Sampling")print("=" * 50)print(f"True mean: {mean}")print(f"Sample mean: {samples.mean(axis=0).round(3)}")print(f"True covariance:{cov}")print(f"Sample covariance:{np.cov(samples.T).round(3)}") # Demo: Gaussian Process Regressionprint("" + "=" * 50)print("Gaussian Process Regression") # Training dataX_train = np.linspace(0, 10, 20).reshape(-1, 1)y_train = np.sin(X_train.flatten()) + 0.1 * np.random.randn(20) # Test dataX_test = np.linspace(0, 10, 100).reshape(-1, 1) # Predictionsmean_pred, var_pred, log_ml = gaussian_process_predict( X_train, y_train, X_test, lambda X1, X2: rbf_kernel(X1, X2, lengthscale=1.0, variance=1.0), noise_var=0.01) print(f"Log marginal likelihood: {log_ml:.4f}")print(f"Mean prediction range: [{mean_pred.min():.3f}, {mean_pred.max():.3f}]")print(f"Predictive std range: [{np.sqrt(var_pred).min():.3f}, {np.sqrt(var_pred).max():.3f}]") # Demo: Mahalanobis distanceprint("" + "=" * 50)print("Mahalanobis Distance") # Points to comparex1 = np.array([0, 0])x2 = np.array([2, 2])Sigma = np.array([[4, 2], [2, 3]]) # Correlation structure L = cholesky(Sigma, lower=True)diff = x2 - x1z = solve_triangular(L, diff, lower=True)mahal_dist = np.linalg.norm(z) euclidean_dist = np.linalg.norm(diff) print(f"Euclidean distance: {euclidean_dist:.4f}")print(f"Mahalanobis distance: {mahal_dist:.4f}")print("(Mahalanobis accounts for correlation structure)")We've explored Cholesky decomposition comprehensively—from its elegant definition through efficient algorithms to critical ML applications. Let's consolidate the key insights.
What's next:
We'll explore Applications of Matrix Decompositions in ML—synthesizing SVD, eigendecomposition, QR, and Cholesky into a unified toolkit for solving dimensionality reduction, recommendation systems, signal processing, and optimization problems in machine learning.
You now understand Cholesky decomposition in depth—when it exists, how to compute it efficiently, its variants, incremental updates, and central role in probabilistic ML. Combined with SVD, eigendecomposition, and QR, you have a complete matrix decomposition toolkit.