Loading learning content...
Having mastered the four fundamental matrix decompositions—SVD, Eigendecomposition, QR, and Cholesky—we now synthesize this knowledge into a unified toolkit for solving real machine learning problems. These decompositions aren't academic exercises; they form the computational backbone of systems you use every day.
From Netflix's recommendation engine to Google's image search, from speech recognition to drug discovery, matrix decompositions power critical components of production ML systems. Understanding when and how to apply each decomposition—and recognizing problem structures that call for matrix methods—separates practitioners who can solve textbook problems from engineers who can build systems at scale.
This page connects the theoretical foundations we've built to practical applications: dimensionality reduction, collaborative filtering, signal processing, feature extraction, and optimization. We'll see how the same mathematical tools manifest across seemingly different domains, revealing the deep unity underlying diverse ML techniques.
By the end of this page, you will understand: (1) How to choose the right decomposition for your problem, (2) PCA and dimensionality reduction in depth, (3) Collaborative filtering and recommendation systems, (4) Image compression and signal processing, (5) Numerical stability in optimization, and (6) Modern applications in deep learning and NLP.
Each matrix decomposition has distinct strengths. Knowing when to apply each is essential for efficient, stable ML implementations.
Decision framework:
| Problem Type | Primary Decomposition | Why? |
|---|---|---|
| Dimensionality reduction | SVD (or eigendecomposition of covariance) | Optimal low-rank approximation |
| Recommendation systems | SVD / matrix factorization | Latent factor discovery |
| Linear system Ax = b (general) | LU / QR | Stable, efficient for dense |
| Linear system Ax = b (SPD) | Cholesky | Fastest, most stable for SPD |
| Eigenvalues (dynamics, spectra) | Eigendecomposition | Reveals system modes |
| Least squares | QR / SVD | Numerically stable |
| Sampling Gaussians | Cholesky | Most efficient for covariance |
| Graph analysis | Eigendecomposition of Laplacian | Spectral properties |
| Signal denoising | SVD / eigendecomposition | Low-rank approximation |
| Matrix completion | SVD / nuclear norm | Low-rank structure |
Matrix structure guide:
Any rectangular matrix → SVD SVD is the universal tool. When in doubt, try SVD first. It works on any matrix and provides optimal low-rank approximations.
Square symmetric → Eigendecomposition or Cholesky Symmetric matrices have real eigenvalues and orthogonal eigenvectors. If positive definite, Cholesky is fastest for solving systems; eigendecomposition reveals spectral information.
Square general → LU or Schur For general square matrices needing eigenvalues, the QR algorithm (producing Schur form) or direct eigendecomposition. For solving systems, LU with pivoting.
Overdetermined system (m > n) → QR Least squares via QR is more stable than normal equations. Use SVD for rank-deficient cases.
Sparse large → Iterative methods For large sparse matrices, iterative decompositions (Lanczos, Arnoldi) compute dominant eigenpairs without full factorization.
This covers 90% of practical cases.
Principal Component Analysis (PCA) is perhaps the most widely used dimensionality reduction technique, directly powered by matrix decompositions.
The goal: Given high-dimensional data X ∈ ℝⁿˣᵈ (n samples, d features), find a lower-dimensional representation that preserves maximum variance.
Two equivalent formulations:
1. Maximum variance directions: Find orthonormal vectors w₁, w₂, ... such that projecting data onto wₖ captures maximum remaining variance.
2. Minimum reconstruction error: Find k-dimensional subspace that minimizes ||X - X_reconstructed||².
Both lead to the same solution: eigenvectors of the covariance matrix.
The PCA algorithm:
SVD-based PCA (preferred):
Compute SVD of centered data directly: X_centered = UΣVᵀ
This avoids forming XᵀX, improving numerical stability.
Computing XᵀX squares the condition number. If κ(X) = 10⁸, then κ(XᵀX) = 10¹⁶—complete loss of precision in double arithmetic! SVD of X directly has condition number κ(X), preserving numerical accuracy.
Choosing the number of components:
1. Explained variance threshold: Keep k components capturing ≥ 95% (or 99%) of total variance: $$\frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{d} \lambda_i} \geq 0.95$$
2. Elbow method: Plot eigenvalues and look for the 'elbow' where values drop sharply.
3. Cross-validation: Choose k that minimizes held-out reconstruction error.
4. Task-specific: Use downstream task performance (classification accuracy with k components) to select k.
When PCA struggles:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import load_digitsfrom sklearn.preprocessing import StandardScaler def pca_from_scratch(X, n_components=None): """ PCA implementation using SVD (numerically stable). Parameters: X: n x d data matrix n_components: number of components (default: all) Returns: Z: projected data (n x n_components) components: principal component directions (d x n_components) explained_variance_ratio: fraction of variance per component """ n, d = X.shape # Center data mean = X.mean(axis=0) X_centered = X - mean # SVD (full_matrices=False for efficiency) U, s, Vt = np.linalg.svd(X_centered, full_matrices=False) # Explained variance (singular values squared, normalized) variance = s**2 / (n - 1) explained_variance_ratio = variance / variance.sum() # Select components if n_components is None: n_components = min(n, d) # Principal components are rows of Vt (columns of V) components = Vt[:n_components].T # d x n_components # Project data Z = X_centered @ components # Equivalent to U[:, :n_components] * s[:n_components] return Z, components, explained_variance_ratio[:n_components], mean def analyze_pca(X, max_components=20): """Analyze PCA with visualization.""" _, _, var_ratio, _ = pca_from_scratch(X) cumulative_var = np.cumsum(var_ratio) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Individual explained variance k = min(max_components, len(var_ratio)) axes[0].bar(range(1, k+1), var_ratio[:k], alpha=0.7) axes[0].set_xlabel('Principal Component') axes[0].set_ylabel('Explained Variance Ratio') axes[0].set_title('Variance Explained by Each Component') # Cumulative explained variance axes[1].plot(range(1, k+1), cumulative_var[:k], 'bo-') axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% threshold') axes[1].set_xlabel('Number of Components') axes[1].set_ylabel('Cumulative Explained Variance') axes[1].set_title('Cumulative Variance Explained') axes[1].legend() axes[1].grid(True) plt.tight_layout() plt.show() # Report key statistics k_95 = np.argmax(cumulative_var >= 0.95) + 1 k_99 = np.argmax(cumulative_var >= 0.99) + 1 print(f"Components for 95% variance: {k_95}") print(f"Components for 99% variance: {k_99}") print(f"Top component explains: {var_ratio[0]*100:.1f}%") return var_ratio, cumulative_var # Load digits dataset (8x8 images = 64 features)digits = load_digits()X = digits.data # 1797 samples, 64 featuresy = digits.target print("Digits Dataset PCA Analysis")print("=" * 50)print(f"Shape: {X.shape} (samples x features)") # Standardizescaler = StandardScaler()X_scaled = scaler.fit_transform(X) # Analyze variancevar_ratio, _ = analyze_pca(X_scaled) # Project to 2D for visualizationZ, components, _, _ = pca_from_scratch(X_scaled, n_components=2) plt.figure(figsize=(10, 8))scatter = plt.scatter(Z[:, 0], Z[:, 1], c=y, cmap='tab10', alpha=0.6, s=20)plt.colorbar(scatter, label='Digit')plt.xlabel('PC 1')plt.ylabel('PC 2')plt.title('Digits Projected onto First Two Principal Components')plt.show() # Reconstruction exampleprint("" + "=" * 50)print("Reconstruction at Different Component Counts")k_values = [5, 10, 20, 40, 64]sample_idx = 0 fig, axes = plt.subplots(1, len(k_values) + 1, figsize=(15, 3))axes[0].imshow(X[sample_idx].reshape(8, 8), cmap='gray')axes[0].set_title('Original')axes[0].axis('off') for i, k in enumerate(k_values): Z_k, comp_k, _, mean_k = pca_from_scratch(X_scaled, n_components=k) X_reconstructed = Z_k @ comp_k.T + mean_k X_orig_space = scaler.inverse_transform(X_reconstructed) axes[i+1].imshow(X_orig_space[sample_idx].reshape(8, 8), cmap='gray') axes[i+1].set_title(f'k={k}') axes[i+1].axis('off') plt.suptitle('Reconstruction Quality vs Number of Components', fontsize=12)plt.tight_layout()plt.show()The Netflix Prize competition (2006-2009) brought matrix factorization to widespread attention. The winning approaches used SVD and related techniques to predict user preferences from sparse rating data.
The problem setup:
Given a user-item rating matrix R ∈ ℝᵐˣⁿ where:
Goal: Predict missing ratings to recommend items users will likely enjoy.
Matrix Factorization Approach:
Assume the rating matrix has low-rank structure: $$R \approx UV^T$$
where:
Each row of U is a k-dimensional representation of a user's preferences. Each row of V is a k-dimensional representation of an item's characteristics.
Prediction: R̂ᵢⱼ = Uᵢ · Vⱼ (dot product of latent vectors)
Connection to SVD:
If R were fully observed: R = UΣVᵀ (SVD)
User factors: UΣ^(1/2), Item factors: VΣ^(1/2) → R = (UΣ^(1/2))(VΣ^(1/2))ᵀ
With missing data, we minimize reconstruction error on observed entries only: $$\min_{U, V} \sum_{(i,j) \in \Omega} (R_{ij} - U_i \cdot V_j)^2 + \lambda(|U|_F^2 + |V|_F^2)$$
where Ω is the set of observed entries and λ controls regularization.
The intuition: if user A and user B rate similar movies similarly, they probably share movie preferences (captured by similar latent vectors). If movie X and movie Y are rated similarly by many users, they're probably similar movies. The latent factors encode these implicit relationships through shared dimensions of taste.
Training algorithms:
1. Alternating Least Squares (ALS): Fix U, solve for V (convex); fix V, solve for U (convex). Iterate until convergence. Each step is a ridge regression problem.
2. Stochastic Gradient Descent (SGD): Process observed ratings one at a time, updating U_i and V_j to reduce prediction error. Scales to massive datasets.
3. SVD++ and extensions: Add biases (per-user, per-item), implicit feedback, temporal dynamics. These extensions won the Netflix Prize.
Evaluation metrics:
Modern extensions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136
import numpy as npfrom scipy.sparse import csr_matrixfrom scipy.sparse.linalg import svds def svd_recommendation(R, k=10): """ SVD-based recommendation for fully observed matrix. In practice, use matrix factorization for sparse data. """ # Full SVD U, s, Vt = np.linalg.svd(R, full_matrices=False) # Truncate to k factors U_k = U[:, :k] s_k = s[:k] V_k = Vt[:k, :].T # Reconstruct: predictions for all user-item pairs R_pred = U_k @ np.diag(s_k) @ V_k.T return R_pred, U_k @ np.diag(np.sqrt(s_k)), V_k @ np.diag(np.sqrt(s_k)) def matrix_factorization_sgd(R_sparse, k=10, learning_rate=0.01, reg=0.02, epochs=50, verbose=True): """ Matrix factorization via SGD for sparse rating matrix. Parameters: R_sparse: csr_matrix of observed ratings k: latent dimension learning_rate: SGD step size reg: L2 regularization epochs: number of passes over data """ m, n = R_sparse.shape # Initialize with small random values np.random.seed(42) U = np.random.randn(m, k) * 0.1 V = np.random.randn(n, k) * 0.1 # Get non-zero entries rows, cols = R_sparse.nonzero() n_ratings = len(rows) history = [] for epoch in range(epochs): # Shuffle training data indices = np.random.permutation(n_ratings) total_error = 0 for idx in indices: i, j = rows[idx], cols[idx] r_ij = R_sparse[i, j] # Prediction pred = U[i] @ V[j] error = r_ij - pred total_error += error**2 # Gradient updates U[i] += learning_rate * (error * V[j] - reg * U[i]) V[j] += learning_rate * (error * U[i] - reg * V[j]) rmse = np.sqrt(total_error / n_ratings) history.append(rmse) if verbose and (epoch + 1) % 10 == 0: print(f"Epoch {epoch+1}: RMSE = {rmse:.4f}") return U, V, history def recommend_top_k(user_id, U, V, R_observed, k=5): """Get top-k recommendations for a user (items not yet rated).""" # Predicted ratings for this user predictions = U[user_id] @ V.T # Mask already-rated items rated_items = set(R_observed[user_id].nonzero()[1]) for item in rated_items: predictions[item] = -np.inf # Top-k unrated items top_items = np.argsort(predictions)[::-1][:k] top_scores = predictions[top_items] return top_items, top_scores # Create synthetic recommendation scenarionp.random.seed(42)n_users, n_items = 200, 100k_true = 5 # True latent dimension # Generate low-rank rating matrix with noiseU_true = np.random.randn(n_users, k_true)V_true = np.random.randn(n_items, k_true)R_true = U_true @ V_true.TR_true = (R_true - R_true.min()) / (R_true.max() - R_true.min()) * 4 + 1 # Scale to 1-5 # Add noiseR_noisy = R_true + 0.5 * np.random.randn(n_users, n_items)R_noisy = np.clip(R_noisy, 1, 5) # Create sparse version (only 10% observed)mask = np.random.random((n_users, n_items)) < 0.10R_sparse = csr_matrix(R_noisy * mask) print("Recommendation System via Matrix Factorization")print("=" * 60)print(f"Users: {n_users}, Items: {n_items}")print(f"Observed ratings: {mask.sum()} ({mask.mean()*100:.1f}%)")print(f"True latent dimension: {k_true}") # Train modelk_model = 10U_learned, V_learned, history = matrix_factorization_sgd( R_sparse, k=k_model, epochs=50, learning_rate=0.01, reg=0.02) # Evaluate on held-out datatest_mask = np.random.random((n_users, n_items)) < 0.05test_mask = test_mask & ~mask # Only truly unobservedR_pred = U_learned @ V_learned.T test_rmse = np.sqrt(np.mean((R_noisy[test_mask] - R_pred[test_mask])**2))print(f"Test RMSE: {test_rmse:.4f}") # Example recommendationsprint("=== Recommendations for User 0 ===")top_items, top_scores = recommend_top_k(0, U_learned, V_learned, R_sparse, k=5)for item, score in zip(top_items, top_scores): true_rating = R_noisy[0, item] print(f"Item {item}: Predicted {score:.2f}, True {true_rating:.2f}")Matrix decompositions are fundamental tools in image processing and signal analysis, enabling compression, denoising, and feature extraction.
Image Compression with SVD:
A grayscale image is a matrix of pixel intensities. Color images are three matrices (RGB channels). SVD reveals that many images are effectively low-rank—they can be well-approximated with far fewer parameters.
The math: For m × n image I, SVD gives I = UΣVᵀ.
Quality vs. compression tradeoff: More components → better quality → larger file size. The singular value spectrum determines how much compression is possible without visual degradation.
Image Denoising:
Noisy images have the form: I_noisy = I_true + noise.
If I_true is approximately low-rank:
The assumption: signal lives in low-rank subspace; noise is distributed across all dimensions.
Modern image compression (JPEG, WebP) uses DCT/wavelet transforms, not SVD—they're faster and exploit local structure better. However, SVD remains valuable for: (1) semantic compression (face recognition eigenfaces), (2) video compression (temporal correlations), (3) hyperspectral imaging (spectral correlations), and (4) scientific imaging where interpretability matters.
Eigenfaces for Face Recognition:
A classic application of PCA/eigendecomposition:
The eigenfaces capture principal modes of variation in face appearance: lighting, expression, identity.
Background Subtraction (Video Analysis):
For surveillance video where background is static:
This is Robust PCA, solved via nuclear norm minimization.
Audio and Signal Processing:
For audio spectrograms and time-frequency representations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npimport matplotlib.pyplot as plt def svd_compress_image(image, k): """ Compress grayscale image using rank-k SVD approximation. """ U, s, Vt = np.linalg.svd(image, full_matrices=False) # Truncate to k components U_k = U[:, :k] s_k = s[:k] Vt_k = Vt[:k, :] # Reconstruct compressed = U_k @ np.diag(s_k) @ Vt_k return np.clip(compressed, 0, 255).astype(np.uint8) def svd_denoise(image, threshold_ratio=0.1): """ Denoise image by zeroing small singular values. """ U, s, Vt = np.linalg.svd(image, full_matrices=False) # Zero out singular values below threshold threshold = threshold_ratio * s[0] s_denoised = s * (s > threshold) # Reconstruct denoised = U @ np.diag(s_denoised) @ Vt return np.clip(denoised, 0, 255).astype(np.uint8), s, s_denoised def create_test_image(size=256): """Create a test image with geometric patterns.""" x = np.linspace(0, 4*np.pi, size) y = np.linspace(0, 4*np.pi, size) X, Y = np.meshgrid(x, y) image = 128 + 50*np.sin(X) + 30*np.cos(Y) + 20*np.sin(X+Y) return image.astype(np.float64) # Create test imageimage = create_test_image(256) # Analyze singular value spectrumU, s, Vt = np.linalg.svd(image, full_matrices=False) plt.figure(figsize=(12, 4))plt.subplot(1, 2, 1)plt.semilogy(s, 'b-')plt.xlabel('Component')plt.ylabel('Singular value (log scale)')plt.title('Singular Value Spectrum')plt.grid(True) plt.subplot(1, 2, 2)cumulative = np.cumsum(s**2) / np.sum(s**2)plt.plot(cumulative, 'g-')plt.axhline(y=0.99, color='r', linestyle='--', label='99%')plt.xlabel('Number of components')plt.ylabel('Cumulative energy')plt.title('Cumulative Explained Variance')plt.legend()plt.grid(True)plt.tight_layout()plt.show() # Compression comparisonk_values = [1, 5, 10, 25, 50, 256] # 256 = full rankfig, axes = plt.subplots(2, 3, figsize=(15, 10))axes = axes.flatten() for idx, k in enumerate(k_values): compressed = svd_compress_image(image, k) # Compression ratio m, n = image.shape original_size = m * n compressed_size = k * (m + n + 1) ratio = original_size / compressed_size # Error metrics mse = np.mean((image - compressed)**2) psnr = 10 * np.log10(255**2 / mse) if mse > 0 else np.inf axes[idx].imshow(compressed, cmap='gray', vmin=0, vmax=255) axes[idx].set_title(f'k={k}Ratio: {ratio:.1f}x, PSNR: {psnr:.1f}dB') axes[idx].axis('off') plt.suptitle('SVD Image Compression at Different Ranks', fontsize=14)plt.tight_layout()plt.show() # Denoising demonstrationprint("=== Image Denoising ===")noise_level = 30noisy_image = image + noise_level * np.random.randn(*image.shape) denoised, s_noisy, s_clean = svd_denoise(noisy_image, threshold_ratio=0.05) fig, axes = plt.subplots(1, 3, figsize=(15, 5))axes[0].imshow(image, cmap='gray', vmin=0, vmax=255)axes[0].set_title('Original')axes[0].axis('off') axes[1].imshow(np.clip(noisy_image, 0, 255), cmap='gray', vmin=0, vmax=255)axes[1].set_title(f'Noisy (σ={noise_level})')axes[1].axis('off') axes[2].imshow(denoised, cmap='gray', vmin=0, vmax=255)mse_noisy = np.mean((image - noisy_image)**2)mse_denoised = np.mean((image - denoised)**2)axes[2].set_title(f'Denoised (MSE: {mse_noisy:.1f} → {mse_denoised:.1f})')axes[2].axis('off') plt.tight_layout()plt.show()Matrix decompositions play critical roles in making optimization algorithms stable and efficient—essential for training ML models.
Newton's Method and Hessian Decomposition:
Newton's method updates: x ← x - H⁻¹∇f
Directly inverting the Hessian H is problematic:
Cholesky-based Newton: If H is SPD:
Cost: O(n³) factorization + O(n²) solves = same as inversion, but more stable.
Handling indefinite Hessians:
At saddle points, H has negative eigenvalues. Solutions:
Quasi-Newton and Matrix Updates:
BFGS maintains an approximation B ≈ H (or its inverse) updated via rank-2 modifications: $$B_{k+1} = B_k - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k} + \frac{y_k y_k^T}{y_k^T s_k}$$
L-BFGS stores only the last m update vectors, implicitly representing B⁻¹ composition.
In deep learning, Hessians can have condition numbers of 10⁶ or higher. This causes: (1) Slow convergence of gradient descent, (2) Exploding/vanishing gradients, (3) Sensitivity to learning rate. Preconditioning (rescaling by approximate H⁻¹) dramatically accelerates convergence.
Preconditioning:
Instead of solving Ax = b directly, solve: $$M^{-1}Ax = M^{-1}b$$
where M approximates A but is easy to invert.
Common preconditioners:
SVD for Regularization:
Truncated SVD provides implicit regularization:
Least squares via SVD: For potentially rank-deficient systems: $$x = V \Sigma^+ U^T b$$
where Σ⁺ has 1/σᵢ for 'large' singular values, 0 for small ones. This gives the minimum-norm solution with controlled noise amplification.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npfrom scipy.linalg import cholesky, solve_triangularimport matplotlib.pyplot as plt def newton_step_naive(H, grad): """Naive Newton step: directly solve H @ d = grad.""" return np.linalg.solve(H, grad) def newton_step_cholesky(H, grad): """Newton step via Cholesky (stable for SPD).""" try: L = cholesky(H, lower=True) y = solve_triangular(L, grad, lower=True) d = solve_triangular(L.T, y, lower=False) return d except np.linalg.LinAlgError: raise ValueError("Hessian not positive definite") def newton_step_modified_cholesky(H, grad, beta=1e-6): """ Newton step with modified Cholesky for potentially indefinite H. Adds diagonal perturbation to ensure positive definiteness. """ n = H.shape[0] # Try Cholesky; if fails, add to diagonal for i in range(20): try: L = cholesky(H + beta * np.eye(n), lower=True) y = solve_triangular(L, grad, lower=True) d = solve_triangular(L.T, y, lower=False) return d, beta except np.linalg.LinAlgError: beta *= 10 # Increase perturbation raise ValueError("Could not make Hessian positive definite") def compare_condition_numbers(): """Demonstrate effect of condition number on optimization.""" # Well-conditioned Hessian np.random.seed(42) n = 50 # Create Hessians with different condition numbers kappas = [1, 10, 100, 1000, 10000] fig, axes = plt.subplots(1, 2, figsize=(14, 5)) for kappa in kappas: # Create SPD matrix with specified condition number Q, _ = np.linalg.qr(np.random.randn(n, n)) eigenvalues = np.logspace(0, np.log10(kappa), n)[::-1] H = Q @ np.diag(eigenvalues) @ Q.T # Gradient descent simulation x0 = np.random.randn(n) x_opt = np.zeros(n) # True optimum # Optimal learning rate for quadratic with Hessian H lr = 2 / (eigenvalues[0] + eigenvalues[-1]) x = x0.copy() errors = [np.linalg.norm(x - x_opt)] for _ in range(100): grad = H @ (x - x_opt) x = x - lr * grad errors.append(np.linalg.norm(x - x_opt)) axes[0].semilogy(errors, label=f'κ={kappa}') axes[0].set_xlabel('Iteration') axes[0].set_ylabel('Error ||x - x*|| (log scale)') axes[0].set_title('Gradient Descent Convergence vs Condition Number') axes[0].legend() axes[0].grid(True) # Show eigenvalue spectra for kappa in [10, 1000]: Q, _ = np.linalg.qr(np.random.randn(n, n)) eigenvalues = np.logspace(0, np.log10(kappa), n)[::-1] axes[1].plot(eigenvalues, 'o-', markersize=3, label=f'κ={kappa}') axes[1].set_xlabel('Index') axes[1].set_ylabel('Eigenvalue') axes[1].set_title('Eigenvalue Spectra') axes[1].set_yscale('log') axes[1].legend() axes[1].grid(True) plt.tight_layout() plt.show() # Run comparisoncompare_condition_numbers() # Demonstrate modified Cholesky on indefinite Hessianprint("=== Modified Cholesky for Indefinite Hessian ===")np.random.seed(42)n = 10 # Create indefinite symmetric matrix (saddle point Hessian)Q, _ = np.linalg.qr(np.random.randn(n, n))eigenvalues = np.array([5, 3, 2, 1, 0.5, -0.3, -0.5, -1, -2, -3]) # Mixed signsH_indefinite = Q @ np.diag(eigenvalues) @ Q.T grad = np.random.randn(n) print(f"Hessian eigenvalues: {np.linalg.eigvalsh(H_indefinite).round(2)}")print(f"Positive definite: {np.all(np.linalg.eigvalsh(H_indefinite) > 0)}") try: d_chol = newton_step_cholesky(H_indefinite, grad) print("Standard Cholesky: SUCCESS (unexpected)")except ValueError as e: print(f"Standard Cholesky: FAILED ({e})") d_modified, beta_used = newton_step_modified_cholesky(H_indefinite, grad)print(f"Modified Cholesky: SUCCESS with β = {beta_used:.2e}")print(f"Modified matrix eigenvalues: {np.linalg.eigvalsh(H_indefinite + beta_used * np.eye(n)).round(2)}")Matrix decompositions remain relevant in modern deep learning, appearing in architecture design, efficient implementations, and interpretability.
Low-Rank Factorization of Weight Matrices:
For large neural network layers with weight matrix W ∈ ℝᵐˣⁿ:
When k << min(m, n), this dramatically reduces model size and inference cost.
Applications:
Word Embeddings and SVD:
The famous word2vec can be understood as implicit matrix factorization:
GloVe explicitly constructs and factors this matrix.
LoRA (Low-Rank Adaptation) freezes the original weight matrix W and trains only low-rank updates: W' = W + AB. For a 7B parameter LLM, LoRA can reduce trainable parameters to ~0.1% while maintaining performance. This is SVD thinking applied to modern AI!
Transformer Efficiency:
Attention: Attention(Q, K, V) = softmax(QK^T/√d)V
The QK^T matrix is n × n for sequence length n—quadratic cost!
Linear attention approximations:
SVD-Based Approaches:
Batch Normalization and Whitening:
Batch norm's success relates to decorrelation:
Weight Initialization:
Xavier/He initialization maintains variance through layers:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import numpy as npimport time def low_rank_layer_forward(x, A, B): """ Forward pass through low-rank factorized layer. Original: y = Wx where W = AB Factorized: y = A(Bx) — fewer operations if rank << min(m, n) """ return A @ (B @ x) def compare_layer_sizes(m, n, ranks): """Compare full vs low-rank layer efficiency.""" print(f"Layer: {m} x {n}") print("-" * 50) print(f"{'Rank':<10} {'Params':<15} {'Compression':<15} {'Time (ms)':<12}") print("-" * 50) # Full rank W_full = np.random.randn(m, n) x = np.random.randn(n, 100) # 100 samples start = time.time() for _ in range(100): y_full = W_full @ x time_full = (time.time() - start) * 10 # ms per forward full_params = m * n print(f"{'Full':<10} {full_params:<15} {'1.0x':<15} {time_full:.2f}") for rank in ranks: A = np.random.randn(m, rank) B = np.random.randn(rank, n) start = time.time() for _ in range(100): y_low_rank = low_rank_layer_forward(x, A, B) time_low_rank = (time.time() - start) * 10 low_rank_params = rank * (m + n) compression = full_params / low_rank_params print(f"{rank:<10} {low_rank_params:<15} {compression:.1f}x{'':<11} {time_low_rank:.2f}") # Compare layer sizescompare_layer_sizes(1024, 4096, [32, 64, 128, 256]) def svd_model_compression(W, rank): """ Compress a weight matrix using SVD truncation. Returns: A, B: low-rank factors such that W ≈ AB compression_ratio: original_params / compressed_params reconstruction_error: ||W - AB|| / ||W|| """ U, s, Vt = np.linalg.svd(W, full_matrices=False) # Truncate A = U[:, :rank] @ np.diag(np.sqrt(s[:rank])) B = np.diag(np.sqrt(s[:rank])) @ Vt[:rank, :] # Metrics W_approx = A @ B error = np.linalg.norm(W - W_approx) / np.linalg.norm(W) m, n = W.shape compression = (m * n) / (rank * (m + n)) return A, B, compression, error # Demonstrate compressionprint("=== SVD Model Compression ===")np.random.seed(42)m, n = 768, 3072 # Typical transformer FFN dimensions # Create a weight matrix with approximately low-rank structuretrue_rank = 50U_true = np.random.randn(m, true_rank)V_true = np.random.randn(n, true_rank)W = U_true @ V_true.T + 0.1 * np.random.randn(m, n) # Low-rank + noise print(f"Weight matrix: {m} x {n}")print(f"Original parameters: {m * n:,}") for rank in [16, 32, 64, 128]: A, B, compression, error = svd_model_compression(W, rank) print(f"Rank {rank}: {compression:.1f}x compression, {error*100:.2f}% relative error") # LoRA-style adaptation demonstrationprint("=== LoRA Adaptation Simulation ===") class LoRALayer: """Simulates LoRA adapter for efficient fine-tuning.""" def __init__(self, W_frozen, rank=16, alpha=16): self.W_frozen = W_frozen self.rank = rank self.alpha = alpha self.scaling = alpha / rank m, n = W_frozen.shape # Initialize low-rank adapters self.A = np.random.randn(m, rank) * 0.01 self.B = np.zeros((rank, n)) # Initialize B to zero def forward(self, x): # Frozen forward + low-rank update return self.W_frozen @ x + self.scaling * (self.A @ (self.B @ x)) def trainable_params(self): m, n = self.W_frozen.shape return self.rank * (m + n) def total_params(self): m, n = self.W_frozen.shape return m * n # Create base model layerW_base = np.random.randn(4096, 4096) # Large weight matrixlora = LoRALayer(W_base, rank=16) print(f"Base layer params: {lora.total_params():,}")print(f"LoRA trainable params: {lora.trainable_params():,}")print(f"Trainable fraction: {100 * lora.trainable_params() / lora.total_params():.2f}%")We've synthesized our knowledge of matrix decompositions into practical applications across machine learning. Let's consolidate the key insights from this module.
Module complete!
You now have a comprehensive understanding of the four fundamental matrix decompositions—SVD, eigendecomposition, QR, and Cholesky—and how they power modern machine learning systems. This linear algebra foundation will serve you across all areas of ML, from classical methods to cutting-edge deep learning.
Next, we'll explore Norms and Distance Metrics—understanding how to measure vector and matrix magnitudes, define similarity, and apply regularization in machine learning.
Congratulations! You've mastered Matrix Decompositions—the computational backbone of machine learning. From SVD's optimal low-rank approximations to Cholesky's efficient positive definite solves, you now have the linear algebra toolkit essential for understanding and implementing ML algorithms at scale.