Machine LearningMatrix Decompositions

Matrix Decompositions

LevelIntermediate

Duration120 mins

TopicMatrix Decompositions

5 / 5

Applications in Machine Learning

Matrix Decompositions: The Foundation of Modern ML

Having mastered the four fundamental matrix decompositions—SVD, Eigendecomposition, QR, and Cholesky—we now synthesize this knowledge into a unified toolkit for solving real machine learning problems. These decompositions aren't academic exercises; they form the computational backbone of systems you use every day.

From Netflix's recommendation engine to Google's image search, from speech recognition to drug discovery, matrix decompositions power critical components of production ML systems. Understanding when and how to apply each decomposition—and recognizing problem structures that call for matrix methods—separates practitioners who can solve textbook problems from engineers who can build systems at scale.

This page connects the theoretical foundations we've built to practical applications: dimensionality reduction, collaborative filtering, signal processing, feature extraction, and optimization. We'll see how the same mathematical tools manifest across seemingly different domains, revealing the deep unity underlying diverse ML techniques.

What You Will Learn

By the end of this page, you will understand: (1) How to choose the right decomposition for your problem, (2) PCA and dimensionality reduction in depth, (3) Collaborative filtering and recommendation systems, (4) Image compression and signal processing, (5) Numerical stability in optimization, and (6) Modern applications in deep learning and NLP.

Choosing the Right Decomposition

Each matrix decomposition has distinct strengths. Knowing when to apply each is essential for efficient, stable ML implementations.

Decision framework:

Matrix Decomposition Selection Guide
Problem Type	Primary Decomposition	Why?
Dimensionality reduction	SVD (or eigendecomposition of covariance)	Optimal low-rank approximation
Recommendation systems	SVD / matrix factorization	Latent factor discovery
Linear system Ax = b (general)	LU / QR	Stable, efficient for dense
Linear system Ax = b (SPD)	Cholesky	Fastest, most stable for SPD
Eigenvalues (dynamics, spectra)	Eigendecomposition	Reveals system modes
Least squares	QR / SVD	Numerically stable
Sampling Gaussians	Cholesky	Most efficient for covariance
Graph analysis	Eigendecomposition of Laplacian	Spectral properties
Signal denoising	SVD / eigendecomposition	Low-rank approximation
Matrix completion	SVD / nuclear norm	Low-rank structure

Matrix structure guide:

Any rectangular matrix → SVD SVD is the universal tool. When in doubt, try SVD first. It works on any matrix and provides optimal low-rank approximations.

Square symmetric → Eigendecomposition or Cholesky Symmetric matrices have real eigenvalues and orthogonal eigenvectors. If positive definite, Cholesky is fastest for solving systems; eigendecomposition reveals spectral information.

Square general → LU or Schur For general square matrices needing eigenvalues, the QR algorithm (producing Schur form) or direct eigendecomposition. For solving systems, LU with pivoting.

Overdetermined system (m > n) → QR Least squares via QR is more stable than normal equations. Use SVD for rank-deficient cases.

Sparse large → Iterative methods For large sparse matrices, iterative decompositions (Lanczos, Arnoldi) compute dominant eigenpairs without full factorization.

The Practitioner's Heuristic

Is it symmetric positive definite? → Cholesky
Need best low-rank approximation? → SVD
Solving overdetermined least squares? → QR
Understanding matrix dynamics? → Eigendecomposition
General linear system? → LU with pivoting

This covers 90% of practical cases.

Principal Component Analysis: A Deep Dive

Principal Component Analysis (PCA) is perhaps the most widely used dimensionality reduction technique, directly powered by matrix decompositions.

The goal: Given high-dimensional data X ∈ ℝⁿˣᵈ (n samples, d features), find a lower-dimensional representation that preserves maximum variance.

Two equivalent formulations:

1. Maximum variance directions: Find orthonormal vectors w₁, w₂, ... such that projecting data onto wₖ captures maximum remaining variance.

2. Minimum reconstruction error: Find k-dimensional subspace that minimizes ||X - X_reconstructed||².

Both lead to the same solution: eigenvectors of the covariance matrix.

The PCA algorithm:

Center data: X_centered = X - mean(X)
Compute covariance: C = (1/(n-1)) X_centeredᵀ X_centered
Eigendecompose: C = VΛVᵀ (eigenvalues λ₁ ≥ λ₂ ≥ ... ≥ λ_d)
Project: Z = X_centered × V_k (keep top k eigenvectors)

SVD-based PCA (preferred):

Compute SVD of centered data directly: X_centered = UΣVᵀ

Principal components: columns of V
Projections: columns of UΣ
Explained variance: σᵢ² / (n-1)

This avoids forming XᵀX, improving numerical stability.

Why SVD for PCA?

Computing XᵀX squares the condition number. If κ(X) = 10⁸, then κ(XᵀX) = 10¹⁶—complete loss of precision in double arithmetic! SVD of X directly has condition number κ(X), preserving numerical accuracy.

Choosing the number of components:

1. Explained variance threshold: Keep k components capturing ≥ 95% (or 99%) of total variance: $$\frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{d} \lambda_i} \geq 0.95$$

2. Elbow method: Plot eigenvalues and look for the 'elbow' where values drop sharply.

3. Cross-validation: Choose k that minimizes held-out reconstruction error.

4. Task-specific: Use downstream task performance (classification accuracy with k components) to select k.

When PCA struggles:

Nonlinear structure: PCA finds linear subspaces; use Kernel PCA or autoencoders for nonlinear.
Non-Gaussian data: PCA assumes Gaussian-like structure; heavy-tailed data may need robust PCA.
Interpretability: Principal components mix all original features; for sparse interpretation, use sparse PCA.

pca_comprehensive.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
 
def pca_from_scratch(X, n_components=None):
    """
    PCA implementation using SVD (numerically stable).
    
    Parameters:
        X: n x d data matrix
        n_components: number of components (default: all)
    
    Returns:
        Z: projected data (n x n_components)
        components: principal component directions (d x n_components)
        explained_variance_ratio: fraction of variance per component
    """
    n, d = X.shape
    
    # Center data
    mean = X.mean(axis=0)
    X_centered = X - mean
    
    # SVD (full_matrices=False for efficiency)
    U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)
    
    # Explained variance (singular values squared, normalized)
    variance = s**2 / (n - 1)
    explained_variance_ratio = variance / variance.sum()
    
    # Select components
    if n_components is None:
        n_components = min(n, d)
    
    # Principal components are rows of Vt (columns of V)
    components = Vt[:n_components].T  # d x n_components
    
    # Project data
    Z = X_centered @ components  # Equivalent to U[:, :n_components] * s[:n_components]
    
    return Z, components, explained_variance_ratio[:n_components], mean
 
def analyze_pca(X, max_components=20):
    """Analyze PCA with visualization."""
    _, _, var_ratio, _ = pca_from_scratch(X)
    cumulative_var = np.cumsum(var_ratio)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Individual explained variance
    k = min(max_components, len(var_ratio))
    axes[0].bar(range(1, k+1), var_ratio[:k], alpha=0.7)
    axes[0].set_xlabel('Principal Component')
    axes[0].set_ylabel('Explained Variance Ratio')
    axes[0].set_title('Variance Explained by Each Component')
    
    # Cumulative explained variance
    axes[1].plot(range(1, k+1), cumulative_var[:k], 'bo-')
    axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
    axes[1].set_xlabel('Number of Components')
    axes[1].set_ylabel('Cumulative Explained Variance')
    axes[1].set_title('Cumulative Variance Explained')
    axes[1].legend()
    axes[1].grid(True)
    
    plt.tight_layout()
    plt.show()
    
    # Report key statistics
    k_95 = np.argmax(cumulative_var >= 0.95) + 1
    k_99 = np.argmax(cumulative_var >= 0.99) + 1
    print(f"Components for 95% variance: {k_95}")
    print(f"Components for 99% variance: {k_99}")
    print(f"Top component explains: {var_ratio[0]*100:.1f}%")
    
    return var_ratio, cumulative_var
 
# Load digits dataset (8x8 images = 64 features)
digits = load_digits()
X = digits.data  # 1797 samples, 64 features
y = digits.target
 
print("Digits Dataset PCA Analysis")
print("=" * 50)
print(f"Shape: {X.shape} (samples x features)")
 
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Analyze variance
var_ratio, _ = analyze_pca(X_scaled)
 
# Project to 2D for visualization
Z, components, _, _ = pca_from_scratch(X_scaled, n_components=2)
 
plt.figure(figsize=(10, 8))
scatter = plt.scatter(Z[:, 0], Z[:, 1], c=y, cmap='tab10', alpha=0.6, s=20)
plt.colorbar(scatter, label='Digit')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.title('Digits Projected onto First Two Principal Components')
plt.show()
 
# Reconstruction example
print("
" + "=" * 50)
print("Reconstruction at Different Component Counts")
k_values = [5, 10, 20, 40, 64]
sample_idx = 0
 
fig, axes = plt.subplots(1, len(k_values) + 1, figsize=(15, 3))
axes[0].imshow(X[sample_idx].reshape(8, 8), cmap='gray')
axes[0].set_title('Original')
axes[0].axis('off')
 
for i, k in enumerate(k_values):
    Z_k, comp_k, _, mean_k = pca_from_scratch(X_scaled, n_components=k)
    X_reconstructed = Z_k @ comp_k.T + mean_k
    X_orig_space = scaler.inverse_transform(X_reconstructed)
    
    axes[i+1].imshow(X_orig_space[sample_idx].reshape(8, 8), cmap='gray')
    axes[i+1].set_title(f'k={k}')
    axes[i+1].axis('off')
 
plt.suptitle('Reconstruction Quality vs Number of Components', fontsize=12)
plt.tight_layout()
plt.show()

Collaborative Filtering and Recommendation Systems

The Netflix Prize competition (2006-2009) brought matrix factorization to widespread attention. The winning approaches used SVD and related techniques to predict user preferences from sparse rating data.

The problem setup:

Given a user-item rating matrix R ∈ ℝᵐˣⁿ where:

m users, n items
Rᵢⱼ = rating of user i for item j (if observed)
Most entries are missing (typically 99%+ sparsity)

Goal: Predict missing ratings to recommend items users will likely enjoy.

Matrix Factorization Approach:

Assume the rating matrix has low-rank structure: $$R \approx UV^T$$

where:

U ∈ ℝᵐˣᵏ: user latent factors (k << m, n)
V ∈ ℝⁿˣᵏ: item latent factors

Each row of U is a k-dimensional representation of a user's preferences. Each row of V is a k-dimensional representation of an item's characteristics.

Prediction: R̂ᵢⱼ = Uᵢ · Vⱼ (dot product of latent vectors)

Connection to SVD:

If R were fully observed: R = UΣVᵀ (SVD)

User factors: UΣ^(1/2), Item factors: VΣ^(1/2) → R = (UΣ^(1/2))(VΣ^(1/2))ᵀ

With missing data, we minimize reconstruction error on observed entries only: $$\min_{U, V} \sum_{(i,j) \in \Omega} (R_{ij} - U_i \cdot V_j)^2 + \lambda(|U|_F^2 + |V|_F^2)$$

where Ω is the set of observed entries and λ controls regularization.

Why This Works

The intuition: if user A and user B rate similar movies similarly, they probably share movie preferences (captured by similar latent vectors). If movie X and movie Y are rated similarly by many users, they're probably similar movies. The latent factors encode these implicit relationships through shared dimensions of taste.

Training algorithms:

1. Alternating Least Squares (ALS): Fix U, solve for V (convex); fix V, solve for U (convex). Iterate until convergence. Each step is a ridge regression problem.

2. Stochastic Gradient Descent (SGD): Process observed ratings one at a time, updating U_i and V_j to reduce prediction error. Scales to massive datasets.

3. SVD++ and extensions: Add biases (per-user, per-item), implicit feedback, temporal dynamics. These extensions won the Netflix Prize.

Evaluation metrics:

RMSE: √(mean squared error on held-out ratings)
MAE: Mean absolute error
Recall@K: Fraction of relevant items in top-K recommendations
NDCG: Normalized discounted cumulative gain (ranking quality)

Modern extensions:

Neural Collaborative Filtering: Replace dot product with neural network
Graph Neural Networks: Model user-item interactions as graphs
Factorization Machines: Generalize to any feature interactions
Autoencoders: Learn nonlinear latent representations

matrix_factorization_recommendation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds
 
def svd_recommendation(R, k=10):
    """
    SVD-based recommendation for fully observed matrix.
    In practice, use matrix factorization for sparse data.
    """
    # Full SVD
    U, s, Vt = np.linalg.svd(R, full_matrices=False)
    
    # Truncate to k factors
    U_k = U[:, :k]
    s_k = s[:k]
    V_k = Vt[:k, :].T
    
    # Reconstruct: predictions for all user-item pairs
    R_pred = U_k @ np.diag(s_k) @ V_k.T
    
    return R_pred, U_k @ np.diag(np.sqrt(s_k)), V_k @ np.diag(np.sqrt(s_k))
 
def matrix_factorization_sgd(R_sparse, k=10, learning_rate=0.01, 
                              reg=0.02, epochs=50, verbose=True):
    """
    Matrix factorization via SGD for sparse rating matrix.
    
    Parameters:
        R_sparse: csr_matrix of observed ratings
        k: latent dimension
        learning_rate: SGD step size
        reg: L2 regularization
        epochs: number of passes over data
    """
    m, n = R_sparse.shape
    
    # Initialize with small random values
    np.random.seed(42)
    U = np.random.randn(m, k) * 0.1
    V = np.random.randn(n, k) * 0.1
    
    # Get non-zero entries
    rows, cols = R_sparse.nonzero()
    n_ratings = len(rows)
    
    history = []
    
    for epoch in range(epochs):
        # Shuffle training data
        indices = np.random.permutation(n_ratings)
        
        total_error = 0
        for idx in indices:
            i, j = rows[idx], cols[idx]
            r_ij = R_sparse[i, j]
            
            # Prediction
            pred = U[i] @ V[j]
            error = r_ij - pred
            total_error += error**2
            
            # Gradient updates
            U[i] += learning_rate * (error * V[j] - reg * U[i])
            V[j] += learning_rate * (error * U[i] - reg * V[j])
        
        rmse = np.sqrt(total_error / n_ratings)
        history.append(rmse)
        
        if verbose and (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}: RMSE = {rmse:.4f}")
    
    return U, V, history
 
def recommend_top_k(user_id, U, V, R_observed, k=5):
    """Get top-k recommendations for a user (items not yet rated)."""
    # Predicted ratings for this user
    predictions = U[user_id] @ V.T
    
    # Mask already-rated items
    rated_items = set(R_observed[user_id].nonzero()[1])
    for item in rated_items:
        predictions[item] = -np.inf
    
    # Top-k unrated items
    top_items = np.argsort(predictions)[::-1][:k]
    top_scores = predictions[top_items]
    
    return top_items, top_scores
 
# Create synthetic recommendation scenario
np.random.seed(42)
n_users, n_items = 200, 100
k_true = 5  # True latent dimension
 
# Generate low-rank rating matrix with noise
U_true = np.random.randn(n_users, k_true)
V_true = np.random.randn(n_items, k_true)
R_true = U_true @ V_true.T
R_true = (R_true - R_true.min()) / (R_true.max() - R_true.min()) * 4 + 1  # Scale to 1-5
 
# Add noise
R_noisy = R_true + 0.5 * np.random.randn(n_users, n_items)
R_noisy = np.clip(R_noisy, 1, 5)
 
# Create sparse version (only 10% observed)
mask = np.random.random((n_users, n_items)) < 0.10
R_sparse = csr_matrix(R_noisy * mask)
 
print("Recommendation System via Matrix Factorization")
print("=" * 60)
print(f"Users: {n_users}, Items: {n_items}")
print(f"Observed ratings: {mask.sum()} ({mask.mean()*100:.1f}%)")
print(f"True latent dimension: {k_true}")
 
# Train model
k_model = 10
U_learned, V_learned, history = matrix_factorization_sgd(
    R_sparse, k=k_model, epochs=50, learning_rate=0.01, reg=0.02
)
 
# Evaluate on held-out data
test_mask = np.random.random((n_users, n_items)) < 0.05
test_mask = test_mask & ~mask  # Only truly unobserved
R_pred = U_learned @ V_learned.T
 
test_rmse = np.sqrt(np.mean((R_noisy[test_mask] - R_pred[test_mask])**2))
print(f"
Test RMSE: {test_rmse:.4f}")
 
# Example recommendations
print("
=== Recommendations for User 0 ===")
top_items, top_scores = recommend_top_k(0, U_learned, V_learned, R_sparse, k=5)
for item, score in zip(top_items, top_scores):
    true_rating = R_noisy[0, item]
    print(f"Item {item}: Predicted {score:.2f}, True {true_rating:.2f}")

Image Processing and Signal Analysis

Matrix decompositions are fundamental tools in image processing and signal analysis, enabling compression, denoising, and feature extraction.

Image Compression with SVD:

A grayscale image is a matrix of pixel intensities. Color images are three matrices (RGB channels). SVD reveals that many images are effectively low-rank—they can be well-approximated with far fewer parameters.

The math: For m × n image I, SVD gives I = UΣVᵀ.

Full storage: m × n values
Rank-k approximation storage: k(m + n + 1) values
Compression ratio: mn / [k(m + n + 1)]

Quality vs. compression tradeoff: More components → better quality → larger file size. The singular value spectrum determines how much compression is possible without visual degradation.

Image Denoising:

Noisy images have the form: I_noisy = I_true + noise.

If I_true is approximately low-rank:

Compute SVD of I_noisy
Zero out small singular values (likely noise)
Reconstruct from remaining components

The assumption: signal lives in low-rank subspace; noise is distributed across all dimensions.

Beyond Simple SVD

Modern image compression (JPEG, WebP) uses DCT/wavelet transforms, not SVD—they're faster and exploit local structure better. However, SVD remains valuable for: (1) semantic compression (face recognition eigenfaces), (2) video compression (temporal correlations), (3) hyperspectral imaging (spectral correlations), and (4) scientific imaging where interpretability matters.

Eigenfaces for Face Recognition:

A classic application of PCA/eigendecomposition:

Collect many face images (all same size, aligned)
Flatten each image to a vector; form data matrix X
Compute PCA: eigenvectors of XᵀX (or SVD of X)
Top eigenvectors look like 'average faces'—called eigenfaces
Project new faces onto eigenface basis for recognition

The eigenfaces capture principal modes of variation in face appearance: lighting, expression, identity.

Background Subtraction (Video Analysis):

For surveillance video where background is static:

Stack video frames as columns of matrix X
Decompose: X = L + S (low-rank + sparse)
L captures static background (repeats every frame → low rank)
S captures moving foreground (sparse—only a few pixels move)

This is Robust PCA, solved via nuclear norm minimization.

Audio and Signal Processing:

For audio spectrograms and time-frequency representations:

Non-negative Matrix Factorization (NMF) separates sources
SVD-based methods denoise recordings
Eigendecomposition of autocorrelation estimates spectral content

image_processing_svd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
import matplotlib.pyplot as plt
 
def svd_compress_image(image, k):
    """
    Compress grayscale image using rank-k SVD approximation.
    """
    U, s, Vt = np.linalg.svd(image, full_matrices=False)
    
    # Truncate to k components
    U_k = U[:, :k]
    s_k = s[:k]
    Vt_k = Vt[:k, :]
    
    # Reconstruct
    compressed = U_k @ np.diag(s_k) @ Vt_k
    
    return np.clip(compressed, 0, 255).astype(np.uint8)
 
def svd_denoise(image, threshold_ratio=0.1):
    """
    Denoise image by zeroing small singular values.
    """
    U, s, Vt = np.linalg.svd(image, full_matrices=False)
    
    # Zero out singular values below threshold
    threshold = threshold_ratio * s[0]
    s_denoised = s * (s > threshold)
    
    # Reconstruct
    denoised = U @ np.diag(s_denoised) @ Vt
    
    return np.clip(denoised, 0, 255).astype(np.uint8), s, s_denoised
 
def create_test_image(size=256):
    """Create a test image with geometric patterns."""
    x = np.linspace(0, 4*np.pi, size)
    y = np.linspace(0, 4*np.pi, size)
    X, Y = np.meshgrid(x, y)
    
    image = 128 + 50*np.sin(X) + 30*np.cos(Y) + 20*np.sin(X+Y)
    return image.astype(np.float64)
 
# Create test image
image = create_test_image(256)
 
# Analyze singular value spectrum
U, s, Vt = np.linalg.svd(image, full_matrices=False)
 
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.semilogy(s, 'b-')
plt.xlabel('Component')
plt.ylabel('Singular value (log scale)')
plt.title('Singular Value Spectrum')
plt.grid(True)
 
plt.subplot(1, 2, 2)
cumulative = np.cumsum(s**2) / np.sum(s**2)
plt.plot(cumulative, 'g-')
plt.axhline(y=0.99, color='r', linestyle='--', label='99%')
plt.xlabel('Number of components')
plt.ylabel('Cumulative energy')
plt.title('Cumulative Explained Variance')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
 
# Compression comparison
k_values = [1, 5, 10, 25, 50, 256]  # 256 = full rank
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
 
for idx, k in enumerate(k_values):
    compressed = svd_compress_image(image, k)
    
    # Compression ratio
    m, n = image.shape
    original_size = m * n
    compressed_size = k * (m + n + 1)
    ratio = original_size / compressed_size
    
    # Error metrics
    mse = np.mean((image - compressed)**2)
    psnr = 10 * np.log10(255**2 / mse) if mse > 0 else np.inf
    
    axes[idx].imshow(compressed, cmap='gray', vmin=0, vmax=255)
    axes[idx].set_title(f'k={k}
Ratio: {ratio:.1f}x, PSNR: {psnr:.1f}dB')
    axes[idx].axis('off')
 
plt.suptitle('SVD Image Compression at Different Ranks', fontsize=14)
plt.tight_layout()
plt.show()
 
# Denoising demonstration
print("
=== Image Denoising ===")
noise_level = 30
noisy_image = image + noise_level * np.random.randn(*image.shape)
 
denoised, s_noisy, s_clean = svd_denoise(noisy_image, threshold_ratio=0.05)
 
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].imshow(image, cmap='gray', vmin=0, vmax=255)
axes[0].set_title('Original')
axes[0].axis('off')
 
axes[1].imshow(np.clip(noisy_image, 0, 255), cmap='gray', vmin=0, vmax=255)
axes[1].set_title(f'Noisy (σ={noise_level})')
axes[1].axis('off')
 
axes[2].imshow(denoised, cmap='gray', vmin=0, vmax=255)
mse_noisy = np.mean((image - noisy_image)**2)
mse_denoised = np.mean((image - denoised)**2)
axes[2].set_title(f'Denoised (MSE: {mse_noisy:.1f} → {mse_denoised:.1f})')
axes[2].axis('off')
 
plt.tight_layout()
plt.show()

Numerical Stability in Optimization

Matrix decompositions play critical roles in making optimization algorithms stable and efficient—essential for training ML models.

Newton's Method and Hessian Decomposition:

Newton's method updates: x ← x - H⁻¹∇f

Directly inverting the Hessian H is problematic:

Expensive: O(n³) inversion
Unstable if H is ill-conditioned
Fails if H is not positive definite (not at a minimum)

Cholesky-based Newton: If H is SPD:

Factor: H = LLᵀ (Cholesky)
Solve: Ly = ∇f, then Lᵀd = y
Update: x ← x - d

Cost: O(n³) factorization + O(n²) solves = same as inversion, but more stable.

Handling indefinite Hessians:

At saddle points, H has negative eigenvalues. Solutions:

Modified Cholesky: Add diagonal perturbation E so H + E is SPD
Trust region: Constrain step within region where quadratic model is valid
Line search with curvature condition: Reject steps where curvature is negative

Quasi-Newton and Matrix Updates:

BFGS maintains an approximation B ≈ H (or its inverse) updated via rank-2 modifications: $$B_{k+1} = B_k - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k} + \frac{y_k y_k^T}{y_k^T s_k}$$

L-BFGS stores only the last m update vectors, implicitly representing B⁻¹ composition.

The Curse of Ill-Conditioning

In deep learning, Hessians can have condition numbers of 10⁶ or higher. This causes: (1) Slow convergence of gradient descent, (2) Exploding/vanishing gradients, (3) Sensitivity to learning rate. Preconditioning (rescaling by approximate H⁻¹) dramatically accelerates convergence.

Preconditioning:

Instead of solving Ax = b directly, solve: $$M^{-1}Ax = M^{-1}b$$

where M approximates A but is easy to invert.

Common preconditioners:

Jacobi: M = diag(A) — diagonal scaling
Incomplete Cholesky: M = L̃L̃ᵀ where L̃ ≈ Cholesky(A)
SSOR: Symmetric successive over-relaxation
Block preconditioners: Exploit block structure

SVD for Regularization:

Truncated SVD provides implicit regularization:

Small singular values correspond to 'noise directions'
Truncating them prevents amplifying noise in solutions
Equivalent to Tikhonov regularization in the limit

Least squares via SVD: For potentially rank-deficient systems: $$x = V \Sigma^+ U^T b$$

where Σ⁺ has 1/σᵢ for 'large' singular values, 0 for small ones. This gives the minimum-norm solution with controlled noise amplification.

optimization_stability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy.linalg import cholesky, solve_triangular
import matplotlib.pyplot as plt
 
def newton_step_naive(H, grad):
    """Naive Newton step: directly solve H @ d = grad."""
    return np.linalg.solve(H, grad)
 
def newton_step_cholesky(H, grad):
    """Newton step via Cholesky (stable for SPD)."""
    try:
        L = cholesky(H, lower=True)
        y = solve_triangular(L, grad, lower=True)
        d = solve_triangular(L.T, y, lower=False)
        return d
    except np.linalg.LinAlgError:
        raise ValueError("Hessian not positive definite")
 
def newton_step_modified_cholesky(H, grad, beta=1e-6):
    """
    Newton step with modified Cholesky for potentially indefinite H.
    Adds diagonal perturbation to ensure positive definiteness.
    """
    n = H.shape[0]
    
    # Try Cholesky; if fails, add to diagonal
    for i in range(20):
        try:
            L = cholesky(H + beta * np.eye(n), lower=True)
            y = solve_triangular(L, grad, lower=True)
            d = solve_triangular(L.T, y, lower=False)
            return d, beta
        except np.linalg.LinAlgError:
            beta *= 10  # Increase perturbation
    
    raise ValueError("Could not make Hessian positive definite")
 
def compare_condition_numbers():
    """Demonstrate effect of condition number on optimization."""
    # Well-conditioned Hessian
    np.random.seed(42)
    n = 50
    
    # Create Hessians with different condition numbers
    kappas = [1, 10, 100, 1000, 10000]
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    for kappa in kappas:
        # Create SPD matrix with specified condition number
        Q, _ = np.linalg.qr(np.random.randn(n, n))
        eigenvalues = np.logspace(0, np.log10(kappa), n)[::-1]
        H = Q @ np.diag(eigenvalues) @ Q.T
        
        # Gradient descent simulation
        x0 = np.random.randn(n)
        x_opt = np.zeros(n)  # True optimum
        
        # Optimal learning rate for quadratic with Hessian H
        lr = 2 / (eigenvalues[0] + eigenvalues[-1])
        
        x = x0.copy()
        errors = [np.linalg.norm(x - x_opt)]
        
        for _ in range(100):
            grad = H @ (x - x_opt)
            x = x - lr * grad
            errors.append(np.linalg.norm(x - x_opt))
        
        axes[0].semilogy(errors, label=f'κ={kappa}')
    
    axes[0].set_xlabel('Iteration')
    axes[0].set_ylabel('Error ||x - x*|| (log scale)')
    axes[0].set_title('Gradient Descent Convergence vs Condition Number')
    axes[0].legend()
    axes[0].grid(True)
    
    # Show eigenvalue spectra
    for kappa in [10, 1000]:
        Q, _ = np.linalg.qr(np.random.randn(n, n))
        eigenvalues = np.logspace(0, np.log10(kappa), n)[::-1]
        axes[1].plot(eigenvalues, 'o-', markersize=3, label=f'κ={kappa}')
    
    axes[1].set_xlabel('Index')
    axes[1].set_ylabel('Eigenvalue')
    axes[1].set_title('Eigenvalue Spectra')
    axes[1].set_yscale('log')
    axes[1].legend()
    axes[1].grid(True)
    
    plt.tight_layout()
    plt.show()
 
# Run comparison
compare_condition_numbers()
 
# Demonstrate modified Cholesky on indefinite Hessian
print("
=== Modified Cholesky for Indefinite Hessian ===")
np.random.seed(42)
n = 10
 
# Create indefinite symmetric matrix (saddle point Hessian)
Q, _ = np.linalg.qr(np.random.randn(n, n))
eigenvalues = np.array([5, 3, 2, 1, 0.5, -0.3, -0.5, -1, -2, -3])  # Mixed signs
H_indefinite = Q @ np.diag(eigenvalues) @ Q.T
 
grad = np.random.randn(n)
 
print(f"Hessian eigenvalues: {np.linalg.eigvalsh(H_indefinite).round(2)}")
print(f"Positive definite: {np.all(np.linalg.eigvalsh(H_indefinite) > 0)}")
 
try:
    d_chol = newton_step_cholesky(H_indefinite, grad)
    print("Standard Cholesky: SUCCESS (unexpected)")
except ValueError as e:
    print(f"Standard Cholesky: FAILED ({e})")
 
d_modified, beta_used = newton_step_modified_cholesky(H_indefinite, grad)
print(f"Modified Cholesky: SUCCESS with β = {beta_used:.2e}")
print(f"Modified matrix eigenvalues: {np.linalg.eigvalsh(H_indefinite + beta_used * np.eye(n)).round(2)}")

Modern Applications in Deep Learning and NLP

Matrix decompositions remain relevant in modern deep learning, appearing in architecture design, efficient implementations, and interpretability.

Low-Rank Factorization of Weight Matrices:

For large neural network layers with weight matrix W ∈ ℝᵐˣⁿ:

Full layer: m × n parameters, O(mn) computation
Factorized: W ≈ AB where A ∈ ℝᵐˣᵏ, B ∈ ℝᵏˣⁿ
Factorized: k(m + n) parameters, O(k(m + n)) computation

When k << min(m, n), this dramatically reduces model size and inference cost.

Applications:

Model compression: Replace trained layers with low-rank approximations
LoRA (Low-Rank Adaptation): Fine-tune large models by updating only low-rank delta
Efficient attention: Approximate attention matrices via low-rank decomposition

Word Embeddings and SVD:

The famous word2vec can be understood as implicit matrix factorization:

Build word-context co-occurrence matrix M
Word embeddings ≈ rows of UΣ^α (from SVD of M)
Context embeddings ≈ rows of VΣ^(1-α)

GloVe explicitly constructs and factors this matrix.

LoRA Revolution

LoRA (Low-Rank Adaptation) freezes the original weight matrix W and trains only low-rank updates: W' = W + AB. For a 7B parameter LLM, LoRA can reduce trainable parameters to ~0.1% while maintaining performance. This is SVD thinking applied to modern AI!

Transformer Efficiency:

Attention: Attention(Q, K, V) = softmax(QK^T/√d)V

The QK^T matrix is n × n for sequence length n—quadratic cost!

Linear attention approximations:

Approximate: softmax(QK^T) ≈ φ(Q)φ(K)^T for feature map φ
Compute: φ(Q)(φ(K)^T V) in O(n) instead of O(n²)
This uses low-rank approximation ideas!

SVD-Based Approaches:

Linformer: projects K, V to lower dimension
Performer: random feature approximation of softmax
Nyströmformer: Nyström approximation of attention matrix

Batch Normalization and Whitening:

Batch norm's success relates to decorrelation:

Normalizing activations ≈ approximate whitening
Full whitening (ZCA) uses eigendecomposition: W = Σ^(-1/2)
Decorrelated Batch Norm explicitly uses running covariance estimate

Weight Initialization:

Xavier/He initialization maintains variance through layers:

Relies on understanding weight matrix singular value distribution
Orthogonal initialization: W = QR[:, :n] where Q from random QR
Better gradient flow via controlled singular values

deep_learning_decompositions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
import time
 
def low_rank_layer_forward(x, A, B):
    """
    Forward pass through low-rank factorized layer.
    Original: y = Wx where W = AB
    Factorized: y = A(Bx) — fewer operations if rank << min(m, n)
    """
    return A @ (B @ x)
 
def compare_layer_sizes(m, n, ranks):
    """Compare full vs low-rank layer efficiency."""
    print(f"Layer: {m} x {n}")
    print("-" * 50)
    print(f"{'Rank':<10} {'Params':<15} {'Compression':<15} {'Time (ms)':<12}")
    print("-" * 50)
    
    # Full rank
    W_full = np.random.randn(m, n)
    x = np.random.randn(n, 100)  # 100 samples
    
    start = time.time()
    for _ in range(100):
        y_full = W_full @ x
    time_full = (time.time() - start) * 10  # ms per forward
    
    full_params = m * n
    print(f"{'Full':<10} {full_params:<15} {'1.0x':<15} {time_full:.2f}")
    
    for rank in ranks:
        A = np.random.randn(m, rank)
        B = np.random.randn(rank, n)
        
        start = time.time()
        for _ in range(100):
            y_low_rank = low_rank_layer_forward(x, A, B)
        time_low_rank = (time.time() - start) * 10
        
        low_rank_params = rank * (m + n)
        compression = full_params / low_rank_params
        
        print(f"{rank:<10} {low_rank_params:<15} {compression:.1f}x{'':<11} {time_low_rank:.2f}")
 
# Compare layer sizes
compare_layer_sizes(1024, 4096, [32, 64, 128, 256])
 
def svd_model_compression(W, rank):
    """
    Compress a weight matrix using SVD truncation.
    
    Returns:
        A, B: low-rank factors such that W ≈ AB
        compression_ratio: original_params / compressed_params
        reconstruction_error: ||W - AB|| / ||W||
    """
    U, s, Vt = np.linalg.svd(W, full_matrices=False)
    
    # Truncate
    A = U[:, :rank] @ np.diag(np.sqrt(s[:rank]))
    B = np.diag(np.sqrt(s[:rank])) @ Vt[:rank, :]
    
    # Metrics
    W_approx = A @ B
    error = np.linalg.norm(W - W_approx) / np.linalg.norm(W)
    
    m, n = W.shape
    compression = (m * n) / (rank * (m + n))
    
    return A, B, compression, error
 
# Demonstrate compression
print("
=== SVD Model Compression ===")
np.random.seed(42)
m, n = 768, 3072  # Typical transformer FFN dimensions
 
# Create a weight matrix with approximately low-rank structure
true_rank = 50
U_true = np.random.randn(m, true_rank)
V_true = np.random.randn(n, true_rank)
W = U_true @ V_true.T + 0.1 * np.random.randn(m, n)  # Low-rank + noise
 
print(f"Weight matrix: {m} x {n}")
print(f"Original parameters: {m * n:,}")
 
for rank in [16, 32, 64, 128]:
    A, B, compression, error = svd_model_compression(W, rank)
    print(f"Rank {rank}: {compression:.1f}x compression, {error*100:.2f}% relative error")
 
# LoRA-style adaptation demonstration
print("
=== LoRA Adaptation Simulation ===")
 
class LoRALayer:
    """Simulates LoRA adapter for efficient fine-tuning."""
    
    def __init__(self, W_frozen, rank=16, alpha=16):
        self.W_frozen = W_frozen
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        m, n = W_frozen.shape
        # Initialize low-rank adapters
        self.A = np.random.randn(m, rank) * 0.01
        self.B = np.zeros((rank, n))  # Initialize B to zero
    
    def forward(self, x):
        # Frozen forward + low-rank update
        return self.W_frozen @ x + self.scaling * (self.A @ (self.B @ x))
    
    def trainable_params(self):
        m, n = self.W_frozen.shape
        return self.rank * (m + n)
    
    def total_params(self):
        m, n = self.W_frozen.shape
        return m * n
 
# Create base model layer
W_base = np.random.randn(4096, 4096)  # Large weight matrix
lora = LoRALayer(W_base, rank=16)
 
print(f"Base layer params: {lora.total_params():,}")
print(f"LoRA trainable params: {lora.trainable_params():,}")
print(f"Trainable fraction: {100 * lora.trainable_params() / lora.total_params():.2f}%")

Summary: Matrix Decompositions in ML Practice

We've synthesized our knowledge of matrix decompositions into practical applications across machine learning. Let's consolidate the key insights from this module.

Key Takeaways

•Choose the right decomposition based on matrix structure: SVD for general/rectangular, Cholesky for SPD, QR for least squares, eigendecomposition for dynamics.
•PCA = SVD of centered data: numerically stable dimensionality reduction that captures maximum variance directions.
•Recommendation systems use matrix factorization: discover latent user/item factors from sparse rating data via SVD-like methods.
•Image processing exploits low-rank structure: SVD enables compression, denoising, and feature extraction.
•Numerical stability is critical for optimization: Cholesky for SPD systems, modified Cholesky for indefinite, preconditioning for ill-conditioned.
•Deep learning leverages decompositions: low-rank layers for compression, LoRA for efficient fine-tuning, linear attention for efficiency.
•The same math underlies diverse applications: eigenfaces, PageRank, GPs, transformers—all build on decomposition foundations.

Module complete!

You now have a comprehensive understanding of the four fundamental matrix decompositions—SVD, eigendecomposition, QR, and Cholesky—and how they power modern machine learning systems. This linear algebra foundation will serve you across all areas of ML, from classical methods to cutting-edge deep learning.

Next, we'll explore Norms and Distance Metrics—understanding how to measure vector and matrix magnitudes, define similarity, and apply regularization in machine learning.

Module Complete

Congratulations! You've mastered Matrix Decompositions—the computational backbone of machine learning. From SVD's optimal low-rank approximations to Cholesky's efficient positive definite solves, you now have the linear algebra toolkit essential for understanding and implementing ML algorithms at scale.

5 / 5

Loading learning content...

Machine LearningMatrix Decompositions

Matrix Decompositions

LevelIntermediate

Duration120 mins

TopicMatrix Decompositions

5 / 5

Applications in Machine Learning

Matrix Decompositions: The Foundation of Modern ML

What You Will Learn

Choosing the Right Decomposition

Each matrix decomposition has distinct strengths. Knowing when to apply each is essential for efficient, stable ML implementations.

Decision framework:

Matrix Decomposition Selection Guide
Problem Type	Primary Decomposition	Why?
Dimensionality reduction	SVD (or eigendecomposition of covariance)	Optimal low-rank approximation
Recommendation systems	SVD / matrix factorization	Latent factor discovery
Linear system Ax = b (general)	LU / QR	Stable, efficient for dense
Linear system Ax = b (SPD)	Cholesky	Fastest, most stable for SPD
Eigenvalues (dynamics, spectra)	Eigendecomposition	Reveals system modes
Least squares	QR / SVD	Numerically stable
Sampling Gaussians	Cholesky	Most efficient for covariance
Graph analysis	Eigendecomposition of Laplacian	Spectral properties
Signal denoising	SVD / eigendecomposition	Low-rank approximation
Matrix completion	SVD / nuclear norm	Low-rank structure

Matrix structure guide:

Any rectangular matrix → SVD SVD is the universal tool. When in doubt, try SVD first. It works on any matrix and provides optimal low-rank approximations.

Square general → LU or Schur For general square matrices needing eigenvalues, the QR algorithm (producing Schur form) or direct eigendecomposition. For solving systems, LU with pivoting.

Overdetermined system (m > n) → QR Least squares via QR is more stable than normal equations. Use SVD for rank-deficient cases.

Sparse large → Iterative methods For large sparse matrices, iterative decompositions (Lanczos, Arnoldi) compute dominant eigenpairs without full factorization.

The Practitioner's Heuristic

Is it symmetric positive definite? → Cholesky
Need best low-rank approximation? → SVD
Solving overdetermined least squares? → QR
Understanding matrix dynamics? → Eigendecomposition
General linear system? → LU with pivoting

This covers 90% of practical cases.

Principal Component Analysis: A Deep Dive

Principal Component Analysis (PCA) is perhaps the most widely used dimensionality reduction technique, directly powered by matrix decompositions.

The goal: Given high-dimensional data X ∈ ℝⁿˣᵈ (n samples, d features), find a lower-dimensional representation that preserves maximum variance.

Two equivalent formulations:

1. Maximum variance directions: Find orthonormal vectors w₁, w₂, ... such that projecting data onto wₖ captures maximum remaining variance.

2. Minimum reconstruction error: Find k-dimensional subspace that minimizes ||X - X_reconstructed||².

Both lead to the same solution: eigenvectors of the covariance matrix.

The PCA algorithm:

Center data: X_centered = X - mean(X)
Compute covariance: C = (1/(n-1)) X_centeredᵀ X_centered
Eigendecompose: C = VΛVᵀ (eigenvalues λ₁ ≥ λ₂ ≥ ... ≥ λ_d)
Project: Z = X_centered × V_k (keep top k eigenvectors)

SVD-based PCA (preferred):

Compute SVD of centered data directly: X_centered = UΣVᵀ

Principal components: columns of V
Projections: columns of UΣ
Explained variance: σᵢ² / (n-1)

This avoids forming XᵀX, improving numerical stability.

Why SVD for PCA?

Choosing the number of components:

1. Explained variance threshold: Keep k components capturing ≥ 95% (or 99%) of total variance: $$\frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{d} \lambda_i} \geq 0.95$$

2. Elbow method: Plot eigenvalues and look for the 'elbow' where values drop sharply.

3. Cross-validation: Choose k that minimizes held-out reconstruction error.

4. Task-specific: Use downstream task performance (classification accuracy with k components) to select k.

When PCA struggles:

Nonlinear structure: PCA finds linear subspaces; use Kernel PCA or autoencoders for nonlinear.
Non-Gaussian data: PCA assumes Gaussian-like structure; heavy-tailed data may need robust PCA.
Interpretability: Principal components mix all original features; for sparse interpretation, use sparse PCA.

pca_comprehensive.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
 
def pca_from_scratch(X, n_components=None):
    """
    PCA implementation using SVD (numerically stable).
    
    Parameters:
        X: n x d data matrix
        n_components: number of components (default: all)
    
    Returns:
        Z: projected data (n x n_components)
        components: principal component directions (d x n_components)
        explained_variance_ratio: fraction of variance per component
    """
    n, d = X.shape
    
    # Center data
    mean = X.mean(axis=0)
    X_centered = X - mean
    
    # SVD (full_matrices=False for efficiency)
    U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)
    
    # Explained variance (singular values squared, normalized)
    variance = s**2 / (n - 1)
    explained_variance_ratio = variance / variance.sum()
    
    # Select components
    if n_components is None:
        n_components = min(n, d)
    
    # Principal components are rows of Vt (columns of V)
    components = Vt[:n_components].T  # d x n_components
    
    # Project data
    Z = X_centered @ components  # Equivalent to U[:, :n_components] * s[:n_components]
    
    return Z, components, explained_variance_ratio[:n_components], mean
 
def analyze_pca(X, max_components=20):
    """Analyze PCA with visualization."""
    _, _, var_ratio, _ = pca_from_scratch(X)
    cumulative_var = np.cumsum(var_ratio)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Individual explained variance
    k = min(max_components, len(var_ratio))
    axes[0].bar(range(1, k+1), var_ratio[:k], alpha=0.7)
    axes[0].set_xlabel('Principal Component')
    axes[0].set_ylabel('Explained Variance Ratio')
    axes[0].set_title('Variance Explained by Each Component')
    
    # Cumulative explained variance
    axes[1].plot(range(1, k+1), cumulative_var[:k], 'bo-')
    axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
    axes[1].set_xlabel('Number of Components')
    axes[1].set_ylabel('Cumulative Explained Variance')
    axes[1].set_title('Cumulative Variance Explained')
    axes[1].legend()
    axes[1].grid(True)
    
    plt.tight_layout()
    plt.show()
    
    # Report key statistics
    k_95 = np.argmax(cumulative_var >= 0.95) + 1
    k_99 = np.argmax(cumulative_var >= 0.99) + 1
    print(f"Components for 95% variance: {k_95}")
    print(f"Components for 99% variance: {k_99}")
    print(f"Top component explains: {var_ratio[0]*100:.1f}%")
    
    return var_ratio, cumulative_var
 
# Load digits dataset (8x8 images = 64 features)
digits = load_digits()
X = digits.data  # 1797 samples, 64 features
y = digits.target
 
print("Digits Dataset PCA Analysis")
print("=" * 50)
print(f"Shape: {X.shape} (samples x features)")
 
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Analyze variance
var_ratio, _ = analyze_pca(X_scaled)
 
# Project to 2D for visualization
Z, components, _, _ = pca_from_scratch(X_scaled, n_components=2)
 
plt.figure(figsize=(10, 8))
scatter = plt.scatter(Z[:, 0], Z[:, 1], c=y, cmap='tab10', alpha=0.6, s=20)
plt.colorbar(scatter, label='Digit')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.title('Digits Projected onto First Two Principal Components')
plt.show()
 
# Reconstruction example
print("
" + "=" * 50)
print("Reconstruction at Different Component Counts")
k_values = [5, 10, 20, 40, 64]
sample_idx = 0
 
fig, axes = plt.subplots(1, len(k_values) + 1, figsize=(15, 3))
axes[0].imshow(X[sample_idx].reshape(8, 8), cmap='gray')
axes[0].set_title('Original')
axes[0].axis('off')
 
for i, k in enumerate(k_values):
    Z_k, comp_k, _, mean_k = pca_from_scratch(X_scaled, n_components=k)
    X_reconstructed = Z_k @ comp_k.T + mean_k
    X_orig_space = scaler.inverse_transform(X_reconstructed)
    
    axes[i+1].imshow(X_orig_space[sample_idx].reshape(8, 8), cmap='gray')
    axes[i+1].set_title(f'k={k}')
    axes[i+1].axis('off')
 
plt.suptitle('Reconstruction Quality vs Number of Components', fontsize=12)
plt.tight_layout()
plt.show()

Collaborative Filtering and Recommendation Systems

The problem setup:

Given a user-item rating matrix R ∈ ℝᵐˣⁿ where:

m users, n items
Rᵢⱼ = rating of user i for item j (if observed)
Most entries are missing (typically 99%+ sparsity)

Goal: Predict missing ratings to recommend items users will likely enjoy.

Matrix Factorization Approach:

Assume the rating matrix has low-rank structure: $$R \approx UV^T$$

where:

U ∈ ℝᵐˣᵏ: user latent factors (k << m, n)
V ∈ ℝⁿˣᵏ: item latent factors

Each row of U is a k-dimensional representation of a user's preferences. Each row of V is a k-dimensional representation of an item's characteristics.

Prediction: R̂ᵢⱼ = Uᵢ · Vⱼ (dot product of latent vectors)

Connection to SVD:

If R were fully observed: R = UΣVᵀ (SVD)

User factors: UΣ^(1/2), Item factors: VΣ^(1/2) → R = (UΣ^(1/2))(VΣ^(1/2))ᵀ

With missing data, we minimize reconstruction error on observed entries only: $$\min_{U, V} \sum_{(i,j) \in \Omega} (R_{ij} - U_i \cdot V_j)^2 + \lambda(|U|_F^2 + |V|_F^2)$$

where Ω is the set of observed entries and λ controls regularization.

Why This Works

Training algorithms:

1. Alternating Least Squares (ALS): Fix U, solve for V (convex); fix V, solve for U (convex). Iterate until convergence. Each step is a ridge regression problem.

2. Stochastic Gradient Descent (SGD): Process observed ratings one at a time, updating U_i and V_j to reduce prediction error. Scales to massive datasets.

3. SVD++ and extensions: Add biases (per-user, per-item), implicit feedback, temporal dynamics. These extensions won the Netflix Prize.

Evaluation metrics:

RMSE: √(mean squared error on held-out ratings)
MAE: Mean absolute error
Recall@K: Fraction of relevant items in top-K recommendations
NDCG: Normalized discounted cumulative gain (ranking quality)

Modern extensions:

Neural Collaborative Filtering: Replace dot product with neural network
Graph Neural Networks: Model user-item interactions as graphs
Factorization Machines: Generalize to any feature interactions
Autoencoders: Learn nonlinear latent representations

matrix_factorization_recommendation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds
 
def svd_recommendation(R, k=10):
    """
    SVD-based recommendation for fully observed matrix.
    In practice, use matrix factorization for sparse data.
    """
    # Full SVD
    U, s, Vt = np.linalg.svd(R, full_matrices=False)
    
    # Truncate to k factors
    U_k = U[:, :k]
    s_k = s[:k]
    V_k = Vt[:k, :].T
    
    # Reconstruct: predictions for all user-item pairs
    R_pred = U_k @ np.diag(s_k) @ V_k.T
    
    return R_pred, U_k @ np.diag(np.sqrt(s_k)), V_k @ np.diag(np.sqrt(s_k))
 
def matrix_factorization_sgd(R_sparse, k=10, learning_rate=0.01, 
                              reg=0.02, epochs=50, verbose=True):
    """
    Matrix factorization via SGD for sparse rating matrix.
    
    Parameters:
        R_sparse: csr_matrix of observed ratings
        k: latent dimension
        learning_rate: SGD step size
        reg: L2 regularization
        epochs: number of passes over data
    """
    m, n = R_sparse.shape
    
    # Initialize with small random values
    np.random.seed(42)
    U = np.random.randn(m, k) * 0.1
    V = np.random.randn(n, k) * 0.1
    
    # Get non-zero entries
    rows, cols = R_sparse.nonzero()
    n_ratings = len(rows)
    
    history = []
    
    for epoch in range(epochs):
        # Shuffle training data
        indices = np.random.permutation(n_ratings)
        
        total_error = 0
        for idx in indices:
            i, j = rows[idx], cols[idx]
            r_ij = R_sparse[i, j]
            
            # Prediction
            pred = U[i] @ V[j]
            error = r_ij - pred
            total_error += error**2
            
            # Gradient updates
            U[i] += learning_rate * (error * V[j] - reg * U[i])
            V[j] += learning_rate * (error * U[i] - reg * V[j])
        
        rmse = np.sqrt(total_error / n_ratings)
        history.append(rmse)
        
        if verbose and (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}: RMSE = {rmse:.4f}")
    
    return U, V, history
 
def recommend_top_k(user_id, U, V, R_observed, k=5):
    """Get top-k recommendations for a user (items not yet rated)."""
    # Predicted ratings for this user
    predictions = U[user_id] @ V.T
    
    # Mask already-rated items
    rated_items = set(R_observed[user_id].nonzero()[1])
    for item in rated_items:
        predictions[item] = -np.inf
    
    # Top-k unrated items
    top_items = np.argsort(predictions)[::-1][:k]
    top_scores = predictions[top_items]
    
    return top_items, top_scores
 
# Create synthetic recommendation scenario
np.random.seed(42)
n_users, n_items = 200, 100
k_true = 5  # True latent dimension
 
# Generate low-rank rating matrix with noise
U_true = np.random.randn(n_users, k_true)
V_true = np.random.randn(n_items, k_true)
R_true = U_true @ V_true.T
R_true = (R_true - R_true.min()) / (R_true.max() - R_true.min()) * 4 + 1  # Scale to 1-5
 
# Add noise
R_noisy = R_true + 0.5 * np.random.randn(n_users, n_items)
R_noisy = np.clip(R_noisy, 1, 5)
 
# Create sparse version (only 10% observed)
mask = np.random.random((n_users, n_items)) < 0.10
R_sparse = csr_matrix(R_noisy * mask)
 
print("Recommendation System via Matrix Factorization")
print("=" * 60)
print(f"Users: {n_users}, Items: {n_items}")
print(f"Observed ratings: {mask.sum()} ({mask.mean()*100:.1f}%)")
print(f"True latent dimension: {k_true}")
 
# Train model
k_model = 10
U_learned, V_learned, history = matrix_factorization_sgd(
    R_sparse, k=k_model, epochs=50, learning_rate=0.01, reg=0.02
)
 
# Evaluate on held-out data
test_mask = np.random.random((n_users, n_items)) < 0.05
test_mask = test_mask & ~mask  # Only truly unobserved
R_pred = U_learned @ V_learned.T
 
test_rmse = np.sqrt(np.mean((R_noisy[test_mask] - R_pred[test_mask])**2))
print(f"
Test RMSE: {test_rmse:.4f}")
 
# Example recommendations
print("
=== Recommendations for User 0 ===")
top_items, top_scores = recommend_top_k(0, U_learned, V_learned, R_sparse, k=5)
for item, score in zip(top_items, top_scores):
    true_rating = R_noisy[0, item]
    print(f"Item {item}: Predicted {score:.2f}, True {true_rating:.2f}")

Image Processing and Signal Analysis

Matrix decompositions are fundamental tools in image processing and signal analysis, enabling compression, denoising, and feature extraction.

Image Compression with SVD:

The math: For m × n image I, SVD gives I = UΣVᵀ.

Full storage: m × n values
Rank-k approximation storage: k(m + n + 1) values
Compression ratio: mn / [k(m + n + 1)]

Quality vs. compression tradeoff: More components → better quality → larger file size. The singular value spectrum determines how much compression is possible without visual degradation.

Image Denoising:

Noisy images have the form: I_noisy = I_true + noise.

If I_true is approximately low-rank:

Compute SVD of I_noisy
Zero out small singular values (likely noise)
Reconstruct from remaining components

The assumption: signal lives in low-rank subspace; noise is distributed across all dimensions.

Beyond Simple SVD

Eigenfaces for Face Recognition:

A classic application of PCA/eigendecomposition:

Collect many face images (all same size, aligned)
Flatten each image to a vector; form data matrix X
Compute PCA: eigenvectors of XᵀX (or SVD of X)
Top eigenvectors look like 'average faces'—called eigenfaces
Project new faces onto eigenface basis for recognition

The eigenfaces capture principal modes of variation in face appearance: lighting, expression, identity.

Background Subtraction (Video Analysis):

For surveillance video where background is static:

Stack video frames as columns of matrix X
Decompose: X = L + S (low-rank + sparse)
L captures static background (repeats every frame → low rank)
S captures moving foreground (sparse—only a few pixels move)

This is Robust PCA, solved via nuclear norm minimization.

Audio and Signal Processing:

For audio spectrograms and time-frequency representations:

Non-negative Matrix Factorization (NMF) separates sources
SVD-based methods denoise recordings
Eigendecomposition of autocorrelation estimates spectral content

image_processing_svd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
import matplotlib.pyplot as plt
 
def svd_compress_image(image, k):
    """
    Compress grayscale image using rank-k SVD approximation.
    """
    U, s, Vt = np.linalg.svd(image, full_matrices=False)
    
    # Truncate to k components
    U_k = U[:, :k]
    s_k = s[:k]
    Vt_k = Vt[:k, :]
    
    # Reconstruct
    compressed = U_k @ np.diag(s_k) @ Vt_k
    
    return np.clip(compressed, 0, 255).astype(np.uint8)
 
def svd_denoise(image, threshold_ratio=0.1):
    """
    Denoise image by zeroing small singular values.
    """
    U, s, Vt = np.linalg.svd(image, full_matrices=False)
    
    # Zero out singular values below threshold
    threshold = threshold_ratio * s[0]
    s_denoised = s * (s > threshold)
    
    # Reconstruct
    denoised = U @ np.diag(s_denoised) @ Vt
    
    return np.clip(denoised, 0, 255).astype(np.uint8), s, s_denoised
 
def create_test_image(size=256):
    """Create a test image with geometric patterns."""
    x = np.linspace(0, 4*np.pi, size)
    y = np.linspace(0, 4*np.pi, size)
    X, Y = np.meshgrid(x, y)
    
    image = 128 + 50*np.sin(X) + 30*np.cos(Y) + 20*np.sin(X+Y)
    return image.astype(np.float64)
 
# Create test image
image = create_test_image(256)
 
# Analyze singular value spectrum
U, s, Vt = np.linalg.svd(image, full_matrices=False)
 
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.semilogy(s, 'b-')
plt.xlabel('Component')
plt.ylabel('Singular value (log scale)')
plt.title('Singular Value Spectrum')
plt.grid(True)
 
plt.subplot(1, 2, 2)
cumulative = np.cumsum(s**2) / np.sum(s**2)
plt.plot(cumulative, 'g-')
plt.axhline(y=0.99, color='r', linestyle='--', label='99%')
plt.xlabel('Number of components')
plt.ylabel('Cumulative energy')
plt.title('Cumulative Explained Variance')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
 
# Compression comparison
k_values = [1, 5, 10, 25, 50, 256]  # 256 = full rank
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
 
for idx, k in enumerate(k_values):
    compressed = svd_compress_image(image, k)
    
    # Compression ratio
    m, n = image.shape
    original_size = m * n
    compressed_size = k * (m + n + 1)
    ratio = original_size / compressed_size
    
    # Error metrics
    mse = np.mean((image - compressed)**2)
    psnr = 10 * np.log10(255**2 / mse) if mse > 0 else np.inf
    
    axes[idx].imshow(compressed, cmap='gray', vmin=0, vmax=255)
    axes[idx].set_title(f'k={k}
Ratio: {ratio:.1f}x, PSNR: {psnr:.1f}dB')
    axes[idx].axis('off')
 
plt.suptitle('SVD Image Compression at Different Ranks', fontsize=14)
plt.tight_layout()
plt.show()
 
# Denoising demonstration
print("
=== Image Denoising ===")
noise_level = 30
noisy_image = image + noise_level * np.random.randn(*image.shape)
 
denoised, s_noisy, s_clean = svd_denoise(noisy_image, threshold_ratio=0.05)
 
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].imshow(image, cmap='gray', vmin=0, vmax=255)
axes[0].set_title('Original')
axes[0].axis('off')
 
axes[1].imshow(np.clip(noisy_image, 0, 255), cmap='gray', vmin=0, vmax=255)
axes[1].set_title(f'Noisy (σ={noise_level})')
axes[1].axis('off')
 
axes[2].imshow(denoised, cmap='gray', vmin=0, vmax=255)
mse_noisy = np.mean((image - noisy_image)**2)
mse_denoised = np.mean((image - denoised)**2)
axes[2].set_title(f'Denoised (MSE: {mse_noisy:.1f} → {mse_denoised:.1f})')
axes[2].axis('off')
 
plt.tight_layout()
plt.show()

Numerical Stability in Optimization

Matrix decompositions play critical roles in making optimization algorithms stable and efficient—essential for training ML models.

Newton's Method and Hessian Decomposition:

Newton's method updates: x ← x - H⁻¹∇f

Directly inverting the Hessian H is problematic:

Expensive: O(n³) inversion
Unstable if H is ill-conditioned
Fails if H is not positive definite (not at a minimum)

Cholesky-based Newton: If H is SPD:

Factor: H = LLᵀ (Cholesky)
Solve: Ly = ∇f, then Lᵀd = y
Update: x ← x - d

Cost: O(n³) factorization + O(n²) solves = same as inversion, but more stable.

Handling indefinite Hessians:

At saddle points, H has negative eigenvalues. Solutions:

Modified Cholesky: Add diagonal perturbation E so H + E is SPD
Trust region: Constrain step within region where quadratic model is valid
Line search with curvature condition: Reject steps where curvature is negative

Quasi-Newton and Matrix Updates:

BFGS maintains an approximation B ≈ H (or its inverse) updated via rank-2 modifications: $$B_{k+1} = B_k - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k} + \frac{y_k y_k^T}{y_k^T s_k}$$

L-BFGS stores only the last m update vectors, implicitly representing B⁻¹ composition.

The Curse of Ill-Conditioning

Preconditioning:

Instead of solving Ax = b directly, solve: $$M^{-1}Ax = M^{-1}b$$

where M approximates A but is easy to invert.

Common preconditioners:

Jacobi: M = diag(A) — diagonal scaling
Incomplete Cholesky: M = L̃L̃ᵀ where L̃ ≈ Cholesky(A)
SSOR: Symmetric successive over-relaxation
Block preconditioners: Exploit block structure

SVD for Regularization:

Truncated SVD provides implicit regularization:

Small singular values correspond to 'noise directions'
Truncating them prevents amplifying noise in solutions
Equivalent to Tikhonov regularization in the limit

Least squares via SVD: For potentially rank-deficient systems: $$x = V \Sigma^+ U^T b$$

where Σ⁺ has 1/σᵢ for 'large' singular values, 0 for small ones. This gives the minimum-norm solution with controlled noise amplification.

optimization_stability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy.linalg import cholesky, solve_triangular
import matplotlib.pyplot as plt
 
def newton_step_naive(H, grad):
    """Naive Newton step: directly solve H @ d = grad."""
    return np.linalg.solve(H, grad)
 
def newton_step_cholesky(H, grad):
    """Newton step via Cholesky (stable for SPD)."""
    try:
        L = cholesky(H, lower=True)
        y = solve_triangular(L, grad, lower=True)
        d = solve_triangular(L.T, y, lower=False)
        return d
    except np.linalg.LinAlgError:
        raise ValueError("Hessian not positive definite")
 
def newton_step_modified_cholesky(H, grad, beta=1e-6):
    """
    Newton step with modified Cholesky for potentially indefinite H.
    Adds diagonal perturbation to ensure positive definiteness.
    """
    n = H.shape[0]
    
    # Try Cholesky; if fails, add to diagonal
    for i in range(20):
        try:
            L = cholesky(H + beta * np.eye(n), lower=True)
            y = solve_triangular(L, grad, lower=True)
            d = solve_triangular(L.T, y, lower=False)
            return d, beta
        except np.linalg.LinAlgError:
            beta *= 10  # Increase perturbation
    
    raise ValueError("Could not make Hessian positive definite")
 
def compare_condition_numbers():
    """Demonstrate effect of condition number on optimization."""
    # Well-conditioned Hessian
    np.random.seed(42)
    n = 50
    
    # Create Hessians with different condition numbers
    kappas = [1, 10, 100, 1000, 10000]
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    for kappa in kappas:
        # Create SPD matrix with specified condition number
        Q, _ = np.linalg.qr(np.random.randn(n, n))
        eigenvalues = np.logspace(0, np.log10(kappa), n)[::-1]
        H = Q @ np.diag(eigenvalues) @ Q.T
        
        # Gradient descent simulation
        x0 = np.random.randn(n)
        x_opt = np.zeros(n)  # True optimum
        
        # Optimal learning rate for quadratic with Hessian H
        lr = 2 / (eigenvalues[0] + eigenvalues[-1])
        
        x = x0.copy()
        errors = [np.linalg.norm(x - x_opt)]
        
        for _ in range(100):
            grad = H @ (x - x_opt)
            x = x - lr * grad
            errors.append(np.linalg.norm(x - x_opt))
        
        axes[0].semilogy(errors, label=f'κ={kappa}')
    
    axes[0].set_xlabel('Iteration')
    axes[0].set_ylabel('Error ||x - x*|| (log scale)')
    axes[0].set_title('Gradient Descent Convergence vs Condition Number')
    axes[0].legend()
    axes[0].grid(True)
    
    # Show eigenvalue spectra
    for kappa in [10, 1000]:
        Q, _ = np.linalg.qr(np.random.randn(n, n))
        eigenvalues = np.logspace(0, np.log10(kappa), n)[::-1]
        axes[1].plot(eigenvalues, 'o-', markersize=3, label=f'κ={kappa}')
    
    axes[1].set_xlabel('Index')
    axes[1].set_ylabel('Eigenvalue')
    axes[1].set_title('Eigenvalue Spectra')
    axes[1].set_yscale('log')
    axes[1].legend()
    axes[1].grid(True)
    
    plt.tight_layout()
    plt.show()
 
# Run comparison
compare_condition_numbers()
 
# Demonstrate modified Cholesky on indefinite Hessian
print("
=== Modified Cholesky for Indefinite Hessian ===")
np.random.seed(42)
n = 10
 
# Create indefinite symmetric matrix (saddle point Hessian)
Q, _ = np.linalg.qr(np.random.randn(n, n))
eigenvalues = np.array([5, 3, 2, 1, 0.5, -0.3, -0.5, -1, -2, -3])  # Mixed signs
H_indefinite = Q @ np.diag(eigenvalues) @ Q.T
 
grad = np.random.randn(n)
 
print(f"Hessian eigenvalues: {np.linalg.eigvalsh(H_indefinite).round(2)}")
print(f"Positive definite: {np.all(np.linalg.eigvalsh(H_indefinite) > 0)}")
 
try:
    d_chol = newton_step_cholesky(H_indefinite, grad)
    print("Standard Cholesky: SUCCESS (unexpected)")
except ValueError as e:
    print(f"Standard Cholesky: FAILED ({e})")
 
d_modified, beta_used = newton_step_modified_cholesky(H_indefinite, grad)
print(f"Modified Cholesky: SUCCESS with β = {beta_used:.2e}")
print(f"Modified matrix eigenvalues: {np.linalg.eigvalsh(H_indefinite + beta_used * np.eye(n)).round(2)}")

Modern Applications in Deep Learning and NLP

Matrix decompositions remain relevant in modern deep learning, appearing in architecture design, efficient implementations, and interpretability.

Low-Rank Factorization of Weight Matrices:

For large neural network layers with weight matrix W ∈ ℝᵐˣⁿ:

Full layer: m × n parameters, O(mn) computation
Factorized: W ≈ AB where A ∈ ℝᵐˣᵏ, B ∈ ℝᵏˣⁿ
Factorized: k(m + n) parameters, O(k(m + n)) computation

When k << min(m, n), this dramatically reduces model size and inference cost.

Applications:

Model compression: Replace trained layers with low-rank approximations
LoRA (Low-Rank Adaptation): Fine-tune large models by updating only low-rank delta
Efficient attention: Approximate attention matrices via low-rank decomposition

Word Embeddings and SVD:

The famous word2vec can be understood as implicit matrix factorization:

Build word-context co-occurrence matrix M
Word embeddings ≈ rows of UΣ^α (from SVD of M)
Context embeddings ≈ rows of VΣ^(1-α)

GloVe explicitly constructs and factors this matrix.

LoRA Revolution

Transformer Efficiency:

Attention: Attention(Q, K, V) = softmax(QK^T/√d)V

The QK^T matrix is n × n for sequence length n—quadratic cost!

Linear attention approximations:

Approximate: softmax(QK^T) ≈ φ(Q)φ(K)^T for feature map φ
Compute: φ(Q)(φ(K)^T V) in O(n) instead of O(n²)
This uses low-rank approximation ideas!

SVD-Based Approaches:

Linformer: projects K, V to lower dimension
Performer: random feature approximation of softmax
Nyströmformer: Nyström approximation of attention matrix

Batch Normalization and Whitening:

Batch norm's success relates to decorrelation:

Normalizing activations ≈ approximate whitening
Full whitening (ZCA) uses eigendecomposition: W = Σ^(-1/2)
Decorrelated Batch Norm explicitly uses running covariance estimate

Weight Initialization:

Xavier/He initialization maintains variance through layers:

Relies on understanding weight matrix singular value distribution
Orthogonal initialization: W = QR[:, :n] where Q from random QR
Better gradient flow via controlled singular values

deep_learning_decompositions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
import time
 
def low_rank_layer_forward(x, A, B):
    """
    Forward pass through low-rank factorized layer.
    Original: y = Wx where W = AB
    Factorized: y = A(Bx) — fewer operations if rank << min(m, n)
    """
    return A @ (B @ x)
 
def compare_layer_sizes(m, n, ranks):
    """Compare full vs low-rank layer efficiency."""
    print(f"Layer: {m} x {n}")
    print("-" * 50)
    print(f"{'Rank':<10} {'Params':<15} {'Compression':<15} {'Time (ms)':<12}")
    print("-" * 50)
    
    # Full rank
    W_full = np.random.randn(m, n)
    x = np.random.randn(n, 100)  # 100 samples
    
    start = time.time()
    for _ in range(100):
        y_full = W_full @ x
    time_full = (time.time() - start) * 10  # ms per forward
    
    full_params = m * n
    print(f"{'Full':<10} {full_params:<15} {'1.0x':<15} {time_full:.2f}")
    
    for rank in ranks:
        A = np.random.randn(m, rank)
        B = np.random.randn(rank, n)
        
        start = time.time()
        for _ in range(100):
            y_low_rank = low_rank_layer_forward(x, A, B)
        time_low_rank = (time.time() - start) * 10
        
        low_rank_params = rank * (m + n)
        compression = full_params / low_rank_params
        
        print(f"{rank:<10} {low_rank_params:<15} {compression:.1f}x{'':<11} {time_low_rank:.2f}")
 
# Compare layer sizes
compare_layer_sizes(1024, 4096, [32, 64, 128, 256])
 
def svd_model_compression(W, rank):
    """
    Compress a weight matrix using SVD truncation.
    
    Returns:
        A, B: low-rank factors such that W ≈ AB
        compression_ratio: original_params / compressed_params
        reconstruction_error: ||W - AB|| / ||W||
    """
    U, s, Vt = np.linalg.svd(W, full_matrices=False)
    
    # Truncate
    A = U[:, :rank] @ np.diag(np.sqrt(s[:rank]))
    B = np.diag(np.sqrt(s[:rank])) @ Vt[:rank, :]
    
    # Metrics
    W_approx = A @ B
    error = np.linalg.norm(W - W_approx) / np.linalg.norm(W)
    
    m, n = W.shape
    compression = (m * n) / (rank * (m + n))
    
    return A, B, compression, error
 
# Demonstrate compression
print("
=== SVD Model Compression ===")
np.random.seed(42)
m, n = 768, 3072  # Typical transformer FFN dimensions
 
# Create a weight matrix with approximately low-rank structure
true_rank = 50
U_true = np.random.randn(m, true_rank)
V_true = np.random.randn(n, true_rank)
W = U_true @ V_true.T + 0.1 * np.random.randn(m, n)  # Low-rank + noise
 
print(f"Weight matrix: {m} x {n}")
print(f"Original parameters: {m * n:,}")
 
for rank in [16, 32, 64, 128]:
    A, B, compression, error = svd_model_compression(W, rank)
    print(f"Rank {rank}: {compression:.1f}x compression, {error*100:.2f}% relative error")
 
# LoRA-style adaptation demonstration
print("
=== LoRA Adaptation Simulation ===")
 
class LoRALayer:
    """Simulates LoRA adapter for efficient fine-tuning."""
    
    def __init__(self, W_frozen, rank=16, alpha=16):
        self.W_frozen = W_frozen
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        m, n = W_frozen.shape
        # Initialize low-rank adapters
        self.A = np.random.randn(m, rank) * 0.01
        self.B = np.zeros((rank, n))  # Initialize B to zero
    
    def forward(self, x):
        # Frozen forward + low-rank update
        return self.W_frozen @ x + self.scaling * (self.A @ (self.B @ x))
    
    def trainable_params(self):
        m, n = self.W_frozen.shape
        return self.rank * (m + n)
    
    def total_params(self):
        m, n = self.W_frozen.shape
        return m * n
 
# Create base model layer
W_base = np.random.randn(4096, 4096)  # Large weight matrix
lora = LoRALayer(W_base, rank=16)
 
print(f"Base layer params: {lora.total_params():,}")
print(f"LoRA trainable params: {lora.trainable_params():,}")
print(f"Trainable fraction: {100 * lora.trainable_params() / lora.total_params():.2f}%")

Summary: Matrix Decompositions in ML Practice

We've synthesized our knowledge of matrix decompositions into practical applications across machine learning. Let's consolidate the key insights from this module.

Key Takeaways

•Choose the right decomposition based on matrix structure: SVD for general/rectangular, Cholesky for SPD, QR for least squares, eigendecomposition for dynamics.
•PCA = SVD of centered data: numerically stable dimensionality reduction that captures maximum variance directions.
•Recommendation systems use matrix factorization: discover latent user/item factors from sparse rating data via SVD-like methods.
•Image processing exploits low-rank structure: SVD enables compression, denoising, and feature extraction.
•Numerical stability is critical for optimization: Cholesky for SPD systems, modified Cholesky for indefinite, preconditioning for ill-conditioned.
•Deep learning leverages decompositions: low-rank layers for compression, LoRA for efficient fine-tuning, linear attention for efficiency.
•The same math underlies diverse applications: eigenfaces, PageRank, GPs, transformers—all build on decomposition foundations.

Module complete!

Next, we'll explore Norms and Distance Metrics—understanding how to measure vector and matrix magnitudes, define similarity, and apply regularization in machine learning.

Module Complete

5 / 5