Machine LearningMatrix Decompositions

Matrix Decompositions

LevelIntermediate

Duration120 mins

TopicMatrix Decompositions

1 / 5

Singular Value Decomposition (SVD)

The Swiss Army Knife of Matrix Decompositions

If you were to learn only one matrix decomposition technique for machine learning, Singular Value Decomposition (SVD) would be the unambiguous choice. It is, without exaggeration, the most versatile and widely-used matrix factorization in applied mathematics and data science.

SVD reveals the fundamental structure of any matrix—not just square matrices with special properties, but any matrix of any shape. It decomposes a matrix into three simpler matrices that expose the underlying geometry: the directions of maximum variance, the magnitudes of stretching along those directions, and how inputs and outputs relate through the transformation.

From Netflix's recommendation engine that powered its famous million-dollar prize competition to Google's latent semantic indexing for search, from image compression algorithms to noise reduction in signal processing, SVD is the mathematical workhorse operating behind countless real-world systems.

What You Will Learn

By the end of this page, you will understand: (1) The formal definition and existence theorem of SVD, (2) The geometric interpretation of what SVD reveals about linear transformations, (3) How to compute SVD step-by-step, (4) The relationship between SVD and eigendecomposition, (5) Low-rank approximations and the Eckart-Young-Mirsky theorem, and (6) Why SVD is computationally preferred in many applications.

The Formal Definition of SVD

Let's begin with the precise mathematical statement. Given any matrix A ∈ ℝ^(m×n), there exist orthogonal matrices U ∈ ℝ^(m×m) and V ∈ ℝ^(n×n), and a diagonal matrix Σ ∈ ℝ^(m×n) such that:

$$A = U \Sigma V^T$$

This factorization is called the Singular Value Decomposition of A.

Let's unpack each component:

The matrix U (Left Singular Vectors):

U is an m × m orthogonal matrix (U^T U = UU^T = I_m)
The columns of U are called the left singular vectors of A
These vectors form an orthonormal basis for the column space (range) of A
They represent the output directions of the transformation

The matrix Σ (Singular Values):

Σ is an m × n diagonal matrix (zero everywhere except the main diagonal)
The diagonal entries σ₁, σ₂, ..., σᵣ are called the singular values of A
By convention: σ₁ ≥ σ₂ ≥ ... ≥ σᵣ > 0 (ordered from largest to smallest)
r is the rank of matrix A
The singular values are always real and non-negative

The matrix V (Right Singular Vectors):

V is an n × n orthogonal matrix (V^T V = VV^T = I_n)
The columns of V are called the right singular vectors of A
These vectors form an orthonormal basis for the row space of A
They represent the input directions of the transformation

The Existence Theorem

A remarkable property of SVD is that it exists for EVERY matrix—rectangular or square, full-rank or rank-deficient, symmetric or asymmetric. This universality is what makes SVD so powerful. The eigendecomposition, by contrast, only exists for square matrices and may require complex numbers. SVD always works with real numbers for real matrices.

Understanding the dimensions:

For a matrix A of size m × n with rank r:

U has size m × m (but only the first r columns are 'active')
Σ has size m × n with r non-zero diagonal entries
V has size n × n (but only the first r columns are 'active')

This leads to different forms of SVD:

Full SVD: U is m × m, Σ is m × n, V is n × n

Economy (Thin) SVD: U is m × r, Σ is r × r, V is n × r

The economy SVD is more storage-efficient when r << min(m, n) and is what most software packages return by default.

svd_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
from scipy.linalg import svd
 
# Create a sample matrix
A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]
])
 
print("Original Matrix A:")
print(A)
print(f"Shape: {A.shape}")  # (4, 3) - 4 rows, 3 columns
 
# Compute Full SVD
U_full, s_full, Vt_full = svd(A, full_matrices=True)
print(f"\nFull SVD:")
print(f"U shape: {U_full.shape}")    # (4, 4)
print(f"s shape: {s_full.shape}")    # (3,) - just the diagonal values
print(f"Vt shape: {Vt_full.shape}")  # (3, 3)
 
# Compute Economy (Thin) SVD
U_thin, s_thin, Vt_thin = svd(A, full_matrices=False)
print(f"\nEconomy SVD:")
print(f"U shape: {U_thin.shape}")    # (4, 3)
print(f"s shape: {s_thin.shape}")    # (3,)
print(f"Vt shape: {Vt_thin.shape}")  # (3, 3)
 
# Verify reconstruction: A = U @ Σ @ V^T
Sigma_full = np.zeros((4, 3))
np.fill_diagonal(Sigma_full, s_full)
A_reconstructed = U_full @ Sigma_full @ Vt_full
print(f"\nReconstruction error: {np.linalg.norm(A - A_reconstructed):.2e}")
 
# Singular values reveal the rank
print(f"\nSingular values: {s_full}")
print(f"Matrix rank: {np.sum(s_full > 1e-10)}")

Geometric Interpretation: What SVD Reveals

The power of SVD lies not just in its existence but in what it tells us about the geometry of linear transformations. To understand this, let's think about what happens when we multiply a vector x by the matrix A.

The transformation A = UΣV^T as three steps:

When we compute Ax, the SVD decomposes this into three sequential operations:

Step 1: V^T rotates (and possibly reflects) the input space The matrix V^T rotates the input vector x from the standard basis to a new coordinate system defined by the right singular vectors. This aligns the vector with the 'principal directions' of the transformation.

Step 2: Σ scales along each axis The diagonal matrix Σ stretches or compresses along each axis by the corresponding singular value σᵢ. Directions with larger singular values get amplified more; directions with σᵢ = 0 are collapsed entirely (these correspond to the null space).

Step 3: U rotates into the output space Finally, U rotates the scaled result into the output coordinate system, aligning it with the left singular vectors.

The sphere-to-ellipse transformation:

Perhaps the most illuminating visualization: consider what happens when A acts on the unit sphere. If we take all vectors x with ||x|| = 1 (a sphere in n dimensions), and apply A to each, we get an ellipsoid in m dimensions.

The axes of the ellipsoid are aligned with the left singular vectors (columns of U)
The lengths of the semi-axes are the singular values σ₁, σ₂, ..., σᵣ
The orientation in input space that produces each axis is given by the corresponding right singular vectors (columns of V)

Why This Matters for ML

In machine learning, data matrices represent collections of samples (rows) with features (columns). The SVD of a data matrix reveals: (1) The directions of maximum variance in the data, (2) How much variance lies along each direction, and (3) Which features contribute most to each variance direction. This is exactly what we need for dimensionality reduction, noise filtering, and understanding data structure.

Connecting singular vectors to fundamental subspaces:

SVD provides orthonormal bases for all four fundamental subspaces of a matrix A with rank r:

Subspace	Basis from SVD	Dimension
Column space (Range) of A	First r columns of U	r
Left null space of A	Last m-r columns of U	m-r
Row space of A	First r columns of V	r
Null space of A	Last n-r columns of V	n-r

This elegant connection shows why SVD is considered the 'ultimate' matrix decomposition—it reveals the complete structure of any linear transformation.

Interpretation of singular values:

σ₁ is the maximum amount the matrix can stretch any unit vector: σ₁ = max ||Ax|| for ||x|| = 1
The ratio σ₁/σᵣ (condition number) measures how 'ill-conditioned' the matrix is
σᵢ represents the 'importance' of the i-th direction—larger values capture more of the matrix's action
The sum Σσᵢ² equals ||A||²_F (the squared Frobenius norm)
The number of non-zero singular values equals the rank of A

svd_geometry_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
 
def visualize_svd_transformation(A, ax1, ax2):
    """Visualize how SVD transforms the unit circle to an ellipse."""
    U, s, Vt = np.linalg.svd(A)
    
    # Points on unit circle
    theta = np.linspace(0, 2*np.pi, 100)
    circle = np.vstack([np.cos(theta), np.sin(theta)])
    
    # Transform through A
    ellipse = A @ circle
    
    # Plot unit circle with right singular vectors
    ax1.plot(circle[0], circle[1], 'b-', label='Unit circle')
    for i in range(len(s)):
        ax1.arrow(0, 0, Vt[i, 0]*0.8, Vt[i, 1]*0.8, 
                  head_width=0.1, color=f'C{i+1}',
                  label=f'v{i+1} (input direction)')
    ax1.set_xlim(-1.5, 1.5)
    ax1.set_ylim(-1.5, 1.5)
    ax1.set_aspect('equal')
    ax1.axhline(y=0, color='k', linewidth=0.5)
    ax1.axvline(x=0, color='k', linewidth=0.5)
    ax1.legend(fontsize=8)
    ax1.set_title('Input Space: Unit Circle + Right Singular Vectors')
    
    # Plot ellipse with left singular vectors scaled by singular values
    ax2.plot(ellipse[0], ellipse[1], 'r-', label='Transformed ellipse')
    for i in range(len(s)):
        ax2.arrow(0, 0, U[0, i]*s[i]*0.8, U[1, i]*s[i]*0.8,
                  head_width=0.2, color=f'C{i+1}',
                  label=f'σ{i+1}·u{i+1} (σ={s[i]:.2f})')
    
    max_val = max(s) * 1.5
    ax2.set_xlim(-max_val, max_val)
    ax2.set_ylim(-max_val, max_val)
    ax2.set_aspect('equal')
    ax2.axhline(y=0, color='k', linewidth=0.5)
    ax2.axvline(x=0, color='k', linewidth=0.5)
    ax2.legend(fontsize=8)
    ax2.set_title('Output Space: Ellipse + Scaled Left Singular Vectors')
 
# Example: A transformation that stretches and rotates
A = np.array([[3, 1], 
              [1, 2]])
 
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
visualize_svd_transformation(A, ax1, ax2)
plt.suptitle('SVD Geometric Interpretation: Circle → Ellipse', fontsize=14)
plt.tight_layout()
plt.show()
 
# Verify the geometry
U, s, Vt = np.linalg.svd(A)
print("Singular values (ellipse semi-axis lengths):", s)
print("\nLeft singular vectors (ellipse axis directions):")
print(U)
print("\nRight singular vectors (input directions mapped to axes):")
print(Vt.T)

Computing SVD: From Theory to Algorithm

Understanding how SVD is computed illuminates its deep connection to eigenvalue problems. While production implementations use sophisticated numerical methods (like the Golub-Kahan algorithm), the conceptual derivation reveals essential insights.

Deriving SVD from eigenvalues:

The key insight is that if we have A = UΣV^T, then:

$$A^T A = V \Sigma^T U^T U \Sigma V^T = V \Sigma^T \Sigma V^T = V \Lambda_V V^T$$

$$A A^T = U \Sigma V^T V \Sigma^T U^T = U \Sigma \Sigma^T U^T = U \Lambda_U U^T$$

where Λ_V = Σ^TΣ and Λ_U = ΣΣ^T are diagonal matrices containing σ₁², σ₂², ...

This reveals that:

V contains the eigenvectors of A^TA (an n × n symmetric matrix)
U contains the eigenvectors of AA^T (an m × m symmetric matrix)
The singular values σᵢ are the square roots of the eigenvalues of A^TA (or equivalently, AA^T)

Since A^TA and AA^T are symmetric positive semi-definite, their eigenvalues are real and non-negative. This guarantees that singular values are always real and non-negative.

The computational algorithm (conceptual):

Compute B = A^TA (if n ≤ m) or B = AA^T (if m < n)
Find the eigenvalues λ₁ ≥ λ₂ ≥ ... ≥ λᵣ > 0 and eigenvectors of B
Set σᵢ = √λᵢ for the singular values
The eigenvectors of A^TA give V; the eigenvectors of AA^T give U
Alternatively: once V is known, compute uᵢ = (1/σᵢ) A vᵢ to get U

Numerical Stability Warning

While the derivation above is mathematically correct, computing SVD by explicitly forming A^TA is numerically unstable—it squares the condition number and can lose precision. Production algorithms like the Golub-Kahan bidiagonalization avoid this by working directly with A. Always use library functions (numpy.linalg.svd, scipy.linalg.svd) rather than implementing via eigendecomposition.

Computational complexity:

For an m × n matrix (assume m ≥ n):

Full SVD: O(mn² + n³) operations
Truncated SVD (k singular values): O(mnk) using iterative methods
Memory: O(mn) for the input, plus O(mk + nk + k) for truncated output

This makes SVD relatively expensive for large matrices, motivating the use of approximate and randomized algorithms for big data applications.

Uniqueness of SVD:

SVD is essentially unique with the following caveats:

If all singular values are distinct, the singular vectors are unique up to sign
If some singular values are equal (multiplicity > 1), the corresponding singular vectors are only unique up to rotation within their eigenspace
The singular values themselves are always uniquely determined

svd_computation_from_eigen.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
 
def compute_svd_via_eigendecomposition(A):
    """
    Compute SVD using eigendecomposition of A^T A.
    
    WARNING: This is for educational purposes only!
    In practice, use numpy.linalg.svd which uses numerically 
    stable algorithms.
    """
    m, n = A.shape
    
    # Form A^T A (n x n symmetric positive semi-definite matrix)
    ATA = A.T @ A
    
    # Compute eigenvalues and eigenvectors of A^T A
    # eigenvalues are squared singular values
    eigenvalues, V = np.linalg.eigh(ATA)  # eigh for symmetric matrices
    
    # Sort in descending order (eigh returns ascending)
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    V = V[:, idx]
    
    # Singular values are square roots of eigenvalues
    # Clip to avoid numerical issues with tiny negative values
    singular_values = np.sqrt(np.maximum(eigenvalues, 0))
    
    # Find rank (number of non-trivially zero singular values)
    tol = max(m, n) * np.finfo(float).eps * singular_values[0]
    rank = np.sum(singular_values > tol)
    
    # Compute U from the relationship: u_i = (1/σ_i) A v_i
    U = np.zeros((m, rank))
    for i in range(rank):
        U[:, i] = A @ V[:, i] / singular_values[i]
    
    return U, singular_values[:rank], V[:, :rank].T
 
# Test with a sample matrix
A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]
], dtype=float)
 
# Our implementation
U_ours, s_ours, Vt_ours = compute_svd_via_eigendecomposition(A)
 
# NumPy's implementation
U_np, s_np, Vt_np = np.linalg.svd(A, full_matrices=False)
 
print("Our SVD (via eigendecomposition):")
print(f"Singular values: {s_ours}")
print(f"Rank: {len(s_ours)}")
 
print("\nNumPy's SVD:")
print(f"Singular values: {s_np}")
 
print(f"\nSingular value match: {np.allclose(s_ours, s_np[:len(s_ours)])}")
 
# Verify reconstruction
S_diag = np.diag(s_ours)
A_reconstructed = U_ours @ S_diag @ Vt_ours
print(f"Reconstruction error: {np.linalg.norm(A - A_reconstructed):.2e}")

Low-Rank Approximation: The Eckart-Young-Mirsky Theorem

One of the most important applications of SVD is finding the best low-rank approximation to a matrix. This is fundamental to dimensionality reduction, data compression, noise reduction, and matrix completion.

The problem statement:

Given a matrix A of rank r, find the rank-k matrix Aₖ (where k < r) that is closest to A. 'Closest' means minimizing either:

The Frobenius norm: ||A - Aₖ||_F = √(Σᵢⱼ (A - Aₖ)ᵢⱼ²)
The spectral norm: ||A - Aₖ||₂ = largest singular value of (A - Aₖ)

The Eckart-Young-Mirsky Theorem (1936):

The optimal rank-k approximation Aₖ is obtained by truncating the SVD:

$$A_k = \sum_{i=1}^{k} \sigma_i u_i v_i^T = U_k \Sigma_k V_k^T$$

where Uₖ contains the first k columns of U, Σₖ is the k × k diagonal matrix of the top k singular values, and Vₖ contains the first k columns of V.

The approximation errors are:

||A - Aₖ||₂ = σₖ₊₁ (the next singular value)
||A - Aₖ||_F = √(σₖ₊₁² + σₖ₊₂² + ... + σᵣ²)

This is a profound result: among all rank-k matrices, the truncated SVD is provably optimal in both common matrix norms!

Intuition for Low-Rank Approximation

Think of each singular value as measuring the 'importance' of a direction. The first singular vector captures the most significant pattern; the second captures the most significant remaining pattern orthogonal to the first; and so on. By keeping only the top k directions, we retain the most important structure while discarding less significant variations (often noise).

Practical implications:

1. Data Compression: Instead of storing m × n numbers, store:

Uₖ: m × k numbers
σ₁...σₖ: k numbers
Vₖ: n × k numbers

Total: k(m + n + 1) numbers. If k << min(m,n), this is massive compression.

2. Noise Reduction: If data = signal + noise, and the signal is effectively low-rank while noise is full-rank, truncating small singular values removes noise while preserving signal.

3. Choosing k:

Explained variance ratio: keep enough singular values to explain a target percentage (e.g., 95%) of the total variance (sum of σᵢ²)
Elbow method: plot singular values and look for a sharp drop
Cross-validation: choose k that minimizes held-out reconstruction error

Outer product form of SVD:

The SVD can be written as a sum of rank-1 matrices: $$A = \sum_{i=1}^{r} \sigma_i u_i v_i^T$$

Each term σᵢuᵢvᵢᵀ is a rank-1 'building block'. The truncated SVD simply takes the first k of these blocks.

low_rank_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_low_rank_approximation(A, show_plot=True):
    """
    Analyze the low-rank approximation quality for matrix A.
    """
    U, s, Vt = np.linalg.svd(A, full_matrices=False)
    rank = len(s)
    
    # Compute various approximation metrics
    total_variance = np.sum(s**2)
    cumulative_variance = np.cumsum(s**2) / total_variance
    errors_frobenius = []
    errors_spectral = []
    compression_ratios = []
    
    m, n = A.shape
    original_size = m * n
    
    for k in range(1, rank + 1):
        # Reconstruction with rank k
        Ak = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
        
        # Frobenius error
        error_f = np.linalg.norm(A - Ak, 'fro')
        errors_frobenius.append(error_f)
        
        # Spectral error (equals σ_{k+1}, or 0 if k=rank)
        error_s = s[k] if k < rank else 0
        errors_spectral.append(error_s)
        
        # Compression ratio
        compressed_size = k * (m + n + 1)
        compression_ratios.append(original_size / compressed_size)
    
    if show_plot:
        fig, axes = plt.subplots(2, 2, figsize=(12, 10))
        
        # Singular value spectrum
        axes[0, 0].semilogy(range(1, rank+1), s, 'bo-')
        axes[0, 0].set_xlabel('Index i')
        axes[0, 0].set_ylabel('Singular value σᵢ (log scale)')
        axes[0, 0].set_title('Singular Value Spectrum')
        axes[0, 0].grid(True)
        
        # Cumulative explained variance
        axes[0, 1].plot(range(1, rank+1), cumulative_variance * 100, 'go-')
        axes[0, 1].axhline(y=95, color='r', linestyle='--', label='95% threshold')
        axes[0, 1].set_xlabel('Number of components k')
        axes[0, 1].set_ylabel('Cumulative explained variance (%)')
        axes[0, 1].set_title('Explained Variance Ratio')
        axes[0, 1].legend()
        axes[0, 1].grid(True)
        
        # Approximation error
        axes[1, 0].plot(range(1, rank+1), errors_frobenius, 'r-', label='Frobenius')
        axes[1, 0].plot(range(1, rank+1), errors_spectral, 'b--', label='Spectral')
        axes[1, 0].set_xlabel('Rank k')
        axes[1, 0].set_ylabel('Approximation error ||A - Aₖ||')
        axes[1, 0].set_title('Approximation Error vs Rank')
        axes[1, 0].legend()
        axes[1, 0].grid(True)
        
        # Compression ratio
        axes[1, 1].plot(range(1, rank+1), compression_ratios, 'm-o')
        axes[1, 1].set_xlabel('Rank k')
        axes[1, 1].set_ylabel('Compression ratio')
        axes[1, 1].set_title('Compression Ratio vs Rank')
        axes[1, 1].grid(True)
        
        plt.tight_layout()
        plt.show()
    
    return s, cumulative_variance, errors_frobenius
 
# Example: Create a matrix with structure + noise
np.random.seed(42)
true_rank = 5
m, n = 100, 50
 
# Low-rank component (the "signal")
U_true = np.random.randn(m, true_rank)
V_true = np.random.randn(n, true_rank)
signal = U_true @ V_true.T
 
# Add Gaussian noise
noise_level = 0.5
A = signal + noise_level * np.random.randn(m, n)
 
print(f"Matrix A: {m} x {n}")
print(f"True underlying rank: {true_rank}")
print(f"Noise level: {noise_level}")
 
s, var_explained, errors = analyze_low_rank_approximation(A)
 
# Find k that explains 95% of variance
k_95 = np.argmax(var_explained >= 0.95) + 1
print(f"\nComponents needed for 95% variance: {k_95}")
print(f"Top {k_95} singular values: {s[:k_95].round(2)}")

SVD vs Eigendecomposition: When to Use Which

SVD and eigendecomposition are related but distinct factorizations. Understanding when to use each is crucial for ML practitioners.

Eigendecomposition recap:

For a square matrix A ∈ ℝⁿˣⁿ, the eigendecomposition is: $$A = Q \Lambda Q^{-1}$$

where Λ is diagonal containing eigenvalues, and Q contains eigenvectors.

For symmetric matrices A = Aᵀ, this simplifies to: $$A = Q \Lambda Q^T$$

where Q is orthogonal (Q⁻¹ = Qᵀ).

Key differences:

SVD vs Eigendecomposition Comparison
Property	SVD	Eigendecomposition
Matrix shape	Any m × n	Square n × n only
Existence	Always exists	May not exist (defective matrices)
Values	Always real, non-negative (σᵢ ≥ 0)	Can be complex, negative, or zero
Vectors	Two sets: U (left) and V (right)	One set: eigenvectors Q
Orthogonality	U, V always orthogonal	Q orthogonal only for symmetric A
Ordering convention	σ₁ ≥ σ₂ ≥ ... ≥ 0	Often unordered
Computational cost	O(mn²) for m ≥ n	O(n³)
Numerical stability	Stable standard algorithms	Can be unstable for non-symmetric

Relationship for symmetric matrices:

For a symmetric positive semi-definite matrix A:

Eigenvalues = Singular values (both non-negative)
Eigenvectors = Left singular vectors = Right singular vectors
Both decompositions are essentially the same

For a symmetric matrix with negative eigenvalues:

Singular values = |eigenvalues|
The signs are 'absorbed' into the singular vectors

When to use SVD:

Rectangular matrices — eigendecomposition doesn't apply
Low-rank approximation — SVD gives optimal rank-k approximation
Solving least squares — SVD handles rank-deficient cases gracefully
Computing pseudoinverse — A⁺ = VΣ⁺Uᵀ
Principal Component Analysis — SVD of data matrix is more numerically stable than eigendecomposition of covariance matrix
Matrix completion/recommendation systems — natural fit for user-item matrices

When to use eigendecomposition:

Understanding dynamics — eigenvalues determine system behavior (stability, oscillation, decay)
Graph problems — graph Laplacian eigenvalues reveal connectivity, spectral clustering
Markov chains — stationary distribution is the eigenvector for eigenvalue 1
Quadratic forms — diagonalization simplifies xᵀAx
Matrix powers — Aⁿ = QΛⁿQ⁻¹ is efficient for large n
Matrix exponential — exp(A) = Q exp(Λ) Q⁻¹

The PCA Connection

Principal Component Analysis (PCA) can be computed two ways: (1) Eigendecomposition of the covariance matrix C = XᵀX/(n-1), or (2) SVD of the centered data matrix X directly. The SVD approach is preferred because it avoids explicitly forming XᵀX, which can lose numerical precision. The principal components are the right singular vectors of X, and the explained variances are σᵢ²/(n-1).

SVD Applications in Machine Learning

SVD is not just a theoretical tool—it's the computational backbone of many ML systems. Let's explore the major applications with practical details.

1. Principal Component Analysis (PCA):

PCA finds the orthogonal directions of maximum variance in data. Given centered data matrix X (n samples × d features), compute SVD: X = UΣVᵀ

Principal components (directions): columns of V
Projections onto k dimensions: XVₖ = UₖΣₖ
Explained variance ratios: σᵢ² / Σⱼσⱼ²

2. Latent Semantic Analysis (LSA):

For text analysis, create a term-document matrix A where Aᵢⱼ = frequency of term i in document j. SVD reveals:

Latent topics as singular vectors
Topic strengths as singular values
Documents and terms mapped to shared latent space

3. Recommendation Systems:

The Netflix Prize made matrix factorization famous. For user-item rating matrix R:

Factor into user features U and item features V: R ≈ UVᵀ
SVD provides initialization or direct solution
Predict missing ratings by reconstructing from low-rank factors

4. Image Compression:

Treat an image as a matrix (or separately for each color channel). Truncated SVD gives lossy compression: storing k(m+n+1) values instead of mn, with provably minimal reconstruction error.

svd_image_compression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import requests
from io import BytesIO
 
def compress_image_svd(image_array, k):
    """
    Compress an image using rank-k SVD approximation.
    Works on grayscale; for color, apply to each channel.
    """
    U, s, Vt = np.linalg.svd(image_array, full_matrices=False)
    
    # Truncate to rank k
    compressed = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
    
    # Clip values to valid range
    compressed = np.clip(compressed, 0, 255)
    
    return compressed.astype(np.uint8)
 
def analyze_compression(image_array):
    """Analyze compression quality at different ranks."""
    U, s, Vt = np.linalg.svd(image_array, full_matrices=False)
    m, n = image_array.shape
    original_size = m * n
    
    results = []
    k_values = [1, 5, 10, 20, 50, 100, min(m, n)]
    
    for k in k_values:
        if k > len(s):
            continue
        compressed = compress_image_svd(image_array, k)
        
        # Compute metrics
        compressed_size = k * (m + n + 1)
        compression_ratio = original_size / compressed_size
        psnr = 10 * np.log10(255**2 / np.mean((image_array - compressed)**2))
        explained_var = np.sum(s[:k]**2) / np.sum(s**2) * 100
        
        results.append({
            'k': k,
            'compression_ratio': compression_ratio,
            'psnr': psnr,
            'explained_variance': explained_var,
            'image': compressed
        })
    
    return results
 
# Create a sample grayscale image (gradient + pattern)
x = np.linspace(0, 255, 256)
y = np.linspace(0, 255, 256)
X, Y = np.meshgrid(x, y)
image = (0.5 * X + 0.3 * Y + 50 * np.sin(X/20) * np.cos(Y/30)).astype(np.uint8)
 
# Analyze compression
results = analyze_compression(image)
 
print("Image Compression Analysis")
print("=" * 60)
for r in results:
    print(f"Rank {r['k']:3d}: Compression {r['compression_ratio']:5.1f}x, "
          f"PSNR {r['psnr']:5.1f} dB, Explained var {r['explained_variance']:5.1f}%")
 
# Visualization
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()
 
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original')
axes[0].axis('off')
 
for i, r in enumerate(results[:7]):
    axes[i+1].imshow(r['image'], cmap='gray')
    axes[i+1].set_title(f"k={r['k']} ({r['compression_ratio']:.1f}x)")
    axes[i+1].axis('off')
 
plt.suptitle('SVD Image Compression at Different Ranks', fontsize=14)
plt.tight_layout()
plt.show()

5. Pseudoinverse and Least Squares:

The Moore-Penrose pseudoinverse A⁺ is elegantly computed via SVD: $$A^+ = V \Sigma^+ U^T$$

where Σ⁺ has 1/σᵢ on the diagonal for non-zero singular values (and 0 elsewhere).

This solves least squares problems even when A is rank-deficient: $$x = A^+ b$$

6. Noise Reduction:

If measurements are corrupted by noise, and the true signal is approximately low-rank:

Compute SVD of noisy data
Threshold or truncate small singular values
Reconstruct from remaining components

The small singular values typically correspond to noise, while large ones capture the signal.

7. Matrix Completion:

Given a matrix with missing entries (e.g., user-movie ratings), find a low-rank matrix that matches the observed entries. Nuclear norm minimization (sum of singular values) is a convex relaxation that often recovers the true matrix. SVD is central to algorithms like Soft-Impute.

The Unifying Theme

Across all these applications, SVD provides the same superpower: decomposing complex, high-dimensional data into ordered components of decreasing importance. By focusing on the top components and discarding the rest, we achieve dimensionality reduction, noise removal, and computational efficiency—all with theoretical guarantees of optimality.

Computational Considerations and Scalability

Computing full SVD on large matrices is expensive. Understanding computational trade-offs is essential for practical ML applications.

Complexity Analysis:

Algorithm	Complexity	When to use
Full SVD	O(mn²) for m ≥ n	Small matrices, need all components
Truncated SVD	O(mnk)	Need only top k components
Randomized SVD	O(mn log k + (m+n)k²)	Large matrices, approximate top k
Incremental SVD	O(nk²) per row update	Streaming data

Truncated SVD:

When you only need the top k singular values/vectors (typical in ML), use iterative methods that never compute the full decomposition:

scipy.sparse.linalg.svds for sparse matrices
sklearn.decomposition.TruncatedSVD for general use

Randomized SVD:

For very large matrices, randomized algorithms provide excellent approximations much faster:

Create a random projection matrix Ω (n × k or slightly larger)
Form Y = AΩ (captures the range of A)
Orthonormalize Y to get Q
Form the smaller matrix B = QᵀA
Compute SVD of B
Recover U from the SVD of B

This trades exact computation for probabilistic guarantees with much lower cost.

svd_large_scale.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
import time
from scipy.sparse import random as sparse_random
from scipy.sparse.linalg import svds
from sklearn.decomposition import TruncatedSVD
from sklearn.utils.extmath import randomized_svd
 
def benchmark_svd_methods(m, n, k, sparsity=0.01):
    """
    Compare different SVD methods for a sparse matrix.
    """
    print(f"Matrix size: {m} x {n} (sparsity: {sparsity*100:.1f}%)")
    print(f"Target rank: {k}")
    print("=" * 60)
    
    # Create sparse matrix
    A_sparse = sparse_random(m, n, density=sparsity, format='csr')
    A_dense = A_sparse.toarray()
    
    results = {}
    
    # Method 1: Full SVD (only if matrix is small enough)
    if m * n < 1e6:
        start = time.time()
        U_full, s_full, Vt_full = np.linalg.svd(A_dense, full_matrices=False)
        time_full = time.time() - start
        results['Full SVD'] = {
            'time': time_full, 
            'top_k_singular': s_full[:k]
        }
        print(f"Full SVD:        {time_full:.3f}s")
    
    # Method 2: Sparse truncated SVD (scipy)
    start = time.time()
    U_sparse, s_sparse, Vt_sparse = svds(A_sparse, k=k, which='LM')
    time_sparse = time.time() - start
    # Note: svds returns in ascending order, need to reverse
    idx = np.argsort(s_sparse)[::-1]
    s_sparse = s_sparse[idx]
    results['Sparse SVDs'] = {'time': time_sparse, 'top_k_singular': s_sparse}
    print(f"Sparse SVDs:     {time_sparse:.3f}s")
    
    # Method 3: TruncatedSVD (sklearn) - works on dense or sparse
    start = time.time()
    tsvd = TruncatedSVD(n_components=k, algorithm='arpack')
    tsvd.fit(A_sparse)
    time_tsvd = time.time() - start
    results['TruncatedSVD'] = {
        'time': time_tsvd, 
        'top_k_singular': tsvd.singular_values_
    }
    print(f"TruncatedSVD:    {time_tsvd:.3f}s")
    
    # Method 4: Randomized SVD
    start = time.time()
    U_rand, s_rand, Vt_rand = randomized_svd(
        A_sparse, n_components=k, n_iter=5, random_state=42
    )
    time_rand = time.time() - start
    results['Randomized SVD'] = {'time': time_rand, 'top_k_singular': s_rand}
    print(f"Randomized SVD:  {time_rand:.3f}s")
    
    # Compare singular values
    print("\nTop singular values comparison:")
    for method, data in results.items():
        sv_str = ', '.join(f'{s:.2f}' for s in data['top_k_singular'][:5])
        print(f"  {method:15s}: {sv_str}, ...")
    
    return results
 
# Run benchmark
np.random.seed(42)
results = benchmark_svd_methods(m=1000, n=500, k=10, sparsity=0.05)
 
print("\n" + "=" * 60)
print("For even larger matrices, use out-of-core methods or distributed")
print("computing frameworks like Spark MLlib or Dask.")

Memory Considerations

For an m × n matrix, full SVD requires O(mn + m² + n²) memory to store A, U, and V. For large matrices (e.g., million-user recommendation systems), this is prohibitive. Truncated and streaming methods that only maintain the top-k factors are essential for production ML systems.

Summary: Mastering SVD

We've covered SVD in depth—from its formal definition through geometric interpretation, computation, and practical applications. Let's consolidate the key insights.

Key Takeaways

•SVD exists for every matrix: A = UΣVᵀ decomposes any m×n matrix into orthogonal U, diagonal Σ (with singular values), and orthogonal V.
•Geometric meaning: SVD reveals how a linear transformation rotates, scales, and rotates again—the unit sphere becomes an ellipsoid with axes along singular vectors and lengths equal to singular values.
•Rank and structure: Singular values expose matrix rank; singular vectors span the four fundamental subspaces.
•Optimal low-rank approximation: Truncating SVD to k components gives the provably best rank-k approximation (Eckart-Young-Mirsky theorem).
•SVD vs Eigendecomposition: SVD works on any matrix and provides real, non-negative values; eigendecomposition is square-only but reveals dynamic properties.
•ML applications are everywhere: PCA, LSA, recommendation systems, image compression, pseudoinverse, noise reduction all rely on SVD.
•Scalability matters: For large matrices, use truncated or randomized SVD to avoid O(mn²) full computation.

What's next:

Having mastered SVD, we'll explore Eigendecomposition in depth—understanding when and why to decompose square matrices by their eigenvalues and eigenvectors. While SVD is the general workhorse, eigendecomposition is essential for understanding matrix dynamics, spectral graph theory, and many probabilistic models.

Page Complete

You now have a deep understanding of Singular Value Decomposition—the most versatile matrix factorization in machine learning. From theoretical foundations to practical implementation, you're equipped to apply SVD across dimensionality reduction, recommendation systems, signal processing, and beyond.

1 / 5

Loading learning content...

Machine LearningMatrix Decompositions

Matrix Decompositions

LevelIntermediate

Duration120 mins

TopicMatrix Decompositions

1 / 5

Singular Value Decomposition (SVD)

The Swiss Army Knife of Matrix Decompositions

What You Will Learn

The Formal Definition of SVD

$$A = U \Sigma V^T$$

This factorization is called the Singular Value Decomposition of A.

Let's unpack each component:

The matrix U (Left Singular Vectors):

U is an m × m orthogonal matrix (U^T U = UU^T = I_m)
The columns of U are called the left singular vectors of A
These vectors form an orthonormal basis for the column space (range) of A
They represent the output directions of the transformation

The matrix Σ (Singular Values):

Σ is an m × n diagonal matrix (zero everywhere except the main diagonal)
The diagonal entries σ₁, σ₂, ..., σᵣ are called the singular values of A
By convention: σ₁ ≥ σ₂ ≥ ... ≥ σᵣ > 0 (ordered from largest to smallest)
r is the rank of matrix A
The singular values are always real and non-negative

The matrix V (Right Singular Vectors):

V is an n × n orthogonal matrix (V^T V = VV^T = I_n)
The columns of V are called the right singular vectors of A
These vectors form an orthonormal basis for the row space of A
They represent the input directions of the transformation

The Existence Theorem

Understanding the dimensions:

For a matrix A of size m × n with rank r:

U has size m × m (but only the first r columns are 'active')
Σ has size m × n with r non-zero diagonal entries
V has size n × n (but only the first r columns are 'active')

This leads to different forms of SVD:

Full SVD: U is m × m, Σ is m × n, V is n × n

Economy (Thin) SVD: U is m × r, Σ is r × r, V is n × r

The economy SVD is more storage-efficient when r << min(m, n) and is what most software packages return by default.

svd_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
from scipy.linalg import svd
 
# Create a sample matrix
A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]
])
 
print("Original Matrix A:")
print(A)
print(f"Shape: {A.shape}")  # (4, 3) - 4 rows, 3 columns
 
# Compute Full SVD
U_full, s_full, Vt_full = svd(A, full_matrices=True)
print(f"\nFull SVD:")
print(f"U shape: {U_full.shape}")    # (4, 4)
print(f"s shape: {s_full.shape}")    # (3,) - just the diagonal values
print(f"Vt shape: {Vt_full.shape}")  # (3, 3)
 
# Compute Economy (Thin) SVD
U_thin, s_thin, Vt_thin = svd(A, full_matrices=False)
print(f"\nEconomy SVD:")
print(f"U shape: {U_thin.shape}")    # (4, 3)
print(f"s shape: {s_thin.shape}")    # (3,)
print(f"Vt shape: {Vt_thin.shape}")  # (3, 3)
 
# Verify reconstruction: A = U @ Σ @ V^T
Sigma_full = np.zeros((4, 3))
np.fill_diagonal(Sigma_full, s_full)
A_reconstructed = U_full @ Sigma_full @ Vt_full
print(f"\nReconstruction error: {np.linalg.norm(A - A_reconstructed):.2e}")
 
# Singular values reveal the rank
print(f"\nSingular values: {s_full}")
print(f"Matrix rank: {np.sum(s_full > 1e-10)}")

Geometric Interpretation: What SVD Reveals

The transformation A = UΣV^T as three steps:

When we compute Ax, the SVD decomposes this into three sequential operations:

Step 3: U rotates into the output space Finally, U rotates the scaled result into the output coordinate system, aligning it with the left singular vectors.

The sphere-to-ellipse transformation:

The axes of the ellipsoid are aligned with the left singular vectors (columns of U)
The lengths of the semi-axes are the singular values σ₁, σ₂, ..., σᵣ
The orientation in input space that produces each axis is given by the corresponding right singular vectors (columns of V)

Why This Matters for ML

Connecting singular vectors to fundamental subspaces:

SVD provides orthonormal bases for all four fundamental subspaces of a matrix A with rank r:

Subspace	Basis from SVD	Dimension
Column space (Range) of A	First r columns of U	r
Left null space of A	Last m-r columns of U	m-r
Row space of A	First r columns of V	r
Null space of A	Last n-r columns of V	n-r

This elegant connection shows why SVD is considered the 'ultimate' matrix decomposition—it reveals the complete structure of any linear transformation.

Interpretation of singular values:

σ₁ is the maximum amount the matrix can stretch any unit vector: σ₁ = max ||Ax|| for ||x|| = 1
The ratio σ₁/σᵣ (condition number) measures how 'ill-conditioned' the matrix is
σᵢ represents the 'importance' of the i-th direction—larger values capture more of the matrix's action
The sum Σσᵢ² equals ||A||²_F (the squared Frobenius norm)
The number of non-zero singular values equals the rank of A

svd_geometry_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
 
def visualize_svd_transformation(A, ax1, ax2):
    """Visualize how SVD transforms the unit circle to an ellipse."""
    U, s, Vt = np.linalg.svd(A)
    
    # Points on unit circle
    theta = np.linspace(0, 2*np.pi, 100)
    circle = np.vstack([np.cos(theta), np.sin(theta)])
    
    # Transform through A
    ellipse = A @ circle
    
    # Plot unit circle with right singular vectors
    ax1.plot(circle[0], circle[1], 'b-', label='Unit circle')
    for i in range(len(s)):
        ax1.arrow(0, 0, Vt[i, 0]*0.8, Vt[i, 1]*0.8, 
                  head_width=0.1, color=f'C{i+1}',
                  label=f'v{i+1} (input direction)')
    ax1.set_xlim(-1.5, 1.5)
    ax1.set_ylim(-1.5, 1.5)
    ax1.set_aspect('equal')
    ax1.axhline(y=0, color='k', linewidth=0.5)
    ax1.axvline(x=0, color='k', linewidth=0.5)
    ax1.legend(fontsize=8)
    ax1.set_title('Input Space: Unit Circle + Right Singular Vectors')
    
    # Plot ellipse with left singular vectors scaled by singular values
    ax2.plot(ellipse[0], ellipse[1], 'r-', label='Transformed ellipse')
    for i in range(len(s)):
        ax2.arrow(0, 0, U[0, i]*s[i]*0.8, U[1, i]*s[i]*0.8,
                  head_width=0.2, color=f'C{i+1}',
                  label=f'σ{i+1}·u{i+1} (σ={s[i]:.2f})')
    
    max_val = max(s) * 1.5
    ax2.set_xlim(-max_val, max_val)
    ax2.set_ylim(-max_val, max_val)
    ax2.set_aspect('equal')
    ax2.axhline(y=0, color='k', linewidth=0.5)
    ax2.axvline(x=0, color='k', linewidth=0.5)
    ax2.legend(fontsize=8)
    ax2.set_title('Output Space: Ellipse + Scaled Left Singular Vectors')
 
# Example: A transformation that stretches and rotates
A = np.array([[3, 1], 
              [1, 2]])
 
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
visualize_svd_transformation(A, ax1, ax2)
plt.suptitle('SVD Geometric Interpretation: Circle → Ellipse', fontsize=14)
plt.tight_layout()
plt.show()
 
# Verify the geometry
U, s, Vt = np.linalg.svd(A)
print("Singular values (ellipse semi-axis lengths):", s)
print("\nLeft singular vectors (ellipse axis directions):")
print(U)
print("\nRight singular vectors (input directions mapped to axes):")
print(Vt.T)

Computing SVD: From Theory to Algorithm

Deriving SVD from eigenvalues:

The key insight is that if we have A = UΣV^T, then:

$$A^T A = V \Sigma^T U^T U \Sigma V^T = V \Sigma^T \Sigma V^T = V \Lambda_V V^T$$

$$A A^T = U \Sigma V^T V \Sigma^T U^T = U \Sigma \Sigma^T U^T = U \Lambda_U U^T$$

where Λ_V = Σ^TΣ and Λ_U = ΣΣ^T are diagonal matrices containing σ₁², σ₂², ...

This reveals that:

V contains the eigenvectors of A^TA (an n × n symmetric matrix)
U contains the eigenvectors of AA^T (an m × m symmetric matrix)
The singular values σᵢ are the square roots of the eigenvalues of A^TA (or equivalently, AA^T)

Since A^TA and AA^T are symmetric positive semi-definite, their eigenvalues are real and non-negative. This guarantees that singular values are always real and non-negative.

The computational algorithm (conceptual):

Compute B = A^TA (if n ≤ m) or B = AA^T (if m < n)
Find the eigenvalues λ₁ ≥ λ₂ ≥ ... ≥ λᵣ > 0 and eigenvectors of B
Set σᵢ = √λᵢ for the singular values
The eigenvectors of A^TA give V; the eigenvectors of AA^T give U
Alternatively: once V is known, compute uᵢ = (1/σᵢ) A vᵢ to get U

Numerical Stability Warning

Computational complexity:

For an m × n matrix (assume m ≥ n):

Full SVD: O(mn² + n³) operations
Truncated SVD (k singular values): O(mnk) using iterative methods
Memory: O(mn) for the input, plus O(mk + nk + k) for truncated output

This makes SVD relatively expensive for large matrices, motivating the use of approximate and randomized algorithms for big data applications.

Uniqueness of SVD:

SVD is essentially unique with the following caveats:

If all singular values are distinct, the singular vectors are unique up to sign
If some singular values are equal (multiplicity > 1), the corresponding singular vectors are only unique up to rotation within their eigenspace
The singular values themselves are always uniquely determined

svd_computation_from_eigen.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
 
def compute_svd_via_eigendecomposition(A):
    """
    Compute SVD using eigendecomposition of A^T A.
    
    WARNING: This is for educational purposes only!
    In practice, use numpy.linalg.svd which uses numerically 
    stable algorithms.
    """
    m, n = A.shape
    
    # Form A^T A (n x n symmetric positive semi-definite matrix)
    ATA = A.T @ A
    
    # Compute eigenvalues and eigenvectors of A^T A
    # eigenvalues are squared singular values
    eigenvalues, V = np.linalg.eigh(ATA)  # eigh for symmetric matrices
    
    # Sort in descending order (eigh returns ascending)
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    V = V[:, idx]
    
    # Singular values are square roots of eigenvalues
    # Clip to avoid numerical issues with tiny negative values
    singular_values = np.sqrt(np.maximum(eigenvalues, 0))
    
    # Find rank (number of non-trivially zero singular values)
    tol = max(m, n) * np.finfo(float).eps * singular_values[0]
    rank = np.sum(singular_values > tol)
    
    # Compute U from the relationship: u_i = (1/σ_i) A v_i
    U = np.zeros((m, rank))
    for i in range(rank):
        U[:, i] = A @ V[:, i] / singular_values[i]
    
    return U, singular_values[:rank], V[:, :rank].T
 
# Test with a sample matrix
A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]
], dtype=float)
 
# Our implementation
U_ours, s_ours, Vt_ours = compute_svd_via_eigendecomposition(A)
 
# NumPy's implementation
U_np, s_np, Vt_np = np.linalg.svd(A, full_matrices=False)
 
print("Our SVD (via eigendecomposition):")
print(f"Singular values: {s_ours}")
print(f"Rank: {len(s_ours)}")
 
print("\nNumPy's SVD:")
print(f"Singular values: {s_np}")
 
print(f"\nSingular value match: {np.allclose(s_ours, s_np[:len(s_ours)])}")
 
# Verify reconstruction
S_diag = np.diag(s_ours)
A_reconstructed = U_ours @ S_diag @ Vt_ours
print(f"Reconstruction error: {np.linalg.norm(A - A_reconstructed):.2e}")

Low-Rank Approximation: The Eckart-Young-Mirsky Theorem

The problem statement:

Given a matrix A of rank r, find the rank-k matrix Aₖ (where k < r) that is closest to A. 'Closest' means minimizing either:

The Frobenius norm: ||A - Aₖ||_F = √(Σᵢⱼ (A - Aₖ)ᵢⱼ²)
The spectral norm: ||A - Aₖ||₂ = largest singular value of (A - Aₖ)

The Eckart-Young-Mirsky Theorem (1936):

The optimal rank-k approximation Aₖ is obtained by truncating the SVD:

$$A_k = \sum_{i=1}^{k} \sigma_i u_i v_i^T = U_k \Sigma_k V_k^T$$

where Uₖ contains the first k columns of U, Σₖ is the k × k diagonal matrix of the top k singular values, and Vₖ contains the first k columns of V.

The approximation errors are:

||A - Aₖ||₂ = σₖ₊₁ (the next singular value)
||A - Aₖ||_F = √(σₖ₊₁² + σₖ₊₂² + ... + σᵣ²)

This is a profound result: among all rank-k matrices, the truncated SVD is provably optimal in both common matrix norms!

Intuition for Low-Rank Approximation

Practical implications:

1. Data Compression: Instead of storing m × n numbers, store:

Uₖ: m × k numbers
σ₁...σₖ: k numbers
Vₖ: n × k numbers

Total: k(m + n + 1) numbers. If k << min(m,n), this is massive compression.

2. Noise Reduction: If data = signal + noise, and the signal is effectively low-rank while noise is full-rank, truncating small singular values removes noise while preserving signal.

3. Choosing k:

Explained variance ratio: keep enough singular values to explain a target percentage (e.g., 95%) of the total variance (sum of σᵢ²)
Elbow method: plot singular values and look for a sharp drop
Cross-validation: choose k that minimizes held-out reconstruction error

Outer product form of SVD:

The SVD can be written as a sum of rank-1 matrices: $$A = \sum_{i=1}^{r} \sigma_i u_i v_i^T$$

Each term σᵢuᵢvᵢᵀ is a rank-1 'building block'. The truncated SVD simply takes the first k of these blocks.

low_rank_approximation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_low_rank_approximation(A, show_plot=True):
    """
    Analyze the low-rank approximation quality for matrix A.
    """
    U, s, Vt = np.linalg.svd(A, full_matrices=False)
    rank = len(s)
    
    # Compute various approximation metrics
    total_variance = np.sum(s**2)
    cumulative_variance = np.cumsum(s**2) / total_variance
    errors_frobenius = []
    errors_spectral = []
    compression_ratios = []
    
    m, n = A.shape
    original_size = m * n
    
    for k in range(1, rank + 1):
        # Reconstruction with rank k
        Ak = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
        
        # Frobenius error
        error_f = np.linalg.norm(A - Ak, 'fro')
        errors_frobenius.append(error_f)
        
        # Spectral error (equals σ_{k+1}, or 0 if k=rank)
        error_s = s[k] if k < rank else 0
        errors_spectral.append(error_s)
        
        # Compression ratio
        compressed_size = k * (m + n + 1)
        compression_ratios.append(original_size / compressed_size)
    
    if show_plot:
        fig, axes = plt.subplots(2, 2, figsize=(12, 10))
        
        # Singular value spectrum
        axes[0, 0].semilogy(range(1, rank+1), s, 'bo-')
        axes[0, 0].set_xlabel('Index i')
        axes[0, 0].set_ylabel('Singular value σᵢ (log scale)')
        axes[0, 0].set_title('Singular Value Spectrum')
        axes[0, 0].grid(True)
        
        # Cumulative explained variance
        axes[0, 1].plot(range(1, rank+1), cumulative_variance * 100, 'go-')
        axes[0, 1].axhline(y=95, color='r', linestyle='--', label='95% threshold')
        axes[0, 1].set_xlabel('Number of components k')
        axes[0, 1].set_ylabel('Cumulative explained variance (%)')
        axes[0, 1].set_title('Explained Variance Ratio')
        axes[0, 1].legend()
        axes[0, 1].grid(True)
        
        # Approximation error
        axes[1, 0].plot(range(1, rank+1), errors_frobenius, 'r-', label='Frobenius')
        axes[1, 0].plot(range(1, rank+1), errors_spectral, 'b--', label='Spectral')
        axes[1, 0].set_xlabel('Rank k')
        axes[1, 0].set_ylabel('Approximation error ||A - Aₖ||')
        axes[1, 0].set_title('Approximation Error vs Rank')
        axes[1, 0].legend()
        axes[1, 0].grid(True)
        
        # Compression ratio
        axes[1, 1].plot(range(1, rank+1), compression_ratios, 'm-o')
        axes[1, 1].set_xlabel('Rank k')
        axes[1, 1].set_ylabel('Compression ratio')
        axes[1, 1].set_title('Compression Ratio vs Rank')
        axes[1, 1].grid(True)
        
        plt.tight_layout()
        plt.show()
    
    return s, cumulative_variance, errors_frobenius
 
# Example: Create a matrix with structure + noise
np.random.seed(42)
true_rank = 5
m, n = 100, 50
 
# Low-rank component (the "signal")
U_true = np.random.randn(m, true_rank)
V_true = np.random.randn(n, true_rank)
signal = U_true @ V_true.T
 
# Add Gaussian noise
noise_level = 0.5
A = signal + noise_level * np.random.randn(m, n)
 
print(f"Matrix A: {m} x {n}")
print(f"True underlying rank: {true_rank}")
print(f"Noise level: {noise_level}")
 
s, var_explained, errors = analyze_low_rank_approximation(A)
 
# Find k that explains 95% of variance
k_95 = np.argmax(var_explained >= 0.95) + 1
print(f"\nComponents needed for 95% variance: {k_95}")
print(f"Top {k_95} singular values: {s[:k_95].round(2)}")

SVD vs Eigendecomposition: When to Use Which

SVD and eigendecomposition are related but distinct factorizations. Understanding when to use each is crucial for ML practitioners.

Eigendecomposition recap:

For a square matrix A ∈ ℝⁿˣⁿ, the eigendecomposition is: $$A = Q \Lambda Q^{-1}$$

where Λ is diagonal containing eigenvalues, and Q contains eigenvectors.

For symmetric matrices A = Aᵀ, this simplifies to: $$A = Q \Lambda Q^T$$

where Q is orthogonal (Q⁻¹ = Qᵀ).

Key differences:

SVD vs Eigendecomposition Comparison
Property	SVD	Eigendecomposition
Matrix shape	Any m × n	Square n × n only
Existence	Always exists	May not exist (defective matrices)
Values	Always real, non-negative (σᵢ ≥ 0)	Can be complex, negative, or zero
Vectors	Two sets: U (left) and V (right)	One set: eigenvectors Q
Orthogonality	U, V always orthogonal	Q orthogonal only for symmetric A
Ordering convention	σ₁ ≥ σ₂ ≥ ... ≥ 0	Often unordered
Computational cost	O(mn²) for m ≥ n	O(n³)
Numerical stability	Stable standard algorithms	Can be unstable for non-symmetric

Relationship for symmetric matrices:

For a symmetric positive semi-definite matrix A:

Eigenvalues = Singular values (both non-negative)
Eigenvectors = Left singular vectors = Right singular vectors
Both decompositions are essentially the same

For a symmetric matrix with negative eigenvalues:

Singular values = |eigenvalues|
The signs are 'absorbed' into the singular vectors

When to use SVD:

Rectangular matrices — eigendecomposition doesn't apply
Low-rank approximation — SVD gives optimal rank-k approximation
Solving least squares — SVD handles rank-deficient cases gracefully
Computing pseudoinverse — A⁺ = VΣ⁺Uᵀ
Principal Component Analysis — SVD of data matrix is more numerically stable than eigendecomposition of covariance matrix
Matrix completion/recommendation systems — natural fit for user-item matrices

When to use eigendecomposition:

Understanding dynamics — eigenvalues determine system behavior (stability, oscillation, decay)
Graph problems — graph Laplacian eigenvalues reveal connectivity, spectral clustering
Markov chains — stationary distribution is the eigenvector for eigenvalue 1
Quadratic forms — diagonalization simplifies xᵀAx
Matrix powers — Aⁿ = QΛⁿQ⁻¹ is efficient for large n
Matrix exponential — exp(A) = Q exp(Λ) Q⁻¹

The PCA Connection

SVD Applications in Machine Learning

SVD is not just a theoretical tool—it's the computational backbone of many ML systems. Let's explore the major applications with practical details.

1. Principal Component Analysis (PCA):

PCA finds the orthogonal directions of maximum variance in data. Given centered data matrix X (n samples × d features), compute SVD: X = UΣVᵀ

Principal components (directions): columns of V
Projections onto k dimensions: XVₖ = UₖΣₖ
Explained variance ratios: σᵢ² / Σⱼσⱼ²

2. Latent Semantic Analysis (LSA):

For text analysis, create a term-document matrix A where Aᵢⱼ = frequency of term i in document j. SVD reveals:

Latent topics as singular vectors
Topic strengths as singular values
Documents and terms mapped to shared latent space

3. Recommendation Systems:

The Netflix Prize made matrix factorization famous. For user-item rating matrix R:

Factor into user features U and item features V: R ≈ UVᵀ
SVD provides initialization or direct solution
Predict missing ratings by reconstructing from low-rank factors

4. Image Compression:

Treat an image as a matrix (or separately for each color channel). Truncated SVD gives lossy compression: storing k(m+n+1) values instead of mn, with provably minimal reconstruction error.

svd_image_compression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import requests
from io import BytesIO
 
def compress_image_svd(image_array, k):
    """
    Compress an image using rank-k SVD approximation.
    Works on grayscale; for color, apply to each channel.
    """
    U, s, Vt = np.linalg.svd(image_array, full_matrices=False)
    
    # Truncate to rank k
    compressed = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
    
    # Clip values to valid range
    compressed = np.clip(compressed, 0, 255)
    
    return compressed.astype(np.uint8)
 
def analyze_compression(image_array):
    """Analyze compression quality at different ranks."""
    U, s, Vt = np.linalg.svd(image_array, full_matrices=False)
    m, n = image_array.shape
    original_size = m * n
    
    results = []
    k_values = [1, 5, 10, 20, 50, 100, min(m, n)]
    
    for k in k_values:
        if k > len(s):
            continue
        compressed = compress_image_svd(image_array, k)
        
        # Compute metrics
        compressed_size = k * (m + n + 1)
        compression_ratio = original_size / compressed_size
        psnr = 10 * np.log10(255**2 / np.mean((image_array - compressed)**2))
        explained_var = np.sum(s[:k]**2) / np.sum(s**2) * 100
        
        results.append({
            'k': k,
            'compression_ratio': compression_ratio,
            'psnr': psnr,
            'explained_variance': explained_var,
            'image': compressed
        })
    
    return results
 
# Create a sample grayscale image (gradient + pattern)
x = np.linspace(0, 255, 256)
y = np.linspace(0, 255, 256)
X, Y = np.meshgrid(x, y)
image = (0.5 * X + 0.3 * Y + 50 * np.sin(X/20) * np.cos(Y/30)).astype(np.uint8)
 
# Analyze compression
results = analyze_compression(image)
 
print("Image Compression Analysis")
print("=" * 60)
for r in results:
    print(f"Rank {r['k']:3d}: Compression {r['compression_ratio']:5.1f}x, "
          f"PSNR {r['psnr']:5.1f} dB, Explained var {r['explained_variance']:5.1f}%")
 
# Visualization
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()
 
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original')
axes[0].axis('off')
 
for i, r in enumerate(results[:7]):
    axes[i+1].imshow(r['image'], cmap='gray')
    axes[i+1].set_title(f"k={r['k']} ({r['compression_ratio']:.1f}x)")
    axes[i+1].axis('off')
 
plt.suptitle('SVD Image Compression at Different Ranks', fontsize=14)
plt.tight_layout()
plt.show()

5. Pseudoinverse and Least Squares:

The Moore-Penrose pseudoinverse A⁺ is elegantly computed via SVD: $$A^+ = V \Sigma^+ U^T$$

where Σ⁺ has 1/σᵢ on the diagonal for non-zero singular values (and 0 elsewhere).

This solves least squares problems even when A is rank-deficient: $$x = A^+ b$$

6. Noise Reduction:

If measurements are corrupted by noise, and the true signal is approximately low-rank:

Compute SVD of noisy data
Threshold or truncate small singular values
Reconstruct from remaining components

The small singular values typically correspond to noise, while large ones capture the signal.

7. Matrix Completion:

The Unifying Theme

Computational Considerations and Scalability

Computing full SVD on large matrices is expensive. Understanding computational trade-offs is essential for practical ML applications.

Complexity Analysis:

Algorithm	Complexity	When to use
Full SVD	O(mn²) for m ≥ n	Small matrices, need all components
Truncated SVD	O(mnk)	Need only top k components
Randomized SVD	O(mn log k + (m+n)k²)	Large matrices, approximate top k
Incremental SVD	O(nk²) per row update	Streaming data

Truncated SVD:

When you only need the top k singular values/vectors (typical in ML), use iterative methods that never compute the full decomposition:

scipy.sparse.linalg.svds for sparse matrices
sklearn.decomposition.TruncatedSVD for general use

Randomized SVD:

For very large matrices, randomized algorithms provide excellent approximations much faster:

Create a random projection matrix Ω (n × k or slightly larger)
Form Y = AΩ (captures the range of A)
Orthonormalize Y to get Q
Form the smaller matrix B = QᵀA
Compute SVD of B
Recover U from the SVD of B

This trades exact computation for probabilistic guarantees with much lower cost.

svd_large_scale.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
import time
from scipy.sparse import random as sparse_random
from scipy.sparse.linalg import svds
from sklearn.decomposition import TruncatedSVD
from sklearn.utils.extmath import randomized_svd
 
def benchmark_svd_methods(m, n, k, sparsity=0.01):
    """
    Compare different SVD methods for a sparse matrix.
    """
    print(f"Matrix size: {m} x {n} (sparsity: {sparsity*100:.1f}%)")
    print(f"Target rank: {k}")
    print("=" * 60)
    
    # Create sparse matrix
    A_sparse = sparse_random(m, n, density=sparsity, format='csr')
    A_dense = A_sparse.toarray()
    
    results = {}
    
    # Method 1: Full SVD (only if matrix is small enough)
    if m * n < 1e6:
        start = time.time()
        U_full, s_full, Vt_full = np.linalg.svd(A_dense, full_matrices=False)
        time_full = time.time() - start
        results['Full SVD'] = {
            'time': time_full, 
            'top_k_singular': s_full[:k]
        }
        print(f"Full SVD:        {time_full:.3f}s")
    
    # Method 2: Sparse truncated SVD (scipy)
    start = time.time()
    U_sparse, s_sparse, Vt_sparse = svds(A_sparse, k=k, which='LM')
    time_sparse = time.time() - start
    # Note: svds returns in ascending order, need to reverse
    idx = np.argsort(s_sparse)[::-1]
    s_sparse = s_sparse[idx]
    results['Sparse SVDs'] = {'time': time_sparse, 'top_k_singular': s_sparse}
    print(f"Sparse SVDs:     {time_sparse:.3f}s")
    
    # Method 3: TruncatedSVD (sklearn) - works on dense or sparse
    start = time.time()
    tsvd = TruncatedSVD(n_components=k, algorithm='arpack')
    tsvd.fit(A_sparse)
    time_tsvd = time.time() - start
    results['TruncatedSVD'] = {
        'time': time_tsvd, 
        'top_k_singular': tsvd.singular_values_
    }
    print(f"TruncatedSVD:    {time_tsvd:.3f}s")
    
    # Method 4: Randomized SVD
    start = time.time()
    U_rand, s_rand, Vt_rand = randomized_svd(
        A_sparse, n_components=k, n_iter=5, random_state=42
    )
    time_rand = time.time() - start
    results['Randomized SVD'] = {'time': time_rand, 'top_k_singular': s_rand}
    print(f"Randomized SVD:  {time_rand:.3f}s")
    
    # Compare singular values
    print("\nTop singular values comparison:")
    for method, data in results.items():
        sv_str = ', '.join(f'{s:.2f}' for s in data['top_k_singular'][:5])
        print(f"  {method:15s}: {sv_str}, ...")
    
    return results
 
# Run benchmark
np.random.seed(42)
results = benchmark_svd_methods(m=1000, n=500, k=10, sparsity=0.05)
 
print("\n" + "=" * 60)
print("For even larger matrices, use out-of-core methods or distributed")
print("computing frameworks like Spark MLlib or Dask.")

Memory Considerations

Summary: Mastering SVD

We've covered SVD in depth—from its formal definition through geometric interpretation, computation, and practical applications. Let's consolidate the key insights.

Key Takeaways

•SVD exists for every matrix: A = UΣVᵀ decomposes any m×n matrix into orthogonal U, diagonal Σ (with singular values), and orthogonal V.
•Geometric meaning: SVD reveals how a linear transformation rotates, scales, and rotates again—the unit sphere becomes an ellipsoid with axes along singular vectors and lengths equal to singular values.
•Rank and structure: Singular values expose matrix rank; singular vectors span the four fundamental subspaces.
•Optimal low-rank approximation: Truncating SVD to k components gives the provably best rank-k approximation (Eckart-Young-Mirsky theorem).
•SVD vs Eigendecomposition: SVD works on any matrix and provides real, non-negative values; eigendecomposition is square-only but reveals dynamic properties.
•ML applications are everywhere: PCA, LSA, recommendation systems, image compression, pseudoinverse, noise reduction all rely on SVD.
•Scalability matters: For large matrices, use truncated or randomized SVD to avoid O(mn²) full computation.

What's next:

Page Complete

1 / 5