Machine LearningMathematical Foundations: Linear Algebra

Norms and Distance Metrics

LevelIntermediate

Duration90 mins

TopicMathematical Foundations: Linear Algebra

2 / 5

Matrix Norms

Measuring Matrices: The Size of Linear Transformations

In the previous page, we learned how to measure the 'size' of vectors using norms. But machine learning is fundamentally about transformations—weight matrices that transform inputs to outputs, covariance matrices that describe data distributions, and Jacobians that capture how changes propagate through networks.

How do we measure the 'size' of a matrix? This question is more nuanced than it first appears. A matrix is not just a collection of numbers; it represents a linear transformation. Different matrix norms capture different aspects of this transformation: how much it can stretch vectors, how 'spread out' its values are, or how close it is to being singular.

Matrix norms are essential for understanding numerical stability, convergence of iterative algorithms, condition numbers, regularization in deep learning, and the analysis of neural network generalization.

What You Will Master

By the end of this page, you will understand induced (operator) norms, element-wise (Frobenius) norms, and spectral norms. You'll learn how these relate to eigenvalues and singular values, and why matrix norms are critical for analyzing the stability and convergence of machine learning algorithms.

What Makes a Matrix Norm?

A matrix norm is a function $|\cdot|: \mathbb{R}^{m \times n} \to \mathbb{R}$ that satisfies the same three axioms as vector norms, applied to the space of $m \times n$ matrices:

Formal Definition:

A function $|\cdot|$ on matrices is a norm if for all matrices $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$ and all scalars $\alpha \in \mathbb{R}$:

Matrix Norm Axioms

•Non-Negativity with Definiteness: $|\mathbf{A}| \geq 0$, and $|\mathbf{A}| = 0 \Leftrightarrow \mathbf{A} = \mathbf{0}$ (the zero matrix)
•Absolute Homogeneity: $|\alpha \mathbf{A}| = |\alpha| |\mathbf{A}|$
•Triangle Inequality: $|\mathbf{A} + \mathbf{B}| \leq |\mathbf{A}| + |\mathbf{B}|$

Additional Property for Square Matrices:

For norms on square matrices $\mathbb{R}^{n \times n}$, we often require an additional property:

Submultiplicativity: $|\mathbf{A}\mathbf{B}| \leq |\mathbf{A}| |\mathbf{B}|$

This property is essential for analyzing products of matrices, such as compositions of linear layers in neural networks or powers of matrices in iterative algorithms. A norm satisfying this property is called a submultiplicative norm or ring norm.

Two Categories of Matrix Norms

Matrix norms fall into two broad categories: Induced (Operator) Norms that measure how much a matrix can 'stretch' vectors, and Entry-wise (Element) Norms that treat the matrix as a flattened vector of entries. Each perspective is useful for different applications.

Induced (Operator) Norms

The most geometrically meaningful matrix norms are induced norms (also called operator norms or subordinate norms). They measure how much a matrix can 'stretch' vectors when viewed as a linear transformation.

Definition:

Given a vector norm $|\cdot|$ on $\mathbb{R}^n$, the induced matrix norm on $\mathbb{R}^{m \times n}$ is:

$$|\mathbf{A}| = \max_{\mathbf{x} eq \mathbf{0}} \frac{|\mathbf{A}\mathbf{x}|}{|\mathbf{x}|} = \max_{|\mathbf{x}|=1} |\mathbf{A}\mathbf{x}|$$

This is the maximum stretching factor—the largest ratio by which $\mathbf{A}$ can stretch any vector. Equivalently, it's the maximum norm of $\mathbf{Ax}$ over all unit vectors $\mathbf{x}$.

Key Induced Norms:

1. Matrix 1-Norm (Maximum Column Sum):

$$|\mathbf{A}|1 = \max{j} \sum_{i=1}^{m} |a_{ij}|$$

The maximum absolute column sum. Induced by the vector L1 norm.

2. Matrix ∞-Norm (Maximum Row Sum):

$$|\mathbf{A}|\infty = \max{i} \sum_{j=1}^{n} |a_{ij}|$$

The maximum absolute row sum. Induced by the vector L∞ norm.

3. Matrix 2-Norm (Spectral Norm):

$$|\mathbf{A}|2 = \sigma{\max}(\mathbf{A}) = \sqrt{\lambda_{\max}(\mathbf{A}^T\mathbf{A})}$$

The largest singular value of $\mathbf{A}$. Induced by the vector L2 norm. This is the most important matrix norm in machine learning.

Why Induced Norms are Submultiplicative

Induced norms automatically satisfy $|\mathbf{AB}| \leq |\mathbf{A}| |\mathbf{B}|$. Proof: $|\mathbf{ABx}| \leq |\mathbf{A}| |\mathbf{Bx}| \leq |\mathbf{A}| |\mathbf{B}| |\mathbf{x}|$, so dividing by $|\mathbf{x}|$ and taking max gives the result. This property is crucial for analyzing products of weight matrices in deep networks.

induced_norms.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
 
# Define a matrix
A = np.array([[1, 2, 3],
              [4, 5, 6]])
 
print(f"Matrix A:
{A}
")
 
# Matrix 1-norm: maximum column sum
norm_1 = np.linalg.norm(A, ord=1)
col_sums = np.sum(np.abs(A), axis=0)
print(f"Column sums: {col_sums}")
print(f"Matrix 1-norm (max column sum): {norm_1}")  # max(5, 7, 9) = 9
 
# Matrix ∞-norm: maximum row sum
norm_inf = np.linalg.norm(A, ord=np.inf)
row_sums = np.sum(np.abs(A), axis=1)
print(f"
Row sums: {row_sums}")
print(f"Matrix ∞-norm (max row sum): {norm_inf}")  # max(6, 15) = 15
 
# Matrix 2-norm (spectral norm): largest singular value
norm_2 = np.linalg.norm(A, ord=2)
singular_values = np.linalg.svd(A, compute_uv=False)
print(f"
Singular values: {singular_values}")
print(f"Matrix 2-norm (spectral): {norm_2}")
print(f"σ_max(A) = {singular_values[0]:.6f}")
 
# Verify submultiplicativity
B = np.array([[1, 0],
              [0, 1],
              [1, 1]])
AB = A @ B
print(f"
Verifying ||AB|| ≤ ||A|| ||B||:")
print(f"||A||_2 = {np.linalg.norm(A, 2):.4f}")
print(f"||B||_2 = {np.linalg.norm(B, 2):.4f}")
print(f"||AB||_2 = {np.linalg.norm(AB, 2):.4f}")
print(f"||A||_2 * ||B||_2 = {np.linalg.norm(A, 2) * np.linalg.norm(B, 2):.4f}")
print(f"Submultiplicativity holds: {np.linalg.norm(AB, 2) <= np.linalg.norm(A, 2) * np.linalg.norm(B, 2) + 1e-10}")

The Spectral Norm: King of Matrix Norms

The spectral norm (matrix 2-norm) is arguably the most important matrix norm in machine learning. It equals the largest singular value and measures the maximum geometric 'stretch' a matrix applies to any vector.

Definition:

$$|\mathbf{A}|2 = \sigma{\max}(\mathbf{A}) = \max_{\mathbf{x} eq 0} \frac{|\mathbf{A}\mathbf{x}|_2}{|\mathbf{x}|_2}$$

Computing the Spectral Norm:

For any matrix $\mathbf{A}$, the spectral norm equals:

$\sqrt{\lambda_{\max}(\mathbf{A}^T\mathbf{A})}$ (square root of largest eigenvalue of $\mathbf{A}^T\mathbf{A}$)
$\sqrt{\lambda_{\max}(\mathbf{A}\mathbf{A}^T)}$ (equivalently, using $\mathbf{A}\mathbf{A}^T$)
The largest singular value from SVD: $\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$

Special Case: Symmetric Matrices

For symmetric matrices $\mathbf{A} = \mathbf{A}^T$, the spectral norm simplifies beautifully:

$$|\mathbf{A}|2 = \rho(\mathbf{A}) = \max{i} |\lambda_i|$$

The spectral norm equals the spectral radius—the largest absolute eigenvalue. This is because for symmetric matrices, singular values equal absolute eigenvalues.

Key Properties of Spectral Norm

•Geometric Meaning: Maximum stretch factor of the linear transformation
•Unitary Invariance: $|\mathbf{UAV}|_2 = |\mathbf{A}|_2$ for orthogonal/unitary $\mathbf{U}, \mathbf{V}$
•Submultiplicativity: $|\mathbf{AB}|_2 \leq |\mathbf{A}|_2 |\mathbf{B}|_2$
•Relation to SVD: Equals $\sigma_1$, the largest singular value
•Condition Number: $\kappa(\mathbf{A}) = |\mathbf{A}|2 |\mathbf{A}^{-1}|2 = \sigma{\max}/\sigma{\min}$

Computing Spectral Norm is Expensive

Unlike the 1-norm and ∞-norm (which can be computed in O(mn) time), the spectral norm requires computing the largest singular value, which is O(min(m,n)·mn) via SVD or iterative methods like power iteration. For large matrices, this can be a bottleneck.

Spectral Norm in Machine Learning:

Spectral Normalization: In GANs and discriminator networks, dividing weight matrices by their spectral norm ($\mathbf{W} \to \mathbf{W}/|\mathbf{W}|_2$) controls Lipschitz continuity and stabilizes training.
Generalization Bounds: Many generalization bounds for neural networks involve the product of spectral norms across layers: $\prod_l |\mathbf{W}^{(l)}|_2$.
Gradient Analysis: The spectral norm of the Jacobian bounds how much small input changes become large output changes.
Low-Rank Approximation: Truncating small singular values (those below a threshold relative to $|\mathbf{A}|_2$) gives optimal low-rank approximations.

The Frobenius Norm: Element-wise L2

The Frobenius norm treats the matrix as a long vector of $mn$ entries and computes the vector L2 norm. It is the most commonly used 'element-wise' matrix norm.

Definition:

$$|\mathbf{A}|F = \sqrt{\sum{i=1}^{m} \sum_{j=1}^{n} |a_{ij}|^2} = \sqrt{\text{trace}(\mathbf{A}^T\mathbf{A})}$$

Equivalently, in terms of singular values:

$$|\mathbf{A}|F = \sqrt{\sum{k=1}^{\min(m,n)} \sigma_k^2} = \sqrt{\sigma_1^2 + \sigma_2^2 + \cdots + \sigma_r^2}$$

where $r = \text{rank}(\mathbf{A})$.

Frobenius Norm Advantages

•Easy to Compute: Just sum of squared entries, O(mn) time
•Differentiable: Smooth everywhere except zero matrix
•Unitary Invariance: $|\mathbf{UAV}|_F = |\mathbf{A}|_F$
•Natural Inner Product: Induces the Frobenius inner product $\langle\mathbf{A},\mathbf{B}\rangle_F = \text{trace}(\mathbf{A}^T\mathbf{B})$

Frobenius Norm Considerations

•Not Induced: Not induced by any vector norm
•Not Submultiplicative: $|\mathbf{AB}|_F ot\leq |\mathbf{A}|_F|\mathbf{B}|_F$ in general
•Sensitive to Dimension: Scales with $\sqrt{mn}$ even for identity-like matrices
•Treats All Entries Equally: Doesn't distinguish between 'important' and 'unimportant' directions

Relationship Between Frobenius and Spectral Norms:

Two important inequalities connect these norms:

$$|\mathbf{A}|_2 \leq |\mathbf{A}|_F \leq \sqrt{\text{rank}(\mathbf{A})} \cdot |\mathbf{A}|_2$$

The spectral norm is always smaller (or equal for rank-1 matrices)
The gap is at most $\sqrt{r}$ where $r$ is the rank
For nearly low-rank matrices, the gap is small

frobenius_norm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
 
# Define a matrix
A = np.array([[1, 2, 3],
              [4, 5, 6]])
 
print(f"Matrix A:
{A}
")
 
# Frobenius norm: multiple equivalent calculations
frob_manual = np.sqrt(np.sum(A**2))
frob_trace = np.sqrt(np.trace(A.T @ A))
frob_builtin = np.linalg.norm(A, 'fro')
 
print(f"Frobenius norm calculations:")
print(f"  Manual (√Σa²): {frob_manual:.6f}")
print(f"  Via trace: {frob_trace:.6f}")
print(f"  Built-in: {frob_builtin:.6f}")
 
# In terms of singular values
sigma = np.linalg.svd(A, compute_uv=False)
frob_svd = np.sqrt(np.sum(sigma**2))
print(f"  Via singular values √(Σσ²): {frob_svd:.6f}")
print(f"  Singular values: {sigma}")
 
# Compare with spectral norm
spectral = np.linalg.norm(A, 2)
rank_A = np.linalg.matrix_rank(A)
print(f"
Spectral norm: {spectral:.6f}")
print(f"Frobenius norm: {frob_builtin:.6f}")
print(f"Matrix rank: {rank_A}")
print(f"
Verifying: ||A||_2 ≤ ||A||_F ≤ √rank · ||A||_2")
print(f"  {spectral:.4f} ≤ {frob_builtin:.4f} ≤ {np.sqrt(rank_A) * spectral:.4f}")
 
# Weight decay in neural networks uses squared Frobenius norm
lambda_reg = 0.01
W = np.random.randn(100, 50)
l2_regularization = lambda_reg * np.sum(W**2)  # = λ ||W||_F²
print(f"
L2 regularization (weight decay): λ||W||_F² = {l2_regularization:.4f}")

Frobenius Norm = Weight Decay

In deep learning, 'L2 regularization' or 'weight decay' adds $\lambda \sum_{i,j} w_{ij}^2 = \lambda |\mathbf{W}|_F^2$ to the loss. This penalizes large weights, encouraging the network to use smaller, distributed representations. It's the squared Frobenius norm, not the spectral norm.

The Nuclear Norm: Sum of Singular Values

The nuclear norm (also called trace norm or Schatten 1-norm) is the sum of singular values. It plays the same role for matrices that the L1 norm plays for vectors: it promotes sparsity in singular values, i.e., low rank.

Definition:

$$|\mathbf{A}|* = \sum{k=1}^{\min(m,n)} \sigma_k = \sigma_1 + \sigma_2 + \cdots + \sigma_r$$

where $\sigma_k$ are the singular values of $\mathbf{A}$.

Alternatively:

$$|\mathbf{A}|_* = \text{trace}(\sqrt{\mathbf{A}^T\mathbf{A}})$$

Why Nuclear Norm Promotes Low Rank:

Just as L1 regularization on a vector encourages many entries to be exactly zero, nuclear norm regularization on a matrix encourages many singular values to be exactly zero—which means low rank.

The nuclear norm is the tightest convex relaxation of the rank function:

$$\text{rank}(\mathbf{A}) \approx |\mathbf{A}|_*$$

Since minimizing rank directly is NP-hard, nuclear norm minimization is used as a tractable convex surrogate.

Applications of Nuclear Norm

•Matrix Completion: Netflix problem—fill missing entries while encouraging low rank. Solve: $\min |\mathbf{M}|_*$ s.t. observed entries match.
•Robust PCA: Decompose $\mathbf{M} = \mathbf{L} + \mathbf{S}$ where $\mathbf{L}$ is low-rank (small nuclear norm) and $\mathbf{S}$ is sparse (small L1).
•Recommender Systems: User-item matrices are approximately low-rank; nuclear norm regularization exploits this.
•Image Denoising: Natural images have low-rank structure; nuclear norm minimization recovers clean images.

nuclear_norm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
 
# Define a matrix
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])  # Note: rank 2, one singular value is ~0
 
print(f"Matrix A:
{A}
")
 
# Nuclear norm: sum of singular values
sigma = np.linalg.svd(A, compute_uv=False)
nuclear = np.sum(sigma)
nuclear_builtin = np.linalg.norm(A, 'nuc')
 
print(f"Singular values: {sigma}")
print(f"Rank (numerical): {np.sum(sigma > 1e-10)}")
print(f"Nuclear norm: {nuclear:.6f}")
print(f"Built-in nuclear norm: {nuclear_builtin:.6f}")
 
# Compare all major matrix norms
print(f"
Comparing matrix norms for A:")
print(f"  Spectral (2-norm): {np.linalg.norm(A, 2):.4f} = σ_max")
print(f"  Frobenius: {np.linalg.norm(A, 'fro'):.4f} = √(Σσ²)")
print(f"  Nuclear: {np.linalg.norm(A, 'nuc'):.4f} = Σσ")
 
# Low-rank vs full-rank comparison
print(f"
--- Low-rank matrix ---")
L = np.outer([1, 2, 3], [1, 1, 1])  # Rank 1
print(f"Rank-1 matrix L:
{L}")
sigma_L = np.linalg.svd(L, compute_uv=False)
print(f"Singular values: {sigma_L}")
print(f"Nuclear norm: {np.linalg.norm(L, 'nuc'):.4f}")
print(f"Frobenius norm: {np.linalg.norm(L, 'fro'):.4f}")
print(f"Ratio Nuclear/Frobenius: {np.linalg.norm(L, 'nuc')/np.linalg.norm(L, 'fro'):.4f}")
# For rank-1, nuclear = Frobenius since only one singular value

Dual Norm of Spectral Norm

The nuclear norm is the dual norm of the spectral norm: $|\mathbf{A}|* = \max{|\mathbf{B}|_2 \leq 1} \text{trace}(\mathbf{B}^T\mathbf{A})$. This duality is analogous to L1 and L∞ being duals for vectors, and it plays a crucial role in optimization algorithms.

Condition Numbers: Measuring Sensitivity

The condition number of a matrix measures how sensitively the solution to a system of linear equations changes with small perturbations in the input. It's defined using matrix norms and is fundamental to numerical analysis.

Definition:

For an invertible matrix $\mathbf{A}$:

$$\kappa(\mathbf{A}) = |\mathbf{A}| \cdot |\mathbf{A}^{-1}|$$

For the spectral norm (most common):

$$\kappa_2(\mathbf{A}) = |\mathbf{A}|2 \cdot |\mathbf{A}^{-1}|2 = \frac{\sigma{\max}(\mathbf{A})}{\sigma{\min}(\mathbf{A})}$$

The condition number is always $\geq 1$, with equality for scalar multiples of orthogonal matrices.

Interpretation:

Consider solving $\mathbf{Ax} = \mathbf{b}$. If we perturb $\mathbf{b}$ slightly to $\mathbf{b} + \Delta\mathbf{b}$, how much does the solution change?

$$\frac{|\Delta\mathbf{x}|}{|\mathbf{x}|} \leq \kappa(\mathbf{A}) \cdot \frac{|\Delta\mathbf{b}|}{|\mathbf{b}|}$$

Low condition number ($\kappa \approx 1$): The problem is well-conditioned. Small input errors cause small output errors.
High condition number ($\kappa \gg 1$): The problem is ill-conditioned. Small input errors can cause large output errors.
Very high condition number ($\kappa \to \infty$): The matrix is nearly singular. Numerical solutions are unreliable.

Condition Number Interpretation Guide
Condition Number	Classification	Numerical Reliability
κ ≈ 1	Perfectly conditioned	Excellent; solutions are stable
κ ≈ 10	Well-conditioned	Very good; minor precision loss
κ ≈ 10³	Moderately conditioned	Acceptable; ~3 digits of precision may be lost
κ ≈ 10⁶	Ill-conditioned	Problematic; significant precision loss
κ ≈ 10¹⁵ or higher	Severely ill-conditioned	Unreliable; near numerical singularity

condition_numbers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
 
# Well-conditioned matrix: orthogonal matrices have condition number 1
theta = np.pi/6
Q = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])
print(f"Rotation matrix Q:
{Q}")
print(f"Condition number: {np.linalg.cond(Q):.6f} (ideal = 1)
")
 
# Moderately conditioned
A = np.array([[1, 2],
              [3, 7]])
print(f"Matrix A:
{A}")
print(f"Condition number: {np.linalg.cond(A):.2f}
")
 
# Ill-conditioned: Hilbert matrix (notorious example)
def hilbert(n):
    return np.array([[1/(i+j+1) for j in range(n)] for i in range(n)])
 
for n in [3, 5, 7, 10]:
    H = hilbert(n)
    kappa = np.linalg.cond(H)
    print(f"Hilbert matrix H_{n}: condition number = {kappa:.2e}")
 
# Demonstrating ill-conditioning effects
print(f"
--- Demonstrating ill-conditioning ---")
n = 7
H = hilbert(n)
x_true = np.ones(n)
b = H @ x_true
 
# Solve with slight perturbation
b_perturbed = b + 1e-8 * np.random.randn(n)
x_computed = np.linalg.solve(H, b)
x_perturbed = np.linalg.solve(H, b_perturbed)
 
print(f"Perturbation in b: {np.linalg.norm(b_perturbed - b):.2e}")
print(f"Change in solution: {np.linalg.norm(x_perturbed - x_computed):.2e}")
print(f"Condition number: {np.linalg.cond(H):.2e}")
print("Ill-conditioning amplifies small errors dramatically!")

Condition Numbers in Machine Learning

ill-conditioned matrices appear frequently: (1) Gram matrices $\mathbf{X}^T\mathbf{X}$ with correlated features, (2) Hessians near saddle points, (3) Deep network Jacobians. This is why regularization (adding $\lambda\mathbf{I}$) helps—it bounds the condition number by $\frac{\sigma_{\max} + \lambda}{\sigma_{\min} + \lambda}$.

Comprehensive Matrix Norms Reference

Let's consolidate all the matrix norms we've covered into a comprehensive reference. For a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ with singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0$ where $r = \text{rank}(\mathbf{A})$:

Matrix Norms Summary
Norm	Definition	In Terms of SVD	Primary Use
1-norm $\|\mathbf{A}\|_1$	max column sum	—	Sparse optimization
∞-norm $\|\mathbf{A}\|_\infty$	max row sum	—	Row-wise bounds
Spectral $\|\mathbf{A}\|_2$	max $\|\mathbf{Ax}\|_2/\|\mathbf{x}\|_2$	$\sigma_1$	Lipschitz bounds, stability
Frobenius $\|\mathbf{A}\|_F$	$\sqrt{\sum_{i,j} a_{ij}^2}$	$\sqrt{\sum_k \sigma_k^2}$	Regularization (weight decay)
Nuclear $\|\mathbf{A}\|_*$	trace$(\sqrt{\mathbf{A}^T\mathbf{A}})$	$\sum_k \sigma_k$	Low-rank recovery

Key Relationships:

$$|\mathbf{A}|_2 \leq |\mathbf{A}|_F \leq \sqrt{r} |\mathbf{A}|_2$$

$$|\mathbf{A}|2 \leq |\mathbf{A}|* \leq r |\mathbf{A}|_2$$

$$|\mathbf{A}|_* \leq \sqrt{r} |\mathbf{A}|_F$$

These inequalities show how the choice of norm affects the magnitude and how rank plays a role.

Summary: The Geometry of Transformations

Key Takeaways

•Matrix norms measure the 'size' of linear transformations, generalizing vector norms to matrices with the same three axioms plus submultiplicativity.
•Induced (operator) norms measure maximum stretching: the spectral norm $|\mathbf{A}|2 = \sigma{\max}$ is the most important for analyzing neural networks and optimization.
•The Frobenius norm treats matrices as vectors and is computationally cheap—it's the basis of L2 regularization (weight decay) in deep learning.
•The nuclear norm promotes low rank and is the convex relaxation of the rank function, essential for matrix completion and recommender systems.
•Condition numbers $\kappa = \sigma_{\max}/\sigma_{\min}$ measure numerical sensitivity; ill-conditioned matrices amplify errors and slow convergence.
•All these norms are interconnected through singular values, providing different views of the same underlying linear transformation.

What's Next:

We've covered how to measure vectors (norms) and matrices (matrix norms). The next page introduces distance metrics—how to measure the 'difference' or 'similarity' between two objects. While norms measure absolute size, distances and similarities are crucial for clustering, nearest neighbors, and retrieval systems.

Matrix Norms Mastered

You now understand the theory and practice of matrix norms, from induced norms that capture transformation behavior to entry-wise norms used in regularization. These tools are essential for analyzing neural network dynamics, understanding optimization landscape, and building stable numerical algorithms.

2 / 5

Loading learning content...

Machine LearningMathematical Foundations: Linear Algebra

Norms and Distance Metrics

LevelIntermediate

Duration90 mins

TopicMathematical Foundations: Linear Algebra

2 / 5

Matrix Norms

Measuring Matrices: The Size of Linear Transformations

What You Will Master

What Makes a Matrix Norm?

A matrix norm is a function $|\cdot|: \mathbb{R}^{m \times n} \to \mathbb{R}$ that satisfies the same three axioms as vector norms, applied to the space of $m \times n$ matrices:

Formal Definition:

A function $|\cdot|$ on matrices is a norm if for all matrices $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$ and all scalars $\alpha \in \mathbb{R}$:

Matrix Norm Axioms

•Non-Negativity with Definiteness: $|\mathbf{A}| \geq 0$, and $|\mathbf{A}| = 0 \Leftrightarrow \mathbf{A} = \mathbf{0}$ (the zero matrix)
•Absolute Homogeneity: $|\alpha \mathbf{A}| = |\alpha| |\mathbf{A}|$
•Triangle Inequality: $|\mathbf{A} + \mathbf{B}| \leq |\mathbf{A}| + |\mathbf{B}|$

Additional Property for Square Matrices:

For norms on square matrices $\mathbb{R}^{n \times n}$, we often require an additional property:

Submultiplicativity: $|\mathbf{A}\mathbf{B}| \leq |\mathbf{A}| |\mathbf{B}|$

Two Categories of Matrix Norms

Induced (Operator) Norms

Definition:

Given a vector norm $|\cdot|$ on $\mathbb{R}^n$, the induced matrix norm on $\mathbb{R}^{m \times n}$ is:

$$|\mathbf{A}| = \max_{\mathbf{x} eq \mathbf{0}} \frac{|\mathbf{A}\mathbf{x}|}{|\mathbf{x}|} = \max_{|\mathbf{x}|=1} |\mathbf{A}\mathbf{x}|$$

This is the maximum stretching factor—the largest ratio by which $\mathbf{A}$ can stretch any vector. Equivalently, it's the maximum norm of $\mathbf{Ax}$ over all unit vectors $\mathbf{x}$.

Key Induced Norms:

1. Matrix 1-Norm (Maximum Column Sum):

$$|\mathbf{A}|1 = \max{j} \sum_{i=1}^{m} |a_{ij}|$$

The maximum absolute column sum. Induced by the vector L1 norm.

2. Matrix ∞-Norm (Maximum Row Sum):

$$|\mathbf{A}|\infty = \max{i} \sum_{j=1}^{n} |a_{ij}|$$

The maximum absolute row sum. Induced by the vector L∞ norm.

3. Matrix 2-Norm (Spectral Norm):

$$|\mathbf{A}|2 = \sigma{\max}(\mathbf{A}) = \sqrt{\lambda_{\max}(\mathbf{A}^T\mathbf{A})}$$

The largest singular value of $\mathbf{A}$. Induced by the vector L2 norm. This is the most important matrix norm in machine learning.

Why Induced Norms are Submultiplicative

induced_norms.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
 
# Define a matrix
A = np.array([[1, 2, 3],
              [4, 5, 6]])
 
print(f"Matrix A:
{A}
")
 
# Matrix 1-norm: maximum column sum
norm_1 = np.linalg.norm(A, ord=1)
col_sums = np.sum(np.abs(A), axis=0)
print(f"Column sums: {col_sums}")
print(f"Matrix 1-norm (max column sum): {norm_1}")  # max(5, 7, 9) = 9
 
# Matrix ∞-norm: maximum row sum
norm_inf = np.linalg.norm(A, ord=np.inf)
row_sums = np.sum(np.abs(A), axis=1)
print(f"
Row sums: {row_sums}")
print(f"Matrix ∞-norm (max row sum): {norm_inf}")  # max(6, 15) = 15
 
# Matrix 2-norm (spectral norm): largest singular value
norm_2 = np.linalg.norm(A, ord=2)
singular_values = np.linalg.svd(A, compute_uv=False)
print(f"
Singular values: {singular_values}")
print(f"Matrix 2-norm (spectral): {norm_2}")
print(f"σ_max(A) = {singular_values[0]:.6f}")
 
# Verify submultiplicativity
B = np.array([[1, 0],
              [0, 1],
              [1, 1]])
AB = A @ B
print(f"
Verifying ||AB|| ≤ ||A|| ||B||:")
print(f"||A||_2 = {np.linalg.norm(A, 2):.4f}")
print(f"||B||_2 = {np.linalg.norm(B, 2):.4f}")
print(f"||AB||_2 = {np.linalg.norm(AB, 2):.4f}")
print(f"||A||_2 * ||B||_2 = {np.linalg.norm(A, 2) * np.linalg.norm(B, 2):.4f}")
print(f"Submultiplicativity holds: {np.linalg.norm(AB, 2) <= np.linalg.norm(A, 2) * np.linalg.norm(B, 2) + 1e-10}")

The Spectral Norm: King of Matrix Norms

Definition:

$$|\mathbf{A}|2 = \sigma{\max}(\mathbf{A}) = \max_{\mathbf{x} eq 0} \frac{|\mathbf{A}\mathbf{x}|_2}{|\mathbf{x}|_2}$$

Computing the Spectral Norm:

For any matrix $\mathbf{A}$, the spectral norm equals:

$\sqrt{\lambda_{\max}(\mathbf{A}^T\mathbf{A})}$ (square root of largest eigenvalue of $\mathbf{A}^T\mathbf{A}$)
$\sqrt{\lambda_{\max}(\mathbf{A}\mathbf{A}^T)}$ (equivalently, using $\mathbf{A}\mathbf{A}^T$)
The largest singular value from SVD: $\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$

Special Case: Symmetric Matrices

For symmetric matrices $\mathbf{A} = \mathbf{A}^T$, the spectral norm simplifies beautifully:

$$|\mathbf{A}|2 = \rho(\mathbf{A}) = \max{i} |\lambda_i|$$

The spectral norm equals the spectral radius—the largest absolute eigenvalue. This is because for symmetric matrices, singular values equal absolute eigenvalues.

Key Properties of Spectral Norm

•Geometric Meaning: Maximum stretch factor of the linear transformation
•Unitary Invariance: $|\mathbf{UAV}|_2 = |\mathbf{A}|_2$ for orthogonal/unitary $\mathbf{U}, \mathbf{V}$
•Submultiplicativity: $|\mathbf{AB}|_2 \leq |\mathbf{A}|_2 |\mathbf{B}|_2$
•Relation to SVD: Equals $\sigma_1$, the largest singular value
•Condition Number: $\kappa(\mathbf{A}) = |\mathbf{A}|2 |\mathbf{A}^{-1}|2 = \sigma{\max}/\sigma{\min}$

Computing Spectral Norm is Expensive

Spectral Norm in Machine Learning:

Spectral Normalization: In GANs and discriminator networks, dividing weight matrices by their spectral norm ($\mathbf{W} \to \mathbf{W}/|\mathbf{W}|_2$) controls Lipschitz continuity and stabilizes training.
Generalization Bounds: Many generalization bounds for neural networks involve the product of spectral norms across layers: $\prod_l |\mathbf{W}^{(l)}|_2$.
Gradient Analysis: The spectral norm of the Jacobian bounds how much small input changes become large output changes.
Low-Rank Approximation: Truncating small singular values (those below a threshold relative to $|\mathbf{A}|_2$) gives optimal low-rank approximations.

The Frobenius Norm: Element-wise L2

The Frobenius norm treats the matrix as a long vector of $mn$ entries and computes the vector L2 norm. It is the most commonly used 'element-wise' matrix norm.

Definition:

$$|\mathbf{A}|F = \sqrt{\sum{i=1}^{m} \sum_{j=1}^{n} |a_{ij}|^2} = \sqrt{\text{trace}(\mathbf{A}^T\mathbf{A})}$$

Equivalently, in terms of singular values:

$$|\mathbf{A}|F = \sqrt{\sum{k=1}^{\min(m,n)} \sigma_k^2} = \sqrt{\sigma_1^2 + \sigma_2^2 + \cdots + \sigma_r^2}$$

where $r = \text{rank}(\mathbf{A})$.

Frobenius Norm Advantages

•Easy to Compute: Just sum of squared entries, O(mn) time
•Differentiable: Smooth everywhere except zero matrix
•Unitary Invariance: $|\mathbf{UAV}|_F = |\mathbf{A}|_F$
•Natural Inner Product: Induces the Frobenius inner product $\langle\mathbf{A},\mathbf{B}\rangle_F = \text{trace}(\mathbf{A}^T\mathbf{B})$

Frobenius Norm Considerations

•Not Induced: Not induced by any vector norm
•Not Submultiplicative: $|\mathbf{AB}|_F ot\leq |\mathbf{A}|_F|\mathbf{B}|_F$ in general
•Sensitive to Dimension: Scales with $\sqrt{mn}$ even for identity-like matrices
•Treats All Entries Equally: Doesn't distinguish between 'important' and 'unimportant' directions

Relationship Between Frobenius and Spectral Norms:

Two important inequalities connect these norms:

$$|\mathbf{A}|_2 \leq |\mathbf{A}|_F \leq \sqrt{\text{rank}(\mathbf{A})} \cdot |\mathbf{A}|_2$$

The spectral norm is always smaller (or equal for rank-1 matrices)
The gap is at most $\sqrt{r}$ where $r$ is the rank
For nearly low-rank matrices, the gap is small

frobenius_norm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
 
# Define a matrix
A = np.array([[1, 2, 3],
              [4, 5, 6]])
 
print(f"Matrix A:
{A}
")
 
# Frobenius norm: multiple equivalent calculations
frob_manual = np.sqrt(np.sum(A**2))
frob_trace = np.sqrt(np.trace(A.T @ A))
frob_builtin = np.linalg.norm(A, 'fro')
 
print(f"Frobenius norm calculations:")
print(f"  Manual (√Σa²): {frob_manual:.6f}")
print(f"  Via trace: {frob_trace:.6f}")
print(f"  Built-in: {frob_builtin:.6f}")
 
# In terms of singular values
sigma = np.linalg.svd(A, compute_uv=False)
frob_svd = np.sqrt(np.sum(sigma**2))
print(f"  Via singular values √(Σσ²): {frob_svd:.6f}")
print(f"  Singular values: {sigma}")
 
# Compare with spectral norm
spectral = np.linalg.norm(A, 2)
rank_A = np.linalg.matrix_rank(A)
print(f"
Spectral norm: {spectral:.6f}")
print(f"Frobenius norm: {frob_builtin:.6f}")
print(f"Matrix rank: {rank_A}")
print(f"
Verifying: ||A||_2 ≤ ||A||_F ≤ √rank · ||A||_2")
print(f"  {spectral:.4f} ≤ {frob_builtin:.4f} ≤ {np.sqrt(rank_A) * spectral:.4f}")
 
# Weight decay in neural networks uses squared Frobenius norm
lambda_reg = 0.01
W = np.random.randn(100, 50)
l2_regularization = lambda_reg * np.sum(W**2)  # = λ ||W||_F²
print(f"
L2 regularization (weight decay): λ||W||_F² = {l2_regularization:.4f}")

Frobenius Norm = Weight Decay

The Nuclear Norm: Sum of Singular Values

Definition:

$$|\mathbf{A}|* = \sum{k=1}^{\min(m,n)} \sigma_k = \sigma_1 + \sigma_2 + \cdots + \sigma_r$$

where $\sigma_k$ are the singular values of $\mathbf{A}$.

Alternatively:

$$|\mathbf{A}|_* = \text{trace}(\sqrt{\mathbf{A}^T\mathbf{A}})$$

Why Nuclear Norm Promotes Low Rank:

Just as L1 regularization on a vector encourages many entries to be exactly zero, nuclear norm regularization on a matrix encourages many singular values to be exactly zero—which means low rank.

The nuclear norm is the tightest convex relaxation of the rank function:

$$\text{rank}(\mathbf{A}) \approx |\mathbf{A}|_*$$

Since minimizing rank directly is NP-hard, nuclear norm minimization is used as a tractable convex surrogate.

Applications of Nuclear Norm

•Matrix Completion: Netflix problem—fill missing entries while encouraging low rank. Solve: $\min |\mathbf{M}|_*$ s.t. observed entries match.
•Robust PCA: Decompose $\mathbf{M} = \mathbf{L} + \mathbf{S}$ where $\mathbf{L}$ is low-rank (small nuclear norm) and $\mathbf{S}$ is sparse (small L1).
•Recommender Systems: User-item matrices are approximately low-rank; nuclear norm regularization exploits this.
•Image Denoising: Natural images have low-rank structure; nuclear norm minimization recovers clean images.

nuclear_norm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
 
# Define a matrix
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])  # Note: rank 2, one singular value is ~0
 
print(f"Matrix A:
{A}
")
 
# Nuclear norm: sum of singular values
sigma = np.linalg.svd(A, compute_uv=False)
nuclear = np.sum(sigma)
nuclear_builtin = np.linalg.norm(A, 'nuc')
 
print(f"Singular values: {sigma}")
print(f"Rank (numerical): {np.sum(sigma > 1e-10)}")
print(f"Nuclear norm: {nuclear:.6f}")
print(f"Built-in nuclear norm: {nuclear_builtin:.6f}")
 
# Compare all major matrix norms
print(f"
Comparing matrix norms for A:")
print(f"  Spectral (2-norm): {np.linalg.norm(A, 2):.4f} = σ_max")
print(f"  Frobenius: {np.linalg.norm(A, 'fro'):.4f} = √(Σσ²)")
print(f"  Nuclear: {np.linalg.norm(A, 'nuc'):.4f} = Σσ")
 
# Low-rank vs full-rank comparison
print(f"
--- Low-rank matrix ---")
L = np.outer([1, 2, 3], [1, 1, 1])  # Rank 1
print(f"Rank-1 matrix L:
{L}")
sigma_L = np.linalg.svd(L, compute_uv=False)
print(f"Singular values: {sigma_L}")
print(f"Nuclear norm: {np.linalg.norm(L, 'nuc'):.4f}")
print(f"Frobenius norm: {np.linalg.norm(L, 'fro'):.4f}")
print(f"Ratio Nuclear/Frobenius: {np.linalg.norm(L, 'nuc')/np.linalg.norm(L, 'fro'):.4f}")
# For rank-1, nuclear = Frobenius since only one singular value

Dual Norm of Spectral Norm

Condition Numbers: Measuring Sensitivity

Definition:

For an invertible matrix $\mathbf{A}$:

$$\kappa(\mathbf{A}) = |\mathbf{A}| \cdot |\mathbf{A}^{-1}|$$

For the spectral norm (most common):

$$\kappa_2(\mathbf{A}) = |\mathbf{A}|2 \cdot |\mathbf{A}^{-1}|2 = \frac{\sigma{\max}(\mathbf{A})}{\sigma{\min}(\mathbf{A})}$$

The condition number is always $\geq 1$, with equality for scalar multiples of orthogonal matrices.

Interpretation:

Consider solving $\mathbf{Ax} = \mathbf{b}$. If we perturb $\mathbf{b}$ slightly to $\mathbf{b} + \Delta\mathbf{b}$, how much does the solution change?

$$\frac{|\Delta\mathbf{x}|}{|\mathbf{x}|} \leq \kappa(\mathbf{A}) \cdot \frac{|\Delta\mathbf{b}|}{|\mathbf{b}|}$$

Low condition number ($\kappa \approx 1$): The problem is well-conditioned. Small input errors cause small output errors.
High condition number ($\kappa \gg 1$): The problem is ill-conditioned. Small input errors can cause large output errors.
Very high condition number ($\kappa \to \infty$): The matrix is nearly singular. Numerical solutions are unreliable.

Condition Number Interpretation Guide
Condition Number	Classification	Numerical Reliability
κ ≈ 1	Perfectly conditioned	Excellent; solutions are stable
κ ≈ 10	Well-conditioned	Very good; minor precision loss
κ ≈ 10³	Moderately conditioned	Acceptable; ~3 digits of precision may be lost
κ ≈ 10⁶	Ill-conditioned	Problematic; significant precision loss
κ ≈ 10¹⁵ or higher	Severely ill-conditioned	Unreliable; near numerical singularity

condition_numbers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
 
# Well-conditioned matrix: orthogonal matrices have condition number 1
theta = np.pi/6
Q = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])
print(f"Rotation matrix Q:
{Q}")
print(f"Condition number: {np.linalg.cond(Q):.6f} (ideal = 1)
")
 
# Moderately conditioned
A = np.array([[1, 2],
              [3, 7]])
print(f"Matrix A:
{A}")
print(f"Condition number: {np.linalg.cond(A):.2f}
")
 
# Ill-conditioned: Hilbert matrix (notorious example)
def hilbert(n):
    return np.array([[1/(i+j+1) for j in range(n)] for i in range(n)])
 
for n in [3, 5, 7, 10]:
    H = hilbert(n)
    kappa = np.linalg.cond(H)
    print(f"Hilbert matrix H_{n}: condition number = {kappa:.2e}")
 
# Demonstrating ill-conditioning effects
print(f"
--- Demonstrating ill-conditioning ---")
n = 7
H = hilbert(n)
x_true = np.ones(n)
b = H @ x_true
 
# Solve with slight perturbation
b_perturbed = b + 1e-8 * np.random.randn(n)
x_computed = np.linalg.solve(H, b)
x_perturbed = np.linalg.solve(H, b_perturbed)
 
print(f"Perturbation in b: {np.linalg.norm(b_perturbed - b):.2e}")
print(f"Change in solution: {np.linalg.norm(x_perturbed - x_computed):.2e}")
print(f"Condition number: {np.linalg.cond(H):.2e}")
print("Ill-conditioning amplifies small errors dramatically!")

Condition Numbers in Machine Learning

Comprehensive Matrix Norms Reference

Matrix Norms Summary
Norm	Definition	In Terms of SVD	Primary Use
1-norm $\|\mathbf{A}\|_1$	max column sum	—	Sparse optimization
∞-norm $\|\mathbf{A}\|_\infty$	max row sum	—	Row-wise bounds
Spectral $\|\mathbf{A}\|_2$	max $\|\mathbf{Ax}\|_2/\|\mathbf{x}\|_2$	$\sigma_1$	Lipschitz bounds, stability
Frobenius $\|\mathbf{A}\|_F$	$\sqrt{\sum_{i,j} a_{ij}^2}$	$\sqrt{\sum_k \sigma_k^2}$	Regularization (weight decay)
Nuclear $\|\mathbf{A}\|_*$	trace$(\sqrt{\mathbf{A}^T\mathbf{A}})$	$\sum_k \sigma_k$	Low-rank recovery

Key Relationships:

$$|\mathbf{A}|_2 \leq |\mathbf{A}|_F \leq \sqrt{r} |\mathbf{A}|_2$$

$$|\mathbf{A}|2 \leq |\mathbf{A}|* \leq r |\mathbf{A}|_2$$

$$|\mathbf{A}|_* \leq \sqrt{r} |\mathbf{A}|_F$$

These inequalities show how the choice of norm affects the magnitude and how rank plays a role.

Summary: The Geometry of Transformations

Key Takeaways

•Matrix norms measure the 'size' of linear transformations, generalizing vector norms to matrices with the same three axioms plus submultiplicativity.
•Induced (operator) norms measure maximum stretching: the spectral norm $|\mathbf{A}|2 = \sigma{\max}$ is the most important for analyzing neural networks and optimization.
•The Frobenius norm treats matrices as vectors and is computationally cheap—it's the basis of L2 regularization (weight decay) in deep learning.
•The nuclear norm promotes low rank and is the convex relaxation of the rank function, essential for matrix completion and recommender systems.
•Condition numbers $\kappa = \sigma_{\max}/\sigma_{\min}$ measure numerical sensitivity; ill-conditioned matrices amplify errors and slow convergence.
•All these norms are interconnected through singular values, providing different views of the same underlying linear transformation.

What's Next:

Matrix Norms Mastered

2 / 5