Loading content...
Imagine you're building a recommendation system, and you need to determine how similar two users are based on their preferences. Or perhaps you're training a neural network and need to measure how far your predictions are from the true values. In both cases, you face a fundamental question: How do we measure the 'size' or 'length' of a vector?
This seemingly simple question has profound implications for nearly every machine learning algorithm. The answer lies in vector norms—mathematical tools that generalize our intuitive notion of length to high-dimensional spaces.
Vector norms are not mere abstract concepts; they are the workhorses of machine learning. They appear in loss functions, regularization terms, optimization convergence criteria, similarity measures, and error metrics. Understanding norms deeply is understanding the mathematical language in which machine learning speaks.
By the end of this page, you will understand the rigorous mathematical definition of norms, the geometric and algebraic properties of L1, L2, and L∞ norms, their relationships to each other, and why choosing the right norm can mean the difference between a model that generalizes beautifully and one that overfits catastrophically.
Before exploring specific norms, we must establish what makes a function qualify as a norm. A norm is a function that assigns a non-negative real number to each vector in a vector space, intuitively representing the vector's 'magnitude' or 'length.'
Formal Definition:
A function $|\cdot|: V \to \mathbb{R}$ on a vector space $V$ over the field $\mathbb{R}$ (or $\mathbb{C}$) is called a norm if and only if it satisfies the following three axioms for all vectors $\mathbf{x}, \mathbf{y} \in V$ and all scalars $\alpha \in \mathbb{R}$:
These axioms aren't arbitrary mathematical formalism. Non-negativity ensures that 'distance' is never negative. Homogeneity ensures consistent scaling behavior. The triangle inequality ensures that our notion of 'shortest path' is geometrically sensible. Machine learning algorithms implicitly rely on these properties when using norms in optimization and loss functions.
The p-Norm Family:
The most important family of norms in machine learning is the Lp-norm (also written as $\ell_p$-norm), defined for $p \geq 1$ as:
$$|\mathbf{x}|p = \left( \sum{i=1}^{n} |x_i|^p \right)^{1/p}$$
where $\mathbf{x} = (x_1, x_2, \ldots, x_n)^T \in \mathbb{R}^n$.
For $p \geq 1$, this formula satisfies all three norm axioms. The cases $p = 1$, $p = 2$, and $p = \infty$ are by far the most commonly used in machine learning, and we will study each in depth.
When $0 < p < 1$, the formula $\left( \sum_i |x_i|^p \right)^{1/p}$ does NOT satisfy the triangle inequality, and is therefore NOT a true norm. However, these 'quasi-norms' are still useful in sparse optimization (e.g., the L0 'norm' which counts non-zero entries). This distinction is critical for theoretical understanding.
The L2 norm, also known as the Euclidean norm, is the most natural generalization of our intuitive notion of 'length' from 2D and 3D geometry to arbitrary dimensions. It corresponds to the straight-line distance from the origin to a point.
Definition:
$$|\mathbf{x}|2 = \sqrt{\sum{i=1}^{n} x_i^2} = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}$$
Alternatively, using the inner product:
$$|\mathbf{x}|_2 = \sqrt{\mathbf{x}^T \mathbf{x}} = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle}$$
This connection to the inner product is fundamental—the L2 norm is the unique norm induced by the standard Euclidean inner product.
Geometric Interpretation:
In 2D, the set of all points with $|\mathbf{x}|_2 = 1$ forms a circle centered at the origin. In 3D, it's a sphere. In $n$ dimensions, it's a hypersphere.
This circular symmetry is crucial: the L2 norm treats all directions equally. Rotating a vector doesn't change its L2 norm, a property called rotational invariance.
Mathematical Properties:
In practice, we often minimize $|\mathbf{x}|_2^2$ rather than $|\mathbf{x}|_2$. The squared norm is differentiable everywhere (including at zero) and eliminates the square root, simplifying gradients and enabling closed-form solutions in regression. The minimizers are identical, so this is purely a computational convenience.
1234567891011121314151617181920212223242526272829303132
import numpy as np # Define a vectorx = np.array([3, 4]) # L2 norm calculation - multiple equivalent approachesl2_manual = np.sqrt(np.sum(x**2)) # Manual formulal2_dot = np.sqrt(np.dot(x, x)) # Using dot productl2_builtin = np.linalg.norm(x) # NumPy built-in (default is L2)l2_explicit = np.linalg.norm(x, ord=2) # Explicitly specifying L2 print(f"Vector x = {x}")print(f"L2 norm (manual): ||x||_2 = √(3² + 4²) = √25 = {l2_manual}")print(f"L2 norm (dot prod): ||x||_2 = √(x·x) = {l2_dot}") print(f"L2 norm (built-in): ||x||_2 = {l2_builtin}") # Higher dimensional exampley = np.array([1, 2, 2, 4])print(f"Vector y = {y}")print(f"L2 norm: ||y||_2 = √(1 + 4 + 4 + 16) = √25 = {np.linalg.norm(y)}") # Verifying rotational invariancetheta = np.pi/4 # 45 degreesR = np.array([[np.cos(theta), -np.sin(theta)], [np.sin(theta), np.cos(theta)]]) # Rotation matrix x_rotated = R @ xprint(f"Original vector: {x}, L2 norm = {np.linalg.norm(x)}")print(f"After 45° rotation: {x_rotated}, L2 norm = {np.linalg.norm(x_rotated):.6f}")print(f"Rotational invariance verified: norms are equal")L2 Norm in Machine Learning:
The L2 norm appears throughout machine learning:
The L1 norm, also called the Manhattan norm or taxicab norm, measures distance as if you were navigating a city grid—you can only travel along axes, never diagonally.
Definition:
$$|\mathbf{x}|1 = \sum{i=1}^{n} |x_i| = |x_1| + |x_2| + \cdots + |x_n|$$
The name 'Manhattan' comes from the grid-like street layout of Manhattan, where the closest walking distance between two points requires traveling along streets (horizontal/vertical) rather than cutting through buildings diagonally.
Geometric Interpretation:
The unit ball of the L1 norm (all points with $|\mathbf{x}|_1 \leq 1$) forms a rhombus in 2D (diamond shape), an octahedron in 3D, and a cross-polytope in higher dimensions.
Unlike the smooth, curved boundary of the L2 ball, the L1 ball has sharp corners that lie exactly on the coordinate axes. This geometry is the key to understanding why L1 regularization promotes sparsity—a property we'll explore in depth.
Why L1 Induces Sparsity (Geometric Intuition):
Consider minimizing a loss function $L(\mathbf{w})$ subject to an L1 ball constraint $|\mathbf{w}|_1 \leq t$. Graphically, we're finding where the loss contours first touch the constraint region.
The key insight: the L1 ball has corners on the axes, and the loss contours (ellipses for least squares) are most likely to touch these corners first. At a corner, one or more coordinates are exactly zero.
Contrast with L2: the L2 ball is smooth everywhere. The touching point can be anywhere on the sphere—there's no preference for axis-aligned points, so solutions are typically dense with no exact zeros.
This geometric difference explains why Lasso (L1) produces sparse models while Ridge (L2) produces models with small but non-zero coefficients.
1234567891011121314151617181920212223242526272829
import numpy as np # Define a vectorx = np.array([3, -4, 5]) # L1 norm calculationl1_manual = np.sum(np.abs(x)) # Sum of absolute valuesl1_builtin = np.linalg.norm(x, ord=1) # NumPy with ord=1 print(f"Vector x = {x}")print(f"L1 norm: ||x||_1 = |3| + |-4| + |5| = 3 + 4 + 5 = {l1_manual}") # Comparing L1 vs L2 norm behavior with outliersnormal_errors = np.array([1, 2, 1, 2, 1])with_outlier = np.array([1, 2, 1, 2, 20]) # One outlier print(f"Normal errors: {normal_errors}")print(f" L1 norm: {np.linalg.norm(normal_errors, 1):.2f}")print(f" L2 norm: {np.linalg.norm(normal_errors, 2):.2f}")print(f" L2 squared: {np.sum(normal_errors**2):.2f}") print(f"With outlier: {with_outlier}")print(f" L1 norm: {np.linalg.norm(with_outlier, 1):.2f}") # Increases linearlyprint(f" L2 norm: {np.linalg.norm(with_outlier, 2):.2f}") print(f" L2 squared: {np.sum(with_outlier**2):.2f}") # Dominated by outlier! # The outlier adds 19 to L1 but 19² = 361 to L2², showing MSE's sensitivityWhile $|x|$ isn't differentiable at $x=0$, we can use the subgradient: any value in $[-1, 1]$ is a valid subgradient at zero. This enables algorithms like proximal gradient descent and subgradient methods to handle L1 optimization effectively. The LASSO solution often uses soft-thresholding: $S_\lambda(x) = \text{sign}(x) \cdot \max(|x| - \lambda, 0)$.
The L∞ norm, also called the maximum norm, Chebyshev norm, or uniform norm, measures the largest absolute component of a vector.
Definition:
$$|\mathbf{x}|\infty = \max{i=1,\ldots,n} |x_i|$$
The name 'L∞' comes from taking the limit as $p \to \infty$ in the Lp norm:
$$\lim_{p \to \infty} |\mathbf{x}|_p = \max_i |x_i|$$
This limit result is mathematically elegant: as $p$ increases, larger components contribute proportionally more, and in the limit, only the largest component matters.
Geometric Interpretation:
The unit ball of the L∞ norm in 2D is a square aligned with the axes. In 3D, it's a cube. In $n$ dimensions, it's an n-dimensional hypercube.
This shape is the dual of the L1 ball—and there's deep mathematical duality between L1 and L∞ norms that we'll explore.
Mathematical Properties:
123456789101112131415161718192021222324252627282930313233343536
import numpy as np # Define a vectorx = np.array([3, -7, 5, -2]) # L∞ norm calculationlinf_manual = np.max(np.abs(x)) # Maximum absolute valuelinf_builtin = np.linalg.norm(x, ord=np.inf) # NumPy with ord=inf print(f"Vector x = {x}")print(f"L∞ norm: ||x||_∞ = max(|3|, |-7|, |5|, |-2|) = max(3, 7, 5, 2) = {linf_manual}") # Demonstrating the limit definition: Lp → L∞ as p → ∞x = np.array([3, 7, 5, 2])print(f"Vector for limit demonstration: {x}")print(f"True L∞ norm: {np.linalg.norm(x, ord=np.inf)}") for p in [1, 2, 4, 8, 16, 32, 64, 128]: lp = np.linalg.norm(x, ord=p) print(f" L{p} norm: {lp:.6f}") print(f"As p → ∞, the Lp norm converges to L∞ = 7") # Practical use: Gradient clipping with L∞gradient = np.array([0.5, 2.1, -0.3, 0.8, -3.5])threshold = 1.0 if np.linalg.norm(gradient, ord=np.inf) > threshold: # Clip using L∞ norm (element-wise clipping) clipped = np.clip(gradient, -threshold, threshold) print(f"Original gradient: {gradient}") print(f"L∞ clipped gradient: {clipped}") print(f"L∞ norm after clipping: {np.linalg.norm(clipped, ord=np.inf)}")L∞ Norm in Machine Learning:
While less common than L1 and L2, the L∞ norm appears in:
L1 and L∞ are dual norms: $|\mathbf{x}|1 = \max{|\mathbf{y}|\infty \leq 1} \mathbf{x}^T \mathbf{y}$ and $|\mathbf{x}|\infty = \max_{|\mathbf{y}|_1 \leq 1} \mathbf{x}^T \mathbf{y}$. Similarly, L2 is self-dual. This duality appears in optimization (primal-dual methods) and in Hölder's inequality.
All norms on a finite-dimensional vector space are equivalent in the sense that they define the same topology—the same notion of convergence and continuity. However, they are not identical; they can differ by multiplicative constants.
Norm Equivalence Theorem:
For any two norms $|\cdot|_a$ and $|\cdot|_b$ on $\mathbb{R}^n$, there exist constants $c, C > 0$ such that:
$$c |\mathbf{x}|_a \leq |\mathbf{x}|_b \leq C |\mathbf{x}|_a \quad \text{for all } \mathbf{x} \in \mathbb{R}^n$$
For the standard Lp norms, we have explicit inequalities:
| Inequality | Interpretation |
|---|---|
| $|\mathbf{x}|_\infty \leq |\mathbf{x}|2 \leq \sqrt{n} |\mathbf{x}|\infty$ | L2 is between L∞ and √n times L∞ |
| $|\mathbf{x}|_2 \leq |\mathbf{x}|_1 \leq \sqrt{n} |\mathbf{x}|_2$ | L1 is between L2 and √n times L2 |
| $|\mathbf{x}|_\infty \leq |\mathbf{x}|1 \leq n |\mathbf{x}|\infty$ | L1 is between L∞ and n times L∞ |
| $|\mathbf{x}|_q \leq |\mathbf{x}|_p$ for $p \leq q$ | Higher p gives smaller (or equal) norm |
The General Ordering:
For $1 \leq p \leq q \leq \infty$:
$$|\mathbf{x}|_\infty \leq |\mathbf{x}|_q \leq |\mathbf{x}|_p \leq |\mathbf{x}|_1$$
With equality throughout if and only if at most one component is non-zero or all non-zero components have equal magnitude.
Practical Implications:
1234567891011121314151617181920212223242526272829
import numpy as np def compare_norms(x, name="x"): """Compare L1, L2, and L∞ norms with inequality verification.""" n = len(x) l1 = np.linalg.norm(x, ord=1) l2 = np.linalg.norm(x, ord=2) linf = np.linalg.norm(x, ord=np.inf) print(f"{name} = {x}") print(f"Dimension n = {n}, sqrt(n) = {np.sqrt(n):.4f}") print(f" L∞ norm: {linf:.4f}") print(f" L2 norm: {l2:.4f}") print(f" L1 norm: {l1:.4f}") # Verify inequalities print(f"Verifying inequalities:") print(f" L∞ ≤ L2? {linf:.4f} ≤ {l2:.4f}? {linf <= l2 + 1e-10}") print(f" L2 ≤ L1? {l2:.4f} ≤ {l1:.4f}? {l2 <= l1 + 1e-10}") print(f" L2 ≤ √n·L∞? {l2:.4f} ≤ {np.sqrt(n)*linf:.4f}? {l2 <= np.sqrt(n)*linf + 1e-10}") print(f" L1 ≤ √n·L2? {l1:.4f} ≤ {np.sqrt(n)*l2:.4f}? {l1 <= np.sqrt(n)*l2 + 1e-10}") # Test with various vectorscompare_norms(np.array([1, 0, 0, 0, 0]), "sparse (one nonzero)")compare_norms(np.array([1, 1, 1, 1, 1]), "uniform (all equal)")compare_norms(np.array([1, 2, 3, 4, 5]), "increasing")compare_norms(np.random.randn(100), "random 100D")The unit ball of a norm, defined as $B_p = {\mathbf{x} : |\mathbf{x}|_p \leq 1}$, provides powerful geometric intuition for understanding norm behavior. The shape of the unit ball determines many properties of the norm and its applications.
Visualization in 2D:
The Sparsity Story Through Geometry:
Consider a convex optimization problem:
$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad |\mathbf{x}|_p \leq t$$
The solution lies where the level sets of $f$ (contours of equal cost) first touch the constraint ball.
For L1 (diamond): The touching point is most likely to be at a corner, where one or more coordinates are exactly zero. Hence, sparse solutions.
For L2 (circle): The touching point can be anywhere on the smooth boundary. No preference for sparse solutions.
For L∞ (square): The touching point is likely to be on a face (many coordinates equal in magnitude). This encourages coordinates to have similar magnitudes.
This geometric insight is the key to understanding why different regularizers produce qualitatively different solutions.
The L0 'norm' counts non-zero entries: $|\mathbf{x}|_0 = |{i : x_i eq 0}|$. It's not actually a norm (violates homogeneity), and its 'unit ball' is the union of coordinate subspaces—a non-convex, disconnected set. Optimizing with L0 is NP-hard; L1 is the tightest convex relaxation, which is why LASSO is so important.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import numpy as npimport matplotlib.pyplot as plt def plot_unit_balls(): """Visualize unit balls for various Lp norms in 2D.""" fig, ax = plt.subplots(1, 1, figsize=(10, 10)) theta = np.linspace(0, 2*np.pi, 1000) # For each norm, parameterize the unit circle and transform norms_to_plot = [0.5, 1, 1.5, 2, 3, 5, 10, np.inf] colors = plt.cm.viridis(np.linspace(0, 1, len(norms_to_plot))) for p, color in zip(norms_to_plot, colors): if p == np.inf: # L∞ unit ball is a square x = np.array([-1, 1, 1, -1, -1]) y = np.array([-1, -1, 1, 1, -1]) label = 'L∞' elif p < 1: # For p < 1: |x|^p + |y|^p = 1, non-convex shape t = np.linspace(0, 1, 250) x_pos = t y_pos = (1 - t**p)**(1/p) # Construct all four quadrants x = np.concatenate([x_pos, -x_pos[::-1], -x_pos, x_pos[::-1]]) y = np.concatenate([y_pos, y_pos[::-1], -y_pos, -y_pos[::-1]]) label = f'L{p}' else: # Standard Lp unit ball: |x|^p + |y|^p = 1 t = np.linspace(0, 1, 250) x_pos = t y_pos = (1 - t**p)**(1/p) x = np.concatenate([x_pos, -x_pos[::-1], -x_pos, x_pos[::-1]]) y = np.concatenate([y_pos, y_pos[::-1], -y_pos, -y_pos[::-1]]) label = f'L{p}' ax.plot(x, y, color=color, linewidth=2, label=label) ax.set_xlim(-1.5, 1.5) ax.set_ylim(-1.5, 1.5) ax.set_aspect('equal') ax.grid(True, alpha=0.3) ax.legend(loc='upper right', fontsize=12) ax.set_title('Unit Balls for Various Lp Norms', fontsize=14) ax.axhline(y=0, color='k', linewidth=0.5) ax.axvline(x=0, color='k', linewidth=0.5) plt.tight_layout() plt.savefig('unit_balls.png', dpi=150, bbox_inches='tight') plt.show() # Note: This code generates a visualization showing L1 (diamond), L2 (circle),# L∞ (square), and intermediate Lp norms. The shape smoothly interpolates as p changes.Vector norms are not abstract mathematical curiosities—they are fundamental tools that appear throughout machine learning. Understanding when and why to use each norm is essential for practitioners.
Choosing the Right Norm:
| Situation | Recommended Norm | Rationale |
|---|---|---|
| Standard regression/classification loss | L2 (squared) | Smooth, differentiable, natural for Gaussian noise assumption |
| Outlier-robust regression | L1 (MAE) | Linear penalty reduces outlier influence; median-optimal |
| Feature selection / sparse models | L1 regularization | Drives coefficients to exact zero |
| General regularization without sparsity | L2 regularization | Smooth, closed-form solution, stable |
| Both sparsity and grouping | Elastic Net (L1 + L2) | Combines benefits of both |
| Adversarial robustness analysis | L∞ | Bounds maximum perturbation per feature |
| Embedding normalization | L2 | Unit vectors have nice geometric properties |
Regularization: L1 vs L2
Regularization adds a penalty to the loss function based on the norm of the model parameters:
$$\min_{\mathbf{w}} L(\mathbf{w}; \text{data}) + \lambda R(\mathbf{w})$$
where $R(\mathbf{w}) = |\mathbf{w}|_1$ (Lasso/L1) or $R(\mathbf{w}) = |\mathbf{w}|_2^2$ (Ridge/L2).
Key Differences:
In deep learning, L2 regularization is equivalent to weight decay: adding $\lambda \cdot w$ to each gradient update. L1 regularization is less common but promotes sparse networks. However, the L1 and L2 norms of gradients are used for gradient clipping: if $| abla L|_2 > \text{threshold}$, rescale to prevent exploding gradients.
We've covered the mathematical foundations of vector norms—the essential tools for measuring 'size' in machine learning.
What's Next:
We've focused on vector norms—how to measure a single vector's magnitude. The next page extends these concepts to matrix norms, which measure the 'size' of linear transformations. Matrix norms are essential for understanding condition numbers, stability of algorithms, and spectral analysis.
You now understand the mathematical underpinnings of vector norms and their central role in machine learning. These concepts will recur constantly—in loss functions, regularization, convergence analysis, and algorithm design. Master these, and you've mastered a fundamental language of ML.