Norms And Distance Metrics - Learning Module

Loading content...

0/278

Vector Norms (L1, L2, L∞)

Measuring Vectors: The Foundation of Machine Learning

Imagine you're building a recommendation system, and you need to determine how similar two users are based on their preferences. Or perhaps you're training a neural network and need to measure how far your predictions are from the true values. In both cases, you face a fundamental question: How do we measure the 'size' or 'length' of a vector?

This seemingly simple question has profound implications for nearly every machine learning algorithm. The answer lies in vector norms—mathematical tools that generalize our intuitive notion of length to high-dimensional spaces.

Vector norms are not mere abstract concepts; they are the workhorses of machine learning. They appear in loss functions, regularization terms, optimization convergence criteria, similarity measures, and error metrics. Understanding norms deeply is understanding the mathematical language in which machine learning speaks.

What You Will Master

By the end of this page, you will understand the rigorous mathematical definition of norms, the geometric and algebraic properties of L1, L2, and L∞ norms, their relationships to each other, and why choosing the right norm can mean the difference between a model that generalizes beautifully and one that overfits catastrophically.

Mathematical Definition of a Norm

Before exploring specific norms, we must establish what makes a function qualify as a norm. A norm is a function that assigns a non-negative real number to each vector in a vector space, intuitively representing the vector's 'magnitude' or 'length.'

Formal Definition:

A function $|\cdot|: V \to \mathbb{R}$ on a vector space $V$ over the field $\mathbb{R}$ (or $\mathbb{C}$) is called a norm if and only if it satisfies the following three axioms for all vectors $\mathbf{x}, \mathbf{y} \in V$ and all scalars $\alpha \in \mathbb{R}$:

The Three Axioms of a Norm

•Non-Negativity with Definiteness: $|\mathbf{x}| \geq 0$ for all $\mathbf{x}$, and $|\mathbf{x}| = 0 \Leftrightarrow \mathbf{x} = \mathbf{0}$. A vector has zero norm if and only if it is the zero vector. This ensures that only the zero vector has zero 'size.'
•Absolute Homogeneity (Positive Scalability): $|\alpha \mathbf{x}| = |\alpha| |\mathbf{x}|$ for all scalars $\alpha$. Scaling a vector by a factor $\alpha$ scales its norm by $|\alpha|$. If you double a vector, its norm doubles.
•Triangle Inequality (Subadditivity): $|\mathbf{x} + \mathbf{y}| \leq |\mathbf{x}| + |\mathbf{y}|$. The norm of a sum is at most the sum of the norms. Geometrically, the shortest path between two points is a straight line—you can't do better by going through a third point.

Why These Axioms Matter

These axioms aren't arbitrary mathematical formalism. Non-negativity ensures that 'distance' is never negative. Homogeneity ensures consistent scaling behavior. The triangle inequality ensures that our notion of 'shortest path' is geometrically sensible. Machine learning algorithms implicitly rely on these properties when using norms in optimization and loss functions.

The p-Norm Family:

The most important family of norms in machine learning is the Lp-norm (also written as $\ell_p$-norm), defined for $p \geq 1$ as:

$$|\mathbf{x}|p = \left( \sum{i=1}^{n} |x_i|^p \right)^{1/p}$$

where $\mathbf{x} = (x_1, x_2, \ldots, x_n)^T \in \mathbb{R}^n$.

For $p \geq 1$, this formula satisfies all three norm axioms. The cases $p = 1$, $p = 2$, and $p = \infty$ are by far the most commonly used in machine learning, and we will study each in depth.

The Case of 0 < p < 1

When $0 < p < 1$, the formula $\left( \sum_i |x_i|^p \right)^{1/p}$ does NOT satisfy the triangle inequality, and is therefore NOT a true norm. However, these 'quasi-norms' are still useful in sparse optimization (e.g., the L0 'norm' which counts non-zero entries). This distinction is critical for theoretical understanding.

The L2 Norm: Euclidean Distance

The L2 norm, also known as the Euclidean norm, is the most natural generalization of our intuitive notion of 'length' from 2D and 3D geometry to arbitrary dimensions. It corresponds to the straight-line distance from the origin to a point.

Definition:

$$|\mathbf{x}|2 = \sqrt{\sum{i=1}^{n} x_i^2} = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}$$

Alternatively, using the inner product:

$$|\mathbf{x}|_2 = \sqrt{\mathbf{x}^T \mathbf{x}} = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle}$$

This connection to the inner product is fundamental—the L2 norm is the unique norm induced by the standard Euclidean inner product.

Geometric Interpretation:

In 2D, the set of all points with $|\mathbf{x}|_2 = 1$ forms a circle centered at the origin. In 3D, it's a sphere. In $n$ dimensions, it's a hypersphere.

This circular symmetry is crucial: the L2 norm treats all directions equally. Rotating a vector doesn't change its L2 norm, a property called rotational invariance.

Mathematical Properties:

Key Properties of L2 Norm

•Pythagorean Theorem: In orthogonal decompositions, $|\mathbf{x} + \mathbf{y}|_2^2 = |\mathbf{x}|_2^2 + |\mathbf{y}|_2^2$ when $\mathbf{x} \perp \mathbf{y}$
•Cauchy-Schwarz Inequality: $|\mathbf{x}^T \mathbf{y}| \leq |\mathbf{x}|_2 |\mathbf{y}|_2$ — fundamental to proving the triangle inequality
•Rotational Invariance: For any orthogonal matrix $\mathbf{Q}$, $|\mathbf{Q}\mathbf{x}|_2 = |\mathbf{x}|_2$
•Differentiability: Differentiable everywhere except at $\mathbf{x} = \mathbf{0}$. Gradient: $ abla |\mathbf{x}|_2 = \mathbf{x}/|\mathbf{x}|_2$
•Squared Form: $|\mathbf{x}|_2^2 = \mathbf{x}^T\mathbf{x}$ is smooth everywhere, making it ideal for optimization

Why Squared L2 is Preferred in Optimization

In practice, we often minimize $|\mathbf{x}|_2^2$ rather than $|\mathbf{x}|_2$. The squared norm is differentiable everywhere (including at zero) and eliminates the square root, simplifying gradients and enabling closed-form solutions in regression. The minimizers are identical, so this is purely a computational convenience.

l2_norm_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
 
# Define a vector
x = np.array([3, 4])
 
# L2 norm calculation - multiple equivalent approaches
l2_manual = np.sqrt(np.sum(x**2))        # Manual formula
l2_dot = np.sqrt(np.dot(x, x))           # Using dot product
l2_builtin = np.linalg.norm(x)           # NumPy built-in (default is L2)
l2_explicit = np.linalg.norm(x, ord=2)   # Explicitly specifying L2
 
print(f"Vector x = {x}")
print(f"L2 norm (manual):   ||x||_2 = √(3² + 4²) = √25 = {l2_manual}")
print(f"L2 norm (dot prod): ||x||_2 = √(x·x) = {l2_dot}") 
print(f"L2 norm (built-in): ||x||_2 = {l2_builtin}")
 
# Higher dimensional example
y = np.array([1, 2, 2, 4])
print(f"
Vector y = {y}")
print(f"L2 norm: ||y||_2 = √(1 + 4 + 4 + 16) = √25 = {np.linalg.norm(y)}")
 
# Verifying rotational invariance
theta = np.pi/4  # 45 degrees
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])  # Rotation matrix
 
x_rotated = R @ x
print(f"
Original vector: {x}, L2 norm = {np.linalg.norm(x)}")
print(f"After 45° rotation: {x_rotated}, L2 norm = {np.linalg.norm(x_rotated):.6f}")
print(f"Rotational invariance verified: norms are equal")

L2 Norm in Machine Learning:

The L2 norm appears throughout machine learning:

Mean Squared Error (MSE): $\text{MSE} = \frac{1}{n}|\mathbf{y} - \hat{\mathbf{y}}|_2^2$
Ridge Regression (L2 Regularization): Adds $\lambda|\mathbf{w}|_2^2$ to the loss
Weight Initialization: Random weights often drawn from distributions scaled by norms
Gradient Clipping: Rescale gradients when $| abla L|_2 > \text{threshold}$
Embedding Normalization: Unit-norm embeddings satisfy $|\mathbf{e}|_2 = 1$

The L1 Norm: Manhattan Distance

The L1 norm, also called the Manhattan norm or taxicab norm, measures distance as if you were navigating a city grid—you can only travel along axes, never diagonally.

Definition:

$$|\mathbf{x}|1 = \sum{i=1}^{n} |x_i| = |x_1| + |x_2| + \cdots + |x_n|$$

The name 'Manhattan' comes from the grid-like street layout of Manhattan, where the closest walking distance between two points requires traveling along streets (horizontal/vertical) rather than cutting through buildings diagonally.

Geometric Interpretation:

The unit ball of the L1 norm (all points with $|\mathbf{x}|_1 \leq 1$) forms a rhombus in 2D (diamond shape), an octahedron in 3D, and a cross-polytope in higher dimensions.

Unlike the smooth, curved boundary of the L2 ball, the L1 ball has sharp corners that lie exactly on the coordinate axes. This geometry is the key to understanding why L1 regularization promotes sparsity—a property we'll explore in depth.

L1 Norm Advantages

•Sparsity Inducing: L1 regularization drives coefficients exactly to zero, enabling automatic feature selection
•Robustness to Outliers: L1 loss (MAE) is more robust than L2 loss (MSE) because outliers have linear rather than quadratic influence
•Interpretability: Sparse solutions are easier to interpret—only a subset of features matter
•Computational Tractability: Despite non-differentiability at origin, efficient algorithms exist (coordinate descent, ADMM)

L1 Norm Considerations

•Non-Differentiable at Zero: The gradient is undefined when any $x_i = 0$, complicating optimization
•No Closed-Form Solution: Unlike L2 regularization (ridge), L1 regularization (lasso) requires iterative solvers
•Selection Instability: With correlated features, lasso may arbitrarily select one and discard others
•Bias: L1 regularization introduces more bias than L2 for large coefficients

Why L1 Induces Sparsity (Geometric Intuition):

Consider minimizing a loss function $L(\mathbf{w})$ subject to an L1 ball constraint $|\mathbf{w}|_1 \leq t$. Graphically, we're finding where the loss contours first touch the constraint region.

The key insight: the L1 ball has corners on the axes, and the loss contours (ellipses for least squares) are most likely to touch these corners first. At a corner, one or more coordinates are exactly zero.

Contrast with L2: the L2 ball is smooth everywhere. The touching point can be anywhere on the sphere—there's no preference for axis-aligned points, so solutions are typically dense with no exact zeros.

This geometric difference explains why Lasso (L1) produces sparse models while Ridge (L2) produces models with small but non-zero coefficients.

l1_norm_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
 
# Define a vector
x = np.array([3, -4, 5])
 
# L1 norm calculation
l1_manual = np.sum(np.abs(x))           # Sum of absolute values
l1_builtin = np.linalg.norm(x, ord=1)   # NumPy with ord=1
 
print(f"Vector x = {x}")
print(f"L1 norm: ||x||_1 = |3| + |-4| + |5| = 3 + 4 + 5 = {l1_manual}")
 
# Comparing L1 vs L2 norm behavior with outliers
normal_errors = np.array([1, 2, 1, 2, 1])
with_outlier = np.array([1, 2, 1, 2, 20])  # One outlier
 
print(f"
Normal errors: {normal_errors}")
print(f"  L1 norm: {np.linalg.norm(normal_errors, 1):.2f}")
print(f"  L2 norm: {np.linalg.norm(normal_errors, 2):.2f}")
print(f"  L2 squared: {np.sum(normal_errors**2):.2f}")
 
print(f"
With outlier: {with_outlier}")
print(f"  L1 norm: {np.linalg.norm(with_outlier, 1):.2f}")  # Increases linearly
print(f"  L2 norm: {np.linalg.norm(with_outlier, 2):.2f}") 
print(f"  L2 squared: {np.sum(with_outlier**2):.2f}")  # Dominated by outlier!
 
# The outlier adds 19 to L1 but 19² = 361 to L2², showing MSE's sensitivity

The Subgradient at Zero

While $|x|$ isn't differentiable at $x=0$, we can use the subgradient: any value in $[-1, 1]$ is a valid subgradient at zero. This enables algorithms like proximal gradient descent and subgradient methods to handle L1 optimization effectively. The LASSO solution often uses soft-thresholding: $S_\lambda(x) = \text{sign}(x) \cdot \max(|x| - \lambda, 0)$.

The L∞ Norm: Maximum Component

The L∞ norm, also called the maximum norm, Chebyshev norm, or uniform norm, measures the largest absolute component of a vector.

Definition:

$$|\mathbf{x}|\infty = \max{i=1,\ldots,n} |x_i|$$

The name 'L∞' comes from taking the limit as $p \to \infty$ in the Lp norm:

$$\lim_{p \to \infty} |\mathbf{x}|_p = \max_i |x_i|$$

This limit result is mathematically elegant: as $p$ increases, larger components contribute proportionally more, and in the limit, only the largest component matters.

Geometric Interpretation:

The unit ball of the L∞ norm in 2D is a square aligned with the axes. In 3D, it's a cube. In $n$ dimensions, it's an n-dimensional hypercube.

This shape is the dual of the L1 ball—and there's deep mathematical duality between L1 and L∞ norms that we'll explore.

Mathematical Properties:

Key Properties of L∞ Norm

•Limit of Lp: $|\mathbf{x}|\infty = \lim{p \to \infty} |\mathbf{x}|_p$
•Minimax Property: Minimizes the maximum absolute entry, useful in worst-case analysis
•Dual Norm: L∞ is the dual norm of L1: $|\mathbf{x}|\infty = \max{|\mathbf{y}|_1 \leq 1} |\mathbf{x}^T \mathbf{y}|$
•Non-smooth: Not differentiable when multiple components tie for maximum
•Lipschitz Bound: If $| abla f|_\infty \leq L$, then $|f(\mathbf{x}) - f(\mathbf{y})| \leq L|\mathbf{x} - \mathbf{y}|_1$

linf_norm_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
 
# Define a vector
x = np.array([3, -7, 5, -2])
 
# L∞ norm calculation
linf_manual = np.max(np.abs(x))           # Maximum absolute value
linf_builtin = np.linalg.norm(x, ord=np.inf)  # NumPy with ord=inf
 
print(f"Vector x = {x}")
print(f"L∞ norm: ||x||_∞ = max(|3|, |-7|, |5|, |-2|) = max(3, 7, 5, 2) = {linf_manual}")
 
# Demonstrating the limit definition: Lp → L∞ as p → ∞
x = np.array([3, 7, 5, 2])
print(f"
Vector for limit demonstration: {x}")
print(f"True L∞ norm: {np.linalg.norm(x, ord=np.inf)}")
 
for p in [1, 2, 4, 8, 16, 32, 64, 128]:
    lp = np.linalg.norm(x, ord=p)
    print(f"  L{p} norm: {lp:.6f}")
 
print(f"
As p → ∞, the Lp norm converges to L∞ = 7")
 
# Practical use: Gradient clipping with L∞
gradient = np.array([0.5, 2.1, -0.3, 0.8, -3.5])
threshold = 1.0
 
if np.linalg.norm(gradient, ord=np.inf) > threshold:
    # Clip using L∞ norm (element-wise clipping)
    clipped = np.clip(gradient, -threshold, threshold)
    print(f"
Original gradient: {gradient}")
    print(f"L∞ clipped gradient: {clipped}")
    print(f"L∞ norm after clipping: {np.linalg.norm(clipped, ord=np.inf)}")

L∞ Norm in Machine Learning:

While less common than L1 and L2, the L∞ norm appears in:

Adversarial Machine Learning: L∞ bounded perturbations define the 'adversarial budget'—how much each pixel can change. The famous FGSM attack uses L∞ perturbations.
Uniform Convergence: In learning theory, uniform convergence bounds often involve L∞ norms of function classes.
Max-Norm Regularization: Used in matrix factorization and deep learning to bound the maximum row/column norm.
Numerical Stability: Checking if $|\mathbf{x}|_\infty$ is bounded helps detect overflow issues.
Constraint Satisfaction: When you need to guarantee no component exceeds a threshold.

The Duality Between L1 and L∞

L1 and L∞ are dual norms: $|\mathbf{x}|1 = \max{|\mathbf{y}|\infty \leq 1} \mathbf{x}^T \mathbf{y}$ and $|\mathbf{x}|\infty = \max_{|\mathbf{y}|_1 \leq 1} \mathbf{x}^T \mathbf{y}$. Similarly, L2 is self-dual. This duality appears in optimization (primal-dual methods) and in Hölder's inequality.

Relationships Between Norms

All norms on a finite-dimensional vector space are equivalent in the sense that they define the same topology—the same notion of convergence and continuity. However, they are not identical; they can differ by multiplicative constants.

Norm Equivalence Theorem:

For any two norms $|\cdot|_a$ and $|\cdot|_b$ on $\mathbb{R}^n$, there exist constants $c, C > 0$ such that:

$$c |\mathbf{x}|_a \leq |\mathbf{x}|_b \leq C |\mathbf{x}|_a \quad \text{for all } \mathbf{x} \in \mathbb{R}^n$$

For the standard Lp norms, we have explicit inequalities:

Norm Inequality Relationships
Inequality	Interpretation
$\|\mathbf{x}\|_\infty \leq \|\mathbf{x}\|2 \leq \sqrt{n} \|\mathbf{x}\|\infty$	L2 is between L∞ and √n times L∞
$\|\mathbf{x}\|_2 \leq \|\mathbf{x}\|_1 \leq \sqrt{n} \|\mathbf{x}\|_2$	L1 is between L2 and √n times L2
$\|\mathbf{x}\|_\infty \leq \|\mathbf{x}\|1 \leq n \|\mathbf{x}\|\infty$	L1 is between L∞ and n times L∞
$\|\mathbf{x}\|_q \leq \|\mathbf{x}\|_p$ for $p \leq q$	Higher p gives smaller (or equal) norm

The General Ordering:

For $1 \leq p \leq q \leq \infty$:

$$|\mathbf{x}|_\infty \leq |\mathbf{x}|_q \leq |\mathbf{x}|_p \leq |\mathbf{x}|_1$$

With equality throughout if and only if at most one component is non-zero or all non-zero components have equal magnitude.

Practical Implications:

•Regularization Strength: L1 penalizes more harshly than L2 for the same coefficient value. This contributes to L1's sparsity effect.
•Dimension Dependence: The gap $\sqrt{n}$ between L1 and L2 grows with dimension. In very high dimensions, L1 can be much larger than L2.
•Bounding Arguments: To show $|\mathbf{x}|_2$ is small, it suffices to show $|\mathbf{x}|1$ is small (since L2 ≤ L1). Conversely, $|\mathbf{x}|\infty$ small implies all norms are small.
•Algorithm Design: Some algorithms have guarantees in one norm that translate to others via these bounds.

norm_relationships.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
 
def compare_norms(x, name="x"):
    """Compare L1, L2, and L∞ norms with inequality verification."""
    n = len(x)
    l1 = np.linalg.norm(x, ord=1)
    l2 = np.linalg.norm(x, ord=2)
    linf = np.linalg.norm(x, ord=np.inf)
    
    print(f"
{name} = {x}")
    print(f"Dimension n = {n}, sqrt(n) = {np.sqrt(n):.4f}")
    print(f"  L∞ norm: {linf:.4f}")
    print(f"  L2 norm: {l2:.4f}")
    print(f"  L1 norm: {l1:.4f}")
    
    # Verify inequalities
    print(f"
Verifying inequalities:")
    print(f"  L∞ ≤ L2? {linf:.4f} ≤ {l2:.4f}? {linf <= l2 + 1e-10}")
    print(f"  L2 ≤ L1? {l2:.4f} ≤ {l1:.4f}? {l2 <= l1 + 1e-10}")
    print(f"  L2 ≤ √n·L∞? {l2:.4f} ≤ {np.sqrt(n)*linf:.4f}? {l2 <= np.sqrt(n)*linf + 1e-10}")
    print(f"  L1 ≤ √n·L2? {l1:.4f} ≤ {np.sqrt(n)*l2:.4f}? {l1 <= np.sqrt(n)*l2 + 1e-10}")
 
# Test with various vectors
compare_norms(np.array([1, 0, 0, 0, 0]), "sparse (one nonzero)")
compare_norms(np.array([1, 1, 1, 1, 1]), "uniform (all equal)")
compare_norms(np.array([1, 2, 3, 4, 5]), "increasing")
compare_norms(np.random.randn(100), "random 100D")

Unit Balls and Geometric Intuition

The unit ball of a norm, defined as $B_p = {\mathbf{x} : |\mathbf{x}|_p \leq 1}$, provides powerful geometric intuition for understanding norm behavior. The shape of the unit ball determines many properties of the norm and its applications.

Visualization in 2D:

L1 ball: Diamond (square rotated 45°) with corners at $(\pm1, 0)$ and $(0, \pm1)$
L2 ball: Circle with radius 1
L∞ ball: Square with corners at $(\pm1, \pm1)$
Lp balls for 1 < p < 2: 'Superellipses' between the diamond and circle
Lp balls for 2 < p < ∞: 'Superellipses' between the circle and square

The Sparsity Story Through Geometry:

Consider a convex optimization problem:

$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad |\mathbf{x}|_p \leq t$$

The solution lies where the level sets of $f$ (contours of equal cost) first touch the constraint ball.

For L1 (diamond): The touching point is most likely to be at a corner, where one or more coordinates are exactly zero. Hence, sparse solutions.

For L2 (circle): The touching point can be anywhere on the smooth boundary. No preference for sparse solutions.

For L∞ (square): The touching point is likely to be on a face (many coordinates equal in magnitude). This encourages coordinates to have similar magnitudes.

This geometric insight is the key to understanding why different regularizers produce qualitatively different solutions.

The L0 'Norm' and Its Discontinuity

The L0 'norm' counts non-zero entries: $|\mathbf{x}|_0 = |{i : x_i eq 0}|$. It's not actually a norm (violates homogeneity), and its 'unit ball' is the union of coordinate subspaces—a non-convex, disconnected set. Optimizing with L0 is NP-hard; L1 is the tightest convex relaxation, which is why LASSO is so important.

unit_ball_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import matplotlib.pyplot as plt
 
def plot_unit_balls():
    """Visualize unit balls for various Lp norms in 2D."""
    fig, ax = plt.subplots(1, 1, figsize=(10, 10))
    
    theta = np.linspace(0, 2*np.pi, 1000)
    
    # For each norm, parameterize the unit circle and transform
    norms_to_plot = [0.5, 1, 1.5, 2, 3, 5, 10, np.inf]
    colors = plt.cm.viridis(np.linspace(0, 1, len(norms_to_plot)))
    
    for p, color in zip(norms_to_plot, colors):
        if p == np.inf:
            # L∞ unit ball is a square
            x = np.array([-1, 1, 1, -1, -1])
            y = np.array([-1, -1, 1, 1, -1])
            label = 'L∞'
        elif p < 1:
            # For p < 1: |x|^p + |y|^p = 1, non-convex shape
            t = np.linspace(0, 1, 250)
            x_pos = t
            y_pos = (1 - t**p)**(1/p)
            # Construct all four quadrants
            x = np.concatenate([x_pos, -x_pos[::-1], -x_pos, x_pos[::-1]])
            y = np.concatenate([y_pos, y_pos[::-1], -y_pos, -y_pos[::-1]])
            label = f'L{p}'
        else:
            # Standard Lp unit ball: |x|^p + |y|^p = 1
            t = np.linspace(0, 1, 250)
            x_pos = t
            y_pos = (1 - t**p)**(1/p)
            x = np.concatenate([x_pos, -x_pos[::-1], -x_pos, x_pos[::-1]])
            y = np.concatenate([y_pos, y_pos[::-1], -y_pos, -y_pos[::-1]])
            label = f'L{p}'
        
        ax.plot(x, y, color=color, linewidth=2, label=label)
    
    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-1.5, 1.5)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    ax.legend(loc='upper right', fontsize=12)
    ax.set_title('Unit Balls for Various Lp Norms', fontsize=14)
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    
    plt.tight_layout()
    plt.savefig('unit_balls.png', dpi=150, bbox_inches='tight')
    plt.show()
 
# Note: This code generates a visualization showing L1 (diamond), L2 (circle),
# L∞ (square), and intermediate Lp norms. The shape smoothly interpolates as p changes.

Norms in Machine Learning Practice

Vector norms are not abstract mathematical curiosities—they are fundamental tools that appear throughout machine learning. Understanding when and why to use each norm is essential for practitioners.

Choosing the Right Norm:

Norm Selection Guidelines
Situation	Recommended Norm	Rationale
Standard regression/classification loss	L2 (squared)	Smooth, differentiable, natural for Gaussian noise assumption
Outlier-robust regression	L1 (MAE)	Linear penalty reduces outlier influence; median-optimal
Feature selection / sparse models	L1 regularization	Drives coefficients to exact zero
General regularization without sparsity	L2 regularization	Smooth, closed-form solution, stable
Both sparsity and grouping	Elastic Net (L1 + L2)	Combines benefits of both
Adversarial robustness analysis	L∞	Bounds maximum perturbation per feature
Embedding normalization	L2	Unit vectors have nice geometric properties

Regularization: L1 vs L2

Regularization adds a penalty to the loss function based on the norm of the model parameters:

$$\min_{\mathbf{w}} L(\mathbf{w}; \text{data}) + \lambda R(\mathbf{w})$$

where $R(\mathbf{w}) = |\mathbf{w}|_1$ (Lasso/L1) or $R(\mathbf{w}) = |\mathbf{w}|_2^2$ (Ridge/L2).

Key Differences:

•L2 (Ridge): Shrinks all coefficients proportionally. Never exactly zero. Has closed-form solution: $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$. Handles multicollinearity well.
•L1 (Lasso): Drives some coefficients exactly to zero = feature selection. No closed form; requires iterative optimization. Can be unstable with correlated features.
•Elastic Net: $\lambda_1|\mathbf{w}|_1 + \lambda_2|\mathbf{w}|_2^2$. Gets sparsity from L1 and stability from L2. Best for high-dimensional, correlated data.

Deep Learning Implications

In deep learning, L2 regularization is equivalent to weight decay: adding $\lambda \cdot w$ to each gradient update. L1 regularization is less common but promotes sparse networks. However, the L1 and L2 norms of gradients are used for gradient clipping: if $| abla L|_2 > \text{threshold}$, rescale to prevent exploding gradients.

Summary: The Language of Magnitude

We've covered the mathematical foundations of vector norms—the essential tools for measuring 'size' in machine learning.

Key Takeaways

•A norm is a function satisfying three axioms: non-negativity with definiteness, absolute homogeneity, and the triangle inequality.
•The L2 (Euclidean) norm is our intuitive notion of length, induced by the inner product, and is rotationally invariant with smooth geometry.
•The L1 (Manhattan) norm sums absolute values, promotes sparsity through its corner-shaped unit ball, and is robust to outliers.
•The L∞ (Maximum) norm measures the largest component, is dual to L1, and appears in adversarial analysis and worst-case bounds.
•All norms are equivalent in finite dimensions but differ by dimension-dependent constants, explaining why L1 and L2 behave differently as dimension grows.
•The geometry of unit balls explains regularization effects: L1's corners encourage sparsity, L2's smoothness encourages small but dense solutions.
•In practice: L2 for smooth optimization, L1 for sparsity/feature selection, Elastic Net for the best of both worlds.

What's Next:

We've focused on vector norms—how to measure a single vector's magnitude. The next page extends these concepts to matrix norms, which measure the 'size' of linear transformations. Matrix norms are essential for understanding condition numbers, stability of algorithms, and spectral analysis.

Foundation Established

You now understand the mathematical underpinnings of vector norms and their central role in machine learning. These concepts will recur constantly—in loss functions, regularization, convergence analysis, and algorithm design. Master these, and you've mastered a fundamental language of ML.

Vector Norms (L1, L2, L∞)

Measuring Vectors: The Foundation of Machine Learning

What You Will Master

Mathematical Definition of a Norm

Formal Definition:

The Three Axioms of a Norm

•Non-Negativity with Definiteness: $|\mathbf{x}| \geq 0$ for all $\mathbf{x}$, and $|\mathbf{x}| = 0 \Leftrightarrow \mathbf{x} = \mathbf{0}$. A vector has zero norm if and only if it is the zero vector. This ensures that only the zero vector has zero 'size.'
•Absolute Homogeneity (Positive Scalability): $|\alpha \mathbf{x}| = |\alpha| |\mathbf{x}|$ for all scalars $\alpha$. Scaling a vector by a factor $\alpha$ scales its norm by $|\alpha|$. If you double a vector, its norm doubles.
•Triangle Inequality (Subadditivity): $|\mathbf{x} + \mathbf{y}| \leq |\mathbf{x}| + |\mathbf{y}|$. The norm of a sum is at most the sum of the norms. Geometrically, the shortest path between two points is a straight line—you can't do better by going through a third point.

Why These Axioms Matter

The p-Norm Family:

The most important family of norms in machine learning is the Lp-norm (also written as $\ell_p$-norm), defined for $p \geq 1$ as:

$$|\mathbf{x}|p = \left( \sum{i=1}^{n} |x_i|^p \right)^{1/p}$$

where $\mathbf{x} = (x_1, x_2, \ldots, x_n)^T \in \mathbb{R}^n$.

For $p \geq 1$, this formula satisfies all three norm axioms. The cases $p = 1$, $p = 2$, and $p = \infty$ are by far the most commonly used in machine learning, and we will study each in depth.

The Case of 0 < p < 1

The L2 Norm: Euclidean Distance

Definition:

$$|\mathbf{x}|2 = \sqrt{\sum{i=1}^{n} x_i^2} = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}$$

Alternatively, using the inner product:

$$|\mathbf{x}|_2 = \sqrt{\mathbf{x}^T \mathbf{x}} = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle}$$

This connection to the inner product is fundamental—the L2 norm is the unique norm induced by the standard Euclidean inner product.

Geometric Interpretation:

In 2D, the set of all points with $|\mathbf{x}|_2 = 1$ forms a circle centered at the origin. In 3D, it's a sphere. In $n$ dimensions, it's a hypersphere.

This circular symmetry is crucial: the L2 norm treats all directions equally. Rotating a vector doesn't change its L2 norm, a property called rotational invariance.

Mathematical Properties:

Key Properties of L2 Norm

•Pythagorean Theorem: In orthogonal decompositions, $|\mathbf{x} + \mathbf{y}|_2^2 = |\mathbf{x}|_2^2 + |\mathbf{y}|_2^2$ when $\mathbf{x} \perp \mathbf{y}$
•Cauchy-Schwarz Inequality: $|\mathbf{x}^T \mathbf{y}| \leq |\mathbf{x}|_2 |\mathbf{y}|_2$ — fundamental to proving the triangle inequality
•Rotational Invariance: For any orthogonal matrix $\mathbf{Q}$, $|\mathbf{Q}\mathbf{x}|_2 = |\mathbf{x}|_2$
•Differentiability: Differentiable everywhere except at $\mathbf{x} = \mathbf{0}$. Gradient: $ abla |\mathbf{x}|_2 = \mathbf{x}/|\mathbf{x}|_2$
•Squared Form: $|\mathbf{x}|_2^2 = \mathbf{x}^T\mathbf{x}$ is smooth everywhere, making it ideal for optimization

Why Squared L2 is Preferred in Optimization

l2_norm_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
 
# Define a vector
x = np.array([3, 4])
 
# L2 norm calculation - multiple equivalent approaches
l2_manual = np.sqrt(np.sum(x**2))        # Manual formula
l2_dot = np.sqrt(np.dot(x, x))           # Using dot product
l2_builtin = np.linalg.norm(x)           # NumPy built-in (default is L2)
l2_explicit = np.linalg.norm(x, ord=2)   # Explicitly specifying L2
 
print(f"Vector x = {x}")
print(f"L2 norm (manual):   ||x||_2 = √(3² + 4²) = √25 = {l2_manual}")
print(f"L2 norm (dot prod): ||x||_2 = √(x·x) = {l2_dot}") 
print(f"L2 norm (built-in): ||x||_2 = {l2_builtin}")
 
# Higher dimensional example
y = np.array([1, 2, 2, 4])
print(f"
Vector y = {y}")
print(f"L2 norm: ||y||_2 = √(1 + 4 + 4 + 16) = √25 = {np.linalg.norm(y)}")
 
# Verifying rotational invariance
theta = np.pi/4  # 45 degrees
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])  # Rotation matrix
 
x_rotated = R @ x
print(f"
Original vector: {x}, L2 norm = {np.linalg.norm(x)}")
print(f"After 45° rotation: {x_rotated}, L2 norm = {np.linalg.norm(x_rotated):.6f}")
print(f"Rotational invariance verified: norms are equal")

L2 Norm in Machine Learning:

The L2 norm appears throughout machine learning:

Mean Squared Error (MSE): $\text{MSE} = \frac{1}{n}|\mathbf{y} - \hat{\mathbf{y}}|_2^2$
Ridge Regression (L2 Regularization): Adds $\lambda|\mathbf{w}|_2^2$ to the loss
Weight Initialization: Random weights often drawn from distributions scaled by norms
Gradient Clipping: Rescale gradients when $| abla L|_2 > \text{threshold}$
Embedding Normalization: Unit-norm embeddings satisfy $|\mathbf{e}|_2 = 1$

The L1 Norm: Manhattan Distance

The L1 norm, also called the Manhattan norm or taxicab norm, measures distance as if you were navigating a city grid—you can only travel along axes, never diagonally.

Definition:

$$|\mathbf{x}|1 = \sum{i=1}^{n} |x_i| = |x_1| + |x_2| + \cdots + |x_n|$$

Geometric Interpretation:

The unit ball of the L1 norm (all points with $|\mathbf{x}|_1 \leq 1$) forms a rhombus in 2D (diamond shape), an octahedron in 3D, and a cross-polytope in higher dimensions.

L1 Norm Advantages

•Sparsity Inducing: L1 regularization drives coefficients exactly to zero, enabling automatic feature selection
•Robustness to Outliers: L1 loss (MAE) is more robust than L2 loss (MSE) because outliers have linear rather than quadratic influence
•Interpretability: Sparse solutions are easier to interpret—only a subset of features matter
•Computational Tractability: Despite non-differentiability at origin, efficient algorithms exist (coordinate descent, ADMM)

L1 Norm Considerations

•Non-Differentiable at Zero: The gradient is undefined when any $x_i = 0$, complicating optimization
•No Closed-Form Solution: Unlike L2 regularization (ridge), L1 regularization (lasso) requires iterative solvers
•Selection Instability: With correlated features, lasso may arbitrarily select one and discard others
•Bias: L1 regularization introduces more bias than L2 for large coefficients

Why L1 Induces Sparsity (Geometric Intuition):

Consider minimizing a loss function $L(\mathbf{w})$ subject to an L1 ball constraint $|\mathbf{w}|_1 \leq t$. Graphically, we're finding where the loss contours first touch the constraint region.

This geometric difference explains why Lasso (L1) produces sparse models while Ridge (L2) produces models with small but non-zero coefficients.

l1_norm_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
 
# Define a vector
x = np.array([3, -4, 5])
 
# L1 norm calculation
l1_manual = np.sum(np.abs(x))           # Sum of absolute values
l1_builtin = np.linalg.norm(x, ord=1)   # NumPy with ord=1
 
print(f"Vector x = {x}")
print(f"L1 norm: ||x||_1 = |3| + |-4| + |5| = 3 + 4 + 5 = {l1_manual}")
 
# Comparing L1 vs L2 norm behavior with outliers
normal_errors = np.array([1, 2, 1, 2, 1])
with_outlier = np.array([1, 2, 1, 2, 20])  # One outlier
 
print(f"
Normal errors: {normal_errors}")
print(f"  L1 norm: {np.linalg.norm(normal_errors, 1):.2f}")
print(f"  L2 norm: {np.linalg.norm(normal_errors, 2):.2f}")
print(f"  L2 squared: {np.sum(normal_errors**2):.2f}")
 
print(f"
With outlier: {with_outlier}")
print(f"  L1 norm: {np.linalg.norm(with_outlier, 1):.2f}")  # Increases linearly
print(f"  L2 norm: {np.linalg.norm(with_outlier, 2):.2f}") 
print(f"  L2 squared: {np.sum(with_outlier**2):.2f}")  # Dominated by outlier!
 
# The outlier adds 19 to L1 but 19² = 361 to L2², showing MSE's sensitivity

The Subgradient at Zero

The L∞ Norm: Maximum Component

The L∞ norm, also called the maximum norm, Chebyshev norm, or uniform norm, measures the largest absolute component of a vector.

Definition:

$$|\mathbf{x}|\infty = \max{i=1,\ldots,n} |x_i|$$

The name 'L∞' comes from taking the limit as $p \to \infty$ in the Lp norm:

$$\lim_{p \to \infty} |\mathbf{x}|_p = \max_i |x_i|$$

This limit result is mathematically elegant: as $p$ increases, larger components contribute proportionally more, and in the limit, only the largest component matters.

Geometric Interpretation:

The unit ball of the L∞ norm in 2D is a square aligned with the axes. In 3D, it's a cube. In $n$ dimensions, it's an n-dimensional hypercube.

This shape is the dual of the L1 ball—and there's deep mathematical duality between L1 and L∞ norms that we'll explore.

Mathematical Properties:

Key Properties of L∞ Norm

•Limit of Lp: $|\mathbf{x}|\infty = \lim{p \to \infty} |\mathbf{x}|_p$
•Minimax Property: Minimizes the maximum absolute entry, useful in worst-case analysis
•Dual Norm: L∞ is the dual norm of L1: $|\mathbf{x}|\infty = \max{|\mathbf{y}|_1 \leq 1} |\mathbf{x}^T \mathbf{y}|$
•Non-smooth: Not differentiable when multiple components tie for maximum
•Lipschitz Bound: If $| abla f|_\infty \leq L$, then $|f(\mathbf{x}) - f(\mathbf{y})| \leq L|\mathbf{x} - \mathbf{y}|_1$

linf_norm_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
 
# Define a vector
x = np.array([3, -7, 5, -2])
 
# L∞ norm calculation
linf_manual = np.max(np.abs(x))           # Maximum absolute value
linf_builtin = np.linalg.norm(x, ord=np.inf)  # NumPy with ord=inf
 
print(f"Vector x = {x}")
print(f"L∞ norm: ||x||_∞ = max(|3|, |-7|, |5|, |-2|) = max(3, 7, 5, 2) = {linf_manual}")
 
# Demonstrating the limit definition: Lp → L∞ as p → ∞
x = np.array([3, 7, 5, 2])
print(f"
Vector for limit demonstration: {x}")
print(f"True L∞ norm: {np.linalg.norm(x, ord=np.inf)}")
 
for p in [1, 2, 4, 8, 16, 32, 64, 128]:
    lp = np.linalg.norm(x, ord=p)
    print(f"  L{p} norm: {lp:.6f}")
 
print(f"
As p → ∞, the Lp norm converges to L∞ = 7")
 
# Practical use: Gradient clipping with L∞
gradient = np.array([0.5, 2.1, -0.3, 0.8, -3.5])
threshold = 1.0
 
if np.linalg.norm(gradient, ord=np.inf) > threshold:
    # Clip using L∞ norm (element-wise clipping)
    clipped = np.clip(gradient, -threshold, threshold)
    print(f"
Original gradient: {gradient}")
    print(f"L∞ clipped gradient: {clipped}")
    print(f"L∞ norm after clipping: {np.linalg.norm(clipped, ord=np.inf)}")

L∞ Norm in Machine Learning:

While less common than L1 and L2, the L∞ norm appears in:

Adversarial Machine Learning: L∞ bounded perturbations define the 'adversarial budget'—how much each pixel can change. The famous FGSM attack uses L∞ perturbations.
Uniform Convergence: In learning theory, uniform convergence bounds often involve L∞ norms of function classes.
Max-Norm Regularization: Used in matrix factorization and deep learning to bound the maximum row/column norm.
Numerical Stability: Checking if $|\mathbf{x}|_\infty$ is bounded helps detect overflow issues.
Constraint Satisfaction: When you need to guarantee no component exceeds a threshold.

The Duality Between L1 and L∞

Relationships Between Norms

Norm Equivalence Theorem:

For any two norms $|\cdot|_a$ and $|\cdot|_b$ on $\mathbb{R}^n$, there exist constants $c, C > 0$ such that:

$$c |\mathbf{x}|_a \leq |\mathbf{x}|_b \leq C |\mathbf{x}|_a \quad \text{for all } \mathbf{x} \in \mathbb{R}^n$$

For the standard Lp norms, we have explicit inequalities:

Norm Inequality Relationships
Inequality	Interpretation
$\|\mathbf{x}\|_\infty \leq \|\mathbf{x}\|2 \leq \sqrt{n} \|\mathbf{x}\|\infty$	L2 is between L∞ and √n times L∞
$\|\mathbf{x}\|_2 \leq \|\mathbf{x}\|_1 \leq \sqrt{n} \|\mathbf{x}\|_2$	L1 is between L2 and √n times L2
$\|\mathbf{x}\|_\infty \leq \|\mathbf{x}\|1 \leq n \|\mathbf{x}\|\infty$	L1 is between L∞ and n times L∞
$\|\mathbf{x}\|_q \leq \|\mathbf{x}\|_p$ for $p \leq q$	Higher p gives smaller (or equal) norm

The General Ordering:

For $1 \leq p \leq q \leq \infty$:

$$|\mathbf{x}|_\infty \leq |\mathbf{x}|_q \leq |\mathbf{x}|_p \leq |\mathbf{x}|_1$$

With equality throughout if and only if at most one component is non-zero or all non-zero components have equal magnitude.

Practical Implications:

•Regularization Strength: L1 penalizes more harshly than L2 for the same coefficient value. This contributes to L1's sparsity effect.
•Dimension Dependence: The gap $\sqrt{n}$ between L1 and L2 grows with dimension. In very high dimensions, L1 can be much larger than L2.
•Bounding Arguments: To show $|\mathbf{x}|_2$ is small, it suffices to show $|\mathbf{x}|1$ is small (since L2 ≤ L1). Conversely, $|\mathbf{x}|\infty$ small implies all norms are small.
•Algorithm Design: Some algorithms have guarantees in one norm that translate to others via these bounds.

norm_relationships.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import numpy as np
 
def compare_norms(x, name="x"):
    """Compare L1, L2, and L∞ norms with inequality verification."""
    n = len(x)
    l1 = np.linalg.norm(x, ord=1)
    l2 = np.linalg.norm(x, ord=2)
    linf = np.linalg.norm(x, ord=np.inf)
    
    print(f"
{name} = {x}")
    print(f"Dimension n = {n}, sqrt(n) = {np.sqrt(n):.4f}")
    print(f"  L∞ norm: {linf:.4f}")
    print(f"  L2 norm: {l2:.4f}")
    print(f"  L1 norm: {l1:.4f}")
    
    # Verify inequalities
    print(f"
Verifying inequalities:")
    print(f"  L∞ ≤ L2? {linf:.4f} ≤ {l2:.4f}? {linf <= l2 + 1e-10}")
    print(f"  L2 ≤ L1? {l2:.4f} ≤ {l1:.4f}? {l2 <= l1 + 1e-10}")
    print(f"  L2 ≤ √n·L∞? {l2:.4f} ≤ {np.sqrt(n)*linf:.4f}? {l2 <= np.sqrt(n)*linf + 1e-10}")
    print(f"  L1 ≤ √n·L2? {l1:.4f} ≤ {np.sqrt(n)*l2:.4f}? {l1 <= np.sqrt(n)*l2 + 1e-10}")
 
# Test with various vectors
compare_norms(np.array([1, 0, 0, 0, 0]), "sparse (one nonzero)")
compare_norms(np.array([1, 1, 1, 1, 1]), "uniform (all equal)")
compare_norms(np.array([1, 2, 3, 4, 5]), "increasing")
compare_norms(np.random.randn(100), "random 100D")

Unit Balls and Geometric Intuition

Visualization in 2D:

L1 ball: Diamond (square rotated 45°) with corners at $(\pm1, 0)$ and $(0, \pm1)$
L2 ball: Circle with radius 1
L∞ ball: Square with corners at $(\pm1, \pm1)$
Lp balls for 1 < p < 2: 'Superellipses' between the diamond and circle
Lp balls for 2 < p < ∞: 'Superellipses' between the circle and square

The Sparsity Story Through Geometry:

Consider a convex optimization problem:

$$\min_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad |\mathbf{x}|_p \leq t$$

The solution lies where the level sets of $f$ (contours of equal cost) first touch the constraint ball.

For L1 (diamond): The touching point is most likely to be at a corner, where one or more coordinates are exactly zero. Hence, sparse solutions.

For L2 (circle): The touching point can be anywhere on the smooth boundary. No preference for sparse solutions.

For L∞ (square): The touching point is likely to be on a face (many coordinates equal in magnitude). This encourages coordinates to have similar magnitudes.

This geometric insight is the key to understanding why different regularizers produce qualitatively different solutions.

The L0 'Norm' and Its Discontinuity

unit_ball_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
import matplotlib.pyplot as plt
 
def plot_unit_balls():
    """Visualize unit balls for various Lp norms in 2D."""
    fig, ax = plt.subplots(1, 1, figsize=(10, 10))
    
    theta = np.linspace(0, 2*np.pi, 1000)
    
    # For each norm, parameterize the unit circle and transform
    norms_to_plot = [0.5, 1, 1.5, 2, 3, 5, 10, np.inf]
    colors = plt.cm.viridis(np.linspace(0, 1, len(norms_to_plot)))
    
    for p, color in zip(norms_to_plot, colors):
        if p == np.inf:
            # L∞ unit ball is a square
            x = np.array([-1, 1, 1, -1, -1])
            y = np.array([-1, -1, 1, 1, -1])
            label = 'L∞'
        elif p < 1:
            # For p < 1: |x|^p + |y|^p = 1, non-convex shape
            t = np.linspace(0, 1, 250)
            x_pos = t
            y_pos = (1 - t**p)**(1/p)
            # Construct all four quadrants
            x = np.concatenate([x_pos, -x_pos[::-1], -x_pos, x_pos[::-1]])
            y = np.concatenate([y_pos, y_pos[::-1], -y_pos, -y_pos[::-1]])
            label = f'L{p}'
        else:
            # Standard Lp unit ball: |x|^p + |y|^p = 1
            t = np.linspace(0, 1, 250)
            x_pos = t
            y_pos = (1 - t**p)**(1/p)
            x = np.concatenate([x_pos, -x_pos[::-1], -x_pos, x_pos[::-1]])
            y = np.concatenate([y_pos, y_pos[::-1], -y_pos, -y_pos[::-1]])
            label = f'L{p}'
        
        ax.plot(x, y, color=color, linewidth=2, label=label)
    
    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-1.5, 1.5)
    ax.set_aspect('equal')
    ax.grid(True, alpha=0.3)
    ax.legend(loc='upper right', fontsize=12)
    ax.set_title('Unit Balls for Various Lp Norms', fontsize=14)
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    
    plt.tight_layout()
    plt.savefig('unit_balls.png', dpi=150, bbox_inches='tight')
    plt.show()
 
# Note: This code generates a visualization showing L1 (diamond), L2 (circle),
# L∞ (square), and intermediate Lp norms. The shape smoothly interpolates as p changes.

Norms in Machine Learning Practice

Choosing the Right Norm:

Norm Selection Guidelines
Situation	Recommended Norm	Rationale
Standard regression/classification loss	L2 (squared)	Smooth, differentiable, natural for Gaussian noise assumption
Outlier-robust regression	L1 (MAE)	Linear penalty reduces outlier influence; median-optimal
Feature selection / sparse models	L1 regularization	Drives coefficients to exact zero
General regularization without sparsity	L2 regularization	Smooth, closed-form solution, stable
Both sparsity and grouping	Elastic Net (L1 + L2)	Combines benefits of both
Adversarial robustness analysis	L∞	Bounds maximum perturbation per feature
Embedding normalization	L2	Unit vectors have nice geometric properties

Regularization: L1 vs L2

Regularization adds a penalty to the loss function based on the norm of the model parameters:

$$\min_{\mathbf{w}} L(\mathbf{w}; \text{data}) + \lambda R(\mathbf{w})$$

where $R(\mathbf{w}) = |\mathbf{w}|_1$ (Lasso/L1) or $R(\mathbf{w}) = |\mathbf{w}|_2^2$ (Ridge/L2).

Key Differences:

•L2 (Ridge): Shrinks all coefficients proportionally. Never exactly zero. Has closed-form solution: $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$. Handles multicollinearity well.
•L1 (Lasso): Drives some coefficients exactly to zero = feature selection. No closed form; requires iterative optimization. Can be unstable with correlated features.
•Elastic Net: $\lambda_1|\mathbf{w}|_1 + \lambda_2|\mathbf{w}|_2^2$. Gets sparsity from L1 and stability from L2. Best for high-dimensional, correlated data.

Deep Learning Implications

Summary: The Language of Magnitude

We've covered the mathematical foundations of vector norms—the essential tools for measuring 'size' in machine learning.

Key Takeaways

•A norm is a function satisfying three axioms: non-negativity with definiteness, absolute homogeneity, and the triangle inequality.
•The L2 (Euclidean) norm is our intuitive notion of length, induced by the inner product, and is rotationally invariant with smooth geometry.
•The L1 (Manhattan) norm sums absolute values, promotes sparsity through its corner-shaped unit ball, and is robust to outliers.
•The L∞ (Maximum) norm measures the largest component, is dual to L1, and appears in adversarial analysis and worst-case bounds.
•All norms are equivalent in finite dimensions but differ by dimension-dependent constants, explaining why L1 and L2 behave differently as dimension grows.
•The geometry of unit balls explains regularization effects: L1's corners encourage sparsity, L2's smoothness encourages small but dense solutions.
•In practice: L2 for smooth optimization, L1 for sparsity/feature selection, Elastic Net for the best of both worlds.

What's Next:

Foundation Established