Loading content...
Imagine you're lost in a foggy mountain landscape, trying to find the lowest valley. If the terrain is arbitrary—with cliffs, plateaus, and hidden ravines—your task is nearly impossible. You might descend into a local depression, only to realize you're still far above the true valley floor hidden somewhere beyond.
But now imagine the terrain forms a single, smooth bowl. No matter where you start, if you simply walk downhill, you're guaranteed to reach the absolute lowest point. There are no false bottoms, no deceptive dips—just one inevitable destination.
This is the essence of convexity—the mathematical property that transforms intractable optimization into a guaranteed journey to the global optimum. In machine learning, convexity is not merely a convenience; it's the fundamental distinction between algorithms that provably work and algorithms that hope for the best.
By the end of this page, you will understand: (1) the precise mathematical definition of convex sets, (2) how to verify set convexity through intersection and projection operations, (3) the definition and characterization of convex functions, (4) Jensen's inequality and its far-reaching implications, and (5) why convexity guarantees that local optima are global optima—the holy grail of optimization.
A convex set captures a simple but powerful geometric idea: if you can see from any point to any other point without leaving the set, the set is convex.
Formal Definition:
A set $C \subseteq \mathbb{R}^n$ is convex if for any two points $x, y \in C$ and any scalar $\theta \in [0, 1]$, the point
$$\theta x + (1 - \theta) y \in C$$
In other words, the entire line segment connecting $x$ and $y$ lies within $C$. This seemingly simple property has profound implications for optimization.
Think of a convex set as a set with no "dents" or "holes." If you stretch a rubber band between any two points inside the set, the entire rubber band stays inside. A circle is convex. A star is not—lines between some points would exit the star's boundary.
Understanding the Convex Combination:
The expression $\theta x + (1 - \theta) y$ with $\theta \in [0, 1]$ is called a convex combination of $x$ and $y$. Let's understand this geometrically:
The convex combination parameters always sum to 1: $\theta + (1 - \theta) = 1$. This constraint ensures we're talking about points on the line segment, not the extended line.
| Set | Convex? | Reasoning |
|---|---|---|
| Line segment | ✓ Yes | By definition—it's a convex combination of its endpoints |
| Ball/Sphere interior | ✓ Yes | Any line segment between interior points stays inside |
| Hyperplane ${x : a^T x = b}$ | ✓ Yes | If $a^T x = b$ and $a^T y = b$, then $a^T(\theta x + (1-\theta)y) = b$ |
| Halfspace ${x : a^T x \leq b}$ | ✓ Yes | Linear inequalities preserve convex combinations |
| Polyhedron (intersection of halfspaces) | ✓ Yes | Intersection of convex sets is convex |
| Two disjoint balls | ✗ No | Line segment between points in different balls exits the set |
| Star-shaped region | ✗ No | Line segments between some points exit through the indentations |
| Donut (annulus) | ✗ No | Line segments can pass through the hole |
Understanding which operations preserve convexity allows us to build complex convex sets from simpler ones—a crucial skill for formulating optimization problems in machine learning.
Fundamental Convexity-Preserving Operations:
The union of convex sets is generally NOT convex. Consider two disjoint balls: each is convex, but their union fails the line segment test. This is why optimization over "either A or B" constraints is fundamentally harder than optimization over "A and B" constraints.
Why Intersection is Central:
The fact that intersection preserves convexity is enormously powerful. Many important convex sets in ML can be expressed as intersections of simpler convex sets:
This compositional view allows us to verify convexity of complex sets by decomposing them into simpler convex building blocks.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as np def is_in_convex_hull(point, vertices): """ Check if a point lies in the convex hull of given vertices. Uses linear programming: point is in convex hull iff we can express it as a convex combination of vertices. Args: point: numpy array of shape (n,) vertices: numpy array of shape (k, n) where k = number of vertices Returns: bool: True if point is in convex hull """ from scipy.optimize import linprog n_vertices = vertices.shape[0] n_dims = vertices.shape[1] # We want to find lambdas such that: # sum(lambda_i * vertex_i) = point # sum(lambda_i) = 1 # lambda_i >= 0 # This is a feasibility LP. We minimize a constant (0). c = np.zeros(n_vertices) # Dummy objective # Equality constraints: V^T @ lambda = point, sum(lambda) = 1 A_eq = np.vstack([vertices.T, np.ones(n_vertices)]) b_eq = np.append(point, 1.0) # lambda_i >= 0 is handled by bounds bounds = [(0, None) for _ in range(n_vertices)] result = linprog(c, A_eq=A_eq, b_eq=b_eq, bounds=bounds, method='highs') return result.success def verify_convexity_by_sampling(set_membership_fn, dim, n_pairs=1000, n_interpolations=10): """ Empirically verify convexity of a set by random sampling. This is NOT a proof but a useful diagnostic for complex sets defined procedurally. Args: set_membership_fn: Function that returns True if point is in set dim: Dimension of the space n_pairs: Number of random point pairs to test n_interpolations: Points to check along each line segment Returns: dict: Results including any counterexamples found """ counterexamples = [] for _ in range(n_pairs): # Generate random points and check if they're in the set x = np.random.randn(dim) y = np.random.randn(dim) if not (set_membership_fn(x) and set_membership_fn(y)): continue # Both points must be in the set # Check points along the line segment for theta in np.linspace(0, 1, n_interpolations): z = theta * x + (1 - theta) * y if not set_membership_fn(z): counterexamples.append({ 'x': x.copy(), 'y': y.copy(), 'theta': theta, 'z': z.copy() }) break return { 'is_likely_convex': len(counterexamples) == 0, 'counterexamples': counterexamples, 'pairs_tested': n_pairs } # Example: Verify that L2 ball is convexdef l2_ball_membership(x, radius=1.0): return np.linalg.norm(x) <= radius result = verify_convexity_by_sampling(l2_ball_membership, dim=5, n_pairs=2000)print(f"L2 ball is likely convex: {result['is_likely_convex']}")print(f"Counterexamples found: {len(result['counterexamples'])}")While convex sets describe where we can search, convex functions describe what we're optimizing. A convex function is one where the graph lies below any line segment connecting two points on the graph—it "curves upward."
Formal Definition:
A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if its domain $\text{dom}(f)$ is a convex set and for all $x, y \in \text{dom}(f)$ and $\theta \in [0, 1]$:
$$f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y)$$
The left side is the function value at the interpolated point; the right side is the linear interpolation of function values. For convex functions, the actual value is always at or below the interpolation.
If you draw a chord (straight line) between any two points on the graph of a convex function, the entire chord lies on or above the graph. The function "bows down" relative to its chords. For a strictly convex function, the chord is strictly above except at endpoints.
Strict Convexity:
A function is strictly convex if the inequality is strict for $\theta \in (0, 1)$ and $x \neq y$:
$$f(\theta x + (1 - \theta) y) < \theta f(x) + (1 - \theta) f(y)$$
Strict convexity guarantees a unique global minimum (if one exists), not just that local minima are global.
Strong Convexity:
A function $f$ is $m$-strongly convex (with $m > 0$) if:
$$f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y) - \frac{m}{2} \theta (1 - \theta) |x - y|^2$$
Strong convexity implies strict convexity but adds a quantitative "curvature" bound. This has crucial implications for optimization convergence rates.
| Condition | Definition | Key Implication |
|---|---|---|
| Convex | $f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y)$ | Local minima are global minima |
| Strictly Convex | Strict inequality for $\theta \in (0,1)$, $x \neq y$ | At most one global minimum |
| $m$-Strongly Convex | Convexity + curvature $\geq m > 0$ | Exactly one minimum + fast convergence |
Concave Functions:
A function $f$ is concave if $-f$ is convex. Equivalently:
$$f(\theta x + (1 - \theta) y) \geq \theta f(x) + (1 - \theta) f(y)$$
Since minimizing a convex function is equivalent to maximizing a concave function, all our convexity theory applies to maximization problems with concave objectives.
Building intuition requires familiarity with common convex functions. These form the building blocks for constructing convex objectives in machine learning.
| Function | Domain | Convexity Type | ML Application |
|---|---|---|---|
| $f(x) = x^2$ | $\mathbb{R}$ | Strongly convex | Squared loss, L2 regularization |
| $f(x) = |x|$ | $\mathbb{R}$ | Convex (not strict) | L1 regularization, robust loss |
| $f(x) = e^x$ | $\mathbb{R}$ | Strictly convex | Exponential loss, softmax components |
| $f(x) = -\log(x)$ | $x > 0$ | Strictly convex | Log loss, barrier methods |
| $f(x) = x \log x$ | $x > 0$ | Strictly convex | Entropy, KL divergence |
| $f(x) = \max_i x_i$ | $\mathbb{R}^n$ | Convex (not strict) | Max-margin, hinge loss |
| $f(x) = |x|_p$ for $p \geq 1$ | $\mathbb{R}^n$ | Convex | Norm regularization |
| $f(X) = -\log \det(X)$ | $X \succ 0$ | Strictly convex | Covariance estimation |
Understanding Through Key Examples:
1. Quadratic Functions:
The function $f(x) = \frac{1}{2} x^T Q x + b^T x + c$ where $Q$ is a symmetric matrix is:
This is why the Hessian (second derivative matrix) plays such a central role in convexity verification.
2. Norms:
Every norm on $\mathbb{R}^n$ is convex. This follows from the triangle inequality:
$$|\theta x + (1 - \theta) y| \leq |\theta x| + |(1 - \theta) y| = \theta |x| + (1 - \theta) |y|$$
This is why L1 and L2 regularization terms are convex, making regularized optimization tractable.
3. Log-Sum-Exp (Softmax):
$$f(x) = \log\left(\sum_{i=1}^{n} e^{x_i}\right)$$
This is convex and serves as a smooth approximation to $\max_i x_i$. It's fundamental to softmax classification and attention mechanisms.
If $g$ is convex and non-decreasing, and $f$ is convex, then $g(f(x))$ is convex. If $g$ is convex and non-increasing, and $f$ is concave, then $g(f(x))$ is convex. These composition rules let us build complex convex functions from simpler ones.
Just as certain operations preserve set convexity, analogous operations preserve function convexity. These rules are essential for constructing and verifying convex objectives.
Products of convex functions are NOT generally convex. For example, $f(x) = x$ and $g(x) = x$ are both convex on $\mathbb{R}$, but $f(x) \cdot g(x) = x^2$ is convex only on $\mathbb{R}^+$ or $\mathbb{R}^-$ when considered as a product, not globally. Also, $\min_i f_i(x)$ is NOT convex even when each $f_i$ is convex.
The Power of Pointwise Maximum:
The fact that $\max$ preserves convexity is surprisingly powerful and underlies many ML loss functions:
Hinge Loss: $\ell(y, \hat{y}) = \max(0, 1 - y \cdot \hat{y})$
This is the maximum of two affine functions (0 and $1 - y \cdot \hat{y}$), hence convex.
ReLU Activation: $\text{ReLU}(x) = \max(0, x)$
While individual ReLUs are convex, compositions of linear layers and ReLUs create non-convex networks—an important distinction.
Piecewise Linear Convex Functions:
Any piecewise linear convex function can be written as a maximum of affine functions: $$f(x) = \max_{i=1,\ldots,k} (a_i^T x + b_i)$$
This representation is central to problems like linear programming robustness and the design of activation functions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as np # Demonstrating convexity-preserving operations def squared_norm(x): """L2 squared - strongly convex""" return np.sum(x ** 2) def l1_norm(x): """L1 norm - convex but not strictly convex""" return np.sum(np.abs(x)) def composed_quadratic(x, A, b): """f(Ax + b) where f is squared norm - still convex""" return squared_norm(A @ x + b) def sum_of_convex(x, alpha=1.0, beta=0.1): """alpha * ||x||^2 + beta * ||x||_1 - convex (elastic net)""" return alpha * squared_norm(x) + beta * l1_norm(x) def pointwise_max_example(x, affine_funcs): """ max_i (a_i^T x + b_i) - convex as max of convex Args: x: input vector affine_funcs: list of (a_i, b_i) tuples """ values = [a @ x + b for a, b in affine_funcs] return max(values) def hinge_loss(y_true, y_pred): """ Hinge loss: max(0, 1 - y * y_hat) Convex as max of two affine functions in y_pred """ return np.maximum(0, 1 - y_true * y_pred) def log_sum_exp(x): """ log(sum(exp(x_i))) - convex approximation to max Numerically stable implementation """ max_x = np.max(x) return max_x + np.log(np.sum(np.exp(x - max_x))) # Verify convexity empiricallydef verify_function_convexity(f, x, y, n_points=100): """ Verify f(theta*x + (1-theta)*y) <= theta*f(x) + (1-theta)*f(y) for many values of theta in [0, 1] """ f_x, f_y = f(x), f(y) thetas = np.linspace(0, 1, n_points) violations = [] for theta in thetas: z = theta * x + (1 - theta) * y f_z = f(z) interpolation = theta * f_x + (1 - theta) * f_y if f_z > interpolation + 1e-10: # Small tolerance violations.append({ 'theta': theta, 'f(z)': f_z, 'interpolation': interpolation, 'gap': f_z - interpolation }) return { 'is_convex': len(violations) == 0, 'violations': violations } # Example verificationnp.random.seed(42)x = np.random.randn(5)y = np.random.randn(5) print("Verifying convexity of L2 squared:")print(verify_function_convexity(squared_norm, x, y)) print("\nVerifying convexity of log-sum-exp:")print(verify_function_convexity(log_sum_exp, x, y))Jensen's Inequality is arguably the single most important result in convex analysis. It generalizes the definition of convexity from two points to arbitrary weighted combinations.
Theorem (Jensen's Inequality):
Let $f$ be a convex function, let $x_1, \ldots, x_k$ be points in the domain of $f$, and let $\lambda_1, \ldots, \lambda_k$ be non-negative weights summing to 1. Then:
$$f\left(\sum_{i=1}^{k} \lambda_i x_i\right) \leq \sum_{i=1}^{k} \lambda_i f(x_i)$$
In words: the function value at the weighted average is at most the weighted average of function values.
If $X$ is a random variable and $f$ is convex, then $f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$. This is the form most commonly used in machine learning, especially in variational inference and information theory.
Proof Intuition:
Jensen's inequality follows from repeated application of the basic convexity definition. For two points, it's exactly the definition. For three points, we first combine two of them:
$$f(\lambda_1 x_1 + \lambda_2 x_2 + \lambda_3 x_3) = f\left((\lambda_1 + \lambda_2) \cdot \frac{\lambda_1 x_1 + \lambda_2 x_2}{\lambda_1 + \lambda_2} + \lambda_3 x_3\right)$$
Then apply convexity twice, using the fact that $\frac{\lambda_1 x_1 + \lambda_2 x_2}{\lambda_1 + \lambda_2}$ is itself a convex combination.
Applications in Machine Learning:
Jensen's inequality appears throughout ML:
Example: Proving KL Divergence is Non-negative
The Kullback-Leibler divergence is:
$$D_{KL}(p | q) = \mathbb{E}_p\left[\log \frac{p(x)}{q(x)}\right] = -\mathbb{E}_p\left[\log \frac{q(x)}{p(x)}\right]$$
Since $-\log$ is convex and $\frac{q(x)}{p(x)}$ is a ratio of densities with $\mathbb{E}_p\left[\frac{q(x)}{p(x)}\right] = \int p(x) \cdot \frac{q(x)}{p(x)} dx = \int q(x) dx = 1$:
$$D_{KL}(p | q) = -\mathbb{E}_p\left[\log \frac{q(x)}{p(x)}\right] \geq -\log \mathbb{E}_p\left[\frac{q(x)}{p(x)}\right] = -\log(1) = 0$$
Jensen's gives us this fundamental non-negativity of KL divergence.
The connection between convex functions and convex sets runs deeper than analogy—there's a precise equivalence through the concept of epigraphs.
Definition: Epigraph
The epigraph of a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is the set of points "on or above" the graph:
$$\text{epi}(f) = {(x, t) \in \mathbb{R}^{n+1} : x \in \text{dom}(f), t \geq f(x)}$$
Fundamental Theorem:
A function $f$ is convex if and only if its epigraph $\text{epi}(f)$ is a convex set.
This equivalence is powerful: it means we can translate any statement about convex functions into a statement about convex sets, and vice versa.
For a 1D function, the epigraph is the region above the curve extending to infinity. For $f(x) = x^2$, the epigraph is the region above the parabola—a convex set. For $f(x) = -x^2$, the epigraph is everything above an upside-down parabola—not convex.
Sublevel Sets:
The $\alpha$-sublevel set of $f$ is:
$$S_\alpha = {x \in \text{dom}(f) : f(x) \leq \alpha}$$
Key Property: If $f$ is convex, all its sublevel sets are convex.
Proof: Let $x, y \in S_\alpha$, so $f(x) \leq \alpha$ and $f(y) \leq \alpha$. For any $\theta \in [0, 1]$:
$$f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y) \leq \theta \alpha + (1 - \theta) \alpha = \alpha$$
So $\theta x + (1 - \theta) y \in S_\alpha$, proving $S_\alpha$ is convex.
Caution: The converse is FALSE—having convex sublevel sets doesn't imply the function is convex. The function $f(x) = \sqrt{|x|}$ has convex sublevel sets (intervals around 0) but is not convex (it's concave for $|x| > 0$).
Quasiconvexity:
A function with convex sublevel sets is called quasiconvex. This is a strictly weaker condition than convexity:
$$f \text{ convex} \Rightarrow f \text{ quasiconvex}$$
but the converse doesn't hold. Quasiconvex functions still have nice optimization properties (local minima are still global for strictly quasiconvex functions), but the analysis is more delicate.
Practical Use of Sublevel Sets:
In constrained optimization, we often restrict our search to a sublevel set:
$$\min_x , g(x) \quad \text{subject to} \quad f(x) \leq \alpha$$
If $f$ is convex, the feasible region $S_\alpha$ is convex, preserving the tractability of the optimization.
When functions are differentiable, we gain powerful calculus-based characterizations of convexity that are often easier to verify than the basic definition.
First-Order Characterization:
If $f$ is differentiable, then $f$ is convex if and only if for all $x, y$ in the domain:
$$f(y) \geq f(x) + \nabla f(x)^T (y - x)$$
This says the function lies above its tangent hyperplane at every point. The tangent is a global underestimator.
For a convex function, if you stand at any point and draw the tangent line (or hyperplane), the entire function graph lies on or above that tangent. This is why gradient descent works: moving in the negative gradient direction is guaranteed to reduce the function (to first order).
Second-Order Characterization:
If $f$ is twice differentiable, then $f$ is convex if and only if the Hessian is positive semidefinite everywhere:
$$\nabla^2 f(x) \succeq 0 \quad \forall x \in \text{dom}(f)$$
For strict convexity, we need $\nabla^2 f(x) \succ 0$ (positive definite).
For $m$-strong convexity, we need $\nabla^2 f(x) \succeq m I$ (eigenvalues at least $m$).
Practical Hessian Analysis:
For functions of the form $f(x) = g(Ax)$ where $g$ is convex:
$$\nabla^2 f(x) = A^T \nabla^2 g(Ax) A$$
This is PSD whenever $\nabla^2 g \succeq 0$, confirming that composition with affine transformations preserves convexity.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as npfrom scipy.linalg import eigvalsh def verify_convexity_via_hessian(hessian_fn, test_points, tolerance=1e-10): """ Verify convexity by checking Hessian positive semidefiniteness. Args: hessian_fn: Function that returns Hessian matrix at a point test_points: List of points to check tolerance: Numerical tolerance for eigenvalue check Returns: dict: Verification results """ results = { 'is_psd_everywhere': True, 'min_eigenvalues': [], 'problem_points': [] } for x in test_points: H = hessian_fn(x) eigenvalues = eigvalsh(H) # Real eigenvalues for symmetric matrix min_eig = np.min(eigenvalues) results['min_eigenvalues'].append(min_eig) if min_eig < -tolerance: results['is_psd_everywhere'] = False results['problem_points'].append({ 'point': x, 'min_eigenvalue': min_eig }) return results def verify_first_order_condition(f, grad_f, x, y, tolerance=1e-8): """ Verify f(y) >= f(x) + grad_f(x)^T (y - x) For convex functions, this should hold for all x, y. """ f_y = f(y) f_x = f(x) grad_x = grad_f(x) lower_bound = f_x + np.dot(grad_x, y - x) gap = f_y - lower_bound return { 'f(y)': f_y, 'lower_bound': lower_bound, 'gap': gap, 'is_satisfied': gap >= -tolerance } # Example: Verify logistic loss convexitydef logistic_loss(w, X, y): """Logistic regression loss (without regularization)""" z = y * (X @ w) return np.mean(np.log(1 + np.exp(-z))) def logistic_loss_hessian(w, X, y): """Hessian of logistic loss""" n = len(y) z = X @ w p = 1 / (1 + np.exp(-y * z)) D = np.diag(p * (1 - p)) return (X.T @ D @ X) / n # Generate sample datanp.random.seed(42)n_samples, n_features = 100, 5X = np.random.randn(n_samples, n_features)y = np.sign(np.random.randn(n_samples)) # Test at random pointstest_points = [np.random.randn(n_features) for _ in range(10)]hessian_fn = lambda w: logistic_loss_hessian(w, X, y) result = verify_convexity_via_hessian(hessian_fn, test_points)print("Logistic loss convexity verification:")print(f" Is PSD everywhere: {result['is_psd_everywhere']}")print(f" Min eigenvalues: {[f'{e:.6f}' for e in result['min_eigenvalues'][:5]]}")We've established the foundational concepts of convex analysis that underpin all tractable optimization in machine learning. Let's consolidate the key insights:
What's Next:
With the definitions of convex sets and functions established, we'll explore why convexity matters for optimization. The next page examines local vs. global optima — showing that for convex problems, we never need to worry about getting stuck in suboptimal solutions.
You now understand convex sets, convex functions, and the key operations that preserve convexity. This vocabulary and these tools will be essential as we explore optimality conditions and their applications to machine learning algorithms.