Convex Optimization - Learning Module

Loading content...

0/245

Convex Sets and Functions

The Geometry of Tractable Optimization

Imagine you're lost in a foggy mountain landscape, trying to find the lowest valley. If the terrain is arbitrary—with cliffs, plateaus, and hidden ravines—your task is nearly impossible. You might descend into a local depression, only to realize you're still far above the true valley floor hidden somewhere beyond.

But now imagine the terrain forms a single, smooth bowl. No matter where you start, if you simply walk downhill, you're guaranteed to reach the absolute lowest point. There are no false bottoms, no deceptive dips—just one inevitable destination.

This is the essence of convexity—the mathematical property that transforms intractable optimization into a guaranteed journey to the global optimum. In machine learning, convexity is not merely a convenience; it's the fundamental distinction between algorithms that provably work and algorithms that hope for the best.

What You Will Learn

By the end of this page, you will understand: (1) the precise mathematical definition of convex sets, (2) how to verify set convexity through intersection and projection operations, (3) the definition and characterization of convex functions, (4) Jensen's inequality and its far-reaching implications, and (5) why convexity guarantees that local optima are global optima—the holy grail of optimization.

Convex Sets: The Fundamental Definition

A convex set captures a simple but powerful geometric idea: if you can see from any point to any other point without leaving the set, the set is convex.

Formal Definition:

A set $C \subseteq \mathbb{R}^n$ is convex if for any two points $x, y \in C$ and any scalar $\theta \in [0, 1]$, the point

$$\theta x + (1 - \theta) y \in C$$

In other words, the entire line segment connecting $x$ and $y$ lies within $C$. This seemingly simple property has profound implications for optimization.

Geometric Intuition

Think of a convex set as a set with no "dents" or "holes." If you stretch a rubber band between any two points inside the set, the entire rubber band stays inside. A circle is convex. A star is not—lines between some points would exit the star's boundary.

Understanding the Convex Combination:

The expression $\theta x + (1 - \theta) y$ with $\theta \in [0, 1]$ is called a convex combination of $x$ and $y$. Let's understand this geometrically:

When $\theta = 0$: the point is $y$
When $\theta = 1$: the point is $x$
When $\theta = 0.5$: the point is the midpoint $\frac{x + y}{2}$
For any $\theta \in (0, 1)$: the point lies on the line segment between $x$ and $y$

The convex combination parameters always sum to 1: $\theta + (1 - \theta) = 1$. This constraint ensures we're talking about points on the line segment, not the extended line.

Examples of Convex and Non-Convex Sets
Set	Convex?	Reasoning
Line segment	✓ Yes	By definition—it's a convex combination of its endpoints
Ball/Sphere interior	✓ Yes	Any line segment between interior points stays inside
Hyperplane ${x : a^T x = b}$	✓ Yes	If $a^T x = b$ and $a^T y = b$, then $a^T(\theta x + (1-\theta)y) = b$
Halfspace ${x : a^T x \leq b}$	✓ Yes	Linear inequalities preserve convex combinations
Polyhedron (intersection of halfspaces)	✓ Yes	Intersection of convex sets is convex
Two disjoint balls	✗ No	Line segment between points in different balls exits the set
Star-shaped region	✗ No	Line segments between some points exit through the indentations
Donut (annulus)	✗ No	Line segments can pass through the hole

Operations That Preserve Set Convexity

Understanding which operations preserve convexity allows us to build complex convex sets from simpler ones—a crucial skill for formulating optimization problems in machine learning.

Fundamental Convexity-Preserving Operations:

Operations Preserving Convexity

•Intersection — If $C_1, C_2, \ldots, C_k$ are convex sets, then $\bigcap_{i=1}^{k} C_i$ is convex. This extends to infinite intersections: $\bigcap_{\alpha \in A} C_\alpha$ is convex if each $C_\alpha$ is convex.
•Affine transformations — If $C$ is convex and $f(x) = Ax + b$ is an affine map, then $f(C) = {Ax + b : x \in C}$ is convex. This includes scaling, rotation, translation, and projection.
•Affine preimage — If $C$ is convex and $f(x) = Ax + b$, then $f^{-1}(C) = {x : Ax + b \in C}$ is convex.
•Minkowski sum — For convex sets $C_1$ and $C_2$, the sum $C_1 + C_2 = {x + y : x \in C_1, y \in C_2}$ is convex.
•Cartesian product — If $C_1 \subseteq \mathbb{R}^{n_1}$ and $C_2 \subseteq \mathbb{R}^{n_2}$ are convex, then $C_1 \times C_2 \subseteq \mathbb{R}^{n_1 + n_2}$ is convex.

Union Does NOT Preserve Convexity

The union of convex sets is generally NOT convex. Consider two disjoint balls: each is convex, but their union fails the line segment test. This is why optimization over "either A or B" constraints is fundamentally harder than optimization over "A and B" constraints.

Why Intersection is Central:

The fact that intersection preserves convexity is enormously powerful. Many important convex sets in ML can be expressed as intersections of simpler convex sets:

Polyhedra = intersection of halfspaces
Polytopes = bounded polyhedra (intersection of halfspaces forming a bounded region)
Norm balls = intersection of infinitely many halfspaces (for polyhedral norms like $L_1$, $L_\infty$)
Positive semidefinite cone = intersection of halfspaces defined by all possible quadratic forms

This compositional view allows us to verify convexity of complex sets by decomposing them into simpler convex building blocks.

convexity_verification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
 
def is_in_convex_hull(point, vertices):
    """
    Check if a point lies in the convex hull of given vertices.
    Uses linear programming: point is in convex hull iff we can
    express it as a convex combination of vertices.
    
    Args:
        point: numpy array of shape (n,)
        vertices: numpy array of shape (k, n) where k = number of vertices
    
    Returns:
        bool: True if point is in convex hull
    """
    from scipy.optimize import linprog
    
    n_vertices = vertices.shape[0]
    n_dims = vertices.shape[1]
    
    # We want to find lambdas such that:
    # sum(lambda_i * vertex_i) = point
    # sum(lambda_i) = 1
    # lambda_i >= 0
    
    # This is a feasibility LP. We minimize a constant (0).
    c = np.zeros(n_vertices)  # Dummy objective
    
    # Equality constraints: V^T @ lambda = point, sum(lambda) = 1
    A_eq = np.vstack([vertices.T, np.ones(n_vertices)])
    b_eq = np.append(point, 1.0)
    
    # lambda_i >= 0 is handled by bounds
    bounds = [(0, None) for _ in range(n_vertices)]
    
    result = linprog(c, A_eq=A_eq, b_eq=b_eq, bounds=bounds, method='highs')
    return result.success
 
def verify_convexity_by_sampling(set_membership_fn, dim, n_pairs=1000, n_interpolations=10):
    """
    Empirically verify convexity of a set by random sampling.
    
    This is NOT a proof but a useful diagnostic for complex sets
    defined procedurally.
    
    Args:
        set_membership_fn: Function that returns True if point is in set
        dim: Dimension of the space
        n_pairs: Number of random point pairs to test
        n_interpolations: Points to check along each line segment
    
    Returns:
        dict: Results including any counterexamples found
    """
    counterexamples = []
    
    for _ in range(n_pairs):
        # Generate random points and check if they're in the set
        x = np.random.randn(dim)
        y = np.random.randn(dim)
        
        if not (set_membership_fn(x) and set_membership_fn(y)):
            continue  # Both points must be in the set
        
        # Check points along the line segment
        for theta in np.linspace(0, 1, n_interpolations):
            z = theta * x + (1 - theta) * y
            if not set_membership_fn(z):
                counterexamples.append({
                    'x': x.copy(),
                    'y': y.copy(),
                    'theta': theta,
                    'z': z.copy()
                })
                break
    
    return {
        'is_likely_convex': len(counterexamples) == 0,
        'counterexamples': counterexamples,
        'pairs_tested': n_pairs
    }
 
# Example: Verify that L2 ball is convex
def l2_ball_membership(x, radius=1.0):
    return np.linalg.norm(x) <= radius
 
result = verify_convexity_by_sampling(l2_ball_membership, dim=5, n_pairs=2000)
print(f"L2 ball is likely convex: {result['is_likely_convex']}")
print(f"Counterexamples found: {len(result['counterexamples'])}")

Convex Functions: The Core Definition

While convex sets describe where we can search, convex functions describe what we're optimizing. A convex function is one where the graph lies below any line segment connecting two points on the graph—it "curves upward."

Formal Definition:

A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if its domain $\text{dom}(f)$ is a convex set and for all $x, y \in \text{dom}(f)$ and $\theta \in [0, 1]$:

$$f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y)$$

The left side is the function value at the interpolated point; the right side is the linear interpolation of function values. For convex functions, the actual value is always at or below the interpolation.

The "Chord Above" Interpretation

If you draw a chord (straight line) between any two points on the graph of a convex function, the entire chord lies on or above the graph. The function "bows down" relative to its chords. For a strictly convex function, the chord is strictly above except at endpoints.

Strict Convexity:

A function is strictly convex if the inequality is strict for $\theta \in (0, 1)$ and $x \neq y$:

$$f(\theta x + (1 - \theta) y) < \theta f(x) + (1 - \theta) f(y)$$

Strict convexity guarantees a unique global minimum (if one exists), not just that local minima are global.

Strong Convexity:

A function $f$ is $m$-strongly convex (with $m > 0$) if:

$$f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y) - \frac{m}{2} \theta (1 - \theta) |x - y|^2$$

Strong convexity implies strict convexity but adds a quantitative "curvature" bound. This has crucial implications for optimization convergence rates.

Hierarchy of Convexity Conditions
Condition	Definition	Key Implication
Convex	$f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y)$	Local minima are global minima
Strictly Convex	Strict inequality for $\theta \in (0,1)$, $x \neq y$	At most one global minimum
$m$-Strongly Convex	Convexity + curvature $\geq m > 0$	Exactly one minimum + fast convergence

Concave Functions:

A function $f$ is concave if $-f$ is convex. Equivalently:

$$f(\theta x + (1 - \theta) y) \geq \theta f(x) + (1 - \theta) f(y)$$

Since minimizing a convex function is equivalent to maximizing a concave function, all our convexity theory applies to maximization problems with concave objectives.

Canonical Convex Functions

Building intuition requires familiarity with common convex functions. These form the building blocks for constructing convex objectives in machine learning.

Common Convex Functions and Their Properties
Function	Domain	Convexity Type	ML Application
$f(x) = x^2$	$\mathbb{R}$	Strongly convex	Squared loss, L2 regularization
$f(x) = \|x\|$	$\mathbb{R}$	Convex (not strict)	L1 regularization, robust loss
$f(x) = e^x$	$\mathbb{R}$	Strictly convex	Exponential loss, softmax components
$f(x) = -\log(x)$	$x > 0$	Strictly convex	Log loss, barrier methods
$f(x) = x \log x$	$x > 0$	Strictly convex	Entropy, KL divergence
$f(x) = \max_i x_i$	$\mathbb{R}^n$	Convex (not strict)	Max-margin, hinge loss
$f(x) = \|x\|_p$ for $p \geq 1$	$\mathbb{R}^n$	Convex	Norm regularization
$f(X) = -\log \det(X)$	$X \succ 0$	Strictly convex	Covariance estimation

Understanding Through Key Examples:

1. Quadratic Functions:

The function $f(x) = \frac{1}{2} x^T Q x + b^T x + c$ where $Q$ is a symmetric matrix is:

Convex if and only if $Q \succeq 0$ (positive semidefinite)
Strictly convex if and only if $Q \succ 0$ (positive definite)
$\lambda_{\min}(Q)$-strongly convex if $Q \succ 0$

This is why the Hessian (second derivative matrix) plays such a central role in convexity verification.

2. Norms:

Every norm on $\mathbb{R}^n$ is convex. This follows from the triangle inequality:

$$|\theta x + (1 - \theta) y| \leq |\theta x| + |(1 - \theta) y| = \theta |x| + (1 - \theta) |y|$$

This is why L1 and L2 regularization terms are convex, making regularized optimization tractable.

3. Log-Sum-Exp (Softmax):

$$f(x) = \log\left(\sum_{i=1}^{n} e^{x_i}\right)$$

This is convex and serves as a smooth approximation to $\max_i x_i$. It's fundamental to softmax classification and attention mechanisms.

Convexity of Compositions

If $g$ is convex and non-decreasing, and $f$ is convex, then $g(f(x))$ is convex. If $g$ is convex and non-increasing, and $f$ is concave, then $g(f(x))$ is convex. These composition rules let us build complex convex functions from simpler ones.

Operations Preserving Function Convexity

Just as certain operations preserve set convexity, analogous operations preserve function convexity. These rules are essential for constructing and verifying convex objectives.

Convexity-Preserving Operations on Functions

•Non-negative scaling — If $f$ is convex and $\alpha \geq 0$, then $\alpha f$ is convex.
•Addition — If $f_1, f_2, \ldots, f_k$ are convex, then $\sum_{i=1}^{k} f_i$ is convex.
•Pointwise maximum — If $f_1, \ldots, f_k$ are convex, then $f(x) = \max_i f_i(x)$ is convex. This extends to $\sup$ over infinite families.
•Composition with affine — If $f$ is convex and $g(x) = Ax + b$, then $f(g(x)) = f(Ax + b)$ is convex.
•Partial minimization — If $f(x, y)$ is convex and $C$ is convex, then $g(x) = \inf_{y \in C} f(x, y)$ is convex (if the infimum is achieved).
•Perspective — If $f$ is convex, then the perspective $g(x, t) = t \cdot f(x/t)$ for $t > 0$ is convex.

Non-Convex Operations

Products of convex functions are NOT generally convex. For example, $f(x) = x$ and $g(x) = x$ are both convex on $\mathbb{R}$, but $f(x) \cdot g(x) = x^2$ is convex only on $\mathbb{R}^+$ or $\mathbb{R}^-$ when considered as a product, not globally. Also, $\min_i f_i(x)$ is NOT convex even when each $f_i$ is convex.

The Power of Pointwise Maximum:

The fact that $\max$ preserves convexity is surprisingly powerful and underlies many ML loss functions:

Hinge Loss: $\ell(y, \hat{y}) = \max(0, 1 - y \cdot \hat{y})$

This is the maximum of two affine functions (0 and $1 - y \cdot \hat{y}$), hence convex.

ReLU Activation: $\text{ReLU}(x) = \max(0, x)$

While individual ReLUs are convex, compositions of linear layers and ReLUs create non-convex networks—an important distinction.

Piecewise Linear Convex Functions:

Any piecewise linear convex function can be written as a maximum of affine functions: $$f(x) = \max_{i=1,\ldots,k} (a_i^T x + b_i)$$

This representation is central to problems like linear programming robustness and the design of activation functions.

convex_operations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
 
# Demonstrating convexity-preserving operations
 
def squared_norm(x):
    """L2 squared - strongly convex"""
    return np.sum(x ** 2)
 
def l1_norm(x):
    """L1 norm - convex but not strictly convex"""
    return np.sum(np.abs(x))
 
def composed_quadratic(x, A, b):
    """f(Ax + b) where f is squared norm - still convex"""
    return squared_norm(A @ x + b)
 
def sum_of_convex(x, alpha=1.0, beta=0.1):
    """alpha * ||x||^2 + beta * ||x||_1 - convex (elastic net)"""
    return alpha * squared_norm(x) + beta * l1_norm(x)
 
def pointwise_max_example(x, affine_funcs):
    """
    max_i (a_i^T x + b_i) - convex as max of convex
    
    Args:
        x: input vector
        affine_funcs: list of (a_i, b_i) tuples
    """
    values = [a @ x + b for a, b in affine_funcs]
    return max(values)
 
def hinge_loss(y_true, y_pred):
    """
    Hinge loss: max(0, 1 - y * y_hat)
    Convex as max of two affine functions in y_pred
    """
    return np.maximum(0, 1 - y_true * y_pred)
 
def log_sum_exp(x):
    """
    log(sum(exp(x_i))) - convex approximation to max
    Numerically stable implementation
    """
    max_x = np.max(x)
    return max_x + np.log(np.sum(np.exp(x - max_x)))
 
# Verify convexity empirically
def verify_function_convexity(f, x, y, n_points=100):
    """
    Verify f(theta*x + (1-theta)*y) <= theta*f(x) + (1-theta)*f(y)
    for many values of theta in [0, 1]
    """
    f_x, f_y = f(x), f(y)
    thetas = np.linspace(0, 1, n_points)
    
    violations = []
    for theta in thetas:
        z = theta * x + (1 - theta) * y
        f_z = f(z)
        interpolation = theta * f_x + (1 - theta) * f_y
        
        if f_z > interpolation + 1e-10:  # Small tolerance
            violations.append({
                'theta': theta,
                'f(z)': f_z,
                'interpolation': interpolation,
                'gap': f_z - interpolation
            })
    
    return {
        'is_convex': len(violations) == 0,
        'violations': violations
    }
 
# Example verification
np.random.seed(42)
x = np.random.randn(5)
y = np.random.randn(5)
 
print("Verifying convexity of L2 squared:")
print(verify_function_convexity(squared_norm, x, y))
 
print("\nVerifying convexity of log-sum-exp:")
print(verify_function_convexity(log_sum_exp, x, y))

Jensen's Inequality: The Master Theorem

Jensen's Inequality is arguably the single most important result in convex analysis. It generalizes the definition of convexity from two points to arbitrary weighted combinations.

Theorem (Jensen's Inequality):

Let $f$ be a convex function, let $x_1, \ldots, x_k$ be points in the domain of $f$, and let $\lambda_1, \ldots, \lambda_k$ be non-negative weights summing to 1. Then:

$$f\left(\sum_{i=1}^{k} \lambda_i x_i\right) \leq \sum_{i=1}^{k} \lambda_i f(x_i)$$

In words: the function value at the weighted average is at most the weighted average of function values.

Probabilistic Form

If $X$ is a random variable and $f$ is convex, then $f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$. This is the form most commonly used in machine learning, especially in variational inference and information theory.

Proof Intuition:

Jensen's inequality follows from repeated application of the basic convexity definition. For two points, it's exactly the definition. For three points, we first combine two of them:

$$f(\lambda_1 x_1 + \lambda_2 x_2 + \lambda_3 x_3) = f\left((\lambda_1 + \lambda_2) \cdot \frac{\lambda_1 x_1 + \lambda_2 x_2}{\lambda_1 + \lambda_2} + \lambda_3 x_3\right)$$

Then apply convexity twice, using the fact that $\frac{\lambda_1 x_1 + \lambda_2 x_2}{\lambda_1 + \lambda_2}$ is itself a convex combination.

Applications in Machine Learning:

Jensen's inequality appears throughout ML:

Jensen's Inequality in ML

•Evidence Lower Bound (ELBO) — In variational inference, $\log p(x) \geq \mathbb{E}_{q(z)}[\log p(x, z)] + H[q(z)]$. The gap arises from Jensen's applied to the concave log function.
•KL Divergence Non-negativity — $D_{KL}(p | q) \geq 0$ follows from applying Jensen's to $-\log$, which is convex.
•Log-likelihood bounds — Many generative model objectives use Jensen's to create tractable lower bounds.
•Information-theoretic inequalities — Results like mutual information bounds, entropy inequalities, and data processing inequalities rely on Jensen's.
•Expectation Maximization — The E-step creates a lower bound on the log-likelihood using Jensen's inequality.

Example: Proving KL Divergence is Non-negative

The Kullback-Leibler divergence is:

$$D_{KL}(p | q) = \mathbb{E}_p\left[\log \frac{p(x)}{q(x)}\right] = -\mathbb{E}_p\left[\log \frac{q(x)}{p(x)}\right]$$

Since $-\log$ is convex and $\frac{q(x)}{p(x)}$ is a ratio of densities with $\mathbb{E}_p\left[\frac{q(x)}{p(x)}\right] = \int p(x) \cdot \frac{q(x)}{p(x)} dx = \int q(x) dx = 1$:

$$D_{KL}(p | q) = -\mathbb{E}_p\left[\log \frac{q(x)}{p(x)}\right] \geq -\log \mathbb{E}_p\left[\frac{q(x)}{p(x)}\right] = -\log(1) = 0$$

Jensen's gives us this fundamental non-negativity of KL divergence.

Epigraphs and Sublevel Sets

The connection between convex functions and convex sets runs deeper than analogy—there's a precise equivalence through the concept of epigraphs.

Definition: Epigraph

The epigraph of a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is the set of points "on or above" the graph:

$$\text{epi}(f) = {(x, t) \in \mathbb{R}^{n+1} : x \in \text{dom}(f), t \geq f(x)}$$

Fundamental Theorem:

A function $f$ is convex if and only if its epigraph $\text{epi}(f)$ is a convex set.

This equivalence is powerful: it means we can translate any statement about convex functions into a statement about convex sets, and vice versa.

Visualizing Epigraphs

For a 1D function, the epigraph is the region above the curve extending to infinity. For $f(x) = x^2$, the epigraph is the region above the parabola—a convex set. For $f(x) = -x^2$, the epigraph is everything above an upside-down parabola—not convex.

Sublevel Sets:

The $\alpha$-sublevel set of $f$ is:

$$S_\alpha = {x \in \text{dom}(f) : f(x) \leq \alpha}$$

Key Property: If $f$ is convex, all its sublevel sets are convex.

Proof: Let $x, y \in S_\alpha$, so $f(x) \leq \alpha$ and $f(y) \leq \alpha$. For any $\theta \in [0, 1]$:

$$f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y) \leq \theta \alpha + (1 - \theta) \alpha = \alpha$$

So $\theta x + (1 - \theta) y \in S_\alpha$, proving $S_\alpha$ is convex.

Caution: The converse is FALSE—having convex sublevel sets doesn't imply the function is convex. The function $f(x) = \sqrt{|x|}$ has convex sublevel sets (intervals around 0) but is not convex (it's concave for $|x| > 0$).

Quasiconvexity:

A function with convex sublevel sets is called quasiconvex. This is a strictly weaker condition than convexity:

$$f \text{ convex} \Rightarrow f \text{ quasiconvex}$$

but the converse doesn't hold. Quasiconvex functions still have nice optimization properties (local minima are still global for strictly quasiconvex functions), but the analysis is more delicate.

Practical Use of Sublevel Sets:

In constrained optimization, we often restrict our search to a sublevel set:

$$\min_x , g(x) \quad \text{subject to} \quad f(x) \leq \alpha$$

If $f$ is convex, the feasible region $S_\alpha$ is convex, preserving the tractability of the optimization.

Characterizations via Calculus

When functions are differentiable, we gain powerful calculus-based characterizations of convexity that are often easier to verify than the basic definition.

First-Order Characterization:

If $f$ is differentiable, then $f$ is convex if and only if for all $x, y$ in the domain:

$$f(y) \geq f(x) + \nabla f(x)^T (y - x)$$

This says the function lies above its tangent hyperplane at every point. The tangent is a global underestimator.

Geometric Intuition

For a convex function, if you stand at any point and draw the tangent line (or hyperplane), the entire function graph lies on or above that tangent. This is why gradient descent works: moving in the negative gradient direction is guaranteed to reduce the function (to first order).

Second-Order Characterization:

If $f$ is twice differentiable, then $f$ is convex if and only if the Hessian is positive semidefinite everywhere:

$$\nabla^2 f(x) \succeq 0 \quad \forall x \in \text{dom}(f)$$

For strict convexity, we need $\nabla^2 f(x) \succ 0$ (positive definite).

For $m$-strong convexity, we need $\nabla^2 f(x) \succeq m I$ (eigenvalues at least $m$).

Practical Hessian Analysis:

For functions of the form $f(x) = g(Ax)$ where $g$ is convex:

$$\nabla^2 f(x) = A^T \nabla^2 g(Ax) A$$

This is PSD whenever $\nabla^2 g \succeq 0$, confirming that composition with affine transformations preserves convexity.

convexity_verification_calculus.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy.linalg import eigvalsh
 
def verify_convexity_via_hessian(hessian_fn, test_points, tolerance=1e-10):
    """
    Verify convexity by checking Hessian positive semidefiniteness.
    
    Args:
        hessian_fn: Function that returns Hessian matrix at a point
        test_points: List of points to check
        tolerance: Numerical tolerance for eigenvalue check
    
    Returns:
        dict: Verification results
    """
    results = {
        'is_psd_everywhere': True,
        'min_eigenvalues': [],
        'problem_points': []
    }
    
    for x in test_points:
        H = hessian_fn(x)
        eigenvalues = eigvalsh(H)  # Real eigenvalues for symmetric matrix
        min_eig = np.min(eigenvalues)
        results['min_eigenvalues'].append(min_eig)
        
        if min_eig < -tolerance:
            results['is_psd_everywhere'] = False
            results['problem_points'].append({
                'point': x,
                'min_eigenvalue': min_eig
            })
    
    return results
 
def verify_first_order_condition(f, grad_f, x, y, tolerance=1e-8):
    """
    Verify f(y) >= f(x) + grad_f(x)^T (y - x)
    
    For convex functions, this should hold for all x, y.
    """
    f_y = f(y)
    f_x = f(x)
    grad_x = grad_f(x)
    
    lower_bound = f_x + np.dot(grad_x, y - x)
    gap = f_y - lower_bound
    
    return {
        'f(y)': f_y,
        'lower_bound': lower_bound,
        'gap': gap,
        'is_satisfied': gap >= -tolerance
    }
 
# Example: Verify logistic loss convexity
def logistic_loss(w, X, y):
    """Logistic regression loss (without regularization)"""
    z = y * (X @ w)
    return np.mean(np.log(1 + np.exp(-z)))
 
def logistic_loss_hessian(w, X, y):
    """Hessian of logistic loss"""
    n = len(y)
    z = X @ w
    p = 1 / (1 + np.exp(-y * z))
    D = np.diag(p * (1 - p))
    return (X.T @ D @ X) / n
 
# Generate sample data
np.random.seed(42)
n_samples, n_features = 100, 5
X = np.random.randn(n_samples, n_features)
y = np.sign(np.random.randn(n_samples))
 
# Test at random points
test_points = [np.random.randn(n_features) for _ in range(10)]
hessian_fn = lambda w: logistic_loss_hessian(w, X, y)
 
result = verify_convexity_via_hessian(hessian_fn, test_points)
print("Logistic loss convexity verification:")
print(f"  Is PSD everywhere: {result['is_psd_everywhere']}")
print(f"  Min eigenvalues: {[f'{e:.6f}' for e in result['min_eigenvalues'][:5]]}")

Summary: Convex Sets and Functions

We've established the foundational concepts of convex analysis that underpin all tractable optimization in machine learning. Let's consolidate the key insights:

Key Takeaways

•Convex sets are defined by the line segment property: every point between two points in the set is also in the set.
•Convex functions satisfy the "chord above" property: the function lies below any chord connecting two points on its graph.
•Jensen's inequality generalizes convexity to arbitrary convex combinations and has sweeping applications in ML.
•Operations preserving convexity — intersection for sets; addition, non-negative scaling, and pointwise maximum for functions — let us build complex convex objects from simple ones.
•Epigraphs connect functions to sets: a function is convex iff its epigraph is a convex set.
•Differentiable characterizations — first-order (gradient underestimator) and second-order (Hessian PSD) — provide practical verification methods.

What's Next:

With the definitions of convex sets and functions established, we'll explore why convexity matters for optimization. The next page examines local vs. global optima — showing that for convex problems, we never need to worry about getting stuck in suboptimal solutions.

Page Complete

You now understand convex sets, convex functions, and the key operations that preserve convexity. This vocabulary and these tools will be essential as we explore optimality conditions and their applications to machine learning algorithms.

Convex Sets and Functions

The Geometry of Tractable Optimization

What You Will Learn

Convex Sets: The Fundamental Definition

A convex set captures a simple but powerful geometric idea: if you can see from any point to any other point without leaving the set, the set is convex.

Formal Definition:

A set $C \subseteq \mathbb{R}^n$ is convex if for any two points $x, y \in C$ and any scalar $\theta \in [0, 1]$, the point

$$\theta x + (1 - \theta) y \in C$$

In other words, the entire line segment connecting $x$ and $y$ lies within $C$. This seemingly simple property has profound implications for optimization.

Geometric Intuition

Understanding the Convex Combination:

The expression $\theta x + (1 - \theta) y$ with $\theta \in [0, 1]$ is called a convex combination of $x$ and $y$. Let's understand this geometrically:

When $\theta = 0$: the point is $y$
When $\theta = 1$: the point is $x$
When $\theta = 0.5$: the point is the midpoint $\frac{x + y}{2}$
For any $\theta \in (0, 1)$: the point lies on the line segment between $x$ and $y$

The convex combination parameters always sum to 1: $\theta + (1 - \theta) = 1$. This constraint ensures we're talking about points on the line segment, not the extended line.

Examples of Convex and Non-Convex Sets
Set	Convex?	Reasoning
Line segment	✓ Yes	By definition—it's a convex combination of its endpoints
Ball/Sphere interior	✓ Yes	Any line segment between interior points stays inside
Hyperplane ${x : a^T x = b}$	✓ Yes	If $a^T x = b$ and $a^T y = b$, then $a^T(\theta x + (1-\theta)y) = b$
Halfspace ${x : a^T x \leq b}$	✓ Yes	Linear inequalities preserve convex combinations
Polyhedron (intersection of halfspaces)	✓ Yes	Intersection of convex sets is convex
Two disjoint balls	✗ No	Line segment between points in different balls exits the set
Star-shaped region	✗ No	Line segments between some points exit through the indentations
Donut (annulus)	✗ No	Line segments can pass through the hole

Operations That Preserve Set Convexity

Understanding which operations preserve convexity allows us to build complex convex sets from simpler ones—a crucial skill for formulating optimization problems in machine learning.

Fundamental Convexity-Preserving Operations:

Operations Preserving Convexity

•Intersection — If $C_1, C_2, \ldots, C_k$ are convex sets, then $\bigcap_{i=1}^{k} C_i$ is convex. This extends to infinite intersections: $\bigcap_{\alpha \in A} C_\alpha$ is convex if each $C_\alpha$ is convex.
•Affine transformations — If $C$ is convex and $f(x) = Ax + b$ is an affine map, then $f(C) = {Ax + b : x \in C}$ is convex. This includes scaling, rotation, translation, and projection.
•Affine preimage — If $C$ is convex and $f(x) = Ax + b$, then $f^{-1}(C) = {x : Ax + b \in C}$ is convex.
•Minkowski sum — For convex sets $C_1$ and $C_2$, the sum $C_1 + C_2 = {x + y : x \in C_1, y \in C_2}$ is convex.
•Cartesian product — If $C_1 \subseteq \mathbb{R}^{n_1}$ and $C_2 \subseteq \mathbb{R}^{n_2}$ are convex, then $C_1 \times C_2 \subseteq \mathbb{R}^{n_1 + n_2}$ is convex.

Union Does NOT Preserve Convexity

Why Intersection is Central:

The fact that intersection preserves convexity is enormously powerful. Many important convex sets in ML can be expressed as intersections of simpler convex sets:

Polyhedra = intersection of halfspaces
Polytopes = bounded polyhedra (intersection of halfspaces forming a bounded region)
Norm balls = intersection of infinitely many halfspaces (for polyhedral norms like $L_1$, $L_\infty$)
Positive semidefinite cone = intersection of halfspaces defined by all possible quadratic forms

This compositional view allows us to verify convexity of complex sets by decomposing them into simpler convex building blocks.

convexity_verification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
 
def is_in_convex_hull(point, vertices):
    """
    Check if a point lies in the convex hull of given vertices.
    Uses linear programming: point is in convex hull iff we can
    express it as a convex combination of vertices.
    
    Args:
        point: numpy array of shape (n,)
        vertices: numpy array of shape (k, n) where k = number of vertices
    
    Returns:
        bool: True if point is in convex hull
    """
    from scipy.optimize import linprog
    
    n_vertices = vertices.shape[0]
    n_dims = vertices.shape[1]
    
    # We want to find lambdas such that:
    # sum(lambda_i * vertex_i) = point
    # sum(lambda_i) = 1
    # lambda_i >= 0
    
    # This is a feasibility LP. We minimize a constant (0).
    c = np.zeros(n_vertices)  # Dummy objective
    
    # Equality constraints: V^T @ lambda = point, sum(lambda) = 1
    A_eq = np.vstack([vertices.T, np.ones(n_vertices)])
    b_eq = np.append(point, 1.0)
    
    # lambda_i >= 0 is handled by bounds
    bounds = [(0, None) for _ in range(n_vertices)]
    
    result = linprog(c, A_eq=A_eq, b_eq=b_eq, bounds=bounds, method='highs')
    return result.success
 
def verify_convexity_by_sampling(set_membership_fn, dim, n_pairs=1000, n_interpolations=10):
    """
    Empirically verify convexity of a set by random sampling.
    
    This is NOT a proof but a useful diagnostic for complex sets
    defined procedurally.
    
    Args:
        set_membership_fn: Function that returns True if point is in set
        dim: Dimension of the space
        n_pairs: Number of random point pairs to test
        n_interpolations: Points to check along each line segment
    
    Returns:
        dict: Results including any counterexamples found
    """
    counterexamples = []
    
    for _ in range(n_pairs):
        # Generate random points and check if they're in the set
        x = np.random.randn(dim)
        y = np.random.randn(dim)
        
        if not (set_membership_fn(x) and set_membership_fn(y)):
            continue  # Both points must be in the set
        
        # Check points along the line segment
        for theta in np.linspace(0, 1, n_interpolations):
            z = theta * x + (1 - theta) * y
            if not set_membership_fn(z):
                counterexamples.append({
                    'x': x.copy(),
                    'y': y.copy(),
                    'theta': theta,
                    'z': z.copy()
                })
                break
    
    return {
        'is_likely_convex': len(counterexamples) == 0,
        'counterexamples': counterexamples,
        'pairs_tested': n_pairs
    }
 
# Example: Verify that L2 ball is convex
def l2_ball_membership(x, radius=1.0):
    return np.linalg.norm(x) <= radius
 
result = verify_convexity_by_sampling(l2_ball_membership, dim=5, n_pairs=2000)
print(f"L2 ball is likely convex: {result['is_likely_convex']}")
print(f"Counterexamples found: {len(result['counterexamples'])}")

Convex Functions: The Core Definition

Formal Definition:

A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if its domain $\text{dom}(f)$ is a convex set and for all $x, y \in \text{dom}(f)$ and $\theta \in [0, 1]$:

$$f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y)$$

The "Chord Above" Interpretation

Strict Convexity:

A function is strictly convex if the inequality is strict for $\theta \in (0, 1)$ and $x \neq y$:

$$f(\theta x + (1 - \theta) y) < \theta f(x) + (1 - \theta) f(y)$$

Strict convexity guarantees a unique global minimum (if one exists), not just that local minima are global.

Strong Convexity:

A function $f$ is $m$-strongly convex (with $m > 0$) if:

$$f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y) - \frac{m}{2} \theta (1 - \theta) |x - y|^2$$

Strong convexity implies strict convexity but adds a quantitative "curvature" bound. This has crucial implications for optimization convergence rates.

Hierarchy of Convexity Conditions
Condition	Definition	Key Implication
Convex	$f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y)$	Local minima are global minima
Strictly Convex	Strict inequality for $\theta \in (0,1)$, $x \neq y$	At most one global minimum
$m$-Strongly Convex	Convexity + curvature $\geq m > 0$	Exactly one minimum + fast convergence

Concave Functions:

A function $f$ is concave if $-f$ is convex. Equivalently:

$$f(\theta x + (1 - \theta) y) \geq \theta f(x) + (1 - \theta) f(y)$$

Since minimizing a convex function is equivalent to maximizing a concave function, all our convexity theory applies to maximization problems with concave objectives.

Canonical Convex Functions

Building intuition requires familiarity with common convex functions. These form the building blocks for constructing convex objectives in machine learning.

Common Convex Functions and Their Properties
Function	Domain	Convexity Type	ML Application
$f(x) = x^2$	$\mathbb{R}$	Strongly convex	Squared loss, L2 regularization
$f(x) = \|x\|$	$\mathbb{R}$	Convex (not strict)	L1 regularization, robust loss
$f(x) = e^x$	$\mathbb{R}$	Strictly convex	Exponential loss, softmax components
$f(x) = -\log(x)$	$x > 0$	Strictly convex	Log loss, barrier methods
$f(x) = x \log x$	$x > 0$	Strictly convex	Entropy, KL divergence
$f(x) = \max_i x_i$	$\mathbb{R}^n$	Convex (not strict)	Max-margin, hinge loss
$f(x) = \|x\|_p$ for $p \geq 1$	$\mathbb{R}^n$	Convex	Norm regularization
$f(X) = -\log \det(X)$	$X \succ 0$	Strictly convex	Covariance estimation

Understanding Through Key Examples:

1. Quadratic Functions:

The function $f(x) = \frac{1}{2} x^T Q x + b^T x + c$ where $Q$ is a symmetric matrix is:

Convex if and only if $Q \succeq 0$ (positive semidefinite)
Strictly convex if and only if $Q \succ 0$ (positive definite)
$\lambda_{\min}(Q)$-strongly convex if $Q \succ 0$

This is why the Hessian (second derivative matrix) plays such a central role in convexity verification.

2. Norms:

Every norm on $\mathbb{R}^n$ is convex. This follows from the triangle inequality:

$$|\theta x + (1 - \theta) y| \leq |\theta x| + |(1 - \theta) y| = \theta |x| + (1 - \theta) |y|$$

This is why L1 and L2 regularization terms are convex, making regularized optimization tractable.

3. Log-Sum-Exp (Softmax):

$$f(x) = \log\left(\sum_{i=1}^{n} e^{x_i}\right)$$

This is convex and serves as a smooth approximation to $\max_i x_i$. It's fundamental to softmax classification and attention mechanisms.

Convexity of Compositions

Operations Preserving Function Convexity

Just as certain operations preserve set convexity, analogous operations preserve function convexity. These rules are essential for constructing and verifying convex objectives.

Convexity-Preserving Operations on Functions

•Non-negative scaling — If $f$ is convex and $\alpha \geq 0$, then $\alpha f$ is convex.
•Addition — If $f_1, f_2, \ldots, f_k$ are convex, then $\sum_{i=1}^{k} f_i$ is convex.
•Pointwise maximum — If $f_1, \ldots, f_k$ are convex, then $f(x) = \max_i f_i(x)$ is convex. This extends to $\sup$ over infinite families.
•Composition with affine — If $f$ is convex and $g(x) = Ax + b$, then $f(g(x)) = f(Ax + b)$ is convex.
•Partial minimization — If $f(x, y)$ is convex and $C$ is convex, then $g(x) = \inf_{y \in C} f(x, y)$ is convex (if the infimum is achieved).
•Perspective — If $f$ is convex, then the perspective $g(x, t) = t \cdot f(x/t)$ for $t > 0$ is convex.

Non-Convex Operations

The Power of Pointwise Maximum:

The fact that $\max$ preserves convexity is surprisingly powerful and underlies many ML loss functions:

Hinge Loss: $\ell(y, \hat{y}) = \max(0, 1 - y \cdot \hat{y})$

This is the maximum of two affine functions (0 and $1 - y \cdot \hat{y}$), hence convex.

ReLU Activation: $\text{ReLU}(x) = \max(0, x)$

While individual ReLUs are convex, compositions of linear layers and ReLUs create non-convex networks—an important distinction.

Piecewise Linear Convex Functions:

Any piecewise linear convex function can be written as a maximum of affine functions: $$f(x) = \max_{i=1,\ldots,k} (a_i^T x + b_i)$$

This representation is central to problems like linear programming robustness and the design of activation functions.

convex_operations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
 
# Demonstrating convexity-preserving operations
 
def squared_norm(x):
    """L2 squared - strongly convex"""
    return np.sum(x ** 2)
 
def l1_norm(x):
    """L1 norm - convex but not strictly convex"""
    return np.sum(np.abs(x))
 
def composed_quadratic(x, A, b):
    """f(Ax + b) where f is squared norm - still convex"""
    return squared_norm(A @ x + b)
 
def sum_of_convex(x, alpha=1.0, beta=0.1):
    """alpha * ||x||^2 + beta * ||x||_1 - convex (elastic net)"""
    return alpha * squared_norm(x) + beta * l1_norm(x)
 
def pointwise_max_example(x, affine_funcs):
    """
    max_i (a_i^T x + b_i) - convex as max of convex
    
    Args:
        x: input vector
        affine_funcs: list of (a_i, b_i) tuples
    """
    values = [a @ x + b for a, b in affine_funcs]
    return max(values)
 
def hinge_loss(y_true, y_pred):
    """
    Hinge loss: max(0, 1 - y * y_hat)
    Convex as max of two affine functions in y_pred
    """
    return np.maximum(0, 1 - y_true * y_pred)
 
def log_sum_exp(x):
    """
    log(sum(exp(x_i))) - convex approximation to max
    Numerically stable implementation
    """
    max_x = np.max(x)
    return max_x + np.log(np.sum(np.exp(x - max_x)))
 
# Verify convexity empirically
def verify_function_convexity(f, x, y, n_points=100):
    """
    Verify f(theta*x + (1-theta)*y) <= theta*f(x) + (1-theta)*f(y)
    for many values of theta in [0, 1]
    """
    f_x, f_y = f(x), f(y)
    thetas = np.linspace(0, 1, n_points)
    
    violations = []
    for theta in thetas:
        z = theta * x + (1 - theta) * y
        f_z = f(z)
        interpolation = theta * f_x + (1 - theta) * f_y
        
        if f_z > interpolation + 1e-10:  # Small tolerance
            violations.append({
                'theta': theta,
                'f(z)': f_z,
                'interpolation': interpolation,
                'gap': f_z - interpolation
            })
    
    return {
        'is_convex': len(violations) == 0,
        'violations': violations
    }
 
# Example verification
np.random.seed(42)
x = np.random.randn(5)
y = np.random.randn(5)
 
print("Verifying convexity of L2 squared:")
print(verify_function_convexity(squared_norm, x, y))
 
print("\nVerifying convexity of log-sum-exp:")
print(verify_function_convexity(log_sum_exp, x, y))

Jensen's Inequality: The Master Theorem

Jensen's Inequality is arguably the single most important result in convex analysis. It generalizes the definition of convexity from two points to arbitrary weighted combinations.

Theorem (Jensen's Inequality):

Let $f$ be a convex function, let $x_1, \ldots, x_k$ be points in the domain of $f$, and let $\lambda_1, \ldots, \lambda_k$ be non-negative weights summing to 1. Then:

$$f\left(\sum_{i=1}^{k} \lambda_i x_i\right) \leq \sum_{i=1}^{k} \lambda_i f(x_i)$$

In words: the function value at the weighted average is at most the weighted average of function values.

Probabilistic Form

Proof Intuition:

Jensen's inequality follows from repeated application of the basic convexity definition. For two points, it's exactly the definition. For three points, we first combine two of them:

$$f(\lambda_1 x_1 + \lambda_2 x_2 + \lambda_3 x_3) = f\left((\lambda_1 + \lambda_2) \cdot \frac{\lambda_1 x_1 + \lambda_2 x_2}{\lambda_1 + \lambda_2} + \lambda_3 x_3\right)$$

Then apply convexity twice, using the fact that $\frac{\lambda_1 x_1 + \lambda_2 x_2}{\lambda_1 + \lambda_2}$ is itself a convex combination.

Applications in Machine Learning:

Jensen's inequality appears throughout ML:

Jensen's Inequality in ML

•Evidence Lower Bound (ELBO) — In variational inference, $\log p(x) \geq \mathbb{E}_{q(z)}[\log p(x, z)] + H[q(z)]$. The gap arises from Jensen's applied to the concave log function.
•KL Divergence Non-negativity — $D_{KL}(p | q) \geq 0$ follows from applying Jensen's to $-\log$, which is convex.
•Log-likelihood bounds — Many generative model objectives use Jensen's to create tractable lower bounds.
•Information-theoretic inequalities — Results like mutual information bounds, entropy inequalities, and data processing inequalities rely on Jensen's.
•Expectation Maximization — The E-step creates a lower bound on the log-likelihood using Jensen's inequality.

Example: Proving KL Divergence is Non-negative

The Kullback-Leibler divergence is:

$$D_{KL}(p | q) = \mathbb{E}_p\left[\log \frac{p(x)}{q(x)}\right] = -\mathbb{E}_p\left[\log \frac{q(x)}{p(x)}\right]$$

Since $-\log$ is convex and $\frac{q(x)}{p(x)}$ is a ratio of densities with $\mathbb{E}_p\left[\frac{q(x)}{p(x)}\right] = \int p(x) \cdot \frac{q(x)}{p(x)} dx = \int q(x) dx = 1$:

$$D_{KL}(p | q) = -\mathbb{E}_p\left[\log \frac{q(x)}{p(x)}\right] \geq -\log \mathbb{E}_p\left[\frac{q(x)}{p(x)}\right] = -\log(1) = 0$$

Jensen's gives us this fundamental non-negativity of KL divergence.

Epigraphs and Sublevel Sets

The connection between convex functions and convex sets runs deeper than analogy—there's a precise equivalence through the concept of epigraphs.

Definition: Epigraph

The epigraph of a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is the set of points "on or above" the graph:

$$\text{epi}(f) = {(x, t) \in \mathbb{R}^{n+1} : x \in \text{dom}(f), t \geq f(x)}$$

Fundamental Theorem:

A function $f$ is convex if and only if its epigraph $\text{epi}(f)$ is a convex set.

This equivalence is powerful: it means we can translate any statement about convex functions into a statement about convex sets, and vice versa.

Visualizing Epigraphs

Sublevel Sets:

The $\alpha$-sublevel set of $f$ is:

$$S_\alpha = {x \in \text{dom}(f) : f(x) \leq \alpha}$$

Key Property: If $f$ is convex, all its sublevel sets are convex.

Proof: Let $x, y \in S_\alpha$, so $f(x) \leq \alpha$ and $f(y) \leq \alpha$. For any $\theta \in [0, 1]$:

$$f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y) \leq \theta \alpha + (1 - \theta) \alpha = \alpha$$

So $\theta x + (1 - \theta) y \in S_\alpha$, proving $S_\alpha$ is convex.

Quasiconvexity:

A function with convex sublevel sets is called quasiconvex. This is a strictly weaker condition than convexity:

$$f \text{ convex} \Rightarrow f \text{ quasiconvex}$$

but the converse doesn't hold. Quasiconvex functions still have nice optimization properties (local minima are still global for strictly quasiconvex functions), but the analysis is more delicate.

Practical Use of Sublevel Sets:

In constrained optimization, we often restrict our search to a sublevel set:

$$\min_x , g(x) \quad \text{subject to} \quad f(x) \leq \alpha$$

If $f$ is convex, the feasible region $S_\alpha$ is convex, preserving the tractability of the optimization.

Characterizations via Calculus

When functions are differentiable, we gain powerful calculus-based characterizations of convexity that are often easier to verify than the basic definition.

First-Order Characterization:

If $f$ is differentiable, then $f$ is convex if and only if for all $x, y$ in the domain:

$$f(y) \geq f(x) + \nabla f(x)^T (y - x)$$

This says the function lies above its tangent hyperplane at every point. The tangent is a global underestimator.

Geometric Intuition

Second-Order Characterization:

If $f$ is twice differentiable, then $f$ is convex if and only if the Hessian is positive semidefinite everywhere:

$$\nabla^2 f(x) \succeq 0 \quad \forall x \in \text{dom}(f)$$

For strict convexity, we need $\nabla^2 f(x) \succ 0$ (positive definite).

For $m$-strong convexity, we need $\nabla^2 f(x) \succeq m I$ (eigenvalues at least $m$).

Practical Hessian Analysis:

For functions of the form $f(x) = g(Ax)$ where $g$ is convex:

$$\nabla^2 f(x) = A^T \nabla^2 g(Ax) A$$

This is PSD whenever $\nabla^2 g \succeq 0$, confirming that composition with affine transformations preserves convexity.

convexity_verification_calculus.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy.linalg import eigvalsh
 
def verify_convexity_via_hessian(hessian_fn, test_points, tolerance=1e-10):
    """
    Verify convexity by checking Hessian positive semidefiniteness.
    
    Args:
        hessian_fn: Function that returns Hessian matrix at a point
        test_points: List of points to check
        tolerance: Numerical tolerance for eigenvalue check
    
    Returns:
        dict: Verification results
    """
    results = {
        'is_psd_everywhere': True,
        'min_eigenvalues': [],
        'problem_points': []
    }
    
    for x in test_points:
        H = hessian_fn(x)
        eigenvalues = eigvalsh(H)  # Real eigenvalues for symmetric matrix
        min_eig = np.min(eigenvalues)
        results['min_eigenvalues'].append(min_eig)
        
        if min_eig < -tolerance:
            results['is_psd_everywhere'] = False
            results['problem_points'].append({
                'point': x,
                'min_eigenvalue': min_eig
            })
    
    return results
 
def verify_first_order_condition(f, grad_f, x, y, tolerance=1e-8):
    """
    Verify f(y) >= f(x) + grad_f(x)^T (y - x)
    
    For convex functions, this should hold for all x, y.
    """
    f_y = f(y)
    f_x = f(x)
    grad_x = grad_f(x)
    
    lower_bound = f_x + np.dot(grad_x, y - x)
    gap = f_y - lower_bound
    
    return {
        'f(y)': f_y,
        'lower_bound': lower_bound,
        'gap': gap,
        'is_satisfied': gap >= -tolerance
    }
 
# Example: Verify logistic loss convexity
def logistic_loss(w, X, y):
    """Logistic regression loss (without regularization)"""
    z = y * (X @ w)
    return np.mean(np.log(1 + np.exp(-z)))
 
def logistic_loss_hessian(w, X, y):
    """Hessian of logistic loss"""
    n = len(y)
    z = X @ w
    p = 1 / (1 + np.exp(-y * z))
    D = np.diag(p * (1 - p))
    return (X.T @ D @ X) / n
 
# Generate sample data
np.random.seed(42)
n_samples, n_features = 100, 5
X = np.random.randn(n_samples, n_features)
y = np.sign(np.random.randn(n_samples))
 
# Test at random points
test_points = [np.random.randn(n_features) for _ in range(10)]
hessian_fn = lambda w: logistic_loss_hessian(w, X, y)
 
result = verify_convexity_via_hessian(hessian_fn, test_points)
print("Logistic loss convexity verification:")
print(f"  Is PSD everywhere: {result['is_psd_everywhere']}")
print(f"  Min eigenvalues: {[f'{e:.6f}' for e in result['min_eigenvalues'][:5]]}")

Summary: Convex Sets and Functions

We've established the foundational concepts of convex analysis that underpin all tractable optimization in machine learning. Let's consolidate the key insights:

Key Takeaways

•Convex sets are defined by the line segment property: every point between two points in the set is also in the set.
•Convex functions satisfy the "chord above" property: the function lies below any chord connecting two points on its graph.
•Jensen's inequality generalizes convexity to arbitrary convex combinations and has sweeping applications in ML.
•Operations preserving convexity — intersection for sets; addition, non-negative scaling, and pointwise maximum for functions — let us build complex convex objects from simple ones.
•Epigraphs connect functions to sets: a function is convex iff its epigraph is a convex set.
•Differentiable characterizations — first-order (gradient underestimator) and second-order (Hessian PSD) — provide practical verification methods.

What's Next:

Page Complete