Convex Optimization - Learning Module

Loading content...

0/245

Local vs Global Optima

The Landscape of Optimization

Consider two hikers, each blindfolded, trying to reach the lowest point of a mountain range at night. The first hiker walks through the Alps—a jagged landscape of peaks, valleys, ridges, and hidden ravines. They might descend into a valley that seems like the lowest point, only to learn later that a much deeper valley exists beyond the next ridge. Without global information, they're trapped by local geography.

The second hiker walks through a single giant bowl—a smooth, convex depression with one obvious lowest point. No matter where they start, following the downward slope leads inexorably to the same destination: the true bottom.

This metaphor captures the essence of local vs. global optimization. In machine learning, we're often the blindfolded hiker, using local gradient information to navigate loss landscapes we cannot fully observe. Whether we find the true optimum or get stuck in a suboptimal basin depends fundamentally on the geometry of our optimization problem.

What You Will Learn

This page will rigorously define local and global optima, prove why convexity guarantees equivalence between them, examine the pathologies of non-convex optimization, and build intuition for recognizing when these guarantees apply in machine learning.

Formal Definitions of Optimality

Before proving anything, we need precise definitions. These definitions apply to minimization; for maximization, simply reverse the inequalities.

Definition: Global Minimum

A point $x^* \in \text{dom}(f)$ is a global minimum of $f$ if:

$$f(x^*) \leq f(x) \quad \forall x \in \text{dom}(f)$$

The function value at $x^*$ is the smallest possible over the entire domain.

Definition: Local Minimum

A point $x^*$ is a local minimum of $f$ if there exists $\epsilon > 0$ such that:

$$f(x^) \leq f(x) \quad \forall x \text{ with } |x - x^| < \epsilon$$

The function value at $x^*$ is the smallest in some neighborhood—but perhaps not globally.

Types of Optima
Type	Definition	Uniqueness
Global minimum	$f(x^*) \leq f(x)$ for all $x$ in domain	May not be unique
Strict global minimum	$f(x^) < f(x)$ for all $x \neq x^$	Unique if exists
Local minimum	$f(x^*) \leq f(x)$ in some neighborhood	May be many
Strict local minimum	$f(x^) < f(x)$ nearby, $x \neq x^$	Isolated

The Fundamental Challenge:

Every global minimum is also a local minimum (just take $\epsilon$ large enough). But the converse fails spectacularly in general: a function can have many local minima, only some of which are global.

For optimization algorithms that only use local information (gradients, Hessians), there's no way to distinguish a local minimum from a global minimum without exploring the entire domain—which is computationally intractable for high-dimensional problems.

This is why convexity is so precious:

For convex functions, every local minimum is a global minimum.

This single theorem transforms optimization from an impossibly hard search problem into a tractable computation.

Saddle Points and Stationary Points

A stationary point has $\nabla f(x) = 0$. For non-convex functions, stationary points can be local minima, local maxima, or saddle points (minima in some directions, maxima in others). For convex functions, every stationary point is a global minimum—no ambiguity.

The Fundamental Theorem: Local Implies Global

We now prove the central result that makes convex optimization tractable.

Theorem: Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be a convex function. If $x^$ is a local minimum of $f$, then $x^$ is a global minimum of $f$.

Proof:

Suppose $x^$ is a local minimum but not a global minimum. Then there exists some point $y$ with $f(y) < f(x^)$.

Since $x^$ is a local minimum, there exists $\epsilon > 0$ such that $f(x^) \leq f(x)$ for all $x$ with $|x - x^*| < \epsilon$.

Consider the line segment from $x^*$ toward $y$. For small enough $\theta > 0$, the point:

$$z = \theta y + (1 - \theta) x^* = x^* + \theta(y - x^*)$$

satisfies $|z - x^| = \theta |y - x^| < \epsilon$ (by choosing $\theta < \frac{\epsilon}{|y - x^*|}$).

By convexity of $f$:

$$f(z) = f(\theta y + (1 - \theta) x^) \leq \theta f(y) + (1 - \theta) f(x^)$$

Since $f(y) < f(x^*)$:

$$f(z) < \theta f(x^) + (1 - \theta) f(x^) = f(x^*)$$

But $z$ is within the $\epsilon$-neighborhood of $x^$, and $f(z) < f(x^)$, contradicting that $x^*$ is a local minimum.

Conclusion: Our assumption was wrong; $x^*$ must be a global minimum. $\square$

The Geometric Picture

The proof uses the "shortcut" property of convexity. If there were a better point $y$, we could always find a point on the line toward $y$ that's better than $x^$ but still arbitrarily close. Convexity prevents $x^$ from being "locally good but globally bad."

Corollary: Strict Convexity Implies Uniqueness

If $f$ is strictly convex and has a minimum, that minimum is unique.

Proof: Suppose $x^$ and $y^$ are both global minima with $x^* \neq y^$, so $f(x^) = f(y^) = f^$.

Consider $z = \frac{1}{2}x^* + \frac{1}{2}y^*$. By strict convexity:

$$f(z) < \frac{1}{2}f(x^) + \frac{1}{2}f(y^) = f^*$$

This contradicts $f^*$ being the minimum. $\square$

Corollary: Strong Convexity Guarantees Existence and Uniqueness

If $f$ is strongly convex, it has exactly one global minimum (assuming the domain is all of $\mathbb{R}^n$ or is closed and convex).

Strong convexity also provides convergence rate guarantees for optimization algorithms—we'll see this when studying gradient descent.

Pathologies of Non-Convex Optimization

To appreciate convexity, we must understand what can go wrong without it. Non-convex optimization is notoriously difficult, and gradient-based methods offer no guarantees of finding global optima.

Common Pathologies in Non-Convex Landscapes:

Non-Convex Optimization Challenges

•Multiple local minima — The function may have many valleys of varying depths. Gradient descent finds the nearest valley, not necessarily the deepest.
•Saddle points — In high dimensions, saddle points can be more common than local minima. The gradient is zero, but it's neither a minimum nor maximum—escape requires second-order information or noise.
•Plateaus — Regions where the function is nearly flat make gradient descent crawl. Progress is painfully slow even if the plateau isn't an optimum.
•Narrow ravines — Long, thin valleys cause gradient descent to oscillate between walls rather than progressing along the valley floor.
•Disconnected feasible regions — If the constraint set is non-convex (e.g., $x$ must be an integer), the optimization becomes combinatorially hard.

High-Dimensional Saddle Points

In deep learning, the ratio of saddle points to local minima grows exponentially with dimension. A critical point is a local minimum only if ALL eigenvalues of the Hessian are positive. With millions of parameters, finding such a point by chance is vanishingly unlikely.

The Rastrigin Function: A Classic Nightmare

The Rastrigin function is a standard benchmark for global optimization:

$$f(x) = 10n + \sum_{i=1}^{n} \left[ x_i^2 - 10 \cos(2\pi x_i) \right]$$

This function has:

A global minimum at $x = 0$ with $f(0) = 0$
Approximately $10^n$ local minima for $x \in [-5.12, 5.12]^n$

For $n = 10$, there are roughly 10 billion local minima. Gradient descent from a random starting point almost certainly finds a local (not global) minimum.

Non-Convexity in ML:

Many important ML problems are non-convex:

Neural network training (composition of nonlinearities)
Matrix factorization / collaborative filtering
Clustering (e.g., k-means objective)
Deep reinforcement learning policy optimization

For these problems, we use heuristics, multiple random restarts, and theoretical results about "benign" non-convexity (e.g., certain neural networks have no spurious local minima).

local_vs_global_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
import matplotlib.pyplot as plt
 
def rastrigin(x):
    """Non-convex function with many local minima"""
    n = len(x)
    return 10 * n + np.sum(x**2 - 10 * np.cos(2 * np.pi * x))
 
def quadratic(x, Q=None):
    """Convex quadratic - has unique global minimum"""
    if Q is None:
        Q = np.eye(len(x))
    return 0.5 * x @ Q @ x
 
def gradient_descent(f, grad_f, x0, lr=0.01, max_iters=1000, tol=1e-8):
    """Standard gradient descent"""
    x = x0.copy()
    history = [x.copy()]
    
    for i in range(max_iters):
        g = grad_f(x)
        if np.linalg.norm(g) < tol:
            break
        x = x - lr * g
        history.append(x.copy())
    
    return x, np.array(history)
 
def rastrigin_grad(x):
    """Gradient of Rastrigin function"""
    return 2 * x + 20 * np.pi * np.sin(2 * np.pi * x)
 
def quadratic_grad(x, Q=None):
    """Gradient of quadratic"""
    if Q is None:
        Q = np.eye(len(x))
    return Q @ x
 
# Demonstrate local minima trapping
np.random.seed(42)
n_trials = 20
final_values_rastrigin = []
final_values_quadratic = []
 
for _ in range(n_trials):
    x0 = np.random.uniform(-5, 5, size=2)
    
    # Rastrigin (non-convex)
    x_final, _ = gradient_descent(rastrigin, rastrigin_grad, x0, lr=0.001)
    final_values_rastrigin.append(rastrigin(x_final))
    
    # Quadratic (convex)
    x_final, _ = gradient_descent(quadratic, quadratic_grad, x0, lr=0.1)
    final_values_quadratic.append(quadratic(x_final))
 
print("Rastrigin (non-convex) final values:")
print(f"  Range: [{min(final_values_rastrigin):.4f}, {max(final_values_rastrigin):.4f}]")
print(f"  Unique values: {len(set([round(v, 2) for v in final_values_rastrigin]))}")
print(f"  Global minimum: 0.0")
 
print("\nQuadratic (convex) final values:")
print(f"  Range: [{min(final_values_quadratic):.10f}, {max(final_values_quadratic):.10f}]")
print(f"  All converge to global minimum: {all(v < 1e-6 for v in final_values_quadratic)}")

Sufficient Conditions for Global Optimality

For convex functions, we can characterize global minima using first-order (gradient) and sometimes second-order (Hessian) conditions.

First-Order Condition (Unconstrained):

For a differentiable convex function $f$ on $\mathbb{R}^n$, $x^*$ is a global minimum if and only if:

$$\nabla f(x^*) = 0$$

Proof: We showed earlier that $f(y) \geq f(x) + \nabla f(x)^T(y - x)$ for convex $f$. At $x^$ with $\nabla f(x^) = 0$, this becomes $f(y) \geq f(x^*)$ for all $y$—exactly the definition of global minimum.

Conversely, if $x^$ is a global minimum and $\nabla f(x^) \neq 0$, we can decrease $f$ by moving in the direction $-\nabla f(x^*)$.

Why This Doesn't Work for Non-Convex

For non-convex functions, $\nabla f(x^*) = 0$ is necessary but not sufficient for even local optimality. The point could be a saddle point (some directions curve up, others curve down) or a local maximum.

First-Order Condition (Constrained):

For minimizing convex $f$ over convex set $C$, $x^*$ is a global minimum if and only if:

$$\nabla f(x^)^T (y - x^) \geq 0 \quad \forall y \in C$$

This says the gradient must point "outward" from the constraint set at $x^*$: any feasible direction is non-descent.

For an interior point ($x^$ strictly inside $C$), this reduces to $\nabla f(x^) = 0$.

For a boundary point, the gradient can be nonzero but must point away from the feasible region.

Second-Order Condition:

For twice-differentiable convex $f$, if $\nabla f(x^) = 0$ and $\nabla^2 f(x^) \succ 0$ (positive definite), then $x^*$ is a strict local minimum—and by convexity, a strict global minimum.

For strongly convex $f$ (where $\nabla^2 f \succeq m I$ everywhere with $m > 0$), any stationary point is automatically a strict global minimum.

Optimality Conditions Summary
Function Type	Condition for Global Min	Additional Guarantee
Convex, differentiable	$\nabla f(x^*) = 0$	Every stationary point is global min
Strictly convex, differentiable	$\nabla f(x^*) = 0$	At most one global min
Strongly convex	$\nabla f(x^*) = 0$	Exactly one global min
Convex, constrained	$\nabla f(x^)^T(y - x^) \geq 0$ for all feasible $y$	Gradient points out of feasible set

The Optimization Landscape: A Deeper Look

The "loss landscape" metaphor is powerful but must be used carefully. Let's refine our intuition.

Convex Landscapes:

A convex function's sublevel sets are convex (nested "bowls"). As you descend, the feasible region shrinks but remains well-behaved:

No isolated pockets or "holes" to get trapped in
No ridges separating different basins
Gradient information is always useful: moving downhill makes global progress

Non-Convex Landscapes:

Non-convex functions exhibit complex topology:

Multiple disconnected regions at each height level
Saddle points that look flat but have escape routes
Ridges and valleys that can mislead gradient-based methods
Basin boundaries that depend sensitively on initialization

High-Dimensional Intuition

Our 2D/3D intuition about landscapes breaks down in high dimensions. In $\mathbb{R}^{1000}$, a point can be a minimum in 999 directions but a maximum in 1 direction—making it a saddle point. Saddle points vastly outnumber local minima in high dimensions.

Benign Non-Convexity:

Recent ML research has identified cases where non-convex problems are still tractable:

No spurious local minima: All local minima are global (even though the function isn't convex). Examples: certain matrix sensing problems, some neural network architectures.
Saddle points are escapable: With noise or second-order methods, saddle points can be escaped efficiently.
Local minima are "good enough": Even if not global, local minima may generalize well in practice.
Overparameterization helps: Neural networks with more parameters than data points often have loss landscapes where gradient descent finds global minima.

These results don't make non-convex optimization equivalent to convex—the guarantees are weaker and problem-specific—but they explain why deep learning works in practice.

Loss Landscape Properties

•Convex functions: One basin, gradient always points toward the bottom, no traps, guaranteed convergence to global optimum.
•Quasi-convex functions: Local minima are global, but gradient may not always point toward optimum; requires more careful algorithms.
•General smooth functions: Multiple basins, saddle points, complex topology; no gradient-based guarantees beyond local improvement.
•Deep neural networks: Highly non-convex but empirically well-behaved; active research area to understand why.

Practical Implications for Machine Learning

Understanding local vs. global optima has direct implications for ML practice.

Convex ML Problems (Global Optimality Guaranteed):

Convex ML Algorithms

•Linear regression (squared loss) — Convex in weights; closed-form solution exists.
•Logistic regression — Convex in weights; gradient descent finds global optimum.
•Support Vector Machines — Convex quadratic program; dual is convex.
•Ridge and Lasso regression — Adding L2 makes strongly convex; L1 keeps convex.
•Softmax regression — Convex; cross-entropy loss is convex in logits.
•Kernel methods (with convex loss) — Convex in dual variables.

Non-Convex ML Problems (No Guarantees):

Non-Convex ML Algorithms

•Neural networks — Composition of nonlinearities creates non-convex landscapes.
•K-means clustering — Discrete assignment + continuous centers = non-convex.
•Matrix factorization — $UV^T$ is bilinear in $U, V$, hence non-convex.
•Gaussian mixture models — Log-likelihood is non-convex in parameters.
•Reinforcement learning — Policy optimization is non-convex and non-smooth.

When to Insist on Convexity

For mission-critical applications (finance, healthcare, autonomous systems), convex formulations provide guarantees that non-convex methods cannot. When your model must be certifiably optimal or when debugging is difficult, the interpretability and reliability of convex optimization is invaluable.

Design Principles:

Start convex when possible: If a convex model achieves acceptable accuracy, prefer it over non-convex alternatives.
Convexify when you can: Sometimes a non-convex problem can be relaxed to a convex one (e.g., nuclear norm relaxation for rank minimization).
Understand your landscape: For non-convex problems, analyze whether the landscape is "benign" (no bad local minima) or challenging.
Use multiple initializations: For non-convex problems, run optimization from multiple starting points and take the best result.
Regularize toward convexity: Adding L2 regularization makes objectives "more convex" (increases minimum curvature).

Mathematical Analysis: Why Curvature Matters

The connection between convexity and optimization goes deeper through the lens of curvature.

Curvature and the Hessian:

The Hessian matrix $\nabla^2 f(x)$ captures the local curvature of $f$ at $x$. Its eigenvalues determine:

All eigenvalues $\geq 0$: Positive semidefinite (PSD) Hessian → locally convex
All eigenvalues $> 0$: Positive definite (PD) Hessian → strictly locally convex
All eigenvalues $\geq m > 0$: Bounded below → strongly convex
Some eigenvalues $< 0$: Hessian has negative eigenvalue → non-convex; if at stationary point, it's a saddle point

Condition Number:

For a strongly convex function such that $mI \preceq \nabla^2 f(x) \preceq MI$ for all $x$, the condition number is:

$$\kappa = \frac{M}{m}$$

This measures how "stretched" the landscape is. A well-conditioned problem ($\kappa$ close to 1) is approximately isotropic—all directions have similar curvature. A poorly conditioned problem ($\kappa \gg 1$) has very different curvatures in different directions, causing gradient descent to zigzag.

curvature_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
from scipy.linalg import eigvalsh
 
def analyze_curvature(hessian):
    """
    Analyze the curvature of a function through its Hessian.
    
    Returns:
        dict with eigenvalue statistics and convexity classification
    """
    eigenvalues = eigvalsh(hessian)
    
    min_eig = np.min(eigenvalues)
    max_eig = np.max(eigenvalues)
    
    # Determine convexity type
    if min_eig > 1e-10:
        convexity = "strongly convex"
        condition_number = max_eig / min_eig
    elif min_eig >= -1e-10:
        convexity = "convex (possibly degenerate)"
        condition_number = np.inf if min_eig < 1e-10 else max_eig / min_eig
    else:
        n_negative = np.sum(eigenvalues < -1e-10)
        n_positive = np.sum(eigenvalues > 1e-10)
        if n_negative > 0 and n_positive > 0:
            convexity = f"saddle point (indefinite): {n_negative} negative, {n_positive} positive"
        elif n_negative > 0:
            convexity = "locally concave (negative definite)"
        else:
            convexity = "degenerate"
        condition_number = np.nan
    
    return {
        'eigenvalues': eigenvalues,
        'min_eigenvalue': min_eig,
        'max_eigenvalue': max_eig,
        'convexity': convexity,
        'condition_number': condition_number
    }
 
# Example: Compare well-conditioned vs ill-conditioned
I = np.eye(5)
well_conditioned = I  # condition number = 1
ill_conditioned = np.diag([1, 1, 1, 1, 1000])  # condition number = 1000
saddle_hessian = np.diag([1, 1, -1, 1, 1])  # Has negative eigenvalue
 
print("Well-conditioned Hessian:")
print(analyze_curvature(well_conditioned))
 
print("\nIll-conditioned Hessian:")
print(analyze_curvature(ill_conditioned))
 
print("\nSaddle point Hessian:")
print(analyze_curvature(saddle_hessian))

Convergence Rate Implications:

For gradient descent with step size $\eta$, minimizing a $m$-strongly convex function with $M$-Lipschitz gradient:

$$f(x_k) - f(x^) \leq \left(1 - \frac{m}{M}\right)^k (f(x_0) - f(x^))$$

The convergence rate depends on $\kappa = M/m$. Low condition number means fast convergence; high condition number means slow convergence.

This is why ill-conditioned problems (long narrow valleys) are hard for gradient descent: they have high condition numbers, so the convergence factor $(1 - 1/\kappa)$ is close to 1.

Summary: Local vs Global Optima

We've established one of the most important results in optimization—the equivalence of local and global minima for convex functions. Let's consolidate:

Key Takeaways

•Global minima are best over the entire domain; local minima are best in a neighborhood. Every global minimum is local, but not conversely.
•For convex functions, every local minimum is global. This is the fundamental theorem making convex optimization tractable.
•Strict convexity ensures at most one global minimum; strong convexity guarantees exactly one and enables fast convergence.
•Non-convex optimization suffers from local minima, saddle points, and complex landscapes that trap gradient-based methods.
•Practical ML: Linear regression, logistic regression, SVMs, and ridge regression are convex; neural networks and clustering are not.
•The condition number $\kappa = M/m$ measures optimization difficulty; lower is better.

What's Next:

Knowing that local minima are global is powerful, but how do we find them? The next pages will develop the first-order and second-order optimality conditions that characterize minima for convex functions, providing actionable stopping criteria for optimization algorithms.

Page Complete

You now understand why convexity transforms optimization from an intractable search into a reliable computation. The guarantee that any local minimum is global means gradient-based methods are guaranteed to succeed—no restarts, no luck, just mathematics.