Loading content...
Consider two hikers, each blindfolded, trying to reach the lowest point of a mountain range at night. The first hiker walks through the Alps—a jagged landscape of peaks, valleys, ridges, and hidden ravines. They might descend into a valley that seems like the lowest point, only to learn later that a much deeper valley exists beyond the next ridge. Without global information, they're trapped by local geography.
The second hiker walks through a single giant bowl—a smooth, convex depression with one obvious lowest point. No matter where they start, following the downward slope leads inexorably to the same destination: the true bottom.
This metaphor captures the essence of local vs. global optimization. In machine learning, we're often the blindfolded hiker, using local gradient information to navigate loss landscapes we cannot fully observe. Whether we find the true optimum or get stuck in a suboptimal basin depends fundamentally on the geometry of our optimization problem.
This page will rigorously define local and global optima, prove why convexity guarantees equivalence between them, examine the pathologies of non-convex optimization, and build intuition for recognizing when these guarantees apply in machine learning.
Before proving anything, we need precise definitions. These definitions apply to minimization; for maximization, simply reverse the inequalities.
Definition: Global Minimum
A point $x^* \in \text{dom}(f)$ is a global minimum of $f$ if:
$$f(x^*) \leq f(x) \quad \forall x \in \text{dom}(f)$$
The function value at $x^*$ is the smallest possible over the entire domain.
Definition: Local Minimum
A point $x^*$ is a local minimum of $f$ if there exists $\epsilon > 0$ such that:
$$f(x^) \leq f(x) \quad \forall x \text{ with } |x - x^| < \epsilon$$
The function value at $x^*$ is the smallest in some neighborhood—but perhaps not globally.
| Type | Definition | Uniqueness |
|---|---|---|
| Global minimum | $f(x^*) \leq f(x)$ for all $x$ in domain | May not be unique |
| Strict global minimum | $f(x^) < f(x)$ for all $x \neq x^$ | Unique if exists |
| Local minimum | $f(x^*) \leq f(x)$ in some neighborhood | May be many |
| Strict local minimum | $f(x^) < f(x)$ nearby, $x \neq x^$ | Isolated |
The Fundamental Challenge:
Every global minimum is also a local minimum (just take $\epsilon$ large enough). But the converse fails spectacularly in general: a function can have many local minima, only some of which are global.
For optimization algorithms that only use local information (gradients, Hessians), there's no way to distinguish a local minimum from a global minimum without exploring the entire domain—which is computationally intractable for high-dimensional problems.
This is why convexity is so precious:
For convex functions, every local minimum is a global minimum.
This single theorem transforms optimization from an impossibly hard search problem into a tractable computation.
A stationary point has $\nabla f(x) = 0$. For non-convex functions, stationary points can be local minima, local maxima, or saddle points (minima in some directions, maxima in others). For convex functions, every stationary point is a global minimum—no ambiguity.
We now prove the central result that makes convex optimization tractable.
Theorem: Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be a convex function. If $x^$ is a local minimum of $f$, then $x^$ is a global minimum of $f$.
Proof:
Suppose $x^$ is a local minimum but not a global minimum. Then there exists some point $y$ with $f(y) < f(x^)$.
Since $x^$ is a local minimum, there exists $\epsilon > 0$ such that $f(x^) \leq f(x)$ for all $x$ with $|x - x^*| < \epsilon$.
Consider the line segment from $x^*$ toward $y$. For small enough $\theta > 0$, the point:
$$z = \theta y + (1 - \theta) x^* = x^* + \theta(y - x^*)$$
satisfies $|z - x^| = \theta |y - x^| < \epsilon$ (by choosing $\theta < \frac{\epsilon}{|y - x^*|}$).
By convexity of $f$:
$$f(z) = f(\theta y + (1 - \theta) x^) \leq \theta f(y) + (1 - \theta) f(x^)$$
Since $f(y) < f(x^*)$:
$$f(z) < \theta f(x^) + (1 - \theta) f(x^) = f(x^*)$$
But $z$ is within the $\epsilon$-neighborhood of $x^$, and $f(z) < f(x^)$, contradicting that $x^*$ is a local minimum.
Conclusion: Our assumption was wrong; $x^*$ must be a global minimum. $\square$
The proof uses the "shortcut" property of convexity. If there were a better point $y$, we could always find a point on the line toward $y$ that's better than $x^$ but still arbitrarily close. Convexity prevents $x^$ from being "locally good but globally bad."
Corollary: Strict Convexity Implies Uniqueness
If $f$ is strictly convex and has a minimum, that minimum is unique.
Proof: Suppose $x^$ and $y^$ are both global minima with $x^* \neq y^$, so $f(x^) = f(y^) = f^$.
Consider $z = \frac{1}{2}x^* + \frac{1}{2}y^*$. By strict convexity:
$$f(z) < \frac{1}{2}f(x^) + \frac{1}{2}f(y^) = f^*$$
This contradicts $f^*$ being the minimum. $\square$
Corollary: Strong Convexity Guarantees Existence and Uniqueness
If $f$ is strongly convex, it has exactly one global minimum (assuming the domain is all of $\mathbb{R}^n$ or is closed and convex).
Strong convexity also provides convergence rate guarantees for optimization algorithms—we'll see this when studying gradient descent.
To appreciate convexity, we must understand what can go wrong without it. Non-convex optimization is notoriously difficult, and gradient-based methods offer no guarantees of finding global optima.
Common Pathologies in Non-Convex Landscapes:
In deep learning, the ratio of saddle points to local minima grows exponentially with dimension. A critical point is a local minimum only if ALL eigenvalues of the Hessian are positive. With millions of parameters, finding such a point by chance is vanishingly unlikely.
The Rastrigin Function: A Classic Nightmare
The Rastrigin function is a standard benchmark for global optimization:
$$f(x) = 10n + \sum_{i=1}^{n} \left[ x_i^2 - 10 \cos(2\pi x_i) \right]$$
This function has:
For $n = 10$, there are roughly 10 billion local minima. Gradient descent from a random starting point almost certainly finds a local (not global) minimum.
Non-Convexity in ML:
Many important ML problems are non-convex:
For these problems, we use heuristics, multiple random restarts, and theoretical results about "benign" non-convexity (e.g., certain neural networks have no spurious local minima).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import numpy as npimport matplotlib.pyplot as plt def rastrigin(x): """Non-convex function with many local minima""" n = len(x) return 10 * n + np.sum(x**2 - 10 * np.cos(2 * np.pi * x)) def quadratic(x, Q=None): """Convex quadratic - has unique global minimum""" if Q is None: Q = np.eye(len(x)) return 0.5 * x @ Q @ x def gradient_descent(f, grad_f, x0, lr=0.01, max_iters=1000, tol=1e-8): """Standard gradient descent""" x = x0.copy() history = [x.copy()] for i in range(max_iters): g = grad_f(x) if np.linalg.norm(g) < tol: break x = x - lr * g history.append(x.copy()) return x, np.array(history) def rastrigin_grad(x): """Gradient of Rastrigin function""" return 2 * x + 20 * np.pi * np.sin(2 * np.pi * x) def quadratic_grad(x, Q=None): """Gradient of quadratic""" if Q is None: Q = np.eye(len(x)) return Q @ x # Demonstrate local minima trappingnp.random.seed(42)n_trials = 20final_values_rastrigin = []final_values_quadratic = [] for _ in range(n_trials): x0 = np.random.uniform(-5, 5, size=2) # Rastrigin (non-convex) x_final, _ = gradient_descent(rastrigin, rastrigin_grad, x0, lr=0.001) final_values_rastrigin.append(rastrigin(x_final)) # Quadratic (convex) x_final, _ = gradient_descent(quadratic, quadratic_grad, x0, lr=0.1) final_values_quadratic.append(quadratic(x_final)) print("Rastrigin (non-convex) final values:")print(f" Range: [{min(final_values_rastrigin):.4f}, {max(final_values_rastrigin):.4f}]")print(f" Unique values: {len(set([round(v, 2) for v in final_values_rastrigin]))}")print(f" Global minimum: 0.0") print("\nQuadratic (convex) final values:")print(f" Range: [{min(final_values_quadratic):.10f}, {max(final_values_quadratic):.10f}]")print(f" All converge to global minimum: {all(v < 1e-6 for v in final_values_quadratic)}")For convex functions, we can characterize global minima using first-order (gradient) and sometimes second-order (Hessian) conditions.
First-Order Condition (Unconstrained):
For a differentiable convex function $f$ on $\mathbb{R}^n$, $x^*$ is a global minimum if and only if:
$$\nabla f(x^*) = 0$$
Proof: We showed earlier that $f(y) \geq f(x) + \nabla f(x)^T(y - x)$ for convex $f$. At $x^$ with $\nabla f(x^) = 0$, this becomes $f(y) \geq f(x^*)$ for all $y$—exactly the definition of global minimum.
Conversely, if $x^$ is a global minimum and $\nabla f(x^) \neq 0$, we can decrease $f$ by moving in the direction $-\nabla f(x^*)$.
For non-convex functions, $\nabla f(x^*) = 0$ is necessary but not sufficient for even local optimality. The point could be a saddle point (some directions curve up, others curve down) or a local maximum.
First-Order Condition (Constrained):
For minimizing convex $f$ over convex set $C$, $x^*$ is a global minimum if and only if:
$$\nabla f(x^)^T (y - x^) \geq 0 \quad \forall y \in C$$
This says the gradient must point "outward" from the constraint set at $x^*$: any feasible direction is non-descent.
For an interior point ($x^$ strictly inside $C$), this reduces to $\nabla f(x^) = 0$.
For a boundary point, the gradient can be nonzero but must point away from the feasible region.
Second-Order Condition:
For twice-differentiable convex $f$, if $\nabla f(x^) = 0$ and $\nabla^2 f(x^) \succ 0$ (positive definite), then $x^*$ is a strict local minimum—and by convexity, a strict global minimum.
For strongly convex $f$ (where $\nabla^2 f \succeq m I$ everywhere with $m > 0$), any stationary point is automatically a strict global minimum.
| Function Type | Condition for Global Min | Additional Guarantee |
|---|---|---|
| Convex, differentiable | $\nabla f(x^*) = 0$ | Every stationary point is global min |
| Strictly convex, differentiable | $\nabla f(x^*) = 0$ | At most one global min |
| Strongly convex | $\nabla f(x^*) = 0$ | Exactly one global min |
| Convex, constrained | $\nabla f(x^)^T(y - x^) \geq 0$ for all feasible $y$ | Gradient points out of feasible set |
The "loss landscape" metaphor is powerful but must be used carefully. Let's refine our intuition.
Convex Landscapes:
A convex function's sublevel sets are convex (nested "bowls"). As you descend, the feasible region shrinks but remains well-behaved:
Non-Convex Landscapes:
Non-convex functions exhibit complex topology:
Our 2D/3D intuition about landscapes breaks down in high dimensions. In $\mathbb{R}^{1000}$, a point can be a minimum in 999 directions but a maximum in 1 direction—making it a saddle point. Saddle points vastly outnumber local minima in high dimensions.
Benign Non-Convexity:
Recent ML research has identified cases where non-convex problems are still tractable:
No spurious local minima: All local minima are global (even though the function isn't convex). Examples: certain matrix sensing problems, some neural network architectures.
Saddle points are escapable: With noise or second-order methods, saddle points can be escaped efficiently.
Local minima are "good enough": Even if not global, local minima may generalize well in practice.
Overparameterization helps: Neural networks with more parameters than data points often have loss landscapes where gradient descent finds global minima.
These results don't make non-convex optimization equivalent to convex—the guarantees are weaker and problem-specific—but they explain why deep learning works in practice.
Understanding local vs. global optima has direct implications for ML practice.
Convex ML Problems (Global Optimality Guaranteed):
Non-Convex ML Problems (No Guarantees):
For mission-critical applications (finance, healthcare, autonomous systems), convex formulations provide guarantees that non-convex methods cannot. When your model must be certifiably optimal or when debugging is difficult, the interpretability and reliability of convex optimization is invaluable.
Design Principles:
Start convex when possible: If a convex model achieves acceptable accuracy, prefer it over non-convex alternatives.
Convexify when you can: Sometimes a non-convex problem can be relaxed to a convex one (e.g., nuclear norm relaxation for rank minimization).
Understand your landscape: For non-convex problems, analyze whether the landscape is "benign" (no bad local minima) or challenging.
Use multiple initializations: For non-convex problems, run optimization from multiple starting points and take the best result.
Regularize toward convexity: Adding L2 regularization makes objectives "more convex" (increases minimum curvature).
The connection between convexity and optimization goes deeper through the lens of curvature.
Curvature and the Hessian:
The Hessian matrix $\nabla^2 f(x)$ captures the local curvature of $f$ at $x$. Its eigenvalues determine:
Condition Number:
For a strongly convex function such that $mI \preceq \nabla^2 f(x) \preceq MI$ for all $x$, the condition number is:
$$\kappa = \frac{M}{m}$$
This measures how "stretched" the landscape is. A well-conditioned problem ($\kappa$ close to 1) is approximately isotropic—all directions have similar curvature. A poorly conditioned problem ($\kappa \gg 1$) has very different curvatures in different directions, causing gradient descent to zigzag.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as npfrom scipy.linalg import eigvalsh def analyze_curvature(hessian): """ Analyze the curvature of a function through its Hessian. Returns: dict with eigenvalue statistics and convexity classification """ eigenvalues = eigvalsh(hessian) min_eig = np.min(eigenvalues) max_eig = np.max(eigenvalues) # Determine convexity type if min_eig > 1e-10: convexity = "strongly convex" condition_number = max_eig / min_eig elif min_eig >= -1e-10: convexity = "convex (possibly degenerate)" condition_number = np.inf if min_eig < 1e-10 else max_eig / min_eig else: n_negative = np.sum(eigenvalues < -1e-10) n_positive = np.sum(eigenvalues > 1e-10) if n_negative > 0 and n_positive > 0: convexity = f"saddle point (indefinite): {n_negative} negative, {n_positive} positive" elif n_negative > 0: convexity = "locally concave (negative definite)" else: convexity = "degenerate" condition_number = np.nan return { 'eigenvalues': eigenvalues, 'min_eigenvalue': min_eig, 'max_eigenvalue': max_eig, 'convexity': convexity, 'condition_number': condition_number } # Example: Compare well-conditioned vs ill-conditionedI = np.eye(5)well_conditioned = I # condition number = 1ill_conditioned = np.diag([1, 1, 1, 1, 1000]) # condition number = 1000saddle_hessian = np.diag([1, 1, -1, 1, 1]) # Has negative eigenvalue print("Well-conditioned Hessian:")print(analyze_curvature(well_conditioned)) print("\nIll-conditioned Hessian:")print(analyze_curvature(ill_conditioned)) print("\nSaddle point Hessian:")print(analyze_curvature(saddle_hessian))Convergence Rate Implications:
For gradient descent with step size $\eta$, minimizing a $m$-strongly convex function with $M$-Lipschitz gradient:
$$f(x_k) - f(x^) \leq \left(1 - \frac{m}{M}\right)^k (f(x_0) - f(x^))$$
The convergence rate depends on $\kappa = M/m$. Low condition number means fast convergence; high condition number means slow convergence.
This is why ill-conditioned problems (long narrow valleys) are hard for gradient descent: they have high condition numbers, so the convergence factor $(1 - 1/\kappa)$ is close to 1.
We've established one of the most important results in optimization—the equivalence of local and global minima for convex functions. Let's consolidate:
What's Next:
Knowing that local minima are global is powerful, but how do we find them? The next pages will develop the first-order and second-order optimality conditions that characterize minima for convex functions, providing actionable stopping criteria for optimization algorithms.
You now understand why convexity transforms optimization from an intractable search into a reliable computation. The guarantee that any local minimum is global means gradient-based methods are guaranteed to succeed—no restarts, no luck, just mathematics.