Loading content...
When we run gradient descent and it converges, how do we know we've found an optimum? When does a zero gradient guarantee we're at the best point? And when constraints are present, what does optimality look like at the boundary of the feasible region?
First-order optimality conditions answer these questions using only gradient information—no second derivatives required. For convex problems, these conditions are both necessary and sufficient, providing a complete characterization of global optimality.
This page develops the mathematical framework for first-order conditions, from the simple unconstrained case to the more nuanced constrained optimization setting. These tools form the theoretical foundation for stopping criteria in optimization algorithms and for understanding when machine learning training has truly converged.
By the end of this page, you will master: (1) the stationary point condition for unconstrained optimization, (2) the geometric interpretation of gradient conditions, (3) first-order conditions for constrained convex optimization, (4) subdifferentials for non-differentiable convex functions, and (5) connections to ML training convergence criteria.
For unconstrained optimization of a differentiable function, the first-order condition is elegantly simple.
Theorem (First-Order Necessary Condition):
If $x^*$ is a local minimum of differentiable $f: \mathbb{R}^n \rightarrow \mathbb{R}$, then:
$$\nabla f(x^*) = 0$$
Proof: Suppose $\nabla f(x^) \neq 0$. Consider the direction $d = -\nabla f(x^)$. By definition of the gradient:
$$\lim_{t \to 0^+} \frac{f(x^* + td) - f(x^)}{t} = \nabla f(x^)^T d = -|\nabla f(x^*)|^2 < 0$$
For small enough $t > 0$, we have $f(x^* + td) < f(x^)$, contradicting that $x^$ is a local minimum. $\square$
Definition: A point where $\nabla f(x^*) = 0$ is called a stationary point or critical point.
For general (non-convex) functions, $\nabla f(x^*) = 0$ is necessary but NOT sufficient for a local minimum. The point could be a local maximum (like the top of a hill) or a saddle point (like a mountain pass). Convexity is what makes the condition sufficient.
Theorem (First-Order Sufficient Condition for Convex Functions):
If $f$ is convex and differentiable, and $\nabla f(x^) = 0$, then $x^$ is a global minimum.
Proof: By the first-order characterization of convexity, for any $y$:
$$f(y) \geq f(x^) + \nabla f(x^)^T (y - x^) = f(x^) + 0 = f(x^*)$$
So $x^*$ achieves the minimum value. $\square$
Combining Necessary and Sufficient:
For convex functions, we have a complete characterization:
$$x^* \text{ is a global minimum of convex } f \iff \nabla f(x^*) = 0$$
This is remarkable: a purely local condition (the gradient) characterizes a global property (minimality). No need to search the entire domain—just find where the gradient vanishes.
| Function Type | $\nabla f(x^*) = 0$ | Implication |
|---|---|---|
| General smooth | Necessary for local min | Could be min, max, or saddle |
| Convex | Necessary and sufficient for global min | Guaranteed global minimum |
| Strictly convex | Necessary and sufficient | Unique global minimum |
| Concave | Necessary and sufficient for global max | Guaranteed global maximum |
The condition $\nabla f(x^*) = 0$ has a beautiful geometric meaning that illuminates why it characterizes optimality.
The Gradient as Steepest Ascent:
The gradient $\nabla f(x)$ points in the direction of steepest increase of $f$ at $x$. Its magnitude $|\nabla f(x)|$ gives the rate of increase in that direction.
At an optimum, there should be no direction of descent. This means the steepest ascent direction has zero magnitude—the gradient is the zero vector.
Tangent Hyperplane Interpretation:
At any point $x$, the first-order Taylor approximation of $f$ is:
$$f(y) \approx f(x) + \nabla f(x)^T (y - x)$$
This defines a tangent hyperplane to the graph of $f$. When $\nabla f(x^*) = 0$, this hyperplane becomes horizontal:
$$f(y) \approx f(x^*)$$
The function is "flat" to first order at $x^*$—consistent with being at a minimum (or maximum or saddle).
The gradient is always perpendicular to level sets (contours of constant function value). At a minimum, the level sets shrink to a point, and there's no well-defined "perpendicular direction"—the gradient must be zero.
Directional Derivatives:
The directional derivative of $f$ at $x$ in direction $d$ (with $|d| = 1$) is:
$$D_d f(x) = \nabla f(x)^T d$$
At a minimum $x^*$, the directional derivative must be non-negative in every direction (you can't go down). For interior points, this requires:
$$\nabla f(x^*)^T d \geq 0 \quad \forall d$$
The only vector $g$ satisfying $g^T d \geq 0$ for all $d$ is $g = 0$. Hence $\nabla f(x^*) = 0$.
Visualization in 2D:
For a 2D convex function like $f(x, y) = x^2 + y^2$:
When optimization is constrained to a set $C$, the condition $\nabla f(x^*) = 0$ is no longer appropriate. The optimum might occur at a boundary of $C$ where the gradient is nonzero but points "outside" the feasible region.
Problem Setup:
$$\min_{x \in C} f(x)$$
where $f$ is convex and $C \subseteq \mathbb{R}^n$ is a convex set.
Theorem (First-Order Condition for Constrained Optimization):
Let $f$ be convex and differentiable, and let $C$ be a convex set. Then $x^* \in C$ is a global minimum of $f$ over $C$ if and only if:
$$\nabla f(x^)^T (y - x^) \geq 0 \quad \forall y \in C$$
The condition says that the gradient must form an obtuse (or right) angle with every feasible direction $y - x^*$. Equivalently, the negative gradient (descent direction) doesn't point into the feasible region—there's no feasible way to decrease $f$.
Proof of Sufficiency:
Assume the condition holds. By first-order convexity characterization:
$$f(y) \geq f(x^) + \nabla f(x^)^T (y - x^*)$$
For any $y \in C$, the condition gives $\nabla f(x^)^T (y - x^) \geq 0$, so:
$$f(y) \geq f(x^*)$$
Thus $x^*$ is optimal. $\square$
Proof of Necessity:
Assume $x^*$ is optimal but the condition fails for some $y \in C$:
$$\nabla f(x^)^T (y - x^) < 0$$
Consider $z_t = x^* + t(y - x^)$ for small $t > 0$. Since $C$ is convex and $x^, y \in C$, we have $z_t \in C$.
$$f(z_t) = f(x^) + t \nabla f(x^)^T (y - x^*) + o(t)$$
For small $t$, the linear term dominates and is negative, so $f(z_t) < f(x^*)$, contradicting optimality. $\square$
Special Cases:
Interior optimum ($x^ \in \text{int}(C)$):*
If $x^$ is in the interior of $C$, then for any direction $d$, both $x^ + \epsilon d$ and $x^* - \epsilon d$ are in $C$ for small $\epsilon$. The condition becomes:
$$\nabla f(x^)^T d \geq 0 \text{ and } \nabla f(x^)^T (-d) \geq 0$$
This forces $\nabla f(x^*) = 0$—recovering the unconstrained condition.
Boundary optimum:
At the boundary, the gradient can be nonzero as long as it points "outward." This is formalized by the normal cone:
$$-\nabla f(x^) \in N_C(x^)$$
where $N_C(x^)$ is the normal cone to $C$ at $x^$.
Let's apply first-order conditions to concrete problems.
Example 1: Unconstrained Quadratic
Minimize $f(w) = \frac{1}{2}|Xw - y|^2$ (linear regression).
Setting $\nabla f(w) = X^T(Xw - y) = 0$:
$$X^T X w = X^T y$$
These are the normal equations. If $X^T X$ is invertible:
$$w^* = (X^T X)^{-1} X^T y$$
The first-order condition directly gives the closed-form solution.
Example 2: L2-Regularized Regression
Minimize $f(w) = \frac{1}{2}|Xw - y|^2 + \frac{\lambda}{2}|w|^2$.
Setting $\nabla f(w) = X^T(Xw - y) + \lambda w = 0$:
$$(X^T X + \lambda I) w = X^T y$$
$$w^* = (X^T X + \lambda I)^{-1} X^T y$$
The regularization term makes $X^T X + \lambda I$ always invertible.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import numpy as np def linear_regression_closed_form(X, y): """ Solve linear regression using first-order condition. min_w ||Xw - y||^2 Solution: (X^T X)^(-1) X^T y """ return np.linalg.solve(X.T @ X, X.T @ y) def ridge_regression_closed_form(X, y, lambda_reg): """ Solve ridge regression using first-order condition. min_w ||Xw - y||^2 + lambda * ||w||^2 Solution: (X^T X + lambda I)^(-1) X^T y """ n_features = X.shape[1] return np.linalg.solve(X.T @ X + lambda_reg * np.eye(n_features), X.T @ y) def verify_first_order_condition(grad_f, x_star, tol=1e-8): """Verify that gradient is zero at claimed optimum""" grad = grad_f(x_star) norm = np.linalg.norm(grad) return { 'gradient': grad, 'gradient_norm': norm, 'is_stationary': norm < tol } # Example verificationnp.random.seed(42)n, d = 100, 5X = np.random.randn(n, d)y = np.random.randn(n)lambda_reg = 0.1 # Find solutionw_star = ridge_regression_closed_form(X, y, lambda_reg) # Define gradient functiondef grad_ridge(w): return X.T @ (X @ w - y) + lambda_reg * w # Verifyresult = verify_first_order_condition(grad_ridge, w_star)print(f"Gradient norm at solution: {result['gradient_norm']:.2e}")print(f"Is stationary point: {result['is_stationary']}")Example 3: Constrained to a Box
Minimize $f(x) = (x_1 - 3)^2 + (x_2 - 3)^2$ subject to $0 \leq x_1, x_2 \leq 2$.
The unconstrained minimum is at $(3, 3)$, which is outside the box.
Gradient: $\nabla f(x) = (2(x_1 - 3), 2(x_2 - 3))$
At the unconstrained minimum, $\nabla f = (0, 0)$.
But $(3, 3) \notin [0, 2]^2$, so we look at the boundary. The optimum is at the corner $(2, 2)$.
At $(2, 2)$: $\nabla f = (-2, -2)$.
The feasible directions from $(2, 2)$ are those pointing into the box: $y - x^* = (y_1 - 2, y_2 - 2)$ with $y_1, y_2 \in [0, 2]$.
Check: $\nabla f(2,2)^T (y - (2,2)) = -2(y_1 - 2) - 2(y_2 - 2) = -2y_1 - 2y_2 + 8$
For $y_1, y_2 \in [0, 2]$: minimum value is $-2(2) - 2(2) + 8 = 0 \geq 0$. ✓
The constrained first-order condition is satisfied.
Many important convex functions in ML are not differentiable everywhere—notably the L1 norm, hinge loss, and ReLU. First-order conditions still apply through the concept of subdifferentials.
Motivation:
Consider $f(x) = |x|$. At $x = 0$, the function has a "kink"—no unique tangent line. But there are many lines that stay below the graph. The set of all valid slopes at $x = 0$ is $[-1, 1]$.
Definition: Subgradient
A vector $g$ is a subgradient of convex $f$ at $x$ if:
$$f(y) \geq f(x) + g^T(y - x) \quad \forall y$$
This generalizes the gradient condition $f(y) \geq f(x) + \nabla f(x)^T(y - x)$ to non-differentiable functions.
Definition: Subdifferential
The subdifferential of $f$ at $x$, denoted $\partial f(x)$, is the set of all subgradients at $x$:
$$\partial f(x) = {g : f(y) \geq f(x) + g^T(y - x) \text{ for all } y}$$
For convex $f$: (1) $\partial f(x)$ is always non-empty, closed, and convex. (2) If $f$ is differentiable at $x$, then $\partial f(x) = {\nabla f(x)}$—a singleton. (3) At non-differentiable points, $\partial f(x)$ contains multiple vectors.
First-Order Condition with Subdifferentials:
For convex (possibly non-differentiable) $f$, $x^*$ is a global minimum if and only if:
$$0 \in \partial f(x^*)$$
This generalizes the condition $\nabla f(x^*) = 0$ perfectly.
Common Subdifferentials:
L1 norm at origin: $\partial |\cdot|1(0) = [-1, 1]^n$ (the unit $L\infty$ ball)
L1 norm elsewhere: If $x_i \neq 0$, the $i$-th component of any subgradient is $\text{sign}(x_i)$.
Max function: For $f(x) = \max_i x_i$, the subdifferential at $x$ is the convex hull of basis vectors $e_i$ for indices $i$ achieving the maximum.
ReLU: $\partial \text{ReLU}(x) = \begin{cases} {0} & x < 0 \ [0, 1] & x = 0 \ {1} & x > 0 \end{cases}$
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as np def l1_subdifferential(x): """ Compute subdifferential of L1 norm at x. Returns (g_min, g_max) for each component. """ n = len(x) g_min = np.zeros(n) g_max = np.zeros(n) for i in range(n): if x[i] > 0: g_min[i] = g_max[i] = 1.0 elif x[i] < 0: g_min[i] = g_max[i] = -1.0 else: # x[i] == 0 g_min[i] = -1.0 g_max[i] = 1.0 return g_min, g_max def is_zero_in_subdifferential(f, subgrad_bounds_fn, x): """ Check if 0 is in the subdifferential at x. subgrad_bounds_fn returns (g_min, g_max) componentwise. """ g_min, g_max = subgrad_bounds_fn(x) # 0 is in subdifferential iff for each component i: # g_min[i] <= 0 <= g_max[i] return np.all((g_min <= 0) & (0 <= g_max)) def lasso_optimality_check(X, y, w, lambda_reg, tol=1e-6): """ Check optimality conditions for Lasso: min_w (1/2)||Xw - y||^2 + lambda * ||w||_1 Optimality: 0 in X^T(Xw - y) + lambda * subdiff(||w||_1) """ residual_grad = X.T @ (X @ w - y) violations = [] for i in range(len(w)): if abs(w[i]) > tol: # At non-zero, subdifferential is singleton full_grad = residual_grad[i] + lambda_reg * np.sign(w[i]) if abs(full_grad) > tol: violations.append((i, 'non-zero with nonzero gradient', full_grad)) else: # At zero, subdifferential is [-lambda, lambda] if abs(residual_grad[i]) > lambda_reg + tol: violations.append((i, 'zero but gradient outside [-λ,λ]', residual_grad[i])) return { 'is_optimal': len(violations) == 0, 'violations': violations } # Example: Check Lasso optimalitynp.random.seed(42)n, d = 50, 10X = np.random.randn(n, d)true_w = np.array([1, -2, 0, 0, 3, 0, 0, 0, -1, 0])y = X @ true_w + 0.1 * np.random.randn(n)lambda_reg = 0.5 # Solve with sklearn for comparisonfrom sklearn.linear_model import Lassolasso = Lasso(alpha=lambda_reg, fit_intercept=False)lasso.fit(X, y)w_star = lasso.coef_ result = lasso_optimality_check(X, y, w_star, lambda_reg)print(f"Lasso solution is optimal: {result['is_optimal']}")if result['violations']: print(f"Violations: {result['violations']}")First-order conditions directly inform how we decide when to stop training machine learning models.
Gradient Norm as Stopping Criterion:
For smooth convex objectives, the natural stopping criterion is:
$$|\nabla f(w)| < \epsilon$$
When this holds, we're at an approximate stationary point, which for convex $f$ means an approximate global minimum.
Why This Works:
For $m$-strongly convex $f$, the gradient norm bounds the distance to the optimum:
$$|w - w^*| \leq \frac{1}{m} |\nabla f(w)|$$
So $|\nabla f(w)| < \epsilon$ implies $|w - w^*| < \epsilon/m$.
For the function value:
$$f(w) - f(w^*) \leq \frac{1}{2m} |\nabla f(w)|^2$$
So $|\nabla f(w)| < \epsilon$ implies $f(w) - f(w^*) < \epsilon^2/(2m)$.
| Criterion | When to Use | Caveats |
|---|---|---|
| $|\nabla f(w)| < \epsilon$ | Smooth convex objectives | Not applicable to L1-regularized |
| $|f(w_k) - f(w_{k-1})| < \epsilon$ | General, practical | Can be slow near optimum |
| $|w_k - w_{k-1}| < \epsilon$ | When parameters matter | Doesn't ensure function convergence |
| Validation loss plateaus | Real ML training | May indicate overfitting, not convergence |
| $0 \in \partial f(w) + \epsilon B$ | Non-smooth objectives | Requires subdifferential computation |
In practice, we often use multiple criteria together: gradient norm below threshold AND function value change below threshold AND maximum iterations not exceeded. This handles various pathological cases and ensures the algorithm terminates.
Stochastic Gradient Descent:
In SGD, we don't compute the true gradient but an unbiased estimate. The first-order condition $\nabla f(w) = 0$ still characterizes the optimum, but:
Proximal Methods for Non-Smooth Objectives:
For $f(w) = g(w) + h(w)$ where $g$ is smooth and $h$ is non-smooth (like L1 regularization), the optimality condition becomes:
$$0 \in \nabla g(w^) + \partial h(w^)$$
Proximal gradient methods are designed to find such points efficiently, with convergence guaranteed to the global optimum for convex $g$ and $h$.
An important application of first-order conditions is characterizing the projection onto a convex set—the nearest point in the set.
Definition: Projection
The projection of $z$ onto convex set $C$ is:
$$\text{proj}C(z) = \arg\min{x \in C} |x - z|^2$$
First-Order Characterization:
$x^* = \text{proj}_C(z)$ if and only if $x^* \in C$ and:
$$(z - x^)^T (y - x^) \leq 0 \quad \forall y \in C$$
Proof: Apply the constrained first-order condition to $f(x) = |x - z|^2$ with $\nabla f(x) = 2(x - z)$:
$$2(x^* - z)^T(y - x^) \geq 0 \iff (z - x^)^T(y - x^*) \leq 0$$
The condition says the vector from $x^$ to $z$ makes an obtuse angle with every feasible direction. Equivalently, $z - x^$ is in the normal cone to $C$ at $x^*$. The projection "finds the closest point where the set blocks further movement toward $z$."
Properties of Projection:
Uniqueness: For convex $C$, the projection exists and is unique (follows from strict convexity of $|\cdot|^2$).
Non-expansiveness: $|\text{proj}_C(z_1) - \text{proj}_C(z_2)| \leq |z_1 - z_2|$. Projection doesn't increase distances.
Firm non-expansiveness: $(\text{proj}_C(z_1) - \text{proj}_C(z_2))^T(z_1 - z_2) \geq |\text{proj}_C(z_1) - \text{proj}_C(z_2)|^2$.
These properties are crucial for analyzing projected gradient descent and related algorithms.
Common Projections:
First-order conditions provide a complete characterization of optimality for convex problems using only gradient information. Let's consolidate:
What's Next:
First-order conditions tell us whether a point is optimal but don't characterize the local curvature. The next page develops second-order conditions—using Hessian information to understand the shape of the function near an optimum and to distinguish between strong and weak convexity.
You now have a complete toolkit for verifying optimality using gradient information. For convex problems, a zero gradient (or appropriate subdifferential condition) guarantees global optimality—the foundation of all gradient-based optimization algorithms.