Convex Optimization - Learning Module

Loading content...

0/245

First-Order Conditions

Optimality Through the Lens of Gradients

When we run gradient descent and it converges, how do we know we've found an optimum? When does a zero gradient guarantee we're at the best point? And when constraints are present, what does optimality look like at the boundary of the feasible region?

First-order optimality conditions answer these questions using only gradient information—no second derivatives required. For convex problems, these conditions are both necessary and sufficient, providing a complete characterization of global optimality.

This page develops the mathematical framework for first-order conditions, from the simple unconstrained case to the more nuanced constrained optimization setting. These tools form the theoretical foundation for stopping criteria in optimization algorithms and for understanding when machine learning training has truly converged.

What You Will Learn

By the end of this page, you will master: (1) the stationary point condition for unconstrained optimization, (2) the geometric interpretation of gradient conditions, (3) first-order conditions for constrained convex optimization, (4) subdifferentials for non-differentiable convex functions, and (5) connections to ML training convergence criteria.

Unconstrained First-Order Conditions

For unconstrained optimization of a differentiable function, the first-order condition is elegantly simple.

Theorem (First-Order Necessary Condition):

If $x^*$ is a local minimum of differentiable $f: \mathbb{R}^n \rightarrow \mathbb{R}$, then:

$$\nabla f(x^*) = 0$$

Proof: Suppose $\nabla f(x^) \neq 0$. Consider the direction $d = -\nabla f(x^)$. By definition of the gradient:

$$\lim_{t \to 0^+} \frac{f(x^* + td) - f(x^)}{t} = \nabla f(x^)^T d = -|\nabla f(x^*)|^2 < 0$$

For small enough $t > 0$, we have $f(x^* + td) < f(x^)$, contradicting that $x^$ is a local minimum. $\square$

Definition: A point where $\nabla f(x^*) = 0$ is called a stationary point or critical point.

Necessary vs Sufficient

For general (non-convex) functions, $\nabla f(x^*) = 0$ is necessary but NOT sufficient for a local minimum. The point could be a local maximum (like the top of a hill) or a saddle point (like a mountain pass). Convexity is what makes the condition sufficient.

Theorem (First-Order Sufficient Condition for Convex Functions):

If $f$ is convex and differentiable, and $\nabla f(x^) = 0$, then $x^$ is a global minimum.

Proof: By the first-order characterization of convexity, for any $y$:

$$f(y) \geq f(x^) + \nabla f(x^)^T (y - x^) = f(x^) + 0 = f(x^*)$$

So $x^*$ achieves the minimum value. $\square$

Combining Necessary and Sufficient:

For convex functions, we have a complete characterization:

$$x^* \text{ is a global minimum of convex } f \iff \nabla f(x^*) = 0$$

This is remarkable: a purely local condition (the gradient) characterizes a global property (minimality). No need to search the entire domain—just find where the gradient vanishes.

First-Order Conditions by Function Type
Function Type	$\nabla f(x^*) = 0$	Implication
General smooth	Necessary for local min	Could be min, max, or saddle
Convex	Necessary and sufficient for global min	Guaranteed global minimum
Strictly convex	Necessary and sufficient	Unique global minimum
Concave	Necessary and sufficient for global max	Guaranteed global maximum

Geometric Interpretation

The condition $\nabla f(x^*) = 0$ has a beautiful geometric meaning that illuminates why it characterizes optimality.

The Gradient as Steepest Ascent:

The gradient $\nabla f(x)$ points in the direction of steepest increase of $f$ at $x$. Its magnitude $|\nabla f(x)|$ gives the rate of increase in that direction.

At an optimum, there should be no direction of descent. This means the steepest ascent direction has zero magnitude—the gradient is the zero vector.

Tangent Hyperplane Interpretation:

At any point $x$, the first-order Taylor approximation of $f$ is:

$$f(y) \approx f(x) + \nabla f(x)^T (y - x)$$

This defines a tangent hyperplane to the graph of $f$. When $\nabla f(x^*) = 0$, this hyperplane becomes horizontal:

$$f(y) \approx f(x^*)$$

The function is "flat" to first order at $x^*$—consistent with being at a minimum (or maximum or saddle).

Level Set Perspective

The gradient is always perpendicular to level sets (contours of constant function value). At a minimum, the level sets shrink to a point, and there's no well-defined "perpendicular direction"—the gradient must be zero.

Directional Derivatives:

The directional derivative of $f$ at $x$ in direction $d$ (with $|d| = 1$) is:

$$D_d f(x) = \nabla f(x)^T d$$

At a minimum $x^*$, the directional derivative must be non-negative in every direction (you can't go down). For interior points, this requires:

$$\nabla f(x^*)^T d \geq 0 \quad \forall d$$

The only vector $g$ satisfying $g^T d \geq 0$ for all $d$ is $g = 0$. Hence $\nabla f(x^*) = 0$.

Visualization in 2D:

For a 2D convex function like $f(x, y) = x^2 + y^2$:

Level sets are concentric circles
Gradient vectors point radially outward from origin
At $(0, 0)$: gradient is $(0, 0)$—the unique minimum
The paraboloid surface "bottoms out" at the origin

First-Order Conditions for Constrained Optimization

When optimization is constrained to a set $C$, the condition $\nabla f(x^*) = 0$ is no longer appropriate. The optimum might occur at a boundary of $C$ where the gradient is nonzero but points "outside" the feasible region.

Problem Setup:

$$\min_{x \in C} f(x)$$

where $f$ is convex and $C \subseteq \mathbb{R}^n$ is a convex set.

Theorem (First-Order Condition for Constrained Optimization):

Let $f$ be convex and differentiable, and let $C$ be a convex set. Then $x^* \in C$ is a global minimum of $f$ over $C$ if and only if:

$$\nabla f(x^)^T (y - x^) \geq 0 \quad \forall y \in C$$

Geometric Meaning

The condition says that the gradient must form an obtuse (or right) angle with every feasible direction $y - x^*$. Equivalently, the negative gradient (descent direction) doesn't point into the feasible region—there's no feasible way to decrease $f$.

Proof of Sufficiency:

Assume the condition holds. By first-order convexity characterization:

$$f(y) \geq f(x^) + \nabla f(x^)^T (y - x^*)$$

For any $y \in C$, the condition gives $\nabla f(x^)^T (y - x^) \geq 0$, so:

$$f(y) \geq f(x^*)$$

Thus $x^*$ is optimal. $\square$

Proof of Necessity:

Assume $x^*$ is optimal but the condition fails for some $y \in C$:

$$\nabla f(x^)^T (y - x^) < 0$$

Consider $z_t = x^* + t(y - x^)$ for small $t > 0$. Since $C$ is convex and $x^, y \in C$, we have $z_t \in C$.

$$f(z_t) = f(x^) + t \nabla f(x^)^T (y - x^*) + o(t)$$

For small $t$, the linear term dominates and is negative, so $f(z_t) < f(x^*)$, contradicting optimality. $\square$

Special Cases:

Interior optimum ($x^ \in \text{int}(C)$):*

If $x^$ is in the interior of $C$, then for any direction $d$, both $x^ + \epsilon d$ and $x^* - \epsilon d$ are in $C$ for small $\epsilon$. The condition becomes:

$$\nabla f(x^)^T d \geq 0 \text{ and } \nabla f(x^)^T (-d) \geq 0$$

This forces $\nabla f(x^*) = 0$—recovering the unconstrained condition.

Boundary optimum:

At the boundary, the gradient can be nonzero as long as it points "outward." This is formalized by the normal cone:

$$-\nabla f(x^) \in N_C(x^)$$

where $N_C(x^)$ is the normal cone to $C$ at $x^$.

Examples and Applications

Let's apply first-order conditions to concrete problems.

Example 1: Unconstrained Quadratic

Minimize $f(w) = \frac{1}{2}|Xw - y|^2$ (linear regression).

Setting $\nabla f(w) = X^T(Xw - y) = 0$:

$$X^T X w = X^T y$$

These are the normal equations. If $X^T X$ is invertible:

$$w^* = (X^T X)^{-1} X^T y$$

The first-order condition directly gives the closed-form solution.

Example 2: L2-Regularized Regression

Minimize $f(w) = \frac{1}{2}|Xw - y|^2 + \frac{\lambda}{2}|w|^2$.

Setting $\nabla f(w) = X^T(Xw - y) + \lambda w = 0$:

$$(X^T X + \lambda I) w = X^T y$$

$$w^* = (X^T X + \lambda I)^{-1} X^T y$$

The regularization term makes $X^T X + \lambda I$ always invertible.

first_order_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
 
def linear_regression_closed_form(X, y):
    """
    Solve linear regression using first-order condition.
    min_w ||Xw - y||^2
    Solution: (X^T X)^(-1) X^T y
    """
    return np.linalg.solve(X.T @ X, X.T @ y)
 
def ridge_regression_closed_form(X, y, lambda_reg):
    """
    Solve ridge regression using first-order condition.
    min_w ||Xw - y||^2 + lambda * ||w||^2
    Solution: (X^T X + lambda I)^(-1) X^T y
    """
    n_features = X.shape[1]
    return np.linalg.solve(X.T @ X + lambda_reg * np.eye(n_features), X.T @ y)
 
def verify_first_order_condition(grad_f, x_star, tol=1e-8):
    """Verify that gradient is zero at claimed optimum"""
    grad = grad_f(x_star)
    norm = np.linalg.norm(grad)
    return {
        'gradient': grad,
        'gradient_norm': norm,
        'is_stationary': norm < tol
    }
 
# Example verification
np.random.seed(42)
n, d = 100, 5
X = np.random.randn(n, d)
y = np.random.randn(n)
lambda_reg = 0.1
 
# Find solution
w_star = ridge_regression_closed_form(X, y, lambda_reg)
 
# Define gradient function
def grad_ridge(w):
    return X.T @ (X @ w - y) + lambda_reg * w
 
# Verify
result = verify_first_order_condition(grad_ridge, w_star)
print(f"Gradient norm at solution: {result['gradient_norm']:.2e}")
print(f"Is stationary point: {result['is_stationary']}")

Example 3: Constrained to a Box

Minimize $f(x) = (x_1 - 3)^2 + (x_2 - 3)^2$ subject to $0 \leq x_1, x_2 \leq 2$.

The unconstrained minimum is at $(3, 3)$, which is outside the box.

Gradient: $\nabla f(x) = (2(x_1 - 3), 2(x_2 - 3))$

At the unconstrained minimum, $\nabla f = (0, 0)$.

But $(3, 3) \notin [0, 2]^2$, so we look at the boundary. The optimum is at the corner $(2, 2)$.

At $(2, 2)$: $\nabla f = (-2, -2)$.

The feasible directions from $(2, 2)$ are those pointing into the box: $y - x^* = (y_1 - 2, y_2 - 2)$ with $y_1, y_2 \in [0, 2]$.

Check: $\nabla f(2,2)^T (y - (2,2)) = -2(y_1 - 2) - 2(y_2 - 2) = -2y_1 - 2y_2 + 8$

For $y_1, y_2 \in [0, 2]$: minimum value is $-2(2) - 2(2) + 8 = 0 \geq 0$. ✓

The constrained first-order condition is satisfied.

Subdifferentials for Non-Differentiable Convex Functions

Many important convex functions in ML are not differentiable everywhere—notably the L1 norm, hinge loss, and ReLU. First-order conditions still apply through the concept of subdifferentials.

Motivation:

Consider $f(x) = |x|$. At $x = 0$, the function has a "kink"—no unique tangent line. But there are many lines that stay below the graph. The set of all valid slopes at $x = 0$ is $[-1, 1]$.

Definition: Subgradient

A vector $g$ is a subgradient of convex $f$ at $x$ if:

$$f(y) \geq f(x) + g^T(y - x) \quad \forall y$$

This generalizes the gradient condition $f(y) \geq f(x) + \nabla f(x)^T(y - x)$ to non-differentiable functions.

Definition: Subdifferential

The subdifferential of $f$ at $x$, denoted $\partial f(x)$, is the set of all subgradients at $x$:

$$\partial f(x) = {g : f(y) \geq f(x) + g^T(y - x) \text{ for all } y}$$

Key Properties

For convex $f$: (1) $\partial f(x)$ is always non-empty, closed, and convex. (2) If $f$ is differentiable at $x$, then $\partial f(x) = {\nabla f(x)}$—a singleton. (3) At non-differentiable points, $\partial f(x)$ contains multiple vectors.

First-Order Condition with Subdifferentials:

For convex (possibly non-differentiable) $f$, $x^*$ is a global minimum if and only if:

$$0 \in \partial f(x^*)$$

This generalizes the condition $\nabla f(x^*) = 0$ perfectly.

Common Subdifferentials:

L1 norm at origin: $\partial |\cdot|1(0) = [-1, 1]^n$ (the unit $L\infty$ ball)

L1 norm elsewhere: If $x_i \neq 0$, the $i$-th component of any subgradient is $\text{sign}(x_i)$.

Max function: For $f(x) = \max_i x_i$, the subdifferential at $x$ is the convex hull of basis vectors $e_i$ for indices $i$ achieving the maximum.

ReLU: $\partial \text{ReLU}(x) = \begin{cases} {0} & x < 0 \ [0, 1] & x = 0 \ {1} & x > 0 \end{cases}$

subdifferential_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
 
def l1_subdifferential(x):
    """
    Compute subdifferential of L1 norm at x.
    Returns (g_min, g_max) for each component.
    """
    n = len(x)
    g_min = np.zeros(n)
    g_max = np.zeros(n)
    
    for i in range(n):
        if x[i] > 0:
            g_min[i] = g_max[i] = 1.0
        elif x[i] < 0:
            g_min[i] = g_max[i] = -1.0
        else:  # x[i] == 0
            g_min[i] = -1.0
            g_max[i] = 1.0
    
    return g_min, g_max
 
def is_zero_in_subdifferential(f, subgrad_bounds_fn, x):
    """
    Check if 0 is in the subdifferential at x.
    subgrad_bounds_fn returns (g_min, g_max) componentwise.
    """
    g_min, g_max = subgrad_bounds_fn(x)
    # 0 is in subdifferential iff for each component i:
    # g_min[i] <= 0 <= g_max[i]
    return np.all((g_min <= 0) & (0 <= g_max))
 
def lasso_optimality_check(X, y, w, lambda_reg, tol=1e-6):
    """
    Check optimality conditions for Lasso:
    min_w (1/2)||Xw - y||^2 + lambda * ||w||_1
    
    Optimality: 0 in X^T(Xw - y) + lambda * subdiff(||w||_1)
    """
    residual_grad = X.T @ (X @ w - y)
    
    violations = []
    for i in range(len(w)):
        if abs(w[i]) > tol:
            # At non-zero, subdifferential is singleton
            full_grad = residual_grad[i] + lambda_reg * np.sign(w[i])
            if abs(full_grad) > tol:
                violations.append((i, 'non-zero with nonzero gradient', full_grad))
        else:
            # At zero, subdifferential is [-lambda, lambda]
            if abs(residual_grad[i]) > lambda_reg + tol:
                violations.append((i, 'zero but gradient outside [-λ,λ]', residual_grad[i]))
    
    return {
        'is_optimal': len(violations) == 0,
        'violations': violations
    }
 
# Example: Check Lasso optimality
np.random.seed(42)
n, d = 50, 10
X = np.random.randn(n, d)
true_w = np.array([1, -2, 0, 0, 3, 0, 0, 0, -1, 0])
y = X @ true_w + 0.1 * np.random.randn(n)
lambda_reg = 0.5
 
# Solve with sklearn for comparison
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=lambda_reg, fit_intercept=False)
lasso.fit(X, y)
w_star = lasso.coef_
 
result = lasso_optimality_check(X, y, w_star, lambda_reg)
print(f"Lasso solution is optimal: {result['is_optimal']}")
if result['violations']:
    print(f"Violations: {result['violations']}")

Application: ML Training Convergence Criteria

First-order conditions directly inform how we decide when to stop training machine learning models.

Gradient Norm as Stopping Criterion:

For smooth convex objectives, the natural stopping criterion is:

$$|\nabla f(w)| < \epsilon$$

When this holds, we're at an approximate stationary point, which for convex $f$ means an approximate global minimum.

Why This Works:

For $m$-strongly convex $f$, the gradient norm bounds the distance to the optimum:

$$|w - w^*| \leq \frac{1}{m} |\nabla f(w)|$$

So $|\nabla f(w)| < \epsilon$ implies $|w - w^*| < \epsilon/m$.

For the function value:

$$f(w) - f(w^*) \leq \frac{1}{2m} |\nabla f(w)|^2$$

So $|\nabla f(w)| < \epsilon$ implies $f(w) - f(w^*) < \epsilon^2/(2m)$.

Common Convergence Criteria in ML
Criterion	When to Use	Caveats
$\|\nabla f(w)\| < \epsilon$	Smooth convex objectives	Not applicable to L1-regularized
$\|f(w_k) - f(w_{k-1})\| < \epsilon$	General, practical	Can be slow near optimum
$\|w_k - w_{k-1}\| < \epsilon$	When parameters matter	Doesn't ensure function convergence
Validation loss plateaus	Real ML training	May indicate overfitting, not convergence
$0 \in \partial f(w) + \epsilon B$	Non-smooth objectives	Requires subdifferential computation

Practical Considerations

In practice, we often use multiple criteria together: gradient norm below threshold AND function value change below threshold AND maximum iterations not exceeded. This handles various pathological cases and ensures the algorithm terminates.

Stochastic Gradient Descent:

In SGD, we don't compute the true gradient but an unbiased estimate. The first-order condition $\nabla f(w) = 0$ still characterizes the optimum, but:

Individual gradient estimates are noisy
We use decaying learning rates to ensure convergence
Practical stopping is based on validation performance, not gradient norms

Proximal Methods for Non-Smooth Objectives:

For $f(w) = g(w) + h(w)$ where $g$ is smooth and $h$ is non-smooth (like L1 regularization), the optimality condition becomes:

$$0 \in \nabla g(w^) + \partial h(w^)$$

Proximal gradient methods are designed to find such points efficiently, with convergence guaranteed to the global optimum for convex $g$ and $h$.

Projection and Nearest Point Characterization

An important application of first-order conditions is characterizing the projection onto a convex set—the nearest point in the set.

Definition: Projection

The projection of $z$ onto convex set $C$ is:

$$\text{proj}C(z) = \arg\min{x \in C} |x - z|^2$$

First-Order Characterization:

$x^* = \text{proj}_C(z)$ if and only if $x^* \in C$ and:

$$(z - x^)^T (y - x^) \leq 0 \quad \forall y \in C$$

Proof: Apply the constrained first-order condition to $f(x) = |x - z|^2$ with $\nabla f(x) = 2(x - z)$:

$$2(x^* - z)^T(y - x^) \geq 0 \iff (z - x^)^T(y - x^*) \leq 0$$

Geometric Meaning

The condition says the vector from $x^$ to $z$ makes an obtuse angle with every feasible direction. Equivalently, $z - x^$ is in the normal cone to $C$ at $x^*$. The projection "finds the closest point where the set blocks further movement toward $z$."

Properties of Projection:

Uniqueness: For convex $C$, the projection exists and is unique (follows from strict convexity of $|\cdot|^2$).
Non-expansiveness: $|\text{proj}_C(z_1) - \text{proj}_C(z_2)| \leq |z_1 - z_2|$. Projection doesn't increase distances.
Firm non-expansiveness: $(\text{proj}_C(z_1) - \text{proj}_C(z_2))^T(z_1 - z_2) \geq |\text{proj}_C(z_1) - \text{proj}_C(z_2)|^2$.

These properties are crucial for analyzing projected gradient descent and related algorithms.

Common Projections:

Ball: $\text{proj}_{|x| \leq r}(z) = r \cdot \frac{z}{\max(|z|, r)}$
Simplex: Sum to 1, non-negative—has efficient $O(n \log n)$ algorithm
Halfspace: $\text{proj}_{a^T x \leq b}(z) = z - \frac{\max(0, a^T z - b)}{|a|^2} a$
Box: $\text{proj}_{[l, u]}(z)_i = \max(l_i, \min(u_i, z_i))$

Summary: First-Order Conditions

First-order conditions provide a complete characterization of optimality for convex problems using only gradient information. Let's consolidate:

Key Takeaways

•Unconstrained: $\nabla f(x^*) = 0$ is necessary and sufficient for global optimality of convex $f$.
•Constrained: $\nabla f(x^)^T(y - x^) \geq 0$ for all feasible $y$ characterizes optimality over convex sets.
•Interior points reduce to the unconstrained condition; boundary points allow nonzero gradients pointing outward.
•Subdifferentials extend first-order conditions to non-differentiable convex functions (L1 regularization, hinge loss).
•Projection onto convex sets is characterized by a first-order condition relating the projection, the original point, and feasible directions.
•ML convergence criteria are directly based on first-order conditions: gradient norm, function value change, or subdifferential containment of zero.

What's Next:

First-order conditions tell us whether a point is optimal but don't characterize the local curvature. The next page develops second-order conditions—using Hessian information to understand the shape of the function near an optimum and to distinguish between strong and weak convexity.

Page Complete

You now have a complete toolkit for verifying optimality using gradient information. For convex problems, a zero gradient (or appropriate subdifferential condition) guarantees global optimality—the foundation of all gradient-based optimization algorithms.

First-Order Conditions

Optimality Through the Lens of Gradients

What You Will Learn

Unconstrained First-Order Conditions

For unconstrained optimization of a differentiable function, the first-order condition is elegantly simple.

Theorem (First-Order Necessary Condition):

If $x^*$ is a local minimum of differentiable $f: \mathbb{R}^n \rightarrow \mathbb{R}$, then:

$$\nabla f(x^*) = 0$$

Proof: Suppose $\nabla f(x^) \neq 0$. Consider the direction $d = -\nabla f(x^)$. By definition of the gradient:

$$\lim_{t \to 0^+} \frac{f(x^* + td) - f(x^)}{t} = \nabla f(x^)^T d = -|\nabla f(x^*)|^2 < 0$$

For small enough $t > 0$, we have $f(x^* + td) < f(x^)$, contradicting that $x^$ is a local minimum. $\square$

Definition: A point where $\nabla f(x^*) = 0$ is called a stationary point or critical point.

Necessary vs Sufficient

Theorem (First-Order Sufficient Condition for Convex Functions):

If $f$ is convex and differentiable, and $\nabla f(x^) = 0$, then $x^$ is a global minimum.

Proof: By the first-order characterization of convexity, for any $y$:

$$f(y) \geq f(x^) + \nabla f(x^)^T (y - x^) = f(x^) + 0 = f(x^*)$$

So $x^*$ achieves the minimum value. $\square$

Combining Necessary and Sufficient:

For convex functions, we have a complete characterization:

$$x^* \text{ is a global minimum of convex } f \iff \nabla f(x^*) = 0$$

This is remarkable: a purely local condition (the gradient) characterizes a global property (minimality). No need to search the entire domain—just find where the gradient vanishes.

First-Order Conditions by Function Type
Function Type	$\nabla f(x^*) = 0$	Implication
General smooth	Necessary for local min	Could be min, max, or saddle
Convex	Necessary and sufficient for global min	Guaranteed global minimum
Strictly convex	Necessary and sufficient	Unique global minimum
Concave	Necessary and sufficient for global max	Guaranteed global maximum

Geometric Interpretation

The condition $\nabla f(x^*) = 0$ has a beautiful geometric meaning that illuminates why it characterizes optimality.

The Gradient as Steepest Ascent:

The gradient $\nabla f(x)$ points in the direction of steepest increase of $f$ at $x$. Its magnitude $|\nabla f(x)|$ gives the rate of increase in that direction.

At an optimum, there should be no direction of descent. This means the steepest ascent direction has zero magnitude—the gradient is the zero vector.

Tangent Hyperplane Interpretation:

At any point $x$, the first-order Taylor approximation of $f$ is:

$$f(y) \approx f(x) + \nabla f(x)^T (y - x)$$

This defines a tangent hyperplane to the graph of $f$. When $\nabla f(x^*) = 0$, this hyperplane becomes horizontal:

$$f(y) \approx f(x^*)$$

The function is "flat" to first order at $x^*$—consistent with being at a minimum (or maximum or saddle).

Level Set Perspective

Directional Derivatives:

The directional derivative of $f$ at $x$ in direction $d$ (with $|d| = 1$) is:

$$D_d f(x) = \nabla f(x)^T d$$

At a minimum $x^*$, the directional derivative must be non-negative in every direction (you can't go down). For interior points, this requires:

$$\nabla f(x^*)^T d \geq 0 \quad \forall d$$

The only vector $g$ satisfying $g^T d \geq 0$ for all $d$ is $g = 0$. Hence $\nabla f(x^*) = 0$.

Visualization in 2D:

For a 2D convex function like $f(x, y) = x^2 + y^2$:

Level sets are concentric circles
Gradient vectors point radially outward from origin
At $(0, 0)$: gradient is $(0, 0)$—the unique minimum
The paraboloid surface "bottoms out" at the origin

First-Order Conditions for Constrained Optimization

Problem Setup:

$$\min_{x \in C} f(x)$$

where $f$ is convex and $C \subseteq \mathbb{R}^n$ is a convex set.

Theorem (First-Order Condition for Constrained Optimization):

Let $f$ be convex and differentiable, and let $C$ be a convex set. Then $x^* \in C$ is a global minimum of $f$ over $C$ if and only if:

$$\nabla f(x^)^T (y - x^) \geq 0 \quad \forall y \in C$$

Geometric Meaning

Proof of Sufficiency:

Assume the condition holds. By first-order convexity characterization:

$$f(y) \geq f(x^) + \nabla f(x^)^T (y - x^*)$$

For any $y \in C$, the condition gives $\nabla f(x^)^T (y - x^) \geq 0$, so:

$$f(y) \geq f(x^*)$$

Thus $x^*$ is optimal. $\square$

Proof of Necessity:

Assume $x^*$ is optimal but the condition fails for some $y \in C$:

$$\nabla f(x^)^T (y - x^) < 0$$

Consider $z_t = x^* + t(y - x^)$ for small $t > 0$. Since $C$ is convex and $x^, y \in C$, we have $z_t \in C$.

$$f(z_t) = f(x^) + t \nabla f(x^)^T (y - x^*) + o(t)$$

For small $t$, the linear term dominates and is negative, so $f(z_t) < f(x^*)$, contradicting optimality. $\square$

Special Cases:

Interior optimum ($x^ \in \text{int}(C)$):*

If $x^$ is in the interior of $C$, then for any direction $d$, both $x^ + \epsilon d$ and $x^* - \epsilon d$ are in $C$ for small $\epsilon$. The condition becomes:

$$\nabla f(x^)^T d \geq 0 \text{ and } \nabla f(x^)^T (-d) \geq 0$$

This forces $\nabla f(x^*) = 0$—recovering the unconstrained condition.

Boundary optimum:

At the boundary, the gradient can be nonzero as long as it points "outward." This is formalized by the normal cone:

$$-\nabla f(x^) \in N_C(x^)$$

where $N_C(x^)$ is the normal cone to $C$ at $x^$.

Examples and Applications

Let's apply first-order conditions to concrete problems.

Example 1: Unconstrained Quadratic

Minimize $f(w) = \frac{1}{2}|Xw - y|^2$ (linear regression).

Setting $\nabla f(w) = X^T(Xw - y) = 0$:

$$X^T X w = X^T y$$

These are the normal equations. If $X^T X$ is invertible:

$$w^* = (X^T X)^{-1} X^T y$$

The first-order condition directly gives the closed-form solution.

Example 2: L2-Regularized Regression

Minimize $f(w) = \frac{1}{2}|Xw - y|^2 + \frac{\lambda}{2}|w|^2$.

Setting $\nabla f(w) = X^T(Xw - y) + \lambda w = 0$:

$$(X^T X + \lambda I) w = X^T y$$

$$w^* = (X^T X + \lambda I)^{-1} X^T y$$

The regularization term makes $X^T X + \lambda I$ always invertible.

first_order_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
 
def linear_regression_closed_form(X, y):
    """
    Solve linear regression using first-order condition.
    min_w ||Xw - y||^2
    Solution: (X^T X)^(-1) X^T y
    """
    return np.linalg.solve(X.T @ X, X.T @ y)
 
def ridge_regression_closed_form(X, y, lambda_reg):
    """
    Solve ridge regression using first-order condition.
    min_w ||Xw - y||^2 + lambda * ||w||^2
    Solution: (X^T X + lambda I)^(-1) X^T y
    """
    n_features = X.shape[1]
    return np.linalg.solve(X.T @ X + lambda_reg * np.eye(n_features), X.T @ y)
 
def verify_first_order_condition(grad_f, x_star, tol=1e-8):
    """Verify that gradient is zero at claimed optimum"""
    grad = grad_f(x_star)
    norm = np.linalg.norm(grad)
    return {
        'gradient': grad,
        'gradient_norm': norm,
        'is_stationary': norm < tol
    }
 
# Example verification
np.random.seed(42)
n, d = 100, 5
X = np.random.randn(n, d)
y = np.random.randn(n)
lambda_reg = 0.1
 
# Find solution
w_star = ridge_regression_closed_form(X, y, lambda_reg)
 
# Define gradient function
def grad_ridge(w):
    return X.T @ (X @ w - y) + lambda_reg * w
 
# Verify
result = verify_first_order_condition(grad_ridge, w_star)
print(f"Gradient norm at solution: {result['gradient_norm']:.2e}")
print(f"Is stationary point: {result['is_stationary']}")

Example 3: Constrained to a Box

Minimize $f(x) = (x_1 - 3)^2 + (x_2 - 3)^2$ subject to $0 \leq x_1, x_2 \leq 2$.

The unconstrained minimum is at $(3, 3)$, which is outside the box.

Gradient: $\nabla f(x) = (2(x_1 - 3), 2(x_2 - 3))$

At the unconstrained minimum, $\nabla f = (0, 0)$.

But $(3, 3) \notin [0, 2]^2$, so we look at the boundary. The optimum is at the corner $(2, 2)$.

At $(2, 2)$: $\nabla f = (-2, -2)$.

The feasible directions from $(2, 2)$ are those pointing into the box: $y - x^* = (y_1 - 2, y_2 - 2)$ with $y_1, y_2 \in [0, 2]$.

Check: $\nabla f(2,2)^T (y - (2,2)) = -2(y_1 - 2) - 2(y_2 - 2) = -2y_1 - 2y_2 + 8$

For $y_1, y_2 \in [0, 2]$: minimum value is $-2(2) - 2(2) + 8 = 0 \geq 0$. ✓

The constrained first-order condition is satisfied.

Subdifferentials for Non-Differentiable Convex Functions

Many important convex functions in ML are not differentiable everywhere—notably the L1 norm, hinge loss, and ReLU. First-order conditions still apply through the concept of subdifferentials.

Motivation:

Consider $f(x) = |x|$. At $x = 0$, the function has a "kink"—no unique tangent line. But there are many lines that stay below the graph. The set of all valid slopes at $x = 0$ is $[-1, 1]$.

Definition: Subgradient

A vector $g$ is a subgradient of convex $f$ at $x$ if:

$$f(y) \geq f(x) + g^T(y - x) \quad \forall y$$

This generalizes the gradient condition $f(y) \geq f(x) + \nabla f(x)^T(y - x)$ to non-differentiable functions.

Definition: Subdifferential

The subdifferential of $f$ at $x$, denoted $\partial f(x)$, is the set of all subgradients at $x$:

$$\partial f(x) = {g : f(y) \geq f(x) + g^T(y - x) \text{ for all } y}$$

Key Properties

First-Order Condition with Subdifferentials:

For convex (possibly non-differentiable) $f$, $x^*$ is a global minimum if and only if:

$$0 \in \partial f(x^*)$$

This generalizes the condition $\nabla f(x^*) = 0$ perfectly.

Common Subdifferentials:

L1 norm at origin: $\partial |\cdot|1(0) = [-1, 1]^n$ (the unit $L\infty$ ball)

L1 norm elsewhere: If $x_i \neq 0$, the $i$-th component of any subgradient is $\text{sign}(x_i)$.

Max function: For $f(x) = \max_i x_i$, the subdifferential at $x$ is the convex hull of basis vectors $e_i$ for indices $i$ achieving the maximum.

ReLU: $\partial \text{ReLU}(x) = \begin{cases} {0} & x < 0 \ [0, 1] & x = 0 \ {1} & x > 0 \end{cases}$

subdifferential_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
 
def l1_subdifferential(x):
    """
    Compute subdifferential of L1 norm at x.
    Returns (g_min, g_max) for each component.
    """
    n = len(x)
    g_min = np.zeros(n)
    g_max = np.zeros(n)
    
    for i in range(n):
        if x[i] > 0:
            g_min[i] = g_max[i] = 1.0
        elif x[i] < 0:
            g_min[i] = g_max[i] = -1.0
        else:  # x[i] == 0
            g_min[i] = -1.0
            g_max[i] = 1.0
    
    return g_min, g_max
 
def is_zero_in_subdifferential(f, subgrad_bounds_fn, x):
    """
    Check if 0 is in the subdifferential at x.
    subgrad_bounds_fn returns (g_min, g_max) componentwise.
    """
    g_min, g_max = subgrad_bounds_fn(x)
    # 0 is in subdifferential iff for each component i:
    # g_min[i] <= 0 <= g_max[i]
    return np.all((g_min <= 0) & (0 <= g_max))
 
def lasso_optimality_check(X, y, w, lambda_reg, tol=1e-6):
    """
    Check optimality conditions for Lasso:
    min_w (1/2)||Xw - y||^2 + lambda * ||w||_1
    
    Optimality: 0 in X^T(Xw - y) + lambda * subdiff(||w||_1)
    """
    residual_grad = X.T @ (X @ w - y)
    
    violations = []
    for i in range(len(w)):
        if abs(w[i]) > tol:
            # At non-zero, subdifferential is singleton
            full_grad = residual_grad[i] + lambda_reg * np.sign(w[i])
            if abs(full_grad) > tol:
                violations.append((i, 'non-zero with nonzero gradient', full_grad))
        else:
            # At zero, subdifferential is [-lambda, lambda]
            if abs(residual_grad[i]) > lambda_reg + tol:
                violations.append((i, 'zero but gradient outside [-λ,λ]', residual_grad[i]))
    
    return {
        'is_optimal': len(violations) == 0,
        'violations': violations
    }
 
# Example: Check Lasso optimality
np.random.seed(42)
n, d = 50, 10
X = np.random.randn(n, d)
true_w = np.array([1, -2, 0, 0, 3, 0, 0, 0, -1, 0])
y = X @ true_w + 0.1 * np.random.randn(n)
lambda_reg = 0.5
 
# Solve with sklearn for comparison
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=lambda_reg, fit_intercept=False)
lasso.fit(X, y)
w_star = lasso.coef_
 
result = lasso_optimality_check(X, y, w_star, lambda_reg)
print(f"Lasso solution is optimal: {result['is_optimal']}")
if result['violations']:
    print(f"Violations: {result['violations']}")

Application: ML Training Convergence Criteria

First-order conditions directly inform how we decide when to stop training machine learning models.

Gradient Norm as Stopping Criterion:

For smooth convex objectives, the natural stopping criterion is:

$$|\nabla f(w)| < \epsilon$$

When this holds, we're at an approximate stationary point, which for convex $f$ means an approximate global minimum.

Why This Works:

For $m$-strongly convex $f$, the gradient norm bounds the distance to the optimum:

$$|w - w^*| \leq \frac{1}{m} |\nabla f(w)|$$

So $|\nabla f(w)| < \epsilon$ implies $|w - w^*| < \epsilon/m$.

For the function value:

$$f(w) - f(w^*) \leq \frac{1}{2m} |\nabla f(w)|^2$$

So $|\nabla f(w)| < \epsilon$ implies $f(w) - f(w^*) < \epsilon^2/(2m)$.

Common Convergence Criteria in ML
Criterion	When to Use	Caveats
$\|\nabla f(w)\| < \epsilon$	Smooth convex objectives	Not applicable to L1-regularized
$\|f(w_k) - f(w_{k-1})\| < \epsilon$	General, practical	Can be slow near optimum
$\|w_k - w_{k-1}\| < \epsilon$	When parameters matter	Doesn't ensure function convergence
Validation loss plateaus	Real ML training	May indicate overfitting, not convergence
$0 \in \partial f(w) + \epsilon B$	Non-smooth objectives	Requires subdifferential computation

Practical Considerations

Stochastic Gradient Descent:

In SGD, we don't compute the true gradient but an unbiased estimate. The first-order condition $\nabla f(w) = 0$ still characterizes the optimum, but:

Individual gradient estimates are noisy
We use decaying learning rates to ensure convergence
Practical stopping is based on validation performance, not gradient norms

Proximal Methods for Non-Smooth Objectives:

For $f(w) = g(w) + h(w)$ where $g$ is smooth and $h$ is non-smooth (like L1 regularization), the optimality condition becomes:

$$0 \in \nabla g(w^) + \partial h(w^)$$

Proximal gradient methods are designed to find such points efficiently, with convergence guaranteed to the global optimum for convex $g$ and $h$.

Projection and Nearest Point Characterization

An important application of first-order conditions is characterizing the projection onto a convex set—the nearest point in the set.

Definition: Projection

The projection of $z$ onto convex set $C$ is:

$$\text{proj}C(z) = \arg\min{x \in C} |x - z|^2$$

First-Order Characterization:

$x^* = \text{proj}_C(z)$ if and only if $x^* \in C$ and:

$$(z - x^)^T (y - x^) \leq 0 \quad \forall y \in C$$

Proof: Apply the constrained first-order condition to $f(x) = |x - z|^2$ with $\nabla f(x) = 2(x - z)$:

$$2(x^* - z)^T(y - x^) \geq 0 \iff (z - x^)^T(y - x^*) \leq 0$$

Geometric Meaning

Properties of Projection:

Uniqueness: For convex $C$, the projection exists and is unique (follows from strict convexity of $|\cdot|^2$).
Non-expansiveness: $|\text{proj}_C(z_1) - \text{proj}_C(z_2)| \leq |z_1 - z_2|$. Projection doesn't increase distances.
Firm non-expansiveness: $(\text{proj}_C(z_1) - \text{proj}_C(z_2))^T(z_1 - z_2) \geq |\text{proj}_C(z_1) - \text{proj}_C(z_2)|^2$.

These properties are crucial for analyzing projected gradient descent and related algorithms.

Common Projections:

Ball: $\text{proj}_{|x| \leq r}(z) = r \cdot \frac{z}{\max(|z|, r)}$
Simplex: Sum to 1, non-negative—has efficient $O(n \log n)$ algorithm
Halfspace: $\text{proj}_{a^T x \leq b}(z) = z - \frac{\max(0, a^T z - b)}{|a|^2} a$
Box: $\text{proj}_{[l, u]}(z)_i = \max(l_i, \min(u_i, z_i))$

Summary: First-Order Conditions

First-order conditions provide a complete characterization of optimality for convex problems using only gradient information. Let's consolidate:

Key Takeaways

•Unconstrained: $\nabla f(x^*) = 0$ is necessary and sufficient for global optimality of convex $f$.
•Constrained: $\nabla f(x^)^T(y - x^) \geq 0$ for all feasible $y$ characterizes optimality over convex sets.
•Interior points reduce to the unconstrained condition; boundary points allow nonzero gradients pointing outward.
•Subdifferentials extend first-order conditions to non-differentiable convex functions (L1 regularization, hinge loss).
•Projection onto convex sets is characterized by a first-order condition relating the projection, the original point, and feasible directions.
•ML convergence criteria are directly based on first-order conditions: gradient norm, function value change, or subdifferential containment of zero.

What's Next:

Page Complete