Convex Optimization - Learning Module

Loading content...

0/245

Second-Order Conditions

Beyond Gradients: The Curvature Perspective

First-order conditions tell us whether a point is a stationary point. But gradients only capture the "slope" of a function—they tell us nothing about the curvature. Are we at the bottom of a narrow valley or a wide basin? Is the function curving upward (convex) or downward (concave)? How quickly do gradients change as we move?

Second-order conditions answer these questions using the Hessian matrix—the matrix of second partial derivatives. This curvature information is fundamental to:

Classifying critical points: Distinguishing minima from maxima and saddle points
Verifying convexity: Confirming a function is globally bowl-shaped
Algorithm design: Newton's method uses curvature for faster convergence
Conditioning analysis: Understanding why some problems are hard to optimize

This page develops the mathematical theory of second-order conditions, from the basic definitions to their applications in machine learning optimization.

What You Will Learn

By the end of this page, you will understand: (1) the Hessian matrix and its interpretation, (2) second-order necessary and sufficient conditions for optimality, (3) how the Hessian characterizes convexity, (4) the relationship between eigenvalues and curvature, and (5) applications to strong convexity and conditioning.

The Hessian Matrix

For a twice-differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, the Hessian matrix $\nabla^2 f(x)$ (or $H_f(x)$) is the $n \times n$ matrix of second partial derivatives:

$$(\nabla^2 f(x))_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}(x)$$

Explicitly:

$$\nabla^2 f(x) = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{pmatrix}$$

Symmetry of the Hessian

For twice continuously differentiable functions (functions with continuous second derivatives), Schwarz's theorem guarantees the Hessian is symmetric: $\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$. This symmetry is crucial—it ensures the Hessian has real eigenvalues and orthogonal eigenvectors.

Interpretation: Local Curvature

The Hessian captures how the gradient changes as we move through space:

$$\nabla f(x + \Delta x) \approx \nabla f(x) + \nabla^2 f(x) \Delta x$$

If we move in direction $v$, the rate of change of the gradient in direction $v$ is:

$$v^T \nabla^2 f(x) v$$

This quantity is the directional second derivative—the curvature of $f$ along direction $v$.

Second-Order Taylor Expansion:

Near point $x$, the function can be approximated as:

$$f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x + \frac{1}{2} \Delta x^T \nabla^2 f(x) \Delta x$$

The first three terms: constant, linear (gradient), and quadratic (Hessian). At a stationary point where $\nabla f(x) = 0$, the local behavior is entirely determined by the Hessian.

Common Hessians
Function	Gradient	Hessian
$f(x) = \frac{1}{2}x^T Q x$	$Qx$	$Q$
$f(x) = \|x\|^2$	$2x$	$2I$
$f(x) = x^T A^T A x$	$2A^T A x$	$2A^T A$
$f(x) = \log(1 + e^x)$ (softplus)	$\sigma(x)$	$\sigma(x)(1-\sigma(x))$
$f(w) = \frac{1}{n}\sum (x_i^T w - y_i)^2$	$\frac{2}{n}X^T(Xw - y)$	$\frac{2}{n}X^T X$

Second-Order Necessary Conditions

The Hessian provides additional information beyond what the gradient can tell us. At a local minimum, we can characterize the curvature.

Theorem (Second-Order Necessary Conditions):

If $x^*$ is a local minimum of twice-differentiable $f: \mathbb{R}^n \rightarrow \mathbb{R}$, then:

$\nabla f(x^*) = 0$ (first-order condition)
$\nabla^2 f(x^*) \succeq 0$ (Hessian is positive semidefinite)

Proof of (2):

Suppose $\nabla^2 f(x^*) \not\succeq 0$. Then there exists a direction $v$ with $|v| = 1$ such that:

$$v^T \nabla^2 f(x^*) v < 0$$

Using the Taylor expansion at $x^$ (where $\nabla f(x^) = 0$):

$$f(x^* + tv) = f(x^) + \frac{t^2}{2} v^T \nabla^2 f(x^) v + o(t^2)$$

For small $t$, the $t^2$ term dominates the $o(t^2)$ remainder. Since $v^T \nabla^2 f(x^*) v < 0$:

$$f(x^* + tv) < f(x^*)$$

This contradicts $x^*$ being a local minimum. $\square$

Necessary, Not Sufficient

The condition $\nabla^2 f(x^) \succeq 0$ is necessary but NOT sufficient for a local minimum (even combined with $\nabla f(x^) = 0$). Consider $f(x) = x^3$ at $x = 0$: the gradient is 0, the second derivative is 0 (so PSD), but $x = 0$ is an inflection point, not a minimum.

Understanding Positive Semidefiniteness:

A symmetric matrix $H$ is positive semidefinite ($H \succeq 0$) if any of these equivalent conditions hold:

$v^T H v \geq 0$ for all $v$ (non-negative curvature in all directions)
All eigenvalues of $H$ are $\geq 0$
$H = A^T A$ for some matrix $A$ (Cholesky-like factorization exists)
All principal minors are $\geq 0$ (Sylvester's criterion)

At a local minimum, condition (1) is the geometric requirement: we can't have negative curvature in any direction, or that direction would lead downhill.

Saddle Points:

At a saddle point, $\nabla f(x^) = 0$ but $\nabla^2 f(x^)$ has both positive and negative eigenvalues—positive curvature in some directions, negative in others. The point is a minimum in some directions and a maximum in others.

Second-Order Sufficient Conditions

To guarantee a local minimum (not just rule out ascent directions), we need a stronger condition.

Theorem (Second-Order Sufficient Conditions):

If $f$ is twice continuously differentiable, $\nabla f(x^) = 0$, and $\nabla^2 f(x^) \succ 0$ (positive definite), then $x^*$ is a strict local minimum.

Proof:

Since $\nabla^2 f(x^*)$ is positive definite, let $\lambda_{\min} > 0$ be its smallest eigenvalue. For any unit vector $v$:

$$v^T \nabla^2 f(x^*) v \geq \lambda_{\min} > 0$$

By continuity of the Hessian, there exists $\epsilon > 0$ such that for $|x - x^*| < \epsilon$:

$$v^T \nabla^2 f(x) v \geq \frac{\lambda_{\min}}{2} > 0$$

For any $y$ with $|y - x^| < \epsilon$, consider the path from $x^$ to $y$:

$$f(y) = f(x^) + \int_0^1 \nabla f(x^ + t(y-x^))^T (y - x^) , dt$$

Using $\nabla f(x^*) = 0$ and expanding:

$$f(y) = f(x^) + \int_0^1 \int_0^s (y - x^)^T \nabla^2 f(x^* + \tau(y-x^)) (y - x^) , d\tau , ds$$

$$\geq f(x^) + \frac{\lambda_{\min}}{4} |y - x^|^2 > f(x^*)$$

So $x^*$ is a strict local minimum. $\square$

The Gap Between Necessary and Sufficient

Necessary: $\nabla^2 f(x^) \succeq 0$ (positive semidefinite). Sufficient: $\nabla^2 f(x^) \succ 0$ (positive definite). The gap is when the Hessian is PSD but not PD (has zero eigenvalues). This "borderline" case requires higher-order analysis.

Classification of Critical Points via Hessian
Condition at critical point $x^*$	Classification
$\nabla^2 f(x^*) \succ 0$	Strict local minimum (sufficient)
$\nabla^2 f(x^*) \prec 0$	Strict local maximum
$\nabla^2 f(x^*)$ has pos. and neg. eigenvalues	Saddle point
$\nabla^2 f(x^*) \succeq 0$ with zero eigenvalues	Inconclusive (could be min, saddle, or inflection)

Hessian Characterization of Convexity

The Hessian provides a powerful way to verify global convexity, not just local behavior at critical points.

Theorem (Second-Order Characterization of Convexity):

Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be twice continuously differentiable. Then:

$$f \text{ is convex} \iff \nabla^2 f(x) \succeq 0 \text{ for all } x \in \text{dom}(f)$$

Proof ($\Rightarrow$):

If $f$ is convex, consider any point $x$ and direction $v$. The function $g(t) = f(x + tv)$ is a convex function of $t$. Its second derivative is:

$$g''(t) = v^T \nabla^2 f(x + tv) v$$

For convex $g$, we need $g''(t) \geq 0$, so $v^T \nabla^2 f(x) v \geq 0$ for all $v$. This means $\nabla^2 f(x) \succeq 0$.

Proof ($\Leftarrow$):

Assume $\nabla^2 f(x) \succeq 0$ everywhere. For any $x, y$, consider the path from $x$ to $y$:

$$f(y) = f(x) + \nabla f(x)^T(y-x) + \int_0^1 (1-t)(y-x)^T \nabla^2 f(x + t(y-x))(y-x) , dt$$

Since $(y-x)^T \nabla^2 f(\cdot)(y-x) \geq 0$ everywhere:

$$f(y) \geq f(x) + \nabla f(x)^T(y-x)$$

This is the first-order characterization of convexity. $\square$

Strict and Strong Convexity

Strict convexity: $\nabla^2 f(x) \succ 0$ for all $x$ guarantees strict convexity. Strong convexity: $\nabla^2 f(x) \succeq mI$ for some $m > 0$ everywhere guarantees $m$-strong convexity. These conditions give progressively stronger guarantees about uniqueness and convergence.

Practical Convexity Verification:

To check if a function is convex:

Compute the Hessian $\nabla^2 f(x)$
Verify it's positive semidefinite for all $x$ in the domain
For PSD verification:
- Check all eigenvalues $\geq 0$, or
- Check all leading principal minors $\geq 0$, or
- Attempt Cholesky decomposition

Example: Logistic Regression

For logistic loss $f(w) = \sum_{i} \log(1 + e^{-y_i x_i^T w})$:

The Hessian is:

$$\nabla^2 f(w) = \sum_{i} \sigma(y_i x_i^T w)(1 - \sigma(y_i x_i^T w)) x_i x_i^T = X^T D X$$

where $D$ is diagonal with $D_{ii} = \sigma_i(1 - \sigma_i) \geq 0$.

Since $X^T D X$ is a weighted sum of outer products with non-negative weights, it's PSD. Hence logistic loss is convex.

hessian_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
from scipy.linalg import eigvalsh, cholesky
from scipy.linalg import LinAlgError
 
def check_psd(H, tol=1e-10):
    """
    Check if matrix H is positive semidefinite.
    Uses eigenvalue decomposition.
    """
    eigenvalues = eigvalsh(H)
    min_eig = np.min(eigenvalues)
    return {
        'is_psd': min_eig >= -tol,
        'min_eigenvalue': min_eig,
        'eigenvalues': eigenvalues
    }
 
def check_pd(H, tol=1e-10):
    """
    Check if matrix H is positive definite.
    Attempts Cholesky decomposition (faster than eigenvalue).
    """
    try:
        # Add small regularization for numerical stability
        L = cholesky(H, lower=True)
        return {'is_pd': True, 'cholesky': L}
    except LinAlgError:
        eigs = eigvalsh(H)
        return {
            'is_pd': False, 
            'min_eigenvalue': np.min(eigs),
            'reason': 'Cholesky failed - not positive definite'
        }
 
def verify_convexity_at_points(hessian_fn, test_points, tol=1e-10):
    """
    Verify convexity by checking Hessian PSD at multiple points.
    """
    results = []
    all_psd = True
    
    for i, x in enumerate(test_points):
        H = hessian_fn(x)
        psd_check = check_psd(H, tol)
        results.append({
            'point_index': i,
            'is_psd': psd_check['is_psd'],
            'min_eigenvalue': psd_check['min_eigenvalue']
        })
        if not psd_check['is_psd']:
            all_psd = False
    
    return {
        'is_likely_convex': all_psd,
        'details': results
    }
 
def classify_critical_point(hessian, tol=1e-10):
    """
    Classify a critical point based on Hessian eigenvalues.
    """
    eigenvalues = eigvalsh(hessian)
    n_positive = np.sum(eigenvalues > tol)
    n_negative = np.sum(eigenvalues < -tol)
    n_zero = len(eigenvalues) - n_positive - n_negative
    
    if n_negative == 0 and n_zero == 0:
        return 'strict_local_minimum'
    elif n_positive == 0 and n_zero == 0:
        return 'strict_local_maximum'
    elif n_negative == 0:
        return 'local_minimum_candidate'  # Some zero eigenvalues
    elif n_positive == 0:
        return 'local_maximum_candidate'
    else:
        return f'saddle_point ({n_positive}+, {n_negative}-, {n_zero} zero)'
 
# Example: Analyze quadratic function
def quadratic_hessian(x, Q):
    """Hessian of f(x) = 0.5 * x^T Q x"""
    return Q
 
# Well-conditioned convex
Q_convex = np.array([[2, 0.5], [0.5, 1]])
print("Convex quadratic:")
print(check_psd(Q_convex))
print(f"Classification: {classify_critical_point(Q_convex)}")
 
# Saddle
Q_saddle = np.array([[1, 0], [0, -1]])
print("\nSaddle point:")
print(check_psd(Q_saddle))
print(f"Classification: {classify_critical_point(Q_saddle)}")
 
# Degenerate (PSD but not PD)
Q_degenerate = np.array([[1, 1], [1, 1]])  # Rank 1
print("\nDegenerate (rank-deficient):")
print(check_psd(Q_degenerate))
print(f"Classification: {classify_critical_point(Q_degenerate)}")

Eigenvalues, Eigenvectors, and Curvature

The eigenstructure of the Hessian provides deep geometric insight into the optimization landscape.

Eigenvalue Interpretation:

For symmetric Hessian $H = \nabla^2 f(x)$ with eigendecomposition $H = V \Lambda V^T$:

Eigenvalues $\lambda_1, \ldots, \lambda_n$ are the curvatures along principal directions
Eigenvectors $v_1, \ldots, v_n$ are the principal directions (mutually orthogonal)
$\lambda_{\max}$ = maximum curvature = steepest direction
$\lambda_{\min}$ = minimum curvature = flattest direction

Curvature in Direction $d$:

$$\kappa_d = \frac{d^T H d}{|d|^2}$$

This is bounded by eigenvalues: $\lambda_{\min} \leq \kappa_d \leq \lambda_{\max}$.

Visualizing in 2D

For a 2D quadratic with Hessian $H$, the level sets $x^T H x = c$ are ellipses. The eigenvectors give the ellipse's principal axes, and eigenvalues determine how stretched it is along each axis. Large eigenvalue = tight curvature; small eigenvalue = gradual curvature.

Condition Number:

The condition number of the Hessian (at a point, or the worst over the domain) is:

$$\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}$$

This measures the "elongation" of the level sets:

$\kappa = 1$: Isotropic (circular level sets), ideal for optimization
$\kappa \gg 1$: Anisotropic (highly elongated ellipses), problematic for gradient descent

Why Condition Number Matters:

For gradient descent on a quadratic with Hessian $H$:

$$\text{Convergence rate} \propto \left(\frac{\kappa - 1}{\kappa + 1}\right)^2$$

For $\kappa = 100$ (moderately ill-conditioned): rate $\approx 0.96$, meaning 96% of the error remains each iteration. You need many iterations to converge.

For $\kappa = 10000$ (severely ill-conditioned): rate $\approx 0.9998$, meaning thousands of iterations for modest progress.

Condition Number and Optimization Difficulty
$\kappa$	Iterations to halve error	Description
1	~1	Perfect conditioning, instant convergence
10	~8	Well-conditioned, fast
100	~70	Moderate conditioning
1000	~700	Ill-conditioned, slow
10000+	~7000+	Severely ill-conditioned

Preconditioning:

The solution to ill-conditioning is preconditioning: transforming the problem so the effective Hessian is better conditioned.

If we have an approximation $P \approx H$, we can solve the preconditioned system:

$$P^{-1} H x = P^{-1} b$$

The effective Hessian is $P^{-1} H$, which ideally has condition number close to 1.

Newton's method uses $P = H$ exactly, achieving perfect conditioning (but at computational cost). Quasi-Newton methods (BFGS, L-BFGS) build approximations to $H$ incrementally.

Strong Convexity: Quantifying Curvature

Strong convexity quantifies a lower bound on curvature, providing the strongest optimization guarantees.

Definition via Hessian:

Function $f$ is $m$-strongly convex if:

$$\nabla^2 f(x) \succeq m I \quad \forall x$$

Equivalently, all eigenvalues of the Hessian are at least $m$ everywhere.

Equivalent Definitions:

Quadratic lower bound: $$f(y) \geq f(x) + \nabla f(x)^T(y-x) + \frac{m}{2}|y - x|^2$$
Function definition: $$f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y) - \frac{m}{2}\theta(1-\theta)|x-y|^2$$
$f(x) - \frac{m}{2}|x|^2$ is convex (subtracting a strongly convex function preserves convexity only if the original is at least that strongly convex).

Implications of $m$-Strong Convexity

•Unique global minimum — Strong convexity implies strict convexity, so at most one minimum. Combined with coercivity (going to infinity at infinity), exactly one minimum exists.
•Gradient-to-optimality bounds — $|x - x^| \leq \frac{1}{m}|\nabla f(x)|$ and $f(x) - f(x^) \leq \frac{1}{2m}|\nabla f(x)|^2$.
•Linear convergence of gradient descent — With step size $\eta = 1/L$, error decreases by factor $(1 - m/L)$ each iteration.
•Better conditioning — Condition number is $\kappa = L/m$ where $L$ is Lipschitz constant of gradient.

Regularization Induces Strong Convexity

Adding L2 regularization $\frac{\lambda}{2}|w|^2$ to any convex loss adds $\lambda I$ to the Hessian, making the total objective $\lambda$-strongly convex. This is one reason regularization helps optimization, not just generalization.

strong_convexity_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from scipy.linalg import eigvalsh
 
def estimate_strong_convexity(hessian_fn, sample_points):
    """
    Estimate strong convexity parameter m by finding
    minimum eigenvalue of Hessian over sample points.
    """
    min_eigenvalue = float('inf')
    
    for x in sample_points:
        H = hessian_fn(x)
        eigs = eigvalsh(H)
        min_eig = np.min(eigs)
        min_eigenvalue = min(min_eigenvalue, min_eig)
    
    return {
        'strong_convexity_param': max(0, min_eigenvalue),
        'is_strongly_convex': min_eigenvalue > 1e-10
    }
 
def convergence_factor(m, L):
    """
    Compute convergence factor for gradient descent.
    Error decreases by this factor each iteration.
    """
    if m <= 0 or L <= 0:
        return 1.0
    kappa = L / m
    return 1 - 1/kappa
 
def iterations_to_epsilon(m, L, epsilon, initial_error=1.0):
    """
    Estimate iterations needed to reduce error to epsilon.
    """
    factor = convergence_factor(m, L)
    if factor >= 1:
        return float('inf')
    # initial_error * factor^k < epsilon
    # k > log(epsilon/initial_error) / log(factor)
    return int(np.ceil(np.log(epsilon / initial_error) / np.log(factor)))
 
# Example: Effect of regularization on strong convexity
np.random.seed(42)
n, d = 100, 10
X = np.random.randn(n, d)
 
# Unregularized linear regression: may not be strongly convex
H_unreg = X.T @ X / n
eigs_unreg = eigvalsh(H_unreg)
print(f"Unregularized: min eigenvalue = {np.min(eigs_unreg):.6f}")
print(f"  Condition number: {np.max(eigs_unreg)/np.min(eigs_unreg):.1f}")
 
# L2 regularized: guaranteed strongly convex
lambda_reg = 0.1
H_reg = X.T @ X / n + lambda_reg * np.eye(d)
eigs_reg = eigvalsh(H_reg)
print(f"\nRegularized (λ={lambda_reg}): min eigenvalue = {np.min(eigs_reg):.6f}")
print(f"  Condition number: {np.max(eigs_reg)/np.min(eigs_reg):.1f}")
 
# Convergence analysis
m = np.min(eigs_reg)
L = np.max(eigs_reg)
print(f"\nConvergence factor: {convergence_factor(m, L):.6f}")
print(f"Iterations to 1e-6 error: {iterations_to_epsilon(m, L, 1e-6)}")

Applications in Machine Learning

Second-order conditions and Hessian analysis have direct applications throughout machine learning.

Newton's Method:

Newton's method uses the Hessian to take optimal steps for quadratic functions:

$$x_{k+1} = x_k - (\nabla^2 f(x_k))^{-1} \nabla f(x_k)$$

For quadratics, this converges in one step. For general smooth functions, it achieves quadratic convergence near the optimum:

$$|x_{k+1} - x^| = O(|x_k - x^|^2)$$

The error squares each iteration—convergence is extremely fast once close.

Trade-off: Newton's method requires computing and inverting the Hessian: $O(n^3)$ per iteration. For large-scale ML, this is prohibitive.

Quasi-Newton Methods (BFGS, L-BFGS):

These methods approximate the Hessian (or its inverse) using gradient observations:

Build approximation $B_k \approx \nabla^2 f(x_k)$ incrementally
Use $B_k$ to compute quasi-Newton direction
Cost: $O(n^2)$ for BFGS, $O(nm)$ for L-BFGS (where $m \ll n$ is memory size)

L-BFGS is the workhorse for medium-scale ML optimization (logistic regression, CRFs, etc.).

Why Not Newton for Neural Networks?

For neural networks with millions of parameters, the Hessian is a million-by-million matrix. Even storing it requires terabytes. Approximations like K-FAC and Shampoo use structured Hessian approximations, but SGD variants remain dominant due to simplicity and implicit regularization.

Hessian-Free Optimization:

Instead of forming the full Hessian, compute Hessian-vector products $Hv$ efficiently using automatic differentiation:

$$Hv = \nabla_x (\nabla_x f \cdot v)$$

This costs about 2x a gradient computation. Combined with conjugate gradient, we can approximately solve $Hx = g$ without ever forming $H$.

Gauss-Newton Approximation:

For least-squares problems $f(w) = \frac{1}{2}|r(w)|^2$, the Hessian is:

$$\nabla^2 f = J^T J + \sum_i r_i \nabla^2 r_i$$

The Gauss-Newton approximation drops the second term:

$$\nabla^2 f \approx J^T J$$

This is always PSD (guaranteed descent direction) and cheaper to compute. It's the basis for Levenberg-Marquardt optimization.

Natural Gradient:

In probabilistic models, the Fisher information matrix plays the role of the Hessian in parameter space. Natural gradient descent uses Fisher (not Euclidean) geometry:

$$\theta_{k+1} = \theta_k - \eta F^{-1} \nabla \ell(\theta_k)$$

This gives parameter-independent convergence and is connected to second-order optimization.

Summary: Second-Order Conditions

The Hessian provides crucial curvature information that goes beyond what gradients can tell us. Let's consolidate:

Key Takeaways

•The Hessian $\nabla^2 f$ is the matrix of second derivatives, capturing local curvature in all directions.
•Necessary condition: At a local minimum, $\nabla^2 f \succeq 0$ (PSD). This rules out negative curvature.
•Sufficient condition: If $\nabla f = 0$ and $\nabla^2 f \succ 0$ (PD), the point is a strict local minimum.
•Convexity: $f$ is convex iff $\nabla^2 f(x) \succeq 0$ for all $x$. Global convexity from local Hessian conditions.
•Eigenvalues are curvatures along principal directions. The condition number $\kappa = \lambda_{\max}/\lambda_{\min}$ determines optimization difficulty.
•Strong convexity: $\nabla^2 f \succeq mI$ guarantees unique minimum and linear convergence with rate $(1 - m/L)$.
•Newton's method uses the full Hessian for quadratic convergence; quasi-Newton methods approximate it efficiently.

What's Next:

With convex sets, functions, and optimality conditions established, we'll explore how convexity manifests across machine learning. The next page examines convexity in ML—which algorithms enjoy convex guarantees, which don't, and how to leverage convexity for reliable training.

Page Complete

You now understand how the Hessian characterizes local and global structure of optimization problems. These second-order tools—from verifying convexity to analyzing conditioning to designing Newton-type algorithms—are essential for understanding why and how optimization works in machine learning.