Loading content...
First-order conditions tell us whether a point is a stationary point. But gradients only capture the "slope" of a function—they tell us nothing about the curvature. Are we at the bottom of a narrow valley or a wide basin? Is the function curving upward (convex) or downward (concave)? How quickly do gradients change as we move?
Second-order conditions answer these questions using the Hessian matrix—the matrix of second partial derivatives. This curvature information is fundamental to:
This page develops the mathematical theory of second-order conditions, from the basic definitions to their applications in machine learning optimization.
By the end of this page, you will understand: (1) the Hessian matrix and its interpretation, (2) second-order necessary and sufficient conditions for optimality, (3) how the Hessian characterizes convexity, (4) the relationship between eigenvalues and curvature, and (5) applications to strong convexity and conditioning.
For a twice-differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, the Hessian matrix $\nabla^2 f(x)$ (or $H_f(x)$) is the $n \times n$ matrix of second partial derivatives:
$$(\nabla^2 f(x))_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}(x)$$
Explicitly:
$$\nabla^2 f(x) = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{pmatrix}$$
For twice continuously differentiable functions (functions with continuous second derivatives), Schwarz's theorem guarantees the Hessian is symmetric: $\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$. This symmetry is crucial—it ensures the Hessian has real eigenvalues and orthogonal eigenvectors.
Interpretation: Local Curvature
The Hessian captures how the gradient changes as we move through space:
$$\nabla f(x + \Delta x) \approx \nabla f(x) + \nabla^2 f(x) \Delta x$$
If we move in direction $v$, the rate of change of the gradient in direction $v$ is:
$$v^T \nabla^2 f(x) v$$
This quantity is the directional second derivative—the curvature of $f$ along direction $v$.
Second-Order Taylor Expansion:
Near point $x$, the function can be approximated as:
$$f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x + \frac{1}{2} \Delta x^T \nabla^2 f(x) \Delta x$$
The first three terms: constant, linear (gradient), and quadratic (Hessian). At a stationary point where $\nabla f(x) = 0$, the local behavior is entirely determined by the Hessian.
| Function | Gradient | Hessian |
|---|---|---|
| $f(x) = \frac{1}{2}x^T Q x$ | $Qx$ | $Q$ |
| $f(x) = |x|^2$ | $2x$ | $2I$ |
| $f(x) = x^T A^T A x$ | $2A^T A x$ | $2A^T A$ |
| $f(x) = \log(1 + e^x)$ (softplus) | $\sigma(x)$ | $\sigma(x)(1-\sigma(x))$ |
| $f(w) = \frac{1}{n}\sum (x_i^T w - y_i)^2$ | $\frac{2}{n}X^T(Xw - y)$ | $\frac{2}{n}X^T X$ |
The Hessian provides additional information beyond what the gradient can tell us. At a local minimum, we can characterize the curvature.
Theorem (Second-Order Necessary Conditions):
If $x^*$ is a local minimum of twice-differentiable $f: \mathbb{R}^n \rightarrow \mathbb{R}$, then:
Proof of (2):
Suppose $\nabla^2 f(x^*) \not\succeq 0$. Then there exists a direction $v$ with $|v| = 1$ such that:
$$v^T \nabla^2 f(x^*) v < 0$$
Using the Taylor expansion at $x^$ (where $\nabla f(x^) = 0$):
$$f(x^* + tv) = f(x^) + \frac{t^2}{2} v^T \nabla^2 f(x^) v + o(t^2)$$
For small $t$, the $t^2$ term dominates the $o(t^2)$ remainder. Since $v^T \nabla^2 f(x^*) v < 0$:
$$f(x^* + tv) < f(x^*)$$
This contradicts $x^*$ being a local minimum. $\square$
The condition $\nabla^2 f(x^) \succeq 0$ is necessary but NOT sufficient for a local minimum (even combined with $\nabla f(x^) = 0$). Consider $f(x) = x^3$ at $x = 0$: the gradient is 0, the second derivative is 0 (so PSD), but $x = 0$ is an inflection point, not a minimum.
Understanding Positive Semidefiniteness:
A symmetric matrix $H$ is positive semidefinite ($H \succeq 0$) if any of these equivalent conditions hold:
At a local minimum, condition (1) is the geometric requirement: we can't have negative curvature in any direction, or that direction would lead downhill.
Saddle Points:
At a saddle point, $\nabla f(x^) = 0$ but $\nabla^2 f(x^)$ has both positive and negative eigenvalues—positive curvature in some directions, negative in others. The point is a minimum in some directions and a maximum in others.
To guarantee a local minimum (not just rule out ascent directions), we need a stronger condition.
Theorem (Second-Order Sufficient Conditions):
If $f$ is twice continuously differentiable, $\nabla f(x^) = 0$, and $\nabla^2 f(x^) \succ 0$ (positive definite), then $x^*$ is a strict local minimum.
Proof:
Since $\nabla^2 f(x^*)$ is positive definite, let $\lambda_{\min} > 0$ be its smallest eigenvalue. For any unit vector $v$:
$$v^T \nabla^2 f(x^*) v \geq \lambda_{\min} > 0$$
By continuity of the Hessian, there exists $\epsilon > 0$ such that for $|x - x^*| < \epsilon$:
$$v^T \nabla^2 f(x) v \geq \frac{\lambda_{\min}}{2} > 0$$
For any $y$ with $|y - x^| < \epsilon$, consider the path from $x^$ to $y$:
$$f(y) = f(x^) + \int_0^1 \nabla f(x^ + t(y-x^))^T (y - x^) , dt$$
Using $\nabla f(x^*) = 0$ and expanding:
$$f(y) = f(x^) + \int_0^1 \int_0^s (y - x^)^T \nabla^2 f(x^* + \tau(y-x^)) (y - x^) , d\tau , ds$$
$$\geq f(x^) + \frac{\lambda_{\min}}{4} |y - x^|^2 > f(x^*)$$
So $x^*$ is a strict local minimum. $\square$
Necessary: $\nabla^2 f(x^) \succeq 0$ (positive semidefinite). Sufficient: $\nabla^2 f(x^) \succ 0$ (positive definite). The gap is when the Hessian is PSD but not PD (has zero eigenvalues). This "borderline" case requires higher-order analysis.
| Condition at critical point $x^*$ | Classification |
|---|---|
| $\nabla^2 f(x^*) \succ 0$ | Strict local minimum (sufficient) |
| $\nabla^2 f(x^*) \prec 0$ | Strict local maximum |
| $\nabla^2 f(x^*)$ has pos. and neg. eigenvalues | Saddle point |
| $\nabla^2 f(x^*) \succeq 0$ with zero eigenvalues | Inconclusive (could be min, saddle, or inflection) |
The Hessian provides a powerful way to verify global convexity, not just local behavior at critical points.
Theorem (Second-Order Characterization of Convexity):
Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be twice continuously differentiable. Then:
$$f \text{ is convex} \iff \nabla^2 f(x) \succeq 0 \text{ for all } x \in \text{dom}(f)$$
Proof ($\Rightarrow$):
If $f$ is convex, consider any point $x$ and direction $v$. The function $g(t) = f(x + tv)$ is a convex function of $t$. Its second derivative is:
$$g''(t) = v^T \nabla^2 f(x + tv) v$$
For convex $g$, we need $g''(t) \geq 0$, so $v^T \nabla^2 f(x) v \geq 0$ for all $v$. This means $\nabla^2 f(x) \succeq 0$.
Proof ($\Leftarrow$):
Assume $\nabla^2 f(x) \succeq 0$ everywhere. For any $x, y$, consider the path from $x$ to $y$:
$$f(y) = f(x) + \nabla f(x)^T(y-x) + \int_0^1 (1-t)(y-x)^T \nabla^2 f(x + t(y-x))(y-x) , dt$$
Since $(y-x)^T \nabla^2 f(\cdot)(y-x) \geq 0$ everywhere:
$$f(y) \geq f(x) + \nabla f(x)^T(y-x)$$
This is the first-order characterization of convexity. $\square$
Strict convexity: $\nabla^2 f(x) \succ 0$ for all $x$ guarantees strict convexity. Strong convexity: $\nabla^2 f(x) \succeq mI$ for some $m > 0$ everywhere guarantees $m$-strong convexity. These conditions give progressively stronger guarantees about uniqueness and convergence.
Practical Convexity Verification:
To check if a function is convex:
Example: Logistic Regression
For logistic loss $f(w) = \sum_{i} \log(1 + e^{-y_i x_i^T w})$:
The Hessian is:
$$\nabla^2 f(w) = \sum_{i} \sigma(y_i x_i^T w)(1 - \sigma(y_i x_i^T w)) x_i x_i^T = X^T D X$$
where $D$ is diagonal with $D_{ii} = \sigma_i(1 - \sigma_i) \geq 0$.
Since $X^T D X$ is a weighted sum of outer products with non-negative weights, it's PSD. Hence logistic loss is convex.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
import numpy as npfrom scipy.linalg import eigvalsh, choleskyfrom scipy.linalg import LinAlgError def check_psd(H, tol=1e-10): """ Check if matrix H is positive semidefinite. Uses eigenvalue decomposition. """ eigenvalues = eigvalsh(H) min_eig = np.min(eigenvalues) return { 'is_psd': min_eig >= -tol, 'min_eigenvalue': min_eig, 'eigenvalues': eigenvalues } def check_pd(H, tol=1e-10): """ Check if matrix H is positive definite. Attempts Cholesky decomposition (faster than eigenvalue). """ try: # Add small regularization for numerical stability L = cholesky(H, lower=True) return {'is_pd': True, 'cholesky': L} except LinAlgError: eigs = eigvalsh(H) return { 'is_pd': False, 'min_eigenvalue': np.min(eigs), 'reason': 'Cholesky failed - not positive definite' } def verify_convexity_at_points(hessian_fn, test_points, tol=1e-10): """ Verify convexity by checking Hessian PSD at multiple points. """ results = [] all_psd = True for i, x in enumerate(test_points): H = hessian_fn(x) psd_check = check_psd(H, tol) results.append({ 'point_index': i, 'is_psd': psd_check['is_psd'], 'min_eigenvalue': psd_check['min_eigenvalue'] }) if not psd_check['is_psd']: all_psd = False return { 'is_likely_convex': all_psd, 'details': results } def classify_critical_point(hessian, tol=1e-10): """ Classify a critical point based on Hessian eigenvalues. """ eigenvalues = eigvalsh(hessian) n_positive = np.sum(eigenvalues > tol) n_negative = np.sum(eigenvalues < -tol) n_zero = len(eigenvalues) - n_positive - n_negative if n_negative == 0 and n_zero == 0: return 'strict_local_minimum' elif n_positive == 0 and n_zero == 0: return 'strict_local_maximum' elif n_negative == 0: return 'local_minimum_candidate' # Some zero eigenvalues elif n_positive == 0: return 'local_maximum_candidate' else: return f'saddle_point ({n_positive}+, {n_negative}-, {n_zero} zero)' # Example: Analyze quadratic functiondef quadratic_hessian(x, Q): """Hessian of f(x) = 0.5 * x^T Q x""" return Q # Well-conditioned convexQ_convex = np.array([[2, 0.5], [0.5, 1]])print("Convex quadratic:")print(check_psd(Q_convex))print(f"Classification: {classify_critical_point(Q_convex)}") # SaddleQ_saddle = np.array([[1, 0], [0, -1]])print("\nSaddle point:")print(check_psd(Q_saddle))print(f"Classification: {classify_critical_point(Q_saddle)}") # Degenerate (PSD but not PD)Q_degenerate = np.array([[1, 1], [1, 1]]) # Rank 1print("\nDegenerate (rank-deficient):")print(check_psd(Q_degenerate))print(f"Classification: {classify_critical_point(Q_degenerate)}")The eigenstructure of the Hessian provides deep geometric insight into the optimization landscape.
Eigenvalue Interpretation:
For symmetric Hessian $H = \nabla^2 f(x)$ with eigendecomposition $H = V \Lambda V^T$:
Curvature in Direction $d$:
$$\kappa_d = \frac{d^T H d}{|d|^2}$$
This is bounded by eigenvalues: $\lambda_{\min} \leq \kappa_d \leq \lambda_{\max}$.
For a 2D quadratic with Hessian $H$, the level sets $x^T H x = c$ are ellipses. The eigenvectors give the ellipse's principal axes, and eigenvalues determine how stretched it is along each axis. Large eigenvalue = tight curvature; small eigenvalue = gradual curvature.
Condition Number:
The condition number of the Hessian (at a point, or the worst over the domain) is:
$$\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}$$
This measures the "elongation" of the level sets:
Why Condition Number Matters:
For gradient descent on a quadratic with Hessian $H$:
$$\text{Convergence rate} \propto \left(\frac{\kappa - 1}{\kappa + 1}\right)^2$$
For $\kappa = 100$ (moderately ill-conditioned): rate $\approx 0.96$, meaning 96% of the error remains each iteration. You need many iterations to converge.
For $\kappa = 10000$ (severely ill-conditioned): rate $\approx 0.9998$, meaning thousands of iterations for modest progress.
| $\kappa$ | Iterations to halve error | Description |
|---|---|---|
| 1 | ~1 | Perfect conditioning, instant convergence |
| 10 | ~8 | Well-conditioned, fast |
| 100 | ~70 | Moderate conditioning |
| 1000 | ~700 | Ill-conditioned, slow |
| 10000+ | ~7000+ | Severely ill-conditioned |
Preconditioning:
The solution to ill-conditioning is preconditioning: transforming the problem so the effective Hessian is better conditioned.
If we have an approximation $P \approx H$, we can solve the preconditioned system:
$$P^{-1} H x = P^{-1} b$$
The effective Hessian is $P^{-1} H$, which ideally has condition number close to 1.
Newton's method uses $P = H$ exactly, achieving perfect conditioning (but at computational cost). Quasi-Newton methods (BFGS, L-BFGS) build approximations to $H$ incrementally.
Strong convexity quantifies a lower bound on curvature, providing the strongest optimization guarantees.
Definition via Hessian:
Function $f$ is $m$-strongly convex if:
$$\nabla^2 f(x) \succeq m I \quad \forall x$$
Equivalently, all eigenvalues of the Hessian are at least $m$ everywhere.
Equivalent Definitions:
Quadratic lower bound: $$f(y) \geq f(x) + \nabla f(x)^T(y-x) + \frac{m}{2}|y - x|^2$$
Function definition: $$f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y) - \frac{m}{2}\theta(1-\theta)|x-y|^2$$
$f(x) - \frac{m}{2}|x|^2$ is convex (subtracting a strongly convex function preserves convexity only if the original is at least that strongly convex).
Adding L2 regularization $\frac{\lambda}{2}|w|^2$ to any convex loss adds $\lambda I$ to the Hessian, making the total objective $\lambda$-strongly convex. This is one reason regularization helps optimization, not just generalization.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as npfrom scipy.linalg import eigvalsh def estimate_strong_convexity(hessian_fn, sample_points): """ Estimate strong convexity parameter m by finding minimum eigenvalue of Hessian over sample points. """ min_eigenvalue = float('inf') for x in sample_points: H = hessian_fn(x) eigs = eigvalsh(H) min_eig = np.min(eigs) min_eigenvalue = min(min_eigenvalue, min_eig) return { 'strong_convexity_param': max(0, min_eigenvalue), 'is_strongly_convex': min_eigenvalue > 1e-10 } def convergence_factor(m, L): """ Compute convergence factor for gradient descent. Error decreases by this factor each iteration. """ if m <= 0 or L <= 0: return 1.0 kappa = L / m return 1 - 1/kappa def iterations_to_epsilon(m, L, epsilon, initial_error=1.0): """ Estimate iterations needed to reduce error to epsilon. """ factor = convergence_factor(m, L) if factor >= 1: return float('inf') # initial_error * factor^k < epsilon # k > log(epsilon/initial_error) / log(factor) return int(np.ceil(np.log(epsilon / initial_error) / np.log(factor))) # Example: Effect of regularization on strong convexitynp.random.seed(42)n, d = 100, 10X = np.random.randn(n, d) # Unregularized linear regression: may not be strongly convexH_unreg = X.T @ X / neigs_unreg = eigvalsh(H_unreg)print(f"Unregularized: min eigenvalue = {np.min(eigs_unreg):.6f}")print(f" Condition number: {np.max(eigs_unreg)/np.min(eigs_unreg):.1f}") # L2 regularized: guaranteed strongly convexlambda_reg = 0.1H_reg = X.T @ X / n + lambda_reg * np.eye(d)eigs_reg = eigvalsh(H_reg)print(f"\nRegularized (λ={lambda_reg}): min eigenvalue = {np.min(eigs_reg):.6f}")print(f" Condition number: {np.max(eigs_reg)/np.min(eigs_reg):.1f}") # Convergence analysism = np.min(eigs_reg)L = np.max(eigs_reg)print(f"\nConvergence factor: {convergence_factor(m, L):.6f}")print(f"Iterations to 1e-6 error: {iterations_to_epsilon(m, L, 1e-6)}")Second-order conditions and Hessian analysis have direct applications throughout machine learning.
Newton's Method:
Newton's method uses the Hessian to take optimal steps for quadratic functions:
$$x_{k+1} = x_k - (\nabla^2 f(x_k))^{-1} \nabla f(x_k)$$
For quadratics, this converges in one step. For general smooth functions, it achieves quadratic convergence near the optimum:
$$|x_{k+1} - x^| = O(|x_k - x^|^2)$$
The error squares each iteration—convergence is extremely fast once close.
Trade-off: Newton's method requires computing and inverting the Hessian: $O(n^3)$ per iteration. For large-scale ML, this is prohibitive.
Quasi-Newton Methods (BFGS, L-BFGS):
These methods approximate the Hessian (or its inverse) using gradient observations:
L-BFGS is the workhorse for medium-scale ML optimization (logistic regression, CRFs, etc.).
For neural networks with millions of parameters, the Hessian is a million-by-million matrix. Even storing it requires terabytes. Approximations like K-FAC and Shampoo use structured Hessian approximations, but SGD variants remain dominant due to simplicity and implicit regularization.
Hessian-Free Optimization:
Instead of forming the full Hessian, compute Hessian-vector products $Hv$ efficiently using automatic differentiation:
$$Hv = \nabla_x (\nabla_x f \cdot v)$$
This costs about 2x a gradient computation. Combined with conjugate gradient, we can approximately solve $Hx = g$ without ever forming $H$.
Gauss-Newton Approximation:
For least-squares problems $f(w) = \frac{1}{2}|r(w)|^2$, the Hessian is:
$$\nabla^2 f = J^T J + \sum_i r_i \nabla^2 r_i$$
The Gauss-Newton approximation drops the second term:
$$\nabla^2 f \approx J^T J$$
This is always PSD (guaranteed descent direction) and cheaper to compute. It's the basis for Levenberg-Marquardt optimization.
Natural Gradient:
In probabilistic models, the Fisher information matrix plays the role of the Hessian in parameter space. Natural gradient descent uses Fisher (not Euclidean) geometry:
$$\theta_{k+1} = \theta_k - \eta F^{-1} \nabla \ell(\theta_k)$$
This gives parameter-independent convergence and is connected to second-order optimization.
The Hessian provides crucial curvature information that goes beyond what gradients can tell us. Let's consolidate:
What's Next:
With convex sets, functions, and optimality conditions established, we'll explore how convexity manifests across machine learning. The next page examines convexity in ML—which algorithms enjoy convex guarantees, which don't, and how to leverage convexity for reliable training.
You now understand how the Hessian characterizes local and global structure of optimization problems. These second-order tools—from verifying convexity to analyzing conditioning to designing Newton-type algorithms—are essential for understanding why and how optimization works in machine learning.