Loading content...
The gradient tells us direction—which way is downhill. But it doesn't tell us how fast the landscape is changing, or whether we're approaching a bowl-shaped minimum or a saddle point.
For this, we need second-order information: how do the first derivatives themselves change? This is captured by the Hessian matrix—a square matrix of all second partial derivatives that encodes the complete curvature information of a function.
By the end of this page, you'll understand the Hessian matrix structure, interpret its eigenvalues for curvature analysis, classify critical points, and see how second-order methods accelerate optimization.
Formal Definition:
For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$, the Hessian is the $n \times n$ matrix of second partial derivatives:
$$H_f = abla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$
Key Property: Symmetry
By Schwarz's theorem (for smooth functions): $\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$
Thus $H_f = H_f^T$—the Hessian is symmetric. This guarantees real eigenvalues.
Example:
For $f(x, y) = x^3 - 2xy + y^2$:
$$H_f = \begin{bmatrix} 6x & -2 \ -2 & 2 \end{bmatrix}$$
At point $(1, 1)$: $H_f = \begin{bmatrix} 6 & -2 \ -2 & 2 \end{bmatrix}$
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as np def hessian(f, point, h=1e-5): """Compute Hessian matrix numerically.""" n = len(point) H = np.zeros((n, n)) point = np.array(point, dtype=float) for i in range(n): for j in range(n): # Mixed partial: [f(x+hi+hj) - f(x+hi-hj) - f(x-hi+hj) + f(x-hi-hj)] / 4h² pp = point.copy() pm = point.copy() mp = point.copy() mm = point.copy() pp[i] += h; pp[j] += h pm[i] += h; pm[j] -= h mp[i] -= h; mp[j] += h mm[i] -= h; mm[j] -= h H[i, j] = (f(pp) - f(pm) - f(mp) + f(mm)) / (4 * h * h) return H # Example: f(x,y) = x³ - 2xy + y²def f(p): x, y = p[0], p[1] return x**3 - 2*x*y + y**2 def hessian_analytical(p): x, y = p[0], p[1] return np.array([ [6*x, -2], [-2, 2] ]) point = np.array([1.0, 1.0]) H_num = hessian(f, point)H_ana = hessian_analytical(point) print(f"f(x,y) = x³ - 2xy + y² at point (1, 1)")print("Numerical Hessian:")print(H_num)print("Analytical Hessian:")print(H_ana) # Eigenvalue analysiseigenvalues = np.linalg.eigvalsh(H_ana)print(f"Eigenvalues: {eigenvalues}")print(f"All positive? {np.all(eigenvalues > 0)} → {'Local minimum' if np.all(eigenvalues > 0) else 'Not a minimum'}")Eigenvalues Encode Curvature:
The eigenvalues of the Hessian $\lambda_1, \lambda_2, \ldots, \lambda_n$ represent principal curvatures:
| Eigenvalue Sign | Curvature | Surface Shape |
|---|---|---|
| $\lambda > 0$ | Positive (curves up) | Bowl-like |
| $\lambda < 0$ | Negative (curves down) | Dome-like |
| $\lambda = 0$ | Flat | Linear in that direction |
Quadratic Approximation:
Near a point $\mathbf{a}$, the Taylor expansion gives: $$f(\mathbf{x}) \approx f(\mathbf{a}) + abla f(\mathbf{a})^T(\mathbf{x}-\mathbf{a}) + \frac{1}{2}(\mathbf{x}-\mathbf{a})^T H(\mathbf{a})(\mathbf{x}-\mathbf{a})$$
The Hessian determines the quadratic term—how the function curves away from its linear approximation.
The condition number $\kappa = \frac{|\lambda_{max}|}{|\lambda_{min}|}$ measures how 'elongated' the curvature is. High $\kappa$ (ill-conditioned) means very different curvatures in different directions, making gradient descent oscillate and converge slowly.
The Second Derivative Test (Multivariate):
At a critical point where $ abla f = \mathbf{0}$, the Hessian determines the nature:
| Hessian Property | Eigenvalues | Classification |
|---|---|---|
| Positive definite | All $\lambda_i > 0$ | Local minimum |
| Negative definite | All $\lambda_i < 0$ | Local maximum |
| Indefinite | Mixed signs | Saddle point |
| Singular | Some $\lambda_i = 0$ | Inconclusive |
Definiteness Checks:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as np def classify_critical_point(H): """Classify critical point based on Hessian.""" eigenvalues = np.linalg.eigvalsh(H) if np.all(eigenvalues > 0): return "Local Minimum", eigenvalues elif np.all(eigenvalues < 0): return "Local Maximum", eigenvalues elif np.any(eigenvalues == 0): return "Inconclusive (singular)", eigenvalues else: return "Saddle Point", eigenvalues # Example 1: Bowl (minimum)H_bowl = np.array([[4, 0], [0, 2]])print("Bowl: f = 2x² + y²")result, eigvals = classify_critical_point(H_bowl)print(f" Eigenvalues: {eigvals}")print(f" Classification: {result}") # Example 2: Dome (maximum)H_dome = np.array([[-4, 0], [0, -2]])print("Dome: f = -2x² - y²")result, eigvals = classify_critical_point(H_dome)print(f" Eigenvalues: {eigvals}")print(f" Classification: {result}") # Example 3: SaddleH_saddle = np.array([[2, 0], [0, -3]])print("Saddle: f = x² - 1.5y²")result, eigvals = classify_critical_point(H_saddle)print(f" Eigenvalues: {eigvals}")print(f" Classification: {result}") # Example 4: Ill-conditioned (elongated bowl)H_ill = np.array([[100, 0], [0, 1]])print("Ill-conditioned: f = 50x² + 0.5y²")result, eigvals = classify_critical_point(H_ill)print(f" Eigenvalues: {eigvals}")print(f" Condition number: {max(eigvals)/min(eigvals):.0f}")print(f" Classification: {result}")Newton's Method for Optimization:
Instead of just using gradient direction, Newton's method uses curvature to take smarter steps:
$$\mathbf{x}_{t+1} = \mathbf{x}_t - H^{-1} abla f$$
Why This Works:
For a quadratic function, Newton converges in one step. For general functions, it converges much faster than gradient descent near the minimum.
The Tradeoff:
| Method | Per-iteration | Iterations | Best For |
|---|---|---|---|
| Gradient Descent | $O(n)$ | Many | Large-scale ML |
| Newton's Method | $O(n^3)$ | Few | Small problems |
Methods like BFGS and L-BFGS approximate the inverse Hessian from gradient information, achieving near-Newton convergence without computing actual second derivatives. L-BFGS is the standard for many classical ML problems.
Why the Hessian Matters for ML:
The Saddle Point Problem:
In high-dimensional neural networks, critical points are almost always saddle points, not local minima. The Hessian has both positive and negative eigenvalues. Gradient descent can get stuck near saddles where the gradient is small. This is why momentum and adaptive methods (Adam) help—they escape saddles faster.
You've mastered the differential calculus foundations for ML: derivatives, chain rule, gradients, directional derivatives, and Hessians. You now have the complete toolkit for understanding how optimization algorithms work and why they succeed or fail.