Differential Calculus - Learning Module

Loading content...

0/278

The Hessian Matrix

Beyond First-Order: Understanding Curvature

The gradient tells us direction—which way is downhill. But it doesn't tell us how fast the landscape is changing, or whether we're approaching a bowl-shaped minimum or a saddle point.

For this, we need second-order information: how do the first derivatives themselves change? This is captured by the Hessian matrix—a square matrix of all second partial derivatives that encodes the complete curvature information of a function.

What You Will Master

By the end of this page, you'll understand the Hessian matrix structure, interpret its eigenvalues for curvature analysis, classify critical points, and see how second-order methods accelerate optimization.

The Hessian Matrix Defined

Formal Definition:

For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$, the Hessian is the $n \times n$ matrix of second partial derivatives:

$$H_f = abla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \ \vdots & \vdots & \ddots & \vdots \ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$

Key Property: Symmetry

By Schwarz's theorem (for smooth functions): $\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$

Thus $H_f = H_f^T$—the Hessian is symmetric. This guarantees real eigenvalues.

Example:

For $f(x, y) = x^3 - 2xy + y^2$:

$\frac{\partial f}{\partial x} = 3x^2 - 2y$, $\frac{\partial f}{\partial y} = -2x + 2y$
$\frac{\partial^2 f}{\partial x^2} = 6x$
$\frac{\partial^2 f}{\partial y^2} = 2$
$\frac{\partial^2 f}{\partial x \partial y} = -2$

$$H_f = \begin{bmatrix} 6x & -2 \ -2 & 2 \end{bmatrix}$$

At point $(1, 1)$: $H_f = \begin{bmatrix} 6 & -2 \ -2 & 2 \end{bmatrix}$

Hessian Computation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
 
def hessian(f, point, h=1e-5):
    """Compute Hessian matrix numerically."""
    n = len(point)
    H = np.zeros((n, n))
    point = np.array(point, dtype=float)
    
    for i in range(n):
        for j in range(n):
            # Mixed partial: [f(x+hi+hj) - f(x+hi-hj) - f(x-hi+hj) + f(x-hi-hj)] / 4h²
            pp = point.copy()
            pm = point.copy()
            mp = point.copy()
            mm = point.copy()
            
            pp[i] += h; pp[j] += h
            pm[i] += h; pm[j] -= h
            mp[i] -= h; mp[j] += h
            mm[i] -= h; mm[j] -= h
            
            H[i, j] = (f(pp) - f(pm) - f(mp) + f(mm)) / (4 * h * h)
    
    return H
 
# Example: f(x,y) = x³ - 2xy + y²
def f(p):
    x, y = p[0], p[1]
    return x**3 - 2*x*y + y**2
 
def hessian_analytical(p):
    x, y = p[0], p[1]
    return np.array([
        [6*x, -2],
        [-2, 2]
    ])
 
point = np.array([1.0, 1.0])
 
H_num = hessian(f, point)
H_ana = hessian_analytical(point)
 
print(f"f(x,y) = x³ - 2xy + y² at point (1, 1)
")
print("Numerical Hessian:")
print(H_num)
print("
Analytical Hessian:")
print(H_ana)
 
# Eigenvalue analysis
eigenvalues = np.linalg.eigvalsh(H_ana)
print(f"
Eigenvalues: {eigenvalues}")
print(f"All positive? {np.all(eigenvalues > 0)} → {'Local minimum' if np.all(eigenvalues > 0) else 'Not a minimum'}")

Curvature Interpretation

Eigenvalues Encode Curvature:

The eigenvalues of the Hessian $\lambda_1, \lambda_2, \ldots, \lambda_n$ represent principal curvatures:

Eigenvalue Sign	Curvature	Surface Shape
$\lambda > 0$	Positive (curves up)	Bowl-like
$\lambda < 0$	Negative (curves down)	Dome-like
$\lambda = 0$	Flat	Linear in that direction

Quadratic Approximation:

Near a point $\mathbf{a}$, the Taylor expansion gives: $$f(\mathbf{x}) \approx f(\mathbf{a}) + abla f(\mathbf{a})^T(\mathbf{x}-\mathbf{a}) + \frac{1}{2}(\mathbf{x}-\mathbf{a})^T H(\mathbf{a})(\mathbf{x}-\mathbf{a})$$

The Hessian determines the quadratic term—how the function curves away from its linear approximation.

Condition Number and Optimization

The condition number $\kappa = \frac{|\lambda_{max}|}{|\lambda_{min}|}$ measures how 'elongated' the curvature is. High $\kappa$ (ill-conditioned) means very different curvatures in different directions, making gradient descent oscillate and converge slowly.

Classifying Critical Points

The Second Derivative Test (Multivariate):

At a critical point where $ abla f = \mathbf{0}$, the Hessian determines the nature:

Hessian Property	Eigenvalues	Classification
Positive definite	All $\lambda_i > 0$	Local minimum
Negative definite	All $\lambda_i < 0$	Local maximum
Indefinite	Mixed signs	Saddle point
Singular	Some $\lambda_i = 0$	Inconclusive

Definiteness Checks:

Eigenvalue method: Compute eigenvalues directly
Sylvester's criterion: Check signs of leading principal minors
Quadratic form: Check if $\mathbf{v}^T H \mathbf{v} > 0$ for all $\mathbf{v} eq 0$

Critical Point Analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
 
def classify_critical_point(H):
    """Classify critical point based on Hessian."""
    eigenvalues = np.linalg.eigvalsh(H)
    
    if np.all(eigenvalues > 0):
        return "Local Minimum", eigenvalues
    elif np.all(eigenvalues < 0):
        return "Local Maximum", eigenvalues
    elif np.any(eigenvalues == 0):
        return "Inconclusive (singular)", eigenvalues
    else:
        return "Saddle Point", eigenvalues
 
# Example 1: Bowl (minimum)
H_bowl = np.array([[4, 0], [0, 2]])
print("Bowl: f = 2x² + y²")
result, eigvals = classify_critical_point(H_bowl)
print(f"  Eigenvalues: {eigvals}")
print(f"  Classification: {result}
")
 
# Example 2: Dome (maximum)
H_dome = np.array([[-4, 0], [0, -2]])
print("Dome: f = -2x² - y²")
result, eigvals = classify_critical_point(H_dome)
print(f"  Eigenvalues: {eigvals}")
print(f"  Classification: {result}
")
 
# Example 3: Saddle
H_saddle = np.array([[2, 0], [0, -3]])
print("Saddle: f = x² - 1.5y²")
result, eigvals = classify_critical_point(H_saddle)
print(f"  Eigenvalues: {eigvals}")
print(f"  Classification: {result}
")
 
# Example 4: Ill-conditioned (elongated bowl)
H_ill = np.array([[100, 0], [0, 1]])
print("Ill-conditioned: f = 50x² + 0.5y²")
result, eigvals = classify_critical_point(H_ill)
print(f"  Eigenvalues: {eigvals}")
print(f"  Condition number: {max(eigvals)/min(eigvals):.0f}")
print(f"  Classification: {result}")

The Hessian in Newton's Method

Newton's Method for Optimization:

Instead of just using gradient direction, Newton's method uses curvature to take smarter steps:

$$\mathbf{x}_{t+1} = \mathbf{x}_t - H^{-1} abla f$$

Why This Works:

Gradient descent: Steps proportional to slope
Newton: Steps that would reach the minimum of the quadratic approximation

For a quadratic function, Newton converges in one step. For general functions, it converges much faster than gradient descent near the minimum.

The Tradeoff:

Method	Per-iteration	Iterations	Best For
Gradient Descent	$O(n)$	Many	Large-scale ML
Newton's Method	$O(n^3)$	Few	Small problems

Quasi-Newton Methods

Methods like BFGS and L-BFGS approximate the inverse Hessian from gradient information, achieving near-Newton convergence without computing actual second derivatives. L-BFGS is the standard for many classical ML problems.

Hessian in Machine Learning

Why the Hessian Matters for ML:

Applications

•Learning rate selection: $\eta_{opt} \approx 1/\lambda_{max}$
•Second-order optimizers: Natural gradient, K-FAC
•Loss landscape analysis: Understanding training dynamics
•Uncertainty estimation: Laplace approximation uses $H^{-1}$ as covariance

Challenges

•Storage: $O(n^2)$ for $n$ parameters
•Computation: $O(n^3)$ for inversion
•Non-convexity: Hessian often indefinite
•Stochasticity: Hard to estimate from mini-batches

The Saddle Point Problem:

In high-dimensional neural networks, critical points are almost always saddle points, not local minima. The Hessian has both positive and negative eigenvalues. Gradient descent can get stuck near saddles where the gradient is small. This is why momentum and adaptive methods (Adam) help—they escape saddles faster.

Summary: Curvature and Optimization

Key Takeaways

•Hessian is matrix of second derivatives — symmetric by Schwarz's theorem
•Eigenvalues represent principal curvatures — positive = curves up, negative = curves down
•Classifies critical points — positive definite = minimum, indefinite = saddle
•Enables Newton's method — uses $H^{-1} abla f$ for faster convergence
•Condition number affects optimization — high $\kappa$ means slow gradient descent

Differential Calculus Module Complete

You've mastered the differential calculus foundations for ML: derivatives, chain rule, gradients, directional derivatives, and Hessians. You now have the complete toolkit for understanding how optimization algorithms work and why they succeed or fail.