Multivariate Calculus - Learning Module

Loading content...

0/245

Second-Order Approximations and the Hessian Matrix

Beyond the Gradient: Understanding Curvature

Gradient descent treats all directions equally, descending purely based on slope. But loss landscapes are curved—some directions have gentle slopes while others plummet steeply. Some regions are nearly flat; others curve sharply. Ignoring this curvature information is like navigating a mountain using only a compass, without considering the terrain's shape.

The Hessian matrix captures this curvature information. It tells us:

How fast the gradient changes as we move
Which directions are steep versus gentle
Whether a critical point is a minimum, maximum, or saddle
How to scale updates to account for varying curvature

Understanding the Hessian transforms optimization from a blind descent into an informed navigation of the loss landscape. While computing the full Hessian is expensive for neural networks, the conceptual understanding guides the design of practical approximations like Adam, natural gradient, and K-FAC.

This page provides a comprehensive treatment of second-order approximations and the Hessian, completing our foundation in multivariate calculus for machine learning.

What You Will Learn

By the end of this page, you will understand the Hessian matrix in depth: its computation, spectral properties, role in critical point classification, connection to conditioning and convergence, and practical approaches for leveraging second-order information in large-scale optimization.

The Hessian Matrix: Definition and Computation

Let's formalize and deepen our understanding of the Hessian matrix.

Formal Definition:

For f: ℝⁿ → ℝ with continuous second partial derivatives, the Hessian H_f(x) ∈ ℝⁿˣⁿ is:

$$[\mathbf{H}f]{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

Computing Second Derivatives:

Second partials are computed by differentiating twice:

$$\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial}{\partial x_i}\left( \frac{\partial f}{\partial x_j} \right)$$

Schwarz's Theorem (Symmetry):

If second partials are continuous, mixed partials are equal:

$$\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$$

Therefore H = Hᵀ (Hessian is symmetric). This guarantees real eigenvalues and orthogonal eigenvectors.

Alternative Definitions:

Jacobian of gradient: Hf = J{∇f}
Second-order Taylor coefficient: The matrix in ½ΔxᵀHΔx
Bilinear form: H(u, v) = ∂²f/∂u∂v = uᵀHv

Size Considerations:

For a neural network with n parameters:

Gradient ∇L: n components (e.g., n = 10⁸)
Hessian H: n² entries (e.g., 10¹⁶ entries → ~100 petabytes)

Explicit storage is impossible for modern networks. This drives research into Hessian-free and diagonal/structured approximations.

hessian_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
import torch
import torch.nn as nn
 
# ================================================
# Method 1: Analytical Hessian for simple function
# ================================================
 
def example_function(x, y, z):
    """f(x, y, z) = x²y + yz² + exp(xyz)"""
    return x**2 * y + y * z**2 + np.exp(x * y * z)
 
def analytical_hessian(x, y, z):
    """Analytically derived Hessian of the example function."""
    e = np.exp(x * y * z)
    
    # Diagonal entries (∂²f/∂xᵢ²)
    H_xx = 2*y + (y*z)**2 * e
    H_yy = (x*z)**2 * e
    H_zz = 2*y + (x*y)**2 * e
    
    # Off-diagonal entries (∂²f/∂xᵢ∂xⱼ)
    H_xy = 2*x + z*e + x*y*z**2 * e
    H_xz = y**2*z * e + x*y**2*z * e
    H_yz = 2*z + x*e + x**2*y*z * e
    
    return np.array([
        [H_xx, H_xy, H_xz],
        [H_xy, H_yy, H_yz],
        [H_xz, H_yz, H_zz]
    ])
 
# ================================================
# Method 2: Numerical Hessian via finite differences
# ================================================
 
def numerical_hessian(f, point, h=1e-5):
    """
    Compute Hessian numerically using central differences.
    
    H_ij ≈ (f(x + h*eᵢ + h*eⱼ) - f(x + h*eᵢ - h*eⱼ) 
           - f(x - h*eᵢ + h*eⱼ) + f(x - h*eᵢ - h*eⱼ)) / (4h²)
    """
    n = len(point)
    H = np.zeros((n, n))
    
    for i in range(n):
        for j in range(n):
            # Create perturbation vectors
            ei = np.zeros(n)
            ei[i] = h
            ej = np.zeros(n)
            ej[j] = h
            
            # Central difference formula for mixed partial
            f_pp = f(*(point + ei + ej))
            f_pm = f(*(point + ei - ej))
            f_mp = f(*(point - ei + ej))
            f_mm = f(*(point - ei - ej))
            
            H[i, j] = (f_pp - f_pm - f_mp + f_mm) / (4 * h**2)
    
    return H
 
# ================================================
# Method 3: Automatic differentiation (PyTorch)
# ================================================
 
def torch_hessian(f, x):
    """
    Compute Hessian using PyTorch autograd.
    Requires computing n gradient passes (one per output dimension).
    """
    x = x.clone().requires_grad_(True)
    y = f(x)
    
    # First, compute gradient
    grad = torch.autograd.grad(y, x, create_graph=True)[0]
    
    # Then compute Jacobian of gradient (= Hessian)
    n = x.numel()
    H = torch.zeros(n, n)
    
    for i in range(n):
        # Compute gradient of grad[i] w.r.t. x
        grad2 = torch.autograd.grad(grad[i], x, retain_graph=True)[0]
        H[i] = grad2
    
    return H
 
# Test all three methods
point = np.array([1.0, 2.0, 0.5])
 
print("Hessian Computation Methods Comparison")
print("=" * 60)
print(f"Point: {point}")
print()
 
# Analytical
H_analytical = analytical_hessian(*point)
print("Analytical Hessian:")
print(H_analytical.round(4))
print()
 
# Numerical
H_numerical = numerical_hessian(example_function, point)
print("Numerical Hessian (finite differences):")
print(H_numerical.round(4))
print()
 
# Symmetry check
print(f"Symmetry check (max |H - Hᵀ|): {np.max(np.abs(H_analytical - H_analytical.T)):.2e}")
print()
 
# Eigenvalue analysis
eigenvalues, eigenvectors = np.linalg.eigh(H_analytical)
print("Eigenvalue Analysis:")
print(f"Eigenvalues: {eigenvalues.round(4)}")
print(f"All positive (positive definite)? {all(eigenvalues > 0)}")
 
# Condition number
cond_number = max(abs(eigenvalues)) / min(abs(eigenvalues))
print(f"Condition number: {cond_number:.2f}")
 
# ================================================
# Method 4: Hessian-vector product (efficient)
# ================================================
 
def hessian_vector_product(f, x, v):
    """
    Compute H @ v without forming full Hessian.
    Uses two backward passes but never stores H.
    """
    x = x.clone().requires_grad_(True)
    v = v.clone().detach()
    
    # Forward pass
    y = f(x)
    
    # First backward: compute gradient
    grad = torch.autograd.grad(y, x, create_graph=True)[0]
    
    # Second backward: compute ∂(grad · v)/∂x = Hv
    Hv = torch.autograd.grad(torch.dot(grad, v), x)[0]
    
    return Hv
 
print("
" + "=" * 60)
print("Hessian-Vector Product (efficient method)")
print("=" * 60)
 
def f_torch(x):
    return x[0]**2 * x[1] + x[1] * x[2]**2 + torch.exp(x[0] * x[1] * x[2])
 
x_torch = torch.tensor([1.0, 2.0, 0.5], requires_grad=True)
v = torch.tensor([1.0, 0.0, 0.0])  # First standard basis vector
 
Hv = hessian_vector_product(f_torch, x_torch, v)
print(f"H @ e₁ (first column of H): {Hv.detach().numpy().round(4)}")
print(f"Compare to analytical first column: {H_analytical[:, 0].round(4)}")

Hessian-Vector Products

We can compute Hv (Hessian times a vector) in O(n) time using two backpropagation passes, without ever forming the full O(n²) Hessian. This enables Hessian-free optimization and power iteration for eigenvalue estimation in large-scale settings.

Spectral Properties of the Hessian

The eigenvalue decomposition of the Hessian reveals the curvature structure of the loss landscape along different directions.

Eigendecomposition:

Since H is symmetric, it admits the decomposition:

$$\mathbf{H} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^\top = \sum_{i=1}^{n} \lambda_i \mathbf{v}_i \mathbf{v}_i^\top$$

where:

V = [v₁, ..., vₙ] are orthonormal eigenvectors
Λ = diag(λ₁, ..., λₙ) are real eigenvalues

Curvature Along Eigenvectors:

The eigenvalue λᵢ is the curvature (second derivative) of f along direction vᵢ:

$$\frac{d^2 f}{dt^2}\bigg|_{\mathbf{x} + t\mathbf{v}_i, t=0} = \mathbf{v}_i^\top \mathbf{H} \mathbf{v}_i = \lambda_i$$

General curvature in direction u (unit vector):

$$\text{curvature along } \mathbf{u} = \mathbf{u}^\top \mathbf{H} \mathbf{u} \in [\lambda_{\min}, \lambda_{\max}]$$

Spectrum Characterization:

The collection of eigenvalues (the spectrum) characterizes the Hessian:

Hessian Spectrum and Loss Landscape Properties
Spectrum Property	Hessian Type	Loss Landscape Shape
All λᵢ > 0	Positive definite	Bowl (convex), unique minimum
All λᵢ < 0	Negative definite	Inverted bowl, unique maximum
λᵢ > 0 and λⱼ < 0 exist	Indefinite	Saddle point
All λᵢ ≥ 0, some = 0	Positive semidefinite	Flat directions exist
λ_max >> λ_min > 0	Ill-conditioned	Elongated elliptical contours

The Condition Number:

$$\kappa = \frac{|\lambda_{\max}|}{|\lambda_{\min}|}$$

The condition number measures the eccentricity of the loss landscape:

κ ≈ 1: Nearly spherical contours. GD converges quickly.
κ >> 1: Highly elongated contours. GD zig-zags, converging slowly.

Impact on Gradient Descent:

For a quadratic loss, GD achieves:

$$\text{error after } k \text{ steps} \propto \left(\frac{\kappa - 1}{\kappa + 1}\right)^k$$

With κ = 1000, we need thousands of iterations. With κ = 10, only tens.

Spectrum of Neural Network Losses:

Empirical studies of neural network Hessians reveal:

Most eigenvalues cluster near zero (flat directions)
A small "bulk" of moderate eigenvalues
A few very large eigenvalues (sharp directions)
Saddle points dominate critical points

This structure motivates adaptive optimizers that scale per-direction.

The Curse of High Condition Number

Neural network Hessians often have condition numbers κ > 10⁶. This explains why vanilla gradient descent fails: the largest eigenvalue limits the learning rate (to prevent divergence), but then progress along small-eigenvalue directions is glacial. Adaptive methods like Adam address this by implicitly rescaling directions.

Critical Point Classification

A critical point (or stationary point) is where the gradient vanishes: ∇f(x*) = 0. The Hessian at a critical point tells us its nature.

The Second Derivative Test:

At a critical point x* where ∇f(x*) = 0, the Taylor expansion becomes:

$$f(\mathbf{x}^* + \Delta\mathbf{x}) \approx f(\mathbf{x}^) + \frac{1}{2} \Delta\mathbf{x}^\top \mathbf{H}(\mathbf{x}^) \Delta\mathbf{x}$$

The quadratic term determines whether x* is locally a minimum, maximum, or saddle.

Classification Rules:

Second Derivative Test in Multiple Dimensions

•Positive Definite H (all λᵢ > 0): Strict local minimum. Moving in any direction increases f.
•Negative Definite H (all λᵢ < 0): Strict local maximum. Moving in any direction decreases f.
•Indefinite H (mixed signs): Saddle point. f increases in some directions, decreases in others.
•Semidefinite with zeros: Inconclusive test. Need higher-order terms to classify.

Practical Tests for Definiteness:

Eigenvalue check: Compute eigenvalues and check signs. O(n³).
Sylvester's criterion: For positive definiteness, all leading principal minors must be positive:
- det([H₁₁]) > 0
- det([H₁₁, H₁₂; H₂₁, H₂₂]) > 0
- ...
- det(H) > 0
Cholesky decomposition: H is positive definite iff Cholesky factorization H = LLᵀ succeeds.
Quadratic form sampling: Check sign of vᵀH****v for random v. (Can miss degenerate directions.)

High-Dimensional Reality:

At a random critical point in n dimensions, each Hessian eigenvalue is roughly equally likely to be positive or negative. The probability of all being positive (local minimum) is approximately 2⁻ⁿ—exponentially rare!

For n = 100 dimensions, P(local min) ≈ 10⁻³⁰. Almost all critical points are saddle points.

This is actually good news for optimization: gradient descent doesn't get stuck at saddle points (gradient points away from them in some directions).

critical_point_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
def analyze_critical_point(H):
    """
    Analyze a Hessian matrix at a critical point.
    Returns classification and eigenvalue information.
    """
    eigenvalues = np.linalg.eigvalsh(H)  # Real eigenvalues for symmetric H
    
    pos = np.sum(eigenvalues > 1e-10)
    neg = np.sum(eigenvalues < -1e-10)
    zero = np.sum(np.abs(eigenvalues) <= 1e-10)
    
    if pos == len(eigenvalues):
        classification = "Local Minimum (positive definite)"
    elif neg == len(eigenvalues):
        classification = "Local Maximum (negative definite)"
    elif pos > 0 and neg > 0:
        classification = f"Saddle Point ({pos}+, {neg}-, {zero} zero)"
    elif zero > 0:
        classification = f"Degenerate ({pos}+, {neg}-, {zero} zero)"
    else:
        classification = "Unknown"
    
    condition_number = np.abs(eigenvalues).max() / max(np.abs(eigenvalues).min(), 1e-10)
    
    return {
        'eigenvalues': eigenvalues,
        'classification': classification,
        'condition_number': condition_number,
        'pos_count': pos,
        'neg_count': neg,
        'zero_count': zero
    }
 
# Example 1: Bowl (minimum)
print("Example 1: Quadratic Bowl f(x,y) = x² + 2y²")
print("-" * 50)
H_bowl = np.array([[2, 0], [0, 4]])
result = analyze_critical_point(H_bowl)
print(f"Hessian:\n{H_bowl}")
print(f"Eigenvalues: {result['eigenvalues']}")
print(f"Classification: {result['classification']}")
print(f"Condition number: {result['condition_number']:.2f}")
print()
 
# Example 2: Saddle point
print("Example 2: Hyperbolic Paraboloid f(x,y) = x² - y²")
print("-" * 50)
H_saddle = np.array([[2, 0], [0, -2]])
result = analyze_critical_point(H_saddle)
print(f"Hessian:\n{H_saddle}")
print(f"Eigenvalues: {result['eigenvalues']}")
print(f"Classification: {result['classification']}")
print()
 
# Example 3: Monkey saddle (higher-order saddle)
print("Example 3: f(x,y) = x³ - 3xy² at (0,0)")
print("-" * 50)
H_monkey = np.array([[0, 0], [0, 0]])  # Hessian is zero!
result = analyze_critical_point(H_monkey)
print(f"Hessian:\n{H_monkey}")
print(f"Eigenvalues: {result['eigenvalues']}")
print(f"Classification: {result['classification']}")
print("(Need higher-order analysis for degenerate case)")
print()
 
# Example 4: Ill-conditioned minimum
print("Example 4: Elongated bowl f(x,y) = x² + 100y²")
print("-" * 50)
H_ellipse = np.array([[2, 0], [0, 200]])
result = analyze_critical_point(H_ellipse)
print(f"Hessian:\n{H_ellipse}")
print(f"Eigenvalues: {result['eigenvalues']}")
print(f"Classification: {result['classification']}")
print(f"Condition number: {result['condition_number']:.2f}")
print("(High condition number = slow GD convergence)")
print()
 
# Visualization
fig = plt.figure(figsize=(15, 5))
 
examples = [
    (lambda x, y: x**2 + 2*y**2, "Minimum (Bowl)", "viridis"),
    (lambda x, y: x**2 - y**2, "Saddle Point", "RdBu"),
    (lambda x, y: x**2 + 100*y**2, "Ill-Conditioned Min", "viridis"),
]
 
for idx, (f, title, cmap) in enumerate(examples, 1):
    ax = fig.add_subplot(1, 3, idx, projection='3d')
    
    x = np.linspace(-2, 2, 50)
    y = np.linspace(-2, 2, 50)
    X, Y = np.meshgrid(x, y)
    Z = f(X, Y)
    
    ax.plot_surface(X, Y, Z, cmap=cmap, alpha=0.8)
    ax.scatter([0], [0], [f(0, 0)], color='red', s=100)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('f(x,y)')
    ax.set_title(title)
 
plt.tight_layout()
plt.savefig('critical_points.png', dpi=150, bbox_inches='tight')
plt.show()
 
# High-dimensional saddle point prevalence
print("\n" + "=" * 60)
print("Saddle Point Prevalence in High Dimensions")
print("=" * 60)
 
for n in [2, 5, 10, 20, 50, 100]:
    # Probability of local min = 2^(-n)
    p_min = 2**(-n)
    print(f"Dimension n={n:3d}: P(random critical point is local min) ≈ {p_min:.2e}")

Quadratic Models and Newton's Method

The second-order Taylor expansion gives a quadratic model that we can minimize exactly, leading to Newton's method.

The Quadratic Model:

Approximate f near current point θ:

$$m(\Delta\theta) = f(\theta) + \nabla f^\top \Delta\theta + \frac{1}{2} \Delta\theta^\top \mathbf{H} \Delta\theta$$

To minimize this quadratic, differentiate and set to zero:

$$\nabla_\Delta m = \nabla f + \mathbf{H} \Delta\theta = \mathbf{0}$$

Newton Step:

$$\Delta\theta^* = -\mathbf{H}^{-1} \nabla f$$ $$\theta_{\text{new}} = \theta - \mathbf{H}^{-1} \nabla f$$

Geometric Interpretation:

Newton's method jumps directly to the minimum of the local quadratic approximation. If f is exactly quadratic, one step reaches the minimum. For general f, the method converges quadratically near a minimum.

Convergence Analysis:

Near a strict local minimum θ* with positive definite Hessian:

$$|\theta_{k+1} - \theta^| \leq C |\theta_k - \theta^|^2$$

This quadratic convergence means the number of correct digits roughly doubles each iteration. Compare to GD's linear convergence:

$$|\theta_{k+1} - \theta^| \leq \rho |\theta_k - \theta^|$$

with rate ρ = (κ-1)/(κ+1) for quadratic functions.

Newton's Method Advantages

•Quadratic convergence near minimum
•No learning rate hyperparameter
•Invariant to linear reparameterization
•Accounts for curvature (condition-independent)
•Reaches minimum in one step for quadratics

Newton's Method Disadvantages

•O(n³) cost per iteration (Hessian inversion)
•O(n²) memory to store Hessian
•May diverge if Hessian not positive definite
•Can overshoot for non-quadratic functions
•Completely impractical for n > 10⁴

Damped Newton and Trust Regions

Pure Newton can diverge far from the minimum. Practical variants include: (1) Damped Newton: θ ← θ - α·H⁻¹∇f with line search for α. (2) Trust region: minimize quadratic model within a ball ‖Δθ‖ ≤ r. Both ensure progress even when the quadratic model is inaccurate.

Practical Second-Order Methods for Deep Learning

Full Newton is impractical for neural networks, but various approximations capture curvature information at manageable cost.

1. Quasi-Newton Methods (BFGS, L-BFGS)

Idea: Build an approximation B ≈ H⁻¹ incrementally from gradient changes.

BFGS update: $$\mathbf{B}_{k+1} = \left(\mathbf{I} - \frac{\mathbf{s}_k \mathbf{y}_k^\top}{\mathbf{y}_k^\top \mathbf{s}_k}\right) \mathbf{B}_k \left(\mathbf{I} - \frac{\mathbf{y}_k \mathbf{s}_k^\top}{\mathbf{y}_k^\top \mathbf{s}_k}\right) + \frac{\mathbf{s}_k \mathbf{s}_k^\top}{\mathbf{y}_k^\top \mathbf{s}_k}$$

where sk = θ{k+1} - θ_k and yk = ∇f{k+1} - ∇f_k.

L-BFGS: Limited-memory variant stores only last m (s, y) pairs. O(n) per iteration.

2. Gauss-Newton (for Least Squares)

For loss L = ½‖r(θ)‖² (residual sum of squares):

H ≈ J_r^⊤ J_r

Ignores second derivatives of residuals. Guaranteed positive semidefinite!

3. Natural Gradient

Use the Fisher Information Matrix F instead of Hessian:

$$\theta \leftarrow \theta - \alpha \mathbf{F}^{-1} \nabla L$$

F = 𝔼[∇log p(y|x,θ) ∇log p(y|x,θ)^⊤]

The natural gradient accounts for the geometry of probability distributions.

4. K-FAC (Kronecker-Factored Approximate Curvature)

Approximate Fisher per layer using Kronecker products:

F_layer ≈ A ⊗ G

where A captures input covariance and G captures gradient covariance. Inversion costs O(layer_dim³) instead of O(total_params³).

5. Diagonal Approximations

Simplest: approximate H ≈ diag(h₁, ..., hₙ). Then H⁻¹∇f is just element-wise division.

AdaGrad, RMSprop, Adam implicitly use diagonal curvature estimates based on accumulated squared gradients.

Comparison of Second-Order Optimization Methods
Method	Curvature Approximation	Memory	Compute/Step	Best For
Full Newton	Exact H	O(n²)	O(n³)	Small problems, convex
L-BFGS	Low-rank H⁻¹	O(mn)	O(mn)	Medium-scale, smooth
Gauss-Newton	J^T J	O(nm)	O(nm²)	Least squares
Natural Gradient	Fisher F	O(n²)	O(n³)	Probability models
K-FAC	Kronecker factors	O(layers)	O(layer³)	Deep networks
Adam/AdaGrad	Diagonal estimate	O(n)	O(n)	Deep learning (general)

Adam's Implicit Second-Order Nature

Adam maintains running estimates of gradient mean (mₜ) and squared gradient mean (vₜ). The update θ ← θ - α·mₜ/√vₜ effectively scales by 1/diagonal(H) (approximately). This is why Adam works well without tuning—it implicitly adapts to local curvature!

Hessian in Loss Landscape Analysis

Beyond optimization, the Hessian reveals deep information about neural network loss landscapes.

Sharpness and Generalization:

Emerging research connects Hessian eigenvalues to generalization:

Sharp minima: High Hessian eigenvalues, loss increases quickly when leaving minimum
Flat minima: Low Hessian eigenvalues, loss plateau around minimum

Empirical observation: Flat minima often generalize better. Intuition: flat minima are stable under weight perturbation and noise.

Sharpness-Aware Minimization (SAM):

Minimize both loss and "sharpness":

$$\min_\theta \max_{|\epsilon| \leq \rho} L(\theta + \epsilon)$$

This finds parameters where even worst-case perturbations don't increase loss much—encouraging flat minima.

Hessian Spectral Density:

For large networks, compute the distribution of Hessian eigenvalues:

Bulk spectrum: Most eigenvalues near zero (flat directions)
Edge/outliers: A few large eigenvalues (sharp directions)
Maximum eigenvalue: Limits safe learning rate (α < 2/λ_max)

Local Convexity:

At any point, the loss is locally convex iff Hessian is positive semidefinite. Studies show neural network losses are often locally convex along the optimization trajectory, even though globally non-convex.

Mode Connectivity:

Different minima found by different training runs are often connected by paths of near-constant loss. The Hessian structure along these paths reveals information about the loss landscape's "tube" structure.

hessian_spectrum_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
import numpy as np
import matplotlib.pyplot as plt
 
def power_iteration_largest_eigenvalue(H, num_iterations=100):
    """
    Estimate largest eigenvalue of H using power iteration.
    O(n²) per iteration, can use Hv products for O(n).
    """
    n = H.shape[0]
    v = np.random.randn(n)
    v = v / np.linalg.norm(v)
    
    for i in range(num_iterations):
        Hv = H @ v
        v = Hv / np.linalg.norm(Hv)
    
    # Rayleigh quotient gives eigenvalue estimate
    lambda_max = v @ H @ v
    return lambda_max, v
 
def lanczos_eigenvalue_spectrum(H, k=50):
    """
    Estimate top-k eigenvalues using Lanczos algorithm.
    More efficient than full eigendecomposition for large matrices.
    """
    n = H.shape[0]
    k = min(k, n)
    
    # Lanczos algorithm
    V = np.zeros((n, k+1))
    T = np.zeros((k, k))  # Tridiagonal matrix
    
    v = np.random.randn(n)
    V[:, 0] = v / np.linalg.norm(v)
    
    for j in range(k):
        w = H @ V[:, j]
        if j > 0:
            w = w - T[j-1, j] * V[:, j-1]
        
        T[j, j] = V[:, j] @ w
        w = w - T[j, j] * V[:, j]
        
        if j < k - 1:
            T[j, j+1] = np.linalg.norm(w)
            T[j+1, j] = T[j, j+1]
            if T[j, j+1] > 1e-10:
                V[:, j+1] = w / T[j, j+1]
            else:
                # Breakdown - pick random vector
                V[:, j+1] = np.random.randn(n)
                V[:, j+1] = V[:, j+1] / np.linalg.norm(V[:, j+1])
    
    # Eigenvalues of T approximate eigenvalues of H
    eigenvalues = np.linalg.eigvalsh(T)
    return np.sort(eigenvalues)[::-1]
 
# Simulate a Hessian with neural-network-like spectrum
def generate_nn_like_hessian(n, num_large=5, bulk_scale=0.01, large_scale=1.0):
    """
    Generate a Hessian with spectral properties similar to 
    neural network loss landscapes.
    
    - Most eigenvalues near zero (flat directions)
    - A few large eigenvalues (sharp directions)
    """
    # Start with near-zero (positive) bulk
    eigenvalues = np.abs(np.random.randn(n)) * bulk_scale
    
    # Add a few large eigenvalues
    eigenvalues[:num_large] = np.abs(np.random.randn(num_large)) * large_scale + large_scale
    
    # Random orthogonal eigenvectors
    V, _ = np.linalg.qr(np.random.randn(n, n))
    
    # Construct Hessian
    H = V @ np.diag(eigenvalues) @ V.T
    return H, eigenvalues
 
# Generate and analyze
np.random.seed(42)
n = 200
H, true_eigenvalues = generate_nn_like_hessian(n, num_large=10, bulk_scale=0.01, large_scale=2.0)
 
print("Simulated Neural Network Hessian Analysis")
print("=" * 60)
print(f"Dimension: {n}")
 
# Power iteration for largest eigenvalue
lambda_max_est, v_max = power_iteration_largest_eigenvalue(H)
print(f"\nLargest eigenvalue (power iteration): {lambda_max_est:.4f}")
print(f"True largest eigenvalue: {max(true_eigenvalues):.4f}")
 
# Full spectrum
full_eigenvalues = np.linalg.eigvalsh(H)
 
# Lanczos approximation
lanczos_eigenvalues = lanczos_eigenvalue_spectrum(H, k=30)
 
# Plotting
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Plot 1: Full spectrum
ax1 = axes[0]
ax1.hist(full_eigenvalues, bins=50, edgecolor='black', alpha=0.7)
ax1.set_xlabel('Eigenvalue')
ax1.set_ylabel('Count')
ax1.set_title('Full Hessian Spectrum')
ax1.axvline(x=0, color='red', linestyle='--', label='Zero')
ax1.legend()
 
# Plot 2: Log-scale spectrum (to see both bulk and outliers)
ax2 = axes[1]
ax2.hist(np.log10(full_eigenvalues + 1e-10), bins=50, edgecolor='black', alpha=0.7)
ax2.set_xlabel('log₁₀(Eigenvalue)')
ax2.set_ylabel('Count')
ax2.set_title('Log-Scale Spectrum')
 
# Plot 3: Sorted eigenvalues
ax3 = axes[2]
sorted_eigs = np.sort(full_eigenvalues)[::-1]
ax3.semilogy(range(len(sorted_eigs)), sorted_eigs + 1e-10)
ax3.set_xlabel('Index')
ax3.set_ylabel('Eigenvalue (log scale)')
ax3.set_title('Sorted Eigenvalues')
ax3.axhline(y=1e-2, color='red', linestyle='--', label='Bulk threshold')
ax3.legend()
 
plt.tight_layout()
plt.savefig('hessian_spectrum.png', dpi=150, bbox_inches='tight')
plt.show()
 
# Summary statistics
print(f"\nSpectrum Statistics:")
print(f"  Max eigenvalue: {max(full_eigenvalues):.4f}")
print(f"  Min eigenvalue: {min(full_eigenvalues):.6f}")
print(f"  Condition number: {max(full_eigenvalues) / max(min(full_eigenvalues), 1e-10):.2f}")
print(f"  Eigenvalues > 0.1: {np.sum(full_eigenvalues > 0.1)}")
print(f"  Safe learning rate < 2/λ_max: {2/max(full_eigenvalues):.4f}")

Hessian-Free Optimization

We can use second-order information without ever forming the full Hessian by exploiting Hessian-vector products.

Key Insight:

To solve HΔθ = -∇f for the Newton step, we don't need H explicitly—we only need to compute Hv for arbitrary vectors v. This is enough for iterative solvers.

Computing Hv Without H:

Recall from page 3 that we can compute:

$$\mathbf{H}\mathbf{v} = \nabla_\theta (\nabla_\theta f \cdot \mathbf{v})$$

This requires:

Forward pass: compute f
Backward pass: compute ∇f
Forward pass on ∇f·v
Backward pass: get Hv

Cost: ~2× one backward pass. Memory: O(n).

Conjugate Gradient (CG) Method:

CG solves Hx = b using only matrix-vector products Hv:

Initialize: x = 0, r = b, p = r
Repeat until convergence:
    α = (r · r) / (p · Hp)
    x = x + α p
    r_new = r - α Hp
    β = (r_new · r_new) / (r · r)
    p = r_new + β p
    r = r_new

Each iteration needs one Hv product. For n-dimensional problems, CG converges in at most n iterations (often much fewer).

Hessian-Free Optimization Algorithm:

Compute gradient ∇f
Approximately solve HΔθ = -∇f using CG with Hessian-vector products
Update θ ← θ + αΔθ (with line search for α)
Repeat

Advantages:

Captures curvature without O(n²) memory
Works for very large networks
Better than GD convergence

Challenges:

More expensive per iteration than GD
CG can be slow if Hessian is ill-conditioned
Need good preconditioners

When to Use Hessian-Free Methods

•Recurrent Neural Networks: HF optimization was particularly effective for training RNNs before LSTM/GRU became dominant.
•Problems with limited data: Fewer iterations to convergence means less overfitting risk from many epochs.
•Non-stochastic settings: Deterministic loss makes Hessian estimation more reliable.
•Fine-tuning: When near a minimum, second-order methods excel compared to first-order.
•Scientific computing: Physics-based losses often have accessible Hessian structure.

The Gauss-Newton Hessian

For losses of the form L = ½‖r(θ)‖² (sum of squared residuals), the Gauss-Newton approximation H ≈ JᵀJ is always positive semidefinite, making CG well-behaved. This approximation ignores the second derivative of residuals, which is often small or zero anyway.

Summary: Mastering Curvature

The Hessian matrix and second-order approximations provide the mathematical framework for understanding loss landscape curvature—essential for both optimization algorithms and theoretical analysis.

Module Synthesis:

Across this module on multivariate calculus, we've built a comprehensive toolkit:

Module Key Takeaways

•Multivariable functions model ML predictions and losses in high-dimensional parameter spaces. We understand domains, level sets, and geometric interpretations.
•Gradient vectors point toward steepest ascent; -∇L gives the gradient descent direction. They're perpendicular to level sets and computed efficiently via backpropagation.
•Jacobian matrices generalize derivatives to vector outputs, encoding how layers transform representations. The chain rule multiplies Jacobians; backprop chains VJPs.
•Taylor series approximate functions as polynomials. The second-order term introduces the Hessian and curvature, enabling analysis of convergence and critical points.
•The Hessian captures curvature via second partial derivatives. Its eigenvalues classify critical points and determine convergence rates via the condition number.
•Newton's method uses the Hessian for quadratic convergence but is expensive. Practical methods (L-BFGS, natural gradient, K-FAC, Adam) approximate second-order information at manageable cost.
•Hessian-free techniques compute Hv products without storing H, enabling second-order optimization for large-scale problems.

The Bigger Picture:

Multivariate calculus provides the mathematical language of machine learning optimization:

Gradients tell us which way to move
Hessians tell us how far to move (curvature-aware step sizes)
Taylor expansions tell us why these methods work (local approximations)

This foundation prepares you for advanced topics:

Convex optimization (module 3): When is the global minimum guaranteed?
Gradient descent variants (module 4): How to accelerate and stabilize?
Constrained optimization (module 5): What if parameters must satisfy constraints?

Practical Wisdom:

While we rarely compute full Hessians in deep learning, the concepts pervade practice:

Understanding condition numbers explains why learning rate tuning is hard
Hessian spectrum analysis reveals why flat minima generalize better
Curvature intuition guides architecture design (residual connections help gradient flow)

The mathematics of curvature isn't just theory—it's the lens through which we understand learning dynamics.

Module Complete

Congratulations! You've completed the Multivariate Calculus module. You now understand multivariable functions, gradients, Jacobians, Taylor series, and the Hessian—the complete first and second-order derivative toolkit for machine learning. These tools enable deep analysis of optimization algorithms and loss landscapes.