Loading content...
Gradient descent treats all directions equally, descending purely based on slope. But loss landscapes are curved—some directions have gentle slopes while others plummet steeply. Some regions are nearly flat; others curve sharply. Ignoring this curvature information is like navigating a mountain using only a compass, without considering the terrain's shape.
The Hessian matrix captures this curvature information. It tells us:
Understanding the Hessian transforms optimization from a blind descent into an informed navigation of the loss landscape. While computing the full Hessian is expensive for neural networks, the conceptual understanding guides the design of practical approximations like Adam, natural gradient, and K-FAC.
This page provides a comprehensive treatment of second-order approximations and the Hessian, completing our foundation in multivariate calculus for machine learning.
By the end of this page, you will understand the Hessian matrix in depth: its computation, spectral properties, role in critical point classification, connection to conditioning and convergence, and practical approaches for leveraging second-order information in large-scale optimization.
Let's formalize and deepen our understanding of the Hessian matrix.
Formal Definition:
For f: ℝⁿ → ℝ with continuous second partial derivatives, the Hessian H_f(x) ∈ ℝⁿˣⁿ is:
$$[\mathbf{H}f]{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$
Computing Second Derivatives:
Second partials are computed by differentiating twice:
$$\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial}{\partial x_i}\left( \frac{\partial f}{\partial x_j} \right)$$
Schwarz's Theorem (Symmetry):
If second partials are continuous, mixed partials are equal:
$$\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$$
Therefore H = Hᵀ (Hessian is symmetric). This guarantees real eigenvalues and orthogonal eigenvectors.
Alternative Definitions:
Size Considerations:
For a neural network with n parameters:
Explicit storage is impossible for modern networks. This drives research into Hessian-free and diagonal/structured approximations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
import numpy as npimport torchimport torch.nn as nn # ================================================# Method 1: Analytical Hessian for simple function# ================================================ def example_function(x, y, z): """f(x, y, z) = x²y + yz² + exp(xyz)""" return x**2 * y + y * z**2 + np.exp(x * y * z) def analytical_hessian(x, y, z): """Analytically derived Hessian of the example function.""" e = np.exp(x * y * z) # Diagonal entries (∂²f/∂xᵢ²) H_xx = 2*y + (y*z)**2 * e H_yy = (x*z)**2 * e H_zz = 2*y + (x*y)**2 * e # Off-diagonal entries (∂²f/∂xᵢ∂xⱼ) H_xy = 2*x + z*e + x*y*z**2 * e H_xz = y**2*z * e + x*y**2*z * e H_yz = 2*z + x*e + x**2*y*z * e return np.array([ [H_xx, H_xy, H_xz], [H_xy, H_yy, H_yz], [H_xz, H_yz, H_zz] ]) # ================================================# Method 2: Numerical Hessian via finite differences# ================================================ def numerical_hessian(f, point, h=1e-5): """ Compute Hessian numerically using central differences. H_ij ≈ (f(x + h*eᵢ + h*eⱼ) - f(x + h*eᵢ - h*eⱼ) - f(x - h*eᵢ + h*eⱼ) + f(x - h*eᵢ - h*eⱼ)) / (4h²) """ n = len(point) H = np.zeros((n, n)) for i in range(n): for j in range(n): # Create perturbation vectors ei = np.zeros(n) ei[i] = h ej = np.zeros(n) ej[j] = h # Central difference formula for mixed partial f_pp = f(*(point + ei + ej)) f_pm = f(*(point + ei - ej)) f_mp = f(*(point - ei + ej)) f_mm = f(*(point - ei - ej)) H[i, j] = (f_pp - f_pm - f_mp + f_mm) / (4 * h**2) return H # ================================================# Method 3: Automatic differentiation (PyTorch)# ================================================ def torch_hessian(f, x): """ Compute Hessian using PyTorch autograd. Requires computing n gradient passes (one per output dimension). """ x = x.clone().requires_grad_(True) y = f(x) # First, compute gradient grad = torch.autograd.grad(y, x, create_graph=True)[0] # Then compute Jacobian of gradient (= Hessian) n = x.numel() H = torch.zeros(n, n) for i in range(n): # Compute gradient of grad[i] w.r.t. x grad2 = torch.autograd.grad(grad[i], x, retain_graph=True)[0] H[i] = grad2 return H # Test all three methodspoint = np.array([1.0, 2.0, 0.5]) print("Hessian Computation Methods Comparison")print("=" * 60)print(f"Point: {point}")print() # AnalyticalH_analytical = analytical_hessian(*point)print("Analytical Hessian:")print(H_analytical.round(4))print() # NumericalH_numerical = numerical_hessian(example_function, point)print("Numerical Hessian (finite differences):")print(H_numerical.round(4))print() # Symmetry checkprint(f"Symmetry check (max |H - Hᵀ|): {np.max(np.abs(H_analytical - H_analytical.T)):.2e}")print() # Eigenvalue analysiseigenvalues, eigenvectors = np.linalg.eigh(H_analytical)print("Eigenvalue Analysis:")print(f"Eigenvalues: {eigenvalues.round(4)}")print(f"All positive (positive definite)? {all(eigenvalues > 0)}") # Condition numbercond_number = max(abs(eigenvalues)) / min(abs(eigenvalues))print(f"Condition number: {cond_number:.2f}") # ================================================# Method 4: Hessian-vector product (efficient)# ================================================ def hessian_vector_product(f, x, v): """ Compute H @ v without forming full Hessian. Uses two backward passes but never stores H. """ x = x.clone().requires_grad_(True) v = v.clone().detach() # Forward pass y = f(x) # First backward: compute gradient grad = torch.autograd.grad(y, x, create_graph=True)[0] # Second backward: compute ∂(grad · v)/∂x = Hv Hv = torch.autograd.grad(torch.dot(grad, v), x)[0] return Hv print("" + "=" * 60)print("Hessian-Vector Product (efficient method)")print("=" * 60) def f_torch(x): return x[0]**2 * x[1] + x[1] * x[2]**2 + torch.exp(x[0] * x[1] * x[2]) x_torch = torch.tensor([1.0, 2.0, 0.5], requires_grad=True)v = torch.tensor([1.0, 0.0, 0.0]) # First standard basis vector Hv = hessian_vector_product(f_torch, x_torch, v)print(f"H @ e₁ (first column of H): {Hv.detach().numpy().round(4)}")print(f"Compare to analytical first column: {H_analytical[:, 0].round(4)}")We can compute Hv (Hessian times a vector) in O(n) time using two backpropagation passes, without ever forming the full O(n²) Hessian. This enables Hessian-free optimization and power iteration for eigenvalue estimation in large-scale settings.
The eigenvalue decomposition of the Hessian reveals the curvature structure of the loss landscape along different directions.
Eigendecomposition:
Since H is symmetric, it admits the decomposition:
$$\mathbf{H} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^\top = \sum_{i=1}^{n} \lambda_i \mathbf{v}_i \mathbf{v}_i^\top$$
where:
Curvature Along Eigenvectors:
The eigenvalue λᵢ is the curvature (second derivative) of f along direction vᵢ:
$$\frac{d^2 f}{dt^2}\bigg|_{\mathbf{x} + t\mathbf{v}_i, t=0} = \mathbf{v}_i^\top \mathbf{H} \mathbf{v}_i = \lambda_i$$
General curvature in direction u (unit vector):
$$\text{curvature along } \mathbf{u} = \mathbf{u}^\top \mathbf{H} \mathbf{u} \in [\lambda_{\min}, \lambda_{\max}]$$
Spectrum Characterization:
The collection of eigenvalues (the spectrum) characterizes the Hessian:
| Spectrum Property | Hessian Type | Loss Landscape Shape |
|---|---|---|
| All λᵢ > 0 | Positive definite | Bowl (convex), unique minimum |
| All λᵢ < 0 | Negative definite | Inverted bowl, unique maximum |
| λᵢ > 0 and λⱼ < 0 exist | Indefinite | Saddle point |
| All λᵢ ≥ 0, some = 0 | Positive semidefinite | Flat directions exist |
| λ_max >> λ_min > 0 | Ill-conditioned | Elongated elliptical contours |
The Condition Number:
$$\kappa = \frac{|\lambda_{\max}|}{|\lambda_{\min}|}$$
The condition number measures the eccentricity of the loss landscape:
Impact on Gradient Descent:
For a quadratic loss, GD achieves:
$$\text{error after } k \text{ steps} \propto \left(\frac{\kappa - 1}{\kappa + 1}\right)^k$$
With κ = 1000, we need thousands of iterations. With κ = 10, only tens.
Spectrum of Neural Network Losses:
Empirical studies of neural network Hessians reveal:
This structure motivates adaptive optimizers that scale per-direction.
Neural network Hessians often have condition numbers κ > 10⁶. This explains why vanilla gradient descent fails: the largest eigenvalue limits the learning rate (to prevent divergence), but then progress along small-eigenvalue directions is glacial. Adaptive methods like Adam address this by implicitly rescaling directions.
A critical point (or stationary point) is where the gradient vanishes: ∇f(x*) = 0. The Hessian at a critical point tells us its nature.
The Second Derivative Test:
At a critical point x* where ∇f(x*) = 0, the Taylor expansion becomes:
$$f(\mathbf{x}^* + \Delta\mathbf{x}) \approx f(\mathbf{x}^) + \frac{1}{2} \Delta\mathbf{x}^\top \mathbf{H}(\mathbf{x}^) \Delta\mathbf{x}$$
The quadratic term determines whether x* is locally a minimum, maximum, or saddle.
Classification Rules:
Practical Tests for Definiteness:
Eigenvalue check: Compute eigenvalues and check signs. O(n³).
Sylvester's criterion: For positive definiteness, all leading principal minors must be positive:
Cholesky decomposition: H is positive definite iff Cholesky factorization H = LLᵀ succeeds.
Quadratic form sampling: Check sign of vᵀH****v for random v. (Can miss degenerate directions.)
High-Dimensional Reality:
At a random critical point in n dimensions, each Hessian eigenvalue is roughly equally likely to be positive or negative. The probability of all being positive (local minimum) is approximately 2⁻ⁿ—exponentially rare!
For n = 100 dimensions, P(local min) ≈ 10⁻³⁰. Almost all critical points are saddle points.
This is actually good news for optimization: gradient descent doesn't get stuck at saddle points (gradient points away from them in some directions).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D def analyze_critical_point(H): """ Analyze a Hessian matrix at a critical point. Returns classification and eigenvalue information. """ eigenvalues = np.linalg.eigvalsh(H) # Real eigenvalues for symmetric H pos = np.sum(eigenvalues > 1e-10) neg = np.sum(eigenvalues < -1e-10) zero = np.sum(np.abs(eigenvalues) <= 1e-10) if pos == len(eigenvalues): classification = "Local Minimum (positive definite)" elif neg == len(eigenvalues): classification = "Local Maximum (negative definite)" elif pos > 0 and neg > 0: classification = f"Saddle Point ({pos}+, {neg}-, {zero} zero)" elif zero > 0: classification = f"Degenerate ({pos}+, {neg}-, {zero} zero)" else: classification = "Unknown" condition_number = np.abs(eigenvalues).max() / max(np.abs(eigenvalues).min(), 1e-10) return { 'eigenvalues': eigenvalues, 'classification': classification, 'condition_number': condition_number, 'pos_count': pos, 'neg_count': neg, 'zero_count': zero } # Example 1: Bowl (minimum)print("Example 1: Quadratic Bowl f(x,y) = x² + 2y²")print("-" * 50)H_bowl = np.array([[2, 0], [0, 4]])result = analyze_critical_point(H_bowl)print(f"Hessian:\n{H_bowl}")print(f"Eigenvalues: {result['eigenvalues']}")print(f"Classification: {result['classification']}")print(f"Condition number: {result['condition_number']:.2f}")print() # Example 2: Saddle pointprint("Example 2: Hyperbolic Paraboloid f(x,y) = x² - y²")print("-" * 50)H_saddle = np.array([[2, 0], [0, -2]])result = analyze_critical_point(H_saddle)print(f"Hessian:\n{H_saddle}")print(f"Eigenvalues: {result['eigenvalues']}")print(f"Classification: {result['classification']}")print() # Example 3: Monkey saddle (higher-order saddle)print("Example 3: f(x,y) = x³ - 3xy² at (0,0)")print("-" * 50)H_monkey = np.array([[0, 0], [0, 0]]) # Hessian is zero!result = analyze_critical_point(H_monkey)print(f"Hessian:\n{H_monkey}")print(f"Eigenvalues: {result['eigenvalues']}")print(f"Classification: {result['classification']}")print("(Need higher-order analysis for degenerate case)")print() # Example 4: Ill-conditioned minimumprint("Example 4: Elongated bowl f(x,y) = x² + 100y²")print("-" * 50)H_ellipse = np.array([[2, 0], [0, 200]])result = analyze_critical_point(H_ellipse)print(f"Hessian:\n{H_ellipse}")print(f"Eigenvalues: {result['eigenvalues']}")print(f"Classification: {result['classification']}")print(f"Condition number: {result['condition_number']:.2f}")print("(High condition number = slow GD convergence)")print() # Visualizationfig = plt.figure(figsize=(15, 5)) examples = [ (lambda x, y: x**2 + 2*y**2, "Minimum (Bowl)", "viridis"), (lambda x, y: x**2 - y**2, "Saddle Point", "RdBu"), (lambda x, y: x**2 + 100*y**2, "Ill-Conditioned Min", "viridis"),] for idx, (f, title, cmap) in enumerate(examples, 1): ax = fig.add_subplot(1, 3, idx, projection='3d') x = np.linspace(-2, 2, 50) y = np.linspace(-2, 2, 50) X, Y = np.meshgrid(x, y) Z = f(X, Y) ax.plot_surface(X, Y, Z, cmap=cmap, alpha=0.8) ax.scatter([0], [0], [f(0, 0)], color='red', s=100) ax.set_xlabel('x') ax.set_ylabel('y') ax.set_zlabel('f(x,y)') ax.set_title(title) plt.tight_layout()plt.savefig('critical_points.png', dpi=150, bbox_inches='tight')plt.show() # High-dimensional saddle point prevalenceprint("\n" + "=" * 60)print("Saddle Point Prevalence in High Dimensions")print("=" * 60) for n in [2, 5, 10, 20, 50, 100]: # Probability of local min = 2^(-n) p_min = 2**(-n) print(f"Dimension n={n:3d}: P(random critical point is local min) ≈ {p_min:.2e}")The second-order Taylor expansion gives a quadratic model that we can minimize exactly, leading to Newton's method.
The Quadratic Model:
Approximate f near current point θ:
$$m(\Delta\theta) = f(\theta) + \nabla f^\top \Delta\theta + \frac{1}{2} \Delta\theta^\top \mathbf{H} \Delta\theta$$
To minimize this quadratic, differentiate and set to zero:
$$\nabla_\Delta m = \nabla f + \mathbf{H} \Delta\theta = \mathbf{0}$$
Newton Step:
$$\Delta\theta^* = -\mathbf{H}^{-1} \nabla f$$ $$\theta_{\text{new}} = \theta - \mathbf{H}^{-1} \nabla f$$
Geometric Interpretation:
Newton's method jumps directly to the minimum of the local quadratic approximation. If f is exactly quadratic, one step reaches the minimum. For general f, the method converges quadratically near a minimum.
Convergence Analysis:
Near a strict local minimum θ* with positive definite Hessian:
$$|\theta_{k+1} - \theta^| \leq C |\theta_k - \theta^|^2$$
This quadratic convergence means the number of correct digits roughly doubles each iteration. Compare to GD's linear convergence:
$$|\theta_{k+1} - \theta^| \leq \rho |\theta_k - \theta^|$$
with rate ρ = (κ-1)/(κ+1) for quadratic functions.
Pure Newton can diverge far from the minimum. Practical variants include: (1) Damped Newton: θ ← θ - α·H⁻¹∇f with line search for α. (2) Trust region: minimize quadratic model within a ball ‖Δθ‖ ≤ r. Both ensure progress even when the quadratic model is inaccurate.
Full Newton is impractical for neural networks, but various approximations capture curvature information at manageable cost.
1. Quasi-Newton Methods (BFGS, L-BFGS)
Idea: Build an approximation B ≈ H⁻¹ incrementally from gradient changes.
BFGS update: $$\mathbf{B}_{k+1} = \left(\mathbf{I} - \frac{\mathbf{s}_k \mathbf{y}_k^\top}{\mathbf{y}_k^\top \mathbf{s}_k}\right) \mathbf{B}_k \left(\mathbf{I} - \frac{\mathbf{y}_k \mathbf{s}_k^\top}{\mathbf{y}_k^\top \mathbf{s}_k}\right) + \frac{\mathbf{s}_k \mathbf{s}_k^\top}{\mathbf{y}_k^\top \mathbf{s}_k}$$
where sk = θ{k+1} - θ_k and yk = ∇f{k+1} - ∇f_k.
L-BFGS: Limited-memory variant stores only last m (s, y) pairs. O(n) per iteration.
2. Gauss-Newton (for Least Squares)
For loss L = ½‖r(θ)‖² (residual sum of squares):
H ≈ J_r^⊤ J_r
Ignores second derivatives of residuals. Guaranteed positive semidefinite!
3. Natural Gradient
Use the Fisher Information Matrix F instead of Hessian:
$$\theta \leftarrow \theta - \alpha \mathbf{F}^{-1} \nabla L$$
F = 𝔼[∇log p(y|x,θ) ∇log p(y|x,θ)^⊤]
The natural gradient accounts for the geometry of probability distributions.
4. K-FAC (Kronecker-Factored Approximate Curvature)
Approximate Fisher per layer using Kronecker products:
F_layer ≈ A ⊗ G
where A captures input covariance and G captures gradient covariance. Inversion costs O(layer_dim³) instead of O(total_params³).
5. Diagonal Approximations
Simplest: approximate H ≈ diag(h₁, ..., hₙ). Then H⁻¹∇f is just element-wise division.
AdaGrad, RMSprop, Adam implicitly use diagonal curvature estimates based on accumulated squared gradients.
| Method | Curvature Approximation | Memory | Compute/Step | Best For |
|---|---|---|---|---|
| Full Newton | Exact H | O(n²) | O(n³) | Small problems, convex |
| L-BFGS | Low-rank H⁻¹ | O(mn) | O(mn) | Medium-scale, smooth |
| Gauss-Newton | J^T J | O(nm) | O(nm²) | Least squares |
| Natural Gradient | Fisher F | O(n²) | O(n³) | Probability models |
| K-FAC | Kronecker factors | O(layers) | O(layer³) | Deep networks |
| Adam/AdaGrad | Diagonal estimate | O(n) | O(n) | Deep learning (general) |
Adam maintains running estimates of gradient mean (mₜ) and squared gradient mean (vₜ). The update θ ← θ - α·mₜ/√vₜ effectively scales by 1/diagonal(H) (approximately). This is why Adam works well without tuning—it implicitly adapts to local curvature!
Beyond optimization, the Hessian reveals deep information about neural network loss landscapes.
Sharpness and Generalization:
Emerging research connects Hessian eigenvalues to generalization:
Empirical observation: Flat minima often generalize better. Intuition: flat minima are stable under weight perturbation and noise.
Sharpness-Aware Minimization (SAM):
Minimize both loss and "sharpness":
$$\min_\theta \max_{|\epsilon| \leq \rho} L(\theta + \epsilon)$$
This finds parameters where even worst-case perturbations don't increase loss much—encouraging flat minima.
Hessian Spectral Density:
For large networks, compute the distribution of Hessian eigenvalues:
Local Convexity:
At any point, the loss is locally convex iff Hessian is positive semidefinite. Studies show neural network losses are often locally convex along the optimization trajectory, even though globally non-convex.
Mode Connectivity:
Different minima found by different training runs are often connected by paths of near-constant loss. The Hessian structure along these paths reveals information about the loss landscape's "tube" structure.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139
import numpy as npimport matplotlib.pyplot as plt def power_iteration_largest_eigenvalue(H, num_iterations=100): """ Estimate largest eigenvalue of H using power iteration. O(n²) per iteration, can use Hv products for O(n). """ n = H.shape[0] v = np.random.randn(n) v = v / np.linalg.norm(v) for i in range(num_iterations): Hv = H @ v v = Hv / np.linalg.norm(Hv) # Rayleigh quotient gives eigenvalue estimate lambda_max = v @ H @ v return lambda_max, v def lanczos_eigenvalue_spectrum(H, k=50): """ Estimate top-k eigenvalues using Lanczos algorithm. More efficient than full eigendecomposition for large matrices. """ n = H.shape[0] k = min(k, n) # Lanczos algorithm V = np.zeros((n, k+1)) T = np.zeros((k, k)) # Tridiagonal matrix v = np.random.randn(n) V[:, 0] = v / np.linalg.norm(v) for j in range(k): w = H @ V[:, j] if j > 0: w = w - T[j-1, j] * V[:, j-1] T[j, j] = V[:, j] @ w w = w - T[j, j] * V[:, j] if j < k - 1: T[j, j+1] = np.linalg.norm(w) T[j+1, j] = T[j, j+1] if T[j, j+1] > 1e-10: V[:, j+1] = w / T[j, j+1] else: # Breakdown - pick random vector V[:, j+1] = np.random.randn(n) V[:, j+1] = V[:, j+1] / np.linalg.norm(V[:, j+1]) # Eigenvalues of T approximate eigenvalues of H eigenvalues = np.linalg.eigvalsh(T) return np.sort(eigenvalues)[::-1] # Simulate a Hessian with neural-network-like spectrumdef generate_nn_like_hessian(n, num_large=5, bulk_scale=0.01, large_scale=1.0): """ Generate a Hessian with spectral properties similar to neural network loss landscapes. - Most eigenvalues near zero (flat directions) - A few large eigenvalues (sharp directions) """ # Start with near-zero (positive) bulk eigenvalues = np.abs(np.random.randn(n)) * bulk_scale # Add a few large eigenvalues eigenvalues[:num_large] = np.abs(np.random.randn(num_large)) * large_scale + large_scale # Random orthogonal eigenvectors V, _ = np.linalg.qr(np.random.randn(n, n)) # Construct Hessian H = V @ np.diag(eigenvalues) @ V.T return H, eigenvalues # Generate and analyzenp.random.seed(42)n = 200H, true_eigenvalues = generate_nn_like_hessian(n, num_large=10, bulk_scale=0.01, large_scale=2.0) print("Simulated Neural Network Hessian Analysis")print("=" * 60)print(f"Dimension: {n}") # Power iteration for largest eigenvaluelambda_max_est, v_max = power_iteration_largest_eigenvalue(H)print(f"\nLargest eigenvalue (power iteration): {lambda_max_est:.4f}")print(f"True largest eigenvalue: {max(true_eigenvalues):.4f}") # Full spectrumfull_eigenvalues = np.linalg.eigvalsh(H) # Lanczos approximationlanczos_eigenvalues = lanczos_eigenvalue_spectrum(H, k=30) # Plottingfig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Plot 1: Full spectrumax1 = axes[0]ax1.hist(full_eigenvalues, bins=50, edgecolor='black', alpha=0.7)ax1.set_xlabel('Eigenvalue')ax1.set_ylabel('Count')ax1.set_title('Full Hessian Spectrum')ax1.axvline(x=0, color='red', linestyle='--', label='Zero')ax1.legend() # Plot 2: Log-scale spectrum (to see both bulk and outliers)ax2 = axes[1]ax2.hist(np.log10(full_eigenvalues + 1e-10), bins=50, edgecolor='black', alpha=0.7)ax2.set_xlabel('log₁₀(Eigenvalue)')ax2.set_ylabel('Count')ax2.set_title('Log-Scale Spectrum') # Plot 3: Sorted eigenvaluesax3 = axes[2]sorted_eigs = np.sort(full_eigenvalues)[::-1]ax3.semilogy(range(len(sorted_eigs)), sorted_eigs + 1e-10)ax3.set_xlabel('Index')ax3.set_ylabel('Eigenvalue (log scale)')ax3.set_title('Sorted Eigenvalues')ax3.axhline(y=1e-2, color='red', linestyle='--', label='Bulk threshold')ax3.legend() plt.tight_layout()plt.savefig('hessian_spectrum.png', dpi=150, bbox_inches='tight')plt.show() # Summary statisticsprint(f"\nSpectrum Statistics:")print(f" Max eigenvalue: {max(full_eigenvalues):.4f}")print(f" Min eigenvalue: {min(full_eigenvalues):.6f}")print(f" Condition number: {max(full_eigenvalues) / max(min(full_eigenvalues), 1e-10):.2f}")print(f" Eigenvalues > 0.1: {np.sum(full_eigenvalues > 0.1)}")print(f" Safe learning rate < 2/λ_max: {2/max(full_eigenvalues):.4f}")We can use second-order information without ever forming the full Hessian by exploiting Hessian-vector products.
Key Insight:
To solve HΔθ = -∇f for the Newton step, we don't need H explicitly—we only need to compute Hv for arbitrary vectors v. This is enough for iterative solvers.
Computing Hv Without H:
Recall from page 3 that we can compute:
$$\mathbf{H}\mathbf{v} = \nabla_\theta (\nabla_\theta f \cdot \mathbf{v})$$
This requires:
Cost: ~2× one backward pass. Memory: O(n).
Conjugate Gradient (CG) Method:
CG solves Hx = b using only matrix-vector products Hv:
Initialize: x = 0, r = b, p = r
Repeat until convergence:
α = (r · r) / (p · Hp)
x = x + α p
r_new = r - α Hp
β = (r_new · r_new) / (r · r)
p = r_new + β p
r = r_new
Each iteration needs one Hv product. For n-dimensional problems, CG converges in at most n iterations (often much fewer).
Hessian-Free Optimization Algorithm:
Advantages:
Challenges:
For losses of the form L = ½‖r(θ)‖² (sum of squared residuals), the Gauss-Newton approximation H ≈ JᵀJ is always positive semidefinite, making CG well-behaved. This approximation ignores the second derivative of residuals, which is often small or zero anyway.
The Hessian matrix and second-order approximations provide the mathematical framework for understanding loss landscape curvature—essential for both optimization algorithms and theoretical analysis.
Module Synthesis:
Across this module on multivariate calculus, we've built a comprehensive toolkit:
The Bigger Picture:
Multivariate calculus provides the mathematical language of machine learning optimization:
This foundation prepares you for advanced topics:
Practical Wisdom:
While we rarely compute full Hessians in deep learning, the concepts pervade practice:
The mathematics of curvature isn't just theory—it's the lens through which we understand learning dynamics.
Congratulations! You've completed the Multivariate Calculus module. You now understand multivariable functions, gradients, Jacobians, Taylor series, and the Hessian—the complete first and second-order derivative toolkit for machine learning. These tools enable deep analysis of optimization algorithms and loss landscapes.