Loading learning content...
In our exploration of optimization for neural networks, we've established gradient descent as the workhorse algorithm that enables deep learning. Yet, as we watch gradient descent slowly navigate the loss landscape, a natural question emerges: Can we do better?
The answer lies in a mathematical treasure dating back to the 17th century—Newton's method. While Isaac Newton developed this technique for finding roots of equations, its extension to optimization provides a framework that dramatically accelerates convergence by exploiting not just the gradient, but the curvature of the loss surface.
This page represents your gateway to second-order optimization methods. We will develop Newton's method from first principles, understand its theoretical foundations, analyze its convergence properties, and critically examine why—despite its elegance—it presents formidable challenges for training modern neural networks.
By the end of this page, you will: (1) Derive Newton's method from Taylor expansion principles, (2) Understand the role of the Hessian matrix in capturing curvature information, (3) Analyze the quadratic convergence properties and their conditions, (4) Recognize the computational barriers that limit direct application to neural networks, and (5) Appreciate how Newton's method provides the theoretical foundation for practical second-order methods.
To understand Newton's method, we must first establish the mathematical framework that motivates it. The key insight comes from Taylor's theorem, which allows us to approximate a function locally using polynomial terms.
Consider a scalar function $f: \mathbb{R} \to \mathbb{R}$ that we wish to minimize. Taylor's theorem tells us that near any point $x_0$, we can approximate $f$ as:
$$f(x) \approx f(x_0) + f'(x_0)(x - x_0) + \frac{1}{2}f''(x_0)(x - x_0)^2 + \mathcal{O}((x - x_0)^3)$$
This quadratic approximation captures three essential pieces of information:
Gradient descent uses only the first two terms. Newton's method exploits all three.
Consider walking through a valley. The gradient tells you which way is downhill, but the curvature tells you how steep the walls are. A narrow valley has high curvature perpendicular to the path—small steps sideways cause large changes in height. Newton's method accounts for this, allowing larger steps in flat directions and smaller steps in curved directions.
For a function $f: \mathbb{R}^n \to \mathbb{R}$ with parameter vector $\boldsymbol{\theta} \in \mathbb{R}^n$, the Taylor expansion becomes:
$$f(\boldsymbol{\theta}) \approx f(\boldsymbol{\theta}_0) + \nabla f(\boldsymbol{\theta}_0)^T (\boldsymbol{\theta} - \boldsymbol{\theta}_0) + \frac{1}{2}(\boldsymbol{\theta} - \boldsymbol{\theta}_0)^T \mathbf{H}(\boldsymbol{\theta}_0)(\boldsymbol{\theta} - \boldsymbol{\theta}_0)$$
where:
The Hessian is a symmetric matrix (assuming continuous second derivatives) that encodes the curvature of $f$ in all directions. Its eigenvalues and eigenvectors describe the principal axes and magnitudes of curvature.
| Object | Symbol | Dimension | Information Captured |
|---|---|---|---|
| Gradient | $\nabla f$ | $n \times 1$ | Direction of steepest ascent |
| Hessian | $\mathbf{H}$ | $n \times n$ | Local curvature in all directions |
| Eigenvalues of $\mathbf{H}$ | $\lambda_i$ | Scalar | Curvature magnitude along eigenvector |
| Condition number | $\kappa = \lambda_{max}/\lambda_{min}$ | Scalar | Ratio of largest to smallest curvature |
Newton's method emerges naturally when we ask: What is the optimal step to take if we trust our quadratic approximation perfectly?
Define the quadratic approximation of $f$ around the current point $\boldsymbol{\theta}_k$:
$$m_k(\boldsymbol{\theta}) = f(\boldsymbol{\theta}_k) + \nabla f(\boldsymbol{\theta}_k)^T (\boldsymbol{\theta} - \boldsymbol{\theta}_k) + \frac{1}{2}(\boldsymbol{\theta} - \boldsymbol{\theta}_k)^T \mathbf{H}_k(\boldsymbol{\theta} - \boldsymbol{\theta}_k)$$
where $\mathbf{H}_k = \mathbf{H}(\boldsymbol{\theta}_k)$ is the Hessian evaluated at $\boldsymbol{\theta}_k$.
To minimize this quadratic, we take the derivative with respect to $\boldsymbol{\theta}$ and set it to zero:
$$\nabla m_k(\boldsymbol{\theta}) = \nabla f(\boldsymbol{\theta}_k) + \mathbf{H}_k(\boldsymbol{\theta} - \boldsymbol{\theta}_k) = 0$$
Solving for $\boldsymbol{\theta}$:
$$\boldsymbol{\theta} - \boldsymbol{\theta}_k = -\mathbf{H}_k^{-1} \nabla f(\boldsymbol{\theta}_k)$$
$$\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \mathbf{H}_k^{-1} \nabla f(\boldsymbol{\theta}_k)$$
This is the Newton update: we step in the direction given by the inverse Hessian multiplied by the negative gradient. Compare with gradient descent: $\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \alpha \nabla f(\boldsymbol{\theta}_k)$, where we simply scale the gradient by a learning rate $\alpha$.
The Newton direction $\mathbf{d}_k = -\mathbf{H}_k^{-1} \nabla f_k$ has a profound geometric interpretation. The Hessian inverse rescales and rotates the gradient to account for curvature:
This adaptive scaling is precisely what gradient descent lacks. In a long, narrow valley (high condition number), gradient descent oscillates because it takes equal-sized steps in all directions. Newton's method "flattens" the landscape by accounting for curvature, making the path directly toward the minimum.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npfrom numpy.linalg import inv, norm def newtons_method(f, grad_f, hessian_f, theta_init, tol=1e-8, max_iter=100): """ Newton's method for unconstrained optimization. Args: f: Objective function f(theta) -> scalar grad_f: Gradient function grad_f(theta) -> n-dimensional vector hessian_f: Hessian function hessian_f(theta) -> n x n matrix theta_init: Initial parameter vector tol: Convergence tolerance on gradient norm max_iter: Maximum number of iterations Returns: theta: Final parameter vector history: List of (theta, f_value, grad_norm) at each iteration """ theta = theta_init.copy() history = [] for iteration in range(max_iter): # Compute gradient and Hessian at current point grad = grad_f(theta) H = hessian_f(theta) # Check convergence grad_norm = norm(grad) history.append((theta.copy(), f(theta), grad_norm)) if grad_norm < tol: print(f"Converged in {iteration} iterations") break # Compute Newton direction: solve H @ d = -grad # Using inverse (inefficient but clear; use solve() in practice) newton_direction = -inv(H) @ grad # Update parameters (pure Newton step, no line search) theta = theta + newton_direction return theta, history # Example: Minimize Rosenbrock function (a challenging test function)def rosenbrock(x): """Rosenbrock function: f(x,y) = (a-x)^2 + b(y-x^2)^2, with a=1, b=100""" return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2 def rosenbrock_grad(x): """Gradient of Rosenbrock function""" df_dx = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2) df_dy = 200 * (x[1] - x[0]**2) return np.array([df_dx, df_dy]) def rosenbrock_hessian(x): """Hessian of Rosenbrock function""" H = np.array([ [2 - 400 * (x[1] - 3 * x[0]**2), -400 * x[0]], [-400 * x[0], 200] ]) return H # Run Newton's methodx0 = np.array([-1.5, 2.0])x_opt, history = newtons_method(rosenbrock, rosenbrock_grad, rosenbrock_hessian, x0)print(f"Optimal point: {x_opt}")print(f"Function value: {rosenbrock(x_opt):.2e}")Newton's method is celebrated for its quadratic convergence—a property that gradient descent conspicuously lacks. Understanding this convergence behavior is essential to appreciating both the power and limitations of second-order methods.
A sequence ${\boldsymbol{\theta}_k}$ converges quadratically to $\boldsymbol{\theta}^*$ if there exists a constant $C > 0$ such that:
$$|\boldsymbol{\theta}_{k+1} - \boldsymbol{\theta}^| \leq C |\boldsymbol{\theta}_k - \boldsymbol{\theta}^|^2$$
This means the error is squared at each iteration. If you're at distance $10^{-2}$ from the optimum, the next iteration brings you to $\approx 10^{-4}$, then $10^{-8}$, then $10^{-16}$. Just a few iterations can achieve machine precision.
Compare with linear convergence (typical for gradient descent):
$$|\boldsymbol{\theta}_{k+1} - \boldsymbol{\theta}^| \leq \rho |\boldsymbol{\theta}_k - \boldsymbol{\theta}^|$$
for some $\rho < 1$. The error decreases by a constant factor each iteration—much slower when $\rho$ is close to 1.
| Iteration | Gradient Descent (ρ=0.9) | Newton's Method |
|---|---|---|
| 0 | $10^{-1}$ | $10^{-1}$ |
| 1 | $9 \times 10^{-2}$ | $10^{-2}$ |
| 2 | $8.1 \times 10^{-2}$ | $10^{-4}$ |
| 5 | $5.9 \times 10^{-2}$ | $10^{-32}$ |
| 10 | $3.5 \times 10^{-2}$ | Machine precision |
| 50 | $5.2 \times 10^{-3}$ | — |
| 100 | $2.7 \times 10^{-5}$ | — |
Newton's quadratic convergence is not guaranteed universally. The following conditions must hold:
1. The Hessian is Positive Definite at the Minimum
At a local minimum $\boldsymbol{\theta}^$, the Hessian $\mathbf{H}(\boldsymbol{\theta}^)$ must have all positive eigenvalues. This ensures:
2. The Starting Point is Sufficiently Close
Newton's method has only local convergence guarantees. If $\boldsymbol{\theta}_0$ is too far from $\boldsymbol{\theta}^*$, the quadratic approximation may be poor, and the method can diverge or converge to a different critical point.
3. The Hessian is Lipschitz Continuous
There exists $L > 0$ such that $|\mathbf{H}(\boldsymbol{\theta}) - \mathbf{H}(\boldsymbol{\theta}')| \leq L |\boldsymbol{\theta} - \boldsymbol{\theta}'|$ for all $\boldsymbol{\theta}, \boldsymbol{\theta}'$ in a neighborhood of $\boldsymbol{\theta}^*$. This technical condition ensures the curvature doesn't change too rapidly.
Pure Newton's method can fail catastrophically far from the solution. It may: (1) Diverge entirely if the Hessian is indefinite, (2) Step to a saddle point or maximum, (3) Overshoot and oscillate. In practice, we combine Newton's method with safeguards like line search or trust regions to ensure global convergence while preserving fast local convergence.
Theorem (Newton's Method Local Convergence): Let $f: \mathbb{R}^n \to \mathbb{R}$ be twice continuously differentiable. Suppose $\boldsymbol{\theta}^$ is a point where $\nabla f(\boldsymbol{\theta}^) = 0$ and $\mathbf{H}(\boldsymbol{\theta}^)$ is positive definite. Suppose further that $\mathbf{H}$ is Lipschitz continuous in a neighborhood of $\boldsymbol{\theta}^$. Then there exists $\delta > 0$ such that if $|\boldsymbol{\theta}_0 - \boldsymbol{\theta}^| < \delta$, the Newton iterates ${\boldsymbol{\theta}_k}$ converge to $\boldsymbol{\theta}^$ quadratically.
The proof relies on showing that the error $\boldsymbol{e}_k = \boldsymbol{\theta}k - \boldsymbol{\theta}^*$ satisfies a recurrence of the form $|\boldsymbol{e}{k+1}| \leq C |\boldsymbol{e}_k|^2$ using Taylor expansion and the Lipschitz property of the Hessian.
The elegance of Newton's method masks a computational reality that limits its practicality for large-scale machine learning: the Hessian is expensive.
Consider a neural network with $n$ parameters. The computational costs are:
| Operation | Cost |
|---|---|
| Gradient computation | $\mathcal{O}(n)$ per sample |
| Hessian storage | $\mathcal{O}(n^2)$ |
| Hessian computation | $\mathcal{O}(n^2)$ per sample |
| Solving $\mathbf{H}\mathbf{d} = -\nabla f$ | $\mathcal{O}(n^3)$ |
For a modest network with $n = 10^6$ parameters:
Even for networks orders of magnitude smaller, forming and inverting the full Hessian is prohibitive.
A ResNet-50 has ~25 million parameters. Its Hessian would require 625 trillion entries, consuming about 5 petabytes of storage. Inverting such a matrix would take longer than the age of the universe on current hardware. This is why we need Hessian-free methods.
Despite the prohibitive costs, understanding how the Hessian can be computed is valuable:
Method 1: Analytic Differentiation
For simple functions, we can derive the Hessian analytically: $$H_{ij} = \frac{\partial^2 f}{\partial \theta_i \partial \theta_j}$$
This is tractable for small networks but becomes error-prone and tedious for complex architectures.
Method 2: Automatic Differentiation
Modern frameworks support second-order derivatives through automatic differentiation. However, computing the full Hessian still requires $\mathcal{O}(n)$ backward passes—one per parameter—making it $\mathcal{O}(n^2)$ overall.
Method 3: Finite Differences
Approximate each Hessian column via finite differences of the gradient: $$\frac{\partial^2 f}{\partial \theta_i \partial \theta_j} \approx \frac{\nabla_j f(\boldsymbol{\theta} + \epsilon \mathbf{e}_i) - \nabla_j f(\boldsymbol{\theta})}{\epsilon}$$
Requires $2n$ gradient evaluations for the full Hessian—expensive and numerically unstable.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import torchimport torch.nn as nn def compute_hessian_autograd(model, loss_fn, x, y): """ Compute full Hessian using PyTorch autograd. WARNING: Only feasible for very small models! Args: model: PyTorch model loss_fn: Loss function x, y: Input data and labels Returns: hessian: Full Hessian matrix (n x n) """ # Get all parameters as a single vector params = torch.cat([p.flatten() for p in model.parameters()]) n = params.numel() # Compute loss output = model(x) loss = loss_fn(output, y) # Compute gradient (create graph for second derivatives) grad = torch.autograd.grad(loss, model.parameters(), create_graph=True) grad_flat = torch.cat([g.flatten() for g in grad]) # Compute Hessian column by column hessian = torch.zeros(n, n) for i in range(n): # Compute gradient of gradient[i] w.r.t. all parameters grad2 = torch.autograd.grad( grad_flat[i], model.parameters(), retain_graph=True ) hessian[i] = torch.cat([g.flatten() for g in grad2]) return hessian def hessian_vector_product(model, loss_fn, x, y, v): """ Compute Hessian-vector product H @ v efficiently. Uses only O(n) computation, not O(n^2)! This is the key insight for Hessian-free optimization. """ # Forward pass output = model(x) loss = loss_fn(output, y) # First backward pass: compute gradient grad = torch.autograd.grad(loss, model.parameters(), create_graph=True) grad_flat = torch.cat([g.flatten() for g in grad]) # Compute gradient dot v grad_v = torch.sum(grad_flat * v) # Second backward pass: gradient of (grad · v) = Hv hv = torch.autograd.grad(grad_v, model.parameters()) return torch.cat([h.flatten() for h in hv]) # Example showing the dramatic differenceclass TinyMLP(nn.Module): def __init__(self, input_dim=10, hidden_dim=5, output_dim=2): super().__init__() self.fc1 = nn.Linear(input_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, output_dim) def forward(self, x): x = torch.relu(self.fc1(x)) return self.fc2(x) # Count parametersmodel = TinyMLP()n_params = sum(p.numel() for p in model.parameters())print(f"Parameters: {n_params}")print(f"Hessian size: {n_params} x {n_params} = {n_params**2} entries")print(f"Memory for Hessian: {n_params**2 * 4 / 1e6:.2f} MB (float32)")A crucial observation enables practical second-order methods: we rarely need the full Hessian—we need its action on vectors.
The product $\mathbf{H}\mathbf{v}$ can be computed in $\mathcal{O}(n)$ time using a technique called Pearlmutter's trick (1994):
This is the foundation of Hessian-free optimization, which we'll explore in the next page. Instead of solving $\mathbf{H}\mathbf{d} = -\mathbf{g}$ directly, we use iterative methods (like conjugate gradient) that only require Hessian-vector products.
Applying Newton's method to neural network training introduces additional challenges beyond computational cost. Understanding these issues is essential for appreciating why we need more sophisticated second-order methods.
Neural network loss surfaces differ fundamentally from the well-behaved convex functions where Newton's method excels:
1. Non-Convexity
Neural network loss functions are highly non-convex, with numerous:
At saddle points, Newton's method behaves poorly. If $\mathbf{H}$ is indefinite (has negative eigenvalues), the Newton direction may point toward a saddle or maximum rather than a minimum.
2. Near-Singular Hessians
Deep networks often have Hessians that are nearly singular—with many eigenvalues close to zero. This occurs because:
When $\mathbf{H}$ is singular or nearly so, $\mathbf{H}^{-1}$ doesn't exist or is numerically unstable.
In high dimensions, saddle points vastly outnumber local minima. For a random function in n dimensions, the probability that all n eigenvalues of the Hessian are positive (local minimum) decreases exponentially with n. Newton's method can get attracted to saddle points, where it struggles because the Hessian is indefinite.
Deep learning training uses stochastic gradients computed on mini-batches. This introduces additional issues:
Gradient Noise: The stochastic gradient $\mathbf{g}_B$ (computed on batch $B$) is an unbiased but noisy estimate of the true gradient $\nabla f$.
Hessian Noise: The stochastic Hessian is even noisier. The variance of Hessian estimates is higher than gradient estimates, making second-order information less reliable.
Sample Efficiency: If we compute the Hessian (or Hessian-vector products) on the same batch as the gradient, we're overfitting to that batch's curvature. Computing on different batches requires more data passes.
To address the challenges above, practical Newton-like methods incorporate regularization:
Levenberg-Marquardt Damping: Replace $\mathbf{H}$ with $\mathbf{H} + \lambda \mathbf{I}$, where $\lambda > 0$ ensures positive definiteness:
$$\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - (\mathbf{H}_k + \lambda \mathbf{I})^{-1} \nabla f_k$$
When $\lambda$ is large, this approaches gradient descent (with learning rate $1/\lambda$). When $\lambda$ is small, it approaches Newton's method. This interpolation provides stability far from the solution while preserving fast convergence nearby.
Trust Region Methods: Instead of modifying the Hessian, constrain the step size. Solve:
$$\min_\mathbf{d} \quad m_k(\boldsymbol{\theta}_k + \mathbf{d}) \quad \text{subject to} \quad |\mathbf{d}| \leq \Delta_k$$
The trust region radius $\Delta_k$ is adjusted based on how well the quadratic model predicts the actual function decrease.
Understanding Newton's method's historical development illuminates why it remains foundational despite its computational limitations.
Newton originally developed his method for finding roots of equations—solving $g(x) = 0$. The iteration:
$$x_{k+1} = x_k - \frac{g(x_k)}{g'(x_k)}$$
finds where a function crosses zero by repeatedly linearizing and solving. The extension to optimization comes from applying this to $g = \nabla f$—we find where the gradient is zero, which (for convex functions) corresponds to the minimum.
Joseph Raphson independently discovered the method, and its formalization for optimization came later through the work of mathematicians studying calculus of variations and numerical analysis. The key insight—that second-order information accelerates convergence—has driven optimization research for centuries.
Newton's method directly inspired:
Understanding Newton's method is essential for understanding all of these.
Even though we rarely apply pure Newton's method to neural networks, its conceptual framework permeates modern optimization. Every discussion of curvature, conditioning, and convergence rates builds on Newton's insights. The methods in subsequent pages are all, in some sense, attempts to capture Newton's benefits without paying Newton's costs.
We have developed a comprehensive understanding of Newton's method for optimization—the theoretical foundation upon which practical second-order methods are built.
Newton's method reveals a fundamental tension in optimization: we want curvature information for fast convergence, but computing it exactly is prohibitively expensive. The next page explores Hessian-free optimization, which resolves this tension by using iterative linear solvers combined with efficient Hessian-vector products. We'll see how to capture much of Newton's benefit at a fraction of the cost.
You now have a rigorous understanding of Newton's method—its derivation, convergence properties, computational requirements, and limitations in the neural network context. This foundation prepares you for the practical second-order methods that follow: Hessian-free optimization, natural gradient descent, and K-FAC.