Machine LearningBackpropagation & Optimization

Second-Order Optimization Methods

LevelAdvanced

Duration120 mins

TopicBackpropagation & Optimization

1 / 5

Newton's Method

The Quest for Faster Convergence

In our exploration of optimization for neural networks, we've established gradient descent as the workhorse algorithm that enables deep learning. Yet, as we watch gradient descent slowly navigate the loss landscape, a natural question emerges: Can we do better?

The answer lies in a mathematical treasure dating back to the 17th century—Newton's method. While Isaac Newton developed this technique for finding roots of equations, its extension to optimization provides a framework that dramatically accelerates convergence by exploiting not just the gradient, but the curvature of the loss surface.

This page represents your gateway to second-order optimization methods. We will develop Newton's method from first principles, understand its theoretical foundations, analyze its convergence properties, and critically examine why—despite its elegance—it presents formidable challenges for training modern neural networks.

Learning Objectives

By the end of this page, you will: (1) Derive Newton's method from Taylor expansion principles, (2) Understand the role of the Hessian matrix in capturing curvature information, (3) Analyze the quadratic convergence properties and their conditions, (4) Recognize the computational barriers that limit direct application to neural networks, and (5) Appreciate how Newton's method provides the theoretical foundation for practical second-order methods.

Mathematical Foundations

To understand Newton's method, we must first establish the mathematical framework that motivates it. The key insight comes from Taylor's theorem, which allows us to approximate a function locally using polynomial terms.

Taylor Expansion and Local Approximation

Consider a scalar function $f: \mathbb{R} \to \mathbb{R}$ that we wish to minimize. Taylor's theorem tells us that near any point $x_0$, we can approximate $f$ as:

$$f(x) \approx f(x_0) + f'(x_0)(x - x_0) + \frac{1}{2}f''(x_0)(x - x_0)^2 + \mathcal{O}((x - x_0)^3)$$

This quadratic approximation captures three essential pieces of information:

The function value $f(x_0)$: Where we currently are
The first derivative $f'(x_0)$: The slope—which direction is downhill
The second derivative $f''(x_0)$: The curvature—how quickly the slope changes

Gradient descent uses only the first two terms. Newton's method exploits all three.

The Curvature Insight

Consider walking through a valley. The gradient tells you which way is downhill, but the curvature tells you how steep the walls are. A narrow valley has high curvature perpendicular to the path—small steps sideways cause large changes in height. Newton's method accounts for this, allowing larger steps in flat directions and smaller steps in curved directions.

Extension to Multiple Dimensions

For a function $f: \mathbb{R}^n \to \mathbb{R}$ with parameter vector $\boldsymbol{\theta} \in \mathbb{R}^n$, the Taylor expansion becomes:

$$f(\boldsymbol{\theta}) \approx f(\boldsymbol{\theta}_0) + \nabla f(\boldsymbol{\theta}_0)^T (\boldsymbol{\theta} - \boldsymbol{\theta}_0) + \frac{1}{2}(\boldsymbol{\theta} - \boldsymbol{\theta}_0)^T \mathbf{H}(\boldsymbol{\theta}_0)(\boldsymbol{\theta} - \boldsymbol{\theta}_0)$$

where:

$\nabla f(\boldsymbol{\theta}_0) \in \mathbb{R}^n$ is the gradient vector with components $\frac{\partial f}{\partial \theta_i}$
$\mathbf{H}(\boldsymbol{\theta}0) \in \mathbb{R}^{n \times n}$ is the Hessian matrix with components $H{ij} = \frac{\partial^2 f}{\partial \theta_i \partial \theta_j}$

The Hessian is a symmetric matrix (assuming continuous second derivatives) that encodes the curvature of $f$ in all directions. Its eigenvalues and eigenvectors describe the principal axes and magnitudes of curvature.

Key Mathematical Objects in Optimization
Object	Symbol	Dimension	Information Captured
Gradient	$\nabla f$	$n \times 1$	Direction of steepest ascent
Hessian	$\mathbf{H}$	$n \times n$	Local curvature in all directions
Eigenvalues of $\mathbf{H}$	$\lambda_i$	Scalar	Curvature magnitude along eigenvector
Condition number	$\kappa = \lambda_{max}/\lambda_{min}$	Scalar	Ratio of largest to smallest curvature

Deriving Newton's Method

Newton's method emerges naturally when we ask: What is the optimal step to take if we trust our quadratic approximation perfectly?

The Quadratic Model

Define the quadratic approximation of $f$ around the current point $\boldsymbol{\theta}_k$:

$$m_k(\boldsymbol{\theta}) = f(\boldsymbol{\theta}_k) + \nabla f(\boldsymbol{\theta}_k)^T (\boldsymbol{\theta} - \boldsymbol{\theta}_k) + \frac{1}{2}(\boldsymbol{\theta} - \boldsymbol{\theta}_k)^T \mathbf{H}_k(\boldsymbol{\theta} - \boldsymbol{\theta}_k)$$

where $\mathbf{H}_k = \mathbf{H}(\boldsymbol{\theta}_k)$ is the Hessian evaluated at $\boldsymbol{\theta}_k$.

To minimize this quadratic, we take the derivative with respect to $\boldsymbol{\theta}$ and set it to zero:

$$\nabla m_k(\boldsymbol{\theta}) = \nabla f(\boldsymbol{\theta}_k) + \mathbf{H}_k(\boldsymbol{\theta} - \boldsymbol{\theta}_k) = 0$$

Solving for $\boldsymbol{\theta}$:

$$\boldsymbol{\theta} - \boldsymbol{\theta}_k = -\mathbf{H}_k^{-1} \nabla f(\boldsymbol{\theta}_k)$$

The Newton Update Rule

$$\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \mathbf{H}_k^{-1} \nabla f(\boldsymbol{\theta}_k)$$

This is the Newton update: we step in the direction given by the inverse Hessian multiplied by the negative gradient. Compare with gradient descent: $\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \alpha \nabla f(\boldsymbol{\theta}_k)$, where we simply scale the gradient by a learning rate $\alpha$.

Interpreting the Newton Direction

The Newton direction $\mathbf{d}_k = -\mathbf{H}_k^{-1} \nabla f_k$ has a profound geometric interpretation. The Hessian inverse rescales and rotates the gradient to account for curvature:

Along directions of high curvature (large eigenvalues of $\mathbf{H}$): The step is shortened. High curvature means the quadratic approximation breaks down quickly.
Along directions of low curvature (small eigenvalues of $\mathbf{H}$): The step is lengthened. Low curvature means we can trust the approximation farther.

This adaptive scaling is precisely what gradient descent lacks. In a long, narrow valley (high condition number), gradient descent oscillates because it takes equal-sized steps in all directions. Newton's method "flattens" the landscape by accounting for curvature, making the path directly toward the minimum.

newtons_method.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from numpy.linalg import inv, norm
 
def newtons_method(f, grad_f, hessian_f, theta_init, tol=1e-8, max_iter=100):
    """
    Newton's method for unconstrained optimization.
    
    Args:
        f: Objective function f(theta) -> scalar
        grad_f: Gradient function grad_f(theta) -> n-dimensional vector
        hessian_f: Hessian function hessian_f(theta) -> n x n matrix
        theta_init: Initial parameter vector
        tol: Convergence tolerance on gradient norm
        max_iter: Maximum number of iterations
        
    Returns:
        theta: Final parameter vector
        history: List of (theta, f_value, grad_norm) at each iteration
    """
    theta = theta_init.copy()
    history = []
    
    for iteration in range(max_iter):
        # Compute gradient and Hessian at current point
        grad = grad_f(theta)
        H = hessian_f(theta)
        
        # Check convergence
        grad_norm = norm(grad)
        history.append((theta.copy(), f(theta), grad_norm))
        
        if grad_norm < tol:
            print(f"Converged in {iteration} iterations")
            break
        
        # Compute Newton direction: solve H @ d = -grad
        # Using inverse (inefficient but clear; use solve() in practice)
        newton_direction = -inv(H) @ grad
        
        # Update parameters (pure Newton step, no line search)
        theta = theta + newton_direction
    
    return theta, history
 
# Example: Minimize Rosenbrock function (a challenging test function)
def rosenbrock(x):
    """Rosenbrock function: f(x,y) = (a-x)^2 + b(y-x^2)^2, with a=1, b=100"""
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
 
def rosenbrock_grad(x):
    """Gradient of Rosenbrock function"""
    df_dx = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
    df_dy = 200 * (x[1] - x[0]**2)
    return np.array([df_dx, df_dy])
 
def rosenbrock_hessian(x):
    """Hessian of Rosenbrock function"""
    H = np.array([
        [2 - 400 * (x[1] - 3 * x[0]**2), -400 * x[0]],
        [-400 * x[0], 200]
    ])
    return H
 
# Run Newton's method
x0 = np.array([-1.5, 2.0])
x_opt, history = newtons_method(rosenbrock, rosenbrock_grad, rosenbrock_hessian, x0)
print(f"Optimal point: {x_opt}")
print(f"Function value: {rosenbrock(x_opt):.2e}")

Convergence Analysis

Newton's method is celebrated for its quadratic convergence—a property that gradient descent conspicuously lacks. Understanding this convergence behavior is essential to appreciating both the power and limitations of second-order methods.

Quadratic Convergence: Definition and Implications

A sequence ${\boldsymbol{\theta}_k}$ converges quadratically to $\boldsymbol{\theta}^*$ if there exists a constant $C > 0$ such that:

$$|\boldsymbol{\theta}_{k+1} - \boldsymbol{\theta}^| \leq C |\boldsymbol{\theta}_k - \boldsymbol{\theta}^|^2$$

This means the error is squared at each iteration. If you're at distance $10^{-2}$ from the optimum, the next iteration brings you to $\approx 10^{-4}$, then $10^{-8}$, then $10^{-16}$. Just a few iterations can achieve machine precision.

Compare with linear convergence (typical for gradient descent):

$$|\boldsymbol{\theta}_{k+1} - \boldsymbol{\theta}^| \leq \rho |\boldsymbol{\theta}_k - \boldsymbol{\theta}^|$$

for some $\rho < 1$. The error decreases by a constant factor each iteration—much slower when $\rho$ is close to 1.

Convergence Rate Comparison
Iteration	Gradient Descent (ρ=0.9)	Newton's Method
0	$10^{-1}$	$10^{-1}$
1	$9 \times 10^{-2}$	$10^{-2}$
2	$8.1 \times 10^{-2}$	$10^{-4}$
5	$5.9 \times 10^{-2}$	$10^{-32}$
10	$3.5 \times 10^{-2}$	Machine precision
50	$5.2 \times 10^{-3}$	—
100	$2.7 \times 10^{-5}$	—

Conditions for Quadratic Convergence

Newton's quadratic convergence is not guaranteed universally. The following conditions must hold:

1. The Hessian is Positive Definite at the Minimum

At a local minimum $\boldsymbol{\theta}^$, the Hessian $\mathbf{H}(\boldsymbol{\theta}^)$ must have all positive eigenvalues. This ensures:

The minimum is actually a minimum (not a saddle point)
The Newton direction is a descent direction
The inverse Hessian exists

2. The Starting Point is Sufficiently Close

Newton's method has only local convergence guarantees. If $\boldsymbol{\theta}_0$ is too far from $\boldsymbol{\theta}^*$, the quadratic approximation may be poor, and the method can diverge or converge to a different critical point.

3. The Hessian is Lipschitz Continuous

There exists $L > 0$ such that $|\mathbf{H}(\boldsymbol{\theta}) - \mathbf{H}(\boldsymbol{\theta}')| \leq L |\boldsymbol{\theta} - \boldsymbol{\theta}'|$ for all $\boldsymbol{\theta}, \boldsymbol{\theta}'$ in a neighborhood of $\boldsymbol{\theta}^*$. This technical condition ensures the curvature doesn't change too rapidly.

The Local Convergence Trap

Pure Newton's method can fail catastrophically far from the solution. It may: (1) Diverge entirely if the Hessian is indefinite, (2) Step to a saddle point or maximum, (3) Overshoot and oscillate. In practice, we combine Newton's method with safeguards like line search or trust regions to ensure global convergence while preserving fast local convergence.

Convergence Theorem (Formal Statement)

Theorem (Newton's Method Local Convergence): Let $f: \mathbb{R}^n \to \mathbb{R}$ be twice continuously differentiable. Suppose $\boldsymbol{\theta}^$ is a point where $\nabla f(\boldsymbol{\theta}^) = 0$ and $\mathbf{H}(\boldsymbol{\theta}^)$ is positive definite. Suppose further that $\mathbf{H}$ is Lipschitz continuous in a neighborhood of $\boldsymbol{\theta}^$. Then there exists $\delta > 0$ such that if $|\boldsymbol{\theta}_0 - \boldsymbol{\theta}^| < \delta$, the Newton iterates ${\boldsymbol{\theta}_k}$ converge to $\boldsymbol{\theta}^$ quadratically.

The proof relies on showing that the error $\boldsymbol{e}_k = \boldsymbol{\theta}k - \boldsymbol{\theta}^*$ satisfies a recurrence of the form $|\boldsymbol{e}{k+1}| \leq C |\boldsymbol{e}_k|^2$ using Taylor expansion and the Lipschitz property of the Hessian.

Computational Considerations

The elegance of Newton's method masks a computational reality that limits its practicality for large-scale machine learning: the Hessian is expensive.

The Curse of Dimensionality

Consider a neural network with $n$ parameters. The computational costs are:

Operation	Cost
Gradient computation	$\mathcal{O}(n)$ per sample
Hessian storage	$\mathcal{O}(n^2)$
Hessian computation	$\mathcal{O}(n^2)$ per sample
Solving $\mathbf{H}\mathbf{d} = -\nabla f$	$\mathcal{O}(n^3)$

For a modest network with $n = 10^6$ parameters:

Gradient: ~$10^6$ numbers = ~8 MB (manageable)
Hessian: ~$10^{12}$ numbers = ~8 TB (impossible)
Solve time: ~$10^{18}$ operations (astronomical)

Even for networks orders of magnitude smaller, forming and inverting the full Hessian is prohibitive.

The Hessian Barrier

A ResNet-50 has ~25 million parameters. Its Hessian would require 625 trillion entries, consuming about 5 petabytes of storage. Inverting such a matrix would take longer than the age of the universe on current hardware. This is why we need Hessian-free methods.

Computing the Hessian: Methods and Trade-offs

Despite the prohibitive costs, understanding how the Hessian can be computed is valuable:

Method 1: Analytic Differentiation

For simple functions, we can derive the Hessian analytically: $$H_{ij} = \frac{\partial^2 f}{\partial \theta_i \partial \theta_j}$$

This is tractable for small networks but becomes error-prone and tedious for complex architectures.

Method 2: Automatic Differentiation

Modern frameworks support second-order derivatives through automatic differentiation. However, computing the full Hessian still requires $\mathcal{O}(n)$ backward passes—one per parameter—making it $\mathcal{O}(n^2)$ overall.

Method 3: Finite Differences

Approximate each Hessian column via finite differences of the gradient: $$\frac{\partial^2 f}{\partial \theta_i \partial \theta_j} \approx \frac{\nabla_j f(\boldsymbol{\theta} + \epsilon \mathbf{e}_i) - \nabla_j f(\boldsymbol{\theta})}{\epsilon}$$

Requires $2n$ gradient evaluations for the full Hessian—expensive and numerically unstable.

hessian_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import torch
import torch.nn as nn
 
def compute_hessian_autograd(model, loss_fn, x, y):
    """
    Compute full Hessian using PyTorch autograd.
    WARNING: Only feasible for very small models!
    
    Args:
        model: PyTorch model
        loss_fn: Loss function
        x, y: Input data and labels
        
    Returns:
        hessian: Full Hessian matrix (n x n)
    """
    # Get all parameters as a single vector
    params = torch.cat([p.flatten() for p in model.parameters()])
    n = params.numel()
    
    # Compute loss
    output = model(x)
    loss = loss_fn(output, y)
    
    # Compute gradient (create graph for second derivatives)
    grad = torch.autograd.grad(loss, model.parameters(), create_graph=True)
    grad_flat = torch.cat([g.flatten() for g in grad])
    
    # Compute Hessian column by column
    hessian = torch.zeros(n, n)
    for i in range(n):
        # Compute gradient of gradient[i] w.r.t. all parameters
        grad2 = torch.autograd.grad(
            grad_flat[i], model.parameters(), retain_graph=True
        )
        hessian[i] = torch.cat([g.flatten() for g in grad2])
    
    return hessian
 
def hessian_vector_product(model, loss_fn, x, y, v):
    """
    Compute Hessian-vector product H @ v efficiently.
    Uses only O(n) computation, not O(n^2)!
    
    This is the key insight for Hessian-free optimization.
    """
    # Forward pass
    output = model(x)
    loss = loss_fn(output, y)
    
    # First backward pass: compute gradient
    grad = torch.autograd.grad(loss, model.parameters(), create_graph=True)
    grad_flat = torch.cat([g.flatten() for g in grad])
    
    # Compute gradient dot v
    grad_v = torch.sum(grad_flat * v)
    
    # Second backward pass: gradient of (grad · v) = Hv
    hv = torch.autograd.grad(grad_v, model.parameters())
    
    return torch.cat([h.flatten() for h in hv])
 
# Example showing the dramatic difference
class TinyMLP(nn.Module):
    def __init__(self, input_dim=10, hidden_dim=5, output_dim=2):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)
 
# Count parameters
model = TinyMLP()
n_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {n_params}")
print(f"Hessian size: {n_params} x {n_params} = {n_params**2} entries")
print(f"Memory for Hessian: {n_params**2 * 4 / 1e6:.2f} MB (float32)")

The Hessian-Vector Product: A Key Insight

A crucial observation enables practical second-order methods: we rarely need the full Hessian—we need its action on vectors.

The product $\mathbf{H}\mathbf{v}$ can be computed in $\mathcal{O}(n)$ time using a technique called Pearlmutter's trick (1994):

Compute the gradient $\mathbf{g} = \nabla f(\boldsymbol{\theta})$ with the computation graph retained
Compute the directional derivative $\mathbf{g}^T \mathbf{v}$ (a scalar)
Differentiate this scalar with respect to $\boldsymbol{\theta}$ to obtain $\mathbf{H}\mathbf{v}$

This is the foundation of Hessian-free optimization, which we'll explore in the next page. Instead of solving $\mathbf{H}\mathbf{d} = -\mathbf{g}$ directly, we use iterative methods (like conjugate gradient) that only require Hessian-vector products.

Newton's Method in Neural Network Context

Applying Newton's method to neural network training introduces additional challenges beyond computational cost. Understanding these issues is essential for appreciating why we need more sophisticated second-order methods.

The Loss Landscape Reality

Neural network loss surfaces differ fundamentally from the well-behaved convex functions where Newton's method excels:

1. Non-Convexity

Neural network loss functions are highly non-convex, with numerous:

Local minima: Multiple points where $\nabla f = 0$ but $f$ is not globally minimal
Saddle points: Critical points where the Hessian has both positive and negative eigenvalues
Plateaus: Regions where both gradient and Hessian are nearly zero

At saddle points, Newton's method behaves poorly. If $\mathbf{H}$ is indefinite (has negative eigenvalues), the Newton direction may point toward a saddle or maximum rather than a minimum.

2. Near-Singular Hessians

Deep networks often have Hessians that are nearly singular—with many eigenvalues close to zero. This occurs because:

Overparameterization creates redundant directions
ReLU activations cause piecewise linear regions
Batch normalization flattens the landscape

When $\mathbf{H}$ is singular or nearly so, $\mathbf{H}^{-1}$ doesn't exist or is numerically unstable.

The Saddle Point Problem

In high dimensions, saddle points vastly outnumber local minima. For a random function in n dimensions, the probability that all n eigenvalues of the Hessian are positive (local minimum) decreases exponentially with n. Newton's method can get attracted to saddle points, where it struggles because the Hessian is indefinite.

Stochastic Setting Complications

Deep learning training uses stochastic gradients computed on mini-batches. This introduces additional issues:

Gradient Noise: The stochastic gradient $\mathbf{g}_B$ (computed on batch $B$) is an unbiased but noisy estimate of the true gradient $\nabla f$.

Hessian Noise: The stochastic Hessian is even noisier. The variance of Hessian estimates is higher than gradient estimates, making second-order information less reliable.

Sample Efficiency: If we compute the Hessian (or Hessian-vector products) on the same batch as the gradient, we're overfitting to that batch's curvature. Computing on different batches requires more data passes.

When Newton-Like Methods Help

•Ill-conditioned problems (varying curvature)
•When close to a well-behaved minimum
•Small-to-medium scale problems
•Fine-tuning pretrained models
•Second-stage optimization after SGD warm-up

When Newton-Like Methods Struggle

•Very large models (billions of parameters)
•Early training (far from any minimum)
•Highly stochastic gradients
•Saddle point-rich landscapes
•Memory-constrained environments

Regularization and Damping

To address the challenges above, practical Newton-like methods incorporate regularization:

Levenberg-Marquardt Damping: Replace $\mathbf{H}$ with $\mathbf{H} + \lambda \mathbf{I}$, where $\lambda > 0$ ensures positive definiteness:

$$\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - (\mathbf{H}_k + \lambda \mathbf{I})^{-1} \nabla f_k$$

When $\lambda$ is large, this approaches gradient descent (with learning rate $1/\lambda$). When $\lambda$ is small, it approaches Newton's method. This interpolation provides stability far from the solution while preserving fast convergence nearby.

Trust Region Methods: Instead of modifying the Hessian, constrain the step size. Solve:

$$\min_\mathbf{d} \quad m_k(\boldsymbol{\theta}_k + \mathbf{d}) \quad \text{subject to} \quad |\mathbf{d}| \leq \Delta_k$$

The trust region radius $\Delta_k$ is adjusted based on how well the quadratic model predicts the actual function decrease.

Historical Context and Significance

Understanding Newton's method's historical development illuminates why it remains foundational despite its computational limitations.

Origins in Root Finding

Newton originally developed his method for finding roots of equations—solving $g(x) = 0$. The iteration:

$$x_{k+1} = x_k - \frac{g(x_k)}{g'(x_k)}$$

finds where a function crosses zero by repeatedly linearizing and solving. The extension to optimization comes from applying this to $g = \nabla f$—we find where the gradient is zero, which (for convex functions) corresponds to the minimum.

The Optimization Connection

Joseph Raphson independently discovered the method, and its formalization for optimization came later through the work of mathematicians studying calculus of variations and numerical analysis. The key insight—that second-order information accelerates convergence—has driven optimization research for centuries.

Impact on Machine Learning

Newton's method directly inspired:

Quasi-Newton methods (BFGS, L-BFGS): Approximate the Hessian using gradient history
Gauss-Newton algorithm: For nonlinear least squares, approximate Hessian using only first derivatives
Natural gradient descent: Use the Fisher information matrix as a curvature metric
K-FAC: Approximate the Fisher matrix with Kronecker products
Hessian-free optimization: Use Hessian-vector products with conjugate gradient

Understanding Newton's method is essential for understanding all of these.

The Conceptual Foundation

Even though we rarely apply pure Newton's method to neural networks, its conceptual framework permeates modern optimization. Every discussion of curvature, conditioning, and convergence rates builds on Newton's insights. The methods in subsequent pages are all, in some sense, attempts to capture Newton's benefits without paying Newton's costs.

Summary

We have developed a comprehensive understanding of Newton's method for optimization—the theoretical foundation upon which practical second-order methods are built.

Key Takeaways

•Quadratic approximation: Newton's method fits a local quadratic model using both gradient and Hessian, then steps to its minimum.
•The Hessian captures curvature: The second derivative matrix encodes how the gradient changes in different directions, enabling curvature-aware optimization.
•Quadratic convergence: Near a well-behaved minimum, Newton's method converges extremely fast—errors square at each iteration.
•Computational barrier: The $\mathcal{O}(n^2)$ storage and $\mathcal{O}(n^3)$ solve cost make pure Newton's method impractical for large neural networks.
•Hessian-vector products: We can compute $\mathbf{H}\mathbf{v}$ in $\mathcal{O}(n)$ time, enabling iterative methods that don't require the full Hessian.
•Neural network challenges: Non-convexity, saddle points, and stochastic gradients complicate second-order optimization in deep learning.

Looking Ahead

Newton's method reveals a fundamental tension in optimization: we want curvature information for fast convergence, but computing it exactly is prohibitively expensive. The next page explores Hessian-free optimization, which resolves this tension by using iterative linear solvers combined with efficient Hessian-vector products. We'll see how to capture much of Newton's benefit at a fraction of the cost.

Page Complete

You now have a rigorous understanding of Newton's method—its derivation, convergence properties, computational requirements, and limitations in the neural network context. This foundation prepares you for the practical second-order methods that follow: Hessian-free optimization, natural gradient descent, and K-FAC.

1 / 5

Loading learning content...

Machine LearningBackpropagation & Optimization

Second-Order Optimization Methods

LevelAdvanced

Duration120 mins

TopicBackpropagation & Optimization

1 / 5

Newton's Method

The Quest for Faster Convergence

Learning Objectives

Mathematical Foundations

Taylor Expansion and Local Approximation

Consider a scalar function $f: \mathbb{R} \to \mathbb{R}$ that we wish to minimize. Taylor's theorem tells us that near any point $x_0$, we can approximate $f$ as:

$$f(x) \approx f(x_0) + f'(x_0)(x - x_0) + \frac{1}{2}f''(x_0)(x - x_0)^2 + \mathcal{O}((x - x_0)^3)$$

This quadratic approximation captures three essential pieces of information:

The function value $f(x_0)$: Where we currently are
The first derivative $f'(x_0)$: The slope—which direction is downhill
The second derivative $f''(x_0)$: The curvature—how quickly the slope changes

Gradient descent uses only the first two terms. Newton's method exploits all three.

The Curvature Insight

Extension to Multiple Dimensions

For a function $f: \mathbb{R}^n \to \mathbb{R}$ with parameter vector $\boldsymbol{\theta} \in \mathbb{R}^n$, the Taylor expansion becomes:

where:

$\nabla f(\boldsymbol{\theta}_0) \in \mathbb{R}^n$ is the gradient vector with components $\frac{\partial f}{\partial \theta_i}$
$\mathbf{H}(\boldsymbol{\theta}0) \in \mathbb{R}^{n \times n}$ is the Hessian matrix with components $H{ij} = \frac{\partial^2 f}{\partial \theta_i \partial \theta_j}$

Key Mathematical Objects in Optimization
Object	Symbol	Dimension	Information Captured
Gradient	$\nabla f$	$n \times 1$	Direction of steepest ascent
Hessian	$\mathbf{H}$	$n \times n$	Local curvature in all directions
Eigenvalues of $\mathbf{H}$	$\lambda_i$	Scalar	Curvature magnitude along eigenvector
Condition number	$\kappa = \lambda_{max}/\lambda_{min}$	Scalar	Ratio of largest to smallest curvature

Deriving Newton's Method

Newton's method emerges naturally when we ask: What is the optimal step to take if we trust our quadratic approximation perfectly?

The Quadratic Model

Define the quadratic approximation of $f$ around the current point $\boldsymbol{\theta}_k$:

where $\mathbf{H}_k = \mathbf{H}(\boldsymbol{\theta}_k)$ is the Hessian evaluated at $\boldsymbol{\theta}_k$.

To minimize this quadratic, we take the derivative with respect to $\boldsymbol{\theta}$ and set it to zero:

$$\nabla m_k(\boldsymbol{\theta}) = \nabla f(\boldsymbol{\theta}_k) + \mathbf{H}_k(\boldsymbol{\theta} - \boldsymbol{\theta}_k) = 0$$

Solving for $\boldsymbol{\theta}$:

$$\boldsymbol{\theta} - \boldsymbol{\theta}_k = -\mathbf{H}_k^{-1} \nabla f(\boldsymbol{\theta}_k)$$

The Newton Update Rule

$$\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - \mathbf{H}_k^{-1} \nabla f(\boldsymbol{\theta}_k)$$

Interpreting the Newton Direction

The Newton direction $\mathbf{d}_k = -\mathbf{H}_k^{-1} \nabla f_k$ has a profound geometric interpretation. The Hessian inverse rescales and rotates the gradient to account for curvature:

Along directions of high curvature (large eigenvalues of $\mathbf{H}$): The step is shortened. High curvature means the quadratic approximation breaks down quickly.
Along directions of low curvature (small eigenvalues of $\mathbf{H}$): The step is lengthened. Low curvature means we can trust the approximation farther.

newtons_method.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from numpy.linalg import inv, norm
 
def newtons_method(f, grad_f, hessian_f, theta_init, tol=1e-8, max_iter=100):
    """
    Newton's method for unconstrained optimization.
    
    Args:
        f: Objective function f(theta) -> scalar
        grad_f: Gradient function grad_f(theta) -> n-dimensional vector
        hessian_f: Hessian function hessian_f(theta) -> n x n matrix
        theta_init: Initial parameter vector
        tol: Convergence tolerance on gradient norm
        max_iter: Maximum number of iterations
        
    Returns:
        theta: Final parameter vector
        history: List of (theta, f_value, grad_norm) at each iteration
    """
    theta = theta_init.copy()
    history = []
    
    for iteration in range(max_iter):
        # Compute gradient and Hessian at current point
        grad = grad_f(theta)
        H = hessian_f(theta)
        
        # Check convergence
        grad_norm = norm(grad)
        history.append((theta.copy(), f(theta), grad_norm))
        
        if grad_norm < tol:
            print(f"Converged in {iteration} iterations")
            break
        
        # Compute Newton direction: solve H @ d = -grad
        # Using inverse (inefficient but clear; use solve() in practice)
        newton_direction = -inv(H) @ grad
        
        # Update parameters (pure Newton step, no line search)
        theta = theta + newton_direction
    
    return theta, history
 
# Example: Minimize Rosenbrock function (a challenging test function)
def rosenbrock(x):
    """Rosenbrock function: f(x,y) = (a-x)^2 + b(y-x^2)^2, with a=1, b=100"""
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
 
def rosenbrock_grad(x):
    """Gradient of Rosenbrock function"""
    df_dx = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
    df_dy = 200 * (x[1] - x[0]**2)
    return np.array([df_dx, df_dy])
 
def rosenbrock_hessian(x):
    """Hessian of Rosenbrock function"""
    H = np.array([
        [2 - 400 * (x[1] - 3 * x[0]**2), -400 * x[0]],
        [-400 * x[0], 200]
    ])
    return H
 
# Run Newton's method
x0 = np.array([-1.5, 2.0])
x_opt, history = newtons_method(rosenbrock, rosenbrock_grad, rosenbrock_hessian, x0)
print(f"Optimal point: {x_opt}")
print(f"Function value: {rosenbrock(x_opt):.2e}")

Convergence Analysis

Quadratic Convergence: Definition and Implications

A sequence ${\boldsymbol{\theta}_k}$ converges quadratically to $\boldsymbol{\theta}^*$ if there exists a constant $C > 0$ such that:

$$|\boldsymbol{\theta}_{k+1} - \boldsymbol{\theta}^| \leq C |\boldsymbol{\theta}_k - \boldsymbol{\theta}^|^2$$

Compare with linear convergence (typical for gradient descent):

$$|\boldsymbol{\theta}_{k+1} - \boldsymbol{\theta}^| \leq \rho |\boldsymbol{\theta}_k - \boldsymbol{\theta}^|$$

for some $\rho < 1$. The error decreases by a constant factor each iteration—much slower when $\rho$ is close to 1.

Convergence Rate Comparison
Iteration	Gradient Descent (ρ=0.9)	Newton's Method
0	$10^{-1}$	$10^{-1}$
1	$9 \times 10^{-2}$	$10^{-2}$
2	$8.1 \times 10^{-2}$	$10^{-4}$
5	$5.9 \times 10^{-2}$	$10^{-32}$
10	$3.5 \times 10^{-2}$	Machine precision
50	$5.2 \times 10^{-3}$	—
100	$2.7 \times 10^{-5}$	—

Conditions for Quadratic Convergence

Newton's quadratic convergence is not guaranteed universally. The following conditions must hold:

1. The Hessian is Positive Definite at the Minimum

At a local minimum $\boldsymbol{\theta}^$, the Hessian $\mathbf{H}(\boldsymbol{\theta}^)$ must have all positive eigenvalues. This ensures:

The minimum is actually a minimum (not a saddle point)
The Newton direction is a descent direction
The inverse Hessian exists

2. The Starting Point is Sufficiently Close

3. The Hessian is Lipschitz Continuous

The Local Convergence Trap

Convergence Theorem (Formal Statement)

Computational Considerations

The elegance of Newton's method masks a computational reality that limits its practicality for large-scale machine learning: the Hessian is expensive.

The Curse of Dimensionality

Consider a neural network with $n$ parameters. The computational costs are:

Operation	Cost
Gradient computation	$\mathcal{O}(n)$ per sample
Hessian storage	$\mathcal{O}(n^2)$
Hessian computation	$\mathcal{O}(n^2)$ per sample
Solving $\mathbf{H}\mathbf{d} = -\nabla f$	$\mathcal{O}(n^3)$

For a modest network with $n = 10^6$ parameters:

Gradient: ~$10^6$ numbers = ~8 MB (manageable)
Hessian: ~$10^{12}$ numbers = ~8 TB (impossible)
Solve time: ~$10^{18}$ operations (astronomical)

Even for networks orders of magnitude smaller, forming and inverting the full Hessian is prohibitive.

The Hessian Barrier

Computing the Hessian: Methods and Trade-offs

Despite the prohibitive costs, understanding how the Hessian can be computed is valuable:

Method 1: Analytic Differentiation

For simple functions, we can derive the Hessian analytically: $$H_{ij} = \frac{\partial^2 f}{\partial \theta_i \partial \theta_j}$$

This is tractable for small networks but becomes error-prone and tedious for complex architectures.

Method 2: Automatic Differentiation

Method 3: Finite Differences

Requires $2n$ gradient evaluations for the full Hessian—expensive and numerically unstable.

hessian_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import torch
import torch.nn as nn
 
def compute_hessian_autograd(model, loss_fn, x, y):
    """
    Compute full Hessian using PyTorch autograd.
    WARNING: Only feasible for very small models!
    
    Args:
        model: PyTorch model
        loss_fn: Loss function
        x, y: Input data and labels
        
    Returns:
        hessian: Full Hessian matrix (n x n)
    """
    # Get all parameters as a single vector
    params = torch.cat([p.flatten() for p in model.parameters()])
    n = params.numel()
    
    # Compute loss
    output = model(x)
    loss = loss_fn(output, y)
    
    # Compute gradient (create graph for second derivatives)
    grad = torch.autograd.grad(loss, model.parameters(), create_graph=True)
    grad_flat = torch.cat([g.flatten() for g in grad])
    
    # Compute Hessian column by column
    hessian = torch.zeros(n, n)
    for i in range(n):
        # Compute gradient of gradient[i] w.r.t. all parameters
        grad2 = torch.autograd.grad(
            grad_flat[i], model.parameters(), retain_graph=True
        )
        hessian[i] = torch.cat([g.flatten() for g in grad2])
    
    return hessian
 
def hessian_vector_product(model, loss_fn, x, y, v):
    """
    Compute Hessian-vector product H @ v efficiently.
    Uses only O(n) computation, not O(n^2)!
    
    This is the key insight for Hessian-free optimization.
    """
    # Forward pass
    output = model(x)
    loss = loss_fn(output, y)
    
    # First backward pass: compute gradient
    grad = torch.autograd.grad(loss, model.parameters(), create_graph=True)
    grad_flat = torch.cat([g.flatten() for g in grad])
    
    # Compute gradient dot v
    grad_v = torch.sum(grad_flat * v)
    
    # Second backward pass: gradient of (grad · v) = Hv
    hv = torch.autograd.grad(grad_v, model.parameters())
    
    return torch.cat([h.flatten() for h in hv])
 
# Example showing the dramatic difference
class TinyMLP(nn.Module):
    def __init__(self, input_dim=10, hidden_dim=5, output_dim=2):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)
 
# Count parameters
model = TinyMLP()
n_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {n_params}")
print(f"Hessian size: {n_params} x {n_params} = {n_params**2} entries")
print(f"Memory for Hessian: {n_params**2 * 4 / 1e6:.2f} MB (float32)")

The Hessian-Vector Product: A Key Insight

A crucial observation enables practical second-order methods: we rarely need the full Hessian—we need its action on vectors.

The product $\mathbf{H}\mathbf{v}$ can be computed in $\mathcal{O}(n)$ time using a technique called Pearlmutter's trick (1994):

Compute the gradient $\mathbf{g} = \nabla f(\boldsymbol{\theta})$ with the computation graph retained
Compute the directional derivative $\mathbf{g}^T \mathbf{v}$ (a scalar)
Differentiate this scalar with respect to $\boldsymbol{\theta}$ to obtain $\mathbf{H}\mathbf{v}$

Newton's Method in Neural Network Context

The Loss Landscape Reality

Neural network loss surfaces differ fundamentally from the well-behaved convex functions where Newton's method excels:

1. Non-Convexity

Neural network loss functions are highly non-convex, with numerous:

Local minima: Multiple points where $\nabla f = 0$ but $f$ is not globally minimal
Saddle points: Critical points where the Hessian has both positive and negative eigenvalues
Plateaus: Regions where both gradient and Hessian are nearly zero

At saddle points, Newton's method behaves poorly. If $\mathbf{H}$ is indefinite (has negative eigenvalues), the Newton direction may point toward a saddle or maximum rather than a minimum.

2. Near-Singular Hessians

Deep networks often have Hessians that are nearly singular—with many eigenvalues close to zero. This occurs because:

Overparameterization creates redundant directions
ReLU activations cause piecewise linear regions
Batch normalization flattens the landscape

When $\mathbf{H}$ is singular or nearly so, $\mathbf{H}^{-1}$ doesn't exist or is numerically unstable.

The Saddle Point Problem

Stochastic Setting Complications

Deep learning training uses stochastic gradients computed on mini-batches. This introduces additional issues:

Gradient Noise: The stochastic gradient $\mathbf{g}_B$ (computed on batch $B$) is an unbiased but noisy estimate of the true gradient $\nabla f$.

Hessian Noise: The stochastic Hessian is even noisier. The variance of Hessian estimates is higher than gradient estimates, making second-order information less reliable.

When Newton-Like Methods Help

•Ill-conditioned problems (varying curvature)
•When close to a well-behaved minimum
•Small-to-medium scale problems
•Fine-tuning pretrained models
•Second-stage optimization after SGD warm-up

When Newton-Like Methods Struggle

•Very large models (billions of parameters)
•Early training (far from any minimum)
•Highly stochastic gradients
•Saddle point-rich landscapes
•Memory-constrained environments

Regularization and Damping

To address the challenges above, practical Newton-like methods incorporate regularization:

Levenberg-Marquardt Damping: Replace $\mathbf{H}$ with $\mathbf{H} + \lambda \mathbf{I}$, where $\lambda > 0$ ensures positive definiteness:

$$\boldsymbol{\theta}_{k+1} = \boldsymbol{\theta}_k - (\mathbf{H}_k + \lambda \mathbf{I})^{-1} \nabla f_k$$

Trust Region Methods: Instead of modifying the Hessian, constrain the step size. Solve:

$$\min_\mathbf{d} \quad m_k(\boldsymbol{\theta}_k + \mathbf{d}) \quad \text{subject to} \quad |\mathbf{d}| \leq \Delta_k$$

The trust region radius $\Delta_k$ is adjusted based on how well the quadratic model predicts the actual function decrease.

Historical Context and Significance

Understanding Newton's method's historical development illuminates why it remains foundational despite its computational limitations.

Origins in Root Finding

Newton originally developed his method for finding roots of equations—solving $g(x) = 0$. The iteration:

$$x_{k+1} = x_k - \frac{g(x_k)}{g'(x_k)}$$

The Optimization Connection

Impact on Machine Learning

Newton's method directly inspired:

Quasi-Newton methods (BFGS, L-BFGS): Approximate the Hessian using gradient history
Gauss-Newton algorithm: For nonlinear least squares, approximate Hessian using only first derivatives
Natural gradient descent: Use the Fisher information matrix as a curvature metric
K-FAC: Approximate the Fisher matrix with Kronecker products
Hessian-free optimization: Use Hessian-vector products with conjugate gradient

Understanding Newton's method is essential for understanding all of these.

The Conceptual Foundation

Summary

We have developed a comprehensive understanding of Newton's method for optimization—the theoretical foundation upon which practical second-order methods are built.

Key Takeaways

•Quadratic approximation: Newton's method fits a local quadratic model using both gradient and Hessian, then steps to its minimum.
•The Hessian captures curvature: The second derivative matrix encodes how the gradient changes in different directions, enabling curvature-aware optimization.
•Quadratic convergence: Near a well-behaved minimum, Newton's method converges extremely fast—errors square at each iteration.
•Computational barrier: The $\mathcal{O}(n^2)$ storage and $\mathcal{O}(n^3)$ solve cost make pure Newton's method impractical for large neural networks.
•Hessian-vector products: We can compute $\mathbf{H}\mathbf{v}$ in $\mathcal{O}(n)$ time, enabling iterative methods that don't require the full Hessian.
•Neural network challenges: Non-convexity, saddle points, and stochastic gradients complicate second-order optimization in deep learning.

Looking Ahead

Page Complete

1 / 5