Second Order Methods - Learning Module

Loading content...

0/245

Practical Limitations of Second-Order Methods

The Theory-Practice Gap

Throughout this module, we've explored the theoretical elegance and power of second-order optimization methods. Newton's method offers quadratic convergence. Natural gradient descent provides parameterization invariance. K-FAC makes these benefits practical for neural networks.

Yet, despite these compelling advantages, first-order methods dominate deep learning practice. SGD with momentum, Adam, and their variants train the vast majority of production models. Why?

This page confronts the practical limitations of second-order methods honestly. We'll examine the computational trade-offs, the scenarios where second-order methods shine or struggle, and provide guidance for choosing the right optimization strategy for your specific problem. This pragmatic perspective is essential for making informed decisions in real-world machine learning engineering.

Learning Objectives

By the end of this page, you will: (1) Understand the computational and memory costs of second-order methods at scale, (2) Recognize the scenarios where second-order methods provide clear benefits, (3) Identify when first-order methods are preferable, (4) Learn how to make principled decisions about optimizer selection, and (5) Appreciate the ongoing research addressing these limitations.

Computational and Memory Costs at Scale

The fundamental challenge of second-order methods is their computational complexity. Let's quantify this precisely.

Memory Requirements

Consider a neural network with $n$ parameters:

Method	Memory for Curvature	Typical Model (100M params)
SGD	$\mathcal{O}(0)$	0
Adam	$\mathcal{O}(n)$	~400 MB
Full Newton	$\mathcal{O}(n^2)$	~40 PB (infeasible)
Hessian-Free	$\mathcal{O}(n)$	~400 MB
K-FAC	$\mathcal{O}(\sum_l d_l^2)$	~1-10 GB

K-FAC's memory depends on layer dimensions. For a transformer with $L$ layers, each with $d_{model}$ hidden dim:

Feed-forward: Two $d_{model} \times 4d_{model}$ weight matrices per layer
K-FAC factors: $d_{model}^2 + (4d_{model})^2 = 17d_{model}^2$ per layer
Total: $\mathcal{O}(L \cdot d_{model}^2)$

For GPT-2 medium ($L=24$, $d_{model}=1024$): ~400M entries for K-FAC factors, or ~1.6 GB in FP32.

For larger models, K-FAC memory becomes prohibitive:

GPT-3 ($L=96$, $d_{model}=12288$): ~15 trillion entries—infeasible

Per-Iteration Computational Cost
Method	Forward/Backward	Curvature	Update	Total (relative)
SGD	$\mathcal{O}(n)$	—	$\mathcal{O}(n)$	1x
Adam	$\mathcal{O}(n)$	$\mathcal{O}(n)$	$\mathcal{O}(n)$	1.5x
HF (k CG iters)	$\mathcal{O}(n)$	$\mathcal{O}(kn)$	$\mathcal{O}(n)$	~(k+1)x
K-FAC	$\mathcal{O}(n)$	$\mathcal{O}(\sum d_l^2)$	$\mathcal{O}(\sum d_l^3)$	~2-10x
Full Newton	$\mathcal{O}(n)$	$\mathcal{O}(n^2)$	$\mathcal{O}(n^3)$	Infeasible

The Inversion Bottleneck

K-FAC's per-layer $\mathcal{O}(d^3)$ matrix inversion becomes problematic for wide layers. A single 4096×4096 inversion takes ~10ms on GPU. With 100 such layers updated every 10 steps, that's still 100ms of inversion overhead per update—significant compared to ~50ms forward-backward time.

Hardware Utilization Concerns

Modern accelerators (GPUs, TPUs) are optimized for the dense matrix operations in forward and backward passes. Second-order methods introduce operations that may not utilize hardware as efficiently:

Hessian-Free (CG iterations): Sequential matrix-vector products that can't be parallelized across iterations. CG requires $k$ sequential steps, each with one Hv product - no data parallelism.

K-FAC inversions: Matrix inversions are memory-bound operations that don't benefit as much from tensor core acceleration as matrix multiplications.

Communication overhead: In distributed training, K-FAC requires additional all-reduce operations for covariances, adding synchronization points.

The result: even when second-order methods take fewer iterations, wall-clock time may not improve proportionally due to hardware inefficiencies.

Statistical and Noise Considerations

Second-order methods rely on curvature estimates that are inherently noisier than gradient estimates. This creates fundamental challenges in the stochastic optimization setting.

Curvature Estimation Variance

The gradient has variance that scales as $\mathcal{O}(1/B)$ where $B$ is batch size. But second-order information has higher variance:

For the Hessian diagonal, variance scales as $\mathcal{O}(1/B)$ per element, but there are $n$ elements—the total estimation error doesn't decrease as favorably.

For K-FAC covariances, each entry of $\mathbf{A} = \mathbb{E}[aa^T]$ has variance $\mathcal{O}(\text{Var}(a_i^2)/B)$. Fourth moments can be much larger than second moments, especially with non-normalized activations.

Sample Efficiency Trade-off

Consider the informational content of a gradient vs Hessian estimate:

Gradient (n numbers): Provides directional information for $n$ parameters
Hessian ($n^2$ numbers): Provides curvature information, but requires more samples to estimate accurately

With a fixed batch, the gradient estimate may be reliable while the curvature estimate is dominated by noise. Using a noisy Hessian can be worse than ignoring curvature entirely.

The Bias-Variance Trade-off in Curvature

Diagonal approximations (like Adam's second moment) are biased (they ignore off-diagonal curvature) but low variance (easy to estimate per-element). Full curvature methods are unbiased but high variance. K-FAC sits in between—some structure but still needs careful averaging.

Stale Curvature

To reduce noise, K-FAC and similar methods use exponential moving averages of curvature statistics. But this introduces curvature staleness:

The curvature reflects the average over many past iterations
During rapid training phases, the curvature can change significantly
Stale curvature may point to a minimum that no longer exists

This is particularly problematic early in training when the loss landscape changes rapidly.

Batch Size Requirements

Empirically, second-order methods often require larger batch sizes to achieve their theoretical benefits:

Method	Recommended Min Batch	Rationale
SGD	32-128	Noise provides regularization
Adam	32-256	Diagonal estimates stabilize quickly
K-FAC	256-1024+	Covariance matrices need many samples
Hessian-Free	512-2048+	CG needs stable Hv products

For applications where small batches are necessary (streaming, limited memory), first-order methods may be the only viable choice.

Second-Order Benefits

•Fewer iterations to convergence
•Less sensitive to learning rate choice
•Better handling of ill-conditioning
•Theoretically optimal convergence rate
•Principled adaptation to curvature

Second-Order Drawbacks

•Higher per-iteration cost
•More memory for curvature storage
•Noisier estimates require larger batches
•Complex implementation
•More hyperparameters (damping, update frequency)

Problem-Dependent Performance

Second-order methods don't uniformly outperform first-order methods. Their advantage depends critically on problem characteristics.

Where Second-Order Methods Excel

1. Ill-Conditioned Problems

When the loss landscape has highly variable curvature (high condition number), gradient descent struggles with step size selection. Second-order methods handle this naturally.

Example: Physics-informed neural networks often have loss functions with multiple terms at different scales, creating ill-conditioning.

2. Small to Medium Scale

For networks with thousands to low millions of parameters, the computational overhead of K-FAC or Hessian-free methods is manageable, and the iteration savings translate to wall-clock improvements.

Example: Specialized scientific computing networks, reinforcement policy networks.

3. Near-Quadratic Objectives

When the loss is locally well-approximated by a quadratic (near a minimum, or for certain network architectures), second-order methods achieve their theoretical potential.

Example: Fine-tuning pretrained models on small datasets.

4. Recurrent Networks

RNNs and LSTMs have notoriously difficult optimization landscapes. Hessian-free methods showed early success on deep RNNs where vanilla SGD failed entirely.

The Fine-Tuning Sweet Spot

Second-order methods often shine during fine-tuning. The model starts close to a good solution, curvature is relatively stable, and we want fast convergence rather than exploration. K-FAC can reduce fine-tuning iterations by 2-5x compared to Adam, with total time savings on appropriately-sized models.

Where First-Order Methods Win

1. Very Large Scale

For models with billions of parameters (modern LLMs), even K-FAC's reduced complexity is prohibitive. The curvature matrices for embedding layers with vocabularies of 50K+ tokens are infeasible.

2. Highly Stochastic Settings

When noise dominates the training signal (small batches, high data variability), curvature estimates are unreliable. SGD's implicit regularization from noise can even be beneficial.

3. Exploration-Heavy Training

Early in training, we're far from any minimum, and the landscape is highly non-quadratic. Careful curvature following matters less than moving quickly through parameter space.

4. Simple Architectures

For well-designed modern architectures (ResNets, Transformers with normalization), first-order methods work remarkably well. These architectures were co-evolved with SGD/Adam optimization.

5. Memory-Constrained Settings

On edge devices or with limited GPU memory, the overhead of storing curvature statistics may not be affordable.

Method Recommendations by Scenario
Scenario	Recommended Method	Rationale
Large language model pretraining	AdamW	Scale prohibits second-order; well-tuned LR schedules suffice
ResNet on ImageNet	SGD+momentum	Architecture well-conditioned; proven baseline
Scientific computing (PDEs)	L-BFGS or K-FAC	Often ill-conditioned; smaller scale
Fine-tuning BERT	Adam or K-FAC	Near-quadratic; fast convergence valuable
Reinforcement learning	Adam or Natural Gradient	Non-stationarity; Fisher provides stability
GAN training	Adam (careful tuning)	Non-convexity; second-order less helpful
RNN/LSTM training	Adam or Hessian-Free	Vanishing gradients; curvature awareness helps

Implementation Complexity and Engineering Costs

Beyond algorithmic concerns, second-order methods impose significant engineering overhead.

Code Complexity

A correct, efficient implementation of K-FAC requires:

Hook registration for capturing activations and gradients per layer
Layer-type handling for different architectures (Linear, Conv2d, LSTM cells)
Covariance accumulation with exponential moving averages
Inversion scheduling (not every step)
Damping adaptation with pi-correction
Memory management for storing factors and inverses
Distributed communication patterns for multi-GPU/multi-node

Compare with Adam:

# Adam update (simplified)
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad ** 2
param -= lr * m / (sqrt(v) + eps)

The simplicity of first-order methods reduces bugs, eases debugging, and simplifies integration with training pipelines.

The Debugging Challenge

When training diverges with K-FAC, identifying the cause is substantially harder than with Adam. Is it: damping too low? Covariance estimates unstable? Inversion numerical issues? Layer statistics dominated by outliers? The debugging surface area is much larger.

Hyperparameter Burden

Second-order methods introduce new hyperparameters:

Hyperparameter	Typical Range	Sensitivity
Base learning rate	0.001-0.1	High
Damping	0.001-1.0	High
Covariance EMA decay	0.9-0.999	Medium
Inverse update frequency	5-100 steps	Low
Trust region / LM adaptation	On/Off + parameters	Medium

While second-order methods are often claimed to be "less sensitive to learning rate," they trade this for sensitivity to damping and other parameters. The total hyperparameter tuning burden may not decrease.

Framework Support

Popular frameworks have varying support for second-order methods:

Method	PyTorch	JAX	TensorFlow
SGD/Adam	Native	Native	Native
L-BFGS	torch.optim.LBFGS	optax.lbfgs	tf.keras (limited)
Hessian-Free	Manual	jax.hessian	Manual
K-FAC	Third-party (KFAC-Pytorch)	Third-party (kfac-jax)	Official (tf.contrib)

The lack of first-class support means users must maintain compatibility across framework versions, a significant maintenance burden.

A Decision Framework for Optimizer Selection

Given the trade-offs we've discussed, here's a practical framework for choosing optimizers.

The Decision Tree

optimizer_decision.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def recommend_optimizer(
    num_parameters: int,
    batch_size: int,
    problem_type: str,
    training_phase: str,
    computational_budget: str,
    implementation_expertise: str
) -> str:
    """
    Recommend an optimizer based on problem characteristics.
    
    This encodes practical wisdom about when different optimizers work well.
    """
    
    # Scale thresholds
    SMALL_MODEL = 1_000_000      # 1M params
    MEDIUM_MODEL = 100_000_000   # 100M params
    LARGE_MODEL = 1_000_000_000  # 1B params
    
    SMALL_BATCH = 64
    LARGE_BATCH = 512
    
    # Very large models: first-order only
    if num_parameters > LARGE_MODEL:
        return "AdamW with careful LR schedule and warmup"
    
    # Small batches: first-order preferred
    if batch_size < SMALL_BATCH:
        return "Adam with standard settings"
    
    # Problem-specific recommendations
    if problem_type == "reinforcement_learning":
        if num_parameters < SMALL_MODEL:
            return "Natural Gradient (TRPO/PPO style) or Adam"
        else:
            return "Adam with entropy regularization"
    
    if problem_type == "scientific_computing" or problem_type == "pde_solving":
        if num_parameters < MEDIUM_MODEL:
            return "L-BFGS or K-FAC (ill-conditioning likely)"
        else:
            return "Adam with aggressive LR decay"
    
    if problem_type == "sequence_modeling" and "rnn" in str(training_phase).lower():
        return "Adam or Hessian-Free (gradient pathologies common)"
    
    # Training phase considerations
    if training_phase == "fine_tuning":
        if batch_size >= LARGE_BATCH and num_parameters < MEDIUM_MODEL:
            return "K-FAC (fast convergence, stable curvature)"
        else:
            return "AdamW with low LR"
    
    if training_phase == "early_training":
        return "Adam or SGD+momentum (exploration phase, curvature unreliable)"
    
    # Computational budget
    if computational_budget == "unlimited" and num_parameters < MEDIUM_MODEL:
        if implementation_expertise == "high":
            return "K-FAC (potential iteration savings)"
        else:
            return "Adam (reliable, well-understood)"
    
    # Default: Adam is a strong baseline
    return "AdamW with cosine LR schedule"
 
# Example usage
print(recommend_optimizer(
    num_parameters=50_000_000,
    batch_size=256,
    problem_type="vision",
    training_phase="early_training",
    computational_budget="limited",
    implementation_expertise="medium"
))
# Output: "Adam or SGD+momentum (exploration phase, curvature unreliable)"

Key Decision Factors

•Model size: Larger models push toward simpler (first-order) methods due to memory and computation constraints.
•Batch size: Larger batches enable better curvature estimation, favoring second-order methods.
•Training phase: Early training benefits less from curvature; fine-tuning benefits more.
•Problem conditioning: Ill-conditioned problems benefit most from second-order methods.
•Engineering capacity: Implementing and maintaining second-order optimizers requires expertise.
•Wall-clock budget: If iteration count matters less than per-iteration speed, first-order wins.

Research Directions and Future Outlook

The limitations of current second-order methods are well-understood, and active research addresses many of them.

Scalable Approximations

Shampoo (Google, 2018): Uses matrix roots instead of inverses, computed via iterative methods. Scales better to large layers and shows competitive results at scale.

LARS/LAMB (2017-2019): Large-batch optimizers that use per-layer learning rate scaling based on gradient/weight norms—a very lightweight form of curvature adaptation.

AdaHessian (2020): Full diagonal Hessian estimation with efficient averaging. Bridges Adam and true second-order methods.

Sophia (2023): Adaptive learning rate based on Hutchinson's trace estimator for diagonal Hessian. Achieves faster convergence than Adam on LLM training.

Hardware Co-Design

Future accelerators may be designed with second-order methods in mind:

Sparse matrix units for efficient curvature storage and products
Hierarchical memory optimized for the access patterns of K-FAC
Custom collective operations for covariance synchronization

As hardware evolves, the computational balance between first and second-order methods may shift.

The Adam Paradox

Adam's success is partly paradoxical: it uses a diagonal approximation to the Fisher, which should be crude. Yet it works remarkably well. Recent theory suggests Adam's effectiveness comes from implicit regularization and adaptivity to gradient distribution, not from accurate curvature capture. This insight is guiding new optimizer designs.

Hybrid Approaches

The future may lie in hybrid methods that combine first and second-order elements:

1. First-order start, second-order finish: Use Adam for early training, switch to K-FAC for fine-tuning.

2. Periodic curvature updates: Compute expensive curvature every N steps, use it to precondition many gradient steps.

3. Curvature-aware learning rate: Use cheap curvature estimates (diagonal Hessian) to schedule learning rates per-parameter or per-layer.

4. Meta-learned optimizers: Neural networks that learn to optimize, potentially discovering novel algorithms that combine aspects of both approaches.

The Generalization Question

A fundamental open question: Does second-order optimization find different solutions than first-order methods?

Some evidence suggests:

SGD's implicit regularization leads to flatter minima
Second-order methods may converge to sharper, less generalizable solutions
Damping in K-FAC/HF introduces regularization that may counteract this

Understanding the generalization properties of different optimizers remains an active research area.

Module Summary

This module has taken you on a journey through second-order optimization methods—from the foundational theory of Newton's method to the practical realities of training deep neural networks.

Module Takeaways

•Newton's method uses the Hessian to achieve quadratic convergence, but requires $\mathcal{O}(n^3)$ computation and $\mathcal{O}(n^2)$ storage.
•Hessian-free optimization uses Hessian-vector products with conjugate gradient to approximate Newton's method at $\mathcal{O}(n)$ cost per CG iteration.
•Natural gradient descent measures progress in distribution space via the Fisher information, providing parameterization invariance.
•K-FAC approximates the Fisher as Kronecker products, enabling efficient natural gradient for neural networks.
•Practical limitations include computational overhead, curvature noise, implementation complexity, and scale constraints.
•Method selection should consider model size, batch size, problem conditioning, training phase, and engineering resources.

The Practitioner's Perspective

For most deep learning practitioners, the honest recommendation is:

Start with Adam/AdamW. It's robust, well-understood, and works across a wide range of problems.
Consider SGD+momentum for vision. CNNs trained on large datasets often benefit from SGD's implicit regularization.
Explore second-order methods when:
- You have ill-conditioned problems (scientific computing, physics)
- You're fine-tuning and want faster convergence
- You have the engineering capacity and problem scale fits
Stay informed about new developments. Methods like Sophia and Shampoo are pushing second-order optimization to larger scales.

Closing Thoughts

Second-order optimization methods represent some of the most elegant mathematics in machine learning—Newton's insights from the 17th century, information geometry from the 20th, and modern approximations from the 21st. Understanding these methods deeply, including their limitations, makes you a more complete machine learning engineer.

The gap between theoretical elegance and practical utility is a recurring theme in optimization. Methods that dominate in theory (full Newton) are impractical, while methods that dominate in practice (Adam) are theoretically approximate. Bridging this gap remains one of the great challenges in the field.

Module Complete

Congratulations! You have completed the Second-Order Optimization Methods module. You now have a rigorous understanding of Newton's method, Hessian-free optimization, natural gradient descent, K-FAC, and the practical considerations for choosing optimization strategies. This knowledge will serve you in diagnosing optimization challenges and making informed decisions about training neural networks.