Loading content...
Throughout this module, we've explored the theoretical elegance and power of second-order optimization methods. Newton's method offers quadratic convergence. Natural gradient descent provides parameterization invariance. K-FAC makes these benefits practical for neural networks.
Yet, despite these compelling advantages, first-order methods dominate deep learning practice. SGD with momentum, Adam, and their variants train the vast majority of production models. Why?
This page confronts the practical limitations of second-order methods honestly. We'll examine the computational trade-offs, the scenarios where second-order methods shine or struggle, and provide guidance for choosing the right optimization strategy for your specific problem. This pragmatic perspective is essential for making informed decisions in real-world machine learning engineering.
By the end of this page, you will: (1) Understand the computational and memory costs of second-order methods at scale, (2) Recognize the scenarios where second-order methods provide clear benefits, (3) Identify when first-order methods are preferable, (4) Learn how to make principled decisions about optimizer selection, and (5) Appreciate the ongoing research addressing these limitations.
The fundamental challenge of second-order methods is their computational complexity. Let's quantify this precisely.
Consider a neural network with $n$ parameters:
| Method | Memory for Curvature | Typical Model (100M params) |
|---|---|---|
| SGD | $\mathcal{O}(0)$ | 0 |
| Adam | $\mathcal{O}(n)$ | ~400 MB |
| Full Newton | $\mathcal{O}(n^2)$ | ~40 PB (infeasible) |
| Hessian-Free | $\mathcal{O}(n)$ | ~400 MB |
| K-FAC | $\mathcal{O}(\sum_l d_l^2)$ | ~1-10 GB |
K-FAC's memory depends on layer dimensions. For a transformer with $L$ layers, each with $d_{model}$ hidden dim:
For GPT-2 medium ($L=24$, $d_{model}=1024$): ~400M entries for K-FAC factors, or ~1.6 GB in FP32.
For larger models, K-FAC memory becomes prohibitive:
| Method | Forward/Backward | Curvature | Update | Total (relative) |
|---|---|---|---|---|
| SGD | $\mathcal{O}(n)$ | — | $\mathcal{O}(n)$ | 1x |
| Adam | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | 1.5x |
| HF (k CG iters) | $\mathcal{O}(n)$ | $\mathcal{O}(kn)$ | $\mathcal{O}(n)$ | ~(k+1)x |
| K-FAC | $\mathcal{O}(n)$ | $\mathcal{O}(\sum d_l^2)$ | $\mathcal{O}(\sum d_l^3)$ | ~2-10x |
| Full Newton | $\mathcal{O}(n)$ | $\mathcal{O}(n^2)$ | $\mathcal{O}(n^3)$ | Infeasible |
K-FAC's per-layer $\mathcal{O}(d^3)$ matrix inversion becomes problematic for wide layers. A single 4096×4096 inversion takes ~10ms on GPU. With 100 such layers updated every 10 steps, that's still 100ms of inversion overhead per update—significant compared to ~50ms forward-backward time.
Modern accelerators (GPUs, TPUs) are optimized for the dense matrix operations in forward and backward passes. Second-order methods introduce operations that may not utilize hardware as efficiently:
Hessian-Free (CG iterations): Sequential matrix-vector products that can't be parallelized across iterations. CG requires $k$ sequential steps, each with one Hv product - no data parallelism.
K-FAC inversions: Matrix inversions are memory-bound operations that don't benefit as much from tensor core acceleration as matrix multiplications.
Communication overhead: In distributed training, K-FAC requires additional all-reduce operations for covariances, adding synchronization points.
The result: even when second-order methods take fewer iterations, wall-clock time may not improve proportionally due to hardware inefficiencies.
Second-order methods rely on curvature estimates that are inherently noisier than gradient estimates. This creates fundamental challenges in the stochastic optimization setting.
The gradient has variance that scales as $\mathcal{O}(1/B)$ where $B$ is batch size. But second-order information has higher variance:
For the Hessian diagonal, variance scales as $\mathcal{O}(1/B)$ per element, but there are $n$ elements—the total estimation error doesn't decrease as favorably.
For K-FAC covariances, each entry of $\mathbf{A} = \mathbb{E}[aa^T]$ has variance $\mathcal{O}(\text{Var}(a_i^2)/B)$. Fourth moments can be much larger than second moments, especially with non-normalized activations.
Consider the informational content of a gradient vs Hessian estimate:
With a fixed batch, the gradient estimate may be reliable while the curvature estimate is dominated by noise. Using a noisy Hessian can be worse than ignoring curvature entirely.
Diagonal approximations (like Adam's second moment) are biased (they ignore off-diagonal curvature) but low variance (easy to estimate per-element). Full curvature methods are unbiased but high variance. K-FAC sits in between—some structure but still needs careful averaging.
To reduce noise, K-FAC and similar methods use exponential moving averages of curvature statistics. But this introduces curvature staleness:
This is particularly problematic early in training when the loss landscape changes rapidly.
Empirically, second-order methods often require larger batch sizes to achieve their theoretical benefits:
| Method | Recommended Min Batch | Rationale |
|---|---|---|
| SGD | 32-128 | Noise provides regularization |
| Adam | 32-256 | Diagonal estimates stabilize quickly |
| K-FAC | 256-1024+ | Covariance matrices need many samples |
| Hessian-Free | 512-2048+ | CG needs stable Hv products |
For applications where small batches are necessary (streaming, limited memory), first-order methods may be the only viable choice.
Second-order methods don't uniformly outperform first-order methods. Their advantage depends critically on problem characteristics.
1. Ill-Conditioned Problems
When the loss landscape has highly variable curvature (high condition number), gradient descent struggles with step size selection. Second-order methods handle this naturally.
Example: Physics-informed neural networks often have loss functions with multiple terms at different scales, creating ill-conditioning.
2. Small to Medium Scale
For networks with thousands to low millions of parameters, the computational overhead of K-FAC or Hessian-free methods is manageable, and the iteration savings translate to wall-clock improvements.
Example: Specialized scientific computing networks, reinforcement policy networks.
3. Near-Quadratic Objectives
When the loss is locally well-approximated by a quadratic (near a minimum, or for certain network architectures), second-order methods achieve their theoretical potential.
Example: Fine-tuning pretrained models on small datasets.
4. Recurrent Networks
RNNs and LSTMs have notoriously difficult optimization landscapes. Hessian-free methods showed early success on deep RNNs where vanilla SGD failed entirely.
Second-order methods often shine during fine-tuning. The model starts close to a good solution, curvature is relatively stable, and we want fast convergence rather than exploration. K-FAC can reduce fine-tuning iterations by 2-5x compared to Adam, with total time savings on appropriately-sized models.
1. Very Large Scale
For models with billions of parameters (modern LLMs), even K-FAC's reduced complexity is prohibitive. The curvature matrices for embedding layers with vocabularies of 50K+ tokens are infeasible.
2. Highly Stochastic Settings
When noise dominates the training signal (small batches, high data variability), curvature estimates are unreliable. SGD's implicit regularization from noise can even be beneficial.
3. Exploration-Heavy Training
Early in training, we're far from any minimum, and the landscape is highly non-quadratic. Careful curvature following matters less than moving quickly through parameter space.
4. Simple Architectures
For well-designed modern architectures (ResNets, Transformers with normalization), first-order methods work remarkably well. These architectures were co-evolved with SGD/Adam optimization.
5. Memory-Constrained Settings
On edge devices or with limited GPU memory, the overhead of storing curvature statistics may not be affordable.
| Scenario | Recommended Method | Rationale |
|---|---|---|
| Large language model pretraining | AdamW | Scale prohibits second-order; well-tuned LR schedules suffice |
| ResNet on ImageNet | SGD+momentum | Architecture well-conditioned; proven baseline |
| Scientific computing (PDEs) | L-BFGS or K-FAC | Often ill-conditioned; smaller scale |
| Fine-tuning BERT | Adam or K-FAC | Near-quadratic; fast convergence valuable |
| Reinforcement learning | Adam or Natural Gradient | Non-stationarity; Fisher provides stability |
| GAN training | Adam (careful tuning) | Non-convexity; second-order less helpful |
| RNN/LSTM training | Adam or Hessian-Free | Vanishing gradients; curvature awareness helps |
Beyond algorithmic concerns, second-order methods impose significant engineering overhead.
A correct, efficient implementation of K-FAC requires:
Compare with Adam:
# Adam update (simplified)
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad ** 2
param -= lr * m / (sqrt(v) + eps)
The simplicity of first-order methods reduces bugs, eases debugging, and simplifies integration with training pipelines.
When training diverges with K-FAC, identifying the cause is substantially harder than with Adam. Is it: damping too low? Covariance estimates unstable? Inversion numerical issues? Layer statistics dominated by outliers? The debugging surface area is much larger.
Second-order methods introduce new hyperparameters:
| Hyperparameter | Typical Range | Sensitivity |
|---|---|---|
| Base learning rate | 0.001-0.1 | High |
| Damping | 0.001-1.0 | High |
| Covariance EMA decay | 0.9-0.999 | Medium |
| Inverse update frequency | 5-100 steps | Low |
| Trust region / LM adaptation | On/Off + parameters | Medium |
While second-order methods are often claimed to be "less sensitive to learning rate," they trade this for sensitivity to damping and other parameters. The total hyperparameter tuning burden may not decrease.
Popular frameworks have varying support for second-order methods:
| Method | PyTorch | JAX | TensorFlow |
|---|---|---|---|
| SGD/Adam | Native | Native | Native |
| L-BFGS | torch.optim.LBFGS | optax.lbfgs | tf.keras (limited) |
| Hessian-Free | Manual | jax.hessian | Manual |
| K-FAC | Third-party (KFAC-Pytorch) | Third-party (kfac-jax) | Official (tf.contrib) |
The lack of first-class support means users must maintain compatibility across framework versions, a significant maintenance burden.
Given the trade-offs we've discussed, here's a practical framework for choosing optimizers.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
def recommend_optimizer( num_parameters: int, batch_size: int, problem_type: str, training_phase: str, computational_budget: str, implementation_expertise: str) -> str: """ Recommend an optimizer based on problem characteristics. This encodes practical wisdom about when different optimizers work well. """ # Scale thresholds SMALL_MODEL = 1_000_000 # 1M params MEDIUM_MODEL = 100_000_000 # 100M params LARGE_MODEL = 1_000_000_000 # 1B params SMALL_BATCH = 64 LARGE_BATCH = 512 # Very large models: first-order only if num_parameters > LARGE_MODEL: return "AdamW with careful LR schedule and warmup" # Small batches: first-order preferred if batch_size < SMALL_BATCH: return "Adam with standard settings" # Problem-specific recommendations if problem_type == "reinforcement_learning": if num_parameters < SMALL_MODEL: return "Natural Gradient (TRPO/PPO style) or Adam" else: return "Adam with entropy regularization" if problem_type == "scientific_computing" or problem_type == "pde_solving": if num_parameters < MEDIUM_MODEL: return "L-BFGS or K-FAC (ill-conditioning likely)" else: return "Adam with aggressive LR decay" if problem_type == "sequence_modeling" and "rnn" in str(training_phase).lower(): return "Adam or Hessian-Free (gradient pathologies common)" # Training phase considerations if training_phase == "fine_tuning": if batch_size >= LARGE_BATCH and num_parameters < MEDIUM_MODEL: return "K-FAC (fast convergence, stable curvature)" else: return "AdamW with low LR" if training_phase == "early_training": return "Adam or SGD+momentum (exploration phase, curvature unreliable)" # Computational budget if computational_budget == "unlimited" and num_parameters < MEDIUM_MODEL: if implementation_expertise == "high": return "K-FAC (potential iteration savings)" else: return "Adam (reliable, well-understood)" # Default: Adam is a strong baseline return "AdamW with cosine LR schedule" # Example usageprint(recommend_optimizer( num_parameters=50_000_000, batch_size=256, problem_type="vision", training_phase="early_training", computational_budget="limited", implementation_expertise="medium"))# Output: "Adam or SGD+momentum (exploration phase, curvature unreliable)"The limitations of current second-order methods are well-understood, and active research addresses many of them.
Shampoo (Google, 2018): Uses matrix roots instead of inverses, computed via iterative methods. Scales better to large layers and shows competitive results at scale.
LARS/LAMB (2017-2019): Large-batch optimizers that use per-layer learning rate scaling based on gradient/weight norms—a very lightweight form of curvature adaptation.
AdaHessian (2020): Full diagonal Hessian estimation with efficient averaging. Bridges Adam and true second-order methods.
Sophia (2023): Adaptive learning rate based on Hutchinson's trace estimator for diagonal Hessian. Achieves faster convergence than Adam on LLM training.
Future accelerators may be designed with second-order methods in mind:
As hardware evolves, the computational balance between first and second-order methods may shift.
Adam's success is partly paradoxical: it uses a diagonal approximation to the Fisher, which should be crude. Yet it works remarkably well. Recent theory suggests Adam's effectiveness comes from implicit regularization and adaptivity to gradient distribution, not from accurate curvature capture. This insight is guiding new optimizer designs.
The future may lie in hybrid methods that combine first and second-order elements:
1. First-order start, second-order finish: Use Adam for early training, switch to K-FAC for fine-tuning.
2. Periodic curvature updates: Compute expensive curvature every N steps, use it to precondition many gradient steps.
3. Curvature-aware learning rate: Use cheap curvature estimates (diagonal Hessian) to schedule learning rates per-parameter or per-layer.
4. Meta-learned optimizers: Neural networks that learn to optimize, potentially discovering novel algorithms that combine aspects of both approaches.
A fundamental open question: Does second-order optimization find different solutions than first-order methods?
Some evidence suggests:
Understanding the generalization properties of different optimizers remains an active research area.
This module has taken you on a journey through second-order optimization methods—from the foundational theory of Newton's method to the practical realities of training deep neural networks.
For most deep learning practitioners, the honest recommendation is:
Start with Adam/AdamW. It's robust, well-understood, and works across a wide range of problems.
Consider SGD+momentum for vision. CNNs trained on large datasets often benefit from SGD's implicit regularization.
Explore second-order methods when:
Stay informed about new developments. Methods like Sophia and Shampoo are pushing second-order optimization to larger scales.
Second-order optimization methods represent some of the most elegant mathematics in machine learning—Newton's insights from the 17th century, information geometry from the 20th, and modern approximations from the 21st. Understanding these methods deeply, including their limitations, makes you a more complete machine learning engineer.
The gap between theoretical elegance and practical utility is a recurring theme in optimization. Methods that dominate in theory (full Newton) are impractical, while methods that dominate in practice (Adam) are theoretically approximate. Bridging this gap remains one of the great challenges in the field.
Congratulations! You have completed the Second-Order Optimization Methods module. You now have a rigorous understanding of Newton's method, Hessian-free optimization, natural gradient descent, K-FAC, and the practical considerations for choosing optimization strategies. This knowledge will serve you in diagnosing optimization challenges and making informed decisions about training neural networks.