Loading learning content...
Newton's method achieves quadratic convergence but requires computing and inverting the Hessian—O(n²) storage and O(n³) computation per iteration. Gradient descent requires only O(n) storage and computation but converges slowly. Quasi-Newton methods bridge this gap, achieving superlinear convergence (faster than linear, approaching quadratic) while requiring only gradient information and O(n²) storage (or O(n) for limited-memory variants).
The key insight is both simple and profound: rather than computing the Hessian at each iteration, we can approximate and update it using gradient differences observed during optimization. The Hessian encodes second-order information, and gradients at consecutive points implicitly contain this information—we just need to extract it cleverly.
By the end of this page, you will understand the secant condition that underlies quasi-Newton updates, derive the BFGS update formula, appreciate why BFGS maintains positive definiteness, implement L-BFGS for large-scale optimization, and critically evaluate when quasi-Newton methods outperform alternatives.
Quasi-Newton methods are built on a single fundamental requirement: the Hessian approximation should satisfy the secant equation (also called the quasi-Newton condition).
Derivation from first principles:
For a smooth function f(x), consider the Taylor expansion of the gradient around x_{k+1}:
∇f(x_k) ≈ ∇f(x_{k+1}) + ∇²f(x_{k+1})(x_k - x_{k+1})
Rearranging:
∇²f(x_{k+1})(x_{k+1} - x_k) ≈ ∇f(x_{k+1}) - ∇f(x_k)
Defining:
We get the secant equation:
B_{k+1} s_k = y_k
where B_{k+1} is our approximation to the Hessian H(x_{k+1}).
The secant condition says: 'Our Hessian approximation B, when applied to the step we just took, should produce the gradient change we observed.' This is a consistency requirement—the approximation should at least explain recent observations.
The inverse Hessian variant:
Since Newton's method uses H⁻¹∇f, it's often more convenient to maintain an approximation to the inverse Hessian. Let H_k ≈ B_k⁻¹. The secant equation becomes:
H_{k+1} y_k = s_k
This form avoids explicitly inverting B_{k+1} at each iteration.
Underdetermination and the update problem:
The secant equation provides only n constraints (one vector equation), but H has n² entries (or n(n+1)/2 for symmetric matrices). With infinitely many matrices satisfying the secant condition, how do we choose?
Different quasi-Newton methods correspond to different ways of resolving this underdetermination:
BFGS is the most widely used quasi-Newton method, prized for its excellent numerical behavior and self-correcting properties. The update formula can be derived by solving an optimization problem over positive definite matrices.
Derivation (for the inverse Hessian approximation H):
We seek H_{k+1} that:
The solution to this constrained optimization problem is the BFGS update:
H_{k+1} = (I - ρ_k s_k y_kᵀ) H_k (I - ρ_k y_k s_kᵀ) + ρ_k s_k s_kᵀ
where ρ_k = 1 / (y_kᵀ s_k)
This formula is a rank-2 update—it adds at most 2 to the rank of the modification from H_k to H_{k+1}.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npfrom typing import Tuple def bfgs_update( H: np.ndarray, s: np.ndarray, y: np.ndarray, eps: float = 1e-10) -> np.ndarray: """ Compute BFGS update of inverse Hessian approximation. H_{k+1} = (I - ρ·s·yᵀ) H_k (I - ρ·y·sᵀ) + ρ·s·sᵀ Parameters: ----------- H : np.ndarray Current inverse Hessian approximation (n, n) s : np.ndarray Step vector: s = x_{k+1} - x_k (n,) y : np.ndarray Gradient difference: y = ∇f_{k+1} - ∇f_k (n,) eps : float Small constant for numerical stability Returns: -------- H_new : np.ndarray Updated inverse Hessian approximation (n, n) """ ys = np.dot(y, s) # Curvature condition check if ys < eps: # Skip update if curvature condition violated return H rho = 1.0 / ys n = len(s) I = np.eye(n) # BFGS formula: H = (I - ρsyᵀ) H (I - ρysᵀ) + ρssᵀ V = I - rho * np.outer(s, y) H_new = V @ H @ V.T + rho * np.outer(s, s) return H_new def bfgs_direct_hessian_update( B: np.ndarray, s: np.ndarray, y: np.ndarray, eps: float = 1e-10) -> np.ndarray: """ Direct BFGS update for Hessian approximation B. B_{k+1} = B_k - (B_k s sᵀ B_k)/(sᵀ B_k s) + (y yᵀ)/(yᵀ s) This is the Sherman-Morrison-Woodbury dual of the inverse update. """ ys = np.dot(y, s) if ys < eps: return B Bs = B @ s sBs = np.dot(s, Bs) if abs(sBs) < eps: return B B_new = B - np.outer(Bs, Bs) / sBs + np.outer(y, y) / ys return B_newThe curvature condition:
For the BFGS update to maintain positive definiteness, we require:
y_kᵀ s_k > 0 (curvature condition)
This condition ensures ρ_k is well-defined and positive. Geometrically, it requires that moving in direction s_k increases the directional derivative—the function has positive curvature along the step direction.
When is the curvature condition satisfied?
Self-correcting property:
BFGS has a remarkable self-correcting property: even starting from a poor initial approximation (like H₀ = I), the approximation improves as gradients are observed. Historical gradient information accumulates in H, improving the approximation over time.
DFP was the first quasi-Newton method (1959-1963) but BFGS (1970) largely superseded it. BFGS is more stable numerically—DFP can degrade when line searches are imprecise, while BFGS remains robust. The Broyden class parameterizes the family between them: φ=0 gives BFGS, φ=1 gives DFP.
The full BFGS algorithm combines inverse Hessian updates with a line search satisfying the Wolfe conditions. These conditions ensure both sufficient decrease and sufficient curvature—critical for guaranteeing the curvature condition y_kᵀ s_k > 0.
Wolfe Conditions:
Typical values: c₁ = 10⁻⁴, c₂ = 0.9
The curvature condition ensures we don't stop too early, guaranteeing sufficient curvature for the BFGS update.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
import numpy as npfrom typing import Callable, Tuple, Optional def wolfe_line_search( f: Callable[[np.ndarray], float], grad_f: Callable[[np.ndarray], np.ndarray], x: np.ndarray, d: np.ndarray, grad: np.ndarray, c1: float = 1e-4, c2: float = 0.9, alpha_init: float = 1.0, max_iter: int = 25) -> Tuple[float, np.ndarray]: """ Line search satisfying strong Wolfe conditions. Returns step size α and gradient at new point. Critical for BFGS to maintain positive definiteness. """ f_0 = f(x) g_0 = np.dot(grad, d) # Directional derivative at α=0 alpha_lo, alpha_hi = 0.0, np.inf alpha = alpha_init for _ in range(max_iter): x_new = x + alpha * d f_new = f(x_new) grad_new = grad_f(x_new) g_new = np.dot(grad_new, d) # Check Armijo condition if f_new > f_0 + c1 * alpha * g_0: alpha_hi = alpha alpha = 0.5 * (alpha_lo + alpha_hi) continue # Check curvature condition (strong Wolfe) if abs(g_new) > c2 * abs(g_0): if g_new * g_0 < 0: alpha_hi = alpha else: alpha_lo = alpha if alpha_hi < np.inf: alpha = 0.5 * (alpha_lo + alpha_hi) else: alpha = 2 * alpha continue # Both conditions satisfied return alpha, grad_new return alpha, grad_f(x + alpha * d) def bfgs_optimize( f: Callable[[np.ndarray], float], grad_f: Callable[[np.ndarray], np.ndarray], x0: np.ndarray, tol: float = 1e-8, max_iter: int = 1000, verbose: bool = False) -> Tuple[np.ndarray, list]: """ BFGS quasi-Newton optimization with Wolfe line search. Parameters: ----------- f : callable Objective function f(x) -> scalar grad_f : callable Gradient function ∇f(x) -> n-vector x0 : np.ndarray Starting point (n,) tol : float Gradient norm tolerance for convergence max_iter : int Maximum number of iterations verbose : bool Print iteration information Returns: -------- x : np.ndarray Approximate minimizer history : list List of (x, f(x), ||grad||) at each iteration """ n = len(x0) x = x0.copy().astype(float) H = np.eye(n) # Initial inverse Hessian: identity grad = grad_f(x) history = [(x.copy(), f(x), np.linalg.norm(grad))] for k in range(max_iter): grad_norm = np.linalg.norm(grad) if verbose and k % 10 == 0: print(f"Iter {k:4d}: f = {f(x):.6e}, ||∇f|| = {grad_norm:.6e}") # Check convergence if grad_norm < tol: if verbose: print(f"Converged in {k} iterations") break # Compute search direction d = -H @ grad # Line search (Wolfe conditions) alpha, grad_new = wolfe_line_search(f, grad_f, x, d, grad) # Compute step and gradient difference s = alpha * d y = grad_new - grad # Update position and gradient x = x + s grad = grad_new # BFGS update ys = np.dot(y, s) if ys > 1e-10: # Curvature condition rho = 1.0 / ys I = np.eye(n) V = I - rho * np.outer(s, y) H = V @ H @ V.T + rho * np.outer(s, s) history.append((x.copy(), f(x), np.linalg.norm(grad))) return x, history # Example: Rosenbrock functiondef rosenbrock(x): """Rosenbrock function (extended to n dimensions)""" return sum(100 * (x[i+1] - x[i]**2)**2 + (1 - x[i])**2 for i in range(len(x) - 1)) def rosenbrock_grad(x): """Gradient of Rosenbrock function""" n = len(x) grad = np.zeros(n) for i in range(n - 1): grad[i] += -400 * x[i] * (x[i+1] - x[i]**2) - 2 * (1 - x[i]) grad[i+1] += 200 * (x[i+1] - x[i]**2) return grad if __name__ == "__main__": # 10-dimensional Rosenbrock x0 = np.zeros(10) x_opt, history = bfgs_optimize( rosenbrock, rosenbrock_grad, x0, verbose=True ) print(f"\nOptimum: {x_opt}") print(f"Optimal value: {rosenbrock(x_opt):.2e}") print(f"Total iterations: {len(history)}")Starting with H₀ = I (identity) works but can be improved. A common trick: after the first iteration, rescale H₀ = (y₀ᵀs₀)/(y₀ᵀy₀) × I. This scales the initial approximation to match the observed curvature along the first step direction.
BFGS achieves superlinear convergence under reasonable conditions—faster than gradient descent's linear rate, though not quite Newton's quadratic rate.
Formal convergence result (Dennis-Moré conditions):
If f is twice continuously differentiable with Lipschitz continuous Hessian, and {x_k} converges to x* with H(x*) positive definite, then:
lim_{k→∞} ‖(H_k - H(x*)⁻¹)∇f_k‖ / ‖∇f_k‖ = 0
This implies superlinear convergence:
‖x_{k+1} - x*‖ / ‖x_k - x*‖ → 0 as k → ∞
The ratio of successive errors goes to zero, meaning each iteration provides increasingly larger relative improvement.
| Method | Rate Type | n=100 | n=1000 | n=10000 |
|---|---|---|---|---|
| Gradient Descent | Linear O(κ log 1/ε) | ~10,000 | ~100,000 | ~1,000,000 |
| Conjugate Gradient | Superlinear | ~100 | ~1,000 | ~10,000 |
| BFGS | Superlinear | ~50-100 | ~100-200 | ~200-500 |
| L-BFGS (m=10) | Superlinear | ~75-150 | ~150-300 | ~300-800 |
| Newton | Quadratic | ~10 | ~10 | Infeasible (n³ cost) |
Global convergence:
With a line search satisfying the Wolfe conditions, BFGS is globally convergent for smooth functions bounded below:
The condition number independence:
Unlike gradient descent, whose convergence rate degrades with the condition number κ = λ_max/λ_min of the Hessian, BFGS's rate is largely independent of κ. The inverse Hessian approximation automatically adapts to different curvatures along different directions.
Why not quadratic convergence?
BFGS doesn't achieve Newton's quadratic rate because H_k only approximates H⁻¹—the approximation error prevents true quadratic steps. However, the approximation becomes increasingly accurate near the solution (Dennis-Moré conditions), explaining the superlinear acceleration.
For most ML optimization problems requiring moderate precision (tol ≈ 10⁻⁶), BFGS often converges in 10-100 iterations regardless of problem size. This makes it highly practical for problems where gradient evaluation dominates and O(n²) storage is affordable.
Standard BFGS requires O(n²) storage for the inverse Hessian approximation—prohibitive when n exceeds a few thousand. L-BFGS (Limited-memory BFGS) addresses this by storing only the last m pairs (s_k, y_k) and computing the search direction on-the-fly using a two-loop recursion.
Key insight:
We don't actually need the full matrix H_k. We only need the product H_k ∇f_k (the search direction). This product can be computed efficiently using just the stored vector pairs, without ever forming H_k explicitly.
Storage complexity:
For n = 1,000,000 and m = 10: BFGS needs ~4TB, L-BFGS needs ~80MB!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190
import numpy as npfrom collections import dequefrom typing import Callable, Tuple, List def lbfgs_two_loop_recursion( grad: np.ndarray, s_history: List[np.ndarray], y_history: List[np.ndarray], rho_history: List[float], gamma: float = 1.0) -> np.ndarray: """ L-BFGS two-loop recursion to compute search direction. Computes H_k @ grad without forming H_k explicitly. Uses only the m most recent (s, y, ρ) triplets. Parameters: ----------- grad : np.ndarray Current gradient (n,) s_history : list Recent s_i = x_{i+1} - x_i vectors y_history : list Recent y_i = ∇f_{i+1} - ∇f_i vectors rho_history : list Recent ρ_i = 1/(y_i·s_i) scalars gamma : float Scaling for initial H_0 = gamma * I Returns: -------- d : np.ndarray Search direction -H_k @ grad (n,) """ q = grad.copy() m = len(s_history) if m == 0: # No history: steepest descent return -gamma * grad # First loop: right-to-left alpha = np.zeros(m) for i in range(m - 1, -1, -1): alpha[i] = rho_history[i] * np.dot(s_history[i], q) q = q - alpha[i] * y_history[i] # Initial direction: H_0 @ q where H_0 = gamma * I r = gamma * q # Second loop: left-to-right for i in range(m): beta = rho_history[i] * np.dot(y_history[i], r) r = r + (alpha[i] - beta) * s_history[i] return -r def lbfgs_optimize( f: Callable[[np.ndarray], float], grad_f: Callable[[np.ndarray], np.ndarray], x0: np.ndarray, m: int = 10, tol: float = 1e-8, max_iter: int = 1000, verbose: bool = False) -> Tuple[np.ndarray, list]: """ L-BFGS optimization with limited memory. Parameters: ----------- f : callable Objective function f(x) -> scalar grad_f : callable Gradient function ∇f(x) -> n-vector x0 : np.ndarray Starting point (n,) m : int Number of (s, y) pairs to store (memory parameter) tol : float Gradient norm tolerance for convergence max_iter : int Maximum number of iterations verbose : bool Print iteration information Returns: -------- x : np.ndarray Approximate minimizer history : list List of (x, f(x), ||grad||) at each iteration """ x = x0.copy().astype(float) # History storage (using deque for efficient pop-left) s_history = deque(maxlen=m) y_history = deque(maxlen=m) rho_history = deque(maxlen=m) grad = grad_f(x) history = [(x.copy(), f(x), np.linalg.norm(grad))] for k in range(max_iter): grad_norm = np.linalg.norm(grad) if verbose and k % 20 == 0: print(f"Iter {k:4d}: f = {f(x):.6e}, ||∇f|| = {grad_norm:.6e}") # Check convergence if grad_norm < tol: if verbose: print(f"Converged in {k} iterations") break # Compute initial H_0 scaling if len(y_history) > 0: # Barzilai-Borwein scaling gamma = (np.dot(s_history[-1], y_history[-1]) / np.dot(y_history[-1], y_history[-1])) else: gamma = 1.0 / grad_norm # Scale by inverse gradient norm initially # Compute search direction using two-loop recursion d = lbfgs_two_loop_recursion( grad, list(s_history), list(y_history), list(rho_history), gamma ) # Line search (simplified backtracking for brevity) alpha = 1.0 c1 = 1e-4 f_x = f(x) grad_d = np.dot(grad, d) for _ in range(20): if f(x + alpha * d) <= f_x + c1 * alpha * grad_d: break alpha *= 0.5 # Update x_new = x + alpha * d grad_new = grad_f(x_new) s = x_new - x y = grad_new - grad ys = np.dot(y, s) # Store new (s, y, ρ) if curvature condition satisfied if ys > 1e-10: s_history.append(s) y_history.append(y) rho_history.append(1.0 / ys) x = x_new grad = grad_new history.append((x.copy(), f(x), np.linalg.norm(grad))) return x, history if __name__ == "__main__": # Test on high-dimensional Rosenbrock n = 1000 def rosenbrock_nd(x): return sum(100 * (x[i+1] - x[i]**2)**2 + (1 - x[i])**2 for i in range(len(x) - 1)) def rosenbrock_grad_nd(x): n = len(x) grad = np.zeros(n) for i in range(n - 1): grad[i] += -400 * x[i] * (x[i+1] - x[i]**2) - 2 * (1 - x[i]) grad[i+1] += 200 * (x[i+1] - x[i]**2) return grad x0 = np.zeros(n) x_opt, history = lbfgs_optimize( rosenbrock_nd, rosenbrock_grad_nd, x0, m=20, verbose=True, max_iter=500 ) print(f"\nFinal gradient norm: {np.linalg.norm(rosenbrock_grad_nd(x_opt)):.2e}") print(f"Optimal value: {rosenbrock_nd(x_opt):.2e}") print(f"Total iterations: {len(history)}")Typical values are m = 3-20. Larger m means better Hessian approximation but more storage and computation per iteration. For most problems, m = 10-20 captures enough curvature information. Very large m (>50) rarely helps and can even hurt on non-quadratic functions.
Production L-BFGS implementations include several refinements beyond the basic two-loop recursion:
1. Initial Hessian Scaling (γ_k):
The scaling γ_k = (s_{k-1}ᵀ y_{k-1}) / (y_{k-1}ᵀ y_{k-1}) is crucial. It approximates the average curvature along the most recent direction, providing a much better initial approximation than γ = 1.
2. Wolfe Line Search:
Full Wolfe conditions (not just Armijo) are essential for guaranteeing the curvature condition ysᵀs > 0 on non-convex functions. The strong Wolfe conditions prevent cycling.
3. Gradient Scaling:
On the first iteration (empty history), scaling the initial direction as d = -grad/‖grad‖ or d = -γ·grad with γ based on problem-specific knowledge prevents excessively large initial steps.
4. Handling Curvature Condition Violations:
When ysᵀs ≤ 0 (can happen with imprecise line search on non-convex functions), options include:
5. Memory Management:
Using circular buffers (deque in Python) with maxlen=m automatically evicts the oldest pairs, maintaining O(1) insertion. The vectors s_i and y_i should be stored contiguously for cache efficiency.
6. Parallel Gradient Computation:
In ML applications, gradient computation typically dominates. L-BFGS can parallelize gradient evaluation across data batches while the two-loop recursion runs serially (it's inherently sequential but fast: O(mn) flops).
| Parameter | Typical Value | Effect of Increase | When to Change |
|---|---|---|---|
| m (memory) | 10-20 | Better approximation, more memory | Increase for ill-conditioned problems |
| c₁ (Armijo) | 10⁻⁴ | Stricter decrease requirement | Rarely changed |
| c₂ (curvature) | 0.9 | Looser curvature requirement | Decrease (0.1-0.5) for non-convex |
| α_init | 1.0 | Larger initial step size | Reduce if steps overshoot |
| max_iter | 1000 | More iterations allowed | Increase for large-scale problems |
If the objective function changes (e.g., when regularization weight changes), the stored (s, y) pairs become invalid—they encode curvature of the old function. Clear the history when the function changes significantly, or use damped updates that gradually forget old information.
L-BFGS is the workhorse optimizer for many machine learning applications, particularly those with smooth objectives and deterministic gradients.
Classical ML (L-BFGS is often the best choice):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
import numpy as npfrom scipy.optimize import minimize # Example 1: Logistic Regression with L-BFGSdef logistic_loss_and_grad(w, X, y, reg=0.01): """ Compute L2-regularized logistic loss and gradient. Loss = Σ log(1 + exp(-y_i * wᵀx_i)) + (reg/2)||w||² """ N = len(y) z = y * (X @ w) # Numerically stable log-sum-exp loss = np.sum(np.logaddexp(0, -z)) / N + 0.5 * reg * np.dot(w, w) # Gradient sigmoid_neg = 1 / (1 + np.exp(z)) # σ(-z) = 1 - σ(z) grad = -X.T @ (y * sigmoid_neg) / N + reg * w return loss, grad def train_logistic_lbfgs(X, y, reg=0.01): """Train logistic regression using scipy's L-BFGS-B.""" n_features = X.shape[1] w0 = np.zeros(n_features) result = minimize( logistic_loss_and_grad, w0, args=(X, y, reg), method='L-BFGS-B', jac=True, # Function returns (loss, gradient) options={'maxiter': 1000, 'disp': True} ) return result.x # Example 2: Neural Network training with L-BFGSdef mlp_forward_backward(params, X, y, hidden_size=100): """ Two-layer MLP: X -> hidden -> softmax output Returns loss and flattened gradient. """ n_features = X.shape[1] n_classes = np.max(y) + 1 # Unpack parameters W1_size = n_features * hidden_size b1_size = hidden_size W2_size = hidden_size * n_classes W1 = params[:W1_size].reshape(n_features, hidden_size) b1 = params[W1_size:W1_size + b1_size] W2 = params[W1_size + b1_size:W1_size + b1_size + W2_size].reshape(hidden_size, n_classes) b2 = params[W1_size + b1_size + W2_size:] # Forward pass h = np.maximum(0, X @ W1 + b1) # ReLU logits = h @ W2 + b2 # Softmax and cross-entropy logits_max = logits.max(axis=1, keepdims=True) exp_logits = np.exp(logits - logits_max) probs = exp_logits / exp_logits.sum(axis=1, keepdims=True) N = len(y) loss = -np.log(probs[np.arange(N), y] + 1e-8).mean() # Backward pass dlogits = probs.copy() dlogits[np.arange(N), y] -= 1 dlogits /= N dW2 = h.T @ dlogits db2 = dlogits.sum(axis=0) dh = dlogits @ W2.T dh[h <= 0] = 0 # ReLU gradient dW1 = X.T @ dh db1 = dh.sum(axis=0) # Pack gradients grad = np.concatenate([dW1.ravel(), db1, dW2.ravel(), db2]) return loss, grad # Example 3: Using PyTorch's L-BFGS"""import torchfrom torch.optim import LBFGS model = MyModel()optimizer = LBFGS( model.parameters(), lr=1.0, max_iter=20, history_size=10, line_search_fn='strong_wolfe') def closure(): optimizer.zero_grad() output = model(input) loss = criterion(output, target) loss.backward() return loss # L-BFGS requires calling optimizer.step(closure) with a closurefor epoch in range(num_epochs): optimizer.step(closure)"""Why L-BFGS struggles with deep learning:
Stochastic gradients: Standard L-BFGS assumes exact gradients. Mini-batch noise violates the curvature condition, corrupting the Hessian approximation.
Non-smooth activations: ReLU, max-pooling, and dropout create non-differentiable points where Taylor approximations break down.
Massive scale: Even O(mn) for m=10, n=10⁸ is prohibitive for storage and memory bandwidth.
Saddle points: The interpolation path between BFGS updates may not avoid saddle points effectively in high dimensions.
Stochastic L-BFGS attempts:
Recent research has developed stochastic variants:
These remain active research areas and are not yet production-ready for large neural networks.
Choosing between Newton, BFGS, L-BFGS, and first-order methods depends on problem characteristics. Here's a decision framework:
| Characteristic | Best Method | Rationale |
|---|---|---|
| n < 500, cheap Hessian | Newton | Quadratic convergence outweighs O(n³) cost |
| 500 < n < 5000, deterministic gradients | BFGS | Good convergence, O(n²) affordable |
| 5000 < n < 10⁶, deterministic gradients | L-BFGS | O(mn) storage, superlinear convergence |
| n > 10⁶, deterministic gradients | L-BFGS or Adam | Depends on structure and accuracy needs |
| Any n, stochastic gradients | Adam/SGD | Quasi-Newton corrupted by noise |
| Severely ill-conditioned | Newton/BFGS | First-order methods too slow |
| Sparse Hessian structure | Specialized Newton | Exploit structure for efficiency |
For classical ML (logistic regression, SVM, CRF, GP): start with L-BFGS. For deep learning: start with Adam. For very small problems: try Newton. Only switch if you have evidence the default choice is suboptimal for your specific problem.
Quasi-Newton methods represent the practical sweet spot between gradient descent's simplicity and Newton's power. BFGS and L-BFGS have become indispensable tools in the ML optimization toolkit.
What's next:
The next page explores Coordinate Descent—a fundamentally different approach that optimizes one coordinate (or block of coordinates) at a time. This strategy excels for problems with separable structure or when coordinate-wise updates admit closed-form solutions.
You now understand quasi-Newton methods: the secant condition, BFGS updates, L-BFGS memory efficiency, and practical application guidelines. This knowledge equips you to select and tune optimizers effectively for a wide range of machine learning problems.