Loading learning content...
Gradient descent, while foundational, is a first-order method—it uses only gradient information (first derivatives) to navigate the loss landscape. This limitation manifests as slow convergence near optima, sensitivity to learning rates, and difficulty with ill-conditioned problems.
Newton's method fundamentally changes the optimization paradigm by incorporating second-order information—the curvature of the objective function. Rather than taking small steps proportional to the gradient, Newton's method uses a local quadratic approximation to jump directly toward the minimum. This single insight transforms optimization from iterative trial to geometric precision.
By the end of this page, you will understand the mathematical derivation of Newton's method from Taylor expansion, analyze its quadratic convergence properties, implement the algorithm from first principles, and critically evaluate when Newton's method excels versus when simpler methods suffice.
Newton's method for optimization is an extension of Newton-Raphson iteration for root finding, first described by Isaac Newton in 1669 and later refined by Joseph Raphson in 1690. The connection is elegant: finding where a function achieves its minimum is equivalent to finding where its gradient equals zero.
The root-finding perspective:
If we want to minimize a scalar function f(x), we seek x* where f'(x*) = 0. Newton-Raphson finds roots of g(x) = 0 by iterating:
x_{n+1} = x_n - g(x_n) / g'(x_n)
Applying this to g(x) = f'(x), we obtain:
x_{n+1} = x_n - f'(x_n) / f''(x_n)
This is Newton's method for scalar optimization. The multivariate generalization replaces the scalar derivative with the gradient and the second derivative with the Hessian matrix.
At each iteration, Newton's method fits a parabola (or paraboloid in higher dimensions) to the local shape of the function and jumps directly to that parabola's minimum. If the function is actually quadratic, Newton's method finds the exact minimum in one step. For non-quadratic functions, repeated quadratic approximations rapidly hone in on the true minimum.
Why curvature matters:
Consider two scenarios where the gradient has the same magnitude but different curvatures:
Gradient descent cannot distinguish these scenarios—it takes steps proportional only to gradient magnitude. Newton's method naturally incorporates this distinction through the Hessian. When curvature is high (large Hessian eigenvalues), the Hessian inverse shrinks the step. When curvature is low (small eigenvalues), the step is amplified.
This curvature-adaptive step sizing is the fundamental innovation of second-order methods.
The rigorous derivation of Newton's method proceeds from the second-order Taylor expansion of the objective function. Let f: ℝⁿ → ℝ be a twice continuously differentiable function. Around a point x_k, the Taylor expansion is:
f(x_k + Δx) ≈ f(x_k) + ∇f(x_k)ᵀΔx + ½Δxᵀ H(x_k) Δx
where:
This quadratic approximation defines a local quadratic model m_k(Δx) of the objective function.
123456789101112131415161718192021222324
# Local quadratic model at point x_kdef quadratic_model(delta_x, f_k, grad_k, hessian_k): """ Computes the quadratic approximation: m_k(Δx) = f(x_k) + ∇f(x_k)ᵀΔx + ½Δxᵀ H(x_k) Δx Parameters: ----------- delta_x : np.ndarray Step direction (n,) f_k : float Function value at x_k grad_k : np.ndarray Gradient at x_k (n,) hessian_k : np.ndarray Hessian matrix at x_k (n, n) Returns: -------- float : Quadratic model value """ linear_term = np.dot(grad_k, delta_x) quadratic_term = 0.5 * np.dot(delta_x, hessian_k @ delta_x) return f_k + linear_term + quadratic_termMinimizing the quadratic model:
To find the optimal step Δx*, we set the gradient of the quadratic model to zero:
∇_{Δx} m_k(Δx) = ∇f(x_k) + H(x_k)Δx = 0
Solving for Δx:
Δx = -H(x_k)⁻¹ ∇f(x_k)*
This is the Newton direction. The full Newton update is:
x_{k+1} = x_k - H(x_k)⁻¹ ∇f(x_k)
This formula encapsulates the essence of Newton's method: the step is the gradient, preconditioned (rescaled and rotated) by the inverse Hessian.
The inverse Hessian H⁻¹ acts as a variable metric—it stretches space along directions of low curvature and compresses it along directions of high curvature. This normalization makes Newton's method invariant to affine transformations of the coordinate system, a property first-order methods lack.
Second-order optimality conditions:
For the Newton direction to be a descent direction (moving toward a minimum rather than a maximum), the Hessian must be positive definite. Recall:
If H is not positive definite at x_k, the Newton direction may point toward a maximum or saddle point. This is a fundamental challenge addressed by modifications we'll discuss later.
The remarkable property of Newton's method is its quadratic convergence rate near a solution. This means the number of correct digits roughly doubles at each iteration—exponentially faster than the linear convergence of gradient descent.
Formal statement (Local Convergence Theorem):
Let f: ℝⁿ → ℝ be twice continuously differentiable with Lipschitz continuous Hessian near a local minimum x*. If:
Then Newton's method converges quadratically:
‖x_{k+1} - x*‖ ≤ C · ‖x_k - x*‖²
for some constant C > 0.
| Method | Convergence Rate | Error Reduction per Iteration | Iterations for 10⁻¹² Error |
|---|---|---|---|
| Gradient Descent | Linear: O(ρᵏ) | Constant factor ρ < 1 | ~120 (for ρ = 0.9) |
| Accelerated GD (Nesterov) | Linear: O(ρ^√k) | Improved constant | ~50-80 |
| Newton's Method | Quadratic: O(ε²ᵏ) | Error squared each step | ~4-6 (near convergence) |
| Conjugate Gradient | Super-linear | Depends on spectrum | ~20-40 |
Understanding quadratic convergence:
Consider a concrete example. Suppose at iteration k, the error ‖x_k - x*‖ = 10⁻². With quadratic convergence:
Three iterations take us from 1% error to machine precision! This dramatic acceleration explains why Newton's method, despite its per-iteration cost, often achieves convergence in far fewer total operations.
The catch—local convergence:
The quadratic rate only holds when starting "sufficiently close" to x*. From far away, Newton's method may:
This local convergence property motivates numerous modifications and globalizations.
The "sufficiently close" requirement is not just theoretical pedantry. In practice, pure Newton's method can fail spectacularly from poor starting points. The convergence basin can be surprisingly small for non-convex functions, and identifying whether you're inside it is generally impossible without already knowing x*.
Implementing Newton's method correctly requires careful attention to numerical stability, especially in computing the Newton direction. We'll develop the algorithm from first principles.
The Pure Newton Algorithm:
Input: Starting point x₀, tolerance ε, max iterations K
Output: Approximate minimizer x*
1. k ← 0
2. While ‖∇f(x_k)‖ > ε and k < K:
a. Compute gradient g_k = ∇f(x_k)
b. Compute Hessian H_k = ∇²f(x_k)
c. Solve linear system: H_k d_k = -g_k for Newton direction d_k
d. Update: x_{k+1} = x_k + d_k
e. k ← k + 1
3. Return x_k
The critical computational step is solving the linear system in step 2c. Never compute H⁻¹ explicitly—this is numerically unstable and computationally wasteful.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
import numpy as npfrom scipy.linalg import cho_solve, cho_factor, solvefrom typing import Callable, Tuple, Optional def newtons_method( f: Callable[[np.ndarray], float], grad_f: Callable[[np.ndarray], np.ndarray], hess_f: Callable[[np.ndarray], np.ndarray], x0: np.ndarray, tol: float = 1e-8, max_iter: int = 100, verbose: bool = False) -> Tuple[np.ndarray, list]: """ Pure Newton's method for unconstrained optimization. Parameters: ----------- f : callable Objective function f(x) -> scalar grad_f : callable Gradient function ∇f(x) -> n-vector hess_f : callable Hessian function ∇²f(x) -> n×n matrix x0 : np.ndarray Starting point (n,) tol : float Gradient norm tolerance for convergence max_iter : int Maximum number of iterations verbose : bool Print iteration information Returns: -------- x : np.ndarray Approximate minimizer history : list List of (x, f(x), ||grad||) at each iteration """ x = x0.copy().astype(float) history = [] for k in range(max_iter): # Compute gradient and Hessian grad = grad_f(x) grad_norm = np.linalg.norm(grad) f_val = f(x) history.append((x.copy(), f_val, grad_norm)) if verbose: print(f"Iter {k:3d}: f = {f_val:.6e}, ||∇f|| = {grad_norm:.6e}") # Check convergence if grad_norm < tol: if verbose: print(f"Converged in {k} iterations") break # Compute Hessian H = hess_f(x) # Solve Newton system: H @ d = -grad # Using Cholesky factorization (assumes H is positive definite) try: L, lower = cho_factor(H) d = cho_solve((L, lower), -grad) except np.linalg.LinAlgError: # Hessian not positive definite, fall back to general solver d = solve(H, -grad) # Newton update (full step) x = x + d return x, history # Example: Minimize Rosenbrock functiondef rosenbrock(x): """Rosenbrock function: f(x,y) = (a-x)² + b(y-x²)²""" a, b = 1.0, 100.0 return (a - x[0])**2 + b * (x[1] - x[0]**2)**2 def rosenbrock_grad(x): """Gradient of Rosenbrock function""" a, b = 1.0, 100.0 dx = -2*(a - x[0]) - 4*b*x[0]*(x[1] - x[0]**2) dy = 2*b*(x[1] - x[0]**2) return np.array([dx, dy]) def rosenbrock_hess(x): """Hessian of Rosenbrock function""" a, b = 1.0, 100.0 h11 = 2 - 4*b*x[1] + 12*b*x[0]**2 h12 = -4*b*x[0] h22 = 2*b return np.array([[h11, h12], [h12, h22]]) if __name__ == "__main__": # Starting point far from minimum at (1, 1) x0 = np.array([-1.0, 1.0]) x_opt, history = newtons_method( rosenbrock, rosenbrock_grad, rosenbrock_hess, x0, verbose=True ) print(f"\nOptimum found at: {x_opt}") print(f"Function value: {rosenbrock(x_opt):.2e}")Always solve H·d = -g rather than computing d = H⁻¹·(-g). Use Cholesky factorization (cho_solve) when H is positive definite—it's twice as fast as LU and numerically more stable. For indefinite H, use LDL factorization or regularization techniques.
When the Hessian is not positive definite, the Newton direction may not be a descent direction, leading to convergence toward saddle points or maxima. Several strategies address this challenge:
1. Hessian Modification (Regularization):
Add a multiple of the identity matrix to ensure positive definiteness:
H̃_k = H_k + τI
where τ > 0 is chosen large enough that H̃_k is positive definite. This is equivalent to adding L2 regularization to the local quadratic model.
Selection strategies for τ:
2. Modified Cholesky Factorization:
More sophisticated algorithms (like the Gill-Murray modification) compute a modified Cholesky factorization directly, perturbing the factorization minimally to ensure positive definiteness while preserving the Cholesky structure.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
def modified_newton_direction( grad: np.ndarray, H: np.ndarray, min_eigval: float = 1e-8) -> np.ndarray: """ Compute modified Newton direction with Hessian regularization. If H is not positive definite, add τI to make it so. Parameters: ----------- grad : np.ndarray Gradient vector (n,) H : np.ndarray Hessian matrix (n, n) min_eigval : float Minimum eigenvalue threshold for positive definiteness Returns: -------- d : np.ndarray Modified Newton direction (n,) """ n = len(grad) # Compute eigendecomposition eigenvalues, eigenvectors = np.linalg.eigh(H) # Find minimum eigenvalue lambda_min = eigenvalues.min() if lambda_min > min_eigval: # H is already positive definite tau = 0.0 else: # Regularize to ensure positive definiteness tau = -lambda_min + min_eigval # Regularized Hessian H_reg = H + tau * np.eye(n) # Solve modified Newton system d = np.linalg.solve(H_reg, -grad) return d def trust_region_cauchy_point( grad: np.ndarray, H: np.ndarray, delta: float) -> np.ndarray: """ Compute Cauchy point for trust region subproblem. The Cauchy point minimizes the quadratic model along the steepest descent direction within the trust region. Parameters: ----------- grad : np.ndarray Gradient vector (n,) H : np.ndarray Hessian matrix (n, n) delta : float Trust region radius Returns: -------- p_c : np.ndarray Cauchy point (n,) """ grad_norm = np.linalg.norm(grad) if grad_norm < 1e-12: return np.zeros_like(grad) # Normalized steepest descent direction g_hat = grad / grad_norm # Curvature along steepest descent gHg = np.dot(grad, H @ grad) if gHg <= 0: # Negative or zero curvature: go to boundary tau_star = delta / grad_norm else: # Positive curvature: unconstrained minimizer or boundary tau_unc = grad_norm**2 / gHg tau_star = min(tau_unc, delta / grad_norm) return -tau_star * grad3. Trust Region Methods:
Rather than modifying the Hessian directly, trust region methods constrain the Newton step to a region where the quadratic model is trusted:
minimize m_k(p) = f_k + ∇f_kᵀp + ½pᵀH_kp subject to ‖p‖ ≤ Δ_k
where Δ_k is the trust region radius. This subproblem always has a well-defined solution, even when H_k is indefinite. The trust region radius adapts based on how well the quadratic model predicts actual function decrease.
4. Line Search with Negative Curvature Direction:
When H_k has a negative eigenvalue, we can extract the corresponding eigenvector and use it as a descent direction. The function decreases along directions of negative curvature, potentially escaping saddle points.
For strictly convex problems (like linear regression with regularization), the Hessian is always positive definite. For neural networks and other non-convex models, indefinite Hessians are the norm. Production optimizers like L-BFGS inherently maintain positive definite approximations, sidestepping this issue entirely.
Pure Newton's method takes full steps in the Newton direction, which can lead to divergence from poor starting points. Damped Newton methods introduce a step size α_k ∈ (0, 1]:
x_{k+1} = x_k + α_k d_k
where d_k = -H_k⁻¹∇f_k is the Newton direction.
Backtracking line search (Armijo rule):
The most common approach finds the largest α ≤ 1 satisfying the sufficient decrease condition:
f(x_k + αd_k) ≤ f(x_k) + c₁ α ∇f_kᵀd_k
where c₁ ∈ (0, 1) is typically 10⁻⁴. This ensures the function actually decreases, providing global convergence guarantees while preserving fast local convergence.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
def backtracking_line_search( f: Callable[[np.ndarray], float], x: np.ndarray, d: np.ndarray, grad: np.ndarray, alpha_init: float = 1.0, c1: float = 1e-4, rho: float = 0.5, max_iter: int = 50) -> float: """ Backtracking line search satisfying Armijo condition. Finds step size α such that: f(x + α·d) ≤ f(x) + c1·α·(∇f)ᵀd Parameters: ----------- f : callable Objective function x : np.ndarray Current point d : np.ndarray Search direction (should be descent direction) grad : np.ndarray Gradient at x alpha_init : float Initial step size (1.0 for Newton methods) c1 : float Armijo parameter (typically 1e-4) rho : float Backtracking factor (typically 0.5) max_iter : int Maximum backtracking iterations Returns: -------- alpha : float Step size satisfying Armijo condition """ alpha = alpha_init f_x = f(x) grad_d = np.dot(grad, d) # Directional derivative if grad_d >= 0: raise ValueError("d is not a descent direction (∇fᵀd ≥ 0)") for _ in range(max_iter): if f(x + alpha * d) <= f_x + c1 * alpha * grad_d: return alpha alpha *= rho return alpha def damped_newton_method( f: Callable[[np.ndarray], float], grad_f: Callable[[np.ndarray], np.ndarray], hess_f: Callable[[np.ndarray], np.ndarray], x0: np.ndarray, tol: float = 1e-8, max_iter: int = 100, verbose: bool = False) -> Tuple[np.ndarray, list]: """ Damped Newton's method with backtracking line search. Combines global convergence of line search with local quadratic convergence of Newton's method. """ x = x0.copy().astype(float) history = [] for k in range(max_iter): grad = grad_f(x) grad_norm = np.linalg.norm(grad) f_val = f(x) history.append((x.copy(), f_val, grad_norm)) if verbose: print(f"Iter {k:3d}: f = {f_val:.6e}, ||∇f|| = {grad_norm:.6e}") if grad_norm < tol: if verbose: print(f"Converged in {k} iterations") break # Compute Newton direction with regularization H = hess_f(x) d = modified_newton_direction(grad, H) # Line search for step size alpha = backtracking_line_search(f, x, d, grad) if verbose and alpha < 1.0: print(f" Step size reduced to α = {alpha:.4f}") # Damped Newton update x = x + alpha * d return x, historyConvergence properties of damped Newton:
With appropriate line search:
This combination makes damped Newton practical for real-world optimization where initialization quality cannot be guaranteed.
Newton's method's superior convergence rate comes at significant computational cost. Understanding when this tradeoff is favorable requires careful complexity analysis.
Per-iteration costs:
| Operation | Cost (n parameters) | Dominant When |
|---|---|---|
| Gradient computation | O(n) to O(n·m) | Small n, large data (m samples) |
| Hessian computation | O(n²) to O(n²·m) | Moderate n |
| Hessian storage | O(n²) memory | Always for dense Hessian |
| Linear system solve | O(n³) time | Large n (thousands) |
| Total per iteration | O(n³) or O(n²·m) | Depends on problem structure |
Comparison with gradient descent:
For an optimization problem with n parameters requiring K iterations to converge:
The crossover point depends on:
When Newton's method wins:
Neural networks typically have n > 10⁶ parameters, making O(n²) Hessian storage alone prohibitive. This explains why deep learning universally uses first-order methods (SGD, Adam). Quasi-Newton methods like L-BFGS bridge the gap by approximating second-order information with O(n) storage, as we'll see in the next page.
Despite its cubic complexity, Newton's method remains valuable in specific machine learning contexts where its advantages outweigh computational costs.
Logistic Regression (IRLS):
Logistic regression is typically trained using Iteratively Reweighted Least Squares (IRLS), which is exactly Newton's method for maximum likelihood estimation. For N samples with d features:
For d << N (typical), this is highly efficient.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
def logistic_regression_newton( X: np.ndarray, y: np.ndarray, max_iter: int = 20, tol: float = 1e-8) -> np.ndarray: """ Logistic regression using Newton's method (IRLS). For binary classification: P(y=1|x) = σ(wᵀx) = 1/(1+exp(-wᵀx)) Parameters: ----------- X : np.ndarray Feature matrix (N, d) y : np.ndarray Binary labels (N,) with values in {0, 1} max_iter : int Maximum Newton iterations tol : float Convergence tolerance on weight change Returns: -------- w : np.ndarray Optimal weights (d,) """ N, d = X.shape w = np.zeros(d) # Initialize weights for iteration in range(max_iter): # Compute predictions z = X @ w p = 1 / (1 + np.exp(-z)) # Sigmoid # Gradient of negative log-likelihood # ∇L = Xᵀ(p - y) gradient = X.T @ (p - y) # Hessian of negative log-likelihood # H = Xᵀ W X, where W = diag(p * (1-p)) W = p * (1 - p) # Diagonal weights # Efficient Hessian-vector product # H = Xᵀ diag(W) X XtW = X.T * W # Broadcasting: (d, N) * (N,) = (d, N) H = XtW @ X # (d, N) @ (N, d) = (d, d) # Add small regularization for numerical stability H += 1e-8 * np.eye(d) # Newton direction: d = -H⁻¹g # Solve: H @ delta = -gradient delta = np.linalg.solve(H, -gradient) # Update weights w_new = w + delta # Check convergence if np.linalg.norm(delta) < tol: print(f"Converged in {iteration + 1} iterations") return w_new w = w_new print(f"Max iterations ({max_iter}) reached") return w # Example usageif __name__ == "__main__": from sklearn.datasets import make_classification # Generate synthetic binary classification data X, y = make_classification( n_samples=1000, n_features=20, n_informative=10, n_redundant=5, random_state=42 ) w_optimal = logistic_regression_newton(X, y) # Check accuracy predictions = (X @ w_optimal > 0).astype(int) accuracy = np.mean(predictions == y) print(f"Training accuracy: {accuracy:.2%}")Gaussian Process Posterior:
GP inference requires solving K⁻¹y where K is the kernel matrix. Newton's method efficiently optimizes hyperparameters by leveraging closed-form gradients and Hessians of the marginal likelihood.
Natural Gradient Methods:
In probabilistic models, the natural gradient preconditions the standard gradient with the Fisher information matrix (expected Hessian under the current distribution). This achieves approximate Newton-like adaptation without computing the true Hessian.
Second-order optimization for neural networks:
Recent advances approximate the Hessian efficiently:
These methods achieve better convergence than SGD for certain problems while maintaining tractable complexity.
For least-squares problems f(x) = ½‖r(x)‖², the Gauss-Newton method approximates the Hessian as H ≈ JᵀJ (ignoring second derivatives of residuals). This simplification is exact when residuals are small near the optimum and forms the basis for Levenberg-Marquardt and trust-region methods widely used in nonlinear regression.
Newton's method represents a fundamental paradigm in optimization—exploiting second-order information to achieve rapid convergence. While its computational cost limits direct application to large-scale ML, it provides the theoretical foundation for many practical algorithms.
What's next:
The next page explores Quasi-Newton methods—especially BFGS and L-BFGS—which achieve Newton-like convergence without computing the Hessian explicitly. These methods approximate the inverse Hessian using only gradient information, making second-order optimization practical for problems with thousands of parameters.
You now understand Newton's method: its derivation from Taylor expansion, quadratic convergence properties, implementation with line search and regularization, and application to machine learning problems. This foundation prepares you for quasi-Newton methods that scale these ideas to larger problems.