Machine LearningAdvanced Optimization

Advanced Optimization Methods

LevelAdvanced

Duration90 mins

TopicAdvanced Optimization

1 / 5

Newton's Method

Beyond First-Order Optimization

Gradient descent, while foundational, is a first-order method—it uses only gradient information (first derivatives) to navigate the loss landscape. This limitation manifests as slow convergence near optima, sensitivity to learning rates, and difficulty with ill-conditioned problems.

Newton's method fundamentally changes the optimization paradigm by incorporating second-order information—the curvature of the objective function. Rather than taking small steps proportional to the gradient, Newton's method uses a local quadratic approximation to jump directly toward the minimum. This single insight transforms optimization from iterative trial to geometric precision.

What You Will Learn

By the end of this page, you will understand the mathematical derivation of Newton's method from Taylor expansion, analyze its quadratic convergence properties, implement the algorithm from first principles, and critically evaluate when Newton's method excels versus when simpler methods suffice.

Historical and Conceptual Foundation

Newton's method for optimization is an extension of Newton-Raphson iteration for root finding, first described by Isaac Newton in 1669 and later refined by Joseph Raphson in 1690. The connection is elegant: finding where a function achieves its minimum is equivalent to finding where its gradient equals zero.

The root-finding perspective:

If we want to minimize a scalar function f(x), we seek x* where f'(x*) = 0. Newton-Raphson finds roots of g(x) = 0 by iterating:

x_{n+1} = x_n - g(x_n) / g'(x_n)

Applying this to g(x) = f'(x), we obtain:

x_{n+1} = x_n - f'(x_n) / f''(x_n)

This is Newton's method for scalar optimization. The multivariate generalization replaces the scalar derivative with the gradient and the second derivative with the Hessian matrix.

The Geometric Intuition

At each iteration, Newton's method fits a parabola (or paraboloid in higher dimensions) to the local shape of the function and jumps directly to that parabola's minimum. If the function is actually quadratic, Newton's method finds the exact minimum in one step. For non-quadratic functions, repeated quadratic approximations rapidly hone in on the true minimum.

Why curvature matters:

Consider two scenarios where the gradient has the same magnitude but different curvatures:

High curvature: The function is changing rapidly. We're close to a minimum and should take a smaller step.
Low curvature: The function is relatively flat. We're far from the minimum and should take a larger step.

Gradient descent cannot distinguish these scenarios—it takes steps proportional only to gradient magnitude. Newton's method naturally incorporates this distinction through the Hessian. When curvature is high (large Hessian eigenvalues), the Hessian inverse shrinks the step. When curvature is low (small eigenvalues), the step is amplified.

This curvature-adaptive step sizing is the fundamental innovation of second-order methods.

Mathematical Derivation from Taylor Expansion

The rigorous derivation of Newton's method proceeds from the second-order Taylor expansion of the objective function. Let f: ℝⁿ → ℝ be a twice continuously differentiable function. Around a point x_k, the Taylor expansion is:

f(x_k + Δx) ≈ f(x_k) + ∇f(x_k)ᵀΔx + ½Δxᵀ H(x_k) Δx

where:

∇f(x_k) ∈ ℝⁿ is the gradient at x_k
H(x_k) ∈ ℝⁿˣⁿ is the Hessian matrix at x_k
Δx ∈ ℝⁿ is the step to take

This quadratic approximation defines a local quadratic model m_k(Δx) of the objective function.

quadratic_model.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Local quadratic model at point x_k
def quadratic_model(delta_x, f_k, grad_k, hessian_k):
    """
    Computes the quadratic approximation:
    m_k(Δx) = f(x_k) + ∇f(x_k)ᵀΔx + ½Δxᵀ H(x_k) Δx
    
    Parameters:
    -----------
    delta_x : np.ndarray
        Step direction (n,)
    f_k : float
        Function value at x_k
    grad_k : np.ndarray
        Gradient at x_k (n,)
    hessian_k : np.ndarray
        Hessian matrix at x_k (n, n)
    
    Returns:
    --------
    float : Quadratic model value
    """
    linear_term = np.dot(grad_k, delta_x)
    quadratic_term = 0.5 * np.dot(delta_x, hessian_k @ delta_x)
    return f_k + linear_term + quadratic_term

Minimizing the quadratic model:

To find the optimal step Δx*, we set the gradient of the quadratic model to zero:

∇_{Δx} m_k(Δx) = ∇f(x_k) + H(x_k)Δx = 0

Solving for Δx:

Δx = -H(x_k)⁻¹ ∇f(x_k)*

This is the Newton direction. The full Newton update is:

x_{k+1} = x_k - H(x_k)⁻¹ ∇f(x_k)

This formula encapsulates the essence of Newton's method: the step is the gradient, preconditioned (rescaled and rotated) by the inverse Hessian.

The Hessian's Role

The inverse Hessian H⁻¹ acts as a variable metric—it stretches space along directions of low curvature and compresses it along directions of high curvature. This normalization makes Newton's method invariant to affine transformations of the coordinate system, a property first-order methods lack.

Second-order optimality conditions:

For the Newton direction to be a descent direction (moving toward a minimum rather than a maximum), the Hessian must be positive definite. Recall:

A matrix H is positive definite if vᵀHv > 0 for all nonzero v
This means all eigenvalues of H are positive
Geometrically, the function curves upward in all directions (local convexity)

If H is not positive definite at x_k, the Newton direction may point toward a maximum or saddle point. This is a fundamental challenge addressed by modifications we'll discuss later.

Convergence Analysis: Quadratic Convergence

The remarkable property of Newton's method is its quadratic convergence rate near a solution. This means the number of correct digits roughly doubles at each iteration—exponentially faster than the linear convergence of gradient descent.

Formal statement (Local Convergence Theorem):

Let f: ℝⁿ → ℝ be twice continuously differentiable with Lipschitz continuous Hessian near a local minimum x*. If:

∇f(x*) = 0 (first-order necessary condition)
H(x*) is positive definite (second-order sufficient condition)
x₀ is sufficiently close to x*

Then Newton's method converges quadratically:

‖x_{k+1} - x*‖ ≤ C · ‖x_k - x*‖²

for some constant C > 0.

Convergence Rate Comparison
Method	Convergence Rate	Error Reduction per Iteration	Iterations for 10⁻¹² Error
Gradient Descent	Linear: O(ρᵏ)	Constant factor ρ < 1	~120 (for ρ = 0.9)
Accelerated GD (Nesterov)	Linear: O(ρ^√k)	Improved constant	~50-80
Newton's Method	Quadratic: O(ε²ᵏ)	Error squared each step	~4-6 (near convergence)
Conjugate Gradient	Super-linear	Depends on spectrum	~20-40

Understanding quadratic convergence:

Consider a concrete example. Suppose at iteration k, the error ‖x_k - x*‖ = 10⁻². With quadratic convergence:

Iteration k+1: error ≈ (10⁻²)² = 10⁻⁴
Iteration k+2: error ≈ (10⁻⁴)² = 10⁻⁸
Iteration k+3: error ≈ (10⁻⁸)² = 10⁻¹⁶

Three iterations take us from 1% error to machine precision! This dramatic acceleration explains why Newton's method, despite its per-iteration cost, often achieves convergence in far fewer total operations.

The catch—local convergence:

The quadratic rate only holds when starting "sufficiently close" to x*. From far away, Newton's method may:

Diverge entirely
Oscillate without converging
Converge to a different stationary point (maximum or saddle)

This local convergence property motivates numerous modifications and globalizations.

The Convergence Region

The "sufficiently close" requirement is not just theoretical pedantry. In practice, pure Newton's method can fail spectacularly from poor starting points. The convergence basin can be surprisingly small for non-convex functions, and identifying whether you're inside it is generally impossible without already knowing x*.

Algorithm and Implementation

Implementing Newton's method correctly requires careful attention to numerical stability, especially in computing the Newton direction. We'll develop the algorithm from first principles.

The Pure Newton Algorithm:

Input: Starting point x₀, tolerance ε, max iterations K
Output: Approximate minimizer x*

1. k ← 0
2. While ‖∇f(x_k)‖ > ε and k < K:
   a. Compute gradient g_k = ∇f(x_k)
   b. Compute Hessian H_k = ∇²f(x_k)
   c. Solve linear system: H_k d_k = -g_k for Newton direction d_k
   d. Update: x_{k+1} = x_k + d_k
   e. k ← k + 1
3. Return x_k

The critical computational step is solving the linear system in step 2c. Never compute H⁻¹ explicitly—this is numerically unstable and computationally wasteful.

newtons_method.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
from scipy.linalg import cho_solve, cho_factor, solve
from typing import Callable, Tuple, Optional
 
def newtons_method(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    hess_f: Callable[[np.ndarray], np.ndarray],
    x0: np.ndarray,
    tol: float = 1e-8,
    max_iter: int = 100,
    verbose: bool = False
) -> Tuple[np.ndarray, list]:
    """
    Pure Newton's method for unconstrained optimization.
    
    Parameters:
    -----------
    f : callable
        Objective function f(x) -> scalar
    grad_f : callable
        Gradient function ∇f(x) -> n-vector
    hess_f : callable
        Hessian function ∇²f(x) -> n×n matrix
    x0 : np.ndarray
        Starting point (n,)
    tol : float
        Gradient norm tolerance for convergence
    max_iter : int
        Maximum number of iterations
    verbose : bool
        Print iteration information
        
    Returns:
    --------
    x : np.ndarray
        Approximate minimizer
    history : list
        List of (x, f(x), ||grad||) at each iteration
    """
    x = x0.copy().astype(float)
    history = []
    
    for k in range(max_iter):
        # Compute gradient and Hessian
        grad = grad_f(x)
        grad_norm = np.linalg.norm(grad)
        f_val = f(x)
        
        history.append((x.copy(), f_val, grad_norm))
        
        if verbose:
            print(f"Iter {k:3d}: f = {f_val:.6e}, ||∇f|| = {grad_norm:.6e}")
        
        # Check convergence
        if grad_norm < tol:
            if verbose:
                print(f"Converged in {k} iterations")
            break
        
        # Compute Hessian
        H = hess_f(x)
        
        # Solve Newton system: H @ d = -grad
        # Using Cholesky factorization (assumes H is positive definite)
        try:
            L, lower = cho_factor(H)
            d = cho_solve((L, lower), -grad)
        except np.linalg.LinAlgError:
            # Hessian not positive definite, fall back to general solver
            d = solve(H, -grad)
        
        # Newton update (full step)
        x = x + d
    
    return x, history
 
 
# Example: Minimize Rosenbrock function
def rosenbrock(x):
    """Rosenbrock function: f(x,y) = (a-x)² + b(y-x²)²"""
    a, b = 1.0, 100.0
    return (a - x[0])**2 + b * (x[1] - x[0]**2)**2
 
def rosenbrock_grad(x):
    """Gradient of Rosenbrock function"""
    a, b = 1.0, 100.0
    dx = -2*(a - x[0]) - 4*b*x[0]*(x[1] - x[0]**2)
    dy = 2*b*(x[1] - x[0]**2)
    return np.array([dx, dy])
 
def rosenbrock_hess(x):
    """Hessian of Rosenbrock function"""
    a, b = 1.0, 100.0
    h11 = 2 - 4*b*x[1] + 12*b*x[0]**2
    h12 = -4*b*x[0]
    h22 = 2*b
    return np.array([[h11, h12], [h12, h22]])
 
 
if __name__ == "__main__":
    # Starting point far from minimum at (1, 1)
    x0 = np.array([-1.0, 1.0])
    
    x_opt, history = newtons_method(
        rosenbrock, rosenbrock_grad, rosenbrock_hess,
        x0, verbose=True
    )
    
    print(f"\nOptimum found at: {x_opt}")
    print(f"Function value: {rosenbrock(x_opt):.2e}")

Numerical Best Practices

Always solve H·d = -g rather than computing d = H⁻¹·(-g). Use Cholesky factorization (cho_solve) when H is positive definite—it's twice as fast as LU and numerically more stable. For indefinite H, use LDL factorization or regularization techniques.

Handling Non-Positive-Definite Hessians

When the Hessian is not positive definite, the Newton direction may not be a descent direction, leading to convergence toward saddle points or maxima. Several strategies address this challenge:

1. Hessian Modification (Regularization):

Add a multiple of the identity matrix to ensure positive definiteness:

H̃_k = H_k + τI

where τ > 0 is chosen large enough that H̃_k is positive definite. This is equivalent to adding L2 regularization to the local quadratic model.

Selection strategies for τ:

Fixed small τ (simple but may over-regularize)
Adaptive τ based on smallest eigenvalue: τ = max(0, -λ_min + δ) for small δ > 0
Multiple of gradient norm: τ = β‖∇f(x_k)‖

2. Modified Cholesky Factorization:

More sophisticated algorithms (like the Gill-Murray modification) compute a modified Cholesky factorization directly, perturbing the factorization minimally to ensure positive definiteness while preserving the Cholesky structure.

modified_newton.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def modified_newton_direction(
    grad: np.ndarray,
    H: np.ndarray,
    min_eigval: float = 1e-8
) -> np.ndarray:
    """
    Compute modified Newton direction with Hessian regularization.
    
    If H is not positive definite, add τI to make it so.
    
    Parameters:
    -----------
    grad : np.ndarray
        Gradient vector (n,)
    H : np.ndarray
        Hessian matrix (n, n)
    min_eigval : float
        Minimum eigenvalue threshold for positive definiteness
    
    Returns:
    --------
    d : np.ndarray
        Modified Newton direction (n,)
    """
    n = len(grad)
    
    # Compute eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(H)
    
    # Find minimum eigenvalue
    lambda_min = eigenvalues.min()
    
    if lambda_min > min_eigval:
        # H is already positive definite
        tau = 0.0
    else:
        # Regularize to ensure positive definiteness
        tau = -lambda_min + min_eigval
    
    # Regularized Hessian
    H_reg = H + tau * np.eye(n)
    
    # Solve modified Newton system
    d = np.linalg.solve(H_reg, -grad)
    
    return d
 
 
def trust_region_cauchy_point(
    grad: np.ndarray,
    H: np.ndarray,
    delta: float
) -> np.ndarray:
    """
    Compute Cauchy point for trust region subproblem.
    
    The Cauchy point minimizes the quadratic model along the 
    steepest descent direction within the trust region.
    
    Parameters:
    -----------
    grad : np.ndarray
        Gradient vector (n,)
    H : np.ndarray  
        Hessian matrix (n, n)
    delta : float
        Trust region radius
    
    Returns:
    --------
    p_c : np.ndarray
        Cauchy point (n,)
    """
    grad_norm = np.linalg.norm(grad)
    
    if grad_norm < 1e-12:
        return np.zeros_like(grad)
    
    # Normalized steepest descent direction
    g_hat = grad / grad_norm
    
    # Curvature along steepest descent
    gHg = np.dot(grad, H @ grad)
    
    if gHg <= 0:
        # Negative or zero curvature: go to boundary
        tau_star = delta / grad_norm
    else:
        # Positive curvature: unconstrained minimizer or boundary
        tau_unc = grad_norm**2 / gHg
        tau_star = min(tau_unc, delta / grad_norm)
    
    return -tau_star * grad

3. Trust Region Methods:

Rather than modifying the Hessian directly, trust region methods constrain the Newton step to a region where the quadratic model is trusted:

minimize m_k(p) = f_k + ∇f_kᵀp + ½pᵀH_kp subject to ‖p‖ ≤ Δ_k

where Δ_k is the trust region radius. This subproblem always has a well-defined solution, even when H_k is indefinite. The trust region radius adapts based on how well the quadratic model predicts actual function decrease.

4. Line Search with Negative Curvature Direction:

When H_k has a negative eigenvalue, we can extract the corresponding eigenvector and use it as a descent direction. The function decreases along directions of negative curvature, potentially escaping saddle points.

Practice vs Theory

For strictly convex problems (like linear regression with regularization), the Hessian is always positive definite. For neural networks and other non-convex models, indefinite Hessians are the norm. Production optimizers like L-BFGS inherently maintain positive definite approximations, sidestepping this issue entirely.

Damped Newton and Line Search

Pure Newton's method takes full steps in the Newton direction, which can lead to divergence from poor starting points. Damped Newton methods introduce a step size α_k ∈ (0, 1]:

x_{k+1} = x_k + α_k d_k

where d_k = -H_k⁻¹∇f_k is the Newton direction.

Backtracking line search (Armijo rule):

The most common approach finds the largest α ≤ 1 satisfying the sufficient decrease condition:

f(x_k + αd_k) ≤ f(x_k) + c₁ α ∇f_kᵀd_k

where c₁ ∈ (0, 1) is typically 10⁻⁴. This ensures the function actually decreases, providing global convergence guarantees while preserving fast local convergence.

damped_newton.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
def backtracking_line_search(
    f: Callable[[np.ndarray], float],
    x: np.ndarray,
    d: np.ndarray,
    grad: np.ndarray,
    alpha_init: float = 1.0,
    c1: float = 1e-4,
    rho: float = 0.5,
    max_iter: int = 50
) -> float:
    """
    Backtracking line search satisfying Armijo condition.
    
    Finds step size α such that:
    f(x + α·d) ≤ f(x) + c1·α·(∇f)ᵀd
    
    Parameters:
    -----------
    f : callable
        Objective function
    x : np.ndarray
        Current point
    d : np.ndarray
        Search direction (should be descent direction)
    grad : np.ndarray
        Gradient at x
    alpha_init : float
        Initial step size (1.0 for Newton methods)
    c1 : float
        Armijo parameter (typically 1e-4)
    rho : float
        Backtracking factor (typically 0.5)
    max_iter : int
        Maximum backtracking iterations
        
    Returns:
    --------
    alpha : float
        Step size satisfying Armijo condition
    """
    alpha = alpha_init
    f_x = f(x)
    grad_d = np.dot(grad, d)  # Directional derivative
    
    if grad_d >= 0:
        raise ValueError("d is not a descent direction (∇fᵀd ≥ 0)")
    
    for _ in range(max_iter):
        if f(x + alpha * d) <= f_x + c1 * alpha * grad_d:
            return alpha
        alpha *= rho
    
    return alpha
 
 
def damped_newton_method(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    hess_f: Callable[[np.ndarray], np.ndarray],
    x0: np.ndarray,
    tol: float = 1e-8,
    max_iter: int = 100,
    verbose: bool = False
) -> Tuple[np.ndarray, list]:
    """
    Damped Newton's method with backtracking line search.
    
    Combines global convergence of line search with
    local quadratic convergence of Newton's method.
    """
    x = x0.copy().astype(float)
    history = []
    
    for k in range(max_iter):
        grad = grad_f(x)
        grad_norm = np.linalg.norm(grad)
        f_val = f(x)
        
        history.append((x.copy(), f_val, grad_norm))
        
        if verbose:
            print(f"Iter {k:3d}: f = {f_val:.6e}, ||∇f|| = {grad_norm:.6e}")
        
        if grad_norm < tol:
            if verbose:
                print(f"Converged in {k} iterations")
            break
        
        # Compute Newton direction with regularization
        H = hess_f(x)
        d = modified_newton_direction(grad, H)
        
        # Line search for step size
        alpha = backtracking_line_search(f, x, d, grad)
        
        if verbose and alpha < 1.0:
            print(f"  Step size reduced to α = {alpha:.4f}")
        
        # Damped Newton update
        x = x + alpha * d
    
    return x, history

Convergence properties of damped Newton:

With appropriate line search:

Global convergence: From any starting point, the sequence converges to a stationary point
Local quadratic convergence: Near a strict local minimum with positive definite Hessian, the step size becomes α = 1, recovering pure Newton's quadratic rate
Transition: The algorithm seamlessly transitions from "careful" globally convergent steps to "aggressive" locally optimal steps

This combination makes damped Newton practical for real-world optimization where initialization quality cannot be guaranteed.

Computational Complexity and Cost Analysis

Newton's method's superior convergence rate comes at significant computational cost. Understanding when this tradeoff is favorable requires careful complexity analysis.

Per-iteration costs:

Computational Cost Breakdown
Operation	Cost (n parameters)	Dominant When
Gradient computation	O(n) to O(n·m)	Small n, large data (m samples)
Hessian computation	O(n²) to O(n²·m)	Moderate n
Hessian storage	O(n²) memory	Always for dense Hessian
Linear system solve	O(n³) time	Large n (thousands)
Total per iteration	O(n³) or O(n²·m)	Depends on problem structure

Comparison with gradient descent:

For an optimization problem with n parameters requiring K iterations to converge:

Gradient Descent: O(K_GD × n) total cost, where K_GD can be O(1/ε) or worse
Newton's Method: O(K_N × n³) total cost, where K_N is often O(log log(1/ε))

The crossover point depends on:

Problem dimension n: For n > 10,000, the O(n³) cost dominates
Condition number κ: Gradient descent iterations scale with κ, Newton's do not
Available Hessian structure: Sparse or structured Hessians can reduce O(n³) to O(n) or O(n²)

When Newton's method wins:

Small to moderate dimensions (n < 1000): The per-iteration cost is manageable, and quadratic convergence delivers huge iteration savings
Ill-conditioned problems: Gradient descent struggles with high condition numbers, making many more iterations necessary
High accuracy requirements: When ε = 10⁻¹² is needed, Newton's O(log log(1/ε)) scaling is unbeatable
Cheap Hessians: Some problems (quadratics, generalized linear models) have analytically simple Hessians

The Deep Learning Exception

Neural networks typically have n > 10⁶ parameters, making O(n²) Hessian storage alone prohibitive. This explains why deep learning universally uses first-order methods (SGD, Adam). Quasi-Newton methods like L-BFGS bridge the gap by approximating second-order information with O(n) storage, as we'll see in the next page.

Newton's Method in Machine Learning

Despite its cubic complexity, Newton's method remains valuable in specific machine learning contexts where its advantages outweigh computational costs.

Logistic Regression (IRLS):

Logistic regression is typically trained using Iteratively Reweighted Least Squares (IRLS), which is exactly Newton's method for maximum likelihood estimation. For N samples with d features:

Hessian: H = XᵀWX where W is diagonal (N×N)
Cost: O(Nd² + d³) per iteration
Convergence: 5-10 iterations typical

For d << N (typical), this is highly efficient.

logistic_newton.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
def logistic_regression_newton(
    X: np.ndarray,
    y: np.ndarray,
    max_iter: int = 20,
    tol: float = 1e-8
) -> np.ndarray:
    """
    Logistic regression using Newton's method (IRLS).
    
    For binary classification: P(y=1|x) = σ(wᵀx) = 1/(1+exp(-wᵀx))
    
    Parameters:
    -----------
    X : np.ndarray
        Feature matrix (N, d)
    y : np.ndarray
        Binary labels (N,) with values in {0, 1}
    max_iter : int
        Maximum Newton iterations
    tol : float
        Convergence tolerance on weight change
        
    Returns:
    --------
    w : np.ndarray
        Optimal weights (d,)
    """
    N, d = X.shape
    w = np.zeros(d)  # Initialize weights
    
    for iteration in range(max_iter):
        # Compute predictions
        z = X @ w
        p = 1 / (1 + np.exp(-z))  # Sigmoid
        
        # Gradient of negative log-likelihood
        # ∇L = Xᵀ(p - y)
        gradient = X.T @ (p - y)
        
        # Hessian of negative log-likelihood  
        # H = Xᵀ W X, where W = diag(p * (1-p))
        W = p * (1 - p)  # Diagonal weights
        
        # Efficient Hessian-vector product
        # H = Xᵀ diag(W) X
        XtW = X.T * W  # Broadcasting: (d, N) * (N,) = (d, N)
        H = XtW @ X    # (d, N) @ (N, d) = (d, d)
        
        # Add small regularization for numerical stability
        H += 1e-8 * np.eye(d)
        
        # Newton direction: d = -H⁻¹g
        # Solve: H @ delta = -gradient
        delta = np.linalg.solve(H, -gradient)
        
        # Update weights
        w_new = w + delta
        
        # Check convergence
        if np.linalg.norm(delta) < tol:
            print(f"Converged in {iteration + 1} iterations")
            return w_new
        
        w = w_new
    
    print(f"Max iterations ({max_iter}) reached")
    return w
 
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Generate synthetic binary classification data
    X, y = make_classification(
        n_samples=1000, n_features=20,
        n_informative=10, n_redundant=5,
        random_state=42
    )
    
    w_optimal = logistic_regression_newton(X, y)
    
    # Check accuracy
    predictions = (X @ w_optimal > 0).astype(int)
    accuracy = np.mean(predictions == y)
    print(f"Training accuracy: {accuracy:.2%}")

Gaussian Process Posterior:

GP inference requires solving K⁻¹y where K is the kernel matrix. Newton's method efficiently optimizes hyperparameters by leveraging closed-form gradients and Hessians of the marginal likelihood.

Natural Gradient Methods:

In probabilistic models, the natural gradient preconditions the standard gradient with the Fisher information matrix (expected Hessian under the current distribution). This achieves approximate Newton-like adaptation without computing the true Hessian.

Second-order optimization for neural networks:

Recent advances approximate the Hessian efficiently:

K-FAC (Kronecker-Factored Approximate Curvature): Approximates layer-wise Hessians using Kronecker products
Shampoo: Uses matrix-valued preconditioners in the spirit of Newton's method
Hessian-free optimization: Computes Hessian-vector products without forming the full Hessian

These methods achieve better convergence than SGD for certain problems while maintaining tractable complexity.

The Gauss-Newton Connection

For least-squares problems f(x) = ½‖r(x)‖², the Gauss-Newton method approximates the Hessian as H ≈ JᵀJ (ignoring second derivatives of residuals). This simplification is exact when residuals are small near the optimum and forms the basis for Levenberg-Marquardt and trust-region methods widely used in nonlinear regression.

Summary and Key Takeaways

Newton's method represents a fundamental paradigm in optimization—exploiting second-order information to achieve rapid convergence. While its computational cost limits direct application to large-scale ML, it provides the theoretical foundation for many practical algorithms.

Key Takeaways

•Newton's method uses second-order information — The Hessian matrix captures curvature, enabling curvature-adaptive step sizes that dramatically accelerate convergence.
•Quadratic convergence near optima — Error squares at each iteration, reaching machine precision in just a few steps once in the convergence basin.
•Requires positive definite Hessian — For non-convex problems, modifications like regularization, trust regions, or line search ensure descent.
•O(n³) per-iteration cost — Computing and factorizing the Hessian is prohibitive for large n, motivating quasi-Newton approximations.
•Ideal for small-to-medium convex problems — Logistic regression, GP hyperparameter optimization, and other structured problems benefit from Newton's method.
•Foundation for advanced methods — Understanding Newton's method is essential for comprehending L-BFGS, natural gradients, and second-order deep learning optimizers.

What's next:

The next page explores Quasi-Newton methods—especially BFGS and L-BFGS—which achieve Newton-like convergence without computing the Hessian explicitly. These methods approximate the inverse Hessian using only gradient information, making second-order optimization practical for problems with thousands of parameters.

Page Complete

You now understand Newton's method: its derivation from Taylor expansion, quadratic convergence properties, implementation with line search and regularization, and application to machine learning problems. This foundation prepares you for quasi-Newton methods that scale these ideas to larger problems.

1 / 5

Loading learning content...

Machine LearningAdvanced Optimization

Advanced Optimization Methods

LevelAdvanced

Duration90 mins

TopicAdvanced Optimization

1 / 5

Newton's Method

Beyond First-Order Optimization

What You Will Learn

Historical and Conceptual Foundation

The root-finding perspective:

If we want to minimize a scalar function f(x), we seek x* where f'(x*) = 0. Newton-Raphson finds roots of g(x) = 0 by iterating:

x_{n+1} = x_n - g(x_n) / g'(x_n)

Applying this to g(x) = f'(x), we obtain:

x_{n+1} = x_n - f'(x_n) / f''(x_n)

This is Newton's method for scalar optimization. The multivariate generalization replaces the scalar derivative with the gradient and the second derivative with the Hessian matrix.

The Geometric Intuition

Why curvature matters:

Consider two scenarios where the gradient has the same magnitude but different curvatures:

High curvature: The function is changing rapidly. We're close to a minimum and should take a smaller step.
Low curvature: The function is relatively flat. We're far from the minimum and should take a larger step.

This curvature-adaptive step sizing is the fundamental innovation of second-order methods.

Mathematical Derivation from Taylor Expansion

f(x_k + Δx) ≈ f(x_k) + ∇f(x_k)ᵀΔx + ½Δxᵀ H(x_k) Δx

where:

∇f(x_k) ∈ ℝⁿ is the gradient at x_k
H(x_k) ∈ ℝⁿˣⁿ is the Hessian matrix at x_k
Δx ∈ ℝⁿ is the step to take

This quadratic approximation defines a local quadratic model m_k(Δx) of the objective function.

quadratic_model.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Local quadratic model at point x_k
def quadratic_model(delta_x, f_k, grad_k, hessian_k):
    """
    Computes the quadratic approximation:
    m_k(Δx) = f(x_k) + ∇f(x_k)ᵀΔx + ½Δxᵀ H(x_k) Δx
    
    Parameters:
    -----------
    delta_x : np.ndarray
        Step direction (n,)
    f_k : float
        Function value at x_k
    grad_k : np.ndarray
        Gradient at x_k (n,)
    hessian_k : np.ndarray
        Hessian matrix at x_k (n, n)
    
    Returns:
    --------
    float : Quadratic model value
    """
    linear_term = np.dot(grad_k, delta_x)
    quadratic_term = 0.5 * np.dot(delta_x, hessian_k @ delta_x)
    return f_k + linear_term + quadratic_term

Minimizing the quadratic model:

To find the optimal step Δx*, we set the gradient of the quadratic model to zero:

∇_{Δx} m_k(Δx) = ∇f(x_k) + H(x_k)Δx = 0

Solving for Δx:

Δx = -H(x_k)⁻¹ ∇f(x_k)*

This is the Newton direction. The full Newton update is:

x_{k+1} = x_k - H(x_k)⁻¹ ∇f(x_k)

This formula encapsulates the essence of Newton's method: the step is the gradient, preconditioned (rescaled and rotated) by the inverse Hessian.

The Hessian's Role

Second-order optimality conditions:

For the Newton direction to be a descent direction (moving toward a minimum rather than a maximum), the Hessian must be positive definite. Recall:

A matrix H is positive definite if vᵀHv > 0 for all nonzero v
This means all eigenvalues of H are positive
Geometrically, the function curves upward in all directions (local convexity)

If H is not positive definite at x_k, the Newton direction may point toward a maximum or saddle point. This is a fundamental challenge addressed by modifications we'll discuss later.

Convergence Analysis: Quadratic Convergence

Formal statement (Local Convergence Theorem):

Let f: ℝⁿ → ℝ be twice continuously differentiable with Lipschitz continuous Hessian near a local minimum x*. If:

∇f(x*) = 0 (first-order necessary condition)
H(x*) is positive definite (second-order sufficient condition)
x₀ is sufficiently close to x*

Then Newton's method converges quadratically:

‖x_{k+1} - x*‖ ≤ C · ‖x_k - x*‖²

for some constant C > 0.

Convergence Rate Comparison
Method	Convergence Rate	Error Reduction per Iteration	Iterations for 10⁻¹² Error
Gradient Descent	Linear: O(ρᵏ)	Constant factor ρ < 1	~120 (for ρ = 0.9)
Accelerated GD (Nesterov)	Linear: O(ρ^√k)	Improved constant	~50-80
Newton's Method	Quadratic: O(ε²ᵏ)	Error squared each step	~4-6 (near convergence)
Conjugate Gradient	Super-linear	Depends on spectrum	~20-40

Understanding quadratic convergence:

Consider a concrete example. Suppose at iteration k, the error ‖x_k - x*‖ = 10⁻². With quadratic convergence:

Iteration k+1: error ≈ (10⁻²)² = 10⁻⁴
Iteration k+2: error ≈ (10⁻⁴)² = 10⁻⁸
Iteration k+3: error ≈ (10⁻⁸)² = 10⁻¹⁶

The catch—local convergence:

The quadratic rate only holds when starting "sufficiently close" to x*. From far away, Newton's method may:

Diverge entirely
Oscillate without converging
Converge to a different stationary point (maximum or saddle)

This local convergence property motivates numerous modifications and globalizations.

The Convergence Region

Algorithm and Implementation

Implementing Newton's method correctly requires careful attention to numerical stability, especially in computing the Newton direction. We'll develop the algorithm from first principles.

The Pure Newton Algorithm:

Input: Starting point x₀, tolerance ε, max iterations K
Output: Approximate minimizer x*

1. k ← 0
2. While ‖∇f(x_k)‖ > ε and k < K:
   a. Compute gradient g_k = ∇f(x_k)
   b. Compute Hessian H_k = ∇²f(x_k)
   c. Solve linear system: H_k d_k = -g_k for Newton direction d_k
   d. Update: x_{k+1} = x_k + d_k
   e. k ← k + 1
3. Return x_k

The critical computational step is solving the linear system in step 2c. Never compute H⁻¹ explicitly—this is numerically unstable and computationally wasteful.

newtons_method.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
from scipy.linalg import cho_solve, cho_factor, solve
from typing import Callable, Tuple, Optional
 
def newtons_method(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    hess_f: Callable[[np.ndarray], np.ndarray],
    x0: np.ndarray,
    tol: float = 1e-8,
    max_iter: int = 100,
    verbose: bool = False
) -> Tuple[np.ndarray, list]:
    """
    Pure Newton's method for unconstrained optimization.
    
    Parameters:
    -----------
    f : callable
        Objective function f(x) -> scalar
    grad_f : callable
        Gradient function ∇f(x) -> n-vector
    hess_f : callable
        Hessian function ∇²f(x) -> n×n matrix
    x0 : np.ndarray
        Starting point (n,)
    tol : float
        Gradient norm tolerance for convergence
    max_iter : int
        Maximum number of iterations
    verbose : bool
        Print iteration information
        
    Returns:
    --------
    x : np.ndarray
        Approximate minimizer
    history : list
        List of (x, f(x), ||grad||) at each iteration
    """
    x = x0.copy().astype(float)
    history = []
    
    for k in range(max_iter):
        # Compute gradient and Hessian
        grad = grad_f(x)
        grad_norm = np.linalg.norm(grad)
        f_val = f(x)
        
        history.append((x.copy(), f_val, grad_norm))
        
        if verbose:
            print(f"Iter {k:3d}: f = {f_val:.6e}, ||∇f|| = {grad_norm:.6e}")
        
        # Check convergence
        if grad_norm < tol:
            if verbose:
                print(f"Converged in {k} iterations")
            break
        
        # Compute Hessian
        H = hess_f(x)
        
        # Solve Newton system: H @ d = -grad
        # Using Cholesky factorization (assumes H is positive definite)
        try:
            L, lower = cho_factor(H)
            d = cho_solve((L, lower), -grad)
        except np.linalg.LinAlgError:
            # Hessian not positive definite, fall back to general solver
            d = solve(H, -grad)
        
        # Newton update (full step)
        x = x + d
    
    return x, history
 
 
# Example: Minimize Rosenbrock function
def rosenbrock(x):
    """Rosenbrock function: f(x,y) = (a-x)² + b(y-x²)²"""
    a, b = 1.0, 100.0
    return (a - x[0])**2 + b * (x[1] - x[0]**2)**2
 
def rosenbrock_grad(x):
    """Gradient of Rosenbrock function"""
    a, b = 1.0, 100.0
    dx = -2*(a - x[0]) - 4*b*x[0]*(x[1] - x[0]**2)
    dy = 2*b*(x[1] - x[0]**2)
    return np.array([dx, dy])
 
def rosenbrock_hess(x):
    """Hessian of Rosenbrock function"""
    a, b = 1.0, 100.0
    h11 = 2 - 4*b*x[1] + 12*b*x[0]**2
    h12 = -4*b*x[0]
    h22 = 2*b
    return np.array([[h11, h12], [h12, h22]])
 
 
if __name__ == "__main__":
    # Starting point far from minimum at (1, 1)
    x0 = np.array([-1.0, 1.0])
    
    x_opt, history = newtons_method(
        rosenbrock, rosenbrock_grad, rosenbrock_hess,
        x0, verbose=True
    )
    
    print(f"\nOptimum found at: {x_opt}")
    print(f"Function value: {rosenbrock(x_opt):.2e}")

Numerical Best Practices

Handling Non-Positive-Definite Hessians

When the Hessian is not positive definite, the Newton direction may not be a descent direction, leading to convergence toward saddle points or maxima. Several strategies address this challenge:

1. Hessian Modification (Regularization):

Add a multiple of the identity matrix to ensure positive definiteness:

H̃_k = H_k + τI

where τ > 0 is chosen large enough that H̃_k is positive definite. This is equivalent to adding L2 regularization to the local quadratic model.

Selection strategies for τ:

Fixed small τ (simple but may over-regularize)
Adaptive τ based on smallest eigenvalue: τ = max(0, -λ_min + δ) for small δ > 0
Multiple of gradient norm: τ = β‖∇f(x_k)‖

2. Modified Cholesky Factorization:

modified_newton.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def modified_newton_direction(
    grad: np.ndarray,
    H: np.ndarray,
    min_eigval: float = 1e-8
) -> np.ndarray:
    """
    Compute modified Newton direction with Hessian regularization.
    
    If H is not positive definite, add τI to make it so.
    
    Parameters:
    -----------
    grad : np.ndarray
        Gradient vector (n,)
    H : np.ndarray
        Hessian matrix (n, n)
    min_eigval : float
        Minimum eigenvalue threshold for positive definiteness
    
    Returns:
    --------
    d : np.ndarray
        Modified Newton direction (n,)
    """
    n = len(grad)
    
    # Compute eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(H)
    
    # Find minimum eigenvalue
    lambda_min = eigenvalues.min()
    
    if lambda_min > min_eigval:
        # H is already positive definite
        tau = 0.0
    else:
        # Regularize to ensure positive definiteness
        tau = -lambda_min + min_eigval
    
    # Regularized Hessian
    H_reg = H + tau * np.eye(n)
    
    # Solve modified Newton system
    d = np.linalg.solve(H_reg, -grad)
    
    return d
 
 
def trust_region_cauchy_point(
    grad: np.ndarray,
    H: np.ndarray,
    delta: float
) -> np.ndarray:
    """
    Compute Cauchy point for trust region subproblem.
    
    The Cauchy point minimizes the quadratic model along the 
    steepest descent direction within the trust region.
    
    Parameters:
    -----------
    grad : np.ndarray
        Gradient vector (n,)
    H : np.ndarray  
        Hessian matrix (n, n)
    delta : float
        Trust region radius
    
    Returns:
    --------
    p_c : np.ndarray
        Cauchy point (n,)
    """
    grad_norm = np.linalg.norm(grad)
    
    if grad_norm < 1e-12:
        return np.zeros_like(grad)
    
    # Normalized steepest descent direction
    g_hat = grad / grad_norm
    
    # Curvature along steepest descent
    gHg = np.dot(grad, H @ grad)
    
    if gHg <= 0:
        # Negative or zero curvature: go to boundary
        tau_star = delta / grad_norm
    else:
        # Positive curvature: unconstrained minimizer or boundary
        tau_unc = grad_norm**2 / gHg
        tau_star = min(tau_unc, delta / grad_norm)
    
    return -tau_star * grad

3. Trust Region Methods:

Rather than modifying the Hessian directly, trust region methods constrain the Newton step to a region where the quadratic model is trusted:

minimize m_k(p) = f_k + ∇f_kᵀp + ½pᵀH_kp subject to ‖p‖ ≤ Δ_k

4. Line Search with Negative Curvature Direction:

Practice vs Theory

Damped Newton and Line Search

Pure Newton's method takes full steps in the Newton direction, which can lead to divergence from poor starting points. Damped Newton methods introduce a step size α_k ∈ (0, 1]:

x_{k+1} = x_k + α_k d_k

where d_k = -H_k⁻¹∇f_k is the Newton direction.

Backtracking line search (Armijo rule):

The most common approach finds the largest α ≤ 1 satisfying the sufficient decrease condition:

f(x_k + αd_k) ≤ f(x_k) + c₁ α ∇f_kᵀd_k

where c₁ ∈ (0, 1) is typically 10⁻⁴. This ensures the function actually decreases, providing global convergence guarantees while preserving fast local convergence.

damped_newton.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
def backtracking_line_search(
    f: Callable[[np.ndarray], float],
    x: np.ndarray,
    d: np.ndarray,
    grad: np.ndarray,
    alpha_init: float = 1.0,
    c1: float = 1e-4,
    rho: float = 0.5,
    max_iter: int = 50
) -> float:
    """
    Backtracking line search satisfying Armijo condition.
    
    Finds step size α such that:
    f(x + α·d) ≤ f(x) + c1·α·(∇f)ᵀd
    
    Parameters:
    -----------
    f : callable
        Objective function
    x : np.ndarray
        Current point
    d : np.ndarray
        Search direction (should be descent direction)
    grad : np.ndarray
        Gradient at x
    alpha_init : float
        Initial step size (1.0 for Newton methods)
    c1 : float
        Armijo parameter (typically 1e-4)
    rho : float
        Backtracking factor (typically 0.5)
    max_iter : int
        Maximum backtracking iterations
        
    Returns:
    --------
    alpha : float
        Step size satisfying Armijo condition
    """
    alpha = alpha_init
    f_x = f(x)
    grad_d = np.dot(grad, d)  # Directional derivative
    
    if grad_d >= 0:
        raise ValueError("d is not a descent direction (∇fᵀd ≥ 0)")
    
    for _ in range(max_iter):
        if f(x + alpha * d) <= f_x + c1 * alpha * grad_d:
            return alpha
        alpha *= rho
    
    return alpha
 
 
def damped_newton_method(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    hess_f: Callable[[np.ndarray], np.ndarray],
    x0: np.ndarray,
    tol: float = 1e-8,
    max_iter: int = 100,
    verbose: bool = False
) -> Tuple[np.ndarray, list]:
    """
    Damped Newton's method with backtracking line search.
    
    Combines global convergence of line search with
    local quadratic convergence of Newton's method.
    """
    x = x0.copy().astype(float)
    history = []
    
    for k in range(max_iter):
        grad = grad_f(x)
        grad_norm = np.linalg.norm(grad)
        f_val = f(x)
        
        history.append((x.copy(), f_val, grad_norm))
        
        if verbose:
            print(f"Iter {k:3d}: f = {f_val:.6e}, ||∇f|| = {grad_norm:.6e}")
        
        if grad_norm < tol:
            if verbose:
                print(f"Converged in {k} iterations")
            break
        
        # Compute Newton direction with regularization
        H = hess_f(x)
        d = modified_newton_direction(grad, H)
        
        # Line search for step size
        alpha = backtracking_line_search(f, x, d, grad)
        
        if verbose and alpha < 1.0:
            print(f"  Step size reduced to α = {alpha:.4f}")
        
        # Damped Newton update
        x = x + alpha * d
    
    return x, history

Convergence properties of damped Newton:

With appropriate line search:

Global convergence: From any starting point, the sequence converges to a stationary point
Local quadratic convergence: Near a strict local minimum with positive definite Hessian, the step size becomes α = 1, recovering pure Newton's quadratic rate
Transition: The algorithm seamlessly transitions from "careful" globally convergent steps to "aggressive" locally optimal steps

This combination makes damped Newton practical for real-world optimization where initialization quality cannot be guaranteed.

Computational Complexity and Cost Analysis

Newton's method's superior convergence rate comes at significant computational cost. Understanding when this tradeoff is favorable requires careful complexity analysis.

Per-iteration costs:

Computational Cost Breakdown
Operation	Cost (n parameters)	Dominant When
Gradient computation	O(n) to O(n·m)	Small n, large data (m samples)
Hessian computation	O(n²) to O(n²·m)	Moderate n
Hessian storage	O(n²) memory	Always for dense Hessian
Linear system solve	O(n³) time	Large n (thousands)
Total per iteration	O(n³) or O(n²·m)	Depends on problem structure

Comparison with gradient descent:

For an optimization problem with n parameters requiring K iterations to converge:

Gradient Descent: O(K_GD × n) total cost, where K_GD can be O(1/ε) or worse
Newton's Method: O(K_N × n³) total cost, where K_N is often O(log log(1/ε))

The crossover point depends on:

Problem dimension n: For n > 10,000, the O(n³) cost dominates
Condition number κ: Gradient descent iterations scale with κ, Newton's do not
Available Hessian structure: Sparse or structured Hessians can reduce O(n³) to O(n) or O(n²)

When Newton's method wins:

Small to moderate dimensions (n < 1000): The per-iteration cost is manageable, and quadratic convergence delivers huge iteration savings
Ill-conditioned problems: Gradient descent struggles with high condition numbers, making many more iterations necessary
High accuracy requirements: When ε = 10⁻¹² is needed, Newton's O(log log(1/ε)) scaling is unbeatable
Cheap Hessians: Some problems (quadratics, generalized linear models) have analytically simple Hessians

The Deep Learning Exception

Newton's Method in Machine Learning

Despite its cubic complexity, Newton's method remains valuable in specific machine learning contexts where its advantages outweigh computational costs.

Logistic Regression (IRLS):

Logistic regression is typically trained using Iteratively Reweighted Least Squares (IRLS), which is exactly Newton's method for maximum likelihood estimation. For N samples with d features:

Hessian: H = XᵀWX where W is diagonal (N×N)
Cost: O(Nd² + d³) per iteration
Convergence: 5-10 iterations typical

For d << N (typical), this is highly efficient.

logistic_newton.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
def logistic_regression_newton(
    X: np.ndarray,
    y: np.ndarray,
    max_iter: int = 20,
    tol: float = 1e-8
) -> np.ndarray:
    """
    Logistic regression using Newton's method (IRLS).
    
    For binary classification: P(y=1|x) = σ(wᵀx) = 1/(1+exp(-wᵀx))
    
    Parameters:
    -----------
    X : np.ndarray
        Feature matrix (N, d)
    y : np.ndarray
        Binary labels (N,) with values in {0, 1}
    max_iter : int
        Maximum Newton iterations
    tol : float
        Convergence tolerance on weight change
        
    Returns:
    --------
    w : np.ndarray
        Optimal weights (d,)
    """
    N, d = X.shape
    w = np.zeros(d)  # Initialize weights
    
    for iteration in range(max_iter):
        # Compute predictions
        z = X @ w
        p = 1 / (1 + np.exp(-z))  # Sigmoid
        
        # Gradient of negative log-likelihood
        # ∇L = Xᵀ(p - y)
        gradient = X.T @ (p - y)
        
        # Hessian of negative log-likelihood  
        # H = Xᵀ W X, where W = diag(p * (1-p))
        W = p * (1 - p)  # Diagonal weights
        
        # Efficient Hessian-vector product
        # H = Xᵀ diag(W) X
        XtW = X.T * W  # Broadcasting: (d, N) * (N,) = (d, N)
        H = XtW @ X    # (d, N) @ (N, d) = (d, d)
        
        # Add small regularization for numerical stability
        H += 1e-8 * np.eye(d)
        
        # Newton direction: d = -H⁻¹g
        # Solve: H @ delta = -gradient
        delta = np.linalg.solve(H, -gradient)
        
        # Update weights
        w_new = w + delta
        
        # Check convergence
        if np.linalg.norm(delta) < tol:
            print(f"Converged in {iteration + 1} iterations")
            return w_new
        
        w = w_new
    
    print(f"Max iterations ({max_iter}) reached")
    return w
 
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Generate synthetic binary classification data
    X, y = make_classification(
        n_samples=1000, n_features=20,
        n_informative=10, n_redundant=5,
        random_state=42
    )
    
    w_optimal = logistic_regression_newton(X, y)
    
    # Check accuracy
    predictions = (X @ w_optimal > 0).astype(int)
    accuracy = np.mean(predictions == y)
    print(f"Training accuracy: {accuracy:.2%}")

Gaussian Process Posterior:

GP inference requires solving K⁻¹y where K is the kernel matrix. Newton's method efficiently optimizes hyperparameters by leveraging closed-form gradients and Hessians of the marginal likelihood.

Natural Gradient Methods:

Second-order optimization for neural networks:

Recent advances approximate the Hessian efficiently:

K-FAC (Kronecker-Factored Approximate Curvature): Approximates layer-wise Hessians using Kronecker products
Shampoo: Uses matrix-valued preconditioners in the spirit of Newton's method
Hessian-free optimization: Computes Hessian-vector products without forming the full Hessian

These methods achieve better convergence than SGD for certain problems while maintaining tractable complexity.

The Gauss-Newton Connection

Summary and Key Takeaways

Key Takeaways

•Newton's method uses second-order information — The Hessian matrix captures curvature, enabling curvature-adaptive step sizes that dramatically accelerate convergence.
•Quadratic convergence near optima — Error squares at each iteration, reaching machine precision in just a few steps once in the convergence basin.
•Requires positive definite Hessian — For non-convex problems, modifications like regularization, trust regions, or line search ensure descent.
•O(n³) per-iteration cost — Computing and factorizing the Hessian is prohibitive for large n, motivating quasi-Newton approximations.
•Ideal for small-to-medium convex problems — Logistic regression, GP hyperparameter optimization, and other structured problems benefit from Newton's method.
•Foundation for advanced methods — Understanding Newton's method is essential for comprehending L-BFGS, natural gradients, and second-order deep learning optimizers.

What's next:

Page Complete

1 / 5