Machine LearningAdvanced Optimization

Advanced Optimization Methods

LevelAdvanced

Duration90 mins

TopicAdvanced Optimization

2 / 5

Quasi-Newton Methods (BFGS, L-BFGS)

The Best of Both Worlds

Newton's method achieves quadratic convergence but requires computing and inverting the Hessian—O(n²) storage and O(n³) computation per iteration. Gradient descent requires only O(n) storage and computation but converges slowly. Quasi-Newton methods bridge this gap, achieving superlinear convergence (faster than linear, approaching quadratic) while requiring only gradient information and O(n²) storage (or O(n) for limited-memory variants).

The key insight is both simple and profound: rather than computing the Hessian at each iteration, we can approximate and update it using gradient differences observed during optimization. The Hessian encodes second-order information, and gradients at consecutive points implicitly contain this information—we just need to extract it cleverly.

What You Will Learn

By the end of this page, you will understand the secant condition that underlies quasi-Newton updates, derive the BFGS update formula, appreciate why BFGS maintains positive definiteness, implement L-BFGS for large-scale optimization, and critically evaluate when quasi-Newton methods outperform alternatives.

The Secant Condition: Foundation of Quasi-Newton

Quasi-Newton methods are built on a single fundamental requirement: the Hessian approximation should satisfy the secant equation (also called the quasi-Newton condition).

Derivation from first principles:

For a smooth function f(x), consider the Taylor expansion of the gradient around x_{k+1}:

∇f(x_k) ≈ ∇f(x_{k+1}) + ∇²f(x_{k+1})(x_k - x_{k+1})

Rearranging:

∇²f(x_{k+1})(x_{k+1} - x_k) ≈ ∇f(x_{k+1}) - ∇f(x_k)

Defining:

s_k = x_{k+1} - x_k (step taken)
y_k = ∇f(x_{k+1}) - ∇f(x_k) (gradient change)

We get the secant equation:

B_{k+1} s_k = y_k

where B_{k+1} is our approximation to the Hessian H(x_{k+1}).

The Secant Intuition

The secant condition says: 'Our Hessian approximation B, when applied to the step we just took, should produce the gradient change we observed.' This is a consistency requirement—the approximation should at least explain recent observations.

The inverse Hessian variant:

Since Newton's method uses H⁻¹∇f, it's often more convenient to maintain an approximation to the inverse Hessian. Let H_k ≈ B_k⁻¹. The secant equation becomes:

H_{k+1} y_k = s_k

This form avoids explicitly inverting B_{k+1} at each iteration.

Underdetermination and the update problem:

The secant equation provides only n constraints (one vector equation), but H has n² entries (or n(n+1)/2 for symmetric matrices). With infinitely many matrices satisfying the secant condition, how do we choose?

Different quasi-Newton methods correspond to different ways of resolving this underdetermination:

DFP (Davidon-Fletcher-Powell): Minimize change in H_k subject to secant condition
BFGS (Broyden-Fletcher-Goldfarb-Shanno): Minimize change in B_k subject to secant condition
Broyden class: Convex combination of DFP and BFGS

The BFGS Update Formula

BFGS is the most widely used quasi-Newton method, prized for its excellent numerical behavior and self-correcting properties. The update formula can be derived by solving an optimization problem over positive definite matrices.

Derivation (for the inverse Hessian approximation H):

We seek H_{k+1} that:

Satisfies the secant condition: H_{k+1} y_k = s_k
Is symmetric: H_{k+1} = H_{k+1}ᵀ
Is closest to H_k in a weighted Frobenius norm

The solution to this constrained optimization problem is the BFGS update:

H_{k+1} = (I - ρ_k s_k y_kᵀ) H_k (I - ρ_k y_k s_kᵀ) + ρ_k s_k s_kᵀ

where ρ_k = 1 / (y_kᵀ s_k)

This formula is a rank-2 update—it adds at most 2 to the rank of the modification from H_k to H_{k+1}.

bfgs_update.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from typing import Tuple
 
def bfgs_update(
    H: np.ndarray,
    s: np.ndarray,
    y: np.ndarray,
    eps: float = 1e-10
) -> np.ndarray:
    """
    Compute BFGS update of inverse Hessian approximation.
    
    H_{k+1} = (I - ρ·s·yᵀ) H_k (I - ρ·y·sᵀ) + ρ·s·sᵀ
    
    Parameters:
    -----------
    H : np.ndarray
        Current inverse Hessian approximation (n, n)
    s : np.ndarray
        Step vector: s = x_{k+1} - x_k (n,)
    y : np.ndarray
        Gradient difference: y = ∇f_{k+1} - ∇f_k (n,)
    eps : float
        Small constant for numerical stability
        
    Returns:
    --------
    H_new : np.ndarray
        Updated inverse Hessian approximation (n, n)
    """
    ys = np.dot(y, s)
    
    # Curvature condition check
    if ys < eps:
        # Skip update if curvature condition violated
        return H
    
    rho = 1.0 / ys
    n = len(s)
    I = np.eye(n)
    
    # BFGS formula: H = (I - ρsyᵀ) H (I - ρysᵀ) + ρssᵀ
    V = I - rho * np.outer(s, y)
    H_new = V @ H @ V.T + rho * np.outer(s, s)
    
    return H_new
 
 
def bfgs_direct_hessian_update(
    B: np.ndarray,
    s: np.ndarray,
    y: np.ndarray,
    eps: float = 1e-10
) -> np.ndarray:
    """
    Direct BFGS update for Hessian approximation B.
    
    B_{k+1} = B_k - (B_k s sᵀ B_k)/(sᵀ B_k s) + (y yᵀ)/(yᵀ s)
    
    This is the Sherman-Morrison-Woodbury dual of the inverse update.
    """
    ys = np.dot(y, s)
    
    if ys < eps:
        return B
    
    Bs = B @ s
    sBs = np.dot(s, Bs)
    
    if abs(sBs) < eps:
        return B
    
    B_new = B - np.outer(Bs, Bs) / sBs + np.outer(y, y) / ys
    
    return B_new

The curvature condition:

For the BFGS update to maintain positive definiteness, we require:

y_kᵀ s_k > 0 (curvature condition)

This condition ensures ρ_k is well-defined and positive. Geometrically, it requires that moving in direction s_k increases the directional derivative—the function has positive curvature along the step direction.

When is the curvature condition satisfied?

Convex functions: Always satisfied since Hessian is positive semidefinite
Wolfe line search: Guarantees y_kᵀ s_k > 0 for any function (the key reason Wolfe conditions are used with quasi-Newton methods)
Non-convex functions without proper line search: May violate the condition, leading to update skipping or failure

Self-correcting property:

BFGS has a remarkable self-correcting property: even starting from a poor initial approximation (like H₀ = I), the approximation improves as gradients are observed. Historical gradient information accumulates in H, improving the approximation over time.

BFGS vs DFP

DFP was the first quasi-Newton method (1959-1963) but BFGS (1970) largely superseded it. BFGS is more stable numerically—DFP can degrade when line searches are imprecise, while BFGS remains robust. The Broyden class parameterizes the family between them: φ=0 gives BFGS, φ=1 gives DFP.

The Complete BFGS Algorithm

The full BFGS algorithm combines inverse Hessian updates with a line search satisfying the Wolfe conditions. These conditions ensure both sufficient decrease and sufficient curvature—critical for guaranteeing the curvature condition y_kᵀ s_k > 0.

Wolfe Conditions:

Sufficient decrease (Armijo): f(x_k + αd_k) ≤ f(x_k) + c₁α∇f_kᵀd_k
Curvature condition: ∇f(x_k + αd_k)ᵀd_k ≥ c₂∇f_kᵀd_k

Typical values: c₁ = 10⁻⁴, c₂ = 0.9

The curvature condition ensures we don't stop too early, guaranteeing sufficient curvature for the BFGS update.

bfgs_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import numpy as np
from typing import Callable, Tuple, Optional
 
def wolfe_line_search(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    x: np.ndarray,
    d: np.ndarray,
    grad: np.ndarray,
    c1: float = 1e-4,
    c2: float = 0.9,
    alpha_init: float = 1.0,
    max_iter: int = 25
) -> Tuple[float, np.ndarray]:
    """
    Line search satisfying strong Wolfe conditions.
    
    Returns step size α and gradient at new point.
    Critical for BFGS to maintain positive definiteness.
    """
    f_0 = f(x)
    g_0 = np.dot(grad, d)  # Directional derivative at α=0
    
    alpha_lo, alpha_hi = 0.0, np.inf
    alpha = alpha_init
    
    for _ in range(max_iter):
        x_new = x + alpha * d
        f_new = f(x_new)
        grad_new = grad_f(x_new)
        g_new = np.dot(grad_new, d)
        
        # Check Armijo condition
        if f_new > f_0 + c1 * alpha * g_0:
            alpha_hi = alpha
            alpha = 0.5 * (alpha_lo + alpha_hi)
            continue
            
        # Check curvature condition (strong Wolfe)
        if abs(g_new) > c2 * abs(g_0):
            if g_new * g_0 < 0:
                alpha_hi = alpha
            else:
                alpha_lo = alpha
            
            if alpha_hi < np.inf:
                alpha = 0.5 * (alpha_lo + alpha_hi)
            else:
                alpha = 2 * alpha
            continue
        
        # Both conditions satisfied
        return alpha, grad_new
    
    return alpha, grad_f(x + alpha * d)
 
 
def bfgs_optimize(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    x0: np.ndarray,
    tol: float = 1e-8,
    max_iter: int = 1000,
    verbose: bool = False
) -> Tuple[np.ndarray, list]:
    """
    BFGS quasi-Newton optimization with Wolfe line search.
    
    Parameters:
    -----------
    f : callable
        Objective function f(x) -> scalar
    grad_f : callable
        Gradient function ∇f(x) -> n-vector
    x0 : np.ndarray
        Starting point (n,)
    tol : float
        Gradient norm tolerance for convergence
    max_iter : int
        Maximum number of iterations
    verbose : bool
        Print iteration information
        
    Returns:
    --------
    x : np.ndarray
        Approximate minimizer
    history : list
        List of (x, f(x), ||grad||) at each iteration
    """
    n = len(x0)
    x = x0.copy().astype(float)
    H = np.eye(n)  # Initial inverse Hessian: identity
    
    grad = grad_f(x)
    history = [(x.copy(), f(x), np.linalg.norm(grad))]
    
    for k in range(max_iter):
        grad_norm = np.linalg.norm(grad)
        
        if verbose and k % 10 == 0:
            print(f"Iter {k:4d}: f = {f(x):.6e}, ||∇f|| = {grad_norm:.6e}")
        
        # Check convergence
        if grad_norm < tol:
            if verbose:
                print(f"Converged in {k} iterations")
            break
        
        # Compute search direction
        d = -H @ grad
        
        # Line search (Wolfe conditions)
        alpha, grad_new = wolfe_line_search(f, grad_f, x, d, grad)
        
        # Compute step and gradient difference
        s = alpha * d
        y = grad_new - grad
        
        # Update position and gradient
        x = x + s
        grad = grad_new
        
        # BFGS update
        ys = np.dot(y, s)
        if ys > 1e-10:  # Curvature condition
            rho = 1.0 / ys
            I = np.eye(n)
            V = I - rho * np.outer(s, y)
            H = V @ H @ V.T + rho * np.outer(s, s)
        
        history.append((x.copy(), f(x), np.linalg.norm(grad)))
    
    return x, history
 
 
# Example: Rosenbrock function
def rosenbrock(x):
    """Rosenbrock function (extended to n dimensions)"""
    return sum(100 * (x[i+1] - x[i]**2)**2 + (1 - x[i])**2 
               for i in range(len(x) - 1))
 
def rosenbrock_grad(x):
    """Gradient of Rosenbrock function"""
    n = len(x)
    grad = np.zeros(n)
    for i in range(n - 1):
        grad[i] += -400 * x[i] * (x[i+1] - x[i]**2) - 2 * (1 - x[i])
        grad[i+1] += 200 * (x[i+1] - x[i]**2)
    return grad
 
 
if __name__ == "__main__":
    # 10-dimensional Rosenbrock
    x0 = np.zeros(10)
    
    x_opt, history = bfgs_optimize(
        rosenbrock, rosenbrock_grad, x0,
        verbose=True
    )
    
    print(f"\nOptimum: {x_opt}")
    print(f"Optimal value: {rosenbrock(x_opt):.2e}")
    print(f"Total iterations: {len(history)}")

Initialization Matters

Starting with H₀ = I (identity) works but can be improved. A common trick: after the first iteration, rescale H₀ = (y₀ᵀs₀)/(y₀ᵀy₀) × I. This scales the initial approximation to match the observed curvature along the first step direction.

Convergence Analysis of BFGS

BFGS achieves superlinear convergence under reasonable conditions—faster than gradient descent's linear rate, though not quite Newton's quadratic rate.

Formal convergence result (Dennis-Moré conditions):

If f is twice continuously differentiable with Lipschitz continuous Hessian, and {x_k} converges to x* with H(x*) positive definite, then:

lim_{k→∞} ‖(H_k - H(x*)⁻¹)∇f_k‖ / ‖∇f_k‖ = 0

This implies superlinear convergence:

‖x_{k+1} - x*‖ / ‖x_k - x*‖ → 0 as k → ∞

The ratio of successive errors goes to zero, meaning each iteration provides increasingly larger relative improvement.

Convergence Rate Comparison (Practical Iterations for 10⁻¹² Error)
Method	Rate Type	n=100	n=1000	n=10000
Gradient Descent	Linear O(κ log 1/ε)	~10,000	~100,000	~1,000,000
Conjugate Gradient	Superlinear	~100	~1,000	~10,000
BFGS	Superlinear	~50-100	~100-200	~200-500
L-BFGS (m=10)	Superlinear	~75-150	~150-300	~300-800
Newton	Quadratic	~10	~10	Infeasible (n³ cost)

Global convergence:

With a line search satisfying the Wolfe conditions, BFGS is globally convergent for smooth functions bounded below:

From any starting point, the iterates converge to a stationary point
No requirement to start near the solution
Self-correcting: even with a poor initial H₀, convergence is guaranteed

The condition number independence:

Unlike gradient descent, whose convergence rate degrades with the condition number κ = λ_max/λ_min of the Hessian, BFGS's rate is largely independent of κ. The inverse Hessian approximation automatically adapts to different curvatures along different directions.

Why not quadratic convergence?

BFGS doesn't achieve Newton's quadratic rate because H_k only approximates H⁻¹—the approximation error prevents true quadratic steps. However, the approximation becomes increasingly accurate near the solution (Dennis-Moré conditions), explaining the superlinear acceleration.

Practical Implications

For most ML optimization problems requiring moderate precision (tol ≈ 10⁻⁶), BFGS often converges in 10-100 iterations regardless of problem size. This makes it highly practical for problems where gradient evaluation dominates and O(n²) storage is affordable.

Limited-Memory BFGS (L-BFGS)

Standard BFGS requires O(n²) storage for the inverse Hessian approximation—prohibitive when n exceeds a few thousand. L-BFGS (Limited-memory BFGS) addresses this by storing only the last m pairs (s_k, y_k) and computing the search direction on-the-fly using a two-loop recursion.

Key insight:

We don't actually need the full matrix H_k. We only need the product H_k ∇f_k (the search direction). This product can be computed efficiently using just the stored vector pairs, without ever forming H_k explicitly.

Storage complexity:

BFGS: O(n²) for dense H matrix
L-BFGS: O(mn) for m vector pairs (typically m = 5-20)

For n = 1,000,000 and m = 10: BFGS needs ~4TB, L-BFGS needs ~80MB!

lbfgs_two_loop.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
import numpy as np
from collections import deque
from typing import Callable, Tuple, List
 
def lbfgs_two_loop_recursion(
    grad: np.ndarray,
    s_history: List[np.ndarray],
    y_history: List[np.ndarray],
    rho_history: List[float],
    gamma: float = 1.0
) -> np.ndarray:
    """
    L-BFGS two-loop recursion to compute search direction.
    
    Computes H_k @ grad without forming H_k explicitly.
    Uses only the m most recent (s, y, ρ) triplets.
    
    Parameters:
    -----------
    grad : np.ndarray
        Current gradient (n,)
    s_history : list
        Recent s_i = x_{i+1} - x_i vectors
    y_history : list
        Recent y_i = ∇f_{i+1} - ∇f_i vectors  
    rho_history : list
        Recent ρ_i = 1/(y_i·s_i) scalars
    gamma : float
        Scaling for initial H_0 = gamma * I
        
    Returns:
    --------
    d : np.ndarray
        Search direction -H_k @ grad (n,)
    """
    q = grad.copy()
    m = len(s_history)
    
    if m == 0:
        # No history: steepest descent
        return -gamma * grad
    
    # First loop: right-to-left
    alpha = np.zeros(m)
    for i in range(m - 1, -1, -1):
        alpha[i] = rho_history[i] * np.dot(s_history[i], q)
        q = q - alpha[i] * y_history[i]
    
    # Initial direction: H_0 @ q where H_0 = gamma * I
    r = gamma * q
    
    # Second loop: left-to-right
    for i in range(m):
        beta = rho_history[i] * np.dot(y_history[i], r)
        r = r + (alpha[i] - beta) * s_history[i]
    
    return -r
 
 
def lbfgs_optimize(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    x0: np.ndarray,
    m: int = 10,
    tol: float = 1e-8,
    max_iter: int = 1000,
    verbose: bool = False
) -> Tuple[np.ndarray, list]:
    """
    L-BFGS optimization with limited memory.
    
    Parameters:
    -----------
    f : callable
        Objective function f(x) -> scalar
    grad_f : callable
        Gradient function ∇f(x) -> n-vector
    x0 : np.ndarray
        Starting point (n,)
    m : int
        Number of (s, y) pairs to store (memory parameter)
    tol : float
        Gradient norm tolerance for convergence
    max_iter : int
        Maximum number of iterations
    verbose : bool
        Print iteration information
        
    Returns:
    --------
    x : np.ndarray
        Approximate minimizer
    history : list
        List of (x, f(x), ||grad||) at each iteration
    """
    x = x0.copy().astype(float)
    
    # History storage (using deque for efficient pop-left)
    s_history = deque(maxlen=m)
    y_history = deque(maxlen=m)
    rho_history = deque(maxlen=m)
    
    grad = grad_f(x)
    history = [(x.copy(), f(x), np.linalg.norm(grad))]
    
    for k in range(max_iter):
        grad_norm = np.linalg.norm(grad)
        
        if verbose and k % 20 == 0:
            print(f"Iter {k:4d}: f = {f(x):.6e}, ||∇f|| = {grad_norm:.6e}")
        
        # Check convergence
        if grad_norm < tol:
            if verbose:
                print(f"Converged in {k} iterations")
            break
        
        # Compute initial H_0 scaling
        if len(y_history) > 0:
            # Barzilai-Borwein scaling
            gamma = (np.dot(s_history[-1], y_history[-1]) / 
                     np.dot(y_history[-1], y_history[-1]))
        else:
            gamma = 1.0 / grad_norm  # Scale by inverse gradient norm initially
        
        # Compute search direction using two-loop recursion
        d = lbfgs_two_loop_recursion(
            grad, list(s_history), list(y_history), 
            list(rho_history), gamma
        )
        
        # Line search (simplified backtracking for brevity)
        alpha = 1.0
        c1 = 1e-4
        f_x = f(x)
        grad_d = np.dot(grad, d)
        
        for _ in range(20):
            if f(x + alpha * d) <= f_x + c1 * alpha * grad_d:
                break
            alpha *= 0.5
        
        # Update
        x_new = x + alpha * d
        grad_new = grad_f(x_new)
        
        s = x_new - x
        y = grad_new - grad
        ys = np.dot(y, s)
        
        # Store new (s, y, ρ) if curvature condition satisfied
        if ys > 1e-10:
            s_history.append(s)
            y_history.append(y)
            rho_history.append(1.0 / ys)
        
        x = x_new
        grad = grad_new
        
        history.append((x.copy(), f(x), np.linalg.norm(grad)))
    
    return x, history
 
 
if __name__ == "__main__":
    # Test on high-dimensional Rosenbrock
    n = 1000
    
    def rosenbrock_nd(x):
        return sum(100 * (x[i+1] - x[i]**2)**2 + (1 - x[i])**2 
                   for i in range(len(x) - 1))
    
    def rosenbrock_grad_nd(x):
        n = len(x)
        grad = np.zeros(n)
        for i in range(n - 1):
            grad[i] += -400 * x[i] * (x[i+1] - x[i]**2) - 2 * (1 - x[i])
            grad[i+1] += 200 * (x[i+1] - x[i]**2)
        return grad
    
    x0 = np.zeros(n)
    
    x_opt, history = lbfgs_optimize(
        rosenbrock_nd, rosenbrock_grad_nd, x0,
        m=20, verbose=True, max_iter=500
    )
    
    print(f"\nFinal gradient norm: {np.linalg.norm(rosenbrock_grad_nd(x_opt)):.2e}")
    print(f"Optimal value: {rosenbrock_nd(x_opt):.2e}")
    print(f"Total iterations: {len(history)}")

Choosing the Memory Parameter m

Typical values are m = 3-20. Larger m means better Hessian approximation but more storage and computation per iteration. For most problems, m = 10-20 captures enough curvature information. Very large m (>50) rarely helps and can even hurt on non-quadratic functions.

L-BFGS Implementation Details

Production L-BFGS implementations include several refinements beyond the basic two-loop recursion:

1. Initial Hessian Scaling (γ_k):

The scaling γ_k = (s_{k-1}ᵀ y_{k-1}) / (y_{k-1}ᵀ y_{k-1}) is crucial. It approximates the average curvature along the most recent direction, providing a much better initial approximation than γ = 1.

2. Wolfe Line Search:

Full Wolfe conditions (not just Armijo) are essential for guaranteeing the curvature condition ysᵀs > 0 on non-convex functions. The strong Wolfe conditions prevent cycling.

3. Gradient Scaling:

On the first iteration (empty history), scaling the initial direction as d = -grad/‖grad‖ or d = -γ·grad with γ based on problem-specific knowledge prevents excessively large initial steps.

4. Handling Curvature Condition Violations:

When ysᵀs ≤ 0 (can happen with imprecise line search on non-convex functions), options include:

Skip the update (discard this s, y pair)
Damped BFGS: modify y to ensure positive curvature
Powell's method: θy + (1-θ)Bs with careful θ selection

5. Memory Management:

Using circular buffers (deque in Python) with maxlen=m automatically evicts the oldest pairs, maintaining O(1) insertion. The vectors s_i and y_i should be stored contiguously for cache efficiency.

6. Parallel Gradient Computation:

In ML applications, gradient computation typically dominates. L-BFGS can parallelize gradient evaluation across data batches while the two-loop recursion runs serially (it's inherently sequential but fast: O(mn) flops).

L-BFGS Parameters and Their Effects
Parameter	Typical Value	Effect of Increase	When to Change
m (memory)	10-20	Better approximation, more memory	Increase for ill-conditioned problems
c₁ (Armijo)	10⁻⁴	Stricter decrease requirement	Rarely changed
c₂ (curvature)	0.9	Looser curvature requirement	Decrease (0.1-0.5) for non-convex
α_init	1.0	Larger initial step size	Reduce if steps overshoot
max_iter	1000	More iterations allowed	Increase for large-scale problems

Warm Starting Caveats

If the objective function changes (e.g., when regularization weight changes), the stored (s, y) pairs become invalid—they encode curvature of the old function. Clear the history when the function changes significantly, or use damped updates that gradually forget old information.

L-BFGS in Machine Learning Applications

L-BFGS is the workhorse optimizer for many machine learning applications, particularly those with smooth objectives and deterministic gradients.

Classical ML (L-BFGS is often the best choice):

Logistic Regression: Standard solver in sklearn's LogisticRegression(solver='lbfgs')
Conditional Random Fields (CRFs): Structured prediction with complex feature dependencies
Gaussian Processes: Hyperparameter optimization via marginal likelihood
Maximum Entropy Models: NLP feature-rich classifiers
Neural Network Pre-training: When dataset fits in memory and exact gradients are available

lbfgs_ml_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import numpy as np
from scipy.optimize import minimize
 
# Example 1: Logistic Regression with L-BFGS
def logistic_loss_and_grad(w, X, y, reg=0.01):
    """
    Compute L2-regularized logistic loss and gradient.
    
    Loss = Σ log(1 + exp(-y_i * wᵀx_i)) + (reg/2)||w||²
    """
    N = len(y)
    z = y * (X @ w)
    
    # Numerically stable log-sum-exp
    loss = np.sum(np.logaddexp(0, -z)) / N + 0.5 * reg * np.dot(w, w)
    
    # Gradient
    sigmoid_neg = 1 / (1 + np.exp(z))  # σ(-z) = 1 - σ(z)
    grad = -X.T @ (y * sigmoid_neg) / N + reg * w
    
    return loss, grad
 
 
def train_logistic_lbfgs(X, y, reg=0.01):
    """Train logistic regression using scipy's L-BFGS-B."""
    n_features = X.shape[1]
    w0 = np.zeros(n_features)
    
    result = minimize(
        logistic_loss_and_grad,
        w0,
        args=(X, y, reg),
        method='L-BFGS-B',
        jac=True,  # Function returns (loss, gradient)
        options={'maxiter': 1000, 'disp': True}
    )
    
    return result.x
 
 
# Example 2: Neural Network training with L-BFGS
def mlp_forward_backward(params, X, y, hidden_size=100):
    """
    Two-layer MLP: X -> hidden -> softmax output
    Returns loss and flattened gradient.
    """
    n_features = X.shape[1]
    n_classes = np.max(y) + 1
    
    # Unpack parameters
    W1_size = n_features * hidden_size
    b1_size = hidden_size
    W2_size = hidden_size * n_classes
    
    W1 = params[:W1_size].reshape(n_features, hidden_size)
    b1 = params[W1_size:W1_size + b1_size]
    W2 = params[W1_size + b1_size:W1_size + b1_size + W2_size].reshape(hidden_size, n_classes)
    b2 = params[W1_size + b1_size + W2_size:]
    
    # Forward pass
    h = np.maximum(0, X @ W1 + b1)  # ReLU
    logits = h @ W2 + b2
    
    # Softmax and cross-entropy
    logits_max = logits.max(axis=1, keepdims=True)
    exp_logits = np.exp(logits - logits_max)
    probs = exp_logits / exp_logits.sum(axis=1, keepdims=True)
    
    N = len(y)
    loss = -np.log(probs[np.arange(N), y] + 1e-8).mean()
    
    # Backward pass
    dlogits = probs.copy()
    dlogits[np.arange(N), y] -= 1
    dlogits /= N
    
    dW2 = h.T @ dlogits
    db2 = dlogits.sum(axis=0)
    
    dh = dlogits @ W2.T
    dh[h <= 0] = 0  # ReLU gradient
    
    dW1 = X.T @ dh
    db1 = dh.sum(axis=0)
    
    # Pack gradients
    grad = np.concatenate([dW1.ravel(), db1, dW2.ravel(), db2])
    
    return loss, grad
 
 
# Example 3: Using PyTorch's L-BFGS
"""
import torch
from torch.optim import LBFGS
 
model = MyModel()
optimizer = LBFGS(
    model.parameters(),
    lr=1.0,
    max_iter=20,
    history_size=10,
    line_search_fn='strong_wolfe'
)
 
def closure():
    optimizer.zero_grad()
    output = model(input)
    loss = criterion(output, target)
    loss.backward()
    return loss
 
# L-BFGS requires calling optimizer.step(closure) with a closure
for epoch in range(num_epochs):
    optimizer.step(closure)
"""

Why L-BFGS struggles with deep learning:

Stochastic gradients: Standard L-BFGS assumes exact gradients. Mini-batch noise violates the curvature condition, corrupting the Hessian approximation.
Non-smooth activations: ReLU, max-pooling, and dropout create non-differentiable points where Taylor approximations break down.
Massive scale: Even O(mn) for m=10, n=10⁸ is prohibitive for storage and memory bandwidth.
Saddle points: The interpolation path between BFGS updates may not avoid saddle points effectively in high dimensions.

Stochastic L-BFGS attempts:

Recent research has developed stochastic variants:

SdLBFGS: Uses gradient differences from larger batches for curvature estimation
oLBFGS: Online L-BFGS with careful averaging
S-BFGS: Variance-reduced gradient pairs

These remain active research areas and are not yet production-ready for large neural networks.

Comparison and Method Selection Guide

Choosing between Newton, BFGS, L-BFGS, and first-order methods depends on problem characteristics. Here's a decision framework:

Optimizer Selection by Problem Characteristics
Characteristic	Best Method	Rationale
n < 500, cheap Hessian	Newton	Quadratic convergence outweighs O(n³) cost
500 < n < 5000, deterministic gradients	BFGS	Good convergence, O(n²) affordable
5000 < n < 10⁶, deterministic gradients	L-BFGS	O(mn) storage, superlinear convergence
n > 10⁶, deterministic gradients	L-BFGS or Adam	Depends on structure and accuracy needs
Any n, stochastic gradients	Adam/SGD	Quasi-Newton corrupted by noise
Severely ill-conditioned	Newton/BFGS	First-order methods too slow
Sparse Hessian structure	Specialized Newton	Exploit structure for efficiency

Use L-BFGS When

•Exact gradients available (batch optimization)
•Moderate problem size (10³ - 10⁶ parameters)
•Need faster convergence than SGD
•Problem is smooth and continuous
•High accuracy required (tol < 10⁻⁶)
•Training data fits in memory

Avoid L-BFGS When

•Using mini-batch/stochastic gradients
•Problem has many non-smooth points
•Parameters exceed 10⁷ (memory concerns)
•Low accuracy sufficient (tol ~ 10⁻³)
•Data requires streaming/online learning
•Strong regularization (simpler methods work)

The Practitioner's Heuristic

For classical ML (logistic regression, SVM, CRF, GP): start with L-BFGS. For deep learning: start with Adam. For very small problems: try Newton. Only switch if you have evidence the default choice is suboptimal for your specific problem.

Summary and Key Takeaways

Quasi-Newton methods represent the practical sweet spot between gradient descent's simplicity and Newton's power. BFGS and L-BFGS have become indispensable tools in the ML optimization toolkit.

Key Takeaways

•The secant condition — Quasi-Newton methods maintain approximations satisfying B_{k+1}s_k = y_k, ensuring consistency with observed gradient changes.
•BFGS update — A rank-2 update preserving positive definiteness under the curvature condition y_kᵀs_k > 0, achieved via Wolfe line search.
•Superlinear convergence — Faster than gradient descent's linear rate, approaching quadratic as H_k approaches H⁻¹.
•L-BFGS reduces storage — From O(n²) to O(mn) by storing only m recent (s, y) pairs and using the two-loop recursion.
•Wolfe conditions are essential — They guarantee positive curvature, enabling reliable BFGS updates on non-convex functions.
•Standard for classical ML — L-BFGS is the default optimizer for logistic regression, CRFs, GPs, and many other smooth ML models.
•Not for stochastic optimization — Mini-batch gradients violate assumptions and corrupt the Hessian approximation.

What's next:

The next page explores Coordinate Descent—a fundamentally different approach that optimizes one coordinate (or block of coordinates) at a time. This strategy excels for problems with separable structure or when coordinate-wise updates admit closed-form solutions.

Page Complete

You now understand quasi-Newton methods: the secant condition, BFGS updates, L-BFGS memory efficiency, and practical application guidelines. This knowledge equips you to select and tune optimizers effectively for a wide range of machine learning problems.

2 / 5

Loading learning content...

Machine LearningAdvanced Optimization

Advanced Optimization Methods

LevelAdvanced

Duration90 mins

TopicAdvanced Optimization

2 / 5

Quasi-Newton Methods (BFGS, L-BFGS)

The Best of Both Worlds

What You Will Learn

The Secant Condition: Foundation of Quasi-Newton

Quasi-Newton methods are built on a single fundamental requirement: the Hessian approximation should satisfy the secant equation (also called the quasi-Newton condition).

Derivation from first principles:

For a smooth function f(x), consider the Taylor expansion of the gradient around x_{k+1}:

∇f(x_k) ≈ ∇f(x_{k+1}) + ∇²f(x_{k+1})(x_k - x_{k+1})

Rearranging:

∇²f(x_{k+1})(x_{k+1} - x_k) ≈ ∇f(x_{k+1}) - ∇f(x_k)

Defining:

s_k = x_{k+1} - x_k (step taken)
y_k = ∇f(x_{k+1}) - ∇f(x_k) (gradient change)

We get the secant equation:

B_{k+1} s_k = y_k

where B_{k+1} is our approximation to the Hessian H(x_{k+1}).

The Secant Intuition

The inverse Hessian variant:

Since Newton's method uses H⁻¹∇f, it's often more convenient to maintain an approximation to the inverse Hessian. Let H_k ≈ B_k⁻¹. The secant equation becomes:

H_{k+1} y_k = s_k

This form avoids explicitly inverting B_{k+1} at each iteration.

Underdetermination and the update problem:

Different quasi-Newton methods correspond to different ways of resolving this underdetermination:

DFP (Davidon-Fletcher-Powell): Minimize change in H_k subject to secant condition
BFGS (Broyden-Fletcher-Goldfarb-Shanno): Minimize change in B_k subject to secant condition
Broyden class: Convex combination of DFP and BFGS

The BFGS Update Formula

Derivation (for the inverse Hessian approximation H):

We seek H_{k+1} that:

Satisfies the secant condition: H_{k+1} y_k = s_k
Is symmetric: H_{k+1} = H_{k+1}ᵀ
Is closest to H_k in a weighted Frobenius norm

The solution to this constrained optimization problem is the BFGS update:

H_{k+1} = (I - ρ_k s_k y_kᵀ) H_k (I - ρ_k y_k s_kᵀ) + ρ_k s_k s_kᵀ

where ρ_k = 1 / (y_kᵀ s_k)

This formula is a rank-2 update—it adds at most 2 to the rank of the modification from H_k to H_{k+1}.

bfgs_update.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from typing import Tuple
 
def bfgs_update(
    H: np.ndarray,
    s: np.ndarray,
    y: np.ndarray,
    eps: float = 1e-10
) -> np.ndarray:
    """
    Compute BFGS update of inverse Hessian approximation.
    
    H_{k+1} = (I - ρ·s·yᵀ) H_k (I - ρ·y·sᵀ) + ρ·s·sᵀ
    
    Parameters:
    -----------
    H : np.ndarray
        Current inverse Hessian approximation (n, n)
    s : np.ndarray
        Step vector: s = x_{k+1} - x_k (n,)
    y : np.ndarray
        Gradient difference: y = ∇f_{k+1} - ∇f_k (n,)
    eps : float
        Small constant for numerical stability
        
    Returns:
    --------
    H_new : np.ndarray
        Updated inverse Hessian approximation (n, n)
    """
    ys = np.dot(y, s)
    
    # Curvature condition check
    if ys < eps:
        # Skip update if curvature condition violated
        return H
    
    rho = 1.0 / ys
    n = len(s)
    I = np.eye(n)
    
    # BFGS formula: H = (I - ρsyᵀ) H (I - ρysᵀ) + ρssᵀ
    V = I - rho * np.outer(s, y)
    H_new = V @ H @ V.T + rho * np.outer(s, s)
    
    return H_new
 
 
def bfgs_direct_hessian_update(
    B: np.ndarray,
    s: np.ndarray,
    y: np.ndarray,
    eps: float = 1e-10
) -> np.ndarray:
    """
    Direct BFGS update for Hessian approximation B.
    
    B_{k+1} = B_k - (B_k s sᵀ B_k)/(sᵀ B_k s) + (y yᵀ)/(yᵀ s)
    
    This is the Sherman-Morrison-Woodbury dual of the inverse update.
    """
    ys = np.dot(y, s)
    
    if ys < eps:
        return B
    
    Bs = B @ s
    sBs = np.dot(s, Bs)
    
    if abs(sBs) < eps:
        return B
    
    B_new = B - np.outer(Bs, Bs) / sBs + np.outer(y, y) / ys
    
    return B_new

The curvature condition:

For the BFGS update to maintain positive definiteness, we require:

y_kᵀ s_k > 0 (curvature condition)

When is the curvature condition satisfied?

Convex functions: Always satisfied since Hessian is positive semidefinite
Wolfe line search: Guarantees y_kᵀ s_k > 0 for any function (the key reason Wolfe conditions are used with quasi-Newton methods)
Non-convex functions without proper line search: May violate the condition, leading to update skipping or failure

Self-correcting property:

BFGS vs DFP

The Complete BFGS Algorithm

Wolfe Conditions:

Sufficient decrease (Armijo): f(x_k + αd_k) ≤ f(x_k) + c₁α∇f_kᵀd_k
Curvature condition: ∇f(x_k + αd_k)ᵀd_k ≥ c₂∇f_kᵀd_k

Typical values: c₁ = 10⁻⁴, c₂ = 0.9

The curvature condition ensures we don't stop too early, guaranteeing sufficient curvature for the BFGS update.

bfgs_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import numpy as np
from typing import Callable, Tuple, Optional
 
def wolfe_line_search(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    x: np.ndarray,
    d: np.ndarray,
    grad: np.ndarray,
    c1: float = 1e-4,
    c2: float = 0.9,
    alpha_init: float = 1.0,
    max_iter: int = 25
) -> Tuple[float, np.ndarray]:
    """
    Line search satisfying strong Wolfe conditions.
    
    Returns step size α and gradient at new point.
    Critical for BFGS to maintain positive definiteness.
    """
    f_0 = f(x)
    g_0 = np.dot(grad, d)  # Directional derivative at α=0
    
    alpha_lo, alpha_hi = 0.0, np.inf
    alpha = alpha_init
    
    for _ in range(max_iter):
        x_new = x + alpha * d
        f_new = f(x_new)
        grad_new = grad_f(x_new)
        g_new = np.dot(grad_new, d)
        
        # Check Armijo condition
        if f_new > f_0 + c1 * alpha * g_0:
            alpha_hi = alpha
            alpha = 0.5 * (alpha_lo + alpha_hi)
            continue
            
        # Check curvature condition (strong Wolfe)
        if abs(g_new) > c2 * abs(g_0):
            if g_new * g_0 < 0:
                alpha_hi = alpha
            else:
                alpha_lo = alpha
            
            if alpha_hi < np.inf:
                alpha = 0.5 * (alpha_lo + alpha_hi)
            else:
                alpha = 2 * alpha
            continue
        
        # Both conditions satisfied
        return alpha, grad_new
    
    return alpha, grad_f(x + alpha * d)
 
 
def bfgs_optimize(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    x0: np.ndarray,
    tol: float = 1e-8,
    max_iter: int = 1000,
    verbose: bool = False
) -> Tuple[np.ndarray, list]:
    """
    BFGS quasi-Newton optimization with Wolfe line search.
    
    Parameters:
    -----------
    f : callable
        Objective function f(x) -> scalar
    grad_f : callable
        Gradient function ∇f(x) -> n-vector
    x0 : np.ndarray
        Starting point (n,)
    tol : float
        Gradient norm tolerance for convergence
    max_iter : int
        Maximum number of iterations
    verbose : bool
        Print iteration information
        
    Returns:
    --------
    x : np.ndarray
        Approximate minimizer
    history : list
        List of (x, f(x), ||grad||) at each iteration
    """
    n = len(x0)
    x = x0.copy().astype(float)
    H = np.eye(n)  # Initial inverse Hessian: identity
    
    grad = grad_f(x)
    history = [(x.copy(), f(x), np.linalg.norm(grad))]
    
    for k in range(max_iter):
        grad_norm = np.linalg.norm(grad)
        
        if verbose and k % 10 == 0:
            print(f"Iter {k:4d}: f = {f(x):.6e}, ||∇f|| = {grad_norm:.6e}")
        
        # Check convergence
        if grad_norm < tol:
            if verbose:
                print(f"Converged in {k} iterations")
            break
        
        # Compute search direction
        d = -H @ grad
        
        # Line search (Wolfe conditions)
        alpha, grad_new = wolfe_line_search(f, grad_f, x, d, grad)
        
        # Compute step and gradient difference
        s = alpha * d
        y = grad_new - grad
        
        # Update position and gradient
        x = x + s
        grad = grad_new
        
        # BFGS update
        ys = np.dot(y, s)
        if ys > 1e-10:  # Curvature condition
            rho = 1.0 / ys
            I = np.eye(n)
            V = I - rho * np.outer(s, y)
            H = V @ H @ V.T + rho * np.outer(s, s)
        
        history.append((x.copy(), f(x), np.linalg.norm(grad)))
    
    return x, history
 
 
# Example: Rosenbrock function
def rosenbrock(x):
    """Rosenbrock function (extended to n dimensions)"""
    return sum(100 * (x[i+1] - x[i]**2)**2 + (1 - x[i])**2 
               for i in range(len(x) - 1))
 
def rosenbrock_grad(x):
    """Gradient of Rosenbrock function"""
    n = len(x)
    grad = np.zeros(n)
    for i in range(n - 1):
        grad[i] += -400 * x[i] * (x[i+1] - x[i]**2) - 2 * (1 - x[i])
        grad[i+1] += 200 * (x[i+1] - x[i]**2)
    return grad
 
 
if __name__ == "__main__":
    # 10-dimensional Rosenbrock
    x0 = np.zeros(10)
    
    x_opt, history = bfgs_optimize(
        rosenbrock, rosenbrock_grad, x0,
        verbose=True
    )
    
    print(f"\nOptimum: {x_opt}")
    print(f"Optimal value: {rosenbrock(x_opt):.2e}")
    print(f"Total iterations: {len(history)}")

Initialization Matters

Convergence Analysis of BFGS

BFGS achieves superlinear convergence under reasonable conditions—faster than gradient descent's linear rate, though not quite Newton's quadratic rate.

Formal convergence result (Dennis-Moré conditions):

If f is twice continuously differentiable with Lipschitz continuous Hessian, and {x_k} converges to x* with H(x*) positive definite, then:

lim_{k→∞} ‖(H_k - H(x*)⁻¹)∇f_k‖ / ‖∇f_k‖ = 0

This implies superlinear convergence:

‖x_{k+1} - x*‖ / ‖x_k - x*‖ → 0 as k → ∞

The ratio of successive errors goes to zero, meaning each iteration provides increasingly larger relative improvement.

Convergence Rate Comparison (Practical Iterations for 10⁻¹² Error)
Method	Rate Type	n=100	n=1000	n=10000
Gradient Descent	Linear O(κ log 1/ε)	~10,000	~100,000	~1,000,000
Conjugate Gradient	Superlinear	~100	~1,000	~10,000
BFGS	Superlinear	~50-100	~100-200	~200-500
L-BFGS (m=10)	Superlinear	~75-150	~150-300	~300-800
Newton	Quadratic	~10	~10	Infeasible (n³ cost)

Global convergence:

With a line search satisfying the Wolfe conditions, BFGS is globally convergent for smooth functions bounded below:

From any starting point, the iterates converge to a stationary point
No requirement to start near the solution
Self-correcting: even with a poor initial H₀, convergence is guaranteed

The condition number independence:

Why not quadratic convergence?

Practical Implications

Limited-Memory BFGS (L-BFGS)

Key insight:

Storage complexity:

BFGS: O(n²) for dense H matrix
L-BFGS: O(mn) for m vector pairs (typically m = 5-20)

For n = 1,000,000 and m = 10: BFGS needs ~4TB, L-BFGS needs ~80MB!

lbfgs_two_loop.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
import numpy as np
from collections import deque
from typing import Callable, Tuple, List
 
def lbfgs_two_loop_recursion(
    grad: np.ndarray,
    s_history: List[np.ndarray],
    y_history: List[np.ndarray],
    rho_history: List[float],
    gamma: float = 1.0
) -> np.ndarray:
    """
    L-BFGS two-loop recursion to compute search direction.
    
    Computes H_k @ grad without forming H_k explicitly.
    Uses only the m most recent (s, y, ρ) triplets.
    
    Parameters:
    -----------
    grad : np.ndarray
        Current gradient (n,)
    s_history : list
        Recent s_i = x_{i+1} - x_i vectors
    y_history : list
        Recent y_i = ∇f_{i+1} - ∇f_i vectors  
    rho_history : list
        Recent ρ_i = 1/(y_i·s_i) scalars
    gamma : float
        Scaling for initial H_0 = gamma * I
        
    Returns:
    --------
    d : np.ndarray
        Search direction -H_k @ grad (n,)
    """
    q = grad.copy()
    m = len(s_history)
    
    if m == 0:
        # No history: steepest descent
        return -gamma * grad
    
    # First loop: right-to-left
    alpha = np.zeros(m)
    for i in range(m - 1, -1, -1):
        alpha[i] = rho_history[i] * np.dot(s_history[i], q)
        q = q - alpha[i] * y_history[i]
    
    # Initial direction: H_0 @ q where H_0 = gamma * I
    r = gamma * q
    
    # Second loop: left-to-right
    for i in range(m):
        beta = rho_history[i] * np.dot(y_history[i], r)
        r = r + (alpha[i] - beta) * s_history[i]
    
    return -r
 
 
def lbfgs_optimize(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    x0: np.ndarray,
    m: int = 10,
    tol: float = 1e-8,
    max_iter: int = 1000,
    verbose: bool = False
) -> Tuple[np.ndarray, list]:
    """
    L-BFGS optimization with limited memory.
    
    Parameters:
    -----------
    f : callable
        Objective function f(x) -> scalar
    grad_f : callable
        Gradient function ∇f(x) -> n-vector
    x0 : np.ndarray
        Starting point (n,)
    m : int
        Number of (s, y) pairs to store (memory parameter)
    tol : float
        Gradient norm tolerance for convergence
    max_iter : int
        Maximum number of iterations
    verbose : bool
        Print iteration information
        
    Returns:
    --------
    x : np.ndarray
        Approximate minimizer
    history : list
        List of (x, f(x), ||grad||) at each iteration
    """
    x = x0.copy().astype(float)
    
    # History storage (using deque for efficient pop-left)
    s_history = deque(maxlen=m)
    y_history = deque(maxlen=m)
    rho_history = deque(maxlen=m)
    
    grad = grad_f(x)
    history = [(x.copy(), f(x), np.linalg.norm(grad))]
    
    for k in range(max_iter):
        grad_norm = np.linalg.norm(grad)
        
        if verbose and k % 20 == 0:
            print(f"Iter {k:4d}: f = {f(x):.6e}, ||∇f|| = {grad_norm:.6e}")
        
        # Check convergence
        if grad_norm < tol:
            if verbose:
                print(f"Converged in {k} iterations")
            break
        
        # Compute initial H_0 scaling
        if len(y_history) > 0:
            # Barzilai-Borwein scaling
            gamma = (np.dot(s_history[-1], y_history[-1]) / 
                     np.dot(y_history[-1], y_history[-1]))
        else:
            gamma = 1.0 / grad_norm  # Scale by inverse gradient norm initially
        
        # Compute search direction using two-loop recursion
        d = lbfgs_two_loop_recursion(
            grad, list(s_history), list(y_history), 
            list(rho_history), gamma
        )
        
        # Line search (simplified backtracking for brevity)
        alpha = 1.0
        c1 = 1e-4
        f_x = f(x)
        grad_d = np.dot(grad, d)
        
        for _ in range(20):
            if f(x + alpha * d) <= f_x + c1 * alpha * grad_d:
                break
            alpha *= 0.5
        
        # Update
        x_new = x + alpha * d
        grad_new = grad_f(x_new)
        
        s = x_new - x
        y = grad_new - grad
        ys = np.dot(y, s)
        
        # Store new (s, y, ρ) if curvature condition satisfied
        if ys > 1e-10:
            s_history.append(s)
            y_history.append(y)
            rho_history.append(1.0 / ys)
        
        x = x_new
        grad = grad_new
        
        history.append((x.copy(), f(x), np.linalg.norm(grad)))
    
    return x, history
 
 
if __name__ == "__main__":
    # Test on high-dimensional Rosenbrock
    n = 1000
    
    def rosenbrock_nd(x):
        return sum(100 * (x[i+1] - x[i]**2)**2 + (1 - x[i])**2 
                   for i in range(len(x) - 1))
    
    def rosenbrock_grad_nd(x):
        n = len(x)
        grad = np.zeros(n)
        for i in range(n - 1):
            grad[i] += -400 * x[i] * (x[i+1] - x[i]**2) - 2 * (1 - x[i])
            grad[i+1] += 200 * (x[i+1] - x[i]**2)
        return grad
    
    x0 = np.zeros(n)
    
    x_opt, history = lbfgs_optimize(
        rosenbrock_nd, rosenbrock_grad_nd, x0,
        m=20, verbose=True, max_iter=500
    )
    
    print(f"\nFinal gradient norm: {np.linalg.norm(rosenbrock_grad_nd(x_opt)):.2e}")
    print(f"Optimal value: {rosenbrock_nd(x_opt):.2e}")
    print(f"Total iterations: {len(history)}")

Choosing the Memory Parameter m

L-BFGS Implementation Details

Production L-BFGS implementations include several refinements beyond the basic two-loop recursion:

1. Initial Hessian Scaling (γ_k):

2. Wolfe Line Search:

Full Wolfe conditions (not just Armijo) are essential for guaranteeing the curvature condition ysᵀs > 0 on non-convex functions. The strong Wolfe conditions prevent cycling.

3. Gradient Scaling:

On the first iteration (empty history), scaling the initial direction as d = -grad/‖grad‖ or d = -γ·grad with γ based on problem-specific knowledge prevents excessively large initial steps.

4. Handling Curvature Condition Violations:

When ysᵀs ≤ 0 (can happen with imprecise line search on non-convex functions), options include:

Skip the update (discard this s, y pair)
Damped BFGS: modify y to ensure positive curvature
Powell's method: θy + (1-θ)Bs with careful θ selection

5. Memory Management:

Using circular buffers (deque in Python) with maxlen=m automatically evicts the oldest pairs, maintaining O(1) insertion. The vectors s_i and y_i should be stored contiguously for cache efficiency.

6. Parallel Gradient Computation:

L-BFGS Parameters and Their Effects
Parameter	Typical Value	Effect of Increase	When to Change
m (memory)	10-20	Better approximation, more memory	Increase for ill-conditioned problems
c₁ (Armijo)	10⁻⁴	Stricter decrease requirement	Rarely changed
c₂ (curvature)	0.9	Looser curvature requirement	Decrease (0.1-0.5) for non-convex
α_init	1.0	Larger initial step size	Reduce if steps overshoot
max_iter	1000	More iterations allowed	Increase for large-scale problems

Warm Starting Caveats

L-BFGS in Machine Learning Applications

L-BFGS is the workhorse optimizer for many machine learning applications, particularly those with smooth objectives and deterministic gradients.

Classical ML (L-BFGS is often the best choice):

Logistic Regression: Standard solver in sklearn's LogisticRegression(solver='lbfgs')
Conditional Random Fields (CRFs): Structured prediction with complex feature dependencies
Gaussian Processes: Hyperparameter optimization via marginal likelihood
Maximum Entropy Models: NLP feature-rich classifiers
Neural Network Pre-training: When dataset fits in memory and exact gradients are available

lbfgs_ml_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import numpy as np
from scipy.optimize import minimize
 
# Example 1: Logistic Regression with L-BFGS
def logistic_loss_and_grad(w, X, y, reg=0.01):
    """
    Compute L2-regularized logistic loss and gradient.
    
    Loss = Σ log(1 + exp(-y_i * wᵀx_i)) + (reg/2)||w||²
    """
    N = len(y)
    z = y * (X @ w)
    
    # Numerically stable log-sum-exp
    loss = np.sum(np.logaddexp(0, -z)) / N + 0.5 * reg * np.dot(w, w)
    
    # Gradient
    sigmoid_neg = 1 / (1 + np.exp(z))  # σ(-z) = 1 - σ(z)
    grad = -X.T @ (y * sigmoid_neg) / N + reg * w
    
    return loss, grad
 
 
def train_logistic_lbfgs(X, y, reg=0.01):
    """Train logistic regression using scipy's L-BFGS-B."""
    n_features = X.shape[1]
    w0 = np.zeros(n_features)
    
    result = minimize(
        logistic_loss_and_grad,
        w0,
        args=(X, y, reg),
        method='L-BFGS-B',
        jac=True,  # Function returns (loss, gradient)
        options={'maxiter': 1000, 'disp': True}
    )
    
    return result.x
 
 
# Example 2: Neural Network training with L-BFGS
def mlp_forward_backward(params, X, y, hidden_size=100):
    """
    Two-layer MLP: X -> hidden -> softmax output
    Returns loss and flattened gradient.
    """
    n_features = X.shape[1]
    n_classes = np.max(y) + 1
    
    # Unpack parameters
    W1_size = n_features * hidden_size
    b1_size = hidden_size
    W2_size = hidden_size * n_classes
    
    W1 = params[:W1_size].reshape(n_features, hidden_size)
    b1 = params[W1_size:W1_size + b1_size]
    W2 = params[W1_size + b1_size:W1_size + b1_size + W2_size].reshape(hidden_size, n_classes)
    b2 = params[W1_size + b1_size + W2_size:]
    
    # Forward pass
    h = np.maximum(0, X @ W1 + b1)  # ReLU
    logits = h @ W2 + b2
    
    # Softmax and cross-entropy
    logits_max = logits.max(axis=1, keepdims=True)
    exp_logits = np.exp(logits - logits_max)
    probs = exp_logits / exp_logits.sum(axis=1, keepdims=True)
    
    N = len(y)
    loss = -np.log(probs[np.arange(N), y] + 1e-8).mean()
    
    # Backward pass
    dlogits = probs.copy()
    dlogits[np.arange(N), y] -= 1
    dlogits /= N
    
    dW2 = h.T @ dlogits
    db2 = dlogits.sum(axis=0)
    
    dh = dlogits @ W2.T
    dh[h <= 0] = 0  # ReLU gradient
    
    dW1 = X.T @ dh
    db1 = dh.sum(axis=0)
    
    # Pack gradients
    grad = np.concatenate([dW1.ravel(), db1, dW2.ravel(), db2])
    
    return loss, grad
 
 
# Example 3: Using PyTorch's L-BFGS
"""
import torch
from torch.optim import LBFGS
 
model = MyModel()
optimizer = LBFGS(
    model.parameters(),
    lr=1.0,
    max_iter=20,
    history_size=10,
    line_search_fn='strong_wolfe'
)
 
def closure():
    optimizer.zero_grad()
    output = model(input)
    loss = criterion(output, target)
    loss.backward()
    return loss
 
# L-BFGS requires calling optimizer.step(closure) with a closure
for epoch in range(num_epochs):
    optimizer.step(closure)
"""

Why L-BFGS struggles with deep learning:

Stochastic gradients: Standard L-BFGS assumes exact gradients. Mini-batch noise violates the curvature condition, corrupting the Hessian approximation.
Non-smooth activations: ReLU, max-pooling, and dropout create non-differentiable points where Taylor approximations break down.
Massive scale: Even O(mn) for m=10, n=10⁸ is prohibitive for storage and memory bandwidth.
Saddle points: The interpolation path between BFGS updates may not avoid saddle points effectively in high dimensions.

Stochastic L-BFGS attempts:

Recent research has developed stochastic variants:

SdLBFGS: Uses gradient differences from larger batches for curvature estimation
oLBFGS: Online L-BFGS with careful averaging
S-BFGS: Variance-reduced gradient pairs

These remain active research areas and are not yet production-ready for large neural networks.

Comparison and Method Selection Guide

Choosing between Newton, BFGS, L-BFGS, and first-order methods depends on problem characteristics. Here's a decision framework:

Optimizer Selection by Problem Characteristics
Characteristic	Best Method	Rationale
n < 500, cheap Hessian	Newton	Quadratic convergence outweighs O(n³) cost
500 < n < 5000, deterministic gradients	BFGS	Good convergence, O(n²) affordable
5000 < n < 10⁶, deterministic gradients	L-BFGS	O(mn) storage, superlinear convergence
n > 10⁶, deterministic gradients	L-BFGS or Adam	Depends on structure and accuracy needs
Any n, stochastic gradients	Adam/SGD	Quasi-Newton corrupted by noise
Severely ill-conditioned	Newton/BFGS	First-order methods too slow
Sparse Hessian structure	Specialized Newton	Exploit structure for efficiency

Use L-BFGS When

•Exact gradients available (batch optimization)
•Moderate problem size (10³ - 10⁶ parameters)
•Need faster convergence than SGD
•Problem is smooth and continuous
•High accuracy required (tol < 10⁻⁶)
•Training data fits in memory

Avoid L-BFGS When

•Using mini-batch/stochastic gradients
•Problem has many non-smooth points
•Parameters exceed 10⁷ (memory concerns)
•Low accuracy sufficient (tol ~ 10⁻³)
•Data requires streaming/online learning
•Strong regularization (simpler methods work)

The Practitioner's Heuristic

Summary and Key Takeaways

Quasi-Newton methods represent the practical sweet spot between gradient descent's simplicity and Newton's power. BFGS and L-BFGS have become indispensable tools in the ML optimization toolkit.

Key Takeaways

•The secant condition — Quasi-Newton methods maintain approximations satisfying B_{k+1}s_k = y_k, ensuring consistency with observed gradient changes.
•BFGS update — A rank-2 update preserving positive definiteness under the curvature condition y_kᵀs_k > 0, achieved via Wolfe line search.
•Superlinear convergence — Faster than gradient descent's linear rate, approaching quadratic as H_k approaches H⁻¹.
•L-BFGS reduces storage — From O(n²) to O(mn) by storing only m recent (s, y) pairs and using the two-loop recursion.
•Wolfe conditions are essential — They guarantee positive curvature, enabling reliable BFGS updates on non-convex functions.
•Standard for classical ML — L-BFGS is the default optimizer for logistic regression, CRFs, GPs, and many other smooth ML models.
•Not for stochastic optimization — Mini-batch gradients violate assumptions and corrupt the Hessian approximation.

What's next:

Page Complete

2 / 5