Advanced Optimization - Learning Module

Loading content...

0/245

Proximal Methods

Handling Non-Smoothness Elegantly

Many machine learning objectives combine a smooth loss with a non-smooth regularizer:

minimize f(x) + g(x)

where f is differentiable (e.g., squared error) and g is convex but non-differentiable (e.g., L1 norm). Standard gradient descent stumbles on such objectives—the gradient of g doesn't exist everywhere, and subgradients lead to slow convergence.

Proximal methods provide an elegant solution. Rather than computing gradients of non-smooth terms, they replace gradient steps with proximal operators—a generalization that handles non-differentiability naturally while preserving convergence guarantees. The result is algorithms that are as simple as gradient descent but work seamlessly with sparsity-inducing regularizers.

What You Will Learn

By the end of this page, you will understand the proximal operator and its geometric interpretation, derive proximal operators for common regularizers, implement proximal gradient descent and its accelerated variants (ISTA/FISTA), analyze convergence rates of proximal methods, and apply proximal algorithms to composite optimization problems in ML.

The Proximal Operator: Definition and Intuition

Definition:

For a convex function g: ℝⁿ → ℝ ∪ {+∞} and parameter λ > 0, the proximal operator is:

prox_{λg}(v) = argmin_x { g(x) + (1/2λ)‖x - v‖² }

This operator finds the point x that balances two goals:

Minimizing g(x) (the original objective)
Staying close to v (the regularization term)

The parameter λ controls this tradeoff—larger λ gives more weight to minimizing g, smaller λ keeps x closer to v.

Key properties:

Existence and uniqueness: For convex g, prox_{λg}(v) always exists and is unique
Non-expansiveness: ‖prox_{λg}(x) - prox_{λg}(y)‖ ≤ ‖x - y‖
Fixed points: x* is a minimizer of g if and only if x* = prox_{λg}(x*)
Gradient relationship: For smooth g, as λ → 0: prox_{λg}(v) → v - λ∇g(v)

The Geometric Intuition

Think of prox_{λg}(v) as a 'denoising' operation. Starting from point v, we ask: where should we move to reduce g while not straying too far? For L1 norm (g = ‖·‖₁), this becomes soft-thresholding—small coordinates are set to zero, large coordinates are shrunk.

Relationship to gradient descent:

For smooth g, the gradient descent step is:

x^{new} = x - λ∇g(x)

This can be rewritten using the proximal operator:

x^{new} = prox_{λg}(x) + O(λ²)

As λ → 0, they coincide. But for non-smooth g (where ∇g doesn't exist), only the proximal form makes sense. The proximal operator is thus a generalization of gradient steps to non-differentiable functions.

Why 'proximal'?

The name comes from 'proximity'—we measure distance to v in the quadratic penalty. Other distance measures (Bregman divergences) lead to mirror descent and related methods.

Proximal Operators for Common Regularizers

For proximal methods to be practical, proximal operators must be efficiently computable. Fortunately, many regularizers used in ML have closed-form proximal operators.

1. L1 Norm (Lasso): g(x) = ‖x‖₁

prox_{λg}(v) = soft-threshold(v, λ) where:

S_λ(v)ᵢ = sign(vᵢ) · max(|vᵢ| - λ, 0)

This is component-wise soft-thresholding—the foundation of sparse optimization.

2. L2 Norm: g(x) = ‖x‖₂

prox_{λg}(v) = v · max(0, 1 - λ/‖v‖₂) (if v ≠ 0); 0 otherwise

This is block soft-thresholding—the vector is shrunk toward zero while maintaining its direction.

3. Indicator Function (Constraints): g(x) = I_C(x)

I_C(x) = 0 if x ∈ C; +∞ otherwise

prox_{λg}(v) = Proj_C(v) = argmin_{x ∈ C} ‖x - v‖²

Projection onto the constraint set! Constrained optimization via proximal methods.

proximal_operators.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
from typing import Callable
 
# ============================================
# Proximal Operators for Common Regularizers
# ============================================
 
def prox_l1(v: np.ndarray, lambda_: float) -> np.ndarray:
    """
    Proximal operator for L1 norm: g(x) = λ||x||₁
    
    This is soft-thresholding: S_λ(v)_i = sign(v_i) max(|v_i| - λ, 0)
    
    Used in: Lasso, sparse coding, compressed sensing
    """
    return np.sign(v) * np.maximum(np.abs(v) - lambda_, 0)
 
 
def prox_l2(v: np.ndarray, lambda_: float) -> np.ndarray:
    """
    Proximal operator for L2 norm: g(x) = λ||x||₂
    
    This is block soft-thresholding: v * max(0, 1 - λ/||v||)
    
    Used in: Group Lasso, nuclear norm minimization
    """
    v_norm = np.linalg.norm(v)
    if v_norm == 0:
        return np.zeros_like(v)
    return v * max(0, 1 - lambda_ / v_norm)
 
 
def prox_l2_squared(v: np.ndarray, lambda_: float) -> np.ndarray:
    """
    Proximal operator for squared L2 norm: g(x) = (λ/2)||x||₂²
    
    This is shrinkage: v / (1 + λ)
    
    Used in: Ridge regression, Tikhonov regularization
    """
    return v / (1 + lambda_)
 
 
def prox_elastic_net(v: np.ndarray, lambda1: float, lambda2: float) -> np.ndarray:
    """
    Proximal operator for Elastic Net: g(x) = λ₁||x||₁ + (λ₂/2)||x||₂²
    
    Composition: first soft-threshold, then shrink
    
    prox(v) = S_{λ₁}(v) / (1 + λ₂)
    """
    return prox_l1(v, lambda1) / (1 + lambda2)
 
 
def prox_nuclear_norm(V: np.ndarray, lambda_: float) -> np.ndarray:
    """
    Proximal operator for nuclear norm: g(X) = λ||X||_*
    
    Singular value soft-thresholding: U S_λ(Σ) V^T
    
    Used in: Matrix completion, low-rank approximation
    """
    U, s, Vt = np.linalg.svd(V, full_matrices=False)
    s_thresh = np.maximum(s - lambda_, 0)
    return U @ np.diag(s_thresh) @ Vt
 
 
def prox_box(v: np.ndarray, lower: float, upper: float) -> np.ndarray:
    """
    Proximal operator for box constraints: C = [lower, upper]^n
    
    This is just clipping: project onto the box
    """
    return np.clip(v, lower, upper)
 
 
def prox_simplex(v: np.ndarray) -> np.ndarray:
    """
    Projection onto probability simplex: Δ = {x ≥ 0, Σx = 1}
    
    Uses efficient O(n log n) algorithm by Duchi et al.
    
    Used in: Constrained optimization, optimal transport
    """
    n = len(v)
    u = np.sort(v)[::-1]  # Sort descending
    cssv = np.cumsum(u)
    rho = np.nonzero(u > (cssv - 1) / (np.arange(1, n + 1)))[0][-1]
    theta = (cssv[rho] - 1) / (rho + 1)
    return np.maximum(v - theta, 0)
 
 
def prox_l_inf(v: np.ndarray, lambda_: float) -> np.ndarray:
    """
    Proximal operator for L∞ norm: g(x) = λ||x||_∞
    
    Uses Moreau decomposition: prox_g(v) = v - prox_{g*}(v)
    where g* is the conjugate (indicator of L1 ball)
    """
    # Project v onto the L1 ball of radius lambda_
    proj = prox_simplex(np.abs(v) / lambda_) * lambda_ * np.sign(v)
    return v - proj
 
 
# Test examples
if __name__ == "__main__":
    np.random.seed(42)
    v = np.array([3.0, -1.0, 0.5, -0.2, 2.0])
    
    print("Original:", v)
    print("Prox L1 (λ=0.5):", prox_l1(v, 0.5))
    print("Prox L2 (λ=0.5):", prox_l2(v, 0.5))
    print("Prox L2² (λ=0.5):", prox_l2_squared(v, 0.5))
    print("Prox ElasticNet:", prox_elastic_net(v, 0.3, 0.2))
    
    # Matrix case
    M = np.random.randn(4, 3)
    print("\nOriginal matrix rank:", np.linalg.matrix_rank(M))
    M_prox = prox_nuclear_norm(M, lambda_=1.0)
    print("After nuclear norm prox:", np.linalg.matrix_rank(M_prox, tol=1e-6))

Proximal Operators Reference
Function g(x)	prox_{λg}(v)	Computation
‖x‖₁ (L1)	sign(v) ⊙ max(\|v\| - λ, 0)	O(n)
‖x‖₂ (L2)	v · max(0, 1 - λ/‖v‖)	O(n)
(1/2)‖x‖₂² (L2²)	v / (1 + λ)	O(n)
I_C(x) (indicator)	Proj_C(v)	Depends on C
‖X‖_* (nuclear)	U S_λ(Σ) Vᵀ	O(min(m,n)²·max(m,n))
‖x‖_∞ (L∞)	via Moreau decomposition	O(n log n)

Moreau Decomposition

For any convex g with conjugate g*, we have v = prox_{λg}(v) + λ·prox_{g*/λ}(v/λ). This means knowing prox_g gives prox_{g*} for free. The L∞ proximal operator uses this: L∞ is conjugate to the L1 ball indicator.

Proximal Gradient Descent (ISTA)

Composite optimization:

Consider the problem:

minimize F(x) = f(x) + g(x)

where:

f is convex and L-smooth (gradient Lipschitz with constant L)
g is convex but possibly non-smooth

Proximal Gradient Descent (also called ISTA for Iterative Shrinkage-Thresholding Algorithm) alternates:

Gradient step on smooth f: z = x^k - (1/L)∇f(x^k)
Proximal step on non-smooth g: x^{k+1} = prox_{g/L}(z)

Or combined:

x^{k+1} = prox_{g/L}(x^k - (1/L)∇f(x^k))

The step size 1/L ensures descent. Intuitively: we take a gradient step to reduce f, then apply the proximal operator to incorporate g's structure.

proximal_gradient_descent.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
import numpy as np
from typing import Callable, Tuple
 
def proximal_gradient_descent(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    prox_g: Callable[[np.ndarray, float], np.ndarray],
    x0: np.ndarray,
    L: float,
    max_iter: int = 1000,
    tol: float = 1e-8,
    verbose: bool = False
) -> Tuple[np.ndarray, list]:
    """
    Proximal gradient descent (ISTA) for composite minimization.
    
    Minimizes: F(x) = f(x) + g(x)
    where f is L-smooth and g has computable proximal operator.
    
    Parameters:
    -----------
    f : callable
        Smooth objective component f(x)
    grad_f : callable 
        Gradient of f: ∇f(x)
    prox_g : callable
        Proximal operator: prox_{λg}(v, λ) for non-smooth g
    x0 : np.ndarray
        Starting point
    L : float
        Lipschitz constant of ∇f
    max_iter : int
        Maximum iterations
    tol : float
        Convergence tolerance on ||x^{k+1} - x^k||
        
    Returns:
    --------
    x : np.ndarray
        Approximate minimizer
    history : list
        Objective values at each iteration
    """
    x = x0.copy().astype(float)
    step_size = 1.0 / L
    history = []
    
    for k in range(max_iter):
        x_old = x.copy()
        
        # Gradient step on f
        grad = grad_f(x)
        z = x - step_size * grad
        
        # Proximal step on g
        x = prox_g(z, step_size)
        
        # Record history (optional: compute full objective)
        # history.append(f(x) + g(x))  # if g value is computable
        
        # Convergence check
        if np.linalg.norm(x - x_old) < tol:
            if verbose:
                print(f"Converged in {k + 1} iterations")
            break
        
        if verbose and k % 100 == 0:
            print(f"Iter {k}: ||step|| = {np.linalg.norm(x - x_old):.2e}")
    
    return x, history
 
 
def proximal_gd_backtracking(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    prox_g: Callable[[np.ndarray, float], np.ndarray],
    g: Callable[[np.ndarray], float],
    x0: np.ndarray,
    L_init: float = 1.0,
    beta: float = 2.0,
    max_iter: int = 1000,
    tol: float = 1e-8
) -> Tuple[np.ndarray, list]:
    """
    Proximal gradient descent with backtracking line search.
    
    Adaptively finds L (step size 1/L) without knowing it a priori.
    Uses the sufficient decrease condition for composite functions.
    """
    x = x0.copy().astype(float)
    L = L_init
    history = []
    
    for k in range(max_iter):
        x_old = x.copy()
        f_x = f(x)
        grad = grad_f(x)
        
        while True:
            # Try step with current L
            step_size = 1.0 / L
            z = x - step_size * grad
            x_new = prox_g(z, step_size)
            
            # Check sufficient decrease (composite version)
            f_new = f(x_new)
            quadratic_bound = (f_x + np.dot(grad, x_new - x) + 
                              (L / 2) * np.linalg.norm(x_new - x)**2)
            
            if f_new <= quadratic_bound:
                break
            
            # Increase L (decrease step size)
            L *= beta
        
        x = x_new
        history.append(f(x) + g(x))
        
        if np.linalg.norm(x - x_old) < tol:
            break
        
        # Optionally decrease L for next iteration
        L = max(L / beta, L_init)
    
    return x, history
 
 
# Example: Lasso via proximal gradient descent
def solve_lasso_pgd(X, y, lambda_, max_iter=1000, tol=1e-6):
    """
    Solve Lasso: min (1/2)||y - Xw||² + λ||w||₁
    
    using proximal gradient descent.
    """
    n_samples, n_features = X.shape
    
    # f(w) = (1/2)||y - Xw||², smooth with L = ||X||² (largest eigenvalue of X^T X)
    L = np.linalg.norm(X, ord=2) ** 2
    
    def f(w):
        return 0.5 * np.sum((y - X @ w) ** 2)
    
    def grad_f(w):
        return -X.T @ (y - X @ w)
    
    def prox_g(w, t):
        # g(w) = λ||w||₁, proximal is soft-thresholding with threshold t*λ
        return np.sign(w) * np.maximum(np.abs(w) - t * lambda_, 0)
    
    w0 = np.zeros(n_features)
    
    w_opt, history = proximal_gradient_descent(
        f, grad_f, prox_g, w0, L, max_iter, tol, verbose=True
    )
    
    return w_opt
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    from sklearn.linear_model import Lasso
    
    # Generate problem
    X, y = make_regression(n_samples=200, n_features=50, 
                           n_informative=10, noise=1.0, random_state=42)
    
    lambda_ = 1.0
    
    # Our implementation
    w_pgd = solve_lasso_pgd(X, y, lambda_)
    
    # Sklearn reference
    lasso = Lasso(alpha=lambda_ / len(y), fit_intercept=False, max_iter=10000)
    lasso.fit(X, y)
    
    print(f"\nPGD solution sparsity: {np.sum(np.abs(w_pgd) > 1e-6)} non-zeros")
    print(f"Sklearn solution sparsity: {np.sum(np.abs(lasso.coef_) > 1e-6)} non-zeros")
    print(f"Difference: {np.linalg.norm(w_pgd - lasso.coef_):.2e}")

Convergence analysis:

For L-smooth, convex f and convex g, proximal gradient descent achieves:

F(x^k) - F(x) ≤ L‖x⁰ - x‖² / (2k)**

This is O(1/k) convergence—the same rate as gradient descent on smooth problems. The non-smoothness of g doesn't slow convergence!

For strongly convex f (with parameter μ):

F(x^k) - F(x) ≤ (1 - μ/L)^k · [F(x⁰) - F(x)]**

Linear convergence with rate 1 - μ/L, exactly matching gradient descent.

Accelerated Proximal Gradient (FISTA)

Just as Nesterov's momentum accelerates gradient descent, FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) accelerates proximal gradient descent from O(1/k) to O(1/k²)—the optimal rate for first-order methods on smooth + convex problems.

FISTA Algorithm:

Initialize: x⁰ = y¹, t₁ = 1
For k = 1, 2, ...:
   1. x^k = prox_{g/L}(y^k - (1/L)∇f(y^k))   [proximal gradient step]
   2. t_{k+1} = (1 + √(1 + 4t_k²)) / 2       [momentum coefficient]
   3. y^{k+1} = x^k + ((t_k - 1)/t_{k+1})(x^k - x^{k-1})  [momentum step]

The momentum term y^{k+1} extrapolates from x^k in the direction of recent progress, enabling the algorithm to 'overshoot' and accelerate convergence.

fista.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
import numpy as np
from typing import Callable, Tuple
 
def fista(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    prox_g: Callable[[np.ndarray, float], np.ndarray],
    x0: np.ndarray,
    L: float,
    max_iter: int = 1000,
    tol: float = 1e-8,
    verbose: bool = False,
    restart: bool = True
) -> Tuple[np.ndarray, list]:
    """
    FISTA: Fast Iterative Shrinkage-Thresholding Algorithm.
    
    Accelerated proximal gradient descent achieving O(1/k²) convergence.
    
    Parameters:
    -----------
    f : callable
        Smooth objective component
    grad_f : callable
        Gradient of f
    prox_g : callable
        Proximal operator for non-smooth g
    x0 : np.ndarray
        Starting point
    L : float
        Lipschitz constant of ∇f
    max_iter : int
        Maximum iterations
    tol : float
        Convergence tolerance
    restart : bool
        Use adaptive restart heuristic for faster convergence
        
    Returns:
    --------
    x : np.ndarray
        Approximate minimizer
    history : list
        Objective values at each iteration
    """
    x = x0.copy().astype(float)
    y = x.copy()
    t = 1.0
    step_size = 1.0 / L
    
    x_prev = x.copy()
    history = []
    
    for k in range(max_iter):
        x_old = x.copy()
        
        # Proximal gradient step on y
        grad = grad_f(y)
        z = y - step_size * grad
        x = prox_g(z, step_size)
        
        # Check for restart condition (O'Donoghue & Candès)
        if restart and np.dot(y - x, x - x_prev) > 0:
            # Gradient and momentum pointing in opposite directions
            # Restart momentum
            t = 1.0
            y = x_old.copy()
            continue
        
        # Momentum update
        t_new = (1 + np.sqrt(1 + 4 * t**2)) / 2
        y = x + ((t - 1) / t_new) * (x - x_prev)
        
        t = t_new
        x_prev = x_old
        
        # Record history
        # history.append(f(x) + g(x))
        
        # Convergence check
        step_norm = np.linalg.norm(x - x_old)
        if step_norm < tol:
            if verbose:
                print(f"FISTA converged in {k + 1} iterations")
            break
        
        if verbose and k % 100 == 0:
            print(f"Iter {k}: ||step|| = {step_norm:.2e}, t = {t:.2f}")
    
    return x, history
 
 
def fista_backtracking(
    f: Callable[[np.ndarray], float],
    grad_f: Callable[[np.ndarray], np.ndarray],
    prox_g: Callable[[np.ndarray, float], np.ndarray],
    g: Callable[[np.ndarray], float],
    x0: np.ndarray,
    L_init: float = 1.0,
    eta: float = 2.0,
    max_iter: int = 1000,
    tol: float = 1e-8
) -> Tuple[np.ndarray, list]:
    """
    FISTA with backtracking line search.
    
    Finds appropriate L adaptively at each iteration.
    """
    x = x0.copy().astype(float)
    y = x.copy()
    t = 1.0
    L = L_init
    
    x_prev = x.copy()
    history = []
    
    for k in range(max_iter):
        x_old = x.copy()
        
        while True:
            step_size = 1.0 / L
            grad = grad_f(y)
            z = y - step_size * grad
            x_new = prox_g(z, step_size)
            
            f_new = f(x_new)
            f_y = f(y)
            
            # Check sufficient decrease
            Q_L = (f_y + np.dot(grad, x_new - y) + 
                   (L / 2) * np.linalg.norm(x_new - y)**2)
            
            if f_new <= Q_L:
                x = x_new
                break
            
            L *= eta
        
        # Momentum update
        t_new = (1 + np.sqrt(1 + 4 * t**2)) / 2
        y = x + ((t - 1) / t_new) * (x - x_prev)
        
        t = t_new
        x_prev = x_old
        L = max(L / eta, L_init)  # Try smaller L next iteration
        
        history.append(f(x) + g(x))
        
        if np.linalg.norm(x - x_old) < tol:
            break
    
    return x, history
 
 
# Comparison: ISTA vs FISTA
if __name__ == "__main__":
    import matplotlib.pyplot as plt
    
    # Ill-conditioned Lasso problem
    np.random.seed(42)
    n, d = 500, 200
    X = np.random.randn(n, d)
    X[:, :10] *= 10  # Create ill-conditioning
    w_true = np.zeros(d)
    w_true[:10] = np.random.randn(10)
    y = X @ w_true + 0.1 * np.random.randn(n)
    
    L = np.linalg.norm(X, ord=2) ** 2
    lambda_ = 0.1 * np.max(np.abs(X.T @ y))
    
    def f(w): return 0.5 * np.sum((y - X @ w)**2)
    def grad_f(w): return -X.T @ (y - X @ w)
    def prox_g(w, t): return np.sign(w) * np.maximum(np.abs(w) - t * lambda_, 0)
    def g(w): return lambda_ * np.sum(np.abs(w))
    
    # Run both algorithms
    w0 = np.zeros(d)
    
    w_ista, hist_ista = proximal_gradient_descent(
        f, grad_f, prox_g, w0, L, max_iter=500
    )
    
    w_fista, hist_fista = fista(
        f, grad_f, prox_g, w0, L, max_iter=500
    )
    
    print(f"ISTA final objective: {f(w_ista) + g(w_ista):.6f}")
    print(f"FISTA final objective: {f(w_fista) + g(w_fista):.6f}")
    print(f"True objective: {f(w_true) + g(w_true):.6f}")

Adaptive Restart

FISTA can oscillate on strongly convex problems. Adaptive restart—resetting momentum when the gradient and momentum point in opposite directions—gives FISTA linear convergence on strongly convex problems while maintaining O(1/k²) on merely convex ones. Always enable restart in practice.

Convergence rates:

Algorithm	Convex f	Strongly Convex f
ISTA (Proximal GD)	O(1/k)	O((1-μ/L)^k) linear
FISTA (no restart)	O(1/k²)	O(1/k²)
FISTA (with restart)	O(1/k²)	O((1-√(μ/L))^k) accelerated linear

FISTA with restart is the recommended default—it's never worse than ISTA and often dramatically faster.

ADMM: Alternating Direction Method of Multipliers

ADMM is a powerful algorithm for solving problems of the form:

minimize f(x) + g(z) subject to Ax + Bz = c

It combines the decomposability of dual ascent with the convergence properties of the augmented Lagrangian method. ADMM is particularly effective when f and g have simple proximal operators individually, even if f + g composed is difficult.

The ADMM Iteration:

Forming the augmented Lagrangian with dual variable u and penalty ρ:

Repeat:
  x^{k+1} = argmin_x { f(x) + (ρ/2)||Ax + Bz^k - c + u^k||² }  [x-update]
  z^{k+1} = argmin_z { g(z) + (ρ/2)||Ax^{k+1} + Bz - c + u^k||² }  [z-update]
  u^{k+1} = u^k + Ax^{k+1} + Bz^{k+1} - c                         [dual update]

In the scaled form (u = μ/ρ), the x and z updates are proximal operators!

admm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
import numpy as np
from typing import Callable, Tuple
 
def admm_lasso(
    X: np.ndarray,
    y: np.ndarray,
    lambda_: float,
    rho: float = 1.0,
    max_iter: int = 1000,
    tol: float = 1e-6,
    verbose: bool = False
) -> Tuple[np.ndarray, list]:
    """
    ADMM for Lasso: min (1/2)||y - Xw||² + λ||w||₁
    
    Reformulate as:
        min (1/2)||y - Xw||² + λ||z||₁
        s.t. w - z = 0
    
    Then apply ADMM splitting between smooth (w) and non-smooth (z).
    
    Parameters:
    -----------
    X : np.ndarray
        Feature matrix (n, d)
    y : np.ndarray
        Target vector (n,)
    lambda_ : float
        L1 regularization strength
    rho : float
        ADMM penalty parameter
    max_iter : int
        Maximum ADMM iterations
    tol : float
        Tolerance for primal and dual residuals
        
    Returns:
    --------
    w : np.ndarray
        Solution
    history : dict
        Convergence history
    """
    n, d = X.shape
    
    # Initialize
    w = np.zeros(d)
    z = np.zeros(d)
    u = np.zeros(d)  # Scaled dual variable
    
    # Precompute (X^T X + ρI)^{-1} X^T y and (X^T X + ρI)^{-1}
    # for efficient w-update
    XtX = X.T @ X
    Xty = X.T @ y
    
    # Factorize once (expensive for large d, but amortized)
    L_factor = np.linalg.cholesky(XtX + rho * np.eye(d))
    
    history = {'primal_res': [], 'dual_res': [], 'objective': []}
    
    for k in range(max_iter):
        w_old = w.copy()
        
        # w-update: argmin (1/2)||y - Xw||² + (ρ/2)||w - z + u||²
        # Solution: (X^T X + ρI) w = X^T y + ρ(z - u)
        rhs = Xty + rho * (z - u)
        w = np.linalg.solve(L_factor @ L_factor.T, rhs)
        
        # z-update: argmin λ||z||₁ + (ρ/2)||w - z + u||²
        # Solution: soft-threshold(w + u, λ/ρ)
        z_old = z.copy()
        z = np.sign(w + u) * np.maximum(np.abs(w + u) - lambda_/rho, 0)
        
        # u-update (dual ascent)
        u = u + w - z
        
        # Compute residuals
        primal_res = np.linalg.norm(w - z)
        dual_res = rho * np.linalg.norm(z - z_old)
        
        history['primal_res'].append(primal_res)
        history['dual_res'].append(dual_res)
        history['objective'].append(
            0.5 * np.sum((y - X @ w)**2) + lambda_ * np.sum(np.abs(z))
        )
        
        if verbose and k % 50 == 0:
            print(f"Iter {k}: primal_res={primal_res:.2e}, dual_res={dual_res:.2e}")
        
        # Convergence check
        if primal_res < tol and dual_res < tol:
            if verbose:
                print(f"ADMM converged in {k + 1} iterations")
            break
    
    return z, history  # Return z (the sparse solution)
 
 
def admm_consensus(
    f_list: list,
    prox_list: list,
    x0: np.ndarray,
    rho: float = 1.0,
    max_iter: int = 500,
    tol: float = 1e-6
) -> np.ndarray:
    """
    Consensus ADMM for sum of functions: min Σᵢ fᵢ(x)
    
    Each worker optimizes fᵢ independently, then solutions are 
    averaged to reach consensus.
    
    Useful for distributed optimization where data is partitioned.
    """
    n_workers = len(f_list)
    d = len(x0)
    
    # Local copies
    x_local = [x0.copy() for _ in range(n_workers)]
    
    # Global consensus
    z = x0.copy()
    
    # Dual variables
    u = [np.zeros(d) for _ in range(n_workers)]
    
    for k in range(max_iter):
        z_old = z.copy()
        
        # Local updates (parallelizable)
        for i in range(n_workers):
            # x_i = prox_{f_i/ρ}(z - u_i)
            x_local[i] = prox_list[i](z - u[i], 1.0/rho)
        
        # Global average (consensus)
        z = np.mean([x_local[i] + u[i] for i in range(n_workers)], axis=0)
        
        # Dual updates
        for i in range(n_workers):
            u[i] = u[i] + x_local[i] - z
        
        # Check convergence
        if np.linalg.norm(z - z_old) < tol:
            break
    
    return z
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    from sklearn.linear_model import Lasso
    
    # Test ADMM Lasso
    X, y = make_regression(n_samples=200, n_features=50, 
                           n_informative=10, random_state=42)
    lambda_ = 0.1 * np.max(np.abs(X.T @ y))
    
    w_admm, history = admm_lasso(X, y, lambda_, rho=1.0, verbose=True)
    
    # Reference
    lasso = Lasso(alpha=lambda_ / len(y), fit_intercept=False)
    lasso.fit(X, y)
    
    print(f"\nADMM sparsity: {np.sum(np.abs(w_admm) > 1e-6)}")
    print(f"Sklearn sparsity: {np.sum(np.abs(lasso.coef_) > 1e-6)}")
    print(f"Difference: {np.linalg.norm(w_admm - lasso.coef_):.2e}")

ADMM Penalty Parameter ρ

Choosing ρ affects convergence speed significantly. Too small: slow primal convergence. Too large: slow dual convergence. Adaptive ρ schemes (increasing when primal residual dominates, decreasing when dual dominates) are common in practice. A good initial guess: ρ ≈ λ for Lasso.

Proximal Methods in Machine Learning Practice

Proximal methods are the foundation of modern sparse learning algorithms. Here's how they're used in practice:

Sparse Linear Models:

Lasso: ISTA/FISTA with soft-thresholding
Elastic Net: Combined L1/L2 proximal
Group Lasso: Block soft-thresholding
Sparse Logistic Regression: ISTA on log-loss + L1

Matrix Problems:

Low-rank Matrix Completion: Proximal gradient on nuclear norm
Robust PCA: ADMM splitting nuclear norm and L1
Sparse + Low-rank Decomposition: Overlapping proximal operators

Deep Learning:

Sparse Autoencoders: L1 penalty on activations
Pruning: Proximal operators for weight sparsification
Constrained Networks: Projection-based training

Proximal Algorithm Selection Guide
Problem Type	Recommended Algorithm	Notes
Lasso (moderate size)	FISTA + restart	Fast, simple, reliable
Lasso (large scale)	Coordinate Descent	Even faster for sparse solutions
Group Lasso	FISTA with block prox	Block soft-thresholding
Elastic Net	FISTA or Coordinate Descent	Both work well
Constrained problems	ADMM or Projected GD	ADMM for complex constraints
Distributed data	Consensus ADMM	Natural parallelization
Matrix completion	Proximal SVD thresholding	Singular value soft-threshold

sparse_logistic_fista.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
 
def sparse_logistic_regression_fista(
    X: np.ndarray,
    y: np.ndarray,
    lambda_: float,
    max_iter: int = 500,
    tol: float = 1e-6
) -> np.ndarray:
    """
    L1-regularized logistic regression via FISTA.
    
    Minimizes: (1/n)Σᵢ log(1 + exp(-yᵢ wᵀxᵢ)) + λ||w||₁
    
    where y ∈ {-1, +1}.
    """
    n, d = X.shape
    
    # Lipschitz constant of gradient: L = ||X||²/4n 
    # (log-loss has 1/4 bounded Hessian)
    L = np.linalg.norm(X, ord=2)**2 / (4 * n)
    step_size = 1.0 / L
    
    def log_loss_grad(w):
        """Gradient of logistic loss"""
        z = y * (X @ w)
        # Numerically stable sigmoid
        sig = np.where(z >= 0,
                       1 / (1 + np.exp(-z)),
                       np.exp(z) / (1 + np.exp(z)))
        return -X.T @ (y * (1 - sig)) / n
    
    def soft_threshold(v, thresh):
        return np.sign(v) * np.maximum(np.abs(v) - thresh, 0)
    
    # FISTA
    w = np.zeros(d)
    v = w.copy()
    t = 1.0
    w_prev = w.copy()
    
    for k in range(max_iter):
        w_old = w.copy()
        
        # Gradient step on log-loss
        grad = log_loss_grad(v)
        z = v - step_size * grad
        
        # Proximal step (soft-thresholding)
        w = soft_threshold(z, step_size * lambda_)
        
        # Momentum
        t_new = (1 + np.sqrt(1 + 4 * t**2)) / 2
        v = w + ((t - 1) / t_new) * (w - w_prev)
        
        t = t_new
        w_prev = w_old
        
        if np.linalg.norm(w - w_old) < tol:
            print(f"Converged in {k+1} iterations")
            break
    
    return w
 
 
def matrix_completion_pgd(
    M_observed: np.ndarray,
    mask: np.ndarray,
    lambda_: float,
    rank_est: int,
    max_iter: int = 200,
    tol: float = 1e-5
) -> np.ndarray:
    """
    Matrix completion via nuclear norm minimization.
    
    Minimizes: (1/2)||P_Ω(X - M)||²_F + λ||X||_*
    
    where P_Ω is projection onto observed entries.
    """
    m, n = M_observed.shape
    
    # Initialize with observed entries
    X = M_observed * mask
    
    # Lipschitz constant is 1 (projection is non-expansive)
    L = 1.0
    step_size = 1.0 / L
    
    for k in range(max_iter):
        X_old = X.copy()
        
        # Gradient: P_Ω(X - M) where Ω is observed set
        grad = (X - M_observed) * mask
        
        # Gradient step
        Z = X - step_size * grad
        
        # Proximal step: singular value soft-thresholding
        U, s, Vt = np.linalg.svd(Z, full_matrices=False)
        s_thresh = np.maximum(s - step_size * lambda_, 0)
        X = U @ np.diag(s_thresh) @ Vt
        
        # Check convergence
        if np.linalg.norm(X - X_old, 'fro') < tol:
            print(f"Converged in {k+1} iterations")
            break
    
    return X

Choosing Between FISTA and Coordinate Descent

For Lasso with dense features, FISTA is simple and effective. For Lasso with sparse solutions (many zero coefficients), coordinate descent with active set strategies is typically 5-10× faster. For general problems where coordinate updates don't have closed forms, FISTA is the default choice.

Summary and Key Takeaways

Proximal methods provide a unified framework for optimizing composite objectives with non-smooth components. Their elegance lies in replacing gradients with proximal operators, enabling efficient algorithms for problems central to modern machine learning.

Key Takeaways

•Proximal operators generalize gradient steps — For non-smooth g, prox_{λg}(v) minimizes g(x) + (1/2λ)‖x-v‖², handling non-differentiability naturally.
•Common proximal operators have closed forms — L1 (soft-thresholding), L2 (block shrinkage), indicator (projection), nuclear norm (SVD thresholding).
•Proximal GD (ISTA) achieves O(1/k) — Same rate as gradient descent despite non-smooth g.
•FISTA accelerates to O(1/k²) — Momentum acceleration with adaptive restart is the practical default.
•ADMM enables splitting — Decompose problems into simpler subproblems with proximal structure; ideal for distributed optimization.
•Foundation for sparse ML — Lasso, Elastic Net, Group Lasso, matrix completion all use proximal algorithms.
•Complementary to coordinate descent — Use proximal methods when coordinate updates lack closed forms.

What's next:

The next page explores Optimization Landscapes—the geometry of loss surfaces in machine learning. Understanding landscapes helps explain why optimization succeeds or fails, particularly for non-convex deep learning objectives.

Page Complete

You now understand proximal methods: proximal operators, ISTA/FISTA algorithms, ADMM for splitting, and applications to sparse ML. This completes the algorithmic toolkit for composite optimization, enabling you to efficiently solve problems with non-smooth regularizers.