Loading content...
Many machine learning objectives combine a smooth loss with a non-smooth regularizer:
minimize f(x) + g(x)
where f is differentiable (e.g., squared error) and g is convex but non-differentiable (e.g., L1 norm). Standard gradient descent stumbles on such objectives—the gradient of g doesn't exist everywhere, and subgradients lead to slow convergence.
Proximal methods provide an elegant solution. Rather than computing gradients of non-smooth terms, they replace gradient steps with proximal operators—a generalization that handles non-differentiability naturally while preserving convergence guarantees. The result is algorithms that are as simple as gradient descent but work seamlessly with sparsity-inducing regularizers.
By the end of this page, you will understand the proximal operator and its geometric interpretation, derive proximal operators for common regularizers, implement proximal gradient descent and its accelerated variants (ISTA/FISTA), analyze convergence rates of proximal methods, and apply proximal algorithms to composite optimization problems in ML.
Definition:
For a convex function g: ℝⁿ → ℝ ∪ {+∞} and parameter λ > 0, the proximal operator is:
prox_{λg}(v) = argmin_x { g(x) + (1/2λ)‖x - v‖² }
This operator finds the point x that balances two goals:
The parameter λ controls this tradeoff—larger λ gives more weight to minimizing g, smaller λ keeps x closer to v.
Key properties:
Think of prox_{λg}(v) as a 'denoising' operation. Starting from point v, we ask: where should we move to reduce g while not straying too far? For L1 norm (g = ‖·‖₁), this becomes soft-thresholding—small coordinates are set to zero, large coordinates are shrunk.
Relationship to gradient descent:
For smooth g, the gradient descent step is:
x^{new} = x - λ∇g(x)
This can be rewritten using the proximal operator:
x^{new} = prox_{λg}(x) + O(λ²)
As λ → 0, they coincide. But for non-smooth g (where ∇g doesn't exist), only the proximal form makes sense. The proximal operator is thus a generalization of gradient steps to non-differentiable functions.
Why 'proximal'?
The name comes from 'proximity'—we measure distance to v in the quadratic penalty. Other distance measures (Bregman divergences) lead to mirror descent and related methods.
For proximal methods to be practical, proximal operators must be efficiently computable. Fortunately, many regularizers used in ML have closed-form proximal operators.
1. L1 Norm (Lasso): g(x) = ‖x‖₁
prox_{λg}(v) = soft-threshold(v, λ) where:
S_λ(v)ᵢ = sign(vᵢ) · max(|vᵢ| - λ, 0)
This is component-wise soft-thresholding—the foundation of sparse optimization.
2. L2 Norm: g(x) = ‖x‖₂
prox_{λg}(v) = v · max(0, 1 - λ/‖v‖₂) (if v ≠ 0); 0 otherwise
This is block soft-thresholding—the vector is shrunk toward zero while maintaining its direction.
3. Indicator Function (Constraints): g(x) = I_C(x)
I_C(x) = 0 if x ∈ C; +∞ otherwise
prox_{λg}(v) = Proj_C(v) = argmin_{x ∈ C} ‖x - v‖²
Projection onto the constraint set! Constrained optimization via proximal methods.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as npfrom typing import Callable # ============================================# Proximal Operators for Common Regularizers# ============================================ def prox_l1(v: np.ndarray, lambda_: float) -> np.ndarray: """ Proximal operator for L1 norm: g(x) = λ||x||₁ This is soft-thresholding: S_λ(v)_i = sign(v_i) max(|v_i| - λ, 0) Used in: Lasso, sparse coding, compressed sensing """ return np.sign(v) * np.maximum(np.abs(v) - lambda_, 0) def prox_l2(v: np.ndarray, lambda_: float) -> np.ndarray: """ Proximal operator for L2 norm: g(x) = λ||x||₂ This is block soft-thresholding: v * max(0, 1 - λ/||v||) Used in: Group Lasso, nuclear norm minimization """ v_norm = np.linalg.norm(v) if v_norm == 0: return np.zeros_like(v) return v * max(0, 1 - lambda_ / v_norm) def prox_l2_squared(v: np.ndarray, lambda_: float) -> np.ndarray: """ Proximal operator for squared L2 norm: g(x) = (λ/2)||x||₂² This is shrinkage: v / (1 + λ) Used in: Ridge regression, Tikhonov regularization """ return v / (1 + lambda_) def prox_elastic_net(v: np.ndarray, lambda1: float, lambda2: float) -> np.ndarray: """ Proximal operator for Elastic Net: g(x) = λ₁||x||₁ + (λ₂/2)||x||₂² Composition: first soft-threshold, then shrink prox(v) = S_{λ₁}(v) / (1 + λ₂) """ return prox_l1(v, lambda1) / (1 + lambda2) def prox_nuclear_norm(V: np.ndarray, lambda_: float) -> np.ndarray: """ Proximal operator for nuclear norm: g(X) = λ||X||_* Singular value soft-thresholding: U S_λ(Σ) V^T Used in: Matrix completion, low-rank approximation """ U, s, Vt = np.linalg.svd(V, full_matrices=False) s_thresh = np.maximum(s - lambda_, 0) return U @ np.diag(s_thresh) @ Vt def prox_box(v: np.ndarray, lower: float, upper: float) -> np.ndarray: """ Proximal operator for box constraints: C = [lower, upper]^n This is just clipping: project onto the box """ return np.clip(v, lower, upper) def prox_simplex(v: np.ndarray) -> np.ndarray: """ Projection onto probability simplex: Δ = {x ≥ 0, Σx = 1} Uses efficient O(n log n) algorithm by Duchi et al. Used in: Constrained optimization, optimal transport """ n = len(v) u = np.sort(v)[::-1] # Sort descending cssv = np.cumsum(u) rho = np.nonzero(u > (cssv - 1) / (np.arange(1, n + 1)))[0][-1] theta = (cssv[rho] - 1) / (rho + 1) return np.maximum(v - theta, 0) def prox_l_inf(v: np.ndarray, lambda_: float) -> np.ndarray: """ Proximal operator for L∞ norm: g(x) = λ||x||_∞ Uses Moreau decomposition: prox_g(v) = v - prox_{g*}(v) where g* is the conjugate (indicator of L1 ball) """ # Project v onto the L1 ball of radius lambda_ proj = prox_simplex(np.abs(v) / lambda_) * lambda_ * np.sign(v) return v - proj # Test examplesif __name__ == "__main__": np.random.seed(42) v = np.array([3.0, -1.0, 0.5, -0.2, 2.0]) print("Original:", v) print("Prox L1 (λ=0.5):", prox_l1(v, 0.5)) print("Prox L2 (λ=0.5):", prox_l2(v, 0.5)) print("Prox L2² (λ=0.5):", prox_l2_squared(v, 0.5)) print("Prox ElasticNet:", prox_elastic_net(v, 0.3, 0.2)) # Matrix case M = np.random.randn(4, 3) print("\nOriginal matrix rank:", np.linalg.matrix_rank(M)) M_prox = prox_nuclear_norm(M, lambda_=1.0) print("After nuclear norm prox:", np.linalg.matrix_rank(M_prox, tol=1e-6))| Function g(x) | prox_{λg}(v) | Computation |
|---|---|---|
| ‖x‖₁ (L1) | sign(v) ⊙ max(|v| - λ, 0) | O(n) |
| ‖x‖₂ (L2) | v · max(0, 1 - λ/‖v‖) | O(n) |
| (1/2)‖x‖₂² (L2²) | v / (1 + λ) | O(n) |
| I_C(x) (indicator) | Proj_C(v) | Depends on C |
| ‖X‖_* (nuclear) | U S_λ(Σ) Vᵀ | O(min(m,n)²·max(m,n)) |
| ‖x‖_∞ (L∞) | via Moreau decomposition | O(n log n) |
For any convex g with conjugate g*, we have v = prox_{λg}(v) + λ·prox_{g*/λ}(v/λ). This means knowing prox_g gives prox_{g*} for free. The L∞ proximal operator uses this: L∞ is conjugate to the L1 ball indicator.
Composite optimization:
Consider the problem:
minimize F(x) = f(x) + g(x)
where:
Proximal Gradient Descent (also called ISTA for Iterative Shrinkage-Thresholding Algorithm) alternates:
Or combined:
x^{k+1} = prox_{g/L}(x^k - (1/L)∇f(x^k))
The step size 1/L ensures descent. Intuitively: we take a gradient step to reduce f, then apply the proximal operator to incorporate g's structure.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178
import numpy as npfrom typing import Callable, Tuple def proximal_gradient_descent( f: Callable[[np.ndarray], float], grad_f: Callable[[np.ndarray], np.ndarray], prox_g: Callable[[np.ndarray, float], np.ndarray], x0: np.ndarray, L: float, max_iter: int = 1000, tol: float = 1e-8, verbose: bool = False) -> Tuple[np.ndarray, list]: """ Proximal gradient descent (ISTA) for composite minimization. Minimizes: F(x) = f(x) + g(x) where f is L-smooth and g has computable proximal operator. Parameters: ----------- f : callable Smooth objective component f(x) grad_f : callable Gradient of f: ∇f(x) prox_g : callable Proximal operator: prox_{λg}(v, λ) for non-smooth g x0 : np.ndarray Starting point L : float Lipschitz constant of ∇f max_iter : int Maximum iterations tol : float Convergence tolerance on ||x^{k+1} - x^k|| Returns: -------- x : np.ndarray Approximate minimizer history : list Objective values at each iteration """ x = x0.copy().astype(float) step_size = 1.0 / L history = [] for k in range(max_iter): x_old = x.copy() # Gradient step on f grad = grad_f(x) z = x - step_size * grad # Proximal step on g x = prox_g(z, step_size) # Record history (optional: compute full objective) # history.append(f(x) + g(x)) # if g value is computable # Convergence check if np.linalg.norm(x - x_old) < tol: if verbose: print(f"Converged in {k + 1} iterations") break if verbose and k % 100 == 0: print(f"Iter {k}: ||step|| = {np.linalg.norm(x - x_old):.2e}") return x, history def proximal_gd_backtracking( f: Callable[[np.ndarray], float], grad_f: Callable[[np.ndarray], np.ndarray], prox_g: Callable[[np.ndarray, float], np.ndarray], g: Callable[[np.ndarray], float], x0: np.ndarray, L_init: float = 1.0, beta: float = 2.0, max_iter: int = 1000, tol: float = 1e-8) -> Tuple[np.ndarray, list]: """ Proximal gradient descent with backtracking line search. Adaptively finds L (step size 1/L) without knowing it a priori. Uses the sufficient decrease condition for composite functions. """ x = x0.copy().astype(float) L = L_init history = [] for k in range(max_iter): x_old = x.copy() f_x = f(x) grad = grad_f(x) while True: # Try step with current L step_size = 1.0 / L z = x - step_size * grad x_new = prox_g(z, step_size) # Check sufficient decrease (composite version) f_new = f(x_new) quadratic_bound = (f_x + np.dot(grad, x_new - x) + (L / 2) * np.linalg.norm(x_new - x)**2) if f_new <= quadratic_bound: break # Increase L (decrease step size) L *= beta x = x_new history.append(f(x) + g(x)) if np.linalg.norm(x - x_old) < tol: break # Optionally decrease L for next iteration L = max(L / beta, L_init) return x, history # Example: Lasso via proximal gradient descentdef solve_lasso_pgd(X, y, lambda_, max_iter=1000, tol=1e-6): """ Solve Lasso: min (1/2)||y - Xw||² + λ||w||₁ using proximal gradient descent. """ n_samples, n_features = X.shape # f(w) = (1/2)||y - Xw||², smooth with L = ||X||² (largest eigenvalue of X^T X) L = np.linalg.norm(X, ord=2) ** 2 def f(w): return 0.5 * np.sum((y - X @ w) ** 2) def grad_f(w): return -X.T @ (y - X @ w) def prox_g(w, t): # g(w) = λ||w||₁, proximal is soft-thresholding with threshold t*λ return np.sign(w) * np.maximum(np.abs(w) - t * lambda_, 0) w0 = np.zeros(n_features) w_opt, history = proximal_gradient_descent( f, grad_f, prox_g, w0, L, max_iter, tol, verbose=True ) return w_opt if __name__ == "__main__": from sklearn.datasets import make_regression from sklearn.linear_model import Lasso # Generate problem X, y = make_regression(n_samples=200, n_features=50, n_informative=10, noise=1.0, random_state=42) lambda_ = 1.0 # Our implementation w_pgd = solve_lasso_pgd(X, y, lambda_) # Sklearn reference lasso = Lasso(alpha=lambda_ / len(y), fit_intercept=False, max_iter=10000) lasso.fit(X, y) print(f"\nPGD solution sparsity: {np.sum(np.abs(w_pgd) > 1e-6)} non-zeros") print(f"Sklearn solution sparsity: {np.sum(np.abs(lasso.coef_) > 1e-6)} non-zeros") print(f"Difference: {np.linalg.norm(w_pgd - lasso.coef_):.2e}")Convergence analysis:
For L-smooth, convex f and convex g, proximal gradient descent achieves:
F(x^k) - F(x) ≤ L‖x⁰ - x‖² / (2k)**
This is O(1/k) convergence—the same rate as gradient descent on smooth problems. The non-smoothness of g doesn't slow convergence!
For strongly convex f (with parameter μ):
F(x^k) - F(x) ≤ (1 - μ/L)^k · [F(x⁰) - F(x)]**
Linear convergence with rate 1 - μ/L, exactly matching gradient descent.
Just as Nesterov's momentum accelerates gradient descent, FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) accelerates proximal gradient descent from O(1/k) to O(1/k²)—the optimal rate for first-order methods on smooth + convex problems.
FISTA Algorithm:
Initialize: x⁰ = y¹, t₁ = 1
For k = 1, 2, ...:
1. x^k = prox_{g/L}(y^k - (1/L)∇f(y^k)) [proximal gradient step]
2. t_{k+1} = (1 + √(1 + 4t_k²)) / 2 [momentum coefficient]
3. y^{k+1} = x^k + ((t_k - 1)/t_{k+1})(x^k - x^{k-1}) [momentum step]
The momentum term y^{k+1} extrapolates from x^k in the direction of recent progress, enabling the algorithm to 'overshoot' and accelerate convergence.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189
import numpy as npfrom typing import Callable, Tuple def fista( f: Callable[[np.ndarray], float], grad_f: Callable[[np.ndarray], np.ndarray], prox_g: Callable[[np.ndarray, float], np.ndarray], x0: np.ndarray, L: float, max_iter: int = 1000, tol: float = 1e-8, verbose: bool = False, restart: bool = True) -> Tuple[np.ndarray, list]: """ FISTA: Fast Iterative Shrinkage-Thresholding Algorithm. Accelerated proximal gradient descent achieving O(1/k²) convergence. Parameters: ----------- f : callable Smooth objective component grad_f : callable Gradient of f prox_g : callable Proximal operator for non-smooth g x0 : np.ndarray Starting point L : float Lipschitz constant of ∇f max_iter : int Maximum iterations tol : float Convergence tolerance restart : bool Use adaptive restart heuristic for faster convergence Returns: -------- x : np.ndarray Approximate minimizer history : list Objective values at each iteration """ x = x0.copy().astype(float) y = x.copy() t = 1.0 step_size = 1.0 / L x_prev = x.copy() history = [] for k in range(max_iter): x_old = x.copy() # Proximal gradient step on y grad = grad_f(y) z = y - step_size * grad x = prox_g(z, step_size) # Check for restart condition (O'Donoghue & Candès) if restart and np.dot(y - x, x - x_prev) > 0: # Gradient and momentum pointing in opposite directions # Restart momentum t = 1.0 y = x_old.copy() continue # Momentum update t_new = (1 + np.sqrt(1 + 4 * t**2)) / 2 y = x + ((t - 1) / t_new) * (x - x_prev) t = t_new x_prev = x_old # Record history # history.append(f(x) + g(x)) # Convergence check step_norm = np.linalg.norm(x - x_old) if step_norm < tol: if verbose: print(f"FISTA converged in {k + 1} iterations") break if verbose and k % 100 == 0: print(f"Iter {k}: ||step|| = {step_norm:.2e}, t = {t:.2f}") return x, history def fista_backtracking( f: Callable[[np.ndarray], float], grad_f: Callable[[np.ndarray], np.ndarray], prox_g: Callable[[np.ndarray, float], np.ndarray], g: Callable[[np.ndarray], float], x0: np.ndarray, L_init: float = 1.0, eta: float = 2.0, max_iter: int = 1000, tol: float = 1e-8) -> Tuple[np.ndarray, list]: """ FISTA with backtracking line search. Finds appropriate L adaptively at each iteration. """ x = x0.copy().astype(float) y = x.copy() t = 1.0 L = L_init x_prev = x.copy() history = [] for k in range(max_iter): x_old = x.copy() while True: step_size = 1.0 / L grad = grad_f(y) z = y - step_size * grad x_new = prox_g(z, step_size) f_new = f(x_new) f_y = f(y) # Check sufficient decrease Q_L = (f_y + np.dot(grad, x_new - y) + (L / 2) * np.linalg.norm(x_new - y)**2) if f_new <= Q_L: x = x_new break L *= eta # Momentum update t_new = (1 + np.sqrt(1 + 4 * t**2)) / 2 y = x + ((t - 1) / t_new) * (x - x_prev) t = t_new x_prev = x_old L = max(L / eta, L_init) # Try smaller L next iteration history.append(f(x) + g(x)) if np.linalg.norm(x - x_old) < tol: break return x, history # Comparison: ISTA vs FISTAif __name__ == "__main__": import matplotlib.pyplot as plt # Ill-conditioned Lasso problem np.random.seed(42) n, d = 500, 200 X = np.random.randn(n, d) X[:, :10] *= 10 # Create ill-conditioning w_true = np.zeros(d) w_true[:10] = np.random.randn(10) y = X @ w_true + 0.1 * np.random.randn(n) L = np.linalg.norm(X, ord=2) ** 2 lambda_ = 0.1 * np.max(np.abs(X.T @ y)) def f(w): return 0.5 * np.sum((y - X @ w)**2) def grad_f(w): return -X.T @ (y - X @ w) def prox_g(w, t): return np.sign(w) * np.maximum(np.abs(w) - t * lambda_, 0) def g(w): return lambda_ * np.sum(np.abs(w)) # Run both algorithms w0 = np.zeros(d) w_ista, hist_ista = proximal_gradient_descent( f, grad_f, prox_g, w0, L, max_iter=500 ) w_fista, hist_fista = fista( f, grad_f, prox_g, w0, L, max_iter=500 ) print(f"ISTA final objective: {f(w_ista) + g(w_ista):.6f}") print(f"FISTA final objective: {f(w_fista) + g(w_fista):.6f}") print(f"True objective: {f(w_true) + g(w_true):.6f}")FISTA can oscillate on strongly convex problems. Adaptive restart—resetting momentum when the gradient and momentum point in opposite directions—gives FISTA linear convergence on strongly convex problems while maintaining O(1/k²) on merely convex ones. Always enable restart in practice.
Convergence rates:
| Algorithm | Convex f | Strongly Convex f |
|---|---|---|
| ISTA (Proximal GD) | O(1/k) | O((1-μ/L)^k) linear |
| FISTA (no restart) | O(1/k²) | O(1/k²) |
| FISTA (with restart) | O(1/k²) | O((1-√(μ/L))^k) accelerated linear |
FISTA with restart is the recommended default—it's never worse than ISTA and often dramatically faster.
ADMM is a powerful algorithm for solving problems of the form:
minimize f(x) + g(z) subject to Ax + Bz = c
It combines the decomposability of dual ascent with the convergence properties of the augmented Lagrangian method. ADMM is particularly effective when f and g have simple proximal operators individually, even if f + g composed is difficult.
The ADMM Iteration:
Forming the augmented Lagrangian with dual variable u and penalty ρ:
Repeat:
x^{k+1} = argmin_x { f(x) + (ρ/2)||Ax + Bz^k - c + u^k||² } [x-update]
z^{k+1} = argmin_z { g(z) + (ρ/2)||Ax^{k+1} + Bz - c + u^k||² } [z-update]
u^{k+1} = u^k + Ax^{k+1} + Bz^{k+1} - c [dual update]
In the scaled form (u = μ/ρ), the x and z updates are proximal operators!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166
import numpy as npfrom typing import Callable, Tuple def admm_lasso( X: np.ndarray, y: np.ndarray, lambda_: float, rho: float = 1.0, max_iter: int = 1000, tol: float = 1e-6, verbose: bool = False) -> Tuple[np.ndarray, list]: """ ADMM for Lasso: min (1/2)||y - Xw||² + λ||w||₁ Reformulate as: min (1/2)||y - Xw||² + λ||z||₁ s.t. w - z = 0 Then apply ADMM splitting between smooth (w) and non-smooth (z). Parameters: ----------- X : np.ndarray Feature matrix (n, d) y : np.ndarray Target vector (n,) lambda_ : float L1 regularization strength rho : float ADMM penalty parameter max_iter : int Maximum ADMM iterations tol : float Tolerance for primal and dual residuals Returns: -------- w : np.ndarray Solution history : dict Convergence history """ n, d = X.shape # Initialize w = np.zeros(d) z = np.zeros(d) u = np.zeros(d) # Scaled dual variable # Precompute (X^T X + ρI)^{-1} X^T y and (X^T X + ρI)^{-1} # for efficient w-update XtX = X.T @ X Xty = X.T @ y # Factorize once (expensive for large d, but amortized) L_factor = np.linalg.cholesky(XtX + rho * np.eye(d)) history = {'primal_res': [], 'dual_res': [], 'objective': []} for k in range(max_iter): w_old = w.copy() # w-update: argmin (1/2)||y - Xw||² + (ρ/2)||w - z + u||² # Solution: (X^T X + ρI) w = X^T y + ρ(z - u) rhs = Xty + rho * (z - u) w = np.linalg.solve(L_factor @ L_factor.T, rhs) # z-update: argmin λ||z||₁ + (ρ/2)||w - z + u||² # Solution: soft-threshold(w + u, λ/ρ) z_old = z.copy() z = np.sign(w + u) * np.maximum(np.abs(w + u) - lambda_/rho, 0) # u-update (dual ascent) u = u + w - z # Compute residuals primal_res = np.linalg.norm(w - z) dual_res = rho * np.linalg.norm(z - z_old) history['primal_res'].append(primal_res) history['dual_res'].append(dual_res) history['objective'].append( 0.5 * np.sum((y - X @ w)**2) + lambda_ * np.sum(np.abs(z)) ) if verbose and k % 50 == 0: print(f"Iter {k}: primal_res={primal_res:.2e}, dual_res={dual_res:.2e}") # Convergence check if primal_res < tol and dual_res < tol: if verbose: print(f"ADMM converged in {k + 1} iterations") break return z, history # Return z (the sparse solution) def admm_consensus( f_list: list, prox_list: list, x0: np.ndarray, rho: float = 1.0, max_iter: int = 500, tol: float = 1e-6) -> np.ndarray: """ Consensus ADMM for sum of functions: min Σᵢ fᵢ(x) Each worker optimizes fᵢ independently, then solutions are averaged to reach consensus. Useful for distributed optimization where data is partitioned. """ n_workers = len(f_list) d = len(x0) # Local copies x_local = [x0.copy() for _ in range(n_workers)] # Global consensus z = x0.copy() # Dual variables u = [np.zeros(d) for _ in range(n_workers)] for k in range(max_iter): z_old = z.copy() # Local updates (parallelizable) for i in range(n_workers): # x_i = prox_{f_i/ρ}(z - u_i) x_local[i] = prox_list[i](z - u[i], 1.0/rho) # Global average (consensus) z = np.mean([x_local[i] + u[i] for i in range(n_workers)], axis=0) # Dual updates for i in range(n_workers): u[i] = u[i] + x_local[i] - z # Check convergence if np.linalg.norm(z - z_old) < tol: break return z if __name__ == "__main__": from sklearn.datasets import make_regression from sklearn.linear_model import Lasso # Test ADMM Lasso X, y = make_regression(n_samples=200, n_features=50, n_informative=10, random_state=42) lambda_ = 0.1 * np.max(np.abs(X.T @ y)) w_admm, history = admm_lasso(X, y, lambda_, rho=1.0, verbose=True) # Reference lasso = Lasso(alpha=lambda_ / len(y), fit_intercept=False) lasso.fit(X, y) print(f"\nADMM sparsity: {np.sum(np.abs(w_admm) > 1e-6)}") print(f"Sklearn sparsity: {np.sum(np.abs(lasso.coef_) > 1e-6)}") print(f"Difference: {np.linalg.norm(w_admm - lasso.coef_):.2e}")Choosing ρ affects convergence speed significantly. Too small: slow primal convergence. Too large: slow dual convergence. Adaptive ρ schemes (increasing when primal residual dominates, decreasing when dual dominates) are common in practice. A good initial guess: ρ ≈ λ for Lasso.
Proximal methods are the foundation of modern sparse learning algorithms. Here's how they're used in practice:
Sparse Linear Models:
Matrix Problems:
Deep Learning:
| Problem Type | Recommended Algorithm | Notes |
|---|---|---|
| Lasso (moderate size) | FISTA + restart | Fast, simple, reliable |
| Lasso (large scale) | Coordinate Descent | Even faster for sparse solutions |
| Group Lasso | FISTA with block prox | Block soft-thresholding |
| Elastic Net | FISTA or Coordinate Descent | Both work well |
| Constrained problems | ADMM or Projected GD | ADMM for complex constraints |
| Distributed data | Consensus ADMM | Natural parallelization |
| Matrix completion | Proximal SVD thresholding | Singular value soft-threshold |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as np def sparse_logistic_regression_fista( X: np.ndarray, y: np.ndarray, lambda_: float, max_iter: int = 500, tol: float = 1e-6) -> np.ndarray: """ L1-regularized logistic regression via FISTA. Minimizes: (1/n)Σᵢ log(1 + exp(-yᵢ wᵀxᵢ)) + λ||w||₁ where y ∈ {-1, +1}. """ n, d = X.shape # Lipschitz constant of gradient: L = ||X||²/4n # (log-loss has 1/4 bounded Hessian) L = np.linalg.norm(X, ord=2)**2 / (4 * n) step_size = 1.0 / L def log_loss_grad(w): """Gradient of logistic loss""" z = y * (X @ w) # Numerically stable sigmoid sig = np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) return -X.T @ (y * (1 - sig)) / n def soft_threshold(v, thresh): return np.sign(v) * np.maximum(np.abs(v) - thresh, 0) # FISTA w = np.zeros(d) v = w.copy() t = 1.0 w_prev = w.copy() for k in range(max_iter): w_old = w.copy() # Gradient step on log-loss grad = log_loss_grad(v) z = v - step_size * grad # Proximal step (soft-thresholding) w = soft_threshold(z, step_size * lambda_) # Momentum t_new = (1 + np.sqrt(1 + 4 * t**2)) / 2 v = w + ((t - 1) / t_new) * (w - w_prev) t = t_new w_prev = w_old if np.linalg.norm(w - w_old) < tol: print(f"Converged in {k+1} iterations") break return w def matrix_completion_pgd( M_observed: np.ndarray, mask: np.ndarray, lambda_: float, rank_est: int, max_iter: int = 200, tol: float = 1e-5) -> np.ndarray: """ Matrix completion via nuclear norm minimization. Minimizes: (1/2)||P_Ω(X - M)||²_F + λ||X||_* where P_Ω is projection onto observed entries. """ m, n = M_observed.shape # Initialize with observed entries X = M_observed * mask # Lipschitz constant is 1 (projection is non-expansive) L = 1.0 step_size = 1.0 / L for k in range(max_iter): X_old = X.copy() # Gradient: P_Ω(X - M) where Ω is observed set grad = (X - M_observed) * mask # Gradient step Z = X - step_size * grad # Proximal step: singular value soft-thresholding U, s, Vt = np.linalg.svd(Z, full_matrices=False) s_thresh = np.maximum(s - step_size * lambda_, 0) X = U @ np.diag(s_thresh) @ Vt # Check convergence if np.linalg.norm(X - X_old, 'fro') < tol: print(f"Converged in {k+1} iterations") break return XFor Lasso with dense features, FISTA is simple and effective. For Lasso with sparse solutions (many zero coefficients), coordinate descent with active set strategies is typically 5-10× faster. For general problems where coordinate updates don't have closed forms, FISTA is the default choice.
Proximal methods provide a unified framework for optimizing composite objectives with non-smooth components. Their elegance lies in replacing gradients with proximal operators, enabling efficient algorithms for problems central to modern machine learning.
What's next:
The next page explores Optimization Landscapes—the geometry of loss surfaces in machine learning. Understanding landscapes helps explain why optimization succeeds or fails, particularly for non-convex deep learning objectives.
You now understand proximal methods: proximal operators, ISTA/FISTA algorithms, ADMM for splitting, and applications to sparse ML. This completes the algorithmic toolkit for composite optimization, enabling you to efficiently solve problems with non-smooth regularizers.