Loading content...
While gradient descent and Newton's method update all parameters simultaneously, coordinate descent takes a radically different approach: it optimizes one coordinate (parameter) at a time, holding all others fixed. This seemingly restrictive strategy turns out to be remarkably powerful for specific problem structures.
The key insight is that many optimization problems become dramatically simpler when restricted to a single variable. Complex multivariate objectives reduce to one-dimensional subproblems that often admit closed-form solutions. By cycling through coordinates, each update is exact within its dimension, leading to algorithms that are simple to implement, highly parallelizable, and often faster than full-gradient methods for problems with the right structure.
By the end of this page, you will understand cyclic and randomized coordinate descent strategies, derive closed-form coordinate updates for common ML objectives, analyze convergence rates and conditions for coordinate descent, implement efficient coordinate descent for Lasso and other L1-regularized models, and recognize when coordinate descent outperforms gradient-based alternatives.
Basic Algorithm:
Given an objective f: ℝⁿ → ℝ to minimize:
1. Initialize x⁰ ∈ ℝⁿ
2. For k = 0, 1, 2, ...:
a. Select coordinate i ∈ {1, ..., n}
b. Solve: xᵢᵏ⁺¹ = argmin_{t} f(x₁ᵏ, ..., xᵢ₋₁ᵏ, t, xᵢ₊₁ᵏ, ..., xₙᵏ)
c. Set xⱼᵏ⁺¹ = xⱼᵏ for j ≠ i
3. Until convergence
The magic happens in step 2b: the multivariate problem becomes a univariate optimization in variable t. For many ML objectives, this univariate problem has an explicit solution.
Coordinate selection strategies:
| Strategy | Advantages | Disadvantages | Best For |
|---|---|---|---|
| Cyclic | Simple, deterministic, cache-friendly | Can be slow for correlated coordinates | Dense, well-conditioned problems |
| Randomized | Better theoretical guarantees, breaks symmetry | Overhead from random number generation | Theoretical analysis, parallel settings |
| Greedy | Fastest per-iteration progress | O(n) cost to find max gradient | Small n, sparse updates |
| Importance | Adapts to problem structure | Requires Lipschitz constants | Non-uniform coordinate scaling |
Coordinate descent works because minimizing along any direction (including coordinate directions) is guaranteed to decrease the objective (or leave it unchanged at optimum). For convex functions, repeatedly decreasing along all coordinates eventually reaches the global minimum. The efficiency comes from cheap univariate solutions, not from choosing optimal directions.
The convergence of coordinate descent depends critically on the objective function's structure. We analyze conditions guaranteeing convergence and characterize convergence rates.
Sufficient conditions for convergence:
When coordinate descent can fail:
For non-smooth, non-separable functions, coordinate descent can get stuck. Consider:
f(x, y) = |x - 1| + |y - 1| + |x + y - 2|
At (1, 1), the subdifferential doesn't include 0, so it's not a minimum. But:
Coordinate descent is stuck at a non-optimal point!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import numpy as npimport matplotlib.pyplot as plt # Demo: Coordinate descent stuck on non-smooth functiondef f_stuck(x, y): """Non-smooth function where CD gets stuck at (1,1)""" return abs(x - 1) + abs(y - 1) + abs(x + y - 2) # Visualizex_range = np.linspace(-1, 3, 100)y_range = np.linspace(-1, 3, 100)X, Y = np.meshgrid(x_range, y_range)Z = np.abs(X - 1) + np.abs(Y - 1) + np.abs(X + Y - 2) # Point (1, 1) appears to be a minimum along each coordinate# but is not a global minimum (which is a line segment)print("f(1, 1) =", f_stuck(1, 1)) # = 0print("f(0.5, 1.5) =", f_stuck(0.5, 1.5)) # = 0 # Both points achieve minimum value 0, but (1,1) is a CD fixed point# that is not a corner of the optimal set # For smooth strongly convex functions, CD convergesdef f_quadratic(x, y, Q): """Quadratic: 0.5 * [x,y] @ Q @ [x,y]^T""" v = np.array([x, y]) return 0.5 * v @ Q @ v # Condition number affects convergence ratedef cd_quadratic(Q, x0, n_iters=100): """Cyclic coordinate descent on quadratic""" x = np.array(x0, dtype=float) history = [x.copy()] for _ in range(n_iters): for i in range(len(x)): # Univariate minimization for quadratic # d/dx_i (0.5 x^T Q x) = (Qx)_i = 0 # Q_ii * x_i + sum_{j≠i} Q_ij * x_j = 0 # x_i = -sum_{j≠i} Q_ij * x_j / Q_ii s = sum(Q[i, j] * x[j] for j in range(len(x)) if j != i) x[i] = -s / Q[i, i] history.append(x.copy()) return np.array(history) # Well-conditioned: fast convergenceQ_good = np.array([[2.0, 0.5], [0.5, 2.0]])history_good = cd_quadratic(Q_good, [1.0, 1.0], 20) # Ill-conditioned: slow convergenceQ_bad = np.array([[1.0, 0.99], [0.99, 1.0]])history_bad = cd_quadratic(Q_bad, [1.0, 1.0], 100) print(f"Well-conditioned (κ={np.linalg.cond(Q_good):.1f}): " f"converges in ~{np.argmax(np.linalg.norm(history_good, axis=1) < 1e-6)} iters")print(f"Ill-conditioned (κ={np.linalg.cond(Q_bad):.1f}): " f"converges in ~{np.argmax(np.linalg.norm(history_bad, axis=1) < 1e-6)} iters")Convergence rate for smooth, strongly convex functions:
For a function f that is L-smooth and μ-strongly convex, randomized coordinate descent achieves:
E[f(x^k) - f(x)] ≤ (1 - μ/nL)^k · [f(x⁰) - f(x)]**
Comparing to gradient descent's rate (1 - μ/L)^k:
Coordinate Lipschitz constants:
When per-coordinate Lipschitz constants Lᵢ vary significantly, importance sampling with probabilities pᵢ ∝ Lᵢ can dramatically accelerate convergence. The rate becomes O((Σᵢ Lᵢ)/(nμ) · log(1/ε)), which is much better when maxᵢ Lᵢ >> (1/n)Σᵢ Lᵢ.
When coordinates are highly correlated (high off-diagonal Hessian entries), coordinate descent takes many iterations because progress in one coordinate is undone by subsequent updates in correlated coordinates. This is the 'zigzagging' phenomenon. Preconditioning or block coordinate descent can help.
The power of coordinate descent lies in exploiting closed-form solutions for univariate subproblems. Let's derive these for common ML objectives.
1. Least Squares Regression:
Objective: f(w) = ½‖y - Xw‖²
Holding all wⱼ (j ≠ i) fixed, the univariate problem in wᵢ is:
min_{wᵢ} ½‖y - Σⱼ≠ᵢ wⱼxⱼ - wᵢxᵢ‖² = min_{wᵢ} ½‖rᵢ - wᵢxᵢ‖²
where rᵢ = y - Σⱼ≠ᵢ wⱼxⱼ is the partial residual. Setting the derivative to zero:
xᵢᵀ(rᵢ - wᵢxᵢ) = 0 → wᵢ = xᵢᵀrᵢ / ‖xᵢ‖²
2. Ridge Regression (L2-regularized):
Objective: f(w) = ½‖y - Xw‖² + (λ/2)‖w‖²
The coordinate update becomes:
wᵢ = xᵢᵀrᵢ / (‖xᵢ‖² + λ)
The L2 penalty simply adds λ to the denominator—elegantly shrinking the estimate.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
import numpy as npfrom typing import Tuple def soft_threshold(x: float, threshold: float) -> float: """ Soft-thresholding operator (proximal of L1 norm). S_λ(x) = sign(x) * max(|x| - λ, 0) """ return np.sign(x) * max(abs(x) - threshold, 0) def cd_lasso( X: np.ndarray, y: np.ndarray, lambda_: float, max_iter: int = 1000, tol: float = 1e-6, warm_start: np.ndarray = None) -> Tuple[np.ndarray, list]: """ Coordinate descent for Lasso (L1-regularized least squares). Minimizes: (1/2n)||y - Xw||² + λ||w||₁ Parameters: ----------- X : np.ndarray Feature matrix (n_samples, n_features) y : np.ndarray Target vector (n_samples,) lambda_ : float L1 regularization strength max_iter : int Maximum number of full passes over coordinates tol : float Convergence tolerance on weight change warm_start : np.ndarray, optional Initial weights (defaults to zero) Returns: -------- w : np.ndarray Optimal weights history : list Objective value at each iteration """ n_samples, n_features = X.shape # Normalize lambda for sample size lambda_scaled = lambda_ * n_samples # Precompute column norms (for denominator) col_norms_sq = np.sum(X ** 2, axis=0) # Initialize weights if warm_start is not None: w = warm_start.copy() else: w = np.zeros(n_features) # Initialize residual: r = y - Xw residual = y - X @ w history = [] for iteration in range(max_iter): w_old = w.copy() # Cyclic coordinate descent for i in range(n_features): if col_norms_sq[i] < 1e-10: continue # Skip zero columns # Add back contribution of current w_i to residual # (Efficiently compute partial residual) residual += X[:, i] * w[i] # Compute correlation: x_i^T * r_i rho_i = X[:, i] @ residual # Soft-thresholding update for Lasso w[i] = soft_threshold(rho_i, lambda_scaled) / col_norms_sq[i] # Update residual with new w_i residual -= X[:, i] * w[i] # Compute objective for history loss = 0.5 * np.sum(residual ** 2) / n_samples + lambda_ * np.sum(np.abs(w)) history.append(loss) # Check convergence if np.max(np.abs(w - w_old)) < tol: print(f"Converged in {iteration + 1} iterations") break return w, history def cd_elastic_net( X: np.ndarray, y: np.ndarray, lambda1: float, lambda2: float, max_iter: int = 1000, tol: float = 1e-6) -> np.ndarray: """ Coordinate descent for Elastic Net. Minimizes: (1/2n)||y - Xw||² + λ₁||w||₁ + (λ₂/2)||w||₂² The update rule combines soft-thresholding (L1) with shrinkage (L2). """ n_samples, n_features = X.shape col_norms_sq = np.sum(X ** 2, axis=0) w = np.zeros(n_features) residual = y.copy() for _ in range(max_iter): w_old = w.copy() for i in range(n_features): if col_norms_sq[i] < 1e-10: continue residual += X[:, i] * w[i] rho_i = X[:, i] @ residual # Elastic Net update: soft-threshold then shrink # w_i = S_{λ₁n}(x_i^T r) / (||x_i||² + λ₂n) w[i] = soft_threshold(rho_i, lambda1 * n_samples) / (col_norms_sq[i] + lambda2 * n_samples) residual -= X[:, i] * w[i] if np.max(np.abs(w - w_old)) < tol: break return w if __name__ == "__main__": from sklearn.datasets import make_regression from sklearn.linear_model import Lasso # Generate sparse regression problem X, y, true_coef = make_regression( n_samples=200, n_features=100, n_informative=10, noise=10, coef=True, random_state=42 ) # Our implementation lambda_ = 0.1 w_cd, history = cd_lasso(X, y, lambda_) # Sklearn reference lasso_sklearn = Lasso(alpha=lambda_, fit_intercept=False) lasso_sklearn.fit(X, y) print(f"CD non-zero weights: {np.sum(np.abs(w_cd) > 1e-6)}") print(f"SKL non-zero weights: {np.sum(np.abs(lasso_sklearn.coef_) > 1e-6)}") print(f"Max difference: {np.max(np.abs(w_cd - lasso_sklearn.coef_)):.2e}")3. Lasso Regression (L1-regularized):
Objective: f(w) = ½‖y - Xw‖² + λ‖w‖₁
The L1 penalty is non-smooth, but the univariate subproblem still has a closed-form solution via the soft-thresholding operator:
S_λ(x) = sign(x) · max(|x| - λ, 0)
The coordinate update becomes:
wᵢ = S_λ(xᵢᵀrᵢ) / ‖xᵢ‖²
This elegant formula is why coordinate descent is the standard solver for Lasso. The soft-thresholding induces exact zeros at optimum, providing automatic feature selection.
The key implementation trick is maintaining the residual r = y - Xw incrementally. When updating wᵢ from wᵢᵒˡᵈ to wᵢⁿᵉʷ, update r ← r - (wᵢⁿᵉʷ - wᵢᵒˡᵈ)xᵢ in O(n) time. This avoids recomputing the full residual at O(nd) cost, making each coordinate update O(n) instead of O(nd).
When coordinates are naturally grouped or highly correlated within groups, block coordinate descent updates an entire block of variables simultaneously. This is a powerful generalization that can capture within-block dependencies.
Block structure:
Partition coordinates into blocks: {1, ..., n} = B₁ ∪ B₂ ∪ ... ∪ Bᴷ. Each iteration:
Applications in ML:
Matrix Factorization (UV decomposition):
Group Lasso:
Multi-task Learning:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162
import numpy as npfrom typing import Tuple def als_matrix_factorization( R: np.ndarray, k: int, lambda_reg: float = 0.1, max_iter: int = 50, tol: float = 1e-4) -> Tuple[np.ndarray, np.ndarray, list]: """ Alternating Least Squares (ALS) for matrix factorization. Block coordinate descent: alternately fix U and update V, then fix V and update U. Minimizes: ||R - U @ V^T||²_F + λ(||U||²_F + ||V||²_F) over observed entries only. Parameters: ----------- R : np.ndarray Rating matrix (n_users, n_items), NaN for missing entries k : int Latent dimension (rank) lambda_reg : float L2 regularization strength max_iter : int Maximum alternating iterations tol : float Convergence tolerance on reconstruction error Returns: -------- U : np.ndarray User factors (n_users, k) V : np.ndarray Item factors (n_items, k) history : list Reconstruction error at each iteration """ n_users, n_items = R.shape observed = ~np.isnan(R) # Mask of observed entries R_obs = np.where(observed, R, 0) # Replace NaN with 0 for computation # Initialize factors randomly np.random.seed(42) U = np.random.randn(n_users, k) * 0.1 V = np.random.randn(n_items, k) * 0.1 history = [] for it in range(max_iter): # Block 1: Fix V, solve for each row of U for i in range(n_users): observed_items = observed[i, :] V_obs = V[observed_items, :] # Items rated by user i R_obs_i = R_obs[i, observed_items] if len(R_obs_i) == 0: continue # Solve: (V_obs^T V_obs + λI) u_i = V_obs^T r_i A = V_obs.T @ V_obs + lambda_reg * np.eye(k) b = V_obs.T @ R_obs_i U[i, :] = np.linalg.solve(A, b) # Block 2: Fix U, solve for each row of V for j in range(n_items): observed_users = observed[:, j] U_obs = U[observed_users, :] # Users who rated item j R_obs_j = R_obs[observed_users, j] if len(R_obs_j) == 0: continue # Solve: (U_obs^T U_obs + λI) v_j = U_obs^T r_j A = U_obs.T @ U_obs + lambda_reg * np.eye(k) b = U_obs.T @ R_obs_j V[j, :] = np.linalg.solve(A, b) # Compute reconstruction error pred = U @ V.T error = np.sqrt(np.mean((R_obs - pred * observed) ** 2)) history.append(error) if it > 0 and abs(history[-1] - history[-2]) < tol: print(f"Converged in {it + 1} iterations") break return U, V, history def group_lasso_cd( X: np.ndarray, y: np.ndarray, groups: list, lambda_: float, max_iter: int = 1000, tol: float = 1e-6) -> np.ndarray: """ Block coordinate descent for Group Lasso. Minimizes: (1/2)||y - Xw||² + λ Σ_g ||w_g||₂ Each group's weights are updated together using a block soft-thresholding operator. Parameters: ----------- X : np.ndarray Feature matrix (n_samples, n_features) y : np.ndarray Target vector (n_samples,) groups : list List of index arrays, one per group lambda_ : float Group regularization strength Returns: -------- w : np.ndarray Optimal weights """ n_samples, n_features = X.shape w = np.zeros(n_features) residual = y.copy() for _ in range(max_iter): w_old = w.copy() for g_idx in groups: g_idx = np.array(g_idx) X_g = X[:, g_idx] # Add back group contribution residual += X_g @ w[g_idx] # Compute group update direction z = X_g.T @ residual z_norm = np.linalg.norm(z) # Block soft-thresholding if z_norm > lambda_ * n_samples: # Solve (X_g^T X_g + λI) w_g = z with group shrinkage gram = X_g.T @ X_g shrinkage = 1 - (lambda_ * n_samples) / z_norm w[g_idx] = shrinkage * np.linalg.solve( gram + 1e-6 * np.eye(len(g_idx)), z ) else: # Entire group set to zero w[g_idx] = 0 # Update residual residual -= X_g @ w[g_idx] if np.max(np.abs(w - w_old)) < tol: break return wLarger blocks capture more coordinate dependencies but require solving larger subproblems. The extreme case (one block = all coordinates) is gradient descent! Choose block sizes that balance subproblem complexity with the benefit of joint updates.
Stochastic Coordinate Descent (SCD) combines the coordinate-wise framework with stochastic approximation, enabling scaling to massive datasets that don't fit in memory.
Basic idea:
Instead of computing the exact coordinate gradient, use a stochastic estimate from a mini-batch:
∂f/∂wᵢ ≈ (1/|B|) Σ_{j∈B} ∂fⱼ/∂wᵢ
For least squares with f(w) = (1/n)Σⱼ (yⱼ - wᵀxⱼ)²:
∂f/∂wᵢ ≈ (1/|B|) Σ_{j∈B} -2xⱼᵢ(yⱼ - wᵀxⱼ)
Doubly stochastic coordinate descent:
This achieves O(1) cost per update (vs O(n) for full gradient, O(d) for stochastic gradient descent).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
import numpy as np def stochastic_cd_least_squares( X: np.ndarray, y: np.ndarray, n_iters: int = 10000, step_size: float = 0.01, batch_size: int = 1) -> np.ndarray: """ Stochastic Coordinate Descent for least squares. Doubly stochastic: samples both coordinate and data points. Achieves O(1) update cost. """ n_samples, n_features = X.shape w = np.zeros(n_features) for t in range(n_iters): # Sample coordinate uniformly i = np.random.randint(n_features) # Sample mini-batch batch_idx = np.random.choice(n_samples, batch_size, replace=False) X_batch = X[batch_idx, :] y_batch = y[batch_idx] # Stochastic gradient for coordinate i residuals = y_batch - X_batch @ w grad_i = -2 * np.mean(X_batch[:, i] * residuals) # Decaying step size for convergence eta_t = step_size / (1 + 0.001 * t) # Update single coordinate w[i] -= eta_t * grad_i return w def sdca_least_squares( X: np.ndarray, y: np.ndarray, lambda_: float, n_epochs: int = 50) -> np.ndarray: """ Stochastic Dual Coordinate Ascent (SDCA) for L2-regularized least squares. Works in the dual space where coordinates correspond to training examples. Often more efficient than primal SCD. Minimizes: (λ/2)||w||² + (1/n)Σᵢ (1/2)(y_i - wᵀx_i)² """ n_samples, n_features = X.shape # Dual variables (one per sample) alpha = np.zeros(n_samples) # Primal variable: w = (1/λn) Σᵢ αᵢ xᵢ w = np.zeros(n_features) # Precompute ||x_i||² for each sample x_norms_sq = np.sum(X ** 2, axis=1) for epoch in range(n_epochs): # Random permutation of samples perm = np.random.permutation(n_samples) for i in perm: x_i = X[i, :] y_i = y[i] # Compute optimal dual update for coordinate i # Δαᵢ = (yᵢ - wᵀxᵢ - αᵢ) / (1 + ||xᵢ||²/(λn)) residual = y_i - np.dot(x_i, w) - alpha[i] denom = 1 + x_norms_sq[i] / (lambda_ * n_samples) delta_alpha = residual / denom # Update dual and primal variables alpha[i] += delta_alpha w += delta_alpha * x_i / (lambda_ * n_samples) return w # Comparisonif __name__ == "__main__": from sklearn.datasets import make_regression X, y = make_regression(n_samples=10000, n_features=100, noise=0.1, random_state=42) # Optimal solution for comparison from sklearn.linear_model import Ridge ridge = Ridge(alpha=0.1, fit_intercept=False) ridge.fit(X, y) w_opt = ridge.coef_ # SDCA w_sdca = sdca_least_squares(X, y, lambda_=0.1 / len(y), n_epochs=20) print(f"SDCA error vs optimal: {np.linalg.norm(w_sdca - w_opt):.4f}")SDCA: Stochastic Dual Coordinate Ascent:
For regularized empirical risk minimization, working in the dual space is often more efficient. In the dual, coordinates correspond to training examples rather than features. SDCA achieves:
When to use stochastic coordinate descent:
Coordinate descent's simplicity makes it amenable to parallelization, though care is needed to maintain correctness when multiple workers update simultaneously.
Naive parallelization problems:
If two workers simultaneously update coordinates i and j:
Hogwild! (asynchronous parallel):
For sparse problems, Hogwild! ignores conflicts:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npfrom multiprocessing import Pool, shared_memoryimport threading def parallel_cd_chunked( X: np.ndarray, y: np.ndarray, lambda_: float, n_workers: int = 4, n_epochs: int = 10) -> np.ndarray: """ Parallel coordinate descent with coordinate partitioning. Each worker owns a disjoint set of coordinates and updates them in each epoch. Workers synchronize between epochs. """ n_samples, n_features = X.shape w = np.zeros(n_features) # Partition coordinates among workers coords_per_worker = np.array_split(np.arange(n_features), n_workers) def worker_update(worker_id, w_shared, residual_shared): """Update coordinates owned by this worker""" my_coords = coords_per_worker[worker_id] for i in my_coords: col_norm_sq = np.sum(X[:, i] ** 2) if col_norm_sq < 1e-10: continue # Read current residual (slightly stale is OK) rho = X[:, i] @ residual_shared # Compute update w_old = w_shared[i] w_new = soft_threshold(rho, lambda_ * n_samples) / col_norm_sq # Apply update w_shared[i] = w_new # Update residual (atomic operation in practice) residual_shared -= (w_new - w_old) * X[:, i] residual = y - X @ w for epoch in range(n_epochs): threads = [] for wid in range(n_workers): t = threading.Thread(target=worker_update, args=(wid, w, residual)) threads.append(t) t.start() for t in threads: t.join() # Full residual recomputation for stability if (epoch + 1) % 5 == 0: residual = y - X @ w return w def distributed_cd_admm( X_partitions: list, y_partitions: list, lambda_: float, rho: float = 1.0, n_iters: int = 50) -> np.ndarray: """ Distributed Lasso via ADMM (Alternating Direction Method of Multipliers). Each worker has a partition of the data and maintains a local copy of w. Workers solve local subproblems then average and project globally. Minimizes: (1/2n) Σ_k ||y_k - X_k w||² + λ||w||₁ where k indexes data partitions across workers. """ n_workers = len(X_partitions) n_features = X_partitions[0].shape[1] # Local copies of w for each worker w_local = [np.zeros(n_features) for _ in range(n_workers)] # Global consensus variable z = np.zeros(n_features) # Dual variables (scaled form) u = [np.zeros(n_features) for _ in range(n_workers)] for it in range(n_iters): # Worker updates (can be parallelized) for k in range(n_workers): X_k = X_partitions[k] y_k = y_partitions[k] n_k = len(y_k) # Solve: min (1/2n_k)||y_k - X_k w||² + (ρ/2)||w - z + u_k||² # This is a ridge regression problem A = X_k.T @ X_k / n_k + rho * np.eye(n_features) b = X_k.T @ y_k / n_k + rho * (z - u[k]) w_local[k] = np.linalg.solve(A, b) # Global consensus update (average + L1 projection) w_avg = np.mean([w_local[k] + u[k] for k in range(n_workers)], axis=0) z_new = soft_threshold_vec(w_avg, lambda_ / (rho * n_workers)) # Dual update for k in range(n_workers): u[k] = u[k] + w_local[k] - z_new z = z_new return z def soft_threshold(x: float, threshold: float) -> float: return np.sign(x) * max(abs(x) - threshold, 0) def soft_threshold_vec(x: np.ndarray, threshold: float) -> np.ndarray: return np.sign(x) * np.maximum(np.abs(x) - threshold, 0)ADMM (Alternating Direction Method of Multipliers) enables distributed coordinate descent by introducing consensus constraints. Each worker solves a local subproblem on its data partition, then workers average their solutions and project onto the constraint set. This framework naturally handles regularization and scales to many workers.
Implementing coordinate descent efficiently requires attention to several practical details:
1. Feature normalization:
Coordinate descent implicitly assumes comparable scales across coordinates. If feature j has range [0, 1000] while feature k has range [0, 1], updates to j will be much smaller. Standardize features to unit variance before optimization.
2. Warm starting:
For regularization path computation (e.g., Lasso for varying λ), initialize w(λ_new) from the solution at w(λ_old). The solutions change continuously, so warm starting dramatically reduces iterations needed.
3. Active set strategies:
For Lasso, many coordinates become exactly zero. An 'active set' tracks potentially non-zero coordinates. After initial passes, only update active coordinates, periodically checking if inactive ones should enter. This reduces work by 10-100x for sparse solutions.
| Optimization | Benefit | When to Use |
|---|---|---|
| Feature normalization | Balanced coordinate updates | Always |
| Warm starting | 10-100x fewer iterations | Regularization paths |
| Active set | Skip zero coordinates | L1-regularized problems |
| Residual caching | O(n) vs O(nd) updates | Least squares objectives |
| Coordinate ordering | Exploit data locality | Large problems with cache effects |
| Early stopping | Avoid over-optimization | When approximate solution suffices |
4. Convergence monitoring:
Monitor the duality gap for problems with dual formulations. For Lasso:
Primal: P(w) = ½‖y - Xw‖² + λ‖w‖₁ Dual: D(θ) = ½‖y‖² - ½‖y - θ‖² subject to ‖Xᵀθ‖_∞ ≤ λ
The duality gap P(w) - D(θ) provides a certificate of optimality. Stop when gap < ε.
5. Numerical stability:
For Lasso, 'safe screening rules' identify coordinates guaranteed to be zero at optimum before running the algorithm. The SAFE rule and Sequential Strong Rules can eliminate 90%+ of coordinates for large λ, reducing the problem size massively.
Coordinate descent is a fundamental optimization paradigm that exploits problem structure by decomposing multivariate optimization into simple univariate subproblems. Its efficiency for Lasso and related problems makes it indispensable in the ML toolkit.
What's next:
The next page explores Proximal Methods—a powerful framework that elegantly handles non-smooth objectives like L1 regularization. Proximal gradient descent generalizes both gradient descent and coordinate descent, providing a unified view of first-order optimization for composite objectives.
You now understand coordinate descent: its framework, convergence properties, closed-form updates for ML objectives, block and stochastic extensions, and practical implementation. This knowledge is essential for efficiently solving L1-regularized problems and understanding the optimization landscape of modern ML.