Advanced Optimization - Learning Module

Loading content...

0/245

Coordinate Descent

One Variable at a Time

While gradient descent and Newton's method update all parameters simultaneously, coordinate descent takes a radically different approach: it optimizes one coordinate (parameter) at a time, holding all others fixed. This seemingly restrictive strategy turns out to be remarkably powerful for specific problem structures.

The key insight is that many optimization problems become dramatically simpler when restricted to a single variable. Complex multivariate objectives reduce to one-dimensional subproblems that often admit closed-form solutions. By cycling through coordinates, each update is exact within its dimension, leading to algorithms that are simple to implement, highly parallelizable, and often faster than full-gradient methods for problems with the right structure.

What You Will Learn

By the end of this page, you will understand cyclic and randomized coordinate descent strategies, derive closed-form coordinate updates for common ML objectives, analyze convergence rates and conditions for coordinate descent, implement efficient coordinate descent for Lasso and other L1-regularized models, and recognize when coordinate descent outperforms gradient-based alternatives.

The Coordinate Descent Framework

Basic Algorithm:

Given an objective f: ℝⁿ → ℝ to minimize:

1. Initialize x⁰ ∈ ℝⁿ
2. For k = 0, 1, 2, ...:
   a. Select coordinate i ∈ {1, ..., n}
   b. Solve: xᵢᵏ⁺¹ = argmin_{t} f(x₁ᵏ, ..., xᵢ₋₁ᵏ, t, xᵢ₊₁ᵏ, ..., xₙᵏ)
   c. Set xⱼᵏ⁺¹ = xⱼᵏ for j ≠ i
3. Until convergence

The magic happens in step 2b: the multivariate problem becomes a univariate optimization in variable t. For many ML objectives, this univariate problem has an explicit solution.

Coordinate selection strategies:

Cyclic (Gauss-Seidel): i = (k mod n) + 1. Cycle through coordinates in order.
Randomized: Sample i uniformly at random from {1, ..., n}.
Greedy (Gauss-Southwell): i = argmax_j |∂f/∂xⱼ|. Pick the coordinate with largest gradient.
Importance sampling: Sample i with probability proportional to some importance measure (e.g., Lipschitz constants).

Coordinate Selection Strategies Comparison
Strategy	Advantages	Disadvantages	Best For
Cyclic	Simple, deterministic, cache-friendly	Can be slow for correlated coordinates	Dense, well-conditioned problems
Randomized	Better theoretical guarantees, breaks symmetry	Overhead from random number generation	Theoretical analysis, parallel settings
Greedy	Fastest per-iteration progress	O(n) cost to find max gradient	Small n, sparse updates
Importance	Adapts to problem structure	Requires Lipschitz constants	Non-uniform coordinate scaling

Why Coordinate Descent Works

Coordinate descent works because minimizing along any direction (including coordinate directions) is guaranteed to decrease the objective (or leave it unchanged at optimum). For convex functions, repeatedly decreasing along all coordinates eventually reaches the global minimum. The efficiency comes from cheap univariate solutions, not from choosing optimal directions.

Convergence Analysis

The convergence of coordinate descent depends critically on the objective function's structure. We analyze conditions guaranteeing convergence and characterize convergence rates.

Sufficient conditions for convergence:

Convexity: If f is convex, coordinate descent converges to a global minimum.
Strict convexity: Convergence to the unique global minimum.
Separability + convexity: f(x) = Σᵢ fᵢ(xᵢ) is trivially solved by independent univariate minimizations.

When coordinate descent can fail:

For non-smooth, non-separable functions, coordinate descent can get stuck. Consider:

f(x, y) = |x - 1| + |y - 1| + |x + y - 2|

At (1, 1), the subdifferential doesn't include 0, so it's not a minimum. But:

Holding y = 1 fixed, minimizing over x gives x = 1
Holding x = 1 fixed, minimizing over y gives y = 1

Coordinate descent is stuck at a non-optimal point!

cd_convergence_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
import matplotlib.pyplot as plt
 
# Demo: Coordinate descent stuck on non-smooth function
def f_stuck(x, y):
    """Non-smooth function where CD gets stuck at (1,1)"""
    return abs(x - 1) + abs(y - 1) + abs(x + y - 2)
 
# Visualize
x_range = np.linspace(-1, 3, 100)
y_range = np.linspace(-1, 3, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = np.abs(X - 1) + np.abs(Y - 1) + np.abs(X + Y - 2)
 
# Point (1, 1) appears to be a minimum along each coordinate
# but is not a global minimum (which is a line segment)
print("f(1, 1) =", f_stuck(1, 1))  # = 0
print("f(0.5, 1.5) =", f_stuck(0.5, 1.5))  # = 0
 
# Both points achieve minimum value 0, but (1,1) is a CD fixed point
# that is not a corner of the optimal set
 
# For smooth strongly convex functions, CD converges
def f_quadratic(x, y, Q):
    """Quadratic: 0.5 * [x,y] @ Q @ [x,y]^T"""
    v = np.array([x, y])
    return 0.5 * v @ Q @ v
 
# Condition number affects convergence rate
def cd_quadratic(Q, x0, n_iters=100):
    """Cyclic coordinate descent on quadratic"""
    x = np.array(x0, dtype=float)
    history = [x.copy()]
    
    for _ in range(n_iters):
        for i in range(len(x)):
            # Univariate minimization for quadratic
            # d/dx_i (0.5 x^T Q x) = (Qx)_i = 0
            # Q_ii * x_i + sum_{j≠i} Q_ij * x_j = 0
            # x_i = -sum_{j≠i} Q_ij * x_j / Q_ii
            
            s = sum(Q[i, j] * x[j] for j in range(len(x)) if j != i)
            x[i] = -s / Q[i, i]
            
        history.append(x.copy())
    
    return np.array(history)
 
# Well-conditioned: fast convergence
Q_good = np.array([[2.0, 0.5], [0.5, 2.0]])
history_good = cd_quadratic(Q_good, [1.0, 1.0], 20)
 
# Ill-conditioned: slow convergence
Q_bad = np.array([[1.0, 0.99], [0.99, 1.0]])
history_bad = cd_quadratic(Q_bad, [1.0, 1.0], 100)
 
print(f"Well-conditioned (κ={np.linalg.cond(Q_good):.1f}): "
      f"converges in ~{np.argmax(np.linalg.norm(history_good, axis=1) < 1e-6)} iters")
print(f"Ill-conditioned (κ={np.linalg.cond(Q_bad):.1f}): "
      f"converges in ~{np.argmax(np.linalg.norm(history_bad, axis=1) < 1e-6)} iters")

Convergence rate for smooth, strongly convex functions:

For a function f that is L-smooth and μ-strongly convex, randomized coordinate descent achieves:

E[f(x^k) - f(x)] ≤ (1 - μ/nL)^k · [f(x⁰) - f(x)]**

Comparing to gradient descent's rate (1 - μ/L)^k:

CD requires n times more iterations to match GD
But each CD iteration costs O(1) vs O(n) for GD
Total complexity is comparable: O(nL/μ · log(1/ε))

Coordinate Lipschitz constants:

When per-coordinate Lipschitz constants Lᵢ vary significantly, importance sampling with probabilities pᵢ ∝ Lᵢ can dramatically accelerate convergence. The rate becomes O((Σᵢ Lᵢ)/(nμ) · log(1/ε)), which is much better when maxᵢ Lᵢ >> (1/n)Σᵢ Lᵢ.

Correlation Slows Convergence

When coordinates are highly correlated (high off-diagonal Hessian entries), coordinate descent takes many iterations because progress in one coordinate is undone by subsequent updates in correlated coordinates. This is the 'zigzagging' phenomenon. Preconditioning or block coordinate descent can help.

Closed-Form Coordinate Updates

The power of coordinate descent lies in exploiting closed-form solutions for univariate subproblems. Let's derive these for common ML objectives.

1. Least Squares Regression:

Objective: f(w) = ½‖y - Xw‖²

Holding all wⱼ (j ≠ i) fixed, the univariate problem in wᵢ is:

min_{wᵢ} ½‖y - Σⱼ≠ᵢ wⱼxⱼ - wᵢxᵢ‖² = min_{wᵢ} ½‖rᵢ - wᵢxᵢ‖²

where rᵢ = y - Σⱼ≠ᵢ wⱼxⱼ is the partial residual. Setting the derivative to zero:

xᵢᵀ(rᵢ - wᵢxᵢ) = 0 → wᵢ = xᵢᵀrᵢ / ‖xᵢ‖²

2. Ridge Regression (L2-regularized):

Objective: f(w) = ½‖y - Xw‖² + (λ/2)‖w‖²

The coordinate update becomes:

wᵢ = xᵢᵀrᵢ / (‖xᵢ‖² + λ)

The L2 penalty simply adds λ to the denominator—elegantly shrinking the estimate.

cd_closed_form.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import numpy as np
from typing import Tuple
 
def soft_threshold(x: float, threshold: float) -> float:
    """
    Soft-thresholding operator (proximal of L1 norm).
    
    S_λ(x) = sign(x) * max(|x| - λ, 0)
    """
    return np.sign(x) * max(abs(x) - threshold, 0)
 
 
def cd_lasso(
    X: np.ndarray,
    y: np.ndarray,
    lambda_: float,
    max_iter: int = 1000,
    tol: float = 1e-6,
    warm_start: np.ndarray = None
) -> Tuple[np.ndarray, list]:
    """
    Coordinate descent for Lasso (L1-regularized least squares).
    
    Minimizes: (1/2n)||y - Xw||² + λ||w||₁
    
    Parameters:
    -----------
    X : np.ndarray
        Feature matrix (n_samples, n_features)
    y : np.ndarray
        Target vector (n_samples,)
    lambda_ : float
        L1 regularization strength
    max_iter : int
        Maximum number of full passes over coordinates
    tol : float
        Convergence tolerance on weight change
    warm_start : np.ndarray, optional
        Initial weights (defaults to zero)
        
    Returns:
    --------
    w : np.ndarray
        Optimal weights
    history : list
        Objective value at each iteration
    """
    n_samples, n_features = X.shape
    
    # Normalize lambda for sample size
    lambda_scaled = lambda_ * n_samples
    
    # Precompute column norms (for denominator)
    col_norms_sq = np.sum(X ** 2, axis=0)
    
    # Initialize weights
    if warm_start is not None:
        w = warm_start.copy()
    else:
        w = np.zeros(n_features)
    
    # Initialize residual: r = y - Xw
    residual = y - X @ w
    
    history = []
    
    for iteration in range(max_iter):
        w_old = w.copy()
        
        # Cyclic coordinate descent
        for i in range(n_features):
            if col_norms_sq[i] < 1e-10:
                continue  # Skip zero columns
            
            # Add back contribution of current w_i to residual
            # (Efficiently compute partial residual)
            residual += X[:, i] * w[i]
            
            # Compute correlation: x_i^T * r_i
            rho_i = X[:, i] @ residual
            
            # Soft-thresholding update for Lasso
            w[i] = soft_threshold(rho_i, lambda_scaled) / col_norms_sq[i]
            
            # Update residual with new w_i
            residual -= X[:, i] * w[i]
        
        # Compute objective for history
        loss = 0.5 * np.sum(residual ** 2) / n_samples + lambda_ * np.sum(np.abs(w))
        history.append(loss)
        
        # Check convergence
        if np.max(np.abs(w - w_old)) < tol:
            print(f"Converged in {iteration + 1} iterations")
            break
    
    return w, history
 
 
def cd_elastic_net(
    X: np.ndarray,
    y: np.ndarray,
    lambda1: float,
    lambda2: float,
    max_iter: int = 1000,
    tol: float = 1e-6
) -> np.ndarray:
    """
    Coordinate descent for Elastic Net.
    
    Minimizes: (1/2n)||y - Xw||² + λ₁||w||₁ + (λ₂/2)||w||₂²
    
    The update rule combines soft-thresholding (L1) with 
    shrinkage (L2).
    """
    n_samples, n_features = X.shape
    
    col_norms_sq = np.sum(X ** 2, axis=0)
    w = np.zeros(n_features)
    residual = y.copy()
    
    for _ in range(max_iter):
        w_old = w.copy()
        
        for i in range(n_features):
            if col_norms_sq[i] < 1e-10:
                continue
            
            residual += X[:, i] * w[i]
            rho_i = X[:, i] @ residual
            
            # Elastic Net update: soft-threshold then shrink
            # w_i = S_{λ₁n}(x_i^T r) / (||x_i||² + λ₂n)
            w[i] = soft_threshold(rho_i, lambda1 * n_samples) /                    (col_norms_sq[i] + lambda2 * n_samples)
            
            residual -= X[:, i] * w[i]
        
        if np.max(np.abs(w - w_old)) < tol:
            break
    
    return w
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    from sklearn.linear_model import Lasso
    
    # Generate sparse regression problem
    X, y, true_coef = make_regression(
        n_samples=200, n_features=100, n_informative=10,
        noise=10, coef=True, random_state=42
    )
    
    # Our implementation
    lambda_ = 0.1
    w_cd, history = cd_lasso(X, y, lambda_)
    
    # Sklearn reference
    lasso_sklearn = Lasso(alpha=lambda_, fit_intercept=False)
    lasso_sklearn.fit(X, y)
    
    print(f"CD  non-zero weights: {np.sum(np.abs(w_cd) > 1e-6)}")
    print(f"SKL non-zero weights: {np.sum(np.abs(lasso_sklearn.coef_) > 1e-6)}")
    print(f"Max difference: {np.max(np.abs(w_cd - lasso_sklearn.coef_)):.2e}")

3. Lasso Regression (L1-regularized):

Objective: f(w) = ½‖y - Xw‖² + λ‖w‖₁

The L1 penalty is non-smooth, but the univariate subproblem still has a closed-form solution via the soft-thresholding operator:

S_λ(x) = sign(x) · max(|x| - λ, 0)

The coordinate update becomes:

wᵢ = S_λ(xᵢᵀrᵢ) / ‖xᵢ‖²

This elegant formula is why coordinate descent is the standard solver for Lasso. The soft-thresholding induces exact zeros at optimum, providing automatic feature selection.

Efficient Residual Updates

The key implementation trick is maintaining the residual r = y - Xw incrementally. When updating wᵢ from wᵢᵒˡᵈ to wᵢⁿᵉʷ, update r ← r - (wᵢⁿᵉʷ - wᵢᵒˡᵈ)xᵢ in O(n) time. This avoids recomputing the full residual at O(nd) cost, making each coordinate update O(n) instead of O(nd).

Block Coordinate Descent

When coordinates are naturally grouped or highly correlated within groups, block coordinate descent updates an entire block of variables simultaneously. This is a powerful generalization that can capture within-block dependencies.

Block structure:

Partition coordinates into blocks: {1, ..., n} = B₁ ∪ B₂ ∪ ... ∪ Bᴷ. Each iteration:

Select block Bₖ
Solve: x_{Bₖ}^{new} = argmin_z f(x_{B₁}, ..., x_{Bₖ₋₁}, z, x_{Bₖ₊₁}, ..., x_{Bᴷ})

Applications in ML:

Matrix Factorization (UV decomposition):
- Block 1: All entries of U (fixing V)
- Block 2: All entries of V (fixing U)
- Alternating least squares (ALS) for recommender systems
Group Lasso:
- Features naturally grouped (e.g., one-hot encoded categoricals)
- Block updates enforce group sparsity
Multi-task Learning:
- Each task's parameters form a block
- Captures task-specific structure

block_cd_matrix_factorization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
import numpy as np
from typing import Tuple
 
def als_matrix_factorization(
    R: np.ndarray,
    k: int,
    lambda_reg: float = 0.1,
    max_iter: int = 50,
    tol: float = 1e-4
) -> Tuple[np.ndarray, np.ndarray, list]:
    """
    Alternating Least Squares (ALS) for matrix factorization.
    
    Block coordinate descent: alternately fix U and update V, 
    then fix V and update U.
    
    Minimizes: ||R - U @ V^T||²_F + λ(||U||²_F + ||V||²_F)
    over observed entries only.
    
    Parameters:
    -----------
    R : np.ndarray
        Rating matrix (n_users, n_items), NaN for missing entries
    k : int
        Latent dimension (rank)
    lambda_reg : float
        L2 regularization strength
    max_iter : int
        Maximum alternating iterations
    tol : float
        Convergence tolerance on reconstruction error
        
    Returns:
    --------
    U : np.ndarray
        User factors (n_users, k)
    V : np.ndarray
        Item factors (n_items, k)
    history : list
        Reconstruction error at each iteration
    """
    n_users, n_items = R.shape
    observed = ~np.isnan(R)  # Mask of observed entries
    R_obs = np.where(observed, R, 0)  # Replace NaN with 0 for computation
    
    # Initialize factors randomly
    np.random.seed(42)
    U = np.random.randn(n_users, k) * 0.1
    V = np.random.randn(n_items, k) * 0.1
    
    history = []
    
    for it in range(max_iter):
        # Block 1: Fix V, solve for each row of U
        for i in range(n_users):
            observed_items = observed[i, :]
            V_obs = V[observed_items, :]  # Items rated by user i
            R_obs_i = R_obs[i, observed_items]
            
            if len(R_obs_i) == 0:
                continue
            
            # Solve: (V_obs^T V_obs + λI) u_i = V_obs^T r_i
            A = V_obs.T @ V_obs + lambda_reg * np.eye(k)
            b = V_obs.T @ R_obs_i
            U[i, :] = np.linalg.solve(A, b)
        
        # Block 2: Fix U, solve for each row of V
        for j in range(n_items):
            observed_users = observed[:, j]
            U_obs = U[observed_users, :]  # Users who rated item j
            R_obs_j = R_obs[observed_users, j]
            
            if len(R_obs_j) == 0:
                continue
            
            # Solve: (U_obs^T U_obs + λI) v_j = U_obs^T r_j
            A = U_obs.T @ U_obs + lambda_reg * np.eye(k)
            b = U_obs.T @ R_obs_j
            V[j, :] = np.linalg.solve(A, b)
        
        # Compute reconstruction error
        pred = U @ V.T
        error = np.sqrt(np.mean((R_obs - pred * observed) ** 2))
        history.append(error)
        
        if it > 0 and abs(history[-1] - history[-2]) < tol:
            print(f"Converged in {it + 1} iterations")
            break
    
    return U, V, history
 
 
def group_lasso_cd(
    X: np.ndarray,
    y: np.ndarray,
    groups: list,
    lambda_: float,
    max_iter: int = 1000,
    tol: float = 1e-6
) -> np.ndarray:
    """
    Block coordinate descent for Group Lasso.
    
    Minimizes: (1/2)||y - Xw||² + λ Σ_g ||w_g||₂
    
    Each group's weights are updated together using a 
    block soft-thresholding operator.
    
    Parameters:
    -----------
    X : np.ndarray
        Feature matrix (n_samples, n_features)
    y : np.ndarray
        Target vector (n_samples,)
    groups : list
        List of index arrays, one per group
    lambda_ : float
        Group regularization strength
        
    Returns:
    --------
    w : np.ndarray
        Optimal weights
    """
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    residual = y.copy()
    
    for _ in range(max_iter):
        w_old = w.copy()
        
        for g_idx in groups:
            g_idx = np.array(g_idx)
            X_g = X[:, g_idx]
            
            # Add back group contribution
            residual += X_g @ w[g_idx]
            
            # Compute group update direction
            z = X_g.T @ residual
            z_norm = np.linalg.norm(z)
            
            # Block soft-thresholding
            if z_norm > lambda_ * n_samples:
                # Solve (X_g^T X_g + λI) w_g = z with group shrinkage
                gram = X_g.T @ X_g
                shrinkage = 1 - (lambda_ * n_samples) / z_norm
                w[g_idx] = shrinkage * np.linalg.solve(
                    gram + 1e-6 * np.eye(len(g_idx)), z
                )
            else:
                # Entire group set to zero
                w[g_idx] = 0
            
            # Update residual
            residual -= X_g @ w[g_idx]
        
        if np.max(np.abs(w - w_old)) < tol:
            break
    
    return w

Block Size Trade-offs

Larger blocks capture more coordinate dependencies but require solving larger subproblems. The extreme case (one block = all coordinates) is gradient descent! Choose block sizes that balance subproblem complexity with the benefit of joint updates.

Stochastic Coordinate Descent

Stochastic Coordinate Descent (SCD) combines the coordinate-wise framework with stochastic approximation, enabling scaling to massive datasets that don't fit in memory.

Basic idea:

Instead of computing the exact coordinate gradient, use a stochastic estimate from a mini-batch:

∂f/∂wᵢ ≈ (1/|B|) Σ_{j∈B} ∂fⱼ/∂wᵢ

For least squares with f(w) = (1/n)Σⱼ (yⱼ - wᵀxⱼ)²:

∂f/∂wᵢ ≈ (1/|B|) Σ_{j∈B} -2xⱼᵢ(yⱼ - wᵀxⱼ)

Doubly stochastic coordinate descent:

Sample a random coordinate i
Sample a random data point (or mini-batch) j
Update wᵢ using the stochastic gradient estimate

This achieves O(1) cost per update (vs O(n) for full gradient, O(d) for stochastic gradient descent).

stochastic_cd.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import numpy as np
 
def stochastic_cd_least_squares(
    X: np.ndarray,
    y: np.ndarray,
    n_iters: int = 10000,
    step_size: float = 0.01,
    batch_size: int = 1
) -> np.ndarray:
    """
    Stochastic Coordinate Descent for least squares.
    
    Doubly stochastic: samples both coordinate and data points.
    Achieves O(1) update cost.
    """
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    
    for t in range(n_iters):
        # Sample coordinate uniformly
        i = np.random.randint(n_features)
        
        # Sample mini-batch
        batch_idx = np.random.choice(n_samples, batch_size, replace=False)
        X_batch = X[batch_idx, :]
        y_batch = y[batch_idx]
        
        # Stochastic gradient for coordinate i
        residuals = y_batch - X_batch @ w
        grad_i = -2 * np.mean(X_batch[:, i] * residuals)
        
        # Decaying step size for convergence
        eta_t = step_size / (1 + 0.001 * t)
        
        # Update single coordinate
        w[i] -= eta_t * grad_i
    
    return w
 
 
def sdca_least_squares(
    X: np.ndarray,
    y: np.ndarray,
    lambda_: float,
    n_epochs: int = 50
) -> np.ndarray:
    """
    Stochastic Dual Coordinate Ascent (SDCA) for L2-regularized 
    least squares.
    
    Works in the dual space where coordinates correspond to 
    training examples. Often more efficient than primal SCD.
    
    Minimizes: (λ/2)||w||² + (1/n)Σᵢ (1/2)(y_i - wᵀx_i)²
    """
    n_samples, n_features = X.shape
    
    # Dual variables (one per sample)
    alpha = np.zeros(n_samples)
    
    # Primal variable: w = (1/λn) Σᵢ αᵢ xᵢ
    w = np.zeros(n_features)
    
    # Precompute ||x_i||² for each sample
    x_norms_sq = np.sum(X ** 2, axis=1)
    
    for epoch in range(n_epochs):
        # Random permutation of samples
        perm = np.random.permutation(n_samples)
        
        for i in perm:
            x_i = X[i, :]
            y_i = y[i]
            
            # Compute optimal dual update for coordinate i
            # Δαᵢ = (yᵢ - wᵀxᵢ - αᵢ) / (1 + ||xᵢ||²/(λn))
            residual = y_i - np.dot(x_i, w) - alpha[i]
            denom = 1 + x_norms_sq[i] / (lambda_ * n_samples)
            delta_alpha = residual / denom
            
            # Update dual and primal variables
            alpha[i] += delta_alpha
            w += delta_alpha * x_i / (lambda_ * n_samples)
    
    return w
 
 
# Comparison
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    
    X, y = make_regression(n_samples=10000, n_features=100, noise=0.1, random_state=42)
    
    # Optimal solution for comparison
    from sklearn.linear_model import Ridge
    ridge = Ridge(alpha=0.1, fit_intercept=False)
    ridge.fit(X, y)
    w_opt = ridge.coef_
    
    # SDCA
    w_sdca = sdca_least_squares(X, y, lambda_=0.1 / len(y), n_epochs=20)
    
    print(f"SDCA error vs optimal: {np.linalg.norm(w_sdca - w_opt):.4f}")

SDCA: Stochastic Dual Coordinate Ascent:

For regularized empirical risk minimization, working in the dual space is often more efficient. In the dual, coordinates correspond to training examples rather than features. SDCA achieves:

Linear convergence rate for strongly convex losses
Efficiently handles large n (many samples) by updating one dual coordinate per iteration
Maintains primal iterate as a by-product: w = (1/λn) Σᵢ αᵢ · ∇loss(αᵢ)

When to use stochastic coordinate descent:

Very large datasets that require streaming
Features vary in importance (importance sampling helps)
Sparse updates are sufficient (many coordinates near-zero)
Dual formulation is natural (e.g., SVMs, regularized regression)

Parallel and Distributed Coordinate Descent

Coordinate descent's simplicity makes it amenable to parallelization, though care is needed to maintain correctness when multiple workers update simultaneously.

Naive parallelization problems:

If two workers simultaneously update coordinates i and j:

Worker 1 reads w, computes update for i
Worker 2 reads w (same old value), computes update for j
Both write back → one update lost (race condition)

Hogwild! (asynchronous parallel):

For sparse problems, Hogwild! ignores conflicts:

Each worker grabs a random coordinate, updates it, moves on
No locks, no synchronization
Works when coordinates are approximately independent (sparse feature vectors)
Provably convergent when step size decreases appropriately

parallel_cd.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from multiprocessing import Pool, shared_memory
import threading
 
def parallel_cd_chunked(
    X: np.ndarray,
    y: np.ndarray,
    lambda_: float,
    n_workers: int = 4,
    n_epochs: int = 10
) -> np.ndarray:
    """
    Parallel coordinate descent with coordinate partitioning.
    
    Each worker owns a disjoint set of coordinates and updates
    them in each epoch. Workers synchronize between epochs.
    """
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    
    # Partition coordinates among workers
    coords_per_worker = np.array_split(np.arange(n_features), n_workers)
    
    def worker_update(worker_id, w_shared, residual_shared):
        """Update coordinates owned by this worker"""
        my_coords = coords_per_worker[worker_id]
        
        for i in my_coords:
            col_norm_sq = np.sum(X[:, i] ** 2)
            if col_norm_sq < 1e-10:
                continue
            
            # Read current residual (slightly stale is OK)
            rho = X[:, i] @ residual_shared
            
            # Compute update
            w_old = w_shared[i]
            w_new = soft_threshold(rho, lambda_ * n_samples) / col_norm_sq
            
            # Apply update
            w_shared[i] = w_new
            
            # Update residual (atomic operation in practice)
            residual_shared -= (w_new - w_old) * X[:, i]
    
    residual = y - X @ w
    
    for epoch in range(n_epochs):
        threads = []
        for wid in range(n_workers):
            t = threading.Thread(target=worker_update, args=(wid, w, residual))
            threads.append(t)
            t.start()
        
        for t in threads:
            t.join()
        
        # Full residual recomputation for stability
        if (epoch + 1) % 5 == 0:
            residual = y - X @ w
    
    return w
 
 
def distributed_cd_admm(
    X_partitions: list,
    y_partitions: list,
    lambda_: float,
    rho: float = 1.0,
    n_iters: int = 50
) -> np.ndarray:
    """
    Distributed Lasso via ADMM (Alternating Direction Method of Multipliers).
    
    Each worker has a partition of the data and maintains a local copy of w.
    Workers solve local subproblems then average and project globally.
    
    Minimizes: (1/2n) Σ_k ||y_k - X_k w||² + λ||w||₁
    
    where k indexes data partitions across workers.
    """
    n_workers = len(X_partitions)
    n_features = X_partitions[0].shape[1]
    
    # Local copies of w for each worker
    w_local = [np.zeros(n_features) for _ in range(n_workers)]
    
    # Global consensus variable
    z = np.zeros(n_features)
    
    # Dual variables (scaled form)
    u = [np.zeros(n_features) for _ in range(n_workers)]
    
    for it in range(n_iters):
        # Worker updates (can be parallelized)
        for k in range(n_workers):
            X_k = X_partitions[k]
            y_k = y_partitions[k]
            n_k = len(y_k)
            
            # Solve: min (1/2n_k)||y_k - X_k w||² + (ρ/2)||w - z + u_k||²
            # This is a ridge regression problem
            A = X_k.T @ X_k / n_k + rho * np.eye(n_features)
            b = X_k.T @ y_k / n_k + rho * (z - u[k])
            w_local[k] = np.linalg.solve(A, b)
        
        # Global consensus update (average + L1 projection)
        w_avg = np.mean([w_local[k] + u[k] for k in range(n_workers)], axis=0)
        z_new = soft_threshold_vec(w_avg, lambda_ / (rho * n_workers))
        
        # Dual update
        for k in range(n_workers):
            u[k] = u[k] + w_local[k] - z_new
        
        z = z_new
    
    return z
 
 
def soft_threshold(x: float, threshold: float) -> float:
    return np.sign(x) * max(abs(x) - threshold, 0)
 
def soft_threshold_vec(x: np.ndarray, threshold: float) -> np.ndarray:
    return np.sign(x) * np.maximum(np.abs(x) - threshold, 0)

ADMM for Distributed Optimization

ADMM (Alternating Direction Method of Multipliers) enables distributed coordinate descent by introducing consensus constraints. Each worker solves a local subproblem on its data partition, then workers average their solutions and project onto the constraint set. This framework naturally handles regularization and scales to many workers.

Practical Considerations and Best Practices

Implementing coordinate descent efficiently requires attention to several practical details:

1. Feature normalization:

Coordinate descent implicitly assumes comparable scales across coordinates. If feature j has range [0, 1000] while feature k has range [0, 1], updates to j will be much smaller. Standardize features to unit variance before optimization.

2. Warm starting:

For regularization path computation (e.g., Lasso for varying λ), initialize w(λ_new) from the solution at w(λ_old). The solutions change continuously, so warm starting dramatically reduces iterations needed.

3. Active set strategies:

For Lasso, many coordinates become exactly zero. An 'active set' tracks potentially non-zero coordinates. After initial passes, only update active coordinates, periodically checking if inactive ones should enter. This reduces work by 10-100x for sparse solutions.

Coordinate Descent Implementation Checklist
Optimization	Benefit	When to Use
Feature normalization	Balanced coordinate updates	Always
Warm starting	10-100x fewer iterations	Regularization paths
Active set	Skip zero coordinates	L1-regularized problems
Residual caching	O(n) vs O(nd) updates	Least squares objectives
Coordinate ordering	Exploit data locality	Large problems with cache effects
Early stopping	Avoid over-optimization	When approximate solution suffices

4. Convergence monitoring:

Monitor the duality gap for problems with dual formulations. For Lasso:

Primal: P(w) = ½‖y - Xw‖² + λ‖w‖₁ Dual: D(θ) = ½‖y‖² - ½‖y - θ‖² subject to ‖Xᵀθ‖_∞ ≤ λ

The duality gap P(w) - D(θ) provides a certificate of optimality. Stop when gap < ε.

5. Numerical stability:

Recompute residuals periodically to prevent error accumulation
Use numerically stable soft-thresholding: S_λ(x) = max(0, x-λ) - max(0, -x-λ)
Guard against division by zero when coordinate variances are tiny

The Screening Rule Trick

For Lasso, 'safe screening rules' identify coordinates guaranteed to be zero at optimum before running the algorithm. The SAFE rule and Sequential Strong Rules can eliminate 90%+ of coordinates for large λ, reducing the problem size massively.

Summary and Key Takeaways

Coordinate descent is a fundamental optimization paradigm that exploits problem structure by decomposing multivariate optimization into simple univariate subproblems. Its efficiency for Lasso and related problems makes it indispensable in the ML toolkit.

Key Takeaways

•Optimize one coordinate at a time — Multivariate problems become univariate, often with closed-form solutions.
•Selection strategies matter — Cyclic is simple; randomized has better theory; greedy is fastest when applicable.
•Closed-form updates power efficiency — Least squares, ridge, Lasso all have explicit update formulas.
•Soft-thresholding enables L1 optimization — The key to efficient Lasso solvers.
•Block CD captures dependencies — Essential for matrix factorization, group Lasso, multi-task learning.
•Stochastic variants scale to big data — SDCA and Hogwild! enable parallel and distributed optimization.
•Convergence requires smoothness or separability — Non-smooth, non-separable functions can cause CD to get stuck.

What's next:

The next page explores Proximal Methods—a powerful framework that elegantly handles non-smooth objectives like L1 regularization. Proximal gradient descent generalizes both gradient descent and coordinate descent, providing a unified view of first-order optimization for composite objectives.

Page Complete

You now understand coordinate descent: its framework, convergence properties, closed-form updates for ML objectives, block and stochastic extensions, and practical implementation. This knowledge is essential for efficiently solving L1-regularized problems and understanding the optimization landscape of modern ML.

Coordinate Descent

One Variable at a Time

What You Will Learn

The Coordinate Descent Framework

Basic Algorithm:

Given an objective f: ℝⁿ → ℝ to minimize:

1. Initialize x⁰ ∈ ℝⁿ
2. For k = 0, 1, 2, ...:
   a. Select coordinate i ∈ {1, ..., n}
   b. Solve: xᵢᵏ⁺¹ = argmin_{t} f(x₁ᵏ, ..., xᵢ₋₁ᵏ, t, xᵢ₊₁ᵏ, ..., xₙᵏ)
   c. Set xⱼᵏ⁺¹ = xⱼᵏ for j ≠ i
3. Until convergence

The magic happens in step 2b: the multivariate problem becomes a univariate optimization in variable t. For many ML objectives, this univariate problem has an explicit solution.

Coordinate selection strategies:

Cyclic (Gauss-Seidel): i = (k mod n) + 1. Cycle through coordinates in order.
Randomized: Sample i uniformly at random from {1, ..., n}.
Greedy (Gauss-Southwell): i = argmax_j |∂f/∂xⱼ|. Pick the coordinate with largest gradient.
Importance sampling: Sample i with probability proportional to some importance measure (e.g., Lipschitz constants).

Coordinate Selection Strategies Comparison
Strategy	Advantages	Disadvantages	Best For
Cyclic	Simple, deterministic, cache-friendly	Can be slow for correlated coordinates	Dense, well-conditioned problems
Randomized	Better theoretical guarantees, breaks symmetry	Overhead from random number generation	Theoretical analysis, parallel settings
Greedy	Fastest per-iteration progress	O(n) cost to find max gradient	Small n, sparse updates
Importance	Adapts to problem structure	Requires Lipschitz constants	Non-uniform coordinate scaling

Why Coordinate Descent Works

Convergence Analysis

The convergence of coordinate descent depends critically on the objective function's structure. We analyze conditions guaranteeing convergence and characterize convergence rates.

Sufficient conditions for convergence:

Convexity: If f is convex, coordinate descent converges to a global minimum.
Strict convexity: Convergence to the unique global minimum.
Separability + convexity: f(x) = Σᵢ fᵢ(xᵢ) is trivially solved by independent univariate minimizations.

When coordinate descent can fail:

For non-smooth, non-separable functions, coordinate descent can get stuck. Consider:

f(x, y) = |x - 1| + |y - 1| + |x + y - 2|

At (1, 1), the subdifferential doesn't include 0, so it's not a minimum. But:

Holding y = 1 fixed, minimizing over x gives x = 1
Holding x = 1 fixed, minimizing over y gives y = 1

Coordinate descent is stuck at a non-optimal point!

cd_convergence_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
import matplotlib.pyplot as plt
 
# Demo: Coordinate descent stuck on non-smooth function
def f_stuck(x, y):
    """Non-smooth function where CD gets stuck at (1,1)"""
    return abs(x - 1) + abs(y - 1) + abs(x + y - 2)
 
# Visualize
x_range = np.linspace(-1, 3, 100)
y_range = np.linspace(-1, 3, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = np.abs(X - 1) + np.abs(Y - 1) + np.abs(X + Y - 2)
 
# Point (1, 1) appears to be a minimum along each coordinate
# but is not a global minimum (which is a line segment)
print("f(1, 1) =", f_stuck(1, 1))  # = 0
print("f(0.5, 1.5) =", f_stuck(0.5, 1.5))  # = 0
 
# Both points achieve minimum value 0, but (1,1) is a CD fixed point
# that is not a corner of the optimal set
 
# For smooth strongly convex functions, CD converges
def f_quadratic(x, y, Q):
    """Quadratic: 0.5 * [x,y] @ Q @ [x,y]^T"""
    v = np.array([x, y])
    return 0.5 * v @ Q @ v
 
# Condition number affects convergence rate
def cd_quadratic(Q, x0, n_iters=100):
    """Cyclic coordinate descent on quadratic"""
    x = np.array(x0, dtype=float)
    history = [x.copy()]
    
    for _ in range(n_iters):
        for i in range(len(x)):
            # Univariate minimization for quadratic
            # d/dx_i (0.5 x^T Q x) = (Qx)_i = 0
            # Q_ii * x_i + sum_{j≠i} Q_ij * x_j = 0
            # x_i = -sum_{j≠i} Q_ij * x_j / Q_ii
            
            s = sum(Q[i, j] * x[j] for j in range(len(x)) if j != i)
            x[i] = -s / Q[i, i]
            
        history.append(x.copy())
    
    return np.array(history)
 
# Well-conditioned: fast convergence
Q_good = np.array([[2.0, 0.5], [0.5, 2.0]])
history_good = cd_quadratic(Q_good, [1.0, 1.0], 20)
 
# Ill-conditioned: slow convergence
Q_bad = np.array([[1.0, 0.99], [0.99, 1.0]])
history_bad = cd_quadratic(Q_bad, [1.0, 1.0], 100)
 
print(f"Well-conditioned (κ={np.linalg.cond(Q_good):.1f}): "
      f"converges in ~{np.argmax(np.linalg.norm(history_good, axis=1) < 1e-6)} iters")
print(f"Ill-conditioned (κ={np.linalg.cond(Q_bad):.1f}): "
      f"converges in ~{np.argmax(np.linalg.norm(history_bad, axis=1) < 1e-6)} iters")

Convergence rate for smooth, strongly convex functions:

For a function f that is L-smooth and μ-strongly convex, randomized coordinate descent achieves:

E[f(x^k) - f(x)] ≤ (1 - μ/nL)^k · [f(x⁰) - f(x)]**

Comparing to gradient descent's rate (1 - μ/L)^k:

CD requires n times more iterations to match GD
But each CD iteration costs O(1) vs O(n) for GD
Total complexity is comparable: O(nL/μ · log(1/ε))

Coordinate Lipschitz constants:

Correlation Slows Convergence

Closed-Form Coordinate Updates

The power of coordinate descent lies in exploiting closed-form solutions for univariate subproblems. Let's derive these for common ML objectives.

1. Least Squares Regression:

Objective: f(w) = ½‖y - Xw‖²

Holding all wⱼ (j ≠ i) fixed, the univariate problem in wᵢ is:

min_{wᵢ} ½‖y - Σⱼ≠ᵢ wⱼxⱼ - wᵢxᵢ‖² = min_{wᵢ} ½‖rᵢ - wᵢxᵢ‖²

where rᵢ = y - Σⱼ≠ᵢ wⱼxⱼ is the partial residual. Setting the derivative to zero:

xᵢᵀ(rᵢ - wᵢxᵢ) = 0 → wᵢ = xᵢᵀrᵢ / ‖xᵢ‖²

2. Ridge Regression (L2-regularized):

Objective: f(w) = ½‖y - Xw‖² + (λ/2)‖w‖²

The coordinate update becomes:

wᵢ = xᵢᵀrᵢ / (‖xᵢ‖² + λ)

The L2 penalty simply adds λ to the denominator—elegantly shrinking the estimate.

cd_closed_form.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import numpy as np
from typing import Tuple
 
def soft_threshold(x: float, threshold: float) -> float:
    """
    Soft-thresholding operator (proximal of L1 norm).
    
    S_λ(x) = sign(x) * max(|x| - λ, 0)
    """
    return np.sign(x) * max(abs(x) - threshold, 0)
 
 
def cd_lasso(
    X: np.ndarray,
    y: np.ndarray,
    lambda_: float,
    max_iter: int = 1000,
    tol: float = 1e-6,
    warm_start: np.ndarray = None
) -> Tuple[np.ndarray, list]:
    """
    Coordinate descent for Lasso (L1-regularized least squares).
    
    Minimizes: (1/2n)||y - Xw||² + λ||w||₁
    
    Parameters:
    -----------
    X : np.ndarray
        Feature matrix (n_samples, n_features)
    y : np.ndarray
        Target vector (n_samples,)
    lambda_ : float
        L1 regularization strength
    max_iter : int
        Maximum number of full passes over coordinates
    tol : float
        Convergence tolerance on weight change
    warm_start : np.ndarray, optional
        Initial weights (defaults to zero)
        
    Returns:
    --------
    w : np.ndarray
        Optimal weights
    history : list
        Objective value at each iteration
    """
    n_samples, n_features = X.shape
    
    # Normalize lambda for sample size
    lambda_scaled = lambda_ * n_samples
    
    # Precompute column norms (for denominator)
    col_norms_sq = np.sum(X ** 2, axis=0)
    
    # Initialize weights
    if warm_start is not None:
        w = warm_start.copy()
    else:
        w = np.zeros(n_features)
    
    # Initialize residual: r = y - Xw
    residual = y - X @ w
    
    history = []
    
    for iteration in range(max_iter):
        w_old = w.copy()
        
        # Cyclic coordinate descent
        for i in range(n_features):
            if col_norms_sq[i] < 1e-10:
                continue  # Skip zero columns
            
            # Add back contribution of current w_i to residual
            # (Efficiently compute partial residual)
            residual += X[:, i] * w[i]
            
            # Compute correlation: x_i^T * r_i
            rho_i = X[:, i] @ residual
            
            # Soft-thresholding update for Lasso
            w[i] = soft_threshold(rho_i, lambda_scaled) / col_norms_sq[i]
            
            # Update residual with new w_i
            residual -= X[:, i] * w[i]
        
        # Compute objective for history
        loss = 0.5 * np.sum(residual ** 2) / n_samples + lambda_ * np.sum(np.abs(w))
        history.append(loss)
        
        # Check convergence
        if np.max(np.abs(w - w_old)) < tol:
            print(f"Converged in {iteration + 1} iterations")
            break
    
    return w, history
 
 
def cd_elastic_net(
    X: np.ndarray,
    y: np.ndarray,
    lambda1: float,
    lambda2: float,
    max_iter: int = 1000,
    tol: float = 1e-6
) -> np.ndarray:
    """
    Coordinate descent for Elastic Net.
    
    Minimizes: (1/2n)||y - Xw||² + λ₁||w||₁ + (λ₂/2)||w||₂²
    
    The update rule combines soft-thresholding (L1) with 
    shrinkage (L2).
    """
    n_samples, n_features = X.shape
    
    col_norms_sq = np.sum(X ** 2, axis=0)
    w = np.zeros(n_features)
    residual = y.copy()
    
    for _ in range(max_iter):
        w_old = w.copy()
        
        for i in range(n_features):
            if col_norms_sq[i] < 1e-10:
                continue
            
            residual += X[:, i] * w[i]
            rho_i = X[:, i] @ residual
            
            # Elastic Net update: soft-threshold then shrink
            # w_i = S_{λ₁n}(x_i^T r) / (||x_i||² + λ₂n)
            w[i] = soft_threshold(rho_i, lambda1 * n_samples) /                    (col_norms_sq[i] + lambda2 * n_samples)
            
            residual -= X[:, i] * w[i]
        
        if np.max(np.abs(w - w_old)) < tol:
            break
    
    return w
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    from sklearn.linear_model import Lasso
    
    # Generate sparse regression problem
    X, y, true_coef = make_regression(
        n_samples=200, n_features=100, n_informative=10,
        noise=10, coef=True, random_state=42
    )
    
    # Our implementation
    lambda_ = 0.1
    w_cd, history = cd_lasso(X, y, lambda_)
    
    # Sklearn reference
    lasso_sklearn = Lasso(alpha=lambda_, fit_intercept=False)
    lasso_sklearn.fit(X, y)
    
    print(f"CD  non-zero weights: {np.sum(np.abs(w_cd) > 1e-6)}")
    print(f"SKL non-zero weights: {np.sum(np.abs(lasso_sklearn.coef_) > 1e-6)}")
    print(f"Max difference: {np.max(np.abs(w_cd - lasso_sklearn.coef_)):.2e}")

3. Lasso Regression (L1-regularized):

Objective: f(w) = ½‖y - Xw‖² + λ‖w‖₁

The L1 penalty is non-smooth, but the univariate subproblem still has a closed-form solution via the soft-thresholding operator:

S_λ(x) = sign(x) · max(|x| - λ, 0)

The coordinate update becomes:

wᵢ = S_λ(xᵢᵀrᵢ) / ‖xᵢ‖²

This elegant formula is why coordinate descent is the standard solver for Lasso. The soft-thresholding induces exact zeros at optimum, providing automatic feature selection.

Efficient Residual Updates

Block Coordinate Descent

Block structure:

Partition coordinates into blocks: {1, ..., n} = B₁ ∪ B₂ ∪ ... ∪ Bᴷ. Each iteration:

Select block Bₖ
Solve: x_{Bₖ}^{new} = argmin_z f(x_{B₁}, ..., x_{Bₖ₋₁}, z, x_{Bₖ₊₁}, ..., x_{Bᴷ})

Applications in ML:

Matrix Factorization (UV decomposition):
- Block 1: All entries of U (fixing V)
- Block 2: All entries of V (fixing U)
- Alternating least squares (ALS) for recommender systems
Group Lasso:
- Features naturally grouped (e.g., one-hot encoded categoricals)
- Block updates enforce group sparsity
Multi-task Learning:
- Each task's parameters form a block
- Captures task-specific structure

block_cd_matrix_factorization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
import numpy as np
from typing import Tuple
 
def als_matrix_factorization(
    R: np.ndarray,
    k: int,
    lambda_reg: float = 0.1,
    max_iter: int = 50,
    tol: float = 1e-4
) -> Tuple[np.ndarray, np.ndarray, list]:
    """
    Alternating Least Squares (ALS) for matrix factorization.
    
    Block coordinate descent: alternately fix U and update V, 
    then fix V and update U.
    
    Minimizes: ||R - U @ V^T||²_F + λ(||U||²_F + ||V||²_F)
    over observed entries only.
    
    Parameters:
    -----------
    R : np.ndarray
        Rating matrix (n_users, n_items), NaN for missing entries
    k : int
        Latent dimension (rank)
    lambda_reg : float
        L2 regularization strength
    max_iter : int
        Maximum alternating iterations
    tol : float
        Convergence tolerance on reconstruction error
        
    Returns:
    --------
    U : np.ndarray
        User factors (n_users, k)
    V : np.ndarray
        Item factors (n_items, k)
    history : list
        Reconstruction error at each iteration
    """
    n_users, n_items = R.shape
    observed = ~np.isnan(R)  # Mask of observed entries
    R_obs = np.where(observed, R, 0)  # Replace NaN with 0 for computation
    
    # Initialize factors randomly
    np.random.seed(42)
    U = np.random.randn(n_users, k) * 0.1
    V = np.random.randn(n_items, k) * 0.1
    
    history = []
    
    for it in range(max_iter):
        # Block 1: Fix V, solve for each row of U
        for i in range(n_users):
            observed_items = observed[i, :]
            V_obs = V[observed_items, :]  # Items rated by user i
            R_obs_i = R_obs[i, observed_items]
            
            if len(R_obs_i) == 0:
                continue
            
            # Solve: (V_obs^T V_obs + λI) u_i = V_obs^T r_i
            A = V_obs.T @ V_obs + lambda_reg * np.eye(k)
            b = V_obs.T @ R_obs_i
            U[i, :] = np.linalg.solve(A, b)
        
        # Block 2: Fix U, solve for each row of V
        for j in range(n_items):
            observed_users = observed[:, j]
            U_obs = U[observed_users, :]  # Users who rated item j
            R_obs_j = R_obs[observed_users, j]
            
            if len(R_obs_j) == 0:
                continue
            
            # Solve: (U_obs^T U_obs + λI) v_j = U_obs^T r_j
            A = U_obs.T @ U_obs + lambda_reg * np.eye(k)
            b = U_obs.T @ R_obs_j
            V[j, :] = np.linalg.solve(A, b)
        
        # Compute reconstruction error
        pred = U @ V.T
        error = np.sqrt(np.mean((R_obs - pred * observed) ** 2))
        history.append(error)
        
        if it > 0 and abs(history[-1] - history[-2]) < tol:
            print(f"Converged in {it + 1} iterations")
            break
    
    return U, V, history
 
 
def group_lasso_cd(
    X: np.ndarray,
    y: np.ndarray,
    groups: list,
    lambda_: float,
    max_iter: int = 1000,
    tol: float = 1e-6
) -> np.ndarray:
    """
    Block coordinate descent for Group Lasso.
    
    Minimizes: (1/2)||y - Xw||² + λ Σ_g ||w_g||₂
    
    Each group's weights are updated together using a 
    block soft-thresholding operator.
    
    Parameters:
    -----------
    X : np.ndarray
        Feature matrix (n_samples, n_features)
    y : np.ndarray
        Target vector (n_samples,)
    groups : list
        List of index arrays, one per group
    lambda_ : float
        Group regularization strength
        
    Returns:
    --------
    w : np.ndarray
        Optimal weights
    """
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    residual = y.copy()
    
    for _ in range(max_iter):
        w_old = w.copy()
        
        for g_idx in groups:
            g_idx = np.array(g_idx)
            X_g = X[:, g_idx]
            
            # Add back group contribution
            residual += X_g @ w[g_idx]
            
            # Compute group update direction
            z = X_g.T @ residual
            z_norm = np.linalg.norm(z)
            
            # Block soft-thresholding
            if z_norm > lambda_ * n_samples:
                # Solve (X_g^T X_g + λI) w_g = z with group shrinkage
                gram = X_g.T @ X_g
                shrinkage = 1 - (lambda_ * n_samples) / z_norm
                w[g_idx] = shrinkage * np.linalg.solve(
                    gram + 1e-6 * np.eye(len(g_idx)), z
                )
            else:
                # Entire group set to zero
                w[g_idx] = 0
            
            # Update residual
            residual -= X_g @ w[g_idx]
        
        if np.max(np.abs(w - w_old)) < tol:
            break
    
    return w

Block Size Trade-offs

Stochastic Coordinate Descent

Stochastic Coordinate Descent (SCD) combines the coordinate-wise framework with stochastic approximation, enabling scaling to massive datasets that don't fit in memory.

Basic idea:

Instead of computing the exact coordinate gradient, use a stochastic estimate from a mini-batch:

∂f/∂wᵢ ≈ (1/|B|) Σ_{j∈B} ∂fⱼ/∂wᵢ

For least squares with f(w) = (1/n)Σⱼ (yⱼ - wᵀxⱼ)²:

∂f/∂wᵢ ≈ (1/|B|) Σ_{j∈B} -2xⱼᵢ(yⱼ - wᵀxⱼ)

Doubly stochastic coordinate descent:

Sample a random coordinate i
Sample a random data point (or mini-batch) j
Update wᵢ using the stochastic gradient estimate

This achieves O(1) cost per update (vs O(n) for full gradient, O(d) for stochastic gradient descent).

stochastic_cd.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import numpy as np
 
def stochastic_cd_least_squares(
    X: np.ndarray,
    y: np.ndarray,
    n_iters: int = 10000,
    step_size: float = 0.01,
    batch_size: int = 1
) -> np.ndarray:
    """
    Stochastic Coordinate Descent for least squares.
    
    Doubly stochastic: samples both coordinate and data points.
    Achieves O(1) update cost.
    """
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    
    for t in range(n_iters):
        # Sample coordinate uniformly
        i = np.random.randint(n_features)
        
        # Sample mini-batch
        batch_idx = np.random.choice(n_samples, batch_size, replace=False)
        X_batch = X[batch_idx, :]
        y_batch = y[batch_idx]
        
        # Stochastic gradient for coordinate i
        residuals = y_batch - X_batch @ w
        grad_i = -2 * np.mean(X_batch[:, i] * residuals)
        
        # Decaying step size for convergence
        eta_t = step_size / (1 + 0.001 * t)
        
        # Update single coordinate
        w[i] -= eta_t * grad_i
    
    return w
 
 
def sdca_least_squares(
    X: np.ndarray,
    y: np.ndarray,
    lambda_: float,
    n_epochs: int = 50
) -> np.ndarray:
    """
    Stochastic Dual Coordinate Ascent (SDCA) for L2-regularized 
    least squares.
    
    Works in the dual space where coordinates correspond to 
    training examples. Often more efficient than primal SCD.
    
    Minimizes: (λ/2)||w||² + (1/n)Σᵢ (1/2)(y_i - wᵀx_i)²
    """
    n_samples, n_features = X.shape
    
    # Dual variables (one per sample)
    alpha = np.zeros(n_samples)
    
    # Primal variable: w = (1/λn) Σᵢ αᵢ xᵢ
    w = np.zeros(n_features)
    
    # Precompute ||x_i||² for each sample
    x_norms_sq = np.sum(X ** 2, axis=1)
    
    for epoch in range(n_epochs):
        # Random permutation of samples
        perm = np.random.permutation(n_samples)
        
        for i in perm:
            x_i = X[i, :]
            y_i = y[i]
            
            # Compute optimal dual update for coordinate i
            # Δαᵢ = (yᵢ - wᵀxᵢ - αᵢ) / (1 + ||xᵢ||²/(λn))
            residual = y_i - np.dot(x_i, w) - alpha[i]
            denom = 1 + x_norms_sq[i] / (lambda_ * n_samples)
            delta_alpha = residual / denom
            
            # Update dual and primal variables
            alpha[i] += delta_alpha
            w += delta_alpha * x_i / (lambda_ * n_samples)
    
    return w
 
 
# Comparison
if __name__ == "__main__":
    from sklearn.datasets import make_regression
    
    X, y = make_regression(n_samples=10000, n_features=100, noise=0.1, random_state=42)
    
    # Optimal solution for comparison
    from sklearn.linear_model import Ridge
    ridge = Ridge(alpha=0.1, fit_intercept=False)
    ridge.fit(X, y)
    w_opt = ridge.coef_
    
    # SDCA
    w_sdca = sdca_least_squares(X, y, lambda_=0.1 / len(y), n_epochs=20)
    
    print(f"SDCA error vs optimal: {np.linalg.norm(w_sdca - w_opt):.4f}")

SDCA: Stochastic Dual Coordinate Ascent:

For regularized empirical risk minimization, working in the dual space is often more efficient. In the dual, coordinates correspond to training examples rather than features. SDCA achieves:

Linear convergence rate for strongly convex losses
Efficiently handles large n (many samples) by updating one dual coordinate per iteration
Maintains primal iterate as a by-product: w = (1/λn) Σᵢ αᵢ · ∇loss(αᵢ)

When to use stochastic coordinate descent:

Very large datasets that require streaming
Features vary in importance (importance sampling helps)
Sparse updates are sufficient (many coordinates near-zero)
Dual formulation is natural (e.g., SVMs, regularized regression)

Parallel and Distributed Coordinate Descent

Coordinate descent's simplicity makes it amenable to parallelization, though care is needed to maintain correctness when multiple workers update simultaneously.

Naive parallelization problems:

If two workers simultaneously update coordinates i and j:

Worker 1 reads w, computes update for i
Worker 2 reads w (same old value), computes update for j
Both write back → one update lost (race condition)

Hogwild! (asynchronous parallel):

For sparse problems, Hogwild! ignores conflicts:

Each worker grabs a random coordinate, updates it, moves on
No locks, no synchronization
Works when coordinates are approximately independent (sparse feature vectors)
Provably convergent when step size decreases appropriately

parallel_cd.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from multiprocessing import Pool, shared_memory
import threading
 
def parallel_cd_chunked(
    X: np.ndarray,
    y: np.ndarray,
    lambda_: float,
    n_workers: int = 4,
    n_epochs: int = 10
) -> np.ndarray:
    """
    Parallel coordinate descent with coordinate partitioning.
    
    Each worker owns a disjoint set of coordinates and updates
    them in each epoch. Workers synchronize between epochs.
    """
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    
    # Partition coordinates among workers
    coords_per_worker = np.array_split(np.arange(n_features), n_workers)
    
    def worker_update(worker_id, w_shared, residual_shared):
        """Update coordinates owned by this worker"""
        my_coords = coords_per_worker[worker_id]
        
        for i in my_coords:
            col_norm_sq = np.sum(X[:, i] ** 2)
            if col_norm_sq < 1e-10:
                continue
            
            # Read current residual (slightly stale is OK)
            rho = X[:, i] @ residual_shared
            
            # Compute update
            w_old = w_shared[i]
            w_new = soft_threshold(rho, lambda_ * n_samples) / col_norm_sq
            
            # Apply update
            w_shared[i] = w_new
            
            # Update residual (atomic operation in practice)
            residual_shared -= (w_new - w_old) * X[:, i]
    
    residual = y - X @ w
    
    for epoch in range(n_epochs):
        threads = []
        for wid in range(n_workers):
            t = threading.Thread(target=worker_update, args=(wid, w, residual))
            threads.append(t)
            t.start()
        
        for t in threads:
            t.join()
        
        # Full residual recomputation for stability
        if (epoch + 1) % 5 == 0:
            residual = y - X @ w
    
    return w
 
 
def distributed_cd_admm(
    X_partitions: list,
    y_partitions: list,
    lambda_: float,
    rho: float = 1.0,
    n_iters: int = 50
) -> np.ndarray:
    """
    Distributed Lasso via ADMM (Alternating Direction Method of Multipliers).
    
    Each worker has a partition of the data and maintains a local copy of w.
    Workers solve local subproblems then average and project globally.
    
    Minimizes: (1/2n) Σ_k ||y_k - X_k w||² + λ||w||₁
    
    where k indexes data partitions across workers.
    """
    n_workers = len(X_partitions)
    n_features = X_partitions[0].shape[1]
    
    # Local copies of w for each worker
    w_local = [np.zeros(n_features) for _ in range(n_workers)]
    
    # Global consensus variable
    z = np.zeros(n_features)
    
    # Dual variables (scaled form)
    u = [np.zeros(n_features) for _ in range(n_workers)]
    
    for it in range(n_iters):
        # Worker updates (can be parallelized)
        for k in range(n_workers):
            X_k = X_partitions[k]
            y_k = y_partitions[k]
            n_k = len(y_k)
            
            # Solve: min (1/2n_k)||y_k - X_k w||² + (ρ/2)||w - z + u_k||²
            # This is a ridge regression problem
            A = X_k.T @ X_k / n_k + rho * np.eye(n_features)
            b = X_k.T @ y_k / n_k + rho * (z - u[k])
            w_local[k] = np.linalg.solve(A, b)
        
        # Global consensus update (average + L1 projection)
        w_avg = np.mean([w_local[k] + u[k] for k in range(n_workers)], axis=0)
        z_new = soft_threshold_vec(w_avg, lambda_ / (rho * n_workers))
        
        # Dual update
        for k in range(n_workers):
            u[k] = u[k] + w_local[k] - z_new
        
        z = z_new
    
    return z
 
 
def soft_threshold(x: float, threshold: float) -> float:
    return np.sign(x) * max(abs(x) - threshold, 0)
 
def soft_threshold_vec(x: np.ndarray, threshold: float) -> np.ndarray:
    return np.sign(x) * np.maximum(np.abs(x) - threshold, 0)

ADMM for Distributed Optimization

Practical Considerations and Best Practices

Implementing coordinate descent efficiently requires attention to several practical details:

1. Feature normalization:

2. Warm starting:

3. Active set strategies:

Coordinate Descent Implementation Checklist
Optimization	Benefit	When to Use
Feature normalization	Balanced coordinate updates	Always
Warm starting	10-100x fewer iterations	Regularization paths
Active set	Skip zero coordinates	L1-regularized problems
Residual caching	O(n) vs O(nd) updates	Least squares objectives
Coordinate ordering	Exploit data locality	Large problems with cache effects
Early stopping	Avoid over-optimization	When approximate solution suffices

4. Convergence monitoring:

Monitor the duality gap for problems with dual formulations. For Lasso:

Primal: P(w) = ½‖y - Xw‖² + λ‖w‖₁ Dual: D(θ) = ½‖y‖² - ½‖y - θ‖² subject to ‖Xᵀθ‖_∞ ≤ λ

The duality gap P(w) - D(θ) provides a certificate of optimality. Stop when gap < ε.

5. Numerical stability:

Recompute residuals periodically to prevent error accumulation
Use numerically stable soft-thresholding: S_λ(x) = max(0, x-λ) - max(0, -x-λ)
Guard against division by zero when coordinate variances are tiny

The Screening Rule Trick

Summary and Key Takeaways

Key Takeaways

•Optimize one coordinate at a time — Multivariate problems become univariate, often with closed-form solutions.
•Selection strategies matter — Cyclic is simple; randomized has better theory; greedy is fastest when applicable.
•Closed-form updates power efficiency — Least squares, ridge, Lasso all have explicit update formulas.
•Soft-thresholding enables L1 optimization — The key to efficient Lasso solvers.
•Block CD captures dependencies — Essential for matrix factorization, group Lasso, multi-task learning.
•Stochastic variants scale to big data — SDCA and Hogwild! enable parallel and distributed optimization.
•Convergence requires smoothness or separability — Non-smooth, non-separable functions can cause CD to get stuck.

What's next:

Page Complete