Machine LearningRegularized Logistic Regression

Regularized Logistic Regression

LevelIntermediate

Duration90 mins

TopicRegularized Logistic Regression

4 / 5

Regularization Path

Understanding the Regularization Path

So far, we've studied L1, L2, and Elastic Net regularization at fixed regularization strengths. But in practice, we never know the optimal λ in advance—we must explore a range of values and select based on cross-validation or other criteria.

The regularization path is the set of all solutions $\hat{\boldsymbol{\beta}}(\lambda)$ as $\lambda$ varies from 0 (no regularization, MLE solution) to infinity (complete shrinkage, all coefficients zero). Understanding this path provides deep insight into:

Feature importance: Which features enter the model first as regularization decreases?
Solution stability: How sensitive are coefficients to the choice of λ?
Model complexity: How does the number of selected features evolve?

This page develops the theory and practice of computing, visualizing, and interpreting regularization paths.

Path Algorithms: From Homotopy to Coordinate Descent

Early path algorithms (LARS, homotopy methods) computed the exact Lasso path by identifying "breakpoints" where features enter or leave the active set. Modern approaches use coordinate descent at a dense grid of λ values—less elegant mathematically but more practical and generalizable to logistic regression and other models.

The Regularization Path Concept

1.1 Formal Definition

For a given regularization type (L1, L2, or Elastic Net), the regularization path is the function:

$$\hat{\boldsymbol{\beta}}: [0, \infty) \to \mathbb{R}^p$$ $$\lambda \mapsto \hat{\boldsymbol{\beta}}(\lambda) = \arg\min_{\boldsymbol{\beta}} \left[ -\mathcal{L}(\boldsymbol{\beta}) + \lambda \cdot \text{Penalty}(\boldsymbol{\beta}) \right]$$

Boundary conditions:

At $\lambda = 0$: $\hat{\boldsymbol{\beta}}(0) = \hat{\boldsymbol{\beta}}_{\text{MLE}}$ (unregularized solution)
As $\lambda \to \infty$: $\hat{\boldsymbol{\beta}}(\lambda) \to \mathbf{0}$ (all coefficients shrink to zero)

The path interpolates between these extremes, with different behaviors depending on the penalty type.

1.2 Path Properties by Penalty Type

L2 (Ridge) Path:

Continuous and smooth: $\hat{\boldsymbol{\beta}}(\lambda)$ is infinitely differentiable in λ
Monotonic shrinkage: Each coefficient magnitude decreases monotonically with λ
No zeros: All coefficients remain non-zero for finite λ
Closed form (for linear regression): $\hat{\boldsymbol{\beta}}(\lambda) = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$

L1 (Lasso) Path:

Piecewise linear (for linear regression): The path is linear between "breakpoints"
Features enter/exit: Coefficients become exactly zero and non-zero at discrete λ values
Non-monotonic possible: Coefficient magnitudes can increase then decrease
Sparse: Many coefficients are exactly zero for moderate to large λ

Elastic Net Path:

Smoother than Lasso: The L2 component adds regularity
Still sparse: Features enter/exit, but with grouping behavior
Grouped entry: Correlated features tend to enter together

Logistic vs Linear Regression Paths

For linear regression, the Lasso path is exactly piecewise linear. For logistic regression, this is not true—the path is smooth due to the nonlinear log-likelihood. However, it still exhibits the "feature selection" behavior where coefficients reach exactly zero at discrete λ values.

Computing the Regularization Path

2.1 Grid-Based Computation

The most common approach is to solve the regularized problem at a sequence of λ values:

$$\lambda_1 > \lambda_2 > \cdots > \lambda_K > 0$$

Key insight: Starting from large λ (sparse solution) and decreasing is much faster than the reverse, because:

Sparse solutions require fewer active variables
Each solution warm-starts the next (good initialization)

Grid construction:

Log-spaced: $\lambda_k = \lambda_{\max} \cdot e^{-k \cdot \Delta}$ for uniform spacing on log scale
Geometric: $\lambda_{k+1} = \lambda_k / r$ for ratio $r > 1$ (common choice: $r = 1.1$ to $1.5$)

2.2 Finding λ_max

Before computing the path, we need λ_max: the smallest λ that sets all coefficients to zero. For L1 regularization:

$$\lambda_{\max} = \max_j \left| abla_j \mathcal{L}(\boldsymbol{\beta} = \mathbf{0}) \right|$$

This is the maximum absolute gradient at the zero solution. Any λ > λ_max produces the zero solution (all features excluded).

For logistic regression with standardized features:

$$\lambda_{\max} = \frac{1}{n} \max_j \left| \sum_{i=1}^n x_{ij}(y_i - \bar{y}) \right|$$

where $\bar{y}$ is the mean of the binary outcomes.

Why this works: The L1 optimality conditions require $| abla_j \mathcal{L}| \leq \lambda$ for $\beta_j = 0$. When λ exceeds all gradient magnitudes, the zero solution is optimal.

regularization_path.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
import numpy as np
from scipy.special import expit
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
 
def compute_lambda_max(X, y, fit_intercept=True):
    """
    Compute the smallest lambda that sets all coefficients to zero.
    
    This is the maximum absolute gradient of the log-likelihood at beta=0.
    """
    n = len(y)
    
    # At beta=0, predictions are all 0.5 (if no intercept) or class mean (with intercept)
    if fit_intercept:
        # With intercept, optimal intercept at beta=0 gives p = mean(y)
        p_mean = np.mean(y)
    else:
        p_mean = 0.5
    
    # Gradient at beta=0: X'(p - y) where p = p_mean for all samples
    residuals = p_mean - y
    gradients = X.T @ residuals
    
    # Lambda_max is the maximum absolute gradient
    lambda_max = np.max(np.abs(gradients)) / n
    
    return lambda_max
 
 
def compute_lasso_path(X, y, n_lambdas=100, lambda_ratio=1e-4):
    """
    Compute the L1 regularization path.
    
    Parameters
    ----------
    X : array, shape (n, p)
        Feature matrix (should be standardized)
    y : array, shape (n,)
        Binary labels
    n_lambdas : int
        Number of lambda values in the grid
    lambda_ratio : float
        Ratio of lambda_min to lambda_max
    
    Returns
    -------
    lambdas : array
        Lambda values (decreasing)
    coef_path : array, shape (p, n_lambdas)
        Coefficients at each lambda
    """
    n, p = X.shape
    
    # Compute lambda_max
    lambda_max = compute_lambda_max(X, y)
    lambda_min = lambda_max * lambda_ratio
    
    # Log-spaced grid (decreasing)
    lambdas = np.logspace(np.log10(lambda_max), np.log10(lambda_min), n_lambdas)
    
    # C = 1/lambda in sklearn
    Cs = 1 / lambdas
    
    # Initialize coefficient array
    coef_path = np.zeros((p, n_lambdas))
    
    # Compute path with warm starts
    warm_start_coef = None
    
    for i, C in enumerate(Cs):
        model = LogisticRegression(
            penalty='l1',
            C=C,
            solver='saga',
            max_iter=1000,
            tol=1e-4,
            warm_start=(i > 0),
            fit_intercept=True
        )
        
        # For warm start, set initial coefficients
        if warm_start_coef is not None:
            model.coef_ = warm_start_coef.reshape(1, -1)
        
        model.fit(X, y)
        coef_path[:, i] = model.coef_.flatten()
        warm_start_coef = model.coef_.flatten()
    
    return lambdas, coef_path
 
 
def visualize_regularization_path(lambdas, coef_path, feature_names=None, 
                                   top_k=10, title="Regularization Path"):
    """
    Visualize the regularization path.
    
    Parameters
    ----------
    lambdas : array
        Lambda values
    coef_path : array, shape (p, n_lambdas)
        Coefficients at each lambda
    feature_names : list, optional
        Names of features (for legend)
    top_k : int
        Number of top features to label
    title : str
        Plot title
    """
    p, n_lambdas = coef_path.shape
    
    # Create figure
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Coefficient values vs log(lambda)
    ax1 = axes[0]
    for j in range(p):
        ax1.plot(np.log10(lambdas), coef_path[j, :], linewidth=1)
    
    ax1.axhline(0, color='gray', linestyle='--', linewidth=0.5)
    ax1.set_xlabel('log₁₀(λ)', fontsize=12)
    ax1.set_ylabel('Coefficient Value', fontsize=12)
    ax1.set_title(f'{title}: Coefficients', fontsize=14)
    
    # Add vertical line at a reference lambda
    ax1.axvline(np.log10(lambdas[len(lambdas)//2]), color='red', 
                linestyle='--', alpha=0.5, label='Reference λ')
    
    # Plot 2: Number of non-zero coefficients vs log(lambda)
    ax2 = axes[1]
    n_nonzero = np.sum(np.abs(coef_path) > 1e-6, axis=0)
    ax2.plot(np.log10(lambdas), n_nonzero, 'b-', linewidth=2)
    ax2.fill_between(np.log10(lambdas), 0, n_nonzero, alpha=0.3)
    
    ax2.set_xlabel('log₁₀(λ)', fontsize=12)
    ax2.set_ylabel('Number of Non-zero Coefficients', fontsize=12)
    ax2.set_title(f'{title}: Sparsity', fontsize=14)
    ax2.set_ylim(0, p + 1)
    
    plt.tight_layout()
    plt.savefig('regularization_path.png', dpi=150)
    plt.show()
    
    # Identify top features (largest coefficients at smallest lambda)
    final_coef = np.abs(coef_path[:, -1])
    top_indices = np.argsort(final_coef)[-top_k:][::-1]
    
    print(f"
Top {top_k} features (by final coefficient magnitude):")
    print("-" * 40)
    for rank, idx in enumerate(top_indices, 1):
        name = feature_names[idx] if feature_names else f"Feature {idx}"
        print(f"{rank:2d}. {name}: {coef_path[idx, -1]:.4f}")
 
 
def compute_elastic_net_path_comparison(X, y, l1_ratios=[0.2, 0.5, 0.8, 1.0]):
    """
    Compare regularization paths for different l1_ratios (Elastic Net).
    """
    n, p = X.shape
    n_lambdas = 50
    
    # Use the same lambda range for all
    lambda_max = compute_lambda_max(X, y) * 2  # Slightly larger to ensure zeros
    lambda_min = lambda_max * 1e-3
    lambdas = np.logspace(np.log10(lambda_max), np.log10(lambda_min), n_lambdas)
    Cs = 1 / lambdas
    
    paths = {}
    
    for l1_ratio in l1_ratios:
        coef_path = np.zeros((p, n_lambdas))
        
        for i, C in enumerate(Cs):
            if l1_ratio == 1.0:
                model = LogisticRegression(
                    penalty='l1', C=C, solver='saga', max_iter=1000
                )
            elif l1_ratio == 0.0:
                model = LogisticRegression(
                    penalty='l2', C=C, solver='lbfgs', max_iter=1000
                )
            else:
                model = LogisticRegression(
                    penalty='elasticnet', C=C, l1_ratio=l1_ratio,
                    solver='saga', max_iter=1000
                )
            
            model.fit(X, y)
            coef_path[:, i] = model.coef_.flatten()
        
        paths[l1_ratio] = coef_path
    
    return lambdas, paths
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate data
    n, p = 300, 30
    X = np.random.randn(n, p)
    
    # Sparse true model
    true_coef = np.zeros(p)
    true_coef[:5] = [2.0, -1.5, 1.0, -0.8, 0.5]
    
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    # Standardize
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Compute and visualize path
    print("Computing Lasso regularization path...")
    lambdas, coef_path = compute_lasso_path(X_scaled, y, n_lambdas=80)
    
    feature_names = [f"x{i+1}" for i in range(p)]
    visualize_regularization_path(lambdas, coef_path, feature_names, 
                                   title="Lasso Logistic Regression")

Interpreting the Regularization Path

3.1 Feature Entry Order

The regularization path reveals which features are most strongly associated with the outcome. As λ decreases from λ_max:

First features to enter: Those with the strongest marginal associations with y, after accounting for the intercept
Later entries: Features with weaker or conditional associations
Never enter: Features that are essentially noise or redundant

Entry order as importance ranking: Features that enter early can be considered "more important" in a regularization-robust sense. This provides a form of feature ranking that's more stable than p-values or single-split importance measures.

Correlation Confounds Entry Order

Entry order can be misleading with correlated features. If two features are highly correlated, only one enters first. This doesn't mean the second is unimportant—it's just redundant given the first. For grouped importance, use Elastic Net paths or stability selection.

3.2 Path Stability

A stable path shows consistent behavior:

Coefficients change smoothly as λ varies
Features don't frequently enter and exit
Sign of coefficients is consistent

Signs of instability:

Wild oscillations in coefficient values
Frequent sign changes
Features entering and immediately exiting

Unstable paths suggest potential issues:

Multicollinearity among features
Near-separation in the data
Sample size too small for the number of features

Remedy: Increase regularization (use smaller C), add L2 component (switch to Elastic Net), or remove highly collinear features.

3.3 Reading the Path Visualization

A standard regularization path plot shows coefficients (y-axis) vs log(λ) (x-axis):

Left side (large λ): Strong regularization, few or no features active Right side (small λ): Weak regularization, approaching MLE, many features active

Key observations:

Path shape: Roughly linear segments indicate features entering/exiting
Crossing zero: A coefficient changing sign isn't necessarily bad, but suggests complex conditional relationships
Asymptotes: Coefficient values at smallest λ indicate unregularized effect sizes
Groups: Features with parallel trajectories are likely correlated

Path Interpretation Guide
Pattern	Interpretation	Action
Feature enters early	Strong marginal predictor	Likely important; include in model
Feature enters late	Weak or conditional predictor	May be redundant; consider exclusion
Parallel trajectories	Correlated features	Consider grouping or Elastic Net
Coefficient oscillates	Unstable, possibly collinear	Add L2 regularization
Coefficient sign flip	Complex interaction	Investigate feature relationships

Using the Path for Model Selection

4.1 Cross-Validation on the Path

The most common approach to selecting λ is cross-validation:

For each λ in the path, fit the model on training folds
Evaluate prediction performance on held-out folds
Average across folds to estimate out-of-sample performance
Select λ that optimizes the criterion

Common criteria:

AUC-ROC: Best for general classification performance
Log-loss: Measures probability calibration
Misclassification rate: Interpretable but can be noisy

Implementation:

LogisticRegressionCV in scikit-learn automates this process
Use Cs parameter to specify λ grid (actually 1/λ values)
refit=True fits final model on full data at optimal λ

path_model_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
import numpy as np
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
 
def compare_lambda_selection_methods(X, y, cv=5):
    """
    Compare different methods for selecting lambda on the regularization path.
    
    Methods:
    1. Cross-validation (1SE rule)
    2. Cross-validation (minimum error)
    3. BIC approximation
    """
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    n, p = X_scaled.shape
    
    # Define lambda grid
    n_lambdas = 50
    Cs = np.logspace(-3, 2, n_lambdas)  # C = 1/lambda
    
    # Method 1 & 2: Cross-validation
    cv_model = LogisticRegressionCV(
        Cs=Cs,
        penalty='l1',
        solver='saga',
        cv=cv,
        scoring='neg_log_loss',
        max_iter=2000,
        refit=True
    )
    cv_model.fit(X_scaled, y)
    
    # Get CV scores for each C
    # cv_model.scores_ has shape (n_classes, n_Cs, n_folds) or similar
    if hasattr(cv_model, 'scores_'):
        # For binary classification
        cv_scores = cv_model.scores_[1]  # Scores for class 1
        mean_scores = np.mean(cv_scores, axis=1)  # Average across folds
        std_scores = np.std(cv_scores, axis=1)    # Std across folds
    
    # Cross-validation optimal C (minimum error)
    C_cv_min = cv_model.C_[0]
    lambda_cv_min = 1 / C_cv_min
    
    # 1SE rule: largest lambda within 1 SE of minimum
    best_score = mean_scores.max()
    best_idx = np.argmax(mean_scores)
    
    # For log-loss (which we're minimizing negative of), higher is better
    threshold = best_score - std_scores[best_idx]
    
    # Find most regularized (smallest C = largest lambda) meeting threshold
    valid_indices = np.where(mean_scores >= threshold)[0]
    idx_1se = valid_indices[0] if len(valid_indices) > 0 else best_idx
    C_1se = Cs[idx_1se]
    lambda_1se = 1 / C_1se
    
    # Method 3: BIC approximation
    # BIC ≈ -2 * log_likelihood + k * log(n) where k = #non-zero coefficients
    bic_values = []
    for C in Cs:
        model = LogisticRegression(
            penalty='l1', C=C, solver='saga', max_iter=2000
        )
        model.fit(X_scaled, y)
        
        # Compute log-likelihood
        probs = model.predict_proba(X_scaled)[:, 1]
        eps = 1e-15
        probs = np.clip(probs, eps, 1 - eps)
        log_lik = np.sum(y * np.log(probs) + (1 - y) * np.log(1 - probs))
        
        # Count non-zero coefficients (including intercept)
        k = np.sum(np.abs(model.coef_) > 1e-6) + 1  # +1 for intercept
        
        bic = -2 * log_lik + k * np.log(n)
        bic_values.append(bic)
    
    bic_values = np.array(bic_values)
    C_bic = Cs[np.argmin(bic_values)]
    lambda_bic = 1 / C_bic
    
    # Report results
    print("Lambda Selection Comparison")
    print("=" * 60)
    print(f"{'Method':<30} {'λ':>12} {'C (=1/λ)':>12}")
    print("-" * 60)
    print(f"{'CV (minimum error)':<30} {lambda_cv_min:>12.4f} {C_cv_min:>12.4f}")
    print(f"{'CV (1SE rule)':<30} {lambda_1se:>12.4f} {C_1se:>12.4f}")
    print(f"{'BIC':<30} {lambda_bic:>12.4f} {C_bic:>12.4f}")
    
    # Fit final models and compare
    print("
Model Comparison at Selected λ")
    print("-" * 60)
    
    for name, C in [("CV-min", C_cv_min), ("CV-1SE", C_1se), ("BIC", C_bic)]:
        model = LogisticRegression(
            penalty='l1', C=C, solver='saga', max_iter=2000
        )
        model.fit(X_scaled, y)
        n_features = np.sum(np.abs(model.coef_) > 1e-6)
        train_acc = model.score(X_scaled, y)
        print(f"{name:<10}: {n_features:3d} features, train accuracy = {train_acc:.4f}")
    
    return {
        'Cs': Cs,
        'lambdas': 1/Cs,
        'mean_cv_scores': mean_scores if 'mean_scores' in dir() else None,
        'bic_values': bic_values,
        'selected': {
            'cv_min': lambda_cv_min,
            'cv_1se': lambda_1se,
            'bic': lambda_bic
        }
    }
 
 
def visualize_selection(results):
    """
    Visualize the model selection process.
    """
    Cs = results['Cs']
    lambdas = results['lambdas']
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # CV scores
    ax1 = axes[0]
    if results['mean_cv_scores'] is not None:
        ax1.plot(np.log10(lambdas), -results['mean_cv_scores'], 'b-', linewidth=2)
        ax1.set_ylabel('Cross-Validation Log-Loss', fontsize=11)
    ax1.set_xlabel('log₁₀(λ)', fontsize=11)
    ax1.set_title('Cross-Validation Error', fontsize=12)
    
    # Mark selected lambdas
    for name, lam, color in [
        ('CV-min', results['selected']['cv_min'], 'green'),
        ('CV-1SE', results['selected']['cv_1se'], 'orange')
    ]:
        ax1.axvline(np.log10(lam), color=color, linestyle='--', label=name)
    ax1.legend()
    
    # BIC
    ax2 = axes[1]
    ax2.plot(np.log10(lambdas), results['bic_values'], 'r-', linewidth=2)
    ax2.set_xlabel('log₁₀(λ)', fontsize=11)
    ax2.set_ylabel('BIC', fontsize=11)
    ax2.set_title('Bayesian Information Criterion', fontsize=12)
    ax2.axvline(np.log10(results['selected']['bic']), color='purple', 
                linestyle='--', label='BIC optimal')
    ax2.legend()
    
    plt.tight_layout()
    plt.savefig('lambda_selection.png', dpi=150)
    plt.show()
 
 
# Example
if __name__ == "__main__":
    from scipy.special import expit
    np.random.seed(42)
    
    # Generate data
    n, p = 200, 30
    X = np.random.randn(n, p)
    true_coef = np.zeros(p)
    true_coef[:5] = [1.5, -1.0, 0.8, -0.5, 0.3]
    
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    results = compare_lambda_selection_methods(X, y)
    visualize_selection(results)

4.2 The 1-Standard-Error Rule

A practical heuristic for selecting λ is the 1-standard-error (1SE) rule:

Identify $\lambda_{\min}$: the λ with best CV performance
Compute the standard error of performance at $\lambda_{\min}$
Find the largest λ whose performance is within 1 SE of the optimum
Use this more regularized model

Rationale: Models with similar CV performance are statistically indistinguishable. Among them, prefer the simpler (more regularized) one. This embodies Occam's razor in model selection.

When to use:

When interpretability matters (sparser is better)
When overfitting risk is high (more regularization is safer)
As a default when no domain knowledge suggests otherwise

When NOT to use:

When prediction accuracy is paramount
When theoretical understanding justifies a specific λ
When the 1SE interval is very wide (unstable estimates)

Information Criteria for Path Selection

5.1 AIC and BIC for Regularized Models

Information criteria provide alternatives to cross-validation for model selection:

Akaike Information Criterion (AIC): $$\text{AIC} = -2 \log \mathcal{L}(\hat{\boldsymbol{\beta}}) + 2k$$

Bayesian Information Criterion (BIC): $$\text{BIC} = -2 \log \mathcal{L}(\hat{\boldsymbol{\beta}}) + k \log n$$

where $k$ is the number of non-zero parameters (including intercept).

Comparison:

BIC penalizes complexity more when $n > 7$ (since $\log n > 2$)
AIC tends to select more complex models (less regularization)
BIC is consistent: selects true model as $n \to \infty$ (if true model in set)
AIC is efficient: minimizes prediction error asymptotically

Degrees of Freedom Complications

For regularized models, the effective "degrees of freedom" isn't simply the number of non-zero coefficients. The penalty shrinks estimates, reducing effective model complexity. More accurate formulas exist (e.g., trace of the hat matrix), but counting non-zeros is a common approximation that works reasonably well in practice.

5.2 Extended BIC for High Dimensions

When $p$ is large (potentially $p > n$), standard BIC can underpenalize complexity. The Extended BIC (EBIC) adds an additional penalty:

$$\text{EBIC}_\gamma = -2 \log \mathcal{L}(\hat{\boldsymbol{\beta}}) + k \log n + 2\gamma \log \binom{p}{k}$$

where $\gamma \in [0, 1]$ controls the additional penalty based on model size relative to $p$.

Intuition: With many potential features, finding a good-fitting model by chance is easier. EBIC penalizes for this "search space" effect.

Common choices:

$\gamma = 0$: Reduces to standard BIC
$\gamma = 0.5$: Moderate penalty, common default
$\gamma = 1$: Strong penalty for high-dimensional problems

Stability Selection on the Path

6.1 The Stability Selection Framework

Rather than selecting a single λ, stability selection uses the regularization path to identify robustly selected features:

Subsample the data (typically 50% without replacement)
Compute the regularization path on the subsample
Record which features are selected at each λ
Repeat many times (e.g., 100 bootstraps)
Aggregate: Keep features selected in > τ% of runs at some λ

Key insight: A feature that is consistently selected across subsamples is likely a true signal, not a spurious correlation. Features that are selected only occasionally are likely noise or unstable due to correlation.

6.2 Implementation

Parameters:

Sample fraction: Typically 0.5 (each subsample uses half the data)
Selection threshold τ: Fraction of subsamples where feature must be selected (e.g., 0.6 to 0.9)
λ range: Use a range rather than a single value

Selection probability curve: For each feature, plot the probability of selection across the λ range. Stable features have high selection probability across a wide range; unstable features are sensitive to λ.

Theoretical guarantee: Under mild conditions, stability selection controls the false discovery rate at level $q$ if threshold $\tau$ is chosen appropriately:

$$\tau = \sqrt{0.5 \cdot (2q \cdot p \cdot \tau^2 / E[\hat{k}]) + 0.5}$$

where $E[\hat{k}]$ is the expected number of selected features.

stability_selection_path.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from scipy.special import expit
import matplotlib.pyplot as plt
 
def stability_selection_path(X, y, n_bootstrap=100, sample_fraction=0.5,
                              n_lambdas=30, lambda_ratio=1e-2):
    """
    Stability selection across the regularization path.
    
    For each bootstrap subsample, computes which features are selected
    at each lambda value. Returns selection probabilities.
    
    Parameters
    ----------
    X : array, shape (n, p)
        Feature matrix (should be standardized)
    y : array, shape (n,)
        Binary labels
    n_bootstrap : int
        Number of bootstrap iterations
    sample_fraction : float
        Fraction of samples to use in each bootstrap
    n_lambdas : int
        Number of lambda values in grid
    lambda_ratio : float
        Ratio of lambda_min to lambda_max
    
    Returns
    -------
    lambdas : array
        Lambda values
    selection_probs : array, shape (p, n_lambdas)
        Selection probability at each (feature, lambda) pair
    max_selection_prob : array, shape (p,)
        Maximum selection probability across all lambdas for each feature
    """
    n, p = X.shape
    n_subsample = int(n * sample_fraction)
    
    # Compute lambda range
    lambda_max = np.max(np.abs(X.T @ (y - y.mean()))) / n * 1.5
    lambda_min = lambda_max * lambda_ratio
    lambdas = np.logspace(np.log10(lambda_max), np.log10(lambda_min), n_lambdas)
    Cs = 1 / lambdas
    
    # Track selections
    selection_counts = np.zeros((p, n_lambdas))
    
    for b in range(n_bootstrap):
        # Subsample without replacement
        indices = np.random.choice(n, size=n_subsample, replace=False)
        X_sub = X[indices]
        y_sub = y[indices]
        
        # Compute path on subsample
        for i, C in enumerate(Cs):
            model = LogisticRegression(
                penalty='l1', C=C, solver='saga', 
                max_iter=1000, tol=1e-4
            )
            model.fit(X_sub, y_sub)
            
            # Record selections
            selected = np.abs(model.coef_.flatten()) > 1e-6
            selection_counts[:, i] += selected
    
    # Convert to probabilities
    selection_probs = selection_counts / n_bootstrap
    max_selection_prob = np.max(selection_probs, axis=1)
    
    return lambdas, selection_probs, max_selection_prob
 
 
def visualize_stability_selection(lambdas, selection_probs, max_probs,
                                    threshold=0.6, feature_names=None):
    """
    Visualize stability selection results.
    """
    p = len(max_probs)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Selection probability curves
    ax1 = axes[0]
    
    # Stable features (above threshold)
    stable_features = np.where(max_probs >= threshold)[0]
    unstable_features = np.where(max_probs < threshold)[0]
    
    # Plot unstable features in gray
    for j in unstable_features:
        ax1.plot(np.log10(lambdas), selection_probs[j, :], 
                 color='gray', alpha=0.3, linewidth=0.5)
    
    # Plot stable features with colors
    colors = plt.cm.tab10(np.linspace(0, 1, len(stable_features)))
    for idx, j in enumerate(stable_features):
        name = feature_names[j] if feature_names else f"x{j+1}"
        ax1.plot(np.log10(lambdas), selection_probs[j, :], 
                 color=colors[idx], linewidth=2, label=name)
    
    ax1.axhline(threshold, color='red', linestyle='--', 
                label=f'Threshold={threshold}')
    ax1.set_xlabel('log₁₀(λ)', fontsize=12)
    ax1.set_ylabel('Selection Probability', fontsize=12)
    ax1.set_title('Stability Selection Curves', fontsize=14)
    ax1.legend(loc='lower right', fontsize=8)
    ax1.set_ylim(-0.05, 1.05)
    
    # Plot 2: Max selection probability distribution
    ax2 = axes[1]
    
    # Sort by max probability
    sorted_indices = np.argsort(max_probs)[::-1]
    sorted_probs = max_probs[sorted_indices]
    
    colors = ['green' if p >= threshold else 'gray' for p in sorted_probs]
    
    ax2.barh(range(min(20, p)), sorted_probs[:20], color=colors[:20])
    ax2.axvline(threshold, color='red', linestyle='--', label='Threshold')
    ax2.set_xlabel('Max Selection Probability', fontsize=12)
    ax2.set_ylabel('Feature (sorted)', fontsize=12)
    ax2.set_title(f'Top 20 Features (Stable: {len(stable_features)})', fontsize=14)
    ax2.legend()
    
    # Label features
    labels = [feature_names[j] if feature_names else f"x{j+1}" 
              for j in sorted_indices[:20]]
    ax2.set_yticks(range(20))
    ax2.set_yticklabels(labels)
    ax2.invert_yaxis()
    
    plt.tight_layout()
    plt.savefig('stability_selection.png', dpi=150)
    plt.show()
    
    # Report stable features
    print(f"
Stable Features (selection prob >= {threshold}):")
    print("-" * 50)
    for j in sorted_indices:
        if max_probs[j] >= threshold:
            name = feature_names[j] if feature_names else f"x{j+1}"
            print(f"{name}: {max_probs[j]:.3f}")
    
    return stable_features
 
 
# Example
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate sparse data
    n, p = 300, 50
    X = np.random.randn(n, p)
    
    # Add correlated features
    X[:, 5:8] = X[:, 0:1] + 0.1 * np.random.randn(n, 3)
    
    # True sparse model
    true_coef = np.zeros(p)
    true_coef[:5] = [2.0, -1.5, 1.0, -0.8, 0.6]
    
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    # Standardize
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Run stability selection
    print("Running stability selection...")
    lambdas, sel_probs, max_probs = stability_selection_path(
        X_scaled, y, n_bootstrap=50, sample_fraction=0.5
    )
    
    feature_names = [f"x{i+1}" for i in range(p)]
    stable_features = visualize_stability_selection(
        lambdas, sel_probs, max_probs, 
        threshold=0.6, feature_names=feature_names
    )
    
    print(f"
True non-zero features: x1, x2, x3, x4, x5")
    print(f"Detected stable features: {[feature_names[j] for j in stable_features]}")

Summary and Key Takeaways

Key Takeaways

•The regularization path shows how solutions evolve from complete shrinkage (λ = ∞) to MLE (λ = 0)
•Feature entry order provides a ranking of feature importance: early entries are strongly predictive
•Path computation uses warm-started coordinate descent at a log-spaced grid of λ values
•Cross-validation on the path is the standard approach to selecting λ; the 1SE rule provides added regularization
•Information criteria (AIC, BIC, EBIC) offer alternatives to CV, especially useful for large datasets
•Stability selection identifies robustly selected features across subsamples, providing FDR control
•Visualizing the path reveals feature relationships, multicollinearity issues, and model stability

What's Next

The final page of this module covers hyperparameter selection in depth—including grid search, random search, and efficient optimization strategies for tuning both λ and the L1 ratio in Elastic Net models.

4 / 5

Loading learning content...

Machine LearningRegularized Logistic Regression

Regularized Logistic Regression

LevelIntermediate

Duration90 mins

TopicRegularized Logistic Regression

4 / 5

Regularization Path

Understanding the Regularization Path

Feature importance: Which features enter the model first as regularization decreases?
Solution stability: How sensitive are coefficients to the choice of λ?
Model complexity: How does the number of selected features evolve?

This page develops the theory and practice of computing, visualizing, and interpreting regularization paths.

Path Algorithms: From Homotopy to Coordinate Descent

The Regularization Path Concept

1.1 Formal Definition

For a given regularization type (L1, L2, or Elastic Net), the regularization path is the function:

Boundary conditions:

At $\lambda = 0$: $\hat{\boldsymbol{\beta}}(0) = \hat{\boldsymbol{\beta}}_{\text{MLE}}$ (unregularized solution)
As $\lambda \to \infty$: $\hat{\boldsymbol{\beta}}(\lambda) \to \mathbf{0}$ (all coefficients shrink to zero)

The path interpolates between these extremes, with different behaviors depending on the penalty type.

1.2 Path Properties by Penalty Type

L2 (Ridge) Path:

Continuous and smooth: $\hat{\boldsymbol{\beta}}(\lambda)$ is infinitely differentiable in λ
Monotonic shrinkage: Each coefficient magnitude decreases monotonically with λ
No zeros: All coefficients remain non-zero for finite λ
Closed form (for linear regression): $\hat{\boldsymbol{\beta}}(\lambda) = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$

L1 (Lasso) Path:

Piecewise linear (for linear regression): The path is linear between "breakpoints"
Features enter/exit: Coefficients become exactly zero and non-zero at discrete λ values
Non-monotonic possible: Coefficient magnitudes can increase then decrease
Sparse: Many coefficients are exactly zero for moderate to large λ

Elastic Net Path:

Smoother than Lasso: The L2 component adds regularity
Still sparse: Features enter/exit, but with grouping behavior
Grouped entry: Correlated features tend to enter together

Logistic vs Linear Regression Paths

Computing the Regularization Path

2.1 Grid-Based Computation

The most common approach is to solve the regularized problem at a sequence of λ values:

$$\lambda_1 > \lambda_2 > \cdots > \lambda_K > 0$$

Key insight: Starting from large λ (sparse solution) and decreasing is much faster than the reverse, because:

Sparse solutions require fewer active variables
Each solution warm-starts the next (good initialization)

Grid construction:

Log-spaced: $\lambda_k = \lambda_{\max} \cdot e^{-k \cdot \Delta}$ for uniform spacing on log scale
Geometric: $\lambda_{k+1} = \lambda_k / r$ for ratio $r > 1$ (common choice: $r = 1.1$ to $1.5$)

2.2 Finding λ_max

Before computing the path, we need λ_max: the smallest λ that sets all coefficients to zero. For L1 regularization:

$$\lambda_{\max} = \max_j \left| abla_j \mathcal{L}(\boldsymbol{\beta} = \mathbf{0}) \right|$$

This is the maximum absolute gradient at the zero solution. Any λ > λ_max produces the zero solution (all features excluded).

For logistic regression with standardized features:

$$\lambda_{\max} = \frac{1}{n} \max_j \left| \sum_{i=1}^n x_{ij}(y_i - \bar{y}) \right|$$

where $\bar{y}$ is the mean of the binary outcomes.

Why this works: The L1 optimality conditions require $| abla_j \mathcal{L}| \leq \lambda$ for $\beta_j = 0$. When λ exceeds all gradient magnitudes, the zero solution is optimal.

regularization_path.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
import numpy as np
from scipy.special import expit
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
 
def compute_lambda_max(X, y, fit_intercept=True):
    """
    Compute the smallest lambda that sets all coefficients to zero.
    
    This is the maximum absolute gradient of the log-likelihood at beta=0.
    """
    n = len(y)
    
    # At beta=0, predictions are all 0.5 (if no intercept) or class mean (with intercept)
    if fit_intercept:
        # With intercept, optimal intercept at beta=0 gives p = mean(y)
        p_mean = np.mean(y)
    else:
        p_mean = 0.5
    
    # Gradient at beta=0: X'(p - y) where p = p_mean for all samples
    residuals = p_mean - y
    gradients = X.T @ residuals
    
    # Lambda_max is the maximum absolute gradient
    lambda_max = np.max(np.abs(gradients)) / n
    
    return lambda_max
 
 
def compute_lasso_path(X, y, n_lambdas=100, lambda_ratio=1e-4):
    """
    Compute the L1 regularization path.
    
    Parameters
    ----------
    X : array, shape (n, p)
        Feature matrix (should be standardized)
    y : array, shape (n,)
        Binary labels
    n_lambdas : int
        Number of lambda values in the grid
    lambda_ratio : float
        Ratio of lambda_min to lambda_max
    
    Returns
    -------
    lambdas : array
        Lambda values (decreasing)
    coef_path : array, shape (p, n_lambdas)
        Coefficients at each lambda
    """
    n, p = X.shape
    
    # Compute lambda_max
    lambda_max = compute_lambda_max(X, y)
    lambda_min = lambda_max * lambda_ratio
    
    # Log-spaced grid (decreasing)
    lambdas = np.logspace(np.log10(lambda_max), np.log10(lambda_min), n_lambdas)
    
    # C = 1/lambda in sklearn
    Cs = 1 / lambdas
    
    # Initialize coefficient array
    coef_path = np.zeros((p, n_lambdas))
    
    # Compute path with warm starts
    warm_start_coef = None
    
    for i, C in enumerate(Cs):
        model = LogisticRegression(
            penalty='l1',
            C=C,
            solver='saga',
            max_iter=1000,
            tol=1e-4,
            warm_start=(i > 0),
            fit_intercept=True
        )
        
        # For warm start, set initial coefficients
        if warm_start_coef is not None:
            model.coef_ = warm_start_coef.reshape(1, -1)
        
        model.fit(X, y)
        coef_path[:, i] = model.coef_.flatten()
        warm_start_coef = model.coef_.flatten()
    
    return lambdas, coef_path
 
 
def visualize_regularization_path(lambdas, coef_path, feature_names=None, 
                                   top_k=10, title="Regularization Path"):
    """
    Visualize the regularization path.
    
    Parameters
    ----------
    lambdas : array
        Lambda values
    coef_path : array, shape (p, n_lambdas)
        Coefficients at each lambda
    feature_names : list, optional
        Names of features (for legend)
    top_k : int
        Number of top features to label
    title : str
        Plot title
    """
    p, n_lambdas = coef_path.shape
    
    # Create figure
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Coefficient values vs log(lambda)
    ax1 = axes[0]
    for j in range(p):
        ax1.plot(np.log10(lambdas), coef_path[j, :], linewidth=1)
    
    ax1.axhline(0, color='gray', linestyle='--', linewidth=0.5)
    ax1.set_xlabel('log₁₀(λ)', fontsize=12)
    ax1.set_ylabel('Coefficient Value', fontsize=12)
    ax1.set_title(f'{title}: Coefficients', fontsize=14)
    
    # Add vertical line at a reference lambda
    ax1.axvline(np.log10(lambdas[len(lambdas)//2]), color='red', 
                linestyle='--', alpha=0.5, label='Reference λ')
    
    # Plot 2: Number of non-zero coefficients vs log(lambda)
    ax2 = axes[1]
    n_nonzero = np.sum(np.abs(coef_path) > 1e-6, axis=0)
    ax2.plot(np.log10(lambdas), n_nonzero, 'b-', linewidth=2)
    ax2.fill_between(np.log10(lambdas), 0, n_nonzero, alpha=0.3)
    
    ax2.set_xlabel('log₁₀(λ)', fontsize=12)
    ax2.set_ylabel('Number of Non-zero Coefficients', fontsize=12)
    ax2.set_title(f'{title}: Sparsity', fontsize=14)
    ax2.set_ylim(0, p + 1)
    
    plt.tight_layout()
    plt.savefig('regularization_path.png', dpi=150)
    plt.show()
    
    # Identify top features (largest coefficients at smallest lambda)
    final_coef = np.abs(coef_path[:, -1])
    top_indices = np.argsort(final_coef)[-top_k:][::-1]
    
    print(f"
Top {top_k} features (by final coefficient magnitude):")
    print("-" * 40)
    for rank, idx in enumerate(top_indices, 1):
        name = feature_names[idx] if feature_names else f"Feature {idx}"
        print(f"{rank:2d}. {name}: {coef_path[idx, -1]:.4f}")
 
 
def compute_elastic_net_path_comparison(X, y, l1_ratios=[0.2, 0.5, 0.8, 1.0]):
    """
    Compare regularization paths for different l1_ratios (Elastic Net).
    """
    n, p = X.shape
    n_lambdas = 50
    
    # Use the same lambda range for all
    lambda_max = compute_lambda_max(X, y) * 2  # Slightly larger to ensure zeros
    lambda_min = lambda_max * 1e-3
    lambdas = np.logspace(np.log10(lambda_max), np.log10(lambda_min), n_lambdas)
    Cs = 1 / lambdas
    
    paths = {}
    
    for l1_ratio in l1_ratios:
        coef_path = np.zeros((p, n_lambdas))
        
        for i, C in enumerate(Cs):
            if l1_ratio == 1.0:
                model = LogisticRegression(
                    penalty='l1', C=C, solver='saga', max_iter=1000
                )
            elif l1_ratio == 0.0:
                model = LogisticRegression(
                    penalty='l2', C=C, solver='lbfgs', max_iter=1000
                )
            else:
                model = LogisticRegression(
                    penalty='elasticnet', C=C, l1_ratio=l1_ratio,
                    solver='saga', max_iter=1000
                )
            
            model.fit(X, y)
            coef_path[:, i] = model.coef_.flatten()
        
        paths[l1_ratio] = coef_path
    
    return lambdas, paths
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate data
    n, p = 300, 30
    X = np.random.randn(n, p)
    
    # Sparse true model
    true_coef = np.zeros(p)
    true_coef[:5] = [2.0, -1.5, 1.0, -0.8, 0.5]
    
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    # Standardize
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Compute and visualize path
    print("Computing Lasso regularization path...")
    lambdas, coef_path = compute_lasso_path(X_scaled, y, n_lambdas=80)
    
    feature_names = [f"x{i+1}" for i in range(p)]
    visualize_regularization_path(lambdas, coef_path, feature_names, 
                                   title="Lasso Logistic Regression")

Interpreting the Regularization Path

3.1 Feature Entry Order

The regularization path reveals which features are most strongly associated with the outcome. As λ decreases from λ_max:

First features to enter: Those with the strongest marginal associations with y, after accounting for the intercept
Later entries: Features with weaker or conditional associations
Never enter: Features that are essentially noise or redundant

Correlation Confounds Entry Order

3.2 Path Stability

A stable path shows consistent behavior:

Coefficients change smoothly as λ varies
Features don't frequently enter and exit
Sign of coefficients is consistent

Signs of instability:

Wild oscillations in coefficient values
Frequent sign changes
Features entering and immediately exiting

Unstable paths suggest potential issues:

Multicollinearity among features
Near-separation in the data
Sample size too small for the number of features

Remedy: Increase regularization (use smaller C), add L2 component (switch to Elastic Net), or remove highly collinear features.

3.3 Reading the Path Visualization

A standard regularization path plot shows coefficients (y-axis) vs log(λ) (x-axis):

Left side (large λ): Strong regularization, few or no features active Right side (small λ): Weak regularization, approaching MLE, many features active

Key observations:

Path shape: Roughly linear segments indicate features entering/exiting
Crossing zero: A coefficient changing sign isn't necessarily bad, but suggests complex conditional relationships
Asymptotes: Coefficient values at smallest λ indicate unregularized effect sizes
Groups: Features with parallel trajectories are likely correlated

Path Interpretation Guide
Pattern	Interpretation	Action
Feature enters early	Strong marginal predictor	Likely important; include in model
Feature enters late	Weak or conditional predictor	May be redundant; consider exclusion
Parallel trajectories	Correlated features	Consider grouping or Elastic Net
Coefficient oscillates	Unstable, possibly collinear	Add L2 regularization
Coefficient sign flip	Complex interaction	Investigate feature relationships

Using the Path for Model Selection

4.1 Cross-Validation on the Path

The most common approach to selecting λ is cross-validation:

For each λ in the path, fit the model on training folds
Evaluate prediction performance on held-out folds
Average across folds to estimate out-of-sample performance
Select λ that optimizes the criterion

Common criteria:

AUC-ROC: Best for general classification performance
Log-loss: Measures probability calibration
Misclassification rate: Interpretable but can be noisy

Implementation:

LogisticRegressionCV in scikit-learn automates this process
Use Cs parameter to specify λ grid (actually 1/λ values)
refit=True fits final model on full data at optimal λ

path_model_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
import numpy as np
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
 
def compare_lambda_selection_methods(X, y, cv=5):
    """
    Compare different methods for selecting lambda on the regularization path.
    
    Methods:
    1. Cross-validation (1SE rule)
    2. Cross-validation (minimum error)
    3. BIC approximation
    """
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    n, p = X_scaled.shape
    
    # Define lambda grid
    n_lambdas = 50
    Cs = np.logspace(-3, 2, n_lambdas)  # C = 1/lambda
    
    # Method 1 & 2: Cross-validation
    cv_model = LogisticRegressionCV(
        Cs=Cs,
        penalty='l1',
        solver='saga',
        cv=cv,
        scoring='neg_log_loss',
        max_iter=2000,
        refit=True
    )
    cv_model.fit(X_scaled, y)
    
    # Get CV scores for each C
    # cv_model.scores_ has shape (n_classes, n_Cs, n_folds) or similar
    if hasattr(cv_model, 'scores_'):
        # For binary classification
        cv_scores = cv_model.scores_[1]  # Scores for class 1
        mean_scores = np.mean(cv_scores, axis=1)  # Average across folds
        std_scores = np.std(cv_scores, axis=1)    # Std across folds
    
    # Cross-validation optimal C (minimum error)
    C_cv_min = cv_model.C_[0]
    lambda_cv_min = 1 / C_cv_min
    
    # 1SE rule: largest lambda within 1 SE of minimum
    best_score = mean_scores.max()
    best_idx = np.argmax(mean_scores)
    
    # For log-loss (which we're minimizing negative of), higher is better
    threshold = best_score - std_scores[best_idx]
    
    # Find most regularized (smallest C = largest lambda) meeting threshold
    valid_indices = np.where(mean_scores >= threshold)[0]
    idx_1se = valid_indices[0] if len(valid_indices) > 0 else best_idx
    C_1se = Cs[idx_1se]
    lambda_1se = 1 / C_1se
    
    # Method 3: BIC approximation
    # BIC ≈ -2 * log_likelihood + k * log(n) where k = #non-zero coefficients
    bic_values = []
    for C in Cs:
        model = LogisticRegression(
            penalty='l1', C=C, solver='saga', max_iter=2000
        )
        model.fit(X_scaled, y)
        
        # Compute log-likelihood
        probs = model.predict_proba(X_scaled)[:, 1]
        eps = 1e-15
        probs = np.clip(probs, eps, 1 - eps)
        log_lik = np.sum(y * np.log(probs) + (1 - y) * np.log(1 - probs))
        
        # Count non-zero coefficients (including intercept)
        k = np.sum(np.abs(model.coef_) > 1e-6) + 1  # +1 for intercept
        
        bic = -2 * log_lik + k * np.log(n)
        bic_values.append(bic)
    
    bic_values = np.array(bic_values)
    C_bic = Cs[np.argmin(bic_values)]
    lambda_bic = 1 / C_bic
    
    # Report results
    print("Lambda Selection Comparison")
    print("=" * 60)
    print(f"{'Method':<30} {'λ':>12} {'C (=1/λ)':>12}")
    print("-" * 60)
    print(f"{'CV (minimum error)':<30} {lambda_cv_min:>12.4f} {C_cv_min:>12.4f}")
    print(f"{'CV (1SE rule)':<30} {lambda_1se:>12.4f} {C_1se:>12.4f}")
    print(f"{'BIC':<30} {lambda_bic:>12.4f} {C_bic:>12.4f}")
    
    # Fit final models and compare
    print("
Model Comparison at Selected λ")
    print("-" * 60)
    
    for name, C in [("CV-min", C_cv_min), ("CV-1SE", C_1se), ("BIC", C_bic)]:
        model = LogisticRegression(
            penalty='l1', C=C, solver='saga', max_iter=2000
        )
        model.fit(X_scaled, y)
        n_features = np.sum(np.abs(model.coef_) > 1e-6)
        train_acc = model.score(X_scaled, y)
        print(f"{name:<10}: {n_features:3d} features, train accuracy = {train_acc:.4f}")
    
    return {
        'Cs': Cs,
        'lambdas': 1/Cs,
        'mean_cv_scores': mean_scores if 'mean_scores' in dir() else None,
        'bic_values': bic_values,
        'selected': {
            'cv_min': lambda_cv_min,
            'cv_1se': lambda_1se,
            'bic': lambda_bic
        }
    }
 
 
def visualize_selection(results):
    """
    Visualize the model selection process.
    """
    Cs = results['Cs']
    lambdas = results['lambdas']
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # CV scores
    ax1 = axes[0]
    if results['mean_cv_scores'] is not None:
        ax1.plot(np.log10(lambdas), -results['mean_cv_scores'], 'b-', linewidth=2)
        ax1.set_ylabel('Cross-Validation Log-Loss', fontsize=11)
    ax1.set_xlabel('log₁₀(λ)', fontsize=11)
    ax1.set_title('Cross-Validation Error', fontsize=12)
    
    # Mark selected lambdas
    for name, lam, color in [
        ('CV-min', results['selected']['cv_min'], 'green'),
        ('CV-1SE', results['selected']['cv_1se'], 'orange')
    ]:
        ax1.axvline(np.log10(lam), color=color, linestyle='--', label=name)
    ax1.legend()
    
    # BIC
    ax2 = axes[1]
    ax2.plot(np.log10(lambdas), results['bic_values'], 'r-', linewidth=2)
    ax2.set_xlabel('log₁₀(λ)', fontsize=11)
    ax2.set_ylabel('BIC', fontsize=11)
    ax2.set_title('Bayesian Information Criterion', fontsize=12)
    ax2.axvline(np.log10(results['selected']['bic']), color='purple', 
                linestyle='--', label='BIC optimal')
    ax2.legend()
    
    plt.tight_layout()
    plt.savefig('lambda_selection.png', dpi=150)
    plt.show()
 
 
# Example
if __name__ == "__main__":
    from scipy.special import expit
    np.random.seed(42)
    
    # Generate data
    n, p = 200, 30
    X = np.random.randn(n, p)
    true_coef = np.zeros(p)
    true_coef[:5] = [1.5, -1.0, 0.8, -0.5, 0.3]
    
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    results = compare_lambda_selection_methods(X, y)
    visualize_selection(results)

4.2 The 1-Standard-Error Rule

A practical heuristic for selecting λ is the 1-standard-error (1SE) rule:

Identify $\lambda_{\min}$: the λ with best CV performance
Compute the standard error of performance at $\lambda_{\min}$
Find the largest λ whose performance is within 1 SE of the optimum
Use this more regularized model

Rationale: Models with similar CV performance are statistically indistinguishable. Among them, prefer the simpler (more regularized) one. This embodies Occam's razor in model selection.

When to use:

When interpretability matters (sparser is better)
When overfitting risk is high (more regularization is safer)
As a default when no domain knowledge suggests otherwise

When NOT to use:

When prediction accuracy is paramount
When theoretical understanding justifies a specific λ
When the 1SE interval is very wide (unstable estimates)

Information Criteria for Path Selection

5.1 AIC and BIC for Regularized Models

Information criteria provide alternatives to cross-validation for model selection:

Akaike Information Criterion (AIC): $$\text{AIC} = -2 \log \mathcal{L}(\hat{\boldsymbol{\beta}}) + 2k$$

Bayesian Information Criterion (BIC): $$\text{BIC} = -2 \log \mathcal{L}(\hat{\boldsymbol{\beta}}) + k \log n$$

where $k$ is the number of non-zero parameters (including intercept).

Comparison:

BIC penalizes complexity more when $n > 7$ (since $\log n > 2$)
AIC tends to select more complex models (less regularization)
BIC is consistent: selects true model as $n \to \infty$ (if true model in set)
AIC is efficient: minimizes prediction error asymptotically

Degrees of Freedom Complications

5.2 Extended BIC for High Dimensions

When $p$ is large (potentially $p > n$), standard BIC can underpenalize complexity. The Extended BIC (EBIC) adds an additional penalty:

$$\text{EBIC}_\gamma = -2 \log \mathcal{L}(\hat{\boldsymbol{\beta}}) + k \log n + 2\gamma \log \binom{p}{k}$$

where $\gamma \in [0, 1]$ controls the additional penalty based on model size relative to $p$.

Intuition: With many potential features, finding a good-fitting model by chance is easier. EBIC penalizes for this "search space" effect.

Common choices:

$\gamma = 0$: Reduces to standard BIC
$\gamma = 0.5$: Moderate penalty, common default
$\gamma = 1$: Strong penalty for high-dimensional problems

Stability Selection on the Path

6.1 The Stability Selection Framework

Rather than selecting a single λ, stability selection uses the regularization path to identify robustly selected features:

Subsample the data (typically 50% without replacement)
Compute the regularization path on the subsample
Record which features are selected at each λ
Repeat many times (e.g., 100 bootstraps)
Aggregate: Keep features selected in > τ% of runs at some λ

6.2 Implementation

Parameters:

Sample fraction: Typically 0.5 (each subsample uses half the data)
Selection threshold τ: Fraction of subsamples where feature must be selected (e.g., 0.6 to 0.9)
λ range: Use a range rather than a single value

Theoretical guarantee: Under mild conditions, stability selection controls the false discovery rate at level $q$ if threshold $\tau$ is chosen appropriately:

$$\tau = \sqrt{0.5 \cdot (2q \cdot p \cdot \tau^2 / E[\hat{k}]) + 0.5}$$

where $E[\hat{k}]$ is the expected number of selected features.

stability_selection_path.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from scipy.special import expit
import matplotlib.pyplot as plt
 
def stability_selection_path(X, y, n_bootstrap=100, sample_fraction=0.5,
                              n_lambdas=30, lambda_ratio=1e-2):
    """
    Stability selection across the regularization path.
    
    For each bootstrap subsample, computes which features are selected
    at each lambda value. Returns selection probabilities.
    
    Parameters
    ----------
    X : array, shape (n, p)
        Feature matrix (should be standardized)
    y : array, shape (n,)
        Binary labels
    n_bootstrap : int
        Number of bootstrap iterations
    sample_fraction : float
        Fraction of samples to use in each bootstrap
    n_lambdas : int
        Number of lambda values in grid
    lambda_ratio : float
        Ratio of lambda_min to lambda_max
    
    Returns
    -------
    lambdas : array
        Lambda values
    selection_probs : array, shape (p, n_lambdas)
        Selection probability at each (feature, lambda) pair
    max_selection_prob : array, shape (p,)
        Maximum selection probability across all lambdas for each feature
    """
    n, p = X.shape
    n_subsample = int(n * sample_fraction)
    
    # Compute lambda range
    lambda_max = np.max(np.abs(X.T @ (y - y.mean()))) / n * 1.5
    lambda_min = lambda_max * lambda_ratio
    lambdas = np.logspace(np.log10(lambda_max), np.log10(lambda_min), n_lambdas)
    Cs = 1 / lambdas
    
    # Track selections
    selection_counts = np.zeros((p, n_lambdas))
    
    for b in range(n_bootstrap):
        # Subsample without replacement
        indices = np.random.choice(n, size=n_subsample, replace=False)
        X_sub = X[indices]
        y_sub = y[indices]
        
        # Compute path on subsample
        for i, C in enumerate(Cs):
            model = LogisticRegression(
                penalty='l1', C=C, solver='saga', 
                max_iter=1000, tol=1e-4
            )
            model.fit(X_sub, y_sub)
            
            # Record selections
            selected = np.abs(model.coef_.flatten()) > 1e-6
            selection_counts[:, i] += selected
    
    # Convert to probabilities
    selection_probs = selection_counts / n_bootstrap
    max_selection_prob = np.max(selection_probs, axis=1)
    
    return lambdas, selection_probs, max_selection_prob
 
 
def visualize_stability_selection(lambdas, selection_probs, max_probs,
                                    threshold=0.6, feature_names=None):
    """
    Visualize stability selection results.
    """
    p = len(max_probs)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Selection probability curves
    ax1 = axes[0]
    
    # Stable features (above threshold)
    stable_features = np.where(max_probs >= threshold)[0]
    unstable_features = np.where(max_probs < threshold)[0]
    
    # Plot unstable features in gray
    for j in unstable_features:
        ax1.plot(np.log10(lambdas), selection_probs[j, :], 
                 color='gray', alpha=0.3, linewidth=0.5)
    
    # Plot stable features with colors
    colors = plt.cm.tab10(np.linspace(0, 1, len(stable_features)))
    for idx, j in enumerate(stable_features):
        name = feature_names[j] if feature_names else f"x{j+1}"
        ax1.plot(np.log10(lambdas), selection_probs[j, :], 
                 color=colors[idx], linewidth=2, label=name)
    
    ax1.axhline(threshold, color='red', linestyle='--', 
                label=f'Threshold={threshold}')
    ax1.set_xlabel('log₁₀(λ)', fontsize=12)
    ax1.set_ylabel('Selection Probability', fontsize=12)
    ax1.set_title('Stability Selection Curves', fontsize=14)
    ax1.legend(loc='lower right', fontsize=8)
    ax1.set_ylim(-0.05, 1.05)
    
    # Plot 2: Max selection probability distribution
    ax2 = axes[1]
    
    # Sort by max probability
    sorted_indices = np.argsort(max_probs)[::-1]
    sorted_probs = max_probs[sorted_indices]
    
    colors = ['green' if p >= threshold else 'gray' for p in sorted_probs]
    
    ax2.barh(range(min(20, p)), sorted_probs[:20], color=colors[:20])
    ax2.axvline(threshold, color='red', linestyle='--', label='Threshold')
    ax2.set_xlabel('Max Selection Probability', fontsize=12)
    ax2.set_ylabel('Feature (sorted)', fontsize=12)
    ax2.set_title(f'Top 20 Features (Stable: {len(stable_features)})', fontsize=14)
    ax2.legend()
    
    # Label features
    labels = [feature_names[j] if feature_names else f"x{j+1}" 
              for j in sorted_indices[:20]]
    ax2.set_yticks(range(20))
    ax2.set_yticklabels(labels)
    ax2.invert_yaxis()
    
    plt.tight_layout()
    plt.savefig('stability_selection.png', dpi=150)
    plt.show()
    
    # Report stable features
    print(f"
Stable Features (selection prob >= {threshold}):")
    print("-" * 50)
    for j in sorted_indices:
        if max_probs[j] >= threshold:
            name = feature_names[j] if feature_names else f"x{j+1}"
            print(f"{name}: {max_probs[j]:.3f}")
    
    return stable_features
 
 
# Example
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate sparse data
    n, p = 300, 50
    X = np.random.randn(n, p)
    
    # Add correlated features
    X[:, 5:8] = X[:, 0:1] + 0.1 * np.random.randn(n, 3)
    
    # True sparse model
    true_coef = np.zeros(p)
    true_coef[:5] = [2.0, -1.5, 1.0, -0.8, 0.6]
    
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    # Standardize
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Run stability selection
    print("Running stability selection...")
    lambdas, sel_probs, max_probs = stability_selection_path(
        X_scaled, y, n_bootstrap=50, sample_fraction=0.5
    )
    
    feature_names = [f"x{i+1}" for i in range(p)]
    stable_features = visualize_stability_selection(
        lambdas, sel_probs, max_probs, 
        threshold=0.6, feature_names=feature_names
    )
    
    print(f"
True non-zero features: x1, x2, x3, x4, x5")
    print(f"Detected stable features: {[feature_names[j] for j in stable_features]}")

Summary and Key Takeaways

Key Takeaways

•The regularization path shows how solutions evolve from complete shrinkage (λ = ∞) to MLE (λ = 0)
•Feature entry order provides a ranking of feature importance: early entries are strongly predictive
•Path computation uses warm-started coordinate descent at a log-spaced grid of λ values
•Cross-validation on the path is the standard approach to selecting λ; the 1SE rule provides added regularization
•Information criteria (AIC, BIC, EBIC) offer alternatives to CV, especially useful for large datasets
•Stability selection identifies robustly selected features across subsamples, providing FDR control
•Visualizing the path reveals feature relationships, multicollinearity issues, and model stability

What's Next

4 / 5