Loading learning content...
So far, we've studied L1, L2, and Elastic Net regularization at fixed regularization strengths. But in practice, we never know the optimal λ in advance—we must explore a range of values and select based on cross-validation or other criteria.
The regularization path is the set of all solutions $\hat{\boldsymbol{\beta}}(\lambda)$ as $\lambda$ varies from 0 (no regularization, MLE solution) to infinity (complete shrinkage, all coefficients zero). Understanding this path provides deep insight into:
This page develops the theory and practice of computing, visualizing, and interpreting regularization paths.
Early path algorithms (LARS, homotopy methods) computed the exact Lasso path by identifying "breakpoints" where features enter or leave the active set. Modern approaches use coordinate descent at a dense grid of λ values—less elegant mathematically but more practical and generalizable to logistic regression and other models.
For a given regularization type (L1, L2, or Elastic Net), the regularization path is the function:
$$\hat{\boldsymbol{\beta}}: [0, \infty) \to \mathbb{R}^p$$ $$\lambda \mapsto \hat{\boldsymbol{\beta}}(\lambda) = \arg\min_{\boldsymbol{\beta}} \left[ -\mathcal{L}(\boldsymbol{\beta}) + \lambda \cdot \text{Penalty}(\boldsymbol{\beta}) \right]$$
Boundary conditions:
The path interpolates between these extremes, with different behaviors depending on the penalty type.
L2 (Ridge) Path:
L1 (Lasso) Path:
Elastic Net Path:
For linear regression, the Lasso path is exactly piecewise linear. For logistic regression, this is not true—the path is smooth due to the nonlinear log-likelihood. However, it still exhibits the "feature selection" behavior where coefficients reach exactly zero at discrete λ values.
The most common approach is to solve the regularized problem at a sequence of λ values:
$$\lambda_1 > \lambda_2 > \cdots > \lambda_K > 0$$
Key insight: Starting from large λ (sparse solution) and decreasing is much faster than the reverse, because:
Grid construction:
Before computing the path, we need λ_max: the smallest λ that sets all coefficients to zero. For L1 regularization:
$$\lambda_{\max} = \max_j \left| abla_j \mathcal{L}(\boldsymbol{\beta} = \mathbf{0}) \right|$$
This is the maximum absolute gradient at the zero solution. Any λ > λ_max produces the zero solution (all features excluded).
For logistic regression with standardized features:
$$\lambda_{\max} = \frac{1}{n} \max_j \left| \sum_{i=1}^n x_{ij}(y_i - \bar{y}) \right|$$
where $\bar{y}$ is the mean of the binary outcomes.
Why this works: The L1 optimality conditions require $| abla_j \mathcal{L}| \leq \lambda$ for $\beta_j = 0$. When λ exceeds all gradient magnitudes, the zero solution is optimal.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224
import numpy as npfrom scipy.special import expitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import StandardScalerimport matplotlib.pyplot as plt def compute_lambda_max(X, y, fit_intercept=True): """ Compute the smallest lambda that sets all coefficients to zero. This is the maximum absolute gradient of the log-likelihood at beta=0. """ n = len(y) # At beta=0, predictions are all 0.5 (if no intercept) or class mean (with intercept) if fit_intercept: # With intercept, optimal intercept at beta=0 gives p = mean(y) p_mean = np.mean(y) else: p_mean = 0.5 # Gradient at beta=0: X'(p - y) where p = p_mean for all samples residuals = p_mean - y gradients = X.T @ residuals # Lambda_max is the maximum absolute gradient lambda_max = np.max(np.abs(gradients)) / n return lambda_max def compute_lasso_path(X, y, n_lambdas=100, lambda_ratio=1e-4): """ Compute the L1 regularization path. Parameters ---------- X : array, shape (n, p) Feature matrix (should be standardized) y : array, shape (n,) Binary labels n_lambdas : int Number of lambda values in the grid lambda_ratio : float Ratio of lambda_min to lambda_max Returns ------- lambdas : array Lambda values (decreasing) coef_path : array, shape (p, n_lambdas) Coefficients at each lambda """ n, p = X.shape # Compute lambda_max lambda_max = compute_lambda_max(X, y) lambda_min = lambda_max * lambda_ratio # Log-spaced grid (decreasing) lambdas = np.logspace(np.log10(lambda_max), np.log10(lambda_min), n_lambdas) # C = 1/lambda in sklearn Cs = 1 / lambdas # Initialize coefficient array coef_path = np.zeros((p, n_lambdas)) # Compute path with warm starts warm_start_coef = None for i, C in enumerate(Cs): model = LogisticRegression( penalty='l1', C=C, solver='saga', max_iter=1000, tol=1e-4, warm_start=(i > 0), fit_intercept=True ) # For warm start, set initial coefficients if warm_start_coef is not None: model.coef_ = warm_start_coef.reshape(1, -1) model.fit(X, y) coef_path[:, i] = model.coef_.flatten() warm_start_coef = model.coef_.flatten() return lambdas, coef_path def visualize_regularization_path(lambdas, coef_path, feature_names=None, top_k=10, title="Regularization Path"): """ Visualize the regularization path. Parameters ---------- lambdas : array Lambda values coef_path : array, shape (p, n_lambdas) Coefficients at each lambda feature_names : list, optional Names of features (for legend) top_k : int Number of top features to label title : str Plot title """ p, n_lambdas = coef_path.shape # Create figure fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Coefficient values vs log(lambda) ax1 = axes[0] for j in range(p): ax1.plot(np.log10(lambdas), coef_path[j, :], linewidth=1) ax1.axhline(0, color='gray', linestyle='--', linewidth=0.5) ax1.set_xlabel('log₁₀(λ)', fontsize=12) ax1.set_ylabel('Coefficient Value', fontsize=12) ax1.set_title(f'{title}: Coefficients', fontsize=14) # Add vertical line at a reference lambda ax1.axvline(np.log10(lambdas[len(lambdas)//2]), color='red', linestyle='--', alpha=0.5, label='Reference λ') # Plot 2: Number of non-zero coefficients vs log(lambda) ax2 = axes[1] n_nonzero = np.sum(np.abs(coef_path) > 1e-6, axis=0) ax2.plot(np.log10(lambdas), n_nonzero, 'b-', linewidth=2) ax2.fill_between(np.log10(lambdas), 0, n_nonzero, alpha=0.3) ax2.set_xlabel('log₁₀(λ)', fontsize=12) ax2.set_ylabel('Number of Non-zero Coefficients', fontsize=12) ax2.set_title(f'{title}: Sparsity', fontsize=14) ax2.set_ylim(0, p + 1) plt.tight_layout() plt.savefig('regularization_path.png', dpi=150) plt.show() # Identify top features (largest coefficients at smallest lambda) final_coef = np.abs(coef_path[:, -1]) top_indices = np.argsort(final_coef)[-top_k:][::-1] print(f"Top {top_k} features (by final coefficient magnitude):") print("-" * 40) for rank, idx in enumerate(top_indices, 1): name = feature_names[idx] if feature_names else f"Feature {idx}" print(f"{rank:2d}. {name}: {coef_path[idx, -1]:.4f}") def compute_elastic_net_path_comparison(X, y, l1_ratios=[0.2, 0.5, 0.8, 1.0]): """ Compare regularization paths for different l1_ratios (Elastic Net). """ n, p = X.shape n_lambdas = 50 # Use the same lambda range for all lambda_max = compute_lambda_max(X, y) * 2 # Slightly larger to ensure zeros lambda_min = lambda_max * 1e-3 lambdas = np.logspace(np.log10(lambda_max), np.log10(lambda_min), n_lambdas) Cs = 1 / lambdas paths = {} for l1_ratio in l1_ratios: coef_path = np.zeros((p, n_lambdas)) for i, C in enumerate(Cs): if l1_ratio == 1.0: model = LogisticRegression( penalty='l1', C=C, solver='saga', max_iter=1000 ) elif l1_ratio == 0.0: model = LogisticRegression( penalty='l2', C=C, solver='lbfgs', max_iter=1000 ) else: model = LogisticRegression( penalty='elasticnet', C=C, l1_ratio=l1_ratio, solver='saga', max_iter=1000 ) model.fit(X, y) coef_path[:, i] = model.coef_.flatten() paths[l1_ratio] = coef_path return lambdas, paths # Example usageif __name__ == "__main__": np.random.seed(42) # Generate data n, p = 300, 30 X = np.random.randn(n, p) # Sparse true model true_coef = np.zeros(p) true_coef[:5] = [2.0, -1.5, 1.0, -0.8, 0.5] prob = expit(X @ true_coef) y = (np.random.rand(n) < prob).astype(float) # Standardize scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Compute and visualize path print("Computing Lasso regularization path...") lambdas, coef_path = compute_lasso_path(X_scaled, y, n_lambdas=80) feature_names = [f"x{i+1}" for i in range(p)] visualize_regularization_path(lambdas, coef_path, feature_names, title="Lasso Logistic Regression")The regularization path reveals which features are most strongly associated with the outcome. As λ decreases from λ_max:
Entry order as importance ranking: Features that enter early can be considered "more important" in a regularization-robust sense. This provides a form of feature ranking that's more stable than p-values or single-split importance measures.
Entry order can be misleading with correlated features. If two features are highly correlated, only one enters first. This doesn't mean the second is unimportant—it's just redundant given the first. For grouped importance, use Elastic Net paths or stability selection.
A stable path shows consistent behavior:
Signs of instability:
Unstable paths suggest potential issues:
Remedy: Increase regularization (use smaller C), add L2 component (switch to Elastic Net), or remove highly collinear features.
A standard regularization path plot shows coefficients (y-axis) vs log(λ) (x-axis):
Left side (large λ): Strong regularization, few or no features active Right side (small λ): Weak regularization, approaching MLE, many features active
Key observations:
| Pattern | Interpretation | Action |
|---|---|---|
| Feature enters early | Strong marginal predictor | Likely important; include in model |
| Feature enters late | Weak or conditional predictor | May be redundant; consider exclusion |
| Parallel trajectories | Correlated features | Consider grouping or Elastic Net |
| Coefficient oscillates | Unstable, possibly collinear | Add L2 regularization |
| Coefficient sign flip | Complex interaction | Investigate feature relationships |
The most common approach to selecting λ is cross-validation:
Common criteria:
Implementation:
LogisticRegressionCV in scikit-learn automates this processCs parameter to specify λ grid (actually 1/λ values)refit=True fits final model on full data at optimal λ123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179
import numpy as npfrom sklearn.linear_model import LogisticRegressionCV, LogisticRegressionfrom sklearn.model_selection import cross_val_scorefrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipelineimport matplotlib.pyplot as plt def compare_lambda_selection_methods(X, y, cv=5): """ Compare different methods for selecting lambda on the regularization path. Methods: 1. Cross-validation (1SE rule) 2. Cross-validation (minimum error) 3. BIC approximation """ scaler = StandardScaler() X_scaled = scaler.fit_transform(X) n, p = X_scaled.shape # Define lambda grid n_lambdas = 50 Cs = np.logspace(-3, 2, n_lambdas) # C = 1/lambda # Method 1 & 2: Cross-validation cv_model = LogisticRegressionCV( Cs=Cs, penalty='l1', solver='saga', cv=cv, scoring='neg_log_loss', max_iter=2000, refit=True ) cv_model.fit(X_scaled, y) # Get CV scores for each C # cv_model.scores_ has shape (n_classes, n_Cs, n_folds) or similar if hasattr(cv_model, 'scores_'): # For binary classification cv_scores = cv_model.scores_[1] # Scores for class 1 mean_scores = np.mean(cv_scores, axis=1) # Average across folds std_scores = np.std(cv_scores, axis=1) # Std across folds # Cross-validation optimal C (minimum error) C_cv_min = cv_model.C_[0] lambda_cv_min = 1 / C_cv_min # 1SE rule: largest lambda within 1 SE of minimum best_score = mean_scores.max() best_idx = np.argmax(mean_scores) # For log-loss (which we're minimizing negative of), higher is better threshold = best_score - std_scores[best_idx] # Find most regularized (smallest C = largest lambda) meeting threshold valid_indices = np.where(mean_scores >= threshold)[0] idx_1se = valid_indices[0] if len(valid_indices) > 0 else best_idx C_1se = Cs[idx_1se] lambda_1se = 1 / C_1se # Method 3: BIC approximation # BIC ≈ -2 * log_likelihood + k * log(n) where k = #non-zero coefficients bic_values = [] for C in Cs: model = LogisticRegression( penalty='l1', C=C, solver='saga', max_iter=2000 ) model.fit(X_scaled, y) # Compute log-likelihood probs = model.predict_proba(X_scaled)[:, 1] eps = 1e-15 probs = np.clip(probs, eps, 1 - eps) log_lik = np.sum(y * np.log(probs) + (1 - y) * np.log(1 - probs)) # Count non-zero coefficients (including intercept) k = np.sum(np.abs(model.coef_) > 1e-6) + 1 # +1 for intercept bic = -2 * log_lik + k * np.log(n) bic_values.append(bic) bic_values = np.array(bic_values) C_bic = Cs[np.argmin(bic_values)] lambda_bic = 1 / C_bic # Report results print("Lambda Selection Comparison") print("=" * 60) print(f"{'Method':<30} {'λ':>12} {'C (=1/λ)':>12}") print("-" * 60) print(f"{'CV (minimum error)':<30} {lambda_cv_min:>12.4f} {C_cv_min:>12.4f}") print(f"{'CV (1SE rule)':<30} {lambda_1se:>12.4f} {C_1se:>12.4f}") print(f"{'BIC':<30} {lambda_bic:>12.4f} {C_bic:>12.4f}") # Fit final models and compare print("Model Comparison at Selected λ") print("-" * 60) for name, C in [("CV-min", C_cv_min), ("CV-1SE", C_1se), ("BIC", C_bic)]: model = LogisticRegression( penalty='l1', C=C, solver='saga', max_iter=2000 ) model.fit(X_scaled, y) n_features = np.sum(np.abs(model.coef_) > 1e-6) train_acc = model.score(X_scaled, y) print(f"{name:<10}: {n_features:3d} features, train accuracy = {train_acc:.4f}") return { 'Cs': Cs, 'lambdas': 1/Cs, 'mean_cv_scores': mean_scores if 'mean_scores' in dir() else None, 'bic_values': bic_values, 'selected': { 'cv_min': lambda_cv_min, 'cv_1se': lambda_1se, 'bic': lambda_bic } } def visualize_selection(results): """ Visualize the model selection process. """ Cs = results['Cs'] lambdas = results['lambdas'] fig, axes = plt.subplots(1, 2, figsize=(12, 4)) # CV scores ax1 = axes[0] if results['mean_cv_scores'] is not None: ax1.plot(np.log10(lambdas), -results['mean_cv_scores'], 'b-', linewidth=2) ax1.set_ylabel('Cross-Validation Log-Loss', fontsize=11) ax1.set_xlabel('log₁₀(λ)', fontsize=11) ax1.set_title('Cross-Validation Error', fontsize=12) # Mark selected lambdas for name, lam, color in [ ('CV-min', results['selected']['cv_min'], 'green'), ('CV-1SE', results['selected']['cv_1se'], 'orange') ]: ax1.axvline(np.log10(lam), color=color, linestyle='--', label=name) ax1.legend() # BIC ax2 = axes[1] ax2.plot(np.log10(lambdas), results['bic_values'], 'r-', linewidth=2) ax2.set_xlabel('log₁₀(λ)', fontsize=11) ax2.set_ylabel('BIC', fontsize=11) ax2.set_title('Bayesian Information Criterion', fontsize=12) ax2.axvline(np.log10(results['selected']['bic']), color='purple', linestyle='--', label='BIC optimal') ax2.legend() plt.tight_layout() plt.savefig('lambda_selection.png', dpi=150) plt.show() # Exampleif __name__ == "__main__": from scipy.special import expit np.random.seed(42) # Generate data n, p = 200, 30 X = np.random.randn(n, p) true_coef = np.zeros(p) true_coef[:5] = [1.5, -1.0, 0.8, -0.5, 0.3] prob = expit(X @ true_coef) y = (np.random.rand(n) < prob).astype(float) results = compare_lambda_selection_methods(X, y) visualize_selection(results)A practical heuristic for selecting λ is the 1-standard-error (1SE) rule:
Rationale: Models with similar CV performance are statistically indistinguishable. Among them, prefer the simpler (more regularized) one. This embodies Occam's razor in model selection.
When to use:
When NOT to use:
Information criteria provide alternatives to cross-validation for model selection:
Akaike Information Criterion (AIC): $$\text{AIC} = -2 \log \mathcal{L}(\hat{\boldsymbol{\beta}}) + 2k$$
Bayesian Information Criterion (BIC): $$\text{BIC} = -2 \log \mathcal{L}(\hat{\boldsymbol{\beta}}) + k \log n$$
where $k$ is the number of non-zero parameters (including intercept).
Comparison:
For regularized models, the effective "degrees of freedom" isn't simply the number of non-zero coefficients. The penalty shrinks estimates, reducing effective model complexity. More accurate formulas exist (e.g., trace of the hat matrix), but counting non-zeros is a common approximation that works reasonably well in practice.
When $p$ is large (potentially $p > n$), standard BIC can underpenalize complexity. The Extended BIC (EBIC) adds an additional penalty:
$$\text{EBIC}_\gamma = -2 \log \mathcal{L}(\hat{\boldsymbol{\beta}}) + k \log n + 2\gamma \log \binom{p}{k}$$
where $\gamma \in [0, 1]$ controls the additional penalty based on model size relative to $p$.
Intuition: With many potential features, finding a good-fitting model by chance is easier. EBIC penalizes for this "search space" effect.
Common choices:
Rather than selecting a single λ, stability selection uses the regularization path to identify robustly selected features:
Key insight: A feature that is consistently selected across subsamples is likely a true signal, not a spurious correlation. Features that are selected only occasionally are likely noise or unstable due to correlation.
Parameters:
Selection probability curve: For each feature, plot the probability of selection across the λ range. Stable features have high selection probability across a wide range; unstable features are sensitive to λ.
Theoretical guarantee: Under mild conditions, stability selection controls the false discovery rate at level $q$ if threshold $\tau$ is chosen appropriately:
$$\tau = \sqrt{0.5 \cdot (2q \cdot p \cdot \tau^2 / E[\hat{k}]) + 0.5}$$
where $E[\hat{k}]$ is the expected number of selected features.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187
import numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import StandardScalerfrom scipy.special import expitimport matplotlib.pyplot as plt def stability_selection_path(X, y, n_bootstrap=100, sample_fraction=0.5, n_lambdas=30, lambda_ratio=1e-2): """ Stability selection across the regularization path. For each bootstrap subsample, computes which features are selected at each lambda value. Returns selection probabilities. Parameters ---------- X : array, shape (n, p) Feature matrix (should be standardized) y : array, shape (n,) Binary labels n_bootstrap : int Number of bootstrap iterations sample_fraction : float Fraction of samples to use in each bootstrap n_lambdas : int Number of lambda values in grid lambda_ratio : float Ratio of lambda_min to lambda_max Returns ------- lambdas : array Lambda values selection_probs : array, shape (p, n_lambdas) Selection probability at each (feature, lambda) pair max_selection_prob : array, shape (p,) Maximum selection probability across all lambdas for each feature """ n, p = X.shape n_subsample = int(n * sample_fraction) # Compute lambda range lambda_max = np.max(np.abs(X.T @ (y - y.mean()))) / n * 1.5 lambda_min = lambda_max * lambda_ratio lambdas = np.logspace(np.log10(lambda_max), np.log10(lambda_min), n_lambdas) Cs = 1 / lambdas # Track selections selection_counts = np.zeros((p, n_lambdas)) for b in range(n_bootstrap): # Subsample without replacement indices = np.random.choice(n, size=n_subsample, replace=False) X_sub = X[indices] y_sub = y[indices] # Compute path on subsample for i, C in enumerate(Cs): model = LogisticRegression( penalty='l1', C=C, solver='saga', max_iter=1000, tol=1e-4 ) model.fit(X_sub, y_sub) # Record selections selected = np.abs(model.coef_.flatten()) > 1e-6 selection_counts[:, i] += selected # Convert to probabilities selection_probs = selection_counts / n_bootstrap max_selection_prob = np.max(selection_probs, axis=1) return lambdas, selection_probs, max_selection_prob def visualize_stability_selection(lambdas, selection_probs, max_probs, threshold=0.6, feature_names=None): """ Visualize stability selection results. """ p = len(max_probs) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Selection probability curves ax1 = axes[0] # Stable features (above threshold) stable_features = np.where(max_probs >= threshold)[0] unstable_features = np.where(max_probs < threshold)[0] # Plot unstable features in gray for j in unstable_features: ax1.plot(np.log10(lambdas), selection_probs[j, :], color='gray', alpha=0.3, linewidth=0.5) # Plot stable features with colors colors = plt.cm.tab10(np.linspace(0, 1, len(stable_features))) for idx, j in enumerate(stable_features): name = feature_names[j] if feature_names else f"x{j+1}" ax1.plot(np.log10(lambdas), selection_probs[j, :], color=colors[idx], linewidth=2, label=name) ax1.axhline(threshold, color='red', linestyle='--', label=f'Threshold={threshold}') ax1.set_xlabel('log₁₀(λ)', fontsize=12) ax1.set_ylabel('Selection Probability', fontsize=12) ax1.set_title('Stability Selection Curves', fontsize=14) ax1.legend(loc='lower right', fontsize=8) ax1.set_ylim(-0.05, 1.05) # Plot 2: Max selection probability distribution ax2 = axes[1] # Sort by max probability sorted_indices = np.argsort(max_probs)[::-1] sorted_probs = max_probs[sorted_indices] colors = ['green' if p >= threshold else 'gray' for p in sorted_probs] ax2.barh(range(min(20, p)), sorted_probs[:20], color=colors[:20]) ax2.axvline(threshold, color='red', linestyle='--', label='Threshold') ax2.set_xlabel('Max Selection Probability', fontsize=12) ax2.set_ylabel('Feature (sorted)', fontsize=12) ax2.set_title(f'Top 20 Features (Stable: {len(stable_features)})', fontsize=14) ax2.legend() # Label features labels = [feature_names[j] if feature_names else f"x{j+1}" for j in sorted_indices[:20]] ax2.set_yticks(range(20)) ax2.set_yticklabels(labels) ax2.invert_yaxis() plt.tight_layout() plt.savefig('stability_selection.png', dpi=150) plt.show() # Report stable features print(f"Stable Features (selection prob >= {threshold}):") print("-" * 50) for j in sorted_indices: if max_probs[j] >= threshold: name = feature_names[j] if feature_names else f"x{j+1}" print(f"{name}: {max_probs[j]:.3f}") return stable_features # Exampleif __name__ == "__main__": np.random.seed(42) # Generate sparse data n, p = 300, 50 X = np.random.randn(n, p) # Add correlated features X[:, 5:8] = X[:, 0:1] + 0.1 * np.random.randn(n, 3) # True sparse model true_coef = np.zeros(p) true_coef[:5] = [2.0, -1.5, 1.0, -0.8, 0.6] prob = expit(X @ true_coef) y = (np.random.rand(n) < prob).astype(float) # Standardize scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Run stability selection print("Running stability selection...") lambdas, sel_probs, max_probs = stability_selection_path( X_scaled, y, n_bootstrap=50, sample_fraction=0.5 ) feature_names = [f"x{i+1}" for i in range(p)] stable_features = visualize_stability_selection( lambdas, sel_probs, max_probs, threshold=0.6, feature_names=feature_names ) print(f"True non-zero features: x1, x2, x3, x4, x5") print(f"Detected stable features: {[feature_names[j] for j in stable_features]}")The final page of this module covers hyperparameter selection in depth—including grid search, random search, and efficient optimization strategies for tuning both λ and the L1 ratio in Elastic Net models.