Machine LearningRegularized Logistic Regression

Regularized Logistic Regression

LevelIntermediate

Duration90 mins

TopicRegularized Logistic Regression

3 / 5

Elastic Net Regularization in Logistic Regression

The Best of Both Worlds

L1 regularization (Lasso) produces sparse models but struggles with correlated features. L2 regularization (Ridge) handles correlation gracefully but cannot eliminate features. What if we could have both? Elastic Net achieves exactly this by combining L1 and L2 penalties, enabling sparse solutions that respect feature correlation structure.

Introduced by Zou and Hastie in 2005, Elastic Net has become the default regularization approach in many machine learning pipelines. It's particularly valuable in genomics, finance, and other domains where features are naturally grouped and correlated, yet model simplicity is crucial.

This page develops the complete theory of Elastic Net for logistic regression, covering the mathematical formulation, optimization, the "grouping effect," and practical tuning strategies.

Why "Elastic Net"?

The name "Elastic Net" evokes the idea of a flexible net that can stretch and contract. The L2 component provides smoothness and stability (like rubber), while the L1 component provides sharp feature selection (like knots in the net). Together, they create a regularization that adapts to data structure.

Mathematical Formulation

1.1 The Elastic Net Penalty

Elastic Net combines L1 and L2 penalties through a weighted sum:

$$J(\boldsymbol{\beta}) = -\mathcal{L}(\boldsymbol{\beta}) + \lambda \left[ \alpha |\boldsymbol{\beta}|_1 + \frac{(1-\alpha)}{2} |\boldsymbol{\beta}|_2^2 \right]$$

where:

$\mathcal{L}(\boldsymbol{\beta})$ is the log-likelihood from logistic regression
$\lambda \geq 0$ controls the overall regularization strength
$\alpha \in [0, 1]$ controls the L1/L2 mixing ratio

Special cases:

$\alpha = 1$: Pure Lasso (L1 only)
$\alpha = 0$: Pure Ridge (L2 only)
$0 < \alpha < 1$: Elastic Net (mixed penalty)

1.2 Alternative Parameterizations

Different software packages use different parameterizations:

Zou-Hastie (Original): $$\text{Penalty} = \lambda_1 |\boldsymbol{\beta}|_1 + \lambda_2 |\boldsymbol{\beta}|_2^2$$

with $\lambda_1 = \lambda \alpha$ and $\lambda_2 = \lambda(1-\alpha)/2$.

scikit-learn: Uses $C = 1/\lambda$ (inverse regularization) with l1_ratio = $\alpha$: $$J = C^{-1} \left[ \alpha |\boldsymbol{\beta}|_1 + \frac{(1-\alpha)}{2} |\boldsymbol{\beta}|_2^2 \right] + \text{loss}$$

glmnet (R): Uses $\alpha$ directly as the L1 ratio and $\lambda$ as overall strength, normalized by sample size.

Parameterization Matters!

When comparing results across packages, verify the parameterization. A model with "alpha=0.5" in scikit-learn means 50% L1, 50% L2. In some other packages, alpha might represent something different. Always consult the documentation.

1.3 Constraint Geometry

The Elastic Net constraint region combines the L1 diamond and L2 ball:

$$\alpha |\beta_1| + \alpha |\beta_2| + \frac{(1-\alpha)}{2}(\beta_1^2 + \beta_2^2) \leq t$$

Geometric interpretation (2D case):

$\alpha = 1$: Diamond (Lasso)
$\alpha = 0$: Circle (Ridge)
$0 < \alpha < 1$: Rounded diamond with curved sides

The Elastic Net constraint region has corners (from L1, enabling sparsity) but they are smoothed (from L2, improving stability). This geometry explains why Elastic Net achieves sparse solutions while being more stable than pure Lasso.

elastic_net_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt
 
def elastic_net_constraint(alpha, t=1.0, n_points=1000):
    """
    Generate the boundary of the Elastic Net constraint region in 2D.
    
    The constraint is: alpha * (|b1| + |b2|) + (1-alpha)/2 * (b1^2 + b2^2) <= t
    
    We parameterize by angle and solve for radius.
    """
    theta = np.linspace(0, 2 * np.pi, n_points)
    
    # For each angle, find the radius r such that the constraint is satisfied with equality
    # Let b1 = r*cos(theta), b2 = r*sin(theta)
    # alpha * r * (|cos| + |sin|) + (1-alpha)/2 * r^2 = t
    
    abs_cos = np.abs(np.cos(theta))
    abs_sin = np.abs(np.sin(theta))
    
    if alpha == 1:
        # Pure L1: r * (|cos| + |sin|) = t
        r = t / (abs_cos + abs_sin)
    elif alpha == 0:
        # Pure L2: r^2 / 2 = t => r = sqrt(2t)
        r = np.sqrt(2 * t) * np.ones_like(theta)
    else:
        # Quadratic in r: (1-alpha)/2 * r^2 + alpha * (|cos| + |sin|) * r - t = 0
        a_coef = (1 - alpha) / 2
        b_coef = alpha * (abs_cos + abs_sin)
        c_coef = -t
        
        # Quadratic formula (positive root)
        discriminant = b_coef**2 - 4 * a_coef * c_coef
        r = (-b_coef + np.sqrt(discriminant)) / (2 * a_coef)
    
    x = r * np.cos(theta)
    y = r * np.sin(theta)
    
    return x, y
 
 
def visualize_constraint_regions():
    """
    Visualize how the constraint region changes with mixing parameter alpha.
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    alpha_values = [1.0, 0.5, 0.0]
    titles = ['Lasso (α=1)', 'Elastic Net (α=0.5)', 'Ridge (α=0)']
    
    for ax, alpha, title in zip(axes, alpha_values, titles):
        x, y = elastic_net_constraint(alpha, t=1.0)
        
        ax.fill(x, y, alpha=0.3, color='blue')
        ax.plot(x, y, 'b-', linewidth=2)
        
        ax.axhline(0, color='gray', linestyle='-', linewidth=0.5)
        ax.axvline(0, color='gray', linestyle='-', linewidth=0.5)
        
        ax.set_xlim(-2, 2)
        ax.set_ylim(-2, 2)
        ax.set_aspect('equal')
        ax.set_xlabel('β₁', fontsize=12)
        ax.set_ylabel('β₂', fontsize=12)
        ax.set_title(title, fontsize=14)
        
        # Mark corners if they exist
        if alpha > 0:
            # Corners approximately on axes
            corners_x = [1, 0, -1, 0]
            corners_y = [0, 1, 0, -1]
            ax.scatter(corners_x, corners_y, s=50, c='red', zorder=5, label='Corner (sparse)')
            ax.legend()
    
    plt.tight_layout()
    plt.savefig('elastic_net_constraints.png', dpi=150)
    plt.show()
 
 
if __name__ == "__main__":
    visualize_constraint_regions()

The Grouping Effect

2.1 The Problem with Lasso and Correlated Features

Recall the Lasso's behavior with correlated features: among a group of similarly predictive features, Lasso tends to select one arbitrarily and exclude the others. This is problematic because:

Interpretation: We may incorrectly conclude unselected features are irrelevant
Stability: Small data changes can completely change which feature is selected
Domain relevance: In genomics, genes in the same pathway may all be biologically relevant

Example: Consider three features $x_1, x_2, x_3$ where $x_2 \approx x_3$ (highly correlated) and both predict $y$:

Lasso: Might select $(\beta_1, \beta_2, \beta_3) = (1.0, 0.8, 0)$ or $(1.0, 0, 0.8)$ randomly
Desired: Select both $x_2$ and $x_3$ with similar coefficients

2.2 Elastic Net's Grouping Effect

Elastic Net exhibits a grouping effect: highly correlated features tend to have similar coefficients. Formally, for features $x_i$ and $x_j$ with sample correlation $\rho_{ij}$:

$$|\hat{\beta}_i - \hat{\beta}j| \leq \frac{1}{\lambda(1-\alpha)} \cdot \sqrt{2(1-\rho{ij})} \cdot |\mathbf{y}|_2$$

Implications:

When $\rho_{ij} = 1$ (perfect correlation): $\hat{\beta}_i = \hat{\beta}_j$ (equal coefficients)
When $\rho_{ij} \approx 1$ (high correlation): $\hat{\beta}_i \approx \hat{\beta}_j$ (similar coefficients)
As $(1-\alpha) \to 0$ (approaching Lasso): bound becomes infinite (no grouping guarantee)

The grouping effect is driven by the L2 component. Ridge regression naturally distributes weight among correlated predictors, and this property carries over to Elastic Net.

Choosing α for Grouping

If you want strong grouping behavior (correlated features have similar coefficients), use smaller α values (more L2). If you want more aggressive sparsity at the expense of grouping, use larger α values (more L1). Typical defaults: α = 0.5 (balanced) or α = 0.95 (mostly L1 with slight L2 stabilization).

grouping_effect_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from scipy.special import expit
 
def demonstrate_grouping_effect():
    """
    Show how Elastic Net groups correlated features while Lasso doesn't.
    """
    np.random.seed(42)
    
    n = 500
    
    # Create a base feature and two copies with noise
    x_base = np.random.randn(n)
    x1 = x_base + 0.05 * np.random.randn(n)  # Nearly identical
    x2 = x_base + 0.05 * np.random.randn(n)  # Nearly identical
    x3 = np.random.randn(n)  # Independent
    x4 = np.random.randn(n)  # Independent noise feature
    
    X = np.column_stack([x1, x2, x3, x4])
    
    # True relationship uses x1 + x2 + x3 (ignores x4)
    true_signal = 1.0 * x1 + 1.0 * x2 + 0.5 * x3
    prob = expit(true_signal)
    y = (np.random.rand(n) < prob).astype(int)
    
    # Standardize
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Correlation matrix
    corr_matrix = np.corrcoef(X.T)
    
    print("Feature Correlations:")
    print("-" * 40)
    feature_names = ['x1', 'x2', 'x3', 'x4']
    for i in range(4):
        for j in range(i + 1, 4):
            print(f"corr({feature_names[i]}, {feature_names[j]}) = {corr_matrix[i, j]:.3f}")
    print()
    
    # Compare Lasso, Ridge, and Elastic Net
    models = {
        'Lasso (α=1.0)': LogisticRegression(
            penalty='l1', C=1.0, solver='saga', max_iter=5000
        ),
        'Elastic Net (α=0.5)': LogisticRegression(
            penalty='elasticnet', C=1.0, solver='saga', 
            l1_ratio=0.5, max_iter=5000
        ),
        'Elastic Net (α=0.2)': LogisticRegression(
            penalty='elasticnet', C=1.0, solver='saga',
            l1_ratio=0.2, max_iter=5000
        ),
        'Ridge (α=0)': LogisticRegression(
            penalty='l2', C=1.0, solver='lbfgs', max_iter=5000
        ),
    }
    
    print("Coefficient Comparison (x1 and x2 are nearly identical)")
    print("=" * 70)
    print(f"{'Model':<25} {'β₁':>10} {'β₂':>10} {'|β₁-β₂|':>10} {'β₃':>10} {'β₄':>10}")
    print("-" * 70)
    
    for name, model in models.items():
        model.fit(X_scaled, y)
        coef = model.coef_[0]
        diff = np.abs(coef[0] - coef[1])
        print(f"{name:<25} {coef[0]:>10.4f} {coef[1]:>10.4f} {diff:>10.4f} "
              f"{coef[2]:>10.4f} {coef[3]:>10.4f}")
    
    print("-" * 70)
    print("
Observations:")
    print("- Lasso: Large |β₁ - β₂| (one dominates, arbitrary selection)")
    print("- Elastic Net: Smaller |β₁ - β₂| (grouping effect)")
    print("- Ridge: β₁ ≈ β₂ (perfect grouping, but no sparsity)")
    print("- All methods correctly shrink β₄ (noise feature) toward zero")
 
 
def analyze_grouping_vs_alpha():
    """
    Show how grouping strength varies with the mixing parameter alpha.
    """
    np.random.seed(42)
    
    n = 500
    x_base = np.random.randn(n)
    
    # Create pairs of correlated features with different correlations
    correlations = [0.99, 0.9, 0.7, 0.5]
    noise_levels = [0.05, 0.3, 0.7, 1.0]
    
    features = []
    for noise in noise_levels:
        x_noisy = x_base + noise * np.random.randn(n)
        features.append(x_noisy)
    
    X = np.column_stack(features)
    true_signal = X @ np.array([1.0, 1.0, 1.0, 1.0])
    prob = expit(true_signal)
    y = (np.random.rand(n) < prob).astype(int)
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    alpha_values = [1.0, 0.8, 0.5, 0.2, 0.1]
    
    print("
Grouping Effect vs Alpha")
    print("=" * 60)
    print(f"{'Alpha':<10} {'β₁':>10} {'β₂':>10} {'β₃':>10} {'β₄':>10}")
    print("-" * 60)
    
    for alpha in alpha_values:
        if alpha == 1.0:
            model = LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=5000)
        elif alpha == 0.0:
            model = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=5000)
        else:
            model = LogisticRegression(
                penalty='elasticnet', C=1.0, solver='saga',
                l1_ratio=alpha, max_iter=5000
            )
        
        model.fit(X_scaled, y)
        coef = model.coef_[0]
        print(f"{alpha:<10.1f} {coef[0]:>10.4f} {coef[1]:>10.4f} "
              f"{coef[2]:>10.4f} {coef[3]:>10.4f}")
 
 
if __name__ == "__main__":
    demonstrate_grouping_effect()
    analyze_grouping_vs_alpha()

Optimization Algorithms

3.1 Objective Decomposition

The Elastic Net objective can be written as:

$$J(\boldsymbol{\beta}) = \underbrace{-\mathcal{L}(\boldsymbol{\beta}) + \frac{\lambda(1-\alpha)}{2}|\boldsymbol{\beta}|2^2}{g(\boldsymbol{\beta}) \text{ (smooth)}} + \underbrace{\lambda\alpha|\boldsymbol{\beta}|1}{h(\boldsymbol{\beta}) \text{ (non-smooth)}}$$

The smooth part $g$ now includes both the negative log-likelihood AND the L2 penalty, making it strongly convex. The non-smooth part $h$ is the L1 penalty, handled via soft thresholding.

Gradient of smooth part: $$ abla g(\boldsymbol{\beta}) = - abla \mathcal{L}(\boldsymbol{\beta}) + \lambda(1-\alpha)\boldsymbol{\beta}$$

The L2 term makes the Hessian of $g$ more positive definite, improving convergence.

3.2 Coordinate Descent for Elastic Net

Coordinate descent adapts naturally to Elastic Net. For each coordinate $j$, the update involves:

Compute the gradient of the smooth part at current $\boldsymbol{\beta}$
Apply a Newton-like step for the smooth part
Apply soft thresholding for the L1 penalty

Univariate subproblem: For coefficient $\beta_j$, minimize:

$$\tilde{g}_j(\beta_j) + \lambda\alpha|\beta_j| + \frac{\lambda(1-\alpha)}{2}\beta_j^2$$

where $\tilde{g}_j$ is the quadratic approximation of the log-likelihood.

Solution: $$\hat{\beta}j = \frac{S{\lambda\alpha}(z_j)}{1 + \lambda(1-\alpha)/H_{jj}}$$

where $z_j$ is the unconstrained Newton step, $S$ is soft thresholding, and $H_{jj}$ is the relevant Hessian diagonal.

elastic_net_coordinate_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from scipy.special import expit
 
class ElasticNetLogisticRegression:
    """
    Logistic Regression with Elastic Net penalty using coordinate descent.
    
    Minimizes: -log_likelihood + lambda * [alpha * ||beta||_1 + (1-alpha)/2 * ||beta||_2^2]
    """
    
    def __init__(self, lambda_reg=1.0, alpha=0.5, max_iter=1000, tol=1e-6):
        """
        Parameters
        ----------
        lambda_reg : float
            Overall regularization strength
        alpha : float in [0, 1]
            Elastic net mixing parameter (1 = Lasso, 0 = Ridge)
        max_iter : int
            Maximum iterations
        tol : float
            Convergence tolerance
        """
        self.lambda_reg = lambda_reg
        self.alpha = alpha
        self.max_iter = max_iter
        self.tol = tol
        self.coef_ = None
        self.intercept_ = None
        self.n_iter_ = 0
    
    def _soft_threshold(self, z, threshold):
        """Soft thresholding operator."""
        return np.sign(z) * np.maximum(np.abs(z) - threshold, 0)
    
    def fit(self, X, y):
        """Fit the Elastic Net logistic regression model."""
        n, p = X.shape
        
        # Initialize
        intercept = 0.0
        coef = np.zeros(p)
        
        # L1 and L2 weights
        l1_weight = self.lambda_reg * self.alpha
        l2_weight = self.lambda_reg * (1 - self.alpha)
        
        for iteration in range(self.max_iter):
            coef_old = coef.copy()
            intercept_old = intercept
            
            # Compute current predictions
            linear_pred = X @ coef + intercept
            prob = expit(linear_pred)
            
            # Weights (diagonal of W matrix)
            weights = prob * (1 - prob)
            weights = np.maximum(weights, 1e-10)
            
            # Update intercept (no regularization)
            grad_intercept = np.sum(prob - y)
            hess_intercept = np.sum(weights)
            intercept = intercept - grad_intercept / hess_intercept
            
            # Update each coefficient
            for j in range(p):
                # Gradient from log-likelihood
                residual = prob - y
                grad_j = np.dot(X[:, j], residual)
                
                # Hessian diagonal
                hess_j = np.dot(X[:, j] ** 2, weights)
                
                if hess_j < 1e-10:
                    continue
                
                # Newton step for smooth part (including L2)
                # The L2 penalty adds l2_weight to the effective Hessian
                effective_hess = hess_j + l2_weight
                
                # Unconstrained minimizer of quadratic approximation
                z = coef[j] - grad_j / effective_hess
                
                # Soft threshold for L1 component
                threshold = l1_weight / effective_hess
                coef[j] = self._soft_threshold(z, threshold)
                
                # Update predictions for next coordinate
                linear_pred = X @ coef + intercept
                prob = expit(linear_pred)
                weights = prob * (1 - prob)
                weights = np.maximum(weights, 1e-10)
            
            # Check convergence
            coef_change = np.max(np.abs(coef - coef_old))
            intercept_change = np.abs(intercept - intercept_old)
            
            if max(coef_change, intercept_change) < self.tol:
                self.n_iter_ = iteration + 1
                break
        else:
            self.n_iter_ = self.max_iter
        
        self.coef_ = coef
        self.intercept_ = intercept
        
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        linear_pred = X @ self.coef_ + self.intercept_
        prob_1 = expit(linear_pred)
        return np.column_stack([1 - prob_1, prob_1])
    
    def predict(self, X, threshold=0.5):
        """Predict class labels."""
        return (self.predict_proba(X)[:, 1] >= threshold).astype(int)
    
    @property
    def n_nonzero(self):
        """Number of non-zero coefficients."""
        if self.coef_ is None:
            return 0
        return np.sum(np.abs(self.coef_) > 1e-6)
 
 
# Compare with sklearn
if __name__ == "__main__":
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import StandardScaler
    
    np.random.seed(42)
    
    # Generate data
    n, p = 300, 20
    X = np.random.randn(n, p)
    true_coef = np.zeros(p)
    true_coef[:5] = [2.0, -1.5, 1.0, -0.5, 0.5]
    
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Our implementation
    our_model = ElasticNetLogisticRegression(lambda_reg=1.0, alpha=0.5)
    our_model.fit(X_scaled, y)
    
    # sklearn implementation
    sk_model = LogisticRegression(
        penalty='elasticnet', C=1.0, solver='saga',
        l1_ratio=0.5, max_iter=5000
    )
    sk_model.fit(X_scaled, y)
    
    print("Elastic Net Comparison")
    print("=" * 50)
    print(f"Our implementation: {our_model.n_nonzero} non-zero, "
          f"{our_model.n_iter_} iterations")
    print(f"sklearn: {np.sum(np.abs(sk_model.coef_[0]) > 1e-6)} non-zero")
    
    # Check coefficient agreement
    max_diff = np.max(np.abs(our_model.coef_ - sk_model.coef_[0]))
    print(f"Max coefficient difference: {max_diff:.6f}")

Convergence Benefits

The L2 component in Elastic Net makes the optimization problem strongly convex, which improves convergence rates. Pure Lasso can converge slowly when features are highly correlated; Elastic Net typically converges faster due to the conditioning effect of the L2 penalty.

Bayesian Interpretation

4.1 The Gaussian-Laplace Mixture Prior

Elastic Net corresponds to MAP estimation with a prior that is a product of Laplace and Gaussian:

$$p(\beta_j) \propto \exp\left( -\lambda\alpha|\beta_j| - \frac{\lambda(1-\alpha)}{2}\beta_j^2 \right)$$

This can be viewed as a scale mixture: the L1 component (Laplace) induces sparsity while the L2 component (Gaussian) provides regularization.

Properties:

Peaked at zero (from Laplace component) → enables exact zeros
Smooth quadratic decay in tails (from Gaussian) → prevents extreme values
Interpolates between pure Laplace ($\alpha = 1$) and pure Gaussian ($\alpha = 0$)

4.2 Connection to Hierarchical Models

An alternative Bayesian view places a Gamma hyperprior on the Lasso penalty:

$$\beta_j | \tau_j \sim \mathcal{N}(0, \tau_j)$$ $$\tau_j \sim \text{Gamma}$$

Marginalizing over $\tau_j$ can produce prior shapes similar to Elastic Net. This hierarchical perspective motivates more sophisticated priors like the horseshoe prior, which achieves even stronger sparsity while preserving large signals.

Practical implication: Elastic Net is a computationally efficient approximation to more complex Bayesian sparse regression models. For production systems, Elastic Net offers most of the benefits without the computational overhead of full Bayesian inference.

Prior Comparison
Prior Type	Regularization	Properties
Gaussian (N)	L2 (Ridge)	Smooth, no sparsity, fast computation
Laplace (DE)	L1 (Lasso)	Peaked, sparse, arbitrary group selection
Gaussian + Laplace	Elastic Net	Peaked, sparse, grouped selection
Spike-and-Slab	Discrete selection	Exact sparsity, computationally intensive
Horseshoe	Continuous	Strong sparsity, preserves large signals

When to Use Elastic Net

5.1 Ideal Scenarios

Elastic Net excels in situations that challenge both pure Lasso and pure Ridge:

When to Choose Elastic Net

•Correlated feature groups: When features come in correlated clusters (e.g., genes in pathways, related financial indicators), Elastic Net selects groups together rather than arbitrarily picking one
•p >> n settings: When features vastly outnumber samples, Lasso can select at most n features; Elastic Net can include more through the L2 component
•Unstable Lasso selection: When Lasso selected features change dramatically across CV folds, Elastic Net provides stability
•Moderate sparsity desire: When you want some feature selection but not as aggressive as pure Lasso
•Interpretability + prediction: When you need both a sparse, interpretable model and good predictive accuracy

5.2 Comparison Summary

Choosing Between L1, L2, and Elastic Net
Scenario	Recommended	Rationale
True sparsity, uncorrelated features	Lasso	Maximally sparse, stable selection
No sparsity, correlated features	Ridge	Best prediction, handles multicollinearity
Sparsity + correlated features	Elastic Net	Groups correlated features, still sparse
p >> n	Elastic Net	Overcomes Lasso's n feature limit
Fast optimization needed	Ridge	Smooth problem, Newton methods work
Quick baseline model	Elastic Net (α=0.5)	Reasonable default for most problems

Default Strategy

A practical approach: start with Elastic Net (α=0.5), then compare with Lasso (α=1) and Ridge (α=0) via cross-validation. If all three perform similarly, prefer the simplest interpretation (Ridge for stability, Lasso for sparsity). Use Elastic Net when it meaningfully outperforms both extremes.

Practical Implementation with scikit-learn

6.1 Basic Usage

scikit-learn provides Elastic Net logistic regression through the LogisticRegression class with penalty='elasticnet':

elastic_net_sklearn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import numpy as np
 
def create_elastic_net_pipeline(C=1.0, l1_ratio=0.5):
    """
    Create a complete Elastic Net logistic regression pipeline.
    
    Parameters
    ----------
    C : float
        Inverse regularization strength (C = 1/lambda)
    l1_ratio : float
        L1 mixing ratio (alpha in our notation)
        l1_ratio = 1 -> Lasso
        l1_ratio = 0 -> Ridge  
        0 < l1_ratio < 1 -> Elastic Net
    """
    return Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(
            penalty='elasticnet',
            C=C,
            l1_ratio=l1_ratio,
            solver='saga',  # Required for elasticnet
            max_iter=5000,
            random_state=42,
            n_jobs=-1
        ))
    ])
 
 
def tune_elastic_net(X, y, cv=5):
    """
    Tune both C and l1_ratio using grid search cross-validation.
    
    This is the recommended approach for Elastic Net: tune BOTH hyperparameters.
    """
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(
            penalty='elasticnet',
            solver='saga',
            max_iter=5000,
            random_state=42
        ))
    ])
    
    # Parameter grid
    param_grid = {
        'classifier__C': [0.01, 0.1, 1.0, 10.0],
        'classifier__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
    }
    
    grid_search = GridSearchCV(
        pipeline,
        param_grid,
        cv=cv,
        scoring='roc_auc',
        n_jobs=-1,
        verbose=1
    )
    
    grid_search.fit(X, y)
    
    print(f"
Best parameters: {grid_search.best_params_}")
    print(f"Best CV AUC: {grid_search.best_score_:.4f}")
    
    return grid_search
 
 
def analyze_regularization_path(X, y):
    """
    Examine how coefficients change across regularization strengths
    at different l1_ratios.
    """
    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    l1_ratios = [0.2, 0.5, 0.8, 1.0]
    C_values = np.logspace(-2, 1, 20)
    
    results = {l1_ratio: [] for l1_ratio in l1_ratios}
    
    for l1_ratio in l1_ratios:
        for C in C_values:
            if l1_ratio == 1.0:
                # Pure Lasso
                model = LogisticRegression(
                    penalty='l1', C=C, solver='saga', max_iter=5000
                )
            else:
                model = LogisticRegression(
                    penalty='elasticnet', C=C, l1_ratio=l1_ratio,
                    solver='saga', max_iter=5000
                )
            
            model.fit(X_scaled, y)
            
            results[l1_ratio].append({
                'C': C,
                'lambda': 1/C,
                'n_nonzero': np.sum(np.abs(model.coef_[0]) > 1e-6),
                'coef_norm': np.linalg.norm(model.coef_[0])
            })
    
    return results
 
 
# Full example
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    
    # Generate data with correlated features
    X, y = make_classification(
        n_samples=500,
        n_features=50,
        n_informative=10,
        n_redundant=20,  # Correlated with informative features
        n_clusters_per_class=2,
        random_state=42
    )
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Tune hyperparameters
    print("Tuning Elastic Net...")
    grid_search = tune_elastic_net(X_train, y_train)
    
    # Evaluate on test set
    best_model = grid_search.best_estimator_
    test_score = best_model.score(X_test, y_test)
    
    coef = best_model.named_steps['classifier'].coef_[0]
    n_selected = np.sum(np.abs(coef) > 1e-6)
    
    print(f"
Test accuracy: {test_score:.4f}")
    print(f"Features selected: {n_selected} of {X.shape[1]}")

Important: solver='saga' Required

For Elastic Net penalty in scikit-learn, you must use solver='saga'. Other solvers (lbfgs, newton-cg, liblinear) don't support the mixed L1/L2 penalty. The SAGA solver is a variant of stochastic average gradient descent with efficient support for L1 penalties.

Summary and Key Takeaways

Key Takeaways

•Elastic Net combines L1 and L2 penalties through a mixing parameter α, interpolating between Lasso and Ridge
•The constraint geometry has rounded corners—enabling sparsity while maintaining stability
•The grouping effect ensures correlated features receive similar coefficients, addressing Lasso's arbitrary selection problem
•Two hyperparameters (λ for overall strength, α for L1/L2 ratio) both require tuning via cross-validation
•Bayesian interpretation: corresponds to a prior mixing Laplace and Gaussian distributions
•Use Elastic Net when features are correlated and sparsity is desired, or when pure Lasso is unstable
•scikit-learn requires solver='saga' for the elasticnet penalty option

What's Next

The next page explores the regularization path—how solutions change as we vary the regularization strength λ. Understanding this path is crucial for hyperparameter selection and provides insight into which features matter most for prediction.

3 / 5

Loading learning content...

Machine LearningRegularized Logistic Regression

Regularized Logistic Regression

LevelIntermediate

Duration90 mins

TopicRegularized Logistic Regression

3 / 5

Elastic Net Regularization in Logistic Regression

The Best of Both Worlds

This page develops the complete theory of Elastic Net for logistic regression, covering the mathematical formulation, optimization, the "grouping effect," and practical tuning strategies.

Why "Elastic Net"?

Mathematical Formulation

1.1 The Elastic Net Penalty

Elastic Net combines L1 and L2 penalties through a weighted sum:

$$J(\boldsymbol{\beta}) = -\mathcal{L}(\boldsymbol{\beta}) + \lambda \left[ \alpha |\boldsymbol{\beta}|_1 + \frac{(1-\alpha)}{2} |\boldsymbol{\beta}|_2^2 \right]$$

where:

$\mathcal{L}(\boldsymbol{\beta})$ is the log-likelihood from logistic regression
$\lambda \geq 0$ controls the overall regularization strength
$\alpha \in [0, 1]$ controls the L1/L2 mixing ratio

Special cases:

$\alpha = 1$: Pure Lasso (L1 only)
$\alpha = 0$: Pure Ridge (L2 only)
$0 < \alpha < 1$: Elastic Net (mixed penalty)

1.2 Alternative Parameterizations

Different software packages use different parameterizations:

Zou-Hastie (Original): $$\text{Penalty} = \lambda_1 |\boldsymbol{\beta}|_1 + \lambda_2 |\boldsymbol{\beta}|_2^2$$

with $\lambda_1 = \lambda \alpha$ and $\lambda_2 = \lambda(1-\alpha)/2$.

glmnet (R): Uses $\alpha$ directly as the L1 ratio and $\lambda$ as overall strength, normalized by sample size.

Parameterization Matters!

1.3 Constraint Geometry

The Elastic Net constraint region combines the L1 diamond and L2 ball:

$$\alpha |\beta_1| + \alpha |\beta_2| + \frac{(1-\alpha)}{2}(\beta_1^2 + \beta_2^2) \leq t$$

Geometric interpretation (2D case):

$\alpha = 1$: Diamond (Lasso)
$\alpha = 0$: Circle (Ridge)
$0 < \alpha < 1$: Rounded diamond with curved sides

elastic_net_geometry.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt
 
def elastic_net_constraint(alpha, t=1.0, n_points=1000):
    """
    Generate the boundary of the Elastic Net constraint region in 2D.
    
    The constraint is: alpha * (|b1| + |b2|) + (1-alpha)/2 * (b1^2 + b2^2) <= t
    
    We parameterize by angle and solve for radius.
    """
    theta = np.linspace(0, 2 * np.pi, n_points)
    
    # For each angle, find the radius r such that the constraint is satisfied with equality
    # Let b1 = r*cos(theta), b2 = r*sin(theta)
    # alpha * r * (|cos| + |sin|) + (1-alpha)/2 * r^2 = t
    
    abs_cos = np.abs(np.cos(theta))
    abs_sin = np.abs(np.sin(theta))
    
    if alpha == 1:
        # Pure L1: r * (|cos| + |sin|) = t
        r = t / (abs_cos + abs_sin)
    elif alpha == 0:
        # Pure L2: r^2 / 2 = t => r = sqrt(2t)
        r = np.sqrt(2 * t) * np.ones_like(theta)
    else:
        # Quadratic in r: (1-alpha)/2 * r^2 + alpha * (|cos| + |sin|) * r - t = 0
        a_coef = (1 - alpha) / 2
        b_coef = alpha * (abs_cos + abs_sin)
        c_coef = -t
        
        # Quadratic formula (positive root)
        discriminant = b_coef**2 - 4 * a_coef * c_coef
        r = (-b_coef + np.sqrt(discriminant)) / (2 * a_coef)
    
    x = r * np.cos(theta)
    y = r * np.sin(theta)
    
    return x, y
 
 
def visualize_constraint_regions():
    """
    Visualize how the constraint region changes with mixing parameter alpha.
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    
    alpha_values = [1.0, 0.5, 0.0]
    titles = ['Lasso (α=1)', 'Elastic Net (α=0.5)', 'Ridge (α=0)']
    
    for ax, alpha, title in zip(axes, alpha_values, titles):
        x, y = elastic_net_constraint(alpha, t=1.0)
        
        ax.fill(x, y, alpha=0.3, color='blue')
        ax.plot(x, y, 'b-', linewidth=2)
        
        ax.axhline(0, color='gray', linestyle='-', linewidth=0.5)
        ax.axvline(0, color='gray', linestyle='-', linewidth=0.5)
        
        ax.set_xlim(-2, 2)
        ax.set_ylim(-2, 2)
        ax.set_aspect('equal')
        ax.set_xlabel('β₁', fontsize=12)
        ax.set_ylabel('β₂', fontsize=12)
        ax.set_title(title, fontsize=14)
        
        # Mark corners if they exist
        if alpha > 0:
            # Corners approximately on axes
            corners_x = [1, 0, -1, 0]
            corners_y = [0, 1, 0, -1]
            ax.scatter(corners_x, corners_y, s=50, c='red', zorder=5, label='Corner (sparse)')
            ax.legend()
    
    plt.tight_layout()
    plt.savefig('elastic_net_constraints.png', dpi=150)
    plt.show()
 
 
if __name__ == "__main__":
    visualize_constraint_regions()

The Grouping Effect

2.1 The Problem with Lasso and Correlated Features

Recall the Lasso's behavior with correlated features: among a group of similarly predictive features, Lasso tends to select one arbitrarily and exclude the others. This is problematic because:

Interpretation: We may incorrectly conclude unselected features are irrelevant
Stability: Small data changes can completely change which feature is selected
Domain relevance: In genomics, genes in the same pathway may all be biologically relevant

Example: Consider three features $x_1, x_2, x_3$ where $x_2 \approx x_3$ (highly correlated) and both predict $y$:

Lasso: Might select $(\beta_1, \beta_2, \beta_3) = (1.0, 0.8, 0)$ or $(1.0, 0, 0.8)$ randomly
Desired: Select both $x_2$ and $x_3$ with similar coefficients

2.2 Elastic Net's Grouping Effect

Elastic Net exhibits a grouping effect: highly correlated features tend to have similar coefficients. Formally, for features $x_i$ and $x_j$ with sample correlation $\rho_{ij}$:

$$|\hat{\beta}_i - \hat{\beta}j| \leq \frac{1}{\lambda(1-\alpha)} \cdot \sqrt{2(1-\rho{ij})} \cdot |\mathbf{y}|_2$$

Implications:

When $\rho_{ij} = 1$ (perfect correlation): $\hat{\beta}_i = \hat{\beta}_j$ (equal coefficients)
When $\rho_{ij} \approx 1$ (high correlation): $\hat{\beta}_i \approx \hat{\beta}_j$ (similar coefficients)
As $(1-\alpha) \to 0$ (approaching Lasso): bound becomes infinite (no grouping guarantee)

The grouping effect is driven by the L2 component. Ridge regression naturally distributes weight among correlated predictors, and this property carries over to Elastic Net.

Choosing α for Grouping

grouping_effect_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from scipy.special import expit
 
def demonstrate_grouping_effect():
    """
    Show how Elastic Net groups correlated features while Lasso doesn't.
    """
    np.random.seed(42)
    
    n = 500
    
    # Create a base feature and two copies with noise
    x_base = np.random.randn(n)
    x1 = x_base + 0.05 * np.random.randn(n)  # Nearly identical
    x2 = x_base + 0.05 * np.random.randn(n)  # Nearly identical
    x3 = np.random.randn(n)  # Independent
    x4 = np.random.randn(n)  # Independent noise feature
    
    X = np.column_stack([x1, x2, x3, x4])
    
    # True relationship uses x1 + x2 + x3 (ignores x4)
    true_signal = 1.0 * x1 + 1.0 * x2 + 0.5 * x3
    prob = expit(true_signal)
    y = (np.random.rand(n) < prob).astype(int)
    
    # Standardize
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Correlation matrix
    corr_matrix = np.corrcoef(X.T)
    
    print("Feature Correlations:")
    print("-" * 40)
    feature_names = ['x1', 'x2', 'x3', 'x4']
    for i in range(4):
        for j in range(i + 1, 4):
            print(f"corr({feature_names[i]}, {feature_names[j]}) = {corr_matrix[i, j]:.3f}")
    print()
    
    # Compare Lasso, Ridge, and Elastic Net
    models = {
        'Lasso (α=1.0)': LogisticRegression(
            penalty='l1', C=1.0, solver='saga', max_iter=5000
        ),
        'Elastic Net (α=0.5)': LogisticRegression(
            penalty='elasticnet', C=1.0, solver='saga', 
            l1_ratio=0.5, max_iter=5000
        ),
        'Elastic Net (α=0.2)': LogisticRegression(
            penalty='elasticnet', C=1.0, solver='saga',
            l1_ratio=0.2, max_iter=5000
        ),
        'Ridge (α=0)': LogisticRegression(
            penalty='l2', C=1.0, solver='lbfgs', max_iter=5000
        ),
    }
    
    print("Coefficient Comparison (x1 and x2 are nearly identical)")
    print("=" * 70)
    print(f"{'Model':<25} {'β₁':>10} {'β₂':>10} {'|β₁-β₂|':>10} {'β₃':>10} {'β₄':>10}")
    print("-" * 70)
    
    for name, model in models.items():
        model.fit(X_scaled, y)
        coef = model.coef_[0]
        diff = np.abs(coef[0] - coef[1])
        print(f"{name:<25} {coef[0]:>10.4f} {coef[1]:>10.4f} {diff:>10.4f} "
              f"{coef[2]:>10.4f} {coef[3]:>10.4f}")
    
    print("-" * 70)
    print("
Observations:")
    print("- Lasso: Large |β₁ - β₂| (one dominates, arbitrary selection)")
    print("- Elastic Net: Smaller |β₁ - β₂| (grouping effect)")
    print("- Ridge: β₁ ≈ β₂ (perfect grouping, but no sparsity)")
    print("- All methods correctly shrink β₄ (noise feature) toward zero")
 
 
def analyze_grouping_vs_alpha():
    """
    Show how grouping strength varies with the mixing parameter alpha.
    """
    np.random.seed(42)
    
    n = 500
    x_base = np.random.randn(n)
    
    # Create pairs of correlated features with different correlations
    correlations = [0.99, 0.9, 0.7, 0.5]
    noise_levels = [0.05, 0.3, 0.7, 1.0]
    
    features = []
    for noise in noise_levels:
        x_noisy = x_base + noise * np.random.randn(n)
        features.append(x_noisy)
    
    X = np.column_stack(features)
    true_signal = X @ np.array([1.0, 1.0, 1.0, 1.0])
    prob = expit(true_signal)
    y = (np.random.rand(n) < prob).astype(int)
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    alpha_values = [1.0, 0.8, 0.5, 0.2, 0.1]
    
    print("
Grouping Effect vs Alpha")
    print("=" * 60)
    print(f"{'Alpha':<10} {'β₁':>10} {'β₂':>10} {'β₃':>10} {'β₄':>10}")
    print("-" * 60)
    
    for alpha in alpha_values:
        if alpha == 1.0:
            model = LogisticRegression(penalty='l1', C=1.0, solver='saga', max_iter=5000)
        elif alpha == 0.0:
            model = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=5000)
        else:
            model = LogisticRegression(
                penalty='elasticnet', C=1.0, solver='saga',
                l1_ratio=alpha, max_iter=5000
            )
        
        model.fit(X_scaled, y)
        coef = model.coef_[0]
        print(f"{alpha:<10.1f} {coef[0]:>10.4f} {coef[1]:>10.4f} "
              f"{coef[2]:>10.4f} {coef[3]:>10.4f}")
 
 
if __name__ == "__main__":
    demonstrate_grouping_effect()
    analyze_grouping_vs_alpha()

Optimization Algorithms

3.1 Objective Decomposition

The Elastic Net objective can be written as:

The smooth part $g$ now includes both the negative log-likelihood AND the L2 penalty, making it strongly convex. The non-smooth part $h$ is the L1 penalty, handled via soft thresholding.

Gradient of smooth part: $$ abla g(\boldsymbol{\beta}) = - abla \mathcal{L}(\boldsymbol{\beta}) + \lambda(1-\alpha)\boldsymbol{\beta}$$

The L2 term makes the Hessian of $g$ more positive definite, improving convergence.

3.2 Coordinate Descent for Elastic Net

Coordinate descent adapts naturally to Elastic Net. For each coordinate $j$, the update involves:

Compute the gradient of the smooth part at current $\boldsymbol{\beta}$
Apply a Newton-like step for the smooth part
Apply soft thresholding for the L1 penalty

Univariate subproblem: For coefficient $\beta_j$, minimize:

$$\tilde{g}_j(\beta_j) + \lambda\alpha|\beta_j| + \frac{\lambda(1-\alpha)}{2}\beta_j^2$$

where $\tilde{g}_j$ is the quadratic approximation of the log-likelihood.

Solution: $$\hat{\beta}j = \frac{S{\lambda\alpha}(z_j)}{1 + \lambda(1-\alpha)/H_{jj}}$$

where $z_j$ is the unconstrained Newton step, $S$ is soft thresholding, and $H_{jj}$ is the relevant Hessian diagonal.

elastic_net_coordinate_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from scipy.special import expit
 
class ElasticNetLogisticRegression:
    """
    Logistic Regression with Elastic Net penalty using coordinate descent.
    
    Minimizes: -log_likelihood + lambda * [alpha * ||beta||_1 + (1-alpha)/2 * ||beta||_2^2]
    """
    
    def __init__(self, lambda_reg=1.0, alpha=0.5, max_iter=1000, tol=1e-6):
        """
        Parameters
        ----------
        lambda_reg : float
            Overall regularization strength
        alpha : float in [0, 1]
            Elastic net mixing parameter (1 = Lasso, 0 = Ridge)
        max_iter : int
            Maximum iterations
        tol : float
            Convergence tolerance
        """
        self.lambda_reg = lambda_reg
        self.alpha = alpha
        self.max_iter = max_iter
        self.tol = tol
        self.coef_ = None
        self.intercept_ = None
        self.n_iter_ = 0
    
    def _soft_threshold(self, z, threshold):
        """Soft thresholding operator."""
        return np.sign(z) * np.maximum(np.abs(z) - threshold, 0)
    
    def fit(self, X, y):
        """Fit the Elastic Net logistic regression model."""
        n, p = X.shape
        
        # Initialize
        intercept = 0.0
        coef = np.zeros(p)
        
        # L1 and L2 weights
        l1_weight = self.lambda_reg * self.alpha
        l2_weight = self.lambda_reg * (1 - self.alpha)
        
        for iteration in range(self.max_iter):
            coef_old = coef.copy()
            intercept_old = intercept
            
            # Compute current predictions
            linear_pred = X @ coef + intercept
            prob = expit(linear_pred)
            
            # Weights (diagonal of W matrix)
            weights = prob * (1 - prob)
            weights = np.maximum(weights, 1e-10)
            
            # Update intercept (no regularization)
            grad_intercept = np.sum(prob - y)
            hess_intercept = np.sum(weights)
            intercept = intercept - grad_intercept / hess_intercept
            
            # Update each coefficient
            for j in range(p):
                # Gradient from log-likelihood
                residual = prob - y
                grad_j = np.dot(X[:, j], residual)
                
                # Hessian diagonal
                hess_j = np.dot(X[:, j] ** 2, weights)
                
                if hess_j < 1e-10:
                    continue
                
                # Newton step for smooth part (including L2)
                # The L2 penalty adds l2_weight to the effective Hessian
                effective_hess = hess_j + l2_weight
                
                # Unconstrained minimizer of quadratic approximation
                z = coef[j] - grad_j / effective_hess
                
                # Soft threshold for L1 component
                threshold = l1_weight / effective_hess
                coef[j] = self._soft_threshold(z, threshold)
                
                # Update predictions for next coordinate
                linear_pred = X @ coef + intercept
                prob = expit(linear_pred)
                weights = prob * (1 - prob)
                weights = np.maximum(weights, 1e-10)
            
            # Check convergence
            coef_change = np.max(np.abs(coef - coef_old))
            intercept_change = np.abs(intercept - intercept_old)
            
            if max(coef_change, intercept_change) < self.tol:
                self.n_iter_ = iteration + 1
                break
        else:
            self.n_iter_ = self.max_iter
        
        self.coef_ = coef
        self.intercept_ = intercept
        
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        linear_pred = X @ self.coef_ + self.intercept_
        prob_1 = expit(linear_pred)
        return np.column_stack([1 - prob_1, prob_1])
    
    def predict(self, X, threshold=0.5):
        """Predict class labels."""
        return (self.predict_proba(X)[:, 1] >= threshold).astype(int)
    
    @property
    def n_nonzero(self):
        """Number of non-zero coefficients."""
        if self.coef_ is None:
            return 0
        return np.sum(np.abs(self.coef_) > 1e-6)
 
 
# Compare with sklearn
if __name__ == "__main__":
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import StandardScaler
    
    np.random.seed(42)
    
    # Generate data
    n, p = 300, 20
    X = np.random.randn(n, p)
    true_coef = np.zeros(p)
    true_coef[:5] = [2.0, -1.5, 1.0, -0.5, 0.5]
    
    prob = expit(X @ true_coef)
    y = (np.random.rand(n) < prob).astype(float)
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Our implementation
    our_model = ElasticNetLogisticRegression(lambda_reg=1.0, alpha=0.5)
    our_model.fit(X_scaled, y)
    
    # sklearn implementation
    sk_model = LogisticRegression(
        penalty='elasticnet', C=1.0, solver='saga',
        l1_ratio=0.5, max_iter=5000
    )
    sk_model.fit(X_scaled, y)
    
    print("Elastic Net Comparison")
    print("=" * 50)
    print(f"Our implementation: {our_model.n_nonzero} non-zero, "
          f"{our_model.n_iter_} iterations")
    print(f"sklearn: {np.sum(np.abs(sk_model.coef_[0]) > 1e-6)} non-zero")
    
    # Check coefficient agreement
    max_diff = np.max(np.abs(our_model.coef_ - sk_model.coef_[0]))
    print(f"Max coefficient difference: {max_diff:.6f}")

Convergence Benefits

Bayesian Interpretation

4.1 The Gaussian-Laplace Mixture Prior

Elastic Net corresponds to MAP estimation with a prior that is a product of Laplace and Gaussian:

$$p(\beta_j) \propto \exp\left( -\lambda\alpha|\beta_j| - \frac{\lambda(1-\alpha)}{2}\beta_j^2 \right)$$

This can be viewed as a scale mixture: the L1 component (Laplace) induces sparsity while the L2 component (Gaussian) provides regularization.

Properties:

Peaked at zero (from Laplace component) → enables exact zeros
Smooth quadratic decay in tails (from Gaussian) → prevents extreme values
Interpolates between pure Laplace ($\alpha = 1$) and pure Gaussian ($\alpha = 0$)

4.2 Connection to Hierarchical Models

An alternative Bayesian view places a Gamma hyperprior on the Lasso penalty:

$$\beta_j | \tau_j \sim \mathcal{N}(0, \tau_j)$$ $$\tau_j \sim \text{Gamma}$$

Prior Comparison
Prior Type	Regularization	Properties
Gaussian (N)	L2 (Ridge)	Smooth, no sparsity, fast computation
Laplace (DE)	L1 (Lasso)	Peaked, sparse, arbitrary group selection
Gaussian + Laplace	Elastic Net	Peaked, sparse, grouped selection
Spike-and-Slab	Discrete selection	Exact sparsity, computationally intensive
Horseshoe	Continuous	Strong sparsity, preserves large signals

When to Use Elastic Net

5.1 Ideal Scenarios

Elastic Net excels in situations that challenge both pure Lasso and pure Ridge:

When to Choose Elastic Net

•Correlated feature groups: When features come in correlated clusters (e.g., genes in pathways, related financial indicators), Elastic Net selects groups together rather than arbitrarily picking one
•p >> n settings: When features vastly outnumber samples, Lasso can select at most n features; Elastic Net can include more through the L2 component
•Unstable Lasso selection: When Lasso selected features change dramatically across CV folds, Elastic Net provides stability
•Moderate sparsity desire: When you want some feature selection but not as aggressive as pure Lasso
•Interpretability + prediction: When you need both a sparse, interpretable model and good predictive accuracy

5.2 Comparison Summary

Choosing Between L1, L2, and Elastic Net
Scenario	Recommended	Rationale
True sparsity, uncorrelated features	Lasso	Maximally sparse, stable selection
No sparsity, correlated features	Ridge	Best prediction, handles multicollinearity
Sparsity + correlated features	Elastic Net	Groups correlated features, still sparse
p >> n	Elastic Net	Overcomes Lasso's n feature limit
Fast optimization needed	Ridge	Smooth problem, Newton methods work
Quick baseline model	Elastic Net (α=0.5)	Reasonable default for most problems

Default Strategy

Practical Implementation with scikit-learn

6.1 Basic Usage

scikit-learn provides Elastic Net logistic regression through the LogisticRegression class with penalty='elasticnet':

elastic_net_sklearn.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import numpy as np
 
def create_elastic_net_pipeline(C=1.0, l1_ratio=0.5):
    """
    Create a complete Elastic Net logistic regression pipeline.
    
    Parameters
    ----------
    C : float
        Inverse regularization strength (C = 1/lambda)
    l1_ratio : float
        L1 mixing ratio (alpha in our notation)
        l1_ratio = 1 -> Lasso
        l1_ratio = 0 -> Ridge  
        0 < l1_ratio < 1 -> Elastic Net
    """
    return Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(
            penalty='elasticnet',
            C=C,
            l1_ratio=l1_ratio,
            solver='saga',  # Required for elasticnet
            max_iter=5000,
            random_state=42,
            n_jobs=-1
        ))
    ])
 
 
def tune_elastic_net(X, y, cv=5):
    """
    Tune both C and l1_ratio using grid search cross-validation.
    
    This is the recommended approach for Elastic Net: tune BOTH hyperparameters.
    """
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', LogisticRegression(
            penalty='elasticnet',
            solver='saga',
            max_iter=5000,
            random_state=42
        ))
    ])
    
    # Parameter grid
    param_grid = {
        'classifier__C': [0.01, 0.1, 1.0, 10.0],
        'classifier__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
    }
    
    grid_search = GridSearchCV(
        pipeline,
        param_grid,
        cv=cv,
        scoring='roc_auc',
        n_jobs=-1,
        verbose=1
    )
    
    grid_search.fit(X, y)
    
    print(f"
Best parameters: {grid_search.best_params_}")
    print(f"Best CV AUC: {grid_search.best_score_:.4f}")
    
    return grid_search
 
 
def analyze_regularization_path(X, y):
    """
    Examine how coefficients change across regularization strengths
    at different l1_ratios.
    """
    from sklearn.preprocessing import StandardScaler
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    l1_ratios = [0.2, 0.5, 0.8, 1.0]
    C_values = np.logspace(-2, 1, 20)
    
    results = {l1_ratio: [] for l1_ratio in l1_ratios}
    
    for l1_ratio in l1_ratios:
        for C in C_values:
            if l1_ratio == 1.0:
                # Pure Lasso
                model = LogisticRegression(
                    penalty='l1', C=C, solver='saga', max_iter=5000
                )
            else:
                model = LogisticRegression(
                    penalty='elasticnet', C=C, l1_ratio=l1_ratio,
                    solver='saga', max_iter=5000
                )
            
            model.fit(X_scaled, y)
            
            results[l1_ratio].append({
                'C': C,
                'lambda': 1/C,
                'n_nonzero': np.sum(np.abs(model.coef_[0]) > 1e-6),
                'coef_norm': np.linalg.norm(model.coef_[0])
            })
    
    return results
 
 
# Full example
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    
    # Generate data with correlated features
    X, y = make_classification(
        n_samples=500,
        n_features=50,
        n_informative=10,
        n_redundant=20,  # Correlated with informative features
        n_clusters_per_class=2,
        random_state=42
    )
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Tune hyperparameters
    print("Tuning Elastic Net...")
    grid_search = tune_elastic_net(X_train, y_train)
    
    # Evaluate on test set
    best_model = grid_search.best_estimator_
    test_score = best_model.score(X_test, y_test)
    
    coef = best_model.named_steps['classifier'].coef_[0]
    n_selected = np.sum(np.abs(coef) > 1e-6)
    
    print(f"
Test accuracy: {test_score:.4f}")
    print(f"Features selected: {n_selected} of {X.shape[1]}")

Important: solver='saga' Required

Summary and Key Takeaways

Key Takeaways

•Elastic Net combines L1 and L2 penalties through a mixing parameter α, interpolating between Lasso and Ridge
•The constraint geometry has rounded corners—enabling sparsity while maintaining stability
•The grouping effect ensures correlated features receive similar coefficients, addressing Lasso's arbitrary selection problem
•Two hyperparameters (λ for overall strength, α for L1/L2 ratio) both require tuning via cross-validation
•Bayesian interpretation: corresponds to a prior mixing Laplace and Gaussian distributions
•Use Elastic Net when features are correlated and sparsity is desired, or when pure Lasso is unstable
•scikit-learn requires solver='saga' for the elasticnet penalty option

What's Next

3 / 5