Loading learning content...
In our journey through regularization techniques, we've encountered two powerful but distinct approaches: Ridge Regression (L2) with its smooth shrinkage properties, and Lasso Regression (L1) with its ability to produce sparse models through automatic feature selection. Each excels in different scenarios, but each also carries fundamental limitations.
What if we could harness the strengths of both while mitigating their individual weaknesses? This is precisely what Elastic Net accomplishes—a regularization technique that elegantly combines L1 and L2 penalties into a unified framework that often outperforms either approach alone.
Elastic Net was introduced by Hui Zou and Trevor Hastie in their seminal 2005 paper 'Regularization and Variable Selection via the Elastic Net', specifically designed to address scenarios where Lasso struggles: high-dimensional settings with correlated features, situations where we expect many small but non-zero effects, and cases where we want both shrinkage and selection.
By the end of this page, you will understand the complete mathematical formulation of Elastic Net, grasp how the combined penalty creates a 'stretchy' regularization that adapts to data characteristics, and develop intuition for why this synthesis overcomes the individual limitations of Ridge and Lasso regression.
Before introducing Elastic Net, we must understand precisely why combining L1 and L2 penalties is necessary. Both Ridge and Lasso have fundamental limitations that become critical in modern high-dimensional settings.
Ridge Regression's Limitation: No Feature Selection
Ridge regression applies uniform shrinkage to all coefficients, pushing them toward zero but never exactly to zero. This means:
Mathematically, Ridge solves:
$$\hat{\boldsymbol{\beta}}{\text{ridge}} = \arg\min{\boldsymbol{\beta}} \left{ |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_2^2 \right}$$
The squared L2 penalty is strictly convex and differentiable everywhere, meaning coefficients approach zero asymptotically but never reach it.
In genomics, finance, and text analysis, we often have thousands or millions of features. Ridge regression includes all of them in predictions, making the model computationally expensive to deploy and impossible to interpret. We need a method that can identify which features matter.
Lasso's Limitation: Instability with Correlated Features
Lasso performs automatic feature selection by driving coefficients exactly to zero. However, this sparsity-inducing property creates a critical problem:
$$\hat{\boldsymbol{\beta}}{\text{lasso}} = \arg\min{\boldsymbol{\beta}} \left{ |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda |\boldsymbol{\beta}|_1 \right}$$
When features are highly correlated, Lasso exhibits arbitrary selection behavior:
Theoretical Bound on Selected Features:
Zou and Hastie proved that for the standard Lasso, the number of selected features is bounded:
$$|{j : \hat{\beta}_j eq 0}| \leq \min(n, p)$$
where $n$ is the number of observations and $p$ is the number of features. In the $p \gg n$ regime (more features than samples), Lasso can select at most $n$ features—a severe limitation when many features might be relevant.
| Property | Ridge (L2) | Lasso (L1) | Desired Behavior |
|---|---|---|---|
| Feature Selection | Never (all coefficients non-zero) | Yes (sparse solutions) | Selective when appropriate |
| Correlated Features | Stable (similar coefficients) | Arbitrary (selects one) | Stable grouping |
| Max Selected Features | All p features | At most min(n, p) | Unlimited |
| Coefficient Shrinkage | Smooth, proportional | Discontinuous jumps | Smooth with selection |
| Uniqueness of Solution | Always unique | May be non-unique | Prefer uniqueness |
| Computational Complexity | Closed form O(p³) | Iterative algorithms | Efficient |
Imagine predicting house prices with features 'square_feet' and 'square_meters'—perfectly correlated. Ridge would give both similar weights (correct behavior). Lasso would arbitrarily pick one and set the other to zero (problematic). We need a method that recognizes they should be treated as a group.
The Elastic Net elegantly combines both penalty terms into a single objective function. The general formulation is:
$$\hat{\boldsymbol{\beta}}{\text{enet}} = \arg\min{\boldsymbol{\beta}} \left{ \frac{1}{2n}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda \left( \alpha |\boldsymbol{\beta}|_1 + \frac{1-\alpha}{2} |\boldsymbol{\beta}|_2^2 \right) \right}$$
Let's unpack this formulation carefully:
The Mixing Parameter α ∈ [0, 1]:
The Regularization Strength λ ≥ 0:
The Combined Penalty Term:
$$P_{\alpha}(\boldsymbol{\beta}) = \alpha |\boldsymbol{\beta}|1 + \frac{1-\alpha}{2} |\boldsymbol{\beta}|2^2 = \alpha \sum{j=1}^{p} |\beta_j| + \frac{1-\alpha}{2} \sum{j=1}^{p} \beta_j^2$$
This penalty is a convex combination of L1 and L2 norms, ensuring the overall optimization problem remains convex.
The factor of 1/2 in front of the L2 term and the 1/2n factor in the loss are common conventions that simplify derivatives. Different implementations may use slightly different scaling—always check the documentation. The relative behavior remains the same; only the scale of λ changes.
Alternative Parameterization (λ₁, λ₂):
Some formulations use separate regularization parameters for each penalty:
$$\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} \left{ |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda_1 |\boldsymbol{\beta}|_1 + \lambda_2 |\boldsymbol{\beta}|_2^2 \right}$$
The relationship between parameterizations:
The (λ, α) parameterization is preferred because:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
import numpy as np def elastic_net_objective(beta, X, y, lambda_val, alpha): """ Compute the Elastic Net objective function value. Parameters: ----------- beta : array of shape (p,) Coefficient vector X : array of shape (n, p) Feature matrix y : array of shape (n,) Target vector lambda_val : float Overall regularization strength alpha : float in [0, 1] Mixing parameter (1 = Lasso, 0 = Ridge) Returns: -------- float : Objective function value """ n = len(y) # Residual sum of squares (RSS) residuals = y - X @ beta rss = (1 / (2 * n)) * np.sum(residuals ** 2) # L1 penalty (Lasso component) l1_penalty = alpha * np.sum(np.abs(beta)) # L2 penalty (Ridge component) l2_penalty = ((1 - alpha) / 2) * np.sum(beta ** 2) # Combined objective objective = rss + lambda_val * (l1_penalty + l2_penalty) return objective def compute_penalty_contributions(beta, lambda_val, alpha): """ Decompose the penalty into L1 and L2 contributions. Useful for understanding regularization behavior. """ l1_contribution = lambda_val * alpha * np.sum(np.abs(beta)) l2_contribution = lambda_val * (1 - alpha) / 2 * np.sum(beta ** 2) return { 'l1_penalty': l1_contribution, 'l2_penalty': l2_contribution, 'total_penalty': l1_contribution + l2_contribution, 'l1_fraction': l1_contribution / (l1_contribution + l2_contribution + 1e-10) } # Example: Comparing objectives for different alpha valuesnp.random.seed(42)n, p = 100, 10X = np.random.randn(n, p)y = np.random.randn(n)beta = np.random.randn(p)lambda_val = 0.1 print("Objective values for different mixing parameters:")print("-" * 50)for alpha in [0.0, 0.25, 0.5, 0.75, 1.0]: obj = elastic_net_objective(beta, X, y, lambda_val, alpha) penalties = compute_penalty_contributions(beta, lambda_val, alpha) print(f"α = {alpha:.2f}: Objective = {obj:.4f}") print(f" L1 penalty = {penalties['l1_penalty']:.4f}") print(f" L2 penalty = {penalties['l2_penalty']:.4f}") print()Understanding regularization geometrically provides profound intuition about how Elastic Net combines L1 and L2 properties. Let's analyze the constraint regions defined by each penalty.
The Constrained Optimization View:
The Elastic Net objective with penalty $\lambda P_\alpha(\boldsymbol{\beta})$ is equivalent to solving:
$$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 \quad \text{subject to} \quad \alpha |\boldsymbol{\beta}|_1 + \frac{1-\alpha}{2} |\boldsymbol{\beta}|_2^2 \leq t$$
for some constraint bound $t$ that depends on $\lambda$. The shape of this constraint region determines the regularization behavior.
Constraint Region Shapes in 2D:
Elastic Net (0 < α < 1): Rounded Diamond
The Elastic Net constraint region is a hybrid shape—a diamond with rounded corners:
$$\alpha (|\beta_1| + |\beta_2|) + \frac{1-\alpha}{2}(\beta_1^2 + \beta_2^2) \leq t$$
This shape has remarkable properties:
Retains Corners (from L1): The corners on the coordinate axes remain, enabling sparse solutions when the loss contours touch these corners.
Rounded Edges (from L2): The edges between corners are curved (strictly convex), not flat. This adds the strictly convex property that Lasso lacks.
The "Stretchy" Effect: The L2 component allows the constraint region to 'stretch' to accommodate correlated features, distributing weight among them rather than selecting just one.
Why Strict Convexity Matters:
The L2 component ensures that the Elastic Net objective is strictly convex, guaranteeing:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npimport matplotlib.pyplot as plt def elastic_net_constraint(beta1, beta2, alpha, t=1.0): """ Compute the Elastic Net constraint value. Returns True if (beta1, beta2) is inside the constraint region. """ l1_part = alpha * (np.abs(beta1) + np.abs(beta2)) l2_part = (1 - alpha) / 2 * (beta1**2 + beta2**2) return l1_part + l2_part def plot_constraint_regions(): """ Visualize constraint regions for different alpha values. """ fig, axes = plt.subplots(1, 5, figsize=(20, 4)) alphas = [0.0, 0.25, 0.5, 0.75, 1.0] # Create grid beta_range = np.linspace(-1.5, 1.5, 500) B1, B2 = np.meshgrid(beta_range, beta_range) for ax, alpha in zip(axes, alphas): # Compute constraint values Z = elastic_net_constraint(B1, B2, alpha, t=1.0) # Plot constraint region (where Z <= 1) ax.contourf(B1, B2, Z, levels=[0, 1], colors=['lightblue'], alpha=0.7) ax.contour(B1, B2, Z, levels=[1], colors=['blue'], linewidths=2) # Add axes ax.axhline(y=0, color='gray', linestyle='--', linewidth=0.5) ax.axvline(x=0, color='gray', linestyle='--', linewidth=0.5) # Labels ax.set_xlabel(r'$\beta_1$', fontsize=12) ax.set_ylabel(r'$\beta_2$', fontsize=12) if alpha == 0: title = f'Ridge (α={alpha})' elif alpha == 1: title = f'Lasso (α={alpha})' else: title = f'Elastic Net (α={alpha})' ax.set_title(title, fontsize=14) ax.set_aspect('equal') ax.set_xlim(-1.5, 1.5) ax.set_ylim(-1.5, 1.5) plt.tight_layout() plt.savefig('elastic_net_constraint_regions.png', dpi=150) plt.show() # Generate the visualizationplot_constraint_regions() # Demonstrate the corner propertyprint("Corner Analysis:")print("-" * 50)for alpha in [0.0, 0.5, 1.0]: # Check constraint value at corner (1, 0) normalized corner_val = elastic_net_constraint(1, 0, alpha) edge_val = elastic_net_constraint(0.707, 0.707, alpha) # 45-degree point print(f"α = {alpha:.1f}: Corner (1,0) = {corner_val:.3f}, " f"45° point (0.707, 0.707) = {edge_val:.3f}")When visualizing regularization, imagine the RSS loss function as elliptical contours centered at the OLS solution. The regularized solution is where these contours first touch the constraint region. Corners (from L1) enable sparse solutions; rounded edges (from L2) ensure stability and uniqueness.
The Elastic Net enjoys several important mathematical properties that make it particularly useful in practice. Understanding these properties helps you predict when Elastic Net will outperform alternatives.
Property 1: Strict Convexity (for α < 1)
The Elastic Net objective function is:
Strict convexity means: $$f(t\boldsymbol{\beta}_1 + (1-t)\boldsymbol{\beta}_2) < tf(\boldsymbol{\beta}_1) + (1-t)f(\boldsymbol{\beta}_2)$$
for distinct $\boldsymbol{\beta}_1, \boldsymbol{\beta}_2$ and $t \in (0,1)$.
Implication: The global minimum is unique. Unlike Lasso, there's exactly one solution.
Pure Lasso (α = 1) can have infinitely many optimal solutions when features are perfectly correlated. Consider X₁ = X₂ perfectly correlated; any combination β₁ + β₂ = c achieves the same L1 penalty. Elastic Net's L2 term breaks this degeneracy, preferring the solution with β₁ ≈ β₂.
Property 2: Sparsity Preservation
Despite adding the L2 term, Elastic Net retains the ability to set coefficients exactly to zero. The sparsity pattern is determined by the subgradient conditions:
For feature $j$, the coefficient $\hat{\beta}_j = 0$ if and only if:
$$\left| \frac{1}{n} \mathbf{x}j^T (\mathbf{y} - \mathbf{X}{-j}\hat{\boldsymbol{\beta}}_{-j}) \right| \leq \lambda \alpha$$
The L1 component provides the threshold mechanism for zeroing coefficients, while the L2 component provides stability among selected features.
Property 3: The Naive Elastic Net Problem
Direct optimization of the Elastic Net penalty reveals an interesting structure. Define augmented data:
$$\mathbf{X}^* = \frac{1}{\sqrt{1 + \lambda_2}} \begin{pmatrix} \mathbf{X} \ \sqrt{\lambda_2} \mathbf{I}_p \end{pmatrix}, \quad \mathbf{y}^* = \begin{pmatrix} \mathbf{y} \ \mathbf{0}_p \end{pmatrix}$$
Then the Elastic Net is equivalent to Lasso on the augmented problem:
$$\hat{\boldsymbol{\beta}}^* = \arg\min_{\boldsymbol{\beta}^} \left{ |\mathbf{y}^ - \mathbf{X}^* \boldsymbol{\beta}^|_2^2 + \frac{\lambda_1}{\sqrt{1+\lambda_2}} |\boldsymbol{\beta}^|_1 \right}$$
with $\hat{\boldsymbol{\beta}}_{\text{enet}} = (1 + \lambda_2) \hat{\boldsymbol{\beta}}^*$
This augmentation trick enables using fast Lasso solvers for Elastic Net!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npfrom sklearn.linear_model import Lasso, ElasticNet def elastic_net_via_augmentation(X, y, lambda1, lambda2): """ Solve Elastic Net by converting to a Lasso problem on augmented data. This demonstrates the mathematical equivalence between Elastic Net and Lasso on an augmented dataset. Parameters: ----------- X : array of shape (n, p) y : array of shape (n,) lambda1 : float, L1 penalty coefficient lambda2 : float, L2 penalty coefficient Returns: -------- beta_enet : array of shape (p,), Elastic Net coefficients """ n, p = X.shape # Create augmented data scale_factor = 1.0 / np.sqrt(1 + lambda2) X_augmented = np.vstack([ scale_factor * X, np.sqrt(lambda2) * np.eye(p) ]) y_augmented = np.concatenate([y, np.zeros(p)]) # Solve Lasso on augmented data # Note: sklearn's alpha parameter is lambda / (2 * n_augmented) n_aug = len(y_augmented) lasso_alpha = lambda1 * scale_factor / (2 * n_aug) lasso = Lasso(alpha=lasso_alpha, fit_intercept=False, max_iter=10000) lasso.fit(X_augmented, y_augmented) # Rescale to get Elastic Net solution beta_enet = (1 + lambda2) * lasso.coef_ return beta_enet # Verify equivalencenp.random.seed(42)n, p = 100, 20X = np.random.randn(n, p)beta_true = np.array([3, -2, 1.5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])y = X @ beta_true + 0.5 * np.random.randn(n) # Elastic Net parametersalpha = 0.5 # Mixing parameterlambda_total = 0.1lambda1 = alpha * lambda_totallambda2 = (1 - alpha) * lambda_total # Method 1: Direct Elastic Netenet = ElasticNet( alpha=lambda_total, l1_ratio=alpha, fit_intercept=False, max_iter=10000)enet.fit(X, y)beta_direct = enet.coef_ # Method 2: Via augmentation (for demonstration)# Note: Exact equivalence requires careful parameter matchingprint("Elastic Net Coefficients Comparison")print("-" * 50)print(f"{'Feature':<10} {'True':>10} {'Elastic Net':>12}")print("-" * 50)for j in range(min(10, p)): print(f"β_{j:<7} {beta_true[j]:>10.3f} {beta_direct[j]:>12.3f}") # Count non-zero coefficientsn_nonzero = np.sum(np.abs(beta_direct) > 1e-6)print(f"Non-zero coefficients: {n_nonzero} / {p}")Property 4: Double Shrinkage and Rescaling
A subtlety of the naive Elastic Net solution (from direct optimization) is that it suffers from double shrinkage:
The cumulative effect can over-shrink coefficients, leading to excessive bias. Zou and Hastie addressed this by proposing rescaling:
$$\hat{\boldsymbol{\beta}}{\text{enet}} = (1 + \lambda_2) \hat{\boldsymbol{\beta}}{\text{naive}}$$
This rescaling partially counteracts the L2 shrinkage, yielding coefficients with better predictive performance. Most software implementations (including scikit-learn) apply this rescaling automatically.
Different software packages implement Elastic Net with varying conventions for scaling and parameterization. Always consult documentation to understand exactly what objective is being minimized. The coefficient magnitude can differ by factors of (1 + λ₂) between implementations.
Understanding how Elastic Net is optimized reveals insight into its behavior. The key tool is the soft-thresholding operator, which appears naturally when solving the Elastic Net subproblem.
The Soft-Thresholding Operator:
$$S(z, \gamma) = \text{sign}(z) \cdot \max(|z| - \gamma, 0) = \begin{cases} z - \gamma & \text{if } z > \gamma \ 0 & \text{if } |z| \leq \gamma \ z + \gamma & \text{if } z < -\gamma \end{cases}$$
This operator:
Coordinate Descent for Elastic Net:
The most efficient algorithm for Elastic Net is coordinate descent, which updates one coefficient at a time while holding others fixed.
For the Elastic Net with standardized features ($\mathbf{x}_j^T \mathbf{x}_j = n$), the update for coordinate $j$ is:
$$\hat{\beta}j \leftarrow \frac{S\left(\frac{1}{n}\mathbf{x}j^T(\mathbf{y} - \mathbf{X}{-j}\hat{\boldsymbol{\beta}}{-j}), \lambda\alpha\right)}{1 + \lambda(1-\alpha)}$$
where $\mathbf{X}{-j}$ denotes all columns except $j$, and $\hat{\boldsymbol{\beta}}{-j}$ denotes all coefficients except $\hat{\beta}_j$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import numpy as np def soft_threshold(z, gamma): """ Soft-thresholding operator S(z, gamma). This is the proximal operator of the L1 norm. """ return np.sign(z) * np.maximum(np.abs(z) - gamma, 0) def coordinate_descent_elastic_net(X, y, lambda_val, alpha, max_iter=1000, tol=1e-6): """ Solve Elastic Net using coordinate descent. Parameters: ----------- X : array of shape (n, p), assumed standardized (columns mean 0, norm n) y : array of shape (n,), centered lambda_val : float, overall regularization strength alpha : float in [0, 1], mixing parameter (1 = Lasso, 0 = Ridge) max_iter : int, maximum iterations tol : float, convergence tolerance Returns: -------- beta : array of shape (p,), coefficient estimates history : list of objective values """ n, p = X.shape beta = np.zeros(p) residual = y.copy() # Current residual: y - X @ beta history = [] # Precompute column norms (assumed = n for standardized data) col_norms_sq = np.sum(X ** 2, axis=0) for iteration in range(max_iter): beta_old = beta.copy() for j in range(p): # Temporarily add back contribution of beta_j residual += X[:, j] * beta[j] # Compute simple least squares update (gradient step) rho_j = X[:, j] @ residual / n # Apply soft-thresholding with L1 threshold l1_threshold = lambda_val * alpha # Closed-form update for coordinate j numerator = soft_threshold(rho_j, l1_threshold) denominator = 1 + lambda_val * (1 - alpha) beta[j] = numerator / denominator # Update residual with new beta_j residual -= X[:, j] * beta[j] # Compute objective for monitoring rss = np.sum(residual ** 2) / (2 * n) l1_term = lambda_val * alpha * np.sum(np.abs(beta)) l2_term = lambda_val * (1 - alpha) / 2 * np.sum(beta ** 2) objective = rss + l1_term + l2_term history.append(objective) # Check convergence if np.max(np.abs(beta - beta_old)) < tol: print(f"Converged at iteration {iteration + 1}") break return beta, history # Demonstrationnp.random.seed(42)n, p = 200, 50 # Create standardized featuresX = np.random.randn(n, p)X = X - X.mean(axis=0) # Center columnsX = X / np.sqrt(np.sum(X**2, axis=0) / n) # Scale to variance 1 / sqrt(n) # True sparse coefficientsbeta_true = np.zeros(p)beta_true[:5] = [3, -2.5, 2, -1.5, 1] # Generate responsey = X @ beta_true + 0.5 * np.random.randn(n)y = y - y.mean() # Center response # Solve with coordinate descentlambda_val = 0.1alpha = 0.5 beta_hat, history = coordinate_descent_elastic_net( X, y, lambda_val, alpha) print("Coefficient Recovery:")print(f"{'Feature':<10} {'True':>10} {'Estimated':>12} {'Error':>10}")print("-" * 45)for j in range(10): error = abs(beta_true[j] - beta_hat[j]) print(f"β_{j:<7} {beta_true[j]:>10.3f} {beta_hat[j]:>12.3f} {error:>10.3f}") # Statisticsprint(f"Non-zero coefficients: {np.sum(np.abs(beta_hat) > 1e-6)}")print(f"True non-zero: {np.sum(np.abs(beta_true) > 1e-6)}")The numerator S(ρⱼ, λα) applies sparse selection via soft-thresholding. The denominator (1 + λ(1-α)) applies additional L2 shrinkage. This two-stage process—threshold then shrink—is how Elastic Net achieves both selection and stability.
Computational Complexity:
Coordinate descent for Elastic Net has complexity:
For sparse solutions (many zero coefficients), active set strategies can reduce this to $O(ns \cdot k)$ where $s$ is the number of non-zero coefficients.
Warm Starting:
When computing solutions across a regularization path (sequence of λ values), using the solution at $\lambda_{i}$ as the starting point for $\lambda_{i+1}$ dramatically accelerates convergence. This is standard in implementations like glmnet and sklearn.linear_model.ElasticNet.
The Elastic Net provides a continuous interpolation between Ridge and Lasso, enabling us to select the regularization type that best fits our data. Let's consolidate our understanding with a unified comparison.
| Property | Ridge (α=0) | Elastic Net (0<α<1) | Lasso (α=1) |
|---|---|---|---|
| Penalty Term | $\lambda |\boldsymbol{\beta}|_2^2 / 2$ | $\lambda (\alpha |\boldsymbol{\beta}|_1 + \frac{1-\alpha}{2}|\boldsymbol{\beta}|_2^2)$ | $\lambda |\boldsymbol{\beta}|_1$ |
| Sparsity | No (all β ≠ 0) | Yes (some β = 0) | Yes (sparse solutions) |
| Solution Uniqueness | Always unique | Always unique | May have multiple solutions |
| Correlated Features | Similar coefficients | Grouped selection | Arbitrary selection of one |
| Max Features Selected | All p | Unlimited | At most min(n, p) |
| Closed-Form Solution | Yes | No | No |
| Constraint Shape | Sphere | Rounded polytope | Cross-polytope |
| Bayesian Prior | Gaussian | Mixed Gaussian-Laplace | Laplace (double exponential) |
| Best When | Many small effects, no sparsity expected | Correlated features, moderate sparsity | True sparsity, independent features |
The Regularization Path Perspective:
A powerful way to understand these methods is through the regularization path—how coefficient estimates change as λ varies from ∞ to 0:
Decision Framework:
Choosing between methods often depends on your beliefs about the true data-generating process:
In practice, Elastic Net with α ∈ [0.1, 0.9] often outperforms pure Ridge or Lasso. A common starting point is α = 0.5 (equal weighting), then tuning via cross-validation. The extra hyperparameter (α) is worth it for the robustness gained.
We've established the complete mathematical foundation of Elastic Net regularization. Let's consolidate the key insights:
What's Next:
Now that we understand the Elastic Net formulation, the next page explores one of its most remarkable properties: the grouping effect. We'll see how Elastic Net handles correlated features in a principled way, automatically assigning similar coefficients to related variables—behavior that emerges naturally from the combined penalty structure.
You now understand the mathematical formulation of Elastic Net, its geometric interpretation, key theoretical properties, and how coordinate descent optimizes the objective. Next, we'll examine the grouping effect that makes Elastic Net particularly powerful for correlated features.