Loading learning content...
Every machine learning practitioner faces a fundamental tension: models must be complex enough to capture patterns, but not so complex that they memorize noise. This is the essence of the bias-variance tradeoff, and regularization is our primary tool for navigating it.
Regularization works by adding a penalty term to the loss function that discourages overly complex models. The most common penalties are based on norms of the model parameters—the same norms we've studied throughout this module.
By penalizing the L1 or L2 norm of weights, we constrain the model's capacity, reduce overfitting, and often improve generalization. But the choice of norm has profound implications: L1 promotes sparsity (feature selection), L2 promotes small but dense weights, and combinations offer different tradeoffs.
By the end of this page, you will understand the mathematical foundations of norm-based regularization, the geometric intuition for why L1 induces sparsity, the tradeoffs between L1 (Lasso), L2 (Ridge), and Elastic Net, practical guidelines for choosing regularization strength, and how these concepts extend to deep learning.
In supervised learning, we minimize a loss function $L(\mathbf{w}; \mathcal{D})$ that measures how well our model with parameters $\mathbf{w}$ fits the training data $\mathcal{D}$. Regularization modifies this objective by adding a penalty on the parameters:
General Form:
$$\min_{\mathbf{w}} \underbrace{L(\mathbf{w}; \mathcal{D})}{\text{Data Fidelity}} + \underbrace{\lambda R(\mathbf{w})}{\text{Regularization Penalty}}$$
where:
The Tradeoff:
Constraint Formulation (Equivalent View):
The penalized objective is mathematically equivalent to a constrained optimization:
$$\min_{\mathbf{w}} L(\mathbf{w}; \mathcal{D}) \quad \text{subject to} \quad R(\mathbf{w}) \leq t$$
For every $\lambda$, there exists a corresponding $t$ (via Lagrangian duality) where the solutions are identical. This constraint view provides geometric intuition: we're minimizing loss while staying within a 'budget' on model complexity.
Norms are natural complexity measures: larger ||w|| means weights are 'bigger' and the model is potentially more complex. L1 and L2 norms are convex, preserving the convexity of many ML problems. They have well-understood optimization properties and interpretable effects on solutions.
| Name | Penalty $R(\mathbf{w})$ | Effect |
|---|---|---|
| L2 / Ridge / Tikhonov | $|\mathbf{w}|_2^2 = \sum_i w_i^2$ | Shrinks all weights; stable; closed-form |
| L1 / Lasso | $|\mathbf{w}|_1 = \sum_i |w_i|$ | Drives some weights to exactly zero (sparsity) |
| Elastic Net | $\alpha|\mathbf{w}|_1 + (1-\alpha)|\mathbf{w}|_2^2$ | Sparsity + grouping of correlated features |
| Weight Decay (DL) | $\frac{1}{2}|\mathbf{W}|_F^2$ | Frobenius norm on weight matrices |
L2 regularization (also called Ridge regression or Tikhonov regularization) adds the squared L2 norm of weights to the loss:
$$\min_{\mathbf{w}} \frac{1}{n}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda|\mathbf{w}|_2^2$$
Closed-Form Solution:
For linear regression, the regularized normal equations have a beautiful closed form:
$$\hat{\mathbf{w}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$
The addition of $\lambda\mathbf{I}$ to the Gram matrix has profound effects:
Invertibility: Even if $\mathbf{X}^T\mathbf{X}$ is singular (more features than samples, or multicollinearity), $\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}$ is always invertible for $\lambda > 0$.
Condition Number: The condition number drops from $\sigma_{\max}^2/\sigma_{\min}^2$ to $(\sigma_{\max}^2 + \lambda)/(\sigma_{\min}^2 + \lambda)$, improving numerical stability.
No Sparsity: The solution typically has all weights non-zero; they're shrunk but not eliminated.
Geometric Interpretation:
The unit ball of the L2 norm is a sphere. When the level curves of the loss function (elliptical for least squares) meet this spherical constraint, the contact point is typically in the interior of each coordinate face—not at corners.
Result: Weights are shrunk proportionally toward zero, but rarely become exactly zero.
SVD Perspective:
If $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$ (SVD), then:
$$\hat{\mathbf{w}}_{\text{ridge}} = \mathbf{V}\text{diag}\left(\frac{\sigma_i}{\sigma_i^2 + \lambda}\right)\mathbf{U}^T\mathbf{y}$$
Each singular direction $\mathbf{v}_i$ gets shrunk by factor $\sigma_i/(\sigma_i^2 + \lambda)$. Small singular values (noisy directions) are shrunk more aggressively.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as npfrom sklearn.linear_model import Ridgeimport matplotlib.pyplot as plt # Generate data with collinearitynp.random.seed(42)n, p = 100, 5X = np.random.randn(n, p)X[:, 1] = X[:, 0] + 0.01 * np.random.randn(n) # x1 ≈ x0 (collinear)true_w = np.array([1, 2, 0.5, -1, 0.3])y = X @ true_w + 0.5 * np.random.randn(n) print("Ridge Regression Analysis")print("=" * 50) # Solve with different lambda valueslambdas = [0, 0.01, 0.1, 1, 10, 100]print(f"{'Lambda':<10} {'||w||_2':<10} {'Weights...'}")print("-" * 60) for lam in lambdas: if lam == 0: # OLS solution w = np.linalg.lstsq(X, y, rcond=None)[0] else: # Ridge solution: (X'X + λI)^(-1) X'y XtX = X.T @ X I = np.eye(p) w = np.linalg.solve(XtX + lam * I, X.T @ y) norm_w = np.linalg.norm(w) print(f"{lam:<10} {norm_w:<10.4f} {np.round(w, 3)}") # Demonstrate improved conditioningprint("--- Effect on Condition Number ---")XtX = X.T @ Xprint(f"κ(X'X) = {np.linalg.cond(XtX):.2f} (ill-conditioned due to collinearity)") for lam in [0.01, 0.1, 1]: cond = np.linalg.cond(XtX + lam * np.eye(p)) print(f"κ(X'X + {lam}I) = {cond:.2f}") # Using sklearnmodel = Ridge(alpha=1.0)model.fit(X, y)print(f"sklearn Ridge (α=1): {np.round(model.coef_, 3)}")print(f"True weights: {true_w}")L2 regularization corresponds to a Gaussian prior on weights: $p(\mathbf{w}) \propto \exp(-\lambda|\mathbf{w}|_2^2)$. Ridge regression is equivalent to MAP estimation with this prior. The regularization strength λ is inversely proportional to the prior variance.
L1 regularization (also called Lasso - Least Absolute Shrinkage and Selection Operator) adds the L1 norm of weights to the loss:
$$\min_{\mathbf{w}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda|\mathbf{w}|_1$$
The defining property of Lasso is sparsity: it tends to set some coefficients to exactly zero, effectively performing feature selection.
Why L1 Induces Sparsity (Geometric Intuition):
This is one of the most important insights in regularization theory:
Contrast with L2:
Soft Thresholding:
For simple settings (orthonormal design), Lasso has a closed-form solution involving the soft thresholding operator:
$$\hat{w}j = S\lambda(\hat{w}_j^{\text{OLS}}) = \text{sign}(\hat{w}_j^{\text{OLS}}) \cdot \max(|\hat{w}_j^{\text{OLS}}| - \lambda, 0)$$
This shrinks coefficients toward zero and sets them exactly to zero if they're smaller than $\lambda$ in absolute value.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npfrom sklearn.linear_model import Lasso, Ridge # Generate sparse ground truthnp.random.seed(42)n, p = 100, 20X = np.random.randn(n, p) # Only 5 features are truly relevanttrue_w = np.zeros(p)true_w[:5] = [3, -2, 1.5, -1, 0.5] # First 5 features matter y = X @ true_w + 0.5 * np.random.randn(n) print("Lasso vs Ridge: Sparsity Comparison")print("=" * 60)print(f"True weights (first 10): {true_w[:10]}")print(f"Number of true non-zero weights: {np.sum(true_w != 0)}") # Lasso (L1)lasso = Lasso(alpha=0.1)lasso.fit(X, y) # Ridge (L2)ridge = Ridge(alpha=0.1)ridge.fit(X, y) print(f"{'Method':<10} {'Non-zeros':<12} {'||w||_1':<10} {'||w||_2':<10}")print("-" * 55)print(f"{'Lasso':<10} {np.sum(np.abs(lasso.coef_) > 1e-6):<12} " f"{np.linalg.norm(lasso.coef_, 1):<10.4f} " f"{np.linalg.norm(lasso.coef_, 2):<10.4f}")print(f"{'Ridge':<10} {np.sum(np.abs(ridge.coef_) > 1e-6):<12} " f"{np.linalg.norm(ridge.coef_, 1):<10.4f} " f"{np.linalg.norm(ridge.coef_, 2):<10.4f}") print(f"Lasso coefficients (first 10): {np.round(lasso.coef_[:10], 3)}")print(f"Ridge coefficients (first 10): {np.round(ridge.coef_[:10], 3)}") # Regularization pathprint("--- Lasso Regularization Path ---")alphas = [0.01, 0.05, 0.1, 0.5, 1.0]print(f"{'Alpha':<10} {'Non-zeros':<12} {'Selected features'}")for alpha in alphas: lasso = Lasso(alpha=alpha) lasso.fit(X, y) nonzeros = np.where(np.abs(lasso.coef_) > 1e-6)[0] print(f"{alpha:<10} {len(nonzeros):<12} {list(nonzeros)}")L1 regularization corresponds to a Laplace (double exponential) prior on weights: $p(w) \propto \exp(-\lambda|w|)$. The Laplace distribution has heavier tails than Gaussian, allowing large weights, but a sharp peak at zero, encouraging sparsity. Lasso is MAP estimation with this prior.
Elastic Net combines L1 and L2 regularization, addressing weaknesses of both:
$$\min_{\mathbf{w}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda\left(\alpha|\mathbf{w}|_1 + (1-\alpha)|\mathbf{w}|_2^2\right)$$
Alternatively written as:
$$\min_{\mathbf{w}} \frac{1}{2n}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda_1|\mathbf{w}|_1 + \lambda_2|\mathbf{w}|_2^2$$
where $\alpha \in [0, 1]$ controls the mix:
Geometric Intuition:
The Elastic Net constraint region is between the L1 diamond and L2 sphere—a 'rounded diamond.' It has the corners of L1 (enabling sparsity) but with smoothed edges from L2 (providing stability).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom sklearn.linear_model import ElasticNet, Lasso, Ridge # Generate data with correlated featuresnp.random.seed(42)n, p = 100, 10base = np.random.randn(n)X = np.zeros((n, p)) # Create groups of correlated featuresX[:, 0] = base + 0.1 * np.random.randn(n)X[:, 1] = base + 0.1 * np.random.randn(n) # Correlated with X[:, 0]X[:, 2] = base + 0.1 * np.random.randn(n) # Correlated with X[:, 0], X[:, 1]X[:, 3:] = np.random.randn(n, p - 3) # Independent features # True weights: all correlated features matter equallytrue_w = np.array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0])y = X @ true_w + 0.5 * np.random.randn(n) print("Elastic Net: Handling Correlated Features")print("=" * 60)print(f"Features 0, 1, 2 are correlated (all have true weight 1)")print(f"True weights: {true_w}") # Compare methodsalpha = 0.1 lasso = Lasso(alpha=alpha)ridge = Ridge(alpha=alpha)elastic = ElasticNet(alpha=alpha, l1_ratio=0.5) lasso.fit(X, y)ridge.fit(X, y)elastic.fit(X, y) print(f"{'Method':<15} {'w[0]':<8} {'w[1]':<8} {'w[2]':<8} {'Non-zeros':<10}")print("-" * 60)print(f"{'Lasso':<15} {lasso.coef_[0]:<8.3f} {lasso.coef_[1]:<8.3f} " f"{lasso.coef_[2]:<8.3f} {np.sum(np.abs(lasso.coef_) > 0.01)}")print(f"{'Ridge':<15} {ridge.coef_[0]:<8.3f} {ridge.coef_[1]:<8.3f} " f"{ridge.coef_[2]:<8.3f} {np.sum(np.abs(ridge.coef_) > 0.01)}")print(f"{'Elastic Net':<15} {elastic.coef_[0]:<8.3f} {elastic.coef_[1]:<8.3f} " f"{elastic.coef_[2]:<8.3f} {np.sum(np.abs(elastic.coef_) > 0.01)}") print("Notice: Lasso may arbitrarily pick one of the correlated features,")print("while Ridge and Elastic Net spread weight across all three.") # Vary l1_ratioprint("--- Effect of l1_ratio (α) ---")for l1_ratio in [0.1, 0.3, 0.5, 0.7, 0.9]: en = ElasticNet(alpha=0.1, l1_ratio=l1_ratio) en.fit(X, y) nonzeros = np.sum(np.abs(en.coef_) > 0.01) print(f"l1_ratio={l1_ratio}: {nonzeros} non-zeros, coef[:3] = {np.round(en.coef_[:3], 3)}")The regularization strength $\lambda$ (or $\alpha$ in scikit-learn) is a critical hyperparameter. Too small, and regularization has no effect; too large, and the model underfits.
Cross-Validation:
The standard approach is to select $\lambda$ via cross-validation:
Common Strategies:
| Strategy | Description | When to Use |
|---|---|---|
| $\lambda_{\min}$ | λ with lowest CV error | Maximum predictive accuracy |
| $\lambda_{1se}$ | Largest λ within 1 SE of minimum | Simpler model, similar performance; prefer for interpretability |
| AIC/BIC | Information criteria (penalize complexity) | Model selection theory; Bayesian justification |
| Stability selection | Features selected across multiple subsamples | Robust feature selection |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npfrom sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCVfrom sklearn.model_selection import cross_val_score # Generate datanp.random.seed(42)n, p = 200, 50X = np.random.randn(n, p)true_w = np.zeros(p)true_w[:10] = np.random.randn(10)y = X @ true_w + 0.5 * np.random.randn(n) print("Automatic Lambda Selection with Cross-Validation")print("=" * 60) # LassoCV automatically finds optimal lambdalasso_cv = LassoCV(cv=5, alphas=np.logspace(-4, 1, 50), random_state=42)lasso_cv.fit(X, y) print(f"Lasso with CV:")print(f" Best α: {lasso_cv.alpha_:.6f}")print(f" Non-zero coefficients: {np.sum(np.abs(lasso_cv.coef_) > 1e-6)}")print(f" True non-zero: {np.sum(true_w != 0)}") # Ridge CVridge_cv = RidgeCV(cv=5, alphas=np.logspace(-4, 4, 50))ridge_cv.fit(X, y)print(f"Ridge with CV:")print(f" Best α: {ridge_cv.alpha_:.6f}") # Elastic Net CVenet_cv = ElasticNetCV(cv=5, l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95], alphas=np.logspace(-4, 1, 50), random_state=42)enet_cv.fit(X, y)print(f"Elastic Net with CV:")print(f" Best α: {enet_cv.alpha_:.6f}")print(f" Best l1_ratio: {enet_cv.l1_ratio_:.2f}")print(f" Non-zero coefficients: {np.sum(np.abs(enet_cv.coef_) > 1e-6)}") # Regularization pathprint("--- Regularization Path (Lasso) ---")print(f"{'Alpha':<12} {'Non-zeros':<12} {'CV MSE (mean)':<15}")print("-" * 45)for alpha, mse_path in zip(lasso_cv.alphas_[::10], lasso_cv.mse_path_[::10]): mean_mse = np.mean(mse_path) n_nonzero = np.sum(np.abs(lasso_cv.coef_) > 1e-6) if alpha == lasso_cv.alpha_ else "—" print(f"{alpha:<12.6f} {str(n_nonzero):<12} {mean_mse:<15.4f}")Start with a wide logarithmic range for λ (e.g., 10⁻⁴ to 10²). For Elastic Net, also search over l1_ratio (e.g., 0.1, 0.5, 0.7, 0.9, 0.99). Use at least 5-fold CV. Prefer λ_1se over λ_min if interpretability/simplicity matters. Always standardize features before regularization to make λ comparable across features.
Norm-based regularization extends to deep learning, though the landscape is more complex due to the non-convex loss surfaces and the unique properties of neural networks.
Weight Decay (L2):
In deep learning, L2 regularization is typically implemented as weight decay—adding a fraction of the weights to the gradient update:
$$\mathbf{w}_{t+1} = \mathbf{w}_t - \eta( abla L + \lambda \mathbf{w}_t) = (1 - \eta\lambda)\mathbf{w}_t - \eta abla L$$
The $(1 - \eta\lambda)$ factor 'decays' weights toward zero each step.
Note: With optimizers like Adam, weight decay and L2 regularization are NOT equivalent! Use separate weight decay (AdamW) for proper behavior.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import torchimport torch.nn as nn # Example: Adding weight decay in PyTorch # Method 1: In optimizer (applies to all parameters)model = nn.Sequential( nn.Linear(100, 50), nn.ReLU(), nn.Linear(50, 10)) optimizer_with_wd = torch.optim.SGD( model.parameters(), lr=0.01, weight_decay=0.01 # L2 regularization strength) # Method 2: AdamW for proper weight decay (decoupled from adaptive learning rate)optimizer_adamw = torch.optim.AdamW( model.parameters(), lr=0.001, weight_decay=0.01 # Proper decoupled weight decay) # Method 3: Different decay for different parameter groupsoptimizer_custom = torch.optim.AdamW([ {'params': model[0].parameters(), 'weight_decay': 0.01}, # First layer {'params': model[2].parameters(), 'weight_decay': 0.001} # Second layer (less decay)], lr=0.001) # Manual L2 regularization (equivalent to weight_decay in SGD)def train_step_manual_l2(model, optimizer, x, y, l2_lambda=0.01): optimizer.zero_grad() output = model(x) loss = nn.functional.cross_entropy(output, y) # Add L2 penalty manually l2_reg = sum(param.pow(2).sum() for param in model.parameters()) loss = loss + l2_lambda * l2_reg loss.backward() optimizer.step() return loss.item() print("Weight decay in PyTorch:")print("- Use weight_decay parameter in optimizer")print("- Prefer AdamW over Adam + L2 for proper decoupled weight decay")print("- Can specify different decay for different parameter groups")print("- Typically weight_decay ~ 0.01 to 0.1 for transformers")In Adam optimizer, weight decay and L2 regularization are NOT equivalent due to adaptive learning rates. L2 regularization gradient is scaled by Adam's per-parameter learning rate, diluting the effect. AdamW fixes this by applying weight decay directly to weights, not through the gradient. Always prefer AdamW for consistent regularization.
After understanding the theory, here are actionable guidelines for applying regularization in practice.
| Method | Parameter | Typical Range | Notes |
|---|---|---|---|
| Ridge | λ (alpha) | 10⁻⁴ to 10⁴ | Wide range; scale depends on data |
| Lasso | λ (alpha) | 10⁻⁴ to 10¹ | Larger values → more zeros |
| Elastic Net | λ (alpha) | 10⁻⁴ to 10¹ | Combined overall strength |
| Elastic Net | l1_ratio | 0.1 to 0.99 | Higher → more sparsity |
| Deep Learning | weight_decay | 10⁻⁶ to 10⁻¹ | 0.01-0.1 common for transformers |
Think of regularization as expressing a prior belief: 'All else equal, I prefer simpler models with smaller/sparser weights.' The regularization strength λ quantifies how strongly you hold this belief. Cross-validation lets the data tell you the appropriate strength.
Module Complete:
This concludes our module on Norms and Distance Metrics. We've journeyed from the abstract axioms of norms through their concrete applications in measuring vectors, matrices, distances, similarities, and finally controlling model complexity through regularization.
These concepts form a mathematical foundation that appears throughout machine learning—in loss functions, in optimization algorithms, in generalization theory, and in practical model tuning. Mastery of these ideas equips you to understand why algorithms work and how to make them work better.
You've mastered the theory and practice of norms, distances, similarities, and regularization. These tools for measuring 'size' and 'difference' are fundamental to virtually every ML algorithm—from linear regression to deep neural networks to clustering to retrieval systems. You're now equipped to choose appropriate metrics for your problems and apply regularization effectively.