Loading content...
Having established why regularization is essential and how it works theoretically, we now survey the specific regularization techniques available to practitioners. Each regularizer embodies different assumptions about desirable model properties—sparsity, smoothness, group structure—and understanding these differences is essential for choosing the right tool for each problem.
This page provides a comprehensive tour of regularization methods, from classical norm penalties to modern techniques like dropout and batch normalization, equipping you with the knowledge to select, combine, and tune regularizers effectively.
By the end of this page, you will understand: (1) L2 (Ridge) regularization in depth, (2) L1 (Lasso) and its sparsity-inducing properties, (3) Elastic Net and when to use it, (4) Group Lasso for structured sparsity, (5) modern neural network regularizers (Dropout, Batch Normalization), (6) data augmentation as regularization, and (7) practical guidelines for regularizer selection.
L2 regularization is the most widely used regularizer, adding a penalty on the squared magnitude of parameters.
$$\mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \lambda |w|2^2 = \mathcal{L}(w) + \lambda \sum{j=1}^d w_j^2$$
Alternative names:
Note: In neural networks, the penalty $\lambda |w|_2^2$ is often written as $(\text{weight_decay}) \cdot |w|_2^2$. The relationship depends on optimizer: for SGD, weight decay = 2λ/learning_rate.
For linear regression, Ridge has a beautiful closed form:
$$\hat{w}_{\text{Ridge}} = (X^\top X + \lambda I)^{-1} X^\top y$$
Comparison to OLS: $(X^\top X + \lambda I)^{-1}$ vs. $(X^\top X)^{-1}$
Adding $\lambda I$ ensures the matrix is invertible, even if $X^\top X$ is singular or ill-conditioned.
SVD perspective: If $X = UDV^\top$, then: $$\hat{w}_{\text{Ridge}} = V \cdot \text{diag}\left(\frac{d_j}{d_j^2 + \lambda}\right) \cdot U^\top y$$
Each singular value direction is shrunk by factor $d_j/(d_j^2 + \lambda)$. Small singular values (noisy directions) are shrunk more.
| Property | Description | Implication |
|---|---|---|
| Uniform shrinkage | All weights shrink proportionally | No feature selection |
| Smooth penalty | Differentiable everywhere | Standard gradient optimization |
| Gaussian prior | MAP with $w \sim \mathcal{N}(0, \tau^2 I)$ | Bayesian interpretation |
| Stability | Small changes in data → small changes in $w$ | Robust generalization |
| Closed form (linear) | Explicit solution available | Computational efficiency |
| Non-sparse | Weights shrink but never equal zero | All features retained |
Use L2 regularization when: (1) you believe all features contribute somewhat, (2) features are correlated (L2 handles collinearity well), (3) you don't need feature selection, (4) computational efficiency matters (closed form for linear models), or (5) as a default choice when unsure.
L1 regularization penalizes the sum of absolute parameter values, inducing sparsity.
$$\mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \lambda |w|1 = \mathcal{L}(w) + \lambda \sum{j=1}^d |w_j|$$
The Lasso (Least Absolute Shrinkage and Selection Operator) introduced by Tibshirani (1996) revolutionized high-dimensional statistics by enabling simultaneous estimation and feature selection.
Key result: For sufficiently large $\lambda$, many $\hat{w}_j = 0$ exactly.
Why sparsity occurs:
Soft-thresholding: For the simple problem $\min_w \frac{1}{2}(y - w)^2 + \lambda|w|$: $$\hat{w} = \text{sign}(y) \cdot \max(|y| - \lambda, 0)$$
Values within $[-\lambda, \lambda]$ are set exactly to zero.
The non-differentiability of L1 requires specialized methods:
Coordinate Descent:
Proximal Gradient (ISTA, FISTA):
LARS (Least Angle Regression):
For Lasso, the coordinate descent update for $w_j$ is: $w_j^{\text{new}} = S_\lambda(\rho_j) / (\sum_i x_{ij}^2)$ where $\rho_j = \sum_i x_{ij}(y_i - \sum_{k \neq j} w_k x_{ik})$ and $S_\lambda$ is soft-thresholding. This iterates until convergence, typically very fast for sparse solutions.
Elastic Net combines L1 and L2 penalties, getting the best of both worlds.
$$\mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \lambda_1 |w|_1 + \lambda_2 |w|_2^2$$
Often parameterized as: $$\mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \lambda \left[ \alpha |w|_1 + (1-\alpha) |w|_2^2 \right]$$
where:
The grouping effect: Lasso has an undesirable property with correlated features — it tends to select one and ignore others arbitrarily. Elastic Net's L2 component encourages correlated features to have similar weights.
Mathematical explanation:
Practical benefit: More stable feature selection; correlated features are selected together.
| Property | L2 (Ridge) | L1 (Lasso) | Elastic Net |
|---|---|---|---|
| Sparsity | No | Yes | Yes |
| Grouping effect | Yes | No | Yes |
| Max features (p > n) | All | ≤ n | n |
| Stability | High | Low | Medium |
| Correlated features | All retained | One selected | Group selected |
| Optimization | Easy (differentiable) | Harder | Harder |
Use Elastic Net when: (1) features are correlated and you want stable selection, (2) you need both sparsity and grouping, (3) p >> n but you expect many non-zero features, (4) Lasso selection is unstable across samples, or (5) as a safe default that includes Lasso and Ridge as special cases.
Elastic Net has two hyperparameters:
$\lambda$ (overall strength): Tune via cross-validation, similar to Lasso/Ridge.
$\alpha$ (L1/L2 ratio):
Practical note: Many implementations (scikit-learn, glmnet) efficiently compute the solution path for a grid of $\lambda$ values at each $\alpha$.
When features have known group structure, specialized regularizers can exploit this information.
When features are organized into groups $\mathcal{G}_1, \ldots, \mathcal{G}_K$:
$$\Omega(w) = \sum_{k=1}^K \sqrt{p_k} |w_{\mathcal{G}_k}|_2$$
where $p_k = |\mathcal{G}k|$ is the group size and $w{\mathcal{G}_k}$ are weights in group $k$.
Effect: Entire groups are set to zero or retained together.
Examples of groups:
Combines group sparsity with within-group sparsity:
$$\Omega(w) = \lambda_1 \sum_k \sqrt{p_k} |w_{\mathcal{G}_k}|_2 + \lambda_2 |w|_1$$
Effect: Some groups are eliminated entirely; within retained groups, some individual weights may be zero.
Use case: When you want both group selection and feature selection within groups.
Fused Lasso: $$\Omega(w) = \lambda_1 |w|1 + \lambda_2 \sum{j=1}^{d-1} |w_{j+1} - w_j|$$
Encourages both sparsity and contiguity — adjacent parameters should be similar. Useful for spatial/temporal data.
Nuclear Norm (Trace Norm): $$\Omega(W) = |W|_* = \sum_i \sigma_i(W)$$
Sum of singular values for matrix-valued parameters. Encourages low-rank structure.
Total Variation: $$\Omega(w) = \sum_{i,j} |w_{i+1,j} - w_{i,j}| + |w_{i,j+1} - w_{i,j}|$$
For 2D structured parameters (images). Encourages piecewise constant solutions.
Structured regularization is powerful when the assumed structure matches reality. Group Lasso helps when feature groups are meaningful. Fused Lasso helps with ordered features. Nuclear norm helps with matrix completion. Mismatched structure assumptions can hurt rather than help — domain knowledge is essential.
Dropout is a powerful regularization technique for neural networks, introduced by Hinton et al. (2012).
Training: For each training batch, randomly "drop" (set to zero) each neuron activation with probability $p$ (typically 0.5).
$$\tilde{h}_i = \begin{cases} 0 & \text{with probability } p \ h_i / (1-p) & \text{with probability } 1-p \end{cases}$$
The division by $(1-p)$ maintains the expected activation magnitude (inverted dropout).
Inference: Use all neurons without dropout (with expectation-scaled weights).
Ensemble interpretation: Each training step uses a different "thinned" network. Final network averages predictions of exponentially many sub-networks.
Co-adaptation prevention: Neurons can't rely on specific other neurons, forcing redundancy and robust features.
Implicit L2 effect: Dropout has been shown to approximate L2 regularization for linear models.
Bayesian connection: Can be interpreted as approximate Bayesian inference with a specific prior.
| Parameter | Typical Value | Effect of Increasing |
|---|---|---|
| Drop probability $p$ | 0.5 for hidden layers | More regularization, may underfit |
| Input dropout | 0.1-0.2 for inputs | Robustness to input noise |
| Layer-specific rates | Lower for early layers | Preserve low-level features |
Standard dropout can hurt with batch normalization. Spatial Dropout drops entire feature maps (for CNNs). DropConnect drops individual weights instead of activations. Concrete Dropout learns drop rates. Alpha Dropout is designed for self-normalizing networks (SELU activation).
When to use:
When to avoid/modify:
Tuning:
Batch Normalization (Ioffe & Szegedy, 2015) was originally proposed to address internal covariate shift, but has significant regularization effects.
For each mini-batch, normalize activations:
$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$
Then apply learned affine transformation: $$y_i = \gamma \hat{x}_i + \beta$$
where $\mu_B, \sigma_B^2$ are batch mean and variance, and $\gamma, \beta$ are learnable parameters.
Noise injection: Batch statistics vary between batches, adding stochastic noise similar to dropout.
Gradient smoothing: Normalizing activations prevents gradient explosion/vanishing, enabling more stable training.
Implicit regularization: Changes the optimization landscape in ways that favor flatter minima.
Reduced sensitivity to initialization: Less dependence on careful weight initialization.
Note: BatchNorm's regularization effect means you may need less dropout when using it.
Layer Normalization:
Weight Normalization:
Spectral Normalization:
Mixup:
Modern networks often combine multiple regularizers: BatchNorm + mild Dropout + Weight Decay + Data Augmentation. The interaction effects are complex — when adding BatchNorm, often reduce dropout rate. When using strong augmentation, may need less weight decay. Tune the combination, not just individual components.
Beyond explicit penalty terms, several techniques provide implicit regularization.
Artificially expand training data by applying label-preserving transformations.
Image augmentation examples:
Text augmentation examples:
Audio augmentation examples:
Variance reduction: Seeing more variations reduces sensitivity to specific training examples.
Implicit invariances: Augmentations encode prior knowledge (rotated digit is still the same digit).
Vicinal risk minimization: Augmented examples fill in "vicinity" of real examples, smoothing the decision boundary.
Effective data increase: Acts like having more training data, reducing generalization gap.
Stop training before convergence based on validation performance.
The mechanism:
Why it regularizes:
Connection to L2: For gradient descent on quadratic loss, early stopping is mathematically equivalent to L2 regularization, with $1/(\eta t)$ playing the role of $\lambda$.
With many regularization options, how do you choose? Here's a practical guide.
Step 1: Identify the problem type
Step 2: Identify goals
| Scenario | Recommended Regularizer | Why |
|---|---|---|
| Default linear model | Ridge (L2) | Safe, stable, handles correlation |
| Feature selection needed | Lasso (L1) | Produces sparse models |
| Correlated features + selection | Elastic Net | Groups correlated features |
| Known feature groups | Group Lasso | Leverages group structure |
| Deep fully-connected network | Dropout + Weight Decay | Standard practice |
| Deep CNN | BatchNorm + Augmentation + Weight Decay | Modern best practice |
| Limited labeled data | Strong augmentation + Dropout | Maximize data utilization |
| Quickly training | Early stopping | Simple, saves time |
General approach:
Default starting points:
"When in doubt, regularize more." It's usually safer to err on the side of more regularization — you can always reduce it if you see underfitting. The cost of underfitting (high bias) is usually less severe than undetected overfitting. Start with stronger regularization and relax as needed.
We have surveyed the landscape of regularization techniques available to practitioners. Let us consolidate the key insights:
You have now completed a comprehensive study of Regularization Theory:
This knowledge forms a crucial foundation for designing, training, and understanding machine learning models that generalize well beyond their training data.
Congratulations! You have mastered Regularization Theory from multiple perspectives — motivation, constraint, prior, generalization effects, and practical techniques. You can now select, combine, and tune regularizers effectively, understanding both the intuition and the rigorous theory behind each approach. This completes Module 5 of Statistical Learning Theory.