Regularization Theory - Learning Module

Loading content...

0/245

Common Regularizers

The Regularization Toolbox

Having established why regularization is essential and how it works theoretically, we now survey the specific regularization techniques available to practitioners. Each regularizer embodies different assumptions about desirable model properties—sparsity, smoothness, group structure—and understanding these differences is essential for choosing the right tool for each problem.

This page provides a comprehensive tour of regularization methods, from classical norm penalties to modern techniques like dropout and batch normalization, equipping you with the knowledge to select, combine, and tune regularizers effectively.

What You Will Learn

By the end of this page, you will understand: (1) L2 (Ridge) regularization in depth, (2) L1 (Lasso) and its sparsity-inducing properties, (3) Elastic Net and when to use it, (4) Group Lasso for structured sparsity, (5) modern neural network regularizers (Dropout, Batch Normalization), (6) data augmentation as regularization, and (7) practical guidelines for regularizer selection.

L2 Regularization (Ridge / Weight Decay)

L2 regularization is the most widely used regularizer, adding a penalty on the squared magnitude of parameters.

Formulation

$$\mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \lambda |w|2^2 = \mathcal{L}(w) + \lambda \sum{j=1}^d w_j^2$$

Alternative names:

Ridge regression — linear regression with L2 penalty
Weight decay — common in neural network literature
Tikhonov regularization — classic mathematical name

Note: In neural networks, the penalty $\lambda |w|_2^2$ is often written as $(\text{weight_decay}) \cdot |w|_2^2$. The relationship depends on optimizer: for SGD, weight decay = 2λ/learning_rate.

Closed-Form Solution for Linear Regression

For linear regression, Ridge has a beautiful closed form:

$$\hat{w}_{\text{Ridge}} = (X^\top X + \lambda I)^{-1} X^\top y$$

Comparison to OLS: $(X^\top X + \lambda I)^{-1}$ vs. $(X^\top X)^{-1}$

Adding $\lambda I$ ensures the matrix is invertible, even if $X^\top X$ is singular or ill-conditioned.

SVD perspective: If $X = UDV^\top$, then: $$\hat{w}_{\text{Ridge}} = V \cdot \text{diag}\left(\frac{d_j}{d_j^2 + \lambda}\right) \cdot U^\top y$$

Each singular value direction is shrunk by factor $d_j/(d_j^2 + \lambda)$. Small singular values (noisy directions) are shrunk more.

L2 Regularization Properties
Property	Description	Implication
Uniform shrinkage	All weights shrink proportionally	No feature selection
Smooth penalty	Differentiable everywhere	Standard gradient optimization
Gaussian prior	MAP with $w \sim \mathcal{N}(0, \tau^2 I)$	Bayesian interpretation
Stability	Small changes in data → small changes in $w$	Robust generalization
Closed form (linear)	Explicit solution available	Computational efficiency
Non-sparse	Weights shrink but never equal zero	All features retained

When to Use L2

Use L2 regularization when: (1) you believe all features contribute somewhat, (2) features are correlated (L2 handles collinearity well), (3) you don't need feature selection, (4) computational efficiency matters (closed form for linear models), or (5) as a default choice when unsure.

L1 Regularization (Lasso)

L1 regularization penalizes the sum of absolute parameter values, inducing sparsity.

Formulation

$$\mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \lambda |w|1 = \mathcal{L}(w) + \lambda \sum{j=1}^d |w_j|$$

The Lasso (Least Absolute Shrinkage and Selection Operator) introduced by Tibshirani (1996) revolutionized high-dimensional statistics by enabling simultaneous estimation and feature selection.

Sparsity-Inducing Property

Key result: For sufficiently large $\lambda$, many $\hat{w}_j = 0$ exactly.

Why sparsity occurs:

The L1 penalty has a cusp at zero (non-differentiable)
The subdifferential at $w_j = 0$ is the interval $[-\lambda, +\lambda]$
Unless the gradient exceeds $\lambda$ in magnitude, the optimum stays at zero

Soft-thresholding: For the simple problem $\min_w \frac{1}{2}(y - w)^2 + \lambda|w|$: $$\hat{w} = \text{sign}(y) \cdot \max(|y| - \lambda, 0)$$

Values within $[-\lambda, \lambda]$ are set exactly to zero.

L1 Advantages

•Automatic feature selection — irrelevant features get zero weight
•Interpretability — sparse models are easier to understand
•Works in high dimensions — $p \gg n$ settings
•Efficient path algorithms — LARS computes full solution path

L1 Limitations

•Arbitrary selection with correlation — picks one of correlated features
•Non-differentiable — requires special optimization
•At most $n$ features — in low $n$, high $p$ can select $\leq n$
•Unstable selection — small data changes → different features

Optimization Algorithms for L1

The non-differentiability of L1 requires specialized methods:

Coordinate Descent:

Optimize one coordinate at a time
Closed-form update for each coordinate
Highly efficient for Lasso

Proximal Gradient (ISTA, FISTA):

Gradient step on smooth part
Proximal operator (soft-thresholding) for L1
FISTA adds acceleration

LARS (Least Angle Regression):

Computes entire regularization path
Exploits piecewise linearity
Efficient when path is needed

Coordinate Descent Update

For Lasso, the coordinate descent update for $w_j$ is: $w_j^{\text{new}} = S_\lambda(\rho_j) / (\sum_i x_{ij}^2)$ where $\rho_j = \sum_i x_{ij}(y_i - \sum_{k \neq j} w_k x_{ik})$ and $S_\lambda$ is soft-thresholding. This iterates until convergence, typically very fast for sparse solutions.

Elastic Net: Combining L1 and L2

Elastic Net combines L1 and L2 penalties, getting the best of both worlds.

Formulation

$$\mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \lambda_1 |w|_1 + \lambda_2 |w|_2^2$$

Often parameterized as: $$\mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \lambda \left[ \alpha |w|_1 + (1-\alpha) |w|_2^2 \right]$$

where:

$\lambda$ controls total regularization strength
$\alpha \in [0, 1]$ controls L1/L2 balance
$\alpha = 1$ is Lasso
$\alpha = 0$ is Ridge

Why Combine L1 and L2?

The grouping effect: Lasso has an undesirable property with correlated features — it tends to select one and ignore others arbitrarily. Elastic Net's L2 component encourages correlated features to have similar weights.

Mathematical explanation:

Lasso: If $x_1 \approx x_2$, Lasso may assign $w_1 = c, w_2 = 0$ or vice versa
Elastic Net: The L2 term pulls $w_1$ and $w_2$ toward each other, so both are included with similar weights

Practical benefit: More stable feature selection; correlated features are selected together.

Comparison: L1 vs. L2 vs. Elastic Net
Property	L2 (Ridge)	L1 (Lasso)	Elastic Net
Sparsity	No	Yes	Yes
Grouping effect	Yes	No	Yes
Max features (p > n)	All	≤ n	n
Stability	High	Low	Medium
Correlated features	All retained	One selected	Group selected
Optimization	Easy (differentiable)	Harder	Harder

When to Use Elastic Net

Use Elastic Net when: (1) features are correlated and you want stable selection, (2) you need both sparsity and grouping, (3) p >> n but you expect many non-zero features, (4) Lasso selection is unstable across samples, or (5) as a safe default that includes Lasso and Ridge as special cases.

Tuning Elastic Net

Elastic Net has two hyperparameters:

$\lambda$ (overall strength): Tune via cross-validation, similar to Lasso/Ridge.

$\alpha$ (L1/L2 ratio):

Start with $\alpha = 0.5$ (equal weighting)
If too few features selected, decrease $\alpha$ (more L2)
If too many features, increase $\alpha$ (more L1)
Can cross-validate over a grid of $\alpha$ values

Practical note: Many implementations (scikit-learn, glmnet) efficiently compute the solution path for a grid of $\lambda$ values at each $\alpha$.

Group and Structured Regularization

When features have known group structure, specialized regularizers can exploit this information.

Group Lasso

When features are organized into groups $\mathcal{G}_1, \ldots, \mathcal{G}_K$:

$$\Omega(w) = \sum_{k=1}^K \sqrt{p_k} |w_{\mathcal{G}_k}|_2$$

where $p_k = |\mathcal{G}k|$ is the group size and $w{\mathcal{G}_k}$ are weights in group $k$.

Effect: Entire groups are set to zero or retained together.

Examples of groups:

One-hot encoded categorical variables
Polynomial terms of same degree
Wavelets at same scale
All weights in a neural network layer

Sparse Group Lasso

Combines group sparsity with within-group sparsity:

$$\Omega(w) = \lambda_1 \sum_k \sqrt{p_k} |w_{\mathcal{G}_k}|_2 + \lambda_2 |w|_1$$

Effect: Some groups are eliminated entirely; within retained groups, some individual weights may be zero.

Use case: When you want both group selection and feature selection within groups.

Other Structured Regularizers

Fused Lasso: $$\Omega(w) = \lambda_1 |w|1 + \lambda_2 \sum{j=1}^{d-1} |w_{j+1} - w_j|$$

Encourages both sparsity and contiguity — adjacent parameters should be similar. Useful for spatial/temporal data.

Nuclear Norm (Trace Norm): $$\Omega(W) = |W|_* = \sum_i \sigma_i(W)$$

Sum of singular values for matrix-valued parameters. Encourages low-rank structure.

Total Variation: $$\Omega(w) = \sum_{i,j} |w_{i+1,j} - w_{i,j}| + |w_{i,j+1} - w_{i,j}|$$

For 2D structured parameters (images). Encourages piecewise constant solutions.

Choosing Structure

Structured regularization is powerful when the assumed structure matches reality. Group Lasso helps when feature groups are meaningful. Fused Lasso helps with ordered features. Nuclear norm helps with matrix completion. Mismatched structure assumptions can hurt rather than help — domain knowledge is essential.

Dropout: Stochastic Regularization

Dropout is a powerful regularization technique for neural networks, introduced by Hinton et al. (2012).

The Dropout Mechanism

Training: For each training batch, randomly "drop" (set to zero) each neuron activation with probability $p$ (typically 0.5).

$$\tilde{h}_i = \begin{cases} 0 & \text{with probability } p \ h_i / (1-p) & \text{with probability } 1-p \end{cases}$$

The division by $(1-p)$ maintains the expected activation magnitude (inverted dropout).

Inference: Use all neurons without dropout (with expectation-scaled weights).

Why Dropout Works

Ensemble interpretation: Each training step uses a different "thinned" network. Final network averages predictions of exponentially many sub-networks.

Co-adaptation prevention: Neurons can't rely on specific other neurons, forcing redundancy and robust features.

Implicit L2 effect: Dropout has been shown to approximate L2 regularization for linear models.

Bayesian connection: Can be interpreted as approximate Bayesian inference with a specific prior.

Dropout Hyperparameters
Parameter	Typical Value	Effect of Increasing
Drop probability $p$	0.5 for hidden layers	More regularization, may underfit
Input dropout	0.1-0.2 for inputs	Robustness to input noise
Layer-specific rates	Lower for early layers	Preserve low-level features

Dropout Variants

Standard dropout can hurt with batch normalization. Spatial Dropout drops entire feature maps (for CNNs). DropConnect drops individual weights instead of activations. Concrete Dropout learns drop rates. Alpha Dropout is designed for self-normalizing networks (SELU activation).

Practical Dropout Guidelines

When to use:

Deep fully-connected networks
Networks prone to overfitting
When labeled data is limited

When to avoid/modify:

Very small networks (may underfit)
With batch normalization (reduce rate or use after BN)
RNNs (use recurrent dropout variants)

Tuning:

Start with $p = 0.5$ for hidden layers
Use lower rates (0.1-0.2) for input layer
Reduce if underfitting; increase if overfitting

Batch Normalization and Modern Techniques

Batch Normalization (Ioffe & Szegedy, 2015) was originally proposed to address internal covariate shift, but has significant regularization effects.

Batch Normalization Mechanism

For each mini-batch, normalize activations:

$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$

Then apply learned affine transformation: $$y_i = \gamma \hat{x}_i + \beta$$

where $\mu_B, \sigma_B^2$ are batch mean and variance, and $\gamma, \beta$ are learnable parameters.

Regularization Effects of BatchNorm

Noise injection: Batch statistics vary between batches, adding stochastic noise similar to dropout.

Gradient smoothing: Normalizing activations prevents gradient explosion/vanishing, enabling more stable training.

Implicit regularization: Changes the optimization landscape in ways that favor flatter minima.

Reduced sensitivity to initialization: Less dependence on careful weight initialization.

Note: BatchNorm's regularization effect means you may need less dropout when using it.

Other Modern Regularizers

Layer Normalization:

Normalizes across features (not batch)
Better for RNNs and small batch sizes
No batch-size dependence

Weight Normalization:

Reparameterizes weights as $w = g \cdot v / |v|$
Decouples magnitude and direction
Often combined with other regularizers

Spectral Normalization:

Constrains spectral norm (largest singular value)
Popular in GANs for stability
Controls Lipschitz constant

Mixup:

Creates virtual training examples by interpolating
$(\tilde{x}, \tilde{y}) = \lambda (x_i, y_i) + (1-\lambda)(x_j, y_j)$
Encourages linear behavior between examples

Combining Regularizers

Modern networks often combine multiple regularizers: BatchNorm + mild Dropout + Weight Decay + Data Augmentation. The interaction effects are complex — when adding BatchNorm, often reduce dropout rate. When using strong augmentation, may need less weight decay. Tune the combination, not just individual components.

Data Augmentation and Early Stopping

Beyond explicit penalty terms, several techniques provide implicit regularization.

Data Augmentation

Artificially expand training data by applying label-preserving transformations.

Image augmentation examples:

Random crops, flips, rotations
Color jittering, contrast changes
Cutout (random rectangular masks)
RandAugment (learned augmentation policies)

Text augmentation examples:

Synonym replacement
Back-translation
Random insertion/deletion

Audio augmentation examples:

Time stretching, pitch shifting
Adding background noise
SpecAugment (frequency/time masking)

Why Augmentation Works as Regularization

Variance reduction: Seeing more variations reduces sensitivity to specific training examples.

Implicit invariances: Augmentations encode prior knowledge (rotated digit is still the same digit).

Vicinal risk minimization: Augmented examples fill in "vicinity" of real examples, smoothing the decision boundary.

Effective data increase: Acts like having more training data, reducing generalization gap.

Early Stopping

Stop training before convergence based on validation performance.

The mechanism:

Monitor validation error during training
Stop when validation error starts increasing
Use model from best validation point

Why it regularizes:

Early iterations capture coarse, robust patterns
Later iterations fit fine details (including noise)
Stopping early limits effective model complexity

Connection to L2: For gradient descent on quadratic loss, early stopping is mathematically equivalent to L2 regularization, with $1/(\eta t)$ playing the role of $\lambda$.

Data Augmentation Pros

•Encodes domain knowledge
•Effectively increases data
•Can improve robustness
•Works with any model

Early Stopping Pros

•No hyperparameter (just epoch count)
•Saves computation
•Automatic complexity control
•Well-understood theory

Regularizer Selection Guide

With many regularization options, how do you choose? Here's a practical guide.

Decision Framework

Step 1: Identify the problem type

Linear model? → L2, L1, or Elastic Net
Deep neural network? → Weight decay + Dropout + BatchNorm
Structured features? → Group Lasso, Fused Lasso
Matrix/tensor parameters? → Nuclear norm

Step 2: Identify goals

Need sparsity/feature selection? → Include L1 component
Need stability with correlation? → Use L2 or Elastic Net
Need interpretability? → Favor sparse solutions
Need uncertainty estimates? → Consider Bayesian approaches

Regularizer Selection Cheat Sheet
Scenario	Recommended Regularizer	Why
Default linear model	Ridge (L2)	Safe, stable, handles correlation
Feature selection needed	Lasso (L1)	Produces sparse models
Correlated features + selection	Elastic Net	Groups correlated features
Known feature groups	Group Lasso	Leverages group structure
Deep fully-connected network	Dropout + Weight Decay	Standard practice
Deep CNN	BatchNorm + Augmentation + Weight Decay	Modern best practice
Limited labeled data	Strong augmentation + Dropout	Maximize data utilization
Quickly training	Early stopping	Simple, saves time

Hyperparameter Tuning Strategy

General approach:

Start with reasonable defaults
Use validation set or cross-validation
Search over logarithmic scale for $\lambda$
Consider grid or random search over multiple parameters

Default starting points:

L2: $\lambda \in {10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1}$
L1: Similar range, but depends on feature scaling
Dropout: $p \in {0.1, 0.2, 0.3, 0.5}$
Weight decay: ${10^{-5}, 10^{-4}, 10^{-3}}$ for deep networks

Practical Wisdom

"When in doubt, regularize more." It's usually safer to err on the side of more regularization — you can always reduce it if you see underfitting. The cost of underfitting (high bias) is usually less severe than undetected overfitting. Start with stronger regularization and relax as needed.

Summary: The Regularization Landscape

We have surveyed the landscape of regularization techniques available to practitioners. Let us consolidate the key insights:

Key Takeaways

•L2 (Ridge): Uniform shrinkage, handles correlation, closed-form solution. Default choice when no sparsity needed.
•L1 (Lasso): Sparsity-inducing, automatic feature selection. Use when interpretability or feature selection matters.
•Elastic Net: Best of L1 and L2 — sparsity with grouping effect. Use with correlated features.
•Group/Structured regularizers: Exploit known structure. Group Lasso, Fused Lasso, Nuclear Norm for specific problems.
•Dropout: Stochastic regularization for neural networks. Ensemble averaging, prevents co-adaptation.
•Batch Normalization: Normalization with implicit regularization. Standard component in modern deep learning.
•Data Augmentation: Domain-knowledge-driven regularization. Extremely effective when applicable.
•Early Stopping: Implicit regularization via training time. Simple, effective, saves computation.

Module Complete

You have now completed a comprehensive study of Regularization Theory:

Motivation: Why regularization is essential for generalization
As Constraint: The optimization-theoretic view
As Prior: The Bayesian perspective
Effects on Generalization: Rigorous theoretical guarantees
Common Regularizers: Practical techniques and when to use them

This knowledge forms a crucial foundation for designing, training, and understanding machine learning models that generalize well beyond their training data.

Module Complete

Congratulations! You have mastered Regularization Theory from multiple perspectives — motivation, constraint, prior, generalization effects, and practical techniques. You can now select, combine, and tune regularizers effectively, understanding both the intuition and the rigorous theory behind each approach. This completes Module 5 of Statistical Learning Theory.

Common Regularizers

The Regularization Toolbox

What You Will Learn

L2 Regularization (Ridge / Weight Decay)

L2 regularization is the most widely used regularizer, adding a penalty on the squared magnitude of parameters.

Formulation

$$\mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \lambda |w|2^2 = \mathcal{L}(w) + \lambda \sum{j=1}^d w_j^2$$

Alternative names:

Ridge regression — linear regression with L2 penalty
Weight decay — common in neural network literature
Tikhonov regularization — classic mathematical name

Closed-Form Solution for Linear Regression

For linear regression, Ridge has a beautiful closed form:

$$\hat{w}_{\text{Ridge}} = (X^\top X + \lambda I)^{-1} X^\top y$$

Comparison to OLS: $(X^\top X + \lambda I)^{-1}$ vs. $(X^\top X)^{-1}$

Adding $\lambda I$ ensures the matrix is invertible, even if $X^\top X$ is singular or ill-conditioned.

SVD perspective: If $X = UDV^\top$, then: $$\hat{w}_{\text{Ridge}} = V \cdot \text{diag}\left(\frac{d_j}{d_j^2 + \lambda}\right) \cdot U^\top y$$

Each singular value direction is shrunk by factor $d_j/(d_j^2 + \lambda)$. Small singular values (noisy directions) are shrunk more.

L2 Regularization Properties
Property	Description	Implication
Uniform shrinkage	All weights shrink proportionally	No feature selection
Smooth penalty	Differentiable everywhere	Standard gradient optimization
Gaussian prior	MAP with $w \sim \mathcal{N}(0, \tau^2 I)$	Bayesian interpretation
Stability	Small changes in data → small changes in $w$	Robust generalization
Closed form (linear)	Explicit solution available	Computational efficiency
Non-sparse	Weights shrink but never equal zero	All features retained

When to Use L2

L1 Regularization (Lasso)

L1 regularization penalizes the sum of absolute parameter values, inducing sparsity.

Formulation

$$\mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \lambda |w|1 = \mathcal{L}(w) + \lambda \sum{j=1}^d |w_j|$$

The Lasso (Least Absolute Shrinkage and Selection Operator) introduced by Tibshirani (1996) revolutionized high-dimensional statistics by enabling simultaneous estimation and feature selection.

Sparsity-Inducing Property

Key result: For sufficiently large $\lambda$, many $\hat{w}_j = 0$ exactly.

Why sparsity occurs:

The L1 penalty has a cusp at zero (non-differentiable)
The subdifferential at $w_j = 0$ is the interval $[-\lambda, +\lambda]$
Unless the gradient exceeds $\lambda$ in magnitude, the optimum stays at zero

Soft-thresholding: For the simple problem $\min_w \frac{1}{2}(y - w)^2 + \lambda|w|$: $$\hat{w} = \text{sign}(y) \cdot \max(|y| - \lambda, 0)$$

Values within $[-\lambda, \lambda]$ are set exactly to zero.

L1 Advantages

•Automatic feature selection — irrelevant features get zero weight
•Interpretability — sparse models are easier to understand
•Works in high dimensions — $p \gg n$ settings
•Efficient path algorithms — LARS computes full solution path

L1 Limitations

•Arbitrary selection with correlation — picks one of correlated features
•Non-differentiable — requires special optimization
•At most $n$ features — in low $n$, high $p$ can select $\leq n$
•Unstable selection — small data changes → different features

Optimization Algorithms for L1

The non-differentiability of L1 requires specialized methods:

Coordinate Descent:

Optimize one coordinate at a time
Closed-form update for each coordinate
Highly efficient for Lasso

Proximal Gradient (ISTA, FISTA):

Gradient step on smooth part
Proximal operator (soft-thresholding) for L1
FISTA adds acceleration

LARS (Least Angle Regression):

Computes entire regularization path
Exploits piecewise linearity
Efficient when path is needed

Coordinate Descent Update

Elastic Net: Combining L1 and L2

Elastic Net combines L1 and L2 penalties, getting the best of both worlds.

Formulation

$$\mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \lambda_1 |w|_1 + \lambda_2 |w|_2^2$$

Often parameterized as: $$\mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \lambda \left[ \alpha |w|_1 + (1-\alpha) |w|_2^2 \right]$$

where:

$\lambda$ controls total regularization strength
$\alpha \in [0, 1]$ controls L1/L2 balance
$\alpha = 1$ is Lasso
$\alpha = 0$ is Ridge

Why Combine L1 and L2?

Mathematical explanation:

Lasso: If $x_1 \approx x_2$, Lasso may assign $w_1 = c, w_2 = 0$ or vice versa
Elastic Net: The L2 term pulls $w_1$ and $w_2$ toward each other, so both are included with similar weights

Practical benefit: More stable feature selection; correlated features are selected together.

Comparison: L1 vs. L2 vs. Elastic Net
Property	L2 (Ridge)	L1 (Lasso)	Elastic Net
Sparsity	No	Yes	Yes
Grouping effect	Yes	No	Yes
Max features (p > n)	All	≤ n	n
Stability	High	Low	Medium
Correlated features	All retained	One selected	Group selected
Optimization	Easy (differentiable)	Harder	Harder

When to Use Elastic Net

Tuning Elastic Net

Elastic Net has two hyperparameters:

$\lambda$ (overall strength): Tune via cross-validation, similar to Lasso/Ridge.

$\alpha$ (L1/L2 ratio):

Start with $\alpha = 0.5$ (equal weighting)
If too few features selected, decrease $\alpha$ (more L2)
If too many features, increase $\alpha$ (more L1)
Can cross-validate over a grid of $\alpha$ values

Practical note: Many implementations (scikit-learn, glmnet) efficiently compute the solution path for a grid of $\lambda$ values at each $\alpha$.

Group and Structured Regularization

When features have known group structure, specialized regularizers can exploit this information.

Group Lasso

When features are organized into groups $\mathcal{G}_1, \ldots, \mathcal{G}_K$:

$$\Omega(w) = \sum_{k=1}^K \sqrt{p_k} |w_{\mathcal{G}_k}|_2$$

where $p_k = |\mathcal{G}k|$ is the group size and $w{\mathcal{G}_k}$ are weights in group $k$.

Effect: Entire groups are set to zero or retained together.

Examples of groups:

One-hot encoded categorical variables
Polynomial terms of same degree
Wavelets at same scale
All weights in a neural network layer

Sparse Group Lasso

Combines group sparsity with within-group sparsity:

$$\Omega(w) = \lambda_1 \sum_k \sqrt{p_k} |w_{\mathcal{G}_k}|_2 + \lambda_2 |w|_1$$

Effect: Some groups are eliminated entirely; within retained groups, some individual weights may be zero.

Use case: When you want both group selection and feature selection within groups.

Other Structured Regularizers

Fused Lasso: $$\Omega(w) = \lambda_1 |w|1 + \lambda_2 \sum{j=1}^{d-1} |w_{j+1} - w_j|$$

Encourages both sparsity and contiguity — adjacent parameters should be similar. Useful for spatial/temporal data.

Nuclear Norm (Trace Norm): $$\Omega(W) = |W|_* = \sum_i \sigma_i(W)$$

Sum of singular values for matrix-valued parameters. Encourages low-rank structure.

Total Variation: $$\Omega(w) = \sum_{i,j} |w_{i+1,j} - w_{i,j}| + |w_{i,j+1} - w_{i,j}|$$

For 2D structured parameters (images). Encourages piecewise constant solutions.

Choosing Structure

Dropout: Stochastic Regularization

Dropout is a powerful regularization technique for neural networks, introduced by Hinton et al. (2012).

The Dropout Mechanism

Training: For each training batch, randomly "drop" (set to zero) each neuron activation with probability $p$ (typically 0.5).

$$\tilde{h}_i = \begin{cases} 0 & \text{with probability } p \ h_i / (1-p) & \text{with probability } 1-p \end{cases}$$

The division by $(1-p)$ maintains the expected activation magnitude (inverted dropout).

Inference: Use all neurons without dropout (with expectation-scaled weights).

Why Dropout Works

Ensemble interpretation: Each training step uses a different "thinned" network. Final network averages predictions of exponentially many sub-networks.

Co-adaptation prevention: Neurons can't rely on specific other neurons, forcing redundancy and robust features.

Implicit L2 effect: Dropout has been shown to approximate L2 regularization for linear models.

Bayesian connection: Can be interpreted as approximate Bayesian inference with a specific prior.

Dropout Hyperparameters
Parameter	Typical Value	Effect of Increasing
Drop probability $p$	0.5 for hidden layers	More regularization, may underfit
Input dropout	0.1-0.2 for inputs	Robustness to input noise
Layer-specific rates	Lower for early layers	Preserve low-level features

Dropout Variants

Practical Dropout Guidelines

When to use:

Deep fully-connected networks
Networks prone to overfitting
When labeled data is limited

When to avoid/modify:

Very small networks (may underfit)
With batch normalization (reduce rate or use after BN)
RNNs (use recurrent dropout variants)

Tuning:

Start with $p = 0.5$ for hidden layers
Use lower rates (0.1-0.2) for input layer
Reduce if underfitting; increase if overfitting

Batch Normalization and Modern Techniques

Batch Normalization (Ioffe & Szegedy, 2015) was originally proposed to address internal covariate shift, but has significant regularization effects.

Batch Normalization Mechanism

For each mini-batch, normalize activations:

$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$

Then apply learned affine transformation: $$y_i = \gamma \hat{x}_i + \beta$$

where $\mu_B, \sigma_B^2$ are batch mean and variance, and $\gamma, \beta$ are learnable parameters.

Regularization Effects of BatchNorm

Noise injection: Batch statistics vary between batches, adding stochastic noise similar to dropout.

Gradient smoothing: Normalizing activations prevents gradient explosion/vanishing, enabling more stable training.

Implicit regularization: Changes the optimization landscape in ways that favor flatter minima.

Reduced sensitivity to initialization: Less dependence on careful weight initialization.

Note: BatchNorm's regularization effect means you may need less dropout when using it.

Other Modern Regularizers

Layer Normalization:

Normalizes across features (not batch)
Better for RNNs and small batch sizes
No batch-size dependence

Weight Normalization:

Reparameterizes weights as $w = g \cdot v / |v|$
Decouples magnitude and direction
Often combined with other regularizers

Spectral Normalization:

Constrains spectral norm (largest singular value)
Popular in GANs for stability
Controls Lipschitz constant

Mixup:

Creates virtual training examples by interpolating
$(\tilde{x}, \tilde{y}) = \lambda (x_i, y_i) + (1-\lambda)(x_j, y_j)$
Encourages linear behavior between examples

Combining Regularizers

Data Augmentation and Early Stopping

Beyond explicit penalty terms, several techniques provide implicit regularization.

Data Augmentation

Artificially expand training data by applying label-preserving transformations.

Image augmentation examples:

Random crops, flips, rotations
Color jittering, contrast changes
Cutout (random rectangular masks)
RandAugment (learned augmentation policies)

Text augmentation examples:

Synonym replacement
Back-translation
Random insertion/deletion

Audio augmentation examples:

Time stretching, pitch shifting
Adding background noise
SpecAugment (frequency/time masking)

Why Augmentation Works as Regularization

Variance reduction: Seeing more variations reduces sensitivity to specific training examples.

Implicit invariances: Augmentations encode prior knowledge (rotated digit is still the same digit).

Vicinal risk minimization: Augmented examples fill in "vicinity" of real examples, smoothing the decision boundary.

Effective data increase: Acts like having more training data, reducing generalization gap.

Early Stopping

Stop training before convergence based on validation performance.

The mechanism:

Monitor validation error during training
Stop when validation error starts increasing
Use model from best validation point

Why it regularizes:

Early iterations capture coarse, robust patterns
Later iterations fit fine details (including noise)
Stopping early limits effective model complexity

Connection to L2: For gradient descent on quadratic loss, early stopping is mathematically equivalent to L2 regularization, with $1/(\eta t)$ playing the role of $\lambda$.

Data Augmentation Pros

•Encodes domain knowledge
•Effectively increases data
•Can improve robustness
•Works with any model

Early Stopping Pros

•No hyperparameter (just epoch count)
•Saves computation
•Automatic complexity control
•Well-understood theory

Regularizer Selection Guide

With many regularization options, how do you choose? Here's a practical guide.

Decision Framework

Step 1: Identify the problem type

Linear model? → L2, L1, or Elastic Net
Deep neural network? → Weight decay + Dropout + BatchNorm
Structured features? → Group Lasso, Fused Lasso
Matrix/tensor parameters? → Nuclear norm

Step 2: Identify goals

Need sparsity/feature selection? → Include L1 component
Need stability with correlation? → Use L2 or Elastic Net
Need interpretability? → Favor sparse solutions
Need uncertainty estimates? → Consider Bayesian approaches

Regularizer Selection Cheat Sheet
Scenario	Recommended Regularizer	Why
Default linear model	Ridge (L2)	Safe, stable, handles correlation
Feature selection needed	Lasso (L1)	Produces sparse models
Correlated features + selection	Elastic Net	Groups correlated features
Known feature groups	Group Lasso	Leverages group structure
Deep fully-connected network	Dropout + Weight Decay	Standard practice
Deep CNN	BatchNorm + Augmentation + Weight Decay	Modern best practice
Limited labeled data	Strong augmentation + Dropout	Maximize data utilization
Quickly training	Early stopping	Simple, saves time

Hyperparameter Tuning Strategy

General approach:

Start with reasonable defaults
Use validation set or cross-validation
Search over logarithmic scale for $\lambda$
Consider grid or random search over multiple parameters

Default starting points:

L2: $\lambda \in {10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1}$
L1: Similar range, but depends on feature scaling
Dropout: $p \in {0.1, 0.2, 0.3, 0.5}$
Weight decay: ${10^{-5}, 10^{-4}, 10^{-3}}$ for deep networks

Practical Wisdom

Summary: The Regularization Landscape

We have surveyed the landscape of regularization techniques available to practitioners. Let us consolidate the key insights:

Key Takeaways

•L2 (Ridge): Uniform shrinkage, handles correlation, closed-form solution. Default choice when no sparsity needed.
•L1 (Lasso): Sparsity-inducing, automatic feature selection. Use when interpretability or feature selection matters.
•Elastic Net: Best of L1 and L2 — sparsity with grouping effect. Use with correlated features.
•Group/Structured regularizers: Exploit known structure. Group Lasso, Fused Lasso, Nuclear Norm for specific problems.
•Dropout: Stochastic regularization for neural networks. Ensemble averaging, prevents co-adaptation.
•Batch Normalization: Normalization with implicit regularization. Standard component in modern deep learning.
•Data Augmentation: Domain-knowledge-driven regularization. Extremely effective when applicable.
•Early Stopping: Implicit regularization via training time. Simple, effective, saves computation.

Module Complete

You have now completed a comprehensive study of Regularization Theory:

Motivation: Why regularization is essential for generalization
As Constraint: The optimization-theoretic view
As Prior: The Bayesian perspective
Effects on Generalization: Rigorous theoretical guarantees
Common Regularizers: Practical techniques and when to use them

This knowledge forms a crucial foundation for designing, training, and understanding machine learning models that generalize well beyond their training data.

Module Complete